Hello Everyone!
Well, we had a nice Friday – we went LIVE with our upgrade to Sitecore 9.1 on Thursday night – because we follow No Deploy Friday too – and then everything was tested fine from a functional standpoint and looked great.
Suddenly, on Friday morning, right before our Status Check call for checking with the stakeholders – BAAM – All the functionalities linking to Solr Search are down!
Challenge:
All the Search Functionalities on the website are not working.
What we did as a part of troubleshooting and how we found a pattern that helped us solve?
Well, to begin with, we started thinking what suddenly went wrong? Everything was fine during the testing so we didn’t have any doubts on the functional aspect of it. So we started brainstorming from the environmental/configuration point of view.
- Could it be the VIPs/Load Balancer?
- Could it be monitoring systems?
- Is the Solr Cloud up and running?
As any usual Sitecore Developer would do, we went and checked the Sitecore logs of CM and CD servers. In the log files we noticed some unusual entries as below:
9440 2019:09:06 01:39:35 ERROR The operation has timed out
Exception: SolrNet.Exceptions.SolrConnectionException
Message: The operation has timed out
Source: SolrNet
at SolrNet.Impl.SolrConnection.Get(String relativeUrl, IEnumerable`1 parameters)
at SolrNet.Impl.SolrQueryExecuter`1.Execute(ISolrQuery q, QueryOptions options)
at Sitecore.ContentSearch.SolrProvider.LinqToSolrIndex`1.GetResult(SolrCompositeQuery compositeQuery, QueryOptions queryOptions)
Nested Exception
Exception: System.Net.WebException
Message: The operation has timed out
Source: System
at System.Net.HttpWebRequest.GetResponse()
at HttpWebAdapters.Adapters.HttpWebRequestAdapter.GetResponse()
at SolrNet.Impl.SolrConnection.GetResponse(IHttpWebRequest request)
at SolrNet.Impl.SolrConnection.Get(String relativeUrl, IEnumerable`1 parameters)
Next, we found a log entry
9888 2019:09:06 01:57:05 ERROR <html>
<head>
<meta http-equiv=”Content-Type” content=”text/html;charset=utf-8″/>
<title>Error 500 Server Error</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /solr/MYCUSTOMINDEX/select. Reason:
<pre> Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.OutOfMemoryError: Java heap space
</pre>
</body>
</html>
Exception: SolrNet.Exceptions.SolrConnectionException
Message: <html>
<head>
<meta http-equiv=”Content-Type” content=”text/html;charset=utf-8″/>
<title>Error 500 Server Error</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /solr/MYCUSTOMINDEX/select. Reason:
<pre> Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.OutOfMemoryError: Java heap space
</pre>
</body>
</html>
Source: SolrNet
at SolrNet.Impl.SolrConnection.Get(String relativeUrl, IEnumerable`1 parameters)
at SolrNet.Impl.SolrQueryExecuter`1.Execute(ISolrQuery q, QueryOptions options)
at Sitecore.ContentSearch.SolrProvider.LinqToSolrIndex`1.GetResult(SolrCompositeQuery compositeQuery, QueryOptions queryOptions)
Nested Exception
Exception: System.Net.WebException
Message: The remote server returned an error: (500) Internal Server Error.
Source: System
at System.Net.HttpWebRequest.GetResponse()
at HttpWebAdapters.Adapters.HttpWebRequestAdapter.GetResponse()
at SolrNet.Impl.SolrConnection.GetResponse(IHttpWebRequest request)
at SolrNet.Impl.SolrConnection.Get(String relativeUrl, IEnumerable`1 parameters)
So, now we knew, it was something with the Solr – and it looked like there was not enough memory.
This also matched the pattern, that the Search functionalities stopped working suddenly.
We checked the Solr cloud – and as it was not responding, we stopped all Solr and zookeepers, started zookeepers and then the solrs and the search functionalities started working again.
We checked the Solrs to prove the point and it matched with the theory/pattern:
But, this was not the solution, it was just temporary, the main issue was with Out of Memory.
The JVM-Memory was already 91% utilized on all the solrs – just within 12 minutes of starting it.
So we knew, to solve it, we had to increase this memory.
Solution:
We searched online and found this useful documentation guide: https://lucene.apache.org/solr/guide/6_6/taking-solr-to-production.html#TakingSolrtoProduction-MemoryandGCSettings
In here, we found this value to be changed in solr.in.cmd for the parameter SOLR_JAVA_MEM
The default value is set SOLR_JAVA_MEM=-Xms512m -Xmx512m
We changed it to SOLR_JAVA_MEM=”-Xms9g -Xmx10g”
Which means, Minimum 9 GB and Maximum to go to 10 GB.
Also, when we went and checked this parameter in the solr.in.cmd, it had the following comment in place – which helped us gain confidence in what we were doing.
After making the following change, we kept checking the Solr and found that it was keeping minimum to be 1.4 GB occupied during low traffic and during high traffic it was going to around 2.5 GB – but still remaining below the mentioned cap.
Credits:
Thanks to my collegues Kiran Patil, Yogini Zope & John Schjolberg for troubleshooting with me and solving the following issue
References:
- https://lucene.apache.org/solr/guide/6_6/taking-solr-to-production.html#TakingSolrtoProduction-MemoryandGCSettings
Happy Troubleshooting! 🙂