Every day hundreds of thousands of users come to Redbubble searching for art works. Given the large number of works on Redbubble – we host millions of works for our ever-growing community of artists – it’s important that our search engine return good results, because our users sure ain’t going to paginate through millions of works!
Standing between us and our quest to produce relevant search results was our 3-year old Solr 3.6. It’s not so much the age that was bothering us, but rather its lack of boolean and relevance functions. Add in the fact that Solr 3.6’s higher memory demand was causing occasional performance problems, we had a solid case to say goodbye to this fella.
So we went ahead and upgraded to 4.10 (in case you’re wondering we didn’t go all the way to 5.x because Solr upgrade cannot skip a major version number). This migration process obligingly threw the proverbial spanner in the works, so we share our experience here for the benefit of others.
Java and web container
Let’s get the easy bits out of the way. Solr 4.10 requires at least Java 7, but you might as well get Java 8 (the latest at the time of writing). We use the Tomcat web container to run Solr and we found that Solr 4.10 will happily run on Tomcat 6 (old!) but having said that, we should use the newest Tomcat if we can.
Theoretically it is possible to do a rolling upgrade of each Solr instance in place, starting from a slave and finishing with the master. This had been our plan initially but we soon decided that this technique was too risky.
Instead, we spun up an independent set of servers so we could set up the new Solr cluster offline. This allowed us to break things – let’s be honest, it’s improbable to get everything right the first time – without impacting capacity to serve traffic. Just as importantly, it served as an isolated platform for us to run tests before going live. Please see the Testing section below for more information.
Another useful design decision was to keep the same Master-Slave topology (as opposed to the cloud topology) to minimise differences between the old and new Solr clusters.
Solr 3.6 stores its index differently from Solr 4.10. Of all the upgrade tasks that we had to do, migrating Solr index from 3.6 to 4.10 was the trickiest to get right. The only way to reliably migrate the index (that we know) is to do the following dance:
- Stop the Solr 4.10 Master and Slaves
- Ensure Solr 4.10 Master and Slaves index is empty
- Freeze Solr 3.6 Master’s index (stop further writes)
- Copy Solr 3.6 Master’s index to Solr 4.10 Master (using good old tar and scp)
- Start Solr 4.10 Master
- Optimise Solr 4.10 Master
- Restart Solr 4.10 Master
- Start Solr 4.10 Slaves
We discovered that migrating data by way of making Solr 4.10 Master a repeater break its slaves’ ability to replicate.
The eventual switchover went smoothly with zero downtime and no nasty surprises, thanks to various tests that we did beforehand.
Below is a summary of what we tested:
|What was tested||Comment|
|Index replication works||Compare the index size, generation and version|
|Search results remain the same i.e. not accidentally altered through misconfiguration, replication issue, etc||We built a Ruby script to run searches against Solr 3.6 and 4.10, and check if both result sets were identical|
|Response time over a period of at least 24 hours is not worse||We used siege to hit one of the Solr 4.10 slaves for 24 hours with realistic search traffic and monitored its response time through NewRelic|
|JVM memory consumption over a period of at least 24 hours is stable||JVM memory issue tends to rear its head after prolonged heavy use so piggy-backed the test above to look for memory leaks or excessive GC (garbage collection) pauses|
We saw that the response time for the newer JVM + Tomcat + Solr is lower (quicker), and that there was no memory leak. We did, however, have to redistribute JVM memory and increase Eden space to keep GC activities low (3-4% of CPU time).
See also Solr tuning at Redbubble
Test your upgraded Solr for performance, correctness and replicability. Replication in particular can be tricky to set up. You don’t have to be fancy with your test tools – we got by with pretty standard stuff like siege, NewRelic and jstat.