Rafał Kuć

Just the other day we wrote about Sensei, the new distributed, real-time full-text search database built on top of Lucene and here we are again writing about another “new” distributed, real-time, full-text search server also built on top of Lucene: SolrCloud.

In this post we’ll share some interesting SolrCloud bits and pieces that matter mostly to those working with large data and query volumes, but that all search lovers should find really interesting, too. If you have any questions about what we wrote (or did not write!) in this post, please leave a comment – we’re good at following up to comments! Or just ask @sematext!

Please note that functionality described in this post is now part of trunk in Lucene and Solr SVN repository. This means that it will be available when Lucene and Solr 4.0 are released, but you can also use trunk version just like we did, if you don’t mind living on the bleeding edge.

Recently, we were given the opportunity to once again use big data (massive may actually be more descriptive of this data volume) stored in a HBase cluster and search. We needed to design a scalable search cluster capable of elastically handling future data volume growth. Because of the huge data volume and high search rates our search system required the index to be sharded. We also wanted the indexing to be as simple as possible and we also wanted a stable, reliable, and very fast solution. The one thing we did not want to do is reinvent the wheel. At this point you may ask why we didn’t choose ElasticSearch, especially since we use ElasticSearch a lot at Sematext. The answer is that when we started the engagement with this particular client a whiiiiile back when ElasticSearch wasn’t where it is today. And while ElasticSearch does have a number of advantages over the old master-slave Solr, with SolrCloud being in the trunk now, Solr is again a valid choice for very large search clusters.

And so we took the opportunity to use SolrCloud and some of its features not present in previous versions of Solr. In particular, we wanted to make use of Distributed Indexing and Distributed Searching, both of which SolrCloud makes possible. In the process we looked at a few JIRA issues, such as SOLR-2358 and SOLR-2355, and we got familiar with relevant portions of SolrCloud source code. This confirmed SolrCloud would indeed satisfy our needs for the project and here we are sharing what we’ve learned.

Our Search Cluster Architecture

Basically, we wanted the search cluster to look like this:

Simple? Yes, we like simple. Who doesn’t! But let’s peek inside that “Solr cluster” box now.

SolrCloud Features and Architecture

Some of the nice things about SolrCloud are:

centralized cluster configuration
automatic node fail-over
near real time search
leader election
durable writes
…

Furthermore, SolrCloud can be configured to:

have multiple index shards
have one or more replicas of each shards

Shards and Replicas are arranged into Collections. Multiple Collections can be deployed in a single SolrCloud cluster. A single search request can search multiple Collections at once, as long as they are compatible. The diagram below shows a high-level picture of how SolrCloud indexing works.

As the above diagram shows, documents can be sent to any SolrCloud node/instance in the SolrCloud cluster. Documents are automatically forwarded to the appropriate Shard Leader (labeled as Shard 1 and Shard 2 in the diagram). This is done automatically and documents are sent in batches between Shards. If a Shard has one or more replicas (labeled Shard 1 replica and Shard 2 replica in the diagram) a document will get replicated to one or more replicas. Unlike in traditional master-slave Solr setups where index/shard replication is performed periodically in batches, replication in SolrCloud is done in real-time. This is how Distributed Indexing works at the high level. We simplified things a bit, of course – for example, there is no ZooKeeper or overseer shown in our diagram.

Setup Details

All configuration files are stored in ZooKeeper. If you are not familiar with ZooKeeper you can think of it as a distributed file system where SolrCloud configuration files are stored. When the first Solr instance in a SolrCloud cluster is started configuration files need to be sent to ZooKeeper and one needs to specify how many shards there should be in the cluster. Then, this Solr instance/node is running one can start additional Solr instances/nodes and point them to the ZooKeeper instance (ZooKeeper is actually typically deployed as a quorum or 3, 5, or more instances in production environments). And voilà – the SolrCloud cluster is up! I must say, it’s quite simple and straightforward.

Shard Replicas in SolrCloud serve multiple purposes. They provide fault tolerance in the sense that when (not if!) a single Solr instance/node containing a portion of the index goes down, you still have one or more replicas of data that was served by that instance elsewhere in the cluster and thus you still have the whole data set and no data loss. They also allow you to spread query load over more servers, this making the cluster capable of handling higher query rates.

Indexing

As you saw above, the new SolrCloud really simplifies Distributed Indexing. Document distribution between Shards and Replicas is automatic and real-time. There is no master server one needs to send all documents to. A document can be sent to any SolrCloud instance and SolrCloud takes care of the rest. Because of this, there is no longer a SPOF (Single Point of Failure) in Solr. Previously, Solr master was a SPOF in all but the most elaborate setups.

Querying

One can query SolrCloud a few different ways:

One can query a single Shard, which is just like Solr querying a search a single Solr instance.
The second option is to query a single Collection (i.e., search all shards holding pieces of a given Collection’s index).
The third option is to only query some of the Shards by specifying their addresses or names.
Finally, one can query multiple Collections assuming they are compatible and Solr can merge results they return.

As you can see, lots of choices!

Administration with Core Admin

In addition to the standard core admin parameters there are some new ones available in SolrCloud. These new parameters let one:

create new Shards for an existing Collection
create a new Collection
add more nodes
…

The Future

If you look at the New SolrCloud Design wiki page (http://wiki.apache.org/solr/NewSolrCloudDesign) you will notice, that not all planned features have been implemented yet. There are still things like cluster re-balancing or monitoring (if you are using SolrCloud already and want to monitor its performance, let us know if you want early access to SPM for SolrCloud) to be done. Now that SolrCloud is in the Solr trunk, it should see more user and more developer attention. We look forward to using SolrCloud in more projects in the future!

— @sematext

The year 2011 is coming to an end and it’s time to reflect on the past 12 months. Without further fluff, let’s look back and summarize all significant events that happened in Lucene and Solr world over the course of last dozen months. In the next few paragraphs we’ll go over major changes in Lucene and Solr, new blood, relevant conferences and books.

We should start by pointing out that this year Apache Lucene celebrated its 10 year anniversary as an Apache Software Foundation project. Lucene itself is actually over 10 years old. Otis is one of the very few people from the early years who is still around. While we didn’t celebrations any Solr anniversaries this year, we should note that Solr, too, has been around for quite a while and is in fact approaching its 6th year at ASF!

This year saw numerous changes and additions both in Lucene and Solr. As a matter of fact, we’d venture to say we saw more changes in Lucene & Solr this year than in any one year before. In that sense, both projects are very much like wine – getting better with time. Lets take a look at a few of the most significant changes in 2011.

The much anticipated Near Real-Time search (NRT) functionality has arrived. What this means is that documents that were just added to a Lucene/Solr index can immediately be made visible in search results. This is big! Of course, work on NRT is still in progress, but NRT is ready and you, like a number of our clients, should start using it.

Field Collapsing was one of the most watched and voted for JIRA issues for many month. This functionality was implemented this year and now Lucene and Solr users can group result on the basis of a field or a query. In addition, you can control the groups and even do faceting calculation on the groups, not single documents. A rather powerful feature.

From Lucene users’ perspective it is also worth noting that Lucene finally got a faceting module. Until now, faceting was available only in Solr. If you are a pure Lucene users, you now don’t need Solr to calculate facets.

In the past modeling parent-child relationships in Lucene and Solr indices was not really possible – one had to flatten everything. No longer – if you need to model a parent-child relationship in your index you can use the Join contrib module. This Join functionality lets you join parent and child documents at query-time, while relaying on some assumptions about how documents were indexed.

Good and broad language support is hugely important for any search solution and this year was good for Lucene and Solr in that department: KStemFilter English stemmer was added, full Unicode 4 support was added, a new Japanese and Chinese support was added, a new stemmer-protection mechanism was added, work on synonym filter RAM consumption reduction was done, etc. Another big addition was integration with Hunspell, which enables language-specific processing for all languages supported by Open Office. That’s a lot of new languages we can now handle with Lucene and Solr! There is more.

Lucene 3.5.0 introduced significantly reduced the term dictionary memory footprint. Big! Right now, Lucene uses 3 to 5 times less memory for when dealing with terms dictionary, so it’s even less RAM consuming.

If you use Lucene and need to page through a lot of results you may run into problems. That’s why in Lucene 3.5.0 the searchAfter method was introduced which solves the deep paging problem once and for all!

There is also a new, fast and reliable Term Vector-based highlighter that both Lucene and Solr can use.

Dismax is great, but Extended Dismax query parser added to Solr is even better – it extends Dismax query parser functionality and can further improve the quality of search results.

You can now also sort by function (imagine sorting the results by distance from a point) and a new spatial search with filtering.

Solr also got the new suggest/autocomplete functionality based on FST automaton which significantly reduced the memory needed for such functionality. If you need this for your search application, have a look at Sematext’s AutoComplete – it has additional functionality that lots of our customers like.

While not yet officially released, the new transaction log support provides Solr with a real-time get operation – as soon as you add a document you can retrieve it by ID. This will also be used for recovering nodes in SolrCloud.

And talking about SolrCloud… We’ve covered SolrCloud on this blog before in Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search, and we’ll be covering it again soon. In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier. Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic. SolrCloud has not been released yet, but Solr developers are working on it and the codebase is seeing good progress. We’ve used SolrCloud in a few recent engagements with our customers and were pleased by what we saw.

After merging developments of those two projects back in the 2010, we saw a speed up in development and releases. Lucene and Solr committers introduced five(!) new versions of both projects! In March, Lucene and Solr 3.1 was released with the Unicode 4 support, ReusableTokenStream, Spatial search, Vector-based Highlighter, Extended Dismax parser, and many more features and bug fixes. Then, after less than 3 months(!) on June 4th, version 3.2 was released. This release introduced a new and much desired results grouping module, NRTCachingDirectory, and highlighting performance improvements. Just one month later, on July 1st, Lucene and Solr 3.3 were introduced. That release included KStem stemmer, new implementations of Spellchecker, Field Collapsing in Solr and RAM usage reduction for autocomplete mechanism. By the end of summer there was another release, this time it was version 3.4 released on the 14th of September. Pure Lucene users got what Solr could do for a very long time – the long awaited faceting module contributed by IBM. Version 3.4 also included the new Join functionality, ability to turn off query and filter caches and faceting calculation for Field Collapsing. The last release of Lucene and Solr saw the light of day in late November. The 3.5.0 version consisted of huge memory reduction when dealing with term dictionaries, deep paging support, SearcherManager and SearcherLifetimeManager classes along with language identification provided by Tika, as well as sortMissingFirst and sortMissingLast support for TrieFields.

During the last 12 months we attended three major conferences focused on search and big data themes. Lucene Revolution took place in San Francisco in May. Otis gave a talk titled “Search Analytics: What? Why? How?” (slides) during the first day. There were a number of other good talks there and the complete conference agenda is available on http://lucenerevolution.com/2011/agenda. Some videos are available as well. Next came the Berlin Buzzwords conference, a more grass-roots conference which took place between 4th and 10th of June. Otis gave the updated version of his “Search Analytics: What? Why? How?”. If you want to know more, check conference official site – http://berlinbuzzwords.de. The last conference focused exclusively on Lucene and Solr was Lucene Eurocon 2011 in sunny and tourist-filled Barcelona between 17th and 20th of October. And guess what – we were there again (surprise!), this time in slightly larger numbers. Otis gave a talk about “Search Analytics: Business Value & BigData NoSQL Backend” (video, slides) and Rafał gave a talk on a pretty popular topic – “Explaining & Visualizing Solr ‘explain’ information” (video, slides). No open source project can endure without regular injections of new blood. This year, Lucene and Solr development team was joined by a number of new people whose names may look familiar to you:

These 7 men are now Lucene and Solr committers and we look forward to our next year’s Year in Review post, where we hope to go over the good things these people will have brought to Lucene and Solr in 2012.

You know an open source project is successful when a whole book is dedicated to it. You know a project is very successful when more than one book and more than one publisher cover it. There were no new editions of Lucene in Action (amazon, manning) this year, but our own Rafał Kuć published his Solr 3.1 Cookbook (amazon) in July. Rafał’s cookbook includes a number of recipes that can make your life easier when it comes to solving common problems with Apache Solr. Another book, Apache Solr 3 Enterprise Search Server (amazon) by David Smiley and Eric Pugh was published in November. This is a major update to the first edition of the book and it covers a wide range of functionalities of Apache Solr.

— @sematext