Solr vs. ElasticSearch: Part 2 – Data Handling

[Note: for those of you who don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview]

In the previous part of Solr vs. ElasticSearch series we talked about general architecture of these two great search engines based on Apache Lucene. Today, we will look at their ability to handle your data and perform indexing and language analysis.

Data Indexing

Apart from using Java API exposed both by ElasticSearch and Apache Solr, you can index data using an HTTP call. To index data in ElasticSearch you need to prepare your data in JSON format. Solr also allows that, but in addition to that, it lets you to use other formats like the default XML or CSV. Importantly, indexing data in different formats has different performance characteristics, but that comes with some limitations. For example, indexing documents in CSV format is considered to be the fastest, but you can’t use field value boosting while using that format. Of course, one will usually use some kind of a library or Java API to index data as one doesn’t typically store data in a way that allows indexing of data straight into the search engine (at least in most cases that’s true).

More About ElasticSearch

It is worth noting that ElasticSearch supports two additional things, that Solr does not – nested documents and multiple document types inside a single index.

The nested documents functionality lets you create more than a flat document structure. For example, imagine you index documents that are bound to some group of users. In addition to document contents, you would like to store which users can access that document. And this is were we run into a little problem – this data changes over time. If you were to store document content and users inside a single index document, you would have to reindex the whole document every time the list of users who can access it changes in any way. Luckily, with ElasticSearch you don’t have to do that – you can use nested document types and then use appropriate queries for matching. In this example, a nested document would hold a lists of users with document access rights. Internally, nested documents are indexed as separate index documents stored inside the same index. ElasticSearch ensures they are indexed in a way that allows it to use fast join operations to get them. In addition to that, these documents are not shown when using standard queries and you have to use nested query to get them, a very handy feature.

Multiple types of documents per index allow just what the name says – you can index different types of documents inside the same index. This is not possible with Solr, as you have only one schema in Solr per core. In ElasticSearch you can filter, query, or facet on document types. You can make queries against all document types or just choose a single document type (both with Java API and REST).

Index Manipulation

Let’s look at the ability to manage your indices/collections using the HTTP API of both Apache Solr and ElasticSearch.

Solr

Solr let’s you control all cores that live inside your cluster with the CoreAdmin API – you can create cores, rename, reload, or even merge them into another core. In addition to the CoreAdmin API Solr enables you to use the collections API to create, delete or reload a collection. The collections API uses CoreAdmin API under the hood, but it’s a simpler way to control your collections. Remember that you need to have your configuration pushed into ZooKeeper ensemble in order to create a collection with a new configuration.

When it comes to Solr, there is additional functionality that is in early stages of work, although it’s functional – the ability to split your shards. After applying the patch available in SOLR-3755 you can use a SPLIT action to split your index and write it to two separate cores. If you look at the mentioned JIRA issue, you’ll see that once this is commited Solr will have the ability not only to create new replicas, but also to dynamically re-shard the indices. This is huge!

ElasticSearch

One of the great things in ElasticSearch is the ability to control your indices using HTTP API. We will take about it extensively in the last part of the series, but I have to mention it ere, too. In ElasticSearch you can create indices on the live cluster and delete them. During creation you can specify the number of shards an index should have and you can decrease and increase the number of replicas without anything more than a single API call. You cannot change the number of shards yet. Of course, you can also define mappings and analyzers during index creation, so you have all the control you need to index a new type of data into you cluster.

Partial Document Updates

Both search engines support partial document update. This is not the true partial document update that everyone has been after for years – this is really just normal document reindexing, but performed on the search engine side, so it feels like a real update.

Solr

Let’s start from the requirements – because this functionality reconstructs the document on the server side you need to have your fields set as stored and you have to have the _version_ field available in your index structure. Then you can update a document with a simple API call, for example:

curl 'localhost:8983/solr/update' -H 'Content-type:application/json' -d '[{"id":"1","price":{"set":100}}]'

ElasticSearch

In case of ElasticSearch you need to have the _source field enabled for the partial update functionality to work. This _source is a special ElasticSearch field that stores the original JSON document. Theis functionality doesn’t have add/set/delete command, but instead lets you use script to modify a document. For example, the following command updates the same document that we updated with the above Solr request:

curl -XPOST 'localhost:9200/sematext/doc/1/_update'-d '{
    "script" : "ctx._source.price = price",
    "params" : {
        "price" : 100
    }
}'

Multilingual Data Handling

As we mentioned previously, and as you probably know, both ElasticSearch and Solr use Apache Lucene to index and search data. But, of course, each search engine has its own Java implementation that interacts with Lucene. This is also the case when it comes to language handling. Apache Solr 4.0 beta has the advantage over ElasticSearch because it can handle more languages out of the box. For example, my native language Polish is supported by Solr out of the box (with two different filters for stemming), but not by ElasticSearch. On the other hand, there are many plugins for ElasticSearch that enable support for languages not supported by default, though still not as many as we can find supported in Solr out of the box. It’s also worth mentioning there are commercial analyzers that plug into Solr (and Lucene), but none that we are aware of work with ElasticSearch…. yet.

Supported Languages

For the full list of languages supported by those two search engine please refer to the following pages:

Analysis Chain Definition

Of course, both Apache Solr and ElasticSearch allow you to define a custom analysis chain by specifying your own analyzer/tokenizer and list of filters that should be used to process your data. However, the difference between ElasticSearch and Solr is not only in the list of supported languages. ElasticSearch allows one to specify the analyzer per document and per query. So, if you need to use a different analyzer for each document in the index you can do that in ElasticSearch. The same applies to queries – each query can use a different analyzer.

Results Grouping

One of the most requested features for Apache Solr was result grouping. It was highly anticipated for Solr and it is still anticipated for ElasticSearch, which doesn’t yet have field grouping as of this writing. You can see the number of +1 votes in the following issue: https://github.com/elasticsearch/elasticsearch/issues/256. You can expect grouping to be supported in ElasticSearch after changes introduced in 0.20. If you are not familiar with results grouping – it allows you to group results based on the value of a field, value of a query, or a function and return matching documents as groups. You can imagine grouping results of restaurants on the value of the city field and returning only five restaurants for each city. A feature like this may be handy in some situations. Currently, for the search engines we are talking about, only Apache Solr supports results grouping out of the box.

Prospective Search

One thing Apache Solr completely lacks when comparing to ElasticSearch is functionality called Percolator in ElasticSearch. Imagine a search engine that, instead of storing documents in the index, stores queries and lets you check which stored/indexed queries match each new document being indexed. Sound handy, right? For example, this is useful when people want to watch out for any new documents (think Social Media, News, etc.) matching their topics of interest, as described through queries. This functionality is also called Prospective Search, some call it Pub-Sub as well as Stored Searches. At Sematext we’ve implemented this a few times for our clients using Solr, but ElasticSearch has this functionality built-in. If you want to know more about ElasticSearch Percolator see http://www.elasticsearch.org/blog/2011/02/08/percolator.html.

What’s Next ?

In the next part of the series we will focus on comparing the ability to query your indices and leverage the full text search capabilities of Apache Solr and ElasticSearch. We will also look at the possibility to influence Lucene scoring algorithms during query time. Till next time 🙂

@kucrafal, @sematext

Author: Rafał Kuć

Sematext engineer, books author, trainer, speaker. View all posts by Rafał Kuć

25 thoughts on “Solr vs. ElasticSearch: Part 2 – Data Handling”

lkafle says:

September 4, 2012 at 5:13 AM

Reblogged this on lava kafle kathmandu nepal.

Reply
Vadim Kisselmann says:

September 4, 2012 at 7:17 AM

Thanks for this post Rafał !
A question: can you tell us something more about “Prospective Search” and the possibilities to run it with Solr? Did you plan to open source your implementation on Sematext?
We use “Prospective Search” too, with EmbeddedSolrServer. But it’s not really stable, we observe unexplained memory consumption problems.

Best Regards
Vadim

Reply
1. sematext says:
  
  September 4, 2012 at 10:15 AM
  
  @Vadim, our Prospective Search implementations with Solr were all work for hire, so we can’t open-source any of them, unfortunately.
  
  Reply
  1. vkisselmann says:
    
    September 6, 2012 at 3:32 AM
    
    No problem:) I used the Solr MailingList and hope to get an answer.
    
    Reply
Jörg Prante says:

September 4, 2012 at 10:12 AM

Hi Rafał,

you write “On the other hand, there are many plugins for ElasticSearch that enable support for languages not supported by default, though still not as many as we can find supported in Solr out of the box. ”
It would have been nice to list all the analysis plugins available and how easy they can be installed (elasticsearch-analysis-stempel for polish is not in Elasticsearch core only because it is not in the Lucene core jar)
I would like to improve this situation by writing more Elasticsearch analysis plugins, but I did not do much research about what is lacking. Maybe you have a list or something what is in Solr and what is missing in Elasticsearch? This would be very helpful.

Best regards
Jörg

Reply
1. Rafał Kuć says:
  
  September 4, 2012 at 11:09 AM
  
  Jörg I know about different plugins for language analysis, but I focused (in most cases) on out of the box functionality that’s because I didn’t mention those. I plan to focus on the plugins and stuff in the final part of the series along with administration API review.
  
  As for the list, I don’t have anything like that now, but I’ll create it and will publish it in the last part of the posts series. Sounds OK ? 🙂
  
  Reply
Mark says:

September 13, 2012 at 2:17 PM

Interesting stuff! 🙂
I’m currently trying to setup Cassandra as my datastore and I want to use Solr to provide search & facetting.
In the past I worked with MSSQL and Solr and it was easy to have Solr index the MSSQL DB. But I have seen no examples or tutorials on how to have Solr index Cassandra.

I read about Sollandra, but I’ve also heard it’s not maintained frequently and is missing a lot of features from Solr.
I also read about Datastax, but that is a commercial solution and I have like no budget.

Is there an open-source well maintained solution or other recommended approach?

Reply
1. sematext says:
  
  September 13, 2012 at 2:21 PM
  
  @Mark There is no Cassandra ==> Solr indexer a la MySQL ==> Solr indexer (or really Solr+DIH). Solandra seems dead indeed, so I think that leaves you with DataStax, which is not free, and it’s not really a Cassandra ==> Solr solution, but more like a (Cassandra+Solr) package. However, Cassandra ==> Solr indexer should not really be that hard to write if you are OK with the 2 data stores being out of sync most of the time (assuming the data in Cassandra changes often).
  
  Reply
sematext says:

November 1, 2012 at 2:49 PM

A question on behalf of Mark Bennett (via dev@lucene.apache.org):

Let me ask you a question back. We really appreciate your ongoing series on Solr vs. ElasticSearch (I haven’t dove into ES yet). Looking your section on indexing (http://blog.sematext.com/2012/09/04/solr-vs-elasticsearch-part-2-data-handling/), can ES be as precise and flexible about creating highly customized tokens? When I initially heard about schema-less and read their use cases, I had the impression that ES was more for mainstream use-cases, but your review got me thinking maybe there’s a lot more there?

Reply
1. Rafał Kuć says:
  
  November 2, 2012 at 3:45 AM
  
  You can go into schema – less approach and use ElasticSearch without going into the structure of the data you have, but that’s only one approach. Of course you can define detailed index structure, choose tokenizer and filters. You can also specify index and query time analyzers, so in this case, you have all the flexibility you need. You can also turn off automatic field creation and stick to the index structure you’ve defined no matter what.
  
  In order to provide example, this is how analyzer definition looks like in elasticsearch.yml file (you can also submit analyzer definition with PUT mappings request):
  
  index:
  analysis:
  analyzer:
  en:
  type: english
  filter: [standard, asciifolding, stop, lowercase, englishFilter]
  
  filter:
  englishFilter:
  type: kstem
  
  Reply
Renee says:

November 7, 2012 at 4:46 PM

does solr re-shard (shard split) need to re-index split documents into the new shard? will that be very time consuming assuming the shard needs to be split is a large one (which is usually the case and why we need split )
thanks

Reply
sematext says:

November 7, 2012 at 4:54 PM

Solr can’t split shards just yet. When it does gain this capability it should not require re-indexing of the content of those shards and should mostly be about moving split shards around the cluster and writing them to disk on their target node. But we’ll see when this actually gets implemented.

Reply
1. Renee says:
  
  November 7, 2012 at 5:31 PM
  
  thought i saw some thing mentioned as ‘copy-n-delete’ for this feature, assuming copy the documents indexed since certain time over to new shard and delete from the old shard… hope it will not involve with re-index or any performance killer
  
  Reply
  1. Rafał Kuć says:
    
    November 7, 2012 at 5:42 PM
    
    Today at ApacheCon there was a short discussion about it – splitting shards will be possible in the future (work is in progress) however it will still require high amount of system resources. Imagine that to split a single shard you will have to create two new ones – smaller, but it’ll require to rewrite this data into disk and probably move them around the cluster later. I assume that it will be possible, but not recommended just like optimizing the index to force the segments merge into one, big segment.
    
    Reply
    1. Renee says:
      
      November 8, 2012 at 1:01 AM
      
      thanks for the insight … usually the system should have a way to monitor the size of the shard and prevent it from growing into a big one …
      
      Reply
      1. sematext says:
        
        November 8, 2012 at 11:58 AM
        
        @Renee – I think this will happen, just not in 2012 🙂 But I am pretty sure we’ll see it in 2013.
Renee says:

November 7, 2012 at 5:40 PM

I’m interested in the nested document types … seems that can be used to work out the issue in solr when changing a flag field the whole document need to be re-indexed. If I need to flip a flag on batch of documents based on a query, it sounds like I can do it in ES without re-index that batch of document, I could just set the flag as I could in database?
thanks!

Reply
1. Rafał Kuć says:
  
  November 7, 2012 at 6:05 PM
  
  How often do you need to change that flag ? Maybe it would be good to use the document update feature. If not I would rather look at parent – child, because you can easily update a single child. Nested are faster, but requires you to reindex the whole document with the nested ones too, while parent – child ones are a bit slower, but can be a single child can be reindexed.
  
  Reply
  1. Renee says:
    
    November 8, 2012 at 1:13 AM
    
    unfortunately the document update will require all fields be stored which is impossible with our data, also underneath it is still a re-index.
    I am not familiar with ES yet, I guess parent-child is implemented with using separate index (schema) then the query results will be ‘joined’ … I doubt about the performance if we need paginating through the search results with a parent-child…
    
    Reply
    1. Rafał Kuć says:
      
      November 8, 2012 at 1:58 AM
      
      It depends on your data, but parent – child calculations will take some resources and thus performance will be lower than using a single ‘standard’ document queries.
      
      Reply
      1. Renee says:
        
        November 15, 2012 at 5:11 PM
        
        we prototyped on using split schema for documents that need to flip flag all the time. So the fields that are changed from time to time are indexed into a different core with different schema, while the big static document is indexed into separate core. so when the flag fields are updated, only the ‘small’ core is re-indexed. At query time, we aggregate the results using the ID.
        
        This approach works well except when we need the results to display at UI, which needs pagination. In that case it requires a lot of resources and the performance sacrifice.
        
        that is why I wonder if ES’ parent-child implementation going to have same issue.
Renee says:

November 8, 2012 at 1:28 AM

” if you need to use a different analyzer for each document in the index you can do that in ElasticSearch. ” — I think Solr allows you to do this as well… by using language identifier and it allows you specify a different fieldtype to use the language specific analyzer.

Reply
1. Rafał Kuć says:
  
  November 8, 2012 at 1:59 AM
  
  But in Solr you specify it in the schema and each document field of the certain type will be analyzed by the same analyzer no matter what the document content is. In ElasticSearch analyzer can be chosen on the basis of field value.
  
  Reply
  1. Renee says:
    
    November 15, 2012 at 4:43 PM
    
    actually you can customize a update processor factory and plugin to processor chain. you can make the processor does language identifier for each indexing document field and based on the identified language direct them to use different field type (which uses language specific analyzer). This is what we have in our system and works well in production.
    
    Reply
    1. Rafał Kuć says:
      
      November 15, 2012 at 5:54 PM
      
      Yeap, you can customize, but we are talking about out of the box functionalities and you don’t need anything special for that functionality to work in ElasticSearch.
      
      Reply