Solr vs. ElasticSearch: Part 6 – User & Dev Communities

[Note: for those of you don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview]

One of the questions after my talk during the recent ApacheCon EU was what I thought about the communities of the two search engines I was comparing. Not surprisingly, this is also a question we often address in our consulting engagements.  As a part of our Apache Solr vs ElasticSearch post series we decided to step away from the technical aspects of SolrCloud vs. ElasticSearch and look at the communities gathered around thesee two projects. If you haven’t read the previous posts about Apache Solr vs. ElasticSearch here are pointers to all of them:

Continue reading “Solr vs. ElasticSearch: Part 6 – User & Dev Communities”

Solr vs ElasticSearch: Part 5 – Management API Capabilities

[Note: for those of you who don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview]

In previous posts, all listed below, we’ve discussed general architecture, full text search capabilities and facet aggregations possibilities. However, till now we have not discussed any of the administration and management options and things you can do on a live cluster without any restart. So let’s get into it and see what Apache Solr and ElasticSearch have to offer.

Continue reading “Solr vs ElasticSearch: Part 5 – Management API Capabilities”

Solr vs ElasticSearch: Part 4 – Faceting

[Note: for those of you who don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview]

Solr 4 (aka SolrCloud) has just been released, so it’s the perfect time to continue our ElasticSearch vs. Solr series. In the last three parts of the ElasticSearch vs. Solr series we gave a general overview of the two search engines, about data handling, and about their full text search capabilities. In this part we  look at how these two engines handle faceting.

Continue reading “Solr vs ElasticSearch: Part 4 – Faceting”

Solr vs ElasticSearch: Part 3 – Searching

[Note: for those of you who don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview]

——-

In the last two parts of the series we looked at the general architecture and how data can be handled in both Apache Solr 4 (aka SolrCloud) and ElasticSearch and what the language handling capabilities of both enterprise search engines are like. In today’s post we will discuss one of the key parts of any search engine – the ability to match queries to documents and retrieve them.

Continue reading “Solr vs ElasticSearch: Part 3 – Searching”

Solr vs. ElasticSearch: Part 2 – Data Handling

[Note: for those of you who don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview]

In the previous part of Solr vs. ElasticSearch series we talked about general architecture of these two great search engines based on Apache Lucene. Today, we will look at their ability to handle your data and perform indexing and language analysis.

  1. Solr vs. ElasticSearch: Part 1 – Overview
  2. Solr vs. ElasticSearch: Part 2 – Indexing and Language Handling
  3. Solr vs. ElasticSearch: Part 3 – Searching
  4. Solr vs. ElasticSearch: Part 4 – Faceting
  5. Solr vs. ElasticSearch: Part 5 – Management API Capabilities
  6. Solr vs. ElasticSearch: Part 6 – User & Dev Communities Compared

Continue reading “Solr vs. ElasticSearch: Part 2 – Data Handling”

Solr vs. ElasticSearch: Part 1 – Overview

[Note: for those of you who don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview]

——-

A good Solr vs. ElasticSearch coverage is long overdue.  At Sematext we make good use of our own Search Analytics and pay attention to what people search for.  Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch, and this SolrCloud vs. ElasticSearch question is something we regularly address in our search consulting engagements.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.

  1. Solr vs. ElasticSearch: Part 1 – Overview
  2. Solr vs. ElasticSearch: Part 2 – Indexing and Language Handling
  3. Solr vs. ElasticSearch: Part 3 – Searching
  4. Solr vs. ElasticSearch: Part 4 – Faceting
  5. Solr vs. ElasticSearch: Part 5 – Management API Capabilities
  6. Solr vs. ElasticSearch: Part 6 – User & Dev Communities Compared

Continue reading “Solr vs. ElasticSearch: Part 1 – Overview”

ActionGenerator – Part Two

In the previous part of the two – parts series about ActionGenerator we showed how to develop and run your own ActionGenerator. Today, we want to show you what action generators are there for you to use out-of-the-box and we want to share some insights about the future of this project.

ActionGenerator for ElasticSearch

The ag-player-es project contain all the code specific to action generators for ElasticSearch. You can find two sinks implemented – one for sending simple queries to ElasticSearch and the other for indexing data. They both use  ElasticSearch REST API, so no dependencies are needed for those to work. The other piece available in ag-player-es are three players configured and ready to use:

  • SimpleEsPlayerMain – ActionGenerator for running random queries to ElasticSearch.
  • DictionaryEsPlayerMain – ActionGenerator for running queries to ElasticSearch. Queries are generated using a provided dictionary.
  • DictionaryDataEsPlayerMain – ActionGenerator for indexing data to ElasticSearch. Fields content is generated using provided dictionary.

ActionGenerator for Solr

Similar to ag-player-es one can expect that ag-player-solr will contain all the code specific to action generators for Apache Solr. In this project you can find two sinks implementation – one for sending queries to ElasticSearch and one for indexing data. The first one uses Solr HTTP API and the other one uses XML to index data to Solr. Similar to ag-player-es, no dependencies are needed, so you should be able to use action generator for Solr with all recent Solr versions. Apart from that, there are three players configured and ready for use:

  • DictionaryDataSolrPlayerMain – ActionGenerator for indexing data to Apache Solr. Fields content is generated using provided dictionary.
  • DictionarySolrPlayerMain –  ActionGenerator for running queries to Apache Solr. Queries are generated using provided dictionary.
  • RandomQueriesSolrPlayerMain – ActionGenerator for running random queries to Apache Solr.

Using ActionGenerator for ElasticSearch

Lets concentrate on players available for ElasticSearch in ActionGenerator. As we wrote above, you have three main players for ElasticSearch that can be used out of the box.

SimpleEsPlayerMain

The simplest of the three generators available. It lets you generate random queries to a given index and with the use of the given field name. In order to use that player, you need to provide the following parameters:

  • ElasticSearch base URL
  • ElasticSearch index name
  • Name of the field queries should be run against
  • Number of events that should be generated
For example, you could run the following command and have 1000 queries sent to name field of the documents index on your local ElasticSearch instance:
java -cp ag-player-es-0.1.0-withdeps.jar \
com.sematext.ag.es.SimpleEsPlayerMain http://localhost:9200/ documents text 1000

DictionaryEsPlayerMain

The second generator that enables you to run queries uses a dictionary to generate text of your queries. It is similar to the SimpleEsPlayerMain, except for dictionary usage. In order to use that player, you need to provide the same parameters as to the SimpleEsPlayerMain and one additional parameters:

  • Dictionary path
For example, you could run the following command and have 1000 queries sent to name field of the documents index on your local ElasticSearch instance. Queries would be generated using dict.txt dictionary (each line containing different query string):
java -cp ag-player-es-0.1.0-withdeps.jar com.sematext.ag.es.DictionaryEsPlayerMain http://localhost:9200/ documents text 1000 dict.txt

DictionaryDataEsPlayerMain

The one and only player that enables you to index data to your ElasticSearch instance. Just as the player discussed above, DictionaryDataEsPlayerMain also works with the help of a dictionary. You need to provide the following parameters in order to use this player:

  • ElasticSearch base URL
  • ElasticSearch index name
  • ElasticSearch type name
  • Number of events that should be generated
  • Dictionary path
  • One or more fields and their types
So, for example if you would like to index 100.000 documents to documents index under document type to your local ElasticSearch instance you could run the following
java -cp ag-player-es-0.1.0-withdeps.jar com.sematext.ag.es.DictionaryDataEsPlayerMain http://localhost:9200/ documents document 100000 dict.txt id:numeric title:text likes:numeric

Right now, the following field types are available:

  • numeric
  • text
We plan to add more types in the future.  Pull requests with patches are welcome!

Using ActionGenerator for Apache Solr

Players available for Apache Solr are similar to the ones available for ElasticSearch, but lets quickly look at them, too.

RandomQueriesSolrPlayerMain

This player is similar to the SimpleEsPlayerMain – it also generates random queries, but to Apache Solr. In order to use it, you need to provide the following parameters:

  • Apache Solr core search handler URL
  • The field queries should be run against
  • Number of queries to be generated
For example, you could run the following command and have 1000 queries sent to name field of the documents core of your local Apache Solr instance:
java -cp ag-player-solr-0.1.0-withdeps.jar com.sematext.ag.solr.RandomQueriesSolrPlayerMain http://localhost:8983/solr/documents name 1000

DictionarySolrPlayerMain

DictonarySolrPlayerMain is similar to RandomQueriesSolrPlayerMain except it uses dictionary to generate queries. In order to use this player you need to provide one additional parameter compared to the ones you provided to DictionarySolrPlayerMain:

  • Dictionary path
For example, you could run the following command and have 1000 queries sent to name field of the documents core of your local Apache Solr instance. Queries would be generated using dict.txt dictionary (each line containing different query string):
java -cp ag-player-solr-0.1.0-withdeps.jar com.sematext.ag.solr.DictionarySolrPlayerMain http://localhost:8983/solr/documents name 1000 dict.txt

DictionaryDataSolrPlayerMain

The last player I’d like to tell you about is the one that enables data indexation to your Apache Solr instance. DictionaryDataSolrPlayerMain is similar to its ElasticSearch counterpart. You need to provide the following parameters in order to use this player:

  • Apache Solr core update handler URL
  • Number of events that should be generated
  • Dictionary path
  • One or more fields and their types
For example, if you would like to index 100000 documents to documents core of your local Apache Solr instance you could run the following
java -cp ag-player-solr-0.1.0-withdeps.jar com.sematext.ag.solr.DictionaryDataSolrPlayerMain http://localhost:8983/solr/documents/update/ 100000 dict.txt id:numeric title:text likes:numeric

Lets omit the field types description as it was provided during DictionaryDataEsPlayerMain description above.

Calculating Metrics

Currently, the AbstractHttpSink class has the ability to gather metrics about how the system to which you are sending events is behaving. You can choose between two methods of metrics output – to the standard output or to a file.  To enable metrics tracking you need to pass the -DenableMetrics=true parameter when running ActionGenerator. This parameter enables metrics gathering and outputs those to standard output. In order to change that behavior and output the metrics to a file you need to pass the -DmetricsType=file parameter. In addition, you need to specify which directory the metrics should be written to – you do that by passing -DmetricsDir=/path/to/output/dir/ parameter with the value of the directory. Please remember that the directory needs to be created before running your action generator.

Maven Artifacts

The libraries for all the projects creating ActionGenerator can be found in the Sonatype maven repository (http://oss.sonatype.org/content/repositories/releases/) under the following dependencies:

Action Generator

<dependency>
   <groupId>com.sematext.ag</groupId>
   <artifactId>ag-player</artifactId>
   <version>0.1.0</version>
</dependency>

Action Generator for ElasticSearch

<dependency>
   <groupId>com.sematext.ag</groupId>
   <artifactId>ag-player-es</artifactId>
   <version>0.1.0</version>
</dependency>

Action Generator for Solr

<dependency>
   <groupId>com.sematext.ag</groupId>
   <artifactId>ag-player-solr</artifactId>
   <version>0.1.0</version>
</dependency>

Plans for the Future

We plan to release action generators for SenseiDB, as well as expand the number of sinks and players ready to be used out of the box. Of course, as always, patches are welcome and if you find any problems with ActionGenerator or if you identify missing features, please open an issue.

ActionGenerator, Part One

In this post we’ll introduce you to ActionGenerator, one of several open source projects we are working on. ActionGenerator lets you generate actions (you can also think of actions as events) from an action sources and play those actions with ActionGenerator’s action player to one of the sinks. The rest is done by ActionGenerator. ActionGenerator comes with several action sources and sinks, but one can easily implement custom action sources and sinks and play them with ActionGenerator. Let’s dig into the details.

Continue reading “ActionGenerator, Part One”

ElasticSearch Shard Placement Control

In this post you will learn how to control ElasticSearch shard placement.  If you are coming to our ElasticSearch talk at BerlinBuzzwords, we’ll be talking about this and more.

If you’ve ever used ElasticSearch you probably know that you can set it up to have multiple shards and replicas of each index it serves. This can be very handy in many situations. With the ability to have multiple shards of a single index we can deal with indices that are too large for efficient serving by a single machine. With the ability to have multiple replicas of each shard we can handle higher query load by spreading replicas over multiple servers. Of course, replicas are useful for more than just spreading the query load, but that’s another topic.  In order to shard and replicate ElasticSearch has to figure out where in the cluster it should place shards and replicas. It needs to figure out which server/nodes each shard or replica should be placed on.

Continue reading “ElasticSearch Shard Placement Control”

ElasticSearch Cache Usage

We’ve been doing a ton of work with ElasticSearch. Not long ago, we had a few situations where ElasticSearch would “eat” all the JVM heap memory we give it.  It was so hungry, we could not feed it enough memory to keep it happy.  It was insatiable.  After some troubleshooting and looking at SPM for ElasticSearch (btw. we released a new version of the SPM agent earlier this week, so if you don’t have it, go grab agent v1.5.0) we figured out the cause – ElasticSearch default field cache setting was not quite right for our deployment. In this post we’ll share our experience on this topic, explain why this was happening and how to minimize the negative effect of large field caches.

ElasticSearch Cache Types

There are two types of caches in ElasticSearch whose behaviors you can control. The first cache is the filter cache. This cache is responsible for caching results of filters used in your queries. This is very handy, because after a filter is run once, ElasticSearch will subsequently use values stored in the filter cache and thus save precious disk I/O operations and by avoiding disk I/O speed up query execution. There are two main implementations of filter cache in ElasticSearch:

  1. node filter cache (default)
  2. index filter cache

The node filter cache is an LRU cache, which means that the least recently used items will be evicted when the filter cache is full. Its size can be limited to be either a percentage of the total memory allocated to the Java process or by specifying the exact amount of memory. The second type of filter cache is the index filter cache. It is not recommended for use because you can’t predict (in most cases) how much memory it will use, since that depends on which shards are located on which node. In addition to that, you can’t control the amount of memory used by index filter cache, you can only set its expiration time and maximum amount of entries in that cache.

The second type of cache in ElasticSearch is field data cache. This cache is used for sorting and faceting in most cases. It loads all values from the field you sort or facet on and then provides calculations on the basis of loaded values. You can imagine that the cost of building such a cache for a large amount of data might be very high.  And it is.  Apart from the type (which can be either resident or soft) you can control two additional parameters of field data cache – the maximum amount of entries in it and its expiration time.

The Defaults

The default implementation for the filter cache is the index filter cache, with its size set to the maximum of 20% of the memory allocated to the Java process. As you can imagine there is nothing to worry about – if the cache fills up appropriate cache entries will get evicted.  You can then consider adding more RAM to make index filter cache bigger or you must live with evictions. That’s perfectly acceptable.

On the other hand we have the default settings for ElasticSearch field data cache – it is a resident cache with unlimited size. Yes, unlimited. The cost of rebuilding this cache is very high and thus you must know how much memory it can use – you must control your queries and watch what you sort on and on which fields you do the faceting.

What Happens When You Don’t Control Your Cache Size ?

This is what can happen when you don’t control your field data cache size:

As you can see on the above chart field data cache jumped to more than 58 GB, which is enormous. Yes, we got OutOfMemory exception during that time.

What CanYou Do ?

There are actually three thing you can do to make your field data cache use less memory:

Control its Size and Expiration Time

When using the default, resident field data cache type, you can set its size and expiration time. However, please remember, that there are situations when you need the field data cache to hold values for that particular field you are sorting or faceting on. In order to change field data cache size, you specify the following property:

index.cache.field.max_size

It specifies the maximum size entries in that cache per Lucene segment. It doesn’t limit the amount of memory field data cache can use, so you have to do some testing to ensure queries you are using won’t result in OutOfMemory exception.

The other property you can set is the expiration time.  It defaults to -1 which says that the cache will not be expired by default. In order to change that, you must set the following property:

index.cache.field.expire

So if, for example, you would like to have a maximum of 50k entries of field data cache per segment and if you would like to have those entries expiredafter 10 minutes, you would set the following property values in ElasticSearch configuration file:

index.cache.field.max_size: 50000
index.cache.field.expire: 10m

Change its Type

The other thing you can do is change field data cache type from the default resident to soft. Why does that matter? ElasticSearch uses Google Guava libraries to implement its cache. The soft type wraps cache values in soft references, which means that whenever memory is needed garbage collector will clear those references even when they are used. This means that when you start hitting heap memory limit, the JVM wont throw OutOfMemory exception, but will  instead release those soft references with the use of garbage collector. More about soft references can be found at:

http://docs.oracle.com/javase/6/docs/api/java/lang/ref/SoftReference.html

So in order to change the default field data cache type to soft you should add the following property to ElasticSearch configuration file:

index.cache.field.type: soft

Change Your Data

The last thing you can do is the operation that requires much more effort than only changing ElasticSearch configuration – you may want to change your data. Look at your index structure, look at your queries and think. Maybe you can lowercase some string data and this way reduce the number of unique values in the field? Maybe you don’t need your dates be precise down to a second, maybe it can be minute or even an hour? Of course, when doing some faceting operations you can set the granularity, but the data will still be loaded into memory. So if there are parts of your data that can be changed in a way that will result in lower memory consumption, you should consider it. As a matter of fact, that is exactly what we did!

Caches After Some Changes

After we made some changes in our ElasticSearch configuration/deployment, this is what the field data cache usage looked like:

As you can see, the cache dropped from 58 GB down to 37 GB.  Impressive drop!  After these changes we stopped running into OutOfMemory exception problems.

Summary

You have to remember that the default settings for field data cache in ElasticSearch may not be appropriate for you. Of course, that may not be the case in your deployment. You may not need sorting, apart from the default based on Lucene scoring and you may not need faceting on fields with many unique terms. If that’s the case for you, don’t worry about field data cache. If you have enough memory for holding the fields data for your facets and sorting then you also don’t need to change anything regarding the cache setup from the default ElasticSearch configuration. What you need to remember is to monitor your JVM heap memory usage and cache statistics, so you know what is happening in your cluster and react before things get worse.

One More Thing

The charts you see in the post are taken from SPM (Scalable Performance Monitoring) for ElasticSearch.  SPM is currently free and, as you can see, we use it extensively in our client engagements.  If you give it a try, please let us know what you think and what else you would like to see in it.

@sematext (Like working with ElasticSearch?  We’re hiring!)