Book: Elasticsearch in Action

Lee Hinman and I have been working hard on a book called Elasticsearch in Action. The book has now entered the Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available. You can also use a forum to ask questions or give us feedback before the book goes into print.

Please tweet about Book: Elasticsearch in Action

This is an intermediate level book designed for anyone who’s writing applications that use Elasticsearch or who is responsible for managing Elasticsearch in a production environment. It will offer great coverage of all the key features exposed through the HTTP API, and all code samples will be in bash using curl. This way, it will be easy for you to translate them in any language and to get help from the Elasticsearch community.

Elasticsearch in Action

If you’re interested in the MEAP of Elasticsearch in Action, Manning is sharing a 50% off promotion code active until July 30: mlelastico

Elasticsearch Refresh Interval vs Indexing Performance

Elasticsearch is near-realtime, in the sense that when you index a document, you need to wait for the next refresh for that document to appear in a search. Refreshing is an expensive operation and that is why by default it’s made at a regular interval, instead of after each indexing operation. This interval is defined by the index.refresh_interval setting, which can go either in Elasticsearch’s configuration, or in each index’s settings. If you use both, index settings override the configuration. The default is 1s, so newly indexed documents will appear in searches after 1 second at most.

Because refreshing is expensive, one way to improve indexing throughput is by increasing refresh_interval. Less refreshing means less load, and more resources can go to the indexing threads. How does all this translate into performance? Below is what our benchmarks revealed when we looked at it through the SPM lens.

Please tweet about Elasticsearch refresh interval vs indexing performance

Test conditions

For this benchmark, we indexed apache logs in bulks of 3000 each, on 2 threads. Those logs would go in one index, with 3 shards and 1 replica, hosted by 3 m1.small Amazon EC2 instances. The Elasticsearch version used was 0.90.0.

As indexing throughput is a priority for this test, we also made some configuration changes in this direction:

  • index.store.type: mmapfs. Because memory-mapped files make better use of OS caches
  • indices.memory.index_buffer_size: 30%. Increased from the default 10%. Usually, the more buffers, the better, but we don’t want to overdo it
  • index.translog.flush_threshold_ops: 50000. This makes commits from the translog to the actual Lucene indices less often than the default setting of 5000

Results

First, we’ve indexed documents with the default refresh_interval of 1s. Within 30 minutes, 3.6M new documents were indexed, at an average of 2K documents per second. Here’s how indexing throughput looks like in SPM for Elasticsearch:

refresh_interval: 1s

Then, refresh_interval was set to 5s. Within the same 30 minutes, 4.5M new documents were indexed at an average of 2.5K documents per second. Indexing thoughput was increased by 25%, compared to the initial setup:

refresh_interval: 5s

The last test was with refresh_interval set to 30s. This time 6.1M new documents were indexed, at an average of 3.4K documents per second. Indexing thoughput was increased by 70% compared to the default setting, and 25% from the previous scenario:

refresh_interval: 30s

Other metrics from SPM, such as load, CPU, disk I/O usage, memory, JVM or Garbage Collection didn’t change significantly between runs. So configuring the refresh_interval really is a trade-off between indexing performance and how “fresh” your searches are. What settings works best for you? It depends on the requirements of your use-case. Here’s a summary of the indexing throughput values we got:

  • refresh_interval: 1s   – 2.0K docs/s
  • refresh_interval: 5s   – 2.5K docs/s
  • refresh_interval: 30s3.4K docs/s

Recipe: rsyslog + Elasticsearch + Kibana

In this post you’ll see how you can take your logs with rsyslog and ship them directly to Elasticsearch (running on your own servers, or the one behind Logsene Elasticsearch API)  in such a way that you can use Kibana to search, analyze and make pretty graphs out of them.

Please tweet about Recipe: rsyslog + Elasticsearch + Kibana.

This is especially useful when you have a lot of servers logging [a lot of data] to their syslog daemons and you want a way to search them quickly or do statistics on the logs. You can use rsyslog’s Elasticsearch output to get your logs into Elasticsearch, and Kibana to visualize them. The only challenge is to get your rsyslog configuration right, so your logs end up where Kibana is expecting them. In this post, we’ll get you through the basics; if you want a more complete tutorial (including how to parse unstructured data and use rsyslog as a high-performance buffer), we suggest to have a look at our updated recipe here.

Note: that if this sort of stuff excites you, we are both hiring (from devops and core product engineering to marketing and sales) and working on Logsene – a log and data analytics product/service to complement SPM.

Getting all the ingredients

Here’s what you’ll need:

  • a recent version of rsyslog (v7+, if you ask me. The Elasticsearch output is available since 6.4.0). You can download and compile it yourself, or you can get it from the RHEL/CentOS or Ubuntu repositories provided by the maintainers
  • the Elasticsearch output plugin for rsyslog. If you compile rsyslog from sources, you’ll need to add the –enable-elasticsearch parameter to the configure script. If you use the repositories, just install the rsyslog-elasticsearch package
  • Elasticsearch :). You have a DEB and a RPM there, which should get you started in no time. If you choose the tar.gz archive, you might find the installation instructions useful
  • Kibana 3 and a web server to serve it. There are installation instructions on the GitHub page. To get started quickly, you can just clone the repository somewhere, then go into the “kibana” directory:

git clone https://github.com/elasticsearch/kibana.git

cd kibana

Then, you’ll probably need to edit config.js to change the Elasticsearch host name from “localhost” to the actual FQDN of the host that’s running Elasticsearch. This applies even if Kibana is on the same machine as Elasticsearch. “localhost” only works if your browser is on the same machine as Elasticsearch, because Kibana talks to Elasticsearch directly from your browser.

Finally, you can serve the Kibana page with any HTTP server you prefer. If you want to get started quickly, you can try SimpleHTTPServer, which should be embedded to any recent Python, by running this command from the “kibana” directory:

python -m SimpleHTTPServer

Putting them all together

Kibana is, by default, expecting Logstash to send logs to Elasticsearch. So “putting them all together” here means “configuring rsyslog to send logs to Elasticsearch in the same manner Logstash does”. And Logstash, by default, has some particular ways when it comes to naming the indices and formatting the logs:

  • indices should be formatted like logstash-YYYY.MM.DD. You can change the pattern Kibana is looking for, but we won’t do that here
  • logs must have a timestamp, and that timestamp must be stored in the @timestamp field. It’s also nice to put the message part in the message field – because Kibana shows it by default

To satisfy the requirements above, here’s a rsyslog configuration that should work for sending your local syslog logs to Elasticsearch in a Logstash/Kibana-friendly way:

module(load="imuxsock")             # for listening to /dev/log
module(load="omelasticsearch") # for outputting to Elasticsearch
# this is for index names to be like: logstash-YYYY.MM.DD
template(name="logstash-index"
  type="list") {
    constant(value="logstash-")
    property(name="timereported" dateFormat="rfc3339" position.from="1" position.to="4")
    constant(value=".")
    property(name="timereported" dateFormat="rfc3339" position.from="6" position.to="7")
    constant(value=".")
    property(name="timereported" dateFormat="rfc3339" position.from="9" position.to="10")
}

# this is for formatting our syslog in JSON with @timestamp
template(name="plain-syslog"
  type="list") {
    constant(value="{")
      constant(value="\"@timestamp\":\"")     property(name="timereported" dateFormat="rfc3339")
      constant(value="\",\"host\":\"")        property(name="hostname")
      constant(value="\",\"severity\":\"")    property(name="syslogseverity-text")
      constant(value="\",\"facility\":\"")    property(name="syslogfacility-text")
      constant(value="\",\"tag\":\"")   property(name="syslogtag" format="json")
      constant(value="\",\"message\":\"")    property(name="msg" format="json")
    constant(value="\"}")
}
# this is where we actually send the logs to Elasticsearch (localhost:9200 by default)
action(type="omelasticsearch"
    template="plain-syslog"
    searchIndex="logstash-index"
    dynSearchIndex="on")

After restarting rsyslog, you can go to http://host-serving-Kibana:8000/ in your browser and start searching and graphing your logs:

Kibana screenshot
Digging into syslog with Kibana

Tips

Now that you got the essentials working, here are some tips that might help you go even further with your centralized logging setup:

  • you might not want to put the new rsyslog and omelasticsearch on all your servers. In this case you can forward them over the network to a central rsyslog that has omelasticsearch, and push your logs to Elasticsearch from there. Some information on forwarding logs via TCP can be found here and here
  • you might want rsyslog to buffer your logs (in memory, on disk, or some combination of the two), in case Elasticsearch is not available for some reason. Buffering will also help performance, as you can send messages in bulks instead of one by one. There’s a reference on buffers with rsyslog&omelasticsearch here
  • you might want to parse JSON-formatted (CEE) syslog messages. If you’re using them, check our earlier post on the subject

If you don’t want to worry about any of that, you might want to check out Logsene. This is our new data & log analytics service, where you can just send your syslog messages (CEE-formatted or not) and not worry about running and maintaining a logging cluster in house. We’ll index them for you, and provide a nice interface to search and graph those logs. We also expose an Elasticsearch HTTP API, so Logsene plays nicely with Logstash, rsyslog+omelasticsearch, Kibana, and virtually any other logging tool that can send logs to Elasticsearch.