August 2012

Solr vs. ElasticSearch: Part 1 – Overview

[Note: for those of you who don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview]

——-

A good Solr vs. ElasticSearch coverage is long overdue. At Sematext we make good use of our own Search Analytics and pay attention to what people search for. Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch, and this SolrCloud vs. ElasticSearch question is something we regularly address in our search consulting engagements.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.

Continue reading “Solr vs. ElasticSearch: Part 1 – Overview”

HBase FuzzyRowFilter: Alternative to Secondary Indexes

In this post we’ll explain the usage of FuzzyRowFilter which can help in many situations where secondary indexes solutions seems to be the only choice to avoid full table scans.

Background

When it comes to HBase the way you design your row key affects everything. It is a common pattern to have composite row key which consists of several parts, e.g. userId_actionId_timestamp. This allows for fast fetching of rows (or single row) based on start/stop row keys which have to be a prefix of the row keys you want to select. E.g. one may select last time of userX logged in by specifying row key prefix “userX_login_”. Or last action of userX by fetching the first row with prefix “userX_”. These partial row key scans work very fast and does not require scanning the whole table: HBase storage is optimized to make them fast.

Problem

However, there are cases when you need to fetch data based on key parts which happen to be in the middle of the row key. In the example above you may want to find last logged in users. When you don’t know the first parts of the key partial row key scan turns into full table scan which might be very slow and resource intensive.

Possible Solution #1: Data Redundancy

One possible way around it would be to use secondary indexes by creating redundant rows with the same data as original ones but with different sequence of the parts of the key (e.g. actionId_timestamp). This solution may not be suitable for some because of its cons:

storing extra indexes (usually it requires to store N times more data for N indexes) results in storing a lot more data on disk
storing (and serving) extra rows brings additional load on the cluster during writing and reading (extra blocks fighting to be in cache, etc.)
writing/updating/deleting several rows is not an atomic operation in HBase

Possible Solution #2: Integrated Secondary Indexes

Another way to attack the problem is to use smart secondary indexes mechanism integrated in HBase which doesn’t rely on data redundancy. E.g. something like IHBase. The problem here is that there’s no out-of-the box solution to be used. This may change with addition of newer CoProcessors functionality (see e.g. HBASE-2038 or this). But as of now existent solutions have their own limitations and drawbacks while new solutions are yet to be completed.

Suggested Solution

First of all, I have to say that solution suggested below is not a silver bullet. Moreover its performance may be very bad and even be close to full table scan in some cases. Even more: it can’t be used in any of the situations described in Background and Problem sections. But in many cases depending on your data the suggested simple solution can be used to avoid secondary indexes burden and still allow for very fast scans. In many other cases it can be used to significantly speed up your full table scans.

Suggested solution is not new and quite simple, but it is usually overlooked by HBase users, though it shouldn’t be.

Fast-Forwarding in Server-side Filter

In recent HBase versions (I believe in 0.90.+) there’s a mechanism that allows skipping the whole range of rows when scanning with server-side filter. These skipped rows data may not even be read from the disk. Based on the current row key the filter can tell scanner to advance to the row with the specific key and by doing that jump over many rows which are simply skipped. For example, this makes it possible to perform fast full-table scans (or large partial key scans) in case there’s enough information about the key and the data that allows to provide efficient hints for skipping a lot of rows during the scan.

Most of the time you’ll have to implement your own custom filter that performs fast-forwarding. Hint for these cases: refer to org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT in HBase sources.

FuzzyRowFilter

FuzzyRowFilter is one of the handy filters which is available and which performs fast-forwaring based on the fuzzy row key mask provided by user. It will be available out of the box in the next HBase release, but you can now download its sources from HBASE-6509 (use latest patch) and use it as any other custom filter (there’s no need to patch HBase, etc. it relies on existing functionality).

FuzzyRowFilter takes as parameters row key and a mask info. In example above, in case we want to find last logged in users and row key format is userId_actionId_timestamp (where userId has fixed length of say 4 chars), the fuzzy row key we are looking for is “????_login_”. This translates into the following params for FuzzyRowKey:

FuzzyRowFilter rowFilter = new FuzzyRowFilter(
 Arrays.asList(
  new Pair<byte[], byte[]>(
    Bytes.toBytesBinary("\\x00\\x00\\x00\\x00_login_"),
    new byte[] {1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0})));

I.e. the row key to compare with is provided as the first byte array (at the byte positions where any value is allowed, “\x00” is set, which is translated into (byte) 0). To tell which positions are fixed and which are not fixed, second byte array is provided (mask info) with zeroes on positions whose values are “fixed” and ones at the “non-fixed” positions.

Thus one can define different fuzzy row key masks, including those with “non-fixed” positions anywhere in the middle of the key. E.g.: “hb?se” or “??pred?ce”.

Note that FuzzyRowFilter accepts more than one mask: if row key satisfies at least one, the row will be included in the result. E.g. by providing masks “????_login_” and “????_register_” we can find last logged in and registered users.

How It Works

In the example above, with the mask “????_login_” scan initially navigates to the first row of the table. It is likely to be a user-action record (let userId to be “0001”), but the action may be not “login”. In this case as filter knows the “current” user (“0001”) and the action it is looking for, filter tells scan to jump to the row with the key “0001_login_”. By doing that, many rows may be skipped from the scanning (if we track other user actions apart from “login”, there are likely a lot more other user-action records than user logins). Then it scans user login actions records until it faces the record with action which is not login, say “0001_logout”. In this case filter knows that there’s no point in scanning this user’s records and tells scanner to jump to the next user “0002_login_” and it will continue scanning its records. Note: there might be no “0002” user, filter knows nothing about users, it simply suggests the next user id by increasing the current one by one. In this case scan will automatically jump to the next existing user, and the steps above will be repeated.

Limitations & Performance Considerations

As you probably already have figured out from the example above, FuzzyRowFilter can be applied only if userId has fixed length. While it is usually not hard to design the row key format so that its parts have fixed length (at least those parts that we need to mask with “???”), in many situations it may be problematic.

The efficiency of using FuzzyRowFilter (and any other fast-forwarding filters) is determined by how many records filter can actually skip and how many jumps it has to do to skip them.

Performance of the scan based on FuzzyRowFilter usually depends on the cardinality of the fuzzy part. E.g. in the example above, if users number is several hundreds to several thousand, the scan should be very fast: there will only be several hundreds or thousand “jumps” and huge amount of rows might be skipped. If the cardinality is high then scan can take a lot of time. The worst-case scenario is when you have N records and N users, i.e. one record per user. In this case there’s simply nothing to skip.

At times when the performance of full-table scan with the help of FuzzyRowFilter is not suitable for serving online data, it has still proven to be very efficient when you feed data from HBase into MapReduce job. Don’t overlook this!

Summary

There are times when you design the row key for the data to be stored in HBase and feel the need for the secondary indexes, because of very different data access patterns. In this situation consider relying on FuzzyRowFilter for some of the data reading use-cases. Depending on your data with small adjustments of the row key format (sometimes it is not even needed) you can benefit from very fast fetching of records where before you needed to perform full table scans or very large partial key scans.

Plug: if this sort of stuff interests you, we are hiring people who know and love to work with Hadoop, HBase, MapReduce…

@abaranau

Announcing Scalable Performance Monitoring (SPM) for JVM

Up until now, SPM existed in several flavors for monitoring Solr, HBase, ElasticSearch, and Sensei. Besides metrics specific to a particular system type, all these SPM flavors also monitor OS and JVM statistics. But what if you want to monitor any Java application? Say your custom Java application run either in some container, application server, or from a command line? You don’t really want to be forced to look at blank graphs that are really meant for stats from one of the above mentioned systems. This was one of our own itches, and we figured we were not the only ones craving to scratch that itch, so we put together a flavor of SPM for monitoring just the JVM and (Operating) System metrics.

Now SPM lets you monitor OS and JVM performance metrics of any Java process through the following 5 reports, along with all other SPM functionality like integrated Alerts, email Subscriptions, etc. If you are one of many existing SPM users these graphs should look very familiar.

JVM: heap, thread stats

We are not including it here, but the JVM report includes and additional and valuable Garbage Collection graph if you are using Java 7.

Garbage Collection: collection time & count

CPU & Memory: CPU stats breakdown, system load, memory stats breakdown, swap

Disk: I/O rates, disk space used & free

Network: I/O rates

To start monitoring, one should have a valid Sematext Apps account, which you can get free of charge here. After that, define a new SPM JVM System, download the installation package, and proceed with the installation. The whole process should take around 10 minutes (assuming you are not multitasking or suffer from ADD, that is).

Installation process is simple as always and described on the installer download page. After the installation is done, monitor is enabled in your Java process by adding just the following to the command line for the Java process/application you want to monitor:

-Dcom.sun.management.jmxremote \
-javaagent:/spm/spm-monitor/lib/spm-monitor-jvm-1.6.0-withdeps.jar=/spm/spm-monitor/conf/spm-monitor-config-YourSystemTokenHere-default.xml

For example, if my application is com.sematext.Snoopy I could run it with SPM parameters as shown here:

java -Dcom.sun.management.jmxremote -javaagent:/spm/spm-monitor/lib/spm-monitor-jvm-1.6.0-withdeps.jar=/spm/spm-monitor/conf/spm-monitor-config-YourSystemTokenHere-default.xml com.sematext.Snoopy

After you are finished with the installation, the stats should start to appear in SPM after just a few minutes.

Happy monitoring!

– @sematext

Announcing HBase Refcard

We’re happy to announce the very first HBase Refcard proudly authored by two guys from Sematext. We hope people will find the HBase Refcard useful in their work with HBase, along with the wonderful Apache HBase Reference Guide. If you think the refcard is missing some important piece of information that deserves to be included or that it contains superfluous content, please do let us know! (e.g., via comments here)