interview

Sensei: distributed, realtime, semi-structured database

Once upon a time there was no decent open-source search engine. Then, at the very beginning of this millennium Doug Cutting gave us Lucene. Several years later Yonik Seeley wrote Solr. In 2010 Shay Banon released ElasticSearch. And just a few days ago John Wang and his team at LinkedIn announed Sensei 1.0.0 (also known as SenseiDB). Here at Sematext we’ve been aware of Sensei for a while now (2 years?) and are happy to have one more great piece of search software available for our own needs and those of our customers. As a matter of fact, we are so excited about Sensei that we’ve already started hacking on adding support for Sensei to SPM, our Scalable Performance Monitoring tool and service! Since Sensei is brand new, we asked John to tell us a little more about it.

Please tell us a bit about yourself.

My name is John Wang, and I am the search architect at LinkedIn.com. I am the creator and the current lead for the Sensei project.

Could you describe Sensei for us?

Sensei is an open-source, elastic, realtime, distributed database with native support for searching and navigating both unstructured text and structured data. Sensei is designed to handle complex semi-structured queries on very large, and rapidly changing datasets.

It was written by the content search team at LinkedIn to support LinkedIn Homepage and Signal.

The core engine is also used for LinkedIn search properties, e.g. people search, recruiter system, job and company search pages.

Why did you write Sensei instead of using Lucene or Solr?

Sensei leverages Lucene.

We weren’t able to leverage Solr because of the following requirements:

High update requirement, 10’s of thousands updates per second in to the system
Real distributed solution, current Solr’s distributed story has a SPOF at the master, and Solr Cloud is not yet completed.
Complex faceting support. Not just your standard terms based faceting. We needed to facet on social graph, dynamic time ranges and many other interesting faceting scenarios. Faceting behavior also needs to be highly customizable, which is not available via Solr.

What does Sensei do that existing open-source search solutions don’t provide?

Consider Sensei if your application has the following characteristics:

High update rates
Non-trivial semi-structured query support

Who should stick with Solr or ElasticSearch instead of using Sensei?

The feature set, as well as limitations of all these system don’t overlap fully. Depending on your application, if you are building on certain features in one system and it is working out, then I would suggest you stick with it. But for Sensei, our philosophy is to consider performance ahead of features.

Are there currently any Sensei users other than LinkedIn that you are aware of?

We have seen some activities on the mailing list indicating deployments outside of LinkedIn, but I don’t know the specifics. This is a new project and we are keeping track of its usage on http://senseidb.github.com/sensei/usage.html, so let us know if you are using Sensei and want to be listed there.

What are Sensei’s biggest weaknesses and how and when do you plan on addressing them?

Let me address this question by providing a few limitations of Sensei:

Each document inserted into Sensei must have a unique identifier (UID) of type long. Though this can be inconvenient, this is a decision made for performance reasons. We have no immediate plans for addressing this, but we are thinking about it.
For columns defined in numeric format, e.g. int, float, long…, we don’t yet support negative numbers. We will have support for negative numbers very soon.
Static schema. Dynamic schema is something we find useful, and we will support it in the near future.

What’s next for Sensei as a project?

We will continue iterating on Sensei on both the performance and feature front. See below for a list of things we are looking at.

What are some of the key features you plan on adding to Sensei in the coming months?

This may not be a comprehensive list, but gives you an idea areas we are working on:

Relevance toolkit
Built-in time and geo type columns
Parent-node type documents
Attribute type faceting (name-value pairs)
Online rebalancing
Online reindexing
Parameter secondary store (e.g. activities on a document, social gestures, etc.)
Dynamic schemata
Support for aggregation functions, e.g. AVG, MIN, MAX, etc.

The Relevance toolkit sounds interesting. Could you tell us a bit about what this will do and if, by any chance, this might in any way be close to the idea behind Apache Open Relevance?

This is a big feature for 1.1.0. I am not familiar with Open Relevance to comment. The idea behind relevance toolkit is to allow you to specify a model with the query. One important usage for us is to be able to perform relevance tuning against fast-flowing production data. Waiting for things to be redeployed to production after relevance model changes does not work if the production data is changing in real-time, like tweets.

Maybe some specific tech questions – feel free to skip the ones that you think are not important or you just don’t feel like answering.

What is the role of Norbert in Sensei?

Sensei currently uses Norbert , whose maintainer is one of our main developers ,as a cluster manager and RPC between a Broker and Sensei nodes. A Broker is servlet embedded in each Sensei node. Norbert is used as a message transport to Sensei nodes.. Norbert is an elegant wrapper around Zookeeper for cluster management. We do have plans to create abstraction around this component to allow pluggability for other cluster managers.

When I first saw SQL-like query on Sensei’s home page I thought it was purely for illustration purposes, but now I realize it is actually very real!

BQL – Browse Query Language, is a SQL-variant to query Senesi. It is very real, we plan for BQL to be a standard way to query Sensei.

Can you share with us any Sensei performance numbers?

We have published some performance numbers at http://senseidb.com/performance.html

We have created a separate Github repository containing all our performance evaluation code at:

https://github.com/kwei/search-perf

Does Sensei have a SPOF (Single Point Of Failure)?

No – assuming a Sensei cluster contains more than 1 replica of each document. This is one important design goal of Sensei: every Sensei node in the cluster acts independently in both consuming data as well as handling queries. See the following answers for details.

What has to happen for data loss to occur?

Data loss occurs only if you have data store corruption on all replicas. If only 1 replica is corrupted, you can always recover from other replicas.

Sensei by design assumes a data source that is ordered and versioned, e.g., a persistent queue. Each Sensei node persists the version for each commit. Thus, to recover data events can be replayed from that version.

In production at Linkedin, this is very handy to ensure data consistency when bouncing nodes.

You mention recovery from other replicas and recovery by replaying data events from a specific version. Does that mean once a copy of a document makes it into Sensei in order to recover lost replicas for that document Sensei does not need to reach out to the originator of the data and is self-sufficient, so to speak? Or does replaying mean getting the lost data from an external data store?

The data stream is external. So to catch-up from an older version, Sensei would just re-play the data events from the stream using this version. But if an entire data replica is lost, a manual copy from other replicas is required (for now).

What happens if a node in a cluster fails?

When a node fails, Zookeeper notifies other cluster event listeners in the cluster, which means the Broker. Broker keeps a state of the current cluster node topology, and subsequent queries will be routed to the live replicas, thus avoiding sending requests to the failed node. If all nodes for one replica are down, then partial results are returned.

What happens when the cluster reaches its capacity? Can one simply add more nodes to the cluster and Sensei auto-magically takes care of the rest or does one have to manually rebalance shards, or…?

Depending on how data is sharded:

If over-sharding technique is used, then adding nodes to the cluster is trivial. New nodes would just specify which shards they want to handle – every node in sensei.properties indicates partitions it should handle, e.g., sensei.node.partitions=1,3,8

If using sharding strategy where data migration is not needed as new data is flowing into the system, e.g., sharding by time or consecutive UID, then expanding the cluster is also trivial.

If such sharding strategy requires data migration, e.g. mod based sharding. Then cluster rebalancing is required. This is coming in a future release, where we already have designs for online data rebalancing. For now, one has to reindex all data in order to reshard and rebalance.

Since Sensei is an eventually consistent system, how does one insure the search client gets consistent results (e.g. when paging through results or filtering results with facets)?

On the Sensei request object, there is a routing parameter. Typically this routing parameter is set to the value of the search session. By default, Sensei applies consistent hashing on the routing parameter to make sure the same replica is used for queries with the same routing parameter or search session.

How does one upgrade Sensei? Can one perform an online upgrade of a Sensei cluster without shutting down the whole cluster? Are new versions of Sensei backwards compatible?

Yes, subsets of the cluster can be shut down, and dynamic routing via Zookeeper would take place. This is very useful when we are pushing out new builds in canary mode to ensure stability and compatibility.

Does Sensei require a schema or is it schemaless?

Sensei requires a schema. But we do have plans to make Sensei schema dynamic like ElasticSearch.

Does Sensei have support for things like Spatial Search, Function Queries, Parent-Child data, or JOIN?

We have in the works a relevance toolkit which should cover features of Solr’s Function Queries.

We also have plans to support Spatial Search and Parent-Child data.

We don’t have immediate plans to support Joins.

How does one talk to Sensei? Are there existing client libraries?

The Sensei cluster exposes 2 rest end-points: REST/JSON and BQL.
The packaging also include Java and Python clients, (also Ruby if resourcing works out), along with JavaScript helpers for using the REST/JSON API in a web application.

Does Sensei have an administrative/management UI?

Sensei comes with a web application for helping with building queries against the cluster. We use it to tweak relevance models as well as instrumenting an online cluster.

JMX is also exposed to administer the cluster.

In the configuration users can turn on other types of data reporting to other clusters, e.g. RRD, log etc.

Big thank you to John and his team for releasing and open-sourcing Sensei and for taking the time to answer all our questions.

— @sematext

The State of Solandra – Summer 2011

A little over 18 months ago we talked to Jake Luciani about Lucandra – a Cassandra-based Lucene backend. Since then Jake has moved away from raw Lucene and married Cassandra with Solr, which is why Lucandra now goes by Solandra. Let’s see what Jake and Solandra are up to these days.

What is the current status of Solandra in terms of features and stability?

Solandra has gone through a few iterations. First as Lucandra which partitioned data by terms and used thrift to communicate with Cassandra. This worked for a few big use cases, mainly how to manage a index per user, and garnered a number of adopters. But it performed poorly when you had very large indexes with many dense terms, due to the number and size of remote calls needed to fulfill a query.Last summer I started off on a new approach based on Solr that would address Lucandra’s shortcomings: Solandra. The core idea of Solandra is to use Cassandra as a foundation for scaling Solr. It achieves this by embedding Solr in the Cassandra runtime and uses the Cassandra routing layer to auto shard a index across the ring (by document). This means good random distribution of data for writes (using Cassandra’s RandomParitioner) and good search performance since individual shards can be searched in parallel across nodes (using SolrDistributedSearch). Cassandra is responsible for sharding, replication, failover and compaction. The end user now gets a single scalable component for search without changing API’s which will scale in the background for them. Since search functionality is performed by Solr so it will support anything Solr does.

I gave a talk recently on Solandra and how it works: http://blip.tv/datastax/scaling-solr-with-cassandra-5491642

Are you still the sole developer of Solandra? How much time do you spend on Solandra?
Have there been any external contributions to Solandra?

I still am responsible for the majority of the code, however the Solandra community is quite large with over 500 github followers and 60 forks. I receive many useful bug reports and patches through the community. Late last year I started working at DataStax (formerly Riptano) to focus on Apache Cassandra. DataStax is building a suite of products and services to help customers use Cassandra in production and incorporate Cassandra into existing enterprise infrastructure. Solandra is a great example of this. We currently have a number of customers using Solandra and we encourage people interested in using Solandra to reach out to us for support.

What are the most notable differences with Solandra and Solr?

The primary difference is the ability to grow your Solr infrastructure seamlessly using Cassandra. I purposely want to avoid altering the Solr functionality since the primary goal here is to make it easy for users to migrate to and from Solandra and Solr. That being said Solandra does offer some unique features regarding managing millions of indexes. One is different Solr schemas can be injected at runtime using a RESTful interface and Solandra supports the concept of virtual Solr Cores which share the same core but are treated as different indexes. For example, if you have a core called “inbox” you can create an index per user like “inbox.otis” or “inbox.jake” just by changing the endpoint URL.

Finally, Solandra has a bulk loading interface that makes it easy to index large amounts of data at a time (one known cluster indexes at ~4-5MB of text per second).

What are the most notable differences with Solandra and Elastic Search?

ElasticSearch is more mature and offers a similar architecture for scaling search though not based on Cassandra or Solr. I think ElasticSearch’s main weakness is it requires users to scrap their existing code and tools to use it. On the other hand, Solandra provides a scalable platform built on Solr and lets you grow with it.

Solandra doesn’t use the Lucene index file format so it will grow to support millions of indexes. Systems like Solr and ElasticSearch create a directory per index which makes managing millions of indexes very hard. The flipside is there are a lot of performance tweaks lost by not using the native file format most of the current work on Solandra relates to improving single node performance.

Solandra is a single component that gives you search AND NoSQL database, and is therefore much easier to manage from the operations perspective IMO.

What do you plan on adding to Solandra that will make it clearly stand out from Solr or Elastic Search?

Solandra will continue to grow with Solr (4.0 will be out in the future), as well as with Cassandra. Right now Solandra’s real-time search is limited by the performance of Solr’s field cache implementation. By incorporating Cassandra triggers I think we can remove this bottleneck and get really impressive real-time performance at scale, due to how Solandra pre-allocates shards.

Also, since the Solr index is stored in the Cassandra datamodel, you can now apply some really interesting features of Cassandra to Solr, such as expiring indexes and triggered searches.

When should one use Solandra?

If you say yes to any of the following you should use Solandra:

I have more data than can fit on a single box
I have potentially millions of indexes
I need improved indexing with multi-master writes
I need multi-datacenter search
I am already using Cassandra and Solr
I am having trouble managing my Solr cluster

When should one not use Solandra?

If you are happy with your Solr system today and you have enough capacity to scale the size and number of indexes comfortably then there is no need to use Solandra. Also, Solandra is under active development so you should be prepared to help diagnose unknown issues. Also note that if you require search features that are currently not supported by Solr distributed search, you should not use Solandra.

Are there known problems with Solandra that users should be aware of?

Yes, currently the index sizes can be much larger in Solandra than Solr (in some cases 10x) this is due to how Solandra indexes data as well as Cassandra’s file format. Cassandra 1.0 includes compression so that will help quite a bit.Also, since consistency in Solandra is tunable it requires your application to consider the implications of writing data at lower consistencies.Finally, one thing that keeps coming up quite often is users assuming Solandra auto indexes the data you already have in Cassandra, since Solandra builds on Cassandra. This is not the case. Data must be indexed and searched through the traditional Solr APIs.

Is anyone using Solandra in production? What is the biggest production deployment in terms of # docs, data size on filesystem, query rate?

Solandra is now in production with a handful of users I know of. Some others are in the testing/pre-production stage. But it’s still a small number AFAIK.The largest Solandra cluster I know of is in the order of ~5 nodes, ~10TB of data with ~100k indexes and ~2B documents.

If you had to do it all over, what would you do differently?

I’m really excited with the way Lucandra/Solandra has evolved over the past year. It’s been a great learning experience and has allowed me to work with technologies and people I’m really, really excited about. I don’t think I’d change a thing, great software takes time.

When is Solandra 1.0 coming out and what is the functionality/issues that remain to be implemented before 1.0?

I don’t really use the 1.0 moniker as people tend to assume too much when they read that. I think once Solandra is fully documented, supports things like Cassandra based triggers for indexing and search, and has an improved on disk format, I’d be comfortable calling Solandra 0.9 🙂

Thank you Jake. We are looking forward to Solandra 0.9 then.

Elastic Search: Distributed, Lucene-based Search Engine

Here at Sematext we are constantly looking to expand our horizons and acquire new knowledge (as well as our search team – see Sematext jobs page – we are always on the lookout for people with passion for search and, yes, we are heavy users of ElasticSearch!), especially in and around our main domains of expertize – search and analytics. That is why today we are talking with Shay Banon about his latest project: ElasticSearch. If his name sounds familiar to you, that’s likely because Shay is also known for his work on GigaSpaces and Compass.

What is the history of Elastic Search in terms of:

When you got the idea?

How long you had that brewing in your head?

When did you first start cutting code for it?

I have been thinking about something along the lines of what elasticsearch has turned out to be for a few years now. As you may know, I am the creator of Compass (http://www.compass-project.org), which I started more 7 years ago, and the aim of Compass was to ease the integration of search into any Java application.

When I developed Compass, I slowly started to add more and more features to it. For example, Compass, from the get go, supported mapping of Objects to a Search Engine (OSEM – http://www.compass-project.org/docs/2.2.0/reference/html/core-osem.html). But, it also added a JSON to Search Engine mapping layer (JSEM – http://www.compass-project.org/docs/2.2.0/reference/html/core-jsem.html) as slowly JSON was becoming a de-facto standard wire protocol.

Another aspect that I started to tackle was Compass support for a distributed mode. GigaSpaces (http://www.compass-project.org/docs/2.2.0/reference/html/needle-gigaspaces.html), Coherence (http://www.compass-project.org/docs/2.2.0/reference/html/needle-coherence.html), and Terracotta (http://www.compass-project.org/docs/2.2.0/reference/html/needle-terracotta.html) are attempts at solving that. All support using a distributed Lucene Directory implementation (scaling the storage), but, as you know Lucene, sometimes this is not enough. With GigaSpaces, the integration took another step with sharding the index itself and using “map/reduce” to search on nodes.

The last important aspect of Compass is its integration with different mechanisms to handle content and make it searchable. For example, it has very nice integration with JPA (Hibernate, TopLink, OpenJPA), which means any change you do the database through JPA is automatically indexed. Another integration point was with data grids such as GigaSpaces and Coherence: any change done to them gets applied to the index.

But still, Compass is a library that gets embedded in your Java app, and its solutions for distributed search are far from ideal. So, I started to play around with the idea of creating a Compass Search server that would use its mapping support (JSON) and expose itself through a RESTful API.

Also, I really wanted to try and tackle the distributed search problem. I wanted to create a distributed search solution that is inspired, in its distributed model, from how current state of the art data grids work.

So, about 7 months ago I wrote the first line of elasticsearch and have been hacking on it ever since.

The inevitable: How is Elastic Search different from Solr?

To be honest, I never used Solr. When I was looking around for current distributed search solutions, I took a brief look at Solr distributed model, and was shocked that this is what people need to deal with in order to build a scalable search solution (that was 7 months ago, so maybe things have changed). While looking at Solr distributed model I also noticed the very problematic “REST” API it exposes. I am a strong believer in having the product talk the domain model, and not the other way around. ElasticSearch is very much a domain driven search engine, and I explain it more here: http://www.elasticsearch.com/blog/2010/02/12/yourdatayoursearch.html. You will find this attitude throughout elasticsearch APIs.

Is there a feature-for-feature comparison to Solr that would make it easier for developers of new search applications to understand the differences and choose the right tool for the job?

There isn’t one, and frankly, I am not that expert with Solr to create such a list. What I hope is that people who work with both will create such a list, hopefully with input from both projects.

When would one want (or need) to use Elastic Search instead of Solr and vice versa?

As far as I am concerned, elasticsearch is being built to be a complete distributed, RESTful, search solution, with all the features anyone would ever want from a search server. So, I really don’t see a reason why someone would choose Solr over ElasticSearch. To be honest, with today data scale, and the slow move to the cloud (or “cloud architectures”) you *need* a search engine that you can scale, and I really don’t understand how one would work with Solr distributed model, but that might just be me and I am spoiled by what I expect from distributed solutions because of my data grid background.

What was the reason for simply not working with the Solr community and enhancing Solr? (discussion, patches…) Are some of Elastic Search’s features simply not implementable in Solr?

First, the challenge. Writing a data grid level distributed search engine is quite challenging to say the least (note, I am not talking about data grid features, such as transactions and so on, just data grids distributed model).

Second, building something on top of existing codebase will never be as good as building something from scratch. For example, elasticsearch has a highly optimized, asynchronous, transport layer to communicate between nodes (which the native Java client uses), a highly modular core where almost anything is pluggable. These are things that are very hard to introduce or change with existing codebase, and existing developers. Its much simpler to write it from scratch.

We see more and more projects using Github. What is your reason for choosing Git for SCM and Github for Elastic Search’s home?

Well, there is no doubt that Git is much nicer to work with than SVN thanks to its distributed nature (and I like things that are distributed 🙂 ). As for GitHub, I think that its currently the best project hosting service out there. You really feel like people out there know developers and what developers need. As a side note, I am a sucker for eye candy, and it simply looks good.

We see the community already created a Perl client. Are there other client libraries in the works?

Yeah, so there is an excellent Perl client (http://github.com/clintongormley/ElasticSearch.pm) which Clinton Gormley has been developing (he has also been an invaluable source of suggestions/bugs to the development of elasticsearch). There are several more including erlang, ruby, python and PHP (all listed here http://www.elasticsearch.com/products/). Note, thanks to the fact that elasticsearch has a rich, domain driven, JSON API, writing clients to it is very simple since most times there is no need to perform any mappings, especially with dynamic languages.

We realize it is super early for Elastic Search, but what is the biggest known deployment to date?

Yes, it is quite early. But, I can tell you that some not that small sites (10s of millions of documents) are already playing with elasticsearch successfully. Not sure if I can disclose any names, but once they go live, I will try and get them to post something about it.

What are Elastic Search future plans, is there a roadmap?

The first thing to get into ElasticSearch are more search engine features. The features are derived from the features that are already exposed in Compass, including some new aspects such a geo/local search.

Another interesting aspect is making elasticsearch more cloud provider friendly. For example, elasticsearch persistent store is designed in a write behind fashion, and I would love to get one that persist the index to Amazon S3 or Rackspace CloudFiles (See more information on how persistency works with elasticsearch, see here: http://www.elasticsearch.com/blog/2010/02/16/searchengine_time_machine.html).

NoSQL is also an avenue that I would love to explore. In similar concept to how Compass works with JPA / Data Grids, I would like the integration of search with NoSQL solutions more simple. It should be as simple as you do something against the NoSQL solution, and it automatically gets applied to elasticsearch as well. Thanks to the fact that elasticsearch model is very much domain driven, and very similar to what NoSQL uses, the integration should be simple. As an example, TerraStore already comes with an integration module that applies to elasticsearch any changes done to TerraStore. I blogged my thoughts about search and NoSQL here http://www.elasticsearch.com/blog/2010/02/25/nosql_yessearch.html.

If you have additional questions for Shay about Elastic Search, please feel free to leave them in comments, and we will ask Shay to use comments to answer them.