cloud

The New SolrCloud: Overview

Just the other day we wrote about Sensei, the new distributed, real-time full-text search database built on top of Lucene and here we are again writing about another “new” distributed, real-time, full-text search server also built on top of Lucene: SolrCloud.

In this post we’ll share some interesting SolrCloud bits and pieces that matter mostly to those working with large data and query volumes, but that all search lovers should find really interesting, too. If you have any questions about what we wrote (or did not write!) in this post, please leave a comment – we’re good at following up to comments! Or just ask @sematext!

Please note that functionality described in this post is now part of trunk in Lucene and Solr SVN repository. This means that it will be available when Lucene and Solr 4.0 are released, but you can also use trunk version just like we did, if you don’t mind living on the bleeding edge.

Recently, we were given the opportunity to once again use big data (massive may actually be more descriptive of this data volume) stored in a HBase cluster and search. We needed to design a scalable search cluster capable of elastically handling future data volume growth. Because of the huge data volume and high search rates our search system required the index to be sharded. We also wanted the indexing to be as simple as possible and we also wanted a stable, reliable, and very fast solution. The one thing we did not want to do is reinvent the wheel. At this point you may ask why we didn’t choose ElasticSearch, especially since we use ElasticSearch a lot at Sematext. The answer is that when we started the engagement with this particular client a whiiiiile back when ElasticSearch wasn’t where it is today. And while ElasticSearch does have a number of advantages over the old master-slave Solr, with SolrCloud being in the trunk now, Solr is again a valid choice for very large search clusters.

And so we took the opportunity to use SolrCloud and some of its features not present in previous versions of Solr. In particular, we wanted to make use of Distributed Indexing and Distributed Searching, both of which SolrCloud makes possible. In the process we looked at a few JIRA issues, such as SOLR-2358 and SOLR-2355, and we got familiar with relevant portions of SolrCloud source code. This confirmed SolrCloud would indeed satisfy our needs for the project and here we are sharing what we’ve learned.

Our Search Cluster Architecture

Basically, we wanted the search cluster to look like this:

Simple? Yes, we like simple. Who doesn’t! But let’s peek inside that “Solr cluster” box now.

SolrCloud Features and Architecture

Some of the nice things about SolrCloud are:

centralized cluster configuration
automatic node fail-over
near real time search
leader election
durable writes
…

Furthermore, SolrCloud can be configured to:

have multiple index shards
have one or more replicas of each shards

Shards and Replicas are arranged into Collections. Multiple Collections can be deployed in a single SolrCloud cluster. A single search request can search multiple Collections at once, as long as they are compatible. The diagram below shows a high-level picture of how SolrCloud indexing works.

As the above diagram shows, documents can be sent to any SolrCloud node/instance in the SolrCloud cluster. Documents are automatically forwarded to the appropriate Shard Leader (labeled as Shard 1 and Shard 2 in the diagram). This is done automatically and documents are sent in batches between Shards. If a Shard has one or more replicas (labeled Shard 1 replica and Shard 2 replica in the diagram) a document will get replicated to one or more replicas. Unlike in traditional master-slave Solr setups where index/shard replication is performed periodically in batches, replication in SolrCloud is done in real-time. This is how Distributed Indexing works at the high level. We simplified things a bit, of course – for example, there is no ZooKeeper or overseer shown in our diagram.

Setup Details

All configuration files are stored in ZooKeeper. If you are not familiar with ZooKeeper you can think of it as a distributed file system where SolrCloud configuration files are stored. When the first Solr instance in a SolrCloud cluster is started configuration files need to be sent to ZooKeeper and one needs to specify how many shards there should be in the cluster. Then, this Solr instance/node is running one can start additional Solr instances/nodes and point them to the ZooKeeper instance (ZooKeeper is actually typically deployed as a quorum or 3, 5, or more instances in production environments). And voilà – the SolrCloud cluster is up! I must say, it’s quite simple and straightforward.

Shard Replicas in SolrCloud serve multiple purposes. They provide fault tolerance in the sense that when (not if!) a single Solr instance/node containing a portion of the index goes down, you still have one or more replicas of data that was served by that instance elsewhere in the cluster and thus you still have the whole data set and no data loss. They also allow you to spread query load over more servers, this making the cluster capable of handling higher query rates.

Indexing

As you saw above, the new SolrCloud really simplifies Distributed Indexing. Document distribution between Shards and Replicas is automatic and real-time. There is no master server one needs to send all documents to. A document can be sent to any SolrCloud instance and SolrCloud takes care of the rest. Because of this, there is no longer a SPOF (Single Point of Failure) in Solr. Previously, Solr master was a SPOF in all but the most elaborate setups.

Querying

One can query SolrCloud a few different ways:

One can query a single Shard, which is just like Solr querying a search a single Solr instance.
The second option is to query a single Collection (i.e., search all shards holding pieces of a given Collection’s index).
The third option is to only query some of the Shards by specifying their addresses or names.
Finally, one can query multiple Collections assuming they are compatible and Solr can merge results they return.

As you can see, lots of choices!

Administration with Core Admin

In addition to the standard core admin parameters there are some new ones available in SolrCloud. These new parameters let one:

create new Shards for an existing Collection
create a new Collection
add more nodes
…

The Future

If you look at the New SolrCloud Design wiki page (http://wiki.apache.org/solr/NewSolrCloudDesign) you will notice, that not all planned features have been implemented yet. There are still things like cluster re-balancing or monitoring (if you are using SolrCloud already and want to monitor its performance, let us know if you want early access to SPM for SolrCloud) to be done. Now that SolrCloud is in the Solr trunk, it should see more user and more developer attention. We look forward to using SolrCloud in more projects in the future!

— @sematext

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

As promised in Part 1 of Solr Digest, Spring-Summer 2011, in this Part 2 post we’ll summarize what’s new with Solr’s Near-Real-Time Search support and Solr Cloud (if you love clouds and search with some big data on the side, get in touch). Let’s first examine what is being worked on for Solr Cloud and what else is in the queue for the near future. A good overview of what is currently functional can be found in the old Solr Cloud wiki page. Also, there is now another wiki page covering New Solr Cloud Design, which we find quite useful. The individual pieces of Solr Cloud functionality that are being worked on are as follows:

Work is still in progress on Distributed Indexing and Shard distribution policy. Patches exist, although they are now over 6 months old, so you can expect to see them updated soon.
As part of the Distributed Indexing effort, shard leader functionality deals with leader election and with publishing the information about which node is a leader of which shard and in Zookeeper in order to notify all interested parties. The development is pretty active here and initial patches already exist.
At some point in the future, Replication Handler may become cloud aware, which means it should be possible to switch the roles of masters and slaves, master URLs will be able to change based on cluster state, etc. The work hasn’t started on this issue.
Another feature Solr Cloud will have is automatic Spliting and migrating of Indices. The idea is that when some shard’s index becomes too large or the shard itself starts having bad query response times, we should be able to split parts of that index and migrate it (or merge) with indices on other (less loaded) nodes. Again, the work on this hasn’t started yet. Once this is implemented one will be able to split and move/merge indices using a Solr Core Admin as described in SOLR-2593.
To achieve more efficiency in search and gain control over where exactly each document gets indexed to, you will be able to define a custom shard lookup mechanism. This way, you’ll be able to limit execution of search requests to only some shards that are known to hold target documents, thus making the query more efficient and faster. This, along with the above mentioned shard distribution policy, is akin to routing functionality in ElasticSearch.

On to NRT:

There is a now a new wiki page dedicated to Solr NRT Search. In short, NRT Search will be available in Solr 4.0 and the work currently in progress is already available on the trunk. The first new functionality that enables NRT Search in Solr is called “soft-commit”. A soft commit is a light version of a regular commit, which means that it avoids the costly parts of a regular commit, namely the flushing of documents from memory to disk, while still allowing searches to see new documents. It appears that a common way of using this will be having a soft-commit every second or so, to make Solr behave as NRT as possible, while also having a “hard-commit” automatically every 1-10 minutes. “Hard-commit” will still be needed so the latest index changes are persisted to the storage. Otherwise, in case of crash, changes since last “hard-commit” would be lost.
Initial steps in supporting NRT Search in Solr were done in Re-architect Update Handler. Some old issues Solr had were dealt with, like waiting for background merges to finish before opening a new IndexReader, blocking of new updates while commit is in progress and a problem where it was possible that multiple IndexWriters were open on the same index. The work was done on solr2193 branch and that is the place where the spinoffs of this issue will continue to move Solr even closer to NRT.
One of the spinoffs of the Update Handler rearchitecture is SOLR-2565, which provides further improvements on the above mentioned issue. New issues to deal with other related functionality will be opened along the way, while SOLR-2566 looks to serve as an umbrella issue for NRT Search in Solr.
Partially related to NRT Search is the new Transaction Log implemented in Solr under SOLR-2700. The goal is to provide durability of updates, while also supporting features like the already committed Realtime get. Transaction logs are implemented in various other search solutions such as ElasticSearch and Zoie, so Simon Willnauer started a good thread about the possibility of generalizing this new Transaction Log functionality so that it is not limited to Solr, but exposed to other users and applications, such as Lucene, too.

We hope you found this post useful. If you have any questions or suggestions, please leave a comment, and if you want to follow us, we are @sematext on Twitter.

Elastic Search: Distributed, Lucene-based Search Engine

Here at Sematext we are constantly looking to expand our horizons and acquire new knowledge (as well as our search team – see Sematext jobs page – we are always on the lookout for people with passion for search and, yes, we are heavy users of ElasticSearch!), especially in and around our main domains of expertize – search and analytics. That is why today we are talking with Shay Banon about his latest project: ElasticSearch. If his name sounds familiar to you, that’s likely because Shay is also known for his work on GigaSpaces and Compass.

What is the history of Elastic Search in terms of:

When you got the idea?

How long you had that brewing in your head?

When did you first start cutting code for it?

I have been thinking about something along the lines of what elasticsearch has turned out to be for a few years now. As you may know, I am the creator of Compass (http://www.compass-project.org), which I started more 7 years ago, and the aim of Compass was to ease the integration of search into any Java application.

When I developed Compass, I slowly started to add more and more features to it. For example, Compass, from the get go, supported mapping of Objects to a Search Engine (OSEM – http://www.compass-project.org/docs/2.2.0/reference/html/core-osem.html). But, it also added a JSON to Search Engine mapping layer (JSEM – http://www.compass-project.org/docs/2.2.0/reference/html/core-jsem.html) as slowly JSON was becoming a de-facto standard wire protocol.

Another aspect that I started to tackle was Compass support for a distributed mode. GigaSpaces (http://www.compass-project.org/docs/2.2.0/reference/html/needle-gigaspaces.html), Coherence (http://www.compass-project.org/docs/2.2.0/reference/html/needle-coherence.html), and Terracotta (http://www.compass-project.org/docs/2.2.0/reference/html/needle-terracotta.html) are attempts at solving that. All support using a distributed Lucene Directory implementation (scaling the storage), but, as you know Lucene, sometimes this is not enough. With GigaSpaces, the integration took another step with sharding the index itself and using “map/reduce” to search on nodes.

The last important aspect of Compass is its integration with different mechanisms to handle content and make it searchable. For example, it has very nice integration with JPA (Hibernate, TopLink, OpenJPA), which means any change you do the database through JPA is automatically indexed. Another integration point was with data grids such as GigaSpaces and Coherence: any change done to them gets applied to the index.

But still, Compass is a library that gets embedded in your Java app, and its solutions for distributed search are far from ideal. So, I started to play around with the idea of creating a Compass Search server that would use its mapping support (JSON) and expose itself through a RESTful API.

Also, I really wanted to try and tackle the distributed search problem. I wanted to create a distributed search solution that is inspired, in its distributed model, from how current state of the art data grids work.

So, about 7 months ago I wrote the first line of elasticsearch and have been hacking on it ever since.

The inevitable: How is Elastic Search different from Solr?

To be honest, I never used Solr. When I was looking around for current distributed search solutions, I took a brief look at Solr distributed model, and was shocked that this is what people need to deal with in order to build a scalable search solution (that was 7 months ago, so maybe things have changed). While looking at Solr distributed model I also noticed the very problematic “REST” API it exposes. I am a strong believer in having the product talk the domain model, and not the other way around. ElasticSearch is very much a domain driven search engine, and I explain it more here: http://www.elasticsearch.com/blog/2010/02/12/yourdatayoursearch.html. You will find this attitude throughout elasticsearch APIs.

Is there a feature-for-feature comparison to Solr that would make it easier for developers of new search applications to understand the differences and choose the right tool for the job?

There isn’t one, and frankly, I am not that expert with Solr to create such a list. What I hope is that people who work with both will create such a list, hopefully with input from both projects.

When would one want (or need) to use Elastic Search instead of Solr and vice versa?

As far as I am concerned, elasticsearch is being built to be a complete distributed, RESTful, search solution, with all the features anyone would ever want from a search server. So, I really don’t see a reason why someone would choose Solr over ElasticSearch. To be honest, with today data scale, and the slow move to the cloud (or “cloud architectures”) you *need* a search engine that you can scale, and I really don’t understand how one would work with Solr distributed model, but that might just be me and I am spoiled by what I expect from distributed solutions because of my data grid background.

What was the reason for simply not working with the Solr community and enhancing Solr? (discussion, patches…) Are some of Elastic Search’s features simply not implementable in Solr?

First, the challenge. Writing a data grid level distributed search engine is quite challenging to say the least (note, I am not talking about data grid features, such as transactions and so on, just data grids distributed model).

Second, building something on top of existing codebase will never be as good as building something from scratch. For example, elasticsearch has a highly optimized, asynchronous, transport layer to communicate between nodes (which the native Java client uses), a highly modular core where almost anything is pluggable. These are things that are very hard to introduce or change with existing codebase, and existing developers. Its much simpler to write it from scratch.

We see more and more projects using Github. What is your reason for choosing Git for SCM and Github for Elastic Search’s home?

Well, there is no doubt that Git is much nicer to work with than SVN thanks to its distributed nature (and I like things that are distributed 🙂 ). As for GitHub, I think that its currently the best project hosting service out there. You really feel like people out there know developers and what developers need. As a side note, I am a sucker for eye candy, and it simply looks good.

We see the community already created a Perl client. Are there other client libraries in the works?

Yeah, so there is an excellent Perl client (http://github.com/clintongormley/ElasticSearch.pm) which Clinton Gormley has been developing (he has also been an invaluable source of suggestions/bugs to the development of elasticsearch). There are several more including erlang, ruby, python and PHP (all listed here http://www.elasticsearch.com/products/). Note, thanks to the fact that elasticsearch has a rich, domain driven, JSON API, writing clients to it is very simple since most times there is no need to perform any mappings, especially with dynamic languages.

We realize it is super early for Elastic Search, but what is the biggest known deployment to date?

Yes, it is quite early. But, I can tell you that some not that small sites (10s of millions of documents) are already playing with elasticsearch successfully. Not sure if I can disclose any names, but once they go live, I will try and get them to post something about it.

What are Elastic Search future plans, is there a roadmap?

The first thing to get into ElasticSearch are more search engine features. The features are derived from the features that are already exposed in Compass, including some new aspects such a geo/local search.

Another interesting aspect is making elasticsearch more cloud provider friendly. For example, elasticsearch persistent store is designed in a write behind fashion, and I would love to get one that persist the index to Amazon S3 or Rackspace CloudFiles (See more information on how persistency works with elasticsearch, see here: http://www.elasticsearch.com/blog/2010/02/16/searchengine_time_machine.html).

NoSQL is also an avenue that I would love to explore. In similar concept to how Compass works with JPA / Data Grids, I would like the integration of search with NoSQL solutions more simple. It should be as simple as you do something against the NoSQL solution, and it automatically gets applied to elasticsearch as well. Thanks to the fact that elasticsearch model is very much domain driven, and very similar to what NoSQL uses, the integration should be simple. As an example, TerraStore already comes with an integration module that applies to elasticsearch any changes done to TerraStore. I blogged my thoughts about search and NoSQL here http://www.elasticsearch.com/blog/2010/02/25/nosql_yessearch.html.

If you have additional questions for Shay about Elastic Search, please feel free to leave them in comments, and we will ask Shay to use comments to answer them.

Introducing Cloud MapReduce

The following post is the introduction to Cloud MapReduce (CMR) written by Huan Liu, CMR’s main author and the Research Manager at Accenture Technology Labs.

MapReduce is a programming model (borrowed from functional programming languages) and its associated implementation, and it was first proposed by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce programs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007. MapReduce also has found wide application outside of the Google environment.

Cloud MapReduce is another implementation of the MapReduce programming model. Back in late 2008, we saw the emergence of a cloud Operating System (OS) — a set of software managing a large cloud infrastructure rather than an individual PC. We asked ourselves the following questions: what if we build systems on top of a cloud OS instead of directly on bare metal? Can we dramatically simplify system design? We thought we will try implementing MapReduce as a proof of concept; thus, the Cloud MapReduce project was born. At the time, Amazon was the only one that has a “complete” cloud OS, so we built on top of it. In the course of the project, we encountered a lot of problems working with the Amazon cloud OS, most could be attributed to the weaker consistent model it presents. Fortunately, we were able to work through all issues and successfully built MapReduce on top of the Amazon cloud OS. The end result surprises us somewhat. We are not only able to have a simpler design, but we are also able to make it more scalable, more fault tolerant and faster than Hadoop.

Why Cloud MapReduce

There are already several MapReduce implementations, including Hadoop, which is already widely used. So the natural question to ask is: why another implementation? The answer is that all previous implementations essentially copy what Google have described in their original MapReduce paper, but we need to explore alternative implementation approaches for the following reasons:

1) Patent risk. MapReduce is a core technology in Google. By using MapReduce, Google engineers are able to focus on their core algorithms, rather than being bogged down by parallelization details. As a result, MapReduce greatly increases their productivity. It is no surprise that Google would patent such a technology to maintain its competitive advantage. The MapReduce patent covers the Google implementation as described in its paper. Since CMR only implements the programming model, but has a totally different architecture and implementation, it poses a minimal risk w.r.t. the MapReduce patent. This is particularly important for enterprise customers who are concerned about potential risks.

2) Architectural exploration. Google only described one implementation of the MapReduce programming model in its paper. Is it the best one? What are the tradeoffs if using a different one? CMR is the first to explore a completely different architecture. In the following, I will describe what is unique about CMR’s architecture.

Architectural principle and advantages

CMR advocates component decoupling: separate out common components as independent cloud services. If we can separate out a common component as a stand-alone cloud service, the component not only can be leveraged for other systems, but it can also evolve independently. As we have seen in other contexts (e.g., SOA, virtualization), decoupling enables faster innovation.

CMR currently uses existing components offered by Amazon, including Amazon S3, SimpleDB, and SQS. By leveraging the concept of component decoupling, CMR achieves a couple of key advantages.

– A fully distributed architecture. Since each component is a smaller project, it is easier to build it as a fully distributed system. Amazon has done it for all its services (S3, SimpleDB, SQS). Building on what Amazon has done, we are able to build a fully distributed MapReduce implementation with only 3,000 lines of code. A fully distributed architecture has several advantages over a master/slave architecture. First, it is more fault tolerant. Many enterprise customers are not willing to adopt something with a single point of failure, especially for their mission critical data. Second, it is more scalable. In one comparison study, we are able to stress the master node in Hadoop so much that CMR has a 60x performance advantage.

– More efficient data shuffling between Map and Reduce. CMR uses queue as the intermediate point between Map and Reduce, which enables Map to “push” data to Reduce (rather than Reduce “pulls” data from Map). This design is similar to what is used in parallel databases, so it inherits the benefits of efficient data transfer as a result of pipelining. However, unlike in parallel databases, by using tagging, filtering and a commit mechanism, CMR still maintains the fine grain fault tolerance property offered by other MapReduce implementations. The majority of CMR’s performance gain (aside from the 60x gain which is from stressing the master node) comes from this optimization.

CMR is particularly attractive in a cloud environment due to its native integration with the cloud. Hadoop, on the other hand, is designed for a static environment inside an enterprise. If run in a cloud, Hadoop introduces additional overhead. For example, after launching a cluster of virtual machines, Amazon’s Elastic MapReduce (pay-per-use Hadoop) has to configure and setup Hadoop on the cluster, then copy data from S3 to the Hadoop file system before it can start data processing. At the end, it also has to copy results back to S3. All these steps are additional overheads that CMR does not incur, because CMR starts processing right away on S3 data directly, no cluster configuration and data copying are necessary.

CMR’s vision

Although CMR currently only runs on Amazon components, we envision that it will support a wide range of components in the future, including other clouds, such as Microsoft Windows Azure. There are a number of very interesting open source projects already, such as jclouds, libcloud, deltacloud and Dasein, that are building a layer of abstraction on top of various cloud services to hide their differences. These middleware would make it much easier for CMR to support a large number of cloud components.

At the same time, we are also looking at how to build these components and deploy them locally inside an enterprise. Although several products, such as vCloud and Eucalyptus, provide cloud services inside an enterprise, their current version is limited to the compute capability. There are other cloud services, such as storage and queue, that an enterprise has to deploy to provide a full cloud capability to its internal customers. At Accenture Technology Labs, we are helping to address some pieces of the puzzle. For example, we have started a research project to design a fully distributed and scalable queue service, which is similar to SQS in functionality, but exploring a different tradeoff point.

Synergy with Hadoop

Although, on the surface, CMR may seem to compete with Hadoop, there are actually quite a bit of synergies between the two projects. First, they are both moving to the vision of component decoupling. In the recent 0.20.1 release of Hadoop, the HDFS file system is separated out as an independent component. This makes a lot of sense because HDFS is useful as a stand alone component to store large data sets, even if the users are not interested in MapReduce at all. Second, there are lessons to be learned from each project. For example, CMR points the way on how to “push” data from Map to Reduce to streamline data transfer without sacrificing fine-grain fault tolerance. Similarly, Hadoop supports rich data types beyond simple strings, which is something that CMR will for sure inherit in the near future.

Hopefully, by now, I have convinced you that CMR is something that is at least worth a look. In the next post (coming in two weeks), I will follow up with a practical post showing step by step how to write a CMR application and how to run it. The current thinking is that I will demonstrate how to perform certain data analytics with data from search-hadoop.com, but I am open to suggestions. If you have suggestions, I would appreciate if you could post them in comments below.