Introducing Cloud MapReduce

The following post is the introduction to Cloud MapReduce (CMR) written by Huan Liu, CMR’s main author and the Research Manager at Accenture Technology Labs.

MapReduce is a programming model (borrowed from functional programming languages) and its associated implementation, and it was first proposed by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce programs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007. MapReduce also has found wide application outside of the Google environment.

Cloud MapReduce is another implementation of the MapReduce programming model. Back in late 2008, we saw the emergence of a cloud Operating System (OS) — a set of software managing a large cloud infrastructure rather than an individual PC. We asked ourselves the following questions: what if we build systems on top of a cloud OS instead of directly on bare metal? Can we dramatically simplify system design? We thought we will try implementing MapReduce as a proof of concept; thus, the Cloud MapReduce project was born. At the time, Amazon was the only one that has a “complete” cloud OS, so we built on top of it. In the course of the project, we encountered a lot of problems working with the Amazon cloud OS, most could be attributed to the weaker consistent model it presents. Fortunately, we were able to work through all issues and successfully built MapReduce on top of the Amazon cloud OS. The end result surprises us somewhat. We are not only able to have a simpler design, but we are also able to make it more scalable, more fault tolerant and faster than Hadoop.

Why Cloud MapReduce

There are already several MapReduce implementations, including Hadoop, which is already widely used. So the natural question to ask is: why another implementation? The answer is that all previous implementations essentially copy what Google have described in their original MapReduce paper, but we need to explore alternative implementation approaches for the following reasons:

1) Patent risk. MapReduce is a core technology in Google. By using MapReduce, Google engineers are able to focus on their core algorithms, rather than being bogged down by parallelization details. As a result, MapReduce greatly increases their productivity. It is no surprise that Google would patent such a technology to maintain its competitive advantage. The MapReduce patent covers the Google implementation as described in its paper. Since CMR only implements the programming model, but has a totally different architecture and implementation, it poses a minimal risk w.r.t. the MapReduce patent. This is particularly important for enterprise customers who are concerned about potential risks.

2) Architectural exploration. Google only described one implementation of the MapReduce programming model in its paper. Is it the best one? What are the tradeoffs if using a different one? CMR is the first to explore a completely different architecture. In the following, I will describe what is unique about CMR’s architecture.

Architectural principle and advantages

CMR advocates component decoupling: separate out common components as independent cloud services. If we can separate out a common component as a stand-alone cloud service, the component not only can be leveraged for other systems, but it can also evolve independently. As we have seen in other contexts (e.g., SOA, virtualization), decoupling enables faster innovation.

CMR currently uses existing components offered by Amazon, including Amazon S3, SimpleDB, and SQS. By leveraging the concept of component decoupling, CMR achieves a couple of key advantages.

A fully distributed architecture. Since each component is a smaller project, it is easier to build it as a fully distributed system. Amazon has done it for all its services (S3, SimpleDB, SQS). Building on what Amazon has done, we are able to build a fully distributed MapReduce implementation with only 3,000 lines of code. A fully distributed architecture has several advantages over a master/slave architecture. First, it is more fault tolerant. Many enterprise customers are not willing to adopt something with a single point of failure, especially for their mission critical data. Second, it is more scalable. In one comparison study, we are able to stress the master node in Hadoop so much that CMR has a 60x performance advantage.

More efficient data shuffling between Map and Reduce. CMR uses queue as the intermediate point between Map and Reduce, which enables Map to “push” data to Reduce (rather than Reduce “pulls” data from Map). This design is similar to what is used in parallel databases, so it inherits the benefits of efficient data transfer as a result of pipelining. However, unlike in parallel databases, by using tagging, filtering and a commit mechanism, CMR still maintains the fine grain fault tolerance property offered by other MapReduce implementations. The majority of CMR’s performance gain (aside from the 60x gain which is from stressing the master node) comes from this optimization.

CMR is particularly attractive in a cloud environment due to its native integration with the cloud. Hadoop, on the other hand, is designed for a static environment inside an enterprise. If run in a cloud, Hadoop introduces additional overhead. For example, after launching a cluster of virtual machines, Amazon’s Elastic MapReduce (pay-per-use Hadoop) has to configure and setup Hadoop on the cluster, then copy data from S3 to the Hadoop file system before it can start data processing. At the end, it also has to copy results back to S3. All these steps are additional overheads that CMR does not incur, because CMR starts processing right away on S3 data directly, no cluster configuration and data copying are necessary.

CMR’s vision

Although CMR currently only runs on Amazon components, we envision that it will support a wide range of components in the future, including other clouds, such as Microsoft Windows Azure. There are a number of very interesting open source projects already, such as jclouds, libcloud, deltacloud and Dasein, that are building a layer of abstraction on top of various cloud services to hide their differences. These middleware would make it much easier for CMR to support a large number of cloud components.

At the same time, we are also looking at how to build these components and deploy them locally inside an enterprise. Although several products, such as vCloud and Eucalyptus, provide cloud services inside an enterprise, their current version is limited to the compute capability. There are other cloud services, such as storage and queue, that an enterprise has to deploy to provide a full cloud capability to its internal customers. At Accenture Technology Labs, we are helping to address some pieces of the puzzle. For example, we have started a research project to design a fully distributed and scalable queue service, which is similar to SQS in functionality, but exploring a different tradeoff point.

Synergy with Hadoop

Although, on the surface, CMR may seem to compete with Hadoop, there are actually quite a bit of synergies between the two projects. First, they are both moving to the vision of component decoupling. In the recent 0.20.1 release of Hadoop, the HDFS file system is separated out as an independent component. This makes a lot of sense because HDFS is useful as a stand alone component to store large data sets, even if the users are not interested in MapReduce at all. Second, there are lessons to be learned from each project. For example, CMR points the way on how to “push” data from Map to Reduce to streamline data transfer without sacrificing fine-grain fault tolerance. Similarly, Hadoop supports rich data types beyond simple strings, which is something that CMR will for sure inherit in the near future.

Hopefully, by now, I have convinced you that CMR is something that is at least worth a look. In the next post (coming in two weeks), I will follow up with a practical post showing step by step how to write a CMR application and how to run it. The current thinking is that I will demonstrate how to perform certain data analytics with data from search-hadoop.com, but I am open to suggestions. If you have suggestions, I would appreciate if you could post them in comments below.

Lucandra / Solandra: A Cassandra-based Lucene backend

In this guest post Jake Luciani (@tjake) introduces Lucandra (Update: now known as Solandra – see our State of Solandra post), a Cassandra-based backend for Lucene (Update: now integrated with Solr instead).
Update: Jake will be giving a talk about Lucandra in the April 2010 NY Search & Discovery Meetup.  Sign up!
Update 2: Slides from the Lucandra meetup talk are on line: http://www.jroller.com/otis/entry/lucandra_presentation
For most users, the trickiest part of deploying a Lucene based solution is managing and scaling storage, reads, writes and index optimization. Solr and Katta (among others) offer ways to address these, but still require quite a lot of administration and maintenance. This problem is not specific to Lucene. In fact most data management applications require a significant amount of administration.
In response to this problem of managing and scaling large amounts of data the “nosql” movement has started to become more popular. One of the most popular and widely used “nosql” systems is Apache Software Foundation project, originally developed at Facebook called Cassandra.

What is Cassandra?

Cassandra is a scalable and easy to administer column-oriented data store, modeled after Google’s BigTable, but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things:
  1. It has no single point of failure, and
  2. Adding nodes to the cluster is as simple as pointing it to any one live node

Cassandra also has built-in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully. Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.

Enter Lucandra

Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within Facebook was for search, integrating Lucene with Cassandra seemed like a “no brainer”. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than trying to build a Lucene Directory interface on top of Lucene as some backends do (DbDirectory for example), our approach was to implement a an IndexReader and IndexWriter directly on top of Cassandra.

Here’s how Terms and Documents are stored in Cassandra. A Term is a composite key made up from the index, field and term with the document id as the column name and position vector as the column value.
      Term Key                    ColumnName   Value
      "indexName/field/term" => { documentId , positionVector }
      Document Key
      "indexName/documentId" => { fieldName , value }
Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.
There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader. However, this is still quite acceptable to us given what you get in return.
For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under its own key. Luckily, this will be fixed in the next version of Cassandra, which will allow batched writes for keys.
One other major caveat is, there is no term scoring in the current code. This simply hasn’t been needed yet. Adding is relatively trivial – via another column.
To see Lucandra in action you can try out the Twitter search app http://sparse.ly that is built on Lucandra. This service uses the Lucandra store exclusively and does not use any sort of relational or other type of database.

Lucandra in Action

Using Lucandra is extremely simple and switching a regular Lucene search application to use Lucandra is a matter of just several lines of code. Let’s have a look.

First we need to create the connection to Cassandra


import lucandra.CassandraUtils;
import lucandra.IndexReader;
import lucandra.IndexWriter;
...
Cassandra.Client client = CassandraUtils.createConnection();

Next, we create Lucandra’s IndexWriter and IndexReader, and Lucene’s own IndexSearcher.


IndexWriter indexWriter = new IndexWriter("bookmarks", client);
IndexReader indexReader = new IndexReader("bookmarks", client);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

From here on, you work with IndexWriter and IndexSearcher just like you in vanilla Lucene. Look at the BookmarksDemo for the complete class.

What’s next? Solandra!

Now that we have a Lucandra we can use it with anything built on Lucene. For example, we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.
For most users, the trickiest part of deploying a Lucene based solution is managing and scaling storage, reads, writes and index optimization. Solr and Katta (among others) offer ways to address these, but still require quite a lot of administration and maintenance. This problem is not specific to Lucene. In fact most data management apps require a significant amount of administration.
In response to this problem of managing and scaling large amounts of data the “nosql” movement has started to become more popular. One of the most popular and widely used “nosql” systems is Apache project, originally developed at facebook called Cassandra.
What is Cassandra?
==============
Cassandra is a scalable and easy to administer column oriented data store, modeled after Google’s Bigtable but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things. It has no single point of failure and adding nodes into the cluster is as simple as pointing it to any one live node. Cassandra also has built in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully.
Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.
Enter Lucandra:
===========
Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within facebook was for search, integrating Lucene with Cassandra seemed like a nobrainer. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than try and build a Lucene directory interface ontop of Lucene as some backends do (DbDirectory for example), our approach was to implement a index reader and writer directly ontop of Cassandra.
Here’s how Terms and Documents are stored in Cassandra. A Term is a composite key made up from the index, field and term with the document id as the column name and position vector as the column value.
      Term Key                    ColumnName   Value
      "indexName/field/term" => { documentId , positionVector }
      Document Key
      "indexName/documentId" => { fieldName , value }
Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.
There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader however this is still quite acceptable to us given what you get in return.
For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under it’s own key, however this will be fixed in next version of the Cassandra, which will allow batched writes for keys.
One other major caveat is, we do any term scoring in the current code.
To see Lucandra in action you can try out our Twitter search app http://sparse.ly that is built on Lucandra.
What’s next? Solandra:
================
Now that we have a Lucandra we can use it with anything built on Lucene. For example we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.

Mahout Digest, February 2010

Last month we published the Lucene Digest, followed by the Solr Digest, and wrapped the month with the HBase Digest (just before Yahoo posted their report showing Cassandra beating HBase in their benchmarks!).  We are starting February with a fresh Mahout Digest.

When covering Mahout, it seems logical to group topics following Mahout’s own groups of core algorithms.  Thus, we’ll follow that grouping in this post, too:

  • Recommendation Engine (Taste)
  • Clustering
  • Classification

There are, of course, some common concepts, some overlap, like n-grams.  Let’s talk n-grams for a bit.

N-grams

There has been a lot of talk about n-gram usage through all of the major subject areas on Mahout mailing lists. This makes sense, since n-gram-based language models are used in various areas of statistical Natural Language Processing.  An n-gram is a subsequence of n items from a given sequence of “items”. The “items” in question can be anything, though most commonly n-grams are made up of character or word/token sequenceas.  Lucene’s n-gram support provided through NGramTokenizer tokenizes an input String into character n-grams and can be useful when building character n-gram models from text. When there is an existing Lucene TokenStream and character n-gram model is needed, NGramTokenFilter can be applied to the TokenStream.  Word n-grams are sometimes referred to as “shingles”.  Lucene helps there, too.  When word n-gram statistics or model is needed, ShingleFilter or ShingleMatrixFilter can be used.

Classification

Usage of character n-grams in context of classification and, more specifically, the possibility of applying Naive Bayes to character n-grams instead of word/term n-grams is discussed here. Since Naive Bayes classifier as probabilistic classifier treats features of any type, there is no reason it could not be applied to character n-grams, too. Use of character n-gram model instead of word model in text classification could result in more accurate classification of shorter texts.  Our language identifier is a good example of a classifier (though it doesn’t use Mahout) and it provides good results even on short texts, try it.

Clustering

Non-trivial word n-grams (aka shingles) extracted from a document can be useful for document clustering. Similar to usage of document’s term vectors, this thread proposes usage of non-trivial word n-grams as a foundation for clustering. For extraction of word n-grams or shingles from a document Lucene’s ShingleAnalyzerWrapper is suggested. ShingleAnalyzerWrapper wraps the previously mentioned ShingleFilter around another Analyzer.  Since clustering (grouping similar items) is an example of a unsupervised type of machine learning, it is always interesting to validate clustering results. In clustering there are no referent train or, more importantly, referent test data, so evaluating how well some clustering algorithm works is not a trivial task. Although good clustering results are intuitive and often easily visually evaluated, it is hard to implement an automated test. Here is an older thread about validating Mahout’s clustering output which resulted in an open JIRA issue.

Recommendation Engine

There is an interesting thread about content-based recommendation, what content-based recommendation really is or how it should be defined. So far, Mahout has only Collaborative Filtering based recommendation engine called Taste.  Two different approaches are presented in that thread. One approach treats content-based recommendation as Collaborative Filtering problem or generalized Machine Learning problem, where item similarity is based on Collaborative Filtering applied on item’s attributes or ‘related’ user’s attributes (usual Collaborative Filtering treats an item as a black-box).  The other approach is to treat content-based recommendation as a “generalized search engine” problem. Here, matching same or similar queries makes two items similar. Just think of queries as queries composed of, say, key words extracted from user’s reading or search history and this will start making sense.  If items have enough textual content then content-based analysis (similar items are those that have similar term vectors) seems like a good approach for implementing content-based recommendation.  This is actually nothing novel (people have been (ab)using Lucene, Solr, and other search engines as “recommendation engines” for a while), but content-based recommendations is a recently discussed topic of possible Mahout expansion.  All algorithms in Mahout tend to run on top of Hadoop as MapReduce jobs, but in current release Taste does not have the MapReduce version. You can read more about about MapReduce Collaborative Filtering implementation in Mahout’s trunk. If you are in need of a working recommendation engine (that is, a whole application built on top of the core recommendation engine libraries), have a look at the Sematext’s recommendation engine.

In addition to Mahout’s basic machine learning algorithms there are discussions and development in directions which don’t fall under any of the above categories, such as collocation extraction. Often phrase extractors use word n-gram model for co-occurrence frequency counts. Check the thread about collocations which resulted in JIRA issue and the first implementation. Also, you can find more details on how Log-likelihood ratio can be used in the context of the collocation extraction in this thread.

Of course, anyone interested in Mahout should definitely read Mahout in Action (we’ve got ourselves a MEAP copy recently) and keep an eye on features for next 0.3 release.

HBase Digest, January 2010

Here at Sematext we are making more and more use of the Hadoop family of projects. We are expanding our digest post series with this HBase Digest and adding it to the existing Lucene and Solr coverage. For those of you who wants to be up to date with HBase community discussions and benefit from the knowledge-packed discussions that happen in that community, but can’t follow all those high volume Hadoop mailing lists, we also include a brief mailing lists coverage.

  • HBase 0.20.3 has just been released. Nice way to end the month.  It includes fixes of huge number of bugs, fixes of EC2-related issues and good amount of improvements. HBase 0.20.3 uses the latest 3.2.2 version of Zookeeper.  We should also note that another distributed and column-oriented database from Apache was released a few days ago, too – Cassandra 0.5.0.
  • An alternative indexed HBase implementation (HBASE-2037) was reported as completed (and included in 0.20.3). It speeds up scans by adding indexes to regions rather than secondary tables.
  • HBql was announced this month, too. It is an abstraction layer for HBase that introduces SQL dialect for HBase and JDBC-like bindings, i.e. more familiar API for HBase users. Thread…
  • Ways of integrating with HBase (instantiating HTable) on client-side: Template for HBase Data Acces (for integration with Spring framework), simple Java Beans mapping for HBase, HTablePool class.
  • HbaseExplorer – an open-source web application that helps with simple HBase data administration and monitoring.
  • There was a discussion about the possibilities of splitting the process of importing very large data volumes into HBase in separate steps. Thread…
  • To get any parallelization, you have to start multiple JVMs in the current Hadoop version. Thread…
  • Tips for increasing the HBase write speed: use random int keys to distribute loading between RegionServers; use multi-process client instead of multi-threaded client; set a higher heap space in conf/hbase-env.sh , give it a much as you can without swapping; consider lzo to hold the same amount of data in fewer regions per server. Thread…
  • Some advice on hardware configuration for the case of managing 40-50K records/sec write speed. Thread…
  • Secondary index can go out of sync with the base table in case of I/O exceptions during commit (when using transactional contrib). Handling such exceptions in transactional layer should be revised. Thread…
  • Configuration instance (as well as an instance of HBaseConfiguration) is not thread-safe, so do not change it when sharing between threads. Thread…
  • What are the minimal number of boxes for HBase deployment? Covered both HA and non-HA options, what deployments can share the same boxes, etc. Thread…
  • Optimizing random reads: using client-side multi-threading will not improve reads greatly according to some tests, but there is an open JIRA issue HBASE-1845 related to batch operations. Thread…
  • Exploring possibilities for server-side data filtering. Discussed classpath requirements for that and the variants for filters hot-deploy. Thread…
  • How-to: Configure table to keep only one version of data. Thread…
  • Recipes and hints for scanning more than one table in Map. Thread…
  • Managing timestamps for Put and Delete operations to avoid unwanted overlap between them. Thread…
  • The amount of replication should have no effect on the performance for reads using either scanner or random-access. Thread…

Did you really make it this far down?  🙂 If you are into search, see January 2010 Digests for Lucene and Solr, too.

Please, share your thoughts about the Digest posts as they come. We really like to know whether they are valuable.  Please tell us what format and style you prefer and, really, any other ideas you have would be very welcome.

3.2.2

UIMA Talk at NY Search & Discovery Meetup

This coming February, the NY Search & Discovery meetup will be hosting Dr. Pablo Duboue from IBM Research and his talk about UIMA.

This following is an “excerpt from the blurb” about Pablo’s talk:

“… In this talk, I will briefly present UIMA basics before discussing full UIMA systems I have been involved in the past (including our Expert Search system in TREC Enterprise Track 2007). I will be talking about how UIMA supported the construction of our custom NLP tools. I will also sketch the new characteristics of the UIMA Asynchronous Scaleout (UIMA AS) subproject that enable UIMA to run Analysis Engines in thousands of machines… “

The talk/presentation will be on February 24, 2010.  Sign up (free) at http://www.meetup.com/NYC-Search-and-Discovery/calendar/12384559/ .

Solr Digest, January 2010

Similar to our Lucene Digest post , we’ll occasionally cover recent developments in the  Solr world. Although Solr 1.4 was released only two months ago, work on new features for 1.5 (located in the svn trunk) is in full gear:

1) GeoSpatial search features have been added to Solr. The work is based on Local Lucene project, donated to the Lucene community a while back and converted to a Lucene contrib module. The features to be incorporated into Solr will allow one to implement things like:

  • filter by a bounding box – find all documents that match some specific area
  • calculate the distance between two points
  • sort by distance
  • use the distance to boost the score of documents (this is different than sort and would be used in case you want some other factors to affect the ordering of documents)

The main JIRA issue is SOLR-773, and it is scheduled to be implemented in Solr 1.5.

2) Solr Autocomplete component – Autocomplete functionality is a very common requirement. Solr itself doesn’t have the component which would provide such functionality (unlike SpellcheckComponent which provides spellchecking, although in limited form; more on this in another post or, if you are curious, have a look at the DYM ReSearcher). There are a few common approaches to solving this problem, for instance, by using recently added TermsComponent (for more information, you can check this excellent showcase by Matt Weber).  These approaches, however, have some limitations. The first limitation that comes to mind is spellchecking and correction of misspelled words. You can see such feature in Firefox’ google-bar,  if you write “Washengton”. You’ll still get “Washington” offered, while TermsComponent (and some other) approaches fail here.

The aim of SOLR-1316 is to provide autocomplete component in Solr out-of-the-box. Since it is still in development, you can’t quite rely on it, but it is scheduled to be released with Solr 1.5. In the mean-time, you can check another Sematext product which offers AutoComplete functionality with few more advanced features. It is a constantly developing product whose features have been heavily absed on real-life/customer feedback.  It uses an approach unlike than of TermsComponent and is, therefore, faster. Also, one of the features currently in development (and soon to be released) is spell-correction of queries.

3) One very nice addition in Solr 1.5 is edismax or Extended Dismax. Dismax is very useful for situations where you want to let searchers just enter a few free-text keywords (think Google) , without field names, logical operators, etc.  The Extended Dismax is contributed thanks to Lucid Imagination.  Here are some of its features:

  • Supports full Lucene query syntax in the absence of syntax errors
  • supports “and”/”or” to mean “AND”/”OR” in Lucene syntax mode
  • When there are syntax errors, improved smart partial escaping of special characters is done to prevent them… in this mode, fielded queries, +/-, and phrase queries are still supported.
  • Improved proximity boosting via word bi-grams… this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
  • advanced stopword handling… stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
  • Supports the “boost” parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
  • Supports pure negative nested queries… so a query like +foo (-foo) will match all documents

You can check the development in JIRA issue SOLR-1553.

4) Up until version 1.5, distributed Solr deployments depended not only on Solr features (like replication or sharding), but also on some external systems, like load-balancers. Now, Zookeeper is used to provide Solr specific naming service (check JIRA issue SOLR-1277 for the details). The features we’ll get look exciting:

  • Automatic failover (i.e. when a server fails, clients stop trying to index to or search it and uses a different server)
  • Centralized configuration management (i.e. new solrconfig.xml or schema.xml propagate to a live Solr cluster)
  • Optionally allow shards of a partition to be moved to another server (i.e. if a server gets hot, move the hot segments out to cooler servers).

We’ll cover some of these topics in more details in the future installments. Thanks for reading!

Lucene Digest, January 2010

In this debut Lucene Digest post we’ll cover some of the recent interesting happenings in Lucene-land.

  • For anyone who missed it, Lucene 3.0.0 is out (November 2009).  The difference between 3.0.0 and the previous release is that 3.0.0 has no deprecated code (big cleanup job) and that the support for Java 1.4 was finally dropped!
  • As of this writing, there is only 1 JIRA issue targeted for 3.0.1 and 104 targeted for 3.1 Lucene release.
  • Luke goes hand in hand with Lucene, and Andrzej quickly released Luke that uses the Lucene 3.0.0 jars.  There were only minor changes in the code (and none in functionality) related to the changes in API between Lucene 2.9.1 and 3.0.
  • As usual, the development is happening on Lucene’s trunk, but there is now also a “flex branch” in svn, where features related to Flexible Indexing (not quite up to date) are happening.  Lucene’s JIRA also has a Flex Branch “version”, and as of this writing there are 6 big issues being worked on there.
  • The new Lucene Connectors Framework (or LCF for short) subproject is in the works.  As you can guess from the name, LCF aims to provide a set of connectors to various content repositories, such as relational databases, SharePoint, ECM Documentum, File System, Windows File Shares, various Content Management Systems, even Feeds and Web sites.  LCF is based on code donation from MetaCarta and will be going through the ASF Incubator. We expect it to graduate from the Incubator into a regular Lucene TLP subproject later this year.
  • Spatial search is hot, or at least that’s what it looks like from inside Lucene/Solr.  Both projects’ developers and contributors are busy adding support for spatial/geo search.  Just in Lucene alone, there are currently 21 open JIRA issues aimed at spatial search.  Lucene’s contrib/spatial is where spatial search lives.  We’ll cover Solr’s support for spatial search in a future post.
  • Robert Muir and friends have added some final-state automata technology to Lucene and dramatically improved regular expression and wildcard query performance.  Fuzzy queries may benefit from this in the future, too.  The ever industrious Robert covered this in a post with more details.

These are just some highlights from Luceneland.  More Lucene goodness coming soon.  Please use comments to tell us whether you find posts like this one useful or whether you’d prefer a different angle, format, or something else.

Sponsoring NYC Search & Discovery Meetup

We were busy in 2009, and so was Otis, who organized the NYC Search & Discovery Meetup, and which we happily sponsor.  So far a few interesting talks/presentations have happened and more are scheduled for 2010 (presenters for the next few months are already known).  Also, Otis and Daniel gave Faceted Search presentation (slides) at the NY CTO Club in December 2009, and Otis presented Lucene (slides and video) at the Semantic Web Meetup in October 2009.  More to come in 2010!

Lucene TLP Digests

We promised we’d blog more in 2010.  One of the things you’ll see on this blog in 2010 are regularly irregular periodic Digests of interesting or significant development in various projects under the Lucene Top Level Project.  The thinking is that with the ever-increasing traffic on various Lucene TLP mailing lists, we are not the only ones having a hard time keeping up.  Since we have to closely follow nearly all Lucene TLP projects for our work anyway, we may as well try and help others by periodically publishing posts summarizing key development.  We’ll tag and categorize our Digest posts appropriately and consistently, so you’ll be able to subscribe to them directly if that’s all you are interested in.  We’ll try to get @sematext to tweet as well, so you can follow us there, too.