February 2010

HBase Digest, February 2010

The first HBase Digest post received very good feedback from the community. We continue using HBase at Sematext and thus continue covering the status of HBase project with this post.

Added Performance Evaluation for IHBase. The PE does a good job of showing what IHBase is good and bad at.
Transactional contrib stuff is more geared to short-duration transactions, but it should be possible to share transaction states across machines with certain rules in mind. Thread…
Choosing between Thrift and REST connectors for communicating with HBase outside of Java is explained in this thread.
How to properly set up Zookeeper to be used by HBase (how many instances/resources should be dedicated to it, etc) is discussed in this thread. Some more info you can find is also in this one.
Yahoo Research has developed a benchmarking tool for “Cloud Serving Systems”. In their paper describing the tool which they intend to open source soon, they compare four “Cloud Serving Systems” and HBase is one of them. Please, also read the explanation from HBase dev team about the numbers inside this paper.
HBase trunk has been migrated to a Maven build system.
New branch opened for 0.20 updated version to run on Hadoop 0.21. This lets 0.20.3 or 0.20.2 clients operate against HBase running on HDFS 0.21 (with durable WAL, etc.) without any change to the client side. Thread…
Since Hadoop 0.21 isn’t going to be released soon and HBase team is waiting for applying critical changes (HDFS-265, HDFS-200, etc.) to make HBase user’s life easier, HBase trunk is likely to support both 0.21 and the patched 0.20 versions of Hadoop. There was a discussion about naming convention for HBase releases with regard to this fact which also touches plans for which features to include in the nearest releases.
Cloudera’s latest release now includes HBase-0.20.3.
Exploring possible solutions to “write only top N rows from reduce outcome”. Thread…
A new handy binary comparator was added that only compares up to the length of the supplied byte array.
These days, HBase developers are working hard on the very sweet “Multi data center replication” feature. It is aimed for 0.21 and will support federated deployment where someone might have terascale (or beyond) clusters in more than one geography and would want the system to handle replication between the clusters/regions.

We’d also like to introduce a small FAQ and FA (frequent advices) section to save some time for HBase dev team who is very supportive on the mailing lists.

How to move/copy all data from one HBase cluster to another? If you stop the source cluster then you can distcp the /hbase to the other cluster. Done. A perfect copy.
Is there a way to get the row count of the table? From Java API? There is no single-call method to do that. Actual row count info isn’t stored anywhere. You can use “count” command from HBase shell which iterates over all records and may take a lot of time to complete. It can be discovered by a table scan, or distributed count (MapReduce job usually).
I’m editing property X in hbase-default.xml to… You should edit hbase-site.xml, not hbase-default.xml.
Inserting row keys with an incremental ID is usually not a good idea since sequential writing is usually slower than random writing. If you can’t find a natural row key (which is good for scans), use a UUID.
Apply HBASE-2180 patch to increase random read performance in case of multiple concurrent clients.
How can I perform “select * from tableX where columnY=Z”-like query in HBase? You’ll need to use a Scan along with a SingleColumnValueFilter. But this isn’t quick, it’s like performing a SQL query on a column that isn’t indexed: the more data you have the longer it will take. There’s no support for secondary indexes in HBase core, you need you use one of the contribs (2 are available in 0.20.3: src/contrib/indexed and src/contrib/transactional). Another option is maintaining the indexes yourself.

Some other efforts that you may find interesting:

Clojure-HBase was introduced. It is a simple library for accessing HBase conveniently from Clojure.
DataNucleus Access Platform 2.0 now contains plugin for persistence to HBase (HADOOP) datastores.
HBase Indexing Library aids in building and querying indexes on top of HBase, in Google App Engine datastore-style.
HBase-dsl is meant to help reduce and simplify your HBase code.

Solr Digest, February 2010

This second installment of Solr Digest (see Solr January Digest) will cover 8 topics, some of which are quite new and some with very long history (and still uncertain future).

So, here we go:

1. solr.ISOLatin1AccentFilterFactory is commonly used filter which replaces accented characters in ISO Latin 1 charset with their unaccented version (for instance, ‘à’ is replaced with ‘a’). However, the underlying Lucene filter ISOLatin1AccentFilter is already deprecated in favor of ASCIIFoldingFilter in Lucene 2.9 (BTW, Solr 1.4 release uses Lucene 2.9.1, while trunk with future Solr 1.5 uses Lucene 2.9.2) and has been deleted from Lucene 3.0. Of course, Solr already has a filter factory for the replacement, solr.ASCIIFoldingFilterFactory, so it would probably be wise to start using it in your Solr schemata, if you are still using the old ISOLating1AccentFilter. Functionality wise, there are no differences between these two filters, except that ASCIIFoldingFilter covers a superset of ISO Latin 1, meaning it converts everything ISOLatin1AccentFilter was converting and some more.

2. DataImportHandler became multithreaded – after being filled with different functionalities, DataImportHandler got a performance boost. Your multicore servers will be happy to try it :). As part of JIRA issue SOLR-1352, the patch was created and committed to trunk, so you can expect this functionality in Solr 1.5 release, or you can already try it with one of Solr 1.5 nightly builds.

3. Script based UpdateRequestProcessorFactory – one very interesting feature still in development (JIRA issue SOLR-1725) is adding support for script based UpdateRequestProcessorFactory. It will depend on Java 6 script engine support (so Java 5 based Solr installation will not benefit here, although upgrade to Java 6 is definitely recommended) and be very easy to use. The scripts will have to be placed under SOLR_HOME/conf directory and their names will be defined in solrconfig.xml, like this:


<updateRequestProcessorChain name="script">
  <processor>
    <str name="scripts">updateProcessor.js</str>
    <lst name="params">
      <bool name="boolValue">true</bool>
      <int name="intValue">3</int>
    </lst>
  </processor>
</updateRequestProcessorChain>

Implementations would also be simple, here is example of updateProcessor.jsp (copied from patch which brings this functionality):


function processAdd(cmd) {
  functionMessages.add("processAdd1");
}

function processDelete(cmd) {
  functionMessages.add("processDelete1");
}

function processMergeIndexes(cmd) {
  functionMessages.add("processMergeIndexes1");
}

function processCommit(cmd) {
  functionMessages.add("processCommit1");
}

function processRollback(cmd) {
  functionMessages.add("processRollback1");
}

function finish() {
  functionMessages.add("finish1");
}

4. Similar to SolrJ API for communicating with Solr, there are numerous Solr clients for other languages, especially the dynamic scripting languages. As with all scripting languages, one of the main advantages over using pure Java is simplicity and development speed. You just write a few lines of code and immediately run the script — no need for compiling. At Sematext we find them especially handy when making changes to Solr installations, for quickly testing if Solr behaves as we expect. One excellent solution for all Ruby lovers is RSolr. Coincidentally, RSolr will be covered in the upcoming Solr in Action.

5. Field Collapsing – this is a very frequently needed feature, but without satisfactory solution in Solr. There is a long history of this functionality in Solr, everything started while Solr was in version 1.3 with issue SOLR-236. It was never committed to svn, so you basically had to pick one of the many patches available in JIRA and apply it to your distribution. Since Solr was constantly developing, patches would pretty quickly become obsolete, so new versions would be created. Even when you found the correct patch for your Solr version, you would get occasional errors, so this surely wasn’t good enough for enterprise customers.

Recently, there have been renewed efforts invested into this issue and there are plans for this feature to finally be included in Solr 1.5. However, current implementation still isn’t good enough, there are OutOfMemory reports by some users, so it seems like we’ll have to wait some more to get enterprise quality “field collapsing” solution in Solr.

In light of problems with SOLR-236 solution, new JIRA issue SOLR-1773 was created. The goal of this issue is to provide “lightweight” implementation of this feature. There is already a patch containing this implementation and some measurements which show this approach has potential, but this still isn’t ready for serious deployments. The same approach is also implemented in SOLR-1682.

As you can see, work to provide field collapsing is underway, but we’re still some time away from committed code.

6. SystemStatsRequestHandler – designed to provide statistics from stats.jsp to clients which access Solr with APIs like SolrJ or RSolr, it is being developed as JIRA issue SOLR-1750. It is destined to be included in Solr 1.5 version, but for now it is available as Java class attached to the issue. Before it is committed to svn, it might get another name.

7. While Lucene just saw its 2.9.2 and 3.0.1 versions released, Solr trunk already has the latest Lucene 2.9.*, as you can see described in this thread.

8. We’ve saved the best for last. If you could have one feature in Solr… Check out this informative thread to see what people want from Solr that Solr doesn’t already have. What do you want from Solr? Post your Solr desires in comments.

Hadoop Digest, February 2010

We’ve published the HBase Digest last month, but this is our first ever Hadoop Digest in which we cover Hadoop HDFS and MapReduce pieces of the Hadoop Ecosystem. Before we get started, let us point out that we recently published a guest post titled Introdoction to Cloud MapReduce, which should be interesting to all users of Hadoop, as well as its developers.

As of this writing, there are 34 open issues in JIRA scheduled for 0.21.0 release with most of them considered as “major” and 4 “critical” or “blockers”. There is quite a lot of work to do before 0.21.0 is out. Hadoop developers are working hard, providing at the same time a tons of very helpful answers & advice on mailing lists. Please find the summary of the most interesting discussions along with information on current Hadoop API usage below.

After several rejections, the USPTO granted a patent to Google for MapReduce. Find out the community reactions in thread and in thread.
What are security mechanisms in HDFS and what should we expect in the near future? Presentation, Design Document, Thread…
An attempt was made to get Hadoop into the Debian Linux distribution. All relevant links and summary can be found in this thread.

Consider using LZO compression, which allows splitting for a compressed file for Map jobs. GZIP is not splittable.

Use Python-based scripts to utilize EBS for NameNode and DataNode storage to have persistent, restartable Hadoop clusters running on EC2. Old scripts (in src/contrib/ec2) will be deprecated.
Do not rely on uniquness of objects in the “values” parameter when implementing reduce(T key, Iterable<T> values, Context context), the same instances of objects can be reused. Thread…
In order for long running tasks not to be aborted, use the Reporter object to provide “task heartbeat”. If the map task takes longer than 600 seconds (default) to complete an iteration map/reduce assumes the task is stalled and axes it.
Setting up DNS lookup properly (caching DNS servers, reverse DNS setup) for a big cluster to avoid DNS requests traffic flood is discussed in this thread.
Setting other than default output compressing codec programmatically is explained in this thread.

What are the version compatibility rules for Hadoop distributions? Read the hot discussion here.

Critical issue HDFS-101 (DFS write pipeline: DFSClient sometimes does not detect second DataNode failure) was reported and fixed (and compatible with DFSClient 0.20.1) and will be included in 0.20.2.
Text type is meant for use when you have a UTF8-encoded string. Creating a Text object from a byte array that is not proper UTF-8 is likely to result in an exception or data mangling. BytesWritable should be used for this purpose.
How to make particular “section of code” run only in any one of the mappers? (or how to share some flag state between jobs running on the different machines). Thread…

We would also like to add small FAQ section here to spot the common user questions.

MR. Is there a way to cancel/kill the job?
Invoke command: hadoop job -kill jobID

MR. How to get the name of the file that is being used for the map task?

FileSplit fileSplit = (FileSplit) context.getInputSplit();
String sFileName = fileSplit.getPath().getName();

MR. When framework splits a file, can some part of a line fall in one split and the other part in some other split?
In general, the file split may break the records, it is the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in <key,value>.
HDFS. How to view text content of SequenceFile?
The SequenceFile is not text file, so you can not see the content by invoking UNIX command cat. Use hadoop command : hadoop fs -text <src>
HDFS. How to move file from one dir to another using Hadoop API?
Use FileSystem#rename(Path, Path) to move files. The copy methods will leave you with two of the same file.
Cluster setup. Some of my nodes are in the blacklist, and I want to reuse them again. How can I do that?
Restarting the trackers removes them from the blacklist.
General. What command should I use to…? How should use comand X?
Please, refer to Commands Guide page.

There were also several efforts (be patient, some of them are still somewhat rough) that might be of interest:

JRuby on Hadoop is a thin wrapper for Hadoop Mapper / Reducer for JRuby, not to be mixed with Hadoop Streaming.
Stream-to-hdfs is a simple utility for streaming stdin to a file in HDFS.
Crane manages Hadoop cluster using Clojure.
Piglet is a DSL for writing Pig Latin scripts in Ruby.

Thank you for reading us! We highly appreciate all feedback to our Digests, so tell us what you like or dislike.

Run FAST to Open-Source Search

Ever since Microsoft announced they are halting all development of FAST for Linux starting now, every single organization involved in search had something to say on this topic. Here are our thoughts on that.

Apparently 80% of FAST users are Linux users. We won’t speculate what is really behind this seemingly crazy decision to turn off the FAST@Linux dollar faucet. Over at Kellblog, the Mark Logic CEO already itemized whose door the current FAST@Linux customers might want to knock on next depending on what they are using FAST for. Unfortunately, he forgot one important angle there, one key option for FAST@Linux users that in today’s day and age should absolutely not be ignored. After every storm comes sunshine. The same is happening here. While being forced to start thinking about changing one’s search solution probably isn’t pleasant, it does open another important door, a big opportunity, one should not ignore. The question we would pose FAST customers is the following: Is there something you’ve always wanted to do with FAST, but never could? Here is your chance to change that! That door that just opened… walk through it, take look around, you just may like it. Read on.

While FAST@Linux users may be experiencing mental turbulence now, they know there are open-source tools and solutions out there that are more scalable and faster than FAST and don’t involve any crazy per-document, per-query, per-xyz licensing fees. (The part about superior performance is referring to Lucene and solutions built on top of it, like Solr. Several years ago the former C*O of FAST called me up with a search business proposal. One of the bits he mentioned is that his/FAST engineers were testing Lucene and found it faster than their own search components.) More importantly, these tools are completely open and give FAST@Linux customers the opportunity to finally get whatever they could never get from FAST. Sure, there will be some trade-offs (e.g. some of these tools may not have some of the nice GUI pieces FAST has, but that is changing…eh, fast). But the key bit is that these tools are not only free to get and use, they are nearly infinitely flexible. They have their source opened to all the eyeballs of the world. They have open-minded communities (or at least the ones we are involved in at Apache Lucene, Solr, Nutch, etc. are). Development plans are all an open book. Any organization’s engineers are welcome to jump in and either contribute (nearly finished) new components, or collaborate to develop new functionality, or simply explain user-cases and request functionality to support them, or even fork the whole project. One would be crazy to choose the last option, of course – it would be sub-optimal use of the benefits of choosing an open-source solution. Flexibility is one of the key benefits of using open-source. You don’t like relevance is computed? Plug your own secret sauce! You don’t like how data is stored on disk? Change it to your liking!

If you have to get off FAST@Linux in the coming years, think very, very, VERY hard before you go to another closed enterprise search vendor. By going to another closed enterprise search vendor, what have you achieved? You may be able to do what you never could do with FAST before, but there might be some other functionality you’ll have to give up because the new vendor does not have it, does not plan on having, and that you cannot add yourself because most of the solution is a black box with tiny holes poked on some of its sides, so you can kind of take a pretend peek inside. This would not be an improvement. This would be a missed opportunity!

Here are some of the key benefits of going with open-source search tools and solutions:

No up-front fees
No increasing fees due to growth
Flexibility to change anything and everything
A large and growing user and development communities
Security of commercial-grade support

Microsoft’s decision to drop FAST development for Linux is a blessing in disguise. It may not be pleasant right now, but it is actually a good opportunity for FAST customers. Choose wisely now and in the coming years, and avoid being put on the spot again.

Update: after posting this we came across a blog post that refers to a search service switching from FAST to Solr and cutting costs 400%. That is, the cost of their new Solr-powered search is 25% of the cost of their old FAST-powered search.

Introducing Cloud MapReduce

The following post is the introduction to Cloud MapReduce (CMR) written by Huan Liu, CMR’s main author and the Research Manager at Accenture Technology Labs.

MapReduce is a programming model (borrowed from functional programming languages) and its associated implementation, and it was first proposed by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it is quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce programs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007. MapReduce also has found wide application outside of the Google environment.

Cloud MapReduce is another implementation of the MapReduce programming model. Back in late 2008, we saw the emergence of a cloud Operating System (OS) — a set of software managing a large cloud infrastructure rather than an individual PC. We asked ourselves the following questions: what if we build systems on top of a cloud OS instead of directly on bare metal? Can we dramatically simplify system design? We thought we will try implementing MapReduce as a proof of concept; thus, the Cloud MapReduce project was born. At the time, Amazon was the only one that has a “complete” cloud OS, so we built on top of it. In the course of the project, we encountered a lot of problems working with the Amazon cloud OS, most could be attributed to the weaker consistent model it presents. Fortunately, we were able to work through all issues and successfully built MapReduce on top of the Amazon cloud OS. The end result surprises us somewhat. We are not only able to have a simpler design, but we are also able to make it more scalable, more fault tolerant and faster than Hadoop.

Why Cloud MapReduce

There are already several MapReduce implementations, including Hadoop, which is already widely used. So the natural question to ask is: why another implementation? The answer is that all previous implementations essentially copy what Google have described in their original MapReduce paper, but we need to explore alternative implementation approaches for the following reasons:

1) Patent risk. MapReduce is a core technology in Google. By using MapReduce, Google engineers are able to focus on their core algorithms, rather than being bogged down by parallelization details. As a result, MapReduce greatly increases their productivity. It is no surprise that Google would patent such a technology to maintain its competitive advantage. The MapReduce patent covers the Google implementation as described in its paper. Since CMR only implements the programming model, but has a totally different architecture and implementation, it poses a minimal risk w.r.t. the MapReduce patent. This is particularly important for enterprise customers who are concerned about potential risks.

2) Architectural exploration. Google only described one implementation of the MapReduce programming model in its paper. Is it the best one? What are the tradeoffs if using a different one? CMR is the first to explore a completely different architecture. In the following, I will describe what is unique about CMR’s architecture.

Architectural principle and advantages

CMR advocates component decoupling: separate out common components as independent cloud services. If we can separate out a common component as a stand-alone cloud service, the component not only can be leveraged for other systems, but it can also evolve independently. As we have seen in other contexts (e.g., SOA, virtualization), decoupling enables faster innovation.

CMR currently uses existing components offered by Amazon, including Amazon S3, SimpleDB, and SQS. By leveraging the concept of component decoupling, CMR achieves a couple of key advantages.

– A fully distributed architecture. Since each component is a smaller project, it is easier to build it as a fully distributed system. Amazon has done it for all its services (S3, SimpleDB, SQS). Building on what Amazon has done, we are able to build a fully distributed MapReduce implementation with only 3,000 lines of code. A fully distributed architecture has several advantages over a master/slave architecture. First, it is more fault tolerant. Many enterprise customers are not willing to adopt something with a single point of failure, especially for their mission critical data. Second, it is more scalable. In one comparison study, we are able to stress the master node in Hadoop so much that CMR has a 60x performance advantage.

– More efficient data shuffling between Map and Reduce. CMR uses queue as the intermediate point between Map and Reduce, which enables Map to “push” data to Reduce (rather than Reduce “pulls” data from Map). This design is similar to what is used in parallel databases, so it inherits the benefits of efficient data transfer as a result of pipelining. However, unlike in parallel databases, by using tagging, filtering and a commit mechanism, CMR still maintains the fine grain fault tolerance property offered by other MapReduce implementations. The majority of CMR’s performance gain (aside from the 60x gain which is from stressing the master node) comes from this optimization.

CMR is particularly attractive in a cloud environment due to its native integration with the cloud. Hadoop, on the other hand, is designed for a static environment inside an enterprise. If run in a cloud, Hadoop introduces additional overhead. For example, after launching a cluster of virtual machines, Amazon’s Elastic MapReduce (pay-per-use Hadoop) has to configure and setup Hadoop on the cluster, then copy data from S3 to the Hadoop file system before it can start data processing. At the end, it also has to copy results back to S3. All these steps are additional overheads that CMR does not incur, because CMR starts processing right away on S3 data directly, no cluster configuration and data copying are necessary.

CMR’s vision

Although CMR currently only runs on Amazon components, we envision that it will support a wide range of components in the future, including other clouds, such as Microsoft Windows Azure. There are a number of very interesting open source projects already, such as jclouds, libcloud, deltacloud and Dasein, that are building a layer of abstraction on top of various cloud services to hide their differences. These middleware would make it much easier for CMR to support a large number of cloud components.

At the same time, we are also looking at how to build these components and deploy them locally inside an enterprise. Although several products, such as vCloud and Eucalyptus, provide cloud services inside an enterprise, their current version is limited to the compute capability. There are other cloud services, such as storage and queue, that an enterprise has to deploy to provide a full cloud capability to its internal customers. At Accenture Technology Labs, we are helping to address some pieces of the puzzle. For example, we have started a research project to design a fully distributed and scalable queue service, which is similar to SQS in functionality, but exploring a different tradeoff point.

Synergy with Hadoop

Although, on the surface, CMR may seem to compete with Hadoop, there are actually quite a bit of synergies between the two projects. First, they are both moving to the vision of component decoupling. In the recent 0.20.1 release of Hadoop, the HDFS file system is separated out as an independent component. This makes a lot of sense because HDFS is useful as a stand alone component to store large data sets, even if the users are not interested in MapReduce at all. Second, there are lessons to be learned from each project. For example, CMR points the way on how to “push” data from Map to Reduce to streamline data transfer without sacrificing fine-grain fault tolerance. Similarly, Hadoop supports rich data types beyond simple strings, which is something that CMR will for sure inherit in the near future.

Hopefully, by now, I have convinced you that CMR is something that is at least worth a look. In the next post (coming in two weeks), I will follow up with a practical post showing step by step how to write a CMR application and how to run it. The current thinking is that I will demonstrate how to perform certain data analytics with data from search-hadoop.com, but I am open to suggestions. If you have suggestions, I would appreciate if you could post them in comments below.

Lucandra / Solandra: A Cassandra-based Lucene backend

In this guest post Jake Luciani (@tjake) introduces Lucandra (Update: now known as Solandra – see our State of Solandra post), a Cassandra-based backend for Lucene (Update: now integrated with Solr instead).

Update: Jake will be giving a talk about Lucandra in the April 2010 NY Search & Discovery Meetup. Sign up!

Update 2: Slides from the Lucandra meetup talk are on line: http://www.jroller.com/otis/entry/lucandra_presentation

For most users, the trickiest part of deploying a Lucene based solution is managing and scaling storage, reads, writes and index optimization. Solr and Katta (among others) offer ways to address these, but still require quite a lot of administration and maintenance. This problem is not specific to Lucene. In fact most data management applications require a significant amount of administration.

In response to this problem of managing and scaling large amounts of data the “nosql” movement has started to become more popular. One of the most popular and widely used “nosql” systems is Apache Software Foundation project, originally developed at Facebook called Cassandra.

What is Cassandra?

Cassandra is a scalable and easy to administer column-oriented data store, modeled after Google’s BigTable, but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things:

It has no single point of failure, and
Adding nodes to the cluster is as simple as pointing it to any one live node

Cassandra also has built-in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully. Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.

Enter Lucandra

Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within Facebook was for search, integrating Lucene with Cassandra seemed like a “no brainer”. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than trying to build a Lucene Directory interface on top of Lucene as some backends do (DbDirectory for example), our approach was to implement a an IndexReader and IndexWriter directly on top of Cassandra.

Here’s how Terms and Documents are stored in Cassandra. A Term is a composite key made up from the index, field and term with the document id as the column name and position vector as the column value.

      Term Key                    ColumnName   Value
      "indexName/field/term" => { documentId , positionVector }

      Document Key
      "indexName/documentId" => { fieldName , value }

Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.

There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader. However, this is still quite acceptable to us given what you get in return.

For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under its own key. Luckily, this will be fixed in the next version of Cassandra, which will allow batched writes for keys.

One other major caveat is, there is no term scoring in the current code. This simply hasn’t been needed yet. Adding is relatively trivial – via another column.

To see Lucandra in action you can try out the Twitter search app http://sparse.ly that is built on Lucandra. This service uses the Lucandra store exclusively and does not use any sort of relational or other type of database.

Lucandra in Action

Using Lucandra is extremely simple and switching a regular Lucene search application to use Lucandra is a matter of just several lines of code. Let’s have a look.

First we need to create the connection to Cassandra

import lucandra.CassandraUtils; import lucandra.IndexReader; import lucandra.IndexWriter; ... Cassandra.Client client = CassandraUtils.createConnection();

Next, we create Lucandra’s IndexWriter and IndexReader, and Lucene’s own IndexSearcher.

IndexWriter indexWriter = new IndexWriter("bookmarks", client); IndexReader indexReader = new IndexReader("bookmarks", client); IndexSearcher indexSearcher = new IndexSearcher(indexReader);

From here on, you work with IndexWriter and IndexSearcher just like you in vanilla Lucene. Look at the BookmarksDemo for the complete class.

What’s next? Solandra!

Now that we have a Lucandra we can use it with anything built on Lucene. For example, we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.

In response to this problem of managing and scaling large amounts of data the “nosql” movement has started to become more popular. One of the most popular and widely used “nosql” systems is Apache project, originally developed at facebook called Cassandra.

What is Cassandra?

==============

Cassandra is a scalable and easy to administer column oriented data store, modeled after Google’s Bigtable but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things. It has no single point of failure and adding nodes into the cluster is as simple as pointing it to any one live node. Cassandra also has built in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully.

Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.

Enter Lucandra:

===========

Lucandra is a Cassandra backend for Lucene. Since Cassandra’s original use within facebook was for search, integrating Lucene with Cassandra seemed like a nobrainer. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than try and build a Lucene directory interface ontop of Lucene as some backends do (DbDirectory for example), our approach was to implement a index reader and writer directly ontop of Cassandra.

      Term Key                    ColumnName   Value
      "indexName/field/term" => { documentId , positionVector }

      Document Key
      "indexName/documentId" => { fieldName , value }

There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader however this is still quite acceptable to us given what you get in return.

For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under it’s own key, however this will be fixed in next version of the Cassandra, which will allow batched writes for keys.

One other major caveat is, we do any term scoring in the current code.

To see Lucandra in action you can try out our Twitter search app http://sparse.ly that is built on Lucandra.

What’s next? Solandra:

================

Now that we have a Lucandra we can use it with anything built on Lucene. For example we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.

Mahout Digest, February 2010

Last month we published the Lucene Digest, followed by the Solr Digest, and wrapped the month with the HBase Digest (just before Yahoo posted their report showing Cassandra beating HBase in their benchmarks!). We are starting February with a fresh Mahout Digest.

When covering Mahout, it seems logical to group topics following Mahout’s own groups of core algorithms. Thus, we’ll follow that grouping in this post, too:

Recommendation Engine (Taste)
Clustering
Classification

There are, of course, some common concepts, some overlap, like n-grams. Let’s talk n-grams for a bit.

N-grams

There has been a lot of talk about n-gram usage through all of the major subject areas on Mahout mailing lists. This makes sense, since n-gram-based language models are used in various areas of statistical Natural Language Processing. An n-gram is a subsequence of n items from a given sequence of “items”. The “items” in question can be anything, though most commonly n-grams are made up of character or word/token sequenceas. Lucene’s n-gram support provided through NGramTokenizer tokenizes an input String into character n-grams and can be useful when building character n-gram models from text. When there is an existing Lucene TokenStream and character n-gram model is needed, NGramTokenFilter can be applied to the TokenStream. Word n-grams are sometimes referred to as “shingles”. Lucene helps there, too. When word n-gram statistics or model is needed, ShingleFilter or ShingleMatrixFilter can be used.

Classification

Usage of character n-grams in context of classification and, more specifically, the possibility of applying Naive Bayes to character n-grams instead of word/term n-grams is discussed here. Since Naive Bayes classifier as probabilistic classifier treats features of any type, there is no reason it could not be applied to character n-grams, too. Use of character n-gram model instead of word model in text classification could result in more accurate classification of shorter texts. Our language identifier is a good example of a classifier (though it doesn’t use Mahout) and it provides good results even on short texts, try it.

Clustering

Non-trivial word n-grams (aka shingles) extracted from a document can be useful for document clustering. Similar to usage of document’s term vectors, this thread proposes usage of non-trivial word n-grams as a foundation for clustering. For extraction of word n-grams or shingles from a document Lucene’s ShingleAnalyzerWrapper is suggested. ShingleAnalyzerWrapper wraps the previously mentioned ShingleFilter around another Analyzer. Since clustering (grouping similar items) is an example of a unsupervised type of machine learning, it is always interesting to validate clustering results. In clustering there are no referent train or, more importantly, referent test data, so evaluating how well some clustering algorithm works is not a trivial task. Although good clustering results are intuitive and often easily visually evaluated, it is hard to implement an automated test. Here is an older thread about validating Mahout’s clustering output which resulted in an open JIRA issue.

Recommendation Engine

There is an interesting thread about content-based recommendation, what content-based recommendation really is or how it should be defined. So far, Mahout has only Collaborative Filtering based recommendation engine called Taste. Two different approaches are presented in that thread. One approach treats content-based recommendation as Collaborative Filtering problem or generalized Machine Learning problem, where item similarity is based on Collaborative Filtering applied on item’s attributes or ‘related’ user’s attributes (usual Collaborative Filtering treats an item as a black-box). The other approach is to treat content-based recommendation as a “generalized search engine” problem. Here, matching same or similar queries makes two items similar. Just think of queries as queries composed of, say, key words extracted from user’s reading or search history and this will start making sense. If items have enough textual content then content-based analysis (similar items are those that have similar term vectors) seems like a good approach for implementing content-based recommendation. This is actually nothing novel (people have been (ab)using Lucene, Solr, and other search engines as “recommendation engines” for a while), but content-based recommendations is a recently discussed topic of possible Mahout expansion. All algorithms in Mahout tend to run on top of Hadoop as MapReduce jobs, but in current release Taste does not have the MapReduce version. You can read more about about MapReduce Collaborative Filtering implementation in Mahout’s trunk. If you are in need of a working recommendation engine (that is, a whole application built on top of the core recommendation engine libraries), have a look at the Sematext’s recommendation engine.

In addition to Mahout’s basic machine learning algorithms there are discussions and development in directions which don’t fall under any of the above categories, such as collocation extraction. Often phrase extractors use word n-gram model for co-occurrence frequency counts. Check the thread about collocations which resulted in JIRA issue and the first implementation. Also, you can find more details on how Log-likelihood ratio can be used in the context of the collocation extraction in this thread.

Of course, anyone interested in Mahout should definitely read Mahout in Action (we’ve got ourselves a MEAP copy recently) and keep an eye on features for next 0.3 release.