August 2010

Solr wins the 2010 Bossie Award for the best open source applications

We do a ton of work with Solr for our clients, so it is great to see that Apache Solr won this year’s Bossie Award for the best open source applications. an award from InfoWorld:

While search engines have transformed the online world as we know it, there is no doubt that companies and research groups can be well served by running their own search engines and creating custom presentations of results. Solr gives them the tools to do this in a fast, scalable implementation that handles rich documents easily, and it can run on any platform that supports Java. It also offers distributed search, replication of results, and developer access via numerous languages and protocols.

Other winners include Drupal, WordPress, Alfresco, etc. Congratulations to all winners!

HBase Case-Study: Using HBaseTestingUtility for Local Testing & Development

Motivation

As HBase becomes more mature there’s is a growing demand for tools and methods for making development process easier – here at Sematext (@sematext) we’ve gone through our own per aspera ad astra learning process in addition to Cloudera’s Hadoop trainings and certifications. In this post we share what we’ve learned and show how one can HBaseTestingUtility for this.

Suppose there is a system that deals with processing data stored in HBase and displaying stored data via reporting application. Data processing is done using Hadoop MapReduce jobs. During development, it would be desirable to be able to:

debug MapReduce jobs in an IDE
run reporting application locally (on developer’s machine, without setting up a cluster) with possibility of debugging in IDE
easily access data stored in HBase for debugging purposes (easily means “naturally” as if all rows are in a text file)

Disclaimer

Described use-case and solution are just one option, an option that makes use of HbaseTestingUtility and underlying “mini” clusters. Depending on the context, this solution might not be the most optimal, but it is a good fit for presenting the ideas. This solution and this post should encourage developers to look at HBase’s unit-test sources when constructing their own tests and/or when finding ways for easier debugging & development.

Problem Details

In our example there are two tables in HBase: one with raw data and another with processed data. Let’s call them RawDataTable and ProcessedDataTable. We import data into RawDataTable via simple importing MapReduce job which initally takes data from a log file. Subsequently, another MapReduce job processes data in that table and stores the outcome into ProcessedDataTable. We use HBase Scan and Get operations to access the processed data from the client.

Solution

As stated in javadocs, HBaseTestingUtility is a “facility for testing HBase”. Its description comes with a bit more of explanation: “Create an instance and keep it around doing HBase testing. This class is meant to be your one-stop shop for anything you mind need testing. Manages one cluster at a time only.” In this post we describe one possible way of how to use it to achieve the goals described above.

Processing Data

Step 1: Init cluster.

The following code starts “local” cluster and creates two tables:

private final HBaseTestingUtility testUtil = new HBaseTestingUtility();
private HTable rawDataTable;
private HTable processedDataTable;
…
void initCluster() throws Exception {
  testUtil.getConfiguration().addResource("hbase-site-local.xml");
  testUtil.getConfiguration().reloadConfiguration();
  // start mini hbase cluster
  testUtil.startMiniCluster(1);
  // create tables
  rawDataTable = testUtil.createTable(RAW_TABLE_NAME, RAW_TABLE_COLUMN_FAMILIES);
  processedDataTable = testUtil.createTable(PROCESSED_TABLE_NAME, PROCESSED_TABLE_COLUMN_FAMILIES);
  // start MR cluster
  testUtil.startMiniMapReduceCluster();
}

testUtil.startMiniCluster(1) means start cluster with 1 datanode and 1 regionserver. You can start cluster with greater number of servers for test purposes.

Step 2: Import Data

We use simple map-only job for import data. Please refer to org.apache.hadoop.hbase.mapreduce.ImportTsv class for an example of such a job. The following code runs the job that uses locally stored files (e.g. a part of the log file of reasonable size) on just created cluster:

String[] importJobArgs = new String[] {RAW_TABLE_NAME, "file://" + inputFile};
if (!MyImportJob.createSubmittableJob(testUtil.getConfiguration(), importJobArgs).waitForCompletion(true)) {
  System.exit(1);
}

Step 3: Process Data

To process data in RawDataTable we run an appropriate MapReduce job in the same way as during the import:

if (!ProcessLogsJob.createSubmittableJob(testUtil.getConfiguration(), processLogsJobArgs).waitForCompletion(true)) {
  System.exit(1);
}

Step 4: Persist Processed Data

Since we need processed data during our reporting application development and debugging we persist it in some local file. In order to have “easy” access to this data during debugging it makes sense to store table data in a text file in a readable form (so that we could perform “grep” and other handy commands). So we actually write to two files at once. The Result class implements Writable interface, so there is a natural way to serialize its data.

BufferedWriter bw = ...;
DataOutputStream dos = ...;
ResultScanner rs = processedDataTable.getScanner(new Scan());
Result next = rs.next();
while (next != null) {
  next.write(dos);
  bw.write(getHumanReadableString(next));
  bw.newLine();
  next = rs.next();
}

After this step, the processed data is stored on the local disk and can be used for running the reporting application. Importing and processing of data is performed locally and is thus easier to debug.
In order to add extra processed data incrementally to the already stored data, instead of rewriting it from scratch, we need to load it from the file after cluster initialization as described in the following section.

Fetching Data

In order to make reporting application run on “local” cluster instead of the “true” one, we create an alternative HTable factory. Reporting application code uses a single HTable object instantiated by the factory during its whole lifecycle – this is the best practice for minimizing creation of HTable objects.

Step 1: Init cluster.

This step is exactly the same as described previously.

Step 2: Load processed data.

We use a file created during processing data stage to load the data back into just initialized cluster:

DataInputStream dis = ...;
Result next = new Result();
next.readFields(dis);
while (next.getRow() != null) {
  Put put = new Put(next.getRow());
  for (KeyValue kv : next.raw()) {
    put.add(kv);
  }
  processedDataTable.put(put);
  next = new Result();
  try {
    next.readFields(dos);
  } catch (EOFException e) {
    // file went to an end.
    break;
  }
}

After data is all loaded, the constructed processedDataTable can be used by the reporting application code. The app can now also be started and debugged easily from an IDE.

Next Steps

Internally HBaseTestingUtility makes use of a whole bunch of “mini” clusters: MiniZooKeeperCluster, MiniDFSCluster, MiniHBaseCluster and MiniMRCluster. Refer to the unit-test implementations in the source code of respective projects to get more examples on how to use them.

Thank you for reading, we hope you found this useful. Follow @sematext on Twitter to be notified of new posts on Hadoop, HBase, Lucene, Solr, Mahout, and other related topics.

Solr Digest, July 2010

As usual, July is one of the slower months in Solr world, however, we managed to find a few interesting topics for our readers.

Interesting feature might be added with SOLR-1979 – Create LanguageIdentifierUpdateProcessor. It would provide ability to differently handle the text in different languages (think about stemming in analysis, for instance) and to do it automatically. This issue was just created, so the work on it and any usable patches are coming some time in the future. However, if you need something working now, Sematext has a few products for similar multilingual functionality, for instance, Multilingual Indexer or its cousin Language Identifier.
Another interesting feature might come with SOLR-1980 – Implement boundary match support. This will enable one to specify that query should match only at the start or at the end of the field (or be exact match), not somewhere in the middle, which could provide more relevant search results in some specific cases. This issue is also in its infancy and has no patches yet, so we’ll have to wait and see how it progresses.
Ever wanted Solr to store as the value of some field something other than the raw input value (remember, when you search Solr, you search on analyzed and indexed values; when you fetch the content of some field, you get the raw input value added to that field, not its analyzed version)? Patch for that already exists in one rather fresh JIRA issue – SOLR-1997 – Store internal value instead of input one.
Getting ready to start using Solr, but are unsure about which version you should use? Don’t worry, confusion about Solr’s version started this spring (see Solr May 2010 Digest), but things stabilized lately. The latest release is the fairly recent 1.4.1, which is basically 1.4 version with many bugfixes. The next release version is 3.1 which can be found on branch_3x branch. You can find its nightly build versions here. The trunk is still used for “unstable” development and the future 4.0 version. To get more information, check these recent threads on the Solr mailing list: here and here.
Many will probably agree that Solr’s SpellCheckComponent isn’t very useful in real-life applications. One of the main problems is that it poorly handles multi-word queries, where it creates its suggestion as a collated version of best suggestion for each word of the query, so you often get suggestions which have 0 hits. Also, it doesn’t return important information about suggested query, like how many hits such query would generate and what results it would give. Some of these issues could be fixed some day with SOLR-2010 – Improvements to SpellCheckComponent Collate functionality. The first version of the patch is already provided. However, if you’d like to use such functionality in your Solr production today, you might consider one much more sophisticated and production-ready component developed by Sematext – DYM ReSearcher – you can see DYM ReSearcher in action on Search-Lucene.com, for example.
One minor functionality is added to QueryElevationComponent – Add option to return only the specified results. It was added with JIRA issue SOLR-1966 and is already committed to 3.x and trunk.

We hope that this was enough to satisfy your Solr appetite. Hopefully, we’ll dig more interesting topics for you in August. Until then you can keep up with us via @sematext on Twitter.

Add option to return only the specified results

HBase Digest, July 2010

Big news first: HBase 0.20.6 is out and available for download. It fixes only 8 issues (including 2 blockers), but some of them might be significant in particular cases (like scan recovery in case of region server failure). You can find the release notes here. Message from the HBase dev team: “we recommend that all users, particularly those running 0.20.4, upgrade”.

The very sweet piece of functionality is under active development right now (and looks like it’s nearly complete). This new functionality makes it possible to take HBase table snapshots: HBASE-50. This might be extremely useful in production. Design plan and implementation are looking so good that “committers should read it as they might learn something”.

Community news & trends:

Summary notes of HBase meetup (#11) at Facebook give a comprehensive overview of development activities and what’s in coming releases. Slides are available here.
Welcome the official HBase Blog.
It is strongly recommended to use HDFS patched with HDFS-630 with HBase. To save time, you can use Cloudera’s distribution: HDFS-630 is included in 0.20.1+169.89 (the latest CDH2, i.e. CDH2u1) a well as both betas of CDH3.
From the operations standpoint, is setup of a HBase cluster and their maintenance a fairly complex task? Can a single person manage it? People are sharing their experiences in this thread.
It makes sense to disable WAL (e.g. by Put.setWriteToWAL(false)) during one-time large import into HBase (of course this makes things unreliable, but it can be OK when doing import once given the resulting speedup).

Notable efforts:

HBase RowLog Library: a component to build WALs and queues backed by HBase.
Lily is out: meet Proof of Architecture release. Lily is the cloud-scalable NoSQL-based content store and search repository, built on top of Apache HBase and SOLR.

FAQ:

Why are recently added/modified records not present in result of scan and get operations? How can one make them available?
Check autoFlush option: autoFlush=false causes the client to accumulate puts without sending them to the server. Gets and scans only talk to the server and thus ignore the client write cache. You can either set autoFlush to true or perform HTable.flushCommits() before reading data.

Are you on Twitter? You can follow @sematext on Twitter.

Hadoop Digest, July 2010

Strong moves towards the 0.21 Hadoop release “detected”: 0.21 Release Candidate 0 was out and tested. A number of issues were identified and with it the roadmap to the next candidate is set. Tom White has been hard at work and is acting as the release engineer for the 0.21 release.

Community trends and discussions:

Hadoop Summit 2010 slides and videos are available here.
In case you’re at the design stage of your Hadoop cluster aimed at work with text-based and/or structured data, you should read the “Text files vs SequenceFiles” thread.
Thinking of decreasing HDFS replication factor to 2? This thread might be useful to you.
Managing workflows of Sqoop, Hive, Pig, and MapReduce jobs with Oozie (Hadoop workflow engine from Yahoo!) is explained in this post.
The 2nd edition of “Hadoop: The Definitive Guide” is now in Production. Again, Tom While in action.

Small FAQ:

How do you efficiently process (large) XML documents in Hadoop MapReduce?
Take a look at Mahout’s XmlInputFormat in case StreamXmlRecordReader doesn’t do a good job for you. The former one got a lot of positive feedback from the community.
What are the ways of importing data to HDFS from remote locations? I need this process to be well-managed and automated.
Here are just some of the options. First you should look at available HDFS shell commands. For large inter/intra-cluster copying distcp might work best for you. For moving data from RDBMS system you should check Sqoop. To automate moving (constantly produced) data from many different locations refer to Flume. You might also want to look at Chukwa (data collection system for monitoring large distributed systems) and Scribe (server for aggregating log data streamed in real time from a large number of servers).

Hey, follow @sematext if you are on Twitter and RT!

Rate this:

Motivation

Disclaimer

Problem Details

Solution

Processing Data

Step 1: Init cluster.

Step 2: Import Data

Step 3: Process Data

Step 4: Persist Processed Data

Fetching Data

Step 1: Init cluster.

Step 2: Load processed data.

Next Steps

Rate this:

Add option to return only the specified results

Rate this:

Rate this:

Rate this: