Solr Digest, March 2010

As you probably already know, there were some big changes in Lucene/Solr world. They were already mentioned in Lucene Digest post for March, but here we’ll summarize changes related to Solr:

  • Lucene and Solr are now merged in svn, repositories are already created (you  can check interesting discussion on this thread)
  • Next Solr release could have a new major version number. There are conflicting opinions on what the version should be called, 1.5, 2.0 or 3.1 (following the name of the next Lucene release).  We are putting our money on it being version 2.0
  • Solr will soon move to Java 6
  • Deprecations in Solr code will finally be removed, which means transition to new version will not be as easy for users as it was from version 1.3 to 1.4
  • Solr and Lucene will not be changing their names, they remain two as two separate products

One interesting addition to Solr world is the new iPhone app called SolrStation. If you ever have a need to remotely administer your  Solr installation and you also own an iPhone or Android based phone, here is a tool you might find useful. It allows adminstators to:

  • Administer Solr installs
  • Manage data imports
  • Control replication

One thing to be careful about is security: what happens when you lose your phone? The “lucky” finder will have access to your Solr installation.  After some back-and-forth with Chris of SolrStation, Chris agreed adding additional security in form of  a pass code would be good to add and is putting that on the roadmap.  Soon, we’ll have a special post dedicated solely to SolrStation, to cover this interesting product in more details.

Solr just got integrated with Grails in the form of a plugin. This plugin integrates Grails domain model with Solr and is looking to provide functionalities of already existing Grails’ Searchable plugin and add some more. It is still in its infancy, so it is missing some key features like returning a list of domain objects in result set, but integration with Solr should improve Grails searching with capabilities like clustering, replication, scalability, facets…

Another interesting addition to Solr is Zoie plugin. It provides real-time update functionality to Solr 1.4. We’ll have a special post on Zoie, so we will not go into details here.

In Solr Digest February 2010, we covered RSolr scripting client for Ruby. This time we suggest looking at Solr Flux. This command-line interface for Solr speaks SQL, so if you ever wanted to use SQL-like syntax to insert or delete some documents or make a Solr query, this tool will enable you to do exactly that.

If you have a good knowledge of Solr and Mahout and want to contribute to Solr, check this thread, there is an opportunity with Google Summer of Code.

Thank you for reading.  You can also follow us on Twitter – @sematext.

HBase Digest, March 2010

We were waiting until the end of the month hoping to include coverage of the new HBase 0.20.4 version, but HBase developers are still working on it. This release will contain a lot of critical fixes and enhancements, so stay tuned.

Typically, our HBase Digest posts consist of three main parts: project status and summary of mailing lists’ most interesting discussions, other projects’ efforts & announcements related to HBase technology, and a FAQ section that aims to save time of the very responsive HBase developers answering the same questions again and again. Please feel free to provide feedback on how you like this coverage format in the post comments.

  • A must-read HBase & HDFS presentation from Todd Lipcon of Cloudera that was a part of “HUG9: March HBase User Group at Mozilla”. Links to other presentations are here. The meeting was followed by a nice discussion on Hadoop (and therefore HBase) reliability with regard to EC2. People shared a lot of useful information about hosting opportunities for one’s HBase setup.
  • Very interesting discussion covers various use-cases of what HBase is a good fit for.
  • Some notes on what settings to adjust when running HBase on a machine with low RAM in this thread.
  • Good questions from HBase evaluating person/team and good answers in this thread. Such discussions periodically appear on mailing lists and given the great responsiveness of HBase committers are very good to read by those who thinking about using HBase or are already using, like we are.
  • The warning we already shared with readers through our Hadoop Digest (March): avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while — this update proved to be very stable.
  • One more explanation of the difference of indexed (IHBase) and transactional (THBase) indices.
  • Deleting the row and putting another one with the same key at the same time (i.e. performing “clean update”) can cause unexpected results if not done properly. There are several solutions to make this process safer currently. In case you face this problem, please share your experience with HBase developers on user mailing list, they will be happy to consider your case when developing solution to the issue in next release.
  • Making column names/keys shorter can result in ~20-30% of RAM savings, and visible storage savings too. Even bigger advantage came with defining the right schema and column families. More advices in this thread.
  • What are the options for connecting to HBase running on EC2 from outside the Amazon cloud using Java library? Thread…

Most notable efforts:

  • Lucehbase: Lucene Index on HBase, ported from Lucandra. Please find more info on this topic in the comments to our super popular Lucandra: A Cassandra-based Lucene backend post.
  • Elephant Bird: Twitter’s library of LZO and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats, Writables, Pig LoadFuncs, HBase miscellanea, etc. The majority of these are in production at Twitter running over rather big data every day.

Small FAQ:

  1. How to back up HBase data?
    You can either do exports at the HBase API level (a la Export class), or you can force flush all your tables and do an HDFS level copy of the /hbase directory (using distcp for example).
  2. Is there a way to perform multiple get (put, delete)?
    There is a work being done on that, please refer to HBASE-1845. The patch is available for 0.20.3 version.

Thank you for reading us and follow us on Twitter – @sematext.

Mahout Digest, March 2010

In this Mahout Digest we’ll summarize what went on in the Mahout world since our last post back in February.

There has been some talk on the mailing list about Mahout becoming a top level project (TLP). Indeed, the decision to go TLP has been made (see Lucene March Digest to find out about other Lucene subprojects going for TLP) and this will probably happen soon, now that Mahout 0.3 is released. Check the discussion thread on Mahout as TLP and follow the discussion on what the PMC will look like. Also, Sean Owen is nominated as Mahout PMC Chair.  There’s been some logo tweaking.

There has been a lot of talk on Mahout mailing list about Google Summer Of Code and project ideas related to Mahout. Check the full list of Mahout’s GSOC project ideas or take on the invitation to write up your GSOC idea for Mahout!

Since Sematext is such a big Solr shop, we find the proposal to integrate Mahout clustering or classification with Solr quite interesting.  Check more details in MAHOUT-343 JIRA issue.  One example of classification integrated with Solr or actually, classifier as a Solr component, is Sematext’s Multilingual Indexer.  Among other things, Multilingual Indexer uses our Language Identifier to classify documents and individual fields based on language.

When talking about classification we should point out a few more interesting developments. There is an interesting thread on implementation of new classification algorithms and overall classifier architecture. In the same context of classifier architecture, there is a MAHOUT-286 JIRA issue on how (and when)  Mahout’s Bayes classifier will be able to support classification of non-text (non-binary) data like numeric features. If you are interested in using decision forests to classify new data, check this wiki page and this JIRA and patch.

In the previous post we discussed application of n-grams in collocation identification and now there is a wiki page where you can read more on how Mahout handles collocations. Also, check memory management improvements in collocation identification here. Of course, if you think you need more features in key phrases identification and extraction, check Sematext’s Key Phrase Extractor demo – it does more than collocations, can be extended, etc.

Finally, two new committers, Drew Farris and Benson Margulies, have been added to the list of Mahout committers.

That’s all for now from the Mahout world.  Please feel free to leave comments or if you have any questions – just ask, we are listening!

Hadoop Digest, March 2010

Main news first: Hadoop 0.20.2 was released! The list of changes may be found in the release notes here. Related news:

To get the most fresh insight on the 0.21 version release plans, check out this thread and the continuation of it.

More news on releases:

High availability is one of the hottest topics nowadays in Hadoop land. Umbrella HDFS-1064 JIRA issue has been created to track discussions/issues related to HDFS NameNode availability. While there are a lot of questions about eliminating single point of failure, Hadoop developers are more concerned about the minimizing the downtime (including downtime for upgrades, restart time) than getting rid of SPOFs, since high downtime is the real pain for those who manage the cluster. There is some work on adding hot standby that might help with planned upgrades. Please find some thoughts and a bit of explanation on this topic in a thread that started with “Why not to consider Zookeeper for the NameNode?” question. Next time we see “How Hadoop developers feel about SPOF?” come up on the mailing list, we’ll put it in a special FAQ section at the bottom of this digest. 🙂

We already reported in our latest Lucene Digest (March) about various Lucene projects starting discussions on their mailing lists about becoming Top Level Apache projects. This tendency (motivated by the Apache board’s warnings of Hadoop and Lucene becoming umbrella projects) raised discussions at HBase, Avro, Pig and Zookeeper as well.

Several other notable items from MLs:

  • Important note from Todd Lipcon we’d like to pass to our readers: avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while which proved to be very stable. Please read the complete discussion around it here.
  • Storing Custom Java Objects in Hadoop Distibuted Cache is explained here.
  • Here is a bit of explanation of the fsck command output.
  • Several users shared their experience with issues running Hadoop on a Virtualized O/S vs. the Real O/S in this thread.
  • Those who think about using Hadoop as a base for academic research work (both students and professors) might find a lot of useful links (public datasets, sources for problems, existed researches) in this discussion.
  • Hadoop security features are in high demand among the users and community. Developers will be working hard on deploying authentication mechanisms this summer. You can monitor the progress via HADOOP-4487.

This time a very small FAQ section:

  1. How can I request a larger heap for Map tasks?
    By including -Xmx in mapred.child.java.opts
  2. How to configure and use LZO compression?
    Take a look at http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/.

Thank you for reading us! Please feel free to provide feedback on the format of the digests or anything else, really.

Apache ZooKeeper 3.3.0

Nutch Digest, March 2010

This is the first post in the Nutch Digest series and a little introduction to Nutch seems in order. Nutch is a multi-threaded and, more importantly, a distributed Web crawler with distributed content processing (parsing, filtering), full text indexer and a search runtime. Nutch is at version 1.0 and community is now working towards a 1.1. release. Nutch is a large scale, flexible Web search engine, which includes several types of operations. In this post we’ll present new features and mailing list discussion as we describe each of these operations.

Crawling

Nutch starts crawling from a given “seed list” (a list of seed URLs) and iteratively follows useful/interesting outlinks, thus expanding its link database. When talking about Nutch as a crawler it is important to distinguish between two different approaches: focused or vertical crawling and whole Web or wide crawling. Each approach has a different set-up and issues which need to be addressed.  At Sematext we’ve done both vertical and wide crawling.

When using Nutch at large scale (whole Web crawling and dealing with e.g. billions of URLs), generating a fetchlist (a list of URLs to crawl) from crawlDB (the link database) and updating crawlDB with new URLs tends to take a lot of time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB and then update the crawlDb only once on several segments (set of data generated by a single fetch iteration). Implementation of a Generator that generates several fetchlists at was created in NUTCH-762. Whether this feature will be included in 1.1 release and when will version 1.1 be released, check here.

One more issue related to whole Web crawling which often pops up on the Nutch mailing list is an authenticated websites crawling, so here is wiki page on this subject.

When using Nutch for vertical/focused crawls, one often ends up with a very slow fetch performance at the end of each fetch iteration. An iteration typically starts with high fetch speed, but it drops significantly over time and keeps dropping, and dropping, and dropping. This is known problem.  It is caused by the fetch run having a small number of sites, some of which may have a lot more pages than others, and may be much slower than others.  Crawler politeness, which means it will politely wait before hitting the same domain again, combined with the fact that the number of distinct domains from fetchlist often drops rapidly during one fetch causes fetcher to wait a lot. More on this and overall Fetcher2 performance (which is a default fetcher in Nutch 1.0) you can find NUTCH-721.

To be able to support whole Web crawling, Nutch needs also needs to have scalable data processing mechanism. For this purpose Nutch uses Hadoop’s MapReduce processing and HDFS for storage.

Content storing

Nutch uses Hadoop’s HDFS as a fully distributed storage which creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable and fast computations even on large data volumes. Currently, Nutch is using Hadoop 0.20.1, but will be upgrading to Hadoop 0.20.2. in version 1.1.

In this NUTCH-650 you can find out more about the ongoing effort (and progress) to use HBase as Nutch storage backend. This should simplify Nutch storage and make URL/page processing work more efficient due to the features of HBase (data is mutable and indexed by keys/columns/timestamps).

Content processing

As we noted before, Nutch does a lot of content processing, like parsing (parsing downloaded content to extract text) and filtering (extracting only URLs which match filtering requirements). Nutch is moving away from its own processing tools and delegating content parsing and MimeType detection to Tika by use of Tika plugin, you can read more NUTCH-766.

Indexing and Searching

Nutch uses Lucene for indexing, currently version 2.9.1, but pushing to upgrade Lucene to 3.0.1. Nutch 1.0 can also index directly to Solr, becuse of Nutch-Solr integration. More on this and how using Solr with Nutch worked before Solr-Nutch integration you can find here. This integration now upgrades to Solr 1.4, because Solr 1.4 has a StreamingUpdateSolrServer which simplifies the way docs are buffered before sending to the Solr instance. Another improvement in this integration was a change to SolrIndexer to commit only once after all reducers have finished, NUTCH-799.

Some of patches discussed here and a number of other high quality patches were contributed by Julien Nioche who was added as a Nutch committer in December 2009.

One more thing, keep an eye on an interesting thread about Nutch becoming a Top Level Project (TLP) at Apache.

Thank you for reading, and if you have any questions or comments leave them in comments and we’ll respond promptly!

Lucene Digest, March 2010

Welcome to another edition of the Lucene monthly Digest post.

As reported by @lucene, Lucene and Solr have merged.  This pretty big change didn’t happen over night.  As a matter of fact, the Lucene/Solr developers went through a pretty intense and heated discussion and several rounds of voting before deciding on this.  The decision was not unanimous.  If you feel like reading the lengthy discussions, you can find it all on Search-Lucene.com.  The discussions took place on general@lucene list.  Happy reading!  We should note that the upcoming Lucene Meetup will include an explanation of Lucene/Solr merger.

This is not the only big change that happened in Luceneland this month.  There is a whole series of similar changes in various states:

  • Tika is going TLP (Top Level Project in Apache Software Foundation parlance).  You can see the brief discussion followed by the unanimous voting.
  • Mahout is also going TLP.  That discussion already happened last month and some more non-interesting details were discussed this month.  In short, Mahout is getting out of Lucene and becoming its own TLP.
  • The Nutch TLP discussion has been started, but no conclusions have been reached.  Nutch needs more love and attention.

By the time we publish the Lucene April Digest the Lucene landscape may look quite different.  We’ll make sure to update you on what happened, what went where, and how things are going to look in the future.  Got questions or suggestions?  Please leave them in comments and thank you for reading.

Lucene Digest, February 2010

Publishing a February Lucene Digest in March?  Nonsense, ha?  Blame it on Sematext keeping us busy and on the short month of February.  At least we got the Solr Digest and HBase Digest out in time!

  • Well, the first thing to know about Lucene now is that Lucene 2.9.2 and 3.0.1 have been released.  They contain fixes you’ll want to have, so go grab a new release.  While you are at it, you may also be interested in seeing a discussion about Lucene upgrades that emerged after the release announcement was made a few days ago.
  • Guess what?  Lucene in Action 2ns ed is in production.  What this means is that LIA2 authors are code working on the manuscript and that Manning Publications people are preparing it for print and distribution.  You got your MEAP already, right?  LIA2 covers Lucene 3.* API.
  • If Lucene in Action 2nd ed is not enough for you, note that another Lucene book is in the works: Lucene in Practice.  Hey, John Wang, Jake & Co., does LIP have a URL?  For now, the only URL I have is to the LIP source code: http://code.google.com/p/lucene-book/.
  • If you missed the popular Lucandra post, have a look at it now.  Lucandra from @tjake looks interesting.  Note that we’ll have a talk about Lucandra soon – keep an eye on the NY Search & Discovery Meetup.
  • Lucene developers are a super disciplined bunch.  Look how well unit-tested Lucene is in the Lucene Clover Report.
  • LinkedIn is one of the bigger Lucene users out there, and they’ve been publishing a lot about that.  Zoie and Bobo Browse are two projects you’ll see covered in Lucene in Action 2, but here is a LinkedIn Search presentation.
  • Search, search, more and more search frameworks are getting built on top of Lucene.  Solr is the biggest and the most well known, of course, but certainly not the only one:
  • And talking about Katta (Lucene (or Hadoop Mapfiles or any content which can be split into shards) in the cloud), the new release is out:

The key changes of the 0.6 release among dozens of bug fixes:
– upgrade lucene to 3.0
– upgrade zookeeper to  3.2.2
– upgrade hadoop to 0.20.1
– generalize katta for serving shard-able content (lucene is one implementation, hadoop mapfiles another one)
– basic lucene field sort capability
– more robust zookeeper session expiration handling
– throttling of shard deployment (kb/sec configurable) to have  a stable search while deploying
– load test facility
– monitoring facility
– alpha version of web-gui

See full list of changes at
http://oss.101tec.com/jira/secure/ReleaseNote.jspa?projectId=10000&styleName=Html&version=10010

  • Full-text Search and Spatial Search go hand in hand.  Both Lucene and Solr have seen work in the spatial search area and now a new Apache project called Spatial Information Systems (SIS) has been proposed and approved.  SIS will enter Apache Software Foundation via the Incubator.