nutch

Lucene and Solr: 2010 in Review

Lucene has been around for 10+ years and Solr for 4+ years. It’s amazing that even after being as mature as these tools are there is still very rapid development and improvement going on. We are not talking about polishing of the APIs or minor tweaks here and there, but serious development in the heart of both of these tools. When you know this, it’s even more amazing to hear commercial search vendors spread FUD about tools like Lucene or Solr not being ready for serious business, large scale, high performance, etc. Those 5000-6000 daily downloads of Lucene/Solr/Nutch/etc. (see the graph, scroll down on the page) must be from people who simply don’t know better than to download this free stuff…

But let’s not go down that path. Below are some of the Lucene & Solr highlights from 2010.

The Merge

Lucene and Solr code bases were merged early in 2010. Development mailing lists merged, but user lists remained separate, as did release artifacts. The code repository went through major reorganization resulting in the addition of the “modules” section that currently hosts only the analysis package (this contains numerous analyzers, tokenizers, stemmers – over 400 Java classes so far. Why is this good? Because tools like our Key Phrase Extractor can now use just the jar from the analysis package instead of having to use the whole Lucene jar if all they really want is access to Lucene’s tokenizers, for example.). In short, things are working out well after the merge.

Code, Releases…

In 2010 Lucene saw 3 releases: 3.0.1, 3.0.2, and 3.0.3, as well as 2.9.2, 2.9.3, and 2.9.4. Solr 1.4.1 was released, too. Subversion repository got some new branches which essentially means parallel development at increased pace, more experimentation, more freedom to change the code, etc. Ultimately it’s the users of Lucene and Solr who reap major benefits from this. In 2011 we’ll most likely see Lucene 4.0, as well as SolrCloud version of Solr, both of which will bring speed improvements, lower memory footprint, flexible indexing, and a bunch of other good stuff that we’ll write about in our Lucene Digests and Solr Digests on this blog in 2011.

Top Level Projects, Incubator, New Sub-Project

Three former Lucene sub-projects became Top Level Projects: Mahout, Nutch, and Tika. Mahout 0.3 and 0.4 were released. Nutch 1.1 and 1.2 were released and work is under way to get Nutch 2.0 out in 2011. This new Nutch 2.0 includes some major improvements, such as great use of HBase. After some semi-stagnation, it feels like Nutch is getting some more love from contributors and developers. Tika is developing rapidly and also releasing rapidly with releases 0.6, 0.7, and 0.8 happening in 2010 and 0.9 being mentioned on the mailing list already.

Lucene ecosystem got a new sub-project in 2010: ManifoldCF (previously known as Lucene Connectors Framework). The code was donated by MetaCarta and it includes connectors for various enterprise data sources, such as Microsoft Sharepoint or EMC Documentum, as well as the file system, Web, or RSS and Atom feeds. Importantly, ManifoldCF includes a Security Model and has the ability to index documents with Solr.

At the same time, Lucy (the Lucene C port) went to the Incubator. Lucene.Net is on its way to the Incubator, too. In short, both projects need to work on building more active development community.

Conferences

Lucene Eurocon was the first Lucene-focused conference last May in Praha, followed by Lucene Revolution in October in Boston, where we presented how we built search-lucene.com and search-hadoop.com.

Books

Lucene in Action 2nd edition was published by Manning and a book on Solr was published by Packt. Mahout in Action is nearly done, and Tika in Action is in the works. A book on Nutch is also getting started.

Search-Lucene.com

We built a Lucene/Solr-powered search-lucene.com and the sister search-hadoop.com sites, where one can search all mailing list archives, JIRA issues, source code, javadoc, wiki, and web site for all (sub-) projects at once, facet on sub-projects, data sources, and authors, get short links for any mailing list message handy for sharing, view mailing list messages in threaded or non-threaded view, see search term highlighted not only on search results page, but also in mailing list messages themselves (click on that “book on Nutch” link above for an example), etc.

If you’d like to keep up with Lucene and Solr news in 2011, as well as keep an eye on Nutch, Mahout, and Tika, you can follow @lucene on Twitter – a low volume source of key developments in these projects.

Nutch Digest, May 2010

With May being almost over, it’s time for our regular monthly Nutch Digest. As you know, Nutch became a top-level project and first related changes are already visible/applied. The Nutch site was moved to its new home at nutch.apache.org and the mailing lists (user- and dev-) have been moved to new addresses: user@nutch.apache.org and dev@nutch.apache.org. Old subscriptions were automatically moved to the new lists, so subscribers don’t have to do anything (apart from changing mail filters, perhaps). Even though Nutch is not a Lucene sub-project any more, we will continue to index its content (mailing lists, JIRA, Wiki, web site) and searchable over on Search-Lucene.com. We’ve helped enough organizations with their Nutch clusters that it makes sense for at least us at Sematext to have a handy place to find all things Nutch.

There is a vote on updated Release Candidate (RC3) for the Apache Nutch 1.1 release. The major differences between this release and RC2 are several bug fixes: NUTCH-732, NUTCH-815, NUTCH-814, NUTCH-812 and one improvement in NUTCH-816. Nutch 1.1 is expected shortly!

From relevant changes related to Nuch develompent during May it’s important to note than Nutch will be using Ivy in Nutch builds and that there is one change regarding Nutch’s language identification: code for Japanese changed from “jp” to “ja”. The former is Japan’s country code and the latter is the language code for Japanese.

There’s been a lot of talk on the Nutch mailing list about the first book on Nutch which Dennis Kubes started writing. We look forward to reading it, Dennis!

Nutch developers were busy with the TLP-related changes and with preparations for the Nutch 1.1 release this month, so this Digest is a bit thinner than usual. Follow @sematext on Twitter.

Nutch Digest, April 2010

In the first part of this Nutch Digest we’ll go through new and useful features of the upcoming Nutch 1.1 release, while in the second part we’ll focus on developments and plans for next big Nutch milestone, Nutch 2.0. But, let’s start with few informational items.

Nutch has been approved by the ASF board to become Top Level Project (TLP) in the Apache Software Foundation. The changing of Nutch mailing lists, URL, etc. will start soon.

Nutch 1.1 will be officially released any day now and here is a Nutch 1.1 release features walk through:

Nutch release 1.1 uses Tika 0.7 for parsing and MimeType detection
Hadoop 0.20.2 is used for job distribution (Map/Reduce) and distributed file system (HDFS)
On the indexing and search side, Nutch 1.1 uses either Lucene 3.0.1.with its own search application or Solr 1.4

Some of the new features included in release 1.1 were discussed in previous Nutch Digest. For example, alternative generator which can generate several segments in one parse of the crawlDB is included in release 1.1. We used a flavour of this patch in our most recent Nutch engagement that involved super-duper vertical crawl. Also, improvement of SOLRIndexer, which now commits only once when all reducers have finished, is included in Nutch 1.1.

Some of the new and very useful features were not mentioned before. For example, Fetcher2 (now renamed to just Fetcher) was changed to implement Hadoop’s Tool interface. With this change it is possible to override parameters from configuration files, like nutch-site.xml or hadoop-site.xml, on the command line.
If you’ve done some focused or vertical crawling you probably know that one or few unresponsive host(s) can slow down entire fetch, so one very useful feature added to Nutch 1.1 is the ability to skip queues (which can be translated to hosts) for URLS getting repeated exceptions. We made good use of that here at Sematext, in the Nutch project we just completed in April 2010.
Another improvement included in 1.1 release related to Nutch-Solr integration comes in a form of improved Solr schema that allows field mapping from Nutch to Solr index.
One useful addition to Nutch’s injector is new functionality which allows user to inject metadata into the CrawlDB. Sometimes you need additional data, related to each URL, to be stored. Such external knowledge can later be used (e.g. indexed) by a custom plug-in. If we can all agree that storing arbitrary data in CrawlDb (with URL as a primary key) can be very useful, then migration to database oriented storage (like HBase) is only a logical step. This makes a good segue to the second part of this Digest…

In the second half of this Digest we’ll focus on the future of Nutch, starting with Nutch 2.0. Plans and ideas for the next Nutch release can be found on mailing list under Nutch 2.0 roadmap and on the official wiki page.

Nutch is slowly replacing some of its home-grown functionality with best of breed products — it uses Tika for parsing, Solr for indexing/searching and HBase for storing various types of data. Migration to Tika is already included in Nutch 1.1. release and exclusive use of Solr as (enterprise) search engine makes sense — for months we have been telling clients and friends we predict Nutch will deprecate its own Lucene-based search web application in favour of Solr, and that time has finally come. Solr offers much more functionality, configurability, performance and ease of integration than Nutch’s simple search web application. We are happy Solr users ourselves – we use it to power search-lucene.com.

Storing data in HBase instead of directly in HDFS has all of the usual benefits of storing data in database instead of a files system. Structured (fetched and parsed) data is not split into segments (in file system directories), so data can be accessed easily and time consuming segment merges can be avoided, among other things. As a matter of fact, we are about to engage in a project that involves this exact functionality: the marriage of Nutch and HBase. Naturally, we are hoping we can contribute this work back to Nutch, possibly through NUTCH-650.

Of course, when you add a persistence layer to an application there is always a question if whether it is acceptable for it to be tied to one back-end (database) or whether it is better to have an ORM layer on top of the datastore. Such an ORM layer would be an additional layer which would allow different backends to be used to store data. And guess what? Such an ORM, initially focused on HBase and Nutch, and then on Cassandra and other column-oriented databases is in the works already! Check the evaluation of ORM frameworks which support non-relational column-oriented datastores and RDBMs and development of an ORM framework that, while initially using Nutch as the guinea pig, already lives its own decoupled life over at http://github.com/enis/gora.

That’s all from us on Nutch’s present and future for this month, stay tuned for more Nutch news, next month! And of course, as usual, feel free to leave any comments or questions – we appreciate any and all feedback. You can also follow @sematext on Twitter.

Nutch Digest, March 2010

This is the first post in the Nutch Digest series and a little introduction to Nutch seems in order. Nutch is a multi-threaded and, more importantly, a distributed Web crawler with distributed content processing (parsing, filtering), full text indexer and a search runtime. Nutch is at version 1.0 and community is now working towards a 1.1. release. Nutch is a large scale, flexible Web search engine, which includes several types of operations. In this post we’ll present new features and mailing list discussion as we describe each of these operations.

Crawling

Nutch starts crawling from a given “seed list” (a list of seed URLs) and iteratively follows useful/interesting outlinks, thus expanding its link database. When talking about Nutch as a crawler it is important to distinguish between two different approaches: focused or vertical crawling and whole Web or wide crawling. Each approach has a different set-up and issues which need to be addressed. At Sematext we’ve done both vertical and wide crawling.

When using Nutch at large scale (whole Web crawling and dealing with e.g. billions of URLs), generating a fetchlist (a list of URLs to crawl) from crawlDB (the link database) and updating crawlDB with new URLs tends to take a lot of time. One solution is to limit such operations to a minimum by generating several fetchlists in one parse of the crawlDB and then update the crawlDb only once on several segments (set of data generated by a single fetch iteration). Implementation of a Generator that generates several fetchlists at was created in NUTCH-762. Whether this feature will be included in 1.1 release and when will version 1.1 be released, check here.

One more issue related to whole Web crawling which often pops up on the Nutch mailing list is an authenticated websites crawling, so here is wiki page on this subject.

When using Nutch for vertical/focused crawls, one often ends up with a very slow fetch performance at the end of each fetch iteration. An iteration typically starts with high fetch speed, but it drops significantly over time and keeps dropping, and dropping, and dropping. This is known problem. It is caused by the fetch run having a small number of sites, some of which may have a lot more pages than others, and may be much slower than others. Crawler politeness, which means it will politely wait before hitting the same domain again, combined with the fact that the number of distinct domains from fetchlist often drops rapidly during one fetch causes fetcher to wait a lot. More on this and overall Fetcher2 performance (which is a default fetcher in Nutch 1.0) you can find NUTCH-721.

To be able to support whole Web crawling, Nutch needs also needs to have scalable data processing mechanism. For this purpose Nutch uses Hadoop’s MapReduce processing and HDFS for storage.

Content storing

Nutch uses Hadoop’s HDFS as a fully distributed storage which creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable and fast computations even on large data volumes. Currently, Nutch is using Hadoop 0.20.1, but will be upgrading to Hadoop 0.20.2. in version 1.1.

In this NUTCH-650 you can find out more about the ongoing effort (and progress) to use HBase as Nutch storage backend. This should simplify Nutch storage and make URL/page processing work more efficient due to the features of HBase (data is mutable and indexed by keys/columns/timestamps).

Content processing

As we noted before, Nutch does a lot of content processing, like parsing (parsing downloaded content to extract text) and filtering (extracting only URLs which match filtering requirements). Nutch is moving away from its own processing tools and delegating content parsing and MimeType detection to Tika by use of Tika plugin, you can read more NUTCH-766.

Indexing and Searching

Nutch uses Lucene for indexing, currently version 2.9.1, but pushing to upgrade Lucene to 3.0.1. Nutch 1.0 can also index directly to Solr, becuse of Nutch-Solr integration. More on this and how using Solr with Nutch worked before Solr-Nutch integration you can find here. This integration now upgrades to Solr 1.4, because Solr 1.4 has a StreamingUpdateSolrServer which simplifies the way docs are buffered before sending to the Solr instance. Another improvement in this integration was a change to SolrIndexer to commit only once after all reducers have finished, NUTCH-799.

Some of patches discussed here and a number of other high quality patches were contributed by Julien Nioche who was added as a Nutch committer in December 2009.

One more thing, keep an eye on an interesting thread about Nutch becoming a Top Level Project (TLP) at Apache.

NUTCH-762

Thank you for reading, and if you have any questions or comments leave them in comments and we’ll respond promptly!

Hiring Lucene, Solr, Nutch, Hadoop, NLP People

Hear, hear!

We are looking for people passionate about search/information retrieval, natural language processing, machine learning, text analytics, recommendation engines, and related topics. Please see the Sematext Jobs page for a bit more information. If you enjoy working with Lucene, Solr, Nutch, Hadoop, HBase, or any of the other technologies listed on the Sematext Jobs page or, more generally, you enjoy working in any of the related fields, please get in touch.

We are a small, private company based in New York City, with people on multiple continents and clients from all around the globe.