January 2011

Just the other day, we posted the Lucene & Solr highlights in our Lucene & Solr: 2010 in Review post, and now it’s time to really conclude 2010 in Solr world with December Solr Digest. Although one might expect festive period to take its toll on the Solr development velocity, it wasn’t like that at all. Open source never sleeps. Here are the most interesting highlights:

Interesting features in development

In our July’s Digest we mentioned LanguageIdentifierUpdateProcessor feature which is being proposed under JIRA issue SOLR-1979. Some artifacts in the form of patches are starting to appear attached to that issue, so if you’re interested in this feature, take a look.
Solr’s spatial capabilities are being further refined. With issue SOLR-2268, Solr will get “support for Point in Polygon searches”. This should enable features like “for a given point, return all documents which contain a polygon inside of which that point lays” and “for a given polygon, return all documents which have a point contained inside of that polygon“. Of course, negated versions of such feature will be supported. The work is in early stages, one patch is attached, but it can be used only as a general pointer about how this thing will be implemented, nothing else.
Support for “ColognePhonetic” encoder was added to PhoneticFilterFactory. Since “ColognePhonetic” will be added to Commons Codec 1.4.1, the patch provided in SOLR-2276 will wait until that version gets released.

Interesting new features

Solr is getting JOIN functionality – sort of. As part of SOLR-2272 , Solr got a working patch that provides SQL JOIN-like functionality. Of course, this is not exactly the JOIN you might know from SQL, but it is probably the closest thing to it which can be implemented in Solr. It is likely that this feature will be integrated into Lucene as well – it makes no sense to have it strictly in Solr. It is also likely that this feature will be expanded in the future; currently it has only one algorithm and supports many-to-many type of JOIN.
As part of SolrCloud, new feature SolrCloud distributed indexing will be added to Solr some day. SOLR-2293 will be likely JIRA home for this feature. Before SolrCloud, anyone using distributed indexing had to create a custom logic which handled distribution of documents over various shards in the cluster. With SolrCloud, this will be transparent to the clients. Also, SolrCloud will include some out-of-the-box distribution algorithms, while, of course, plugging in custom algorithms will be easy to accomplish. However, don’t hold your breath waiting for this feature. At the moment, there exists only JIRA issue (and some guidelines in the Wiki) related to this feature.

Miscellaneous

Solr’s Jetty is now upgraded to the latest 6.1.26 version. The change was committed to 3_x branch and trunk but, of course, it doesn’t mean you have to upgrade your Jetty too. It means just that stuff (various jars and few xmls) under /solr/example got upgraded. Details of this change can be found in SOLR-2265.
Did you experience any problems with DataImportHandler and its multi-threaded option? If so, you are not the only one. More details about the nature of the problem can be found in SOLR-2186, while it appears that SOLR-2233 already contains a patch which might help in case of JDBC data source. That patch contains a few other DataImportHandler multithreading fixes. Nothing related to this issue has been committed to trunk or 3_x branch yet.
Many problems people encounter with Solr are related to OutOfMemory error. There were many interesting discussions on the ML in December, but we consider two of them related to that topic which you might find interesting OutOfMemory GC: GC overhead limit exceeded – Why isn’t WeakHashMap getting collected? and Memory use during merges (OOM)
If you found “bf” parameter somewhat limited – it doesn’t accept complex nested expressions with lots of whitespaces – you might find patch from SOLR-2014 useful. It appears that this is a common problem: another similar bug report (although this actually isn’t a bug) was also opened recently – SOLR-2267. However, consider that “bf” parameter will likely be deprecated in the future, since “bq” could be used to achieve everything “bf” can and more.
In our November’s Digest, we mentioned SOLR-2154 and Solr’s problem with multi-valued spatial fields. What we missed was SOLR-2155, which actually provides a patch for such problems. Although it isn’t committed, the patch appears to be functional, so give it a try.

And that would be all for Solr in 2010. We’ll be back with the new Solr Digest in a month. Follow @sematext for other interesting search news.

Lucene has been around for 10+ years and Solr for 4+ years. It’s amazing that even after being as mature as these tools are there is still very rapid development and improvement going on. We are not talking about polishing of the APIs or minor tweaks here and there, but serious development in the heart of both of these tools. When you know this, it’s even more amazing to hear commercial search vendors spread FUD about tools like Lucene or Solr not being ready for serious business, large scale, high performance, etc. Those 5000-6000 daily downloads of Lucene/Solr/Nutch/etc. (see the graph, scroll down on the page) must be from people who simply don’t know better than to download this free stuff…

But let’s not go down that path. Below are some of the Lucene & Solr highlights from 2010.

The Merge

Lucene and Solr code bases were merged early in 2010. Development mailing lists merged, but user lists remained separate, as did release artifacts. The code repository went through major reorganization resulting in the addition of the “modules” section that currently hosts only the analysis package (this contains numerous analyzers, tokenizers, stemmers – over 400 Java classes so far. Why is this good? Because tools like our Key Phrase Extractor can now use just the jar from the analysis package instead of having to use the whole Lucene jar if all they really want is access to Lucene’s tokenizers, for example.). In short, things are working out well after the merge.

Code, Releases…

In 2010 Lucene saw 3 releases: 3.0.1, 3.0.2, and 3.0.3, as well as 2.9.2, 2.9.3, and 2.9.4. Solr 1.4.1 was released, too. Subversion repository got some new branches which essentially means parallel development at increased pace, more experimentation, more freedom to change the code, etc. Ultimately it’s the users of Lucene and Solr who reap major benefits from this. In 2011 we’ll most likely see Lucene 4.0, as well as SolrCloud version of Solr, both of which will bring speed improvements, lower memory footprint, flexible indexing, and a bunch of other good stuff that we’ll write about in our Lucene Digests and Solr Digests on this blog in 2011.

Top Level Projects, Incubator, New Sub-Project

Three former Lucene sub-projects became Top Level Projects: Mahout, Nutch, and Tika. Mahout 0.3 and 0.4 were released. Nutch 1.1 and 1.2 were released and work is under way to get Nutch 2.0 out in 2011. This new Nutch 2.0 includes some major improvements, such as great use of HBase. After some semi-stagnation, it feels like Nutch is getting some more love from contributors and developers. Tika is developing rapidly and also releasing rapidly with releases 0.6, 0.7, and 0.8 happening in 2010 and 0.9 being mentioned on the mailing list already.

Lucene ecosystem got a new sub-project in 2010: ManifoldCF (previously known as Lucene Connectors Framework). The code was donated by MetaCarta and it includes connectors for various enterprise data sources, such as Microsoft Sharepoint or EMC Documentum, as well as the file system, Web, or RSS and Atom feeds. Importantly, ManifoldCF includes a Security Model and has the ability to index documents with Solr.

At the same time, Lucy (the Lucene C port) went to the Incubator. Lucene.Net is on its way to the Incubator, too. In short, both projects need to work on building more active development community.

Conferences

Lucene Eurocon was the first Lucene-focused conference last May in Praha, followed by Lucene Revolution in October in Boston, where we presented how we built search-lucene.com and search-hadoop.com.

Books

Lucene in Action 2nd edition was published by Manning and a book on Solr was published by Packt. Mahout in Action is nearly done, and Tika in Action is in the works. A book on Nutch is also getting started.

Search-Lucene.com

We built a Lucene/Solr-powered search-lucene.com and the sister search-hadoop.com sites, where one can search all mailing list archives, JIRA issues, source code, javadoc, wiki, and web site for all (sub-) projects at once, facet on sub-projects, data sources, and authors, get short links for any mailing list message handy for sharing, view mailing list messages in threaded or non-threaded view, see search term highlighted not only on search results page, but also in mailing list messages themselves (click on that “book on Nutch” link above for an example), etc.

If you’d like to keep up with Lucene and Solr news in 2011, as well as keep an eye on Nutch, Mahout, and Tika, you can follow @lucene on Twitter – a low volume source of key developments in these projects.