Mahout Digest, February 2010

Last month we published the Lucene Digest, followed by the Solr Digest, and wrapped the month with the HBase Digest (just before Yahoo posted their report showing Cassandra beating HBase in their benchmarks!).  We are starting February with a fresh Mahout Digest.

When covering Mahout, it seems logical to group topics following Mahout’s own groups of core algorithms.  Thus, we’ll follow that grouping in this post, too:

  • Recommendation Engine (Taste)
  • Clustering
  • Classification

There are, of course, some common concepts, some overlap, like n-grams.  Let’s talk n-grams for a bit.

N-grams

There has been a lot of talk about n-gram usage through all of the major subject areas on Mahout mailing lists. This makes sense, since n-gram-based language models are used in various areas of statistical Natural Language Processing.  An n-gram is a subsequence of n items from a given sequence of “items”. The “items” in question can be anything, though most commonly n-grams are made up of character or word/token sequenceas.  Lucene’s n-gram support provided through NGramTokenizer tokenizes an input String into character n-grams and can be useful when building character n-gram models from text. When there is an existing Lucene TokenStream and character n-gram model is needed, NGramTokenFilter can be applied to the TokenStream.  Word n-grams are sometimes referred to as “shingles”.  Lucene helps there, too.  When word n-gram statistics or model is needed, ShingleFilter or ShingleMatrixFilter can be used.

Classification

Usage of character n-grams in context of classification and, more specifically, the possibility of applying Naive Bayes to character n-grams instead of word/term n-grams is discussed here. Since Naive Bayes classifier as probabilistic classifier treats features of any type, there is no reason it could not be applied to character n-grams, too. Use of character n-gram model instead of word model in text classification could result in more accurate classification of shorter texts.  Our language identifier is a good example of a classifier (though it doesn’t use Mahout) and it provides good results even on short texts, try it.

Clustering

Non-trivial word n-grams (aka shingles) extracted from a document can be useful for document clustering. Similar to usage of document’s term vectors, this thread proposes usage of non-trivial word n-grams as a foundation for clustering. For extraction of word n-grams or shingles from a document Lucene’s ShingleAnalyzerWrapper is suggested. ShingleAnalyzerWrapper wraps the previously mentioned ShingleFilter around another Analyzer.  Since clustering (grouping similar items) is an example of a unsupervised type of machine learning, it is always interesting to validate clustering results. In clustering there are no referent train or, more importantly, referent test data, so evaluating how well some clustering algorithm works is not a trivial task. Although good clustering results are intuitive and often easily visually evaluated, it is hard to implement an automated test. Here is an older thread about validating Mahout’s clustering output which resulted in an open JIRA issue.

Recommendation Engine

There is an interesting thread about content-based recommendation, what content-based recommendation really is or how it should be defined. So far, Mahout has only Collaborative Filtering based recommendation engine called Taste.  Two different approaches are presented in that thread. One approach treats content-based recommendation as Collaborative Filtering problem or generalized Machine Learning problem, where item similarity is based on Collaborative Filtering applied on item’s attributes or ‘related’ user’s attributes (usual Collaborative Filtering treats an item as a black-box).  The other approach is to treat content-based recommendation as a “generalized search engine” problem. Here, matching same or similar queries makes two items similar. Just think of queries as queries composed of, say, key words extracted from user’s reading or search history and this will start making sense.  If items have enough textual content then content-based analysis (similar items are those that have similar term vectors) seems like a good approach for implementing content-based recommendation.  This is actually nothing novel (people have been (ab)using Lucene, Solr, and other search engines as “recommendation engines” for a while), but content-based recommendations is a recently discussed topic of possible Mahout expansion.  All algorithms in Mahout tend to run on top of Hadoop as MapReduce jobs, but in current release Taste does not have the MapReduce version. You can read more about about MapReduce Collaborative Filtering implementation in Mahout’s trunk. If you are in need of a working recommendation engine (that is, a whole application built on top of the core recommendation engine libraries), have a look at the Sematext’s recommendation engine.

In addition to Mahout’s basic machine learning algorithms there are discussions and development in directions which don’t fall under any of the above categories, such as collocation extraction. Often phrase extractors use word n-gram model for co-occurrence frequency counts. Check the thread about collocations which resulted in JIRA issue and the first implementation. Also, you can find more details on how Log-likelihood ratio can be used in the context of the collocation extraction in this thread.

Of course, anyone interested in Mahout should definitely read Mahout in Action (we’ve got ourselves a MEAP copy recently) and keep an eye on features for next 0.3 release.

HBase Digest, January 2010

Here at Sematext we are making more and more use of the Hadoop family of projects. We are expanding our digest post series with this HBase Digest and adding it to the existing Lucene and Solr coverage. For those of you who wants to be up to date with HBase community discussions and benefit from the knowledge-packed discussions that happen in that community, but can’t follow all those high volume Hadoop mailing lists, we also include a brief mailing lists coverage.

  • HBase 0.20.3 has just been released. Nice way to end the month.  It includes fixes of huge number of bugs, fixes of EC2-related issues and good amount of improvements. HBase 0.20.3 uses the latest 3.2.2 version of Zookeeper.  We should also note that another distributed and column-oriented database from Apache was released a few days ago, too – Cassandra 0.5.0.
  • An alternative indexed HBase implementation (HBASE-2037) was reported as completed (and included in 0.20.3). It speeds up scans by adding indexes to regions rather than secondary tables.
  • HBql was announced this month, too. It is an abstraction layer for HBase that introduces SQL dialect for HBase and JDBC-like bindings, i.e. more familiar API for HBase users. Thread…
  • Ways of integrating with HBase (instantiating HTable) on client-side: Template for HBase Data Acces (for integration with Spring framework), simple Java Beans mapping for HBase, HTablePool class.
  • HbaseExplorer – an open-source web application that helps with simple HBase data administration and monitoring.
  • There was a discussion about the possibilities of splitting the process of importing very large data volumes into HBase in separate steps. Thread…
  • To get any parallelization, you have to start multiple JVMs in the current Hadoop version. Thread…
  • Tips for increasing the HBase write speed: use random int keys to distribute loading between RegionServers; use multi-process client instead of multi-threaded client; set a higher heap space in conf/hbase-env.sh , give it a much as you can without swapping; consider lzo to hold the same amount of data in fewer regions per server. Thread…
  • Some advice on hardware configuration for the case of managing 40-50K records/sec write speed. Thread…
  • Secondary index can go out of sync with the base table in case of I/O exceptions during commit (when using transactional contrib). Handling such exceptions in transactional layer should be revised. Thread…
  • Configuration instance (as well as an instance of HBaseConfiguration) is not thread-safe, so do not change it when sharing between threads. Thread…
  • What are the minimal number of boxes for HBase deployment? Covered both HA and non-HA options, what deployments can share the same boxes, etc. Thread…
  • Optimizing random reads: using client-side multi-threading will not improve reads greatly according to some tests, but there is an open JIRA issue HBASE-1845 related to batch operations. Thread…
  • Exploring possibilities for server-side data filtering. Discussed classpath requirements for that and the variants for filters hot-deploy. Thread…
  • How-to: Configure table to keep only one version of data. Thread…
  • Recipes and hints for scanning more than one table in Map. Thread…
  • Managing timestamps for Put and Delete operations to avoid unwanted overlap between them. Thread…
  • The amount of replication should have no effect on the performance for reads using either scanner or random-access. Thread…

Did you really make it this far down?  🙂 If you are into search, see January 2010 Digests for Lucene and Solr, too.

Please, share your thoughts about the Digest posts as they come. We really like to know whether they are valuable.  Please tell us what format and style you prefer and, really, any other ideas you have would be very welcome.

3.2.2

UIMA Talk at NY Search & Discovery Meetup

This coming February, the NY Search & Discovery meetup will be hosting Dr. Pablo Duboue from IBM Research and his talk about UIMA.

This following is an “excerpt from the blurb” about Pablo’s talk:

“… In this talk, I will briefly present UIMA basics before discussing full UIMA systems I have been involved in the past (including our Expert Search system in TREC Enterprise Track 2007). I will be talking about how UIMA supported the construction of our custom NLP tools. I will also sketch the new characteristics of the UIMA Asynchronous Scaleout (UIMA AS) subproject that enable UIMA to run Analysis Engines in thousands of machines… “

The talk/presentation will be on February 24, 2010.  Sign up (free) at http://www.meetup.com/NYC-Search-and-Discovery/calendar/12384559/ .

Solr Digest, January 2010

Similar to our Lucene Digest post , we’ll occasionally cover recent developments in the  Solr world. Although Solr 1.4 was released only two months ago, work on new features for 1.5 (located in the svn trunk) is in full gear:

1) GeoSpatial search features have been added to Solr. The work is based on Local Lucene project, donated to the Lucene community a while back and converted to a Lucene contrib module. The features to be incorporated into Solr will allow one to implement things like:

  • filter by a bounding box – find all documents that match some specific area
  • calculate the distance between two points
  • sort by distance
  • use the distance to boost the score of documents (this is different than sort and would be used in case you want some other factors to affect the ordering of documents)

The main JIRA issue is SOLR-773, and it is scheduled to be implemented in Solr 1.5.

2) Solr Autocomplete component – Autocomplete functionality is a very common requirement. Solr itself doesn’t have the component which would provide such functionality (unlike SpellcheckComponent which provides spellchecking, although in limited form; more on this in another post or, if you are curious, have a look at the DYM ReSearcher). There are a few common approaches to solving this problem, for instance, by using recently added TermsComponent (for more information, you can check this excellent showcase by Matt Weber).  These approaches, however, have some limitations. The first limitation that comes to mind is spellchecking and correction of misspelled words. You can see such feature in Firefox’ google-bar,  if you write “Washengton”. You’ll still get “Washington” offered, while TermsComponent (and some other) approaches fail here.

The aim of SOLR-1316 is to provide autocomplete component in Solr out-of-the-box. Since it is still in development, you can’t quite rely on it, but it is scheduled to be released with Solr 1.5. In the mean-time, you can check another Sematext product which offers AutoComplete functionality with few more advanced features. It is a constantly developing product whose features have been heavily absed on real-life/customer feedback.  It uses an approach unlike than of TermsComponent and is, therefore, faster. Also, one of the features currently in development (and soon to be released) is spell-correction of queries.

3) One very nice addition in Solr 1.5 is edismax or Extended Dismax. Dismax is very useful for situations where you want to let searchers just enter a few free-text keywords (think Google) , without field names, logical operators, etc.  The Extended Dismax is contributed thanks to Lucid Imagination.  Here are some of its features:

  • Supports full Lucene query syntax in the absence of syntax errors
  • supports “and”/”or” to mean “AND”/”OR” in Lucene syntax mode
  • When there are syntax errors, improved smart partial escaping of special characters is done to prevent them… in this mode, fielded queries, +/-, and phrase queries are still supported.
  • Improved proximity boosting via word bi-grams… this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
  • advanced stopword handling… stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
  • Supports the “boost” parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
  • Supports pure negative nested queries… so a query like +foo (-foo) will match all documents

You can check the development in JIRA issue SOLR-1553.

4) Up until version 1.5, distributed Solr deployments depended not only on Solr features (like replication or sharding), but also on some external systems, like load-balancers. Now, Zookeeper is used to provide Solr specific naming service (check JIRA issue SOLR-1277 for the details). The features we’ll get look exciting:

  • Automatic failover (i.e. when a server fails, clients stop trying to index to or search it and uses a different server)
  • Centralized configuration management (i.e. new solrconfig.xml or schema.xml propagate to a live Solr cluster)
  • Optionally allow shards of a partition to be moved to another server (i.e. if a server gets hot, move the hot segments out to cooler servers).

We’ll cover some of these topics in more details in the future installments. Thanks for reading!

Lucene Digest, January 2010

In this debut Lucene Digest post we’ll cover some of the recent interesting happenings in Lucene-land.

  • For anyone who missed it, Lucene 3.0.0 is out (November 2009).  The difference between 3.0.0 and the previous release is that 3.0.0 has no deprecated code (big cleanup job) and that the support for Java 1.4 was finally dropped!
  • As of this writing, there is only 1 JIRA issue targeted for 3.0.1 and 104 targeted for 3.1 Lucene release.
  • Luke goes hand in hand with Lucene, and Andrzej quickly released Luke that uses the Lucene 3.0.0 jars.  There were only minor changes in the code (and none in functionality) related to the changes in API between Lucene 2.9.1 and 3.0.
  • As usual, the development is happening on Lucene’s trunk, but there is now also a “flex branch” in svn, where features related to Flexible Indexing (not quite up to date) are happening.  Lucene’s JIRA also has a Flex Branch “version”, and as of this writing there are 6 big issues being worked on there.
  • The new Lucene Connectors Framework (or LCF for short) subproject is in the works.  As you can guess from the name, LCF aims to provide a set of connectors to various content repositories, such as relational databases, SharePoint, ECM Documentum, File System, Windows File Shares, various Content Management Systems, even Feeds and Web sites.  LCF is based on code donation from MetaCarta and will be going through the ASF Incubator. We expect it to graduate from the Incubator into a regular Lucene TLP subproject later this year.
  • Spatial search is hot, or at least that’s what it looks like from inside Lucene/Solr.  Both projects’ developers and contributors are busy adding support for spatial/geo search.  Just in Lucene alone, there are currently 21 open JIRA issues aimed at spatial search.  Lucene’s contrib/spatial is where spatial search lives.  We’ll cover Solr’s support for spatial search in a future post.
  • Robert Muir and friends have added some final-state automata technology to Lucene and dramatically improved regular expression and wildcard query performance.  Fuzzy queries may benefit from this in the future, too.  The ever industrious Robert covered this in a post with more details.

These are just some highlights from Luceneland.  More Lucene goodness coming soon.  Please use comments to tell us whether you find posts like this one useful or whether you’d prefer a different angle, format, or something else.

Lucene TLP Digests

We promised we’d blog more in 2010.  One of the things you’ll see on this blog in 2010 are regularly irregular periodic Digests of interesting or significant development in various projects under the Lucene Top Level Project.  The thinking is that with the ever-increasing traffic on various Lucene TLP mailing lists, we are not the only ones having a hard time keeping up.  Since we have to closely follow nearly all Lucene TLP projects for our work anyway, we may as well try and help others by periodically publishing posts summarizing key development.  We’ll tag and categorize our Digest posts appropriately and consistently, so you’ll be able to subscribe to them directly if that’s all you are interested in.  We’ll try to get @sematext to tweet as well, so you can follow us there, too.

Sematext Blog Introduction

Here it is – Sematext’s new and shinny blog.

We’ll be writing about topics that are dear and important to us – search (both web search and enterprise search), text analytics, natural language processing (sentiment detection, named entity recognition…), machine learning, information gathering (e.g. web crawling), information extraction, e-discovery, recommendation engines, etc.  There will be a lot of talk about tools we use regularly – Lucene, Solr, Nutch, Mahout and Taste, Hadoop, HBase and friends, and more.

To subscribe, use the orange feed icon or just go to http://feeds.feedburner.com/SematextBlog.  If you are a Twitter user, you can follow @sematext on Twitter, too.