Lucene / Solr for Academia: PhD Thesis Ideas

If you are a Lucene or Solr user or developer, please read on, we’d like to hear from you.  If you use a different search tool, please also keep reading.  And if you have 5 minutes of free time, we’d like to hear from you, too! 😉

Short version:

We are looking for your suggestions for advanced features that tools like Lucene, Sol, etc. could or should have, but unfortunately don’t have today, and that could be good topics for one’s Master’s or PhD thesis.  Some of us here at Sematext are PhD candidates and are looking for suggestions that could result in working code ready to be contributed to open-source.  Plus, we are trying to go beyond that and involve the academic community, as described below.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics.  Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Longer version:

We are in early stages of collaborating with academia in areas such as IR/IE/ML/NLP.  What we’d like to do is involve the academic community, but with an explicit intention of producing research whose day one goal is to result in an implementation that will get integrated (in)to a specific, non-academic system.  Thus, we’d like to come up with very real, very practical problems or deficiencies in existing IR/IE/ML/NLP systems, but that are not simple and that require academic sort of work that then requires real hacking in order to produce at least a working prototype/proof of concept. Our hope is that such a PoC could then be truly integrated, and maybe even improved upon, by industry people.

This may be too abstract and vague, so how about an example.

  • Say the target is Lucene and IR.
  • Say we identify that ability to do X is missing from Lucene.
  • Say that X is non-trivial, that it’s nobody’s immediate itch, and thus won’t be implemented by anyone in Lucene community in the next N months.
  • Say that X involves advanced functionality that could benefit from relatively advanced and/or new research coming out of academia, and is thus something that could be a part of someone’s PhD thesis.
  • Say we find a PhD candidate with adequate background knowledge and interest in X.
  • N months later we could have a working PofC of X.

We are hoping that by doing this we can help everyone:

  • The future PhD will have a non-made-up, real-world problem to solve and existing code (Lucene) to hack on.
  • Lucene community will get X.
  • Lucene community may get a good contributor or committer down the road.

As facilitators of this, we will try hard to work with the academia and teach them “open-source ways”, which includes teaching how to effectively work with the specific open-source community (to the extent this is permissible by one’s academic institution), in order for the research and the real-world needs to be aligned.

So….. at this point we are looking for suggestions of various interesting and practical advanced topics that have both the academic and industry facet to it.  And, with this debut blog post, we are specifically turning to the IR/Lucene/Solr community at large to make suggestions.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics. Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Thank you!

Solr Digest, October 2010

Another busy month is behind us.  There were plenty of interesting topics, so let’s get started:

Already committed functionality

Interesting functionality in development

  • Faceting is heavily used functionality, but occasionally people find they’re missing some form of faceting. Hierarchical faceting is one such thing. It has been in development for a very long time but, despite a few posted patches, is still not a part of Solr distribution, plus it hasn’t seen much activity lately. There is another similar issue – Pivot (aka Decision Tree) Faceting Component which should come to life as a separate component. However, there is renewed effort to make it usable, so eventually we’ll see expanded faceting support in Solr.

Interesting new functionality

  • Extending SchemaField with custom attributes is being dealt with in the issue Custom SchemaField object.
  • Improving search relevance is always a big issue (and represents a good part of what Sematext does in client engagements), no matter how good out of the box Solr and Lucene relevance is.  One very useful addition to our search relevance arsenal could come from the Anti-phrasing feature. The idea is that some word sequences in a query are irrelevant to the query meaning (like “Where can I find” or “Where is.”) and could/should be ignored while searching the index. This JIRA issue is still very fresh, so don’t hold your breath waiting for the implementation to become available next week, although we are bound to see this feature in one of the future Solr releases.
  • If you often working with financial data, you might find patches from issue Money FieldType useful. The new field type will support point and range queries, sorting, and exchange rates.
  • Lucene’s ICUTokenizer is useful for multilingual tokenizing but until recently there was no support for it in Solr. The issue Provide Solr FilterFactory for Lucene ICUTokenizer will provide a filter factory which will enable us to use it from Solr. Bingo! The patch already exists, so it can be tried already. Additional new functionality will be added over time.  If you need multilingual support in Solr, have a look at Sematext‘s popular Multilingual Indexer.

Miscellaneous

  • One of the favorite topics, which we also cover frequently, is related to the ongoing confusion about Solr versions. October didn’t disappoint, this topic was discussed on mailing lists again. So, here is one such thread – Which version of Solr to use?. Let us summarize the key parts. Solr 1.5 will probably never be released.  The branch_3x is a stable version from which the next Solr 3.1 version will likely be released.  The trunk contains relatively stable, but still development version of what will become Solr 4.0 one day.
  • If you provide faceting functionality in your application, here is a small (but interesting) discussion that might give you a few ideas about how to optimize it – Faceting and first letter of fields.
  • It appears that Solr has problems running on Tomcat 7. These problems are not related to a particular version of Solr, but to all versions. To learn more, start with Problems running on tomcat and SOLR-2022 .
  • The replication between Solr master and slave when they’re running different versions of Solr is broken, as you can see in issue Cross-version replication broken by new javabin format. The cause is the new javabin format, so in cases like the one described in this issue (master 1.4.1, slave 3x), you’ll encounter problems. Keep that in mind if you plan cross-version replication for some reason.

These were the most interesting highlights for the month of October. Thank you for reading Sematext Blog and following @sematext on Twitter.

Solr Digest, September 2010

Mahout Digest, October 2010

We’ve been very busy here at Sematext, so we haven’t covered Mahout during the last few months.  We are pleased with what’s been keeping us busy, but are not happy about our irregular Mahout Digests.  We had covered the last (0.3) release with all of its features and we are not going to miss covering very important milestone for Mahout: release 0.4 is out! In this digest we’ll summarize the most important changes in Mahout from the last digest and add some perspective.

Before we dive into Mahout, please note that we are looking for people with Machine Learning skills and Mahout experience (as well as good Lucene/Solr search people).  See our Hiring Search and Data Analytics Engineers post.

This Mahout release brings overall changes regarding model refactoring and command line interface to Mahout aimed at improving integration and consistency (easier access to Mahout operations via the command line). The command line interface is pretty much standardized for working with all the various options now, which makes it easier to run and use. Interfaces are better and more consistent across algorithms and there have been many small fixes, improvements, refactorings, and clean-ups. Details on what’s included can be found in the release notes and download is available from the Apache Mirrors.

Now let’s add some context to various changes and new features.

GSoC projects

Mahout completed its Google Summer of Code  projects and two completed successfully:

  • EigenCuts spectral clustering implementation on Map-Reduce for Apache Mahout (addresses issue MAHOUT-328), proposal and implementation details can be found in MAHOUT-363
  • Hidden Markov Models based sequence classification (proposal for a summer-term university project), proposal and implementation details in  MAHOUT-396

Two projects did not complete due to lack of student participation and one remains in progress.

Clustering

The biggest addition in clustering department are EigenCuts clustering algorithm (project from GSoC) and MinHash based clustering which we covered as one of possible GSoC suggestions in one of previous digests . MinHash clustering was implemented, but not as a GSoC project. In the first digest from the Mahout series we covered problems related to evaluation of clustering results (unsupervised learning issue), so big addition to Mahout’s clustering are Cluster Evaluation Tools featuring new ClusterEvaluator (uses Mahout In Action code for inter-cluster density and similar code for intra-cluster density over a set of representative points, not the entire clustered data set) and CDbwEvaluator which offers new ways to evaluate clustering effectiveness.

Logistic Regression

Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation are now part of Mahout. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person’s age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription. The Mahout implementation uses Stochastic Gradient Descent (SGD), check more on initial request and development in MAHOUT-228. New sequential logistic regression training framework supports feature vector encoding framework for high speed vectorization without a pre-built dictionary. You can find more details on Mahout’s logistic regression wiki page.

Math

There has been a lot of cleanup done in the math module (you can check details in Cleanup Math discussion on ML), lot’s of it related to an untested Colt framework integration (and deprecated code in Colt framework). The discussion resulted in several pieces of Colt framework getting promoted to a tested status (QRdecomposition, in particular)

Classification

In addition to speedups and bug fixes, main new features in classification are new classifiers (new classification algorithms) and more open/uniformed input data formats (vectors). Most important changes are:

  • New SGD classifier
  • Experimental new type of Naive bayes classifier (using vectors) and feature reduction options for existing Naive bayes classifier (variable length coding of vectors)
  • New VectorModelClassifier allows any set of clusters to be used for classification (clustering as input for classification)
  • Now random forest can be saved and used to classify new data. Read more on how to build a random forest and how to use it to classify new cases on this dedicated wiki page.

Recommendation Engine

The most important changes in this area are related to distributed similarity computations which can be used in Collaborative Filtering (or other areas like clustering, for example). Implementation of Map-Reduce job, based on algorithm suggested in Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce, which computes item-item similarities for item-based Collaborative Filtering can be found in MAHOUT-362. Generalization of algorithm based on the mailing list discussion led to an implementation of  Map-Reduce job which computes pairwise similarities of the rows of a matrix using a customizable similarity measure (with implementations already provided for Cooccurrence, Euclidean Distance, Loglikelihood, Pearson Correlation, Tanimoto coefficient, Cosine). More on distributed version of any item similarity function (which was available in a non-distributed implementation before) can be found in MAHOUT-393. With pairwise similarity computation defined, RecommenderJob has been evolved to a fully distributed item-based recommender (implementation depends on how the pairwise similarities are computed). You can read more on distributed item-based recommender in MAHOUT-420.

Implementation of distributed operations on very large matrices are very important for a scalable machine learning library which supports large data sets. For example, when term vector is built from textual document/content, terms vectors tend to have high dimension. Now,  if we consider a term-document matrix where each row represents terms from document(s), while a column represents a document we obviously end up with high dimensional matrix. Same/similar thing occurs in Collaborative Filtering: it uses a user-item matrix containing ratings for matrix values, row corresponds to a user and each column represents an item. Again we have large dimension matrix that is sparse.

Now, in both cases (term-document matrix and user-item matrix) we are dealing with high matrix dimensionality which needs to be reduced, but most of information needs to be preserved (in best way possible). Obviously we need to have some sort of matrix operation which will provide lower dimension matrix with important information preserved. For example, large dimensional matrix may be approximated to lower dimensions using Singular Value Decomposition (SVD).

It’s obvious that we need some (java) matrix framework capable of fundamental matrix decompositions. JAMA is a great example of widely used linear algebra package for matrix operations, capable of SVD and other fundamental matrix decompositions (WEKA for example uses JAMA for matrix operations). Operations on highly dimensional matrices always require heavy computation and this requirements produces high HW requirements on any ML production system. This is where Mahout, which features distributed operations on large matrices, should be the production choice for Machine Learning algorithms over frameworks like JAMA, which although great, can not distribute its operations.

In typical recommendation setup users often ‘have’ (used/interacted with) only a few items from the whole item set (item set can be very large) which leads to user-item matrices being sparse matrices. Mahout’s (0.4) distributed Lanczos SVD implementation is particularly useful for finding decompositions of very large sparse matrices.

News and Roadmap

All of the new distributed similarity/recommender implementations we analyzed in previous paragraph were contributed by Sebastian Schelter and as a recognition for this important work he was elected as a new Mahout committer.

The book “Mahout in Action”, published by Manning, has reached 15/16 chapters complete and will soon enter final review.

This is all from us for now.  Any comments/questions/suggestions are more than welcome and until next Mahout digest keep an eye on Mahout’s road map for 0.5 or discussion about what is Mahout missing to become production stabile (1.0) framework.  We’ll see you next month – @sematext.