Lucene / Solr for Academia: PhD Thesis Ideas

If you are a Lucene or Solr user or developer, please read on, we’d like to hear from you.  If you use a different search tool, please also keep reading.  And if you have 5 minutes of free time, we’d like to hear from you, too! 😉

Short version:

We are looking for your suggestions for advanced features that tools like Lucene, Sol, etc. could or should have, but unfortunately don’t have today, and that could be good topics for one’s Master’s or PhD thesis.  Some of us here at Sematext are PhD candidates and are looking for suggestions that could result in working code ready to be contributed to open-source.  Plus, we are trying to go beyond that and involve the academic community, as described below.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics.  Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Longer version:

We are in early stages of collaborating with academia in areas such as IR/IE/ML/NLP.  What we’d like to do is involve the academic community, but with an explicit intention of producing research whose day one goal is to result in an implementation that will get integrated (in)to a specific, non-academic system.  Thus, we’d like to come up with very real, very practical problems or deficiencies in existing IR/IE/ML/NLP systems, but that are not simple and that require academic sort of work that then requires real hacking in order to produce at least a working prototype/proof of concept. Our hope is that such a PoC could then be truly integrated, and maybe even improved upon, by industry people.

This may be too abstract and vague, so how about an example.

  • Say the target is Lucene and IR.
  • Say we identify that ability to do X is missing from Lucene.
  • Say that X is non-trivial, that it’s nobody’s immediate itch, and thus won’t be implemented by anyone in Lucene community in the next N months.
  • Say that X involves advanced functionality that could benefit from relatively advanced and/or new research coming out of academia, and is thus something that could be a part of someone’s PhD thesis.
  • Say we find a PhD candidate with adequate background knowledge and interest in X.
  • N months later we could have a working PofC of X.

We are hoping that by doing this we can help everyone:

  • The future PhD will have a non-made-up, real-world problem to solve and existing code (Lucene) to hack on.
  • Lucene community will get X.
  • Lucene community may get a good contributor or committer down the road.

As facilitators of this, we will try hard to work with the academia and teach them “open-source ways”, which includes teaching how to effectively work with the specific open-source community (to the extent this is permissible by one’s academic institution), in order for the research and the real-world needs to be aligned.

So….. at this point we are looking for suggestions of various interesting and practical advanced topics that have both the academic and industry facet to it.  And, with this debut blog post, we are specifically turning to the IR/Lucene/Solr community at large to make suggestions.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics. Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Thank you!

5 thoughts on “Lucene / Solr for Academia: PhD Thesis Ideas”

  1. This is a great idea!

    I recently completed my MS at UC Irvine where I wrote my thesis on a part of query tagger for search queries. I implemented query segmentation and classification as a solr component and used on a search engine for federal research grants http://researchwatch.net. A search query like “energy stanford university” would get transformed into the filter query: Grant Abstract: “energy” and Organization: “stanford university”

    I’m a little busy with some projects but I plan to open source the library.

    1. Great. I think this is a good example of something that could be done as research/thesis (how to do query tagging well) with a concrete implementation (a Solr SearchComponent, I’m guessing, in this case).

  2. Idea : Luke is great tool.To complement it and monitor what queries are getting executed, what are their usage of resources, what are the top 10 queries. A tool like SQL Server DMV/Oracle V$Session will be good addition.

Leave a comment