Internships at Sematext in 2012

Enthusiastic students world-wide interested in the following (and related) areas are invited to get in touch if interested in an internship position at Sematext:

  • Search / Information Retrieval
  • Data Analytics (not limited to Search Analytics)
  • Machine Learning
  • Data Mining
  • Text Analytics
  • Natural Language Processing
  • Information Gathering
  • Information Extraction
  • Distributed and Parallel Computing
  • Cloud Computing
  • Performance Monitoring
The above topics represent our areas of interest, business, or expertise, but are open to other subject areas as well.
Sematext HQ is in Brooklyn, NY, USA, but we are a very geographically distributed organization whose members are spread over several countries and continents.  As such, we welcome students from all across the globe.  Students in or near New York are welcome to spend time in our space, while others are welcome to do the internship remotely.  We have an existing and fruitful relationship with an academic institution in Europe and are not new to working with students remotely.  Internship positions are available year-round, but are subject to student demand and our capacity.  Key technologies we work with are listed on our Technology page, some of our past clients are on our Clients page, and some of our open source projects are described on our well-hidden Open Source Projects page.

Lucene / Solr for Academia: PhD Thesis Ideas

If you are a Lucene or Solr user or developer, please read on, we’d like to hear from you.  If you use a different search tool, please also keep reading.  And if you have 5 minutes of free time, we’d like to hear from you, too! 😉

Short version:

We are looking for your suggestions for advanced features that tools like Lucene, Sol, etc. could or should have, but unfortunately don’t have today, and that could be good topics for one’s Master’s or PhD thesis.  Some of us here at Sematext are PhD candidates and are looking for suggestions that could result in working code ready to be contributed to open-source.  Plus, we are trying to go beyond that and involve the academic community, as described below.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics.  Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Longer version:

We are in early stages of collaborating with academia in areas such as IR/IE/ML/NLP.  What we’d like to do is involve the academic community, but with an explicit intention of producing research whose day one goal is to result in an implementation that will get integrated (in)to a specific, non-academic system.  Thus, we’d like to come up with very real, very practical problems or deficiencies in existing IR/IE/ML/NLP systems, but that are not simple and that require academic sort of work that then requires real hacking in order to produce at least a working prototype/proof of concept. Our hope is that such a PoC could then be truly integrated, and maybe even improved upon, by industry people.

This may be too abstract and vague, so how about an example.

  • Say the target is Lucene and IR.
  • Say we identify that ability to do X is missing from Lucene.
  • Say that X is non-trivial, that it’s nobody’s immediate itch, and thus won’t be implemented by anyone in Lucene community in the next N months.
  • Say that X involves advanced functionality that could benefit from relatively advanced and/or new research coming out of academia, and is thus something that could be a part of someone’s PhD thesis.
  • Say we find a PhD candidate with adequate background knowledge and interest in X.
  • N months later we could have a working PofC of X.

We are hoping that by doing this we can help everyone:

  • The future PhD will have a non-made-up, real-world problem to solve and existing code (Lucene) to hack on.
  • Lucene community will get X.
  • Lucene community may get a good contributor or committer down the road.

As facilitators of this, we will try hard to work with the academia and teach them “open-source ways”, which includes teaching how to effectively work with the specific open-source community (to the extent this is permissible by one’s academic institution), in order for the research and the real-world needs to be aligned.

So….. at this point we are looking for suggestions of various interesting and practical advanced topics that have both the academic and industry facet to it.  And, with this debut blog post, we are specifically turning to the IR/Lucene/Solr community at large to make suggestions.  Please add your suggestions to the Lucene / Solr Wishlist public spreadsheet, but please keep in mind that we are looking for advanced functionality, not simple features that would be too simple or small as research/thesis topics. Feel free to pass the link to friends and colleagues you think would be interested in this or may want to make suggestions.

Thank you!