October 2010

Key Phrases for Better Search: Smart Content Presentation

We are 3 for 3 this month – 3 talks at 3 different conferences – Lucene Revolution (see our presentation), Hadoop World (see our presentation), and Smart Content (see full agenda). The last conference was a small one-day conference here in New York, organized by Seth Grimes. It turns out there are tons of vendors in the text analytics / “semantic” analysis space who all do more or less the same thing – Named Entity Recognition, Classification, Clustering, Key Phrase Extraction, etc. Sematext is not in that space, though we do have a classifier, a Language Identifier, and a Key Phrase Extractor. If is this last tool, the Key Phrase Extractor that I made use of in the presentation. But enough talk, here is our presentation:

Hiring Search and Data Analytics Engineers

We are growing and looking for smart people to join us either in an “elastic”, on-demand, per-project, or more permanent role:

Lucene/Solr expert who…

Has built non-trivial applications with Lucene or Solr or Elastic Search, knows how to tune them, and can design systems for large volume of data and queries
Is familiar with (some of the) internals of Lucene or Solr or Elastic Search, at least on the high level (yeah, a bit of an oxymoron)
Has a systems/ops bent or knows how to use performance-related UNIX and JVM tools for analyzing disk IO, CPU, GC, etc.

Data Analytics expert who…

Has used or built tools to process and analyze large volumes of data
Has experience using HDFS and MapReduce, and have ideally also worked with HBase, or Pig, or Hive, or Cassandra, or Voldemort, or Cascading or…
Has experience using Mahout or other similar tools
Has interest or background in Statistics, or Machine Learning, or Data Mining, or Text Analytics or…
Has interest in growing into a Lead role for the Data Analytics team

We like to dream that we can find a person who gets both Search and Data Analytics, and ideally wants or knows how to marry them.

Ideal candidates also have the ability to:

Write articles on interesting technical topics (that may or may not relate to Lucene/Solr) on Sematext Blog or elsewhere
Create and give technical talks/presentations (at conferences, local user groups, etc.)

Additional personal and professional traits we really like:

Proactive and analytical: takes initiative, doesn’t wait to be asked or told what to do and how to do it
Self-improving and motivated: acquires new knowledge and skills, reads books, follows relevant projects, keeps up with changes in the industry…
Self-managing and organized: knows how to parcel work into digestible tasks, organizes them into Sprints, updates and closes them, keeps team members in the loop…
Realistic: good estimator of time and effort (i.e. knows how to multiply by 2)
Active in OSS projects: participates in open source community (e.g. mailing list participation, patch contribution…) or at least keeps up with relevant projects via mailing list or some other means
Follows good development practices: from code style to code design to architecture
Productive, gets stuff done: minimal philosophizing and over-designing

Here are some of the Search things we do (i.e. that you will do if you join us):

Work with external clients on their Lucene/Solr projects. This may involve anything from performance troubleshooting to development of custom components, to designing highly scalable, high performance, fault-tolerant architectures. See our services page for common requests.
Provide Lucene/Solr technical support to our tech support customers
Work on search-related products and services

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis. Our projects with external clients range from 1 week to several months. Some clients are small startups, some are large international organizations. Some are top secret. New customers knock on our door regularly and this keeps us busy at pretty much all times. When we are not busy with clients we work on our products. We run search-lucene.com and search-hadoop.com. We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, and HBase. We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team that communicates via email, Skype voice/IM, BaseCamp. Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com.

Search Analytics: Hadoop World Presentation

After our Lucene Revolution talk in Boston, we got ready for last week’s Hadoop World conference in New York. Like at the Lucene Revolution, we presented to a packed room of 200+ people. The topic of our talk was the Search Analytics tool we’ve built with the help of Flume, HBase, MapReduce, and other open-source tools, and which are now starting to use for search-hadoop.com and search-lucene.com. If you couldn’t make it to Hadoop World, have a look at our presentation below. And if you’d like to work on Search, Analytics, and related areas, we’re looking for good people world-wide – see our jobs page. Enjoy!

ProjectHub: Lucene Revolution Presentation

Over the past few weeks we’ve been to two conferences: Lucene Revolution in Boston and Hadoop World in New York. We presented at both. At Lucene Revolution we talked about how we built search-hadoop.com and search-lucene.com. The fact that our presentation room was packed despite a couple of other interesting talks being given at the same time tells us this stuff is interesting to people (or at least the title and the brief description in the conference schedule were attractive). For those of you who were unable to make it to Boston, we are sharing our presentation below. And for those of you who like to work on Search, Analytics, Machine Learning, and related areas, we’re looking for good people world-wide – see our jobs page. Enjoy!

Sematext at Lucene Revolution 2010

We are packing our bags and going to Boston to the Lucene Revolution conference. If you are a current or a past client of Sematext, please try to spot Otis and say hello.

Our presentation is Lucene Ecosystem Tools for Hadoop Ecosystem Search and we’ll be presenting at the end of Day 1 (tomorrow, Thursday October 7th). In the presentation we’ll be talking about how we’ve built search-hadoop.com and search-lucene.com, how we used Solr, Tika, Droids, Digester, etc.

See you there!

Solr Digest, September 2010

It is a busy time of year here at Sematext – we have 3 different presentations to prepare for 3 different conferences to prepare (2 down, 1 more to go!), so we’re a bit late with our digests. Nevertheless, we managed to compile a list of interesting topics in Solr world:

Already committed functionality

Solr was upgraded to use Tika 0.7 – SOLR-1819 – the fix was applied to 1.4.2, 3.1 and 4.0 versions. Of course, Tika 0.8 is going to happen in not very distant future.
If you’re still using old rsync based replication and have a need to throttle transfer rate, have a look at a patch contributed in JIRA issue SOLR-2099. Unfortunatelly, if you’re using 1.4 Java based replication, there is currently no way to throttle replication.
If you are using new spatial capabilities in Solr, you might have noticed some incorrect calculations. One of them is fixed – Spatial filter is not accurate – on 3.1 and 4.0 branches
Another minor but useful addition – function queries can now be defined in terms of parameters from other request parameters. Check JIRA issue “full parameter dereferencing for function queries”. It is already implemented in 3.1 and 4.0 and is ready to be used. Here is a short example from JIRA (check how add function is defined and note v1 and v2 request parameters):

http://localhost:8983/solr/select?defType=func&fl=id,score&q=add($v1,$v2)&v1=mul(2,3)&v2=10

Can we say, Solr Calculator, eh?

Interesting functionalities in development for some time

Ever wanted to add some custom fields to a response, although they were not stored in your Solr index? You could always create a custom response writer which would add those fields (although it would probably be a “dirty” copy of some already existing Solr’s response writer). However, we all know that it doesn’t sound as the right way to code. One JIRA issue might deliver a correct way some day – Allow components to add fields to outgoing documents. We say “some day“, since this functionality has been in development for quite some time now and, although it has some patches (currently unfunctional, it seems), is probably is not very near being completed. But it will be handy to have once it’s done.

Interesting new functionalities

Highlighter could get one frequently requested improvement – Highlighter fragement/formatter for returning just the matching terms – we believe this will be a useful addition, although we don’t expect it very soon.
One potentially useful feature for all of you who use HDFS – DIH should be able read data directly from HDFS for indexing. This issue already contains some working code, although it is a question if the fix will become a part of standard Solr distribution. Still, if you’re using Solr 1.4.1 and you have data in HDFS that you want to index with Solr, have a look at this contribution.
Another improvement related to replication is in SOLR-2117 – Allow slaves to replicate at different times. This should be useful to anyone who has long (and therefore heavy) warmup periods on their slaves after replication. This way, you can have your slaves replicate at different time and at the time of replication just take replicating slave offline (to avoid degradation of response times). Be careful though, there is a downside : for some time (limited, but still), your slaves will serve different data. Patch is available for 4.0 version.

Miscellaneous

Some more information about current Solr branches, future versions, etc can be found in these ML threads: “Version stability [was: svn branch issues]” and “Solr 3.1”.
Have you recently asked yourself a question like this : morelikethis – “stored=true” is necessary? You might find the answer in this thread useful – check what this link has to say about it.
One extremely useful thread you’ll want to keep in your bookmarks (and read it from time to time) – Tuning Solr caches with high commit rates (NRT)

So, we had a little bit of everything from Solr this month. Until late October (or start of November) when new Solr Digest arrives, stay tuned to @sematext, where we tweet other interesting stuff on a wider set of topics from time to time.