April 2011

Sematext at Berlin Buzzwords 2011

As part of Sematext’s Summer 2011 Conference Tour we are going to be visiting the good old Europe and giving a talk at Berlin Buzzwords. This is the second year for Berlin Buzzwords, “a conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags “search”, “store” and “scale”“. Last year, one of us from Sematext went there as an attendee. This year, three of us are going and one of us is giving a talk – @OtisG will be speaking about Search Analytics on June 6th. That’s the first day of the conference and we are first in line to talk at 11:00 AM, right after the morning coffee. Doug Cutting and Ted Dunning will be giving Keynotes. Some of us may also be there for some of the Hackathons/Workshops before and/or after the conference. If you are going to be there and would like to meet up, please let us know! @sematext.

For more information on this topic read about Sematext Search Analytics service.

Training: Solr Performance Tuning and Monitoring

Quick announcement!

In addition to presenting at Open Source Search Conference in June, we’ll also be doing a super-cheap half-day training on Solr Performance Tuning & Monitoring. You can sign up here.

In this tutorial you will learn how to squeeze the most performance out of your Solr cluster. We’ll cover performance at both indexing and query time; dealing with large volumes of data versus high query rates, the combination of the two; and various index sharding architectures possible to gain on search performance, in multi-data center setups, etc. We’ll cover an array of best practices, tips and tricks we regularly use in our engagements with clients, from various configuration settings to querying efficiently, all of which one should employ to get the most out of Solr. You will also learn how to monitor your Solr cluster’s performance with command-line tools and a visual monitoring solution specifically designed for Solr performance monitoring.

Prerequisites:

Basic knowledge of Solr, its configuration and setup.

Details:

Cost: $100
When: June 14, 2011, 9:00 a.m.-1:00 p.m.
Bonus: Lunch will be provided.
Register here

If you are interested in Solr Performance Monitoring, please read about Sematext Scalable Performance Monitoring service.

Hiring: Data Mining, Analytics, Machine Learning Hackers

If you want to work with search, big data mining, analytics, and machine learning, and you are a positive, proactive, independent creature, please keep reading.We are looking for devops to hack on Sematext’s new products and services, as well as provide services to our growing list of clients. Working knowledge of Mahout or statistics/machine learning/data mining background would be a major plus.

Skills & experience (the more of these you have under your belt the better):

Data mining and/or machine learning (Mahout or …)
Big data (HBase or Cassandra or Hive or …)
Search (Solr or Lucene or Elastic Search or …)

More about an ideal you:

You are well organized, disciplined, and efficient
You don’t wait to be told what to do and don’t need hand-holding
You are reliable, friendly, have a positive attitude, and don’t act like a prima donna
You have an eye for detail, don’t like sloppy code, poor spelelling and typous
You are able to communicate complex ideas in a clear fashion in English, clean and well designed code, or pretty diagrams

Optional bonus points:

You like to write or speak publicly about technologies relevant to what we do
You are an open-source software contributor

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis and we present at conferences. Our projects with external clients range from 1 week to several months. Some clients are small startups, some are large international organizations. Some are top secret. New customers knock on our door regularly and this keeps us busy at pretty much all times. When we are not busy with clients we work on our products. We run search-lucene.com and search-hadoop.com. We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, Hive, and HBase. We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team spanning 3 continents and 6 countries. We communicates via email, Skype voice/IM, BaseCamp. Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com feel free to check out our other positions.

Solr Digest, February-March 2011

We Sematexters have been very busy over the past few months, so we missed Solr’s February Digest. This one will therefore be a bit longer than usual. Let’s get started…

First, some major news : Solr 3.1 is officially released! The details of the announcement can be found here. We covered most of the new features in our digests already, so we’ll keep it short:

Numeric range facets (similar to date faceting)
New spatial search, including spatial filtering, boosting and sorting capabilities
Example Velocity driven search UI at http://localhost:8983/solr/browse
A new termvector-based highlighter
Extend dismax (edismax) query parser which addresses some missing features in the dismax query parser along with some extensions
Several more components now support distributed mode: TermsComponent, SpellCheckComponent
A new Auto Suggest component
Ability to sort by functions
JSON document indexing
CSV response format
Apache UIMA integration for metadata extraction
Leverages Lucene 3.1 and it’s inherent optimizations and bug fixes as well as new analysis capabilities
Numerous improvements, bug fixes, and optimizations

You can start your download :).

Already committed features

post.jar got improved – JIRA issue improve post.jar to handle non UTF-8 files removed some of its very old limitations
jetty server included in Solr distribution didn’t support UTF-8. Now this is solved, fresh 3.1 version already contains this fix

Interesting features in development

as part of SolrCloud, distributed indexing is being implemented in JIRA issue SOLR-2358. You can already see the work in progress in the initial patch, but you can also check SOLR-2341 which deals with shard distribution policies which will be available in Solr 4.0
If you ever wanted to add custom fields (not existing in the index) to Solr responses, you couldn’t have done that from Solr components. There were other ways to achieve such functionality (for instance, customizing response writer class), but it looks like we’ll get such ability inside of components, too. No need to say how much more natural that would be. Anyway, issue Allow components to add fields to outgoing documents provides the umbrella for this new functionality. Although it is already closed, there are few sub-issues in which actual pieces of logic will be implemented.
if you have problem with case sensitive searches in wildcard queries, you might take a look at a patch provided in JIRA issue Case Insensitive Search for Wildcard Queries
although Solr got its first solid spatial implementation in version 3.1, many people found its limitations. One of them is surely a case where documents have multivalued spatial fields. We already wrote about SOLR-2155 in our December digest, but work under that issue hasn’t stopped and keeps evolving. It is likely that it will become a part of the standard Solr distribution and Lucene could get it incorporated, too. If you need spatial search you may want to watch this issue.

Interesting new features

one common problem when using Solr’s default spellchecker or auto-suggest is filtering of suggestions based on what some user can see (for instance, depending on the region in which your user resides). JIRA issue Doc Filters for Auto-suggest, spell checking, terms component, etc. proposes a feature which would help here. Currently, no work was done there, though we believe we’ll get to see some progress in the future. While we are at it, in case you need such feature in Auto-suggest now, you might take a look at our in-house Search Auto-Complete solution, which you can see in action on search-lucene.com and search-hadoop.com.
just like there are default components for SearchHandlers (which are used by default for every new search handler, unless overriden), update processors will get a similar feature. JIRA issue Let some UpdateProcessors be default without explicitly configuring them will take care that some important update processors are available by default to your UpdateRequestProcessorChain.
one great new feature could be added to Solr – ability not to cache a filter. JIRA issue SOLR-2429 will deal with this. Many Solr users will be happy to optimize their cache performance when this feature is available some day.

Miscellaneous

some interesting thoughts on spellchecker can be found in ML thread My spellchecker experiment and much more on that topic in the related blog
should you use ASCIIFoldingFilter or MappingCharFilter when dealing with accents? Interesting discussion in thread Should ASCIIFoldingFilter be deprecated? could help you decide which one is right for you
interesting idea for Solr’s admin UI can be found in this ML thread. Community’s reception was very good so we also got Solr Admin Interface, reworked issue as the home for this new work.
anyone using Solr’s UIMA (Unstructured Information Management Architecture) contrib might be interested to know that its wiki page got a major improvement – more docs to read!
we might be a bit late on this, but there is still some time left – Google’s Summer of Code applications can be submitted until 8th April. Check this ML thread for some detail. And don’t forge that Sematext is sponsoring interns, too!
new Solr/Lucene users should take a look at the Refcard provided by Erik Hatcher in ML thread [infomercial] Lucene Refcard at DZone
some deep thoughts on Solr/Lucene’s release process by some of the key people can be found here Brainstorming on Improving the Release Process. Related to that is a JIRA issue Define Test Plan for 4.0 which will… eh, contain some info about Test plan for 4.0 release, obviously. Also, check the TestPlans wiki page that’s in the making.

Although there were some other interesting topics, we have to stop somewhere. Until next month, you’ll find us on Twitter.

Sematext at Lucene Revolution 2011

Last year at Lucene Revolution in October in Boston, we shared how we built search-lucene.com and search-hadoop.com. In May of this year, we’ll again be talking at Lucene Revolution about another topic very dear to us at Sematext – Search Analytics (abstract). The full conference agenda is available. Start picking sessions to attend.

This year’s Lucene Revolution is extra interesting because Sematext is also sponsoring the conference. In addition to that, it’s great to see a couple of our customers be presenting this year!

If you are coming to the conference don’t be afraid to say hello. And if San Francisco is too far this year and you are on the east coast of the US in mid-June, you can also catch us at the Open Source Search Conference. And if you are in Europe, you’ll see us there in June of this year, too. Until then, so long from @sematext.

For more information on this topic read about Sematext Search Analytics service.