digest

Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search

As promised in Part 1 of Solr Digest, Spring-Summer 2011, in this Part 2 post we’ll summarize what’s new with Solr’s Near-Real-Time Search support and Solr Cloud (if you love clouds and search with some big data on the side, get in touch). Let’s first examine what is being worked on for Solr Cloud and what else is in the queue for the near future. A good overview of what is currently functional can be found in the old Solr Cloud wiki page. Also, there is now another wiki page covering New Solr Cloud Design, which we find quite useful. The individual pieces of Solr Cloud functionality that are being worked on are as follows:

Work is still in progress on Distributed Indexing and Shard distribution policy. Patches exist, although they are now over 6 months old, so you can expect to see them updated soon.
As part of the Distributed Indexing effort, shard leader functionality deals with leader election and with publishing the information about which node is a leader of which shard and in Zookeeper in order to notify all interested parties. The development is pretty active here and initial patches already exist.
At some point in the future, Replication Handler may become cloud aware, which means it should be possible to switch the roles of masters and slaves, master URLs will be able to change based on cluster state, etc. The work hasn’t started on this issue.
Another feature Solr Cloud will have is automatic Spliting and migrating of Indices. The idea is that when some shard’s index becomes too large or the shard itself starts having bad query response times, we should be able to split parts of that index and migrate it (or merge) with indices on other (less loaded) nodes. Again, the work on this hasn’t started yet. Once this is implemented one will be able to split and move/merge indices using a Solr Core Admin as described in SOLR-2593.
To achieve more efficiency in search and gain control over where exactly each document gets indexed to, you will be able to define a custom shard lookup mechanism. This way, you’ll be able to limit execution of search requests to only some shards that are known to hold target documents, thus making the query more efficient and faster. This, along with the above mentioned shard distribution policy, is akin to routing functionality in ElasticSearch.

On to NRT:

There is a now a new wiki page dedicated to Solr NRT Search. In short, NRT Search will be available in Solr 4.0 and the work currently in progress is already available on the trunk. The first new functionality that enables NRT Search in Solr is called “soft-commit”. A soft commit is a light version of a regular commit, which means that it avoids the costly parts of a regular commit, namely the flushing of documents from memory to disk, while still allowing searches to see new documents. It appears that a common way of using this will be having a soft-commit every second or so, to make Solr behave as NRT as possible, while also having a “hard-commit” automatically every 1-10 minutes. “Hard-commit” will still be needed so the latest index changes are persisted to the storage. Otherwise, in case of crash, changes since last “hard-commit” would be lost.
Initial steps in supporting NRT Search in Solr were done in Re-architect Update Handler. Some old issues Solr had were dealt with, like waiting for background merges to finish before opening a new IndexReader, blocking of new updates while commit is in progress and a problem where it was possible that multiple IndexWriters were open on the same index. The work was done on solr2193 branch and that is the place where the spinoffs of this issue will continue to move Solr even closer to NRT.
One of the spinoffs of the Update Handler rearchitecture is SOLR-2565, which provides further improvements on the above mentioned issue. New issues to deal with other related functionality will be opened along the way, while SOLR-2566 looks to serve as an umbrella issue for NRT Search in Solr.
Partially related to NRT Search is the new Transaction Log implemented in Solr under SOLR-2700. The goal is to provide durability of updates, while also supporting features like the already committed Realtime get. Transaction logs are implemented in various other search solutions such as ElasticSearch and Zoie, so Simon Willnauer started a good thread about the possibility of generalizing this new Transaction Log functionality so that it is not limited to Solr, but exposed to other users and applications, such as Lucene, too.

We hope you found this post useful. If you have any questions or suggestions, please leave a comment, and if you want to follow us, we are @sematext on Twitter.

Solr Digest, Spring-Summer 2011, Part 1

No, Solr Digests are not dead, we’ve just been crazily busy at Sematext (yes, we are hiring!). Since our last Solr Digest not one, but 2 new Solr releases have been made: 3.2 in June, 3.3 in July and version 3.4 is imminent – voting is already in progress, so you can expect a new release pretty soon. Also, there were a number of interesting developments on the trunk (future 3.x and 4.0 versions). Therefore, we will be publishing two Solr Digests this time. This first Digest covers general developments in Solr world, while the sequel will be more focused on two features drawing a lot of attention: Solr Cloud and Near Real Time search.

Let’s get started with a short overview of announced news in 3.2 and 3.3. First, 3.2 brought us:

Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
TermQParserPlugin, useful when generating filter queries from terms returned by field faceting or terms component
DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString
Improvements to the UIMA and Carrot2 integrations
Highlighting performance improvements
A test-framework jar for easy testing of Solr extensions
Bugfixes and improvements from Apache Lucene 3.2

With 3.3 we got:

Grouping / Field Collapsing
A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption
KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English
Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See Mike’s cool Lucene segment merging video
Important bugfixes, including extremely high RAM usage in spellchecking
Bugfixes and improvements from Apache Lucene 3.3

Let’s now look at other interesting stuff. We’ll start with DataImportHandler and its bug fixes. As you’ll notice, there are quite a few of them (and we didn’t even list them all!) so we advise using all available patches.

Already committed features

A bug-fix for DataImportHandler – “replication reserves commit-point forever if using replicateAfter=startup”. SOLR-2469 brought a fix to version 3.2 and future 4.0 (trunk). This problem caused unnecessary (and huge) buildup in the number of index files on the slaves.
Another bug-fix for DataImportHandler – DIH does not commit if only Deletes are processed. When using special commands $deleteDocById and/or $deleteDocByQuery, when there were no updates of documents, commit wasn’t called by the DIH. Fix is available in 3.4 and 4.0.
Also – DataImportHandler multi-threaded option throws exception. The problem would happen when threads attribute was used. The fix for this is available in 3.4 and 4.0. Related to this is another fixed issue – DIH multi threaded mode does not resolves attributes correctly also available in 3.4 and 4.0.
Join feature got committed to the trunk (future 4.0 version). It can also perform cross-core joins now, which can be very useful. However, this feature also initiated some heated discussions which can be seen in SOLR-2272. The root cause was the fact that this feature was committed only to the Solr while Lucene got none of it. Of course, it might get refactored and included in Lucene too in the future, but this discussion shows the divisons which still existed between Solr and Lucene communities back then.
While we’re talking about Join feature, it might be worth mentioning a patch in SOLR-2604 which back-ports it to 3.x version. Be careful though, it was created for version 3.2 more than two months ago, so a few more adjustments after applying this patch might be needed.
Function Queries got new if(), exists(), and(), or(), not(), xor() and def() functions. The fix is committed to trunk so you’ll be able to use it in 4.0.
As can be seen from the Solr 3.3 announcement, one of the longest living Solr issue is finally closed for good :). SOLR-236 – Field Collapsing – along with SOLR-2524 finally bring field collapsing to 3_x and future 4.0 versions.
Since grouping/field collapsing was added to Solr, we should be able to use faceting in combination with it. Issue SOLR-2665 – Solr Post Group Faceting – brought exactly that to 3.4 and 4.0.
Ever wanted to have more control over what gets stored in the cache? SOLR-2429 will bring exactly that starting with the next Solr release – 3.4. It is simple to use, just add cache=false to your queries like this: fq={!frange l=10 u=100 cache=false}mul(popularity,price). Note that with this new functionality you can prevent either a filter or a query to be cached, while document caching still remains out of request-time control.
If you’re using JMX to observe the state of your Solr installation, you might have encountered a problem when reloading Solr cores – it appears that JMX beans didn’t survive those reloads in the past versions. The fix is created and is available in future 3_x and trunk releases.

Interesting features in development

To achieve case-insensitive search with wildcard queries you could use a patch suplied under issue SOLR-2438. It has to be said that this isn’t committed to svn and it is hard to say whether it ever will be since there is a similar issue SOLR-219 on which work started 4 years ago.
Multithreaded faceting might bring some performance improvements. At the moment, initial patch exists, but more work will be needed here and it still isn’t clear how big improvement we could expect in real-world conditions, but it is worth keeping an eye on this issue.
We all know that Solr’s Spatial support has its limitations. One of them is specifying bounding box which isn’t based on point distance, effectively making it limited to a circular shape. Under SOLR-2609 we might get support for exactly this.
For anyone interested in which direction Spatial support might evolve, we suggest checking Lucene Spatial Playground. It continues the great work done in SOLR-2155 which provided extension to initial GeoSpatial support in Solr by adding multivalued spatial fields. At some point, SOLR-2155 might get the goodness from LSP. Also, another thing to check would be a thread on Lucene Spatial Future.

Interesting new features

Support for Lucene’s Surround Parser is added to Solr in issue SOLR-2703. The patch is already committed to the trunk.
Solr will get the ability to use configuration like analyzer type=”phrase”. Lucene’s Query Parsers recently got a simpler way to use different analyzer based on the query string. One example is usage of double quotes where one can decide that instead of current meaning in Lucene/Solr world – specifying a phrase to be searched for – it should have a meaning like in Google’s search engine – find this exact wording. Patch for this exists and can be applied on the trunk (it depends on Lucene trunk).
SOLR-2593 aims to provide a new Solr core admin action – ‘split’ – for splitting index. It would be used in case some core got too big or in any other case you might find it necessary. Lucene already has a similar function.

Miscellaneous

Oracle released Java 7 about a month ago, but we advize against using it yet. JVM crashes and index corruption are issues likely to be encoutered with it. For more information, visit this URL
As anticipated for some time, Java 5 support got axed from Lucene 4.0 (trunk). You can expect similar stuff for Solr too.
Solr’s build system has been reworked now. Among other things, this implies changes in directory structure in Solr project. For example, solr/src/ doesn’t exist any more and its old subdirs /java and /test are now in solr/core/. The changes are already applied to the trunk and 3_x which holds the next 3.4 version. For more details, see SOLR-2452.
A handy Solr architecture diagram can be found in ML thread
Solr’s Admin UI is being refreshed with the work in JIRA issue SOLR-2399 (we already wrote about it) and its spin-off SOLR-2667. Some of this stuff is already committed (on the trunk), so you may want to inspect the changes. More details can be found in the wiki where you can also get the sneak-peak of the upcoming changes.

And that would be all for part one of the Solr Spring-Summer 2011 Digest edition from @sematext. Part two of the Spring-Summer Digest is coming in a few days – stay tuned!

Solr Digest, February-March 2011

We Sematexters have been very busy over the past few months, so we missed Solr’s February Digest. This one will therefore be a bit longer than usual. Let’s get started…

First, some major news : Solr 3.1 is officially released! The details of the announcement can be found here. We covered most of the new features in our digests already, so we’ll keep it short:

Numeric range facets (similar to date faceting)
New spatial search, including spatial filtering, boosting and sorting capabilities
Example Velocity driven search UI at http://localhost:8983/solr/browse
A new termvector-based highlighter
Extend dismax (edismax) query parser which addresses some missing features in the dismax query parser along with some extensions
Several more components now support distributed mode: TermsComponent, SpellCheckComponent
A new Auto Suggest component
Ability to sort by functions
JSON document indexing
CSV response format
Apache UIMA integration for metadata extraction
Leverages Lucene 3.1 and it’s inherent optimizations and bug fixes as well as new analysis capabilities
Numerous improvements, bug fixes, and optimizations

You can start your download :).

Already committed features

post.jar got improved – JIRA issue improve post.jar to handle non UTF-8 files removed some of its very old limitations
jetty server included in Solr distribution didn’t support UTF-8. Now this is solved, fresh 3.1 version already contains this fix

Interesting features in development

as part of SolrCloud, distributed indexing is being implemented in JIRA issue SOLR-2358. You can already see the work in progress in the initial patch, but you can also check SOLR-2341 which deals with shard distribution policies which will be available in Solr 4.0
If you ever wanted to add custom fields (not existing in the index) to Solr responses, you couldn’t have done that from Solr components. There were other ways to achieve such functionality (for instance, customizing response writer class), but it looks like we’ll get such ability inside of components, too. No need to say how much more natural that would be. Anyway, issue Allow components to add fields to outgoing documents provides the umbrella for this new functionality. Although it is already closed, there are few sub-issues in which actual pieces of logic will be implemented.
if you have problem with case sensitive searches in wildcard queries, you might take a look at a patch provided in JIRA issue Case Insensitive Search for Wildcard Queries
although Solr got its first solid spatial implementation in version 3.1, many people found its limitations. One of them is surely a case where documents have multivalued spatial fields. We already wrote about SOLR-2155 in our December digest, but work under that issue hasn’t stopped and keeps evolving. It is likely that it will become a part of the standard Solr distribution and Lucene could get it incorporated, too. If you need spatial search you may want to watch this issue.

Interesting new features

one common problem when using Solr’s default spellchecker or auto-suggest is filtering of suggestions based on what some user can see (for instance, depending on the region in which your user resides). JIRA issue Doc Filters for Auto-suggest, spell checking, terms component, etc. proposes a feature which would help here. Currently, no work was done there, though we believe we’ll get to see some progress in the future. While we are at it, in case you need such feature in Auto-suggest now, you might take a look at our in-house Search Auto-Complete solution, which you can see in action on search-lucene.com and search-hadoop.com.
just like there are default components for SearchHandlers (which are used by default for every new search handler, unless overriden), update processors will get a similar feature. JIRA issue Let some UpdateProcessors be default without explicitly configuring them will take care that some important update processors are available by default to your UpdateRequestProcessorChain.
one great new feature could be added to Solr – ability not to cache a filter. JIRA issue SOLR-2429 will deal with this. Many Solr users will be happy to optimize their cache performance when this feature is available some day.

Miscellaneous

some interesting thoughts on spellchecker can be found in ML thread My spellchecker experiment and much more on that topic in the related blog
should you use ASCIIFoldingFilter or MappingCharFilter when dealing with accents? Interesting discussion in thread Should ASCIIFoldingFilter be deprecated? could help you decide which one is right for you
interesting idea for Solr’s admin UI can be found in this ML thread. Community’s reception was very good so we also got Solr Admin Interface, reworked issue as the home for this new work.
anyone using Solr’s UIMA (Unstructured Information Management Architecture) contrib might be interested to know that its wiki page got a major improvement – more docs to read!
we might be a bit late on this, but there is still some time left – Google’s Summer of Code applications can be submitted until 8th April. Check this ML thread for some detail. And don’t forge that Sematext is sponsoring interns, too!
new Solr/Lucene users should take a look at the Refcard provided by Erik Hatcher in ML thread [infomercial] Lucene Refcard at DZone
some deep thoughts on Solr/Lucene’s release process by some of the key people can be found here Brainstorming on Improving the Release Process. Related to that is a JIRA issue Define Test Plan for 4.0 which will… eh, contain some info about Test plan for 4.0 release, obviously. Also, check the TestPlans wiki page that’s in the making.

Although there were some other interesting topics, we have to stop somewhere. Until next month, you’ll find us on Twitter.

Hive Digest, March 2011

Welcome to the first Hive digest!

Hive is a data warehouse built on Hadoop, initially developed by Facebook, it’s been under the Apache umbrella for about 2 years and has seen very active development. Last year there were 2 major releases which introduced loads of features and bug fixes. Now Hive 0.7.0 has just been released and is packed with goodness.

Hive 0.6.0

Hive 0.6.0 was released October last year. Some of its most interesting features included

Better skew joins.
Views were added.
Database/schema support was added to Hive QL.
Integration with HBase was added. Allowing to read HBase tables via Hive and bulk load Hive tables into HBase.
There were multiple improvements making it easier to work with partitions, including multi partition inserts and archiving of partitions.

Hive 0.7.0

Hive 0.7.0 has just been released! Some of the major features include:

Indexing has been implemented, index types are currently limited to compact indexes. This feature opens up lots of potential for future improvements, such as HIVE-1694 which aims to use indexes to accelerate query execution for GROUP BY, ORDER BY, JOINS and other misc cases and HIVE-1803 which will implement bitmap indexing.
Security features have been added with authorisation and authentication.
There is now an optional concurrency model which makes use of Zookeeper, so tables can now be locked during writes. It is disabled by default, but can be enabled using hive.support.concurrency=true in the config.

And many other small improvements including:

Making databases more useful, you can now select across a database.
The Hive command line interface has gotten some love and now supports auto-complete.
There’s now support for HAVING clauses, so users no longer have to do nested queries in order to apply a filter on group by expressions.

and much more.

You can download Hive 0.7.0 from here and you can follow @sematext on Twitter.

Solr Digest, January 2011

Welcome to the second season of Sematext’s monthly Solr Digests. Once again, we compiled a list of most interesting topics in Solr world for the previous month:

Already committed features

A bug related to using PHPSerialized response writer in sharded environment was fixed and committed in SOLR-2307. It affected all recent Solr versions (trunk, 3_x, 1.4.1,…) and the fix is committed to 3_x branch and trunk. In case you’re stuck with older version of Solr, you can manually try applying the patch, it should be doable.
One old JIRA issue Enable sorting by Function Query is finally closed and committed to 3_x and trunk.
A problem with race condition in StreamingUpdateSolrServer got its fixes before, however it appears that issue wasn’t fixed completely. Now another fix is committed to 3_x and trunk, so if you use this feature, we advise picking up the fix.

Interesting features in development

Support for complex syntax (e.g. wildcards) in phrase queries is being brought to Lucene. In case you’re interested, you can take a look at LUCENE-1823 or LUCENE-1486 which was another try at similar functionality. These issues have been in development for a long time and still aren’t finished, although patches exist. Similar feature for Solr is developed under SOLR-1604, where you can also find some patches. However, we think it is a bit unclear if any of these issues will ever be committed to Lucene/Solr, so if you’re interested, check the progress on them occasionally and don’t hold your breath.

Interesting new features

Solr might get improved per-field similarity integration into schema.xml. Currently, in Solr’s schema only global SimilarityProvider can be defined.

Miscellaneous

Anyone having performance problems when using large start and rows parameters could benefit from looking at issue SOLR-2218. You can find some advice on how to deal with the problem using existing Solr capabilities.
An interesting patch from issue Modify default solrconfig parameters via JMX aims to provide more flexibility in configuring Solr
As usual, one common question is related to Solr/Lucene versions and release dates. In ML threads [Solr4.0] Release Date, Lucene 3.1 Release Proposal, Release schedule Lucene 4? and Is solr 4.0 ready for prime time? (or other ways to use geo distance in search) you can find more. In short, 3.1 is next version and might happen soon (March is being mention). 4.0 is a major release with many features not present in 3.1 and it is not likely that we will get it soon. Another ML thread provides insight into future release strategy.
As usual, heated discussions are being held over Maven in Lucene/Solr world. If you’re interesting into reading what community thinks about Maven’s place in Lucene/Solr, we recommend reading (very long!) ML thread Let’s drop Maven Artifacts ! Discussions like this might eventually lead to Maven being dropped, and as a matter of fact, some sort of voting is already done in ML thread [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors? where you can make your voice heard.
A few interesting conferences are slowly approaching : Lucene Revolution 2011 is Coming – May 25 & 26 and Berlin Buzzwords 2011. Also, note that applications for Google’s Summer of Code must be submitted by the end of February.

And that’s all for January.

Solr Digest, December 2010

Just the other day, we posted the Lucene & Solr highlights in our Lucene & Solr: 2010 in Review post, and now it’s time to really conclude 2010 in Solr world with December Solr Digest. Although one might expect festive period to take its toll on the Solr development velocity, it wasn’t like that at all. Open source never sleeps. Here are the most interesting highlights:

Interesting features in development

In our July’s Digest we mentioned LanguageIdentifierUpdateProcessor feature which is being proposed under JIRA issue SOLR-1979. Some artifacts in the form of patches are starting to appear attached to that issue, so if you’re interested in this feature, take a look.
Solr’s spatial capabilities are being further refined. With issue SOLR-2268, Solr will get “support for Point in Polygon searches”. This should enable features like “for a given point, return all documents which contain a polygon inside of which that point lays” and “for a given polygon, return all documents which have a point contained inside of that polygon“. Of course, negated versions of such feature will be supported. The work is in early stages, one patch is attached, but it can be used only as a general pointer about how this thing will be implemented, nothing else.
Support for “ColognePhonetic” encoder was added to PhoneticFilterFactory. Since “ColognePhonetic” will be added to Commons Codec 1.4.1, the patch provided in SOLR-2276 will wait until that version gets released.

Interesting new features

Solr is getting JOIN functionality – sort of. As part of SOLR-2272 , Solr got a working patch that provides SQL JOIN-like functionality. Of course, this is not exactly the JOIN you might know from SQL, but it is probably the closest thing to it which can be implemented in Solr. It is likely that this feature will be integrated into Lucene as well – it makes no sense to have it strictly in Solr. It is also likely that this feature will be expanded in the future; currently it has only one algorithm and supports many-to-many type of JOIN.
As part of SolrCloud, new feature SolrCloud distributed indexing will be added to Solr some day. SOLR-2293 will be likely JIRA home for this feature. Before SolrCloud, anyone using distributed indexing had to create a custom logic which handled distribution of documents over various shards in the cluster. With SolrCloud, this will be transparent to the clients. Also, SolrCloud will include some out-of-the-box distribution algorithms, while, of course, plugging in custom algorithms will be easy to accomplish. However, don’t hold your breath waiting for this feature. At the moment, there exists only JIRA issue (and some guidelines in the Wiki) related to this feature.

Miscellaneous

Solr’s Jetty is now upgraded to the latest 6.1.26 version. The change was committed to 3_x branch and trunk but, of course, it doesn’t mean you have to upgrade your Jetty too. It means just that stuff (various jars and few xmls) under /solr/example got upgraded. Details of this change can be found in SOLR-2265.
Did you experience any problems with DataImportHandler and its multi-threaded option? If so, you are not the only one. More details about the nature of the problem can be found in SOLR-2186, while it appears that SOLR-2233 already contains a patch which might help in case of JDBC data source. That patch contains a few other DataImportHandler multithreading fixes. Nothing related to this issue has been committed to trunk or 3_x branch yet.
Many problems people encounter with Solr are related to OutOfMemory error. There were many interesting discussions on the ML in December, but we consider two of them related to that topic which you might find interesting OutOfMemory GC: GC overhead limit exceeded – Why isn’t WeakHashMap getting collected? and Memory use during merges (OOM)
If you found “bf” parameter somewhat limited – it doesn’t accept complex nested expressions with lots of whitespaces – you might find patch from SOLR-2014 useful. It appears that this is a common problem: another similar bug report (although this actually isn’t a bug) was also opened recently – SOLR-2267. However, consider that “bf” parameter will likely be deprecated in the future, since “bq” could be used to achieve everything “bf” can and more.
In our November’s Digest, we mentioned SOLR-2154 and Solr’s problem with multi-valued spatial fields. What we missed was SOLR-2155, which actually provides a patch for such problems. Although it isn’t committed, the patch appears to be functional, so give it a try.

And that would be all for Solr in 2010. We’ll be back with the new Solr Digest in a month. Follow @sematext for other interesting search news.

Solr Digest, November 2010

It is time for the last Solr Digest of 2010; the next Digest will be published some time in January 2011. This was not a month with too many interesting developments, so here we bring to your attention only the few more interesting bits. Here we go…

Already committed features

Anyone working with Polish language will be happy to hear that factory for Polish stemmer is committed to 3_x and trunk.

Interesting features in development

Ever had problems with Solr’s sharding when one of the shards fails? One interesting patch that will help you in such cases is attached to JIRA issue Solr should be able to keep on truckin’ if a shard fails during a distributed search. Nice issue name, ha? Another old issue also related to this problem is Return partial results when a connection to a shard is refused also contains a patch. We didn’t try them patches, but one of them should be suitable if you experience problems like this.
Work is being done on Improve analyzer/version handling in Solr. In the future, this should notify users about things like deprecated or old version of APIs they currently use and that will be removed in the future versions. Check out this JIRA issue for more details.

Miscellaneous

A fix for a feature that was committed earlier this year – Enable sorting by Function Query – is close to being committed. This is big one! There were some problems with it: functions weren’t weighted, function query wasn’t being properly parsed, some deprecated bits of code were used, etc. Patch is already posted, so if you are eager to use this functionality you can start by applying the patch yourself.

If you’re thinking about using NRT (near-real time) search capabilities of future Solr 4.0, some food for thoughts may be found in this ML thread – Possibilities of (near) real time search with solr.

Many people are using Spatial Search features recently introduced in Solr. If you’re considering that too, be careful about one limitation: there is no Spatial support for multi-valued fields. So, if you have multi-valued spatial fields and you’d like to do some sorting on them, you’ll end up with incorrect results. The feature we’re describing here can be found in some other search tools, though, like Elastic Search, so Solr might be getting it too some day. You can check if there is some progress with this in JIRA issues like SOLR-2154

There is a major bug in DataImportHandler – it doesn’t release JDBC connections. It appears that this issue isn’t related to any particular database, so this is an obvious bug in DIH. Check this JIRA issue for updates.

If you prefer git over svn, you might be interested in Solr’s git repository recently set up. Check this ML thread to learn more about it.

So long until 2011, Solr Digest readers! Follow @sematext on Twitter for other stuff from Sematext.

Solr Digest, October 2010

Another busy month is behind us. There were plenty of interesting topics, so let’s get started:

Already committed functionality

One rather common problem people had with Highlighter was the fact that it ignored the q.alt parameter. Now this is fixed with in issue Highlighter doesn’t support q.alt and is already committed to branch_3x and trunk.
A bug where unregistered searchers were not being closed is fixed as part of issue SOLR-2179 and is also already committed.
Minor but useful feature is also committed to trunk and branch_3x – Propose adding field to the admin GUI to indicate the status of HTTP caching.
A fix related to thread safety in StreamingUpdateSolrServer is committed to trunk, branch_3x and 1.4 – more about it in SOLR-2192.

Interesting functionality in development

Faceting is heavily used functionality, but occasionally people find they’re missing some form of faceting. Hierarchical faceting is one such thing. It has been in development for a very long time but, despite a few posted patches, is still not a part of Solr distribution, plus it hasn’t seen much activity lately. There is another similar issue – Pivot (aka Decision Tree) Faceting Component which should come to life as a separate component. However, there is renewed effort to make it usable, so eventually we’ll see expanded faceting support in Solr.

Interesting new functionality

Extending SchemaField with custom attributes is being dealt with in the issue Custom SchemaField object.
Improving search relevance is always a big issue (and represents a good part of what Sematext does in client engagements), no matter how good out of the box Solr and Lucene relevance is. One very useful addition to our search relevance arsenal could come from the Anti-phrasing feature. The idea is that some word sequences in a query are irrelevant to the query meaning (like “Where can I find” or “Where is.”) and could/should be ignored while searching the index. This JIRA issue is still very fresh, so don’t hold your breath waiting for the implementation to become available next week, although we are bound to see this feature in one of the future Solr releases.
If you often working with financial data, you might find patches from issue Money FieldType useful. The new field type will support point and range queries, sorting, and exchange rates.
Lucene’s ICUTokenizer is useful for multilingual tokenizing but until recently there was no support for it in Solr. The issue Provide Solr FilterFactory for Lucene ICUTokenizer will provide a filter factory which will enable us to use it from Solr. Bingo! The patch already exists, so it can be tried already. Additional new functionality will be added over time. If you need multilingual support in Solr, have a look at Sematext‘s popular Multilingual Indexer.

Miscellaneous

One of the favorite topics, which we also cover frequently, is related to the ongoing confusion about Solr versions. October didn’t disappoint, this topic was discussed on mailing lists again. So, here is one such thread – Which version of Solr to use?. Let us summarize the key parts. Solr 1.5 will probably never be released. The branch_3x is a stable version from which the next Solr 3.1 version will likely be released. The trunk contains relatively stable, but still development version of what will become Solr 4.0 one day.
If you provide faceting functionality in your application, here is a small (but interesting) discussion that might give you a few ideas about how to optimize it – Faceting and first letter of fields.
It appears that Solr has problems running on Tomcat 7. These problems are not related to a particular version of Solr, but to all versions. To learn more, start with Problems running on tomcat and SOLR-2022 .
The replication between Solr master and slave when they’re running different versions of Solr is broken, as you can see in issue Cross-version replication broken by new javabin format. The cause is the new javabin format, so in cases like the one described in this issue (master 1.4.1, slave 3x), you’ll encounter problems. Keep that in mind if you plan cross-version replication for some reason.

These were the most interesting highlights for the month of October. Thank you for reading Sematext Blog and following @sematext on Twitter.

Solr Digest, September 2010

Mahout Digest, October 2010

We’ve been very busy here at Sematext, so we haven’t covered Mahout during the last few months. We are pleased with what’s been keeping us busy, but are not happy about our irregular Mahout Digests. We had covered the last (0.3) release with all of its features and we are not going to miss covering very important milestone for Mahout: release 0.4 is out! In this digest we’ll summarize the most important changes in Mahout from the last digest and add some perspective.

Before we dive into Mahout, please note that we are looking for people with Machine Learning skills and Mahout experience (as well as good Lucene/Solr search people). See our Hiring Search and Data Analytics Engineers post.

This Mahout release brings overall changes regarding model refactoring and command line interface to Mahout aimed at improving integration and consistency (easier access to Mahout operations via the command line). The command line interface is pretty much standardized for working with all the various options now, which makes it easier to run and use. Interfaces are better and more consistent across algorithms and there have been many small fixes, improvements, refactorings, and clean-ups. Details on what’s included can be found in the release notes and download is available from the Apache Mirrors.

Now let’s add some context to various changes and new features.

GSoC projects

Mahout completed its Google Summer of Code projects and two completed successfully:

EigenCuts spectral clustering implementation on Map-Reduce for Apache Mahout (addresses issue MAHOUT-328), proposal and implementation details can be found in MAHOUT-363
Hidden Markov Models based sequence classification (proposal for a summer-term university project), proposal and implementation details in MAHOUT-396

Two projects did not complete due to lack of student participation and one remains in progress.

Clustering

The biggest addition in clustering department are EigenCuts clustering algorithm (project from GSoC) and MinHash based clustering which we covered as one of possible GSoC suggestions in one of previous digests . MinHash clustering was implemented, but not as a GSoC project. In the first digest from the Mahout series we covered problems related to evaluation of clustering results (unsupervised learning issue), so big addition to Mahout’s clustering are Cluster Evaluation Tools featuring new ClusterEvaluator (uses Mahout In Action code for inter-cluster density and similar code for intra-cluster density over a set of representative points, not the entire clustered data set) and CDbwEvaluator which offers new ways to evaluate clustering effectiveness.

Logistic Regression

Online learning capabilities such as Stochastic Gradient Descent (SGD) algorithm implementation are now part of Mahout. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person’s age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer’s propensity to purchase a product or cease a subscription. The Mahout implementation uses Stochastic Gradient Descent (SGD), check more on initial request and development in MAHOUT-228. New sequential logistic regression training framework supports feature vector encoding framework for high speed vectorization without a pre-built dictionary. You can find more details on Mahout’s logistic regression wiki page.

Math

There has been a lot of cleanup done in the math module (you can check details in Cleanup Math discussion on ML), lot’s of it related to an untested Colt framework integration (and deprecated code in Colt framework). The discussion resulted in several pieces of Colt framework getting promoted to a tested status (QRdecomposition, in particular)

Classification

In addition to speedups and bug fixes, main new features in classification are new classifiers (new classification algorithms) and more open/uniformed input data formats (vectors). Most important changes are:

New SGD classifier
Experimental new type of Naive bayes classifier (using vectors) and feature reduction options for existing Naive bayes classifier (variable length coding of vectors)
New VectorModelClassifier allows any set of clusters to be used for classification (clustering as input for classification)
Now random forest can be saved and used to classify new data. Read more on how to build a random forest and how to use it to classify new cases on this dedicated wiki page.

Recommendation Engine

The most important changes in this area are related to distributed similarity computations which can be used in Collaborative Filtering (or other areas like clustering, for example). Implementation of Map-Reduce job, based on algorithm suggested in Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce, which computes item-item similarities for item-based Collaborative Filtering can be found in MAHOUT-362. Generalization of algorithm based on the mailing list discussion led to an implementation of Map-Reduce job which computes pairwise similarities of the rows of a matrix using a customizable similarity measure (with implementations already provided for Cooccurrence, Euclidean Distance, Loglikelihood, Pearson Correlation, Tanimoto coefficient, Cosine). More on distributed version of any item similarity function (which was available in a non-distributed implementation before) can be found in MAHOUT-393. With pairwise similarity computation defined, RecommenderJob has been evolved to a fully distributed item-based recommender (implementation depends on how the pairwise similarities are computed). You can read more on distributed item-based recommender in MAHOUT-420.

Implementation of distributed operations on very large matrices are very important for a scalable machine learning library which supports large data sets. For example, when term vector is built from textual document/content, terms vectors tend to have high dimension. Now, if we consider a term-document matrix where each row represents terms from document(s), while a column represents a document we obviously end up with high dimensional matrix. Same/similar thing occurs in Collaborative Filtering: it uses a user-item matrix containing ratings for matrix values, row corresponds to a user and each column represents an item. Again we have large dimension matrix that is sparse.

Now, in both cases (term-document matrix and user-item matrix) we are dealing with high matrix dimensionality which needs to be reduced, but most of information needs to be preserved (in best way possible). Obviously we need to have some sort of matrix operation which will provide lower dimension matrix with important information preserved. For example, large dimensional matrix may be approximated to lower dimensions using Singular Value Decomposition (SVD).

It’s obvious that we need some (java) matrix framework capable of fundamental matrix decompositions. JAMA is a great example of widely used linear algebra package for matrix operations, capable of SVD and other fundamental matrix decompositions (WEKA for example uses JAMA for matrix operations). Operations on highly dimensional matrices always require heavy computation and this requirements produces high HW requirements on any ML production system. This is where Mahout, which features distributed operations on large matrices, should be the production choice for Machine Learning algorithms over frameworks like JAMA, which although great, can not distribute its operations.

In typical recommendation setup users often ‘have’ (used/interacted with) only a few items from the whole item set (item set can be very large) which leads to user-item matrices being sparse matrices. Mahout’s (0.4) distributed Lanczos SVD implementation is particularly useful for finding decompositions of very large sparse matrices.

News and Roadmap

All of the new distributed similarity/recommender implementations we analyzed in previous paragraph were contributed by Sebastian Schelter and as a recognition for this important work he was elected as a new Mahout committer.

The book “Mahout in Action”, published by Manning, has reached 15/16 chapters complete and will soon enter final review.

This is all from us for now. Any comments/questions/suggestions are more than welcome and until next Mahout digest keep an eye on Mahout’s road map for 0.5 or discussion about what is Mahout missing to become production stabile (1.0) framework. We’ll see you next month – @sematext.

Solr Digest, September 2010

It is a busy time of year here at Sematext – we have 3 different presentations to prepare for 3 different conferences to prepare (2 down, 1 more to go!), so we’re a bit late with our digests. Nevertheless, we managed to compile a list of interesting topics in Solr world:

Already committed functionality

Solr was upgraded to use Tika 0.7 – SOLR-1819 – the fix was applied to 1.4.2, 3.1 and 4.0 versions. Of course, Tika 0.8 is going to happen in not very distant future.
If you’re still using old rsync based replication and have a need to throttle transfer rate, have a look at a patch contributed in JIRA issue SOLR-2099. Unfortunatelly, if you’re using 1.4 Java based replication, there is currently no way to throttle replication.
If you are using new spatial capabilities in Solr, you might have noticed some incorrect calculations. One of them is fixed – Spatial filter is not accurate – on 3.1 and 4.0 branches
Another minor but useful addition – function queries can now be defined in terms of parameters from other request parameters. Check JIRA issue “full parameter dereferencing for function queries”. It is already implemented in 3.1 and 4.0 and is ready to be used. Here is a short example from JIRA (check how add function is defined and note v1 and v2 request parameters):

http://localhost:8983/solr/select?defType=func&fl=id,score&q=add($v1,$v2)&v1=mul(2,3)&v2=10

Can we say, Solr Calculator, eh?

Interesting functionalities in development for some time

Ever wanted to add some custom fields to a response, although they were not stored in your Solr index? You could always create a custom response writer which would add those fields (although it would probably be a “dirty” copy of some already existing Solr’s response writer). However, we all know that it doesn’t sound as the right way to code. One JIRA issue might deliver a correct way some day – Allow components to add fields to outgoing documents. We say “some day“, since this functionality has been in development for quite some time now and, although it has some patches (currently unfunctional, it seems), is probably is not very near being completed. But it will be handy to have once it’s done.

Interesting new functionalities

Highlighter could get one frequently requested improvement – Highlighter fragement/formatter for returning just the matching terms – we believe this will be a useful addition, although we don’t expect it very soon.
One potentially useful feature for all of you who use HDFS – DIH should be able read data directly from HDFS for indexing. This issue already contains some working code, although it is a question if the fix will become a part of standard Solr distribution. Still, if you’re using Solr 1.4.1 and you have data in HDFS that you want to index with Solr, have a look at this contribution.
Another improvement related to replication is in SOLR-2117 – Allow slaves to replicate at different times. This should be useful to anyone who has long (and therefore heavy) warmup periods on their slaves after replication. This way, you can have your slaves replicate at different time and at the time of replication just take replicating slave offline (to avoid degradation of response times). Be careful though, there is a downside : for some time (limited, but still), your slaves will serve different data. Patch is available for 4.0 version.

Miscellaneous

Some more information about current Solr branches, future versions, etc can be found in these ML threads: “Version stability [was: svn branch issues]” and “Solr 3.1”.
Have you recently asked yourself a question like this : morelikethis – “stored=true” is necessary? You might find the answer in this thread useful – check what this link has to say about it.
One extremely useful thread you’ll want to keep in your bookmarks (and read it from time to time) – Tuning Solr caches with high commit rates (NRT)

So, we had a little bit of everything from Solr this month. Until late October (or start of November) when new Solr Digest arrives, stay tuned to @sematext, where we tweet other interesting stuff on a wider set of topics from time to time.