bsmid

Announcing Scalable Performance Monitoring (SPM) for JVM

Up until now, SPM existed in several flavors for monitoring Solr, HBase, ElasticSearch, and Sensei. Besides metrics specific to a particular system type, all these SPM flavors also monitor OS and JVM statistics. But what if you want to monitor any Java application? Say your custom Java application run either in some container, application server, or from a command line? You don’t really want to be forced to look at blank graphs that are really meant for stats from one of the above mentioned systems. This was one of our own itches, and we figured we were not the only ones craving to scratch that itch, so we put together a flavor of SPM for monitoring just the JVM and (Operating) System metrics.

Now SPM lets you monitor OS and JVM performance metrics of any Java process through the following 5 reports, along with all other SPM functionality like integrated Alerts, email Subscriptions, etc. If you are one of many existing SPM users these graphs should look very familiar.

JVM: heap, thread stats

We are not including it here, but the JVM report includes and additional and valuable Garbage Collection graph if you are using Java 7.

Garbage Collection: collection time & count

CPU & Memory: CPU stats breakdown, system load, memory stats breakdown, swap

Disk: I/O rates, disk space used & free

Network: I/O rates

To start monitoring, one should have a valid Sematext Apps account, which you can get free of charge here. After that, define a new SPM JVM System, download the installation package, and proceed with the installation. The whole process should take around 10 minutes (assuming you are not multitasking or suffer from ADD, that is).

Installation process is simple as always and described on the installer download page. After the installation is done, monitor is enabled in your Java process by adding just the following to the command line for the Java process/application you want to monitor:

-Dcom.sun.management.jmxremote \
-javaagent:/spm/spm-monitor/lib/spm-monitor-jvm-1.6.0-withdeps.jar=/spm/spm-monitor/conf/spm-monitor-config-YourSystemTokenHere-default.xml

For example, if my application is com.sematext.Snoopy I could run it with SPM parameters as shown here:

java -Dcom.sun.management.jmxremote -javaagent:/spm/spm-monitor/lib/spm-monitor-jvm-1.6.0-withdeps.jar=/spm/spm-monitor/conf/spm-monitor-config-YourSystemTokenHere-default.xml com.sematext.Snoopy

After you are finished with the installation, the stats should start to appear in SPM after just a few minutes.

Happy monitoring!

– @sematext

Solr Digest, April 2010

Another month is almost over, so it is time for our regular monthly Solr Digest. This time we’ll focus on interesting JIRA issues, so let’s start:

Issue SOLR-1860 intends to improve stopwords list handling in Solr, based on recent Lucene’s stopwords lists additions to all language analyzers. The work hasn’t started just yet (there are no patches to try), so we’ll need to be patient before actually using it.

Ever had problems with http authentication in distributed Solr environment? Currently, it worked only when querying one Solr server. Now JIRA issue SOLR-1861 solves such problems and allows specification of credentials for each shard, while in the absence of credential info it falls back to default functionality (no credentials). The patch is already attached to the issue and it can be used with Solr 1.4.

If you have used Solr’s MoreLikeThisComponent, you noticed its output lacks any info which would explain why it recommended some item. Patch in issue SOLR-860 deals with that and improves MLT Component by adding debug info, like this (copied from JIRA):

"debug":{ "moreLikeThis":{ "IW-02":{ "rawMLTQuery":"", "boostedMLTQuery":"", "realMLTQuery":"+() -id:IW-02"}, "SOLR1000":{ "rawMLTQuery":"", "boostedMLTQuery":"", "realMLTQuery":"+() -id:SOLR1000"}, "F8V7067-APL-KIT":{ "rawMLTQuery":"", "boostedMLTQuery":"", "realMLTQuery":"+() -id:F8V7067-APL-KIT"}, "MA147LL/A":{ "rawMLTQuery":"features:2 features:0 features:lcd features:x features:3", "boostedMLTQuery":"features:2 features:0 features:lcd features:x features:3", "realMLTQuery":"+(features:2 features:0 features:lcd features:x features:3) -id:MA147LL/A"}},
This issue is marked to be included in Solr 3.1.

If you ever got a requirement like “some users should be able to access these documents while being forbidden to access some other”, Solr wasn’t able to help you much. Recently, document level security has been the subject of 2 JIRA issues. In SOLR-1834 you can find a patch which is already running in production environment, while another approach to the same problem (also with attached patch) is presented in SOLR-1872 (the latter currently adds security only on select queries, delete is not supported yet).

SolrCloud brings exciting new capabilities to Solr, some of them already mentioned in our Solr Digest posts (for instance, check Solr Digest January 2010). Solr Cloud functionality is getting committed to trunk, you can monitor the progress in SOLR-1873. This is big!

When working with Solr, you should explicitly configure Solr to take care of lowercasing indexed tokens and query strings (so uppercased versions of some words match their lowercase versions. For instance, to have query Sematext matches SEMATEXT, sematext and Sematext). However, there is one old JIRA issue SOLR-219 designated to be fixed in Solr 1.5 which would automatically make Solr smart enough for searches to be case insensitive.

One common source of confusion for first time Solr users was dismax and its relation to default query operator defined in schema.xml. In reality, the default query operator has no effect on how dismax works. Also, with dismax you can’t use directly AND and OR operators, but you can achieve such functionality by using dismax’s mm (minimum should match) parameter. The default value for it is 100% (meaning that all clauses must match, which is equal to using AND operator between all clauses). If you want to achieve OR operator functionality, you would just define its value to 1 (meaning, one matching clause is enough). The confusion with default operator arises from the fact that in case your default query operator in schema.xml is OR, dismax would by default behave like it was AND. Issue SOLR-1889 should deal with that and assign default mm value for dismax depending on the default query operator from schema.xml, which will make Solr behave more consistently for new users.

Another old JIRA issue got its first patch a few days ago, SOLR-571. This patch allows autowarmCount values to be specified as percentages of cache size (for instance, 50% would mean that autowarm of only top half of cached queries is needed) instead of being specified by an absolute amount.

Solr 1.4 introduced ClusteringComponent which can cluster search results and documents. By using plugins, it allows implementation of any clustering engine. One such engine was recently unveiled, lsa4solr, which is based on Latent Semantic Analysis. This engine depends on development version of Solr 1.3 and Clojure 1.2, so take a look if you’re interested in clustering.

And last, but not least, for all Solr enthusiasts, an interesting webinar is on schedule for April 29th: “Practical Search with Solr: Beyond just looking it up”. You can find more about it here.

Remember, you can also follow us on Twitter: @sematext. Until next month!

Solr Digest, March 2010

As you probably already know, there were some big changes in Lucene/Solr world. They were already mentioned in Lucene Digest post for March, but here we’ll summarize changes related to Solr:

Lucene and Solr are now merged in svn, repositories are already created (you can check interesting discussion on this thread)
Next Solr release could have a new major version number. There are conflicting opinions on what the version should be called, 1.5, 2.0 or 3.1 (following the name of the next Lucene release). We are putting our money on it being version 2.0
Solr will soon move to Java 6
Deprecations in Solr code will finally be removed, which means transition to new version will not be as easy for users as it was from version 1.3 to 1.4
Solr and Lucene will not be changing their names, they remain two as two separate products

One interesting addition to Solr world is the new iPhone app called SolrStation. If you ever have a need to remotely administer your Solr installation and you also own an iPhone or Android based phone, here is a tool you might find useful. It allows adminstators to:

Administer Solr installs
Manage data imports
Control replication

One thing to be careful about is security: what happens when you lose your phone? The “lucky” finder will have access to your Solr installation. After some back-and-forth with Chris of SolrStation, Chris agreed adding additional security in form of a pass code would be good to add and is putting that on the roadmap. Soon, we’ll have a special post dedicated solely to SolrStation, to cover this interesting product in more details.

Solr just got integrated with Grails in the form of a plugin. This plugin integrates Grails domain model with Solr and is looking to provide functionalities of already existing Grails’ Searchable plugin and add some more. It is still in its infancy, so it is missing some key features like returning a list of domain objects in result set, but integration with Solr should improve Grails searching with capabilities like clustering, replication, scalability, facets…

Another interesting addition to Solr is Zoie plugin. It provides real-time update functionality to Solr 1.4. We’ll have a special post on Zoie, so we will not go into details here.

In Solr Digest February 2010, we covered RSolr scripting client for Ruby. This time we suggest looking at Solr Flux. This command-line interface for Solr speaks SQL, so if you ever wanted to use SQL-like syntax to insert or delete some documents or make a Solr query, this tool will enable you to do exactly that.

If you have a good knowledge of Solr and Mahout and want to contribute to Solr, check this thread, there is an opportunity with Google Summer of Code.

Thank you for reading. You can also follow us on Twitter – @sematext.

Solr Digest, February 2010

This second installment of Solr Digest (see Solr January Digest) will cover 8 topics, some of which are quite new and some with very long history (and still uncertain future).

So, here we go:

1. solr.ISOLatin1AccentFilterFactory is commonly used filter which replaces accented characters in ISO Latin 1 charset with their unaccented version (for instance, ‘à’ is replaced with ‘a’). However, the underlying Lucene filter ISOLatin1AccentFilter is already deprecated in favor of ASCIIFoldingFilter in Lucene 2.9 (BTW, Solr 1.4 release uses Lucene 2.9.1, while trunk with future Solr 1.5 uses Lucene 2.9.2) and has been deleted from Lucene 3.0. Of course, Solr already has a filter factory for the replacement, solr.ASCIIFoldingFilterFactory, so it would probably be wise to start using it in your Solr schemata, if you are still using the old ISOLating1AccentFilter. Functionality wise, there are no differences between these two filters, except that ASCIIFoldingFilter covers a superset of ISO Latin 1, meaning it converts everything ISOLatin1AccentFilter was converting and some more.

2. DataImportHandler became multithreaded – after being filled with different functionalities, DataImportHandler got a performance boost. Your multicore servers will be happy to try it :). As part of JIRA issue SOLR-1352, the patch was created and committed to trunk, so you can expect this functionality in Solr 1.5 release, or you can already try it with one of Solr 1.5 nightly builds.

3. Script based UpdateRequestProcessorFactory – one very interesting feature still in development (JIRA issue SOLR-1725) is adding support for script based UpdateRequestProcessorFactory. It will depend on Java 6 script engine support (so Java 5 based Solr installation will not benefit here, although upgrade to Java 6 is definitely recommended) and be very easy to use. The scripts will have to be placed under SOLR_HOME/conf directory and their names will be defined in solrconfig.xml, like this:


<updateRequestProcessorChain name="script">
  <processor>
    <str name="scripts">updateProcessor.js</str>
    <lst name="params">
      <bool name="boolValue">true</bool>
      <int name="intValue">3</int>
    </lst>
  </processor>
</updateRequestProcessorChain>

Implementations would also be simple, here is example of updateProcessor.jsp (copied from patch which brings this functionality):


function processAdd(cmd) {
  functionMessages.add("processAdd1");
}

function processDelete(cmd) {
  functionMessages.add("processDelete1");
}

function processMergeIndexes(cmd) {
  functionMessages.add("processMergeIndexes1");
}

function processCommit(cmd) {
  functionMessages.add("processCommit1");
}

function processRollback(cmd) {
  functionMessages.add("processRollback1");
}

function finish() {
  functionMessages.add("finish1");
}

4. Similar to SolrJ API for communicating with Solr, there are numerous Solr clients for other languages, especially the dynamic scripting languages. As with all scripting languages, one of the main advantages over using pure Java is simplicity and development speed. You just write a few lines of code and immediately run the script — no need for compiling. At Sematext we find them especially handy when making changes to Solr installations, for quickly testing if Solr behaves as we expect. One excellent solution for all Ruby lovers is RSolr. Coincidentally, RSolr will be covered in the upcoming Solr in Action.

5. Field Collapsing – this is a very frequently needed feature, but without satisfactory solution in Solr. There is a long history of this functionality in Solr, everything started while Solr was in version 1.3 with issue SOLR-236. It was never committed to svn, so you basically had to pick one of the many patches available in JIRA and apply it to your distribution. Since Solr was constantly developing, patches would pretty quickly become obsolete, so new versions would be created. Even when you found the correct patch for your Solr version, you would get occasional errors, so this surely wasn’t good enough for enterprise customers.

Recently, there have been renewed efforts invested into this issue and there are plans for this feature to finally be included in Solr 1.5. However, current implementation still isn’t good enough, there are OutOfMemory reports by some users, so it seems like we’ll have to wait some more to get enterprise quality “field collapsing” solution in Solr.

In light of problems with SOLR-236 solution, new JIRA issue SOLR-1773 was created. The goal of this issue is to provide “lightweight” implementation of this feature. There is already a patch containing this implementation and some measurements which show this approach has potential, but this still isn’t ready for serious deployments. The same approach is also implemented in SOLR-1682.

As you can see, work to provide field collapsing is underway, but we’re still some time away from committed code.

6. SystemStatsRequestHandler – designed to provide statistics from stats.jsp to clients which access Solr with APIs like SolrJ or RSolr, it is being developed as JIRA issue SOLR-1750. It is destined to be included in Solr 1.5 version, but for now it is available as Java class attached to the issue. Before it is committed to svn, it might get another name.

7. While Lucene just saw its 2.9.2 and 3.0.1 versions released, Solr trunk already has the latest Lucene 2.9.*, as you can see described in this thread.

8. We’ve saved the best for last. If you could have one feature in Solr… Check out this informative thread to see what people want from Solr that Solr doesn’t already have. What do you want from Solr? Post your Solr desires in comments.

Solr Digest, January 2010

Similar to our Lucene Digest post , we’ll occasionally cover recent developments in the Solr world. Although Solr 1.4 was released only two months ago, work on new features for 1.5 (located in the svn trunk) is in full gear:

1) GeoSpatial search features have been added to Solr. The work is based on Local Lucene project, donated to the Lucene community a while back and converted to a Lucene contrib module. The features to be incorporated into Solr will allow one to implement things like:

filter by a bounding box – find all documents that match some specific area
calculate the distance between two points
sort by distance
use the distance to boost the score of documents (this is different than sort and would be used in case you want some other factors to affect the ordering of documents)

The main JIRA issue is SOLR-773, and it is scheduled to be implemented in Solr 1.5.

2) Solr Autocomplete component – Autocomplete functionality is a very common requirement. Solr itself doesn’t have the component which would provide such functionality (unlike SpellcheckComponent which provides spellchecking, although in limited form; more on this in another post or, if you are curious, have a look at the DYM ReSearcher). There are a few common approaches to solving this problem, for instance, by using recently added TermsComponent (for more information, you can check this excellent showcase by Matt Weber). These approaches, however, have some limitations. The first limitation that comes to mind is spellchecking and correction of misspelled words. You can see such feature in Firefox’ google-bar, if you write “Washengton”. You’ll still get “Washington” offered, while TermsComponent (and some other) approaches fail here.

The aim of SOLR-1316 is to provide autocomplete component in Solr out-of-the-box. Since it is still in development, you can’t quite rely on it, but it is scheduled to be released with Solr 1.5. In the mean-time, you can check another Sematext product which offers AutoComplete functionality with few more advanced features. It is a constantly developing product whose features have been heavily absed on real-life/customer feedback. It uses an approach unlike than of TermsComponent and is, therefore, faster. Also, one of the features currently in development (and soon to be released) is spell-correction of queries.

3) One very nice addition in Solr 1.5 is edismax or Extended Dismax. Dismax is very useful for situations where you want to let searchers just enter a few free-text keywords (think Google) , without field names, logical operators, etc. The Extended Dismax is contributed thanks to Lucid Imagination. Here are some of its features:

Supports full Lucene query syntax in the absence of syntax errors
supports “and”/”or” to mean “AND”/”OR” in Lucene syntax mode
When there are syntax errors, improved smart partial escaping of special characters is done to prevent them… in this mode, fielded queries, +/-, and phrase queries are still supported.
Improved proximity boosting via word bi-grams… this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
advanced stopword handling… stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
Supports the “boost” parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
Supports pure negative nested queries… so a query like +foo (-foo) will match all documents

You can check the development in JIRA issue SOLR-1553.

4) Up until version 1.5, distributed Solr deployments depended not only on Solr features (like replication or sharding), but also on some external systems, like load-balancers. Now, Zookeeper is used to provide Solr specific naming service (check JIRA issue SOLR-1277 for the details). The features we’ll get look exciting:

Automatic failover (i.e. when a server fails, clients stop trying to index to or search it and uses a different server)
Centralized configuration management (i.e. new solrconfig.xml or schema.xml propagate to a live Solr cluster)
Optionally allow shards of a partition to be moved to another server (i.e. if a server gets hot, move the hot segments out to cooler servers).

We’ll cover some of these topics in more details in the future installments. Thanks for reading!