ZooKeeper Poll Results

We’ve collected 50 votes in our ZooKeeper Usage Poll over the last few days. Here are the results so far:

66% of people use ZooKeeper directly
Another 16% use ZooKeeper indirectly
18% do not use ZooKeeper at all

This puts total ZooKeeper usage at over 80%. BUT:

Direct ZooKeeper usage being so high at 66% seems a little high and indirect usage being so low at 16% doesn’t feel quite right. ZooKeeper is used by Hadoop, HBase, SolrCloud, Kafka, Storm, and a number of other popular distributed systems that one would think indirect usage would be much higher than direct usage.

What’s your take on these numbers?

Announcement: ZooKeeper Performance Monitoring in SPM

You don’t see him, but he is present. He is all around us. He keeps things running. No, we are not talking about Him, nor about The Force. We are talking about Apache ZooKeeper, the under-appreciated, often not talked-about, yet super-critical component of almost all distributed systems we’ve come to rely on – Hadoop, HBase, Solr, Kafka, Storm, and so on. Our SPM, Search Analytics, and Logsene, all use ZooKeeper, and we are not alone – check our ZooKeeper poll.

We’re happy to announce that SPM can now monitor Apache ZooKeeper! This means everyone using SPM to monitor Hadoop HBase, Solr, Kafka, Sensei, and other applications that rely on ZooKeeper can now use the same monitoring and alerting tool – SPM – to monitor their ZooKeeper instances.

Please tweet about Performance Monitoring for ZooKeeper

Here’s a glimpse into what SPM for ZooKeeper provides – click on the image to see the full view or look at the actual SPM live demo:

Please tell us what you think – @sematext is always listening! Is there something SPM doesn’t monitor that you would really like to monitor? Please vote for tech to monitor!

Want to build highly distributed big data apps with us? We’re hiring good engineers (not just for positions listed on our jobs page), and we’re sitting on a heap of some pretty juicy big data!

Poll: Are You Using ZooKeeper?

In the last decade the world of distributed computing has exploded and Apache ZooKeeper is often at the center of it….which is why we just added ZooKeeper monitoring in SPM. Let’s see what percentage of us use ZooKeeper.

Please tweet so we can collect a large number of votes and get a statistically representative sample.

Please tweet about Poll: Are you using ZooKeeper?

Hadoop Digest, August 2010

The biggest announcement of the year: Apache Hadoop 0.21.0 released and is available for download here. Over 1300 issues have been addressed since 0.20.2; you can find details for Common, HDFS and MapReduce. Note from Tom White who did an excellent job as a release manager: “Please note that this release has not undergone testing at scale and should not be considered stable or suitable for production. It is being classified as a minor release, which means that it should be API compatible with 0.20.2.”. Please find a detailed description of what’s new in 0.21.0 release here.

Community trends & news:

New branch hadoop-0.20-security is being created. Apart from the security features, which are in high demand, it will include improvements and fixes from over 12 months of work by Yahoo!. The new security features are going to be a very valuable and welcome contribution (also discussed before).
A thorough discussion about approaches of backing up HDFS data in this thread.
Hive voted to become Top Level Apache Project (TLP) (also here). Note that we’ll keep Hive under Search-Hadoop.com even after Hive goes TLP.
Pig voted to become TLP too (also here). Note that we’ll keep Pig under Search-Hadoop.com even after Pig goes TLP.
Tip: if you define a Hadoop object (e.g. Partitioner, as implementing Configurable, then its setConf() method will be called once, right after it gets instantiated)
For those new to ZooKeeper and pressed for time, here you can find the shortest ZooKeeper description — only 4 sentences short!
Good read “Avoiding Common Hadoop Administration Issues” article.

Notable efforts:

Howl: Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive (yet another contribution from Yahoo!)
PHP library for Avro, includes schema parsing, Avro data file and
string IO.
avro-scala-compiler-plugin: aimed to auto-generate Avro serializable classes based on some simple case class definitions

FAQ:

How to programatically determine the names of the files in a particular Hadoop/HDFS directory?
Use FileSystem & FileStatus API. Detailed examples are in this thread.
How to restrict HDFS space usage?
Please, refer to HDFS Quotas Guide.
How to pass parameters determined at run-time (i.e. not hard-coded) to Hadoop objects (like Partitioner, Writable, etc.)?
One option is to define a Hadoop object as implementing Configurable. In this case its setConf() method will be called once, right after it gets instantiated and you can use “native” Hadoop configuration for passing parameters you need.

Hadoop Digest, May 2010

Big news: HBase and Avro have become Apache’s Top Level Projects (TLPs)! The initial discussion happened when our previous Hadoop Digest was published, so you can find links to the threads there. The question of whether to become a TLP or not caused some pretty heated debates in Hadoop subprojects’ communities. You might find it interesting to read the discussions of the vote results for HBase and Zookeeper. Chris Douglas was kind enough to sum up the Hadoop subprojects’ response to becoming a TLP in his post. We are happy to say that all subprojects which became TLP are still fully searchable via our search-hadoop.com service.

More news:

Great! Google granted MapReduce patent license to Hadoop.
Chukwa team announced the release of Chukwa 0.4.0, their second public release. This release fixes many bugs, improves documentation, and adds several more collection tools, such as the ability to collect UDP packets.
HBase 0.20.4 was released. More info in our May HBase Digest!
New Chicago area Hadoop User Group was organized.

Good-to-know nuggets shared by the community:

Dedicate a separate partition to Hadoop file space – do not use the “/” (root) partition. Setting dfs.datanode.du.reserved property is not enough to limit the space used by Hadoop, since it limits only HDFS usage, but not MapReduce’s.
Cloudera’s Support Team shares some basic hardware recommendations in this post. Read more on proper dedicating & counting RAM for specific parts of the system (and thus avoiding swapping) in this thread.
Find a couple of pieces of advice about how to save seconds when you need a job to be completed in tens of seconds or less in this thread.
Use Combiners to increase performance when the majority of Map output records have the same key.
Useful tips on how to implement Writable can be found in this thread.

Notable efforts:

Cascalog: Clojure-based query language for Hadoop inspired by Datalog.
pomsets: computational workflow management system for your public and/or private cloud.
hiho: a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable

FAQ

How can I attach external libraries (jars) which my jobs depend on?
You can put them in a “lib” subdirectory of your jar root directory. Alternatively you can use DistributedCache API.
How to Recommission DataNode(s) in Hadoop?
Remove the hostname from your dfs.hosts.exclude file and run ‘hadoop dfsadmin -refreshNodes‘. Then start the DataNode process in the ‘recommissioned’ DataNode again.
How to configure log placement under specific directory?
You can specify the log directory in the environment variable HADOOP_LOG_DIR. It is best to set this variable in bin/hadoop-env.sh.

Thank you for reading us, and if you are a Twitter addict, you can now follow @sematext, too!

Hadoop Digest, March 2010

Main news first: Hadoop 0.20.2 was released! The list of changes may be found in the release notes here. Related news:

Maven artifacts have been pushed to repository.apache.org.
This version has entered Debian unstable repository.
Cloudera officially announced CDH2 release (as well as CDH3 Beta 1).

To get the most fresh insight on the 0.21 version release plans, check out this thread and the continuation of it.

More news on releases:

Pig 0.6.0 is out. This release includes performance and memory usage improvements, a new Accumulator interface for UDFs, and many bug fixes. Release notes available at http://hadoop.apache.org/pig/releases.html.
ZooKeeper 3.3.0 is out. Please, find the announcement and release details.

High availability is one of the hottest topics nowadays in Hadoop land. Umbrella HDFS-1064 JIRA issue has been created to track discussions/issues related to HDFS NameNode availability. While there are a lot of questions about eliminating single point of failure, Hadoop developers are more concerned about the minimizing the downtime (including downtime for upgrades, restart time) than getting rid of SPOFs, since high downtime is the real pain for those who manage the cluster. There is some work on adding hot standby that might help with planned upgrades. Please find some thoughts and a bit of explanation on this topic in a thread that started with “Why not to consider Zookeeper for the NameNode?” question. Next time we see “How Hadoop developers feel about SPOF?” come up on the mailing list, we’ll put it in a special FAQ section at the bottom of this digest. 🙂

We already reported in our latest Lucene Digest (March) about various Lucene projects starting discussions on their mailing lists about becoming Top Level Apache projects. This tendency (motivated by the Apache board’s warnings of Hadoop and Lucene becoming umbrella projects) raised discussions at HBase, Avro, Pig and Zookeeper as well.

Several other notable items from MLs:

Important note from Todd Lipcon we’d like to pass to our readers: avoid upgrading your clusters to Sun JVM 1.6.0u18, stick to 1.6.0u16 for a while which proved to be very stable. Please read the complete discussion around it here.

Storing Custom Java Objects in Hadoop Distibuted Cache is explained here.

Here is a bit of explanation of the fsck command output.
Several users shared their experience with issues running Hadoop on a Virtualized O/S vs. the Real O/S in this thread.
Those who think about using Hadoop as a base for academic research work (both students and professors) might find a lot of useful links (public datasets, sources for problems, existed researches) in this discussion.
Hadoop security features are in high demand among the users and community. Developers will be working hard on deploying authentication mechanisms this summer. You can monitor the progress via HADOOP-4487.

This time a very small FAQ section:

How can I request a larger heap for Map tasks?
By including -Xmx in mapred.child.java.opts
How to configure and use LZO compression?
Take a look at http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/.

Thank you for reading us! Please feel free to provide feedback on the format of the digests or anything else, really.