hive

Hive Digest, March 2011

Welcome to the first Hive digest!

Hive is a data warehouse built on Hadoop, initially developed by Facebook, it’s been under the Apache umbrella for about 2 years and has seen very active development. Last year there were 2 major releases which introduced loads of features and bug fixes. Now Hive 0.7.0 has just been released and is packed with goodness.

Hive 0.6.0

Hive 0.6.0 was released October last year. Some of its most interesting features included

Better skew joins.
Views were added.
Database/schema support was added to Hive QL.
Integration with HBase was added. Allowing to read HBase tables via Hive and bulk load Hive tables into HBase.
There were multiple improvements making it easier to work with partitions, including multi partition inserts and archiving of partitions.

Hive 0.7.0

Hive 0.7.0 has just been released! Some of the major features include:

Indexing has been implemented, index types are currently limited to compact indexes. This feature opens up lots of potential for future improvements, such as HIVE-1694 which aims to use indexes to accelerate query execution for GROUP BY, ORDER BY, JOINS and other misc cases and HIVE-1803 which will implement bitmap indexing.
Security features have been added with authorisation and authentication.
There is now an optional concurrency model which makes use of Zookeeper, so tables can now be locked during writes. It is disabled by default, but can be enabled using hive.support.concurrency=true in the config.

And many other small improvements including:

Making databases more useful, you can now select across a database.
The Hive command line interface has gotten some love and now supports auto-complete.
There’s now support for HAVING clauses, so users no longer have to do nested queries in order to apply a filter on group by expressions.

and much more.

You can download Hive 0.7.0 from here and you can follow @sematext on Twitter.

Google Summer of Code and Intern Sponsoring

Are you a student and looking to do some fun and rewarding coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has identified initial potential projects, but this doesn’t mean you can also pick your own. If you need additional ideas, look at our Lucene / Solr for Academia: PhD Thesis Ideas (or just the spreadsheet if you don’t want to read the what and the why), just be sure to discuss with the community first (send an email to dev@lucene.apache.org).

We should also add that, separately from GSoC, Sematext would be happy to sponsor good students and interns interested in work on projects involving search (Lucene, Solr), machine learning & analytics (Mahout), big data (Hadoop, HBase, Hive, Pig, Cassandra), and related areas. We are a virtual and geographically distributed organization whose members are spread over several countries and continents and we welcome students from all across the globe. For more information please inquire within.

Hadoop Digest, August 2010

The biggest announcement of the year: Apache Hadoop 0.21.0 released and is available for download here. Over 1300 issues have been addressed since 0.20.2; you can find details for Common, HDFS and MapReduce. Note from Tom White who did an excellent job as a release manager: “Please note that this release has not undergone testing at scale and should not be considered stable or suitable for production. It is being classified as a minor release, which means that it should be API compatible with 0.20.2.”. Please find a detailed description of what’s new in 0.21.0 release here.

Community trends & news:

New branch hadoop-0.20-security is being created. Apart from the security features, which are in high demand, it will include improvements and fixes from over 12 months of work by Yahoo!. The new security features are going to be a very valuable and welcome contribution (also discussed before).
A thorough discussion about approaches of backing up HDFS data in this thread.
Hive voted to become Top Level Apache Project (TLP) (also here). Note that we’ll keep Hive under Search-Hadoop.com even after Hive goes TLP.
Pig voted to become TLP too (also here). Note that we’ll keep Pig under Search-Hadoop.com even after Pig goes TLP.
Tip: if you define a Hadoop object (e.g. Partitioner, as implementing Configurable, then its setConf() method will be called once, right after it gets instantiated)
For those new to ZooKeeper and pressed for time, here you can find the shortest ZooKeeper description — only 4 sentences short!
Good read “Avoiding Common Hadoop Administration Issues” article.

Notable efforts:

Howl: Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive (yet another contribution from Yahoo!)
PHP library for Avro, includes schema parsing, Avro data file and
string IO.
avro-scala-compiler-plugin: aimed to auto-generate Avro serializable classes based on some simple case class definitions

FAQ:

How to programatically determine the names of the files in a particular Hadoop/HDFS directory?
Use FileSystem & FileStatus API. Detailed examples are in this thread.
How to restrict HDFS space usage?
Please, refer to HDFS Quotas Guide.
How to pass parameters determined at run-time (i.e. not hard-coded) to Hadoop objects (like Partitioner, Writable, etc.)?
One option is to define a Hadoop object as implementing Configurable. In this case its setConf() method will be called once, right after it gets instantiated and you can use “native” Hadoop configuration for passing parameters you need.