HBase Real-time Analytics & Rollbacks via Append-based Updates (Part 2)

This is the second part of a 3-part post series in which we describe how we use HBase at Sematext for real-time analytics with an append-only updates approach.

In our previous post we explained the problem in detail with the help of example and touched on the suggested solution idea. In this post we will go through solution details as well as briefly introduce the open-sourced implementation of the described approach.

Cons

Below are the major drawbacks of the suggested solution. Usually there are ways to reduce their effect on your system depending on the specific case (e.g. by tuning HBase appropriately or by adjusting parameters involved in data compaction logic).

merging on the fly takes time. Properly configuring periodic updates processing is a key to keeping data fetching fast.
when performing compaction, scanning of many records that don’t need to be compacted can happen (already compacted or “alone-standing” records). Compaction can usually be performed only on data written after the previous compaction which allows to use efficient time-based filters to reduce the impact here.

Solving these issues may be implementation-specific. We’ll bring them up again when talking about our implementation in the follow up post.

Implemenation: Meet HBaseHUT

Suggested solution was implemented and open-sourced as HBaseHUT project. HBaseHUT will be covered in the follow up post shortly.

If you like this sort of stuff, we’re looking for Data Engineers!

HBase Real-time Analytics & Rollbacks via Append-based Updates

In this part 1 of a 3-part post series we’ll describe how we use HBase at Sematext for real-time analytics and how we can perform data rollbacks by using an append-only updates approach.

Some bits of this topic were already covered in Deferring Processing Updates to Increase HBase Write Performance and some were briefly presented at BerlinBuzzwords 2011 (video). We will also talk about some of the ideas below during HBaseCon-2012 in late May (see Real-time Analytics with HBase). The approach described in this post is used in our production systems (SPM & SA) and the implementation was open-sourced as HBaseHUT project.

Problem we are Solving

While HDFS & MapReduce are designed for massive batch processing and with the idea of data being immutable (write once, read many times), HBase includes support for additional operations such as real-time and random read/write/delete access to data records. HBase performs its basic job very well, but there are times when developers have to think at a higher level about how to utilize HBase capabilities for specific use-cases. HBase is a great tool with good core functionality and implementation, but it does require one to do some thinking to ensure this core functionality is used properly and optimally. The use-case we’ll be working with in this post is a typical data analytics system where:

new data are continuously streaming in
data are processed and stored in HBase, usually as time-series data
processed data are served to users who can navigate through most recent data as well as dig deep into historical data

Although the above points frame the use-case relatively narrowly, the approach and its implementation that we’ll describe here are really more general and applicable to a number of other systems, too. The basic issues we want to solve are the following:

increase record update throughput. Ideally, despite high volume of incoming data changes can be applied in real-time . Usually. due to the limitations of the “normal HBase update”, which requires Get+Put operations, updates are applied using batch-processing approach (e.g. as MapReduce jobs). This, of course, is anything but real-time: incoming data is not immediately seen. It is seen only after it has been processed.
ability to roll back changes in the served data. Human errors or any other issues should not permanently corrupt data that system serves.
ability to fetch data interactively (i.e. fast enough for inpatient humans). When one navigates through a small amount of recent data, as well as when selected time interval spans years, the retrieval should be fast.

Here is what we consider an “update”:

addition of a new record if no records with same key exists
update of an existing record with a particular key

Let’s take a look at the following example to better understand the problem we are solving.

Example Description

Briefly, here are the details of an example system:

System collects metrics from a large number of sensors (N) very frequently (each second) and displays them on chart(s) over time
User needs to be able to select small time intervals to display on a chart (e.g. several minutes) as well as very large spans (e.g. several years)
Ideally, data shown to user should be updated in real-time (i.e. user can see the most recent state of the sensors)

Note that even if some of the above points are not applicable to your system the ideas that follow may still be relevant and applicable.

Possible “direct” Implementation Steps

The following steps are by no means the only possible approach.

Step 1: Write every data point as a new record or new column(s) in some record in HBase. In other words, use a simple append-only approach. While this works well for displaying charts with data from short time intervals, showing a year (there are about 31,536,000 seconds in one year) worth of data may be too slow to call the experience “interactive”.

Step 2: Store extra records with aggregated data for larger time intervals (say 1 hour, so that 1 year = 8,760 data points). As new data comes in continuously and we want data to be seen in real-time, plus we cannot rely on data coming in a strict order, say because one sensor had network connectivity issues or we want to have ability to import historical data from a new data source, we have to use update operations on those records that hold data for longer intervals. This requires a lot of Get+Put operations to update aggregated records and this means degradation in performance — writing to HBase in this fashion will be significantly slower compared to using the append-only approach described in Step 1. This may slow writes so much that a system like this may not actually be able to keep up with the volume of the incoming data. Not good.

Step 3: Compromise real-time data analysis and process data in small batches (near real-time). This will decrease the load on HBase as we can process (aggregate) data more efficiently in batches and can reduce the number of update (Get+Put) operations. But do we really want to compromise real-time analytics? No, of course not. While it may seem OK in this specific example to show data for bigger intervals with some delay (near real-time), in real-world systems this usually affects other charts/reports, such as reports that need to show total, up to date figures. So no, we really don’t want to compromise real-time analytics if we don’t have to. In addition, imagine what happens if something goes wrong (e.g. wrong data was fed as input, or application aggregates data incorrectly due to a bug or human error). If that happens we will not be able to easily roll back recently written data. Utilizing native HBase column versions may help in some cases, but in general, when we want greater control over rollback operation a better solution is needed.

Use Versions for Rolling Back?

Recent additions in managing cell versions make cell versioning even more powerful than before. Things like HBASE-4071 make it easy to store historical data without big data overhead by cleaning old data efficiently. While it seems obvious to use versions (native HBase feature) for allowing rolling back data changes, we cannot (and do not want to) rely heavily on cell versions here. The main reason for that is that it is just not very effective when dealing with lots of versions for a given cell. When update history for a record/cell becomes very long this requires many versions for a given cell. Versions are managed and navigated as a simple list in HBase (as opposed to using a Map-like structure that is used for records and columns) so managing long lists of versions is less efficient than having a bunch of separate records/columns. Besides, using versions will not help us with Get+Put situation and we are aiming to kill these two birds with one rock with the solution we are about to describe. One could try to use append-only updates approach described below and use cells versions as update log, but this would again bring us to managing long lists in a non-efficient way.

Suggested Solution

Given the example above, our suggested solution can be described as follows:

replace update (Get+Put) operations at write time with simple append-only writes and defer processing of updates to periodic jobs or perform aggregations on the fly if user asks for data earlier than individual additions are processed.

The idea is simple and not necessarily novel, but given the specific qualities of HBase, namely fast range scans and high write throughput, this approach works very well. So well, in fact, that we’ve implemented it in HBaseHUT and have been using it with success in our production systems (SPM & SA).

So, what we gain here is:

high update throughput
real-time updates visibility: despite deferring the actual updates processing, user always sees the latest data changes
efficient updates processing by replacing random Get+Put operations with processing whole sets of records at a time (during the fast scan) and eliminating redundant Get+Put attempts when writing first data item
ability to roll back any range of updates
avoid data inconsistency problems caused by tasks that fail after only partially updating data in HBase without doing rollback (when using with MapReduce, for example)

In part 2 post we’ll dig into the details around each of the above points and we’ll talk more about HBaseHUT, which makes all of the above possible. If you like this sort of stuff, we’re looking for Data Engineers!

Data Engineer Position at Sematext International

If you’ve always wanted to work with Hadoop, HBase, Flume, and friends and build massively scalable, high-throughput distributed systems (like our Search Analytics and SPM), we have a Data Engineer position that is all about that! If you are interested, please send your resume to jobs@sematext.com.

Responsibilities:

Versatile architect and developer – design and build large, high performance,scalable data processing systems using Hadoop, HBase, and other big data technologies
DevOps fan – run and tune large data processing production clusters
Tool maker – develop ops and management tools
Open source participant – keep up with development in areas of cloud and distributed computing, NoSQL, Big Data, Analytics, etc.

Pluses:

solid Math, Statistics, Machine Learning, or Data Mining is not required but is a big plus
experience with Analytics, OLAP, Data Warehouse or related technologies is a big plus
ability and desire to expand and lead a data engineering team
ability to think both business and engineering
ability to build products and services based on observed client needs
ability to present in public, at meetups, conferences, etc.
ability to contribute to blog.sematext.com
active participation in open-source communities
desire to share knowledge and teach
positive attitude, humor, agility, high integrity, and low ego, attention to detail

Location:

New York

Relevant pointers:

Wanted Dead or Alive: Search Engineer with Client-facing Skills

We are on a lookout for a strong Search Engineer with interest and ability to interact with clients and with potential to build and lead local and/or remote development teams. By “client-facing” we really mean primarily email, phone, Skype.

A person in this role needs to be able to:

design large scale search systems
have solid knowledge of either Solr or ElasticSearch or both
efficiently troubleshoot performance, relevance, and other search-related issues
speak and interact with clients

Pluses – beyond pure engineering:

ability and desire to expand and lead a development/consulting teams
ability to think both business and engineering
ability to build products and services based on observed client needs
ability to present in public, at meetups, conferences, etc
ability to contribute to blog.sematext.com
active participation in online search communities
attention to detail
desire to share knowledge and teach
positive attitude, humor, agility

We’re small and growing. Our HQ is in Brooklyn, but our team is spread over 4 continents. If you follow this blog you know we have deep expertise in search and big data analytics and that our team members are conference speakers, book authors, Apache members, open-source contributors, etc. While we are truly international, this particular opening is in New York. Speaking of New York, some of our New York City clients that we are allowed to mention are Etsy, Gilt, Tumblr, Thomson Reuters, Simon & Schuster (more on http://sematext.com/clients/index.html).

Relevant pointers:

If you are interested, please send over some information about yourself, your CV, and let’s talk.

Berlin Buzzwords 2012 – Three Talks from Sematext

Last year was our first time at Berlin Buzzwords. We gave 1 full talk about Search Analytics (video) and 2 lightning talks (video, video). We saw a number of good talks, too. We also took part in a HBase Hackathon organized by Lars George in Groupon’s Berlin offices and even found time to go clubbing. So in hopes of paying Berlin another visit this year, a few of us at Sematext (@sematext) submitted talk proposals. Last week we all got acceptance emails, so this year there will be 3 talks from 3 Sematextans at Berlin Buzzwords! Here is what we’ll be talking about:

Rafał: Scaling Massive ElasticSearch Clusters

This talk describes how we’ve used ElasticSearch to build massive search clusters capable of indexing several thousand documents per second while at the same time serving a few hundred QPS over billions of documents in well under a second. We’ll talk about building clusters that continuously grow in terms of both indexing and search rates. You will learn about finding cluster nodes that can handle more documents, about managing shard and replica allocation and prevention of unwanted shard rebalancing, about avoiding expensive distributed queries, etc. We’ll also describe our experience doing performance testing of several ElasticSearch clusters and will share our observations about what settings affect search performance and how much. In this talk you’ll also learn how to monitor large ElasticSearch clusters, what various metrics mean, and which ones to pay extra attention to.

Alex: Real-time Analytics with HBase

HBase can store massive amounts of data and allow random access to it – great. MapReduce jobs can be used to perform data analytics on a large scale – great. MapReduce jobs are batch jobs – not so great if you are after Real-time Analytics. Meet append-only writes approach that allows going real-time where it wasn’t possible before.

In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive. Apart from making Real-time Analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes, and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase. The talk is based on Sematext’s success story of building a highly scalable, general purpose data aggregation framework which was used to build Search Analytics and Performance Monitoring services. Most of the generic code needed for append-only approach described in this talk is implemented in our HBaseHUT open-source project.

Otis: Large Scale ElasticSearch, Solr & HBase Performance Monitoring

This talk has all the buzzwords covered: big data, search, analytics, realtime, large scale, multi-tenant, SaaS, cloud, performance… and here is why:

In this talk we’ll share the “behind the scenes” details about SPM for HBase, ElasticSearch, and Solr, a large scale, multi-tenant performance monitoring SaaS built on top of Hadoop and HBase running in the cloud. We will describe all its backend components, from the agent used for performance metrics gathering, to how metrics get sent to SPM in the cloud, how they get aggregated and stored in HBase, how alerting is implemented and how it’s triggered, how we graph performance data, etc. We’ll also point out the key metrics to watch for each system type. We’ll go over various pain-points we’ve encountered while building and running SPM, how we’ve dealt with them, and we’ll discuss our plans for SPM in the future.

We hope to see some of you in Berlin. If these topics are of interest to you, but you won’t be coming to Berlin, feel free to get in touch, leave comments, or ping @sematext. And if you love working with things our talks are about, we are hiring world-wide!

Sematext Presenting Open Source Search Safari at ESS 2012

We are continuing our “new tradition” of presenting at Enterprise Search Summit (ESS) conferences. We presented at ESS in 2011 (see http://blog.sematext.com/2011/11/02/search-analytics-at-enterprise-search-summit-fall-2011-presentation/). This year we’ll be giving a talk titled Open Source Search Safari, in which Otis (@otisg) will present a number of open-source search solutions – Lucene, Solr, ElasticSearch, and Sensei, plus maybe one or two others (suggestions?). We’ll also be chained to booth 26 where we’ll be showcasing our search-vendor neutral Search Analytics service (service is currently still free, feel free to use it all you want), along with some of our other search-related products.

ESS East will be held May 15-16 in our own New York City. If you are a past or prospective client or customer of ours, please get in touch if you are interested in attending ESS at a discount.

HBaseWD: Avoid RegionServer Hotspotting Despite Sequential Keys

In HBase world, RegionServer hotspotting is a common problem. We can describe this problem with a single sentence: while writing records with sequential row keys allows the most efficient reading of data range given the start and stop keys, it causes undesirable RegionServer hotspotting at write time. In this 2-part post series we’ll discuss the problem and show you how to avoid this infamous problem.

Problem Description

Records in HBase are sorted lexicographically by the row key. This allows fast access to an individual record by its key and fast fetching of a range of data given start and stop keys. There are common cases where you would think row keys forming a natural sequence at write time would be a good choice because of types of queries that will fetch the data later. For example, we may want to associate each record with a timestamp so that later we can fetch records from a particular time range. Examples of such keys are:

time-based format: Long.MAX_VALUE – new Date().getTime()
increasing/decreasing sequence: ”001”, ”002”, ”003”,… or ”499”, ”498”, ”497”, …

But writing records with such naive keys will cause hotspotting because of how HBase writes data to its Regions.

RegionServer Hotspotting

When records with sequential keys are being written to HBase all writes hit one Region. This would not be a problem if a Region was served by multiple RegionServers, but that is not the case – each Region lives on just one RegionServer. Each Region has a pre-defined maximal size, so after a Region reaches that size it is split in two smaller Regions. Following that, one of these new Regions takes all new records and then this Region and the RegionServer that serves it becomes the new hotspot victim. Obviously, this uneven write load distribution is highly undesirable because it limits the write throughput to the capacity of a single server instead of making use of multiple/all nodes in the HBase cluster. The uneven load distribution can be seen in Figure 1. (chart courtesy of SPM for HBase):

We can see that while one server was sweating trying to keep up with writes, others were “resting”. You can find some more information about this problem in HBase Reference Guide.

Solution Approach

So how do we solve this problem? The cases discussed here assume that we don’t have all data we want to write to HBase all at once, but rather that the data are arriving continuously. In case of bulk import of data into HBase the best solutions, including those that avoid hotspotting, are described in bulk load section of HBase documentation. However, if you are like us at Sematext, and many organizations nowadays are, the data keeps streaming in and needs processing and storing. The simplest way to avoid single RegionServer hotspotting in case of continuously arriving data would be to simply distribute writes over multiple Regions by using random row keys. Unfortunately, this would compromise ability to do fast range scans using start and stop keys. But that is not the only solution. The following simple approach solves the hotspotting issue while at the same time preserving the ability to fetch data by start and stop key. This solution, mentioned multiple times on HBase mailing lists and elsewhere is to salt row keys with a prefix. For example, consider constructing the row key using this:

new_row_key = (++index % BUCKETS_NUMBER) + original_key

For the visual types among us, that may result in keys looking as shown in Figure 2.

HBase Row Key Prefix Salting — Figure 2. HBase row key prefix salting

Here we have:

index is the numeric (or any sequential) part of the specific record/row ID that we later want to use for record fetching (e.g. 1, 2, 3 ….)
BUCKETS_NUMBER is the number of “buckets” we want our new row keys to be spread across. As records are written, each bucket preserves the sequential notion of original records’ IDs
original_key is the original key of the record we want to write
new_row_key is the actual key that will be used when writing a new record (i.e. “distributed key” or “prefixed key”). Later in the post the “distributed records” term is used for records which were written with this “distributed key”.

Thus, new records will be split into multiple buckets, each (hopefully) ending up in a different Region in the HBase cluster. New row keys of bucketed records will no longer be in one sequence, but records in each bucket will preserve their original sequence. Of course, if you start writing into an empty HTable, you’ll have to wait some time (depending on the volume and velocity of incoming data, compression, and maximal Region size) before you have several Regions for a table. Hint: use pre-splitting feature for the new table to avoid the wait time. Once writes using the above approach kick in and start writing to multiple Regions your “slaves load” chart should look better.

Figure 3. HBase RegionServer evenly distributed write load

Scan

Since data is placed in multiple buckets during writes, we have to read from all of those buckets when doing scans based on “original” start and stop keys and merge data so that it preserves the “sorted” attribute. That means BUCKETS_NUMBER more Scans and this can affect performance. Luckily, these scans can be run in parallel and performance should not degrade or might even improve — compare the situation when you read 100K sequential records from one Region (and thus one RegionServer) with reading 10K records from 10 Regions and 10 RegionServers in parallel!

Get/Delete

To get or delete a single record by original key may need to perform 1 or up to BUCKETS_NUMBER Get operations depending on the logic we used for prefix generation. E.g. when using “static” hash as prefix, given the original key we may precisely identify the prefixed key. In case we used random prefix we will have to perform Get for each of the possible buckets. The same goes for Delete operations.

MapReduce Input

Since we still want to benefit from data locality, the implementation of feeding “distributed” data to a MapReduce job will likely break the order in which data comes to mappers. This is at least true for the current HBaseWD implementation (see below). Each map task will process data for a particular bucket. Of course, records will be in same order based on original keys within a bucket. However, since two records which were meant to be “near each other” based on their original key may have fallen into different buckets, the will be fed to different map tasks. Thus, if the mapper assumes records come in the strict/original sequence, we will be hurt, since the order will be preserved only within each bucket, but not globally.

Increased Number of Map Tasks

When using data (written using the suggested approach) as a MapReduce input (with start and/or stop key provided) the splits number will likely to be increased (depends on the implementation). For current HBaseWD implementation you’ll get BUCKETS_NUMBER times more splits compared to “regular” MapReduce with same parameters. This is due to the same logic for data selection as with simple Scan operation, as described above. As the result, MapReduce jobs will have BUCKETS_NUMBER times more map tasks. This should not decrease performance if BUCKETS_NUMBER is reasonably not too high (when MR job initialization & cleanup work takes more time than processing itself). Moreover, in many use-cases having many more mappers helps improve performance. Many users reported more mappers having a positive impact given that standard HTable input based MapReduce job usually has too few map tasks (one per Region) which cannot be changed without extra coding.

Another strong signal the suggested approach and its implementation could help is if in your application, in addition to writing records with sequential keys, the application also continuously processes newly written data delta using MapReduce . In such use-cases when data is written sequentially (not using any artificial distribution) and is being processed relatively frequently, the delta to be processed resides in just a few Regions (or perhaps in even just one Region if the write load is not high, if maximal Region size is high, and processing batches are very frequent).

Solution Implementation: HBaseWD

We implemented the solution described above and open-sourced it as a small HBaseWD project. We say small because HBaseWD is really self-contained and really simple to integrate into an existing code due to support for native HBase client API (see examples below). HBaseWD project was first presented at BerlinBuzzwords 2011(video) and is currently used in a number of production systems.

Configuring Distribution

Simple Even Distribution

Distributing records with sequential keys to be distributed in up to Byte.MAX_VALUE buckets (single byte is added in front of a key):

byte bucketsCount = (byte) 32; // distributing into 32 buckets
RowKeyDistributor keyDistributor =  new RowKeyDistributorByOneBytePrefix(bucketsCount);
Put put = new Put(keyDistributor.getDistributedKey(originalKey));
... // add values
hTable.put(put);

Hash-Based Distribution

Another useful RowKeyDistributor is RowKeyDistributorByHashPrefix. Please see example below. It creates the “distributed key” based on original key value so that later when you have original key and want to update the record you can calculate distributed key without having to call HBase (too see what bucket it is in). Or, you can perform a single Get operation when original key is known (instead of reading from all buckets).

AbstractRowKeyDistributor keyDistributor =
     new RowKeyDistributorByHashPrefix(
            new RowKeyDistributorByHashPrefix.OneByteSimpleHash(15));

You can use your own hashing logic here by implementing this simple interface:

public static interface Hasher extends Parametrizable {
  byte[] getHashPrefix(byte[] originalKey);
  byte[][] getAllPossiblePrefixes();
}

Custom Distribution Logic

HBaseWD is designed to be flexible especially when it comes to supporting custom row key distribution approaches. In addition to the above mentioned ability to implement custom hashing logic to be used with RowKeyDistributorByHashPrefix, one can define custom row key distribution logic by extending AbstractRowKeyDistributor abstract class whose interface is super simple:

public abstract class AbstractRowKeyDistributor implements Parametrizable {
  public abstract byte[] getDistributedKey(byte[] originalKey);
  public abstract byte[] getOriginalKey(byte[] adjustedKey);
  public abstract byte[][] getAllDistributedKeys(byte[] originalKey);
  ... // some utility methods
}

Common Operations

Scan

Performing a range scan over data:

Scan scan = new Scan(startKey, stopKey);
ResultScanner rs = DistributedScanner.create(hTable, scan, keyDistributor);
for (Result current : rs) {
  ...
}

Configuring MapReduce Job

Performing MapReduce job over the data chunk specified by Scan:

Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "testMapreduceJob");
Scan scan = new Scan(startKey, stopKey);
TableMapReduceUtil.initTableMapperJob("table", scan,
RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);
// Substituting standard TableInputFormat which was set in
// TableMapReduceUtil.initTableMapperJob(...)
job.setInputFormatClass(WdTableInputFormat.class);
keyDistributor.addInfo(job.getConfiguration());

What’s Next?

In the next post we’ll cover:

Integration into already running production systems
Changing distribution logic in running systems
Other “advanced topics”

If you’ve read this far you must be interested in HBase. And since we (@sematext) are interested in people with interest in HBase, here are some open positions at Sematext, some nice advantages we offer and “problems” we are into that you may want to check out.

Oh, and all HBase performance charts you see in this post are from our SPM service which uses HBase and, you guessed it, HBaseWD too! 😉

Sematext Presenting Real-time Analytics at HBaseCon 2012

HBaseCon 2012 is the first-ever HBase-focused conference happening this May in San Francisco. I’m happy to say that Sematext will be there as both a sponsor (likely) and a presenter (definitely) along Facebook, Cloudera, Adobe, Tumblr (nice to see our clients there!) and others. Alex will be presenting our work on HBaseHUT, one of Sematext open-sourced projects that were spun out of our work on Scalable Performance Monitoring (for HBase, Solr, ElasticSearch, Sensei, etc.) and Search Analytics.

HBaseHUT makes real-time analytics with HBase possible today — pull requests welcome!

Month: April 2012

HBase Real-time Analytics & Rollbacks via Append-based Updates (Part 2)

Suggested Solution

High Update Throughput

Real-time Updates Visibility

Efficient Updates Processing

Better Handling of Update Load Peaks

Ability to Rollback Updates

Automatic Handling of Task Failures which Write Data to HBase

Cons

Implemenation: Meet HBaseHUT

HBase Real-time Analytics & Rollbacks via Append-based Updates

Problem we are Solving

Use Versions for Rolling Back?

Suggested Solution

Data Engineer Position at Sematext International

Wanted Dead or Alive: Search Engineer with Client-facing Skills

Berlin Buzzwords 2012 – Three Talks from Sematext

Sematext Presenting Open Source Search Safari at ESS 2012

HBaseWD: Avoid RegionServer Hotspotting Despite Sequential Keys

Problem Description

RegionServer Hotspotting

Solution Approach

Solution Implementation: HBaseWD

Configuring Distribution

Simple Even Distribution

Hash-Based Distribution

Custom Distribution Logic

Common Operations

Scan

Configuring MapReduce Job

What’s Next?

Sematext Presenting Real-time Analytics at HBaseCon 2012

Suggested Solution

High Update Throughput

Real-time Updates Visibility

Efficient Updates Processing

Better Handling of Update Load Peaks

Ability to Rollback Updates

Automatic Handling of Task Failures which Write Data to HBase

Cons

Implemenation: Meet HBaseHUT

Rate this:

Problem we are Solving

Use Versions for Rolling Back?

Suggested Solution

Rate this:

Rate this:

Rate this:

Rate this:

Rate this:

Problem Description

RegionServer Hotspotting

Solution Approach

Solution Implementation: HBaseWD

Configuring Distribution

Simple Even Distribution

Hash-Based Distribution

Custom Distribution Logic

Common Operations

Scan

Configuring MapReduce Job

What’s Next?

Rate this:

Rate this: