Big DataBlog

HBase – Row level timestamp consistency in HBase

By January 6, 2013 March 27th, 2017 No Comments

Multiple dimensions of data are supported by having multiple versions of each HBase table cell. This enables identification of each cell by 3 keys: row key, column and version.

At its core, HBase is distributed, persistent, sparse column oriented, multidimensional data repository. Multiple dimensions of data are supported by having multiple versions of each HBase table cell. This enables unique identification of each cell by 3 keys: row key, column and version.

Following is a simplified view on how data is stored within HBase:

ROW KEY

VERSION

CF1:Q1

CF2:Q1

CF3:Q1

 

12345

TS1

Value1

Value1

Value1

TS2

Value2

Value2

Value2

TS3

Value3

Value3

Shaded cells represent latest version of row with rowkey=12345, and data contained in these cells is returned from HBase when table is polled for latest version of row.  In this example case, latest value for row 12345 would look like this: rowkey = 12345, CF1:Q1=Value3, CF2:Q1=Value3, CF3:Q1=Value2.

As HBase does not persist null values (in contrary to other relational databases) null value in HBase is represented by simple absence of value in the column. This makes it hard to implement Update by deletionwithout physically deleting the value in Table, or adding some custom representation of null value in to the cell that holds now deprecated value. Both solutions have certain shortcomings:

  • Deleting the data physically from HBase cell requires explicit handling of version identifiers (timestamps) in order to issue Delete request for deprecated cells. Using this approach we would also lose ability to track the data by time component.
  • Adding explicit null values adds additional data overhead for each null value. As the date evolves in time, memory footprint for our data would also grow because of all null values that need to be written

Solution was to introduce row level version in such way that timestamp of the latest cell represents the row level timestamp, and only data with that timestamp is read from HBase, and data with older timestamps is filtered.

Let’s assume following:

(TS1) Row 12345 is added with values CF1:Q1 = Value1, CF2:Q1=Value1, CF3:Q1=Value1

(TS4) Row 12345 is updated with values: CF1:Q1=Value2, CF2:Q1=Value2, CF3:Q1=Value2

(TS3) At last, row 12345 is updated with values:  CF1:Q1=Value3, CF2:Q1=Value3 and CF3:Q1 is deleted.

The HBase representation after these steps is exactly the same as in the first table, but the difference is in how our data layer interprets this row:

ROW KEY

VERSION

CF1:Q1

CF2:Q1

CF3:Q1

 

12345

TS1

Value1

Value1

Value1

TS2

Value2

Value2

Value2

TS3

Value3

Value3

Our custom data layer will interpret above data as row with only CF1:Q1 and CF2:Q1 columns having value, as only these columns have the latest timestamp (TS3). Value for C1:Q3 cell will be discarded, as it contains data with older timestamp (TS2).

It is important to mention that data from older versions of the row (column qualifiers with older timestamps) should be cleaned on regular base.

This approach enables us to achieve row level timestamp consistency when reading data from HBase using our custom data layer.