Multiple dimensions of data are supported by having multiple versions of each HBase table cell. This enables identification of each cell by 3 keys: row key, column, and version.
At its core, HBase is a distributed, persistent, sparse column-oriented, multidimensional data repository. Multiple dimensions of data are supported by having multiple versions of each HBase table cell. This enables the unique identification of each cell by 3 keys: row key, column, and version.
Following is a simplified view on how data is stored within HBase:
ROW KEY |
VERSION |
CF1:Q1 |
CF2:Q1 |
CF3:Q1 |
12345 |
TS1 |
Value1 |
Value1 |
Value1 |
TS2 |
Value2 |
Value2 |
Value2 |
|
TS3 |
Value3 |
Value3 |
Shaded cells represent the latest version of a row with rowkey=12345, and data contained in these cells is returned from HBase when the table is polled for the latest version of the row. In this example case, latest value for row 12345 would look like this: rowkey = 12345, CF1:Q1=Value3, CF2:Q1=Value3, CF3:Q1=Value2.
As HBase does not persist null values (in contrary to other relational databases) null value in HBase is represented by the simple absence of value in the column. This makes it hard to implement Update by deletion without physically deleting the value in Table, or adding some custom representation of null value into the cell that holds now deprecated value. Both solutions have certain shortcomings:
- Deleting the data physically from HBase cell requires explicit handling of version identifiers (timestamps) in order to issue Delete request for deprecated cells. Using this approach we would also lose the ability to track the data by time component.
- Adding explicit null values adds additional data overhead for each null value. As the date evolves in time, memory footprint for our data would also grow because of all null values that need to be written
The solution was to introduce a row level version in such a way that timestamp of the latest cell represents the row level timestamp, and only data with that timestamp is read from HBase, and data with older timestamps is filtered.
Let’s assume the following:
(TS1) Row 12345 is added with values CF1:Q1 = Value1, CF2:Q1=Value1, CF3:Q1=Value1
(TS4) Row 12345 is updated with values: CF1:Q1=Value2, CF2:Q1=Value2, CF3:Q1=Value2
(TS3) At last, row 12345 is updated with values: CF1:Q1=Value3, CF2:Q1=Value3 and CF3:Q1 is deleted.
The HBase representation after these steps is exactly the same as in the first table, but the difference is in how our data layer interprets this row:
ROW KEY |
VERSION |
CF1:Q1 |
CF2:Q1 |
CF3:Q1 |
12345 |
TS1 |
Value1 |
Value1 |
Value1 |
TS2 |
Value2 |
Value2 |
Value2 |
|
TS3 |
Value3 |
Value3 |
Our custom data layer will interpret the above data as row with only CF1:Q1 and CF2:Q1 columns having value, as only these columns have the latest timestamp (TS3). Value for C1:Q3 cell will be discarded, as it contains data with older timestamps (TS2).
It is important to mention that data from older versions of the row (column qualifiers with older timestamps) should be cleaned on a regular base.
This approach enables us to achieve row level timestamp consistency when reading data from HBase using our custom data layer.