Welcome to our BLOG.
Get the latest scoop or just follow and discuss our latest studies.
Feb
10

HBase backup, anyone?

By Sanel ZUKAN  //  Big Data  //  No Comments

Sort of intro

Remember ‘undelete’ command from MS-DOS? I surely do and I often course modern file systems and their designers why this is no more provided on modern filesystems.

Maybe we are living in more complex world than before and this ‘feature’ can not be easily implemented any more (for some filesystems it is true indeed) but it must not be excuse for not seeking alternatives.

Why? Well, how often you do any kind of backup? I can’t remember. I rely on my OS and hardware to keep things and me happy, but didn’t cry once when they failed. And they do fail.

I have excuse; my important files are not on my drive, they are on some remote repository which is fault tolerant and (I make sure) it has backup. Daily, monthly, any kind, just I can be sure I can pull things at and for certain period of time.

But, this approach is often not applicable for big companies due to various reasons: sensitive data, large amount of data that changes (or not) often, even country laws (in some states it is mandatory by law to keep backup of your data so investigators can seek content for specific period of time).

Where HBase fits here

Above, we got small picture about backup importance. But how HBase fits in it?

HBase (backed by Hadoop) is designed to be fault tolerant system, able to run on commodity hardware and to self heal in case things starts to fall apart. Thanks to HDFS (Hadoop file system) and configured replication factor, we can be assured that every block on DFS node (or Datanode) will have given number of replicas, reducing chances of data loss.

Maybe this Hadoop design inspired HBase folks to not consider any kind of backup approach. Even (was often happened before) if you ask this question on HBase mailing list, someone will simply say how it is not needed and adding more Datanodes will, theoretically, decrease the chance of data loss as copies will exist multiple times on multiple places.

Well, in theory something can work flawlessly but in practice it is totally different story. For example, what if Namenode got corrupted somehow (I had this case once), and as it is SPOF for whole cluster, you can freely close your business throwing everything you have (ok, you can always do something, depending how HBase data is important to you).

Or, what if something totally different happens, where HBase can’t help, like validity of data inside tables. In my previous company (which also used HBase for business) we had nasty bug in application, causing application to (sometimes) write wrong things into table. The content was encrypted with strong algorithm, making restore (by admins) practically impossible.

To keep things short, this content was a part of, so called, user profile and after this happens, user account became useless. Imagine one day Facebook or Gmail send you an mail telling your account is non-restorable due bug inside application. I would not use it any more, that is for sure.

Ok, anything?

When you google for ‘hbase backup’, the first hit will be popular HBase Backup Options where are listed possible options for doing some kind of backup.

Immediately after reading it, you will figure out how backup on HBase is really complex issue. Backup on traditional, relational, databases is complex but well explored topic. On other hand, HBase can have tables over tens or thousands gigabytes with fast throughput (that is made for, isn’t?), rendering backup techniques from relational world questionable.

Even these approaches, listed on given page, have flaws: they are time consuming and often they makes cluster quite busy (or even unusable) during backup phase. For example, one of our contractors has quite large HBase database and backup is done with Export (org.apache.hadoop.hbase.mapreduce.Export) MapReduce job which takes a couple of hours to be completed. Should I mention how exported table(s) will not be consistent; backup is done on live system, where already exported rows can be changed inside database while MapReduce job is in the middle of table.

Salvage can be seen in Cluster replication feature. If correctly set, HBase will slowly, without spending much of cluster resources, update changed rows on backup cluster. Details how it works you can see on given link.

This solution comes again with some costs. First, you must have the second cluster, presumably the same size as original. Also you should take into account network usage for replication, which is executed when HBase determine when is appropriate (usually when row was changed). Luckily, replication can be run on demand.

But, replication will not save your backup cluster from accepting anything original one sends, including malformed data inside table rows due application bugs.

Ideally, snapshot mechanism could solve these problems, but HBASE-50 issue is almost popular as HBase itself. There is some work on it, but the last commit was more than year ago and I’m not sure how is up to date with major HBase changes.

Not that bad

I will conclude this post in not so negative tone as it seems. New HBase version(s) comes with really cool feature: Coprocessors. They gives us, mortals, access to HBase internals without spending considerable time to learn not so small codebase. With coprocessors you can do really cool (and dangerous) things, I will (or one of my colleagues) try to explore in upcoming posts.

Next time, I’m hoping I will be able to cover replication with a bit more, practical details. All hardcore technicalities are already covered inside HBase Replication document ;)