Debunking the most common big data backup and recovery myths

Platform-provided mechanisms such as replicas and snapshots are not sufficient to ensure proper data protection and to minimize downtime

This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.

Big data has become a priority for most organizations, which are increasingly aware of the central role data can play in their success.  But firms continue to struggle with how to best protect, manage and analyze data within today's modern architectures. Not doing so can result in extended downtime and potential data loss costing the organization millions of dollars.

Unlike traditional data platforms (Oracle, SQL*Server, etc.), which are managed by IT professionals, big data platforms (Hadoop, Cassandra, Couchbase, HPE Vertica, etc.) are often managed by engineers or DevOps groups and there are some common misconceptions around big data backup and recovery that need to be cleared up.  

Some of the most common myths include:

Myth #1: Multiple replicas of data eliminates the need for separate backup/recovery tools of big data.  Most big data platforms create multiple copies of data and distribute these copies across different servers or racks. This type of data redundancy protects data in case of hardware failures. However, any other situation such as user errors, accidental deletions, data corruptions, etc. will result in data loss because these errors or corruptions quickly propagate to all copies of data.

Myth #2: Lost data can be quickly and easily rebuilt from the original raw data.  This might actually work if you still have all the raw data to rebuild the lost data. But in most cases, that raw data was deleted or is not easily accessible. Even if it was available, rebuilding the lost data at big data scales can take weeks, consuming significant engineering resources, and results in an extended downtime for the Big Data users.

Myth #3: Backing up a petabyte of big data is not economical or practical.  Periodic full backups of a petabyte of data will take weeks and require infrastructure investments north of half a million dollars. However, there are a few things you can do to mitigate these issues. You can identify a subset of data that is valuable to the organization and only back up that data. Adopting newer backup techniques such as deduplication to store backups efficiently, incremental-forever to transfer changes, using commodity servers, etc. will also help reduce costs and speed up backup time.

Myth #4: Remote disaster recovery copies can serve as a backup copy.  It is prudent to have a copy of the data in a remote data center to protect against large scale disasters such as fires and earthquakes. This is typically done by replicating data on a periodic basis from the production data center to the disaster recovery data center. However, all changes made on the production data center are propagated to the disaster recovery site including accidental deletions, database corruptions, application corruptions, etc. As a result, the disaster recovery copy cannot serve as a backup copy since it does not have the point-in-time copies that you can roll back to.

Myth #5: Writing backup/recovery scripts for big data is easy.  Writing scripts can work if you have engineering resources, a small amount of data and just one of the big data platforms. Most organizations typically have 10’s to 100’s of terabytes of big data spread over multiple big data platforms. It is not easy to write, test and maintain scripts for these types of environments. Scripts have to be written for each platform that is being backed up (e.g. a script for Hadoop, another one for Cassandra, etc.).  Scripts have to be tested at scale and retested as platform versions change (upgrade from Cassandra 2.1 to 2.2). In some cases, scripts may have to be periodically updated to support new features of the platform, new API’s, new data types, etc.

Most organizations do not realize that there are significant hidden costs and expertise needed to write good backup scripts for big data platforms. The recovery process is much harder and error prone since it involves locating the right backup copies, copying the data back to the appropriate nodes, and applying platform specific recovery procedures to recover data.

Myth #6: Big Data Backup/Recovery operations costs are very small.  In addition to periodically maintaining and testing scripts, there are additional costs associated with backup and recovery. Additional costs include:

  • People cost: someone responsible for running scripts, ensuring backups are successful, debugging when needed, performing ad hoc recoveries, etc.
  • Storage cost: spend needed to store backups
  • Downtime costs: during the time it takes the admin to locate the backup copies and restoring the data to the desired state  

These costs could significantly add up especially as the big data environment gets bigger and more complex.

Myth #7: Snapshots are an effective backup mechanism for big data.  Snapshots (state of data frozen at a particular point in time) are sometimes used as a backup copy to protect against user errors or application corruptions. There are a few considerations when using platform or storage snapshots for backup.

First, snapshots can be used to automate the backup process. However, when using storage snapshots, extra manual steps are needed to ensure consistency of the backup data and metadata. Secondly, snapshots are efficient when the data is not changing rapidly. With big data platforms the rate of data change is high and techniques such as compaction only add to the rate of data change.  As a result, snapshots require significant storage overhead (as much as 50%) to keep a few point in time copies.

Finally, recovery from snapshots will be a very tedious and manual process. The admin or DBA will have to identify the snapshot files that correspond to the data that needs to be restored (e.g. a keyspace or table) and restore it from the snapshot to their respective nodes in the cluster. Any mistakes during the restore process can incur permanent data loss.

In summary, organizations that are deploying big data platforms and applications must realize the importance of backing up their data. Platform-provided mechanisms such as replicas and snapshots are not sufficient to ensure proper data protection and to minimize downtime. Proper backup and recovery requires some investment but is well worth it given the role big data plays in driving business value.

Organizations should be aware of the hidden costs associated with developing home grown solutions and deploy the right technologies to meet their Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO). Not having a backup/recovery solution for big data is not an option because events such as human error and data corruptions will happen. It is not a question of if, but when.

By making data always available, Talena’s award-winning software helps companies improve their business agility while greatly reducing overall capital and operating costs. For more information, visit

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2016 IDG Communications, Inc.