Sizing your disk backup and deduplication system to avoid future missteps

This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.

Correctly sizing a disk backup with deduplication to meet your current and future needs is an important part of your data protection strategy. If you ask the right questions upfront and analyze each aspect of your environment that impacts backup requirements, you can avoid the consequences of buying an undersize system that quickly exceeds capacity.

First and foremost, it's important to understand that this sizing exercise is different than the process of sizing a primary storage system. In primary storage you can simply say, "I have 8TB to store and so I will buy 10TB." In disk-based backup with deduplication, a sizing exercise must be conducted based on a number of factors. Here's what to consider:

* Data types. The data types you have directly impact the deduplication ratio and therefore the system you need. If your mix of data types is conducive to deduplication and has high deduplication ratios (e.g., 50:1), then the deduplicated data will occupy less storage space and you need a smaller system. If you have a mix of data that does not deduplicate well (i.e., 10:1 or less data reduction), then you will need a much larger system. What matters is what deduplication ratio is achieved in a real-world environment with a real mix of data types.

[ CLEAR CHOICE TEST: Recoup with data dedupe 

ANALYSIS: Will cloud backup services finally put tape backups in trash can? ]

* Deduplication method. The deduplication method has a significant impact on deduplication ratio. All deduplication approaches are not created equal.

  • Zone-level with byte comparison or alternatively 8KB block-level with variable length content splitting will get the best deduplication ratios. The average is a 20:1 deduplication ratio with a general mix of data types.
  • 64KB and 128KB fixed block will produce the lowest deduplication ratio, as the blocks are too big to find many repetitive matches. The average is a 7:1 deduplication ratio.
  • 4KB fixed block will get close to the above but often suffers a performance hit. A 13:1 deduplication ratio is the average with a general mix of data types.

* Retention. The number of weeks of retention you keep impacts deduplication ratio as well. This is because the longer the retention, the more the deduplication system is seeing repetitive data. Therefore, the deduplication ratio increases as the retention increases. Most vendors will say that they get a deduplication ratio of 20:1, but when you do the math, that is typically if the retention period is about 16 weeks. If you keep only two weeks of retention, you may only get about a 4:1 reduction.

Here is an example to highlight this: If you have 10TB of data and you keep four weeks of retention, then without deduplication you would store about 40TB of data. With deduplication, assuming a 2% weekly change rate, you would store about 5.6TB of data, so the deduplication ratio is about 7.1:1 (40TB ÷ 5.6TB = 7.1:1). However, if you have 10TB of data, and you keep 16 weeks of retention, then without deduplication you would store about 160TB of data (10TB x 16 weeks). With deduplication, assuming a 2% weekly change rate, you would store about 8TB of data, which is a deduplication ratio of 20:1 (160TB ÷ 8TB = 20:1).

* Rotation. Your backup rotation will also impact the size of the disk-based backup with deduplication system you need. If you are doing rolling full backups each night, then you need a larger system than if you are doing incremental backups on files during the week and then a weekend full backup. Rotation schemes are usually:

  • Database and email
    • Full backup on Monday, Tuesday, Wednesday, Thursday, weekend
  • File data
    • Incrementals forever or optimized synthetics -- this copies only changed files each night, and there is no weekend full
    • Incrementals -- Copies changed files each night, full backup of all files on the weekend
    • Differentials -- Copies files each night that have changed since the last full backup, full backup of all files on the weekend
    • Rolling fulls -- Breaks total full backup into a subset and backs up a portion of the full backup each night (e.g., if the full backup is 30TB, then back up 10TB each night and keep rotating on a three-day schedule)

Because the backup rotation scheme you use changes how much data is being sent to the disk-based backup with deduplication system, this also impacts the system size you require.

* Cross protection. Two scenarios:

  • Sizing Scenario 1: You are backing up data at Site A and replicating to Site B for disaster recovery. For example, if Site A is 10TB and Site B is just for disaster recovery, then a system that can handle 10TB at Site A and 10TB at Site B is required.
  • Sizing Scenario 2: However, if backup data is kept at both Site A (e.g., 10TB) and at Site B (e.g., 6TB), and the data from Site A is being replicated to Site B while the data from Site B is being cross-replicated to Site A, then a larger system on both sides is required.

Bottom line for sizing a system

In summary, dozens of possible scenarios impact the sizing of a system, including:

  • How much data is in your full backup? What percentage of the data is compressed (including media files), encrypted, database, unstructured?
  • What is the required retention period in weeks/months on-site?
  • What is the required retention period in weeks/months off-site?
  • What is the nightly backup rotation?
  • Is data being replicated one way only or backed up from multiple sites and cross-replicated?
  • Other considerations unique to your environment

When working with a vendor, ensure it offers a tool that allows it to calculate the exact size of the system you need based on all of the above.

It is very common to see an organization acquire a disk backup with deduplication system and, in a few short months, have it fill up because the system was undersize, retention was longer, the rotation scheme put more data into the system, the deduplication method had a low deduplication ratio, or the data types were such that they could not deduplicate well.

Because disk-based backup with deduplication is not simply primary storage, vendors should have the proper tools to help you size the system correctly.

Keep in mind that with this sizing exercise, your goal is to avoid either (a) buying a costly system that has excessive capacity relative to your needs today, or (b) an undersize system that will require a forklift upgrade once data growth exceeds its capacity.

By carefully evaluating each of your scenarios and requirements upfront, and asking the right questions of the vendor, you can choose the right system that will positively impact backup performance for the long term.

ExaGrid Systems is the leader in cost-effective scalable disk-based backup solutions with data deduplication. For more information, visit www.exagrid.com.

Editors' Picks
Join the discussion
Be the first to comment on this article. Our Commenting Policies