Deduplication: Stop repeating yourself

New techniques can save disk space and speed backups.

Data deduplication, data reduction, commonality factoring, capacity optimized storage – whatever you call it — is a process designed to make network backups to disk faster and more economical.

The idea is to eliminate large amounts of redundant data that can chew up disk space. Proponents also say it enables you to make more data available online longer in the same amount of disk.

In deduplication, as data is backed up to a disk-based virtual tape library (VTL) appliance, a catalog of the data is built. This catalog or repository indexes individual bits of data in a file or block of information, assigns a metadata reference to it that is used to rebuild the file if it needs to be recovered and stores it on disk. The catalog also is used on subsequent backups to identify which data elements are unique. Nonunique data elements are not backed up; unique ones are committed to disk.

For instance, a 20-slide PowerPoint file is initially backed up. The user then changes a single slide in the files, saves the file and e-mails it to 10 counterparts. When a traditional backup occurs, the entire PowerPoint file and its 10 e-mailed copies are backed up. In deduplication, after the PowerPoint file is modified, only the unique elements of data — the single changed slide – is backed up, requiring significantly less disk capacity.

“The data-reduction numbers are great,” says Randy Kerns, an independent storage analyst. “Most vendors are quoting a 20-to-1 capacity reduction by only storing uniquely changed data.”

Data deduplication uses a couple of methods to identify unique information. Some vendors use a cryptographic algorithm called hashing to tell whether data is unique. The algorithm is applied to the data and compared with previously calculated hashes. Other vendors, such as Diligent, use a pattern-matching and differencing algorithm that identifies duplicate data. Diligent says this method is more efficient, because it is less CPU- and memory-intensive.

Data deduplication software is being deployed either on disk-based backup appliances or VTL boxes that emulate the operations of a tape library. Among the vendors implementing deduplication on devices appliances are Asigra, Avamar, Copan Systems, Data Domain, Diligent, Exagrid and Sepaton. Vendors such as ADIC (since acquired by Quantum), Falconstor and Microsoft provide deduplication software for implementation on other vendors’ industry standard servers or appliances.

Kevin Fiore, vice president and director of enterprise engineering at Thomas Weisel Partners in Boston, has seen the advantages of deduplication.

“We were looking to replace our tape backup environment and get rid of the problems associated with tape,” says Fiore, who uses six Data Domain DD4000 Enterprise Series disk-based backup appliances.

“To get 30 days of backup data online, we were looking at having to buy 60 to 80 terabytes of disk,” Fiore says. “With Data Domain disk-based appliance, the worst we get is a compression ratio of 19-to-1. On one site we get a 39-to-1 compression ratio.”

Fiore says that deduplication is also helping him redefine how he treats his data.

“Now we can keep data online for 40 to 45 days,” Fiore says. “The data we would need to restore — the databases or Exchange data — is now online longer and the data we wouldn’t retrieve isn’t.

“Another reason for deduplicating data is to reduce the amount of data being replicated across sites for disaster recovery.

James Wonder, director of online technology for the American Institute of Physics in Melville, N.Y., backs up and replicates data to another site.

“One of the main reasons I bought Sepaton’s VTL is their roadmap for deduplication,” says Wonder, who backs up 20TB of data.

“To replicate data to another site takes a pretty big pipe. With Sepaton’s DeltaStor [deduplication], we don’t need to have a huge pipe, because we are replicating less data over time.” Sepaton’s deduplication, which resides on its S2100-ES2 VTL appliance, is in beta test and scheduled to be available in December.

Steven Bilby, director of IT for the Cherokee Nation Enterprises in Catoosa, Okla., is an Avamar customer, who also uses deduplication to reduce the amount of data he backs up. He says he hopes to build replication capability to a remote disaster-recovery site in Tahlequah, Okla., next year.

“The commonality factoring reduces the amount of data we back up and replicate,” says Bilby, who is backing up 6TB of data. “Once we did the full backup and then subsequent backups, we saw a reduction in the data we were backing up of 99%.

Dedupe differentiation

Data deduplication differs from compression in that compression looks only for repeating patterns of information and reduces them. Brad O’Neill, senior analyst with the Taneja Group, offers this example: The pattern of data ‘123412341234123412341234’ would be compressed to ‘6 1234’ or 6x1234 — a fivefold compression of 24 digits. Data duplication would result in reducing the unique data initially to four digits – 1234 — and subsequent backups would recognize that no additional unique data was being transmitted, so it would not be backed up.

Deduplication also differs from incremental backups in that only the byte-level changes are backed up. In incremental backups, entire files or blocks of information are backed up when they change. For instance, in a file, a user changes the single word ‘Bob’ to ‘Steve’ and saves the file. When the system backs up this data incrementally, rather than just backing up the unique data – ‘Steve’ – it backs up the entire file. Data-deduplication technology would recognize that ‘Steve’ is the only unique element of the file and thus back it up solely.

The size of the catalog and cache are also important in differentiating deduplication products.

“The efficiency of deduplication technology all comes down to how the index is architected and how large it is,” O’Neill says. “For instance, Diligent spends a lot of time talking about the speed and size of its index — that it’s small and resides completely in RAM.”

Data deduplication takes place by two methods — either in-line or postprocessing. With in-line processing, data is deduplicated as it is backed up; in postprocessing, data is deduplicated after it is backed up.

Analysts say there is not much of a difference in the outcome between using either method.

“The in-line vendors make claims about performance and scalability; the postprocessing vendors are generally making the same claims,” O’Neill says. “From everything I see, it comes down to the particular workload profile of the user. One of the disadvantages of postprocessing is it can potentially extend the time it takes to backup the data.”

ADIC, Asigra, Avamar, Data Domain, Diligent, Falconstor and Microsoft all use in-line processing; Copan and Sepaton use postprocessing. ADIC can use either.

Getting rid of repetitionA variety of vendors employ data deduplication or reduction in their appliances.
Company/softwareWhere software runsIn-line or post-processing implementation
ADIC /Rocksoft BlockletsDeployed with other vendors' storage appliances.Either
Asigra /TelevaultingWindows, Linux or Unix serverIn-line
Avamar/Commonality FactoringAxion applianceIn-line
Copan Systems/future productRevolution AppliancePostprocessing
Data Domain/Capacity Optimized Storage (COS)DD400 Enterprise Series or DDX Scalable COS ArrayIn-line
Diligent/HyperfactorProtecTIER virtual tape library applianceIn-line
Falconstor/Single Instance RepositoryUsed on virtual tape libraries from EMC, IBM, McData, Sun.In-line
Microsoft/Single Instance StorageWindows Storage Server R2 appliancesIn-line
Sepaton/DeltaStorS2100-ES2 virtual tape library appliancePostprocessing

Learn more about this topic

How to reduce your disk requirements for on-line recoveries


Why current virtual tape libraries may not be meeting your needs


ADIC acquires de-duper


Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2006 IDG Communications, Inc.

IT Salary Survey: The results are in