The down and dirty on data deduplication

At its core, data deduplication is a simple concept. Stored data is parsed for duplicate sequences, and when duplicates are found, a pointer to the first instance is inserted in place of the duplicated data.

For example, using a product that supports data deduplication, a backup of an Exchange server in which 20 recipients have received the same attachment would store only the first instance of that attachment with all others pointing back to it.

Under this scheme, the many parts of different files that are similar need to be stored only once. For instance, if the first few lines of a document contain the path name of the document, that name will be generally the same for all the documents in a folder.

If the path name is 40 characters long, and the first 29 are the same for all of the files, the 29 bytes in all of those files after the first one are replaced with a pointer. Because many types of files have structural elements that are similar from file to file, and PowerPoint or PDF documents may contain the same text as the original Word document, the same strings of text recur in many documents.

Deduplication can occur at the file level, the block level or sub-block levels, which some vendors call the "blocklet" and others the "chunklet." The smaller the size of the chunk of data, the more effect you will get from deduplication, though at the expense of additional processing and larger databases.

If deduplication occurs at the file level, an entire file must be identical to another to be deduplicated. If running at the block level, a whole block of data -- whether 512 bytes or 4,096 bytes -- must be identical for the pointer to be placed. If running at the blocklet level, as few as a couple dozen identical characters can be replaced with a pointer, producing much higher effective compression ratios. The point of diminishing returns is reached when the space used to index and process these short strings becomes greater than the savings from replacing duplicate strings with short pointers.

In addition to the level at which deduplication is performed, the other major difference between how the virtual tape library devices tested is whether deduplication occurs in-line as data is moved, or after the fact with postprocessing measures.

Both approaches have the potential to cause problems. In-line processing can limit overall network throughput, while postprocessing uses more disk space initially, until the deduplication process is complete.


< Return to main test

Learn more about this topic

Tape Library Buyer's Guide

NetApp debuts all-file deduplication

05/15/07

Q&A: Diligent CTO demystifies data deduplication

05/04/07

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2007 IDG Communications, Inc.

IT Salary Survey: The results are in