The down and dirty on data deduplication

* At its core, data deduplication is a simple concept

At its core, data deduplication is a simple concept. Stored data is parsed for duplicate sequences, and when duplicates are found, a pointer to the first instance is inserted in place of the duplicated data.

For example, using a product that supports data deduplication, a backup of an Exchange server in which 20 recipients have received the same attachment would store only the first instance of that attachment with all others pointing back to it.

Under this scheme, the many parts of different files that are similar need to be stored only once. For instance, if the first few lines of a document contain the path name of the document, that name will be generally the same for all the documents in a folder.

If the path name is 40 characters long, and the first 29 are the same for all of the files, the 29 bytes in all of those files after the first one are replaced with a pointer. Because many types of files have structural elements that are similar from file to file, and PowerPoint or PDF documents may contain the same text as the original Word document, the same strings of text recur in many documents.

For more on this test, please click here.

Harbaugh is a freelance writer and IT consultant in Redding, Calif. He can be reached at logan@lharba.com.
Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2007 IDG Communications, Inc.