Skip Links

The down and dirty on data deduplication

By Logan Harbaugh, Network World
June 04, 2007 12:06 AM ET
  • Print

At its core, data deduplication is a simple concept. Stored data is parsed for duplicate sequences, and when duplicates are found, a pointer to the first instance is inserted in place of the duplicated data.

For example, using a product that supports data deduplication, a backup of an Exchange server in which 20 recipients have received the same attachment would store only the first instance of that attachment with all others pointing back to it.

Under this scheme, the many parts of different files that are similar need to be stored only once. For instance, if the first few lines of a document contain the path name of the document, that name will be generally the same for all the documents in a folder.

If the path name is 40 characters long, and the first 29 are the same for all of the files, the 29 bytes in all of those files after the first one are replaced with a pointer. Because many types of files have structural elements that are similar from file to file, and PowerPoint or PDF documents may contain the same text as the original Word document, the same strings of text recur in many documents.

  • Print

Videos

rssRss Feed