- 15 Non-Certified IT Skills Growing in Demand
- How 19 Tech Titans Target Healthcare
- Twitter Suffering From Growing Pains (and Facebook Comparisons)
- Agile Comes to Data Integration
Network World - Backing up servers and workstations to tape can be a cumbersome process, and restoring data from tape even more so. While backing up to disk-based storage is faster and easier, and probably more reliable, it can also be more expensive.
One way to get the best of both worlds is to back up to disk-based storage that uses deduplication, which increases efficiency by only storing one copy of a thing.
While the process was originally used at the file level, many products now work at the block or sub-block (chunk) level, which means that even files that are mostly the same can be deduplicated, saving the space consumed by the parts that are the same.
For instance, say someone opens a document and makes a few changes, then sends the new version to a dozen people. With file-level deduplication, the old and new versions are different files, though only one copy of the new version is stored. With block-level or sub-block-level deduplication, only the first document and the changes between the first document and the second are stored.
There is some debate about the optimum process - deduplication of files is not very efficient, blocks, more so, chunks even more so. However, the smaller the chunks, the more processing it takes, and the bigger the indices are that keep track of duplicates. Some systems use variable size chunks to tune this, depending on the type of data being stored.
The good news is that deduplication works well - in our tests, all of the products were able to create a second copy of a volume and use less than 1% additional space, and to back up a copy of the test volume with 4,552 files changed totaling 31.7 GB and use no more than 32GB of additional space, and in some cases a good deal less than 32GB.
Deduplication was originally used only for backups - since backups tend to be run regularly and usually contain mostly the same data as the last backup, very high efficiencies can be obtained with deduplication. Now, however, deduplication is beginning to be seen in primary storage and other applications as well, such as the deduplication of snapshots and replication.
There are two main types of deduplication, in-line and post-processing. In-line looks at data as it is sent to the storage system, and only stores a file if it is not already on the system. Post-processing stores the file immediately and then scans all the data on the system at regular intervals to find and remove duplicate chunks of data.
In-line requires less storage, while post-processing requires a 'landing area' where data can be stored until it is deduplicated. On the other hand, since it must handle high-speed streams of data, in-line requires considerably more processing power which is expensive, while storage space is relatively cheap. Post-processing might be scheduled for once a day, following the end of the backup window. Since backups are typically run during the periods of lowest activity, post-processing might be scheduled for the start of the business day. Since the deduplication storage isn't typically used for anything other than backups, this doesn't impact users.