Deduplication is arguably the biggest advancement in backup technology in the last two decades.\u00a0 It is single-handedly responsible for enabling the shift from tape to disk for the bulk of backup data, and its popularity only increases with each passing day.\u00a0 Understanding the different kinds of deduplication, also known as dedupe, is important for any person looking at backup technology.\nWhat is data deduplication?\nDedupe is the identification and elimination of duplicate blocks within a dataset. It is similar to compression, which only identifies redundant blocks in a single file. Deduplication can find redundant blocks of data between files from different directories, different data types, even different servers in different locations.\nFor example, a dedupe system might be able to identify the unique blocks in a spreadsheet and back them up. If you update it and back it up again, it should be able to identify the segments that have changed and only back them up. Then if you email it to a colleague, it should be able to identify the same blocks in your Sent Mail folder, their Inbox and even on their laptop\u2019s hard drive if they save it locally. It will not need to back up these additional copies of the same segments; it will only identify their location.\nHow does deduplication work?\nThe usual way that dedupe works is that data to be deduped is chopped up into what most call chunks. A chunk is one or more contiguous blocks of data. Where and how the chunks are divided is the subject of many patents, but suffice it to say that each product creates a series of chunks that will then be compared against all previous chunks seen by a given dedupe system.\nThe way the comparison works is that each chunk is run through a deterministic cryptographic hashing algorithm, such as SHA-1, SHA-2, or SHA-256, which creates what is called a hash.\u00a0 For example, if one enters \u201cThe quick brown fox jumps over the lazy dog\u201d into a SHA-1 hash calculator, you get the following hash value:2FD4E1C67A2D28FCED849EE1BB76E7391B93EB12. (You can try this yourself here: https:\/\/passwordsgenerator.net\/sha1-hash-generator\/.)\nIf the hashes of two chunks match, they are considered identical, because even the smallest change causes the hash of a chunk to change. A SHA-1 hash is 160 bits. If you create a 160-bit hash for an 8 MB chunk, you save almost 8 MB every time you back up that same chunk. This is why dedupe is such a space saver.\nTarget deduplication\nTarget dedupe is the most common type of dedupe sold on the market today. The idea is that you buy a target dedupe disk appliance and send your backups to its network share or to virtual tape drives if the product is a virtual tape library (VTL). The chunking and comparison steps are all done on the target; none of it is done on the source. This allows you to get the benefits of dedupe without changing your backup software.\nThis incremental approach allowed many companies to switch from tape to disk as their primary backup target. Most customers copied the backups to tape for offsite purpose.\u00a0 Some advanced customers with larger budgets used the replication abilities of these target dedupe appliances to replicate their backups offsite. A good dedupe system would reduce the size of a typical file by 99%, and the size of an incremental backup by 90%, making replication of all backups possible.\u00a0 (Within reason, of course. Not everyone has enough bandwidth to handle this level of replication.)\nSource deduplication\nSource dedupe happens on the backup client \u2013 at the source \u2013 hence the name source, or client-side dedupe. The chunking process happens on the client, and then it passes the hash value to the backup server for the lookup process. If the backup server says a given chunk is unique, the chunk will be transferred to the backup server and written to disk. If the backup server says a given chunk has been seen before, it doesn\u2019t even need to be transferred. This saves bandwidth and storage space.\nOne criticism of source dedupe is that the process of creating the hash is a resource-intensive operation requiring a lot of CPU power. While this is true, it is generally offset by a significant reduction in the amount of CPU necessary to transfer the backup, since more than 90% of all chunks will be duplicates on any given backup.\nThe bandwidth savings also allow source dedupe to run where target dedupe cannot run. For example, it allows companies to back up their laptops or mobile devices, all of which are using the Internet as their bandwidth. Backing up such devices with a target dedupe system would require an appliance local to each device being backed up. This is why source dedupe is the preferred method for remote backup.\nThere aren\u2019t as many installations of source dedupe in the field as there are target dedupe, for several reasons. One reason is that target dedupe products have been out and stable longer than most source dedupe products. But perhaps the biggest reason is that target dedupe can be implemented incrementally (i.e. using the same backup software and just changing the target), where source dedupe usually requires a wholesale replacement of your backup system. Finally, not all source-dedupe implementations are created equal, and some had a bit of a rocky road along the way.\nDeduplication plusses and minuses\nThe main advantage of target dedupe is that you can use it with virtually any backup software, as long as it is one the appliance supports. The downside is that you need an appliance everywhere you\u2019re going to back up, even if it\u2019s just a virtual appliance. The main advantage of source dedupe is the opposite; you can backup literally from anywhere. This flexibility can create situations where backups meet your needs but restore speeds don\u2019t, so make sure to take that into consideration.