*Keeping track of large blocks of content using hash
values only can be a a way to save bandwidth and
build exciting new applications. Follow some
important rules to use hashes correctly and securely.*

## Prologue

It was 1996, the bandwidth between Australia and the rest of the world was miserable, and Andrew Tridgell had a problem. He wanted to synchronize source code located in Australia with source code on machines around the world, but sending patches was annoying and error-prone, and just sending all the files was painfully slow. Most people would have just waited a few years for trans-Pacific bandwidth to improve; instead Tridgell wrote `rsync`, the first known instance of content-based addressing (also known as content-addressed storage, or compare-by-hash), an innovation which eventually spread to software like BitTorrent, `git`, and many document storage products on the market.

Used properly, content-based addressing dramatically reduces bandwidth use, simplifies code, and improves security. But used when it doesn't make sense, it can reduce performance, increase bandwidth usage, and create new security risks. Figuring out when content-based addressing is good and when it is very, very bad requires an understanding of both systems programming and mathematics. What's a programmer to do? In this article, we'll lay out the considerations in language that any programmer (and even some managers) can understand.

## Content-based addressing: The basics

Let's start with an example problem. Say you have an MP3 music collection, and your friend sends you a link to a new MP3 but doesn't give you the title. If it's already in your collection, you don't want to bother downloading and storing a second copy, so you want to quickly find out if it is identical to one of your other MP3s. If you only have a few MP3s, you'll just directly compare the new MP3 to each of your existing MP3s. This takes an amount of time proportional to the number of MP3s you already have, and if you have a lot of MP3s (and we know you do), it will run too slowly and you'll look for a smarter algorithm. A method for quickly finding a particular item that is nearly as old as computer science is called the hash table. There are a thousand variations on the hash table, but we'll describe a very simple example, and then use this to understand the tradeoffs of content-based addressing.

Instead of directly comparing the entire MP3 with every other MP3, we could run a function that takes the MP3 data as its input, and outputs a fixed-size signature based on the input. The goal is to generate something like a thumbnail version of a digital photo - the signature doesn't have all the information in the original, but it's a useful hint as to what the whole file contains. For example, we could just XOR together every 4 bytes together - the core of the CRC32 hash function. Hash functions are designed to generate these signatures, or hash values. The same input will always have the same hash value, and conversely, different hash values will always have different inputs. So, instead of comparing all the file data, you'll keep a table of hash values of the MP3s you already have. Then you can ask your friend to send you the CRC32 of the new MP3 and look it up in your hash table. If you don't have an MP3 with that hash value, then you know you don't have your friend's MP3, and it's worthwhile downloading.

## Hash collisions

What happens if the hash value of the new MP3 matches one already in your database? While the same input always has the same hash value, different inputs may have identical hash values - a hash collision. For example, the 4-byte XOR of an input consisting of the two 32-bit numbers 0x12340000 and 0x00005678 will be 0x12345678, and the 4-byte XOR of 0x12345678 and 0x00000000 will also be 0x12345678. Because the number of inputs is infinite, and the number of hash values are finite, collisions are inevitable. In fact, some of your own MP3s may already have colliding hash values. The easiest way to deal with this is to make each entry in the hash table a list of MP3s that have that same hash value, along with pointers to their entire data. When you find two MP3s with the same hash value, you compare the whole files to see if they are really the same, or just have the same hash value. Now, instead of comparing with every file, you only compare the MP3 with the few files that have the same hash value - much faster than comparing directly with every single file.

Now comes the interesting part. Sometimes, comparing the file against another file even once is prohibitively expensive, as when the files are located on different machines separated by low-bandwidth networks - say you're at an Antarctic research station and your friend's MP3 has to be trickled through the shared satellite link at a few kilobits per second. (Or maybe you're just still on dial-up.) If you had her send you the hash of the MP3, and you didn't have any file with that same hash, you'd know it was worth downloading the MP3. The problem is when the hash matches up to an existing MP3 in your database - how do you know whether or not they are the same MP3 without slowly and painfully downloading the new MP3 and comparing it? Hash functions have collisions, it's a fact of life. But if you could find a hash function that almost never had collisions, then you could, with very high probability, be sure that if the hashes are the same, the files they represent are also the same. If that probability is high enough, you would feel safe assuming that a collision will never occur in practice (bravely accepting the risk of never hearing that hi-lar-i-ous Weird Al Yankovic song).

## Enter cryptographic hash functions

Fortunately, hash functions that are collision-resistant (simply put, very difficult to find inputs with same output) have already been developed for another purpose - they are called cryptographic hash functions and are used for message authentication, digital signatures, obscuring passwords, and more. As programmers, we can "borrow" these functions and use them for other purposes. So now your algorithm for detecting whether two MP3s are the same is very simple: Calculate and store the cryptographic hash of all your MP3s. When you want to know if your friend's MP3 is new, you ask for its cryptographic hash and then compare it to the hashes of all your other MP3s. If it is the same as any of the existing MP3s, you assume it's the same MP3 and don't bother downloading or storing it. Otherwise, you start the download and kill time photographing the penguins outside the research station.

This, in a nutshell, is what is variously called content-based addressing (CBA), compare-by-hash (CBH), or content-addressed storage (CAS). Just calculate the cryptographic hash of a piece of data (often at a granularity smaller than an entire file) and use the resulting hash value as the address of the data. Besides saving on network bandwidth, CBA eliminates the need for a top-down address assignment framework - everyone calculates the same addresses without having to talk to anyone else - which is useful for peer-to-peer networks like BitTorrent and distributed hash tables. It also simplifies implementation - saving state is optional, since all the addresses can be regenerated from the content itself.

Content-based addressing also can have serious downsides, depending on the application and the state of cryptographical research. First, cryptographic hashes are expensive to compute, so in many applications where the cost to transfer or compare data is low, CBA will only slow down the application. If you wanted to compare MP3s on your local hard drive, more mundane techniques, such as traditional hash tables, would be much faster than CBA in most cases. Second, the collision-resistance of a cryptographic hash function degrades over time - the cryptographic hash function that was collision-resistant when the program was first written will no longer be collision-resistant in a few years, after other cryptographers figure out how to find hash collisions quickly (hence, the term "collision-resistant" rather than "collision-free").

In the rest of this article, we'll first review cryptographic hash functions, examining their origin, computational cost, mathematical properties, and "life cycle" as they transition from newly proposed to reliably collision-resistant to broken. Then we'll explore in detail how to judge when content-based addressing will improve an application, and what effect the breaking of the chosen cryptographic hash function will have on the application. With this information under your belt, you can confidently judge when to use content-based addressing.

## Cryptographic hash functions for programmers

In this section, we'll give a quick 'n' dirty overview of the aspects of cryptographic hash functions most relevant to programmers using content-based addressing. For those interested in a deeper dive, see the Further Reading section. A wealth of detail is available on the web these days, including most publications and pre-prints, and of course, there's always Applied Cryptography (although it's woefully out of date when it comes to cryptographic hash functions and many pine for a third edition).

Most programmers are familiar with the idea of cryptographically signing a document - you use an encryption algorithm to generate a signature using the document as input. Other people can verify your signature given the proper key. The problem is that signing is often a very slow and expensive calculation, proportional to the length of the input. Instead of signing the entire document, it would be nicer to sign a short, fixed-size signature of the document. This signature would have to be easy for everyone to calculate given the original document, but at the same time be very hard for anyone to find another document that has the same signature (in technical parlance, second pre-image resistance). It should also be hard to find a document that has a given signature (pre-image resistance), or any pair of documents with the same signature, ever (collision resistance).

Cryptographic hash functions were designed to fill this and similar needs. The major tradeoffs in their design, as visible to a programmer, are:

**Complexity of calculation:**Too simple, and the hash is easily broken. Too complex, and the hash takes too long to calculate.**Size of output:**Too small, and brute-force attacks are too easy. Too big and the cost of storing and sending the hash value is too large.

Often, the better the cryptograhic hash function, the larger the hash value, and the longer it takes to calculate. Hash functions designed to detecting random corruption are usually much faster and have shorter outputs.

## Computational cost

To get a rough idea how much CPU time calculating cryptographic hashes requires, let's take a high level tour of MD5. (MD5 has been fairly thoroughly broken but makes a good example.) Dan Kaminsky describes the MD5 algorithm at a high-level in MD5 To Be Considered Harmful Someday:

In what's referred to as a Merkle-Damgard construction, MD5 starts with an arbitrary initial state 128 bits in length. 512 bits of input data are "stirred" into this 128 bit state, with a new, massively shuffled 128 bit value as the result. 512 more bits are constantly stirred in, over and over, until there's no further data. 64 bits more are appended to the data stream to explicitly reflect the amount of data being hashed with another round of MD5 being done if need be (if there wasn't enough room in a previous round to hash in that 64 bits), and the final 128 bit value after all the stirring is complete is christened the MD5 hash.

The stirring operation involves 64 rounds of operation on each 512 bits of the message (technically, 4 rounds of 16 operations each). The output of each round feeds into the next round - that is, the rounds can't be easily parallelized. Newer, more modern hash functions often use 80 or more rounds of mixing. The overall message is that a cryptographic hash function involves a significant computational cost and multiple operations for every block of data processed.

## Mathematics and cryptographic hash functions

A prime number is always a prime number, and no advance in mathematical knowledge will ever change that. But the collision-resistance of a hash function changes over time as cryptography improves. Cryptographic hash functions, are, at heart, human technology, invented by humans to confound other humans limited by current technology and mathematical knowledge. (If you doubt this, ask yourself when the worldwide cryptography community stopped considering SHA-0 a one-way function - and when the NSA did.)

While cryptographic hash functions are mathematical functions, their collision-resistance properties are not mathematically proven - and are often disproven by example, when the hash is broken. Here's what Bruce Schneier has to say about one-way functions, which include cryptographic hash functions, on page 29 of Applied Cryptography: