Skip Links

Network World

  • Social Web 
  • Email 
  • Close

(Comma separation for multiple addresses)
Your Message:

The code monkey's guide to cryptographic hashes for content-based addressing

Keeping track of large blocks of content using hash values only can be a a way to save bandwidth and build exciting new applications. Follow some important rules to use hashes correctly and securely.
By Valerie Henson, LinuxWorld.com
November 15, 2007 07:50 PM ET
  • Share/Email
  • Tweet This
  • Comment
  • Print

Prologue

It was 1996, the bandwidth between Australia and the rest of the world was miserable, and Andrew Tridgell had a problem. He wanted to synchronize source code located in Australia with source code on machines around the world, but sending patches was annoying and error-prone, and just sending all the files was painfully slow. Most people would have just waited a few years for trans-Pacific bandwidth to improve; instead Tridgell wrote rsync, the first known instance of content-based addressing (also known as content-addressed storage, or compare-by-hash), an innovation which eventually spread to software like BitTorrent, git, and many document storage products on the market.

Used properly, content-based addressing dramatically reduces bandwidth use, simplifies code, and improves security. But used when it doesn't make sense, it can reduce performance, increase bandwidth usage, and create new security risks. Figuring out when content-based addressing is good and when it is very, very bad requires an understanding of both systems programming and mathematics. What's a programmer to do? In this article, we'll lay out the considerations in language that any programmer (and even some managers) can understand.

Content-based addressing: The basics

Let's start with an example problem. Say you have an MP3 music collection, and your friend sends you a link to a new MP3 but doesn't give you the title. If it's already in your collection, you don't want to bother downloading and storing a second copy, so you want to quickly find out if it is identical to one of your other MP3s. If you only have a few MP3s, you'll just directly compare the new MP3 to each of your existing MP3s. This takes an amount of time proportional to the number of MP3s you already have, and if you have a lot of MP3s (and we know you do), it will run too slowly and you'll look for a smarter algorithm. A method for quickly finding a particular item that is nearly as old as computer science is called the hash table. There are a thousand variations on the hash table, but we'll describe a very simple example, and then use this to understand the tradeoffs of content-based addressing.

Instead of directly comparing the entire MP3 with every other MP3, we could run a function that takes the MP3 data as its input, and outputs a fixed-size signature based on the input. The goal is to generate something like a thumbnail version of a digital photo - the signature doesn't have all the information in the original, but it's a useful hint as to what the whole file contains. For example, we could just XOR together every 4 bytes together - the core of the CRC32 hash function. Hash functions are designed to generate these signatures, or hash values. The same input will always have the same hash value, and conversely, different hash values will always have different inputs. So, instead of comparing all the file data, you'll keep a table of hash values of the MP3s you already have. Then you can ask your friend to send you the CRC32 of the new MP3 and look it up in your hash table. If you don't have an MP3 with that hash value, then you know you don't have your friend's MP3, and it's worthwhile downloading.

  • Share/Email
  • Tweet This
  • Comment
  • Print

Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a NetworkWorld account? Log in here. Register now for a free account.

Videos

rssRss Feed