LinkedIn open sources its database change capture system

LinkedIn has just open sourced its Databus real-time database change capture system and Backblaze has release version 3.0 of its open source low-cost, high-density storage solution and Gibbs is excited!

OK, lots of interesting stuff for you this week. First up, LinkedIn has open sourced a system called Databus, a real-time database change capture system that provides a "timeline-consistent stream of change capture events ... grouped in transactions, in source commit order."

This code has been in use inside LinkedIn since 2011 where they use it to "propagate profile, connection, company updates, and many other databases at LinkedIn."

Databus is designed for speed, providing end-to-end latencies of milliseconds and "throughput of thousands of change events per second per server while supporting infinite lookback capabilities and rich subscription functionality."

[MORE LINKEDIN: 10 LinkedIn tips to boost your job search]

The infinite lookback feature is said to allow client applications to create a copy of any or all changes to the source database without placing any extra load on the source. It also apparently allows clients to stop and restart acquiring updates so that client-side processing demands or performance limitations can be handled.

The Databus architecture feeds updates from the host database connector to Databus Relays. These relays buffer the serialized change data events in memory and, on demand, send change events to Databus Clients and Databus Bootstrap Producers.

The Databus Bootstrap Producers are, in effect, archives of change data events which are stored in separate MySQL databases.


LinkedIn's open source Databus architecture

When new Databus Clients connect, they first query a Databus Bootstrap Server which requests "look back data" (change data that is older than the change data currently stored by the Databus Relays) and then, when a client has "caught up" with events so that it is current on changes, it switches to a Database Relay for realtime change data events.

Databus is independent of the source database but a connector is required to interface with the host database. This release provides an Oracle connector and a MySQL connector is due to be released "soon."

It would be interesting to see a high performance database like NuoDB, which I discussed a few weeks ago and which was designed as an Oracle "drop in" replacement, paired with Databus to provide a distributed, low management overhead, database solution with realtime backup and realtime update delivery. An interesting possibility of this architecture is making it possible for clients to perform analytics on any or all of the data over any historical period without impacting the host database performance at all.

So, with this combination of databases echoing databases you'll be needing some serious storage for your non-realtime backend, right? All that analytic and historical stuff chews up disk space like there's no tomorrow.

You'll probably be looking for something in the 100TB range and, while you could go out and buy from the big guys such as NetApp or Drobo, you'll certainly be in for some sticker shock.

On the other hand, if you're like Backblaze, an online backup company, you'll find yourself thinking, "Wow, that kind of pricing will break the bank," so you might choose to do what Backblaze did: Build your own high-density storage systems.

I had a very interesting talk with Backblaze's CEO Gleb Budman, and he told me they had, indeed, looked at the big boys and worked out that storage from NetApp or Drobo would cost them 10 times the cost of the raw hard disk drives. Given that Backblaze offers unlimited online storage (still priced at a measly $5 per month since they opened the doors!), that wasn't a formula that would work for them.

After some head scratching, Backblaze decided to "roll their own" and designed a workhorse storage platform from off-the-shelf components. They then did something remarkable: In 2009 they open sourced the design!

They initially thought that a handful of people would be interested but, to their surprise, there was a huge amount of interest (their blog announcing the open source design got 1 million hits) because, come on, who wouldn't get excited about 67TB of storage for less than $8,000? (OK, perhaps your family and friends might not be but you get it, right?)

Now, roll forward three years and Backblaze has just published its latest iteration of what they call the Backblaze Storage Pod: Storage Pod 3.0.


Backblaze's open source Storage Pod 3.0 design

This latest version packs up to a total of 45 drives on nine five-port multiplier backplanes with a Supermicro MBD-X9SCL-F motherboard and an Intel Core i3 processor i3-2100 with 8GB of RAM into a 4U custom rackmount case with a 760 watt power supply, six fans, and two Gigabit Ethernet ports. Fully stocked with 4TB drives, which can be had for $195 each, a 180TB pod built by you will set you back about $11,000 or $59.54 per TB. Now, that's cheap!

Should you think that using 4TB drives is gilding the lily, Backblaze notes: "While it looks like the cost of a 4TB drive system is more expensive [than using 3TB drives], when you factor in rack space, electricity, installation labor, etc., the long-term cost for Backblaze leans towards using 4TB drives. Our monthly cost for a full rack of Storage Pods with 3TB drives is $0.63 per TB, while a full rack of Storage Pods with 4TB drives is $0.47 per TB. When you factor all the costs together, it takes about five months for us to recover the extra cost encountered when building 4 TB based Storage Pods."

It's up to you what you want to run on the processor to manage and deliver the storage, but you might go the same way Backblaze has: A customized version of Debian Linux with the Ext4 filesystem to which they have added their own proprietary storage management storage which allows new pods to be automatically integrated into their system (you'll have to figure that bit out for yourself).


Backblaze's online backup service currently uses 40 petabytes of custom storage

The backup service offered by Backblaze is really impressive. A small download for Windows or OS X, a fast and painless installation, and you're backing up. Backblaze doesn't throttle your backups or recoveries in any way; they'll transfer data either way just as fast as your connection allows with encryption, de-duplication and compression.

Need to restore? You can download files via Backblaze's Web interface or have them ship you a 64GB USB drive via Fedex for $99, or a 3TB hard drive for $189 (you get to keep the drive). Versioning is supported along the same lines as Apple's Time Machine (incremental backups with hourly changes for the past 24 hours, daily changes for the past month, weekly changes for the quarter, and quarterly changes for the year with restoration of all duplicated files to their correct locations).

For it's elegance and cost-effectiveness, the Backblaze online backup service is remarkable and gets a Gearhead rating of 5 out of 5.

So, did the Pod hardware design work out for Backblaze? You bet! The company was bootstrapped by the founders and last year, after five years in business and with a storage capacity of an impressive 40 petabytes, Backblaze decided it finally needed to raise a VC round for $5 million to accelerate their growth ... and that wasn't nearly as much as VCs were reportedly willing to fund them.

Could the Pod 3.0 design work for you? Sure. It's not designed for your superfast primary storage, but as a secondary storage solution it's way cool and very cost-effective. You can buy just the Pod 2.0 cases from 45 Drives for $872 each or completely assembled systems with a somewhat upgraded specification for $5,395 each (Pod 3.0 versions are due in the near future).

For any company looking at optimizing its storage costs the Backblaze Pod 3.0 whether home-built or purchased from the likes of 45 Drives is a very compelling way to go. If you're using any of these systems or considering doing so, let me know ...

Gibbs is always short of storage in Ventura, Calif. Store your thoughts with him at and follow him on Twitter and (@quistuipater) and on Facebook (quistuipater). And check out the Tech Predictions blog.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2013 IDG Communications, Inc.