Data warehouse appliances and columnar DBMS

When I listed 5 data management technologies you probably should know more about, two of them were data warehouse appliances and columnar DBMS. And that was before today's announcement that Microsoft is buying data warehouse appliance vendor DATAllegro. Now those technologies are even more important. They make data warehousing vastly more practical and affordable, all across the cost spectrum - people, system purchase/maintenance, even electric power. And so they enable new types of analysis, such as the worldwide rush of telecom companies to actually put their whole call detail records (CDRs) directly online. And so we're having a classical market "disruption," in which the established vendors (i.e., Microsoft, Oracle) are getting increasingly marginalized by upstarts, and the only way they can keep going is to compete with the newcomers on their own terms (e.g., by acquiring one and going head-to-head).

The main technological points behind the disruption are:

Disk rotation speed is the big bottleneck, and therefore OLTP-style random database access is terrible for data warehousing. Hard disks were introduced in 1956, spinning 1200 times per minute. Half a century later, their rotational speed has only increased by a factor of 12.5. If you want to do data warehousing with a row-based architecture, you need to optimize it very differently (more table scans and sequential reads, fewer random seeks) than is appropriate for online transaction processing.

Functional database sizes (for identical user data) vary greatly. To get decent data warehouse performance at all, Oracle et al. did a great job of developing clever indexing schemes. But those take up a lot of space on disk; a data warehouse's functional size can easily be 6X or more the amount of actual user data it contains. By way of contrast, the upstarts can have 2-5X compression (in the case of row stores) or 5-10X+ (in the case of column stores), along with having many fewer indexes. The result can be functional databases that are much smaller than the raw data. Huge differences in disk use translate to major differences in equipment cost, power consumption, and - since disk access is the big bottleneck in large queries - warehouse performance itself.

When it comes to data warehousing, MPP (Massively Parallel Processing)/shared-nothing architectures blow away shared-everything ones. But Oracle and SQL Server are shared-everything, while Teradata and almost all the newer startups (e.g., Netezza, DATAllegro, Vertica, ParAccel, Infobright, Greenplum, Kognitio) are shared-nothing.

Columnar architectures have interesting benefits. Traditional relational databases arrange data on disk in rows. For updates, that's clearly the right way to go, because when you make a change you only need make it to one place (transactions of course aside). But when doing queries, you have to retrieve a lot of information you don't want along with all the stuff that you do. What's more, as I noted above, columnar vendors find it easier to get great compression results than row-based vendors do.

I've written about all these points are great length on DBMS2.

Related links

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.