Skip Links

Network World

Curt Monash

Data warehouse appliances and columnar DBMS

By CurtMonash on Thu, 07/24/08 - 2:45pm.

When I listed 5 data management technologies you probably should know more about, two of them were data warehouse appliances and columnar DBMS. And that was before today's announcement that Microsoft is buying data warehouse appliance vendor DATAllegro. Now those technologies are even more important. They make data warehousing vastly more practical and affordable, all across the cost spectrum - people, system purchase/maintenance, even electric power. And so they enable new types of analysis, such as the worldwide rush of telecom companies to actually put their whole call detail records (CDRs) directly online. And so we're having a classical market "disruption," in which the established vendors (i.e., Microsoft, Oracle) are getting increasingly marginalized by upstarts, and the only way they can keep going is to compete with the newcomers on their own terms (e.g., by acquiring one and going head-to-head).

The main technological points behind the disruption are:

Disk rotation speed is the big bottleneck, and therefore OLTP-style random database access is terrible for data warehousing. Hard disks were introduced in 1956, spinning 1200 times per minute. Half a century later, their rotational speed has only increased by a factor of 12.5. If you want to do data warehousing with a row-based architecture, you need to optimize it very differently (more table scans and sequential reads, fewer random seeks) than is appropriate for online transaction processing.

Functional database sizes (for identical user data) vary greatly. To get decent data warehouse performance at all, Oracle et al. did a great job of developing clever indexing schemes. But those take up a lot of space on disk; a data warehouse's functional size can easily be 6X or more the amount of actual user data it contains. By way of contrast, the upstarts can have 2-5X compression (in the case of row stores) or 5-10X+ (in the case of column stores), along with having many fewer indexes. The result can be functional databases that are much smaller than the raw data. Huge differences in disk use translate to major differences in equipment cost, power consumption, and - since disk access is the big bottleneck in large queries - warehouse performance itself.

When it comes to data warehousing, MPP (Massively Parallel Processing)/shared-nothing architectures blow away shared-everything ones. But Oracle and SQL Server are shared-everything, while Teradata and almost all the newer startups (e.g., Netezza, DATAllegro, Vertica, ParAccel, Infobright, Greenplum, Kognitio) are shared-nothing.

Columnar architectures have interesting benefits. Traditional relational databases arrange data on disk in rows. For updates, that's clearly the right way to go, because when you make a change you only need make it to one place (transactions of course aside). But when doing queries, you have to retrieve a lot of information you don't want along with all the stuff that you do. What's more, as I noted above, columnar vendors find it easier to get great compression results than row-based vendors do.

I've written about all these points are great length on DBMS2.

Related links

About A World of Bytes

Curt Monash is a leading analyst of and strategic advisor to the software industry. Praised by Lawrence J. Ellison for his "unmatched insight into technology and marketplace trends," Curt was the software/services industry's #1 ranked stock analyst while at PaineWebber, Inc., where he served as a First Vice President until 1987. He subsequently co-founded Evernet, Inc., a $40 million networking systems integrator. Since 1990, he has owned and operated Monash Research, an analysis and advisory firm covering software-intensive sectors of the technology industry. In that period he also has been co-founder, president, or chairman of several other technology startups.

Curt has served as a strategic advisor to many well-known firms, including Oracle, Microsoft, SAP, AOL, CA, and Netezza. Curt earned a Ph.D. in mathematics (Game Theory) from Harvard University. He has held faculty positions in mathematics, economics and public policy at Harvard, Yale, and Suffolk universities.

 

Most Discussed Posts

On The Web
LinkedIn