Big Data – Terabytes in SQL Server

Another interesting session I attended at the PASS Summit in Seattle was one concerning dealing with VLDBs in SQL Server. It was called “Big Data – Working with Terabytes in SQL Server” presented by Andrew Novick. As you would expect, the session was packed. The definition of a very large database is a moving target. At the moment it seems to be anything over a terabyte. Judging by the audience, VLDBs are very common these days.

Of course, soon we’ll be talking about Petabytes (1 Petabyte=1024TB) and Microsoft did just that during its Keynote sessions at the conference. We were treated to a demo of a 150+ TB database distributed across 24 servers and 192 cores giving a reasonable response time of 15 seconds for running an SSRS summary report using the new Datallegro technology that is proposed for Project Madison (expected in 2010) (spontaneous applause).

Andrew Novick’s presentation got into how to load millions of rows on a daily basis to an already huge database. The daily load scenario is a common one for a Data Warehouse that is required as the foundation for Business Intelligence (OLAP) systems. The Data Warehouse is a relational structure that serves as a staging area for the high performing multi-dimensional database or “cube” that is generated and populated using Analysis Services.

Andrew highlighted some example performance metrics from his own tests. For instance, to load a million rows into an existing 400 million row table with 12 indexes took around 12 hours. However, to load a million rows into an empty table then add the 12 indexes after the fact took 5 minutes. That’s a big difference. Also, he let us know that to backup a full 10TB Database at 10GB per minute took 16 hours.

He let us know that his recommendation for a solution consisted of 4 tactics:

1) use as many files on as many drives as possible: this is for raw hardware performance.

 2) Insert new rows into empty unindexed tables: as shown above, the result is high performance bulk inserts.

3) use partitioned tables: either Data Partitioning or Distributed Partitioned Views

4) use Read-only File groups: this will speed up backups

I’ll be discussing each of these tactics in my next blog entries.

Watch this space!



Recent posts:

Accumulation or Regression in SQL Server?

SQLCAT – awesome resource!

Microsoft re-open the loop hole - thankfully!

Micrsoft keep their promises - SP3 + CU1 available!

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2009 IDG Communications, Inc.

SD-WAN buyers guide: Key questions to ask vendors (and yourself)