Network World
Friday, August 22, 2008
DNSstuff.com
Get information about your IP
IP Information
50+ On-demand DNS and network tools

Brian Egler's SQL Server Strategies

Microsoft Subnet

Navigation

Slowly Changing Dimensions – a cool new transform in SSIS

SQL Server Integration Services (SSIS) has superseded DTS in SQL 2005 and has some great new features beyond better performance, more scalability and an integrated debugging facility. As if that was not enough, we have been given a whole new set of pre-defined tasks and transforms. But wait; we can also write our own tasks and transforms, making SSIS a truly industrial strength extensible ETL tool that can now compete against the Informaticas of the world. One of the predefined transforms that you might find useful is the Slowly Changing Dimension (SCD) transform.

One of the major uses of ETL (Extract, Transform and Load) tools is to populate Data Warehouses. A Data Warehouse (or its smaller brother: a Data Mart) is the relational staging area between our operational databases and our high-performing multi-dimensional databases we call Cubes (as humans we can only think in 3 dimensions). The Data Warehouse is designed around Facts, Dimensions, Attributes and Hierarchies. A fact (or measure) is something we want to measure about our business like Revenue or units sold. A dimension allows us to "slice and dice" facts in different ways, for instance, by year, by month, by product, by customer, by region. The giveaway is that when a business analyst uses the word "by" they invariably follow it with a dimension. Remember this in your next requirements meeting. Attributes allow you to flesh out dimensions with extra characteristics like customer demographics such as age and salary range. Hierarchies are multiple dimensions related to each other like Category and Sub-Category. So once we have designed this Data Warehouse, how do we get data into it? Enter the ETL tool of choice. However, it is usually a lot more complex than just moving data from our operational systems. Often we need to perform data manipulation, data cleansing and/or standardization. Enter your favorite Transformations. These vary from Data Conversion to Split, Merge and "Fuzzy" Lookup.

My favorite this month is the SCD (Slowly Changing Dimension) transform. An SCD is a special type of dimension where data is changing slowly but predictably. An example is when a customer moves from one region to another, what happens to our historical analysis? Should we assume that a customer has always lived in their current region (Type 1 SCD) or should we accurately track the historical attribute (Type 2 SCD)? If Sales Reps are compensated on regional sales, you had better go with the latter. Or when a student moves to a new school, should their old exam scores be reassigned to the new school (Type 1) or should we tell it like it is (Type 2)?

Here's an example of a Type 1 SCD:
CREATE TABLE [dbo].[DimCustomer](
[CustomerKey] [int] NOT NULL,
[FirstName] [nvarchar](50) NULL,
[LastName] [nvarchar](50) NULL,
[GeographyKey] [int] NOT NULL,
CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED
( [CustomerKey] ASC ))

In a Type 1 SCD, we only have knowledge of the current values. Effectively we don't care what happened in the past, we only care about the present.

In order to track a Type 2 SCD, we need some extra columns. First of all, we need a Start Date and End Date so we can keep track of exactly when that customer was in a particular region (GeographyKey). Because we will need to store multiple rows for a customer that has moved regions, we can't use the normal CustomerKey as our Primary Key in the Data Warehouse because of duplicates. We have to introduce a "surrogate key" so that we can have multiple rows for a particular customer. In fact, we need one row for each time they move to a new region. The Surrogate Key can have an arbitrary value as long as it's unique. The Start Date and End Date columns will tell us exactly when they were in that region, and the one row with no End Date (NULL) shows us where they live currently.
Here's an example of a Type2 SCD:
CREATE TABLE [dbo].[DimCustomer](
[CustomerSK] [int] IDENTITY(1,1) NOT NULL,
[CustomerKey] [int] NOT NULL,
[FirstName] [nvarchar](50) NULL,
[LastName] [nvarchar](50) NULL,
[GeographyKey] [int] NOT NULL,
[StartDate] [datetime] NULL,
[EndDate] [datetime] NULL,
CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED
( [CustomerSK] ASC ))

If we can get data into our Data Warehouse in this format, then our historical analysis will be accurate. For instance, when analyzing sales by region by year, results will come out accurately as a matter of course. This is where the SCD Transform and its associated Wizard helps out. It will prompt for your type requirements (Type 1 -Changing Attribute, Type 2 - Historical Attribute) and your Business Key (the Primary Key from the operational system). Based on your answers, it will generate some pretty complex logic that compares data changes from the previous incremental load and will generate extra rows accordingly. Once the data is in the Data Warehouse, you are just one step from generating a cube that is valid and accurate. That's where Analysis Services takes over.

Then it's Business Intelligence as usual.
Cheers Brian

See recent blog posts...

Bill Gates is giving us a break with SQL Server licensing – or is he?

Database failure? Don’t update your resume - You can be the hero!

Database failure? Don’t update your resume - You can be the hero!


About Brian Egler

Brian D. Egler, MCITP-DBA/MCSE/MCT, is currently an instructor with Global Knowledge, teaching various Microsoft training courses such as MCSE, MCITP-DBA and other SQL Server courses. He is a SQL specialist and an expert on Exchange, Windows, .Net and XML. Egler has been a technical instructor for 16 years and has more than 10 years experience with SQL Server, data modeling, database design, application development including IMS, DB2, Sybase. In addition, he is member of the Project Management Institute.

RSS feed

Egler's archive.

Global Knowledge sponsors a monthly giveaway on Microsoft Subnet. Check out the Microsoft Subnet home page for details.

Global Knowledge offers a comprehensive catalog of Microsoft courses, including:

MCITP: Database Administrator Boot Camp
SQL Server 2005 Administration
MCITP: Enterprise Administrator Boot Camp
More Microsoft Courses

Microsoft Subnet

RSS feed Microsoft news RSS feed

Advertisement: