How Slow is Slow? Slowly Changing Dimensions in SSAS

SCD's are key component of a well-designed Data Warehouse

An important subject that seems to be omitted in many Business Intelligence training courses is the Slowly Changing Dimension or SCD. When I teach BI classes, I make a point of covering the topic and it probably gets more students excited about BI more than any other topic. And believe me, that’s easier said than done. Excited and Business Intelligence don’t always fit in the same sentence. .. OK, so we know a Data Warehouse is made up of a relational design involving Facts, Dimensions, Attributes and Hierarchies arranged in a “Star” or “Snowflake” schema. Facts are typically numeric values we want to use to assess the business such as Sales Revenue, costs, profit margins etc. If facts are what we want to measure, then dimensions are how we want to analyze the facts; for instance, Sales Revenue by Year, by Quarter, by Month, by Customer or by Region. Attributes are added to the dimension tables to flesh out the dimension to give it more meaning, for example customer demographics such as Age or Salary Range. Hierarchies are multiple dimensions that are related to each other such as Year/Quarter/Month or Customer/Region. So what’s a Slowly Changing Dimension? As the name suggests, it’s a dimension that changes slowly and predictably. The big decision we have to make is: do we care? When a customer moves from one region to another, what should happen to their previous orders? If we are not careful, they will end up appearing under the new region and go missing from the old one. Not bad if you are a sales rep for the new region but disastrous if you are the sales rep who actually made those sales. Another example is when analyzing schools: when a student moves to another school should their previous exam results apply to the new school? Of course not. (Depends on whether they are a good or bad student I hear you say? Shame on you…). We want to be accurate in all such cases, but with standard dimensions we may fall into the trap of losing sight of that history. At the dimension attribute level there are at least 3 types of SCD. Type 1 SCD means the attribute is a “changing attribute” but we only care about the most current value. Type 2 means it is a “historical attribute” and we very much care about maintaining historical accuracy. Type 3 is for the rare attribute where we only care about the original and current value but not those changes in between, sometimes called “First and Last”. There are some other types but then we are getting too academic for this time of the day… If we do nothing about our dimension design, we will end up with all Type 1 attributes. This might be OK for Customer Last Name which may change but as long as we don’t need to analyze how many “Smiths” bought a particular product then Type 1 should be just fine. Sales By Region or Scores by School are classic examples where Type 2 SCD design is definitely needed. And the trick to enable this design is both simple and ingenious. For the customer who moves from one region to another, or the student who moves to a new school we need to create a row in the dimension table for each move with a corresponding Start Date and End Date to indicate the period they were there. However, relational constraints limit the Primary Key to a single unique value per row so, enter the concept of the “Surrogate Key”. Within the Data Warehouse design, the original business key from the operational system becomes just another attribute. Then we add a surrogate key as the primary key and have the system generate it to be unique. Now we can have multiple rows for a customer or student, with accurate start and end dates to indicate the history of the moves. The single row without an end date value will indicate the current location. SSIS has a Slowly Changing Dimension Transform which helps us with the incremental load. And here is the ingenious part: we only need to join to the fact table using the surrogate key because, as long as we load the data correctly with our incremental load processes, the end user will be able to analyze Customer by Region or Scores by School with confidence of accuracy even when analyzing over time-periods. The “magic” is in the data load. Don’t worry; I was skeptical too, until I tested I out. It’s a “eureka moment” when you see it working for the first time. The Business Analysts then perform their analysis as normal, using high performing cubes or multi-dimensional databases, oblivious to the underlying complexity using their “drag and drop” GUI in Excel and it’s “Business Intelligence as Usual” except with deadly accuracy. Cheers, Brian.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2011 IDG Communications, Inc.

IT Salary Survey: The results are in