Slowly Changing Dimension (SCD) is great for tracking historical changes to dimension attributes. SCDs have evolved over the years and besides the conventional type 1 (update), type 2 (add row) and type 3 (add column), now there are extensions up to type 7 including type 0. Almost every DW/BI project has at least few type 2 dimensions where a change to an attribute causes the current dimension record to be end dated and creates a new record with the new value.
While type 2 dimensions are great to track attribute history, you should also consider the following in your design decisions:
- ETL for type 2 dimensions is relatively complex. At the time of writing, out of box SSIS SCD component has performance issues and you have to rely on alternative workarounds.
- Date range SSIS lookup for type 2 SCD uses partial cache instead of full cache
- Type 2 SCD requires more disk space.
In some instances type 2 SCD may not be appropriate. In this post, I would like to mention few common anti-patterns I have encountered around type 2 SCD.
Anti-pattern 1: Modeling almost every dimension as type 2 SCD
This is a usually a consequence of inadequate business analysis and/or over engineering. Not every dimension is type 2 SCD. Do not over engineer to cater to a future requirement that may never occur. If there is no business need to capture history, neither should you. Do explain to business what it means to capture attribute history and ask the question before you design.
Anti-pattern 2: No Business Effective Date, only DW Effective Date
If tracking changes is important to business, most source systems would already have mechanism to capture effective date of change either directly or indirectly. Dimensions that do not have business effective date and have only DW effective date instead is often an indication that the dimension is not a type 2 candidate. It is important to know when the business change happened is as opposed to when the DW knew about the change. The DW effective date is an artificially manufactured date that will be lost if the dimension is re-initialized.
Anti-pattern 3: Tracking every attribute’s history
Even if your dimension is a legit type 2 SCD, not every attribute needs to be tracked for historical changes.
- Data changes to some attributes do not represent a change in state. For e.g. an update to date of birth is mostly likely in response to incorrect data entry and hence an in-place update is most appropriate rather than end date the old record and create a new record with correct DOB.
- Some of the attributes are there for information purposes only. If an attribute is used as drill-through field rather than as slicer, only current value is relevant. For e.g. phone number or email id does not need history tracking.
- Some attributes exist in source for system use only. These attributes need not be promoted into the dimension in first place.
Anti-pattern 4: Track history, but only use current dimension record
After all the hard work, some implementations only ever use the current dimension record to link up to fact tables which contradicts the very reason to track attribute history as type 2 SCD.
5 thoughts on “Common Type 2 SCD Anti-patterns”
1 – You have a good point, however, I would argue that I have witnessed first-hand business users who change their mind mid-project. It is always easier to discard unused data than it is to try and recreate lost data. In addition, there are a number of design patterns that make the type SCD very cookie cutter that can be rapid development.
2 – There are a number of legacy COBOL systems that do not have the ability to track changes. I have worked with a number of OLTP systems that unless you pull the trigger audit tables you would not find a changed date. So, I’d recommend revisiting this thought.
3 – Easy way to track every attribute, generate a HASHBYTE for the row based on the concatenation of the entire field. Then just evaluate the HASHBYTE. I can send you the dynamic SQL to create this in SQL server, or Teradata.
4- In order to do point in time analysis, you will need to have the historical data. All retailers want to be able to track and understand the historic patterns of their users. With that opinion, you’ve excluded a pretty large market not to mention, airlines, hotels, gaming etc…
I would recommend taking some deeper introspection of the type II SCD concept, while I agree it can be overused, it does provide an excellent way to track changes.
Tim, thanks for visiting my blog and leaving your comment. This post is not intended to question the capability of type 2 SCD. Neither does it seek technical solutions to implement type 2 SCD. There are other posts out there that provide technical solutions – https://bennyaustin.wordpress.com/2010/05/29/alternatives-to-ssis-scd-wizard-component/. On the contrary this post only identifies common design mistakes when implementing type 2 SCD. Type 2 SCD is a great asset to any BI/DW IF implemented correctly for the right use case.
Also pretty much any logistical flow tracking analytics requires point in time history, which pretty much includes all warehousing of MRO and physical/material warehousing operations (see what I did there) in manufacturing contexts; or any type of asset sub-component tracking; biostatistics; etc. Especially in the age of IoT sensors and ubiquitous geospatial tagging, historical data is king.
cowbert, if there is a need for type 2 SCD by all means go for it.
Nice one, Benny!