SSIS: Optimizing the ETL for Late Arriving Dimensions

 

Late Arriving Dimension/Early Arriving Fact as the name implies is a dimension record which is not available when the ETL loads the related fact record but becomes available at a later time.

This scenario could happen for a variety of reasons;

  • It might be a perfectly valid business process. For e.g. in an emergency ward, the patient interventions could be recorded first as priority while the patient details that is collected could be recorded into the system at a later time.
  • Sometimes due to operational reasons, the data might arrive at different times from their respective source systems. For e.g. The dimensions could be extracted from Master Data repository while the facts are extracted directly from the transaction system at different periods.
  • It might be due a failed ETL run.

The ETL assigns an unknown surrogate key for the Late Arriving Dimension in the related fact table (such as -1) when it does not find the dimension record. The fact table is then revisited later at set intervals to re-determine whether a valid dimension record is now available and the unknown surrogate key is updated with a real surrogate key, if available.

Typical challenges for the SSIS package that does the re-determination of the surrogate key are

  • Scanning potentially big fact tables for the few records with unknowns surrogate key.
  • Natural Key/ Business Key of the dimension record is not usually stored in fact record. It has to be sourced again to re-determine the surrogate key.
  • Might involve row-by-row updates for the re-determined records.

Few tips to tackle the above challenges and optimize the ETL for Late Arriving Dimensions are discussed below

1. Use Filtered Index on Fact Tables

Since the unknown surrogate key is well-defined subset of data, use Filtered Index to fetch fact records that have unknown SK.  Filtered Index is smaller in size and has accurate statistics resulting in better query plan and significant improvement in query response time. This avoids table scan.

--Filtered Index
CREATE NONCLUSTERED INDEX FX_CLIENT
ON dbo.FactEmergency (EVENT_NK,EVENT_START_DATE) WHERE (CLIENT_SK =-1)

Structure the source query of the SSIS dataflow to use the Filtered Index.

2. Use Skinny Staging Tables

Fact tables usually do not store the Natural Key of the dimension. So the Natural Key of the dimension has to be fetched from the source system. Create a skinny staging table that maps the fact record to dimension’s natural key and the re-determined surrogate key. Using the example of the patient in an Emergency ward, the staging table could be something like this

CREATE Table Staging.EmergencyClient
(
EVENT_NK Varchar(20) null,
EVENT_START_DATE Datetime,
CLIENT_ID Varchar(20),
CLIENT_SK INT null
)

The dataflow could be something like the one shown in screen capture 1

Screen Capture 1 – ETL for Late Arriving Dimension

Note that in the above SSIS package,

  • The Natural/Business Key of the dimension (CLIENT_ID in this case) has to be fetched from the source system
  • The ETL has to cater for the scenario where the client information may not be available yet from the source system. So do the lookup only if client’s Natural Key is available for the previously unknown clients.
  • Insert into the staging table only when the dimension surrogate key is found. In this way the staging table has only the records that require update.

3. Avoid Row-by-row updates

By populating the staging table only when a genuine dimension record is found, the staging table contains only the records that require update, the fact record can be updated in single batch.

–Single Update Statement to Avoid Row By Row Update

--Single Update Statement to Avoid Row By Row Update
UPDATE emerg
SET emerg.CLIENT_SK =stage.CLIENT_SK
FROM dbo.FactEmergency AS emerg
	INNER JOIN Staging.EmergencyClient As Stage
	ON stage.EVENT_NK =emerg.EVENT_NK
WHERE emerg.CLIENT_SK=-1

The use of staging tables facilitates batch updates instead of row-by-row updates.

These optimizations would keep ETL for Late Arriving Dimensions in good shape.

 

Related Posts

Unclogging the Fact Table Surrogate Key Pipeline

3 thoughts on “SSIS: Optimizing the ETL for Late Arriving Dimensions

  1. I agree Benny and also I have situation where a dimension assignment is derived through complex set of logic including mapping of multiple source columns via boolean logic maintained in MDS and then populates existing junk dimension. Bottom line is generating inferred dimension rows is not possible.

  2. Since updating the fact tables is more costly than updating the dimension table, I think it is more efficient to first load the dimension with new natural key and set the rest of attributes in dimension table to ‘null’ or some other default. Use the new generated surrogate key to insert in the fact table. Later parse the dimension tables for any such infered rows and update them with the actual data if it has arrived.

    1. Good idea and will work if the natural key of the dim is known at the time of inserting the fact record. In most late arriving instances this will not be the case.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s