This blog article describes a commonly-encountered practical problem when you are creating a dimensional model.
With a star schema, it’s a fixed rule of dimensional modeling that every foreign key on every fact table must join onto its associated dimension.
But what happens if you are presented with new fact data that does not join onto its dimension table? There are several options, which this post will explore.
The following video walks you through what to do with a late arriving dimension or early arriving facts using Snowflake.
Late Arriving Dimension or Early Arriving Fact?
In this sense it actually doesn’t matter which way around you see it. Either you’ve acquired some fact data before all the necessary dimensions, or else you’ve acquired the dimension data after it was needed for the facts.
Here’s a fragment of the bus matrix associated with Matillion’s civil aviation data model:
This means that every fact_flight record must have two valid foreign keys to the dim_airport dimension: one for the origin airport and one for the destination.
If those joins don’t fully work, there are three common solutions:
- Putting fact records into suspense
- Using a special “unknown” dimension
- Complete the dimension later
Putting fact records into suspense
This approach involves simply storing the incoming fact data in a separate table ready for re-processing later. It’s sometimes called a “suspense” or “retry” table.
This is a good choice in some specific cases, but suffers from being fundamentally unpredictable. There’s no indication of when the missing dimension data might arrive.
In fact, in the worst case, it might never arrive. The fact data would cycle endlessly through re-try after re-try and would never appear before end-users. It’s usually considered worse to not know about something at all – rather than knowing it happened but not having all the details.
Using a special “unknown” dimension
This is a very valuable technique, and is widely used for optional relationships. For example in an aviation bus matrix there might be a “Delay Reason” dimension, containing a list of reasons for delayed departures.
For flights that were not delayed, it’s simply not appropriate to store a “delay reason”. You would link these records to a specially-created dim_delay_reason record with a “N/A” description. Often there’s no need even to do a lookup, because the special records are allocated meaningful (usually negative) surrogate keys.
Similarly, for late-arriving dimensions, you can use a specially-created dimension record with description “Not known yet”.
This works well for late-arriving dimensions, although with two main drawbacks:
- When the dimension data eventually does arrive, you have to remember to go back and update any fact records which are pointing at “Not known yet”. Updates like these tend to be computationally expensive and are prone to errors in logic.
- You would be forced to store the natural key alongside the surrogate key in the Fact record. This is not a problem in 3-tier or Data Vault architectures, but does affect 2-tier models.
Complete the dimension later
There’s an implication with a “Late Arriving” Dimension that you know what it should be, but it just hasn’t been loaded yet. Specifically this means that you know the natural key, but have not been able to find it in the dimension table.
This is good news! Using just the natural key, you can still create dimension records which have not arrived yet. Set all the description columns to the value “Unknown”, “N/A” or blank, and update them later when you know the real values.
Advantages of this include:
- There’s no risk of fact data becoming lost in suspense.
- It’s a simple update to go back and fill in a few missing columns in a dimension. In fact, almost all dimension update logic will probably already be doing this.
- The target star schema is always as up-to-date as it can be. In fact, showing “unknown” values in a report is a good way to get business users on your side. They will help bang the necessary heads together to get your late-arriving dimension data to start arriving on time!
This method is often considered best, and is explained in more detail in the next section.
Proactive checking with Matillion
So how do you know that your dimension is “late”? Most commonly, you only discover this when you’re doing the final join which sets the surrogate keys for new fact records. It’s surely rather a large detour to go back to the start, add the missing dimensions, and then re-try the join again?
No pun intended, but that “reactive” way of completing the dimension is happening too late in the data processing pipeline. It requires conditional branching and repetition. You can deal with the problem earlier, in a different, more proactive way.
Detecting missing dimensions early
There’s a very easy way to find late-arriving dimensions in a Matillion Transformation job. A late-arriving dimension has a natural key which exists in the new fact data, but which does not yet exist in the dimension.
Start with the new fact data:
Find every column which holds the dimension’s natural keys. In this example, every flight has an Origin and a Destination airport code, so the data flow needs to be considered twice.
With a Matillion ELT pipeline you can let the database do all the hard work. Replicate the input data, choosing the Origin at the top and the Destination at the bottom, and feed them into a Unite component. The database will output a unique list of all the airport codes which have appeared either as an Origin or a Destination.
It’s common to write this out to another temporary table for further processing. In this case the Rewrite Table creates a new database table named stg_iata_code.
Now that you have all the dimension keys that are going to be needed, it’s a simple operation to exclude all those which already exist.
Choose the newly-calculated stg_iata_code as the main table, and left-join to the existing dimension. Filter out any records where the natural key is already present in the dimension by choosing records where the dimension’s key is null or blank.
Hopefully there are none! But if this filter does reveal late-arriving dimensions, you’ll need to add a Calculator to set defaults for the other columns, and a Table Update to append the late-arriving records to the dimension.
Star Schema append
In the Calculator component, choose defaults according to agreed business rules. For example:
After this job has finished, you are guaranteed that the dimension is not missing any late-arriving values. You can safely join the new flight records and get back a real surrogate key for every single one.
At some point in the future your ordinary dimension update (not shown in this article) will hopefully update the ‘Unknown’ to its real value, but this will not require any extra action on your part.
[callout-content title=”Want to find out more about Matillion ETL? Select your data warehouse below to learn more”]
[button title=”Amazon Redshift” link=”https://matillion.com/etl-for-redshift/etl-redshift/”]
[button title=”Google BigQuery” link=”https://matillion.com/etl-for-bigquery/etl-bigquery/”]
[button title=”Snowflake” link=”https://matillion.com/etl-for-snowflake/etl-snowflake/”]