- Blog
- 07.04.2024
- Product, Data Fundamentals
Schema Drift on Database Tables

This article is part three of a three-part series on schema drift. It demonstrates how to design data pipelines using Matillion Data Productivity Cloud when the source database table schema changes. This article will show how to design data pipelines to check for schema changes and determine whether to proceed or flag the change.
Initial database load, Identify drift, and Load bronze layer
The Matillion Data Productivity Cloud is designed with a Medallion architecture in mind. Source data is extracted from the source database and loaded directly into the Cloud Data Warehouse, maintaining the source schema, otherwise known as the Raw or Bronze layer. The Bronze layer is designed to match the source schema, and this article will demonstrate how to maintain it.
Step 1: Initial stage data from a database table
Start with a database query component configured to ingest a table from a source database, this one is configured for SQL Server and writes to Snowflake, using a variable to identify the source table and naming the final table as ${var_table_name}_stg (i.e. ‘customer_stg').
Step 2: Identify schema/metadata drift
Now the data has been loaded, it's time to check for schema drift.
- This step captures the schema metadata into a Snowflake table, saves that metadata to a grid variable, and checks to see if a target table exists.
- If the target table does not exist, it runs the initial stage to target transformation, which creates the table the first time through. After the initial stage to target pipeline runs, it updates a project-level variable, indicating that it’s no longer the initial run.
- If the target table exists, the target schema metadata is captured into another table.
- The last step, called the transformation pipeline, compares the two schema tables. Note: Make sure to include the ‘End Success’ component so that the final step in the main pipeline doesn’t run to compare source to target.
Step 2a: xfrm_initial_stg_to_tgt
The initial stage-to-target transformation writes the staged data into the target table. The first time, it uses a rewrite table component, which creates the target table dynamically. See the notes section just before the conclusion for how to add additional metadata columns.
Step 2b: compare stage and target metadata
The stage and target metadata tables created in the last step are used along with the ‘detect changes’ component which identifies any new, changed or deleted rows and writes the columns out to tables.
Step 2c: Flag drift or run stage to target transformation
- This step first checks the new or deleted columns tables created in the previous step to determine if there are any rows in either table.
- If there are changes, then a webhook alert is sent, and the job ends in failure.
- If there aren’t any changes, it proceeds to run the xfrm_stg_to_tgt pipeline.
Step 3: xfrm_stg_to_tgt
If the target table already exists, then the new stage data is appended to the existing target table.
This step can also be more complex and can contain additional processing to maintain history as long as the stage and target tables remain in the same structure. See the Notes section for more details on how to do this.
Notes
If you want additional columns added such as load timestamps and source metadata, add these in the initial database query component so that the two tables are the same and can be compared in the later step.
Set the mode to ‘Advanced’ and enter a SQL Query, adding the metadata fields to the initial stage query. This example shows all columns from the source, the current timestamp used to populate a field for stage_load_dttm, and some constants to load source_schema and source_table. These can also be variables if you are iterating through different schemas and tables.
Summary
Schema drift is a critical challenge in data management. This article has demonstrated a way to detect and manage schema drift, focusing on automating the stage to target load to populate the bronze or raw layer. By keeping the initial source and target tables the same structure, you can compare the 2 tables before appending data to the target.
Identifying schema drift within this system involves several steps focused on dynamic table handling and comparison of schemas. Discrepancies between the current data schema and the target schema indicate schema drift. The system performs metadata comparisons using saved schemas from previous loads and the newly extracted schema from the current table. If differences are detected, the pipeline flags the schema drift and can optionally halt the process to alert system administrators or data engineers for manual intervention. This approach ensures that schema changes do not go unnoticed and that data integrity is maintained. This is critical for analytical accuracy and operational efficiency in data-driven businesses.
If you'd like to try out the Matillion Data Productivity Cloud for yourself, follow the link below to sign up for a free trial.
Catch up on:
Part One: Schema Drift for Excel Files
Part Two: Schema Drift for Dynamic S3 Files
Angie Hastings
Senior Sales Engineer
Featured Resources
What Is Massively Parallel Processing (MPP)? How It Powers Modern Cloud Data Platforms
Massively Parallel Processing (often referred to as simply MPP) is the architectural backbone that powers modern cloud data ...
BlogETL and SQL: How They Work Together in Modern Data Integration
Explore how SQL and ETL power modern data workflows, when to use SQL scripts vs ETL tools, and how Matillion blends automation ...
WhitepapersUnlocking Data Productivity: A DataOps Guide for High-performance Data Teams
Download the DataOps White Paper today and start building data pipelines that are scalable, reliable, and built for success.
Share: