Visit Matillion AI Playground at Snowflake Data Cloud Summit 24

Find out more

Data Replication: A No-Nonsense Guide to Full Loading in a Medallion Data Architecture

For those tasked with managing data replication, full data loading is the most straightforward strategy. This article explains how and why. To help with clarity, I'll explain using the commonly used Medallion data architecture as an example.

Understanding the Medallion Data Architecture

The term "Medallion Data Architecture" was raised to prominence primarily by Databricks. It is a comprehensive blueprint for overall structuring within a Data Lakehouse or Cloud Data Warehouse. This design philosophy classifies data into three distinct layers: bronze, silver, and gold. Pipelines govern the data flowing between the layers from bronze to gold.

Data is first replicated - copied - from its source into the foundational bronze layer. This step doesn't change any aspect of the data but provides a single unified technology interface for the data team to access everything they need. It also safeguards against disruptions such as temporary connectivity issues or the loss of historical data.

Next, the data is transitioned into the silver layer. This is a consolidated, standardized, and system-neutral representation of data from all the diverse sources. Performing this integration requires data transformation to address the inevitable inconsistencies caused by having many different source applications. Data models in the silver layer are concise and succinct data structures devoid of redundancy. Every single data definition resides in just one place. This makes data easy to find and unambiguous for downstream users in the next layer.

The silver layer is an efficient and compact central repository, but its compactness means that data retrieval can be complex - requiring many relational joins. This makes it less suitable for direct end-user consumption. This is where the gold layer becomes valuable as a presentation layer, aiming to enhance the accessibility of silver layer data. Structural rearrangements make the data much more user-friendly during this second data transformation stage. A star schema is the most common choice of data model in the gold layer.

This small table summarizes the three logical layers in a Medallion data architecture.

Easily consumable?xx

The Medallion data architecture

Simple Data Replication: Mastering the Full Load Strategy

This uncomplicated approach involves first duplicating the entire dataset from source into the bronze Medallion layer in the target database.

In the initial bronze layer, data remains in its initial, system-specific format. Following extraction and loading, the data then undergoes transformation into a generic, system-neutral structure in the silver layer. These transformations, often involving datatype changes and adherence to naming standards, convert the raw data into a more generic target schema.

Updates to the silver layer table can be executed through a straightforward truncate-and-insert method, rapidly replacing all data in the target table.

Alternatively, a "merge" or "insert/update" operation can be used based on a primary key. This is a more robust approach regarding DataOps, but it requires a unique identifier in the source table.

The result is a fully replicated data set, ready to use in the silver Medallion layer.

Alternatives to full data loading

Simplicity is the forte of the full load strategy, making it a pragmatic choice for data synchronization in many cases. But it comes with a caveat: simplicity can sometimes translate to suboptimal performance. This tradeoff is particularly noticeable when dealing with larger datasets. Repeatedly transferring entire datasets from place to place is resource-intensive and can lead to lengthy runtimes.

The alternative involves being more careful about what data to extract from the source system. If an item of data has not changed, don't copy it into the target again. This approach is known as either

  • Incremental or "high water mark" loading - in which we ask the source for all changes since a point in time. Or
  • Change data capture (CDC) - in which the source continuously pushes changes to the destination

These will be discussed in more detail in a future article.

Next steps

Full data loading is a simple and effective way to get started quickly. It is particularly valuable when synchronization of entire datasets is imperative. Even if you eventually switch to a more sophisticated incremental or CDC load strategy (look out for an article to follow on this subject), keeping a full load mechanism in reserve for cases when large structural changes have occurred in the source system is still useful.

Take a look at an accompanying article you would like to gain a more technical understanding of full-load data replication.

Data Productivity Cloud users can also download a set of three pipelines that demonstrate a full data replication strategy using a Medallion data architecture.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.