Now Available On-Demand | AI. The Future of Data Engineering - Today. Not Just Tomorrow.

Watch now

DataOps Harmony: Incremental Loading for Optimal Data Engineering Performance

Efficient and reliable data loading processes are one of the early stages of an up-to-date and responsive data infrastructure. Incremental - "high water mark" - data loading can be key to achieving this goal.

In this article, I will dive into the concepts behind incremental high-water mark loading and its benefits, especially in contrast to full data loading. I will talk about real data transfers, as opposed to virtualization, as information is moved from a source system to a target cloud data platform.

Understanding Incremental High Water Mark Loading

The key concept behind incremental loading is to extract and load any data that has been changed since the last successful load. The "high water mark" in this strategy is the point in time beyond which the source system contains further updates that are not yet reflected in the target.

The steps to achieve incremental data loading work in a loop like this:

Incremental loading steps

One circuit around these three steps performs one incremental load.

Step 1 - Request changed data

In step 1, a data pipeline extracts all the changed records from the source and loads it into the target cloud data platform.

This step involves sending the high-water mark as a parameter when querying the source system. It's important that the parameterized query is "pushed down" to the source system, as I will explain shortly.

When deployed as part of the commonly used Medallion data architecture (see below), the destination in the target cloud data platform is a relatively temporary object in the bronze layer.

Step 2 - Merge changes

The target cloud data platform must manage its own permanent copy of all the source data since it is only being fed with recent changes. This step involves merging these changes into the replicated copy.

In Medallion data architecture terms, the full, replicated copy of the source data exists in permanent storage in the silver layer.

Step 3 - Set new high-water mark

The last step is to query the target cloud data platform to find the latest timestamp from the recently loaded data within that source. This value acts as the high-water mark parameter when the process starts again with the next loop.

This step adds resilience in DataOps terms since it depends upon the successful completion of step 2. If anything failed during step 2, step 1 would not otherwise know about it

When a high-water mark load runs for the first time, there is no high-water mark! So to make the process work, the high-water mark parameter must default to long in the past. This makes the very first load act as a full load.

The Medallion Data Architecture

Ever since it began to be widely adopted in the 1980s, the Medallion Data Architecture has consistently proven to be a sound solution across data integration scenarios.

The three layers in a Medallion Data Architecture can be implemented in any database or data lake, and have been known by different names over the years:

BronzeSilverGold
OperationalData WarehouseDepartmental
StagingAtomicData Mart
ODSNormalizedStar Schema
LandingIntegrationDenormalized

 

Data integration always requires data transformation. The key to the success of the Medallion Data Architecture has been differentiating between data transformation for integration, and data transformation for presentation.

Benefits of Incremental High-water Mark Loading

The main feature of incremental loading is it's a rather surgical approach of asking the source system only for changes every time. This leads to three main benefits:

Improved Performance

Reducing the amount of data to be processed during each iteration means fast loading times. Superior performance is possible because incremental loading minimizes the impact on system resources. This allows for more responsive data pipelines.

Reduced Resource Consumption

Compared with full data loading, Incremental loading consumes fewer resources. In addition to performance, this also makes it a more cost-effective solution for data engineering teams.

As I mentioned earlier, it's important that the high-water mark parameter is "pushed down" to the source system during the extraction process. It is far more efficient in terms of data transfer when the source system does the filtering: not the target cloud data platform.

Near Real-time Data Availability

By loading only the changed data, incremental high-water mark loading enables near-real-time updates. This ensures that the latest information is readily available for analysis and decision-making.

Alternatives to incremental data loading

Performance is the main advantage of incremental loading. Data engineers building pipelines need to make them sophisticated enough to repeatedly run the three steps I outlined earlier.

Incremental loading also imposes three very specific conditions on the data source

  • It must have a unique identifier so that the merge in step 2 is possible
  • All records must have a high-water mark property that the source system maintains reliably
  • The source system must be able to efficiently find the subset of records that have been modified after the high water mark

Considering these constraints, you may think about full data loading as an alternative.

Full data loading simply involves loading the entire dataset every time, regardless of changes. It's relatively resource-intensive and time-consuming, but it may nevertheless be suitable. Particularly so for small data sets, where the performance benefits of incremental loading are likely to be marginal at best.

One last consideration is related to deleted records. Incremental data loading strategies usually have difficulties with deletes since records that no longer exist can't be included in the extract and load of step 1. Full data loads are a simple way to solve this problem.

Next steps

This approach is particularly useful when dealing with large datasets, as it minimizes the amount of data that must be processed, resulting in faster and more efficient loading times.

Incremental data loading is the next step up in sophistication from full loading. For more information, take a look at an accompanying article containing more technical details of incremental data replication.

Data Productivity Cloud users can also download a set of three pipelines that demonstrate incremental data replication using a Medallion data architecture.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.