Moving data around is a fact of life in modern organizations. And doing it as efficiently as possible is a growing concern for data professionals. That’s why we’re hearing a lot about Change Data Capture, or CDC, these days. CDC is a strategy that helps companies integrate data faster, while minimizing strain on system resources, so that the data can be moved to a location where it can be used for analysis and other tasks as quickly as possible.
What is Change Data Capture and How Does it Work?
Change data capture (CDC) is a process that’s used to identify and capture the changes made to an organization’s source databases. Most companies move their application data to secondary locations, such as business intelligence applications or cloud data warehouses, before querying it. This is because querying an application directly could negatively impact performance. For example, you wouldn’t want to run a bunch of resource-intensive queries against an application that’s used in real time to serve customers. You might impact performance and negatively affect the customer interaction. So the data has to be put somewhere else before it can be analyzed. But moving an entire database has become too time-consuming in the age of big data.
With CDC, you don’t have to copy the entire database, just the changes between the source and target systems. This is incredibly important as the size of our transactional databases increases. By allowing you to detect, capture, and move only the changed data sets, CDC cuts the amount of time required for data integration. It also allows you to perform data integration more frequently. Instead of copying the entire database once per 24-hour period, for example, you can use CDC to run smaller, more frequent data integration jobs. CDC therefore reduces the amount of resources required for data warehousing, all while enabling continuous data integration.
What are the Different Methods of Change Data Capture?
There are numerous CDC methods, as well as commercial CDC solutions.
One method is using database triggers. Triggers can be called before or after certain events, such as when a database table is updated. These triggers can then capture the changes in a changelog, which is then used to update the secondary data source. However, this method can impact database performance because the triggers will run as each change to the database is executed. In fact, you could potentially double the amount of time required for each database transaction, since any transaction results in two separate processes.
Table differencing is a method for CDC that entails moving a copy of an entire database table from the source database to the secondary system, and then using queries to compare the copy with an older version of the table. There are a couple of challenges with this method. Copying entire source tables can be extremely resource-intensive and therefore costly. Plus, the amount of computing power involved to run the queries to compare the two tables is significant.
Log-based CDC uses a database’s transaction log to identify the changed data. Transactional databases’ management systems include a transaction log that logs every change that’s made to the database. In a log-based CDC system, this transaction log can then be scanned to identify the changes so they can be applied to the secondary system, such as the business intelligence application or cloud data warehouse. One challenge with this method is that different databases have different formats for their transaction logs. So if you’re dealing with multiple data sources in multiple source systems, you may not be able to use a one-size-fits-all approach for each log scan.
With this method, metadata fields, such as version numbers or timestamps, are used to determine what has been changed in the source system. You can then run a query to find the data that has recently updated metadata. One issue with this method is that metadata can sometimes change even when the data itself hasn’t changed. For example, metadata might be updated when data is viewed but not updated. So the amount of data selected would be larger than necessary, increasing the time and resources needed to update the secondary system.
How Can Change Data Capture Help Improve Your ETL Process?
Using CDC is an important part of making your ETL processes as efficient as possible. Here are several reasons why good CDC practices can improve your ETL.
CDC minimizes the resources required for ETL processes because it limits the amount of data that’s going to be extracted, transformed, and loaded. With less data to work on, your ETL processes will run faster. Less processing time translates to reduced costs.
Before CDC became common practice, organizations might run one big data integration job within a 24-hour period, usually overnight to minimize impact to users. But using CDC practices along with your ETL software, it’s possible to run smaller, more frequent data integration jobs. This helps get your analysts closer to the real-time data that they want.
By using CDC combined with an ETL tool or data loader, you can establish a data replication strategy to support your disaster recovery and high availability requirements. Maybe you need a production database, a test database and a backup. With CDC and ETL, it’s faster and easier to keep these different versions of your databases in synch.
CDC and ETL can be used together to analyze transactions to support fraud detection. For example, you could analyze transactions to determine if a customer’s credit card is being used from more than one location at the same time, which could signal fraudulent activity.
Want to Learn More About Change Data Capture?
Given the increasing volume and complexity of data, and the speed and scale needed to handle it, the only place you can compete effectively—and cost-effectively—is in the cloud. Matillion provides a complete data integration and transformation solution that is purpose-built for the cloud and cloud data warehouses.
Change Data Capture (CDC) in Matillion is a great feature that will help keep your cloud data warehouse up to date with a source database (including MySQL, PostgreSQL, Oracle, or Microsoft SQL Server). The CDC feature in Matillion is only available in Matillion ETL for Amazon Redshift and Matillion ETL for Snowflake on AWS. Each of these cloud data platforms use AWS DMS (Data Management Service) and S3 to check for updates to the source database and update the relevant tables within Matillion. Request a demo to learn more about the CDC features in Matillion ETL for Snowflake and Matillion ETL for Amazon Redshift.