What is Change Data Capture and Why is it Important?
Change Data Capture (CDC) extracts data changes in a source database and ingests those changes into cloud storage in near-real time. CDC is more efficient and faster than batch data ingestion, making it the go-to solution for data teams and analysts who need to get data into the cloud and analyze it quickly.
There are many methods of implementing CDC. Use cases like fraud detection, real-time marketing campaigns, operational analytics, AI/ML modeling, and more not only need data quickly but also need to capture every data change for complete analytics. Choosing the CDC method for your use case is a strategic decision that needs to be made before researching vendor solutions.
Why is Change Data Capture important?
As databases grow exponentially over time, performing large, batch-based operations can become inefficient, slow, and resource intensive. These large batch extractions can put a strain on the performance of the source database itself, potentially impacting write performance of new data records.
Monitoring and extracting changes as they occur with CDC simplifies the replication process, is incredibly efficient, and consumes fewer compute resources in the database so there is minimal, if any, performance impact. CDC also maintains timely consistency of datasets across all downstream data stores and processing that rely on this data. The incremental change data can be loaded in microbatches–for example, every minute–instead of waiting on large batch jobs that may run only once a day or every few hours.
What are good Change Data Capture use cases?
Change Data Capture has been a popular way to replicate and back up mission-critical databases for many years. Today, there are a growing number of use cases that take advantage of the near-real time nature of CDC. CDC is ideal for rapidly changing data, since it extracts and ingests data with extremely low latency. A sampling of popular use cases include:
- Cloud migration: Faster and more accurate data migration
- Operational analytics: Faster data ingestion, which leads to faster decisions
- Fraud detection: Faster and more accurate detection of potential fraud
- Real-time marketing campaigns: Increased customer engagement
- AI/ML: Reduced cycle times and more accurate models
- Replication: Automated simple or complex data integrations
- Audit: Recreated data from any state of the business at any point in time
What CDC methods are available?
There are several CDC methods, including timestamps, versioning, and status indicators on each database row, or a combination of these. These methods take a point-in-time snapshot of the database and capture the changes since the last snapshot.
Timestamps on rows: Any row in a database table with a timestamp more recent than the last time data was captured is considered to have changed.
Version numbers on rows: Any row with a higher version than the current table version is considered to have changed.
Status indicators on rows: Any row with a changed status indicator (TRUE) is considered to have changed.
Time/Version/Status on rows: It is not uncommon to see multiple CDC methods in a single system. The combination of time, version, and status is not redundant or superfluous.
Other CDC methods include adding database trigger logic when data changes that copies changes into a separate log file, and coding complex logic into applications that look for data changes and capture the changes.
Triggers on tables: In this approach, triggers log events that happen to the transactional table into another queue table that can later be “played back”.
Event programming: Coding a change into an application at appropriate points is another method that can give an indication that data changed.
And finally, there are log-based scanners that monitor database log files, which log every update made to the database.
Log scanners: Most database management systems manage a transaction log that records changes made to the database contents and to metadata. This method is specific to a particular database management system. Most databases do not document the internal format of their transaction logs, although some provide programming interfaces (APIs) to their transaction logs (e.g. Oracle, DB2, SQL/MP, SQL/MX and SQL Server 2008).
What are common issues with CDC?
The problem with the first four CDC methods above is that they take snapshots at specified points in time. If a data element changes multiple times between snapshots, only the last change is captured. The interim changes are completely missed and lost forever. This has huge implications for use cases that depend on analyzing every change event, such as fraud detection and AI/ML modeling.
Trigger-based CDC can impact the performance of the source database because triggers run on the database tables as data changes are made. With every transaction, it takes compute cycles to record the change in a separate table, so the system is slowed by this extra processing. This may also cause a slight delay, which affects latency.
Event programming is a complex undertaking. This method involves writing the capture code into applications outside of the database. For example, when data changes, the application executes additional code to capture the change and record it into a separate table. This can get complicated, impact the performance of the application, and require application code changes for every schema change and every new change event in the database.
What are the advantages of log-based CDC?
Increasingly for new use cases, log-based CDC is being implemented. Many databases capture every change as it occurs and logs an entry in the database log files. Log-based CDC scanners monitor the log files to capture every change in real time. You don’t miss changes, as you can with point-in-time methods. And you don’t need to code complex logic in applications or downstream analytics platforms
Log-based CDC provides the lowest latency among the different CDC options. It minimizes resource consumption in the source database and has little impact on the day-to-day operation of the database. And, there is no need to change the database schema to add a column for timestamp, version number, or status, and no need for complicated, high maintenance application code.
Matillion Data Loader’s Change Data Capture
Matillion Data Loader with Change Data Capture leverages low-level database change logs, or log-based CDC, to ingest consistent, accurate, and up-to-date data. It captures every change, not just point-in-time snapshots that miss interim updates. With Matillion Data Loader, you can create and manage both batch and CDC data pipelines in the same user interface. You don’t need a separate solution for each.
Change Data Capture in Matillion Data Loader happens through a hybrid SaaS architecture. CDC pipeline setup and management occurs in the intuitive, wizard-based SaaS interface of Matillion Data Loader. A CDC Agent is created using the credentials you provide. The Agent is deployed into a container in your Virtual Private Cloud (VPC) or in your on-premises data environment. The Agent executes the data extraction and data loading; and all data remains secure in your data environment. No data lands in a Matillion data store.
Matillion’s CDC integrates with Matillion ETL to run data transformations. This ensures that data is not only up to date, but also analytics-ready in just minutes.
Matillioners using Matillion: Alice Tilles' Journey with Matillion & ThoughtSpot
In the constantly evolving landscape of data analytics, ...Blog
What’s New to Data Productivity Cloud?
In July of this year, Matillion introduced the Data Productivity ...eBooks
10 Best Practices for Maintaining Data Pipelines
Mastering Data Pipeline Maintenance: A Comprehensive GuideBeyond ...