Near-real time data loading with change data capture
Gone are the days when data could only be analyzed after the nightly, hours-long batch loading completed. Data today is dynamic—it changes constantly throughout the day. Modern enterprises and decision makers don’t want to wait for big data drops. They want data ‘now’ to inform business decisions, and the most time-sensitive data is contained in operational databases.
Most organizations replicate their operational databases to secondary locations, such as a cloud data warehouses, in scheduled batches before querying it. This has several advantages. First, it offloads queries from the write-optimized database to a query-optimized data warehouse. Second, operational data can be combined with other data for advanced analytics. And third, the source database is protected from rogue queries that may impact its performance.
But for most enterprises, waiting for daily or even hourly batch loading to complete does not fit their business model nor ever-emerging use cases that require fresh, real-time data. For example, marketing personalization best practices analyze current customer engagement data with historical data to deliver the right offer while they are engaged. Waiting an hour to see what products they clicked on or to see that they abandoned their cart is simply not an option.
However, if data is replicated constantly from the source database, enterprises can incorporate fresh and historical data into their analytics, gain insights into what is happening now, and make confident, data-driven decisions in near-real time.
Using change data capture for near real-time analytics
You might have already heard of “change data capture” or “CDC.” It’s been used for a long time as a way of replicating data in mission-critical databases for backup, disaster recovery, and for downstream analytics. But as demand for real-time data has grown across all business areas, and particularly in relation to operational data, it’s now much more in demand.
CDC works by identifying and capturing changes made to source databases, e.g. a new customer order arrived, an invoice was just paid, or a web visitor clicked on various products or downloaded several content pieces. The incremental change data is replicated immediately–instead of waiting on batch job schedules. This means that CDC is ideal for near-real time analytics use cases like marketing personalization, AI/ML model development, fraud detection, inventory management, dynamic pricing, and so much more.
By continuously extracting data changes, the changed data is replicated to its destination with low latency so it can be analyzed almost immediately. It also maintains timely consistency of data between the database and downstream data stores.
How to implement change data capture efficiently and effectively
There are many ways to implement and manage change data capture. Many database administrators add “high water mark” columns in their data tables to indicate when a row has been added, deleted, or updated. A high water mark could be a timestamp, a status indicator, a version number, or any combination of these indicators. However, to find changes since the last replication, a query reads the data tables to find the high water marks. This is inefficient and puts a measurable strain on the database. It is also ineffective. If multiple changes to the same data occur in between queries, only the last change is captured—the interim changes are lost. This can have huge implications for use cases like fraud detection that depend on every data change.
A more efficient way to implement CDC is to query database log files. Many of the largest and most popular databases capture every change event in log files or change tables. Monitoring and extracting changes from log files is very efficient because it has little to no effect on database performance. It is also very effective because every change event is captured—no more missing data.
For more information about the pros and cons of these and other CDC techniques, refer to the ebook Why You Need Change Data Capture: Frictionless, Real-Time Data Ingestion.
How to create a change data capture pipeline
CDC data pipelines are often created in-house by data engineers with help from database administrators. Writing the code for a CDC pipeline is not an easy task. A recent study showed that it takes 4-6 weeks to build a new connector pipeline and 1 week per quarter to maintain it. It requires specialized skills that are no longer easy to find, and it adds a significant delay in getting data into the hands of those who need it.
However, it’s now possible to create CDC pipelines in a much faster and easier way. Matillion Data Loader with Change Data Capture provides a SaaS-based platform to create CDC pipelines—without coding. The intuitive, wizard-based UI steps users through the configuration and set up of an executable agent and the CDC data pipeline.
It creates a CDC Agent that runs in a container within the customer’s private cloud to monitor database log files. When change events occur, the Agent safely and securely extracts changed data from the log files and loads the data into cloud storage. This unique hybrid SaaS architecture keeps all data and access credentials within the customer environment to satisfy data sovereignty policies—no data or passwords are ever exposed.
Matillion CDC meets the test for an efficient and effective CDC implementation. It efficiently captures and streams change events as they occur without impacting database performance. All changed data is loaded into cloud storage, creating an immutable copy of the source data. This enables an exact replica to be created at any point in time. It is also effective since every change event is captured, including source schema changes which are automatically picked up and propagated to the data destination—reducing broken pipelines caused by schema drift.
Once the raw data is in cloud storage, data scientists, AI/ML developers, and other raw data users can immediately put the data to use to perfect their algorithms and models. In addition, the raw data can be transformed and loaded into a cloud data warehouse using Matillion ETL, making the data business-ready within minutes for data analysts and decision makers.
Matillion CDC supports source databases PostgreSQL, MySQL, Oracle, Microsoft SQL Server, and soon Db2 for IBM i. It loads data into Amazon S3, Microsoft Azure Blob Storage, and Google Cloud Storage. Matillion ETL can extract the data from cloud storage and load business-ready data into Snowflake, Amazon Redshift, and Delta Lake on Databricks.
To learn more about Matillion CDC and see it in action, you can watch the on-demand session Smart, Safe, Real Time: Data Loading and Change Data Capture. The session discusses advantages of CDC and shows a brief demo of an end-to-end CDC pipeline.
Getting started with change data capture
When you are ready to get started with CDC, register for Matillion Data Loader. Once inside the Matillion Data Loader application, click on Agents in the left navigation pane to create a new CDC Agent definition. Once this definition has been completed, you will need to deploy and configure the Agent container into your chosen cloud service provider environment with the unique parameters provided.
When you have an Agent running and successfully connected, you are now ready to create your CDC data pipeline. Click on Start Enterprise Trial on the top right. Next, Add a pipeline for your CDC Agent. Select your source database and provide configuration and access credential details, and Select the schemas and tables that you want to replicate, Then choose your storage destination, provide a pipeline name, and create away!
Once your CDC pipeline has been deployed, the Matillion CDC Agent will do an initial data load and immediately start capturing and delivering change data to your chosen cloud storage. Then, you can leverage the power of Matillion ETL Shared Jobs to transform and load the change data into your selected data warehouse.
If you need assistance at any point, please reach out to us and we will gladly assist you.
Change data capture enables many new use cases and puts your operational data to use in minutes. Accelerate your data productivity. Extract and load operational data into your cloud data platform in near-real time, without coding, using Matillion Data Loader Change Data Capture.
Data Mesh vs. Data Fabric: Which Approach Is Right for Your Organization? Part 3
In our recent exploration, we've thoroughly analyzed two key ...eBooks
10 Best Practices for Maintaining Data Pipelines
Mastering Data Pipeline Maintenance: A Comprehensive GuideBeyond ...News
Matillion Adds AI Power to Pipelines with Amazon Bedrock
Data Productivity Cloud adds Amazon Bedrock to no-code generative ...