How to supercharge data team productivity
One of the most frustrating times for a data analyst and a business decision maker is waiting on data. Without data, the world stops, and there is not much they can do about it. IT and data engineers are overwhelmed with new data requests. Many large enterprises deal with more than 1,000 different data sources, and legacy data tools just don’t cut it anymore. Architects struggle to integrate data models and make it all gel together. There has to be a better way.
Fortunately, there is. There is a new class of data tools emerging to help solve the frustrations and delays of connecting to new data sources. These cloud-based tools provide pre-built connectors to popular data sources, and also provide a way to connect to an exploding number of SaaS applications through a RESTful API connector. The goal is simple—make data loading as quick and easy as possible.
But a plethora of tools does not make an enterprise hum with precision. Quite often, disparate tools from multiple vendors complicate the data environment, impact data team productivity, and delay data delivery to those who need it. Separate tools for batch loading and change data capture (CDC) loading require data teams to become an expert in two different user experiences. What is needed is a single data loading platform that helps users create and manage both batch and CDC pipelines.
Batch data pipelines
Batch pipelines have been around for decades. They are used when analytics is not time sensitive, like running monthly, weekly, or daily reports based on historical data. Batch jobs extract data from a data source and load it into a destination on a set schedule. Because batch data extraction puts a load on the source system, they are most often performed at night to avoid workload conflicts. These large batch jobs may take hours to complete.
Incremental batch loading uses “high water marks” in additional columns within the database tables to indicate when a row has changed. High water marks can be timestamps, status flags, versions, or any combination of these indicators. The incremental batch job queries the database and extracts only those rows with a high water mark indicator. set since the previous extraction. This reduces the amount of data extracted, thereby reducing the performance impact on the database, but there is still an impact.
One disadvantage, however, is that some data changes may be missed. For example, if a data item changes multiple times in between the set time intervals, or snapshots, only the last change is captured—interim changes are lost. This can be OK for some use cases, but for other use cases it is not acceptable to miss any data updates.
Change data capture pipelines
Change data capture (CDC) pipelines have also been around for a very long time. They are often used for database backups, disaster recovery, data migrations, and for time-sensitive analytics. CDC jobs can also use high water marks within the database tables to identify changed rows. Queries performed against the database tables or partitions extract the changed data and load it into a destination. CDC pipelines run continuously all the time, rather than only at set time intervals.
Log-based CDC pipelines monitor database logs rather than the database tables. The database itself needs to have the capability of capturing change events into log files or change files. CDC techniques that monitor logs virtually eliminate any performance impact on the source system.
For more information on these and other CDC methods, including pros and cons of each, check out the ebook Why You Need Change Data Capture: Frictionless, Real-Time Data Ingestion.
One advantage of CDC pipelines is that every change event is captured, unlike incremental batch pipelines that may miss multiple changes to data in between snapshots. Given this and their extremely low latency, CDC pipelines support time-sensitive use cases like marketing personalization, fraud detection, dynamic pricing, clickstream analytics, and many more.
The need for a single data loading platform
Enterprise data teams have a difficult job. They face constant pressure from incoming requests to connect to, extract, and load data from countless data sources. If the data team hard codes data pipelines, a study showed that it takes 4-6 weeks to build a new connector and another week every quarter to maintain it—adjusting to source schema changes, fixing broken pipelines, etc. This is a daunting task when there are a dozen data sources, let alone hundreds or a thousand or more.
Data loading tools alleviate most of the pain. With some basic configuration and access credentials, they automatically generate the data pipeline code. Some even help adjust for source schema changes like a new column or an updated data type. Very few, however, handle both batch and CDC pipelines in a single tool.
This means that data engineers must become experts in multiple tools, switching between them to see a full picture of their data environment. This impacts data team productivity and often delays data delivery to data analysts, data scientists, AI/ML modellers, and ultimately to business managers who make decisions based on data.
Matillion Data Loader
Matillion Data Loader removes all the barriers to extract source system data and ingest it into your cloud data platforms—for both batch and CDC pipelines. It is the perfect platform for enterprises that have data coming in from an ever increasing number of sources and need faster, simpler access to data, insights, and analytics.
It empowers data engineers, architects, and even data analysts, AI/ML developers, and data scientists to build robust batch and CDC pipelines in minutes, without writing code. Through an intuitive, wizard-based UI, users configure their data pipelines, deploy them, and monitor status. Pipelines get created faster and data gets delivered in a timely manner.
“Matillion Data Loader has a simple user interface making data loading easy for analysts and engineers alike.”
Senior Data Warehouse Lead, Cimpress
Using the Matillion Data Loader SaaS user interface, users can set up a new batch pipeline through pre-built connectors or a custom connector to any RESTful API. To experience Matillion Data Loader batch loading for yourself, take a quick product tour through our interactive demo.
Change data capture loading
“Using Matillion Data Loader change data capture, we unlock the ability to deliver complex insights significantly faster. Matillion Data Loader led to a 95% improvement in time-to-complete over previous solutions, which is a real game changer for us.”
With Matillion Data Loader Change Data Capture, uses log-based CDC to capture every change event as it occurs from database log files in PostgreSQL, MySQL, Oracle, Microsoft SQL Server, and Db2 for IBM i databases. This provides low-latency data movement, scales as needed, and does not impact database performance.
A unique feature of Matillion CDC is its hybrid SaaS architecture. A CDC Agent and the CDC pipeline are configured in the SaaS interface. The CDC Agent is deployed into a container within the customer’s private cloud or on-premises data environment. The Agent extracts data from the source database and loads the data in cloud storage. This happens entirely within the customer’s private cloud to satisfy data sovereignty policies—no data or access credentials are ever exposed.
See it in action
We put together a less formal demo of Matillion Data Loader so you can see it in action. In this video, we address a common scenario that we often see—combining real-time product inventory information and CRM data. We take you through batch loading Salesforce data into Snowflake, and CDC loading PostgreSQL data into Amazon S3 and then transform and load it into Snowflake.
Matillion Data Loader makes data loading easier and more self-service to make your data productive, faster, across the enterprise. If you are ready to unlock more data from multiple sources at speed and scale, try Matillion Data Loader today for free. Once you register, you can load up to a million rows of batch data for free every month, or Start Enterprise trial to test out change data capture in your enterprise.
Matillion Adds AI Power to Pipelines with Amazon Bedrock
Data Productivity Cloud adds Amazon Bedrock to no-code generative ...Blog
Data Mesh vs. Data Fabric: Which Approach Is Right for Your Organization? Part 3
In our recent exploration, we've thoroughly analyzed two key ...eBooks
10 Best Practices for Maintaining Data Pipelines
Mastering Data Pipeline Maintenance: A Comprehensive GuideBeyond ...