Skip to main content

Delta Lake on Databricks: CDC and Batch Ingestion with Matillion Data Loader

As enterprises strive to become data-driven, choosing a cloud data environment is an important strategic decision. The Databricks Lakehouse Platform combines the data structure and data management features of a data warehouse with the low-cost storage of data lakes. Delta Lake on Databricks builds a Databricks Lakehouse architecture on top of existing cloud storage. Delta Lake provides a reliable, performant, and secure data storage and management system that is now used by thousands of organizations to process exabytes of data on a monthly basis.

Once the data environment is chosen, filling it with data is the next critical step. Matillion Data Loader is a SaaS-based data pipeline tool that empowers individuals to quickly and easily ingest data at scale into Delta Lake on Databricks, without coding. Whether you want to set up a high-volume batch or a low latency Change Data Capture (CDC) pipeline, you can use the wizard-based pipeline builder to extract data from popular data sources and start your Delta Lake on Databricks ingestion within minutes. You don’t need a separate solution for batch and CDC loading.

Delta Lake on Databricks Change Data Capture

Change Data Capture extracts data changes in a source database and ingests those changes into cloud storage in near-real-time. For example, you can analyze online purchases and make near-real-time recommendations as part of your marketing campaign. CDC is more efficient and faster than batch data ingestion, making it the go-to solution for data teams and analysts who need to get data into the cloud and analyze it quickly.

Matillion Data Loader with CDC is a near-real-time data ingestion architecture that, along with Matillion ETL for Databricks, offers a complete end-to-end solution for capturing, transforming, and ingesting change data into Delta Lake on Databricks. 

Matillion Data Loader provides an intuitive no code/low code user experience to quickly extract data from the most commercially used source databases, such as Microsoft SQL, PostgreSQL, and Oracle. It leverages low-level database change logs, or log-based CDC, to ingest consistent, accurate, and up-to-date data. It captures every change, not just point-in-time snapshots that miss interim updates. 

Delta Lake on Databricks CDC through Matillion Data Loader happens in a hybrid SaaS architecture. CDC pipeline setup and management occurs in the intuitive, wizard-based SaaS interface of Matillion Data Loader. A CDC Agent is created to access your data source and cloud storage destination using the credentials you provide. The Agent is deployed into a container in your Virtual Private Cloud (VPC) or in your on-premises data environment. The Agent executes the data extraction and data loading; and all data remains secure in your data environment. No data lands in a Matillion data store.

Once data is in your cloud storage, Matillion’s CDC integrates with Matillion ETL to run data transformations and loads the transformed data into your Delta Lake on Databricks lakehouse. This ensures that data is not only up to date, but also analytics-ready in just minutes.

Batch Loading into Delta Lake on Databricks

The most commonly used data ingestion technique is batch loading. Data is loaded in batches at periodic intervals like once per day or once per hour. Companies typically use batch ingestion for large datasets that don’t require near-real-time analysis. For example, a business that wants to delve into the correlation between SaaS subscription renewals and customer support tickets could ingest the related data on a daily basis.

Matillion Data Loader extracts data from select data sources and loads the data directly into Delta Lake on Databricks (now available in Public Preview) with minimal configuration and complexity. This saves time for data scientists, analysts, and line of business managers in data preparation, allowing businesses to move enormous amounts of data into the cloud to analyze and gain insights fast.

Configuring Delta Lake on Databricks as a batch destination.

 

Through the SaaS-based user interface, you simply supply the credentials to access the source and the Delta Lake destination and then set the frequency of your batch loading job. Sources can be as diverse as Salesforce, Google Sheets, Snowflake, PostgreSQL, MySQL, Oracle, and more. Audit columns are added (a batch ID and update timestamp) to facilitate traceability before the staged data is written to the final target Delta Lake table. Incremental loading using a high-water mark is also supported dependent on your source data models to provide only changed data after the initial load.

Active batch pipeline from Google Sheets to Delta Lake on Databricks.

 

Once your data pipeline is set up and running, you will see the “Active” status in the Matillion Data Loader interface, and you are all set loading batch data.

Ready to get started? 

Matillion Data Loader is perfect for organizations that have data coming in from an ever increasing number of sources and need faster, simpler access to insights and analytics.

To learn more about Matillion Data Loader for Delta Lake on Databricks, request a demo.

To get started with Matillion Data Loader, simply register today at dataloader.matillion.com.

 

{demandbase.company_name}, realize the value of your Cloud Data Platform
With Matillion, {demandbase.company_name} can leverage a low-code/no-code platform to load, transform, orchestrate, and sync data with speed at scale, to get the most value across your cloud ecosystem. Check out these resources to learn more.