The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Integrate data from YugabyteDB to Databricks using Matillion

The YugabyteDB to Databricks connector enables seamless data transfer to Databricks within minutes, ensuring your information remains current—without the necessity for manual coding or intricate ETL processes.

YugabyteDB
Databricks
YugabyteDB to Databricks banner

Extracting data from YugabyteDB to Databricks

Extracting data from YugabyteDB is a key step in integrating your operational databases with advanced analytics platforms such as Databricks. By moving selected datasets from YugabyteDB into Databricks, organizations can unlock the full potential of cloud-scale analytics, machine learning, and business intelligence tools. This article will guide you through the end-to-end process of extracting data from YugabyteDB and loading it into Databricks. We will begin by outlining how to create an identity in YugabyteDB to enable secure access to your data. For readers using Matillion for data integration, we will cover how to check for, or acquire, the necessary JDBC driver to connect to YugabyteDB. Establishing reliable network connectivity between the source database and the Databricks environment is essential, and we will examine best practices for this critical step. Finally, we will demonstrate efficient approaches for querying and extracting data—both for initial full loads and for ongoing incremental data refreshes. By the end of this article, you will be equipped with practical knowledge to facilitate smooth and efficient data transfer from YugabyteDB to Databricks, powering your analytic workloads with fresh operational data.


What is YugabyteDB?

YugabyteDB is a distributed, open-source SQL database designed to address the needs of cloud-native applications that require both horizontal scalability and strong consistency. Built using a sharded architecture inspired by Google Spanner, YugabyteDB offers high availability and global distribution of data, supporting multi-region and multi-cloud deployments. It combines the power of PostgreSQL-compatible APIs (YSQL) for rich relational queries with high throughput, low-latency performance, and automatic failover capabilities. YugabyteDB also features tunable consistency levels, distributed transactions, and seamless scalability, making it well-suited for microservices, internet-scale OLTP workloads, and modern SaaS applications that demand resilience and continuous availability.

matillion logo x YugabyteDB

What is Databricks?

Databricks is a cloud-based platform that unifies data engineering, collaborative data science, and analytics on its Lakehouse architecture. Powered by Delta Lake, it offers ACID transactions, scalable metadata, and unified batch and streaming data processing—bridging the gap between data lakes and warehouses. Users manage schemas and query data efficiently via SQL, while Databricks supports diverse file formats and integrates with Python, Scala, and R. This enables professionals to build ETL pipelines, machine learning models, and real-time analytics with robust security and scalability in the cloud.

Why Move Data from YugabyteDB into Databricks

Unlocking Advanced Analytics: The Benefits of Replicating YugabyteDB Data to Databricks

A data engineer or architect may consider copying data from YugabyteDB to Databricks for several key reasons. YugabyteDB often holds valuable, real-time transactional data that can drive meaningful business insights. However, the true potential of this data is realized when it is integrated and analyzed alongside information from other sources, such as data lakes, legacy databases, or external APIs. By leveraging Databricks for the data integration and analytical workloads, organizations can perform complex transformations, aggregations, and advanced analytics in a scalable, distributed environment. This approach helps ensure that the operational performance of YugabyteDB is not adversely affected by heavy analytical queries, as these are offloaded onto Databricks’ optimized processing infrastructure. Ultimately, this strategy enables organizations to unlock significant value from their data assets while maintaining the stability and responsiveness of their transactional systems.

Creating a User in YugabyteDB

YugabyteDB is a distributed SQL database compatible with the PostgreSQL wire protocol. Therefore, user and role management is performed using standard PostgreSQL statements such as

CREATE ROLE
and
CREATE USER
.

Prerequisites

  • Ensure you have the appropriate administrative privileges, typically with the
    ysqlsh
    shell or a suitable PostgreSQL-compatible client.
  • Connection to your YugabyteDB cluster is required.

Step 1: Connect to YugabyteDB

Use the

ysqlsh
command-line tool:

ysqlsh -h <hostname> -p <port> -U <admin_user> -d <database>

Replace

<hostname>
,
<port>
,
<admin_user>
, and
<database>
with your cluster's details.

Step 2: Create a User

You can create a user using the following SQL command:

CREATE USER username WITH PASSWORD 'your_password';

Example:

CREATE USER app_user WITH PASSWORD 'S3cureP@ss!';

This command creates a new user named

app_user
with the specified password.

Step 3: Assign Privileges (Optional)

To enable the new user to connect to specific databases or perform actions on database objects, grant appropriate privileges. For example, to allow usage of the database, you might use:

GRANT CONNECT ON DATABASE mydb TO app_user;

To grant the ability to create, read, update, and delete data on a specific schema:

GRANT USAGE ON SCHEMA public TO app_user;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO app_user;

Step 4: Verification

The new user can now connect using their credentials:

ysqlsh -U app_user -h <hostname> -p <port> -d <database>


Note:
In YugabyteDB,

CREATE ROLE
is synonymous with
CREATE USER
, but with the addition of the
LOGIN
attribute when creating users, as
CREATE USER
is essentially a shorthand for
CREATE ROLE ... LOGIN
.

For more details, refer to YugabyteDB Documentation: User and Role Management.

Installing the JDBC Driver

At the time of writing, the required JDBC driver for YugabyteDB is not bundled with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. As a result, you will need to manually obtain and install this driver to enable connectivity. Follow the steps below to download and upload the necessary JDBC driver.

Step 1: Download the JDBC Driver

  1. Navigate to the PostgreSQL JDBC page: YugabyteDB uses the PostgreSQL wire protocol, so you can use a compatible PostgreSQL JDBC driver. Visit https://jdbc.postgresql.org/.
  2. Select a Type 4 JDBC Driver: Type 4 drivers are the recommended choice as they are platform-independent and communicate directly using the database's network protocol.
  3. Download the driver JAR: Find the latest stable version of the Type 4 JDBC driver and download the JAR file to your local machine. This file is usually named in the format
    postgresql-<version>.jar
    .

Step 2: Upload the JDBC Driver to Matillion Data Productivity Cloud

  1. Review the official documentation: For specific steps on uploading external drivers, consult the guide at Matillion: Uploading External Drivers.
  2. Access the driver management interface: Log in to the relevant Matillion environment and navigate to the area where external JDBC drivers can be uploaded.
  3. Upload the JAR file: Follow the prompts to upload the previously downloaded JDBC driver JAR file. Be sure to select the correct option or location that allows Matillion to utilize this driver in connector configurations.

Step 3: Configure and Use the Driver

Once the JDBC driver has been uploaded:

Following these steps ensures that Matillion Data Productivity Cloud can interact with your YugabyteDB database using the PostgreSQL-compatible JDBC driver you installed.

Checking network connectivity

To ensure that Matillion Data Productivity Cloud can connect to your YugabyteDB database, you must verify that the database accepts incoming connections based on your chosen deployment configuration:

  • Full SaaS Agent Configuration: The YugabyteDB database must allow incoming connections from the IP addresses listed in the Matillion Data Productivity Cloud documentation. Refer to this URL for the full list of required source IP addresses.

  • Hybrid SaaS Deployment: The YugabyteDB database must accept incoming connections from the IP address ranges used by your own Virtual Private Cloud (VPC). You can find tools to help you check network access and identify relevant IPs at the Matillion Exchange network check utility.

If you configure access to YugabyteDB using a DNS host name instead of an IP address, ensure that your selected Full SaaS or Hybrid SaaS agent can successfully resolve the database’s address to its underlying IP address. Network issues can occur if the agent cannot perform DNS lookup for the database.

Querying Data from YugabyteDB

This guide provides instructions for technical users on querying data from a YugabyteDB database. The instructions include SQL examples and recommendations for initial and incremental extract patterns, especially relevant when YugabyteDB data is integrated with platforms such as Databricks.


Example YugabyteDB Queries

YugabyteDB is PostgreSQL-compatible; thus, standard SQL

SELECT
statements can be used:

``` -- Select all columns from the employees table SELECT * FROM employees;

-- Select specific columns SELECT id, first_name, last_name, department FROM employees;

-- Get rows where department is 'finance' SELECT * FROM employees WHERE department = 'finance';

-- Retrieve recent changes based on a last_update timestamp SELECT * FROM employees WHERE last_update > '2024-06-01 00:00:00';

-- Aggregate: Count employees per department SELECT department, COUNT(*) FROM employees GROUP BY department; ```


Datatype Considerations

When moving data from YugabyteDB to Databricks, be aware that datatype conversion may occur. Common conversions include:

  • uuid
    ,
    jsonb
    in YugabyteDB may be mapped to
    STRING
    or serialized text in Databricks.
  • YugabyteDB’s
    timestamp with time zone
    may convert to Databricks’
    TIMESTAMP
    (UTC interpreted).
  • Numeric or array types may get cast to compatible Databricks types (e.g.,
    DECIMAL
    ,
    ARRAY<STRING>
    ).

It is recommended to inspect data mappings if precision or special types are critical in your workflow.


Loading Patterns: Initial vs Incremental

Initial Data Load

For a once-off "full" load, you typically fetch all data from a table without a filter. In ETL tools (such as Matillion), use the

Database Query
component with no
WHERE
clause.

Example:

SELECT * FROM orders;

Usage: - Run this query once to capture the initial dataset. - Transfer or process the entire result set in Databricks.

Incremental Data Loads

For routine updates, load only data that has changed since the last extraction—using a suitable filter (e.g.,

last_update
,
modified_at
, or an integer primary key).

Example incremental load query using a timestamp:

SELECT * FROM orders WHERE last_update > '{last_extracted_timestamp}';

Or incremental by unique integer:

SELECT * FROM orders WHERE order_id > {last_max_order_id};

Notes: - The same

Database Query
component is used. Only the SQL changes: you add a
WHERE
clause with your filter. - Maintain the last extracted value in your ETL/control system to use in the next extraction.

For further details and incremental load design, refer to Matillion’s Incremental Load Data Replication Strategy.


Pro tip: Always validate datatype compatibility between YugabyteDB and downstream platforms (like Databricks) when designing your ETL/ELT logic.

Data Integration Architecture

A key advantage of the ELT (Extract, Load, Transform) architecture is the separation of data loading from integration and transformation tasks. By loading the data into Databricks in advance, organizations can divide and conquer the data integration challenge, first focusing on reliably ingesting data, and then addressing transformation and integration as a distinct, subsequent step. Since successful data integration requires systematic data transformation, the most efficient approach is to implement transformation pipelines, which can automate and orchestrate complex workflows. Furthermore, the ELT model allows all transformation and integration operations to occur within the target Databricks database itself. This not only enables faster, on-demand, and highly scalable processing—leveraging Databricks' powerful compute engine—but also eliminates the need for additional data processing infrastructure and associated costs.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.