Integrate data from YugabyteDB to Snowflake using Matillion

Our YugabyteDB to Snowflake connector seamlessly transfers your data to Snowflake within minutes, ensuring it remains up-to-date without requiring manual coding or the management of complex ETL scripts.

YugabyteDB
Snowflake
YugabyteDB to Snowflake banner

Extracting data from YugabyteDB to Snowflake

Extracting data from YugabyteDB is a vital step for organizations looking to leverage the analytical strengths of Snowflake while maintaining transactional operations on YugabyteDB. This article provides a practical guide for data engineers and architects on how to efficiently transfer data from YugabyteDB into Snowflake, enabling robust reporting and analytics. We will begin by outlining how to create an identity in YugabyteDB, which establishes the necessary permissions for data extraction. For users working with Matillion, we will explain how to check for an existing JDBC driver compatible with YugabyteDB, and how to acquire one if needed. Additionally, we will discuss key considerations for ensuring secure and reliable network connectivity between YugabyteDB (the source) and Snowflake (the target). Finally, the process of querying and extracting data—both for the initial load and subsequent incremental updates—will be explored in detail. By following this guide, you will be equipped to optimize your data movement pipeline and maximize the value of your organization’s data assets.


What is YugabyteDB?

YugabyteDB is an open-source, distributed SQL database designed for high availability, linear scalability, and global data distribution. It features a PostgreSQL-compatible query layer (YSQL) and a Cassandra-compatible API (YCQL), enabling seamless integration with existing applications. Internally, YugabyteDB leverages a distributed, sharded storage engine based on a Raft consensus protocol, ensuring strong consistency and automatic failover. Its architecture supports multi-region and multi-cloud deployments, making it suitable for mission-critical workloads requiring zero downtime and low-latency access. Advanced features such as online schema changes, distributed ACID transactions, and point-in-time recovery further enhance its suitability for modern cloud-native environments.

matillion logo x YugabyteDB

What is Snowflake?

Snowflake is a cloud-native data platform with a unique multi-cluster shared data architecture that separates compute and storage for scalable, concurrent data processing. Available on AWS, Azure, and Google Cloud, it handles structured and semi-structured data (e.g., JSON, Avro, Parquet) and delivers efficient analytics with automatic tuning and minimal maintenance. Snowflake enables independent scaling of compute, supporting batch ETL, ad hoc queries, and real-time data sharing without resource contention. Built-in security—including automatic encryption, fine-grained access control, and auditing—meets enterprise compliance needs, making Snowflake a versatile, high-performance choice for modern data workloads.

Why Move Data from YugabyteDB into Snowflake

Unlocking Analytics: The Value of Transferring Data from YugabyteDB to Snowflake

A data engineer or architect might choose to copy data from YugabyteDB into Snowflake for several compelling reasons. Firstly, YugabyteDB often contains operational data that is highly valuable for analytical and business intelligence purposes. However, the true potential of this data is frequently realized only when it is integrated with datasets from other sources, enabling comprehensive analysis and richer insights. By leveraging Snowflake as the platform for such integration and analysis, organizations can take advantage of its scalable processing capabilities without placing additional analytical or computational load on the YugabyteDB cluster. This separation ensures that the performance of critical transactional workloads in YugabyteDB is maintained, while analysts and data scientists can work efficiently within the Snowflake environment.

Creating a User in YugabyteDB

To create a user in YugabyteDB, you will use standard PostgreSQL user (role) management commands, as YugabyteDB is PostgreSQL-compatible. Below are the steps and example SQL scripts for creating a new user.

1. Connect to YugabyteDB

First, connect to your YugabyteDB cluster using a database client such as

psql
:

bash
psql -h <hostname> -p <port> -U <admin_user> -d <database>

Replace

<hostname>
,
<port>
,
<admin_user>
, and
<database>
with your environment’s values.

2. Create a User

To create a new user (role) named

app_user
with a password, use the following SQL:

CREATE ROLE app_user WITH LOGIN PASSWORD 'StrongPasswordHere';

3. Grant Privileges

By default, a new user has no privileges on any database objects. You can grant specific privileges as needed:

To grant CONNECT on a database:

GRANT CONNECT ON DATABASE mydb TO app_user;

To grant usage on a schema:

GRANT USAGE ON SCHEMA public TO app_user;

To grant privileges on specific tables:

GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE my_table TO app_user;

4. Verify the User Creation

List the users (roles) in your YugabyteDB cluster:

\du

This will display all roles and their attributes.


Note: User and role management in YugabyteDB is performed similarly to PostgreSQL. Users are implemented as roles with the

LOGIN
attribute. Always choose strong passwords and follow best practices for privilege management.

Installing the JDBC Driver

At the time of writing, the YugabyteDB JDBC driver is not bundled with Matillion Data Productivity Cloud, primarily due to licensing and redistribution restrictions. This means it is necessary to manually obtain and install the required driver to enable connectivity between Matillion and your YugabyteDB database.

To proceed, follow these steps:

1. Download the JDBC Driver

  • Visit the official PostgreSQL JDBC downloads page: https://jdbc.postgresql.org/
  • Locate the most recent, stable release of the driver.
  • Select the Type 4 JDBC driver (“pgjdbc”), as this is the recommended pure Java driver for database connectivity and maximum compatibility.
  • Download the
    .jar
    file to your local system. Take note of the file location for use in the next steps.

2. Install the JDBC Driver in Matillion Data Productivity Cloud

Since this JDBC driver is not packaged with Matillion by default, you will need to upload the driver manually. Follow the process described in the official documentation:
Uploading External Drivers

  • Log in to the Matillion Data Productivity Cloud user interface and access your Agent’s configuration.
  • Use the ‘Upload External Driver’ feature to add the downloaded
    .jar
    file.
  • Confirm the driver has been registered within your Agent’s environment.

Keep in mind that after the upload, the Agent may require a restart—or the driver may need to be assigned to a specific connector template, depending on your configuration.

3. Configure and Use the Driver inside Matillion Designer

Once the JDBC driver is installed and recognized by the Matillion Agent, you can leverage it to create database connections and orchestrate YugabyteDB queries within your pipelines.

Detailed usage instructions are documented here:
Using Database Query Component in Matillion Designer

  • Configure connection parameters, referencing your YugabyteDB host, port, database name, credentials, and any additional required properties.
  • Verify connectivity and start building your data workflows using Matillion’s Designer interface.

If you encounter errors during upload or database connectivity, consult the Matillion troubleshooting steps and YugabyteDB documentation to resolve compatibility or configuration issues.

Checking network connectivity

When configuring connectivity between Matillion Data Productivity Cloud and your YugabyteDB database, you must ensure that the database accepts incoming network connections from the appropriate sources based on your deployment type:

  • Full SaaS agent configuration: YugabyteDB must allow inbound connections from the set of outgoing IP addresses used by Matillion. The latest list of IP addresses to allow can be found here: Matillion Allowed IP Addresses.

  • Hybrid SaaS deployment: YugabyteDB should be configured to accept incoming connections from the IP addresses of your own virtual private cloud (VPC). You can find utilities to help you determine your relevant VPC addresses—and to verify connectivity—at Matillion Network Access Check.

Additionally, if your YugabyteDB database is referenced using a DNS hostname (instead of a direct IP address), ensure that the relevant Matillion Full SaaS or Hybrid SaaS agent is able to resolve that address successfully. This may require configuring DNS resolution within your network or agent environment.

Querying Data from YugabyteDB

This guide explains how to query data from YugabyteDB using SQL

SELECT
statements, with an emphasis on data loading strategies and considerations for integration with Snowflake.


Example YugabyteDB SELECT Queries

Below are some common

SELECT
query patterns for YugabyteDB:

``` -- Select all columns from a table SELECT * FROM customers;

-- Select specific columns with a condition SELECT customer_id, name, city FROM customers WHERE city = 'San Francisco';

-- Aggregate data SELECT city, COUNT(*) AS num_customers FROM customers GROUP BY city;

-- Join between tables SELECT o.order_id, c.name, o.order_total FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE o.order_date >= '2024-01-01'; ```


Data Type Conversion: YugabyteDB & Snowflake

When loading data from YugabyteDB to Snowflake, be aware that certain data types may automatically convert. For example:

  • YugabyteDB
    timestamp
    may become
    TIMESTAMP_NTZ
    in Snowflake.
  • YugabyteDB
    uuid
    can become a
    VARCHAR
    .
  • YugabyteDB number types (
    int
    ,
    bigint
    ,
    numeric
    ) convert to the closest numeric types in Snowflake.

Always review the mapping of source and destination data types to prevent loading errors or incorrect data representations.


Data Loading Patterns

One-Off Initial Load

The recommended approach is to first run a comprehensive, one-time load that brings all records into Snowflake.

Pattern: - Use a SELECT statement without a

WHERE
filter in the source Database Query component.

Example:

SELECT *
FROM orders;

Incremental Loads

Subsequent loads should bring in only new or changed records. Use the same Database Query component but add a

WHERE
filter based on a tracking column (such as
last_updated
or a monotonically increasing
id
).

Pattern: - Add a filter condition to select only records newer than the last load timestamp.

Example with a timestamp:

SELECT *
FROM orders
WHERE last_updated > :LAST_LOAD_TIMESTAMP;

Example with an auto-incrementing ID:

SELECT *
FROM orders
WHERE order_id > :LAST_MAX_ID;

Review Incremental Load Data Replication Strategy (Matillion) for more details.


Tip: Both initial and incremental loads should be performed via the same Database Query component. On the initial load, omit the filter; on incremental runs, supply the filter parameter. This streamlines your ETL or CDC processes and makes it easier to manage in production.

Data Integration Architecture

Loading data in advance of integration is a core principle of the ELT (Extract, Load, Transform) architecture, providing a “divide and conquer” approach by splitting the overall problem into two manageable steps: first loading the data into the Snowflake database, and then performing transformation and integration tasks separately. This structure allows organizations to efficiently handle large and diverse data sources without being constrained by a single process bottleneck. Data integration inherently requires the transformation of data—cleaning, enriching, and merging it from different sources—and this is best accomplished using robust data transformation pipelines that automate and orchestrate these complex operations. An additional advantage of the ELT approach is that both data transformation and integration processes take place directly inside the Snowflake environment. This not only ensures fast, on-demand, and highly scalable performance but also eliminates the need for expensive, separate data processing infrastructure, allowing organizations to leverage Snowflake’s native compute resources for all aspects of data integration.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.