Extracting data from YugabyteDB to Snowflake
Extracting data from YugabyteDB is a vital step for organizations looking to leverage the analytical strengths of Snowflake while maintaining transactional operations on YugabyteDB. This article provides a practical guide for data engineers and architects on how to efficiently transfer data from YugabyteDB into Snowflake, enabling robust reporting and analytics. We will begin by outlining how to create an identity in YugabyteDB, which establishes the necessary permissions for data extraction. For users working with Matillion, we will explain how to check for an existing JDBC driver compatible with YugabyteDB, and how to acquire one if needed. Additionally, we will discuss key considerations for ensuring secure and reliable network connectivity between YugabyteDB (the source) and Snowflake (the target). Finally, the process of querying and extracting data—both for the initial load and subsequent incremental updates—will be explored in detail. By following this guide, you will be equipped to optimize your data movement pipeline and maximize the value of your organization’s data assets.
What is YugabyteDB?
YugabyteDB is an open-source, distributed SQL database designed for high availability, linear scalability, and global data distribution. It features a PostgreSQL-compatible query layer (YSQL) and a Cassandra-compatible API (YCQL), enabling seamless integration with existing applications. Internally, YugabyteDB leverages a distributed, sharded storage engine based on a Raft consensus protocol, ensuring strong consistency and automatic failover. Its architecture supports multi-region and multi-cloud deployments, making it suitable for mission-critical workloads requiring zero downtime and low-latency access. Advanced features such as online schema changes, distributed ACID transactions, and point-in-time recovery further enhance its suitability for modern cloud-native environments.
What is Snowflake?
Snowflake is a cloud-native data platform with a unique multi-cluster shared data architecture that separates compute and storage for scalable, concurrent data processing. Available on AWS, Azure, and Google Cloud, it handles structured and semi-structured data (e.g., JSON, Avro, Parquet) and delivers efficient analytics with automatic tuning and minimal maintenance. Snowflake enables independent scaling of compute, supporting batch ETL, ad hoc queries, and real-time data sharing without resource contention. Built-in security—including automatic encryption, fine-grained access control, and auditing—meets enterprise compliance needs, making Snowflake a versatile, high-performance choice for modern data workloads.
Why Move Data from YugabyteDB into Snowflake
Unlocking Analytics: The Value of Transferring Data from YugabyteDB to Snowflake
A data engineer or architect might choose to copy data from YugabyteDB into Snowflake for several compelling reasons. Firstly, YugabyteDB often contains operational data that is highly valuable for analytical and business intelligence purposes. However, the true potential of this data is frequently realized only when it is integrated with datasets from other sources, enabling comprehensive analysis and richer insights. By leveraging Snowflake as the platform for such integration and analysis, organizations can take advantage of its scalable processing capabilities without placing additional analytical or computational load on the YugabyteDB cluster. This separation ensures that the performance of critical transactional workloads in YugabyteDB is maintained, while analysts and data scientists can work efficiently within the Snowflake environment.
Similar connectors
Creating a User in YugabyteDB
To create a user in YugabyteDB, you will use standard PostgreSQL user (role) management commands, as YugabyteDB is PostgreSQL-compatible. Below are the steps and example SQL scripts for creating a new user.
1. Connect to YugabyteDB
First, connect to your YugabyteDB cluster using a database client such as
psql:
bash psql -h <hostname> -p <port> -U <admin_user> -d <database>
Replace
<hostname>,
<port>,
<admin_user>, and
<database>with your environment’s values.
2. Create a User
To create a new user (role) named
app_userwith a password, use the following SQL:
CREATE ROLE app_user WITH LOGIN PASSWORD 'StrongPasswordHere';
3. Grant Privileges
By default, a new user has no privileges on any database objects. You can grant specific privileges as needed:
To grant CONNECT on a database:
GRANT CONNECT ON DATABASE mydb TO app_user;
To grant usage on a schema:
GRANT USAGE ON SCHEMA public TO app_user;
To grant privileges on specific tables:
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE my_table TO app_user;
4. Verify the User Creation
List the users (roles) in your YugabyteDB cluster:
\du
This will display all roles and their attributes.
Note: User and role management in YugabyteDB is performed similarly to PostgreSQL. Users are implemented as roles with the
LOGINattribute. Always choose strong passwords and follow best practices for privilege management.
Installing the JDBC Driver
At the time of writing, the YugabyteDB JDBC driver is not bundled with Matillion Data Productivity Cloud, primarily due to licensing and redistribution restrictions. This means it is necessary to manually obtain and install the required driver to enable connectivity between Matillion and your YugabyteDB database.
To proceed, follow these steps:
1. Download the JDBC Driver
- Visit the official PostgreSQL JDBC downloads page: https://jdbc.postgresql.org/
- Locate the most recent, stable release of the driver.
- Select the Type 4 JDBC driver (“pgjdbc”), as this is the recommended pure Java driver for database connectivity and maximum compatibility.
- Download the
.jar
file to your local system. Take note of the file location for use in the next steps.
2. Install the JDBC Driver in Matillion Data Productivity Cloud
Since this JDBC driver is not packaged with Matillion by default, you will need to upload the driver manually. Follow the process described in the official documentation:
Uploading External Drivers
- Log in to the Matillion Data Productivity Cloud user interface and access your Agent’s configuration.
- Use the ‘Upload External Driver’ feature to add the downloaded
.jar
file. - Confirm the driver has been registered within your Agent’s environment.
Keep in mind that after the upload, the Agent may require a restart—or the driver may need to be assigned to a specific connector template, depending on your configuration.
3. Configure and Use the Driver inside Matillion Designer
Once the JDBC driver is installed and recognized by the Matillion Agent, you can leverage it to create database connections and orchestrate YugabyteDB queries within your pipelines.
Detailed usage instructions are documented here:
Using Database Query Component in Matillion Designer
- Configure connection parameters, referencing your YugabyteDB host, port, database name, credentials, and any additional required properties.
- Verify connectivity and start building your data workflows using Matillion’s Designer interface.
If you encounter errors during upload or database connectivity, consult the Matillion troubleshooting steps and YugabyteDB documentation to resolve compatibility or configuration issues.
Checking network connectivity
When configuring connectivity between Matillion Data Productivity Cloud and your YugabyteDB database, you must ensure that the database accepts incoming network connections from the appropriate sources based on your deployment type:
-
Full SaaS agent configuration: YugabyteDB must allow inbound connections from the set of outgoing IP addresses used by Matillion. The latest list of IP addresses to allow can be found here: Matillion Allowed IP Addresses.
-
Hybrid SaaS deployment: YugabyteDB should be configured to accept incoming connections from the IP addresses of your own virtual private cloud (VPC). You can find utilities to help you determine your relevant VPC addresses—and to verify connectivity—at Matillion Network Access Check.
Additionally, if your YugabyteDB database is referenced using a DNS hostname (instead of a direct IP address), ensure that the relevant Matillion Full SaaS or Hybrid SaaS agent is able to resolve that address successfully. This may require configuring DNS resolution within your network or agent environment.
Querying Data from YugabyteDB
This guide explains how to query data from YugabyteDB using SQL
SELECTstatements, with an emphasis on data loading strategies and considerations for integration with Snowflake.
Example YugabyteDB SELECT Queries
Below are some common
SELECTquery patterns for YugabyteDB:
``` -- Select all columns from a table SELECT * FROM customers;
-- Select specific columns with a condition SELECT customer_id, name, city FROM customers WHERE city = 'San Francisco';
-- Aggregate data SELECT city, COUNT(*) AS num_customers FROM customers GROUP BY city;
-- Join between tables SELECT o.order_id, c.name, o.order_total FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE o.order_date >= '2024-01-01'; ```
Data Type Conversion: YugabyteDB & Snowflake
When loading data from YugabyteDB to Snowflake, be aware that certain data types may automatically convert. For example:
- YugabyteDB
timestamp
may becomeTIMESTAMP_NTZ
in Snowflake. - YugabyteDB
uuid
can become aVARCHAR
. - YugabyteDB number types (
int
,bigint
,numeric
) convert to the closest numeric types in Snowflake.
Always review the mapping of source and destination data types to prevent loading errors or incorrect data representations.
Data Loading Patterns
One-Off Initial Load
The recommended approach is to first run a comprehensive, one-time load that brings all records into Snowflake.
Pattern: - Use a SELECT statement without a
WHEREfilter in the source Database Query component.
Example:
SELECT * FROM orders;
Incremental Loads
Subsequent loads should bring in only new or changed records. Use the same Database Query component but add a
WHEREfilter based on a tracking column (such as
last_updatedor a monotonically increasing
id).
Pattern: - Add a filter condition to select only records newer than the last load timestamp.
Example with a timestamp:
SELECT * FROM orders WHERE last_updated > :LAST_LOAD_TIMESTAMP;
Example with an auto-incrementing ID:
SELECT * FROM orders WHERE order_id > :LAST_MAX_ID;
Review Incremental Load Data Replication Strategy (Matillion) for more details.
Tip: Both initial and incremental loads should be performed via the same Database Query component. On the initial load, omit the filter; on incremental runs, supply the filter parameter. This streamlines your ETL or CDC processes and makes it easier to manage in production.
Data Integration Architecture
Loading data in advance of integration is a core principle of the ELT (Extract, Load, Transform) architecture, providing a “divide and conquer” approach by splitting the overall problem into two manageable steps: first loading the data into the Snowflake database, and then performing transformation and integration tasks separately. This structure allows organizations to efficiently handle large and diverse data sources without being constrained by a single process bottleneck. Data integration inherently requires the transformation of data—cleaning, enriching, and merging it from different sources—and this is best accomplished using robust data transformation pipelines that automate and orchestrate these complex operations. An additional advantage of the ELT approach is that both data transformation and integration processes take place directly inside the Snowflake environment. This not only ensures fast, on-demand, and highly scalable performance but also eliminates the need for expensive, separate data processing infrastructure, allowing organizations to leverage Snowflake’s native compute resources for all aspects of data integration.