The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Integrate data from YugabyteDB to Amazon Redshift using Matillion

Our YugabyteDB to Redshift connector streamlines data transfer to Redshift within minutes, ensuring your information remains current without the necessity for manual coding or intricate ETL scripts.

YugabyteDB
Amazon Redshift
YugabyteDB to Amazon Redshift banner

Extracting data from YugabyteDB to Amazon Redshift

Extracting data from YugabyteDB is an important step for organizations looking to leverage the scalability of distributed SQL in combination with the analytics capabilities of cloud data warehouses like Amazon Redshift. In this article, we will walk you through the essential stages of orchestrating a smooth data pipeline between YugabyteDB and Redshift. We will begin by explaining how to create an identity within YugabyteDB to securely enable access for data extraction activities. For teams using Matillion for orchestration, we will review how to verify if the appropriate JDBC driver is available for YugabyteDB, or how to acquire it if necessary. Ensuring reliable network connectivity between the source (YugabyteDB) and the target (Redshift) is a critical prerequisite, so we will cover configuration considerations on both ends. Finally, we will guide you through querying and migrating your data, discussing both initial full-load strategies and ongoing incremental extraction techniques to support up-to-date, efficient data movement. By the end of this article, you should be equipped with practical steps and considerations to effectively bring your YugabyteDB data into Redshift, unlocking advanced analytics and reporting capabilities.


What is YugabyteDB?

YugabyteDB is an open-source, distributed SQL database designed for cloud-native applications needing high scalability, strong consistency, and global distribution. It uses a hybrid architecture separating query processing (YSQL/YCQL APIs) from distributed storage, inspired by Google Spanner and backed by the Raft protocol for consistency and fault tolerance. YugabyteDB offers distributed ACID transactions, automatic sharding, and synchronous replication for low-latency, horizontally scalable operations across cloud or commodity hardware. Its PostgreSQL-compatible YSQL API lets developers reuse existing tools and skills, making YugabyteDB ideal for organizations modernizing transactional workloads without sacrificing data integrity or developer productivity.

matillion logo x YugabyteDB

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service by AWS, designed for rapid querying and analysis of structured data using SQL. Built on PostgreSQL, Redshift leverages columnar storage, advanced compression, and parallel processing to optimize query performance across vast datasets. It supports seamless integration with a broad ecosystem of analytics and ETL tools, enabling efficient data ingestion, transformation, and retrieval. Redshift’s architecture includes features such as result caching, automatic scaling, and concurrency scaling to handle variable workloads with low latency. Security, availability, and data durability are ensured through encryption, automated backups, and replicated storage across multiple availability zones.

Why Move Data from YugabyteDB into Amazon Redshift

Unlocking Analytical Insights: Copying Data from YugabyteDB to Redshift

A data engineer or architect may choose to copy data from YugabyteDB into Amazon Redshift for several compelling reasons. YugabyteDB often houses valuable, operational data that, when combined with information from other enterprise sources, can yield deeper business insights and support advanced analytics. However, performing heavy analytical queries or complex data integration tasks directly on the operational YugabyteDB cluster can introduce undesirable load, potentially impacting application performance and reliability. By extracting data into Redshift—a service optimized for fast, scalable analytics—organizations can integrate YugabyteDB data with other sources and perform resource-intensive analysis without burdening mission-critical transactional systems. This approach ensures operational stability while unlocking the full analytic potential of the data.

Creating a User in YugabyteDB

To create a new user in YugabyteDB, you will use SQL commands such as

CREATE ROLE
along with role attributes to control permissions. YugabyteDB is PostgreSQL-compatible, so you can follow familiar
CREATE ROLE
or
CREATE USER
patterns. The following instructions assume you are connected to YugabyteDB's
ysqlsh
shell or another SQL client with administrator privileges.

1. Connect to Your YugabyteDB Cluster

ysqlsh -h <hostname> -p <port> -U <admin_user> -d <database>
Replace
<hostname>
,
<port>
,
<admin_user>
, and
<database>
as required.

2. Create the User

You can use either

CREATE ROLE
or
CREATE USER
.
CREATE USER
is essentially shorthand for
CREATE ROLE
with the
LOGIN
attribute.

CREATE ROLE appuser WITH LOGIN PASSWORD 'StrongPassword123';
or:
CREATE USER appuser WITH PASSWORD 'StrongPassword123';
This creates a user with the ability to log in, but with no additional privileges.

3. Grant Privileges

After creating the user, grant appropriate privileges based on your use case.

For example, to give access to all objects in a schema:

GRANT USAGE ON SCHEMA public TO appuser;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO appuser;

To allow the user to create databases:

ALTER ROLE appuser CREATEDB;

To give superuser rights (use with caution):

ALTER ROLE appuser SUPERUSER;

4. (Optional) Set Role Attributes

You can further specify role attributes, such as the user’s ability to inherit privileges or create other roles:

ALTER ROLE appuser WITH INHERIT;           -- default, role inherits privileges
ALTER ROLE appuser WITH CREATEROLE;        -- allows user to create other roles

5. List Existing Users and Roles

To view users and their attributes:

\du

This command works inside

ysqlsh
and displays information about all roles.


Note: Always use strong passwords and grant only the privileges necessary for your users’ tasks.

Installing the JDBC Driver

At the time of writing, the JDBC driver required for YugabyteDB is not included by default within Matillion Data Productivity Cloud. This is due to licensing or redistribution restrictions, so you will need to obtain and install the driver manually before establishing any database connections.

1. Downloading the JDBC Driver

The preferred JDBC driver for YugabyteDB is a Type 4 driver, which is a pure Java implementation and does not require any native libraries. YugabyteDB is compatible with the PostgreSQL JDBC driver, and you can obtain this driver directly from the official PostgreSQL JDBC website:

On this page, locate and download the latest release of the Type 4 PostgreSQL JDBC driver (the

.jar
file).

2. Installing the JDBC Driver in Matillion Data Productivity Cloud

Once the JDBC driver is downloaded, you will need to upload it to your Matillion Agent as an external driver. Matillion provides clear step-by-step instructions for this process. Refer to their documentation at the following location:

Carefully follow the outlined steps to ensure your driver is placed correctly and available for use within your Matillion environment.

3. Using the YugabyteDB JDBC Driver

After the driver is installed, you can configure your database queries in the Matillion Data Productivity Cloud, utilizing the new YugabyteDB connection. Detailed instructions for establishing database connections and configuring queries are found here:

Refer to this Matillion documentation to guide you through connecting to YugabyteDB using the JDBC driver you installed.

Checking network connectivity

To ensure successful integration between Matillion Data Productivity Cloud and your YugabyteDB database, you must confirm that your YugabyteDB instance accepts incoming connections from the correct sources. The configuration requirements vary depending on your Matillion deployment type:

  • Full SaaS Agent Configuration:
    YugabyteDB must allow incoming connections from the specific IP addresses used by Matillion Data Productivity Cloud’s SaaS infrastructure. You can find the current list of these IP addresses at this documentation page. Make sure your firewall or access control lists are updated to permit traffic from all relevant IP addresses.

  • Hybrid SaaS Deployment:
    In a Hybrid SaaS configuration, network traffic to YugabyteDB will originate from within your own virtual private cloud (VPC). Ensure that your YugabyteDB database is configured to accept connections from the internal IP address ranges of your VPC environment. To check network connectivity and troubleshoot any issues, you can use utilities available at Matillion Exchange.

Additionally, if you are referencing your YugabyteDB database using a DNS hostname, the Matillion agent (either Full SaaS or Hybrid SaaS) must be able to resolve that DNS address to connect successfully. Make sure that the DNS name you provide is resolvable from the relevant network environment.

Querying Data from YugabyteDB for Integration Workloads

When working with YugabyteDB, a distributed SQL database, external data integration and replication processes often require extracting data using SQL queries. Below you will find guidance and best practices for building such queries, especially in the context of ETL tools (like Matillion) targeting further ingestion into data warehouses such as Amazon Redshift.

Example YugabyteDB
SELECT
Queries

Below are some practical examples that demonstrate querying data from YugabyteDB tables:

``` -- Select all columns from a customers table SELECT * FROM customers;

-- Select specific columns where a condition is met SELECT id, name, email FROM customers WHERE signup_date >= '2024-01-01';

-- Use aggregation and grouping SELECT country, COUNT(*) AS user_count FROM customers GROUP BY country;

-- Join two tables SELECT o.id, o.total_amount, c.name FROM orders o JOIN customers c ON o.customer_id = c.id; ```

Considerations for Datatype Conversion

YugabyteDB is PostgreSQL-compatible, but some differences may exist between YugabyteDB and Amazon Redshift datatypes.
Common conversion considerations include:

  • timestamp
    in YugabyteDB may map to
    timestamp without time zone
    or
    timestamp with time zone
    in Redshift.
  • YugabyteDB
    text
    maps to Redshift
    VARCHAR
    .
  • YugabyteDB supports
    JSONB
    ; Redshift supports
    SUPER
    , but explicit transformation may be required.
  • Numeric precision/scale may differ (e.g.,
    numeric
    /
    decimal
    types).
  • Always review and map source-to-target datatypes to prevent loading issues.

Loading Patterns: Initial vs Incremental Loads

For high-performance and scalable data ingest, it’s recommended to use a combination of an initial full load followed by periodic incremental loads. The

Database Query
component provides a flexible means to accomplish both.

Initial Load

  • Pattern: Extract all data from the source table.
  • Query Structure: No
    WHERE
    clause or filter; all data is selected.

-- Example for Initial Load
SELECT * FROM orders;

Incremental Load

  • Pattern: Extract only new or modified data since the last load.
  • Query Structure: Use a
    WHERE
    clause to filter data (usually by timestamp or increasing ID).

-- Assume last_loaded_at is the latest processed order timestamp
SELECT * FROM orders
WHERE updated_at > '{{last_loaded_at}}';
Here,
{{last_loaded_at}}
is a placeholder to be replaced with the appropriate value from your ETL orchestration logic.

Reusing the Database Query Component

In ETL tools, use the same

Database Query
component for both initial and incremental queries—just parameterize the
WHERE
clause such that it is empty on the initial load and active in subsequent runs.

-- Pseudocode pattern
SELECT * FROM your_table
{{WHERE_CLAUSE}}

Set

WHERE_CLAUSE
to empty (
''
) for the initial load, and to something like
WHERE updated_at > '{{last_loaded_at}}'
for incremental loads.


For more detailed patterns and real-life implementation, refer to Matillion Exchange's Incremental Load Article.

Data Integration Architecture

Loading data into Amazon Redshift prior to integration is an application of the “divide and conquer” principle, effectively splitting the data integration process into two distinct steps. This approach is a hallmark of the ELT (Extract, Load, Transform) architecture, where raw data is first loaded and staged within the Redshift database before any further integration or transformation is performed. Since data integration frequently involves complex transformations to harmonize and enrich incoming datasets, leveraging a data transformation pipeline within Redshift becomes the optimal strategy. One of the significant advantages of ELT is that both transformation and integration operations are executed directly within the target Redshift environment. This design ensures rapid, on-demand, and scalable processing, utilizing Redshift's native compute resources and parallel query execution capabilities. As a result, organizations benefit from efficient data operations without incurring the additional costs or complexity associated with external data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.