Integrate data from CockroachDB to Amazon Redshift using Matillion

Our CockroachDB to Redshift connector seamlessly transfers your data to Redshift within minutes, ensuring it remains current without the need for manual coding or managing complex ETL scripts.

CockroachDB
Amazon Redshift
CockroachDB to Amazon Redshift banner

Extracting data from CockroachDB to Amazon Redshift

Extracting data from CockroachDB is a common requirement for organizations looking to integrate or analyze their transactional data within scalable analytics platforms such as Amazon Redshift. A successful extract, transform, and load (ETL) process involves careful consideration of several preparatory and operational steps to ensure a secure, efficient, and reliable data transfer. In this article, we will provide a step-by-step guide for moving data from CockroachDB to Redshift, with emphasis on the following key areas: - **Creating an identity in CockroachDB:** Establish the necessary user permissions and roles to facilitate secure data extraction. - **For Matillion users, checking or acquiring the JDBC driver:** Ensure that Matillion ETL can connect to CockroachDB by verifying or installing the appropriate JDBC driver. - **Network connectivity from source to target:** Validate connectivity between your CockroachDB instance and Redshift environment to prevent unforeseen network issues during data transfer. - **Querying data, initially and incrementally:** Develop effective strategies for both initial full extracts and ongoing incremental data loads. By following the guidance in this article, you will be well-equipped to design and implement a robust pipeline for moving your data from CockroachDB to Redshift, whether you use Matillion or another ETL solution.


What is CockroachDB?

CockroachDB is a distributed SQL database designed for high availability, strong consistency, and effortless horizontal scaling. Architected to tolerate network or node failures, CockroachDB replicates data automatically across multiple nodes and regions using the Raft consensus protocol, thus minimizing the risk of data loss and downtime. It offers a familiar PostgreSQL-compatible interface, enabling seamless integration with existing tools and applications, while providing transactional guarantees via serializable isolation. CockroachDB’s architecture enables operational simplicity, supporting elastic scale-out and self-healing capabilities, which make it well-suited for globally distributed, latency-sensitive workloads demanding continuous uptime.

matillion logo x CockroachDB

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service designed for high-performance analytics and reporting workloads. It utilizes columnar storage, massively parallel processing (MPP), and advanced query optimization techniques to efficiently handle complex SQL queries across large datasets. Redshift is closely integrated with the AWS ecosystem, offering features such as seamless data loading from Amazon S3, support for federated queries across multiple data sources, and comprehensive security controls including encryption and VPC support. With its ability to scale compute and storage independently, organizations can optimize for both workload performance and cost-effectiveness, making Redshift well-suited for modern data warehousing needs.

Why Move Data from CockroachDB into Amazon Redshift

Advantages of Replicating CockroachDB Data to Redshift for Advanced Analytics

A data engineer or architect may choose to copy data from CockroachDB into Amazon Redshift for several strategic reasons. CockroachDB is often the source of valuable operational data generated from distributed applications, but its true value is realized when this data is analyzed in the broader context of other business sources. By integrating CockroachDB data with information stored in additional systems, organizations can uncover deeper insights and make more informed decisions. Utilizing Redshift for this integration is beneficial because it is specifically designed for large-scale analytics workloads, enabling complex queries and data processing without imposing extra computational burden on CockroachDB. This approach preserves the performance and responsiveness of CockroachDB for transactional workloads, while leveraging Redshift’s analytical capabilities to derive actionable intelligence from integrated data.

Creating an Identity in CockroachDB

This guide explains how to create a user (or identity) in a CockroachDB database via SQL. CockroachDB manages users as roles, which are granted login privileges.

Prerequisites

  • Access to a CockroachDB SQL client (e.g., cockroach sql CLI, or a SQL connection via a tool such as DBeaver).
  • Sufficient administrative privileges (typically via the
    root
    user or a role with the
    CREATEROLE
    privilege).

1. Connect to the Database

Open your SQL shell and connect as an administrative user. For example:

cockroach sql --host=<hostname> --user=root

2. Create a User

Users are represented as roles with the

LOGIN
option enabled.

CREATE USER <username>;
Or, equivalently:

CREATE ROLE <username> WITH LOGIN;
Example:
CREATE USER alice;

3. Set a Password (Optional but Recommended)

Set a password for the newly created user:

ALTER USER <username> WITH PASSWORD '<strong_password>';
Example:
ALTER USER alice WITH PASSWORD 'S3cretP@ssw0rd!';

4. Verify the User

Query the list of users (roles with

LOGIN
privilege):

SELECT username FROM system.users;

5. (Optional) Grant Database Privileges

After creating the user, you may want to grant access to a specific database:

GRANT <privilege> ON DATABASE <database_name> TO <username>;
Example:
GRANT CONNECT ON DATABASE movr TO alice;

For fine-grained permissions (e.g., on tables):

GRANT SELECT, INSERT ON TABLE movr.customers TO alice;


Refer to the official CockroachDB SQL documentation for advanced options and best practices around user management.

Installing the JDBC Driver

At the time of writing, the JDBC driver for CockroachDB is not included by default within the Matillion Data Productivity Cloud platform. This omission is due to licensing or redistribution restrictions that prevent Matillion from packaging or distributing the CockroachDB JDBC driver directly with the product. As a result, you will need to manually download and install the driver to enable connectivity.

Step 1: Download the CockroachDB JDBC Driver

Visit the CockroachDB official documentation at the following link to obtain the current JDBC driver:
https://www.cockroachlabs.com/docs/stable/connect-to-cockroachdb.html

When selecting your driver, ensure you select a Type 4 JDBC driver. Type 4 drivers are written entirely in Java and communicate directly with the database over the network, making them the preferred and most highly compatible option for integration with Matillion Data Productivity Cloud.

Step 2: Install the JDBC Driver in Matillion Data Productivity Cloud

Once you have downloaded the CockroachDB JDBC driver (usually as a JAR file), you will need to upload and register it within the Matillion environment.

Please refer to Matillion's official documentation for a detailed, step-by-step guide to upload third-party JDBC drivers:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

Follow the platform-specific instructions to ensure that the driver is correctly placed and recognized by Matillion's service agent or runtime environment. This usually involves navigating to an administrative interface and uploading the JAR, or placing the file in a specified directory, followed by a system or agent restart if required.

Step 3: Configure and Use the JDBC Driver

After successful installation, you can use the new connection option within Matillion Data Productivity Cloud to configure and use the CockroachDB driver for your data integration jobs.

Detailed usage instructions, including how to connect to your database and construct data queries, can be found in Matillion's Database Query documentation:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

These guides describe the creation of database connections, specifying driver options, and integrating with database query components as part of your orchestration and transformation workflows in the platform.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your CockroachDB database, you must confirm that your database allows incoming connections according to your deployment configuration:

Note: If your CockroachDB database connection uses a DNS address (rather than a direct IP address), ensure that the Full SaaS or Hybrid SaaS agent can resolve this DNS name to the appropriate IP address. Proper DNS resolution is required for the agent to establish a connection with your database.

Querying Data from CockroachDB

This guide explains how to query data from a CockroachDB database, focusing on patterns for one-off initial loads and incremental loads, particularly when moving data into systems such as Amazon Redshift. We also discuss relevant SQL examples and important considerations regarding datatype conversion.


SQL SELECT Examples for CockroachDB

To retrieve data from CockroachDB, use standard SQL

SELECT
statements. Here are some example queries:

``
-- Select all columns from the
orders` table SELECT * FROM orders;

-- Select specific columns with filtering SELECT id, customer_id, created_at FROM orders WHERE status = 'processed';

-- Aggregate function example SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id; ```


Datatype Conversion Considerations

When transferring data from CockroachDB to Amazon Redshift (or other databases), datatype conversions may be required. Some common conversions include:

CockroachDB Type Amazon Redshift Equivalent
SERIAL or INT INTEGER
STRING VARCHAR
TIMESTAMP TIMESTAMP
DECIMAL NUMERIC
BOOL BOOLEAN

Note: Always validate compatibility and test conversions, especially for complex datatypes or when dealing with timezone and encoding nuances.


One-off Initial Load vs. Incremental Load

A common data replication pattern is:

  1. Once-off Initial Load: Extract all required data from CockroachDB and transfer it into the target system.
  2. Incremental Load: After the initial load, move only new or changed data by filtering the data source.

Both load types typically use the same Database Query component (such as in ETL/ELT tools).

Initial Load

During the initial load, you generally do not use a filter clause. For example:

SELECT * FROM orders;

This fetches all records from the

orders
table, moving the entire dataset. Use this approach only once at the beginning of your data pipeline process.

Incremental Load

In incremental loads, you extract only new or updated records, based on a suitable filter (e.g., a timestamp column). For example:

SELECT * FROM orders WHERE updated_at > '2024-06-01 00:00:00';

In practice, you would replace

'2024-06-01 00:00:00'
with a dynamic value reflecting the last successful load time. This ensures your sync process is efficient and up-to-date.

For detailed information on incremental loads and strategies, refer to the Incremental Load Data Replication Strategy article on Matillion Exchange.


Tip: Always use the same Database Query component for both initial and incremental loads for consistency and maintainability. The only difference will be the presence or absence of the filter clause.


Data Integration Architecture

Loading data in advance of integration is a practical way to divide and conquer complex data workflows, as it splits the process into two manageable steps: first, extracting and loading data into the Redshift database, and second, integrating and transforming it as needed. This approach illustrates one of the principal advantages of the ELT (Extract, Load, Transform) architecture. Within this framework, data integration inherently involves transforming data to meet the requirements of analytical or operational use cases, and the most efficient method for accomplishing this is through data transformation pipelines that can automate and streamline the process. Another significant benefit of the ELT architecture is that data transformation and integration take place directly within the target Redshift environment. This enables processing that is fast, scalable, and self-service, eliminating the need for separate, dedicated data transformation infrastructure and associated costs.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.