Extracting data from CockroachDB to Amazon Redshift
Extracting data from CockroachDB is a common requirement for organizations looking to integrate or analyze their transactional data within scalable analytics platforms such as Amazon Redshift. A successful extract, transform, and load (ETL) process involves careful consideration of several preparatory and operational steps to ensure a secure, efficient, and reliable data transfer. In this article, we will provide a step-by-step guide for moving data from CockroachDB to Redshift, with emphasis on the following key areas: - **Creating an identity in CockroachDB:** Establish the necessary user permissions and roles to facilitate secure data extraction. - **For Matillion users, checking or acquiring the JDBC driver:** Ensure that Matillion ETL can connect to CockroachDB by verifying or installing the appropriate JDBC driver. - **Network connectivity from source to target:** Validate connectivity between your CockroachDB instance and Redshift environment to prevent unforeseen network issues during data transfer. - **Querying data, initially and incrementally:** Develop effective strategies for both initial full extracts and ongoing incremental data loads. By following the guidance in this article, you will be well-equipped to design and implement a robust pipeline for moving your data from CockroachDB to Redshift, whether you use Matillion or another ETL solution.
What is CockroachDB?
CockroachDB is a distributed SQL database designed for high availability, strong consistency, and effortless horizontal scaling. Architected to tolerate network or node failures, CockroachDB replicates data automatically across multiple nodes and regions using the Raft consensus protocol, thus minimizing the risk of data loss and downtime. It offers a familiar PostgreSQL-compatible interface, enabling seamless integration with existing tools and applications, while providing transactional guarantees via serializable isolation. CockroachDB’s architecture enables operational simplicity, supporting elastic scale-out and self-healing capabilities, which make it well-suited for globally distributed, latency-sensitive workloads demanding continuous uptime.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service designed for high-performance analytics and reporting workloads. It utilizes columnar storage, massively parallel processing (MPP), and advanced query optimization techniques to efficiently handle complex SQL queries across large datasets. Redshift is closely integrated with the AWS ecosystem, offering features such as seamless data loading from Amazon S3, support for federated queries across multiple data sources, and comprehensive security controls including encryption and VPC support. With its ability to scale compute and storage independently, organizations can optimize for both workload performance and cost-effectiveness, making Redshift well-suited for modern data warehousing needs.
Why Move Data from CockroachDB into Amazon Redshift
Advantages of Replicating CockroachDB Data to Redshift for Advanced Analytics
A data engineer or architect may choose to copy data from CockroachDB into Amazon Redshift for several strategic reasons. CockroachDB is often the source of valuable operational data generated from distributed applications, but its true value is realized when this data is analyzed in the broader context of other business sources. By integrating CockroachDB data with information stored in additional systems, organizations can uncover deeper insights and make more informed decisions. Utilizing Redshift for this integration is beneficial because it is specifically designed for large-scale analytics workloads, enabling complex queries and data processing without imposing extra computational burden on CockroachDB. This approach preserves the performance and responsiveness of CockroachDB for transactional workloads, while leveraging Redshift’s analytical capabilities to derive actionable intelligence from integrated data.
Similar connectors
Creating an Identity in CockroachDB
This guide explains how to create a user (or identity) in a CockroachDB database via SQL. CockroachDB manages users as roles, which are granted login privileges.
Prerequisites
- Access to a CockroachDB SQL client (e.g., cockroach sql CLI, or a SQL connection via a tool such as DBeaver).
- Sufficient administrative privileges (typically via the
root
user or a role with theCREATEROLE
privilege).
1. Connect to the Database
Open your SQL shell and connect as an administrative user. For example:
cockroach sql --host=<hostname> --user=root
2. Create a User
Users are represented as roles with the
LOGINoption enabled.
CREATE USER <username>;Or, equivalently:
CREATE ROLE <username> WITH LOGIN;Example:
CREATE USER alice;
3. Set a Password (Optional but Recommended)
Set a password for the newly created user:
ALTER USER <username> WITH PASSWORD '<strong_password>';Example:
ALTER USER alice WITH PASSWORD 'S3cretP@ssw0rd!';
4. Verify the User
Query the list of users (roles with
LOGINprivilege):
SELECT username FROM system.users;
5. (Optional) Grant Database Privileges
After creating the user, you may want to grant access to a specific database:
GRANT <privilege> ON DATABASE <database_name> TO <username>;Example:
GRANT CONNECT ON DATABASE movr TO alice;
For fine-grained permissions (e.g., on tables):
GRANT SELECT, INSERT ON TABLE movr.customers TO alice;
Refer to the official CockroachDB SQL documentation for advanced options and best practices around user management.
Installing the JDBC Driver
At the time of writing, the JDBC driver for CockroachDB is not included by default within the Matillion Data Productivity Cloud platform. This omission is due to licensing or redistribution restrictions that prevent Matillion from packaging or distributing the CockroachDB JDBC driver directly with the product. As a result, you will need to manually download and install the driver to enable connectivity.
Step 1: Download the CockroachDB JDBC Driver
Visit the CockroachDB official documentation at the following link to obtain the current JDBC driver:
https://www.cockroachlabs.com/docs/stable/connect-to-cockroachdb.html
When selecting your driver, ensure you select a Type 4 JDBC driver. Type 4 drivers are written entirely in Java and communicate directly with the database over the network, making them the preferred and most highly compatible option for integration with Matillion Data Productivity Cloud.
Step 2: Install the JDBC Driver in Matillion Data Productivity Cloud
Once you have downloaded the CockroachDB JDBC driver (usually as a JAR file), you will need to upload and register it within the Matillion environment.
Please refer to Matillion's official documentation for a detailed, step-by-step guide to upload third-party JDBC drivers:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
Follow the platform-specific instructions to ensure that the driver is correctly placed and recognized by Matillion's service agent or runtime environment. This usually involves navigating to an administrative interface and uploading the JAR, or placing the file in a specified directory, followed by a system or agent restart if required.
Step 3: Configure and Use the JDBC Driver
After successful installation, you can use the new connection option within Matillion Data Productivity Cloud to configure and use the CockroachDB driver for your data integration jobs.
Detailed usage instructions, including how to connect to your database and construct data queries, can be found in Matillion's Database Query documentation:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
These guides describe the creation of database connections, specifying driver options, and integrating with database query components as part of your orchestration and transformation workflows in the platform.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your CockroachDB database, you must confirm that your database allows incoming connections according to your deployment configuration:
-
Full SaaS agent configuration:
Ensure your CockroachDB database permits inbound connections from the Matillion Data Productivity Cloud public IP addresses. The list of required IP addresses can be found at: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS deployment:
Ensure that your CockroachDB database accepts incoming connections from the addresses within your own virtual private cloud (VPC), where your Matillion agent instance is running. To help verify network connectivity, you can refer to the tools provided here: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Note: If your CockroachDB database connection uses a DNS address (rather than a direct IP address), ensure that the Full SaaS or Hybrid SaaS agent can resolve this DNS name to the appropriate IP address. Proper DNS resolution is required for the agent to establish a connection with your database.
Querying Data from CockroachDB
This guide explains how to query data from a CockroachDB database, focusing on patterns for one-off initial loads and incremental loads, particularly when moving data into systems such as Amazon Redshift. We also discuss relevant SQL examples and important considerations regarding datatype conversion.
SQL SELECT Examples for CockroachDB
To retrieve data from CockroachDB, use standard SQL
SELECTstatements. Here are some example queries:
`` -- Select all columns from theorders` table SELECT * FROM orders;
-- Select specific columns with filtering SELECT id, customer_id, created_at FROM orders WHERE status = 'processed';
-- Aggregate function example SELECT customer_id, COUNT(*) AS order_count FROM orders GROUP BY customer_id; ```
Datatype Conversion Considerations
When transferring data from CockroachDB to Amazon Redshift (or other databases), datatype conversions may be required. Some common conversions include:
| CockroachDB Type | Amazon Redshift Equivalent |
|---|---|
| SERIAL or INT | INTEGER |
| STRING | VARCHAR |
| TIMESTAMP | TIMESTAMP |
| DECIMAL | NUMERIC |
| BOOL | BOOLEAN |
Note: Always validate compatibility and test conversions, especially for complex datatypes or when dealing with timezone and encoding nuances.
One-off Initial Load vs. Incremental Load
A common data replication pattern is:
- Once-off Initial Load: Extract all required data from CockroachDB and transfer it into the target system.
- Incremental Load: After the initial load, move only new or changed data by filtering the data source.
Both load types typically use the same Database Query component (such as in ETL/ELT tools).
Initial Load
During the initial load, you generally do not use a filter clause. For example:
SELECT * FROM orders;
This fetches all records from the
orderstable, moving the entire dataset. Use this approach only once at the beginning of your data pipeline process.
Incremental Load
In incremental loads, you extract only new or updated records, based on a suitable filter (e.g., a timestamp column). For example:
SELECT * FROM orders WHERE updated_at > '2024-06-01 00:00:00';
In practice, you would replace
'2024-06-01 00:00:00'with a dynamic value reflecting the last successful load time. This ensures your sync process is efficient and up-to-date.
For detailed information on incremental loads and strategies, refer to the Incremental Load Data Replication Strategy article on Matillion Exchange.
Tip: Always use the same Database Query component for both initial and incremental loads for consistency and maintainability. The only difference will be the presence or absence of the filter clause.
Data Integration Architecture
Loading data in advance of integration is a practical way to divide and conquer complex data workflows, as it splits the process into two manageable steps: first, extracting and loading data into the Redshift database, and second, integrating and transforming it as needed. This approach illustrates one of the principal advantages of the ELT (Extract, Load, Transform) architecture. Within this framework, data integration inherently involves transforming data to meet the requirements of analytical or operational use cases, and the most efficient method for accomplishing this is through data transformation pipelines that can automate and streamline the process. Another significant benefit of the ELT architecture is that data transformation and integration take place directly within the target Redshift environment. This enables processing that is fast, scalable, and self-service, eliminating the need for separate, dedicated data transformation infrastructure and associated costs.