Integrate data from CrateDB to Amazon Redshift using Matillion

Our CrateDB to Redshift connector enables seamless and timely data transfers to Redshift, automatically keeping your information current without the need for manual coding or complex ETL processes.

CrateDB
Amazon Redshift
CrateDB to Amazon Redshift banner

Extracting data from CrateDB to Amazon Redshift

Extracting data from CrateDB is a fundamental step in integrating this high-performance, distributed SQL database with other data platforms such as Amazon Redshift. Whether you are migrating large volumes of historical data, setting up regular data pipelines, or enabling real-time analytics, this process requires careful planning and consideration of both security and performance. In this article, we will guide you through the key stages required to move your data efficiently and securely from CrateDB to Redshift. Specifically, we will cover: - **Creating an Identity in CrateDB:** Establishing proper user credentials and permissions to ensure secure data access. - **For Matillion Users, Checking or Acquiring the JDBC Driver:** Verifying that you have the necessary JDBC driver to enable seamless connectivity between Matillion ETL and CrateDB. - **Ensuring Network Connectivity:** Configuring your systems to allow reliable and secure network communication between CrateDB (the source) and Redshift (the target). - **Querying Data—Both Initially and Incrementally:** Constructing queries to handle both the initial full data load and subsequent incremental updates, optimizing the process for accuracy and efficiency. By following these steps, you will be well-equipped to extract, transform, and load data from CrateDB into Redshift, unlocking new possibilities for analytics and business intelligence.


What is CrateDB?

CrateDB is a distributed SQL database engineered for high-performance querying and real-time analytics on massive volumes of machine data and time-series information. Utilizing a shared-nothing architecture, CrateDB seamlessly combines the scalability and flexibility of NoSQL systems with the familiar querying capabilities of standard SQL, supporting full-text search and geospatial queries natively. It is built on top of Elasticsearch, leveraging Lucene for storage and indexing, and provides horizontal scalability through automatic data sharding and replication. CrateDB is particularly well-suited for IoT, monitoring, and log analytics applications, where ingestion rates and low-latency analytical queries are critical, and offers a user-friendly RESTful API alongside standard PostgreSQL wire protocol compatibility for broad integration support.

matillion logo x CrateDB

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse designed for OLAP and large-scale analytics. Built on PostgreSQL, it offers familiar SQL interfaces with optimizations like columnar storage, advanced compression, and massively parallel processing (MPP) for high-performance queries. Redshift Spectrum allows querying exabytes of semi-structured data in Amazon S3 without data loading. Its architecture includes workload management, automatic backups, data encryption, and cross-region replication, providing scalability and robust security. This makes Redshift ideal for enterprises needing efficient, secure, and scalable analytics infrastructure.

Why Move Data from CrateDB into Amazon Redshift

The Advantages of Transferring Data from CrateDB to Redshift for Analytics

A data engineer or architect may wish to copy data from CrateDB into Amazon Redshift for several strategic reasons. Firstly, CrateDB likely contains data that is potentially valuable for analysis, and unlocking this value often requires integrating it with additional datasets residing elsewhere within the organization. By consolidating data from CrateDB alongside other sources in Redshift, analysts can perform more comprehensive and insightful queries, enabling richer business intelligence. Furthermore, by utilizing Redshift for such integrations and analytical workloads, the operational burden placed on the CrateDB instance is minimized. This approach helps to preserve CrateDB’s performance for ongoing transactional workloads, ensuring that analytics do not interfere with day-to-day operations.

Creating a User in CrateDB

This guide explains how to create a user identity in CrateDB using SQL statements. The following steps require sufficient privileges, such as access via an administrative account (e.g.,

crate
user). You may execute these statements using any compatible SQL client (e.g.,
crash
shell, PgAdmin, or any PostgreSQL client).


1. Create a User

Use the

CREATE USER
SQL statement. Replace
<username>
with the desired username.

CREATE USER <username>;

Example:

CREATE USER alice;


2. Set the User Password

Set the user's password for authentication:

ALTER USER <username> SET (password = '<plaintext-password>');

Example:

ALTER USER alice SET (password = 'securepassword123');

Note: Passwords are stored securely by CrateDB, but choose strong passwords and handle with care.


3. Grant Privileges to the User

New users have no privileges by default. Grant required privileges, such as

DQL
,
DML
,
DDL
, or
ALL
on databases, tables, or schemas.

Grant all privileges on a specific schema:

GRANT ALL PRIVILEGES ON SCHEMA <schema_name> TO <username>;

Example:

GRANT ALL PRIVILEGES ON SCHEMA doc TO alice;

Or grant specific privileges (e.g.,

DQL
for data querying):

GRANT DQL ON TABLE doc.mytable TO alice;


4. Verify User Creation

To check existing users:

SELECT name FROM sys.users;


Reference

Installing the JDBC driver

To connect Matillion Data Productivity Cloud to CrateDB, you will need to manually download and install the CrateDB JDBC driver. This step is necessary because, at the time of writing, the CrateDB JDBC driver is not included with Matillion Data Productivity Cloud by default; this is due to licensing or redistribution restrictions.

Follow these steps to set up the CrateDB JDBC driver with Matillion Data Productivity Cloud:

  1. Download the JDBC Driver

  2. Visit the official CrateDB JDBC documentation page.

  3. Locate the download option for the JDBC driver, ensuring you select a “Type 4 JDBC driver” for best compatibility.
  4. Download the latest available

    .jar
    file for the driver to your local machine.

  5. Upload the JDBC Driver to Matillion Data Productivity Cloud

  6. Refer to Matillion’s documentation for installing external drivers: Uploading External Drivers.

  7. Follow the documented steps to upload the downloaded CrateDB JDBC
    .jar
    file to your Matillion agent.
  8. Ensure the agent restarts (or reloads the driver list if supported) in order for the new JDBC driver to become available.

  9. Configure Usage in Matillion Data Productivity Cloud

  10. Once installation is complete, configure a database connection or job in Matillion Data Productivity Cloud using the newly installed CrateDB JDBC driver.

  11. For guidance on how to configure and use the new connection or driver inside the Designer, refer to Matillion’s database query usage documentation.

By following these steps, you will be able to integrate CrateDB as a source or target within your data workflows in Matillion Data Productivity Cloud utilizing the provided JDBC driver.

Checking network connectivity

To ensure successful connectivity between the Matillion Data Productivity Cloud and your CrateDB database, you must configure your firewall or network security rules to allow incoming connections based on your deployment type:

  • Full SaaS Agent Configuration:
    If you are using Matillion's Full SaaS agent, you must permit incoming connections to your CrateDB database from the IP addresses specified by Matillion. The current list of authorized source IP addresses is maintained here. Be sure to reference the latest list and allow inbound access from all listed addresses.

  • Hybrid SaaS Deployment:
    For a Hybrid SaaS deployment, incoming connections should be allowed from your organization’s own virtual private cloud (VPC) where the Hybrid agent is running. Matillion provides network connectivity verification tools at this link to help you test and validate your setup.

Additionally, if you are referencing your CrateDB database using a DNS hostname rather than a direct IP address, the relevant Matillion agent (Full SaaS or Hybrid SaaS) must be able to resolve this DNS address. Please ensure that DNS resolution is possible from the environment where the agent resides.

Querying Data from CrateDB

This guide demonstrates how to query data from a CrateDB database using SQL

SELECT
statements, highlights considerations around datatype conversions (especially when integrating with Amazon Redshift), and outlines established loading patterns for data extraction workflows.


Example CrateDB
SELECT
Statements

Below are example queries illustrating typical read operations:

``` -- Select all columns and rows from the "sales" table SELECT * FROM sales;

-- Select specific columns SELECT order_id, customer_name, total_amount FROM sales;

-- Filtering records with a WHERE clause SELECT * FROM sales WHERE order_date >= '2024-01-01';

-- Aggregate functions with GROUP BY SELECT customer_id, COUNT(*) AS order_count FROM sales GROUP BY customer_id;

-- Joining two tables SELECT c.customer_name, o.order_id, o.total_amount FROM customers c JOIN orders o ON c.customer_id = o.customer_id; ```


Datatype Conversion: CrateDB ↔ Redshift

When migrating or syncing data from CrateDB to Redshift, be aware that datatype conversions may be necessary due to differences and limitations in each platform's type systems. For example:

  • CrateDB's
    OBJECT
    type is not supported natively in Redshift and may need flattening or serialization to text (
    VARCHAR
    ).
  • Numeric precision and scale may vary.
  • Date and timestamp types may require format conversion.
  • Boolean values: Ensure proper mapping (CrateDB
    BOOLEAN
    ↔ Redshift
    BOOLEAN
    ).

Plan and test type conversions during ETL development to avoid data integrity issues.


Data Loading Strategy: Initial and Incremental Loads

The optimal approach to move or replicate data from CrateDB is:

  1. Once-off Initial Load
  2. Incremental Loads for ongoing syncs

Use the same Database Query component for both, but alter the query filtering logic as follows:

Initial Load

  • There is no filter clause.
  • The query selects all relevant data from the source table.

-- Initial full load; no filter on last-modified/created
SELECT * FROM sales;

Incremental Loads

  • Filter clause applies, usually on a monotonically increasing column (
    id
    ,
    updated_at
    , or
    created_at
    ).
  • Only retrieves new or changed records since the last load.

-- Incremental load; filter on "updated_at"
SELECT * FROM sales
WHERE updated_at > '${last_loaded_timestamp}';

  • Ensure the filtered column is indexed for optimal performance.
  • Store and update the watermark (e.g., last timestamp) after each batch.

For more details and patterns, see Incremental Load Data Replication Strategy - Matillion Exchange.


References


Data Integration Architecture

Loading data into Amazon Redshift in advance of integrating it is a practical way to tackle data integration, effectively dividing the process into two distinct steps: loading (Extract and Load phases) and transforming (the Transform phase). This “divide and conquer” strategy is a core benefit of the ELT (Extract, Load, Transform) architecture, making complex data workflows easier to manage. Data integration typically requires extensive data transformation, such as cleaning, joining, and reshaping data, and the most efficient way to perform these operations is by using data transformation pipelines. In the ELT model, these transformations and integrations are carried out directly inside the target Redshift database. This approach offers significant advantages: it is fast, supports on-demand scaling, and leverages Redshift’s parallel processing capabilities. As a result, organizations benefit from reduced infrastructure costs, since there is no need to provision or pay for a separate data processing environment outside the database.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.