Integrate data from CrateDB to Snowflake using Matillion

Our CrateDB to Snowflake connector enables seamless, automatic data transfers to Snowflake within minutes, ensuring your data stays current without the need for manual coding or managing intricate ETL scripts.

CrateDB
Snowflake
CrateDB to Snowflake banner

Extracting data from CrateDB to Snowflake

Extracting data from CrateDB is a common requirement for organizations seeking to leverage advanced analytics and data warehousing capabilities offered by platforms such as Snowflake. Ensuring a smooth and reliable data pipeline between these systems is crucial for accurate, up-to-date insights and streamlined business processes. In this article, we will guide you through the essential steps for successfully extracting data from CrateDB and loading it into Snowflake. The process includes creating an identity within CrateDB for secure data access, and—if you are using Matillion—verifying or acquiring the required JDBC driver to establish connectivity. We will also discuss the necessary network configurations to allow seamless data transfer between source and target systems. Finally, we will walk through how to query data from CrateDB, addressing both the extraction of initial datasets and subsequent incremental updates to keep your Snowflake environment synchronized. By the end of this article, you will be equipped with practical strategies and technical know-how to efficiently move data from CrateDB to Snowflake, enabling more effective data-driven decision making.


What is CrateDB?

CrateDB is a distributed SQL database engineered for handling large-scale machine data and real-time analytics. Built on a shared-nothing architecture, it leverages a combination of columnar storage, integrated search functionality via Elasticsearch libraries, and sharding to achieve high horizontal scalability and low-latency queries across substantial data volumes. CrateDB supports full ANSI SQL, allowing users to perform complex joins, aggregations, and sub-queries, and exposes native compatibility with standard PostgreSQL drivers. Its architecture facilitates indexing and querying both structured and semi-structured data, including JSON, making it well-suited for IoT platforms, time-series data, and event-driven architectures that require ingest rates of millions of records per second and fast analytical responses.

matillion logo x CrateDB

What is Snowflake?

Snowflake is a cloud-native, fully managed data platform offering scalable, high-performance data warehousing. Its architecture separates compute, storage, and services, letting users scale each independently. A multi-cluster design supports many concurrent workloads for ad hoc and advanced analytics without resource contention. Snowflake handles semi-structured data (JSON, Avro, Parquet) and features strong data sharing and governance, easing integration and collaboration. Its SQL-based interface, support for major clouds, and automated maintenance lower operational overhead, making Snowflake an attractive option for modern data engineering and analytics.

Why Move Data from CrateDB into Snowflake

Unlocking Advanced Analytics: Copying Data from CrateDB to Snowflake

A data engineer or architect might choose to copy data from CrateDB into Snowflake for several compelling reasons. Firstly, CrateDB often contains valuable operational or time-series data that can drive significant business insights when properly leveraged. However, the true value of this data is frequently realized when it is integrated and analyzed alongside data from other sources, such as customer databases, financial systems, or external datasets. Transferring data into Snowflake enables organizations to perform comprehensive analytics, advanced data modeling, and reporting in a centralized environment. Importantly, utilizing Snowflake for these intensive analytical workloads avoids placing additional computational or query stress on CrateDB, which is typically optimized for high-ingest, low-latency operations rather than complex, resource-intensive queries. This approach ensures both systems operate efficiently within their strengths, supporting robust data integration without compromising the performance of operational data storage.

Creating a User in CrateDB

CrateDB supports role-based access control (RBAC) through user management. You can create users (called "roles" in CrateDB) directly via SQL statements.

Prerequisites

  • Ensure you are connected to a CrateDB cluster as a user with sufficient privileges (typically a superuser or an administrator).
  • CrateDB should be version 4.7 or later, as user management commands were fully introduced from this version.

Steps to Create a User

  1. Connect to your CrateDB instance
    Use either the

    cr8
    command-line client,
    psql
    , or any SQL client that can connect to CrateDB.

  2. Create a New Role (User):
    Run the following SQL command to create a new user. Replace

    <username>
    with the desired username:

CREATE ROLE <username>;

For example, to create a user named

analytics_user
:

CREATE ROLE analytics_user;

  1. Set a Password (Optional but recommended):
    Associate a password with the new user for secure authentication. For example, to set password
    s3cret
    :

ALTER USER analytics_user SET (password = 's3cret');

  1. Grant Privileges (Optional):
    To allow the user to perform specific actions (such as querying or modifying data), grant the appropriate privileges. For example, to allow the user to read and write on a specific table:

GRANT DQL, DML ON TABLE my_schema.my_table TO analytics_user;

Or to grant cluster-wide administrative privileges:

GRANT ALL PRIVILEGES TO analytics_user;

Example: Complete User Creation Script

``` -- Create a new role CREATE ROLE analytics_user;

-- Set the user password ALTER USER analytics_user SET (password = 's3cret');

-- Grant privileges for a specific table GRANT DQL, DML ON TABLE my_schema.my_table TO analytics_user; ```

Note:
- CrateDB uses the term ROLE throughout, but roles can be used both for users and group-like role concepts.
- You must reload privileges for changes to immediately take effect using

COPY ADMIN PRIVILEGES;
(for older CrateDB versions).

See the official CrateDB documentation for more information on user management and RBAC: https://crate.io/docs/crate/reference/en/latest/general/user-management.html

Installing the JDBC driver

To connect Matillion Data Productivity Cloud to a CrateDB database, you need to install the CrateDB JDBC driver manually. At the time of writing, this driver is not bundled by default with Matillion Data Productivity Cloud, owing to licensing or redistribution restrictions. Follow the steps below to obtain and install the driver in your Matillion environment.

1. Download the CrateDB JDBC Driver

  • Go to the official CrateDB documentation page for JDBC drivers: CrateDB JDBC Driver Download
  • Look for a Type 4 JDBC driver for CrateDB, as this driver is platform-independent and supports direct connections using Java.
  • Download the latest available
    crate-jdbc
    JAR file to your local machine.

2. Upload the JDBC Driver to Matillion Data Productivity Cloud

  • Follow the official Matillion documentation for external driver installation: Uploading External Drivers
  • Use your downloaded CrateDB JDBC driver file and upload it to your Matillion instance, as described in these instructions.
  • Ensure that the uploaded JAR is available for your Data Productivity Cloud agent(s) as appropriate.

3. Verify and Use the Loaded Driver

  • Once the driver is uploaded and made available, refer to the official Matillion guide to configure and use database connectors: Database Query - Usage Instructions
  • You can now set up SQL Query components or custom database connectors using the newly uploaded CrateDB JDBC driver following those steps.

By following these instructions, you can enable connectivity between Matillion Data Productivity Cloud and a CrateDB database using the officially supported JDBC driver.

Checking network connectivity

To ensure successful communication between Matillion Data Productivity Cloud and your CrateDB database, you must configure your CrateDB instance to allow incoming network connections from Matillion. The required configuration depends on your deployment type:

  • Full SaaS Agent Configuration:
    Allow incoming connections from the IP addresses maintained by Matillion for the Data Productivity Cloud. The up-to-date list of these IP addresses can be found here: Allowing IP Addresses. Update your CrateDB firewall or access controls to permit connections from these IP addresses.

  • Hybrid SaaS Deployment:
    In this configuration, you must allow incoming connections from your own virtual private cloud (VPC), where your Matillion Hybrid SaaS agent is deployed. Ensure that your CrateDB database permits network traffic from your VPC. You can use the utilities provided by Matillion to help test and verify network connectivity: Check Network Access Utility.

Additionally, if your CrateDB database is referenced by a Domain Name System (DNS) address rather than a direct IP address, make sure that your Full SaaS or Hybrid SaaS agent is able to resolve the DNS address correctly. This is essential to establish network connectivity to your database.

Querying Data from a CrateDB Database

This guide covers querying data from a CrateDB database, including example SQL queries, notes on datatype conversion between CrateDB and Snowflake, and best practices for initial and incremental data loads using a Database Query component.


Example CrateDB Queries

Below are sample SQL

SELECT
statements you might use to query data from CrateDB:

``` -- Select all columns from a table SELECT * FROM customers;

-- Select specific columns SELECT id, name, email FROM customers;

-- Filter data using a WHERE clause SELECT * FROM orders WHERE order_date >= '2024-06-01';

-- Aggregate example: count the number of rows SELECT COUNT(*) FROM transactions WHERE status = 'complete';

-- Joining tables example SELECT o.id, o.total, c.name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE o.status = 'shipped'; ```

Datatype Conversion: CrateDB → Snowflake

When transferring data from CrateDB to Snowflake, datatype conversion may occur. While both databases support a wide range of datatypes, there are some differences in their type systems. Always review CrateDB's Supported Data Types and ensure they're compatible with Snowflake's Data Types.

Tips: - Text-based types (

TEXT
,
STRING
) generally convert cleanly. - Numeric types (
INTEGER
,
FLOAT
,
DOUBLE
) may have precision differences. - Timestamps and dates should be reviewed for timezone and precision compatibility. - Arrays and objects may need to be serialized or mapped to variant structures.

Pattern for Initial and Incremental Loads

A best-practice pattern for loading data is to perform a full, once-off initial load, then proceed with incremental loads. This pattern ensures a complete dataset upfront, followed by regular updates with only new or modified data.

  • Database Query Component: Use the same component for both initial and incremental loads.

Initial Load

  • Perform a load without a filter to bring in all available data.

-- Example: Load all records
SELECT * FROM users;

Incremental Load

  • Add a filter clause based on a high-water mark, such as a modification timestamp or integer PK/ID, to only retrieve new or changed records.

-- Example: Load new/updated records since last run
SELECT * FROM users WHERE updated_at > '2024-06-01T00:00:00';

Replace

'2024-06-01T00:00:00'
with a variable representing the most recent successful load.

For further details on incremental load strategies, refer to Matillion's Incremental Load and Data Replication Strategy.

Data Integration Architecture

A key advantage of using an ELT (Extract, Load, Transform) architecture in Snowflake is its ability to "divide and conquer" the integration process by splitting it into two distinct steps: first loading the raw data into Snowflake, and then performing integration and transformations within the database itself. By loading the data in advance of any integration, you create a clear separation of concerns, which simplifies troubleshooting and improves process flexibility. Data integration in this architecture necessitates various data transformation tasks, and the optimal way to perform these is by constructing data transformation pipelines, which automate and orchestrate each step in a manageable and scalable fashion. Another major advantage of ELT in Snowflake is that all data transformation and integration activities happen inside the Snowflake environment. This ensures rapid, on-demand, and highly scalable performance, and eliminates the need to invest in or maintain separate data processing infrastructure—since Snowflake takes care of the underlying computation requirements for you.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.