Integrate data from Apache Cassandra to Amazon Redshift using Matillion

Our Apache Cassandra to Redshift connector enables seamless and timely data transfer to Redshift, eliminating the need for manual coding or complex ETL processes.

Apache Cassandra
Amazon Redshift
Apache Cassandra to Amazon Redshift banner

Extracting data from Apache Cassandra to Amazon Redshift

Extracting data from Apache Cassandra is a key requirement for organizations that wish to combine NoSQL sources with their cloud data warehouse for advanced analytics and reporting. In this article, we will guide you through the process of transferring data from Apache Cassandra into Amazon Redshift, ensuring a reliable and scalable workflow. We will begin by explaining how to set up an identity in Apache Cassandra, necessary for secure, authenticated access. For users of Matillion, we will discuss how to verify the presence of a suitable JDBC driver for Cassandra, or how to acquire one if needed. Next, we will address the critical aspect of network connectivity between your source (Cassandra) and target (Redshift) environments to facilitate a smooth transfer. Finally, we will cover strategies for querying and extracting your data—starting with a full initial load, and then implementing incremental loading to keep Redshift in sync with any changes in Cassandra. By the end of this article, you will have a practical roadmap for integrating Cassandra data with Amazon Redshift.


What is Apache Cassandra?

Apache Cassandra is a highly scalable, distributed NoSQL database built for managing large data volumes across multiple servers with no single point of failure. Using a peer-to-peer, decentralized architecture, it delivers high availability, fault tolerance, and tunable consistency to suit diverse workload needs. Data is partitioned and replicated with consistent hashing, and multi-datacenter support offers geographic resiliency. Cassandra’s flexible wide-column schema efficiently handles structured, semi-structured, or unstructured data, making it ideal for time-series workloads, real-time analytics, and distributed applications needing robust scalability and continuous uptime.

matillion logo x Apache Cassandra

What is Amazon Redshift?

Amazon Redshift is a fully managed, cloud-based data warehouse service designed for large-scale data analytics and storage. Built on PostgreSQL, Redshift supports standard SQL queries and integrates seamlessly with a broad ecosystem of data ingestion, transformation, and business intelligence tools. Its architecture leverages massively parallel processing (MPP), columnar storage, and data compression to optimize query performance for petabyte-scale datasets. Redshift Spectrum further extends capabilities by enabling direct querying of exabytes of data stored in Amazon S3 without loading it into the warehouse. With features like automatic vacuuming, workload management, and elastic scaling, Redshift provides a robust platform for handling complex analytical workloads typical of enterprise environments.

Why Move Data from Apache Cassandra into Amazon Redshift

Unlocking Advanced Analytics: The Case for Copying Data from Apache Cassandra to Amazon Redshift

Apache Cassandra often stores operational and transactional data that holds significant potential value for analytics and business intelligence. However, the true benefit is realized when this data is integrated with information from other sources, such as customer relationship systems, sales databases, or external datasets. By copying data from Apache Cassandra into Amazon Redshift, an environment purpose-built for complex queries and data integration, organizations can perform comprehensive analytics without imposing additional workload or latency on the source Cassandra database. This approach allows companies to leverage the strengths of both platforms: maintaining high performance and availability for transactional operations in Cassandra while enabling rich analytical processing and cross-source insights in Redshift.

Creating a User in Apache Cassandra

To create a user (identity) in Apache Cassandra, you must have user management enabled via

PasswordAuthenticator
. The following instructions assume you have superuser access and that Cassandra authentication is enabled.

1. Connect to Cassandra

Connect to your Cassandra cluster using

cqlsh
as a superuser. For example:

bash
cqlsh -u cassandra -p your_superuser_password

2. Create a New User

Use the

CREATE USER
CQL command to add a user. Specify the username and password, and choose whether the user should have superuser privileges.

``` -- Create a regular user CREATE USER example_user WITH PASSWORD = 'user_password' NOSUPERUSER;

-- Create a superuser CREATE USER admin_user WITH PASSWORD = 'admin_password' SUPERUSER; ```

Note: Passwords must be enclosed in single quotes.

3. Assign Permissions (Optional)

After creating the user, you can grant permissions for specific keyspaces, tables, or the entire cluster.

-- Grant SELECT and MODIFY permissions on a keyspace to a user
GRANT SELECT, MODIFY ON KEYSPACE your_keyspace TO example_user;

4. List Existing Users

To verify users exist:

LIST USERS;

5. Enable Authentication (if needed)

Ensure authentication and authorization are enabled in your

cassandra.yaml
configuration:

yaml
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

Reminder: Any configuration changes require a Cassandra restart.

Installing the JDBC Driver

The Apache Cassandra JDBC driver is not bundled by default with the Matillion Data Productivity Cloud, largely due to licensing and redistribution restrictions. As a result, if you intend to connect to Cassandra from Matillion, you are required to manually download and install the driver into your environment. The following steps will guide you through obtaining and installing the JDBC driver for use with Matillion Data Productivity Cloud.

1. Download the JDBC Driver

  • Visit the Simba driver download page at: https://www.simba.com/drivers/cassandra-jdbc-odbc/
  • Locate and select the Type 4 JDBC driver for Apache Cassandra. Type 4 indicates a “pure Java” implementation, which is preferable because it provides platform independence and easier integration with cloud services like Matillion.
  • Complete any registration or acceptance of license agreements as required by Simba.
  • Download the driver package (typically supplied as a ZIP or JAR file).

2. Prepare for Installation

Before proceeding with installation, make sure you have: - Access to your Matillion Data Productivity Cloud environment (Cloud Agent or installation location), - The downloaded Cassandra JDBC driver file.

3. Install the Driver in Matillion

  • Refer to the Matillion documentation for uploading external drivers: Uploading External Drivers
  • Follow the step-by-step instructions that walk you through the process of uploading the driver to your Matillion Agent or designated environment.
  • When prompted, upload the Cassandra JDBC JAR file you downloaded earlier.

4. Connecting and Using the Driver

Once the driver has been successfully uploaded and is available in your Matillion environment, consult the usage instructions to set up database queries and connections: - Detailed steps are documented here: Database Query Usage

These instructions will cover configuring new connections to your Apache Cassandra database using the newly installed JDBC driver, as well as how to issue queries and work with Cassandra data within Matillion Data Productivity Cloud.

Checking network connectivity

To ensure successful network connectivity between the Matillion Data Productivity Cloud and your Apache Cassandra database, you must allow incoming connections from the appropriate sources, depending on your deployment type:

Additionally, if you are referencing your Apache Cassandra database using a DNS address, make sure that the Full SaaS or Hybrid SaaS agent is able to resolve this address correctly. DNS resolution is required for establishing the connection to your database instance.

Querying Data from Apache Cassandra: Technical Guide

Example Apache Cassandra Queries

Apache Cassandra uses CQL (Cassandra Query Language) for querying data, which is similar to SQL but has some syntactical differences and limitations:

``` -- Selecting all columns from a table SELECT * FROM keyspace_name.table_name;

-- Selecting specific columns SELECT column1, column2 FROM keyspace_name.table_name;

-- Filtering results by partition key (recommended pattern) SELECT * FROM keyspace_name.table_name WHERE partition_key = 'some_value';

-- With clustering key for more refined queries SELECT * FROM keyspace_name.table_name WHERE partition_key = 'some_value' AND clustering_key = 'other_value';

-- Limiting results SELECT * FROM keyspace_name.table_name WHERE partition_key = 'some_value' LIMIT 100; ```

Note: In Cassandra, filter conditions must include the partition key. Filtering by non-indexed non-partition keys will result in errors or need the

ALLOW FILTERING
clause (discouraged for performance).

Datatype Conversion: Cassandra vs. Redshift

When moving data between Cassandra and Amazon Redshift (e.g., for analytics workloads), you may experience datatype conversions. Example mappings:

Cassandra Data Type Amazon Redshift Data Type
text
VARCHAR
timestamp
TIMESTAMP
int
INTEGER
uuid
VARCHAR
or
CHAR(36)
blob
VARBYTE
boolean
BOOLEAN
decimal
DECIMAL

Custom logic may be required for complex types (e.g., sets, lists, maps), and precision may vary for types like

decimal
or
timestamp
.

Initial Load vs. Incremental Load Patterns

A commonly recommended pattern is to use a two-step data load approach:

  1. Initial (One-time) Loading: Load all existing records from the source Cassandra table to the destination (e.g., Redshift).
  2. Incremental Subsequent Loading: Ingest only new or changed records since the last successful load.

The Matillion Incremental Load Strategy fits this pattern and is widely adopted in ETL pipelines.

Using the Database Query Component

Both initial and incremental loads utilize the Database Query component. The difference is in the use of the filter (WHERE clause):

  • Initial Load: No filter is specified, to select all data.

    SELECT * FROM keyspace_name.table_name;

  • Incremental Load: A WHERE clause is included (e.g., based on a timestamp or incremental id), to fetch only new or changed data:

    SELECT * FROM keyspace_name.table_name 
        WHERE updated_at > '2024-06-13T00:00:00Z';
    or
    SELECT * FROM keyspace_name.table_name 
        WHERE last_modified >= :last_batch_time;

Use the value of

last_batch_time
(or equivalent) as a parameter to drive incremental loads. This approach minimizes data movement and adheres to best practices in scalable data engineering across Cassandra and Redshift.

For more on incremental load strategy and best practices, refer to the Matillion Incremental Load Diagram and Details.

Data Integration Architecture

Loading data in advance of integration is a robust divide-and-conquer strategy, separating the ingestion of raw data from the subsequent integration and transformation processes. This two-step approach is a key advantage of the Extract, Load, and Transform (ELT) architecture. In Redshift, data integration necessarily involves transforming disparate data sources into a unified format, which is most effectively achieved using automated data transformation pipelines. A further benefit of the ELT model is that these transformation and integration steps occur directly within the target Redshift database. This design ensures that operations are fast, scalable, and available on-demand, eliminating the need to provision and maintain separate, costly data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.