Integrate data from Apache Cassandra to Snowflake using Matillion

Our Apache Cassandra to Snowflake connector enables seamless, automatic data migration and synchronization to Snowflake within minutes, eliminating the need for manual coding or complex ETL processes.

Apache Cassandra
Snowflake
Apache Cassandra to Snowflake banner

Extracting data from Apache Cassandra to Snowflake

Extracting data from Apache Cassandra is a common requirement for organizations seeking to consolidate operational data into cloud-based platforms such as Snowflake for advanced analytics and reporting. Achieving a seamless data transfer involves several preparatory and technical steps to ensure both security and efficiency throughout the process. In this article, we will guide you through each stage of extracting data from Apache Cassandra and loading it into Snowflake. We will start by demonstrating how to create an appropriate identity in Apache Cassandra to securely access your data. For users of Matillion—one of the most popular ETL platforms—we will outline how to check for, or acquire, a suitable JDBC driver to facilitate the integration. Next, we will address the important topic of establishing trustworthy and performant network connectivity between your source (Cassandra) and target (Snowflake) environments. Finally, we will explain techniques for querying Cassandra data, covering both initial large-scale extraction as well as ongoing incremental updates. By following this article, you will gain the knowledge needed to set up a robust data pipeline from Apache Cassandra to Snowflake, enabling enhanced data-driven insights for your organization.


What is Apache Cassandra?

Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle massive amounts of data across many commodity servers with no single point of failure. Originating from work at Facebook and later open-sourced, Cassandra employs a peer-to-peer architecture where each node has identical responsibility, enabling continuous availability and fault tolerance even during node outages or network partitions. The system supports tunable consistency levels, allowing engineers to balance between availability and correctness based on specific application needs. Utilizing a partitioned row store with a flexible schema, Cassandra is well-suited for real-time big data applications, supporting efficient writes/read operations, horizontal scaling, and multi-datacenter replication for robust disaster recovery.

matillion logo x Apache Cassandra

What is Snowflake?

Snowflake is a cloud-native data platform offering scalable data warehousing and analytics across AWS, Azure, and Google Cloud. Its multi-cluster, shared-data architecture separates compute from storage, enabling dynamic scaling and concurrent workloads. Supporting structured and semi-structured data (JSON, Parquet, Avro), Snowflake provides a fully managed, ANSI SQL-compliant interface for ingestion, transformation, and querying. Built-in security includes automatic encryption, fine-grained access control, and auditing. Features like zero-copy cloning, time travel, and secure data sharing help organizations streamline analytics and collaborate securely at scale.

Why Move Data from Apache Cassandra into Snowflake

Unlocking Deeper Insights: The Case for Moving Data from Apache Cassandra to Snowflake for Analytics

A data engineer or architect may consider copying data from Apache Cassandra into Snowflake for several compelling reasons. Data stored in Apache Cassandra is often of substantial value, containing high-velocity transactional records, time-series events, or customer activities that are critical for analytics and business insight. However, the true potential of this data emerges when it is integrated with information from other systems, such as enterprise data warehouses, CRM platforms, or external datasets. Conducting such integration and analysis directly within Cassandra can pose challenges, as it is optimized for high-throughput transactional workloads, and attempting to run complex analytical queries may degrade its performance or impact its primary use cases. By offloading data to Snowflake, which is purpose-built for scalable analytics and data integration, organizations can efficiently combine Cassandra data with other sources without burdening the operational database. This approach not only protects the performance of Cassandra but also leverages Snowflake’s advanced analytical and data transformation capabilities for richer business insights.

Creating a User in Apache Cassandra

To create a new user (identity) in an Apache Cassandra database, you must have access to a cluster where

Authentication
is enabled. Users are created with CQL (Cassandra Query Language), using the
CREATE USER
statement.

Prerequisites

  • You have administrator-level access to your Cassandra cluster.
  • Authentication and authorization are enabled (
    authenticator: CassandraAuthenticator
    ) in
    cassandra.yaml
    .
  • You can connect to the database using
    cqlsh
    or another CQL client.

Step-by-Step Instructions

1. Connect to Cassandra

Use the

cqlsh
client and authenticate as a user with sufficient permissions (typically, the
cassandra
superuser):

bash
cqlsh -u cassandra -p your_admin_password

2. Create a New User

Use the following CQL command to create a user. Replace

new_user
and
new_password
with the desired username and password. You can choose whether the user is a
SUPERUSER
or not (
NOSUPERUSER
):

``` -- To create a regular user CREATE USER new_user WITH PASSWORD 'new_password' NOSUPERUSER;

-- To create a superuser (with all privileges) CREATE USER new_user WITH PASSWORD 'new_password' SUPERUSER; ```

Example:

CREATE USER analytics_app WITH PASSWORD 'SecurePassword42!' NOSUPERUSER;

3. Grant Permissions (Optional)

By default, new users do not have any permissions. Grant them permissions as needed:

``` -- Grant ALL permissions on a keyspace to the new user GRANT ALL PERMISSIONS ON KEYSPACE my_keyspace TO analytics_app;

-- Or grant limited permissions, for example, SELECT only: GRANT SELECT ON KEYSPACE my_keyspace TO analytics_app; ```

4. Verify User Creation

List all users in the system to confirm that the new user was created:

LIST USERS;

5. Connect as the New User

Test the new user's credentials:

bash
cqlsh -u analytics_app -p SecurePassword42!

Important Notes

  • Passwords must be enclosed in single quotes and are case sensitive.
  • User management statements require
    SUPERUSER
    privileges.
  • Modifications, such as changing a password, can be accomplished with the
    ALTER USER
    statement.

For further details, reference: Apache Cassandra 4.x Documentation - Authentication.

Installing the JDBC Driver

The Apache Cassandra JDBC driver is required to connect Matillion Data Productivity Cloud to your Cassandra database. Due to licensing and redistribution restrictions, this JDBC driver is not bundled with the product and must be downloaded and installed manually by each user.

Step 1: Download the JDBC Driver

  1. Navigate to the official Simba website to download the Cassandra JDBC driver:
    https://www.simba.com/drivers/cassandra-jdbc-odbc/

  2. On the download page, locate the Type 4 JDBC driver. Type 4 drivers are preferred as they are platform-independent and communicate directly with the database server over the network.

  3. Download the latest available version of the type 4 Cassandra JDBC driver.

Step 2: Prepare for Installation

The file you download will typically be a single JAR file. Ensure you have access to the Matillion Agent or Workstation environment if you need to perform the installation.

Step 3: Install the JDBC Driver into Matillion

  1. Consult Matillion’s official documentation for uploading external drivers:
    Uploading External Drivers

  2. Follow the steps as outlined. This involves accessing the Matillion Data Productivity Cloud platform, navigating to the appropriate section for managing database drivers, and uploading the downloaded JAR file.

  3. After uploading, the driver will become available for use in database connections managed by the agent.

Step 4: Configure Connection and Usage

For detailed usage instructions, refer to Matillion’s documentation on running database queries via JDBC:
Database Query Documentation

This guide will help you set up new connections, define connection parameters, and begin querying your Cassandra database through the Matillion Data Productivity Cloud interface.

Note: Always ensure you are complying with licensing agreements associated with the JDBC driver and review the distributor’s license terms when downloading.

Checking network connectivity

To enable Matillion Data Productivity Cloud to connect to your Apache Cassandra database, you must ensure that incoming network connections are permitted. The configuration depends on your deployment type:

Additionally, if you are connecting to Apache Cassandra using a DNS hostname (rather than a direct IP address), you must ensure that the agent (either Full SaaS or Hybrid SaaS) is able to resolve the DNS address properly. This means that all required DNS and network firewall rules must be correctly configured so that the agent host can access and resolve the Cassandra database endpoint.

Querying Data from an Apache Cassandra Database

This guide outlines how to retrieve data from an Apache Cassandra database, including example

SELECT
queries, considerations for datatype conversion with Snowflake targets, and best practices for initial and incremental load patterns. For more information on incremental load strategies, refer to the Matillion Exchange article.


Example Cassandra
SELECT
Queries

Below are some example SQL-like

SELECT
statements for retrieving data from a Cassandra table:

``
-- Return all rows and columns from the
users` table SELECT * FROM keyspace_name.users;

-- Select specific columns SELECT user_id, first_name, email FROM keyspace_name.users;

-- Add a filtering WHERE clause (note: Column used in WHERE must be part of the primary key or use ALLOW FILTERING) SELECT * FROM keyspace_name.users WHERE user_id = 'a123';

-- Example with ALLOW FILTERING (use cautiously; can impact performance) SELECT * FROM keyspace_name.users WHERE email = '[email protected]' ALLOW FILTERING; ```

Note: Cassandra is not a relational database; filtering and sorting work differently compared to traditional SQL databases.


Datatype Conversion: Cassandra ↔ Snowflake

When extracting data from Cassandra to load into Snowflake, be aware of datatype conversions:

Cassandra Type Snowflake Type (recommended)
text
VARCHAR
int
NUMBER(10, 0)
bigint
NUMBER(19, 0)
timestamp
TIMESTAMP_NTZ
uuid
VARCHAR
boolean
BOOLEAN
float
FLOAT
decimal
NUMBER
list/set/map
VARIANT
or flatten as JSON

Always test and validate data types to avoid data integrity issues during ETL.


Recommended Load Pattern: Initial & Incremental Loads

Best Practice:
To efficiently replicate data from Cassandra to Snowflake (or any other target), use a one-off initial load followed by incremental loads. In both cases, use the same Database Query component or process for consistency.

1. Initial Data Load

  • Query: No filter clause; extracts the entire table.
    SELECT * FROM keyspace_name.table_name;
  • Use Case: At the start of a project or data replication cycle to capture all existing historical data.

2. Incremental Data Load

  • Query: Includes a filter (e.g., WHERE clause) based on change detection (e.g., timestamp,
    last_updated
    , or surrogate key).
    SELECT * FROM keyspace_name.table_name WHERE last_updated > '2024-06-01 00:00:00';
  • Filter column must be indexed or part of the primary key for performance.

  • Use Case: Scheduled extracts following the initial load to fetch only newly inserted or modified records.

For a detailed walkthrough of implementing incremental loads, refer to the Matillion Exchange article on incremental load strategy.


Tip: Always design your Cassandra tables and queries with your querying and load pattern in mind for optimal performance and maintainability.

Data Integration Architecture

Loading data into Snowflake ahead of integration represents a "divide and conquer" approach, in which the process is neatly separated into two distinct stages: first, loading, then transforming and integrating the data. This is one of the core advantages of the ELT (Extract, Load, Transform) architecture. Within this architecture, data integration necessarily involves data transformation, and the most effective way to accomplish this is through the use of dedicated data transformation pipelines. These pipelines automate the complex operations required to reconcile and standardize data from diverse sources. Furthermore, ELT’s design means that both transformation and integration take place directly inside the target Snowflake database itself. As a result, these processes benefit from Snowflake’s high performance, scalability, and on-demand compute model—eliminating the need to maintain, manage, or pay separately for external data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.