Extracting data from Apache Cassandra to Snowflake
Extracting data from Apache Cassandra is a common requirement for organizations seeking to consolidate operational data into cloud-based platforms such as Snowflake for advanced analytics and reporting. Achieving a seamless data transfer involves several preparatory and technical steps to ensure both security and efficiency throughout the process. In this article, we will guide you through each stage of extracting data from Apache Cassandra and loading it into Snowflake. We will start by demonstrating how to create an appropriate identity in Apache Cassandra to securely access your data. For users of Matillion—one of the most popular ETL platforms—we will outline how to check for, or acquire, a suitable JDBC driver to facilitate the integration. Next, we will address the important topic of establishing trustworthy and performant network connectivity between your source (Cassandra) and target (Snowflake) environments. Finally, we will explain techniques for querying Cassandra data, covering both initial large-scale extraction as well as ongoing incremental updates. By following this article, you will gain the knowledge needed to set up a robust data pipeline from Apache Cassandra to Snowflake, enabling enhanced data-driven insights for your organization.
What is Apache Cassandra?
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle massive amounts of data across many commodity servers with no single point of failure. Originating from work at Facebook and later open-sourced, Cassandra employs a peer-to-peer architecture where each node has identical responsibility, enabling continuous availability and fault tolerance even during node outages or network partitions. The system supports tunable consistency levels, allowing engineers to balance between availability and correctness based on specific application needs. Utilizing a partitioned row store with a flexible schema, Cassandra is well-suited for real-time big data applications, supporting efficient writes/read operations, horizontal scaling, and multi-datacenter replication for robust disaster recovery.
What is Snowflake?
Snowflake is a cloud-native data platform offering scalable data warehousing and analytics across AWS, Azure, and Google Cloud. Its multi-cluster, shared-data architecture separates compute from storage, enabling dynamic scaling and concurrent workloads. Supporting structured and semi-structured data (JSON, Parquet, Avro), Snowflake provides a fully managed, ANSI SQL-compliant interface for ingestion, transformation, and querying. Built-in security includes automatic encryption, fine-grained access control, and auditing. Features like zero-copy cloning, time travel, and secure data sharing help organizations streamline analytics and collaborate securely at scale.
Why Move Data from Apache Cassandra into Snowflake
Unlocking Deeper Insights: The Case for Moving Data from Apache Cassandra to Snowflake for Analytics
A data engineer or architect may consider copying data from Apache Cassandra into Snowflake for several compelling reasons. Data stored in Apache Cassandra is often of substantial value, containing high-velocity transactional records, time-series events, or customer activities that are critical for analytics and business insight. However, the true potential of this data emerges when it is integrated with information from other systems, such as enterprise data warehouses, CRM platforms, or external datasets. Conducting such integration and analysis directly within Cassandra can pose challenges, as it is optimized for high-throughput transactional workloads, and attempting to run complex analytical queries may degrade its performance or impact its primary use cases. By offloading data to Snowflake, which is purpose-built for scalable analytics and data integration, organizations can efficiently combine Cassandra data with other sources without burdening the operational database. This approach not only protects the performance of Cassandra but also leverages Snowflake’s advanced analytical and data transformation capabilities for richer business insights.
Similar connectors
Creating a User in Apache Cassandra
To create a new user (identity) in an Apache Cassandra database, you must have access to a cluster where
Authenticationis enabled. Users are created with CQL (Cassandra Query Language), using the
CREATE USERstatement.
Prerequisites
- You have administrator-level access to your Cassandra cluster.
- Authentication and authorization are enabled (
authenticator: CassandraAuthenticator
) incassandra.yaml
. - You can connect to the database using
cqlsh
or another CQL client.
Step-by-Step Instructions
1. Connect to Cassandra
Use the
cqlshclient and authenticate as a user with sufficient permissions (typically, the
cassandrasuperuser):
bash cqlsh -u cassandra -p your_admin_password
2. Create a New User
Use the following CQL command to create a user. Replace
new_userand
new_passwordwith the desired username and password. You can choose whether the user is a
SUPERUSERor not (
NOSUPERUSER):
``` -- To create a regular user CREATE USER new_user WITH PASSWORD 'new_password' NOSUPERUSER;
-- To create a superuser (with all privileges) CREATE USER new_user WITH PASSWORD 'new_password' SUPERUSER; ```
Example:
CREATE USER analytics_app WITH PASSWORD 'SecurePassword42!' NOSUPERUSER;
3. Grant Permissions (Optional)
By default, new users do not have any permissions. Grant them permissions as needed:
``` -- Grant ALL permissions on a keyspace to the new user GRANT ALL PERMISSIONS ON KEYSPACE my_keyspace TO analytics_app;
-- Or grant limited permissions, for example, SELECT only: GRANT SELECT ON KEYSPACE my_keyspace TO analytics_app; ```
4. Verify User Creation
List all users in the system to confirm that the new user was created:
LIST USERS;
5. Connect as the New User
Test the new user's credentials:
bash cqlsh -u analytics_app -p SecurePassword42!
Important Notes
- Passwords must be enclosed in single quotes and are case sensitive.
- User management statements require
SUPERUSER
privileges. - Modifications, such as changing a password, can be accomplished with the
ALTER USER
statement.
For further details, reference: Apache Cassandra 4.x Documentation - Authentication.
Installing the JDBC Driver
The Apache Cassandra JDBC driver is required to connect Matillion Data Productivity Cloud to your Cassandra database. Due to licensing and redistribution restrictions, this JDBC driver is not bundled with the product and must be downloaded and installed manually by each user.
Step 1: Download the JDBC Driver
-
Navigate to the official Simba website to download the Cassandra JDBC driver:
https://www.simba.com/drivers/cassandra-jdbc-odbc/ -
On the download page, locate the Type 4 JDBC driver. Type 4 drivers are preferred as they are platform-independent and communicate directly with the database server over the network.
-
Download the latest available version of the type 4 Cassandra JDBC driver.
Step 2: Prepare for Installation
The file you download will typically be a single JAR file. Ensure you have access to the Matillion Agent or Workstation environment if you need to perform the installation.
Step 3: Install the JDBC Driver into Matillion
-
Consult Matillion’s official documentation for uploading external drivers:
Uploading External Drivers -
Follow the steps as outlined. This involves accessing the Matillion Data Productivity Cloud platform, navigating to the appropriate section for managing database drivers, and uploading the downloaded JAR file.
-
After uploading, the driver will become available for use in database connections managed by the agent.
Step 4: Configure Connection and Usage
For detailed usage instructions, refer to Matillion’s documentation on running database queries via JDBC:
Database Query Documentation
This guide will help you set up new connections, define connection parameters, and begin querying your Cassandra database through the Matillion Data Productivity Cloud interface.
Note: Always ensure you are complying with licensing agreements associated with the JDBC driver and review the distributor’s license terms when downloading.
Checking network connectivity
To enable Matillion Data Productivity Cloud to connect to your Apache Cassandra database, you must ensure that incoming network connections are permitted. The configuration depends on your deployment type:
-
Full SaaS agent configuration:
Allow incoming connections to your Apache Cassandra database from the specific IP addresses used by Matillion services. The current list of IP addresses is provided at: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS deployment:
Allow incoming connections from the network of your own virtual private cloud (VPC), where the Hybrid SaaS agent is deployed. For assistance in identifying your source IP or validating network access, use the utilities available here: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Additionally, if you are connecting to Apache Cassandra using a DNS hostname (rather than a direct IP address), you must ensure that the agent (either Full SaaS or Hybrid SaaS) is able to resolve the DNS address properly. This means that all required DNS and network firewall rules must be correctly configured so that the agent host can access and resolve the Cassandra database endpoint.
Querying Data from an Apache Cassandra Database
This guide outlines how to retrieve data from an Apache Cassandra database, including example
SELECTqueries, considerations for datatype conversion with Snowflake targets, and best practices for initial and incremental load patterns. For more information on incremental load strategies, refer to the Matillion Exchange article.
Example Cassandra SELECT
Queries
Below are some example SQL-like
SELECTstatements for retrieving data from a Cassandra table:
`` -- Return all rows and columns from theusers` table SELECT * FROM keyspace_name.users;
-- Select specific columns SELECT user_id, first_name, email FROM keyspace_name.users;
-- Add a filtering WHERE clause (note: Column used in WHERE must be part of the primary key or use ALLOW FILTERING) SELECT * FROM keyspace_name.users WHERE user_id = 'a123';
-- Example with ALLOW FILTERING (use cautiously; can impact performance) SELECT * FROM keyspace_name.users WHERE email = '[email protected]' ALLOW FILTERING; ```
Note: Cassandra is not a relational database; filtering and sorting work differently compared to traditional SQL databases.
Datatype Conversion: Cassandra ↔ Snowflake
When extracting data from Cassandra to load into Snowflake, be aware of datatype conversions:
| Cassandra Type | Snowflake Type (recommended) |
|---|---|
text |
VARCHAR |
int |
NUMBER(10, 0) |
bigint |
NUMBER(19, 0) |
timestamp |
TIMESTAMP_NTZ |
uuid |
VARCHAR |
boolean |
BOOLEAN |
float |
FLOAT |
decimal |
NUMBER |
list/set/map |
VARIANTor flatten as JSON |
Always test and validate data types to avoid data integrity issues during ETL.
Recommended Load Pattern: Initial & Incremental Loads
Best Practice:
To efficiently replicate data from Cassandra to Snowflake (or any other target), use a one-off initial load followed by incremental loads. In both cases, use the same Database Query component or process for consistency.
1. Initial Data Load
- Query: No filter clause; extracts the entire table.
SELECT * FROM keyspace_name.table_name;
- Use Case: At the start of a project or data replication cycle to capture all existing historical data.
2. Incremental Data Load
- Query: Includes a filter (e.g., WHERE clause) based on change detection (e.g., timestamp,
last_updated
, or surrogate key).SELECT * FROM keyspace_name.table_name WHERE last_updated > '2024-06-01 00:00:00';
-
Filter column must be indexed or part of the primary key for performance.
-
Use Case: Scheduled extracts following the initial load to fetch only newly inserted or modified records.
For a detailed walkthrough of implementing incremental loads, refer to the Matillion Exchange article on incremental load strategy.
Tip: Always design your Cassandra tables and queries with your querying and load pattern in mind for optimal performance and maintainability.
Data Integration Architecture
Loading data into Snowflake ahead of integration represents a "divide and conquer" approach, in which the process is neatly separated into two distinct stages: first, loading, then transforming and integrating the data. This is one of the core advantages of the ELT (Extract, Load, Transform) architecture. Within this architecture, data integration necessarily involves data transformation, and the most effective way to accomplish this is through the use of dedicated data transformation pipelines. These pipelines automate the complex operations required to reconcile and standardize data from diverse sources. Furthermore, ELT’s design means that both transformation and integration take place directly inside the target Snowflake database itself. As a result, these processes benefit from Snowflake’s high performance, scalability, and on-demand compute model—eliminating the need to maintain, manage, or pay separately for external data processing infrastructure.