Extracting data from Apache Cassandra to Amazon Redshift
Extracting data from Apache Cassandra is a key requirement for organizations that wish to combine NoSQL sources with their cloud data warehouse for advanced analytics and reporting. In this article, we will guide you through the process of transferring data from Apache Cassandra into Amazon Redshift, ensuring a reliable and scalable workflow. We will begin by explaining how to set up an identity in Apache Cassandra, necessary for secure, authenticated access. For users of Matillion, we will discuss how to verify the presence of a suitable JDBC driver for Cassandra, or how to acquire one if needed. Next, we will address the critical aspect of network connectivity between your source (Cassandra) and target (Redshift) environments to facilitate a smooth transfer. Finally, we will cover strategies for querying and extracting your data—starting with a full initial load, and then implementing incremental loading to keep Redshift in sync with any changes in Cassandra. By the end of this article, you will have a practical roadmap for integrating Cassandra data with Amazon Redshift.
What is Apache Cassandra?
Apache Cassandra is a highly scalable, distributed NoSQL database built for managing large data volumes across multiple servers with no single point of failure. Using a peer-to-peer, decentralized architecture, it delivers high availability, fault tolerance, and tunable consistency to suit diverse workload needs. Data is partitioned and replicated with consistent hashing, and multi-datacenter support offers geographic resiliency. Cassandra’s flexible wide-column schema efficiently handles structured, semi-structured, or unstructured data, making it ideal for time-series workloads, real-time analytics, and distributed applications needing robust scalability and continuous uptime.
What is Amazon Redshift?
Amazon Redshift is a fully managed, cloud-based data warehouse service designed for large-scale data analytics and storage. Built on PostgreSQL, Redshift supports standard SQL queries and integrates seamlessly with a broad ecosystem of data ingestion, transformation, and business intelligence tools. Its architecture leverages massively parallel processing (MPP), columnar storage, and data compression to optimize query performance for petabyte-scale datasets. Redshift Spectrum further extends capabilities by enabling direct querying of exabytes of data stored in Amazon S3 without loading it into the warehouse. With features like automatic vacuuming, workload management, and elastic scaling, Redshift provides a robust platform for handling complex analytical workloads typical of enterprise environments.
Why Move Data from Apache Cassandra into Amazon Redshift
Unlocking Advanced Analytics: The Case for Copying Data from Apache Cassandra to Amazon Redshift
Apache Cassandra often stores operational and transactional data that holds significant potential value for analytics and business intelligence. However, the true benefit is realized when this data is integrated with information from other sources, such as customer relationship systems, sales databases, or external datasets. By copying data from Apache Cassandra into Amazon Redshift, an environment purpose-built for complex queries and data integration, organizations can perform comprehensive analytics without imposing additional workload or latency on the source Cassandra database. This approach allows companies to leverage the strengths of both platforms: maintaining high performance and availability for transactional operations in Cassandra while enabling rich analytical processing and cross-source insights in Redshift.
Similar connectors
Creating a User in Apache Cassandra
To create a user (identity) in Apache Cassandra, you must have user management enabled via
PasswordAuthenticator. The following instructions assume you have superuser access and that Cassandra authentication is enabled.
1. Connect to Cassandra
Connect to your Cassandra cluster using
cqlshas a superuser. For example:
bash cqlsh -u cassandra -p your_superuser_password
2. Create a New User
Use the
CREATE USERCQL command to add a user. Specify the username and password, and choose whether the user should have superuser privileges.
``` -- Create a regular user CREATE USER example_user WITH PASSWORD = 'user_password' NOSUPERUSER;
-- Create a superuser CREATE USER admin_user WITH PASSWORD = 'admin_password' SUPERUSER; ```
Note: Passwords must be enclosed in single quotes.
3. Assign Permissions (Optional)
After creating the user, you can grant permissions for specific keyspaces, tables, or the entire cluster.
-- Grant SELECT and MODIFY permissions on a keyspace to a user GRANT SELECT, MODIFY ON KEYSPACE your_keyspace TO example_user;
4. List Existing Users
To verify users exist:
LIST USERS;
5. Enable Authentication (if needed)
Ensure authentication and authorization are enabled in your
cassandra.yamlconfiguration:
yaml authenticator: PasswordAuthenticator authorizer: CassandraAuthorizer
Reminder: Any configuration changes require a Cassandra restart.
Installing the JDBC Driver
The Apache Cassandra JDBC driver is not bundled by default with the Matillion Data Productivity Cloud, largely due to licensing and redistribution restrictions. As a result, if you intend to connect to Cassandra from Matillion, you are required to manually download and install the driver into your environment. The following steps will guide you through obtaining and installing the JDBC driver for use with Matillion Data Productivity Cloud.
1. Download the JDBC Driver
- Visit the Simba driver download page at: https://www.simba.com/drivers/cassandra-jdbc-odbc/
- Locate and select the Type 4 JDBC driver for Apache Cassandra. Type 4 indicates a “pure Java” implementation, which is preferable because it provides platform independence and easier integration with cloud services like Matillion.
- Complete any registration or acceptance of license agreements as required by Simba.
- Download the driver package (typically supplied as a ZIP or JAR file).
2. Prepare for Installation
Before proceeding with installation, make sure you have: - Access to your Matillion Data Productivity Cloud environment (Cloud Agent or installation location), - The downloaded Cassandra JDBC driver file.
3. Install the Driver in Matillion
- Refer to the Matillion documentation for uploading external drivers: Uploading External Drivers
- Follow the step-by-step instructions that walk you through the process of uploading the driver to your Matillion Agent or designated environment.
- When prompted, upload the Cassandra JDBC JAR file you downloaded earlier.
4. Connecting and Using the Driver
Once the driver has been successfully uploaded and is available in your Matillion environment, consult the usage instructions to set up database queries and connections: - Detailed steps are documented here: Database Query Usage
These instructions will cover configuring new connections to your Apache Cassandra database using the newly installed JDBC driver, as well as how to issue queries and work with Cassandra data within Matillion Data Productivity Cloud.
Checking network connectivity
To ensure successful network connectivity between the Matillion Data Productivity Cloud and your Apache Cassandra database, you must allow incoming connections from the appropriate sources, depending on your deployment type:
-
Full SaaS Agent Configuration:
Configure your Apache Cassandra database to allow incoming connections from the Matillion Data Productivity Cloud IP addresses. The current list of IP addresses can be found at:
https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS Deployment:
Allow incoming connections from your own virtual private cloud (VPC), where your Hybrid SaaS agent is deployed. To assist with verifying network access and connectivity, you can use the utilities available here:
https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Additionally, if you are referencing your Apache Cassandra database using a DNS address, make sure that the Full SaaS or Hybrid SaaS agent is able to resolve this address correctly. DNS resolution is required for establishing the connection to your database instance.
Querying Data from Apache Cassandra: Technical Guide
Example Apache Cassandra Queries
Apache Cassandra uses CQL (Cassandra Query Language) for querying data, which is similar to SQL but has some syntactical differences and limitations:
``` -- Selecting all columns from a table SELECT * FROM keyspace_name.table_name;
-- Selecting specific columns SELECT column1, column2 FROM keyspace_name.table_name;
-- Filtering results by partition key (recommended pattern) SELECT * FROM keyspace_name.table_name WHERE partition_key = 'some_value';
-- With clustering key for more refined queries SELECT * FROM keyspace_name.table_name WHERE partition_key = 'some_value' AND clustering_key = 'other_value';
-- Limiting results SELECT * FROM keyspace_name.table_name WHERE partition_key = 'some_value' LIMIT 100; ```
Note: In Cassandra, filter conditions must include the partition key. Filtering by non-indexed non-partition keys will result in errors or need the
ALLOW FILTERINGclause (discouraged for performance).
Datatype Conversion: Cassandra vs. Redshift
When moving data between Cassandra and Amazon Redshift (e.g., for analytics workloads), you may experience datatype conversions. Example mappings:
| Cassandra Data Type | Amazon Redshift Data Type |
|---|---|
text |
VARCHAR |
timestamp |
TIMESTAMP |
int |
INTEGER |
uuid |
VARCHARor CHAR(36) |
blob |
VARBYTE |
boolean |
BOOLEAN |
decimal |
DECIMAL |
Custom logic may be required for complex types (e.g., sets, lists, maps), and precision may vary for types like
decimalor
timestamp.
Initial Load vs. Incremental Load Patterns
A commonly recommended pattern is to use a two-step data load approach:
- Initial (One-time) Loading: Load all existing records from the source Cassandra table to the destination (e.g., Redshift).
- Incremental Subsequent Loading: Ingest only new or changed records since the last successful load.
The Matillion Incremental Load Strategy fits this pattern and is widely adopted in ETL pipelines.
Using the Database Query Component
Both initial and incremental loads utilize the Database Query component. The difference is in the use of the filter (WHERE clause):
-
Initial Load: No filter is specified, to select all data.
SELECT * FROM keyspace_name.table_name;
-
Incremental Load: A WHERE clause is included (e.g., based on a timestamp or incremental id), to fetch only new or changed data:
SELECT * FROM keyspace_name.table_name WHERE updated_at > '2024-06-13T00:00:00Z';orSELECT * FROM keyspace_name.table_name WHERE last_modified >= :last_batch_time;
Use the value of
last_batch_time(or equivalent) as a parameter to drive incremental loads. This approach minimizes data movement and adheres to best practices in scalable data engineering across Cassandra and Redshift.
For more on incremental load strategy and best practices, refer to the Matillion Incremental Load Diagram and Details.
Data Integration Architecture
Loading data in advance of integration is a robust divide-and-conquer strategy, separating the ingestion of raw data from the subsequent integration and transformation processes. This two-step approach is a key advantage of the Extract, Load, and Transform (ELT) architecture. In Redshift, data integration necessarily involves transforming disparate data sources into a unified format, which is most effectively achieved using automated data transformation pipelines. A further benefit of the ELT model is that these transformation and integration steps occur directly within the target Redshift database. This design ensures that operations are fast, scalable, and available on-demand, eliminating the need to provision and maintain separate, costly data processing infrastructure.