Extracting data from Apache Cassandra to Databricks
Extracting data from Apache Cassandra is an essential step for organizations seeking to integrate and analyze their data in cloud platforms such as Databricks. Transitioning data between these systems can be complex, involving various configuration and security considerations. This article provides a step-by-step guide to ensure a smooth data extraction and loading process from Cassandra into Databricks. We will begin by outlining how to create an identity in Apache Cassandra, which is required for secure data access. For Matillion users, we will describe how to verify the presence of the appropriate JDBC driver or obtain it if necessary. Ensuring reliable network connectivity between your Cassandra source and the Databricks target is also critical, and we will detail the recommended approaches to achieve this. Finally, we will walk through querying data from Cassandra—both for an initial complete load and for capturing changes incrementally. By following these steps, you can efficiently and securely move your data from Apache Cassandra to Databricks, empowering further analysis and insight.
What is Apache Cassandra?
Apache Cassandra is a highly scalable, distributed NoSQL database designed for handling large structured data volumes across multiple servers without a single point of failure. Its peer-to-peer architecture, multi-datacenter replication, and decentralized design ensure high availability and fault tolerance. Cassandra uses a partitioned row store, supports flexible schemas via CQL (Cassandra Query Language), and offers tunable consistency. Features like automatic data distribution and robust horizontal scaling make it ideal for time-series, sensor, and recommendation workloads, as well as other high-velocity data needs. It is widely used for mission-critical applications requiring continuous uptime, linear scalability, and demanding transactional or analytical processing.
What is Databricks?
Databricks is a cloud-optimized analytics platform built on Apache Spark, designed for large-scale data processing and machine learning. It supports both structured and unstructured data, using formats like Delta Lake that provide ACID transactions and scalable metadata management. Ideal for collaborative data engineering and data science, Databricks integrates with major storage services (e.g., Amazon S3, Azure Data Lake Storage) and supports APIs in Python, Scala, R, and SQL. Its interactive workspace enables versioned notebooks, advanced job scheduling, and robust security, delivering reliable, scalable analytics for enterprise needs.
Why Move Data from Apache Cassandra into Databricks
Unlocking Analytics: Copying Data from Apache Cassandra to Databricks
Data engineers and architects may seek to copy data from Apache Cassandra into Databricks for several compelling reasons. While Apache Cassandra holds potentially valuable data, its full value is often realized only when integrated with information from other sources. Leveraging Databricks facilitates this integration, allowing organizations to combine data from Cassandra with various structured and unstructured data sources for advanced analytics, reporting, and machine learning. Importantly, by utilizing Databricks for the integration and subsequent processing, the additional computational workload is shifted away from Cassandra itself, preserving its performance for its primary roles in high-availability data ingestion and retrieval. This approach enables richer insights and more scalable analytics workflows without compromising the efficiency of the operational database.
Similar connectors
Creating a User in Apache Cassandra
To manage access and security in Apache Cassandra, you can create users with custom roles and passwords. Below are the instructions and example CQL scripts for creating a new user.
Prerequisites
- Ensure you have access to a Cassandra node, ideally via
cqlsh
. - Log in with a user that has sufficient privileges (such as
cassandra
or a user withCREATE USER
permissions).
1. Enable Authentication and Authorization (Optional)
Note: By default, Cassandra authentication may be disabled. To enable internal authentication and role-based access control, set the following in
cassandra.yamland restart Cassandra:
yaml authenticator: PasswordAuthenticator authorizer: CassandraAuthorizer
2. Creating a User
The recommended approach in Apache Cassandra 3.x and later is to create a role as a user.
CREATE ROLE your_username WITH PASSWORD = 'some_secure_password' AND LOGIN = true;
your_username
: Replace this with the desired username.some_secure_password
: Replace this with a strong password of your choosing.LOGIN = true
: Allows this role to be used for user authentication.
Example
CREATE ROLE alice WITH PASSWORD = 'w0nderland!123' AND LOGIN = true;
3. Granting Permissions (Optional)
By default, new users do not have any permissions. You must explicitly grant permissions.
GRANT SELECT ON KEYSPACE your_keyspace TO alice; GRANT MODIFY ON TABLE your_keyspace.your_table TO alice;
- Replace
your_keyspace
andyour_table
with the appropriate keyspace and table names.
4. List Existing Users/Roles
To see the list of users (roles) defined:
LIST ROLES;
Or for a specific user:
LIST ROLES OF alice;
5. Connect as the New User
Exit the current session and connect with the new user's credentials:
cqlsh -u alice -p w0nderland!123
Note: Always ensure strong password practices, and only grant required permissions following the principle of least privilege.
Installing the JDBC driver
The Matillion Data Productivity Cloud empowers users to connect to a broad array of data sources. However, at the time of writing, the JDBC driver for Apache Cassandra is not included in Matillion by default due to licensing or redistribution restrictions. To establish a Cassandra connection, users must manually obtain and install the appropriate JDBC driver.
Follow these instructions to install the required Apache Cassandra JDBC driver within Matillion Data Productivity Cloud:
- Download the Apache Cassandra JDBC Driver
- Visit the Simba drivers page at https://www.simba.com/drivers/cassandra-jdbc-odbc/.
- Locate the JDBC driver, specifically selecting a Type 4 JDBC driver, as this variant does not require native libraries and is generally preferred for cloud and containerized environments.
-
Download the appropriate archive or
.jar
file, ensuring it corresponds with your target Cassandra version and platform requirements. -
Upload the JDBC Driver to Matillion
- Since Matillion Data Productivity Cloud does not provide the Cassandra JDBC driver by default, you will need to upload the driver to your instance.
-
Comprehensive, step-by-step instructions for uploading external drivers can be found at https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/. These instructions provide details on how to package and deploy the JDBC driver via the Agent interface in Matillion, including accepted file formats and common troubleshooting tips.
-
Connecting and Using the Driver
- After successfully uploading the driver, create a new database connection in Matillion using the provided interface.
- For detailed guidance on configuring your database query components and utilizing the new Cassandra connection, refer to the usage documentation at https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
- Follow the prompts within the Matillion Designer to select the JDBC driver, input the necessary Cassandra connection details, and test the setup before creating database queries.
Note: Always consult the licensing terms for the JDBC driver to ensure compliance with your organization’s policies.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your Apache Cassandra database, you must verify and configure your network settings appropriately for one of the following deployment types:
-
Full SaaS Agent Configuration:
Your Apache Cassandra database must allow incoming connections from the specific IP addresses used by Matillion Data Productivity Cloud. Please refer to the up-to-date list of required IP addresses at this documentation page. Ensure that your Cassandra security groups, firewall rules, or access control lists permit traffic from these IPs. -
Hybrid SaaS Deployment:
In a Hybrid SaaS model, connections will originate from your own virtual private cloud (VPC). Make sure your Cassandra database is configured to accept incoming connections from the IP range(s) or network(s) that represent your VPC. For tools to help you check whether your database is correctly accessible from your VPC, visit the Check Network Access utilities provided by Matillion.
Additionally, if you are referencing the Apache Cassandra database by DNS name (rather than by direct IP address), ensure that the agent (in either Full SaaS or Hybrid SaaS mode) is able to resolve the DNS address successfully. This may require making sure that the DNS name is publicly resolvable, or that any necessary private DNS configuration is correctly set up in your environment.
Querying Data from an Apache Cassandra Database
This guide outlines how to query data from an Apache Cassandra database, with SQL SELECT examples, notes on datatype conversion to Databricks, and best practices for implementing initial and incremental loads using a Database Query component.
Sample Cassandra Query: SQL SELECT Statement
Cassandra uses CQL (Cassandra Query Language), which resembles SQL. Below are some example queries:
``` -- Query all records from a table SELECT * FROM keyspace_name.table_name;
-- Query specific columns SELECT id, name, email FROM keyspace_name.table_name;
-- Query with a filter SELECT * FROM keyspace_name.table_name WHERE id = '12345';
-- Query with a time-based filter SELECT * FROM keyspace_name.table_name WHERE last_modified >= '2024-06-21 00:00:00'; ```
Note: Cassandra requires that all
WHEREclauses include at least the partition key field.
Datatype Conversion: Cassandra and Databricks
When extracting data from Apache Cassandra into Databricks, note the following:
- Datatypes in Cassandra (e.g.,
uuid
,timestamp
,int
,text
) may not have direct equivalents in Databricks/Spark. - Databricks will attempt to infer and map datatypes, e.g., Cassandra
timestamp
→ SparkTimestampType
, Cassandratext
→ SparkStringType
. - Additional conversion logic may be required for complex types (e.g.,
map
,list
,set
).
Pattern for Loading Data: Initial and Incremental Loads
The recommended approach for importing data is to perform an initial (once-off) bulk load, followed by subsequent incremental loads. Both patterns use the same Database Query component, differing primarily in their SQL
WHEREclause usage.
Initial Load (Full Table)
- The Database Query component queries the entire table, without a filter clause.
- Example:
SELECT * FROM keyspace_name.table_name;
Incremental Load (Filtered by Change Tracking)
- The Database Query component uses a
WHERE
clause to select only records that have changed or been created since the last successfully processed value. - A typical filter for time-based incremental loading:
SELECT * FROM keyspace_name.table_name WHERE last_modified > '2024-06-21 00:00:00'; - The field used in the filter (e.g.,
last_modified
) should be indexed or part of the primary key to maintain performance.
Read more about incremental load data replication strategies at Matillion Exchange: Incremental Load.
Data Integration Architecture
Loading data in advance of integration is an effective "divide and conquer" strategy that separates data loading from transformation and integration, turning the process into two manageable steps. This staged approach is a fundamental advantage of the ELT (Extract, Load, Transform) architecture. Data integration always requires some degree of transformation to harmonize, cleanse, or enrich incoming datasets. The most robust and adaptable way to accomplish this is by using data transformation pipelines, which define, automate, and orchestrate transformation logic systematically. A further benefit of the ELT architecture is that these transformation and integration tasks occur directly within the target Databricks database. As a result, data processing becomes fast, on-demand, and highly scalable, leveraging Databricks’ native compute power and eliminating the need—and cost—for additional external data processing infrastructure.