Extracting data from Datastax Enterprise to Databricks
Extracting data from Datastax Enterprise is an essential step for organisations seeking to unlock the full value of their data through advanced analytics and machine learning. By efficiently moving data into Databricks, teams can leverage a scalable and collaborative platform for processing and analysis. This article provides a practical guide to establishing a reliable data integration pipeline from Datastax Enterprise to Databricks. We begin by detailing the creation of a dedicated identity within Datastax Enterprise to ensure secure and manageable access. For users who work with Matillion for data orchestration, we will review the process of checking for – or acquiring – the appropriate JDBC driver, which is required for seamless connectivity. Next, we examine the network considerations necessary to enable secure, performant data movement between Datastax Enterprise and Databricks. Finally, we discuss techniques for querying data, addressing both the initial extraction and subsequent incremental updates to ensure the pipeline remains efficient and up to date. Whether you are setting up a migration for the first time or refining your organisation’s data integration workflows, the following steps provide a secure and robust foundation for extracting data from Datastax Enterprise and loading it into Databricks.
What is Datastax Enterprise?
DataStax Enterprise (DSE) is a distributed, highly scalable database platform built on Apache Cassandra, designed for handling large-scale, mission-critical data workloads with high availability and no single point of failure. DSE enhances Cassandra’s core features with advanced capabilities such as integrated search (via Apache Solr), in-memory computing, operational analytics (through Apache Spark), and enterprise-grade security. It offers support for multi-model data, seamless multi-cloud and hybrid deployments, and unified management tools, making it suitable for use cases requiring low latency, high throughput, and robust fault tolerance. Additionally, DataStax provides enterprise support, automated management, and performance optimization features, enabling organizations to deploy resilient and scalable real-time applications efficiently.
What is Databricks?
Databricks Lakehouse, or Databricks SQL, is a unified analytics platform built on Apache Spark and optimized for the cloud. It merges the governance and reliability of data warehouses with the scalability of data lakes. Using the open-source Delta Lake format, Databricks ensures ACID transactions, scalable metadata, and schema enforcement on AWS S3, Azure, and Google Cloud. The platform supports SQL, Python, Scala, and R, facilitating interactive and automated analytics for diverse datasets. With strong security, machine learning support, and broad integration capabilities, Databricks provides a collaborative environment for modern, end-to-end data analytics and lakehouse implementations.
Why Move Data from Datastax Enterprise into Databricks
Unlocking Advanced Analytics: The Benefits of Copying Data from Datastax Enterprise to Databricks
A data engineer or architect may wish to copy data from Datastax Enterprise into Databricks for several compelling reasons. Datastax Enterprise often stores large volumes of operational and transactional data that are rich in business value, but this value is fully realized only when such data can be integrated with information from other sources across the organization. By moving data into Databricks, teams can leverage its advanced analytics and machine learning capabilities to combine and analyze datasets from disparate origins, uncovering deeper insights that would not be accessible in isolation. Furthermore, using Databricks to handle computationally intensive analytics and integration tasks ensures that the primary Datastax environment remains dedicated to its core operational role, thereby preventing analytic workloads from impacting the performance or reliability of the transactional system. This approach enables organizations to maximize the value of their data assets while maintaining system stability and scalability.
Similar connectors
Creating a User in Datastax Enterprise
To create a user (an identity) in Datastax Enterprise, you must have sufficient privileges such as
CREATE USER. The following steps assume you are already connected to your Datastax cluster using CQLSH (Cassandra Query Language Shell) or another suitable CQL client, and that the
PasswordAuthenticatoris enabled.
1. Connect to cqlsh
Open your terminal and connect to the cluster:
bash cqlsh -u <admin_user> -p <admin_password> <host>
Replace
<admin_user>,
<admin_password>, and
<host>with the appropriate values.
2. Create a New User
Use the
CREATE USERCQL statement to create a new user with a password:
CREATE USER username WITH PASSWORD 'plain_text_password' [SUPERUSER | NOSUPERUSER];
username
: The desired username for the new user.'plain_text_password'
: The password (in plain text) you are setting for this user.SUPERUSER
(optional): Grants superuser privileges to the user.NOSUPERUSER
(optional): Ensures the user does not have superuser privileges.
Example
CREATE USER analyst_user WITH PASSWORD 'str0ngP@ssw0rd' NOSUPERUSER;
This command creates a new user named
analyst_userwith the password
str0ngP@ssw0rd, without superuser privileges.
3. Grant Permissions (Optional)
After creating the user, you may need to grant appropriate permissions. For example, to grant SELECT permission on a keyspace:
GRANT SELECT ON KEYSPACE my_keyspace TO analyst_user;
4. Verify the User
To list all users in the cluster, execute:
LIST USERS;
Notes
- Passwords must meet any password complexity requirements defined by your organization.
- Do not use weak or default passwords in a production environment.
- Only superusers can create or drop other users.
Installing the JDBC driver
At the time of writing, the official Datastax Enterprise JDBC driver is not bundled by default with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. This guide will walk you through downloading the appropriate driver and integrating it with your Matillion environment.
Step 1: Locate and Download the JDBC Driver
- Navigate to the Datastax drivers download page: https://downloads.datastax.com/#drivers.
- On the downloads page, look specifically for the Type 4 JDBC driver for Datastax Enterprise, as this ensures direct connectivity from the platform without dependence on external applications or native libraries.
- Download the JDBC driver JAR file to your local system, ensuring that you have the correct version that aligns both with your Datastax Enterprise deployment and any compatibility requirements outlined by Matillion.
Step 2: Install the JDBC Driver into Matillion Data Productivity Cloud
Matillion Data Productivity Cloud allows the use of external JDBC drivers by uploading the JAR file to the Agent. Detailed installation steps are provided in the Matillion documentation:
Uploading External Drivers to Matillion Agent
- Follow the documented process on how to upload and register an external driver JAR using the Matillion Agent interface.
- Ensure that you have the necessary permissions to manage external drivers in your Matillion environment.
- After uploading, verify that the driver appears in the external drivers list and is ready for use with database query orchestration or transformation components.
Step 3: Configure Usage in Database Query Components
Once the driver is installed, you can connect to Datastax Enterprise via Matillion’s Database Query components. Usage guidance is explained in the Database Query documentation:
Instructions for Using Database Query
- Ensure you reference the uploaded Datastax JDBC driver when configuring your connection.
- Provide the necessary connection information, such as driver class name, JDBC URL, credentials, and any additional properties required for secure and reliable communication with your Datastax Enterprise cluster.
By completing these steps, you will enable connectivity between Matillion Data Productivity Cloud and your Datastax Enterprise environment using the officially supported JDBC driver.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your Datastax Enterprise database, you must verify that the database allows incoming connections from the appropriate sources, based on your deployment configuration:
Full SaaS Agent Configuration:
If you’re using a Full SaaS agent, you must allow incoming connections from the specific IP addresses that the Matillion platform uses. The complete and up-to-date list of these IP addresses can be found here: Matillion Data Productivity Cloud - Allowing IP Addresses. Configure your Datastax Enterprise database firewall or access control rules to permit connections from these IPs.
Hybrid SaaS Deployment:
If your organization has deployed Matillion Data Productivity Cloud using a Hybrid SaaS agent, incoming connections to Datastax Enterprise will originate from your own virtual private cloud (VPC). Therefore, you should ensure that your Datastax database is configured to accept connections from your VPC's IP address range. For tools and guidance to verify network access from your VPC to the database, you can refer to the utilities provided here: Matillion Exchange - Check Network Access.
DNS Resolution Requirement:
If your Datastax Enterprise database is configured with a DNS address (rather than an IP), the Full SaaS or Hybrid SaaS agent must be able to resolve this DNS address successfully. Make sure appropriate DNS records are in place and accessible from the networking environment in which your Matillion agent runs.
Querying Data from a Datastax Enterprise Database
Example Datastax Enterprise Queries (CQL SELECT Statements)
To query data from Datastax Enterprise (DSE), use the Cassandra Query Language (CQL). Here are basic examples:
``` -- Select all rows from the 'users' table SELECT * FROM my_keyspace.users;
-- Select specific columns SELECT user_id, first_name, last_name FROM my_keyspace.users;
-- Filtered query with a WHERE clause SELECT * FROM my_keyspace.users WHERE user_id = '123e4567-e89b-12d3-a456-426614174000';
-- Range selection with a filter SELECT * FROM my_keyspace.orders WHERE created_at >= '2024-06-01' AND created_at < '2024-07-01'; ```
Datatype Conversion Considerations
When transferring data between Datastax Enterprise and analytic platforms like Databricks, be aware that datatype conversion may occur. For example:
uuid
in DSE may become aSTRING
in Databricks.timestamp
maps toTIMESTAMP
, but check for time zone compatibility.- Collections (
set
,list
,map
) in DSE may be converted to arrays or maps in Databricks. - Numeric types (e.g.,
bigint
,decimal
) may be interpreted differently.
Always verify the mapped schema to prevent unwanted data loss or misinterpretation.
Best Loading Pattern: Initial + Incremental Loads
The recommended strategy is to:
- Perform a once-off initial load: Copy the complete dataset into your target system.
- Set up incremental loads: Periodically pull only new or changed data.
Both loads can use the same Database Query component by changing only the filtering logic.
Initial Load
During the initial load, you typically query all available rows:
SELECT * FROM my_keyspace.orders;
No filter is applied, so you receive the complete dataset.
Incremental Load
For incremental loads, add a filter in your SQL statement based on time or an incrementing column (often
updated_ator a numeric
id):
SELECT * FROM my_keyspace.orders WHERE updated_at > '2024-06-14 00:00:00';
This loads only records added/modified since the specified timestamp.
Reference: For more on incremental loading patterns, see Matillion's Incremental Load: Data Replication Strategy.
Note: Queries shown use CQL, which is SQL-like but specific to Datastax Enterprise/Cassandra. Always adapt to your schema and environment.
Data Integration Architecture
Loading data into Databricks in advance of integration exemplifies a “divide and conquer” approach that splits the overall process into two distinct stages: loading and transformation. This two-step method is a fundamental advantage of the ELT (Extract, Load, Transform) architecture, as it separates data ingestion from the operational complexities of data integration. Data integration requires transformation—such as cleansing, enrichment, and validation—to deliver data in a suitable form for analytics and business intelligence. The most effective way to carry out these transformations is through dedicated data transformation pipelines, which orchestrate, automate, and manage complex dataflows. Furthermore, ELT architecture leverages the scalability of the Databricks platform by performing all transformations and integrations natively within the target Databricks database. This reduces latency, enables on-demand processing, and provides near real-time scalability, eliminating the need for separate data processing infrastructure and the associated costs.