Extracting data from Apache Druid to Databricks
Extracting data from Apache Druid is a valuable capability for organizations seeking to combine Druid’s high-performance analytics with the scalability and processing power of Databricks. Moving data between these two platforms opens opportunities for advanced analytics, machine learning, and broader operational use. However, this process involves several important steps, from configuring user access to ensuring network connectivity all the way through to designing robust and efficient data extraction pipelines. In this article, we will guide you through the practical steps necessary to extract data from Apache Druid and load it into Databricks. We will cover the following key topics: - **Creating an identity in Apache Druid:** Setting up the necessary user accounts and permissions to enable secure data access. - **For Matillion users, checking or acquiring the JDBC driver:** Ensuring you have the correct Druid JDBC driver configured in Matillion to facilitate data extraction. - **Network connectivity from source to target:** Verifying and establishing reliable network connections between the Druid instance and Databricks environment. - **Querying data, initially and incrementally:** Crafting initial full loads as well as incremental queries to efficiently migrate and synchronize data. By the end of this article, you will have a clear and actionable understanding of how to set up and maintain a data extraction workflow from Apache Druid to Databricks, equipping your team for enhanced analytics and data-driven decision-making.
What is Apache Druid?
Apache Druid is a high-performance, real-time analytics database optimized for fast OLAP queries on large data sets. Using a column-oriented storage, distributed computing, and advanced indexing (inverted indices, bitmap filtering), it delivers low-latency aggregation and filtering over billions of rows. Druid's architecture features scalable node types—data, query, and master nodes—for fault tolerance and resource efficiency. It ingests both batch and streaming data (e.g., from Apache Kafka and Amazon S3), making it ideal for time-series, event-driven, and user analytics that need sub-second query responses, even under heavy concurrent workloads.
What is Databricks?
The Databricks database is part of the Databricks Lakehouse Platform, unifying data engineering, science, and analytics. It uses Delta Lake, providing ACID transactions, scalable metadata, and support for both batch and streaming on cloud storage solutions like AWS S3, Azure Data Lake, and Google Cloud. Databricks databases serve as schema containers for organizing tables, views, and artifacts, allowing users to define structure and access data via SQL, Python, R, or Scala. Integration with Databricks Unity Catalog enables centralized governance, fine-grained access control, and data lineage, making it ideal for secure, scalable, and collaborative data solutions.
Why Move Data from Apache Druid into Databricks
Unlocking Analytical Power: Copying Data from Apache Druid to Databricks
Apache Druid often holds data that is potentially valuable, such as real-time event streams or analytical aggregates, which can serve as key inputs for broader data analysis initiatives. However, to fully realize this value, it is frequently necessary to integrate Druid’s data with other sources—such as enterprise data warehouses, external APIs, or historical datasets. By copying data from Apache Druid into Databricks, data engineers and architects can leverage Databricks’ scalable platform for complex data integration, transformation, and advanced analytics, all while avoiding the risk of overburdening the Druid cluster itself. This separation of workloads ensures that Druid remains optimized for fast ingestion and low-latency querying, while Databricks handles the heavier, more resource-intensive processing required to blend and analyze data from multiple systems.
Similar connectors
Creating an Identity in Apache Druid
Apache Druid handles user authentication and authorization through its security extensions. To create a user (called an "Identity"), you must have enabled and configured the "Basic Security" extension. These instructions assume the Basic Security extension is in use and you have admin access.
Step 1: Activate the Basic Security Extensions
Ensure you have enabled the following extensions in your Druid
druid.extensions.loadListconfiguration:
json [ "druid-basic-security" ]Restart your Druid cluster components after making configuration changes.
Step 2: Use the API to Create a User
Druid provides a REST API to manage users. Use a tool such as
curlto send commands to the Coordinator or Overlord node, depending on your cluster setup.
Create a User
Replace
<user_name>with the desired username.
bash
curl -u admin:admin_password -X POST \
-H 'Content-Type: application/json' \
-d '{"name":"<user_name>"}' \
http://<COORDINATOR_HOST>:<PORT>/druid-ext/basic-security/authentication/db/login
Note:
- Default port is 8081for the Coordinator. - Ensure that
admin:admin_passwordhas sufficient privileges to create users.
Set the User Password
bash
curl -u admin:admin_password -X POST \
-H 'Content-Type: application/json' \
-d '{"password":"<user_password>"}' \
http://<COORDINATOR_HOST>:<PORT>/druid-ext/basic-security/authentication/db/<user_name>/credentials
This sets the password for your new user.
Step 3: (Optional) Assign Roles and Permissions
-
Create a Role (if it does not exist):
bash curl -u admin:admin_password -X POST \ -H 'Content-Type: application/json' \ -d '{"name":"<role_name>"}' \ http://<COORDINATOR_HOST>:<PORT>/druid-ext/basic-security/authorization/db/roles -
Assign Permissions to the Role: See the Druid security documentation to specify resource and actions (e.g., read/write on datasources).
-
Assign Role to the User:
bash curl -u admin:admin_password -X POST \ -H 'Content-Type: application/json' \ -d '{}' \ http://<COORDINATOR_HOST>:<PORT>/druid-ext/basic-security/authorization/db/users/<user_name>/roles/<role_name>
Additional Notes
- All API requests should be performed over HTTPS in a production environment.
- Replace
<COORDINATOR_HOST>
and<PORT>
according to your deployment. - For full details, review Druid's security documentation.
Apache Druid does not support conventional SQL CREATE USER
statements; user management is handled through its REST API when basic security is enabled.
Installing the JDBC driver
At the time of writing, the JDBC driver for Apache Druid is not bundled by default with Matillion Data Productivity Cloud. This is due to licensing or redistribution restrictions that require customers to download and install the driver manually. The following instructions will guide you through obtaining and uploading the JDBC driver for use with your Matillion workflows.
Downloading the JDBC Driver
-
Obtain the Driver
Visit the official Apache Druid SQL documentation at https://druid.apache.org/docs/latest/querying/sql.html.
Look for a download link for the JDBC driver on the page, and in particular, select a Type 4 JDBC driver (also known as a "fully Java" driver) for maximum compatibility and straightforward deployment. -
Download and Extract
Download the JAR file for the driver to your local machine. If the file is compressed (as a .zip or .tar.gz), extract it to obtain the actual.jar
JDBC driver file.
Uploading the Driver to Matillion Data Productivity Cloud
To enable Matillion to connect to Apache Druid via JDBC, you need to upload the
.jardriver file to your Matillion agent:
-
Refer to Matillion Documentation
Detailed instructions for installing external JDBC drivers can be found in the Matillion Data Productivity Cloud documentation:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/ -
Follow the Steps on the Documentation Page:
- Access your Matillion Agent server as described in the documentation.
- Locate the correct directory for uploading external JDBC drivers.
- Place the downloaded Druid JDBC
.jar
file into that directory. - Restart the relevant agent or service to load the new driver, if required by the platform.
Configuring and Using the Driver in Matillion
After uploading the Apache Druid JDBC driver, refer to the Matillion usage guide for setup details and to begin connecting to your Druid data source:
- Usage instructions are available here:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
From here, you can configure new database endpoints, create queries, and integrate Druid data as part of your Matillion workflows.
Checking network connectivity
To enable secure and reliable network connectivity between Matillion Data Productivity Cloud and your Apache Druid database, you must ensure that Apache Druid is configured to accept incoming connections based on your deployment type:
-
Full SaaS Agent Configuration:
Allow incoming connections from the Matillion Data Productivity Cloud SaaS agents. You will need to add the appropriate IP addresses to your firewall or access control list on the Apache Druid side. The latest list of IP addresses to permit can be found at https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/. -
Hybrid SaaS Deployment:
In a Hybrid SaaS deployment, ensure that your Apache Druid instance allows network access from your own Virtual Private Cloud (VPC) where the Matillion Hybrid SaaS agent is running. You can use the utilities and information at https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/ to help verify and troubleshoot network access from your VPC to Apache Druid.
Additionally, if your Apache Druid database hostname is referenced using a DNS name, make sure that the relevant agent (Full SaaS or Hybrid SaaS) has network access to resolve this DNS address. This typically means the agent must be able to reach the DNS servers responsible for your database's domain.
Querying Data from Apache Druid
Apache Druid supports querying data using standard SQL, enabling seamless integration for data analysis and transformation workflows. Below are instructions and examples illustrating how to query Druid, recommend best practices for loading data, and considerations when pairing Druid with Databricks.
Example: SQL SELECT Statements for Apache Druid
Here are some standard SQL queries you might run against a Druid datasource:
``` -- Retrieve all rows from a datasource SELECT * FROM "wiki" LIMIT 100;
-- Retrieve aggregated event count per day SELECT FLOOR(__time TO DAY) AS day, COUNT(*) AS event_count FROM "wiki" GROUP BY 1 ORDER BY day DESC;
-- Filter by dimension and aggregate SELECT page, COUNT(*) AS edits FROM "wiki" WHERE "countryName" = 'United States' GROUP BY page ORDER BY edits DESC LIMIT 10; ```
Note:
"wiki" in this example is a Druid datasource (similar to a table).
Datatype Conversions Between Apache Druid and Databricks
When extracting data from Druid to load into Databricks, be aware of datatype conversions:
| Druid SQL Type | Typical Databricks Spark Type | Notes |
|---|---|---|
STRING |
STRING |
Compatible |
LONG, INTEGER |
BIGINT, INTEGER |
Compatible |
FLOAT, DOUBLE |
FLOAT, DOUBLE |
Compatible |
__time( TIMESTAMP) |
TIMESTAMP |
Use explicit casts if needed |
ARRAY, MAP (beta/limited) |
ARRAY, MAP(Databricks DataFrames) |
Check serialization/flattening needed |
NULL |
NULL |
Compatible |
Always verify column datatypes after loading and perform explicit casting in your queries if necessary.
Data Loading Pattern: Initial Load Followed by Incremental Loads
The recommended approach for syncing data from Druid uses two load strategies:
1. Initial Load (Full Load)
- Use the Database Query component (or equivalent access method).
- Run a SELECT statement with no filter clause.
Example:SELECT * FROM "wiki"
- This loads all existing data at once.
2. Incremental Load
- Use the same Database Query component.
- Add an appropriate filter clause to load only new/changed data.
Example, using a timestamp column (usually__time
):SELECT * FROM "wiki" WHERE "__time" > TIMESTAMP '2024-06-01 00:00:00' - Incremental conditions should reference a column that identifies new/modified records. Typically,
__time
(TIMESTAMP
) is used.
Read more on incremental load and replication strategies at the Matillion Exchange Article: Incremental Load Data Replication Strategy.
Tip: Use the same query structure for both initial and incremental loads, varying only the presence of the filtering clause in your
WHEREstatement. This promotes reusable, maintainable pipelines.
Data Integration Architecture
Loading data in advance of integration, as practiced in Extract, Load, Transform (ELT) architectures, allows organizations to divide and conquer the data pipeline by splitting it into two distinct phases: first loading raw data into the target system and then performing integration and transformation as a separate step. This approach not only streamlines the workflow but also facilitates flexibility in data processing. Since data integration inherently requires transformation of data to align disparate sources, transformation pipelines—automated sequences of tasks—provide the most effective and reliable method to execute these operations at scale. Furthermore, a key advantage of ELT architecture within Databricks lies in conducting all transformation and integration activities inside the Databricks database itself. This means that processing can leverage the native scalability and speed of the Databricks platform, allowing for on-demand performance without the cost or complexity of maintaining external processing infrastructure.