Integrate data from Apache Druid to Databricks using Matillion

Our Apache Druid to Databricks connector enables seamless, rapid data migration to Databricks, ensuring your information remains current without requiring manual coding or intricate ETL processes.

Apache Druid
Databricks
Apache Druid to Databricks banner

Extracting data from Apache Druid to Databricks

Extracting data from Apache Druid is a valuable capability for organizations seeking to combine Druid’s high-performance analytics with the scalability and processing power of Databricks. Moving data between these two platforms opens opportunities for advanced analytics, machine learning, and broader operational use. However, this process involves several important steps, from configuring user access to ensuring network connectivity all the way through to designing robust and efficient data extraction pipelines. In this article, we will guide you through the practical steps necessary to extract data from Apache Druid and load it into Databricks. We will cover the following key topics: - **Creating an identity in Apache Druid:** Setting up the necessary user accounts and permissions to enable secure data access. - **For Matillion users, checking or acquiring the JDBC driver:** Ensuring you have the correct Druid JDBC driver configured in Matillion to facilitate data extraction. - **Network connectivity from source to target:** Verifying and establishing reliable network connections between the Druid instance and Databricks environment. - **Querying data, initially and incrementally:** Crafting initial full loads as well as incremental queries to efficiently migrate and synchronize data. By the end of this article, you will have a clear and actionable understanding of how to set up and maintain a data extraction workflow from Apache Druid to Databricks, equipping your team for enhanced analytics and data-driven decision-making.


What is Apache Druid?

Apache Druid is a high-performance, real-time analytics database optimized for fast OLAP queries on large data sets. Using a column-oriented storage, distributed computing, and advanced indexing (inverted indices, bitmap filtering), it delivers low-latency aggregation and filtering over billions of rows. Druid's architecture features scalable node types—data, query, and master nodes—for fault tolerance and resource efficiency. It ingests both batch and streaming data (e.g., from Apache Kafka and Amazon S3), making it ideal for time-series, event-driven, and user analytics that need sub-second query responses, even under heavy concurrent workloads.

matillion logo x Apache Druid

What is Databricks?

The Databricks database is part of the Databricks Lakehouse Platform, unifying data engineering, science, and analytics. It uses Delta Lake, providing ACID transactions, scalable metadata, and support for both batch and streaming on cloud storage solutions like AWS S3, Azure Data Lake, and Google Cloud. Databricks databases serve as schema containers for organizing tables, views, and artifacts, allowing users to define structure and access data via SQL, Python, R, or Scala. Integration with Databricks Unity Catalog enables centralized governance, fine-grained access control, and data lineage, making it ideal for secure, scalable, and collaborative data solutions.

Why Move Data from Apache Druid into Databricks

Unlocking Analytical Power: Copying Data from Apache Druid to Databricks

Apache Druid often holds data that is potentially valuable, such as real-time event streams or analytical aggregates, which can serve as key inputs for broader data analysis initiatives. However, to fully realize this value, it is frequently necessary to integrate Druid’s data with other sources—such as enterprise data warehouses, external APIs, or historical datasets. By copying data from Apache Druid into Databricks, data engineers and architects can leverage Databricks’ scalable platform for complex data integration, transformation, and advanced analytics, all while avoiding the risk of overburdening the Druid cluster itself. This separation of workloads ensures that Druid remains optimized for fast ingestion and low-latency querying, while Databricks handles the heavier, more resource-intensive processing required to blend and analyze data from multiple systems.

Creating an Identity in Apache Druid

Apache Druid handles user authentication and authorization through its security extensions. To create a user (called an "Identity"), you must have enabled and configured the "Basic Security" extension. These instructions assume the Basic Security extension is in use and you have admin access.

Step 1: Activate the Basic Security Extensions

Ensure you have enabled the following extensions in your Druid

druid.extensions.loadList
configuration:
json
[
  "druid-basic-security"
]
Restart your Druid cluster components after making configuration changes.

Step 2: Use the API to Create a User

Druid provides a REST API to manage users. Use a tool such as

curl
to send commands to the Coordinator or Overlord node, depending on your cluster setup.

Create a User

Replace

<user_name>
with the desired username.

bash
curl -u admin:admin_password -X POST \
  -H 'Content-Type: application/json' \
  -d '{"name":"<user_name>"}' \
  http://<COORDINATOR_HOST>:<PORT>/druid-ext/basic-security/authentication/db/login
Note: - Default port is
8081
for the Coordinator. - Ensure that
admin:admin_password
has sufficient privileges to create users.

Set the User Password

bash
curl -u admin:admin_password -X POST \
  -H 'Content-Type: application/json' \
  -d '{"password":"<user_password>"}' \
  http://<COORDINATOR_HOST>:<PORT>/druid-ext/basic-security/authentication/db/<user_name>/credentials
This sets the password for your new user.

Step 3: (Optional) Assign Roles and Permissions

  1. Create a Role (if it does not exist):

    bash
        curl -u admin:admin_password -X POST \
          -H 'Content-Type: application/json' \
          -d '{"name":"<role_name>"}' \
          http://<COORDINATOR_HOST>:<PORT>/druid-ext/basic-security/authorization/db/roles

  2. Assign Permissions to the Role: See the Druid security documentation to specify resource and actions (e.g., read/write on datasources).

  3. Assign Role to the User:

    bash
        curl -u admin:admin_password -X POST \
          -H 'Content-Type: application/json' \
          -d '{}' \
          http://<COORDINATOR_HOST>:<PORT>/druid-ext/basic-security/authorization/db/users/<user_name>/roles/<role_name>

Additional Notes

  • All API requests should be performed over HTTPS in a production environment.
  • Replace
    <COORDINATOR_HOST>
    and
    <PORT>
    according to your deployment.
  • For full details, review Druid's security documentation.

Apache Druid does not support conventional SQL

CREATE USER
statements; user management is handled through its REST API when basic security is enabled.

Installing the JDBC driver

At the time of writing, the JDBC driver for Apache Druid is not bundled by default with Matillion Data Productivity Cloud. This is due to licensing or redistribution restrictions that require customers to download and install the driver manually. The following instructions will guide you through obtaining and uploading the JDBC driver for use with your Matillion workflows.

Downloading the JDBC Driver

  1. Obtain the Driver
    Visit the official Apache Druid SQL documentation at https://druid.apache.org/docs/latest/querying/sql.html.
    Look for a download link for the JDBC driver on the page, and in particular, select a Type 4 JDBC driver (also known as a "fully Java" driver) for maximum compatibility and straightforward deployment.

  2. Download and Extract
    Download the JAR file for the driver to your local machine. If the file is compressed (as a .zip or .tar.gz), extract it to obtain the actual

    .jar
    JDBC driver file.

Uploading the Driver to Matillion Data Productivity Cloud

To enable Matillion to connect to Apache Druid via JDBC, you need to upload the

.jar
driver file to your Matillion agent:

  1. Refer to Matillion Documentation
    Detailed instructions for installing external JDBC drivers can be found in the Matillion Data Productivity Cloud documentation:
    https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

  2. Follow the Steps on the Documentation Page:

  3. Access your Matillion Agent server as described in the documentation.
  4. Locate the correct directory for uploading external JDBC drivers.
  5. Place the downloaded Druid JDBC
    .jar
    file into that directory.
  6. Restart the relevant agent or service to load the new driver, if required by the platform.

Configuring and Using the Driver in Matillion

After uploading the Apache Druid JDBC driver, refer to the Matillion usage guide for setup details and to begin connecting to your Druid data source:

From here, you can configure new database endpoints, create queries, and integrate Druid data as part of your Matillion workflows.

Checking network connectivity

To enable secure and reliable network connectivity between Matillion Data Productivity Cloud and your Apache Druid database, you must ensure that Apache Druid is configured to accept incoming connections based on your deployment type:

Additionally, if your Apache Druid database hostname is referenced using a DNS name, make sure that the relevant agent (Full SaaS or Hybrid SaaS) has network access to resolve this DNS address. This typically means the agent must be able to reach the DNS servers responsible for your database's domain.

Querying Data from Apache Druid

Apache Druid supports querying data using standard SQL, enabling seamless integration for data analysis and transformation workflows. Below are instructions and examples illustrating how to query Druid, recommend best practices for loading data, and considerations when pairing Druid with Databricks.


Example: SQL SELECT Statements for Apache Druid

Here are some standard SQL queries you might run against a Druid datasource:

``` -- Retrieve all rows from a datasource SELECT * FROM "wiki" LIMIT 100;

-- Retrieve aggregated event count per day SELECT FLOOR(__time TO DAY) AS day, COUNT(*) AS event_count FROM "wiki" GROUP BY 1 ORDER BY day DESC;

-- Filter by dimension and aggregate SELECT page, COUNT(*) AS edits FROM "wiki" WHERE "countryName" = 'United States' GROUP BY page ORDER BY edits DESC LIMIT 10; ```

Note:
"wiki" in this example is a Druid datasource (similar to a table).


Datatype Conversions Between Apache Druid and Databricks

When extracting data from Druid to load into Databricks, be aware of datatype conversions:

Druid SQL Type Typical Databricks Spark Type Notes
STRING
STRING
Compatible
LONG
,
INTEGER
BIGINT
,
INTEGER
Compatible
FLOAT
,
DOUBLE
FLOAT
,
DOUBLE
Compatible
__time
(
TIMESTAMP
)
TIMESTAMP
Use explicit casts if needed
ARRAY
,
MAP (beta/limited)
ARRAY
,
MAP
(Databricks DataFrames)
Check serialization/flattening needed
NULL
NULL
Compatible

Always verify column datatypes after loading and perform explicit casting in your queries if necessary.


Data Loading Pattern: Initial Load Followed by Incremental Loads

The recommended approach for syncing data from Druid uses two load strategies:

1. Initial Load (Full Load)

  • Use the Database Query component (or equivalent access method).
  • Run a SELECT statement with no filter clause.
    Example:
    SELECT * FROM "wiki"
  • This loads all existing data at once.

2. Incremental Load

  • Use the same Database Query component.
  • Add an appropriate filter clause to load only new/changed data.
    Example, using a timestamp column (usually
    __time
    ):
    SELECT * FROM "wiki"
        WHERE "__time" > TIMESTAMP '2024-06-01 00:00:00'
  • Incremental conditions should reference a column that identifies new/modified records. Typically,
    __time
    (
    TIMESTAMP
    ) is used.

Read more on incremental load and replication strategies at the Matillion Exchange Article: Incremental Load Data Replication Strategy.


Tip: Use the same query structure for both initial and incremental loads, varying only the presence of the filtering clause in your

WHERE
statement. This promotes reusable, maintainable pipelines.

Data Integration Architecture

Loading data in advance of integration, as practiced in Extract, Load, Transform (ELT) architectures, allows organizations to divide and conquer the data pipeline by splitting it into two distinct phases: first loading raw data into the target system and then performing integration and transformation as a separate step. This approach not only streamlines the workflow but also facilitates flexibility in data processing. Since data integration inherently requires transformation of data to align disparate sources, transformation pipelines—automated sequences of tasks—provide the most effective and reliable method to execute these operations at scale. Furthermore, a key advantage of ELT architecture within Databricks lies in conducting all transformation and integration activities inside the Databricks database itself. This means that processing can leverage the native scalability and speed of the Databricks platform, allowing for on-demand performance without the cost or complexity of maintaining external processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.