Extracting data from Apache Druid to Snowflake
Extracting data from Apache Druid is a common requirement for organizations seeking to integrate real-time analytics with cloud-based data warehousing solutions such as Snowflake. Moving data between these systems allows users to take advantage of Snowflake’s advanced analytics, business intelligence integrations, and scalable storage, while still benefiting from Druid’s fast ingest and query capabilities. In this article, we will provide a step-by-step guide for exporting data from Apache Druid and loading it into Snowflake. The process involves several key steps. First, we will describe how to create or configure an identity in Apache Druid to ensure secure connectivity. Next, for users of Matillion—the popular ETL tool—we will outline how to check for, or acquire, the correct JDBC driver required to connect to Druid. We will then address the essential network configuration needed to facilitate smooth data transfer between Druid and Snowflake. Finally, we will discuss best practices for querying data, covering both the initial full extraction and subsequent incremental updates. Whether you are migrating data for the first time or looking to automate ongoing data pipelines, this article aims to provide you with practical guidance for a streamlined and reliable extraction process.
What is Apache Druid?
Apache Druid is a high-performance, distributed analytics database designed for fast, sub-second OLAP queries on large-scale, real-time, and historical data. Its column-oriented storage format, combined with advanced indexing, enables efficient interactive exploration, aggregation, and filtering of multi-dimensional data. Druid's architecture separates ingestion, query, and storage layers, supporting both streaming and batch data ingestion from sources such as Apache Kafka and Hadoop. Engineered for scalability and fault-tolerance, Druid can handle petabyte-scale datasets across clusters, making it a popular choice for powering analytics applications, dashboards, and user-facing data experiences where low-latency query performance is critical.
What is Snowflake?
Snowflake is a cloud-native data platform renowned for its scalable, multi-cluster architecture and separation of storage and compute, which enables dynamic resource allocation and high concurrency for diverse workloads. Built to leverage the elasticity of AWS, Azure, and Google Cloud, Snowflake manages structured and semi-structured data with support for ANSI SQL, and it seamlessly ingests data formats such as JSON, Avro, and Parquet. Unlike traditional databases, Snowflake’s architecture allows instant provisioning of virtual warehouses, enabling concurrent processing without contention. Its robust security features, automatic scaling, metadata management, and built-in failover make it a preferred choice for data warehousing, analytics, data sharing, and data lake scenarios in modern enterprises.
Why Move Data from Apache Druid into Snowflake
Unlocking Analytical Insights: The Case for Copying Data from Apache Druid to Snowflake
A data engineer or architect may consider copying data from Apache Druid into Snowflake for several compelling reasons. First, Apache Druid typically contains valuable, high-granularity event or time-series data that can provide critical insights when analyzed effectively. By integrating this data with other sources residing in Snowflake, such as relational databases, application logs, or third-party datasets, organizations can derive richer, more comprehensive analytics and unlock new business value. Moreover, performing this integration natively within Snowflake avoids placing additional analytical query workloads on the Druid cluster itself, thereby ensuring Druid’s performance remains optimized for its primary use case—fast, real-time data ingestion and exploration. This architectural approach enables organizations to fully leverage their existing Druid datasets without compromising on performance or scalability.
Similar connectors
Creating a User Identity in Apache Druid
Apache Druid users and roles are managed via the SQL-based security model ("SQL security" introduced in Druid 24.0), which stores identities and permissions in metadata tables accessible through SQL commands. Below are the steps to create a new user identity using Druid's SQL interface.
Prerequisites
- Your Druid cluster security configuration must enable SQL-based security management.
- You must have sufficient privileges (
ADMIN
role) to create users. - The SQL client must be connected to a Coordinator or Broker node.
Step 1: Create a User
To create a user, use the
CREATE USERSQL statement. Replace
new_userwith your desired username:
CREATE USER new_user;
Step 2 (Optional): Create a Role
Although not strictly necessary, defining roles allows for easier permission management:
CREATE ROLE analyst_role;
Step 3 (Optional): Grant a Role to a User
Assign the new (or existing) role to your user:
GRANT analyst_role TO new_user;
Step 4: Grant Privileges to a User or Role
Example: Grant the ability to query all datasources:
GRANT SELECT ON DATASOURCE * TO analyst_role; -- OR grant privileges directly to the user: GRANT SELECT ON DATASOURCE * TO new_user;
Step 5: Review Existing Users
To list existing users in the system:
SELECT * FROM sys.user_privileges;
Note: All user, role, and privilege definitions are stored in Druid’s
druid_systemschema.
Further Reference
For production deployments, consider also configuring authentication (e.g., Basic Authentication or LDAP) and linking stored identities to external authentication mechanisms. User creation alone does not set a password—see Druid’s Basic Auth Extension documentation for user authentication and credential management.
Installing the JDBC Driver
The JDBC driver for Apache Druid enables connectivity between Matillion Data Productivity Cloud and an Apache Druid database. Please be aware that, at the time of writing, this driver is not bundled with Matillion Data Productivity Cloud by default. This is due to licensing and redistribution restrictions. As a result, you will need to manually download and install the appropriate JDBC driver before you can use Druid as a data source.
1. Download the Apache Druid JDBC Driver
- Visit the official Apache Druid documentation site: https://druid.apache.org/docs/latest/querying/sql.html
- Locate the JDBC driver download link on the page.
- Be sure to download the Type 4 JDBC driver, as it is typically preferred for ease of configuration and broad compatibility with Java applications.
2. Install the Driver into Matillion Data Productivity Cloud
Because the Druid driver is not present by default, it must be uploaded manually to your Matillion Data Productivity Cloud environment:
- Review the driver upload process as detailed in the Matillion documentation:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/ - Follow the outlined steps to upload the downloaded Druid JDBC driver (.jar file) through the Matillion web interface or agent, as appropriate for your deployment.
3. Configure Usage
Once the Druid JDBC driver is uploaded and recognized by Matillion Data Productivity Cloud, you can proceed to create and configure database query components to interact with your Druid instance:
- Refer to the following documentation for using JDBC connections in queries: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
Following these steps ensures that you have installed the necessary JDBC driver and are prepared to work with Apache Druid as a source within your Matillion Data Productivity Cloud projects.
Checking network connectivity
To enable proper communication between Matillion Data Productivity Cloud and your Apache Druid database, you must ensure that the database allows incoming connections from the appropriate source, based on your deployment configuration:
-
Full SaaS agent configuration:
Make sure your Apache Druid database allows incoming connections from the static IP addresses documented here. These IP addresses should be whitelisted in your network security configuration (such as firewall or security group settings) to permit access by Matillion’s managed service. -
Hybrid SaaS deployment:
When using a Hybrid SaaS deployment, allow incoming connections from the IP address ranges associated with your own virtual private cloud (VPC). This ensures that the agent running within your VPC can connect to your Apache Druid instance. You can find useful utilities to help check and troubleshoot network access on the Matillion Exchange.
DNS resolution:
If your Apache Druid database is referenced using a DNS hostname (rather than a direct IP address), ensure that the relevant Matillion agent—in either Full SaaS or Hybrid SaaS configuration—is able to resolve the DNS address to connect to the database successfully. If necessary, verify network and DNS settings to avoid connectivity issues.
Querying Data from an Apache Druid Database
Apache Druid supports SQL-based querying, making it accessible to users familiar with relational databases. Below are instructions and examples for querying data, points about datatype conversions between Druid and Snowflake, and recommended patterns for data loading.
1. Example SQL SELECT Statements for Druid
Apache Druid's SQL interface allows for querying with standard SQL. Here are some common examples:
Basic SELECT
SELECT * FROM my_datasource;
Filtering with WHERE
SELECT timestamp, user_id, purchase_amount FROM my_datasource WHERE purchase_amount > 100;
Aggregations and GROUP BY
SELECT user_id, COUNT(*) AS purchase_count, SUM(purchase_amount) AS total_spent FROM my_datasource GROUP BY user_id;
Time-based Filtering
SELECT * FROM my_datasource WHERE __time >= TIMESTAMP '2024-06-01 00:00:00' AND __time < TIMESTAMP '2024-07-01 00:00:00';
Limit Results
SELECT * FROM my_datasource ORDER BY __time DESC LIMIT 10;
2. Datatype Conversion Between Druid and Snowflake
When moving data between Apache Druid and Snowflake, be aware of datatype conversions and incompatibilities. Notable considerations:
- TIMESTAMP: Druid stores timestamps in UTC in the
__time
column. Make sure to adjust for time zone conversions when writing to/from Snowflake. - VARCHAR (STRING in Druid): String data is compatible, but Snowflake supports larger string sizes.
- DOUBLE/FLOAT: Floating point types are supported by both, but ensure precision isn't lost.
- ARRAY, MAP, COMPLEX TYPES: Druid supports nested data; flatten these when loading into a Snowflake table.
Review the Snowflake-Druid data type mapping documentation during migration or ETL design.
3. Initial and Incremental Data Loads with the Same Query Pattern
Best Practice: To keep your database synchronized, use a once-off initial load, followed by faster, smaller incremental loads. Use the same "Database Query" component and adapt only the filter clause.
Initial Load
- Description: Extracts the entire dataset with no filter.
- SQL Example:
SELECT *
FROM my_datasource
- This loads all records into your target system (e.g., Snowflake).
Incremental Load
- Description: Retrieves only new or modified records since the last load using a filter, typically on a time or incrementing key column.
- SQL Example (incremental, using timestamp):
SELECT *
FROM my_datasource
WHERE __time > TIMESTAMP '2024-06-10 14:00:00'
- The filter tracks what has already been loaded.
- Adapt the timestamp or key value with each run to fetch only the latest changes.
For deeper guidance, consult the Matillion article on incremental load strategy.
Note: Always ensure the datatype mapping and timezone handling are properly dealt with during extract, transform, and load operations between Druid and Snowflake.
Data Integration Architecture
Loading data in advance of integration provides a practical way to divide and conquer complex integration tasks by splitting the process into two distinct steps: first, data extraction and loading, and then data transformation and integration. This approach exemplifies the advantages of the ELT (Extract, Load, Transform) architecture, where data is ingested into the Snowflake database before any transformation occurs. Data integration inherently requires extensive data transformation to align disparate data sources, standardize formats, and prepare data for analysis. The most effective way to accomplish these transformations is through data transformation pipelines, which orchestrate and automate the required processing steps. A notable benefit of the ELT model is that all data transformations and integration occur directly within the Snowflake database. This in-database processing ensures rapid, on-demand scalability and eliminates the need for separate data processing infrastructure, allowing organizations to contain costs and reduce operational complexity.