Extracting data from Apache Druid to Amazon Redshift
Extracting data from Apache Druid is an essential step for organizations seeking to consolidate analytics or reporting workflows within downstream platforms, such as Amazon Redshift. A well-structured data pipeline allows users to leverage Druid’s high-speed ingestion and real-time capabilities, while benefitting from Redshift’s powerful analytical engine. In this article, we provide a step-by-step guide to transferring data from Apache Druid into Redshift. We begin by detailing how to create an appropriate user identity within Apache Druid to enable secure access. For those using Matillion as their ETL tool, we explain how to check for, or acquire, the necessary JDBC driver for efficient connectivity between platforms. We also address key considerations for establishing robust network connectivity between your Druid and Redshift environments. Finally, we outline best practices for querying data—both for initial full loads and for ongoing incremental updates—ensuring your integration is both accurate and up to date. Whether you are setting up this workflow for the first time or seeking to optimize your existing process, this article equips you with the foundational steps for a successful Druid to Redshift data transfer.
What is Apache Druid?
Apache Druid is a high-performance real-time analytics database designed for fast, scalable querying and ingest of large, event-driven datasets. It employs a column-oriented storage layout and supports approximate and exact aggregations, filtering, and complex group-bys over petabyte-scale data. Druid architectures typically consist of distinct node types—such as data, query, master, and coordination nodes—which allow for decentralized compute and storage scaling. Its indexing and compression mechanisms enable low-latency OLAP workloads, making it well-suited for time-series analysis and interactive dashboards. Druid integrates with various ingestion systems including Kafka, Hadoop, and cloud object stores, and exposes an HTTP/JSON API alongside compatibility with SQL via its native query engine.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse built on PostgreSQL, designed for large-scale storage and complex analytics. It uses columnar storage, data compression, and parallel processing for optimized query performance across massive datasets. Redshift integrates with AWS tools like S3, Athena, and QuickSight, and supports standard SQL, materialized views, and result caching. Its architecture features a leader node distributing queries to compute nodes for scalability and high throughput. Administrators benefit from automated backups, encryption at rest and in transit, and detailed access controls, making Redshift a robust, enterprise-ready analytics solution.
Why Move Data from Apache Druid into Amazon Redshift
Benefits of Transferring Data from Apache Druid to Redshift for Advanced Analytics
A data engineer or architect may wish to copy data from Apache Druid into Amazon Redshift for several compelling reasons. First, Apache Druid often contains valuable, high-granularity event or time-series data that can provide meaningful business insights when properly analyzed. However, the true potential of this data is often realized only when it is integrated with data from other sources—such as customer profiles, transaction records, or external market data—enabling richer analytics and more informed decision-making. By transferring data into Redshift, which is designed for large-scale data warehousing and complex SQL querying, organizations can perform advanced data integration and analytics across multiple datasets without imposing additional query workload or operational complexity on their Apache Druid clusters. This separation of duties helps maintain the performance and responsiveness of Druid for real-time analytics while leveraging Redshift's capabilities for broader and deeper exploration of integrated data.
Similar connectors
Creating an Identity in Apache Druid
By default, Apache Druid does not include built-in SQL commands for user management, but it does support role-based authentication and authorization using external identity providers. User accounts in Druid are typically managed via its SQL-based Basic Security extension. This extension allows for SQL statements to manage users (identities), roles, and permissions at runtime.
The instructions below assume that the Basic Security extension is enabled and properly configured on your Druid cluster.
1. Enable Basic Security (if not already done)
Ensure your
conf/druid/_common/common.runtime.propertiescontains:
properties druid.auth.authenticatorChain=["my-basic-authenticator"] druid.auth.authorizer.name=my-basic-authorizer druid.extensions.loadList=[..., "druid-basic-security", ...]
Replace
"my-basic-authenticator"and"my-basic-authorizer"with your chosen names.
Restart Druid services after making configuration changes.
2. Create a User
To create a new user (identity), use the
CREATE USERSQL command. You must be logged in as a superuser to do this.
-- Create a user named 'analyst' CREATE USER analyst PASSWORD 'SecurePa$$word1';
3. Granting Role(s) to the User
Typically, you will grant privileges via roles. First, create a role if necessary, then add privileges, and finally assign the role to the user.
``` -- Create a new role called 'analyst_role' CREATE ROLE analyst_role;
-- Grant SELECT privilege on all datasources to the role GRANT 'DATASOURCE' ON '.*' TO ROLE analyst_role;
-- Assign the role to the user GRANT analyst_role TO USER analyst; ```
4. Listing Users
To view all users:
SHOW USERS;
5. Altering or Dropping a User
To change a user's password:
ALTER USER analyst PASSWORD 'NewStrongerPassw0rd2';
To remove a user:
DROP USER analyst;
Notes
- Replace usernames, passwords, and privilege patterns to fit your security policy.
- Passwords are hashed using PBKDF2.
- Druid stores user and role data in its metadata storage.
- These commands require appropriate superuser access and a Druid router with SQL support.
- For more complex integration with LDAP, SAML, or other identity backends, consult the Druid Security documentation.
References
Installing the JDBC driver
To connect Matillion Data Productivity Cloud to an Apache Druid database, you must manually install the JDBC driver. At the time of writing, this driver is not bundled with Matillion Data Productivity Cloud by default due to licensing or redistribution restrictions. The following steps outline how to download and install the correct JDBC driver and configure Matillion for Druid connectivity.
1. Download the Druid JDBC Driver
- Visit the official Apache Druid SQL documentation.
- Locate the section related to JDBC connectivity.
- Look specifically for a Type 4 JDBC driver (a pure Java driver), as this is generally the preferred and most widely supported driver type for third-party integrations like Matillion.
- Download the latest stable version of the JDBC driver (typically provided as a
.jar
file).
2. Install the JDBC Driver into Matillion
Once you have the Druid JDBC driver
.jarfile:
- Follow the step-by-step instructions to upload external JDBC drivers to Matillion Data Productivity Cloud found at: Uploading External Drivers
- Ensure the
.jar
file is correctly uploaded and registered, making it available to your Matillion environment. - You may need to restart the relevant components or agents in Matillion for the new driver to be recognized.
3. Configure and Use the Connection
After installing the driver, refer to the usage instructions under: Database Query Component Usage for guidance on configuring a Druid connection and running queries from within Matillion.
By following the above steps, you can enable product integrations directly with your Druid cluster using the external JDBC driver.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your Apache Druid database, you must configure your Druid instance to allow incoming connections, depending on the type of deployment. Ensure the following requirements are met:
-
Full SaaS Agent Configuration:
Allow incoming connections to your Druid database from the IP addresses listed at: Matillion Security - Allowing IP Addresses. Configure your firewall or network access controls to permit these IP addresses. -
Hybrid SaaS Deployment:
Allow incoming connections from the IP ranges associated with your own virtual private cloud (VPC). If you need to verify connectivity or determine which IP addresses are being used, you can find helpful utilities here: Matillion Exchange - Check Network Access.
Additionally, if your Apache Druid database is being referenced by a DNS name instead of a static IP address, ensure that the Full SaaS or Hybrid SaaS agent has network access to resolve the DNS address. This may require allowing outbound DNS (port 53) access and confirming that the relevant DNS servers are reachable from the agent environment.
Querying Data from Apache Druid
This guide describes how to query data from an Apache Druid database using SQL, touches on datatype considerations when syncing with Amazon Redshift, and outlines the recommended approach for initial and incremental loads using a Database Query component.
Apache Druid SQL Query Examples
Apache Druid supports SQL for querying data in a familiar way. Here are some example queries:
Example: Select All Data
SELECT * FROM "my_datasource"
Example: Select With Filtering and Aggregation
SELECT "country", COUNT(*) AS "count" FROM "my_datasource" WHERE "__time" >= TIMESTAMP '2024-01-01 00:00:00' GROUP BY "country" ORDER BY "count" DESC
Example: Select Distinct Values
SELECT DISTINCT "user_id" FROM "my_datasource"
Datatype Conversion: Druid and Redshift
When moving data between Druid and Redshift, be aware that datatypes may not match exactly and implicit conversions might occur. For example:
- Druid numeric types (
DOUBLE
,FLOAT
,LONG
) may be loaded into Redshift asDOUBLE PRECISION
,REAL
, orBIGINT
. - Druid's
STRING
maps to Redshift'sVARCHAR
. - Druid's
TIMESTAMP
type should load as Redshift'sTIMESTAMP
.
Validate your schema mappings to avoid issues caused by datatype differences.
Initial vs. Incremental Loads Using Database Query
The best practice is to perform:
- Once-off Initial Load
This captures all available records at the current point in time and is usually performed as a full extract with no filters.
-- Initial load: query everything
SELECT *
FROM "my_datasource"
- Incremental Loads
After the initial load, regularly pull only new or updated records by applying a filter condition (commonly on the timestamp or an update marker). This ensures efficient data synchronization.
-- Incremental load: fetch new data only
SELECT *
FROM "my_datasource"
WHERE "__time" > TIMESTAMP '2024-06-01 00:00:00'
You use the same Database Query component for both scenarios; the difference lies in whether a filter clause is provided. For more details, see the Incremental Load Data Replication Strategy.
Note: Always validate your filter logic for incremental loads to prevent missing or duplicating data during syncs.
Data Integration Architecture
Loading data into Amazon Redshift in advance of integration exemplifies the “divide and conquer” approach inherent to an Extract, Load, Transform (ELT) architecture. By first loading raw data batches into Redshift, and then separately performing the integration and transformation as distinct steps, organizations can manage complexity more effectively and improve reliability. Since data integration tasks always require transformations—such as combining fields, joining datasets, or reformatting data—the best practice is to handle these requirements using data transformation pipelines, which provide structured, repeatable processes. A further key advantage of the ELT model is that these transformations and integration routines occur entirely within the Redshift environment. This not only delivers high performance due to Redshift’s native processing power, but also allows transformations to be executed on-demand and to scale as needed. Most importantly, these benefits are achieved without the need for additional data processing servers or cloud infrastructure, offering both technical efficiency and significant cost savings.