Integrate data from QuestDB to Databricks using Matillion

The QuestDB to Databricks connector enables seamless data transfer to Databricks within minutes and ensures your data remains up to date, all without the need for manual coding or complex ETL scripting.

QuestDB
Databricks
QuestDB to Databricks banner

Extracting data from QuestDB to Databricks

Extracting data from QuestDB is a crucial step for organizations seeking to integrate fast, real-time analytics from time-series data with powerful cloud-based platforms like Databricks. This article provides a practical guide for users aiming to seamlessly move data from QuestDB into Databricks, ensuring efficient and secure workflows. We will begin by explaining how to create an identity in QuestDB, establishing proper access control for data extraction. For Matillion users, we will detail the steps required to check for, or acquire, the appropriate JDBC driver that enables connectivity. Next, the article will address vital considerations for securing and testing network connectivity between QuestDB (the source) and Databricks (the target), ensuring that data can move smoothly between environments. Finally, we will cover methods to query and extract data from QuestDB—both for an initial full load and for ongoing incremental loads—helping you keep your analytics platform up-to-date with minimal effort. Whether you are new to data integration or looking to enhance existing workflows, this article will provide the foundational steps for extracting and loading data efficiently from QuestDB to Databricks.


What is QuestDB?

QuestDB is a high-performance, open-source time series database designed in Java for real-time analytics on large datasets. Using a column-oriented model, zero-GC, and vectorized execution, it achieves sub-millisecond query response times on billions of records. QuestDB supports native SQL with time series extensions (sampling, interpolation, window functions), and ingests data via PostgreSQL and InfluxDB protocols or HTTP. Optimized for multi-core CPUs, it ensures high throughput, low latency, and strong scalability. Suitable for finance, IoT telemetry, and monitoring, QuestDB efficiently handles demanding workloads with advanced memory management and parallel execution.

matillion logo x QuestDB

What is Databricks?

Databricks delivers a robust, cloud-native data platform based on Apache Spark. Its Databricks Lakehouse Platform merges data warehouse reliability with the flexibility and cost-efficiency of data lakes. Through Delta Lake, it offers ACID transactions, schema enforcement, and granular governance, efficiently managing large-scale structured and unstructured data. Databricks supports SQL, Python, Scala, and R, seamlessly integrating with major cloud storage providers. This unifies scalable data engineering, analytics, and AI workflows, streamlining data pipelines, machine learning, and collaborative analytics within a single workspace.

Why Move Data from QuestDB into Databricks

Unlocking Analytical Power: Copying Data from QuestDB to Databricks

A data engineer or architect may wish to copy data from QuestDB into Databricks for several compelling reasons. QuestDB is often used to store high-throughput time-series data, and this data may contain significant business value. However, the true potential of this data is frequently realized only when it is integrated with information from other sources, such as operational databases, data lakes, or external APIs. By transferring QuestDB data into Databricks, professionals can leverage its advanced data integration, analytics, and machine learning capabilities without putting additional processing load on the QuestDB instance itself. This approach ensures QuestDB continues to serve real-time workloads at optimal performance, while Databricks provides a scalable environment for deeper analysis and cross-source data enrichment.

Creating an Identity in QuestDB

As of June 2024, QuestDB does not support multi-user authentication or role-based access control at the database level. QuestDB is designed for simplicity and high performance, with the assumption that it is operated in secure, trusted environments. The database does not provide SQL statements such as

CREATE USER
, nor any ability to manage users, roles, or permissions natively within the database itself.

Typical Setup in QuestDB

  • All clients connect to the database with the same level of access.
  • Popular interaction methods: REST API, InfluxDB API, Postgres Wire Protocol.
  • Security is typically managed externally, e.g., via:
  • Network policies (firewalls, VPC rules)
  • Application proxying and authentication
  • Container orchestration/service mesh authorization

Sample: Restricting Access at Network Level

To control "who" can access QuestDB, you might restrict incoming connections. For example, a Docker Compose setup might look like this, limiting access to only your subnet:

```yaml services: questdb: image: questdb/questdb:latest ports: - "9000:9000" # REST & GUI - "8812:8812" # Postgres wire networks: my_internal_network: ipv4_address: 172.28.1.10

networks: my_internal_network: driver: bridge ipam: config: - subnet: 172.28.0.0/16

``
_(Only containers in
my_internal_network` can access QuestDB in this setup.)_

Application-Level Authentication

If you require user identities, best practice is:

  • Implement authentication within your own application.
  • Connect to QuestDB from your backend only (not directly from clients).
  • Use application logic to authenticate users and authorize their access.

Monitoring and Logging

While QuestDB does not track users, you can log connection sources at the system (e.g., Docker, OS firewall, load balancer) or application level.


Note: For enterprise-grade authentication, user management, and auditing, consider alternate databases or await future QuestDB roadmap updates, as user security is a frequently requested feature but was not yet available as of the noted date.

Installing the JDBC driver

At the time of writing, the QuestDB JDBC driver is not included by default in Matillion Data Productivity Cloud. This is due to licensing and redistribution restrictions, which prevent the driver from being bundled directly with the platform. Therefore, you must manually obtain and install the driver before you can use QuestDB data sources within Matillion.

To proceed, follow these steps:

  1. Download the QuestDB JDBC Driver
    The official download and reference page for QuestDB’s JDBC driver is available at https://questdb.io/docs/reference/api/jdbc/.
    When choosing a driver, look for the “Type 4” JDBC driver, which is pure Java and does not require any native libraries. This type is best suited for use with Matillion Data Productivity Cloud and is generally the most straightforward for cloud environments.

  2. Prepare for Installation
    Ensure that you have access to the appropriate Matillion environment, with sufficient permissions to upload external drivers. You will typically need the downloaded

    .jar
    file containing the QuestDB JDBC driver.

  3. Install the JDBC Driver to Matillion Data Productivity Cloud
    Matillion requires user-supplied JDBC drivers to be uploaded manually. The process for uploading and installing external drivers is detailed in the Matillion documentation: Uploading External Drivers.
    Follow the steps provided at the above link to upload the QuestDB driver jar file to your Matillion Agent or Cloud environment.

  4. Using the QuestDB JDBC Driver
    Once the driver is uploaded and available, you can configure and use connection components to interact with QuestDB. For instructions on creating and managing database connections, refer to the usage documentation: Database Query Usage.

By ensuring you download the correct driver type and follow the supported installation method, you will be able to establish connectivity between Matillion Data Productivity Cloud and QuestDB.

Checking network connectivity

To enable successful integration between Matillion Data Productivity Cloud and your QuestDB database, you must ensure that the QuestDB instance allows incoming network connections from the appropriate sources. The configuration depends on your deployment type:

  • Full SaaS Agent Configuration:
    You must allow incoming connections from the set of Matillion-owned IP addresses. The definitive list of these IP addresses is maintained at this documentation page. Update your QuestDB network or firewall rules to permit connections from these addresses.

  • Hybrid SaaS Deployment:
    In this setup, you must permit incoming connections from your own virtual private cloud (VPC), where your Matillion agent is running. To help identify your agent’s egress IP and verify connectivity, you can use the network utilities available on the Matillion Exchange Network Check page.

Additionally, if your QuestDB database connection uses a hostname (DNS), rather than a direct IP address, you must ensure that the Matillion Full SaaS or Hybrid SaaS agent can resolve this name using DNS. This may involve configuring appropriate DNS routing or private zones depending on your network setup.

Querying Data from a QuestDB Database

This guide details how to query data from a QuestDB database, and covers best practices for combining initial and incremental data loads, especially when integrating with platforms like Databricks. See the "Incremental Load Data Replication Strategy" for additional context.


Example QuestDB SQL Queries

You can run standard SQL queries against QuestDB.

Query all rows and columns in a table:

SELECT * FROM trades;

Apply filters to select specific rows:

SELECT * FROM trades 
WHERE symbol = 'AAPL' AND price > 100;

Aggregate data:

SELECT symbol, AVG(price) AS avg_price
FROM trades
GROUP BY symbol;

Get most recent records using QuestDB’s timestamp:

SELECT * FROM trades
ORDER BY ts DESC
LIMIT 10;

Select data within a time range:

SELECT * FROM trades
WHERE ts >= '2024-06-01T00:00:00.000000Z'
  AND ts < '2024-07-01T00:00:00.000000Z';


Data Type Conversion Considerations

  • When querying from QuestDB and integrating with Databricks, be aware of automatic data type conversion that can occur between the two platforms.
    • Example: QuestDB’s
      TIMESTAMP
      may map to Spark’s
      TIMESTAMP
      type.
    • Some numeric precision or representation (e.g.,
      SYMBOL
      ,
      FLOAT
      ) may differ. Always check and test your pipeline to avoid type mismatches.

Initial and Incremental Load Pattern

The most robust pattern for extracting data from QuestDB—especially for replication into platforms like Databricks—is to use an initial load for historical data followed by incremental loads for new data.

  • Both use the same Database Query component for consistency.
  • The initial load retrieves all records without a filter.
  • Incremental loads apply a filter (typically on a monotonically increasing column like
    ts
    or an
    id
    ).

1. Initial Load

No filter clause:

SELECT * FROM trades;
-- Fetches all records from the "trades" table

2. Incremental Load

With filtering clause:
Only fetch new records past a tracked point (e.g., the last processed timestamp or ID).

SELECT * FROM trades
WHERE ts > '2024-06-21T00:00:00.000000Z';
(Replace the timestamp with your most recent loaded value.)

For more information, refer to Matillion's Incremental Load Data Replication Strategy.


Tip: Always track the last processed value (incremental key like

ts
or primary
id
) and parameterize your queries for each incremental run for reliable, idempotent replication.

Data Integration Architecture

Loading data in advance of integration serves as an effective "divide and conquer" strategy by splitting the overall project into two distinct steps—a fundamental advantage of the Extract, Load, Transform (ELT) architecture. With ELT, data is first loaded into the Databricks environment before any transformation takes place, thereby simplifying management and streamlining workflows. Data integration inherently involves transforming datasets to suit analytic or operational needs, and the most robust method for accomplishing this is through dedicated data transformation pipelines. These pipelines automate and orchestrate processes such as data cleansing, enrichment, and standardization. Additionally, by performing all transformations and integration inside the Databricks database itself, the ELT pattern ensures that operations are fast, scalable, and available on demand. This in-database processing eliminates the need for separate data transformation infrastructure, providing substantial cost savings and operational efficiency.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.