Integrate data from Kdb+ (KX Systems) to Databricks using Matillion

Our kdb+ to Databricks connector enables seamless and timely data transfer to Databricks, automatically keeping your data current without the need for manual coding or intricate ETL processes.

Kdb+
Databricks
Kdb+ (KX Systems) to Databricks banner

Extracting data from Kdb+ (KX Systems) to Databricks

Extracting data from kdb+ is a critical task for organisations seeking to integrate high-performance time-series data with powerful analytics platforms such as Databricks. This article provides a concise guide for users who need to transfer data from kdb+ into Databricks, whether for ad hoc analysis, reporting, or broader data workflows. We will begin by outlining how to create a database identity in kdb+, ensuring proper authentication and authorisation for data access. For those leveraging Matillion as part of their ETL pipeline, we will explain how to check for the presence of the required JDBC driver or, if needed, how to acquire and configure it. Establishing secure and reliable network connectivity between your kdb+ source and Databricks target is another fundamental step—this article will outline best practices for ensuring data can move freely and safely across environments. Finally, we will cover techniques for both initial full data extraction and incremental data loads, supporting efficient and timely data synchronisation. Whether you are new to kdb+ and Databricks integration or seeking to optimise your existing pipelines, this guide aims to provide practical, step-by-step instructions to streamline your workflow.


What is Kdb+ (KX Systems)?

kdb+ is a high-performance, column-oriented database system renowned for its extremely fast handling of time-series data, making it a preferred choice in capital markets and financial analytics. Developed by Kx Systems, kdb+ consists of a tightly integrated database and its expressive vector programming language, q, which enables users to write complex queries and analytics with minimal code. Its in-memory and on-disk components facilitate real-time as well as historical data analysis, supporting both OLAP and OLTP workloads. Kdb+ distinguishes itself through efficient compression, superior temporal queries, and the ability to process billions of records per second, making it particularly suitable for market data feeds, tick data storage, and algorithmic trading systems.

matillion logo x Kdb+

What is Databricks?

Databricks Database—also known as Databricks Lakehouse or Databricks SQL—is a key part of the Databricks platform, unifying data warehousing and analytics in the cloud. It combines Apache Spark and Delta Lake to deliver transactional support and ACID compliance on cloud object storage. This enables scalable, high-performance SQL queries for workloads ranging from BI to machine learning. Features like automated schema management, partitioning, caching, and data skipping streamline data engineering tasks. Databricks Database empowers organizations to store, query, and analyze large datasets efficiently while maintaining strong data governance and simplifying advanced analytics workflows.

Why Move Data from Kdb+ (KX Systems) into Databricks

Unlocking Advanced Analytics: The Case for Copying Data from kdb+ to Databricks

A data engineer or architect may wish to copy data from kdb+ into Databricks for several compelling reasons. First, kdb+ often stores high-frequency, time-series data that can be extremely valuable for analytics and decision-making. By integrating this data with other sources in Databricks, organizations can unlock new insights that are not possible when the data remains siloed. Furthermore, conducting complex data transformations and analytics directly within Databricks, rather than on kdb+, allows for scalable processing capabilities without overburdening the kdb+ environment. This approach preserves the performance of kdb+ for its primary transactional and analytical workloads, while leveraging Databricks’ powerful ecosystem to derive greater value from integrated enterprise data.

Creating a User Identity in kdb+

In native kdb+, there is no built-in concept of users or roles at the database level like in traditional RDBMS systems (such as Postgres or MySQL). User authentication is instead usually managed at the process or connection level, typically by configuring the q process to require a username and, optionally, a password.

The following instructions outline the typical approach for setting up user access in a kdb+ environment.


1. Run q With User Authentication

When starting the q process, use the

-u
command-line flag to enforce username/password authentication.

Example:

bash
q -u username:password yourdb.q
-
username:
Set the required username. -
password
(optional): If present, specifies the password required. -
yourdb.q
: Your kdb+ script or data load file.

Multiple Users:
You may specify several user/password pairs by using a file.

bash
q -u users.txt yourdb.q
Where
users.txt
contains lines in the format:
alice:password1
bob:password2
carol:password3


2. Controlling Access in q Scripts

You can also implement your own logic within q code to restrict functionality based on the authenticated username, which is accessible via

.z.u
within a running session.

Example: Restricting Sensitive Functionality

q
secureFunction:{[arg]
  if[not any .z.u=/:("alice";"bob");
    "Access denied";
    // otherwise
    // function code here
    arg*2
  ]
}


3. Using OS-Level and Network Controls

For finer access control: - Rely on operating system users and permissions (for file system or process execution). - Restrict which users can start/stop kdb+ processes. - Limit network access to trusted hosts via firewalls or

-H
bind address.


4. Managing Database Identities in Tables (Optional)

If you wish to model application-level users inside the database, define a table to store user metadata.

Example: User Identity Table in qSQL

q
users:(
  ([]username:`alice`bob`carol;
    email:`[email protected]`[email protected]`[email protected];
    role:`admin`analyst`guest
  )
)

You can then write access logic in q functions referencing information from this table.


5. Summary Table of Approaches

Level Method
Connection q process
-u
flag or user file
In-Process Logic based on
.z.u
(authenticated username)
Application Users/roles table with custom q logic
System/Network OS permissions, firewalls, bind address

Note:
Actual credential security must be considered carefully—kdb+'s native authentication is simple and basic. For production use, especially over public networks, consider additional security measures, such as running behind SSH tunnels or reverse proxies with authentication.

For further details, refer to the official kdb+ process guide.

Installing the JDBC Driver for kdb+ in Matillion Data Productivity Cloud

At the time of writing, the kdb+ JDBC driver is not shipped by default with Matillion Data Productivity Cloud, due to licensing and redistribution restrictions. Therefore, users who wish to connect Matillion to kdb+ must manually download and install the appropriate driver.

To proceed, follow these steps:

1. Download the JDBC Driver
Visit the official KX documentation site to download the JDBC driver for kdb+:
https://code.kx.com/q/interfaces/jdbc/
When selecting a driver, prioritize the Type 4 JDBC driver, as it does not require native libraries or platform-dependent components.

2. Install the Driver in Matillion Data Productivity Cloud
After downloading the driver JAR file, you will need to upload it into your Matillion agent. Matillion provides step-by-step instructions for installing external JDBC drivers at the following URL:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
Review these instructions carefully to ensure the driver is placed in the correct directory and recognized by the platform.

3. Use the Driver in Matillion
Once the driver is installed and your Matillion instance restarted (if required), you can create and configure database connections to kdb+ using Matillion Designer. Usage guidance, including configuring connections and leveraging database components, is documented here:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

Checking network connectivity

Before connecting Matillion Data Productivity Cloud to your kdb+ database, you must ensure that the database is configured to accept incoming network connections. The required configuration depends on your deployment type:

  • Full SaaS agent configuration:
    You must allow incoming connections from the specific IP addresses used by Matillion's SaaS infrastructure. The current list of these IP addresses can be found here: Allowing IP addresses | Matillion Documentation. Update your kdb+ firewall or network access controls to permit connections from these IPs.

  • Hybrid SaaS deployment:
    You should allow incoming connections from your own virtual private cloud (VPC) where the Matillion agent is running. Configure your kdb+ network policies or firewalls to permit traffic from the relevant VPC CIDR blocks. If you need to verify network access, helpful checking utilities are available here: Check Network Access | Matillion Exchange.

Additionally, if your kdb+ database is referenced using a DNS hostname rather than a static IP address, the Matillion agent (whether Full SaaS or Hybrid SaaS) must be able to resolve that DNS name. Ensure the appropriate DNS resolution configuration is in place so that the agent can locate the kdb+ server.

Querying Data from a kdb+ Database

This guide provides instructions for technical users on querying data from kdb+ using SQL SELECT statements, highlights considerations for datatype conversion with Databricks, and outlines best practices for initial and incremental data loads using a Database Query component.


Example SQL SELECT Statements in kdb+

While q is the native language for kdb+, it also supports an SQL-like syntax for querying tables:

``` -- Selecting all columns from the trades table SELECT * FROM trades

-- Selecting specific columns with a WHERE clause SELECT sym, price, size FROM trades WHERE price > 100

-- Retrieving data within a certain time range SELECT * FROM trades WHERE time BETWEEN 09:30:00 AND 16:00:00

-- Selecting with aggregation SELECT sym, AVG(price) AS avg_price FROM trades GROUP BY sym ```

When using interfaces such as ODBC/JDBC or integration platforms, these queries can be specified as SQL statements.


Note on Datatype Conversion with Databricks

When transferring data between kdb+ and Databricks, datatype conversion may occur due to differences in supported datatypes. For example:

  • kdb+
    float
    may be mapped to Databricks
    double
  • kdb+ timestamp types may be converted to standard Databricks TIMESTAMP
  • Symbol types in kdb+ may be translated to string/varchar in Databricks

Always verify datatype mappings in your integration tool and apply explicit casting if necessary.


Load Patterns: Initial vs. Incremental

Best Practice: - Use a once-off initial load to bring the full dataset into the target system. - Follow up with incremental loads to replicate only new or changed data.

Both operations can use the same Database Query component; only the query’s filter condition changes.

Initial Load

  • The query retrieves the entire dataset.
  • No filter is applied.

SELECT * FROM trades

Matillion Database Query Example:
Set the SQL statement with no

WHERE
clause.


Incremental Load

  • The query includes a filter, typically on a timestamp or an increasing numeric column (e.g., tradeId, update_time), to collect only new data.

SELECT * FROM trades WHERE last_update_time > '2024-06-20 00:00:00.000'

Matillion Database Query Example:
Use a variable (e.g.,

${last_run_time}
) as a filter:

SELECT * FROM trades WHERE last_update_time > '${last_run_time}'

Update the

${last_run_time}
after each load to ensure only new records are retrieved on subsequent runs.

For more on this approach, see the Incremental Load/ Data Replication Strategy article.


Refer to your specific integration documentation to adapt these patterns to your environment, and always test queries for correctness and efficiency.

Data Integration Architecture

Loading data in advance of integration exemplifies the "divide and conquer" approach by splitting the overall data integration process into two manageable steps: loading raw data into the target environment first and then transforming and integrating it as needed. This sequencing is a core advantage of the ELT (Extract, Load, Transform) architecture. Data integration inherently requires considerable data transformation, and the most efficient way to accomplish these transformations is through well-structured data transformation pipelines. A further benefit of the ELT approach is that these transformations and integrations take place within the target Databricks database itself. This in-database processing ensures the operations are fast, on-demand, and scalable, while also eliminating the need to invest in and maintain separate data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.