Integrate data from MonetDB to Databricks using Matillion

Our MonetDB to Databricks connector streamlines data transfer in just minutes, ensuring your data remains up to date automatically—no manual coding or complex ETL processes required.

MonetDB
Databricks
MonetDB to Databricks banner

Extracting data from MonetDB to Databricks

Extracting data from MonetDB is an essential step for organizations seeking to leverage the power of modern analytics platforms such as Databricks. MonetDB’s high-performance, columnar architecture makes it a popular choice for storing and managing large datasets, but organizations frequently wish to integrate this data into the broader analytic and machine learning capabilities offered by Databricks. This article provides a comprehensive, step-by-step guide to facilitate seamless data extraction from MonetDB and loading into Databricks. We will begin by outlining the process for creating a secure identity in MonetDB, ensuring you have the required permissions to access the data. Next, for users employing Matillion as their integration tool, we will explain how to check for and, if necessary, acquire the appropriate JDBC driver required for connectivity to MonetDB. Establishing reliable network connectivity between MonetDB and Databricks is critical for successful data transfer, and we will discuss strategies and best practices to achieve this. Finally, we will demonstrate how to query data from MonetDB—both for an initial bulk load and for subsequent incremental updates, enabling efficient and effective data synchronization. By following this guide, you will be able to confidently extract your MonetDB data and operationalize it within the Databricks environment.


What is MonetDB?

MonetDB is an open-source relational database management system designed specifically for high-performance analytics on large datasets. It employs a column-oriented storage model, which enables efficient vectorized query processing and optimizes disk I/O for analytical workloads, such as those common in data warehousing and business intelligence environments. MonetDB’s architecture supports automatic indexing, adaptive query optimization, and leveraging modern hardware features like large memory and multi-core CPUs. Its rich SQL:2003 support is complemented by extensions for geospatial data, scientific analytics, and integration with programming languages such as Python and R, making it suitable for a wide range of data-intensive applications.

matillion logo x MonetDB

What is Databricks?

The Databricks database, built on the Lakehouse Platform, unifies data lake flexibility and data warehouse reliability for scalable, cost-efficient data management. Powered by Apache Spark and integrated with Delta Lake, it supports ACID transactions, schema enforcement, and time travel, ensuring consistency and integrity for large, dynamic datasets. Its cloud-native design promotes collaboration among data engineers, analysts, and data scientists through easy data access, optimized caching, and automated cluster management. Databricks excels at advanced analytics and machine learning at enterprise scale, while providing strong governance and high availability across multi-cloud environments.

Why Move Data from MonetDB into Databricks

Unlocking Advanced Analytics: Migrating Data from MonetDB to Databricks

A data engineer or architect may wish to copy data from MonetDB into Databricks for several compelling reasons. Firstly, MonetDB typically holds data that is potentially valuable for analytical purposes, often as a result of ongoing business operations or legacy systems. However, the true value of this data is often realized only when it is integrated with information from other sources, enabling comprehensive analysis and more informed decision-making. By transferring data into Databricks, professionals can leverage the platform’s powerful data integration and processing capabilities to combine MonetDB data with a broad spectrum of other datasets. Importantly, conducting this integration and analysis within Databricks rather than on MonetDB helps minimize additional processing demands on the MonetDB system, thus ensuring that its performance for operational workloads remains unaffected while still unlocking the potential of its data.

Creating a User in MonetDB

The process of adding a new user to a MonetDB database involves executing SQL commands—either through the

mclient
command-line tool or within any compatible SQL interface. The instructions below assume you have appropriate rights (for example, as a database administrator or with user management privileges).

1. Connect to the Database

Open your terminal or preferred SQL client and connect to the desired MonetDB database:

mclient -u monetdb -d <database_name>

Replace

<database_name>
with the name of your database. The
monetdb
user is the default administrator account.

2. Create a New User

Execute the following SQL statement to add a new user. Make sure to choose a secure password.

CREATE USER <username> WITH PASSWORD '<strong_password>' NAME '<display_name>' SCHEMA <schema>;

  • <username>
    : The system login name for the new user.
  • <strong_password>
    : The chosen password (use single quotes).
  • <display_name>
    : The full name or description for the user.
  • <schema>
    : The schema to use as the default for this user (commonly
    sys
    ).

Example:

CREATE USER alice WITH PASSWORD 'Secur3P@ssw0rd' NAME 'Alice Smith' SCHEMA sys;

3. Grant Privileges

By default, a new user has minimal privileges. Grant the necessary permissions according to your requirements. For example, to grant all privileges on a specific schema:

GRANT ALL ON SCHEMA <schema> TO <username>;

Example:

GRANT ALL ON SCHEMA sys TO alice;

You may grant more granular permissions as needed, such as

SELECT
,
INSERT
,
UPDATE
, or
DELETE
on specific tables:

GRANT SELECT, INSERT ON <schema>.<table> TO <username>;

Example:

GRANT SELECT, INSERT ON sys.orders TO alice;

4. Verify the User

List the users to confirm the new identity has been successfully created:

SELECT * FROM sys.users;

Additional Notes

  • Only users with
    CREATEDB
    or
    CREATEROLE
    privilege can create new users.
  • Passwords are case-sensitive and should be stored securely.
  • Consult the MonetDB SQL Reference for advanced user and permission management.

Installing the JDBC Driver

At the time of writing, the MonetDB JDBC driver is not included by default with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, before you can connect to a MonetDB database from Matillion, you will need to manually download and install the relevant JDBC driver.

Downloading the MonetDB JDBC Driver

  1. Navigate to the official MonetDB download page for JDBC drivers: https://www.monetdb.org/Documentation/JDBC.
  2. Look for the latest stable version of the MonetDB JDBC driver. Preferentially select the Type 4 JDBC driver, which is the pure Java implementation and generally recommended for compatibility and performance.
  3. Download the appropriate
    .jar
    file for the JDBC driver to a location on your local machine.

Installing the Driver into Matillion Data Productivity Cloud

Matillion Data Productivity Cloud supports uploading external JDBC drivers for database connectivity. To install the MonetDB JDBC driver:

  1. Follow the official Matillion instructions for uploading external drivers found at: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/.
  2. Use the upload interface to add your downloaded MonetDB
    .jar
    file.
  3. Complete any additional configuration required as detailed in the Matillion documentation, ensuring the driver is properly registered for use within your environment.

Further Usage in Matillion

Once the MonetDB JDBC driver is installed, you can begin using it in your database components and queries. For instructions on configuring and using database connectivity within Matillion Data Productivity Cloud, refer to: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.

If you encounter any issues with connecting to your MonetDB instance, ensure your driver is the correct Type 4 JDBC version and has been uploaded according to the provided instructions.

Checking network connectivity

To enable communication between Matillion Data Productivity Cloud and your MonetDB database, you must ensure that the MonetDB instance is configured to allow incoming network connections from the appropriate sources. The required configuration depends on your deployment type:

  • Full SaaS Agent Configuration:
    If you are using the Full SaaS agent, your MonetDB database must accept connections from the set of static IP addresses provided by Matillion. A current list of these IP addresses can be found here. Update your network access rules or firewall to permit inbound connections from these addresses.

  • Hybrid SaaS Deployment:
    For a Hybrid SaaS deployment, you need to allow inbound connections from your own virtual private cloud (VPC) where the Matillion agent is running. To verify network access or for useful tools to assist with this process, visit the Matillion network utilities page.

Additionally, if you reference your MonetDB instance using a DNS hostname rather than a direct IP address, ensure that your selected Matillion agent (whether Full SaaS or Hybrid SaaS) is able to resolve the DNS address correctly. This is important for maintaining reliable connectivity to your database.

How to Query Data from a MonetDB Database

This guide explains how to query data from a MonetDB database, with attention to initial and incremental data loads, query examples, and considerations for integration with platforms like Databricks.


Querying Data: SQL SELECT Statement Examples

Below are common MonetDB SQL queries:

``` -- Retrieve all records from a table SELECT * FROM sales_orders;

-- Select specific columns SELECT order_id, customer_id, amount FROM sales_orders;

-- Filter using a WHERE clause SELECT * FROM sales_orders WHERE order_date >= '2024-01-01';

-- Aggregate and group data SELECT customer_id, SUM(amount) AS total_spent FROM sales_orders GROUP BY customer_id ORDER BY total_spent DESC; ```


Datatype Conversion Between MonetDB and Databricks

When moving data between MonetDB and Databricks, be aware that datatype conversions may be necessary. Common MonetDB datatypes (e.g.,

INTEGER
,
DECIMAL
,
VARCHAR
) will be mapped to their closest equivalents in Databricks (using Spark SQL datatypes). For example:

MonetDB Type Databricks (Spark SQL) Type
INTEGER INTEGER
DOUBLE DOUBLE
VARCHAR(n) STRING
TIMESTAMP TIMESTAMP
BOOLEAN BOOLEAN

Check your sync tool's documentation for automatic mapping and identify any manual type conversions that may be needed in your ETL workflows.


Data Loading Best Pattern: Initial and Incremental Loads

The optimal pattern for loading data from MonetDB employs an initial full load, followed by recurring incremental loads. Both operations can be accomplished using the same Database Query component (such as in Matillion, or similar ETL tools).

  • Initial Load:
    No filter is required. The entire table or dataset is selected and loaded.

Example Initial Load Query (no filter):

SELECT * FROM sales_orders;

  • Incremental Load:
    Uses a filter (such as
    WHERE updated_at >= ?
    ) to load only records that have changed since the last extract. This reduces data transfer and speeds up the process.

Example Incremental Load Query (with filter):

SELECT *
  FROM sales_orders
  WHERE updated_at >= '2024-06-01 00:00:00';


Use the same Database Query component for both loading procedures, just modify the SQL to include or omit the incrementality filter as appropriate. This ensures consistency and simplifies ongoing data extraction and synchronization tasks.

Data Integration Architecture

Loading data in advance of integration is a foundational advantage of the ELT (Extract, Load, Transform) architecture, as it allows the overall process to be divided into two distinct steps: first, ingesting raw data into your Databricks database, and then performing any necessary integration or transformation operations within the same environment. This separation makes the workload more manageable and flexible. Data transformation—a critical aspect of integration—is most efficiently performed using well-structured data transformation pipelines, which can be orchestrated and automated in Databricks. Another key benefit of the ELT architecture lies in the fact that data transformation and integration occur directly inside the target Databricks database. This enables fast, on-demand, and scalable processing that leverages the power of Databricks’ optimized computing resources, eliminating the need for additional, external data processing infrastructure and associated costs.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.