Meet Maia: The AI Data Automation platform that gives you the freedom to do more.

Visit maia.ai

Integrate data from Exasol to Databricks using Matillion

Our Exasol to Databricks connector enables seamless, up-to-date data transfers to Databricks in just minutes, eliminating the need for manual coding or the management of complex ETL scripts.

Exasol
Databricks
Exasol to Databricks banner

Extracting data from Exasol to Databricks

Extracting data from Exasol is a common requirement for organizations looking to leverage the advanced analytics capabilities of Databricks. Ensuring a smooth and efficient data transfer process is critical for maximizing the value of your data assets. This article provides a step-by-step guide to moving data from Exasol to Databricks, covering essential preparation and execution stages. We will begin by discussing how to create an identity in Exasol, which is necessary for secure and authorized access. For users of the Matillion ETL platform, we will outline how to check for the presence of an appropriate JDBC driver or acquire one if needed. Network connectivity between the source (Exasol) and the target (Databricks) is crucial, so we will review key considerations to ensure a reliable connection. Finally, we will explore best practices for querying data, including strategies for initial data loads and subsequent incremental updates, to maintain both performance and data integrity. By the end of this article, you will have a clear understanding of how to efficiently extract data from Exasol and load it into Databricks, whether you are working with full datasets or ongoing data updates.


What is Exasol?

Exasol is a high-performance, in-memory, columnar relational database management system (RDBMS) designed specifically for advanced analytics and business intelligence workloads. Leveraging massively parallel processing (MPP) architecture, Exasol achieves exceptionally fast query execution and highly scalable performance, making it suitable for handling large-scale data sets and complex analytical queries. The platform supports standard SQL, integrates natively with various ETL tools and BI platforms, and provides connectivity to data science environments through dedicated APIs for languages such as Python, R, and Java. Additionally, Exasol offers robust features for workload management, security, and high availability, making it a compelling choice for enterprises seeking efficient, real-time data analytics across diverse industries.

matillion logo x Exasol

What is Databricks?

Databricks is a cloud-based unified analytics platform built on Apache Spark, designed for data engineering, machine learning, and collaborative analytics. Central to its database capability is Delta Lake, enabling ACID transactions, scalable metadata, and unified batch/stream data management for data lakes. With managed compute clusters and intelligent caching, users can create scalable ETL pipelines, perform complex analytics, and support data versioning and schema evolution. Databricks integrates with AWS S3, Azure, and Google Cloud storage, leveraging Spark’s distributed processing and automatic scaling. It empowers teams to efficiently handle large structured or unstructured datasets via collaborative notebooks, strong security features, and automated cluster management.

Why Move Data from Exasol into Databricks

Unlocking Analytics: The Benefits of Copying Data from Exasol into Databricks

A data engineer or architect may wish to copy data from Exasol into Databricks for several compelling reasons. Exasol often holds data that is potentially valuable for analytics, decision-making, or machine learning initiatives. However, the true value of this data can be realized by integrating it with information from other sources—such as cloud data lakes, transactional systems, or third-party datasets. By using Databricks as the platform for this integration, organizations can leverage its scalable processing capabilities while avoiding the risk of overburdening the Exasol database with complex transformational workloads. This approach not only preserves the performance of core Exasol operations but also facilitates richer analytics and insights by harnessing Databricks' advanced data engineering and collaborative features.

Creating a User in Exasol Database

To create a new user (identity) in an Exasol database, you will use SQL statements executed via SQL clients such as EXAplus, DBeaver, SQL Editor in EXAoperation, or any similar tool connected to your Exasol instance. You must have sufficient privileges, typically as a database administrator or a user with

CREATE USER
privilege.

Step 1: Connect to the Database

Ensure that you are connected to your Exasol database with a user account that has the necessary privileges to create other users.

Step 2: Create the User

You can create a new user with the

CREATE USER
statement. Specify the username and an initial password. Exasol usernames and passwords are case-sensitive by default. Example:

CREATE USER my_new_user IDENTIFIED BY 'Str0ngInitPass!';

  • Replace
    my_new_user
    with the desired username.
  • Change
    'Str0ngInitPass!'
    to a secure initial password for your requirements.

Step 3: Grant Privileges to the User

By default, the new user will have no privileges. You must explicitly grant them the required privileges, such as access to specific schemas, the ability to create sessions, or administrative roles.

For example, to grant the ability to create a session (i.e., to log in):

GRANT CREATE SESSION TO my_new_user;

To grant privileges on a specific schema (e.g., to allow SELECT, INSERT, UPDATE and DELETE on all tables in the schema MY_SCHEMA):

GRANT SELECT, INSERT, UPDATE, DELETE ON SCHEMA MY_SCHEMA TO my_new_user;

Granting an existing role (e.g.,

SYS
for DB admin, use caution):

GRANT SYS TO my_new_user;

Step 4: (Optional) Force Password Change on First Login

It is a best practice to require the user to change their password upon first use:

ALTER USER my_new_user PASSWORD EXPIRE;

Step 5: (Optional) Set Additional User Properties

You can further configure user properties, such as resource limits or password policies, using the

ALTER USER
statement.


Note: All SQL keywords in Exasol are case-insensitive, but object names are case-sensitive if quoted. Passwords should always be managed securely.

Installing the JDBC Driver

To enable connectivity between Matillion Data Productivity Cloud and your Exasol database, you must manually install the Exasol JDBC driver. Please note that, at the time of writing, this driver is not distributed by default with Matillion Data Productivity Cloud due to third-party licensing and redistribution restrictions imposed by Exasol.

Follow the steps below to obtain and install the driver:

1. Download the Exasol JDBC Driver

  • Access the Exasol JDBC driver downloads at: https://www.exasol.com/portal/display/DOWNLOADS/JDBC+Drivers
  • On the downloads page, search for the Type 4 JDBC driver for your intended version of the Exasol database. The Type 4 driver is a pure Java driver and does not require native libraries, making it the preferred option for Matillion Data Productivity Cloud integrations.
  • Download the latest stable release of the Type 4 JDBC driver JAR file to your local work environment.

2. Install the JDBC Driver in Matillion Data Productivity Cloud

  • Installation steps for uploading external JDBC drivers are detailed here: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
  • In summary:
  • Log in to Matillion Data Productivity Cloud as a user with the necessary permissions.
  • Navigate to the section for uploading external drivers (see the URL above for step-by-step instructions).
  • Upload the previously downloaded JDBC driver JAR file, specifying Exasol as the target destination as required.

3. Next Steps: Using the JDBC Driver

  • Once installed, you can configure Matillion Data Productivity Cloud to connect to Exasol and issue database queries.
  • Refer to the product's usage documentation for setting up connections and working with the integrated JDBC driver: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

By following these steps, you will have enabled support for Exasol JDBC connectivity within your Matillion Data Productivity Cloud environment using the officially referenced and up-to-date third-party artefact.

Checking network connectivity

To enable successful connections between Matillion Data Productivity Cloud and your Exasol database, you must ensure that your Exasol instance is configured to allow inbound network traffic from the appropriate source, depending on your deployment architecture:

Additionally, if the Exasol database is referenced by a DNS name (rather than a direct IP address), ensure that whichever agent you are using (Full SaaS or Hybrid SaaS) can resolve the database's DNS address to the correct underlying IP address. Proper DNS resolution is required for successful connectivity.

Querying Data from an Exasol Database

This guide outlines how to query data from an Exasol database and provides guidance for handling data loads, including patterns for initial and incremental data replication. It also highlights considerations for datatype conversion between Exasol and Databricks.


Examples of Exasol SQL SELECT Statements

Below are examples of common SELECT queries in Exasol SQL syntax:

``` -- Select all columns from the "sales" table SELECT * FROM sales;

-- Select specific columns SELECT order_id, customer_id, total_amount FROM sales;

-- Aggregate data: Count orders per customer SELECT customer_id, COUNT(order_id) AS order_count FROM sales GROUP BY customer_id;

-- Join two tables SELECT c.name, s.total_amount FROM customers c JOIN sales s ON c.customer_id = s.customer_id WHERE s.order_date >= '2024-01-01';

-- Use a WHERE clause for filtering SELECT * FROM sales WHERE total_amount > 1000; ```

Datatype Conversion: Exasol ↔ Databricks

When exchanging data between Exasol and Databricks (for example, using ETL tools), be aware that some datatypes do not have a direct correspondence and conversions may occur. For instance:

Exasol Datatype Databricks Corresponding Type Notes
VARCHAR / NVARCHAR STRING Length limits may differ
DECIMAL DECIMAL Precision and scale may change
DATE/TIMESTAMP DATE/TIMESTAMP Timezone handling may vary
BOOLEAN BOOLEAN Supported, but source truth values may vary
GEOMETRY Not directly supported May require conversion to WKT or WKB

Tip: Always validate the data schema and test your pipeline to ensure compatibility between source and destination.

Initial Load vs. Incremental Load Pattern

A robust and scalable pattern for data ingestion is to:

  1. Perform a once-off initial load of the complete data set.
  2. Follow up with ongoing incremental loads to retrieve only data that has changed (new or updated records).

Use the Database Query component (e.g., in Matillion ETL) for both initial and incremental data loads.

Initial Load

  • Approach: Retrieve all required data from the source table.
  • Query Example: (No filter in the WHERE clause.)

SELECT *
FROM sales;

  • Component Configuration: Do not set a filter clause in the
    Database Query
    component.

Incremental Load

  • Approach: Retrieve only new or changed data since the last successful load, using a filter column such as an updated timestamp or incrementing ID.
  • Query Example: (Includes a filter condition.)

SELECT *
FROM sales
WHERE last_modified > '2024-06-10 00:00:00';

  • Component Configuration: Specify a filter clause in the
    Database Query
    component to select only the delta data. Update the filter value after every successful load.

Further Reading:
Refine your incremental load strategy by reviewing Incremental Load Data Replication Strategy.


Note: Each ETL tool may have different best practices and settings for Exasol and Databricks integration; always consult your platform’s official documentation.

Data Integration Architecture

Loading data in advance of integration represents a "divide and conquer" approach to data management by splitting the process into two distinct steps, an important advantage of the ELT (Extract, Load, Transform) architecture. In this model, raw data is first loaded into the target Databricks database, and only then is the necessary integration and transformation applied. Data integration invariably requires transformation—such as aligning schemas, cleansing, and standardizing records—which is most effectively performed using dedicated data transformation pipelines. Another significant benefit of the ELT approach is that both transformation and integration run natively inside the Databricks platform. This means they can leverage Databricks’ fast, scalable, and on-demand processing capabilities, avoiding the need for separate, costly data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.