Extracting data from Exasol to Databricks
Extracting data from Exasol is a common requirement for organizations looking to leverage the advanced analytics capabilities of Databricks. Ensuring a smooth and efficient data transfer process is critical for maximizing the value of your data assets. This article provides a step-by-step guide to moving data from Exasol to Databricks, covering essential preparation and execution stages. We will begin by discussing how to create an identity in Exasol, which is necessary for secure and authorized access. For users of the Matillion ETL platform, we will outline how to check for the presence of an appropriate JDBC driver or acquire one if needed. Network connectivity between the source (Exasol) and the target (Databricks) is crucial, so we will review key considerations to ensure a reliable connection. Finally, we will explore best practices for querying data, including strategies for initial data loads and subsequent incremental updates, to maintain both performance and data integrity. By the end of this article, you will have a clear understanding of how to efficiently extract data from Exasol and load it into Databricks, whether you are working with full datasets or ongoing data updates.
What is Exasol?
Exasol is a high-performance, in-memory, columnar relational database management system (RDBMS) designed specifically for advanced analytics and business intelligence workloads. Leveraging massively parallel processing (MPP) architecture, Exasol achieves exceptionally fast query execution and highly scalable performance, making it suitable for handling large-scale data sets and complex analytical queries. The platform supports standard SQL, integrates natively with various ETL tools and BI platforms, and provides connectivity to data science environments through dedicated APIs for languages such as Python, R, and Java. Additionally, Exasol offers robust features for workload management, security, and high availability, making it a compelling choice for enterprises seeking efficient, real-time data analytics across diverse industries.
What is Databricks?
Databricks is a cloud-based unified analytics platform built on Apache Spark, designed for data engineering, machine learning, and collaborative analytics. Central to its database capability is Delta Lake, enabling ACID transactions, scalable metadata, and unified batch/stream data management for data lakes. With managed compute clusters and intelligent caching, users can create scalable ETL pipelines, perform complex analytics, and support data versioning and schema evolution. Databricks integrates with AWS S3, Azure, and Google Cloud storage, leveraging Spark’s distributed processing and automatic scaling. It empowers teams to efficiently handle large structured or unstructured datasets via collaborative notebooks, strong security features, and automated cluster management.
Why Move Data from Exasol into Databricks
Unlocking Analytics: The Benefits of Copying Data from Exasol into Databricks
A data engineer or architect may wish to copy data from Exasol into Databricks for several compelling reasons. Exasol often holds data that is potentially valuable for analytics, decision-making, or machine learning initiatives. However, the true value of this data can be realized by integrating it with information from other sources—such as cloud data lakes, transactional systems, or third-party datasets. By using Databricks as the platform for this integration, organizations can leverage its scalable processing capabilities while avoiding the risk of overburdening the Exasol database with complex transformational workloads. This approach not only preserves the performance of core Exasol operations but also facilitates richer analytics and insights by harnessing Databricks' advanced data engineering and collaborative features.
Similar connectors
Creating a User in Exasol Database
To create a new user (identity) in an Exasol database, you will use SQL statements executed via SQL clients such as EXAplus, DBeaver, SQL Editor in EXAoperation, or any similar tool connected to your Exasol instance. You must have sufficient privileges, typically as a database administrator or a user with
CREATE USERprivilege.
Step 1: Connect to the Database
Ensure that you are connected to your Exasol database with a user account that has the necessary privileges to create other users.
Step 2: Create the User
You can create a new user with the
CREATE USERstatement. Specify the username and an initial password. Exasol usernames and passwords are case-sensitive by default. Example:
CREATE USER my_new_user IDENTIFIED BY 'Str0ngInitPass!';
- Replace
my_new_user
with the desired username. - Change
'Str0ngInitPass!'
to a secure initial password for your requirements.
Step 3: Grant Privileges to the User
By default, the new user will have no privileges. You must explicitly grant them the required privileges, such as access to specific schemas, the ability to create sessions, or administrative roles.
For example, to grant the ability to create a session (i.e., to log in):
GRANT CREATE SESSION TO my_new_user;
To grant privileges on a specific schema (e.g., to allow SELECT, INSERT, UPDATE and DELETE on all tables in the schema MY_SCHEMA):
GRANT SELECT, INSERT, UPDATE, DELETE ON SCHEMA MY_SCHEMA TO my_new_user;
Granting an existing role (e.g.,
SYSfor DB admin, use caution):
GRANT SYS TO my_new_user;
Step 4: (Optional) Force Password Change on First Login
It is a best practice to require the user to change their password upon first use:
ALTER USER my_new_user PASSWORD EXPIRE;
Step 5: (Optional) Set Additional User Properties
You can further configure user properties, such as resource limits or password policies, using the
ALTER USERstatement.
Note: All SQL keywords in Exasol are case-insensitive, but object names are case-sensitive if quoted. Passwords should always be managed securely.
Installing the JDBC Driver
To enable connectivity between Matillion Data Productivity Cloud and your Exasol database, you must manually install the Exasol JDBC driver. Please note that, at the time of writing, this driver is not distributed by default with Matillion Data Productivity Cloud due to third-party licensing and redistribution restrictions imposed by Exasol.
Follow the steps below to obtain and install the driver:
1. Download the Exasol JDBC Driver
- Access the Exasol JDBC driver downloads at: https://www.exasol.com/portal/display/DOWNLOADS/JDBC+Drivers
- On the downloads page, search for the Type 4 JDBC driver for your intended version of the Exasol database. The Type 4 driver is a pure Java driver and does not require native libraries, making it the preferred option for Matillion Data Productivity Cloud integrations.
- Download the latest stable release of the Type 4 JDBC driver JAR file to your local work environment.
2. Install the JDBC Driver in Matillion Data Productivity Cloud
- Installation steps for uploading external JDBC drivers are detailed here: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
- In summary:
- Log in to Matillion Data Productivity Cloud as a user with the necessary permissions.
- Navigate to the section for uploading external drivers (see the URL above for step-by-step instructions).
- Upload the previously downloaded JDBC driver JAR file, specifying Exasol as the target destination as required.
3. Next Steps: Using the JDBC Driver
- Once installed, you can configure Matillion Data Productivity Cloud to connect to Exasol and issue database queries.
- Refer to the product's usage documentation for setting up connections and working with the integrated JDBC driver: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
By following these steps, you will have enabled support for Exasol JDBC connectivity within your Matillion Data Productivity Cloud environment using the officially referenced and up-to-date third-party artefact.
Checking network connectivity
To enable successful connections between Matillion Data Productivity Cloud and your Exasol database, you must ensure that your Exasol instance is configured to allow inbound network traffic from the appropriate source, depending on your deployment architecture:
-
Full SaaS agent configuration: Configure your Exasol database to accept incoming connections from the IP addresses used by the Matillion Data Productivity Cloud SaaS agents. The current list of required IP addresses can be found at the following URL: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/.
-
Hybrid SaaS deployment: Configure your Exasol database networking or firewall rules to accept incoming connections from your organization's virtual private cloud (VPC). For guidance and utilities to help test connectivity and network access from your hybrid environment, see: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if the Exasol database is referenced by a DNS name (rather than a direct IP address), ensure that whichever agent you are using (Full SaaS or Hybrid SaaS) can resolve the database's DNS address to the correct underlying IP address. Proper DNS resolution is required for successful connectivity.
Querying Data from an Exasol Database
This guide outlines how to query data from an Exasol database and provides guidance for handling data loads, including patterns for initial and incremental data replication. It also highlights considerations for datatype conversion between Exasol and Databricks.
Examples of Exasol SQL SELECT Statements
Below are examples of common SELECT queries in Exasol SQL syntax:
``` -- Select all columns from the "sales" table SELECT * FROM sales;
-- Select specific columns SELECT order_id, customer_id, total_amount FROM sales;
-- Aggregate data: Count orders per customer SELECT customer_id, COUNT(order_id) AS order_count FROM sales GROUP BY customer_id;
-- Join two tables SELECT c.name, s.total_amount FROM customers c JOIN sales s ON c.customer_id = s.customer_id WHERE s.order_date >= '2024-01-01';
-- Use a WHERE clause for filtering SELECT * FROM sales WHERE total_amount > 1000; ```
Datatype Conversion: Exasol ↔ Databricks
When exchanging data between Exasol and Databricks (for example, using ETL tools), be aware that some datatypes do not have a direct correspondence and conversions may occur. For instance:
| Exasol Datatype | Databricks Corresponding Type | Notes |
|---|---|---|
| VARCHAR / NVARCHAR | STRING | Length limits may differ |
| DECIMAL | DECIMAL | Precision and scale may change |
| DATE/TIMESTAMP | DATE/TIMESTAMP | Timezone handling may vary |
| BOOLEAN | BOOLEAN | Supported, but source truth values may vary |
| GEOMETRY | Not directly supported | May require conversion to WKT or WKB |
Tip: Always validate the data schema and test your pipeline to ensure compatibility between source and destination.
Initial Load vs. Incremental Load Pattern
A robust and scalable pattern for data ingestion is to:
- Perform a once-off initial load of the complete data set.
- Follow up with ongoing incremental loads to retrieve only data that has changed (new or updated records).
Use the Database Query component (e.g., in Matillion ETL) for both initial and incremental data loads.
Initial Load
- Approach: Retrieve all required data from the source table.
- Query Example: (No filter in the WHERE clause.)
SELECT * FROM sales;
- Component Configuration: Do not set a filter clause in the
Database Query
component.
Incremental Load
- Approach: Retrieve only new or changed data since the last successful load, using a filter column such as an updated timestamp or incrementing ID.
- Query Example: (Includes a filter condition.)
SELECT * FROM sales WHERE last_modified > '2024-06-10 00:00:00';
- Component Configuration: Specify a filter clause in the
Database Query
component to select only the delta data. Update the filter value after every successful load.
Further Reading:
Refine your incremental load strategy by reviewing Incremental Load Data Replication Strategy.
Note: Each ETL tool may have different best practices and settings for Exasol and Databricks integration; always consult your platform’s official documentation.
Data Integration Architecture
Loading data in advance of integration represents a "divide and conquer" approach to data management by splitting the process into two distinct steps, an important advantage of the ELT (Extract, Load, Transform) architecture. In this model, raw data is first loaded into the target Databricks database, and only then is the necessary integration and transformation applied. Data integration invariably requires transformation—such as aligning schemas, cleansing, and standardizing records—which is most effectively performed using dedicated data transformation pipelines. Another significant benefit of the ELT approach is that both transformation and integration run natively inside the Databricks platform. This means they can leverage Databricks’ fast, scalable, and on-demand processing capabilities, avoiding the need for separate, costly data processing infrastructure.