Integrate data from HSQLDB (HyperSQL Database Engine) to Databricks using Matillion

Our HSQLDB to Databricks connector enables seamless data transfer to Databricks within minutes, ensuring your data remains current without requiring manual coding or complex ETL scripts.

HSQLDB
Databricks
HSQLDB (HyperSQL Database Engine) to Databricks banner

Extracting data from HSQLDB (HyperSQL Database Engine) to Databricks

Extracting data from HSQLDB is a key step when migrating or integrating data with modern platforms such as Databricks. Whether you are moving legacy information for analytics, supporting a data warehouse migration, or enabling real-time insights, understanding the end-to-end process is essential. In this article, we will walk through the critical stages required to efficiently extract data from an HSQLDB database and load it into Databricks. We will begin by outlining strategies for creating an appropriate database identity in HSQLDB, ensuring that you have the credentials necessary for secure access. For Matillion ETL users, we will discuss how to verify the availability of the required JDBC driver, and outline steps to acquire it if needed. Next, we will cover the network background required, highlighting how to establish connectivity between your HSQLDB source and your Databricks target environment. Finally, we will delve into techniques for querying data from HSQLDB, using both initial full extracts and incremental loading approaches to keep your Databricks datasets up to date. Through this guide, you will gain the practical knowledge needed to make your HSQLDB to Databricks data pipelines reliable and efficient.


What is HSQLDB (HyperSQL Database Engine)?

HSQLDB (HyperSQL DataBase) is a lightweight, open-source Java relational database noted for its speed, small size, and full JDBC and SQL support. It offers both in-memory and disk-based storage, making it ideal for Java applications, prototyping, unit testing, and lightweight deployments. HSQLDB features a wide range of SQL capabilities such as triggers, stored procedures, and advanced constraints, with compatibility for major databases to ease migration. Its minimal configuration and easy integration with Java projects make HSQLDB popular among developers needing an efficient and reliable database solution.

matillion logo x HSQLDB

What is Databricks?

The Databricks database is an integral component of the Databricks Lakehouse Platform, designed to unify data analytics, data engineering, machine learning, and data warehousing workloads. Built atop Apache Spark, it utilizes a scalable, distributed architecture and employs Delta Lake technology to provide ACID transactions, scalable metadata handling, and unified streaming and batch data processing. Data within a Databricks database is organized into schemas (databases) and tables, and can be queried using standard SQL, Python, Scala, or R. Advanced features, such as automated optimization, indexing, time travel, and reliable data versioning, enable efficient management of large-scale datasets, making it well-suited for complex analytic workflows in enterprise environments.

Why Move Data from HSQLDB (HyperSQL Database Engine) into Databricks

Unlocking Analytics: Transferring Data from HSQLDB to Databricks

A data engineer or architect may wish to copy data from HSQLDB into Databricks for several significant reasons. Firstly, HSQLDB may hold data that is potentially valuable to the organization, but that value can only be fully realized when the data is accessible within a modern analytics platform. By integrating HSQLDB data with other sources in Databricks, engineers can unlock new insights and enable more comprehensive analyses that are not possible with siloed data. Furthermore, performing data integration and processing in Databricks, rather than directly on HSQLDB, avoids adding additional workload to the transactional database, ensuring HSQLDB remains responsive and efficient for its primary operational tasks. This approach not only enhances analytical capabilities but also preserves system performance and reliability.

Creating an Identity in HSQLDB

To create a new user (identity) in an HSQLDB database, you use the

CREATE USER
SQL statement. This statement allows you to define a username and password, and optionally grant administrative privileges to the new user. The following steps provide guidance on creating a database user in HSQLDB.


1. Connect to the Database as an Administrator

Before creating a user, you must connect as a user with sufficient privileges, typically the default

SA
(System Administrator):

ell
java -jar hsqldb.jar
or by using the
sqltool.jar
:

ell
java -jar sqltool.jar --inlineRc=url=jdbc:hsqldb:file:yourdb,user=SA,password=

2. Create a New User

Use the

CREATE USER
statement with the desired username and password. Usernames are case-insensitive, while passwords are case-sensitive. Replace
'newuser'
and
'strong_password'
with your desired values:

CREATE USER newuser PASSWORD 'strong_password';

3. Grant Privileges to the User

After creating the user, grant the necessary privileges. Privileges can be granted on specific tables or as global administrative rights. For example:

3.1. Grant Usage Privileges

To allow login only (without further rights):

GRANT USAGE ON * TO newuser;

3.2. Grant All Privileges

To grant all rights on the entire database (useful for development):

GRANT ALL TO newuser;

3.3. Grant Table-Specific Privileges

To grant rights on specific tables:

GRANT SELECT, INSERT ON your_table TO newuser;

4. Optionally, Assign Administrative Rights

To make the user an administrator:

CREATE USER adminuser PASSWORD 'admin_pass' ADMIN;

Alternatively, promote an existing user to admin:

SET USER newuser ADMIN;


Note:
- Passwords must be enclosed in single quotes. - To see all users:

SELECT * FROM INFORMATION_SCHEMA.SYSTEM_USERS;

  • User and privilege changes take effect immediately.

Refer to the HSQLDB User Guide for further details and advanced options.

Installing the JDBC Driver

The Matillion Data Productivity Cloud platform supports connectivity to a wide range of databases through JDBC drivers. However, at the time of writing, the JDBC driver for the HSQLDB database is not included by default in Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, you must manually download and install this driver before you can use HSQLDB data sources within Matillion.

Downloading the HSQLDB JDBC Driver

To obtain the required JDBC driver, navigate to the official HSQLDB project site:

  • http://hsqldb.org/

On the downloads page, look for the most recent and stable release. For optimal compatibility and minimal external dependencies, select the Type 4 JDBC driver. Type 4 drivers are pure Java implementations and are recommended for use with platforms like Matillion.

When downloading, you will typically be provided with a ZIP or JAR package. Once the download completes, extract the downloaded files and locate the appropriate

.jar
file (for example,
hsqldb.jar
) that contains the JDBC driver implementation.

Uploading the Driver to Matillion Data Productivity Cloud

To install the driver into your Matillion environment, follow the official documentation for uploading external drivers:

  • https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

This documentation outlines the process of uploading the

.jar
file to your Matillion agent or environment. Make sure you have the necessary permissions and are working with the correct environment when you perform this step.

Using the HSQLDB Driver in Matillion

After installation, the HSQLDB JDBC driver will appear as an available option when configuring database query components or establishing new connections in Matillion Data Productivity Cloud.

For detailed guidance on how to use an uploaded driver for database connectivity and query execution, consult the official usage instructions:

  • https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

This resource provides step-by-step instructions on setting up connections, configuring parameters, and using database query tasks with your newly installed driver.

Checking network connectivity

To enable successful integration between Matillion Data Productivity Cloud and your HSQLDB database, you must ensure that the database is configured to allow incoming network connections from the appropriate sources, depending on your deployment type:

Additionally, if you are connecting to the HSQLDB database using a DNS hostname rather than an IP address, ensure that the Full SaaS or Hybrid SaaS agent is able to resolve the FQDN or hostname to the correct IP address. This typically means that required DNS servers or services must be reachable from the agent’s network environment.

Querying Data from an HSQLDB Database

This guide provides technical instructions for querying data from an HSQLDB database, with SQL examples and best practices for integration scenarios involving dataset synchronization (e.g., moving data to Databricks). Guidance includes establishing robust initial and incremental data load patterns using a Database Query component.


Example HSQLDB SQL SELECT Statements

You can use standard SQL to query tables and views from HSQLDB.

Selecting all columns from a table:

SELECT * FROM employee;

Selecting specific columns with a filter:

SELECT id, name, hire_date FROM employee WHERE department = 'Engineering';

Aggregating data:

SELECT department, COUNT(*) AS num_employees
FROM employee
GROUP BY department;

Joining tables:

SELECT e.id, e.name, d.dept_name 
FROM employee e
JOIN department d ON e.department_id = d.id;


Datatype Conversion: HSQLDB to Databricks

When transferring data from HSQLDB to Databricks, be aware that datatype conversion may occur, as each platform has its own supported datatypes.

  • VARCHAR in HSQLDB may map to STRING in Databricks.
  • DATE and TIMESTAMP types often map as expected, but always validate proper timezone handling.
  • NUMERIC and DECIMAL conversions usually preserve precision, but extremely high scales could be truncated.
  • BOOLEAN types might convert to BOOLEAN without issue, but test NULL handling explicitly.

Always consult the documentation of your integration platform for definitive type-mapping rules.


Data Load Pattern: Initial and Incremental Loads

For scalable integration (such as ETL or data replication), the best practice is to adopt an initial full load followed by incremental loads, both usually using the same Database Query component (e.g., in ETL tools like Matillion, Apache NiFi, or custom scripts).

  • Initial Load: Retrieves the entire dataset.
  • Incremental Load: Retrieves only new or changed records, based on a reliable filter (like a modified timestamp).

1. Initial Load SQL Example

During the initial load, do not use a WHERE clause; fetch the entire table:

SELECT * FROM employee;

2. Incremental Load SQL Example

For incremental loading, add a WHERE clause to fetch only recent records. For example, using a last-updated timestamp column:

SELECT * FROM employee
WHERE last_modified > '2024-06-01 00:00:00';

Update the filter value each load based on the latest timestamp you have processed.

Reference: Incremental Load Data Replication Strategy – Matillion Exchange

Tip: Use parameterization so the

last_modified
cutoff is supplied dynamically at runtime.


Using these query patterns, you can reliably perform initial and ongoing data captures from HSQLDB databases in hybrid and data pipeline architectures.

Data Integration Architecture

Loading data in advance of integration is a powerful approach commonly employed in modern data platforms, notably those adopting an ELT (Extract, Load, Transform) architecture. This method effectively divides the data integration problem into two manageable steps: first, data is ingested and loaded into the target environment, and then integration and transformation are performed as needed. By decoupling loading from transformation, organizations gain flexibility and are better able to meet diverse data requirements. In practice, data integration necessarily involves data transformation—standardizing, cleansing, and joining disparate sources—and the optimal way to manage this process is through robust data transformation pipelines. These pipelines ensure that integrations are repeatable, transparent, and compliant with evolving business logic. Another significant advantage of the ELT approach is that both transformation and integration take place directly within the target Databricks database. This design ensures that data processing is fast, performed on-demand, and inherently scalable with the Databricks platform. Importantly, because all processing occurs within the Databricks environment, organizations avoid the costs and complexity of maintaining additional data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.