Extracting data from H2 Database to Databricks
Extracting data from H2 Database is a common requirement for teams migrating legacy data, integrating analytics platforms, or building modern data pipelines. Whether your end goal is robust reporting, advanced analytics, or simply centralizing data for easier access, getting your H2 data into platforms like Databricks is an important step. In this article, we will guide you through the essential stages of this process. We begin by creating an identity in the H2 database to manage access and permissions securely. Next, if you are a Matillion user, we will walk you through checking for—and if needed, acquiring—the appropriate JDBC driver to ensure seamless connectivity. We will then discuss the key considerations for establishing reliable network connectivity between your H2 source and the Databricks target environment. Finally, we will show you how to query your data, both for an initial full extraction and for incremental updates, to optimize ongoing synchronization and minimize resource usage. By following this guide, you will be able to efficiently move data from H2 Database to Databricks, supporting both initial migrations and ongoing integration needs.
What is H2 Database?
H2 Database is a lightweight, Java-based relational database management system renowned for its small footprint, high performance, and ease of integration into Java applications. Operating in both embedded and server modes, H2 offers full compliance with ANSI SQL standards, robust support for transactions, and a range of features such as in-memory tables, encryption, and pluggable authentication. Its JDBC API compatibility makes it a popular choice for development, testing, and prototyping environments, while its straightforward configuration—typically requiring only a single JAR file—simplifies deployment. Furthermore, H2 includes a web-based console for query execution and database management, enhancing developer productivity in varied use cases, from unit testing to lightweight production applications.
What is Databricks?
Databricks is a unified analytics platform built on Apache Spark, offering scalable big data processing and robust database capabilities. Its Delta Lake storage delivers fast, ACID-compliant transactions with schema enforcement for reliable data lakes. Databricks lets users process structured, semi-structured, or unstructured data using SQL or Spark, with collaborative notebooks in Python, Scala, R, and SQL. It supports workflows for data engineering, machine learning, and BI. Natively integrating with AWS S3, Azure Data Lake Storage, and Google Cloud Storage, Databricks gives enterprises a secure, cloud-based environment for advanced analytics and comprehensive data management.
Why Move Data from H2 Database into Databricks
Unlocking Analytics: Transferring Data from H2 Database to Databricks
A data engineer or architect might choose to copy data from an H2 Database into Databricks for several compelling reasons. Firstly, the H2 Database may hold information that is potentially valuable for analysis or business intelligence initiatives, but its utility is often limited when confined to its original environment. By integrating data from the H2 Database with other enterprise datasets within Databricks, organizations can uncover deeper insights and realize the full potential of their data through advanced analytics and machine learning capabilities. Additionally, conducting this integration within Databricks, rather than directly querying the H2 system, helps avoid placing additional processing strain on the transactional H2 Database, thereby ensuring its performance and stability for operational workloads while simultaneously enabling large-scale data analysis.
Similar connectors
Creating a User in H2 Database
To create a new user in an H2 Database, you typically use the SQL
CREATE USERstatement. The following instructions assume you already have access to your H2 database via the H2 Console or a compatible client.
1. Log in as an Admin User
Connect to the H2 database as a user with sufficient privileges (usually the
SAuser by default).
2. Run the CREATE USER
SQL Statement
The general syntax to create a new user is:
CREATE USER <username> PASSWORD '<password>';
Example:
CREATE USER alice PASSWORD 'securePassword123';
This command creates a user named
alicewith the specified password. Passwords are case-sensitive and should be chosen to be strong.
3. Grant Privileges (Optional)
By default, a new user does not have any privileges on existing schemas or tables. Grant privileges explicitly as needed:
GRANT SELECT, INSERT, UPDATE, DELETE ON SCHEMA PUBLIC TO alice;
This example grants common privileges on the default
PUBLICschema to the user
alice.
4. Verify the User
You can test logging in with the newly created user via an H2 client or console, using the new username and password.
Note:
- The ability to create users may only be available if the H2 database is running in server mode, not in
in-memoryor
embeddedmode.
- Administrative users (like
SA) are permitted to create other users.
Installing the JDBC driver
At the time of writing, the JDBC driver for the H2 Database is not included by default with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, it is necessary to manually download and install the H2 JDBC driver before you can use it in your Matillion environment.
1. Download the H2 JDBC Driver
To obtain the latest version of the H2 JDBC driver, visit the official H2 Database download page:
https://www.h2database.com/html/download.html
When downloading, ensure you select the Type 4 JDBC driver, which is a pure Java driver and generally recommended for broad compatibility and optimal integration with Matillion.
2. Upload the Driver to Matillion Data Productivity Cloud
After downloading the driver (JAR file), follow the installation instructions provided by Matillion to correctly upload and register the external JDBC driver. Comprehensive, step-by-step guidance can be found here:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
This documentation outlines how to upload external drivers, where to place the JDBC JAR file, and how to make the driver available for use within your environment.
3. Configure and Use the Driver
Once the H2 JDBC driver has been successfully uploaded and installed, you can proceed to configure connections to your H2 Database within Matillion. For detailed instructions on usage and establishing connections, refer to:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
This resource covers connection setup, configuration parameters, and examples of how to leverage your new JDBC connection within Matillion workflows.
Checking network connectivity
Before establishing a connection between Matillion Data Productivity Cloud and your H2 Database database, you must ensure that the database is configured to allow incoming connections from the appropriate sources, depending on your deployment type:
-
Full SaaS Agent Configuration:
Allow incoming connections from the IP addresses maintained by Matillion. The current list of required IP addresses is published at: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/
Ensure your H2 Database firewall or network access control lists (ACLs) permit connections from these IP ranges. -
Hybrid SaaS Deployment:
Allow incoming connections from your own Virtual Private Cloud (VPC), as this is where the Matillion agent will be running. To test and verify network access from your VPC to the H2 Database, you can utilize the network test utilities available at: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Additionally, if your H2 Database database is referenced using a DNS hostname rather than a direct IP address, ensure that the Full SaaS or Hybrid SaaS agent is able to resolve the address. This may require updating DNS configurations or providing appropriate resolver access within your environment.
Querying Data from an H2 Database
This document provides guidance for technical users on querying data from an H2 Database, with practical SQL examples and proven data loading patterns. This advice considers environments where H2 Database is used with other systems (such as Databricks), which may involve datatype conversions.
Example H2 Database Queries
You can perform queries in H2 Database just like you would with standard SQL. Here are some SELECT statement examples:
``` -- Retrieve all rows from a table SELECT * FROM employees;
-- Retrieve specific columns and apply a filter SELECT id, name, department FROM employees WHERE department = 'Engineering';
-- Use aggregation SELECT department, COUNT(*) as count FROM employees GROUP BY department;
-- Join two tables SELECT e.name, d.department_name FROM employees e JOIN departments d ON e.department = d.id WHERE d.location = 'London'; ``` Note: H2 SQL syntax aims for compatibility with standard SQL, but check the H2 documentation for specifics.
Datatype Conversion Between H2 Database and Databricks
When transferring data from H2 Database to Databricks, be aware that datatype conversions may occur. For example:
- H2
VARCHAR
may be mapped toSTRING
in Databricks - H2
INTEGER
orBIGINT
to DatabricksINT
orBIGINT
- H2
DATE
,TIME
, andTIMESTAMP
types to DatabricksDATE
orTIMESTAMP
- BLOB/CLOB fields may require special handling
Always review and test your pipeline to validate datatype compatibility and to prevent data loss or corruption.
Data Loading Patterns: Initial Load and Incremental Loads
A robust and efficient loading pattern is to perform:
- A once-off initial load to capture all existing data
- Ongoing incremental loads to capture only new or updated records
This approach applies to integrations where you use a Database Query component (like those found in ETL tools).
Initial Load Example
The initial load fetches all records. The SQL query includes no
WHEREfilter to retrieve the entire dataset.
SELECT * FROM employees;In your Database Query component, leave the filter/query clause empty or as broad as possible.
Incremental Load Example
For incremental loads, apply a filter so you only pick up new/changed records since the last extraction. The filter often relies on a column such as a timestamp (
last_modified) or a unique incremental ID.
SELECT * FROM employees WHERE last_modified > '2024-05-31 23:59:59';In your ETL workflow, parameterize the filter value so it picks up only data added or changed since the last run (see more on incremental load patterns).
Best Practice: Use the same Database Query component for both initial and incremental loads; just adapt the query filter based on the phase (initial vs incremental).
Data Integration Architecture
Loading data in advance of integration represents an important "divide and conquer" strategy: by first extracting and loading raw data into the Databricks database, data engineers can then perform the more complex data transformation and integration as a distinct, second phase. This separation of concerns is a central advantage of the ELT (Extract, Load, Transform) architecture, as it lets teams address extraction, storage, and transformation independently, increasing flexibility and efficiency. A key aspect of successful integration is data transformation, and data transformation pipelines are considered the best practice for this task, as they enable structured, repeatable, and auditable processing of data at scale. Another significant benefit of ELT in the Databricks environment is that all transformation and integration activities occur directly inside the target Databricks database itself, using high-performance, on-demand compute resources. This architecture ensures integration jobs are both fast and scalable, while eliminating the need to provision or manage separate data processing infrastructure, ultimately reducing both complexity and cost.