Extracting data from Apache Derby (Java DB) to Databricks
Extracting data from Apache Derby is a common task for organizations looking to consolidate or analyze their data on modern cloud platforms such as Databricks. Whether you are migrating historical records or setting up ongoing data pipelines, a clear and structured approach is essential for success. In this article, we will guide you through the key steps required to move your data efficiently and securely from Apache Derby to Databricks. The sections will cover: - **Creating an identity in Apache Derby:** Setting up appropriate credentials to ensure secure access to your data. - **Checking or acquiring the JDBC driver (for Matillion users):** Ensuring that your ETL tool is equipped with the necessary software to connect to Apache Derby. - **Establishing network connectivity:** Verifying that your systems can communicate with each other, from your source database to your target Databricks environment. - **Querying data, initially and incrementally:** Techniques for extracting a first full dataset and subsequently capturing changes for ongoing data synchronization. With these foundations, you'll be equipped to implement a robust data pipeline between Apache Derby and Databricks.
What is Apache Derby (Java DB)?
Apache Derby is a lightweight, open-source, relational database management system (RDBMS) fully implemented in Java, making it cross-platform and easy to embed in Java applications. It supports standard SQL, ACID transactions, multi-user access, and JDBC integration. Derby runs in both server and embedded modes, making it ideal for desktop, mobile, or small server applications where a large-scale RDBMS is unnecessary. Distributed under the Apache License and actively maintained by the Apache Software Foundation, Derby excels in development, testing, and production environments requiring a small footprint, ease of deployment, and straightforward integration with Java.
What is Databricks?
The Databricks Lakehouse platform unifies data lake and warehouse capabilities for modern analytics and AI. Built on Apache Spark and Delta Lake, it supports ACID transactions, scalable metadata, unified streaming and batch processing, and various data formats (Parquet, ORC, JSON). Databricks integrates with storage like Azure Data Lake, AWS S3, and Google Cloud. It features collaborative notebooks, direct SQL queries, workload management, and fine-grained security, making it ideal for enterprises aiming to simplify data engineering, enable real-time analytics, and accelerate AI across distributed environments.
Why Move Data from Apache Derby (Java DB) into Databricks
Unlocking Advanced Analytics: Transferring Data from Apache Derby to Databricks
A data engineer or architect may find it advantageous to copy data from Apache Derby into Databricks for several reasons. Apache Derby, despite being a lightweight, embedded database, often contains data that is potentially valuable for further analysis and decision-making. However, the true value of this data is realized when it is integrated with information from other sources, enabling comprehensive analytics, advanced reporting, and machine learning workflows. By bringing Derby data into Databricks, organizations can leverage Databricks’ powerful data integration and processing capabilities without increasing the workload or operational complexity on the Derby environment. This approach not only preserves the performance and stability of the Derby instance but also enables seamless collaboration and data processing at scale within the Databricks ecosystem.
Similar connectors
Creating a User in Apache Derby
Derby uses built-in authentication and authorization mechanisms. User identities and passwords are typically defined using system properties, rather than explicit SQL user creation commands. This approach is suitable for embedded and client/server applications.
Below, you'll find instructions to set up a Derby database with user authentication, create users, and assign appropriate privileges.
1. Enabling User Authentication
To require users to authenticate, set the following properties. You can specify them:
- As startup properties in the
derby.properties
file (preferred for development or server use) - Using the
SET
command within a SQL session (for database-internal settings) - As command-line arguments (for some environments)
```properties
derby.properties (place alongside your Derby database)
derby.connection.requireAuthentication=true derby.authentication.provider=BUILTIN ```
2. Defining Users and Passwords
Define users via system properties. For each user, specify a property with this form:
derby.user.<username>=<password>
For example, to create the users
aliceand
bob:
properties derby.user.alice=securePassword1 derby.user.bob=securePassword2
Add these lines after the authentication properties in your
derby.propertiesfile.
Note: Derby does not provide standard SQL
CREATE USERor
ALTER USERstatements.
3. Connecting as a Specific User
When connecting to your Derby database, supply the username and password via the JDBC URL or the appropriate connection parameters.
JDBC URL example:
java // Java code to connect as 'alice' String url = "jdbc:derby:/path/to/mydb;user=alice;password=securePassword1"; Connection conn = DriverManager.getConnection(url);
Or, with SQL client command-line:
ij> connect 'jdbc:derby:/path/to/mydb;user=alice;password=securePassword1';
4. Assigning Privileges
By default, users must be granted permissions to access database objects. For example, to grant
aliceSELECT access on table
EMPLOYEES:
GRANT SELECT ON EMPLOYEES TO alice;
To grant schema-level permissions:
GRANT ALL PRIVILEGES ON SCHEMA HR TO alice;
You must issue these commands while connected as a user with sufficient privileges (typically, the database owner/creator).
Further Reading:
- Derby Developer's Guide: Working with user authentication
- Derby Reference Manual: GRANT statement
Installing the JDBC driver
At the time of writing, the Apache Derby JDBC driver is not included by default in Matillion Data Productivity Cloud installations due to licensing or redistribution restrictions. You will need to manually download and install the driver to enable JDBC connectivity with Apache Derby.
Step 1: Download the Apache Derby JDBC Driver
To obtain the appropriate JDBC driver, navigate to the official Apache Derby downloads page:
Look specifically for the Type 4 JDBC driver, which is implemented entirely in Java and is the most portable option.
When downloading, ensure you choose the version of the driver that corresponds to the version of Derby you intend to use.
Step 2: Install the Driver in Matillion Data Productivity Cloud
Once the JDBC driver JAR file has been downloaded to your local environment, you will need to upload it to your Matillion Agent as an external driver.
Refer to the following documentation for step-by-step instructions on uploading and managing external JDBC drivers in your environment:
Follow the outlined procedures to ensure the driver is correctly placed and recognized by your agent.
Step 3: Configure and Use the Driver in Matillion
After installation, consult the product documentation for guidance on how to connect to your Apache Derby database and construct queries using the newly installed driver:
This documentation provides information on configuring connections, specifying connection strings, and utilizing the JDBC driver within your workflows.
By following these steps, you'll ensure secure and functional connectivity between Matillion Data Productivity Cloud and your Apache Derby database instance via JDBC.
Checking network connectivity
To ensure successful communication between Matillion Data Productivity Cloud and your Apache Derby database, you must confirm that the database is configured to allow incoming network connections, depending on your agent deployment model:
-
Full SaaS Agent:
You must configure your Apache Derby database to allow inbound connections from the IP addresses used by Matillion Data Productivity Cloud’s SaaS agents. The currently required IP address list is maintained at: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/. Make sure all these addresses are permitted through your firewall or security group settings. -
Hybrid SaaS Deployment:
When using a Hybrid SaaS agent, incoming connections should be allowed from within your own Virtual Private Cloud (VPC) or equivalent network segment. Verify that firewall rules or network ACLs allow access from your VPC to the Derby database server. To help validate connectivity and troubleshoot, you can use the available utilities at: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if the Apache Derby database is referenced by its DNS hostname instead of an IP address, the respective Full SaaS or Hybrid SaaS agent must be able to resolve this DNS name to the appropriate IP address. Ensure that all required DNS records are accessible and correctly configured for your deployment type.
Querying Data from an Apache Derby Database
This guide explains how to query data from an Apache Derby database, with SQL examples, and discusses best practices for integration with analytical platforms like Databricks, especially in the context of ETL and data migration patterns.
Basic SQL SELECT Query Examples
Below are common SQL SELECT queries for Apache Derby, where statements are executed within tools like the
ijutility, via JDBC in applications, or through ETL platforms.
``` -- Select all columns from a table SELECT * FROM employees;
-- Select specific columns SELECT first_name, last_name, hire_date FROM employees;
-- Filtering results with a WHERE clause SELECT * FROM orders WHERE status = 'SHIPPED';
-- Aggregating results SELECT department, COUNT(*) as num_employees FROM employees GROUP BY department;
-- Joining tables SELECT o.order_id, c.customer_name FROM orders o JOIN customers c ON o.customer_id = c.customer_id WHERE o.order_date >= '2023-01-01'; ```
Datatype Conversion Between Apache Derby and Databricks
When migrating data or querying across platforms (e.g., moving data from Derby into Databricks DeltaLake), datatype conversion may occur. For example:
- Apache Derby's
INTEGER
maps to Databricks’INT
. TIMESTAMP
columns should be checked for compatible timezone behavior.DECIMAL(p,s)
types are supported, but precision handling should be verified.- Derby’s
VARCHAR
fields may need max length adjustment depending on Databricks destination schemas. - NULLability and some default values may require extra handling in transformation code.
Testing with small data samples is recommended to validate datatype compatibility in your pipeline.
Query Pattern: Initial and Incremental Loads
A common, production-grade data load strategy involves:
1. Once-Off Initial Load (Full Table)
- Use the
Database Query
component without a filter in theWHERE
clause. - Example, to load all customers into Databricks:
SELECT * FROM customers;
This loads the full dataset, ensuring a clean baseline.
2. Incremental Loads (Delta/Changed Data Only)
- Apply a filter to fetch only new or modified records.
- Typically, you use a date/timestamp column or auto-increment key as a filter.
- Example, only fetch new/updated orders since a specific timestamp:
SELECT * FROM orders WHERE last_modified > '2024-06-01 00:00:00';
The same
Database Querycomponent/query logic is usable for both load types; the difference is omission or presence of the filtering clause.
Read more detailed strategy and usage recommendations in Matillion’s Incremental Load Data Replication Strategy article.
Tip: Always thoroughly test both initial and incremental loads, specifically validating row count comparisons and handling of deletions or late-arriving data.
Data Integration Architecture
Loading data in advance of integration represents an effective "divide and conquer" strategy by segmenting the data integration process into two distinct steps. This approach, a key tenet of the Extract, Load, Transform (ELT) architecture, allows organizations to load raw data into Databricks before performing the necessary integration. Efficient data integration requires robust transformation, and the optimal method for this is through data transformation pipelines that can be automated and scaled as needed. A core advantage of the ELT approach is that transformation and integration take place entirely inside the target Databricks database. This enables processing to be fast, on-demand, and highly scalable, eliminating the need for – and associated costs of – external data processing infrastructure.