Integrate data from Apache Derby (Java DB) to Amazon Redshift using Matillion

Our Apache Derby to Redshift connector enables seamless data migration to Redshift within minutes, ensuring your information remains current without requiring manual coding or complex ETL scripts.

Apache Derby
Amazon Redshift
Apache Derby (Java DB) to Amazon Redshift banner

Extracting data from Apache Derby (Java DB) to Amazon Redshift

Extracting data from Apache Derby is a common prerequisite for organizations seeking to leverage the analytical power and scalability of cloud data warehouses such as Amazon Redshift. This article outlines a streamlined process for efficiently migrating your data from Apache Derby to Redshift, with practical steps tailored to users of Matillion ETL. In the following sections, we will begin by creating an identity within Apache Derby to facilitate secure access. For Matillion users, we will explain how to check for the required JDBC driver—or acquire it if not already present. We will also address how to ensure reliable network connectivity between your Derby instance and Redshift, a crucial step for smooth data transfer. Finally, we will discuss best practices for querying your data—covering both initial full extracts and ongoing incremental updates—to maximize reliability and performance during your migration. By following this guide, you will establish a robust foundation for seamless data extraction and loading between Apache Derby and Amazon Redshift.


What is Apache Derby (Java DB)?

Apache Derby is a lightweight, open-source, relational database management system (RDBMS) fully implemented in Java and distributed under the Apache License, Version 2.0. Designed for embedded scenarios, Derby operates with minimal memory and disk footprint, making it well-suited for Java applications that require an in-process database without the overhead of a traditional client-server architecture. Derby supports standard SQL, JDBC, and transactions, and offers features such as multi-user concurrent access and security controls. Its ease of integration, cross-platform compatibility, and active community support make it a practical choice for prototyping, testing, or embedding within Java applications where simplicity and efficiency are priorities.

matillion logo x Apache Derby

What is Amazon Redshift?

Amazon Redshift is a fully managed, cloud-based data warehouse built on PostgreSQL, designed for large-scale analytics. It features columnar storage, massively parallel processing (MPP), and advanced query optimization for high-performance querying on petabyte-scale data. Redshift integrates with AWS services like S3, Glue, and IAM, supporting robust ETL, security, and access control. With Redshift Spectrum, users can query data directly in S3 without moving it. Offering automated backups, encryption, workload management, scalability, and pay-as-you-go pricing, Redshift is a popular choice for organizations seeking business intelligence and advanced analytics capabilities in the cloud.

Why Move Data from Apache Derby (Java DB) into Amazon Redshift

Unlocking Powerful Analytics: Transferring Data from Apache Derby to Redshift

A data engineer or architect may wish to copy data from Apache Derby into Redshift for several compelling reasons. Firstly, Apache Derby may house unique or historical data that holds significant potential value for business analytics or decision-making processes. However, the true value of this data can often be realized only when it is integrated with information from other sources, enabling comprehensive analysis and richer business insights. Redshift, as a cloud-based data warehouse, provides a powerful platform for such data integration owing to its scalability and performance. By transferring data into Redshift, organizations can perform complex analyses and join Derby data with datasets from other operational systems, all without burdening the Apache Derby database. This approach not only safeguards the transactional performance of Derby but also leverages Redshift’s optimized infrastructure for analytical workloads, facilitating efficient data-driven strategies.

Creating a User Identity in Apache Derby Database

Apache Derby does not use a traditional

CREATE USER
SQL statement. Instead, Derby handles user authentication by validating user credentials (username and password) against configured properties. The actual user "creation" is performed by specifying user credentials in Derby’s property system. The method of configuration depends on the authentication scheme in use (e.g., property-based, LDAP). This guide demonstrates how to configure user credentials using Derby’s property-based authentication.

Step 1: Enable User Authentication

To require authentication, set the following Derby property in your

derby.properties
file (located in the database system directory) or via SQL as a database property:

properties
derby.connection.requireAuthentication=true

Step 2: Define Users Using Derby Properties

Define the user credentials by adding property lines for each user you wish to enable. The pattern is:

derby.user.<username>=<password>

For example, to create a user named

appuser
with a password
secretpwd
, add the following line to your
derby.properties
file:

properties
derby.user.appuser=secretpwd

Alternatively, you can set these properties using SQL in the SYS schema with the

CALL SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY
procedure (requires
SYSDBA
access):

CALL SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY(
    'derby.user.appuser',
    'secretpwd'
);

Note: For changes to take effect, restart the database.

Step 3: (Optional) Specify Default User

You may specify the default authentication mechanism, such as

BUILTIN
:

properties
derby.authentication.provider=BUILTIN

Or, via SQL:

CALL SYSCS_UTIL.SYSCS_SET_DATABASE_PROPERTY(
    'derby.authentication.provider',
    'BUILTIN'
);

Step 4: Connect to the Database as the New User

Once configured and the database is restarted, applications or users may connect using:

ell
jdbc:derby://localhost:1527/myDB;user=appuser;password=secretpwd

or, embedded mode:

ell
jdbc:derby:myDB;user=appuser;password=secretpwd

Example
derby.properties
File

```properties

Require authentication

derby.connection.requireAuthentication=true

Use Derby’s built-in authentication mechanism

derby.authentication.provider=BUILTIN

Define users

derby.user.appuser=secretpwd derby.user.analytics=report2024 ```

With these settings, both

appuser
and
analytics
can authenticate to the Derby database.

Installing the JDBC driver

At the time of writing, the Apache Derby JDBC driver is not included with Matillion Data Productivity Cloud by default. This is due to licensing or redistribution restrictions that prevent bundling the driver in the product distribution. If you require connectivity to Apache Derby databases, you will need to manually download and upload the necessary JDBC driver to your Matillion environment.

Follow the instructions below to download and install the JDBC driver for Apache Derby:

  1. Download the JDBC Driver
  2. Go to the official Apache Derby download page at https://db.apache.org/derby/derby_downloads.html.
  3. Locate a Derby version that is compatible with your database.
  4. Download the Type 4 JDBC driver for Apache Derby. A Type 4 driver is a pure Java driver and is preferred due to ease of integration and portability within Matillion.

  5. Install the JDBC Driver in Matillion Data Productivity Cloud

  6. Review the detailed instructions provided by Matillion for uploading and using external JDBC drivers: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
  7. These instructions outline requirements on how and where to upload the JDBC driver

    .jar
    files within your Matillion instance, and how to verify that the driver has been registered successfully.

  8. Configure and Use the Driver

  9. Once the JDBC driver for Derby has been installed, refer to Matillion's documentation for usage of custom drivers in database integrations: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
  10. These instructions will guide you through establishing database connections, setting authentication details, and executing queries via the new JDBC driver.

By following these steps and using the linked documentation, you can successfully connect Matillion Data Productivity Cloud to your Apache Derby database with a manually supplied JDBC driver.

Checking network connectivity

To allow Matillion Data Productivity Cloud to connect to your Apache Derby database, you must ensure that the database allows incoming connections based on your deployment configuration:

  • Full SaaS agent: You must configure your network firewall or security group to allow incoming connections to the Apache Derby database from the IP addresses listed here: Matillion Allowed IP Addresses.
  • Hybrid SaaS deployment: You must allow incoming connections from your own virtual private cloud (VPC). To verify network accessibility, use the tools and utilities found here: Matillion Exchange - Check Network Access.

If your Apache Derby database is referenced using a DNS name, ensure that the Full SaaS or Hybrid SaaS agent can resolve the DNS address to the correct IP. This may require appropriate DNS configuration, such as providing access to your internal DNS servers if the database uses a private DNS name.

Querying Data from an Apache Derby Database

This guide provides technical guidance on how to query data from an Apache Derby database, including SQL query examples, notes on datatype conversion with Amazon Redshift, and best practices for supporting both initial and incremental loads using the Database Query component.


Example Apache Derby SQL SELECT Queries

The core operation for extracting data from Apache Derby is the

SELECT
statement. Here are some illustrative examples:

``` -- Retrieve all rows and columns from "ORDERS" SELECT * FROM ORDERS;

-- Select specific columns SELECT ORDER_ID, CUSTOMER_NAME, ORDER_DATE FROM ORDERS;

-- Filter results with a WHERE clause SELECT * FROM ORDERS WHERE ORDER_DATE >= '2023-01-01';

-- Aggregate result with GROUP BY SELECT CUSTOMER_ID, COUNT(*) AS ORDER_COUNT FROM ORDERS GROUP BY CUSTOMER_ID; ```

Datatype Conversion Considerations (Apache Derby to Redshift)

When querying from Apache Derby and loading into Amazon Redshift, remember:

  • Some Derby datatypes don't map directly to Redshift datatypes. For example:
    • Derby
      INTEGER
      converts naturally to Redshift
      INTEGER
    • Derby
      VARCHAR
      converts to Redshift
      VARCHAR
      or
      CHARACTER VARYING
    • Derby
      TIMESTAMP
      ,
      DATE
      , or
      TIME
      types may require format adjustments when loading into Redshift
  • Always review and align target table DDLs in Redshift to ensure seamless loading.

Initial vs Incremental Data Loads: Usage Patterns

Recommended Load Pattern:
- Initial Load: Extract all required data at once. - Incremental Loads: Periodically extract only new or changed data. - Both patterns use the same Database Query component, but differ in their SQL query construction.

Example: Initial Load (Full Table, No Filter)

The Database Query in this stage has no filter clause.

SELECT ORDER_ID, ORDER_DATE, TOTAL_AMOUNT
FROM ORDERS;

Example: Incremental Load (Filtered Data)

Here, the Database Query includes a filter. For example, only new records since the last load based on an

ORDER_DATE
column:

SELECT ORDER_ID, ORDER_DATE, TOTAL_AMOUNT
FROM ORDERS
WHERE ORDER_DATE > '2024-06-03 00:00:00';


Tip: This pattern supports robust, repeatable data extraction workflows in ELT/ETL pipelines with seamless Apache Derby to Amazon Redshift transitions.

Data Integration Architecture

Loading data into Amazon Redshift in advance of integration exemplifies the "divide and conquer" strategy by separating the process into two distinct stages: extraction and loading first, then transformation and integration afterward. This is a core advantage of the ELT (Extract, Load, Transform) architecture, which helps to simplify workflows and avoid complications that arise from trying to transform data before loading it. Data integration inevitably involves data transformation, and the most effective approach is to use purpose-built data transformation pipelines that can efficiently reshape, cleanse, and enrich data within Redshift. Another significant benefit of the ELT approach is that all transformation and integration tasks occur directly inside the target Redshift database. This means these operations are executed quickly and on-demand, leveraging Redshift's powerful compute resources, ensuring scalability, and eliminating the need to maintain or pay for separate data processing infrastructures outside the database.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.