Extracting data from NuoDB to Databricks
Extracting data from NuoDB is a valuable step in enabling advanced analytics, reporting, or integration with modern platforms such as Databricks. Efficient data extraction and loading workflows can unlock the full potential of your organizational data by bridging traditional RDBMS systems with cloud-based analytics environments. In this article, we will guide you through a systematic approach to migrating or synchronizing data from NuoDB to Databricks. We begin by covering the creation of an identity in NuoDB to manage secure and trackable data access. Next, we outline the requirements for connecting via JDBC, with special attention to users leveraging Matillion for orchestration—making sure the correct JDBC driver is in place. Network connectivity considerations between the source (NuoDB) and the target (Databricks or your ingestion layer) are then discussed to help you avoid common pitfalls. Finally, we examine best practices for querying your data, including strategies for both full initial extracts and incremental data loading. Whether you are building a one-off migration pipeline or establishing a recurring data integration workflow, this article will provide clear, actionable guidance for each critical step in the process.
What is NuoDB?
NuoDB is a distributed, cloud-native relational database designed to deliver SQL compliance alongside elastic scalability for modern applications. It utilizes a multi-node, peer-to-peer architecture that separates transactional processing (transaction engines) from storage management (storage managers), enabling seamless scaling both horizontally and vertically without sacrificing consistency or performance. NuoDB employs a durable, ACID-compliant approach, and supports continuous availability and online schema changes, making it well-suited for mission-critical workloads in dynamic cloud or hybrid environments. Its support for automation, containerization, and integration with orchestration platforms like Kubernetes further aligns NuoDB with contemporary DevOps and microservices architectures.
What is Databricks?
The Databricks database is an integral component of the Databricks Lakehouse Platform, designed to unify data warehousing and AI capabilities on a single, cloud-based architecture. Built on top of Apache Spark, it provides robust, scalable storage and querying functionality using Delta Lake, an open-source storage layer that ensures ACID compliance and reliable data versioning. Databricks supports both structured and semi-structured data, enabling efficient ETL workflows, interactive analytics, and machine learning operations at scale. Through features like managed tables, time travel, and seamless integration with popular data formats, the Databricks database simplifies the management of large datasets while maintaining high performance and data governance standards.
Why Move Data from NuoDB into Databricks
The Value of Transferring Data from NuoDB to Databricks for Advanced Analytics
A data engineer or architect may wish to copy data from NuoDB into Databricks for several compelling reasons. Firstly, the data residing in NuoDB often holds significant value but may be underutilized when kept in isolation. By integrating this data with information from multiple other sources in Databricks, organizations can derive deeper insights, unlock new analytical capabilities, and support more comprehensive decision-making processes. Furthermore, performing complex data processing or analytics tasks natively within NuoDB could impose additional workload and potentially impact transactional performance. By offloading these workloads to Databricks, which is specifically designed for scalable analytics and big data processing, teams ensure that NuoDB remains dedicated to its primary transactional responsibilities while still enabling value extraction from its data in a high-performance, cost-effective environment.
Similar connectors
Creating a User in NuoDB
To provide user-level access control in NuoDB, you can create database users (also called identities). Users are typically managed using SQL commands through the NuoDB SQL client (
nuosql) or any SQL interface connected to your NuoDB database.
Below are step-by-step instructions for creating a user in NuoDB.
Prerequisites
- Ensure you have administrative privileges (for example, connecting as user
dba
). - You have access to the
nuosql
command line or an equivalent SQL interface.
1. Connect to the Database
Use the following command to start a SQL session. Replace
<db-name>,
<db-host>, and
<db-port>as required:
nuosql <db-name>@<db-host>:<db-port> -u dba -p <admin-password>
2. Create a User
NuoDB manages login identities with the
CREATE USERSQL statement. The basic syntax is:
CREATE USER <username> IDENTIFIED BY '<password>';
Example: To create a user named
appuserwith password
Complex!Passw0rd:
CREATE USER appuser IDENTIFIED BY 'Complex!Passw0rd';
3. Grant Privileges
After creating the user, grant appropriate privileges so they can operate within the database (e.g.,
CONNECTand
SELECT).
Example: To allow the user to connect and select from all tables:
GRANT CONNECT TO appuser; GRANT SELECT ON ALL TABLES TO appuser;
Note: Adjust privileges per your security policy and requirements.
4. Verify User Creation
You can verify the user by querying the
SYSTEM.USERSsystem table:
SELECT * FROM SYSTEM.USERS;
This will list all users currently defined in the database.
5. Test User Login (Optional)
To ensure the new identity works, try connecting as the newly created user:
nuosql <db-name>@<db-host>:<db-port> -u appuser -p Complex!Passw0rd
References:
Installing the JDBC Driver
At the time of writing, the JDBC driver for the NuoDB database is not included by default in Matillion Data Productivity Cloud. This omission is due to the licensing and redistribution restrictions associated with the NuoDB JDBC driver. As a result, installing the necessary driver is a manual process.
Downloading the NuoDB JDBC Driver
-
Access the NuoDB Downloads Portal:
Navigate to https://www.nuodb.com/downloads. -
Locate the JDBC driver:
On the downloads page, look for the JDBC driver—specifically, choose the Type 4 JDBC driver if multiple options are presented. The Type 4 driver is a pure Java implementation that does not require native libraries and is suitable for cloud-based integration scenarios like Matillion Data Productivity Cloud. -
Download the driver:
Follow the licensing steps as required by NuoDB and download the.jar
file for the JDBC driver to your local machine.
Installing the Driver in Matillion Data Productivity Cloud
With the JDBC driver
.jarfile downloaded, you will need to upload it into your Matillion Data Productivity Cloud environment. To do this, refer to the official Matillion documentation on external driver installation:
This guide provides detailed steps for:
- Accessing the Matillion Agent.
- Uploading the
.jar
file via the user interface. - Restarting services or validating the installation, if required.
Be sure to review any additional prerequisites or compatibility notes in the linked instructions, as these may vary depending on your deployment configuration and version of Matillion Data Productivity Cloud.
Additional Usage Guidance
Once the NuoDB JDBC driver has been successfully uploaded and installed, refer to the following documentation for configuring and using the NuoDB connection in your data workflows:
This page offers step-by-step usage instructions, including how to specify connection details for your NuoDB instance, and how to utilize the driver within data pipelines in Matillion Data Productivity Cloud.
By following these instructions, you can extend your Matillion Data Productivity Cloud environment to connect with NuoDB databases using the JDBC protocol, subject to licensing compliance and any organizational security policies.
Checking network connectivity
To ensure successful connectivity between the Matillion Data Productivity Cloud and your NuoDB database, you must confirm that the NuoDB database allows incoming connections from the appropriate sources, depending on your deployment type:
-
Full SaaS agent configuration:
You must allow incoming connections from the set of IP addresses used by the Matillion Data Productivity Cloud. These addresses are published and kept up-to-date at https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/. Ensure the access rules or firewalls protecting your NuoDB instance permit traffic from these IP addresses. -
Hybrid SaaS deployment:
In this configuration, your Matillion agent runs within your own infrastructure (virtual private cloud — VPC). You should configure your NuoDB database to accept incoming connections from your VPC's IP range. To help validate connectivity from your VPC to your NuoDB database, use the network checking tools and guidance available at https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
In addition, if your NuoDB database is referenced using a DNS hostname (rather than a direct IP address), your Matillion Full SaaS or Hybrid SaaS agent must be able to resolve the given hostname to the correct IP address. Ensure there are no DNS resolution issues that would prevent successful connections.
Querying Data from a NuoDB Database
This guide explains how to query data from a NuoDB database, provides SQL SELECT statement examples, and outlines loading strategies—highlighting data type considerations when working with platforms like Databricks.
Example: NuoDB SQL SELECT Statements
To query data from a NuoDB database, you can use standard SQL SELECT statements. Here are some common examples:
``` -- Select all rows from a table SELECT * FROM customers;
-- Select specific columns and filter with a WHERE clause SELECT customer_id, name, email FROM customers WHERE status = 'active';
-- Aggregate data with GROUP BY SELECT status, COUNT(*) AS total FROM orders GROUP BY status;
-- Join two tables SELECT o.order_id, c.name FROM orders o JOIN customers c ON o.customer_id = c.customer_id; ```
Datatype Conversion Considerations
When transferring data between NuoDB and other platforms such as Databricks, datatype conversion may occur. Not all data types map directly; for example:
| NuoDB Datatype | Possible Databricks Equivalent |
|---|---|
| STRING | STRING |
| INTEGER | INT |
| FLOAT/DOUBLE | FLOAT/DOUBLE |
| TIMESTAMP | TIMESTAMP |
| BOOLEAN | BOOLEAN |
Be aware of possible precision or format differences, especially for date, timestamp, and numeric types.
Loading Patterns: Initial and Incremental Loads
The best practice for populating your target system (e.g., Databricks) from NuoDB is:
- Once-off Initial Load: Full extract of existing data.
- Incremental Loads: Load only new or changed records, improving efficiency and reducing duplicates.
Both load types use the same Database Query component (e.g., Matillion's Database Query), but are differentiated by their SQL filter clauses.
Initial Load Example
The initial load fetches all data from the source table, so no
WHEREclause is applied:
SELECT * FROM customers;
Incremental Load Example
For ongoing data loads, you should include a filter, such as a timestamp or an incrementing ID, to extract only new or updated rows. For example:
SELECT *
FROM customers
WHERE last_updated > '${last_loaded_timestamp}';
last_updated
should be a column in your table indicating change time.${last_loaded_timestamp}is typically a parameter set by your ETL tool to remember the last successfully loaded value.
For an in-depth guide, see Matillion's Incremental Load Data Replication Strategy.
Note: Ensure the filter condition for incremental loads correctly matches your data's semantic for changes. Both initial and incremental loads should utilize the same database connector/component for simplicity and consistency.
Data Integration Architecture
Loading data in advance of integration enables organizations to divide and conquer the data integration challenge by splitting the process into two distinct steps: first, ingestion (loading data) and second, transformation and integration. This division is a key advantage of the Extract, Load, and Transform (ELT) architecture, as it enables raw data to be quickly loaded into the Databricks database before any transformation is performed. Data integration inherently requires data transformation, and the most effective way to facilitate this is by implementing robust data transformation pipelines. These pipelines systematically prepare and combine data into formats suitable for analytics or downstream operations. Another significant benefit of the ELT approach is that all transformation and integration steps occur directly within the Databricks environment. This on-platform processing is fast, supports on-demand scaling, and eliminates the need for additional, external data processing infrastructure, thereby reducing overall costs and simplifying operational management.