Extracting data from Progress OpenEdge RDBMS to Databricks
Extracting data from Progress OpenEdge RDBMS is a critical step for organizations seeking to modernize their analytics capabilities by leveraging scalable cloud platforms like Databricks. Such data migrations and integrations can pose challenges due to differences in database architectures, connectivity requirements, and data movement strategies. This article provides a practical guide to facilitate this process, from establishing a secure connection to implementing efficient data extraction methods. Specifically, we will cover how to create an identity within Progress OpenEdge RDBMS to authorize data access; for Matillion users, how to confirm or obtain the appropriate JDBC driver required for connectivity; how to configure network access to enable secure data transmission from OpenEdge to Databricks; and finally, best practices for querying data both for initial bulk loads and ongoing incremental updates. By following these steps, you will enable a robust pipeline for migrating and synchronizing your Progress OpenEdge data with Databricks.
What is Progress OpenEdge RDBMS?
The Progress OpenEdge RDBMS is a high-performance, transactional, relational database management system designed to meet the needs of mission-critical business applications. It offers robust multi-user concurrency, integrated application security, advanced replication, and comprehensive data integrity features. OpenEdge supports a dynamic schema, allowing developers to modify tables and relationships with minimal downtime, and seamlessly integrates with Progress' Advanced Business Language (ABL) for rapid application development. With built-in support for multi-tenancy, high availability through OpenEdge Replication, and scalable architecture—from on-premises to cloud deployments—the OpenEdge RDBMS is well-suited for organizations requiring flexible, reliable, enterprise-class data management solutions.
What is Databricks?
Databricks is a unified data analytics platform that provides a highly scalable and collaborative environment for data engineering, data science, and machine learning workloads. Built atop Apache Spark, Databricks integrates cloud-based data storage, such as Delta Lake, for ACID-compliant, high-performance data management. It supports a range of data access paradigms, including SQL, Python, Scala, and R, facilitating end-to-end analytics workflows. Databricks’ database capabilities enable efficient ingestion, transformation, and querying of both structured and unstructured data, with features such as auto-scaling, optimized runtime, and seamless integration with commonly used data sources and BI tools. This robust architecture allows organizations to process and analyze massive datasets with reliability, governance, and interactive collaboration.
Why Move Data from Progress OpenEdge RDBMS into Databricks
Unlocking Analytics: The Case for Copying Data from Progress OpenEdge RDBMS to Databricks
A data engineer or architect may wish to copy data from the Progress OpenEdge RDBMS into Databricks for several compelling reasons. Progress OpenEdge RDBMS frequently contains data that is potentially valuable, often underpinning critical business processes. However, the true value of this data can be fully realized only when it is integrated with information from other sources, facilitating richer insights and advanced analytics. By performing data integration and analysis within Databricks, organizations gain the ability to efficiently unify disparate data sources—including cloud and on-premises systems—without increasing the operational workload on the production OpenEdge RDBMS itself. This approach ensures that the transactional performance and reliability of the source system are preserved, while leveraging the scalable processing power of Databricks for data transformation, cleansing, and analytics.
Similar connectors
Creating a User in Progress OpenEdge RDBMS
Progress OpenEdge RDBMS does not manage user accounts internally in the same way as some other RDBMSs, such as Oracle or SQL Server. Instead, OpenEdge defers authentication to the operating system or external security mechanisms by default. However, you can use OpenEdge’s Security Administration features to register identities, assign permissions, and, when using SQL access, create SQL-level users and roles.
Below are two common scenarios for creating and managing users in OpenEdge:
1. Creating a SQL User (SQL-92 User) via SQL
To enable SQL access (for ODBC/JDBC or SQL scripting), you need to create a user identity in the OpenEdge SQL catalog. This is typically performed using the
CREATE USERstatement against the database through, for example, the SQL Explorer utility (
sqlexp).
-- Replace 'testuser' and 'password123' as appropriate
CREATE USER testuser
WITH PASSWORD 'password123';
Notes: - Use secure passwords that comply with your organization's policies. - The created SQL user identity can be associated with SQL privileges and roles.
Grant Permissions to the User
GRANT SELECT, INSERT, UPDATE, DELETE
ON PUB.customer
TO testuser;
2. Registering an ABL User (Data Administration)
For ABL (Advanced Business Language) clients, user accounts typically correspond to operating system users. Database-level user information (such as users and their properties) can be managed using the Data Administration Tool (Data Dictionary) or programmatically through ABL code.
Using Data Administration Tool
- Connect to the database from Data Administration.
- Navigate to Admin > Security > Edit User List.
- Add the operating system username (e.g.,
DOMAIN\username
or simplyusername
), and optionally assign User Levels or Security Groups.
Programmatically with ABL
```abl DEFINE VARIABLE hSec AS HANDLE NO-UNDO.
hSec = SESSION:GET-SECURITY-POLICY(). hSec:REGISTER-USER("john_doe"). hSec:SET-USER-GROUP("john_doe", "MyGroup"). ```
Note: These examples assume you use the OpenEdge Security Policy API (available in certain OpenEdge releases). Adapt as needed for your environment.
Additional Considerations
- For detailed audit and authentication control, consider configuring OpenEdge Authentication Gateway.
- Always consult your organization's security policies and the OpenEdge documentation matching your database version for compliance and further options.
- Use
GRANT
andREVOKE
to manage SQL-based privileges, as required.
For more details, refer to the Progress OpenEdge SQL Reference Guide.
Installing the JDBC driver
At the time of writing, the JDBC driver for the Progress OpenEdge RDBMS database is not included within Matillion Data Productivity Cloud by default. This is due to licensing and redistribution restrictions imposed by Progress Software. Therefore, users must manually download and install the JDBC driver before establishing connections to OpenEdge databases from Matillion.
Downloading the Progress OpenEdge JDBC Driver
- Navigate to the official Progress Software downloads page: https://www.progress.com/openedge/downloads
- Locate and download the "Type 4 JDBC driver" for OpenEdge. Type 4 drivers are recommended due to their platform independence and direct connectivity to the target database over TCP/IP.
- After downloading, extract the driver files if they are provided as a compressed archive. The main driver file will typically have a
.jar
extension.
Installing the JDBC Driver in Matillion Data Productivity Cloud
- Follow the detailed installation instructions published at this URL: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
- Typically, the process involves uploading the JDBC driver
.jar
file to the Matillion agent environment, either through the user interface or via a specific directory. Please refer to the above documentation for the precise steps required in your deployment.
Using the Driver in Matillion
Once the driver is installed, you can use it to connect from within Matillion's Data Productivity Cloud, including leveraging JDBC connection capabilities for visual database connectors or code-driven workflows.
To get started with OpenEdge connections, please review the usage guidance at: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
This documentation includes details on configuring connection properties, authentication, and usage examples for querying databases via JDBC within Matillion.
Checking network connectivity
To ensure successful data integration with Matillion Data Productivity Cloud, you must make sure that the Progress OpenEdge RDBMS database accepts incoming connections according to your deployment configuration:
-
Full SaaS Agent Configuration:
Allow inbound connections from all the Matillion IP addresses specified at this URL. Update the database server’s network access controls (for example, firewall or security group settings) with these IP addresses to permit communication from the Matillion cloud service. -
Hybrid SaaS Deployment:
Incoming traffic should be permitted from within your own virtual private cloud (VPC) where the Matillion Hybrid SaaS agent resides. For guidance on confirming connectivity and troubleshooting network access, you can visit the network check utilities on the Matillion Exchange.
Additionally, if you reference your Progress OpenEdge RDBMS database by its DNS hostname (instead of a direct IP address), ensure that the relevant Matillion agent—whether Full SaaS or Hybrid SaaS—is able to successfully resolve this DNS name. Proper DNS resolution is required for establishing and maintaining connections to your database.
How to Query Data from a Progress OpenEdge RDBMS Database
Example: Progress OpenEdge SQL SELECT Statements
To retrieve data from a Progress OpenEdge database, you can use standard SQL queries via the OpenEdge SQL interface or compatible ODBC/JDBC drivers. Some SQL syntax examples:
``` -- Retrieve all records from a table SELECT * FROM pub.customer;
-- Select specific columns SELECT customer_num, name, address FROM pub.customer;
-- Filter data SELECT * FROM pub.orders WHERE order_date >= '2024-01-01';
-- Aggregate functions SELECT COUNT(*) AS total_orders FROM pub.orders WHERE shipped = TRUE; ```
Remark: The
pubschema prefix is frequently used with OpenEdge databases.
Datatype Conversion with Databricks
When querying and loading data from Progress OpenEdge RDBMS into Databricks, be aware that datatype mapping occurs:
- OpenEdge data types (e.g.,
INTEGER
,CHARACTER
,DATE
) are automatically translated to Databricks-compatible types (e.g.,int
,string
,date
). - Some complex or proprietary types may require explicit conversion/casting, either in the SQL query (using
CAST
), during ETL, or in Databricks.
Example of explicit casting:
SELECT customer_num, CAST(balance AS DOUBLE) AS balance_amount FROM pub.customer;
Pattern: Initial and Incremental Loads
The recommended approach uses:
- Once-off Initial Load: Extract the entire dataset.
- Incremental Loads: Extract only data changed or added since the last load.
Both should use the same Data Integration tool/component (such as the Database Query component). The main difference is the presence of a filtering clause in the incremental phase.
Initial Load Example
Database Query component SQL:
SELECT * FROM pub.customer;
No filter: All data is extracted.
Incremental Load Example
Introduce a filter on a column that tracks changes (e.g.,
updated_ator
last_modified). More details from Matillion.
Database Query component SQL:
SELECT * FROM pub.customer WHERE updated_at > '2024-06-01 00:00:00';
Update the filter value based on the last successfully loaded timestamp from Databricks.
For more on incremental loading strategies using ETL tools, refer to the Matillion article.
Data Integration Architecture
Loading data in advance of integration is a proven strategy for simplifying complex data integration workflows, allowing organizations to divide and conquer the problem by separating it into two distinct steps: data loading (Extract and Load, or EL) and data transformation (Transform, or T). This approach exemplifies the key advantages of the ELT (Extract, Load, Transform) architecture. Once the raw data is loaded into your Databricks database, integration tasks that require data transformation are best accomplished through robust data transformation pipelines, which ensure data quality, repeatability, and scalability. Another notable benefit of the ELT architecture is that all data processing—especially transformation and integration—takes place inside the target Databricks environment. This makes processing both fast and scalable, as Databricks leverages its built-in compute resources on-demand, and eliminates the need for maintaining and paying for additional, external data processing infrastructure.