Extracting data from Progress OpenEdge RDBMS to Amazon Redshift
Extracting data from Progress OpenEdge RDBMS is a crucial task for organizations seeking to leverage advanced analytics and scalable storage available in modern cloud platforms such as Amazon Redshift. Moving data efficiently and securely from OpenEdge to Redshift involves several important steps, from establishing the necessary user identities and ensuring proper connectivity, to configuring your integration tools and implementing effective querying methods. In this article, we will guide you through the essential components of a robust extraction and loading (ELT) process: - **Creating an identity in Progress OpenEdge RDBMS** to ensure secure and authorized data access; - **Checking or acquiring the appropriate JDBC driver** within Matillion, which is often used to orchestrate this type of data movement; - **Establishing reliable network connectivity** between your source OpenEdge system and the Redshift target database; - **Querying data efficiently**, covering both initial full loads and subsequent incremental updates. By following these steps, you will be able to construct a secure, reliable, and scalable workflow for moving your Progress OpenEdge data into Redshift, enabling your organization to fully benefit from modern data analytics capabilities.
What is Progress OpenEdge RDBMS?
Progress OpenEdge RDBMS is a high-performance relational database built for mission-critical applications with high concurrency and scalability. It uses advanced multi-version concurrency control for efficient transactions and minimal locking. Native integration with Progress ABL enables streamlined application development and deployment. Features include online maintenance, dynamic schema changes, robust disaster recovery (e.g., asynchronous replication), comprehensive security, advanced indexing, and broad platform support. These capabilities ensure data integrity, high availability, and reliable performance for both on-premises and cloud deployments.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse optimized for fast queries and analysis of structured and semi-structured data. Using columnar storage and parallel processing, Redshift delivers high performance and storage efficiency, ideal for OLAP workloads. It supports standard SQL, integrates with the AWS ecosystem, and offers features like Redshift Spectrum for S3 querying, automated backups, workload management, and strong security. Built on PostgreSQL, Redshift ensures compatibility with many analytics tools while offering scalable options such as RA3 node types and managed data sharing for enhanced elasticity and flexibility.
Why Move Data from Progress OpenEdge RDBMS into Amazon Redshift
Unlocking Advanced Analytics: The Value of Replicating Progress OpenEdge RDBMS Data to Amazon Redshift
Progress OpenEdge RDBMS often contains critical operational data that holds significant potential value for analytical and business intelligence initiatives. However, the true worth of this data is realized when it is integrated with information from other systems, enabling comprehensive analysis, enriched reporting, and the discovery of deeper insights. By copying data from Progress OpenEdge RDBMS into Amazon Redshift, organizations can seamlessly combine it with diverse sources without imposing additional workload or performance overhead on the transactional production system. Leveraging Redshift for data integration and analytics ensures that the operational integrity of Progress OpenEdge is maintained, while unlocking the capability to process large-scale queries and run advanced analytics in a highly scalable and efficient environment.
Similar connectors
Creating a User in a Progress OpenEdge RDBMS Database
In Progress OpenEdge RDBMS, user authentication and management are typically handled outside the database, for example, via the operating system or directory services (such as LDAP). However, if you are using SQL capabilities, you can create database users by defining and authorizing identities within the database using the SQL CREATE USER statement.
Below are step-by-step instructions to create a user in a Progress OpenEdge RDBMS database using SQL scripting.
1. Connect to the Database via SQL
Before creating a user, connect to your OpenEdge database using the SQL tool, such as
sqlexp(SQL Explorer) or an external client (for example,
isqlwith the ODBC/JDBC driver):
bash sqlexp -db <db-name> -H <host> -S <port>
2. Create a New Database User (SQL Identity)
To create a new user (identity) within the database, use the
CREATE USERSQL statement:
CREATE USER username IDENTIFIED BY 'user_password';
Example:
CREATE USER johndoe IDENTIFIED BY 'StrongP@ssw0rd';
username
: Replace with the desired user name.'user_password'
: Replace with a secure password.
Note: Enforcing database authentication requires that the database be configured for SQL access authentication. Some environments rely exclusively on connection profiles or external identity providers.
3. Grant Privileges to the New User
By default, the new user will have limited access. Explicitly grant database privileges as needed:
GRANT CONNECT TO johndoe; GRANT SELECT, INSERT, UPDATE, DELETE ON tablename TO johndoe;
- Replace
tablename
with the relevant table you want the user to access.
Example:
GRANT CONNECT TO johndoe; GRANT SELECT, INSERT ON customers TO johndoe;
4. List Existing Users (Optional)
To verify the user was created, you can query the system catalog:
SELECT user_name FROM sysprogress.sysusers;
Additional Notes
- If your Progress OpenEdge installation uses OS or directory authentication, SQL user management may be restricted or handled externally.
- Make sure to review your organization's user management and security best practices before managing users directly via SQL.
Installing the JDBC Driver
The Progress OpenEdge JDBC driver enables connectivity between Matillion Data Productivity Cloud and the Progress OpenEdge RDBMS. Due to licensing or redistribution restrictions at the time of writing, this JDBC driver is not bundled with Matillion by default and must be downloaded and installed manually.
Step 1: Download the JDBC Driver
- Navigate to the Progress OpenEdge Downloads page.
- Locate the appropriate JDBC driver, giving preference to the Type 4 JDBC driver, which is fully implemented in Java and does not require native libraries.
- Complete any registration or agreement steps as prompted, and download the relevant JDBC driver JAR file for your operating system and database version.
Step 2: Install the Driver into Matillion Data Productivity Cloud
- Refer to the official guide on uploading external drivers to Matillion Data Productivity Cloud Agents.
- Follow the documented process to upload the downloaded Progress OpenEdge JDBC driver JAR file to the relevant Matillion agent(s). Ensure that the driver file is accessible to the agent for your workspace.
- Once the driver is uploaded, verify that it appears in your agent’s list of available JDBC drivers.
Step 3: Configure and Use the Driver
Instructions for using the Progress OpenEdge JDBC driver in your data pipelines and queries within Matillion can be found in the Database Query component documentation. Use this resource to configure database connections and interact with your Progress OpenEdge RDBMS from within Matillion Data Productivity Cloud.
Checking network connectivity
To enable Matillion Data Productivity Cloud to connect to your Progress OpenEdge RDBMS database, you must ensure that the database can accept incoming network connections, based on your deployment configuration:
-
Full SaaS Agent Configuration:
The Progress OpenEdge RDBMS database must allow incoming connections from the IP addresses used by Matillion's SaaS agents. You can view the latest list of required IP addresses at this documentation page. Ensure these addresses are added to your firewall allowlist or access control list. -
Hybrid SaaS Deployment:
If you are using a Hybrid SaaS deployment, incoming connections to the Progress OpenEdge RDBMS database should be allowed from your own Virtual Private Cloud (VPC). This setup enables the on-premises or customer-controlled agent to communicate securely with your database. To assist with verifying connectivity from your environment, you can use network utility tools provided by Matillion at this resource.
Additionally, if the Progress OpenEdge RDBMS database is being referenced using a DNS address (rather than a static IP), the Full SaaS or Hybrid SaaS agent must have DNS resolution enabled for the relevant hostname. Ensure that the agent can correctly resolve and reach the database server using its DNS address to avoid connection issues.
Querying Data from Progress OpenEdge RDBMS
This guide provides technical instructions for querying data from a Progress OpenEdge RDBMS database. It includes example SQL SELECT statements, notes on datatype conversion to Redshift, and best practices for performing initial and incremental data loads using a Database Query component.
Example Progress OpenEdge SQL SELECT Queries
Progress OpenEdge RDBMS supports standard SQL queries. Here are some examples:
``` -- Basic SELECT statement SELECT CustomerID, Name, Balance FROM pub.Customer;
-- Filtering with WHERE clause SELECT OrderID, OrderDate, TotalAmount FROM pub.Orders WHERE OrderDate >= '2024-01-01';
-- Selecting with aggregate functions SELECT SalesRep, COUNT(*) AS NumOrders, SUM(TotalAmount) AS TotalSales FROM pub.Orders GROUP BY SalesRep;
-- Joining tables SELECT c.CustomerID, c.Name, o.OrderID, o.TotalAmount FROM pub.Customer c JOIN pub.Orders o ON c.CustomerID = o.CustomerID WHERE o.OrderDate >= '2024-01-01'; ```
Datatype Conversion (OpenEdge to Redshift)
When querying data for migration or ETL between Progress OpenEdge and Amazon Redshift, be aware that datatype conversions may be required. For example:
| Progress OpenEdge | Amazon Redshift | Notes |
|---|---|---|
| CHAR, VARCHAR | VARCHAR | Variable-length strings |
| INTEGER | INTEGER | Integer numbers |
| DECIMAL, MONEY | DECIMAL | Use same or higher precision |
| DATE | DATE | Supported |
| DATETIME | TIMESTAMP | Use Redshift TIMESTAMP |
Check and test all conversions, especially for date/time and large numeric fields, as certain differences may require additional transformation logic.
Best Pattern: Initial and Incremental Loads
The recommended approach for efficiently querying and transferring data is a two-step process:
- Initial (Once-off) Load:
Extract all required data in a single operation. - No filter clause is applied.
- Incremental Loads:
Move only new or updated records since the previous load. - A filter clause is used (e.g., on a timestamp or ID).
Use the same Database Query component (such as in an ETL tool) for both load types, adjusting the SQL WHERE clause as follows:
Initial Load Query (No Filter)
SELECT * FROM pub.Orders;
Incremental Load Query (With Filter)
SELECT * FROM pub.Orders WHERE OrderDate > '2024-06-01'; -- Replace with dynamic value
Reference: Matillion Incremental Load and Data Replication Strategy
- During incremental loads, the filter should dynamically reference the last successfully loaded position, such as a timestamp or auto-incrementing primary key.
Tip: Consistency in column selection and ordering is critical between the initial and incremental loads to enable seamless data replication and avoid schema mismatches.
Data Integration Architecture
Loading data in advance of integration exemplifies the "divide and conquer" approach by splitting the process into two manageable steps: first, raw data is ingested into the Redshift database, and then data integration is performed afterward. This decoupled workflow is a key advantage of the ELT (Extract, Load, Transform) architecture, allowing organizations to quickly store data regardless of its initial format or source. Data integration necessitates transforming raw data into meaningful, structured information, and the optimal way to achieve this is by leveraging data transformation pipelines. These pipelines apply the necessary modifications and aggregations before the data is used for analytics or reporting. Another significant benefit of the ELT architecture is that both transformation and integration activities take place within the target Redshift environment itself. As a result, data processing becomes faster, more scalable, and immediately accessible on demand. Furthermore, by avoiding reliance on separate processing infrastructure, organizations can reduce costs and streamline operational complexity.