Extracting data from Oracle Database to Amazon Redshift
Extracting data from Oracle Database is a common yet critical step in building modern analytics pipelines, especially when migrating or synchronizing data to cloud platforms like Amazon Redshift. This article provides a step-by-step guide to streamline this process, with a particular focus on users leveraging Matillion ETL. We will begin by detailing how to create an appropriate identity in the Oracle Database to ensure secure and efficient access for extraction tasks. Next, for those using Matillion, we will explain how to check for—or acquire—the necessary JDBC driver to enable a seamless connection between Oracle and Matillion. The article will also outline the essential network connectivity requirements to ensure reliable, secure communication between your Oracle source database and Amazon Redshift target. Finally, we'll cover the process of querying data: first through a full extraction of initial datasets, and subsequently by implementing incremental data loads for ongoing synchronization. Whether you are undertaking a migration project or automating recurring data integration, this guide will help you navigate the key steps involved.
What is Oracle Database?
Oracle Database is a robust, enterprise-grade relational database management system (RDBMS) developed by Oracle Corporation. Recognized for its scalability, reliability, and advanced feature set, Oracle Database supports both transactional and analytical workloads, and provides comprehensive tools for database administration, security, and backup and recovery. Offering multi-model data management capabilities, it enables users to work with structured, semi-structured, and unstructured data, and is highly optimized for deployment in both on-premises and cloud environments. Advanced features such as Real Application Clusters (RAC), Automatic Storage Management (ASM), and pluggable databases deliver high availability, flexible resource utilization, and simplified management, making Oracle Database a preferred choice for mission-critical applications across diverse industries.
What is Amazon Redshift?
Amazon Redshift is a fully managed, cloud-based data warehouse built for fast, scalable analytics on large datasets. Using a columnar architecture and Massively Parallel Processing (MPP), it delivers high-performance queries for complex business intelligence. Redshift integrates seamlessly with AWS, supports both structured and semi-structured data via Amazon S3 and Redshift Spectrum, and works with various data and BI tools. Key features include advanced compression, automatic vacuuming, concurrency scaling, and workload management. Redshift enables efficient, cost-effective querying across diverse sources without needing to move your data.
Why Move Data from Oracle Database into Amazon Redshift
Unlocking Advanced Analytics: The Value of Copying Data from Oracle to Redshift
A data engineer or architect might choose to copy data from Oracle Database into Amazon Redshift for several compelling reasons. Firstly, Oracle Database often stores business-critical information that can yield valuable insights when analyzed effectively. Integrating this data with information from other sources—such as transactional systems, log files, or cloud-native applications—enables organizations to gain a more comprehensive and actionable view of their operations. By leveraging Amazon Redshift for this integration and subsequent analysis, teams can fully utilize Oracle’s data without imposing additional workload or performance overhead on the source database. This approach not only preserves Oracle’s primary function for transactional processing but also maximizes the value derived from its data in a scalable, high-performance analytics environment.
Similar connectors
Creating an Identity in Oracle Database
Creating a user in Oracle Database involves several steps, including connecting as a privileged user, executing a
CREATE USERstatement, and (optionally) granting privileges. This guide provides concise instructions suitable for DBA and technical audiences.
Prerequisites
- Oracle Database instance accessible
- DBA privileges (e.g., access via a user such as
sys
orsystem
, or another user withCREATE USER
privilege)
1. Connect as a Privileged User
Connect to the database as a user with sufficient privileges, usually
sysdba:
bash sqlplus sys@ORCL as sysdba
Or for a local connection:
bash sqlplus / as sysdba
2. Create the User
Replace
<username>and
<password>with the desired username and a secure password. Assign a default tablespace (e.g.,
USERS) and a temporary tablespace (e.g.,
TEMP).
CREATE USER <username> IDENTIFIED BY <password> DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP QUOTA UNLIMITED ON USERS;
Example:
CREATE USER app_user IDENTIFIED BY StrongPwd#2024 DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP QUOTA UNLIMITED ON USERS;
3. Grant Permissions
Typically, new users require at least
CREATE SESSIONto log in. You may also grant object creation privileges, or assign roles like
DBA(with caution).
```
-- Basic connect privilege
GRANT CREATE SESSION TO
-- (Optional) Privileges to create objects:
GRANT CREATE TABLE, CREATE VIEW, CREATE PROCEDURE, CREATE SEQUENCE TO
Example:
GRANT CREATE SESSION TO app_user; GRANT CREATE TABLE, CREATE VIEW TO app_user;
4. (Optional) Grant Roles
For full administrative access (not usually recommended for application users):
GRANT DBA TO <username>;
5. Verify the User
Connect using the new user credentials:
```bash sqlplus app_user@ORCL
or
sqlplus app_user/StrongPwd#2024@ORCL ```
Notes
- Always use strong, complex passwords for database accounts.
- Minimize granted privileges—apply the principle of least privilege.
- If connecting remotely, ensure the listener and network configuration permit access.
Checking the JDBC driver
The JDBC driver for Oracle Database is already installed in Matillion Data Productivity Cloud, so there is no need to download or install it separately. To confirm the presence of the JDBC driver, follow these steps:
- Create a new Orchestration pipeline in your Matillion Data Productivity Cloud project.
- Add a "Database Query" component to your pipeline.
- In the settings for the "Database Query" component, locate the Database Type dropdown menu.
- Look for Oracle Database in the list of available database types. If it appears in the dropdown, the JDBC driver is installed and ready to use.
If you need to verify which version of the JDBC driver is currently in use, please refer to the detailed instructions provided here: Database Driver Versions.
Checking network connectivity
To enable successful connectivity between Matillion Data Productivity Cloud and your Oracle Database, you must ensure that the database is configured to allow incoming connections according to the type of Matillion deployment you are using:
-
Full SaaS Agent:
If you are using a Full SaaS agent configuration, the Oracle Database must allow incoming connections from the IP addresses listed at:
Matillion Data Productivity Cloud IP Allowlist Documentation -
Hybrid SaaS Deployment:
For Hybrid SaaS deployments, incoming connections from your own virtual private cloud (VPC) must be permitted. To assist in checking network access from your VPC, you can use utilities provided by Matillion at:
Matillion Network Access Checker
Additionally, if your Oracle Database is referenced via a DNS hostname (rather than a direct IP address), ensure that the Full SaaS or Hybrid SaaS agent is able to resolve the hostname to the correct IP address, as DNS resolution is required for connectivity.
Querying Data from an Oracle Database
This guide provides technical instructions for querying data from an Oracle Database using SQL
SELECTstatements, and discusses how to integrate these queries into data loading workflows such as those in Matillion ETL, including datatype conversion and loading strategies.
Oracle Database SQL Query Examples
To query data in Oracle Database, use the standard
SELECTstatement syntax. Here are some common examples:
``` -- Select all columns from an "employees" table SELECT * FROM employees;
-- Select specific columns with a filter SELECT employee_id, first_name, last_name, hire_date FROM employees WHERE department_id = 10;
-- Aggregate function usage example SELECT department_id, COUNT(*) AS employee_count FROM employees GROUP BY department_id;
-- Join two tables SELECT e.employee_id, e.first_name, d.department_name FROM employees e JOIN departments d ON e.department_id = d.department_id; ```
Datatype Conversion: Oracle Database vs Redshift
When loading data from Oracle Database into Amazon Redshift, different database systems' datatypes may require explicit conversion. Examples of common Oracle-to-Redshift datatype mappings include:
| Oracle Datatype | Redshift Equivalent |
|---|---|
NUMBER(p,s) |
DECIMAL(p,s) |
VARCHAR2(n) |
VARCHAR(n) |
DATE |
DATE |
TIMESTAMP |
TIMESTAMP |
CLOB |
VARCHAR(65535) |
BLOB |
Not directly supported |
Careful review of your schema and use of CAST/CONVERT functions in SQL or ETL tools may be necessary to assure compatibility.
Data Load Patterns: Initial and Incremental Loads
The best practice for replicating data from Oracle Database is to use a two-phase data load strategy with the same
Database Querycomponent: - Initial Load: Perform a one-time extraction to target all existing data. - Incremental Loads: Subsequently, load only changes (new or updated rows) since the last extraction.
Initial Load
On the first run, your
Database Queryshould not use any row-based filters; extract the entire dataset:
SELECT * FROM employees;
Incremental Loads
For future loads, filter on a suitable column (typically a timestamp or auto-incrementing key):
SELECT * FROM employees WHERE hire_date > TO_DATE(:last_extracted, 'YYYY-MM-DD HH24:MI:SS');Replace
:last_extractedwith the appropriate last extraction value, often managed by your ETL workflow.
For a comprehensive strategy discussion, refer to Matillion Incremental Load Data Replication Strategy.
Note: Ensure the chosen filter column (e.g.,
hire_date,
last_updated, or a surrogate key) reliably reflects new/updated data for correct incremental loading. The repeat use of the
Database Querycomponent, changing only the filter clause, aligns with best practice for Oracle-to-Redshift ETL processes.
Data Integration Architecture
Loading data in advance of integration is a core strategy for dividing and conquering the data integration challenge, by breaking the process into two distinctive steps: loading and then transforming the data. This modular approach forms a key advantage of the ELT (Extract, Load, Transform) architecture, as it allows raw data to be quickly ingested before any complex integration logic is applied. Effective data integration inherently requires robust data transformations, and the most reliable method for achieving this is through the use of structured data transformation pipelines. These pipelines ensure consistency, traceability, and efficiency in processing large datasets. Notably, another significant benefit of the ELT architecture is that all data transformation and integration tasks are performed inside the target Redshift database itself. As a result, transformations are fast, occur on-demand as needed, and scale seamlessly as data volumes grow, all while eliminating the need to invest in, maintain, or manage separate data processing infrastructure.