Extracting data from Oracle Database to Databricks
Extracting data from Oracle Database is a critical process for organizations seeking to leverage their operational data within modern analytic platforms such as Databricks. Seamless integration enables richer insights and timely decision-making but requires careful attention to authentication, connectivity, and data transfer mechanisms. In this article, we will guide you through the end-to-end process of moving data from Oracle Database into Databricks. We will begin by showing how to create an identity in Oracle Database, ensuring secure and appropriate access for extraction activities. For those using Matillion as an ETL tool, we will cover how to check for—or acquire—the appropriate JDBC driver necessary for Oracle connectivity. Network connectivity between source (Oracle) and target (Databricks) is a crucial consideration, and we will outline best practices for verifying and establishing this connection. Finally, we will explore strategies for querying data, including both initial full loads and ongoing incremental extractions, to maintain an up-to-date, efficient pipeline. Whether you are just starting to integrate Oracle with Databricks or looking to optimize your current workflow, this article will provide the key steps and considerations needed for a successful data migration.
What is Oracle Database?
Oracle Database, developed by Oracle Corporation, is a powerful RDBMS designed for enterprise-level transactional processing, data warehousing, and mixed workloads. It features multitenant architecture, automatic storage management, and strong data security, making it ideal for mission-critical applications. Oracle ensures high availability and consistency with technologies like Real Application Clusters (RAC) and Data Guard. It supports on-premises, cloud, and hybrid deployments. Comprehensive SQL and PL/SQL capabilities, along with advanced monitoring, tuning, and automation tools, offer developers and administrators the flexibility to manage complex data needs efficiently across diverse environments.
What is Databricks?
Databricks is a unified analytics platform built atop Apache Spark, offering scalable cloud compute and managed data services. It enables large-scale data engineering, science, and machine learning by integrating with cloud storage (AWS S3, Azure Data Lake, Google Cloud Storage) and leveraging Delta Lake, which adds ACID transactions, schema enforcement, and time travel—transforming storage into a reliable data lakehouse. Users interact via notebooks, REST APIs, and IDEs using SQL, Python, Scala, or R. Databricks supports both batch and real-time processing and connects easily to BI tools and applications, combining the flexibility of cloud storage with database-like reliability and performance.
Why Move Data from Oracle Database into Databricks
The Benefits of Transferring Oracle Data to Databricks for Advanced Analytics
A data engineer or architect may seek to copy data from Oracle Database into Databricks for several compelling reasons. Firstly, Oracle Database often contains data that is of significant potential value to the organization, including transactional records, customer information, and business-critical metrics. By extracting this data and integrating it within Databricks alongside information from other sources, such as cloud storage, IoT devices, or third-party APIs, organizations can uncover deeper insights and enhance analytics capabilities that would not be possible by analyzing siloed datasets alone. Furthermore, leveraging Databricks as the platform for data integration and processing allows the heavy computational workload to be handled in a scalable cloud environment. This approach greatly reduces the processing burden on the operational Oracle Database, minimizing the risk of performance degradation or downtime for critical business applications that rely on Oracle.
Similar connectors
Creating a User in Oracle Database
This guide provides the steps required for creating a new user (database identity) in an Oracle Database environment. The instructions assume you have the necessary administrative privileges (such as DBA role or CREATE USER privilege).
Step 1: Connect to the Database as an Administrator
Connect to the Oracle Database using SQL*Plus, SQLcl, or any database access tool, as a user with sufficient privileges:
bash sqlplus sys@ORCL as sysdbaOr if authenticated as another user with DBA rights:
bash sqlplus admin_username@ORCL
Step 2: Create a User
To create a user, execute the
CREATE USERstatement. Specify a username and authentication method. You must also assign the user to a default tablespace and an optional temporary tablespace.
Example:
CREATE USER hr_user IDENTIFIED BY StrongPassword_123 DEFAULT TABLESPACE users TEMPORARY TABLESPACE temp;
Parameters: -
hr_user: The username to be created. -
StrongPassword_123: Replace with a secure password. -
DEFAULT TABLESPACE users: Schema objects will be placed in the USERS tablespace unless otherwise specified. -
TEMPORARY TABLESPACE temp: TEMP tablespace for the user's sorting operations.
Note: Verify the tablespaces (e.g., USERS, TEMP) exist, or adjust as appropriate for your environment.
Step 3: Grant Required Privileges
A newly-created user cannot create objects or log in until granted appropriate privileges.
Minimal Required Privileges Example:
Enabling schema object creation and login:
GRANT CREATE SESSION, CREATE TABLE TO hr_user;
Optional Administrative Privileges Example:
If the user needs DBA rights:
GRANT DBA TO hr_user;
Grant Quotas (Optional):
If you want to control user storage limits, grant a quota on the tablespace:
ALTER USER hr_user QUOTA 100M ON users;
Step 4: Verify User Access
Log in as the new user to ensure everything works as expected:
bash sqlplus hr_user@ORCL
The new user should now be able to connect (if
CREATE SESSIONwas granted) and create objects (if appropriate object privileges were granted).
Checking the JDBC driver
The JDBC driver for the Oracle Database is already installed in Matillion Data Productivity Cloud—there is no need to manually download or install it.
To confirm that the JDBC driver is present:
- Create an Orchestration Pipeline:
In Matillion Data Productivity Cloud, begin by creating a new Orchestration pipeline. - Add the Database Query Component:
Drag and drop the “Database Query” component onto the pipeline workspace. - Check for Oracle Database:
In the component’s Properties panel, open the “Database Type” dropdown list. If you see "Oracle Database" as an option, the JDBC driver is installed and available for use.
For instructions on how to check which version of the JDBC driver is installed, refer to the Matillion documentation here: Database Driver Versions.
Checking network connectivity
To ensure successful communication between Matillion Data Productivity Cloud and your Oracle Database, you must configure the database to allow incoming network connections from the appropriate sources, based on your deployment type:
-
Full SaaS Agent Configuration:
You need to allow incoming connections from the set of IP addresses maintained by Matillion. The complete and up-to-date list of these IP addresses is available here: Matillion allowed IP addresses. Make sure these IPs have permission to connect to your Oracle Database. -
Hybrid SaaS Deployment:
If you are using a Hybrid SaaS setup, you must permit incoming connections from resources within your own virtual private cloud (VPC), based on where your Matillion Hybrid SaaS Agent is deployed. To assist with connectivity and firewall rule checks, you can use these tools: Matillion Check Network Access Utilities.
If your Oracle Database is referenced via a DNS hostname, it is also essential that the Full SaaS or Hybrid SaaS Agent can resolve this DNS address to the correct IP. This requires that the agent has access to appropriate DNS servers and that any necessary DNS records are correctly configured and accessible.
Querying Data from an Oracle Database
This guide explains how to query data from an Oracle Database, with practical SQL examples and tips for initial and incremental data extraction patterns. This is particularly relevant when integrating with platforms such as Databricks, where datatype conversion might also play a role.
Example Oracle Database Query Statements
Below are typical
SELECTstatement examples used when querying data from an Oracle Database:
``` -- Select all columns from a table SELECT * FROM employees;
-- Select specific columns SELECT employee_id, first_name, hire_date FROM employees;
-- Apply filtering using WHERE clause SELECT * FROM orders WHERE order_status = 'Shipped';
-- Aggregate data with GROUP BY SELECT department_id, COUNT(*) FROM employees GROUP BY department_id; ```
Datatype Conversion Considerations
When extracting data from Oracle Database to platforms like Databricks, be aware that datatype conversions may occur. For example:
NUMBER
in Oracle might map toDOUBLE
orDECIMAL
in Databricks.DATE
orTIMESTAMP
types should be carefully checked for formatting consistency.- String types, such as
VARCHAR2
, generally map directly toSTRING
in Databricks, but character set compatibility should be considered.
Test and validate data after migration to ensure integrity.
Initial Load vs. Incremental Load Pattern
The recommended data loading strategy is a once-off initial load, followed by incremental loads. This ensures both data completeness and efficiency. You can use the same Database Query component for both operations, adjusting the query as needed.
Initial Load
During the initial load, the Database Query component executes without any filter conditions, extracting the full dataset:
SELECT * FROM customers;
Incremental Load
On subsequent (incremental) loads, a WHERE clause is added to filter for new or updated records only. Typically, this uses a high-watermark column such as a primary key or a last-modified timestamp.
For example, to load only rows newer than the last extraction (
:last_extracted_tsbeing supplied dynamically):
SELECT * FROM customers WHERE last_modified > TO_TIMESTAMP(:last_extracted_ts, 'YYYY-MM-DD HH24:MI:SS');
This approach minimises data movement and processing time for routine loads. For more on this strategy, see the Incremental Load Data Replication Strategy on Matillion Exchange.
Tip: Always use parameterized queries and proper data type handling with your Database Query components to ensure reliability and performance during both initial and incremental loads.
Data Integration Architecture
Loading data in advance of integration represents a "divide and conquer" strategy by splitting the process into two discrete steps—a hallmark advantage of the ELT (Extract, Load, Transform) architecture. In this approach, raw data is first loaded into the target environment, such as the Databricks database, before any integration or transformation occurs. Effective data integration requires robust data transformation, and the most efficient way to achieve this is through data transformation pipelines that automate and manage the complex business logic necessary for analysis and reporting. Another advantage of the ELT model is that all data transformations and integration tasks occur within the Databricks database itself. This localized processing allows for fast, on-demand, and scalable operations, without the need to provision or pay for additional, dedicated data processing infrastructure outside of your Databricks environment.