Extracting data from OrientDB to Databricks
Extracting data from OrientDB is a valuable way to leverage the rich information stored in your graph database, enabling advanced analytics and data processing in modern platforms such as Databricks. To help you achieve a seamless data pipeline between OrientDB and Databricks, this article will guide you through the essential steps for extraction and transfer. We will begin by demonstrating how to create an appropriate identity in OrientDB, ensuring secure and controlled access to your source data. For users of Matillion, a leading ETL solution, we will review how to verify or acquire the necessary JDBC driver to facilitate the connection between systems. Proper network connectivity is also crucial; we will outline considerations for establishing reliable communication between OrientDB and Databricks. Finally, the article will cover strategies for querying your data, both for an initial full load and for efficient incremental updates. By following these steps, you will be positioned to successfully integrate OrientDB data into your Databricks environment for further analysis and insight.
What is OrientDB?
OrientDB is a multi-model, open-source NoSQL database management system that uniquely enables the integration of graph, document, object, and key/value models within a single database engine. Engineered for high performance, scalability, and flexibility, OrientDB allows organizations to natively traverse relationships between records, supporting advanced graph analytics while simultaneously offering document store features such as ACID transactions, schema-less architecture, and security controls. It supports SQL-like query syntax extended with graph operations, facilitating adoption for teams familiar with relational databases but requiring the power of graph traversal capabilities. OrientDB can operate in both a distributed and multi-master replicated environment, enabling horizontal scaling and robust fault tolerance for mission-critical applications.
What is Databricks?
Databricks provides an optimized data management layer built on top of Apache Spark, offering a robust, scalable database solution for both structured and unstructured data. Leveraging Delta Lake, Databricks ensures ACID compliance, scalable metadata handling, and efficient storage through its transactional storage engine. The platform natively supports SQL, Python, Scala, and R, facilitating advanced analytics, machine learning, and streaming workloads on large datasets. With its seamless integration into cloud environments and support for open data formats like Parquet and ORC, Databricks enables collaborative, end-to-end data engineering and analytics pipelines, while its unified workspace promotes efficient development and operationalization of data-driven solutions.
Why Move Data from OrientDB into Databricks
Unlocking Analytical Power: Copying Data from OrientDB to Databricks for Enhanced Insights
A data engineer or architect might choose to copy data from OrientDB into Databricks for several compelling reasons. Firstly, OrientDB often contains critical operational or legacy data that holds substantial potential value for analytics, reporting, or machine learning initiatives. However, this value is significantly amplified when OrientDB data is integrated with information from other sources, enabling more comprehensive analysis and deeper insights. By leveraging Databricks as the platform for this integration, engineers can perform ETL (Extract, Transform, Load) operations and complex analytics without adding any additional processing load to the OrientDB instance itself. This approach ensures that production systems remain responsive while still unlocking the rich analytical capabilities that Databricks provides.
Similar connectors
Creating a User in OrientDB
To create a new user in an OrientDB database, you must have access with sufficient privileges (typically as an administrator). Users in OrientDB are managed through the
OUserclass in the security schema. Below are step-by-step instructions and example SQL scripts to guide you through the process.
1. Connect to the Database
Establish a connection to the desired database using OrientDB's console or studio.
Via Console:
ell orientdb> CONNECT remote:localhost/yourDatabase admin yourAdminPassword
Via Studio:
Navigate to the database and open the SQL console.
2. Create the User
Run the following SQL command to create a new user by inserting a record into the
OUserclass.
INSERT INTO OUser SET name = 'newuser', password = 'new_password', status = 'ACTIVE', roles = ['reader']
name
: Desired username.password
: User's password (OrientDB will automatically hash it).status
: Set to'ACTIVE'
(or'SUSPENDED'
if restricting).roles
: One or more default roles (reader
,writer
,admin
), or your own custom role(s).
Example:
INSERT INTO OUser SET name = 'jdoe', password = 'JDoeS3cret!', status = 'ACTIVE', roles = ['writer']
3. Assign Multiple Roles (Optional)
Users can have multiple roles. For example, to assign both
readerand
writerroles:
INSERT INTO OUser SET name = 'poweruser', password = 'SuperS3cret!', status = 'ACTIVE', roles = ['reader','writer']
4. Verify the New User
To check that the user was created, run:
SELECT FROM OUser WHERE name = 'newuser'
Replace
'newuser'with your actual username.
5. User Authentication
The new user can now connect to the database with their username and password.
Notes: - Passwords are stored as hashes by default in OrientDB. - To create custom roles, define them in the
ORoleclass before assigning them to users. - Rights and privileges depend on assigned roles. Alter roles as needed for your security policies.
Installing the JDBC driver
At the time of writing, the JDBC driver for OrientDB is not included by default in Matillion Data Productivity Cloud. This is due to licensing or redistribution restrictions, which require end users to obtain and install the driver themselves if they wish to connect to OrientDB.
To install the JDBC driver and make it available in Matillion Data Productivity Cloud, follow these steps:
- Download the JDBC Driver
Visit the OrientDB download page at https://orientdb.com/download/. On this page, locate and download the Type 4 JDBC driver for your platform. A Type 4 JDBC driver is preferred as it communicates with the database over the network using the database's native protocol, without requiring any native libraries or additional software.
- Review and Accept License Terms
Before proceeding with the download, review any license agreements or restrictions posted by OrientDB to ensure compliance with their terms.
- Upload and Install the Driver in Matillion
With the driver JAR file downloaded, follow the instructions in the Matillion documentation to upload and install external JDBC drivers:
- Detailed instructions can be found here: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
-
The process will guide you through uploading the OrientDB JDBC JAR file to the Matillion Agent, configuring the driver, and making it available for use in your Matillion environment.
-
Configuring and Using the Driver
Once the driver is installed, refer to the following documentation for information on using database connections with custom JDBC drivers in Matillion:
- Usage instructions: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
These resources provide guidance on creating and configuring a new database connection to OrientDB using the JDBC driver, and utilizing it within your Matillion Data Productivity Cloud projects.
Please ensure you retain a copy of the OrientDB JDBC driver and its associated documentation within your technical records to help with future maintenance and troubleshooting.
Checking network connectivity
To establish connectivity between Matillion Data Productivity Cloud and your OrientDB database, you must ensure that the OrientDB instance is configured to accept incoming network connections. The requirements vary depending on your deployment model:
-
Full SaaS Agent Configuration:
Allow access from the IP addresses listed in the Matillion Data Productivity Cloud security documentation. Your OrientDB database firewall and network policies must permit incoming connections from these IPs for successful communication. -
Hybrid SaaS Deployment:
You need to allow incoming connections from your own virtual private cloud (VPC), where your Hybrid SaaS agent is deployed. To confirm and troubleshoot network connectivity from your environment, you can use the utilities available at Matillion Exchange: Check Network Access.
Additionally, if you connect to your OrientDB database using a DNS name rather than a static IP address, ensure that the Full SaaS or Hybrid SaaS agent can properly resolve the DNS address to an IP. Configure firewall and DNS policies as needed to allow name resolution and network traffic.
Querying Data from an OrientDB Database
This guide provides technical instructions on querying data from an OrientDB database, focusing on SQL-like SELECT examples. It also includes considerations for datatype conversion (e.g., when extracting data to Databricks) and recommended patterns for initial and incremental loads using a Database Query component.
Example OrientDB Queries
OrientDB uses a SQL-like syntax for data queries. Here are some SELECT query examples:
``` -- Select all records from the "Person" class SELECT * FROM Person;
-- Retrieve specific fields from a record SELECT name, age FROM Person WHERE city = 'London';
-- Join two classes (using LET for edge traversal) SELECT name, out('WorksAt').name AS company FROM Person WHERE age > 30;
-- Aggregate: Count the number of people in each city SELECT city, COUNT(*) AS city_count FROM Person GROUP BY city; ```
Datatype Conversion Considerations
When extracting data from OrientDB and loading into external systems like Databricks, be aware of possible datatype mapping or conversion. For example:
- OrientDB's
LINK
,EMBEDDEDLIST
, andEMBEDDEDMAP
types may map to stringified JSON or ARRAY types in Databricks. DATE
andDATETIME
objects in OrientDB may require conversion to respective Databricks formats.- Binary fields should be handled consistently with their equivalent representation in your analytics platform.
Test your queries and review schema mappings to prevent data loss or misinterpretation during ETL.
Patterns for Initial and Incremental Loads
The best practice is to conduct a once-off full (initial) load of data, followed by periodic incremental loads that capture only new or changed records. The Database Query component allows both approaches.
Initial Load Example
For an initial load, extract the full dataset—there is no filter clause:
SELECT * FROM Person;
- All records are returned.
- Use this once to bootstrap the downstream database or warehouse.
Incremental Load Example
During incremental loads, apply a filter (e.g., based on
last_updatedtimestamp):
SELECT * FROM Person WHERE last_updated > :last_run_timestamp;
- Replace
:last_run_timestamp
with the last successful load time; this limits data to new or updated records. - Use the same Database Query component for both initial and incremental loads. Parameterize the query as needed.
For more implementation details, see the guide: Incremental Load Data Replication Strategy.
Data Integration Architecture
Loading data ahead of integration is a classic "divide and conquer" strategy, allowing organizations to split the data pipeline into two distinct steps—a concept central to the Extract, Load, and Transform (ELT) architecture. This separation means raw data is ingested and stored in the target environment (such as Databricks) before performing any transformations or integration. Carrying out data transformation as part of integration is essential, and the most effective approach leverages data transformation pipelines, which provide automation, monitoring, and scalability. Another notable advantage of the ELT architecture is that all data transformation and integration processes are executed directly within the target Databricks database. This not only ensures that transformations are performed rapidly and on demand, but also scales efficiently without the need for additional data processing infrastructure, leading to significant cost savings and operational simplicity.