Integrate data from GraphDB (Ontotext) to Databricks using Matillion

Our GraphDB to Databricks connector seamlessly transfers your data to Databricks within minutes, ensuring it remains current without the need for manual coding or complex ETL scripts.

GraphDB
Databricks
GraphDB (Ontotext) to Databricks banner

Extracting data from GraphDB (Ontotext) to Databricks

Extracting data from GraphDB is a crucial step for organizations seeking to leverage their semantic data within modern analytics platforms such as Databricks. By enabling this data movement, teams can integrate knowledge graphs into advanced data processing pipelines, facilitating deeper insights and more flexible reporting. This article provides a step-by-step guide to efficiently move data from GraphDB to Databricks. We will begin by describing the process of creating an identity (user or service account) in GraphDB with the appropriate permissions for data extraction. For those using Matillion as their ETL tool, we will explain how to verify the presence of a compatible JDBC driver or acquire it if necessary. Network connectivity considerations are also essential, and we will cover how to ensure secure and reliable communication between your source (GraphDB) and target (Databricks) environments. Finally, we will outline the process of querying data—starting with an initial full extract and then implementing incremental data extraction strategies for ongoing synchronization. Whether you are setting up this integration for the first time or looking to enhance your current workflow, this guide will help you navigate the required steps with clarity and confidence.


What is GraphDB (Ontotext)?

GraphDB, developed by Ontotext, is a high-performance semantic graph database built for managing and querying large-scale knowledge graphs and Linked Data. It supports W3C RDF and SPARQL standards, enabling efficient storage and complex queries for use cases like semantic search, data integration, text mining, and knowledge management. GraphDB offers native inferencing for logical conclusions using ontologies, robust ACID transaction support, horizontal scaling via clustering, and integration with data visualization tools. Renowned for performance and scalability, GraphDB is widely used in industries such as publishing, healthcare, and government to power intelligent, data-driven solutions.

matillion logo x GraphDB

What is Databricks?

Databricks is a cloud-based Unified Data Analytics Platform centered on Apache Spark. It offers an interactive workspace for data engineers, scientists, and analysts to collaborate on large-scale data processing, analytics, and machine learning. Core features include managed Spark clusters, optimized storage via Delta Lake (with ACID transactions and unified batch/streaming data), and seamless integration with AWS, Azure, and Google Cloud. Databricks simplifies big data infrastructure, supports many data connectors, and accelerates SQL analytics with its own Databricks SQL engine, letting organizations unify data sources and speed up data-driven insights.

Why Move Data from GraphDB (Ontotext) into Databricks

Unlocking Advanced Analytics: The Case for Copying Data from GraphDB to Databricks

A data engineer or architect might wish to copy data from GraphDB into Databricks for several compelling reasons. Firstly, GraphDB stores data with rich relationships and connections, which are potentially valuable for analytics, machine learning, or reporting purposes. By integrating this graph data with other enterprise data sources within Databricks, professionals can unlock deeper insights and derive greater business value by performing advanced analytics on a unified dataset. Furthermore, using Databricks as the platform for data integration and processing helps to minimize the computational workload on the production GraphDB system. This approach ensures that complex or resource-intensive queries are executed within Databricks' scalable environment, preserving the performance and reliability of the operational graph database.

Creating an Identity in GraphDB

This guide describes how to create a new user in a GraphDB database using the built-in user management functionality provided by GraphDB's Workbench (web interface) and REST API. User management ensures secure and organized access control, especially in multi-user environments.

1. Using the GraphDB Workbench (Web Interface)

  1. Log in with an account that has administrative privileges.
  2. From the main menu, navigate to Users & Security > Users.
  3. Click the Add User button.
  4. Fill out the form:
    • Username: Choose a unique login name for the new user.
    • Password: Enter and confirm a strong password.
    • Roles: Assign the desired role(s) (e.g.,
      READ
      ,
      WRITE
      ,
      REPO_MANAGER
      , or
      ADMIN
      ).
  5. Click Create to create the new user.

2. Using the REST API

GraphDB exposes its user management via a REST API. You can create a user by sending a POST request to the

/rest/security/users
endpoint.

Example Request (using
curl
)

bash
curl -X POST "http://localhost:7200/rest/security/users" \
  -u admin:admin_password \
  -H "Content-Type: application/json" \
  -d '{
        "userName": "newuser",
        "password": "secure_password",
        "grantedAuthorities": ["READ", "WRITE"]
      }'

Notes: - Replace

admin:admin_password
with your admin account credentials. - You can assign different roles as needed (
READ
,
WRITE
,
REPO_MANAGER
,
ADMIN
).

Example JSON Payload

json
{
  "userName": "newuser",
  "password": "secure_password",
  "grantedAuthorities": ["READ", "WRITE"]
}

3. Programmatic Creation via SQL-like Syntax

GraphDB does not natively support user creation via SQL (such as

CREATE USER
statements). Please use the Workbench or REST API.

Additional Considerations

  • Confirm your changes by listing users via the Workbench or the REST endpoint (
    GET /rest/security/users
    ).
  • Always use a secure connection (HTTPS) when managing users on a remote or production server.
  • Password reset and further user management actions are available via similar interfaces.

Installing the JDBC Driver

The GraphDB JDBC driver is not distributed with Matillion Data Productivity Cloud by default. This is due to licensing and/or redistribution restrictions. If you wish to connect Matillion Data Productivity Cloud to a GraphDB database via JDBC, you must manually download and install the appropriate driver.

Follow these steps to download and install the GraphDB JDBC driver for use in Matillion Data Productivity Cloud:

1. Download the GraphDB JDBC Driver

  1. Navigate to the official GraphDB JDBC connector documentation: https://graphdb.ontotext.com/documentation/jdbc-connector.html
  2. On that page, locate and download the latest available Type 4 JDBC driver. Type 4 drivers are strongly recommended as they are platform-independent and do not require additional libraries.
  3. Save the downloaded
    .jar
    file to a local directory for later upload.

2. Prepare for JDBC Driver Installation in Matillion

Due to the external nature of this driver, you must manually upload it to your Matillion Data Productivity Cloud Agent.

  1. Review the official Matillion documentation for uploading and installing external JDBC drivers: Uploading External Drivers.
  2. Follow the instructions provided to upload the
    .jar
    file(s) for the GraphDB JDBC driver to your Matillion Agent environment.
  3. This process may differ depending on whether your agent runs in AWS, Azure, or a local environment. Review and follow the details applicable to your deployment model.
  4. Ensure that the driver
    .jar
    is properly recognized and available to your Matillion Data Productivity Cloud Agents before proceeding.

3. Configure and Use the Driver

After successful upload and installation of the JDBC driver, consult the following Matillion documentation for step-by-step guidance on connecting to and querying GraphDB from within the product:
Database Query Usage Instructions

This will provide you with details about using the GraphDB JDBC connection in your pipelines or transformation workflows.

Note: Always confirm you are using a compatible JDBC driver version for both GraphDB and Matillion Data Productivity Cloud to ensure connectivity and avoid runtime issues.

Checking network connectivity

To establish successful connectivity between Matillion Data Productivity Cloud and your GraphDB database, it is essential to ensure that the database allows incoming network connections according to your chosen deployment configuration:

Note:
If your GraphDB instance is referenced by a DNS hostname, ensure that the Matillion agent (whether Full SaaS or Hybrid SaaS) is able to resolve that DNS address successfully. Connectivity will fail if the agent cannot translate the DNS name to an IP address.

Querying Data from GraphDB in a Technical Workflow

This guide explains how to query data from a GraphDB database using SQL SELECT statements and describes an efficient loading strategy for downstream analytics systems (like Databricks). It covers data type conversion specifics and demonstrates best practices for initial and incremental loads.


1. Querying GraphDB with SQL SELECT Statements

Although GraphDB is fundamentally an RDF triple store typically queried with SPARQL, it can often be integrated with systems that express requests as (or translate them to) familiar SQL

SELECT
statements. Here are representative examples adapted for a relational-like querying interface for GraphDB data:

Example 1: Select all elements from a table/view

SELECT *
FROM Person

Example 2: Select specific columns with filtering

SELECT name, age
FROM Person
WHERE age > 30

Example 3: Aggregation and grouping

SELECT department, COUNT(*) AS staff_total
FROM Employee
GROUP BY department

Example 4: Joining tables/views

SELECT p.name, t.title
FROM Person AS p
JOIN Task AS t ON p.id = t.assigned_to
WHERE t.status = 'Open'

Note: If querying through Databricks or an ETL tool, these queries may be translated or mapped to SPARQL or underlying table equivalents in GraphDB.


2. Datatype Conversion: GraphDB vs. Databricks

When moving data from GraphDB to Databricks, watch for datatype discrepancies. Typical conversion issues include:

GraphDB Datatype Databricks/SQL Datatype
xsd:string
STRING
xsd:integer
BIGINT
or
INT
xsd:decimal
DECIMAL(precision)
xsd:dateTime
TIMESTAMP
xsd:date
DATE
xsd:boolean
BOOLEAN

Tip: Always review field mappings and test sample loads to mitigate conversion errors.


3. Initial vs. Incremental Load Patterns

The best-practice data loading pattern for analytics (particularly in Matillion or Databricks pipelines) is:

  • Initial Load: One-time full extraction of the relevant data
  • Incremental Loads: Ongoing extraction of only new or changed records

Use the same Database Query component for both.

3.1 Initial Load Example

On the first load, omit all filter conditions so the full dataset is ingested:

SELECT id, name, last_modified
FROM Customer

  • No WHERE or filter clause is used.
  • Use this only once: during the initial system population.

3.2 Incremental Load Example

For subsequent (incremental) loads, include a filter on a "change tracking" column (such as

last_modified
or an incrementing key):

SELECT id, name, last_modified
FROM Customer
WHERE last_modified > '${last_load_marker}'

  • The filter clause ensures only new or updated records since the last load time are selected.
  • ${last_load_marker}
    is typically a variable or parameter passed in from ETL orchestration.
  • This pattern optimizes data movement by transferring only what has changed.

Further Reading:
For more details and best practices, see Matillion's Incremental Load Data Replication Strategy.


By strictly separating initial from incremental loads and using parameterized filter clauses, you maximize efficiency and reliability when synchronizing GraphDB with analytical targets like Databricks.

Data Integration Architecture

Loading data in advance of integration exemplifies the "divide and conquer" principle by separating the data loading and transformation steps, a key advantage of the ELT (Extract, Load, Transform) architecture. Within this framework, raw data is first loaded into the Databricks environment, allowing practitioners to then focus solely on data integration and transformation tasks as a distinct phase. This separation is efficient because data transformation — an essential aspect of integration — is best managed using data transformation pipelines. These pipelines streamline complex workflows, ensuring data is cleansed, standardized, and prepared before integration. Furthermore, the ELT architecture enables both transformation and integration operations to be performed natively within the target Databricks database. This offers significant benefits: operations are fast, can be triggered on demand, and easily scale to accommodate large data volumes, all without incurring additional expenses for external data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.