The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Integrate data from Virtuoso to Databricks using Matillion

The Virtuoso to Databricks connector enables seamless and timely data transfer to Databricks, ensuring your information remains current without the need for manual coding or complex ETL processes.

Virtuoso
Databricks
Virtuoso to Databricks banner

Extracting data from Virtuoso to Databricks

Extracting data from Virtuoso is a critical step for organizations seeking to leverage their semantic data in modern analytics platforms such as Databricks. To ensure a successful data migration or integration, it is important to address several foundational tasks. This article will guide you through the key steps involved in extracting data from Virtuoso and loading it into Databricks. We will begin by creating an identity within Virtuoso, establishing the necessary credentials for secure data access. For those using Matillion, we will demonstrate how to check for an existing JDBC driver or acquire one as needed to enable connectivity. The article will also address considerations for ensuring reliable network connectivity between Virtuoso (the source) and Databricks (the target). Finally, we will explore strategies for querying data—both for initial loads and for ongoing incremental updates. By following this structured approach, you will be prepared to execute a smooth and efficient data transfer from Virtuoso to Databricks, laying the groundwork for advanced data analytics and business intelligence.


What is Virtuoso?

Virtuoso is a high-performance, multi-model data server developed by OpenLink Software, notable for its robust support of relational, graph, and document-centric data management within a unified engine. Designed for flexible and interoperable data integration, Virtuoso efficiently handles SQL, SPARQL, and RDF workloads, making it well suited for Semantic Web, linked data, data federation, and traditional relational database applications. Its architecture allows for seamless horizontal scaling and includes advanced features such as ACID-compliant transactions, fine-grained access controls, and high-availability configurations. Additionally, Virtuoso offers extensible web application services, enabling secure hosting of data-driven services and APIs directly within the database environment.

matillion logo x Virtuoso

What is Databricks?

Databricks is a unified analytics platform built on Delta Lake, offering ACID transaction support for data lakes. Its cloud-based architecture enables high-performance, scalable DataFrame operations for massive datasets, integrating easily with Apache Spark, MLlib, BI tools, and supporting both streaming and batch data processing. Databricks simplifies structured and unstructured data management, provides fine-grained data governance, time travel for historical analysis, and operates natively on AWS, Azure, and Google Cloud. The platform enables collaborative analytics and machine learning workflows, helping teams speed up data-driven decisions while ensuring enterprise-grade security and compliance.

Why Move Data from Virtuoso into Databricks

The Benefits of Replicating Virtuoso Data into Databricks for Advanced Analytics

A data engineer or architect may wish to copy data from Virtuoso to Databricks for several compelling reasons. Firstly, Virtuoso contains data that is potentially valuable for analysis or decision-making, but its value can be significantly enhanced when integrated with data from other sources. By leveraging Databricks, which is designed for scalable data analytics and processing, engineers can efficiently combine data from Virtuoso with disparate datasets without incurring additional load or performance concerns on the Virtuoso system itself. This approach not only preserves Virtuoso’s operational efficiency but also empowers organizations to extract deeper insights and maximize the usefulness of their diverse data assets.

Creating a User in Virtuoso Database

To create a new user (identity) in your Virtuoso database, you will use SQL commands issued through the Virtuoso Interactive SQL interface (ISQL), the Virtuoso Conductor, or any appropriate database client. The steps outlined below assume you have administrative privileges.

1. Connect to the Database

Access the Virtuoso ISQL shell from your system terminal:

bash
isql 1111 dba your_dba_password

  • 1111
    is the default Virtuoso port; replace if your instance runs on a different port.
  • dba
    is the default administrator account; provide your password accordingly.

Alternatively, use Virtuoso Conductor (web interface) or another SQL client as appropriate.

2. Create the User Account

Execute an SQL statement in the connected session to create the new user. Replace

username
and
password
with your chosen values.

CREATE USER username IDENTIFIED BY 'password';

Example

CREATE USER analyst IDENTIFIED BY 'StrongPassw0rd!';

3. Grant Privileges to the User

After creation, the user has no privileges. Grant specific permissions such as connecting, reading, writing, or administering database objects.

Basic CONNECT privilege

GRANT CONNECT TO username;

Example

GRANT CONNECT TO analyst;

Grant Additional Privileges As Needed

For example, providing the user with ability to select (

SELECT
) from a schema:

GRANT SELECT ON SCHEMA schema_name TO username;

Or to grant all privileges on a specific table:

GRANT ALL ON TABLE schema_name.table_name TO username;

4. Verifying the User

List all database users to confirm creation:

SELECT * FROM DB.DBA.SYS_USERS;


Note:
- All identifiers in Virtuoso are case-sensitive unless unquoted, and should follow your security and naming conventions. - Always use secure passwords when creating user accounts.

Installing the JDBC Driver

At the time of writing, the Virtuoso JDBC driver is not bundled with Matillion Data Productivity Cloud by default. This is due to licensing and redistribution restrictions, so you will need to manually download and install it to enable connectivity.

Downloading the Virtuoso JDBC Driver

  1. Access the Official Download Page:
    Visit the Virtuoso JDBC driver download page at:
    https://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtuosoJDBC

  2. Select the Appropriate Driver:
    Among the options, look specifically for the Type 4 JDBC driver. Type 4 drivers are pure Java implementations and are usually preferred for Matillion integrations due to their portability and performance.

  3. Download the Driver File:
    Download the JAR file for the Virtuoso Type 4 JDBC driver to your local machine. Ensure you comply with any license terms presented during the download process.

Installing the Driver into Matillion Data Productivity Cloud

  1. Review Installation Documentation:
    For detailed, up-to-date instructions on installing external JDBC drivers, refer to the Matillion documentation here:
    https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

  2. Upload the Driver JAR:
    Using your Matillion Agent user interface, navigate to the section for uploading external drivers (sometimes labeled as “Manage Agents” or “Driver Upload”). Follow the prompts to upload the Virtuoso JDBC driver JAR you downloaded earlier.

  3. Verify Installation:
    After upload, verify that the driver is listed as available. You may be required to restart the Matillion Agent or service for the new driver to be recognized.

Using the Driver in Matillion Data Productivity Cloud

Once successfully uploaded, you can configure database connections and use the Virtuoso JDBC driver within your Matillion data pipelines and components. For information on creating and managing database queries using external JDBC drivers in Matillion, consult the following resource:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your Virtuoso database, you must allow incoming connections to the Virtuoso database according to your deployment type:

  • Full SaaS agent configuration:
    Allow incoming connections from the Matillion-managed IP addresses. Refer to the list of required IP addresses at this URL.

  • Hybrid SaaS deployment:
    Allow incoming connections from your own Virtual Private Cloud (VPC) where the agent is running. You can find helpful tools for testing network access at Matillion Exchange: Check Network Access.

Additionally, if your Virtuoso database host is specified using a DNS name rather than an IP address, ensure that the agent (whether Full SaaS or Hybrid SaaS) is able to resolve the DNS address correctly. Network or firewall rules must not block the necessary DNS queries required for hostname resolution.

Querying Data from a Virtuoso Database

This guide explains how to query data from a Virtuoso database using SQL SELECT statements, describes datatype conversion considerations when interacting with systems like Databricks, and details best practices for initial and incremental data loads using the Database Query component.

1. Example Virtuoso SQL SELECT Queries

Virtuoso supports standard SQL along with some extensions. Below are example queries:

``` -- Select all columns from a table called customers SELECT * FROM customers;

-- Select specific columns SELECT id, name, email FROM customers;

-- Filtering results with WHERE SELECT id, name FROM customers WHERE signup_date > '2023-01-01';

-- Using aggregation SELECT status, COUNT(*) AS count FROM orders GROUP BY status;

-- Join example SELECT c.id, c.name, o.order_id, o.amount FROM customers c JOIN orders o ON c.id = o.customer_id WHERE o.status = 'COMPLETE'; ```

2. Datatype Conversion Considerations

When moving data between Virtuoso and Databricks (or other systems), datatype differences may arise. Some datatype handling tips:

  • Virtuoso’s
    VARCHAR
    ,
    INTEGER
    ,
    DECIMAL
    , and
    DATE
    types generally map well to Databricks data types.
  • Complex types such as Virtuoso’s
    VARCHAR(MAX)
    or custom data types may need explicit casting, e.g., using
    CAST(column_name AS VARCHAR(255))
    .
  • Timestamps in Virtuoso might need adjustment to Databricks' expected format (such as UTC alignment).

Example:

SELECT id, CAST(comment AS VARCHAR(255)) AS comment FROM feedback;

3. Initial and Incremental Loading Pattern

The recommended ETL pattern is to perform:

  1. Once-Off Initial Load:
    The Database Query component runs without filtering, extracting the full dataset.

  2. Incremental Loads:
    Subsequent loads only extract data modified or created since the last load, using a filter (e.g., based on a timestamp or incrementing key).

Both scenarios use identical Database Query components, differing only in the WHERE clause.

Initial Load Example

No filter is applied; all data is extracted:

SELECT id, name, signup_date FROM customers;

Incremental Load Example

A filter is applied, referencing a variable for the last loaded timestamp:

SELECT id, name, signup_date
FROM customers
WHERE signup_date > '${last_loaded_date}';

  • Replace
    ${last_loaded_date}
    with your ETL orchestration system’s timestamp parameter.

For more guidance, refer to Matillion’s documentation on incremental loading.

Data Integration Architecture

Loading data into Databricks before performing integration is an effective way to divide and conquer the data integration process, as it splits the overall workload into two manageable steps: loading and transforming the data. This method, central to the Extract-Load-Transform (ELT) architecture, offers a significant advantage by enabling organizations to quickly bring raw data into the centralized Databricks database without needing to transform it immediately. Data transformation—a core requirement for effective data integration—is then performed using data transformation pipelines, which provide a systematic and reliable way to process and enrich the data. Another key benefit of the ELT approach is that both the data transformation and integration tasks are executed within the Databricks environment itself. This not only accelerates processing due to Databricks’ highly performant compute capabilities but also ensures transformations are available on demand and at scale. Furthermore, by transforming data directly within Databricks, organizations avoid the cost and complexity of maintaining additional data processing infrastructure, leading to a more streamlined, cost-effective integration solution.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.