Integrate data from AllegroGraph to Databricks using Matillion

The AllegroGraph to Databricks connector enables seamless, up-to-date data transfers to Databricks within minutes, eliminating the need for manual coding or complex ETL scripts.

AllegroGraph
Databricks
AllegroGraph to Databricks banner

Extracting data from AllegroGraph to Databricks

Extracting data from AllegroGraph is a key step for organizations looking to leverage semantic data in advanced analytics platforms such as Databricks. This article provides a practical guide to moving data from an AllegroGraph triple store into Databricks, outlining the important steps and considerations involved in the process. We begin by covering the creation of an appropriate identity in AllegroGraph, ensuring that the data export is secure and compliant with access controls. For those using Matillion as the ETL tool, we will explain how to check for, and if necessary acquire, the correct JDBC driver to facilitate a direct connection between AllegroGraph and Databricks. Next, we address the critical aspect of network connectivity, detailing what is required for the source and target systems to communicate efficiently and securely. Finally, we guide you through querying your data—covering both initial full data extractions and subsequent incremental loads to keep your analytics platform in sync with ongoing changes. By following these steps, you will be able to establish a reliable and maintainable data pipeline between AllegroGraph and Databricks, setting the stage for deeper data analysis and insight generation.


What is AllegroGraph?

AllegroGraph, developed by Franz Inc., is an enterprise graph database for storing and querying complex, interconnected data using RDF and SPARQL. It supports both property graph and RDF data models, ACID transactions, geospatial and temporal reasoning, and multi-modal indexing. AllegroGraph scales horizontally in distributed environments and offers semantic reasoning via OWL and RDFS inferencing. It integrates with big data tools and machine learning frameworks, and supports JSON-LD, labeled property graphs, and RESTful APIs. This makes it suitable for knowledge graph construction, enterprise data integration, and advanced analytics across healthcare, finance, and government intelligence domains.

matillion logo x AllegroGraph

What is Databricks?

The Databricks Lakehouse Platform is a unified analytics engine that merges data warehouse and data lake features. Built on Apache Spark, it enables scalable processing and management of structured and unstructured data. Key innovations like Delta Lake provide ACID transactions, schema enforcement, and time travel for robust, reliable pipelines. Databricks offers a collaborative workspace, streamlining data engineering, science, and machine learning workflows. Users can interact using SQL, Python, R, or Scala, and integrate with diverse data sources and visualization tools. This platform simplifies big data tasks, making analytics more accessible and efficient for organizations.

Why Move Data from AllegroGraph into Databricks

Unlocking Advanced Analytics: The Case for Copying Data from AllegroGraph to Databricks

A data engineer or architect may consider copying data from AllegroGraph into Databricks for several compelling reasons. First and foremost, AllegroGraph often contains rich, potentially valuable data—such as semantic graphs or linked data—that can provide deep insights when analyzed effectively. By integrating such data with information from other sources within Databricks, organizations can unlock its full value, enabling more comprehensive analytics and revealing patterns that may not be apparent within the isolated dataset. Furthermore, performing the data integration and analytical workloads within Databricks helps minimize the strain on the AllegroGraph database itself. This approach ensures that resource-intensive operations, such as complex joins or large-scale processing, do not adversely impact AllegroGraph’s performance or availability, thereby maintaining system reliability while facilitating advanced data analysis.

Creating an Identity in AllegroGraph

To manage access to your AllegroGraph repositories, you need to create users (identities) and assign appropriate privileges. The process can be completed using the AllegroGraph WebView interface, the agtool command-line utility, or through HTTP requests to the REST API. Below are step-by-step instructions for these common methods.


1. Using the AllegroGraph WebView Interface

  1. Log in to the AllegroGraph WebView at
    http://<your-server>:<port>/
    .
  2. Navigate to "Users" in the administration panel.
  3. Click the "Add User" button.
  4. Fill in the required details:
    • Username: The login name for the user.
    • Password: Set the initial password.
    • Optionally, configure description, email, etc.
  5. Set the global privileges and repository-specific privileges as needed.
  6. Click "Create" to finish.

2. Using the agtool Command-Line Utility

If you prefer to script or automate user management, use the

agtool user
command:

bash
agtool user create <username> <password> \
    --description "Optional Comment" \
    --roles admin \
    --email [email protected]

Replace

<username>
and
<password>
with the desired credentials. Adjust options as required:

  • Use
    --roles user
    or
    admin
    depending on intended privileges.

Example:

bash
agtool user create alice secretpw --roles user --email [email protected]


3. Using the HTTP REST API

You can directly issue HTTP requests to the AllegroGraph server. The following example shows how to create a user using

curl
(suitable for scripting):

bash
curl -u <admin>:<admin-password>  \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{"id":"bob", "password":"P@ssw0rd!", "roles":["user"], "email":"[email protected]"}' \
    http://<host>:<port>/admin/users

Parameters:

  • id
    : username
  • password
    : initial password
  • roles
    : list including "user" or "admin"
  • email
    : (optional) user email

Replace

<admin>
,
<admin-password>
,
<host>
, and
<port>
with the appropriate values.


Notes

  • Administrative privileges are required to create new user identities.
  • Password policies are configurable via AllegroGraph's administration settings.
  • Users and their privileges can be modified or removed at any time using similar steps.

For further detail, consult the AllegroGraph User Authentication documentation.

Installing the JDBC Driver

The AllegroGraph JDBC driver enables connectivity between Matillion Data Productivity Cloud and the AllegroGraph database. However, due to licensing or redistribution restrictions, this driver is not bundled with Matillion by default and must be downloaded and installed manually.

To install the AllegroGraph JDBC driver into Matillion Data Productivity Cloud, follow these steps:

1. Obtain the JDBC Driver

Visit the official AllegroGraph JDBC documentation and download page at:
https://franz.com/agraph/support/documentation/current/jdbc/

On the download page, select the Type 4 JDBC driver package, as this type is the most suitable for cloud data platform integrations such as Matillion. The Type 4 driver is a "pure" Java driver and does not require any native client libraries.

2. Install the Driver in Matillion Data Productivity Cloud

Once you have downloaded the appropriate

.jar
file for the AllegroGraph JDBC Type 4 driver, you need to upload and install it in your Matillion environment.

Refer to Matillion’s official guide for uploading external JDBC drivers here:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

This guide provides step-by-step instructions specific to uploading and registering the driver jar file in your Matillion Agent, including setting driver class names and confirming successful registration.

3. Use the Driver

After installing the driver, you can configure and use it in your Matillion Data Productivity Cloud projects by following the standard procedures for creating and managing database integrations.

For guidance on using custom database drivers in pipeline or Designer components, consult the usage documentation at:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

These instructions will help you to set up and validate connections to AllegroGraph using the newly installed JDBC driver within your Matillion workflows.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your AllegroGraph database, you must configure your network to allow incoming connections to the database. The required configuration depends on your deployment type:

  • Full SaaS Agent Configuration:
    You must allow incoming connections to your AllegroGraph database from the IP addresses used by Matillion's SaaS agents. The full and current list of required IP addresses is maintained here: Matillion Allowing IP Addresses Documentation. Add these IP addresses to your allowlist or firewall rules in your AllegroGraph environment.

  • Hybrid SaaS Deployment:
    Incoming connections should be allowed from your own Virtual Private Cloud (VPC) where the Matillion agent is running. This typically involves configuring your AllegroGraph firewall or network security group to accept inbound connections from your specified VPC CIDR block or agent IP addresses.

To help verify network connectivity, you can use utilities available at the Matillion Exchange: Check Network Access.

DNS Resolution:
If your AllegroGraph database is referenced via a DNS hostname instead of a direct IP address, ensure that the DNS record is publicly resolvable (for Full SaaS) or resolvable from within your VPC (for Hybrid SaaS). The relevant Matillion agent must be able to resolve the database hostname in order to establish a connection.

Querying Data from an AllegroGraph Database

This guide provides step-by-step instructions for querying data from an AllegroGraph database, with a focus on integration patterns, potential datatype conversions (e.g., AllegroGraph to Databricks), and best practices for initial and incremental data loads using the Database Query component.


1. AllegroGraph Query Examples (SQL-Style SELECT Statements)

While AllegroGraph natively supports SPARQL for querying RDF graphs, it also provides a Prolog and AGWebView SQL-like SELECT interface for simpler data extraction. Here are examples of AllegroGraph queries in SQL SELECT syntax:

``` -- Example: Select all records from a table SELECT * FROM people;

-- Example: Select specific columns from a class (table-equivalent) SELECT name, age, occupation FROM people;

-- Example: Apply a WHERE clause to filter data SELECT * FROM people WHERE age >= 30;

-- Example: Limit the number of results SELECT name, email FROM people LIMIT 10; ```

Replace

people
with the IRI or table equivalent in your schema.


2. Datatype Conversion (AllegroGraph vs. Databricks)

When extracting data from AllegroGraph to load into Databricks, datatype conversion may be necessary. Consider the following mapping examples:

AllegroGraph Datatype Typical SQL/Databricks Equivalent
xsd:string
STRING
/
VARCHAR
xsd:int
INT
xsd:float
FLOAT
xsd:dateTime
TIMESTAMP
/
DATETIME
xsd:boolean
BOOLEAN

Conversion is often handled automatically by ETL/ELT tools, but review your schemas for custom or unsupported types.


3. Using the Database Query Component for Initial and Incremental Loads

The most effective approach for loading data from AllegroGraph into another platform (such as Databricks) is to use a once-off initial load followed by regular incremental loads. Both patterns utilize the same Database Query component. This is especially common in integration platforms like Matillion.

Initial Load Pattern

  • Purpose: Extract all existing data at once (full load).
  • SQL Query: No filter clause is present.

SELECT * FROM people;

Invoke this in the Database Query component to load the full dataset into your staging/processing environment.

Incremental Load Pattern

  • Purpose: Extract only records that have changed or been added since the last run.
  • SQL Query: Include a filter clause, typically on a timestamp or an auto-incrementing ID.
  • Resource: Matillion Incremental Load Strategy

SELECT * FROM people
WHERE modified_date > '${last_load_date}';

  • ${last_load_date}
    is dynamically set to the most recent timestamp (tracked in your orchestration logic).
Best Practices:
  • Use the same Database Query component definition for both initial and incremental loads.
  • During initial loads, omit the filter to retrieve all data.
  • During incremental loads, dynamically generate the WHERE clause, filtering on your change tracking field.

For more reading on incremental load strategies, visit:
Matillion Community: Incremental Load and Data Replication Strategy

Data Integration Architecture

Loading data in advance of integration is a fundamental strategy that exemplifies the “divide and conquer” approach within modern data engineering. By splitting the process into two clear steps—first loading raw data into the Databricks environment, then transforming and integrating it later—you take full advantage of the ELT (Extract, Load, Transform) architecture. Data integration inherently relies on efficient data transformation, and the most effective way to achieve this is through purpose-built data transformation pipelines. These pipelines operationalize complex transformation logic, making both the process repeatable and manageable. ELT provides further advantages by moving data transformation and integration workloads into the Databricks database itself. This in-database processing is fast, on-demand, and highly scalable, leveraging Databricks' native capabilities; moreover, it eliminates the need for—and cost of—external data processing infrastructure, further simplifying data workflows.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.