Integrate data from Neo4j to Databricks using Matillion

The Neo4j to Databricks connector streamlines the transfer of your data to Databricks within minutes, ensuring it remains current without the need for manual coding or complex ETL processes.

Neo4j
Databricks
Neo4j to Databricks banner

Extracting data from Neo4j to Databricks

Extracting data from Neo4j is an essential step for organizations seeking to maximize the value of their graph databases by integrating insights into advanced analytics platforms such as Databricks. Transferring data from Neo4j to Databricks enables more robust analysis, reporting, and machine learning workflows. In this article, we will guide you through the key stages of this process, beginning with creating a read-only identity in Neo4j to facilitate secure data access. For users working with Matillion, we will outline how to check for the required JDBC driver and acquire it if necessary. Given that data extraction often spans multiple environments, we will also cover strategies for verifying and ensuring the appropriate network connectivity between your Neo4j source and Databricks target. Finally, we will demonstrate approaches for querying and loading data—both for initial full extracts and for subsequent incremental updates. By following these steps, readers will be equipped to build a reliable, repeatable data pipeline from Neo4j to Databricks.


What is Neo4j?

Neo4j is a highly popular, native graph database designed to efficiently store, manage, and query highly connected data. Unlike traditional relational databases that model data in tables, Neo4j employs nodes, relationships, and properties to naturally represent networked information, thereby excelling at use cases such as knowledge graphs, social networks, and fraud detection. Its query language, Cypher, allows for expressive and intuitive traversal of complex graph structures. Neo4j is ACID-compliant, ensuring data integrity, and supports both transactional and analytical workloads. Additionally, it offers robust scalability options, ranging from single-node deployments to enterprise-grade, horizontally scalable clusters with support for high availability and security features.

matillion logo x Neo4j

What is Databricks?

Databricks offers a cloud-optimized database solution that unifies analytics and machine learning workloads atop the Delta Lake format, providing ACID transaction support, scalable metadata handling, and robust data governance features. By leveraging its collaborative workspace and native support for streaming and batch data, Databricks facilitates performant ETL pipelines, interactive SQL analytics, and distributed machine learning on a single platform. Its integration with the Lakehouse architecture allows organizations to store raw, structured, and semi-structured data efficiently while maintaining data reliability and consistency, making it suitable for enterprise-scale analytics and real-time processing scenarios.

Why Move Data from Neo4j into Databricks

Unlocking Analytics: The Case for Copying Neo4j Data into Databricks

A data engineer or architect might wish to copy data from Neo4j into Databricks for several compelling reasons. First, Neo4j often stores rich, interconnected data that can provide valuable insights when analyzed alongside other sources. By integrating Neo4j data with diverse datasets within Databricks, organizations can unlock and amplify the potential value embedded in relationships and patterns that might otherwise remain isolated. Furthermore, leveraging Databricks as the platform for data integration and processing ensures that the additional computational workload does not impact the performance of the production Neo4j database. This approach preserves Neo4j’s responsiveness for transactional operations, while enabling scalable analytics and advanced processing in Databricks.

Creating a User in Neo4j

To manage access to your Neo4j database securely, it is best practice to create individual user accounts with appropriate permissions. This process is performed using Cypher statements through the Neo4j Cypher Shell, Browser, or other client interfaces by someone with administrative privileges.

Prerequisites

  • You must have administrative privileges (be a member of the
    admin
    role).
  • An active connection to the Neo4j database, either via the Cypher Shell or Neo4j Browser.

Steps to Create a User

1. Use the
CREATE USER
Command

The

CREATE USER
Cypher administrative command creates a new user with a specified password. You must also include the
SET PASSWORD CHANGE [NOT] REQUIRED
clause to control if the user must change their password at the first login.

cypher
CREATE USER alice SET PASSWORD 'securepassword' CHANGE NOT REQUIRED;

Replace alice with your desired username and securepassword with the initial password for the user. For production environments, choose a strong password.

2. Optionally, Require Password Change at First Login

To force the user to change the password upon first login (recommended for security), use:

cypher
CREATE USER bob SET PASSWORD 'initialPassword' CHANGE REQUIRED;

3. Grant Roles to the User

After creating the user, they will have no access by default. Use the

GRANT ROLE
command to assign roles such as
reader
,
publisher
,
architect
, or
admin
:

cypher
GRANT ROLE reader TO alice;

It is best practice to grant only the minimal permissions required.

Example: Complete Workflow

```cypher // Step 1: Create user CREATE USER charlie SET PASSWORD 'TempPass123!' CHANGE REQUIRED;

// Step 2: Grant appropriate role(s) GRANT ROLE publisher TO charlie; ```

List Existing Users

To verify that your user was created:

cypher
SHOW USERS;

This will display all users with their corresponding roles and status.


Note: Cypher administrative commands require Neo4j 4.x or newer. Always consult the Neo4j documentation for your deployment specifics.

Installing the JDBC Driver

At the time of writing, the Neo4j JDBC driver is not bundled with Matillion Data Productivity Cloud due to licensing and redistribution restrictions. Users who wish to connect to Neo4j databases must manually download and install the appropriate JDBC driver.

Downloading the Neo4j JDBC Driver

  1. Visit the official Neo4j JDBC Driver download page:
    https://neo4j.com/developer/neo4j-jdbc/
  2. From the available options, select a Type 4 JDBC driver, as this is a pure Java implementation and provides the broadest compatibility and ease of deployment.

Uploading the Driver to Matillion Data Productivity Cloud

Once you have downloaded the appropriate JDBC .jar file, it needs to be uploaded to your Matillion Agent. Matillion provides step-by-step installation instructions for external JDBC drivers, available at:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

Be sure to follow the guide carefully, ensuring that the driver is placed in the correct directory and that your Agent is restarted or refreshed as described in the documentation.

Using the Driver

After installation, the JDBC driver will be available for use within Matillion Data Productivity Cloud. To learn how to configure database connections and leverage the driver in your workflows, refer to:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

This resource outlines how to specify JDBC URLs, credentials, and other necessary parameters to query your Neo4j database from within Matillion Data Productivity Cloud.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your Neo4j database, you must configure your database to allow incoming connections according to your deployment type:

Additionally, if you are referencing your Neo4j database using a DNS name (rather than a static IP), you must also ensure that the Full SaaS or Hybrid SaaS agent can resolve this DNS address to connect successfully.

Querying Data from Neo4j: Technical Instructions

Example Neo4j Queries (SQL SELECT Equivalents)

When accessing data from a Neo4j database, the standard query language is Cypher, which has some conceptual similarities to SQL. Below are examples contrasting SQL

SELECT
statements and their Cypher equivalents:

SQL Example Cypher Example
SELECT * FROM Person
MATCH (p:Person) RETURN p
SELECT name, age FROM Person WHERE age > 30
MATCH (p:Person) WHERE p.age > 30 RETURN p.name, p.age
SELECT * FROM Movie m JOIN Person p ON m.director_id = p.id
MATCH (p:Person)-[:DIRECTED]->(m:Movie) RETURN p, m

Note: - In Cypher, patterns are described using nodes

( )
and relationships
[ ]
. - The WHERE clauses in SQL map directly to Cypher's
WHERE
filters. -
RETURN
in Cypher is analogous to SQL's
SELECT
.

Datatype Conversion between Neo4j and Databricks

When integrating Neo4j with platforms like Databricks, datatype conversion may occur due to differing datatype systems. For example:

  • Neo4j
    Integer
    may map to Databricks
    LongType
    or
    IntegerType
    .
  • Neo4j
    Float
    maps to Databricks
    DoubleType
    or
    FloatType
    .
  • Neo4j
    String
    maps to Databricks
    StringType
    .
  • Neo4j specific types, such as
    Point
    , may require transformation or serialization to a compatible Databricks type.

Always review the integration documentation and perform data validation during initial data loads.

Recommended Data Loading Pattern: Initial and Incremental Loads

The safest and most efficient pattern for querying and loading data from Neo4j to another system (such as Databricks) involves:

  1. Initial (Once-Off) Load:
  2. Objective: Bring in all current records from Neo4j.
  3. Method: Use the Database Query component without a filtering clause.
  4. Example Cypher:
    cypher
         MATCH (n:Person) RETURN n
  5. Example pseudo-SQL:
    SELECT * FROM Person
  6. Incremental (Ongoing) Loads:
  7. Objective: Regularly update only the changed/new data.
  8. Method: Use the same Database Query component, but add a filter to retrieve only records modified after the last load (e.g., by timestamp).
  9. Example Cypher:
    cypher
         MATCH (n:Person)
         WHERE n.updatedAt > $last_success_time
         RETURN n
  10. Example pseudo-SQL:
    SELECT * FROM Person WHERE updatedAt > ?

Process Notes: - Both initial and incremental loads should use the same Database Query logic/component, varying only the filter clause. - Maintain a tracking mechanism (such as a watermark/timestamp in a control table) to identify new/modified records for incremental loads. - For more details about implementing incremental load strategies, see this Matillion article: https://exchange.matillion.com/articles/incremental-load-data-replication-strategy/

Tip: Whether loading into Databricks or another analytics platform, validate your target schema/data types for compatibility after each transfer. This ensures successful end-to-end data movement and integrity.

Data Integration Architecture

Loading data into your system ahead of the integration process exemplifies the "divide and conquer" approach, as it allows you to separate data ingestion from later data integration and transformation steps. This is a key advantage of the Extract, Load, and Transform (ELT) architecture. Data integration itself typically requires comprehensive data transformation, best managed using dedicated transformation pipelines that automate and orchestrate the necessary steps. A further benefit of the ELT approach is that both data transformation and integration are performed directly within the target Databricks database. This structure ensures that data processing is fast, occurs on-demand, and easily scales as data volumes grow, all without the need to invest in or manage additional processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.