Extracting data from TigerGraph to Databricks
Extracting data from TigerGraph is an essential task for organizations looking to leverage graph-based insights within modern data platforms such as Databricks. Seamless integration between TigerGraph and Databricks enables powerful analytics and machine learning workflows by making complex, connected data easily accessible for further processing. This article provides a step-by-step guide to efficiently move data from TigerGraph into Databricks. The instructions include creating an appropriate identity in TigerGraph to manage data access and, for Matillion users, ensuring that the correct JDBC driver is available or acquiring it if necessary. We will also address the critical aspect of establishing network connectivity between the source (TigerGraph) and the target (Databricks environment). Finally, you will learn how to query data for an initial load and also set up processes for capturing incremental updates. By following these steps, you will be able to build and maintain a robust data pipeline from TigerGraph to Databricks, ensuring data consistency, security, and performance throughout the process.
What is TigerGraph?
TigerGraph is a high-performance, scalable, native parallel graph database designed to handle complex and large-scale graph analytics. Its architecture leverages parallelism and a native graph storage format to efficiently store, traverse, and process billions of vertices and edges in real-time. TigerGraph supports the GSQL query language, which is both expressive and optimized for deep link analytics required in use cases such as fraud detection, recommendation engines, and supply chain management. It provides ACID compliance, distributed deployment, dynamic schema evolution, and integration with enterprise data infrastructure. TigerGraph's emphasis on scalability and performance makes it well-suited for technical applications demanding low-latency graph traversals and high-throughput analytics.
What is Databricks?
Databricks is a unified analytics platform that builds on Apache Spark to deliver scalable, cloud-based big data processing and machine learning. It uses Delta Lake to add ACID transactions, scalable metadata, and data versioning to cloud storage. Users manage data with familiar SQL syntax and features like schema evolution, time travel, and efficient batch or streaming ingestion. Databricks integrates with major cloud storage, supports interactive notebooks, robust APIs, and strong security, enabling collaborative development of robust data pipelines and advanced analytics at scale.
Why Move Data from TigerGraph into Databricks
Unlocking Analytical Power: Copying Data from TigerGraph to Databricks for Advanced Insights
A data engineer or architect may choose to copy data from TigerGraph into Databricks for several strategic reasons. Although TigerGraph contains data that is potentially valuable—especially due to its ability to represent complex relationships and connected data—its full value is often realized when this data is integrated with information from other enterprise sources. By moving data into Databricks, organizations gain the ability to combine and analyze TigerGraph's data alongside data from disparate systems, facilitating richer analytics and deeper insights. Moreover, by leveraging Databricks for these integration and processing tasks, the operational workload on the TigerGraph database itself is minimized. This approach allows TigerGraph to remain focused on its core functions as a high-performance graph database, while Databricks handles computationally intensive transformations and analyses, resulting in a more scalable and efficient data architecture.
Similar connectors
Creating an Identity in TigerGraph
To grant access and manage privileges in a TigerGraph database, you can create new users (also called identities). User management operations are performed via the GraphStudio UI or, more typically for technical audiences, through GSQL command line scripting. Below are instructions to create a user via the GSQL command line interface.
Prerequisites
- Access to the TigerGraph server where GSQL CLI is available.
- Sufficient privileges (typically, an
admin
user or a user assigned theUSER_CREATOR
privilege).
Steps to Create a User
- Login to GSQL CLI
gsql
- Create a New User
Use the
CREATE USERstatement, specifying a username and password. Optionally, assign the user to one or more roles.
CREATE USER username PASSWORD 'user_password';
- Replace
username
with your desired username (case-insensitive; letters, numbers,_
, and-
are allowed). -
Replace
user_password
with a secure password (must meet server’s complexity policy). -
(Optional) Assign Roles to the User
To grant privileges, assign one or more predefined roles or custom roles to the user on a graph. The basic syntax is:
GRANT ROLE role ON GRAPH graph_name TO username;
For example, to grant the
queryreaderrole on a graph named
MyGraph:
GRANT ROLE queryreader ON GRAPH MyGraph TO username;
Common built-in roles include: -
reader-
writer-
queryreader-
designer-
admin
- Verify User Creation
To see a list of all users:
SHOW USERS;
Example Script
-- Create a new user CREATE USER alice PASSWORD 'YourSecur3P@ss!'; -- Assign 'writer' and 'queryreader' roles for the MyGraph graph GRANT ROLE writer ON GRAPH MyGraph TO alice; GRANT ROLE queryreader ON GRAPH MyGraph TO alice;
Note: All user management commands must be run from an account with sufficient privilege. Ensure
aliceselects an appropriately strong password.
Refer to the TigerGraph documentation for further details and password requirements.
Installing the JDBC driver
At the time of writing, the TigerGraph JDBC driver is not bundled by default in Matillion Data Productivity Cloud, due to licensing or redistribution restrictions. Follow the process below to acquire and install the driver in your environment.
1. Download the TigerGraph JDBC Driver
- Visit the official TigerGraph documentation page for JDBC drivers:
https://docs.tigergraph.com/dev/gsql-queries/jdbc-driver - Identify and download the Type 4 JDBC driver for TigerGraph, as it is recommended for compatibility and ease of deployment.
- Type 4 drivers are pure Java implementations and do not require native libraries.
2. Prepare for Installation
- Ensure you have appropriate access to your Matillion Data Productivity Cloud Agent, which is required to upload custom JDBC drivers.
3. Install the JDBC Driver into Matillion Data Productivity Cloud
- Follow the step-by-step agent driver upload instructions as described in the Matillion documentation:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/ - Upload the downloaded TigerGraph JDBC driver JAR file to the correct location as per the agent documentation.
- Restart or refresh any agent or application components that require initialization with new drivers, as instructed in the documentation.
4. Configure and Use the Driver
For guidance on using the custom (external) JDBC driver with database integration components in Matillion Data Productivity Cloud, refer to the official usage instructions:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
Be sure to specify the correct driver class, endpoints, and credentials appropriate to your TigerGraph environment during configuration.
Checking network connectivity
Before connecting Matillion Data Productivity Cloud to your TigerGraph database, you must ensure that network connectivity is properly configured for your deployment type.
-
Full SaaS Agent Configuration:
If you are using the Full SaaS agent configuration, you must allow incoming connections to your TigerGraph database from the set of IP addresses used by Matillion Data Productivity Cloud. The list of IP addresses is maintained by Matillion and can be found here. Ensure that your firewall or security group rules permit access from these IPs. -
Hybrid SaaS Deployment:
For Hybrid SaaS deployments, connections to TigerGraph originate from your own virtual private cloud (VPC) where the Matillion Hybrid SaaS agent is deployed. Therefore, you must allow incoming connections from your VPC’s relevant subnets or IP addresses. To assist with network configuration and troubleshooting, you can use the utilities provided by Matillion at this page.
Additionally, if your TigerGraph database endpoint is specified using a DNS name (rather than a raw IP address), it is essential that the Full SaaS or Hybrid SaaS agent can correctly resolve the DNS address. Make sure that there are no DNS resolution issues or restrictions that could prevent Matillion from accessing the TigerGraph host.
Querying Data from a TigerGraph Database
TigerGraph is a graph database that uses its own query language, GSQL, rather than SQL. However, when integrating with ETL tools or external analytics platforms like Databricks, it's common to map or transform TigerGraph queries into SQL-like
SELECTstatements for data extraction. Below are practical instructions and examples for querying data from TigerGraph, along with guidance on handling datatype conversion and designing efficient data loads.
Example: TigerGraph Queries & SQL SELECT Statements
TigerGraph GSQL Query Example
Suppose you have a simple graph schema containing
Personand
Knowsrelationships. A GSQL query to fetch all persons and their friends may look like this:
SELECT s FROM ( MATCH (p:Person)-[:Knows]->(f:Person) RETURN p, f ) s
Equivalent SQL SELECT Statement
If you are extracting TigerGraph data into a relational system (like Databricks), you might represent the query as:
SELECT p.id AS person_id, f.id AS friend_id FROM Person p JOIN Knows k ON p.id = k.from_id JOIN Person f ON k.to_id = f.id
Note:
Many integration and ETL tools present connectors that internally translate SQL-like queries to GSQL or vice versa, abstracting away the query language differences.
Datatype Conversion Between TigerGraph and Databricks
When transferring data between TigerGraph and Databricks, datatype conversion is important. Here are some typical mappings:
| TigerGraph Type | Databricks/Spark SQL Type | Notes |
|---|---|---|
STRING |
STRING |
Direct mapping. |
INT |
INT |
Direct mapping, may upcast to BIGINT. |
FLOAT |
FLOAT/ DOUBLE |
Use DOUBLEin Databricks as needed. |
DATETIME |
TIMESTAMP |
May require explicit conversion. |
BOOL |
BOOLEAN |
Direct mapping. |
Best Practices:
- Always validate the schema mapping after loading to ensure data fidelity.
- Use explicit casts in Databricks Spark SQL if auto-detection fails.
Best Pattern for Data Loads: Initial then Incremental
To synchronize data from TigerGraph into Databricks (or any analytics environment), the best approach is:
- Once-Off Initial Load
- Pulls all records (no filters).
- Subsequent Incremental Loads
- Pulls only new or updated records by using filter predicates.
- Use the Same Database Query Component
- Configure your integration (via ETL tools or JDBC/ODBC queries) to use the same component for both steps, modifying only the query's filter clause as needed.
Example: SQL SELECT for Initial and Incremental Load
Initial Load Query (no filter)
SELECT id, name, last_updated FROM Person
Incremental Load Query (with filter)
SELECT id, name, last_updated
FROM Person
WHERE last_updated > '{{last_loaded_timestamp}}'
{{last_loaded_timestamp}}should be dynamically set at runtime (from a control table or parameter store).
References:
For detailed strategies on incremental extraction, see Matillion Exchange: Incremental Load Data Replication Strategy.
Tip: When working with larger datasets or production systems, always test your initial and incremental queries for correctness and performance before deploying them in automated pipelines.
Data Integration Architecture
Loading data in advance of integration, as practiced in ELT (Extract, Load, Transform) architectures, allows teams to divide and conquer data processing by splitting it into two distinct steps: loading raw data first, then transforming and integrating it as needed. This staged approach simplifies the workflow and enhances data management flexibility within Databricks. Data integration inherently requires robust transformation capabilities, and data transformation pipelines are the optimal tool for this task, as they efficiently orchestrate and automate complex transformation logic. A significant advantage of the ELT model in Databricks is that both data transformation and integration are performed directly inside the target Databricks database. This in-database processing enables fast, on-demand, and highly scalable transformations, eliminating the need for separate data processing infrastructure and the associated costs, while leveraging the underlying performance of the Databricks platform.