Integrate data from TiDB to Databricks using Matillion

Our TiDB to Databricks connector seamlessly transfers your data to Databricks within minutes, ensuring it remains up-to-date—eliminating the need for manual coding or complex ETL processes.

TiDB
Databricks
TiDB to Databricks banner

Extracting data from TiDB to Databricks

Extracting data from TiDB is a foundational step in integrating data from this high-performance, distributed SQL database into broader analytics platforms such as Databricks. Whether you are modernizing your data pipeline or enabling large-scale, cloud-based analytics, moving data efficiently and reliably is essential. In this article, we will guide you through the process of transferring data from TiDB to Databricks. You will learn how to set up an identity in TiDB that has the appropriate permissions for data extraction. For Matillion users, we will cover how to verify or obtain the correct JDBC driver for seamless connectivity. We will also discuss key considerations around establishing secure network connectivity between the source (TiDB) and target (Databricks) environments. Finally, we will outline methods for querying your data—both for an initial data load and for ongoing incremental updates—ensuring your Databricks workspace is kept up to date with the latest information from TiDB.


What is TiDB?

TiDB is an open-source, distributed SQL database developed by PingCAP, designed to provide both strong consistency and horizontal scalability for transactional and analytical workloads. It features a hybrid transaction/analytical processing (HTAP) architecture, integrating a TiKV key-value storage layer and a columnar TiFlash component for real-time analytics. TiDB supports standard MySQL protocols, enabling seamless adoption for existing MySQL applications while offering transparent sharding and automatic failover to achieve high availability. With built-in distributed transactions and Raft-based replication, TiDB ensures data integrity and resilience, making it suitable for cloud-native, mission-critical applications requiring elastic scaling and high performance.

matillion logo x TiDB

What is Databricks?

Databricks is a cloud-based data platform built on Apache Spark, enabling scalable analytics and machine learning. It introduces the 'lakehouse' model, combining data lakes and warehouses for unified storage, processing, and governance. Supporting formats like Parquet, Delta Lake, and ORC, and offering ACID transactions via Delta Lake, Databricks lets users efficiently ingest, transform, and query large datasets using SQL, Python, Scala, or R. Its collaborative workspace, native cloud storage connectors, and integration with major data tools make Databricks a comprehensive solution for modern big data and AI workflows.

Why Move Data from TiDB into Databricks

Unlocking Advanced Analytics: The Benefits of Integrating TiDB with Databricks

A data engineer or architect may wish to copy data from TiDB into Databricks for several compelling reasons. TiDB often contains business-critical or operational data that holds significant potential value, but this value is best realized when that data is integrated with information from other sources—for example, marketing, sales, or IoT datasets. By leveraging Databricks for this integration, organizations can enrich their analytic capabilities without imposing additional query or computational workload on the TiDB platform itself. Databricks provides a scalable environment for advanced analytics and machine learning, allowing the organization to extract deeper insights while ensuring that day-to-day operations on TiDB remain performant and unaffected.

Creating an Identity in TiDB

To create a user (an "identity") in TiDB, you use standard SQL statements similar to those in MySQL. The following instructions describe the process.

Prerequisites

  • You must have sufficient privileges (
    CREATE USER
    or
    GRANT
    privilege) to create users in the TiDB cluster.
  • You have access to the TiDB server, typically via the MySQL client or another MySQL-compatible client tool.

Step 1: Log in to TiDB

Connect to TiDB using a privileged account (such as

root
):

mysql -h <tidb-host> -P <tidb-port> -u root -p

Replace

<tidb-host>
and
<tidb-port>
with the address and port of your TiDB instance.

Step 2: Create a New User

Use the

CREATE USER
statement to create a new user. The general syntax is:

CREATE USER '<username>'@'<host>' IDENTIFIED BY '<password>';

  • <username>
    : The name of the user to create.
  • <host>
    : The host from which the user can connect (use
    %
    for any host).
  • <password>
    : The password for the user.

Example:

To create a user named

developer
that can connect from any host, with password
DevStrongPwd123
:

CREATE USER 'developer'@'%' IDENTIFIED BY 'DevStrongPwd123';

Step 3: Grant Privileges

After creating the user, assign necessary privileges using the

GRANT
statement. For example, to grant all privileges on a specific database (
mydb
):

GRANT ALL PRIVILEGES ON mydb.* TO 'developer'@'%';

Or, for read-only access:

GRANT SELECT ON mydb.* TO 'developer'@'%';

Step 4: Apply the Privilege Changes

Privileges take effect immediately in TiDB. However, in some MySQL-compatible systems, you may need to run:

FLUSH PRIVILEGES;

In TiDB, this is generally not required.

Step 5: Verify the User

You can verify the creation and privileges with the following queries:

``` -- List all users SELECT User, Host FROM mysql.user;

-- Show granted privileges for the new user SHOW GRANTS FOR 'developer'@'%'; ```


For further customization, refer to the TiDB User Account Management documentation.

Installing the JDBC driver

At the time of writing, the JDBC driver for the TiDB database is not bundled with Matillion Data Productivity Cloud by default. This is primarily due to licensing or redistribution restrictions. As a result, you’ll need to download and install the TiDB JDBC driver manually before you can integrate Matillion with your TiDB systems.

To obtain the necessary driver, visit the official TiDB website at: https://pingcap.com/. When searching for the appropriate driver, look specifically for a Type 4 JDBC driver for TiDB, as this type provides better performance and a direct, all-Java connectivity to the database.

Once you have downloaded the appropriate TiDB JDBC driver file (typically a JAR archive), you can upload and install it into Matillion Data Productivity Cloud by following the step-by-step installation process described in the official documentation. See these installation instructions: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/. This guide will outline how to upload the JAR file to your Matillion environment and validate the installation.

After installation, you can follow the official usage instructions to configure and use the newly installed driver within database query components and pipelines, as detailed here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your TiDB database, you must verify that your TiDB instance accepts incoming connections appropriate to your deployment type:

Additionally, if you are connecting to your TiDB database using a DNS hostname rather than a direct IP address, make sure that the Full SaaS or Hybrid SaaS agent is able to resolve the hostname to the correct endpoint. This may require appropriate DNS configuration or firewall rules to permit DNS resolution from the agent’s environment.

Querying Data from a TiDB Database

This guide provides instructions for querying data from a TiDB database, with examples of SQL SELECT statements and best practices for integrating with environments like Databricks. It also describes how to perform initial and incremental data loads using the same Database Query component. For further information on incremental load strategies, refer to the dedicated article on Matillion Exchange.


Example: Basic TiDB SQL SELECT Statements

You can query data from TiDB using standard SQL. Here are several examples:

``` -- Select all columns from a table SELECT * FROM sales_orders;

-- Select specific columns SELECT order_id, customer_id, order_date FROM sales_orders;

-- Filter results with WHERE SELECT * FROM sales_orders WHERE order_date >= '2024-01-01';

-- Aggregate and group data SELECT customer_id, SUM(order_total) as total_spent FROM sales_orders GROUP BY customer_id HAVING SUM(order_total) > 1000; ```


Datatype Conversion between TiDB and Databricks

When extracting data from TiDB for loading into environments like Databricks, be aware that datatype conversions may occur. For example:

  • TiDB
    INT
    often maps to Databricks
    LongType
  • TiDB
    VARCHAR
    maps to Databricks
    StringType
  • TiDB
    DECIMAL
    maps to Databricks
    DecimalType
  • TiDB
    DATETIME
    might map to Databricks
    TimestampType

Check both systems' documentation to ensure that data types are correctly handled during data transfers.


Initial and Incremental Load Patterns

The recommended pattern is to perform a one-time initial load of data, followed by ongoing incremental loads. The same "Database Query" component—such as in Matillion or similar data pipeline tools—should be used for both.

Initial Load Example

During the first full load, retrieve the entire dataset. No filter clause is required:

SELECT * FROM sales_orders;

Or, for specific columns:

SELECT order_id, customer_id, order_total, order_date FROM sales_orders;

Incremental Load Example

For subsequent loads, add a filter (typically on a timestamp or incremental key) so only new or modified rows are fetched:

-- Assuming order_date or last_updated is the incremental key
SELECT * FROM sales_orders
WHERE order_date > '2024-06-01';

Or, for an automatically passed parameter:

SELECT * FROM sales_orders
WHERE last_updated > '${LAST_SUCCESSFUL_LOAD_TIMESTAMP}';

Note: Replace the placeholder with the actual variable or parameter as required by your ETL tool or integration flow.

For more details on designing an optimal incremental load process, see the Matillion Exchange article.


Data Integration Architecture

One of the key benefits of the ELT (Extract, Load, Transform) architecture is its “divide and conquer” approach, where data is first loaded into the Databricks database before any integration work is started. By splitting the process into separate loading and integration steps, organizations gain both flexibility and operational efficiency. Data integration itself often requires substantial transformation, and the most effective way to manage this is through the use of data transformation pipelines, which are designed to handle tasks such as data cleansing, reshaping, and enrichment in a robust and systematic manner. Another significant advantage of the ELT approach is that these data transformations and integration operations take place directly within the target Databricks environment. This not only leverages the scalability and speed of the Databricks platform—allowing transformations and integrations to occur on-demand—but also eliminates the need for specialized, external data processing infrastructure, thereby reducing complexity and operational costs.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.