Extracting data from TiDB to Databricks
Extracting data from TiDB is a foundational step in integrating data from this high-performance, distributed SQL database into broader analytics platforms such as Databricks. Whether you are modernizing your data pipeline or enabling large-scale, cloud-based analytics, moving data efficiently and reliably is essential. In this article, we will guide you through the process of transferring data from TiDB to Databricks. You will learn how to set up an identity in TiDB that has the appropriate permissions for data extraction. For Matillion users, we will cover how to verify or obtain the correct JDBC driver for seamless connectivity. We will also discuss key considerations around establishing secure network connectivity between the source (TiDB) and target (Databricks) environments. Finally, we will outline methods for querying your data—both for an initial data load and for ongoing incremental updates—ensuring your Databricks workspace is kept up to date with the latest information from TiDB.
What is TiDB?
TiDB is an open-source, distributed SQL database developed by PingCAP, designed to provide both strong consistency and horizontal scalability for transactional and analytical workloads. It features a hybrid transaction/analytical processing (HTAP) architecture, integrating a TiKV key-value storage layer and a columnar TiFlash component for real-time analytics. TiDB supports standard MySQL protocols, enabling seamless adoption for existing MySQL applications while offering transparent sharding and automatic failover to achieve high availability. With built-in distributed transactions and Raft-based replication, TiDB ensures data integrity and resilience, making it suitable for cloud-native, mission-critical applications requiring elastic scaling and high performance.
What is Databricks?
Databricks is a cloud-based data platform built on Apache Spark, enabling scalable analytics and machine learning. It introduces the 'lakehouse' model, combining data lakes and warehouses for unified storage, processing, and governance. Supporting formats like Parquet, Delta Lake, and ORC, and offering ACID transactions via Delta Lake, Databricks lets users efficiently ingest, transform, and query large datasets using SQL, Python, Scala, or R. Its collaborative workspace, native cloud storage connectors, and integration with major data tools make Databricks a comprehensive solution for modern big data and AI workflows.
Why Move Data from TiDB into Databricks
Unlocking Advanced Analytics: The Benefits of Integrating TiDB with Databricks
A data engineer or architect may wish to copy data from TiDB into Databricks for several compelling reasons. TiDB often contains business-critical or operational data that holds significant potential value, but this value is best realized when that data is integrated with information from other sources—for example, marketing, sales, or IoT datasets. By leveraging Databricks for this integration, organizations can enrich their analytic capabilities without imposing additional query or computational workload on the TiDB platform itself. Databricks provides a scalable environment for advanced analytics and machine learning, allowing the organization to extract deeper insights while ensuring that day-to-day operations on TiDB remain performant and unaffected.
Similar connectors
Creating an Identity in TiDB
To create a user (an "identity") in TiDB, you use standard SQL statements similar to those in MySQL. The following instructions describe the process.
Prerequisites
- You must have sufficient privileges (
CREATE USER
orGRANT
privilege) to create users in the TiDB cluster. - You have access to the TiDB server, typically via the MySQL client or another MySQL-compatible client tool.
Step 1: Log in to TiDB
Connect to TiDB using a privileged account (such as
root):
mysql -h <tidb-host> -P <tidb-port> -u root -p
Replace
<tidb-host>and
<tidb-port>with the address and port of your TiDB instance.
Step 2: Create a New User
Use the
CREATE USERstatement to create a new user. The general syntax is:
CREATE USER '<username>'@'<host>' IDENTIFIED BY '<password>';
<username>
: The name of the user to create.<host>
: The host from which the user can connect (use%
for any host).<password>
: The password for the user.
Example:
To create a user named
developerthat can connect from any host, with password
DevStrongPwd123:
CREATE USER 'developer'@'%' IDENTIFIED BY 'DevStrongPwd123';
Step 3: Grant Privileges
After creating the user, assign necessary privileges using the
GRANTstatement. For example, to grant all privileges on a specific database (
mydb):
GRANT ALL PRIVILEGES ON mydb.* TO 'developer'@'%';
Or, for read-only access:
GRANT SELECT ON mydb.* TO 'developer'@'%';
Step 4: Apply the Privilege Changes
Privileges take effect immediately in TiDB. However, in some MySQL-compatible systems, you may need to run:
FLUSH PRIVILEGES;
In TiDB, this is generally not required.
Step 5: Verify the User
You can verify the creation and privileges with the following queries:
``` -- List all users SELECT User, Host FROM mysql.user;
-- Show granted privileges for the new user SHOW GRANTS FOR 'developer'@'%'; ```
For further customization, refer to the TiDB User Account Management documentation.
Installing the JDBC driver
At the time of writing, the JDBC driver for the TiDB database is not bundled with Matillion Data Productivity Cloud by default. This is primarily due to licensing or redistribution restrictions. As a result, you’ll need to download and install the TiDB JDBC driver manually before you can integrate Matillion with your TiDB systems.
To obtain the necessary driver, visit the official TiDB website at: https://pingcap.com/. When searching for the appropriate driver, look specifically for a Type 4 JDBC driver for TiDB, as this type provides better performance and a direct, all-Java connectivity to the database.
Once you have downloaded the appropriate TiDB JDBC driver file (typically a JAR archive), you can upload and install it into Matillion Data Productivity Cloud by following the step-by-step installation process described in the official documentation. See these installation instructions: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/. This guide will outline how to upload the JAR file to your Matillion environment and validate the installation.
After installation, you can follow the official usage instructions to configure and use the newly installed driver within database query components and pipelines, as detailed here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your TiDB database, you must verify that your TiDB instance accepts incoming connections appropriate to your deployment type:
-
Full SaaS agent configuration: You must configure your TiDB database to allow incoming connections from the fixed IP addresses used by Matillion. The required list of IP addresses can be found at https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/.
-
Hybrid SaaS deployment: If you are using a Hybrid SaaS setup, your TiDB database should permit connections originating from your own virtual private cloud (VPC) where the Matillion agent runs. To help identify network connectivity issues or source addresses, you can use the network access checking utilities provided here: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if you are connecting to your TiDB database using a DNS hostname rather than a direct IP address, make sure that the Full SaaS or Hybrid SaaS agent is able to resolve the hostname to the correct endpoint. This may require appropriate DNS configuration or firewall rules to permit DNS resolution from the agent’s environment.
Querying Data from a TiDB Database
This guide provides instructions for querying data from a TiDB database, with examples of SQL SELECT statements and best practices for integrating with environments like Databricks. It also describes how to perform initial and incremental data loads using the same Database Query component. For further information on incremental load strategies, refer to the dedicated article on Matillion Exchange.
Example: Basic TiDB SQL SELECT Statements
You can query data from TiDB using standard SQL. Here are several examples:
``` -- Select all columns from a table SELECT * FROM sales_orders;
-- Select specific columns SELECT order_id, customer_id, order_date FROM sales_orders;
-- Filter results with WHERE SELECT * FROM sales_orders WHERE order_date >= '2024-01-01';
-- Aggregate and group data SELECT customer_id, SUM(order_total) as total_spent FROM sales_orders GROUP BY customer_id HAVING SUM(order_total) > 1000; ```
Datatype Conversion between TiDB and Databricks
When extracting data from TiDB for loading into environments like Databricks, be aware that datatype conversions may occur. For example:
- TiDB
INT
often maps to DatabricksLongType
- TiDB
VARCHAR
maps to DatabricksStringType
- TiDB
DECIMAL
maps to DatabricksDecimalType
- TiDB
DATETIME
might map to DatabricksTimestampType
Check both systems' documentation to ensure that data types are correctly handled during data transfers.
Initial and Incremental Load Patterns
The recommended pattern is to perform a one-time initial load of data, followed by ongoing incremental loads. The same "Database Query" component—such as in Matillion or similar data pipeline tools—should be used for both.
Initial Load Example
During the first full load, retrieve the entire dataset. No filter clause is required:
SELECT * FROM sales_orders;
Or, for specific columns:
SELECT order_id, customer_id, order_total, order_date FROM sales_orders;
Incremental Load Example
For subsequent loads, add a filter (typically on a timestamp or incremental key) so only new or modified rows are fetched:
-- Assuming order_date or last_updated is the incremental key SELECT * FROM sales_orders WHERE order_date > '2024-06-01';
Or, for an automatically passed parameter:
SELECT * FROM sales_orders
WHERE last_updated > '${LAST_SUCCESSFUL_LOAD_TIMESTAMP}';
Note: Replace the placeholder with the actual variable or parameter as required by your ETL tool or integration flow.
For more details on designing an optimal incremental load process, see the Matillion Exchange article.
Data Integration Architecture
One of the key benefits of the ELT (Extract, Load, Transform) architecture is its “divide and conquer” approach, where data is first loaded into the Databricks database before any integration work is started. By splitting the process into separate loading and integration steps, organizations gain both flexibility and operational efficiency. Data integration itself often requires substantial transformation, and the most effective way to manage this is through the use of data transformation pipelines, which are designed to handle tasks such as data cleansing, reshaping, and enrichment in a robust and systematic manner. Another significant advantage of the ELT approach is that these data transformations and integration operations take place directly within the target Databricks environment. This not only leverages the scalability and speed of the Databricks platform—allowing transformations and integrations to occur on-demand—but also eliminates the need for specialized, external data processing infrastructure, thereby reducing complexity and operational costs.