Integrate data from TiDB to Snowflake using Matillion

Our TiDB to Snowflake connector streamlines data transfer to Snowflake within minutes, ensuring your data remains current—all without manual coding or complex ETL processes.

TiDB
Snowflake
TiDB to Snowflake banner

Extracting data from TiDB to Snowflake

Extracting data from TiDB is an essential task for organizations seeking to integrate their operational data with cloud-based analytics platforms such as Snowflake. Efficiently moving data from TiDB, a distributed SQL database, to Snowflake enables advanced analytics, reporting, and business intelligence use cases. In this article, we will guide you through a practical approach for extracting data from TiDB and loading it into Snowflake. We will begin by detailing the steps required to create an identity within TiDB, ensuring appropriate permissions and secure access for data extraction. Next, for users leveraging Matillion as their ETL tool, we will discuss how to check for or acquire the necessary JDBC driver to establish connectivity with TiDB. Network connectivity between your TiDB source and Snowflake target is a key prerequisite, so we will outline best practices for establishing and verifying secure data connections. Finally, we will demonstrate methods for querying data from TiDB—both for initial full loads and for ongoing incremental extracts. By the end of this article, you will have a clear understanding of how to configure, extract, and move your data from TiDB into Snowflake efficiently and securely.


What is TiDB?

TiDB, developed by PingCAP, is an open-source, distributed SQL database offering horizontal scalability, high availability, and strong consistency for mission-critical workloads. Its hybrid transactional/analytical processing (HTAP) model supports simultaneous OLTP and OLAP tasks without duplicating data. TiDB is MySQL compatible, enabling smooth migration with minimal changes. Powered by TiKV, a distributed transactional key-value store, it guarantees ACID compliance. Dynamic data scheduling ensures balanced workload and resilience. Cloud-native and high-performance, TiDB suits modern, data-intensive applications, making it a strong choice for organizations needing flexibility and robust analytics and transaction capabilities within a unified platform.

matillion logo x TiDB

What is Snowflake?

Snowflake is a cloud-native data platform offering scalable, high-performance data warehousing and analytics. Its unique multi-cluster architecture separates compute from storage, enabling independent resource scaling for diverse workloads and concurrent users without performance loss. Fully managed, Snowflake eliminates administrative tasks and integrates seamlessly with AWS, Azure, and Google Cloud. It supports ANSI SQL, semi-structured data (like JSON and Parquet), robust security, role-based access, and advanced features such as time travel, zero-copy cloning, and native data sharing, making it a comprehensive solution for modern data engineering and analytics needs.

Why Move Data from TiDB into Snowflake

Unlocking Analytics: Copying Data from TiDB to Snowflake for Enhanced Insights

A data engineer or architect may wish to copy data from TiDB into Snowflake for several compelling reasons. Firstly, TiDB often contains operational and transactional data that is potentially valuable for analytics and business intelligence purposes. However, this value is best realized when the TiDB data is integrated with information from other critical sources such as sales, marketing, or external datasets. Performing such integration within Snowflake allows organizations to harness powerful analytics and reporting capabilities without imposing additional workload or query complexity on the TiDB system itself. This approach not only protects TiDB’s primary function as a transactional database but also leverages Snowflake’s scalable and flexible architecture to aggregate, transform, and analyze diverse data sets more efficiently.

Creating a User in TiDB

To create a new user in a TiDB database, you use SQL statements similar to those in MySQL, as TiDB is compatible with the MySQL protocol and syntax. You must have the

CREATE USER
privilege or the
INSERT
privilege on the
mysql
database to create new users.

Follow the steps below to create a user:

1. Connect to TiDB

Use a MySQL client or any compatible SQL tool to connect to your TiDB server:

ell
mysql -h <tidb-host> -P <tidb-port> -u <root-user> -p

Replace

<tidb-host>
,
<tidb-port>
, and
<root-user>
with your actual hostname, port, and administrator username.

2. Create a New User

Execute the

CREATE USER
statement with the desired username and password:

CREATE USER 'new_user'@'%' IDENTIFIED BY 'user_password';

  • 'new_user'
    is the username for the new account.
  • '%'
    specifies that the user can connect from any host. Replace with a specific host IP or hostname if required.
  • 'user_password'
    is the password for the user.

Example:

CREATE USER 'app_user'@'localhost' IDENTIFIED BY 'S3cur3P@ssword!';

3. (Optional) Grant Privileges

After creating the user, you typically grant privileges to allow them to interact with databases or tables:

GRANT SELECT, INSERT, UPDATE ON your_database.* TO 'app_user'@'localhost';

  • SELECT, INSERT, UPDATE
    are example privileges. Adjust as needed.
  • your_database.*
    specifies the scope (all tables in a particular database).

To apply the privileges, run:

FLUSH PRIVILEGES;

4. Verify User Creation

To confirm the user was created successfully:

SELECT user, host FROM mysql.user WHERE user = 'app_user';

This statement returns the username and host for the created account.


Note: Always follow your organization's security policies for username, password selection, and privilege assignment.

Installing the JDBC driver

At the time of writing, the JDBC driver required for connecting Matillion Data Productivity Cloud to TiDB is not bundled with the Matillion environment by default. This is primarily due to licensing or redistribution restrictions associated with the driver. To enable TiDB connectivity, users must manually obtain and install the driver as an external resource.

1. Downloading the TiDB JDBC Driver

  1. Navigate to the TiDB website to obtain the latest release of the JDBC driver.
  2. On the downloads section, search explicitly for the Type 4 JDBC driver. Type 4 drivers are recommended because they are pure Java implementations and do not require any native libraries.
  3. Download the appropriate JAR file for the TiDB JDBC driver to your local machine.

2. Installing the Driver in Matillion Data Productivity Cloud

Matillion Data Productivity Cloud allows for the manual upload and management of external JDBC drivers using the Matillion Agent. The driver JAR file should be uploaded following Matillion’s documented process.

  • Refer to the step-by-step guidance at Installing External Drivers.
  • Ensure that you upload the driver with the correct permissions and follow any Matillion-specific configuration steps, such as restarting agents or specifying the driver class name in the Matillion interface if required.
  • Verify that the JDBC driver appears in your agent’s driver list after installation.

3. Using the Driver for Database Integration

Once installed, you may proceed to configure and use the TiDB connection within Matillion Data Productivity Cloud:

  • Consult the official Database Query Usage Documentation for detailed instructions on creating and managing database connections, using the installed TiDB JDBC driver.
  • During connection setup, select the driver and fill in any required connection properties as per your TiDB environment setup.

Following these steps ensures secure and robust integration of TiDB with your Matillion Data Productivity Cloud environment.

Checking network connectivity

To ensure successful communication between Matillion Data Productivity Cloud and your TiDB database, it is essential to verify that your TiDB instance allows incoming connections from the appropriate sources, depending on your deployment configuration:

In both configurations, if your TiDB database is referenced using a DNS hostname rather than a direct IP address, ensure that the Full SaaS or Hybrid SaaS agent can resolve the hostname. This may require that the DNS record is publicly accessible or that necessary DNS configurations are in place within your environment.

Querying Data from TiDB: Technical Instructions

This document explains how to query data from a TiDB database, using SQL

SELECT
statements. It also covers key considerations for data type conversion with Snowflake and best practices for initial and incremental loads, using the Database Query component. For background on incremental loads, read Matillion's Incremental Load Data Replication Strategy.


Basic TiDB SQL SELECT Queries

To query data from a TiDB database, use SQL

SELECT
statements. Examples below demonstrate typical patterns:

``` -- Select all columns from a table SELECT * FROM users;

-- Select specific columns SELECT id, name, email FROM users;

-- Filter rows SELECT * FROM orders WHERE order_status = 'completed';

-- Aggregate data SELECT customer_id, COUNT(*) AS order_count FROM orders WHERE order_date >= '2024-06-01' GROUP BY customer_id; ```

Notes on TiDB and Snowflake Data Type Conversion

When extracting data from TiDB and loading into Snowflake, note that data types may differ. For example:

  • TiDB INTEGER may map to Snowflake NUMBER
  • TiDB VARCHAR may map to Snowflake STRING
  • TiDB DATETIME may map to Snowflake TIMESTAMP_NTZ

Carefully review and explicitly cast columns if needed, especially for dates or custom types, to ensure compatibility during ETL processes.

-- Example: Explicitly casting DATETIME to VARCHAR for Snowflake compatibility
SELECT id, CAST(created_at AS CHAR) AS created_at_str FROM events;

Initial and Incremental Data Loads Using Database Query

The recommended strategy for loading data from TiDB into downstream systems like Snowflake is:

  1. Perform a once-off initial load (full data extract).
  2. Perform ongoing incremental loads (extracting only new or changed data).
  3. Use the same Database Query component for both, altering only the filter clause.

Example: Initial Load Query

For initial load, use no filter clause; extract all data.

SELECT * FROM orders;

Example: Incremental Load Query

For incremental load, add a filter clause. Common patterns:

  • Use a
    last_updated
    or timestamp column.
  • Source: Pass the "last successful run time" as a variable.

SELECT * FROM orders
WHERE last_updated > '${last_run_timestamp}';

Replace

${last_run_timestamp}
with your actual parameter for tracking incremental loads.

For more on incremental loading strategies, see Matillion Exchange: Incremental Load Data Replication Strategy.


By following these patterns, you efficiently query and move data from TiDB, ensure data type compatibility, and streamline ongoing synchronization with systems such as Snowflake.

Data Integration Architecture

One of the key advantages of the ELT (Extract, Load, Transform) architecture is the ability to load data into Snowflake in advance of integration, effectively dividing the overall challenge into two manageable steps: first loading the data, then performing integration and transformation. This approach not only simplifies the integration process but also provides flexibility and control over each phase. Data integration inherently requires robust data transformation, and the optimal way to achieve this in Snowflake is through data transformation pipelines, which automate and streamline complex transformation tasks. Additionally, a major benefit of the ELT model is that both transformation and integration of data occur directly inside the Snowflake database. This design ensures rapid, on-demand, and highly scalable processing without the overhead or cost of maintaining separate data processing infrastructure, making it a cost-effective solution for modern data workflows.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.