Extracting data from MySQL to Snowflake
Extracting data from MySQL is an essential process for many organizations seeking to leverage the advanced analytics and scalability features offered by modern cloud-based data warehouses such as Snowflake. Successfully transferring data between these environments requires thoughtful planning and setup to ensure accuracy, reliability, and security throughout the pipeline. In this article, we will guide you through the key steps necessary to accomplish a successful MySQL-to-Snowflake data transfer. We will begin by walking through the creation of a dedicated identity in MySQL, which is fundamental for managing data access permissions. For organizations using Matillion as their ETL tool, we will explain how to check for—or acquire—the required JDBC driver to enable MySQL connectivity. Next, we will discuss considerations related to establishing secure and efficient network connectivity between your source (MySQL) and target (Snowflake) environments. Finally, we will explore strategies for querying your MySQL data, covering both the initial data load and methods for performing incremental updates to ensure your Snowflake environment remains current. By following these steps, you will be well prepared to build a robust integration tailored to your organization's analytical and operational requirements.
What is MySQL?
MySQL is a popular open-source relational database management system (RDBMS) using SQL for data definition and queries. Built on a client-server model, it supports transactional processing, ACID compliance (via InnoDB), and high concurrency for multi-user environments. Known for its reliability, scalability, and speed, MySQL powers many web and enterprise applications, especially within the LAMP stack (Linux, Apache, MySQL, PHP/Python/Perl). Its extensible design, wide community support, and compatibility with various languages and operating systems make it a crucial tool for modern, data-driven applications.
What is Snowflake?
Snowflake is a cloud-native data warehousing platform designed to enable secure, scalable, and high-performance analytics across diverse data sets. Unlike traditional databases, Snowflake architecture separates compute, storage, and services layers, allowing independent scaling and efficient resource utilization. The platform employs a multi-cluster, shared-data approach, supporting concurrent workloads without contention and facilitating seamless data sharing among users or organizations. With support for structured and semi-structured data (such as JSON, Avro, and Parquet), Snowflake provides automatic performance tuning, robust security controls, and integration with major cloud providers—including AWS, Azure, and Google Cloud—making it a versatile solution for modern data engineering and analytics workflows.
Why Move Data from MySQL into Snowflake
Unlocking Powerful Analytics: The Benefits of Copying Data from MySQL to Snowflake
A data engineer or architect may wish to copy data from MySQL into Snowflake for several compelling reasons. Firstly, MySQL often contains data that is potentially valuable for analytics or business intelligence, but its native environment may not be optimized for large-scale querying or integration. By moving data from MySQL into Snowflake, professionals can integrate that data with information from other sources, such as cloud applications, files, or additional databases—unlocking deeper insights than could be derived from MySQL alone. In addition, conducting data transformations and analytical queries directly within Snowflake rather than on the source MySQL database prevents increased operational load on the MySQL system, ensuring continued application performance and reliability. This approach leverages Snowflake’s scalable compute and storage capabilities, enabling more efficient and comprehensive data analysis.
Similar connectors
Creating a User in MySQL
This guide explains how to create a new user in a MySQL database and assign appropriate privileges, using SQL statements. The instructions assume you have administrative access to the MySQL server.
1. Access the MySQL Command Line
Log in to the MySQL server as the
rootuser or another user with sufficient privileges:
mysql -u root -p
Enter your root password when prompted.
2. Create the User
To create a new user, use the
CREATE USERstatement. Substitute
username,
host, and
passwordwith appropriate values.
CREATE USER 'username'@'host' IDENTIFIED BY 'password';
'username'
: Replace with the desired username.'host'
: Use'localhost'
to limit access to the local machine, or use'%'
to allow connections from any host.'password'
: Specify a strong password.
Example:
Create a user named
appuserwho can connect from any host, with a secure password.
CREATE USER 'appuser'@'%' IDENTIFIED BY 'SecureP@ssw0rd!';
3. Grant Privileges
After creating the user, assign the necessary privileges. For example, to grant all privileges on a specific database:
GRANT ALL PRIVILEGES ON database_name.* TO 'username'@'host';
Example:
GRANT ALL PRIVILEGES ON mydatabase.* TO 'appuser'@'%';
To grant only specific privileges (e.g., SELECT, INSERT):
GRANT SELECT, INSERT ON mydatabase.* TO 'appuser'@'%';
4. Apply Changes
Flush the privileges to ensure that all changes take effect immediately:
FLUSH PRIVILEGES;
5. (Optional) Verify the User
To confirm that the user has been created:
SELECT user, host FROM mysql.user WHERE user = 'appuser';
Note:
Avoid using weak passwords, and only grant the minimum privileges necessary for each user as a best security practice.
Installing the JDBC Driver
The MySQL JDBC driver, which enables connectivity between Matillion Data Productivity Cloud and MySQL databases, is not included by default in Matillion deployments. This exclusion is due to licensing or redistribution restrictions associated with the driver. To use MySQL as a data source or target, you must manually download and install the JDBC driver as an external dependency within your Matillion environment. The following instructions apply at the time of writing.
Step 1: Download the MySQL JDBC Driver
- Navigate to the official MySQL Connector/J download page: https://dev.mysql.com/downloads/connector/j/
- On the download page, locate and select the latest stable release of the connector.
- Prefer the Type 4 JDBC driver, which is a pure Java implementation and does not require native libraries.
- Download the relevant “Platform Independent” version: this will typically be provided as a zip or tar.gz archive. Extract the file to locate the JAR file for the JDBC driver (for example,
mysql-connector-java-8.x.x.jar
).
Step 2: Install the JDBC Driver into Matillion Data Productivity Cloud
Matillion provides a process for uploading external JDBC drivers to your environment. Follow the official Matillion installation instructions here: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/.
In brief, you will generally be required to:
- Log in to your Matillion Data Productivity Cloud environment.
- Access the section for agent configuration or driver management.
- Upload the MySQL JDBC JAR file to the designated location, following on-screen prompts or documentation guidance.
- Restart or refresh your Matillion Agent if instructed, to ensure the new driver is picked up and available for use.
Step 3: Configure Your Database Query Components
Once the driver is installed, you can begin designing and running queries against your MySQL databases within Matillion. For instructions on how to utilize the newly installed JDBC driver inside your pipelines (including information such as connection URLs, authentication, and supported features), refer to the official Matillion usage guide here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
Checking network connectivity
Before connecting Matillion Data Productivity Cloud to your MySQL database, you must ensure that the database allows incoming network connections according to your deployment configuration:
-
Full SaaS agent configuration: You need to allow incoming connections to your MySQL database from the list of IP addresses provided by Matillion. These IP addresses can be found at Matillion documentation: Allowing IP addresses. Make sure these addresses are permitted in your database’s network access controls (such as firewall or security group settings).
-
Hybrid SaaS deployment: In this scenario, you should allow incoming connections from your own Virtual Private Cloud (VPC). The Matillion agent runs within your VPC, so ensure the necessary security group and firewall rules are configured to allow network traffic from your internal resources to the MySQL database. To help verify network access from your VPC, refer to the tools available here: Matillion Exchange: Check network access.
Additionally, if your MySQL database is referenced using a Domain Name System (DNS) address rather than a direct IP address, you must ensure that the Full SaaS or Hybrid SaaS agent can resolve the database’s DNS name to its corresponding IP address. Proper DNS resolution is critical for connectivity; verify that external or internal DNS configuration allows this, depending on your network setup.
Querying Data from a MySQL Database
This guide provides instructions and best practices for querying data from MySQL, especially when integrating or migrating data to another platform such as Snowflake. Examples include SQL
SELECTstatements, handling datatype conversions, and performing initial plus incremental loads using the Database Query component.
Example MySQL SELECT Statements
Here are basic examples of retrieving data from a MySQL database:
``` -- Select all columns from a table SELECT * FROM employees;
-- Select specific columns SELECT first_name, last_name, hire_date FROM employees;
-- Apply a filter for incremental loads (see section below) SELECT employee_id, first_name, hire_date FROM employees WHERE last_updated > '2024-06-01'; ```
Datatype Conversion to Snowflake
When migrating data between MySQL and Snowflake, be aware that datatype conversions may occur. For instance:
- MySQL
DATETIME
can be mapped to SnowflakeTIMESTAMP_NTZ
- MySQL
VARCHAR
maps to SnowflakeVARCHAR
- Numeric types may need precision adjustments
It's important to review the datatype mapping documentation for Snowflake to ensure fidelity of your data during migration.
Initial and Incremental Load Patterns
The recommended pattern for loading data is:
- Initial Load: Copy the full dataset once.
- Incremental Load: Subsequently load only new or modified records, often determined by a timestamp or incrementing key.
Both steps use the same Database Query component, only differing in whether a filter (such as a
WHEREclause) is present.
Initial Load (No Filter Clause)
In an initial load, the SQL statement typically has no filter clause, thus selecting all data:
SELECT * FROM employees;
Incremental Load (With Filter Clause)
For incremental loads, include a filter to select only records that are new or changed since the last load. For example:
SELECT * FROM employees WHERE last_updated > '2024-06-01 00:00:00';
Tip: The filter condition is often parameterized to the value of the last successful load.
By structuring your loads this way, you ensure efficient, performant, and consistent data movement between MySQL and your target system.
Data Integration Architecture
Loading data into Snowflake in advance of integration is a classic way to "divide and conquer" the data integration challenge, as it separates the process into two manageable steps: first, data is ingested or loaded into the database (Extract and Load), and then transformed as needed (Transform). This separation is a key advantage of the ELT (Extract, Load, Transform) architecture, as opposed to the traditional ETL approach. Data integration in Snowflake nearly always requires some level of data transformation—such as restructuring, cleansing, or joining data from multiple sources—and the most effective way to implement these transformations is through data transformation pipelines. Another significant advantage of the ELT approach is that all transformation and integration operations occur within the Snowflake platform itself. This means the process benefits from Snowflake’s native scalability, performance, and on-demand data processing capabilities. Furthermore, since all computing resources are provisioned dynamically within the database environment, there is no need to invest in or maintain separate data processing infrastructure, resulting in both operational simplicity and cost savings.