Integrate data from MySQL to Amazon Redshift using Matillion

Our MySQL to Redshift connector seamlessly transfers your data to Redshift within minutes, ensuring it remains current without the need for manual coding or managing complex ETL processes.

MySQL
Amazon Redshift
MySQL to Amazon Redshift banner

Extracting data from MySQL to Amazon Redshift

Extracting data from MySQL is a common requirement for organizations seeking to integrate disparate data sources or to leverage the advanced analytical capabilities available in Amazon Redshift. Successfully transferring data between these platforms requires careful planning and attention to detail at each stage of the process. In this article, we will systematically guide you through the essential steps involved in moving data from MySQL to Redshift. We begin by explaining how to create an identity in MySQL, which is necessary to establish secure access. For users of Matillion, a popular ETL tool, we will discuss the importance of verifying or acquiring the appropriate JDBC driver to enable seamless connectivity. Additionally, we address key considerations around ensuring reliable network connectivity between your MySQL source and Redshift target. Finally, we will cover the fundamentals of querying your data—both for the initial full load and for subsequent incremental updates—enabling you to maintain an up-to-date and robust data pipeline. Whether you are undertaking a one-time data migration or setting up ongoing synchronization, this step-by-step guide will provide you with the foundational knowledge required to achieve a successful integration.


What is MySQL?

MySQL is a popular open-source relational database management system known for reliability, scalability, and strong performance, especially in web applications. It uses SQL for data management and supports ACID properties for robust transactions. MySQL’s client-server architecture lets multiple users access databases over a network. It offers various storage engines like InnoDB and MyISAM, suiting both transaction-heavy and read-focused applications. Its flexibility, ecosystem integration, and active community make MySQL a key part of the LAMP stack and a preferred data management solution for organizations seeking dependability and versatility.

matillion logo x MySQL

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse from AWS, designed for fast analytics on large datasets using standard SQL and existing BI tools. Leveraging columnar storage and massively parallel processing (MPP), it delivers high query performance for both structured and semi-structured data. Redshift integrates with AWS services and third-party ETL tools, supports data sharing, integrates with data lakes, and provides security features like VPC isolation, encryption, and auditing. With Redshift Spectrum, users can directly query exabytes of data in Amazon S3 without loading it into Redshift, offering a flexible, scalable platform for modern analytics workloads.

Why Move Data from MySQL into Amazon Redshift

Unlocking Advanced Analytics: The Case for Migrating Data from MySQL to Redshift

A data engineer or architect may choose to copy data from MySQL into Amazon Redshift for several compelling reasons. First, operational MySQL databases often contain transactional or historical data that is potentially valuable for deeper analysis and business intelligence initiatives. By integrating MySQL data with information from other sources—such as CRM systems, application logs, or marketing platforms—in a centralized Redshift data warehouse, organizations can derive richer insights and unlock greater value from their combined datasets. Furthermore, using Redshift for such integration and analysis tasks ensures that complex queries and large-scale analytical workloads are handled by a system purpose-built for high-performance analytics, thereby safeguarding the operational MySQL database from additional workload and performance degradation.

Creating an Identity in MySQL

The following instructions explain how to create a new user (an identity) in a MySQL database. This process requires appropriate privileges, typically granted to administrative users.

1. Log In to the MySQL Server

Open your terminal or command prompt, then connect to the MySQL server using an account with sufficient administrative privileges:

mysql -u root -p
Enter your password when prompted.

2. Create a New User

To create a user, use the

CREATE USER
statement. Replace
username
,
host
, and
password
with your desired values:

CREATE USER 'username'@'host' IDENTIFIED BY 'password';

  • 'username'
    : the name for the new MySQL user
  • 'host'
    : the host from which the user is allowed to connect (e.g.,
    'localhost'
    ,
    '%'
    for any host, or a specific IP)
  • 'password'
    : a strong password

Example: To create a user named

appuser
who can connect only from
localhost
with the password
p@ssw0rd!
:

CREATE USER 'appuser'@'localhost' IDENTIFIED BY 'p@ssw0rd!';

3. Grant Privileges

Assign required privileges to the new user using

GRANT
. For example, to grant all privileges on a specific database (
mydatabase
):

GRANT ALL PRIVILEGES ON mydatabase.* TO 'appuser'@'localhost';

Alternatively, to grant only specific privileges (e.g.,

SELECT
and
INSERT
):

GRANT SELECT, INSERT ON mydatabase.* TO 'appuser'@'localhost';

4. Apply Changes

Finalize the process by executing:

FLUSH PRIVILEGES;

5. Optional: Verify User Creation

List all users to verify that your user has been created:

SELECT user, host FROM mysql.user;

This displays all user identities configured in your MySQL server.

Installing the JDBC driver

At the time of writing, Matillion Data Productivity Cloud does not include the MySQL JDBC driver by default. This is due to licensing or redistribution restrictions associated with the MySQL Connector/J. As a result, users are required to manually download and install the JDBC driver before utilizing MySQL connectivity features within Matillion Data Productivity Cloud.

Downloading the MySQL JDBC (Connector/J) Driver

You can acquire the MySQL JDBC driver (Connector/J) directly from Oracle’s official repository. Please visit the following URL to download the latest official version:

https://dev.mysql.com/downloads/connector/j/

When navigating the download options, locate and select the platform-independent version of the connector, which is typically distributed as a

.zip
or compressed archive. Ensure you are downloading a Type 4 JDBC driver, which operates entirely in Java and is preferred for compatibility and portability.

Uploading the JDBC Driver to Matillion Data Productivity Cloud

To install the downloaded MySQL JDBC driver, you will need to upload the JAR file to your Matillion Data Productivity Cloud Agent. Detailed, step-by-step instructions are available in the official Matillion documentation here:

https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

Follow these instructions carefully to ensure the driver is correctly recognized and made available to relevant Matillion components.

Using the MySQL JDBC Driver within Matillion

Once the driver has been installed via the procedure above, you can configure and use database connectors in your data pipelines or workflows. Usage instructions and further configuration guidance can be found at:

https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

These guides will walk you through creating database queries and data movement tasks specifically tailored for MySQL, leveraging the JDBC driver you have installed.

Checking network connectivity

To enable Matillion Data Productivity Cloud to connect to your MySQL database, you must ensure that the database allows incoming network connections according to your chosen deployment architecture:

Additionally, if you are referencing your MySQL database using a domain name (DNS), ensure that the Full SaaS or Hybrid SaaS agent has external DNS resolution enabled and can successfully resolve the hostname to the correct IP address. Failure to resolve the database hostname will prevent successful connection and data operations.

Querying Data from a MySQL Database

This guide provides technical instructions on querying data from a MySQL database using SQL

SELECT
statements, with specific focus on data migration and incremental loading strategies. Some information is provided about potential datatype conversions between MySQL and AWS Redshift, and how to structure database extraction flows for both initial and incremental loads.


Example MySQL SELECT Queries

Basic query:

SELECT * FROM customers;

Limit results:

SELECT id, first_name, last_name
FROM customers
LIMIT 100;

Conditional query:

SELECT *
FROM orders
WHERE order_date >= '2024-01-01';

Joining tables:

SELECT o.id, o.order_date, c.first_name, c.last_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_status = 'SHIPPED';

Selecting with aggregate function:

SELECT customer_id, COUNT(*) AS total_orders
FROM orders
GROUP BY customer_id
HAVING total_orders > 5;


Datatype Conversion: MySQL to Redshift

When transferring or synchronizing data from MySQL to Amazon Redshift, keep in mind that some data types may convert differently. For example:

  • MySQL
    VARCHAR(n)
    maps to Redshift
    VARCHAR(n)
    .
  • MySQL
    DATETIME
    and
    TIMESTAMP
    map to Redshift
    TIMESTAMP
    .
  • MySQL
    TINYINT(1)
    (often used as boolean) may require conversion because Redshift does not have a native boolean type.
  • MySQL
    TEXT
    maps to Redshift
    VARCHAR
    .
  • MySQL
    DECIMAL(p,s)
    maps to Redshift
    DECIMAL(p,s)
    .

Always review Redshift datatype requirements and perform transformations if necessary during ETL.


Loading Strategies: Initial vs Incremental Loads

A recommended pattern for synchronizing MySQL data to targets like AWS Redshift is:

  1. Once-off Initial Load:
  2. Extract all data from the MySQL table.
  3. No filter clause is necessary in the SELECT statement.

  4. Incremental Loads:

  5. After the initial extraction, periodically fetch only the new or modified records.
  6. This uses a filter clause, often on a column such as
    updated_at
    or
    id
    .

Both loading approaches utilize the same Database Query component in your ETL tool or integration platform (e.g. Matillion).

Initial Load Example:

-- No WHERE clause: fetch all records
SELECT *
FROM orders;

Incremental Load Example:

-- WHERE clause filters new or changed records
SELECT *
FROM orders
WHERE updated_at > '2024-06-01 00:00:00';

Tip:
Store the maximum value of your filter column (e.g., the latest

updated_at
or
id
) after each load to use in the next incremental query.

For an in-depth guide on this pattern, see the official Matillion article on incremental load data replication strategy.


Data Integration Architecture

Loading data into Amazon Redshift in advance of integration is a hallmark of the ELT (Extract, Load, Transform) architecture, which helps to "divide and conquer" the data integration process by splitting it into two distinct steps: loading and transformational integration. By first loading data into Redshift and then performing integration within the database, organizations can leverage the power of SQL-based data transformation pipelines to efficiently cleanse, join, and reformat data as required for downstream analytics. This approach not only simplifies orchestration and monitoring of transformation workflows but also ensures that all data integration tasks are executed inside Redshift itself. As a result, organizations benefit from high performance, on-demand scalability, and cost efficiency, since data does not need to be moved out to external processing frameworks, and there is no need to provision or license additional data transformation infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.