Extracting data from MySQL to Amazon Redshift
Extracting data from MySQL is a common requirement for organizations seeking to integrate disparate data sources or to leverage the advanced analytical capabilities available in Amazon Redshift. Successfully transferring data between these platforms requires careful planning and attention to detail at each stage of the process. In this article, we will systematically guide you through the essential steps involved in moving data from MySQL to Redshift. We begin by explaining how to create an identity in MySQL, which is necessary to establish secure access. For users of Matillion, a popular ETL tool, we will discuss the importance of verifying or acquiring the appropriate JDBC driver to enable seamless connectivity. Additionally, we address key considerations around ensuring reliable network connectivity between your MySQL source and Redshift target. Finally, we will cover the fundamentals of querying your data—both for the initial full load and for subsequent incremental updates—enabling you to maintain an up-to-date and robust data pipeline. Whether you are undertaking a one-time data migration or setting up ongoing synchronization, this step-by-step guide will provide you with the foundational knowledge required to achieve a successful integration.
What is MySQL?
MySQL is a popular open-source relational database management system known for reliability, scalability, and strong performance, especially in web applications. It uses SQL for data management and supports ACID properties for robust transactions. MySQL’s client-server architecture lets multiple users access databases over a network. It offers various storage engines like InnoDB and MyISAM, suiting both transaction-heavy and read-focused applications. Its flexibility, ecosystem integration, and active community make MySQL a key part of the LAMP stack and a preferred data management solution for organizations seeking dependability and versatility.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse from AWS, designed for fast analytics on large datasets using standard SQL and existing BI tools. Leveraging columnar storage and massively parallel processing (MPP), it delivers high query performance for both structured and semi-structured data. Redshift integrates with AWS services and third-party ETL tools, supports data sharing, integrates with data lakes, and provides security features like VPC isolation, encryption, and auditing. With Redshift Spectrum, users can directly query exabytes of data in Amazon S3 without loading it into Redshift, offering a flexible, scalable platform for modern analytics workloads.
Why Move Data from MySQL into Amazon Redshift
Unlocking Advanced Analytics: The Case for Migrating Data from MySQL to Redshift
A data engineer or architect may choose to copy data from MySQL into Amazon Redshift for several compelling reasons. First, operational MySQL databases often contain transactional or historical data that is potentially valuable for deeper analysis and business intelligence initiatives. By integrating MySQL data with information from other sources—such as CRM systems, application logs, or marketing platforms—in a centralized Redshift data warehouse, organizations can derive richer insights and unlock greater value from their combined datasets. Furthermore, using Redshift for such integration and analysis tasks ensures that complex queries and large-scale analytical workloads are handled by a system purpose-built for high-performance analytics, thereby safeguarding the operational MySQL database from additional workload and performance degradation.
Similar connectors
Creating an Identity in MySQL
The following instructions explain how to create a new user (an identity) in a MySQL database. This process requires appropriate privileges, typically granted to administrative users.
1. Log In to the MySQL Server
Open your terminal or command prompt, then connect to the MySQL server using an account with sufficient administrative privileges:
mysql -u root -pEnter your password when prompted.
2. Create a New User
To create a user, use the
CREATE USERstatement. Replace
username,
host, and
passwordwith your desired values:
CREATE USER 'username'@'host' IDENTIFIED BY 'password';
'username'
: the name for the new MySQL user'host'
: the host from which the user is allowed to connect (e.g.,'localhost'
,'%'
for any host, or a specific IP)'password'
: a strong password
Example: To create a user named
appuserwho can connect only from
localhostwith the password
p@ssw0rd!:
CREATE USER 'appuser'@'localhost' IDENTIFIED BY 'p@ssw0rd!';
3. Grant Privileges
Assign required privileges to the new user using
GRANT. For example, to grant all privileges on a specific database (
mydatabase):
GRANT ALL PRIVILEGES ON mydatabase.* TO 'appuser'@'localhost';
Alternatively, to grant only specific privileges (e.g.,
SELECTand
INSERT):
GRANT SELECT, INSERT ON mydatabase.* TO 'appuser'@'localhost';
4. Apply Changes
Finalize the process by executing:
FLUSH PRIVILEGES;
5. Optional: Verify User Creation
List all users to verify that your user has been created:
SELECT user, host FROM mysql.user;
This displays all user identities configured in your MySQL server.
Installing the JDBC driver
At the time of writing, Matillion Data Productivity Cloud does not include the MySQL JDBC driver by default. This is due to licensing or redistribution restrictions associated with the MySQL Connector/J. As a result, users are required to manually download and install the JDBC driver before utilizing MySQL connectivity features within Matillion Data Productivity Cloud.
Downloading the MySQL JDBC (Connector/J) Driver
You can acquire the MySQL JDBC driver (Connector/J) directly from Oracle’s official repository. Please visit the following URL to download the latest official version:
https://dev.mysql.com/downloads/connector/j/
When navigating the download options, locate and select the platform-independent version of the connector, which is typically distributed as a
.zipor compressed archive. Ensure you are downloading a Type 4 JDBC driver, which operates entirely in Java and is preferred for compatibility and portability.
Uploading the JDBC Driver to Matillion Data Productivity Cloud
To install the downloaded MySQL JDBC driver, you will need to upload the JAR file to your Matillion Data Productivity Cloud Agent. Detailed, step-by-step instructions are available in the official Matillion documentation here:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
Follow these instructions carefully to ensure the driver is correctly recognized and made available to relevant Matillion components.
Using the MySQL JDBC Driver within Matillion
Once the driver has been installed via the procedure above, you can configure and use database connectors in your data pipelines or workflows. Usage instructions and further configuration guidance can be found at:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
These guides will walk you through creating database queries and data movement tasks specifically tailored for MySQL, leveraging the JDBC driver you have installed.
Checking network connectivity
To enable Matillion Data Productivity Cloud to connect to your MySQL database, you must ensure that the database allows incoming network connections according to your chosen deployment architecture:
-
Full SaaS Agent Configuration:
Configure your MySQL database firewall or network security settings to allow incoming connections from the IP addresses used by Matillion Data Productivity Cloud. The complete and most up-to-date list of required IP addresses can be found at: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/. -
Hybrid SaaS Deployment:
In this scenario, incoming connections to the MySQL database should be allowed from the IP addresses within your own virtual private cloud (VPC), where the Matillion Hybrid SaaS agent is running. To help check network routes, port availability, and connection success, you can use the network check utilities provided at: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if you are referencing your MySQL database using a domain name (DNS), ensure that the Full SaaS or Hybrid SaaS agent has external DNS resolution enabled and can successfully resolve the hostname to the correct IP address. Failure to resolve the database hostname will prevent successful connection and data operations.
Querying Data from a MySQL Database
This guide provides technical instructions on querying data from a MySQL database using SQL
SELECTstatements, with specific focus on data migration and incremental loading strategies. Some information is provided about potential datatype conversions between MySQL and AWS Redshift, and how to structure database extraction flows for both initial and incremental loads.
Example MySQL SELECT Queries
Basic query:
SELECT * FROM customers;
Limit results:
SELECT id, first_name, last_name FROM customers LIMIT 100;
Conditional query:
SELECT * FROM orders WHERE order_date >= '2024-01-01';
Joining tables:
SELECT o.id, o.order_date, c.first_name, c.last_name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE o.order_status = 'SHIPPED';
Selecting with aggregate function:
SELECT customer_id, COUNT(*) AS total_orders FROM orders GROUP BY customer_id HAVING total_orders > 5;
Datatype Conversion: MySQL to Redshift
When transferring or synchronizing data from MySQL to Amazon Redshift, keep in mind that some data types may convert differently. For example:
- MySQL
VARCHAR(n)
maps to RedshiftVARCHAR(n)
. - MySQL
DATETIME
andTIMESTAMP
map to RedshiftTIMESTAMP
. - MySQL
TINYINT(1)
(often used as boolean) may require conversion because Redshift does not have a native boolean type. - MySQL
TEXT
maps to RedshiftVARCHAR
. - MySQL
DECIMAL(p,s)
maps to RedshiftDECIMAL(p,s)
.
Always review Redshift datatype requirements and perform transformations if necessary during ETL.
Loading Strategies: Initial vs Incremental Loads
A recommended pattern for synchronizing MySQL data to targets like AWS Redshift is:
- Once-off Initial Load:
- Extract all data from the MySQL table.
-
No filter clause is necessary in the SELECT statement.
-
Incremental Loads:
- After the initial extraction, periodically fetch only the new or modified records.
- This uses a filter clause, often on a column such as
updated_at
orid
.
Both loading approaches utilize the same Database Query component in your ETL tool or integration platform (e.g. Matillion).
Initial Load Example:
-- No WHERE clause: fetch all records SELECT * FROM orders;
Incremental Load Example:
-- WHERE clause filters new or changed records SELECT * FROM orders WHERE updated_at > '2024-06-01 00:00:00';
Tip:
Store the maximum value of your filter column (e.g., the latest
updated_ator
id) after each load to use in the next incremental query.
For an in-depth guide on this pattern, see the official Matillion article on incremental load data replication strategy.
Data Integration Architecture
Loading data into Amazon Redshift in advance of integration is a hallmark of the ELT (Extract, Load, Transform) architecture, which helps to "divide and conquer" the data integration process by splitting it into two distinct steps: loading and transformational integration. By first loading data into Redshift and then performing integration within the database, organizations can leverage the power of SQL-based data transformation pipelines to efficiently cleanse, join, and reformat data as required for downstream analytics. This approach not only simplifies orchestration and monitoring of transformation workflows but also ensures that all data integration tasks are executed inside Redshift itself. As a result, organizations benefit from high performance, on-demand scalability, and cost efficiency, since data does not need to be moved out to external processing frameworks, and there is no need to provision or license additional data transformation infrastructure.