Extracting data from MariaDB to Amazon Redshift
Extracting data from MariaDB is a critical step in many modern data integration and analytics workflows. When transferring this data to a cloud-based data warehouse such as Amazon Redshift, several factors must be considered to ensure a smooth and secure process. This article provides a structured guide to facilitate the extraction and loading of data from MariaDB into Redshift. We will begin by detailing how to create the necessary database identity (user) in MariaDB, with appropriate permissions for extraction duties. Next, for those using Matillion, we will explain how to check whether the required JDBC driver is present and, if needed, how to acquire and configure it. Ensuring reliable network connectivity between your MariaDB source and Redshift target is our third focus, as this is crucial for a successful data transfer. Finally, we will explore approaches for querying your data – both for your initial full extracts and for efficient incremental updates in subsequent transfers. Whether you are setting up your pipeline for the first time or optimizing a recurring ETL workflow, this article aims to provide clarity and actionable guidance at each stage.
What is MariaDB?
MariaDB is an open-source relational database management system (RDBMS), created in 2009 by MySQL’s original developers as a MySQL fork following Oracle’s acquisition. Fully compatible with MySQL, MariaDB enables seamless migration while offering improved storage engines (like Aria, XtraDB, and ColumnStore), better performance, and advanced security. Features include comprehensive JSON support, dynamic columns, and enhanced query optimization, making it ideal for high-availability, mission-critical applications. Advanced replication options and an active ecosystem ensure regular updates, rapid adoption of new features, and robust support for modern workloads. Its transparent, community-driven development model ensures ongoing innovation and reliability.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the AWS cloud, designed to efficiently run large-scale analytic workloads using columnar storage and parallel processing. It leverages SQL for querying and integrates seamlessly with a broad range of business intelligence and ETL tools. Redshift supports complex queries across structured and semistructured data, providing fast query performance through techniques such as data compression, query optimization, and zone mapping. Its architecture allows for cost-effective scalability by decoupling storage and compute resources, while features like Redshift Spectrum enable direct querying of data in Amazon S3, further enhancing data lake analytics capabilities.
Why Move Data from MariaDB into Amazon Redshift
Unlocking Advanced Analytics: The Case for Copying Data from MariaDB to Amazon Redshift
A data engineer or architect may wish to copy data from MariaDB into Redshift for several compelling reasons. Firstly, MariaDB often contains data that is potentially valuable for analytics and business decision-making, but in its operational environment, it may be siloed or underutilized. By integrating data from MariaDB with information from other sources in a centralized environment, organizations can reveal deeper insights and correlations that remain hidden when datasets are isolated. Leveraging Amazon Redshift for this integration enables high-performance, large-scale analytic queries without introducing additional workload or performance bottlenecks to the MariaDB source system. In this way, valuable operational data can be harnessed for analytics and reporting without disrupting core business operations on MariaDB.
Similar connectors
Creating an Identity in MariaDB: User Account Creation
To interact with a MariaDB database, each client requires a user account. This account defines access rights and can be scoped to specific hosts. The following instructions guide you through creating a new user in MariaDB.
Prerequisites
- Ensure you have access to a MariaDB server with sufficient privileges, typically as the
root
user or another account withCREATE USER
privileges.
Steps to Create a New User
1. Connect to the MariaDB Server
Log in to your MariaDB instance using the command-line client:
mysql -u root -p
- When prompted, enter your password.
2. Create a User
Use the
CREATE USERstatement to define a new user. The general syntax is:
CREATE USER 'username'@'host' IDENTIFIED BY 'password';
username
: The desired database username.host
: The host from which this user may connect (e.g.,'localhost'
,'%'
for any host, or specific IP).password
: The password for this user.
Example: Create a user named
appuserwho can connect from any host, with a password of
S3cur3P@ss:
CREATE USER 'appuser'@'%' IDENTIFIED BY 'S3cur3P@ss';
3. Grant Privileges (Recommended)
At creation, the user has no privileges. To allow meaningful operations, grant appropriate privileges. For example, to allow all privileges on a specific database:
GRANT ALL PRIVILEGES ON exampledb.* TO 'appuser'@'%';
Or, to grant only SELECT and INSERT on a single table:
GRANT SELECT, INSERT ON exampledb.orders TO 'appuser'@'%';
4. Apply Privilege Changes
Reload privileges to ensure they are active:
FLUSH PRIVILEGES;(As of MariaDB 10.0.11, privilege tables are refreshed automatically after
GRANT,
REVOKE, or
CREATE USERstatements, but it is good practice to use
FLUSH PRIVILEGESin scripts for compatibility.)
5. Test the New User
Disconnect from MariaDB by typing:
EXIT;
Log in using the new credentials to verify:
mysql -u appuser -p -h 127.0.0.1 exampledb
Enter the password when prompted.
For more advanced requirements and host specificity, consult the MariaDB documentation on account management.
Checking the JDBC driver
The MariaDB JDBC driver comes pre-installed with Matillion Data Productivity Cloud, so there is no need to download or manually add the driver.
To confirm the presence of the MariaDB JDBC driver in your Matillion environment:
-
Create a new Orchestration pipeline:
In the Matillion Designer, start a new pipeline by selecting the Orchestration pipeline type. -
Add a "Database Query" component:
Drag and drop the "Database Query" component onto the canvas in your pipeline. -
Check the Database Types dropdown:
Open the properties of the "Database Query" component and find the 'Database Type' dropdown menu. If the MariaDB option is listed here, the JDBC driver is installed and ready to use.
If you need to check which version of the MariaDB JDBC driver is available, detailed instructions are provided in the documentation: Database Driver Versions.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your MariaDB database, you must configure your MariaDB instance to allow incoming connections from the correct network sources, based on your Matillion deployment type:
-
Full SaaS Agent Configuration:
Allow incoming connections from the IP addresses specified for Matillion Data Productivity Cloud. The current list of required IP addresses can be found in the Matillion documentation: Allowing IP Addresses. -
Hybrid SaaS Deployment:
Allow incoming connections from your own virtual private cloud (VPC), where your Matillion agent is deployed. To verify connectivity and find useful tools, visit the Matillion Exchange utility: Check Network Access.
Additionally, if your MariaDB instance is referenced using a DNS hostname instead of a direct IP address, ensure that your chosen agent (whether Full SaaS or Hybrid SaaS) is able to resolve the DNS address properly. This may require appropriate DNS configuration and network routing to ensure that hostname lookups succeed.
Querying Data from a MariaDB Database
This guide explains how to query data from a MariaDB database, particularly in the context of data loading workflows such as initial and incremental loads. It also highlights considerations for datatype conversion when migrating data to platforms like Amazon Redshift.
1. Example MariaDB SELECT Queries
SQL SELECT statements are used to query data from MariaDB tables. Below are some sample queries:
``` -- Select all columns from a table SELECT * FROM customers;
-- Select specific columns SELECT id, name, email FROM customers;
-- Filter results with a WHERE clause SELECT * FROM orders WHERE status = 'SHIPPED';
-- Aggregate data SELECT country, COUNT(*) AS customer_count FROM customers GROUP BY country; ```
2. Datatype Conversion: MariaDB to Redshift
When extracting data from MariaDB and loading it into Amazon Redshift, be aware that datatype conversion may occur between the two systems. For example:
- MariaDB
VARCHAR
maps to RedshiftVARCHAR
- MariaDB
INT
maps to RedshiftINTEGER
- MariaDB
DATETIME
/TIMESTAMP
maps to RedshiftTIMESTAMP
Review your schema and consider edge cases such as default values and precision to avoid data truncation or incompatibility.
3. Data Loading Patterns
The recommended approach is to use a two-phase data loading pattern:
- Initial Load: Retrieve the full dataset in a single batch
- Incremental Loads: Retrieve only the new or changed records since the previous load
These can be accomplished using the Database Query component in ETL tools such as Matillion.
Initial Load: No Filter Clause
For the initial load, the query extracts all available data from the source MariaDB table. No filter clause is used:
SELECT * FROM customers;
Incremental Load: With Filter Clause
For subsequent loads, use a
WHEREfilter to extract only new or updated rows. Typically, this involves filtering based on a timestamp or incrementing identifier:
SELECT * FROM customers WHERE last_modified > NOW() - INTERVAL 1 DAY;or
SELECT * FROM customers
WHERE id > {{last_loaded_id}};
Replace
last_loaded_idor date with the value from your tracking mechanism.
For a detailed explanation of this pattern, refer to the Incremental Load / Data Replication Strategy article.
By following this approach, you ensure efficient data extraction and migration from MariaDB to Redshift, handling datatype conversions and optimizing data transfer with full and incremental loads as required.
Data Integration Architecture
Loading data in advance of integration is an effective "divide and conquer" strategy, allowing organizations to decouple the transfer of raw data from its transformation and integration. This two-step process is central to the Extract, Load, and Transform (ELT) architecture, which first loads the data into the target database—such as Amazon Redshift—before applying transformation logic. Data integration inherently involves transforming data to fit analytical requirements, and the most efficient approach to this is through robust data transformation pipelines that automate and orchestrate complex processing workflows. A further advantage of the ELT architecture is that these transformation and integration steps take place directly within the Redshift database. As a result, the process benefits from Redshift’s optimized, scalable compute resources, delivering high performance and on-demand execution. This approach also eliminates the need for separate data processing infrastructure, reducing both operational complexity and costs.