Extracting data from MySQL to Databricks
Extracting data from MySQL is a common requirement for organizations seeking to leverage the scalable analytics, collaborative tools, and machine learning capabilities offered by platforms like Databricks. Successfully moving data from MySQL into Databricks involves several important considerations, including user management, connectivity, and data loading techniques. In this article, we will guide you through the essential steps for setting up and executing a robust data pipeline from MySQL to Databricks. We will cover how to create a dedicated identity in MySQL to secure and audit your data extraction, and, for those using Matillion, how to check for or acquire the appropriate JDBC driver for MySQL. Ensuring reliable and secure network connectivity between your data source and target platform will also be discussed. Lastly, we will explain methods for querying data—both for an initial large transfer and subsequent incremental updates—to support ongoing data ingestion in your Databricks environment. By following this guide, you will be well-equipped to efficiently and securely transfer MySQL data into Databricks, laying the foundation for advanced data analysis and reporting.
What is MySQL?
MySQL is a popular open-source relational database management system (RDBMS) that uses SQL for data definition and manipulation. Known for reliability and performance, it powers major sites like Facebook and Wikipedia. MySQL supports multiple storage engines (e.g., InnoDB, MyISAM), transactions, replication, partitioning, and indexing, making it ideal for both small projects and large enterprises. It integrates well with numerous programming languages and platforms, aiding efficient development of scalable, data-heavy applications. Backed by an active community and commercial support from Oracle, MySQL ensures ongoing enhancements and long-term stability.
What is Databricks?
The Databricks database, built on Delta Lake, is a unified analytics platform for efficient, large-scale data management and processing. Leveraging a distributed architecture and tight integration with Apache Spark, it supports high-performance querying, analytics, and machine learning across structured, semi-structured, and unstructured data. Databricks ensures data integrity via transaction support, schema enforcement, and ACID compliance, while features like time travel enable version control. As a managed cloud service, it offers optimized resource management and collaborative tools for data engineering, science, and business intelligence teams, blending the strengths of data lakes and warehouses for scalability and reliability.
Why Move Data from MySQL into Databricks
Unlocking Powerful Analytics: The Case for Copying MySQL Data into Databricks
A data engineer or architect may find it advantageous to copy data from MySQL into Databricks for several reasons. Firstly, MySQL frequently contains valuable business data, such as transactional records, customer profiles, or operational metrics, which are essential for deeper analysis and decision-making. By integrating MySQL data with information from other disparate sources within Databricks, organizations can uncover new insights, enrich analytics, and develop a more comprehensive view of their operations, customers, or products. Furthermore, performing data transformations and analytics within Databricks, rather than directly on the MySQL database, helps minimize the risk of straining the production environment; this avoids placing additional workload on MySQL, thereby protecting the performance and stability of systems critical for day-to-day business activities.
Similar connectors
Creating a User in MySQL
The following instructions explain how to create a user in a MySQL database. These steps assume you have the required administrative privileges.
1. Log In to MySQL
Open your terminal and access the MySQL prompt as the root user or another user with sufficient privileges:
mysql -u root -pYou will be prompted for the root password.
2. Create a User
Use the
CREATE USERSQL statement to create a new user. Replace
usernameand
passwordwith your chosen values.
CREATE USER 'username'@'host' IDENTIFIED BY 'password';
'username'
is the new user's name.'host'
specifies where the user can connect from (use'localhost'
for local connections,'%'
for any host, or specify an IP address).'password'
is the user's password.
Example:
CREATE USER 'analytics_user'@'localhost' IDENTIFIED BY 'S3cureP@ssword!';
3. Grant Privileges (Optional but Recommended)
Assign permissions to the new user. For example, to grant all privileges on a specific database:
GRANT ALL PRIVILEGES ON database_name.* TO 'username'@'host';
Example:
GRANT SELECT, INSERT, UPDATE, DELETE ON sales_data.* TO 'analytics_user'@'localhost';
To immediately apply the privilege changes, execute:
FLUSH PRIVILEGES;
4. Verify the User
You can confirm the user was created by querying the
mysql.usertable:
SELECT User, Host FROM mysql.user;
Installing the JDBC driver
At the time of writing, the JDBC driver for MySQL database is not included by default with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. To enable connectivity between Matillion Data Productivity Cloud and MySQL, you will need to manually download and install the JDBC driver.
Step 1: Download the JDBC Driver
- Visit the official MySQL Connector/J download page: https://dev.mysql.com/downloads/connector/j/
- Locate and download the platform-independent, compressed archive file containing the Connector/J library.
- When selecting the download, prefer the Type 4 JDBC driver (also known as the "pure Java" driver), as it does not require native library dependencies and is fully compatible with the Java runtime used by Matillion Data Productivity Cloud.
Step 2: Install the Driver into Matillion Data Productivity Cloud
- After downloading, extract the archive if necessary. The extracted folder will include a JAR file, typically named in the format
mysql-connector-java-x.x.xx.jar
. - Follow the official Matillion instructions to upload and install external JDBC drivers. The documentation provides a step-by-step guide at this URL: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
- Upload the MySQL connector JAR file to your Matillion Agent as described in those instructions.
Step 3: Configure and Use MySQL Connectivity
After the installation is complete, you are ready to create and use MySQL data sources within your Matillion projects. For detailed directions on configuring and using the MySQL database connection within Matillion Data Productivity Cloud, refer to the usage instructions provided in the official documentation: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/.
Please ensure that you follow all licensing conditions required by MySQL when downloading and using their JDBC driver.
Checking network connectivity
To successfully connect Matillion Data Productivity Cloud to your MySQL database, you must ensure that your MySQL instance allows incoming network connections from the appropriate sources, based on your deployment configuration.
-
Full SaaS Agent Configuration:
You need to permit incoming connections from the static IP addresses used by Matillion Data Productivity Cloud’s cloud agents. The complete list of these addresses can be found at: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/. Update your MySQL instance's network access controls (such as firewalls or security groups) to allow traffic from these IPs on the port used by MySQL (default: 3306). -
Hybrid SaaS Deployment:
In this configuration, the Hybrid SaaS agent runs within your own virtual private cloud (VPC). You must ensure your MySQL instance accepts incoming connections from the VPC where the agent is deployed. Utilities and guidance for verifying connectivity from your VPC are available at: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if you reference your MySQL database using a DNS hostname (rather than a direct IP address), the relevant Matillion agent—whether Full SaaS or Hybrid SaaS—must be able to resolve that DNS name to an IP address. Ensure that DNS resolution is possible from the Matillion agent environment by verifying any applicable DNS settings or custom domain configurations.
Querying Data from MySQL Databases
This guide explains how to query data from a MySQL database, including SELECT statement examples, notes on datatype conversion with Databricks, and best practices for implementing initial and incremental data loads.
1. Example MySQL SELECT Statements
Here are some basic examples of querying data using SQL in MySQL:
``` -- Retrieve all columns and all rows from the "customers" table SELECT * FROM customers;
-- Retrieve specific columns from the "orders" table SELECT order_id, order_date, total_amount FROM orders;
-- Retrieve records with filters and sort the results SELECT employee_id, name, department FROM employees WHERE department = 'Sales' ORDER BY name ASC;
-- Compute aggregate values (e.g., total sales by product) SELECT product_id, SUM(quantity_sold) AS total_sold FROM sales GROUP BY product_id HAVING total_sold > 100; ```
2. Datatype Conversion with Databricks
When transferring data between MySQL and Databricks, some datatype conversion may occur. For example:
- MySQL
DATETIME
andTIMESTAMP
may convert to DatabricksTIMESTAMP
. - MySQL
TINYINT(1)
is commonly converted to DatabricksBOOLEAN
. - MySQL
TEXT
,VARCHAR
,CHAR
generally convert to DatabricksSTRING
.
Pay close attention to compatibility between the source MySQL datatypes and the target Databricks types to avoid unexpected issues.
3. Best Practice: Initial and Incremental Loads
The recommended pattern for querying and loading data into analytics platforms is:
- Initial (Once-Off) Load
Load all data from the source table.
Example query using the Database Query component, with no filter clause:SELECT * FROM orders;
- Incremental Loads
After the initial load, only new or updated records are extracted—typically those above or after a certainprimary key
or timestamp.
Example query with a filter clause for incremental load:SELECT * FROM orders WHERE order_date > '2024-01-01 00:00:00';
Replace the filter value with an appropriate variable or watermark from your last load.
The Incremental Load - Data Replication Strategy documentation provides more detail on implementing this process in an ETL workflow.
Note: Use the same Database Query component/template for both initial and incremental loads. The difference is that the initial load’s query does not include a filter, while the incremental load’s query must filter for records not previously extracted.
Data Integration Architecture
Loading data in advance of integration illustrates the "divide and conquer" approach, where the data integration process is split into two distinct steps: first, loading the raw data, and second, integrating and transforming that data. This strategy is a core advantage of the ELT (Extract, Load, Transform) architecture. Data transformation is an essential part of integration, and the recommended best practice is to use robust data transformation pipelines to automate and coordinate these changes systematically. Another significant benefit of the ELT approach is that all necessary data transformation and integration activities are performed directly inside the target Databricks database. This means operations are fast, scalable, and available on-demand, taking full advantage of Databricks’ processing power. Moreover, because all processing happens within Databricks, there is no need for additional or external data processing infrastructure, avoiding extra costs and complexity.