Extracting data from MariaDB to Databricks
Extracting data from MariaDB is an essential task for organizations seeking to leverage their existing data assets in modern analytics environments such as Databricks. Seamless migration of data supports advanced analytics, machine learning, and better decision-making by making legacy data readily available for processing. In this article, we will guide you through the fundamental steps to successfully extract data from MariaDB and load it into Databricks. Whether you are preparing for your first data pipeline or optimizing a recurring workflow, these steps will help ensure reliability and security throughout the process. We will cover how to create the necessary database identity in MariaDB, confirm that you have the correct JDBC driver (with special attention for Matillion users), and ensure proper network connectivity between your source (MariaDB) and target (Databricks) systems. Finally, we will discuss best practices for querying your data, covering both one-time initial data loads and incremental updates to keep your Databricks environment current. Let’s explore each of these stages in detail to make your data extraction journey efficient and successful.
What is MariaDB?
MariaDB is an open-source relational database management system (RDBMS) renowned for its compatibility with MySQL, from which it was originally forked. Designed for performance, scalability, and robustness, MariaDB is engineered with advanced storage engines, dynamic replication features, and built-in security enhancements suitable for mission-critical applications. It offers ACID compliance, full SQL support, and extensibility through plugins and numerous tools, making it highly adaptable to diverse deployment scenarios. The active development community and backing by the MariaDB Foundation ensure frequent updates, transparent governance, and ongoing innovation, positioning MariaDB as a compelling choice for modern data infrastructure needs.
What is Databricks?
Databricks is a unified analytics platform built on a scalable lakehouse architecture that merges data lake and data warehouse functionalities. It uses Delta Lake to add ACID transactions, efficient metadata management, and versioning to cloud storage like AWS S3, Azure Data Lake, and Google Cloud. Powered by Apache Spark, Databricks supports fast, distributed batch processing, SQL analytics, streaming, and machine learning. Collaborative workspaces, strong security, and governance features make Databricks ideal for enterprise data engineering, analytics, and data science workloads.
Why Move Data from MariaDB into Databricks
Unlocking Advanced Analytics: The Benefits of Copying Data from MariaDB to Databricks
A data engineer or architect might wish to copy data from MariaDB into Databricks for several compelling reasons. Firstly, MariaDB typically holds operational data that is potentially very valuable for analytics and business intelligence purposes. By integrating the data stored in MariaDB with information from other sources—such as cloud storage, data lakes, or external databases—organizations can unlock richer insights and identify trends that would not be apparent from a single source alone. Moreover, performing this data integration within Databricks provides a scalable and flexible environment for data processing and analysis. Importantly, this approach helps to minimize the computational load on the MariaDB system itself, preserving its performance for transactional operations and freeing it from the strain of complex analytical queries.
Similar connectors
Creating a User in MariaDB
To enable secure and controlled access to a MariaDB database, you must create individual users with tailored privileges. Follow the step-by-step instructions and example SQL scripts below.
1. Connect to MariaDB as an Administrative User
You need administrative credentials (such as the
rootuser) to create new users.
mysql -u root -pEnter the root password when prompted.
2. Create a New User
The general syntax to create a new user is:
CREATE USER 'username'@'host' IDENTIFIED BY 'password';
username
: The name of the new user you want to create.host
: The host from which the user can connect (e.g.,'localhost'
,'%'
for any host, or a specific IP address).password
: The password for the user.
Example:
CREATE USER 'app_user'@'localhost' IDENTIFIED BY 'S3cureP@ssw0rd';
This statement creates a user named
app_userwho can connect only from
localhost.
3. Grant Privileges to the User
Users require appropriate privileges to perform actions on databases. The most common command is:
GRANT privileges ON database.table TO 'username'@'host';
privileges
: Type of permissions (e.g.,ALL PRIVILEGES
,SELECT
,INSERT
,UPDATE
, etc.).database.table
: The specific database and table, or use*.*
for all.
Granting All Privileges Example:
GRANT ALL PRIVILEGES ON example_db.* TO 'app_user'@'localhost';
Granting Specific Privileges Example:
GRANT SELECT, INSERT ON example_db.* TO 'app_user'@'localhost';
4. Apply Changes
Flush the privileges to ensure that the changes take effect immediately:
FLUSH PRIVILEGES;
5. Test the New User
To verify the user creation, try logging in with the new credentials:
mysql -u app_user -p -h localhostEnter the password you specified earlier.
Note:
Always choose strong, unique passwords and restrict user privileges following the principle of least privilege.
Checking the JDBC driver
The JDBC driver for MariaDB is already installed in Matillion Data Productivity Cloud, so there is no need to download or install it manually. To confirm that the JDBC driver is present, follow these steps:
- Create a new Orchestration pipeline in Matillion.
- Add a Database Query component to your pipeline.
- In the properties of the Database Query component, locate the Database Type dropdown.
- Look for MariaDB in the list of available database types.
If you see MariaDB listed, the JDBC driver is present and ready for use.
For instructions on how to check the version of the JDBC driver, visit the official documentation at: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/#database-driver-versions.
Checking network connectivity
To ensure successful connectivity between Matillion Data Productivity Cloud and your MariaDB database, you must configure your MariaDB database to accept incoming network connections according to your agent deployment type:
-
Full SaaS Agent Configuration:
If you are using the Full SaaS agent, you need to allow incoming connections from the IP addresses listed in the official Matillion documentation: Allowing IP Addresses. Make sure you review this list regularly, as allowed IP addresses may change over time. -
Hybrid SaaS Deployment:
If you are using a Hybrid SaaS deployment, you must allow incoming connections from your own Virtual Private Cloud (VPC) where the Matillion agent is running. You can find utilities to help identify and check network access from your VPC at Check Network Access.
Additionally, if the MariaDB database is referenced using a DNS hostname, ensure that the Full SaaS or Hybrid SaaS agent is able to resolve this DNS address. This may require updating DNS settings or network policies so that the agent can access the required DNS servers.
How to Query Data from a MariaDB Database
This guide explains how to query data from a MariaDB database, focusing on best practices for initial and incremental data loading, SQL examples, and considerations for cross-platform datatype conversions (such as with Databricks).
Example MariaDB SELECT Statements
The most basic query retrieves all columns and rows from a table:
SELECT * FROM employees;
To return specific columns:
SELECT employee_id, first_name, last_name, hire_date FROM employees;
To filter results using a
WHEREclause:
SELECT * FROM employees WHERE department = 'Sales' AND status = 'Active';
To sort results:
SELECT * FROM employees ORDER BY hire_date DESC;
To aggregate data:
SELECT department, COUNT(*) AS employee_count FROM employees GROUP BY department;
To join tables:
SELECT e.employee_id, e.first_name, d.department_name FROM employees e JOIN departments d ON e.department_id = d.department_id;
Datatype Conversion: MariaDB to Databricks
When moving data from MariaDB to Databricks, datatypes may require conversion. Some differences to be aware of:
- MariaDB
DATETIME
,DATE
, orTIMESTAMP
types typically convert to DatabricksTIMESTAMP
orDATE
. - MariaDB
VARCHAR
/TEXT
become DatabricksSTRING
. - MariaDB numeric types, like
INT
,BIGINT
,DECIMAL
, generally map directly, but review precision/scale. - MariaDB
BOOLEAN
may become DatabricksBOOLEAN
orTINYINT
, depending on connector settings.
Always review and test datatype mappings to avoid data loss or unexpected behavior.
Data Loading Patterns: Initial and Incremental Loads
A recommended approach for loading data from MariaDB is:
- Initial (Full) Load: Load all historical data in one operation.
- Incremental Load: On subsequent runs, only extract new or updated records.
Both loads can be handled with the same Database Query component in your ETL tool.
- Initial Load: Use a query without a filter:
SELECT * FROM employees;
- Incremental Load: Add a filter, such as on a timestamp or incrementing ID column:
SELECT * FROM employees
WHERE last_modified >= '2024-06-01 00:00:00';
Replace
'2024-06-01 00:00:00'with the latest value already loaded to avoid duplicating or missing data.
Further details on this pattern: Incremental Load Data Replication Strategy (Matillion Exchange)
Data Integration Architecture
Loading data into the system in advance of integration allows organizations to break down the data integration challenge into more manageable steps, a key benefit of the Extract, Load, Transform (ELT) architecture. By first loading raw data into the Databricks database, teams can subsequently perform integration and transformation activities with greater flexibility and control. Effective data integration hinges on robust data transformation, which is best achieved using dedicated data transformation pipelines that automate, standardize, and simplify complex data preparation tasks. Another significant advantage of the ELT approach is that all transformation and integration occurs within the target Databricks database. This not only ensures fast, on-demand, and scalable processing by leveraging Databricks’ native compute capabilities, but also minimizes the need for external data processing infrastructure—allowing organizations to optimize costs and streamline operations.