Meet Maia: The AI Data Automation platform that gives you the freedom to do more.

Visit maia.ai

Integrate data from MariaDB to Databricks using Matillion

Our MariaDB to Databricks connector enables seamless and timely data transfer to Databricks within minutes, automatically synchronizing updates without the need for manual coding or complex ETL processes.

MariaDB
Databricks
MariaDB to Databricks banner

Extracting data from MariaDB to Databricks

Extracting data from MariaDB is an essential task for organizations seeking to leverage their existing data assets in modern analytics environments such as Databricks. Seamless migration of data supports advanced analytics, machine learning, and better decision-making by making legacy data readily available for processing. In this article, we will guide you through the fundamental steps to successfully extract data from MariaDB and load it into Databricks. Whether you are preparing for your first data pipeline or optimizing a recurring workflow, these steps will help ensure reliability and security throughout the process. We will cover how to create the necessary database identity in MariaDB, confirm that you have the correct JDBC driver (with special attention for Matillion users), and ensure proper network connectivity between your source (MariaDB) and target (Databricks) systems. Finally, we will discuss best practices for querying your data, covering both one-time initial data loads and incremental updates to keep your Databricks environment current. Let’s explore each of these stages in detail to make your data extraction journey efficient and successful.


What is MariaDB?

MariaDB is an open-source relational database management system (RDBMS) renowned for its compatibility with MySQL, from which it was originally forked. Designed for performance, scalability, and robustness, MariaDB is engineered with advanced storage engines, dynamic replication features, and built-in security enhancements suitable for mission-critical applications. It offers ACID compliance, full SQL support, and extensibility through plugins and numerous tools, making it highly adaptable to diverse deployment scenarios. The active development community and backing by the MariaDB Foundation ensure frequent updates, transparent governance, and ongoing innovation, positioning MariaDB as a compelling choice for modern data infrastructure needs.

matillion logo x MariaDB

What is Databricks?

Databricks is a unified analytics platform built on a scalable lakehouse architecture that merges data lake and data warehouse functionalities. It uses Delta Lake to add ACID transactions, efficient metadata management, and versioning to cloud storage like AWS S3, Azure Data Lake, and Google Cloud. Powered by Apache Spark, Databricks supports fast, distributed batch processing, SQL analytics, streaming, and machine learning. Collaborative workspaces, strong security, and governance features make Databricks ideal for enterprise data engineering, analytics, and data science workloads.

Why Move Data from MariaDB into Databricks

Unlocking Advanced Analytics: The Benefits of Copying Data from MariaDB to Databricks

A data engineer or architect might wish to copy data from MariaDB into Databricks for several compelling reasons. Firstly, MariaDB typically holds operational data that is potentially very valuable for analytics and business intelligence purposes. By integrating the data stored in MariaDB with information from other sources—such as cloud storage, data lakes, or external databases—organizations can unlock richer insights and identify trends that would not be apparent from a single source alone. Moreover, performing this data integration within Databricks provides a scalable and flexible environment for data processing and analysis. Importantly, this approach helps to minimize the computational load on the MariaDB system itself, preserving its performance for transactional operations and freeing it from the strain of complex analytical queries.

Creating a User in MariaDB

To enable secure and controlled access to a MariaDB database, you must create individual users with tailored privileges. Follow the step-by-step instructions and example SQL scripts below.

1. Connect to MariaDB as an Administrative User

You need administrative credentials (such as the

root
user) to create new users.

mysql -u root -p
Enter the root password when prompted.

2. Create a New User

The general syntax to create a new user is:

CREATE USER 'username'@'host' IDENTIFIED BY 'password';

  • username
    : The name of the new user you want to create.
  • host
    : The host from which the user can connect (e.g.,
    'localhost'
    ,
    '%'
    for any host, or a specific IP address).
  • password
    : The password for the user.

Example:

CREATE USER 'app_user'@'localhost' IDENTIFIED BY 'S3cureP@ssw0rd';

This statement creates a user named

app_user
who can connect only from
localhost
.

3. Grant Privileges to the User

Users require appropriate privileges to perform actions on databases. The most common command is:

GRANT privileges ON database.table TO 'username'@'host';

  • privileges
    : Type of permissions (e.g.,
    ALL PRIVILEGES
    ,
    SELECT
    ,
    INSERT
    ,
    UPDATE
    , etc.).
  • database.table
    : The specific database and table, or use
    *.*
    for all.

Granting All Privileges Example:

GRANT ALL PRIVILEGES ON example_db.* TO 'app_user'@'localhost';

Granting Specific Privileges Example:

GRANT SELECT, INSERT ON example_db.* TO 'app_user'@'localhost';

4. Apply Changes

Flush the privileges to ensure that the changes take effect immediately:

FLUSH PRIVILEGES;

5. Test the New User

To verify the user creation, try logging in with the new credentials:

mysql -u app_user -p -h localhost
Enter the password you specified earlier.


Note:
Always choose strong, unique passwords and restrict user privileges following the principle of least privilege.

Checking the JDBC driver

The JDBC driver for MariaDB is already installed in Matillion Data Productivity Cloud, so there is no need to download or install it manually. To confirm that the JDBC driver is present, follow these steps:

  1. Create a new Orchestration pipeline in Matillion.
  2. Add a Database Query component to your pipeline.
  3. In the properties of the Database Query component, locate the Database Type dropdown.
  4. Look for MariaDB in the list of available database types.

If you see MariaDB listed, the JDBC driver is present and ready for use.

For instructions on how to check the version of the JDBC driver, visit the official documentation at: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/#database-driver-versions.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your MariaDB database, you must configure your MariaDB database to accept incoming network connections according to your agent deployment type:

  • Full SaaS Agent Configuration:
    If you are using the Full SaaS agent, you need to allow incoming connections from the IP addresses listed in the official Matillion documentation: Allowing IP Addresses. Make sure you review this list regularly, as allowed IP addresses may change over time.

  • Hybrid SaaS Deployment:
    If you are using a Hybrid SaaS deployment, you must allow incoming connections from your own Virtual Private Cloud (VPC) where the Matillion agent is running. You can find utilities to help identify and check network access from your VPC at Check Network Access.

Additionally, if the MariaDB database is referenced using a DNS hostname, ensure that the Full SaaS or Hybrid SaaS agent is able to resolve this DNS address. This may require updating DNS settings or network policies so that the agent can access the required DNS servers.

How to Query Data from a MariaDB Database

This guide explains how to query data from a MariaDB database, focusing on best practices for initial and incremental data loading, SQL examples, and considerations for cross-platform datatype conversions (such as with Databricks).


Example MariaDB SELECT Statements

The most basic query retrieves all columns and rows from a table:

SELECT * FROM employees;

To return specific columns:

SELECT employee_id, first_name, last_name, hire_date FROM employees;

To filter results using a

WHERE
clause:

SELECT * FROM employees
WHERE department = 'Sales' AND status = 'Active';

To sort results:

SELECT * FROM employees
ORDER BY hire_date DESC;

To aggregate data:

SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;

To join tables:

SELECT e.employee_id, e.first_name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;


Datatype Conversion: MariaDB to Databricks

When moving data from MariaDB to Databricks, datatypes may require conversion. Some differences to be aware of:

  • MariaDB
    DATETIME
    ,
    DATE
    , or
    TIMESTAMP
    types typically convert to Databricks
    TIMESTAMP
    or
    DATE
    .
  • MariaDB
    VARCHAR
    /
    TEXT
    become Databricks
    STRING
    .
  • MariaDB numeric types, like
    INT
    ,
    BIGINT
    ,
    DECIMAL
    , generally map directly, but review precision/scale.
  • MariaDB
    BOOLEAN
    may become Databricks
    BOOLEAN
    or
    TINYINT
    , depending on connector settings.

Always review and test datatype mappings to avoid data loss or unexpected behavior.


Data Loading Patterns: Initial and Incremental Loads

A recommended approach for loading data from MariaDB is:

  1. Initial (Full) Load: Load all historical data in one operation.
  2. Incremental Load: On subsequent runs, only extract new or updated records.

Both loads can be handled with the same Database Query component in your ETL tool.

  • Initial Load: Use a query without a filter:

SELECT * FROM employees;

  • Incremental Load: Add a filter, such as on a timestamp or incrementing ID column:

SELECT * FROM employees
    WHERE last_modified >= '2024-06-01 00:00:00';

Replace

'2024-06-01 00:00:00'
with the latest value already loaded to avoid duplicating or missing data.

Further details on this pattern: Incremental Load Data Replication Strategy (Matillion Exchange)

Data Integration Architecture

Loading data into the system in advance of integration allows organizations to break down the data integration challenge into more manageable steps, a key benefit of the Extract, Load, Transform (ELT) architecture. By first loading raw data into the Databricks database, teams can subsequently perform integration and transformation activities with greater flexibility and control. Effective data integration hinges on robust data transformation, which is best achieved using dedicated data transformation pipelines that automate, standardize, and simplify complex data preparation tasks. Another significant advantage of the ELT approach is that all transformation and integration occurs within the target Databricks database. This not only ensures fast, on-demand, and scalable processing by leveraging Databricks’ native compute capabilities, but also minimizes the need for external data processing infrastructure—allowing organizations to optimize costs and streamline operations.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.