Integrate data from MemSQL (SingleStore) to Databricks using Matillion

Our SingleStore to Databricks connector seamlessly transfers your data to Databricks in just minutes, ensuring it remains current—all without the need for manual programming or managing intricate ETL scripts.

SingleStore
Databricks
MemSQL (SingleStore) to Databricks banner

Extracting data from MemSQL (SingleStore) to Databricks

Extracting data from SingleStore is an essential task for organizations looking to leverage advanced analytics capabilities within platforms like Databricks. By moving data efficiently from SingleStore to Databricks, users can unlock powerful insights and seamlessly integrate with modern data workflows. This article provides a step-by-step guide to transferring data from SingleStore into Databricks, ensuring best practices are followed at each stage. We will begin by outlining how to create an identity in SingleStore for secure data access. For Matillion users, we will cover how to check for or acquire the required JDBC driver that facilitates connectivity between the platforms. Next, we will address the critical aspect of network connectivity, showing how to configure secure and reliable communication from the source (SingleStore) to the target (Databricks). Finally, we will explore querying strategies for both initial full data loads and ongoing incremental updates, enabling efficient and timely data extraction. Whether you are setting up the process for the first time or optimizing an existing integration, this article will equip you with the necessary knowledge to successfully extract and load data between SingleStore and Databricks.


What is MemSQL (SingleStore)?

SingleStore is a high-performance, distributed SQL database designed for operational analytics and real-time data processing workloads. It integrates both rowstore and columnstore engines within a unified platform, enabling efficient handling of transactional (OLTP) and analytical (OLAP) queries. SingleStore achieves this through a shared-nothing architecture, widespread support for standard SQL, and in-memory processing capabilities that facilitate low-latency data access at scale. Its compatibility with MySQL wire protocol ensures seamless integration with existing tools, while features like horizontal scalability, high availability, and native support for JSON and geospatial data broaden its applicability for modern data-intensive applications.

matillion logo x SingleStore

What is Databricks?

Databricks is a unified data analytics platform built on Apache Spark, designed for scalable collaboration with large-scale data. Centered on the Lakehouse architecture, it merges data lake flexibility and data warehouse performance, supporting storage formats like Delta Lake, Parquet, and ORC. Databricks enables ACID transactions, time travel, and schema evolution for analytics, and integrates with cloud sources including AWS S3, Azure Data Lake, and Google Cloud Storage. Its collaborative notebooks, strong security, and extensive APIs support machine learning, ETL workflows, and real-time analytics, making it a leading choice for production data environments.

Why Move Data from MemSQL (SingleStore) into Databricks

Unlocking Advanced Analytics: Copying Data from SingleStore to Databricks

A data engineer or architect might choose to copy data from SingleStore into Databricks for several compelling reasons. Firstly, the data housed in SingleStore often contains potentially valuable insights, particularly when it is combined with data from other sources such as cloud data lakes, external APIs, or legacy databases. By integrating SingleStore's data within Databricks, organizations can unlock the full value of their information through comprehensive analytics, advanced machine learning, and unified business intelligence. Moreover, performing these integration and analytical operations within Databricks, rather than directly on SingleStore, prevents additional strain on the SingleStore database, preserving its performance for transactional workloads and thus maintaining operational efficiency. This approach ensures data-driven decision-making capabilities without compromising the stability or responsiveness of the primary data store.

Creating an Identity in SingleStore: User Creation Instructions

To create a new user in a SingleStore database, you use the standard SQL

CREATE USER
statement. The instructions below outline how to add a user and set a password, plus how to grant privileges as needed.

1. Connect to the SingleStore Database

Start by connecting to your SingleStore environment using your preferred SQL client. For example:

mysql -u admin -p -h <host> -P <port>

Replace

<host>
,
<port>
, and
admin
with your cluster's details.

2. Create a User

You can create a user with the following SQL syntax:

CREATE USER 'newuser'@'%' IDENTIFIED BY 'strong_password';

  • 'newuser'
    is the username you want to create.
  • '%'
    allows this user to connect from any host (replace with a specific IP or hostname if necessary).
  • 'strong_password'
    is the user's password.

Example:
To create

reporting_user
who can connect from any location:

CREATE USER 'reporting_user'@'%' IDENTIFIED BY 'mypassw0rd!';

3. Grant Privileges to the User

After creating the user, assign the appropriate privileges so they can interact with the database:

GRANT SELECT, INSERT ON your_database.* TO 'reporting_user'@'%';

Common examples:

  • Grant all privileges on a database:

GRANT ALL PRIVILEGES ON example_db.* TO 'reporting_user'@'%';

4. Apply Privilege Changes

Ensure all changes take effect by running:

FLUSH PRIVILEGES;

5. Verify the User

List users to confirm creation:

SELECT user, host FROM information_schema.users;


Note: Always use strong, unique passwords and only grant the minimum required privileges to follow security best practices.

Installing the JDBC Driver

At the time of writing, the SingleStore JDBC driver is not included by default in Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, you will need to manually download and install the driver before you can create database connections to SingleStore. Please follow the instructions below to complete this process.

Step 1: Download the SingleStore JDBC Driver

Navigate to the official SingleStore documentation page for JDBC drivers:
https://docs.singlestore.com/managed-service/en/reference/jdbc-driver.html

  • On this page, you will find information on the latest supported JDBC driver versions.
  • Look specifically for a Type 4 JDBC driver, as this type is preferred for use in Matillion Data Productivity Cloud.
  • Download the appropriate JAR file to your local environment.

Step 2: Install the Driver in Matillion Data Productivity Cloud

Because the JDBC driver is not distributed with Matillion by default, you will need to upload the JAR file to your Matillion Agent. Installation instructions for uploading external JDBC drivers can be found in detail here:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

Briefly, the process involves:

  • Accessing your Matillion Agent management interface.
  • Navigating to the section for JDBC or external driver uploads.
  • Uploading the SingleStore JDBC JAR file you downloaded.
  • Assigning the driver to the environments where you plan to use SingleStore connections.

Make sure to review the official Matillion documentation (linked above) for any environment-specific instructions or prerequisites, such as Agent restarts or Java compatibility considerations.

Step 3: Configure and Use the Driver

Once the SingleStore JDBC driver has been successfully uploaded and registered in your Matillion environment, you can proceed to configure database connections and put the driver to use.

Comprehensive instructions for setting up and using JDBC drivers within Matillion Data Productivity Cloud, including connection string formats and database query components, are available here:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

Be sure to reference these instructions when creating connections to your SingleStore databases, to ensure correct driver utilization and optimal performance.

With the above steps completed, your Matillion Data Productivity Cloud environment will be set up for working with SingleStore using its official JDBC driver.

Checking network connectivity

Before connecting Matillion Data Productivity Cloud to your SingleStore database, you must ensure that the database allows incoming connections from the appropriate sources, depending on your deployment type:

  • Full SaaS Agent Configuration:
    If you are using a Full SaaS agent, you need to allow incoming connections to your SingleStore database from Matillion's IP addresses. The list of required IP addresses is available at the following URL:
    Matillion IP Addresses.

  • Hybrid SaaS Deployment:
    For Hybrid SaaS configurations, your SingleStore database should permit incoming connections from the IP addresses belonging to your own virtual private cloud (VPC). To verify and troubleshoot connectivity from your VPC, you can use the utilities provided here:
    Check Network Access.

Additionally, if your SingleStore database is referenced using a DNS name instead of a direct IP address, the agent (either Full SaaS or Hybrid SaaS) must be able to resolve the DNS address successfully. Make sure any necessary DNS resolution requirements are met to avoid connection issues.

Querying Data from a SingleStore Database

This guide provides practical instructions for querying data from a SingleStore database, including SQL SELECT examples, insights on datatype conversion between SingleStore and Databricks, and guidance on best practices for initial and incremental data loads.


Example SingleStore Queries

To retrieve data from SingleStore, use standard SQL queries.

1. Select All Rows:

SELECT * FROM orders;

2. Select Specific Columns:

SELECT order_id, customer_id, order_total FROM orders;

3. Filter with WHERE Clause:

SELECT * FROM orders
WHERE order_date >= '2024-06-01';

4. Join Tables:

SELECT o.order_id, c.customer_name, o.order_total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

5. Aggregate Data:

SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id;


Datatype Conversion Between SingleStore and Databricks

SingleStore and Databricks may use different internal datatypes for corresponding data. When transferring or querying data between the platforms (e.g., via a JDBC connector), implicit or explicit type mappings can occur.
- Numeric types (e.g., DOUBLE, INT) are generally compatible, but check for range or precision differences. - String types (VARCHAR, TEXT) may map to Databricks

STRING
. - Date/Time columns (
DATE
,
DATETIME
,
TIMESTAMP
) may require explicit conversion or formatting. - Always review datatype mappings to ensure accurate data replication, especially when automating data pipelines.


Best Pattern: Initial and Incremental Loads

The recommended approach when querying SingleStore for ETL/ELT purposes is to: - Perform a one-time initial load of your dataset - Subsequently use incremental loads to keep data up-to-date

Database Query Component Pattern: - Use the same query component for both initial and incremental loads, modifying only the

WHERE
clause as described below.

1. Initial Load (No Filter)

For an initial, full-table load, use a query without any filter. Example:

SELECT * FROM orders;

2. Incremental Load (Filtered by Watermark Column)

For subsequent incremental loads, include a filter clause based on a column (e.g.,

last_updated
or an auto-incrementing
id
). Example:
SELECT * FROM orders
WHERE last_updated > '${last_loaded_timestamp}';
Here,
${last_loaded_timestamp}
would be dynamically populated with the value from the last successful load.

For more in-depth information on incremental load strategies, consult the Matillion documentation on incremental loads:
https://exchange.matillion.com/articles/incremental-load-data-replication-strategy/

Data Integration Architecture

Loading data in advance of integration embodies the "divide and conquer" philosophy, as it separates the overall data integration process into two distinct steps: staging (Extract and Load) and transformation (Transform). This separation is a key advantage of the Extract, Load, and Transform (ELT) architecture, allowing organizations to optimize each step independently. Data integration inherently requires transforming raw datasets into a unified, usable format, and the most effective approach for achieving this within modern data ecosystems is through robust data transformation pipelines. These pipelines can accommodate complex business logic, facilitate ongoing maintenance, and ensure scalability. ELT's further strength lies in performing all data transformation and integration tasks directly within the target Databricks database. This centralized processing model not only enables rapid, on-demand data handling but also extends seamless scalability, eliminating the need for—and associated cost of—external data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.