The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Integrate data from StarRocks to Databricks using Matillion

The StarRocks to Databricks connector seamlessly transfers your data to Databricks within minutes, ensuring it remains current without requiring manual coding or the management of complex ETL scripts.

StarRocks
Databricks
StarRocks to Databricks banner

Extracting data from StarRocks to Databricks

Extracting data from StarRocks is a key step in leveraging its high-performance analytics for further processing and exploration in platforms such as Databricks. Moving data seamlessly and securely between these systems requires careful planning to ensure data integrity and performance. In this article, we will guide you through the essential steps to extract data from StarRocks and load it into Databricks, enabling you to unlock greater value from your data assets. We will begin by discussing how to create an appropriate identity in StarRocks, providing the necessary permissions for controlled data extraction. For users working within Matillion, we will explain how to check for an existing JDBC driver or acquire the correct one, ensuring smooth connectivity. Next, we will cover best practices for establishing reliable connectivity between your source (StarRocks) and target (Databricks) environments, accounting for security and firewall considerations. Lastly, we will delve into efficient data extraction techniques, outlining how to query data both for initial loads and incremental updates. By following this guide, you will be equipped to design a robust and scalable data pipeline from StarRocks to Databricks.


What is StarRocks?

StarRocks is a high-performance, massively parallel analytic database designed to support real-time, ad hoc analysis on large volumes of data. Leveraging a columnar storage engine and vectorized execution, StarRocks delivers ultra-fast query performance for both traditional OLAP and emerging hybrid workloads, including real-time analytics and flexible federated queries. Its architecture is built around strong consistency and ease of scaling, offering seamless support for distributed deployments on commodity hardware. StarRocks features native compatibility with MySQL protocols and standard SQL syntax, simplifying integration with existing BI tools and data pipelines. Advanced optimizations, such as cost-based and rule-based query optimizers, further enhance performance, making it a suitable choice for demanding enterprise analytics environments.

matillion logo x StarRocks

What is Databricks?

The Databricks database, built on the Lakehouse Platform, unifies analytics and machine learning by blending data warehouse features with the flexibility of data lakes. Using Delta Lake, it supports ACID transactions, scalable metadata, and time travel for versioning within the lake architecture. Users can efficiently store and query structured and semi-structured data at scale with SQL or Apache Spark. Optimizations like data compaction, file pruning, and caching enhance performance and reliability for large-scale data engineering, BI, and real-time analytics. Databricks integrates with common data sources and cloud infrastructure, enabling collaborative workflows and streamlined data governance throughout the lifecycle.

Why Move Data from StarRocks into Databricks

Benefits of Copying Data from StarRocks to Databricks for Advanced Analytics

A data engineer or architect might choose to copy data from StarRocks into Databricks for several compelling reasons. First, the data stored in StarRocks is often of considerable value—containing transactional, analytical, or operational insights that could be pivotal for strategic decision-making. To fully realize this value, it is often beneficial to integrate StarRocks data with information from other sources, enabling richer analysis, advanced machine learning, and comprehensive business intelligence. By performing this data integration within Databricks, rather than directly on StarRocks, organizations can avoid imposing additional computational workloads on their StarRocks clusters. This approach ensures the performance and stability of StarRocks for its primary use cases while leveraging Databricks’ scalable processing and expansive ecosystem to derive deeper insights from the integrated data.

Creating an Identity in StarRocks

To add a new user (identity) in a StarRocks database, you use the

CREATE USER
SQL statement. Users in StarRocks are associated with authentication and optional connection constraints, and you can assign one or more privileges for data access and management.

Syntax

CREATE USER 'username'@'host' IDENTIFIED BY 'password';

  • 'username'
    : The name for the new user.
  • 'host'
    : The host from which the user is allowed to connect. Use
    '%'
    for any host, or specify an IP address or hostname.
  • 'password'
    : The plaintext password for the user.

Example: Create a User

CREATE USER 'report_reader'@'%' IDENTIFIED BY 'StrongPa$$w0rd';

This command creates a user named

report_reader
who can connect from any host, authenticating with the password
StrongPa$$w0rd
.

Creating a User Restricted to a Specific Host

CREATE USER 'etl_user'@'192.168.1.100' IDENTIFIED BY 'etl_strong_pw';

This user (

etl_user
) will only be able to log in from the IP address
192.168.1.100
.

Assigning Privileges to the New User

By default, a new user has no privileges. To allow a user to perform actions, grant them privileges explicitly using the

GRANT
statement:

GRANT SELECT ON my_database.* TO 'report_reader'@'%';

This statement permits

report_reader
to execute
SELECT
queries on all tables in
my_database
.

Viewing Existing Users

To list all users in the system:

SHOW GRANTS;

Or for specific grants associated with a user:

SHOW GRANTS FOR 'report_reader'@'%';

Installing the JDBC Driver

At the time of writing, the StarRocks JDBC driver is not bundled by default with Matillion Data Productivity Cloud, primarily due to licensing or redistribution restrictions. As such, you will need to manually download and install the driver before StarRocks can be used as a data source within the Matillion platform.

To proceed, follow these steps:

  1. Download the Driver

  2. Visit the official StarRocks JDBC documentation page.

  3. Locate and download the Type 4 JDBC driver, as it provides a pure Java implementation best suited for JDBC connectivity in most environments.

  4. Upload the Driver to Matillion Data Productivity Cloud

  5. Obtain the latest

    .jar
    file for the JDBC driver from the location above.

  6. Follow the step-by-step instructions outlined in Uploading external drivers to add the JDBC driver to your Matillion agent. This usually involves accessing the Matillion agent's management interface and uploading the downloaded driver

    .jar
    file.

  7. Configure and Use the Driver

  8. After the upload completes and the driver is available, you can use it to connect to StarRocks.

  9. For guidance on how to configure database queries using an external driver, refer to the documentation for Database Query in Matillion Data Productivity Cloud.

By following these steps and referencing the supplied documentation links, you enable connectivity from Matillion Data Productivity Cloud to your StarRocks database deployment via JDBC.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your StarRocks database, you must verify that your StarRocks database is configured to allow incoming connections based on your deployment type:

  • Full SaaS Agent Configuration:
    Your StarRocks database must allow incoming connections from the IP addresses used by Matillion Data Productivity Cloud Full SaaS agents. You can find the up-to-date list of required IP addresses at: Matillion Allowing IP Addresses.

  • Hybrid SaaS Deployment:
    In a Hybrid SaaS setup, the connection originates from within your own virtual private cloud (VPC). You must ensure that your StarRocks database allows incoming connections from the relevant IP addresses in your VPC. To help check access from your environment, you can use utilities found here: Matillion Exchange Network Access Checker.

Additionally, if your StarRocks database is referenced by a DNS hostname rather than a static IP, ensure that the Matillion Full SaaS or Hybrid SaaS agent is able to resolve the DNS address successfully to establish the connection.

Querying Data from a StarRocks Database

This guide will illustrate how to query data from a StarRocks database, including best practices for initial and incremental loads, and data considerations when integrating with tools such as Databricks.


Example StarRocks SQL SELECT Statements

You can query data from StarRocks using standard SQL. Typical queries include:

``` -- Simple SELECT from a table SELECT * FROM orders;

-- Selecting specific columns SELECT order_id, customer_id, purchase_date FROM orders;

-- Filtering data SELECT * FROM customers WHERE city = 'New York';

-- Aggregation example SELECT product_id, COUNT(*) as total_orders FROM order_lines GROUP BY product_id ORDER BY total_orders DESC;

-- Date range filter for incremental load SELECT * FROM transactions WHERE transaction_date > '2024-06-01 00:00:00'; ```

Data Type Conversion Considerations

When extracting or loading data between StarRocks and Databricks:

  • Be aware that datatype conversion may occur, especially with types such as
    DATE
    ,
    TIMESTAMP
    , or
    DECIMAL
    .
  • Confirm numeric and date/time compatibility—e.g., StarRocks
    DATETIME
    may map to Databricks
    TIMESTAMP
    but formats and supported ranges can differ.
  • Always validate your data after migration for non-lossy conversions, especially with
    STRING
    /
    VARCHAR
    and floating point.
  • Consider explicitly casting columns in your SQL statements when compatibility is critical:

SELECT CAST(order_amount AS DOUBLE) AS order_amount FROM orders;

Best Pattern: Initial and Incremental Loads

The recommended pattern when transferring data (e.g., for analytics platforms like Databricks) is:

  1. Once-off Initial Load: Extract all data without any filters.
  2. Incremental Load: Subsequently extract only new or changed data, using an appropriate filter.

Both types of load use the same Database Query component as the extraction mechanism—configuration changes depending on load type.

Initial Load Query Example

No filter clause; extracts all records:

SELECT * FROM orders;

Incremental Load Query Example

A filter clause is added (usually a timestamp or incremental key):

SELECT * FROM orders
WHERE updated_at > '2024-07-01 00:00:00';

The database query structure (e.g., in your ETL tool) remains consistent—only the

WHERE
clause changes according to your high-water mark or last extraction point.

For details on incremental loading strategies, refer to Matillion’s incremental load data replication strategy.


References:
- StarRocks Documentation
- Matillion Article: Incremental Load Strategy

Data Integration Architecture

Loading data in advance of integration exemplifies the "divide and conquer" approach, as it splits the process into two distinct steps: first ingesting raw data, then transforming and integrating it as needed. This staged workflow is a key advantage of the Extract, Load, and Transform (ELT) architecture, as it enables flexibility and efficiency in managing large and diverse datasets. Effective data integration almost always relies on data transformation, and data transformation pipelines, such as those built using Databricks Workflows or Delta Live Tables, provide a robust framework for orchestrating these complex processes. A further benefit of the ELT model in the context of Databricks is that data transformation and integration are performed directly within the target Databricks database. This approach delivers on-demand, high-performance, and scalable data processing, while minimizing operational overhead by removing the need for separate data transformation infrastructure or external computation resources.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.