Extracting data from ClickHouse to Databricks
Extracting data from ClickHouse is an essential step for organizations looking to leverage the advanced analytics and processing capabilities of Databricks. Whether you are planning a one-time migration or establishing an ongoing data pipeline, understanding the correct procedures is crucial for secure and efficient data transfers. In this article, we will guide you through the key steps to move data from ClickHouse to Databricks. We will start by creating an appropriate identity in ClickHouse to ensure secure access. Next, we will outline how to verify or acquire the necessary JDBC driver, especially for users working with Matillion ETL. We will also discuss fundamental network connectivity considerations that enable reliable communication between your source and target environments. Finally, we will describe practical approaches to data querying, addressing both full (initial) data loads and the incremental updates essential for maintaining up-to-date analytics in Databricks. By the end of this guide, you will have a clear understanding of the process and be well-equipped to implement a robust data extraction and loading workflow between ClickHouse and Databricks.
What is ClickHouse?
ClickHouse is an open-source, columnar-oriented database management system designed for high-performance online analytical processing (OLAP) workloads. Developed by Yandex, ClickHouse excels at ingesting and querying large volumes of data with sub-second response times, making it well-suited for real-time analytics applications such as event data analysis, monitoring, and business intelligence. Its architecture leverages columnar storage, data compression, and advanced indexing techniques to achieve efficient disk I/O and optimal CPU utilization. ClickHouse supports SQL-like queries and offers features such as sharding, replication, and horizontal scalability, enabling robust deployments in distributed environments. Integration with various data ingestion and visualization tools further enhances its utility in modern data pipelines.
What is Databricks?
Databricks provides a unified analytics platform built on Apache Spark, featuring the Databricks Lakehouse—a system blending data warehousing reliability with the scalability of data lakes. Powered by Delta Lake’s ACID-compliant storage, it supports both batch and streaming data pipelines for analytics, BI, and machine learning. Users can interact using SQL, Python, or Scala, while benefiting from auto-scaling, data versioning, optimization tools (like Z-Ordering and file compaction), and integration with diverse data sources. This cloud-native architecture enables organizations to securely manage the entire data lifecycle—from ingestion to analysis—in a collaborative environment.
Why Move Data from ClickHouse into Databricks
Leveraging Data Movement from ClickHouse to Databricks for Advanced Analytics
A data engineer or architect may wish to copy data from ClickHouse into Databricks for several compelling reasons. Firstly, data stored in ClickHouse is often rich and potentially valuable, comprising high-volume analytical datasets that could provide significant business insights. However, the true value of this data frequently emerges when it is integrated with information from other disparate sources, enabling more comprehensive analytics and improved decision-making. By leveraging Databricks for data integration, professionals can efficiently merge and process data without imposing additional workload or performance constraints on the ClickHouse database itself. This approach ensures that ClickHouse maintains its responsiveness for transactional and analytical queries, while Databricks provides a powerful, scalable environment for unified data processing and advanced analytics.
Similar connectors
Creating an Identity in ClickHouse
ClickHouse provides a robust access control system for managing user identities and their permissions. Users can be created and managed using SQL statements, which are executed with appropriate privileges (typically as an administrator).
Prerequisites
- Ensure that user management is enabled in your ClickHouse server configuration (
access_control_path
must be set). - You need the
CREATE USER
privilege to create new users.
Creating a User
Use the
CREATE USERSQL command to add a new user. The command syntax allows you to specify authentication methods, network restrictions, and other options.
Basic Example
Create a user named
reporting_userwith a specified password:
CREATE USER reporting_user IDENTIFIED BY 'strong_password';
Example: User with Host Restriction
Limit the user login to a specific IP address or network:
CREATE USER reporting_user IDENTIFIED BY 'strong_password' HOST IP '10.0.1.20'; -- Only allow connection from this IP
Or, for a network range:
CREATE USER reporting_user IDENTIFIED BY 'strong_password' HOST IP '10.0.1.0/24';
Example: User with LDAP Authentication
If using external LDAP authentication:
CREATE USER ldap_user IDENTIFIED WITH ldap BY 'ldap_server_name';
Granting Privileges
By default, new users do not have access to any objects. Grant necessary permissions explicitly, for example:
GRANT SELECT ON database_name.table_name TO reporting_user;
To grant broader access, such as all tables within a database:
GRANT SELECT ON database_name.* TO reporting_user;
Listing Users
To view existing users:
SHOW USERS;
For additional authentication methods or advanced options, refer to the ClickHouse documentation on user management.
Installing the JDBC Driver
At the time of writing, the ClickHouse JDBC driver is not shipped by default with Matillion Data Productivity Cloud. This is due to licensing or redistribution restrictions that prevent including the driver package in the core distribution. As a result, you will need to manually obtain and install the driver in your environment.
1. Download the ClickHouse JDBC Driver
- Navigate to the official ClickHouse JDBC driver repository: https://github.com/ClickHouse/clickhouse-jdbc.
- Download the latest stable release of the driver. When selecting a driver, prioritize a Type 4 JDBC driver, as this type is 100% Java-based and does not require native libraries, ensuring compatibility with most Java environments.
- The release page on the repository typically provides pre-built
.jar
binaries for convenient download.
2. Install the JDBC Driver into Matillion Data Productivity Cloud
- Follow the detailed installation instructions provided by Matillion at Uploading External Drivers. These instructions guide you through the process of uploading a third-party JDBC jar file into the Matillion Data Productivity Cloud agent environment.
- Ensure that the driver jar you upload matches the required version for both the Matillion environment and your ClickHouse database to avoid compatibility issues.
- After the upload, review the driver's visibility and confirm that it is available for use in the Agent interface as described in the installation documentation.
3. Configure Database Connections Using the New Driver
- Once the driver is installed, you can proceed to configure ClickHouse connections using the built-in database query features in Matillion Data Productivity Cloud.
- For details on establishing connections and running queries with the ClickHouse JDBC driver, consult the official usage documentation here: Database Query Usage Guide.
- The guide will help you to select the correct driver, enter the required connection parameters, and test connectivity to your ClickHouse instance.
By following these steps, you will be able to integrate ClickHouse with Matillion Data Productivity Cloud via its JDBC interface.
Checking network connectivity
To ensure successful integration between Matillion Data Productivity Cloud and your ClickHouse database, you must verify that the database accepts incoming connections from authorized sources based on your selected deployment model:
-
Full SaaS agent configuration:
Configure your ClickHouse database to allow incoming connections from the IP addresses listed at Matillion’s security documentation. This ensures that Matillion’s managed agents can communicate with your database endpoints. -
Hybrid SaaS deployment:
In this configuration, you must permit incoming connections from your own virtual private cloud (VPC) containing the Hybrid SaaS agent. If you are unsure how to check network connectivity from your VPC, utilities are available at Matillion Exchange network check.
Note:
If you are referencing your ClickHouse database using a DNS hostname rather than a static IP address, ensure that the relevant Matillion Full SaaS agent or your Hybrid SaaS agent can successfully resolve the DNS address. Failure to resolve the database address will prevent the connector from establishing a connection.
How to Query Data from a ClickHouse Database
This guide covers querying data from a ClickHouse database using SQL, including considerations for datatype conversion with Databricks and best practices for loading data efficiently.
Querying Data with SQL SELECT Statements
You can retrieve data from ClickHouse tables using standard SQL
SELECTqueries. Here are some common examples:
Select All Columns from a Table
SELECT * FROM sales_orders;
Select Specific Columns with a Filter
SELECT order_id, customer_id, amount FROM sales_orders WHERE amount > 1000;
Aggregate Query Example
SELECT customer_id, SUM(amount) AS total_amount FROM sales_orders WHERE order_date >= '2024-01-01' GROUP BY customer_id;
Ordering and Limiting Results
SELECT * FROM sales_orders ORDER BY order_date DESC LIMIT 100;
Datatype Conversion: ClickHouse and Databricks
Be aware that datatypes may not always match perfectly between ClickHouse and Databricks.
- For example, ClickHouse's
DateTime
may convert to Databricks'TIMESTAMP
. - Numeric types like
UInt32
may becomeBIGINT
orINTEGER
. - Strings, arrays, and other types may also map differently.
Check your data mappings and test the queries to ensure datatype compatibility during data transfers. You may need to use
CAST()in SQL to explicitly convert types:
SELECT order_id, CAST(order_date AS String) AS order_date_str FROM sales_orders;
Loading Data: Initial and Incremental Loads
The ideal data loading strategy is:
- Once-off Initial Load
- Ongoing Incremental Loads
Both processes can (and should) use the same Database Query component for consistency.
Initial Load Example
During the initial load, you generally want to select the entire table, with no filter:
SELECT * FROM sales_orders;
This pulls all records into your destination system (such as Databricks).
Incremental Load Example
After the initial load, incremental loads only extract new or modified records. Use a filter clause referencing a high watermark (typically a timestamp or incrementing ID):
SELECT *
FROM sales_orders
WHERE order_date > '{{last_loaded_order_date}}';
Here, {{last_loaded_order_date}} is a variable storing the most recent value already loaded.
For details on incremental loading strategies, refer to Matillion's guide on incremental load data replication.
Note: Always test your queries for expected results, and verify that filters for incremental loads use columns that efficiently support change tracking (like timestamps or monotonic IDs).
Data Integration Architecture
Loading data in advance of integration is a fundamental design advantage of the ELT (Extract, Load, Transform) architecture, as it enables organizations to divide and conquer the complexities of data management by separating the loading and integration steps. By first loading raw data into the target environment, Databricks users can then leverage powerful data transformation pipelines to process and integrate this information efficiently. These pipelines enable sophisticated data transformation workflows, which are essential for ensuring clean, unified datasets ready for analysis. A further advantage of the ELT approach in Databricks is that data transformation and integration are executed directly within the target Databricks database. This not only accelerates processing and supports on-demand pipeline execution, but it also ensures scalability by utilizing the database’s native compute resources, eliminating the need to invest in additional data processing infrastructure.