Extracting data from ClickHouse to Snowflake
Extracting data from ClickHouse is an increasingly common requirement for organizations seeking to integrate disparate data sources and leverage modern cloud data platforms such as Snowflake. As data-driven decision-making becomes more critical, the need to efficiently move data between systems—while ensuring security and minimizing operational friction—continues to grow. In this article, we will guide you through a practical workflow for extracting data from ClickHouse and loading it into Snowflake. Our discussion will cover the essential steps required for a robust and reliable integration: - **Creating an identity in ClickHouse:** We’ll review how to set up appropriate user credentials to enable secure and authorized access to your ClickHouse instance. - **For Matillion users, checking or acquiring the JDBC driver:** If you are working with Matillion ETL, ensuring that the correct JDBC driver for ClickHouse is present is a key prerequisite. - **Network connectivity from source to target:** Establishing and validating the necessary network paths between ClickHouse and Snowflake is vital for successful data transfer. - **Querying data, initially and incrementally:** Finally, we’ll explore strategies for querying your data, both for the initial migration and for ongoing incremental loads. By following this approach, you’ll gain a clear understanding of best practices for a secure and efficient data extraction workflow from ClickHouse to Snowflake.
What is ClickHouse?
ClickHouse is an open-source, column-oriented database management system designed for high-performance, real-time analytical processing of large volumes of data. Engineered with a focus on speed and scalability, ClickHouse supports massively parallel processing and can execute complex analytical queries on petabytes of data with minimal latency. Its columnar storage format ensures efficient data compression, vectorized query execution, and minimizes disk I/O, making it particularly well-suited for OLAP workloads such as business intelligence, monitoring, and log analytics. ClickHouse features SQL-like query syntax, support for distributed and replicated clusters, and a robust ecosystem of connectors and integration tools, allowing seamless interoperability with popular data ingestion, visualization, and orchestration platforms.
What is Snowflake?
Snowflake is a fully-managed, cloud-native data platform for data warehousing, analytics, and sharing across multiple cloud providers. Its multi-cluster, shared-data architecture separates storage and compute, enabling independent scaling and performance with simultaneous queries. Supporting structured and semi-structured formats (JSON, Avro, Parquet), it offers robust pipelines and compression. The SQL-based interface, zero-copy cloning, time travel, and role-based access control provide strong data management and security. Snowflake’s marketplace and secure data sharing support real-time collaboration within and between organizations, making it a versatile solution for modern data engineering and analytics needs.
Why Move Data from ClickHouse into Snowflake
Benefits of Replicating ClickHouse Data into Snowflake for Advanced Analytics
A data engineer or architect may choose to copy data from ClickHouse into Snowflake for several strategic reasons. First, ClickHouse often contains valuable operational data that can provide significant insights when analyzed alongside information from other systems. By moving this data into Snowflake, organizations can integrate ClickHouse datasets with data from additional sources, thus unlocking richer analytics and bringing out the true value of their data assets. Furthermore, performing data integration and complex queries directly within Snowflake means that the compute-intensive workloads do not impact the performance of the ClickHouse cluster, which may be optimized for high-speed inserts and real-time analytics. This approach allows for enhanced data exploration and reporting capabilities in Snowflake, while preserving the efficiency and stability of the production ClickHouse environment.
Similar connectors
Creating an Identity in ClickHouse
This guide describes how to create a user account in ClickHouse and grant basic access privileges.
1. Prerequisites
- SQL access to the ClickHouse server with administrator privileges.
- The server is configured to use the
users.xml
or users DDL operations (users as SQL entities, available in relatively recent versions; refer to your deployment's documentation). - This guide uses SQL-based user management, introduced in ClickHouse 21.8+.
2. Creating a User
To create a new user, use the
CREATE USERstatement. You can specify user properties such as the password, authentication method, allowed host(s), and default profile.
CREATE USER IF NOT EXISTS new_user
IDENTIFIED BY 'StrongPassword123!'
HOST ANY;
- Replace
new_user
with the desired username. - Replace
StrongPassword123!
with a secure password.
Note: The
HOSTclause restricts the allowed host(s) from which the user can connect. Use
ANYto permit all hosts, or provide hostname(s)/IP(s) as required.
3. Granting Privileges
By default, a new user has no access to any databases or tables. Use
GRANTto assign privileges. For example, to allow
SELECTand
INSERTon the
mydb.mytabletable:
GRANT SELECT, INSERT ON mydb.mytable TO new_user;
For all tables in a database:
GRANT SELECT, INSERT ON mydb.* TO new_user;
For more privileges (e.g.,
ALTER,
DROP, etc.), specify as needed.
4. Optional: Specify a User Profile
Profiles in ClickHouse control quotas, settings, and restrictions. Assign a profile if your organization uses them:
CREATE USER another_user
IDENTIFIED BY 'AnotherPassword!'
PROFILE 'default';
5. Verifying the User
To verify that the user was created:
SHOW USERS;
Or, to view details of a specific user:
SHOW CREATE USER new_user;
Note: You may require the
CREATE USERprivilege to create users. For more information about user management, see the ClickHouse documentation: User Management.
Installing the JDBC driver
At the time of writing, the ClickHouse JDBC driver is not bundled with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. Therefore, you must manually obtain and install the driver to enable connectivity between Matillion and your ClickHouse database.
Step 1: Download the ClickHouse JDBC Driver
- Access the official GitHub repository for the ClickHouse JDBC driver: https://github.com/ClickHouse/clickhouse-jdbc.
- On the repository page, locate the latest release and download the driver file. Preferably, select the Type 4 JDBC driver, as it is a pure Java implementation and well-suited for most environments where Matillion runs.
Step 2: Install the Driver in Matillion Data Productivity Cloud
- Review and follow the official Matillion documentation on installing external JDBC drivers: Installing external drivers in Matillion Data Productivity Cloud.
- This guide will walk you through the process of uploading the downloaded ClickHouse JDBC JAR file to the Matillion Agent, ensuring it is available for use in data integration workflows.
Step 3: Configure and Use the Driver
- After installation, refer to the database query usage instructions to configure and test your ClickHouse connection within Matillion Data Productivity Cloud.
- This documentation includes details on creating database integrations, querying your data, and troubleshooting connectivity issues.
By following these steps, you will enable support for ClickHouse within your Matillion environment through the addition of its Type 4 JDBC driver.
Checking network connectivity
To ensure that Matillion Data Productivity Cloud can connect to your ClickHouse database, you must allow incoming network connections according to your deployment configuration:
-
Full SaaS agent: Configure your ClickHouse database's allowlist or firewall settings to permit inbound connections from all IP addresses listed in the following documentation: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/.
-
Hybrid SaaS deployment: Allow inbound connections from your own virtual private cloud (VPC) where the Matillion Hybrid SaaS agent is deployed. To assist with connectivity testing and troubleshooting, use the utilities provided here: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/.
Additionally, if your ClickHouse database endpoint is specified using a DNS hostname rather than a static IP address, verify that the corresponding agent (Full SaaS or Hybrid SaaS) is able to resolve the DNS address correctly for reliable connectivity.
Querying Data from ClickHouse Database
To extract data from a ClickHouse database, you use SQL syntax that will be familiar to anyone with experience in other relational databases. Here are core concepts and examples to guide your querying process, including guidance on integrating these processes with tools such as Matillion and handling specific scenarios like initial and incremental data loads.
ClickHouse Query Examples
The fundamental pattern to query data from ClickHouse is the
SELECTstatement:
``` -- Select all fields from a table SELECT * FROM my_table;
-- Select specific columns SELECT user_id, session_start, event_type FROM event_log;
-- Aggregate functions and filtering SELECT country, count(*) AS users FROM users WHERE signup_date >= '2024-01-01' GROUP BY country ORDER BY users DESC LIMIT 10; ```
Handling Datatype Conversion (ClickHouse ↔ Snowflake)
When transferring data between ClickHouse and Snowflake, be aware that some datatypes do not map perfectly—automatic conversion may be required. Common considerations:
| ClickHouse Type | Common Snowflake Equivalent |
|---|---|
Int32, UInt64 |
NUMBER |
Decimal(10,2) |
NUMBER(10,2) |
Date, DateTime |
DATE, TIMESTAMP_NTZ |
String |
VARCHAR |
Float64 |
FLOAT |
Always check and, if necessary, CAST fields explicitly in your query for compatibility:
SELECT user_id::String AS user_id, CAST(balance AS Decimal(10,2)) AS balance_usd FROM accounts;
Initial vs Incremental Loads
Best practice is to use a one-off initial load followed by incremental loads using the same Database Query component, modifying only the
WHEREfilter for the incremental loads.
1. Initial Load
For the initial load, you typically extract the full source table without any filtering. For example:
SELECT * FROM orders;
Set up your Database Query component (e.g., in Matillion ETL) without any filter clause at this stage.
2. Incremental Load
For subsequent incremental loads, use a filter to fetch only new or changed records—commonly based on a timestamp or an incrementing primary key. For example:
SELECT *
FROM orders
WHERE updated_at > '{{last_loaded_timestamp}}';
Here,
{{last_loaded_timestamp}} should be replaced by a variable or value representing the cutoff from the previous load.
Read more about this incremental load pattern and how to implement it in ETL pipelines:
https://exchange.matillion.com/articles/incremental-load-data-replication-strategy/
Note: Maintain consistent column ordering and datatypes in queries for both initial and incremental loads to ensure seamless downstream integration.
Data Integration Architecture
Loading data into Snowflake in advance of integration is a "divide and conquer" approach that separates the process into two manageable steps: first, loading the raw data, and second, integrating and transforming it as required. This approach exemplifies the benefits of the ELT (Extract, Load, Transform) architecture, which contrasts with traditional ETL methods by deferring transformation until after the data is available in the target system. Effective data integration necessitates data transformation, and the optimal method for achieving this is through data transformation pipelines—automated workflows designed to systematically cleanse, structure, and enrich the data. A key advantage of the ELT architecture is that all transformation and integration operations occur directly inside the Snowflake database. As a result, these processes are fast, scalable, and on-demand, allowing organizations to make immediate use of the powerful and elastic computing resources offered by Snowflake, without the need for separate data processing infrastructure.