Extracting data from Apache Kylin to Snowflake
Extracting data from Apache Kylin is an essential step when organizations aim to take advantage of powerful analytics platforms like Snowflake. By enabling seamless data movement from Kylin’s OLAP cubes to Snowflake, businesses can unlock advanced reporting, deeper cross-source analysis, and scalable cloud-native compute capabilities. In this article, we will guide you through the process of exporting data from Apache Kylin and loading it into Snowflake. The following topics will be addressed: - **Creating an identity in Apache Kylin:** We'll outline the steps to create and configure the necessary user or service account to permit secure data extraction. - **Acquiring the JDBC driver for Matillion users:** If you are using Matillion for orchestration, we'll show you how to verify that the required JDBC driver is available, or, if not, how to obtain it. - **Ensuring network connectivity:** We'll discuss the prerequisites for connecting your source (Kylin) and target (Snowflake) environments, including necessary firewall and networking considerations. - **Querying data (initial and incremental):** Finally, we'll detail best practices for extracting data both in the context of a full, initial load as well as for ongoing incremental updates. Whether you're performing a one-time migration or setting up automated data pipelines, following these steps will help ensure a smooth and reliable integration between Apache Kylin and Snowflake.
What is Apache Kylin?
Apache Kylin is an open-source distributed analytics engine designed to provide extremely fast online analytical processing (OLAP) on large-scale datasets, primarily on Hadoop and cloud environments. It precomputes multidimensional cubes from data stored in sources such as Apache Hive or relational databases, dramatically accelerating complex analytical queries by reducing them to efficient lookups. Kylin supports standard SQL for query operations and integrates seamlessly with business intelligence tools via ODBC, JDBC, and RESTful APIs. Leveraging techniques such as bitmap indexing, aggregation, and sharding, Apache Kylin can support sub-second query responses over datasets containing billions of rows, making it suitable for enterprise data warehousing and real-time analytics scenarios.
What is Snowflake?
Snowflake is a cloud-native data platform known for its multi-cluster shared data architecture, which separates storage from compute for flexible, on-demand scaling. Running natively on AWS, Azure, and Google Cloud, it seamlessly integrates with these platforms and supports SQL queries alongside semi-structured data like JSON, Parquet, and Avro. Snowflake features automatic clustering, result caching, robust security, and minimal administrative overhead. Its zero-maintenance infrastructure enables data sharing across organizations without moving data. Widely used for data warehousing, lakes, and sharing, Snowflake’s rapid elasticity and automatic optimization have led to broad adoption for diverse analytical workloads.
Why Move Data from Apache Kylin into Snowflake
Advantages of Copying Data from Apache Kylin to Snowflake for Analytics
A data engineer or architect may wish to copy data from Apache Kylin into Snowflake for several important reasons. First, the data stored within Apache Kylin is often rich and potentially valuable, encompassing multidimensional OLAP cubes that power analytical insights. However, the true utility of this data is realized when it is integrated with information from other enterprise sources. Consolidating data from Apache Kylin alongside additional datasets within Snowflake enables comprehensive analytics, richer business intelligence, and the generation of new insights that would not be possible using Apache Kylin in isolation. Furthermore, leveraging Snowflake's robust data warehousing capabilities for integration and analysis prevents additional workload from being placed on the Apache Kylin instance itself, thereby preserving its performance and reliability for its core OLAP functions. This approach not only secures operational stability but also maximizes the value derived from enterprise data assets.
Similar connectors
Creating an Identity in Apache Kylin
This guide provides step-by-step instructions on how to create a user (identity) in Apache Kylin. Apache Kylin does not store users directly in the OLAP engine but instead manages user authentication and authorization via its Web UI or RESTful APIs, using either the built-in user management system or external authentication (e.g., LDAP).
Prerequisites
- You must have administrator privileges in the Apache Kylin environment.
- Kylin should be running and accessible.
- For these instructions, it is assumed that Kylin is configured for internal user management (the default mode, not LDAP).
Creating a User via Apache Kylin REST API
1. Prepare User Details
Define the username and password for the new user. You may also assign roles.
2. Construct the API Request
Use the
/kylin/api/user/authenticationendpoint to create a new user.
bash
curl -X POST "http://<KYLIN_HOST>:<KYLIN_PORT>/kylin/api/user/authentication"
-H "Content-Type: application/json"
-u ADMIN_USERNAME:ADMIN_PASSWORD \
-d '{
"username": "new_user",
"password": "user_password",
"authorities": ["ROLE_ANALYST"]
}'
- Replace
<KYLIN_HOST>
and<KYLIN_PORT>
with your server details. - Replace
ADMIN_USERNAME
andADMIN_PASSWORD
with administrator credentials. - Adjust
authorities
to roles such asROLE_ADMIN
,ROLE_ANALYST
, orROLE_MODELER
as needed.
Creating a User via Apache Kylin Web UI
- Log in to the Kylin Web UI with administrator credentials.
- Go to System > User Management.
- Click Add User.
- Enter the username and password.
- Assign roles (e.g., ANALYST, MODELER, ADMIN).
- Click Save.
Listing Existing Users via REST API
You can retrieve the current list of users as follows:
bash
curl -X GET "http://<KYLIN_HOST>:<KYLIN_PORT>/kylin/api/user/users"
-u ADMIN_USERNAME:ADMIN_PASSWORD
Additional Notes
- Apache Kylin does not use SQL scripts for user creation; all user management is handled via the UI or API.
- If LDAP is configured, manage users via your LDAP directory. Roles in Kylin can still be managed through the UI or API.
Installing the JDBC driver
If you plan to connect Matillion Data Productivity Cloud to an Apache Kylin database, you must manually install the JDBC driver for Kylin. At the time of writing, this driver is not bundled with Matillion by default, primarily due to licensing or redistribution restrictions. The following steps provide guidance for downloading and installing the JDBC driver so that you can configure this connection.
1. Download the Apache Kylin JDBC Driver
Start by downloading the latest JDBC driver for Apache Kylin. Visit the official Apache Kylin downloads page using this link: https://kylin.apache.org/download/
On the downloads page, look specifically for the Type 4 JDBC driver. Preference should be given to Type 4 drivers as they are entirely written in Java, require no native dependencies, and are generally the most compatible with cloud and containerized environments like Matillion Data Productivity Cloud.
Download the JDBC driver
.jarfile and save it to your local environment, ready for upload.
2. Install the JDBC Driver into Matillion
Since the Kylin JDBC driver is not included out-of-the-box with Matillion Data Productivity Cloud, you must upload it as an external driver to your Matillion Agent. Follow the official instructions for uploading custom or third-party JDBC drivers to the Matillion platform here:
https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
Refer to this documentation for the step-by-step process to ensure proper placement and activation of the JDBC driver within your agent environment. The process typically involves accessing the agent's interface, selecting the correct driver slot, and uploading your
.jarfile.
3. Configure and Use the Driver
After installing the JDBC driver, use the agent’s configuration to specify the connection parameters to your Kylin instance. For full guidance on configuring and testing your Kylin connections, consult the Matillion documentation:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
Follow the product’s guidelines for entering your JDBC URL, credentials, and any additional connection properties required for Apache Kylin. After following the configuration instructions, you will be ready to query data from Apache Kylin in your Matillion Data Productivity Cloud environment.
Checking network connectivity
To enable Matillion Data Productivity Cloud to connect to your Apache Kylin database, you must ensure that the database allows incoming network connections based on your deployment type:
-
Full SaaS Agent Configuration:
You must allow incoming connections to your Apache Kylin database from the specific IP addresses used by Matillion’s Full SaaS agents. The full list of required IP addresses is provided here: Matillion SaaS IP addresses. Ensure these IP addresses are whitelisted in your network or firewall configuration. -
Hybrid SaaS Deployment:
For hybrid configurations, your Apache Kylin instance should permit incoming connections from the network associated with your own Virtual Private Cloud (VPC). For assistance identifying your network settings or verifying connections, you can use the utilities available at: Matillion Network Access Checker.
Additionally, if your Apache Kylin database is referenced using a DNS hostname (rather than a direct IP address), the Matillion Full SaaS or Hybrid SaaS agent must be able to resolve this DNS address. You may need to ensure appropriate DNS records are available and reachable from the Matillion environment.
Querying Data from Apache Kylin
This guide details how to query data from Apache Kylin using SQL and discusses integration patterns, especially in the context of loading data into systems like Snowflake.
Example Queries in Apache Kylin SQL
Apache Kylin exposes cube data via an SQL interface, similar to other analytic databases. Examples:
Basic Query:
SELECT country, SUM(sales_amount) AS total_sales FROM sales_data GROUP BY country ORDER BY total_sales DESC;
Query with Filters:
SELECT product_category, AVG(order_amount) AS average_order FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01' GROUP BY product_category;
Incremental Load - Using a Filter Suppose you want only the rows updated after a certain timestamp:
SELECT * FROM sales_data WHERE last_updated >= '2024-06-01 00:00:00';
Datatype Conversion Notes (Kylin → Snowflake)
When querying Kylin for transfer to databases like Snowflake (often via ETL tools), note that datatype conversions may be necessary, because supported data types can differ. For example:
| Apache Kylin Type | Closest Snowflake Type |
|---|---|
BIGINT |
NUMBER(38,0) |
VARCHAR |
STRING |
DATE |
DATE |
TIMESTAMP |
TIMESTAMP_NTZ |
DOUBLE |
FLOAT |
Check your ETL tool's documentation about field mappings and consider explicit type casting in queries if needed.
Querying Patterns: Initial and Incremental Loads
Best practice: Use a staged ETL strategy comprising an initial full load followed by periodic incremental loads from Kylin into your destination system (like Snowflake).
- Component: Use the same Database Query component for both load types.
- Read more: Incremental Load/ Data Replication Strategy - Matillion Exchange
Initial Load
- Load all available data.
- Query pattern: No WHERE filter.
SELECT * FROM sales_data;
Incremental Load
- Load only new or changed data based on a tracked field (e.g.,
last_updated
,id
, or date). - Query pattern: Add a WHERE clause to filter results.
SELECT * FROM sales_data WHERE last_updated > '2024-06-01 00:00:00';
- Replace
'2024-06-01 00:00:00'
with your most recent loaded value.
For more detailed strategies, see Matillion's guide on incremental loading.
Data Integration Architecture
Loading data into Snowflake prior to integration exemplifies a “divide and conquer” approach by splitting the process into two distinct steps: loading and then transforming the data as needed. This separation is a core advantage of the ELT (Extract, Load, Transform) architecture, which stands in contrast to traditional ETL methods. Transforming and integrating data effectively requires robust data transformation pipelines, which organize, automate, and orchestrate data cleanup, enrichment, and merging routines. In the ELT model, these pipelines are executed directly within the Snowflake database, leveraging its high-performance processing capabilities. As a result, data transformation and integration become fast, on-demand, and highly scalable, all without the need to purchase or maintain additional data processing infrastructure external to Snowflake.