Extracting data from Apache Kylin to Amazon Redshift
Extracting data from Apache Kylin is a valuable step in integrating big data analytics with cloud-based platforms such as Amazon Redshift. Whether you are consolidating data lakes, building out your analytics pipeline, or seeking scalable cloud storage, establishing a seamless data flow between these systems is essential. This article outlines a practical approach for extracting data from Apache Kylin and loading it into Redshift, leveraging tools such as Matillion where appropriate. We will begin by guiding you through the process of creating an appropriate identity in Apache Kylin to authenticate extraction operations securely. For Matillion users, we will also review the steps to ensure you have access to the necessary JDBC driver, a critical component for establishing connectivity. Network connectivity between both platforms—a factor often overlooked but crucial in production environments—will be addressed with detailed recommendations. Lastly, we will walk through the processes of querying data from Kylin: covering both initial data loads and setting up for ongoing, incremental extraction to ensure data remains up to date in Redshift. By following this guide, you will be equipped to build a robust and repeatable workflow for moving data efficiently from Apache Kylin to Amazon Redshift.
What is Apache Kylin?
Apache Kylin is a distributed, open-source analytics data warehouse for ultra-fast OLAP queries on large datasets. Built on Hadoop, it uses pre-computed multidimensional cubes and columnar storage to speed up SQL queries over sources like Hive, HBase, or cloud data lakes. Kylin integrates with BI tools via ANSI-SQL and JDBC/ODBC, supporting high concurrency, sub-second latency, and petabyte-scale scalability. Advanced features include intelligent cube building, real-time streaming ingestion, and distributed processing, enabling efficient complex aggregations and drill-down analysis, making it ideal for interactive analytics and business intelligence in big data environments.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse built on PostgreSQL, enabling fast SQL querying and advanced analytics on large datasets. Its Massively Parallel Processing (MPP) architecture distributes complex queries across multiple nodes, boosting performance for high-volume workloads. Redshift supports columnar storage, data compression, automatic backups, and federated queries with S3 and AWS Lake Formation integration. Security features include VPC isolation, encryption at rest/in transit, and IAM integration. Seamlessly integrating with various BI and ETL tools, Redshift offers scalable, high-performance, and secure cloud data warehousing for enterprise needs.
Why Move Data from Apache Kylin into Amazon Redshift
Optimizing Analytics: Reasons for Transferring Data from Apache Kylin to Redshift
A data engineer or architect may want to copy data from Apache Kylin into Amazon Redshift for several strategic reasons. Firstly, the data residing in Apache Kylin is often the result of complex aggregation and modeling, making it highly valuable for further analysis. However, the true potential of this dataset emerges when it can be integrated and cross-referenced with data from a variety of other sources, something that Redshift—being a fully managed, scalable cloud data warehouse—excels at. By transferring data into Redshift, engineers can perform comprehensive analytics and business intelligence activities without imposing additional query loads on the Apache Kylin environment, thereby preserving its performance for core OLAP tasks. This architecture allows organizations to maximize the value of their data assets while maintaining efficiency and scalability in their analytics infrastructure.
Similar connectors
Creating a User in Apache Kylin
This guide details the steps for creating a user (identity) in Apache Kylin. It is intended for engineers and administrators responsible for managing Kylin's access controls.
Note: Apache Kylin manages user identities at the application layer, not through direct SQL commands to its underlying storage system. User and role management is typically performed via the Kylin web UI or Kylin REST APIs.
Prerequisites
- Ensure you have
ADMIN
privileges in Kylin. - Confirm the Kylin server is running.
Method 1: Using the Kylin Web UI
- Log in to the Kylin web interface.
- Navigate to System → Security → User Management.
- Click New User.
- Enter Username, Password, and (optionally) assign one or more roles (e.g.,
ADMIN
,MODELER
, orANALYST
). - Click Save.
Method 2: Using the Kylin REST API
You can also automate user creation with a REST API call.
1. Prepare the API Request
The following is an example
cURLcommand to create a new user:
bash
curl -X POST "http://<KYLIN_HOST>:<PORT>/kylin/api/user"
-H "Content-Type: application/json"
-u ADMIN_USERNAME:ADMIN_PASSWORD
-d '{
"username": "newuser",
"password": "UserPass123!",
"authorities": ["ROLE_ANALYST"]
}'
Replace:
<KYLIN_HOST>
and<PORT>
with your Kylin instance information.ADMIN_USERNAME
andADMIN_PASSWORD
with your admin credentials.newuser
,UserPass123!
, and the roles as appropriate.
2. Example JSON Payload
json
{
"username": "newuser",
"password": "UserPass123!",
"authorities": ["ROLE_ANALYST"]
}
Possible roles include:
-
ROLE_ADMIN
-
ROLE_MODELER-
ROLE_ANALYST
Method 3: Scripting with Python (Optional Example)
Here is a sample Python script for creating a user:
```python import requests
admin_user = 'ADMIN_USERNAME'
admin_password = 'ADMIN_PASSWORD'
host = 'http://
user_data = { 'username': 'newuser', 'password': 'UserPass123!', 'authorities': ['ROLE_ANALYST'] }
response = requests.post( f"{host}/kylin/api/user", json=user_data, auth=(admin_user, admin_password) ) print(response.status_code, response.text) ```
Additional Notes
- Assign multiple roles as required via the
authorities
array. - If Kylin is configured with an external authentication system such as LDAP or SSO, provision users according to your organization’s process, as direct creation may not be permitted.
- Users created via these methods will be managed internally by Kylin.
For more information, see the official Apache Kylin documentation.
Installing the JDBC driver
At the time of writing, the JDBC driver for Apache Kylin is not bundled within Matillion Data Productivity Cloud by default. This is primarily due to licensing or redistribution restrictions that prevent automatic distribution with the platform. As a result, administrators are responsible for sourcing and installing this driver manually.
To install the Apache Kylin JDBC driver, follow these steps:
-
Obtain the JDBC Driver
-
Navigate to the official Apache Kylin download page: https://kylin.apache.org/download/.
- On this page, locate the section containing client drivers and look specifically for a Type 4 JDBC driver package. Type 4 drivers are preferred as they are platform-independent and connect directly to the database via network protocols.
-
Download the latest compatible JDBC driver JAR file for your environment.
-
Install the Driver into Matillion Data Productivity Cloud
-
Access the Matillion Data Productivity Cloud Admin or Agent interface as outlined in the platform documentation.
- Follow the driver upload instructions available at: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
-
Upload the Apache Kylin JDBC driver JAR file to your Agent or designated driver location as specified in your platform configuration.
-
Configure and Use the Driver
-
After the driver has been successfully uploaded and registered, you may proceed to configure database connections or queries utilizing this driver within Matillion Data Productivity Cloud.
- Usage and configuration guidance can be found here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
- Ensure that any references to the Kylin JDBC driver align with the driver library and class names provided by your downloaded version.
Always verify you are observing your organization’s security and compliance requirements when handling and installing external JDBC drivers.
Checking network connectivity
To establish connectivity between Matillion Data Productivity Cloud and an Apache Kylin database, you must ensure that your network and security group settings allow incoming connections from the appropriate source, depending on your deployment configuration:
-
Full SaaS agent configuration:
Your Apache Kylin database must allow incoming connections from the IP addresses provided by Matillion. You can find the most up-to-date list of these IP addresses here: Matillion Allowed IP Addresses. -
Hybrid SaaS deployment:
In this scenario, you should permit incoming connections from your own virtual private cloud (VPC) where the Matillion Hybrid SaaS agent resides. To help confirm that network connectivity is properly configured from your VPC to Apache Kylin, you can use helpful utilities available at: Network Access Check Utility.
Additionally, if your Apache Kylin database is referenced using a DNS hostname rather than a static IP address, you must ensure that the Matillion agent (either Full SaaS or Hybrid SaaS) can resolve the DNS name. This may require allowing DNS traffic and ensuring your DNS records are properly configured and accessible from the agent’s network environment.
Querying Data from Apache Kylin
This guide explains how to query data from an Apache Kylin database using SQL, as well as strategies for initial and incremental data loads, especially in ETL tools like Matillion. Guidance is included for those integrating with Amazon Redshift.
Example Apache Kylin SQL Queries
Apache Kylin supports a subset of ANSI SQL. Here are common query examples:
1. Select Specific Columns
SELECT customer_id, total_amount, sales_date FROM sales_cube WHERE sales_date >= '2024-01-01' ORDER BY sales_date DESC;
2. Aggregate Data Example
SELECT region, SUM(total_amount) AS total_sales FROM sales_cube WHERE sales_date BETWEEN '2024-01-01' AND '2024-03-31' GROUP BY region ORDER BY total_sales DESC;
3. Filter Data with Conditions
SELECT product_id, COUNT(*) AS cnt FROM sales_cube WHERE product_category = 'Electronics' AND sales_date >= '2024-06-01' GROUP BY product_id;
Datatype Conversion: Kylin ↔ Redshift
When querying Kylin for downstream ingestion into Amazon Redshift, datatype conversion may be necessary.
| Apache Kylin | Amazon Redshift | Notes |
|---|---|---|
| STRING | VARCHAR | |
| INTEGER | INTEGER | |
| BIGINT | BIGINT | |
| BOOLEAN | BOOLEAN | |
| DATE | DATE | |
| TIMESTAMP | TIMESTAMP | |
| DECIMAL/DOUBLE | FLOAT8/NUMERIC | May need explicit cast in SELECT |
Avoid subtle pitfalls, e.g., when Apache Kylin returns a LONG but Redshift expects an INTEGER, apply an explicit cast in your Kylin SQL:
SELECT CAST(id AS INTEGER) AS id FROM example_table
Initial Load vs. Incremental Load Pattern
A robust pattern for extracting data from Apache Kylin (especially into Redshift or other warehouses) is to:
- Perform a once-off initial load
- Follow it with ongoing incremental loads
For both, utilize the same Database Query (or equivalent ETL) component, only changing the filter clause.
Initial Load
- No filter clause (extracts all data)
- Example:
SELECT * FROM sales_cube;
Incremental Load
- Add a filter clause based on a high-watermark column (such as a timestamp or primary key).
- Example: If you're tracking new records by
last_updated_at
:SELECT * FROM sales_cube WHERE last_updated_at > '${last_extracted_time}';In Matillion or other ETL jobs,${last_extracted_time}would be parameterized.
See further details in the Matillion Exchange article on incremental load strategy.
Best Practices
-
Database Query Component:
Use the same logic and transformation steps for both loads, only change the filter clause in the query. -
Consistency:
After the initial load, always keep theWHERE
clause in sync with your change-tracking/high-watermark logic. -
Type Safety:
Explicitly cast types in Kylin queries to match Redshift schemas where mismatches are observed.
For more advanced modeling or further query optimization, refer to the Apache Kylin Documentation.
Data Integration Architecture
Loading data into Amazon Redshift in advance of integration is a hallmark of the ELT (Extract, Load, Transform) architecture, where the process is intentionally divided into two distinct stages: first loading the raw data, and then performing the necessary transformations within the database. This approach allows organizations to "divide and conquer" by decoupling the data loading and integration phases, simplifying management and improving overall efficiency. Data integration itself often relies on transformations—tasks such as joining tables, aggregating data, or harmonizing disparate formats—which are most effectively carried out through dedicated data transformation pipelines. One of the principal advantages of the ELT architecture is that all transformation and integration steps occur inside the Redshift environment. As a result, transformations run quickly, execute on-demand, and scale with Redshift's compute resources. Moreover, organizations benefit from not having to maintain separate data processing infrastructure, since all computational tasks are handled natively in the data warehouse, offering both performance and cost advantages.