Integrate data from Apache Kylin to Amazon Redshift using Matillion

Our Apache Kylin to Redshift connector streamlines data transfer to Redshift within minutes, ensuring your data remains current without the need for manual coding or complex ETL scripts.

Apache Kylin
Amazon Redshift
Apache Kylin to Amazon Redshift banner

Extracting data from Apache Kylin to Amazon Redshift

Extracting data from Apache Kylin is a valuable step in integrating big data analytics with cloud-based platforms such as Amazon Redshift. Whether you are consolidating data lakes, building out your analytics pipeline, or seeking scalable cloud storage, establishing a seamless data flow between these systems is essential. This article outlines a practical approach for extracting data from Apache Kylin and loading it into Redshift, leveraging tools such as Matillion where appropriate. We will begin by guiding you through the process of creating an appropriate identity in Apache Kylin to authenticate extraction operations securely. For Matillion users, we will also review the steps to ensure you have access to the necessary JDBC driver, a critical component for establishing connectivity. Network connectivity between both platforms—a factor often overlooked but crucial in production environments—will be addressed with detailed recommendations. Lastly, we will walk through the processes of querying data from Kylin: covering both initial data loads and setting up for ongoing, incremental extraction to ensure data remains up to date in Redshift. By following this guide, you will be equipped to build a robust and repeatable workflow for moving data efficiently from Apache Kylin to Amazon Redshift.


What is Apache Kylin?

Apache Kylin is a distributed, open-source analytics data warehouse for ultra-fast OLAP queries on large datasets. Built on Hadoop, it uses pre-computed multidimensional cubes and columnar storage to speed up SQL queries over sources like Hive, HBase, or cloud data lakes. Kylin integrates with BI tools via ANSI-SQL and JDBC/ODBC, supporting high concurrency, sub-second latency, and petabyte-scale scalability. Advanced features include intelligent cube building, real-time streaming ingestion, and distributed processing, enabling efficient complex aggregations and drill-down analysis, making it ideal for interactive analytics and business intelligence in big data environments.

matillion logo x Apache Kylin

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse built on PostgreSQL, enabling fast SQL querying and advanced analytics on large datasets. Its Massively Parallel Processing (MPP) architecture distributes complex queries across multiple nodes, boosting performance for high-volume workloads. Redshift supports columnar storage, data compression, automatic backups, and federated queries with S3 and AWS Lake Formation integration. Security features include VPC isolation, encryption at rest/in transit, and IAM integration. Seamlessly integrating with various BI and ETL tools, Redshift offers scalable, high-performance, and secure cloud data warehousing for enterprise needs.

Why Move Data from Apache Kylin into Amazon Redshift

Optimizing Analytics: Reasons for Transferring Data from Apache Kylin to Redshift

A data engineer or architect may want to copy data from Apache Kylin into Amazon Redshift for several strategic reasons. Firstly, the data residing in Apache Kylin is often the result of complex aggregation and modeling, making it highly valuable for further analysis. However, the true potential of this dataset emerges when it can be integrated and cross-referenced with data from a variety of other sources, something that Redshift—being a fully managed, scalable cloud data warehouse—excels at. By transferring data into Redshift, engineers can perform comprehensive analytics and business intelligence activities without imposing additional query loads on the Apache Kylin environment, thereby preserving its performance for core OLAP tasks. This architecture allows organizations to maximize the value of their data assets while maintaining efficiency and scalability in their analytics infrastructure.

Creating a User in Apache Kylin

This guide details the steps for creating a user (identity) in Apache Kylin. It is intended for engineers and administrators responsible for managing Kylin's access controls.

Note: Apache Kylin manages user identities at the application layer, not through direct SQL commands to its underlying storage system. User and role management is typically performed via the Kylin web UI or Kylin REST APIs.

Prerequisites

  • Ensure you have
    ADMIN
    privileges in Kylin.
  • Confirm the Kylin server is running.

Method 1: Using the Kylin Web UI

  1. Log in to the Kylin web interface.
  2. Navigate to SystemSecurityUser Management.
  3. Click New User.
  4. Enter Username, Password, and (optionally) assign one or more roles (e.g.,
    ADMIN
    ,
    MODELER
    , or
    ANALYST
    ).
  5. Click Save.

Method 2: Using the Kylin REST API

You can also automate user creation with a REST API call.

1. Prepare the API Request

The following is an example

cURL
command to create a new user:

bash
curl -X POST "http://<KYLIN_HOST>:<PORT>/kylin/api/user"
     -H "Content-Type: application/json"
     -u ADMIN_USERNAME:ADMIN_PASSWORD
     -d '{
           "username": "newuser",
           "password": "UserPass123!",
           "authorities": ["ROLE_ANALYST"]
         }'

Replace:

  • <KYLIN_HOST>
    and
    <PORT>
    with your Kylin instance information.
  • ADMIN_USERNAME
    and
    ADMIN_PASSWORD
    with your admin credentials.
  • newuser
    ,
    UserPass123!
    , and the roles as appropriate.

2. Example JSON Payload

json
{
  "username": "newuser",
  "password": "UserPass123!",
  "authorities": ["ROLE_ANALYST"]
}

Possible roles include:
-

ROLE_ADMIN

-
ROLE_MODELER
-
ROLE_ANALYST


Method 3: Scripting with Python (Optional Example)

Here is a sample Python script for creating a user:

```python import requests

admin_user = 'ADMIN_USERNAME' admin_password = 'ADMIN_PASSWORD' host = 'http://:'

user_data = { 'username': 'newuser', 'password': 'UserPass123!', 'authorities': ['ROLE_ANALYST'] }

response = requests.post( f"{host}/kylin/api/user", json=user_data, auth=(admin_user, admin_password) ) print(response.status_code, response.text) ```


Additional Notes

  • Assign multiple roles as required via the
    authorities
    array.
  • If Kylin is configured with an external authentication system such as LDAP or SSO, provision users according to your organization’s process, as direct creation may not be permitted.
  • Users created via these methods will be managed internally by Kylin.

For more information, see the official Apache Kylin documentation.

Installing the JDBC driver

At the time of writing, the JDBC driver for Apache Kylin is not bundled within Matillion Data Productivity Cloud by default. This is primarily due to licensing or redistribution restrictions that prevent automatic distribution with the platform. As a result, administrators are responsible for sourcing and installing this driver manually.

To install the Apache Kylin JDBC driver, follow these steps:

  1. Obtain the JDBC Driver

  2. Navigate to the official Apache Kylin download page: https://kylin.apache.org/download/.

  3. On this page, locate the section containing client drivers and look specifically for a Type 4 JDBC driver package. Type 4 drivers are preferred as they are platform-independent and connect directly to the database via network protocols.
  4. Download the latest compatible JDBC driver JAR file for your environment.

  5. Install the Driver into Matillion Data Productivity Cloud

  6. Access the Matillion Data Productivity Cloud Admin or Agent interface as outlined in the platform documentation.

  7. Follow the driver upload instructions available at: https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/
  8. Upload the Apache Kylin JDBC driver JAR file to your Agent or designated driver location as specified in your platform configuration.

  9. Configure and Use the Driver

  10. After the driver has been successfully uploaded and registered, you may proceed to configure database connections or queries utilizing this driver within Matillion Data Productivity Cloud.

  11. Usage and configuration guidance can be found here: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/
  12. Ensure that any references to the Kylin JDBC driver align with the driver library and class names provided by your downloaded version.

Always verify you are observing your organization’s security and compliance requirements when handling and installing external JDBC drivers.

Checking network connectivity

To establish connectivity between Matillion Data Productivity Cloud and an Apache Kylin database, you must ensure that your network and security group settings allow incoming connections from the appropriate source, depending on your deployment configuration:

  • Full SaaS agent configuration:
    Your Apache Kylin database must allow incoming connections from the IP addresses provided by Matillion. You can find the most up-to-date list of these IP addresses here: Matillion Allowed IP Addresses.

  • Hybrid SaaS deployment:
    In this scenario, you should permit incoming connections from your own virtual private cloud (VPC) where the Matillion Hybrid SaaS agent resides. To help confirm that network connectivity is properly configured from your VPC to Apache Kylin, you can use helpful utilities available at: Network Access Check Utility.

Additionally, if your Apache Kylin database is referenced using a DNS hostname rather than a static IP address, you must ensure that the Matillion agent (either Full SaaS or Hybrid SaaS) can resolve the DNS name. This may require allowing DNS traffic and ensuring your DNS records are properly configured and accessible from the agent’s network environment.

Querying Data from Apache Kylin

This guide explains how to query data from an Apache Kylin database using SQL, as well as strategies for initial and incremental data loads, especially in ETL tools like Matillion. Guidance is included for those integrating with Amazon Redshift.


Example Apache Kylin SQL Queries

Apache Kylin supports a subset of ANSI SQL. Here are common query examples:

1. Select Specific Columns

SELECT customer_id, total_amount, sales_date
FROM sales_cube
WHERE sales_date >= '2024-01-01'
ORDER BY sales_date DESC;

2. Aggregate Data Example

SELECT region, SUM(total_amount) AS total_sales
FROM sales_cube
WHERE sales_date BETWEEN '2024-01-01' AND '2024-03-31'
GROUP BY region
ORDER BY total_sales DESC;

3. Filter Data with Conditions

SELECT product_id, COUNT(*) AS cnt
FROM sales_cube
WHERE product_category = 'Electronics'
  AND sales_date >= '2024-06-01'
GROUP BY product_id;


Datatype Conversion: Kylin ↔ Redshift

When querying Kylin for downstream ingestion into Amazon Redshift, datatype conversion may be necessary.

Apache Kylin Amazon Redshift Notes
STRING VARCHAR
INTEGER INTEGER
BIGINT BIGINT
BOOLEAN BOOLEAN
DATE DATE
TIMESTAMP TIMESTAMP
DECIMAL/DOUBLE FLOAT8/NUMERIC May need explicit cast in SELECT

Avoid subtle pitfalls, e.g., when Apache Kylin returns a LONG but Redshift expects an INTEGER, apply an explicit cast in your Kylin SQL:

SELECT CAST(id AS INTEGER) AS id
FROM example_table


Initial Load vs. Incremental Load Pattern

A robust pattern for extracting data from Apache Kylin (especially into Redshift or other warehouses) is to:

  1. Perform a once-off initial load
  2. Follow it with ongoing incremental loads

For both, utilize the same Database Query (or equivalent ETL) component, only changing the filter clause.

Initial Load

  • No filter clause (extracts all data)
  • Example:
    SELECT * FROM sales_cube;

Incremental Load

  • Add a filter clause based on a high-watermark column (such as a timestamp or primary key).
  • Example: If you're tracking new records by
    last_updated_at
    :
    SELECT *
      FROM sales_cube
      WHERE last_updated_at > '${last_extracted_time}';
    In Matillion or other ETL jobs,
    ${last_extracted_time}
    would be parameterized.

See further details in the Matillion Exchange article on incremental load strategy.


Best Practices

  • Database Query Component:
    Use the same logic and transformation steps for both loads, only change the filter clause in the query.

  • Consistency:
    After the initial load, always keep the

    WHERE
    clause in sync with your change-tracking/high-watermark logic.

  • Type Safety:
    Explicitly cast types in Kylin queries to match Redshift schemas where mismatches are observed.


For more advanced modeling or further query optimization, refer to the Apache Kylin Documentation.

Data Integration Architecture

Loading data into Amazon Redshift in advance of integration is a hallmark of the ELT (Extract, Load, Transform) architecture, where the process is intentionally divided into two distinct stages: first loading the raw data, and then performing the necessary transformations within the database. This approach allows organizations to "divide and conquer" by decoupling the data loading and integration phases, simplifying management and improving overall efficiency. Data integration itself often relies on transformations—tasks such as joining tables, aggregating data, or harmonizing disparate formats—which are most effectively carried out through dedicated data transformation pipelines. One of the principal advantages of the ELT architecture is that all transformation and integration steps occur inside the Redshift environment. As a result, transformations run quickly, execute on-demand, and scale with Redshift's compute resources. Moreover, organizations benefit from not having to maintain separate data processing infrastructure, since all computational tasks are handled natively in the data warehouse, offering both performance and cost advantages.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.