Integrate data from GridGain to Amazon Redshift using Matillion

Our GridGain to Redshift connector enables seamless, timely data transfer to Redshift, eliminating the need for manual programming or complex ETL script management.

GridGain
Amazon Redshift
GridGain to Amazon Redshift banner

Extracting data from GridGain to Amazon Redshift

Extracting data from GridGain is a crucial step for teams seeking to leverage high-performance, in-memory data in downstream analytics platforms such as Amazon Redshift. Moving data efficiently and reliably requires careful attention to both the source and the target environments, as well as to the tools facilitating the transfer. In this article, we will walk through the process of moving data from GridGain to Redshift. We begin by demonstrating how to create an appropriate identity in GridGain to ensure secure and auditable data access. For those leveraging Matillion for ETL, we address how to check for, or acquire, the necessary JDBC driver to connect to GridGain. Next, we will discuss important considerations regarding network connectivity, ensuring data can flow smoothly from source to target. Finally, we will examine methods for querying data from GridGain, covering both the initial extraction and strategies for ongoing incremental loads. By following these steps, you will be able to establish a robust, repeatable process for loading key data from GridGain into Redshift, maximizing the value of your data assets.


What is GridGain?

GridGain is a high-performance, in-memory computing platform built on top of Apache Ignite, designed for real-time data processing and analytics at scale. Leveraging a distributed, memory-centric architecture, GridGain offers ACID-compliant transactional support, advanced SQL querying capabilities, and native integration with both disk and cloud storage for persistence and durability. Its architecture enables horizontal scalability, resilience, and low-latency access to petabyte-scale data sets, making it suitable for use cases such as high-frequency trading, fraud detection, and real-time analytics. GridGain also supports comprehensive security features, advanced clustering, and zero-downtime rolling upgrades, making it ideal for mission-critical, enterprise-grade deployments.

matillion logo x GridGain

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the AWS cloud, architected for fast query performance on large datasets. Built upon PostgreSQL, Redshift employs columnar storage, data compression, and zone mapping to optimize storage efficiency and accelerate analytic workloads. It utilizes Massively Parallel Processing (MPP), enabling concurrent execution of complex queries across multiple nodes. Redshift integrates seamlessly with a wide range of AWS services and business intelligence tools, supporting standard SQL and offering features such as automatic scaling, workload management, and failover capabilities. Its support for both classic provisioned clusters and the serverless deployment model allows for flexible adaptation to varying workloads and cost requirements.

Why Move Data from GridGain into Amazon Redshift

Unlocking Analytics: The Case for Copying Data from GridGain to Redshift

A data engineer or architect may choose to copy data from GridGain into Amazon Redshift for several compelling reasons. GridGain’s in-memory data grid can contain high-value, real-time operational data essential for business insights. However, the true value of this data is most often realized when it is integrated with information from other enterprise sources, such as transactional databases, CRM systems, or external data feeds. By copying GridGain data into Redshift, data professionals can leverage Redshift’s powerful analytics and integration capabilities—enabling complex queries, data warehousing, and cross-source analysis. This approach also ensures that the demanding workloads associated with large-scale data integration do not impact GridGain’s operational performance or latency, as data engineers avoid executing resource-intensive processes directly on the operational GridGain cluster. Consequently, the organization benefits from a more flexible, performant, and scalable analytics ecosystem.

Creating a User in GridGain Database

To manage security and access in GridGain, users are created and managed through SQL commands. This document provides step-by-step instructions on how to create a user (identity) in GridGain using SQL scripts.

Prerequisites

  • Make sure authentication is enabled in your GridGain cluster.
  • You must have administrative privileges to create new users.

Steps to Create a User

  1. Connect to the GridGain SQL client
    You can connect to the GridGain cluster using SQLLine, JDBC client (such as DBeaver, SquirrelSQL), or via the GridGain Web Console.

  2. Use the

    CREATE USER
    SQL Command

The syntax for creating a new user is as follows:

CREATE USER <username> WITH PASSWORD '<password>';

  • <username>
    : The desired username.
  • <password>
    : The password for the new user in single quotes.

Example:

CREATE USER analyst1 WITH PASSWORD 'StrongPassw0rd!';

  1. Grant Permissions (Optional)
    By default, new users have no permissions. Grant roles or privileges as necessary:

  2. To grant all privileges on a schema or database:

    GRANT ALL ON my_schema.* TO analyst1;

  3. To grant select permission on a specific table:
    GRANT SELECT ON my_schema.orders TO analyst1;

Notes

  • Password policies (complexity, expiration) depend on your cluster configuration.
  • Created users are authenticated on all nodes in the cluster.
  • To display existing users:
    SELECT * FROM INFORMATION_SCHEMA.USERS;
  • To delete a user:
    DROP USER analyst1;

Refer to the GridGain Security documentation for additional details and advanced configuration options.

Installing the JDBC Driver

At the time of writing, the GridGain (Apache Ignite) JDBC driver is not bundled by default with Matillion Data Productivity Cloud due to licensing or redistribution restrictions. To use the JDBC integration, you need to manually download and install the appropriate driver (preferably a Type 4 JDBC driver) before connecting Matillion to your GridGain database.

Step 1: Download the GridGain JDBC Driver

  1. Visit the Apache Ignite JDBC Driver download page.
  2. Review the available JDBC driver options, and download the Type 4 JDBC driver. Type 4 drivers are "pure Java" drivers and generally offer easier integration and better portability, making them the preferred choice for most environments.
  3. Save the downloaded driver JAR file to your local machine.

Step 2: Install the JDBC Driver in Matillion Data Productivity Cloud

  1. Follow the official process to upload external JDBC drivers in the agent component of Matillion Data Productivity Cloud. Detailed instructions can be found in the Matillion guide for uploading external drivers.
  2. Ensure the GridGain JDBC driver JAR is uploaded successfully and is available to the agent that will be executing database connections.
  3. After installation, you may need to restart the Matillion agent or refresh your agent’s drivers cache so that Matillion can recognize the newly added JDBC driver.

Step 3: Configure and Use the JDBC Connection

  1. With the driver installed, refer to the Matillion Designer's Database Query documentation for step-by-step guidance on configuring a database connection and running queries using the newly uploaded driver.
  2. When prompted for JDBC URL and connection parameters, ensure you supply the correct values corresponding to your GridGain environment.

By following these steps, you can enable JDBC connectivity from Matillion Data Productivity Cloud to your GridGain database.

Checking network connectivity

To establish a successful connection between Matillion Data Productivity Cloud and your GridGain database, you must ensure that your GridGain database allows incoming connections based on your deployment configuration:

  • Full SaaS Agent Configuration: Configure your GridGain database firewall and network security rules to allow incoming connections from the Matillion Data Productivity Cloud IP addresses. You can find the latest list of required IP addresses at this link.

  • Hybrid SaaS Deployment: Make sure your GridGain database allows incoming connections from your own Virtual Private Cloud (VPC). This typically involves enabling network access from the VPC subnets where your Matillion agent is running. To assist with checking your database’s accessibility, you can use tools found on the Matillion Network Access Check Exchange page.

Additionally, if you are referencing your GridGain database using a DNS hostname, you must ensure that the Full SaaS or Hybrid SaaS agent is able to resolve the DNS address correctly. This may require that your DNS settings and any required private DNS zones are visible and accessible to the agent.

Querying Data from GridGain Database

This guide explains how to query data from a GridGain in-memory database, particularly when synchronizing with target systems such as Amazon Redshift. You'll learn how to write SQL queries for both initial and incremental loads, as well as considerations around datatype conversions.

Example SQL SELECT Statements for GridGain

GridGain supports standard ANSI SQL. When querying tables, use the familiar

SELECT
statement. Here are some examples:

Selecting All Columns:

SELECT * FROM Person;

Selecting Specific Columns:

SELECT id, name, age FROM Person;

Using Filter Conditions:

SELECT * FROM Orders WHERE status = 'SHIPPED' AND order_date >= '2024-01-01';

Aggregations:

SELECT product_id, COUNT(*) AS order_count FROM Orders GROUP BY product_id;

Datatype Conversion Between GridGain and Redshift

When moving data from GridGain to Redshift, be aware of potential datatype conversion issues, especially during ETL operations. For example:

GridGain SQL Type Example Closest Redshift Type Notes
INT 123 INTEGER Direct mapping
VARCHAR 'abc' VARCHAR Make sure length matches
DOUBLE 1.23 DOUBLE PRECISION Direct mapping
TIMESTAMP 2024-01-01 01:02:03 TIMESTAMP Timezone issues possible

Always verify that your ETL tool or connector handles conversions as expected.

Best Pattern: Initial vs Incremental Loads

When syncing data, the most robust approach is:

  1. Once-off Initial Load: Load all relevant data from GridGain into Redshift without filters.
  2. Incremental Loads: Subsequently, only load rows from GridGain that have changed or been added since the previous sync.

Both load types can use the same Database Query component.

SQL Example: Initial Load

No filters are applied; fetch all rows

SELECT * FROM Orders;

SQL Example: Incremental Load

Apply a filter—e.g., by

updated_at
timestamp
SELECT * FROM Orders WHERE updated_at > '${last_sync_time}';
Replace
${last_sync_time}
with the actual timestamp from your last successful sync.

Using the Database Query Component

  • For the initial load, configure your Database Query without a WHERE clause to retrieve all data.
  • For incremental loads, edit the Database Query to add a filter, such as a
    WHERE
    condition on an
    updated_at
    or
    last_modified
    column.

By following this approach, you maximize efficiency and consistency when integrating GridGain with Redshift or any other target data platform.

Data Integration Architecture

Loading data in advance of integration represents a strategic application of the "divide and conquer" principle by clearly separating the data ingestion and transformation stages. This method is a key advantage of the ELT (Extract, Load, Transform) architecture, as it allows raw data to be loaded quickly and then integrated in a controlled, systematic manner. Data integration inevitably requires transformation—data must be cleaned, standardized, and combined from multiple sources—and the most effective way to accomplish this is by leveraging data transformation pipelines. These automated sequences of transformation steps facilitate complex integration logic with efficiency and repeatability. A further benefit of the ELT approach is that both data transformation and integration occur entirely within the target Redshift database environment. This not only provides fast, on-demand, and highly scalable processing but also eliminates the need for separate data processing infrastructure, reducing operational complexity and associated costs.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.