Meet Maia: The AI Data Automation platform that gives you the freedom to do more.

Visit maia.ai

Integrate data from Google AlloyDB to Amazon Redshift using Matillion

Our Google AlloyDB to Redshift connector streamlines data migration by efficiently transferring and continuously updating your data in Redshift, eliminating the need for manual coding or intricate ETL scripts.

Google AlloyDB
Amazon Redshift
Google AlloyDB to Amazon Redshift banner

Extracting data from Google AlloyDB to Amazon Redshift

Extracting data from Google AlloyDB is a crucial step for organizations looking to integrate or analyze their data using modern data platforms such as Amazon Redshift. Moving data efficiently and securely requires careful consideration of identity management, connectivity, and data extraction methods. In this article, we will guide you through the essential steps for a successful data migration or synchronization workflow. Specifically, we will cover how to create the necessary identity in Google AlloyDB to enable authentication and access control. For users leveraging Matillion for ETL operations, we will discuss how to check for the required JDBC driver or acquire it if needed. Establishing reliable network connectivity between AlloyDB and Redshift is critical, and we will outline best practices for configuring and verifying this connection. Finally, we will examine approaches for querying data, from the initial full extraction to setting up incremental data loads to keep Redshift in sync with AlloyDB. By the end of this article, you will have a clear understanding of the key steps and considerations involved in extracting data from Google AlloyDB and loading it into Redshift, ensuring both accuracy and efficiency throughout the process.


What is Google AlloyDB?

Google AlloyDB is a fully managed, PostgreSQL-compatible database service designed to provide high performance, reliability, and scalability for demanding enterprise workloads. Leveraging Google's cloud infrastructure, AlloyDB enhances PostgreSQL with adaptive algorithms, storage optimizations, and integrated artificial intelligence for tasks such as autopilot indexing and workload forecasting. The service offers automatic backups, seamless high availability through multi-zone replication, and near-zero downtime maintenance, supporting both transactional and analytical processing. AlloyDB is compatible with PostgreSQL tooling and extensions, enabling organizations to migrate existing PostgreSQL applications with minimal changes and benefit from substantial improvements in throughput, latency, and operational efficiency.

matillion logo x Google AlloyDB

What is Amazon Redshift?

Amazon Redshift is a fully managed, cloud-based data warehouse service designed for large-scale data analytics workloads. Built on PostgreSQL, Redshift extends its functionality and performance through columnar storage, advanced compression, and parallel query processing, enabling users to efficiently execute complex queries against petabyte-scale data sets. The service integrates seamlessly with a variety of AWS services, supports standard SQL, and provides robust security features, including network isolation, encryption at rest and in transit, and fine-grained access controls. Optimized for both batch and real-time analytics, Redshift allows organizations to rapidly scale compute and storage resources as needed, making it a preferred platform for business intelligence, reporting, and data science workloads.

Why Move Data from Google AlloyDB into Amazon Redshift

Unlocking Analytics: The Case for Copying Data from Google AlloyDB to Amazon Redshift

A data engineer or architect may find compelling reasons to copy data from Google AlloyDB into Amazon Redshift. Google AlloyDB often stores operational data that is rich in potential value, particularly for analytics and reporting purposes. However, the true utility of this data emerges when it is combined with information from other organizational sources to facilitate more comprehensive analysis. By integrating data from Google AlloyDB into Redshift—a platform specifically optimized for high-performance analytics—they can leverage Redshift’s advanced processing capabilities to generate insights without imposing additional workload or operational overhead on the AlloyDB environment. This strategic approach ensures that transaction processing on AlloyDB remains unencumbered, while Redshift handles the complex and resource-intensive analytical queries.

Creating an Identity in Google AlloyDB

To create a user (i.e., a database role with login privileges) in a Google AlloyDB database, you generally use standard PostgreSQL SQL commands. Google AlloyDB is a fully-managed PostgreSQL-compatible database, so management of users and privileges follows familiar PostgreSQL patterns.

Below are the steps to create a user:


1. Connect to Your AlloyDB Database

You can connect to AlloyDB using a PostgreSQL-compatible client such as

psql
, or through any PostgreSQL GUI tool.

bash
psql "host=<INSTANCE_IP> port=5432 dbname=<DATABASE_NAME> user=<ADMIN_USER>"
Replace the placeholder values with your actual instance IP, database name, and admin user.


2. Create a New User

Use the following SQL command to create a new user. In PostgreSQL, users are referred to as "roles". Use the

CREATE ROLE
statement with the
LOGIN
attribute:

CREATE ROLE myuser WITH LOGIN PASSWORD 'mysecretpassword';
- Replace
myuser
with your desired username. - Replace
'mysecretpassword'
with a secure password.


3. Grant Privileges (Optional)

By default, the new user does not have any privileges on databases or objects. To allow the user to connect to a specific database:

GRANT CONNECT ON DATABASE <your_database> TO myuser;
If you want the user to have more specific privileges (such as creating tables), you can grant those as needed:

GRANT CREATE ON DATABASE <your_database> TO myuser;

To allow access to specific schemas and tables:

GRANT USAGE ON SCHEMA public TO myuser;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO myuser;


4. (Optional) Change Password for Existing User

If you need to update the password for an existing user:

ALTER ROLE myuser WITH PASSWORD 'newsecretpassword';


Note: Always use sufficiently strong passwords and review privilege assignments carefully to maintain security compliance.

Installing the JDBC Driver

To connect Matillion Data Productivity Cloud to Google AlloyDB, you need to install the required JDBC driver manually. At the time of writing, this JDBC driver is not bundled with Matillion Data Productivity Cloud because of licensing or redistribution restrictions set by the driver provider. The following instructions outline how to obtain and install the appropriate driver for use within your Matillion environment.

1. Downloading the Correct JDBC Driver

Google AlloyDB is compatible with the PostgreSQL JDBC driver. You can obtain the official driver package from the PostgreSQL community site at https://jdbc.postgresql.org/.
When downloading, ensure you select a Type 4 JDBC driver (typically distributed as a .jar file), as this type is platform-independent and does not require any client-side libraries or middleware.

2. Preparing the JDBC Driver for Installation

Once you've downloaded the latest stable Type 4 JDBC driver, it will typically be delivered as a

.jar
file. Keep this file accessible on your workstation or network location before proceeding to the installation step.

3. Uploading the JDBC Driver to Matillion Data Productivity Cloud

Since the driver is not packaged with Matillion by default, you are required to upload it using your Matillion Agent. For detailed, step-by-step guidance, refer to the uploading external drivers documentation.
The process entails:

  • Accessing your Matillion Agent console.
  • Navigating to the section for uploading external (third-party) JDBC drivers.
  • Selecting and uploading the
    .jar
    file downloaded in the previous step.
  • Restarting the Agent if necessary to apply the changes.

4. Using the Uploaded JDBC Driver

After successful upload and registration of the driver, it will be available for use in connections and components that interact with databases within Matillion. For instructions on defining data sources, configuring connection properties specific to the AlloyDB/PostgreSQL JDBC driver, and utilizing database-querying components, refer to the database query usage documentation.

Following these steps ensures that your Matillion Data Productivity Cloud instance can communicate securely and efficiently with your Google AlloyDB resources.

Checking network connectivity

To ensure successful integration between Matillion Data Productivity Cloud and your Google AlloyDB database, you must verify that the database is configured to allow incoming network connections from the appropriate sources, depending on your deployment mode:

  • Full SaaS agent configuration:
    Configure your Google AlloyDB database to allow incoming connections from the specific IP addresses used by Matillion’s SaaS platform. The list of current IP addresses can be found here: Matillion IP Addresses.

  • Hybrid SaaS deployment:
    In a Hybrid SaaS setup, connectivity is established from within your own virtual private cloud (VPC). You need to allow incoming connections to your AlloyDB instance from your VPC network. Helpful tools and utilities for checking and troubleshooting network access are available at: Matillion Exchange: Check Network Access.

Additionally, if your Google AlloyDB database is referenced using a DNS hostname (rather than a direct IP address), you must ensure that the Full SaaS or Hybrid SaaS agent is able to successfully resolve that DNS address. Inaccessible or misconfigured DNS can prevent Matillion agents from establishing a connection, even if all firewall and network rules are otherwise correct.

Querying Data from Google AlloyDB

Google AlloyDB is a fully-managed PostgreSQL-compatible database service. This guide demonstrates how to query data from AlloyDB using SQL, incorporates considerations for datatype conversion (especially when integrating with Amazon Redshift), and outlines best practices for initial and incremental data loads using a single Database Query component.


Example: SQL SELECT Statements for Google AlloyDB

Below are example SQL queries that you can use with AlloyDB:

``` -- Query all rows and columns SELECT * FROM customers;

-- Select specific columns SELECT id, first_name, last_name, signup_date FROM customers;

-- Apply a filter clause SELECT id, email FROM customers WHERE is_active = true;

-- Aggregate data SELECT country, COUNT(*) AS customer_count FROM customers GROUP BY country; ```


Datatype Conversion Considerations (AlloyDB to Redshift)

When replicating data from AlloyDB (PostgreSQL-based) to Amazon Redshift, be aware of datatype conversions. Some key examples:

AlloyDB Type Redshift Equivalent
SERIAL
INTEGER IDENTITY
VARCHAR
VARCHAR
TIMESTAMP
TIMESTAMP
JSONB
VARCHAR
or
SUPER
*
BOOLEAN
BOOLEAN
BYTEA
VARBYTE

*

SUPER
is only available in Amazon Redshift RA3/node types or Redshift Serverless.

Additional information on supported data types can be found in both products' official docs. It’s advisable to CAST columns as needed in your query:

-- Example: Explicit casting for Redshift compatibility
SELECT 
    id,
    CAST(data_field AS VARCHAR(4000)) AS data_field,
    updated_at
FROM my_table;


Best Pattern: One-Off Initial and Incremental Loads

The recommended loading strategy is:

  1. Initial Load: Fully replicate all required data from AlloyDB to Redshift in a one-off operation.
  2. Incremental Loads: Repeatably sync only new or updated records after the initial load.

Both operations should use the same Database Query component, but with a key difference:

  • Initial Load: No filter clause.
  • Incremental Load: Includes a filter clause, typically based on a timestamp or update column.

Refer to Matillion’s article for detailed incremental load strategies: Incremental Load Data Replication Strategy

Example SQL Patterns

Initial Load (No Filter)

SELECT id, name, last_modified_at
FROM product_catalog;

Incremental Load (With Filter)

Supposing

last_modified_at
is used to track updated records:

SELECT id, name, last_modified_at
FROM product_catalog
WHERE last_modified_at > '{{last_loaded_timestamp}}';

Replace

{{last_loaded_timestamp}}
with the appropriate variable from your ETL tool, such as Matillion.


Use the Database Query component in both cases, changing only the SQL to implement or omit the filter as needed for your load type.

Data Integration Architecture

Loading data into Redshift in advance of integration represents a classic “divide and conquer” strategy, breaking the data integration process into two manageable steps. This approach aligns with the Extract, Load, Transform (ELT) architecture, where raw data is first loaded into Redshift before being transformed. By employing efficient data transformation pipelines within the database, organizations can apply business logic, data cleansing, and aggregation tasks in a systematic, repeatable manner. One of the clear advantages of the ELT architecture is that all data transformation and integration processes occur inside the target Redshift environment instead of relying on external processing. This guarantees fast, on-demand, and scalable performance while eliminating the need for additional data processing infrastructure, resulting in both operational simplicity and cost savings.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.