Integrate data from Google BigQuery to Amazon Redshift using Matillion

Our Google BigQuery to Redshift connector enables seamless and timely data transfers to Redshift, ensuring your information remains current without the need for manual coding or managing complex ETL processes.

Google BigQuery
Amazon Redshift
Google BigQuery to Amazon Redshift banner

Extracting data from Google BigQuery to Amazon Redshift

Extracting data from Google BigQuery is a common requirement for organizations that wish to consolidate analytics, support reporting workflows, or leverage additional processing features available in other data warehouses, such as Amazon Redshift. This article provides a concise guide to efficiently moving your data from Google BigQuery to Redshift, with a particular focus on Matillion users. We will begin by outlining how to create a suitable identity in Google BigQuery to authorize and manage data extraction securely. Next, we will guide Matillion users on how to verify or acquire the appropriate JDBC driver to facilitate communication between platforms. We will then address the vital considerations for establishing reliable network connectivity from the source (BigQuery) to the target (Redshift). Finally, we will illustrate best practices for querying your data—covering both full (initial) data loads and ongoing incremental updates—to ensure efficient and consistent data transfer. By the end of this article, you will have a clear understanding of each step required to establish and maintain a seamless, repeatable data pipeline between Google BigQuery and Amazon Redshift.


What is Google BigQuery?

Google BigQuery is a fully managed, serverless data warehouse solution designed for large-scale data analytics. Built on Google's Dremel technology, BigQuery utilizes a columnar storage format and a distributed architecture, enabling high-speed SQL querying over petabyte-scale datasets. It separates compute and storage, allowing users to independently scale resources according to workload requirements. BigQuery supports standard SQL, integrates seamlessly with other Google Cloud services, and offers features such as streaming ingestion, partitioned and clustered tables, and built-in machine learning capabilities via BigQuery ML. Its robust security, fine-grained access controls, and encrypted data storage make it well-suited for handling sensitive enterprise workloads in a cloud-native environment.

matillion logo x Google BigQuery

What is Amazon Redshift?

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, designed for high-performance analytics on structured and semi-structured data. Leveraging columnar storage and massively parallel processing (MPP) architecture, Redshift efficiently executes complex queries across large datasets, making it well-suited for business intelligence, reporting, and data lake integration scenarios. The platform supports standard SQL syntax and integrates with a wide array of data ingestion, transformation, and visualization tools. With features like automatic scaling, encryption, workload management, and Redshift Spectrum for querying data directly in Amazon S3, Redshift offers robust capabilities for handling analytical workloads, optimizing query performance, and reducing administrative overhead for data engineering teams.

Why Move Data from Google BigQuery into Amazon Redshift

The Case for Copying Data from Google BigQuery to Redshift for Enhanced Analytics

A data engineer or architect might choose to copy data from Google BigQuery into Amazon Redshift for several reasons. Firstly, Google BigQuery often holds a wealth of valuable data that organizations wish to leverage for analytical and reporting purposes. However, the true potential of this data is often realized only when it is integrated with other data sources, such as operational databases or data lakes, to produce richer insights and more comprehensive analyses. By transferring the data into Redshift, organizations can perform complex data integration and transformation operations within Redshift itself, thereby avoiding additional processing workloads and costs on the BigQuery environment. This approach not only optimizes resource utilization across platforms but also streamlines data workflows for downstream analytics, reporting, and business intelligence activities.

Creating an Identity in Google BigQuery

In Google BigQuery, you do not directly create database "users" as in traditional databases. Instead, you grant identities (such as Google accounts, Google Groups, or service accounts) specific roles and permissions at the project, dataset, or table level by using Google Cloud IAM (Identity and Access Management). Below are the steps to grant BigQuery access to a new user.

1. Identify the User’s Email Address

You need the user’s Google account email address (for individuals), group email address (for multiple users), or service account email.

Example:

[email protected]

2. Assign BigQuery IAM Roles

You can assign roles using the Google Cloud Console, gcloud CLI, or Terraform/IaC tools. The following example demonstrates how to grant the

BigQuery Data Editor
role to a user at the dataset level using SQL and the Google Cloud Console.

Option A: Grant Permission Using SQL (bq Command Line or Console)

Google BigQuery supports SQL-based GRANT through the

GRANT
statement for datasets only. For example, to grant BigQuery Data Viewer access to a user on a specific dataset:

GRANT `roles/bigquery.dataViewer`
ON SCHEMA my_project.my_dataset
TO "user:[email protected]";

You may also use:

GRANT `roles/bigquery.dataEditor`
ON SCHEMA my_project.my_dataset
TO "user:[email protected]";

Option B: Google Cloud Console

  1. Navigate to your project in Google Cloud Console.
  2. In the sidebar, go to BigQuery.
  3. Locate the dataset you want to share.
  4. Click the Sharing icon next to the dataset.
  5. Click Add Principal.
  6. Enter the user's email address (e.g.,
    [email protected]
    ).
  7. Select a role (e.g.,
    BigQuery Data Editor
    ).
  8. Click Save.

Option C: Using gcloud CLI

gcloud projects add-iam-policy-binding my_project \
  --member="user:[email protected]" \
  --role="roles/bigquery.dataEditor"

Replace

my_project
,
my_dataset
, and
[email protected]
with your actual resource names and email address.

Common BigQuery IAM Roles

  • roles/bigquery.dataViewer — Read-only access (can query data)
  • roles/bigquery.dataEditor — Read/write access (can query and load data)
  • roles/bigquery.dataOwner — Full control, including deleting tables and datasets

Note: After assigning permissions, there is no further step required to "create" the user in BigQuery itself. The user is managed through Google Cloud IAM.

Installing the JDBC Driver

The Google BigQuery JDBC driver is required when connecting Matillion Data Productivity Cloud to external BigQuery data sources. At the time of writing, the BigQuery JDBC driver is not shipped with Matillion Data Productivity Cloud by default, as there are licensing and redistribution restrictions imposed by the driver vendor. To facilitate BigQuery connectivity, you must manually download and install the JDBC driver.

Downloading the Google BigQuery JDBC Driver

  1. Navigate to the Official Download Page:
    Access the Google BigQuery JDBC driver from the official provider at https://cloud.google.com/bigquery/providers/simba-drivers/.

  2. Locate the Type 4 JDBC Driver:
    On the Simba driver download page, look specifically for a Type 4 JDBC driver. Type 4 drivers are purely Java-based and are generally the preferred option due to their portability and ease of deployment.

  3. Download the Driver Package:
    Download the latest compatible JDBC driver package for your platform as instructed on the provider’s download page. You may be required to accept license terms prior to initiating the download.

Installing the JDBC Driver in Matillion Data Productivity Cloud

Once you have downloaded the appropriate driver

.jar
files, you are ready to upload and install them into your Matillion environment.

  1. Prepare the Driver Files:
    Extract the downloaded package if necessary, and ensure you have the relevant

    .jar
    (and any supplementary) files readily accessible.

  2. Follow the Product Driver Upload Procedure:
    For step-by-step instructions on adding external drivers to Matillion Data Productivity Cloud, follow the guidance documented here:
    https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

The instructions will guide you through the process of uploading the driver files to the Matillion Agent or relevant component within your environment and verifying their correct recognition by the platform.

  1. Restart as Instructed:
    Some configuration changes may require a restart of your Matillion Agent for the driver to be recognized. Carefully follow any prompts or instructions in the documentation.

Next Steps: Configuring and Using the Driver

Once the driver is successfully installed, you can begin to use it to connect to your BigQuery projects. Matillion provides detailed guidance for configuring database connections and leveraging the Database Query component, available at:
https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

Refer to these instructions for configuring authentication, specifying connection properties, and authoring queries to BigQuery through Matillion Data Productivity Cloud.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your Google BigQuery database, you must make sure that incoming network connections are permitted based on your chosen deployment configuration:

Additionally, if your Google BigQuery instance is being accessed via a DNS hostname (rather than an IP address), you must ensure that the Full SaaS or Hybrid SaaS agent has network access to DNS services and can resolve the database’s hostname correctly.

Querying Data from Google BigQuery

Writing SQL Queries for Google BigQuery

Google BigQuery uses standard SQL for querying structured data. Here are some example SQL

SELECT
statements:

``
-- Select all columns from a table
SELECT *
FROM
my_project.my_dataset.my_table`;

-- Select specific columns with a WHERE clause SELECT id, name, created_at FROM

my_project.my_dataset.my_table
WHERE active = TRUE;

-- Aggregate data and use grouping SELECT country, COUNT(*) AS user_count FROM

my_project.my_dataset.users
GROUP BY country; ```

Datatype Conversion: Google BigQuery vs. Amazon Redshift

When moving data between BigQuery and Redshift, be aware that some data types do not match perfectly:

BigQuery Datatype Redshift Equivalent
STRING VARCHAR
BYTES VARBYTE
INTEGER BIGINT
FLOAT DOUBLE PRECISION
BOOLEAN BOOLEAN
DATE DATE
DATETIME TIMESTAMP
TIMESTAMP TIMESTAMP
ARRAY, STRUCT, JSON Varies (often VARCHAR or use SUPER in Redshift Spectrum)

You may need to explicitly cast or transform values, especially for nested (ARRAY/STRUCT) or JSON data, when moving between systems.

Recommended Loading Pattern

The best practice for loading data is to perform a once-off initial load, followed by recurring incremental loads. Both operations use the same Database Query component with different query clauses. (Learn more)

Initial Load

  • Pattern: No filter clause; all data is selected.
  • Example Query:

SELECT *
    FROM `my_project.my_dataset.my_table`;

  • This loads the full table’s contents into Redshift or another destination.

Incremental Load

  • Pattern: Filter clause only fetches new or changed records since last load.
  • Example Query: Using
    updated_at
    as a change-tracking column:

SELECT *
    FROM `my_project.my_dataset.my_table`
    WHERE updated_at > TIMESTAMP('2024-01-01 00:00:00');

  • Replace the timestamp or filter value dynamically in production, using the value from your last successful load.
Using the Matillion Database Query Component
  • In both initial and incremental loads, use the same component but change the SQL query to add a WHERE clause for incremental extraction.
  • Filter fields should correspond to reliable change tracking columns (e.g.,
    modified_at
    ,
    updated_at
    ,
    id
    , etc.).

For further reading on the incremental load pattern and best practices:
Incremental Load – Data Replication Strategy (Matillion Exchange)

Data Integration Architecture

A key advantage of the ELT (Extract, Load, Transform) architecture is that it divides the data integration process into two distinct steps by loading data into the Redshift database in advance of integration. This “divide and conquer” approach simplifies data workflows and enhances flexibility; raw data is first loaded into staging tables, and integration is performed as a separate step. Since data integration requires transformation to standardize, clean, and enrich information, utilizing data transformation pipelines is the most effective strategy to orchestrate and automate these complex, repetitive operations. Furthermore, a notable benefit of the ELT model is that all data transformation and integration occur directly inside the target Redshift database. This makes the process fast, on-demand, and inherently scalable, eliminating the need for expensive, separate data processing infrastructure outside of Redshift.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.