The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Integrate data from Google BigQuery to Snowflake using Matillion

Our Google BigQuery to Snowflake connector swiftly and seamlessly transfers your data to Snowflake, ensuring it remains current without the need for manual coding or complex ETL processes.

Google BigQuery
Snowflake
Google BigQuery to Snowflake banner

Extracting data from Google BigQuery to Snowflake

Extracting data from Google BigQuery is a critical step for organizations seeking to integrate diverse data sources or migrate analytical workloads to cloud data platforms such as Snowflake. This article guides you through the process of securely and efficiently transferring data from BigQuery into Snowflake, ensuring compliance, performance, and scalability throughout. We will begin by discussing how to create and configure an appropriate identity in Google BigQuery, which is essential for authorizing data access and extraction. Next, if you are utilizing Matillion for orchestration, we will demonstrate how to check for or acquire the correct JDBC driver to facilitate a reliable connection between platforms. Ensuring robust network connectivity is another foundational requirement. We will outline the key considerations and steps to establish secure communication between Google BigQuery and Snowflake, minimizing latency and enhancing data transfer consistency. Finally, the article will explore best practices for querying data, both for your initial large-volume extraction and for setting up efficient incremental data loads. By following the outlined procedures, you will be equipped to implement a scalable, secure, and automated data extraction pipeline from Google BigQuery to Snowflake.


What is Google BigQuery?

Google BigQuery is a fully managed, serverless data warehouse on Google Cloud Platform, built for high-performance SQL analysis of petabyte-scale datasets. Using Dremel’s distributed architecture, it enables fast queries and decouples storage from compute for flexible scaling. Data can be ingested via streaming or batch, and it integrates with other Google services like Dataflow and Pub/Sub. BigQuery offers automatic optimization, clustering, and partitioning, while its pay-as-you-go model lets organizations analyze massive structured data volumes with minimal infrastructure management.

matillion logo x Google BigQuery

What is Snowflake?

Snowflake is a cloud-native data platform that delivers data warehousing, data lake, and data sharing capabilities through a fully managed Software-as-a-Service (SaaS) architecture. Built on top of scalable cloud infrastructure, Snowflake separates compute and storage, allowing elastic resource provisioning and workload isolation. It supports ANSI SQL, offers powerful semi-structured data handling with native support for formats such as JSON, Avro, and Parquet, and automates infrastructure management tasks, including tuning, scaling, and maintenance. Additionally, Snowflake features secure data sharing across accounts, role-based access control, end-to-end encryption, and broad support for integration with popular ETL, BI, and data science tools, making it a robust solution for modern data engineering and analytics workloads.

Why Move Data from Google BigQuery into Snowflake

Unlocking Advanced Analytics: The Strategic Value of Copying Data from Google BigQuery to Snowflake

A data engineer or architect may choose to copy data from Google BigQuery into Snowflake for several compelling reasons. Firstly, Google BigQuery often houses high-value data that, if utilized effectively, can yield significant business insights. However, the true potential of this data is often realized when it is integrated with other disparate data sources—many of which may already reside in Snowflake or are more readily accessible through Snowflake’s robust integration capabilities. By moving data into Snowflake, teams can streamline analytics and achieve richer, more comprehensive views across sources. Additionally, conducting integration and transformation workloads in Snowflake, rather than directly in BigQuery, alleviates processing burdens on the original BigQuery environment. This approach not only prevents performance degradation for existing BigQuery workloads but also leverages Snowflake’s elastic compute power to support advanced data processing, enabling organizations to maximize both platforms efficiently.

Creating an Identity in Google BigQuery

BigQuery itself does not have a concept of local, standalone database “users” like traditional relational database systems (for example, no

CREATE USER
SQL command). Instead, user identity and authorization in BigQuery are managed through Google Cloud Identity and Access Management (IAM).

To enable a user, group, or service account to work with BigQuery datasets or resources, you assign IAM roles at the project or dataset level. Below, you'll find step-by-step instructions for granting a user access using both the Google Cloud Console and the

bq
command-line tool.


Granting User Access to BigQuery

1. Identify the User

You’ll need the user’s Google account email address, a Google group email, or the email address for a service account, e.g.:

2. Choose the Appropriate IAM Role

Scope roles as necessary (more info here). Common predefined BigQuery roles:

  • BigQuery Data Viewer:
    roles/bigquery.dataViewer
    (read access)
  • BigQuery Data Editor:
    roles/bigquery.dataEditor
    (read and write access)
  • BigQuery User:
    roles/bigquery.user
    (run jobs, create datasets)
  • BigQuery Admin:
    roles/bigquery.admin
    (full control)

You may also custom-create roles for finer control.


Using Google Cloud Console

  1. Navigate to the Google Cloud Console.
  2. Select your project at the top of the page.
  3. In the navigation menu, go to IAM & Admin > IAM.
  4. Click Grant Access.
  5. In the New principals field, enter the user’s email address.
  6. In the Select a role dropdown, select BigQuery > choose the appropriate role.
  7. Click Save.

Using the
bq
Command-Line Tool

Below is an example for granting the

BigQuery Data Viewer
role to
[email protected]
at the project level.

```ell bq show --format=prettyjson --project_id=[PROJECT_ID] # (Optional) See your project's current IAM policy

gcloud projects add-iam-policy-binding [PROJECT_ID] \ --member='user:[email protected]' \ --role='roles/bigquery.dataViewer' ```

To grant access to a dataset specifically:

ell
bq update --dataset --source <(echo '{
  "access": [
    {
      "role": "READER",
      "userByEmail": "[email protected]"
    }
  ]
}') [PROJECT_ID]:[DATASET_ID]

Replace

[PROJECT_ID]
and
[DATASET_ID]
with your details.


Granting Access to a Service Account with SQL (for Authorized Views)

You may occasionally need to grant dataset access for a service account—often for authorized views. Example (using the Cloud Console, not regular SQL):

  1. Go to the BigQuery page in the Console.
  2. Click on your dataset.
  3. In the Details pane, click Sharing > Permissions.
  4. Click Add principal, set the service account email in New principals, assign an appropriate BigQuery role, and click Save.

Note: All access to BigQuery resources requires an existing Google identity (user, service account, or group). BigQuery and Google Cloud do not create or manage these identities—they must already exist in your Google Cloud environment.

Installing the JDBC driver

To enable connectivity between Matillion Data Productivity Cloud and Google BigQuery, you will need to install the BigQuery JDBC driver manually. At the time of writing, this driver is not bundled with Matillion Data Productivity Cloud by default due to licensing and redistribution restrictions imposed by the driver’s vendor. The following instructions provide a step-by-step guide to downloading the correct driver and installing it within the Matillion environment.

1. Obtain the BigQuery JDBC Driver

  1. Visit the official Simba Drivers for Google BigQuery download page:
    https://cloud.google.com/bigquery/providers/simba-drivers/

  2. Look for the JDBC driver version designated as a Type 4 driver (also referred to as a pure Java implementation). This is preferred for cloud-based integrations such as with Matillion.

  3. Download the appropriate JDBC driver packaged as a

    .jar
    file. Ensure you comply with all licensing terms as specified by Google and Simba.

2. Upload the Driver to Matillion Data Productivity Cloud

  1. Follow the instructions provided by Matillion to upload external JDBC drivers to your cloud agent:
    https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

  2. Make sure the uploaded driver is accessible to the agent that will be running your database integrations.

  3. After uploading, restart the agent if required to ensure the new driver is recognized by Matillion.

3. Configure BigQuery Connections in Matillion

With the driver installed, you can now proceed to set up and use BigQuery as a data source or destination within Matillion Data Productivity Cloud:

  1. Follow the official usage instructions to configure and utilize BigQuery connections:
    https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

Refer to Matillion’s documentation for any parameter settings, authentication methods, or connection property requirements specific to your BigQuery setup.

Checking network connectivity

To ensure that Matillion Data Productivity Cloud can communicate with your Google BigQuery database, you must configure network access according to your chosen deployment method:

  • Full SaaS agent configuration:
    You need to allow incoming connections from the Matillion Data Productivity Cloud by whitelisting the IP addresses specified at this link. Ensure that all relevant IP addresses are included in your Google BigQuery network configuration to allow access from the Matillion agent.

  • Hybrid SaaS deployment:
    In this scenario, connections to Google BigQuery will originate from your organization’s own virtual private cloud (VPC). Make sure your Google BigQuery instance allows incoming connections from your VPC. If you need to verify network access, or for additional connectivity utilities, refer to the tools available at this link.

Additionally, if your Google BigQuery database is referenced using a DNS hostname, ensure that the Full SaaS or Hybrid SaaS agent can successfully resolve the DNS address. This may require that appropriate DNS servers or forwarders are accessible from the environment where the agent is running.

Querying Data from Google BigQuery

This guide explains how to query data from a Google BigQuery database, with sample SQL queries, notes on datatype conversions (especially relevant when integrating with Snowflake), and best practices for initial and incremental loads using a Database Query component.


1. Basic SELECT Query Examples

The standard syntax for querying BigQuery is SQL-based. Below are sample queries:

``
-- Select all columns from a public dataset
SELECT *
FROM
bigquery-public-data.samples.natality`

-- Select specific columns with filters SELECT name, gender FROM

project.dataset.users
WHERE signup_date >= '2023-01-01' ORDER BY name LIMIT 100

-- Aggregate with GROUP BY SELECT state, COUNT(*) AS num_births FROM

bigquery-public-data.samples.natality
GROUP BY state HAVING num_births > 10000 ```


2. Notes on Datatype Conversion: BigQuery vs Snowflake

When migrating or integrating between BigQuery and Snowflake, keep in mind:

  • STRING (BigQuery) maps to VARCHAR (Snowflake)
  • BYTES (BigQuery) maps to BINARY (Snowflake)
  • INTEGER (BigQuery) can be either NUMBER or BIGINT (Snowflake)
  • FLOAT (BigQuery) maps to FLOAT (Snowflake)
  • BOOLEAN (BigQuery) maps to BOOLEAN (Snowflake)
  • TIMESTAMP and DATETIME types require specific attention to time zone semantics

Tip: Always cast columns explicitly when datatype discrepancies may arise.

-- Example: Explicit casting
SELECT CAST(user_id AS STRING) AS user_id_str
FROM `project.dataset.users`


3. Initial vs Incremental Load Patterns

The recommended approach is: - Perform a one-off initial load that captures all existing data, with no filter clause. - Follow up with incremental loads that retrieve only new/changed data since the last load, using a filter clause.

The Matillion exchange article – Incremental Load: Data Replication Strategy provides an in-depth explanation of this approach.

Initial Load Example

-- Loads the entire table: no filter clause required
SELECT *
FROM `project.dataset.sales_orders`

Incremental Load Example

-- Loads only records modified since the last batch
SELECT *
FROM `project.dataset.sales_orders`
WHERE modified_at > TIMESTAMP('2024-06-01 00:00:00')

Both queries use the same Database Query component, but only the incremental load includes a filter (e.g., on

modified_at
,
updated_at
, or a surrogate key).


For detailed implementation steps and filter patterns, always refer to the relevant ELT tool and consult Google BigQuery documentation as well as the Matillion exchange article linked above.

Data Integration Architecture

Loading data into Snowflake in advance of integration—a core aspect of the Extract, Load, Transform (ELT) architecture—enables organizations to divide and conquer the integration challenge by breaking it into two manageable steps: first loading the raw data, and then performing integrations and transformations as needed. This separation simplifies development and troubleshooting, and also unlocks greater flexibility for managing evolving data requirements. To effectively transform and integrate this data, modern solutions use scalable data transformation pipelines, which allow for the coordinated execution of complex data processing tasks. A further advantage of the ELT approach is that transformation and integration workloads run directly inside the target Snowflake database. This design ensures fast, on-demand processing that can automatically scale with your needs, eliminating the need to provision or maintain separate processing infrastructure, and reducing both operational complexity and total cost of ownership.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.