Integrate data from Google Cloud Spanner to Snowflake using Matillion

Our Google Cloud Spanner to Snowflake connector enables seamless, code-free data transfers in minutes, automatically keeping your Snowflake data current without complex ETL processes.

Google Cloud Spanner
Snowflake
Google Cloud Spanner to Snowflake banner

Extracting data from Google Cloud Spanner to Snowflake

Extracting data from Google Cloud Spanner is an essential step for organizations aiming to leverage the strengths of multiple cloud platforms for analytics and reporting. Many teams choose to move their operational data from Spanner to a powerful analytics data warehouse like Snowflake, unlocking advanced querying, reporting, and business intelligence capabilities. However, successful data extraction—particularly when integrating with tools like Matillion—requires careful planning and coordination between services. In this article, we will guide you through the end-to-end process of extracting data from Google Cloud Spanner and loading it into Snowflake. We will begin by detailing how to create and configure an identity in Google Cloud Spanner to ensure secure and authorized access. For users working with Matillion, we will outline the steps required to check for or acquire the necessary JDBC driver, which facilitates the connection to Spanner. The article also covers network connectivity considerations, ensuring secure and reliable data transfer between your Google Cloud and Snowflake environments. Finally, we will explore strategies for querying your data efficiently, discussing approaches both for an initial data load and for setting up recurring, incremental data extraction. By following these steps, you will be well-equipped to establish a robust, scalable data pipeline from Google Cloud Spanner to Snowflake.


What is Google Cloud Spanner?

Google Cloud Spanner is a fully managed, horizontally scalable, and strongly consistent relational database service offered on Google Cloud Platform. It is architected to combine the benefits of traditional relational database management systems (RDBMS), such as SQL support and ACID transactions, with the high availability and global distribution capabilities characteristic of NoSQL systems. Cloud Spanner leverages Google's proprietary TrueTime API to achieve global consistency and external strong consistency across multiple geographic regions, allowing for seamless scaling without manual sharding or complex replication setups. Its automatic replication, load balancing, and multi-version concurrency control make it particularly well-suited for mission-critical, high-throughput applications that require maximum uptime and data integrity.

matillion logo x Google Cloud Spanner

What is Snowflake?

Snowflake is a cloud-native data platform that offers a fully managed, multi-cluster architecture designed for scalability, concurrency, and high performance. Unlike traditional RDBMS and legacy data warehouses, Snowflake separates compute, storage, and services layers, enabling users to independently scale resources according to workload demands. Its support for semi-structured data (e.g., JSON, Avro, Parquet) alongside structured data, as well as features such as automatic query optimization, zero-management infrastructure, and near-instantaneous cloning, make it particularly well suited for analytics, data warehousing, and data sharing scenarios. Snowflake operates entirely on major public clouds—AWS, Azure, and Google Cloud—allowing organizations to manage data across regions and clouds seamlessly while enforcing robust security and compliance policies.

Why Move Data from Google Cloud Spanner into Snowflake

Unlocking Advanced Analytics: The Case for Copying Data from Google Cloud Spanner to Snowflake

A data engineer or architect may wish to copy data from Google Cloud Spanner into Snowflake for several compelling reasons. First, the data stored within Google Cloud Spanner is often highly valuable, as it may include critical operational or transactional information central to an organization’s activities. However, the true potential of this data is often realized only when it is integrated and analyzed alongside information from other sources, enabling more comprehensive business insights and advanced analytics. By performing this data integration within Snowflake—a platform built for scalable analytics—organizations can leverage Snowflake’s capabilities without imposing additional query workload or resource constraints on their production-grade Cloud Spanner environment. This approach preserves Spanner’s performance and reliability while maximizing the value of stored data through rich, combined analytical use cases.

Creating an Identity (User) in Google Cloud Spanner

Google Cloud Spanner does not manage users inside the database in the traditional sense (unlike MySQL or PostgreSQL). Instead, it relies on Google Cloud IAM (Identity and Access Management) to control access at the instance or database level. However, it is common to create a user table within Spanner to manage application-level users.

Below are instructions for both methods: - Granting access to a user via Google Cloud IAM - Creating an application-level user table within a Spanner database


1. Grant Access to a User via Google Cloud IAM

  1. Identify the user’s email address.

  2. Assign a Spanner role.

  3. Open the Google Cloud Console.
  4. Navigate to IAM & Admin > IAM.
  5. Click +Grant access.
  6. Enter the email address of the user.
  7. Assign one of the following roles, depending on the required level of access:
    • Cloud Spanner Viewer (
      roles/spanner.viewer
      )
    • Cloud Spanner Database User (
      roles/spanner.databaseUser
      )
    • Cloud Spanner Database Admin (
      roles/spanner.databaseAdmin
      )
    • Cloud Spanner Owner (
      roles/spanner.admin
      )
  8. Optional: Restrict the scope to a specific Spanner instance or database by specifying a resource.
  9. Click Save.

Note: Permissions granted via IAM cannot be controlled within Spanner SQL.


2. Create an Application-Level User Table

You can also create a table to store application-level users and their information. This is not tied to authentication/authorization, but is useful for application logic.

Example: Create a
Users
table

CREATE TABLE Users (
  UserId       STRING(36) NOT NULL,    -- UUID or unique ID
  UserName     STRING(64) NOT NULL,    -- User’s display name
  Email        STRING(128) NOT NULL,   -- User’s email address
  CreatedAt    TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY(UserId);

Example: Insert a User

INSERT INTO Users (UserId, UserName, Email, CreatedAt)
VALUES (
  'b1a63b10-1c0c-4c53-a7f3-01dc96f95824',
  'Jane Doe',
  '[email protected]',
  PENDING_COMMIT_TIMESTAMP()
);

Note: Authentication to Spanner itself is still managed by IAM as described in step 1. The

Users
table is for application logic only.

Installing the JDBC driver

At the time of writing, the Google Cloud Spanner JDBC driver is not included by default in Matillion Data Productivity Cloud. This is due to licensing or redistribution restrictions, which require users to obtain the driver directly from the vendor and handle installation manually. Follow these instructions to locate, download, and add the appropriate driver.

Step 1: Acquire the JDBC driver

You can download the Google Cloud Spanner JDBC driver from the official repository:

https://github.com/GoogleCloudPlatform/cloud-spanner-jdbc

When downloading, ensure you obtain a Type 4 JDBC driver, as this version is fully implemented in Java and does not require additional software or native libraries, maximizing compatibility and ease of deployment in cloud or containerized environments.

Step 2: Upload the driver to Matillion Data Productivity Cloud

Because the Spanner JDBC driver is not bundled within Matillion, you need to manually upload it. Matillion provides instructions for uploading external JDBC drivers on this page:

Uploading external drivers — Matillion documentation

Follow these instructions closely to ensure the driver is recognized by your Matillion agent and is available for use in orchestration jobs or database connectivity setup.

Step 3: Use the driver in Matillion

Once the driver is uploaded and registered, you can begin using it within your orchestration and transformation jobs in the Matillion Designer. Configuration and usage details for establishing database connections are available here:

Database Query usage instructions — Matillion documentation

Following these steps will enable your Matillion deployment to connect to and interact with Google Cloud Spanner using the official, up-to-date JDBC driver.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your Google Cloud Spanner database, you need to allow incoming connections from the appropriate sources depending on your deployment configuration:

If your Google Cloud Spanner instance is being accessed using a DNS hostname, make sure that the Matillion Full SaaS or Hybrid SaaS agent can resolve that DNS address. This may require appropriate DNS configuration or rules depending on your network setup.

Querying Data from a Google Cloud Spanner Database

This guide provides instructions for querying data from a Google Cloud Spanner instance via SQL, and details best practices for extracting data to external systems (such as Snowflake), including adopting once-off initial loads followed by incremental loads.

Example: SQL SELECT Statements for Google Cloud Spanner

Execute queries using SQL syntax. Common examples are shown below.

Simple Table Selection

SELECT *
FROM Users;

Filtered Query

SELECT user_id, email, signup_timestamp
FROM Users
WHERE is_active = TRUE
ORDER BY signup_timestamp DESC
LIMIT 100;

Aggregate Query

SELECT country, COUNT(*) AS user_count
FROM Users
WHERE signup_timestamp >= '2024-01-01'
GROUP BY country;

Join Across Tables

SELECT u.user_id, u.email, o.order_id, o.amount
FROM Users AS u
INNER JOIN Orders AS o
ON u.user_id = o.user_id
WHERE o.created_at > '2024-06-01';

Datatype Conversion Considerations

When transferring data from Google Cloud Spanner to other database systems such as Snowflake, note that datatype conversions may be required.
Some examples include:

Spanner Datatype Snowflake Datatype Notes
STRING VARCHAR Direct mapping
BOOL BOOLEAN Direct mapping
BYTES BINARY Direct mapping
TIMESTAMP TIMESTAMP_NTZ Timezone handling required
DATE DATE Direct mapping
INT64 NUMBER (INTEGER) Direct mapping
FLOAT64 FLOAT Precision may differ
ARRAY VARIANT Requires careful handling
STRUCT VARIANT/OBJECT May need flattening

Consult official Spanner to Snowflake documentation for the latest details.

Extracting Data: Initial and Incremental Load Pattern

A recommended approach to load data from Spanner into analytics platforms like Snowflake is to:

  1. Perform a once-off initial load (full dataset).
  2. Follow with incremental loads to ingest only new or changed data.

Both steps typically use the same Database Query component (e.g., ETL tool or connector).

1. Initial Load

The query for an initial load does not include a filter clause — all data is extracted.

SELECT * FROM Orders;

2. Incremental Load

The query for incremental loads includes a filter, often based on a timestamp or unique ID column.

SELECT *
FROM Orders
WHERE updated_at > @last_loaded_timestamp;

The variable (

@last_loaded_timestamp
) refers to the most recent record imported.
This pattern allows efficient ongoing synchronization.

See the Incremental Load & Data Replication Strategy article for deeper guidance.

Both the initial and incremental loads leverage the same connection/component—optimization depends on the query clauses alone.

Data Integration Architecture

One of the key strengths of the Extract, Load, and Transform (ELT) architecture is its ability to effectively handle complex data integration challenges by decomposing the process into manageable steps. By loading data into Snowflake in advance, organizations can divide and conquer the integration problem: data is ingested first, leaving the transformation and integration tasks for a subsequent stage. This staged approach streamlines operations and ensures that raw data is immediately available for post-processing. Crucially, data integration often relies on comprehensive transformation workflows, and the optimal way to achieve this is through configurable transformation pipelines. With ELT, these pipelines are executed directly within the target Snowflake database. This design provides several significant benefits: transformations can be performed rapidly, on-demand, and at scale, leveraging Snowflake’s high performance and elasticity. Furthermore, because all processing occurs natively within the Snowflake environment, there is no need for separate data processing infrastructure—reducing both cost and operational complexity.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.