Extracting data from Google AlloyDB to Snowflake
Extracting data from Google AlloyDB is a crucial step for organizations seeking to leverage the advanced analytical capabilities of Snowflake. Successfully moving data between these two platforms involves a series of technical considerations to ensure security, efficiency, and ongoing synchronization. In this article, we will guide you through the essential steps required to facilitate this process. First, we will demonstrate how to create an appropriate identity in Google AlloyDB, which is necessary for secure access. Next, for users of the Matillion ETL platform, we will outline the process of checking for—or acquiring—the required JDBC driver to establish a connection. We will then discuss the network connectivity prerequisites to ensure reliable and secure data transfer between source and target environments. Finally, we will explain how to perform both initial and incremental data extraction, enabling you to keep Snowflake up to date with changes from Google AlloyDB.
What is Google AlloyDB?
Google AlloyDB is a fully managed, PostgreSQL-compatible database service designed for enterprise workloads requiring high performance, scalability, and reliability. By leveraging advanced machine learning–powered storage and intelligent caching, AlloyDB achieves substantial improvements over standard PostgreSQL, including up to four times faster transactional and analytical query processing. With features such as automatic failover, backup, integrated high availability, seamless multi-region replication, and compatibility with existing PostgreSQL tools and extensions, AlloyDB allows organizations to modernize their infrastructure while minimizing operational overhead. Its hybrid architecture, which decouples storage and compute, ensures dynamic scaling and efficient resource utilization for demanding workloads.
What is Snowflake?
Snowflake is a fully managed, cloud-native data platform for large-scale warehousing, analytics, and data sharing. Its architecture separates compute and storage for independent scaling and high concurrency, running on AWS, Azure, and Google Cloud. Snowflake supports semi-structured data (JSON, Parquet, Avro) via its VARIANT type, and features automatic scaling, zero-copy cloning, secure sharing, and robust security (end-to-end encryption, role-based access). Integration with major BI tools and standard SQL support make it ideal for ETL workflows and ad hoc analytics. Snowflake helps organizations efficiently manage vast datasets and minimize complexity, making it a top choice for modern data needs.
Why Move Data from Google AlloyDB into Snowflake
Unlocking Analytics: The Benefits of Copying Data from Google AlloyDB to Snowflake
A data engineer or architect may choose to copy data from Google AlloyDB into Snowflake for several compelling reasons. Google AlloyDB often hosts highly valuable operational data, such as transactional records or user activity logs, which can provide critical insights when properly analyzed. However, the greatest value is realized when this data is integrated and combined with information from other disparate sources, enabling advanced analytics and comprehensive business intelligence. By leveraging Snowflake's robust data integration and analytical capabilities, organizations can avoid the potential performance impacts and increased workload that would result from running complex queries directly on their AlloyDB instance. In this way, the data can be analyzed at scale in Snowflake without disrupting AlloyDB’s ongoing transactional operations.
Similar connectors
Creating a User in Google AlloyDB
To create a new user (database role) in a Google AlloyDB cluster, follow these steps. Google AlloyDB is built on PostgreSQL, so the process closely parallels standard PostgreSQL methods.
Prerequisites
- You need access to a
psql
client or suitable database administration tool (e.g., Cloud Shell, Compute Engine instance, or your local machine with the correct connections configured). - You must have sufficient privileges (generally as a database superuser or as a user with
CREATEROLE
privileges). - You should know the connection details for your AlloyDB cluster (host, port, database name, user credentials).
1. Connect to Your AlloyDB Instance
Use the
psqlcommand-line utility to connect:
psql "host=<ALLOYDB_HOST> port=<PORT> dbname=<DATABASE_NAME> user=<ADMIN_USER> sslmode=require"
Replace: -
<ALLOYDB_HOST>: The endpoint of your AlloyDB instance. -
<PORT>: The Cluster or instance port (typically 5432). -
<DATABASE_NAME>: Name of the database where you want to manage users. -
<ADMIN_USER>: Username with required privileges.
2. Create the User
In PostgreSQL (and AlloyDB), users are commonly referred to as "roles".
Below is the SQL syntax to create a user with password authentication:
CREATE USER example_user WITH PASSWORD 'example_password';
Replace
example_userand
example_passwordwith the desired username and a strong password.
3. Grant Privileges (Optional)
By default, the new user will not have any privileges on existing databases or tables. Grant privileges as required. For example, to grant the user the ability to connect to a database:
GRANT CONNECT ON DATABASE your_database TO example_user;
To grant usage on a specific schema:
GRANT USAGE ON SCHEMA public TO example_user;
To grant access only to specific tables:
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE your_table TO example_user;
4. (Optional) Assign Role Attributes
To give the user additional abilities, such as creating databases or roles, run:
ALTER USER example_user CREATEDB; -- Allows creating databases ALTER USER example_user CREATEROLE; -- Allows creating more roles/users
5. Managing Password Expiry (Optional)
You can enforce password expiration for the user:
ALTER ROLE example_user VALID UNTIL '2025-12-31';
Note: Google recommends adhering to best security practices when creating and managing database users, including using strong, unique passwords and granting only necessary privileges.
Installing the JDBC driver
When connecting Matillion Data Productivity Cloud to a Google AlloyDB database, you need to first install a compatible JDBC driver. At the time of writing, this driver is not bundled with the Matillion product due to licensing or redistribution restrictions. Therefore, a manual download and installation process is required.
1. Downloading the JDBC Driver
- Navigate to the official PostgreSQL JDBC driver page: https://jdbc.postgresql.org/
- Locate and download the latest Type 4 JDBC driver (also known as the pure Java driver). The Type 4 driver is recommended for stable connectivity and compatibility with AlloyDB.
- Ensure you are downloading the correct
.jar
file for your environment.
2. Installing the JDBC Driver in Matillion Data Productivity Cloud
- Since Matillion Data Productivity Cloud does not bundle this driver by default, you must manually upload the
.jar
file. - Refer to the official Matillion guide on uploading and installing external JDBC drivers: Uploading external drivers
- Follow the instructions on that page, which will walk you through the upload process using the Matillion agent interface.
- After uploading, verify that the driver has been successfully added and is visible within the Matillion Data Productivity Cloud environment.
3. Using the JDBC Driver with Database Query Components
- To utilize the newly installed AlloyDB (or compatible PostgreSQL) JDBC driver, review the detailed usage documentation provided by Matillion: Database Query component instructions
- This guide includes steps to create connections and configure database queries using the custom JDBC driver.
- Make sure to reference this documentation to ensure proper configuration of connectivity settings within your Matillion Data Productivity Cloud environment.
Following these steps ensures that your Matillion environment can access Google AlloyDB using a supported JDBC driver. This process may need to be repeated as new driver versions become available or if you update your database connectivity requirements.
Checking network connectivity
Before you connect Matillion Data Productivity Cloud to a Google AlloyDB database, you must ensure that the database is configured to allow incoming connections based on your Matillion deployment type:
-
Full SaaS Agent Configuration:
You need to allow incoming connections from the IP addresses used by the Matillion Data Productivity Cloud service. The complete and up-to-date list of required IP addresses can be found at:
https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS Deployment:
In this configuration, the connection to your Google AlloyDB database will originate from your own virtual private cloud (VPC). You must ensure that Google AlloyDB allows incoming connections from your VPC's IP address range.
For utilities to help check VPC network access and connectivity, visit:
https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Additionally, if you reference your Google AlloyDB database using a DNS hostname (rather than an IP address), the Matillion Full SaaS or Hybrid SaaS agent must be able to resolve the database's DNS address. Ensure that DNS resolution is possible from the agent’s environment to avoid connectivity issues.
Querying Data from Google AlloyDB
This guide explains how to query data from a Google AlloyDB database using SQL and discusses best practices for data loading—both initial and incremental. The audience is expected to be technically proficient with SQL and ETL/ELT tools such as Matillion.
Example: Google AlloyDB SQL SELECT Statements
Run SQL queries against your AlloyDB instance as you would for other PostgreSQL databases.
Simple SELECT Statement
SELECT id, full_name, email FROM customers;
SELECT with WHERE Clause
SELECT order_id, order_date, total FROM orders WHERE order_date >= '2024-01-01';
SELECT with Aggregation
SELECT status, COUNT(*) AS num_orders FROM orders GROUP BY status;
Datatype Conversion in ETL Scenarios
When transferring data between Google AlloyDB and Snowflake, pay special attention to data type equivalence and conversion. For example, numeric types, timestamps, and text types may have subtle differences. An example mapping:
| AlloyDB (PostgreSQL) | Snowflake Equivalent |
|---|---|
| INTEGER | NUMBER |
| BIGINT | NUMBER(38,0) |
| VARCHAR(n) | VARCHAR(n) |
| TEXT | STRING |
| TIMESTAMP | TIMESTAMP_NTZ |
Tip: Always assess conversion for columns involving dates, numeric precision, and boolean logic to ensure data integrity.
Patterns for Data Loading: Initial and Incremental Loads
A robust data loading strategy utilizes:
- Initial (Full) Load: Extracts the complete dataset—no WHERE or filter clause.
- Incremental Load: Extracts only new or changed records since the last load—uses a filter clause.
Both processes should use the same Database Query component to maintain consistency.
Initial Load: No Filter
For a first-time, full refresh from AlloyDB to (for example) Snowflake:
SELECT * FROM customers;
There are no constraints—the entire table is read and migrated.
Incremental Load: With Filter
Subsequent, regular loads fetch only the latest data. Common patterns use timestamps (e.g.,
updated_at):
SELECT *
FROM customers
WHERE updated_at > '{{last_loaded_timestamp}}';
- Replace {{last_loaded_timestamp}} with the timestamp recorded at the end of the previous load.- Ensure the column used in the filter (here,
updated_at) is indexed to maximize performance.
For further patterns and advanced incremental replication strategies, see Matillion Article: Incremental Load Data Replication Strategy.
Note:
For both initial and incremental loads, the same Database Query component can be used; only the presence of a WHERE clause differentiates the query patterns.
By following these patterns and adjusting queries and ETL logic as described, you can efficiently and reliably query and migrate data from Google AlloyDB.
Data Integration Architecture
Loading data into Snowflake in advance of performing integration tasks represents a classic "divide and conquer" approach, neatly separating the process into two distinct steps. This sequential methodology, central to the Extract-Load-Transform (ELT) architecture, allows data engineers to first ingest raw data as-is and then apply necessary integration processes afterward. Since effective data integration inevitably requires data transformation—such as cleansing, enrichment, and joining from multiple sources—using well-defined data transformation pipelines within Snowflake is considered best practice. These pipelines orchestrate and automate the transformation steps, ensuring consistency and efficiency. A further advantage of the ELT architecture is that all data transformation and integration operations are executed directly within the Snowflake environment itself. This on-platform processing eliminates the need for additional external data processing infrastructure, thereby resulting in fast, on-demand, and highly scalable data integration while reducing complexity and cost.