Integrate data from Google AlloyDB to Databricks using Matillion

Our Google AlloyDB to Databricks connector enables seamless and timely data transfer to Databricks within minutes, ensuring your data remains current without requiring manual coding or managing intricate ETL processes.

Google AlloyDB
Databricks
Google AlloyDB to Databricks banner

Extracting data from Google AlloyDB to Databricks

Extracting data from Google AlloyDB is an essential step for organizations looking to analyze their transactional data using modern data platforms such as Databricks. Whether you are migrating large datasets, building data pipelines for analytics, or enabling real-time reporting, it is important to ensure that the data movement process is secure, efficient, and reliable. In this article, we will walk through the process of extracting data from Google AlloyDB and loading it into Databricks. We will begin by discussing how to create an identity in Google AlloyDB, which is necessary for secure access and data extraction. For Matillion users, we will cover how to check for the appropriate JDBC driver or acquire it if needed, ensuring that your ETL tool can connect seamlessly to AlloyDB. Next, we will address the key considerations for establishing network connectivity between your AlloyDB instance and the Databricks environment, focusing on both security and performance best practices. Finally, we will look at how to query data effectively—covering both initial data loads and ongoing incremental updates—to ensure your data pipeline remains up to date. By the end of this article, you will be equipped with the knowledge needed to leverage Google AlloyDB as a reliable source for your Databricks-based analytics and data engineering workflows.


What is Google AlloyDB?

Google AlloyDB is a fully managed, PostgreSQL-compatible relational database service designed to deliver enhanced performance, scale, and availability for enterprise workloads. Built on Google Cloud, AlloyDB leverages cloud-native architecture and deep integration with advanced Google infrastructure, including proprietary storage and intelligent caching, to achieve significantly better throughput and lower latency than standard PostgreSQL deployments. AlloyDB provides robust support for read scalability, zero-downtime patching, automated backups, high availability with automatic failover, and integrated machine learning capabilities. Its compatibility with existing PostgreSQL tools and extensions simplifies migration and modernization for organizations seeking to leverage the benefits of managed cloud databases without sacrificing familiar development paradigms.

matillion logo x Google AlloyDB

What is Databricks?

Databricks is a unified data analytics platform built on Apache Spark that streamlines big data and AI solution development. It integrates natively with structured, semi-structured, and unstructured data, scaling to petabytes. The Databricks Lakehouse combines data warehouse reliability with data lake scalability, supporting analytics and machine learning in one place. Databricks enables data engineers and analysts to manage databases, tables, and views with SQL, leveraging Delta Lake for ACID transactions, scalable metadata, and unified batch-plus-streaming processing. This collaborative environment supports version-controlled development and efficient, end-to-end data pipeline management, making advanced analytics and machine learning seamless and scalable.

Why Move Data from Google AlloyDB into Databricks

Unlocking Analytics Potential: The Value of Copying Google AlloyDB Data into Databricks

A data engineer or architect may wish to copy data from Google AlloyDB into Databricks for several compelling reasons. First, the data residing in Google AlloyDB is often valuable, representing crucial business transactions, user activity, or operational records. However, the full potential of this data is typically realized only when it is integrated with information from other sources, such as cloud data lakes, legacy systems, or third-party datasets. By bringing AlloyDB data into Databricks, organizations can perform advanced analytics, machine learning, and data transformation at scale, drawing richer insights from comprehensive datasets. Furthermore, utilizing Databricks for data integration and processing offloads resource-intensive tasks from Google AlloyDB, protecting the performance and responsiveness of AlloyDB’s transactional workloads while enabling complex data operations in an optimized environment.

Creating a User in Google AlloyDB

To manage access control in Google AlloyDB, you create users within the database, similar to PostgreSQL. The following instructions guide you through creating a new user account using SQL scripts.

Prerequisites

  • You must have appropriate privileges (typically a user with
    CREATEROLE
    or superuser rights) to create new users.
  • Connect to your AlloyDB instance using your preferred PostgreSQL-compatible client (such as
    psql
    , DBeaver, or Cloud Shell).

Step 1: Connect to the Database

Use the

psql
command-line tool or your database client's connection interface, providing the database endpoint, username, and password.

psql -h [ALLOYDB_HOST] -U [ADMIN_USER] -d [DATABASE_NAME]

Replace the placeholders with your AlloyDB instance details.

Step 2: Create the User

Once connected, run the following SQL statement to create a new user. Optionally, you can specify a password.

CREATE USER new_user_name WITH PASSWORD 'strong_password';

  • Replace
    new_user_name
    with your desired username.
  • Replace
    strong_password
    with a secure password.

Step 3: Grant Privileges (Optional)

By default, the user has no privileges on existing databases or tables. Grant access as required. For example, to grant basic read and write access on all tables in a schema:

GRANT CONNECT ON DATABASE [DATABASE_NAME] TO new_user_name;
GRANT USAGE ON SCHEMA public TO new_user_name;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO new_user_name;

Step 4: Confirm User Creation

To verify the user has been created, you can query the

pg_roles
catalog:

\du
or

SELECT rolname FROM pg_roles;

The new user should appear in the list.


Note: For advanced user management—such as specifying roles, login restrictions, or group membership—refer to the PostgreSQL documentation, as AlloyDB is compatible with most PostgreSQL features.

Installing the JDBC Driver

At the time of writing, the JDBC driver required for connecting to Google AlloyDB is not included natively in Matillion Data Productivity Cloud. This is due to licensing or redistribution restrictions, so you will need to download and install it manually.

To complete the installation, follow these steps:

1. Download the JDBC Driver

The Google AlloyDB database is PostgreSQL-compatible, and uses the PostgreSQL JDBC driver (also known as "pgJDBC"). Only the Type 4 JDBC driver (a pure Java implementation) is required and recommended.

  • Visit https://jdbc.postgresql.org/ to access the official downloads page.
  • Download the latest stable release of the “Type 4” JDBC driver in
    .jar
    format.

2. Upload the Driver to Matillion Data Productivity Cloud

Since the driver is not bundled, you will need to add it as an external driver to your Matillion environment.

3. Configure Usage in Database Query Components

Once the driver installation is complete, you can configure and utilize it within Matillion Data Productivity Cloud to connect and interact with your AlloyDB database.

  • Usage instructions are provided here: Database Query component usage
  • Ensure you reference the newly uploaded JDBC driver in your Matillion connection configuration.

By following the steps above and consulting the relevant documentation, you can enable Google AlloyDB support within your Matillion Data Productivity Cloud deployments.

Checking network connectivity

To enable connectivity between Matillion Data Productivity Cloud and your Google AlloyDB database, you must ensure that the AlloyDB instance allows incoming network connections from the appropriate sources, based on your deployment configuration:

Additionally, if your Google AlloyDB instance is referenced using a DNS hostname, ensure that the Full SaaS or Hybrid SaaS agent can resolve the DNS address. This may require appropriate DNS configuration or resolution from within your VPC or environment.

Querying Data from Google AlloyDB

Google AlloyDB is a fully managed, PostgreSQL-compatible database service. To extract data from AlloyDB, you use PostgreSQL-compatible SQL queries. This guide demonstrates querying AlloyDB data, managing type conversions when integrating with Databricks, and recommended patterns for initial and incremental data loads using ETL tools.

1. Example SQL SELECT Statements

Below are sample queries you can run against a table named

sales_orders
in AlloyDB:

``` -- Select all columns and rows SELECT * FROM sales_orders;

-- Select specific columns SELECT order_id, customer_name, order_total FROM sales_orders;

-- Filter rows by a column value SELECT order_id, order_total FROM sales_orders WHERE order_date >= '2024-01-01';

-- Aggregate sales by customer SELECT customer_id, SUM(order_total) AS total_sales FROM sales_orders GROUP BY customer_id; ```

2. Data Type Conversion: AlloyDB and Databricks

Data types in AlloyDB (PostgreSQL) may not have direct equivalents in Databricks (which uses Spark SQL). Common conversions include:

  • timestamp with time zone
    (AlloyDB) →
    timestamp
    (Databricks)
  • numeric
    (AlloyDB) →
    decimal
    (Databricks)
  • jsonb
    (AlloyDB) →
    string
    (Databricks)
  • Arrays (
    text[]
    , etc.) may be handled as delimited strings or arrays in Databricks

Tip: Always review Databricks' mapping for source PostgreSQL/AlloyDB data types when designing ETL pipelines.

3. Initial and Incremental Loads: Best Pattern

The recommended pattern to replicate data from AlloyDB uses an initial full load, followed by incremental loads:

a. Initial Load

  • Use the Database Query component (such as in Matillion or another ETL tool) with no filter clause to extract all records.
  • Example (full table load):

SELECT * FROM sales_orders;

b. Incremental Load

  • For subsequent loads, only fetch data updated or created since the last load. This is done by adding a suitable filter clause using a timestamp or incrementing key.

SELECT * FROM sales_orders
    WHERE updated_at > '${lastSuccessfulLoadTimestamp}';

  • Replace
    ${lastSuccessfulLoadTimestamp}
    with the value recorded after your last ETL run.

Note: Use the same Database Query component for both initial and incremental loads, adjusting only the query filter.

Read more on this strategy: Incremental Load Data Replication Strategy (Matillion Exchange)


By following these patterns, you can robustly and efficiently extract data from Google AlloyDB for downstream analytics and machine learning use cases.

Data Integration Architecture

One of the key advantages of the ELT (Extract, Load, Transform) architecture is its approach to dividing the data integration process into discrete, manageable steps. By loading data into the Databricks environment prior to integrating it, organizations can "divide and conquer" complex integration tasks—first ensuring that all relevant data is available, and then focusing on transformation and integration within the platform. Data integration inherently requires transformation, and the most effective way to manage this is through the use of data transformation pipelines, which allow for repeatable, scalable, and automated processing of diverse data sets. An additional benefit of the ELT approach is that all transformation and integration are performed directly within the target Databricks database. This not only maximizes performance by leveraging Databricks's distributed processing capabilities, but also enables on-demand scalability, and alleviates the need for additional data movement or investment in separate data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.