Integrate data from Google BigQuery to Databricks using Matillion

Our Google BigQuery to Databricks connector seamlessly transfers your data to Databricks within minutes, ensuring it remains up to date without requiring manual coding or complex ETL processes.

Google BigQuery
Databricks
Google BigQuery to Databricks banner

Extracting data from Google BigQuery to Databricks

Extracting data from Google BigQuery is a fundamental step when organizations seek to integrate their cloud-based analytics with other modern data platforms, such as Databricks. This process enables teams to capitalize on the powerful processing capabilities of Databricks while leveraging data stored in BigQuery. In this article, we will guide you through the essential steps of securely and efficiently moving your data from Google BigQuery into Databricks. We will begin by outlining how to establish the necessary identity in Google BigQuery to enable trusted connections. Next, for users leveraging Matillion, we will review how to check for, or acquire, the appropriate JDBC driver required to facilitate the connection. We will also discuss the networking considerations involved in establishing proper connectivity between the source and target environments. Finally, we will detail best practices for querying your data—both for the initial full load and for ongoing incremental updates. By following this guide, you will be able to streamline your data workflow and ensure reliable, scalable data transfers between Google BigQuery and Databricks.


What is Google BigQuery?

Google BigQuery is a fully managed, serverless data warehouse built for large-scale analytics. Using the Dremel engine, it runs ANSI SQL queries efficiently across petabytes of structured and semi-structured data. Its architecture separates compute and storage for elastic scaling and cost control, offering pay-per-query or flat-rate pricing. BigQuery features native Google Cloud integration, built-in machine learning (BigQuery ML), data partitioning, clustering, streaming ingestion, and supports standard JDBC/ODBC connectors. These capabilities make it a robust, flexible platform for enterprises requiring high-performance analytics and easy integration with modern workflows.

matillion logo x Google BigQuery

What is Databricks?

Databricks is a unified data analytics platform built on top of Apache Spark, integrating data engineering, science, and analytics into a collaborative environment. Its database capabilities leverage a robust Delta Lake storage layer, offering ACID transactions, scalable metadata handling, and efficient data versioning. Databricks databases, often referred to as schemas, organize and manage structured and semi-structured data, supporting SQL queries, streaming, and advanced machine learning workflows. Tight integration with cloud object storage and native support for lakehouse architecture enable seamless ingestion, processing, and analysis of massive datasets, making Databricks a versatile solution for modern data-driven organizations.

Why Move Data from Google BigQuery into Databricks

The Case for Replicating Google BigQuery Data in Databricks for Analytics

A data engineer or architect may decide to copy data from Google BigQuery into Databricks for several strategic reasons. First, BigQuery often stores data that is potentially valuable for analytics or machine learning, but its true value is often realized when integrated with data residing in other systems or platforms. By bringing BigQuery data into Databricks, organizations can seamlessly combine it with structured, semi-structured, or unstructured data from a wide variety of sources, enabling richer insights and more comprehensive data models. Additionally, leveraging Databricks for this integration and subsequent analytics ensures that the computational workload remains off of BigQuery, helping to manage costs and maintain performance for other users and processes within the Google Cloud environment. This approach ultimately supports scalable and flexible data engineering pipelines, optimized for both performance and collaboration.

Creating an Identity in Google BigQuery

Google BigQuery does not manage users directly within the database using SQL scripts like traditional RDBMS platforms. Instead, identity and access are controlled through Google Cloud IAM (Identity and Access Management), where you assign roles and permissions to Google identities (such as Google accounts, service accounts, or Google Groups).

Below are steps to grant a user access to your BigQuery project or dataset.

1. Identify the User or Service Account

You will need the email address of the user, Google Group, or service account you wish to grant access to, e.g.,

[email protected]
.

2. Assigning Roles via Google Cloud Console

  1. Open the Google Cloud Console.
  2. Navigate to IAM & Admin > IAM.
  3. Click Add.
  4. Enter the user’s email address.
  5. Select a BigQuery role (e.g.,
    BigQuery Data Viewer
    ,
    BigQuery Data Editor
    , or
    BigQuery Admin
    ).
  6. Click Save.

Alternatively, to grant dataset-level access:

  • Go to BigQuery in the console.
  • Find and select your dataset.
  • Click SHARE DATASET.
  • Add the member’s email and assign a dataset-level permission (e.g.,
    roles/bigquery.dataViewer
    ).

3. Granting Access Using the gcloud CLI

You can also use the Google Cloud CLI to assign roles:

gcloud projects add-iam-policy-binding [PROJECT_ID] \
  --member="user:[email protected]" \
  --role="roles/bigquery.dataEditor"

  • Replace
    [PROJECT_ID]
    with your Google Cloud project ID.
  • Adjust the role as required.

4. Granting Access to a Specific Dataset Using SQL

Within BigQuery, you can execute SQL to grant dataset-level access to specific users using the

GRANT
statement (for datasets that use access controls):

GRANT `roles/bigquery.dataEditor`
ON SCHEMA `project_id.dataset_id`
TO "user:[email protected]";
Note: This requires that you have the proper IAM permissions to modify dataset access.

Common BigQuery Roles

  • roles/bigquery.dataViewer — Read-only access
  • roles/bigquery.dataEditor — Read and write access to data
  • roles/bigquery.user — Ability to run queries, create datasets

Additional roles (see documentation for details):
https://cloud.google.com/bigquery/docs/access-control

Important Considerations

  • “Users” in BigQuery are Google Cloud identities.
  • Role assignments can be given at the project or dataset level.
  • There is no in-database CREATE USER statement as in traditional SQL databases; user management is externalized via Google Cloud IAM.

Installing the JDBC Driver

For Matillion Data Productivity Cloud users integrating with Google BigQuery, a JDBC driver is required to enable connectivity between the platform and your BigQuery instance. Due to current licensing or redistribution restrictions, the BigQuery JDBC driver is not bundled with Matillion Data Productivity Cloud. Instead, users must download and upload it manually as an external driver to their Matillion environment.

The following steps outline the process for obtaining and installing the Google BigQuery JDBC driver:

Step 1: Download the BigQuery JDBC Driver

  1. Visit the official Google BigQuery JDBC driver download page, provided by Simba Technologies, at:
    https://cloud.google.com/bigquery/providers/simba-drivers/

  2. On the download page, identify the download for the Type 4 JDBC driver. Type 4 drivers are pure Java drivers, ideal for cloud-based platforms such as Matillion Data Productivity Cloud.

  3. Review the licensing terms to ensure compliance before proceeding with the download.

  4. Download the driver package (typically a

    .jar
    file) to a secure and accessible location on your device.

Step 2: Upload the Driver to Matillion Data Productivity Cloud

  1. Refer to the Matillion Data Productivity Cloud documentation for instructions on uploading external drivers:
    https://docs.matillion.com/data-productivity-cloud/agent/docs/uploading-external-drivers/

  2. Using the guidance provided:

  3. Access the Matillion Agent interface or appropriate configuration area as described.
  4. Locate the option to upload a new JDBC driver.
  5. Upload the previously downloaded BigQuery JDBC driver
    .jar
    file.
  6. Ensure that the driver is enabled and properly registered within the Agent environment.

  7. After uploading, refresh or restart any relevant services as per the documentation to ensure the new driver is recognized and ready for use.

Step 3: Connecting to BigQuery

  1. Once installed, you can configure and use the new driver to connect to Google BigQuery within your Matillion Data Productivity Cloud workflows.

  2. For specifics on how to utilize the driver in database query orchestration, follow Matillion’s usage guide:
    https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/

By following these steps, you ensure that Matillion Data Productivity Cloud can establish secured and fully functional JDBC connections to Google BigQuery using your custom-installed JDBC driver.

Checking network connectivity

To ensure successful connectivity between Matillion Data Productivity Cloud and your Google BigQuery database, you must configure your network settings to allow incoming connections based on your deployment type:

  • Full SaaS agent configuration:
    Allow incoming connections from the set of IP addresses used by the Matillion Data Productivity Cloud SaaS agents. The current list of required IP addresses can be found here. You must add these addresses to your allowlist or firewall rule set as needed for BigQuery access.

  • Hybrid SaaS deployment:
    Incoming connections should be permitted from the IP addresses that originate within your own Virtual Private Cloud (VPC). This allows the Hybrid SaaS agent to communicate directly with Google BigQuery. If you need to validate your setup, you can use a suite of network access checking utilities available at the Matillion Exchange.

Additionally, if the Google BigQuery database is being accessed via a DNS hostname, ensure that your Full SaaS or Hybrid SaaS agent can successfully resolve the DNS address to connect. Proper DNS resolution is critical for establishing and maintaining connectivity.

Querying Data from a Google BigQuery Database

This guide explains how to query data from a Google BigQuery database, including example SQL queries, considerations for data type compatibility (especially when interacting with Databricks), and best patterns for initial and incremental loading.


Examples of Google BigQuery SQL SELECT Statements

To retrieve data from BigQuery, you typically use standard SQL. Below are example queries, assuming a dataset called

my_dataset
and a table called
sales_orders
.

``
-- Select all columns from the table
SELECT *
FROM
project_id.my_dataset.sales_orders`;

-- Select specific columns and apply filters SELECT order_id, customer_id, amount, order_date FROM

project_id.my_dataset.sales_orders
WHERE amount > 100;

-- Aggregate results SELECT customer_id, SUM(amount) AS total_spent FROM

project_id.my_dataset.sales_orders
GROUP BY customer_id; ```

Note: Replace

project_id
,
my_dataset
, and
sales_orders
with your actual project, dataset, and table names.


Datatype Conversion Between Google BigQuery and Databricks

When loading or querying data from BigQuery into Databricks, watch out for data type compatibility. For example:

  • BigQuery STRING <--> Databricks STRING
  • BigQuery INT64 <--> Databricks LONG
  • BigQuery FLOAT64 <--> Databricks DOUBLE
  • BigQuery TIMESTAMP <--> Databricks TIMESTAMP
  • BigQuery BOOLEAN <--> Databricks BOOLEAN

Opaque or complex datatypes such as BigQuery's

STRUCT
or
ARRAY
may require additional handling or flattening to work correctly in Databricks. Always review automatic type mappings and consider explicit casting in your SELECT statements, for example:

SELECT
    CAST(order_id AS STRING) AS order_id_str,
    SAFE_CAST(amount AS FLOAT64) AS order_amount
FROM `project_id.my_dataset.sales_orders`;


Data Loading Patterns: Initial & Incremental Loads

The recommended pattern for loading data from BigQuery is:

  1. Initial Load: One-time full copy (typically used to populate a target system for the first time).
  2. Incremental Loads: Ongoing, smaller updates (based on new or changed records).

The Matillion Exchange Article provides detailed guidance on this approach.

Using the Database Query Component

Use the same Database Query component for both load types, but vary your SQL

WHERE
clause:

  • Initial Load: Run the query without any filters.

SELECT *
    FROM `project_id.my_dataset.sales_orders`;

  • Incremental Load: Add a filter using a high water mark (e.g., a last-modified timestamp or incrementing key).

SELECT *
    FROM `project_id.my_dataset.sales_orders`
    WHERE updated_at > '2024-06-01 00:00:00'; -- replace with a dynamic value

The value for the filter (

updated_at > ...
) should be parameterized based on the last successful load.


For more information on best practices and advanced strategies, consult Matillion’s incremental load guide.

Data Integration Architecture

Loading data in advance of integration represents a strategic 'divide and conquer' approach, breaking the overall task into two manageable steps: first loading raw data into the Databricks database, and then transforming and integrating it as needed. This separation is a key advantage of the ELT (Extract, Load, Transform) architecture. Data integration inherently involves transforming data to meet analytical or operational requirements, and data transformation pipelines offer an efficient and reliable mechanism to automate these processes. Another significant benefit of the ELT architecture is that both transformation and integration activities take place directly within the target Databricks database environment. This ensures seamless, rapid, and scalable processing of data, allowing organizations to perform transformations on-demand without incurring the costs or administrative overhead associated with additional external data processing infrastructure.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.