Connecting Matillion ETL to Google BigQuery: A Guide

CTB 0

This guide is a walk-through of how to connect Matillion ETL to Google BigQuery.

 

In Matillion ETL, the metadata for connecting to Google BigQuery is held in an artifact known as an Environment. Matillion ETL Environments can also hold additional information that is used during data extraction and loading.

 

Prerequisites for connecting to Google BigQuery

The prerequisites for connecting Matillion ETL to Google BigQuery are:

 

 

You must enable APIs for the GCP project, including at least the following:

  • The BigQuery API
  • The Identity and Access Management (IAM) API

 

To check which APIs have been enabled, follow APIs and services from your GCP console like this:

 

 

Google BigQuery identifiers

All Google Cloud services – including Google BigQuery – are managed within the context of a Google Cloud project. You can find your project name from the dropdown list at the top of the GCP console as shown below. The dropdown lists all the projects you can access.

 

 

Every project has two alternative unique identifiers:

 

  • Project ID – alphanumeric, you can influence the value when creating a project
  • Project Number – long integer, generated automatically

 

To list all your project IDs and numbers you can use Google Cloud Shell. Launch it from the GCP console by following the Activate Cloud Shell link at the top right of the screen.

 

Once your cloud shell terminal becomes active, run a gcloud projects list command. You will find all the project IDs listed, along with their project number:

 

 

To list the datasets within your currently selected project, you can run a bq ls command from the cloud shell.

 

 

By default only the first 50 datasets are listed. If you have more than that, you can raise the limit by adding for example –max_results 500.

 

How to connect a Matillion ETL Environment to Google BigQuery

To create or edit a Matillion ETL Environment, start from the main Project menu, or from the context menu in the Environments panel.

 

 

This action will pop up a Create or an Edit Environment window.

 

In this dialog, choose a name for the Environment, and optionally set any AWS, Azure, or GCP credentials that you intend to use.

 

Then choose your Default Project and Default Dataset from the dropdown lists. While you are completing this information, make certain of two things:

 

  • The dropdown lists for the Default Project and Default Dataset both show values that you can select
  • After selecting the values, you get a success message when you press the Test button

 

If the dropdown lists are empty, you will need to re-check the prerequisites. It is likely that either your GCP project does not have the necessary APIs enabled, or that the service account is insufficiently privileged.

 

Press the Test button and make sure you see a success message before finishing the Environment Setup dialog

 

Connectivity metamodel for Matillion ETL and Google BigQuery

Matillion ETL stores Google BigQuery connection details inside an Environment object. The Environment contains the project ID and the dataset ID.

 

Environments can optionally also hold additional cloud provider information, in the form of AWS, Azure, and GCP credentials. This information is used by various orchestration components such as the Data Transfer.

 

Matillion ETL Environments exist within a Project and a Group. Orchestration and Transformation Jobs are versioned within Matillion ETL projects. Jobs always execute within the context of exactly one Environment.

 

 

For Google, the core administrative entity is the account, also known as an organization. You create projects within an account, and they are allocated a globally unique project ID. All project contents are completely isolated from each other by default.

 

You create Principals – such as Service Accounts – within projects. Service Accounts can be  granted Roles, such as the ability to use Google BigQuery. Matillion uses Service Accounts to obtain GCP privileges at runtime.

 

You can enable individual APIs – such as the BigQuery API – for projects. Datasets can be created within a Project provided the BigQuery API has been enabled, and that billing has been switched on for the project.

 

 

Once you’ve connected Matillion ETL and Google BigQuery

 

You can start to build workflows in Matillion ETL

 

Learn more about using Matillion ETL with Google BigQuery

 

If you want to read more about how Matillion ETL and Google BigQuery work together in the cloud, download our ebook, Optimizing BigQuery: A Real-World Guide

 

Download the ebook