Blog

Matillion ETL for BigQuery Supports Structured Data with Structs and Arrays

Modern companies with a diverse set of data-generating applications are collecting more varieties of semi-structured and nested data. This data needs to be stored in a data warehouse or data lake until you are ready to utilize it for analytics. Data will often need to be flattened into a tabular format to facilitate joining to other data and for analysis, not to mention to support downstream reporting and analytics packages that don’t understand semi-structured data. Matillion ETL version 1.40 now supports the ability to load and flatten Structs (nested fields) and Arrays (repeated fields) in BigQuery as well as create Structs and Arrays as required.

In this blog, we will look at how you can use Matillion support for BigQuery Structs and Arrays to better handle and utilize your semi-structured and nested data. This blog uses an example data set containing information about the users of a fictional system. This data is held in a semi-structured format such as the below:

 

Matillion ETL for BigQuery - Semi-structured format JSON

Create Table with Structs

We will load the data set into a table in BigQuery using the Cloud Storage Load component. First, the structure of the table that we will use to load the data into will need to be defined using the Create Table component. Within the Table Metadata grid property, there is now an option available to Define Nested Metadata.

 

Matillion ETL for BigQuery-Create Table-Table Metadata

 

Check this box and the tree view becomes available. New columns can be added using the right-click option:

 

Matillion ETL for BigQuery-Table Metadata-Add Columns

 

New columns can be defined as Fields, Structs or Arrays and this can be used to build up a nested tree to represent the structure of the data. Below we have created a structure for the user data referenced above, which contains a nested Struct called ‘name’ and three arrays:

  • ‘Tags’, a string array
  • ‘Range’, an integer array
  • ‘Friends’, an array of structs

 

Matillion ETL for BigQuery - Create Table - Define Nested Metadata

 

Data can now be loaded into this table using the Cloud Storage Load component:

 

Matillion ETL for BigQuery- Load Table into Cloud Storage Load

 

Using a Transformation job, you can view the top-level Struct and see that it has been flattened out but the ‘name’ struct is still stored as nested data as are the arrays:

 

Matillion ETL for BigQuery- Transformation Job - View Structs

 

Matillion ETL for BigQuery - Transformation job - Structs and Arrays

 

Extract Nested Data

To completely flatten the data, we can use the new Extract Nested Data component. In this component, simply select the columns to be flattened. For example, the picture below shows how this will flatten the ‘name’ struct and the tags and range array:

 

Matillion ETL for BigQuery - Extract Nested Data - Column Names

 

By clicking on the component and sampling the result shows, we can confirm that there are a lot more rows available. This is because each array element will become one row via an unnest. Here we have jumped from 6 rows to 300 rows:

 

 

Construct Struct

As well as the ability to flatten nested data, we have added a new component which will create structs. For example, suppose we want all contact details for the user in a struct. This can be done with the new Construct Struct component. Simply select the columns to go into the Struct and Matillion will create it:

 

Matillion ETL for BigQuery - Construct Struct

 

 

Matillion ETL for BigQuery - Construct Struct - Column Mapping

 

Sampling the data will show the new Struct has been created and it contains 3 fields:

 

Matillion ETL for BigQuery - Create New Structs

 

Now that your data has been flattened, structured and written back out to Google BigQuery, it is ready for further transformation and analysis as needed.

To see in action how these new features work in Matillion, watch our video on Managing Semi-Structured and Nested Data in Matillion ETL for BigQuery.

 

 

 

Thank you to Darlington for all of his hard work on this.