Modern companies with a diverse set of data-generating applications are collecting more varieties of semi-structured and nested data. This data needs to be stored in a data warehouse or data lake until you are ready to utilize it for analytics. Data will often need to be flattened into a tabular format to facilitate joining to other data and for analysis, not to mention to support downstream reporting and analytics packages that don’t understand semi-structured data. Matillion ETL version 1.40 now supports the ability to load and flatten Structs (nested fields) and Arrays (repeated fields) in BigQuery as well as create Structs and Arrays as required.
In this blog, we will look at how you can use Matillion support for BigQuery Structs and Arrays to better handle and utilize your semi-structured and nested data. This blog uses an example data set containing information about the users of a fictional system. This data is held in a semi-structured format such as the below:
Create Table with Structs
We will load the data set into a table in BigQuery using the Cloud Storage Load component. First, the structure of the table that we will use to load the data into will need to be defined using the Create Table component. Within the Table Metadata grid property, there is now an option available to Define Nested Metadata.
Check this box and the tree view becomes available. New columns can be added using the right-click option:
New columns can be defined as Fields, Structs or Arrays and this can be used to build up a nested tree to represent the structure of the data. Below we have created a structure for the user data referenced above, which contains a nested Struct called ‘name’ and three arrays:
- ‘Tags’, a string array
- ‘Range’, an integer array
- ‘Friends’, an array of structs
Data can now be loaded into this table using the Cloud Storage Load component:
Using a Transformation job, you can view the top-level Struct and see that it has been flattened out but the ‘name’ struct is still stored as nested data as are the arrays:
Extract Nested Data
To completely flatten the data, we can use the new Extract Nested Data component. In this component, simply select the columns to be flattened. For example, the picture below shows how this will flatten the ‘name’ struct and the tags and range array:
By clicking on the component and sampling the result shows, we can confirm that there are a lot more rows available. This is because each array element will become one row via an unnest. Here we have jumped from 6 rows to 300 rows:
As well as the ability to flatten nested data, we have added a new component which will create structs. For example, suppose we want all contact details for the user in a struct. This can be done with the new Construct Struct component. Simply select the columns to go into the Struct and Matillion will create it:
Sampling the data will show the new Struct has been created and it contains 3 fields:
Now that your data has been flattened, structured and written back out to Google BigQuery, it is ready for further transformation and analysis as needed.
To see in action how these new features work in Matillion, watch our video on Managing Semi-Structured and Nested Data in Matillion ETL for BigQuery.
Thank you to Darlington for all of his hard work on this.