Data in a MongoDB database can have many nested elements, including both arrays and structs. Some of the top-level objects can be flattened via the MongoDB component, but this still leaves many nested arrays in the data. Here we look at how to flatten out this data in a Transformation job using Snowflake’s JSON handling functionality.
An example of the format is below. Here the many nested arrays and structs can be seen.
Snowflake has native functionality for handling the nested JSON using the Flatten function. This expands out the nested data in a table to create one row per array or struct which is flattened. Each array or struct element can also be directly referred to using Snowflake’s JSON Query syntax.Matillion has a Flatten Variant component which takes advantage of the Flatten and JSON functionality available in Snowflake and makes handling this nested data easy.
Load the MongoDB data to S3
Step one is to extract the data from MongoDB load the data into a Snowflake table. The Snowflake table should be a single column table with the column type VARIANT. This is because the Snowflake JSON functions need to be performed on a variant. We recommend creating the table with the single variable column before load and then using the Matillion MongoDB Query component to load the data from the MongoDB database into the existing table. The contents of the MongoDB collection are loaded as a single column table due to the SELECT document(*) statement:
Now the data exists in a Snowflake table, it can be flattened using the Flatten Variant component.
The arrays and structs to be flattened are defined in the Column Flattens property:
This expands out the 1000 rows of data loaded into over 6 million records by cross joining each array back onto the original table.
The Column Mapping property is used to define which items from the variant column to map into separate columns. From within here nested items can be extracted and their data types can be defined:
Below is a sample of the flattened data:
Here we looked at how to use the power of Snowflake to handle nested data from a MongoDB source. The benefits of using Matillion to Flatten the data include:
- Simple and fast to implement
- Fast job runs
- A more scalable process which is more accommodating for changing data structures
Export of the job and the sample data set is available here.