Blog

    How to Flatten Nested Arrays from MongoDB with Matillion ETL for Amazon Redshift

    Problem Statement

    A customer came to us who had lots of data in a MongoDB with many nested elements. Including both arrays and structs. Their previous workflow was to load this data into an Amazon Redshift table. The MongoDB component within Matillion ETL for Amazon Redshift would then flatten the top-level objects, but this still left many nested arrays. Therefore, a Transformation job was used to flatten the remaining arrays and JSON structures by cross joining to a sequencer table to catch each array element. This process involved extensive use of the below Amazon Redshift functions:

    • JSON_EXTRACT_ARRAY_ELEMENT_TEXT
    • JSON_EXTRACT_PATH_TEXT


    This way of processing the data resulted in:

    • Slow performance
    • Amazon Redshift varchar column limits meant some data risked being truncated
    • If the sequencer table was too small it ran the  risk of potentially missing out data

    This is what the job looked like in Matillion ETL for Amazon Redshift:


    An example of the data format is below. Here the many nested arrays and structs can be seen.

    Using Amazon Redshift Spectrum

    Since Matillion ETL for Amazon Redshift is purpose-built for Amazon Redshift, it can leverage native Spectrum functionality for handling the nested JSON. Further details on Amazon Redshift Spectrum are here.

    Matillion can take advantage of this functionality by defining the external table to match the nested JSON and then using the Nested Data Load component to reference the external nested data and load it into an internal table.

    Here we discuss a simple three-step approach which flattens the nested structures and arrays and makes the job more scalable:

    Upload data files to S3

    Step one is to extract the data from MongoDB and upload this to S3 as an Amazon Redshift Spectrum Table. This can be done using the Matillion MongoDB Query component by selecting an external schema. The contents of the MongoDB collection are loaded as a single column table due to the SELECT document(*) statement:


    Define Nested Columns

    Now the data exists in a Spectrum table, it’s time to define the nested columns. This is done by defining a new external table on top of the files which exist in S3. In the Matillion Create External Table component, define the Table Metadata with nested columns matching JSON structure from MongoDB:


    This allows each element in each struct or array to be individually referenced.

    Create internal Redshift table

    The Nested Data Load component can be used to flatten this out to a non-Spectrum table. In this component, select the columns required in the output table, the name of the new table to create, and the join conditions. Joins are required to capture all of the array elements against the corresponding parent values.  

    Note: this will mean the number of rows of data will increase by the number of array elements in each original row.


    Sample data

    Below is a sample of the flattened data in a Transformation job:

    Running the MongoExport Command

    An alternative to using the MongoDB component is to do a MongoExport to export the data directly from MongoDB to a file stored in S3. This can be done with a simple Bash Script run from a Bash Script component. In our testing for the nested data, this method was quicker compared to the MongoDB component:


    Conclusion

    Here we looked at how to use the power of Amazon Redshift Spectrum to handle nested data from a MongoDB.

    Once implementing the job changes, our customer achieved the following benefits:

    • Faster job runs
    • Simpler and easier to follow job development
    • Fewer runs where data was unintentionally truncated
    • A more scalable process which is more accommodating for changing data structures



    Learn more about Amazon Redshift Spectrum in our ebook.