API data integration in a Designer Orchestration Pipeline: From raw data to structured tables

In earlier parts of this series, I created a Matillion Custom Connector to send REST API calls to the SpaceX data API. I've also used iterators and variables to make REST API calls to the SpaceX API using variable parameters to load data into Snowflake tables to hold the raw responses from the API.

Now that we have the raw data in Snowflake, this blog will outline how to leverage Transformation pipelines and logic in Matillion Designer to extract the raw responses from the API and turn this raw semi-structured data into rows and columns in Snowflake.

Handling semi-structured data from REST APIs

Matillion's Custom Connector will often extract any raw JSON / nested data provided by the response from the API when it loads into the table in Snowflake. 

Using part 2 as an example, we ended up with a few ‘mission_*’ tables, such as the mission_6C42550 table. Directly after running the SpaceX Custom Connector in part 2, the table data ended up like this in Snowflake without any intervention or custom transformation logic:

As we can see, the table already contains rows and columns for the corresponding response, and the data is inherently broken into rows and columns for us. Awesome! However, what methods can be followed in the event that the table data is not broken into rows and columns for us by the Designer/Custom Connector?

In most cases, when the data is not extracted natively, it will be loaded into a table in Snowflake that contains one column/ one or more rows, and the data type on the column will be VARIANT. Additionally, generally speaking, the column name in the table will be ‘DataValue’. The row(s) in the table will contain the raw JSON response from the API, such as:

Luckily, Designer has native components to extract this nested data and turn this raw JSON data into rows and columns using a transformation pipeline.

To do this, we will input the table that contains the raw data using a table input component. Then, we will drag and drop an extract nested data onto the canvas and start to configure it:

The main configuration point behind the extract nested data component is the ‘columns’ property, which will be used to look at the respons and create columns for each key in the JSON response from the API. This will also provide an option to alter the data types for the elements in the JSON structure to help build the table per requirements.

Autofill nested data

Open the columns property in the Extract Nested Data component and click on ‘Autofill’. Autofill will automatically parse the raw JSON in the table and provide a breakdown of the elements in the JSON object held in the table:

While Autofill makes this process exceptionally easy for users, it's always recommended that you double-check the autofill results to ensure that all elements are present and all data types are accurate. 

If there is a need to alter any of the configurations put in place by Autofill, this can be achieved by clicking the three dots next to one of the elements. This will enable you to either alter the alias of the element or the data type:

After everything looks as expected, click the save button in the column editor. 

The other main property I want to call attention to in the component is the ‘Include Input Columns’ property:

This property will allow you to include the raw columns in the extracted table if needed. For example, after fully configuring the component and the Include Input Columns property is set to yes, we have the following output after sampling the data from the component:

Notice the DataValue column is present in the output. Setting this to no will have the exact same extracted data output, but the DataValue column will no longer be present:

Now that we have all of the logic in place to orchestrate API calls with dynamic parameters and transform the data, let's look at how we can build one cohesive pipeline to schedule this work.

Looking back at the parent Orchestration pipeline (that makes use of the SpaceX API Custom Connector), let's run the transformation pipeline to execute this transformation logic directly after the data loads. To do this, we will use a Run Transformation component, and drag it onto the canvas like any other component, and configure it to run our transformation pipeline that extracts the nested data from API, but only after a successful execution of the Custom Connector (this is controlled by using the green connector between the Custom Connector component and the Run Transformation Pipeline component):

The Run Transformation Pipeline component is powerful because it enables us to orchestrate transformation logic when required without having to run/schedule it individually. 

This is now a complete ETL pipeline and can be published/scheduled as needed!

Conclusion and further reading

As we can see throughout this blog series, interacting with APIs with configurable parameters and processing the responses can be made very easy with the use of Designer, orchestration, and transformation pipelines. Designer pipelines also offer the flexibility to create custom logic so as not to force development down one path of how to handle specific data.

The aim of this blog series has been to outline some fundamentals on this topic, which will hopefully enable you to now quickly get to this point with your data, and now start to further process the data as you require!

Sign up for a free trial to try out these techniques on your own data.

References:

Kevin Kirkpatrick
Kevin Kirkpatrick

Associate Delivery Solution Architect

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.