Half a day with Maia. A working pipeline by the end.

Register

Making sense out of Semi-Structured data

Parsing JSON with the Extract Nested Data component within Matillion Data Productivity Cloud connected to Snowflake simplifies the parsing for many semi-structured data patterns.  

The JSON format has become a more popular format for semi-structured data, primarily because it is more consistent containing all key:value pairs. JSON handles repeating elements by containing them in an array as a value of a key:value pair.  

For this article, I am using the same example data set that was used in part one on XML only this sample data is represented as JSON. I also walk you through how to convert the XML to JSON to simplify parsing XML. 

{
	"items": {
		"item": [
			{
				"name": "Cake",
				"ppu": "0.55",
				"batters": {
					"batter": [
						{
							"_id": "1001",
							"__text": "Regular"
						},
						{
							"_id": "1002",
							"__text": "Chocolate"
						},
						{
							"_id": "1003",
							"__text": "Blueberry"
						},
						{
							"_id": "1003",
							"__text": "Devil's Food"
						}
					]
				},
				"topping": [
					{
						"_id": "5001",
						"__text": "None"
					},
					{
						"_id": "5002",
						"__text": "Glazed"
					},
					{
						"_id": "5005",
						"__text": "Sugar"
					},
					{
						"_id": "5007",
						"__text": "Powdered Sugar"
					},
					{
						"_id": "5006",
						"__text": "Chocolate with Sprinkles"
					},
					{
						"_id": "5003",
						"__text": "Chocolate"
					},
					{
						"_id": "5004",
						"__text": "Maple"
					}
				],
				"_id": "0001",
				"_type": "donut"
			},
			{
				"name": "Raised",
				"ppu": "0.55",
				"batters": {
					"batter": {
						"_id": "1001",
						"__text": "Regular"
					}
				},
				"topping": [
					{
						"_id": "5001",
						"__text": "None"
					},
					{
						"_id": "5002",
						"__text": "Glazed"
					},
					{
						"_id": "5005",
						"__text": "Sugar"
					},
					{
						"_id": "5003",
						"__text": "Chocolate"
					},
					{
						"_id": "5004",
						"__text": "Maple"
					}
				],
				"_id": "0002",
				"_type": "donut"
			},
			{
				"name": "Buttermilk",
				"ppu": "0.55",
				"batters": {
					"batter": [
						{
							"_id": "1001",
							"__text": "Regular"
						},
						{
							"_id": "1002",
							"__text": "Chocolate"
						}
					]
				},
				"_id": "0003",
				"_type": "donut"
			},
			{
				"name": "Bar",
				"ppu": "0.75",
				"batters": {
					"batter": {
						"_id": "1001",
						"__text": "Regular"
					}
				},
				"topping": [
					{
						"_id": "5003",
						"__text": "Chocolate"
					},
					{
						"_id": "5004",
						"__text": "Maple"
					}
				],
				"fillings": {
					"filling": [
						{
							"name": "None",
							"addcost": "0",
							"_id": "7001"
						},
						{
							"name": "Custard",
							"addcost": "0.25",
							"_id": "7002"
						},
						{
							"name": "Whipped Cream",
							"addcost": "0.25",
							"_id": "7003"
						}
					]
				},
				"_id": "0004",
				"_type": "bar"
			},
			{
				"name": "Twist",
				"ppu": "0.65",
				"batters": {
					"batter": {
						"_id": "1001",
						"__text": "Regular"
					}
				},
				"topping": [
					{
						"_id": "5002",
						"__text": "Glazed"
					},
					{
						"_id": "5005",
						"__text": "Sugar"
					}
				],
				"_id": "0005",
				"_type": "twist"
			},
			{
				"name": "Filled",
				"ppu": "0.75",
				"batters": {
					"batter": {
						"_id": "1001",
						"__text": "Regular"
					}
				},
				"topping": [
					{
						"_id": "5002",
						"__text": "Glazed"
					},
					{
						"_id": "5007",
						"__text": "Powdered Sugar"
					},
					{
						"_id": "5003",
						"__text": "Chocolate"
					},
					{
						"_id": "5004",
						"__text": "Maple"
					}
				],
				"fillings": {
					"filling": [
						{
							"name": "Custard",
							"addcost": "0",
							"_id": "7002"
						},
						{
							"name": "Whipped Cream",
							"addcost": "0",
							"_id": "7003"
						},
						{
							"name": "Strawberry Jelly",
							"addcost": "0",
							"_id": "7004"
						},
						{
							"name": "Rasberry Jelly",
							"addcost": "0",
							"_id": "7005"
						}
					]
				},
				"_id": "0006",
				"_type": "filled"
			}
		]
	}
}

Extract Nested Data 

We start by using the Extract Nested Data component, which simplifies parsing semi-structured data. In this example, we’re using several of them to traverse the nested elements.  

First, the JSON file is loaded into a table called donut_json, which contains a single column defined as a variant “data_value.” 

Next, configure the Columns property of the Extract Nested component. I used “Autofill”’ and let the component identify the structure of the JSON. I have deselected all the columns and chosen to pass through the Item attributes and element values.

In the example, I also passed through the Filling element, keeping it a variant for further processing downstream. 

Since the topping elements are repeating at the first level, the component has flattened toppings into separate rows automatically, so I was able to select the element value level for toppings.  

Another property to call out is the Outer join property on the Configuration tab. Since all of the elements do not exist for every item, I needed to set Outer Join = “Yes.”  This will retain all the rows for all items, even though only two items have Fillings.  

Flatten Variant 

The Flatten Variant component is used to flatten arrays. Although the Extract Nested Data component can sometimes be used, the Flatten Variant lets you explicitly break a column into more rows than the original extract nested data if you are seeking further granularity from the extract nested component. 

The batter element in this example has two formats, so I have to treat the Batter array differently by using a Flatten Variant component to parse the array of batters into separate rows. The initial Extract Nested Data component created a new row for each item and each topping.

From there, we want a new row for each item, topping and batter. I tested the batter element to determine if it’s an array, by using the IS_ARRAY() function in a Calculator component.  

IS_ARRAY("items_item-element_batters_batter")

After that, Flatten the array into separate rows per batter element before extracting the attributes.  

  • Set the Column Flatten property to read the batter array column
  • In the column mappings, use the flatten alias to map to an output variant column  

Finally, we bring all the rows back together, remove unwanted columns, and write to a new table.

  • The Unite component unions all the rows back together
  • The Rename component allows us to remove any unwanted fields, like the arrays, and rename and reorder the fields 
  • The Rewrite component writes to a new table  

The resulting final pipeline is much simpler than the previous XML one.  

Convert XML to JSON

Our example pipeline started with a file that was already in a JSON format.  However, if you have an XML file that needs to be converted and you would like to convert the XML to JSON inside a pipeline, you’ll use the code below. 

Create an Orchestration Pipeline

First, I created a separate Orchestration pipeline that contains a SQL Script component to create a Snowflake UDF using the code below. This code calls a Snowflake Snowpark package called “xmltodict.” 

Our example XML_to_JSON Python code follows.

CREATE OR REPLACE FUNCTION XML_TO_JSON(xml_value string)
 RETURNS VARCHAR
 LANGUAGE PYTHON
 RUNTIME_VERSION=3.8
 HANDLER = 'convert_xml_to_json'
 PACKAGES = ('xmltodict')
 COMMENT = 'UDF to convert XML string into JSON'
AS
$$

import xmltodict
import json

def convert_xml_to_json(xml):
 return(json.dumps(xmltodict.parse(xml)))
$$;
Parse With the Calculator Component

Next in my Transformation pipeline, I called the procedure in a Calculator component. The parse_json function formats the JSON so it’s readable. 

PARSE_JSON(XML_TO_JSON("data_value"::VARIANT))

Normalizing Semi-Structured Data 

Semi-structured files typically contain data that has been nested, and we often want to store that data in a structured format more friendly to analytics and reporting. Many times, as we flatten out deeply nested data, we end up with a multi-join or cartesian join where all upper-level elements of the file are joined with all nested elements of the file.

It’s important to understand the data and the relationships between the elements.  

The donut data file is a small example, so there is no need to break it up into several tables. However, real-world examples are often very large when flattened. In these cases, we need to evaluate the data contained in the JSON response and determine the best model to represent the data in different tables.  

Looking at the final flattened table from the pipeline shown above, we see that for donut item_id = “0001” there are four different batters and four toppings. That results in 16 rows to display every combination possible. This is where understanding the data and what you want to achieve is very important.  

You might have a model that is analysing all the possible combinations of donuts, but more likely, there is another file or API endpoint that shows the sale of those donuts that has metrics such as sales and supply of raw materials based on demand.  

In this case, the example data file provides all possible combinations of donuts sold at the donut shop.  Using this file, we could design a dimensional model, storing all the distinct batters in a batter dimension table and all of the toppings and fillings in corresponding dimension tables.

Perhaps there is a fact table in the middle that shows the price based on each combination of batter, toppings, and fillings. Another fact table could show the actual sales of specific donuts sold with total quantity, price, and sales using the same dimension tables already built. 

Splitting Nested Arrays into Separate Streams

In order to split the dimensions into separate tables, the first Extract Nested Data component will pass the full element as a variant downstream in order to start to split out the different datasets into separate streams.  

In the first Extract Nested component, there is a sample of the elements chosen in the Columns property, and I chose the item-level attributes, which will be carried into each downstream component. I also chose the filling element and batters Variant columns, which pass along the whole element and allow for further parsing in the next Extract component.  

Passing the different Variant columns downstream facilitates different processing and transformation rules for each element. 

Carrying the item level attributes, the dimension IDs, and the fact metrics such as price per unit and filling added cost, we join all the distinct combinations of item/batter, item/topping, and item/filling back together. Note that the distinct components are important for removing duplicates before joining. Without these, the join resulted in a multi-join and 17k rows. 

Oversized Semi-Structured Files 

You may also run into scenarios where the original file has to be split because it exceeds the single-column variant size of 16MB. In this case, it is best to go back to the API call to filter the response.  

Many APIs offer parameters, such as date range, to filter the number of rows.  Another option is to create two API calls where you choose less data for the output by splitting the data in half. For this option, you must bring over the header or top-level information in each API call so that the data can be stitched together after parsing.   

Summary

The Matillion Data Productivity Cloud for Snowflake makes it simple to parse a JSON file or convert an XML to JSON and then parse to store the data. It also allows flattening all the way or how to break into a dimensional model.  

If your data is in JSON format, parsing is easier using the Extract Nested Data component, which automatically flattens the first level of repeating elements and one level of embedded repeating elements. If there are more nested elements or arrays, a combination of the Extract Nested Data and Flatten Variant components is used.  

The important thing to note with semi-structured data files is that they usually represent a combination of tables and fields that originated from a normalized model, which represents data and relationships all in a single file format.

When parsing or flattening these data files, they can become very large as they display every combination of the data contained in the file. Understanding the use case for analyzing the data will help you to model, parse, and break apart the data for better analytic efficiency. 

Angie Hastings
Angie Hastings

Senior Sales Engineer

Ready to get moving?

See how quickly your team can start delivering business-ready data, with Matillion.