Making sense out of Semi-Structured data

Many modern data sources are now being delivered by APIs where the delivered data in the response is semi-structured data formatted as XML or JSON. SOAP APIs deliver data in an XML format, whereas the newer, more popular REST APIs are typically JSON. 

In this series, we will show examples of extracting XML and JSON and converting those flattened data structures into an analytics-friendly model using Matillion’s Data Productivity Cloud for Snowflake.  

In this first part of the series, I will cover how to parse XML data using the Snowflake functions of GET() and XMLGET().  

Remember to read part two, where I will convert the XML to JSON, parse the JSON using the flattened variant, extract nested components, and demonstrate how to remodel the data from flattened XML or JSON to a normalized model. 

Extracting XML

Although XML is not as common as JSON, it comes up enough that it is helpful to understand how to parse XML. XML is made up of different structured components. It has attributes which are key:value pairs such as “id”=”0001”.  It also contains elements with a value surrounded by XML tags such as <name>Cake</name>.   

Matillion Data Productivity Cloud offers a way to flatten nested arrays of elements and extract attributes and elements using native Snowflake functions. I will work with the sample XML data below. I chose this XML sample because it has several levels of repeating elements along with their attributes.

Example XML file:

<items>
	<item id="0001" type="donut">
		<name>Cake</name>
		<ppu>0.55</ppu>
		<batters>
			<batter id="1001">Regular</batter>
			<batter id="1002">Chocolate</batter>
			<batter id="1003">Blueberry</batter>
			<batter id="1003">Devil's Food</batter>
		</batters>
		<topping id="5001">None</topping>
		<topping id="5002">Glazed</topping>
		<topping id="5005">Sugar</topping>
		<topping id="5007">Powdered Sugar</topping>
		<topping id="5006">Chocolate with Sprinkles</topping>
		<topping id="5003">Chocolate</topping>
		<topping id="5004">Maple</topping>
	</item>

	<item id="0002" type="donut">
		<name>Raised</name>
		<ppu>0.55</ppu>
		<batters>
			<batter id="1001">Regular</batter>
		</batters>
		<topping id="5001">None</topping>
		<topping id="5002">Glazed</topping>
		<topping id="5005">Sugar</topping>
		<topping id="5003">Chocolate</topping>
		<topping id="5004">Maple</topping>
	</item>
	<item id="0003" type="donut">
		<name>Buttermilk</name>
		<ppu>0.55</ppu>
		<batters>
			<batter id="1001">Regular</batter>
			<batter id="1002">Chocolate</batter>
		</batters>
	</item>
	<item id="0004" type="bar">
		<name>Bar</name>
		<ppu>0.75</ppu>
		<batters>
			<batter id="1001">Regular</batter>
		</batters>
		<topping id="5003">Chocolate</topping>
		<topping id="5004">Maple</topping>
		<fillings>
			<filling id="7001">
				<name>None</name>
				<addcost>0</addcost>
			</filling>
			<filling id="7002">
				<name>Custard</name>
				<addcost>0.25</addcost>
			</filling>
			<filling id="7003">
				<name>Whipped Cream</name>
				<addcost>0.25</addcost>
			</filling>
		</fillings>
	</item>
	<item id="0005" type="twist">
		<name>Twist</name>
		<ppu>0.65</ppu>
		<batters>
			<batter id="1001">Regular</batter>
		</batters>
		<topping id="5002">Glazed</topping>
		<topping id="5005">Sugar</topping>
	</item>
	<item id="0006" type="filled">
		<name>Filled</name>
		<ppu>0.75</ppu>
		<batters>
			<batter id="1001">Regular</batter>
		</batters>
		<topping id="5002">Glazed</topping>
		<topping id="5007">Powdered Sugar</topping>
		<topping id="5003">Chocolate</topping>
		<topping id="5004">Maple</topping>
		<fillings>
			<filling id="7002">
				<name>Custard</name>
				<addcost>0</addcost>
			</filling>
			<filling id="7003">
				<name>Whipped Cream</name>
				<addcost>0</addcost>
			</filling>
			<filling id="7004">
				<name>Strawberry Jelly</name>
				<addcost>0</addcost>
			</filling>
			<filling id="7005">
				<name>Rasberry Jelly</name>
				<addcost>0</addcost>
			</filling>
		</fillings>
	</item>
</items>
Nested Elements

Looking at our example collapsed to show only the top level, it contains six repeating elements which have nested elements.  

First, I convert the XML into an array and then use the Flatten Variant component to create a separate row for each repeating element. Let’s break down the process.

To start, I loaded the XML into a table called ‘donut_xml’, which has a single column called ‘data_value’, which is defined as a variant data type. 

XML to Array - using the Calculator component, use the TO_ARRAY() function to convert the “data_value” variant to an array.

TO_ARRAY("data_value":"$")

Flatten Item -  The Flatten Variant component to flatten the arrays, producing a separate row for each array element. In this case, we will end up with six rows, one for each item.  

Set the ‘column flattens’ property to read the input ‘xml_array’ and give it an alias. If you don’t specify a property to flatten, it will create a row for each element in the array.  Specifying a column property will set the repeating element at the specified property level. 

Don’t forget to set the ‘column mapping’ property to map the flattened result and name the new row. In this case, I wanted each ‘item’ element to be in its own row. 

The result of the flatten component contains one row per element, in this case “item”. That said, the result is still an xml element, which needs further parsing. 

Next, extract the item level attributes and elements using the Calculator component with the snowflake xml functions GET() or XMLGET().

Here’s a quick rundown of the difference between an attribute and an element so you can easily determine which function to use:  

  • Attributes are specified as a key value pair, such as id=”0001” or type=”donut” in the example below.  
  • Elements are inside tags such as <name>Cake</name> in the example below. Elements can also be nested arrays like the <batters> example, where the batters element contains a repeating element.  

In the case of batters, sometimes it contains a single batter element and sometimes it contains an array of repeating batters elements.  Elements can also be themselves repeating such as <topping …topping>.

 

Let’s break down this example and name each component of the item element: 

  • “id” - attribute of the top level 
  • “type” - attribute of the top level 
  • “batters” - an embedded array with repeating elements that contain 1 attribute and a text value
    • “id” - an attribute of batters element
    • element text 
  • “topping” - repeating element

Using the Calculator component, define a new column for each element within the item array. Keep the embedded arrays at the array level for now.  

To extract an array element, use the XMLGET() function. This function takes in the array and the name of the element being extracted—“$” indicates the value, and “@” indicates the key or label of that value and the data type being cast.  

In the examples below, the first gets the name value out of the element and casts it as a string. The second extracts the batter array and casts it as a variant. 

XMLGET("item_xml", 'name'):"$"::string
XMLGET("item_xml", 'batters'):"$"::variant

To extract attributes, use the GET() function. In this case, use the @notation to identify the attributes.

GET("item_xml", '@id')::string

Next, we will repeat the process to flatten the fillings and batters arrays using the Flatten Variant component to create new rows for each repeating element in the array. Then, I’ll extract each element and attribute for that element.  

I created a branch to flatten batter, fillings, and toppings in separate streams and then joined them all back together at the end. Toppings is a repeating element at the first level and not inside an array, so I needed to convert the xml to an array first before flattening. 

In this case, the toppings repeating element is at the same level as the batters, fillings, and other elements within the items array, so I needed to use a filter to keep only the toppings rows after I flattened them. Next, I extracted the attributes from the toppings element in the same way as above using the GET() function. 

Conclusion  

This article demonstrated a method for parsing nested XML by using the Flatten Variant component and the Calculator component in Matillion Data Productivity Cloud. The final pipeline looks like this: 

In this pipeline, I used the Flatten Variant component to flatten each item element to the row level. I then extracted the attributes and element values associated with the item.  

Next, I parsed out each repeating element into a different stream, again using a Flatten Variant component for each embedded array, followed by a Calculator component to extract the attributes and element values once each element was contained in its own row. Finally, I used a Unite component and a Join component to bring the results back together to a final flattened table.  

In the next part of this series, I will parse the same example in JSON format. After converting the XML file to JSON, I can simplify my pipeline by using the Extract Nested Data component in Matillion Data Productivity Cloud. Stay tuned!  

Interested to try for yourself? Sign up for a free trial! 

Angie Hastings
Angie Hastings

Senior Sales Engineer

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.