Loading and Flattening XML Data with a Snowflake File Format

XML (eXtensible Markup Language) is simple and usable across all platforms. It is commonly used to encode documents or serialize data. There are compelling reasons to load XML data into your data infrastructure. As businesses continue to deliver services online, more XML data is generated, containing rich, complex, and hierarchical information that can offer deep insights into - for example - customer behaviors and business operations.

To gain the most value out of this potential wealth of information, it is necessary to flatten the XML data. Flattening involves transforming this hierarchical and nested data into a relational structure that can be more easily queried and analyzed with common SQL-based tools. By doing so, businesses are better able to tap into knowledge contained in the data, fueling more informed decision-making and planning.

In this article, I'll use a convenient public-facing XML data source in the form of this RSS news blog feed from AWS and show how to process it using the Matillion data productivity cloud.

You can also follow along in your own Data Productivity Cloud account by using two downloadable pipelines from the Matillion Exchange that process XML API data in Snowflake using a File Format.

If you download that RSS feed yourself, you'll find the raw data looks something like this:

An XML source data sample from the AWS news blog

This RSS data is typical of many REST (Representational State Transfer) style APIs that are used in web development. OData is another example of a protocol that uses XML as its core language for data communication, and looks very similar in practice.

Regardless of source, the first step in extracting and loading XML is to define a Snowflake File Format.

Snowflake File Format objects

In Snowflake, a file format is a representation of data that includes information such as the file type, formatting options, and compression method. In other words, a file format defines how the data is stored, and guides how to interpret this data subsequently. This is how file formats help load data from files into Snowflake tables. By the way, they can help with unloading, too, although it's not a topic for this article.

File formats are named database objects. They are created using a SQL DDL statement. Matillion's low code interface has a component that creates a file format.

Creating a File Format with Matillion

Once created, the file format can be invoked during a COPY operation to transfer the RSS data - cached in cloud storage - into a Snowflake table.

This Snowflake table acts as an intermediary, holding the XML data in one VARIANT column, and making it readily accessible for future use.

You can probably just about see the document structure from the screenshot of the XML above. There's a "channel" that contains "item" objects that represent blog articles. In reality, you'll find that XML files won't always be conveniently indented like this, and will often be harder to read.

To un-nest the items from the channel and make them accessible requires a lateral flatten transformation.

What is a Lateral Flatten in Snowflake?

Snowflake's LATERAL FLATTEN operation is a data transformation step that plays a vital role in making XML data accessible. This transformation changes the data's granularity from a single record to as many items as are present in the data.

A lateral flatten un-nests - or "flattens" - the nested XML data, effectively transposing the data structure from wide to narrow. This helps to make data much more approachable and manageable for downstream analysis and reporting. Here's how it looks in the user interface:

A lateral flatten transformation in Matillion

Note that this is a high code variation of the Extracted Nested Data component that does something very similar for JSON data.

After flattening the XML into item records, the last step is to merge it into a permanent Snowflake table, using the low code Table Update component you can see in the screenshot above. Now it's ready for analysis!

Try it yourself

You can download pipelines from the Matillion Exchange to run this loading and flattening process yourself. Two ways to load, flatten and transform XML data are available:

Why Choose Matillion for XML Handling?

Matillion's Data Productivity Cloud contains all the data loading, data orchestration, and data transformation tools that you need for this kind of sophisticated, multi-step operation. Matillion offers an all-encompassing solution to handle XML data effectively.

There are high-code and low-code options for handling most data engineering tasks, from writing raw SQL to running pushdown Python scripts. This flexibility makes the process of extracting, loading, and transforming XML files seamlessly scalable. The development lifecycle is git based on fostering collaboration among data teams.

Designed with flexibility and large-scale data engineering in mind, Matillion leverages the processing power of your cloud data platform, allowing you to handle any data type or source with ease. Matillion's data productivity cloud makes seemingly daunting tasks manageable, helping you turn data into a powerful ally for your business decisions.