How Matillion ETL for Snowflake Supports External Tables
At their recent summit, Snowflake announced support for External Tables. This represents an advancement in data storage as users can now store, query and manage data both in their Snowflake data warehouse and out in object cloud storage.
External Tables allow Snowflake users to query data across their Snowflake cloud data warehouse and data in external storage objects, bringing users closer to their data, in a performant and cost efficient manner.
We are happy to announce that External Tables are fully supported in Matillion ETL for Snowflake, allowing users to bring their data together and transform it, all from within Matillion. In this blog, we will walk you through how to use the new External Tables component.
See how External Table support works in Matillion ETL for Snowflake.
Why Use External Tables
Key business data often exists in files stored in AWS S3 or Azure Blob Storage which can be regularly updated by end users, suppliers or partners. Additionally, data is also often stored in the cloud as a Data Lake. Traditionally, this business data is loaded into the data warehouse to be combined with other data for analysis. This results in having two copies of the data which can cause data inconsistencies due to constantly reloading the data using the data warehouse resources, as well as a potential time lag in loading the data. A solution to this problem is to store the data in files in S3 or Azure Blob Storage and query these from the data warehouse, either directly via end-user queries or, more likely, during a larger data transformation. Snowflake supports this in the form of Snowflake External Tables. Matillion ETL for Snowflake has full support for Snowflake External Tables, enabling users to perform External Table transformations directly from within Matillion.
In order to query a file directly in S3, or Azure Blob Storage, an External Table definition needs to be created referencing a Snowflake Stage. A Snowflake Stage is a reference to either an external location on S3 or Azure Blob Storage or an internal Snowflake location where files are stored. Associated credentials to access these files are required. Matillion often uses the provided instance credentials, however, a Snowflake Stage is a long-lived object that may exist far longer than temporary instance credentials. For this reason, permanent credentials are required.
Either an Internal Stage or an External Stage can be used to store files to be used in External Tables. These can be managed within Matillion from the Environment Panel. Right click on the Environment and select “Manage Stages”:
A pop-up will show all available stages, the schema they are associated with, whether they are Internal or External and the URL if applicable:
New Stages can be added here and existing ones deleted. A stage is typically either an S3 bucket or Azure Blob Storage or a folder within these so multiple files can be held within one stage.
Create an External Table
Matillion has a Create External Table component to create the External Table definition. This is done from an Orchestration job. An example job is below:
The Create External Table component allows users to specify the name of the new external table to be created, the stage location, the format of the files in the stage and any columns the table should be partitioned on. You should consider partitioning to improve the speed of queries on the External Table.
The example job also pulls in files from another external source such as Google Cloud Storage, sFTP, Windows Fileshare, HTTPS using the Data Transfer component and adds these files into the location referenced by the stage used for the External Table. A Refresh Table component is required to run a refresh on the table so data from all files are returned when the external table is queried.
Querying the Data
The External Table can now be used in any Transformation job as any other Snowflake table would be used. The tables can be explored using the environments panel. Here, all External Tables are listed with their associated metadata:
Dragging this table on to the canvas brings up a configured Table Input component. Sampling the table shows the available data in JSON format:
This component can then be used with any Transformation components downstream.
Flattening the Data
A common requirement is to flatten out the different data elements into a tabular structure. The example table used in this blog contains details of customer orders and invoices. The structure of the data is below:
Each row of data has customer information, details of items purchased and one or more payment options. There may be a requirement to flatten this data out into 2 new tables. One for customer orders and another for associated payments. This can be done with the Flatten Variant component. Further details on how to configure this component are available in the blog article here.
An example of the job to do this is here:
Here, the sample of the items is taken on the Flatten Variant component and the 50 rows of data from the External Table are expanded into 140 rows with one row of data per item. This data can then be written into either a new External Table or a new internal Snowflake table, as per the example above.
For Matillion ETL for Snowflake customers, we have outlined above how to take advantage of Snowflake’s recently released External Tables features from directly within Matillion. Users can create jobs from within Matillion to directly reference data held in files in S3 or Azure Blob Storage and also perform transformations on that data. This prevents unnecessary data being loaded into the Snowflake database and also allows users who don’t have access to Snowflake to update the data when required.
Author Note: A special thank you to a few of our Matillion developers – Nick, Sam, Vlad, Tom and Greg, in particular – for all the work they did to bring External Tables support to Matillion ETL for Snowflake.