Performance Benefits of S3 load vs Query components

While designing and implementing pipelines to ingest data into your data warehouse, the main goal (for me, at least) is to get the data loaded into the data warehouse as quickly as possible. Faster initial loading means the transformation of the raw data set can be performed sooner – which can lead to faster business reporting and delivery. 

This blog will outline an approach to help get the raw/source data ingested into the warehouse faster by loading data sets into cloud storage (S3/Blob) rather than using a query component.

What’s happening behind the scenes 

First, let's briefly outline what’s happening while loading data from an ordinary Matillion query component:

Configuration of the component takes place, and the specified query is populated in the component. When the component runs, the query that was configured is sent to the source system, and the results are returned to Matillion. 

The query results are initially stored in memory, and the data set is then staged in cloud storage (S3/Blob). Finally, the staged file that holds the query results is then loaded into the warehouse, generally via a bulk load command specific to the cloud warehouse. 

Loading directly from cloud storage into the data warehouse is extremely fast, so the majority of the time spent with this approach is sending the query to the source system, fetching the results back over the network (which may have a varying degree of latency) and creating the staged file in cloud storage.

Optimization data set exports

Often, with source systems, there are methods of automating data set exports, specifically to cloud storage such as S3 or Blob storage. With this available, you can leverage the automated export ability to ‘shortcut’ the process of connecting to a source system, issuing a query, retrieving the results, and creating a file in cloud storage. 

With the dataset to be loaded into your cloud data warehouse being automatically exported into cloud storage, the use of cloud storage loads, such as S3 load or Azure Blob storage load can be used to load the data instead of the query component. 

Example

Recently, I was testing a daily load from Salesforce Marketing Cloud (SFMC) that consisted of 5-20 million rows of data. Here are some statistics for following the traditional design pattern of using the Salesforce Marketing Cloud query component to query and load the data into Snowflake:

Here are some statistics on using SFMC automation to export the daily data into S3 and then using the S3 load component to load the data directly into Snowflake:

 
Data loading with Salesforce Marketing Cloud Query component:

In this screenshot, we are loading data from Salesforce Marketing Cloud directly and into Snowflake. After the load has been completed, we are then running a transformation pipeline to further modify the data in Snowflake.

Loading exported data from Salesforce Marketing Cloud:

In this screenshot, we can see that the overall pipeline design is very similar to the traditional method of using a query component. The main difference with this approach is that the query component is replaced by the S3 load component, which will load the data into Snowflake directly from S3 instead of the Salesforce Marketing Cloud. It's important to note that the run transformation component is also invoking the same transformation job in both screenshots. 

Considerations

It is important to keep in mind that it is possible that there may be a service charge from a source vendor to automatically set up exports into S3 or another cloud storage location. To understand any cost implications, it's best to reach out to the vendor of the source system to consult with them to see what the process may be, and what price (if any) may be incurred. 

For more tips and documentation on optimizing data set exports, be sure to read through:
  1. S3 load
  2. Salesforce Marketing Cloud Query component

Using cloud storage like S3 for data ingestion into your data warehouse greatly speeds up the process compared to traditional query methods. Automating data exports to cloud storage reduces load times from hours to minutes, as shown with Salesforce Marketing Cloud.

This method accelerates data transformation and reporting, improving business decision-making. Be aware of potential vendor charges for automated exports and consult your vendor for cost details. By adopting these techniques, you can streamline data ingestion for faster, more reliable data availability.

Kevin Kirkpatrick
Kevin Kirkpatrick

Associate Delivery Solution Architect