Crossing the finish line: Executing AI workloads using dbt and Matillion

Welcome to the third and final installment of the Stepping into data with Matillion and dbt series! In the first two articles - Lacing Up: Introduction to dbt Core and Matillion Data Productivity Cloud and Hitting the ground running: Testing data quality using dbt and Matillion, the following overlaps in Matillion and dbt functionality were covered:

  • Why the dbt Core integration should be used in conjunction with Matillion Data Productivity Cloud. Matillion is the data integration platform for cloud-native data processes. Marketing lingo aside, Matillion serves as the application to extract data from a variety of source systems like files, on-premise databases, and API based applications, then facilitate the loading of that data to the Cloud Data Platform of choice. Once data has landed, users have the choice of pushdown transformation via: a) low-code Transformation pipelines or b) orchestration of dbt projects via the dbt Core component. This addition of dbt Core allows users to begin transforming that data into a usable state the moment it completes loading, using any dbt command the user so wishes.
  • Orchestration of dbt data quality tests with embedded dependencies on downstream actions. Tests can be leveraged to ensure data cleanliness and lack of staleness through the use of packages and base functionality like source freshness. Based on the results of those tests, connectors between Matillion components dynamically dictate the next steps in the pipeline.
  • Email notifications with test failure metrics in the email body. With the combination of dbt’s store_failures config and Matillion’s Query Result to Scalar component to derive the count of failures from a Snowflake table, the results of dbt tests are communicated in email notifications.

This article will expand on the above concepts and reflect upon the concept of modularity to prepare data for AI workloads.

About Modularity and AI Workloads

Modularity, as defined by dbt is the breaking down of complex tasks into smaller, independent units. DRY, an acronym for Don’t Repeat Yourself, is the name of the game. The segmentation of workflows leads to a higher degree of predictability and quicker identification of any issues, which ultimately brings teams to production on a much shorter timescale. Modularity is such a foundational concept in dbt that there is even an entire learning course on the subject! 

This article will cover lots of dbt and Matillion-based “modularity hacks”, such as: 

Artificial Intelligence (AI) - you’ve heard of it, you may have seen it, and now is your time to learn how to incorporate it into a pipeline workflow. Matillion’s Data Productivity Cloud is the platform for preparing data for AI workloads through the embedding of functionality in pipeline-based components. As Snowflake is the target for the loading and transformation of any data, users will learn about the use of the AI-based Snowflake Cortex functionality housed in the form of Sentiment and Completions components.

As referenced in the last two articles, you can load the pipeline of reference and many more quickstart pipelines directly to your own Data Productivity Cloud environment from the Best Practice Pipelines repository

Let’s get started!

What goes upstream must come downstream

Matillion is a platform best utilized when metadata is driving the pipeline. As covered in the Testing Data Quality Using dbt and Matillion article, variables were introduced as the conduits to pass metadata into dbt commands within the dbt Core component.

Building on that concept, our Charlies Shoe Emporium analyst wishes to utilize metadata to drive the entire pipeline, not just which dbt models to run. By specifying state abbreviation parameters (FL, CO, OR) at the top of the pipeline, the analyst wishes to isolate only Florida, Colorado, and Oregon retail stores’ sales using the following workflow:

The Charlies Shoe Emporium analyst wants to build a workflow completely dependent on metadata specified upstream. This metadata in this case are state abbreviations (CO, OR, FL) added to a Matillion grid variable (gv_state_list). This grid variable then feeds the following modules:

  • Module 1 - Create Sales Summary Materializations
    • CSV Files containing shoe sales data are loaded from S3 to a table in Snowflake only for the states contained in gv_state_list.
    • dbt build is run against this table, crafting models summarizing the sales for each state and testing data hygiene.
  • Module 2 - Generate Snowflake Cortex Insights
    • dbt run to generate a dimensional model for each state’s sales data
    • State-specific metadata (email recipient list, greeting, Cortex prompt) loaded to a grid variable (gv_prompt_email_details).
    • Snowflake Cortex is used to produce sentiment analysis for customer reviews, then prompt the Mistral-7b model to summarize the state store performance vs other states.
  • Module 3 - Build & Send Emails with Cortex Insights
    • Using email and AI prompt metadata, emails are populated to summarize and embed the Cortex insights into a humorous email for the managers of Charlie's Shoe Emporium stores.

No sweat, right? Let’s break down these modules.

Module 1 - Create sales summary materializations

Following along with the theme of using metadata to drive a modular-based workflow, the Charlie’s Shoe Emporium analyst will incorporate the art of Matillion grid variables to achieve their aim of loading data and creating dbt models only for the states they establish as values.

So, what are Grid Variables?

Grid Variables are just temporary tables local to Matillion - that’s it! What makes them special is that they can hold rows of values that can then be fed as properties in components. The iteration over the rows of values allows for operations in Matillion to perform the same base functionality (modularity!) in the component only for those values set in the grid variable.

The grid variable, which will dictate the downstream, will be about as simple as it gets; it will be called gv_state_list and use a single column to identify only the states that the workflow will pertain to. As the first step in the workflow, the analyst uses the Append to Grid component to add the values FL, CO, and OR as the abbreviations for the states of Florida, Colorado, and Oregon.

With that grid variable now populated with the state abbreviations, the analyst will choose to load the data from S3 for only those states. This will be achieved by feeding the values of the grid variable into another job using the Set Grid Variables property.  

Metadata-driven data loading

The datasets utilized in the workflow are all contained in flat files within AWS S3 buckets. These files will be loaded using a combination of the S3 Load component and the Grid Iterator in Matillion. As the name implies, the Grid Iterator will “loop” through the list of three values in gv_state_list established in the last step and pass each value - the state abbreviation - into the S3 Load component. 

As seen in the clip below, the Pattern field utilizes the pv_state variable to indicate which state’s shoe sales file will be loaded. Each iteration will populate its state value—CO, for example—into pv_state to only load the Colorado sales data. This combination of Grid Iterator + S3 Load will amount to three total iterations of the same process of loading a file from AWS S3, the only difference in any run being the state being loaded in each iteration.

One dbt command to rule them all

With the data now having been loaded from AWS S3 via iteration over the state values, the analyst is ready to unleash a whole host of dbt functionality on the data. From within the 3a-crossing-finish-line node in their dbt project, the analyst wishes to complete this entire list of functionality:

  • Build the incremental model STG_CHARLIES_SALES_ALLSTATES, which will join all raw tables loaded in the previous step. As an incremental model, only rows of data with non-already-existing TRANSACTION_IDs will be loaded.
  • Utilize the tests defined in the schema.yml file to test the following qualities of the STG_CHARLIES_SALES_ALLSTATES model:
    • Row count does not exceed 75,000
    • The model consists of exactly 21 columns
    • Column TRANSACTION_ID is unique
    • Column PRODUCT_PRICE values are between 50 and 220
    • Column PRODUCT_NAME has no more than 15 distinct values
       

Rather than running each piece of the functionality command-by-command, the user will utilize a single dbt command to achieve the aim of running all the above. And that command is dbt build. Simply, dbt build runs every bit of indicated functionality in the dbt project, from end to end in the order as defined by dependencies.

Within the Matillion dbt Core component, the user sets the dbt Command field to dbt build –select 3a-crossing-finish-line. Doing so will run all the functionality above and existing in the 3a-crossing-finish-line node in their dbt project.

Behold the power of dbt and Matillion-based modularity! Using the combination of grid variables and the dbt build command, the Charlie’s Shoe Emporium analyst has loaded sales data only for the states indicated upstream, tested the quality of that data, and built materializations summarizing the dataset across all states.

With the summaries now materialized, the analyst will further utilize the same state abbreviation metadata to drive another workflow found in Module 2. Read further to learn how to iterate over the state values to build state-specific materializations and employ Snowflake Cortex functionality to uncover insights from within the data.

Module 2 - Generate Snowflake Cortex insights

In Module 1, the analyst began by creating and populating a Matillion grid variable with only the state abbreviations - FL, CO, and OR - they wish to undergo downstream functionality. Whereas a grid iterator was used only to load sales data only for those states, this module will begin with showing how a grid iterator can be used to iterate over entire pipelines and not just components.

The Charlie’s Shoe Emporium analyst connects the Grid Iterator to the Run Regional Models AI Reporting Orchestration pipeline to iterate over the entire underlying workflow.

Looping through each state one by one, the Run Regional Models AI Reporting pipeline aims to perform a handful of actions:

  • dbt run to generate a dimensional model for each state’s sales data
  • State-specific metadata (email recipient list, greeting, Cortex prompt) loaded to a grid variable (gv_prompt_email_details).
  • Snowflake Cortex is used to produce sentiment analysis for customer reviews, then prompt a Mistral 7 model to summarize the state store performance vs other states.
dbt macros to create state-specific models

The first step of the pipeline involves the dbt run command to generate models specific to each state. Making the process as modular as possible, the analyst has designed a workflow resting on the foundation of dbt macros. dbt macros are reusable blocks of SQL (similarly to functions), often with variables to define whatever may be a unique quality about that specific run. In this case, that unique quality will be the state abbreviation being passed to the WHERE clause of the SQL statement in the macro, to act as a filter. 

Here is a diagram of what will be performed, with each step having a color to define which application is performing the action:

dbt run will be the command passed will be the first step of the iterated pipeline, the result of each of the three iterations being a materialization being a table in Snowflake. These tables are named very similarly - DIM_CHARLIES_SALES_<state abbreviation> - and have the same exact columns, the only difference being the state whose data is being referenced. Please see an example of such a model here, with the macro being referenced in each model here. The creation of the models will take place for each state, thus completing the first round of actions in the Run Regional Models AI Reporting pipeline. - INCLUDE DATASET FROM SNOWFLAKE

The next step in the pipeline is to declare state-specific details, which will ultimately be embedded in the reporting emails. Again, this process will include Matillion grid variables; however, as opposed to the single-column variable, the structure of these grid variables will be slightly more complex.

As seen above, this grid variable called gv_prompt_email_details holds information such as the recipient list, subject line, greeting, and other information specific to that state. Know your audience, right? Along with the email details, note the addition of state-specific prompts, which will be used in the Snowflake Cortex functionality.

The Charlie’s Shoe Emporium analyst designs a method to evaluate which state value is being passed in the current iteration using the If component, then appends those state-specific qualities using an Append to Grid component.

This combination of If and Append to Grid components has set up the process by which state-specific email and LLM prompt detail have been mapped to the gv_prompt_email_details grid variable. With the metadata now mapped, the analyst can confidently continue to work to employ Snowflake Cortex to make inferences about the sales summary and state-specific materializations.

Snowflake Cortex

It can’t be overstated enough that Snowflake is the underlying platform for any action taking place throughout the Charlie’s Shoe Emporium workflow. Data is being loaded from AWS S3 to tables in Snowflake; the dbt tests for data quality are being conducted on existing tables in Snowflake; any transformations being conducted in dbt build and run commands are taking place on tables in Snowflake to then materialize more tables in Snowflake. 

I think you get the drift! Along with serving as a data warehouse for its customers, Snowflake is the platform through which those customers can leverage AI and ML-based functionality on that data, categorized as Cortex. And through the use of the Cortex components in Matillion (EmbedCompletionExtract AnswerSentimentSummarizeTranslate), users have native capabilities to build Cortex as specific steps in a Transformation pipeline.

The Charlie’s Shoe Emporium analyst designs a process to first tabulate sentiment on only rows in the state sales data, which contain reviews. Once tabulated, a second Cortex component is used to prompt for Mistral 7 to perform an evaluation of that state’s performance relative to the other states. Similar to the beginning of Module 2, which involved iterating over an entire pipeline, a similar process is designed to pass the state-specific LLM prompt to the underlying Transformation pipeline.

Without going overboard on detail, the Transformation pipeline has two goals: 1) calculate sentiment analysis to the reviews found within the DIM_CHARLIES_SALES models created via dbt run, and 2) apply the Mistral 7 model to summarize the sentiment and sales summary models created in Module 1.

Using the Table Input and Filter components, the dataset only consists of rows that have the REVIEW column populated. Connecting the resulting dataset to the Cortex Sentiment component, the resulting dataset provides a numerical value between -1 and 1 to communicate the positivity or negativity of the review.

The measure of sentiment has now been supplied in the analyst’s dataset via the Cortex Sentiment component. With the sentiment values in place, the analyst now wishes to aggregate the sentiment ratings into an average and then join it with the sales summary information materialized in Module 1. The resulting dataset is the following:

Here is what is being communicated on the Charlie’s Shoe Emporium sales data for the state of Florida:

  • Average sentiment across all reviews is an ever-so-slightly lower than neutral rating of -0.08
  • Combined revenue across all retail stores in Florida totaled $245,445
  • Combined sales across all retail stores in Florida totaled 1,933 units
  • Florida is ranked #1 for revenue between all other state totals

The dataset now has been fully prepared for the second layer of Cortex functionality. Now armed with information effectively communicating the performance of the retail stores across all of Florida, the analyst will use Mistral 7 model to answer the following:

I want a response that does 3 things: 1) Please take a look at the rank column - this column denotes how well the Florida stores performed against the Colorado and Oregon stores. Please give a snarky review of this ranking in the tone of a stereotypical Floridian, 1 being the best, 2 right in the middle, and 3 being the worst. 2) Also take a look at the avg_sentiment_REVIEW which communicates the sentiment of reviews given to the stores in the state in the current year. Again, please give a snarky summary of the sentiment score. 3)  The Florida General Managers are also looking for creative ways to market the shoe sales to local sports teams. Please give ideas on how to be able to do so.

To achieve this aim, the analyst will utilize the Cortex Completions component to perform the prompt and save the response to a new table in Snowflake. The setup of the component is quite simple - the only required arguments are the model (mistral-7b) the context for the “user” being prompted (You are an unofficial reviewer of the sales performance of fictional shoe company, Charlie's Shoe Emporium), then the optional system prompt which for Florida is listed out above.

And here is what is being returned by mistral-7b:

1) Well, y'all, take a gander at these Florida store rankings! I'm just floored, absolutely floored, that we're sitting at a disappointing number 3. I mean, come on, Colorado and Oregon might have their mountains and hipsters, but we've got sunshine, beaches, and... well, let's not forget our world-class palm trees! I'm sure there's a simple explanation for this ranking like maybe our stores are too busy selling flip-flops and sunscreen to focus on selling shoes. But hey, no worries, we'll show those northerners how it's done!

2) And now, let me tell you about the avg_sentiment_REVIEW for our Florida stores. I'd say it's a real rollercoaster of emotions down here. With an average sentiment score of -0.08, it's clear that our customers are as excited about our shoes as a gator is about a swamp tour. I guess we'll need to step up our game if we want them to stop complaining about the fit, the price, and the selection. Maybe we could start by offering free key lime pie with every purchase?

3) As for marketing shoe sales to local sports teams, I've got a few ideas that'll make those coaches and players come running to our stores. First, we could offer custom team colors and logos on select shoe models. Second, we could sponsor a local sports team and provide them with free shoes for the entire season. Lastly, we could host a "Meet the Team" event at our stores, where fans can meet their favorite players and get autographs on their new shoes. And if all else fails, we could always offer a discount for anyone wearing a team jersey. Go team, go sales!

Spicy yet informative response from mistral-7b! But as it exists, the responses for all states are captured in a table in Snowflake. To make this more actionable and place the information in the hands of decision-makers at Charlie’s Shoe Emporium, the analyst decides the last piece of the puzzle is to embed these insights in emails.

Module 3 - Build & send emails with Cortex Insights

Hang on, we’ve saved the best for last. The previous two modules focused on preparing the data - this last one will key in on what Charlie’s Shoe Emporium wishes to do with that data. The last bit of functionality to briefly discuss is the sending of email notifications to the regional general managers of the retail store locations. The analyst wishes to create an email template with a consistent format across the different regions but have the tone and information reflect that of each region. Within the Body field of the Send Email component, the analyst supplies the following:

As already covered, anything with the ${} denotes the use of variables. But, how to establish the variable values?

Leading up to the Send Email component there are three separate workflows assisting in the embedding of values. In the order of the highlighted components above:

  • Append to Grid - conditionally loads the gv_prompt_email_details grid variable with state-specific information related to the emailing of GMs (recipient list, greeting, etc) and Cortex prompt.
  • Query Result to Scalar - loads information from the CHARLIES_SALES_SUMMARY materialized way back in Module 1, such as the state ranking, # units sold, $ revenue.
  • Grid Iterator - feeds the above details into the Send Email component.

When the mechanisms in place to embed state-specific into emails that will also be sent to the appropriate parties, here is how the pipeline runs from beginning to end with Colorado as the example:

A more clear look at that final step being the email content:

Mistral 7-b is leaning hard into Colorado’s western heritage, giving users a chuckle as they are also suggested strategies to take advantage of the given situation. Armed with this information, the GMs of the Colorado retail locations can add a perspective on how to navigate their respective regions of the company going forward. 

Summary

Whew! We’ve reached the end of our journey to data productivity. As demonstrated, Matillion can serve as the single orchestration layer of a process involving ingestion, transformation, and application of that data in a useful context. Specifically, the orchestration of AWS S3, dbt Core, Snowflake Cortex, and SMTP-based email notifications can make for an efficient, incredibly powerful piece of Charlie’s Shoe Emporium’s data stack. Or, yours!

Your next steps

Thank you for reading!

Karey Graham
Karey Graham

Snr Manager of Tech Services - PS & Partners

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.