Scale your data team’s output by up to 100x. We'd love to prove it.

Challenge Maia at Snowflake Summit

Hitting the ground running: Testing data quality using dbt and Matillion

Welcome back to the race! In the first installment of the Stepping into data with Matillion and dbt series - Lacing up: An introduction to dbt Core and Matillion Data Productivity Cloud—we covered why a user would consider the Matillion Data Productivity Cloud as the platform to provide the data for, as well as to orchestrate a dbt project.

To sum up, Matillion serves as the binding for any data Extract-Load-Transform process; native tooling allows users to extract and load data from source systems to a Cloud Data Platform, then utilize built-in pushdown SQL capabilities or dbt integration to transform the data into a production-ready state.

However, data engineers currently find themselves in a crisis. These practitioners of data pipelines must often take a “wait and see” approach to the quality of the data they are putting into production. 

They may have designed the ingestion of the data perfectly and employed top security protocols to “wow” the stakeholders of whatever that data is communicating. This could be the population of a dashboard for the eyes of C-level executives or training of a custom model using proprietary data. 

But no matter the application, a robust framework is absolutely critical to ensuring the quality of such data. Without such a framework, the message remains the same—“garbage in, garbage out.”

This second entry in the Stepping into Data with Matillion and dbt series will take a deeper look into how dbt tests can be integrated into a pipeline-building framework. These dbt tests offer out-of-box functionality to validate the quality and hygiene of the raw data loaded via Matillion connectors or models built upstream by dbt. 

Based on the results of such tests, Matillion can enable the communication of test results in automated emails. Should all go as planned, those encumbered data engineers will be able to breathe easy at the end of the day, knowing their pipelines ensure and communicate data quality.

Here’s a breakdown of what we’ll cover:

  • Building pipeline architecture for analytics engineering
  • dbt tests and the role they play in ensuring data quality
  • Designing a dbt testing framework in Matillion
  • Dynamic, event-based email notifications 

As referenced in the last article, you can load these and many more pipelines directly to your own Data Productivity Cloud environment from the Best Practice Pipelines repository

Let’s get started!

On the road again…

We will resume the workflow that began with the loading of data and running of dbt models in the Getting Started with Matillion and dbt article. 

In the second step of the workflow, our Charlie’s Shoe Emporium analyst wishes to employ dbt tests to validate the hygiene of the Florida shoe sales data loaded in the last article’s workflow. They also would like to add a series of dependencies involving emails based on the success or failure of dbt commands.

Laying the track: Pipeline dependencies

Our analyst is a real stickler for ensuring data quality and receiving notifications if reality doesn’t meet expectations. They wish to build the pipeline by performing the following:

  • Initially run the dbt compile and dbt list commands to respectively validate the SQL compiled and which nodes will be run in the project.
  • If both commands succeed, a dbt test command will be issued to perform the data quality tests on a model built in the first article.
  • If either of the dbt compile and dbt list commands were to fail, an email will be sent communicating the failure.
  • Emails will also be sent in either scenario of the data quality tests passing or failing.
  • Based on success or failure emails being sent, the job itself will show as either a success or failure.

The diagram below communicates the series of actions and dependencies our analyst will build into a Matillion pipeline:

Within Matillion, connections can be made between components with embedded success/failure dependencies. As our user is now a seasoned veteran of dragging, dropping, and connecting components on their canvas, they speedrun the buildout of the pipeline:

All the dependencies indicated in the diagram above are now captured in the pipeline:

“Hygiene is Two-Thirds of Health”: dbt Tests

The pipeline has been set into place; let’s dig into what the implemented data quality testing framework will consist of. The above is a popular Lebanese proverb about the importance of cleanliness. 

And they nailed it as it relates to analytics engineering; the validation of data hygiene is absolutely essential to a healthy end-to-end data pipeline. Luckily for us, dbt tests are an incredibly effective way to easily build such validation. 

Out of the box, dbt ships with four generic data tests already defined: unique, not_null, accepted_values, and relationships. However, custom tests can be unlocked by either writing singular tests or using a wide array of community-curated packages (more on packages later).

Our analyst is rightfully paranoid about the quality of the data in the STG_CHARLIES_SALES view. Specifically they want to ensure that:

  • The TRANSACTION_ID column has only unique values and no nulls.
  • The SALE_INDICATOR column exists in the dataset and only has true or false values.
  • The PRODUCT_NAME column has no more than 15 distinct values.
  • The age of the dataset, communicated in LAST_UPDATED column, is no more than 7 days old.

In order to perform this validation, the shoe sales analyst will construct a sources.yml file to reference the STG_CHARLIES_SALES view materialized in the Lacing Up: Getting Started with Matillion and dbt article. This sources.yml file contains a series of tests to be conducted on the STG_CHARLIES_SALES view, housed in a new node called 2-hitting-ground-running.

Within the 2-hitting-ground-running node, each test will be dictated in the sources.yml file, which references columns in STG_CHARLIES_SALES, and any failures will be saved to a materialization in Snowflake via the store_failures: true config. Here’s what the sources.yml file consists of:

The diagram above references all six tests embedded in the sources.yml file. Note the use of the “dbt_expectations” within the SALE_INDICATOR and PRODUCT_NAME columns’ tests; this is referencing a package that can be downloaded and used in dbt projects. 

Packages are one of the most compelling features of dbt, as they can be built by any user to add custom functionality. In this case, the dbt_expectations package is being loaded to allow custom tests to be embedded.

To utilize packages such as dbt_expectations in the dbt Core component in Matillion, all a user needs to do is have a packages.yml file in their dbt project with the named packages and applicable versions. Upon runtime of any command within the dbt Core component, the packages will be automatically installed. No separate step involving the dbt deps command is necessary!

Now that the content of the data quality testing framework has been discussed let’s embed it within a Matillion pipeline.

Getting testy in DPC Designer

The analyst now has the dbt testing configuration in place to begin thinking about how the scripts will be implemented in a Matillion pipeline. Within the Matillion dbt Core component, a user can run as many or as few tests as desired, and if any of the tests were to fail, so too does the component in Matillion. 

This failure, as communicated by the component, will then enable downstream actions via a failure connector.

 

First to be discussed is how the tests being contained in a specific node will play a role here. As illustrated in the above section, the tests are contained in a new node called 2-hitting-ground-running. 

To select only those tests in the node via the dbt command, the analyst begins by creating a new variable called dbt_model with 2-hitting-ground-running as the value. Note: if you’re new to variables, the first article in this series—Lacing Up: Getting Started with Matillion and dbt—covers them in detail.

The analyst must append this variable to any command in the dbt Core component to run only those actions found within the 2-hitting-ground-running node. 

Within the first component, they set the dbt Command property to dbt list -s ${dbt_model}, which:

  • dbt list: lists all resources within the node
  • -s: selects the node in the dbt project
  • ${dbt_model}: Matillion variable containing the node name

Running the component, they receive the following in the log, which confirms only those resources that would be run in the 2-hitting-ground-running node:

The analyst then does the same with the following dbt Core component to run the dbt compile command, which runs successfully after validating the project code:

With these two commands, the analyst has confirmed which tests will be run within the selected node in dbt, as well as any code compiled within the project. 

The next step in the pipeline is to utilize the dbt test command, which will kick off the six data quality tests against the columns in STG_CHARLIES_SALES. As the pipeline is designed, the component running the tests will only run in the event of both the dbt list and dbt compile commands succeeding.

The dbt Core component will exist with the identical set of configurations as the prior two components, the only change being the dbt Command property being set as dbt test -s ${dbt_model}.

Success on the tests from top to bottom! But wait a second—there is no source freshness test in the log, which will allow the analyst to validate that the data is not stale. So rather than leaving the pipeline as is, the analyst edits it to run both the dbt test and dbt source freshness commands in parallel. 

The analyst decides to set another dbt Core component underneath the dbt test component. After setting the command to dbt source freshness… 

…it renders the following output:

As seen in the highlighted green section, the freshness test also passed with flying colors. The following has been confirmed: 

  • Validation of the resources to be run
  • SQL has compiled with no errors
  • Data quality verified by the passing dbt tests
  • Data staleness is not an issue

Our analyst has gained the confidence to now begin building a series of automated email notifications, some of which will contain analytics about our dataset.

From me to you: Automated email notifications in Matillion

The above sections have confirmed the use of tests configured in dbt to validate the cleanliness of the shoe sales dataset. This section will follow the success or failure of those tests to provide users with dynamic notifications they need to understand the state of the data.

Within a Matillion Orchestration pipeline, users can embed notification technologies as components. The ability to automatically send notifications can keep stakeholders abreast of the state of their pipelines and be set dynamically with success or failure connections between components. 

At the time of writing, the notifications available in Data Productivity Cloud include:

In the above workflow, there are three different emails which are to be sent in the following scenarios:

  • Send dbt Setup Failure Email: sent when either the dbt debug or dbt list commands fail
  • Send dbt Test Failure Email: sent when any of the dbt tests fail
  • Send dbt Test Success Email: sent when all of the dbt test succeed

But how do you configure the Send Email component? It relies upon SMTP, a widely-adopted technical standard for the sending of emails. Through SMTP, Matillion can trigger the sending of an email from many popular clients such as Gmail, Outlook, Apple Mail, and Yahoo Mail, with a user’s email address as the sender. 

In this case, our user wishes to send an email from their Gmail account. By using GMass.co, they can easily configure the account and be given an SMTP password that will be used in Matillion.

Our analyst then begins configuring their Send Email component.

In the component, the first block of properties (To Recipients, CC Recipients, Subject, Message, Sender Address and Reply Address) can be configured to whatever the given use case is. 

The second set (SMTP Username, SMTP Password, SMTP Hostname, SMTP Port, Enable SSL TLS, and Enable StartTLS) must be set in respect of the email provider being utilized. 

For example, the SMTP Username and Password must be unique to the user and smtp.gmail.com (SMTP Hostname); 465 (SMTP Port); Yes (Enable SSL TLS); No (Enable StartTLS) must be set exactly as such for any use of a Gmail SMTP server.

The metadata is the message: Embedding metrics in email notifications

Great! So, we’ve figured out how to handle notifications when all goes well. But what about if the dbt tests were to fail, and the analyst wants to know why the tests failed?

While the previous Send Email components are configured with hard-coded values, email bodies can also be populated with values from variables. The use of variables can make the body of the email dynamic and can communicate metrics as extracted from Snowflake. 

Our analyst wishes to capture & communicate such metrics, which communicate how many failures the tests are rendering. Using the store_results config in the test definition in the sources.yml file, the analyst has saved failures from an accepted_values test to a transient table in Snowflake, as communicated in the dbt tests log.

The analyst can map that number of failures to a Matillion variable and then communicate the failure count in the body of the email. Using the Query Result to Scalar component, the analyst is able to pass a SQL statement to Snowflake and map the result to variables in Matillion. The following query will be passed in the component to map the failure value (“true) and the count of “true” values (7896).

Then, when setting the Message of the Send Email component, the variables can be embedded…

… and the rendered email shows us not only the failure of this test—in this case, the value true being outside an accepted values test—and the count of failures in the database object being tested.
 

These are very simple examples of not only the use of the Send Email component but also how to dynamically populate the bodies with the results of the dbt tests. This wraps up the second workflow in which our analyst has implemented and set up notifications for their dbt testing framework!

Conclusion

In this second part of the Stepping into Data with Matillion and dbt series, we’ve taken significant strides in learning how to enhance a pipeline by integrating user-configured data quality checks and embedded notifications with failure information. 

By leveraging dbt’s testing capabilities, we learned how to build a data pipeline that is not only functional but also reliable, maintaining the highest standards of data hygiene. Additionally, we learned how Matillion’s powerful orchestration tools can automate the communication of test results, providing stakeholders with real-time insights into the pipeline's health.

With these tools in place, the analyst at Charlie’s Shoe Emporium can confidently rely on data pipelines to deliver accurate and timely information essential for making informed business decisions.

Read part three, the next and final installment of the Stepping into data with Matillion and dbt series, where the following will be covered:

  • Preparation and deployment of Snowflake Cortex workloads through dbt and Matillion functionality
  • Parameterization of Matillion components provided by Grid Variables
  • Using dbt build to materialize objects and test data quality with a single command

See you there!

Interested in having a go? Sign up for a free trial!

Karey Graham
Karey Graham

Snr Manager of Tech Services - PS & Partners

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.