Ask a Solution Architect: How Do I Build a Data-Driven Workflow?

  • Dan D’Orazio, Solution Architect
  • August 30, 2019

Q: I’ve heard a lot about the benefits of a data-driven workflow. How do I create one in Matillion? 

Before we dive into a practical discussion on how to build a data-driven workflow, I think it’s valuable to briefly discuss what we mean by ‘data-driven’. The premise, at a high level, is to create workflows that take advantage of design patterns, parameterization, and encourage a write-once-use-many philosophy. In doing so, we can inject implementation details into these workflows at runtime, driving their execution using the data that’s injected. Data-driven workflows have a number of benefits when compared to individual and static workflows. 

  • Scalability A pattern-based, data-driven approach, by its very nature, enables the reuse of existing workflows so future development efforts that require the same functionality can be efficient, if not negligible.
  • Maintainability A consistent, reusable process is less prone to errors than a manually created, one-off workflow. This is especially true when the workflow is complex.
  • Flexibility When it’s time to modify the design, or, in the (unlikely) event that you find a defect, a reusable workflow can be updated, tested, and applied globally.

There are many ways to create a data-driven workflow. The examples below are not meant to be all-encompassing, but rather to inspire the imagination. However, there are some foundational elements that are likely to be used in nearly all data-driven workflows, namely Variables and Iterators. Shared Jobs can be created and used to enhance data-driven workflow functionality, but they are not strictly required to create the desired effect.

*These examples use the AdventureWorks database that comes standard with Microsoft SQL Server.

Building a data-driven workflow: Basic iteration  

We’ll start with a very simple example and comparison.

This is a basic iteration of a manual workflowthis is a basic iteration using a fixed iterator

Both orchestrations above perform the same function. They both read from the same source database, query three tables, and land that data into our data warehouse. The job on the left requires a Database Query component for each new source table. 

The job on the right uses a Fixed Iterator to set a collection of Job Variables (below) which are referenced in the Database Query. We need only add a new row with related values to the iterator to accommodate a new source table. Existing job variables for hostname, database, etc. could be stored just as easily, and perhaps preferably, in Environment Variables.

how to create a data-driven job: this is a screen shot of a fixed iterator

This rather vanilla design may not be production-ready, but I think it provides the appropriate contrast between a data-driven approach and its static counterpart.

External metadata

Now let’s add on to our example by moving the data contained in the fixed iterator into our data warehouse.

We’ll start by creating a table in the data warehouse using a Create Table component. This table will store the values currently held in our fixed iterator. We’ll also expand on the attributes slightly, to provide more flexibility. We’ll populate the table with some initial values using a SQL Script component.

create a data-driven job: this is a screenshot of a create table component

Next, we’ll replace the Fixed Iterator with a Table Iterator, set the Target Table property to the ETL_METADATA table created in the previous step, and map the columns to the appropriate job variables (we’ve created some additional ones to hold new attributes). 

create a data-driven workflow: this is a screenshot of a query made with a table iterator

The job looks exactly the same (almost) but the metadata that drives our iterator is now stored in, and retrieved from, the data warehouse. To extract from a new table in the source system, we just add a new row to the ETL_METADATA table and the next run will include that source as an additional iteration.

Pretty slick.

If you’d prefer to use a custom query to populate your iterator you can create a similar pattern using the Query Result to Grid, Grid Variable, and Grid Iterator, in place of the Table Iterator. 

create a data-driven workflow: this is a screenshot of an orchestration job

The Query Result to Grid populates your Grid Variable and the Grid Iterator consumes it. This gives you more control over the criteria by which your Grid Variable is populated; allowing you to add conditional logic to determine if a row in your ETL_METADATA table is a candidate for selection, for example.

Bringing it all together

Our example works great if you’re only interested in iterating over a single component. In practice, however, it’s likely that we’d want some additional components to run for each iteration; capturing some statistics on success and/or sending a notification on failure, perhaps.

To create an Orchestration Job, we’ll copy the existing job (so that it includes all of our Job / Grid Variables) and remove the Iterator and the Query Result to Grid, leaving only the Database Query. We’ll use CloudWatch to log the row count on success and SNS to send the notification on failure (or related components). Remember, the real aim here is to show how you can iterate over a child-job, and pass the relevant information down to it. Your ideal design of what this job will do could be quite different.

creating a data-driven workflow: this is a screenshot of how to create an orchestration job

Next, we’ll remove the Database Query from the original job that was copied, and replace it with the new Orchestration Job that we just built, attaching it to the iterator. 

create a data-driven workflow: this is a screen of taking an orchestration job and attaching it to an iterator.

The last thing we have to do is pass the variables down from the parent into the child. We do that by setting the Set Scalar Variables property of the Run Orchestration component. The variables in each will be named exactly the same (because we made a copy), so we set the ‘Value’ to the parent’s variable (below).

create a data-driven workflow: this screen shot shows you how to set scalar variables

Finally, we can test our notifications on failure by adding a row in our metadata table with details that won’t be found in our source system and confirm our RowCount metric is logged in CloudWatch.

Summary

And there we have it. Adding additional source tables is as easy as adding rows in our ETL_METADATA table. We can be reasonably assured that each will run consistently, adhering to the pattern that we’ve designed, while also allowing us to make adjustments in a single place, should the need arise.

While the examples above only make use of Orchestration Jobs, it’s important to point out that these methods can be applied in Transformations as well. For additional details, check out our Using Grid Variables to Apply Business Rules in a Transformation Job support page.

*The referenced jobs were created with 1.40.10 of Matillion for Snowflake on AWS. Some slight modifications may be necessary in other product versions.

Here are job files for all Matillion ETL products, for you to download and try out:

Data-driven workload: Amazon Redshift

Data-driven workload: Snowflake for AWS

Data-driven workload: Snowflake for Microsoft Azure

Data-driven workload: Google BigQuery