Build dbt pipelines in 4 easy steps with Matillion

Data transformation is a critical aspect of modern analytics, and leveraging tools like dbt (Data Build Tool) in tandem with Matillion can streamline this process significantly. This blog will showcase the significance of dbt pipelines, explore the Matillion setup process, and demonstrate the integration of dbt within Matillion for effective data transformation. 

The importance of building dbt pipelines for data integration and analytics 

Dbt is a command-line tool that enables data analysts and engineers to transform data in their warehouse more effectively. It’s particularly popular in the context of analytics on top of data warehouses like Snowflake, BigQuery, and Redshift. Building dbt pipelines for data integration and analytics holds several significant advantages: 

1. Modularity and Reusability:

   - dbt allows you to organize your transformations into modular "models" that represent specific business logic or analysis. These models can be reused across different projects, promoting consistency and reducing redundancy.

2. Version Control and Collaboration:

   - dbt integrates seamlessly with version control systems like Git. This enables teams to collaborate on analytics projects more effectively, track changes over time, and roll back to previous versions if needed.

3. Dependency Management:

   - dbt automatically manages dependencies between different models. When you update a source table or change the logic in one model, dbt intelligently identifies the downstream dependencies and ensures that impacted models are rebuilt in the correct order.

4. Incremental Builds:

   - dbt supports incremental model builds, meaning it only processes the data that has changed since the last run. This significantly improves the efficiency of the data transformation process, especially in large datasets, by avoiding unnecessary recomputation.

5. Documentation:

   - dbt allows you to document your models using a combination of SQL comments and Markdown. This documentation is then automatically generated and can be viewed through a web interface. This promotes transparency and makes it easier for team members to understand the purpose and usage of different data models.

6. Testing:

   - dbt supports the implementation of tests on your data models. These tests can be written in SQL and automatically run during the transformation process. This helps ensure the accuracy and integrity of your data, providing confidence in the results of your analytics.

7. Data Quality Assurance:

   - With built-in testing capabilities, dbt helps maintain data quality by allowing you to define and enforce expectations about your data. This is crucial for analytics and reporting, where the accuracy of insights directly depends on the quality of the underlying data.

8. Scalability:

   - As your data grows and your analytics requirements evolve, dbt provides a scalable solution. It's designed to handle large and complex datasets, making it suitable for organizations with diverse and growing data needs.

9. Collaboration Between Analysts and Engineers:

   - dbt promotes collaboration between data analysts and engineers by providing a common framework for managing SQL-based transformations. Analysts can focus on defining business logic, while engineers can handle the underlying data infrastructure.

The four-step setup process with Matillion 

Step 1: Snowflake Deployment and Setup

Cloud Data Warehouse (CDW). You must provision your target CDW account. For the purposes of this article, we will use Snowflake as our destination CDW.

Snowflake is our destination Cloud Data Platform of choice. 

If you are using your personal account, you must have ACCOUNTADMIN access.

If you do not have ACCOUNTADMIN access, you can deploy a Trial Snowflake environment

Enter the following into a Snowflake worksheet:

Step 2: Matillion ETL Deployment and Setup

You have two options to access Matillion ETL:

Matillion Hub Deployment

Matillion ETL can be launched in your VPC using the instructions here.

Snowflake Partner Connect

Alternatively, Matillion ETL can be launched using Snowflake Partner Connect. These are two-week-long trials and launch instructions can be found here:

  1. Within your Snowflake environment, navigate to Admin → Partner Connect, then click on the “Matillion ETL” tile

     
  2. Once the partner account has been created, Click Activate

     
  3. Matillion ETL will launch on a new tab in your browser, and you will receive an email containing a link to the instance and credentials to log in.
     

All DBT libraries and binaries will be pre-installed on your matillion instance

Step 3: GIT repository for dbt scripts

Matillion’s ability to execute DBT scripts is dependent on the scripts being present in a Git Repository. Ensure that you have access to a Git repository where all the DBT scripts are present and can be accessed. For the purposes of this article, we are going to use dbt_matillion_repo GitHub Repository, where we have some sample scripts. This is a public repository, so provided an account is provisioned in GitHub, the artifacts can be accessed. 

Step 4: Dbt in Matillion 

It is now time to put our dbt project files to work. We must first set up the process to fetch dbt scripts in the Github repo. This involves the Manage External File Sources menu and the Sync File Source components.

Once the dbt project files have been loaded, we can then run commands against them within Matillion ETL. Let’s get started.

Manage External File Sources

Here we will create a profile for the GitHub repository we wish to sync with.

  1. Go to Project > Manage External File Sources
     
  2. Click the + button to add a new Source and fill in the following:
    • Source Name: <Give your Source Name a user-friendly Name>
    • Remote URI: <The Linkt to your dbt Git Repository>
    • Username: <Your git username credential>
    • Password: <Your git password credential>
    • Branch: The git branch holding the dbt scripts

Click OK to save the profile.

Sync File Source

  1. Type dbt into the Components pane search bar to make the two dbt-related components appear.
  2. Drag and drop the Sync File Source component and drop it as the next step after the Transformation job.
  3. Within the components properties, select dbt_matillion Repo from the dropdown as the External File Source.

  1. Right-click the component and select Run Component. If all is well, the Task will render a Successfully synced file source message.

dbt component

Let’s begin running the fetched dbt project files against the dataset. We will begin with a dbt debug command. dbt debug validates several aspects, such as connectivity and if our configuration files are syncing well together. Let’s first create some variables that will exist in the Run DBT Command components.

Create Job Variables - jv_target_database & jv_target_schema

Run DBT Command Component

  1. Find the Run dbt Command component, drag it from the Components pane onto the workspace, and connect it to the Sync File Source component.
  2. Rename the component dbt debug
  3. Set the External file source to the profile we set in the Manage External File Source section.
  4. Set command to dbt debug.
  5. Open the Map Environment Variables menu and set as follows: 
    • Note how we are passing Matillion job variables to the dbt Environment Variables, which were referenced in the sources.yml and dbt_project.yml config files.
  6. Right-click the dbt debug component and select Run Component.
  7. Within the Test Message, we can see that the configurations are syncing well together, and all checks have passed to allow us to build our dbt model. 

Beyond this, you can use the dbt component to invoke all types of commands. A full list of these commands can be found here.

The dbt scripts that have been pulled down in the ‘synch file source’ step can be referenced in the dbt component. For full documentation, please refer here.

About the Author

Patrice Massieh is a senior technical integration consultant at Matillion Professional Services. He has over 25 years of experience in Software development, Solution Architecture Design and Delivery, as well as Client and Project Management.

Patrice Massieh
Patrice Massieh

Senior Technical Integration Consultant