Blog

Error Handling in Matillion ETL – Best Practices

This is the first blog in a three-part series on Matillion ETL error handling options. This blog describes some best practice approaches to making your job more resilient to errors. The second blog in this series will show you how to set alerts and the third will look at logging errors.

Why is Error Handling important?

ETL jobs in complex cloud environments fail for a wide variety of reasons. While this is to be expected, what is important is that the system can safely recover from errors, they are logged appropriately, and that system administrators can be notified immediately.

Matillion logs any errors encountered by the application in the Server Logs which can be downloaded from the Admin Menu. If you are encountering an error this is a great place to start your troubleshooting. However, there may not always be sufficient information – especially if an error is generated from a source system and not Matillion ETL itself.

For these scenarios, Matillion has an Auto Debug feature on the Data Stager components with a verbosity setting that allows you to capture much more information to help with troubleshooting.

This is great for determining the root cause of an error, but does not in anyway improve the resiliency of a job or notify an administrator, and can be too verbose for simple logging. That needs to be built into your jobs and the following best practices help achieve that.

Make Jobs Idempotent

The concept of idempotence is that if a job fails and is re-run, it doesn’t matter where it failed since re-running the job will return things into a consistent state. We have described this in detail in our technical article Building Jobs for a High Availability Cluster but it is a good practice to ensure all of your jobs are built in this way.

This includes using transaction control components. A good example of this is a Type II Slowly Changing Dimension job that first detects changes between the new data and the dimension table, then proceeds to close off changed records and then append new/changed records. This type of job needs to be spread across multiple transformations because the closing of records must happen before the append or the detect changes will not be comparing the correct data. In this scenario you would never want one transformation to complete without the other so you can use the transaction control components Begin, Commit & Rollback to make this job idempotent as in the example below.

 

Error Handling ETL - Best Practices - Make Jobs Idempotent

 

Implement a Retry Pattern

Applications built on cloud infrastructure can be sensitive to transient faults that occur such as a momentary loss of network connectivity, services which are temporarily unavailable, and timeouts that can occur when the service is busy or under heavy load.

For example, API throttling is common among SaaS applications to ensure performance and availability is not impacted by too many requests for data. These types of errors are usually self-correcting and will benefit from an appropriate retry pattern.

 

Error Handling ETL - Best Practices - Retry Pattern

 

The Retry Component can be attached to any component, and it is a good practice to use wherever components are communicating across networks such as Data Stagers or on sub-jobs which are idempotent.

Note: Take care when using the Retry Component inside a transaction block (between the Begin and Commit components). If an error occurs on the machine or platform Matillion is running on, retrying will also fail. This is because the transaction has not yet been rolled back. It is more sensible to attach the Retry Component to a Run Orchestration component where that job contains its own transaction logic.

Create Reusable & Shared Jobs

You will want to repeat your error handling process in multiple jobs, and across multiple projects, and often numerous times within the same job. To this end, the goal should be to create a parameterized job that can be easily dropped into existing jobs. This works by using the export tab on any component and passing some metadata into the sub-job, more on this in the second blog in this series, Error Handling Options in Matillion ETL – Alerting & Audit Tables. You could then convert this job into a Shared Job. Shared Jobs allow users to bundle entire workflows into a single custom component and then use those custom components anywhere else.

 

Error Handling ETL - Best Practices - Create reusable and shared jobs

 

Force jobs to end in failure

When you connect components via the on failure nub you are effectively handling that error. So, when the job completes it will be recorded in the task history as a successful run. Therefore, it is recommended that you use the End Failure Component after handling your errors.

 

Error Handling ETL - Best Practices - End failure

 

In the above screenshot, the Cleanup component is running a SQL statement. This returns the entities into the state before the ‘Create Flights Fact’ component was executed. This will likely run successfully but we still want the job to end in failure.

Other Options for Error Handling

Alerting & Audit Tables
Creating a Shared Job

Snowflake Resources

Admin Menu – Download Server Log
Auto Debug Feature
Data Stagers
Building Jobs for a High Availability Cluster
Detects Changes Component
Begin component
Commit Component
Rollback Component
Retry Component
Run Orchestration
Shared Jobs
End Failure Component

 

BigQuery Resources

Admin Menu – Download Server Log
Auto Debug Feature
Data Stagers
Building Jobs for a High Availability Cluster
Detects Changes Component
Begin component
Commit Component
Rollback Component
Retry Component
Run Orchestration
Shared Jobs
End Failure Component

 

Want to see how you can handle your ETL job errors? Select your data warehouse below to request a demonstration.