Identifying a rogue schedule in Matillion ETL
We were called to investigate an issue recently with the data for one of our customers – it kept disappearing!
For the first couple of occasions we decided it must have been user error and simply replaced the data. However after another occurrence, some further investigation was required!
For the purpose of this article, we have recreated the issue using our raw_flights data. In order to demonstrate some of the functionality of Matillion we use a table with 123 million rows of data covering all flights which flew over the United States of America since 1987. This data is initially stored in a series of gzipped files held in an S3 bucket and is loaded, using the S3 Load Generator Tool into a Redshift table called raw_flights. Recently, this raw_flights table has been empty when data should have been available.
Identifying the Culprit
As no-one would admit to manually deleting the data, the assumption was made that a Matillion ETL job was being run to delete the data, or had accidentally been scheduled to delete the data. Given that there are currently over 20 projects in the affected Matillion environment, it isn’t feasible to manually search through each project to review the task history, so in order to identify the job causing the issue we needed to query the data via the Matillion REST API.
The Matillion REST API allows programmatic access to both the project and task data from Matillion. Further details are available via our Support Center here. For this example we used the Matillion API Profile which comes built into the product and provides details of the Run History. The API Query component was used in an Orchestration Job to connect to the Matillion API Profile and pull the Run History Summary and the Run History Details into two new Amazon Redshift tables:
This allows us to analyse the data in a Transformation Job. Using the Sample data tab on the Table Input Component, we can review the data available:
The Run History Summary contains details of the jobs which have run, how they started, whether they ran successfully or not, the start and end time of the job and the details of the environment which they ran in.
The Run History Details goes down to a component level to show how long the component took to run, the number of rows of data inserted, whether the component ran successfully or not and any associated message.
We used a join component to join these 2 sets of data together so we can see the components next to their associated job. This gave us 2,657 rows of data to review manually. We needed more information to help!
A SQL Component in the Transformation job was used to query the svl_statementtext table gives the details of all SQL statements run on the database. By narrowing down the selection to just statements run dropping the raw_flights table, we discovered that the table was regularly dropped at midday every day:
This suggests it was a scheduled job causing the problem.
Back to the Transformation of the Run History and Run History Details, a filter was added to show only Orchestration Jobs and only components that were executing, rather than validating, and only components which were successful:
This left us 150 rows to evaluate.
A second filter was added to show only rows with the name Flights or Create, as it was likely that either a create table component was used to replace the raw_flights table and it was either renamed Flights or hadn’t been renamed. Finally a calculator component was added to convert the start time from a timestamp to a readable date:
This allowed us to identify from the remaining 34 rows, which were run at exactly midday for the last few days:
The results have given us one component called Flights Table run in a job at 12PM for the last few days. More crucially we also had the job name and the project name, so we were able to simply go into that project and select Manage Schedules and disable the offending schedule:
Preventing this issue Reoccurring
Further investigation into the reason the job was deleting the data (and not recreating it as it should) resulted in us discovering that the source files used to load the data had been deleted from the S3 bucket. In order to prevent this issue reoccurring, a lambda function was created to monitor the S3 bucket for files being deleted.
If a file is then deleted, the lambda function trigger would then fire and put a message on the SQS Queue. This SQS message is then picked up by Matillion and a job is started to replace the data and reload the table. Further details on using a lambda function to trigger an ETL job is available here.
The Matillion job uses an S3 Put component to copy all of the files back from a backup S3 bucket and then an S3 Load to load the files back into the raw_flights table:
This article uses the Matillion API and a transformation job to investigate an issue with a schedule being set to run when it shouldn’t have. It then used a combination of a Lambda function and triggering a Matillion ETL from from an SQS message to ensure that even if the schedule is accidentally set again that the data will be replaced immediately.