Matillion Data Quality Framework
When you’re using Matillion to build data processing jobs, you are acting as a data consumer. There are upstream systems (probably many) in which primary producers are actively creating data. But by the time that data becomes input, it’s read-only and is non-negotiable.
Having and delivering good quality data helps create accurate measurements and forecasts, and enables the business to make sound and auditable decisions. But nevertheless, data quality is not a goal in itself. If a good decision can still be made when half the data is missing, then half is enough. If there are known inaccuracies but they average out, then it’s accurate enough.
How do you determine exactly what is “good enough”? There are often attempts to create numeric data quality values. These are meaningless. As this article will argue later, context is everything, and numeric data quality values are at best unhelpful.
The Matillion data processing system is certainly a great place to thoroughly check the data that’s provided to it. It’s important to audit and report back on data that doesn’t look right, because:
- There may actually be real, and unknown, problems at source
- Nobody else will (or even can) check data in this way
- You may just be reading it wrongly
This article will describe some common methods for checking source data quality in Matillion data transformation jobs.
While implementing these checks, remember to check first of all that you’re interpreting the data correctly, because in fact the source data is actually 100% perfect…
Source data is always perfect!
If a live source system is good enough to run the business operations, then by definition its data is fit for purpose and is perfect.
With regard to timeliness, there’s always a tradeoff between immediacy and integration. You can only integrate data sources once they have all become available. If customers need more timely data then you’ll either need to tune the data processing to make it faster, or else stop trying to integrate the output with other, less-frequently available data.
Bearing in mind the above, the first rule of data quality is to keep an open dialog with business representatives, to discuss any problems which you think exist in the data.
You can start this by trying to quantify exactly what you think is wrong. What measurements are applicable to data quality, and how can you check them using Matillion?
Data Quality measurements are subjective
There are many ways to define objective, numeric data quality values, but they are not useful outside technical circles.
For example, an “airline flight” fact table might contain one million records. In fully 20% of these, the NumberOfLemonSoakedPaperNapkins column has no value! Also, 2% of the records have no destination airport. Obviously the lack of a destination is by far the more serious data quality problem, but this would not be reflected in a simple numeric metric.
As a second example, consider the consequences of accidental duplication. It would be immediately disastrous if an “aircraft reference” table contained just a single duplicate row. The business would not be able to reliably answer the enormously significant question “how many A380’s do we own?”
In contrast, a table of Twitter posts might come to contain many thousands of duplicate records, perhaps as the result of an accidental double load. Despite the numerically large number of duplicates, the effect would probably be no more than a temporary and statistically insignificant uplift in the number of comments about the lack of paper napkins.
For these reasons, we don’t recommend that you try to publish objective data quality metrics out to business users. Focus instead on the fact that poor Data Quality is a business problem which may be helped by technical solutions.
We’ll look next at some typical technical solutions, which you can easily implement with Matillion in commonly-encountered scenarios.
Solutions for common problems
This section describes three situations which are always at or near the top of the list of most frequently encountered data quality problems.
We’ll focus on missing attributes here, remembering that if you think there are missing records, it usually means that you have not yet found the data source where the missing records need to be loaded from!
In a Matillion job, the best technique is to use “complementary filters”. In a Transformation job, add a Replicate component, leading into two equal and opposite filters. This ensures that every individual row of data will always go down one of the two possible paths.
The first “happy path” filter might be:
This is the data that you want to keep, so the pipeline continues to normal downstream processing.
The second filter, or “problem path” must be exactly opposite. In this case:
The problem data should be written to an audit trail, for further investigation and discussion.
Before rejecting records, always make very sure it’s not just that you are missing a business rule. More on this subject shortly.
When you find a “duplicate” record, you must first rule out the possibility that your data model is not correct.
For example, you have two ticket records for the same person on the same flight, but with completely different seat numbers. Nobody can sit in two seats simultaneously, so this is a duplicate, right? In fact the reality is that seat allocations can be changed, and this data represents the history of updates. The model must allow this.
In cases like the duplicated aircraft, where you really do have to remove duplicate records, the steps are:
- Agree the unique identifier
- Use an analytic function, such as ROW_NUMBER()
- Apply complementary filters on the value of the analytic function
In Matillion, you can use a Calculator component to add the analytic function:
Matillion also has a built-in “Rank” component which you can use for the same purpose:
Having applied the analytic function, you can then use another pair of complementary filters to remove records with duplicate keys from the “happy path” processing.
But you may also be able to take advantage of “duplicate” records, for example to build a slowly changing dimension. The history of changes may be an opportunity to provide the business with valuable insights into its operations.
If your incoming flight reservation data includes a booking for “2009-01-11”, does that mean November 1st (American interpretation), or January 11th (Australian or British interpretation). You’ll need to find the business rule which enables you to interpret the values correctly, rather than risk relying on defaults.
Another common case in this context are strings with special meanings, which can be mixed in with ordinary “date” values.
For these “magic strings”, you will need to find and follow an appropriate business rule. For example:
- “TOMORROW” – follow the business rule and treat this as “system date + 1”
- Nonsense value (or entirely missing) – follow the business rule and leave it NULL
You can resolve the datatype disagreement and apply the business rules in a Matillion calculator component, for example like this:
This expression uses a pattern match to check that the string could conceivably be a date, and if so converts it using the explicit format mask YYYY-MM-DD.
Applying business rules
Simple business rules can be defined using combinations of Matillion Filter and Calculator components.
It’s possible to combine multiple components to implement much more sophisticated business rules.
For example, if the “distance” attribute is missing from a flight record, then fill in from the most-commonly found distance value for other flights between the same origin and destination, in either direction.
As a data consumer, the first assumption of a Matillion Transformation job must be that the source data is correct.
The most important part of a data quality framework is having an active dialog with business representatives, where outliers and unusual-looking data values can be discussed. Simply publishing objective, numeric “data quality metrics” is not helpful in this process.
Actively questioning the input data will help you establish the business rules to handle unexpected data points.
This article has demonstrated various ways to detect and deal with unusual input data, and how to apply business rules. To learn more about the Matillion data quality framework, watch our video tutorial below:
Download our free eBook for loads of useful advice and best practices on how to optimize your Amazon Redshift setup