Matillion uses R integration to investigate Volkswagen
A few years ago Volkswagen were famously caught using defeat devices to influence emission testing. They undertook a technical intervention inside the engine of every affected vehicle to stop this, and claimed that there would be no side effects.
As an interested 3rd party, and in some cases vehicle owners too, we decided to check this for ourselves!
We used Matillion ETL for Amazon Redshift to load a set of fuel economy data into the database, and to transform it ready for analysis. Then after installing the statistical package R on the Matillion compute instance, used its statistics library to check if there had been any adverse impact on fuel economy.
Here’s how we did it, and what we found out…
Matillion ETL for Amazon Redshift runs as a compute instance in your Amazon Virtual Private Cloud, although you can happily use all of Matillion’s functionality without even knowing this.
However, Matillion is not a SaaS service, and you have full administrative control over the compute instance. This means you can install and use a huge variety of additional software packages, including in this case the free statistical package R.
So to install R, and an accompanying database library, we logged onto the compute instance and issued a few administrative commands. These are fully documented in an accompanying technical document and YouTube video.
Fuel economy data
We had a good quality company car refuelling record stretching back several years, and crucially spanning the date of Volkswagen’s technical intervention.
Every record in the data set included a date, a mileometer reading and a number of gallons. This meant we could easily add a miles per gallon calculation to each record.
The date of Volkswagen’s technical intervention split the data set nearly down the middle, so there was a good sample of records either side. Individual refills would of course have some variability due to traffic conditions, but with that many records there would be plenty of opportunity for the figures to average out over time.
Would there be a noticeable difference in fuel economy before and after? More importantly, would any difference be statistically significant? In statistical terms the “null hypothesis” in this situation is that there’s no real difference in economy. To prove or disprove this required R’s specialist statistical functions, operating on data that Matillion loaded into Redshift.
Loading the data
The first step was to extract the data from its source location (Excel) and load it into Redshift: the “E” and “L” of “ELT”. This was easily accomplished using an Excel Query Orchestration component.
We had 103 records in total: 64 from before the technical intervention and 39 afterwards.
We needed to split the 103 records up into two separate data sets in order to run the necessary statistics function. This was of course the “T” part of “ELT”
To do this required four commonly-used data preparation techniques.
- Join to summary of self
- Derive new fields in just one place
- Multiplex with complementary filters
- Present using views
Join to summary of self
All the records were ordered by the RowId column in the Excel import, and one of the records (rowid 66) was flagged as the “First refill after intervention”.
Records before 66 belong to the “before” set, and after that point to the “after” set.
The way to split the data in this way in a transformation job is to use the same source twice: once just as itself (demo_efficiency data) and once aggregated down to a single record containing the value 66 (shown by the demo_efficiency Agg table input).
The Filter picks out the changeover point, and the following Aggregate ensures that there’s only every one record, just in case the source data accidentally includes more than one changeover.
While technically a SQL Join operation, the join to self actually doesn’t need a real join clause, and a constant expression such as 1=1 works just fine.
At this point, we had added an “intevention_rowid” with the fixed value of 66 on every row, ready for the next step.
Derive new fields in just one place
In this instance the decision point on how to split up the data was simple, and could easily be expressed in a single Calculator component.
We used a CASE statement to add another new column named dataset_ident.
This simple example serves to illustrate an important pattern: try to only perform derivations – no matter how complex – in just one place.
Multiplex with complementary filters
The last part of splitting the data into two also demonstrated the common and useful pattern of multiplexing a data set using a Replicate component, and having complementary filters on either path.
The two equal and opposite filters ensured that every source record only passed only one of the two filters, and on into the Create View component.
Present using views
It would have been technically fine to use a Rewrite Table operation as the final step of the transformation layer.
But as you can see from the screenshot above, we decided to use two named views instead. All the transformation logic (or “heavy lifting”) gets performed by Redshift in either case, but the advantage of views is that they automatically refresh. You don’t have to remember to reload them if the source data changes.
After running the job once, we had two data sets: one view with 64 rows of “before” data:
.. and one view with 39 rows of “after” data:
The data was now ready for statistical analysis!
The statistical analysis
Having previously installed R onto the Matillion compute instance, it was easy to invoke it using a Bash Script component in an Orchestration job.
The JDBC org.postgresql.Driver is pre-installed in Matillion so we were able to use it to acquire the prepared data from Redshift.
The main logic of the Bash Script involved selecting the mpg figure from the two named views, and then using an R library to run an independent sample to compare the two.
Output from the t-test was sent to the Matillion task console where we could check the results.
In statistical terminology, the p-value is the probability that there’s no difference between the averages of the two samples, and that any observed difference is just the result of chance.
With our data, the p-value indicated that it was very likely that there was no difference between the two. We were unable to disprove the “null hypothesis” described earlier.
There was a drop in fuel economy after Volkswagen’s technical intervention, as shown by a decrease in average miles per gallon.
- It was not a statistically significant difference, and
- With just one small data set we could not in any case prove that a reduction in average mpg was caused by changes made to the engine. It might have been influenced by different driving routes or patterns, the time of year, or other factors beyond our control
Looks like we won’t be phoning the lawyers just yet!
Download our free eBook below for loads of useful advice and best practices on how to optimize your Amazon Redshift setup