How Did They Build That? Creating a Data Visualization with the Help of Matillion ETL
Check out this dashboard. If you are working with data in the cloud, chances are you are going to want to visualize your data in a dashboard similar to this one. Having data and analyzing it is one thing; but the business value really comes when you can bring that data back to the stakeholders who need it, either by data visualization in a product like Tableau Public (in this case) or by bringing it back into an operational system.
The key to having a useful data visualization? Useful data, made so by cloud data transformation.
The Matillion flight data visualization: How did we build it?
If you’ve had a Matillion demo or seen one of our webinars, this data might look familiar to you. This dashboard, showing the delay and cancellation rates of domestic flights in the United States, is one of our most popular examples of using Matillion ETL to help build a solid data visualization.
How did we use Matillion ETL’s data transformation capabilities to design this visualization on Tableau Public? In this blog post, we’ll walk you through the steps.
There are two main architectural themes in this article:
- Use the processing power and scalability of Snowflake to aggregate the data to a manageable size.
- Use Matillion ETL to implement the sophisticated business rules which give a precise definition to the “delay” and “cancellation rate” KPIs.
We’ll demonstrate some development and performance techniques which will hopefully be useful while you’re working with Tableau and Matillion ETL.
Step 1: Acquire the raw data
First, we need the data. The underlying flight data is published here by the Bureau of Transportation Statistics (BTS) and can be downloaded month by month. It’s a simple but large dataset, totalling more than 6Gb while highly compressed.
Three closely related data sets are required for this dashboard:
You can download the transactional flight data from the BTS website right back to 1988. Every record identifies the carrier, and includes the IATA codes of the departure and arrival airports.
The airport data is available from the same source, although in this example it has been published via a hosted REST API for convenience. Every record has the latitude and longitude of the airport, which we’ll use in the main Tableau visualization to draw the flight paths between the airports.
Every flight is operated by exactly one carrier. For this dashboard we will just need the carrier code and the name. For example the carrier code abbreviation AA means American Airlines.
Data Extraction and Loading
In data integration, data extraction and loading is the important first step. But it’s fairly easy and straightforward. In this example, the flight data is held in cloud storage and the airport codes behind a REST API. Below are screenshots from the Matillion Orchestration Job.
At runtime, Snowflake makes short work of the bulky flight data. This Matillion component loads nearly 200 million rows into the database in around 60 seconds, depending on which size virtual warehouse you choose.
Rather than being in files, the airport codes are available via a REST API. With Matillion we often use two components for this purpose.
The Check API Access component is an extensibility feature downloaded as a Shared Job from the Matillion exchange. It verifies that Matillion has network connectivity to the API.
The Load Airport Codes component is a standard Matillion API Query, which efficiently queries the REST API and transfers the data into Snowflake.
Step 2: Define the KPI measures in Tableau
Next, you need to define your KPIs. The Tableau dashboard permits you to choose which of the two KPI fields to display.
The % Cancelled is the simpler of the two. The BTS flight data conveniently includes a field which indicates if a flight was cancelled. So the % Cancelled KPI is just the total flight count divided by the number that were cancelled. This can be calculated in the Tableau dashboard provided we supply a flight counter and a cancelled-flight counter.
The % Delayed KPI is slightly more complex, because it depends what you mean by “delayed”! This is endlessly arguable, but we’ll settle on the definition that a flight is delayed if its actual elapsed time is greater than 20% longer than the scheduled time.
There are a few very short domestic flights, including some that are scheduled as low as 10 minutes. It seems unfair to criticize a 2 minute delay on a 10 minute flight, so we’ll add a rule that we’ll only count flights where the scheduled airtime is longer than 30 minutes.
So the % Delayed KPI can also be calculated in the Tableau dashboard provided we output a delayed-flight counter too.
One of the main intentions of the dashboard is to highlight routes with exceptional delay or cancellation rates. Routes with very small numbers of flights overall can produce misleading output. For example one unlucky cancellation from among just 10 flights per year would show up at the same rate as 1,000 cancellations per 10,000 flights. To lessen the impact of outliers we will implement a rule to only consider routes with at least one flight per month.
One more thing you need to consider before continuing: It sounds unnecessary to state what is actually meant by “one flight”. But nevertheless it is vital to define this, because the definition underpins both KPIs. There’s no guarantee that the flight data reported by the BTS actually is at the desired granularity. So for this article we’ll apply the business rule that “one flight” is the unique combination of the following five fields:
- Flight Date
- Origin airport
- Destination airport
- Scheduled Departure Time
- Flight Number (which differentiates between multiple carriers operating the same route with the same timing)
As a sanity check, if you show up at the airport on the right date, with a ticket that shows a flight number, a departure time, and a destination, you’re going to get on the right flight!
When you dig further into this, there’s an implied requirement that those five fields have real values. Again it sounds superfluous, but it does mean we’ll need to
- Reject any records that don’t have values for all those five fields
- Reject any records that have duplicates of all those five fields
Duplication is a small problem in the BTS data, accounting for only around 0.0004% of records. But in total the filtering reduces the candidate record count down from 193,412,807 rows to only 55,240,645 rows.
When a report does not adequately specify all of the definitions, the developers have to make their own minds up when they build the logic. This is the root cause of many data quality problems. Multi-tier data architectures can help address this problem too; but that’s a subject for another article.
Meanwhile, in the next section, we’ll look at how to implement those business rules and KPIs in a Matillion Transformation Job.
Step 3: Transform the data in Matillion ETL
Again, data extraction and loading is the first step in preparing the data, and it’s the easy part. Proper data transformation is the hard part, and the most critical step in making that data useful so it yields valuable insight.
Data transformation done right is highly sophisticated, but doesn’t need to be complicated if you have the right tool for the job. Here’s a screenshot of the whole Matillion Transformation Job, which implements the business rules and calculates the flight delay and cancellation rate KPIs. As you can see, visually it’s very straightforward. But that simple interface sits on top of some fairly complex logic.
There are many parts to the logic of this job, and we’ll dive into the details of each one.
The data transformation begins with the Flights in Geo component. This component does the bulk of the filtering. It restricts the data to only include 2010 and onwards, and within the required geographical area. The visualization is for the lower 48 states only, so flights to or from Alaska and Hawaii can be excluded by only considering airports where the longitude is > -130. (No offense to Alaska and Hawaii, just trying to keep the visualization area simple for demo purposes!) The BTS domestic data does include some flights to and from Mexico and the Caribbean so these can be excluded by applying a filter on airports where latitude is > 24.
It’s always a good idea to check the results, so a glance at the Sample row count reveals that we have 64,497,685 rows to work with.
As a sanity check, we know the BTS data goes back to 1988 and we’re filtering to roughly one decade so we would expect to see about one third of the original 193 million rows here.
Of note to Matillion developers, this component demonstrates one possible application of a Common Table Expression in a Snowflake query.
If you download the job and look at it closely yourself, you’ll notice that this component is the only place where the airport reference data is used. Where are the latitudes and longitudes that we’ll need later to draw the flight paths?
It would be technically fine to add the spatial data at this point. However, it would result in a significantly larger Tableau extract, with the consequent risk of causing performance problems with the visualization. So in this example, we’ve taken the slightly unusual step of postponing the denormalization of the coordinates and performing it entirely client side. More on this subject in the Tableau Techniques section later.
The Cancellation KPI
As described earlier, this is a simple calculation for Tableau to perform in conjunction with its dynamic filtering.
Through Matillion, we just need to use a Calculator component to define a cancelled-flight counter:
...and a simple flight counter as the denominator.
The Delay KPI
We have an agreed definition of what makes a flight count as “delayed.” The actual elapsed time must be at least 20 percent longer than the scheduled time.
But as is normal with a large data set, there are some oddities and outliers. Sometimes the actual elapsed time is missing, and sometimes it’s negative. For a few records, the ratio is either very high or very low.
So the Matillion logic is split into two parts:
- An expression that calculates the delay proportion
- An expression that flags the flight as “delayed” if the proportion is larger than 1.2 (i.e. if it’s at least 20% greater than the scheduled flight time)
Within the Calculate derived fields component, a SQL expression to calculate a new field named delay_proportion:
CASE WHEN "ActualElapsedTime" IS NULL THEN 1 WHEN "ActualElapsedTime" > 5 * "CRSElapsedTime" THEN 5 WHEN "CRSElapsedTime" > 5 * "ActualElapsedTime" THEN 0.2 ELSE ROUND("ActualElapsedTime" / NULLIF("CRSElapsedTime",0), 1) END :: NUMBER(4,1)
Then within the Quality Indicator component, another SQL expression to convert it into a counter:
CASE WHEN "delay_proportion" >= 1.2 THEN 1 ELSE 0 END :: BYTEINT
Join as a filter
Along with the delay_proportion calculation, the Quality Indicator component also has a filtering role. A calculated field named :_indicator is introduced here that flags all records where:
- Any of the five mandatory fields are missing
- Flight time is 30 minutes or less
- The origin is the same as the destination
The dq_indicator is set to 1 for every record that passes all the above checks.
Not every airline operator is required to report on-time performance. The BTS only requires this data from carriers that have at least 0.5% of total domestic scheduled-service passenger revenue.
A join is the perfect way to remove any carriers that don’t report delay statistics. In addition to a join, the “Filter valid only” join component performs another task. Its join clause looks like this:
"f"."IATA_CODE_Reporting_Airline" = "c"."code" AND "f"."dq_indicator" = 1
So, at the same time it filters the carrier codes, this component can also remove any other data we do not wish to consider.
Of note to Matillion developers, the Carrier Codes component is not a real table: it’s a hardcoded list of the 17 reporting airlines from the BTS website. It’s usually not good practice to hardcode values using a Fixed Flow component like this. But in this case, we’re literally just copying data from the BTS website with a CTRL-C, so it seems reasonable to simply paste it into a virtual database table with a CTRL-V!
The last data filtering operation comprises two components that are commonly used together for this purpose:
We configure the Rank component to generate a ROW_NUMBER based on what we decided would be the five-part unique identifier for flights. This is known as the partition key:
The subsequent filter component keeps only records where dup_seq is 1. That will have the effect of retaining all records taht were unique already (i.e. most of them), and keeping only one instance of records that were duplicated any number of times.
Aggregating the data
The aggregation component generates a GROUP BY in SQL, and aggregates the data up to the required level. It’s a grouping up to Year, Carrier, Origin, and Destination, which will permit us to use those four filters in the Tableau dashboard.
The Aggregation is set to the required granularity:
The three row counters we have defined are all measures that can be summed across any dimension, so they are added under the Sum aggregation type.
Implementing a CTAS
The last component in the Matillion Transformation Job simply saves the aggregated data into a named Snowflake table.
After all the filtering and aggregation, the original 193 million records are reduced to just over 80,000 rows, a factor of over 2000. Tableau Public can handle a single file containing this data with no problem at all.
Exporting to cloud storage
There are three data sets that the Tableau dashboard will make use of:
- The 80,000 rows of aggregated data from the Snowflake table created in the previous section
- Latitudes by airport (used in the Tableau Techniques section below)
- Longitudes by airport
For portability, we’ll export them into cloud storage using S3 Unload components.
Now all of the data is ready for use. In the next section, we’ll look at the Tableau visualization techniques.
Step 4: Visualize the data in Tableau
Now that the data is transformed, we want to use Tableau data visualization techniques to bring it to life.
There’s only one Data Source, which is the agg_delay_stats CSV file that was extracted in the previous section.
Because we have the base rowcounts, the KPI expressions are easy to create in Tableau. That way, they will update automatically as we interact with the filters in the dashboard.
The DelayPercent calculated field is:
100 * [Long Delay Counter] / [Counter]
Similarly, the CanxPercent calculated field is:
100 * [Cancelled Counter] / [Counter]
Don’t forget to set the Default Properties / Aggregation to Average for both these.
The dashboard incorporates them both into a swappable measure, using the technique described in Tableau’s online help documentation here.
Deduplication in the spatial layer
Drawing the flight arcs is the most expensive operation that this dashboard is going to perform on Tableau Public, so we want to make it as easy as possible.
One thing that will help is to group together all flights operating the same route, regardless of direction. In other words it would be wasteful to get Tableau to draw the DEN-JFK arc, and then draw a second JFK-DEN arc completely covering the first one.
Instead we’ll merge them using an alphabetically sorted route name. First add a calculated field AlphaSortedOrigin:
IF [Dest] < [Origin] THEN [Dest] ELSE [Origin] END
Then, similarly add a calculated field AlphaSortedDest:
IF [Dest] > [Origin] THEN [Dest] ELSE [Origin] END
Now combine the two in a third calculated field called RouteName:
[AlphaSortedOrigin] + ' - ' + [AlphaSortedDest]
RouteName will be added to the Detail of the Marks card, and will appear in the tooltip. The AlphaSortedOrigin and AlphaSortedDest will determine the position of the arcs, as we’ll see next.
Client side coordinate denormalization
The main worksheet in the viz is the map of the lower 48 states, showing the selected flight route arcs. Recall in the Spatial Filtering section above we decided not to add the latitudes and longitudes to the data extract. This roughly halves the size of the data extract, so is a great savings.
There are only a few thousand distinct airport codes, so it’s easily within the realm of a calculated field in Tableau. It’s far too much to accurately type by hand, but a Matillion Transformation Job to the rescue…
The key is the Expression generators Calculator component. Two new Snowflake fields are added:
'WHEN ''' || "iata" || ''' THEN ' || TO_CHAR("lat")
'WHEN ''' || "iata" || ''' THEN ' || TO_CHAR("long")
The resulting lines are valid syntactically to use in a Tableau derived field. It’s probably easier to show from the data sample than to explain:
You can simply copy and paste the text into a new calculated Tableau field based on he AlphaSortedOrigin above.
Continue exactly the same for DestLat, OriginLong and DestLong, and there’s now enough information to construct the spatial feature named RouteMark using some of Tableau’s spatial functions.
It’s arguably more convenient to copy and paste from a CSV file. Hence the S3 Unload v_airport_lat and S3 Unload v_airport_long tasks in the Matillion Orchestration Job shown in the Export to cloud storage section above.
Matillion developers hopefully also noticed another rank-and-filter component combination to remove any duplicated codes, which would be hard to pick out manually.
And that’s how you do it!
To recap, we used Matillion and Snowflake to filter and aggregate a dataset of roughly 200 million rows down to size that’s manageable for a desktop tool.
We spent some time and care to create two clearly defined KPIs on the data. Most of that work happened in Matillion Transformation jobs.
Importing the much smaller, summarized data into Tableau enabled us to build an interactive dashboard showing the KPIs over various dimensions.
We ended with a short demonstration of how to generate Tableau expressions using Matillion ETL. This is a development technique you might consider in order to achieve optimal performance when users are interacting with your Tableau dashboard.
Want to see how to transform and visualize your data in the cloud? Request a demo of Matillion ETL.
Manager of Developer Relations
Data Mesh vs. Data Fabric: Which Approach Is Right for Your Organization? Part 3
In our recent exploration, we've thoroughly analyzed two key ...eBooks
10 Best Practices for Maintaining Data Pipelines
Mastering Data Pipeline Maintenance: A Comprehensive GuideBeyond ...News
Matillion Adds AI Power to Pipelines with Amazon Bedrock
Data Productivity Cloud adds Amazon Bedrock to no-code generative ...