It is nearly time for the Oscars! With the nominations recently announced the inevitable buzz over who the winners might be has commenced. We decided to conduct some data exploration and investigation to see if we could identify any trends that may suggest which film will be named the ‘Best Picture’ at the 2018 Oscars.
We will use Matillion ETL for BigQuery to load a number of different datasets into a BigQuery data warehouse using pre-built Matillion ETL components. Then we will build out Transformation jobs on data from previous years’ winners with the hope of finding some correlations or trends that will help us make a data-informed guess at who the 2018 ‘Best Picture’ winner may be.
The purpose of this activity is to show how you can use Matillion ETL for BigQuery to explore and investigate your data. One of the key benefits of Matillion ETL is that it eliminates the need to hand-code ETL jobs, freeing up your Data Analysts’ time for exactly that, data analysis. A similar process can be applied to any datasets where you want to investigate data to derive new insights. Data mining with Matillion ETL can help with this. For example, you could use some of the methods discussed below to anticipate customer habits, predict usage on IoT devices, or identify key variables that suggest certain behaviours – all from the existing data you already have.
Our first job was to identify existing sources that we could use. For your purposes, this data will probably already exist in your BigQuery data warehouse or Cloud Storage Bucket.
We downloaded all of the nominations as a CSV file and uploaded it to a Cloud Storage Bucket. Using the Cloud Storage Load Generator we generated the components to create a blank table in BigQuery and load the contents of the CSV file into it:
By running this job, information about the 2018 Nominees for ‘Best Picture’ were loaded into BigQuery.
The 2018 Best Picture Nominations:
- Call Me by Your Name
- Darkest Hour
- Get Out
- Lady Bird
- Phantom Thread
- The Post
- The Shape of Water
- Three Billboards Outside Ebbing, Missouri
OMDb Data API
www.omdbapi.com offers an unofficial API for the Internet Movie Database (IMDb). It allows searching for a film by title to retrieve information about the film such as genre, director, actor, plot and most importantly for this, ratings. Three average user ratings are available from IMDb, Rotten Tomatoes and Metacritic.
The OMDb API is a REST API which returns JSON formatted data, hence you can use the Matillion API Query component to bring this data into BigQuery. We created an API profile in Matillion to bring back the rating fields and also some additional data:
This information will be written to a table in BigQuery, as per the job below.
Box Office Data Excel
In addition to the ratings data, we have obtained some Box Office data to add to our analysis. We have one Excel Sheet per month of 2017 data showing the films ranked by box office takings:
This data is loaded into Matillion using the Excel Query component with an iterator to iterate through every tab of data and add a new column with the month name.
The above data loads made the 2018 nominations data and data from previous years’ winners available in BigQuery for exploration and investigation.
Before we can start deriving value from the data, we need to prepare it for investigation.
In order to query the OMDb API, we first of all need to clean the nominations data and reformat the film name into a URL encoded string to pass into the API. This was done using a simple Transformation Job in Matillion:
The names of the nominated films are brought in from an existing table in BigQuery and passed into the first Calculator component to clean these names and trim any blanks from them.
The Transpose Rows component pivots the data so each film only has one row of data in the table. If a film has been nominated in multiple categories, these will appear in a comma separated list in the category column.
The second calculator produces a URL encoded film name by replacing the spaces with “%20” signs.
The Rank component adds an identifier to each record. This identifier is used in conjunction with the iterator for when the API calls is made and the tables are created with the OMDb data.
The output component Create View saves the output of this Transformation into a view so it can be used elsewhere in Matillion.
From a list of previous winners from the year 2000 onwards, we created a job similar to the nominations to URL encode the data and then passed this into our API Query component to get the details of the ratings from the OMDb API:
A Table Iterator is used to iterate over the previous winners and pass the film titles into the API Query. This gave us all the ratings for the previous winners. Using this data we will be able to calculate an average rating as a measurement for analysis.
Using the prepared data, we will look to see if data about previous winners provides any indications about predicting a potential winner in 2018. Based on the data at hand, we looked at Box Office Takings, Average Rating, Primary Country of Origin, and Primary Genre.
Box Office Takings
A Matillion Transformation job is used to bring together the three data sets and transform it and to view the ratings together with the box office takings:
Analysing the output shows us that the highest rated film is Call Me by Your Name, however this does not have high box office takings, having only taken $9m to date, whereas Lady Bird only has a slightly lower average rating and has taken over 4x as much at the box office. If we’re considering mostly box office takings, Dunkirk is the winner closely followed by Get Out. However, both of these have been out since early or mid 2017. Unfortunately, this data does not seem to lend itself to any clear suggestions.
We also considered some limitations with the box office takings data. Looking at the released dates of the nominated films, some are yet to be released. Furthermore, if we look to compare box office takings from this year to previous, we would need to account for inflation and industry growth, i.e. films that made more in 2017 than in 2002 might not be worth more. Also there is a question about whether opening weekend vs. overall takings are more significant. Box office takings therefore proposed more questions than solutions. Therefore we are unable to derive any confident conclusions from this data set. This may be an area for future research and investigation.
So at the moment the data we have doesn’t give much away. Let’s bring in some additional data from previous winners…
Average film rating seems like a logical variable as it suggests likeability and popularity.
Now we will review the information we have on previous winners to see if we can use this to predict future results. Starting off with investigating the public ratings. We built a dashboard in Google Data Studio to chart the average ratings of the previous winner by year:
We can see that, apart from the 2005 winner (Crash) which had a rating of 74, all others had a high rating between 86 and 91. So based on this logic it’s unlikely that The Post (average rating 81)or Darkest Hour (average rating 78) will be the 2018 winners.
Primary Country of Origin
While Average Rating helps us to narrow down our list we need a deeper analysis to get a shortlist.
So now we’re going to do some segmentation analysis, by comparing the previous winners categorisation to the current nominations. First, we look at the primary country of origin for the film. On the left we have a pie chart showing the previous winners broken down by country. Here we can clearly see that films from the USA have dominated the ‘Best Picture’ category in previous years. Therefore if we use the USA’s past performance as an indicator, then it is fair to expect a film from the USA to win this year. The bar chart on the right from Google Data Studio shows the country break down of the current nominations. Again we can see that films from the USA make up the majority of the nominations. If we exclude films from outside the USA, we have 5 potential winners:
- Darkest Hour
- The Shape of Water
- Lady Bird
- The Post
- Phantom Thread
Due to overall lower average ratings, however, we have previously discounted The Post and Darkest Hour. This leaves us with three contenders still in the running: The Shape of Water, Lady Bird and Phantom Thread.
To further narrow down the list to a single potential winner, we need to look at some more data. We also had some data about genre. Continuing on with the segmentation analysis we can use this data to further reduce our shortlist, in the hopes of identifying a front runner. Again we have a pie chart on the left showing the primary genre, according to IMDb, for previous winners. Unlike with country of origin, there is a more even distribution across the genres. Crime, Drama and Biography come out on top, with 23.5%. To the right, we have the current nominations split by genre. In line with the findings from previous years, Biography and Drama are the most popular.
Biography is also discounted with the exclusion of The Post and Darkest Hour. We can see that Lady Bird is a Comedy and The Shape of Water is an Adventure film. Assuming again that past history is an indicator to future success, we think it would be unlikely for either of the films to be named the winner based on this variable. This leaves us with one remaining film, Phantom Thread, a Drama, as the expected winner.
‘Best Picture’ at the 2018 Oscars
Based on the data at hand, we expect Phantom Thread to win ‘Best Picture’ at the 2018 Oscars.
Areas for Hypothesis and Future Research
We fully admit that this data exploration activity was purely a bit of fun analysis, and therefore has several limitations. We could delve further into the analysis by reviewing some of the other supplementary data we have on the films. For example, we could have looked at director or main actors and actresses or even do some further analysis around the running time, etc. Also we did not consider the ranking of the variables. For example, genre could be a better indicator of winning than average rating. It is important that when you explore your own data you have a clear understanding of your data and variables in order to derive real value from it. The possibilities are endless where the data is available.