Investigating Football Data for World Cup Insights with Matillion ETL for Amazon Redshift
The excitement about the upcoming FIFA World Cup is reaching fever pitch. So we decided to take a look at some of the data. Even more so than other international sporting events, the World Cup produces a seemingly endless list of variables. This makes predicting a winner or favorite incredibly difficult.
With so much data, where do you start? We find a lot of the businesses we work with ask themselves the same question. Matillion ETL reduces development time. This frees up more time to mine for insights that you might not have previously had time to investigate. In this blog, we will use Matillion ETL for Amazon Redshift to explore football data beyond goals and stats to see if we can identify any variables with hidden insights. This same analysis can be done with Matillion ETL for BigQuery and Matillion ETL for Snowflake.
Does player nationality and domestic leagues affect performance in the World Cup?
We decided to look at player nationality data. Football (or soccer!) is arguably the largest international sport. Players often move around globally and play in a number of different countries throughout their careers. The World Cup is one of few instances where players represent their ‘home’ countries. This got us thinking – does the nationality of players and the domestic league in which they play affect performance in the World Cup? For example, a country with a high export rate of players may benefit from having a large number of internationally trained players. This is because they may have experience playing against players in their potential World Cup opponents squad. Alternatively, teams that have a low export rate have more experience playing with and against each other on a domestic level. This information may also provide insights into how effective ‘Homegrown Player Rules’ might be.
Most of the qualifying teams come from Europe (14), so we focussed on this region as a sample. We built a job in Matillion ETL for Amazon Redshift to bring in European National team data. This would give us the data we need to analyze the domestic leagues’ current season.
Processing the Data
So let’s get started. Using data from the football-data.org REST API to look at current season data (2017/2018), the first thing we will do is generate an API Profile in Matillion to retrieve the domestic teams in the top leagues from each country. We will then generate a second API Profile to retrieve player details for each team, including their nationality.
The job below will now retrieve the data. We will be able to re-use this job and run it for each National Team.
- The first API Query component, seen in the image above obtains all teams in a specific domestic league
- The ‘world cup’ transformation job cleanses the data. We are doing this so we can use it later down the line to feed into the iterator to obtain details of all players in each country
- Finally, we collate all of this together in one table. From there we can transform the data using a Transformation job in Matillion ETL
This will aggregate the data to show the split of nationalities within each national league:
A sample table of player nationality by country is below. This is the output of the Orchestration job called “team analysis” which gives one line of data per player showing the date of birth, player nationality, and the country they play domestic football for:
This data is then aggregated to show by country the count of native players and the count of non-native players.
We then pass this data through a Window Calculation component so we can see the total number of players by country. The Calculator component within Matillion ETL for Amazon Redshift calculates the percentage of native players per league. We write this data back into a new table using the Rewrite Table component:
Analyzing the Data
Visualizing this data in QuickSight, we can clearly see that by far England has the fewest native players in their domestic league with only 32% of players identifying as English. France, on the other hand, has 52% and the Netherlands has 57%:
Now let’s look at the historical importance of this.
2014 World Cup
We’ve got an Excel Spreadsheet containing details of every player who played for every team in the top league of our countries in their domestic 2013/14 season:
We load this into Amazon Redshift, to perform our analysis using the Excel Query component:
This transform does a similar calculation to the above transform and aggregates the data to calculate the percentage of native players per league.
The results are interesting. England and Portugal were once again the countries with the lowest percentage of native players in their top domestic league, with 40% and 49% respectively. Comparing this to Spain, who had 66% in 2014 and 57% in 2018.
So 4 years ago, a high proportion of the Spanish league was made up of Spanish players. For both England and Portugal, however, a bigger proportion of the league was formed of players not eligible to play for the national team. This is the same scenario for this year.
What does this mean in the context of World Cup results? The FIFA World Cup 2014 was won by Germany with the Netherlands coming second. England, Spain, Italy, and Portugal didn’t get beyond the Group Stages, with France being knocked out in the quarterfinals. From these results, there does not seems to be a correlation between what domestic league a player plays in and their national team performance.
2010 World Cup
We will bring in some additional data from 2010 to further investigate.
In 2010, Spain had the highest proportion of native players (71%) in their top domestic league. England and Portugal had significantly fewer with 42% and 48% respectively. This is interesting as Spain won the 2010 FIFA World Cup. Neither Portugal nor England got beyond the last 16.
This might suggest that the percentage of native players might need to be very high to impact performance, if it does have any impact at all. At the 2018 World Cup in Russia, we will be sure to watch out for teams that have over 70% national league players comprising their squad as this might indicate enhanced performance.
2018 World Cup
That being said, we have probably ruled out nationality as a strong indicator of performance for the World Cup. Based on the evidence, the Premier League may want to reconsider their ‘Homegrown Player Rules’ as this might not necessarily improve English football. Either way, we’re very much looking forward to the FIFA World Cup 2018.
Want to investigate your data for new insights? You can learn more about Matillion ETL for Amazon Redshift by arranging a free 1hr demonstration, or get started with a free 14-day trial.