Scale your data team’s output by up to 100x. We'd love to prove it.

Challenge Maia at Snowflake Summit

Who are the most successful Olympic athletes in history?

To mark the opening of the 2024 summer Olympics, this article contains a short piece of data analysis to find the most successful olympians of all time.

To follow along, you'll just need a Snowflake account and a SQL client such as SnowSQL.

Load Historical Olympic Athlete Data

Two datafiles will be used in this analysis:

  • athlete_events.csv.gz - all the athletes in every Olympic event, as a CSV with lines separated by CRLF, and one header line. The granularity is one record per athlete, per Games.
  • noc_regions.csv.gz - the National Olympic Committee country codes, as a CSV separated by a single CR character, and also with one header line. The granularity is one record per NOC country.

Begin by downloading the files, and saving them locally:

aws s3 cp s3://devrel.matillion.com/data/structured/olympics/athlete_events.csv.gz .
aws s3 cp s3://devrel.matillion.com/data/structured/olympics/noc_regions.csv.gz .

If you don't have the "aws" CLI installed, you can use HTTPS URLs instead, for example using curl:

curl -o athlete_events.csv.gz https://s3.eu-west-1.amazonaws.com/devrel.matillion.com/data/structured/olympics/athlete_events.csv.gz
curl -o noc_regions.csv.gz https://s3.eu-west-1.amazonaws.com/devrel.matillion.com/data/structured/olympics/noc_regions.csv.gz

Upload the files into the current user's Snowflake stage using SnowSQL:

PUT file://athlete_events.csv.gz @~/olympics;
PUT file://noc_regions.csv.gz @~/olympics;

Snowflake schema inference

Now create a table to hold the data using schema inference. First a file format for the Olympic athlete data:

CREATE OR REPLACE FILE FORMAT ff_olymp_ath
TYPE = CSV
COMPRESSION = AUTO
RECORD_DELIMITER = '\r\n'
FIELD_DELIMITER = ','
PARSE_HEADER = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"';

Then the table DDL:

CREATE OR REPLACE TABLE olympic_athletes USING TEMPLATE
    (SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
     WITHIN GROUP (ORDER BY ORDER_ID)
     FROM TABLE(INFER_SCHEMA(LOCATION => '@~/olympics/athlete_events.csv.gz',
                             FILE_FORMAT => 'FF_OLYMP_ATH')));

Once the table has been created, load the data with a COPY command:

COPY INTO olympic_athletes
    FROM '@~/olympics/athlete_events.csv.gz'
    FILE_FORMAT = (FORMAT_NAME= 'FF_OLYMP_ATH')
    MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

The National Olympic Committee country codes are in a slightly different format, in which the line delimiter is a single CR character.

CREATE OR REPLACE FILE FORMAT ff_olymp_noc
TYPE = CSV
COMPRESSION = AUTO
RECORD_DELIMITER = '\r'
FIELD_DELIMITER = ','
PARSE_HEADER = TRUE
FIELD_OPTIONALLY_ENCLOSED_BY = '"';

Then the DDL for the country codes table:

CREATE OR REPLACE TABLE noc_region USING TEMPLATE
    (SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
     WITHIN GROUP (ORDER BY ORDER_ID)
     FROM TABLE(INFER_SCHEMA(LOCATION => '@~/olympics/noc_regions.csv.gz',
                             FILE_FORMAT => 'FF_OLYMP_NOC')));

Once the table has been created, load the data with another COPY command:

COPY INTO noc_region
    FROM '@~/olympics/noc_regions.csv.gz'
    FILE_FORMAT = (FORMAT_NAME= 'FF_OLYMP_NOC')
    MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

Analyzing athlete performance

Now that the two staging tables have been created and loaded, they can be joined on the common NOC country code column.

The NOC code table has three columns, so the actual country code has to be derived using the notes where they are set, and falling back to the region. The SQL expression is:

COALESCE("notes", "region")

The golden rule of data integration

As every data engineer knows, the golden rule of data integration is that where there's a rule, some data will come along and break it.

This dataset is no exception: some of the athlete data contains a NOC code that is not present in the lookup table. You can verify with the following SQL query:

SELECT NOC FROM olympic_athletes
EXCEPT
SELECT NOC FROM noc_region;

This means the tables will need a left join, and the missing country code will just have to be marked as unknown. The SQL query to integrate the tables is as follows:

SELECT a."Name", a."Games", a."Medal", a."NOC", COAL ESCE(r."NOCName", 'Unknown') AS "NOCName"
FROM olympic_athletes a
LEFT JOIN (SELECT r.NOC, COAL ESCE(r."notes", r."region") AS "NOCName" FROM noc_region r) r
ON a."NOC" = r."NOC";

How to use a WITH clause

The above join will be the foundation of all the subsequent analysis. So for simplicity I recommend converting it into a WITH clause. An added bonus is that you can now easily check how many records are being returned:

WITH data AS (
    SELECT a."Name", a."Games", a."Medal", a."NOC", COAL ESCE(r."NOCName", 'Unknown') AS "NOCName"
    FROM olympic_athletes a
    LEFT JOIN (SELECT r.NOC, COAL ESCE(r."notes", r."region") AS "NOCName" FROM noc_region r) r
    ON a."NOC" = r."NOC")
SELECT COUNT(*)
FROM data;

When you run this yourself you should find 271116 rows: the same number as in the original olympic_athletes table.

Classifying athlete performance

In this article I'll judge success by the number of medals won, where a gold medal is worth 3 points, a silver 2, and a bronze 1 point.

In SQL the classification is done with a CASE expression like this:

CASE "Medal" WHEN 'Gold' THEN 3 WHEN 'Silver' THEN 2 WHEN 'Bronze' THEN 1 ELSE 0 END

RANK() vs DENSE_RANK() vs ROW_NUMBER()

In order to rank the athletes by overall performance, it's necessary to generate some kind of ordering in the results. The next piece of SQL introduces a second WITH clause, and demonstrates the difference between the RANK, DENSE_RANK and ROW_NUMBER analytic functions:

WITH data AS (
        SELECT a."Name", a."Games", a."Medal", a."NOC", COAL ESCE(r."NOCName", 'Unknown') AS "NOCName"
        FROM olympic_athletes a
        LEFT JOIN (SELECT r.NOC, COAL ESCE(r."notes", r."region") AS "NOCName" FROM noc_region r) r
        ON a."NOC" = r."NOC"),
    medals AS (
        SELECT "Name", SUM(CASE "Medal" WHEN 'Gold' THEN 3 WHEN 'Silver' THEN 2 WHEN 'Bronze' THEN 1 ELSE 0 END) AS "sum_Medal_Points"
        FROM data
        GROUP BY "Name")
SELECT "Name", "sum_Medal_Points",
       RANK() OVER (ORDER BY "sum_Medal_Points" DESC) AS "Rank",
       DENSE_RANK() OVER (ORDER BY "sum_Medal_Points" DESC) AS "DenseRank",
       ROW_NUMBER() OVER (ORDER BY "sum_Medal_Points" DESC) AS "RowNumber"
FROM medals
ORDER BY 5;

RANK, DENSE_RANK and ROW_NUMBER all agree who is in first place. It's only lower down in the listings that the differences appear. To illustrate, here are the first 12 rows:

RANK() vs DENSE_RANK() vs ROW_NUMBER()

In summary:

  • RANK - marks equal records as equal, and leaves gaps in the sequence
  • DENSE_RANK - ditto but does not leave gaps in the sequence
  • ROW_NUMBER - produces a dense, always-ascending sequence

In any case there's no argument about first place in this dataset, which goes to Michael Phelps 🏊 by a considerable margin. It's going to be a tough record to beat!

Running the Olympic athlete analysis in a graphical SQL builder

You may be interested to know that it's easy to build logic like this in a data integration platform with a graphical user interface. The equivalent data pipeline in the Matillion data productivity cloud looks like this:

Olympic athlete analysis in the Matillion data productivity cloud

This graphical representation of the logic builds up the same kind of pushdown SQL statement that I showed in the previous section.

If you're not so confident writing SQL logic, the Matillion platform comes with a built-in copilot that can create the components in response to a freehand text prompt. Here's the one for the "Filter Medal Winners" component:

Matillion copilot example

Conclusion

The raw data files I listed at the beginning of the article are in a publicly available Amazon S3 cloud storage bucket, so you can use them to build your own analysis.

You can try out the Matillion data productivity cloud at zero cost for 14 days and build your own graphical data integration pipelines on your own data.

If you are already a Matillion data productivity cloud user, you can download the Olympic Athlete Data Analysis pipelines from the Matillion Exchange and run them yourself.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.