Liverpool versus Manchester United. Cats versus dogs. Coke versus Pepsi. Rivalries have persisted throughout the ages. But for gamers, not many are more contested than Xbox versus Playstation. With the release of both the new Xbox Series X and PlayStation 5 (PS5) consoles drawing near, the competition is heating up again. In the latest battle of the consoles, which platform will emerge victorious?
We could attempt to guess the outcome by examining the desirable features of both. (Full disclosure, I am biased toward the current PS4.) But why not use data? With the help of Matillion ETL, I analyzed historical sales data for both platforms to see if I could predict a winner. Don’t worry, my bias definitely did not affect the outcome of this comparison!
First, I needed to find the data. In kaggle, I found PS4 and Xbox One video game sales data from 2013 through 2020 (click on the links to download the data). By finding the average sales of video games for each platform during those years and comparing them against each other, I hoped to reach a solid conclusion.
Where does Matillion ETL fit in?
Matillion ETL helps you migrate and load data into a cloud data warehouse such as Amazon Redshift, Google BigQuery, Snowflake, or Azure Synapse Analytics. You are then able to transform the data into a suitable format for analytics. This is all done within the cloud data warehouse (CDW) environment. Let’s not forget to mention that you can run these Matillion ETL instances on multiple cloud platforms: Amazon Web Services(AWS), Microsoft Azure and Google Cloud Platform(GCP). But keep in mind that different cloud providers support different CDWs. For example, Snowflake runs on AWS, Azure and GCP, whereas Amazon Redshift runs on AWS alone. Likewise for Google BigQuery and GCP. When it comes to analytics like these to predict outcomes, a good ETL product like Matillion ETL ensures that the data going in for analytics will be transformed, well prepared, and yield optimal results.
Orchestrating the historical data within Snowflake
Within Snowflake on an AWS instance, I used Matillion ETL to ingest and analyse the data by creating an Orchestration Job. An Orchestration Job consists of several components that can retrieve data from other services and load that data into a table. For example, these components can be called Data Stagers, Connectors, or Data Loaders..
Loading the data
First, I created a Snowflake table. This table is where the data that we will load will sit. To get your historical data that we obtained from kaggle into the cloud, upload them into an S3 bucket of your choice. To connect to these tables, I used an S3 Load component, which extracts the data from CSV files stored in an S3 bucket and automatically loads them into the table. Loading the CSV data into the table was quick and efficient with no hassle at all.
Transforming the data for analytics
Once the data was available in Snowflake, it was ready to transform to prepare it for analytics. I created a Transformation Job that would allow me to find the sum of regional video game sales by year, then work out the average sales across all seven years to determine the popularity of the PS4 and Xbox One. These next steps are actually a visualisation or representation of a series of SQL queries. By using the drag-and-drop components, you save a lot of time that would otherwise be spent hand-coding these functions in SQL. Instead, you simply drag the components onto the Matillion canvas and alter the property values to suit your needs.
The following screenshot is the full Transformation Job:
Breaking down the Transformation Job
What do each of these components do? Let’s look at the Transformation Job step by step.
The tabular data
The data is already set up in a tabular form, which means we can move straight to transforming it. As I’m working, you can sample and see the data at each stage and check out the row count, which is a really helpful feature. What you see here is the structure of the PS4 video game sales. The Xbox One game sales table has the same structure and column names.
The Aggregate component
Using the Aggregate component, I worked out the sum of the sales in each region by year. I looked at sales in four regions: North America, Europe, Japan and the rest of the world. Rather than typing out an SQL query (which is time consuming), you simply add the columns you want to aggregate and the type of aggregation you want to use:
Then you can view the SQL at each stage as it is automatically generated:
The Transpose Columns component
The Transpose Columns component is very useful when you have several columns that can be grouped into a single column. In this instance, the output from the aggregation placed the sum of sales for each region in its own column, grouped by the year. What we can do now is reshape the data by outputting each set of input columns to a single output column:
This makes the next step, calculating the data, much easier.
The Calculator component
Using the Calculator component, I can find the average of seven years of sales data from the ‘region_sum_sales’ output column. The Transpose component helped in simplifying the expression, as we only have to reference the one column in this calculation:
Joining the data
Now we have two data sources that need to be joined together for one final analysis. The Join component joins the columns on a column key. Since the two tables have different values, I used a full join to receive data from both tables, even though they do not match. You can also choose between the following join types:
Converting the value to compare results
Finally, we convert the value into a percentage. Related to this step, another benefit of the Calculator component is that you do not need to remember both the function names and the syntax, as the information is provided under the functions folders.
In the screen shot, you can see that I converted the value into a percentage, using a simple percentage multiplication:
ROUND(100 * “PS4 7 year Sales Avg” , 2) || ‘%’
The functions will change based on the CDW that you are using.
Remember in the beginning, when we created an empty table in the Orchestration Job? Now, we will place the output of the data here in the results.
Which console prevailed?
Of course, what really matters is which console will win?
As a recap, I took the video games sales data from the PS4 and Xbox One from the years 2013-2020 and found the sum of the sales in the regions per year. We then placed the sum of the sales into one column to make it easier to find the overall average sales across all regions. To come to the final output, we joined the two sets of data from the Xbox One and PS4, changed that into a percentage and placed it into the ‘results’ table.
And the winner is…
PS4 Popularity = 21.19%
Xbox One Popularity = 20.19%
According to the average of the historical sales data over seven years, the PS4 was the clear winner with 21.9% percent higher video game sales. (Note that the percentages are small because the data is spread out across seven years. If I were to find the average sales for three years, the percentages would be higher. Still, a clear margin of victory!)
So according to the data, we can assume that the PS5 will win out over the Xbox Series X in the upcoming product releases. I am excited that the data shows that the PS5 will be more popular. I will definitely be part of the crowd that will contribute to this popularity. Regardless of my personal excitement and findings, there’s nothing to say that it may be the complete opposite!
Jobs used for this blog
Download the jobs used to orchestrate and transform the data in this blog.
Learn more about Matillion ETL
Want to know more about what Matillion ETL can do for your data? Request a demo.