The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Star Schema vs Snowflake

Data warehousing is the cornerstone of modern data analytics, providing a centralized repository for integrating, transforming, and storing vast amounts of data from diverse sources. A well-architected data warehouse streamlines data management and enhances data quality, making it readily accessible for analytical and reporting purposes.

The best overall architecture for a data warehouse is multi-layered. This structure, often referred to as a medallion data architecture, typically encompasses three tiers: an initial layer for data acquisition, a middle layer for data integration, and a presentation layer for data querying. Each layer employs different data modeling techniques to address unique challenges and optimize maintenance.

In particular, the presentation (gold) layer usually leverages either Star Schemas or Snowflake Schemas (or a combination of both!) for efficient querying and reporting.

Different modeling techniques are used in the earlier layers to set the stage for the refined presentation layer:

  • The silver tier often employs 3NF or Data Vault modeling techniques
  • The bronze tier using source-specific models like semi-structured, normalized, or OBT (One Big Table) extracts

These data tiers work together to ensure a robust data warehousing solution.

What is a Star Schema?

A Star Schema is a type of data model employed predominantly in the presentation (gold) tier of a multi-tier data architecture. This data modeling approach uses just two entity types: facts and dimensions.

  • Facts represent measurable quantities or metrics, such as sales revenue or units sold
  • Dimensions provide context by which we can measure these facts, such as time, product, and location.

Because of this simple approach, Star Schemas are also widely known as "dimensional" models.

The distinguishing characteristic of a Star Schema is - naturally - its star-like shape when visualized. At the center of this model lies a single fact table, to store all the quantitative data. Surrounding the fact table are many dimension tables, each representing a different context for analysis. This layout necessitates minimal relational joins for querying, thereby optimizing query performance and simplifying navigation.

Within a Star Schema, some dimensions like date, time, and location are universal and can be reused across different business domains. Other dimensions are tailored specifically to the business context and reporting needs.

A Star Schema maintains only one level of join between facts and dimensions. This rule delivers optimal processing speed, simplicity, and ease of understanding. Here's a diagram showing a fragment of a star schema, focusing on the "time" dimension that includes day, month, and year:

Fragment of a Star Schema

A full data warehouse design will consist of many fact tables, each in its own individual star schema. One key goal of a well-designed Star Schema is the reusability of dimension tables by multiple fact tables. This facilitates reliable cross-referencing between events, aiding in comprehensive analysis and insight generation. A useful tool in this design process is the Bus Matrix, a grid depicting the relationships and reuse of dimension tables across various fact tables. This is an elegant way to describe a coherent and powerful star schema design.

What is a Snowflake Schema?

A Snowflake Schema is another variety of dimensional modeling widely used in data warehousing. Again, it is based on facts and dimensions. Snowflake Schemas are built upon the principle of organizing data into a structured format most conducive to query efficiency and data integrity.

The key differentiator of a Snowflake Schema is to normalize dimensional tables to reduce redundancy and repetition. This results in a layout where dimension tables are connected to one central fact table, but dimensions can themselves be split into additional sub-dimensions.

So in a Snowflake Schema, rather than having single flat dimension tables, each dimension can be broken down into multiple related tables. For example, in a sales data warehouse, the time dimension might be normalized into separate tables for year, month, and day. This normalization reduces data redundancy, conserving storage and facilitating easier maintenance of the data.

As you saw above, a Star Schema has fully denormalized dimensions. That means each dimension table is directly linked to the fact table without further subdivision. Queries are simple and fast.

A Snowflake Schema exhibits normalization, requiring more joins but achieving a more hierarchical and organized layout. Here's a diagram showing a fragment of a snowflake schema, focusing on the "time" dimension again. Note how that one table has been normalized out into three separate dimension tables for day, month and year:

Fragment of a Snowflake Schema

Consequently, the equivalent Snowflake Schema may be more challenging to query but offers advantages in storage efficiency and minimizes redundancy.

Understanding the trade-offs between these schema types is essential for designing a data warehouse tailored to specific performance and maintenance requirements.

Star Schema vs Snowflake Schema

Star Schema and Snowflake Schema data models are both employed in data warehousing to structure data for efficient querying and reporting. Within this fundamental similarity in purpose, the differences are mainly in their level of normalization.

Star Schemas are characterized by a highly denormalized structure. They consist of a central fact table surrounded by a single layer of dimension tables, looking exactly like a star. This high level of denormalization ensures simple, fast query performance because fewer joins are required. It's user-friendly for business analysts and simplifies complexities for developers, making it popular in exploratory data analytics where speed is crucial.

Referring back to the "time" dimension model earlier, here's what the data in that dimension looks like. Note all the repetition in the Month and Year columns. This is how denormalization manifests in a star schema dimension table.

Key

Date

Month

Year

191124

April 28 2009

April 2009

2009

191125

April 29 2009

April 2009

2009

191126

April 30 2009

April 2009

2009

191127

May 1 2009

May 2009

2009

191128

May 2 2009

May 2009

2009

etc

Time dimension in a Star Schema

Assuming you are deploying a star schema into a cloud data warehouse, there's no need to be too afraid of the seemingly wasteful large-scale repetition of data. Modern data warehouse platforms contain storage optimizations specifically designed to make dimension table storage compact, keeping queries fast and efficient.

An equivalent Snowflake Schema is a more normalized variant. Dimension tables are split into multiple related tables, creating a network reminiscent of a snowflake shape. Each dimension can have more general sub-dimensions, minimizing redundancy and storage overhead. This does come at the expense of more complex queries.

Here's how "time" is represented in a more normalized way in a snowflake schema. Note how one table turns into three, requiring less space overall but with a more complex structure.

Day Key

Date

Month Key

191124

April 28 2009

42165

191125

April 29 2009

42165

191126

April 30 2009

42165

191127

May 1 2009

42166

191128

May 2 2009

42166

etc

Day dimension in a Snowflake Schema

Month Key

Date

Year Key

42165

April 2009

220218

42166

May 2009

220218

etc

Month dimension in a Snowflake Schema

Year Key

Year

220218

2009

etc

Year dimension in a Snowflake Schema

A snowflake schema reduces repetition with a more normalized design, reducing the chance of data anomalies and improving data integrity. Less can go wrong with the ETL processes because they do not have to create so much duplicated data.

This approach is advantageous in environments where data quality and storage efficiency are more critical than the simplest query access, such as in highly regulated industries.

Summary

In summary, while both Star and Snowflake Schema design approaches serve the purpose of organizing data for analytical processing, the Star Schema’s denormalization favors performance and simplicity, whereas the Snowflake Schema’s normalization prioritizes data integrity and storage efficiency.

Keep in mind that - despite a common misconception - on their own, a Star Schema or a Snowflake Schema are not a data warehouse. They are specific data modeling techniques, best applied to data that has already been transformed and integrated at an earlier stage in the data architecture.

Further reading

Matillion is the data pipeline platform that empowers data teams to build and manage pipelines faster for AI and Analytics at scale, making it an excellent choice for building dimensional models in data warehouses, whether using Star Schemas or Snowflake Schemas. Matillion's comprehensive features include productivity-boosting pre-built components, code-optional flexibility, first-class Git integration, and hybrid SaaS deployment options. Start a 14 day trial for free.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.