Don't confuse the Kimball methodology with Data Architecture

The shift to cloud computing has changed the way companies handle data. Moving away from on-prem databases towards using SaaS applications, has led to the spread of data being siloed across numerous platforms. This is known as application proliferation. Alongside this, the rise of new digital tools has resulted in diverse data formats and processing methods - a phenomenon called format proliferation. Together, these are big, modern data management challenges.

Despite this, data is now more valuable than ever. Thanks to generative AI, companies can now automate the extraction of insights from unstructured data. This has opened new doors for analysis and business intelligence.

Of course, it remains mandatory for companies to stay compliant with governance, risk, and regulatory requirements, ensuring that their data practices align with industry standards while safeguarding against potential risks.

The growing complexity of the modern data landscape means there is a greater risk than ever of creating solutions that don't meet long-term business needs.

An important strategy to help tackle this is to focus on addressing data challenges one at a time, by deploying a multi-tier data architecture.

Architecting to reduce risk

Imagine a ship with watertight bulkheads. If one part floods, the rest stays dry. A multi-tier data architecture exists for the same reason. By segmenting data into isolated layers, the common hazards - such as schema drift, data quality errors, or security breaches - are confined to one area. A problem can be addressed in just one place, preventing it from spreading across your entire data ecosystem.

A good starting point is to consider three logical tiers: data acquisition, integration, and presentation.

Data Acquisition

This involves gathering data without worrying about integration. Staging data first - either selected portions or complete historical records - simplifies the process and enhances scalability.

The acquisition stage focuses on gathering data without any immediate attempt at integration. It addresses format proliferation. This step is largely mechanical, using methods such as batch processing and streaming. It sets the foundation for the complex transformations to follow.

Data Integration

After collection, the next tier addresses the issue of data silos. Data from multiple sources is combined and transformed to ensure it all works together meaningfully.

Data silos are removed by aligning diverse sources into a consistent form, ready for collective analysis. Data transformation is required to apply standardized rules and achieve meaningful integration.

Data Presentation

Finally, the Data Presentation tier makes data easy to use. Data has previously been gathered and organized from different sources. Now it must be presented effectively.

The data presentation tier must be easy to adapt to changing business needs, comprising simple structures that are easy to maintain. Star Schemas are widely used to simplify data access and enhance flexibility.

A core principle is using one definition in multiple ways; for instance, sales and R&D can view "widget" data differently but use the same definition. This is what makes reports add up.

You may also consider a semantic layer at this point, to separate data definition from visualization aspects. This allows for versatile analyses without having to embed complex logic into visualization tools.

Kimball vs Inmon

Ralph Kimball's approach defines the data team's goal in terms of its visible output: a set of linked star schemas, each focused on a main topic. Describing them as "linked" means that they share common data definitions. This kind of data model helps make the data easy to access and use.

Bill Inmon, on the other hand, views the data team's goal more broadly, in terms of subject areas. The aim is to have all information about every different topic stored together. In other words, this vision is all about data integration, and it is fundamental to ensure accurate reporting.

While these approaches may seem different, the reality is that they complement each other:

  • Kimball: Use Star Schema data models in a Data Presentation tier
  • Inmon: Have a Data Integration tier, to organize and manage data most effectively in the long term

Notice that, on their own, neither a star schema nor a semantic layer is a substitute for the well-organized, integrated data supplied by the earlier data tiers.

Summary and further reading

Star schemas are usually the most visible manifestation of data warehousing, so it's easy to mix up creating a dimensional model with organizing and integrating data. Star schemas are best deployed after data has been fully integrated.

A well-organized multi-tier data architecture makes it quick and easy to create different star schema views for analysis, that retain data quality without the need for complicated data changes. Adding a semantic layer can help keep things organized by separating how data is measured from how it's shown.

Tiering your data architecture helps to manage data challenges and risks efficiently. The Data Presentation tier meets constantly changing business needs. Data Acquisition and Integration tiers create a stable foundation for new technologies such as generative AI.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.