- Blog
- 12.05.2024
- Data Fundamentals
Multi-Tier Architecture in a Data Warehouse

The shift to cloud computing in the recent decade has caused a major change in how companies manage data. Traditional methods where everything was stored in an on-site RDBMS are fading. Now, data is commonly spread out across many SaaS applications, with the average company using more than 200 such services. All these diverse applications produce data that does not necessarily interoperate well. This trend is known as application proliferation.
At the same time, new digital tools have introduced many different ways in which data can be stored and processed. This situation, known as format proliferation, requires data teams to work with a wide range of data handling techniques - like Generative AI, Machine Learning, Parquet, Apache Iceberg, JSON, SDKs, REST APIs, OAuth, and Streaming to name just a few.
These twin challenges are tough for data teams. At the same time, they must also adapt and adjust while the business needs and the source applications continually change.
The solution - outlined in this article - is to isolate the challenges and deal with them independently. This involves implementing a multi-tier data architecture.
Here is a broad summary of the three data tiers this article will describe:
| Data Tier | Acquiring data | Integrating data | Presentation |
| Reason for tier | Deal with format proliferation | Deal with application proliferation | Make data easy to consume |
| Data is integrated? | x | ✔ | ✔ |
| Data is easy to use? | x | x | ✔ |
Acquiring data
The focus in this tier is just getting hold of the data, leaving the problem of integration for later.
Loading data in its separate step greatly simplifies the task of acquisition, and makes it highly scalable. It also decouples downstream data integration pipelines, so they are unaffected by transient problems such as changing access paths. Whichever way it's done, data loading is a largely mechanical process best kept free of complex transformations.
You may choose to extract only the data that are required for the immediate task at hand. This is often called "staging" the data. Alternatively, you may decide to extract all the source data without filtering, including all the history. This creates a full copy of the source data in the target data platform, which is known as an Operational Data Store (ODS).
ETL (Extract, Transform, Load)
In both ELT and ETL, the "E" and "L" refer to data acquisition, meaning extracting data from the source and loading it into the target data platform.
Batch Processing
In batch data acquisition mode, many records are moved in a single operation. Batch processing using SQL or Spark is optimal on modern data platforms. High watermark loading is a common strategy, in which the source system is queried periodically for all new records since the last batch.
Streaming or Real-time Data Processing
In contrast to batch processing, streaming acquires records as soon as they are added. When the source is a database, this approach is known as Change Data Capture (CDC).
Data Lakehouse
A data management architecture that combines the highly scalable storage of a data lake with the transactional and analytic capabilities of a traditional data warehouse. The result is a unified platform that can handle structured, semi-structured, and unstructured data.
Zero ETL
Similar to how a Data Lakehouse makes files in cloud storage appear as database tables, Zero ELT means virtualizing a table in a "source" relational database into another ("target") data platform such that it appears to exist within the target.
Semi-Structured data
Semi-structured and Unstructured data is stored as-is during acquisition. Interpreting it is a transformation task that happens during data integration.
Integrating data
Once all the data has been made available in one place, this second data architecture tier is concerned with making data from diverse sources work together - integrating it.
When data from many different sources is simply copied into a single location, it does not automatically become integrated or ready for use. This was the fundamental problem with Data Lakes. True integration means making data from different sources work together in a meaningful way.
Successful data integration involves transforming data. This means applying business rules to ensure it can all be compared and analyzed collectively. Data transformation is the vital step that ensures the information all makes sense in combination.
Data Silos
A silo is data that is isolated and kept separate from all other systems and databases. Whenever a new application is introduced to an organization, its data usually does not work well with data from other applications - making it a potential new data silo.
Data silos are the consequence of application proliferation, which is the reason that the integration process requires its own data architecture tier.
Data Integration
The process of combining data from various sources into a unified view. Data integration requires data transformation because new puzzle pieces don't naturally fit with the others until they are reshaped. Data transformation - the "T" in ETL - usually involves normalization.
Normalization
The database design process of organizing data to reduce redundancy and improve data integrity.
Third Normal Form (3NF) and Data Vault are widely used data modeling techniques for normalization. Normalization tends to create rather technical data models with many joins.
Unstructured Data
Historically unavailable for analytic purposes, unstructured data like emails, documents, and social media posts holds the value that needs modern artificial intelligence (AI) and machine learning (ML) tools to unlock.
Data Quality
Expresses the overall condition of data, based on factors including accuracy, completeness, and reliability.
Arguably data quality is perfect to begin with because it's being used to successfully run business operations. The job of data transformation and integration is to read, interpret and understand it correctly, applying the same business rules that the source systems are using.
It can be more helpful to think instead in terms of how easy it is to use - or consume - data. Newly integrated, normalized data is high quality, but it can be hard to understand...
Architecturally what is needed is a third data tier to present the newly integrated data in a way that's easy to comprehend and use.
Data Presentation
The presentation tier is where data becomes user-friendly.
Earlier the Acquisition and Integration tiers dealt with the twin problems of having many data sources, using many different data formats. Now the data needs to get to where it can be made useful.
Two design considerations stand out most importantly in the data presentation tier:
- It must be able to adapt quickly to changing business needs, whether they're short-term or long-term. The structures must be kept simple and easy to update.
- One definition from the Integration tier can be used in many places during presentation. For example, sales might track widgets by region, while R&D looks at widgets by-product, both using the same definition of "widget."
Database views are an ideal technical solution in the Data Presentation tier. For simplicity of access, presentation tiers almost always use Star Schemas.
Star Schema
Sometimes known as Dimensional Modeling, this is a type of database design that features a central fact table connected to dimension tables. When the same dimension table is used in many different fact tables, it becomes known as a Bus Matrix.
One sign of a well-designed Integration tier data model is when it is easy to create a virtual star schema as a set of views. Relatively light data transformation should be needed, mainly involving joins and aggregations.
Snowflake Schema
A slightly more complex variation on Star Schema design, in which one dimension table can be normalized into a hierarchy of multiple related dimension tables.
Data Mart
A subset of a star schema, usually containing several related fact tables, and targeted at a specific group or department within an organization.
Semantic Layer
A semantic layer can be added to help separate the definition of metrics from visualization and governance. This is sometimes known as Headless BI. Separation in this way allows for more flexible analyses. It helps avoid building a lot of logic directly into a BI tool from where it's difficult to re-use.
However, it's important to understand that a semantic layer is not a replacement for data integration. Semantic layers also work best when expressed in terms of the facts and dimensions of a Star Schema. In other words, multiple data tiers are still required to consolidate data and get the most from a semantic layer.
Conclusion
Implementing multiple data tiers in your data warehouse architecture is a way to address the challenges faced by modern data teams: format and application proliferation - on top of constant change, and the need to do more with less.
Having multiple tiers enables you to solve problems one at a time.
Data can be hard to access, so extract it first into your data platform or lakehouse. This creates a solid foundation upon which to build the next layers.
The interchange between the Integration and Presentation tiers is the key to serving different teams with consistent data. It makes reports add up. It also provides a reliable source for working with generative AI.
Presenting one integrated definition in many places is known as DRY - Don't Repeat Yourself - which is just as important in data engineering as in the rest of information technology.
Matillion users can download a set of demonstration ETL pipelines that build out a multi-tier data architecture as a hands-on example.
Ian Funnell
Data Alchemist
Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell
Featured Resources
ETL and SQL: How They Work Together in Modern Data Integration
Explore how SQL and ETL power modern data workflows, when to use SQL scripts vs ETL tools, and how Matillion blends automation ...
Learn more WhitepapersUnlocking Data Productivity: A DataOps Guide for High-performance Data Teams
Download the DataOps White Paper today and start building data pipelines that are scalable, reliable, and built for success.
Learn more BlogWebhooks and Pushdown Python: Building Interactive and Efficient Data Applications
Part 5 of our blog series demonstrating the art of the possible, using Matillion products and features to build the MatiHelper ...
Learn more
Share: