Why do we lose control of our data? Much of it boils down to one thing: We have so much data from so many sources, we simply can’t keep track of where stuff lives.
You’ve probably experienced this firsthand. It’s a common dilemma. Even the basic sources of data expand by multitudes as an enterprise grows. How do you begin to keep track of these sources and their data flow? Building a data model within your cloud data warehouse (CDW) is a big step toward taming the data beast.
A map of your data landscape
A data model describes your entire enterprise from a data point of view. It’s a representation of the data objects in your CDW, the associations between these objects, and the rules in place. A data warehouse is in fact nothing more than the sum of its parts. You can segment the parts and pieces of a warehouse into various persistence layers to generate the outputs you need. But in order to understand what is available from your CDW, you need a data model. When you skip the step of data modeling, you limit your ability to leverage our data assets efficiently, because you can’t find them quickly – or at all.
How many times have you worked a project, gotten about 40 percent done, and then asked a clarifying question, only to hear, “Oh, we already have that”? You then proceed to explain how this variation is different from the previous build and how none of the old transformations will work. This happens every day in business and it costs organizations, time, money and effort. We can reduce this confusion by simply adding the data model back into our development process.
The benefits of a data model
The cloud data warehouse is not just a collection of data. Rather it is a living growing model of the business processes that generate data that the organization values. The complexity lies in the various levels of data ingestion both on premises and off. Using data models can greatly simplify complex data flows.
There are several significant benefits to creating a data model, both for a data team and for the business in general.
- A data model provides a set of instructions for understanding relational and non-relational data flowing through the environment
- It highlights business rules
- Data modeling enables quicker time to market for long term data assets
- Data can be modeled for scalability
- Data can be modeled for performance
- The visual documentation that data modeling creates helps you understand how actual data elements and business processes relate to one another
Making sense of data objects
One of the biggest benefits of creating a data model is being able to identify and track all of the data objects required by the database.
Big data has created a whole new set of opportunities for business intelligence today. We can leverage big data from data lakes, data oceans, APIs, and several other storage possibilities. But not everything in a data lake is big data.
Modeling API sources is important to understanding the business intelligence life cycle. With a data model, you can provide a schema definition that supports moving from semi-structured to structured formats and physical storage in the enterprise. Most data in organizations is generated outside the company via third-party data vendors, which makes API modeling even more important.
Many (okay, most) organizations pass around excel spreadsheets or other comma delimited files. These files get manipulated or changed over time, which results in data loss or data drift to a useless state.
On-premises, one-off RDBMS instances are everywhere today. Because business moves so fast, we tend to build solutions in a vacuum and then have to migrate that system at some point in time. Data modeling helps us better manage these instances and leverage them when we need them.
In short, you need to know what you have, and where, before you can organize it properly. Going through a data modeling exercise helps immensely.
Data models are a team effort
For creating a solid data model, communication and input from the team is key. Early collaboration and consideration on a data model will yield a superior end result. As data engineers and modelers, the cloud has made our jobs more efficient and much much faster. Working in this environment, we have all the tools needed typically within one development IDE. We can generate messages, set queues, link hundreds of existing flows to another all without even considering the end result. But a clear understanding of data modeling can start any CDW project on the right path.
Types of data models
When contemplating a data model, there are several types to consider. Which one you choose depends on your business goals and what actions you will take based on the model.
A conceptual data model is used to describe concepts, rules, and processes that are needed to support the business. In addition to identifying the data needed, the conceptual model also tracks business events and their related performance. And while a conceptual model focuses on the data elements needed for business processes, it doesn’t concern process flow or the characterization of data types.
A logical data model describes the process in more detail as it relates to other business processes within the organization. The logical model is an attempt to move from a conceptual model into the logical steps of a process and the data attributes it generates. In addition, the logical model will typically include any relationships between concepts and are split by subject area.
A physical data model is the actual representation of the process data elements grouped by entity for storage in a database management system (DBMS). The model contains all relationships and entities, data types, keys, foreign keys, indexes – any DBMS feature that would be included in the schema build would be included in this model. This model is derived from the conceptual and logical models and should provide a third pass at reviewing the build requirements. In most cases the physical model will not look much like the logical model. Rather it contains entities and relationships that support performant data storage \ retrieval of enterprise data.
Creating data layers from a data model
Understanding the power of the cloud, we can leverage the various data layers in a CDW to make the best use of our resources, time, dollars and effort. If we understand each of our sources and their structure, we can build an environment to support a scalable, solution that reduces our costs over time and delivers higher value to our organization and users of our data.
Let’s take a look at how to use a data model to drive an effective CDW build, and what layers we might use to store different data assets. The layers represent the power of the CDW and help developers navigate the environment. We know that the most effective strategy for loading a CDW is ELT: First, extract the data from the source exactly as it is. Then load the data exactly as it came, regardless of data type specification. Finally, transform the data to enhance and enrich it into a usable format for 3NF or Presentation storage. Based on ELT, we can look at the layers and how we can leverage them to make the most use of our CDW.
In the Load layer, you are simply dumping the data into the CDW. You don’t need to spend time on hard-typing columns or coming up with complex structures. Matillion was purpose built to leverage the MPP power of the CDW. So when using Matillion you’ll find our components create temporary database objects and then drop them once the process restarts. In addition to using temporary objects to leverage CDW compute power we also land data in the load layer using the optimized copy commands of the various CDW providers.
In this layer, you can begin to hard-type data and build permanent structures that maintain some metadata and load and process control information. Within the Stage layer, you can also begin to formulate conformed structures to reuse downstream in the 3NF or Presentation layers.
3NF / Data Vault
In this layer, you will reconstruct your data into a generalized third normal form representation. Here you are combining data from multiple sources and setting orders of precedence for what data is your source of truth. This step builds on the Stage layer with conformity to dimensions and denormalization of data where necessary.
In this layer, you present star schema dimensional data. To realize the full benefit of the build, you want to avoid snowflaking or creating dimensions of dimensions. The best way to avoid this pitfall is solid early data modeling to build conformed dimensions. At the Presentation level, the data is finally seen as the asset it truly is and given the proper attention. If you can be diligent in earlier layers in the model, you will see greater ROI from your CDW build once you reach this stage.
Data models and democratization
Many of today’s Citizen Data Professionals have limited insight into the enterprise dataset available to them. They have a great depth of knowledge within their department. But they rarely have insights into other groups. They generally find out about new data attributes from hallway conversations, a chance meeting with another co-worker, or maybe an All Hands where some new project is announced. So by having a corporate data model for your CDW that includes all of your on premises RDBMS, you can promote data democratization and make data available to a greater audience sooner.
How Matillion ETL helps with data modeling
Having a data warehouse is more than just hosting data in the cloud. In order to make use of that data we need to enrich and enhance the data so that it is meaningful to the end user. Here is where Matillion really shines.
Using Matillion you will find components easy to use and understand. The visual interface and use of the properties tab limits typing and in many cases the tool will provide the values needed from a drop down list. You will find components that match your use cases, like CDC or change detection and aggregation. They all come ready to use out of the box. Matillion ETL helps provide the organization that helps you model all of your data and use it more effectively in your organization.
Learn how to progress on your data journey
Building a data model for your cloud data warehouse is one way to gain control of your data and move toward achieving faster time to insight for your whole organization. To determine where you are on the cloud data journey and how to move to the next level, take our interactive assessment today.