Blog

5 Key Differences Between a Data Lake vs Data Warehouse

A data lake is not a direct replacement for a data warehouse; they are supplemental technologies that serve different use cases with some overlap. Most organizations that have a data lake will also have a data warehouse. The following section will compare the properties of a data lake in comparison to a traditional BI architecture (data warehouse & separate ETL server).

 

1. Data in Data Lakes is stored in its native format

Data can be loaded faster and accessed quicker since it does not need to go through an initial transformation process. For traditional relational databases, data would need to processed and manipulated before being stored.

2. Data in Data Lakes can be accessed flexibly

Data scientists, engineers, and analysts can access data much quicker than would be possible in a traditional BI architecture. Data Lakes increase agility and provide more opportunities for data exploration and proof of concept activities, as well as self-service business intelligence, within your privacy and security settings.

3. Data Lakes Provide Schema-on-Read Access

Traditional data warehouses employ Schema-on-Write. This requires an upfront data modeling exercise to define the schema for the data. All data requirements, from all data users, need to be known upfront to ensure the models and schemas produce usable data for all parties. As you unearth new requirements, you may have to redefine your models.

Schema-on-Read, conversely, allows the schema to be developed and tailored on a case-by-case basis. The schema is developed and projected on the data sets required for a particular use case. Once the schema has been developed, it can be kept for future use or discarded when no longer needed.

4. Data Lakes Provide Decoupled Storage and Compute

When you separate storage from compute you better optimize your costs by tailoring your storage requirements to the access frequency. The separation allows your business to archive raw data on less expensive tiers while allowing faster access to transformed, analytics-ready data. Being able to run experiments and exploratory analysis with new technologies is much easier thanks to such data preparation. Traditional data warehouses and ETL servers have tightly coupled storage and compute, meaning if I need to increase storage capacity we also need to expand compute and visa-versa.

 

 

Data Lake Traditional On-Premises Data Warehouse
Data stored in native format Data requires transformation
Can store unlimited data forever Expensive to store large volumes
Schema-on-read Schema-on-write
Decoupled storage & compute Tightly coupled storage & compute

 

5. Data Lakes Go With Cloud Data Warehouses

While data lakes and data warehouses are both contributors to the same strategy, data lakes go better with cloud data warehouses. ESG research shows roughly 35-45% of organizations are actively considering cloud for functions like Hadoop, Spark, databases, data warehouses, and analytics applications, and this is a trend that is increasing due to the benefits of cloud computing such as massive economies of scale, reliability and redundancy, security best practices and easy to use managed services. Cloud Data Warehouses combine these benefits with traditional data warehouse functionality to deliver increased performance & capacity and to reduce the administrative burden of maintenance.

 

The comparison table outlines the where best to store your various data sources.

 

 

 

To learn more about data lakes and how to optimize your data analytics download our eBook ‘The Essential Guide to Data Lakes: Designing Data Lakes to Optimize Analytics‘.

PV_Matillion_DataLakes_Banners_GDN_728x90