There’s a lot of buzz about the lakehouse. Companies like Databricks are betting that enterprises no longer want to choose between a cloud data warehouse and a data lake, or run two different environments. And we think they’re onto something.
What is a lakehouse? It’s a new data management paradigm that combines the capabilities of data warehouses and data lakes. The lakehouse has the data structures and data management features of a data warehouse, but stores data directly on the kind of low-cost storage used for data lakes. In our new ebook, Guide to the Lakehouse: Unite Your Teams in the Cloud to Bridge the Information Gap, we look at several reasons why a lakehouse makes a lot of sense for data teams doing modern analytics.
The type of data we collect is changing
Traditionally, we worked with more structured data from databases and applications that could be stored in a data warehouse and, once transformed, be used for traditional business intelligence and analytics. But more and more data is semi-structured or unstructured, flooding into the business from the IoT sensors and devices or from video, audio, and other multimedia. Certain data teams have always worked with this type of data, but as it proliferates, multiple teams will find that they need to work with semi-structured or unstructured data in addition to structured data.
The lakehouse blurs the line between structured and unstructured, enabling you to store all types of raw data in one location (like Databricks), while still having a storage layer on top (like Delta Lake) to provide transactional views of data and structured data management and analytics when needed.
We’re doing more data science, machine learning, and artificial intelligence
Machine learning and artificial intelligence are no longer on the far horizon – they’re getting closer to being a reality for all organizations. There are several reasons for the shift. First of all, with both the volume and diversity of data rapidly increasing, it’s simply no longer possible for humans to analyze all of it. Organizations are turning to machine learning and artificial intelligence in order to keep up with that enormous volume of data and make sense of it.
Second, as customers demand more customized, personalized experiences, companies need to model more data and identify more attributes to be able to provide customers with what they want, when they want it…even before they know they want it. ML and AI are potentially becoming the keys to unlock the insights in their data to enhance existing revenue streams or develop new ones. As a result, more companies are relying on data scientists to quickly build models and data products. And those data scientists need access to massive stores of structured and unstructured data to build advanced analytics models. The lakehouse provides that data playground for data scientists.
Data engineers and data scientists both need fast access to shared, secure, connected data
You could provide those large data stores that data science requires in a traditional data lake, yes. But that leaves out half of the data equation. Data engineers need access to the same data for analytics and business intelligence. Unfortunately, even if data engineers and data scientists use the same data, they’re probably doing it in different silos: Data engineers in a structured cloud data warehouse and data scientists in an unstructured data lake. This data dichotomy can lead to duplication, inaccuracies, and inefficiencies.
A lakehouse can break down those silos between data and data professionals so everyone has access to the same shared, secure, connected data and different teams are pulling from the same analytics-ready datasets. With a shared system, data is more reliable, processes are more efficient, and systems are more cost-effective.
Enterprises need to increase time to value for their data science teams
In data science, moving fast is critical, for two reasons:
- Data science is not, in fact, an exact science. Data scientists need to be explorers and investigators. And they need to fail fast and iterate. Following leads, pulling on threads, and trying all the things is the only path to real discovery and insight.
- From a business perspective, as soon as you hire a data scientist, the clock is ticking. The average turnover rate for data scientists is two years. And in the 2020 IDG Marketpulse survey, data professionals estimated that they spent around 45 percent of their time preparing data for analytics.
For an enterprise, that’s some scary math. If you can’t get a data scientist up and running, working with agility, and adding value as quickly as possible, it’s possible they’ll barely be up to speed before they’re on to the next gig.
In a lakehouse, data scientists have ready access to quality data. And more importantly, they can collaborate more closely with the data engineers who can help them productionize the valuable work they are doing into repeatable data products. And a cloud-native ELT platform like Matillion ETL for Delta Lake on Databricks can help these teams prepare data faster and collaborate more effectively to realize quicker time to value for data science initiatives.
Learn more about the lakehouse: Download the ebook
For the full story on the value of a lakehouse–and the value of a cloud-native ELT solution like Matillion ETL to help achieve faster time to insight and innovation–download our latest ebook, Guide to the Lakehouse: Unite Your Teams in the Cloud to Bridge the Information Gap