MythBuster: Would life be better without ETL?
During the re:Invent 2022 keynote, AWS CEO Adam Selipsky touted a “zero ETL future”. AWS introduced integration between its Aurora relational database and the Redshift cloud data warehouse that lets users link Aurora data from multiple clusters to a Redshift instance.
Who would not want data ready in near-real time for advanced analytics, machine learning, and other use cases? Would that really be an ideal state for most organizations and their data environments? Let’s look at both sides of the issue.
The case for ‘Zero-ETL’
Zero-ETL, as AWS has defined it, is essentially automating the extraction and loading of data from source to destination and currently encompasses the movement of data from an Aurora database cluster or clusters to a Redshift cloud data warehouse.
The case for Zero-ETL is that it would enable organizations who exclusively rely on the Amazon ecosystem for their data analytics environment to provide their data teams with more seamless access to data, with much less manual work or management, and make it easier to perform near-real-time analytics on that data.
The case against ‘Zero-ETL’
In concept, ‘Zero-ETL’ sounds great. Who wouldn’t want all of their data instantly ready for analytics without any work? But the concept does not match with reality.
ETL is a three-phase process and includes extraction (E) of the data from one or more sources, transformation (T) of that data so that it is clean, sanitized, scrubbed, etc., and then the loading (L) of that data into a destination where it can be analyzed and bring value to an organization. Zero-ETL, as currently constituted, glosses over a number of these critical processes.
The reality is that the data environment for most organizations is not exclusive to a single vendor or cloud. Most organizations have their source data in multiple databases or applications from multiple vendors – whether it be an Oracle database for customer orders, salesforce for CRM, or Workday for HR. Many also rely on a multi-cloud, and often multi-data platform, strategy for storing their data and for analytics. To maximize its utility, zero-ETL would tie organizations to a single ecosystem, resulting in vendor lock.
What about the ‘T’?
Zero-ETL only focuses on half of the ETL equation, the extract and the load. But what about the transform? Transformation enriches data while also cleansing and standardizing it. It is, in fact, what makes the data valuable and ready for analytics, machine learning, and many other use cases.
Transformation provides clarity. Without transformation, there is confusion and chaos; much like it was 30 years ago before data integration and ETL forged the path for modern business.
Transformation is what gives value to the data that is being moved. It enables data from multiple sources to be normalized for easy analysis, then enriched and returned to its place of origin (i.e. reverse ETL), so those operational systems can also benefit from the value that the transformed data provides.
Modern enterprises are moving their data to the cloud to take advantage of the speed, processing, and advanced analytics capabilities of today’s cloud data platforms like Redshift, Snowflake, and Databricks. Eliminating transformation from the ELT process would negate many of the benefits of a cloud infrastructure. It would be much harder to do complex analytics on data from multiple, disparate sources and would handcuff organizations to a single provider for all their data needs.
Conclusion: We need a new ETL paradigm – Zero toil ETL
The premise of Zero-ETL is that by automating the movement of data within the Amazon ecosystem – from an Aurora database to a Redshift cloud data warehouse – you can get your data ready for analytics, in near-real-time, with virtually no effort. While this is an admirable goal, getting data business- and analytics-ready extends well beyond copying data between database systems in a single-vendor environment.
In the real world, critical business data comes from dozens, if not hundreds, of sources. It’s essential to get all this data into your analytics infrastructure and, once there, into business- and analytics-ready formats as quickly as possible. Modern enterprises also balance multiple public cloud environments (AWS, Azure, GCP), as well as multiple cloud data platforms (Snowflake, Databricks, Redshift, etc.). This requires connecting to and extracting data from multiple source systems, loading that data into one or more cloud data platforms, and, most importantly, transforming that data to ensure that data from multiple sources is integrated and enriched to maximize its value.
“Matillion sees data movement as an enabling technology that gets you to the value-add of transformation,” said Matthew Scullion, the CEO of and founder of Matillion. “It’s great that platforms like AWS are making it easier to move data between them, but it does not change the need to refine or transform that data into something useful for the business.”
ETL is not going away as long as there is a need to transform data to bring out its value. What we need is to make ETL easier – a new paradigm: Not Zero-ETL, but Zero-toil ETL.
This is where Matillion Data Productivity Cloud comes in. Matillion makes it easy to extract data from virtually any data source and load it to your cloud platform of choice. We then leverage the compute and storage of your cloud platform to transform the data with unprecedented scalability and performance. And the whole process is made as easy as possible with Matillion’s low-code/no-code functionality and integration of the native functionality of destination cloud data platforms.
We all want to end the ETL toil. But, there will always be a need to move data, integrate it, transform it, and orchestrate data pipelines across platforms. Matillion makes that process easier.
Begin your data journey now with the Matillion Data Productivity Cloud.