Is it a typing mistake? That’s a typical reaction from many data professionals familiar with the acronym ‘ETL’—short for ‘Extract, Transform and Load’—when they first encounter the subtly-different acronym ‘ELT’.
But no, it’s not a typo. It’s those same familiar activities of Extract, Transform and Load—but carried out in a different, and more efficient, order. An order that leverages the Cloud, and that leverages columnar data store technology, and that leverages the massively parallel processing capability of a columnar data store.
And if you’re using Amazon Redshift, then a massively parallel columnar data store is what you’re dealing with. So ELT, rather than ETL, can make a lot of sense.
Let’s take a look at why.
Extract, Transform and Load.
First, let’s remind ourselves about exactly what ETL is, and why we need it.
Primarily, ETL is about taking data from a data source, applying any transformations that may be required, and then loading it into a data warehouse ready for running reports and queries against.
The ‘extract’ part can be relatively simple, if it’s a single data source such as an ERP database. Or it can be a bit more complicated if it’s an ERP database plus a number of line of business systems or external third-party data sources. But either way, hook up the right connectors, and start the extract.
Likewise with the ‘transform’ activity. This can be either a straightforward denormalisation of the normalised data contained in an ERP or line of business database, or something a little more involved, such as converting units of measure to a common bases, or harmonising customer names.
Finally, there’s the ‘load’ task: we have the data, we’ve transformed it, and now it goes into the data warehouse.
Either way—straightforward or more complicated—ETL is characterised by a lot of I/O activity, a lot of string processing and variable transformation, and a lot of data parsing.
Put another way, it’s very compute-intensive.
Extract, Load, Transform.
And that’s really the point about ELT versus ETL: taking a compute-intensive activity, and doing it where it makes most sense. Which typically won’t be in an on-premise server that is perhaps already under pressure with its regular transaction-handling role.
That’s right. The ‘extract’ activity is the same with ELT or ETL. The ‘load’ activity is the same, too, apart from the fact that what is being loaded is the un-transformed data. But the ‘transform’ activity is very different, because it’s taking place in the Cloud, inside Redshift.
And Redshift, let’s remember, is a columnar database, so index and record location operations are vastly quicker. And it’s also a massively parallel database, so the required transformations are carried out in parallel, not sequentially, with multiple nodes handling multiple transformations at the same time.
So it’s quicker and more efficient—which is exactly what you want a compute-intensive activity to be, of course.
ELT versus ETL: the bottom line.
That said, ELT isn’t especially novel.
Putting compute-intensive activities in Big Iron has always been seen as sensible—for those that could afford the price of a hefty mainframe or high-end server. But today, that Big Iron is accessible to any business with the price of an affordable, monthly subscription on AWS.
Likewise, Redshift isn’t the first massively parallel columnar database: IBM, HP, and Oracle have had offerings for years.
But those offerings have been on-premise, and expensive, with an entry point denominated in hundreds of thousands of dollars. Today, yet again, Amazon Redshift is accessible to any business with the price of a low cost monthly subscription.
At Matillion, we think ELT makes compelling sense. Do you?
If you’re an existing Amazon Redshift user, or even if you’re just evaluating tools, download our free eBook below for loads of useful advice and best practices on how to get the most out of your data warehouse setup.