Here at Matillion we have a breadth of customers, from startups to large enterprises, all taking advantage of the Extract-Load-Transform (ELT) approach for delivering quick results. They are using Matillion for an equally wide range of purposes: everything from customer behaviour analytics, financial analysis, and even reducing the cost of synthesising DNA.
At the same time we’ve also noticed that some customers are struggling to get value and insight from the semi-structured contents of their data lake: especially from sources like JSON, XML, text files and logfiles.
Why is this a problem? Isn’t ELT working for those customers? In this fascinating and rapidly-evolving area, is there actually any benefit in using ELT at all?
First let’s prepare the ground by examining the definitions which underpin both ETL and ELT.
- E – extracting data from source, and usually also implying that:
– there are multiple sources
– the data is staged into a data lake, for low storage costs & convenient access
- T – transforming (i.e. converting) the raw data into a format that’s suitable for reporting and analytics. This typically includes:
– Enforcing consistency (currencies, timezones, units of measurement)
– Applying business rules
– Enriching and validating (dealing with missing values, duplicates)
– Converting to a simpler data model (usually a star schema)
- L – loading data into a target platform (e.g. a relational database)
So “ETL” means you extract data from its source(s), transform it en route, and finally load it into a target database for reporting
Whereas “ELT” means you extract data from source, load it unchanged into a target platform, and then transform it afterwards, to make it ready for use
There’s also an important implied assertion that:
- ETL is highly targeted (always driven by known requirements), and so only extracts certain very specific data from sources
- ELT is in contrast, less selective. The preference is to extract all the source data, and then allow the user to decide later what’s needed for reporting
The underlying hardware and data flows differ significantly between ETL and ELT.
With ETL, the data is extracted from source, and transformed by an ETL engine en route to its permanent home in the Data Store, which is usually a relational database.
The ETL engine is basically a compute resource, and needs to be powerful enough to handle the large amounts of data which are going to be transformed. Remember that “powerful” also means expensive!
With ELT, the data is extracted from source and first of all loaded into the Data Store. It immediately becomes convenient to access, won’t disappear, and easy to audit.
ELT then re-uses the power of the Data Store to do the transformations which get the data ready for presentation. With Matillion this involves running push-down SQL, and means that you only need one powerful piece of infrastructure.
Our tools for Redshift and BigQuery both use the ELT approach. So let’s follow this path and work through a practical example.
The Tax Return
Here in the United Kingdom, the tax year runs from April to April. This is so the Queen gets the cash in time for her mid-year birthday.
So for the dual purpose of writing this blog article, and doing my annual tax return, l went through the house and “extracted” all my paperwork before “loading” it into a single, centralized Data Store (my desk). Here it is:
Great! All the data is in one place, so now I just need to do a few simple transformations – like sums and aggregations – and my tax return is pretty much done?
Wrong! Significant work needs to be done before it’s even going to be possible to start working on this pile of input!
- Find the invoices and receipts
- Note the incoming and outgoing amounts
- Correlate to the bank statements
- Decipher the papers that the dog chewed or the kids scribbled on
- Determine if there’s anything missing or duplicated
By the way, it’s the “ETL Lie” too, because exactly the same applies.
With ETL, determining the source schema is often confusingly blended with the task of data extraction. It therefore has the illusion of zero cost. Also, ETL has historically tended to selectively focus only on the “best” source data – i.e. typically data that’s already relational, and has a clearly defined schema.
Source organization and Relationalization
Hopefully many readers are laughing at my shambolic filing system, and have already solved the initial part of the tax return problem by filing their paperwork properly in the first place.
If you’re in this situation then that’s great! Your source data schema is already defined, your organisation very likely has good information governance and master data management in place. You can immediately proceed with ELT (or ETL) and start to use and integrate that data.
But maybe you’re not in control of your source data formats, and they change frequently and without notice. Or perhaps your company has very informal and unstructured procedures for storing data, with ad hoc data models. In this case you really will need to address the problem, and sort out your data before use. This step is known as relationalization, and it applies equally to ETL and ELT.
How to relationalize?
If the need arises, there are a wide range of file- or stream-based technologies which can address this problem.
Solutions such as Hadoop, Cloud Dataproc and AWS EMR can be used to impose some order onto disparate input, using the MapReduce programming model.
For streaming data, services such as Pub/Sub and Kinesis Firehose can capture, batch and write to persistent storage. You must provide a suitable transformation function (for example using Lambda) to interpret and restructure the input formats.
New services like AWS Glue are becoming available to guide and even automate some aspects of data and schema discovery.
These techniques are needed to deal with the fact that
- You first have to identify what the source data actually is (e.g. differentiate between all the receipts and all the bank statements)
- Input formats may vary (e.g. invoices always contain specific information such as date, currency and amount; but they can differ greatly in appearance and structure)
- Formats often change over time
- The dog chewed some (i.e. attributes may be missing)
- The kids scribbled on some (i.e. robust error handling is needed)
If your input data is semi-structured or non-relational, then both ETL and ELT have a pre-requisite in that some “sorting-out” must first be performed.
The “lie” is that this can happen by itself, without effort.
At Matillion, we welcome new technologies such as those mentioned above. We don’t try to compete with those specialist services: instead we encourage you to incorporate them into your infrastructure! Matillion already has conceptually identical integrations with services such as S3, Google Cloud storage, RDS, Google SQL, Spanner, AWS SQS and SNS.
We believe that ELT is the best architecture, with superior scalability and at lower cost compared to ETL. When used in combination with data interpretation and relationalization services, this end-to-end approach enables you to take advantage of a data set that’s both wide and deep.
Download our free eBook below for loads of useful advice and best practices on how to optimize your Amazon Redshift setup