We talk a lot about modern ETL, as opposed to the way that traditional ETL has worked since the technology emerged a few decades ago. But how has ETL actually changed? Here are a few ways in which they differ, and why it may be time to update the way you think about – and perform – ETL.
The basics of the ETL process
As its name suggests, the ETL process does three things:
- It extracts raw data from multiple siloed sources.
- It transforms that data to ensure consistency at its destination, regardless of origin. This transformation typically includes changing the data’s format; standardizing values such as currencies, units of measurement, and time zones; enriching and validating the data to eliminate missing values and duplicates; and applying business rules.
- It loads data into a target system, generally a data warehouse, for use in analytics and reporting.
ETL emerged in the early 1990s as data-generating systems proliferated and data warehouses came into use. Companies realized that integrating data from all of their systems would provide them with a trove of strategic and operational business intelligence (BI), but only if the data was clean, structured, high quality, and integrated in a single location. This created a demand for software tools able to retrieve data from various sources and process it for its intended use before loading it into a data warehouse.
In the cloud and big data era, legacy ETL solutions designed only for on-premises data stores are now giving way to modern ETL tools built to handle data no matter where it is–on premises, in a data center, or in the cloud.
What is the traditional ETL process?
Traditionally, ETL required specialized IT staff able to develop and use on-premise databases and data pipelines. The process tended to be time-consuming, relying on lengthy batch processing sessions. IT needed to know the target schema at the time of loading in order to transform the data to match the schema while in transit. Traditional ETL technology was also hard to scale, which tended to sacrifice the granularity of raw data for the sake of performance as data volumes grew.
What are the challenges of traditional ETL?
Traditional ETL methods can only be used with relational databases, and they are less geared toward unstructured data. The more data sources an ETL project drew from, the more interaction it required from data engineers and developers. Yet changes or customizations to the technology stack made it fragile and unable to adapt to a rapidly changing business environment.
In addition, growing amounts of data in a traditional on-premises data warehouse required high-speed disk space to store large data sets, fast processors to perform calculations, and fast memory to perform data operations efficiently. Most IT organizations couldn’t afford to purchase and use those compute resources only for ETL projects. Therefore, they had to minimize impact on other systems by batch processing data only when system traffic was low. As a result, ETL was slow and costly, and BI users couldn’t perform real-time analysis or even perform ETL outside of scheduled batches.
As the volume and variety of data sources people want to track and query has exploded, companies are increasingly moving their data to cloud data warehouses. Traditional ETL was never designed for the cloud, and it can’t scale to keep up. The resulting bottlenecks delay reports and analytics, which makes it difficult to get timely data for informed decision making. In the era of Big Data, anyone still using traditional ETL ultimately increases their risk of missing opportunities and losing revenues.
What is the modern ETL process?
Traditional ETL is too inflexible, expensive, and slow for the agile, data-driven modern organization. The modern ETL process that has evolved to take its place is cloud-based, which makes it both fast and infinitely scalable. It allows a company to extract data and start moving it into a single, centralized data repository before beginning the transformation process – and to continue to load and process data rapidly in near real time.
Modern ETL leverages the benefits of the cloud to provide accessibility without sacrificing security and delivers easy scalability at a manageable cost. SaaS-based pipeline tools enable you to leave backups, encryption, security, and infrastructure issues to the vendor and focus on moving your data to the cloud. Full featured, cloud-deployed ETL products (like Matillion ETL) enable you to take advantage of the speed, scale, savings, and simplicity of the cloud while still maintaining control over security, governance, and compliance.
Here are some ways in which modern ETL is superior to its traditional predecessor:
Flexible tools for different data sources
Modern ETL tools are capable of importing and exporting both structured data and unstructured data from virtually any source, from spreadsheets to IoT sensors. They can also scale in a timely, cost-effective way to accommodate fluctuating workloads.
Tools for both on-premises and cloud data warehouses
Modern ETL tools are built to integrate with both on-premises and cloud data warehouses, including Amazon Redshift, Snowflake, and Google BigQuery. As new data warehouses emerge, they add connectors to support new ETL integrations as well.
Modern, ETL solutions are designed to capture streaming data and integrate with data platforms to support real-time data pipelines and on-the-fly schema changes – which ensures that your analysts and business decision makers have constant, unlimited access to all your data, all the time.
The modern ETL process lets you decide whether to transform data after or before loading it to the target data warehouse – or both. For example, you might choose to use the power of cloud-based data warehousing for tasks like joins and complex calculations, then perform enrichment and privacy transformations during runtime. Adapting your data pipeline to make the most sense for your particular needs ensures high performance, especially for modern data scenarios such as business intelligence, artificial intelligence, and machine learning.
How do traditional ETL and modern ETL compare?
|Traditional ETL||Modern ETL|
|Sources & data types||Designed for relational databases and other traditional data sources||Designed to handle both structured and semi-structured data from a broad range of sources and targets
|Hardware requirements||Has its own engines for performing data transformations, which generally requires an investment in servers, storage, and other specific hardware and the data center space in which to put them||The vendor absorbs the cost of hardware, infrastructure, and management in the cloud, for a potentially significant reduction in your own IT costs and footprint|
|Flexibility||More inflexible about types of sources and targets, changes in schemas, and location and variety of possible transformations||Seamlessly integrates data from many types of sources to many types of targets, both on-premises and in the cloud|
|Processing method||Batch data processing||Real-time streaming and rapid batch data processing|
|Security||You handle security with your own resources and on your own network||Flexibility regarding security. With SaaS, you can opt for the vendor to manage security in the cloud. Or, with a cloud-deployed solution, you can retain more control over security preferences|
How Matillion can help
Matillion offers modern ETL tools that help companies overcome their traditional ETL challenges. Find out more about how ETL has changed and how it’s continuing to evolve in “What is ETL? The Ultimate Guide” and start learning to create an ETL strategy that will transform not just your data, but your business.