If you’re reading this, you’ve probably heard the term “ETL” thrown around in relation to data, data warehousing, and analytics. It stands for “Extract, Transform, and Load.”
If you have data from multiple sources that you want to bring into a centralized database, you need to:
the data from its original source, whether that is another database or an application
data by cleaning it up, deduplicating it, combining it, and otherwise getting ready to…
the data into the target database
Typically, one ETL tool does all three of these steps, and is a critical part of ensuring that data required for reporting, analytics, and, now, machine learning and artificial intelligence is complete and usable. But the nature of ETL, the data it handles, and where the process takes place has evolved tremendously over the last decade–and the right ETL software is more critical than ever.
ETL has roots in the 1970s and the rise of centralized data repositories. But it wasn’t until the late 1980s and early 1990s, when data warehouses took center stage, that we saw the creation of purpose-built tools to help load data into these new warehouses. Early adopters needed a way to “extract” data from siloed systems, “transform” it into the destination format, and “load” it. The first ETL tools were primitive, but they got the job done. Granted, the amount of data they handled was modest by today’s standards.
As the amount of data grew, data warehouses grew, and ETL software tools proliferated and became more sophisticated. But through the end of the 20th century, data storage and transformation was done primarily in on-premises data warehouses. However, something happened that revolutionized the way we looked at data storage and processing forever.
The amount of data we generate and collect continues to balloon at an exponential rate. What’s more, we have increasingly sophisticated tools that enable us to use all of our data to gain real insights into our business and customers. The traditional data warehouse infrastructure can’t scale to hold and process that much data–at least not cost-effectively, and in a timely manner. If we want to perform high-speed, sophisticated analytics and intelligence on all of our data, the cloud is the only place to do it. Cloud data warehouses such as Amazon Redshift, Snowflake, and Google BigQuery can scale up and down, infinitely, to accommodate virtually any amount of data. A cloud data warehouse also supports massively parallel processing (MPP), which enables coordination of huge workloads across horizontally scalable clusters of computational resources. On-premises infrastructures simply don’t have that speed or scalability. The cloud changes how we handle data and how we define and deliver ETL.
A lot of people ask, now that we’re in the cloud, why do we need ETL? Is it still important? The answer is, “Yes. Absolutely.” ETL has several business benefits that go beyond simply extracting, cleaning, conforming and delivering data from Point A (source) to Point B (destination):
ETL helps businesses gain deep historical context with data.
It provides a consolidated view of data, for easier analysis and reporting.
It improves productivity with repeatable processes that don’t require heavy hand-coding.
It improves data accuracy and audit capabilities that most businesses require for compliance with regulations and standards.
The reason you need ETL in the cloud is the same reason you need it in a traditional data warehouse. Your data still needs to be brought to a central repository–now from more sources than ever, in structured and semi-structured form. Those huge stores of data need to be transformed into formats best suited for analysis. ETL prepares data for fast access and thus fast insight. Data must be collected and prepared for use in business intelligence tools, such as data visualization software, or else it’s no more useful in the cloud than it would be sitting in its raw format in some data center.
While the purpose of ETL is the same, the process and tools are changing. Most traditional ETL software extracts and transforms data before ever loading it into a data warehouse. While you can still use a traditional ETL product for that process in the cloud, you shouldn’t. Here’s why:
In terms of data and workload management, the cloud is exponentially more scalable in terms of storage and processing than traditional data warehouse infrastructure. But traditional ETL software performance doesn’t exponentially improve in the cloud. Traditional ETL is unlikely to take advantage of the native improvements and best practices that a cloud data warehouse offers. In fact, it’s more likely that they treat the cloud warehouse like a traditional warehouse, which can result in some of the same performance bottlenecks, which may leave you asking what added value a move to the cloud provides. Cloud native ELT (instead of ETL) is built to leverage the best features of a cloud data warehouse: elastic scalability as needed, massively parallel processing of many jobs at once, and the ability to spin up and tear down jobs quickly.
In the cloud, the proper order of the three traditional ETL steps also changes. There’s no need to clean up data on dedicated ETL hardware before loading it into your data warehouse. Instead, the cloud creates the perfect conditions for “push-down” ELT architecture:
Extract the data from multiple sources and connectors
Load it as-is into the cloud data warehouse
Transform it using the power and scalability of the target cloud platform
The future is ELT. If you are still on-premises and your data is predictable, coming from only a few sources, then traditional ETL still works. However, that is becoming less and less the case as more businesses commit to a cloud or hybrid data architecture.Download Ebook
ETL and ELT tools can help with a variety of data management tasks, often in tandem with other tools and technologies.
Most fundamentally, ETL tools help businesses combine structured and unstructured data retrieved from source systems and land them in a data warehouse. Often the raw data is manipulated into table structures optimized for reporting, allowing otherwise hidden insights to surface through analytics or visualization software. For example, ETL can combine name, place, and pricing data used in business operations with transactional data flowing in–for example, retail sales, banking deposits and withdrawals, healthcare claims, and more.
ETL also helps businesses migrate data from legacy warehouses to the cloud. And it can be used to consolidate data from two different entities. For example, ETL can combine data as part of a business merger. Or it can join data from business partners or vendors into one target system.
Traditional operational and transactional data only scratches the surface of the data that most companies collect today. The amount of Big Data flowing into companies from the Internet of Things (IOT), social media, video, log mining, and more is staggering. But businesses need this broad scope of data to gain a competitive edge, to understand context, and to make accurate decisions. ETL vendors must constantly add new transformation capability and connectors to evolve as data types and data sources also evolve and multiply.
Companies are moving away from traditional data warehouses, structured master data, and traditional ETL tools. For example, many choose to load and convert their structured and unstructured data in Hadoop, doing so increasingly in the cloud. Hadoop is one environment where data engineers can increase speed and scalability of their ETL processes.
A centralized Hadoop repository scales much more easily than with traditional data warehousing. What’s more, Hadoop is open-source and seen by many as a low-cost alternative to traditional data technology. Some Hadoop-friendly ETL tools include prebuilt libraries of ETL transformations for faster data processing in Hadoop.
Since the dawn of data analytics, we’ve relied on data professionals to integrate and transform data for us, and then give us the business information we need to make decisions. Today, business users need access to data and business intelligence faster, often in real time. We can’t wait for someone else to prepare it for us. Self-service data preparation enables business users and non-technical data professionals to extract, transform, load, and run business intelligence operations on their own. Decision makers don’t need to wait for data engineers to provision data for data analysts. Analysts who self serve provide the agility to pull critical data any time they need it, not just on a set schedule. And overall, businesses spend less time on data preparation and more on deriving insights and fostering innovation. IT professionals gain time back as well by enabling the rest of the business to access and analyze their own data.
ETL tools can significantly improve the quality of data. The ability to standardize and automate ETL processes that move data from a source system into the data warehouse means that you’re less likely to have dirty data leading to decisions based on inaccurate information and insights. ETL tools can also integrate with dedicated data quality tools, and often will include features for data quality-related functions such as data mapping and data lineage.
It’s important to know where data comes from, and metadata often holds the key to unlock that data lineage information. In an ETL process, when a tool extracts source data to bring it into a destination system such as a cloud data warehouse or a data lake, the tool also collects metadata that is critical for the target system to have when doing business intelligence activities such as business process modeling, data modeling, and reporting. A developer doesn’t need to create data structures again in the repository, since the ETL tool captures and stores that information. In the ETL process, metadata is stored in a metadata repository, from which it can be manipulated, queried, and retrieved.
Traditionally, an ETL tool extracts data from one or several Online Transaction Processing (OLTP) databases, also known as “transactional databases.” OLTP applications contain a high volume of transactional data that needs transformation and integration with operational data in order to be useful in data analysis and business intelligence.
That data is generally extracted into a staging area, storage that sits between the data source and the data target. In that staging area, the ETL tool transforms data, cleansing, joining, and otherwise optimizing it for analysis.
The tool then loads data into a Decision Support System (DSS) database, where BI teams can run queries on it and present results and reports to business users to help them make decisions and set strategy.
Often, despite the usefulness of ETL tools, the ETL process is still somewhat messy and complicated, since data itself is messy and complicated. Traditional ETL still requires a fair amount of labor from data professionals in the form of hand-coding, re-processing, and other manual maintenance tasks.
For modern data analytics and ELT, the cloud changed everything forever. While there is still some business intelligence and ELT that occurs in on-premises data warehouses, only the cloud has the combination of speed, scalability, and practicality required for handling enormous amounts of structured and semi-structured data from literally dozens or hundreds of sources.
Powerful cloud data warehouses such as Amazon Redshift, Snowflake, and Google BigQuery don’t need external resources (like an intermediary ETL server) to perform transformations–those operations are best done within the CDW itself. Data can be analyzed from pre-calculated OLAP summaries, which further simplifies and speeds up the ETL process (or, at this point, ELT, since the data is loaded and then transformed in the cloud data warehouse.)
Transformations and data modeling happen in SQL, a language that BI professionals, data scientists, and analysts all have in common.
ETL can be an incredibly complex process, and there are some inherent challenges. Keep these challenges in mind, and how they may affect your business, as you choose an ETL product. Different products take different approaches, but what you select depends on your business requirements and how you use data.
Scalability is one of the most important features in a modern ETL tool. The amount of data that businesses collect is only going to continue to grow. You may move data in batches now, but will that always hold true for your business? How many jobs can you run? You need to be able to scale ETL processes and capacity, infinitely if possible. Take advantage of the cloud. When it comes to ETL and data, think big – and fast.
Another big ETL challenge is ensuring that the data you transform is accurate and complete. Manual coding and changes or failure to plan and test before running an ETL job can sometimes introduce errors, including loading duplicates, missing data, and other issues. An ETL tool can reduce the need for hand-coding and help cut down on errors. Data accuracy testing can help spot inconsistencies and duplicates, and monitoring features can help identify instances where you are dealing with incompatible data types and other data management issues.
Data is growing in volume. But more importantly, it’s growing in complexity. One enterprise could be handling diverse data from hundreds – or even thousands – of data sources. These can include structured and semi-structured sources, real-time sources, flat files, CSVs, S3 buckets, streaming sources, and whatever new comes along. Some of this data is best transformed in batches, while for others, streaming, continuous data transformation works better. Handling each type of data in the most effective and practical manner can be an enormous challenge.
In general, there are four different types of ETL tools available today. Some are designed to work in an on-premises data environment, some in the cloud, and some purport to do both. Which ones you choose depends largely on where your data is and on your business requirements.
Until very recently, batch processing in on-premises tools was the only practical way to do ETL for a lot of businesses. Historically, processing large volumes of data took a lot of time and resources and could easily tax a company’s compute power and storage during business hours. It made more sense for enterprises to run that data processing in batches with ETL tools during off-hours. Though some modern tools support streaming data, most cloud-native and open-source ETL tools still do batch processing, though are less limited in when they can do it and how quickly.
While originally ETL jobs happened on-premises, data is now moving into the cloud, and so is ETL. Several cloud-native ETL applications have cropped up that can extract and load data from sources directly into a cloud data warehouse. They can then transform data using the power and the scale of the cloud – a critical requirement when you’re dealing with Big Data. These ETL tools can be deployed directly into your cloud infrastructure (like Matillion) or hosted in the cloud as a SaaS.
Open source ETL tools are a low-cost alternative to commercially packaged ETL solutions, and are a practical option for many businesses. While some open source projects only support one aspect of ETL, such as projects that only extract data, others perform more than one function. Some common open source tools include Apache Airflow, Apache Kafka, and Apache NiFi. One downside to open source ETL projects is that they are not designed to handle the data complexities that modern enterprises face, and may lack support for complex data transformation and desirable features such as change data capture (CDC). In addition, getting support for open source tools can be challenging, as opposed to other tools with full support teams.
Batch processing is fine for some data updates. But more often today, we demand real-time access to data from different sources. If you are collaborating within Google Docs, you don’t want to see edits and feedback a day later. If you are working in finance, waiting even a few hours to see transactions and transfers is unacceptable with today’s time-sensitive requirements. Real-time demand increasingly requires that we process data in real time, with a distributed model and streaming capabilities, rather than in batches. There are a number of streaming ETL tools available, both commercially and via open source. However, just because you can do ETL in real time doesn’t necessarily mean you should: there are use cases where batch processing is still more efficient and simpler for handling large amounts of data.
There are several different ways that ETL is most commonly used.
Enterprises have traditionally used ETL to collect data from various sources, transform it into a consistent, analytics-ready format and load it into a data warehouse, where business intelligence teams can analyze it for business purposes.
With the advent of cloud computing, businesses have been migrating data to the cloud in general and, in particular, migrating data to cloud data warehouses to gain faster time to insight. Cloud-native ETL tools use the advantages of the cloud, including speed and scale, to load data directly to the cloud and transform it within the cloud infrastructure, which enables data professionals to save time and save money.
While machine learning and AI aren‘t totally commonplace yet in business, many enterprises are starting to explore how to incorporate them into analytics and data science. The cloud is the only practical solution for large-scale machine learning and AI operations. In addition, both techniques require large datastores for analytical model building and training, and for automated data analysis. Cloud-based ELT tools (rather than traditional ETL) are essential to both migrating large amounts of data to the cloud and transforming them to be analytics-ready.
Customers interact with businesses today on multiple channels, logging multiple interactions and transactions per day, or even per hour. It can be difficult for marketers to get a view across all of these channels to understand customer needs and behavior. ETL software can be critical to collecting and integrating customer data from eCommerce, social networking, web sites, mobile applications, and other platforms. It can also help integrate other contextual data so marketers can apply hyper-personalization, improve user experience, offer incentives, and more.
One of the fastest growing sources of data for businesses right now is connected devices and systems that are part of the IoT. Whether we are talking about wearable devices or embedded sensors in places, vehicles, or equipment, the IoT is producing astronomical volumes of data. And experts expect that volume to grow by a 28.5 percent compound annual growth rate between now and 2025. ELT technology, especially cloud-native ETL, will be absolutely essential to integration and transformation of data from IoT sources.
ETL is also critical to database replication, whether you are moving data from source databases as a one-time operation, or as an ongoing process. Often this involves moving data from an on-premises data warehouse to a cloud data warehouse, but as more enterprises move to the cloud, it can mean moving from one cloud infrastructure or cloud service provider to another. This is why it’s important to have an ETL or ELT tool that not only works in the cloud, but is flexible across multiple cloud platforms.
One thing that will never change is that enterprises need to analyze data to provide business intelligence that enables managers and stakeholders to make informed decisions. In order for those decisions to be truly informed, they need to be based on all of an organization’s data, not just as much as a legacy data architecture can handle. The cloud data warehouse is becoming an essential element for data analytics and business intelligence, and thus cloud-native ETL technology is also critical for information management and enabling faster time to insight.
The cloud and big data are no longer the future of ETL. They are the present. Nine out of 10 enterprises say they already have part of their data in the cloud, and nearly all enterprises say they have present or future plans for cloud data migration. The amount of data we collect, whether it’s structured operational data or a firehose of data from the Internet of Things, is starting to outgrow our ability to handle it in traditional, on-premises data warehouses. So what is the future of ETL? Here are a few things we can expect out of the next decade of data transformation and management:
Data won’t just continue to grow, the amount of data we see will mushroom in the next decade. Our IDG MarketPulse survey states that our respondents saw data expanding at an average of 63 percent per month, with 12 percent of respondents saying that data is growing at 100 percent per month.
The Internet of Things will continue to expand and play an increasing role in business and our lives. Therefore, we will continue to outgrow legacy systems and need to move to the cloud. And the more we will require cloud-native tools to help us manage, integrate, and transform that data.
Preparing data for machine learning and artificial intelligence will become a more critical use case for ETL as next-best-action and digital assistant technologies continue to expand.
In the future, data won’t just be for the data professionals. Businesses want – and need – employees to make data-driven decisions. That means centralizing data and employing tools that reduce manual processes to increase time to insight. It also means that different business units will need different kinds of ETL tools. Businesses may use full data transformation capabilities in IT, pipeline tools for business users, and both batch and streaming capabilities, depending on the demand for real-time information. Overall, the more entire organizations can self-serve to gain actionable insights, the greater their competitive edge will be.
In order to stay competitive, a business needs to do three things:
Utilize all of the data it collects, structured and semi-structured, operational, and transactional, to gain the maximum amount of insight for innovation and decision making.
Move to a platform where it can effectively collect and analyze all of this data. Right now, a cloud data warehouse is the most practical solution from the perspective of speed, scale, and cost.
Invest in an ETL tool that can help it extract data and transform it for cloud data analytics as quickly as possible to reduce time to insight.
For today’s enterprises, data, a cloud data warehouse, and the right ETL solution are three equally important legs of a forward-thinking business intelligence strategy. Without any one of these things, it’s going to be very difficult for an enterprise to remain relevant in the next five years.
There are five things an ETL tool needs to do:
It needs to be cloud-native. Traditional tools for traditional data warehouses don’t translate. They were built for a platform with limitations, and will not step up to take advantage of the speed and scalability of the cloud.
It needs to go deep. A cloud-based ETL tool needs to do all of the things that a traditional, full-scale, grown-up enterprise ETL tool does – and then some. It needs to be able to take all kinds of data – structured, semi-structured, cloud or on-prem– from all kinds of data sources and join them together.
It needs to be transformative. Just moving data from left to right in a pipeline isn’t the whole job. It isn’t even the most critical part of the job. Organizations need a tool that was specifically created to transform that data and get it analytics-ready.
It needs to be flexible. It’s rare that a business has data and operations in only one cloud. Or that it doesn’t need to move information from one cloud to another. Choose a tool that can move with you. Also, the job of transforming is almost never done. You continually innovate at the transformation layer, as the business changes and as new questions require new answers and insights.
Matillion ETL is data transformation for cloud data warehouses. As a cloud-native solution deployed into your cloud infrastructure, Matillion leverages the speed and scale of the cloud to consolidate large data sets and quickly transform your data to make it analytics ready.
To learn more about Matillion and ELT technology, download our ebook, From ETL to ELT.Download Ebook