Business intelligence and analytics have seen a bit of a revolution in recent years. With cloud data platforms like Snowflake and Databricks starting a new arms race in the storage and compute space, ETL processing has shifted in favor of ELT-based workloads.
At the risk of oversimplifying, the result has been much the same as it ever was, only with more data and faster processing. Companies produce dashboards that tell them how well they are performing in key areas of their business, analysts pour over data sets looking for ways to improve marketing and sales programs, and decision-makers can see trends or segments emerge within their data. Until recently.
A paradigm shift in ETL
As a Product Manager at Matillion, I’m often talking to data teams at the forefront of the industry, and it’s fascinating to hear about the problems these teams are solving. I have heard the same problems enough to be confident we are seeing a fundamental paradigm shift occurring in the way we think about ETL/ELT or data pipelines more generally.
One such product that can be better solved through data is lead scoring. I’ve always been interested in how sales pick the prospects to go after. I’m not giving away secrets by saying that sales sometimes complain about the quality of leads. It’s a well-worn path. You’ll hear Salespeople claim that the leads are not qualified, and they can’t afford to waste their time focusing on the wrong target opportunities which lead us to ask – “can we use our data better to find the best opportunities and have sales focus just on those?”
One recent example came from a Matillion customer who provides financial services software. They use Matillion ETL to pull in web traffic data into Amazon Redshift and then prepare that data to be consumed by a machine learning algorithm running on Amazon Sagemaker, which then produces a lead score. Matillion ETL then syncs this into Salesforce so the sales team can prioritize accounts to focus on. Why Salesforce and not a business intelligence dashboard? Because Salesforce is where the sales team lives every day. It’s the right data. Right place. Right time.
Evolutions in cloud data platforms
Not so long ago, I used to be the person on-site with customers installing and configuring their data architecture, including discussing the reasons why they needed a data lake. This was back in the day when the Big Data hype was at its peak and Hadoop/MapReduce was becoming the framework for customers who were building a modern data platform.
Things have moved on since then. The data lake has moved from on-prem to cloud, Spark is the standard for big data processing, and, well, cloud data warehouses are conquering the world. Inherent issues with data lakes are now being addressed with lakehouse technology.
What has emerged is a new breed of cloud platform that separates storage and compute to enable a much more agile approach to data analytics. We are seeing the lakehouse/ warehouse/sharehouse combo across all major vendors with the vision going well beyond just a cloud data warehouse. Snowflake describes its Cloud Data Platform as a platform “for all your data and all your essential workloads, with boundless and seamless data collaboration”: An essential element of a worldwide Data Cloud.
With evolution comes new challenges
Every time there are these types of shifts in technology and what’s possible, we see new challenges and opportunities present themselves.
Challenge #1: Data sync
With Customer 360, Master Data Management, and Machine Learning all being performed in modern data platforms like Snowflake and Databricks, the challenge now is, “How can we operationalize these insights?” Businesses now have data sets that tell us how likely a customer is to churn, what the most appropriate nurture flows are, and how to best handle customers based on their sentiment. The problem is, end-users and automated systems don’t have access to the right data at the right time. Enter Reverse ETL, a term that’s emerged as businesses try to solve this problem of getting that enriched data back into the operational systems so it can be leveraged and its value realized.
Cloud-based ETL or ELT is traditionally thought of as loading data into a cloud data warehouse and transforming it for analytics. But in 2019, we began to see a lot of customers looking for ‘data sync’ or output capabilities that take the curated data within a data warehouse and either output it to an operational database or a SaaS application such as SQL Server or Salesforce. These outward-flowing pipelines are often event-based, like when a new preceding transformation has occurred or a new record has been loaded, rather than scheduled on a fixed interval.
Technically speaking, a point-to-point integration is not a hard problem to solve. Sure, APIs can be brittle, and integrations need ongoing maintenance. But API standards have improved a lot in recent years–so much so that Matillion can provide a single connector that will work for almost any REST API. As a result, we’re seeing the data loading/pipeline space become increasingly commoditized. New products tackling the unload side of the pipeline are emerging at a fast pace, too.
This is all great news. It helps businesses solve the problem of data syncing and realize even more value from their data. However, because this problem has not been adequately solved by some of the incumbent solutions, a swath of new products has emerged to disrupt the status quo. That’s led to many data teams creating a cobbled-together pipeline that includes 3, 4, even 5 different products. There’s no guarantee of interoperability between them and the interfaces could change at any moment, which makes this approach inherently brittle. This brings me to the next challenge, which is trying to orchestrate all of this, an endeavor that often means bringing in yet another tool.
Challenge #2: Orchestration
Say you have all of these different products in a single architecture, all reading, transforming, or writing data to many different systems. How conceivable is it to have a single view of your entire data estate and assess the quality, security, and lineage? How much more difficult is it to build in observability when each stage of the process is another product with varying levels of API integration (and some with none at all). Unlike the point-to-point integration, this is not an easy problem to solve.
You can only really begin to assess the fitness of your entire data estate when you have a full view of the end-to-end process. You need to have the right error reporting and observability built-in, when something goes wrong you need to be able to track down the cause. Products like Matillion ETL provide that full end-to-end capability. To me, this is the definition of a data fabric: seamless integrations across different data silos under a single unified experience, in contrast to the patchwork quilt architecture.
The path to a true data fabric
With the evolution of cloud data platforms and products designed to contribute to a true data fabric, we are in a better position than ever to have useful data that drives real value through better decisions. When you’re on a mission to make data useful, just be careful not to create more problems than you solve.