- Blog
- 08.26.2024
- Data Fundamentals
Don't get swamped! Deploying advanced SQL to conquer data chaos

There's always functional overlap in every part of a technology stack, and ETL and ELT are no different. Using sophisticated freehand SQL in tools that are primarily focused on reporting, analytics, and visualization can make translating concepts easier.
The majority of ETL-type data transformation and integration tasks work most scalably and cost-effectively when implemented in the database in ELT style. Ultimately, this means translating business logic into SQL statements. So, writing the equivalent piece of SQL directly into a visualization tool works as a kind of ELT virtualization.
It certainly works, and many of us have probably seen this done or, like me, done it ourselves on behalf of clients (with some monster SQL queries). But does this seemingly simple and obvious approach actually end up being that straightforward? There are broader considerations and consequences to take into account.
A good place to start is to ask why it's so difficult in the first place. Why don't top-quality analytics just fall into place immediately with a few simple SELECT statements against raw application data? The answer is that the source data is the original data swamp.
Source data is the data swamp
The concept of a Data Lake - a centralized repository to store all of an organization's raw source data for later analysis - at first seemed like a revolutionary solution. The catalyst was the availability of low-cost cloud storage options, and the need was amplified by the proliferation of source systems, mainly SaaS and PaaS applications, themselves hosted by the same cloud providers.
Unfortunately, most of these new Data Lakes quickly turned into disorganized repositories filled with redundant, irrelevant, and low-quality data: the very definition of a data swamp.
Media coverage of high-profile cases where organizations struggled to extract meaningful insights from their vast data stores helped the emergence of this phenomenon into the public consciousness. Data swamps have become a cautionary term.
But why does this happen?
How data swamps happen
It's easy to build a Data Lake: just copy all your source data into one place, exactly as-is with no changes.
However, when looking at the resulting Data Lake closely, and we discover that the data is a swamp, there's only one possible conclusion. Nothing changed about the data on the way in, so the data must originally have been a swamp in the first place.
Why is source data such a swamp? It comes down to the way systems are designed and architected. The focus is almost exclusively on processes: what is talking to what. System design and architecture tend to neglect the data: what is actually being said.
As a result of process-oriented thinking, all your many independent source systems contribute to your data swamp, with each data source having its own unique structure and quirks. The accumulated data debt has to be repaid at some point. To access its true value, data needs integration, which in turn requires data transformation.
Challenges of deswampifying data
If you’re lucky, none of the above applies and you haven’t encountered the data swamp yet. The information and insights you need are just falling out of a few simple queries made directly against one single source system.
In that case, I would say don't hesitate; go ahead and build those reports!
But bear in mind that in most cases, fundamental insights can only really be obtained by combining data from multiple sources and cross-referencing them. This requires some data integration, which in turn means doing some data transformation.
If you really don't have many data sources to consider, then one of two things likely happened.
- Wow, you're lucky; your situation represents an outlier in terms of what is happening in most enterprises (which have hundreds of data sources).
- Someone else has already done the ETL on your behalf. Maybe you are already looking at the contents of a data warehouse after ETL processes have been applied. Or maybe some shadow IT is going on to achieve the same result.
Writing an analytic SQL query that spans multiple systems normally presents significant challenges. The main issue is data heterogeneity, in which varying formats, structures, and standards complicate the merging process.
At the same time, inconsistencies, duplicates, and inaccuracies can easily propagate throughout the system if not adequately managed. Plus, scalability, data security and regulatory compliance further muddy the waters.
So, attempting to derive value directly from the source (i.e., directly from the data swamp) is naturally complex and fraught with data quality issues. The term "deswampification" describes the painstaking process of cleaning and normalizing the data and interpreting it correctly.
In my experience it's not long before you are looking at enormously complex pieces of SQL, perhaps hundreds or thousands of lines long.
The problem of independent interpretations
When the data swamp is interpreted inside a BI tool using big chunks of sophisticated SQL, every single report or analysis will require its own individual deswampification process. This quickly creates a maintenance nightmare as every report developer must handle the same data issues independently.
When every BI report operates with its own set of transformations and data quality rules, it becomes increasingly difficult to maintain consistent data interpretations across the enterprise.
This means that while using a BI tool to tackle this challenge may be temporarily effective, it tends to create long-term inconsistencies and high maintenance costs.
Typical symptoms are discrepancies between reports and contradicting insights that hamper the organization's ability to make informed decisions. In the long term it's a fragmented approach, and ensures a perpetually high-maintenance environment.
Multi-tier data architecture: A sustainable solution
There's no way to avoid the data swamp problem without addressing the root cause, which is process-oriented design and architecture. Deploying a plethora of process-oriented systems means that the underlying data will always need to be "deswampified" at some point. It's a hard thing to do, but it should only have to be done once.
The long-term sustainable solution is to divide and conquer. Adopt a multi-tiered data architecture, such as the medallion architecture, together with a robust ETL process to transform the data through the tiers as it becomes integrated and easier to use. This is the best strategy to ensure consistency, with the lowest ongoing maintenance efforts, enabling the most streamlined and reliable analytical process.
In summary, while it may seem feasible to rely solely on hand-coded SQL inside BI tools for analytics directly on the source data, this approach is fraught with challenges. Deswampifying data ad hoc leads to long-term inconsistencies and high maintenance overheads.
Further Reading
Ian Funnell
Data Alchemist
Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell
Featured Resources
What Is Massively Parallel Processing (MPP)? How It Powers Modern Cloud Data Platforms
Massively Parallel Processing (often referred to as simply MPP) is the architectural backbone that powers modern cloud data ...
BlogETL and SQL: How They Work Together in Modern Data Integration
Explore how SQL and ETL power modern data workflows, when to use SQL scripts vs ETL tools, and how Matillion blends automation ...
WhitepapersUnlocking Data Productivity: A DataOps Guide for High-performance Data Teams
Download the DataOps White Paper today and start building data pipelines that are scalable, reliable, and built for success.
Share: