- Blog
- 07.24.2025
ETL Architecture and Design: Essential Steps and Patterns for Modern Data Pipelines
Wikipedia describes a design pattern as being “… the re-usable form of a solution to a design problem.” You might be thinking “well that makes complete sense”, but what’s more likely is that blurb told you nothing at all. The keywords in the sentence above are reusable, solution and design.
ETL architecture and design represent the backbone of successful data engineering, providing reusable solutions to common data pipeline challenges. Whether you're building traditional on-premises systems or modern cloud-native architectures, these proven patterns will make your job easier and your data healthier.
Reuse What Works
Reuse happens organically. We build off previous knowledge, implementations, and failures. One example would be in using variables: the first time we code, we may explicitly target an environment. Later, we may find we need to target a different environment. Making the environment a variable gives us the opportunity to reuse the code that has already been written and tested.
In modern ETL architecture, this reusability extends beyond variables to entire infrastructure patterns. With Infrastructure as Code tools like Terraform or CloudFormation, we can template entire ETL architectures for reuse across environments. Container-based ETL designs enable us to package transformation logic that runs consistently from development through production.
Similarly, a design pattern is a foundation, or prescription for a solution that has worked before. The solution solves a problem – in our case, we'll be addressing the need to acquire data, cleanse it, and homogenize it in a repeatable fashion. This requires design; some thought needs to go into it before starting.
Modern ETL design patterns encompass not just the processing steps, but the entire architectural framework: how components communicate, how systems scale, how failures are handled, and how operations are automated.
ETL Architecture Fundamentals: First, Some Housekeeping
With batch processing comes numerous best practices, which I'll address here and there, but only as they pertain to the pattern. Batch processing is by far the most prevalent technique to perform ETL tasks, because it is the fastest, and what most modern data applications and appliances are designed to accommodate.
However, modern ETL architecture increasingly blends batch and streaming approaches. While this discussion focuses on batch-oriented processing, today's architectures often include real-time components through tools like Apache Kafka, AWS Kinesis, or change data capture (CDC) systems. Streaming and record-by-record processing, while once considered separate concerns, now complement batch processing in hybrid architectures.
Understanding Your ETL Architecture Context
Now that we've decided we are going to process data in batches, we need to figure out the details of the target warehouse, application, data lake, archive…you get the idea. What is the end system doing? What does it support? How are end users interacting with it? All of these things will impact the final phase of the pattern – publishing.
Modern ETL architecture requires understanding not just the target system, but the entire data ecosystem:
- Cloud vs. on-premise deployment models and their implications
- Data governance and compliance requirements that affect architecture decisions
- Scalability and performance expectations that drive infrastructure choices
- Integration patterns with existing systems and future roadmap considerations
- Cost optimization strategies for cloud-native deployments
How we publish the data will vary and will likely involve a bit of negotiation with stakeholders, so be sure everyone agrees on how you're going to progress. In cloud environments, this often includes discussions about data mesh patterns, API-first designs, and self-service analytics capabilities.
Step 1: Extract and Stage Raw Source Data
I've been building ETL processes for roughly 20 years now, and with ETL or ELT, rule numero uno is copy source data as-is. Don't pre-manipulate it, cleanse it, mask it, convert data types … or anything else. Simply copy the raw data set exactly as it is in the source. Why?
Permissions: The source system is typically not one you control. Your access, features, control, and so on can't be guaranteed from one execution to the next.
Impact: Source systems typically have a different use case than the system you are building. Running excessive steps in the extract process negatively impacts the source system and ultimately its end users.
Locking: Many sources will require you to "lock" a resource while reading it. If you are reading it repeatedly, you are locking it repeatedly, forcing others to wait in line for the data they need.
Debugging/Troubleshooting: Having the raw data at hand in your environment will help you identify and resolve issues faster. Troubleshooting while data is moving is much more difficult.
Auditing: Local raw data gives you a convenient mechanism to audit, test, and validate throughout the entire ETL process.
While it may seem convenient to start with transformation, in the long run, it will create more work and headaches.
Modern Extract and Staging Patterns
In cloud-native ETL architecture, this principle extends to leverage modern tools and patterns:
Data Lake Storage: Use cost-effective object storage (AWS S3, Azure Data Lake, Google Cloud Storage) to store raw data in its native format, enabling schema-on-read capabilities.
Change Data Capture (CDC): Implement CDC tools like Debezium, AWS DMS, or cloud-native services to capture incremental changes without impacting source systems.
API-First Extraction: Design extraction processes that expose data through well-defined APIs, enabling multiple consumers without repeated source system impact.
Incremental Loading Patterns: Use watermark columns, timestamps, or hash-based change detection to minimize data movement and processing overhead.
Step 2: Triage and Validate the Data
Now that you have your data staged, it is time to give it a bath. This is where all of the tasks that filter out or repair bad data occur. "Bad data" is the number one problem we run into when we are building and supporting ETL processes. Taking out the trash up front will make subsequent steps easier. Some rules you might apply at this stage include ensuring that dates are not in the future, or that account numbers don't have alpha characters in them. Whatever your particular rules, the goal of this step is to get the data in optimal form before we do the real transformations.
Sooner is Better
Tackle data quality right at the beginning. Batch processing is often an all-or-nothing proposition – one hyphen out of place or a multi-byte character can cause the whole process to screech to a halt. (Ideally, we want it to fail as fast as possible, that way we can correct it as fast as possible.) As you develop (and support), you'll identify more and more things to correct with the source data – simply add them to the list in this step.
I like to approach this step in one of two ways:
- Add a "bad record" flag and a "bad reason" field to the source table(s) so you can qualify and quantify the bad data and easily exclude those bad records from subsequent processing. You might build a process to do something with this bad data later. NOTE: This method does assume that an incomplete (but clean!) data set is okay for your target.
- Apply corrections using SQL by performing an "insert into .. select from" statement. This keeps all of your cleansing logic in one place, and you are doing the corrections in a single step, which will help with performance. An added bonus is by inserting into a new table, you can convert to the proper data types simultaneously. You can always break these into multiple steps if the logic gets too complex, but remember that more steps mean more processing time.
Modern Data Quality Architecture
Contemporary ETL design incorporates automated data quality tools and practices:
Data Quality Frameworks: Tools like Great Expectations, dbt tests, or cloud-native services (AWS Deequ, Google Cloud Data Quality) provide automated validation and monitoring.
Statistical Profiling: Establish baselines for data distributions, patterns, and relationships to automatically detect anomalies and drift.
Data Observability: Implement monitoring systems that track data quality metrics over time, providing alerts when thresholds are exceeded or patterns change.
Quality Gates: Design checkpoints throughout your ETL architecture where data must meet quality standards before proceeding to the next stage.
One exception to executing the cleansing rules: there may be a requirement to fix data in the source system so that other systems can benefit from the change. Again, having the raw data available makes identifying and repairing that data easier.
NOTE: You likely have metadata columns to help with debugging, auditing, and so forth. Populating and managing those fields will change to your specific needs, but the pattern should remain the same.
Steps 3 Through N: Transformation Architecture
Finally, we get to do some transformation! Transformations can be trivial, and they can also be prohibitively complex. Transformations can do just about anything – even our cleansing step could be considered a transformation. A common task is to apply references to the data, making it usable in a broader context with other subjects. Ultimately, the goal of transformations is to get us closer to our required end state.
Take It Step by Step
I like to apply transformations in phases, just like the data cleansing process. I add keys to the data in one step. I add new, calculated columns in another step. I merge sources and create aggregates in yet another step. Keeping each transformation step logically encapsulated makes debugging much, much easier. And not just for you, but also for the poor soul who is stuck supporting your code who will certainly appreciate a consistent, thoughtful approach.
Be Consistent
Apply consistent and meaningful naming conventions and add comments where you can – every breadcrumb helps the next person figure out what is going on. And while you're commenting, be sure to answer the "why," not just the "what". We know it's a join, but why did you choose to make it an outer join?
Depending on the number of steps, processing times, preferences or otherwise, you might choose to combine some transformations, which is fine, but be conscientious that you are adding complexity each time you do so. You may or may not choose to persist data into a new stage table at each step. If you do write the data at each step, be sure to give yourself a mechanism to delete (truncate) data from previous steps (not the raw though) to keep your disk footprint minimal.
Modern Transformation Design Patterns
Contemporary ETL architecture embraces several proven transformation patterns:
Medallion Architecture: Organize transformations into Bronze (raw), Silver (cleaned), and Gold (business-ready) layers, each with specific purposes and data quality standards.
SQL-First Transformations: Tools like dbt, Dataform, or cloud-native SQL engines enable version-controlled, testable transformation logic that runs where your data lives.
Microservices Transformation: Break complex transformations into independent services that can be developed, deployed, and scaled separately.
Event-Driven Transformations: Use message queues and event streams to trigger transformations based on data availability or business events.
Finally! Time to Publish
Remember when I said that it's important to discover/negotiate the requirements by which you'll publish your data? There are a few techniques you can employ to accommodate the rules, and depending on the target, you might even use all of them.
Drop and Reload
This is exactly what it sounds like. If you've taken care to ensure that your shiny new data is in top form and you want to publish it in the fastest way possible, this is your method. You drop or truncate your target then you insert the new data. However, this has serious consequences if it fails mid-flight. You can alleviate some of the risk by reversing the process by creating and loading a new target, then rename tables (replacing the old with the new) as a final step.
The "Surgical" Method
Generally best suited to dimensional and aggregate data. Here, during our last transformation step, we identify our "publish action" (insert, update, delete, skip…). From there, we apply those actions accordingly. This is the most unobtrusive way to publish data, but also one of the more complicated ways to go about it.
The "Append" Method
This is particularly relevant to aggregations and facts. Your first step should be a delete that removes data you are going to load. In a perfect world this would always delete zero rows, but hey, nobody's perfect and we often have to reload data.
Blue/Green
This methodology fully publishes into a production environment using the aforementioned methodologies, but doesn't become "active" until a "switch" is flipped. The switch can be implemented in numerous ways (schemas, synonyms, connection…), but there are always a minimum of two production environments, one active, and one that's being prepared behind the scenes that's then published via the switch mentioned above.
Modern Publishing Patterns
Contemporary ETL architecture includes additional publishing strategies:
Streaming Publication: Use change streams, event sourcing, or message queues to publish data changes in real-time as they occur.
API-First Publication: Expose data through well-designed APIs rather than direct database access, enabling better governance and access control.
Data Product Publishing: In data mesh architectures, publish data as discoverable, self-describing products with clear ownership and SLAs.
Multi-Channel Publishing: Simultaneously publish to multiple targets (warehouse, operational systems, analytics platforms) based on consumer needs.
Another best practice around publishing is to have the data prepared (transformed) exactly how it is going to be in its end state. I call this the "final" stage. Just like you don't want to mess with raw data before extracting, you don't want to transform (or cleanse!) while publishing.
ETL Architecture Patterns for Modern Environments
Traditional ETL Architecture
The classic pattern remains relevant for many use cases:
- Sequential processing through dedicated servers
- Strong data validation before loading
- Optimized for structured data with known schemas
- Ideal for complex transformations requiring significant processing power
Modern ELT Architecture
Leverages the computational power of modern data warehouses:
- Parallel loading of raw data into target systems
- Transformation logic executed within cloud data warehouses
- SQL-first development with tools like dbt
- Cost-effective processing using elastic warehouse compute
Hybrid ETL/ELT Architecture
Combines both approaches for optimal efficiency:
- Selective pre-processing for performance optimization
- Raw data preservation with computed transformations
- Flexible processing patterns based on data characteristics
- Multiple processing engines coordinated through unified orchestration
Serverless ETL Architecture
Eliminates infrastructure management overhead:
- Function-as-a-Service for event-driven processing
- Automatic scaling from zero to massive parallel processing
- Pay-per-execution cost optimization
- Built-in fault tolerance and retry mechanisms
DataOps and Operational Excellence
Modern ETL architecture embraces DataOps practices for reliable operations:
Version Control: Treat ETL pipelines as code with proper branching, testing, and release management.
Automated Testing: Implement unit tests for transformation logic, integration tests for end-to-end workflows, and data quality tests for outputs.
Continuous Integration/Deployment: Automate validation and deployment processes with proper staging and production promotion workflows.
Monitoring and Observability: Track pipeline performance, data quality metrics, and business KPIs with comprehensive dashboards and alerting.
Documentation as Code: Maintain data catalogs, lineage information, and business context alongside pipeline code.
From ETL to Modern Data Architecture
For years I have applied this pattern in traditional on-premises environments as well as modern, cloud-oriented environments. It mostly seems like common sense, but the pattern provides explicit structure, while being flexible enough to accommodate business needs.
Perhaps someday we can get past the semantics of ETL/ELT by calling it ETP, where the "P" is Publish. But regardless of terminology, the fundamental principles remain:
Being smarter about the "Extract" step by minimizing the trips to the source system will instantly make your process faster and more durable. Organizing your transformations into small, logical steps will make your code extensible, easier to understand, and easier to support. It might even help with reuse as well. And having an explicit publishing step will lend you more control and force you to consider the production impact up front.
Modern ETL architecture extends these principles with cloud-native capabilities, real-time processing options, and operational excellence practices. Whether you're implementing traditional patterns or cutting-edge architectures, these foundational steps will make your job easier and your data healthier, while also creating a framework to yield better insights for the business quicker and with greater accuracy.
Key Takeaways for ETL Architecture and Design
- Preserve the fundamentals: The four-step pattern (Extract, Triage, Transform, Publish) remains the foundation of successful ETL design
- Embrace modern tools: Leverage cloud-native services, SQL-first frameworks, and automated quality tools while maintaining proven principles
- Design for operations: Implement comprehensive monitoring, testing, and deployment automation from the start
- Plan for evolution: Build architectures that can adapt to changing requirements and incorporate new technologies
- Focus on business value: Every architectural decision should ultimately serve the goal of delivering reliable, high-quality data products
The steps in this pattern will make your job easier and your data healthier, while also creating a framework to yield better insights for the business quicker and with greater accuracy. Whether you're building your first ETL pipeline or architecting enterprise-scale data platforms, these proven patterns provide the foundation for success.
ETL Architecture and Design FAQs
ETL architecture refers to the design and structure of how data is extracted from source systems, transformed into a usable format, and loaded into a target destination like a data warehouse or lake. A well-designed ETL architecture enables efficient data processing, ensures data quality, and supports scalability across batch, real-time, and cloud-native workflows.
The key components of an ETL architecture typically include:
- Extract: Connectors to source systems (databases, APIs, files).
- Transform: Data cleaning, enrichment, formatting, and business rule application.
- Load: Writing to a target destination such as a cloud data warehouse.
Modern architectures may also include orchestration tools, monitoring dashboards, and metadata management layers.
In ETL, data is transformed before it's loaded into the destination system. In ELT, raw data is first loaded into a cloud warehouse or data lake, and then transformed using in-warehouse compute power. ELT is ideal for modern cloud-native stacks due to better scalability, faster performance, and reduced data movement.
ETL architecture supports data governance through metadata management, audit logging, role-based access controls (RBAC), data lineage tracking, and automated quality checks. These features help ensure compliance with regulations like GDPR, HIPAA, and CCPA by enabling transparency and control over data flows.
Best practices include:
- Using modular, reusable pipeline components.
- Implementing Change Data Capture (CDC) for incremental loads.
- Leveraging parallel processing and partitioning.
- Automating error handling and retries.
- Monitoring with observability tools (e.g., logs, metrics, alerts).
- Applying version control and CI/CD for pipeline code.
Modular architecture breaks down ETL pipelines into smaller, reusable components (e.g., extractors, transformers, loaders). This improves fault isolation, simplifies debugging, supports testing and CI/CD workflows, and enables easier scaling across data teams.

Share: