Building an ETL Design Pattern: The Essential Steps
Wikipedia describes a design pattern as being “… the re-usable form of a solution to a design problem.” You might be thinking “well that makes complete sense”, but what’s more likely is that blurb told you nothing at all. The keywords in the sentence above are reusable, solution and design.
Reuse what works
Reuse happens organically. We build off previous knowledge, implementations, and failures. One example would be in using variables: the first time we code, we may explicitly target an environment. Later, we may find we need to target a different environment. Making the environment a variable gives us the opportunity to reuse the code that has already been written and tested.
Similarly, a design pattern is a foundation, or prescription for a solution that has worked before. The solution solves a problem – in our case, we’ll be addressing the need to acquire data, cleanse it, and homogenize it in a repeatable fashion. This requires design; some thought needs to go into it before starting.
Creating an ETL design pattern: First, some housekeeping
With batch processing comes numerous best practices, which I’ll address here and there, but only as they pertain to the pattern. Batch processing is by far the most prevalent technique to perform ETL tasks, because it is the fastest, and what most modern data applications and appliances are designed to accommodate. This entire blog is about batch-oriented processing. Streaming and record-by-record processing, while viable methods of processing data, are out of scope for this discussion.
Let’s get this train rollin’
Now that we’ve decided we are going to process data in batches, we need to figure out the details of the target warehouse, application, data lake, archive…you get the idea. What is the end system doing? What does it support? How are end users interacting with it? All of these things will impact the final phase of the pattern – publishing. How we publish the data will vary and will likely involve a bit of negotiation with stakeholders, so be sure everyone agrees on how you’re going to progress.
Step 1: Copy raw source data
I’ve been building ETL processes for roughly 20 years now, and with ETL or ELT, rule numero uno is copy source data as-is. Don’t pre-manipulate it, cleanse it, mask it, convert data types … or anything else. Simply copy the raw data set exactly as it is in the source. Why?
The source system is typically not one you control. Your access, features, control, and so on can’t be guaranteed from one execution to the next.
Source systems typically have a different use case than the system you are building. Running excessive steps in the extract process negatively impacts the source system and ultimately its end users.
Many sources will require you to “lock” a resource while reading it. If you are reading it repeatedly, you are locking it repeatedly, forcing others to wait in line for the data they need.
Having the raw data at hand in your environment will help you identify and resolve issues faster. Troubleshooting while data is moving is much more difficult.
Local raw data gives you a convenient mechanism to audit, test, and validate throughout the entire ETL process.
While it may seem convenient to start with transformation, in the long run, it will create more work and headaches.
Step 2: Triage the data
Now that you have your data staged, it is time to give it a bath. This is where all of the tasks that filter out or repair bad data occur. “Bad data” is the number one problem we run into when we are building and supporting ETL processes. Taking out the trash up front will make subsequent steps easier. Some rules you might apply at this stage include ensuring that dates are not in the future, or that account numbers don’t have alpha characters in them. Whatever your particular rules, the goal of this step is to get the data in optimal form before we do the real transformations.
Sooner is better
Tackle data quality right at the beginning. Batch processing is often an all-or-nothing proposition – one hyphen out of place or a multi-byte character can cause the whole process to screech to a halt. (Ideally, we want it to fail as fast as possible, that way we can correct it as fast as possible.) As you develop (and support), you’ll identify more and more things to correct with the source data – simply add them to the list in this step.
I like to approach this step in one of two ways:
- Add a “bad record” flag and a “bad reason” field to the source table(s) so you can qualify and quantify the bad data and easily exclude those bad records from subsequent processing. You might build a process to do something with this bad data later. NOTE: This method does assume that an incomplete (but clean!) data set is okay for your target.
- Apply corrections using SQL by performing an “insert into .. select from” statement. This keeps all of your cleansing logic in one place, and you are doing the corrections in a single step, which will help with performance. An added bonus is by inserting into a new table, you can convert to the proper data types simultaneously. You can always break these into multiple steps if the logic gets too complex, but remember that more steps mean more processing time.
One exception to executing the cleansing rules: there may be a requirement to fix data in the source system so that other systems can benefit from the change. Again, having the raw data available makes identifying and repairing that data easier.
NOTE: You likely have metadata columns to help with debugging, auditing, and so forth. Populating and managing those fields will change to your specific needs, but the pattern should remain the same.
Steps 3 through n: Transformation
Finally, we get to do some transformation! Transformations can be trivial, and they can also be prohibitively complex. Transformations can do just about anything – even our cleansing step could be considered a transformation. A common task is to apply references to the data, making it usable in a broader context with other subjects. Ultimately, the goal of transformations is to get us closer to our required end state.
Take it step by step
I like to apply transformations in phases, just like the data cleansing process. I add keys to the data in one step. I add new, calculated columns in another step. I merge sources and create aggregates in yet another step. Keeping each transformation step logically encapsulated makes debugging much, much easier. And not just for you, but also for the poor soul who is stuck supporting your code who will certainly appreciate a consistent, thoughtful approach.
Apply consistent and meaningful naming conventions and add comments where you can – every breadcrumb helps the next person figure out what is going on. And while you’re commenting, be sure to answer the “why,” not just the “what”. We know it’s a join, but why did you choose to make it an outer join?
Depending on the number of steps, processing times, preferences or otherwise, you might choose to combine some transformations, which is fine, but be conscientious that you are adding complexity each time you do so. You may or may not choose to persist data into a new stage table at each step. If you do write the data at each step, be sure to give yourself a mechanism to delete (truncate) data from previous steps (not the raw though) to keep your disk footprint minimal.
Finally! Time to Publish
Remember when I said that it’s important to discover/negotiate the requirements by which you’ll publish your data? There are a few techniques you can employ to accommodate the rules, and depending on the target, you might even use all of them.
Drop and reload
This is exactly what it sounds like. If you’ve taken care to ensure that your shiny new data is in top form and you want to publish it in the fastest way possible, this is your method. You drop or truncate your target then you insert the new data. However, this has serious consequences if it fails mid-flight. You can alleviate some of the risk by reversing the process by creating and loading a new target, then rename tables (replacing the old with the new) as a final step.
The “surgical” method
Generally best suited to dimensional and aggregate data. Here, during our last transformation step, we identify our “publish action” (insert, update, delete, skip…). From there, we apply those actions accordingly. This is the most unobtrusive way to publish data, but also one of the more complicated ways to go about it.
The “append” method
This is particularly relevant to aggregations and facts. Your first step should be a delete that removes data you are going to load. In a perfect world this would always delete zero rows, but hey, nobody’s perfect and we often have to reload data.
This methodology fully publishes into a production environment using the aforementioned methodologies, but doesn’t become “active” until a “switch” is flipped. The switch can be implemented in numerous ways (schemas, synonyms, connection…), but there are always a minimum of two production environments, one active, and one that’s being prepared behind the scenes that’s then published via the switch mentioned above.
Another best practice around publishing is to have the data prepared (transformed) exactly how it is going to be in its end state. I call this the “final” stage. Just like you don’t want to mess with raw data before extracting, you don’t want to transform (or cleanse!) while publishing.
From ELT to ETP?
For years I have applied this pattern in traditional on-premises environments as well as modern, cloud-oriented environments. It mostly seems like common sense, but the pattern provides explicit structure, while being flexible enough to accommodate business needs. Perhaps someday we can get past the semantics of ETL/ELT by calling it ETP, where the “P” is Publish.
Being smarter about the “Extract” step by minimizing the trips to the source system will instantly make your process faster and more durable. Organizing your transformations into small, logical steps will make your code extensible, easier to understand, and easier to support. It might even help with reuse as well. And having an explicit publishing step will lend you more control and force you to consider the production impact up front. The steps in this pattern will make your job easier and your data healthier, while also creating a framework to yield better insights for the business quicker and with greater accuracy.
Have you tried Matillion ETL, yet? If not, watch our weekly live demo with Q&A to learn more about it and get answers to your questions.
10 Best Practices for Maintaining Data Pipelines
Mastering Data Pipeline Maintenance: A Comprehensive GuideBeyond ...News
Matillion Adds AI Power to Pipelines with Amazon Bedrock
Data Productivity Cloud adds Amazon Bedrock to no-code generative ...Blog
Data Mesh vs. Data Fabric: Which Approach Is Right for Your Organization? Part 3
In our recent exploration, we've thoroughly analyzed two key ...