Scalable Data Architecture: Lean on your Cloud Data Warehouse

Large datasets require strategic thinking to maximize performance without runaway costs. Consider a large dataset of flight records spanning multiple years and thousands of routes. By thoughtfully utilizing cloud-based tools and optimized workflows, this data can be effectively and efficiently optimized for analysis. In this post, we’ll explore three key strategies using Matillion’s Data Productivity Cloud:
- Dynamic Warehouse Scaling: Swiftly adapt compute power to handle large data loads without overspending.
- Zero-Copy Cloning: Safely experiment with transformations without risking the integrity of your original tables.
- Concurrent Transformation Handling: Run multiple transformations in parallel to reduce runtime and costs.
By combining these techniques, you can fine-tune both performance and cost-effectiveness in your cloud data workflows.
Dynamic Warehouse Scaling: Save Costs Without Sacrificing Performance
In this example, we’re running an analytics-centric transformation pipeline that has several complex rank and window calculations. These workloads are time-consuming and comparatively expensive.
Running the pipeline on a smaller warehouse reduces cost per minute but slows performance. A larger warehouse runs the pipeline faster but costs more – especially when running continuously. This is where Matillion’s Data Productivity Cloud's Alter Warehouse component comes in, allowing you to dynamically scale warehouse size right when you need it.
Note the grey link in Pipeline B. This is referred to as an Unconditional link and ensures the warehouse size is scaled back to XSMALL even if the previous component fails. This ensures that the warehouse is downsized regardless of outcome. |
| Warehouse Size | Runtime | Snowflake Credits Consumed | Matillion Credits Consumed | Approx Snowflake Cost | Approx Matillion Data Productivity Cloud Cost | Approx Total Cost |
| XSMALL | 5m 6s | 0.085 | 0.34 | $0.26 | $0.93 | $1.19 |
| MEDIUM | 2m 14s | 0.148 | 0.15 | $0.44 | $0.41 | $0.85 |
When estimating pipeline costs, it’s crucial to consider the cost implications of all technologies involved. Interestingly, the comparison above demonstrates that, while running Snowflake at a MEDIUM warehouse size incurred higher costs due to increased credits consumed, the reduced runtime led to lower Matillion credit usage, ultimately resulting in a lower overall cost. As always, testing specific use cases is essential to determine the most cost-effective configuration.
Why Isn’t the Speed Gain Linear?
Even though a MEDIUM warehouse offers 4x the compute power of XSMALL, the performance improvement is typically 2–2.5x for a workload like this. The gain depends on factors such as:
- Query Complexity: Some operations, like the above window functions, reach diminishing returns when scaling compute power.
- Data Architecture: Snowflake’s parallel processing excels with partitioned data or multiple smaller files. A single large dataset may not fully utilize additional compute resources.
Best Practices and Key Takeaways
- Use smaller warehouses for routine operations to minimize costs.
- Use the Alter Warehouse component to scale up temporarily for heavy loads.
- Run tests to determine the optimal balance between performance and cost.
- Ensure warehouses automatically scale down after heavy loads.
Continue the fun by exploring dynamic scaling in your own environment! Start by testing with a smaller dataset or workload and scale up as needed to optimize performance and cost.
Zero-Copy Cloning for Rapid, Risk-Free Experimentation
Consider scenarios where you need to refine and test calculations before pushing them to production. Zero-copy cloning in Snowflake provides a safe environment for this. Instead of duplicating the entire dataset, you create a metadata-only clone that references the original data. This keeps costs low while letting you experiment freely.
Creating a Clone with an Orchestration Pipeline:
In Matillion’s Data Productivity Cloud, you can use a SQL Script component in your Orchestration pipeline to spawn a test environment. This instantly generates a test table referencing the production data, ready for modifications without endangering the source.
CREATE OR REPLACE TABLE DB.SCHEMA.FLIGHTS_CLONE CLONE DB.SCHEMA.FLIGHTS; |
Experimenting with a Transformation Pipeline:
In a transformation pipeline, we’ll experiment with changes in cost. We might want to adjust rounding methods and/or fuel cost estimates and observe the impact on aggregated totals. Below you will note the code in a Calculator component including two different rounding methods.
ESTIMATED_FUEL_COST_FLOOR = FLOOR("DISTANCE" * 8.50) ESTIMATED_FUEL_COST_CEILING = CEIL("DISTANCE" * 8.50) |
We’ve now tested new calculations on production-scale data. If changes are needed, we can continue with the existing cloned table or drop it and create a fresh one – no harm done. This iterative development cycle lets developers perfect transformations before rolling them out. Also, although clearly outside best practices, in an emergency, the creation of a clone to test a quick change to a production table before implementing can be a life saver. Of course, such a task should only be used when the urgency outweighs the risks.
What about Views, Materialized Views, and Dynamic Tables?
You can absolutely create views, materialized views and dynamic tables using Matillion’s Data Productivity Cloud! However, for rapid, risk-free experimentation, Zero-copy cloning in Snowflake is the clear winner. Unlike views, which rely on the base table and can slow down query performance, and materialized views or dynamic tables, which incur storage and compute costs to maintain, zero-copy cloning creates an independent, instant snapshot of your data with no initial extra storage costs. This allows for flexible, isolated experimentation without impacting the source data or production systems. Its simplicity, cost-efficiency, and independence make it the perfect tool for scalable, iterative workflows with Matillion’s Data Productivity Cloud.
Concurrent Transformation Handling for Faster Throughput
Once you’ve optimized your data warehouse for scaling and validated your transformations, the next challenge is throughput. Often, you’ll need multiple levels of aggregation – such as state-level metrics and airport-level aggregations. Running these tasks back-to-back can double time and expense.
Matillion Data Productivity Cloud’s orchestration pipelines can run multiple transformation pipelines concurrently. Instead of running a state-level aggregation first and then an airport-level aggregation, trigger both at once:
- Create Separate Transformation Pipelines: One for state_metrics and one for airport_metrics.
- Add Both to the Orchestration Canvas: Connect them to the same start event.
- Execute in Parallel: With adequate warehouse capacity, both transformations finish in roughly the same time it would take to run just one, effectively halving total runtime and Matillion cost.
Sequential Processing | Concurrent Processing |
Concurrent processing reduces execution time by running tasks in parallel. In the above example, this potentially completes work in half the time and cost compared to sequential processing. | |
A Note on Transformation Pipelines:
While a single transformation pipeline might visually show branching paths, these typically translate into sequential SQL operations. True concurrency occurs when separate transformation pipelines are orchestrated in parallel. By managing tasks at the orchestration level, you ensure both time and cost savings are realized.
Not Concurrent Processing! |
Though this transformation pipeline is visually branched, it isn’t an example of true concurrency. |
How Matillion’s Data Productivity Cloud Leverages Your Cloud Data Warehouse for Concurrency
Matillion’s Data Productivity Cloud works directly with your cloud data warehouse, offloading transformation tasks to be executed on the platform. While the above examples focus on Snowflake, Matillion Data Productivity Cloud leverages the scalable architecture of all supported CDWs to handle concurrent queries and pipelines efficiently. By orchestrating multiple transformation pipelines in parallel, it distributes workloads across the CDW’s compute resources, enabling fast and concurrent execution without the need for additional infrastructure management.
Bringing It All Together
By applying Dynamic Warehouse Scaling, Zero-Copy Cloning, and Concurrent Transformation Handling, you engineer a holistic, cost-effective, and high-performance data architecture. You can handle large data loads efficiently, test and refine calculations without production risk, and run multiple transformations simultaneously. By combining these techniques, Matillion’s Data Productivity Cloud allows you to maximize your cloud data warehouse's capabilities, balancing performance, scalability, and cost-efficiency in a unified framework.
The result? A streamlined environment where you only pay for what you need, when you need it, while confidently delivering analytics-ready data faster than ever.
Read part 1 of this series here: 4 Ways to “Love your End Users” with Matillion Data Productivity Cloud
David Baldwin
Founder, GiddyUp Data | Data Integration & Analytics Trainer
Featured Resources
Human in the Loop in Data Engineering
Data pipelines are the backbone of modern analytics, but they're also notoriously fragile. The most resilient pipelines ...
BlogData Integration as a Service
Data Integration as a Service modernizes enterprise workflows with scalable, cloud-native integration. The definition is ...
BlogAI for ERP: Preparing Enterprise Data for Intelligent Decision-Making
The challenge isn't just about implementing AI; it's about creating the data foundation that makes ERP AI initiatives successful.
Share: