Now Available On-Demand | AI. The Future of Data Engineering - Today. Not Just Tomorrow.

Watch now

Top 15 Amazon Redshift Best Practices for ETL Processing

15 tips 1200

As more organizations turn to cloud data warehouses, they’re also finding the need to optimize them to get the best performance out of their ETL processes. Amazon Redshift offers the speed, performance, and scalability required to handle the exponential growth in data volumes that you are experiencing. Check out the following Amazon Redshift best practices to help you get the most out of Amazon Redshift and ETL.

 

1. Optimize your workload management

 

Amazon Redshift includes workload management queues that allow you to define multiple queues for your different workloads and to manage the runtimes of queries executed. Using the workload management (WLM) tool, you can create separate queues for ETL, reporting, and superusers, for example. Amazon recommends limiting the concurrency of the WLM tool to approximately 15, which can help keep your ETL execution times consistent.

 

2. Take steps to improve your COPY functions

 

Because of the way massively parallel processing (MPP) databases are configured, the different compute nodes divide the work of ingesting data. The nodes themselves contain a certain number of slices, depending on the node type of the cluster. To help divide the work of loading data equally among the slices, it’s best to COPY from multiple files of equal size, rather than a single large source file. Copying from a single file or from files of vastly different sizes causes some slices to have to do more work than others, which slows down the entire copy process. 

 

3.  Minimize the number of commits in your ETL jobs

 

ETL jobs typically require multiple steps. If you execute a commit after each step, it can slow down your job because commits are resource-intensive. If you surround the steps of your ETL job with a BEGIN…END statement, then you can perform a single commit after all of the steps.

 

4.  Maintain your tables

 

It’s important to use the VACUUM function after large delete operations to deal with fragmentation. The DELETE command doesn’t reclaim the space that was occupied by the deleted rows; using VACUUM reclaims the otherwise unused space. Table owners and superusers can use the VACUUM function to keep table queries performing well.

 

Running the ANALYZE function after ETL jobs complete is also a good practice. Doing so gives Amazon Redshift’s query optimizer the statistics it needs to determine how to run queries with the most efficiency.

 

Amazon Redshift provides an Analyze and Vacuum schema utility that helps automate these functions. Because these operations can be resource-intensive, it may be best to run them during off-hours to avoid impacting users.

 

5.  Load data in sort key order

 

When you’re loading data, if you load in sort key order, you will minimize the need for the VACUUM command. Loading via sort key order means that each new batch of data follows the existing rows in your table.

 

6. Create a data loading strategy

Amazon Redshift allows you to load data from multiple source systems using Amazon Simple Storage Service (Amazon S3). Amazon S3 is an object storage service that’s part of the Amazon Web Services platform. It can be used for a variety of data storage purposes, including data lakes, data warehousing, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.

 

A manifest file will make loads from multiple sources go more smoothly. A manifest is a text file in JSON format that shows the URL of each file that was written to Amazon S3. A manifest can also make use of temporary tables in the case you need to perform simple transformations before loading.

 

7. Take advantage of Amazon Redshift Spectrum

 

Amazon Redshift Spectrum is a feature of the Amazon Redshift data warehouse that enables you to run SQL queries on data that is stored in Amazon S3. Amazon Redshift Spectrum essentially enables you to perform analytics on data stored external to the Amazon Redshift database.

 

Spectrum is well suited to accommodate spikes in your data storage requirements that often impact ETL processing times, especially when staging data in Amazon S3. You can run queries against that data using Amazon Redshift Spectrum as if it were in Redshift.

 

8.  Compress your data files before loading

 

Compress the files you’re using to load into your data warehouse using gzip, lzop, bzip2, or Zstandard. The CPU time spent is regained in reduced bandwidth requirements and faster data transfer times. If you are using Amazon S3, compressing your files before they’re loaded into S3 decreases Amazon Spectrum query times and therefore reduces costs for both storage and query execution.

 

9. Monitor your ETL processes

 

Monitoring is always a smart move. We monitor all types of processes and applications, so make sure you’re also monitoring your ETL processes as well. Amazon Redshift includes several monitoring scripts that can help you check in on the status of your ETL processes. Staying on top of monitoring can help you find problems early, before they start to impact the performance of your Amazon Redshift cluster.

 

10. Use UNLOAD rather than SELECT

 

If you need to extract large numbers of rows, use the UNLOAD command rather than SELECT. SELECT is resource-intensive and time-consuming. Using SELECT for retrieving large result sets can put a burden on the leader node, which can impact performance. UNLOAD, on the other hand, distributes the work among compute nodes, making it more efficient.

 

11. Review your execution plans

 

The EXPLAIN command allows you to review query execution plans without actually running the query. You can use this command to check for performance issues before running queries that might impact performance. One issue to keep in mind regarding query execution is that unlike in traditional RDBMS systems, MERGE joins perform better than HASH joins. Therefore, wherever possible, eliminate the HASH joins in your execution plans in favor of MERGE joins by joining on columns that are included in both tables’ sort and distribution keys.

 

12. Scale your cluster as needed

 

If you have checked on the health of your queries and they still aren’t performing well, it’s possible that the issue is actually with the amount of resources available. The beauty of cloud data warehouses is that you can simply scale them up to gain access to more computing power. Consider scaling your cluster up to leverage a more powerful cluster against your response time requirements. As volumes grow, you can scale to adapt to more and more data, which lets you leverage pay-as-you-go cloud economics.

 

13. Scaling down is also an option

 

Scaling goes both ways and scaling down is a great way to reduce costs. You can shrink an existing cluster so that it consumes fewer resources and therefore costs less. This may be something that your database developers choose to do overnight or on weekends when traffic is lower.

 

14. Design tables to improve performance

 

Amazon Redshift is an MPP platform. Designing tables to take advantage of this parallelism is critical to good performance across large datasets. Amazon Redshift provides various distribution styles, including ALL distribution, EVEN distribution and KEY distribution. The distribution style you choose should depend on how you intend to use the table. A good quick strategy is to use “ALL” for dimensions, “EVEN” or “KEY” for facts to help make joining facts to dimensions more efficient.

 

15. Take advantage of change data capture

 

Change data capture (CDC) is a technique that’s used to identify and capture the changes made to your source databases then replicate those changes on Redshift. The Amazon Redshift COPY function is used to move data into Redshift, but the function can be resource-intensive – especially when loading data that has already been loaded. CDC methods minimize the amount of data you’re moving around, which dramatically improves the performance of your ETL software

Want to Learn More About Amazon Redshift?

 

Matillion is an industry-leading data transformation solution that was originally introduced on Amazon Web Services for the Amazon Redshift cloud data warehouse. Delivering a true end-to-end data transformation solution (not just data migration or data preparation), Matillion provides an instant-on experience to get you up and running in just a few clicks, a pay-as-you-go billing model to cut out lengthy procurement processes, as well as an intuitive user interface to minimize technical pain and speed up time to results.

 

Matillion ETL for Amazon Redshift, which is available on the AWS marketplace, has the platform’s best practices baked in and adds additional warehouse specific functionality, so you get the most out of Redshift.

 

Matillion Data Loader allows you to effortlessly load source system data into your cloud data warehouse. It’s a free SaaS-based data integration tool that provides quick access to your data, helping accelerate innovation and make faster, better business decisions.