WRITTEN IN COLLABORATION WITH SAUNAK CHANDRA, SENIOR SOLUTIONS ARCHITECT, AMAZON WEB SERVICES
Amazon Redshift is one of the most popular, fast, scalable, and simple cloud Data Warehouses built to serve workloads at any scale. If you are bringing data into Amazon Redshift, Matillion is a purpose-built data transformation that deploys into your own AWS cloud environment and seamlessly integrates with Amazon Redshift.
With Matillion, you can extract your data from your many sources and load that data into Amazon Redshift in one swift set. Matillion can then generate a SQL query via a graphical UI to transform data. The query is executed in Amazon Redshift, taking advantage of the platform’s speed and scalability.
5 top tips to optimize Amazon Redshift performance
When dealing with large amounts of data, you need to make sure you have good table designs and efficient queries. You need to choose table designs and query generation in accordance with Amazon Redshift MPP and columnar architecture. If you don’t, may end up underutilizing Redshift processing power.
To ensure your jobs are optimized for performance, here are the Top 5 tips for improving the performance of your data workloads when you use Matillion within Amazon Redshift.
1. Apply query project and filter
The more unnecessary bytes of data that have to be read and stored as part of a query, the longer overall execution time will be. Due to Amazon Redshift’s columnar architecture, it’s important to “project” the necessary columns in the select clause and apply a filter. This will drastically reduce the memory requirement for the query.
When using a Table Input Component, only select the columns that are needed in the transformation.
Make use of Filter Components to add WHERE clauses, which further reduce the amount of data being queried. Using a WHERE clause based on the primary sort column of the largest table will help minimize scanning time.
Behind the scenes, Amazon Redshift uses zone map, which is the metadata containing the maximum and minimum value of the column stored in the disk block. Note that Amazon Redshift uses a 1 MB disk block to store the column data.
When the table is an external table, apply the filter to the table prior to any join with other tables.
2. Fine-tune table design
Matillion ETL for Amazon Redshift allows the user to specify the distribution style and sort key on components that create tables. Selecting the wrong distribution style can have a significant impact on performance, as it may result in distributing or broadcasting the data across Amazon Redshift compute nodes when joining tables.
Amazon Redshift automatically manages data distribution style across the compute nodes to adapt to the size of data. During CREATE TABLE, all you need to do is to skip the DISTSTYLE option and Amazon Redshift will choose the appropriate distribution style. This will make the table design process seamless and query performance improve automatically without any manual table redesign.
If you have some known query patterns, such as joining of large fact table and a large dimension table, you can apply a KEY distribution style. With DISTSTYLE Key, Amazon Redshift co-locates frequently joined tables in the same compute node. This reduces network distribution and broadcasting when tables are joined on the same column. Both tables should have the joining column as DISTKEY.
The KEY STYLE is best suited for:
- Joining a table with another table.
- When the joining column is highly cardinal, for example an ID column.
- Large fact tables
- Large dimension tables
You can choose the Distribution Style while setting up the table metadata in Matillion’s Create Table component:
Sorting enables efficient handling of range-restricted predicates. For example, using sort keys in a WHERE clause can reduce the amount of blocks scanned, per this example from the AWS documentation:
“If a table stores five years of data sorted by date and a query specifies a date range of one month, up to 98 percent of the disk blocks can be eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned.”
COMPOUND sort keys consist of all the columns specified in the order they are listed.
COMPOUND sort keys are best suited for queries:
- Where queries filter on multiple columns on the table.
- Where the primary sort column is always used in the filter.
- That make use of GROUP BY, ORDER BY, and PARTITION BY
- Where the sort key includes a date, timestamp, or identity column
INTERLEAVED sort keys give equal weight to each column in the sort key.
INTERLEAVED sort keys are best for queries:
- With highly selective restrictive predicates
- Where there is no dominant sort key
- In almost all cases, the COMPOUND sort key will suffice and will perform better than INTERLEAVED sort key in the long term. The INTERLEAVED sort key requires additional maintenance—avoid using it unless absolutely necessary.
- When data from the source is ingested in chronological order, make the date column as sort key. Before loading data into the table, sort the new rows by the date column. This will reduce the need to run a vacuum on the table.
You can choose the Sort Keys and Style while setting up the table metadata in Matillion’s Create Table component:
Amazon Redshift automatically applies column encoding during COPY command. With COPY command option COMUPDATE PRESET Redshift will choose the most appropriate column compression, while loading data into an empty table and it will set the sort keys column compression as RAW.
Pro tip: Under no circumstances should you encode the SORTKEY: keep it as RAW.
3. Optimize memory usage
When a query uses more than its allocated amount of memory, it will start consuming disk space to store intermediary results.
Implementing Tips 1 and 2 together is the best way to avoid unnecessarily going to disk. However, when doing that is not enough, you can alter the Workload Management (WLM) slots, which will let you allocate more memory to a query.
WLM in Amazon Redshift allocates memory to slots in a queue equally. Assigning several slots to one query gives that query access to the memory for all of those slots.
You can identify which queries may benefit from increasing the number of slots by using the query below:
#Find recent queries that went to disk select query, substring from svl_qlog join svl_query_summary using(query) where starttime > date(getdate()) - interval '1 day' and is_diskbased = 't';
You should be able to match the query substring to that of a Write component in Matillion, giving you the job that benefits from extra slots.
In an orchestration job, use the Alter Session WLM Slots Component to increase the number of slots available for transformation jobs that run after this component.
Pro tip: Redshift now has the “Auto WLM” feature, which can be turned on at the cluster level. With AutoWLM, Redshift determines the current workload on the cluster and tunes concurrency and resource allocation accordingly. When queries that require large amounts of resources are in the system (for example, hash joins between large tables), the concurrency is lower. When lighter queries (such as inserts, deletes, scans, or simple aggregations) are submitted, concurrency is higher.
4. Schedule Vacuum (if needed)
Redshift runs VACUUM DELETE, which removes the rows marked for deletion by a prior DELETE command. VACUUM DELETE runs automatically in the background when the workload is low on the cluster.
A table that goes through frequent DMLs (INSERT, UPDATE, or DELETE) may need additional maintenance apart from what Redshift AUTO VACUUM provides. (An example would be a table storing sensor or streaming data coming from IoT devices several times a day) For these cases, you can schedule manual VACUUM on the table after you finish the ELT process.
You can increase performance for frequently updated tables by re-sorting. Simply build a maintenance job in Matillion ETL to run on a schedule, check for tables that need to be re-sorted and then run VACUUM on those tables.
Note that when you use an Interleaved Sort Key, if the distribution of the values in the sort key changes, or skews, as rows are added, the sort strategy will no longer be optimal, and the performance benefit of sorting will degrade.
The query below shows tables that use interleaved sort keys:
select tbl as tbl_id, stv_tbl_perm.name as table_name, col, interleaved_skew, last_reindex from svv_interleaved_columns, stv_tbl_perm where svv_interleaved_columns.tbl = stv_tbl_perm.id and interleaved_skew is not null;
The value of interleaved_skew is a ratio that indicates the amount of skew–a 1 indicates no skew. A value greater than 1 indicates that the table needs re-indexing. Usually, if the skew ratio is greater than 1.4, you’ll want to reindex. You can build a job to automatically identify and reindex your tables.
5. Optimize table stats
Load components automatically refresh statistics by default after they complete (you can also turn off this option). But for tables that are modified over time, after every DML you should update the statistical metadata that the query planner uses to build and choose optimal plans. To do so, you analyze your tables by running the ANALYZE command.
Redshift will analyze tables automatically in the background during light workloads. Turn the auto_analyze parameter to True in the cluster parameter group.
Matillion ETL for Amazon Redshift makes it easy to identify all tables that have out-of-date statistics and automatically update them on a schedule.
Optimize Amazon Redshift Performance: Things to keep in mind
To optimize Amazon Redshift performance, you should always look to reduce the number of bytes of data a query needs to read and store, keep the table sorted as much as possible, and reduce the amount of data movement between nodes.
Keeping these things in mind and implementing these five tips will help you achieve both these goals and produce a noticeable improvement in performance, especially with larger data sets.