Blog| Matillion ETL for Amazon Redshift

    5 Top Tips to Improve Amazon Redshift Performance

    Matillion offers a data integration solution for Amazon Redshift. It does this by extracting and loading the data in one swift set, making it available in Amazon Redshift for transformation. Matillion generates a SQL query to transform data which is executed in Amazon Redshift.

    When dealing with large amounts of data it is important to make sure you have good table designs and efficient queries. If you don’t you will at best, have suboptimal transformations and at worst, failed jobs due to running out of resources.

    Here are my top 5 tips for improving the performance of your ELT workloads.

     

    Top 5 tips to Improve Amazon Redshift Performance

     

    1. Reduce the amount of queried data

    The more bytes of data that have to be read and stored as part of a query will increase the overall execution time.

    When using a Table Input Component only select the columns that are needed in the transformation.

    Make use of Filter Components to add WHERE clauses which further reduce the amount of data. Using a WHERE clause based on the primary sort column of the largest table will help minimize scanning time.

     

    2. Design tables properly

    Matillion ETL for Amazon Redshift allows the user to specify the distribution style and sort key on components that create tables. Selecting the wrong distribution style can have a big impact on performance by forcing Amazon Redshift to move data unnecessarily between nodes when joining tables.

    Distribution Styles

    With a KEY distribution style the records are distributed across the nodes according to the values in the select column.

    The KEY STYLE is best suited for:

    • Large fact tables
    • Large dimension tables

    AWS states, “if you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.”

    With an EVEN distribution style the rows are distributed in round-robin fashion across nodes in the cluster.

    The EVEN DISTRIBUTION STYLE is best suited for:

    • Temporary/staging table
    • No obvious joining key or good candidate column

    With an ALL distribution, a copy of the entire table is stored on each node in the cluster.

    The ALL STYLE is best suited for:

    • Slowly changing data

     

    Top 5 tips to Improve Amazon Redshift Performance - All styles

     

    Sort Keys

    Sorting enables efficient handling of range-restricted predicates. An example of how using sort keys in a WHERE clause can reduce the amount of blocks scanned is shown in this example from the Amazon documentation:

    “If a table stores five years of data sorted by date and a query specifies a date range of one month, up to 98 percent of the disk blocks can be eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned.”

    INTERLEAVED sort keys give equal weight to each column in the sort key.

    INTERLEAVED sort keys are best suited for queries:

    • With highly selective restrictive predicates
    • There is no dominant sort key

    COMPOUND sort keys are made up of all the columns specified in the order they are listed.

    COMPOUND sort keys are best suited for queries:

    • Where one sort key is dominant over the others, make sure this is first in the column list.
    • That makes use of GROUP BY, ORDER BY & PARTITION BY
    • Sort key includes a date, timestamp or identity column

    3. Try to avoid queries going to disk

    When a query uses more than its allocated amount of memory it will start consuming disk space to store intermediary results, this can often result in disk full errors.

    Implementing Tips 1 and 2 together is the best way to avoid unnecessarily going to disk. However, when that is not enough, you can alter the Workload Management (WLM) slots. In this way you can allocate more memory to a query.

    WLM in Amazon Redshift allocates memory to slots in a queue equally, assigning several slots to one query gives that query access to the memory for all of those slots.

    To identify which queries may benefit from increasing the number of slots can be found using the below query.

     

    #Find recent queries that went to disk
    select query, substring
    from svl_qlog join svl_query_summary using(query)
    where starttime > date(getdate()) - interval '1 day'
    and is_diskbased = 't';

     

    You should be able to match the query substring to that of a Write component in Matillion giving you the job that benefits from extra slots.

    In an orchestration job, use the Alter Session WLM Slots Component to increase the number of slots available for transformation jobs that run after this component.

     

    Top 5 tips to Improve Amazon Redshift Performance - Alter Warehouse

     

    4. Vacuum tables regularly

    VACUUM commands can be intensive for your cluster to run. Therefore, it’s best if you can schedule these during periods of low usage on your cluster or after ELT process has finished.

    Tables that have unsorted rows can see performance increases by re-sorting. You can easily build a maintenance job in Matillion ETL to run on a schedule and check for tables that need to be re-sorted and then run the VACUUM on those tables.

    Also when you use an Interleaved Sort Key, if the distribution of key values changes, or skews, as rows are added, the sort strategy will no longer be optimal, and the performance benefit of sorting will degrade.

    Amazon suggests if the skew ratio is greater than 1.4 then you can benefit from a reindex. Again this article shows how to build a job to automatically identify and reindex your tables.

     

    Top 5 tips to Improve Amazon Redshift Performance - Analyze tables

     

    5. Keep table stats up to date

    Load components automatically refresh statistics after they have completed by default (although this option can be turned off) but for tables that are modified over time, it is recommended that you update the statistical metadata that the query planner uses to build and choose optimal plans. To do so, you analyze your tables by running the ANALYZE command.

    Matillion ETL for Amazon Redshift makes it easy to identify all tables that have out of date statistics and automatically update them on a schedule.

     

    Top 5 tips to Improve Amazon Redshift Performance - Vacuum

     

    To summarize, you should always look to reduce the number of bytes of data a query needs to read and store, and reduce the amount of data movement between nodes. The above tips all contribute to achieving this and should produce a noticeable improvement in performance, especially with larger data sets when implemented.

    Useful Links

    Query Performance Improvement Opportunity
    Choosing Dist Sort
    Sorting Data
    Using SVL Query Summary
    Vacuum Reindexing
    Analyzing Tables

     

    Try Matillion ETL to improve your Amazon Redshift performance. Arrange a free demo, or start a free 14-day trial.