Boost Amazon Redshift Performance with best practice schema design
Based on our real-world experience of using Redshift, there are a number of important best practices which you must consider. Here we take a look at 3 ways you can optimise Amazon Redshift performance using scheme design.
The Amazon Redshift best practice documentation contains dozens of recommendations.
There are some best practices that, in our opinion, you absolutely have to implement in order to optimise Amazon Redshift performance. There are others which, if you’ve done some other stuff right, you can judiciously ignore.
Best practice 1 – Let ‘Copy’ choose compression encodings
What is it?
Our first best practice tip for improving Amazon Redshift performance involves using the ‘copy’ command.
The ‘Copy’ command is one of the ways of getting data into Redshift (the one we suggest you use) and you can ask it to automatically choose the best column encoding (compression) settings for the data it is uploading, rather than you having to decide yourself.
This one’s an important one.
Firstly, we absolutely agree with the Amazon Redshift best practice. Do let ‘Copy’ choose the best column encoding scheme for you. The ‘Copy’ command is clever, and it knows more than you do!.
However, there’s a ‘gotcha’ to be aware of, which could affect Amazon Redshift performance. The ‘Copy’ command will do a great job of choosing the right column encoding scheme, but it only does so on the first upload of data into an empty table.
As such, ensure that the first time you use ‘Copy’ to upload data into an empty table, that you use a significant (and crucially, representative) data set, which Redshift can then evaluate to best set the column encodings.
If you just upload a few lines of test data, Redshift is not going to know how best to optimise the compression for your real-life workload.
Redshift does know best (as long as you give it some representative data).
Best practice 2 – Choose the Best Distribution Style and Key
What is it?
Our second best practice involves choosing the best distribution key to optimise Amazon Redshift performance.
The distribution style is how the data is distributed across the nodes. For instance, a distribution style of ‘All’ copies the data across all nodes.
When you apply distribution style at table level i.e. for each table in your cluster, you tell Redshift how you want to distribute it… All, Even or Key.
We have found that how you specify distribution style is super important in terms of ensuring good query performance for queries with joins.
The options you choose here also have an impact on data storage requirements, required cluster size and the length of time it takes to execute the ‘Copy’ command (i.e. to upload data into Redshift).
We typically set the distribution style to ALL for smaller dimension tables, e.g. a date dimension with only a few thousand entries.
We set EVEN for tables that are not joined with other tables or are only joined to tables with ALL style specified. For example, a fact table with joins to small dimensions (because each of the small dimensions is already set to ‘All’).
And if we have a very large dimension we will DISTRIBUTE both the dimension and any fact associated with it on their join column. You can (currently) only optimise for a single large dimension, so if we have a second large dimension we would take the storage-hit and distribute ALL, or design the dimension columns into the fact.
If you don’t get it right this can be a killer in regards to your Amazon Redshift performance! So if your query is running very slowly, or more slowly than you would like, optimise here.
Best practice 3 – Choose the Best Sort Key
What is it?
Our final tip to improve Amazon Redshift performance is choosing the best sort key. Redshift maintains the data in a table in the order of a sort-key-column, if you specify one. This is sorted within each partition not overall, so each cluster node maintains its partition in that order.
Choosing a sort key can optimise Amazon Redshift performance in a number of ways.
The first is data filtering. If your where-clause filters on a sort-key-column, entire blocks of data are skipped. This is possible because Redshift stores data in blocks, and the block header section records the minimum and maximum value of the sort key there. If your filter is outside of that range, the entire block can be skipped. Neat.
Another possible optimisation is when joining two tables when one, or both, tables are sorted on their join keys. In that case, the data can be read in matching order and a merge-join becomes possible without a separate sort-step. If you are joining a large dimension to a large fact table, this is going to help a lot, since neither would fit into a hash-table (the other available efficient join strategy). In fact, in this example, sorting and distributing on the same key is even better.
To optimise Amazon Redshift performance: Optimise for joins first, then optimise for filtering.
For more information on improving Amazon Redshift performance, download our free guide below