Choosing the right distribution styles is crucial when it comes to optimising AWS Redshift performance, and if you don’t get this right, it can lead to a significant slowdown in query performance.
The distribution style is how the data is distributed across the nodes in AWS Redshift. For instance, a distribution style of ‘All’ copies the data across all nodes.
When you apply distribution style at table level i.e. for each table in your cluster, you tell AWS Redshift how you want to distribute it… All, Even or Key.
We have found that how you specify distribution style is super important in terms of ensuring good query performance for queries with joins.
The options you choose here also have an impact on data storage requirements, required cluster size and the length of time it takes to execute the ‘Copy’ command (i.e. to upload data into AWS Redshift).
There are 3 different distribution styles and it’s important to understand each one, as well as how they work together.
Redshift distribution styles: All
‘All’ is the simplest distribution style. If you set a distribution style of ‘All’, you instruct Redshift to simply make a copy of the table to every node in the cluster.
The upside of this is that when you are asking the cluster to return a query which includes a join, each node executing that join definitely has a local copy of the table you have distributed using ‘All’. As such, AWS Redshift does not have to get involved copying the required data across the network from node to node, to complete the query. If a particular node was tasked with completing part of a joined query and didn’t have a required table locally, it would have to get the data it needed across the network, negatively and significantly affecting query performance.
The downside of using ‘All’ is that you have a copy of the table on every node in the cluster – taking up space, and increasing the length of time that it takes to use the ‘Copy’ command to upload data into Redshift, and ultimately meaning that you’ll need a larger cluster.
Redshift distribution styles: Even
‘Even’ – Specifying ‘Even’ distribution spreads the table rows over all the nodes in the cluster, well, evenly!
Queries involving that table are then distributed over the cluster with each slice on each node working to provide the answer in parallel.
With no joins involved, this is a good choice, but when joins are involved then the rows matched by different tables involved in the join may not all be on the same node and need to be distributed over the network. Of course, if you join an ‘Even’ and an ‘All’ table together, no redistribution is required because the rows of the ‘All’ table are available everywhere.
Redshift distribution styles: Key
‘Key’ – With a key distribution set, you specify a column to distribute on and then, cleverly, AWS Redshift ensures that all the rows with the same value of that key are placed on the same node.
Key distribution across nodes is really important. Let’s say you had a table of costs, with a column called department code. If you set the table to be distributed across the cluster using the ‘department code’ column as the key, Redshift will ensure that all the costs for a given department are neatly placed onto a single given node. As such, if you then issue a query asking for the total costs, by department, Redshift knows that each node will return the complete result for a given department. It therefore has no further processing to do and can resolve the query efficiently.
Another good use would be to optimise a join between two large tables, for instance, a large dimension table, such as ‘customers’ on an online retail site, and a large sales transaction table on that same site. In this example you could have lots (e.g. millions) of rows in the customers table, and even more (e.g. tens or hundreds of millions) of rows in the sales transactions table.
Based on our extensive experience of using AWS Redshift, here are some tips for the best-practice use of distribution styles.
We typically set the distribution style to ALL for smaller dimension tables, e.g. a date dimension with only a few thousand entries.
We set EVEN for tables that are not joined with other tables or are only joined to tables with ALL style specified. For example, a fact table with joins to small dimensions (because each of the small dimensions is already set to ‘All’).
And if we have a very large dimension we will DISTRIBUTE both the dimension and any fact associated with it on their join column. You can (currently) only optimise for a single large dimension, so if we have a second large dimension we would take the storage-hit and distribute ALL, or design the dimension columns into the fact.
For more best-practice advice on optimizing AWS Redshift performance, download our free eBook below