When considering best practices for Amazon Redshift, it is really useful to understand exactly how Redshift works under the hood.
In this article we examine the key fundamentals of the Redshift columnar database engine and how it does its stuff. We explore how clusters are arranged in Redshift and how the data itself is stored.
How Clusters Are Arranged In Redshift
- Cluster: the cluster is the overall instance or configuration of Redshift. You cannot have just a Redshift server, only a Redshift cluster. A cluster is made up of a leader node and one or more nodes. At cluster level you decide the storage technology for your Redshift implementation – SSD or magnetic.
- Leader Node: The leader node is what you and your application treat as the database. It is the front-end facade to the complexity of the Redshift cluster behind the scenes. Behind the leader node sit one or more nodes that actually do the work – but what your application sees is just one, nice, simple leader node. You can only have one leader node per Redshift cluster. It looks, to the outside world, like a Postgres database.
- Nodes: Data is distributed across nodes, and an individual node is roughly analogous to a virtual machine. How data is distributed across the nodes depends on the schema design. A lot of the best practices surrounding redshift are about getting the right data into the right nodes, for optimum performance. In layman’s terms, the more nodes you have, the more potentially powerful your Redshift cluster is. However, if you don’t follow the best practices properly, additional nodes can yield negligible (or in fact negative) real world performance and scalability gains. Obviously, it is worth bearing in mind that the more nodes your Redshift cluster has, the more expensive it is to run. However, you can re-size the cluster at any time (as long as you don’t reduce the cluster size below the amount of storage that you need.)
- Slices: The type of node that you select governs the number of slices that each node has. A slice is roughly analogous to a processor (or core) allocated to working on the data stored on that node. The correct use of slices allows a node to make use of its multiple cores. When the cluster allocates work to a node, the node can further split this work down across its available cores/CPUs, assuming that the data is structured in a way so that it can be practically and efficiently split up.
How Data Is Stored In Redshift
If you were implementing a data warehouse in a traditional relational database technology – MS SQL Server or MySQL, for example – then you would design a star-schema, fill it with data, and then index the fields that users want to filter, group by, and use. But because you don’t know in advance what fields those are (users have an annoying habit of wanting to use the one you hadn’t thought of), then you end up indexing everything. At which point, you have got two full copies of your data: one in the main tables, and one in the indices.
And so, columnar data stores (like Redshift, Vertica, Netezza and Teradata), at a very simplified level, admit in advance to themselves that every column is going to need an index, and thereby do away with the main data store. They are, in effect, therefore, just an index for every column.
As a by-product of this indexing, all the values of the same type and those which have similar values, are organised next to each other in the indices. As such, compression in columnar data stores is far more efficient than in traditional RDBMs.
This is all relevant to your use of best practices in Redshift. For instance, when you are defining a table in Redshift, you may, if you wish, choose from one of 11 different compression strategies (Redshift calls them ‘column encodings’) for each column in your table. Picking the right one will impact your storage and as a consequence, can also impact performance. Or, if you follow other best practices, you can leave Redshift to pick the correct one for you.
Amazon Redshift best practices
You can find Amazon’s own set of Redshift best practices here:
Based on our real-world experience of using Redshift, some of these are more important than others. There are some best practices that, in our opinion, you absolutely have to implement. There are others which, if you’ve done some other stuff right, you can judiciously ignore.
To help you understand which best practices you NEED to follow, we’ve created our own ‘Real World Guide’ to Amazon Redshift Advice and Best Practices.
Download your copy below.