Amazon Web Services (AWS) has changed forever how IT infrastructure can be delivered, doing so on-demand, scalably, quickly, and cost effectively. And now, Amazon Redshift is doing exactly the same for data warehousing and Big Data analytics.
It offers a massively parallel columnar data store that can deal with billions of rows of data—but one that can be up and running in a few minutes, and operated for a few cents an hour. As such, it’s no surprise that Amazon Redshift is AWS’s fastest ever growing service.
But what is Amazon Redshift exactly? How does it work, and how can it benefit you?
At its simplest, Amazon Redshift is a combination of two important technologies. First, it’s a columnar data store (also called a column-oriented database); and second, it also uses massively parallel processing (MPP).
Let’s take each of those in turn.
What is Amazon Redshift? Column-oriented Databases/Columnar Data Stores Explained
A column-oriented DBMS is a type of database management system that stores data as sections of columns of data, instead of as rows as you would more find in a conventional RDBMS. Column-orientated databases are particularly useful for data warehouses, analytics and Big Data applications, as they resolve queries very quickly and reduce the need for joins.
For instance, let’s say you’ve got a table of data like this:
A normal RDBMS, like MySQL, SQL Server or Oracle, will typically store the data under-the-hood something like this:
Each row, one at a time, serialised, and keyed by an internally maintained row_id.
The RDBMS and this row-orientated data storage format is optimised to be really efficient at returning single rows. This suits common use cases in transactional OLTP systems where you maybe accessing a specific row (e.g. a customer or a product) to update it.
But crucially, row-orientated databases are not as efficient at performing operations over the whole data set. In the above example, for instance, the database would have to inspect the entire data set to identify customers with credit limits between 6,000 and 18,000. In anything but the smallest of tables this is going to mean lot of slow disk operations to resolve the answer for you.
The strategy to mitigate this in a normal database is of course to build an index. An index over our sample table’s credit_limit column might look like this:
As the index is smaller, it requires fewer IO operations to scan it, making it faster. The index can also be sorted on the column value, which makes range queries like our “credit limit between x & y” query faster still.
But indexes have their downsides. They add overhead to the RDBMS in terms of storage, processing, and IO. And as database designers, we have to predict how the user is going to use the database, so that we can have indexes built to support their queries. That’s not so bad in a transactional system, as we know what queries the software will want to make on the database. But in a BI, data warehouse or analytical application, we may want to “slice and dice” by almost any field.
In a columnar data store like Amazon Redshift, the way data is stored is flipped around. Our sample table would be stored like this:
It’s been swapped around. Each row now represents a column.
But there’s another more subtle (and more important) difference– the data has become the key. In a normal database, the Row_ID which is internally maintained is the key to the indexed data. Here, the data is the key. This means that if records have identical values in some of their fields (like in our example, there are two customers called Smith) then this can be reflected more efficiently, like in this example:
…;Smith:001, 003;White:002; Jones:004;…
This has some pretty dramatic results when it comes to returning the answers to analytical queries. For many common searches, like “find all the customers called Smith”, the result can be returned with a single database operation. Counts, averages and other mathematical operations benefit in a similar fashion. Compared to the IO-intensive way that a traditional RDBMS would answer the same question, the difference is transformational.
There are other benefits too. The way the data is stored naturally optimises it for compression. And of course you don’t need any indexes again, ever.
Columnar databases are nothing new. You may have heard of Vertica, Teradata and Infobright, for instance. But Amazon Redshift delivers column-orientated technology on an as-a-service basis, making it affordable, easy, and fast to get up and running with.
Better still, it also pairs columnar database technology with awesome clustering capability, enabling massively parallel processing.
So let’s now look at massively parallel processing, and how that can benefit you.
What is Amazon Redshift? Massively Parallel Processing (MPP) Explained
In computer science, massively parallel processing means using large numbers of processors or computers to perform coordinated computations in parallel simultaneously. In terms of answering the question, what is Amazon Redshift, it means clustering.
Leveraging the underlying services provided by AWS and EC2, Redshift deploys as a cluster. In fact you cannot deploy a single node or server or Amazon Redshift, only a cluster.
An Amazon Redshift cluster is made up of a leader, with nodes behind it. So data is spread across the cluster (depending on the sort key strategy you specify for your tables) to optimise its ability to resolve queries in parallel.
When you issue a query to Amazon Redshift, that query is split and shared across the nodes of the cluster, which leads to near linear scalability. Correctly configured, an 8-node Amazon Redshift cluster will resolve many queries approximately twice as fast as a 4-node cluster. ‘Correctly configured’ is the operative phrase however—and for far more detail here, download our free e-book on the best practice configuration of Amazon Redshift.
Of course the cluster is deployed onto AWS, which means that all the benefits of IAAS and Cloud computing that the AWS platform offers are also passed on to Redshift. Nodes in clusters have “slices” which represent the processors on the underlying EC2 instances. You can choose between conventional (cheaper, bigger) or solid state (SSD – higher performance) disk storage technology. You can deploy, in a couple of clicks, to the availability zone (geographic location of the data centre) of your choice. And you can snapshot, rebuild, backup, re-size and turn on or off your Redshift cluster, all at the touch of a button. And for all this of course, you are only charged hourly.
What is Amazon Redshift? Distilled
What is Amazon Redshift? Simply put, it is a massively parallel, column-orientated database deployed on the AWS platform, with all the benefits that that in itself brings. And it is especially suitable for OLAP style workloads such as data warehouses, analytics, big data and ELT. Better still, it’s almost linearly scalable, and it can be got up and running quickly, easily and cheaply.
In short, no wonder Amazon Redshift is the fastest growing service on AWS.
To learn more about how Amazon Redshift works under the hood, download our free e-book.