There are a number of ways to load data into Amazon Redshift. However, when loading data from a table, the most efficient way of doing this, is to use the COPY command.
The reason why this is the most efficient method, is that the COPY command leverages the Amazon Redshift massively parallel processing architecture , allowing it to read and load data in parallel from a number of sources at once.
In this article we look at Amazon’s own suggestions around using the COPY command function. Then, using our own real-world experience of Amazon Redshift, we give our own advice on the matter.
For more information on the work we do around Amazon Redshift, visit our Amazon Redshift Partner page.
Use a COPY command to Load Data into Amazon Redshift
What is it?
A COPY command is used to get data into a table from various input sources, notably Amazon S3 and SSH.
You should definitely use this, and have it do automatic column encoding at the same time.
COPY is miles faster than using single line inserts, and as it’s likely that you’ll be using Amazon Redshift with large volumes of data, you absolutely want to be using the COPY command.
There are things that actually slow down the COPY that are still good things to do – for example, setting a sort key will slow down the load into a table, but will (potentially) improve the performance of many queries subsequently made against it.
It’s really the only way to get large amounts of data loaded in the first place.
For amazon’s own best practice recommendations around using the COPY command to load data into Amazon Redshift, click here.
Use a Single Copy command to Load from Multiple Files
What is it?
The COPY command doesn’t have to name a single file – it can name a whole bunch of them in one go, which are then used as if they were all one big file.
You should absolutely use this function.
When loading from S3, split up the data into a number of chunks roughly equivalent to (or greater than) the number of slices in the cluster i.e. give every slice some work to do.
The load will then happen in parallel, with each slice taking data from its own dedicated file.
Why pay for an 8-node / 16-slice cluster, then force the copy down to a single thread!
You can compress the files using gzip or lzop to save time uploading the files. COPY is then able to speed up the load process by uncompressing the files as they are read.
Compress your data files with gzip or lzop
What is it?
Input data can be stored in compressed form and decompressed by Redshift during the load (copy) operation.
Compression/decompression is a CPU-intensive task, but compared to ferrying gigabytes of data across the network (and especially from remote networks to get data up into S3), we find that the cost of compression/decompression is more than paid back in reduced bandwidth and data transfer time. (This really depends on your bandwidth of course).
We load data over SSH as well as from S3, and GZIP is still possible then too (and still saves bandwidth). The trick is, when generating the data on the source system, just pipe it through GZIP before returning it to STDOUT (and tell Redshift you’ve done that on the COPY command of course!).
If you can, you should.
Matillion ETL for Amazon Redshift
Simplify data management and unlock Redshift’s potential with Matillion ETL for Amazon Redshift – an ETL/ELT tool built specifically for Amazon Redshift.
Matillion ETL for Amazon Redshift pushes the data transformation down to Redshift, meaning you can process millions of rows in seconds, with real-time in-job feedback and linear scalability.
Delivered as an AMI on the AWS Marketplace, Matillion ETL for Amazon Redshift can be up and running in a matter of minutes.
Right now you can get a FREE 14 DAY TRIAL of Matillion ETL for Amazon Redshift on the AWS Marketplace.
For more best-practice advice and information on optimising Amazon Redshift performance, download our free guide below