S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file

Using the S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file

amazon s3 load matillion etl redshiftPaul Johnson, Solution Architect

Matillion uses the Extract-Load-Transform (ELT) approach to delivering quick results for a wide range of data processing purposes: everything from customer behavior analytics, financial analysis, and even reducing the cost of synthesizing DNA.

The S3 Load component presents an easy-to-use graphical interface, enabling you to connect to a file stored on an S3 Bucket and pull data from that file into Amazon Redshift. This is very popular with our customers to load data stored in files into Amazon Redshift and combine this data with data from additional external sources.

The connector is completely self-contained: no additional software installation is required. It’s within the scope of an ordinary Matillion license, so there is no additional cost for using the features.

In this example we will look at loading CSV files, containing flight data, stored on an S3 Bucket. We have one year’s worth of flights data per file.

S3 Load Component

In order to run the S3 Load Component, you first need to create a table in Amazon Redshift. This is the table you will load the CSV file into. You then need to configure the S3 Load component. While this is not difficult, since Matillion prompts you for all the information, it can take some time, especially when dealing with a data set that spans a lot of columns. To save time and reduce human error you can use the S3 Load Generator in Matillion ETL for Amazon Redshift.

S3 Load Generator

We recommend using the S3 Load Generator to quickly configure the necessary components (S3 Load Component and Create Table Component) to load the contents of the files into Amazon Redshift. Simply select the S3 Load Generator from the ‘Tools’ folder and drag it onto the canvas. The Load Generator will pop up.

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - generator

 

Select the ellipsis next to S3 URL Location and then pick the file you want loaded into Amazon Redshift.

From the list of S3 Buckets, you need to select the correct bucket and sub-folder. Then you can select the file want to load:

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - s3 bucket

 

Back in the Load Generator, change the compression options in the drop down and select the compression type for the file:

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - file compression

 

Matillon will automatically unzip Gzip and BZip2 files during the load.

Click the Get Sample button to prompt Matillion to connect into the file and sample the top 50 rows of data. Matillion will guess the file type and also the metadata for the file based on the sample shown in the Load Generator.

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - get sample

 

Here we can see Matillion has identified the file is a CSV file with a comma field delimiter and newline Record delimiter. Based on the sample of data, Matillion also identifies the top row is the header. Matillion will use this header, in addition to the data sample in order to identify the metadata for the table you are creating in Amazon Redshift.

If required, you can change any of these fields.

Click on the OK button and Matillion will generate the required components to load this data to Amazon Redshift:

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - generated components

 

Running the S3 Load

Before you can run the components, you need to connect them to the Start component. This will mean the Create/Replace Table component will turn green. The S3 Load will remain red as it is relying on the table in the first component to exist. However, you can still run the job because the first component will create the table you need.

You can run the Orchestration job, either manually or using the Scheduler, to query your data and bring it into Amazon Redshift.

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - run components

 

Advanced Options

Object Prefix

The S3 Load component supports Object Prefixes. Rather than creating one component per file to load into Amazon Redshift, you can change the first 2 digits of the file name as shown below:

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - object prefix

 

When this is run it will then loop through all of the 20xx files and load them all into the same table in Amazon Redshift.

Maximum Errors

Another option frequently used is the ‘Maximum Errors’ property. You can use this to set the maximum number of individual parsing errors that cause the whole load to fail. These rows will either have columns with the incorrect metadata, e.g. a string where a date is expected or numbers which are too long, or will not be in the format you’d expected. By changing the ‘Maximum Errors’ setting, values causing parsing errors will be substituted as null values.

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - maximum errors

 

Update Statistics

Finally, you have the option to refresh the optimizer statistics at the end of a successful COPY command. Amazon Redshift updates statistics automatically if the table is initially empty. However, when you ingest data into a nonempty table and the load significantly changes the size of the table, we recommend turning the ‘StatUpdate’ property on.

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - update statistics

 

Transforming the Data

Once you have brought the required data from S3 into Amazon Redshift, you can use it in a Transformation job, perhaps to combine with existing data:

 

S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - transformation

 

In this way, you can build out the rest of your downstream transformations and analysis, taking advantage of Amazon Redshift’s power and scalability.

Useful Resources

S3 Load Component in Matillion ETL for Amazon Redshift
S3 Load Generator in Matillion ETL for Amazon Redshift
Create Table Component in Matillion ETL for Amazon Redshift
Integration information

Info

Try the S3 Load Generator component in Matillion ETL for Amazon Redshift. Arrange a free demo, or start a free 14-day trial.