Paul Johnson, Solution Architect
Matillion uses the Extract-Load-Transform (ELT) approach to delivering quick results for a wide range of data processing purposes: everything from customer behavior analytics, financial analysis, and even reducing the cost of synthesizing DNA.
The S3 Load component presents an easy-to-use graphical interface, enabling you to connect to a file stored on an S3 Bucket and pull data from that file into Amazon Redshift. This is very popular with our customers to load data stored in files into Amazon Redshift and combine this data with data from additional external sources.
The connector is completely self-contained: no additional software installation is required. It’s within the scope of an ordinary Matillion license, so there is no additional cost for using the features.
In this example we will look at loading CSV files, containing flight data, stored on an S3 Bucket. We have one year’s worth of flights data per file.
S3 Load Component
In order to run the S3 Load Component, you first need to create a table in Amazon Redshift. This is the table you will load the CSV file into. You then need to configure the S3 Load component. While this is not difficult, since Matillion prompts you for all the information, it can take some time, especially when dealing with a data set that spans a lot of columns. To save time and reduce human error you can use the S3 Load Generator in Matillion ETL for Amazon Redshift.
S3 Load Generator
We recommend using the S3 Load Generator to quickly configure the necessary components (S3 Load Component and Create Table Component) to load the contents of the files into Amazon Redshift. Simply select the S3 Load Generator from the ‘Tools’ folder and drag it onto the canvas. The Load Generator will pop up.
Select the ellipsis next to S3 URL Location and then pick the file you want loaded into Amazon Redshift.
From the list of S3 Buckets, you need to select the correct bucket and sub-folder. Then you can select the file want to load:
Back in the Load Generator, change the compression options in the drop down and select the compression type for the file:
Matillon will automatically unzip Gzip and BZip2 files during the load.
Click the Get Sample button to prompt Matillion to connect into the file and sample the top 50 rows of data. Matillion will guess the file type and also the metadata for the file based on the sample shown in the Load Generator.
Here we can see Matillion has identified the file is a CSV file with a comma field delimiter and newline Record delimiter. Based on the sample of data, Matillion also identifies the top row is the header. Matillion will use this header, in addition to the data sample in order to identify the metadata for the table you are creating in Amazon Redshift.
If required, you can change any of these fields.
Click on the OK button and Matillion will generate the required components to load this data to Amazon Redshift:
Running the S3 Load
Before you can run the components, you need to connect them to the Start component. This will mean the Create/Replace Table component will turn green. The S3 Load will remain red as it is relying on the table in the first component to exist. However, you can still run the job because the first component will create the table you need.
You can run the Orchestration job, either manually or using the Scheduler, to query your data and bring it into Amazon Redshift.
The S3 Load component supports Object Prefixes. Rather than creating one component per file to load into Amazon Redshift, you can change the first 2 digits of the file name as shown below:
When this is run it will then loop through all of the 20xx files and load them all into the same table in Amazon Redshift.
Another option frequently used is the ‘Maximum Errors’ property. You can use this to set the maximum number of individual parsing errors that cause the whole load to fail. These rows will either have columns with the incorrect metadata, e.g. a string where a date is expected or numbers which are too long, or will not be in the format you’d expected. By changing the ‘Maximum Errors’ setting, values causing parsing errors will be substituted as null values.
Finally, you have the option to refresh the optimizer statistics at the end of a successful COPY command. Amazon Redshift updates statistics automatically if the table is initially empty. However, when you ingest data into a nonempty table and the load significantly changes the size of the table, we recommend turning the ‘StatUpdate’ property on.
Transforming the Data
Once you have brought the required data from S3 into Amazon Redshift, you can use it in a Transformation job, perhaps to combine with existing data:
In this way, you can build out the rest of your downstream transformations and analysis, taking advantage of Amazon Redshift’s power and scalability.