Blog| Matillion ETL for Amazon Redshift

    Using the S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file

    amazon s3 load matillion etl redshiftMatillion uses the Extract-Load-Transform (ELT) approach to delivering quick results for a wide range of data processing purposes: everything from customer behavior analytics, financial analysis, and even reducing the cost of synthesizing DNA.

    The S3 Load component presents an easy-to-use graphical interface, enabling you to connect to a file stored on an S3 Bucket and pull data from that file into Amazon Redshift. This is very popular with our customers to load data stored in files into Amazon Redshift and combine this data with data from additional external sources.

    The connector is completely self-contained: no additional software installation is required. It’s within the scope of an ordinary Matillion license, so there is no additional cost for using the features.

    In this example we will look at loading CSV files, containing flight data, stored on an S3 Bucket. We have one year’s worth of flights data per file.

    S3 Load Component

    In order to run the S3 Load Component, you first need to create a table in Amazon Redshift. This is the table you will load the CSV file into. You then need to configure the S3 Load component. While this is not difficult, since Matillion prompts you for all the information, it can take some time, especially when dealing with a data set that spans a lot of columns. To save time and reduce human error you can use the S3 Load Generator in Matillion ETL for Amazon Redshift.

    S3 Load Generator

    We recommend using the S3 Load Generator to quickly configure the necessary components (S3 Load Component and Create Table Component) to load the contents of the files into Amazon Redshift. Simply select the S3 Load Generator from the ‘Tools’ folder and drag it onto the canvas. The Load Generator will pop up.

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - generator

     

    Select the ellipsis next to S3 URL Location and then pick the file you want loaded into Amazon Redshift.

    From the list of S3 Buckets, you need to select the correct bucket and sub-folder. Then you can select the file want to load:

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - s3 bucket

     

    Back in the Load Generator, change the compression options in the drop down and select the compression type for the file:

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - file compression

     

    Matillon will automatically unzip Gzip and BZip2 files during the load.

    Click the Get Sample button to prompt Matillion to connect into the file and sample the top 50 rows of data. Matillion will guess the file type and also the metadata for the file based on the sample shown in the Load Generator.

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - get sample

     

    Here we can see Matillion has identified the file is a CSV file with a comma field delimiter and newline Record delimiter. Based on the sample of data, Matillion also identifies the top row is the header. Matillion will use this header, in addition to the data sample in order to identify the metadata for the table you are creating in Amazon Redshift.

    If required, you can change any of these fields.

    Click on the OK button and Matillion will generate the required components to load this data to Amazon Redshift:

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - generated components

     

    Running the S3 Load

    Before you can run the components, you need to connect them to the Start component. This will mean the Create/Replace Table component will turn green. The S3 Load will remain red as it is relying on the table in the first component to exist. However, you can still run the job because the first component will create the table you need.

    You can run the Orchestration job, either manually or using the Scheduler, to query your data and bring it into Amazon Redshift.

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - run components

     

    Advanced Options

    Object Prefix

    The S3 Load component supports Object Prefixes. Rather than creating one component per file to load into Amazon Redshift, you can change the first 2 digits of the file name as shown below:

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - object prefix

     

    When this is run it will then loop through all of the 20xx files and load them all into the same table in Amazon Redshift.

    Maximum Errors

    Another option frequently used is the ‘Maximum Errors’ property. You can use this to set the maximum number of individual parsing errors that cause the whole load to fail. These rows will either have columns with the incorrect metadata, e.g. a string where a date is expected or numbers which are too long, or will not be in the format you’d expected. By changing the ‘Maximum Errors’ setting, values causing parsing errors will be substituted as null values.

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - maximum errors

     

    Update Statistics

    Finally, you have the option to refresh the optimizer statistics at the end of a successful COPY command. Amazon Redshift updates statistics automatically if the table is initially empty. However, when you ingest data into a nonempty table and the load significantly changes the size of the table, we recommend turning the ‘StatUpdate’ property on.

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - update statistics

     

    Transforming the Data

    Once you have brought the required data from S3 into Amazon Redshift, you can use it in a Transformation job, perhaps to combine with existing data:

     

    S3 Load Generator Tool in Matillion ETL for Amazon Redshift to Load a CSV file - transformation

     

    In this way, you can build out the rest of your downstream transformations and analysis, taking advantage of Amazon Redshift’s power and scalability.

    Useful Resources

    S3 Load Component in Matillion ETL for Amazon Redshift
    S3 Load Generator in Matillion ETL for Amazon Redshift
    Create Table Component in Matillion ETL for Amazon Redshift
    Integration information

    Try the S3 Load Generator component in Matillion ETL for Amazon Redshift. Arrange a free demo, or start a free 14-day trial.