Loading JSON Files into Amazon Redshift using the Matillion S3 Load Component
Matillion uses the Extract-Load-Transform (ELT) approach to delivering quick results for a wide range of data processing purposes: everything from customer behaviour analytics, financial analysis, and even reducing the cost of synthesising DNA.
The S3 Load component presents an easy-to-use graphical interface, enabling you to pull data from a JSON file stored in an S3 Bucket into a table in a Redshift database. This is very popular with our customers to load data stored in files into Redshift and combine this data with data from additional external sources.
The connector is completely self-contained: no additional software installation is required. It’s within the scope of an ordinary Matillion license, so there is no additional cost for using the features.
Watch our tutorial video for a demonstration on how to set up and use the S3 Load Component in Matillion ETL for Amazon Redshift to load JSON files.
In this example we have a JSON file containing details of different types of donuts sold, a snippet of the file is below:
The first step in configuring the S3 Load component is to provide the Redshift table which the data in the S3 file is to be loaded into. This table should already exist on the Redshift database and can be selected from the dropdown list.
Matillion gives the option to allow the user to specify only specific columns in the table that are loaded with the data or all columns can be loaded if this is left blank.
S3 URL Location
This is the S3 bucket, including folder if required, where the file to be loaded is stored. Select the folder from the Edit Properties pop up:
S3 Object Prefix
After choosing the S3 URL Location, the next step is give the S3 Object prefix on the file or files to be loaded. If many files match the object prefix, the S3 Load component will loop through these files and load the contents of them all into the Redshift table.
Data File Type
The S3 Load component supports many different data file types including Avro, Delimited, Fixed Width and JSON file types. Select the required File Type from the drop down:
Running the S3 Load component
All other properties can be left as the default values however we recommend reviewing these, especially the Encoding, Replace Invalid Characters and Maximum Errors.
You can run the Orchestration job, either manually or using the Scheduler, to query your data and bring it into Redshift.
Transforming the Data
Once the required data has been brought from the JSON file into Redshift, it can then be used in a Transformation job, perhaps to join to existing data. A transformation job can also be used to split out the data in the batters and topping data columns if required.
In this way, you can build out the rest of your downstream transformations and analysis, taking advantage of Redshift’s power and scalability.
JSON Paths file
In the above example, the JSON formatting was simple so we left Matillion to automatically detect the JSON layout. However the S3 Load component supports the ability to add a link to a JSON Paths file defining the format of the JSON file. This example looks at a file called colours.json which contains some nested code data:
A JSON paths file has been written to export both of those codes into separate columns in the Redshift table:
This JSON paths file can be referenced in the JSON Layout property in the S3 Load Component:
When the S3 Load is run and the table is queried in a Transformation job, the codes are separated out:
The option to set a JSON paths file gives the user more flexibility over which data from the file is loaded into Redshift and how the data is formatted in the resulting table.