Matillion is running a 3-part webinar on Amazon Redshift Spectrum. Amazon Redshift Spectrum is revolutionising the way data is stored and queried allowing for complex analysis thus enabling better decision making. The second webinar focuses on Using Amazon Redshift Spectrum from Matillion ETL.
If you have not seen Part 1 Getting Started with Amazon Redshift Spectrum you can access the full webinar on Youtube or read a summary. Once you are comfortable with the basics, watch our second webinar for more indepth information and demonstrations. This blog provides a summary of the full webinar.
A Quick Recap, Amazon Redshift Spectrum Architecture
Spectrum is a newly added product to the Amazon Redshift ecosystem. It allows you write SQL to directly query very large data sets on S3, without first having to load them into Redshift.
Within the architecture the new addition is the separate Spectrum work nodes that scale independently. Spectrum is fully managed and priced on a per query basis. Amazon Redshift Spectrum shares a catalogue with Amazon Athena and Amazon Glue so they can be used together. This is useful if you are pursuing a data lake strategy in S3 as it makes the same data sets available to all three products, which can subsequently be used in different ways.
We will look at utilising a data lake structure in Spectrum in our third webinar “Accessing your Data lake assets from Amazon Redshift Spectrum” on October 18th 2017, which you can register for now.
Accessing Amazon Redshift Spectrum
At the time of this webinar Spectrum is only available in the United States. You can check AWS product availability on the AWS Region Table for updates. If you are outside of the United States but want to try Spectrum you can launch an Amazon Redshift cluster in a US region. You can then access it by adding a new Matillion environment.
Create external schemas with existing tables
There are several steps to create an external schema. This is our connection to a Redshift Spectrum Schema. The name of the external schema is what we are going to see in Matillion.
Next, set your Data Catalogue, which will be Athena or Glue. The Data Catalogue can be new or already defined. If it is already defined you will see table already associated with the schema.
Lastly, set your IAM Role that Amazon Redshift can use to connect to Spectrum.
This will create a schema and any existing tables will be brought through as specified. Once this is created you can see the schema in Athena and Glue.
Create External Tables
If you have data files in S3 you may want to create external tables using the Create External Table component to bring that data through to Amazon Redshift. Configure the external table over the S3 bucket using the Matillion ETL the component properties. Choose the schema and target table and then set the location of where the table data is going to be stored in S3.
When the component is run, the table and defined data will be stored in S3 and given an Amazon Redshift table definition so it can be easily queried by Amazon Redshift. External storage will reduce the amount of disk space being used on Amazon Redshift. Use the ‘Plan’ tab to view the table location, i.e. identify the S3 bucket the data is coming from.
Using this process, you can bring in files such as CSV, parquet, RC or sequence directly from S3 straight into Amazon Redshift. The table you create will be immediately available for use in jobs, such as transformations.
Use Matillion ETL Loading components with Amazon Redshift Spectrum
You can also bring in data from other external sources. Matillion ETL has over 50 query components to bring external data sources into Amazon Redshift that can then accompany your S3 data. In the webinar demonstration we showed how the Google Sheets Query component can bring Google Sheet data into Amazon Redshift. Within this component the data was also loaded into S3 to with an Amazon Redshift table definition.
Building a transformation
A transformation component can be used on internal Amazon Redshift tables, tables in S3 and external data sources. Use standard transformation components from within Matillion ETL to manipulate the data and build data warehouses. A common data transformation job can include multiple steps such as filtering the data, and joining data from different sources to create a new table.
- Filter – Data can be filtered to look at a specific subset of data. For example, in our raw_flights data we can filter out any data prior to the year 2000. This reduces our row total to ~59 million from ~123 million. With Amazon Redshift Spectrum, Spectrum is doing the work to filter the data before it’s pushed to Redshift. This can be viewed in the ‘Plan’.
- Join – A Join can be used to combine the previously filtered raw_flights data with other sources, such as the plane_info dataset. At this stage you can also add internal Amazon Redshift tables. In our demo we use our internal airports data.
Again the ‘Plan’ view can be used to take a closer look at how the data is being queried. In our demonstration, the join is being done in Amazon Redshift and we can see it isn’t pushed down to Athena.
Spectrum is fantastic at filtering and aggregating very large datasets. Best performance comes from taking the load off Redshift. This means you should filter and aggregate in Spectrum before you start joining data, which can be handled in Amazon Redshift.
Unload existing Redshift tables to Amazon Redshift Spectrum
There are various use cases for unloading existing Amazon Redshift tables into Spectrum. Some common use cases are to save disk space in Amazon Redshift, for use in Athena or Glue, or to query/transform in Matillion ETL.
Let’s say you have performed a transformation and the data output is very large and you want to rewrite the table to Spectrum to save disk space on your Amazon Redshift cluster. You can write a SQL script to create a list of the tables in a schema in Redshift. With the external table you select the schema, name it, and choose where the data is going to live (S3 bucket). Iterate over the schema and then run the job to select tables to write out to the specified S3 bucket, with the name of the export. This will also create a table definition that can be queried from Amazon Redshift.
Once the data has been rewritten, the SQL Script component in conjunction with the Export Table component to create a list of all the tables in schema in Redshift that can be unloaded into Spectrum.
You can follow a similar process to create an internal table in Amazon Redshift if required. Both tables can be queried, but the Amazon Redshift Spectrum table will not use disk space or processing power on the Redshift cluster when you query it.
Matillion Expert Tip
Partitioning your data can increase your query efficiency and potentially reduce the cost of your queries. First, create a folder structure within S3 (i.e. year, month).
In order to set up the partition in Matillion, you need both to create the table and the partitions that will read the data. Again if we look at time series data – create a table that defines the partition (i.e. year, month). The Add Partition component will create a series of partitions for the folders defined. Again you can check the data flow by viewing the ‘Plan’.
Register for Part 3 of our Amazon Redshift Spectrum series on “Accessing your Data lake assets from Amazon Redshift Spectrum” on October 18th 2017 at 11am ET.
Questions from the Amazon Redshift Community
Does this work with any Redshift node type?
Yes it does. Any Amazon Redshift node type will let you access Spectrum. However, it is only available within the US at this time.
What is the cost of Amazon Redshift Spectrum?
Spectrum charges per query. From our testing we have found it to be very cost efficient. The cost however, is dependent on the amount of data scanned and therefore depends on your use case. Further details on costing can be found on AWS Redshift Pricing. To help reduce the cost of your queries we would recommend partitioning your data. You can partition your data in S3 to only scan the data you need and then relay that information to Spectrum.
Best approach to filter and aggregate in S3. Can joins be done in Spectrum if tables are external?
From our testing, no. Data is streamed into Amazon Redshift before testing/joining. Aggregations and filters are pushed down. If you are using partitions in your Redshift Spectrum data those aggregations, along the lines of partition, can be quick and cheap but joins are done in Amazon Redshift.
What is the best format to store files in S3 for Spectrum?
We have completed some benchmarks at Matillion to see what works best in the real world. From our research and testing we have found that parquet doubled the performance compared to CSV and compressed CSV. At this time we would recommend parquet, but we want to do a lot more testing around this.
Will Matillion support Glue so that I can use Matillion for transformations and have it generate the pyspark code to run it on the Glue cluster?
Not at this time. We are interested in this but all transformations are in SQL on Amazon Redshift cluster at this time. We are waiting for support and performance enhancements in Amazon Glue before looking into pyspark further.
Is it possible to bypass S3 using local storage on EC2 for small files to load and unload using SSH tunnel for Redshift in Matillion?
We don’t support using the SSH tunnel for data loading in Matillion. Files can be loaded in S3, Amazon Redshift or to external locations. We would say that small tables, especially those that are very wide, are very inefficient to store in Amazon Redshift because it has a minimum block size per column.
How does Amazon Redshift Spectrum compare to Snowflake?
Amazon Redshift including Spectrum and Snowflake are comparable products. However, this is very much dependent on your use case. Both have capabilities to create external tables in S3 that can be queried. Matillion ETL has products for both. You can create jobs in Matillion ETL for Amazon Redshift which can be converted to Matillion ETL Snowflake to analyse cost and performance.
Updating records in an external table in Redshift Spectrum – is it good practice to do so or is better to update to internal tables in Amazon Redshift?
Neither. You can‘t update an external table in Spectrum but could append by adding files in S3. Therefore, it may be best to rewrite the table.
You can update internal tables in Amazon Redshift, but we wouldn’t recommend this. You will find DML style updates are slow. Bulk updates on the other hand are okay but it’s better to re-create the records for enhanced performance. Some small scale updates are probably okay. However, when working with large quantities of data we would read the input table, build the transformation to build a new output table to contain updates.
Generally across analytic databases, Amazon Redshift, Snowflake and Google BigQuery included, it’s best practice to avoid updating if you can.
Do you need a Redshift cluster to use Redshift Spectrum?
Yes. If you just wanted to use Spectrum and not Amazon Redshift capability you could do a lot of that in Athena. If you have things written in Amazon Redshift, you would need at least a minimal Redshift cluster.
When would you use Athena versus Spectrum?
When you are pushing down a query in Spectrum that is essentially being done with the same technology that powers Athena. Athena has really good data query capabilities, but less so with joins/calculations across datasets. What you will miss the most, and the thing that we use a lot in Matillion to do data transformations on data sets, is the ability to perform a complex query across multiple datasets and output to a new dataset. This is difficult in Athena but easier in Spectrum especially when used in conjunction with Matillion.