Webinar l Getting Started with Amazon Redshift Spectrum
Matillion is running a 3-part webinar on Amazon Redshift Spectrum, which is revolutionising the way data is stored and queried allowing for complex analysis and enabling better decision making. The first of these sessions focuses on how to get started.
Amazon Redshift Spectrum, The Basics
What is Amazon Redshift Spectrum?
A feature of Amazon Redshift that allows you query data in S3 directly from within Amazon Redshift. Spectrum uses thousands of nodes and scales independently to run fast queries on ‘External’ tables. It also allows for tables to be seamlessly joined across Amazon Redshift and S3. Spectrum is priced per-query.
Why would I use Amazon Redshift Spectrum?
Traditionally, when your data warehouse was reaching capacity, you would go to CTO to request more resources to scale up. Then you spend a year or two filling up spare CPU sockets, memory slots and plugging more disks in your larger warehouse. Eventually, you run out of CPU and memory before storage. This results in having so much more data than you are able to process and everyone’s queries suffer. And the scalability cycle repeats.
Amazon Redshift on AWS looked to address this problem by fixing the amount of disk, CPU and memory provided by an instance type and allowing you to scale by adding more nodes. Every node brings its own CPU, memory and storage so, as you scale, performance doesn’t suffer.
Resizing Redshift cluster is non-trivial. However, if you want to scale up storage and back down for short term projects it becomes a little bit difficult. Amazon Spectrum Redshift looks to address this problem, amongst others.
Spectrum allows storage to keep growing on S3 and be processed in Amazon Redshift. By bringing its own compute and memory – the hard work Redshift would have to do is done on the Spectrum level. You can scale to larger amounts of data than what cluster would be able to handle on its own disks.
A common use case for Amazon Redshift Spectrum is to access legacy data in S3 that can be queried in ad hoc fashion as opposed to keep online in Amazon Redshift.
How does it work?
Amazon Redshift datasets are partitioned across the nodes and at run time they all participate resulting in quicker run times. Spectrum scales transparently and you are not necessarily aware of the nodes used.
The Spectrum service can see all data on S3 and knows a lot about it via a data catalogue. Spectrum communicates this data back to Redshift as tables.
Getting Started with Amazon Redshift
Go onto Redshift and set up an external schema, which acts as a pointer to a data catalogue (Athena, AWS Glue, or Hive metastore).
Next, you must attach an IAM role to the Redshift cluster so that the cluster can delegate access permission to Amazon Spectrum service.
After that, you are able to set up your tables and queries using a SQL tool, such as Matillion.
Check out a demo in our webinar on how to get started.
- Set up external schema
- Set up external tables
- Add an external schema to existing Matillion environment from within Matillion
Additionally, notes on Getting Started with Amazon Redshift Spectrum are available on our support portal.
Questions from the Amazon Redshift Community
What regions is Amazon Redshift currently available in?
Check the Amazon Regional Product Services table to find out what Amazon services are available in your region.
How can I trial Amazon Redshift Spectrum to see if it’s useful for my organization?
Restore a snapshot from you existing Redshift cluster in a region it is available. You need to be able attach an IAM role which grants access privileges. You will then have a cloned cluster with current data and access in a sandbox environment. From there you can start experimenting by unloading some of your Redshift tables onto S3 and defining external tables and change typical queries to reference data in Spectrum.
Can Amazon Redshift Spectrum read from compressed files?
It definitely can, there is a list of supported compression formats (gzip, Snappy, bzip2). Spectrum uses the file extension to verify it is a supported compressed file so make sure you have your file formatted correctly.
Can we connect to Amazon Redshift Spectrum external schema from other data sources, such as Tableau?
External tools should connect and execute queries as expected against the external schema. If looking for fixed tables it should work straight off. However, if the tool searches the Redshift catalogue to find an introspect tables and view, the Spectrum tables and views are stored in different bits of catalogue so they might not know about the table straight away. Amazon Redshift is working to provide a unified list of the tables to mitigate this.
Can I partition S3 files, for example, monthly? If I add new files in S3 containing data for September, can I access those instantly without having to change the external table definition?
There are two ways you can do this. The first method does not require partitioning files. When you have an external table that is pointing at an S3 location it is really pointed at prefix/folder and all the files will be analysed if you query that table. If you add another data file into the location, formatted the same as the other files, the next SQL query you run will see that data. Alternatively, a partitioned table is a location at S3 where each partition is a sub folder. There is an extra command to run to refresh after you add an entirely new set. It is a common use case to write daily, weekly, monthly files and query as one table.
Can I write to external tables? Say, for example, a way to dump my Redshift data to a formatted file?
Yeah, definitely. The easiest way is to get Amazon Redshift to do an unload of the tables to S3. Matillion has tools to help iterate all tables in schema to make that easier. Unload to S3 and create table definition over the top. Register for the next webinar for a demonstration.
Can I create a data lake in S3 and then create external tables in Amazon Redshift to create a star schema? Or should I load physical tables to create a star schema?
Current best practice is a mixture of the two. Store large fact tables in partitions on S3 and then use an external table. The dimension to compute values from are then stored in Redshift. This seems to work well. It is recommended that the fact table is partitioned by date where most queries will specify a date or date range. When the file is scanned with this information you will save time and money as you are only scanning the partitions known to contain data in that date range. Whether you create table metadata with other AWS tools and then view in Amazon Redshift or whether you create the tables in Redshift will depend on your use case and associated trade offs.
Is there any query limit hitting to the same file at same time? Will there be any concurrency issues if multiple users are querying the same file or table?
What we know so far is the limitations around this are the limitations on S3. S3 scales to massive levels of concurrency. We have not seen issues, but maybe we haven’t used enough data.
Does Spectrum create one catalog table for multiple files on S3 with the same schema?
The table in the catalogue will point at a location but that location may have many files in it. Partitioning can use one table definition and have multiple file locations that are a part of that same table (flexible). This can be arranged in different ways.
I currently have a Matillion job that exports data from Redshift into S3 for some processing. Once the processing is complete I import the data back into Redshift. It sounds like with Spectrum I will be able to directly access the processed data from S3. Is that correct?
That is true. Assuming an Redshift unload statement is being used, the unload will create data in a format that is compatible with Spectrum text file format. Therefore, you don’t have to bring that into Redshift but can query directly from Spectrum.
Watch the full length webinar.