Now Available On-Demand | AI. The Future of Data Engineering - Today. Not Just Tomorrow.

Watch now

Webinar l Accessing your Data Lake Assets from Amazon Redshift Spectrum

Matillion is running a 3-part webinar on Amazon Redshift Spectrum. Amazon Redshift Spectrum is revolutionising the way data is stored and queried allowing for complex analysis thus enabling better decision making. The third webinar focuses on Accessing your Data Lake Assets from Amazon Redshift Spectrum. Catch up on Part 1 Getting Started with Amazon Redshift Spectrum, or Part 2 Using Amazon Redshift Spectrum from Matillion ETL on our website. You can watch the full Accessing your Data Lake Assets from Amazon Redshift Spectrum here.

State of Data Warehousing - Data Warehousing Challenges Today by Greg Khairallah

Traditionally, our main data sources were CRM and financial in nature. However, our data is continually growing as we add new sources, such as social media and blog data. In today’s analytical environment we demand new types of data and faster access.     Amazon Redshift, a fully managed data warehouse, was developed to meet changing demands. Global clients across various industries and sectors use Amazon Redshift worldwide for critical workloads. Notably, it has been recognised as a leading data warehouse by Forrester Wave, a third party analyst. Amazon Redshift allows users to quickly and simply launch fully managed environments that scale up/down based on workload needs. You no longer need to forecast expected use years out. Most noteworthy, Amazon Redshift is secure. Fully managed includes backups, caching, and automated recovery. It also offers encryption options and access control restrictions so you can be confident that customer sensitive data is safely and securely stored. Furthermore, Amazon Redshift is compatible with a host of other solutions, such as Matillion. In addition to these features, Amazon Redshift is a cost effective data warehouse solution. It can be less than $1,000/TB/year for the total cost of ownership   Amazon-Redshift-Spectrum-Matillion-Data lakes - Benefits  

Paradigm Shift Enables by Amazon Redshift Spectrum

We have discussed the benefits of Redshfit and Spectrum in our previous webinars, Part 1 Getting Started with Amazon Redshift Spectrum and Part 2 Using Amazon Redshift Spectrum from Matillion ETL. Another benefit to the Amazon Redshift Spectrum approach is the ability to analyse any of the data in your entire data lake, not just that sat locally in Redshift. Thus allowing you to tie in other data sources such as social media or blog analytics data. Thus allowing you tie in other data sources such as social media or blog data.

What is a data lake?

A data lake enables you to store large amounts of unstructured data. The ability to store unstructured data means that you don't have to transform or convert your data for storage. The data doesn’t need to conform to a particular schema or categorization. AWS enables you to build a data lake using Amazon Athena or Glue external to Amazon Redshift and then call that data for analysis via Spectrum.

Accessing your Data Lake Assets from Amazon Redshift Spectrum

Amazon Spectrum uses its own layer of data nodes to process queries. You can therefore combine the data catalogue that knows about the internal tables with a Amazon Glue/Athena to create an external table defined out in S3.   Amazon-Redshift-Spectrum-Matillion-Data lakes - External Table Schema   This takes the processing pressure off of Amazon Redshift and pushes it out to Spectrum. You don’t need to provision or think about. Spectrum will automatically scale out to the compute power necessary to return on the query. You are then charged per query based on the amount of data processed.   Amazon-Redshift-Spectrum-Matillion-Data lakes - Architecture   This makes accessing your data lake assets from Amazon Redshift Spectrum simple and cost efficient.

Using Matillion with Amazon Redshift Spectrum to Access Data Lake Assets

Matillion offers an easy to use interface into Amazon Redshift. To set up, you can set up an On-Demand cluster which only takes a few minutes. Next you can use Spectrum to make the Athena data available to Redshift. To do this, create an external schema with a SQL script. When the environment refreshes you should be able to see the objects in the environment. In our example you can also see 3 tables under “aviation demo” which were previously visible in Athena.   Amazon-Redshift-Spectrum-Matillion-Data lakes - Athena Amazon-Redshift-Spectrum-Matillion-Data lakes - Component   This is how Amazon Spectrum makes them available in Redshift. As far as the user is concerned they are just ordinary data in Amazon Redshift.

How to use data lake assets in a transformation job

Once the data has been loaded into Amazon Redshift you can use the data in a transformation job within Matillion. In the demonstration we had a number of different transformation jobs to give you an example of the different options you can do. We looked in depth at one, filtering out one day’s worth of flight information. This is typical operation to take advantage of the power of Spectrum.   Amazon-Redshift-Spectrum-Matillion-Data lakes - Transformation   With this process you can filter out only with the data you need for the analytic job you need to perform. Thus you only process what you need from your data lake and reduce the storage on Redshift if you decide to bring that data in locally.   Amazon-Redshift-Spectrum-Matillion-Data lakes - Filter   This process made 3 Amazon Redshift objects, adding one to the original 2 we saw. You could then join the 3 Redshift objects with a SQL operation to create a full dataset. This makes the data ready for analysis.   Amazon-Redshift-Spectrum-Matillion-Data lakes - Join  

Write data to your data lake

Update or create new data using Amazon Spectrum and Athena therefore closing the loop - making it writable and not just queryable. You can re-create the new the new data, as a result of the transformation job in Amazon Redshift, in S3 and then automatically flow back into Athena.   Amazon-Redshift-Spectrum-Matillion-Data lakes - Write Data   Look in S3 console and should see the new file that you pushed to S3 from Amazon Redshift. When you preview the table in Athena, you can see the additional new records added.    

Questions from the Amazon Redshift Community

Is spectrum available AWS GovCloud?
Not yet but thanks for your request and AWS will prioritize.
Does it mean that AWS will put more resources on improving spectrum than the traditional Redshift DB engine?
Spectrum is a feature of Redshift so they are one in the same.
Do we need a separate cluster for Spectrum other than Redshift cluster?
Spectrum is a feature of Amazon Redshift. Spectrum is not yet available in all regions - but where it is, Spectrum is already available.
You indicate that Spectrum makes it possible to query the entire data lake. Suppose I have a Redshift cluster only storing financial data, but I now want to query Twitter data in my lake. Would I use Redshift external Spectrum table in this case?
Yes, that is right. This example is a core capability of Amazon Redshift Spectrum. The financial data in Redshift and the Twitter data in your data lake can be in different formats based on your preferences and performance needs. If you want to increase query performance, the data format will matter. Twitter data in Parquet will be faster than CSV.  You can do a join between external and local tables. The Redshift database just sees the external table as another table.
Does Spectrum consume a large part of the existing Redshift resource?
The actual execution of the query is external to Redshift cluster with external tables (data on S3). In that scenario, Spectrum processes the query. At query execution time, Amazon computes the amount of resources and can dynamically scale the compute.
Does populating and organising S3 have to be down outside of Redshift?
S3 is an object store not a database. Furthermore the Spectrum layer is read-only and therefore doesn’t physically load anywhere. As the data changes on S3 the data outputs also change. Within the Apache ecosystems, best practice would be to update or add partitions. These best practices should be equally applied to Spectrum layer to update it.
Redshift support streaming data?
Kinesis Firehose can land data in S3 or Redshift depending on your use cases and needs. There is more information on Amazon Kinesis Analytics. If someone has a smaller Redshift cluster and another has a bigger - how does Spectrum differentiate the compute? The two are independent. You can query an exabyte through Spectrum and would be independent form size of Redshift. If you wanted to materialise this to a table in Redshift then size would matter. There are a number of factors that push down onto that spectrum would consider, including predicates, partition pruning, and aggregations. redshift-best-practice-ebook

Begin your data journey

Did you enjoy our Amazon Redshift Spectrum Webinar? Subscribe to the Matillion YouTube Channel to watch previous webinars and get alerts about new videos.
Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.