Webinar l Accessing your Data Lake Assets from Amazon Redshift Spectrum
Matillion is running a 3-part webinar on Amazon Redshift Spectrum. Amazon Redshift Spectrum is revolutionising the way data is stored and queried allowing for complex analysis thus enabling better decision making. The third webinar focuses on Accessing your Data Lake Assets from Amazon Redshift Spectrum.
You can watch the full Accessing your Data Lake Assets from Amazon Redshift Spectrum here.
State of Data Warehousing – Data Warehousing Challenges Today by Greg Khairallah
Traditionally, our main data sources were CRM and financial in nature. However, our data is continually growing as we add new sources, such as social media and blog data. In today’s analytical environment we demand new types of data and faster access.
Amazon Redshift, a fully managed data warehouse, was developed to meet changing demands. Global clients across various industries and sectors use Amazon Redshift worldwide for critical workloads. Notably, it has been recognised as a leading data warehouse by Forrester Wave, a third party analyst. Amazon Redshift allows users to quickly and simply launch fully managed environments that scale up/down based on workload needs. You no longer need to forecast expected use years out. Most noteworthy, Amazon Redshift is secure. Fully managed includes backups, caching, and automated recovery. It also offers encryption options and access control restrictions so you can be confident that customer sensitive data is safely and securely stored. Furthermore, Amazon Redshift is compatible with a host of other solutions, such as Matillion.
In addition to these features, Amazon Redshift is a cost effective data warehouse solution. It can be less than $1,000/TB/year for the total cost of ownership
Paradigm Shift Enables by Amazon Redshift Spectrum
We have discussed the benefits of Redshfit and Spectrum in our previous webinars, Part 1 Getting Started with Amazon Redshift Spectrum and Part 2 Using Amazon Redshift Spectrum from Matillion ETL. Another benefit to the Amazon Redshift Spectrum approach is the ability to analyse any of the data in your entire data lake, not just that sat locally in Redshift. Thus allowing you to tie in other data sources such as social media or blog analytics data. Thus allowing you tie in other data sources such as social media or blog data.
What is a data lake?
A data lake enables you to store large amounts of unstructured data. The ability to store unstructured data means that you don’t have to transform or convert your data for storage. The data doesn’t need to conform to a particular schema or categorization. AWS enables you to build a data lake using Amazon Athena or Glue external to Amazon Redshift and then call that data for analysis via Spectrum.
Accessing your Data Lake Assets from Amazon Redshift Spectrum
Amazon Spectrum uses its own layer of data nodes to process queries. You can therefore combine the data catalogue that knows about the internal tables with a Amazon Glue/Athena to create an external table defined out in S3.
This takes the processing pressure off of Amazon Redshift and pushes it out to Spectrum. You don’t need to provision or think about. Spectrum will automatically scale out to the compute power necessary to return on the query. You are then charged per query based on the amount of data processed.
This makes accessing your data lake assets from Amazon Redshift Spectrum simple and cost efficient.
Using Matillion with Amazon Redshift Spectrum to Access Data Lake Assets
Matillion offers an easy to use interface into Amazon Redshift. To set up, you can set up an On-Demand cluster which only takes a few minutes. Next you can use Spectrum to make the Athena data available to Redshift. To do this, create an external schema with a SQL script. When the environment refreshes you should be able to see the objects in the environment. In our example you can also see 3 tables under “aviation demo” which were previously visible in Athena.
This is how Amazon Spectrum makes them available in Redshift. As far as the user is concerned they are just ordinary data in Amazon Redshift.
How to use data lake assets in a transformation job
Once the data has been loaded into Amazon Redshift you can use the data in a transformation job within Matillion. In the demonstration we had a number of different transformation jobs to give you an example of the different options you can do. We looked in depth at one, filtering out one day’s worth of flight information. This is typical operation to take advantage of the power of Spectrum.
With this process you can filter out only with the data you need for the analytic job you need to perform. Thus you only process what you need from your data lake and reduce the storage on Redshift if you decide to bring that data in locally.
This process made 3 Amazon Redshift objects, adding one to the original 2 we saw. You could then join the 3 Redshift objects with a SQL operation to create a full dataset. This makes the data ready for analysis.
Write data to your data lake
Update or create new data using Amazon Spectrum and Athena therefore closing the loop – making it writable and not just queryable. You can re-create the new the new data, as a result of the transformation job in Amazon Redshift, in S3 and then automatically flow back into Athena.
Look in S3 console and should see the new file that you pushed to S3 from Amazon Redshift. When you preview the table in Athena, you can see the additional new records added.
Questions from the Amazon Redshift Community
Is spectrum available AWS GovCloud?
Not yet but thanks for your request and AWS will prioritize.
Does it mean that AWS will put more resources on improving spectrum than the traditional Redshift DB engine?
Spectrum is a feature of Redshift so they are one in the same.
Do we need a separate cluster for Spectrum other than Redshift cluster?
Spectrum is a feature of Amazon Redshift. Spectrum is not yet available in all regions – but where it is, Spectrum is already available.
You indicate that Spectrum makes it possible to query the entire data lake. Suppose I have a Redshift cluster only storing financial data, but I now want to query Twitter data in my lake. Would I use Redshift external Spectrum table in this case?
Yes, that is right. This example is a core capability of Amazon Redshift Spectrum. The financial data in Redshift and the Twitter data in your data lake can be in different formats based on your preferences and performance needs. If you want to increase query performance, the data format will matter. Twitter data in Parquet will be faster than CSV. You can do a join between external and local tables. The Redshift database just sees the external table as another table.
Does Spectrum consume a large part of the existing Redshift resource?
The actual execution of the query is external to Redshift cluster with external tables (data on S3). In that scenario, Spectrum processes the query. At query execution time, Amazon computes the amount of resources and can dynamically scale the compute.
Does populating and organising S3 have to be down outside of Redshift?
S3 is an object store not a database. Furthermore the Spectrum layer is read-only and therefore doesn’t physically load anywhere. As the data changes on S3 the data outputs also change. Within the Apache ecosystems, best practice would be to update or add partitions. These best practices should be equally applied to Spectrum layer to update it.
Redshift support streaming data?
Kinesis Firehose can land data in S3 or Redshift depending on your use cases and needs. There is more information on Amazon Kinesis Analytics.
If someone has a smaller Redshift cluster and another has a bigger – how does Spectrum differentiate the compute?
The two are independent. You can query an exabyte through Spectrum and would be independent form size of Redshift. If you wanted to materialise this to a table in Redshift then size would matter. There are a number of factors that push down onto that spectrum would consider, including predicates, partition pruning, and aggregations.