Meet Maia: The AI Data Automation platform that gives you the freedom to do more.

Visit maia.ai

Where to store your data: Amazon Redshift vs. S3

If you're building data pipelines on AWS, you’ve probably asked yourself, “Where should this data live?” When weighing up Amazon Redshift vs. S3, there are benefits to both, so the short answer? Your data should probably live in both.

Key Takeaways:

  • Use Amazon S3 for cost-efficient, scalable storage.
  • Use Redshift when performance and fast analytics matter.
  • Use both when you need to scale storage and keep compute lean, Redshift Spectrum lets you query S3 without moving data.

Amazon S3 and Amazon Redshift serve different purposes, but each is increasingly complementary to the other. And thanks to tools like Redshift Spectrum and Matillion, you can now blur the lines between data lake and data warehouse to get the best of both worlds.

Matillion is a data integration platform that enables the orchestration and management of these hybrid pipelines seamlessly, whether working with data in S3, Redshift, or both.

In this article, we’ll break down when to use each, show how Redshift Spectrum fits in, and walk through an example using IoT data. You’ll also see how Matillion can orchestrate hybrid pipelines that give you flexibility without adding complexity.
 

S3 vs. Redshift: At a Glance

Before we dive in, here’s a quick side-by-side summary of Amazon Redshift vs. S3, to clearly display their core capabilities and differences. 

FeatureAmazon S3Amazon Redshift
What it doesObject storage serviceFully managed data warehouse
Best forStoring raw, semi-structured or unstructured dataWorking with large amount of structured data using SQL
CostCheap storage worked out on a per-per-use basisMore expensive storage, but high-performance queries
PerformanceHigh latency, slower queries (via Spectrum)Optimized for analytical performance
IntegrationEasy with many AWS servicesTight integration with BI & analytics tools

Amazon Redshift vs. S3: Choosing The Right Storage Strategy

When deciding between S3, Redshift, or Spectrum, the key is to pick the best tool for the job at hand, rather than simply picking a ‘winner.’

With tools like Matillion, you can easily orchestrate hybrid pipelines that combine the best of each, allowing you to take advantage of both high-performance analytics and cost-effective storage. Let’s break down the strengths of each.

Amazon S3: Cost-Effective Storage at Scale

Amazon S3 is a great place to store raw data. It’s built to scale, supports any file format, and is pay-as-you-go. For log data, telemetry, sensor output, or data you don’t need to query often, it’s the obvious choice.

S3 offers cheap and efficient data storage, compared to Amazon Redshift. However, the storage benefits will result in a performance trade-off. This is because internal tables in Amazon Redshift work on data that has already been extracted and loaded into a table format.

Amazon Redshift: Performance for Analytics

Amazon Redshift is a high-performance data warehouse optimized for SQL-based analytics. It's great for structured data you want to slice, dice, and visualize fast.

You’ll typically want to load curated, transformed data into Redshift. This includes star schemas, aggregated metrics, and data you need to join frequently.

With Redshift, performance comes at a higher cost, but when you need fast dashboard loads or frequent joins, it pays off. While Redshift excels at performance, it’s not always the most cost-effective choice, and that’s where Redshift Spectrum comes in.

Redshift Spectrum: Query S3 Without Loading Data

Redshift Spectrum gives you the best of both worlds: you can run Redshift SQL queries on data that remains stored in S3. This is known as an external table. That means:

  • No need to load data into Redshift tables
  • Lower storage costs
  • Seamless joins with other Redshift tables

This is especially useful for large historical datasets or infrequently accessed logs.

Tip: Aggregate and Filter in Spectrum, before loading into an internal table

Spectrum, which can be used with Matillion Data Productivity Cloud, is fantastic at filtering and aggregating very large datasets. The best performance comes from taking the load off Amazon Redshift. This means you should filter and aggregate in Spectrum before you start joining data, which can be handled in Amazon Redshift.

Real-World Example: IoT Data Flow

To see how this works in practice, let’s look at an IoT data scenario,  where data flows from connected devices into the cloud.

 

1. Data collection and load to S3

Data is collected by devices, such as Amazon Alexa, Echo or Fire TV Stick, and streamed into S3 via Kinesis Firehose.

  • Why are we sending data to S3?

By staging the data in S3 and accessing it via Spectrum, there is no data loading time since the data stays on S3.

2. Store data in S3

The data can then be streamed to S3 and a bucket, which can then be read by Spectrum when we execute a job.

  • Why are we storing log data in S3?

S3 offers cheap and efficient data storage, compared to Amazon Redshift. However, the storage benefits will result in a performance trade-off. This is because the data has to be read into Amazon Redshift in order to transform the data.

3. Query and Combine via Matillion

Using Matillion Data Productivity Cloud with Amazon Redshift, you can create pipelines that:

  • Read data from S3 (via Spectrum)
  • Join with internal tables already in Redshift
  • Transform the results - perhaps adding derivations, enrichment, or aggregation - and save the results into a new table

4. Create a New Table in Redshift

Matillion creates a table in Redshift with the results of the transformation and joins.

5. (Optional) Load your new table to S3

You can use Matillion’s Rewrite Table or Rewrite External Table components to push your output back to S3..

  • Tip: Partition your data!

Partitioning your data allows you to place sensible breakpoints, based on the data, that split up the data into logical chunks. This means a partition, as opposed to the full dataset, can be tackled by multiple nodes, improving processing times and reducing cost.

Redshift vs. S3: Where Matillion Fits In

In an era where hybrid data workflows are no longer the exception, rather they are the norm, having a way to effectively and efficiently manage said workflows is crucial. 

And that is where Matillion comes in, enabling the seamless management of hybrid data workflows, by:

So, Where Should Data Be Stored? Use Cases at a Glance

Choosing between S3, Redshift, or a hybrid approach isn’t always straightforward, it depends on the nature of your data and how you plan to use it. Here’s a quick guide to help you decide based on common scenarios:

Use Case Description Best Storage ApproachWhy It Works Well
Archiving logs, staging raw or semi-structured dataS3Minimizes costs with pay-as-you-go object storage. Ideal for data you rarely query
Dashboards, curated data marts, frequent joinsRedshiftHigh-performance SQL engine for fast, complex analytics workloads
Querying large datasets without loading into RedshiftS3 + SpectrumKeeps storage costs low while enabling direct querying from Redshift
Real TimeS3 + SpectrumNew data, for example logfiles or IoT records, are immediately available to queries against Spectrum external tables
Near Real TimeS3 + Matillion + RedshiftMatillion quickly orchestrates ingestion and transformations, loading near real-time updates into Redshift for faster analytics without manual intervention
Cost-effective storage with transformation/orchestrationS3 + MatillionUse Matillion to orchestrate transformations without moving data unnecessarily
Repeatable, high-performance analytics pipelinesRedshift + MatillionCombines a fast warehouse with powerful ETL orchestration and transformation tools

Final Thoughts

There’s no one-size-fits-all answer, and that’s a good thing. With tools like Redshift Spectrum and Matillion Data Productivity Cloud, tailored solutions can be designed for any cost model or analytical need. 

Want help building the right architecture for your workloads? Book a demo or start a free trial to see how Matillion makes hybrid data easier.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.