Matillion was a proud sponsor at AWS re:Invent 2017. If you were able to view Andy Jassy’s keynote, like us, you probably got whiplash keeping up with all the new features and services flashed up on the big screen!
We have summarised the key AWS announcements from re:Invent 2017 that are most relevant to your work with Matillion ETL and Big Data.
S3 Select and Glacier Select
Amazon S3 is the core AWS solution for “blob” storage. This means that the contents of S3 data file have always been opaque to S3. Up until now!
A blob store retrieves the entire file every time, even if you only want to access a small part of it. S3 Select and Glacier Select turn this on its head, and introduce SQL-based filter expressions allowing you to return only parts of the data file. The aim is to reduce the amount of work necessary for downstream processing, to make it both faster and cheaper.
By taking on the task of interpreting the object, S3 Select and Glacier Select require that it’s parseable in some way. Currently, the supported formats are CSV and JSON, and the files may also be gzipped.
Glacier Select will offer ways to bypass the normally slower retrieval time from Glacier, with the option of three tiered speed options: expedited (1-5 minutes), standard (3-5 hours), and bulk (5-12 hours).
S3 Select and Glacier Select are currently in preview, but will be integrated with Athena and Redshift in 2018.
This is a fully managed implementation of Apache’s open source ActiveMQ message broker.
Amazon MQ is a producer/consumer messaging technology which can be used with many industry standard protocols such as JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. It introduces additional messaging functionality to that provided by Amazon Simple Queue Service (SQS) and Amazon Simple Notification Service (SNS). However, Amazon still recommend using SQS and SNS when you are building new infrastructure.
The core entity within MQ is the Message Broker. When you create a broker you have the option of using multiple availability zones in an active/standby configuration. The charge is per instance, per hour, and for storing data. If accessed from outside AWS you will also pay for network data transfer.
Amazon MQ clients are constantly connected to the designated endpoints, so messages are transferred rapidly. Support for many widely-used messaging standards makes this a great product to assist migrating an existing infrastructure into the cloud.
New EC2 instance types
Amazon announced the introduction of several new instance types, plus some changes to the existing family.
This is a general purpose instance type, using multi core 2.5GHz Intel 8175M processors.
These are powerful, storage-optimized instances. They have a large amount of local magnetic storage and are specifically designed for use in MapReduce clusters or to host a distributed file system.
The T2 model is useful in cases where your compute needs are likely to vary greatly over short timescales, but tend to average out over time.
While it’s quiet, CPU credits accumulate. When there is a peak in demand, the credits can be “spent” in the form of a burst. T2 Unlimited allows your instance to sustain its “burst mode” capacity over a longer amount of time, at a small extra cost. You can choose T2 Unlimited as a checkbox option when launching an ordinary T2 instance. For this reason, it’s supported immediately by Matillion.
This is still in preview for the remainder of 2017, but will enable you to run applications directly on AWS-hosted hardware, rather than through a virtualization layer.
This enables access to some specific optimizations that can be concealed by virtualization, and also may be a regulatory requirement.
Amazon Neptune is in preview in 2017, but will soon enter general availability as a Graph-based database service to complement the Relational RDS.
There will be a bulk upload API enabling you to load data into Neptune from S3. Once the data is in place you’ll be able to connect using SPARQL and TinkerPop (running Gremlin).
Graph models have many uses cases, especially where linking between objects is of the greatest importance. There are many similarities between this and its relational equivalent: Data Vault modelling.
Once Multi-Master Aurora is past MySQL-only preview and enters general availability (expected in 2018), you’ll be able to have multiple “write” instances in an AWS Aurora database.
At the moment you may only choose one “write” instance, and failover is provided by promotable read replicas.
The ability to enable multiple write instances will lead to better scalability and resilience for compatible database engines.
Amazon’s current Aurora offering for MySQL or Postgres is great for fairly predictable workloads, since you choose your capacity in advance and can add read replicas to increase query performance.
Aurora Serverless goes beyond this, completely breaking the link between compute and storage. Your data remains in a private storage area but the compute capacity gets adjusted automatically according to demand, between a specified minimum and maximum.
This service is expected to be fully available for both Postgres and MySQL in 2018.
Amazon Comprehend is a natural language text processor. It is available in US and EU regions, and is starting to bring understanding to text processing.
There are four capabilities aimed at interactive use via the API or the AWS console:
- Language identification
- Entity identification (people, places, brands etc)
- Key phrase detection
- Sentiment analysis
One additional capability is designed for batch mode use:
- Topic extraction – for organising sets of documents (i.e. at least 1,000 documents at a time)
AWS re:Invent Recap
With all these updated and new features Amazon Web Services is a natural trend setter. Whether you are an Amazon Redshift or Snowflake user, you can benefit from AWS’s aggressive innovation. In addition to native Amazon services, you also gain access to a wide ecosystem of partner solutions, including Matillion ETL. If you have any questions about the new features or Matillion ETL, feel free to contact us and we will follow up.
Ready to get started with Amazon Redshift? Learn how to get started in our new best practices eBook.