Using the Twitter Query Component in Matillion ETL for Amazon Redshift
Matillion’s “Twitter Query” component enables you to extract data from your Twitter account and load it into Amazon Redshift.
The Twitter Query component uses the ETL principle to load data from diverse sources into one Amazon Redshift database. From there, you can take advantage of Amazon Redshift’s power and scalability to transform, join and aggregate the data for reporting or analytics.
Watch our tutorial video for a demonstration on how to set up and use the Twitter Query Component in Matillion ETL for Amazon Redshift.
Data extraction architecture
Similar to most of Matillion’s Load/Unload components, acquiring data from Twitter is a two stage process.
Matillion connects to Twitter’s API using OAuth credentials. Before using the Twitter Query component you’ll need to set up your OAuth credentials; this is a once-only set up task. Once authenticated, Matillion brings the data from Twitter across the network.
Matillion then instructs Amazon Redshift to bulk-load the extracted data. This is the fastest way to load the Twitter data into Amazon Redshift from end to end.
The Twitter Query is a data Orchestration component. To use the Twitter Query component you need to create an Orchestration job and edit it. Locate the Twitter Query in the component list and drag it onto the Orchestration job canvas to edit it.
It’s always best to work down the property list from top to bottom. Some key setting are as follows:
- Authentication – set this to the name of the OAuth credentials you authenticated in Step 1 above
- Data Source – choose a table or view from those available in the Twitter data model. There are approximately 20 main elements in the Twitter data model, including Tweets, Direct Messages, Followers and List
- Data Selection – choose the columns depending on which Data Source you’re using
- Data Source Filter – filters are normally used to restrict the data of interest
- Target Table – choose a name for the new Amazon Redshift table. Note this is a “staging” table, so you’ll need to move the data on after loading (see the next section)
- S3 Staging Area – choose one of your existing Amazon S3 buckets. It will be used temporarily during the bulk load (stage 2, as mentioned above)
After all the properties have been properly configured, the border of the component should turn green. This indicates that it’s ready to run!
Running the Extract and Load
During testing and development, however, you’ll probably want to run it interactively with a right-click on the canvas.
The actual runtime will vary according to the complexity of the query, how much data is returned, and whether Twitter apply rate limiting.
Once it’s working, you can add this Orchestration job into your daily schedule and the new Amazon Redshift table will be dropped and recreated every day, containing all the newly matching data.
Staging and Transformations
The Target Table property you chose is the name of a Amazon Redshift “staging” table. That means it will be recreated every time the Twitter Query component runs. If the table already exists, it will be silently dropped first.
It’s deliberately designed this way so you can do incremental loads and take advantage of Amazon Redshift’s fast bulk loader. However, it does mean that you need to copy the newly-loaded data into a more permanent location after every load.
The usual pattern is to call a new Transformation job immediately after the Load.
Data returned from an API may not conform to the required schema. Transformations can now be introduced to cleanse the data: fixing datatypes and removing duplicates.
Data Quality Framework document provides a detailed guide on data cleansing. A typical use case for this example would be to make sure that the User IDs in a “user” extract are unique.
In the above example, the Rank component distinguishes between multiple instances of the same ID and the Filter component then chooses only the first instance in every case. The resulting cleansed data from this process can be subsequently blended in with other data sources.
Want to try the Twitter Query component in Matillion ETL for Amazon Redshift? Arrange a free 1-hour training session now, or start a free 14-day trial.