Using the Twitter Query Component in Matillion ETL for Snowflake
Matillion’s “Twitter Query” component enables you to extract data from your Twitter account, and load it into Snowflake.
The Twitter Query component in Matillion ETL for Snowflake uses the ETL principle to load data from diverse sources into one Snowflake database. From there, you can take advantage of Snowflake’s power and scalability to transform, join and aggregate the data into shape for reporting or analytics.
Watch our tutorial video for a demonstration on how to set up and use the Twitter Query Component in Matillion ETL for Snowflake.
Data extraction architecture
Similar to most of Matillion’s Load/Unload components, acquiring data from Twitter is a two stage process.
Matillion connects to Twitter’s API using OAuth credentials. Before using the Twitter Query component you’ll first need to set up your OAuth credentials; this is a once-only setup task. Once authenticated, Matillion brings the data from Twitter across the network.
Matillion instructs Snowflake to bulk-load the extracted data. This is the fastest way to load the Twitter data into Snowflake from end to end.
The Twitter Query is a data Orchestration component. To use the Twitter Query component you need to create an Orchestration job and edit it. Locate the Twitter Query in the component list and drag it onto the Orchestration job canvas to edit it.
It’s always best to work down the property list from top to bottom. Some of the key setting are as follows:
- Authentication – set this to the name of the OAuth credentials you intend to use
- Data Source – choose a table or view from those available in the Twitter data model. There are about 20 of the main elements in the Twitter data model, including Tweets,
- Direct Messages, Followers and Lists
- Data Selection – choose the columns, depending on which Data Source you’re using
- Data Source Filter – you should normally enter some filters to restrict the data of interest
- Limit – it’s a good idea to set this to avoid rate limiting errors
- S3 Staging Area – choose one of your existing S3 buckets. It will be used temporarily during the bulk load (stage 2, as mentioned above)
- Target Table – choose a name for the new Snowflake table. Note this is a “staging” table, so you’ll need to move the data on after loading (see the next section)
After all the properties have been properly configured, the border of the component should turn green. This indicates that it’s ready to run!
Running the Extract and Load with the Twitter Query component in Matillion ETL for Snowflake
Matillion offers various ways to run Orchestration jobs, including its own built-in scheduler, which you can use to run jobs automatically. There is also an integration with SQS allowing you to synchronize Matillion jobs with an external scheduler.
During testing and development, however, you’ll probably just want to run it interactively with a right-click on the canvas.
The actual run-time will vary according to the complexity of the query, the amount of data returned, and whether Twitter apply rate limiting.
Once it’s working, you can add this Orchestration job into your daily schedule, and the new Snowflake table will be dropped and recreated every day, containing all the newly matching data.
Staging and Transformations
The Target Table property you chose is the name of a Snowflake “staging” table. Hence, it will be recreated every time the Twitter Query component runs. If the table already exists, it will be silently dropped first.
It’s deliberately designed this way, so you can do incremental loads and take advantage of Snowflake’s fast bulk loader. However, it does mean that you need to copy the newly-loaded data into a more permanent location after every load.
The usual pattern is to call a new Transformation job immediately after the Load.
Data returned from an API will very likely not conform to the required schema. To cleanse the data, introduce transformations, such as fixing datatypes and removing duplicates.
In the above example, the newly-loaded staging data has a meaningful Twitter “ID” column containing two pieces of information: a timestamp and an ordinal sequence number starting from 1.
To extract the date part, use a SUBSTRING in the Calculator expression:
To extract the ordinal number, take advantage of Snowflake’s sophisticated regular expression handling in a second Calculator expression:
Cleansing and enhancing the data in this way makes it ready for blending with data from other sources.
Want to try the Twitter Query component in Matillion ETL for Snowflake? Arrange a free 1-hour training session now, or start a free 14-day trial.