Using Machine Learning to gain more insights from your data
As organizations have modernized their business intelligence architecture by moving analytics workloads into the cloud, this has opened the doors for leveraging other cloud services to gain deeper insights from data. Machine learning has been a hot topic lately, with new cloud services being introduced regularly. Adopting complex processes, such as machine learning, into your data pipelines has never been simpler. For example, here’s how you can leverage machine learning to apply sentiment analysis to tweets fetched from Twitter.
Integrating machine learning with a cloud-based business intelligence architecture
Speed, simplicity and scale are common factors that drive organizations towards a modern cloud-based business intelligence architecture. Leveraging cloud services for complex operations such as machine learning can yield all of these cloud benefits.
The high level architecture of what’s been implemented in this article is represented in the following diagram:
For sentiment analysis, we are using Amazon Comprehend, a fantastic natural language processing (NLP) AWS service. The rest of the business intelligence architecture is already in AWS, so leveraging existing AWS services for machine learning was an easy choice (and simple to implement!). However, we could have just as easily integrated with other cloud-based machine learning services, such as IBM Watson, Google Natural Language, or many of the other machine learning services that exist today. Most of these types of services are easy to set up and integrate into your architecture, and can help you gain insights from your data faster.
Implement asynchronous integration
Cloud data warehouses (CDWs) are well suited for large scale data workloads, especially when working with large sets of data at a time. The ability of a CDW to handle these volumes of data makes it a great platform for performing those large data transformation workloads, and the performance can scale as your data sets grow.
When integrating with Amazon Comprehend for sentiment analysis, we can implement a synchronous or asynchronous integration. A synchronous integration works by sending individual tweets to Amazon Comprehend and waiting for a response that includes the derived sentiment of the tweet.
An asynchronous integration works by sending a set of tweets for Amazon Comprehend to process. When complete, the results are then made available in one or more flat files that can then be consumed. With scalability and speed in mind, an asynchronous integration with Amazon Comprehend makes the most sense in this case.
Matillion job overview
All jobs referenced in this blog are attached as a single job export at the end of this blog.
Get tweets from Twitter for sentiment analysis
To get data from Twitter to load into our CDW, we will use the Twitter Query component in Matillion ETL. This component makes the process of fetching data from Twitter and loading the results into a CDW very simple to configure and implement.
Before you can start using the Twitter Query component, an OAuth profile must be configured to allow Matillion ETL to fetch data from Twitter. Once an OAuth profile has been setup for authentication, the rest of the configuration of the Twitter Query component is quite simple. For our purposes, we will use the Tweets data source in the component to fetch tweets that meet a specified search term criteria.
In our job, the actual search term value is a variable (jv_searchterm), and the actual value is metadata driven within the job. That metadata is defined in the GetTweets-InitializeData transformation job (see attached job export) that is executed as the first step in the main job.
Twitter API limits
The Twitter Query component works by using Twitter APIs to fetch data. Twitter’s Standard API endpoints have rate limits that need to be accounted for. Fortunately, the Twitter Query component is able to work around these rate limits by specifying a MaxRateLimitDelay Connection Option.
In the above screenshot, a Job Variable is used to define the MaxRateLimit. As seen below, this variable, jv_sleeptime, has been set to 905 seconds. This accounts for the 15 minute window used to enforce Twitter API rate limits. This value should be adjusted based on the applicable rate limits for your Twitter account.
Preparing and sending tweets for sentiment analysis
In this exercise, we are using Amazon Comprehend to perform sentiment analysis on the tweets we have captured. As mentioned earlier, to ensure scalability and speed, we are integrating with Amazon Comprehend using an asynchronous method, specifically using the StartSentimentDetectionJob. To get data to Amazon Comprehend using this method, a file is put in an S3 bucket and then an Amazon Comprehend job is executed to process it. When Amazon Comprehend is done analyzing the data set, it will drop an output file into a specified S3 bucket.
When creating the input file for Amazon Comprehend, there are 2 input formats that can be defined. We are going to use the ONE_DOC_PER_LINE format, which means that each line in the input file will be an entire document and Amazon Comprehend will score the sentiment of that document as a whole. In our case, each line of the file will be a single tweet. Additional data (like the tweet id or search term) is not included in the input file, but we are using the file name to help pass through some metadata about the tweets being analyzed.
Thinking about how we wanted to visualize the analyzed data and wanting to keep the analysis somewhat anonymous, we decided to prepare Amazon Comprehend input files where each input file represents the captured tweets for a particular search term by date. Segmenting the data in this way can be done very easily in a Matillion transformation job, using standard cloud data warehouse features. In our transformation job, we define a view that aggregates tweets by search term and create date. This view will be used to drive an iteration loop, where each iteration will result in an Amazon Comprehend input file being generated.
We then iterate over the view using a Table Iterator component. Each iteration calls an orchestration job, which creates the Amazon Comprehend input file for each execution.
Once all of the Amazon Comprehend input files have been created, we can then tell Amazon Comprehend to analyze all of the generated files. While Matillion ETL does not have a component that will directly trigger an Amazon Comprehend job to execute, every Matillion ETL instance does come with the AWS CLI pre-installed. The AWS CLI makes it very easy to interact with various AWS services from a Matillion orchestration job using the Bash Script component.
Here is a link to the documentation for the AWS CLI Comprehend command executed in our example job, start-sentiment-detection-job. Note also the AWS documentation that outlines the IAM (permissions) required to execute the Amazon Comprehend job. In our Matillion job, all of the parameters passed to the AWS CLI command to execute the Amazon Comprehend job are parameterized as Matillion Job Variables. Also note the usage of Automatic Variables, which are used to help distinguish the Matillion generated Amazon Comprehend jobs by giving the Input S3 folder and Amazon Comprehend Job unique names that can be tied back to a single Matillion job execution.
Receiving and loading sentiment analysis output
Because the Amazon Comprehend job we have executed runs in an asynchronous manner, we need a separate Matillion job to load and transform the data received from Amazon Comprehend when the job completes. Typically, when integrating with an asynchronous process, you would typically choose a “polling” mechanism to determine when results are ready to load, or an “event driven” mechanism that will execute once results are available. Being cloud native, Matillion ETL has native integrations with some helpful cloud features that allows you to easily implement an “event driven” pattern. The output of an Amazon Comprehend job is a file that is placed into an S3 bucket. We can use that as the event that triggers the Matillion job to ingest and process the Amazon Comprehend output file. This article describes exactly how you can implement this exact event driven pattern with Matillion in AWS.
The Matillion job that ingests and transforms the Amazon Comprehend output data works in a simple 3 step process.
- Step 1 – Prepare the file to be loaded into the cloud data warehouse.
- Step 2 – Load the data into the cloud data warehouse
- Step 3 – Transform the data into an analytics ready state
The Bash Script component that is the first step here does some basic file preparation steps and moves the Amazon Comprehend output file into a specified S3 location that acts as a loading area.
The data in the Amazon Comprehend output file is in JSON format. Each JSON element represents an analyzed tweet from one of the Amazon Comprehend input files. Details such as the filename of the input file and the sentiment analysis score are embedded within the JSON information. The JSON data is loaded in its raw state into the cloud data warehouse. The transformation job then flattens the JSON data, parsing out the relevant information and appends that data to a target table in the cloud data warehouse. That target table represents the analytics-ready data against which you can now point your favorite reporting tool!
Twitter sentiment analysis dashboard
We created a simple interactive Tableau dashboard that helps to demonstrate the types of insights we were able to generate by following this process! In this Tableau Dashboard, I thought it would be interesting to see the trend of tweets related to mentions of some NFL teams (go Birds!) and some popular TV shows. This could easily be used to track your own business related hashtags and summarize the trends around what people are saying around your business!