Sentiment Analysis in Redshift with OpenAI GPT-3.5 Turbo

In this article, we will explore how to perform Sentiment Analysis in Amazon Redshift with the help of OpenAI's GPT-3.5 Turbo model. The content will cover different methods for performing sentiment analysis, including the use of Python scripts to enhance functionality.

Additionally, we will describe the capabilities and features of OpenAI's GPT-3.5 Turbo, and start with a foundational explanation of what Sentiment Analysis entails and its significance in data processing.

What is Sentiment Analysis?

Sentiment Analysis using large language models (LLMs) leverages advanced deep learning techniques to classify and interpret the sentiment expressed in textual data. These models can capture nuanced contextual dependencies in text, improving the accuracy of sentiment prediction over traditional machine learning methods. Such models can also be fine-tuned on domain-specific corpora to enhance their performance for targeted applications.

Data engineers and data architects can integrate LLMs with existing data pipelines through APIs or containerized solutions, enabling scalable and real-time analysis of user feedback, social media, and more.

Examples of Sentiment Analysis applications include:

  • Customer Feedback Analytics: Extracting sentiment from product reviews on e-commerce platforms to gauge customer satisfaction and identify areas for improvement.
  • Social Media Monitoring: Real-time analysis of tweets and posts to understand public opinion on current events or brand reputation.
  • Support Ticket Prioritization: Analyzing content of customer service tickets to detect urgency and sentiment, helping in prioritizing and routing tickets effectively.

What is OpenAI GPT-3.5 Turbo?

OpenAI's GPT-3.5 Turbo is an advanced generative AI model that offers substantial improvements in performance and efficiency over its predecessors. Engineered for complex natural language processing (NLP) tasks, GPT-3.5 Turbo utilizes a transformer architecture with optimized attention mechanisms, enabling it to handle extensive datasets and generate highly contextual responses. For data engineers and architects, this model can facilitate sophisticated data preprocessing, anomaly detection, and sentiment analysis.

GPT-3.5 Turbo boasts impressive natural language understanding and generation, making it a powerful tool for a wide range of applications such as content creation, customer support, and coding assistance. One of its primary advantages is its ability to generate human-like text rapidly and with a high degree of fluency, making it versatile for both creative and practical uses. Additionally, it's more cost-effective and efficient compared to its predecessors. However, its cons include occasional inaccuracies, potentially generating contextually irrelevant or nonsensical information, and possible biases in the output due to the data it was trained on. GPT-3.5 Turbo is best used when the need for nuanced, articulate language processing is paramount but should be paired with human oversight to ensure accuracy and appropriateness in its outputs.

How to perform Sentiment Analysis in Redshift  with OpenAI GPT-3.5 Turbo using Python

Start by installing the prerequisite libraries:

python3 -m pip install psycopg2-binary openai

Then load your source data into Redshift. The example below involves product reviews, and the data has been loaded into a table named "stg_sample_reviews" with four columns: id (the primary key), stars, product and review.

Here is the Python script. Note it is good practice to handle credentials more securely than shown in this simple example. You might choose to use environment variables or a secret management service instead of hardcoding them.

Also please note that handling large amounts of data using fetchall() can be inefficient, and may result in memory issues. For large datasets, you should use a cursor to fetch rows incrementally using the fetchmany() method instead.

If you are working in a schema other than "public" you will need the -c connection option to specify the object search path. It is named yourSchemaName in the code sample. Newly created objects will be added to this named schema.

To authenticate with OpenAI you will need to set up an environment variable named OPENAI_API_KEY containing your API key.

import psycopg2
from openai import OpenAI

modelname = "gpt-3.5-turbo"
systemPrompt = "Your job is to analyze online product reviews"

# Function that will be called for every input row
def process_row(row):
    print(f"Processing id: {row[0]}")
    userPrompt = f"""Provide a numeric rating that reflects the overall sentiment of the review.
The rating should be a single number between 1 and 5, where 1 represents the most negative sentiment and 5 represents the most positive sentiment.
Respond with the numeric rating only. Do not include any justification of the rating.
review: {row[1]}"""

    ccresp = oaisdk.chat.completions.create(
                   model=modelname,
                   temperature=1,
                   messages=[ {"role": "system", "content": systemPrompt},
                              {"role": "user",   "content": userPrompt} ])

    msg = ccresp.choices[0].message.content.strip()
    return(msg)

# Database connection parameters
db_params = {
    'dbname': ...,
    'user': ...,
    'password': ...,
    'host': ...,
    'port': '5439',
    'options' : '-c search_path=<<yourSchemaName>>,public'
}

# Create a connection to the Redshift database
try:
    conn = psycopg2.connect(**db_params)
except Exception as e:
    print(f"Unable to connect to the database: {e}")
    exit(1)

conn.autocommit = True

# Establish an OpenAI connection using environment variable OPENAI_API_KEY
oaisdk = OpenAI()

# Create a cursor object
cur = conn.cursor()

# Your query to fetch rows from the table
query = f'SELECT "id", "review" FROM "stg_sample_reviews"'

try:
    # Create the table to hold the results, if it does not already exist
    cur.execute(f'CREATE TABLE IF NOT EXISTS "stg_sample_reviews_{modelname}" ("id" INT NOT NULL, "ai_score" VARCHAR(1024) NOT NULL)')

    cur.execute(f'DELETE FROM "stg_sample_reviews_{modelname}"')

    # Execute the query
    cur.execute(query)

    # Fetch and process each row
    for row in cur.fetchall():
        ai_score = process_row(row)
        cur.execute(f'INSERT INTO "stg_sample_reviews_{modelname}" ("id", "ai_score") VALUES ({row[0]}, {ai_score})')

except Exception as e:
    print(f"SQL error: {e}")
finally:
    cur.close()
    conn.close()

After running the above script, you should find a new table has been created, which contains the AI-generated review score for every input record. Join this table to the original on the common "id" column to compare the AI-generated sentiment scores against the original star review.

The LLM was asked to score between 1 and 5, so you may choose to classify the scores more broadly as follows:

  • 4 or 5 - Positive
  • 3 - Neutral
  • 1 or 2 - Negative

Sentiment Analysis in Redshift using Matillion

In the Matillion Data Productivity Cloud, orchestration pipelines like the one shown in the screenshot below can:

  • Directly extract and load data, or call other pipelines to do so (as shown)
  • Invoke OpenAI GPT-3.5 Turbo, with a nominated prompt, against all rows from a nominated table

Performing Sentiment Analysis in Redshift using Matillion

Data pipelines such as this manage all the connectivity and plumbing between the Databricks source and target tables, and the LLM.

This allows you to focus on the overall design and architecture, and the data analysis. To compare the AI-generated sentiment scores against the original star review, use a transformation pipeline like the one in the next screenshot.

Checking the results of Sentiment Analysis in Redshift using Matillion

The data sample shows two of the records. In one case the LLM's decision matches the original sentiment identically, but in the other record the ratings differ slightly. This is an example of the subjective nature of sentiment analysis.

Summary

Matillion serves as a comprehensive data pipeline platform, empowering data teams to efficiently build and manage scalable AI- and analytics-ready pipelines. Offering a code-optional environment, it integrates seamlessly with hyperscalers, CDPs, LLMs, and more. Its UI features pre-built components, but also allows custom coding in SQL, Python, or DBT. First-class Git integration supports asynchronous collaboration, while AI-generated documentation and numerous no-code connectors streamline workflows. Hybrid SaaS deployment, data lineage tracking, pushdown ELT, and vector store connectivity are among its advanced features. Matillion's AI components, including a copilot for natural language pipeline building and no-code Generative AI prompts, enhance productivity and flexibility.

For more examples of Matillion's AI components in action, check out our library of AI Videos and Demos.

To try Matillion yourself, using your own data, sign up for a free trial.

If you are already a Matillion user or trial customer, you can download the sentiment analysis example shown in the screenshots earlier, and run it on your own platform.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.