Sentiment Analysis in Redshift with OpenAI GPT-4

This article offers a comprehensive guide on performing Sentiment Analysis in Amazon Redshift with the assistance of OpenAI GPT-4. The content will walk readers through different strategies to execute Sentiment Analysis, emphasizing Python for practical implementation.

Additionally, the article will explore the capabilities of OpenAI GPT-4, beginning with an introduction to the core concepts of Sentiment Analysis.

What is Sentiment Analysis?

Sentiment Analysis using large language models (LLMs) leverages advanced natural language processing (NLP) techniques to automatically interpret and classify emotional tones within textual data.

LLMs, powered by transformers and extensive pre-training on diverse datasets, excel in capturing nuanced contextual relationships, making them particularly adept for sentiment analysis tasks. By utilizing deep learning frameworks, these models can discern positive, negative, and neutral sentiments with high precision, significantly outperforming traditional lexicon-based approaches. The scalability and adaptability of LLMs enable real-time sentiment evaluation across diverse domains and languages.

Key applications of sentiment analysis using LLMs include:

  • Product Review Analysis: Automatically categorizes customer feedback on e-commerce platforms to inform product development and inventory decisions.
  • Social Media Monitoring: Identifies and tracks public sentiment about brands, events, or social issues across platforms like Twitter and Facebook, aiding in reputation management and market research.
  • Customer Service Optimization: Analyzes interactions in chat logs and support tickets to assess customer satisfaction and improve service delivery through targeted response strategies.

What is OpenAI GPT-4?

OpenAI GPT-4 presents significant advancements in natural language processing, boasting 175 billion parameters, making it highly suitable for complex data engineering tasks. Its robust architecture can enhance ETL processes by automating data cleaning and transformation activities. Integrating GPT-4 with cloud databases enables sophisticated query generation and optimization, reducing the need for manual intervention. Additionally, its ability to comprehend and generate human-like text can streamline documentation, API generation, and code reviews.

GPT-4 offers enhanced language understanding, context retention, and generation capabilities compared to its predecessors. Among its notable pros are its ability to produce coherent and contextually relevant text across a wide range of topics, making it invaluable for applications such as content creation, customer service automation, and complex query resolution. Additionally, its iterative training on diverse datasets allows it to adapt and provide nuanced responses tailored to specific needs. However, there are cons to consider: GPT-4 can occasionally produce inaccuracies or biased outputs, reflecting the biases present in its training data. It also demands substantial computational resources, potentially driving up costs for high-frequency usage. Best practices for leveraging GPT-4 include scenarios where high-quality, context-sensitive text generation is essential, yet tasks involving highly sensitive or critical information might require additional oversight or supplementary verification due to risks of erroneous outputs.

How to perform Sentiment Analysis in Redshift  with OpenAI GPT-4 using Python

Start by installing the prerequisite libraries:

python3 -m pip install psycopg2-binary openai

Then load your source data into Redshift. The example below involves product reviews, and the data has been loaded into a table named "stg_sample_reviews" with four columns: id (the primary key), stars, product and review.

Here is the Python script. Note it is good practice to handle credentials more securely than shown in this simple example. You might choose to use environment variables or a secret management service instead of hardcoding them.

Also please note that handling large amounts of data using fetchall() can be inefficient, and may result in memory issues. For large datasets, you should use a cursor to fetch rows incrementally using the fetchmany() method instead.

If you are working in a schema other than "public" you will need the -c connection option to specify the object search path. It is named yourSchemaName in the code sample. Newly created objects will be added to this named schema.

To authenticate with OpenAI you will need to set up an environment variable named OPENAI_API_KEY containing your API key.

import psycopg2
from openai import OpenAI

modelname = "gpt-4"
systemPrompt = "Your job is to analyze online product reviews"

# Function that will be called for every input row
def process_row(row):
    print(f"Processing id: {row[0]}")
    userPrompt = f"""Provide a numeric rating that reflects the overall sentiment of the review.
The rating should be a single number between 1 and 5, where 1 represents the most negative sentiment and 5 represents the most positive sentiment.
Respond with the numeric rating only. Do not include any justification of the rating.
review: {row[1]}"""

    ccresp = oaisdk.chat.completions.create(
                   model=modelname,
                   temperature=1,
                   messages=[ {"role": "system", "content": systemPrompt},
                              {"role": "user",   "content": userPrompt} ])

    msg = ccresp.choices[0].message.content.strip()
    return(msg)

# Database connection parameters
db_params = {
    'dbname': ...,
    'user': ...,
    'password': ...,
    'host': ...,
    'port': '5439',
    'options' : '-c search_path=<<yourSchemaName>>,public'
}

# Create a connection to the Redshift database
try:
    conn = psycopg2.connect(**db_params)
except Exception as e:
    print(f"Unable to connect to the database: {e}")
    exit(1)

conn.autocommit = True

# Establish an OpenAI connection using environment variable OPENAI_API_KEY
oaisdk = OpenAI()

# Create a cursor object
cur = conn.cursor()

# Your query to fetch rows from the table
query = f'SELECT "id", "review" FROM "stg_sample_reviews"'

try:
    # Create the table to hold the results, if it does not already exist
    cur.execute(f'CREATE TABLE IF NOT EXISTS "stg_sample_reviews_{modelname}" ("id" INT NOT NULL, "ai_score" VARCHAR(1024) NOT NULL)')

    cur.execute(f'DELETE FROM "stg_sample_reviews_{modelname}"')

    # Execute the query
    cur.execute(query)

    # Fetch and process each row
    for row in cur.fetchall():
        ai_score = process_row(row)
        cur.execute(f'INSERT INTO "stg_sample_reviews_{modelname}" ("id", "ai_score") VALUES ({row[0]}, {ai_score})')

except Exception as e:
    print(f"SQL error: {e}")
finally:
    cur.close()
    conn.close()

After running the above script, you should find a new table has been created, which contains the AI-generated review score for every input record. Join this table to the original on the common "id" column to compare the AI-generated sentiment scores against the original star review.

The LLM was asked to score between 1 and 5, so you may choose to classify the scores more broadly as follows:

  • 4 or 5 - Positive
  • 3 - Neutral
  • 1 or 2 - Negative

Sentiment Analysis in Redshift using Matillion

In the Matillion Data Productivity Cloud, orchestration pipelines like the one shown in the screenshot below can:

  • Directly extract and load data, or call other pipelines to do so (as shown)
  • Invoke OpenAI GPT-4, with a nominated prompt, against all rows from a nominated table

Performing Sentiment Analysis in Redshift using Matillion

Data pipelines such as this manage all the connectivity and plumbing between the Databricks source and target tables, and the LLM.

This allows you to focus on the overall design and architecture, and the data analysis. To compare the AI-generated sentiment scores against the original star review, use a transformation pipeline like the one in the next screenshot.

Checking the results of Sentiment Analysis in Redshift using Matillion

The data sample shows two of the records. In one case the LLM's decision matches the original sentiment identically, but in the other record the ratings differ slightly. This is an example of the subjective nature of sentiment analysis.

Summary

Matillion is a versatile data pipeline platform that boosts productivity and collaboration by enabling data teams to construct and manage pipelines for AI and analytics at scale. Its code-optional environment, featuring a user-friendly UI with pre-built components, allows seamless integration with hyperscalers, CDPs, LLMs, and more. Matillion’s first-class Git integration and hybrid SaaS deployment provide flexibility and scalability. Its powerful pushdown ELT approach leverages cloud data platform processing power, and its extensive no-code connectors, including Generative AI and vector store connectivity, democratize access to AI. Matillion's AI-generated documentation and natural language copilot streamline workflow, while data lineage and reverse ETL capabilities ensure robust data management.

For more examples of Matillion's AI components in action, check out our library of AI Videos and Demos.

To try Matillion yourself, using your own data, sign up for a free trial.

If you are already a Matillion user or trial customer, you can download the sentiment analysis example shown in the screenshots earlier, and run it on your own platform.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.