- Blog
- 08.23.2024
- Leveraging AI, Product, Data Fundamentals
Sentiment Analysis in Redshift with OpenAI GPT-4 Turbo

In this article, we explore advanced methods for implementing Sentiment Analysis in Amazon Redshift using OpenAI's GPT-4 Turbo. We'll demonstrate various techniques, including Python scripts, to efficiently perform sentiment analysis directly within your Redshift data warehouse environment.
The article will provide an overview of the capabilities of OpenAI's GPT-4 Turbo, and it will begin with a fundamental explanation of what Sentiment Analysis is and why it's important for data-driven decision-making in modern enterprises.
What is Sentiment Analysis?
Sentiment Analysis employs large language models (LLMs) to systematically extract, quantify, and categorize sentiments expressed in textual data. These models, trained on vast datasets, leverage their deep learning capabilities and natural language understanding to discern subtle nuances in tone, context, and semantics.
Data engineers and architects can integrate LLMs with cloud databases to dynamically scale and streamline sentiment analysis processes, enhancing real-time insights generation. The architecture involves pre-processing pipelines, model inference layers, and post-processing modules to ensure accurate sentiment tagging and analysis outcomes.
Examples of Sentiment Analysis include:
- Social Media Monitoring: Analyzing user comments on a brand’s social media posts to determine public sentiment towards a new product launch.
- Customer Feedback Evaluation: Assessing customer reviews on e-commerce platforms to identify common sentiment trends and feedback on specific product features.
- Financial News Analysis: Parsing news articles and financial reports to gauge market sentiment and its potential impact on stock prices or economic trends.
What is OpenAI GPT-4 Turbo?
OpenAI GPT-4 Turbo is an enhanced variant of the GPT-4 architecture, providing data engineers and architects with a highly efficient generative model for advanced natural language processing tasks. Leveraging optimized parameter tuning and increased computational throughput, GPT-4 Turbo delivers faster inference times while maintaining high accuracy. This makes it ideal for real-time data analysis, automated report generation, and intelligent data querying. Its scalable nature allows seamless integration into cloud-based systems, facilitating robust data pipelines and enabling complex transformations and analytics on large datasets with reduced latency.
OpenAI GPT-4 Turbo is known for its impressive speed and cost-efficiency compared to its predecessor, GPT-4. One of the major advantages of GPT-4 Turbo is its ability to generate high-quality text quickly, making it ideal for real-time applications such as chatbots, customer service automation, and dynamic content generation. Additionally, its lower operational cost makes it an attractive option for businesses looking to leverage AI without substantial expenditure. However, like any advanced AI, it has limitations, such as occasional inaccuracies and potential biases in generated content, which necessitate human oversight. It is best utilized in scenarios where rapid, large-scale text generation is needed, but with built-in checks to ensure the output meets quality and ethical standards.
How to perform Sentiment Analysis in Redshift with OpenAI GPT-4 Turbo using Python
Start by installing the prerequisite libraries:
python3 -m pip install psycopg2-binary openai
Then load your source data into Redshift. The example below involves product reviews, and the data has been loaded into a table named "stg_sample_reviews" with four columns: id (the primary key), stars, product and review.
Here is the Python script. Note it is good practice to handle credentials more securely than shown in this simple example. You might choose to use environment variables or a secret management service instead of hardcoding them.
Also please note that handling large amounts of data using fetchall() can be inefficient, and may result in memory issues. For large datasets, you should use a cursor to fetch rows incrementally using the fetchmany() method instead.
If you are working in a schema other than "public" you will need the -c connection option to specify the object search path. It is named yourSchemaName in the code sample. Newly created objects will be added to this named schema.
To authenticate with OpenAI you will need to set up an environment variable named OPENAI_API_KEY containing your API key.
import psycopg2
from openai import OpenAI
modelname = "gpt-4-turbo"
systemPrompt = "Your job is to analyze online product reviews"
# Function that will be called for every input row
def process_row(row):
print(f"Processing id: {row[0]}")
userPrompt = f"""Provide a numeric rating that reflects the overall sentiment of the review.
The rating should be a single number between 1 and 5, where 1 represents the most negative sentiment and 5 represents the most positive sentiment.
Respond with the numeric rating only. Do not include any justification of the rating.
review: {row[1]}"""
ccresp = oaisdk.chat.completions.create(
model=modelname,
temperature=1,
messages=[ {"role": "system", "content": systemPrompt},
{"role": "user", "content": userPrompt} ])
msg = ccresp.choices[0].message.content.strip()
return(msg)
# Database connection parameters
db_params = {
'dbname': ...,
'user': ...,
'password': ...,
'host': ...,
'port': '5439',
'options' : '-c search_path=<<yourSchemaName>>,public'
}
# Create a connection to the Redshift database
try:
conn = psycopg2.connect(**db_params)
except Exception as e:
print(f"Unable to connect to the database: {e}")
exit(1)
conn.autocommit = True
# Establish an OpenAI connection using environment variable OPENAI_API_KEY
oaisdk = OpenAI()
# Create a cursor object
cur = conn.cursor()
# Your query to fetch rows from the table
query = f'SELECT "id", "review" FROM "stg_sample_reviews"'
try:
# Create the table to hold the results, if it does not already exist
cur.execute(f'CREATE TABLE IF NOT EXISTS "stg_sample_reviews_{modelname}" ("id" INT NOT NULL, "ai_score" VARCHAR(1024) NOT NULL)')
cur.execute(f'DELETE FROM "stg_sample_reviews_{modelname}"')
# Execute the query
cur.execute(query)
# Fetch and process each row
for row in cur.fetchall():
ai_score = process_row(row)
cur.execute(f'INSERT INTO "stg_sample_reviews_{modelname}" ("id", "ai_score") VALUES ({row[0]}, {ai_score})')
except Exception as e:
print(f"SQL error: {e}")
finally:
cur.close()
conn.close()
After running the above script, you should find a new table has been created, which contains the AI-generated review score for every input record. Join this table to the original on the common "id" column to compare the AI-generated sentiment scores against the original star review.
The LLM was asked to score between 1 and 5, so you may choose to classify the scores more broadly as follows:
- 4 or 5 - Positive
- 3 - Neutral
- 1 or 2 - Negative
Sentiment Analysis in Redshift using Matillion
In the Matillion Data Productivity Cloud, orchestration pipelines like the one shown in the screenshot below can:
- Directly extract and load data, or call other pipelines to do so (as shown)
- Invoke OpenAI GPT-4 Turbo, with a nominated prompt, against all rows from a nominated table
Performing Sentiment Analysis in Redshift using Matillion
Data pipelines such as this manage all the connectivity and plumbing between the Databricks source and target tables, and the LLM.
This allows you to focus on the overall design and architecture, and the data analysis. To compare the AI-generated sentiment scores against the original star review, use a transformation pipeline like the one in the next screenshot.
Checking the results of Sentiment Analysis in Redshift using Matillion
The data sample shows two of the records. In one case the LLM's decision matches the original sentiment identically, but in the other record the ratings differ slightly. This is an example of the subjective nature of sentiment analysis.
Summary
Matillion is a versatile data pipeline platform designed to accelerate AI and analytics workflows at scale. Key features include a code-optional interface with pre-built components, or the flexibility to use SQL, Python, and DBT. It integrates seamlessly with hyperscalers, CDPs, LLMs, and vector stores, and supports hybrid SaaS deployment. Advanced features like first-class Git integration promote asynchronous collaboration, while no-code connectors and custom REST API integrations simplify data ingestion. Matillion includes AI-generated documentation, data lineage, reverse ETL for AI insights, and no-code components for Generative AI prompting and RAG. The platform leverages cloud processing power for scalable, pushdown ELT, enhancing productivity and collaboration among data teams.
For more examples of Matillion's AI components in action, check out our library of AI Videos and Demos.
To try Matillion yourself, using your own data, sign up for a free trial.
If you are already a Matillion user or trial customer, you can download the sentiment analysis example shown in the screenshots earlier, and run it on your own platform.
Ian Funnell
Data Alchemist
Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell
Featured Resources
What Is Massively Parallel Processing (MPP)? How It Powers Modern Cloud Data Platforms
Massively Parallel Processing (often referred to as simply MPP) is the architectural backbone that powers modern cloud data ...
BlogETL and SQL: How They Work Together in Modern Data Integration
Explore how SQL and ETL power modern data workflows, when to use SQL scripts vs ETL tools, and how Matillion blends automation ...
WhitepapersUnlocking Data Productivity: A DataOps Guide for High-performance Data Teams
Download the DataOps White Paper today and start building data pipelines that are scalable, reliable, and built for success.
Share: