Sentiment Analysis in Snowflake with OpenAI GPT-3.5 Turbo

This article will guide you through the process of performing Sentiment Analysis using Snowflake in conjunction with OpenAI GPT-3.5 Turbo. It will explore various techniques, including implementations with Python, to effectively analyze sentiments from text data within the Snowflake environment.

The article will also cover the capabilities of OpenAI GPT-3.5 Turbo and will kick off with a foundational explanation of Sentiment Analysis, providing you with the necessary background before diving into practical applications.

What is Sentiment Analysis?

Sentiment Analysis utilizes natural language processing (NLP) techniques, often leveraging large language models (LLMs) such as GPT-3.5 Turbo, to evaluate and quantify subjective information in text data. These models can comprehensively analyze text by considering context, tone, and nuances, making them highly effective for interpreting sentiments expressed in unstructured data sources.

Engineers and architects can integrate LLMs with cloud databases and data pipelines to automate the extraction and analysis of opinions, emotions, and attitudes, enabling more informed decision-making.

Here are three examples of Sentiment Analysis applications:

  • Social Media Monitoring: Automatically assess public opinion and brand perception by analyzing user-generated content on platforms like Twitter or Facebook.
  • Customer Reviews: Evaluate sentiment in product reviews to gain insights into customer satisfaction and common pain points.
  • Employee Feedback: Analyze sentiment in employee surveys and feedback forms to understand workplace morale and identify areas for improvement.

What is OpenAI GPT-3.5 Turbo?

OpenAI's GPT-3.5 Turbo is an advanced generative model, optimized for computational efficiency and scalability, ideal for integration into data pipelines and analytics workflows. Leveraging 96 attention layers and 175 billion parameters, it offers nuanced natural language generation capabilities, facilitating complex data parsing, transformation, and summarization tasks in real-time. Its API supports batch processing and seamless scaling, making it suitable for enterprise-level data ecosystems. Additionally, GPT-3.5 Turbo's enhanced contextual understanding can dramatically improve automated report generation, anomaly detection, and data categorization efforts such as sentiment analysis.

OpenAI GPT-3.5 Turbo is a highly advanced language model that offers enhanced performance over its predecessors by generating coherent, contextually accurate, and creative text with remarkable speed and efficiency. One of its main advantages is its ability to understand and produce human-like responses, which makes it ideal for applications like customer service chatbots, content creation, and interactive storytelling. However, the model also has some drawbacks, such as occasional inaccuracies, the potential for generating biased content, and the necessity of substantial computational resources, which could be costly. GPT-3.5 Turbo is best used in scenarios requiring sophisticated natural language understanding and generation but should be monitored carefully for accuracy and ethical compliance.

How to perform Sentiment Analysis in Snowflake with OpenAI GPT-3.5 Turbo using Python

Start by installing the prerequisite libraries:

python3 -m pip install snowflake-connector-python openai

Then load your source data into Snowflake. The example below involves product reviews, and the data has been loaded into a table named "stg_sample_reviews" with four columns: id (the primary key), stars, product and review.

Here is the Python script. Note it is good practice to handle credentials more securely than shown in this simple example. You might choose to use environment variables or a secret management service instead of hardcoding them.

Also please note that handling large amounts of data using fetchall() can be inefficient, and may result in memory issues. For large datasets, you should use a cursor to fetch rows incrementally using the fetchmany() method instead.

To authenticate with OpenAI you will need to set up an environment variable named OPENAI_API_KEY containing your API key.

import snowflake.connector
from openai import OpenAI

modelname = "gpt-3.5-turbo"
systemPrompt = "Your job is to analyze online product reviews"

# Function that will be called for every input row
def process_row(row):
    print(f"Processing id: {row[0]}")
    userPrompt = f"""Provide a numeric rating that reflects the overall sentiment of the review.
The rating should be a single number between 1 and 5, where 1 represents the most negative sentiment and 5 represents the most positive sentiment.
Respond with the numeric rating only. Do not include any justification of the rating.
review: {row[1]}"""

    ccresp = oaisdk.chat.completions.create(
                   model=modelname,
                   temperature=1,
                   messages=[ {"role": "system", "content": systemPrompt},
                              {"role": "user",   "content": userPrompt} ])

    msg = ccresp.choices[0].message.content.strip()
    return(msg)

# Establish a Snowflake connection
conn = snowflake.connector.connect(
    user=...,
    password=...,
    account=...,
    warehouse=...,
    database=...,
    schema=...,
    role=...
)

# Establish an OpenAI connection using environment variable OPENAI_API_KEY
oaisdk = OpenAI()

try:
    # Create a cursor object using the connection
    cur = conn.cursor()

    # Create the destination table
    cur.execute(f'CREATE TABLE IF NOT EXISTS "stg_sample_reviews_{modelname}" ("id" NUMBER(6,0) NOT NULL, "ai_score" VARCHAR(1024) NOT NULL)')

    cur.execute(f'DELETE FROM "stg_sample_reviews_{modelname}"')

    # Select source rows from the table
    cur.execute('SELECT "id", "review" FROM "stg_sample_reviews"')

    # Fetch all rows from the executed query
    rows = cur.fetchall()

    # Loop through the fetched rows and call the process_row function
    for row in rows:
        ai_score = process_row(row)
        cur.execute(f'INSERT INTO "stg_sample_reviews_{modelname}" ("id", "ai_score") VALUES ({row[0]}, {ai_score})')

finally:
    # Close the cursor and connection
    if cur:
        cur.close()
    if conn:
        conn.close()

After running the above script, you should find a new table has been created, which contains the AI-generated review score for every input record. Join this table to the original on the common "id" column to compare the AI-generated sentiment scores against the original star review.

The LLM was asked to score between 1 and 5, so you may choose to classify the scores more broadly as follows:

  • 4 or 5 - Positive
  • 3 - Neutral
  • 1 or 2 - Negative

Sentiment Analysis in Snowflake using Matillion

In the Matillion Data Productivity Cloud, orchestration pipelines like the one shown in the screenshot below can:

  • Directly extract and load data, or call other pipelines to do so (as shown)
  • Invoke OpenAI GPT-3.5 Turbo, with a nominated prompt, against all rows from a nominated table

Performing Sentiment Analysis in Snowflake using Matillion

Data pipelines such as this manage all the connectivity and plumbing between the Databricks source and target tables, and the LLM.

This allows you to focus on the overall design and architecture, and the data analysis. To compare the AI-generated sentiment scores against the original star review, use a transformation pipeline like the one in the next screenshot.

Checking the results of Sentiment Analysis in Snowflake using Matillion

The data sample shows two of the records. In one case the LLM's decision matches the original sentiment identically, but in the other record the ratings differ slightly. This is an example of the subjective nature of sentiment analysis.

Summary

Matillion is a versatile data pipeline platform designed to supercharge data teams' productivity and collaboration. It excels in building and managing pipelines for AI and analytics with unmatched speed. Offering a code-optional environment, Matillion integrates seamlessly with hyperscalers, CDPs, LLMs, and more. The platform sports a robust UI with pre-built components, while also supporting SQL, Python, and DBT for customized scripting. It integrates smoothly with Git, supports asynchronous workflows, and features no-code and custom connectors, including REST APIs. Notably, Matillion incorporates AI-generated documentation, data lineage, pushdown ELT, and offers hybrid SaaS deployment. Advanced functionalities include Generative AI, RAG, vector store connectivity, and natural language pipeline building via Matillion Copilot. Its design is geared towards efficient pipeline management at any scale, democratizing access to AI-driven insights and enhancing data engineering workflows.

For more examples of Matillion's AI components in action, check out our library of AI Videos and Demos.

To try Matillion yourself, using your own data, sign up for a free trial.

If you are already a Matillion user or trial customer, you can download the sentiment analysis example shown in the screenshots earlier, and run it on your own platform.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.