Sentiment Analysis in Snowflake with OpenAI GPT-4 Turbo

This article provides an in-depth exploration of implementing Sentiment Analysis in Snowflake with the assistance of OpenAI GPT-4 Turbo. It will demonstrate various techniques to approach sentiment analysis, including the use of Python for enhanced flexibility and control.

Additionally, the article will give a comprehensive overview of OpenAI GPT-4 Turbo's capabilities and will begin by laying a foundational understanding of what Sentiment Analysis is and its applications.

What is Sentiment Analysis?

Sentiment Analysis using large language models (LLMs) has become a vital technique in natural language processing, providing robust ways to parse and understand complex emotional undertones in text. Leveraging pre-trained language models, sentiment analysis can now decode not only blatant sentiment (positive, negative, neutral) but also nuanced expressions and contextual polarity with impressive accuracy.

By tapping into vast corpora and sophisticated algorithms, data engineers can deploy these models to classify text data, deriving insightful emotional context that facilitates informed decision-making across various domains.

Examples of Sentiment Analysis:

  • Customer Feedback Analysis: Evaluating customer sentiments from reviews or survey responses to gauge product satisfaction and identify areas needing improvement.
  • Social Media Monitoring: Analyzing tweets, posts, or comments to detect public sentiment toward a brand, campaign, or political event.
  • Financial Market Prediction: Parsing news articles, financial reports, and social media commentary to predict market movements based on collective investor sentiment.

What is OpenAI GPT-4 Turbo?

OpenAI GPT-4 Turbo represents an advancement in generative AI, providing increased computational efficiency and reduced latency, essential for real-time data processing. Ideal for data engineers and data architects, GPT-4 Turbo can streamline ETL operations, enhance data normalization processes, and enable sophisticated natural language querying of datasets. Its ability to understand and generate human-like text aids in automating documentation, anomaly detection, and customer insights extraction. Integration with existing cloud data infrastructures can be achieved via robust APIs, facilitating seamless deployment and scalability in big data environments.

GPT-4 Turbo is a refined, cost-effective variant of the GPT-4 model, offering faster performance and reduced operational costs without significantly sacrificing the quality of output. The key advantages of GPT-4 Turbo include improved response times, making it ideal for real-time applications like customer support chatbots and dynamic content generation. Additionally, its cost efficiency allows businesses to scale AI-driven solutions more feasibly. However, the downsides may include potential slight reductions in the depth of responses or handling highly complex queries compared to the full GPT-4 model. Best use cases for GPT-4 Turbo are applications where quick response and cost management are critical, such as interactive web services, automated marketing content, and large-scale data annotation projects.

How to perform Sentiment Analysis in Snowflake with OpenAI GPT-4 Turbo using Python

Start by installing the prerequisite libraries:

python3 -m pip install snowflake-connector-python openai

Then load your source data into Snowflake. The example below involves product reviews, and the data has been loaded into a table named "stg_sample_reviews" with four columns: id (the primary key), stars, product and review.

Here is the Python script. Note it is good practice to handle credentials more securely than shown in this simple example. You might choose to use environment variables or a secret management service instead of hardcoding them.

Also please note that handling large amounts of data using fetchall() can be inefficient, and may result in memory issues. For large datasets, you should use a cursor to fetch rows incrementally using the fetchmany() method instead.

To authenticate with OpenAI you will need to set up an environment variable named OPENAI_API_KEY containing your API key.

import snowflake.connector
from openai import OpenAI

modelname = "gpt-4-turbo"
systemPrompt = "Your job is to analyze online product reviews"

# Function that will be called for every input row
def process_row(row):
    print(f"Processing id: {row[0]}")
    userPrompt = f"""Provide a numeric rating that reflects the overall sentiment of the review.
The rating should be a single number between 1 and 5, where 1 represents the most negative sentiment and 5 represents the most positive sentiment.
Respond with the numeric rating only. Do not include any justification of the rating.
review: {row[1]}"""

    ccresp = oaisdk.chat.completions.create(
                   model=modelname,
                   temperature=1,
                   messages=[ {"role": "system", "content": systemPrompt},
                              {"role": "user",   "content": userPrompt} ])

    msg = ccresp.choices[0].message.content.strip()
    return(msg)

# Establish a Snowflake connection
conn = snowflake.connector.connect(
    user=...,
    password=...,
    account=...,
    warehouse=...,
    database=...,
    schema=...,
    role=...
)

# Establish an OpenAI connection using environment variable OPENAI_API_KEY
oaisdk = OpenAI()

try:
    # Create a cursor object using the connection
    cur = conn.cursor()

    # Create the destination table
    cur.execute(f'CREATE TABLE IF NOT EXISTS "stg_sample_reviews_{modelname}" ("id" NUMBER(6,0) NOT NULL, "ai_score" VARCHAR(1024) NOT NULL)')

    cur.execute(f'DELETE FROM "stg_sample_reviews_{modelname}"')

    # Select source rows from the table
    cur.execute('SELECT "id", "review" FROM "stg_sample_reviews"')

    # Fetch all rows from the executed query
    rows = cur.fetchall()

    # Loop through the fetched rows and call the process_row function
    for row in rows:
        ai_score = process_row(row)
        cur.execute(f'INSERT INTO "stg_sample_reviews_{modelname}" ("id", "ai_score") VALUES ({row[0]}, {ai_score})')

finally:
    # Close the cursor and connection
    if cur:
        cur.close()
    if conn:
        conn.close()

After running the above script, you should find a new table has been created, which contains the AI-generated review score for every input record. Join this table to the original on the common "id" column to compare the AI-generated sentiment scores against the original star review.

The LLM was asked to score between 1 and 5, so you may choose to classify the scores more broadly as follows:

  • 4 or 5 - Positive
  • 3 - Neutral
  • 1 or 2 - Negative

Sentiment Analysis in Snowflake using Matillion

In the Matillion Data Productivity Cloud, orchestration pipelines like the one shown in the screenshot below can:

  • Directly extract and load data, or call other pipelines to do so (as shown)
  • Invoke OpenAI GPT-4 Turbo, with a nominated prompt, against all rows from a nominated table

Performing Sentiment Analysis in Snowflake using Matillion

Data pipelines such as this manage all the connectivity and plumbing between the Databricks source and target tables, and the LLM.

This allows you to focus on the overall design and architecture, and the data analysis. To compare the AI-generated sentiment scores against the original star review, use a transformation pipeline like the one in the next screenshot.

Checking the results of Sentiment Analysis in Snowflake using Matillion

The data sample shows two of the records. In one case the LLM's decision matches the original sentiment identically, but in the other record the ratings differ slightly. This is an example of the subjective nature of sentiment analysis.

Summary

Matillion is a versatile data pipeline platform that accelerates the creation and management of pipelines for AI and Analytics at scale. It champions productivity, collaboration, and speed through a hybrid SaaS deployment model. The platform offers a UI with pre-built components and supports coding in SQL, Python, and DBT. Its first-class Git integration enables asynchronous collaboration. Matillion’s no-code connectors simplify data integration, while custom connectors to REST APIs and parameterized variables offer flexibility. It supports pushdown ELT and includes AI, Generative AI, and vector store connectivity components. Matillion also features AI-generated documentation, data lineage, and a copilot for natural language pipeline creation, bridging the gap between data engineering and AI-intensive tasks.

For more examples of Matillion's AI components in action, check out our library of AI Videos and Demos.

To try Matillion yourself, using your own data, sign up for a free trial.

If you are already a Matillion user or trial customer, you can download the sentiment analysis example shown in the screenshots earlier, and run it on your own platform.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.