Sentiment Analysis in Databricks with OpenAI GPT-4

This article will explore Sentiment Analysis in Databricks using OpenAI GPT-4. It will present a range of methods for conducting Sentiment Analysis, with a particular focus on using Python scripts. Additionally, the article will provide an in-depth look at OpenAI GPT-4 and how it enhances Sentiment Analysis capabilities.

To lay the groundwork, we will begin with an explanation of what Sentiment Analysis is and why it's essential for modern data-driven decision-making.

What is Sentiment Analysis?

Sentiment Analysis, also known as opinion mining, utilizes natural language processing (NLP), text analysis, and computational linguistics to identify and extract subjective information from text data. For data engineers and data architects, implementing sentiment analysis within cloud databases can enhance the ability to parse large volumes of unstructured data, providing actionable insights. This process typically involves several steps including data pre-processing, feature extraction, and the application of machine learning algorithms or neural networks to classify sentiments as positive, negative, or neutral. The end goal is to enable deeper understanding and automated responses based on the emotional tone conveyed in the data.

Examples of Sentiment Analysis applications:

  • Customer Feedback Systems: Analyzing customer reviews or social media comments to gauge overall sentiment towards products or services, enabling timely and data-driven improvements.
  • Market Analysis: Monitoring and interpreting sentiment trends in financial news articles and social feeds to predict market movements and inform trading strategies.
  • Employee Surveys: Dissecting internal survey data to understand workforce morale and identifying underlying issues that need managerial attention.

What is OpenAI GPT-4?

OpenAI GPT-4 is a generative pre-trained transformer, leveraging an expansive neural network with billions of parameters, optimized for natural language understanding and generation. Data engineers can utilize its robust API for automated data preprocessing, anomaly detection, and ETL (Extract, Transform, Load) tasks. Data architects might find value in its capability to generate complex SQL queries, documentation, and schema designs, thereby enhancing data pipeline efficiency. Its contextual comprehension and scalability make it a valuable tool for sophisticated data engineering pipelines, facilitating smarter data-driven decision-making.

GPT-4 offers numerous advantages and a few drawbacks. Among its significant pros are its exceptional language understanding capabilities, the ability to generate coherent and contextually relevant text, and its versatility in applications ranging from content creation to customer support. Additionally, GPT-4 can enhance productivity and creativity, serving as a valuable tool for brainstorming, drafting, and translating text. However, the cons include potential biases in its outputs, high computational resource requirements, and the need for careful handling to avoid generating misleading or inappropriate content. It is best to use GPT-4 in scenarios where nuanced language generation is required, such as drafting content, creating conversational agents, or performing complex data analysis tasks, while ensuring robust oversight to mitigate any unintended consequences.

How to perform Sentiment Analysis in Databricks  with OpenAI GPT-4 using Python

Start by installing the prerequisite libraries (note the Databricks SDK for Python is in beta, version 0.28.0, at the time of writing):

python3 -m pip install databricks-sdk openai

Then load your source data into Databricks. The example below involves product reviews, and the data has already been loaded into a managed table named "stg_sample_reviews" with four columns: id (the primary key), stars, product and review.

Here is the Python script. Note it is good practice to handle your host and token credentials more securely than shown in this simple example. You might choose to use environment variables or a secret management service instead of hardcoding them.

Set the following values to those in your own Databricks workspace:

  • sqlWhId - The ID of your SQL Warehouse
  • vCatalog - Your Unity Catalog name
  • vSchema - Your Unity Catalog Schema name (a.k.a. Database name)

To authenticate with OpenAI you will need to set up an environment variable named OPENAI_API_KEY containing your API key.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import StatementParameterListItem

from openai import OpenAI

sqlWhId = "xxxxxxxxxxxxxxxx"
vCatalog = "your-catalog-name"
vSchema = "your-schema"

modelname = "gpt-4"
tablename = "gpt_4"

systemPrompt = "Your job is to analyze online product reviews"

# Function that will be called for every input row
def process_row(resp):
    userPrompt = f"""Provide a numeric rating that reflects the overall sentiment of the review.
The rating should be a single number between 1 and 5, where 1 represents the most negative sentiment and 5 represents the most positive sentiment.
Respond with the numeric rating only. Do not include any justification of the rating.
review: {resp}"""

    ccresp = oaisdk.chat.completions.create(
                   model=modelname,
                   temperature=1,
                   messages=[ {"role": "system", "content": systemPrompt},
                              {"role": "user",   "content": userPrompt} ])

    msg = ccresp.choices[0].message.content.strip()
    return(msg)

# Establish an OpenAI connection using environment variable OPENAI_API_KEY
oaisdk = OpenAI()

# Connect to workspace
wsclient = WorkspaceClient(
  host  = 'https://xxxxxxxxx.cloud.databricks.com',
  token = 'dapixxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
)

ddl = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement=f"CREATE OR REPLACE TABLE `stg_sample_reviews_{tablename}` (`id` INT NOT NULL, `ai_score` VARCHAR(1024) NOT NULL)")

print(ddl.status)

# Fetch rows from the table
xsr = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement="SELECT `id`, `review` FROM `stg_sample_reviews`")

print(xsr.status)

for r in xsr.result.data_array:
    ai_score = process_row(r[1])
    print(f"ID {r[0]}: Score: {ai_score}")
    dml = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement=f"INSERT INTO `stg_sample_reviews_{tablename}` (`id`, `ai_score`) VALUES (:id, :ai_score)",
                                  parameters=[ StatementParameterListItem.from_dict({"name": "id", "type":"INT", "value": r[0]}),
                                               StatementParameterListItem.from_dict({"name": "ai_score", "type":"INT", "value": ai_score}) ])

After running the above script, you should find a new table has been created, which contains the AI-generated review score for every input record. Join this table to the original on the common "id" column to compare the AI-generated sentiment scores against the original star review.

The LLM was asked to score between 1 and 5, so you may choose to classify the scores more broadly as follows:

  • 4 or 5 - Positive
  • 3 - Neutral
  • 1 or 2 - Negative

Sentiment Analysis in Databricks using Matillion

In the Matillion Data Productivity Cloud, orchestration pipelines like the one shown in the screenshot below can:

  • Directly extract and load data, or call other pipelines to do so (as shown)
  • Invoke OpenAI GPT-4, with a nominated prompt, against all rows from a nominated table

Performing Sentiment Analysis in Databricks using Matillion

Data pipelines such as this manage all the connectivity and plumbing between the Databricks source and target tables, and the LLM.

This allows you to focus on the overall design and architecture, and the data analysis. To compare the AI-generated sentiment scores against the original star review, use a transformation pipeline like the one in the next screenshot.

Checking the results of Sentiment Analysis in Databricks using Matillion

The data sample shows two of the records. In one case the LLM's decision matches the original sentiment identically, but in the other record the ratings differ slightly. This is an example of the subjective nature of sentiment analysis.

Summary

Matillion is a versatile data pipeline platform designed for expedited AI and Analytics development at scale. It fosters productivity and collaboration with code-optional workflows, allowing seamless integration with hyperscalers, CDPs, LLMs, and more. The platform supports SQL, Python, and DBT coding, alongside a UI featuring pre-built components and extensive no-code connectors—including custom REST API connectors. With first-rate Git integration for asynchronous work, AI-generated documentation, and hybrid SaaS deployment, Matillion greatly enhances both agility and efficiency. Its pushdown ELT architecture, comprehensive data lineage, and advanced AI components—including vector store connectivity—equip data teams to handle complex data engineering tasks with unparalleled speed and flexibility.

For more examples of Matillion's AI components in action, check out our library of AI Videos and Demos.

To try Matillion yourself, using your own data, sign up for a free trial.

If you are already a Matillion user or trial customer, you can download the sentiment analysis example shown in the screenshots earlier, and run it on your own platform.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.