Half a day with Maia. A working pipeline by the end.

Register

Sentiment Analysis in Databricks with OpenAI GPT-4 Turbo

This article explores the practical implementation of Sentiment Analysis using Databricks and OpenAI GPT-4 Turbo. It will demonstrate various techniques for performing sentiment analysis, primarily using Python, and outline how Databricks can streamline the process within a cloud-based environment.

The article will also provide an in-depth description of OpenAI GPT-4 Turbo and its capabilities, and will begin with a foundational overview of what Sentiment Analysis entails and its importance in analyzing textual data for insights.

What is Sentiment Analysis?

Sentiment Analysis using large language models (LLMs) leverages sophisticated natural language processing (NLP) techniques to infer the emotional tone behind text data. These models, such as GPT-4 Turbo, are pre-trained on large amounts of text, enabling them to accurately classify sentiment as positive, negative, or neutral. They excel in handling the nuances of human language, including sarcasm, idioms, and context-dependent semantics. Key to their efficacy is their ability to understand human-like text, which allows for deeper, context-aware sentiment predictions compared to traditional rule-based or shallow machine learning methods.

Use cases include:

  • Customer Service Analytics: Monitoring support tickets and customer service interactions to gauge overall customer satisfaction and identify recurring pain points.
  • Financial Market Analysis: Analyzing news articles, earnings calls, and social media feeds to assess public sentiment about stocks, aiding trading strategies and market predictions.
  • Product Reviews: Aggregating and analyzing user reviews across e-commerce platforms to determine product popularity and inform product development and marketing strategies.

What is OpenAI GPT-4 Turbo?

OpenAI GPT-4 Turbo enhances AI capabilities with improved performance and efficiency over its predecessors. It leverages a streamlined model architecture, optimizing both computational overhead and latency. This model excels in natural language processing tasks, offering faster inference times and lower operational costs. For data engineers and data architects, GPT-4 Turbo can be instrumental in automating data annotation, generating technical documentation, and augmenting ETL processes with intelligent data transformation. Its advanced context understanding benefits complex query generation and enhances user interaction with sophisticated database systems.

OpenAI GPT-4 Turbo offers several advantages, including enhanced efficiency, faster response times, and cost-effectiveness compared to its predecessors, making it suitable for applications that require quick, real-time generation of text. The model's improved architecture results in better handling of context and longer input sequences, which is advantageous for tasks like drafting emails, generating code, or creating conversational agents. However, like any advanced AI, it requires substantial computational resources and may still produce inaccuracies or biased outputs. Its best use cases are those where complex, contextually-rich responses are needed quickly, such as customer support chatbots, automated content creation, and real-time translation services. Users should remain vigilant about the generated content's accuracy and ethical implications.

How to perform Sentiment Analysis in Databricks  with OpenAI GPT-4 Turbo using Python

Start by installing the prerequisite libraries (note the Databricks SDK for Python is in beta, version 0.28.0, at the time of writing):

python3 -m pip install databricks-sdk openai

Then load your source data into Databricks. The example below involves product reviews, and the data has already been loaded into a managed table named "stg_sample_reviews" with four columns: id (the primary key), stars, product and review.

Here is the Python script. Note it is good practice to handle your host and token credentials more securely than shown in this simple example. You might choose to use environment variables or a secret management service instead of hardcoding them.

Set the following values to those in your own Databricks workspace:

  • sqlWhId - The ID of your SQL Warehouse
  • vCatalog - Your Unity Catalog name
  • vSchema - Your Unity Catalog Schema name (a.k.a. Database name)

To authenticate with OpenAI you will need to set up an environment variable named OPENAI_API_KEY containing your API key.

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import StatementParameterListItem

from openai import OpenAI

sqlWhId = "xxxxxxxxxxxxxxxx"
vCatalog = "your-catalog-name"
vSchema = "your-schema"

modelname = "gpt-4-turbo"
tablename = "gpt_4_turbo"

systemPrompt = "Your job is to analyze online product reviews"

# Function that will be called for every input row
def process_row(resp):
    userPrompt = f"""Provide a numeric rating that reflects the overall sentiment of the review.
The rating should be a single number between 1 and 5, where 1 represents the most negative sentiment and 5 represents the most positive sentiment.
Respond with the numeric rating only. Do not include any justification of the rating.
review: {resp}"""

    ccresp = oaisdk.chat.completions.create(
                   model=modelname,
                   temperature=1,
                   messages=[ {"role": "system", "content": systemPrompt},
                              {"role": "user",   "content": userPrompt} ])

    msg = ccresp.choices[0].message.content.strip()
    return(msg)

# Establish an OpenAI connection using environment variable OPENAI_API_KEY
oaisdk = OpenAI()

# Connect to workspace
wsclient = WorkspaceClient(
  host  = 'https://xxxxxxxxx.cloud.databricks.com',
  token = 'dapixxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
)

ddl = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement=f"CREATE OR REPLACE TABLE `stg_sample_reviews_{tablename}` (`id` INT NOT NULL, `ai_score` VARCHAR(1024) NOT NULL)")

print(ddl.status)

# Fetch rows from the table
xsr = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement="SELECT `id`, `review` FROM `stg_sample_reviews`")

print(xsr.status)

for r in xsr.result.data_array:
    ai_score = process_row(r[1])
    print(f"ID {r[0]}: Score: {ai_score}")
    dml = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement=f"INSERT INTO `stg_sample_reviews_{tablename}` (`id`, `ai_score`) VALUES (:id, :ai_score)",
                                  parameters=[ StatementParameterListItem.from_dict({"name": "id", "type":"INT", "value": r[0]}),
                                               StatementParameterListItem.from_dict({"name": "ai_score", "type":"INT", "value": ai_score}) ])

After running the above script, you should find a new table has been created, which contains the AI-generated review score for every input record. Join this table to the original on the common "id" column to compare the AI-generated sentiment scores against the original star review.

The LLM was asked to score between 1 and 5, so you may choose to classify the scores more broadly as follows:

  • 4 or 5 - Positive
  • 3 - Neutral
  • 1 or 2 - Negative

Sentiment Analysis in Databricks using Matillion

In the Matillion Data Productivity Cloud, orchestration pipelines like the one shown in the screenshot below can:

  • Directly extract and load data, or call other pipelines to do so (as shown)
  • Invoke OpenAI GPT-4 Turbo, with a nominated prompt, against all rows from a nominated table

Performing Sentiment Analysis in Databricks using Matillion

Data pipelines such as this manage all the connectivity and plumbing between the Databricks source and target tables, and the LLM.

This allows you to focus on the overall design and architecture, and the data analysis. To compare the AI-generated sentiment scores against the original star review, use a transformation pipeline like the one in the next screenshot.

Checking the results of Sentiment Analysis in Databricks using Matillion

The data sample shows two of the records. In one case the LLM's decision matches the original sentiment identically, but in the other record the ratings differ slightly. This is an example of the subjective nature of sentiment analysis.

Summary

Matillion serves as a robust data pipeline platform designed to accelerate the building and management of data pipelines for AI and analytics at scale. It offers a blend of productivity, collaboration, and speed, supporting both no-code and code-optional configurations. Integration with hyperscalers, CDPs, and LLMs makes it versatile, while pre-built connectors and the ability to create custom connectors to REST APIs enhance its adaptability. Users can leverage SQL, Python, and DBT within its flexible UI. With first-class Git integration, AI-generated documentation, data lineage, and hybrid SaaS deployment, Matillion empowers data teams to orchestrate complex pipeline tasks efficiently, offering a seamless pushdown ELT experience with AI components.

For more examples of Matillion's AI components in action, check out our library of AI Videos and Demos.

To try Matillion yourself, using your own data, sign up for a free trial.

If you are already a Matillion user or trial customer, you can download the sentiment analysis example shown in the screenshots earlier, and run it on your own platform.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Ready to get moving?

See how quickly your team can start delivering business-ready data, with Matillion.