Scale your data team’s output by up to 100x. We'd love to prove it.

Challenge Maia at Snowflake Summit

Sentiment Analysis in Databricks with Cohere Command using Amazon Bedrock

Sentiment Analysis is a powerful technique that enables organizations to gain valuable insights from unstructured data sources, such as product reviews, social media posts, and customer feedback.

This article will demonstrate various approaches to performing Sentiment Analysis in Databricks, leveraging Cohere Command through Amazon Bedrock, beginning with Python-based implementations. We will explore Cohere Command, a cutting-edge natural language processing (NLP) service, and start by describing the fundamentals of Sentiment Analysis.

What is Sentiment Analysis?

Sentiment Analysis is a technique that extracts a numeric sentiment score from unstructured text, quantifying opinions expressed in social media, customer reviews, and other textual data sources. This process uses Natural Language Processing (NLP) and machine learning to gauge the sentiment's polarity—be it positive, negative, or neutral.

Large Language Models (LLMs), such as Cohere Command, are pivotal in performing Sentiment Analysis. These models, pre-trained on vast corpora, discern contextual nuances and intricate human emotions within text. They convert raw text into embeddings, which are then classified to output a sentiment score. This is achieved through fine-tuning the models on labeled sentiment datasets, ensuring precise assessments.

Reliable data preparation is vital. Sentiment Analysis can provide valuable insights for data-driven decision-making, but it requires careful preprocessing, model selection, and interpretation to account for context, sarcasm, and domain-specific language.

Data engineers must cleanse, normalize, and tokenize text data from databases, ensuring consistency and accuracy before feeding it into an LLM. They encapsulate this preprocessing in robust, scalable pipelines that integrate seamlessly with model interfaces, enabling real-time sentiment analysis and driving actionable insights.

Business examples of Sentiment Analysis:

  • Customer feedback analysis: Analyze product reviews, social media mentions, and support interactions to gauge customer sentiment and identify areas for improvement.
  • Brand monitoring: Track online conversations about a brand, product, or service to understand public perception and respond to negative sentiment.
  • Political campaign analysis: Analyze social media posts, news articles, and public comments to gauge sentiment towards candidates, policies, or issues.

What is Cohere Command?

The Cohere Command LLM (Large Language Model) is a language processing model designed to enable users to generate text, enhance search functionalities, and perform various natural language understanding tasks. Technically, it leverages transformer-based architectures similar to GPT-3, with pre-training on extensive datasets to understand context and nuances in human language. This model supports a range of applications, from simple text autocomplete to complex knowledge extraction.

Pros:

  • High accuracy in text generation and comprehension.
  • Versatile application across multiple industries.
  • Scalable across different levels of text complexity.

Cons:

  • Requires significant computational resources.
  • Performance can degrade with ambiguous inputs.
  • Potential biases inherited from training data.

Ideal Use Cases:

  • Generating high-quality, contextually appropriate text for content creation.
  • Enhancing customer support with intelligent chatbots.
  • Automating document summarization and analysis.

How to perform Sentiment Analysis in Databricks with Cohere Command using Python with the Amazon Bedrock SDK

Prerequisites for the boto3 Amazon Bedrock Python SDK

Start by installing the prerequisite libraries (note the Databricks SDK for Python is in beta, version 0.28.0, at the time of writing):

python3 -m pip install databricks-sdk boto3

Then load your source data into Databricks.

Python boto3 for Cohere Command and Databricks SDK

The example below involves product reviews, and assumes that the source data has been loaded into a table named "stg_sample_reviews" with four columns: id (the primary key), stars, product and review.

The Python script is shown below. Note it is good practice to handle credentials more securely than shown in this simple example. You might choose to use a secret management service instead of environment variables or hardcoding.

import os
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import StatementParameterListItem
import logging
import json
import boto3
import botocore
from botocore.exceptions import ClientError

logger = logging.getLogger("demo")

# Use the Amazon Bedrock InvokeModel API
def analyze_sentiment(text):
    abc = boto3.client(service_name="bedrock-runtime", region_name="us-east-1")
    model_id = "cohere.command-text-v14"
    prompt = f"""Your job is to analyze online product reviews.
Provide a numeric rating that reflects the overall sentiment of the review.
The rating should be a single number between 1 and 5, where 1 represents the most negative sentiment and 5 represents the most positive sentiment.
Respond with only your numeric rating. Do not include any justification of the rating. Use only numbers in your response.
Review: {text}
"""
    body = json.dumps({"prompt": prompt, "temperature":0.9})
    response = abc.invoke_model(body=body, modelId=model_id, accept='application/json', contentType='application/json')
    response_body = json.loads(response.get('body').read())
    return response_body.get('generations')[0].get('text').strip()

sqlWhId = os.environ["DBKS_SQL_WHID"]
vCatalog = os.environ["DBKS_CATALOG"]
vSchema = os.environ["DBKS_SCHEMA"]

# Connect to workspace
wsclient = WorkspaceClient(
  host  = os.environ["DBKS_HOSTURL"],
  token = os.environ["DBKS_TOKEN"]
)

ddl = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement=f"CREATE OR REPLACE TABLE `stg_sample_reviews_genai` (`id` INT NOT NULL, `ai_score` VARCHAR(1024) NOT NULL)")

# Fetch rows from the table
xsr = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement="SELECT `id`, `review` FROM `stg_sample_reviews`")

for r in xsr.result.data_array:
    ai_score = analyze_sentiment(r[1])
    dml = wsclient.statement_execution.execute_statement(warehouse_id=sqlWhId,
                                  catalog=vCatalog,
                                  schema=vSchema,
                                  statement=f"INSERT INTO `stg_sample_reviews_genai` (`id`, `ai_score`) VALUES (:id, :ai_score)",
                                  parameters=[ StatementParameterListItem.from_dict({"name": "id", "type":"INT", "value": r[0]}),
                                               StatementParameterListItem.from_dict({"name": "ai_score", "type":"INT", "value": ai_score}) ])

After running the above script, you should find a new table has been created, which contains the AI-generated review score for every input record. Join this table to the original on the common id column to compare the AI-generated sentiment scores against the original star review.

The LLM was asked to score between 1 and 5, so you may choose to classify the scores more broadly as follows:

  • 4 or 5 - Positive
  • 3 - Neutral
  • 1 or 2 - Negative

Sentiment Analysis in Databricks using Matillion to run Cohere Command via Amazon Bedrock

In the Matillion Data Productivity Cloud, orchestration pipelines like the one shown in the screenshot below can:

  • Directly extract and load data, or call other pipelines to do so (as shown)
  • Invoke Cohere Command, with a nominated prompt, against all rows from a nominated table

Sentiment Analysis in Databricks using Matillion

Data pipelines such as this manage all the connectivity and plumbing between the Databricks source and target tables, and the LLM.

This allows you to focus on the overall design and architecture, and the data analysis. To compare the AI-generated sentiment scores against the original star review, use a transformation pipeline like the one in the next screenshot.

Checking the results of Sentiment Analysis in Databricks using Matillion

The data sample shows two of the records. In one case the LLM's decision matches the original sentiment identically, but in the other record the ratings differ slightly. This is an example of the subjective nature of sentiment analysis.

Summary

Matillion is a data pipeline platform that empowers data teams to build and manage pipelines faster for AI and analytics at scale. It offers a code-optional UI with pre-built components, or users can code in SQL, Python, or DBT. Matillion integrates with hyperscalers, CDPs, LLMs, and has first-class Git integration for asynchronous collaboration.

It provides AI-generated documentation, no-code connectors, REST API connectivity, and parameterization with variables. Matillion's components work seamlessly together on one platform, enabling hybrid SaaS deployment, data lineage, pushdown ELT, vector store connectivity, and reverse ETL for AI insights. Its AI components facilitate generative AI prompting, while Matillion Copilot allows natural language pipeline building.

For more examples of Matillion's AI components in action, check out our library of AI Videos and Demos.

To try Matillion yourself, using your own data, sign up for a free trial.

If you are already a Matillion user or trial customer, you can download the sentiment analysis example shown in the screenshots earlier, and run it on your own platform.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.