Large Language Model Prompt Engineering for Common Data Problems

In the ever-evolving landscape of data engineering, large language models (LLMs) have become an essential part of the toolkit. As data practitioners begin to harness their power, a crucial aspect comes to the forefront: prompt engineering.

Crafting effective prompts can significantly enhance the performance and relevance of various natural language processing tasks. Prompt engineering tends to be an iterative process, beginning with simple statements and then refining them depending on the output you receive from the LLM.

In this article, I will delve into the key aspects of prompt engineering, focusing on Sentiment Analysis, Classification, and Named Entity Recognition (NER).

Sentiment Analysis

Sentiment Analysis involves asking an LLM to determine the feeling and emotion expressed in a text, whether positive or negative.

Start with a very general prompt like this:

Perform sentiment analysis on the following text:
Text: {the text}

This will most likely produce a few sentences giving a score and explain the reasoning. However, unstructured text output is useless for analysis and aggregation in downstream processes.

Instead, it is better to apply the fundamental rules of prompt engineering:

  • Be specific
  • Be precise
  • Ask one question at a time

With these fundamental rules in mind, more exact requests can be achieved, such as:

In one word, classify the sentiment of the following text as either positive or negative
In the following text, is the customer happy? Answer yes or no only
Analyze the sentiment of the following text and provide a sentiment score on a scale of 1
to 10, where 1 is extremely negative and 10 is extremely positive.
Reply with a numeric value only.

In other words, you will obtain more useful sentiment analysis results by tailoring the prompt to your specific objectives.

Every LLM already knows what "sentiment" means. If you want to perform a more industry-specific classification task, you must provide more context.

Classification

Classification tasks involve categorizing input text into predefined classes or labels. Prompt engineering for classification tasks, therefore, means clearly defining the categories.

Begin the prompt with a clear and precise instruction, but this time, introduce a list that contains all the allowable categories:

Classify the following text snippet into one of the predefined topics.
Topics: ["News", "Sports", "Weather", "Entertainment"]
Text snippet: {the text}

The response from an LLM is never deterministic. If the response argues that the text doesn't match any of the topics, you could try being more specific about the output format.

Choose the best classification from the text. Output format: "Topic: best_from_list"
Topics: ["News", "Sports", "Weather", "Entertainment"]
Text: {the text}

Or add a hint to use a nominated value if there is no good match:

Reply with the topic "Other" if the text does not match any topic well.

As you can see, to get best results from the LLM, clarity and specificity are key. To help with this, consider expressing your intent as a shorter prompt, for example:

Classify the text as either "News", "Sports", "Weather" or "Entertainment".
Reply with one word only.
Text: {the text}

This can be a good way to keep the LLM focused on the specific categories relevant to your workflow.

Named Entity Recognition

Named Entity Recognition, or NER, means examining the text and identifying named "entities" within it. These might be - for example - people, organizations, locations, or dates. Because they are trained in language, LLMs automatically know how to do this without any extra help. So, an initial prompt could be as simple as:

Perform Named Entity Recognition on the following text
Text: {the text}

You will probably receive a relatively unstructured piece of text in reply, which includes every single entity that the LLM identified in the text.

To improve both accuracy and relevance, make the prompt more specific about what types of entity to find. For example:

Identify and list all the locations mentioned in the text

Or:

What product is being discussed in the text

Rather than responding with free text, having the LLM identify different types of entities in a more structured way is better. JSON is an excellent intermediate step between unstructured source material and structured analytics. You can tell the LLM to generate JSON by adding another instruction to the prompt, like this:

Identify and list all the organizations, locations and dates mentioned in the text.
Format the output as JSON with the following keys:
- Organization: an array of organizations
- Location: an array of locations
- Date: an array of dates
- Other: an array of other named entities

Now, the JSON output by the LLM has a far higher information density than the original unstructured text. It can be easily processed by semi-structured ETL components downstream.

How can Data Engineers work with LLMs at scale?

LLMs are the next advancement in the field of data engineering. They bring the ability to take unstructured text and process it to create data suitable for data science, analysis, and aggregation.

In this new context, data engineers continue to face the dual challenge of architecting and managing robust infrastructure to ensure seamless data processing and concurrently developing innovative solutions to meet evolving business and analytical needs.

The Matillion Data Productivity Cloud handles this scalability problem by automatically dealing with all the backend plumbing on your behalf. This leaves you free to concentrate on what makes your LLM solution unique: the prompts.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.