Visit Matillion AI Playground at Snowflake Data Cloud Summit 24

Find out more

Unearthing Hidden Gems: Mining Value from Unstructured Data with AI Prompt Components

A modern irony is that the most critical challenge to businesses is no longer the lack of data, but rather the ability to leverage it all to its full potential.

We have arrived at a point where a very significant proportion of the data available to any enterprise is in the form of unstructured text. It might have been captured directly - for example, from messages or documents - or perhaps sourced from other types of unstructured data such as video or call recordings.

On its own, all that unstructured data is just a problematic deluge. But it's actually a huge potential source of crucial information, elusively buried inside those vast plains of text. Large Language Models (LLMs) - particularly AI Prompt components - have become key for data engineers to unlock the hidden value.

Augmenting Unstructured Data

Consider, for instance, a dataset of barista reviews from hypothetical customers of a coffee shop. The review comments contain hints at customer sentiment, how likely they are to remain loyal, and their unbiased opinions on operational performance.

Together with the identity of the reviewer, a timestamp, and the location of the store, this data represents a trove of potential insights into, for example:

  • Trends over time
  • Spatial variation
  • Operational differences between stores

But the unstructured nature of the comments makes it very challenging for traditional data engineering techniques to extract information that's suitable for analysis. Augmented methods are needed to transform the unstructured review texts into a neatly structured table that works in the relational, analytic world.

The necessary data engineering step is to take the review comments one by one, and pass them through an LLM with an effective prompting mechanism that extracts all the nuggets of potential from every record.

LLM Prompt Data Architecture

In data engineering terms, we need to extract structured columns for every single unstructured review text.

LLMs can comprehend, generate, and manipulate unstructured text at an unprecedented scale, revolutionizing how we interact with and derive insights from language-based data. In this case, for example, an LLM can easily assess:

  • Sentiment - broadly as positive or negative
  • Reason and level of anger - an example of scoring, which returns a numeric value within a bounded scale that is perfect for statistical analysis
  • Likelihood of churn and whether the customer is likely to return - these are known as classification problems
  • What product they chose - an example of Named Entity Recognition (NER)

An LLM can even write its own deliberately humorous summary of the customer’s comment, as you can see in the example below:

Augmenting unstructured data with AI

The creativity of the summary is proportional to the "temperature" supplied to the LLM. Higher temperatures mean more imaginative answers; lower temperatures mean more stability.

The example contains several useful pieces of information, but it only represents what can be done. The limit here is merely your imagination!

As you can imagine, there's quite a lot of plumbing in the background to support this data architecture. Simply presenting the unstructured input correctly to the LLM via a prompt is nontrivial. At the same time, the underlying APIs are evolving rapidly. Wrapping, automating, and orchestrating this ETL work is the job of an AI Prompt component.

AI Prompt Components

The task of an AI Prompt component is to integrate three building blocks of your solution architecture:

  1. Your Language Learning Model
  2. A prompt crafted to extract all the desired features from the unstructured text
  3. The database table containing all your unstructured text records

The AI Prompt component takes all the source records and feeds them one by one into the LLM via the prompt. It interacts with the LLM's programmatic interface - perhaps an SDK or just API calls - and collects all the outputs to create a brand-new, structured data table.

Architecture of an AI Prompt Component

Converting unstructured data into structured data like this is beginning to open up the goldmine of information.

But for data engineers, there's a little more work to be done. The new columns added by the LLM are solely based on the review comment. Integrating these new columns with all the original fields, such as the review date and location, is the final stage of data enrichment needed to enable a fully comprehensive analysis.

This integration is typically done using a relational join, with a unique identifier for each record that the AI Prompt component has passed through to its output. If there's no naturally unique identifier in the source data, it's fine to add a surrogate key instead.

Your fresh-off-the-oven structured data table thus incorporates all original columns from your data source alongside new columns added through AI. This lays the groundwork for all subsequent analyses and dashboarding tasks.

Data Engineer to Prompt Engineer: Matillion’s AI Prompt Component implementation

Matillion’s Data Productivity Cloud comes armed with AI Prompt components that perform the tasks I have outlined. They empower a data engineer to easily perform LLM prompt engineering. There are versions for OpenAI (optionally hosted on Azure) and for Amazon Bedrock.

It can be fun to try these components yourself and experiment with the prompts and the temperature. Try out this quickstart set of Data Productivity Cloud pipelines to experiment with mining the barista reviews yourself.

Alongside this the platform also offers a range of data transformation and integration capabilities, which enable you to integrate your data, structure it, and ensure it's ready for analysis. Irrespective of your skills, Matillion empowers you to build and manage data pipelines quickly.

Data engineering tasks are simplified through a unified platform underpinning unlimited scale and rapid setup, all with transparent, consumption-based pricing.

Summary

Harnessing the power of unstructured data has never been so straightforward. With the help of Matillion, you can make sense of your large, unstructured data sets by converting them into valuable, meaningful business insights. In the realm of data-driven decision-making, that is true power!

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.