Now Available On-Demand | AI. The Future of Data Engineering - Today. Not Just Tomorrow.

Watch now

Data Engineer’s Guide to AI Prompt Engineering - 2

Building AI-Powered Data Pipelines: Summarising, Classifying and Filtering Large Quantities of Text 

Hi, it’s me again — Julian Wiffen, Director of Data Science for the Product team at Matillion. In the previous installment of Data Engineer’s Guide to AI Prompt Engineering, we delved into the fascinating realm of using Large Language Models (LLMs) within data pipelines to standardize and label Job Titles for lead processing.

In part two, we shift our focus to the remarkable capabilities of LLMs in handling vast blocks of unstructured data. We look at how LLMs also enable a data engineer to work with large blocks of unstructured data without the need to use hand coding and complex natural language processes. To illustrate what is possible, we are going ask it to read a number of news articles and produce an executive summary of all the news on a chosen topic. The same technique could equally easily be applied to long technical documents, web scraping results, call transcripts or log files. 

There is a huge amount of information locked within unstructured data that is very hard to 

access via traditional business intelligence and transactional processing methods. While LLMs have no true understanding of the text they process, we can get them to emulate reading comprehension by asking them to generate summaries and labels.

We've collected several articles from non-subscription news websites such as the UK's BBC and Independent as well as the Huffington Post. The copy and pasted text is somewhat ugly to see here but the LLM is able to process this without problems. 

There are 18 articles here with a total of 26 pages of text.

As discussed in the previous blog, within our Data Productivity Cloud, we have created a prompt component able to submit API requests to LLMs from OpenAIMicrosoft AzureAI, and AWS Bedrock.

We configured the prompt component to ask the following questions about each article

  • Summary - Please give a short, bullet pointed summary of the main points of this newspaper article. Be concise, use no more than 50 words
  • Headline - What was the headline for this article
  • Topic - Pick one of the following categories to describe the main topic of the article:  Travel, Environment, Economy, Education, Other. And answer with just that one word.

We used OpenAI’s gpt-3.5-turbo for this example - we have found it produces reasonable summaries without the longer processing time and heavier costs that would be incurred by a  more advanced model such as <GPT4>. We set a temperature of 0.8, which is at the mid point of this scale - temperature controls how creative a model is allowed to be in its answers, with 2 being the most and 0 the lowest. It is worth noting that even a temperature of 0 is not completely deterministic - the process of an LLM generating text inherently involves some random elements so we would not expect an identical answer every time. 

Looking at the job summary panel OpenAI took just under 2 minutes to process the 18 articles. The prompt component receives responses in JSON format, which we can unpack using the Data Productivity Cloud’s existing capabilities and join back to the original table to pick up the source URL.   

We can see some of the summaries here:

looking at the first article “Brechin residents prepare for the worst” (https://www.bbc.co.uk/news/live/uk-scotland-67146840) its been given the topic label of “Environment”  and the summary is “The rain in Brechin has been coming down heavily and steadily, causing the river to rise. Residents are preparing for potential flooding after recent incidents. ScotRail has canceled services in areas covered by the red weather warning, with possible disruption in other parts. Fallen trees are causing blockages in Dundee.” A check of the link reveals this is a good precis of the article. 

Leaving aside the summary text for now, we can have the Data Productivity Cloud give us a table of the titles and topics.

The LLM complied with our request to select topics from within the given list of Travel, Environment, Economy, Education, Other. That allows us to filter for just those matching our chosen topic of interest - in this case we set the variable to ‘environment’ - that filters the rows down to just those with a relevant subject. It’s worth remembering that these topic labels will pick up semantic matches of that theme - the specific word does not need to appear in the article at all. 

Taking a look at the final table, we have distilled the eighteen long articles we started with into five short overviews that could be read at a glance, all specific to the topic we care about today, with links back to the source articles if a deeper view is needed.The topic of interest could easily vary from day to day or we could configure lists bespoke to each end user of group of them, to prepare digests relevant to their needs. 

We have used news articles in this context as an example, but the text being handled here could easily be meeting transcripts, support or sales call logs, detailed technical or policy documentation. The use of an LLM in the data pipeline opens up a wide range of options to extract information from extremely unstructured data that can support many different use cases. Check out the next instalment on building AI-powered data pipelines that delves into generating analytical insights from some free text based data.

Want to see this action? 

Join thousands of your peers from across the globe to become Enterprise Ready, Stack Ready, and AI Ready. Register for free today to secure your spot in the data-driven future and stay tuned for announcements about our stellar lineup of leaders, influencers, and data experts.

Julian Wiffen
Julian Wiffen

Director of Data Science

Julian Wiffen, Director of Data Science for the Product team at Matillion, leads a dynamic team focused on leveraging machine learning to improve products and workflows. Collaborating with the CTO's office, they explore the potential of the latest breakthroughs in generative AI to transform data engineering and how Matillion can play a pivotal role in supporting the needs of those working with these cutting-edge AI models.