Make your Large Language Model an expert on any subject using Retrieval Augmented Generation. Practical walkthrough

Large Language Models (LLMs) have extraordinary capabilities in generating human-like answers to questions. Yet, they face limitations due to the static nature of their training data.

This constraint means that an LLM might not be able to provide the latest or most detailed knowledge, especially about recent events or very specific subjects—such as the ins and outs of your company's unique product line—even if there's plenty of available public information.

To bridge this gap, Retrieval Augmented Generation (RAG) is a relatively simple and accessible solution. RAG enhances LLMs by dynamically integrating external knowledge sources directly into the response generation process.

In this article, I’ll demonstrate how a RAG deployment amplifies the utility of an existing LLM.  The example will involve adding the contents of a Git user's manual - a PDF document containing unstructured data - into a vector database to work in tandem with an LLM. As you'll see, this will dramatically improve the model's ability to handle complex and detailed questions - like those you might receive from technical support.

The Git manual is too big to simply send to the LLM in its entirety. Instead, it needs to be split into small "chunks" to make it usable. I'll start by demonstrating how to extract just the text of the PDF using Python.

Extracting PDF text for RAG with LangChain and NLTK

I'll be using a single, large PDF document as the reference material to provide an LLM with expertise. My file is named "GitReferenceMaterial.pdf", but you can, of course, substitute your own documentation. You could consider using this Pro Git book by Scott Chacon and Ben Straub.

To access the text, it first needs to be extracted from the PDF, which I will do using a Python library from LangChain.

from langchain_community.document_loaders import PyPDFLoader

pages = PyPDFLoader("GitReferenceMaterial.pdf").load_and_split()
print("Read {0} pages".format(len(pages)))
alltext = " ".join(p.page_content.replace("\n", " ") for p in pages)

Next, I will use the Natural Language Toolkit to split the full text into sentences. Not all the text is going to be useful. The PDF contains formatting quirks and oddities, mainly from the table of contents and the index. So sentences of length 1 need to be removed. Similarly, something is very likely wrong with sentences over 2000 characters.

import nltk
lengths=[]
cleantext = ""

for s in nltk.sent_tokenize(alltext):
    if(len(s) > 1 and len(s) < 2000):
        lengths.append(len(s))
        cleantext = cleantext + s + " "

print("Number of sentences: {0}".format(len(lengths)))
print("Clean text length %d" % len(cleantext))

Now that we have extracted just the clean – but still unstructured – text, the dataset contains only the most useful information. But it still needs to be split into chunks in a way that the AI engine will be able to use.

Sentence Splitting vs Recursive Character Splitting

Sometimes, sentences refer to each other contextually, and I want to give my LLM the best chance of using the implied information. Let me give an example to clarify.

Consider the following paragraph of three sentences.

Data engineering involves the design and construction of systems for collecting, storing, and analyzing data. It plays a crucial role in optimizing data flow and storage systems. Additionally, it ensures data is accessible and maintained in a secure and efficient manner.

Now look at the second sentence on its own: "It plays a crucial role in …". There's no indication that "it" means "data engineering" in that sentence. That means the second sentence is far less insightful in the absence of the first.

For this reason I recommend splitting the text into chunks that are much longer than an average sentence, with some overlap.

First, some simple statistics:

import statistics
m = statistics.mean(lengths)
sd = statistics.stdev(lengths)

print("Mean %d" % m)
print("SD   %d" % sd)

This is a heuristic guide rather than a rigid rule, but I have found - after some experimentation - that chunks sized between 1 and 2 standard deviations above the mean sentence length produce good results. To make round numbers for my GitReferenceMaterial.pdf example I chose a length of 380 characters and an overlap of 80.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=380,
    chunk_overlap=80,
    length_function=len,
    is_separator_regex=False,
)

docs = text_splitter.create_documents([cleantext])
chunks = [d.page_content for d in docs]

With this exercise complete, the "chunks" of data from the original Git Reference Manual consist of a list of all the fragments of text containing expertise for the LLM. It's time to add them to a vector database.

Saving Vectors to FAISS using Embeddings

Vector storage begins with transforming chunks of text into long arrays of floating point numbers. These are highly conducive to rapid similarity assessment and retrieval using geometric number crunching, which is vital for RAG to operate successfully.

Converting text from its natural form (a string) into a numeric vector involves using a technology known as "embedding." An embedding model captures the semantic essence of the text in a dense numeric vector format. Vector databases, as the name suggests, specialize in storing these dense numeric vector formats.

In this example, I'll use the FAISS vector database and an OpenAI embedding model, which will require some more LangChain libraries.

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(chunks, embedding=embeddings)
vectorstore.save_local("GitDB.faiss")

Don't be surprised if that takes a few seconds to run. It converts thousands of chunks of text into vectors using OpenAI's embedding model, and then saves them to a FAISS database on disk. Saving a copy of the indexed knowledge base is an efficient way to help you re-use it over and over again.

So much for the theory! But is it actually going to work?

How to test a Vector Database

RAG works by selectively infusing relevant snippets of text (the "chunks") into an LLM prompt. The important thing is that the vector database is able to quickly find the most relevant chunks.

To test that, query the vector database with a question that you might ask the LLM.

vectorstore = FAISS.load_local("GitDB.faiss", embeddings)

res = vectorstore.search("What is the minimum size of partial hash that Git can use to refer to a commit?", search_type="similarity")

for d in res:
    print(d.page_content)

 

This Python snippet loads the FAISS database from its disk cache, and searches using a string. It's vital that the same embedding model is used to vectorize the search string.

Try this yourself to gain confidence that the vector search is working. You should hopefully find that several text chunks are returned, and they are all relevant to the question you supplied. Depending on the degree of overlap, you may see the same text returned several times. This gives the LLM better opportunities to interpret the text in context, and produce an accurate response.

During the RAG query itself, you will have the opportunity to select the maximum number of chunks the vector database can supply to the LLM prompt. Five is often a good choice, but you can experiment.

If the value is too small, then the LLM will not get all the potential benefits of insight from the source material. Setting it to zero is equivalent to an ordinary prompt component with no RAG. But setting it too high means you may be passing a very lengthy prompt to the language model, which might start to contain less relevant text chunks. This could actually make the response worse!

LangChain ChatPromptTemplate RAG Python example

Once the vector database is ready, it's possible to put all the elements together and run a RAG query.

The following Python code initializes the FAISS database from its disk cache again, then builds up a prompt using a combination of:

  • Background information and instructions
  • Information provided by the vector database
  • A natural language question

You will need OpenAI credentials to run this because it uses GPT-4. As always, it's important to be consistent with your choice of embedding model.

from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

vectorstore = FAISS.load_local("GitDB.faiss", embeddings)

vscontext = vectorstore.as_retriever()
print("Retriever initialized")

model = ChatOpenAI(model="gpt-4", temperature=1.1)

template = """Your job is to answer a question on git version control.
Use the following context to help answer the question.

{context}

{question}
"""

prompt = ChatPromptTemplate.from_template(template)

chain = (
    {"context": vscontext, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

print(chain.invoke("What is the minimum size of partial hash that Git can use to refer to a commit?"))

You should hopefully find that the RAG-enhanced responses from an LLM are accurate, and refer indirectly back to the source material you added to the vector database. It's always useful to compare the results by asking the identical question to an identical LLM, but without any RAG assistance.

Here's the answer I consistently received when testing. It does indeed match back to some very specific information buried deep within the original material.

Retriever initialized
Using OpenAI model gpt-4
The minimum size of partial hash that Git can use to refer to a commit is four characters long.

Now that you've converted your LLM to an expert in your own unique information domain, you are in a great position - for example - to start giving automated first responses back to customer support questions that require private, specialized, or technically detailed information.

RAG with Matillion

The Matillion Data Productivity Cloud has various no-code and low-code capabilities that integrate LLMs into data pipelines. This allows you to transform and enrich data without code, making AI insights easy to generate and easy to maintain.

The Retrieval Augmented Generation options shown in the screenshot below automate and simplify the utilization of data from a vector store. This enhances the context and depth of the underlying LLM while providing a no-code option to fast-track the use of RAG in new and existing use cases.

Matillion Data Productivity Cloud Retrieval Augmented Generation component

The Matillion Data Productivity Cloud also provides vector connectivity to Pinecone for straightforward data loading and retrieval, coupled with data lineage to ensure AI processes are explainable. Additionally, reverse ETL functionality helps activate AI-generated insights -  such as summarizing transcripts - by sending them back into business operations like your CRM system. Overall this helps create streamlined, intelligent, understandable, and highly maintainable workflows.

A full demonstration video is available in Matillion's library of AI Videos and Demos under the heading "Automating Customer Support Part 1".

To try out the Matillion Data Productivity Cloud for free, using your own data, try it for free. 

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.