The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Register Now

Large Language Model (LLM) attempts to pass Data Productivity Cloud Certification exam

Working towards DPC certification? Matillion is proud of our in-depth documentation. We challenged ourselves to create content that could stand alone and allow an LLM to pass the DPC certification exam using just the content our users have access to. We wanted to see if our DPC documentation and the transcripts from the DPC Foundation Academy course were enough. Spoiler alert: they are.

Exam requirements

As can be seen on the DPC certification page, the exam consists of 40 multiple-choice questions (MCQs) with a passing grade equalling 70% (28 correct questions). These questions cover 8 different topics. In order to prepare for the exam, users are told to read the DPC documentation, complete the “Data Productivity Cloud Foundation“ Academy course, and self-assess with some practice exams. Our LLM followed the same conditions.

Our process

We built multiple pipelines representing the different steps of this project. We pulled all the necessary data, upserted these into a Pinecone Index, generated the exam from the bank of questions, had the LLM attempt to solve the exam, and finally, we scored the LLM’s results.

Step 1: Obtain docs, transcripts, and question bank

The first step was to pull all the DPC-specific documentation, DPC course transcripts, and the bank of questions used for the exam generation. 

1. DPC documentation

We wrote a Python script that loops through each markdown file in the documentation repository and outputs these to a table in Snowflake. Because some of these files were quite big, we needed to chunk the text and arbitrarily used 500 lines at a time, as seen in the following code extract.

#split md files into fixed size chunks
def chunk_text(file_content, lines_per_page=500):
    pages_text = []

    lines = file_content.splitlines()

    for i in range(0, len(lines), lines_per_page):
        page_text = "\n".join(lines[i:i + lines_per_page])
        pages_text.append(page_text)
    
    return pages_text

# Create a list to hold the chunked data
chunked_data = []

# ID counter
id_counter = 1

# Process each file and chunk its contents
for file in tree:
    file_content = base64.b64decode(get_contents("matillion-docs", file.path).content).decode()
    chunks = chunk_text(file_content)
    for chunk in chunks:
        chunked_data.append([id_counter, file.path, chunk])
        id_counter += 1

 

2. DPC Academy course transcripts

We put all the transcript files (.srt) into a Snowflake stage. We then used a Python script to loop through all the files in the stage and cleanse them by removing timestamps and new lines. This generated a table in Snowflake where each row contained a file name and the associated cleansed content.

3. Exam question bank

The exam is generated by selecting 40 random questions from an extensive bank of questions. We mimicked that process by first pulling in the full bank of questions from a JSON file living in an S3 bucket. This is done by using an S3 load component to output the content of the file into a singular column called data.

Step 2: Upload docs and transcripts to Pinecone

Once we had our documents, we upserted them into Pinecone by using 2 Pinecone Vector Upsert components. Both sources were upserted to the same dpc_cert index and namespace.

Step 3: Prepare the exam

In order to prepare the exam, we needed to flatten the data containing the questions, answers, and correct answers. We then needed to select a subset of the questions to make the exam.

1. Extract the nested data

Given the questions live in a JSON file, we needed to flatten this data to get a table where each row is a new question with its associated MCQ options and the correct answer. To do that, we used multiple Extract Nested Data components to flatten the data for each topic, ensuring that the columns selected in the Extract Nested Data components per topic were all renamed exactly the same thing so that we could later unite them into one table. 

We then used multiple Calculator components to associate the corresponding topic name to each question. Once all questions had been extracted correctly, we used a Unite component to join into the same table. 

2. Select questions for exam

With a bank of questions created, we could choose the questions for the LLM to answer. In the first part, we selected two questions per topic by using a Rank component to order each question and then a Filter component to only choose the first two questions. We then also had a calculator component that would reformat the answer options into something that is easier for the LLM to read.

In the second part, we generated a full 40-question exam with the same partition mentioned on the exam website. That meant we needed the following amount of questions per category:

We then used a Rank, Join, and a Filter component to number the questions randomly per topic, join them to the topic distribution, and filter out the questions to only bring across as many as needed by the topicCount respectively. Finally, we formatted the answers in the same way as we did before so the LLM is able to read them properly and produced a complete exam.

Step 4: LLM takes the exam

We have generated 2 exams, one with 18 questions (2 from each topic), and one with 40 questions (x from each topic as per the pie chart). Using the OpenAI Prompt component, we had an LLM model attempt to pass the exam. Our User Context prompt follows:

“You are a data engineer trying to pass the Matillion Data Productivity Cloud certification.

This is a multiple choice question (MCQ) exam. You will be provided with the exam question,the options to select from, and some documentation retrieved by a vector search which may be relevant to the question. You will answer the question by selecting the correct option from the semi-colon (;) separated list of 1234 options. Only output the corresponding number of the correct answer, nothing else”

The Prompt component then took the question and the formatted_answers as inputs. Lastly, we configured the RAG system for the first two components so that the LLM could take the documents from Step 1 into consideration when solving the questions. The prompt here was simple: “Consider the following documents for context in your response if the data is relevant”. 

Step 5: Grade the LLMs exam

Step 3 in the screenshot above contains the LLM’s answers to the exam question outputted by each OpenAI Prompt Component in the previous step. We joined this to the original questions table to compare its answers to the correct answers. With a Calculator component, we assigned a 1 if the LLM's answer matched the correct answer or a 0 if it didn’t. 

We then leveraged an Aggregate component to sum up all the 1s to represent the total questions the LLM got correct. The official grade was then calculated and rounded with a Calculator component. 

1. Results of 18 question exam

ModelEmbedding ModelRAGGradeResult
GPT-4text-embedding-3-smallYes77.78            (14/18)PASS :tada:
GPT-4text-embedding-3-smallNo33.33            (6/18)FAILED :x: 
GPT-4text-embedding-ada-002Yes77.78            (14/18)PASS :tada:
GPT-4otext-embedding-3-smallYes83.33           (15/18)PASS :tada:
GPT-4otext-embedding-3-smallNo72.22           (13/18)PASS :tada:
GPT-4otext-embedding-ada-002Yes77.78            (14/18)PASS :tada:
GPT-4otext-embedding-3-largeYes77.78            (14/18)PASS :tada:

2. Results of 40 question exam

ModelEmbedding ModelRAGGradeResult
GPT-4text-embedding-3-smallYes65          (26/40)FAILED:x:
GPT-4text-embedding-3-smallNo40          (16/40)FAILED:x:
GPT-4text-embedding-ada-002Yes70          (28/18)PASS:tada:
GPT-4otext-embedding-3-smallYes75         (30/40)PASS:tada:
GPT-4otext-embedding-3-smallNo67.50     (27/40)FAILED:x:
GPT-4otext-embedding-ada-002Yes75         (30/40)PASS :tada:
GPT-4otext-embedding-3-largeYes75         (30/40)PASS :tada:
anthropic.claude-v3-sonnettext-embedding-3-smallYes67.50    (27/40)FAILED:x:
anthropic.claude-v3.5-sonnettext-embedding-3-smallYes72.50    (29/40)PASS :tada:

 

Conclusion

In both exam scenarios, the LLM passed the exam the majority of the time. Using a RAG system increases the ability of the LLM to achieve a passing grade! 

Some fine-tuning, such as using a different chunking strategy for the supporting documents, can still improve performance. However, we are pleased to report that our documentation is powerful enough to support exam study for a passing grade!

Sources

Links

DPC DocumentationMatillion Docs
DPC Foundation CourseMatillion Data Productivity Cloud, Foundations | Matillion Academy |  Matillion Academy
Practice ExamsCertification - Data Productivity Cloud | Matillion Academy
Susana Cardoso
Susana Cardoso

Data Engineer

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.