Setting up a help center using Matillion and RAG

In a previous article, 'A deep dive into Embedding and Retrieval-Augmented Generation (RAG),' we explored the concept of vectorization, which entails converting textual data into numeric embeddings to facilitate semantic search. Building upon that foundational knowledge, this article will show the application of embeddings within a highly practical context: creating a solution for generating sample answers and documentation links to aid a support team in resolving tickets efficiently.

In this article, we will utilize several cutting-edge technologies to construct our solution. Primarily, we will leverage the PostgreSQL database enhanced with the PGVector extension to store vector embeddings. We'll also use Snowflake to handle and transform our support question data. To generate embeddings for textual data, we will employ OpenAI’s text-embedding-ada-002 model. Additionally, crucial parts of the workflow involve using Python for web scraping and AWS S3 for data storage.

We begin with setting up the necessary infrastructure, including the preparation of our PostgreSQL database with PGVector. Next, we'll delve into data collection and pre-processing steps: scraping product documentation, storing it in an S3 bucket, and handling sample support questions in Snowflake. Subsequently, we will tackle the challenge of large document handling by chunking them into manageable sections with overlapping content to maintain context. We will then proceed to create embeddings for these chunks and upsert them into our PostgreSQL database. Finally, we will demonstrate how to utilize these embeddings to generate contextual responses using OpenAI, detailing the use of Retrieval-Augmented Generation (RAG) to optimize the quality of these responses with relevant document links. The article will conclude with examples showcasing the differences between responses generated with and without RAG, highlighting the effectiveness of our approach.

We’re going to dive into each of these in-depth but at a high level. The pipelines we will use are the following:

Preprocessing_and_upsert_embeddings (orchestration)

Extract_metadata_to_rows (transformation)

Create_mock_support_cases (transform)

generate_ai_responses (orchestration)

Now let's dive into the details of each

Preparing PostgreSQL with PGVector

The first step is to set up the required infrastructure. Let’s prepare our PostgreSQL database to store vector embeddings using the PGVector extension. Our Postgres is hosted on AWS RDS, which means PGVector often comes pre-installed, so we just need to enable it.

CREATE EXTENSION IF NOT EXISTS vector;

Once enabled, verify the installation with this query:

SELECT * FROM pg_extension;

Next, we'll create a table to store our document information, including metadata and embeddings:

CREATE TABLE blog.documents_with_metadata (
    id SERIAL PRIMARY KEY,                   -- Unique ID for each document
    title TEXT,                              -- Title of the document
    url TEXT,                                -- Document URL
    product TEXT,                            -- Product the document relates to
    file_contents TEXT NOT NULL,             -- Full text from Snowflake (file_contents)
    embedding VECTOR(1536),                  -- Embedding generated from OpenAI model
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP  -- Timestamp for last update
);

The embedding column will store the embeddings (vector representations) of each document chunk, generated using OpenAI's model.

Data Collection

For this example, I’ve scraped our documentation site using Python and stored the data in an S3 bucket. This step was done outside of Matillion, and for the sake of brevity, we won’t go into the details here. However, Matillion offers a wide range of connectors to pull data from many sources, including a custom API connector for unsupported sources

These documents look like a stripped version of our documentation without any screenshots or diagrams. We’ve also appended the URL, product, and title at the top. Here is an example:

TITLE: 1.68 release notes
URL: https://docs.matillion.com/metl/docs/metl-168-release-notes/
Product: Matillion ETL

1.68 RELEASE NOTES
Below, you can find release notes for Matillion ETL version 1.68.x

TECH NOTES

MATILLION ETL FOR REDSHIFT
If you're experiencing out-of-memory (OOM) exceptions or high CPU usage since updating to version 1.68.7 or above, read the Tech note - Redshift RingBuffer exceeding expected limits for a workaround.

MATILLION ETL VERSION 1.68.25
2024 - July 16th

  • Improvements:
    • Updated the base CentOS image from CentOS Linux 7 to CentOS Stream 9 for published virtual machine (VM) images.
  • Driver Updates

etc….

The second source of data we’ll use for this system will be some sample support questions. To create these in Snowflake, we will use our Create_mock_support_cases (transform) pipeline.

This pipeline uses a fixed input component in Matillion to generate some sample support cases in a format we would typically see coming out of a support platform and save them into a table in Snowflake. In a production system, this would be coming directly from a support system (Zendesk, Salesforce, etc.), and you can use one of our out-of-the-box connectors or our API connectors to accomplish this. 

See the sample questions below in table format being written to Snowflake:

In Matillion, we use a calculator component to combine metadata fields from the support system into a single row. This is important because Matillion has multiple products, and we need the model to distinguish which set of documentation to reference when generating an answer.

Here’s what the combined data looks like:

This is important because when we pass these into the model, they will need to be a single block of text, and we still want to preserve the metadata fields associated with each case.

Data Processing

Preprocessing_and_upsert_embeddings (orchestration)

The next step is loading our documentation into Snowflake, which we do using the S3 Load component. We will define a file format to be a single text column and then use the S3 load component to tell Snowflake to load this data into itself using that format.

Now, here’s where things get a little tricky. Some of our documents are too large for OpenAI’s text-embedding-ada-002 model, which has a limit of 8191 tokens. If you’re not sure how tokens work, here’s a quick breakdown: a token can be anything from a single character to an entire word. The word “because” can be one token, and the word “I’m” could be two because “I’” and “m” are considered separate. 

You don’t need to understand exactly how many tokens each word is but it’s good to have an understanding of how the context windows (limits) are calculated.

In practice, this means the model can usually process between 12,000 and 32,000 characters in one go, depending on the complexity of the text. So, if we’re dealing with bigger documents, we need to break them down into smaller, bite-sized chunks.

Chunking Documents

To handle the larger documents, we chunk them into sections of around 12,000 characters so that they fit within the model’s token limit. To keep the flow between chunks smooth, we also include a 10–20% overlap (roughly 2,400 characters) to make sure none of the important context is lost when the model moves from one chunk to the next.

Plus, we include some key metadata—like the document nameproduct, and URL—at the top of each chunk so the model always knows exactly what it’s working with, even when the document is split into sections.

Here’s an example of what one of these chunks looks like:

TITLE: 1.68 release notes
URL: https://docs.matillion.com/metl/docs/metl-168-release-notes/
Product: Matillion ETL

1.68 RELEASE NOTES
Below, you can find release notes for Matillion ETL version 1.68.x

TECH NOTES

MATILLION ETL FOR REDSHIFT
If you've been experiencing out-of-memory (OOM) exceptions or high CPU usage since updating to version 1.68.7 or above, read the Tech note—Redshift RingBuffer exceeding expected limits for a workaround.

MATILLION ETL VERSION 1.68.25
2024 - July 16th

  • Improvements:
    • Updated the base CentOS image from CentOS Linux 7 to CentOS Stream 9 for published virtual machine (VM) images.
  • Driver Updates

etc….

The metadata here is important for us at Matillion because we currently support two products that are very similar in their capabilities. Depending on your use case, you may encounter this. If you're building a RAG system based on different product lines, regions, software versions, or industry sectors, keeping track of this metadata is crucial.

That’s why we include the product at the top of each chunk. This ensures that when the model processes each section, it knows exactly which product and version of the documentation it’s working with and can provide us with the links to the documentation so our team can verify that this is coming from the right place.

To do this, we are first going to break the text file metadata out using Matillion's low code transformation pipeline:

Extract_metadata_to_rows (transformation)

This pipeline reads the matillion_docs table with the table input component:

After that, it splits the metadata fields at the top of our documents into individual columns and rows using our calculator component.
 

Then, in the last component, it saves this to a new table called “matillion_docs_with_metadata”

After we have our metadata fields created, we could continue in a low-code fashion but I wanted to show how you can mix SQL scripts as well as low code to accommodate different workflows based on your preference/background. This SQL script chunks the data by size, creates an overlap, and appends the metadata to the top of each chunk so the model doesn’t lose as much context from each chunk when a document is too long to be processed in one call.

SQL SCRIPT:

-- Set variables
SET chunk_size = 12000;
SET overlap = 2400;

-- Create the target table
CREATE OR REPLACE TABLE RILEY_PHILLIPS.MATILLION_DOCS."matillion_docs_chunked_with_id" AS
WITH source_data AS (
   SELECT
       "title",
       "url",
       "product",
       "file_contents",
       CEIL((LENGTH("file_contents") - $overlap) / ($chunk_size - $overlap)) as num_chunks
   FROM RILEY_PHILLIPS.MATILLION_DOCS."matillion_docs_with_metadata"
),
numbers AS (
   SELECT ROW_NUMBER() OVER (ORDER BY SEQ4()) AS n
   FROM TABLE(GENERATOR(ROWCOUNT => 1000))
),
chunked_data AS (
   SELECT
       s."title",
       s."url",
       s."product",
       s."file_contents",
       n.n as chunk_id
   FROM source_data s
   JOIN numbers n ON n.n <= s.num_chunks
)
SELECT
   ROW_NUMBER() OVER (ORDER BY "title", chunk_id) AS id,
   CONCAT(
       'TITLE: ', "title", '\nURL: ', "url", '\nProduct: ', "product", '\n\n',
       SUBSTR("file_contents",
              1 + (chunk_id - 1) * ($chunk_size - $overlap),
              LEAST($overlap, LENGTH("file_contents") - (chunk_id - 1) * ($chunk_size - $overlap))
       ),
       '\n\n',
       SUBSTR("file_contents",
              1 + (chunk_id - 1) * ($chunk_size - $overlap) + $overlap,
              LEAST($chunk_size - $overlap, LENGTH("file_contents") - (chunk_id - 1) * ($chunk_size - $overlap) - $overlap)
       )
   ) as "file_contents",
   "title",
   "url",
   "product",
   chunk_id
FROM chunked_data;

Now that we have our documents chunked, and saved to our table 

RILEY_PHILLIPS.MATILLION_DOCS."matillion_docs_chunked_with_id

We can begin to create embeddings from our docs and upsert these into PG Vector. To do this, we will use our Postgres Vector Upsert component:

We’ve obfuscated the hostname with a variable. Here’s an example hostname for an AWS RDS instance:

name.id.eu-west-1.rds.amazonaws.com

Once we run the pipeline, we will be able to check Postgres to make sure that the embeddings came through properly. In our case we’re using Dbeaver, and it looks like they did!

Calling Open AI to generate responses

Once your documents are chunked and stored in Snowflake, and your PostgreSQL database is ready with PGVector, you can configure your Matillion pipeline to start asking OpenAI questions. Here is how we set ours up:

Under the configure tab we’ve specified some user context, this can be adjusted to help your specific use-case. In our case we’re providing some context to the model to help it generate responses for our support team;

Provide a detailed answer for our support team to easily reference as a potential response to this question. Also provide the documentation links so they can send those as well.

Responses

Now that we’ve built our system, we can run the pipeline, which will pass our questions through, generate responses, and save them to a Snowflake table. Let’s look at our examples!

Question: Can I configure Matillion to automatically assign roles based on Active Directory groups during LDAP authentication?

Without RAG:

ChatGPT said:

"Yes, you can configure Matillion to automatically assign roles based on Active Directory groups during LDAP authentication. To achieve this, you need to set up LDAP (Lightweight Directory Access Protocol) authentication within Matillion, and then map the LDAP groups from your Active Directory to Matillion roles.

Here’s a general outline of the steps:

  1. Enable LDAP Authentication: In the Matillion Admin panel, enable LDAP authentication. This will allow Matillion to authenticate users against your Active Directory.
  2. Configure LDAP Connection: Set up the connection to your Active Directory server, providing necessary details like the LDAP server address, port, and search base. You’ll also need to define how Matillion should search for users within AD (e.g., using CN or sAMAccountName).
  3. Map Active Directory Groups to Matillion Roles: You can map specific LDAP groups from AD to predefined roles within Matillion (like Administrator, Developer, or Read-only). This mapping will ensure that when a user logs in, they are automatically assigned the appropriate role based on their group membership in AD.
  4. Test Authentication: After the configuration, test the LDAP connection to make sure that users from Active Directory are authenticated and correctly assigned roles based on their group membership.

Would you like detailed steps for any of these processes?”

You said:
How do I get to the Matillion Admin panel?

ChatGPT said:

*"To access the Matillion Admin Panel, follow these steps:

  1. Login to Matillion: Open your web browser and navigate to the Matillion URL for your instance (e.g., http://<your-matillion-instance>:8080).
  2. Click on the Settings Icon: After logging in, locate the settings icon (gear symbol) in the top-right corner of the Matillion interface.
  3. Select Admin: In the dropdown menu, click on 'Admin.' This will take you to the Admin Panel.
  4. Navigate to LDAP Settings (if you need to configure LDAP): Once you're in the Admin Panel, find the 'Authentication' tab where you'll see the LDAP configuration options.”

In the screenshot below, you can see that the ‘admin panel’ ChatGPT is referring to is not correct, and it’s hallucinating a gear icon that doesn’t exist in that product. While the response is close, a user of Matillion would be confused by these steps. Now, let’s look at the responses to this question using our RAG system.

With RAG:

Question:
Can I configure Matillion to automatically assign roles based on Active Directory groups during LDAP authentication?

(Note: This question is hard because it comes from a user using the Matillion ETL product. These same capabilities exist in the DPC (Data Productivity Cloud) with different setup instructions, let’s see how our RAG system handles it.)

Detailed Response:

Yes, you can configure Matillion to automatically assign roles based on Active Directory groups during LDAP authentication. Here’s how you can set this up:

LDAP Integration with Matillion

Matillion ETL supports LDAP integration through the External User Configuration option. This allows users to authenticate via an external LDAP directory server, such as OpenLDAP or Microsoft Active Directory. You can map LDAP roles to Matillion roles, such as Admin or Project Admin, by creating corresponding groups in Active Directory.

Steps to Configure LDAP Integration
  1. Access the LDAP Settings:
    • Click Admin → User Configuration in the top-right of the Matillion ETL interface.
    • Select External from the Security Configuration drop-down menu.
  2. Fill the LDAP Configuration Parameters:
    • Connection Name: The username to bind to the directory.
    • Connection Password: The password for the bind user.
    • Connection URL: The URL for the LDAP server (e.g., ldap://<LDAP-server>:389 for non-SSL or ldaps://<LDAP-server>:636 for SSL).
    • User Base: The root of the user search (e.g., CN=Users,DC=example,DC=com).
    • User Search: Typically sAMAccountName={0} for Active Directory.
    • Role Base: The root of the role search.
    • Role Search: How to find roles for a user, usually member={0}.
    • METL Access: Map to your user group role for accessing Matillion (e.g., Emerald).
    • METL Server Admin, METL Global Project Admin, API: Map these fields according to the roles you've defined in your LDAP directory.
  3. Options & Conditions:
    • Enable options like Allow Nested Roles, Role Subtree, and User Subtree as necessary.
  4. Test & Apply Configuration:
    • Click Test to ensure the configuration is correct.
    • Click OK to save the settings and restart Tomcat.
  5. Log In:
    Once configured, users can log into Matillion ETL using their Active Directory credentials.

For a comprehensive guide, please refer to the Matillion documentation here:

Important Considerations:

  • Azure Active Directory: If you are using Azure AD, configure Azure AD Domain Services. For more details, refer to the section on LDAP authentication with Azure Active Directory.
  • Okta Integration: Specific instructions for integrating Matillion ETL with Okta LDAP are detailed in Okta LDAP Configuration.
  • Backup: Before making changes, back up your configuration files or snapshot the instance to ensure you can roll back if necessary.

This response is pulled almost verbatim from our documentation. It also correctly identified the right documentation for the user based on the metadata we passed through in our chunks, eliminating the need for the user to search for answers. It also generated a response that allowed our support team to quickly assist the client, speeding up response times and improving service level agreements (SLAs).

Conclusion

By following the steps in this post, you’ve learned how to set up a RAG system using Matillion, PGVector, OpenAI, and Snowflake. You’ve seen how to vectorize documents, chunk them for processing, and store embeddings in PostgreSQL. Now it’s time to put it all together and run your own RAG system!

Riley Phillips
Riley Phillips

Professional Services

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.