Visit Matillion AI Playground at Snowflake Data Cloud Summit 24

Find out more

Think Outside the Container: Hands on with Snowpark Container Services and Matillion

Snowpark Container Services (SPCS) allows Snowflake users to run Docker containers directly within the Snowflake ecosystem, bringing speed, efficiency, and complete data sovereignty. This new service supports both small and large generative AI language models that can be tailored to your own specific tasks.

The Matillion Data Productivity Cloud complements Snowflake by offering code-optional interfaces to the Data Cloud. This combination of technologies optimizes data engineering and AI pipeline management on an enterprise scale.

Together, these platforms empower organizations to accelerate data-driven decision-making processes while maintaining robust data governance.

Technical Details of Snowpark Container Services

Import your customized Docker images into your own private image repository, and Snowpark Container Services can run it for you. It can be either as a long-running service - like an interface to a Large Language Model, or a batch job model accessed via a user defined SQL function.

With SPCS, you can deploy a diverse range of language models, from large ones like Llama 3 to smaller models available through the Hugging Face marketplace, for example. The models all run entirely within the secure bounds of your own Snowflake environment.

This setup prioritizes data security and also provides customizable hardware configurations. Users can select GPUs for processing intensive large models or opt for CPUs for less demanding models, making the system adaptable to different needs and budgets.

The pricing structure is transparent and usage-based, providing cost-effective options for varying enterprise demands.

Industry Use Cases for Snowpark Container Services

SPCS is particularly beneficial for sectors that require stringent data privacy and security measures. Industries governed by regulations such as HIPAA in healthcare, Sarbanes-Oxley, and PCI DSS in financial services find assurance in the complete data sovereignty offered by SPCS.

The SPCS architecture ensures that sensitive data does not leave the secure perimeter of the Snowflake environment, which is crucial for government, telecommunications, and energy utilities handling confidential information.

Additionally, the adaptability of Snowpark Container Services allows for efficient handling of tasks like Personally Identifiable Information (PII) detection, which smaller language models can execute exceptionally well, especially after fine-tuning.

This flexibility, coupled with stringent data governance capabilities, positions SPCS as an indispensable tool for modern data-driven industries.

Data Summarization with an LLM in SPCS

I'll bring to life a use case for running a large language model inside SPCS. Imagine you are managing a set of documentation that's full of technical jargon, and is difficult to follow. You need a way to generate short summaries, but the documents contain confidential information, and policy does not permit sending them externally for processing.

A good way to tackle this is with a large language model - such as Meta Llama 3 70B - running in Snowpark Container Services. With a Matillion Snowpark Container Prompt, you can send the document(s) for summarization while keeping them inside Snowflake at all times.

Afterward, you can join the summarized outputs back to the originals to keep the information together. Once again, all this happens inside Snowflake.

Here's how it looks in the Matillion Data Productivity Cloud. The original text is previewed, and the short summary can be seen at the bottom.

Summarizing legal documents with a Matillion Snowpark Container Prompt component

Being a large language model, the entire operation works just using the base functionality of the Llama 3 model, without the need for any additional configurations or customization.

PII detection using a Matillion Snowpark Container Prompt

Now onto another example. This task involves detecting personally identifiable information (PII) within text data. Specifically, it centers on scanning medical notes, and automating the detection of - for example - names and phone numbers using machine learning.

For this kind of task a small to medium-sized language model such as Mixtral 8x7B, housed within an SPCS container, is sufficient. This applies generally to any medium sized model that has been fine tuned for PII detection. It's faster to run than a large language model, and also runs at a cheaper rate.

Data from an input table containing the medical notes is read, analyzed by the language model, and the examination results are stored in a new table. Data never leaves Snowflake at any point.

PII detection with a Matillion Snowpark Container Prompt component

In PII detection, two questions are usually posed to the language model:

  1. Determine the presence of any PII in the medical notes (Yes/No).
  2. If PII is detected, extract and list the identifiable details explicitly. This helps make it easier to remove.

After joining the original notes back to the corresponding output entries from the Mixtral service, here's how a sample of test data appears in a Matillion data pipeline.

PII detection in a Matillion data pipeline

Just like the PII detection, the join and the data sampling operate entirely inside Snowflake.

After doing this, we are able to take a range of possible actions like segregation or removal of PII from records. This process is fundamental for maintaining privacy and complying with data protection regulations.

Want to see this in action on video? Check it out here:

 

Summary

Matillion and Snowpark Container Services provide a secure and flexible platform for data engineers to incorporate AI into data workflows, ensuring data sovereignty within Snowflake’s environment.

This solution supports the use of cutting-edge open-source AI models and offers a choice between CPUs and GPUs for different tasks. With an emphasis on cost-effectiveness, it enables precise budget control by managing Snowflake credit consumption.

Discover more about our joint capabilities at the Snowflake Summit, and get ready to enhance your own data pipelines with AI integration while keeping sensitive data secure!

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.