Evaluating Agentic AI

During my internship at Matillion, I have been working on a project focused on evaluating Maia, Matillion’s Agentic AI agent. My goal was to understand Maia’s behavior in a wide range of scenarios and design a systematic evaluation framework that could guide ongoing improvements. In this post, I’ll share the motivation behind this work, the approach I took, and the lessons I learned along the way.

What is Maia?

Maia is Matillion’s Agentic AI framework, designed to assist users with tasks inside the Matillion Data Productivity Cloud. Its primary functions include:

  • Helping users build new data pipelines.
  • Fixing existing pipelines that may contain logic or configuration errors.
  • Providing explanations for how pipelines work and why specific configurations are chosen.

Maia is conversational, users can interact with it in natural language. This means a user can simply type a request such as:

  • “Can you filter this dataset to include only orders from last quarter?”
  • “Explain how this pipeline is joining these two tables.”
  • “Find and fix the step where we’re losing rows in this transformation.”

This conversational interface makes complex data engineering tasks more approachable, especially for those who may not be deeply familiar with Matillion’s technical features.

Why do we need to evaluate Maia?

Evaluating an agentic AI like Maia is fundamentally different from evaluating traditional AI models because of the nature and complexity of its work. Maia operates within a dynamic environment, interacting with multiple components and tools in the Matillion Data Productivity Cloud. Its job is not just to provide a single correct answer, but to plan, execute, and adapt a sequence of actions that lead to a useful outcome for the user.

In this setting, the definition of “correctness” is highly context-dependent. The “right” way to build or fix a pipeline can vary based on the specific user request, the structure and quality of the data, and the desired end result. Two different approaches may both satisfy the request, but differ significantly in their efficiency, cost, and maintainability. Because of this, simply checking whether Maia’s final output matches a fixed standard is not enough.

  • A correct output could be the result of a long, inefficient, or overly expensive decision-making process.
  • An output that doesn’t perfectly match a rigid reference could still be highly valuable if it gives the user actionable insight or solves the majority of the problem.

The real challenge is that Maia’s success must be measured on more than just the end product. It must be evaluated on:

  1. Efficiency – Did Maia minimize unnecessary steps and avoid redundant tool usage?
  2. Logic – Did its reasoning follow a coherent, well-structured plan?
  3. Cost-effectiveness – Did it achieve the result without consuming excessive computational resources?

Traditional AI metrics don’t capture these qualities. They measure outcomes, not process. With Maia, how it arrives at the answer is just as important as the answer itself.
This is why we need tools that allow us to look inside the “black box” of Maia’s decision-making process. By tracing each step, we can see not only the sequence of actions but also the reasoning behind them. This is exactly what a tool like Langfuse enables, providing a window into Maia’s planning, tool usage, and intermediate progress so we can measure and improve the quality of the journey, not just the final destination.

Langfuse

Langfuse is a developer tool for observing and analyzing how an AI agent processes tasks in real time. It captures a trace, a complete, timestamped record of every action the agent takes from the moment it receives a request to when it returns an answer.

For Maia, Langfuse traces provide:

  • Step-by-step reasoning flow – how Maia decomposed the problem.
  • Tool invocation history – which tools were selected, in what sequence, and with what parameters.
  • Latency and cost metrics – useful for identifying performance bottlenecks.
  • Intermediate states – snapshots of partial progress that help diagnose where errors or inefficiencies occur.

By inspecting these traces, I can not only see what Maia decided to do, but also why it might have made those decisions.

Base Prompt:

Modified Instruction Prompt: 

How Maia was Prompted:

Datasets

Langfuse also enables the creation of datasets, which are curated collections of test cases designed to evaluate specific aspects of an AI agent’s behavior. This makes it possible to test different capabilities of Maia in isolation, for example, one dataset might focus on orchestration scenarios, while another targets transformation tasks.

One dataset I developed is called “Routing”. The purpose of this dataset is to assess whether Maia selects the most appropriate tool early in a task, which is often a strong indicator that it is on the correct execution path.

Structure of each Routing dataset item:

  • Input message – The user’s natural language request.
  • Expected tool – The tool Maia should ideally invoke first.

Running the Routing dataset allows me to directly compare Maia’s initial tool selection against the expected choice across a variety of inputs. This targeted approach helps pinpoint specific areas where routing decisions could be improved.

Analyzing Runs with Custom Metrics

Each execution of a dataset item produces a trace in Langfuse. Alongside the trace, Langfuse provides built-in metrics such as latency (execution time) and cost (resource usage).
While these metrics are valuable, they don’t capture the full complexity of evaluating an agent like Maia. To gain deeper insight, I extracted and calculated custom metrics from the traces, including:

  • Tools Used – A complete list of tools Maia invoked while completing the task.
  • Components Added – All components Maia inserted into the pipeline during execution.
  • Times Thought – How many times Maia paused to “think” before taking the next action.

These custom metrics give us a richer understanding of Maia’s decision-making process. By comparing runs of the same test over time, we can track whether:

  • The number of tools used remains consistent or becomes more efficient.
  • The same components are consistently added when solving a similar task.
  • The amount of “thinking” before acting changes, which may indicate improvements in confidence and decisiveness.

This level of insight allows for fine-grained performance tracking, ensuring that improvements are measurable and targeted.

Looking Ahead

With this evaluation framework in place, we now have a solid foundation for regression testing, ensuring that as Maia evolves, improvements in one area do not cause unintended regressions elsewhere.

This approach allows us to:

  • Test consistently – New model versions or prompt changes can be run against the same datasets for fair comparison.
  • Track trends over time – Custom metrics allow us to monitor whether efficiency, correctness, and cost-effectiveness are improving.
  • Identify edge cases – Scenario-specific datasets reveal where Maia still struggles, guiding targeted improvements.

As we continue to refine Maia, this setup will allow us to move quickly and confidently with clear, objective insights into how each change affects the agent's behavior.

Experience Reflection

Working at Matillion has been an incredibly rewarding experience, both technically and professionally. On the technical side, I’ve gained hands-on experience with agentic AI evaluation, prompt engineering, dataset creation, and performance analytics. I’ve also deepened my understanding of how multi-tool AI systems operate in dynamic environments.

On the professional side, I’ve learned how to work within an agile, collaborative, and feedback-driven environment. The team has been exceptionally supportive, offering guidance when needed while also giving me the autonomy to take ownership of my project. This balance has allowed me to develop problem-solving skills that extend beyond technical ability.

Jessica Luu
Jessica Luu

Software Engineer Intern

Jessica is a rising junior at MIT majoring in Computer Science with a concentration in Artificial Intelligence. She conducts research at the Computer Science and Artificial Intelligence Laboratory (CSAIL) and serves on the First-Gen Low Income (FLI) student board. Outside of academics, Jessica enjoys exploring the city and getting lost in a good book.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.