We’re Sasha and Amanda, software engineering interns at Matillion, and this summer we had the opportunity to work on something both challenging and impactful: a data pipeline quality reviewer, part of Matillion’s data pipeline quality stack.
The mission of this toolkit is to help Matillion users enforce consistent, maintainable, and readable data pipelines by catching issues early and preventing costly fixes later. Our contribution, the data pipeline quality reviewer, sits at the top of Matillion’s data pipeline quality test pyramid and focuses on static, pre-execution analysis of pipelines.
Why We Built a Data Pipeline Quality Reviewer
Matillion’s customers rely on the Data Productivity Cloud to create, manage, and deploy pipelines, sometimes across large, complex organizations. Without consistent standards in place, data engineering teams risk introducing bad practices into production. That can lead to increased technical debt, bloated or unmanageable pipelines, or missed opportunities for automation.
The reviewer is designed to perform quality checks in data pipelines before they are run. That means issues get flagged early in development, right in the Data Productivity Cloud designer interface, so teams can fix them quickly.
To illustrate these concepts, some common data pipeline quality checks we’ve implemented include: enforcing naming conventions (e.g., snake_case), detecting unused or disconnected components, and limiting pipeline size for maintainability.
By catching these issues early, the tool supports automated data pipeline solutions with built-in data quality checks, reducing operational risk and making pipelines easier to scale and maintain.
From Proof of Concept to Production
We started with a high-level pitch: build a static analysis tool for Matillion’s proprietary Data Pipeline Language (DPL) that could enforce rules programmatically. Our first step was a proof of concept, implementing just a few simple rules like checking minimum and maximum component counts in a pipeline.
Once that was working, we layered in more complex checks by allowing configurable rules through a YAML-based rules file. This rules configuration file allows individual users, or entire organizations, to define their own standards. For example, one team might require snake_case for all component names, while another might prefer PascalCase. The data pipeline quality reviewer allows for consistent enforcement of these preferences.
Engineering the Solution
Our project had both frontend and backend components. On the frontend, we constructed a basic UI for triggering linting runs and viewing results. On the backend, we implemented the review logic itself, including reading DPL files, applying configured rules, and returning violations with suggested fixes.
To bridge the two, we built a dedicated API endpoint for running reviews and updating rules. This makes the tool consumable both by human users in the Data Productivity Cloud UI and possibly by other services in the future, like automated CI/CD pipelines.
We also learned a lot about professional testing practices. We wrote unit tests, integration tests, and API tests, making sure our reviewer worked as expected and didn’t break existing systems. Participating in code reviews with senior engineers gave us valuable insight into coding standards, maintainability, and how to focus on customer and company needs while engineering.
This whole process was very exciting because we were able to design and build the tool from scratch within Matillion’s existing infrastructure, making real design decisions that would directly affect how users experience the feature. Through developing backend architecture and incorporating a basic UI prototype, it was rewarding to see how our tool fits into Matillion’s Data Productivity Cloud ecosystem.
Rule Execution and Configuration
At the core of the reviewer is the rule engine. It reads DPL – a structured, YAML-based format that defines the components, transitions, and parameters of a pipeline – and applies a series of static data pipeline quality checks.
We implemented these checks in a modular way so that each rule is a separate function that can identify violations in a given DPL file, communicate those violations with contextual details (including what is wrong and where the violations occurred), and suggest possible fixes the user could apply.
The rules defined in a user’s configuration YAML file are used directly while running pipeline reviews. This format is human-readable, easy to version control, and flexible enough to store both global defaults and local overrides.
For example:
- id: too-many-components
ruleType: standard
category: structure
description: checks that the number of components in a pipeline is at defined maximum
severity: medium
enforcement: warn
enabled: true
config:
maxComponents: 15
In this way, users or teams can customize their rule standards according to varying priorities. This allows one team to enforce strict naming patterns while another might focus more on pipeline size or metadata completeness.
How It Fits Into the Bigger Picture
The data pipeline quality reviewer is the “first pass” quality filter. It runs quickly, catches objective problems, and gives developers immediate feedback. This means fewer surprises during later, more expensive stages of development.
In the future, the reviewer will integrate with Maia, Matillion’s AI-powered assistant for building pipelines. This will allow users to automatically fix certain violations and set pipeline quality rules for Maia to follow during pipeline creation. Additionally, we have explored supporting AI-aided generation of custom rules via natural language input from users.
This opens the door to fully automated data pipeline solutions with built-in data quality checks that scale across teams and even entire organizations.
Lessons Learned
Working on this project taught us far more than just technical skills. From an engineering point of view, we learned how to navigate a large, complex codebase and design flexible systems that support both standardization and customization. We also gained a high-level understanding of how engineering and product teams collaborate to prioritize features that impact customers most. We were able to combine these insights to narrow our proof of concept down to its most impactful features.
We also experienced firsthand the difference between academic coding projects and production software development. In class, testing is often an afterthought; at Matillion, testing is a core part of development from the very start.
Next Steps
Right now, we’re productionizing our code – evaluating performance, refining rule definitions, and ensuring the API works seamlessly with existing Data Productivity Cloud infrastructure.
Our hope is that this tool will help data engineers, analytics teams, and platform owners adopt a culture of proactive quality, catching issues before they cause problems in production.
Final Reflections
Beyond the code, our time at Matillion has been shaped by its culture. We were welcomed warmly into the Manchester office, supported by engineers, product managers, and designers who were always ready to answer questions and share knowledge. We were trusted to make decisions, encouraged to experiment, and given visibility into the full lifecycle of our project.
It’s rare for interns to have this much ownership, and we’re grateful for the opportunity. For us, this project was a chance to build something that will help the data teams Matillion serves create more reliable, maintainable pipelines. It was extremely rewarding to work on a customer-facing tool, and we will undoubtedly continue to practice the skills and values we have learned at Matillion in all our future endeavors.
Share:
Amanda Hulver & Sasha Krigel
Software Engineering Interns
Amanda is a rising 3rd-year at MIT studying Computer Science with a minor in Economics. On campus, she is an officer for MIT IEEE/ACM, leads tours for MIT admissions, is part of the Autonomous team for MIT Motorsports, and conducts AI research at MIT’s Laboratory for Information and Decision Systems. In her free time, she enjoys playing board games, going to concerts, and exploring new places.
-
Originally from Austin, Texas, Sasha is a rising junior at MIT studying Computer Science with a concentration in business analytics. During the semester, she conducts research at the MIT Media Lab and serves on the the student exec board of her dorm. Outside of school and work, you can find her experimenting with dessert recipes, swimming, and going to concerts.
Share: