Bridging the Gap: Applying Conventional DevOps Techniques to Data Engineering

DevOps is a paradigm shift that combines agile development methodologies and IT operations. By fostering collaboration between these teams, DevOps improves efficiency, speed, and reliability in the software development lifecycle.

In the data world, the marriage of DevOps techniques with data engineering practices similarly paves the way for a more agile, collaborative, and automated approach to managing and delivering data solutions.

To bring this to life, let me take you back to a conversation I had early in my career as a data engineer, well before DevOps even existed.

DevOps: the DBAs of the 21st Century?

The Database Administrator (DBA) was the original guardian of the database. Nothing happened without their approval. I was at a financial services client with reasonably high security and was trying to get a new SQL script deployed into Production.

  • DBA: "I can't deploy this script until you tell me which schema to use."
  • Me: "I don't know the layout in Production. Can you tell me what schemas are available so I can work out which is the right one?"
  • DBA: "No. For security reasons, I can't tell you that."

That was the end of the conversation. A classic - and unproductive - collision between development and IT operations.

After some head-scratching, I eventually worked out how to move forward. I made the script prompt for the schema name at runtime and then used SQL injection to apply the value everywhere it was needed. The DBA could keep the schema names secret while I got the application live.

Nowadays, we would recognize that as a standard DevOps approach: separating the application logic from the infrastructure provisioning. For example, the Data Productivity Cloud does exactly this when deploying a hybrid SaaS agent via CloudFormation.

Next, in this article, I'll dive into more detail about how various key aspects of DevOps apply to the world of data.

Collaborate and break down silos

Effective collaboration and communication facilitate sharing knowledge and insights in the data domain, exactly as they do in DevOps.

The Data Productivity Cloud embodies the essence of DevOps collaboration without any unnecessary complication or jargon. The git framework isn't just a feature; it's the backbone, strategically woven into every action. Versioning isn't an afterthought; it's intentional, ensuring full and precise control over every deployment.

Git is at the heart of the Data Productivity Cloud

Matillion's PipelineOS seamlessly integrates with various git providers, allowing you to automate data flow across the application lifecycle according to your own infrastructure needs. Whether using Matillion-hosted git or bring-your-own, versioning pipelines is a straightforward, industry-standard affair.

For example, users can embrace familiar, existing processes in GitHub, utilizing Commit, Pull, Merge, and GitHub Actions for reviews and approval. This moves beyond being just a data tool: Matillion delivers a true, UI-driven git experience.

A true, UI-driven git experience

By fostering cross-functional collaboration between data engineers, data scientists, and operations teams in this way, organizations can break down silos and take advantage of their data ecosystem more productively.

Automate deployment for seamless code evolution

In the realm of data engineering, Continuous Integration (CI) and Continuous Deployment (CD) are integral DevOps practices. CI ensures seamless integration of code changes, promoting early bug detection and reducing integration hiccups. Extending CI to include the practices of Continuous Delivery enables automated deployment, ensuring code is production-ready on demand.

Again, this is where the Matillion Data Productivity Cloud steps in. Going beyond mere point solutions, Matillion focuses on enhancing the overall productivity of the data team by automating manual tasks and simplifying complex workflows. It empowers the data team to work at an unprecedented speed and scale. Examples include:

  • Data engineers can iterate over variables and rows effortlessly, allowing the seamless construction of sophisticated business logic with unmatched speed and precision.
  • Data Productivity Cloud users can automate pipeline execution and promotion across environments using public REST APIs.
  • Execution engines are entirely separated from pipeline logic.

Matillion is the linchpin for an everyone-ready, stack-ready, and future-ready data environment, tying together intricate pipeline steps seamlessly and providing unmatched orchestration flexibility. The Matillion Data Productivity Cloud ensures a strategic approach that aligns data infrastructure with evolving business needs.

Infrastructure as Code (IaC)

IaC is the DevOps principle of treating infrastructure provisioning and management as code. This allows for consistent and repeatable infrastructure deployments, reducing manual errors and enhancing scalability.

Many features built into the Data Productivity Cloud enable teams to automate infrastructure tasks, ensuring that the development and production environments remain synchronized.

Firstly, Matillion's containerized runner and fully SaaS designer interface liberate you from setup worries, allowing for dynamic launches and hassle-free scalability.

The zero-install, auto-upgrade-managed service means saying goodbye to manual maintenance headaches. Container-based infrastructure is a streamlined, efficient approach that's as straightforward as possible.

The metamodel separates Environments, Schedules, Agents, Custom Connectors, Observability, Secrets, and Credentials into separately managed entities. Combined with a range of secure, public REST APIs, this enables data engineers to easily manage infrastructure and seamlessly automate pipeline promotion and execution across environments.

In summary

In the ever-evolving landscape of technology, integrating DevOps methodologies into data engineering practices has become a crucial aspect of ensuring efficient and agile development.

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.