Half a day with Maia. A working pipeline by the end.

Register

Automated Schema Inference: Keep Your CSV Data Loads Running Smoothly

Schema drift occurs when the structure or format of input data files changes unexpectedly. This can manifest in several ways, such as the addition of new columns, alterations in data types, or instances of missing data.

When these changes occur, data processing systems may struggle to read the data correctly, potentially leading to delays as engineers work to manually correct these issues before processing can continue.

This is how schema drift can disrupt data workflows and reduce overall efficiency.

What is Automated Schema Inference

Automated Schema Inference is a process that automatically identifies and defines the structure of source data, including data types and relationships. It is done by looking at the data itself.

When source data files change with no warning - known as schema drift - automated schema inference can help by automatically adjusting for the changes in the structure of new input files.

CSV files are widely used in data engineering. They are a very simple and very practical data exchange format. Schema drift with CSV files can manifest in several ways:

  • Columns being added, removed or renamed
  • Changes to datatypes and formats
  • Alterations to column ordering

To solve the problem of schema drift, an automated solution inspects new files, and dynamically creates a database table to store the data. The column names and datatypes - known as the schema - are automatically inferred by looking at a combination of the column headers and the actual data that follows.

This makes data ingestion smoother, faster and more reliable, reducing downtime and keeping data workflows moving efficiently even when changes occur in the input files.

Matillion and Schema Inference

With Matillion, CSV schema inference is pushed down to the database. This means that the database itself handles the process of inferring the schema, rather than Matillion. Because of this pushdown architecture, the cost of dynamically inferring a schema in each pipeline is almost zero, making it a cost effective way of handling schema drift.

In the low code interface, there are three steps:

  • Supply the unique features of your CSV file, such as delimiters
  • Load the data into a table created by the automated schema inference
  • Continue with the data transformation and integration

Matillion believes that schema inference is a necessary but routine task. It's exactly the type of activity that should be automated. The real value of your data is realised through transformation and integration, rather than just copying it from one place to another.

Matillion: A Unified Data Pipeline Platform

Matillion is a data pipeline platform designed to enable data teams to build and manage pipelines faster for AI and Analytics. It's a unified SaaS platform for end-to-end data pipelines. Matillion offers a wide range of features, including universal connectivity, a code-optional interface and an AI copilot - all underpinned by pushdown compute.

Matillion leverages the power of cloud data platforms to manage pipelines at scale and bring AI capabilities to data engineering. This environment fosters productivity and collaboration, enabling teams to build pipelines more quickly regardless of skill level. 

Data engineers can download the pipelines discussed in this article from the Matillion Exchange.

 

Ian Funnell
Ian Funnell

Data Alchemist

Ian Funnell, Data Alchemist at Matillion, curates The Data Geek weekly newsletter and manages the Matillion Exchange.
Follow Ian on LinkedIn: https://www.linkedin.com/in/ianfunnell

Ready to get moving?

See how quickly your team can start delivering business-ready data, with Matillion.