Skip to main content

An Introduction to Data Ingestion

 

Enterprises are dealing with an unprecedented influx of data, and forecasts indicate that growth will continue at an even faster pace in years to come. Working effectively and strategically with data to gain actionable business intelligence is more important than ever before in helping businesses gain a competitive edge. That’s where data ingestion comes in. 

 

In today’s digital landscape, that means analyzing data from an ever-increasing number of  sources, from databases and SaaS platforms to mobile and IoT devices. But before businesses can assess and apply analytics to their data, they need to ingest it, bringing it all together in a centralized location.

 

What is data ingestion?


In data ingestion, enterprises transport data from various sources to a target destination, often a storage medium. A similar concept to data integration, which combines data from internal systems, ingestion also extends to external data sources. 

 

A data ingestion layer can be architected in several different ways, with design often dictated by how quickly an organization needs analytical access to the data.

 

Batch data ingestion

 

The most commonly used model, batch data ingestion, collects data in large jobs, or batches, for transfer at periodic intervals. Data teams can set the task to run based on logical ordering or simple scheduling.

 

Companies typically use batch ingestion for large datasets that don’t require near-real-time analysis. For example, a business that wants to delve into the correlation between SaaS subscription renewals and customer support tickets could ingest the related data on a daily basis—it doesn’t need to access and analyze data the instant a support ticket resolves.

 

Streaming data ingestion

 

Streaming data ingestion collects data in real time for immediate loading into a target location. 

This is a more costly ingestion technique, requiring systems to continually monitor sources, but one that’s necessary when instant information and insight are at premium.

 

For example, online advertising scenarios that demand a split-second decision—which ad to serve—require streaming ingestion for data access and analysis.   

 

Micro batch data ingestion

 

Micro batch data ingestion takes in small batches of data at very short intervals—typically less than a minute. The technique makes data available in near-real-time, much like a streaming approach. In fact, the terms micro-batching and streaming are often used interchangeably in data architecture and software platform descriptions.

 

The data ingestion process

 

To ingest data, a simple pipeline extracts data from where it was created or stored and loads it into a selected location or set of locations. When the paradigm includes steps to transform the data—such as aggregation, cleansing, or deduplication—it is considered an Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) procedure. The two core components comprising a data ingestion pipeline are:

 

Sources

The process can extend well beyond a company’s enterprise data center. In addition to internal systems and databases, sources for ingestion can include IoT applications, third-party platforms, and information gathered from the internet.

 

Destinations


Data lakes, data warehouses, and document stores are often target locations for the data ingestion process. An ingestion pipeline may also simply send data to an app or messaging system.   

 

Examples and challenges 

 

Common business use cases for data ingestion include:

 

  • Moving data from siloed in-house systems to a reporting or analytics platform for enterprise-wide access
  • Taking in a continual stream of data from various sources as part of a marketing campaign
  • Collecting data from different suppliers to develop an in-house product line
  • Taking in a large quantity of daily records for an internal Salesforce platform 
  • Allowing customers to ingest and aggregate data via an application programming interface (API)
  • Capturing data from a Twitter feed for further analysis

 

Setting up a pipeline in-house comes with a complex set of challenges. In the past, organizations could write scripts or manually create maps for their processes. But with the size and diversity of data today, older methods often aren’t adequate for businesses moving at a rapid pace.

 

For one, the data that companies need to ingest is often managed by third parties, which can make it difficult to work with, particularly if it’s not fully documented. If a marketing team needs to load data from an external system into a marketing application, for example, considerations include:

  • Quality: Is the data of a sufficient quality? By what metrics?
  • Format: Can the ingestion pipeline handle all the various data formats? 
  • Reliability: Is the data stream reliable?
  • Access: How will the pipeline access the source data? How much in-house IT work will that require?
  • Updates: How often does the source data update?

 

In addition, managing a pipeline will demand significant time and resources if manual supervision and process administration is involved. Human intervention in the process also greatly increases the risk of error and, ultimately, data integrations that fail. 

 

And as always, data governance and security are chief concerns. These are particularly vital when determining how to expose data to users. When designing an ingestion pipeline, organizations have to consider:

  • Whether the data will be exposed both internally and externally
  • Who will have data access and what kind of access they will have
  • Whether the data is sensitive and what level of security it requires
  • What regulations apply to the data and how to comply with them   

 

Matillion and Data Ingestion

 

For fast, automated, and security-rich data ingestion, businesses are increasingly turning to cloud-based solutions. With platforms designed to easily extract and load data from multiple data sources into cloud data environments, companies avoid the costs, complexity and risk associated with ingestion pipelines designed and implemented by in-house IT teams.

 

Matillion offers cloud-native applications to help enterprises rapidly ingest data for analytics and business innovation:

  • Matillion Data Loader software helps companies continuously extract and load data into their chosen cloud data environment
  • Matillion ETL software is designed for companies that also requires powerful capabilities for data transformation

 

Learn how Duo Security created a single, easily replicable model for transforming financial data and accelerated reporting from days to just minutes with Matillion ETL software.

 

Get started with Matillion Data Loader for free

 

Get started with Matillion Data Loader now, for free.

 

Request a Matillion ETL Demo

Request a Matillion ETL demo now.