Matters of extraction: Introducing Matillion's new Azure Document Intelligence component

Staying ahead in data engineering means continually enhancing your toolset. Matillion's latest advancement, the integration of the Azure Document Intelligence component, represents a significant stride in automated document processing.

This new component is poised to help you streamline workflows that rely heavily on document data, making manual data extraction from forms and documents a tedious task of the past.

Azure Document Intelligence: A brief insight

Azure Document Intelligence, part of the Azure AI suite, employs machine learning models to automate the extraction of text, handwriting, and layout elements from various documents. It enables users to transform unstructured data into structured formats such as Markdown or plain text. This capability is fundamentally transformative for automatic document processing in data pipelines, as it enhances data-driven strategies and enriches search functionalities within documents.

The key strengths of Azure Document Intelligence lie in its ability to provide customizable document models to meet specific business needs, improve data quality, reduce errors, and reduce time to insight—all while maintaining cost efficiency. Especially for organizations deeply embedded in the Azure ecosystem, the integration promises a seamless, scalable, and exceedingly efficient solution.

Utilizing Matillion’s Azure Document Intelligence component

Using the Matillion Azure Document Intelligence component begins with a few essential setup steps within Azure's framework. As a once-only initial setup, users first add Azure cloud credentials to their Data Productivity Cloud project, which means registering an application with specific Azure roles, including Storage Blob Data Contributor, Cognitive Services User, and Cognitive Services Contributor.

Once this is done, deploying the component involves specifying the Azure Blob storage location, setting the Document Intelligence Service Endpoint, and determining your desired output format - either Markdown or Text.

For those setting Databricks as the destination and Azure Storage as the stage platform, deactivating the "Enable soft delete for blobs" option is crucial for successful pipeline operations.

Matillion and The Data Productivity Cloud: An overview

Matillion is a powerful data pipeline platform designed to empower data teams to build and manage pipelines faster and with scalability for AI and analytics. Offering a bridge between productivity and innovation, Matillion provides a UI packed with pre-built components yet retains the flexibility for coding in SQL, Python, or DBT. Supporting asynchronous workflows through first-class Git integration and featuring AI-generated documentation, Matillion democratizes access to AI, accommodating users of varying coding skills through its code-optional environment.

Integrating with hyperscalers, CDPs, LLMs, and more, Matillion's capabilities extend across hybrid SaaS deployments, vector store connectivity, and reverse ETL for AI-derived insights. Its pushdown ELT strategy harnesses the processing power of cloud data platforms, smoothly handling pipeline orchestration's scale and complexity.

Matillion's Azure Document Intelligence component is a testament to this platform's commitment to innovation. It simplifies and elevates the workflow for businesses heavily reliant on document data, reducing manual labor costs, enhancing data quality, and ensuring compliance - all while offering AI integration and collaboration technologies across data teams.

Not using Matillion yet? Check out a 2-week free trial today!

Lucy Parker
Lucy Parker

Placement Student - Product Analyst

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.