- Blog
- 10.15.2024
- Product, Data Fundamentals
Analyzing unstructured text in PDFs

The proliferation of unstructured data has historically been a problem for data engineers and architects, but it's one we're working toward leaving in the past. Generative AI has emerged as a beacon of hope and innovation, allowing organizations to read blocks of text that were previously opaque to automation.
Imagine the wealth of insights buried within financial reports and strategic documents! Due to their complex, unstructured nature, these documents have remained largely untapped. Generative AI technologies are rewriting this narrative, allowing us to unlock their potential and turn them into actionable insights.
This article will walk you through using the Matillion Data Productivity Cloud to analyze unstructured text in PDFs. Starting with financial reports that are publicly available from Uber* and Lyft**, I will extract the text from the PDFs and load that text into a Snowflake Table with a cloud-based text extraction component. Finally, I will use Generative AI to identify key financial metrics from the reports.
Below is the sample ETL pipeline we will be unpacking.
This pipeline consists of:
- Convert_pdf_w_textract, which is an orchestration pipeline running under an iterator. It creates a separate task for each company and file name and converts the PDF file using AWS Textract.
- Transform_financial_metrics - which is a transformation pipeline that takes the results of the OpenAI component and flattens the output into a structured table.
* Uber financials press release https://s23.q4cdn.com/407969754/files/doc_earnings/2024/q1/earnings-result/Uber-Q1-24-Earnings-Press-Release.pdf
**Lyft financial press release https://s27.q4cdn.com/263799617/files/doc_financials/2024/q1/Lyft-1Q24-Earnings-Press-Release.pdf
Let’s dive into the details.
Convert_pdf_w_textract
First, I defined the following variables so that I can make this pipeline dynamic and process multiple financial reports at the same time:
- pv_company_name
- pv_file_name
From there, I used a fixed iterator to set the variables' values:
These values will be passed into the pipeline by setting the ‘Set Scalar Variables’ property. Note: that the same variables need to be defined inside the “convert_pdf_w_textract” pipeline.
The Textract component properties require the S3 bucket and filename of the PDF file and the table name of the resulting target table. In this example, the values of the predefined variables pv_file_name and pv_company_name are passed into the pipeline at run time.
Next, we will configure the OpenAI component to extract the financial metrics.
I used the GPT-3.5-turbo model with a Temperature setting of 1. Temperature as stated by OpenAI is a value between 0 and 2 that indicates how creative the model should be. I also set the Top P to 1 and the N value to 1. I tested with different models and different values. This combination yielded the best results for these sample PDF documents.
Source:
Next, in the source properties, point to the table using the predefined variable. The textract returned column is called ‘data’.
Configure:
In the configure properties, set the User Context. This is the context that I set.
You are a financial analyst reviewing the financial highlights of press releases of various companies. You need to extract certain key information from the financial quarterly reports. The financial press release has been prepared into a text array. All information in the press release is contained within this text array.
Set the Output Format to JSON.
Lastly, define your output JSON elements and give them additional context. This is what you are asking the Large Language Model (LLM) to provide using the text in your input. For this example, I requested the company name and date of the financial report and then some common financial metrics you would find in a financial report.
Transform_financial_metrics
After the OpenAI component runs, it will create a table with the resulting JSON. The transformation pipeline then unpacks the JSON and creates a table with the model output.
This transformation pipeline uses a Unite component to union the resulting tables, each containing the raw data from the LLM corresponding to each company's PDF Financial Report. The Extract Nested Data component flattens the JSON output into separate columns. The Rename component is used to reorder the columns so that the company name and report date are at the top and the financial metrics order makes sense in the table.
Sample data
First, I used Uber and Lyft Financial results as my sample PDF file. Using the Textract Component, the PDF was converted to raw unstructured text in a variant column. This is a sample of the Uber PDF.
In contrast, these are the results of the Textract component:
Next, I used the OpenAI component, asking the model to identify company and date information along with key financial metrics from the financial results that I provided.
Here's the result:
{
"company_name": "Uber Technologies, Inc.",
"financial_report_date": "May 8, 2024",
"gross_earnings_amount": "37.7 billion",
"net_cash_amount": "1.4 billion",
"quarter_end_date": "March 31, 2024",
"total_adjusted_ebitda": "1.4 billion",
"total_net_loss_amount": "654 million",
"total_operating_income": "172 million",
"total_revenue_amount": "10.1 billion"
}
Last, I flattened the JSON into a structured table.
Conclusion
This example demonstrated how to quickly extract key financial metrics from financial results contained in PDF documents. The example pipeline used the Amazon Textract Input component to read the PDF and convert it to text, and the OpenAI Prompt component sent the resulting text to the OpenAI LLM to evaluate the text and extract key financial metrics.
This shows how easily Matillion’s Data Productivity Cloud can process PDF files and summarize or identify key details using GenAI components, providing productivity to teams that can speed up their analysis of data that resides in unstructured text formats.
Angie Hastings
Senior Sales Engineer
Featured Resources
What Is Massively Parallel Processing (MPP)? How It Powers Modern Cloud Data Platforms
Massively Parallel Processing (often referred to as simply MPP) is the architectural backbone that powers modern cloud data ...
BlogETL and SQL: How They Work Together in Modern Data Integration
Explore how SQL and ETL power modern data workflows, when to use SQL scripts vs ETL tools, and how Matillion blends automation ...
WhitepapersUnlocking Data Productivity: A DataOps Guide for High-performance Data Teams
Download the DataOps White Paper today and start building data pipelines that are scalable, reliable, and built for success.
Share: