The ability to generate concise summary from pages of text is one of the built-in capabilities of Large Language Models like GPT. This post explains how to summarize text using Azure Open AI.
The data for this post is sourced from Form 10-Q pdf reports. Form 10-Q is a comprehensive report of financial performance that must be submitted quarterly by all public companies to the Securities and Exchange Commission (SEC). A typical Form 10-Q looks like this. It’s usually over 50 pages and is publicly available from the investor section of the company’s website. The sections that will be summarized are
- item 2 – Management Analysis
- item 3 – Risk Disclosures
- item 4 – Controls and Procedures
The summarized output looks like the highlighted text
This is how it works
- Form 10-Q pdf’s are uploaded to a storage container. The pdf needs to be chunked into several sections to ensure that the text length stays within the token limits of GPT model. Since the Form 10-Q reports have a repeatable template, a custom Form Recognizer model has been trained to extract the data points of interest namely – item2, item3 and item4 sections. An event grid triggered logic app takes this pdf file as input and calls the custom Form Recognizer model.
- The custom Form Recognizer model converts the pdf file into a JSON that has the data points of interest. The logic app then writes this JSON output to a storage container.
- A Synapse Spark notebook using SynapseML library calls the Azure Open AI GPT API to convert the text prompt into a summary completion of 100 words. This notebook is called by a Synapse pipeline built using ELT framework.
- The notebook writes the summarized data to a storage container. A Synapse Serverless SQL view surfaces this data to a Power BI report.
- The summarized data is also used to populate a Cognitive Search Index for a future use, possibly with an Azure Open AI skillset.
Here are the repos used for this solution
- Resource Provisioning – iac-aiml-platform and iac-synapse-dataplatform
- Logic App to analyse custom Form Recognizer model – logicapp-formrecognizer
- Synapse Pipeline – pipeline/Ingest_AIML_CustomInferenceFile.json from synapse-dataplatform
- Synapse Notebook – notebook/L1Transform-SEC-Form10Q.json from synapse-dataplatform
- Framework – ELT Framework