Summarization using Azure Open AI

The ability to generate concise summary from pages of text is one of the built-in capabilities of Large Language Models like GPT. This post explains how to summarize text using Azure Open AI.

The data for this post is sourced from Form 10-Q pdf reports. Form 10-Q is a comprehensive report of financial performance that must be submitted quarterly by all public companies to the Securities and Exchange Commission (SEC). A typical Form 10-Q looks like this. It’s usually over 50 pages and is publicly available from the investor section of the company’s website. The sections that will be summarized are

  • item 2 – Management Analysis
  • item 3 – Risk Disclosures
  • item 4 – Controls and Procedures

The summarized output looks like the highlighted text

Form-10Q Summary


This is how it works

Architecture - Form 10-Q Summarization using Azure Open AI
  1. Form 10-Q pdf’s are uploaded to a storage container. The pdf needs to be chunked into several sections to ensure that the text length stays within the token limits of GPT model. Since the Form 10-Q reports have a repeatable template, a custom Form Recognizer model has been trained to extract the data points of interest namely – item2, item3 and item4 sections. An event grid triggered logic app takes this pdf file as input and calls the custom Form Recognizer model.
  2. The custom Form Recognizer model converts the pdf file into a JSON that has the data points of interest. The logic app then writes this JSON output to a storage container.
  3. A Synapse Spark notebook using SynapseML library calls the Azure Open AI GPT API to convert the text prompt into a summary completion of 100 words. This notebook is called by a Synapse pipeline built using ELT framework.
  4. The notebook writes the summarized data to a storage container. A Synapse Serverless SQL view surfaces this data to a Power BI report.
  5. The summarized data is also used to populate a Cognitive Search Index for a future use, possibly with an Azure Open AI skillset.

GitHub Repos

Here are the repos used for this solution