Summarization using Azure Open AI

The ability to generate concise summary from pages of text is one of the built-in capabilities of Large Language Models like GPT. This post explains how to summarize text using Azure Open AI.

The data for this post is sourced from Form 10-Q pdf reports. Form 10-Q is a comprehensive report of financial performance that must be submitted quarterly by all public companies to the Securities and Exchange Commission (SEC). A typical Form 10-Q looks like this. It’s usually over 50 pages and is publicly available from the investor section of the company’s website. The sections that will be summarized are

  • item 2 – Management Analysis
  • item 3 – Risk Disclosures
  • item 4 – Controls and Procedures

The summarized output looks like the highlighted text

Form-10Q Summary


This is how it works

Architecture - Form 10-Q Summarization using Azure Open AI
  1. Form 10-Q pdf’s are uploaded to a storage container. The pdf needs to be chunked into several sections to ensure that the text length stays within the token limits of GPT model. Since the Form 10-Q reports have a repeatable template, a custom Form Recognizer model has been trained to extract the data points of interest namely – item2, item3 and item4 sections. An event grid triggered logic app takes this pdf file as input and calls the custom Form Recognizer model.
  2. The custom Form Recognizer model converts the pdf file into a JSON that has the data points of interest. The logic app then writes this JSON output to a storage container.
  3. A Synapse Spark notebook using SynapseML library calls the Azure Open AI GPT API to convert the text prompt into a summary completion of 100 words. This notebook is called by a Synapse pipeline built using ELT framework.
  4. The notebook writes the summarized data to a storage container. A Synapse Serverless SQL view surfaces this data to a Power BI report.
  5. The summarized data is also used to populate a Cognitive Search Index for a future use, possibly with an Azure Open AI skillset.

GitHub Repos

Here are the repos used for this solution

PySpark Common Transforms

Quite often I come across transformations that are applicable to several scenarios. So created this reusable Python class that leverages PySpark capabilities to apply common transformation to a dataframe or a subset of columns in a dataframe. The code is in GitHub – bennyaustin/pyspark-utils. There is also an extensive function reference and usage document to go with it. Feel free to use, extend, request features and contribute.

Continue reading

Upsert to Azure Synapse Analytics using PySpark

At the moment SQL MERGE operation is not available in Azure Synapse Analytics. However, it is possible to implement this feature using Azure Synapse Analytics connector in Databricks with some PySpark code.

Continue reading

Time Zone Conversions in PySpark

PySpark has built-in functions to shift time between time zones. Just need to follow a simple rule. It goes like this. First convert the timestamp from origin time zone to UTC which is a point of reference. Then convert the timestamp from UTC to the required time zone. In this way there is no need to maintain lookup tables and its a generic method to convert time between time zones even for the ones that require daylight savings offset.

Continue reading

HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

Parquet is my preferred format for storing files in data lake. Parquet’s columnar storage and compression makes it very efficient for in-memory processing tasks like Spark/Databricks notebooks while saving cost on storage. Parquet also supports almost all encoding schemes out there. Perhaps the coolest thing in Parquet is unlike CSV there is no such thing as column/row separator. So there is no need to escape those characters if they are part of data.

Azure SQL Data Warehouse supports Parquet data format for External (PolyBase) tables. External tables reference the underlying storage blobs and gives an option to query the data lake using SQL. In fact this is recommended in Microsoft’s reference architecture. With some Parquet files this error gets thrown when the External Table is queried

HdfsBridge::recordReaderFillBuffer – Unexpected error encountered filling record reader buffer: IllegalArgumentException: Must be 12 bytes

In the absence of clear exception message it took a while to figure this out. This error usually happens on a Timestamp column specifically when the data is in yyyy/MM/dd hh:mm:ss format. For some reason SQL Data Warehouse expects the Timestamp data to be in yyyy-MM-dd hh:mm:ss format. Changing the date separator from / to – resolved this issue, although it must be mentioned the underlying file is a perfectly valid Parquet file.

Kaggle: TalkingData

A brief retrospective of my submission for Kaggle data science competition that predicts the gender and age group of a smartphone user based on their usage pattern.

Continue reading

Kaggle: Grupo Bimbo

A brief retrospective of my submission for Kaggle data science competition that forecasts inventory demand for Grupo Bimbo.

Continue reading

Common Type 2 SCD Anti-patterns

Slowly Changing Dimension (SCD) is great for tracking historical changes to dimension attributes. SCDs have evolved over the years and besides the conventional type 1 (update), type 2 (add row) and type 3 (add column), now there are extensions up to type 7 including type 0. Almost every DW/BI project has at least few type 2 dimensions where a change to an attribute causes the current dimension record to be end dated and creates a new record with the new value.

Continue Reading

Forecasting Exchange Rates Using R Time Series

Time Series is the historical representation of data points collected at periodic intervals of time. Statistical tools like R use forecasting models to analyse historical time series data to predict future values with reasonable accuracy. In this post I will be using R time series to forecast the exchange rate of Australian dollar using daily closing rate of the dollar collected over a period of two years.

Continue reading

Energy Rating Analysis of Air conditioners using R Decision Trees

Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome (Screen Capture 1). Internally, decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest.
There are several packages in R that generate decision trees. For this post, I am using the ctree() function available in party package. The data I am using as input is energy rating of household air conditioners.

Continue Reading