Documents can contain table data. For example, earning reports, purchase order forms, technical and operational manuals, etc., contain critical data in tables. You may need to extract this table data into Excel for various scenarios.
Extract each table into a specific worksheet in Excel.
Extract the data from all the similar tables and aggregate that data into a single table.
Here, we present two ways to generate Excel from a document's table data:
Azure Function (HTTP Trigger based): This function takes a document and generates an Excel file with the table data in the document.
Apache Spark in Azure Synapse Analytics (in case you need to process large volumes of documents).
The Azure function extracts table data from the document using Form Recognizer's "General Document" model and generates an Excel file with all the extracted tables. The following is the expected behavior:
Each table on a page gets extracted and stored to a sheet in the Excel document. The sheet name corresponds to the page number in the document.
Sometimes, there are key-value pairs on the page that need to be captured in the table. If you need that feature, leverage the add_key_value_pairs flag in the function.
Form Recognizer extracts column and row spans, and we take advantage of this to present the data as it is represented in the actual table.
Following are two sample extractions.
Top excel is with key value pairs added to the table. Bottom one is without the key value pairs.
The Excel shown above is the extraction of table data from an earnings report. The earnings report file had multiple pages with tables, and the fourth page had two tables.
Azure Function and Synapse Spark Notebook is available here in this GIT Repository