Documents can contain table data. For example, earning reports, purchase order forms, technical and operational manuals, etc., contain critical data in tables. You may need to extract this table data into Excel for various scenarios.
- Extract each table into a specific worksheet in Excel.
- Extract the data from all the similar tables and aggregate that data into a single table.
Here, we present two ways to generate Excel from a document's table data:
- Azure Function (HTTP Trigger based): This function takes a document and generates an Excel file with the table data in the document.
- Apache Spark in Azure Synapse Analytics (in case you need to process large volumes of documents).
The Azure function extracts table data from the document using Form Recognizer's "General Document" model and generates an Excel file with all the extracted tables. The following is the expected behavior:
- Each table on a page gets extracted and stored to a sheet in the Excel document. The sheet name corresponds to the page number in the document.
- Sometimes, there are key-value pairs on the page that need to be captured in the table. If you need that feature, leverage the add_key_value_pairs flag in the function.
- Form Recognizer extracts column and row spans, and we take advantage of this to present the data as it is represented in the actual table.
Following are two sample extractions.
Top excel is with key value pairs added to the table. Bottom one is without the key value pairs.
The Excel shown above is the extraction of table data from an earnings report. The earnings report file had multiple pages with tables, and the fourth page had two tables.
Solution
Azure Function and Synapse Spark Notebook is available here in this GIT Repository
- Deployment Steps
- Sample Data: The repository has two sample documents to work with:
- PurchaseOrderForm.pdf . This document has key-value pairs for each table that makes sense to capture as part of the table.
- PurchaseOrderForm-WithKeyValuePairs.xlsx --> This is the output with addkeyvaluepairs set to True.
- PurchaseOrderForm-WithKeyValuePairs.xlsx --> This is the output with addkeyvaluepairs set to False.
- earningreport.pdf . This document does not have key-value pairs for each table that makes sense to capture.
- earningreport.xlsx -> This is the output with addkeyvaluepairs set to False.
- PurchaseOrderForm.pdf . This document has key-value pairs for each table that makes sense to capture as part of the table.
- Note on the Excel output:
- If there is a page in the main document with no tables, no sheet will be created for that page.
- The code has been updated to remove the extracted text from check boxes (":selected:", ":unselected:") in the table.
- If a cell does not have any alphanumeric text, it will be skipped. Please update the code to reflect different behavior.
How to leverage this Solution
- Use this solution to generate an Excel file as mentioned above.
- Integrate this with Power Automate so that end-users can use this seamlessly from O365 (email, SharePoint, or Teams).
- Customize this to generate an aggregated table.
Contributors: Ben Ufuk Tezcan, Vinod Kurpad, Matt Nelson, Nicolas Uthurriague , Sreedhar Mallangi