Understanding Document Hierarchical Structure with Azure Form Recognizer and XML Format Converter

chulahlou · ‎Jun 12 2023

Azure Form Recognizer is a cloud-based Azure Applied AI Service that provides machine-learning models to extract key-value pairs, text, and tables from documents. It is designed to enhance data-driven strategies and enrich document search capabilities, all without requiring excessive manual intervention or extensive data science expertise. Furthermore, Azure Form Recognizer's Layout and General Document models are capable of extracting documents’ hierarchical structure information. This feature allows users to easily organize, manage, further process and retrieve important insights from their large volumes of data.

As detailed in the product documentation, Azure Form Recognizer returns a structured JSON output from document processing. Form Recognizer is an API and integrating the response into a downstream application may require transforming the response into a format like XML. This post walks you through the shared sample to convert the JSON response to XML for integration scenarios that require XML input. The repository chulahlou/form-recognizer-xml-format (github.com) contains the sample we will be walking through in this post.

Using the XML Converter

Starting with the script fr_xml.py, we will use the Form Recognizer python SDK to analyze a sample document and transform the resulting response to XML.

Place the sample document into the docs/ directory
Update the python script fr_xml.py with the following information:
1. Your Azure Form Recognizer resource endpoint and key

b. File name – name of the document you are trying to process

Run the python script fr_xml.py

For this run, we are using the following model and api version from the Azure Form Recognizer service:

"api_version": "2022-08-31"

"model_id": "prebuilt-document"

4. Find in docs/ directory the results in JSON and XML formats

Results

First, let's see in the Form Recognizer Studio, the output of General Document analysis on the sample PDF file. The document's hierarchical structure information is correctly identified and provided in the result's Text section:

Shown below is the side-by-side comparison of the JSON and XML outputs of the processed sample file. Azure Form Recognizer correctly identifies title, section headings, paragraphs, and page number. By using the format conversion script, the document’s hierarchical structure information is correctly presented by XML nested elements and indentations (2 levels are shown in this example).

Example: "title" role converted to "level-1 header" in XML

Example: "sectionHeading" role converted to "level-2 header" in XML

As shown in this post, converting the JSON response to XML with a small post processing step will enable you to use the content, structure and fields extraction capabilities of Form Recognizer in a variety of applications.

Products (50)

Special Topics (28)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Understanding Document Hierarchical Structure with Azure Form Recognizer and XML Format Converter

Using the XML Converter

Results