Blog Post

AI - Azure AI services Blog
10 MIN READ

Form Recognizer Solution Accelerator

stevedem's avatar
stevedem
Icon for Microsoft rankMicrosoft
Apr 04, 2022

The Form Recognizer Cognitive Service helps you quickly extract text and structure from documents. Custom document models can be trained to understand the specific structure and fields within your documents and extract valuable data. You can quickly train a custom document model that performs very well for your unique forms. Incorporating that model into an automated solution will be the focus of this article.

 

There are two common problems when incorporating scanned paper forms into a production-grade digital solution. The first problem is that documents are usually scanned as a stack of individual pages, which generates a multi-page file. This is efficient for the person scanning the documents, but for downstream solutions, this may cause other issues, such as individually referencing a form or leveraging solutions that are intended to work on a single page. The second problem is that not everyone is comfortable working with REST APIs or developing a fully automated solution. Azure provides you the tools necessary to overcome both of these problems.

 

This solution implements two capabilities that are commonly required when working with a trained custom document model:

 

  1. Splitting multi-page PDF documents into individual, single-page PDF documents
  2. Analyzing the results of documents sent to the Form Recognizer REST API endpoint of a trained custom document model

 

These capabilities are linked together in a cohesive solution using a pair of Logic Apps and an Azure Function.

 

It is important to note that my reference to the multi-page PDF document contains multiple pages of the same form. In other words, the form layout and structure on Page 1 is identical to the form layout and structure on Page 2 (and so on). Each form is only one page in length.

 

In our fictitious demonstration scenario, we will be working with Volunteer Forms which contain custom fields for first name and last name as well as selection marks for each volunteer activity. Below is a screenshot of what the forms look like.

 

 

You can download all sample data here: https://aka.ms/form-reco-sample

 

There are two folders in the GitHub repository linked above:

 

  1. train – contains five individual PDF files (which you can use to train a custom document model in Form Recognizer)
  2. test – contains a single multi-page PDF file with four pages of filled out forms

 

Splitting PDFs

 

Often, when you work with paper-based documents, scanning multiple pages into a single document is the most reasonable way to efficiently digitize these valuable documents. The Form Recognizer service assumes a single document per file and when you have multiple documents scanned into a single file, you will need to split the documents or analyze by page ranges. This solution uses an Azure Function with open-source Python code to read the content of a multi-page PDF file and split it into individual, single-page PDF files. Wrapping this Azure Function in a Logic App provides you with the ability to easily trigger the Logic App whenever a new file is uploaded to a storage account and write each of the single-page PDF files to a storage account.

 

 

Analyzing Results

 

The Form Recognizer service makes training a custom document model very easy. Here is a reference for how to train a custom model in the Form Recognizer studio. Once the model has completed training, it is immediately published and available to consume as a REST API endpoint. You can extract results from documents by sending the document data to the REST API endpoint which hosts your custom document model. Incorporating this machine learning model into an efficient and scalable solution can sometimes be challenging. This solution uses a Logic App to send single-page PDF document data to the published REST API endpoint of a trained custom document model and stores the resulting JSON data into a JSON file in a storage account.

 

 

Bringing it all Together

 

In our fictitious Volunteer Form example, let's say that we filled out a few sample, single-page PDF documents to train a Form Recognizer custom document model. We want to extract the key fields: first name, last name, and each selection mark for volunteer activity. Once we have a model trained, blank forms were distributed to employees to fill out. Once all forms were filled out and gathered, they were placed into a digital scanner as a single stack of documents. The resulting multi-page PDF file has individual pages for each employee's form. To quickly extract the data from this form, we can upload the data to the "multi" container of our storage account and, in a few minutes, have the extracted data from each form.

 

In the architecture below, this solution can be fully automated once a multi-page PDF file is uploaded to a storage account container.

 

 

There are four containers in a storage account:

 

  1. train - intended for storing the training data for Form Recognizer
  2. multi - intended for storing multi-page PDF documents
  3. single - intended for storing single-page PDF documents
  4. results - which will contain the JSON results from the Form Recognizer API

 

There are two Logic Apps:

 

  1. Split PDFs - which splits multi-page PDFs to single-page PDFs
  2. Analyze Results - which sends single-page PDF data to the Form Recognizer API and stores the results of the trained custom document model to a JSON file in the “results” container.

 

The first Logic App is triggered when a multi-page PDF file is created in the “multi” container in a storage account. From there, the Logic App will send the PDF document to an Azure Function, which individually splits out each page in the document. The Azure Function returns the PDF data back to the Logic App, which then writes each single-page PDF document back to another container in the same storage account (single).

 

From here, another Logic App triggers whenever a single-page PDF file is created in the “single” container. This Logic App sends the PDF data to the REST API endpoint of your trained custom document model. The results of the API call are returned and stored as JSON. The data in the JSON document contains your custom form fields and selection marks.

 

Implementing this Solution

 

To implement this solution, go to the following GitHub repository and/or click the "Deploy to Azure" button.

 

stevedem/FormRecognizerAccelerator (github.com)

 

 

After clicking Deploy to Azure, you will be asked to fill out the following fields:

 

 

Once the deployment succeeds, you should see six resources in your resource group.

 

 

Please note that this will only deploy the core infrastructure. There are a few steps to configure each service once deployed.

 

Each configuration is divided into these sections:

 

  • Storage Account - create containers & upload data
  • Form Recognizer - train custom document model
  • Function App - deploy open-source Python code to split PDFs
  • Split PDFs Logic App - split multi-page PDF documents to single-page PDF documents
  • Analyze Results Logic App - send single-page PDF document data to REST API endpoint of trained custom document model

 

Storage Account - create containers & upload data

 

Navigate to the storage account that was created during resource deployment.

 

  

  1. Click Storage browser (preview)
  2. Click + Add container
  3. Enter "train" as the container name
  4. Repeat for the following containers: multi, single, results

 

Next, we will upload the training data. Download all five PDF files under the "train" folder of this GitHub repository: https://aka.ms/form-reco-sample

 

Then, navigate to the storage account that was created during resource deployment to upload these five PDF files.

 

 

  1. Click Storage browser (preview)
  2. Select the "train" container
  3. Click Upload
  4. Select all five training documents from the provided sample data (download sample data here: http://aka.ms/form-reco-sample)

 

Before continuing to the next section, please follow the steps to Configure CORS (Cross Origin Resource Sharing) on your storage account. This is required to be enabled on your storage account for it to be accessible from the Form Recognizer Studio. A detailed walkthrough can be found here: Configure CORS | Microsoft Docs

 

Form Recognizer - train custom document model

 

Navigate to the Form Recognizer Studio: FormRecognizerStudio (azure.com)

 

  • Scroll down and click Create new Custom model. 

 

 

  • Scroll down and click + Create a project, enter project name and click Continue.

 

 

  • Configure service resources by selecting the resource group and form recognizer service that was created during resource deployment. Then click Continue.

 

 

  • Connect training data source by selecting the resource group, storage account and train container that was created during resource deployment. Then click Continue.

 

  • Finally, review all configurations and finish creating the project.

Once your project is open and you can see all five training documents, label and train a custom document model by following these steps.

 

 

  1. Create two field labels for the first name and last name of the Volunteer Form.
  2. Create six selection marks labels for each volunteer activity.
  3. Repeat for all five training documents, then click Train.

 

  • Enter a Model ID and keep a note of the Model ID value (you need it later). Set the Build Mode to Neural and click Train.

 

Function App - deploy open-source Python code to split PDFs

 

Next step is to deploy the open-source Python code to split PDFs as an Azure Function.

 

  1. Navigate to the Azure Portal and open up an Azure Cloud Shell session by clicking the icon at the top of the screen.
  2. Enter the following commands to clone the GitHub repository and publish the Azure Function, replacing the <function-app-name> with the name of your Function App resource deployed during resource deployment.

 

 

 

 

 

 

 

 

mkdir volunteer
cd volunteer
git clone https://github.com/stevedem/splitpdfs.git
cd splitpdfs
func azure functionapp publish <function-app-name>

 

 

 

 

 

 

 

 

After the command to publish the function has successfully completed, here is an example of the output.

 

 

Split PDFs Logic App - split multi-page PDF documents to single-page PDF documents

 

Next up is to build out the Logic App, which will call the Function App we just published.

 

  1. Add the Azure Blob Storage trigger When a blob is added or modified (properties only) (V2)

 

 

  1. Create a connection to the storage account created during resource deployment. You will need to retrieve the storage account access key. If you need help finding your access key, please refer to this documentation: Manage account access keys - Azure Storage | Microsoft Docs

 

 

  1. Select the connection you previously created and specify the following settings
Container /multi
Number of blobs to return 1
How often do you want to check for items 1 Minute

 

 

  1. Add a new step: Get blob content (V2) with the following settings
Storage account name Use connection settings
Blob List of Files Path (Dynamic Expression)
Infer content type Yes

 

 

  1. Add a new step: Azure Function with the following settings

Request Body:

 

 

 

 

 

 

 

 

{
  "fileContent": "@body('Get_blob_content_(V2)')",
  "fileName": "@triggerBody()?['DisplayName']"
}

 

 

 

 

 

 

 

 

 

  1. Add a new step: Control: For each with the following settings:
Select an output from previous steps
@json(body('SplitPDFs'))['individualPDFs']

 

 

  1. Add an action: Create blob (V2) with the following settings
Storage account name Use connection settings
Folder path /single
Blob name
@{items('For_each')?['fileName']}
Blob content
@base64ToBinary(items('For_each')?['fileContent'])

 

 

  1. Click Save. Your Logic App should now look like this:

 

 

 

 

Analyze Results Logic App - send single-page PDF document data to REST API endpoint of trained custom document model

 

Next step is to open the Logic App designer in the Azure portal for the second Logic App you deployed with "analyzeresults" in the name.

 

  1. Add the Azure Blob Storage trigger When a blob is added or modified (properties only) (V2) with the following settings:
Storage account name Use connection settings
Container /single
Number of blobs to return 1
How often to you want to check for items 1 Minute

 

 

  1. Add a new step: Create SAS URI by path (V2) with the following settings:
Storage account name Use connection settings
Blob path List of Files Path (Dynamic Expression)

 

 

  1. Add a new step: HTTP

Please note that you will have to replace a few variables tailored to your resource deployment. Here are the variables you must replace:

  • FORM_RECO_URI: this is the Endpoint of your Form Recognizer service in Azure. You can find your endpoint here after navigating to your Form Recognizer service that was created during resource deployment.

 

  • FORM_RECO_KEY: this is the Key used to access your Cognitive Service API. You can find your Key here after navigating to your Form Recognizer service that was created during resource deployment.

 

  • MODEL_ID: this is the Model ID that you created while training a custom document model in Form Recognizer Studio.

 

Method   POST
URI   <FORM_RECO_URI>/formrecognizer/documentModels/<MODEL_ID>:analyze?api-version= 2022-01-30-preview
Headers Ocp-Apim-Subscription-Key <FORM_RECO_KEY>
Body  
{"urlSource": "@{body('Create_SAS_URI_by_path_(V2)')?['WebUrl']}"}

 

 

  1. Add a new step: Initialize variable with the following settings:
Name analyze-status
Type String
Value
@{null}

 

 

  1. Add a new step: Until with the following settings
analyze-status
is equal to
succeeded

 

 

  1. Add an action: HTTP with the following settings
Method   GET
URI  
@{outputs('HTTP')['headers']['Operation-Location']}
Headers Ocp-Apim-Subscription-Key <FORM_RECO_KEY>

 

 

  1. Add an action: Set variable
Name

analyze-status

Value
@{body('HTTP_2')['status']}

 

 

  1. Add an action: Delay
Count 10
Unit Second

 

 

  1. Add a new step: Create blob (V2)
Storage account name Use connection settings
Folder path /results
Blob name
@{replace(triggerBody()?['DisplayName'], '.pdf', '.json')}
Blob content
@body('HTTP_2')

 

 

  1. Click Save. Your Logic App should now look like this:

 

 

Verifying the Functionality

 

Navigate to the storage account created during resource deployment.

 

  1. Click storage browser
  2. Click the “multi” blob container
  3. Click the “Upload” button to select a multi-page PDF to upload

 

 

Wait a few minutes and check the status of the first Logic App, responsible for splitting PDFs. Then, navigate to the storage account deployed as part of the prerequisites.

 

  1. Click the storage browser
  2. Click the “single” blob container
  3. Verify that each page of the multi-page PDF was saved as an individual file

 

 

Wait a few minutes and check the status of the second Logic App, responsible for analyzing results from single-page PDFs. Then, navigate to the storage account deployed as part of the prerequisites.

 

  1. Click the storage browser
  2. Click the “results” blob container
  3. Verify that each single-page PDF file has a corresponding JSON file with the custom document model output 

 

After downloading and inspecting the volunteer form and corresponding JSON output, we can quickly see that our data was efficiently and accurately extracted using our Form Recognizer custom document model.

 

 

Conclusion

 

In just a few steps we created an AI model and an orchestration pipeline. Now you can leverage this solution to extract valuable data from your multi-page PDF forms. You can check out the GitHub repository here, where this solution accelerator is hosted. In future iterations, we hope to further automate the deployment and configuration tasks.

 

stevedem/FormRecognizerAccelerator (github.com)

Updated Jan 25, 2024
Version 3.0
  • Hello Rameshnimmala 

     

    My guidance would be to ensure you are working with a new, clean directory so that only the necessary project files are included in the build process. Additionally, I would ensure you are working with the correct Azure Function Core Tools version. 

     

    Check out this documentation link if you need help on changing/upgrading your Azure Function Core Tools version.

     

    Best,

     

    Steve

  • Rameshnimmala's avatar
    Rameshnimmala
    Copper Contributor

    Dear Steve,

    Thanks for the detailed information.

    I am trying to make this solution on my trail account but unfortunately I am unable to publish Azure function by using Python code and getting below error in Bash. Please assist me on how to solve this error and proceed further. Thanks in advance.

     

     

  • alexus98's avatar
    alexus98
    Copper Contributor

    stevedem thanks for such a detailed response!

     

    At the end I would like to have a joint, single output file in which records from all the PDF files would be stored. Is it possible to implement such a solution in the Logic App? 

     

     

  • Hi alexus98 , thanks for the comment. 

    To answer your first question, I would recommend using a composite model. A composite model represents a collection of custom models and assigns them to a single model that encompasses your form types. During scoring, the composite model performs a classification step to determine which custom model the current form represents. You can then use the model ID of the composite model in this solution and let the Form Recognizer service determine which custom model to use. Check out more information here: https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/compose-custom-models

     

    To your second question, when you send document data to the REST API endpoint, an HTTP response is returned, which we capture in this solution as JSON. If you need to convert that JSON to a CSV file, you could attempt to do so as additional steps in the Analyze Results Logic App. There may be some complexity parsing the JSON for the specific fields you are interested in retaining in the CSV file. A quick search for “Logic App JSON to CSV” appears it is possible, but would require testing for your specific scenario. Check out more information here: https://stackoverflow.com/questions/48923628/logic-app-create-csv-table-from-json-output

     

    Thanks!

  • alexus98's avatar
    alexus98
    Copper Contributor

    Hi stevedem, I have a question regarding the tutorial. 

    1) In my solution I use 3 different custom forms (they vary slightly from each other), hence I trained 3 different models. I want to choose which model should the app use when I upload the data.

    2) I would like to have a single (preferably a csv) output file with data from all the splitted pdfs that were fed to the Form Recognizer.

     

    Is it possible to implement these two points into the solution you presented?