The Form Recognizer Cognitive Service helps you quickly extract text and structure from documents. Custom document models can be trained to understand the specific structure and fields within your documents and extract valuable data. You can quickly train a custom document model that performs very well for your unique forms. Incorporating that model into an automated solution will be the focus of this article.
There are two common problems when incorporating scanned paper forms into a production-grade digital solution. The first problem is that documents are usually scanned as a stack of individual pages, which generates a multi-page file. This is efficient for the person scanning the documents, but for downstream solutions, this may cause other issues, such as individually referencing a form or leveraging solutions that are intended to work on a single page. The second problem is that not everyone is comfortable working with REST APIs or developing a fully automated solution. Azure provides you the tools necessary to overcome both of these problems.
This solution implements two capabilities that are commonly required when working with a trained custom document model:
These capabilities are linked together in a cohesive solution using a pair of Logic Apps and an Azure Function.
It is important to note that my reference to the multi-page PDF document contains multiple pages of the same form. In other words, the form layout and structure on Page 1 is identical to the form layout and structure on Page 2 (and so on). Each form is only one page in length.
In our fictitious demonstration scenario, we will be working with Volunteer Forms which contain custom fields for first name and last name as well as selection marks for each volunteer activity. Below is a screenshot of what the forms look like.
You can download all sample data here: https://aka.ms/form-reco-sample
There are two folders in the GitHub repository linked above:
Often, when you work with paper-based documents, scanning multiple pages into a single document is the most reasonable way to efficiently digitize these valuable documents. The Form Recognizer service assumes a single document per file and when you have multiple documents scanned into a single file, you will need to split the documents or analyze by page ranges. This solution uses an Azure Function with open-source Python code to read the content of a multi-page PDF file and split it into individual, single-page PDF files. Wrapping this Azure Function in a Logic App provides you with the ability to easily trigger the Logic App whenever a new file is uploaded to a storage account and write each of the single-page PDF files to a storage account.
The Form Recognizer service makes training a custom document model very easy. Here is a reference for how to train a custom model in the Form Recognizer studio. Once the model has completed training, it is immediately published and available to consume as a REST API endpoint. You can extract results from documents by sending the document data to the REST API endpoint which hosts your custom document model. Incorporating this machine learning model into an efficient and scalable solution can sometimes be challenging. This solution uses a Logic App to send single-page PDF document data to the published REST API endpoint of a trained custom document model and stores the resulting JSON data into a JSON file in a storage account.
In our fictitious Volunteer Form example, let's say that we filled out a few sample, single-page PDF documents to train a Form Recognizer custom document model. We want to extract the key fields: first name, last name, and each selection mark for volunteer activity. Once we have a model trained, blank forms were distributed to employees to fill out. Once all forms were filled out and gathered, they were placed into a digital scanner as a single stack of documents. The resulting multi-page PDF file has individual pages for each employee's form. To quickly extract the data from this form, we can upload the data to the "multi" container of our storage account and, in a few minutes, have the extracted data from each form.
In the architecture below, this solution can be fully automated once a multi-page PDF file is uploaded to a storage account container.
There are four containers in a storage account:
There are two Logic Apps:
The first Logic App is triggered when a multi-page PDF file is created in the “multi” container in a storage account. From there, the Logic App will send the PDF document to an Azure Function, which individually splits out each page in the document. The Azure Function returns the PDF data back to the Logic App, which then writes each single-page PDF document back to another container in the same storage account (single).
From here, another Logic App triggers whenever a single-page PDF file is created in the “single” container. This Logic App sends the PDF data to the REST API endpoint of your trained custom document model. The results of the API call are returned and stored as JSON. The data in the JSON document contains your custom form fields and selection marks.
To implement this solution, go to the following GitHub repository and/or click the "Deploy to Azure" button.
stevedem/FormRecognizerAccelerator (github.com)
After clicking Deploy to Azure, you will be asked to fill out the following fields:
Once the deployment succeeds, you should see six resources in your resource group.
Please note that this will only deploy the core infrastructure. There are a few steps to configure each service once deployed.
Each configuration is divided into these sections:
Navigate to the storage account that was created during resource deployment.
Next, we will upload the training data. Download all five PDF files under the "train" folder of this GitHub repository: https://aka.ms/form-reco-sample
Then, navigate to the storage account that was created during resource deployment to upload these five PDF files.
Before continuing to the next section, please follow the steps to Configure CORS (Cross Origin Resource Sharing) on your storage account. This is required to be enabled on your storage account for it to be accessible from the Form Recognizer Studio. A detailed walkthrough can be found here: Configure CORS | Microsoft Docs
Navigate to the Form Recognizer Studio: FormRecognizerStudio (azure.com)
Once your project is open and you can see all five training documents, label and train a custom document model by following these steps.
Next step is to deploy the open-source Python code to split PDFs as an Azure Function.
mkdir volunteer
cd volunteer
git clone https://github.com/stevedem/splitpdfs.git
cd splitpdfs
func azure functionapp publish <function-app-name>
After the command to publish the function has successfully completed, here is an example of the output.
Next up is to build out the Logic App, which will call the Function App we just published.
Container | /multi |
Number of blobs to return | 1 |
How often do you want to check for items | 1 Minute |
Storage account name | Use connection settings |
Blob | List of Files Path (Dynamic Expression) |
Infer content type | Yes |
Request Body:
{
"fileContent": "@body('Get_blob_content_(V2)')",
"fileName": "@triggerBody()?['DisplayName']"
}
Select an output from previous steps |
@json(body('SplitPDFs'))['individualPDFs'] |
Storage account name | Use connection settings |
Folder path | /single |
Blob name |
@{items('For_each')?['fileName']} |
Blob content |
@base64ToBinary(items('For_each')?['fileContent']) |
Next step is to open the Logic App designer in the Azure portal for the second Logic App you deployed with "analyzeresults" in the name.
Storage account name | Use connection settings |
Container | /single |
Number of blobs to return | 1 |
How often to you want to check for items | 1 Minute |
Storage account name | Use connection settings |
Blob path | List of Files Path (Dynamic Expression) |
Please note that you will have to replace a few variables tailored to your resource deployment. Here are the variables you must replace:
Method | POST | |
URI | <FORM_RECO_URI>/formrecognizer/documentModels/<MODEL_ID>:analyze?api-version= 2022-01-30-preview | |
Headers | Ocp-Apim-Subscription-Key | <FORM_RECO_KEY> |
Body |
|
Name | analyze-status |
Type | String |
Value |
@{null} |
analyze-status |
is equal to |
succeeded |
Method | GET | |
URI |
@{outputs('HTTP')['headers']['Operation-Location']} |
|
Headers | Ocp-Apim-Subscription-Key | <FORM_RECO_KEY> |
Name |
analyze-status |
Value |
@{body('HTTP_2')['status']} |
Count | 10 |
Unit | Second |
Storage account name | Use connection settings |
Folder path | /results |
Blob name |
@{replace(triggerBody()?['DisplayName'], '.pdf', '.json')} |
Blob content |
@body('HTTP_2') |
Navigate to the storage account created during resource deployment.
Wait a few minutes and check the status of the first Logic App, responsible for splitting PDFs. Then, navigate to the storage account deployed as part of the prerequisites.
Wait a few minutes and check the status of the second Logic App, responsible for analyzing results from single-page PDFs. Then, navigate to the storage account deployed as part of the prerequisites.
After downloading and inspecting the volunteer form and corresponding JSON output, we can quickly see that our data was efficiently and accurately extracted using our Form Recognizer custom document model.
In just a few steps we created an AI model and an orchestration pipeline. Now you can leverage this solution to extract valuable data from your multi-page PDF forms. You can check out the GitHub repository here, where this solution accelerator is hosted. In future iterations, we hope to further automate the deployment and configuration tasks.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.