Most organizations are now aware of how valuable the forms (pdf, images, videos…) they keep in their closets are. They are looking for best practices and most cost-effective ways and tools to digitize those assets. By extracting the data from those forms and combining it with existing operational systems and data warehouses, they can build powerful AI and ML models to get insights from it to deliver value to their customers and business users.
With the Form Recognizer Cognitive Service, we help organizations to harness their data, automate processes (invoice payments, tax processing …), save money and time and get better accuracy.
Figure 1:Typical form
In my first blog about the automated form processing, I described how you can extract key-value pairs from your forms in real-time using the Azure Form Recognizer cognitive service. We successfully implemented that solution for many customers.
Often, after a successful PoC or MVP, our customers realize that, not only they need this real time solution but, they also have a huge backlog of forms they would like to ingest into their relational, NoSQL databases or data lake, in a batch fashion. They have different types of forms and they don’t want to build a model for each type. They are also looking for easy and quick way to ingest the new type of forms.
In this blog, we’ll describe how to dynamically train a form recognizer model to extract the key-value pairs of different type of forms and at scale using Azure services. We’ll also share a github repository where you can download the code and implement the solution we describe in this post.
The backlog of forms maybe in your on-premises environment or in a (s)FTP server. We assume that you were able to upload them into an Azure Data Lake Store Gen 2, using Azure Data Factory, Storage Explorer or AzCopy. Therefore, the solution we’ll describe here will focus on the data ingestion from the data lake to the (No)SQL database.
Our product team published a great tutorial on how to Train a Form Recognizer model and extract form data by using the REST API with Python. The solution described here demonstrates the approach for one model and one type of forms and is ideal for real-time form processing.
The value-add of the post is to show how to automatically train a model with new and different type of forms using a meta-data driven approach, in batch mode.
Below is the high-level architecture.
Figure 2: High Level Architecture
Azure services required to implement this solution
To implement this solution, you will need to create the below services:
Form Recognizer resource:
Form Recognizer resource to setup and configure the form recognizer cognitive service, get the API key and endpoint URI.
Azure SQL single database:
We will create a meta-data table in Azure SQL Database. This table will contain the non-sensitive data required by the Form Recognizer Rest API. The idea is, whenever there is a new type of form, we just insert a new record in this table and trigger the training and scoring pipeline.
The required attributes of this table are:
- form_description: This field is not required as part of the training of the model the inference. It just to provide a description of the type of forms we are training the model for (example client A forms, Hotel B forms,...)
- training_container_name: This is the storage account container name where we store the training dataset. It can be the same as scoring_container_name
- training_blob_root_folder: The folder in the storage account where we’ll store the files for the training of the model.
- scoring_container_name: This is the storage account container name where we store the files we want to extract the key value pairs from. It can be the same as the training_container_name
- scoring_input_blob_folder: The folder in the storage account where we’ll store the files to extract key-value pair from.
- model_id: The identify of model we want to retrain. For the first run, the value must be set to -1 to create a new custom model to train. The training notebook will return the newly created model id to the data factory and, using a stored procedure activity, we’ll update the meta data table with in the Azure SQL database.
Whenever you had a new form type, you need to reset the model id to -1 and retrain the model.
- file_type: The supported types are application/pdf, image/jpeg, image/png, image/tif.
- form_batch_group_id : Over time, you might have multiple forms type you train against different models. The form_batch_group_id will allow you to specify all the form types that have been training using a specific model.
Azure Key Vault:
For security reasons, we don’t want to store certain sensitive information in the parametrization table in the Azure SQL database. We store those parameters in Azure Key Vault secrets.
Below are the parameters we store in the key vault:
- CognitiveServiceEndpoint: The endpoint of the form recognizer cognitive service. This value will be stored in Azure Key Vault for security reasons.
- CognitiveServiceSubscriptionKey: The access key of the cognitive service. This value will be stored in Azure Key Vault for security reasons. The below screenshot shows how to get the key and endpoint of the cognitive service
Figure 3: Cognitive Service Keys and Endpoint
- StorageAccountName: The storage account where the training dataset and forms we want to extract the key value pairs from are stored. The two storage accounts can be different. The training dataset must be in the same container for all form types. They can be in different folders.
- StorageAccountSasKey : the shared access signature of the storage account
The below screen shows the key vault after you create all the secrets
Figure 4 : Key Vault Secrets
Azure Data Factory:
To orchestrate the training and scoring of the model. Using a look up activity, we’ll retrieve the parameters in the Azure SQL Database and orchestrate the training and scoring of the model using Databricks notebooks. All the sensitive parameters stored in Key vault will be retrieve in the notebooks.
Azure Data Lake Gen 2:
To store the training dataset and the forms we want to extract the key-values pairs from. The training and the scoring datasets can be in different containers but, as mentioned above, the training dataset must be in the same container for all form types.
Azure Databricks:
To implement the python script to train and score the model. Note that we could have used Azure functions.
Azure Key Vault:
To store the sensitive parameters required by the Form Recognizer Rest API.
The code to implement this solution is available in the following GitHub repository.
Additional Resources
Get started with deploying Form Recognizer –
- Custom Model – extract text, tables and key value pairs
- QuickStart: Train a Form Recognizer model and extract form data by using the REST API
- QuickStart: Train a Form Recognizer model with labels using the sample labeling tool
- Form Recognizer Sample Labeling Tool
- Try it out: https://fott.azurewebsites.net/
- Open Source project: https://github.com/microsoft/OCR-Form-Tools
- Prebuilt receipts - extract data from USA sales receipts
- Layout - extract text and table structure (row and column numbers) from your documents
- See What’s New