Most organizations are now aware of how valuable the forms (pdf, images, videos…) they keep in their closets are. They are looking for best practices and most cost-effective ways and tools to digitize those assets. By extracting the data from those forms and combining it with existing operational systems and data warehouses, they can build powerful AI and ML models to get insights from it to deliver value to their customers and business users.
With the Form Recognizer Cognitive Service, we help organizations to harness their data, automate processes (invoice payments, tax processing …), save money and time and get better accuracy.
Figure 1:Typical form
In my first blog about the automated form processing, I described how you can extract key-value pairs from your forms in real-time using the Azure Form Recognizer cognitive service. We successfully implemented that solution for many customers.
Often, after a successful PoC or MVP, our customers realize that, not only they need this real time solution but, they also have a huge backlog of forms they would like to ingest into their relational, NoSQL databases or data lake, in a batch fashion. They have different types of forms and they don’t want to build a model for each type. They are also looking for easy and quick way to ingest the new type of forms.
In this blog, we’ll describe how to dynamically train a form recognizer model to extract the key-value pairs of different type of forms and at scale using Azure services. We’ll also share a github repository where you can download the code and implement the solution we describe in this post.
The backlog of forms maybe in your on-premises environment or in a (s)FTP server. We assume that you were able to upload them into an Azure Data Lake Store Gen 2, using Azure Data Factory, Storage Explorer or AzCopy. Therefore, the solution we’ll describe here will focus on the data ingestion from the data lake to the (No)SQL database.
Our product team published a great tutorial on how to Train a Form Recognizer model and extract form data by using the REST API with Python. The solution described here demonstrates the approach for one model and one type of forms and is ideal for real-time form processing.
The value-add of the post is to show how to automatically train a model with new and different type of forms using a meta-data driven approach, in batch mode.
Below is the high-level architecture.
Figure 2: High Level Architecture
To implement this solution, you will need to create the below services:
Form Recognizer resource to setup and configure the form recognizer cognitive service, get the API key and endpoint URI.
We will create a meta-data table in Azure SQL Database. This table will contain the non-sensitive data required by the Form Recognizer Rest API. The idea is, whenever there is a new type of form, we just insert a new record in this table and trigger the training and scoring pipeline.
The required attributes of this table are:
Whenever you had a new form type, you need to reset the model id to -1 and retrain the model.
For security reasons, we don’t want to store certain sensitive information in the parametrization table in the Azure SQL database. We store those parameters in Azure Key Vault secrets.
Below are the parameters we store in the key vault:
Figure 3: Cognitive Service Keys and Endpoint
The below screen shows the key vault after you create all the secrets
Figure 4 : Key Vault Secrets
To orchestrate the training and scoring of the model. Using a look up activity, we’ll retrieve the parameters in the Azure SQL Database and orchestrate the training and scoring of the model using Databricks notebooks. All the sensitive parameters stored in Key vault will be retrieve in the notebooks.
To store the training dataset and the forms we want to extract the key-values pairs from. The training and the scoring datasets can be in different containers but, as mentioned above, the training dataset must be in the same container for all form types.
To implement the python script to train and score the model. Note that we could have used Azure functions.
To store the sensitive parameters required by the Form Recognizer Rest API.
The code to implement this solution is available in the following GitHub repository.
Additional Resources
Get started with deploying Form Recognizer –
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.