Extracting information from unstructured document (e.g., contracts) with Azure Form Recognizer

Microsoft

Mar 28, 2022

Extracting information from unstructured document (e.g., contracts) with Azure Form Recognizer

Extracting information from unstructured documents such as contracts is usually manual and involves tediously reading and understanding substantial amounts of documents to find specific information and manually extracting the information to digitize it. The process consumes a significant amount of a lawyer’s billable hours and is prone to human errors. With Azure Form Recognizer you can automate this process. Azure Form Recognizer uses deep learning models and enables you to train a custom contract model to extract the information you need with just a few sample documents.

Introduction to the new Azure Form Recognizer Custom Neural (document) model

Organizations today deal with vast quantities of unstructured documents including contracts, financial or medical reports and publications. Processing these unstructured documents with AI to extract the right fields by relying on semantics improves decision making and time to value.

Neural (Custom document) model is a new deep learned model to extract fields from structured and unstructured documents. The new model shares the same labeling approach as the existing custom form or template models. You can start with just 5 labeled documents to train a model. With a common labeling format, it is easy to take your existing template or custom form project and train a neural or custom document model or start from scratch and label. When dealing with variations, simply add a few samples of each variation to the training dataset as custom document models generalize well across variations.

When to use this new capability

Custom neural models or neural models are a deep learned model that combines layout and language features to accurately extract labeled fields from documents. The base custom neural model is trained on various document types that makes it suitable to be trained for extracting fields from structured, semi-structured and unstructured documents. Use the new Custom neural model for training a model on unstructured documents such as contracts, scope of work, letters etc. or to train a model for a variety of documents from the same type with different formats such as paystubs, bank statements etc. to create a single model for all document variations.

Getting started is simple

Let's take contracts as an example and dive into the following steps:

Step 1 - Azure Blob Storage container

Standard performance Azure Blob Storage account. You will create containers to store and organize your training documents within your storage account. If you do not know how to create an Azure storage account with a container, follow these quick starts:

Create a storage account. When creating your storage account, make sure to select Standard performance in the Instance details → Performance field.
Create a container. When creating your container, set the Public access level field to Container (anonymous read access for containers and blobs) in the New Container window.

Configure CORS

CORS (Cross Origin Resource Sharing) needs to be configured on your Azure storage account for it to be accessible from the Form Recognizer Studio. To configure CORS in the Azure portal, you will need access to the CORS blade of your storage account.

Select the CORS blade for the storage account.
Start by creating a new CORS entry in the Blob service.
Set the Allowed origins to https://formrecognizer.appliedai.azure.com.
Select all the available 8 options for Allowed methods.
Approve all Allowed headers and Exposed headers by entering an * in each field.
Set the Max Age to 120 seconds (about 2 minutes) or any acceptable value.
Click the save button at the top of the page to save the changes.

CORS should now be configured to use the storage account from Form Recognizer Studio.

Step 2 - Sample contracts documents set

Go to the Azure portal and navigate as follows: Your storage account → Data storage → Containers

Select a container from the list.
Select Upload from the menu at the top of the page and upload your training documents to the blob. You will need 5 documents to get started.

The Upload blob window will appear.
Select your file(s) to upload.

To train the model you will need 5 contract documents to get started.

Step 3- Create a Custom contracts model

To create custom contracts models, you start with configuring your project:

Login to the Azure Form Recognizer Studio https://formrecognizer.appliedai.azure.com
From the Studio home, select the Custom model card to open the Custom model's page.
Use the "Create a project" command to start the new project configuration wizard.
Enter project details, select the Azure subscription and resource, and the Azure Blob storage container that contains your contract data (created in the previous step).
Review and submit your settings to create the project.
Label and tag the data points you would like to extract from the documents.
Select the Train command and
- enter model name
- select custom neural (document) model
- Train the model
Once the model is ready, use the Test command to validate it with the test document which you did not use in your training dataset.

Step 4 - Start coding to analyze documents using the model

After testing your model with sample documents via the Form Recognizer Studio you can now also analyze documents directly from your application using the Form Recognizer REST API or SDK.

!pip install azure-ai-formrecognizer==3.2.0b3
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient, DocumentModelAdministrationClient
import os
import pandas as pd
endpoint = os.environ["AZURE_FORM_RECOGNIZER_ENDPOINT"]
key = os.environ["AZURE_FORM_RECOGNIZER_KEY"]
 
document_admin_client = DocumentModelAdministrationClient(endpoint=endpoint, credential=AzureKeyCredential(key))
models = document_admin_client.list_models()
for model in models:
    print("{} | {}".format(model.model_id, model.description))
document_analysis_client = DocumentAnalysisClient(
        endpoint=endpoint, credential=AzureKeyCredential(key))
 
path_to_sample_documents = os.path.abspath(
        os.path.join(
            os.getcwd(),
            "sample_contract.pdf",
        )
    )
 
with open(path_to_sample_documents, "rb") as f:
        poller = document_analysis_client.begin_analyze_document(
            "Contracts", document=f)
result = poller.result()
for idx, document in enumerate(result.documents):
        print("--------Analyzing document #{}--------".format(idx + 1))
        print("Document has type {}".format(document.doc_type))
        print("Document has confidence {}".format(document.confidence))
        print("Document was analyzed by model with ID {}".format(result.model_id))
        for name, field in document.fields.items():
            field_value = field.value if field.value else field.content
            if field.value_type == "list":
                df_list  = []
                for row in field.value:
                    a_row = {}
                    for key, value in row.value.items():
                        a_row[key] = value.content
                    df_list.append(a_row)
                df = pd.DataFrame(df_list)
                display(df)
            else:
                print("'{}' with value '{}' and with confidence {}".format(name, field_value, field.confidence))

Now you are ready to analyze all your documents with Form Recognizer.

Additional resources