Extracting information from unstructured document (e.g., contracts) with Azure Form Recognizer
Extracting information from unstructured documents such as contracts is usually manual and involves tediously reading and understanding substantial amounts of documents to find specific information and manually extracting the information to digitize it. The process consumes a significant amount of a lawyer’s billable hours and is prone to human errors. With Azure Form Recognizer you can automate this process. Azure Form Recognizer uses deep learning models and enables you to train a custom contract model to extract the information you need with just a few sample documents.
Introduction to the new Azure Form Recognizer Custom Neural (document) model
Organizations today deal with vast quantities of unstructured documents including contracts, financial or medical reports and publications. Processing these unstructured documents with AI to extract the right fields by relying on semantics improves decision making and time to value.
Neural (Custom document) model is a new deep learned model to extract fields from structured and unstructured documents. The new model shares the same labeling approach as the existing custom form or template models. You can start with just 5 labeled documents to train a model. With a common labeling format, it is easy to take your existing template or custom form project and train a neural or custom document model or start from scratch and label. When dealing with variations, simply add a few samples of each variation to the training dataset as custom document models generalize well across variations.
When to use this new capability
Custom neural models or neural models are a deep learned model that combines layout and language features to accurately extract labeled fields from documents. The base custom neural model is trained on various document types that makes it suitable to be trained for extracting fields from structured, semi-structured and unstructured documents. Use the new Custom neural model for training a model on unstructured documents such as contracts, scope of work, letters etc. or to train a model for a variety of documents from the same type with different formats such as paystubs, bank statements etc. to create a single model for all document variations.
Getting started is simple
Let's take contracts as an example and dive into the following steps:
Step 1 - Azure Blob Storage container
Standard performance Azure Blob Storage account. You will create containers to store and organize your training documents within your storage account. If you do not know how to create an Azure storage account with a container, follow these quick starts:
Configure CORS
CORS (Cross Origin Resource Sharing) needs to be configured on your Azure storage account for it to be accessible from the Form Recognizer Studio. To configure CORS in the Azure portal, you will need access to the CORS blade of your storage account.
CORS should now be configured to use the storage account from Form Recognizer Studio.
Step 2 - Sample contracts documents set
To train the model you will need 5 contract documents to get started.
Step 3- Create a Custom contracts model
To create custom contracts models, you start with configuring your project:
Step 4 - Start coding to analyze documents using the model
After testing your model with sample documents via the Form Recognizer Studio you can now also analyze documents directly from your application using the Form Recognizer REST API or SDK.
!pip install azure-ai-formrecognizer==3.2.0b3
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient, DocumentModelAdministrationClient
import os
import pandas as pd
endpoint = os.environ["AZURE_FORM_RECOGNIZER_ENDPOINT"]
key = os.environ["AZURE_FORM_RECOGNIZER_KEY"]
document_admin_client = DocumentModelAdministrationClient(endpoint=endpoint, credential=AzureKeyCredential(key))
models = document_admin_client.list_models()
for model in models:
print("{} | {}".format(model.model_id, model.description))
document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key))
path_to_sample_documents = os.path.abspath(
os.path.join(
os.getcwd(),
"sample_contract.pdf",
)
)
with open(path_to_sample_documents, "rb") as f:
poller = document_analysis_client.begin_analyze_document(
"Contracts", document=f)
result = poller.result()
for idx, document in enumerate(result.documents):
print("--------Analyzing document #{}--------".format(idx + 1))
print("Document has type {}".format(document.doc_type))
print("Document has confidence {}".format(document.confidence))
print("Document was analyzed by model with ID {}".format(result.model_id))
for name, field in document.fields.items():
field_value = field.value if field.value else field.content
if field.value_type == "list":
df_list = []
for row in field.value:
a_row = {}
for key, value in row.value.items():
a_row[key] = value.content
df_list.append(a_row)
df = pd.DataFrame(df_list)
display(df)
else:
print("'{}' with value '{}' and with confidence {}".format(name, field_value, field.confidence))
Now you are ready to analyze all your documents with Form Recognizer.
Additional resources
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.