In the world of document processing and automation, one of the most frequent use cases is categorizing and organizing documents into predefined classes. For instance, an organization may have a process that ingests documents that then need to be classified into separate categories such as “invoices”, “contracts”, “reports”, etc. Azure AI Document Intelligence custom classification models can address these needs and offer a powerful way to bring order to document management.
Document Intelligence is a cloud-based Azure AI service that uses machine learning models to automate document processing in applications and workflows. New users and those unfamiliar with Document Intelligence's capabilities may be interested in starting their journey using Document Intelligence Studio—an online tool to visually explore, understand, train, and implement features from the Document Intelligence service without having to write a single line of code. However, more advanced use cases and integrations may necessitate interacting with the Document Intelligence service programmatically. This can be achieved using the Document Intelligence REST API or SDKs available for .NET, Java, JavaScript, and Python. In this article we'll focus specifically on building a custom classification model using Python, one of the more popular languages amongst data science and machine learning developers.
Those wanting to get a head start creating a custom classification model programmatically may look to utilize the existing sample_classify_document.py code sample from the azure-sdk-for-python repository. However, for this sample script to work, the classifier training data set must already include ocr.json files for each document. Optical Character Recognition (OCR) is a critical step in converting scanned documents into editable and searchable data. While Azure AI Document Intelligence Studio automatically generates the OCR files behind-the-scenes when building a custom classification model using the visual interface, those utilizing the Python SDK may find themselves at a crossroads due to the lack of this built-in functionality.
The Document Intelligence Python SDK provides a powerful set of tools for extracting information from forms and documents. However, one key limitation is its lack of a method to easily generate ocr.json files from layout analysis results, a feature that is completely integrated and handled automatically in Document Intelligence Studio.
As described in the documentation here, the required ocr.json files can be created by analyzing each training document with Document Intelligence's pre-built layout model and saving the results in the proper API response format. There is a sample Python script sample_analyze_layout.py but since the SDK's layout results object is structured differently than the API's layout results object, there isn't a clear way to generate the required ocr.json files strictly using the Python SDK. This blog post delves into the custom solution we developed to manually code this process, addressing a common problem discussed in the Microsoft community
analyze_layout.py
, which will iterate through files in the specified directory (TRAINING_DOCUMENTS
) and analyze each document using Azure AI Document Intelligence. It saves the results in a .ocr.json
file alongside the original document. This format mirrors the OCR output of the Document Intelligence Studio, maintaining consistency and compatibility.# Use begin_analyze_document to start the analysis process, and use a callback in order to recieve the raw response
with open(document_file_path, "rb") as f:
poller = document_intelligence_client.begin_analyze_document(
"prebuilt-layout", analyze_request=f, content_type="application/octet-stream", cls=lambda raw_response, _, headers: create_ocr_json(ocr_json_file_path, raw_response)
)
// ... other code ...
# Callback function to save the API raw response as .ocr.json file
def create_ocr_json(ocr_json_file_path, raw_response):
with open(ocr_json_file_path, "w") as f:
f.write(raw_response.http_response.body().decode("utf-8"))
print(f"\tOutput saved to {ocr_json_file_path}")
upload_documents.py
, which will upload all the training documents, along with the .ocr.json files and a.jsonl
file that will be used in building the classifier to reference each of the documents. The .jsonl file allows us to process multiple documents in a batch, improving the efficiency of the training process.
build_classifier.py
script initiates the process of building a custom document classifier using the document types and labeled data from the .jsonl
files. It utilizes the DocumentIntelligenceAdministrationClient and BlobServiceClient
classes, which are used to interface with the Document Intelligence and Azure Blob Storage services to retrieve and process the training data uploaded in the previous step. Once finished, it prints the results including the classifier ID, API version, description, and document classes used for training.
classify_document.py
utilizes the DocumentIntelligenceClient class to classify a document using a trained document classifier. It analyzes one document at a time and returns the document type along with the confidence score.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.