Overview of all steps
Customized OCR solutions offer the ability to define unique categories within a document or image. Through working with various clients on custom OCR solutions, we often hear the question: "How well does this solution perform on my data?" We have developed an approach that allows for both benchmarking Microsoft’s Forms Recognizer against custom data using Forms Recognizer Studio and training a custom model with ground truth annotations in one process.
Please consult this short summary of all steps to get you started. We will provide deep dives on how to execute the benchmark end-end throughout the blogpost.
- Annotating a ground truth using Forms Recognizer Studio
- Training a custom neural Forms Recognizer model to recognize custom fields/entities (optional)
- Extracting annotation project from Azure Storage Explorer
- Measuring performance of OCR and field recognition
- Putting your knowledge into practice and performing the benchmark calculations
Annotating a ground truth using Forms Recognizer Studio
Before training a custom Form Recognizer model, it is important to have a labeled or annotated data set, also known as the ground truth. To provide an example of the annotation process, we have created a sample image of a scanned hand-written postal address. The ground-truth name is "John Doe" and the address is "000 Fifth Ave, NY 10065, USA", as shown in the figure below:
Step 1: Define the fields in scope. This depends on your specific applications. In our toy image example, we have defined 3 fields in advance: “Name”, “Address”, and “Missed” (containing non-OCRed / non-recognized contents if any).
Step 2: Open Form Recognizer Studio, select “Custom models”, and create a new project (filling in all required fields such as Azure subscription and Storage Account as instructed in the documentation). After the project is created, we can upload our toy image by “Drag and drop”. Note that when you click the image, the built-in Form Recognizer model will be triggered on OCR the image automatically in the background (usually it takes 1 or 2 seconds per image).
Step 3: Click the “+” button in the upper right corner to create pre-defined fields. For example, we have created 3 fields in our scenario, including a “Missed” field to capture the missed / non-OCRed contents.
Step 4: Start annotating the image by assigning the relevant OCRed contents to associated fields. For example, we assign “John Doe” to the “Name” filed by applying mouseover onto the relevant characters:
Meanwhile, a JSON file is created automatically in the Blob Container to reflect the annotation progress on the fly. This file captures our annotation result as label-value pairs. You can locate and edit this file by using Azure Storage Explorer
Step 5 (Optional): Correct wrongly OCRed fields and missed contents, if there are any.
- Correct wrongly OCRed fields. In our example, the OCRed zip code is “…, NY 6005, …”, whereas the ground-truth is “…, NY 10065, …”, as high-lighted by the blue bars below:
In this case, we need to manually correct the zip code in the “Address” field:
- In Azure Storage Explorer, go to the Blob Container where the uploaded images are stored.
- Click the annotation file with “...labels.json” extension.
- Press the “Edit” button, navigate to the wrongly OCRed field / label “Address”, manually replace “6005” with “10065”, and click the “Save” button to trigger the update.
The below screenshot illustrates the correction process for the wrongly OCRed scenario:
- Correct missed contents. In our example, the missed content is the initial character “1” in the zip code, as high-lighted by the blue circle below:
In this case, we need to manually add it to the “Missed” field:
- Click the “Regions” button in the upper left corner, draw a bounding box around the character “1”, and assign it to the “Missed” field as below:
- Go to the Blob Container.
- Click the annotation file with “...labels.json” extension.
- Press the “Edit” button, and you will find a newly created label named “Missed” with an empty “text” field. Next, you need to manually insert the character “1” as its value, and click the “Save” button to trigger the update
The below screenshot illustrates the correction process for the missed content scenario:
Great! We have corrected wrongly OCRed fields and missed contents to ensure the annotation quality (you can refresh the Form Recognizer Studio to see the corrected changes in the dashboard).
Step 6: Move to the next image and repeat the annotation process (i.e., Step 4 or Step 4 + Step 5).
Training a custom neural Forms Recognizer model to recognize custom fields/entities
To train your custom neural model for custom entities like in the mail example above, utilize the annotations from the previous step. These annotations will also be used for benchmarking later. To begin training, simply click the "Train" button located in the top right corner of the Forms Recognizer Studio API. For detailed instructions on combining separate models into one, refer to the documentation provided. This documentation will also explain the process of testing your newly trained Forms Recognizer instance. Please note that if you only need OCR or generic entities, you can also use the General Document API.
It is also important to remember that when testing your trained Forms Recognizer instance, you should use documents that were not part of the training process. For example, if you have annotated 100 images, use 80-90 for training and the remaining images for testing. The annotations made on the test images can be used to measure the performance of the OCR and field/entity recognition in the next step.
Extracting annotation project from Azure Storage Explorer
To perform an OCR benchmark, you can directly download the outputs from Azure Storage Explorer. You can access the benchmarking code under the following repository. Simply paste the downloaded data directory from Storage Explorer into the root of the project that you downloaded from GitHub. When using storage explorer, your subscription level document tree may appear as follows:
Measuring performance of OCR and field recognition
Throughout this section, we will distinguish between measuring the performance of a custom Forms Recognizer on two levels:
- OCR level: how well does Forms Recognizer digitize my document and correctly translate an image / document into a machine-readable format?
- Field level: how well does Forms Recognizer associate pre-defined (and labelled) fields/categories post-training?
The metric that is used to answer those questions is word similarity using the Levenshtein Distance. Briefly, the Levenshtein distance is a way to measure how different two words or phrases are from each other. We will use fuzzywuzzy implementation of this which gives us the following intervals:
0 = there is no similarity between string A and B; 100 = A and B are the same word.
In order to compute this metric, we will introduce a code base that requires the following inputs:
- Pre-labelled files: This is what will be used as the ground truth. The files can be identified by the filename.labels.json extension.
- Forms Recognizer output: This is what forms recognizer has OCRed .The files can be identified by the filename.ocr.json extension for pre-trained model or, png.json (or another file extension.json) for outputs of a custom trained model.
- Images in the case of .JPEG or, .PNG files (optional): This is needed to perform a dynamic bounding box conversion between the label.json and ocr.json files.
OCR Level
You have the option to manually reorganize the files and separate the images, labels, and OCR files, or you can download the entire folder and use the provided file distribution script within the repository as a template. Please note that this template might require customization.
Field / Entity Level
In order to access the images from field/entity recognition, you must first train a custom Forms Recognizer Model. After training, you can find the trained model under the "Test" tab. The image provided illustrates the location of the trained model in the Studio and the location of the download button for the custom model output. You have the option to download the files from the user interface or by navigating to the code tile next to "result" and retrieving the results through API calls using an IDE.
The provided README.md files explains which output to expect and possible customizations.
Putting your knowledge into practice and performing the benchmark calculations
Once you have completed the annotation of your files and, if necessary, trained a custom model for field/entity recognition, you can proceed to evaluate the performance of your OCR solution. Two shell files are provided for executing the necessary scripts, or you can use the provided interactive python notebooks (.ipynb) for an interactive approach to calculating benchmarks. You can use Azure Machine Learning Studio or your preferred local IDE to perform the calculations, as long as you have a Python environment with the necessary dependencies. The produced outputs will place the predicted/extracted value next to the annotated ground truth and may look as follows:
Filename |
Entity |
True Value |
Extracted Value |
Confidence Score |
Fuzzy Score |
File_1 |
Address |
000 Fifth Avenue |
000 Fifth Avenue |
0.99 |
100 |
File_1 |
Address |
156 Denison Street |
15 Deni Street |
0.87 |
88 |
As such, you can check each annotated word / entity for its word similarity. You have also the option to use this transactional side by side comparison as an input for a dashboard such as PowerBI.
Step-by step summary:
- Download the zip file from the GitHub repository
- Place your data at the root level
- Distribute the files into “./labelling/FR_output/” , “./labelling/GT_check/” and optionally into “./labelling/images/” if your input are image files
- Execute the shell scripts to compute the benchmark OR, use the provided notebooks for interactive benchmarking / debugging.
Technical Requirements and summary
You have learned how to perform a benchmark on your custom data with Forms Recognizer as well as how to train a custom model leveraging Forms Recognizer Studio. All you need is:
- Azure Subscription
- Python environment or a Azure Machine Learning Subscription
- Forms Recognizer Instance
- Azure Storage Explorer
Now that you know how to measure the performance of Forms Recognizer, you can build your OCR solutions with more confidence.