Customized OCR solutions offer the ability to define unique categories within a document or image. Through working with various clients on custom OCR solutions, we often hear the question: "How well does this solution perform on my data?" We have developed an approach that allows for both benchmarking Microsoft’s Forms Recognizer against custom data using Forms Recognizer Studio and training a custom model with ground truth annotations in one process.
Please consult this short summary of all steps to get you started. We will provide deep dives on how to execute the benchmark end-end throughout the blogpost.
Before training a custom Form Recognizer model, it is important to have a labeled or annotated data set, also known as the ground truth. To provide an example of the annotation process, we have created a sample image of a scanned hand-written postal address. The ground-truth name is "John Doe" and the address is "000 Fifth Ave, NY 10065, USA", as shown in the figure below:
Step 1: Define the fields in scope. This depends on your specific applications. In our toy image example, we have defined 3 fields in advance: “Name”, “Address”, and “Missed” (containing non-OCRed / non-recognized contents if any).
Step 2: Open Form Recognizer Studio, select “Custom models”, and create a new project (filling in all required fields such as Azure subscription and Storage Account as instructed in the documentation). After the project is created, we can upload our toy image by “Drag and drop”. Note that when you click the image, the built-in Form Recognizer model will be triggered on OCR the image automatically in the background (usually it takes 1 or 2 seconds per image).
Step 3: Click the “+” button in the upper right corner to create pre-defined fields. For example, we have created 3 fields in our scenario, including a “Missed” field to capture the missed / non-OCRed contents.
Step 4: Start annotating the image by assigning the relevant OCRed contents to associated fields. For example, we assign “John Doe” to the “Name” filed by applying mouseover onto the relevant characters:
Meanwhile, a JSON file is created automatically in the Blob Container to reflect the annotation progress on the fly. This file captures our annotation result as label-value pairs. You can locate and edit this file by using Azure Storage Explorer
Step 5 (Optional): Correct wrongly OCRed fields and missed contents, if there are any.
In this case, we need to manually correct the zip code in the “Address” field:
The below screenshot illustrates the correction process for the wrongly OCRed scenario:
In this case, we need to manually add it to the “Missed” field:
The below screenshot illustrates the correction process for the missed content scenario:
Great! We have corrected wrongly OCRed fields and missed contents to ensure the annotation quality (you can refresh the Form Recognizer Studio to see the corrected changes in the dashboard).
Step 6: Move to the next image and repeat the annotation process (i.e., Step 4 or Step 4 + Step 5).
To train your custom neural model for custom entities like in the mail example above, utilize the annotations from the previous step. These annotations will also be used for benchmarking later. To begin training, simply click the "Train" button located in the top right corner of the Forms Recognizer Studio API. For detailed instructions on combining separate models into one, refer to the documentation provided. This documentation will also explain the process of testing your newly trained Forms Recognizer instance. Please note that if you only need OCR or generic entities, you can also use the General Document API.
It is also important to remember that when testing your trained Forms Recognizer instance, you should use documents that were not part of the training process. For example, if you have annotated 100 images, use 80-90 for training and the remaining images for testing. The annotations made on the test images can be used to measure the performance of the OCR and field/entity recognition in the next step.
To perform an OCR benchmark, you can directly download the outputs from Azure Storage Explorer. You can access the benchmarking code under the following repository. Simply paste the downloaded data directory from Storage Explorer into the root of the project that you downloaded from GitHub. When using storage explorer, your subscription level document tree may appear as follows:
Throughout this section, we will distinguish between measuring the performance of a custom Forms Recognizer on two levels:
The metric that is used to answer those questions is word similarity using the Levenshtein Distance. Briefly, the Levenshtein distance is a way to measure how different two words or phrases are from each other. We will use fuzzywuzzy implementation of this which gives us the following intervals:
0 = there is no similarity between string A and B; 100 = A and B are the same word.
In order to compute this metric, we will introduce a code base that requires the following inputs:
You have the option to manually reorganize the files and separate the images, labels, and OCR files, or you can download the entire folder and use the provided file distribution script within the repository as a template. Please note that this template might require customization.
In order to access the images from field/entity recognition, you must first train a custom Forms Recognizer Model. After training, you can find the trained model under the "Test" tab. The image provided illustrates the location of the trained model in the Studio and the location of the download button for the custom model output. You have the option to download the files from the user interface or by navigating to the code tile next to "result" and retrieving the results through API calls using an IDE.
The provided README.md files explains which output to expect and possible customizations.
Once you have completed the annotation of your files and, if necessary, trained a custom model for field/entity recognition, you can proceed to evaluate the performance of your OCR solution. Two shell files are provided for executing the necessary scripts, or you can use the provided interactive python notebooks (.ipynb) for an interactive approach to calculating benchmarks. You can use Azure Machine Learning Studio or your preferred local IDE to perform the calculations, as long as you have a Python environment with the necessary dependencies. The produced outputs will place the predicted/extracted value next to the annotated ground truth and may look as follows:
Filename |
Entity |
True Value |
Extracted Value |
Confidence Score |
Fuzzy Score |
File_1 |
Address |
000 Fifth Avenue |
000 Fifth Avenue |
0.99 |
100 |
File_1 |
Address |
156 Denison Street |
15 Deni Street |
0.87 |
88 |
As such, you can check each annotated word / entity for its word similarity. You have also the option to use this transactional side by side comparison as an input for a dashboard such as PowerBI.
Step-by step summary:
You have learned how to perform a benchmark on your custom data with Forms Recognizer as well as how to train a custom model leveraging Forms Recognizer Studio. All you need is:
Now that you know how to measure the performance of Forms Recognizer, you can build your OCR solutions with more confidence.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.