Generating OCR Insight in Videos – the Story of a Successful Microsoft Collaboration
Published Oct 03 2022 09:32 AM 5,819 Views

What is special about OCR for videos 

Detecting and recognizing text within images are important tasks that fortunately can be addressed with available models. Specifically, the Read OCR model that was developed by the Microsoft Cognitive Services team, has amazing capabilities in terms of precision and the size of its detections. However, this model is built for images, not videos. Therefore, you need to combine the OCR capability with video analysis techniques to effectively solve the challenge of OCR in the video domain. 

Videos, specifically in the Media industry but also in other domains, contain a lot of information within the text on screen. Examples range from the speaker’s name to street views that include shops, advertisements, and even web pages that are presented as part of the video. In all cases the text is embedded in the image, and not part of the audio (and therefore cannot be transcribed). A lot of valuable information is hidden in that text and unlocking this information is an important part of the video analysis provided by Azure Video Indexer. 

OCR in videos is a special problem not only because the text adds valuable context to the scenes and content, but also because of its unique opportunities and algorithms. Videos can result in mistakes in OCR detections. For example, if someone is passing by a sign and hiding part of the text, it can result in incorrect extraction. On the other hand, videos contain more information than individual images because they contain multiple frames. We can analyze these frames to generate better OCR results. 


Why not use the image-based OCR model as is 

Videos contain frames showing transitions over time - with scenes, people, locations etc. It’s important to accurately place insights from the video in their right place along the video timeline.  However, it’s also important not to overload the user with the same information repeatedly. For example, the analysis from each frame should not be output as is. Instead, it needs to be aggregated over time for it to be useful for the user. Moreover, videos are dynamic and contain complexity that introduce new types of mistakes. Therefore, we want smart aggregation over a timeline that will consider the potential errors and correct them.  

There are two key challenges when creating an OCR insight for videos. The first is due to the nature of the model, and the other is due to the nature of videos: 

  1. One can consider the following scenario, as in example 1: the Microsoft sign appears behind the main speaker – Satya. However, in a few frames it is obstructed by Satya. The viewer of the video knows that the full “Microsoft” sign always appears, and this is the expected result of the OCR insight of the video as well. However, if we take the output of the OCR model as is from each frame, we end up providing the user with multiple results for the same sign: in one frame, only the word “Microso” is visible, in other “Micro”, etc. This is of course wrong and confusing. 
  2. The Read OCR is an AI model and therefore may not be 100% accurate all the time. If we apply the model on a single image, we will not know whether the result is correct or an error. However, in the video domain we have the advantage of knowing the OCR results on nearby frames and learning from them. Therefore, if the model introduces a mistake in one frame, while it correctly extracts in another frame, we would like our video analysis solution to correct this mistake without showing it to the user. As in example 2, in one frame the model mistakenly detected the text as “Microsolt”, while in other frames the correct detection was found “Microsoft”. Our solution should only present to the user the result of “Microsoft”, with the time of both frames aggregated. 

Example 1: The Microsoft sign appears behind Satya, while fully visible in frame 5550, in frame 5580 it is partially hidden and only the "Microso" part of the sign can be detected by the OCR.Example 1: The Microsoft sign appears behind Satya, while fully visible in frame 5550, in frame 5580 it is partially hidden and only the "Microso" part of the sign can be detected by the OCR.

Example 2: The two frames seem very similar, however the OCR detects "Microsolt" in frame 77760, and "Microsoft" in frame 77790Example 2: The two frames seem very similar, however the OCR detects "Microsolt" in frame 77760, and "Microsoft" in frame 77790


How we can leverage information from nearby frames to improve results 

Now that we are fully motivated to provide a correct analysis of the OCR in video, we can start thinking about this smart aggregation problem as a clustering algorithm with a time constraint (we only aggregate results from near-by frames). The first step of every clustering algorithm is to find a smart distance metric, to measure the distance of two predictions in different frames. We want to answer the question of “how likely do these two predictions come from the same real text in the video?”. Our knowledge includes the location of the text on the frame, and the recognized characters. We would like to leverage both, and therefore we create a spatial-textual score, which is the multiplication of both the textual and spatial distance, which can be for example the Levenshtein distance and IOU (Intersection Over Union). We can make this metric smarter if we consider the nature of our specific OCR model, for example if the model tends to add characters in the beginning / end of the word, we can have more tolerance to such mistakes, etc. The assumptions underlying this metric, is that if indeed the two detections are of the same original text, their spatial distance as well as textual distance should be very close, but not necessarily 0, since there can be a mistake in the detections or an occlusion in the text, and there is camera movement that can change the location of the text in the frame. 

The next task is how to perform the aggregation itself. We would like to measure the distance of detections from adjacent frames, and because we need to think about the performance of the algorithm (as this solution will run in production), we choose a greedy heuristic and aggregate everything that we can one frame after the other, if its bellow our distance threshold, building the clusters along the way. Such threshold is determined by testing over our labeled data.  


How do we correct mistakes with clustering? 

It is important to apply clustering for several reasons. First, from the usability perspective, we want to provide the user with a cluster of predictions, representing the same text over multiple frames (instead of each frame separately, which will overload the user with redundant information). More importantly, clustering gives us a powerful tool to correct the mistakes of the OCR – mistakes driven by occlusions or by the OCR model itself. By finding the best representative of each cluster and providing it to the user as the only text for all the frames that participate in it, we can correct mistakes in other cluster members. We therefore need to smartly choose the representative, again by leveraging the Cognitive Services Read OCR model. In our OCR model, we see that the confidence of the prediction is a particularly good indicator of its quality, together with the length of the prediction – as Read does not tend to add characters when they are not present. In other words, insertion errors are rare. We therefore look for the best prediction in the cluster in terms of longest + highest confidence.  


Looking at the full solution 

Now that we have a smart clustering algorithm with a spatial-textual distance metric and a method to choose the cluster representative, we can view the full solution (Figure 1). It is important to note that we exclude very low confident predictions from the solution, and we also leverage a naïve clustering solution for numbers-only predictions. A naïve clustering only cluster by exact match, and it is more suitable for numbers because unlike words, in a number every digit can be legit and important for the whole meaning. 

It is also important to note that from a product architecture point of view, we need to consider time and memory, and therefore we optimize the number of frames we sample and send to the OCR (frames per second), as well as writing the algorithm to work on batches of frames, instead of the whole video together.  


Figure 1: The full solutionFigure 1: The full solution


Optimizing the algorithm and data labeling 

As in any Data Science project, we need labeled data. This is not an easy task, and in fact in this project we have a twist – we asked our labelers to only label the OCR model predictions over the sampled frames, and not all the text in the video, as we are not aiming to optimize the OCR model itself. Since our dataset includes videos with a lot of text (street view, protests, stores etc.), this change decreases the labelers workload significantly. Our question for the labelers includes the following: Is this a correct prediction? If not, what is correct? And finally – what is more correct? For the “Microso” prediction, the answer is that it is correct (as the rest of the sign is not visible in this frame), but its more correct to be “Microsoft”. 

When we optimize the solution (tune the thresholds) and measure its performance, we consider three measurements: precision, recall, and number of clusters. The number of clusters is crucial for the usability aspect of the algorithm, since it is the only metric to capture the difference between “Microso” on its own, which is correct in terms of a single frame prediction, in comparison to clustering it with “Microsoft” (therefore reducing the number of clusters) and making “Microsoft” the representative of its cluster (not harming the precision). 

We created a dataset of such videos and labels and evaluated the performance of our algorithm. We were able to reduce the average number of clusters by more than half, increase the precision by five points and minorly reduce the recall (compared to a naïve clustering over the whole video).  


OCR insight in Azure Video Indexer 

Azure Video Indexer is a Microsoft tool to create insights from videos. This OCR algorithm is now part of it, enabling the user to easily capture the text that appears on screen, together with the full transcription of the video. In figures 2,3 we can see examples of the Video Indexer UI (User Interfaces) showing the results of the indexing of the videos of the 2 examples shown before. We no longer see the mistake of “Microsolt”, but only the “Microsoft” prediction, for the whole time the sign is presented. Same for the “Microsoft” sign, that is now fully exposed in all frames, but is presented as one Microsoft cluster.  

This solution represents a combination of better performance and better user experience, altogether making it another great insight by Video Indexer. 

Figure 2: Azure Video Indexer UI with the correct OCR insight for example 1Figure 2: Azure Video Indexer UI with the correct OCR insight for example 1

Figure 3: Azure Video Indexer UI with the correct OCR insight for example 2Figure 3: Azure Video Indexer UI with the correct OCR insight for example 2


Join us and share your feedback 

For those of you who are new to our technology, we encourage you to get started today with these helpful resources: 

Version history
Last update:
‎Oct 03 2022 09:32 AM
Updated by: