Text Recognition for Video in Microsoft Video Indexer

%3CLINGO-SUB%20id%3D%22lingo-sub-173722%22%20slang%3D%22en-US%22%3EText%20Recognition%20for%20Video%20in%20Microsoft%20Video%20Indexer%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-173722%22%20slang%3D%22en-US%22%3E%3CP%3E%3CSPAN%3EIn%20Video%20Indexer%2C%20we%20have%20the%20capability%20for%20recognizing%20display%20text%20in%20videos.%26nbsp%3B%3C%2FSPAN%3EThere%20is%20a%20misconception%20that%20AI%20for%20video%20is%20simply%20extracting%20frames%20from%20a%20video%20and%20running%20computer%20vision%20algorithms%20on%20each%20video%20frame%20but%20video%20processing%20is%20much%20more%20than%20processing%20individual%20frames%20using%20an%20image%20processing%20algorithm%20%E2%80%93%20for%20example%2C%20with%2030%20frames%20per%20second%2C%20a%20minute-long%20video%20is%201800%20frames%20producing%20a%20lot%20of%20data%20but%2C%20as%20we%20see%20above%2C%20not%20many%20meaningful%20words.%20There%20is%20a%20separate%20blog%20that%20covers%20how%3CSPAN%3E%26nbsp%3B%3C%2FSPAN%3E%3CA%20href%3D%22https%3A%2F%2Fazure.microsoft.com%2Fen-us%2Fblog%2Fhow-is-ai-for-video-different-from-ai-for-images%2F%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3EAI%20for%20video%20is%20different%20from%20AI%20for%20images%3C%2FA%3E.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EWhile%20humans%20have%20cognitive%20abilities%20that%20allow%20them%20to%20complete%20hidden%20parts%20of%20the%20text%20and%20disambiguate%20local%20deficiencies%20resulting%20from%20bad%20video%20quality%2C%20direct%20application%20of%20OCR%20is%20not%20sufficient%20for%20automatic%20text%20extraction%20from%20videos.%20In%20Video%20Indexer%2C%20we%20developed%20and%20implemented%20a%20dedicated%20approach%20to%20tackle%20this%20challenge.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%2206849808-7cd6-4411-baef-c53d5f289aa4.png%22%20style%3D%22width%3A%20625px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F30622iA34B636FD7F89DFE%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20title%3D%2206849808-7cd6-4411-baef-c53d5f289aa4.png%22%20alt%3D%2206849808-7cd6-4411-baef-c53d5f289aa4.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ERead%20more%20about%20it%20in%20the%20%3CA%20href%3D%22https%3A%2F%2Fazure.microsoft.com%2Fen-us%2Fblog%2Ftext-recognition-for-video-in-microsoft-video-indexer%2F%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3EAzure%20blog%3C%2FA%3E.%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-LABS%20id%3D%22lingo-labs-173722%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3EAzure%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Community Manager

In Video Indexer, we have the capability for recognizing display text in videos. There is a misconception that AI for video is simply extracting frames from a video and running computer vision algorithms on each video frame but video processing is much more than processing individual frames using an image processing algorithm – for example, with 30 frames per second, a minute-long video is 1800 frames producing a lot of data but, as we see above, not many meaningful words. There is a separate blog that covers how AI for video is different from AI for images.

 

While humans have cognitive abilities that allow them to complete hidden parts of the text and disambiguate local deficiencies resulting from bad video quality, direct application of OCR is not sufficient for automatic text extraction from videos. In Video Indexer, we developed and implemented a dedicated approach to tackle this challenge.

 

06849808-7cd6-4411-baef-c53d5f289aa4.png

 

Read more about it in the Azure blog

0 Replies