Speech recognition sub extension; OCR sub extension on Edge; Dictate sub

Brass Contributor

Genuinely appreciate all your replies! I think a perfect browser require 3 major source to generate text with timeline:

① voice in videos and audios playing on the page; (automatic Speech recognition sub)

② text inside the videos or pictures playing on the page;(automatic OCR sub)

③ voice from the user(dictate sub)

 

What is these 3 functions used for:

Speech recognition can automatically generate instant sub; OCR sub will turn the picture sub embedded in the video into CC text sub. The dictate sub will be used when the voice in the video are not clear or have strong accent so that we want to dictate the sub ourselves or when we just want to add some addition sub tract like titles!

Once they turn into text CC sub, they can apply translation and TTS to make bilingual subs and even dub any video in someone's native speech!

 

What not just use the sub function of website?

① Not all website have the CC sub function. Like Facebook, twitter and many other unknow websites outside America.

②YouTube have speech recognition sub. But they don't have speech recognition for many rare languages and even some major languages like Mandarin and Arabic. YouTube definitely have no ORC sub. But OCR sub play a very important role. When movies don't have CC subs picture sub dominate! And in China all news and old movie have an picture sub because the accent and the poor microphone make the picture sub much better than the speech recognition sub!

③ Even sometimes we can get the CC sub. The font and format can not be customized. And we have no TTS dubbing applications.

 

How does the automatic speech recognition sub works?

First detect all the videos and audios played on the page (like IDM) and cache them.

Then import Edge speech recognition to transform the voice in the audio/video into text with timeline of each single word. 

Third, divide the word chains into sentence, clauses, phrases with natural language analyze technology. And match timelines with each sentence, clause and phrase. (This step is vital important because it effect the later step like translation layer. The punctuation have an important effect on translation.)

Forth, adopt the translation and phonetic layers for: ①the word being read; ②the sentence being read; ③the IPA transcript or Latin transcript of the word and sentence being read.

Fifth, adopt the format layer for the text which include: ① font of the text; ②size of the text; ③color and bold of the text; ④position of the text box; ⑤filling color of the text box; ⑥animation and voice effect for each word or the whole sentence when they fly in, emphasis, and wipe out. 

We should make 2 mode: full screen sub vs. playing screen sub.

Sixth, import Azure TTS as the dubbing layer if the viewer like dubbing. That really help for whose older people who want to watch foreign news and just want to listen the dubbing. The volume of the dubbing tract and original tract could be set by the user.

Seventh, make sure that the sub fixed in the same position no matter how our mouse hover, click, and drag. Why? That will make both the full screen sub and the playing screen sub looks just like the embedded sub!! But we can edit it any time! Further more, if the sub fixed, we can apply the dictionary extension either. While in our common CC sub, when we click the sub, it will change the position of sub, so that we can't select word. If we can't select word, we can not use the dictionary extension for the word in the CC sub.

 

How does the OCR sub works?

First, cache the video and detect the text embedded inside it.

Second, cut the video into the minimum frame and match the OCR text with the frame to generate timeline.

Step 3 to 7 are totally the same as the the speech recognition sub.

 

How does the dictation sub works?

There are 2 kinds of dictation sub:

① Speech diction sub——when we dictate the sub for a video in a foreign language or have strong accent or not clear enough to be recognized by its original voice.

Other than the steps take above. The speech dictation sub requires 2 steps of speech recognition: 

First to recognize the timeline of each sentence and clause of the original video.

Then recognize the dictator's speech and make the dictate sub match the timeline. How to match the timeline of the dictate sub and original sub? I write an article in Chinese. I feel daunting to translate you can check it out if you have any interest in it:

Dictate Sub功能——字幕软件革新的关键 - 知乎 (zhihu.com)

② Dictate shortcut sub——that is we can just dictate someone's title or dictate a headline of a news in a video. We can just set a speech order to turn on the dictate shortcut sub tract and set a template for it. When the specific tract turn on, you just dictate the content of this tract and all the template apply.

You might say that a browser shouldn't do a video maker do. And we won't do extra work. But sometimes we need more innovation and we have to change our mind. Because browser is the backbone of our journey to surf the Internet. All the pre and post procedures are around the browser. Why not just expand the functions into other links in this process. Like inputting, auto sub, TTS reader, translation assistance, video clip, video export, publishing, email. They are all the process of surfing the Internet. Why not just make all these step in one browser?

 

Like sound tract, all the text have their tracts either! I call them sub tracts. Like OCR tract one, OCR tract two, Speech tract one etc. Once we have sub tract different languages and different people in the same video could be adopted into different tract and this will make speech recognition more targeted! And users could make template(like font, size, animation, shortcut speech order so on) for each tract which will make we make sub more efficiently!

 

Which MS technologies should we import to realize this function:

① Azure speech recognition;

② Azure OCR for all Unicode letters in major fonts;

③ MS translation;

④ TTS (in dubbing)

 

 

15 Replies
Really like the idea, i find OCR extremely useful and often feel the need for it in Edge when visiting some foreign language pages.
please also share your idea using feedback button on Edge or (alt + shift + i)

@HotCakeX Another reason to turn the speech and picture into CC sub is that it benefit the Bing search Engine. If all the speech and pictures on the page turn into text. The search engine can search the text easily. It's far more useful than just search the text on page!!

@GatesLover Here is something I missed! In the full screen mode. The sub can work in 2 other situation:

① When you work on something else, the sub will still be there. I can just listen to video and do something else.

② When the screen were locked! I save the electricity when the screen is all in dark and just the sub demonstrate on the screen.

@GatesLover 


@GatesLover wrote:

@HotCakeX Another reason to turn the speech and picture into CC sub is that it benefit the Bing search Engine. If all the speech and pictures on the page turn into text. The search engine can search the text easily. It's far more useful than just search the text on page!!


Bing can already do that.

Bing does OCR on images.

 

not only OCR, it goes one step further and also detects objects etc.

@GatesLover 


@GatesLover wrote:

@GatesLover Here is something I missed! In the full screen mode. The sub can work in 2 other situation:

① When you work on something else, the sub will still be there. I can just listen to video and do something else.

② When the screen were locked! I save the electricity when the screen is all in dark and just the sub demonstrate on the screen.


I think they need to make the subtitles appear on the picture-in-picture mode.

right now the PiP only shows video, would be useful to have subtitles on it too.

But Bing can't search text in subs neither the sub embedded in the video or the subs merely from the speech. Youku and  Tik Tok can search the text in speech with the speech recognition technology

@HotCakeX I invent a notion of edible fixed sub! On one hand it's a CC sub that could be edited anytime and the format could customized; on the other hand, it's a embedded sub or PiP which was fixed in the same position and cannot be dragged away. The edible fixed sub include 2 types:

①Inside the picture and always follow the size and shape of the picture. When the picture is large the sub follow the picture large; If the picture shrink the sub follow the picture shrink

②Full screen sub. The full screen sub never follow the picture and always stay in the same position. When work in other windows, the subs is still there. If you find the sub affect you, you can drag it away or change its position or size or just shut down. When the screen locked, the Sub will still played there in the dark mode

Spoiler

@GatesLover wrote:

But Bing can't search text in subs neither the sub embedded in the video or the subs merely from the speech. Youku and  Tik Tok can search the text in speech with the speech recognition technology


that's something else.

you said "benefit the Bing search Engine. If all the speech and pictures on the page turn into text. The search engine can search the text easily."

and i said Bing already supports that.

 

@GatesLover 

Spoiler

@GatesLover wrote:

@HotCakeX I invent a notion of edible fixed sub! On one hand it's a CC sub that could be edited anytime and the format could customized; on the other hand, it's a embedded sub or PiP which was fixed in the same position and cannot be dragged away. The edible fixed sub include 2 types:

①Inside the picture and always follow the size and shape of the picture. When the picture is large the sub follow the picture large; If the picture shrink the sub follow the picture shrink

②Full screen sub. The full screen sub never follow the picture and always stay in the same position. When work in other windows, the subs is still there. If you find the sub affect you, you can drag it away or change its position or size or just shut down. When the screen locked, the Sub will still played there in the dark mode


better send these suggestions using the feedback button on Edge.

that's the main method Edge developers receive user feedback.

@HotCakeX When I use the feedback in Edge, I can't see where my feedback go? And I don't know whether my feedback seen or not. No one give me reply like you. Maybe you should make the all Edge feedback into this forum or unified all the feedback approach into one forum so that everyone could see and discuss it

And the feedback tips to the developer also helps a lot in improving feedback quality and efficiency

@GatesLover 


@GatesLover wrote:

@HotCakeX When I use the feedback in Edge, I can't see where my feedback go? And I don't know whether my feedback seen or not. No one give me reply like you. Maybe you should make the all Edge feedback into this forum or unified all the feedback approach into one forum so that everyone could see and discuss it


When you use the feedback button on Edge or (Alt + Shift + i), they are sent to Microsoft developers. you will receive a confirmation Email (if you entered your email address in the feedback window), showing that your feedback is received.

if they have any further question from you, they will ask you with the same email.

if they implement the feature or start working on it, they will reply to you with the same email.

@GatesLover 


@GatesLover wrote:

And the feedback tips to the developer also helps a lot in improving feedback quality and efficiency


Yeah no doubt