Transcription capability isn't very good

Steel Contributor

Hi - The new features of Stream are cool, but so far we've found the transcription/closed-captioning services in Stream are not good (specifically poor accuracy) versus some other tools we use for this task.

Is there any work being done to improve in this area?

Agreed. Finding that videos that contain background music don't translate.
We also use Skype Broadcast, and find the transcoding capability with it is fairly good. I thought that Stream and Broadcast was using the same technology under-the-covers, however given the results we see I guess that isn't the case.

@Adrian Hyde, I am sorry to hear that you are unsatisfied with the transcription quality :( 


You are correct, the Skype Broadcast solution is utilizing the same core technology -- would you be able to share examples of the discrepancy in output?  


We are aware of the issues that we have with background noise/music, and unfortunately there is nothing we can do in the short-term to fix this.

Hey @Adarsh Solanki - For Skype Broadcast versus Stream, I can only provide anecdotal evidence that one works better than the other.  We have not done the same media through both and compared.


The one area however we do see a significant difference is between the transcoding between Stream and a 3rd-party we have typically used in the past (3PlayMedia) for this function.  We have run several videos through both and find 3PlayMedia much more accurate.


I'm open to suggestions on how we could improve this....should I take this up with the Azure Media folks?  Or are there some settings within Stream we can tweak to see if performance can be improved?

@Adrian Hyde:

Fortunately, I am the Azure Media Services contact for Speech-to-text :) 


I would hesitate make any assumptions on quality without testing on identical content, as there are many subtle variables that can lead to low quality transcription.  Stream should have transcript quality that is at-par with other Microsoft services utilizing speech-to-text.  Note that Stream shows the unedited automatic transcript.  We are currently building a feature that will allow a user to edit the automatic transcript to fix any errors.


Re: 3PlayMedia, this service utilizes human editing in addition to automatic transcript generation.  A more fair comparison would be to take the output of our automatic transcript generation and send it to a human transcript editor to correct the transcript prior to publishing.

Hi, are they any plans to improve on the auto-generated transcription accuracy? We have tried it a couple of times and it isn't really anywhere close to what is actually being said (without any background music or noise). This means we would have to spend a lot of time editing the transcription for it to be of any benefit. This means we just don't use it.

Hi Trevor -

We are absolutely planning to improve the auto-generated captions as much as we can!  Sorry to hear that you are having to go through the trouble of editing your captions to provide value.  We hope to decrease our error rate considerably in the coming year as we upgrade the underlying speech infrastructure.