Speaker identification in transcripts?


Once a stream is completed and ready for playback, how may we:

1) Download the transcript easily (without a select-all-copy hack or multiple clicks in Settings) and

2) Potentially determine who spoke when?


If (2) is not available based on the audio, I could provide a quick prototype in github if it may be useful to others; at least for playback of meetings with <=3 people actively speaking. Here is one way:



Here is what the experience could look like:

1. On the Stream's site of the conversation, there is an obvious single-click button in top-right that says "Download transcript" [no extra clicks or hacks required]

2. The transcript is downloaded as a simple text file [not "vtt"; e.g. "txt" so that Microsoft quickly opens it with something like Notepad]

3. The downloaded text file has lines like this:

[time] <person> spoken text


So for example:

[12:20] Bob: good morning, everyone welcome to the meeting.
[12:21] Alice: today we will speak about a new tool on Teams


The above names could easily be determined by who is in the meeting, which is metadata that Teams app already has during a meeting.

Let's make it happen! I'm happy to help. 

We don't need a breakthru algo or much machine-learning to do this with sufficient accuracy so that it's useful. It's mostly putting metadata together. 


Here are some potential use-cases:

1. Determine how long each person spoke -- this could help derive which topics in the meeting may have been most important

2. Determine who asked the most questions and who answered the most questions -- this could help with follow-up discussions, e.g. if someone answered most questions about a given topic, they could be emailed for follow-up questions.

2 Replies

@Quetzalcoatl  this is the kind of thing I am after. At present the standard dowload file has far too much information in it and needs to be trimmed. When using web utility tool (https://web.microsoftstream.com/VTTCleaner/CleanVTT.html) you end up with data all clumped together without time stamps and separation of speakers. The technology is there but it's the execution of formatting that needs improving. 


I've resorted to using the web tool and then pasting into word and then running some find+replace for '?' which I can then use to insert spaces. I have a saved copy of the other 'raw' transcript vtt file to reference for timings of speech as needed...

Hi, did you get any success with this?