With the current work from home situation due to Covid-19, most people in my meetings are joining via their own mic. And even if they were sharing a mic, it should be possible to guess based on the characteristics and pitch of the voices which sentences belongs to different people.
I propose adding a voice ID connected to the different voices that are transcribed and put into chuncks in the .vtt file.
In my usecase, I was trying to use the VTT file for summarizing of a feedback session, and who said what, but this was not efficient. I think a VoiceID can potentially enable an efficient way export a more readable version (Person A said this, Person B said this... Instead of the VTT format.) which I could edit to send out a summary of the discussion, but I need to know who said which chunck. The .VTT export currently does not support this I believe.
I also think a Voice ID can be beneficial for deaf watchers, in cases where the participants did not turn on the camera, and you can't actually see who's lips are moving. You could add Person A (and the transcript) to the captions more easily.