SOLVED

Editing transcripts: Removing extra lines of data from export

Highlighted
New Contributor

Been testing the transcription in Stream. For a multinational organization it is not really ready for prime time (too many accents to do well in English, and no direct support or languages other than Spanish). However, I do find potential in being able to export the auto-generated transcripts by senior leadership for purely text-related usage. That said, even if one is to tidy up the existing transcript there are many many rows of extra data between each text. Not just timecode but like this:

 

NOTE Confidence: 0.936690330505371

9eed9142-c299-42ed-96f1-fed2c6617e0c
00:00:21.476 --> 00:00:24.633

unchanged to make homes
cleaner and healthier.

NOTE Confidence: 0.909458994865417

e2af81c5-7559-4a57-8bf7-5f7b2c586c4e
00:00:27.370 --> 00:00:31.400
Delicate wool garments have
always been tricky to care for

 

The captions are not even on one editable row, and there are three lines (and blank rows not shown here) to be removed between each text. Over an hour meeting, or townhall presentation, this is a LOT of editing.

 

Has anyone come across a way to export and automate removal of the extra material in order to create a clean text document - a pure transcript and not a caption file?

13 Replies
Highlighted

@dhthompson I've been searching for an easy way to do this as well!

Highlighted

@dhthompson, I found a workaround. Download the script at Stream. Select all and copy and paste into Excel. Do a find and replace on "NOTE*" and replace with nothing (blank). Then do the same for "*-*". That should get rid of everything but the text. Then to get rid of the blank rows, do ctrl G to open the "Go to" popup. Click "Special". Select "Blanks". In the Home menu of Excel, go to the "Cells" section. Click the "Delete" drop down and select "delete sheet rows". Then I copied the text to Word and read through it. Still not great but a lot better than with all the data between the transcript text. Hope that helps.

Highlighted
Best Response confirmed by Marc Mroz (Microsoft)
Solution

@Agentjh @dhthompson @mdlau  - I just created a short web utility to clean up the Stream transcript VTT files for when you just want to get the text from the file without the metadata, time codes, and blank lines.

 

I linked the utility from the bottom of this help doc page: https://aka.ms/StreamVTTCleaner

 

Give it a try and see if this is useful for you.

 

The web utility I created is just a quick workaround, ideally this would be built into Stream itself directly. You should add your comments and votes to this idea in our ideas forum: https://techcommunity.microsoft.com/t5/microsoft-stream-ideas/allow-export-of-transcript/idi-p/20546...

Highlighted

@Marc Mroz 

Hi Marc.

I am new at Stream and I have a problem with the captions I have uploaded to my videoes via a .vtt file.

 

When I change some of my captions due to wrong devision of the sentences, fx. deleting a line I do not need anylonger, I can not delete the timeline so it would not show anymore. Please see the attached image, I hope it will show my problem.

Is there a way to delete this lines? When I try to click "Remove", nothing happens.

 

Thanks in advance :)

Highlighted

Thank you so much! You should make it more obvious on Google, this is super relevant tool and So useful. Thanks! @Marc Mroz 

Highlighted

Hi @MarcMroz - any tips on helping this to work? I get the option to browse for a file to upload....then nothing!

Highlighted

@Marc Mroz WOW! I just tried your Clean-Up utility and it's amazing! Thank you so  much for creating it. I tested it with a meeting that was just over an hour and the transcript as downloaded from Stream was 130 pages. Your utility removed all the unnecessary metadata and blanks in a matter of about 2 seconds!

Thank you for sharing it!

Highlighted

@Marc Mroz Thanks. I hadn't been in here for ages but our org is finally making the move to Stream and this came up front of mind again. Will check it out.

Highlighted
Highlighted

@Marc Mroz  

Your utility for cleaning up transcripts appears to have been removed. It is giving a 404 - Page not found error. Can you please have someone fix this issue?

Thank you.

Highlighted

@Marc Mroz I used this excellent utility last week for some research work I am doing. It cleaned up my transcription files brilliantly. I've gone to use it today for my final transcription file and am getting a 404, like others have reported. I'd love to see this utility back as soon as possible.

Highlighted

Thank you so much! That utility really made my job easier today. @Marc Mroz 

Highlighted

@Marc Mroz Hi, this is a really fantastic tool you have developed, and I'm hoping to use it to clean up some transcripts I need to analyse on a project I am working on. For some reason though, the tool will only allow me to upload the transcript and won't produce any output - I wondered whether you might be able to help?