SOLVED

Editing transcripts: Removing extra lines of data from export

Copper Contributor

Been testing the transcription in Stream. For a multinational organization it is not really ready for prime time (too many accents to do well in English, and no direct support or languages other than Spanish). However, I do find potential in being able to export the auto-generated transcripts by senior leadership for purely text-related usage. That said, even if one is to tidy up the existing transcript there are many many rows of extra data between each text. Not just timecode but like this:

 

NOTE Confidence: 0.936690330505371

9eed9142-c299-42ed-96f1-fed2c6617e0c
00:00:21.476 --> 00:00:24.633

unchanged to make homes
cleaner and healthier.

NOTE Confidence: 0.909458994865417

e2af81c5-7559-4a57-8bf7-5f7b2c586c4e
00:00:27.370 --> 00:00:31.400
Delicate wool garments have
always been tricky to care for

 

The captions are not even on one editable row, and there are three lines (and blank rows not shown here) to be removed between each text. Over an hour meeting, or townhall presentation, this is a LOT of editing.

 

Has anyone come across a way to export and automate removal of the extra material in order to create a clean text document - a pure transcript and not a caption file?

26 Replies

@dhthompson I've been searching for an easy way to do this as well!

@dhthompson, I found a workaround. Download the script at Stream. Select all and copy and paste into Excel. Do a find and replace on "NOTE*" and replace with nothing (blank). Then do the same for "*-*". That should get rid of everything but the text. Then to get rid of the blank rows, do ctrl G to open the "Go to" popup. Click "Special". Select "Blanks". In the Home menu of Excel, go to the "Cells" section. Click the "Delete" drop down and select "delete sheet rows". Then I copied the text to Word and read through it. Still not great but a lot better than with all the data between the transcript text. Hope that helps.

best response confirmed by VI_Migration (Silver Contributor)
Solution

@Agentjh @dhthompson @mdlau  - I just created a short web utility to clean up the Stream transcript VTT files for when you just want to get the text from the file without the metadata, time codes, and blank lines.

 

I linked the utility from the bottom of this help doc page: https://aka.ms/StreamVTTCleaner

 

Give it a try and see if this is useful for you.

 

The web utility I created is just a quick workaround, ideally this would be built into Stream itself directly. You should add your comments and votes to this idea in our ideas forum: https://techcommunity.microsoft.com/t5/microsoft-stream-ideas/allow-export-of-transcript/idi-p/20546...

@Marc Mroz 

Hi Marc.

I am new at Stream and I have a problem with the captions I have uploaded to my videoes via a .vtt file.

 

When I change some of my captions due to wrong devision of the sentences, fx. deleting a line I do not need anylonger, I can not delete the timeline so it would not show anymore. Please see the attached image, I hope it will show my problem.

Is there a way to delete this lines? When I try to click "Remove", nothing happens.

 

Thanks in advance :)

Thank you so much! You should make it more obvious on Google, this is super relevant tool and So useful. Thanks! @Marc Mroz 

Hi @MarcMroz - any tips on helping this to work? I get the option to browse for a file to upload....then nothing!

@Marc Mroz WOW! I just tried your Clean-Up utility and it's amazing! Thank you so  much for creating it. I tested it with a meeting that was just over an hour and the transcript as downloaded from Stream was 130 pages. Your utility removed all the unnecessary metadata and blanks in a matter of about 2 seconds!

Thank you for sharing it!

@Marc Mroz Thanks. I hadn't been in here for ages but our org is finally making the move to Stream and this came up front of mind again. Will check it out.

@Marc Mroz  

Your utility for cleaning up transcripts appears to have been removed. It is giving a 404 - Page not found error. Can you please have someone fix this issue?

Thank you.

@Marc Mroz I used this excellent utility last week for some research work I am doing. It cleaned up my transcription files brilliantly. I've gone to use it today for my final transcription file and am getting a 404, like others have reported. I'd love to see this utility back as soon as possible.

Thank you so much! That utility really made my job easier today. @Marc Mroz 

@Marc Mroz Hi, this is a really fantastic tool you have developed, and I'm hoping to use it to clean up some transcripts I need to analyse on a project I am working on. For some reason though, the tool will only allow me to upload the transcript and won't produce any output - I wondered whether you might be able to help?

@dhthompson 

 

I followed the instructions from @Agentjh and put together the following macro. All I do now is open the downloaded vtt file in notepad, select all, paste into excel, then run this Macro. It works for me and I hope helps other. This is the first macro I have ever created so please be gently if it's not particularly elegant!

 

Sub TranscriptCleaner()
'
' TranscriptCleaner Macro
' A macro to clean up vtt printed files from Microsft Stream.
'

'
Cells.Select
Range("E2").Activate

Selection.Replace What:="*-*", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
Selection.Replace What:="WEBVTT", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
Selection.Replace What:="NOTE*", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
Selection.Replace What:="00*", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
Selection.SpecialCells(xlCellTypeBlanks).Select
Selection.EntireRow.Delete

End Sub

@Marc Mroz 

 

This is great (as are some of the other web based solutions that have since appeared) but when you are working in a secure environment it just isn't acceptable to send your data out to some unknown service somewhere. My clients are happy with me uploading their data to a trusted organisation like Microsoft, but not some random website somewhere that I've found through Google.

Hi @Marc Mroz 

Thank you so much for creating the VTT cleaner web utility, I was just wondering if it was GDPR compliant?

Many thanks

Just started using Stream for converting video meetings to transcripts. You certainly saved my bacon (and my time!). Thanks so much.

@KForster - Sorry for my super later reply. You can take a look at the JavaScript code on the page, everything is just done directly in the browser locally in JavaScript. It has NO connection back to Microsoft or any server. Thus, none of your text from the VTT leaves the browser. 

 

It just reads the VTT file locally, does find and replace on a few strings and then sticks the cleaned output back to the screen and to the clipboard. 

 

So it should be safe for you to use because it doesn't save anything at all. 

This is the best solution I've found to get a clean extract from a .vtt file. Not perfect because it leaves you with a single massive block of text but at least all the time stamps are gone.
1 best response

Accepted Solutions
best response confirmed by VI_Migration (Silver Contributor)
Solution

@Agentjh @dhthompson @mdlau  - I just created a short web utility to clean up the Stream transcript VTT files for when you just want to get the text from the file without the metadata, time codes, and blank lines.

 

I linked the utility from the bottom of this help doc page: https://aka.ms/StreamVTTCleaner

 

Give it a try and see if this is useful for you.

 

The web utility I created is just a quick workaround, ideally this would be built into Stream itself directly. You should add your comments and votes to this idea in our ideas forum: https://techcommunity.microsoft.com/t5/microsoft-stream-ideas/allow-export-of-transcript/idi-p/20546...

View solution in original post