Forum Discussion
Editing transcripts: Removing extra lines of data from export
Been testing the transcription in Stream. For a multinational organization it is not really ready for prime time (too many accents to do well in English, and no direct support or languages other than Spanish). However, I do find potential in being able to export the auto-generated transcripts by senior leadership for purely text-related usage. That said, even if one is to tidy up the existing transcript there are many many rows of extra data between each text. Not just timecode but like this:
NOTE Confidence: 0.936690330505371
9eed9142-c299-42ed-96f1-fed2c6617e0c
00:00:21.476 --> 00:00:24.633
unchanged to make homes
cleaner and healthier.
NOTE Confidence: 0.909458994865417
e2af81c5-7559-4a57-8bf7-5f7b2c586c4e
00:00:27.370 --> 00:00:31.400
Delicate wool garments have
always been tricky to care for
The captions are not even on one editable row, and there are three lines (and blank rows not shown here) to be removed between each text. Over an hour meeting, or townhall presentation, this is a LOT of editing.
Has anyone come across a way to export and automate removal of the extra material in order to create a clean text document - a pure transcript and not a caption file?
Agentjh dhthompson mdlau - I just created a short web utility to clean up the Stream transcript VTT files for when you just want to get the text from the file without the metadata, time codes, and blank lines.
I linked the utility from the bottom of this help doc page: https://aka.ms/StreamVTTCleaner
Give it a try and see if this is useful for you.
The web utility I created is just a quick workaround, ideally this would be built into Stream itself directly. You should add your comments and votes to this idea in our ideas forum: https://techcommunity.microsoft.com/t5/microsoft-stream-ideas/allow-export-of-transcript/idi-p/205468
26 Replies
- DavidJacksonTKCCopper Contributor
I followed the instructions from Agentjh and put together the following macro. All I do now is open the downloaded vtt file in notepad, select all, paste into excel, then run this Macro. It works for me and I hope helps other. This is the first macro I have ever created so please be gently if it's not particularly elegant!
Sub TranscriptCleaner()
'
' TranscriptCleaner Macro
' A macro to clean up vtt printed files from Microsft Stream.
''
Cells.Select
Range("E2").Activate
Selection.Replace What:="*-*", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
Selection.Replace What:="WEBVTT", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
Selection.Replace What:="NOTE*", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
Selection.Replace What:="00*", Replacement:="", LookAt:=xlPart, _
SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
Selection.SpecialCells(xlCellTypeBlanks).Select
Selection.EntireRow.Delete
End Sub - AgentjhCopper Contributor
dhthompson, I found a workaround. Download the script at Stream. Select all and copy and paste into Excel. Do a find and replace on "NOTE*" and replace with nothing (blank). Then do the same for "*-*". That should get rid of everything but the text. Then to get rid of the blank rows, do ctrl G to open the "Go to" popup. Click "Special". Select "Blanks". In the Home menu of Excel, go to the "Cells" section. Click the "Delete" drop down and select "delete sheet rows". Then I copied the text to Word and read through it. Still not great but a lot better than with all the data between the transcript text. Hope that helps.
- JaneramaCopper Contributor
Agentjh This is amazing! For anyone else who found it pasted in word from Excel into lots of pages where each line was a handful of words long...I found one more step was necessary because I ended up with 55 pages because of a zillion hard returns. I then needed to delete the hard returns in Word.
1. Navigate to Find and Replace
2. Click on the gear icon to go to advanced Find and Replace
3. In the Replace section click on Special, and select Paragraph Mark
4. Enter the paragraph mark symbol in in the Find What field "^p".
5. Replace with: enter a blank (hit the space bar).
This reduced the text to 14 pages, and now some fun editing ensues.
- Marc Mroz
Microsoft
Agentjh dhthompson mdlau - I just created a short web utility to clean up the Stream transcript VTT files for when you just want to get the text from the file without the metadata, time codes, and blank lines.
I linked the utility from the bottom of this help doc page: https://aka.ms/StreamVTTCleaner
Give it a try and see if this is useful for you.
The web utility I created is just a quick workaround, ideally this would be built into Stream itself directly. You should add your comments and votes to this idea in our ideas forum: https://techcommunity.microsoft.com/t5/microsoft-stream-ideas/allow-export-of-transcript/idi-p/205468
- MightyMedhaCopper Contributor
Marc Mroz Thanks!! It's a wonderful tool, works nicely but one problem, My transcript has numbering also with timeline. how to remove that numbers also maintain line break.
example:
1Hello! 2So, we will here be looking after 3how we need to log in into the supplier account. 4That is, how as a supplier we can log in into our account. 5So, first we need to open a browser. 6Then, after opening the browser,
Wants Output like:
Hello!
So, we will here be looking after
how we need to log in into the supplier account.
- mdlauCopper Contributor
dhthompson I've been searching for an easy way to do this as well!