Forum Discussion

dhthompson's avatar
dhthompson
Copper Contributor
Jul 28, 2019
Solved

Editing transcripts: Removing extra lines of data from export

Been testing the transcription in Stream. For a multinational organization it is not really ready for prime time (too many accents to do well in English, and no direct support or languages other than Spanish). However, I do find potential in being able to export the auto-generated transcripts by senior leadership for purely text-related usage. That said, even if one is to tidy up the existing transcript there are many many rows of extra data between each text. Not just timecode but like this:

 

NOTE Confidence: 0.936690330505371

9eed9142-c299-42ed-96f1-fed2c6617e0c
00:00:21.476 --> 00:00:24.633

unchanged to make homes
cleaner and healthier.

NOTE Confidence: 0.909458994865417

e2af81c5-7559-4a57-8bf7-5f7b2c586c4e
00:00:27.370 --> 00:00:31.400
Delicate wool garments have
always been tricky to care for

 

The captions are not even on one editable row, and there are three lines (and blank rows not shown here) to be removed between each text. Over an hour meeting, or townhall presentation, this is a LOT of editing.

 

Has anyone come across a way to export and automate removal of the extra material in order to create a clean text document - a pure transcript and not a caption file?

26 Replies

  • DavidJacksonTKC's avatar
    DavidJacksonTKC
    Copper Contributor

    dhthompson 

     

    I followed the instructions from Agentjh and put together the following macro. All I do now is open the downloaded vtt file in notepad, select all, paste into excel, then run this Macro. It works for me and I hope helps other. This is the first macro I have ever created so please be gently if it's not particularly elegant!

     

    Sub TranscriptCleaner()
    '
    ' TranscriptCleaner Macro
    ' A macro to clean up vtt printed files from Microsft Stream.
    '

    '
    Cells.Select
    Range("E2").Activate

    Selection.Replace What:="*-*", Replacement:="", LookAt:=xlPart, _
    SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
    ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="WEBVTT", Replacement:="", LookAt:=xlPart, _
    SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
    ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="NOTE*", Replacement:="", LookAt:=xlPart, _
    SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
    ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
    Selection.Replace What:="00*", Replacement:="", LookAt:=xlPart, _
    SearchOrder:=xlByRows, MatchCase:=False, SearchFormat:=False, _
    ReplaceFormat:=False, FormulaVersion:=xlReplaceFormula2
    Selection.SpecialCells(xlCellTypeBlanks).Select
    Selection.EntireRow.Delete

    End Sub

  • Agentjh's avatar
    Agentjh
    Copper Contributor

    dhthompson, I found a workaround. Download the script at Stream. Select all and copy and paste into Excel. Do a find and replace on "NOTE*" and replace with nothing (blank). Then do the same for "*-*". That should get rid of everything but the text. Then to get rid of the blank rows, do ctrl G to open the "Go to" popup. Click "Special". Select "Blanks". In the Home menu of Excel, go to the "Cells" section. Click the "Delete" drop down and select "delete sheet rows". Then I copied the text to Word and read through it. Still not great but a lot better than with all the data between the transcript text. Hope that helps.

    • Janerama's avatar
      Janerama
      Copper Contributor

      Agentjh This is amazing! For anyone else who found it pasted in word from Excel into lots of pages where each line was a handful of words long...I found one more step was necessary because I ended up with 55 pages because of a zillion hard returns. I then needed to delete the hard returns in Word. 

       

      1. Navigate to Find and Replace

      2. Click on the gear icon to go to advanced Find and Replace

      3. In the Replace section click on Special, and select Paragraph Mark

      4. Enter the paragraph mark symbol in in the Find What field "^p".

      5. Replace with: enter a blank (hit the space bar). 

       

      This reduced the text to 14 pages, and now some fun editing ensues. 

    • Marc Mroz's avatar
      Marc Mroz
      Icon for Microsoft rankMicrosoft

      Agentjh dhthompson mdlau  - I just created a short web utility to clean up the Stream transcript VTT files for when you just want to get the text from the file without the metadata, time codes, and blank lines.

       

      I linked the utility from the bottom of this help doc page: https://aka.ms/StreamVTTCleaner

       

      Give it a try and see if this is useful for you.

       

      The web utility I created is just a quick workaround, ideally this would be built into Stream itself directly. You should add your comments and votes to this idea in our ideas forum: https://techcommunity.microsoft.com/t5/microsoft-stream-ideas/allow-export-of-transcript/idi-p/205468

      • MightyMedha's avatar
        MightyMedha
        Copper Contributor

        Marc Mroz Thanks!!  It's a wonderful tool, works nicely but one problem, My transcript has numbering also with timeline. how to remove that numbers also maintain line break.

        example:

        1Hello! 2So, we will here be looking after 3how we need to log in into the supplier account. 4That is, how as a supplier we can log in into our account. 5So, first we need to open a browser. 6Then, after opening the browser,

        Wants Output like:

        Hello!

        So, we will here be looking after

        how we need to log in into the supplier account.

         

Resources