Forum Discussion

Zdenek_Moravec's avatar
Zdenek_Moravec
Brass Contributor
Oct 27, 2024
Solved

How to extract multiple occurrences of a text enclosed between two same characters

Dear colleagues,

I have a cell with plenty of text. Some part of the text is enclosed between tags >> and <<. The task is to extract all these strings to one cell, separated by new line.

NotesImportant text
My standard text.>>This is important 1<< Another standard text >>Important 2<<
Next standard text
This is important 1
Important 2

Similar issue to the https://techcommunity.microsoft.com/t5/excel/formula-for-extracting-multiple-occurrences-of-values-following/m-p/4178824 solved by HansVogelaar , but now we have both opening and closing tag the same, so the TEXTSPLIT function removes all markups ...

Thank You very much for any hint.

Zdenek Moravec

  • HansVogelaar's avatar
    HansVogelaar
    Oct 28, 2024

    Zdenek_Moravec 

    I hope this works for you (with the text in A5):

     

    =LET(w, TEXTSPLIT(A5, "!"), n, COUNTA(w), s, SEQUENCE(, n), f, FILTER(w, ISEVEN(s)), TEXTJOIN(CHAR(10), TRUE, f))

17 Replies

  • Zdenek_Moravec 

    Just to add some thoughts on modern Excel.  I am coming to be of the opinion that spreadsheet formulas are best presented as Lambda functions rather than having the formula details exposed.  In that case HansVogelaar's formula might be

    ImportantText1λ
    = LAMBDA(string,
        LET(
            newLine, CHAR(10),
            extracts, IFERROR(TRIM(TEXTAFTER(TEXTSPLIT(string, "<<"), ">>")),""),
            TEXTJOIN(newLine, TRUE, extracts, "")
        )
    );

    whilst mine might be

    ImportantText2λ
    = LAMBDA(string,
        LET(
            newLine, CHAR(10),
            regex, "[\w|\s]*(?=\<)",
            TEXTJOIN(newLine,,REGEXEXTRACT(string, regex, 1))
        )
    );

    The worksheet formulas would be

    = MAP(notes, ImportantTextλ)

    Any change to the formula made by the developer need not even be visible to the user (unless they are the same person of course) so the sheet may be updated when one chooses using the AFE (Advanced Formula Environment).

      • PeterBartholomew1's avatar
        PeterBartholomew1
        Silver Contributor

        SergeiBaklan 

        I have mixed feelings regarding regular expressions.  The plusses are that it is an industry standard; it is extremely powerful; and its sheer ingenuity is amazing.  That said, its conciseness makes it incredibly difficult to read.  I would gladly trade 10 times the key presses for 4 times the readability.

         

        In that vein, I did look at some solutions on Stack Overflow where contributors had described elements of the formula, so that a sequence returning "5 1/4" would be designated as <number>, when used more than once.  With that in mind, I have explored the idea of using LET variables and even LAMBDA to provide meaning to expressions that would otherwise place heavy demands on the reader to mentally parse the formula.  A hybrid solution that would horrify a regex purist!

    • Zdenek_Moravec's avatar
      Zdenek_Moravec
      Brass Contributor

      PeterBartholomew1 , nice approach, I like it. I hope, Microsoft team will also improve debug possibilities for LAMBDA formula and the Name Manager dialog window. Currently, it is a kind of black box.

      • SergeiBaklan's avatar
        SergeiBaklan
        Diamond Contributor

        Zdenek_Moravec 

        AFE from Excel Labs gives both - debugger and ability to work with names. Not IDE, but at least something.

    • SergeiBaklan's avatar
      SergeiBaklan
      Diamond Contributor

      PeterBartholomew1 

      As variant

       

      =TEXTJOIN(
        CHAR(10),,
        FILTERXML(
          "<t>" &
          SUBSTITUTE( SUBSTITUTE(A5,">>","<s>"),"<<","</s>") &
          "</t>",
          "//s" )
      )

       

  • Zdenek_Moravec 

     

    Try below to extract the first occurrence:

     

    =MID(A2, FIND(">>", A2) + 2, FIND("<<", A2) - FIND(">>", A2) - 2)

     

    Follow to handle extract multiple occurrences:

     

    Function ExtractTextBetweenTags(cell As Range) As String
    Dim text As String
    Dim result As String
    Dim startPos As Long
    Dim endPos As Long

    text = cell.Value
    result = ""

    Do While InStr(text, ">>") > 0
    startPos = InStr(text, ">>") + 2
    endPos = InStr(text, "<<")
    If endPos > startPos Then
    result = result & Mid(text, startPos, endPos - startPos) & vbNewLine
    text = Mid(text, endPos + 2)
    Else
    Exit Do
    End If
    Loop

    ExtractTextBetweenTags = result
    End Function

    • Zdenek_Moravec's avatar
      Zdenek_Moravec
      Brass Contributor
      HansVogelaar, Perfect, thank You. Regarding my post title and the last sentence in the original post - how to apply TEXTSPLIT, if the beginning and ending character is the same:
      My standard text. !This is important 1! Another standard text !Important 2! Next standard text
      should result in
      This is important 1
      Important 2
      I guess completely different technique?
      • HansVogelaar's avatar
        HansVogelaar
        MVP

        Zdenek_Moravec 

        I hope this works for you (with the text in A5):

         

        =LET(w, TEXTSPLIT(A5, "!"), n, COUNTA(w), s, SEQUENCE(, n), f, FILTER(w, ISEVEN(s)), TEXTJOIN(CHAR(10), TRUE, f))

    • Zdenek_Moravec's avatar
      Zdenek_Moravec
      Brass Contributor
      Regular expressions is a topic for next sleepless nights, once the functions come to M365 Basic 😉 I will notice Your solution and learn on it ...

Resources