Forum Discussion
How to extract multiple occurrences of a text enclosed between two same characters
Dear colleagues,
I have a cell with plenty of text. Some part of the text is enclosed between tags >> and <<. The task is to extract all these strings to one cell, separated by new line.
Notes | Important text |
My standard text.>>This is important 1<< Another standard text >>Important 2<< Next standard text | This is important 1 Important 2 |
Similar issue to the https://techcommunity.microsoft.com/t5/excel/formula-for-extracting-multiple-occurrences-of-values-following/m-p/4178824 solved by HansVogelaar , but now we have both opening and closing tag the same, so the TEXTSPLIT function removes all markups ...
Thank You very much for any hint.
Zdenek Moravec
I hope this works for you (with the text in A5):
=LET(w, TEXTSPLIT(A5, "!"), n, COUNTA(w), s, SEQUENCE(, n), f, FILTER(w, ISEVEN(s)), TEXTJOIN(CHAR(10), TRUE, f))
17 Replies
- PeterBartholomew1Silver Contributor
Just to add some thoughts on modern Excel. I am coming to be of the opinion that spreadsheet formulas are best presented as Lambda functions rather than having the formula details exposed. In that case HansVogelaar's formula might be
ImportantText1λ = LAMBDA(string, LET( newLine, CHAR(10), extracts, IFERROR(TRIM(TEXTAFTER(TEXTSPLIT(string, "<<"), ">>")),""), TEXTJOIN(newLine, TRUE, extracts, "") ) );
whilst mine might be
ImportantText2λ = LAMBDA(string, LET( newLine, CHAR(10), regex, "[\w|\s]*(?=\<)", TEXTJOIN(newLine,,REGEXEXTRACT(string, regex, 1)) ) );
The worksheet formulas would be
= MAP(notes, ImportantTextλ)
Any change to the formula made by the developer need not even be visible to the user (unless they are the same person of course) so the sheet may be updated when one chooses using the AFE (Advanced Formula Environment).
- SergeiBaklanDiamond Contributor
And if with RegEx only (e.g. without MAP and TEXTJOIN)
=REGEXREPLACE( REGEXREPLACE( A5:A7, "(?x) ( [.]*>> | <<[\n]?.*>> | <<.*[\n]?>> | <<[\n]?.*$ ) | (^.*>>) ", "\n" ), "^\n|\n$", "" )
- PeterBartholomew1Silver Contributor
I have mixed feelings regarding regular expressions. The plusses are that it is an industry standard; it is extremely powerful; and its sheer ingenuity is amazing. That said, its conciseness makes it incredibly difficult to read. I would gladly trade 10 times the key presses for 4 times the readability.
In that vein, I did look at some solutions on Stack Overflow where contributors had described elements of the formula, so that a sequence returning "5 1/4" would be designated as <number>, when used more than once. With that in mind, I have explored the idea of using LET variables and even LAMBDA to provide meaning to expressions that would otherwise place heavy demands on the reader to mentally parse the formula. A hybrid solution that would horrify a regex purist!
- Zdenek_MoravecBrass Contributor
PeterBartholomew1 , nice approach, I like it. I hope, Microsoft team will also improve debug possibilities for LAMBDA formula and the Name Manager dialog window. Currently, it is a kind of black box.
- SergeiBaklanDiamond Contributor
AFE from Excel Labs gives both - debugger and ability to work with names. Not IDE, but at least something.
- SergeiBaklanDiamond Contributor
As variant
=TEXTJOIN( CHAR(10),, FILTERXML( "<t>" & SUBSTITUTE( SUBSTITUTE(A5,">>","<s>"),"<<","</s>") & "</t>", "//s" ) )
Try below to extract the first occurrence:
=MID(A2, FIND(">>", A2) + 2, FIND("<<", A2) - FIND(">>", A2) - 2)
Follow to handle extract multiple occurrences:
Function ExtractTextBetweenTags(cell As Range) As String
Dim text As String
Dim result As String
Dim startPos As Long
Dim endPos As Long
text = cell.Value
result = ""
Do While InStr(text, ">>") > 0
startPos = InStr(text, ">>") + 2
endPos = InStr(text, "<<")
If endPos > startPos Then
result = result & Mid(text, startPos, endPos - startPos) & vbNewLine
text = Mid(text, endPos + 2)
Else
Exit Do
End If
Loop
ExtractTextBetweenTags = result
End FunctionIn B5:
=TEXTJOIN(CHAR(10), TRUE, IFERROR(TRIM(TEXTAFTER(TEXTSPLIT(A5, "<<"), ">>")), ""))
(The REGEXTRACT function mentioned by PeterBartholomew1 is currently available to Micorosft 365 Insiders only)
- Zdenek_MoravecBrass ContributorHansVogelaar, Perfect, thank You. Regarding my post title and the last sentence in the original post - how to apply TEXTSPLIT, if the beginning and ending character is the same:
My standard text. !This is important 1! Another standard text !Important 2! Next standard text
should result in
This is important 1
Important 2
I guess completely different technique?I hope this works for you (with the text in A5):
=LET(w, TEXTSPLIT(A5, "!"), n, COUNTA(w), s, SEQUENCE(, n), f, FILTER(w, ISEVEN(s)), TEXTJOIN(CHAR(10), TRUE, f))
- PeterBartholomew1Silver Contributor
A start might be
= TEXTJOIN( CHAR(10),, REGEXEXTRACT(targetString, ">[\w|\s]*(?=<)", 1) )
- Zdenek_MoravecBrass ContributorRegular expressions is a topic for next sleepless nights, once the functions come to M365 Basic 😉 I will notice Your solution and learn on it ...