Forum Discussion
How to extract multiple occurrences of a text enclosed between two same characters
- Oct 28, 2024
I hope this works for you (with the text in A5):
=LET(w, TEXTSPLIT(A5, "!"), n, COUNTA(w), s, SEQUENCE(, n), f, FILTER(w, ISEVEN(s)), TEXTJOIN(CHAR(10), TRUE, f))
And if with RegEx only (e.g. without MAP and TEXTJOIN)
=REGEXREPLACE(
REGEXREPLACE(
A5:A7,
"(?x)
( [.]*>> | <<[\n]?.*>> | <<.*[\n]?>> | <<[\n]?.*$ ) |
(^.*>>)
",
"\n"
),
"^\n|\n$",
""
)
I have mixed feelings regarding regular expressions. The plusses are that it is an industry standard; it is extremely powerful; and its sheer ingenuity is amazing. That said, its conciseness makes it incredibly difficult to read. I would gladly trade 10 times the key presses for 4 times the readability.
In that vein, I did look at some solutions on Stack Overflow where contributors had described elements of the formula, so that a sequence returning "5 1/4" would be designated as <number>, when used more than once. With that in mind, I have explored the idea of using LET variables and even LAMBDA to provide meaning to expressions that would otherwise place heavy demands on the reader to mentally parse the formula. A hybrid solution that would horrify a regex purist!
- SergeiBaklanOct 30, 2024Diamond Contributor
In general we may use comments like
=REGEXREPLACE( REGEXREPLACE( A5:A7, "(?x) #allow spaces/comments ( [.]*>> | #match any text with open sep, or <<[\n]?.*>> | #match any text between close and next open sep, or <<.*[\n]?>> | #the same if newline could be before close sep, or " & " <<[\n]?.*$ )| #match any text between close sep and end of line, or (^.*>>) #match any text at the start of the line ", "\n" ), "^\n|\n$", "" )
The problem is Excel still doesn't allow in formulae text constants with more than 255 characters
Thus we need to concat parts of the pattern. Perhaps similar will be with LET/LAMBDA where bunch of names to be concat.
There are great tools on the market like
Screen Shots: Create, Edit, Test, Debug, Convert, Save and Use Regex, but that's more for people who works with regex on regular basis.
- PeterBartholomew1Oct 30, 2024Silver Contributor
Thanks Sergei, I learnt from that! I was vaguely aware of \n but hadn't encountered
(?x). The embedded comments certainly help. I still find regex looks somewhat alien in an Excel function, as does XML. The sort of thing I have played with to hide the detail is:
target = "<<length1>>45<<length2>>25" = ExtractDimensionsλ(target) /* Name: ExtractDimensionsλ Description: A function to extract integer dimensions from a tagged string */ ExtractDimensionsλ = LAMBDA(string, LET( // Positive look-ahead PosLkAhdλ, LAMBDA(x, CONCAT("(?=",x,")")), integer, "\d+", // One or more digits nextTag, "(\<|$)", // Opening tag or end of line REGEXEXTRACT(string, integer & PosLkAhdλ(nextTag), 1) ) )
but, before playing games, learning the existing regex functionality to greater depth would be an idea.
- SergeiBaklanOct 31, 2024Diamond Contributor
RegEx options are, for example, here Options for regular expression - .NET | Microsoft Learn . They are off by default. Mainly I use (?xi).
Regex embedding into lambdas looks nice. The only it could be bit hard on debugging phase. In string it's easier to play with the pattern this and that ways.