Forum Discussion
Molekula
Aug 23, 2022Copper Contributor
Extract text between two characters
Hello everybody,
Recently I encountered a quirky problem while trying to retrieve some data from a cell.
Hope you may help me.
So I have an export csv table as following (5 rows for example):
Export data I receive | Result I expect |
FABF - www.google.com - worldwide - REP | www.google.com |
CRIM - RAW - www.yahoo.com - global - worlwide | www.yahoo.com |
FABF - DOM - RIM - drive.can.com - global | drive.can.com or www.drive.can.com |
limbo.bloomberg.com - FAB - RAW | limbo.bloomberg.com or www.limbo.bloomberg.com |
www.google.com |
I want to be able to retrieve the URL of the website:
- no matter of its position in a cell
- preferably with www. at the beginning if it does not have one
My logic to solve this is the following but I don't understand how to write it (probably there is some way to do it with LET function):
1. Using FIND function find the position of a specific character '.' (the dot)
FABF - www.google.com - worldwide - REP | 11 |
2. Using FIND function find the position of a specific character ' ' (space)
5 |
3. IF the 'space' position number is lower than 'the dot' position number counting counting to the next one until it will be bigger than 'the dot' position number.
4. ONCE it is bigger the 'space' position number is bigger than 'the dot' position number, extract the phrase between spaces. (photo included)
IMPORTANT
When the website is on the first place (like in the line 5) there are no spaces.
Big thank you to everybody in advance!
Here is a small custom VBA function:
Function ExtractURL(ByVal s As String) As String Dim p1 As Long Dim p2 As Long Dim p3 As Long s = Replace(s, Chr(160), " ") p1 = InStr(1, s, ".") p2 = InStrRev(s, " ", p1 - 1) p3 = InStr(p1 + 1, s, " ") If p3 > 0 Then s = Mid(s, p2 + 1, p3 - p2 - 1) Else s = Mid(s, p2 + 1) End If If Left(s, 4) <> "www." Then s = "www." & s End If ExtractURL = s End Function
Use like this: in a cell formula:
=ExtractURL(A2)
11 Replies
Sort By
- Detlef_LewinSilver Contributor
=LET( a,FILTERXML("<y><z>"&SUBSTITUTE(A2," ","</z><z>")&"</z></y>","//z"), b,SEARCH(".",a), LOOKUP(9^99,b,a))
- MolekulaCopper Contributor
Detlef_Lewin good evening,
I am using your code on the daily basis and it works perfectly.
But when I try to understand its logic it becomes complicated, can you please explain it to me?
I don't understand why you divide the text by:FILTERXML("<y><z>"&SUBSTITUTE(A2," ","</z><z>")&"</z></y>","//z")
And what 9^99 stands for?
LOOKUP(9^99,b,a))
Thank you in advance
- mtarlerSilver Contributor
Molekula I can try to explain his code and my variation on it:
=LET( a,FILTERXML("<y><z>"&SUBSTITUTE(A2," ","</z><z>")&"</z></y>","//z"), b,SEARCH(".",a), LOOKUP(9^99,b,a))
FILTERXML will perform a filter on XML text. XML text is a markup language using <> and </> to tag sections of the text. So in this case the function will "filter" out all the sections tagged with <z>...</z> based on that last argument "//z". So the trick here is to convert the text into XML text where what you want is between the <z> and </z> tags. So step 1 is to 'initialize' the XML text with the <y><z> tags and then inside that function is a SUBSTITUTE(A2, " ", "</z><z>") which replaces every instance of a space (" ") with the end tag </z> and then starts a new z container with <z>. Finally it ends the text with </z> </y> to properly close the containers. So FILTERXML will now create an array output from the original text broken up based on the location of the spaces.
The next line (line 3) then searches for the presence of a "." inside each array element and returns a number if it is there (corresponding to its location) or an error if not
The LOOKUP in line 4 finds the numeric value from the array b (i.e. the element that found a ".") and returns the corresponding element that came out from the FILTERXML function.
In my variation:
=LET( a,FILTERXML("<y><z>"&SUBSTITUTE(A27," ","</z><z>")&"</z></y>","//z"), CONCAT(IF(IFERROR(LEN(WEBSERVICE(a)),FALSE),a,"")))
after creating that array of parts I then check if WEBSERVICE() of that item returns any data and if it finds anything it returns the text from that part back again or returns blank if not. All the elements are CONCAT together so all the blanks + the web address + blanks
- MolekulaCopper Contributor
Good evening Detlef_Lewin ,
As for now everything is working perfectly, so thank you once again.
But I am looking a bit in the future, and doubt if there is any possibility to solve the following issue that I may face:1. Apart from the clean data (table in the post above) I receive, there is a high chance of dirty data like that:
port data I receive Result I expect FABF - Google - worldwide - REP Google CRIM - RAW - Yahoo - global - worldwide
Yahoo FABF - DOM - RIM - drive - global drive.can.com limbo - FAB - RAW - Bloomberg limbo.bloomberg.com FABF - google - RIM - drive-global- google.drive.global
FABF - google - RIM - drive-global-NL google.drive.global.nl
CRIM - RAW - Yahoo - global - worldwide (only) Yahoo.worldwide (only)
CRIM - RAW - Yahoo - global - worldwide-(only)-certainperiod Yahoo.worldwide (only) certainperiod
Where there is no clear reference to the website but at least a little source from where it may came from. And thus it will help me to avoid potential blanc links....
2. Is there any way to integrate that in your formula?My solution:
I have added IFERROR=0 to your function to filter the dirty data and added multiple IF columns based on each criteria, but the file looks messy and I am no satisfied with this automation.
Thank you in advance
- MolekulaCopper Contributor
Thank you very much Detlef_Lewin you function works like a charm and if I could mention your answer as a best answer I would do it immediately.
Thank you.
Here is a small custom VBA function:
Function ExtractURL(ByVal s As String) As String Dim p1 As Long Dim p2 As Long Dim p3 As Long s = Replace(s, Chr(160), " ") p1 = InStr(1, s, ".") p2 = InStrRev(s, " ", p1 - 1) p3 = InStr(p1 + 1, s, " ") If p3 > 0 Then s = Mid(s, p2 + 1, p3 - p2 - 1) Else s = Mid(s, p2 + 1) End If If Left(s, 4) <> "www." Then s = "www." & s End If ExtractURL = s End Function
Use like this: in a cell formula:
=ExtractURL(A2)
- MolekulaCopper Contributor
Good evening HansVogelaar,
As for now everything is working perfectly, so thank you once again.
But I am looking a bit in the future, and doubt if there is any possibility to solve the following issue that I may face:1. Apart from the clean data (table in the post above) I receive, there is a high chance of dirty data like that:
port data I receive Result I expect FABF - Google - worldwide - REP Google CRIM - RAW - Yahoo - global - worldwide
Yahoo FABF - DOM - RIM - drive - global drive.can.com limbo - FAB - RAW - Bloomberg limbo.bloomberg.com FABF - google - RIM - drive-global- google.drive.global
FABF - google - RIM - drive-global-NL google.drive.global.nl
CRIM - RAW - Yahoo - global - worldwide (only) Yahoo.worldwide (only)
CRIM - RAW - Yahoo - global - worldwide-(only)-certainperiod Yahoo.worldwide (only) certainperiod
Where there is no clear reference to the website but at least a little source from where it may came from. And thus it will help me to avoid potential blanc links....
2. Is there any way to integrate that in your formula?Thank you in advance
I don't see how we could do this, sorry.
- MolekulaCopper Contributor
Thank you very much HansVogelaar!
Your answer corresponds exactly to what I was looking for even though it is a VBA code. (I am still learning to write in VB)
Works like a charm,