Forum Discussion

Molekula's avatar
Molekula
Copper Contributor
Aug 23, 2022
Solved

Extract text between two characters

Hello everybody,

Recently I encountered a quirky problem while trying to retrieve some data from a cell.

Hope you may help me.

So I have an export csv table as following (5 rows for example):

Export data I receiveResult I expect
FABF - www.google.com - worldwide - REPwww.google.com

CRIM - RAW - www.yahoo.com - global - worlwide

www.yahoo.com
FABF - DOM - RIM - drive.can.com - globaldrive.can.com or www.drive.can.com
limbo.bloomberg.com - FAB - RAWlimbo.bloomberg.com or www.limbo.bloomberg.com

www.google.com

www.google.com

 

I want to be able to retrieve the URL of the website:

- no matter of its position in a cell

- preferably with www. at the beginning if it does not have one

 

My logic to solve this is the following but I don't understand how to write it (probably there is some way to do it with LET function):

1. Using FIND function find the position of a specific character '.' (the dot)

FABF - www.google.com - worldwide - REP11

2. Using FIND function find the position of a specific character ' ' (space)

 5

3. IF the 'space' position number is lower than 'the dot' position number counting counting to the next one until it will be bigger than 'the dot' position number.

4. ONCE it is bigger the 'space' position number is bigger than 'the dot' position number, extract the phrase between spaces. (photo included)

IMPORTANT

When the website is on the first place (like in the line 5) there are no spaces.

 

Big thank you to everybody in advance!

  • Molekula 

    Here is a small custom VBA function:

    Function ExtractURL(ByVal s As String) As String
        Dim p1 As Long
        Dim p2 As Long
        Dim p3 As Long
        s = Replace(s, Chr(160), " ")
        p1 = InStr(1, s, ".")
        p2 = InStrRev(s, " ", p1 - 1)
        p3 = InStr(p1 + 1, s, " ")
        If p3 > 0 Then
            s = Mid(s, p2 + 1, p3 - p2 - 1)
        Else
            s = Mid(s, p2 + 1)
        End If
        If Left(s, 4) <> "www." Then
            s = "www." & s
        End If
        ExtractURL = s
    End Function

    Use like this: in a cell formula:

    =ExtractURL(A2)

11 Replies

  • Detlef_Lewin's avatar
    Detlef_Lewin
    Silver Contributor

    Molekula 

    =LET(
    a,FILTERXML("<y><z>"&SUBSTITUTE(A2," ","</z><z>")&"</z></y>","//z"),
    b,SEARCH(".",a),
    LOOKUP(9^99,b,a))
    • Molekula's avatar
      Molekula
      Copper Contributor

      Detlef_Lewin good evening,

      I am using your code on the daily basis and it works perfectly.
      But when I try to understand its logic it becomes complicated, can you please explain it to me?
      I don't understand why you divide the text by:

      FILTERXML("<y><z>"&SUBSTITUTE(A2," ","</z><z>")&"</z></y>","//z")

       And what 9^99 stands for?

      LOOKUP(9^99,b,a))

       

      Thank you in advance

      • mtarler's avatar
        mtarler
        Silver Contributor

        Molekula I can try to explain his code and my variation on it:

         

         

        =LET(
        a,FILTERXML("<y><z>"&SUBSTITUTE(A2," ","</z><z>")&"</z></y>","//z"),
        b,SEARCH(".",a),
        LOOKUP(9^99,b,a))

         

        FILTERXML will perform a filter on XML text.  XML text is a markup language using <> and </> to tag sections of the text.  So in this case the function will "filter" out all the sections tagged with <z>...</z> based on that last argument "//z".  So the trick here is to convert the text into XML text where what you want is between the <z> and </z> tags.  So step 1 is to 'initialize' the XML text with the <y><z> tags and then inside that function is a SUBSTITUTE(A2, " ", "</z><z>") which replaces every instance of a space (" ") with the end tag </z> and then starts a new z container with <z>.  Finally it ends the text with </z> </y> to properly close the containers.  So FILTERXML will now create an array output from the original text broken up based on the location of the spaces.

        The next line (line 3) then searches for the presence of a "." inside each array element and returns a number if it is there (corresponding to its location) or an error if not

        The LOOKUP in line 4 finds the numeric value from the array b (i.e. the element that found a ".") and returns the corresponding element that came out from the FILTERXML function.

        In my variation:

         

        =LET(
        a,FILTERXML("<y><z>"&SUBSTITUTE(A27," ","</z><z>")&"</z></y>","//z"),
        CONCAT(IF(IFERROR(LEN(WEBSERVICE(a)),FALSE),a,"")))

         

         after creating that array of parts I then check if WEBSERVICE() of that item returns any data and if it finds anything it returns the text from that part back again or returns blank if not.  All the elements are CONCAT together so all the blanks + the web address + blanks

         

    • Molekula's avatar
      Molekula
      Copper Contributor

      Good evening Detlef_Lewin ,
      As for now everything is working perfectly, so thank you once again.
      But I am looking a bit in the future, and doubt if there is any possibility to solve the following issue that I may face:

      1. Apart from the clean data (table in the post above) I receive, there is a high chance of dirty data like that:

      port data I receiveResult I expect
      FABF - Google - worldwide - REPGoogle

      CRIM - RAW - Yahoo - global - worldwide

      Yahoo
      FABF - DOM - RIM - drive - globaldrive.can.com 
      limbo - FAB - RAW - Bloomberglimbo.bloomberg.com 
      FABF - google - RIM - drive-global-

      google.drive.global

      FABF - google - RIM - drive-global-NL

      google.drive.global.nl

      CRIM - RAW - Yahoo - global - worldwide (only)

      Yahoo.worldwide (only)

      CRIM - RAW - Yahoo - global - worldwide-(only)-certainperiod

      Yahoo.worldwide (only) certainperiod

       

      Where there is no clear reference to the website but at least a little source from where it may came from. And thus it will help me to avoid potential blanc links....

      2. Is there any way to integrate that in your formula?

       

      My solution:

      I have added IFERROR=0 to your function to filter the dirty data and added multiple IF columns based on each criteria, but the file looks messy and I am no satisfied with this automation.

      Thank you in advance

    • Molekula's avatar
      Molekula
      Copper Contributor

      Thank you very much Detlef_Lewin  you function works like a charm and if I could mention your answer as a best answer I would do it immediately.
      Thank you.

  • Molekula 

    Here is a small custom VBA function:

    Function ExtractURL(ByVal s As String) As String
        Dim p1 As Long
        Dim p2 As Long
        Dim p3 As Long
        s = Replace(s, Chr(160), " ")
        p1 = InStr(1, s, ".")
        p2 = InStrRev(s, " ", p1 - 1)
        p3 = InStr(p1 + 1, s, " ")
        If p3 > 0 Then
            s = Mid(s, p2 + 1, p3 - p2 - 1)
        Else
            s = Mid(s, p2 + 1)
        End If
        If Left(s, 4) <> "www." Then
            s = "www." & s
        End If
        ExtractURL = s
    End Function

    Use like this: in a cell formula:

    =ExtractURL(A2)

    • Molekula's avatar
      Molekula
      Copper Contributor

      Good evening HansVogelaar,
      As for now everything is working perfectly, so thank you once again.
      But I am looking a bit in the future, and doubt if there is any possibility to solve the following issue that I may face:

      1. Apart from the clean data (table in the post above) I receive, there is a high chance of dirty data like that:

      port data I receiveResult I expect
      FABF - Google - worldwide - REPGoogle

      CRIM - RAW - Yahoo - global - worldwide

      Yahoo
      FABF - DOM - RIM - drive - globaldrive.can.com 
      limbo - FAB - RAW - Bloomberglimbo.bloomberg.com 
      FABF - google - RIM - drive-global-

      google.drive.global

      FABF - google - RIM - drive-global-NL

      google.drive.global.nl

      CRIM - RAW - Yahoo - global - worldwide (only)

      Yahoo.worldwide (only)

      CRIM - RAW - Yahoo - global - worldwide-(only)-certainperiod

      Yahoo.worldwide (only) certainperiod

       

      Where there is no clear reference to the website but at least a little source from where it may came from. And thus it will help me to avoid potential blanc links....

      2. Is there any way to integrate that in your formula?

       

       

      Thank you in advance

    • Molekula's avatar
      Molekula
      Copper Contributor

      Thank you very much HansVogelaar!
      Your answer corresponds exactly to what I was looking for even though it is a VBA code. (I am still learning to write in VB)
      Works like a charm,

Resources