SOLVED

Finding the most frequent series of words in an excel file

Copper Contributor

How can I find the most frequent series of words in an excel file?  I know how to find the most frequent single word.  I need to find the most frequent series of words.  

 

Example of a very short list (my file has almost 27,000 of these that I need to search through): 

 

Sienna Pointe's Fire On High OA OAJ CGC
Shine's Bright Like A Diamond OA OAJ NF
Sienna Pointe's Fire On High OA OAJ CGC
GCH CH Menny Azonnali Medal CD BN RN AX MXJ MXF T2B DCAT DN TKN
Semper Fi Pfc Jozee JH CGC
CH Solaris Palmetto Somebody To Love
Sincerus JE's Everglow What A Wonderful World BN RA OA OAJ OF BCAT ACT1
Patapsco Verity Sea Lynx CD BN RI CGC TKN
Remi XII CGC
CH Midnight Run's Beyond The Wall
CH Pursuit's Trek To The Triple Crown CGC
CH Rhapsody Hanalei Blues, Soul, & Rock'N'Roll
MACH4 Szizlin Fenwick's Shooting Star MXC PAD MJB2 OF T2B CGC TKN
MACH4 Szizlin Fenwick's Shooting Star MXC PAD MJB2 OF T2B CGC TKN
CH Mountain Laurel Summit To The Sea CD RE CGC TKA
CH Solaris Repeat Business RN JH CA RATN CGC TKN

Atomic Tawny SIN
Atomic Tawny SWN
CH Kurtzi N Nike's Pride Of The Mtns He's A Keeper JH TKN
Noel's Beautiful Crimson Rose CD
GCH CH Firelight's Swipe Right CA
Atomic Tawny SCN
Atomic Tawny SEN

I would like it to pick out the occurrence of Atomic Tawny as the most frequent series of words.  These are the names of dogs and performance titles they have earned, if anyone is wondering.

 

Is there a way to do this?

 

 

12 Replies

@Laura Wright 

 

It might be difficult, depending on what your definition of a 'word' is.

 

In that dataset, for example, 'Atomic' and 'Tawny' only appear 4 times each, whereas 'CGC' (10 times), 'CH' (9 times) and 'TKN' (6 times) all appear more times. You'd need to define your logic for what constitutes a 'word' if you wish examples such as those to not be considered.

Thanks for you reply! Sorry for the late response - I thought I'd get an email when someone responded. I've just hit 'subscribe'.

Hmmm, you're absolutely right. I could easily create a list of those that I don't want to be considered. If I do that, how do I get it to look for the most common series of words, excluding what I tell it to exclude?

@Laura Wright

 

Assuming:

 

1) Your data is in A1:A23

2) Your list of exclusions is in H1:H8

 

=LET(
    ζ,A1:A23,
    ξ,H1:H8,
    λ,2,
    κ,TOCOL(REDUCE("",ζ,LAMBDA(α,β,VSTACK(α,TEXTSPLIT(β," ")))),3),
    γ,UNIQUE(FILTER(κ,ISNA(XMATCH(κ,ξ)))),
    TAKE(SORTBY(γ,BYROW(γ,LAMBDA(μ,SUM(N(κ=μ)))),-1),λ)
)

 

λ in the above (currently set to 2) determines the number of returns to be made.

 

Based on your posted data, if H1:H8 contains "CGC", "CH", "TKN", "OA", "OAJ", "CD", "OF" and "The", the above will return "Atomic" and "Tawny".

 

Regards

Thank you. My next task will be googling how to type those Greek letters.

Just to make sure we're on the same page, if the 5 hypothetical dogs below appeared in the list just once each, it would still return Atomic and Tawny, because these dogs below are 5 distinct/different dogs who happen to have the same 4 words in their name. Order of the words matters, and I'm looking for the most frequent appearance of a series of words in the same order.

Noel's Beautiful Crimson Rose
Crimson Noel's Rose Beautiful
Noel's Rose Crimson Beautiful
Beautiful Noel's Crimson Rose
Rose Beautiful Noel's Crimson

Laura

best response confirmed by Laura Wright (Copper Contributor)
Solution

@Laura Wright 

Regarding the previously-posted formula:

 

  • You don't have to use Greek letters; you can replace those with words in your alphabet of choice.
  • It needs a space between the quotes that are inside the TEXTSPLIT function.
  • But even when fixed, it's making a column of the individual words, and counting those. As you note, that's not the desired outcome.
  • And further, your performance titles seem to include "OF", so the word "of" will also be removed from the dog names. That would make (hypothetical) names "Embrace of Fur" and "Embrace Fur" the same. Most Excel functions are insensitive to the case of their arguments, and while that's usually a good thing, here it could be a problem.


You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…


There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.


Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.


And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.


TASK 1


I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:

  1. open the VB Editor (Alt+F11);
  2. from the Insert menu insert a (standard) code module;
  3. ensure that there is an Option Explicit statement at the top of the module; and below it
  4. paste the following code:
Public Function ExtractDogName(ByVal OriginalValue As String _
        , ExclusionList As String _
        ) As Variant
'   This procedure extracts and returns a dog's name from the first
'       argument, by removing "exclusion words/codes" from the front
'       and rear of the original value.
'   Note: Exclusion "words" are compared in a case-sensitive manner.

    Dim strReturnValue  As String
    
    Dim strWords()  As String
    Dim strWord     As String
    Dim in4WordIndex    As Long
    '
    Dim strExclusions() As String
    Dim strExclusion    As String
    Dim in4Exclusion    As Long
    '
    Dim blnNonExclusionWordFound    As Boolean
    
    '----   Enable an error handler.
    On Error GoTo ExtractDN_ErrHndlr
    
    '----   Split the input into words (retaining the order).
    strWords = Split(OriginalValue, " ")
    '----   Split the exclusion list into words or codes.
    strExclusions = Split(ExclusionList, " ")
    '----   Working from right to left in the array of words, replace exclusion
    '       words/codes with empty strings until a part of the dog name is
    '       presumably found.
    blnNonExclusionWordFound = False
    For in4WordIndex = UBound(strWords) To 0 Step -1
        strWord = strWords(in4WordIndex)
        For in4Exclusion = 0 To UBound(strExclusions)   '(This array can _
                be swept in either order.)
            strExclusion = strExclusions(in4Exclusion)
            '
            '   [To do a case-insensitive comparison, replace vbBinaryCompare
            '   in the next line with vbTextCompare.]
            If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
                strWord = ""
                Exit For
            End If
        Next in4Exclusion
        '
        If Len(strWord) > 0 Then
            blnNonExclusionWordFound = True
        Else
            strWords(in4WordIndex) = strWord
        End If
        '
        If blnNonExclusionWordFound = True Then Exit For
    Next in4WordIndex
    '----   Working from left to right in the array of words, replace exclusion
    '       words/codes with empty strings until a part of the dog name is
    '       presumably found.
    blnNonExclusionWordFound = False
    For in4WordIndex = 0 To UBound(strWords)
        strWord = strWords(in4WordIndex)
        For in4Exclusion = 0 To UBound(strExclusions)   '(This array can _
                be swept in either order.)
            strExclusion = strExclusions(in4Exclusion)
            '
            '   [To do a case-insensitive comparison, replace vbBinaryCompare
            '   in the next line with vbTextCompare.]
            If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
                strWord = ""
                Exit For
            End If
        Next in4Exclusion
        '
        If Len(strWord) > 0 Then
            blnNonExclusionWordFound = True
        Else
            strWords(in4WordIndex) = strWord
        End If
        '
        If blnNonExclusionWordFound = True Then Exit For
    Next in4WordIndex
    '----   Reconnect the component words.
    strReturnValue = Join(strWords, " ")
    strReturnValue = Trim$(strReturnValue)
    ''----   Collapse double spaces to a single space; repeat as needed.
    'Do Until InStr(1, strReturnValue, "  ") = 0
    '    strReturnValue = Replace(strReturnValue, "  ", " ")
    'Loop
    '----   Assign the return value.
    ExtractDogName = strReturnValue
ExtractDN_Exit:
    Exit Function

ExtractDN_ErrHndlr:
    Dim in4ErrorCode    As Long
    Dim strErrorDescr   As String
    '  --   Capture info.
    in4ErrorCode = Err.Number
    strErrorDescr = Err.Description
    '  --   Output info.
    Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _
            & vbCrLf & "  on " & OriginalValue
    '  --   Prepare to return an error indication.
    ExtractDogName = CVErr(xlErrValue)
    '  --
    Resume ExtractDN_Exit
End Function


Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:

=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )


TASK 2


To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):

=COUNTIF(B$2:B2, B2)


TASK 3

 

To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:

=MAX(C2:C27001)
=FILTER(B2:B27001, C2:C27001=D2, "-none-")

 

Feel free to respond with questions. 

@SnowMan55 

 

Re "It needs a space between the quotes that are inside the TEXTSPLIT function", many thanks - have now corrected.

 

Plus, I obviously misunderstood the requirement, as your post has made me realise.

 

Regards

Wow, thanks! I've never used VBA in my life, but there's no time like the present to learn! I know someone who can probably help. Yes, in front of some dogs names there are also titles (those in front are specifically championship titles). Previously, I used Excel to remove them for me for another purpose (there are much fewer prefix titles than suffix titles), so I do have a list of dog names without those.
No worries, I appreciate the time you took to help.

@SnowMan55 It worked like a charm!!!  Thank you very much for your help! Very much appreciated!  

@SnowMan55 I apologize for reviving this thread. I am trying to use this code to find the most common phrases in a set of cells in a table. I continue to get a compiler error saying invalid outside procedure.

Do you know what might be causing this?

 

I may be misunderstanding the application of the code you posted, if so I apologize.

On Error GoTo ExtractDN_ErrHndlr

Please comment or remove this line,run again to find which line occurs error.
Yes, I was certainly wondering!
There is a huge amount of self-inflicted damage in this list to undo before one can even consider the type of analysis you require. Why were the names of dogs and their performance titles ever concatenated within a cell?

It makes it difficult to even get to the starting point of generating a list of distinct dogs and another for .performance titles.
1 best response

Accepted Solutions
best response confirmed by Laura Wright (Copper Contributor)
Solution

@Laura Wright 

Regarding the previously-posted formula:

 

  • You don't have to use Greek letters; you can replace those with words in your alphabet of choice.
  • It needs a space between the quotes that are inside the TEXTSPLIT function.
  • But even when fixed, it's making a column of the individual words, and counting those. As you note, that's not the desired outcome.
  • And further, your performance titles seem to include "OF", so the word "of" will also be removed from the dog names. That would make (hypothetical) names "Embrace of Fur" and "Embrace Fur" the same. Most Excel functions are insensitive to the case of their arguments, and while that's usually a good thing, here it could be a problem.


You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…


There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.


Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.


And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.


TASK 1


I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:

  1. open the VB Editor (Alt+F11);
  2. from the Insert menu insert a (standard) code module;
  3. ensure that there is an Option Explicit statement at the top of the module; and below it
  4. paste the following code:
Public Function ExtractDogName(ByVal OriginalValue As String _
        , ExclusionList As String _
        ) As Variant
'   This procedure extracts and returns a dog's name from the first
'       argument, by removing "exclusion words/codes" from the front
'       and rear of the original value.
'   Note: Exclusion "words" are compared in a case-sensitive manner.

    Dim strReturnValue  As String
    
    Dim strWords()  As String
    Dim strWord     As String
    Dim in4WordIndex    As Long
    '
    Dim strExclusions() As String
    Dim strExclusion    As String
    Dim in4Exclusion    As Long
    '
    Dim blnNonExclusionWordFound    As Boolean
    
    '----   Enable an error handler.
    On Error GoTo ExtractDN_ErrHndlr
    
    '----   Split the input into words (retaining the order).
    strWords = Split(OriginalValue, " ")
    '----   Split the exclusion list into words or codes.
    strExclusions = Split(ExclusionList, " ")
    '----   Working from right to left in the array of words, replace exclusion
    '       words/codes with empty strings until a part of the dog name is
    '       presumably found.
    blnNonExclusionWordFound = False
    For in4WordIndex = UBound(strWords) To 0 Step -1
        strWord = strWords(in4WordIndex)
        For in4Exclusion = 0 To UBound(strExclusions)   '(This array can _
                be swept in either order.)
            strExclusion = strExclusions(in4Exclusion)
            '
            '   [To do a case-insensitive comparison, replace vbBinaryCompare
            '   in the next line with vbTextCompare.]
            If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
                strWord = ""
                Exit For
            End If
        Next in4Exclusion
        '
        If Len(strWord) > 0 Then
            blnNonExclusionWordFound = True
        Else
            strWords(in4WordIndex) = strWord
        End If
        '
        If blnNonExclusionWordFound = True Then Exit For
    Next in4WordIndex
    '----   Working from left to right in the array of words, replace exclusion
    '       words/codes with empty strings until a part of the dog name is
    '       presumably found.
    blnNonExclusionWordFound = False
    For in4WordIndex = 0 To UBound(strWords)
        strWord = strWords(in4WordIndex)
        For in4Exclusion = 0 To UBound(strExclusions)   '(This array can _
                be swept in either order.)
            strExclusion = strExclusions(in4Exclusion)
            '
            '   [To do a case-insensitive comparison, replace vbBinaryCompare
            '   in the next line with vbTextCompare.]
            If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
                strWord = ""
                Exit For
            End If
        Next in4Exclusion
        '
        If Len(strWord) > 0 Then
            blnNonExclusionWordFound = True
        Else
            strWords(in4WordIndex) = strWord
        End If
        '
        If blnNonExclusionWordFound = True Then Exit For
    Next in4WordIndex
    '----   Reconnect the component words.
    strReturnValue = Join(strWords, " ")
    strReturnValue = Trim$(strReturnValue)
    ''----   Collapse double spaces to a single space; repeat as needed.
    'Do Until InStr(1, strReturnValue, "  ") = 0
    '    strReturnValue = Replace(strReturnValue, "  ", " ")
    'Loop
    '----   Assign the return value.
    ExtractDogName = strReturnValue
ExtractDN_Exit:
    Exit Function

ExtractDN_ErrHndlr:
    Dim in4ErrorCode    As Long
    Dim strErrorDescr   As String
    '  --   Capture info.
    in4ErrorCode = Err.Number
    strErrorDescr = Err.Description
    '  --   Output info.
    Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _
            & vbCrLf & "  on " & OriginalValue
    '  --   Prepare to return an error indication.
    ExtractDogName = CVErr(xlErrValue)
    '  --
    Resume ExtractDN_Exit
End Function


Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:

=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )


TASK 2


To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):

=COUNTIF(B$2:B2, B2)


TASK 3

 

To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:

=MAX(C2:C27001)
=FILTER(B2:B27001, C2:C27001=D2, "-none-")

 

Feel free to respond with questions. 

View solution in original post