Forum Discussion
Finding the most frequent series of words in an excel file
- Mar 09, 2023
Regarding the previously-posted formula:
- You don't have to use Greek letters; you can replace those with words in your alphabet of choice.
- It needs a space between the quotes that are inside the TEXTSPLIT function.
- But even when fixed, it's making a column of the individual words, and counting those. As you note, that's not the desired outcome.
- And further, your performance titles seem to include "OF", so the word "of" will also be removed from the dog names. That would make (hypothetical) names "Embrace of Fur" and "Embrace Fur" the same. Most Excel functions are insensitive to the case of their arguments, and while that's usually a good thing, here it could be a problem.
You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…
There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.
Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.
And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.
TASK 1
I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:- open the VB Editor (Alt+F11);
- from the Insert menu insert a (standard) code module;
- ensure that there is an Option Explicit statement at the top of the module; and below it
- paste the following code:
Public Function ExtractDogName(ByVal OriginalValue As String _ , ExclusionList As String _ ) As Variant ' This procedure extracts and returns a dog's name from the first ' argument, by removing "exclusion words/codes" from the front ' and rear of the original value. ' Note: Exclusion "words" are compared in a case-sensitive manner. Dim strReturnValue As String Dim strWords() As String Dim strWord As String Dim in4WordIndex As Long ' Dim strExclusions() As String Dim strExclusion As String Dim in4Exclusion As Long ' Dim blnNonExclusionWordFound As Boolean '---- Enable an error handler. On Error GoTo ExtractDN_ErrHndlr '---- Split the input into words (retaining the order). strWords = Split(OriginalValue, " ") '---- Split the exclusion list into words or codes. strExclusions = Split(ExclusionList, " ") '---- Working from right to left in the array of words, replace exclusion ' words/codes with empty strings until a part of the dog name is ' presumably found. blnNonExclusionWordFound = False For in4WordIndex = UBound(strWords) To 0 Step -1 strWord = strWords(in4WordIndex) For in4Exclusion = 0 To UBound(strExclusions) '(This array can _ be swept in either order.) strExclusion = strExclusions(in4Exclusion) ' ' [To do a case-insensitive comparison, replace vbBinaryCompare ' in the next line with vbTextCompare.] If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then strWord = "" Exit For End If Next in4Exclusion ' If Len(strWord) > 0 Then blnNonExclusionWordFound = True Else strWords(in4WordIndex) = strWord End If ' If blnNonExclusionWordFound = True Then Exit For Next in4WordIndex '---- Working from left to right in the array of words, replace exclusion ' words/codes with empty strings until a part of the dog name is ' presumably found. blnNonExclusionWordFound = False For in4WordIndex = 0 To UBound(strWords) strWord = strWords(in4WordIndex) For in4Exclusion = 0 To UBound(strExclusions) '(This array can _ be swept in either order.) strExclusion = strExclusions(in4Exclusion) ' ' [To do a case-insensitive comparison, replace vbBinaryCompare ' in the next line with vbTextCompare.] If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then strWord = "" Exit For End If Next in4Exclusion ' If Len(strWord) > 0 Then blnNonExclusionWordFound = True Else strWords(in4WordIndex) = strWord End If ' If blnNonExclusionWordFound = True Then Exit For Next in4WordIndex '---- Reconnect the component words. strReturnValue = Join(strWords, " ") strReturnValue = Trim$(strReturnValue) ''---- Collapse double spaces to a single space; repeat as needed. 'Do Until InStr(1, strReturnValue, " ") = 0 ' strReturnValue = Replace(strReturnValue, " ", " ") 'Loop '---- Assign the return value. ExtractDogName = strReturnValue ExtractDN_Exit: Exit Function ExtractDN_ErrHndlr: Dim in4ErrorCode As Long Dim strErrorDescr As String ' -- Capture info. in4ErrorCode = Err.Number strErrorDescr = Err.Description ' -- Output info. Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _ & vbCrLf & " on " & OriginalValue ' -- Prepare to return an error indication. ExtractDogName = CVErr(xlErrValue) ' -- Resume ExtractDN_Exit End Function
Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )
TASK 2
To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):=COUNTIF(B$2:B2, B2)
TASK 3To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:
=MAX(C2:C27001) =FILTER(B2:B27001, C2:C27001=D2, "-none-")
Feel free to respond with questions.
Just to make sure we're on the same page, if the 5 hypothetical dogs below appeared in the list just once each, it would still return Atomic and Tawny, because these dogs below are 5 distinct/different dogs who happen to have the same 4 words in their name. Order of the words matters, and I'm looking for the most frequent appearance of a series of words in the same order.
Noel's Beautiful Crimson Rose
Crimson Noel's Rose Beautiful
Noel's Rose Crimson Beautiful
Beautiful Noel's Crimson Rose
Rose Beautiful Noel's Crimson
Laura
Regarding the previously-posted formula:
- You don't have to use Greek letters; you can replace those with words in your alphabet of choice.
- It needs a space between the quotes that are inside the TEXTSPLIT function.
- But even when fixed, it's making a column of the individual words, and counting those. As you note, that's not the desired outcome.
- And further, your performance titles seem to include "OF", so the word "of" will also be removed from the dog names. That would make (hypothetical) names "Embrace of Fur" and "Embrace Fur" the same. Most Excel functions are insensitive to the case of their arguments, and while that's usually a good thing, here it could be a problem.
You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…
There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.
Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.
And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.
TASK 1
I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:
- open the VB Editor (Alt+F11);
- from the Insert menu insert a (standard) code module;
- ensure that there is an Option Explicit statement at the top of the module; and below it
- paste the following code:
Public Function ExtractDogName(ByVal OriginalValue As String _
, ExclusionList As String _
) As Variant
' This procedure extracts and returns a dog's name from the first
' argument, by removing "exclusion words/codes" from the front
' and rear of the original value.
' Note: Exclusion "words" are compared in a case-sensitive manner.
Dim strReturnValue As String
Dim strWords() As String
Dim strWord As String
Dim in4WordIndex As Long
'
Dim strExclusions() As String
Dim strExclusion As String
Dim in4Exclusion As Long
'
Dim blnNonExclusionWordFound As Boolean
'---- Enable an error handler.
On Error GoTo ExtractDN_ErrHndlr
'---- Split the input into words (retaining the order).
strWords = Split(OriginalValue, " ")
'---- Split the exclusion list into words or codes.
strExclusions = Split(ExclusionList, " ")
'---- Working from right to left in the array of words, replace exclusion
' words/codes with empty strings until a part of the dog name is
' presumably found.
blnNonExclusionWordFound = False
For in4WordIndex = UBound(strWords) To 0 Step -1
strWord = strWords(in4WordIndex)
For in4Exclusion = 0 To UBound(strExclusions) '(This array can _
be swept in either order.)
strExclusion = strExclusions(in4Exclusion)
'
' [To do a case-insensitive comparison, replace vbBinaryCompare
' in the next line with vbTextCompare.]
If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
strWord = ""
Exit For
End If
Next in4Exclusion
'
If Len(strWord) > 0 Then
blnNonExclusionWordFound = True
Else
strWords(in4WordIndex) = strWord
End If
'
If blnNonExclusionWordFound = True Then Exit For
Next in4WordIndex
'---- Working from left to right in the array of words, replace exclusion
' words/codes with empty strings until a part of the dog name is
' presumably found.
blnNonExclusionWordFound = False
For in4WordIndex = 0 To UBound(strWords)
strWord = strWords(in4WordIndex)
For in4Exclusion = 0 To UBound(strExclusions) '(This array can _
be swept in either order.)
strExclusion = strExclusions(in4Exclusion)
'
' [To do a case-insensitive comparison, replace vbBinaryCompare
' in the next line with vbTextCompare.]
If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
strWord = ""
Exit For
End If
Next in4Exclusion
'
If Len(strWord) > 0 Then
blnNonExclusionWordFound = True
Else
strWords(in4WordIndex) = strWord
End If
'
If blnNonExclusionWordFound = True Then Exit For
Next in4WordIndex
'---- Reconnect the component words.
strReturnValue = Join(strWords, " ")
strReturnValue = Trim$(strReturnValue)
''---- Collapse double spaces to a single space; repeat as needed.
'Do Until InStr(1, strReturnValue, " ") = 0
' strReturnValue = Replace(strReturnValue, " ", " ")
'Loop
'---- Assign the return value.
ExtractDogName = strReturnValue
ExtractDN_Exit:
Exit Function
ExtractDN_ErrHndlr:
Dim in4ErrorCode As Long
Dim strErrorDescr As String
' -- Capture info.
in4ErrorCode = Err.Number
strErrorDescr = Err.Description
' -- Output info.
Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _
& vbCrLf & " on " & OriginalValue
' -- Prepare to return an error indication.
ExtractDogName = CVErr(xlErrValue)
' --
Resume ExtractDN_Exit
End Function
Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:
=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )
TASK 2
To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):
=COUNTIF(B$2:B2, B2)
TASK 3
To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:
=MAX(C2:C27001)
=FILTER(B2:B27001, C2:C27001=D2, "-none-")
Feel free to respond with questions.
- CDenneyApr 19, 2024Copper Contributor
SnowMan55 I apologize for reviving this thread. I am trying to use this code to find the most common phrases in a set of cells in a table. I continue to get a compiler error saying invalid outside procedure.
Do you know what might be causing this?
I may be misunderstanding the application of the code you posted, if so I apologize.
- peiyezhuApr 20, 2024Bronze ContributorOn Error GoTo ExtractDN_ErrHndlr
Please comment or remove this line,run again to find which line occurs error.
- Laura WrightMar 14, 2023Copper Contributor
SnowMan55 It worked like a charm!!! Thank you very much for your help! Very much appreciated!
- Laura WrightMar 10, 2023Copper ContributorWow, thanks! I've never used VBA in my life, but there's no time like the present to learn! I know someone who can probably help. Yes, in front of some dogs names there are also titles (those in front are specifically championship titles). Previously, I used Excel to remove them for me for another purpose (there are much fewer prefix titles than suffix titles), so I do have a list of dog names without those.
- JosWoolleyMar 09, 2023Iron Contributor
Re "It needs a space between the quotes that are inside the TEXTSPLIT function", many thanks - have now corrected.
Plus, I obviously misunderstood the requirement, as your post has made me realise.
Regards
- Laura WrightMar 10, 2023Copper ContributorNo worries, I appreciate the time you took to help.