Forum Discussion
Finding the most frequent series of words in an excel file
- Mar 09, 2023
Regarding the previously-posted formula:
- You don't have to use Greek letters; you can replace those with words in your alphabet of choice.
- It needs a space between the quotes that are inside the TEXTSPLIT function.
- But even when fixed, it's making a column of the individual words, and counting those. As you note, that's not the desired outcome.
- And further, your performance titles seem to include "OF", so the word "of" will also be removed from the dog names. That would make (hypothetical) names "Embrace of Fur" and "Embrace Fur" the same. Most Excel functions are insensitive to the case of their arguments, and while that's usually a good thing, here it could be a problem.
You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…
There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.
Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.
And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.
TASK 1
I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:- open the VB Editor (Alt+F11);
- from the Insert menu insert a (standard) code module;
- ensure that there is an Option Explicit statement at the top of the module; and below it
- paste the following code:
Public Function ExtractDogName(ByVal OriginalValue As String _ , ExclusionList As String _ ) As Variant ' This procedure extracts and returns a dog's name from the first ' argument, by removing "exclusion words/codes" from the front ' and rear of the original value. ' Note: Exclusion "words" are compared in a case-sensitive manner. Dim strReturnValue As String Dim strWords() As String Dim strWord As String Dim in4WordIndex As Long ' Dim strExclusions() As String Dim strExclusion As String Dim in4Exclusion As Long ' Dim blnNonExclusionWordFound As Boolean '---- Enable an error handler. On Error GoTo ExtractDN_ErrHndlr '---- Split the input into words (retaining the order). strWords = Split(OriginalValue, " ") '---- Split the exclusion list into words or codes. strExclusions = Split(ExclusionList, " ") '---- Working from right to left in the array of words, replace exclusion ' words/codes with empty strings until a part of the dog name is ' presumably found. blnNonExclusionWordFound = False For in4WordIndex = UBound(strWords) To 0 Step -1 strWord = strWords(in4WordIndex) For in4Exclusion = 0 To UBound(strExclusions) '(This array can _ be swept in either order.) strExclusion = strExclusions(in4Exclusion) ' ' [To do a case-insensitive comparison, replace vbBinaryCompare ' in the next line with vbTextCompare.] If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then strWord = "" Exit For End If Next in4Exclusion ' If Len(strWord) > 0 Then blnNonExclusionWordFound = True Else strWords(in4WordIndex) = strWord End If ' If blnNonExclusionWordFound = True Then Exit For Next in4WordIndex '---- Working from left to right in the array of words, replace exclusion ' words/codes with empty strings until a part of the dog name is ' presumably found. blnNonExclusionWordFound = False For in4WordIndex = 0 To UBound(strWords) strWord = strWords(in4WordIndex) For in4Exclusion = 0 To UBound(strExclusions) '(This array can _ be swept in either order.) strExclusion = strExclusions(in4Exclusion) ' ' [To do a case-insensitive comparison, replace vbBinaryCompare ' in the next line with vbTextCompare.] If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then strWord = "" Exit For End If Next in4Exclusion ' If Len(strWord) > 0 Then blnNonExclusionWordFound = True Else strWords(in4WordIndex) = strWord End If ' If blnNonExclusionWordFound = True Then Exit For Next in4WordIndex '---- Reconnect the component words. strReturnValue = Join(strWords, " ") strReturnValue = Trim$(strReturnValue) ''---- Collapse double spaces to a single space; repeat as needed. 'Do Until InStr(1, strReturnValue, " ") = 0 ' strReturnValue = Replace(strReturnValue, " ", " ") 'Loop '---- Assign the return value. ExtractDogName = strReturnValue ExtractDN_Exit: Exit Function ExtractDN_ErrHndlr: Dim in4ErrorCode As Long Dim strErrorDescr As String ' -- Capture info. in4ErrorCode = Err.Number strErrorDescr = Err.Description ' -- Output info. Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _ & vbCrLf & " on " & OriginalValue ' -- Prepare to return an error indication. ExtractDogName = CVErr(xlErrValue) ' -- Resume ExtractDN_Exit End Function
Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )
TASK 2
To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):=COUNTIF(B$2:B2, B2)
TASK 3To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:
=MAX(C2:C27001) =FILTER(B2:B27001, C2:C27001=D2, "-none-")
Feel free to respond with questions.
Thanks for you reply! Sorry for the late response - I thought I'd get an email when someone responded. I've just hit 'subscribe'.
Hmmm, you're absolutely right. I could easily create a list of those that I don't want to be considered. If I do that, how do I get it to look for the most common series of words, excluding what I tell it to exclude?
Assuming:
1) Your data is in A1:A23
2) Your list of exclusions is in H1:H8
=LET(
ζ,A1:A23,
ξ,H1:H8,
λ,2,
κ,TOCOL(REDUCE("",ζ,LAMBDA(α,β,VSTACK(α,TEXTSPLIT(β," ")))),3),
γ,UNIQUE(FILTER(κ,ISNA(XMATCH(κ,ξ)))),
TAKE(SORTBY(γ,BYROW(γ,LAMBDA(μ,SUM(N(κ=μ)))),-1),λ)
)
λ in the above (currently set to 2) determines the number of returns to be made.
Based on your posted data, if H1:H8 contains "CGC", "CH", "TKN", "OA", "OAJ", "CD", "OF" and "The", the above will return "Atomic" and "Tawny".
Regards
- Laura WrightMar 08, 2023Copper ContributorThank you. My next task will be googling how to type those Greek letters.
Just to make sure we're on the same page, if the 5 hypothetical dogs below appeared in the list just once each, it would still return Atomic and Tawny, because these dogs below are 5 distinct/different dogs who happen to have the same 4 words in their name. Order of the words matters, and I'm looking for the most frequent appearance of a series of words in the same order.
Noel's Beautiful Crimson Rose
Crimson Noel's Rose Beautiful
Noel's Rose Crimson Beautiful
Beautiful Noel's Crimson Rose
Rose Beautiful Noel's Crimson
Laura- SnowMan55Mar 09, 2023Bronze Contributor
Regarding the previously-posted formula:
- You don't have to use Greek letters; you can replace those with words in your alphabet of choice.
- It needs a space between the quotes that are inside the TEXTSPLIT function.
- But even when fixed, it's making a column of the individual words, and counting those. As you note, that's not the desired outcome.
- And further, your performance titles seem to include "OF", so the word "of" will also be removed from the dog names. That would make (hypothetical) names "Embrace of Fur" and "Embrace Fur" the same. Most Excel functions are insensitive to the case of their arguments, and while that's usually a good thing, here it could be a problem.
You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…
There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.
Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.
And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.
TASK 1
I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:- open the VB Editor (Alt+F11);
- from the Insert menu insert a (standard) code module;
- ensure that there is an Option Explicit statement at the top of the module; and below it
- paste the following code:
Public Function ExtractDogName(ByVal OriginalValue As String _ , ExclusionList As String _ ) As Variant ' This procedure extracts and returns a dog's name from the first ' argument, by removing "exclusion words/codes" from the front ' and rear of the original value. ' Note: Exclusion "words" are compared in a case-sensitive manner. Dim strReturnValue As String Dim strWords() As String Dim strWord As String Dim in4WordIndex As Long ' Dim strExclusions() As String Dim strExclusion As String Dim in4Exclusion As Long ' Dim blnNonExclusionWordFound As Boolean '---- Enable an error handler. On Error GoTo ExtractDN_ErrHndlr '---- Split the input into words (retaining the order). strWords = Split(OriginalValue, " ") '---- Split the exclusion list into words or codes. strExclusions = Split(ExclusionList, " ") '---- Working from right to left in the array of words, replace exclusion ' words/codes with empty strings until a part of the dog name is ' presumably found. blnNonExclusionWordFound = False For in4WordIndex = UBound(strWords) To 0 Step -1 strWord = strWords(in4WordIndex) For in4Exclusion = 0 To UBound(strExclusions) '(This array can _ be swept in either order.) strExclusion = strExclusions(in4Exclusion) ' ' [To do a case-insensitive comparison, replace vbBinaryCompare ' in the next line with vbTextCompare.] If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then strWord = "" Exit For End If Next in4Exclusion ' If Len(strWord) > 0 Then blnNonExclusionWordFound = True Else strWords(in4WordIndex) = strWord End If ' If blnNonExclusionWordFound = True Then Exit For Next in4WordIndex '---- Working from left to right in the array of words, replace exclusion ' words/codes with empty strings until a part of the dog name is ' presumably found. blnNonExclusionWordFound = False For in4WordIndex = 0 To UBound(strWords) strWord = strWords(in4WordIndex) For in4Exclusion = 0 To UBound(strExclusions) '(This array can _ be swept in either order.) strExclusion = strExclusions(in4Exclusion) ' ' [To do a case-insensitive comparison, replace vbBinaryCompare ' in the next line with vbTextCompare.] If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then strWord = "" Exit For End If Next in4Exclusion ' If Len(strWord) > 0 Then blnNonExclusionWordFound = True Else strWords(in4WordIndex) = strWord End If ' If blnNonExclusionWordFound = True Then Exit For Next in4WordIndex '---- Reconnect the component words. strReturnValue = Join(strWords, " ") strReturnValue = Trim$(strReturnValue) ''---- Collapse double spaces to a single space; repeat as needed. 'Do Until InStr(1, strReturnValue, " ") = 0 ' strReturnValue = Replace(strReturnValue, " ", " ") 'Loop '---- Assign the return value. ExtractDogName = strReturnValue ExtractDN_Exit: Exit Function ExtractDN_ErrHndlr: Dim in4ErrorCode As Long Dim strErrorDescr As String ' -- Capture info. in4ErrorCode = Err.Number strErrorDescr = Err.Description ' -- Output info. Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _ & vbCrLf & " on " & OriginalValue ' -- Prepare to return an error indication. ExtractDogName = CVErr(xlErrValue) ' -- Resume ExtractDN_Exit End Function
Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )
TASK 2
To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):=COUNTIF(B$2:B2, B2)
TASK 3To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:
=MAX(C2:C27001) =FILTER(B2:B27001, C2:C27001=D2, "-none-")
Feel free to respond with questions.
- CDenneyApr 19, 2024Copper Contributor
SnowMan55 I apologize for reviving this thread. I am trying to use this code to find the most common phrases in a set of cells in a table. I continue to get a compiler error saying invalid outside procedure.
Do you know what might be causing this?
I may be misunderstanding the application of the code you posted, if so I apologize.