Mar 03 2023 06:54 AM
How can I find the most frequent series of words in an excel file? I know how to find the most frequent single word. I need to find the most frequent series of words.
Example of a very short list (my file has almost 27,000 of these that I need to search through):
Sienna Pointe's Fire On High OA OAJ CGC
Shine's Bright Like A Diamond OA OAJ NF
Sienna Pointe's Fire On High OA OAJ CGC
GCH CH Menny Azonnali Medal CD BN RN AX MXJ MXF T2B DCAT DN TKN
Semper Fi Pfc Jozee JH CGC
CH Solaris Palmetto Somebody To Love
Sincerus JE's Everglow What A Wonderful World BN RA OA OAJ OF BCAT ACT1
Patapsco Verity Sea Lynx CD BN RI CGC TKN
Remi XII CGC
CH Midnight Run's Beyond The Wall
CH Pursuit's Trek To The Triple Crown CGC
CH Rhapsody Hanalei Blues, Soul, & Rock'N'Roll
MACH4 Szizlin Fenwick's Shooting Star MXC PAD MJB2 OF T2B CGC TKN
MACH4 Szizlin Fenwick's Shooting Star MXC PAD MJB2 OF T2B CGC TKN
CH Mountain Laurel Summit To The Sea CD RE CGC TKA
CH Solaris Repeat Business RN JH CA RATN CGC TKN
Atomic Tawny SIN
Atomic Tawny SWN
CH Kurtzi N Nike's Pride Of The Mtns He's A Keeper JH TKN
Noel's Beautiful Crimson Rose CD
GCH CH Firelight's Swipe Right CA
Atomic Tawny SCN
Atomic Tawny SEN
I would like it to pick out the occurrence of Atomic Tawny as the most frequent series of words. These are the names of dogs and performance titles they have earned, if anyone is wondering.
Is there a way to do this?
Mar 03 2023 07:08 AM - edited Mar 03 2023 07:09 AM
It might be difficult, depending on what your definition of a 'word' is.
In that dataset, for example, 'Atomic' and 'Tawny' only appear 4 times each, whereas 'CGC' (10 times), 'CH' (9 times) and 'TKN' (6 times) all appear more times. You'd need to define your logic for what constitutes a 'word' if you wish examples such as those to not be considered.
Mar 07 2023 05:11 PM - edited Mar 07 2023 05:13 PM
Thanks for you reply! Sorry for the late response - I thought I'd get an email when someone responded. I've just hit 'subscribe'.
Hmmm, you're absolutely right. I could easily create a list of those that I don't want to be considered. If I do that, how do I get it to look for the most common series of words, excluding what I tell it to exclude?
Mar 08 2023 12:41 AM - edited Mar 09 2023 12:01 AM
Assuming:
1) Your data is in A1:A23
2) Your list of exclusions is in H1:H8
=LET(
ζ,A1:A23,
ξ,H1:H8,
λ,2,
κ,TOCOL(REDUCE("",ζ,LAMBDA(α,β,VSTACK(α,TEXTSPLIT(β," ")))),3),
γ,UNIQUE(FILTER(κ,ISNA(XMATCH(κ,ξ)))),
TAKE(SORTBY(γ,BYROW(γ,LAMBDA(μ,SUM(N(κ=μ)))),-1),λ)
)
λ in the above (currently set to 2) determines the number of returns to be made.
Based on your posted data, if H1:H8 contains "CGC", "CH", "TKN", "OA", "OAJ", "CD", "OF" and "The", the above will return "Atomic" and "Tawny".
Regards
Mar 08 2023 08:44 AM
Mar 08 2023 11:28 PM
SolutionRegarding the previously-posted formula:
You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…
There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.
Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.
And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.
TASK 1
I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:
Public Function ExtractDogName(ByVal OriginalValue As String _
, ExclusionList As String _
) As Variant
' This procedure extracts and returns a dog's name from the first
' argument, by removing "exclusion words/codes" from the front
' and rear of the original value.
' Note: Exclusion "words" are compared in a case-sensitive manner.
Dim strReturnValue As String
Dim strWords() As String
Dim strWord As String
Dim in4WordIndex As Long
'
Dim strExclusions() As String
Dim strExclusion As String
Dim in4Exclusion As Long
'
Dim blnNonExclusionWordFound As Boolean
'---- Enable an error handler.
On Error GoTo ExtractDN_ErrHndlr
'---- Split the input into words (retaining the order).
strWords = Split(OriginalValue, " ")
'---- Split the exclusion list into words or codes.
strExclusions = Split(ExclusionList, " ")
'---- Working from right to left in the array of words, replace exclusion
' words/codes with empty strings until a part of the dog name is
' presumably found.
blnNonExclusionWordFound = False
For in4WordIndex = UBound(strWords) To 0 Step -1
strWord = strWords(in4WordIndex)
For in4Exclusion = 0 To UBound(strExclusions) '(This array can _
be swept in either order.)
strExclusion = strExclusions(in4Exclusion)
'
' [To do a case-insensitive comparison, replace vbBinaryCompare
' in the next line with vbTextCompare.]
If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
strWord = ""
Exit For
End If
Next in4Exclusion
'
If Len(strWord) > 0 Then
blnNonExclusionWordFound = True
Else
strWords(in4WordIndex) = strWord
End If
'
If blnNonExclusionWordFound = True Then Exit For
Next in4WordIndex
'---- Working from left to right in the array of words, replace exclusion
' words/codes with empty strings until a part of the dog name is
' presumably found.
blnNonExclusionWordFound = False
For in4WordIndex = 0 To UBound(strWords)
strWord = strWords(in4WordIndex)
For in4Exclusion = 0 To UBound(strExclusions) '(This array can _
be swept in either order.)
strExclusion = strExclusions(in4Exclusion)
'
' [To do a case-insensitive comparison, replace vbBinaryCompare
' in the next line with vbTextCompare.]
If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
strWord = ""
Exit For
End If
Next in4Exclusion
'
If Len(strWord) > 0 Then
blnNonExclusionWordFound = True
Else
strWords(in4WordIndex) = strWord
End If
'
If blnNonExclusionWordFound = True Then Exit For
Next in4WordIndex
'---- Reconnect the component words.
strReturnValue = Join(strWords, " ")
strReturnValue = Trim$(strReturnValue)
''---- Collapse double spaces to a single space; repeat as needed.
'Do Until InStr(1, strReturnValue, " ") = 0
' strReturnValue = Replace(strReturnValue, " ", " ")
'Loop
'---- Assign the return value.
ExtractDogName = strReturnValue
ExtractDN_Exit:
Exit Function
ExtractDN_ErrHndlr:
Dim in4ErrorCode As Long
Dim strErrorDescr As String
' -- Capture info.
in4ErrorCode = Err.Number
strErrorDescr = Err.Description
' -- Output info.
Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _
& vbCrLf & " on " & OriginalValue
' -- Prepare to return an error indication.
ExtractDogName = CVErr(xlErrValue)
' --
Resume ExtractDN_Exit
End Function
Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:
=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )
TASK 2
To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):
=COUNTIF(B$2:B2, B2)
TASK 3
To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:
=MAX(C2:C27001)
=FILTER(B2:B27001, C2:C27001=D2, "-none-")
Feel free to respond with questions.
Mar 09 2023 12:04 AM
Re "It needs a space between the quotes that are inside the TEXTSPLIT function", many thanks - have now corrected.
Plus, I obviously misunderstood the requirement, as your post has made me realise.
Regards
Mar 09 2023 06:22 PM
Mar 09 2023 06:22 PM
Mar 13 2023 06:28 PM
@SnowMan55 It worked like a charm!!! Thank you very much for your help! Very much appreciated!
Apr 19 2024 03:30 PM
@SnowMan55 I apologize for reviving this thread. I am trying to use this code to find the most common phrases in a set of cells in a table. I continue to get a compiler error saying invalid outside procedure.
Do you know what might be causing this?
I may be misunderstanding the application of the code you posted, if so I apologize.
Apr 19 2024 05:04 PM
Apr 19 2024 11:50 PM
Mar 08 2023 11:28 PM
SolutionRegarding the previously-posted formula:
You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…
There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.
Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.
And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.
TASK 1
I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:
Public Function ExtractDogName(ByVal OriginalValue As String _
, ExclusionList As String _
) As Variant
' This procedure extracts and returns a dog's name from the first
' argument, by removing "exclusion words/codes" from the front
' and rear of the original value.
' Note: Exclusion "words" are compared in a case-sensitive manner.
Dim strReturnValue As String
Dim strWords() As String
Dim strWord As String
Dim in4WordIndex As Long
'
Dim strExclusions() As String
Dim strExclusion As String
Dim in4Exclusion As Long
'
Dim blnNonExclusionWordFound As Boolean
'---- Enable an error handler.
On Error GoTo ExtractDN_ErrHndlr
'---- Split the input into words (retaining the order).
strWords = Split(OriginalValue, " ")
'---- Split the exclusion list into words or codes.
strExclusions = Split(ExclusionList, " ")
'---- Working from right to left in the array of words, replace exclusion
' words/codes with empty strings until a part of the dog name is
' presumably found.
blnNonExclusionWordFound = False
For in4WordIndex = UBound(strWords) To 0 Step -1
strWord = strWords(in4WordIndex)
For in4Exclusion = 0 To UBound(strExclusions) '(This array can _
be swept in either order.)
strExclusion = strExclusions(in4Exclusion)
'
' [To do a case-insensitive comparison, replace vbBinaryCompare
' in the next line with vbTextCompare.]
If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
strWord = ""
Exit For
End If
Next in4Exclusion
'
If Len(strWord) > 0 Then
blnNonExclusionWordFound = True
Else
strWords(in4WordIndex) = strWord
End If
'
If blnNonExclusionWordFound = True Then Exit For
Next in4WordIndex
'---- Working from left to right in the array of words, replace exclusion
' words/codes with empty strings until a part of the dog name is
' presumably found.
blnNonExclusionWordFound = False
For in4WordIndex = 0 To UBound(strWords)
strWord = strWords(in4WordIndex)
For in4Exclusion = 0 To UBound(strExclusions) '(This array can _
be swept in either order.)
strExclusion = strExclusions(in4Exclusion)
'
' [To do a case-insensitive comparison, replace vbBinaryCompare
' in the next line with vbTextCompare.]
If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then
strWord = ""
Exit For
End If
Next in4Exclusion
'
If Len(strWord) > 0 Then
blnNonExclusionWordFound = True
Else
strWords(in4WordIndex) = strWord
End If
'
If blnNonExclusionWordFound = True Then Exit For
Next in4WordIndex
'---- Reconnect the component words.
strReturnValue = Join(strWords, " ")
strReturnValue = Trim$(strReturnValue)
''---- Collapse double spaces to a single space; repeat as needed.
'Do Until InStr(1, strReturnValue, " ") = 0
' strReturnValue = Replace(strReturnValue, " ", " ")
'Loop
'---- Assign the return value.
ExtractDogName = strReturnValue
ExtractDN_Exit:
Exit Function
ExtractDN_ErrHndlr:
Dim in4ErrorCode As Long
Dim strErrorDescr As String
' -- Capture info.
in4ErrorCode = Err.Number
strErrorDescr = Err.Description
' -- Output info.
Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _
& vbCrLf & " on " & OriginalValue
' -- Prepare to return an error indication.
ExtractDogName = CVErr(xlErrValue)
' --
Resume ExtractDN_Exit
End Function
Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:
=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )
TASK 2
To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):
=COUNTIF(B$2:B2, B2)
TASK 3
To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:
=MAX(C2:C27001)
=FILTER(B2:B27001, C2:C27001=D2, "-none-")
Feel free to respond with questions.