Forum Discussion
Finding the most frequent series of words in an excel file
How can I find the most frequent series of words in an excel file? I know how to find the most frequent single word. I need to find the most frequent series of words.
Example of a very short list (my file has almost 27,000 of these that I need to search through):
Sienna Pointe's Fire On High OA OAJ CGC
Shine's Bright Like A Diamond OA OAJ NF
Sienna Pointe's Fire On High OA OAJ CGC
GCH CH Menny Azonnali Medal CD BN RN AX MXJ MXF T2B DCAT DN TKN
Semper Fi Pfc Jozee JH CGC
CH Solaris Palmetto Somebody To Love
Sincerus JE's Everglow What A Wonderful World BN RA OA OAJ OF BCAT ACT1
Patapsco Verity Sea Lynx CD BN RI CGC TKN
Remi XII CGC
CH Midnight Run's Beyond The Wall
CH Pursuit's Trek To The Triple Crown CGC
CH Rhapsody Hanalei Blues, Soul, & Rock'N'Roll
MACH4 Szizlin Fenwick's Shooting Star MXC PAD MJB2 OF T2B CGC TKN
MACH4 Szizlin Fenwick's Shooting Star MXC PAD MJB2 OF T2B CGC TKN
CH Mountain Laurel Summit To The Sea CD RE CGC TKA
CH Solaris Repeat Business RN JH CA RATN CGC TKN
Atomic Tawny SIN
Atomic Tawny SWN
CH Kurtzi N Nike's Pride Of The Mtns He's A Keeper JH TKN
Noel's Beautiful Crimson Rose CD
GCH CH Firelight's Swipe Right CA
Atomic Tawny SCN
Atomic Tawny SEN
I would like it to pick out the occurrence of Atomic Tawny as the most frequent series of words. These are the names of dogs and performance titles they have earned, if anyone is wondering.
Is there a way to do this?
Regarding the previously-posted formula:
- You don't have to use Greek letters; you can replace those with words in your alphabet of choice.
- It needs a space between the quotes that are inside the TEXTSPLIT function.
- But even when fixed, it's making a column of the individual words, and counting those. As you note, that's not the desired outcome.
- And further, your performance titles seem to include "OF", so the word "of" will also be removed from the dog names. That would make (hypothetical) names "Embrace of Fur" and "Embrace Fur" the same. Most Excel functions are insensitive to the case of their arguments, and while that's usually a good thing, here it could be a problem.
You'll want to double-check and triple-check to see if you can obtain a version of the data that has better separators (e.g., a tab character, or two spaces instead of one) between the dog names and the performance titles. That data would be much easier to parse. If not…
There are three different tasks here: The first is to extract a cleaned-up version of the dog names; the second is to count the occurrences of each name; the third is to identify the maximum, and those dog names that have that number. The second and third tasks will be trivial once the first is done.
Let me know if I'm incorrect, but there seems to be some additional codes (also performance titles?) in front of the dog names (e.g., GCH, CH, MACH4). That also complicates possible solutions.
And there is the additional assumption that the dog names are spelled the same in each occurrence, so "J & D", "J&D", and "J and D" are three different dogs.
TASK 1
I think the most accurate (and certainly most flexible) solution will involve procedural code (in this case a user-defined function, written in VBA), not just formulas. So, save a copy of the worksheet in a macro-enabled workbook, then:- open the VB Editor (Alt+F11);
- from the Insert menu insert a (standard) code module;
- ensure that there is an Option Explicit statement at the top of the module; and below it
- paste the following code:
Public Function ExtractDogName(ByVal OriginalValue As String _ , ExclusionList As String _ ) As Variant ' This procedure extracts and returns a dog's name from the first ' argument, by removing "exclusion words/codes" from the front ' and rear of the original value. ' Note: Exclusion "words" are compared in a case-sensitive manner. Dim strReturnValue As String Dim strWords() As String Dim strWord As String Dim in4WordIndex As Long ' Dim strExclusions() As String Dim strExclusion As String Dim in4Exclusion As Long ' Dim blnNonExclusionWordFound As Boolean '---- Enable an error handler. On Error GoTo ExtractDN_ErrHndlr '---- Split the input into words (retaining the order). strWords = Split(OriginalValue, " ") '---- Split the exclusion list into words or codes. strExclusions = Split(ExclusionList, " ") '---- Working from right to left in the array of words, replace exclusion ' words/codes with empty strings until a part of the dog name is ' presumably found. blnNonExclusionWordFound = False For in4WordIndex = UBound(strWords) To 0 Step -1 strWord = strWords(in4WordIndex) For in4Exclusion = 0 To UBound(strExclusions) '(This array can _ be swept in either order.) strExclusion = strExclusions(in4Exclusion) ' ' [To do a case-insensitive comparison, replace vbBinaryCompare ' in the next line with vbTextCompare.] If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then strWord = "" Exit For End If Next in4Exclusion ' If Len(strWord) > 0 Then blnNonExclusionWordFound = True Else strWords(in4WordIndex) = strWord End If ' If blnNonExclusionWordFound = True Then Exit For Next in4WordIndex '---- Working from left to right in the array of words, replace exclusion ' words/codes with empty strings until a part of the dog name is ' presumably found. blnNonExclusionWordFound = False For in4WordIndex = 0 To UBound(strWords) strWord = strWords(in4WordIndex) For in4Exclusion = 0 To UBound(strExclusions) '(This array can _ be swept in either order.) strExclusion = strExclusions(in4Exclusion) ' ' [To do a case-insensitive comparison, replace vbBinaryCompare ' in the next line with vbTextCompare.] If StrComp(strWord, strExclusion, vbBinaryCompare) = 0 Then strWord = "" Exit For End If Next in4Exclusion ' If Len(strWord) > 0 Then blnNonExclusionWordFound = True Else strWords(in4WordIndex) = strWord End If ' If blnNonExclusionWordFound = True Then Exit For Next in4WordIndex '---- Reconnect the component words. strReturnValue = Join(strWords, " ") strReturnValue = Trim$(strReturnValue) ''---- Collapse double spaces to a single space; repeat as needed. 'Do Until InStr(1, strReturnValue, " ") = 0 ' strReturnValue = Replace(strReturnValue, " ", " ") 'Loop '---- Assign the return value. ExtractDogName = strReturnValue ExtractDN_Exit: Exit Function ExtractDN_ErrHndlr: Dim in4ErrorCode As Long Dim strErrorDescr As String ' -- Capture info. in4ErrorCode = Err.Number strErrorDescr = Err.Description ' -- Output info. Debug.Print "Error " & CStr(in4ErrorCode) & ": " & strErrorDescr _ & vbCrLf & " on " & OriginalValue ' -- Prepare to return an error indication. ExtractDogName = CVErr(xlErrValue) ' -- Resume ExtractDN_Exit End Function
Let's say that your words/codes to exclude are in A1:A34 of worksheet Exclusions, and your original dog names + performance titles stuff is in another worksheet, A2:A27001. Then the formula for column B (specifically, B2) to use that function is:=ExtractDogName( A2, TEXTJOIN(" ",TRUE,Exclusions!$A$1:$A$34) )
TASK 2
To get a cumulative (from the top down) occurrence count, use this formula in column C (copying from C2 down as needed):=COUNTIF(B$2:B2, B2)
TASK 3To identify the largest occurrence count, and which dog has (or which dogs have) that occurrence count, use these formulas in D2 and E2, respectively:
=MAX(C2:C27001) =FILTER(B2:B27001, C2:C27001=D2, "-none-")
Feel free to respond with questions.
- JosWoolleyIron Contributor
It might be difficult, depending on what your definition of a 'word' is.
In that dataset, for example, 'Atomic' and 'Tawny' only appear 4 times each, whereas 'CGC' (10 times), 'CH' (9 times) and 'TKN' (6 times) all appear more times. You'd need to define your logic for what constitutes a 'word' if you wish examples such as those to not be considered.
- Laura WrightCopper Contributor
Thanks for you reply! Sorry for the late response - I thought I'd get an email when someone responded. I've just hit 'subscribe'.
Hmmm, you're absolutely right. I could easily create a list of those that I don't want to be considered. If I do that, how do I get it to look for the most common series of words, excluding what I tell it to exclude?- JosWoolleyIron Contributor
Assuming:
1) Your data is in A1:A23
2) Your list of exclusions is in H1:H8
=LET( ζ,A1:A23, ξ,H1:H8, λ,2, κ,TOCOL(REDUCE("",ζ,LAMBDA(α,β,VSTACK(α,TEXTSPLIT(β," ")))),3), γ,UNIQUE(FILTER(κ,ISNA(XMATCH(κ,ξ)))), TAKE(SORTBY(γ,BYROW(γ,LAMBDA(μ,SUM(N(κ=μ)))),-1),λ) )
λ in the above (currently set to 2) determines the number of returns to be made.
Based on your posted data, if H1:H8 contains "CGC", "CH", "TKN", "OA", "OAJ", "CD", "OF" and "The", the above will return "Atomic" and "Tawny".
Regards
- PeterBartholomew1Silver ContributorYes, I was certainly wondering!
There is a huge amount of self-inflicted damage in this list to undo before one can even consider the type of analysis you require. Why were the names of dogs and their performance titles ever concatenated within a cell?
It makes it difficult to even get to the starting point of generating a list of distinct dogs and another for .performance titles.