Forum Discussion
Amber McCurtis
Jun 09, 2017Copper Contributor
Help counting duplicate values/words
Hello,
I have a worksheet with about 80 responses that are in a sentence format that I am trying to figure out how to count and sort keywords that are used multiple times. These are verbatim response where people have said the same or similar things and I'm trying to see what are the top 20 words/terms used. Example:
1.) I wish human resources was more honest
2.) Honesty and integrity are important to me
3.) The supervisor needs to be more upfront and honest.
So if I want to count duplicate words how would I do that? In this case, it would be the word honest but, I need to find multiple duplicate words throughout the worksheet which is 80 rows of text. Can anyone help with this?
Thank you!
Amber,
Back to this. I don't know if you are familiar with Power Query / Get & Transform in Excel, but with it the task could be solved quite easy. As the basis we may take published by Chris Webb solution Finding Shakespeare’s Favourite Words With Data Explorer (Power Query) which is very good explained and could be applied in your case practically without changes.
It is in the attached file. Simply copy/paste your text strings into the table Answers, in Ribbon->Data->Refresh All and the list of multiply words is in the next sheet.
Code of the query whic performs the task is below, it's practicaly self-explained, more details in the above blog. If you know Power Query at least a bit you may easy modify it if necessary, for example add more words to exclude from the calculation (in general could be taken from another table in Excel, but takes time to implement).
/* That is literally Chris Webb's solution "Finding Shakespeare’s Favourite Words With Data Explorer (Power Query)" published on March 15, 2013 at https://blog.crossjoin.co.uk/2013/03/15/finding-shakespeares-favourite-words-with-data-explorer/ */ let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], OurList = Table.ToList(Source), RemoveLineBreaks = Lines.ToText(OurList, " "), RemovePunctuation = Text.Remove(RemoveLineBreaks, { "," ,"." ,"?" ,";" ,":" ,";" ,"'", "@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*" } ), Lowercase = Text.Lower(RemovePunctuation), TurnIntoTable = Table.FromValue(Lowercase), SplitIntoWords = Table.TransformColumns(TurnIntoTable, {"Value", Splitter.SplitTextByWhitespace()}), ListOfWords = SplitIntoWords{0}[Value], TableFromList = Table.FromList(ListOfWords, Splitter.SplitByNothing(), null, null, ExtraValues.Error), RenameColumnToWord = Table.RenameColumns(TableFromList,{{"Column1", "Word"}}), RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ( [Word] <> "" and [Word] <> "1" and [Word] <> "2" and [Word] <> "3" and [Word] <> "4" and [Word] <> "5" and [Word] <> "6" and [Word] <> "7" and [Word] <> "8" and [Word] <> "9" and [Word] <> "0" and [Word] <> "i" and [Word] <> "me" and [Word] <> "and" and [Word] <> "are" and [Word] <> "be" and [Word] <> "the" and [Word] <> "to" and [Word] <> "was" ) ), FindWordCounts = Table.Group(RemoveBlanks, {"Word"}, {{"Count", each Table.RowCount(_), type number}}), SortedRows = Table.Sort(FindWordCounts,{{"Count", Order.Descending}}), RemoveSingleWords = Table.SelectRows(SortedRows, each [Count] > 1) in RemoveSingleWords
- Amber McCurtisCopper ContributorThank you! This is really over my head. Maybe I can put the text in Microsoft word and accomplish the same thing?
Amber, you may use Microsoft Word, whatever, if you are only able to save your text as .txt file. Since you don't need formatting for frequency analysis that shall not be an issue. With Word that's simple, you always may do Save As into txt.
From Power Query solution point of view that's easy, just couple more lines in the script. I'll update in a while.
Hi Amber,
Perhaps better to do with Power Query, but here is the manual solution which could work for one-time job, for repeated one better to find more elegant solution.
1) Select your rows with text and replace dots, commas, etc on nothing
2) From Ribbon->Data->Text to Columns split your text on separate words (one in column) using space as separator
3) If your first column is A, copy/paste next ones sequentially at the bottom of A (better to do with Table, but let keep simple cells)
4) If words starts in A1 enter in B1 formula
=COUNTIF(A:A, A1)
and copy it down to the end of your range with words
5) Filter here (or on previous steps) non-calculated words ("to", "and", etc.) and empty cells
6) Sort columns A and B in descending order by column B
In general you shall receive the result. Above is just and idea, I'm in shortage of time to play with this right now, and especially with Step 3.
- Amber McCurtisCopper Contributor
Thank you, Sergei! I'm not familiar with Power Query and not really experienced in Excel. I've tried following your instruction with no success. I've attached an example in my original post. I'm trying to extract the top 20 words/terms used from the worksheet. I've been trying to look on YouTube because I'm visual but, still no luck.
Hi Amber,
When things bit more complex, but since that's practically ready-to-use solution you may use it as black box so far.
First, you have to have Power Query with your Excel. If you are on version Excel 2016 it's built-in (and called Get & Transform now). Nothing to do here.
If you are on earlier version
you need to download and install Microsoft Power Query for Excel add-in. Installation is straight forward and usually no problems are here.
If you are on even earlier version of Excel - no luck with this solution.
I'll check your file when understand how to download it - TechComm changed the interface and so far i see no way to do that.