Forum Discussion
pinkyswear
Aug 31, 2023Copper Contributor
Text Comparison
Hi! I have two text columns and I need to compare them based on their similarity
The first column contains main keywords, the second column contains two lines with the terms and similar cases. Please note that terms may be in different order in the cells.
I need to calculate Jaccard Similarity.
- 0 indicates no similarity (no common words).
- 1 indicates complete similarity (both sets of words are identical).
- Values between 0 and 1 represent partial similarity.
I tried this formula to extract the needed text between terns and similar_cases, but it gives me 1 value in each case: =COUNTA(UNIQUE(FILTER(C2; ISNUMBER(MATCH(C2; TEXTAFTER(TEXTBEFORE(C2; "similar_cases"); "terms:"); 0)))))
Please help!
Example of the table:
table_cloth, auto_store | terms: table_cloth similar_cases: window_cleaner, floor_cloth |
auto_store | terms: auto_store similar_cases: window_cleaner, floor_cloth |
remote_control, table_cloth | terms: table_cloth, remote_control similar_cases: window_cleaner, floor_cloth |
auto_store | terms: none |
remote_control | terms: table_cloth similar_cases: window_cleaner, floor_cloth |
- Patrick2788Silver Contributor
I went with a solution using MAP:
=LET( SplitTerms, LAMBDA(ArrA, ArrB, LET( TermsA, TEXTSPLIT(ArrA, , ", "), TermsB, TEXTSPLIT(ArrB, , ", "), AllTerms, VSTACK(TermsA, TermsB), matches, COUNT(XMATCH(TermsA, TermsB)), denom, COUNTA(UNIQUE(AllTerms)), matches / denom ) ), MAP(ListA, ListB, SplitTerms) )
- pinkyswearCopper Contributor
Patrick2788 and PeterBartholomew1
Hi! Thank you very much for your reply!
Unfortunately, it's giving me the following error: "The first argument of LET must be a valid name."
Please let me know if I should do some particular changes in the formula.
I am comparing cells C2 and G2.
A more particular example of the table would be provided further:
Thank you in advance! Appreciate your help!
flower_sh Category: flower_sh
Keywords: tree care service, tree trimming, tree pruning, tree thinning, tree care, canopy balancingarchitec_serv Category: architec_serv
Keywords: interior designer, residential design, commercial design, interior design services, project management serviceflower_sh ,home_st Category: home_st
Keywords: custom made fire pits, state of the art equipment, laser cutting technologies, premium quality cuts, stainless steelhome_st Category: flower_sh
Keywords: lawn care, affordable prices, quality service, mowing, weed removal, grass blowing, snow-blowing- Patrick2788Silver Contributor
This might be a good time to enable Advanced Formula Environment (AFE), if you don't already have access. AFE is much better with line breaks than the formula bar which may have mangled the formula.
- PeterBartholomew1Silver Contributor
I failed to understand the layout of your data (the role of 'terms' and 'similar cases') so tackled a slightly different problem.
I am using 365 so I used FILTER to determine the union and intersection counts using the Lambda function
= LET( combined, TEXTJOIN(",", , sets), split, TRIM(TEXTSPLIT(combined, ",")), union, UNIQUE(split, TRUE), unique, UNIQUE(split, TRUE, 1), error, AND(ISERROR(unique)), 1 - IF(error, 1, COUNTA(unique) / COUNTA(union)) )
At least I have now learnt the term 'Jaccard similarity'!
- NikolinoDEGold Contributor
To calculate the Jaccard Similarity between two sets of words in Excel, you can use a combination of formulas to extract and compare the words. The Jaccard Similarity is calculated as the intersection of the sets divided by the union of the sets.
Here is a step-by-step guide on how to achieve this:
- Extract the Terms and Similar Cases: You will first need to extract the terms and similar cases from the cells. You can use Excel's text manipulation functions to do this.
In cell D2 (assuming your data starts in cell A2), you can use the following formula to extract the terms:
=TRIM(MID(A2, FIND("terms:", A2) + 7, FIND("similar_cases:", A2) - FIND("terms:", A2) - 7))
In cell E2, you can extract the similar cases using a similar formula:
=TRIM(MID(A2, FIND("similar_cases:", A2) + 14, LEN(A2) - FIND("similar_cases:", A2) - 13))
Drag these formulas down to apply them to all rows in your dataset.
- Calculate Jaccard Similarity: In cell F2, you can calculate the Jaccard Similarity between the terms and similar cases using the following formula:
=IF(AND(D2<>"", E2<>""), COUNT(1)/COUNTUNIQUE(FILTERXML("<a><b>"&SUBSTITUTE(D2, ", ", "</b><b>")&"</b></a>", "//b | //b[not(.=preceding::b)]")), 0)
This formula checks if both terms and similar cases are non-empty. If they are, it calculates the Jaccard Similarity. If either of them is empty, it returns 0.
Drag this formula down to apply it to all rows in your dataset.
The above formulas will calculate the Jaccard Similarity for each pair of terms and similar cases in your dataset. If the terms and similar cases are identical, the similarity will be 1. If they have no common words, the similarity will be 0, and values between 0 and 1 represent partial similarity.
The text, steps and functions were created with the help of AI.
My answers are voluntary and without guarantee!
Hope this will help you.