Text Comparison

Question

Hi! I have two text columns and I need to compare them based on their similarity

The first column contains main keywords, the second column contains two lines with the terms and similar cases. Please note that terms may be in different order in the cells.

I need to calculate Jaccard Similarity.

0 indicates no similarity (no common words).
1 indicates complete similarity (both sets of words are identical).
Values between 0 and 1 represent partial similarity.

I tried this formula to extract the needed text between terns and similar_cases, but it gives me 1 value in each case: =COUNTA(UNIQUE(FILTER(C2; ISNUMBER(MATCH(C2; TEXTAFTER(TEXTBEFORE(C2; "similar_cases"); "terms:"); 0)))))

Please help!

Example of the table:

table_cloth, auto_store	terms: table_cloth similar_cases: window_cleaner, floor_cloth
auto_store	terms: auto_store similar_cases: window_cleaner, floor_cloth
remote_control, table_cloth	terms: table_cloth, remote_control similar_cases: window_cleaner, floor_cloth
auto_store	terms: none
remote_control	terms: table_cloth similar_cases: window_cleaner, floor_cloth

peterbartholomew1 · Answer

pinkyswear&nbsp;I failed to understand the layout of your data (the role of 'terms' and 'similar cases') so tackled a slightly different problem.I am using 365 so I used FILTER to determine the union and intersection counts using the Lambda function= LET(
    combined, TEXTJOIN(",", , sets),
    split,    TRIM(TEXTSPLIT(combined, ",")),
    union,    UNIQUE(split, TRUE),
    unique,   UNIQUE(split, TRUE, 1),
    error,    AND(ISERROR(unique)),
    1 - IF(error, 1, COUNTA(unique) / COUNTA(union))
  )At least I have now learnt the term 'Jaccard similarity'!

patrick2788 · Answer

pinkyswear&nbsp;I went with a solution using MAP:&nbsp;=LET(
    SplitTerms, LAMBDA(ArrA, ArrB,
        LET(
            TermsA, TEXTSPLIT(ArrA, , ", "),
            TermsB, TEXTSPLIT(ArrB, , ", "),
            AllTerms, VSTACK(TermsA, TermsB),
            matches, COUNT(XMATCH(TermsA, TermsB)),
            denom, COUNTA(UNIQUE(AllTerms)),
            matches / denom
        )
    ),
    MAP(ListA, ListB, SplitTerms)
)

nikolinode · Answer

pinkyswear&nbsp;To calculate the Jaccard Similarity between two sets of words in Excel, you can use a combination of formulas to extract and compare the words. The Jaccard Similarity is calculated as the intersection of the sets divided by the union of the sets.Here is a step-by-step guide on how to achieve this:Extract the Terms and Similar Cases: You will first need to extract the terms and similar cases from the cells. You can use Excel's text manipulation functions to do this.In cell D2 (assuming your data starts in cell A2), you can use the following formula to extract the terms:=TRIM(MID(A2, FIND("terms:", A2) + 7, FIND("similar_cases:", A2) - FIND("terms:", A2) - 7))&nbsp;In cell E2, you can extract the similar cases using a similar formula:=TRIM(MID(A2, FIND("similar_cases:", A2) + 14, LEN(A2) - FIND("similar_cases:", A2) - 13))Drag these formulas down to apply them to all rows in your dataset.&nbsp;Calculate Jaccard Similarity: In cell F2, you can calculate the Jaccard Similarity between the terms and similar cases using the following formula:=IF(AND(D2&lt;&gt;"", E2&lt;&gt;""), COUNT(1)/COUNTUNIQUE(FILTERXML("&lt;a&gt;&lt;b&gt;"&amp;SUBSTITUTE(D2, ", ", "&lt;/b&gt;&lt;b&gt;")&amp;"&lt;/b&gt;&lt;/a&gt;", "//b | //b[not(.=preceding::b)]")), 0)This formula checks if both terms and similar cases are non-empty. If they are, it calculates the Jaccard Similarity. If either of them is empty, it returns 0.Drag this formula down to apply it to all rows in your dataset.The above formulas will calculate the Jaccard Similarity for each pair of terms and similar cases in your dataset. If the terms and similar cases are identical, the similarity will be 1. If they have no common words, the similarity will be 0, and values between 0 and 1 represent partial similarity.The text, steps and functions were created with the help of AI.&nbsp;My answers are voluntary and without guarantee!&nbsp;Hope this will help you.

pinkyswear · Answer

Patrick2788&nbsp;and&nbsp;PeterBartholomew1&nbsp;&nbsp;Hi! Thank you very much for your reply!&nbsp;Unfortunately, it's giving me the following error: "The first argument of LET must be a valid name."&nbsp;Please let me know if I should do some particular changes in the formula.I am comparing cells C2 and G2.&nbsp;A more particular example of the table would be provided further:&nbsp;Thank you in advance! Appreciate your help!&nbsp;flower_shCategory: flower_shKeywords: tree care service, tree trimming, tree pruning, tree thinning, tree care, canopy balancingarchitec_servCategory:&nbsp;architec_servKeywords: interior designer, residential design, commercial design, interior design services, project management serviceflower_sh ,home_stCategory: home_stKeywords: custom made fire pits, state of the art equipment, laser cutting technologies, premium quality cuts, stainless steelhome_st&nbsp;Category: flower_shKeywords: lawn care, affordable prices, quality service, mowing, weed removal, grass blowing, snow-blowing

patrick2788 · Answer

pinkyswear&nbsp;This might be a good time to enable Advanced Formula Environment (AFE), if you don't already have access. AFE is much better with line breaks than the formula bar which may have mangled the formula.&nbsp;Excel Labs, a Microsoft Garage project (office.com)

Forum Discussion

Text Comparison

5 Replies

Resources

flower_sh	Category: flower_sh Keywords: tree care service, tree trimming, tree pruning, tree thinning, tree care, canopy balancing
architec_serv	Category: architec_serv Keywords: interior designer, residential design, commercial design, interior design services, project management service
flower_sh ,home_st	Category: home_st Keywords: custom made fire pits, state of the art equipment, laser cutting technologies, premium quality cuts, stainless steel
home_st	Category: flower_sh Keywords: lawn care, affordable prices, quality service, mowing, weed removal, grass blowing, snow-blowing