Forum Discussion

pinkyswear's avatar
pinkyswear
Copper Contributor
Aug 31, 2023

Text Comparison

Hi! I have two text columns and I need to compare them based on their similarity

The first column contains main keywords, the second column contains two lines with the terms and similar cases. Please note that terms may be in different order in the cells.

I need to calculate Jaccard Similarity.

  • 0 indicates no similarity (no common words).
  • 1 indicates complete similarity (both sets of words are identical).
  • Values between 0 and 1 represent partial similarity.

I tried this formula to extract the needed text between terns and similar_cases, but it gives me 1 value in each case: =COUNTA(UNIQUE(FILTER(C2; ISNUMBER(MATCH(C2; TEXTAFTER(TEXTBEFORE(C2; "similar_cases"); "terms:"); 0)))))

 

Please help!

 

Example of the table: 

 

table_cloth, auto_store

terms: table_cloth

similar_cases: window_cleaner, floor_cloth

auto_store

terms: auto_store

similar_cases: window_cleaner, floor_cloth

remote_control, table_cloth

terms: table_cloth, remote_control

similar_cases: window_cleaner, floor_cloth

auto_store

terms: none

remote_control

terms: table_cloth

similar_cases: window_cleaner, floor_cloth

  • Patrick2788's avatar
    Patrick2788
    Silver Contributor

    pinkyswear 

    I went with a solution using MAP:

     

    =LET(
        SplitTerms, LAMBDA(ArrA, ArrB,
            LET(
                TermsA, TEXTSPLIT(ArrA, , ", "),
                TermsB, TEXTSPLIT(ArrB, , ", "),
                AllTerms, VSTACK(TermsA, TermsB),
                matches, COUNT(XMATCH(TermsA, TermsB)),
                denom, COUNTA(UNIQUE(AllTerms)),
                matches / denom
            )
        ),
        MAP(ListA, ListB, SplitTerms)
    )
    • pinkyswear's avatar
      pinkyswear
      Copper Contributor

      Patrick2788 and PeterBartholomew1 

       

      Hi! Thank you very much for your reply!

       

      Unfortunately, it's giving me the following error: "The first argument of LET must be a valid name."

       

      Please let me know if I should do some particular changes in the formula.

      I am comparing cells C2 and G2.

       

      A more particular example of the table would be provided further:

       

      Thank you in advance! Appreciate your help!

       

      flower_shCategory: flower_sh

      Keywords: tree care service, tree trimming, tree pruning, tree thinning, tree care, canopy balancing
      architec_servCategory: architec_serv

      Keywords: interior designer, residential design, commercial design, interior design services, project management service
      flower_sh ,home_stCategory: home_st

      Keywords: custom made fire pits, state of the art equipment, laser cutting technologies, premium quality cuts, stainless steel
      home_st Category: flower_sh

      Keywords: lawn care, affordable prices, quality service, mowing, weed removal, grass blowing, snow-blowing
  • pinkyswear 

    I failed to understand the layout of your data (the role of 'terms' and 'similar cases') so tackled a slightly different problem.

    I am using 365 so I used FILTER to determine the union and intersection counts using the Lambda function

    = LET(
        combined, TEXTJOIN(",", , sets),
        split,    TRIM(TEXTSPLIT(combined, ",")),
        union,    UNIQUE(split, TRUE),
        unique,   UNIQUE(split, TRUE, 1),
        error,    AND(ISERROR(unique)),
        1 - IF(error, 1, COUNTA(unique) / COUNTA(union))
      )

    At least I have now learnt the term 'Jaccard similarity'!

  • NikolinoDE's avatar
    NikolinoDE
    Gold Contributor

    pinkyswear 

    To calculate the Jaccard Similarity between two sets of words in Excel, you can use a combination of formulas to extract and compare the words. The Jaccard Similarity is calculated as the intersection of the sets divided by the union of the sets.

    Here is a step-by-step guide on how to achieve this:

    1. Extract the Terms and Similar Cases: You will first need to extract the terms and similar cases from the cells. You can use Excel's text manipulation functions to do this.

    In cell D2 (assuming your data starts in cell A2), you can use the following formula to extract the terms:

    =TRIM(MID(A2, FIND("terms:", A2) + 7, FIND("similar_cases:", A2) - FIND("terms:", A2) - 7))

     

    In cell E2, you can extract the similar cases using a similar formula:

    =TRIM(MID(A2, FIND("similar_cases:", A2) + 14, LEN(A2) - FIND("similar_cases:", A2) - 13))

    Drag these formulas down to apply them to all rows in your dataset.

     

    1. Calculate Jaccard Similarity: In cell F2, you can calculate the Jaccard Similarity between the terms and similar cases using the following formula:

    =IF(AND(D2<>"", E2<>""), COUNT(1)/COUNTUNIQUE(FILTERXML("<a><b>"&SUBSTITUTE(D2, ", ", "</b><b>")&"</b></a>", "//b | //b[not(.=preceding::b)]")), 0)

    This formula checks if both terms and similar cases are non-empty. If they are, it calculates the Jaccard Similarity. If either of them is empty, it returns 0.

    Drag this formula down to apply it to all rows in your dataset.

    The above formulas will calculate the Jaccard Similarity for each pair of terms and similar cases in your dataset. If the terms and similar cases are identical, the similarity will be 1. If they have no common words, the similarity will be 0, and values between 0 and 1 represent partial similarity.

    The text, steps and functions were created with the help of AI.

     

    My answers are voluntary and without guarantee!

     

    Hope this will help you.

Resources