Forum Discussion

amit_bhola's avatar
amit_bhola
Iron Contributor
Dec 28, 2024

Lists Comparison (Stacked & Keeping order)

I've been trying to develop an Excel function which would compare Lists, each of single column arrays and generate a stacked (side by side) comparison aligning the matched (same) elements and keeping the original order of elements of each list intact in the resulting comparison.

A Two-Lists comparison function has been developed. However, it would be great if it can be scaled up efficiently to 3 or more lists comparison. There are ideas on adopting the 2-list comparison function to greater number of lists, but the present approach is slow! and it seems that the 2-list comparison itself is not done the best way perhaps.

Sharing in attached excel file, the worked-out function till now, and looking for better ideas!

  • m_tarler's avatar
    m_tarler
    Dec 31, 2024

    So I worked your sheet a little further and you can tell me if the performance is still holding up.  Basically once you find the list of 'imaginary numbers' I then do a complete list of all possibilities and then filter out the possibilities that are increasing and then pull the list with the most in the list

    see attached

  • m_tarler's avatar
    m_tarler
    Steel Contributor

    I also think you have a problem with multiple solutions and/or undefined optimization.  You could easily come up with solutions that are likely far from optimal:  ABCDEF could match with BCDEFA in 2 ways:

    A_B_C_D_E_F
    x _B_C_D_E_F_A

    or

    x _x_x_ x_x_A_B_C_D_E_F
    B_C_D_E_F_A

    The former being much "better" match but I don't believe you have any 'rule' why the latter isn't a valid solution.  That said, I didn't break down your existing solution to see if it checks for that but it seams to be 'single path' iterative based so I don't think so.  we also know excel works with arrays more efficiently.  I have the following start that I feel could be useful but maybe you already tried and couldn't get further.  So on your sample sheet I took the 1st 'DEMO' and put the 2nd col into a row and found the matix of matches (shown using complex #s for row, column.  The last 2 columns (V & W) use those matches and sort the 'real'/rows based on the 'imaginary'/cols and then the reverse.  Any 'path' that increases both the row and column is a potential solution, so w/r to the 2 final columns any monotonically increasing set would be a potential solution of matching values (except these :

    in this case I see 4 (non-trivial, i.e. sequences that do not exclude possible matches in between) match solutions:

    2+3i, 6+5i, 8+6i, 9+9i
    2+3i, 6+5i, 7+10i
    2+3i, 5+8i, 6+9i, 7+10i
    2+3i, 5+8i, 9+9i

    So although this step makes it much easier for me to find those solutions I'm not seeing a way to functionally do it.  Like it would be nice if there was a trick using MMULT or something to give potential solutions and then use the one that is the longest.  In THIS case you can easily go row by row or column by column to find the 2 'best' solutions but that would NOT be the case if the 1st column ended with JUL and the 2nd column ended with MAY as those would have to be 'skipped' in each corresponsing case to avoid the 2 match solution.

     

     

    • amit_bhola's avatar
      amit_bhola
      Iron Contributor

      Thanks m_tarler for the idea about holding pairs of candidate matched-indices as complex nos.

      I studied (see attachment), and i expect that filtering-out the monotonically increasing nos. would still need recursion which i already implemented and due to it, as Mr. Peter pointed out, the order of complexity is expected to remain exponential. Thanks for your suggestions. When time permits, i may explore the approach further nonetheless.

      • m_tarler's avatar
        m_tarler
        Steel Contributor

        So I worked your sheet a little further and you can tell me if the performance is still holding up.  Basically once you find the list of 'imaginary numbers' I then do a complete list of all possibilities and then filter out the possibilities that are increasing and then pull the list with the most in the list

        see attached

  • amit_bhola's avatar
    amit_bhola
    Iron Contributor

    In attached Excel file, 3-Lists comparison function implemented - but the calc is so slow that Excel becomes unresponsive. Have set the calculation to Manual.

    Looking for better ideas for this function which is faster and practical.

    Thanks in advance.

     

    • peiyezhu's avatar
      peiyezhu
      Bronze Contributor

       

      I guess you want to align same values in one row.

      For example,All Feb need in row 3 in below picture instead of in line 2 and  line 7(two Febs)

       

      One possible way is transform to one dimension and group by values and value index as key.

      If by sql with sqlite,may faster than Excel.

      If by sql,

      //select * from ListsComparision limit 20;

      cli_one_dim~ListsComparision~0;

      create temp table aa as 

      select 数量||row_number() over (partition by 属性,数量) key,属性,数量 from ListsComparisionunion ;

      //select * from aa;

      cli_create_two_dim~aa~属性~数量;

      select * from aa_two_dim;

       

  • There are approaches that would expand to 3 or more columns provided the 'months' can be interpreted as dates with a natural sort order (otherwise I suspect the results are ambiguous).  For example

    corresponds to your initial problem but starting with Excel dates rather than the abbreviated names of months.  The intermediate step is a combined list of tagged dates.

    = LET(
        taggedA,  EXPAND(listA,,2,"List A"),
        taggedB,  EXPAND(listB,,2,"List B"),
        combined, SORT(VSTACK(taggedA, taggedB)),
        group,    TAKE(combined,,-1),
        month,    TAKE(combined,,1),
        DROP(PIVOTBY(month, group, month, MAX,,0,,0),,1)
      )

    This is fast, but it required additional structure within your data.

    • amit_bhola's avatar
      amit_bhola
      Iron Contributor

      PeterBartholomew1, thanks but it won't be practically possible to add date structures manually.

      In fact, the implementation is required to be done on text arrays which could run upto about 500 elements long. Jan,Feb etc. i added as example texts only.

      In attachment, I have included a more relatable example of persons being members of different teams, and teams being compared without disturbing the original order which may represent seniority of the members. Other use case could be the code comparison for which there are online tools available (example links included in attached Excel), but I need the similar implementation in Excel.

      Going by the first principles method by use of indices of matched elements and working around them, I could do it, but this becomes too slow in Dynamic Arrays. (I'm wondering if implemented by VBA would it be faster?)

      PIVOTBY is a nice trick which I take note as of help elsewhere but for present problem, making that helper structure in-between defeats the purpose of automating the comparison.

      • PeterBartholomew1's avatar
        PeterBartholomew1
        Silver Contributor

        The methods I proposed are completely dependent on there being an overall ranking for the elements appearing in the lists.  Without that, the problem moves from linear complexity to exponential (NP-hard) complexity.  I don't think an Excel solution to the problem should even be attempted. 

        As a trial I patched up a Levenshtein distance calculation to return the 'longest common subsequence'.  To do this I encoded your months as letters A-L, but that was simply to allow the lists to appear as a word in each case.  That showed that there are two alternative presentations that reduce the overall length of the combined list to a minimum.

        Levenshteinλ
        =IF(levenshtein?, 
            value + MIN(diag, left, above), 
            value + MIN(left, above)
        )

        Note: my solution involvers nested SCAN formulas and passing information using thunks.  It is not for the faint-hearted and I would recommend avoiding it if at all possible.

         

    • peiyezhu's avatar
      peiyezhu
      Bronze Contributor

      Same thoughts.

      I added index for same texts(same as date).

      Union two lists and then unpivot😊

       

       

  • amit_bhola's avatar
    amit_bhola
    Iron Contributor

    Thanks but the original order of array elements should remain same, that is the requirement.

    So yes, the same elements must align in same row, but within each array, the sequence of elements must remain same in output.

    Such type of comparison is useful, e.g. in comparing text documents, like code comparisons (which line got deleted/added/remained same). Other use cases could be comparison of multiple products features and specifications grouped by categories from top to bottom.

    I could make the solution but it is slow.

    • peiyezhu's avatar
      peiyezhu
      Bronze Contributor

      Re:like code comparisons (which line got deleted/added/remained same). 

       

      vimdiff

      or 

      diff

       

Resources