Forum Discussion
Lists Comparison (Stacked & Keeping order)
I've been trying to develop an Excel function which would compare Lists, each of single column arrays and generate a stacked (side by side) comparison aligning the matched (same) elements and keeping the original order of elements of each list intact in the resulting comparison.
A Two-Lists comparison function has been developed. However, it would be great if it can be scaled up efficiently to 3 or more lists comparison. There are ideas on adopting the 2-list comparison function to greater number of lists, but the present approach is slow! and it seems that the 2-list comparison itself is not done the best way perhaps.
Sharing in attached excel file, the worked-out function till now, and looking for better ideas!
Interesting project. I see you've found a solution but indicate there are still performance issues with larger amounts of data. One thing to keep in mind is that iterative functions like MAKEARRAY, MAP, SCAN and REDUCE will perform very poorly when using INDEX over an array object (as opposed to a physical range). This may be a contributing factor since a number of your custom functions use iterative indexing methods (not to mention some iterative stacking as well).
In the attached file I've used an alternative approach for generating the list of matching indices and row counts, then used SORTBY with EXPAND and TOCOL-IFS-SEQUENCE to pad each array with the appropriate number of blank rows. It seems to be generating the same results as your original formulas, plus it can handle 5 separate lists with 2000 rows each in < 1 second. I've put the new lambda definitions on a separate worksheet, for your convenience. Check it out and see if it's working as expected. Cheers!
17 Replies
- djclementsSilver Contributor
Interesting project. I see you've found a solution but indicate there are still performance issues with larger amounts of data. One thing to keep in mind is that iterative functions like MAKEARRAY, MAP, SCAN and REDUCE will perform very poorly when using INDEX over an array object (as opposed to a physical range). This may be a contributing factor since a number of your custom functions use iterative indexing methods (not to mention some iterative stacking as well).
In the attached file I've used an alternative approach for generating the list of matching indices and row counts, then used SORTBY with EXPAND and TOCOL-IFS-SEQUENCE to pad each array with the appropriate number of blank rows. It seems to be generating the same results as your original formulas, plus it can handle 5 separate lists with 2000 rows each in < 1 second. I've put the new lambda definitions on a separate worksheet, for your convenience. Check it out and see if it's working as expected. Cheers!
- amit_bholaIron Contributor
Thanks, it is tremendously fast! Indeed, it takes <1 second even for large lists.
Change of the basic approach was what I was looking for and needed.
I needed List comparisons as a sub-tool (part) of a bigger subsequent calc and now it is efficient and helpful.
I tallied the results as well as timed the calculation: -2-Lists, 1000-Elements each
(1) Iterative (Makearray/Map/Vstack/Index/Thunks) : 3 minutes
(2) Direct (Sortby/Expand/Tocol/Ifs/Sequence) : 50 milliseconds3-Lists, 1000-Elements each
(1) Iterative (Makearray/Map/Vstack/Index/Thunks) : 10 minutes
(2) Direct (Sortby/Expand/Tocol/Ifs/Sequence) : 130 millisecondsThe time taken figures mentioned above are not exactly repeatable on re-calc, but the order remains more or less the same. Refer attachment for comparison of different approaches.
Thanks!
- djclementsSilver Contributor
You're welcome! I'm glad it worked out for you.
BTW, I'm guessing the use of double-quotes ("""") as opposed to zero-length strings ("") in your attached file was intentional??? If not, simply adjust the [pad_with] argument of each of the 4 instances of EXPAND in your BySortbyExpandTocolAndIfsEtc.ListCompλ function definition from """" to "" in order to get rid of them.
Cheers! :)
- m_tarlerBronze Contributor
I also think you have a problem with multiple solutions and/or undefined optimization. You could easily come up with solutions that are likely far from optimal: ABCDEF could match with BCDEFA in 2 ways:
A_B_C_D_E_F
x _B_C_D_E_F_Aor
x _x_x_ x_x_A_B_C_D_E_F
B_C_D_E_F_AThe former being much "better" match but I don't believe you have any 'rule' why the latter isn't a valid solution. That said, I didn't break down your existing solution to see if it checks for that but it seams to be 'single path' iterative based so I don't think so. we also know excel works with arrays more efficiently. I have the following start that I feel could be useful but maybe you already tried and couldn't get further. So on your sample sheet I took the 1st 'DEMO' and put the 2nd col into a row and found the matix of matches (shown using complex #s for row, column. The last 2 columns (V & W) use those matches and sort the 'real'/rows based on the 'imaginary'/cols and then the reverse. Any 'path' that increases both the row and column is a potential solution, so w/r to the 2 final columns any monotonically increasing set would be a potential solution of matching values (except these :
in this case I see 4 (non-trivial, i.e. sequences that do not exclude possible matches in between) match solutions:
2+3i, 6+5i, 8+6i, 9+9i
2+3i, 6+5i, 7+10i
2+3i, 5+8i, 6+9i, 7+10i
2+3i, 5+8i, 9+9i
So although this step makes it much easier for me to find those solutions I'm not seeing a way to functionally do it. Like it would be nice if there was a trick using MMULT or something to give potential solutions and then use the one that is the longest. In THIS case you can easily go row by row or column by column to find the 2 'best' solutions but that would NOT be the case if the 1st column ended with JUL and the 2nd column ended with MAY as those would have to be 'skipped' in each corresponsing case to avoid the 2 match solution.- amit_bholaIron Contributor
Thanks m_tarler for the idea about holding pairs of candidate matched-indices as complex nos.
I studied (see attachment), and i expect that filtering-out the monotonically increasing nos. would still need recursion which i already implemented and due to it, as Mr. Peter pointed out, the order of complexity is expected to remain exponential. Thanks for your suggestions. When time permits, i may explore the approach further nonetheless.
- m_tarlerBronze Contributor
So I worked your sheet a little further and you can tell me if the performance is still holding up. Basically once you find the list of 'imaginary numbers' I then do a complete list of all possibilities and then filter out the possibilities that are increasing and then pull the list with the most in the list
see attached
- amit_bholaIron Contributor
Thanks but the original order of array elements should remain same, that is the requirement.
So yes, the same elements must align in same row, but within each array, the sequence of elements must remain same in output.
Such type of comparison is useful, e.g. in comparing text documents, like code comparisons (which line got deleted/added/remained same). Other use cases could be comparison of multiple products features and specifications grouped by categories from top to bottom.
I could make the solution but it is slow.
- peiyezhuBronze Contributor
Re:like code comparisons (which line got deleted/added/remained same).
vimdiff
or
diff
- PeterBartholomew1Silver Contributor
There are approaches that would expand to 3 or more columns provided the 'months' can be interpreted as dates with a natural sort order (otherwise I suspect the results are ambiguous). For example
corresponds to your initial problem but starting with Excel dates rather than the abbreviated names of months. The intermediate step is a combined list of tagged dates.
= LET( taggedA, EXPAND(listA,,2,"List A"), taggedB, EXPAND(listB,,2,"List B"), combined, SORT(VSTACK(taggedA, taggedB)), group, TAKE(combined,,-1), month, TAKE(combined,,1), DROP(PIVOTBY(month, group, month, MAX,,0,,0),,1) )
This is fast, but it required additional structure within your data.
- peiyezhuBronze Contributor
Same thoughts.
I added index for same texts(same as date).
Union two lists and then unpivot😊
- amit_bholaIron Contributor
PeterBartholomew1, thanks but it won't be practically possible to add date structures manually.
In fact, the implementation is required to be done on text arrays which could run upto about 500 elements long. Jan,Feb etc. i added as example texts only.
In attachment, I have included a more relatable example of persons being members of different teams, and teams being compared without disturbing the original order which may represent seniority of the members. Other use case could be the code comparison for which there are online tools available (example links included in attached Excel), but I need the similar implementation in Excel.
Going by the first principles method by use of indices of matched elements and working around them, I could do it, but this becomes too slow in Dynamic Arrays. (I'm wondering if implemented by VBA would it be faster?)
PIVOTBY is a nice trick which I take note as of help elsewhere but for present problem, making that helper structure in-between defeats the purpose of automating the comparison.
- peiyezhuBronze Contributor
Re:won't be practically possible to add date structures manually.
VBA may an option to automate it.
But I would prefer sql which looks more consice and running faster.
- amit_bholaIron Contributor
In attached Excel file, 3-Lists comparison function implemented - but the calc is so slow that Excel becomes unresponsive. Have set the calculation to Manual.
Looking for better ideas for this function which is faster and practical.
Thanks in advance.
- peiyezhuBronze Contributor
I guess you want to align same values in one row.
For example,All Feb need in row 3 in below picture instead of in line 2 and line 7(two Febs)
One possible way is transform to one dimension and group by values and value index as key.
If by sql with sqlite,may faster than Excel.
If by sql,
//select * from ListsComparision limit 20;
cli_one_dim~ListsComparision~0;
create temp table aa as
select 数量||row_number() over (partition by 属性,数量) key,属性,数量 from ListsComparisionunion ;
//select * from aa;
cli_create_two_dim~aa~属性~数量;
select * from aa_two_dim;