Forum Discussion
Extracting table values if values in two columns have identical values in other rows
Hi guys,
Can someone please help me with a formula for the following task?
I have over 40,000 rows of data in 'Table 1', I've extracted a sample.
I need to analyse only cells which contain identical values in both 'Product' and 'Decrement' in 'Table 1'. I have manually created a dataset in 'Table 2' which highlits the output I need.
Thanks in advance!!
16 Replies
- PeterBartholomew1Silver Contributor
This is not a serous recommendation but more a matter of experimentation. The object was to see whether I could use SCAN to pick out the required records. The first problem is that the criterion is based upon two fields. As we have seen, concatenation is a good option, despite being somewhat inelegant. Instead, to get SCAN to work over an array of arrays, I thunked the value pairs, and fed them through SCAN as thunks. Similarly, there a problems with the accumulated values since I need to know both whether the previous record matches the current one and whether it matched its predecessor, in order to determine whether it should be retained or filtered out.
=LET( sorted, SORT(Table1, {2, 3}), attrϑ, BYROW(CHOOSECOLS(sorted, 2, 3), THUNK), initϑ, THUNK(HSTACK("", "", FALSE, FALSE)), resultϑ, SCAN(initϑ, attrϑ, IsGroupedλ), boolean, MAP(resultϑ, LAMBDA(rϑ, LET(r, rϑ(), INDEX(r, 1, 4)))), final, INDEX(INDEX(resultϑ, ROWS(data), 1)(), , 3), FILTER(sorted, VSTACK(DROP(boolean, 1), final)) )The bulk of the problem-specific comparisons are made in the Lambda function 'IsGroupedλ' which is defined to be
= LAMBDA(accϑ, aϑ₁, LET( acc, accϑ(), a₁, aϑ₁(), a₁₀, INDEX(acc, 1, 1), a₂₀, INDEX(acc, 1, 2), a₁₁, INDEX(a₁, 1, 1), a₂₁, INDEX(a₁, 1, 2), b₀, INDEX(acc, 1, 3), b₁, INDEX(acc, 1, 4), c, AND(a₁₁ = a₁₀, a₂₁ = a₂₀), b, OR(c, b₀), THUNK(HSTACK(a₁, c, b)) ) )The time taken for the resulting calculation was 1000ms which is slow, but not ridiculously so. The formula returns an array of thunks, from which a single value is selected to be the filter criterion.
For the current problem, direct array methods are both more transparent and considerably faster. Despite that, the challenges for me were:
1. Could I get around the array of array limitations of SCAN?
2. Is the performance or the resulting formula acceptable?
My conclusion is that the resulting methods are both possible and usable.
Some tidying is required and the problem-specific features of the Lambdas need to be more contained if they are to be used as 'black box' functions.
- SergeiBaklanDiamond Contributor
With SCAN you mainly answer on first question
Could I get around the array of array limitations of SCAN?
and perfectly demonstrated it.
Performance is 4 times better in sample from your previous post. That's a huge difference, but only if we speak about seconds, not milliseconds.
- djclementsSilver Contributor
LachlanB98 With 40,000+ rows of data, efficiency is definitely a concern when using a formula-based method. Might I suggest that you first concatenate the Product and Decrement columns into a single criteria range, then use the UNIQUE function with the optional [exactly_once] argument set to TRUE. From there, the table can be filtered using the ISNA / XMATCH method to exclude all records that appeared only once. For example:
=LET( arr, C3:C41 & "|" & D3:D41, unq, UNIQUE(arr,, TRUE), FILTER(B3:E41, ISNA(XMATCH(arr, unq))) )However, to make this method perform as efficiently as possible when dealing with large datasets, move the arr and unq variables to their own spilled ranges in the worksheet and avoid the LET function altogether. For example:
Cell G3: =C3:C40002 & "|" & D3:D40002 Cell H3: =UNIQUE(G3#,, TRUE) Cell J3: =FILTER(B3:E40002, ISNA(XMATCH(G3#, H3#)))Note: when testing this method with 40,000 rows of data, the calculation time was approx. 7 seconds with LET and only a fraction of a second without. I'm not sure if this is just a bug with the current version of MS365, but I don't remember there ever being an efficiency problem in the past using XMATCH with arrays vs ranges.
See attached...
- SergeiBaklanDiamond Contributor
Great point.
However, if use XMATCH with binary search it works fast
=LET( pd, product & "| " & decrement, u, SORT( UNIQUE(pd,,1) ), f, FILTER( table, ISNA( XMATCH(pd,u,,2) ) ), VSTACK( {"ID NUMBER","PRODUCT","DECREMENT","IN USE?"}, f ) )- PeterBartholomew1Silver Contributor
Well spotted! And I think the formula wins the time trial. When I was timing the original formula with helper ranges the calculation was taking about 230ms. Combining them into
= LET( prodDecr, product & "|" & decrement, exactlyOnce, UNIQUE(prodDecr,, TRUE), multiple?, ISNA(XMATCH(prodDecr, exactlyOnce)), FILTER(Table1, multiple?) )and the time required leapt to 8000ms! Forcing the binary search and the time reduced to 150ms. I am so used to the idea that XLOOKUP and XMATCH perform sorts as part of the solution process to gain efficiency that this came as a surprise.
Once as I had come to terms with the fact that COUNTIFS does not perform well on datasets of this size (10,000ms), I tried a different strategy of sorting the dataset and comparing each record with prior and subsequent records to determine the filter criterion.
=LET( sortedTbl, SORT(Table1, {2,3}), CurrentPD, CHOOSECOLS(sortedTbl,2,3), priorPD, DROP(VSTACK({0,0}, CurrentPD),-1), nextPD, DROP(VSTACK(CurrentPD,{0,0}),1), matchesnext, BYROW(CurrentPD = nextPD, AND), matchesPrior, BYROW(CurrentPD = priorPD, AND), matchesEither, MAP(matchesPrior, matchesnext, OR), FILTER(sortedTbl, matchesEither) )That was back down to 250ms so, no prizes, but a pretty credible performance!
- peiyezhuBronze Contributor
sql:
select b.* from (select f02,f03 from Sheet2 group by f02,f03 having(count(1))>1) as a join Sheet2 as b on a.f02=b.f02 and a.f03=b.f03 ;
- SergeiBaklanDiamond Contributor
As variant
=LET( table, $B$3:$E$41, product, $C$3:$C$41, decrement, $D$3:$D$41, VSTACK( {"ID NUMBER","PRODUCT","DECREMENT","IN USE?"}, SORT( FILTER(table, COUNTIFS(product, product, decrement, decrement) > 1 ), {2,3} ) ) )- PeterBartholomew1Silver Contributor
You went further than I did!
= LET( multiplicity, COUNTIFS(product, product, decrement, decrement), FILTER(Table1, multiplicity > 1) )A thought to throw into the discussion though. I have reservations concerning lines such as
table, $B$3:$E$41,As part of a LET statement, I suspect a copy of the table is loaded into memory at least temporarily as 'table'. If a defined name is used instead, that does nothing until the name appears in a formula where the result is calculated without making a copy of the range?
- SergeiBaklanDiamond Contributor
I'm not sure there is any difference except maintenance. My understanding is that defined within LET name is called only with function evaluation, and only here we load the value into the memory. Not with definition.
- Riny_van_EekelenPlatinum Contributor
LachlanB98 I would prefer to tackle such a task with Power Query as demonstrated in the attached workbook. No complicated worksheet formulas needed.
Note that I identified two additional items that should be included from your example, applying consistent logic.