Hi guys, Can someone please help me with a formula for the following task? I have over 40,000 rows of data in 'Table 1', I've extracted a sample. I need to analyse only cells which contain identical values in both 'Product' and 'Decrement' in 'Table 1'. I have manually created a dataset in 'Table 2' which highlits the output I need. Thanks in advance!!

djclements Great point. However, if use XMATCH with binary search it works fast =LET( pd, product & "| " & decrement, u, SORT( UNIQUE(pd,,1) ), f, FILTER( table, ISNA( XMATCH(pd,u,,2) ) ), VSTACK( {"ID NUMBER","PRODUCT","DECREMENT","IN USE?"}, f ) )

SergeiBaklan Good tip. The [search_mode] never even crossed my mind when I was running my earlier tests (heavy sigh). This is the perfect situation to implement Binary search for improved performance. Thanks for that.

LachlanB98 I would prefer to tackle such a task with Power Query as demonstrated in the attached workbook. No complicated worksheet formulas needed. Note that I identified two additional items that should be included from your example, applying consistent logic.

LachlanB98 As variant =LET( table, $B$3:$E$41, product, $C$3:$C$41, decrement, $D$3:$D$41, VSTACK( {"ID NUMBER","PRODUCT","DECREMENT","IN USE?"}, SORT( FILTER(table, COUNTIFS(product, product, decrement, decrement) > 1 ), {2,3} ) ) )

SergeiBaklan You went further than I did!= LET( multiplicity, COUNTIFS(product, product, decrement, decrement), FILTER(Table1, multiplicity > 1) )A thought to throw into the discussion though. I have reservations concerning lines such as table, $B$3:$E$41,As part of a LET statement, I suspect a copy of the table is loaded into memory at least temporarily as 'table'. If a defined name is used instead, that does nothing until the name appears in a formula where the result is calculated without making a copy of the range?

Extracting table values if values in two columns have identical values in other rows

16 Replies

PeterBartholomew1
Silver Contributor
Jun 05, 2024
LachlanB98
This is not a serous recommendation but more a matter of experimentation. The object was to see whether I could use SCAN to pick out the required records. The first problem is that the criterion is based upon two fields. As we have seen, concatenation is a good option, despite being somewhat inelegant. Instead, to get SCAN to work over an array of arrays, I thunked the value pairs, and fed them through SCAN as thunks. Similarly, there a problems with the accumulated values since I need to know both whether the previous record matches the current one and whether it matched its predecessor, in order to determine whether it should be retained or filtered out.
=LET( sorted, SORT(Table1, {2, 3}), attrϑ, BYROW(CHOOSECOLS(sorted, 2, 3), THUNK), initϑ, THUNK(HSTACK("", "", FALSE, FALSE)), resultϑ, SCAN(initϑ, attrϑ, IsGroupedλ), boolean, MAP(resultϑ, LAMBDA(rϑ, LET(r, rϑ(), INDEX(r, 1, 4)))), final, INDEX(INDEX(resultϑ, ROWS(data), 1)(), , 3), FILTER(sorted, VSTACK(DROP(boolean, 1), final)) )
The bulk of the problem-specific comparisons are made in the Lambda function 'IsGroupedλ' which is defined to be
= LAMBDA(accϑ, aϑ₁, LET( acc, accϑ(), a₁, aϑ₁(), a₁₀, INDEX(acc, 1, 1), a₂₀, INDEX(acc, 1, 2), a₁₁, INDEX(a₁, 1, 1), a₂₁, INDEX(a₁, 1, 2), b₀, INDEX(acc, 1, 3), b₁, INDEX(acc, 1, 4), c, AND(a₁₁ = a₁₀, a₂₁ = a₂₀), b, OR(c, b₀), THUNK(HSTACK(a₁, c, b)) ) )
The time taken for the resulting calculation was 1000ms which is slow, but not ridiculously so. The formula returns an array of thunks, from which a single value is selected to be the filter criterion.

For the current problem, direct array methods are both more transparent and considerably faster. Despite that, the challenges for me were:
1. Could I get around the array of array limitations of SCAN?
2. Is the performance or the resulting formula acceptable?
My conclusion is that the resulting methods are both possible and usable.
Some tidying is required and the problem-specific features of the Lambdas need to be more contained if they are to be used as 'black box' functions.

sample_book_large (1).xlsx6.2 MB
- SergeiBaklan
  Diamond Contributor
  Jun 05, 2024
  PeterBartholomew1
  
  With SCAN you mainly answer on first question
  
  Could I get around the array of array limitations of SCAN?
  
  and perfectly demonstrated it.
  
  Performance is 4 times better in sample from your previous post. That's a huge difference, but only if we speak about seconds, not milliseconds.
djclements
Silver Contributor
Jun 03, 2024
LachlanB98 With 40,000+ rows of data, efficiency is definitely a concern when using a formula-based method. Might I suggest that you first concatenate the Product and Decrement columns into a single criteria range, then use the UNIQUE function with the optional [exactly_once] argument set to TRUE. From there, the table can be filtered using the ISNA / XMATCH method to exclude all records that appeared only once. For example:

=LET( arr, C3:C41 & "|" & D3:D41, unq, UNIQUE(arr,, TRUE), FILTER(B3:E41, ISNA(XMATCH(arr, unq))) )

However, to make this method perform as efficiently as possible when dealing with large datasets, move the arr and unq variables to their own spilled ranges in the worksheet and avoid the LET function altogether. For example:

Cell G3: =C3:C40002 & "|" & D3:D40002 Cell H3: =UNIQUE(G3#,, TRUE) Cell J3: =FILTER(B3:E40002, ISNA(XMATCH(G3#, H3#)))

Note: when testing this method with 40,000 rows of data, the calculation time was approx. 7 seconds with LET and only a fraction of a second without. I'm not sure if this is just a bug with the current version of MS365, but I don't remember there ever being an efficiency problem in the past using XMATCH with arrays vs ranges.

See attached...
sample book.xlsx13 KB
sample_book_large.xlsx2.2 MB
- SergeiBaklan
  Diamond Contributor
  Jun 04, 2024
  djclements
  
  Great point.
  
  However, if use XMATCH with binary search it works fast
  
  =LET( pd, product & "| " & decrement, u, SORT( UNIQUE(pd,,1) ), f, FILTER( table, ISNA( XMATCH(pd,u,,2) ) ), VSTACK( {"ID NUMBER","PRODUCT","DECREMENT","IN USE?"}, f ) )
  sample_book_large xmatch.xlsx2.9 MB
  - PeterBartholomew1
    Silver Contributor
    Jun 04, 2024
    SergeiBaklan
    Well spotted! And I think the formula wins the time trial. When I was timing the original formula with helper ranges the calculation was taking about 230ms. Combining them into
    = LET( prodDecr, product & "|" & decrement, exactlyOnce, UNIQUE(prodDecr,, TRUE), multiple?, ISNA(XMATCH(prodDecr, exactlyOnce)), FILTER(Table1, multiple?) )
    and the time required leapt to 8000ms! Forcing the binary search and the time reduced to 150ms. I am so used to the idea that XLOOKUP and XMATCH perform sorts as part of the solution process to gain efficiency that this came as a surprise.
    
    Once as I had come to terms with the fact that COUNTIFS does not perform well on datasets of this size (10,000ms), I tried a different strategy of sorting the dataset and comparing each record with prior and subsequent records to determine the filter criterion.
    =LET( sortedTbl, SORT(Table1, {2,3}), CurrentPD, CHOOSECOLS(sortedTbl,2,3), priorPD, DROP(VSTACK({0,0}, CurrentPD),-1), nextPD, DROP(VSTACK(CurrentPD,{0,0}),1), matchesnext, BYROW(CurrentPD = nextPD, AND), matchesPrior, BYROW(CurrentPD = priorPD, AND), matchesEither, MAP(matchesPrior, matchesnext, OR), FILTER(sortedTbl, matchesEither) )
    That was back down to 250ms so, no prizes, but a pretty credible performance!
peiyezhu
Bronze Contributor
Jun 03, 2024
LachlanB98
sql:
select b.* from (select f02,f03 from Sheet2 group by f02,f03 having(count(1))>1) as a join Sheet2 as b on a.f02=b.f02 and a.f03=b.f03 ;
SergeiBaklan
Diamond Contributor
Jun 03, 2024
LachlanB98

As variant

=LET( table, $B$3:$E$41, product, $C$3:$C$41, decrement, $D$3:$D$41, VSTACK( {"ID NUMBER","PRODUCT","DECREMENT","IN USE?"}, SORT( FILTER(table, COUNTIFS(product, product, decrement, decrement) > 1 ), {2,3} ) ) )
sample book.xlsx14 KB
- PeterBartholomew1
  Silver Contributor
  Jun 03, 2024
  SergeiBaklan
  You went further than I did!
  = LET( multiplicity, COUNTIFS(product, product, decrement, decrement), FILTER(Table1, multiplicity > 1) )
  A thought to throw into the discussion though. I have reservations concerning lines such as
  table, $B$3:$E$41,
  As part of a LET statement, I suspect a copy of the table is loaded into memory at least temporarily as 'table'. If a defined name is used instead, that does nothing until the name appears in a formula where the result is calculated without making a copy of the range?
  - SergeiBaklan
    Diamond Contributor
    Jun 03, 2024
    PeterBartholomew1
    
    I'm not sure there is any difference except maintenance. My understanding is that defined within LET name is called only with function evaluation, and only here we load the value into the memory. Not with definition.
Riny_van_Eekelen
Platinum Contributor
Jun 02, 2024
LachlanB98 I would prefer to tackle such a task with Power Query as demonstrated in the attached workbook. No complicated worksheet formulas needed.
Note that I identified two additional items that should be included from your example, applying consistent logic.
sample book (1).xlsx20 KB

Forum Discussion

Extracting table values if values in two columns have identical values in other rows