Forum Discussion

LachlanB98's avatar
LachlanB98
Copper Contributor
Jun 03, 2024

Extracting table values if values in two columns have identical values in other rows

 

Hi guys,

 

Can someone please help me with a formula for the following task?

 

I have over 40,000 rows of data in 'Table 1', I've extracted a sample.

 

I need to analyse only cells which contain identical values in both 'Product' and 'Decrement' in 'Table 1'. I have manually created a dataset in 'Table 2' which highlits the output I need.

 

Thanks in advance!! 

 

 

 

 

16 Replies

  • LachlanB98 

    This is not a serous recommendation but more a matter of experimentation.  The object was to see whether I could use SCAN to pick out the required records.  The first problem is that the criterion is based upon two fields.  As we have seen, concatenation is a good option, despite being somewhat inelegant.  Instead, to get SCAN to work over an array of arrays, I thunked the value pairs, and fed them through SCAN as thunks.  Similarly, there a problems with the accumulated values since I need to know both whether the previous record matches the current one and whether it matched its predecessor, in order to determine whether it should be retained or filtered out.

    =LET(
        sorted,  SORT(Table1, {2, 3}),
        attrϑ,   BYROW(CHOOSECOLS(sorted, 2, 3), THUNK),
        initϑ,   THUNK(HSTACK("", "", FALSE, FALSE)),
        resultϑ, SCAN(initϑ, attrϑ, IsGroupedλ),
        boolean, MAP(resultϑ, LAMBDA(rϑ, LET(r, rϑ(), INDEX(r, 1, 4)))),
        final,   INDEX(INDEX(resultϑ, ROWS(data), 1)(), , 3),
        FILTER(sorted, VSTACK(DROP(boolean, 1), final))
    )

    The bulk of the problem-specific comparisons are made in the Lambda function 'IsGroupedλ' which is defined to be

    = LAMBDA(accϑ, aϑ₁,
        LET(
          acc, accϑ(),
          a₁,  aϑ₁(),
          a₁₀, INDEX(acc, 1, 1),
          a₂₀, INDEX(acc, 1, 2),
          a₁₁, INDEX(a₁, 1, 1),
          a₂₁, INDEX(a₁, 1, 2),
          b₀,  INDEX(acc, 1, 3),
          b₁,  INDEX(acc, 1, 4),
          c,   AND(a₁₁ = a₁₀, a₂₁ = a₂₀),
          b,   OR(c, b₀),
          THUNK(HSTACK(a₁, c, b))
        )
      )

    The time taken for the resulting calculation was 1000ms which is slow, but not ridiculously so.  The formula returns an array of thunks, from which a single value is selected to be the filter criterion.

     

    For the current problem, direct array methods are both more transparent and considerably faster.  Despite that, the challenges for me were:

    1. Could I get around the array of array limitations of SCAN?

    2. Is the performance or the resulting formula acceptable?

    My conclusion is that the resulting methods are both possible and usable.  

    Some tidying is required and the problem-specific features of the Lambdas need to be more contained if they are to be used as 'black box' functions.

     

    • SergeiBaklan's avatar
      SergeiBaklan
      Diamond Contributor

      PeterBartholomew1 

      With SCAN you mainly answer on first question

      Could I get around the array of array limitations of SCAN?

      and perfectly demonstrated it.

      Performance is 4 times better in sample from your previous post. That's a huge difference, but only if we speak about seconds, not milliseconds.

  • djclements's avatar
    djclements
    Silver Contributor

    LachlanB98 With 40,000+ rows of data, efficiency is definitely a concern when using a formula-based method. Might I suggest that you first concatenate the Product and Decrement columns into a single criteria range, then use the UNIQUE function with the optional [exactly_once] argument set to TRUE. From there, the table can be filtered using the ISNA / XMATCH method to exclude all records that appeared only once. For example:

     

    =LET(
        arr, C3:C41 & "|" & D3:D41,
        unq, UNIQUE(arr,, TRUE),
        FILTER(B3:E41, ISNA(XMATCH(arr, unq)))
    )

     

    However, to make this method perform as efficiently as possible when dealing with large datasets, move the arr and unq variables to their own spilled ranges in the worksheet and avoid the LET function altogether. For example:

     

    Cell G3:
    =C3:C40002 & "|" & D3:D40002
    
    Cell H3:
    =UNIQUE(G3#,, TRUE)
    
    Cell J3:
    =FILTER(B3:E40002, ISNA(XMATCH(G3#, H3#)))

     

    Note: when testing this method with 40,000 rows of data, the calculation time was approx. 7 seconds with LET and only a fraction of a second without. I'm not sure if this is just a bug with the current version of MS365, but I don't remember there ever being an efficiency problem in the past using XMATCH with arrays vs ranges.

     

    See attached...

    • SergeiBaklan's avatar
      SergeiBaklan
      Diamond Contributor

      djclements 

      Great point.

      However, if use XMATCH with binary search it works fast

      =LET(
        pd,  product & "| " & decrement,
        u,   SORT( UNIQUE(pd,,1) ),
        f,   FILTER( table, ISNA( XMATCH(pd,u,,2) ) ),
        VSTACK( {"ID NUMBER","PRODUCT","DECREMENT","IN USE?"}, f )
      )
      • PeterBartholomew1's avatar
        PeterBartholomew1
        Silver Contributor

        SergeiBaklan 

        Well spotted!  And I think the formula wins the time trial.  When I was timing the original formula with helper ranges the calculation was taking about 230ms.  Combining them into

        = LET(
            prodDecr,    product & "|" & decrement,
            exactlyOnce, UNIQUE(prodDecr,, TRUE),
            multiple?,   ISNA(XMATCH(prodDecr, exactlyOnce)),
            FILTER(Table1, multiple?)
          )

        and the time required leapt to 8000ms!  Forcing the binary search and the time reduced to 150ms.  I am so used to the idea that XLOOKUP and XMATCH perform sorts as part of the solution process to gain efficiency that this came as a surprise.

         

        Once as I had come to terms with the fact that COUNTIFS does not perform well on datasets of this size (10,000ms), I tried a different strategy of sorting the dataset and comparing each record with prior and subsequent records to determine the filter criterion.

        =LET(
            sortedTbl,     SORT(Table1, {2,3}),
            CurrentPD,     CHOOSECOLS(sortedTbl,2,3),
            priorPD,       DROP(VSTACK({0,0}, CurrentPD),-1),
            nextPD,        DROP(VSTACK(CurrentPD,{0,0}),1),
            matchesnext,   BYROW(CurrentPD = nextPD, AND),
            matchesPrior,  BYROW(CurrentPD = priorPD, AND),
            matchesEither, MAP(matchesPrior, matchesnext, OR),
            FILTER(sortedTbl, matchesEither)
          )

        That was back down to 250ms so, no prizes, but a pretty credible performance!

         

         

         

  • peiyezhu's avatar
    peiyezhu
    Bronze Contributor

    LachlanB98 

    sql:

    select b.* from (select f02,f03 from Sheet2 group by f02,f03 having(count(1))>1) as a join Sheet2 as b on a.f02=b.f02 and a.f03=b.f03 ;

     

  • SergeiBaklan's avatar
    SergeiBaklan
    Diamond Contributor

    LachlanB98 

    As variant

    =LET(
        table, $B$3:$E$41,
       product, $C$3:$C$41,
       decrement, $D$3:$D$41,
       VSTACK(
          {"ID NUMBER","PRODUCT","DECREMENT","IN USE?"},
           SORT( FILTER(table, COUNTIFS(product, product, decrement, decrement) > 1 ),
                {2,3}
           )
      )
    )
    • PeterBartholomew1's avatar
      PeterBartholomew1
      Silver Contributor

      SergeiBaklan 

      You went further than I did!

      = LET(
          multiplicity, COUNTIFS(product, product, decrement, decrement),
          FILTER(Table1, multiplicity > 1)
        )

      A thought to throw into the discussion though.  I have reservations concerning lines such as 

      table, $B$3:$E$41,

      As part of a LET statement, I suspect a copy of the table is loaded into memory at least temporarily as 'table'.  If a defined name is used instead, that does nothing until the name appears in a formula where the result is calculated without making a copy of the range?

       

      • SergeiBaklan's avatar
        SergeiBaklan
        Diamond Contributor

        PeterBartholomew1 

        I'm not sure there is any difference except maintenance. My understanding is that defined within LET name is called only with function evaluation, and only here we load the value into the memory. Not with definition.

  • Riny_van_Eekelen's avatar
    Riny_van_Eekelen
    Platinum Contributor

    LachlanB98 I would prefer to tackle such a task with Power Query as demonstrated in the attached workbook. No complicated worksheet formulas needed. 

    Note that I identified two additional items that should be included from your example, applying consistent logic.

Resources