Forum Discussion

MarkolaKrai's avatar
MarkolaKrai
Copper Contributor
Mar 07, 2023

Set theory operations with dynamic arrays

I wonder if any of you have figured this out already (and i would like to have feedback about what i have done so far), i have been playing with columns of data and creating easy set theory operations. The addition of VSTACK and TOCOL basically nailed the sum operation as i can just select an entire array and merge it easily (and then apply a UNIQUE if i want to), the subtraction and intersection are not that straightforward though...

My approach was fairly simple and also known already (as shown here https://stackoverflow.com/questions/31186547/how-can-we-perform-common-set-operations-union-intersection-minus-in-ms-exce) but i wanted a solution that was neither VBA or ADD-IN based, and of course, that used dynamic arrays. For the subtraction operation i started with:

 

=FILTER(list1 , COUNTIF( list2 , list1) = 0)


As the second argument of COUNTIF is the list1 itself the size must be the same so i can apply FILTER to it and get only the elements on list1 that did not appear in list2, a similar solution without COUNTIF but with MATCH would be:

=FILTER(list1 , NOT(ISNUMBER(MATCH(list1 , list2 , 0))))


The approach with MATCH is actually better because it can be substituted with XMATCH and used wildcards and so on, but the main reason is that MATCH can accepct a lookup_vector that is a VSTACK or a TOCOL (an array instead of a range) which COUNTIF cannot handle (https://www.mrexcel.com/board/threads/hstack-vstack-and-countif.1220516/), the thing is now i can have a more general solution:

=
LAMBDA( list1, list2, FILTER( list1, NOT(ISNUMBER(MATCH(list1, list2, 0)))))

(listA , VSTACK(listB, listC))


And with this anonymous lambda i can pass the parameters as ranges or arrays and get the difference between combined sets in a general formula.

Now for the intersection, i have used a similar solution to the difference since i need to get values that are present in both lists at the same time so basically:

 

COUNTIF Solution:
=FILTER(list1 , COUNTIF( list2 , list1) > 0)

MATCH Solution:
=FILTER(list1 , ISNUMBER(MATCH(list1 , list2 , 0)))

 

The problem is that i was not able to make a formula that could do intersection with more lists in a general way, sure i could just add more conditions to the filter using boolean logic and repeating the arguments for each list with a * sign, but this is far from ideal compared to the difference solution.

I started to work with some BYROW and BYCOL combinations but as you guys know nested arrays are not supported, i managed to get away with it in Google Sheets using:

=FILTER(list1, 
  BYROW( 
   ArrayFormula( 
    BYCOL(matrix_lists, 
     LAMBDA(main, COUNTIF(main, list1)))), LAMBDA(matrix, PRODUCT(matrix) > 0)))

 

The idea is to create a COUNTIF (or MATCH with a few tweaks) result for each column and then do the product of each row of results and FILTER out the ones that had a zero because that means the value does not appear in all lists at the same time. Still i was not capable of implementing this idea in Excel as i dont know how to overcome nested arrays not being supported, does any of you have some ideas to share?

7 Replies

  • Greg_Hullender's avatar
    Greg_Hullender
    Copper Contributor

    Actually, I came up with a reasonably elegant way to do n-ary union and intersection where the input is an array of thunks, each thunk representing a set. This is union:

    =LET(_l, LAMBDA(x,LAMBDA(x)),
     sets, VSTACK(_l(B4:B11), _l(C4:C9), _l(D4:D12)),
     UNIQUE(DROP(REDUCE(0,sets, LAMBDA(s,x,VSTACK(s,x()))),1)))

    Lines 1 and 2 just set up the simulated input. This really does nothing but stack all the sets together and run unique on the whole. Intersection is much more interesting:

    =LET(_l, LAMBDA(x, LAMBDA(x)),
     sets, VSTACK(_l(B4:B11), _l(C4:C9), _l(D4:D12)),
     union, UNIQUE(DROP(REDUCE(0, sets, LAMBDA(s,x, VSTACK(s, x()))), 1)),
     not_x, UNIQUE(DROP(REDUCE(0, sets, LAMBDA(s,x, VSTACK(s, UNIQUE(VSTACK(x(), union), , 1)))), 1)),
     UNIQUE(VSTACK(union,not_x), , 1))

    This takes advantage of De Morgan's Law. In this case, the set complement is just the set subtracted from the union. (We don't need to double the set being subtracted since it's entirely contained in the union.) So we compute the union as before, then we compute the union of the complements of all the sets. Finally we take the complement of that (subtract it from the union) to get our answer.

  • Greg_Hullender's avatar
    Greg_Hullender
    Copper Contributor

    I'm two years late to the party, but in case anyone is still watching, I think there's a much easier way to do this, and it does support n-ary union. I assume that sets are already unique, but I don't expect them to be sorted.

    Union is the simplest:

    =LET(A, B4:B11, B, D4:D8, UNIQUE(VSTACK(A, B)))

    Since VSTACK takes multiple arguments, so does this union operation. 

    Symmetric Difference is next-easiest.

    =LET(A, B4:B11, B, D4:D8, UNIQUE(VSTACK(A, B), ,1))

    The "really unique" option turns UNIQUE from a natural union to a natural symmetric difference. Note: The multi-way symmetric difference isn't (in my view) terribly useful, and this function does not produce that. This operation seems to be called the "unique union" by at least some people. Anyway, it's very handy.

    Set subtraction is a bit more clever. To get A-B

    =LET(A, B4:B11, B, D4:D8, UNIQUE(VSTACK(A, B, B), ,1))

    This works because doubling B guarantees that none of B will be in the result, but elements of A not in B are unaffected. Or, having a set twice in the unique union obliterates all of its elements, so what's left is the set subtraction. To subtract more sets, just chain them on the end, two at a time. E.g. A, B, B, C, C.

    Intersection is the harder one, but it's not that hard. 

    =LET(A, B4:B11, B, D4:D8, UNIQUE(VSTACK(UNIQUE(VSTACK(A, B)), UNIQUE(VSTACK(A, B), ,1)), ,1))

    It simply does a symmetric difference between the union of two sets and the symmetric difference of the two. I keep thinking there's a way to simplify this, but I'm not seeing it. Unfortunately, this doesn't give the n-ary intersection: It just gives elements that appear in at least two sets. Here's an n-ary intersection that uses GROUPBY. You basically dump all the sets together and let groupby count the unique elements. 

    =LET(n, 3, A, B4:B11, B, C4:C8, C, D4:D11, stack,VSTACK(A, B, C),
     set_cnt, GROUPBY(stack, SEQUENCE(ROWS(stack), 1, 0), COUNT, 0, 0),
     set, CHOOSECOLS(set_cnt, 1), counts, CHOOSECOLS(set_cnt,2),
     x_cnt, FILTER(set_cnt, counts=n), CHOOSECOLS(x_cnt, 1))

    This is so ugly I keep thinking there must be a better way, but I'm not seeing it. It's also annoying to have to pass in 3, but there's no way to tell how many sets there were after you've merged them together.

    • MarkolaKrai's avatar
      MarkolaKrai
      Copper Contributor
      Hey PeterBartholomew1 thank you for the reply! Huge fan btw i love seeing your beautifully crafted formulas as these that you presented, they are useful of course but i was actually pursuing a general formula that could handle intersection of 3 or more sets without the need of adding more boolean logic nor needing to repeat the formulas one inside the other for every new set
  • MarkolaKrai 

    Dressing it up a bit!

     

     

    Union:                = SetOperateλ(setA, setB, Unionλ);
    Intersection:         = SetOperateλ(setA, setB, Intersectλ);
    Difference1:          = SetOperateλ(setA, setB, Diffλ);
    Difference2:          = SetOperateλ(setB, setA, Diffλ);
    Symmetric difference: = SetOperateλ(setA, setB, SymDiffλ);

     

     

    where

     

     

    SetOperateλ
    = LET(
        items,     VSTACK(A, B),
        distinct,  SORT(UNIQUE(items)),
        isA,       COUNTIFS(A,distinct),
        isB,       COUNTIFS(B,distinct),
        criterion, MAP(isA, isB, Fnλ),
        FILTER(distinct, criterion)
      )

     

     

    and

     

     

    Unionλ      = OR(x, y);
    Intersectλ  = AND(x, y);
    Diffλ       = AND(x, NOT(y));
    SymDiffλ    = XOR(x, y);

     

     

    • MarkolaKrai's avatar
      MarkolaKrai
      Copper Contributor
      Patrick2788 Thank you for the reply, this formula is cleaner than mine for the difference of sets i did not remember to use REDUCE because for some reason my brain always thinks the initial value and the accumulator must handle numbers only. I will be using this solution from now on if i have more than 2 sets.

Resources