Help with MMULT for a range containing 100,000+ values

Copper Contributor

 I have been trying to solve a problem for some time now but unable to find an acceptable solution and hence requesting help from you. The attached has a snapshot of what I am trying to achieve. The problem I am facing is when the list goes beyond 100,000 rows, Excel runs out of resource.

 

The list contains duplicates as well as non-duplicates. I believe, if we restrict MMULT only for the duplicates, the "Out of Resource" error won't show up. I want the formula to do the following:

1. If unique (non-duplicate), simply consider the given value

2. If duplicate, then do MMULT (sum) for only the duplicate at hand and move to the next in sequence.

 

In the attached, "a", "b" and "d" are duplicated while "c" is unique. Want the formula to consider only "a" for MMULT and then separately for "b" and so on. Doing it together for "a", "b", "c" and "d" gives the "Out of Resource" error. The formula identifies "c" as unique and copies the corresponding value.

 

Please let me know if further clarification is required.

 

I wanted to upload the file but unfortunately couldn't find a link to do so.

22 Replies

@dipankar_de You can share your file via Onedrive, Google Docs, Dropbox or similar.

@dipankar_de Thanks! Not sure what I'm looking at but why not simply use this in C20:

=SUMIF(B2#,UNIQUE(B2#),C2#)

 

The actual list has 100,000+ rows and there are 6 columns that need to be individually summed when a duplicate is encountered.

I have tried SUMIFS but that's taking around 14 minutes to process the whole list. That's the reason I am trying to solve this with MMULT and in doing so since the list is long, MMULT is ending up consuming the whole of the memory since it's generating a 100,000+ x 100,000+ matrix.

If somehow, the matrix can be made limited to the number of duplicates for each element (in my example a, b, c, d), possibly the result will be achieved in less than a minute.
Another thing is that I am trying to get the solution through dynamic spill array formulas... one formula and that's it for processing all 100,000+ rows

@dipankar_de Well, if your data set is too big for MMULT and you are short on resources, perhaps you should transform the raw data in PowerQuery (PQ) first and create a table summing each of the six columns, grouped by the unique reference (a,b,c,d, etc. as per your example). Did that for the first few columns

Riny_van_Eekelen_0-1654154675221.png

A data set of 100 thousand rows isn't particularly big for PQ and shouldn't cause much of a problem. But perhaps I'm missing the point completely and underestimating your problem.

That's a good suggestion and appreciate that. Can you please share your workings so that I can have a quick look? I haven't tried PowerQuery as yet and believe I should start exploring it and this perhaps is a good opportunity.

@dipankar_de See attached file. It has a very basic query in it. Can't tell how this will perform on your machine with real data, of course.

@dipankar_de 

As a comment practical limit for MMULT on max dimension of returned vector is about 7500 (bit less).  

Hello Sergei, I have sent a note separately to you. Good that you saw my request for help here itself. I appreciate the comment you made and practically that's the reason I am looking for a formula which will perform MMULT selectively and individually for those that have duplicates - if done that way, MMULT will be limited to maximum dimension of 15 x 15. I have tried doing it that way but somehow it's not working and giving #N/A error. Please refer to the formula in I20 in the file.

The error is coming from the part on the right of the 'equal-to' sign in the formula below.
FILTER(INDEX(ArB, ,2), INDEX(ArB, ,1)=INDEX(UnqB, SeqB))
Not sure how to make it work since here it's running in a virtual loop with SeqB increasing yet the specific value referenced by INDEX(UnqB, SeqB) is not getting identified by the FILTER Function rather FILTER is seeing it as a an array instead of a specific value in the loop.

Please see if there is a way to accomplish this.

@dipankar_de 

SUMIFS is very fast in execution and tends to beat array methods.  However, provided your data is strictly sorted, it is possible to search for the first and last entries of each contiguous range and sum that.

= LET(
      distinct, SORT(UNIQUE(letter)),
         total, MAP(distinct, LAMBDA(chr,
            LET(
               firstCell, XLOOKUP(chr,letter,number,,,1),
               finalCell, XLOOKUP(chr,letter,number,,,-1),
               SUM(firstCell:finalCell)
            )
         )
      ),
      HSTACK(distinct, total)
   )

I have beaten SUMIFS with formulas of this kind in the past in situations in which each summation is only working off a small proportion of the values.

 

image.png

Hello Peter, Many thanks for the help you extended. Unfortunately, I don't have the version that supports LAMBDA, HSTACK and others that are still in Beta. Hence, am limited to functions that are available in the Released O365.

@dipankar_de 

Sorry, the HSTACK was little more than cosmetic tidying up.  This is the same function stripped right back to a relative reference.

= SUM(XLOOKUP(@distinct,letter,number,,, 1):XLOOKUP(@distinct,letter,number,,,-1))

 

Hello Peter, Many thanks for helping me out with this - exactly the solution I was looking for. I replaced the HSTACK with CHOOSE({1,2}, distinct, total) and it worked the way I wanted. However, I now need to implement this on the actual data I have and check the turnaround time. I will keep you posted if I encounter any issue. Thanks again for your help
Hello Peter, The version of excel that I have for implementing the actual solution is Version 2202 (Build 14931.20132) Semi -Annual Enterprise Channel (Preview) and it appears that the LAMBDA functions are not yet there in the said version.

Is there a way to implement the solution without MAP / LAMBDA
Hello Peter, The version of excel that I have for implementing the actual solution is Version 2202 (Build 14931.20132) Semi -Annual Enterprise Channel (Preview) and it appears that the LAMBDA functions are not yet there in the said version.

Is there a way to implement the solution without MAP / LAMBDA.

@dipankar_de 

So you still have something to look forward to!  I couldn't wait, so put my semi-annual update aside and licenced a personal version (insider beta) in order to get to grips with the tumultuous changes.

If you are prepared to accept relative referencing and copy-fill-down then the formula

= SUM(
      XLOOKUP(@distinct,letter,number,,, 1)
    : XLOOKUP(@distinct,letter,number,,,-1)
  )

will do the same job (green header).

image.png

@Peter Bartholomew 

It all depends on concrete data. I assume we shall sumif 100000 range with > 7500 criteria elements. On my relatively slow PC

=SUMIFS( <100000 rows sum range>, <100000 rows names range>, <10000 rows unique names range> )

took about 15 sec.

I don't think that any formula with dynamic arrays, in general any other formula, will be working faster. 

 

To compare, refresh of Power Query aggregation for the same dataset takes about 4 sec and 2-3 sec more if aggregate 8 columns at once.

What is this for? This is a tech community area. Hence, please refrain from posting unsolicited and out of context commercial propositions / business promotion materials here.