Regrouping Data in single row

djclements
Silver Contributor
Jan 08, 2025
Just a friendly comment regarding efficiency... iterative methods using expanding range references will typically only work well on smaller datasets. For something a little more robust, try using SCAN with pre and post sorting logic. For example:
=LET( row_fields, HSTACK(Appointment, Name, City), PIVOTBY(row_fields, "Costume " & INSTANCENUMλ(row_fields), Costume, SINGLE, 0, 0,, 0) )
Where INSTANCENUMλ is defined as:
=LAMBDA(array, IF( ROWS(array) = 1, 1, LET( arr, IF( COLUMNS(array) = 1, array, BYROW(array, LAMBDA(r, TEXTJOIN("|", 0, r))) ), srt, SORT(HSTACK(arr, SEQUENCE(ROWS(arr)))), key, TAKE(srt,, 1), SORTBY(SCAN(0, key = VSTACK("", DROP(key, -1)), LAMBDA(a,v, 1 + a * v)), DROP(srt,, 1)) ) ) )
On its own, INSTANCENUMλ can handle up to 100k rows of data with relative ease (< 1 second).
- PeterBartholomew1
  Silver Contributor
  Jan 08, 2025
  Hi David
  Thank you, I will take a further look. A little more in the way of explanation may help me though. I recognise
  LAMBDA(x, LAMBDA(x))
  as a function that generates thunks but, as yet, it is not clear to me what the pipe separator is doing or, come to that, what the sorted and unsorted lists represent.
  As for my bisection method, it was specifically written to address large array problems in which REDUCE/VSTACK or MAKEARRAY/INDEX calculations scale as n². The recursive bisections reduce this to order n.log₂(n). Maybe your approach will turn out to be faster, but they should be comparable.
  - djclements
    Silver Contributor
    Jan 09, 2025
    Hi Peter,
    Yes, your generalized lambda functions are very robust for any given situation. With regards to efficiency, I was referring to Harun's MAP solution with expanding range references to generate running counts for each unique record.
    I'm glad you inquired about the INSTANCENUMλ function, as it prompted me to conduct further testing. The section you're referring to:
    REDUCE(TAKE(array,, 1), BYCOL(DROP(array,, 1), LAMBDA(x, LAMBDA(x))), LAMBDA(a,λ, a & "|" & λ()))
    ...can be replaced with:
    BYROW(array, LAMBDA(r, TEXTJOIN("|", 0, r)))
    This is done to merge multiple fields into a single column, representing the KeyID for each record. I don't know why I had it in my head that concatenation by column was faster than BYROW-TEXTJOIN. As it turns out, BYROW is more efficient. I have updated the definition for INSTANCENUMλ accordingly.
    As for the sorting logic, pre-sorting is done by KeyID, so SCAN can perform a basic running count; then, post-sorting is done by RowID to return the results back to their original order.
    I did some speed tests on a randomized dataset with 100K rows of data, generated from a list of 20 different names, 15 different costumes, 10 different cities and a date range spanning 5 years, and the average times for each method were as follows:
    GROUPBY-THUNK-EVALTHUNKARRλ: 15.9 seconds
    PIVOTBY-INSTANCENUMλ: 12.9 seconds
    GROUPBY-TEXTJOIN-TEXTBEFORE/AFTER: 6.9 seconds
    
    The randomized dataset used above generated a total of 85K unique records out of 100K. When I reduced the amount of data used to randomize the dataset (10 names, 10 costumes, 5 cities and a 2-year date range), it brought the total number of unique records down to 35K. With this second dataset, the average times for each method were:
    GROUPBY-THUNK-EVALTHUNKARRλ: 8.3 seconds
    PIVOTBY-INSTANCENUMλ: 5.8 seconds
    GROUPBY-TEXTJOIN-TEXTBEFORE/AFTER: 2.7 seconds
    
    As for the PIVOTBY-MAP method with expanding range references, I only tested it with 5K and 10K rows of data, which averaged 6.2 and 23.7 seconds (first dataset), as well as 5.2 and 19.8 seconds (second dataset), respectively.
Harun24HR
Bronze Contributor
Jan 08, 2025
Thank you sir for noticing my answer. Can you please explain problem of filtering rows to drop null strings so that I can aware for future to use the formula? Your formula is too advanced to hard to catch in brain. I will study your formula to understand.
- PeterBartholomew1
  Silver Contributor
  Jan 08, 2025
  My first attempt at using PIVOTBY wasn't as sophisticated as yours. Instead of assigning an index to each costume item and using the indices as column headers, I simply used the costume names as column headers. This gave me the correct items but spread over a larger number of cells, one for each header. That left me the task of filtering out the spaces.
  In the solution I have just posted GROUPBY returns the array of costume items as a thunk,
  A thunk can pass a formula as a function for postponed calculation, Normally I try to reduce the formula to a single LET variable so that the function represents a simple lookup from memory. The key fact is that a thunk is just a bit of executable text, so you can have arrays of thunks even knowing the thunks will evaluate to give range references or large 2D array structures.
  Having got an array of thunks, that is where the challenge is get Excel to stack the array of arrays without throwing a hissy fit. REDUCE/HSTACK would do it but it is bad news for large arrays because that recursively builds ever larger arrays until it reaches the end. My solutions take a row/column of thunks, WRAP it to give 2 columns of ½ the length, and then evaluation and STACK the pairs before turning the result back to a ½ length thunk array. After a few steps the array is reduced to a single thunk for evaluation.
  10 recursive steps using REDUCE are required for 1000 term arrays.

Forum Discussion

Resources