Forum Discussion
spazceroni
Jan 04, 2025Copper Contributor
Regrouping Data in single row
Hi Everyone, I've been struggling with an issue and can't find the answer using an AI excel chatbot... So sorry if this thread has already been addressed but I find it quite hard to formulate. Fi...
PeterBartholomew1
Jan 08, 2025Silver Contributor
There are some really ingenious solutions here! (Harun24HR was the first I noticed)
Mine was to start of with GROUPBY (as per Patrick2788 to obtain the comma separated list and then to use my usual variant of MAP to split the text into an array of arrays.
= LET(
grouped, GROUPBY(HSTACK(Appointment, Name, City), Costume, ARRAYTOTEXT, , 0),
costumes, MAPλ(TAKE(grouped, , -1), LAMBDA(list, TEXTSPLIT(list, ","))),
HSTACK(TAKE(grouped, , 3), costumes)
)
MAPλ is a complicated function, but it will return an array of arrays when used in place of MAP and is reasonably simple to use.
PeterBartholomew1
Jan 08, 2025Silver Contributor
Harun24HRWhen I looked at PIVOTBY my first thought was to use the costume as the column headers. It works but leaves one the problem of filtering the rows to drop null strings. Your numbering idea seems to work better.
My normal strategy is to express intermediate results as thunks (functions that reference arrays held in memory) and use one of my usual helper functions that return arrays of arrays.
This time I have 'thunked' the output from GROUPBY (the array from GROUPBY appears to be a column to the ROROW is needed) and then use EVALTHUNKARRλ to convert the thunk array to the required array of arrays.
= LET(
TOROWTHUNKλ, LAMBDA(x, THUNK(TOROW(x))),
grouped, GROUPBY(HSTACK(Appointment,Name,City), Costume, TOROWTHUNKλ,,0),
costumeϑ, TAKE(grouped,,-1),
HSTACK(TAKE(grouped,,3), EVALTHUNKARRλ(costumeϑ))
)
It is more direct but further down amongst the weeds. EVALTHUNKARRλ is a key step in my helper functions such as MAPλ and SCANVλ
- djclementsJan 08, 2025Bronze Contributor
Just a friendly comment regarding efficiency... iterative methods using expanding range references will typically only work well on smaller datasets. For something a little more robust, try using SCAN with pre and post sorting logic. For example:
=LET( row_fields, HSTACK(Appointment, Name, City), PIVOTBY(row_fields, "Costume " & INSTANCENUMλ(row_fields), Costume, SINGLE, 0, 0,, 0) )
Where INSTANCENUMλ is defined as:
=LAMBDA(array, IF( ROWS(array) = 1, 1, LET( arr, IF( COLUMNS(array) = 1, array, BYROW(array, LAMBDA(r, TEXTJOIN("|", 0, r))) ), srt, SORT(HSTACK(arr, SEQUENCE(ROWS(arr)))), key, TAKE(srt,, 1), SORTBY(SCAN(0, key = VSTACK("", DROP(key, -1)), LAMBDA(a,v, 1 + a * v)), DROP(srt,, 1)) ) ) )
On its own, INSTANCENUMλ can handle up to 100k rows of data with relative ease (< 1 second).
- PeterBartholomew1Jan 08, 2025Silver Contributor
Hi David
Thank you, I will take a further look. A little more in the way of explanation may help me though. I recognise
LAMBDA(x, LAMBDA(x))
as a function that generates thunks but, as yet, it is not clear to me what the pipe separator is doing or, come to that, what the sorted and unsorted lists represent.
As for my bisection method, it was specifically written to address large array problems in which REDUCE/VSTACK or MAKEARRAY/INDEX calculations scale as n². The recursive bisections reduce this to order n.log₂(n). Maybe your approach will turn out to be faster, but they should be comparable.
- djclementsJan 09, 2025Bronze Contributor
Hi Peter,
Yes, your generalized lambda functions are very robust for any given situation. With regards to efficiency, I was referring to Harun's MAP solution with expanding range references to generate running counts for each unique record.
I'm glad you inquired about the INSTANCENUMλ function, as it prompted me to conduct further testing. The section you're referring to:
REDUCE(TAKE(array,, 1), BYCOL(DROP(array,, 1), LAMBDA(x, LAMBDA(x))), LAMBDA(a,λ, a & "|" & λ()))
...can be replaced with:
BYROW(array, LAMBDA(r, TEXTJOIN("|", 0, r)))
This is done to merge multiple fields into a single column, representing the KeyID for each record. I don't know why I had it in my head that concatenation by column was faster than BYROW-TEXTJOIN. As it turns out, BYROW is more efficient. I have updated the definition for INSTANCENUMλ accordingly.
As for the sorting logic, pre-sorting is done by KeyID, so SCAN can perform a basic running count; then, post-sorting is done by RowID to return the results back to their original order.
I did some speed tests on a randomized dataset with 100K rows of data, generated from a list of 20 different names, 15 different costumes, 10 different cities and a date range spanning 5 years, and the average times for each method were as follows:
- GROUPBY-THUNK-EVALTHUNKARRλ: 15.9 seconds
- PIVOTBY-INSTANCENUMλ: 12.9 seconds
- GROUPBY-TEXTJOIN-TEXTBEFORE/AFTER: 6.9 seconds
The randomized dataset used above generated a total of 85K unique records out of 100K. When I reduced the amount of data used to randomize the dataset (10 names, 10 costumes, 5 cities and a 2-year date range), it brought the total number of unique records down to 35K. With this second dataset, the average times for each method were:
- GROUPBY-THUNK-EVALTHUNKARRλ: 8.3 seconds
- PIVOTBY-INSTANCENUMλ: 5.8 seconds
- GROUPBY-TEXTJOIN-TEXTBEFORE/AFTER: 2.7 seconds
As for the PIVOTBY-MAP method with expanding range references, I only tested it with 5K and 10K rows of data, which averaged 6.2 and 23.7 seconds (first dataset), as well as 5.2 and 19.8 seconds (second dataset), respectively.
- Harun24HRJan 08, 2025Bronze Contributor
Thank you sir for noticing my answer. Can you please explain problem of filtering rows to drop null strings so that I can aware for future to use the formula? Your formula is too advanced to hard to catch in brain. I will study your formula to understand.
- PeterBartholomew1Jan 08, 2025Silver Contributor
My first attempt at using PIVOTBY wasn't as sophisticated as yours. Instead of assigning an index to each costume item and using the indices as column headers, I simply used the costume names as column headers. This gave me the correct items but spread over a larger number of cells, one for each header. That left me the task of filtering out the spaces.
In the solution I have just posted GROUPBY returns the array of costume items as a thunk,
A thunk can pass a formula as a function for postponed calculation, Normally I try to reduce the formula to a single LET variable so that the function represents a simple lookup from memory. The key fact is that a thunk is just a bit of executable text, so you can have arrays of thunks even knowing the thunks will evaluate to give range references or large 2D array structures.
Having got an array of thunks, that is where the challenge is get Excel to stack the array of arrays without throwing a hissy fit. REDUCE/HSTACK would do it but it is bad news for large arrays because that recursively builds ever larger arrays until it reaches the end. My solutions take a row/column of thunks, WRAP it to give 2 columns of ½ the length, and then evaluation and STACK the pairs before turning the result back to a ½ length thunk array. After a few steps the array is reduced to a single thunk for evaluation.
10 recursive steps using REDUCE are required for 1000 term arrays.