You're tasked with re-shaping and cleaning up the fictional Faber College alumni list to be used for a mail merge. Going from this:To this:To spice things up, the First and Last names must be in separate columns. Additionally, bellman records must not be included in your list (They've been contacted already). The rules are simple: The solution must be a formula. VBA, PowerQuery, helper columns, fill handle - not allowed. I've included my solution. I believe the best solutions for this exercise will involve some of the newest functions currently available on the Beta channel. I'm looking forward to seeing some creativity. Have at it!

I tried it with 'Alumni' and Excel refused to calculate it - returned a 0. There was no memory or CPU spike, it appears to be a limitation of MAKEARRAY. I tried running it in Excel web app for grins, wouldn't calculate it.

Patrick2788 For HSTACK it depends on both dimensions - number of stacked columns and number of rows in each. My short test gives

Patrick2788 The attached shows alternative points within the calculation at which one may choose to apply the Bellman filter. I also tests for blank rows (as it would for repeat headers) and removes them with FILTER rather than using the setting within TOCOL.[Note: the formula is still vulnerable to error if all 3 email addresses are missing from any given row]

Patrick2788 So I do not have Beta functions so here is an option without them. I don't know how well your solution performs on the full data set but mine crashes. As soon as you get into the couple thousands it start to take a lot of time and at some point before 10K it just crashes. I added a LAMBDA function (AlumniX) purely to easily select how many rows to test it on. It also tries adjusts that length to make it full records (as long as it isn't tripped up by one of those gap rows). Here is the formula: =LET(alum,FILTER(AlumniX(200),INDEX(AlumniX(200),,1)<>""), n,ROWS(alum), list,MAKEARRAY(n,4, LAMBDA(r,c,CHOOSE(c, LET(name,INDEX(alum,MROUND(r+1,3)-2,MOD(r-1,3)+1),LEFT(name,SEARCH(" ",name))), LET(name,INDEX(alum,MROUND(r+1,3)-2,MOD(r-1,3)+1),MID(name,SEARCH(" ",name),LEN(name))), INDEX(alum,MROUND(r+1,3)-1,MOD(r-1,3)+1), INDEX(alum,MROUND(r+1,3),MOD(r-1,3)+1)))), FILTER(list, INDEX(list,,3)<>"Bellman"))

Formula challenge: re-shape and cleanup

23 Replies

Patrick2788
Silver Contributor
Jun 02, 2022
Patrick2788
Doing some more testing with this data set. I expanded the alumni list to just shy of 1 million rows. My solution and PeterBartholomew1 's formula produced #NUM! errors. I found HSTACK, when run by itself and stacking FirstWord and FinalWord, was failing with precisely 524,289 rows in the alumni named range. I shortened the range and HSTACK worked again.

The main formulas still produced #NUM! errors because WRAPROWS, CHOOSECOLS, CHOOSEROWS, TOCOL, etc. - had reached limits and were producing #NUM! errors, too. Granted, these functions are still in BETA but it's fascinating to get an idea of what they can handle. Maybe these limits will change once released.
- SergeiBaklan
  Diamond Contributor
  Jun 02, 2022
  Patrick2788
  
  For HSTACK it depends on both dimensions - number of stacked columns and number of rows in each. My short test gives
  - PeterBartholomew1
    Silver Contributor
    Jun 02, 2022
    SergeiBaklan
    There is a pattern here! (grin)
Patrick2788
Silver Contributor
May 31, 2022
Patrick2788
I've done some more testing with MAKEARRAY. I've been feeding it large ranges trying to get it to quit but no luck.
Granted, there's no re-shaping being done but Excel is able to calculate this:
=MAKEARRAY(1000000,26,LAMBDA(r,c,INDEX(A1:Z1000000,r,c)))
I'm interested in what part of mtarler 's formula is causing Excel to refuse to calculate. This is presuming it's mostly the presence of MAKEARRAY. Going off the calculation times posted by PeterBartholomew1 in the other discussion, it has to be MAKEARRAY.
- PeterBartholomew1
  Silver Contributor
  Jun 02, 2022
  Patrick2788
  Firstly, my congratulations on a very efficient and compact solution.
  
  On the MAKEARRAY discussion, I do not think the function is itself a major problem, provided each calculation is a scalar evaluation. Where I think I hit problems is where the each calculation is actually a full array calc but uses INDEX to restrict the return to a single value. Once each value comes at the price of a full array calc, the time builds as O(n²).
  
  My solution was both more complicated and more expensive than yours (x3) but it did allow me to explore some techniques that I thought might be usable for other problems.
  = LET( Fnλ, HSTACK(FirstWordλ,FinalWordλ,Identityλ,Identityλ), baseTbl, WRAPROWS(TOCOL(alumni,1,1),3), extendedTbl, CHOOSECOLS(baseTbl,{1,1,2,3}), u, SEQUENCE(ROWS(extendedTbl),,1,0), Fnsλ, CHOOSEROWS(Fnλ,u), processedTbl, MAP(Fnsλ,extendedTbl,Applyλ), NoBellλ(processedTbl) )
  The key part of my experiment was to build an array of Lambda functions of the same dimension as the table to which it was to be applied. Then the functions could be applied one by one using MAP. My take is that the approach involves some pretty heavy computational steps but it remains O(n), so should not lock up as the problem size increases.
  
  Faber College Alumni List (PB).xlsx14.3 MB
  - Patrick2788
    Silver Contributor
    Jun 02, 2022
    PeterBartholomew1
    Thanks for sharing, Peter. I like your approach. It's very clean and easy to follow. MAP has tremendous potential when used with multiple arrays. Later when I'm at my other computer I may extended the rows down to 1 million and have at it with the timer. CHOOSECOLS/CHOOSEROWS/TOCOL/HSTACK appear to be very efficient calculation-wise in handling large arrays.
    
    When I designed this exercise, I was curious how others would go about removing the 'bellman' records. From the other discussion, I recalled SergeiBaklan mention your use of FILTER possibly slowing down your calculation so I went in determined not to use FILTER. Ultimately, I used it thinking it would run smoothly if stored inside the NOBELL LAMBDA and called from my main formula. FILTER seems to be best approach to exclude records at the moment.

Riny_van_Eekelen

Platinum Contributor

May 31, 2022

Patrick2788 Although you explicitly mentioned NOT to use PQ, I did it anyway. I'm a lazy person and favor easy solutions over complicated ones. Your formula is impressively short and performed lightning fast on my laptop (also a 32GB machine).

Please find the PQ script that I used below. It worked as fast :))

let
    Source = Excel.CurrentWorkbook(){[Name="alumni"]}[Content],
    RemoveBlankRows = Table.SelectRows(Source, each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
    SplitTbl = Table.Split( RemoveBlankRows, 3 ),
    Convert = Table.FromList(SplitTbl, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    Transpose = Table.AddColumn(Convert, "Custom", each Table.Transpose ([Column1])),
    Remove = Table.SelectColumns(Transpose,{"Custom"}),
    Expand = Table.ExpandTableColumn(Remove, "Custom", {"Column1", "Column2", "Column3"}),
    Split = Table.SplitColumn(Expand, "Column1", Splitter.SplitTextByEachDelimiter({" "}, QuoteStyle.Csv, true), {"Column1.1", "Column1.2"}),
    Filter = Table.SelectRows(Split, each not Text.Contains([Column2], "Bellman"))
in
    Filter

With some knowledge of M (and I'm far from an expert) this isn't all that complicated and easy to follow. Thanks for the challenge and sorry for violating the rules of it.

Patrick2788
Silver Contributor
May 31, 2022
Thanks for sharing! I've been meaning to dive into PQ more and this is something I can study. Thank you!

mtarler
Silver Contributor
May 30, 2022
Patrick2788 So I do not have Beta functions so here is an option without them. I don't know how well your solution performs on the full data set but mine crashes. As soon as you get into the couple thousands it start to take a lot of time and at some point before 10K it just crashes. I added a LAMBDA function (AlumniX) purely to easily select how many rows to test it on. It also tries adjusts that length to make it full records (as long as it isn't tripped up by one of those gap rows). Here is the formula:

=LET(alum,FILTER(AlumniX(200),INDEX(AlumniX(200),,1)<>""), n,ROWS(alum), list,MAKEARRAY(n,4, LAMBDA(r,c,CHOOSE(c, LET(name,INDEX(alum,MROUND(r+1,3)-2,MOD(r-1,3)+1),LEFT(name,SEARCH(" ",name))), LET(name,INDEX(alum,MROUND(r+1,3)-2,MOD(r-1,3)+1),MID(name,SEARCH(" ",name),LEN(name))), INDEX(alum,MROUND(r+1,3)-1,MOD(r-1,3)+1), INDEX(alum,MROUND(r+1,3),MOD(r-1,3)+1)))), FILTER(list, INDEX(list,,3)<>"Bellman"))

Faber College Alumni List.xlsx6.6 MB
- Patrick2788
  Silver Contributor
  May 31, 2022
  I did not have any trouble with your workbook. I'm running this on a new computer with 32 GB RAM, i7 processor. I will try it on my old computer.
  - mtarler
    Silver Contributor
    May 31, 2022
    Did you try replacing AlumniX(200) with AlumniX(10000) or just Alumni? It works great at a couple hundred of lines but got slow at a couple thousand and I'm pretty sure it 'broke' around 10000 and forget about 100000+. That said, my computer has been slogging lately and is well over due to be replaced.

Forum Discussion

Formula challenge: re-shape and cleanup

23 Replies

Resources