Using TEXTSPLIT without copy-fill-down

Copper Contributor

Hello All,

 

I am trying to split "; " separated range of text into columns. The intent is to use one single spill-formula (instead of copy-fill-down) to address each text in the range. 

 

I have the formula to do this for a small set of values using CONCAT and TEXTSPLIT functions. However, when I try doing it with a set of values that run beyond 32,767 characters, CONCAT function runs into a #VALUE error because of Cell value length restriction in Excel.

 

This is because using the CONCAT function I was trying to create a long-long string taking the col_delimited values in a cell in each row and padding a row_delimiter at the end of the cell value in each row for the entire range. The objective was to submit the long-long string, so created using the CONCAT function, to the TEXTSPLIT function to split it into columns and rows based on the col_delimiter and row_delimeter characters. 

 

The solution to this, I believe, is to split values in each row (cell) individually. However, I want to get this done using one single array formula for the entire range, which unfortunately, I am unable to accomplish. I have tried using TEXTSPLIT within MAP-LAMBDA to process each row individually, but it only splits the first element of the col_delimited string in each row - the remaining of col_delimited string in each row is ignored.

 

Request you all to help me with a single-formula spill-array solution to the problem I have. An example case is included in the file (TEXTSPLIT.xlsx) with the range to split (Table2). The single-formula solution that works for a small set of values is given in Cell E3. The single formula solution that failed to work as desired (TEXTSPLIT within MAP-LAMBDA) is available in Cell P3.

 

Please also let me know why this didn't work even when processing each row individually using MAP-LAMBDA.

 

Thanks in advance.

 

Warm regards,

Dipankar De

 

19 Replies

@dipankar_de 

At first I thought that this could be solved quite simply with TEXTSPLIT() within BYROW(). But that always resulted in a CALC# error.

I then solved this rather oldscool with MID() SEARCH() and SEQUENCE().

There are probably other variants.

 

=LET(
in,SUBSTITUTE("; "&Table2[Values]&"; "," ",""),
maxi,MAX(LEN(in)-LEN(SUBSTITUTE(in,";","")))-1,
start,SEARCH(CHAR(124),SUBSTITUTE(in,";",CHAR(124),SEQUENCE(1,maxi,1))),
end,SEARCH(CHAR(124),SUBSTITUTE(in,";",CHAR(124),SEQUENCE(1,maxi,2))),
HSTACK(Table2['#],IFERROR(MID(in,start+1,end-start-1),"")))

 

 

I hope that's OK for you.

@dscheikey  alternative using REDUCE and VSTACK:

=LET(colOne, Table2['#], colRest, Table2[Values], splittedData, 
    IFERROR(DROP(
       REDUCE("",colRest, 
          LAMBDA(p,q, VSTACK(p, TEXTSPLIT(q, "; ", , FALSE, 1, "")))
          )
       ,1),""),
    HSTACK(colOne, splittedData))

@dipankar_de 

Here's my take:

 

=LET(matrix,MAKEARRAY(ROWS(Table2),9,LAMBDA(r,c,CHOOSECOLS(EXPAND(TEXTSPLIT(INDEX(Table2[Values],r),";"),1,9,""),c))),HSTACK(Table2['#],matrix))

 

@dipankar_de 

I think the abject failure to process arrays or arrays is the most ridiculous error of judgement made by Microsoft in what is otherwise a suburb sequence of innovations, turning Excel from amateurs' corner into something capable of creating reusable, quality solutions.  One of the principal benefits of the grid is that it allows one to calculate and display arrays of arrays.  To pretend otherwise is to store up huge compatibility problems for when sense eventually prevails. 

 

To offer an absurd solution, I developed a custom Lambda function "Repackλ" to turn the comma-deliminated values into fixed width strings

 

Repackλ 
= LAMBDA(t,
    LET(
        splitText, TEXTSPLIT(t, ","),
        length,    LEN(splitText),
        CONCAT(splitText & REPT(" ", 10 - length))
    )
);

 

giving

image.png

What appears to be a monumental waste of effort pays off because MID will generate and array of arrays if used to wrap the Lambda function.

 

= MID( MAP(Text, Repackλ), {1,11,21,31,41}, 10)

 

giving

PeterBartholomew_0-1668286679200.png

Having to go back to MID to achieve something that 'Excel has never done?' is so annoying that I propose to continue expressing my scorn at every opportunity (and there are many opportunities, which makes the need for change so urgent).  I hope you all join in to make Microsoft listen even if just to save themselves further embarrassment!

 

 

 

@dscheikey@mtarler; @Patrick2788; @Peter Bartholomew

 

Thanks to each of you for providing me with the exact help I was looking for on this forum.

 

@dscheikey: Your formula might be an old school one but it did solve my problem. I created a similar formula but that was a copy-fill-down type - one formula per row which was restricting me to use the P3# spill-array addressing in subsequent processing. Many thanks for your help - I really liked the  start and end formula snippets. The way I did it in my copy-fill-down formula was:

=LET(a, $B5, b, $C5, data, "|" & SUBSTITUTE(b, "; ", "|") & "|",

         WrdCnt, LEN(data) - LEN(SUBSTITUTE(data, "|", "")),
         start, FIND(CHAR(2), SUBSTITUTE(data, "|", CHAR(2), SEQUENCE(, WrdCnt))),

         end, SEQUENCE(1, COLUMNS(start) - 1),
         O, MID(data, INDEX(start, end) + 1, INDEX(start, end + 1) - INDEX(start, end) - 1), HSTACK(a, O)),

 

Seems, the way I had written the "end" was restricting the row-spill when extended to the complete range of rows. Thanks again.

 

@mtarler: You provided the new school formula that I was looking for but was encountering errors due to the "array of arrays" issue. The formula with REDUCE, VSTACK, TEXTSPLIT did the breakthrough. However, there is still an issue which I am not able to resolve - the output generated by the formula is changing the case of the strings to lowercase. Please look through the "Solution - @mtarler" worksheet in the attached. Do you see a reason why the case is getting changed and what is changing it?

 

https://1drv.ms/x/s!Aqyz05eCPz3-rWwakzdAFKiYsjtH?e=ydgpxa

 

@Patrick2788: Thanks for sharing the link to the "Array of Arrays" limitation literature. I haven't had tried your formula on the actual problem (string length more than 32,767 characters) yet. However, looking at it, I see you have hardcoded the number of columns in the MAKEARRAY function parameters. Anyway, this is a small thing to rectify by taking the max of the col_delimiter count. Will let you know once I have looked through your formula.

 

@Peter Bartholomew: You and Sergei are simply the Best. Though I can follow the solutions you provide, it would take quite some time for me to reach that level. I completely agree with you that going back to MID is really annoying after having seen the power of new age Excel functions.

@dipankar_de 

I have 'interfered' with Matt's solution to add further refinement or complexity (however you take it) to create a worksheet function

= SplitListλ(Response,"; ",185)

which uses the Lambda funtions

SplitListλ
= LAMBDA(list,s,n,
      DROP(
          REDUCE("", list, SplitTextλ(s,n)),
      1,0)
  );


SplitTextλ 
= LAMBDA(s,n, LAMBDA(p, q, 
      VSTACK(p, 
          EXPAND(TEXTSPLIT(q, s), , n, "")
      )
  ));

 

@Peter Bartholomew 

 

I was looking through as you were refining the formula initially created by Matt. Two questions I have are:

1. In Matt's formula what was creating the lowercase?

2. What necessitates to include EXPAND with TEXTSPLIT? If we don't include EXPAND and instead provide all the parameters that TEXTSPLIT has, what would be the difference?

One more question that I earlier missed:
3. What's the benefit using SplitTextλ - how would it impact if the code in SplitTextλ is included in SplitListtλ?
Another issue I ran into:
4. If value in one of the response (Table3[Respose]) cells is made blank (delete the cell content), the formula generates a #VALUE! error - the error is not going away even after the value for the ignore_empty parameter of the TEXTSPILT function has been set to preserve (FALSE)
So that is interesting. Apparently if you set TEXTSPLIT to be case insensitive it must apply a LOWER() function to the output. I would NOT expect the output to be affected and only the comparator but if I change the TEXTSPLIT(q, "; ", , FALSE, 1, "") to TEXTSPLIT(q, "; ", , FALSE, 0, "") it solved that issue.
1. TEXTSPLIT has a case-sensitivity parameter that appears to have unexpected side effects.
2. EXPAND offers an alternative to trapping #N/A at the VSTACK stage
3. It is just a style thing, though I admit I haven't applied it sufficiently to be able to recommend it. The idea is that all LAMBDA functions should be named, especially those accepting parameters from Lambda helper functions. The result is that the calling formula is more concise and dummy variables are not exposed.
4. I didn't consider the possibility that the target cell might be null. The TEXTSPLIT parameter will only address the situation of multiple consecutive separators within a populated cell. The #VALUE! error is a valid result which could be trapped if required.

@dipankar_de 

I'd simplify to

=LET(
   data, HSTACK( Table2['#],  TEXTSPLIT( TEXTJOIN("|",,Table2[Values]), ";","|",,,"") ),
   header, HSTACK( {"#","Key"}, "Value" & SEQUENCE(, COLUMNS(data) - 2 ) ),
   VSTACK( header, data )
 )
Hello Sergei,

This doesn't work when the length of the string, containing col_delimited text in the range, is more than 32,767 characters due to Excel limitations. This is the reason I requested for help on this forum for a single spill array formula that will process each col_delimited string in each row individually but will output the results for the entire range in one go.

Both CONCAT and TEXTJOIN have the 32,767-character restriction and hence the problem.

the example spreadsheet can be found at https://1drv.ms/x/s!Aqyz05eCPz3-rWwakzdAFKiYsjtH?e=ydgpxa in which you can look through any of the worksheets than the TEXTSPLIT worksheet for the near real-world problem.

Warm regards,
Dipankar De

@dipankar_de Here's an updated variant, using a TEXTBEFORE/TEXTAFTER based LAMBDA function:

 

=HSTACK(Table2['#], TEXTTOCOLS(Table2[Values], "; "))

 

- OR -

 

=TEXTTOCOLS(Table2['#]&"; "&Table2[Values], "; ")

 

Where TEXTTOCOLS is defined in Name Manager as:

 

=LAMBDA(array,delimiter,
    LET(
        delims, (LEN(array)-LEN(SUBSTITUTE(array, delimiter, )))/LEN(delimiter),
        cols, MAX(delims)+1,
        arr, delimiter&array&REPT(delimiter, cols-delims),
        TEXTBEFORE(TEXTAFTER(arr, delimiter, SEQUENCE(, cols)), delimiter)
    )
)

 

This method is similar to dscheikey's solution in that it avoids the character limitations imposed on TEXTJOIN and CONCAT and is more efficient than the iterative alternatives, REDUCE/VSTACK and MAKEARRAY/INDEX, when working with larger datasets.

@djclements 

The same idea but with slightly different strengths and weaknesses.

TEXTBETWEEN
= LAMBDA(string, start, [end], [n],
    LET(
        end, IF(ISOMITTED(end), start, end),
        k, SEQUENCE(1, n),
        TEXTAFTER(TEXTBEFORE(string, end, k,,1), start,-1,,1)
    )
  )

The start and end are separators that can be used to extract substrings.  I have left it to the programmer to determine the number of terms to return. 

Basically, though, I agree that your suggestion is a good way around the pitiful 'array of arrays' problem.

 

 

@Peter Bartholomew Good stuff. Looks like you got the syntax figured out for the AFTER/BEFORE arrangement (as opposed to BEFORE/AFTER). I did notice one little glitch, though... when the length of the delimiter is greater than the length of the value in the first field (before the first delimiter), it will return #VALUE!. Interestingly enough, a similar glitch occurred for me when I tried using the "Match to end" option (with a shorter version of TEXTTOCOLS), where it omitted the value in the last field if its length was less than the delimiter.

 

Another version of TEXTTOCOLS that does appear to work:

 

=LAMBDA(array,delimiter,
    LET(
        arr, delimiter&array&delimiter,
        cols, SEQUENCE(, MAX(LEN(arr)-LEN(SUBSTITUTE(arr, delimiter, )))/LEN(delimiter)-1),
        IFERROR(TEXTBEFORE(TEXTAFTER(arr, delimiter, cols), delimiter), "")
    )
)

 

...but it relies on the IFERROR function to handle any #VALUE! or #N/A errors.

 

This all came about after responding to another post, Counting the number of occurrences of an entity in a table, where I used virtually the same technique to generate a unique list of items from a range of delimited strings with varying lengths. Sure, TEXTSPLIT/TEXTJOIN could've been used there, but I don't like the character limitations of that method.

 

I'm with you 100% that nested arrays should be allowed. Iterative alternatives, while useful, are typically not very efficient with larger datasets. In some situations, the right combination of dynamic array helper functions can be used to efficiently simulate an array of arrays, but it usually requires multiple steps involving various manipulations/transformations, which can be a tough sell for the average user. One such example can be found here: Multi-Row Data Consolidation Into Single Row, Multi-Column 

 

Cheers! :)