Selecting Data with Minimum Standard Deviation

mtarler
Silver Contributor
Oct 19, 2022
alvaro037 Here is an alternate solution. I reduce the number of trials because we know std will be lowest for numbers close together so I sort the numbers and then increment through the list:

=LET(in,B2:K2, n,5, sortin, SORT(TRANSPOSE(in)), start, SEQUENCE(ROWS(sortin)-n), stdlist, SCAN(0,start,LAMBDA(p,i,STDEV.P(INDEX(sortin,start+i-1)))), findLow,MATCH(MIN(stdlist),stdlist,0), INDEX(sortin,start+findLow-1))

see attached
EDIT:
a) please note my SD calculation used STD.P per Joe's suggestion above while Hans' uses STD.S and my solution found a set starting with 8.6-8.9 and Hans' found 8.7-9.0 but the SD is the same in each case.
b) here is an alternate version of the formula that stacks/includes the AVG and SD:

=LET(in,B2:K2, n,5, sortin, SORT(TRANSPOSE(in)), start, SEQUENCE(ROWS(sortin)-n), stdlist, SCAN(0,start,LAMBDA(p,i,STDEV.P(INDEX(sortin,start+i-1)))), findLow,MATCH(MIN(stdlist),stdlist,0), ans,INDEX(sortin,start+findLow-1), out,VSTACK(ans,"Avg",AVERAGE(ans),"SD",STDEV.P(ans)), out)

I prefer to have them separate as in the attached file but that is up to you
MinSD.xlsm21 KB
- JoeUser2004
  Bronze Contributor
  Oct 19, 2022
  mtarler wrote:
  
  =LET(in,B2:K2, n,5, sortin, SORT(TRANSPOSE(in)), start, SEQUENCE(ROWS(sortin)-n), stdlist, SCAN(0,start,LAMBDA(p,i,STDEV.P(INDEX(sortin,start+i-1)))), findLow,MATCH(MIN(stdlist),stdlist,0), ans,INDEX(sortin,start+findLow-1), out,VSTACK(ans,"Avg",AVERAGE(ans),"SD",STDEV.P(ans)),out)
  
  Forgive, but I don't speak the new Excel 365 language.
  
  Please describe in one or two sentences what the code above does.
  
  For example, does it find the contiguous n sorted data with the smallest std dev?
  
  The operative word is "contiguous". Right?
  - mtarler
    Silver Contributor
    Oct 19, 2022
    JoeUser2004 sure. and ugh I found another error in going through it because there were 10 items and looking for 5 I was using the n=5 and the 10-5 (i.e. 5) together. I tweaked the code to account for possible other combinations:
    =LET(in,B3:K3, n,5, sortin, SORT(TRANSPOSE(in)), start, SEQUENCE(ROWS(sortin)-n+1), stdlist, SCAN(0,start,LAMBDA(p,i,STDEV.P(INDEX(sortin,SEQUENCE(n,,0)+i)))), findLow,MATCH(MIN(stdlist),stdlist,0), ans,INDEX(sortin,SEQUENCE(n,,0)+findLow), out,VSTACK(ans,AVERAGE(ans),STDEV.P(ans)), TRANSPOSE(out))
    line 1: define the input range
    line 2: define how many items to select
    line 3: rearrange the data from in to be sorted in order
    line 4: define a sequence 1,2,3,... up to the length of the input minus N so if you have 100 inputs and selecting 5 then you will want to start at 1,2,3,..., 93,94,95,96 and you stop at 96 because after that you won't have enough left to get 5 items
    line 5: create a list of standard deviations for each group of n items starting with index 1 and going to the end of the list
    line 6: find the location in the list where the MIN value is located
    line 7: create a list of the N values used to create that MIN standard deviation value
    line 8: stack the list from 7 and add the average and the standard deviation
    
    alvaro037 please see this formula or the attached file for the corrected formula
    MinSD.xlsm21 KB
- JoeUser2004
  Bronze Contributor
  Oct 19, 2022
  mtarler wrote: ``we know std will be lowest for numbers close together``
  
  My version of Excel does not support those features.
  
  Which subset of 5 is chosen with the following set of data:
  
  1.1
  1.3
  1.5
  1.7
  1.9
  100.1
  100.2
  100.3
  100.4
  100.5
  
  The correct subset is 100.1 through 100.5 (sd = 0.141421356237312), not 1.1 through 1.9 (sd = 0.282842712474619).
  
  And just for fun, which subset is chosen if the first 5 numbers are 1.1, 1.2, 1.3, 1.4 and 1.5?
  
  In that case, the sd is the same.
  - mtarler
    Silver Contributor
    Oct 19, 2022
    my comment was purely to explain why I sort the numbers first and can then increment through that sorted list instead of checking every combination. i.e. only check N-5 (i.e. 5) instead of 252 cases.
    As for the answers, it pointed out an error in my formula (i.e. it was offset by 1 and didn't check the first entry correctly) so I edited the above post to have the correction.
    That said, my formula finds the first set with the minimum SD while Hans' find a correct set but not necessarily consistent which.

Forum Discussion

Selecting Data with Minimum Standard Deviation

Resources