Forum Discussion

alvaro037's avatar
alvaro037
Copper Contributor
Oct 19, 2022

Selecting Data with Minimum Standard Deviation

Hi everyone, 

I'm working on a project but the data collected is more than needed, so I need to choose data with the minimum deviation or least error. 

Example : 

Thickness of material A with 10x measurement is 

A  = 8.7  8.9  9.0  9.5  8.4  8.1  8.6  8.7  8.9  9.1

From this I need to select 5 of them with minimum deviation, its AVG and show the data selected. Is there any formula for the selection or any approaches should I take?

 

Thank you for the attention

  • alvaro037 

    In the attached version, I have implemented my VBA algorithm (with StDev_P instead of StDev_S) as a custom function. It is used in an array formula confirmed with Ctrl+Shift+Enter to return the selected values.

  • JKPieterse's avatar
    JKPieterse
    Silver Contributor
    Can you attach a sample file with some of your data (anonymized)?
    • JoeUser2004's avatar
      JoeUser2004
      Bronze Contributor

      HansVogelaar 

       

      Use StDev_P and STDEV.P (STDEVP), not StDev_S and STDEV.S (STDEV).

       

      StDev_P returns the actual std dev of a set of numbers.

       

      Use StDev_S when the set of numbers represents a sampling from a larger population, and we want to estimate the std dev of the population, not the subset per se.

       

      That does not apply here (``choose data with the minimum deviation``)

  • JoeUser2004's avatar
    JoeUser2004
    Bronze Contributor

    alvaro037 

     

    We might note that HansVogelaar's algorithm always finds (one of) the optimal solutions.  And it is quick because COMBIN(10,5) is only 252.

     

    In contrast, the equivalent Solver model does not find the best solution if we use the GRG Nonlinear method.  The Evolutionary (random) method does stumble upon (one of) the optimal solution.  But it takes a long time (although we can change that using Options).

     

    OTOH, the benefits of using Solver is:  (a) it does not require VBA coding; and (b) it adapts easily to different subset sizes.

     

    (We could easily modify HansVogelaar's algorithm to adapt to different "population" sizes.)

    • mtarler's avatar
      mtarler
      Silver Contributor

      alvaro037  Here is an alternate solution.  I reduce the number of trials because we know std will be lowest for numbers close together so I sort the numbers and then increment through the list:

       

       

       

      =LET(in,B2:K2,
      n,5, 
      sortin, SORT(TRANSPOSE(in)), 
      start, SEQUENCE(ROWS(sortin)-n),
      stdlist, SCAN(0,start,LAMBDA(p,i,STDEV.P(INDEX(sortin,start+i-1)))),
      findLow,MATCH(MIN(stdlist),stdlist,0),
      INDEX(sortin,start+findLow-1))

       

       

       

      see attached

      EDIT:

      a) please note my SD calculation used STD.P per Joe's suggestion above while Hans' uses STD.S and my solution found a set starting with 8.6-8.9 and Hans' found 8.7-9.0 but the SD is the same in each case.

      b) here is an alternate version of the formula that stacks/includes the AVG and SD:

       

      =LET(in,B2:K2,
      n,5,
      sortin, SORT(TRANSPOSE(in)),
      start, SEQUENCE(ROWS(sortin)-n),
      stdlist, SCAN(0,start,LAMBDA(p,i,STDEV.P(INDEX(sortin,start+i-1)))),
      findLow,MATCH(MIN(stdlist),stdlist,0),
      ans,INDEX(sortin,start+findLow-1),
      out,VSTACK(ans,"Avg",AVERAGE(ans),"SD",STDEV.P(ans)),
      out)

       

      I prefer to have them separate as in the attached file but that is up to you

      • JoeUser2004's avatar
        JoeUser2004
        Bronze Contributor

        mtarler  wrote:  ``we know std will be lowest for numbers close together``

         

        My version of Excel does not support those features.

         

        Which subset of 5 is chosen with the following set of data:

         

        1.1
        1.3
        1.5
        1.7
        1.9
        100.1
        100.2
        100.3
        100.4
        100.5

         

        The correct subset is 100.1 through 100.5 (sd = 0.141421356237312), not 1.1 through 1.9 (sd = 0.282842712474619).

         

        And just for fun, which subset is chosen if the first 5 numbers are 1.1, 1.2, 1.3, 1.4 and 1.5?

         

        In that case, the sd is the same.

    • mtarler's avatar
      mtarler
      Silver Contributor
      I tried to paste formula but I think there is problem because of me using english terms for the functions and don't know the correct translation.
      2 changes are
      a) change the range for 'in' to be row 3
      b) using the 2nd formula above that has both make the last line: TRANSPOSE(out)
    • HansVogelaar's avatar
      HansVogelaar
      MVP

      alvaro037 

      In the attached version, I have implemented my VBA algorithm (with StDev_P instead of StDev_S) as a custom function. It is used in an array formula confirmed with Ctrl+Shift+Enter to return the selected values.

Resources