Next Page Previous Page Six Sigma Home Tools & Aids Search Handbook
1. Exploratory Data Analysis
1.4. EDA Case Studies
1.4.2. Case Studies
1.4.2.1. Normal Random Numbers

1.4.2.1.3.

Quantitative Output and Interpretation

Summary Statistics As a first step in the analysis, a table of summary statistics is computed from the data. The following table, generated by Dataplot, shows a typical set of statistics.
  
                                 SUMMARY
  
                      NUMBER OF OBSERVATIONS =      500
  
  
 ***********************************************************************
 *        LOCATION MEASURES         *       DISPERSION MEASURES        *
 ***********************************************************************
 *  MIDRANGE     =   0.3945000E+00  *  RANGE        =   0.6083000E+01  *
 *  MEAN         =  -0.2935997E-02  *  STAND. DEV.  =   0.1021041E+01  *
 *  MIDMEAN      =   0.1623600E-01  *  AV. AB. DEV. =   0.8174360E+00  *
 *  MEDIAN       =  -0.9300000E-01  *  MINIMUM      =  -0.2647000E+01  *
 *               =                  *  LOWER QUART. =  -0.7204999E+00  *
 *               =                  *  LOWER HINGE  =  -0.7210000E+00  *
 *               =                  *  UPPER HINGE  =   0.6455001E+00  *
 *               =                  *  UPPER QUART. =   0.6447501E+00  *
 *               =                  *  MAXIMUM      =   0.3436000E+01  *
 ***********************************************************************
 *       RANDOMNESS MEASURES        *     DISTRIBUTIONAL MEASURES      *
 ***********************************************************************
 *  AUTOCO COEF  =   0.4505888E-01  *  ST. 3RD MOM. =   0.3072273E+00  *
 *               =   0.0000000E+00  *  ST. 4TH MOM. =   0.2990314E+01  *
 *               =   0.0000000E+00  *  ST. WILK-SHA =   0.7515639E+01  *
 *               =                  *  UNIFORM PPCC =   0.9756625E+00  *
 *               =                  *  NORMAL  PPCC =   0.9961721E+00  *
 *               =                  *  TUK -.5 PPCC =   0.8366451E+00  *
 *               =                  *  CAUCHY  PPCC =   0.4922674E+00  *
 ***********************************************************************
  
  
Location One way to quantify a change in location over time is to fit a straight line to the data set, using the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no significant drift in the location, the slope parameter should be zero. For this data set, Dataplot generated the following output:
  
LEAST SQUARES MULTILINEAR FIT
SAMPLE SIZE N       =      500
NUMBER OF VARIABLES =        1
NO REPLICATION CASE
 
 
        PARAMETER ESTIMATES           (APPROX. ST. DEV.)    T VALUE
 1  A0                  0.699127E-02   (0.9155E-01)        0.7636E-01
 2  A1       X         -0.396298E-04   (0.3167E-03)        -0.1251
 
RESIDUAL    STANDARD DEVIATION =         1.02205
RESIDUAL    DEGREES OF FREEDOM =         498
  
The slope parameter, A1, has a t value of -0.13 which is statistically not significant. This indicates that the slope can in fact be considered zero.
Variation One simple way to detect a change in variation is with a Bartlett test, after dividing the data set into several equal-sized intervals. The choice of the number of intervals is somewhat arbitrary, although values of 4 or 8 are reasonable. Dataplot generated the following output for the Bartlett test.
               BARTLETT TEST
           (STANDARD DEFINITION)
 NULL HYPOTHESIS UNDER TEST--ALL SIGMA(I) ARE EQUAL
  
 TEST:
    DEGREES OF FREEDOM          =    3.000000
  
    TEST STATISTIC VALUE        =    2.373660
    CUTOFF: 95% PERCENT POINT   =    7.814727
    CUTOFF: 99% PERCENT POINT   =    11.34487
  
    CHI-SQUARE CDF VALUE        =    0.501443
  
   NULL          NULL HYPOTHESIS        NULL HYPOTHESIS
   HYPOTHESIS    ACCEPTANCE INTERVAL    CONCLUSION
 ALL SIGMA EQUAL    (0.000,0.950)         ACCEPT
  
In this case, the Bartlett test indicates that the standard deviations are not significantly different in the 4 intervals.
Randomness There are many ways in which data can be non-random. However, most common forms of non-randomness can be detected with a few simple tests. The lag plot in the 4-plot above is a simple graphical technique.

Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1). Dataplot generated the following autocorrelation plot.

The lag 1 autocorrelation, which is generally the one of most interest, is 0.045. The critical values at the 5% significance level are -0.087 and 0.087. Thus, since 0.045 is in the interval, the lag 1 autocorrelation is not statistically significant, so there is no evidence of non-randomness.

A common test for randomness is the runs test.

                    RUNS UP
         STATISTIC = NUMBER OF RUNS UP
             OF LENGTH EXACTLY I
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1        98.0    104.2083     10.2792       -0.60
 2        43.0     45.7167      5.2996       -0.51
 3        13.0     13.1292      3.2297       -0.04
 4         6.0      2.8563      1.6351        1.92
 5         1.0      0.5037      0.7045        0.70
 6         0.0      0.0749      0.2733       -0.27
 7         0.0      0.0097      0.0982       -0.10
 8         0.0      0.0011      0.0331       -0.03
 9         0.0      0.0001      0.0106       -0.01
10         0.0      0.0000      0.0032        0.00
         STATISTIC = NUMBER OF RUNS UP
             OF LENGTH I OR MORE
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       161.0    166.5000      6.6546       -0.83
 2        63.0     62.2917      4.4454        0.16
 3        20.0     16.5750      3.4338        1.00
 4         7.0      3.4458      1.7786        2.00
 5         1.0      0.5895      0.7609        0.54
 6         0.0      0.0858      0.2924       -0.29
 7         0.0      0.0109      0.1042       -0.10
 8         0.0      0.0012      0.0349       -0.03
 9         0.0      0.0001      0.0111       -0.01
10         0.0      0.0000      0.0034        0.00
                   RUNS DOWN
         STATISTIC = NUMBER OF RUNS DOWN
             OF LENGTH EXACTLY I
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1        91.0    104.2083     10.2792       -1.28
 2        55.0     45.7167      5.2996        1.75
 3        14.0     13.1292      3.2297        0.27
 4         1.0      2.8563      1.6351       -1.14
 5         0.0      0.5037      0.7045       -0.71
 6         0.0      0.0749      0.2733       -0.27
 7         0.0      0.0097      0.0982       -0.10
 8         0.0      0.0011      0.0331       -0.03
 9         0.0      0.0001      0.0106       -0.01
10         0.0      0.0000      0.0032        0.00
         STATISTIC = NUMBER OF RUNS DOWN
             OF LENGTH I OR MORE
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       161.0    166.5000      6.6546       -0.83
 2        70.0     62.2917      4.4454        1.73
 3        15.0     16.5750      3.4338       -0.46
 4         1.0      3.4458      1.7786       -1.38
 5         0.0      0.5895      0.7609       -0.77
 6         0.0      0.0858      0.2924       -0.29
 7         0.0      0.0109      0.1042       -0.10
 8         0.0      0.0012      0.0349       -0.03
 9         0.0      0.0001      0.0111       -0.01
10         0.0      0.0000      0.0034        0.00
         RUNS TOTAL = RUNS UP + RUNS DOWN
       STATISTIC = NUMBER OF RUNS TOTAL
            OF LENGTH EXACTLY I
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       189.0    208.4167     14.5370       -1.34
 2        98.0     91.4333      7.4947        0.88
 3        27.0     26.2583      4.5674        0.16
 4         7.0      5.7127      2.3123        0.56
 5         1.0      1.0074      0.9963       -0.01
 6         0.0      0.1498      0.3866       -0.39
 7         0.0      0.0193      0.1389       -0.14
 8         0.0      0.0022      0.0468       -0.05
 9         0.0      0.0002      0.0150       -0.01
10         0.0      0.0000      0.0045        0.00
       STATISTIC = NUMBER OF RUNS TOTAL
             OF LENGTH I OR MORE
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       322.0    333.0000      9.4110       -1.17
 2       133.0    124.5833      6.2868        1.34
 3        35.0     33.1500      4.8561        0.38
 4         8.0      6.8917      2.5154        0.44
 5         1.0      1.1790      1.0761       -0.17
 6         0.0      0.1716      0.4136       -0.41
 7         0.0      0.0217      0.1474       -0.15
 8         0.0      0.0024      0.0494       -0.05
 9         0.0      0.0002      0.0157       -0.02
10         0.0      0.0000      0.0047        0.00
        LENGTH OF THE LONGEST RUN UP         =     5
        LENGTH OF THE LONGEST RUN DOWN       =     4
        LENGTH OF THE LONGEST RUN UP OR DOWN =     5
  
        NUMBER OF POSITIVE DIFFERENCES =   252
        NUMBER OF NEGATIVE DIFFERENCES =   247
        NUMBER OF ZERO     DIFFERENCES =     0
  
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level. The runs test does not indicate any significant non-randomness.
Distributional Analysis Probability plots are a graphical test for assessing if a particular distribution provides an adequate fit to a data set.

A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot. For this data set the correlation coefficient is 0.996. Since this is greater than the critical value of 0.987 (this is a tabulated value), the normality assumption is not rejected.

Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be used to test for normality. Dataplot generates the following output for the Anderson-Darling normality test.

               ANDERSON-DARLING 1-SAMPLE TEST
               THAT THE DATA CAME FROM A NORMAL DISTRIBUTION
  
 1. STATISTICS:
       NUMBER OF OBSERVATIONS                =      500
       MEAN                                  =  -0.2935997E-02
       STANDARD DEVIATION                    =    1.021041
  
       ANDERSON-DARLING TEST STATISTIC VALUE =    1.061249
       ADJUSTED TEST STATISTIC VALUE         =    1.069633
  
 2. CRITICAL VALUES:
       90         % POINT    =   0.6560000
       95         % POINT    =   0.7870000
       97.5       % POINT    =   0.9180000
       99         % POINT    =    1.092000
  
 3. CONCLUSION (AT THE 5% LEVEL):
       THE DATA DO NOT COME FROM A NORMAL DISTRIBUTION.
The Anderson-Darling test rejects the normality assumption at the 5% level but accepts it at the 1% level.
Outlier Analysis A test for outliers is the Grubbs test. Dataplot generated the following output for Grubbs' test.
               GRUBBS TEST FOR OUTLIERS
               (ASSUMPTION: NORMALITY)
  
 1. STATISTICS:
       NUMBER OF OBSERVATIONS      =      500
       MINIMUM                     =   -2.647000
       MEAN                        =  -0.2935997E-02
       MAXIMUM                     =    3.436000
    STANDARD DEVIATION          =    1.021041
  
    GRUBBS TEST STATISTIC       =    3.368068
  
 2. PERCENT POINTS OF THE REFERENCE DISTRIBUTION
    FOR GRUBBS TEST STATISTIC
       0          % POINT    =   0.0000000E+00
       50         % POINT    =    3.274338
       75         % POINT    =    3.461431
       90         % POINT    =    3.695134
       95         % POINT    =    3.863087
       99         % POINT    =    4.228033
  
 3. CONCLUSION (AT THE 5% LEVEL):
       THERE ARE NO OUTLIERS.
For this data set, Grubbs' test does not detect any outliers at the 25%, 10%, 5%, and 1% significance levels.
Model Since the underlying assumptions were validated both graphically and analytically, we conclude that a reasonable model for the data is:
    Yi = -0.00294 + Ei
We can express the uncertainty for C as the 95% confidence interval (-0.09266,0.086779).
Univariate Report It is sometimes useful and convenient to summarize the above results in a report. The report for the 500 normal random numbers follows.
 Analysis for 500 normal random numbers
  
 1: Sample Size                           = 500
  
 2: Location
    Mean                                  = -0.00294
    Standard Deviation of Mean            = 0.045663
    95% Confidence Interval for Mean      = (-0.09266,0.086779)
    Drift with respect to location?       = NO
  
 3: Variation
    Standard Deviation                    = 1.021042
    95% Confidence Interval for SD        = (0.961437,1.088585)
    Drift with respect to variation?
    (based on Bartletts test on quarters
    of the data)                          = NO
  
 4: Distribution
    Normal PPCC                           = 0.996173
    Data are Normal?
      (as measured by Normal PPCC)        = YES
  
 5: Randomness
    Autocorrelation                       = 0.045059
    Data are Random?
      (as measured by autocorrelation)    = YES
  
 6: Statistical Control
    (i.e., no drift in location or scale,
    data are random, distribution is 
    fixed, here we are testing only for
    fixed normal)
    Data Set is in Statistical Control?   = YES
  
 7: Outliers?
    (as determined by Grubbs' test)       = NO
Six Sigma Home Tools & Aids Search Handbook Previous Page Next Page