Next Page Previous Page Six Sigma Home Tools & Aids Search Handbook
4. Process Modeling
4.4. Data Analysis for Process Modeling
4.4.4. How can I tell if a model fits my data?

4.4.4.6.

How can I test whether any significant terms are missing or misspecified in the functional part of the model?

Statistical Tests Can Augment Ambiguous Residual Plots Although the residual plots discussed on pages 4.4.4.1 and 4.4.4.3 will typically indicate whether any important variables are missing or misspecified in the functional part of the model, a statistical test of the hypothesis that the model is complete may be helpful if they leave any doubt. Although it may seem tempting to use this type of statistical test in place of residual plots, since it apparently assesses the fit of the model objectively, no single test can provide the rich feedback to the user that graphical analysis of the residuals can. Furthermore, while model completeness is one of the most important aspects of model adequacy, this type of test does not address other important aspects of model quality. In statistical jargon, this type of test for model completeness is usually called a "lack-of-fit" test.
General Strategy The most common strategy used to test for model completeness is to compare the amount of random variation in the residuals from the data used to fit the model with an estimate of the random variation in the process using data that is independent of the model. If these two estimates of the random variation are similar, that indicates that no significant terms are likely to be missing from the model. If the model-dependent estimate is larger than the model-independent estimate of the random variation, then signficant terms probably are missing or misspecified in the functional part of the model.
Testing Model Completeness Requires Replicate Measurements The need for a model-independent estimate of the random variation means that replicate measurements made under identical experimental conditions are required to carry out a lack-of-fit test. If no replicate measurements are available then there will not be any baseline estimate of the random process variation to compare with the results from the model. This is the main reason that the use of replication is emphasized in experiment design.
Data Used to Fit Model Can Be Partitioned to Compute Lack-of-Fit Statistic Although it might seem like two sets of data would be needed to carry out the lack-of-fit test using the strategy described above, one set of data to fit the model and compute the residual standard deviation and the other to compute the model-independent estimate of the random variation, that is usually not necessary. In most regression applications, the same data used to fit the model can also be used to carry out the lack-of-fit test, as long as the necessary replicate measurements are available. This is because the sample mean and sample standard deviation are statistically independent of one another if the random errors are drawn from a normal distribution. In these cases, the lack-of-fit statistic is computed by partitioning the residual standard deviation into two independent estimates of the random variation in the process. One estimate depends on the model and the sample means of the replicated sets of data () while the other estimate is a pooled standard deviation based on the variation observed in each set of replicated measurements (). The squares of these two estimates of the random variation are often respectively called the "mean square for lack-of-fit" and the "mean square for pure error" in statistics texts. The notation and is used here instead to emphasize the fact that, if the model fits the data, these quantities should both be good estimates of .
Estimating Using Replicate Measurements The model-independent estimate of is computed using the formula


where is the sample size of the data set used to fit the model, is the number of unique combinations of predictor variable levels, is the number of replicated observations at each combination of predictor variable levels, the are the regression responses indexed by their predictor variable levels and number of replicate measurements, and is the mean of the responses at each combination of predictor variable levels. Notice that the formula for depends only on the data, and not on the functional part of the model. This shows that will be a good estimate of regardless of whether the model is a complete description of the process or not.
Estimating Using the Model Unlike the formula for , the formula for


(where is the number of unknown parameters in the model) does depend on the functional part of the model. If the model is correct, then the value of the function will be a good estimate of the mean value of the response for every combination of predictor variable values. When the function provides good estimates of , then should be close in value to and should also be a good estimate of . If, on the other hand, the function is missing any important terms (within the range of the data) or if any terms are misspecified, then the function will provide a poor estimate of for some combinations of the predictors and will tend to be greater than .
Carrying Out the Test for Lack-of-Fit Combining the ideas presented in the previous two paragraphs, following the general strategy outlined above, the completeness of the functional part of the model can be assessed by comparing the values of and . If then one or more important terms must be missing or misspecified in the functional part of the model. Because of the random error in the data, however, we know that will sometimes be larger than even though the model is complete. To make sure that the hypothesis that the model is complete is not rejected by chance, it is necessary to understand how much greater might typically be than when the model does fit the data. Then the hypothesis can be rejected only when is significantly greater than .
  When the model does fit the data, it turns out that the ratio


follows an F distribution. Knowing the probability distribution that describes the behavior of the statistic, , we can control the probability of rejecting the hypothesis that the model is complete in cases when the model actually is complete. Rejecting the hypothesis that the model is complete only when is greater than an upper-tail cut-off value from the F distribution with a user-specified probability of wrongly rejecting the hypothesis gives us a precise, objective, probabilistic definition of when is significantly greater than . The user-specified probability used to obtain the cut-off value from the F distribution is called the "significance level" of the test. The significance level for most statistical tests is denoted by . The most commonly used value for the significance level is , which means that the hypothesis of a complete model will only be rejected in 5% of tests for which the model really is complete. Cut-off values can be computed using most statistical software or from tables of the F distribution. In addition to needing the significance level to obtain the cut-off value, the F distribution is indexed by the degrees of freedom associated with each of the two estimates of . , which appears in the numerator of has degrees of freedom. , which appears in the denominator of has degrees of freedom.
Alternative Formula for Although the formula given above more clearly shows the nature of , the numerically equivalent formula below is easier to use in computations

.

Six Sigma Home Tools & Aids Search Handbook Previous Page Next Page