|
4.
Process Modeling
4.4. Data Analysis for Process Modeling 4.4.4. How can I tell if a model fits my data?
|
|||
| Statistical Tests Can Augment Ambiguous Residual Plots | Although the residual plots discussed on pages 4.4.4.1 and 4.4.4.3 will typically indicate whether any important variables are missing or misspecified in the functional part of the model, a statistical test of the hypothesis that the model is complete may be helpful if they leave any doubt. Although it may seem tempting to use this type of statistical test in place of residual plots, since it apparently assesses the fit of the model objectively, no single test can provide the rich feedback to the user that graphical analysis of the residuals can. Furthermore, while model completeness is one of the most important aspects of model adequacy, this type of test does not address other important aspects of model quality. In statistical jargon, this type of test for model completeness is usually called a "lack-of-fit" test. | ||
| General Strategy | The most common strategy used to test for model completeness is to compare the amount of random variation in the residuals from the data used to fit the model with an estimate of the random variation in the process using data that is independent of the model. If these two estimates of the random variation are similar, that indicates that no significant terms are likely to be missing from the model. If the model-dependent estimate is larger than the model-independent estimate of the random variation, then signficant terms probably are missing or misspecified in the functional part of the model. | ||
| Testing Model Completeness Requires Replicate Measurements | The need for a model-independent estimate of the random variation means that replicate measurements made under identical experimental conditions are required to carry out a lack-of-fit test. If no replicate measurements are available then there will not be any baseline estimate of the random process variation to compare with the results from the model. This is the main reason that the use of replication is emphasized in experiment design. | ||
| Data Used to Fit Model Can Be Partitioned to Compute Lack-of-Fit Statistic |
Although it might seem like two sets of data would be needed to carry out the lack-of-fit test using the strategy
described above, one set of data to fit the model and compute the residual standard deviation
and the other to compute the model-independent estimate of the random variation, that is usually not necessary. In most
regression applications, the same data used to fit the model can also be used to carry out the lack-of-fit test, as long as
the necessary replicate measurements are available. This is because the sample mean and sample standard deviation are
statistically independent of one another if the random errors are drawn from a normal distribution. In these cases, the
lack-of-fit statistic is computed by partitioning the residual standard deviation into two independent estimates of the
random variation in the process. One estimate depends on the model and the sample means of the replicated
sets of data ( |
||
|
Estimating |
The model-independent estimate of ![]() where |
||
|
Estimating |
Unlike the formula for ![]() (where |
||
| Carrying Out the Test for Lack-of-Fit |
Combining the ideas presented in the previous two paragraphs, following the general strategy outlined
above, the completeness of the functional part of the model
can be assessed by comparing the values of |
||
|
When the model does fit the data, it turns out that the ratio
follows an F distribution. Knowing the probability distribution that describes the behavior of the statistic, |
|||
|
Alternative Formula for |
Although the formula given above more clearly shows the nature of |
||