|
4.
Process Modeling
4.4. Data Analysis for Process Modeling 4.4.4. How can I tell if a model fits my data?
|
|||
| Unnecessary Terms in the Model Affect Inferences | Models that are generally correct in form, but that include extra, unnecessary terms are said to "over-fit" the data. The term over-fitting is used to describe this problem because the extra terms in the model make it more flexible than it should be, allowing it to fit some of the random variation in the data as if it were deterministic structure. Because the parameters for any unnecessary terms in the model usually have estimated values near zero, it may seem like leaving them in the model will not hurt anything. It is true, actually, that having one or two extra terms in the model does not usually have much negative impact. However, if enough extra terms are left in the model, the consequences can be serious. Among other things, including unnecessary terms in the model causes uncertainties associated with the model to be underestimated, potentially causing incorrect scientific or engineering conclusions to be drawn from the analysis of the data. | ||
| Empirical and Local Models Most Prone to Over-fitting the Data | Over-fitting is especially likely to occur when developing purely empirical models for processes for which there is no external understanding of how much of the total variation in the data might be systematic and how much is random. It also happens more frequently when using regression methods that fit the data locally instead of using an explicitly specified function to describe the structure in the data. Explicit functions are usually relatively simple and have few terms. It is usually difficult to know how to specify an explicit function that fits the noise in the data, since noise will not typically display much structure. This is why over-fitting is not usually a problem with these types of models. Local models, on the other hand, can easily be made to fit very complex patterns, allowing them to find apparent structure in process noise without care. | ||
| Statistical Tests for Over-fitting | Just as statistical tests can be used to check for significant missing or misspecified terms in the functional part of a model, they can also be used to determine if any unnecessary terms have been included. In fact, checking for over-fitting of the data is one area in which statistical tests are more effective than residual plots. One test that can be used to check whether the model is over-fitting the data is the lack-of-fit test, previously discussed in its more typical role as a check for missing or misspecified terms in the model. There are also statistical tests that can be used to individually test the importance of each parameter in the model for many modeling methods. | ||
| Using the Lack-of-Fit Test to Check for Overfitting | The lack-of-fit test is one of the more generally applicable tests that can be used to check for over-fitting. The general strategy for testing for over-fitting with the lack-of-fit statistic is essentially the same as the strategy for checking for missing terms in the model, except the model-based estimate of the random variation will be smaller than the model-independent estimate, if the model overfits the data. The fact that the basic strategy for testing is similar to other uses of the lack-of-fit statistic means that this test can only be used if the data set includes replicate measurements, as explained elsewhere. | ||
|
As noted earlier, when the model actually does fit the data, the ratio
follows an F distribution. Therefore, to ensure that a model that actually does fit the data is rarely rejected by chance, the hypothesis that the model fits the data is rejected only when |
|||
| Testing for Missing Terms and Overfitting |
To use the lack-of-fit test to simultaneously test for missing or misspecified terms in the model and overfitting, the
two "one-sided" tests described in the preceding paragraph and on the previous page
should each be used with upper and lower cutoff values each with significance levels of
|
||
| Tests of Individual Parameters | In addition to the lack-of-fit test, most output from regression software also includes individual statistical tests which compare the hypothesis the each parameter is equal to zero with the alternative that it is not zero. These tests are convenient because they are automatically included in most computer output, do not require replicate measurements, and give specific information about each parameter in the model. However, if the different predictor variables included in the model have values that are correlated, these tests can also be quite difficult to interpret. | ||
| Test Statistics Based on Student's t Distribution | The test statistics for testing whether or not each parameter is zero are typically based on Student's t distribution. Each parameter estimate in the model is by measuring how many standard deviations it is from its hypothesized value of zero. If the parameter's estimated value is close enough to the hypothesized value that any deviation can be attributed to random error, the hypothesis that the parameters true value is zero is not rejected. If, on the other hand, the parameter's estimated value is so far away from the hypothesized value that the deviation cannot be plausibly explained by random error, the hypothesis that the true value of the parameter is zero is rejected. | ||
|
Because the hypothesized value of each parameter is zero, the test statistic for each of these tests is simply the
estimated parameter value divided by its estimated standard deviation,
which provides a measure of the distance between the estimated and hypothesized values of the parameter in standard deviations. Based on the assumptions that the random errors are normally distributed and that the true value of the parameter is zero (as we have hypothesized), the test statistic has a Student's t distribution with |
|||
| Parameter Tests for the Pressure / Temperature Example |
To illustrate the use of the individual tests of the significance of each parameter in a model, the Dataplot output for the
Pressure/Temperature example is shown below. In this case a straight line model
was fit to the data so the output includes tests of the significance of the intercept and slope. The estimates of the
intercept and the slope 7.75 and 3.93, respectively. Their estimated standard deviations are listed in the next column
followed by the test statistics to determine whether or not each parameter is zero. At the bottom of the output the estimate
of the residual standard deviation, |
||
| Dataplot Output: Pressure / Temperature Example |
LEAST SQUARES POLYNOMIAL FIT
SAMPLE SIZE N = 40
DEGREE = 1
NO REPLICATION CASE
PARAMETER ESTIMATES (APPROX. ST. DEV.) T VALUE
1 A0 7.74899 ( 2.354 ) 3.292
2 A1 3.93014 (0.5070E-01) 77.51
RESIDUAL STANDARD DEVIATION = 4.299098
RESIDUAL DEGREES OF FREEDOM = 38
|
||
|
Looking up the cut-off value from the tables of the t distribution using a
significance level of |
|||