Next Page Previous Page Six Sigma Home Tools & Aids Search Handbook
4. Process Modeling
4.4. Data Analysis for Process Modeling
4.4.4. How can I tell if a model fits my data?

4.4.4.7.

How can I test whether all of the terms in the functional part of the model are necessary?

Unnecessary Terms in the Model Affect Inferences Models that are generally correct in form, but that include extra, unnecessary terms are said to "over-fit" the data. The term over-fitting is used to describe this problem because the extra terms in the model make it more flexible than it should be, allowing it to fit some of the random variation in the data as if it were deterministic structure. Because the parameters for any unnecessary terms in the model usually have estimated values near zero, it may seem like leaving them in the model will not hurt anything. It is true, actually, that having one or two extra terms in the model does not usually have much negative impact. However, if enough extra terms are left in the model, the consequences can be serious. Among other things, including unnecessary terms in the model causes uncertainties associated with the model to be underestimated, potentially causing incorrect scientific or engineering conclusions to be drawn from the analysis of the data.
Empirical and Local Models Most Prone to Over-fitting the Data Over-fitting is especially likely to occur when developing purely empirical models for processes for which there is no external understanding of how much of the total variation in the data might be systematic and how much is random. It also happens more frequently when using regression methods that fit the data locally instead of using an explicitly specified function to describe the structure in the data. Explicit functions are usually relatively simple and have few terms. It is usually difficult to know how to specify an explicit function that fits the noise in the data, since noise will not typically display much structure. This is why over-fitting is not usually a problem with these types of models. Local models, on the other hand, can easily be made to fit very complex patterns, allowing them to find apparent structure in process noise without care.
Statistical Tests for Over-fitting Just as statistical tests can be used to check for significant missing or misspecified terms in the functional part of a model, they can also be used to determine if any unnecessary terms have been included. In fact, checking for over-fitting of the data is one area in which statistical tests are more effective than residual plots. One test that can be used to check whether the model is over-fitting the data is the lack-of-fit test, previously discussed in its more typical role as a check for missing or misspecified terms in the model. There are also statistical tests that can be used to individually test the importance of each parameter in the model for many modeling methods.
Using the Lack-of-Fit Test to Check for Overfitting The lack-of-fit test is one of the more generally applicable tests that can be used to check for over-fitting. The general strategy for testing for over-fitting with the lack-of-fit statistic is essentially the same as the strategy for checking for missing terms in the model, except the model-based estimate of the random variation will be smaller than the model-independent estimate, if the model overfits the data. The fact that the basic strategy for testing is similar to other uses of the lack-of-fit statistic means that this test can only be used if the data set includes replicate measurements, as explained elsewhere.
As noted earlier, when the model actually does fit the data, the ratio


follows an F distribution. Therefore, to ensure that a model that actually does fit the data is rarely rejected by chance, the hypothesis that the model fits the data is rejected only when is less than a lower-tail cut-off value from the F distribution. The value of the cut-off is from a user-specified probability of wrongly rejecting a model that does not overfit the data. This user-specified probability is called the "significance level" of the test. The significance level for most statistical tests is denoted by . The most commonly used value for the significance level is , which means that the hypothesis that the model does not over-fit the data will only be rejected in 5% of tests when the model actually does not over-fit the data. Cut-off values can be computed using most statistical software or from tables of the F distribution. In addition to needing the significance level to obtain the cut-off value, the F distribution is indexed by the degrees of freedom associated with each of the two estimates of . , which appears in the numerator of has degrees of freedom. , which appears in the denominator of has degrees of freedom.
Testing for Missing Terms and Overfitting To use the lack-of-fit test to simultaneously test for missing or misspecified terms in the model and overfitting, the two "one-sided" tests described in the preceding paragraph and on the previous page should each be used with upper and lower cutoff values each with significance levels of . This will guarantee that the hypothesis that the model does not contain any missing or misspecified terms and does not over-fit the data will only be rejected by chance with probability .
Tests of Individual Parameters In addition to the lack-of-fit test, most output from regression software also includes individual statistical tests which compare the hypothesis the each parameter is equal to zero with the alternative that it is not zero. These tests are convenient because they are automatically included in most computer output, do not require replicate measurements, and give specific information about each parameter in the model. However, if the different predictor variables included in the model have values that are correlated, these tests can also be quite difficult to interpret.
Test Statistics Based on Student's t Distribution The test statistics for testing whether or not each parameter is zero are typically based on Student's t distribution. Each parameter estimate in the model is by measuring how many standard deviations it is from its hypothesized value of zero. If the parameter's estimated value is close enough to the hypothesized value that any deviation can be attributed to random error, the hypothesis that the parameters true value is zero is not rejected. If, on the other hand, the parameter's estimated value is so far away from the hypothesized value that the deviation cannot be plausibly explained by random error, the hypothesis that the true value of the parameter is zero is rejected.
Because the hypothesized value of each parameter is zero, the test statistic for each of these tests is simply the estimated parameter value divided by its estimated standard deviation,


which provides a measure of the distance between the estimated and hypothesized values of the parameter in standard deviations. Based on the assumptions that the random errors are normally distributed and that the true value of the parameter is zero (as we have hypothesized), the test statistic has a Student's t distribution with degrees of freedom. Therefore, cut-off values from the t distribution can be used to determine how extreme the test statistics must be in order for each parameter estimate to be too far away from its hypothesized value for the deviation to be attributed to random error. Because these tests are generally used to simultaneously test whether or not a parameter value is greater than or less than zero, the tests should each be used with cutoff values with a significance level of . This will guarantee that the hypothesis that each test that a parameter equals zero will only be rejected by chance with probability . Because of the symmetry of the t distribution, only one cut-off value, the upper or the lower one, needs to be determined and the other will be it's negative. Equivalently, many people simply compare the absolute value of the test statistic to the upper cut-off value.
Parameter Tests for the Pressure / Temperature Example To illustrate the use of the individual tests of the significance of each parameter in a model, the Dataplot output for the Pressure/Temperature example is shown below. In this case a straight line model was fit to the data so the output includes tests of the significance of the intercept and slope. The estimates of the intercept and the slope 7.75 and 3.93, respectively. Their estimated standard deviations are listed in the next column followed by the test statistics to determine whether or not each parameter is zero. At the bottom of the output the estimate of the residual standard deviation, , and its degrees of freedom are also listed.
Dataplot Output: Pressure / Temperature Example
LEAST SQUARES POLYNOMIAL FIT
SAMPLE SIZE N       =       40
DEGREE              =        1
NO REPLICATION CASE


      PARAMETER ESTIMATES           (APPROX. ST. DEV.)    T VALUE
 1  A0                   7.74899       ( 2.354    )        3.292
 2  A1                   3.93014       (0.5070E-01)        77.51

RESIDUAL    STANDARD DEVIATION =         4.299098
RESIDUAL    DEGREES OF FREEDOM =          38
Looking up the cut-off value from the tables of the t distribution using a significance level of and 38 degrees of freedom yields a cut-off value of 2.024 (the cut-off is obtained from the column labeled "0.025" since this is a two-sided test and 0.05/2 = 0.025). Since both of the test statistics are larger in absolute value than the cut-off value of 2.024, the appropriate conclusion is that both the slope and intercept are significantly different from zero at the 95% confidence level.
Six Sigma Home Tools & Aids Search Handbook Previous Page Next Page