|
4.
Process Modeling
4.4. Data Analysis for Process Modeling 4.4.5. If my current model does not fit the data well, how can I improve it?
|
|||
| Two Basic Approaches: Transformation and Weighting |
There are two basic approaches to getting improved parameter estimates for data in which the standard deviation of the
response is not constant across all combinations of predictor variable values:
|
||
| Using Transformations |
The basic steps for using transformations to handle data with unequal subpopulation standard deviations are:
|
||
| Typical Transformations for Stabilization of Variation |
Appropriate transformations to stabilize the variability may be suggested by scientific knowledge or selected using the
data. Three transformations that are often effective for equalizing the standard deviations across the values of the
predictor variables are:
|
||
| Modified Pressure / Temperature Example | To illustrate how to use transformations to stabilize the variation in the data, we will return to the modified version of the Pressure/Temperature example. The residuals from a straight line fit to that data clearly showed that the standard deviation of the measurements was not constant across the range of temperatures. | ||
| Residuals from Modified Pressure Data |
|
||
| Stabilizing the Variation | The first step in the process is to compare different transformations of the response variable, pressure, to see which one, if any, stabilizes the variation across the range of temperatures. The straight-line relationship will not hold for all of the transformations, but at this stage of the process that is not a concern. The functional relationship can usually be corrected after stabilizing the variation. The key for this step is to find a transformation that makes the uncertainty in the data approximately the same at the lowest and highest temperatures (and in between). The plot below shows the modified Pressure/Temperature data in its original units, and with the response variable transformed using each of the three typical transformations. Remember you can click on the plot to get a larger view for easier comparison. | ||
| Transformations of the Pressure |
|
||
| Inverse Pressure Has Constant Variation | After comparing the effects of the different transformations, it looks like using the inverse of the pressure will make the standard deviation approximately constant across all temperatures. However, it is somewhat difficult to tell how the standard deviations really compare on a plot of this size and scale. To better see the variation, a full-sized plot of temperature versus the inverse of the pressure is shown below. In that plot it is easier to compare the variation across temperatures. For example, comparing the variation in the pressure values at a temperature of about 25 with the variation in the pressure values at temperatures near 45 and 70, this plot shows about the same level of variation at all three temperatures. It will still be critical to look at residual plots after fitting the model to the transformed variables, however, to really see whether or not the transformation we've chosen is effective. The residual scale is really only scale that can reveal that level of detail. | ||
| Enlarged View of Temperature Versus 1/Pressure |
|
||
| Transforming Temperature to Linearity | Having found a transformation that appears to stabilize the standard deviations of the measurements, the next step in the process is to find a transformation of the temperature that will restore the straight-line relationship, or some other simple relationship, between the temperature and pressure. The same three basic transformations that can often be used to stabilize the variation are also usually able to transform the predictor to restore the original relationship between the variables. Plots of the temperature and the three transformations of the temperature versus the inverse of the pressure are shown below. | ||
| Transformations of the Temperature |
|
||
| Comparing the plots of the various transformations of the temperature versus the inverse of the pressure, it appears that the straight-line relationship between the variables is restored when the inverse of the temperature is used. This makes intuitive sense because, if the temperature and pressure are related by a straight line, then the same transformation applied to both variables should change them both similarly, retaining their original relationship. Now, after fitting a straight line to the transformed data, the residuals plotted versus both the transformed and original values of temperature indicate that the straight-line model fits the data and that the random variation no longer increases with increasing temperature. Additional diagnostic plots of the residuals confirm that the model fits the data well. | |||
| Residuals From the Fit to the Transformed Data |
|
||
|
|||
| Using Weighted Least Squares |
As discussed in the overview of different methods for building process models, the
goal when using weighted least squares regression is to ensure that each data point has an appropriate level of influence
on the final parameter estimates. Using the weighted least squares fitting criterion, the parameter
estimates are obtained by minimizing
Optimal results, which minimize the uncertainty in the parameter estimates, are obtained when the weights, Unfortunately, however, these optimal weights, which are based on the true standard deviations of each data point, are never known. Estimated weights have to be used instead. When estimated weights are used, the optimality properties associated with known weights no longer strictly apply. However, if the weights can be estimated with high enough precision, their use can significantly improve the parameter estimates compared to the results that would be obtained if all of the data points were equally weighted. |
||
| Direct Estimation of Weights |
If there are replicates in the data, the most obvious way to estimate the weights is to set the weight for each data
point equal to the inverse of the sample variance obtained from the set of replicate measurements the data point belongs to.
Mathematically this would be
![]() where
|
||
| Unfortunately, although this method is attractive, it rarely works well. This is because when the weights are estimated this way they are usually extremely variable. As a result, the estimated weights do not correctly control how much each data point should influence the parameter estimates. This method can work, but it requires a very large number of replicates at each combination of predictor variables. In fact, if this method is used with too few replicate measurements, the parameter estimates can actually be more variable than they would have been if the unequal variation were ignored. | |||
| A Better Strategy for Estimating the Weights |
A better strategy for estimating the weights is to find a function that relates the standard deviation of the response at
each combination of predictor variable values to the predictor variables themselves. This means that if
(denoting the unknown parameters in the function This approach to estimating the weights usually provides more precise estimates than direct estimation because fewer quantities have to be estimated and there is more data to estimate each one. |
||
| Estimating Weights Without Replicates |
If there are only very few or no replicate measurements for each combination of predictor variable values, then approximate
replicate groups can be formed so that weights can be estimated. There are several possible approaches to forming the
replicate groups.
|
||
| Fitting and Validating the Model for Estimation of the Weights |
When fitting the model for the estimation of the weights,
it is important to note that the usual regression assumptions do not hold. In particular, the variation of the random errors is not the constant across the different sets of replicates and their distribution is not normal. However, this can be accounted for by using the |
||
| Using Weighted Residuals to Validate WLS Models |
Once the weights have been estimated and the weighted least squares model has been fit to the original data, the
validation of the model follows as usual, with one exception. In a weighted analysis the distribution of the residuals
can vary substantially with the different values of the predictor variables. This necessitates the use of weighted
residuals [Graybill and Iyer (1994)] when carrying
out a graphical residual analysis so that the plots can be interpreted as usual. The weighted residuals are given by the
formula
It is important to note that most statistical software packages do not compute and return weighted residuals when a weighted fit is done, so the residuals will usually have to weighted manually in an additional step. It is an almost a sure sign that you have forgotten to compute the weighted residuals if, after computing a weighted least squares fit using carefully estimated weights, the residual plots still show the same funnel-shaped pattern as they did for the initial equally-weighted fit. |
||
| Power Function Model for the Weights |
When there is only one predictor variable, one specific model that often works well for modeling the variances is the power
function
To illustrate the steps needed to carry out a weighted least squares fit, this model will be used with the modified version of the Pressure/Temperature data, plotted below. |
||
| Modified Pressure/Temperature Data |
|
||
| Defining Sets of Approximate Replicate Measurements |
From the data, plotted above, it is clear that there are not many true replicates in this data set. As a result, sets of
approximate replicate measurements need to be defined in order to use the power function model to estimate the weights.
In this case, this was done by rounding a multiple of the temperature to the nearest degree and then converting the
rounded data back to the original scale.
This is an easy way to identify sets of measurements that have temperatures that are relatively close together. If this process had produced too few sets of replicates, a smaller factor than three could have been used to spread the data out further before rounding. If fewer replicates sets were needed, then a larger factor could have been used. The appropriate value to use is a matter of judgment. An ideal value is one that doesn't combine values that are too different and which yields sets of replicates that aren't too different in size. A table showing the original data, the rounded temperatures that define the approximate replicates, and the replicate standard deviations is listed below. |
||
| Data with Approximate Replicates |
Rounded Standard
Temperature Temperature Pressure Deviation
---------------------------------------------
21.602 21 91.423 0.192333
21.448 21 91.695 0.192333
23.323 24 98.883 1.102380
22.971 24 97.324 1.102380
25.854 27 107.620 0.852080
25.609 27 108.112 0.852080
25.838 27 109.279 0.852080
29.242 30 119.933 11.046422
31.489 30 135.555 11.046422
34.101 33 139.684 0.454670
33.901 33 139.041 0.454670
37.481 36 150.165 0.031820
35.451 36 150.210 0.031820
39.506 39 164.155 2.884289
40.285 39 168.234 2.884289
43.004 42 180.802 4.845772
41.449 42 172.646 4.845772
42.989 42 169.884 4.845772
41.976 42 171.617 4.845772
44.692 45 180.564 NA
48.599 48 191.243 5.985219
47.901 48 199.386 5.985219
49.127 48 202.913 5.985219
49.542 51 196.225 9.074554
51.144 51 207.458 9.074554
50.995 51 205.375 9.074554
50.917 51 218.322 9.074554
54.749 54 225.607 2.040637
53.226 54 223.994 2.040637
54.467 54 229.040 2.040637
55.350 54 227.416 2.040637
54.673 54 223.958 2.040637
54.936 54 224.790 2.040637
57.549 57 230.715 10.098899
56.982 57 216.433 10.098899
58.775 60 224.124 23.120270
61.204 60 256.821 23.120270
68.297 69 276.594 6.721043
68.476 69 267.296 6.721043
68.774 69 280.352 6.721043
|
||
| Transformation of the Weight Data |
With the replicate groups defined, a plot of the ln of the replicate variances versus the ln of the temperature shows
the transformed data for estimating the weights does appear to follow the power function model. This is because the
ln-ln transformation linearizes the power function, as well as stabilizing the variation of the random errors and
making their distribution approximately normal.
![]() |
||
| Transformed Data for Weight Estimation with Fitted Model |
|
||
| Specification of Weight Function |
The Splus output from the fit of the weight estimation model is shown below. Based on the output, and the associated
residual plots, the model of the weights seems reasonable, and
![]() should be an appropriate weight function for the modified Pressure/Temperature data. The weight function is based only on the slope from the fit to the transformed weight data because the weights only need to be proportional to the replicate variances. As a result, we can ignore the estimate of |
||
| Output from Weight Estimation Fit |
Residual Standard Error = 3.0245
Multiple R-Square = 0.3642
N = 14,
F-statistic = 6.8744 on 1 and 12 df, p-value = 0.0223
coef std.err t.stat p.value
Intercept -20.5896 8.4994 -2.4225 0.0322
ln(Temperature) 6.0230 2.2972 2.6219 0.0223
|
||
| Fit of the WLS Model to the Pressure / Temperature Data | With the weight function estimated, the fit of the weighted least squares model produces the residual plot below. This plot, which shows the weighted residuals from the fit versus temperature, indicates that use of the estimated weight function has stabilized the increasing variation in pressure observed with increasing temperature. The plot of the data with the estimated regression function and additional residual plots using the weighted residuals confirm that the weighted least squares model fits the data. | ||
| Weighted Residuals from WLS Fit of Pressure / Temperature Data |
|
||
| Comparison of Transformed and Weighted Results | Having modeled the data using both transformed variables and weighted least squares to account for the non-constant standard deviations observed in pressure, it is interesting to compare the two resulting models. Logically, at least one of these two models cannot be correct. With the random error inherent in the data, however, there is no way to tell which of the two models actually describes the relationship between pressure and temperature better. The fact that the two models lie right on top of one another over almost the entire range of the data tells us that. Even at the highest temperatures, where the models diverge slightly, both models match the small amount of data that is available reasonably well. The only way to differentiate between these models is to use additional scientific knowledge or collect a lot more data. The good news, though, is that either model should work equally well for predictions or calibrations based on this data, or for basic understanding of the relationship between temperature and pressure. | ||
|
|||