Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 46 trang )
L1592_frame_C38 Page 338 Tuesday, December 18, 2001 3:21 PM
Mass of solids
M = C0ZA
Mass of solids
M = A” C(z, t ) dz
0
Solids profile at
settling time t
0
Depth
Solids
profile at
settling
time t = 0
dz
z
Z
Z
0
C0
0
C( z, t )
FIGURE 38.1 Solids concentration as a function of depth at time. The initial condition (t = 0) is shown on the left. The
condition at time t is shown on the right.
The fraction of solids removed in a settling tank at any depth z, that has a detention time t, is estimated
as:
z
z
AZ C 0 – A∫ 0 C ( z, t ) dz
1
R ( z, t ) = --------------------------------------------------- = 1 – --------- ∫ C ( z, t ) dz
AZ C 0
Z C0
0
This integral could be calculated graphically (Camp, 1946) or an approximating polynomial can be
derived for the concentration curve and the fraction of solids removed (R ) can be calculated algebraically.
Suppose, for example, that:
C (z, t) = 167 − 2.74t + 11.9z − 0.08zt + 0.014t
2
is a satisfactory empirical model and we want to use this model to predict the removal that will be achieved
with 60-min detention time, for a depth of 8 ft and an initial concentration of 500 mg/L. The solids concentration profile as a function of depth at t = 60 min is:
C (z, t) = 167 − 2.74(60) + 11.9z − 0.08z (60) + 0.014(60) = 53.0 + 7.1z
2
This is integrated over depth (Z = 8 ft) to give the fraction of solids that are expected to be removed:
z=8
1
R ( z = 8, t = 60 ) = 1 – ---------------8 ( 500 )
∫ ( 53.0 + 7.1z ) dz
z=0
1
2
= 1 – ---------------- ( 53 ( 8 ) + 3.55 ( 8 ) ) = 0.84
8 ( 500 )
The model building problem is to determine the form of the polynomial function and to estimate the
coefficients of the terms in the function.
Method: Linear Regression
Suppose the correct model for the process is η = f (β, x) and the observations are yi = f (β, x) + ei, where
the ei are random errors. There may be several parameters ( β ) and several independent variables (x).
According to the least squares criterion, the best estimates of the β ’s minimize the sum of the squared
residuals:
minimize S ( β ) =
where the summation is over all observations.
© 2002 By CRC Press LLC
∑(y – η )
i
i
2
L1592_frame_C38 Page 339 Tuesday, December 18, 2001 3:21 PM
The minimum sum of squares is called the residual sum of squares, RSS. The residual mean square
(RMS) is the residual sum of squares divided by its degrees of freedom. RMS = RSS /(n − p), where n =
number of observations and p = number of parameters estimated.
Case Study: Solution
A column settling test was done on a suspension with initial concentration of 560 mg/L. Samples were
taken at depths of 2, 4, and 6 ft (measured from the water surface) at times 20, 40, 60, and 120 min;
the data are in Table 38.1. The simplest possible model is:
C(z, t) = β 0 + β1t
The most complicated model that might be needed is a full quadratic function of time and depth:
C(z, t) = β 0 + β 1 t + β 2 t + β 3 z + β 4 z + β 5 zt
2
2
We can start the model building process with either of these and add or drop terms as needed.
Fitting the simplest possible model involving time and depth gives:
ˆ
y = 132.3 + 7.12z – 0.97t
which has R = 0.844 and residual mean square = 355.82. R , the coefficient of determination, is the
percentage of the total variation in the data that is accounted for by fitting the model (Chapter 39).
Figure 38.2a shows the diagnostic residual plots for the model. The residuals plotted against the
predicted values are not random. This suggests an inadequacy in the model, but it does not tell us how
2
2
TABLE 38.1
Data from a Laboratory Settling Column Test
Depth (ft)
2
4
6
Suspended Solids Concentration at Time t (min)
20
40
60
120
135
170
180
90
110
126
75
90
96
48
53
60
FIGURE 38.2 (a) Residuals plotted against the predicted suspended solids concentrations are not random. (b) Residuals
plotted against settling time suggest that a quadratic term is needed in the model.
© 2002 By CRC Press LLC
L1592_frame_C38 Page 340 Tuesday, December 18, 2001 3:21 PM
TABLE 38.2
Analysis of Variance for the Six-Parameter Settling
Linear Model
df
SS
MS = SS/df
Regression (Reg SS)
Residuals (RSS)
Total (Total SS)
5
6
11
20255.5
308.8
20564.2
4051.1
51.5
Residuals
Due to
10
0
-10
40
•
•
•
•
•
•
•
•
•
•
•
•
80
120
160
Predicted SS Values
200
ˆ
FIGURE 38.3 Plot of residuals against the predicted values of the regression model y = 185.97 + 7.125t + 0.014t − 3.057z.
2
the model might be improved. The pattern of the residuals plotted against time (Figure 38.2b) suggests
2
that adding a t term may be helpful. This was done to obtain:
ˆ
y = 186.0 + 7.12z – 3.06t + 0.0143t
2
which has R = 0.97 and residual mean square = 81.5. A diagnostic plot of the residuals (Figure 38.3)
reveals no inadequacies. Similar plots of residuals against the independent variables also support the
model. This model is adequate to describe the data.
The most complicated model, which has six parameters, is:
2
ˆ
y = 152 + 20.9z – 2.74t – 1.13z – 0.0143t – 0.080zt
2
2
The model contains quadratic terms for time and depth and the interaction of depth and time (zt). The
analysis of variance for this model is given in Table 38.2. This information is produced by computer
programs that do linear regression. For now we do not need to know how to calculate this, but we should
understand how it is interpreted.
Across the top, SS is sum of squares and df = degrees of freedom associated with a sum of squares
quantity. MS is mean square, where MS = SS/df. The sum of squares due to regression is the regression
sum of squares (RegSS): RegSS = 20,255.5. The sum of squares due to residuals is the residual sum of
squares (RSS); RSS = 308.8. The total sum of squares, or Total SS, is:
Total SS = RegSS + RSS
Also:
Total SS =
∑ ( y – y)
2
i
The residual sum of squares (RSS) is the minimum sum of squares that results from estimating the parameters
by least squares. It is the variation that is not explained by fitting the model. If the model is correct, the RSS
is the variation in the data due to random measurement error. For this model, RSS = 308.8. The residual
mean square is the RSS divided by the degrees of freedom of the residual sum of squares. For RSS, the
degrees of freedom is df = n − p, where n is the number of observations and p is the number of parameters
in the fitted model. Thus, RMS = RSS /(n − p). The residual sum of squares (RSS = 308.8) and the
© 2002 By CRC Press LLC
L1592_frame_C38 Page 341 Tuesday, December 18, 2001 3:21 PM
TABLE 38.3
Summary of All Possible Regressions for the Settling Test Model
Model
b0
b1 z
A
(t ratio)
[SE]
B
C
D
E
F
G
H
152
20.9
(2.3)
[9.1]
11.9
16.1
7.1
20.9
11.9
16.1
7.1
167
171
186
98
113
117
132
Coefficient of the Term
2
b2 t
b3 z
−2.74
(8.3)
[0.33]
−2.74
−3.06
−3.06
−0.65
−0.65
−0.97
−0.97
−1.13
(1.0)
[1.1]
−1.13
−1.13
−1.13
2
2
b4 t
b5 tz
R
0.014
(7.0)
[0.002]
0.014
0.014
0.143
−0.08
(2.4)
[0.03]
−0.08
0.985
0.982
0.971
0.968
0.864
0.858
0.849
0.844
20202
19966
19912
17705
17651
17416
17362
Decrease
in RegSS
20256
−0.08
−0.08
RegSS
54
289
343
2550
2605
2840
2894
Note: () indicates t ratios of the estimated parameters. [] indicates standard errors of the estimated parameters.
residual mean square (RMS = 308.8/6 = 51.5) are the key statistics in comparing this model with simpler
models.
The regression sum of squares (RegSS) shows how much of the total variation (i.e., how much of the
Total SS) has been explained by the fitted equation. For this model, RegSS = 20,255.5.
2
The coefficient of determination, commonly denoted as R , is the regression sum of squares expressed
as a fraction of the total sum of squares. For the complete six-parameter model (Model A in Table 38.3),
2
R = (20256/20564) = 0.985, so it can be said that this model accounts for 98.5% of the total variation
in the data.
2
It is natural to be fascinated by high R values and this tempts us to think that the goal of model building
2
is to make R as high as possible. Obviously, this can be done by putting more high-order terms into a
model, but it should be equally obvious that this does not necessarily improve the predictions that will
2
2
be made using the model. Increasing R is the wrong goal. Instead of worrying about R values, we
should seek the simplest adequate model.
Selecting the “Best” Regression Model
The “best” model is the one that adequately describes the data with the fewest parameters. Table 38.3
2
summarizes parameter estimates, the coefficient of determination R , and the regression sum of squares
for all eight possible linear models. The total sum of squares, of course, is the same in all eight cases
because it depends on the data and not on the form of the model. Standard errors [SE] and t ratios (in
parentheses) are given for the complete model, Model A.
One approach is to examine the t ratio for each parameter. Roughly speaking, if a parameter’s t ratio
is less than 2.5, the true value of the parameter could be zero and that term could be dropped from the
equation.
Another approach is to examine the confidence intervals of the estimated parameters. If this interval
includes zero, the variable associated with the parameter can be dropped from the model. For example,
2
in Model A, the coefficient of z is b3 = −1.13 with standard error = 1.1 and 95% confidence interval
[−3.88 to +1.62]. This confidence interval includes zero, indicating that the true value of b3 is very likely
2
to be zero, and therefore the term z can be tentatively dropped from the model. Fitting the simplified
2
model (without z ) gives Model B in Table 38.3.
The standard error [SE] is the number in brackets. The half-width of the 95% confidence interval is
a multiple of the standard error of the estimated value. The multiplier is a t statistic that depends on the
selected level of confidence and the degrees of freedom. This multiplier is not the same value as the
t ratio given in Table 38.3. Roughly speaking, if the degrees of freedom are large (n − p ≥ 20), the halfwidth of the confidence interval is about 2SE for a 95% confidence interval. If the degrees of freedom
are small (n − p < 10), the multiplier will be in the range of 2.3SE to 3.0SE.
© 2002 By CRC Press LLC
L1592_frame_C38 Page 342 Tuesday, December 18, 2001 3:21 PM
After modifying a model by adding, or in this case dropping, a term, an additional test should be
made to compare the regression sum of squares of the two models. Details of this test are given in
texts on regression analysis (Draper and Smith, 1998) and in Chapter 40. Here, the test is illustrated
by example.
2
The regression sum of squares for the complete model (Model A) is 20,256. Dropping the z term to
get Model B reduced the regression sum of squares by only 54. We need to consider that a reduction
of 54 in the regression sum of squares may not be a statistically significant difference.
2
The reduction in the regression sum of squares due to dropping z can be thought of as a variance
2
associated with the z term. If this variance is small compared to the variance of the pure experimental
2
error, then the term z contributes no real information and it should be dropped from the model. In
2
contrast, if the variance associated with the z term is large relative to the pure error variance, the term
should remain in the model.
There were no repeated measurements in this experiment, so an independent estimate of the variance
due to pure error variance cannot be computed. The best that can be done under the circumstances is to
use the residual mean square of the complete model as an estimate of the pure error variance. The residual
mean square for the complete model (Model A) is 51.5. This is compared with the difference in regression
sum of squares of the two models; the difference in regression sum of squares between Models A and B
2
is 54. The ratio of the variance due to z and the pure error variance is F = 54/51.5 = 1.05. This value is
compared against the upper 5% point of the F distribution (1, 6 degrees of freedom). The degrees of
freedom are 1 for the numerator (1 degree of freedom for the one parameter that was dropped from the
model) and 6 for the denominator (the mean residual sum of squares). From Table C in the appendix,
2
F1,6 = 5.99. Because 1.05 < 5.99, we conclude that removing the z term does not result in a significant
2
reduction in the regression sum of squares. Therefore, the z term is not needed in the model.
The test used above is valid to compare any two of the models that have one less parameter than
2
Model A. To compare Models A and E, notice that omitting t decreases the regression sum of squares
by 20256 − 17705 = 2551. The F statistic is 2551/51.5 = 49.5. Because 49.5 > 5.99 (the upper 95%
>
2
point of the F distribution with 1 and 6 degrees of freedom), this change is significant and t needs to be
included in the model.
The test is modified slightly to compare Models A and D because Model D has two less terms than
2
Model A. The decrease of 343 in the regression sum of squares results from dropping to terms (z and zt).
The F statistic is now computed using 343/2 in the numerator and 51.5 in the denominator: F =
(343/2)/51.5 = 3.33. The upper 95% point of the appropriate reference distribution is F = 5.14, which
has 2 degrees of freedom for the numerator and 6 degrees of freedom for the denominator. Because F
2
for the model is less than the reference F (F = 3.33 < 5.14), the terms z and zt are not needed.
Model D is as good as Model A. Model D is the simplest adequate model:
ˆ
Model D y = 186 + 7.12t – 3.06z + 0.143t
2
This is the same model that was obtained by starting with the simplest possible model and adding terms
to make up for inadequacies.
Comments
The model building process uses regression to estimate the parameters, followed by diagnosis to decide
2
whether the model should be modified by adding or dropping terms. The goal is not to maximize R ,
because this puts unneeded high-order terms into the polynomial model. The best model should have
the fewest possible parameters because this will minimize the prediction error of the model.
One approach to finding the simplest adequate model is to start with a simple tentative model and use
diagnostic checks, such as residuals plots, for guidance. The alternate approach is to start by overfitting
the data with a highly parameterized model and to then find appropriate simplifications. Each time a
© 2002 By CRC Press LLC
L1592_frame_C38 Page 343 Tuesday, December 18, 2001 3:21 PM
term is added or deleted from the model, a check is made on whether the difference in the regression sum
of squares of the two models is large enough to justify modification of the model.
References
Berthouex, P. M. and D. K. Stevens (1982). “Computer Analysis of Settling Data,” J. Envr. Engr. Div., ASCE,
108, 1065–1069.
Camp, T. R. (1946). “Sedimentation and Design of Settling Tanks,” Trans. Am. Soc. Civil Engr., 3, 895–936.
Draper, N. R. and H. Smith (1998). Applied Regression Analysis, 3rd ed., New York, John Wiley.
Exercises
38.1 Settling Test. Find a polynomial model that describes the following data. The initial suspended
solids concentration was 560 mg/L. There are duplicate measurements at each time and depth.
Depth (ft)
Susp. Solids Conc. at Time t (min)
20
40
60
120
2
135
140
90
100
75
66
48
40
4
170
165
110
117
90
88
53
46
6
180
187
126
121
96
90
60
63
38.2 Solid Waste Fuel Value. Exercise 3.5 includes a table that relates solid waste composition to
the fuel value. The fuel value was calculated from the Dulong model, which uses elemental
composition instead of the percentages of paper, food, metal, and plastic. Develop a model
to relate the percentages of paper, food, metals, glass, and plastic to the Dulong estimates of
fuel value. One proposed model is E(Btu/lb) = 23 Food + 82.8 Paper + 160 Plastic. Compare
your model to this.
38.3 Final Clarifier. An activated sludge final clarifier was operated at various levels of overflow
rate (OFR) to evaluate the effect of overflow rate (OFR), feed rate, hydraulic detention time,
and feed slurry concentration on effluent total suspended solids (TSS) and underflow solids
concentration. The temperature was always in the range of 18.5 to 21°C. Runs 11–12, 13–14,
and 15–16 are duplicates, so the pure experimental error can be estimated. (a) Construct a
polynomial model to predict effluent TSS. (b) Construct a polynomial model to predict
underflow solids concentration. (c) Are underflow solids and effluent TSS related?
Run
Feed
Rate
(m/d)
Detention
Time
(h)
Feed
Slurry
3
(kg/m )
Underflow
3
(kg/m )
Effluent
TSS
(mg/L)
1
2
3
4
5
6
7
8
9
© 2002 By CRC Press LLC
OFR
(m/d)
11.1
11.1
11.1
11.1
16.7
16.7
16.7
16.7
13.3
30.0
30.0
23.3
23.3
30.0
30.0
23.3
23.3
33.3
2.4
1.2
2.4
1.2
2.4
1.2
2.4
1.2
1.8
6.32
6.05
7.05
6.72
5.58
5.59
6.20
6.35
5.67
11.36
10.04
13.44
13.06
12.88
13.11
19.04
21.39
9.63
3.5
4.4
3.9
4.8
3.8
5.2
4.0
4.5
5.4
L1592_frame_C38 Page 344 Tuesday, December 18, 2001 3:21 PM
10
11
12
13
14
15
16
13.3
13.3
13.3
13.3
13.3
13.3
13.3
20.0
26.7
26.7
26.7
26.7
26.7
26.7
1.8
3.0
3.0
0.6
0.6
1.8
1.8
7.43
6.06
6.14
6.36
5.40
6.18
6.26
20.55
12.20
12.56
11.94
10.57
11.80
12.12
3.0
3.7
3.6
6.9
6.9
5.0
4.0
Source: Adapted from Deitz J. D. and T. M. Keinath, J. WPCF, 56, 344–350.
(Original values have been rounded.)
38.4 Final Clarification. The influence of three factors on clarification of activated sludge effluent
was investigated in 36 runs. Three runs failed because of overloading. The factors were solids
retention time = SRT, hydraulic retention time = HRT, and overflow rate (OFR). Interpret the
data.
Run
SRT
(d)
HRT
(h)
OFR
(m/d)
Eff TSS
(mg/L)
Run
SRT
(d)
HRT
(h)
OFR
(m/d)
Eff TSS
(mg/L)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
8
8
8
8
8
8
8
8
8
8
8
8
8
8
8
5
5
5
12
12
12
12
12
8
8
8
8
8
4
4
4
4
4
12
12
12
32.6
32.6
24.5
16.4
8.2
57.0
40.8
24.5
8.2
8.2
57.0
40.8
40.8
24.5
8.2
32.6
24.5
24.5
48
60
55
36
45
64
55
30
16
45
37
21
14
4
11
20
28
12
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
5
5
5
5
5
5
5
5
2
2
2
2
2
2
2
2
2
2
12
12
8
8
8
4
4
4
12
12
12
8
8
8
8
4
4
4
16.4
8.2
40.8
24.5
8.2
40.8
24.5
8.2
16.4
16.4
8.2
40.8
40.8
24.5
8.2
40.8
24.5
8.2
18
15
47
41
57
39
41
43
19
36
23
26
15
17
14
39
43
48
Source: Cashion B. S. and T. M. Keinath, J. WPCF, 55, 1331–1338.
© 2002 By CRC Press LLC
L1592_frame_C39 Page 345 Tuesday, December 18, 2001 3:22 PM
39
2
The Coefficient of Determination, R
coefficient of determination, coefficient of multiple correlation, confidence interval, F
ratio, hapenstance data, lack of fit, linear regression, nested model, null model, prediction interval, pure
2
error, R , repeats, replication, regression, regression sum of squares, residual sum of squares, spurious
correlation.
KEY WORDS
Regression analysis is so easy to do that one of the best-known statistics is the coefficient of determi2
nation, R . Anderson-Sprecher (1994) calls it “…a measure many statistician’s love to hate.”
2
2
Every scientist knows that R is the coefficient of determination and R is that proportion of the total
variability in the dependent variable that is explained by the regression equation. This is so seductively
2
2
simple that we often assume that a high R signifies a useful regression equation and that a low R
2
signifies the opposite. We may even assume further that high R indicates that the observed relation
between independent and dependent variables is true and can be used to predict new conditions.
2
Life is not this simple. Some examples will help us understand what R really reveals about how well
the model fits the data and what important information can be overlooked if too much reliance is placed
2
on the interpretation of R .
What Does “Explained” Mean?
2
Caution is recommended in interpreting the phrase “R explains the variation in the dependent variable.”
2
R is the proportion of variation in a variable Y that can be accounted for by fitting Y to a particular
2
model instead of viewing the variable in isolation. R does not explain anything in the sense that “Aha!
Now we know why the response indicated by y behaves the way we have observed in this set of data.”
If the data are from a well-designed controlled experiment, with proper replication and randomization,
it is reasonable to infer that an significant association of the variation in y with variation in the level of
x is a causal effect of x. If the data had been observational, what Box (1966) calls happenstance data,
there is a high risk of a causal interpretation being wrong. With observational data there can be many
reasons for associations among variables, only one of which is causality.
2
A value of R is not just a rescaled measure of variation. It is a comparison between two models. One
of the models is usually referred to as the model. The other model — the null model — is usually never
mentioned. The null model ( y = β 0) provides the reference for comparison. This model describes a
horizontal line at the level of the mean of the y values, which is the simplest possible model that could
be fitted to any set of data.
ˆ
• The model ( y = β 0 + β 1 x + β 2 x + … + ei ) has residual sum of squares ∑ (y i – y ) = RSS model .
2
• The null model ( y = β 0 + ei ) has residual sum of squares ∑ (y i – y) = RSS null model .
2
The comparison of the residual sums of squares (RSS) defines:
RSS model
2
R = 1 – ------------------------RSS null model
© 2002 By CRC Press LLC
L1592_frame_C39 Page 346 Tuesday, December 18, 2001 3:22 PM
2
2
This shows that R is a model comparison and that large R measures only how much the model improves
the null model. It does not indicate how good the model is in any absolute sense. Consequently, the
2
common belief that a large R demonstrates model adequacy is sometimes wrong.
2
The definition of R also shows that comparisons are made only between nested models. The concept
of proportionate reduction in variation is untrustworthy unless one model is a special case of the other.
2
This means that R cannot be used to compare models with an intercept with models that have no
intercept: y = β 0 is not a reduction of the model y = β1x. It is a reduction of y = β 0 + β1x and y = β0 +
β1x + β2x2.
2
A High R Does Not Assure a Valid Relation
Figure 39.1 shows a regression with R = 0.746, which is statistically significant at almost the 1% level
of confidence (a 1% chance of concluding significance when there is no true relation). This might be
impressive until one knows the source of the data. X is the first six digits of pi, and Y is the first six
Fibonocci numbers. There is no true relation between x and y. The linear regression equation has no
predictive value (the seventh digit of pi does not predict the seventh Fibonocci number).
2
Anscombe (1973) published a famous and fascinating example of how R and other statistics that are
routinely computed in regression analysis can fail to reveal the important features of the data. Table 39.1
2
10
Y = 0.31 + 0.79X
R2 = 0.746
8
6
Y
4
2
FIGURE 39.1 An example of nonsense in regression. X is
the first six digits of pi and Y is the first six Fibonocci
2
numbers. R is high although there is no actual relation
between x and y.
0
0
2
4
6
8
10
X
TABLE 39.1
Anscombe’s Four Data Sets
A
B
C
D
x
y
x
y
x
y
x
y
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
8.04
6.95
7.58
8.81
8.33
9.96
7.24
4.26
10.84
4.82
5.68
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
9.14
8.14
8.74
8.77
9.26
8.10
6.13
3.10
9.13
7.26
4.74
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
7.46
6.77
12.74
7.11
7.81
8.84
6.08
5.39
8.15
6.42
5.73
8.0
8.0
8.0
8.0
8.0
8.0
8.0
19.0
8.0
8.0
8.0
6.58
5.76
7.71
8.84
8.47
7.04
5.25
12.50
5.56
7.91
6.89
Note: Each data set has n = 11, mean of x = 9.0, mean of y = 7.5, equation
of the regression line y = 3.0 + 0.5x, standard error of estimate of
the slope = 0.118 (t statistic = 4.24, regression sum of squares
(corrected for mean) = 110.0, residual sum of squares = 13.75,
2
correlation coefficient r = 0.82 and R = 0.67).
Source: Anscombe, F. J. (1973). Am. Stat., 27, 17–21.
© 2002 By CRC Press LLC