1. Trang chủ >
  2. Công Nghệ Thông Tin >
  3. Kỹ thuật lập trình >

Chapter 38. Empirical Model Building by Linear Regression

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 46 trang )


L1592_frame_C38 Page 338 Tuesday, December 18, 2001 3:21 PM



Mass of solids

M = C0ZA



Mass of solids

M = A” C(z, t ) dz

0

Solids profile at

settling time t



0



Depth



Solids

profile at

settling

time t = 0



dz



z



Z



Z



0



C0



0



C( z, t )



FIGURE 38.1 Solids concentration as a function of depth at time. The initial condition (t = 0) is shown on the left. The

condition at time t is shown on the right.



The fraction of solids removed in a settling tank at any depth z, that has a detention time t, is estimated

as:

z



z



AZ C 0 – A∫ 0 C ( z, t ) dz

1

R ( z, t ) = --------------------------------------------------- = 1 – --------- ∫ C ( z, t ) dz

AZ C 0

Z C0

0



This integral could be calculated graphically (Camp, 1946) or an approximating polynomial can be

derived for the concentration curve and the fraction of solids removed (R ) can be calculated algebraically.

Suppose, for example, that:

C (z, t) = 167 − 2.74t + 11.9z − 0.08zt + 0.014t



2



is a satisfactory empirical model and we want to use this model to predict the removal that will be achieved

with 60-min detention time, for a depth of 8 ft and an initial concentration of 500 mg/L. The solids concentration profile as a function of depth at t = 60 min is:

C (z, t) = 167 − 2.74(60) + 11.9z − 0.08z (60) + 0.014(60) = 53.0 + 7.1z

2



This is integrated over depth (Z = 8 ft) to give the fraction of solids that are expected to be removed:

z=8



1

R ( z = 8, t = 60 ) = 1 – ---------------8 ( 500 )



∫ ( 53.0 + 7.1z ) dz



z=0



1

2

= 1 – ---------------- ( 53 ( 8 ) + 3.55 ( 8 ) ) = 0.84

8 ( 500 )

The model building problem is to determine the form of the polynomial function and to estimate the

coefficients of the terms in the function.



Method: Linear Regression

Suppose the correct model for the process is η = f (β, x) and the observations are yi = f (β, x) + ei, where

the ei are random errors. There may be several parameters ( β ) and several independent variables (x).

According to the least squares criterion, the best estimates of the β ’s minimize the sum of the squared

residuals:

minimize S ( β ) =

where the summation is over all observations.

© 2002 By CRC Press LLC



∑(y – η )

i



i



2



L1592_frame_C38 Page 339 Tuesday, December 18, 2001 3:21 PM



The minimum sum of squares is called the residual sum of squares, RSS. The residual mean square

(RMS) is the residual sum of squares divided by its degrees of freedom. RMS = RSS /(n − p), where n =

number of observations and p = number of parameters estimated.



Case Study: Solution

A column settling test was done on a suspension with initial concentration of 560 mg/L. Samples were

taken at depths of 2, 4, and 6 ft (measured from the water surface) at times 20, 40, 60, and 120 min;

the data are in Table 38.1. The simplest possible model is:

C(z, t) = β 0 + β1t

The most complicated model that might be needed is a full quadratic function of time and depth:

C(z, t) = β 0 + β 1 t + β 2 t + β 3 z + β 4 z + β 5 zt

2



2



We can start the model building process with either of these and add or drop terms as needed.

Fitting the simplest possible model involving time and depth gives:

ˆ

y = 132.3 + 7.12z – 0.97t

which has R = 0.844 and residual mean square = 355.82. R , the coefficient of determination, is the

percentage of the total variation in the data that is accounted for by fitting the model (Chapter 39).

Figure 38.2a shows the diagnostic residual plots for the model. The residuals plotted against the

predicted values are not random. This suggests an inadequacy in the model, but it does not tell us how

2



2



TABLE 38.1

Data from a Laboratory Settling Column Test

Depth (ft)

2

4

6



Suspended Solids Concentration at Time t (min)

20

40

60

120

135

170

180



90

110

126



75

90

96



48

53

60



FIGURE 38.2 (a) Residuals plotted against the predicted suspended solids concentrations are not random. (b) Residuals

plotted against settling time suggest that a quadratic term is needed in the model.

© 2002 By CRC Press LLC



L1592_frame_C38 Page 340 Tuesday, December 18, 2001 3:21 PM



TABLE 38.2

Analysis of Variance for the Six-Parameter Settling

Linear Model

df



SS



MS = SS/df



Regression (Reg SS)

Residuals (RSS)

Total (Total SS)



5

6

11



20255.5

308.8

20564.2



4051.1

51.5



Residuals



Due to



10

0

-10

40













































80

120

160

Predicted SS Values



200



ˆ

FIGURE 38.3 Plot of residuals against the predicted values of the regression model y = 185.97 + 7.125t + 0.014t − 3.057z.

2



the model might be improved. The pattern of the residuals plotted against time (Figure 38.2b) suggests

2

that adding a t term may be helpful. This was done to obtain:

ˆ

y = 186.0 + 7.12z – 3.06t + 0.0143t



2



which has R = 0.97 and residual mean square = 81.5. A diagnostic plot of the residuals (Figure 38.3)

reveals no inadequacies. Similar plots of residuals against the independent variables also support the

model. This model is adequate to describe the data.

The most complicated model, which has six parameters, is:

2



ˆ

y = 152 + 20.9z – 2.74t – 1.13z – 0.0143t – 0.080zt

2



2



The model contains quadratic terms for time and depth and the interaction of depth and time (zt). The

analysis of variance for this model is given in Table 38.2. This information is produced by computer

programs that do linear regression. For now we do not need to know how to calculate this, but we should

understand how it is interpreted.

Across the top, SS is sum of squares and df = degrees of freedom associated with a sum of squares

quantity. MS is mean square, where MS = SS/df. The sum of squares due to regression is the regression

sum of squares (RegSS): RegSS = 20,255.5. The sum of squares due to residuals is the residual sum of

squares (RSS); RSS = 308.8. The total sum of squares, or Total SS, is:

Total SS = RegSS + RSS

Also:

Total SS =



∑ ( y – y)



2



i



The residual sum of squares (RSS) is the minimum sum of squares that results from estimating the parameters

by least squares. It is the variation that is not explained by fitting the model. If the model is correct, the RSS

is the variation in the data due to random measurement error. For this model, RSS = 308.8. The residual

mean square is the RSS divided by the degrees of freedom of the residual sum of squares. For RSS, the

degrees of freedom is df = n − p, where n is the number of observations and p is the number of parameters

in the fitted model. Thus, RMS = RSS /(n − p). The residual sum of squares (RSS = 308.8) and the

© 2002 By CRC Press LLC



L1592_frame_C38 Page 341 Tuesday, December 18, 2001 3:21 PM



TABLE 38.3

Summary of All Possible Regressions for the Settling Test Model

Model



b0



b1 z



A

(t ratio)

[SE]

B

C

D

E

F

G

H



152



20.9

(2.3)

[9.1]

11.9

16.1

7.1

20.9

11.9

16.1

7.1



167

171

186

98

113

117

132



Coefficient of the Term

2

b2 t

b3 z

−2.74

(8.3)

[0.33]

−2.74

−3.06

−3.06

−0.65

−0.65

−0.97

−0.97



−1.13

(1.0)

[1.1]

−1.13

−1.13

−1.13



2



2



b4 t



b5 tz



R



0.014

(7.0)

[0.002]

0.014

0.014

0.143



−0.08

(2.4)

[0.03]

−0.08



0.985



0.982

0.971

0.968

0.864

0.858

0.849

0.844



20202

19966

19912

17705

17651

17416

17362



Decrease

in RegSS



20256



−0.08

−0.08



RegSS



54

289

343

2550

2605

2840

2894



Note: () indicates t ratios of the estimated parameters. [] indicates standard errors of the estimated parameters.



residual mean square (RMS = 308.8/6 = 51.5) are the key statistics in comparing this model with simpler

models.

The regression sum of squares (RegSS) shows how much of the total variation (i.e., how much of the

Total SS) has been explained by the fitted equation. For this model, RegSS = 20,255.5.

2

The coefficient of determination, commonly denoted as R , is the regression sum of squares expressed

as a fraction of the total sum of squares. For the complete six-parameter model (Model A in Table 38.3),

2

R = (20256/20564) = 0.985, so it can be said that this model accounts for 98.5% of the total variation

in the data.

2

It is natural to be fascinated by high R values and this tempts us to think that the goal of model building

2

is to make R as high as possible. Obviously, this can be done by putting more high-order terms into a

model, but it should be equally obvious that this does not necessarily improve the predictions that will

2

2

be made using the model. Increasing R is the wrong goal. Instead of worrying about R values, we

should seek the simplest adequate model.



Selecting the “Best” Regression Model

The “best” model is the one that adequately describes the data with the fewest parameters. Table 38.3

2

summarizes parameter estimates, the coefficient of determination R , and the regression sum of squares

for all eight possible linear models. The total sum of squares, of course, is the same in all eight cases

because it depends on the data and not on the form of the model. Standard errors [SE] and t ratios (in

parentheses) are given for the complete model, Model A.

One approach is to examine the t ratio for each parameter. Roughly speaking, if a parameter’s t ratio

is less than 2.5, the true value of the parameter could be zero and that term could be dropped from the

equation.

Another approach is to examine the confidence intervals of the estimated parameters. If this interval

includes zero, the variable associated with the parameter can be dropped from the model. For example,

2

in Model A, the coefficient of z is b3 = −1.13 with standard error = 1.1 and 95% confidence interval

[−3.88 to +1.62]. This confidence interval includes zero, indicating that the true value of b3 is very likely

2

to be zero, and therefore the term z can be tentatively dropped from the model. Fitting the simplified

2

model (without z ) gives Model B in Table 38.3.

The standard error [SE] is the number in brackets. The half-width of the 95% confidence interval is

a multiple of the standard error of the estimated value. The multiplier is a t statistic that depends on the

selected level of confidence and the degrees of freedom. This multiplier is not the same value as the

t ratio given in Table 38.3. Roughly speaking, if the degrees of freedom are large (n − p ≥ 20), the halfwidth of the confidence interval is about 2SE for a 95% confidence interval. If the degrees of freedom

are small (n − p < 10), the multiplier will be in the range of 2.3SE to 3.0SE.

© 2002 By CRC Press LLC



L1592_frame_C38 Page 342 Tuesday, December 18, 2001 3:21 PM



After modifying a model by adding, or in this case dropping, a term, an additional test should be

made to compare the regression sum of squares of the two models. Details of this test are given in

texts on regression analysis (Draper and Smith, 1998) and in Chapter 40. Here, the test is illustrated

by example.

2

The regression sum of squares for the complete model (Model A) is 20,256. Dropping the z term to

get Model B reduced the regression sum of squares by only 54. We need to consider that a reduction

of 54 in the regression sum of squares may not be a statistically significant difference.

2

The reduction in the regression sum of squares due to dropping z can be thought of as a variance

2

associated with the z term. If this variance is small compared to the variance of the pure experimental

2

error, then the term z contributes no real information and it should be dropped from the model. In

2

contrast, if the variance associated with the z term is large relative to the pure error variance, the term

should remain in the model.

There were no repeated measurements in this experiment, so an independent estimate of the variance

due to pure error variance cannot be computed. The best that can be done under the circumstances is to

use the residual mean square of the complete model as an estimate of the pure error variance. The residual

mean square for the complete model (Model A) is 51.5. This is compared with the difference in regression

sum of squares of the two models; the difference in regression sum of squares between Models A and B

2

is 54. The ratio of the variance due to z and the pure error variance is F = 54/51.5 = 1.05. This value is

compared against the upper 5% point of the F distribution (1, 6 degrees of freedom). The degrees of

freedom are 1 for the numerator (1 degree of freedom for the one parameter that was dropped from the

model) and 6 for the denominator (the mean residual sum of squares). From Table C in the appendix,

2

F1,6 = 5.99. Because 1.05 < 5.99, we conclude that removing the z term does not result in a significant

2

reduction in the regression sum of squares. Therefore, the z term is not needed in the model.

The test used above is valid to compare any two of the models that have one less parameter than

2

Model A. To compare Models A and E, notice that omitting t decreases the regression sum of squares

by 20256 − 17705 = 2551. The F statistic is 2551/51.5 = 49.5. Because 49.5 > 5.99 (the upper 95%

>

2

point of the F distribution with 1 and 6 degrees of freedom), this change is significant and t needs to be

included in the model.

The test is modified slightly to compare Models A and D because Model D has two less terms than

2

Model A. The decrease of 343 in the regression sum of squares results from dropping to terms (z and zt).

The F statistic is now computed using 343/2 in the numerator and 51.5 in the denominator: F =

(343/2)/51.5 = 3.33. The upper 95% point of the appropriate reference distribution is F = 5.14, which

has 2 degrees of freedom for the numerator and 6 degrees of freedom for the denominator. Because F

2

for the model is less than the reference F (F = 3.33 < 5.14), the terms z and zt are not needed.

Model D is as good as Model A. Model D is the simplest adequate model:

ˆ

Model D y = 186 + 7.12t – 3.06z + 0.143t



2



This is the same model that was obtained by starting with the simplest possible model and adding terms

to make up for inadequacies.



Comments

The model building process uses regression to estimate the parameters, followed by diagnosis to decide

2

whether the model should be modified by adding or dropping terms. The goal is not to maximize R ,

because this puts unneeded high-order terms into the polynomial model. The best model should have

the fewest possible parameters because this will minimize the prediction error of the model.

One approach to finding the simplest adequate model is to start with a simple tentative model and use

diagnostic checks, such as residuals plots, for guidance. The alternate approach is to start by overfitting

the data with a highly parameterized model and to then find appropriate simplifications. Each time a

© 2002 By CRC Press LLC



L1592_frame_C38 Page 343 Tuesday, December 18, 2001 3:21 PM



term is added or deleted from the model, a check is made on whether the difference in the regression sum

of squares of the two models is large enough to justify modification of the model.



References

Berthouex, P. M. and D. K. Stevens (1982). “Computer Analysis of Settling Data,” J. Envr. Engr. Div., ASCE,

108, 1065–1069.

Camp, T. R. (1946). “Sedimentation and Design of Settling Tanks,” Trans. Am. Soc. Civil Engr., 3, 895–936.

Draper, N. R. and H. Smith (1998). Applied Regression Analysis, 3rd ed., New York, John Wiley.



Exercises

38.1 Settling Test. Find a polynomial model that describes the following data. The initial suspended

solids concentration was 560 mg/L. There are duplicate measurements at each time and depth.



Depth (ft)



Susp. Solids Conc. at Time t (min)

20

40

60

120



2



135

140



90

100



75

66



48

40



4



170

165



110

117



90

88



53

46



6



180

187



126

121



96

90



60

63



38.2 Solid Waste Fuel Value. Exercise 3.5 includes a table that relates solid waste composition to

the fuel value. The fuel value was calculated from the Dulong model, which uses elemental

composition instead of the percentages of paper, food, metal, and plastic. Develop a model

to relate the percentages of paper, food, metals, glass, and plastic to the Dulong estimates of

fuel value. One proposed model is E(Btu/lb) = 23 Food + 82.8 Paper + 160 Plastic. Compare

your model to this.

38.3 Final Clarifier. An activated sludge final clarifier was operated at various levels of overflow

rate (OFR) to evaluate the effect of overflow rate (OFR), feed rate, hydraulic detention time,

and feed slurry concentration on effluent total suspended solids (TSS) and underflow solids

concentration. The temperature was always in the range of 18.5 to 21°C. Runs 11–12, 13–14,

and 15–16 are duplicates, so the pure experimental error can be estimated. (a) Construct a

polynomial model to predict effluent TSS. (b) Construct a polynomial model to predict

underflow solids concentration. (c) Are underflow solids and effluent TSS related?



Run



Feed

Rate

(m/d)



Detention

Time

(h)



Feed

Slurry

3

(kg/m )



Underflow

3

(kg/m )



Effluent

TSS

(mg/L)



1

2

3

4

5

6

7

8

9

© 2002 By CRC Press LLC



OFR

(m/d)

11.1

11.1

11.1

11.1

16.7

16.7

16.7

16.7

13.3



30.0

30.0

23.3

23.3

30.0

30.0

23.3

23.3

33.3



2.4

1.2

2.4

1.2

2.4

1.2

2.4

1.2

1.8



6.32

6.05

7.05

6.72

5.58

5.59

6.20

6.35

5.67



11.36

10.04

13.44

13.06

12.88

13.11

19.04

21.39

9.63



3.5

4.4

3.9

4.8

3.8

5.2

4.0

4.5

5.4



L1592_frame_C38 Page 344 Tuesday, December 18, 2001 3:21 PM



10

11

12

13

14

15

16



13.3

13.3

13.3

13.3

13.3

13.3

13.3



20.0

26.7

26.7

26.7

26.7

26.7

26.7



1.8

3.0

3.0

0.6

0.6

1.8

1.8



7.43

6.06

6.14

6.36

5.40

6.18

6.26



20.55

12.20

12.56

11.94

10.57

11.80

12.12



3.0

3.7

3.6

6.9

6.9

5.0

4.0



Source: Adapted from Deitz J. D. and T. M. Keinath, J. WPCF, 56, 344–350.

(Original values have been rounded.)



38.4 Final Clarification. The influence of three factors on clarification of activated sludge effluent

was investigated in 36 runs. Three runs failed because of overloading. The factors were solids

retention time = SRT, hydraulic retention time = HRT, and overflow rate (OFR). Interpret the

data.



Run



SRT

(d)



HRT

(h)



OFR

(m/d)



Eff TSS

(mg/L)



Run



SRT

(d)



HRT

(h)



OFR

(m/d)



Eff TSS

(mg/L)



1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18



8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

5

5

5



12

12

12

12

12

8

8

8

8

8

4

4

4

4

4

12

12

12



32.6

32.6

24.5

16.4

8.2

57.0

40.8

24.5

8.2

8.2

57.0

40.8

40.8

24.5

8.2

32.6

24.5

24.5



48

60

55

36

45

64

55

30

16

45

37

21

14

4

11

20

28

12



19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36



5

5

5

5

5

5

5

5

2

2

2

2

2

2

2

2

2

2



12

12

8

8

8

4

4

4

12

12

12

8

8

8

8

4

4

4



16.4

8.2

40.8

24.5

8.2

40.8

24.5

8.2

16.4

16.4

8.2

40.8

40.8

24.5

8.2

40.8

24.5

8.2



18

15

47

41

57

39

41

43

19

36

23

26

15

17

14

39

43

48



Source: Cashion B. S. and T. M. Keinath, J. WPCF, 55, 1331–1338.



© 2002 By CRC Press LLC



L1592_frame_C39 Page 345 Tuesday, December 18, 2001 3:22 PM



39

2



The Coefficient of Determination, R



coefficient of determination, coefficient of multiple correlation, confidence interval, F

ratio, hapenstance data, lack of fit, linear regression, nested model, null model, prediction interval, pure

2

error, R , repeats, replication, regression, regression sum of squares, residual sum of squares, spurious

correlation.

KEY WORDS



Regression analysis is so easy to do that one of the best-known statistics is the coefficient of determi2

nation, R . Anderson-Sprecher (1994) calls it “…a measure many statistician’s love to hate.”

2

2

Every scientist knows that R is the coefficient of determination and R is that proportion of the total

variability in the dependent variable that is explained by the regression equation. This is so seductively

2

2

simple that we often assume that a high R signifies a useful regression equation and that a low R

2

signifies the opposite. We may even assume further that high R indicates that the observed relation

between independent and dependent variables is true and can be used to predict new conditions.

2

Life is not this simple. Some examples will help us understand what R really reveals about how well

the model fits the data and what important information can be overlooked if too much reliance is placed

2

on the interpretation of R .



What Does “Explained” Mean?

2



Caution is recommended in interpreting the phrase “R explains the variation in the dependent variable.”

2

R is the proportion of variation in a variable Y that can be accounted for by fitting Y to a particular

2

model instead of viewing the variable in isolation. R does not explain anything in the sense that “Aha!

Now we know why the response indicated by y behaves the way we have observed in this set of data.”

If the data are from a well-designed controlled experiment, with proper replication and randomization,

it is reasonable to infer that an significant association of the variation in y with variation in the level of

x is a causal effect of x. If the data had been observational, what Box (1966) calls happenstance data,

there is a high risk of a causal interpretation being wrong. With observational data there can be many

reasons for associations among variables, only one of which is causality.

2

A value of R is not just a rescaled measure of variation. It is a comparison between two models. One

of the models is usually referred to as the model. The other model — the null model — is usually never

mentioned. The null model ( y = β 0) provides the reference for comparison. This model describes a

horizontal line at the level of the mean of the y values, which is the simplest possible model that could

be fitted to any set of data.

ˆ

• The model ( y = β 0 + β 1 x + β 2 x + … + ei ) has residual sum of squares ∑ (y i – y ) = RSS model .

2

• The null model ( y = β 0 + ei ) has residual sum of squares ∑ (y i – y) = RSS null model .

2



The comparison of the residual sums of squares (RSS) defines:

RSS model

2

R = 1 – ------------------------RSS null model



© 2002 By CRC Press LLC



L1592_frame_C39 Page 346 Tuesday, December 18, 2001 3:22 PM



2



2



This shows that R is a model comparison and that large R measures only how much the model improves

the null model. It does not indicate how good the model is in any absolute sense. Consequently, the

2

common belief that a large R demonstrates model adequacy is sometimes wrong.

2

The definition of R also shows that comparisons are made only between nested models. The concept

of proportionate reduction in variation is untrustworthy unless one model is a special case of the other.

2

This means that R cannot be used to compare models with an intercept with models that have no

intercept: y = β 0 is not a reduction of the model y = β1x. It is a reduction of y = β 0 + β1x and y = β0 +

β1x + β2x2.



2



A High R Does Not Assure a Valid Relation

Figure 39.1 shows a regression with R = 0.746, which is statistically significant at almost the 1% level

of confidence (a 1% chance of concluding significance when there is no true relation). This might be

impressive until one knows the source of the data. X is the first six digits of pi, and Y is the first six

Fibonocci numbers. There is no true relation between x and y. The linear regression equation has no

predictive value (the seventh digit of pi does not predict the seventh Fibonocci number).

2

Anscombe (1973) published a famous and fascinating example of how R and other statistics that are

routinely computed in regression analysis can fail to reveal the important features of the data. Table 39.1

2



10

Y = 0.31 + 0.79X

R2 = 0.746



8

6



Y

4

2



FIGURE 39.1 An example of nonsense in regression. X is

the first six digits of pi and Y is the first six Fibonocci

2

numbers. R is high although there is no actual relation

between x and y.



0



0



2



4



6



8



10



X



TABLE 39.1

Anscombe’s Four Data Sets

A



B



C



D



x



y



x



y



x



y



x



y



10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0



8.04

6.95

7.58

8.81

8.33

9.96

7.24

4.26

10.84

4.82

5.68



10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0



9.14

8.14

8.74

8.77

9.26

8.10

6.13

3.10

9.13

7.26

4.74



10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0



7.46

6.77

12.74

7.11

7.81

8.84

6.08

5.39

8.15

6.42

5.73



8.0

8.0

8.0

8.0

8.0

8.0

8.0

19.0

8.0

8.0

8.0



6.58

5.76

7.71

8.84

8.47

7.04

5.25

12.50

5.56

7.91

6.89



Note: Each data set has n = 11, mean of x = 9.0, mean of y = 7.5, equation

of the regression line y = 3.0 + 0.5x, standard error of estimate of

the slope = 0.118 (t statistic = 4.24, regression sum of squares

(corrected for mean) = 110.0, residual sum of squares = 13.75,

2

correlation coefficient r = 0.82 and R = 0.67).

Source: Anscombe, F. J. (1973). Am. Stat., 27, 17–21.

© 2002 By CRC Press LLC



Xem Thêm
Tải bản đầy đủ (.pdf) (46 trang)

×