Chapter 40. Regression Analysis with Categorical Variables

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 46 trang )

L1592_frame_C40 Page 356 Tuesday, December 18, 2001 3:24 PM

7.0

6.5

pH

6.0

5.5

0

100

200 300 400 500

Weak Acidity (µg/L)

600

700

FIGURE 40.1 The relation of pH and weak acidity data of Cosby Creek after three storms.

Begin by considering data from a single category. The quantitative predictor variable is x1 which can

predict the independent variable y1 using the linear model:

y 1i = β 0 + β 1 x 1i + e i

where β0 and β1 are parameters to be estimated by least squares.

If there are data from two categories (e.g., data produced at two different laboratories), one approach

would be to model the two sets of data separately as:

y 1i = α 0 + α 1 x 1i + e i

and

y 2i = β 0 + β 1 x 2i + e i

and then to compare the estimated intercepts (α 0 and β 0 ) and the estimated slopes (α1 and β1) using

conﬁdence intervals or t-tests.

A second, and often better, method is to simultaneously ﬁt a single augmented model to all the data.

To construct this model, deﬁne a categorical variable Z as follows:

Z=0

Z=1

if the data are in the ﬁrst category

if the data are in the second category

The augmented model is:

yi = α0 + α1 xi + Z ( β0 + β1 xi ) + ei

With some rearrangement:

y i = α 0 + β 0 Z + α 1 x i + β 1 Zx i + e i

In this last form the regression is done as though there are three independent variables, x, Z, and Zx.

The vectors of Z and Zx have to be created from the categorical variables deﬁned above. The four

parameters α 0, β 0, α1, and β 1 are estimated by linear regression.

A model for each category can be obtained by substituting the deﬁned values. For the ﬁrst category,

Z = 0 and:

yi = α0 + α1 xi + ei

© 2002 By CRC Press LLC

L1592_frame_C40 Page 357 Tuesday, December 18, 2001 3:24 PM

Slopes Diffferent

Intercepts

y i =(α 0 + β 0) + (α 1+ β 1) x i + e i

Different

Intercepts

y i = α 0 + (α 1 + β 1) x i + e i

Equal

Slopes Equal

y i =(α 0 + β 0) + α 1x i + e i

y i = α 0 + α 1x i + e i

FIGURE 40.2 Four possible models to ﬁt a straight line to data in two categories.

Complete model

y=(α0+β0)+(α1+β1)x+e

Category 2:

y = (α0+β0)+α1x+e

β0

α1

slope = α1

for both lines

α 0 + β0

α0

Category 1:

α1

y = α0+α1x+e

FIGURE 40.3 Model with two categories having different intercepts but equal slopes.

For the second category, Z = 1 and:

y i = ( α 0 + β 0 ) + ( α 1 + β 1 )x i + e i

The regression might estimate either β0 or β1 as zero, or both as zero. If β0 = 0, the two lines have the

same intercept. If β1 = 0, the two lines have the same slope. If both β1 and β0 equal zero, a single straight

line ﬁts all the data. Figure 40.2 shows the four possible outcomes. Figure 40.3 shows the particular

case where the slopes are equal and the intercepts are different.

If simpliﬁcation seems indicated, a simpliﬁed version is ﬁtted to the data. We show later how the full

model and simpliﬁed model are compared to check whether the simpliﬁcation is justiﬁed.

To deal with three categories, two categorical variables are deﬁned:

Category 1:

Category 2:

Z1 = 1

Z1 = 0

and

and

Z2 = 0

Z2 = 1

This implies Z1 = 0 and Z2 = 0 for category 3.

The model is:

yi = ( α0 + α1 xi ) + Z 1 ( β0 + β1 xi ) + Z 2 ( γ 0 + γ 1 xi ) + ei

The parameters with subscript 0 estimate the intercept and those with subscript 1 estimate the slopes.

This can be rearranged to give:

yi = α0 + β0 Z 1 + γ 0 Z 2 + α1 xi + β1 Z 1 xi + γ 1 Z 2 xi + ei

The six parameters are estimated by ﬁtting the original independent variable xi plus the four created

variables Z1, Z2, Z1xi, and Z2xi.

Any of the parameters might be estimated as zero by the regression analysis. A couple of examples

explain how the simpler models can be identiﬁed. In the simplest possible case, the regression would

© 2002 By CRC Press LLC

L1592_frame_C40 Page 358 Tuesday, December 18, 2001 3:24 PM

estimate β 0 = 0, γ 0 = 0, β 1 = 0, and γ 1 = 0 and the same slope (α 1) and intercept (α 0) would apply to

all three categories. The ﬁtted simpliﬁed model is y i = α 0 + α 1 x i + e i .

If the intercepts are different for the three categories but the slopes are the same, the regression would

estimate β1 = 0 and γ1 = 0 and the model becomes:

yi = ( α0 + β0 Z 1 + γ 0 Z 2 ) + α1 xi + ei

For category 1: y i = ( α 0 + β 0 Z 1 ) + α 1 x i + e i

For category 2: y i = ( α 0 + γ 0 Z 2 ) + α 1 x i + e i

For category 3: y i = α 0 + α 1 x i + e i

Case Study: Solution

The model under consideration allows a different slope and intercept for each storm. Two dummy variables

are needed:

Z1 = 1 for storm 1 and zero otherwise

Z2 = 1 for storm 2 and zero otherwise

The model is:

pH = α 0 + α1WA + Z1(β0 + β1WA) + Z2(γ0 + γ1WA)

where the α ’s, β ’s, and γ ’s are estimated by regression. The model can be rewritten as:

pH = α 0 + β 0 Z1 + γ 0 Z 2 + α1WA + β1Z1WA + γ1Z2WA

The dummy variables are incorporated into the model by creating the new variables Z1WA and Z2WA.

Table 40.1 shows how this is done.

Fitting the full six-parameter model gives:

Model A: pH = 5.77 − 0.00008WA + 0.998Z1 + 1.65Z2 − 0.005Z1WA − 0.008Z2WA

(t-ratios)

(0.11)

(2.14)

(3.51)

(3.63)

(4.90)

which is also shown as Model A in Table 40.2 (top row). The numerical coefﬁcients are the least squares

estimates of the parameters. The small numbers in parentheses beneath the coefﬁcients are the t-ratios

for the parameter values. Terms with t < 2 are candidates for elimination from the model because they

are almost certainly not signiﬁcant.

The term WA appears insigniﬁcant. Dropping this term and reﬁtting the simpliﬁed model gives Model

B, in which all coefﬁcients are signiﬁcant:

Model B:

(t-ratios)

[95% conf. interval]

pH = 5.82 + 0.95Z1 + 1.60Z2 − 0.005Z1WA − 0.008Z2WA

(6.01)

(9.47)

(4.35)

(5.54)

[0.63 to 1.27] [1.26 to 1.94] [−0.007 to −0.002] [−0.01 to −0.005]

The regression sum of squares, listed in Table 40.2, is the same for Model A and for Model B (Reg SS =

4.278). Dropping the WA term caused no decrease in the regression sum of squares. Model B is equivalent

to Model A.

Is any further simpliﬁcation possible? Notice that the 95% conﬁdence intervals overlap for the terms

−0.005 Z1WA and –0.008 Z2WA. Therefore, the coefﬁcients of these two terms might be the same. To

check this, ﬁt Model C, which has the same slope but different intercepts for storms 1 and 2. This is

© 2002 By CRC Press LLC

L1592_frame_C40 Page 359 Tuesday, December 18, 2001 3:24 PM

TABLE 40.1

Weak Acidity (WA), pH, and Categorical Variables for Three Storms

Storm

WA

Z1

Z2

Z1WA

Z2WA

pH

Z3

Z3WA

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

2

2

2

2

3

3

3

3

3

3

3

3

3

3

3

3

3

190

110

150

170

170

170

200

140

140

160

140

110

110

120

110

110

110

140

140

120

190

120

110

110

100

100

120

120

100

80

100

580

640

500

530

670

670

640

640

560

590

640

590

600

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

190

110

150

170

170

170

200

140

140

160

140

110

110

120

110

110

110

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

140

140

120

190

120

110

110

100

100

120

120

100

80

100

0

0

0

0

0

0

0

0

0

0

0

0

0

5.96

6.08

5.93

5.99

6.01

5.97

5.88

6.06

6.06

6.03

6.02

6.17

6.31

6.27

6.42

6.28

6.43

6.33

6.43

6.37

6.09

6.32

6.37

6.73

6.89

6.87

6.30

6.52

6.39

6.87

6.85

5.82

5.94

5.73

5.91

5.87

5.80

5.80

5.78

5.78

5.73

5.63

5.79

6.02

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

190

110

150

170

170

170

200

140

140

160

140

110

110

120

110

110

110

140

140

120

190

120

110

110

100

100

120

120

100

80

100

0

0

0

0

0

0

0

0

0

0

0

0

0

Note: The two right-hand columns are used to ﬁt the simpliﬁed model.

Source: Meinert, D. L., S. A. Miller, R. J. Ruane, and H. Olem (1982). “A Review of Water Quality

Data in Acid Sensitive Watersheds in the Tennessee Valley,” Rep. No. TVA.ONR/WR-82/10, TVA,

Chattanooga, TN.

done by combining columns Z1WA and Z2WA to form the two columns on the right-hand side of

Table 40.1. Call this new variable Z3WA. Z3 = 1 for storms 1 and 2, and 0 for storm 3.

The ﬁtted model is:

Model C:

(t-ratios)

© 2002 By CRC Press LLC

pH = 5.82 + 1.11Z1 + 1.38Z2 − 0.0057Z3WA

(8.43)

(12.19)

(6.68)

L1592_frame_C40 Page 360 Tuesday, December 18, 2001 3:24 PM

TABLE 40.2

Alternate Models for pH at Cosby Creek

Model

2

Reg SS

A pH = 5.77 − 0.00008WA + 0.998Z1 + 1.65Z2 − 0.005Z1WA − 0.008Z2WA

B pH = 5.82 + 0.95Z1 + 1.60Z2 − 0.005Z1WA − 0.008Z2WA

C pH = 5.82 + 1.11Z1 + 1.38Z2 − 0.0057Z3WA

Res SS

R

4.278

4.278

4.229

0.662

0.662

0.712

0.866

0.866

0.856

This simpliﬁcation of the model can be checked in a more formal way by comparing regression sums

of squares of the simpliﬁed model with the more complicated one. The regression sum of squares is a

measure of how well the model ﬁts the data. Dropping an important term will cause the regression sum

of squares to decrease by a noteworthy amount, whereas dropping an unimportant term will change the

regression sum of squares very little. An example shows how we decide whether a change is “noteworthy”

(i.e., statistically signiﬁcant).

If two models are equivalent, the difference of their regression sums of squares will be small, within

an allowance for variation due to random experimental error. The variance due to experimental error

can be estimated by the mean residual sum of squares of the full model (Model A).

The variance due to the deleted term is estimated by the difference between the regression sums of

squares of Model A and Model C, with an adjustment for their respective degrees of freedom. The ratio

of the variance due to the deleted term is compared with the variance due to experimental error by

computing the F statistic, as follows:

( Reg SS A – Reg SS C )/ ( Reg df A – Reg df C )

F = -----------------------------------------------------------------------------------------------------Res SS A /Res df A

where

Reg SS

Reg df

Res SS

Res df

= regression sum of squares

= degrees of freedom associated with the regression sum of squares

= residual sum of squares

= degrees of freedom associated with the residual sum of squares

Model A has ﬁve degrees of freedom associated with the regression sum of squares (Reg df = 5), one

for each of the six parameters in the model minus one for computing the mean. Model C has three

degrees of freedom. Thus:

( 4.278 – 4.229 )/ ( 5 – 3 )

0.0245

F = -------------------------------------------------------- = --------------- = 1.44

0.66/38

0.017

For a test of signiﬁcance at the 95% conﬁdence level, this value of F is compared with the upper 5%

point of the F distribution with the appropriate degrees of freedom (5 – 3 = 2 in the numerator and 38

in the denominator): F2,38,0.05 = 3.25. The computed value (F = 1.44) is smaller than the critical value

F2,38,0.05 = 3.25, which conﬁrms that omitting WA from the model and forcing storms 1 and 2 to have

the same slope has not signiﬁcantly worsened the ﬁt of the model. In short, Model C describes the data

as well as Model A or Model B. Because it is simpler, it is preferred.

Models for the individual storms are derived by substituting the values of Z1, Z2, and Z3 into Model C:

Storm 1

Storm 2

Storm 3

Z1 = 1, Z2 = 0, Z3 = 1

Z1 = 0, Z2 = 1, Z3 = 1

Z1 = 0, Z2 = 0, Z3 = 0

pH = 6.93 − 0.0057WA

pH = 7.20 − 0.0057WA

pH = 5.82

The model indicates a different intercept for each storm, a common slope for storms 1 and 2, and a slope

of zero for storm 3, as shown by Figure 40.4. In storm 3, the variation in pH was random about a mean

© 2002 By CRC Press LLC

L1592_frame_C40 Page 361 Tuesday, December 18, 2001 3:24 PM

7.0

pH = 7.20 - 0.0057 WA

6.5

pH

6.0

pH = 5.82

pH = 6.93 - 0.0057 WA

5.5

0

100

200

300

400

500

600

700

Weak Acidity (mg/L)

FIGURE 40.4 Stream acidiﬁcation data ﬁtted to Model C (Table 40.2). Storms 1 and 2 have the same slope.

of 5.82. For storms 1 and 2, increased WA was associated with a lowering of the pH. It is not difﬁcult to

imagine conditions that would lead to two different storms having the same slope but different intercepts.

It is more difﬁcult to understand how the same stream could respond so differently to storm 3, which had

a range of WA that was much higher than either storm 1 or 2, a lower pH, and no change of pH over the

observed range of WA. Perhaps high WA depresses the pH and also buffers the stream against extreme

changes in pH. But why was the WA so much different during storm 3? The data alone, and the statistical

analysis, do not answer this question. They do, however, serve the investigator by raising the question.

Comments

The variables considered in regression equations usually take numerical values over a continuous range,

but occasionally it is advantageous to introduce a factor that has two or more discrete levels, or categories.

For example, data may arise from three storms, or three operators. In such a case, we cannot set up a

continuous measurement scale for the variable storm or operator. We must create categorical variables

(dummy variables) that account for the possible different effects of separate storms or operators. The

levels assigned to the categorical variables are unrelated to any physical level that might exist in the

factors themselves.

Regression with categorical variables was used to model the disappearance of PCBs from soil (Berthouex

and Gan, 1991; Gan and Berthouex, 1994). Draper and Smith (1998) provide several examples on creating

efﬁcient patterns for assigning categorical variables. Piegorsch and Bailer (1997) show examples for

nonlinear models.

References

Berthouex, P. M. and D. R. Gan (1991). “Fate of PCBs in Soil Treated with Contaminated Municipal Sludge,”

J. Envir. Engr. Div., ASCE, 116(1), 1–18.

Daniel, C. and F. S. Wood (1980). Fitting Equations to Data: Computer Analysis of Multifactor Data, 2nd

ed., New York, John Wiley.

Draper, N. R. and H. Smith, (1998). Applied Regression Analysis, 3rd ed., New York, John Wiley.

Gan, D. R. and P. M. Berthouex (1994). “Disappearance and Crop Uptake of PCBs from Sludge-Amended

Farmland,” Water Envir. Res., 66, 54–69.

Meinert, D. L., S. A. Miller, R. J. Ruane, and H. Olem (1982). “A Review of Water Quality Data in Acid

Sensitive Watersheds in the Tennessee Valley,” Rep. No. TVA.ONR/WR-82/10, TVA, Chattanooga, TN.

Piegorsch, W. W. and A. J. Bailer (1997). Statistics for Environmental Biology and Toxicology, London,

Chapman & Hall.

© 2002 By CRC Press LLC

L1592_frame_C40 Page 362 Tuesday, December 18, 2001 3:24 PM

Exercises

40.1 PCB Degradation in Soil. PCB-contaminated sewage sludge was applied to test plots at three

different loading rates (kg/ha) at the beginning of a 5-yr experimental program. Test plots of

farmland where corn was grown were sampled to assess the rate of disappearance of PCB

from soil. Duplicate plots were used for each treatment. Soil PCB concentration (mg/kg) was

measured each year in the fall after the corn crop was picked and in the spring before planting.

The data are below. Estimate the rate coefﬁcients of disappearance (k) using the model PCBt =

PCB0 exp(−kt). Are the rates the same for the four treatment conditions?

Time

0

5

12

17

24

29

36

41

48

53

Treatment 1

Treatment 2

Treatment 3

1.14

0.63

0.43

0.35

0.35

0.32

0.23

0.20

0.12

0.11

2.66

2.69

1.14

1.00

0.93

0.73

0.47

0.57

0.40

0.32

0.44

0.25

0.18

0.15

0.11

0.08

0.07

0.03

0.03

0.02

0.61

0.81

0.54

0.51

0.34

0.30

0.20

0.16

0.09

0.08

2.50

2.96

1.51

0.48

1.16

0.96

0.46

0.36

0.22

0.31

0.44

0.31

0.22

0.12

0.09

0.10

0.06

0.04

0.03

0.03

40.2 1,1,1-Trichloroethane Biodegradation. Estimates of biodegradation rate (kb) of 1,1,1-trichloroethane were made under three conditions of activated sludge treatment. The model is yi =

bxi + ei, where the slope b is the estimate of kb . Two dummy variables are needed to represent

the three treatment conditions, and these are arranged in the table below. Does the value of

kb depend on the activated sludge treatment condition?

×

x (× 10 )

Z1

Z2

Z1 x

Z2 x

∗

× −3

y(×10 )

61.2

9.8

8.9

44.9

6.3

20.3

7.5

1.2

159.8

44.4

57.4

25.9

37.9

55.0

151.7

116.2

129.9

19.4

7.7

36.7

17.8

8.5

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

0

0

0

0

0

0

0

0

159.8

44.4

57.4

25.9

37.9

55.0

151.7

116.2

129.9

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

19.4

7.7

36.7

17.8

8.5

142.3

140.8

62.7

32.5

82.3

58.6

15.5

2.5

1527.3

697.5

429.9

215.2

331.6

185.7

1169.2

842.8

712.9

49.3

21.6

53.3

59.4

112.3

−6

© 2002 By CRC Press LLC

∗

L1592_frame_C40 Page 363 Tuesday, December 18, 2001 3:24 PM

40.3 Diesel Fuel. Four diesel fuels were tested to estimate the partition coefﬁcient Kdw of

eight organic compounds as a function of their solubility in water (S ). The compounds

are (1) naphthalene, (2) 1-methyl-naphthalene, (3) 2-methyl-naphthalene, (4) acenaphthene,

(5) ﬂuorene, (6) phenanthrene, (7) anthracene, and (8) ﬂuoranthene. The table is set up to do

linear regression with dummy variables to differentiate between diesel fuels. Does the

partitioning relation vary from one diesel fuel to another?

y = log(Kdw)

x = log(S)

Z1

Z2

Z3

Z1log(S)

Z 3 log(S)

Z 3 log(S)

Diesel fuel #1

1

2

3

4

5

6

7

8

3.67

4.47

4.31

4.35

4.45

4.6

5.15

5.32

−3.05

−3.72

−3.62

−3.98

−4.03

−4.50

−4.49

−5.19

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Diesel fuel #2

1

2

3

4

5

6

7

8

3.62

4.29

4.21

4.46

4.41

4.61

5.38

4.64

−3.05

−3.72

−3.62

−3.98

−4.03

−4.50

−4.49

−5.19

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

−3.05

−3.72

−3.62

−3.98

−4.03

−4.50

−4.49

−5.19

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Diesel fuel #3

1

2

3

4

5

6

7

8

3.71

4.44

4.36

4.68

4.52

4.78

5.36

5.61

−3.05

−3.72

−3.62

−3.98

−4.03

−4.50

−4.49

−5.19

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

−3.05

−3.72

−3.62

−3.98

−4.03

−4.50

−4.49

−5.19

0

0

0

0

0

0

0

0

Diesel fuel #4

1

2

3

4

5

6

7

8

3.71

4.49

4.33

4.62

4.55

4.78

5.20

5.60

−3.05

−3.72

−3.62

−3.98

−4.03

−4.50

−4.49

−5.19

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

−3.05

−3.72

−3.62

−3.98

−4.03

−4.50

−4.49

−5.19

Compound

Source: Lee, L. S. et al. (1992). Envir. Sci. Tech., 26, 2104–2110.

40.4 Threshold Concentration. The data below can be described by a hockey-stick pattern. Below

some threshold value (τ ) the response is a constant plateau value (η = γ0). Above the threshold,

the response is linear η = γ0 + β1(x − τ). These can be combined into a continuous segmented

model using a dummy variable z such that z = 1 when x > τ and z = 0 when x ≤ τ. The dummy

variable formulation is η = γ0 + β1(x − τ )z, where z is a dummy variable. This gives η = γ0

for x ≤ τ and η = γ0 + β1(x − τ) = γ0 + β1x − β1τ for x ≥ τ. Estimate the plateau value γ0, the

post-threshold slope β1, and the unknown threshold dose τ.

© 2002 By CRC Press LLC

L1592_frame_C40 Page 364 Tuesday, December 18, 2001 3:24 PM

x

y

2.5

16.6

22

15.3

60

16.9

90

16.1

105

144

178

210

233

256

300

400

17.1 16.9 18.6 19.3 25.8 28.4 35.5 45.3

40.5 Coagulation. Modify the hockey-stick model of Exercise 40.4 so it describes the intersection

of two straight lines with nonzero slopes. Fit the model to the coagulation data (dissolved

organic carbon, DOC) given below to estimate the slopes of the straight-line segments and

the chemical dose (alum) at the intersection.

Alum Dose

(mg/L)

DOC

(mg/L)

Alum Dose

(mg/L)

DOC

(mg/L)

0

5

10

15

20

25

30

6.7

6.4

6.0

5.2

4.7

4.1

3.9

35

40

49

58

68

78

87

3.3

3.3

3.1

2.8

2.7

2.6

2.6

Source: White, M. W. et al. (1997). J. AWWA, 89(5).

© 2002 By CRC Press LLC

L1592_Frame_C41 Page 365 Tuesday, December 18, 2001 3:24 PM

41

The Effect of Autocorrelation on Regression

autocorrelation, autocorrelation coefﬁcient, drift, Durbin-Watson statistic, randomization, regression, time series, trend analysis, serial correlation, variance (inﬂation).

KEY WORDS

Many environmental data exist as sequences over time or space. The time sequence is obvious in some

data series, such as daily measurements on river quality. A characteristic of such data can be that neighboring

observations tend to be somewhat alike. This tendency is called autocorrelation. Autocorrelation can also

arise in laboratory experiments, perhaps because of the sequence in which experimental runs are done or

drift in instrument calibration. Randomization reduces the possibility of autocorrelated results. Data from

unplanned or unrandomized experiments should be analyzed with an eye open to detect autocorrelation.

Most statistical methods, estimation of conﬁdence intervals, ordinary least squares regression, etc.

depend on the residual errors being independent, having constant variance, and being normally distributed. Independent means that the errors are not autocorrelated. The errors in statistical conclusions caused

by violating the condition of independence can be more serious than those caused by not having normality.

Parameter estimates may or may not be seriously affected by autocorrelation, but unrecognized (or

ignored) autocorrelation will bias estimates of variances and any statistics calculated from variances.

Statements about probabilities, including conﬁdence intervals, will be wrong.

This chapter explains why ignoring or overlooking autocorrelation can lead to serious errors and

describes the Durbin-Watson test for detecting autocorrelation in the residuals of a ﬁtted model. Checking

for autocorrelation is relatively easy although it may go undetected even when present in small data

sets. Making suitable provisions to incorporate existing autocorrelation into the data analysis can be

difﬁcult. Some useful references are given but the best approach may be to consult with a statistician.

Case Study: A Suspicious Laboratory Experiment

A laboratory experiment was done to demonstrate to students that increasing factor X by one unit should

cause factor Y to increase by one-half a unit. Preliminary experiments indicated that the standard deviation

of repeated measurements on Y was about 1 unit. To make measurement errors small relative to the

signal, the experiment was designed to produce 20 to 25 units of y. The procedure was to set x and,

after a short time, to collect a specimen on which y would be measured. The measurements on y were

not started until all 11 specimens had been collected. The data, plotted in Figure 41.1, are:

x=

y=

0

21.0

1

21.8

2

21.3

3

22.1

4

22.5

5

20.6

6

19.6

7

20.9

8

21.7

9

22.8

10

23.6

ˆ

Linear regression gave y = 21.04 + 0.12x, with R = 0.12. This was an unpleasant surprise. The 95%

conﬁdence interval of the slope was –0.12 to 0.31, which does not include the theoretical slope of 0.5

that the experiment was designed to reveal. Also, this interval includes zero so we cannot even be sure

that x and y are related.

2

© 2002 By CRC Press LLC

L1592_Frame_C41 Page 366 Tuesday, December 18, 2001 3:24 PM

24

y = 21.04 + 0.12x

23

y

22

21

20

19

0

2

4

6

x

8

10

12

10

12

FIGURE 41.1 The original data from a suspicious laboratory experiment.

25

y = 20 .06 + 0 .43 x

24

23

22

y 21

20

19

18

0

2

4

6

x

8

FIGURE 41.2 Data obtained from a repeated experiment with randomization to eliminate autocorrelation.

One might be tempted to blame the peculiar result entirely on the low value measured at x = 6, but

the experimenters did not leap to conclusions. Discussion of the experimental procedure revealed that

the tests were done starting with x = 0 ﬁrst, then with x = 1, etc., up through x = 10. The measurements

of y were also done in order of increasing concentration. It was also discovered that the injection port

of the instrument used to measure y might not have been thoroughly cleaned between each run. The

students knew about randomization, but time was short and they could complete the experiment faster

by not randomizing. The penalty was autocorrelation and a wasted experiment.

They were asked to repeat the experiment, this time randomizing the order of the runs, the order of

analyzing the specimens, and taking more care to clean the injection port. This time the data were as shown

2

ˆ

in Figure 41.2. The regression equation is y = 20.06 + 0.43x, with R = 0.68. The conﬁdence interval of

the slope is 0.21 to 0.65. This interval includes the expected slope of 0.5 and shows that x and y are related.

Can the dramatic difference in the outcome of the ﬁrst and second experiments possibly be due to the

presence of autocorrelation in the experimental data? It is both possible and likely, in view of the lack

of randomization in the order of running the tests.

The Consequences of Autocorrelation on Regression

An important part of doing regression is obtaining a valid statement about the precision of the estimates.

Unfortunately, autocorrelation acts to destroy our ability to make such statements. If the error terms are

positively autocorrelated, the usual conﬁdence intervals and tests using t and F distributions are no longer

strictly applicable because the variance estimates are distorted (Neter et al., 1983).

© 2002 By CRC Press LLC

Xem Thêm

Chapter 40. Regression Analysis with Categorical Variables

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về