Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 46 trang )
L1592_frame_C40 Page 356 Tuesday, December 18, 2001 3:24 PM
7.0
6.5
pH
6.0
5.5
0
100
200 300 400 500
Weak Acidity (µg/L)
600
700
FIGURE 40.1 The relation of pH and weak acidity data of Cosby Creek after three storms.
Begin by considering data from a single category. The quantitative predictor variable is x1 which can
predict the independent variable y1 using the linear model:
y 1i = β 0 + β 1 x 1i + e i
where β0 and β1 are parameters to be estimated by least squares.
If there are data from two categories (e.g., data produced at two different laboratories), one approach
would be to model the two sets of data separately as:
y 1i = α 0 + α 1 x 1i + e i
and
y 2i = β 0 + β 1 x 2i + e i
and then to compare the estimated intercepts (α 0 and β 0 ) and the estimated slopes (α1 and β1) using
confidence intervals or t-tests.
A second, and often better, method is to simultaneously fit a single augmented model to all the data.
To construct this model, define a categorical variable Z as follows:
Z=0
Z=1
if the data are in the first category
if the data are in the second category
The augmented model is:
yi = α0 + α1 xi + Z ( β0 + β1 xi ) + ei
With some rearrangement:
y i = α 0 + β 0 Z + α 1 x i + β 1 Zx i + e i
In this last form the regression is done as though there are three independent variables, x, Z, and Zx.
The vectors of Z and Zx have to be created from the categorical variables defined above. The four
parameters α 0, β 0, α1, and β 1 are estimated by linear regression.
A model for each category can be obtained by substituting the defined values. For the first category,
Z = 0 and:
yi = α0 + α1 xi + ei
© 2002 By CRC Press LLC
L1592_frame_C40 Page 357 Tuesday, December 18, 2001 3:24 PM
Slopes Diffferent
Intercepts
y i =(α 0 + β 0) + (α 1+ β 1) x i + e i
Different
Intercepts
y i = α 0 + (α 1 + β 1) x i + e i
Equal
Slopes Equal
y i =(α 0 + β 0) + α 1x i + e i
y i = α 0 + α 1x i + e i
FIGURE 40.2 Four possible models to fit a straight line to data in two categories.
Complete model
y=(α0+β0)+(α1+β1)x+e
Category 2:
y = (α0+β0)+α1x+e
β0
α1
slope = α1
for both lines
α 0 + β0
α0
Category 1:
α1
y = α0+α1x+e
FIGURE 40.3 Model with two categories having different intercepts but equal slopes.
For the second category, Z = 1 and:
y i = ( α 0 + β 0 ) + ( α 1 + β 1 )x i + e i
The regression might estimate either β0 or β1 as zero, or both as zero. If β0 = 0, the two lines have the
same intercept. If β1 = 0, the two lines have the same slope. If both β1 and β0 equal zero, a single straight
line fits all the data. Figure 40.2 shows the four possible outcomes. Figure 40.3 shows the particular
case where the slopes are equal and the intercepts are different.
If simplification seems indicated, a simplified version is fitted to the data. We show later how the full
model and simplified model are compared to check whether the simplification is justified.
To deal with three categories, two categorical variables are defined:
Category 1:
Category 2:
Z1 = 1
Z1 = 0
and
and
Z2 = 0
Z2 = 1
This implies Z1 = 0 and Z2 = 0 for category 3.
The model is:
yi = ( α0 + α1 xi ) + Z 1 ( β0 + β1 xi ) + Z 2 ( γ 0 + γ 1 xi ) + ei
The parameters with subscript 0 estimate the intercept and those with subscript 1 estimate the slopes.
This can be rearranged to give:
yi = α0 + β0 Z 1 + γ 0 Z 2 + α1 xi + β1 Z 1 xi + γ 1 Z 2 xi + ei
The six parameters are estimated by fitting the original independent variable xi plus the four created
variables Z1, Z2, Z1xi, and Z2xi.
Any of the parameters might be estimated as zero by the regression analysis. A couple of examples
explain how the simpler models can be identified. In the simplest possible case, the regression would
© 2002 By CRC Press LLC
L1592_frame_C40 Page 358 Tuesday, December 18, 2001 3:24 PM
estimate β 0 = 0, γ 0 = 0, β 1 = 0, and γ 1 = 0 and the same slope (α 1) and intercept (α 0) would apply to
all three categories. The fitted simplified model is y i = α 0 + α 1 x i + e i .
If the intercepts are different for the three categories but the slopes are the same, the regression would
estimate β1 = 0 and γ1 = 0 and the model becomes:
yi = ( α0 + β0 Z 1 + γ 0 Z 2 ) + α1 xi + ei
For category 1: y i = ( α 0 + β 0 Z 1 ) + α 1 x i + e i
For category 2: y i = ( α 0 + γ 0 Z 2 ) + α 1 x i + e i
For category 3: y i = α 0 + α 1 x i + e i
Case Study: Solution
The model under consideration allows a different slope and intercept for each storm. Two dummy variables
are needed:
Z1 = 1 for storm 1 and zero otherwise
Z2 = 1 for storm 2 and zero otherwise
The model is:
pH = α 0 + α1WA + Z1(β0 + β1WA) + Z2(γ0 + γ1WA)
where the α ’s, β ’s, and γ ’s are estimated by regression. The model can be rewritten as:
pH = α 0 + β 0 Z1 + γ 0 Z 2 + α1WA + β1Z1WA + γ1Z2WA
The dummy variables are incorporated into the model by creating the new variables Z1WA and Z2WA.
Table 40.1 shows how this is done.
Fitting the full six-parameter model gives:
Model A: pH = 5.77 − 0.00008WA + 0.998Z1 + 1.65Z2 − 0.005Z1WA − 0.008Z2WA
(t-ratios)
(0.11)
(2.14)
(3.51)
(3.63)
(4.90)
which is also shown as Model A in Table 40.2 (top row). The numerical coefficients are the least squares
estimates of the parameters. The small numbers in parentheses beneath the coefficients are the t-ratios
for the parameter values. Terms with t < 2 are candidates for elimination from the model because they
are almost certainly not significant.
The term WA appears insignificant. Dropping this term and refitting the simplified model gives Model
B, in which all coefficients are significant:
Model B:
(t-ratios)
[95% conf. interval]
pH = 5.82 + 0.95Z1 + 1.60Z2 − 0.005Z1WA − 0.008Z2WA
(6.01)
(9.47)
(4.35)
(5.54)
[0.63 to 1.27] [1.26 to 1.94] [−0.007 to −0.002] [−0.01 to −0.005]
The regression sum of squares, listed in Table 40.2, is the same for Model A and for Model B (Reg SS =
4.278). Dropping the WA term caused no decrease in the regression sum of squares. Model B is equivalent
to Model A.
Is any further simplification possible? Notice that the 95% confidence intervals overlap for the terms
−0.005 Z1WA and –0.008 Z2WA. Therefore, the coefficients of these two terms might be the same. To
check this, fit Model C, which has the same slope but different intercepts for storms 1 and 2. This is
© 2002 By CRC Press LLC
L1592_frame_C40 Page 359 Tuesday, December 18, 2001 3:24 PM
TABLE 40.1
Weak Acidity (WA), pH, and Categorical Variables for Three Storms
Storm
WA
Z1
Z2
Z1WA
Z2WA
pH
Z3
Z3WA
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
190
110
150
170
170
170
200
140
140
160
140
110
110
120
110
110
110
140
140
120
190
120
110
110
100
100
120
120
100
80
100
580
640
500
530
670
670
640
640
560
590
640
590
600
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
190
110
150
170
170
170
200
140
140
160
140
110
110
120
110
110
110
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
140
140
120
190
120
110
110
100
100
120
120
100
80
100
0
0
0
0
0
0
0
0
0
0
0
0
0
5.96
6.08
5.93
5.99
6.01
5.97
5.88
6.06
6.06
6.03
6.02
6.17
6.31
6.27
6.42
6.28
6.43
6.33
6.43
6.37
6.09
6.32
6.37
6.73
6.89
6.87
6.30
6.52
6.39
6.87
6.85
5.82
5.94
5.73
5.91
5.87
5.80
5.80
5.78
5.78
5.73
5.63
5.79
6.02
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
190
110
150
170
170
170
200
140
140
160
140
110
110
120
110
110
110
140
140
120
190
120
110
110
100
100
120
120
100
80
100
0
0
0
0
0
0
0
0
0
0
0
0
0
Note: The two right-hand columns are used to fit the simplified model.
Source: Meinert, D. L., S. A. Miller, R. J. Ruane, and H. Olem (1982). “A Review of Water Quality
Data in Acid Sensitive Watersheds in the Tennessee Valley,” Rep. No. TVA.ONR/WR-82/10, TVA,
Chattanooga, TN.
done by combining columns Z1WA and Z2WA to form the two columns on the right-hand side of
Table 40.1. Call this new variable Z3WA. Z3 = 1 for storms 1 and 2, and 0 for storm 3.
The fitted model is:
Model C:
(t-ratios)
© 2002 By CRC Press LLC
pH = 5.82 + 1.11Z1 + 1.38Z2 − 0.0057Z3WA
(8.43)
(12.19)
(6.68)
L1592_frame_C40 Page 360 Tuesday, December 18, 2001 3:24 PM
TABLE 40.2
Alternate Models for pH at Cosby Creek
Model
2
Reg SS
A pH = 5.77 − 0.00008WA + 0.998Z1 + 1.65Z2 − 0.005Z1WA − 0.008Z2WA
B pH = 5.82 + 0.95Z1 + 1.60Z2 − 0.005Z1WA − 0.008Z2WA
C pH = 5.82 + 1.11Z1 + 1.38Z2 − 0.0057Z3WA
Res SS
R
4.278
4.278
4.229
0.662
0.662
0.712
0.866
0.866
0.856
This simplification of the model can be checked in a more formal way by comparing regression sums
of squares of the simplified model with the more complicated one. The regression sum of squares is a
measure of how well the model fits the data. Dropping an important term will cause the regression sum
of squares to decrease by a noteworthy amount, whereas dropping an unimportant term will change the
regression sum of squares very little. An example shows how we decide whether a change is “noteworthy”
(i.e., statistically significant).
If two models are equivalent, the difference of their regression sums of squares will be small, within
an allowance for variation due to random experimental error. The variance due to experimental error
can be estimated by the mean residual sum of squares of the full model (Model A).
The variance due to the deleted term is estimated by the difference between the regression sums of
squares of Model A and Model C, with an adjustment for their respective degrees of freedom. The ratio
of the variance due to the deleted term is compared with the variance due to experimental error by
computing the F statistic, as follows:
( Reg SS A – Reg SS C )/ ( Reg df A – Reg df C )
F = -----------------------------------------------------------------------------------------------------Res SS A /Res df A
where
Reg SS
Reg df
Res SS
Res df
= regression sum of squares
= degrees of freedom associated with the regression sum of squares
= residual sum of squares
= degrees of freedom associated with the residual sum of squares
Model A has five degrees of freedom associated with the regression sum of squares (Reg df = 5), one
for each of the six parameters in the model minus one for computing the mean. Model C has three
degrees of freedom. Thus:
( 4.278 – 4.229 )/ ( 5 – 3 )
0.0245
F = -------------------------------------------------------- = --------------- = 1.44
0.66/38
0.017
For a test of significance at the 95% confidence level, this value of F is compared with the upper 5%
point of the F distribution with the appropriate degrees of freedom (5 – 3 = 2 in the numerator and 38
in the denominator): F2,38,0.05 = 3.25. The computed value (F = 1.44) is smaller than the critical value
F2,38,0.05 = 3.25, which confirms that omitting WA from the model and forcing storms 1 and 2 to have
the same slope has not significantly worsened the fit of the model. In short, Model C describes the data
as well as Model A or Model B. Because it is simpler, it is preferred.
Models for the individual storms are derived by substituting the values of Z1, Z2, and Z3 into Model C:
Storm 1
Storm 2
Storm 3
Z1 = 1, Z2 = 0, Z3 = 1
Z1 = 0, Z2 = 1, Z3 = 1
Z1 = 0, Z2 = 0, Z3 = 0
pH = 6.93 − 0.0057WA
pH = 7.20 − 0.0057WA
pH = 5.82
The model indicates a different intercept for each storm, a common slope for storms 1 and 2, and a slope
of zero for storm 3, as shown by Figure 40.4. In storm 3, the variation in pH was random about a mean
© 2002 By CRC Press LLC
L1592_frame_C40 Page 361 Tuesday, December 18, 2001 3:24 PM
7.0
pH = 7.20 - 0.0057 WA
6.5
pH
6.0
pH = 5.82
pH = 6.93 - 0.0057 WA
5.5
0
100
200
300
400
500
600
700
Weak Acidity (mg/L)
FIGURE 40.4 Stream acidification data fitted to Model C (Table 40.2). Storms 1 and 2 have the same slope.
of 5.82. For storms 1 and 2, increased WA was associated with a lowering of the pH. It is not difficult to
imagine conditions that would lead to two different storms having the same slope but different intercepts.
It is more difficult to understand how the same stream could respond so differently to storm 3, which had
a range of WA that was much higher than either storm 1 or 2, a lower pH, and no change of pH over the
observed range of WA. Perhaps high WA depresses the pH and also buffers the stream against extreme
changes in pH. But why was the WA so much different during storm 3? The data alone, and the statistical
analysis, do not answer this question. They do, however, serve the investigator by raising the question.
Comments
The variables considered in regression equations usually take numerical values over a continuous range,
but occasionally it is advantageous to introduce a factor that has two or more discrete levels, or categories.
For example, data may arise from three storms, or three operators. In such a case, we cannot set up a
continuous measurement scale for the variable storm or operator. We must create categorical variables
(dummy variables) that account for the possible different effects of separate storms or operators. The
levels assigned to the categorical variables are unrelated to any physical level that might exist in the
factors themselves.
Regression with categorical variables was used to model the disappearance of PCBs from soil (Berthouex
and Gan, 1991; Gan and Berthouex, 1994). Draper and Smith (1998) provide several examples on creating
efficient patterns for assigning categorical variables. Piegorsch and Bailer (1997) show examples for
nonlinear models.
References
Berthouex, P. M. and D. R. Gan (1991). “Fate of PCBs in Soil Treated with Contaminated Municipal Sludge,”
J. Envir. Engr. Div., ASCE, 116(1), 1–18.
Daniel, C. and F. S. Wood (1980). Fitting Equations to Data: Computer Analysis of Multifactor Data, 2nd
ed., New York, John Wiley.
Draper, N. R. and H. Smith, (1998). Applied Regression Analysis, 3rd ed., New York, John Wiley.
Gan, D. R. and P. M. Berthouex (1994). “Disappearance and Crop Uptake of PCBs from Sludge-Amended
Farmland,” Water Envir. Res., 66, 54–69.
Meinert, D. L., S. A. Miller, R. J. Ruane, and H. Olem (1982). “A Review of Water Quality Data in Acid
Sensitive Watersheds in the Tennessee Valley,” Rep. No. TVA.ONR/WR-82/10, TVA, Chattanooga, TN.
Piegorsch, W. W. and A. J. Bailer (1997). Statistics for Environmental Biology and Toxicology, London,
Chapman & Hall.
© 2002 By CRC Press LLC
L1592_frame_C40 Page 362 Tuesday, December 18, 2001 3:24 PM
Exercises
40.1 PCB Degradation in Soil. PCB-contaminated sewage sludge was applied to test plots at three
different loading rates (kg/ha) at the beginning of a 5-yr experimental program. Test plots of
farmland where corn was grown were sampled to assess the rate of disappearance of PCB
from soil. Duplicate plots were used for each treatment. Soil PCB concentration (mg/kg) was
measured each year in the fall after the corn crop was picked and in the spring before planting.
The data are below. Estimate the rate coefficients of disappearance (k) using the model PCBt =
PCB0 exp(−kt). Are the rates the same for the four treatment conditions?
Time
0
5
12
17
24
29
36
41
48
53
Treatment 1
Treatment 2
Treatment 3
1.14
0.63
0.43
0.35
0.35
0.32
0.23
0.20
0.12
0.11
2.66
2.69
1.14
1.00
0.93
0.73
0.47
0.57
0.40
0.32
0.44
0.25
0.18
0.15
0.11
0.08
0.07
0.03
0.03
0.02
0.61
0.81
0.54
0.51
0.34
0.30
0.20
0.16
0.09
0.08
2.50
2.96
1.51
0.48
1.16
0.96
0.46
0.36
0.22
0.31
0.44
0.31
0.22
0.12
0.09
0.10
0.06
0.04
0.03
0.03
40.2 1,1,1-Trichloroethane Biodegradation. Estimates of biodegradation rate (kb) of 1,1,1-trichloroethane were made under three conditions of activated sludge treatment. The model is yi =
bxi + ei, where the slope b is the estimate of kb . Two dummy variables are needed to represent
the three treatment conditions, and these are arranged in the table below. Does the value of
kb depend on the activated sludge treatment condition?
×
x (× 10 )
Z1
Z2
Z1 x
Z2 x
∗
× −3
y(×10 )
61.2
9.8
8.9
44.9
6.3
20.3
7.5
1.2
159.8
44.4
57.4
25.9
37.9
55.0
151.7
116.2
129.9
19.4
7.7
36.7
17.8
8.5
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
159.8
44.4
57.4
25.9
37.9
55.0
151.7
116.2
129.9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
19.4
7.7
36.7
17.8
8.5
142.3
140.8
62.7
32.5
82.3
58.6
15.5
2.5
1527.3
697.5
429.9
215.2
331.6
185.7
1169.2
842.8
712.9
49.3
21.6
53.3
59.4
112.3
−6
© 2002 By CRC Press LLC
∗
L1592_frame_C40 Page 363 Tuesday, December 18, 2001 3:24 PM
40.3 Diesel Fuel. Four diesel fuels were tested to estimate the partition coefficient Kdw of
eight organic compounds as a function of their solubility in water (S ). The compounds
are (1) naphthalene, (2) 1-methyl-naphthalene, (3) 2-methyl-naphthalene, (4) acenaphthene,
(5) fluorene, (6) phenanthrene, (7) anthracene, and (8) fluoranthene. The table is set up to do
linear regression with dummy variables to differentiate between diesel fuels. Does the
partitioning relation vary from one diesel fuel to another?
y = log(Kdw)
x = log(S)
Z1
Z2
Z3
Z1log(S)
Z 3 log(S)
Z 3 log(S)
Diesel fuel #1
1
2
3
4
5
6
7
8
3.67
4.47
4.31
4.35
4.45
4.6
5.15
5.32
−3.05
−3.72
−3.62
−3.98
−4.03
−4.50
−4.49
−5.19
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Diesel fuel #2
1
2
3
4
5
6
7
8
3.62
4.29
4.21
4.46
4.41
4.61
5.38
4.64
−3.05
−3.72
−3.62
−3.98
−4.03
−4.50
−4.49
−5.19
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
−3.05
−3.72
−3.62
−3.98
−4.03
−4.50
−4.49
−5.19
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Diesel fuel #3
1
2
3
4
5
6
7
8
3.71
4.44
4.36
4.68
4.52
4.78
5.36
5.61
−3.05
−3.72
−3.62
−3.98
−4.03
−4.50
−4.49
−5.19
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
−3.05
−3.72
−3.62
−3.98
−4.03
−4.50
−4.49
−5.19
0
0
0
0
0
0
0
0
Diesel fuel #4
1
2
3
4
5
6
7
8
3.71
4.49
4.33
4.62
4.55
4.78
5.20
5.60
−3.05
−3.72
−3.62
−3.98
−4.03
−4.50
−4.49
−5.19
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
−3.05
−3.72
−3.62
−3.98
−4.03
−4.50
−4.49
−5.19
Compound
Source: Lee, L. S. et al. (1992). Envir. Sci. Tech., 26, 2104–2110.
40.4 Threshold Concentration. The data below can be described by a hockey-stick pattern. Below
some threshold value (τ ) the response is a constant plateau value (η = γ0). Above the threshold,
the response is linear η = γ0 + β1(x − τ). These can be combined into a continuous segmented
model using a dummy variable z such that z = 1 when x > τ and z = 0 when x ≤ τ. The dummy
variable formulation is η = γ0 + β1(x − τ )z, where z is a dummy variable. This gives η = γ0
for x ≤ τ and η = γ0 + β1(x − τ) = γ0 + β1x − β1τ for x ≥ τ. Estimate the plateau value γ0, the
post-threshold slope β1, and the unknown threshold dose τ.
© 2002 By CRC Press LLC
L1592_frame_C40 Page 364 Tuesday, December 18, 2001 3:24 PM
x
y
2.5
16.6
22
15.3
60
16.9
90
16.1
105
144
178
210
233
256
300
400
17.1 16.9 18.6 19.3 25.8 28.4 35.5 45.3
40.5 Coagulation. Modify the hockey-stick model of Exercise 40.4 so it describes the intersection
of two straight lines with nonzero slopes. Fit the model to the coagulation data (dissolved
organic carbon, DOC) given below to estimate the slopes of the straight-line segments and
the chemical dose (alum) at the intersection.
Alum Dose
(mg/L)
DOC
(mg/L)
Alum Dose
(mg/L)
DOC
(mg/L)
0
5
10
15
20
25
30
6.7
6.4
6.0
5.2
4.7
4.1
3.9
35
40
49
58
68
78
87
3.3
3.3
3.1
2.8
2.7
2.6
2.6
Source: White, M. W. et al. (1997). J. AWWA, 89(5).
© 2002 By CRC Press LLC
L1592_Frame_C41 Page 365 Tuesday, December 18, 2001 3:24 PM
41
The Effect of Autocorrelation on Regression
autocorrelation, autocorrelation coefficient, drift, Durbin-Watson statistic, randomization, regression, time series, trend analysis, serial correlation, variance (inflation).
KEY WORDS
Many environmental data exist as sequences over time or space. The time sequence is obvious in some
data series, such as daily measurements on river quality. A characteristic of such data can be that neighboring
observations tend to be somewhat alike. This tendency is called autocorrelation. Autocorrelation can also
arise in laboratory experiments, perhaps because of the sequence in which experimental runs are done or
drift in instrument calibration. Randomization reduces the possibility of autocorrelated results. Data from
unplanned or unrandomized experiments should be analyzed with an eye open to detect autocorrelation.
Most statistical methods, estimation of confidence intervals, ordinary least squares regression, etc.
depend on the residual errors being independent, having constant variance, and being normally distributed. Independent means that the errors are not autocorrelated. The errors in statistical conclusions caused
by violating the condition of independence can be more serious than those caused by not having normality.
Parameter estimates may or may not be seriously affected by autocorrelation, but unrecognized (or
ignored) autocorrelation will bias estimates of variances and any statistics calculated from variances.
Statements about probabilities, including confidence intervals, will be wrong.
This chapter explains why ignoring or overlooking autocorrelation can lead to serious errors and
describes the Durbin-Watson test for detecting autocorrelation in the residuals of a fitted model. Checking
for autocorrelation is relatively easy although it may go undetected even when present in small data
sets. Making suitable provisions to incorporate existing autocorrelation into the data analysis can be
difficult. Some useful references are given but the best approach may be to consult with a statistician.
Case Study: A Suspicious Laboratory Experiment
A laboratory experiment was done to demonstrate to students that increasing factor X by one unit should
cause factor Y to increase by one-half a unit. Preliminary experiments indicated that the standard deviation
of repeated measurements on Y was about 1 unit. To make measurement errors small relative to the
signal, the experiment was designed to produce 20 to 25 units of y. The procedure was to set x and,
after a short time, to collect a specimen on which y would be measured. The measurements on y were
not started until all 11 specimens had been collected. The data, plotted in Figure 41.1, are:
x=
y=
0
21.0
1
21.8
2
21.3
3
22.1
4
22.5
5
20.6
6
19.6
7
20.9
8
21.7
9
22.8
10
23.6
ˆ
Linear regression gave y = 21.04 + 0.12x, with R = 0.12. This was an unpleasant surprise. The 95%
confidence interval of the slope was –0.12 to 0.31, which does not include the theoretical slope of 0.5
that the experiment was designed to reveal. Also, this interval includes zero so we cannot even be sure
that x and y are related.
2
© 2002 By CRC Press LLC
L1592_Frame_C41 Page 366 Tuesday, December 18, 2001 3:24 PM
24
y = 21.04 + 0.12x
23
y
22
21
20
19
0
2
4
6
x
8
10
12
10
12
FIGURE 41.1 The original data from a suspicious laboratory experiment.
25
y = 20 .06 + 0 .43 x
24
23
22
y 21
20
19
18
0
2
4
6
x
8
FIGURE 41.2 Data obtained from a repeated experiment with randomization to eliminate autocorrelation.
One might be tempted to blame the peculiar result entirely on the low value measured at x = 6, but
the experimenters did not leap to conclusions. Discussion of the experimental procedure revealed that
the tests were done starting with x = 0 first, then with x = 1, etc., up through x = 10. The measurements
of y were also done in order of increasing concentration. It was also discovered that the injection port
of the instrument used to measure y might not have been thoroughly cleaned between each run. The
students knew about randomization, but time was short and they could complete the experiment faster
by not randomizing. The penalty was autocorrelation and a wasted experiment.
They were asked to repeat the experiment, this time randomizing the order of the runs, the order of
analyzing the specimens, and taking more care to clean the injection port. This time the data were as shown
2
ˆ
in Figure 41.2. The regression equation is y = 20.06 + 0.43x, with R = 0.68. The confidence interval of
the slope is 0.21 to 0.65. This interval includes the expected slope of 0.5 and shows that x and y are related.
Can the dramatic difference in the outcome of the first and second experiments possibly be due to the
presence of autocorrelation in the experimental data? It is both possible and likely, in view of the lack
of randomization in the order of running the tests.
The Consequences of Autocorrelation on Regression
An important part of doing regression is obtaining a valid statement about the precision of the estimates.
Unfortunately, autocorrelation acts to destroy our ability to make such statements. If the error terms are
positively autocorrelated, the usual confidence intervals and tests using t and F distributions are no longer
strictly applicable because the variance estimates are distorted (Neter et al., 1983).
© 2002 By CRC Press LLC