Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 46 trang )
L1592_Frame_C41 Page 366 Tuesday, December 18, 2001 3:24 PM
24
y = 21.04 + 0.12x
23
y
22
21
20
19
0
2
4
6
x
8
10
12
10
12
FIGURE 41.1 The original data from a suspicious laboratory experiment.
25
y = 20 .06 + 0 .43 x
24
23
22
y 21
20
19
18
0
2
4
6
x
8
FIGURE 41.2 Data obtained from a repeated experiment with randomization to eliminate autocorrelation.
One might be tempted to blame the peculiar result entirely on the low value measured at x = 6, but
the experimenters did not leap to conclusions. Discussion of the experimental procedure revealed that
the tests were done starting with x = 0 first, then with x = 1, etc., up through x = 10. The measurements
of y were also done in order of increasing concentration. It was also discovered that the injection port
of the instrument used to measure y might not have been thoroughly cleaned between each run. The
students knew about randomization, but time was short and they could complete the experiment faster
by not randomizing. The penalty was autocorrelation and a wasted experiment.
They were asked to repeat the experiment, this time randomizing the order of the runs, the order of
analyzing the specimens, and taking more care to clean the injection port. This time the data were as shown
2
ˆ
in Figure 41.2. The regression equation is y = 20.06 + 0.43x, with R = 0.68. The confidence interval of
the slope is 0.21 to 0.65. This interval includes the expected slope of 0.5 and shows that x and y are related.
Can the dramatic difference in the outcome of the first and second experiments possibly be due to the
presence of autocorrelation in the experimental data? It is both possible and likely, in view of the lack
of randomization in the order of running the tests.
The Consequences of Autocorrelation on Regression
An important part of doing regression is obtaining a valid statement about the precision of the estimates.
Unfortunately, autocorrelation acts to destroy our ability to make such statements. If the error terms are
positively autocorrelated, the usual confidence intervals and tests using t and F distributions are no longer
strictly applicable because the variance estimates are distorted (Neter et al., 1983).
© 2002 By CRC Press LLC
L1592_Frame_C41 Page 367 Tuesday, December 18, 2001 3:24 PM
Why Autocorrelation Distorts the Variance Estimates
Suppose that the system generating the data has the true underlying relation η = β0 + β1x, where x could
be any independent variable, including time as in a times series of data. We observe n values: y1 = η +
e1, … , yi−2 = η + ei−2, yi−1 = η + ei−1, yi = η + ei, … , yn = η + en. The usual assumption is that the residuals
(ei) are independent, meaning that the value of ei is not related to ei−1, ei−2, etc. Let us examine what
happens when this is not true.
Suppose that the residuals (ei), instead of being random and independent, are correlated in a simple
way that is described by ei = ρ ei−1 + ai, in which the errors (ai) are independent and normally distributed
2
with constant variance σ . The strength of the autocorrelation is indicated by the autocorrelation
coefficient (ρ), which ranges from −1 to +1. If ρ = 0, the ei are independent. If ρ is positive, successive
values of ei are similar to each other and:
e i = ρ e i−1 + a i
e i−1 = ρ e i−2 + a i−1
e i−2 = ρ e i−3 + a i−2
and so on. By recursive substitution we can show that:
e i = ρ ( ρ e i−2 + a i−1 ) + a i = ρ e i−2 + ρ a i−1 + a i
2
and
e i = ρ e i−3 + ρ a i−2 + ρ a i−1 + a i
3
2
This shows that the process is “remembering” past conditions to some extent, and the strength of this
memory is reflected in the value of ρ.
Reversing the order of the terms and continuing the recursive substitution gives:
e i = a i + ρ a i−1 + ρ a i−2 + ρ a i−3 + ⋅⋅⋅ ρ a i−n
2
3
n
The expected values of ai, ai−1,… are zero and so is the expected value of ei. The variance of ei and the
variance of ai, however, are not the same. The variance of ei is the sum of the variances of each term:
σ e = Var ( a i ) + ρ Var ( a i−1 ) + ρ Var ( a i−2 ) + ⋅⋅⋅ + ρ Var ( a i−n ) + ⋅⋅⋅
2
2
4
2n
By definition, the a’s are independent so σ a = Var(ai) = Var(ai−1) = … = Var(ai−n). Therefore, the variance
of ei is:
2
σe = σa ( 1 + ρ + ρ + … + ρ + … )
2
2
2
4
2n
For positive correlation (ρ > 0), the power series converges and:
1
2
4
2n
( 1 + ρ + ρ + … + ρ + … ) = ------------2
1–ρ
Thus:
1
2
2
σ e = σ a ------------2
1–ρ
© 2002 By CRC Press LLC
L1592_Frame_C41 Page 368 Tuesday, December 18, 2001 3:24 PM
This means that when we do not recognize and account for positive autocorrelation, the estimated
2
2
variance σ e will be larger than the true variance of the random independent errors (σ a ) by the factor
2
2
2
1/(1 − ρ ). This inflation can be impressive. If ρ is large (i.e., ρ = 0.8), σ e = 2.8 σ a .
An Example of Autocorrelated Errors
The laboratory data presented for the case study were created to illustrate the consequences of autocorrelation on regression. The true model of the experiment is η = 20 + 0.5x. The data structure is shown
in Table 41.1. If there were no autocorrelation, the observed values would be as shown in Figure 41.2.
These are the third column in Table 41.1, which is computed as yi + 20 + 0.5xi + ai, where the ai are
independent values drawn randomly from a normal distribution with mean zero and variance of one (the
at’s actually selected have a variance of 1.00 and a mean of −0.28).
In the flawed experiment, hidden factors in the experiment were assumed to introduce autocorrelation.
The data were computed assuming that the experiment generated errors having first-order autocorrelation
with ρ = 0.8. The last three columns in Table 41.1 show how independent random errors are converted
to correlated errors. The function producing the flawed data is:
y i = η + e i = 20 + 0.5x i + 0.8e i−1 + a i
If the data were produced by the above model, but we were unaware of the autocorrelation and fit the
simpler model η = β0 + β 0 x, the estimates of β0 and β1 will reflect this misspecification of the model.
Perhaps more serious is the fact that t-tests and F-tests on the regression results will be wrong, so we
may be misled as to the significance or precision of estimated values. Fitting the data produced from
the autocorrelation model of the process gives yi = 21.0 + 0.12xi. The 95% confidence interval of the
slope is [−0.12 to 0.35] and the t-ratio for the slope is 1.1. Both of these results indicate the slope is not
significantly different from zero. Although the result is reported as statistically insignificant, it is wrong
because the true slope is 0.5.
This is in contrast to what would have been obtained if the experiment had been conducted in a way
that prevented autocorrelation from entering. The data for this case are listed in the “no autocorrelation”
section of Table 41.1 and the results are shown in Table 41.2. The fitted model is yi = 20.06 + 0.43xi,
the confidence interval of the slope is [0.21 to 0.65] and the t-ratio for the slope is 4.4. The slope is
statistically significant and the true value of the slope (β = 0.5) falls within the confidence interval.
Table 41.2 summarizes the results of these two regression examples ( ρ = 0 and ρ = 0.8). The DurbinWatson statistic (explained in the next section) provided by the regression program indicates independence in the case where ρ = 0, and shows serial correlation in the other case.
TABLE 41.1
Data Created Using True Values of yi = 20 + 0.5xi + ai with ai = N(0,1)
x
0
1
2
3
4
5
6
7
8
9
10
No Autocorrelation
η
ai
yi = η + ai
20.0
20.5
21.0
21.5
22.0
22.5
23.0
23.5
24.0
24.5
25.0
© 2002 By CRC Press LLC
1.0
0.5
–0.7
0.3
0.0
–2.3
–1.9
0.2
–0.3
0.2
–0.1
21.0
21.0
20.3
21.8
22.0
20.2
21.1
23.7
23.7
24.7
24.9
0.8ei−1
−
+
0.00
0.80
1.04
0.27
0.46
0.37
–1.55
–2.76
–2.05
–1.88
–1.34
+
+
+
+
+
+
+
+
+
+
+
Autocorrelation, ρ = 0.8
ai
=
ei
1.0
0.5
–0.7
0.3
0.0
–2.3
–1.9
0.2
–0.3
0.2
–0.1
=
=
=
=
=
=
=
=
=
=
=
1.0
1.3
0.3
0.6
0.5
–1.9
–3.4
–2.6
–2.3
–1.7
–1.4
yi = η + ei
21.0
21.8
21.3
22.1
22.5
20.6
19.6
20.9
21.7
22.8
23.6
L1592_Frame_C41 Page 369 Tuesday, December 18, 2001 3:24 PM
TABLE 41.2
Summary of the Regression “Experiments”
No Autocorrelation
ρ=0
Autocorrelation
ρ = 0.8
20.1
0.43
[0.21, 0.65]
0.10
0.68
1.06
1.38
Estimated Statistics
21.0
0.12
[–0.12, 0.35]
0.10
0.12
1.22
0.91
Intercept
Slope
Confidence interval of slope
Standard error of slope
2
R
Mean square error
Durbin–Watson D
TABLE 41.3
Durbin-Watson Test Bounds for the 0.05 Level of Significance
p=2
p=3
p=4
n
dL
dU
dL
dU
dL
dU
15
20
25
30
50
1.08
1.20
1.29
1.35
1.50
1.36
1.41
1.45
1.49
1.59
0.95
1.10
1.21
1.28
1.46
1.54
1.54
1.55
1.57
1.63
0.82
1.00
1.12
1.21
1.42
1.75
1.68
1.66
1.65
1.67
Note: n = number of observations; p = number of parameters estimated
in the model.
Source: Durbin, J. and G. S. Watson (1951). Biometrika, 38, 159–178.
A Statistic to Indicate Possible Autocorrelation
Detecting autocorrelation in a small sample is difficult; sometimes it is not possible. In view of this, it
is better to design and conduct experiments to exclude autocorrelated errors. Randomization is our main
weapon against autocorrelation in designed experiments. Still, because there is a possibility of autocorrelation in the errors, most computer programs that do regression also compute the Durbin-Watson
statistic, which is based on an examination of the residual errors for autocorrelation. The Durbin-Watson
test assumes a first-order model of autocorrelation. Higher-order autocorrelation structure is possible,
but less likely than first-order, and verifying higher-order correlation would be more difficult. Even
detecting the first-order effect is difficult when the number of observations is small and the DurbinWatson statistic cannot always detect correlation when it exists.
The test examines whether the first-order autocorrelation parameter ρ is zero. In the case where ρ = 0,
the errors are independent. The test statistic is:
n
∑ (e – e
i
i−1
)
2
i=2
D = ------------------------------n
∑ (e )
2
i
i=1
where the ei are the residuals determined by fitting a model using least squares.
Durbin and Watson (1971) obtained approximate upper and lower bounds (dL and dU) on the statistic
D. If dL ≤ D ≤ dU, the test is inconclusive. However, if D > dU, conclude ρ = 0; and if D < dL, conclude
ρ > 0. A few Durbin-Watson test bounds for the 0.05 level of significance are given in Table 41.3. Note
that this test is for positive ρ. If ρ < 0, a test for negative correlation is required; the test statistic to be used
is 4 − D, where D is calculated as before.
© 2002 By CRC Press LLC
L1592_Frame_C41 Page 370 Tuesday, December 18, 2001 3:24 PM
TABLE 41.4
Results of Trend Analysis of Data in Figure 41.3
Result
Time Series A
yt = 10 + at
ˆ
y t = 9.98 + 0.005t
[ –0.012 to 0.023]
0.009
β1 = 0
2.17
Generating model
Fitted model
Confidence interval of β1
Standard error of β1
Conclusion regarding β1
Durbin-Watson statistic
Time Series B
yt = 10 + 0.8et−1 + at
ˆ
y t = 9.71 + 0.033t
[0.005 to 0.061]
0.014
β1 > 0
0.44
14
12
yt 10
>
8
Series A: y =10.11
6
14
12
10
8
>
yt
Series B: y =9.71+0.033t
6
0
10
20
30
Time (Days)
40
50
FIGURE 41.3 Time series of simulated environmental data. Series A is random, normally distributed values with η = 10
and σ = 1. Series B was constructed using the random variates of Series A to construct serially correlated values with ρ =
0.8, to which a constant value of 10 was added.
Autocorrelation and Trend Analysis
Sometimes we are tempted to take an existing record of environmental data (pH, temperature, etc.) and
analyze it for a trend by doing linear regression to estimate a slope. A slope statistically different from
zero is taken as evidence that some long-term change has been occurring. Resist the temptation, because
such data are almost always serially correlated. Serial correlation is autocorrelation between data that
constitute a time series. An example, similar to the regression example, helps make the point.
Figure 41.3 shows two time series of simulated environmental data. There are 50 values in each series.
The model used to construct Series A was yt = 10 + at, where at is a random, independent variable with
N(0,1). The model used to construct Series B was yt = 10 + 0.8et−1 + at. The ai are the same as in Series A,
but the ei variates are serially correlated with ρ = 0.8.
For both data sets, the true underlying trend is zero (the models contain no term for slope). If trend
is examined by fitting a model of the form η = β0 + β1t, where t is time, the results are in Table 41.4.
ˆ
For Series A in Figure 41.3, the fitted model is y = 9.98 + 0.005t, but the confidence interval for the
ˆ
slope includes zero and we simplify the model to y = 10.11, the average of the observed values.
ˆ
For Series B in Figure 41.3, the fitted model is y = 9.71 + 0.033t. The confidence interval of the
slope does not include zero and the nonexistent upward trend seems verified. This is caused by the serial
correlation. The serial correlation causes the time series to drift and over a short period of time this drift
looks like an upward trend. There is no reason to expect that this upward drift will continue. A series
© 2002 By CRC Press LLC
L1592_Frame_C41 Page 371 Tuesday, December 18, 2001 3:24 PM
generated with a different set of at’s could have had a downward trend. The Durbin-Watson statistic did
give the correct warning about serial correlation.
Comments
We have seen that autocorrelation can cause serious problems in regression. The Durbin-Watson statistic
might indicate when there is cause to worry about autocorrelation. It will not always detect autocorrelation, and it is especially likely to fail when the data set is small. Even when autocorrelation is revealed
as a problem, it is too late to eliminate it from the data and one faces the task of deciding how to model it.
The pitfalls inherent with autocorrelated errors provide a strong incentive to plan experiments to
include proper randomization whenever possible. If an experiment is intended to define a relationship
between x and y, the experiments should not be conducted by gradually increasing (or decreasing) the
x’s. Randomize over the settings of x to eliminate autocorrelation due to time effects in the experiments.
Chapter 51 discusses how to deal with serial correlation.
References
Box, G. E. P., W. G. Hunter, and J. S. Hunter (1978). Statistics for Experimenters: An Introduction to Design,
Data Analysis, and Model Building, New York, Wiley Interscience.
Durbin, J. and G. S. Watson (1951). “Testing for Serial Correlation in Least Squares Regression, II,” Biometrika, 38, 159–178.
Durbin, J. and G. S. Watson (1971). “Testing for Serial Correlation in Least Squares Regression, III,”
Biometrika, 58, 1–19.
Neter, J., W. Wasserman, and M. H. Kutner (1983). Applied Regression Models, Homewood, IL, Richard D.
Irwin Co.
Exercises
41.1 Blood Lead. The data below relate the lead level measured in the umbilical cord blood of
infants born in a Boston hospital in 1980 and 1981 to the total amount of leaded gasoline
sold in Massachusetts in the same months. Do you think autocorrelation might be a problem
in this data set? Do you think the blood levels are related directly to the gasoline sales in the
month of birth, or to gasoline sales in the previous several months? How would this influence
your model building strategy?
Month
3
4
5
6
7
8
9
10
11
12
1
2
3
4
© 2002 By CRC Press LLC
Year
Leaded
Gasoline Sold
1980
1980
1980
1980
1980
1980
1980
1980
1980
1980
1981
1981
1981
1981
141
166
161
170
148
136
169
109
117
87
105
73
82
75
Pb in Umbilical
Cord Blood
( µ g/dL)
6.4
6.1
5.7
6.9
7.0
7.2
6.6
5.7
5.7
5.3
4.9
5.4
4.5
6.0
L1592_Frame_C41 Page 372 Tuesday, December 18, 2001 3:24 PM
41.2 pH Trend. Below are n = 40 observations of pH in a mountain stream that may be affected
by acidic precipitation. The observations are weekly averages made 3 months apart, giving
a record that covers 10 years. Discuss the problems inherent in analyzing these data to assess
whether there is a trend toward lower pH due to acid rain.
6.8
6.6
6.8
7.1
6.9
6.6
6.5
6.8
6.7
7.0
6.8
6.7
6.8
6.7
6.7
6.9
6.9
6.9
6.8
6.9
6.7
6.7
6.8
6.6
6.9
6.4
6.7
6.4
6.9
7.0
6.8
7.0
6.7
6.9
6.9
7.0
6.7
6.8
7.0
6.9
41.3 Simulation. Simulate several time series of n = 50 using yt = 10 + 0.8et−1 + at for different
series of at, where at = N(0,1). Fit the series using linear regression and discuss your results.
41.4 Laboratory Experiment. Describe a laboratory experiment, perhaps one that you have done,
in which autocorrelation could be present. Explain how randomization would protect against
the conclusions being affected by the correlation.
© 2002 By CRC Press LLC
1592_frame_C_42 Page 373 Tuesday, December 18, 2001 3:25 PM
42
The Iterative Approach to Experimentation
biokinetics, chemostat, dilution rate, experimental design, factorial designs, iterative
design, model building, Monod model, parameter estimation, sequential design.
KEY WORDS
The dilemma of model building is that what needs to be known in order to design good experiments is
exactly what the experiments are supposed to discover. We could be easily frustrated by this if we
imagined that success depended on one grand experiment. Life, science, and statistics do not work this
way. Knowledge is gained in small steps. We begin with a modest experiment that produces information
we can use to design the second experiment, which leads to a third, etc. Between each step there is need
for reflection, study, and creative thinking. Experimental design, then, is a philosophy as much as a
technique.
The iterative (or sequential) philosophy of experimental investigation diagrammed in Figure 42.1
applies to mechanistic model building and to empirical exploration of operating conditions (Chapter 43).
The iterative approach is illustrated for an experiment in which each observation requires a considerable
investment.
Case Study: Bacterial Growth
The material balance equations for substrate (S ) and bacterial solids (X ) in a completely mixed reactor
operated without recycle are:
θ1 S
dX
V ------ = 0 – QX + ------------- XV
θ 2 + S
dt
Material balance on bacterial solids
dS
1 θ1 S
V ----- = QS 0 – QX – ---- ------------- XV
dt
θ 3 θ 2 + S
Material balance on substrate
where Q = liquid flow rate, V = reactor volume, D = Q/V = dilution rate, S0 = influent substrate
concentration, X = bacterial solids concentration in the reactor and in the effluent, and S = substrate
concentration in the reactor and in the effluent. The parameters of the Monod model for bacterial growth
are the maximum growth rate (θ 1); the half-saturation constant (θ 2), and the yield coefficient (θ 3). This
assumes there are no bacterial solids in the influent.
After dividing by V, the equations are written more conveniently as:
θ 1 SX
dX
------ = ------------- – DX
dt
θ2 + S
and
θ1 X
S
dS
----- = DS 0 – DS – --------- -------------
θ 3 θ 2 + S
dt
The steady-state solutions (dX /dt = 0 and dS/dt = 0) of the equations are:
θ2 D
S = -------------θ1 – D
© 2002 By CRC Press LLC
and
θ2 D
X = θ 3 ( S 0 – S ) = θ 3 S 0 – --------------
θ 1 – D
1592_frame_C_42 Page 374 Tuesday, December 18, 2001 3:25 PM
Yes
Stop
experiments
Does the
model fit?
Plot residuals
and confidence
regions
Design
experiment
No
New experiment or
more data
Collect
data
Fit model to
estimate
parameters
FIGURE 42.1 The iterative cycle of experimentation. (From Box, G. E. P. and W. G. Hunter (1965). Technometrics, 7, 23.)
If the dilution rate is sufficiently large, the organisms will be washed out of the reactor faster than they
can grow. If all the organisms are washed out, the effluent concentration will equal the influent concentration, S = S0 . The lowest dilution rate at which washout occurs is called the critical dilution rate (Dc )
which is derived by substituting S = S0 into the substrate model above:
θ1 S0
D c = ---------------θ 2 + S0
When S 0 >> θ 2, which is often the case, D c ≈ θ 1.
Experiments will be performed at several dilution rates (i.e., flow rates), while keeping the influent
substrate concentration constant (S0 = 3000 mg/L). When the reactor attains steady-state at the selected
dilution rate, X and S will be measured and the parameters θ1, θ 2, and θ 3 will be estimated. Because
several weeks may be needed to start a reactor and bring it to steady-state conditions, the experimenter
naturally wants to get as much information as possible from each run. Here is how the iterative approach
can be used to do this.
Assume that the experimenter has only two reactors and can test only two dilution rates simultaneously.
Because two responses (X and S) are measured, the two experimental runs provide four data points (X1
and S1 at D1; X2 and S 2 at D2), and this provides enough information to estimate the three parameters in
the model. The first two runs provide a basis for another two runs, etc., until the model parameters have
been estimated with sufficient precision.
Three iterations of the experimental cycle are shown in Table 42.1. An initial guess of parameter
values is used to start the first iteration. Thereafter, estimates based on experimental data are used. The
initial guesses of parameter values were θ 3 = 0.50, θ 1 = 0.70, and θ 2 = 200. This led to selecting flow
rate D1 = 0.66 for one run and D2 = 0.35 for the other.
The experimental design criterion for choosing efficient experimental settings of D is ignored for now
because our purpose is merely to show the efficiency of iterative experimentation. We will simply say
that it recommends doing two runs, one with the dilution rate set as near the critical value Dc as the
experimenter dares to operate, and the other at about half this value. At any stage in the experimental
cycle, the best current estimate of the critical flow rate is Dc = θ 1. The experimenter must be cautious
in using this advice because operating conditions become unstable as Dc is approached. If the actual
critical dilution rate is exceeded, the experiment fails entirely and the reactor has to be restarted, at a
considerable loss of time. On the other hand, staying too far on the safe side (keeping the dilution rate
too low) will yield poor estimates of the parameters, especially of θ 1. In this initial stage of the experiment
we should not be too bold.
© 2002 By CRC Press LLC