Chapter 41. The Effect of Autocorrelation on Regression

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 46 trang )

L1592_Frame_C41 Page 366 Tuesday, December 18, 2001 3:24 PM

24

y = 21.04 + 0.12x

23

y

22

21

20

19

0

2

4

6

x

8

10

12

10

12

FIGURE 41.1 The original data from a suspicious laboratory experiment.

25

y = 20 .06 + 0 .43 x

24

23

22

y 21

20

19

18

0

2

4

6

x

8

FIGURE 41.2 Data obtained from a repeated experiment with randomization to eliminate autocorrelation.

One might be tempted to blame the peculiar result entirely on the low value measured at x = 6, but

the experimenters did not leap to conclusions. Discussion of the experimental procedure revealed that

the tests were done starting with x = 0 ﬁrst, then with x = 1, etc., up through x = 10. The measurements

of y were also done in order of increasing concentration. It was also discovered that the injection port

of the instrument used to measure y might not have been thoroughly cleaned between each run. The

students knew about randomization, but time was short and they could complete the experiment faster

by not randomizing. The penalty was autocorrelation and a wasted experiment.

They were asked to repeat the experiment, this time randomizing the order of the runs, the order of

analyzing the specimens, and taking more care to clean the injection port. This time the data were as shown

2

ˆ

in Figure 41.2. The regression equation is y = 20.06 + 0.43x, with R = 0.68. The conﬁdence interval of

the slope is 0.21 to 0.65. This interval includes the expected slope of 0.5 and shows that x and y are related.

Can the dramatic difference in the outcome of the ﬁrst and second experiments possibly be due to the

presence of autocorrelation in the experimental data? It is both possible and likely, in view of the lack

of randomization in the order of running the tests.

The Consequences of Autocorrelation on Regression

An important part of doing regression is obtaining a valid statement about the precision of the estimates.

Unfortunately, autocorrelation acts to destroy our ability to make such statements. If the error terms are

positively autocorrelated, the usual conﬁdence intervals and tests using t and F distributions are no longer

strictly applicable because the variance estimates are distorted (Neter et al., 1983).

© 2002 By CRC Press LLC

L1592_Frame_C41 Page 367 Tuesday, December 18, 2001 3:24 PM

Why Autocorrelation Distorts the Variance Estimates

Suppose that the system generating the data has the true underlying relation η = β0 + β1x, where x could

be any independent variable, including time as in a times series of data. We observe n values: y1 = η +

e1, … , yi−2 = η + ei−2, yi−1 = η + ei−1, yi = η + ei, … , yn = η + en. The usual assumption is that the residuals

(ei) are independent, meaning that the value of ei is not related to ei−1, ei−2, etc. Let us examine what

happens when this is not true.

Suppose that the residuals (ei), instead of being random and independent, are correlated in a simple

way that is described by ei = ρ ei−1 + ai, in which the errors (ai) are independent and normally distributed

2

with constant variance σ . The strength of the autocorrelation is indicated by the autocorrelation

coefﬁcient (ρ), which ranges from −1 to +1. If ρ = 0, the ei are independent. If ρ is positive, successive

values of ei are similar to each other and:

e i = ρ e i−1 + a i

e i−1 = ρ e i−2 + a i−1

e i−2 = ρ e i−3 + a i−2

and so on. By recursive substitution we can show that:

e i = ρ ( ρ e i−2 + a i−1 ) + a i = ρ e i−2 + ρ a i−1 + a i

2

and

e i = ρ e i−3 + ρ a i−2 + ρ a i−1 + a i

3

2

This shows that the process is “remembering” past conditions to some extent, and the strength of this

memory is reﬂected in the value of ρ.

Reversing the order of the terms and continuing the recursive substitution gives:

e i = a i + ρ a i−1 + ρ a i−2 + ρ a i−3 + ⋅⋅⋅ ρ a i−n

2

3

n

The expected values of ai, ai−1,… are zero and so is the expected value of ei. The variance of ei and the

variance of ai, however, are not the same. The variance of ei is the sum of the variances of each term:

σ e = Var ( a i ) + ρ Var ( a i−1 ) + ρ Var ( a i−2 ) + ⋅⋅⋅ + ρ Var ( a i−n ) + ⋅⋅⋅

2

2

4

2n

By deﬁnition, the a’s are independent so σ a = Var(ai) = Var(ai−1) = … = Var(ai−n). Therefore, the variance

of ei is:

2

σe = σa ( 1 + ρ + ρ + … + ρ + … )

2

2

2

4

2n

For positive correlation (ρ > 0), the power series converges and:

1

2

4

2n

( 1 + ρ + ρ + … + ρ + … ) = ------------2

1–ρ

Thus:

1

2

2

σ e = σ a ------------2

1–ρ

© 2002 By CRC Press LLC

L1592_Frame_C41 Page 368 Tuesday, December 18, 2001 3:24 PM

This means that when we do not recognize and account for positive autocorrelation, the estimated

2

2

variance σ e will be larger than the true variance of the random independent errors (σ a ) by the factor

2

2

2

1/(1 − ρ ). This inﬂation can be impressive. If ρ is large (i.e., ρ = 0.8), σ e = 2.8 σ a .

An Example of Autocorrelated Errors

The laboratory data presented for the case study were created to illustrate the consequences of autocorrelation on regression. The true model of the experiment is η = 20 + 0.5x. The data structure is shown

in Table 41.1. If there were no autocorrelation, the observed values would be as shown in Figure 41.2.

These are the third column in Table 41.1, which is computed as yi + 20 + 0.5xi + ai, where the ai are

independent values drawn randomly from a normal distribution with mean zero and variance of one (the

at’s actually selected have a variance of 1.00 and a mean of −0.28).

In the ﬂawed experiment, hidden factors in the experiment were assumed to introduce autocorrelation.

The data were computed assuming that the experiment generated errors having ﬁrst-order autocorrelation

with ρ = 0.8. The last three columns in Table 41.1 show how independent random errors are converted

to correlated errors. The function producing the ﬂawed data is:

y i = η + e i = 20 + 0.5x i + 0.8e i−1 + a i

If the data were produced by the above model, but we were unaware of the autocorrelation and ﬁt the

simpler model η = β0 + β 0 x, the estimates of β0 and β1 will reﬂect this misspeciﬁcation of the model.

Perhaps more serious is the fact that t-tests and F-tests on the regression results will be wrong, so we

may be misled as to the signiﬁcance or precision of estimated values. Fitting the data produced from

the autocorrelation model of the process gives yi = 21.0 + 0.12xi. The 95% conﬁdence interval of the

slope is [−0.12 to 0.35] and the t-ratio for the slope is 1.1. Both of these results indicate the slope is not

signiﬁcantly different from zero. Although the result is reported as statistically insigniﬁcant, it is wrong

because the true slope is 0.5.

This is in contrast to what would have been obtained if the experiment had been conducted in a way

that prevented autocorrelation from entering. The data for this case are listed in the “no autocorrelation”

section of Table 41.1 and the results are shown in Table 41.2. The ﬁtted model is yi = 20.06 + 0.43xi,

the conﬁdence interval of the slope is [0.21 to 0.65] and the t-ratio for the slope is 4.4. The slope is

statistically signiﬁcant and the true value of the slope (β = 0.5) falls within the conﬁdence interval.

Table 41.2 summarizes the results of these two regression examples ( ρ = 0 and ρ = 0.8). The DurbinWatson statistic (explained in the next section) provided by the regression program indicates independence in the case where ρ = 0, and shows serial correlation in the other case.

TABLE 41.1

Data Created Using True Values of yi = 20 + 0.5xi + ai with ai = N(0,1)

x

0

1

2

3

4

5

6

7

8

9

10

No Autocorrelation

η

ai

yi = η + ai

20.0

20.5

21.0

21.5

22.0

22.5

23.0

23.5

24.0

24.5

25.0

© 2002 By CRC Press LLC

1.0

0.5

–0.7

0.3

0.0

–2.3

–1.9

0.2

–0.3

0.2

–0.1

21.0

21.0

20.3

21.8

22.0

20.2

21.1

23.7

23.7

24.7

24.9

0.8ei−1

−

+

0.00

0.80

1.04

0.27

0.46

0.37

–1.55

–2.76

–2.05

–1.88

–1.34

+

+

+

+

+

+

+

+

+

+

+

Autocorrelation, ρ = 0.8

ai

=

ei

1.0

0.5

–0.7

0.3

0.0

–2.3

–1.9

0.2

–0.3

0.2

–0.1

=

=

=

=

=

=

=

=

=

=

=

1.0

1.3

0.3

0.6

0.5

–1.9

–3.4

–2.6

–2.3

–1.7

–1.4

yi = η + ei

21.0

21.8

21.3

22.1

22.5

20.6

19.6

20.9

21.7

22.8

23.6

L1592_Frame_C41 Page 369 Tuesday, December 18, 2001 3:24 PM

TABLE 41.2

Summary of the Regression “Experiments”

No Autocorrelation

ρ=0

Autocorrelation

ρ = 0.8

20.1

0.43

[0.21, 0.65]

0.10

0.68

1.06

1.38

Estimated Statistics

21.0

0.12

[–0.12, 0.35]

0.10

0.12

1.22

0.91

Intercept

Slope

Conﬁdence interval of slope

Standard error of slope

2

R

Mean square error

Durbin–Watson D

TABLE 41.3

Durbin-Watson Test Bounds for the 0.05 Level of Signiﬁcance

p=2

p=3

p=4

n

dL

dU

dL

dU

dL

dU

15

20

25

30

50

1.08

1.20

1.29

1.35

1.50

1.36

1.41

1.45

1.49

1.59

0.95

1.10

1.21

1.28

1.46

1.54

1.54

1.55

1.57

1.63

0.82

1.00

1.12

1.21

1.42

1.75

1.68

1.66

1.65

1.67

Note: n = number of observations; p = number of parameters estimated

in the model.

Source: Durbin, J. and G. S. Watson (1951). Biometrika, 38, 159–178.

A Statistic to Indicate Possible Autocorrelation

Detecting autocorrelation in a small sample is difﬁcult; sometimes it is not possible. In view of this, it

is better to design and conduct experiments to exclude autocorrelated errors. Randomization is our main

weapon against autocorrelation in designed experiments. Still, because there is a possibility of autocorrelation in the errors, most computer programs that do regression also compute the Durbin-Watson

statistic, which is based on an examination of the residual errors for autocorrelation. The Durbin-Watson

test assumes a ﬁrst-order model of autocorrelation. Higher-order autocorrelation structure is possible,

but less likely than ﬁrst-order, and verifying higher-order correlation would be more difﬁcult. Even

detecting the ﬁrst-order effect is difﬁcult when the number of observations is small and the DurbinWatson statistic cannot always detect correlation when it exists.

The test examines whether the ﬁrst-order autocorrelation parameter ρ is zero. In the case where ρ = 0,

the errors are independent. The test statistic is:

n

∑ (e – e

i

i−1

)

2

i=2

D = ------------------------------n

∑ (e )

2

i

i=1

where the ei are the residuals determined by ﬁtting a model using least squares.

Durbin and Watson (1971) obtained approximate upper and lower bounds (dL and dU) on the statistic

D. If dL ≤ D ≤ dU, the test is inconclusive. However, if D > dU, conclude ρ = 0; and if D < dL, conclude

ρ > 0. A few Durbin-Watson test bounds for the 0.05 level of signiﬁcance are given in Table 41.3. Note

that this test is for positive ρ. If ρ < 0, a test for negative correlation is required; the test statistic to be used

is 4 − D, where D is calculated as before.

© 2002 By CRC Press LLC

L1592_Frame_C41 Page 370 Tuesday, December 18, 2001 3:24 PM

TABLE 41.4

Results of Trend Analysis of Data in Figure 41.3

Result

Time Series A

yt = 10 + at

ˆ

y t = 9.98 + 0.005t

[ –0.012 to 0.023]

0.009

β1 = 0

2.17

Generating model

Fitted model

Conﬁdence interval of β1

Standard error of β1

Conclusion regarding β1

Durbin-Watson statistic

Time Series B

yt = 10 + 0.8et−1 + at

ˆ

y t = 9.71 + 0.033t

[0.005 to 0.061]

0.014

β1 > 0

0.44

14

12

yt 10

>

8

Series A: y =10.11

6

14

12

10

8

>

yt

Series B: y =9.71+0.033t

6

0

10

20

30

Time (Days)

40

50

FIGURE 41.3 Time series of simulated environmental data. Series A is random, normally distributed values with η = 10

and σ = 1. Series B was constructed using the random variates of Series A to construct serially correlated values with ρ =

0.8, to which a constant value of 10 was added.

Autocorrelation and Trend Analysis

Sometimes we are tempted to take an existing record of environmental data (pH, temperature, etc.) and

analyze it for a trend by doing linear regression to estimate a slope. A slope statistically different from

zero is taken as evidence that some long-term change has been occurring. Resist the temptation, because

such data are almost always serially correlated. Serial correlation is autocorrelation between data that

constitute a time series. An example, similar to the regression example, helps make the point.

Figure 41.3 shows two time series of simulated environmental data. There are 50 values in each series.

The model used to construct Series A was yt = 10 + at, where at is a random, independent variable with

N(0,1). The model used to construct Series B was yt = 10 + 0.8et−1 + at. The ai are the same as in Series A,

but the ei variates are serially correlated with ρ = 0.8.

For both data sets, the true underlying trend is zero (the models contain no term for slope). If trend

is examined by ﬁtting a model of the form η = β0 + β1t, where t is time, the results are in Table 41.4.

ˆ

For Series A in Figure 41.3, the ﬁtted model is y = 9.98 + 0.005t, but the conﬁdence interval for the

ˆ

slope includes zero and we simplify the model to y = 10.11, the average of the observed values.

ˆ

For Series B in Figure 41.3, the ﬁtted model is y = 9.71 + 0.033t. The conﬁdence interval of the

slope does not include zero and the nonexistent upward trend seems veriﬁed. This is caused by the serial

correlation. The serial correlation causes the time series to drift and over a short period of time this drift

looks like an upward trend. There is no reason to expect that this upward drift will continue. A series

© 2002 By CRC Press LLC

L1592_Frame_C41 Page 371 Tuesday, December 18, 2001 3:24 PM

generated with a different set of at’s could have had a downward trend. The Durbin-Watson statistic did

give the correct warning about serial correlation.

Comments

We have seen that autocorrelation can cause serious problems in regression. The Durbin-Watson statistic

might indicate when there is cause to worry about autocorrelation. It will not always detect autocorrelation, and it is especially likely to fail when the data set is small. Even when autocorrelation is revealed

as a problem, it is too late to eliminate it from the data and one faces the task of deciding how to model it.

The pitfalls inherent with autocorrelated errors provide a strong incentive to plan experiments to

include proper randomization whenever possible. If an experiment is intended to deﬁne a relationship

between x and y, the experiments should not be conducted by gradually increasing (or decreasing) the

x’s. Randomize over the settings of x to eliminate autocorrelation due to time effects in the experiments.

Chapter 51 discusses how to deal with serial correlation.

References

Box, G. E. P., W. G. Hunter, and J. S. Hunter (1978). Statistics for Experimenters: An Introduction to Design,

Data Analysis, and Model Building, New York, Wiley Interscience.

Durbin, J. and G. S. Watson (1951). “Testing for Serial Correlation in Least Squares Regression, II,” Biometrika, 38, 159–178.

Durbin, J. and G. S. Watson (1971). “Testing for Serial Correlation in Least Squares Regression, III,”

Biometrika, 58, 1–19.

Neter, J., W. Wasserman, and M. H. Kutner (1983). Applied Regression Models, Homewood, IL, Richard D.

Irwin Co.

Exercises

41.1 Blood Lead. The data below relate the lead level measured in the umbilical cord blood of

infants born in a Boston hospital in 1980 and 1981 to the total amount of leaded gasoline

sold in Massachusetts in the same months. Do you think autocorrelation might be a problem

in this data set? Do you think the blood levels are related directly to the gasoline sales in the

month of birth, or to gasoline sales in the previous several months? How would this inﬂuence

your model building strategy?

Month

3

4

5

6

7

8

9

10

11

12

1

2

3

4

© 2002 By CRC Press LLC

Year

Leaded

Gasoline Sold

1980

1980

1980

1980

1980

1980

1980

1980

1980

1980

1981

1981

1981

1981

141

166

161

170

148

136

169

109

117

87

105

73

82

75

Pb in Umbilical

Cord Blood

( µ g/dL)

6.4

6.1

5.7

6.9

7.0

7.2

6.6

5.7

5.7

5.3

4.9

5.4

4.5

6.0

L1592_Frame_C41 Page 372 Tuesday, December 18, 2001 3:24 PM

41.2 pH Trend. Below are n = 40 observations of pH in a mountain stream that may be affected

by acidic precipitation. The observations are weekly averages made 3 months apart, giving

a record that covers 10 years. Discuss the problems inherent in analyzing these data to assess

whether there is a trend toward lower pH due to acid rain.

6.8

6.6

6.8

7.1

6.9

6.6

6.5

6.8

6.7

7.0

6.8

6.7

6.8

6.7

6.7

6.9

6.9

6.9

6.8

6.9

6.7

6.7

6.8

6.6

6.9

6.4

6.7

6.4

6.9

7.0

6.8

7.0

6.7

6.9

6.9

7.0

6.7

6.8

7.0

6.9

41.3 Simulation. Simulate several time series of n = 50 using yt = 10 + 0.8et−1 + at for different

series of at, where at = N(0,1). Fit the series using linear regression and discuss your results.

41.4 Laboratory Experiment. Describe a laboratory experiment, perhaps one that you have done,

in which autocorrelation could be present. Explain how randomization would protect against

the conclusions being affected by the correlation.

© 2002 By CRC Press LLC

1592_frame_C_42 Page 373 Tuesday, December 18, 2001 3:25 PM

42

The Iterative Approach to Experimentation

biokinetics, chemostat, dilution rate, experimental design, factorial designs, iterative

design, model building, Monod model, parameter estimation, sequential design.

KEY WORDS

The dilemma of model building is that what needs to be known in order to design good experiments is

exactly what the experiments are supposed to discover. We could be easily frustrated by this if we

imagined that success depended on one grand experiment. Life, science, and statistics do not work this

way. Knowledge is gained in small steps. We begin with a modest experiment that produces information

we can use to design the second experiment, which leads to a third, etc. Between each step there is need

for reﬂection, study, and creative thinking. Experimental design, then, is a philosophy as much as a

technique.

The iterative (or sequential) philosophy of experimental investigation diagrammed in Figure 42.1

applies to mechanistic model building and to empirical exploration of operating conditions (Chapter 43).

The iterative approach is illustrated for an experiment in which each observation requires a considerable

investment.

Case Study: Bacterial Growth

The material balance equations for substrate (S ) and bacterial solids (X ) in a completely mixed reactor

operated without recycle are:

θ1 S

dX

V ------ = 0 – QX +  -------------  XV

 θ 2 + S

dt

Material balance on bacterial solids

dS

1 θ1 S

V ----- = QS 0 – QX – ----  -------------  XV

dt

θ 3  θ 2 + S

Material balance on substrate

where Q = liquid ﬂow rate, V = reactor volume, D = Q/V = dilution rate, S0 = inﬂuent substrate

concentration, X = bacterial solids concentration in the reactor and in the efﬂuent, and S = substrate

concentration in the reactor and in the efﬂuent. The parameters of the Monod model for bacterial growth

are the maximum growth rate (θ 1); the half-saturation constant (θ 2), and the yield coefﬁcient (θ 3). This

assumes there are no bacterial solids in the inﬂuent.

After dividing by V, the equations are written more conveniently as:

θ 1 SX

dX

------ = ------------- – DX

dt

θ2 + S

and

θ1 X

S

dS

----- = DS 0 – DS –  ---------   ------------- 

 θ 3   θ 2 + S

dt

The steady-state solutions (dX /dt = 0 and dS/dt = 0) of the equations are:

θ2 D

S = -------------θ1 – D

© 2002 By CRC Press LLC

and

θ2 D

X = θ 3 ( S 0 – S ) = θ 3  S 0 – -------------- 



θ 1 – D

1592_frame_C_42 Page 374 Tuesday, December 18, 2001 3:25 PM

Yes

Stop

experiments

Does the

model fit?

Plot residuals

and confidence

regions

Design

experiment

No

New experiment or

more data

Collect

data

Fit model to

estimate

parameters

FIGURE 42.1 The iterative cycle of experimentation. (From Box, G. E. P. and W. G. Hunter (1965). Technometrics, 7, 23.)

If the dilution rate is sufﬁciently large, the organisms will be washed out of the reactor faster than they

can grow. If all the organisms are washed out, the efﬂuent concentration will equal the inﬂuent concentration, S = S0 . The lowest dilution rate at which washout occurs is called the critical dilution rate (Dc )

which is derived by substituting S = S0 into the substrate model above:

θ1 S0

D c = ---------------θ 2 + S0

When S 0 >> θ 2, which is often the case, D c ≈ θ 1.

Experiments will be performed at several dilution rates (i.e., ﬂow rates), while keeping the inﬂuent

substrate concentration constant (S0 = 3000 mg/L). When the reactor attains steady-state at the selected

dilution rate, X and S will be measured and the parameters θ1, θ 2, and θ 3 will be estimated. Because

several weeks may be needed to start a reactor and bring it to steady-state conditions, the experimenter

naturally wants to get as much information as possible from each run. Here is how the iterative approach

can be used to do this.

Assume that the experimenter has only two reactors and can test only two dilution rates simultaneously.

Because two responses (X and S) are measured, the two experimental runs provide four data points (X1

and S1 at D1; X2 and S 2 at D2), and this provides enough information to estimate the three parameters in

the model. The ﬁrst two runs provide a basis for another two runs, etc., until the model parameters have

been estimated with sufﬁcient precision.

Three iterations of the experimental cycle are shown in Table 42.1. An initial guess of parameter

values is used to start the ﬁrst iteration. Thereafter, estimates based on experimental data are used. The

initial guesses of parameter values were θ 3 = 0.50, θ 1 = 0.70, and θ 2 = 200. This led to selecting ﬂow

rate D1 = 0.66 for one run and D2 = 0.35 for the other.

The experimental design criterion for choosing efﬁcient experimental settings of D is ignored for now

because our purpose is merely to show the efﬁciency of iterative experimentation. We will simply say

that it recommends doing two runs, one with the dilution rate set as near the critical value Dc as the

experimenter dares to operate, and the other at about half this value. At any stage in the experimental

cycle, the best current estimate of the critical ﬂow rate is Dc = θ 1. The experimenter must be cautious

in using this advice because operating conditions become unstable as Dc is approached. If the actual

critical dilution rate is exceeded, the experiment fails entirely and the reactor has to be restarted, at a

considerable loss of time. On the other hand, staying too far on the safe side (keeping the dilution rate

too low) will yield poor estimates of the parameters, especially of θ 1. In this initial stage of the experiment

we should not be too bold.

© 2002 By CRC Press LLC

Xem Thêm

Chapter 41. The Effect of Autocorrelation on Regression

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về