Chapter 39. The Coefficient of Determination, R^2

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 46 trang )

L1592_frame_C39 Page 346 Tuesday, December 18, 2001 3:22 PM

2

2

This shows that R is a model comparison and that large R measures only how much the model improves

the null model. It does not indicate how good the model is in any absolute sense. Consequently, the

2

common belief that a large R demonstrates model adequacy is sometimes wrong.

2

The deﬁnition of R also shows that comparisons are made only between nested models. The concept

of proportionate reduction in variation is untrustworthy unless one model is a special case of the other.

2

This means that R cannot be used to compare models with an intercept with models that have no

intercept: y = β 0 is not a reduction of the model y = β1x. It is a reduction of y = β 0 + β1x and y = β0 +

β1x + β2x2.

2

A High R Does Not Assure a Valid Relation

Figure 39.1 shows a regression with R = 0.746, which is statistically signiﬁcant at almost the 1% level

of conﬁdence (a 1% chance of concluding signiﬁcance when there is no true relation). This might be

impressive until one knows the source of the data. X is the ﬁrst six digits of pi, and Y is the ﬁrst six

Fibonocci numbers. There is no true relation between x and y. The linear regression equation has no

predictive value (the seventh digit of pi does not predict the seventh Fibonocci number).

2

Anscombe (1973) published a famous and fascinating example of how R and other statistics that are

routinely computed in regression analysis can fail to reveal the important features of the data. Table 39.1

2

10

Y = 0.31 + 0.79X

R2 = 0.746

8

6

Y

4

2

FIGURE 39.1 An example of nonsense in regression. X is

the ﬁrst six digits of pi and Y is the ﬁrst six Fibonocci

2

numbers. R is high although there is no actual relation

between x and y.

0

0

2

4

6

8

10

X

TABLE 39.1

Anscombe’s Four Data Sets

A

B

C

D

x

y

x

y

x

y

x

y

10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0

8.04

6.95

7.58

8.81

8.33

9.96

7.24

4.26

10.84

4.82

5.68

10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0

9.14

8.14

8.74

8.77

9.26

8.10

6.13

3.10

9.13

7.26

4.74

10.0

8.0

13.0

9.0

11.0

14.0

6.0

4.0

12.0

7.0

5.0

7.46

6.77

12.74

7.11

7.81

8.84

6.08

5.39

8.15

6.42

5.73

8.0

8.0

8.0

8.0

8.0

8.0

8.0

19.0

8.0

8.0

8.0

6.58

5.76

7.71

8.84

8.47

7.04

5.25

12.50

5.56

7.91

6.89

Note: Each data set has n = 11, mean of x = 9.0, mean of y = 7.5, equation

of the regression line y = 3.0 + 0.5x, standard error of estimate of

the slope = 0.118 (t statistic = 4.24, regression sum of squares

(corrected for mean) = 110.0, residual sum of squares = 13.75,

2

correlation coefﬁcient r = 0.82 and R = 0.67).

Source: Anscombe, F. J. (1973). Am. Stat., 27, 17–21.

© 2002 By CRC Press LLC

L1592_frame_C39 Page 347 Tuesday, December 18, 2001 3:22 PM

15

(a) R2 = 0.67

(b) R2 = 0.67

(c) R2 = 0.67

(d) R2 = 0.67

10

y

5

0

15

10

y

5

0

0

5

10

x

15

20 0

5

10

x

15

20

FIGURE 39.2 Plot of Anscombe’s four data sets which all have R = 0.67 and identical results from simple linear regression

analysis (data from Anscombe 1973).

2

gives Anscombe’s four data sets. Each data set has n = 11, x = 9.0, y = 7.5, ﬁtted regression line

ˆ

y = 3 + 0.5x, standard error of estimate of the slope = 0.118 (t statistic = 4.24), regression sum of

squares (corrected for mean) = 110.0, residual sum of squares = 13.75, correlation coefﬁcient = 0.82,

2

and R = 0.67. All four data sets appear to be described equally well by exactly the same linear model,

at least until the data are plotted (or until the residuals are examined). Figure 39.2 shows how vividly

they differ. The example is a persuasive argument for always plotting the data.

2

A Low R Does Not Mean the Model is Useless

2

Hahn (1973) explains that the chances are one in ten of getting R as high as 0.9756 in ﬁtting a simple

linear regression equation to the relation between an independent variable x and a normally distributed

variable y based on only three observations, even if x and y are totally unrelated. On the other hand,

2

with 100 observations, a value of R = 0.07 is sufﬁcient to establish statistical signiﬁcance at the 1% level.

2

Table 39.2 lists the values of R required to establish statistical signiﬁcance for a simple linear regression

equation. Table 39.2 applies only for the straight-line model y = β0 + β1x + e; for multi-variable regression

models, statistical signiﬁcance must be determined by other means. This tabulation gives values at the

10, 5, and 1% signiﬁcance levels. These correspond, respectively, to the situations where one is ready to

take one chance in 10, one chance in 20, and one chance in 100 of incorrectly concluding there is evidence

of a statistically signiﬁcant linear regression when, in fact, x and y are unrelated.

2

A Signiﬁcant R Doesn’t Mean the Model is Useful

Practical signiﬁcance and statistical signiﬁcance are not equivalent. Statistical signiﬁcance and importance are not equivalent. A regression based on a modest and unimportant true relationship may be

established as statistically signiﬁcant if a sufﬁciently large number of observations are available. On the

other hand, with a small sample it may be difﬁcult to obtain statistical evidence of a strong relation.

2

It generally is good news if we ﬁnd R large and also statistically signiﬁcant, but it does not assure a

useful equation, especially if the equation is to be used for prediction. One reason is that the coefﬁcient

of determination is not expressed on the same scale as the dependent variable. A particular equation

© 2002 By CRC Press LLC

L1592_frame_C39 Page 348 Tuesday, December 18, 2001 3:22 PM

TABLE 39.2

2

Values of R Required to Establish Statistical

Signiﬁcance of a Simple Linear Regression

Equation for Various Sample Sizes

Sample Size

n

3

4

5

6

8

10

12

15

20

25

30

40

50

100

Statistical Signiﬁcance Level

10%

5%

1%

0.98

0.81

0.65

0.53

0.39

0.30

0.25

0.19

0.14

0.11

0.09

0.07

0.05

0.03

0.99

0.90

0.77

0.66

0.50

0.40

0.33

0.26

0.20

0.16

0.13

0.10

0.08

0.04

0.99

0.98

0.92

0.84

0.70

0.59

0.50

0.41

0.31

0.26

0.22

0.16

0.13

0.07

Source: Hahn, G. J. (1973). Chemtech, October,

pp. 609– 611.

2

may explain a large proportion of the variability in the dependent variable, and thus have a high R , yet

unexplained variability may be too large for useful prediction. It is not possible to tell from the magnitude

2

of R how accurate the predictions will be.

2

The Magnitude of R Depends on the Range of Variation in X

2

The value of R decreases with a decrease in the range of variation of the independent variable, other

things being equal, and assuming the correct model is being ﬁtted to the data. Figure 39.3 (upper

2

left-hand panel) shows a set of 50 data points that has R = 0.77. Suppose, however, that the range

of x that could be investigated is only from 14 to 16 (for example, because a process is carefully

constrained within narrow operating limits) and the available data are those shown in the upper righthand panel of Figure 39.3. The underlying relationship is the same, and the measurement error in

2

2

each observation is the same, but R is now only 0.12. This dramatic reduction in R occurs mainly

because the range of x is restricted and not because the number of observations is reduced. This is

shown by the two lower panels. Fifteen points (the same number as found in the range of x = 14 to

2

16), located at x = 10, 15, and 20, give R = 0.88. Just 10 points, at x = 10 and 20, gives an even

2

larger value, R = 0.93.

2

These examples show that a large value of R might reﬂect the fact that data were collected over

an unrealistically large range of the independent variable x. This can happen, especially when x is

time. Conversely, a small value might be due to a limited range of x, such as when x is carefully

controlled by a process operator. In this case, x is constrained to a narrow range because it is known

to be highly important, yet this importance will not be revealed by doing regression on typical data

from the process.

2

Linear calibration curves always have a very high R , usually 0.99 and above. One reason is that the

x variable covers a wide range (see Chapter 36.)

© 2002 By CRC Press LLC

L1592_frame_C39 Page 349 Tuesday, December 18, 2001 3:22 PM

30

R2 = 0.77

y 20

10

30

•

•

• • •

• • •

• • •

• •

• •

R2 = 0.12

• •

• •

• •

• • • • •

•

• • • •

•

•• • • • • •

• • •

• •

• •

• •

•

•

R2 = 0.88

R2 = 0.93

•

•

•

•

•

•

•

•

•

y 20

• •

• • •

•

• • •

•

• •

•

•

•

•

•

•

•

•

•

•

•

•

•

10

10

15

x

20

15

x

10

20

FIGURE 39.3 The full data set of 50 observations (upper-left panel) has R = 0.77. The other three panels show how R

depends on the range of variation in the independent variable.

2

2

50

y

25

^

y = 15.4 + 0.97x

0

0

10

20

30

x

FIGURE 39.4 Linear regression with repeated observations. The regression sum of squares is 581.12. The residual sum of

squares (RSS = 116.38) is divided into pure error sum of squares (SSPE = 112.34) and lack-of-ﬁt sum of squares (SSLOF =

2

4.04). R = 0.833, which explains 99% of the amount of residual error that can be explained.

The Effect of Repeated Runs on R

2

If regression is used to ﬁt a model to n settings of x, it is possible for a model with n parameters to ﬁt

2

the data exactly, giving R = 1. This kind of overﬁtting is not recommended but it is mathematically

possible. On the other hand, if repeat measurements are made at some or all of the n settings of the

independent variables, a perfect ﬁt will not be possible. This assumes, of course, that the repeat measurements are not identical.

ˆ

The data in Figure 39.4 are given in Table 39.3. The ﬁtted model is y = 15.45 + 0.97x. The relevant

2

statistics are presented in Table 39.4. The fraction of the variation explained by the regression is R =

581.12/697.5 = 0.833. The residual sum of squares (RSS) is divided into the pure error sum of squares

(SSPE ), which is calculated from the repeated measurements, and the lack-of-ﬁt sum of squares (SSLOF).

That is:

RSS = SSPE + SSLOF

© 2002 By CRC Press LLC

L1592_frame_C39 Page 350 Tuesday, December 18, 2001 3:22 PM

TABLE 39.3

Linear Regression with Repeated

Observations

x

y1

y2

y3

5

12

14

19

24

17.5

30.4

30.1

36.6

38.9

22.4

28.4

25.8

31.3

43.2

19.2

25.1

31.1

34.0

32.7

TABLE 39.4

Analysis of Variance of the Regression with Repeat Observations Shown in Figure 39.4

Source

df

Sum of Sq.

Mean Sq.

F Ratio

Regression

Residual

Lack of ﬁt (LOF)

Pure error (PE)

Total (Corrected)

1

13

3

10

14

581.12

116.38

4.04

112.34

697.50

581.12

8.952

1.35

11.23

Comments

64.91

=s

= s2

L

= s2

e

2

0.12

Suppose now that there had been only ﬁve observations (that is, no repeated measurements) and

furthermore that the ﬁve values of y fell at the average of the repeated values in Figure 39.4. Now the

2

ˆ

ﬁtted model would be exactly the same: y = 15.45 + 0.97x but the R value would be 0.993. This is

because the variance due to the repeats has been removed.

2

The maximum possible value for R when there are repeat measurements is:

Total SS (corrected) – Pure error SS

2

max R = -------------------------------------------------------------------------------------Total SS (corrected)

The pure error SS does not change when terms are added or removed from the model in an effort to

improve the ﬁt. For our example:

697.5 – 112.3

2

max R = -------------------------------- = 0.839

697.5

The actual R = 581.12/697.5 = 0.83. Therefore, the regression has explained 100(0.833/0.839) = 99%

of the amount of variation that can be explained by the model.

2

A Note on Lack-Of-Fit

If repeat measurements are available, a lack-of-ﬁt (LOF) test can be done. The lack-of-ﬁt mean square

2

2

s L = SS LOF /df LOF is compared with the pure error mean square s e = SS PE /df PE . If the model gives an

adequate ﬁt, these two sums of squares should be of the same magnitude. This is checked by comparing the

2 2

ratio s L /s e against the F statistic with the appropriate degrees of freedom. Using the values in Table 39.4

2 2

gives s L /s e = 1.35/11.23 = 0.12. The F statistic for a 95% conﬁdence test with three degrees of freedom

to measure lack of ﬁt and ten degrees of freedom to measure the pure error is F3,10 = 3.71. Because

2 2

s L /s e = 0.12 is less than F3,10 = 3.71, there is no evidence of lack-of-ﬁt. For this lack-of-ﬁt test to be

valid, true repeats are needed.

© 2002 By CRC Press LLC

L1592_frame_C39 Page 351 Tuesday, December 18, 2001 3:22 PM

A Note on Description vs. Prediction

2

Is the regression useful? We have seen that a high R does not guarantee that a regression has meaning.

2

Likewise, a low R may indicate a statistically signiﬁcant relationship between two variables although

the regression is not explaining much of the variation. Even less does statistically signiﬁcant mean that

the regression will predict future observations with much accuracy. “In order for the ﬁtted equation to

be regarded as a satisfactory predictor, the observed F ratio (regression mean square/residual mean

square) should exceed not merely the selected percentage point of the F distribution, but several times

the selected percentage point. How many times depends essentially on how great a ratio (prediction

range/error of prediction) is speciﬁed” (Box and Wetz, 1973). Draper and Smith (1998) offer this ruleof-thumb: unless the observed F for overall regression exceeds the chosen test percentage point by at

least a factor of four, and preferably more, the regression is unlikely to be of practical value for prediction

purposes. The regression in Figure 39.4 has an F ratio of 581.12/8.952 = 64.91 and would have some

practical predictive value.

Other Ways to Examine a Model

2

If R does not tell all that is needed about how well a model ﬁts the data and how good the model may

be for prediction, what else could be examined?

Graphics reveal information in data (Tufte 1983): always examine the data and the proposed model

2

graphically. How sad if this advice was forgotten in a rush to compute some statistic like R .

A more useful single measure of the prediction capability of a model (including a k-variate regression

model) is the standard error of the estimate. The standard error of the estimate is computed from the

ˆ

variance of the predicted value (y ) and it indicates the precision with which the model estimates the

value of the dependent variable. This statistic is used to compute intervals that have the following

meanings (Hahn, 1973).

• The conﬁdence interval for the dependent variable is an interval that one expects, with a

speciﬁed level of conﬁdence, to contain the average value of the dependent variable at a set

of speciﬁed values for the independent variables.

• A prediction interval for the dependent variable is an interval that one expects, with a speciﬁed

probability, to contain a single future value of the dependent variable from the sampled

population at a set of speciﬁed values of the independent variables.

• A conﬁdence interval around a parameter in a model (i.e., a regression coefﬁcient) is an

interval that one expects, with a speciﬁed degree of conﬁdence, to contain the true regression

coefﬁcient.

Conﬁdence intervals for parameter estimates and prediction intervals for the dependent variable are

discussed in Chapters 34 and 35. The exact method of obtaining these intervals is explained in Draper

and Smith (1998). They are computed by most statistics software packages.

Comments

Widely used methods have the potential to be frequently misused. Linear regression, the most widely

2

used statistical method, can be misused or misinterpreted if one relies too much on R as a characterization

of how well a model ﬁts.

2

R is a measure of the proportion of variation in y that is accounted for by ﬁtting y to a particular linear

2

model instead of describing the data by calculating the mean (a horizontal straight line). High R does not

2

prove that a model is correct or useful. A low R may indicate a statistically signiﬁcant relation between two

variables although the regression has no practical predictive value. Replication dramatically improves the

2

predictive error of a model, and it makes possible a formal lack-of-ﬁt test, but it reduces the R of the model.

© 2002 By CRC Press LLC

L1592_frame_C39 Page 352 Tuesday, December 18, 2001 3:22 PM

2

Totally spurious correlations, often with high R values, can arise when unrelated variables are

combined. Two examples of particular interest to environmental engineers are presented by Sherwood

(1974) and Rowe (1974). Both emphasize graphical analysis to stimulate and support any regression

analysis. Rowe discusses the particular dangers that arise when sets of variables are combined to create

new variables such as dimensional numbers (Froude number, etc.). Benson (1965) points out the same

kinds of dangers in the context of hydraulics and hydrology.

References

2

Anderson-Sprecher, R. (1994). “Model Comparison and R ,” Am. Stat., 48(2), 113–116.

Anscombe, F. J. (1973). “Graphs in Statistical Analysis,” Am. Stat., 27, 17–21.

Benson, M. A. (1965). “Spurious Correlation in Hydraulics and Hydrology,” J. Hydraulics Div., ASCE, 91,

HY4, 35–45.

Box, G. E. P. (1966). “The Use and Abuse of Regression,” Technometrics, 8, 625–629.

Box, G. E. P. and J. Wetz (1973). “Criteria for Judging Accuracy of Estimation by an Approximating Response

Function,” Madison, WI, University of Wisconsin Statistics Department, Tech. Rep. No. 9.

Draper, N. R. and H. Smith (1998). Applied Regression Analysis, 3rd ed., New York, John Wiley.

Hahn, G. J. (1973). “The Coefﬁcient of Determination Exposed,” Chemtech, October, pp. 609–611.

Rowe, P. N. (1974). “Correlating Data,” Chemtech, January, pp. 9–14.

Sherwood, T. K. (1974). “The Treatment and Mistreatment of Data,” Chemtech, December, pp. 736–738.

Tufte, E. R. (1983). The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press.

Exercises

39.1 COD Calibration. The ten pairs of readings below were obtained to calibrate a UV spectrophotometer to measure chemical oxygen demand (COD) in wastewater.

COD (mg/L)

UV Absorbance

60

0.30

90

0.35

100

0.45

130

0.48

195

0.95

250

1.30

300

1.60

375

1.80

500

2.3

600

2.55

2

2

(a) Fit a linear model to the data and obtain the R value. (b) Discuss the meaning of R in

the context of this calibration problem. (c) Exercise 36.3 contains a larger calibration data

set for the same instrument. (d) Fit the model to the larger sample and compare the values

2

2

of R . Will the calibration curve with the highest R best predict the COD concentration?

Explain why or why not.

39.2 Stream pH. The data below are n = 200 monthly pH readings on a stream that cover a period of

ˆ

almost 20 years. The data read from left to right. The ﬁtted regression model is y = 7.1435 −

2

0.0003776t; R = 0.042. The conﬁdence interval of the slope is [−0.00063, −0.000013]. Why

2

is R so low? Is the regression statistically signiﬁcant? Is stream pH decreasing? What is the

practical value of the model?

7.0

7.1

7.2

7.1

7.0

7.0

7.2

7.1

7.0

7.0

7.2

7.4

7.1

7.0

7.0

7.2

6.8

7.1

7.1

7.1

7.2

7.1

7.2

7.1

7.2

7.1

7.2

7.0

7.2

7.0

© 2002 By CRC Press LLC

7.3

6.8

7.0

7.4

6.9

7.1

7.2

7.0

7.1

7.1

7.2

7.3

7.0

7.2

7.4

7.1

7.0

7.1

7.1

7.0

7.2

7.3

7.2

7.2

7.0

7.0

7.1

7.1

7.0

6.9

7.2

7.0

7.1

7.2

6.9

7.0

7.1

7.0

7.1

6.9

7.2

7.0

7.1

7.2

7.0

7.0

7.2

7.0

7.0

7.2

7.0

6.9

7.2

7.1

7.1

7.1

7.0

7.0

7.2

7.1

7.1

7.2

7.2

7.0

7.0

7.3

7.1

7.1

7.1

7.2

7.3

7.2

7.2

7.2

7.2

7.1

7.1

7.0

7.1

7.1

7.1

7.3

7.0

7.0

7.2

7.2

7.1

7.1

7.1

7.1

7.1

7.0

7.1

6.9

7.0

7.2

7.0

7.1

7.2

7.0

7.1

7.0

7.1

7.2

7.0

7.2

7.2

7.2

7.1

7.0

7.2

7.1

7.2

7.0

7.1

7.1

7.1

7.2

7.0

6.9

7.3

7.1

7.1

7.0

7.1

7.2

7.1

7.1

7.1

7.1

7.2

7.0

7.2

7.1

7.0

7.2

7.3

7.0

7.2

6.8

7.3

7.2

7.0

7.0

7.2

7.1

6.9

7.0

7.2

7.1

7.2

7.2

7.1

6.9

7.2

7.1

7.2

7.2

7.1

7.0

7.2

7.2

7.2

6.9

7.0

7.1

7.2

7.2

7.2

7.0

L1592_frame_C39 Page 353 Tuesday, December 18, 2001 3:22 PM

39.3 Replication. Fit a straight-line calibration model to y1 and then ﬁt the straight line to the three

replicate measures of y. Suppose a colleague in another lab had the y1 data only and you had

2

all three replicates. Who will have the higher R and who will have the best ﬁtted calibration

2

curve? Compare the values of R obtained. Estimate the pure error variance. How much of

the variation in y has been explained by the model?

x

y1

y2

2

5

8

12

15

18

20

0.0

4.0

5.1

8.1

9.2

11.3

11.7

1.7

2.0

4.1

8.9

8.3

9.5

10.7

y3

2.0

4.5

5.8

8.4

8.8

10.9

10.4

39.4 Range of Data. Fit a straight-line calibration model to the ﬁrst 10 observations in the Exercise

36.3 data set, that is for COD between 60 and 195 mg/L. Then ﬁt the straight line to the full

2

data set (COD from 60 to 675 mg/L). Interpret the change in R for the two cases.

© 2002 By CRC Press LLC

L1592_frame_C40 Page 355 Tuesday, December 18, 2001 3:24 PM

40

Regression Analysis with Categorical Variables

acid rain, pH, categorical variable, F test, indicator variable, east squares, linear model,

regression, dummy variable, qualitative variables, regression sum of squares, t-ratio, weak acidity.

KEY WORDS

Qualitative variables can be used as explanatory variables in regression models. A typical case would be

when several sets of data are similar except that each set was measured by a different chemist (or different

instrument or laboratory), or each set comes from a different location, or each set was measured on a

different day. The qualitative variables — chemist, location, or day — typically take on discrete values

(i.e., chemist Smith or chemist Jones). For convenience, they are usually represented numerically by a

combination of zeros and ones to signify an observation’s membership in a category; hence the name

categorical variables.

One task in the analysis of such data is to determine whether the same model structure and parameter

values hold for each data set. One way to do this would be to ﬁt the proposed model to each individual

data set and then try to assess the similarities and differences in the goodness of ﬁt. Another way would

be to ﬁt the proposed model to all the data as though they were one data set instead of several, assuming

that each data set has the same pattern, and then to look for inadequacies in the ﬁtted model.

Neither of these approaches is as attractive as using categorical variables to create a collective data

set that can be ﬁtted to a single model while retaining the distinction between the individual data sets.

This technique allows the model structure and the model parameters to be evaluated using statistical

methods like those discussed in the previous chapter.

Case Study: Acidiﬁcation of a Stream During Storms

Cosby Creek, in the southern Appalachian Mountains, was monitored during three storms to study how

pH and other measures of acidiﬁcation were affected by the rainfall in that region. Samples were taken

every 30 min and 19 characteristics of the stream water chemistry were measured (Meinert et al., 1982).

Weak acidity (WA) and pH will be examined in this case study.

Figure 40.1 shows 17 observations for storm 1, 14 for storm 2, and 13 for storm 3, giving a total of

44 observations. If the data are analyzed without distinguishing between storms one might consider

2

models of the form pH = β 0 + β 1WA + β 2WA or pH = θ3 + (θ1 − θ3)exp(−θ2WA). Each storm might be

described by pH = β 0 + β1WA, but storm 3 does not have the same slope and intercept as storms 1 and

2, and storms 1 and 2 might be different as well. This can be checked by using categorical variables to

estimate a different slope and intercept for each storm.

Method: Regression with Categorical Variables

Suppose that a model needs to include an effect due to the category (storm event, farm plot, treatment,

truckload, operator, laboratory, etc.) from which the data came. This effect is included in the model in

the form of categorical variables (also called dummy or indicator variables). In general m − 1 categorical

variables are needed to specify m categories.

© 2002 By CRC Press LLC

L1592_frame_C40 Page 356 Tuesday, December 18, 2001 3:24 PM

7.0

6.5

pH

6.0

5.5

0

100

200 300 400 500

Weak Acidity (µg/L)

600

700

FIGURE 40.1 The relation of pH and weak acidity data of Cosby Creek after three storms.

Begin by considering data from a single category. The quantitative predictor variable is x1 which can

predict the independent variable y1 using the linear model:

y 1i = β 0 + β 1 x 1i + e i

where β0 and β1 are parameters to be estimated by least squares.

If there are data from two categories (e.g., data produced at two different laboratories), one approach

would be to model the two sets of data separately as:

y 1i = α 0 + α 1 x 1i + e i

and

y 2i = β 0 + β 1 x 2i + e i

and then to compare the estimated intercepts (α 0 and β 0 ) and the estimated slopes (α1 and β1) using

conﬁdence intervals or t-tests.

A second, and often better, method is to simultaneously ﬁt a single augmented model to all the data.

To construct this model, deﬁne a categorical variable Z as follows:

Z=0

Z=1

if the data are in the ﬁrst category

if the data are in the second category

The augmented model is:

yi = α0 + α1 xi + Z ( β0 + β1 xi ) + ei

With some rearrangement:

y i = α 0 + β 0 Z + α 1 x i + β 1 Zx i + e i

In this last form the regression is done as though there are three independent variables, x, Z, and Zx.

The vectors of Z and Zx have to be created from the categorical variables deﬁned above. The four

parameters α 0, β 0, α1, and β 1 are estimated by linear regression.

A model for each category can be obtained by substituting the deﬁned values. For the ﬁrst category,

Z = 0 and:

yi = α0 + α1 xi + ei

© 2002 By CRC Press LLC

Xem Thêm

Chapter 39. The Coefficient of Determination, R^2

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về