Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 46 trang )
L1592_frame_C39 Page 346 Tuesday, December 18, 2001 3:22 PM
2
2
This shows that R is a model comparison and that large R measures only how much the model improves
the null model. It does not indicate how good the model is in any absolute sense. Consequently, the
2
common belief that a large R demonstrates model adequacy is sometimes wrong.
2
The definition of R also shows that comparisons are made only between nested models. The concept
of proportionate reduction in variation is untrustworthy unless one model is a special case of the other.
2
This means that R cannot be used to compare models with an intercept with models that have no
intercept: y = β 0 is not a reduction of the model y = β1x. It is a reduction of y = β 0 + β1x and y = β0 +
β1x + β2x2.
2
A High R Does Not Assure a Valid Relation
Figure 39.1 shows a regression with R = 0.746, which is statistically significant at almost the 1% level
of confidence (a 1% chance of concluding significance when there is no true relation). This might be
impressive until one knows the source of the data. X is the first six digits of pi, and Y is the first six
Fibonocci numbers. There is no true relation between x and y. The linear regression equation has no
predictive value (the seventh digit of pi does not predict the seventh Fibonocci number).
2
Anscombe (1973) published a famous and fascinating example of how R and other statistics that are
routinely computed in regression analysis can fail to reveal the important features of the data. Table 39.1
2
10
Y = 0.31 + 0.79X
R2 = 0.746
8
6
Y
4
2
FIGURE 39.1 An example of nonsense in regression. X is
the first six digits of pi and Y is the first six Fibonocci
2
numbers. R is high although there is no actual relation
between x and y.
0
0
2
4
6
8
10
X
TABLE 39.1
Anscombe’s Four Data Sets
A
B
C
D
x
y
x
y
x
y
x
y
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
8.04
6.95
7.58
8.81
8.33
9.96
7.24
4.26
10.84
4.82
5.68
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
9.14
8.14
8.74
8.77
9.26
8.10
6.13
3.10
9.13
7.26
4.74
10.0
8.0
13.0
9.0
11.0
14.0
6.0
4.0
12.0
7.0
5.0
7.46
6.77
12.74
7.11
7.81
8.84
6.08
5.39
8.15
6.42
5.73
8.0
8.0
8.0
8.0
8.0
8.0
8.0
19.0
8.0
8.0
8.0
6.58
5.76
7.71
8.84
8.47
7.04
5.25
12.50
5.56
7.91
6.89
Note: Each data set has n = 11, mean of x = 9.0, mean of y = 7.5, equation
of the regression line y = 3.0 + 0.5x, standard error of estimate of
the slope = 0.118 (t statistic = 4.24, regression sum of squares
(corrected for mean) = 110.0, residual sum of squares = 13.75,
2
correlation coefficient r = 0.82 and R = 0.67).
Source: Anscombe, F. J. (1973). Am. Stat., 27, 17–21.
© 2002 By CRC Press LLC
L1592_frame_C39 Page 347 Tuesday, December 18, 2001 3:22 PM
15
(a) R2 = 0.67
(b) R2 = 0.67
(c) R2 = 0.67
(d) R2 = 0.67
10
y
5
0
15
10
y
5
0
0
5
10
x
15
20 0
5
10
x
15
20
FIGURE 39.2 Plot of Anscombe’s four data sets which all have R = 0.67 and identical results from simple linear regression
analysis (data from Anscombe 1973).
2
gives Anscombe’s four data sets. Each data set has n = 11, x = 9.0, y = 7.5, fitted regression line
ˆ
y = 3 + 0.5x, standard error of estimate of the slope = 0.118 (t statistic = 4.24), regression sum of
squares (corrected for mean) = 110.0, residual sum of squares = 13.75, correlation coefficient = 0.82,
2
and R = 0.67. All four data sets appear to be described equally well by exactly the same linear model,
at least until the data are plotted (or until the residuals are examined). Figure 39.2 shows how vividly
they differ. The example is a persuasive argument for always plotting the data.
2
A Low R Does Not Mean the Model is Useless
2
Hahn (1973) explains that the chances are one in ten of getting R as high as 0.9756 in fitting a simple
linear regression equation to the relation between an independent variable x and a normally distributed
variable y based on only three observations, even if x and y are totally unrelated. On the other hand,
2
with 100 observations, a value of R = 0.07 is sufficient to establish statistical significance at the 1% level.
2
Table 39.2 lists the values of R required to establish statistical significance for a simple linear regression
equation. Table 39.2 applies only for the straight-line model y = β0 + β1x + e; for multi-variable regression
models, statistical significance must be determined by other means. This tabulation gives values at the
10, 5, and 1% significance levels. These correspond, respectively, to the situations where one is ready to
take one chance in 10, one chance in 20, and one chance in 100 of incorrectly concluding there is evidence
of a statistically significant linear regression when, in fact, x and y are unrelated.
2
A Significant R Doesn’t Mean the Model is Useful
Practical significance and statistical significance are not equivalent. Statistical significance and importance are not equivalent. A regression based on a modest and unimportant true relationship may be
established as statistically significant if a sufficiently large number of observations are available. On the
other hand, with a small sample it may be difficult to obtain statistical evidence of a strong relation.
2
It generally is good news if we find R large and also statistically significant, but it does not assure a
useful equation, especially if the equation is to be used for prediction. One reason is that the coefficient
of determination is not expressed on the same scale as the dependent variable. A particular equation
© 2002 By CRC Press LLC
L1592_frame_C39 Page 348 Tuesday, December 18, 2001 3:22 PM
TABLE 39.2
2
Values of R Required to Establish Statistical
Significance of a Simple Linear Regression
Equation for Various Sample Sizes
Sample Size
n
3
4
5
6
8
10
12
15
20
25
30
40
50
100
Statistical Significance Level
10%
5%
1%
0.98
0.81
0.65
0.53
0.39
0.30
0.25
0.19
0.14
0.11
0.09
0.07
0.05
0.03
0.99
0.90
0.77
0.66
0.50
0.40
0.33
0.26
0.20
0.16
0.13
0.10
0.08
0.04
0.99
0.98
0.92
0.84
0.70
0.59
0.50
0.41
0.31
0.26
0.22
0.16
0.13
0.07
Source: Hahn, G. J. (1973). Chemtech, October,
pp. 609– 611.
2
may explain a large proportion of the variability in the dependent variable, and thus have a high R , yet
unexplained variability may be too large for useful prediction. It is not possible to tell from the magnitude
2
of R how accurate the predictions will be.
2
The Magnitude of R Depends on the Range of Variation in X
2
The value of R decreases with a decrease in the range of variation of the independent variable, other
things being equal, and assuming the correct model is being fitted to the data. Figure 39.3 (upper
2
left-hand panel) shows a set of 50 data points that has R = 0.77. Suppose, however, that the range
of x that could be investigated is only from 14 to 16 (for example, because a process is carefully
constrained within narrow operating limits) and the available data are those shown in the upper righthand panel of Figure 39.3. The underlying relationship is the same, and the measurement error in
2
2
each observation is the same, but R is now only 0.12. This dramatic reduction in R occurs mainly
because the range of x is restricted and not because the number of observations is reduced. This is
shown by the two lower panels. Fifteen points (the same number as found in the range of x = 14 to
2
16), located at x = 10, 15, and 20, give R = 0.88. Just 10 points, at x = 10 and 20, gives an even
2
larger value, R = 0.93.
2
These examples show that a large value of R might reflect the fact that data were collected over
an unrealistically large range of the independent variable x. This can happen, especially when x is
time. Conversely, a small value might be due to a limited range of x, such as when x is carefully
controlled by a process operator. In this case, x is constrained to a narrow range because it is known
to be highly important, yet this importance will not be revealed by doing regression on typical data
from the process.
2
Linear calibration curves always have a very high R , usually 0.99 and above. One reason is that the
x variable covers a wide range (see Chapter 36.)
© 2002 By CRC Press LLC
L1592_frame_C39 Page 349 Tuesday, December 18, 2001 3:22 PM
30
R2 = 0.77
y 20
10
30
•
•
• • •
• • •
• • •
• •
• •
R2 = 0.12
• •
• •
• •
• • • • •
•
• • • •
•
•• • • • • •
• • •
• •
• •
• •
•
•
R2 = 0.88
R2 = 0.93
•
•
•
•
•
•
•
•
•
y 20
• •
• • •
•
• • •
•
• •
•
•
•
•
•
•
•
•
•
•
•
•
•
10
10
15
x
20
15
x
10
20
FIGURE 39.3 The full data set of 50 observations (upper-left panel) has R = 0.77. The other three panels show how R
depends on the range of variation in the independent variable.
2
2
50
y
25
^
y = 15.4 + 0.97x
0
0
10
20
30
x
FIGURE 39.4 Linear regression with repeated observations. The regression sum of squares is 581.12. The residual sum of
squares (RSS = 116.38) is divided into pure error sum of squares (SSPE = 112.34) and lack-of-fit sum of squares (SSLOF =
2
4.04). R = 0.833, which explains 99% of the amount of residual error that can be explained.
The Effect of Repeated Runs on R
2
If regression is used to fit a model to n settings of x, it is possible for a model with n parameters to fit
2
the data exactly, giving R = 1. This kind of overfitting is not recommended but it is mathematically
possible. On the other hand, if repeat measurements are made at some or all of the n settings of the
independent variables, a perfect fit will not be possible. This assumes, of course, that the repeat measurements are not identical.
ˆ
The data in Figure 39.4 are given in Table 39.3. The fitted model is y = 15.45 + 0.97x. The relevant
2
statistics are presented in Table 39.4. The fraction of the variation explained by the regression is R =
581.12/697.5 = 0.833. The residual sum of squares (RSS) is divided into the pure error sum of squares
(SSPE ), which is calculated from the repeated measurements, and the lack-of-fit sum of squares (SSLOF).
That is:
RSS = SSPE + SSLOF
© 2002 By CRC Press LLC
L1592_frame_C39 Page 350 Tuesday, December 18, 2001 3:22 PM
TABLE 39.3
Linear Regression with Repeated
Observations
x
y1
y2
y3
5
12
14
19
24
17.5
30.4
30.1
36.6
38.9
22.4
28.4
25.8
31.3
43.2
19.2
25.1
31.1
34.0
32.7
TABLE 39.4
Analysis of Variance of the Regression with Repeat Observations Shown in Figure 39.4
Source
df
Sum of Sq.
Mean Sq.
F Ratio
Regression
Residual
Lack of fit (LOF)
Pure error (PE)
Total (Corrected)
1
13
3
10
14
581.12
116.38
4.04
112.34
697.50
581.12
8.952
1.35
11.23
Comments
64.91
=s
= s2
L
= s2
e
2
0.12
Suppose now that there had been only five observations (that is, no repeated measurements) and
furthermore that the five values of y fell at the average of the repeated values in Figure 39.4. Now the
2
ˆ
fitted model would be exactly the same: y = 15.45 + 0.97x but the R value would be 0.993. This is
because the variance due to the repeats has been removed.
2
The maximum possible value for R when there are repeat measurements is:
Total SS (corrected) – Pure error SS
2
max R = -------------------------------------------------------------------------------------Total SS (corrected)
The pure error SS does not change when terms are added or removed from the model in an effort to
improve the fit. For our example:
697.5 – 112.3
2
max R = -------------------------------- = 0.839
697.5
The actual R = 581.12/697.5 = 0.83. Therefore, the regression has explained 100(0.833/0.839) = 99%
of the amount of variation that can be explained by the model.
2
A Note on Lack-Of-Fit
If repeat measurements are available, a lack-of-fit (LOF) test can be done. The lack-of-fit mean square
2
2
s L = SS LOF /df LOF is compared with the pure error mean square s e = SS PE /df PE . If the model gives an
adequate fit, these two sums of squares should be of the same magnitude. This is checked by comparing the
2 2
ratio s L /s e against the F statistic with the appropriate degrees of freedom. Using the values in Table 39.4
2 2
gives s L /s e = 1.35/11.23 = 0.12. The F statistic for a 95% confidence test with three degrees of freedom
to measure lack of fit and ten degrees of freedom to measure the pure error is F3,10 = 3.71. Because
2 2
s L /s e = 0.12 is less than F3,10 = 3.71, there is no evidence of lack-of-fit. For this lack-of-fit test to be
valid, true repeats are needed.
© 2002 By CRC Press LLC
L1592_frame_C39 Page 351 Tuesday, December 18, 2001 3:22 PM
A Note on Description vs. Prediction
2
Is the regression useful? We have seen that a high R does not guarantee that a regression has meaning.
2
Likewise, a low R may indicate a statistically significant relationship between two variables although
the regression is not explaining much of the variation. Even less does statistically significant mean that
the regression will predict future observations with much accuracy. “In order for the fitted equation to
be regarded as a satisfactory predictor, the observed F ratio (regression mean square/residual mean
square) should exceed not merely the selected percentage point of the F distribution, but several times
the selected percentage point. How many times depends essentially on how great a ratio (prediction
range/error of prediction) is specified” (Box and Wetz, 1973). Draper and Smith (1998) offer this ruleof-thumb: unless the observed F for overall regression exceeds the chosen test percentage point by at
least a factor of four, and preferably more, the regression is unlikely to be of practical value for prediction
purposes. The regression in Figure 39.4 has an F ratio of 581.12/8.952 = 64.91 and would have some
practical predictive value.
Other Ways to Examine a Model
2
If R does not tell all that is needed about how well a model fits the data and how good the model may
be for prediction, what else could be examined?
Graphics reveal information in data (Tufte 1983): always examine the data and the proposed model
2
graphically. How sad if this advice was forgotten in a rush to compute some statistic like R .
A more useful single measure of the prediction capability of a model (including a k-variate regression
model) is the standard error of the estimate. The standard error of the estimate is computed from the
ˆ
variance of the predicted value (y ) and it indicates the precision with which the model estimates the
value of the dependent variable. This statistic is used to compute intervals that have the following
meanings (Hahn, 1973).
• The confidence interval for the dependent variable is an interval that one expects, with a
specified level of confidence, to contain the average value of the dependent variable at a set
of specified values for the independent variables.
• A prediction interval for the dependent variable is an interval that one expects, with a specified
probability, to contain a single future value of the dependent variable from the sampled
population at a set of specified values of the independent variables.
• A confidence interval around a parameter in a model (i.e., a regression coefficient) is an
interval that one expects, with a specified degree of confidence, to contain the true regression
coefficient.
Confidence intervals for parameter estimates and prediction intervals for the dependent variable are
discussed in Chapters 34 and 35. The exact method of obtaining these intervals is explained in Draper
and Smith (1998). They are computed by most statistics software packages.
Comments
Widely used methods have the potential to be frequently misused. Linear regression, the most widely
2
used statistical method, can be misused or misinterpreted if one relies too much on R as a characterization
of how well a model fits.
2
R is a measure of the proportion of variation in y that is accounted for by fitting y to a particular linear
2
model instead of describing the data by calculating the mean (a horizontal straight line). High R does not
2
prove that a model is correct or useful. A low R may indicate a statistically significant relation between two
variables although the regression has no practical predictive value. Replication dramatically improves the
2
predictive error of a model, and it makes possible a formal lack-of-fit test, but it reduces the R of the model.
© 2002 By CRC Press LLC
L1592_frame_C39 Page 352 Tuesday, December 18, 2001 3:22 PM
2
Totally spurious correlations, often with high R values, can arise when unrelated variables are
combined. Two examples of particular interest to environmental engineers are presented by Sherwood
(1974) and Rowe (1974). Both emphasize graphical analysis to stimulate and support any regression
analysis. Rowe discusses the particular dangers that arise when sets of variables are combined to create
new variables such as dimensional numbers (Froude number, etc.). Benson (1965) points out the same
kinds of dangers in the context of hydraulics and hydrology.
References
2
Anderson-Sprecher, R. (1994). “Model Comparison and R ,” Am. Stat., 48(2), 113–116.
Anscombe, F. J. (1973). “Graphs in Statistical Analysis,” Am. Stat., 27, 17–21.
Benson, M. A. (1965). “Spurious Correlation in Hydraulics and Hydrology,” J. Hydraulics Div., ASCE, 91,
HY4, 35–45.
Box, G. E. P. (1966). “The Use and Abuse of Regression,” Technometrics, 8, 625–629.
Box, G. E. P. and J. Wetz (1973). “Criteria for Judging Accuracy of Estimation by an Approximating Response
Function,” Madison, WI, University of Wisconsin Statistics Department, Tech. Rep. No. 9.
Draper, N. R. and H. Smith (1998). Applied Regression Analysis, 3rd ed., New York, John Wiley.
Hahn, G. J. (1973). “The Coefficient of Determination Exposed,” Chemtech, October, pp. 609–611.
Rowe, P. N. (1974). “Correlating Data,” Chemtech, January, pp. 9–14.
Sherwood, T. K. (1974). “The Treatment and Mistreatment of Data,” Chemtech, December, pp. 736–738.
Tufte, E. R. (1983). The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press.
Exercises
39.1 COD Calibration. The ten pairs of readings below were obtained to calibrate a UV spectrophotometer to measure chemical oxygen demand (COD) in wastewater.
COD (mg/L)
UV Absorbance
60
0.30
90
0.35
100
0.45
130
0.48
195
0.95
250
1.30
300
1.60
375
1.80
500
2.3
600
2.55
2
2
(a) Fit a linear model to the data and obtain the R value. (b) Discuss the meaning of R in
the context of this calibration problem. (c) Exercise 36.3 contains a larger calibration data
set for the same instrument. (d) Fit the model to the larger sample and compare the values
2
2
of R . Will the calibration curve with the highest R best predict the COD concentration?
Explain why or why not.
39.2 Stream pH. The data below are n = 200 monthly pH readings on a stream that cover a period of
ˆ
almost 20 years. The data read from left to right. The fitted regression model is y = 7.1435 −
2
0.0003776t; R = 0.042. The confidence interval of the slope is [−0.00063, −0.000013]. Why
2
is R so low? Is the regression statistically significant? Is stream pH decreasing? What is the
practical value of the model?
7.0
7.1
7.2
7.1
7.0
7.0
7.2
7.1
7.0
7.0
7.2
7.4
7.1
7.0
7.0
7.2
6.8
7.1
7.1
7.1
7.2
7.1
7.2
7.1
7.2
7.1
7.2
7.0
7.2
7.0
© 2002 By CRC Press LLC
7.3
6.8
7.0
7.4
6.9
7.1
7.2
7.0
7.1
7.1
7.2
7.3
7.0
7.2
7.4
7.1
7.0
7.1
7.1
7.0
7.2
7.3
7.2
7.2
7.0
7.0
7.1
7.1
7.0
6.9
7.2
7.0
7.1
7.2
6.9
7.0
7.1
7.0
7.1
6.9
7.2
7.0
7.1
7.2
7.0
7.0
7.2
7.0
7.0
7.2
7.0
6.9
7.2
7.1
7.1
7.1
7.0
7.0
7.2
7.1
7.1
7.2
7.2
7.0
7.0
7.3
7.1
7.1
7.1
7.2
7.3
7.2
7.2
7.2
7.2
7.1
7.1
7.0
7.1
7.1
7.1
7.3
7.0
7.0
7.2
7.2
7.1
7.1
7.1
7.1
7.1
7.0
7.1
6.9
7.0
7.2
7.0
7.1
7.2
7.0
7.1
7.0
7.1
7.2
7.0
7.2
7.2
7.2
7.1
7.0
7.2
7.1
7.2
7.0
7.1
7.1
7.1
7.2
7.0
6.9
7.3
7.1
7.1
7.0
7.1
7.2
7.1
7.1
7.1
7.1
7.2
7.0
7.2
7.1
7.0
7.2
7.3
7.0
7.2
6.8
7.3
7.2
7.0
7.0
7.2
7.1
6.9
7.0
7.2
7.1
7.2
7.2
7.1
6.9
7.2
7.1
7.2
7.2
7.1
7.0
7.2
7.2
7.2
6.9
7.0
7.1
7.2
7.2
7.2
7.0
L1592_frame_C39 Page 353 Tuesday, December 18, 2001 3:22 PM
39.3 Replication. Fit a straight-line calibration model to y1 and then fit the straight line to the three
replicate measures of y. Suppose a colleague in another lab had the y1 data only and you had
2
all three replicates. Who will have the higher R and who will have the best fitted calibration
2
curve? Compare the values of R obtained. Estimate the pure error variance. How much of
the variation in y has been explained by the model?
x
y1
y2
2
5
8
12
15
18
20
0.0
4.0
5.1
8.1
9.2
11.3
11.7
1.7
2.0
4.1
8.9
8.3
9.5
10.7
y3
2.0
4.5
5.8
8.4
8.8
10.9
10.4
39.4 Range of Data. Fit a straight-line calibration model to the first 10 observations in the Exercise
36.3 data set, that is for COD between 60 and 195 mg/L. Then fit the straight line to the full
2
data set (COD from 60 to 675 mg/L). Interpret the change in R for the two cases.
© 2002 By CRC Press LLC
L1592_frame_C40 Page 355 Tuesday, December 18, 2001 3:24 PM
40
Regression Analysis with Categorical Variables
acid rain, pH, categorical variable, F test, indicator variable, east squares, linear model,
regression, dummy variable, qualitative variables, regression sum of squares, t-ratio, weak acidity.
KEY WORDS
Qualitative variables can be used as explanatory variables in regression models. A typical case would be
when several sets of data are similar except that each set was measured by a different chemist (or different
instrument or laboratory), or each set comes from a different location, or each set was measured on a
different day. The qualitative variables — chemist, location, or day — typically take on discrete values
(i.e., chemist Smith or chemist Jones). For convenience, they are usually represented numerically by a
combination of zeros and ones to signify an observation’s membership in a category; hence the name
categorical variables.
One task in the analysis of such data is to determine whether the same model structure and parameter
values hold for each data set. One way to do this would be to fit the proposed model to each individual
data set and then try to assess the similarities and differences in the goodness of fit. Another way would
be to fit the proposed model to all the data as though they were one data set instead of several, assuming
that each data set has the same pattern, and then to look for inadequacies in the fitted model.
Neither of these approaches is as attractive as using categorical variables to create a collective data
set that can be fitted to a single model while retaining the distinction between the individual data sets.
This technique allows the model structure and the model parameters to be evaluated using statistical
methods like those discussed in the previous chapter.
Case Study: Acidification of a Stream During Storms
Cosby Creek, in the southern Appalachian Mountains, was monitored during three storms to study how
pH and other measures of acidification were affected by the rainfall in that region. Samples were taken
every 30 min and 19 characteristics of the stream water chemistry were measured (Meinert et al., 1982).
Weak acidity (WA) and pH will be examined in this case study.
Figure 40.1 shows 17 observations for storm 1, 14 for storm 2, and 13 for storm 3, giving a total of
44 observations. If the data are analyzed without distinguishing between storms one might consider
2
models of the form pH = β 0 + β 1WA + β 2WA or pH = θ3 + (θ1 − θ3)exp(−θ2WA). Each storm might be
described by pH = β 0 + β1WA, but storm 3 does not have the same slope and intercept as storms 1 and
2, and storms 1 and 2 might be different as well. This can be checked by using categorical variables to
estimate a different slope and intercept for each storm.
Method: Regression with Categorical Variables
Suppose that a model needs to include an effect due to the category (storm event, farm plot, treatment,
truckload, operator, laboratory, etc.) from which the data came. This effect is included in the model in
the form of categorical variables (also called dummy or indicator variables). In general m − 1 categorical
variables are needed to specify m categories.
© 2002 By CRC Press LLC
L1592_frame_C40 Page 356 Tuesday, December 18, 2001 3:24 PM
7.0
6.5
pH
6.0
5.5
0
100
200 300 400 500
Weak Acidity (µg/L)
600
700
FIGURE 40.1 The relation of pH and weak acidity data of Cosby Creek after three storms.
Begin by considering data from a single category. The quantitative predictor variable is x1 which can
predict the independent variable y1 using the linear model:
y 1i = β 0 + β 1 x 1i + e i
where β0 and β1 are parameters to be estimated by least squares.
If there are data from two categories (e.g., data produced at two different laboratories), one approach
would be to model the two sets of data separately as:
y 1i = α 0 + α 1 x 1i + e i
and
y 2i = β 0 + β 1 x 2i + e i
and then to compare the estimated intercepts (α 0 and β 0 ) and the estimated slopes (α1 and β1) using
confidence intervals or t-tests.
A second, and often better, method is to simultaneously fit a single augmented model to all the data.
To construct this model, define a categorical variable Z as follows:
Z=0
Z=1
if the data are in the first category
if the data are in the second category
The augmented model is:
yi = α0 + α1 xi + Z ( β0 + β1 xi ) + ei
With some rearrangement:
y i = α 0 + β 0 Z + α 1 x i + β 1 Zx i + e i
In this last form the regression is done as though there are three independent variables, x, Z, and Zx.
The vectors of Z and Zx have to be created from the categorical variables defined above. The four
parameters α 0, β 0, α1, and β 1 are estimated by linear regression.
A model for each category can be obtained by substituting the defined values. For the first category,
Z = 0 and:
yi = α0 + α1 xi + ei
© 2002 By CRC Press LLC