Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (15.74 MB, 756 trang )
478
CHAPTER 12
Simple Linear Regression and Correlation
Substitution of the expressions for ˆ 0 and ˆ 1 into ˆ 0 ϩ ˆ 1x* followed by some
algebraic manipulation leads to the representation of ˆ 0 ϩ ˆ 1x* as a linear function
of the Yi s:
n
n
1
(x* Ϫ xෆ)(xi Ϫ xෆ)
Yi ϭ Α diYi
ˆ 0 ϩ ˆ 1x* ϭ Α ᎏ ϩ ᎏᎏ
2
Α(xi Ϫ xෆ)
iϭ1 n
iϭ1
΄
΅
The coefficients d1, d2, . . . , dn in this linear function involve the xi s and x*, all of which
are fixed. Application of the rules of Section 5.5 to this linear function gives the following properties.
Let Yˆ ϭ ˆ 0 ϩ ˆ 1x*, where x* is some fixed value of x. Then
1. The mean value of Yˆ is
E(Yˆ ) ϭ E(ˆ 0 ϩ ˆ 1x*) ϭ ˆ ϩˆ x* ϭ 0 ϩ 1x*
0
1
Thus ˆ 0 ϩ ˆ 1x* is an unbiased estimator for 0 ϩ 1x* (i.e., for Yиx*).
2. The variance of Yˆ is
1
1
(x* Ϫ xෆ)2
(x* Ϫ xෆ)2
2
V(Yˆ ) ϭ Yˆ ϭ 2 ᎏ ϩ ᎏᎏ
ϭ 2 ᎏ ϩ ᎏ
2
2
n
n
Α xi Ϫ (Α xi) /n
Sxx
΄
΅
΄
΅
and the standard deviation Yˆ is the square root of this expression. The estimated standard deviation of ˆ 0 ϩ ˆ 1x*, denoted by sYˆ or sˆ ϩˆ x*, results
from replacing by its estimate s:
0
sYˆ ϭ sˆ ϩˆ x* ϭ s
0
1
1
(x* Ϫ xෆ )
ϩ ᎏᎏ
Ίᎏ1n
S
2
xx
3. Yˆ has a normal distribution.
The variance of ˆ 0 ϩ ˆ 1x* is smallest when x* ϭ xෆ and increases as x* moves away
from xෆ in either direction. Thus the estimator of Yиx* is more precise when x* is near
the center of the xi s than when it is far from the x values at which observations have
been made. This will imply that both the CI and PI are narrower for an x* near xෆ than
for an x* far from ෆx. Most statistical computer packages will provide both ˆ 0 ϩ ˆ 1x*
and sˆ ϩˆ x* for any specified x* upon request.
0
1
Inferences Concerning Yиx*
Just as inferential procedures for 1 were based on the t variable obtained by standardizing 1, a t variable obtained by standardizing ˆ 0 ϩ ˆ 1x* leads to a CI and test
procedures here.
THEOREM
The variable
ˆ 0 ϩ ˆ 1x* Ϫ (0 ϩ 1x*)
Yˆ Ϫ (0 ϩ 1x*)
T ϭ ᎏᎏᎏ ϭ ᎏᎏ
S Yˆ
Sˆ ϩˆ x*
0
1
has a t distribution with n Ϫ 2 df.
(12.5)
12.4 Inferences Concerning Yиx* and the Prediction of Future Y Values
479
As for 1 in the previous section, a probability statement involving this standardized variable can be manipulated to yield a confidence interval for Yиx*.
A 100(1 Ϫ ␣)% CI for Yиx*, the expected value of Y when x ϭ x*, is
ˆ 0 ϩ ˆ 1x* Ϯ t␣/2,nϪ2 и sˆ ϩˆ x* ϭ yˆ Ϯ t␣/2,nϪ2 и sYˆ
0
1
(12.6)
This CI is centered at the point estimate for Yиx* and extends out to each side by an
amount that depends on the confidence level and on the extent of variability in the
estimator on which the point estimate is based.
Example 12.13
Corrosion of steel reinforcing bars is the most important durability problem for reinforced concrete structures. Carbonation of concrete results from a chemical reaction
that lowers the pH value by enough to initiate corrosion of the rebar. Representative
data on x ϭ carbonation depth (mm) and y ϭ strength (MPa) for a sample of core
specimens taken from a particular building follow (read from a plot in the article
“The Carbonation of Concrete Structures in the Tropical Environment of Singapore,”
Magazine of Concrete Res., 1996: 293–300).
x
8.0
15.0
16.5
20.0
20.0
27.5
30.0
30.0
35.0
y
22.8
27.2
23.7
17.1
21.5
18.6
16.1
23.4
13.4
x
38.0
40.0
45.0
50.0
50.0
55.0
55.0
59.0
65.0
y
19.5
12.4
13.2
11.4
10.3
14.1
9.7
12.0
6.8
Figure 12.17 MINITAB scatter plot with confidence intervals and prediction intervals for the
data of Example 12.13
480
CHAPTER 12
Simple Linear Regression and Correlation
A scatter plot of the data (see Figure 12.17) gives strong support to use of the simple linear regression model. Relevant quantities are as follows:
Α xi ϭ 659.0
Αyi ϭ 293.2
Α x 2i ϭ 28,967.50
Α xiyi ϭ 9293.95
ˆ 1 ϭ ᎐.297561
xෆ ϭ 36.6111
Αy2i ϭ 5335.76
ˆ 0 ϭ 27.182936
r ϭ .766
Sxx ϭ 4840.7778
SSE ϭ 131.2402
s ϭ 2.8640
2
Let’s now calculate a confidence interval, using a 95% confidence level, for the mean
strength for all core specimens having a carbonation depth of 45 mm—that is, a confidence interval for 0 ϩ 1(45). The interval is centered at
yˆ ϭ ˆ 0 ϩ ˆ 1(45) ϭ 27.18 Ϫ .2976(45) ϭ 13.79
The estimated standard deviation of the statistic Yˆ is
1
(45 Ϫ 36.6111)2
sYˆ ϭ 2.8640 ᎏ ϩ ᎏᎏ ϭ .7582
18
4840.7778
Ί
The 16 df t critical value for a 95% confidence level is 2.120, from which we determine the desired interval to be
13.79 Ϯ (2.120)(.7582) ϭ 13.79 Ϯ 1.61 ϭ (12.18, 15.40)
The narrowness of this interval suggests that we have reasonably precise information
about the mean value being estimated. Remember that if we recalculated this interval
for sample after sample, in the long run about 95% of the calculated intervals would
include 0 ϩ 1(45). We can only hope that this mean value lies in the single interval
that we have calculated.
Figure 12.18 shows MINITAB output resulting from a request to fit the simple
linear regression model and calculate confidence intervals for the mean value of
strength at depths of 45 mm and 35 mm. The intervals are at the bottom of the output;
note that the second interval is narrower than the first, because 35 is much closer to ෆx
than is 45. Figure 12.17 shows (1) curves corresponding to the confidence limits for
each different x value and (2) prediction limits, to be discussed shortly. Notice how the
curves get farther and farther apart as x moves away from ෆx.
The regression equation is strength ϭ 27.2 Ϫ 0.298 depth
Predictor
Coef
Constant
depth
s ϭ 2.864
Stdev
27.183
᎐0.29756
R-sq ϭ 76.6%
t-ratio
p
1.651
16.46
0.04116
᎐7.23
R-sq(adj) ϭ 75.1%
0.000
0.000
Analysis of Variance
SOURCE
DF
SS
MS
F
p
Regression
Error
Total
1
16
17
428.62
131.24
559.86
428.62
8.20
52.25
0.000
Fit
13.793
Stdev.Fit
0.758
95.0% C.I.
(12.185, 15.401)
95.0% P.I.
(7.510, 20.075)
Fit
16.768
Stdev.Fit
0.678
95.0% C.I.
(15.330, 18.207)
95.0% P.I.
(10.527, 23.009)
Figure 12.18
MINITAB regression output for the data of Example 12.13
■
In some situations, a CI is desired not just for a single x value but for two or more
x values. Suppose an investigator wishes a CI both for Yиv and for Yиw, where v and w
are two different values of the independent variable. It is tempting to compute the
12.4 Inferences Concerning Yиx* and the Prediction of Future Y Values
481
interval (12.6) first for x ϭ v and then for x ϭ w. Suppose we use ␣ ϭ .05 in each computation to get two 95% intervals. Then if the variables involved in computing the two
intervals were independent of one another, the joint confidence coefficient would be
(.95) и (.95) Ϸ .90.
However, the intervals are not independent because the same ˆ 0, ˆ 1, and S are
used in each. We therefore cannot assert that the joint confidence level for the two
intervals is exactly 90%. It can be shown, though, that if the 100(1 Ϫ ␣)% CI (12.6)
is computed both for x ϭ v and for x ϭ w to obtain joint CIs for Yиv and Yиw, then
the joint confidence level on the resulting pair of intervals is at least 100(1 Ϫ 2␣)%.
In particular, using ␣ ϭ .05 results in a joint confidence level of at least 90%,
whereas using ␣ ϭ .01 results in at least 98% confidence. For example, in Example
12.13 a 95% CI for Yи45 was (12.185, 15.401) and a 95% CI for Yи35 was (15.330,
18.207). The simultaneous or joint confidence level for the two statements 12.185 Ͻ
Yи45 Ͻ 15.401 and 15.330 Ͻ Yи35 Ͻ 18.207 is at least 90%.
The validity of these joint or simultaneous CIs rests on a probability result
called the Bonferroni inequality, so the joint CIs are referred to as Bonferroni
intervals. The method is easily generalized to yield joint intervals for k different
Yиx s. Using the interval (12.6) separately first for x ϭ x*1 , then for x ϭ x*2 , . . . ,
and finally for x ϭ x*k yields a set of k CIs for which the joint or simultaneous confidence level is guaranteed to be at least 100(1 Ϫ k␣)%.
Tests of hypotheses about 0 ϩ 1x* are based on the test statistic T obtained
by replacing 0 ϩ 1x* in the numerator of (12.5) by the null value 0. For example, H0: 0 ϩ 1(45) ϭ 15 in Example 12.13 says that when carbonation depth
is 45, expected (i.e., true average) strength is 15. The test statistic value is then
t ϭ [ˆ 0 ϩ ˆ 1(45) Ϫ 15]/sˆ ϩˆ (45), and the test is upper-, lower-, or two-tailed according to the inequality in Ha.
0
1
A Prediction Interval for a Future Value of Y
Analogous to the CI (12.6) for Yиx*, one frequently wishes to obtain an interval of plausible values for the value of Y associated with some future observation when the independent variable has value x*. For instance, in the example in which vocabulary size y
is related to the age x of a child, for x ϭ 6 years (12.6) would provide a CI for the true
average vocabulary size of all 6-year-old children. Alternatively, we might wish an
interval of plausible values for the vocabulary size of a particular 6-year-old child.
A CI refers to a parameter, or population characteristic, whose value is fixed
but unknown to us. In contrast, a future value of Y is not a parameter but instead a
random variable; for this reason we refer to an interval of plausible values for a future
Y as a prediction interval rather than a confidence interval. The error of estimation
is 0 ϩ 1x* Ϫ ( ˆ 0 ϩ ˆ 1x*), a difference between a fixed (but unknown) quantity
and a random variable. The error of prediction is Y Ϫ ( ˆ 0 ϩ ˆ 1x*), a difference
between two random variables. There is thus more uncertainty in prediction than in
estimation, so a PI will be wider than a CI. Because the future value Y is independent of the observed Yi s,
V[Y Ϫ ( ˆ 0 ϩ ˆ 1x*)] ϭ variance of prediction error
ϭ V(Y ) ϩ V( ˆ 0 ϩ ˆ 1x*)
1
(x* Ϫ xෆ)2
ϭ 2 ϩ 2 ᎏ ϩ ᎏ
n
Sxx
΄
1
(x* Ϫ xෆ)2
ϭ 2 1 ϩ ᎏ ϩ ᎏ
n
Sxx
΄
΅
΅
482
CHAPTER 12
Simple Linear Regression and Correlation
Furthermore, because E(Y ) ϭ 0 ϩ 1x* and E( ˆ 0 ϩ ˆ 1x*) ϭ 0 ϩ 1x*, the
expected value of the prediction error is E(Y Ϫ (ˆ 0 ϩ ˆ 1x*)) ϭ 0. It can then be
shown that the standardized variable
Y Ϫ ( ˆ 0 ϩ ˆ 1x*)
T ϭ ᎏᎏᎏ2
1
(x* Ϫ xෆ)
S 1 ϩ ᎏᎏ ϩ ᎏᎏ
Sxx
n
Ί
has a t distribution with n Ϫ 2 df. Substituting this T into the probability statement
P(᎐t␣/2,nϪ2 Ͻ T Ͻ t␣/2,nϪ2) ϭ 1 Ϫ ␣ and manipulating to isolate Y between the two
inequalities yields the following interval.
A 100(1 Ϫ ␣)% PI for a future Y observation to be made when x ϭ x* is
1
(x* Ϫ xෆ)2
ˆ 0 ϩ ˆ 1x* Ϯ t␣/2,nϪ2 и s 1 ϩ ᎏ ϩ ᎏ
n
Sxx
Ί
(12.7)
ϭ ˆ 0 ϩ ˆ 1x* Ϯ t␣/2,nϪ2 и ͙ෆ
s2ෆ
ϩෆs2ෆ
ෆ
ˆ x*
ˆ ϩෆ
0
1
ϭ yˆ Ϯ t␣/2,nϪ2 и ͙ෆ
sෆ
ϩෆsෆ
2
2
Yˆ
The interpretation of the prediction level 100(1 Ϫ ␣)% is identical to that of previous
confidence levels—if (12.7) is used repeatedly, in the long run the resulting intervals
will actually contain the observed y values 100(1 Ϫ ␣)% of the time. Notice that the
1 underneath the initial square root symbol makes the PI (12.7) wider than the CI
(12.6), though the intervals are both centered at ˆ 0 ϩ ˆ 1x*. Also, as n 0 ϱ, the width
of the CI approaches 0, whereas the width of the PI does not (because even with perfect knowledge of 0 and 1, there will still be uncertainty in prediction).
Example 12.14
Let’s return to the carbonation depth-strength data of Example 12.13 and calculate a
95% prediction interval for a strength value that would result from selecting a single
core specimen whose carbonation depth is 45 mm. Relevant quantities from that
example are
yˆ ϭ 13.79
s Yˆ ϭ .7582
s ϭ 2.8640
For a prediction level of 95% based on n Ϫ 2 ϭ 16 df, the t critical value is 2.120,
exactly what we previously used for a 95% confidence level. The prediction interval
is then
13.79 Ϯ (2.120)͙(2
ෆ.8
ෆ6
ෆ4
ෆ0
ෆෆ
)2ෆ
ϩෆ(.7
ෆ5
ෆ8
ෆ2
ෆෆ
)2 ϭ 13.79 Ϯ (2.120)(2.963)
ϭ 13.79 Ϯ 6.28 ϭ (7.51, 20.07)
Plausible values for a single observation on strength when depth is 45 mm are (at the
95% prediction level) between 7.51 MPa and 20.07 MPa. The 95% confidence interval for mean strength when depth is 45 was (12.18, 15.40). The prediction interval is
much wider than this because of the extra (2.8640)2 under the square root. Figure
12.18, the MINITAB output in Example 12.13, shows this interval as well as the confidence interval.
■
The Bonferroni technique can be employed as in the case of confidence intervals. If a 100(1 Ϫ ␣)% PI is calculated for each of k different values of x, the simultaneous or joint prediction level for all k intervals is at least 100(1 Ϫ k␣)%.
12.4 Inferences Concerning Yиx* and the Prediction of Future Y Values
EXERCISES
Section 12.4 (44–56)
44. Fitting the simple linear regression model to the n ϭ 27
observations on x ϭ modulus of elasticity and y ϭ flexural
strength given in Exercise 15 of Section 12.2 resulted in yˆ ϭ
7.592, sYˆ ϭ .179 when x ϭ 40 and yˆ ϭ 9.741, sYˆ ϭ .253 for
x ϭ 60.
a. Explain why sYˆ is larger when x ϭ 60 than when x ϭ 40.
b. Calculate a confidence interval with a confidence level of
95% for the true average strength of all beams whose
modulus of elasticity is 40.
c. Calculate a prediction interval with a prediction level of
95% for the strength of a single beam whose modulus of
elasticity is 40.
d. If a 95% CI is calculated for true average strength when
modulus of elasticity is 60, what will be the simultaneous
confidence level for both this interval and the interval calculated in part (b)?
45. Reconsider the filtration rate–moisture content data introduced
in Example 12.6 (see also Example 12.7).
a. Compute a 90% CI for 0 ϩ 1251, true average moisture content when the filtration rate is 125.
b. Predict the value of moisture content for a single experimental run in which the filtration rate is 125 using a 90%
prediction level. How does this interval compare to the
interval of part (a)? Why is this the case?
c. How would the intervals of parts (a) and (b) compare to a
CI and PI when filtration rate is 115? Answer without
actually calculating these new intervals.
d. Interpret the hypotheses H0: 0 ϩ 1251 ϭ 80 and H0:
0 ϩ 1251 Ͻ 80, and then carry out a test at significance
level .01.
46. The article “The Incorporation of Uranium and Silver
by Hydrothermally Synthesized Galena” (Econ. Geology,
1964: 1003–1024) reports on the determination of silver
content of galena crystals grown in a closed hydrothermal
system over a range of temperature. With x ϭ crystallization
temperature in °C and y ϭ Ag2S in mol%, the data follows:
x
398 292 352 575 568 450 550 408 484 350 503 600 600
y
.15
.05
.23
.43
483
.23
.40
.44
.44
.45
.09
.59
.63
.60
from which Α xi ϭ 6130, Α x 2i ϭ 3,022,050, Αyi ϭ 4.73,
Αy 2i ϭ 2.1785, Αxiyi ϭ 2418.74, ˆ 1 ϭ .00143, ˆ 0 ϭ Ϫ.311,
and s ϭ .131.
a. Estimate true average silver content when temperature
is 500°C using a 95% confidence interval.
b. How would the width of a 95% CI for true average silver
content when temperature is 400°C compare to the width
of the interval in part (a)? Answer without computing
this new interval.
c. Calculate a 95% CI for the true average change in silver
content associated with a 1°C increase in temperature.
d. Suppose it had previously been believed that when crystallization temperature was 400°C, true average silver content
would be .25. Carry out a test at significance level .05 to
decide whether the sample data contradicts this prior belief.
47. The simple linear regression model provides a very good fit
to the data on rainfall and runoff volume given in Exercise 16
of Section 12.2. The equation of the least squares line is yˆ ϭ
Ϫ1.128 ϩ .82697x, r 2 ϭ .975, and s ϭ 5.24.
a. Use the fact that sYˆ ϭ 1.44 when rainfall volume is 40 m3
to predict runoff in a way that conveys information about
reliability and precision. Does the resulting interval suggest
that precise information about the value of runoff for this
future observation is available? Explain your reasoning.
b. Calculate a PI for runoff when rainfall is 50 using the
same prediction level as in part (a). What can be said
about the simultaneous prediction level for the two intervals you have calculated?
48. The catch basin in a storm sewer system is the interface
between surface runoff and the sewer. The catch basin insert
is a device for retrofitting catch basins to improve pollutant
removal properties. The article “An Evaluation of the Urban
Stormwater Pollutant Removal Efficiency of Catch Basin
Inserts” (Water Envir. Res., 2005: 500–510) reported on
tests of various inserts under controlled conditions for
which inflow is close to what can be expected in the field.
Consider the following data, read from a graph in the article, for one particular type of insert on x ϭ amount filtered
(1000s of liters) and y ϭ % total suspended solids removed.
x
y
23
45
68
91
114 136 159 182 205 228
53.3 26.9 54.8 33.8 29.9 8.2 17.2 12.2 3.2 11.1
Summary quantities are
Α xi ϭ 1251, Αx 2i ϭ 199,365, Αyi ϭ 250.6, Αy 2i ϭ 9249.36,
Αxiyi ϭ 21,904.4
a. Does a scatter plot support the choice of the simple linear regression model? Explain.
b. Obtain the equation of the least squares line.
c. What proportion of observed variation in % removed can
be attributed to the model relationship?
d. Does the simple linear regression model specify a useful
relationship? Carry out an appropriate test of hypotheses
using a significance level of .05.
e. Is there strong evidence for concluding that there is at least
a 2% decrease in true average suspended solid removal
associated with a 10,000 liter increase in the amount filtered? Test appropriate hypotheses using ␣ ϭ .05.
f. Calculate an interpret a 95% CI for true average %
removed when amount filtered is 100,000 liters. How
does this interval compare in width to a CI when amount
filtered is 200,000 liters?
g. Calculate and interpret a 95% PI for % removed when
amount filtered is 100,000 liters. How does this interval
compare in width to the CI calculated in (f) and to a PI
when amount filtered is 200,000 liters?
484
CHAPTER 12
Simple Linear Regression and Correlation
49. You are told that a 95% CI for expected lead content when
traffic flow is 15, based on a sample of n ϭ 10 observations,
is (462.1, 597.7). Calculate a CI with confidence level 99%
for expected lead content when traffic flow is 15.
50. Silicon-germanium alloys have been used in certain types of
solar cells. The paper “Silicon-Germanium Films Deposited
by Low-Frequency Plasma-Enhanced Chemical Vapor
Deposition” (J. of Material Res., 2006: 88–104) reported on
a study of various structural and electrical properties.
Consider the accompanying data on x ϭ Ge concentration in
solid phase (ranging from 0 to 1) and y ϭ Fermi level position (eV):
x
y
0
.42 .23 .33 .62 .60 .45 .87 .90 .79
1
1
1
.62 .53 .61 .59 .50 .55 .59 .31 .43 .46 .23 .22 .19
A scatter plot shows a substantial linear relationship. Here is
MINITAB output from a least squares fit. [Note: There are
several inconsistencies between the data given in the paper,
the plot that appears there, and the summary information
about a regression analysis.]
The regression equation is
Fermi pos = 0.7217 – 0.4327 Ge conc
S = 0.0737573
R-Sq = 80.2%
R-Sq(adj) = 78.4%
Analysis of Variance
Source
Regression
Error
Total
DF
1
11
12
SS
0.241728
0.059842
0.301569
MS
0.241728
0.005440
F
44.43
P
0.000
a. Obtain an interval estimate of the expected change in
Fermi level position associated with an increase of .1 in
Ge concentration, and interpret your estimate.
b. Obtain an interval estimate for mean Fermi level position
when concentration is .50, and interpret your estimate.
c. Obtain an interval of plausible values for position resulting from a single observation to be made when concentration is .50, interpret your interval, and compare to the
interval of (b).
d. Obtain simultaneous CIs for expected position when
concentration is .3, .5, and .7; the joint confidence level
should be at least 97%.
51. Refer to Example 12.12 in which x ϭ volume fraction of
oxides/inclusions and y ϭ % elongation.
a. MINITAB gave sˆ 0ϩˆ 1(.40) ϭ .0311 and sˆ0ϩˆ 1(1.20) ϭ .0352.
Why is the former estimated standard deviation smaller
than the latter one?
b. Use the MINITAB output from the example to calculate
a 95% CI for expected % elongation when volume fraction ϭ .40.
c. Use the MINITAB output to calculate a 95% PI for a single value of % elongation to be observed when volume
fraction ϭ 1.20.
52. Plasma etching is essential to the fine-line pattern transfer
in current semiconductor processes. The article “Ion
Beam-Assisted Etching of Aluminum with Chlorine”
(J. Electrochem. Soc., 1985: 2010– 2012) gives the accompanying data (read from a graph) on chlorine flow (x, in
SCCM) through a nozzle used in the etching mechanism and
etch rate (y, in 100 A/min).
x
1.5
y
1.5
2.0
2.5
2.5
3.0
3.5
3.5
4.0
23.0 24.5 25.0 30.0 33.5 40.0 40.5 47.0 49.0
The summary statistics are Αxi ϭ 24.0, Αyi ϭ 312.5, Αx 2i ϭ
70.50, Αxiyi ϭ 902.25, Αy2i ϭ 11,626.75, ˆ 0 ϭ 6.448718,
ˆ 1 ϭ 10.602564.
a. Does the simple linear regression model specify a useful
relationship between chlorine flow and etch rate?
b. Estimate the true average change in etch rate associated
with a 1-SCCM increase in flow rate using a 95% confidence interval, and interpret the interval.
c. Calculate a 95% CI for Yи3.0, the true average etch rate
when flow ϭ 3.0. Has this average been precisely estimated?
d. Calculate a 95% PI for a single future observation on
etch rate to be made when flow ϭ 3.0. Is the prediction
likely to be accurate?
e. Would the 95% CI and PI when flow ϭ 2.5 be wider or
narrower than the corresponding intervals of parts (c) and
(d)? Answer without actually computing the intervals.
f. Would you recommend calculating a 95% PI for a flow of
6.0? Explain.
53. Consider the following four intervals based on the data of
Example 12.4 (Section 12.2):
a. A 95% CI for mean porosity when unit weight is 110
b. A 95% PI for porosity when unit weight is 110
c. A 95% CI for mean porosity when unit weight is 115
d. A 95% PI for porosity when unit weight is 115
Without computing any of these intervals, what can be said
about their widths relative to one another?
54. The decline of water supplies in certain areas of the United
States has created the need for increased understanding of
relationships between economic factors such as crop yield
and hydrologic and soil factors. The article “Variability of
Soil Water Properties and Crop Yield in a Sloped
Watershed” (Water Resources Bull., 1988: 281–288) gives
data on grain sorghum yield (y, in g/m-row) and distance
upslope (x, in m) on a sloping watershed. Selected observations are given in the accompanying table.
x
0
10
20
30
45
50
70
y
500
590
410
470
450
480
510
x
80
100
120
140
160
170
190
y
450
360
400
300
410
280
350
a. Construct a scatter plot. Does the simple linear regression model appear to be plausible?
12.5 Correlation
b. Carry out a test of model utility.
c. Estimate true average yield when distance upslope is
75 by giving an interval of plausible values.
55. Verify that V( ˆ 0 ϩ ˆ 1x) is indeed given by the expression in
the text. [Hint: V(ΑdiYi) ϭ Αd i2 и V(Yi).]
56. The article “Bone Density and Insertion Torque as
Predictors of Anterior Cruciate Ligament Graft Fixation
Strength” (The Amer. J. of Sports Med., 2004: 1421–1429)
gave the accompanying data on maximum insertion torque
(N и m) and yield load (N), the latter being one measure of
graft strength, for 15 different specimens.
Torque
Load
1.8
491
2.2
477
1.9
598
1.3
361
2.1
605
2.2
671
1.6
466
Torque
Load
1.2
384
1.8
422
2.6
554
2.5
577
2.5
642
1.7
348
1.6
446
2.1
431
a. Is it plausible that yield load is normally distributed?
b. Estimate true average yield load by calculating a confidence interval with a confidence level of 95%, and interpret the interval.
485
c. Here is output from MINITAB for the regression of yield
load on torque. Does the simple linear regression model
specify a useful relationship between the variables?
Predictor
Constant
Torque
Coef
152.44
178.23
S = 73.2141
SE Coef
91.17
45.97
R–Sq = 53.6%
T
1.67
3.88
P
0.118
0.002
R–Sq(adj) = 50.0%
Source
DF
SS
MS
F
P
Regression
1
80554 80554 15.03 0.002
Residual Error 13
69684
5360
Total
14 150238
d. The authors of the cited paper state, “Consequently, we
cannot but conclude that simple regression analysisbased methods are not clinically sufficient to predict
individual fixation strength.” Do you agree? [Hint:
Consider predicting yield load when torque is 2.0.]
12.5 Correlation
There are many situations in which the objective in studying the joint behavior of
two variables is to see whether they are related, rather than to use one to predict the
value of the other. In this section, we first develop the sample correlation coefficient
r as a measure of how strongly related two variables x and y are in a sample and then
relate r to the correlation coefficient defined in Chapter 5.
The Sample Correlation Coefficient r
Given n pairs of observations (x1, y1), (x2, y2), . . . , (xn, yn), it is natural to speak of x and
y having a positive relationship if large x’s are paired with large y’s and small x’s with
small y’s. Similarly, if large x’s are paired with small y’s and small x’s with large y’s,
then a negative relationship between the variables is implied. Consider the quantity
n
n
iϭ1
iϭ1
Sxy ϭ Α (xi Ϫ xෆ)(yi Ϫ yෆ) ϭ Α xi yi Ϫ
n
n
iϭ1
iϭ1
Α xi Α yi n
Then if the relationship is strongly positive, an xi above the mean xෆ will tend to be
paired with a yi above the mean ෆy, so that (xi Ϫ ෆx)(yi Ϫ ෆy) Ͼ 0, and this product will
also be positive whenever both xi and yi are below their respective means. Thus a positive relationship implies that Sxy will be positive. An analogous argument shows that
when the relationship is negative, Sxy will be negative, since most of the products
(xi Ϫ ෆx)(yi Ϫ ෆy) will be negative. This is illustrated in Figure 12.19.
Although Sxy seems a plausible measure of the strength of a relationship, we
do not yet have any idea of how positive or negative it can be. Unfortunately, Sxy has
a serious defect: By changing the units of measurement of either x or y, Sxy can be
made either arbitrarily large in magnitude or arbitrarily close to zero. For example,
if Sxy ϭ 25 when x is measured in meters, then Sxy ϭ 25,000 when x is measured in
millimeters and .025 when x is expressed in kilometers. A reasonable condition to
486
CHAPTER 12
Simple Linear Regression and Correlation
؊
؉
؊
؉
y
؊
y
؉
؉
؊
x
x
(a)
(b)
Figure 12.19 (a) Scatter plot with Sxy positive; (b) scatter plot with Sxy negative
[ϩ means (xi Ϫ xෆ )(yi Ϫ yෆ ) Ͼ 0, and Ϫ means (xi Ϫ xෆ )(yi Ϫ yෆ ) Ͻ 0]
impose on any measure of how strongly x and y are related is that the calculated
measure should not depend on the particular units used to measure them. This condition is achieved by modifying Sxy to obtain the sample correlation coefficient.
DEFINITION
The sample correlation coefficient for the n pairs (x1, y1), . . . , (xn, yn) is
Sxy
Sxy
r ϭ ᎏᎏᎏ
ϭ ᎏᎏ
2
2
ෆ
ෆ
ෆ
(x
ෆ
Ϫ
ෆ
x
)
ෆ
ෆ
ෆ
ෆෆ
(y
ෆ
ෆ
Ϫ
ෆ
y
)
ෆ
͙
ෆ
S
ෆ͙
Sෆෆ
xx
yy
͙Α i ෆ ͙Α i ෆ
Example 12.15
(12.8)
An accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article “Productivity Ratings Based on
Soil Series” (Prof. Geographer, 1980: 158–163) argues, an acceptable soil productivity index is not so easy to come by. One difficulty is that productivity is determined partly by which crop is planted, and the relationship between yield of two
different crops planted in the same soil may not be very strong. To illustrate, the
article presents the accompanying data on corn yield x and peanut yield y (mT/Ha)
for eight different types of soil.
x
2.4
3.4
4.6
3.7
2.2
3.3
4.0
2.1
y
1.33
2.12
1.80
1.65
2.00
1.76
2.11
1.63
With Αxi ϭ 25.7, Αyi ϭ 14.40, Αx 2i ϭ 88.31, Αxiyi ϭ 46.856, and Αy2i ϭ 26.4324,
(25.7)2
Sxx ϭ 88.31 Ϫ ᎏ ϭ 88.31 Ϫ 82.56 ϭ 5.75
8
(14.40)2
Syy ϭ 26.4324 Ϫ ᎏ ϭ .5124
8
(25.7)(14.40)
Sxy ϭ 46.856 Ϫ ᎏᎏ ϭ .5960
8
from which
.5960
r ϭ ᎏᎏ ϭ .347
͙ෆ
5ෆ
.7ෆ5͙ෆ
.5ෆ1ෆ2ෆ4
■
12.5 Correlation
487
Properties of r
The most important properties of r are as follows:
1. The value of r does not depend on which of the two variables under study is
labeled x and which is labeled y.
2. The value of r is independent of the units in which x and y are measured.
3. Ϫ1 Յ r Յ 1
4. r ϭ 1 if and only if (iff) all (xi, yi) pairs lie on a straight line with positive slope,
and r ϭ Ϫ1 iff all (xi, yi) pairs lie on a straight line with negative slope.
5. The square of the sample correlation coefficient gives the value of the coefficient
of determination that would result from fitting the simple linear regression
model—in symbols, (r)2 ϭ r 2.
Property 1 stands in marked contrast to what happens in regression analysis,
where virtually all quantities of interest (the estimated slope, estimated y-intercept,
s2, etc.) depend on which of the two variables is treated as the dependent variable.
However, Property 5 shows that the proportion of variation in the dependent variable
explained by fitting the simple linear regression model does not depend on which
variable plays this role.
Property 2 is equivalent to saying that r is unchanged if each xi is replaced by
cxi and if each yi is replaced by dyi (a change in the scale of measurement), as well
as if each xi is replaced by xi Ϫ a and yi by yi Ϫ b (which changes the location of zero
on the measurement axis). This implies, for example, that r is the same whether temperature is measured in °F or °C.
Property 3 tells us that the maximum value of r, corresponding to the largest
possible degree of positive relationship, is r ϭ 1, whereas the most negative relationship is identified with r ϭ Ϫ1. According to Property 4, the largest positive and
largest negative correlations are achieved only when all points lie along a straight line.
Any other configuration of points, even if the configuration suggests a deterministic
relationship between variables, will yield an r value less than 1 in absolute magnitude.
Thus r measures the degree of linear relationship among variables. A value of r near
0 is not evidence of the lack of a strong relationship, but only the absence of a linear
relation, so that such a value of r must be interpreted with caution. Figure 12.20 illustrates several configurations of points associated with different values of r.
(a) r near ϩ1
(b) r near Ϫ1
(c) r near 0, no
apparent relationship
(d) r near 0, nonlinear
relationship
Figure 12.20
Data plots for different values of r
488
CHAPTER 12
Simple Linear Regression and Correlation
A frequently asked question is, “When can it be said that there is a strong correlation between the variables, and when is the correlation weak?” A reasonable rule
of thumb is to say that the correlation is weak if 0 Յ ⏐r⏐Յ .5, strong if .8 Յ ⏐r⏐Յ
1, and moderate otherwise. It may surprise you that r ϭ .5 is considered weak, but
r 2 ϭ .25 implies that in a regression of y on x, only 25% of observed y variation
would be explained by the model. In Example 12.15, the correlation between corn
yield and peanut yield would be described as weak.
The Population Correlation Coefficient
and Inferences About Correlation
The correlation coefficient r is a measure of how strongly related x and y are in the
observed sample. We can think of the pairs (xi, yi) as having been drawn from a
bivariate population of pairs, with (Xi, Yi) having joint probability distribution f(x, y).
In Chapter 5, we defined the correlation coefficient (X, Y) by
Cov(X, Y )
ϭ (X, Y ) ϭ ᎏᎏ
X и Y
where
Cov(X, Y ) ϭ
{
Α Α (x Ϫ X)(y Ϫ Y)p(x, y)
(X, Y ) discrete
͵ ͵
(X, Y ) continuous
x
ϱ
y
ϱ
᎐ϱ ᎐ϱ
(x Ϫ X)(y Ϫ Y)f(x, y) dx dy
If we think of f(x, y) as describing the distribution of pairs of values within the entire
population, becomes a measure of how strongly related x and y are in that population. Properties of analogous to those for r were given in Chapter 5.
The population correlation coefficient is a parameter or population characteristic, just as X, Y, X, and Y are, so we can use the sample correlation coefficient to make various inferences about . In particular, r is a point estimate for , and
the corresponding estimator is
Α(Xi Ϫ X
ෆ)(Yi Ϫ Yෆ)
ˆ ϭ R ϭ ᎏᎏᎏ
ෆ(X
ෆෆ
ෆ(Y
ෆෆ
X
Y
ෆෆ2͙Α
ෆෆ2
i Ϫෆ)
i Ϫෆ)
͙Α
Example 12.16
In some locations, there is a strong association between concentrations of two different pollutants. The article “The Carbon Component of the Los Angeles Aerosol:
Source Apportionment and Contributions to the Visibility Budget” (J. Air Pollution
Control Fed., 1984: 643–650) reports the accompanying data on ozone concentration x (ppm) and secondary carbon concentration y (g/m3).
x
.066
.088
.120
.050
.162
.186
.057
.100
y
4.6
11.6
9.5
6.3
13.8
15.4
2.5
11.8
x
.112
.055
.154
.074
.111
.140
.071
.110
y
8.0
7.0
20.6
16.6
9.2
17.9
2.8
13.0
The summary quantities are n ϭ 16, Αxi ϭ 1.656, Αyi ϭ 170.6, Αx ϭ .196912,
Αxi yi ϭ 20.0397, and Αy2i ϭ 2253.56, from which
20.0397 Ϫ (1.656)(170.6)/16
r ϭ ᎏᎏᎏᎏᎏᎏ
͙ෆ
.1ෆ9ෆ6ෆ9ෆ1ෆ2ෆ
Ϫෆ(ෆ1ෆ
.6ෆ5ෆ6ෆ
)2ෆ
/1ෆ6͙ෆ2ෆ2ෆ5ෆ3ෆ
.5ෆ6ෆ
Ϫෆ(ෆ1ෆ7ෆ0ෆ
.6ෆ
)2ෆ
/1ෆ6
2.3826
ϭ ᎏᎏ ϭ .716
(.1597)(20.8456)
2
i