4 Inferences Concerning μ[sub(y-x*)] and the Prediction of Future Y Values

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (15.74 MB, 756 trang )

478

CHAPTER 12

Simple Linear Regression and Correlation

Substitution of the expressions for ␤ˆ 0 and ␤ˆ 1 into ␤ˆ 0 ϩ ␤ˆ 1x* followed by some

algebraic manipulation leads to the representation of ␤ˆ 0 ϩ ␤ˆ 1x* as a linear function

of the Yi s:

n

n

1

(x* Ϫ xෆ)(xi Ϫ xෆ)

Yi ϭ Α diYi

␤ˆ 0 ϩ ␤ˆ 1x* ϭ Α ᎏ ϩ ᎏᎏ

2

Α(xi Ϫ xෆ)

iϭ1 n

iϭ1

΄

΅

The coefﬁcients d1, d2, . . . , dn in this linear function involve the xi s and x*, all of which

are ﬁxed. Application of the rules of Section 5.5 to this linear function gives the following properties.

Let Yˆ ϭ ␤ˆ 0 ϩ ␤ˆ 1x*, where x* is some ﬁxed value of x. Then

1. The mean value of Yˆ is

E(Yˆ ) ϭ E(␤ˆ 0 ϩ ␤ˆ 1x*) ϭ ␮␤ˆ ϩ␤ˆ x* ϭ ␤0 ϩ ␤1x*

0

1

Thus ␤ˆ 0 ϩ ␤ˆ 1x* is an unbiased estimator for ␤0 ϩ ␤1x* (i.e., for ␮Yиx*).

2. The variance of Yˆ is

1

1

(x* Ϫ xෆ)2

(x* Ϫ xෆ)2

2

V(Yˆ ) ϭ ␴Yˆ ϭ ␴ 2 ᎏ ϩ ᎏᎏ

ϭ ␴2 ᎏ ϩ ᎏ

2

2

n

n

Α xi Ϫ (Α xi) /n

Sxx

΄

΅

΄

΅

and the standard deviation ␴ Yˆ is the square root of this expression. The estimated standard deviation of ␤ˆ 0 ϩ ␤ˆ 1x*, denoted by sYˆ or s␤ˆ ϩ␤ˆ x*, results

from replacing ␴ by its estimate s:

0

sYˆ ϭ s␤ˆ ϩ␤ˆ x* ϭ s

0

1

1

(x* Ϫ xෆ )

ϩ ᎏᎏ

Ί๶๶ᎏ1n ๶๶๶

S

2

xx

3. Yˆ has a normal distribution.

The variance of ␤ˆ 0 ϩ ␤ˆ 1x* is smallest when x* ϭ xෆ and increases as x* moves away

from xෆ in either direction. Thus the estimator of ␮Yиx* is more precise when x* is near

the center of the xi s than when it is far from the x values at which observations have

been made. This will imply that both the CI and PI are narrower for an x* near xෆ than

for an x* far from ෆx. Most statistical computer packages will provide both ␤ˆ 0 ϩ ␤ˆ 1x*

and s␤ˆ ϩ␤ˆ x* for any speciﬁed x* upon request.

0

1

Inferences Concerning ␮Yиx*

Just as inferential procedures for ␤1 were based on the t variable obtained by standardizing ␤1, a t variable obtained by standardizing ␤ˆ 0 ϩ ␤ˆ 1x* leads to a CI and test

procedures here.

THEOREM

The variable

␤ˆ 0 ϩ ␤ˆ 1x* Ϫ (␤0 ϩ ␤1x*)

Yˆ Ϫ (␤0 ϩ ␤1x*)

T ϭ ᎏᎏᎏ ϭ ᎏᎏ

S Yˆ

S␤ˆ ϩ␤ˆ x*

0

1

has a t distribution with n Ϫ 2 df.

(12.5)

12.4 Inferences Concerning ␮Yиx* and the Prediction of Future Y Values

479

As for ␤1 in the previous section, a probability statement involving this standardized variable can be manipulated to yield a conﬁdence interval for ␮Yиx*.

A 100(1 Ϫ ␣)% CI for ␮Yиx*, the expected value of Y when x ϭ x*, is

␤ˆ 0 ϩ ␤ˆ 1x* Ϯ t␣/2,nϪ2 и s␤ˆ ϩ␤ˆ x* ϭ yˆ Ϯ t␣/2,nϪ2 и sYˆ

0

1

(12.6)

This CI is centered at the point estimate for ␮Yиx* and extends out to each side by an

amount that depends on the conﬁdence level and on the extent of variability in the

estimator on which the point estimate is based.

Example 12.13

Corrosion of steel reinforcing bars is the most important durability problem for reinforced concrete structures. Carbonation of concrete results from a chemical reaction

that lowers the pH value by enough to initiate corrosion of the rebar. Representative

data on x ϭ carbonation depth (mm) and y ϭ strength (MPa) for a sample of core

specimens taken from a particular building follow (read from a plot in the article

“The Carbonation of Concrete Structures in the Tropical Environment of Singapore,”

Magazine of Concrete Res., 1996: 293–300).

x

8.0

15.0

16.5

20.0

20.0

27.5

30.0

30.0

35.0

y

22.8

27.2

23.7

17.1

21.5

18.6

16.1

23.4

13.4

x

38.0

40.0

45.0

50.0

50.0

55.0

55.0

59.0

65.0

y

19.5

12.4

13.2

11.4

10.3

14.1

9.7

12.0

6.8

Figure 12.17 MINITAB scatter plot with confidence intervals and prediction intervals for the

data of Example 12.13

480

CHAPTER 12

Simple Linear Regression and Correlation

A scatter plot of the data (see Figure 12.17) gives strong support to use of the simple linear regression model. Relevant quantities are as follows:

Α xi ϭ 659.0

Αyi ϭ 293.2

Α x 2i ϭ 28,967.50

Α xiyi ϭ 9293.95

␤ˆ 1 ϭ ᎐.297561

xෆ ϭ 36.6111

Αy2i ϭ 5335.76

␤ˆ 0 ϭ 27.182936

r ϭ .766

Sxx ϭ 4840.7778

SSE ϭ 131.2402

s ϭ 2.8640

2

Let’s now calculate a conﬁdence interval, using a 95% conﬁdence level, for the mean

strength for all core specimens having a carbonation depth of 45 mm—that is, a conﬁdence interval for ␤0 ϩ ␤1(45). The interval is centered at

yˆ ϭ ␤ˆ 0 ϩ ␤ˆ 1(45) ϭ 27.18 Ϫ .2976(45) ϭ 13.79

The estimated standard deviation of the statistic Yˆ is

1

(45 Ϫ 36.6111)2

sYˆ ϭ 2.8640 ᎏ ϩ ᎏᎏ ϭ .7582

18

4840.7778

Ί๶๶๶๶๶

The 16 df t critical value for a 95% conﬁdence level is 2.120, from which we determine the desired interval to be

13.79 Ϯ (2.120)(.7582) ϭ 13.79 Ϯ 1.61 ϭ (12.18, 15.40)

The narrowness of this interval suggests that we have reasonably precise information

about the mean value being estimated. Remember that if we recalculated this interval

for sample after sample, in the long run about 95% of the calculated intervals would

include ␤0 ϩ ␤1(45). We can only hope that this mean value lies in the single interval

that we have calculated.

Figure 12.18 shows MINITAB output resulting from a request to ﬁt the simple

linear regression model and calculate conﬁdence intervals for the mean value of

strength at depths of 45 mm and 35 mm. The intervals are at the bottom of the output;

note that the second interval is narrower than the ﬁrst, because 35 is much closer to ෆx

than is 45. Figure 12.17 shows (1) curves corresponding to the conﬁdence limits for

each different x value and (2) prediction limits, to be discussed shortly. Notice how the

curves get farther and farther apart as x moves away from ෆx.

The regression equation is strength ϭ 27.2 Ϫ 0.298 depth

Predictor

Coef

Constant

depth

s ϭ 2.864

Stdev

27.183

᎐0.29756

R-sq ϭ 76.6%

t-ratio

p

1.651

16.46

0.04116

᎐7.23

R-sq(adj) ϭ 75.1%

0.000

0.000

Analysis of Variance

SOURCE

DF

SS

MS

F

p

Regression

Error

Total

1

16

17

428.62

131.24

559.86

428.62

8.20

52.25

0.000

Fit

13.793

Stdev.Fit

0.758

95.0% C.I.

(12.185, 15.401)

95.0% P.I.

(7.510, 20.075)

Fit

16.768

Stdev.Fit

0.678

95.0% C.I.

(15.330, 18.207)

95.0% P.I.

(10.527, 23.009)

Figure 12.18

MINITAB regression output for the data of Example 12.13

■

In some situations, a CI is desired not just for a single x value but for two or more

x values. Suppose an investigator wishes a CI both for ␮Yиv and for ␮Yиw, where v and w

are two different values of the independent variable. It is tempting to compute the

12.4 Inferences Concerning ␮Yиx* and the Prediction of Future Y Values

481

interval (12.6) ﬁrst for x ϭ v and then for x ϭ w. Suppose we use ␣ ϭ .05 in each computation to get two 95% intervals. Then if the variables involved in computing the two

intervals were independent of one another, the joint conﬁdence coefﬁcient would be

(.95) и (.95) Ϸ .90.

However, the intervals are not independent because the same ␤ˆ 0, ␤ˆ 1, and S are

used in each. We therefore cannot assert that the joint conﬁdence level for the two

intervals is exactly 90%. It can be shown, though, that if the 100(1 Ϫ ␣)% CI (12.6)

is computed both for x ϭ v and for x ϭ w to obtain joint CIs for ␮Yиv and ␮Yиw, then

the joint conﬁdence level on the resulting pair of intervals is at least 100(1 Ϫ 2␣)%.

In particular, using ␣ ϭ .05 results in a joint conﬁdence level of at least 90%,

whereas using ␣ ϭ .01 results in at least 98% conﬁdence. For example, in Example

12.13 a 95% CI for ␮Yи45 was (12.185, 15.401) and a 95% CI for ␮Yи35 was (15.330,

18.207). The simultaneous or joint conﬁdence level for the two statements 12.185 Ͻ

␮Yи45 Ͻ 15.401 and 15.330 Ͻ ␮Yи35 Ͻ 18.207 is at least 90%.

The validity of these joint or simultaneous CIs rests on a probability result

called the Bonferroni inequality, so the joint CIs are referred to as Bonferroni

intervals. The method is easily generalized to yield joint intervals for k different

␮Yиx s. Using the interval (12.6) separately ﬁrst for x ϭ x*1 , then for x ϭ x*2 , . . . ,

and ﬁnally for x ϭ x*k yields a set of k CIs for which the joint or simultaneous conﬁdence level is guaranteed to be at least 100(1 Ϫ k␣)%.

Tests of hypotheses about ␤0 ϩ ␤1x* are based on the test statistic T obtained

by replacing ␤0 ϩ ␤1x* in the numerator of (12.5) by the null value ␮0. For example, H0: ␤0 ϩ ␤1(45) ϭ 15 in Example 12.13 says that when carbonation depth

is 45, expected (i.e., true average) strength is 15. The test statistic value is then

t ϭ [␤ˆ 0 ϩ ␤ˆ 1(45) Ϫ 15]/s␤ˆ ϩ␤ˆ (45), and the test is upper-, lower-, or two-tailed according to the inequality in Ha.

0

1

A Prediction Interval for a Future Value of Y

Analogous to the CI (12.6) for ␮Yиx*, one frequently wishes to obtain an interval of plausible values for the value of Y associated with some future observation when the independent variable has value x*. For instance, in the example in which vocabulary size y

is related to the age x of a child, for x ϭ 6 years (12.6) would provide a CI for the true

average vocabulary size of all 6-year-old children. Alternatively, we might wish an

interval of plausible values for the vocabulary size of a particular 6-year-old child.

A CI refers to a parameter, or population characteristic, whose value is ﬁxed

but unknown to us. In contrast, a future value of Y is not a parameter but instead a

random variable; for this reason we refer to an interval of plausible values for a future

Y as a prediction interval rather than a conﬁdence interval. The error of estimation

is ␤0 ϩ ␤1x* Ϫ ( ␤ˆ 0 ϩ ␤ˆ 1x*), a difference between a ﬁxed (but unknown) quantity

and a random variable. The error of prediction is Y Ϫ ( ␤ˆ 0 ϩ ␤ˆ 1x*), a difference

between two random variables. There is thus more uncertainty in prediction than in

estimation, so a PI will be wider than a CI. Because the future value Y is independent of the observed Yi s,

V[Y Ϫ ( ␤ˆ 0 ϩ ␤ˆ 1x*)] ϭ variance of prediction error

ϭ V(Y ) ϩ V( ␤ˆ 0 ϩ ␤ˆ 1x*)

1

(x* Ϫ xෆ)2

ϭ ␴2 ϩ ␴2 ᎏ ϩ ᎏ

n

Sxx

΄

1

(x* Ϫ xෆ)2

ϭ ␴2 1 ϩ ᎏ ϩ ᎏ

n

Sxx

΄

΅

΅

482

CHAPTER 12

Simple Linear Regression and Correlation

Furthermore, because E(Y ) ϭ ␤0 ϩ ␤1x* and E( ␤ˆ 0 ϩ ␤ˆ 1x*) ϭ ␤0 ϩ ␤1x*, the

expected value of the prediction error is E(Y Ϫ (␤ˆ 0 ϩ ␤ˆ 1x*)) ϭ 0. It can then be

shown that the standardized variable

Y Ϫ ( ␤ˆ 0 ϩ ␤ˆ 1x*)

T ϭ ᎏᎏᎏ2

1

(x* Ϫ xෆ)

S 1 ϩ ᎏᎏ ϩ ᎏᎏ

Sxx

n

Ί๶๶๶๶๶๶๶๶๶๶๶

has a t distribution with n Ϫ 2 df. Substituting this T into the probability statement

P(᎐t␣/2,nϪ2 Ͻ T Ͻ t␣/2,nϪ2) ϭ 1 Ϫ ␣ and manipulating to isolate Y between the two

inequalities yields the following interval.

A 100(1 Ϫ ␣)% PI for a future Y observation to be made when x ϭ x* is

1

(x* Ϫ xෆ)2

␤ˆ 0 ϩ ␤ˆ 1x* Ϯ t␣/2,nϪ2 и s 1 ϩ ᎏ ϩ ᎏ

n

Sxx

Ί๶๶๶๶๶๶๶

(12.7)

ϭ ␤ˆ 0 ϩ ␤ˆ 1x* Ϯ t␣/2,nϪ2 и ͙ෆ

s2ෆ

ϩෆs␤2ෆ

ෆ

ˆ x*

ˆ ϩ␤ෆ

0

1

ϭ yˆ Ϯ t␣/2,nϪ2 и ͙ෆ

sෆ

ϩෆsෆ

2

2

Yˆ

The interpretation of the prediction level 100(1 Ϫ ␣)% is identical to that of previous

conﬁdence levels—if (12.7) is used repeatedly, in the long run the resulting intervals

will actually contain the observed y values 100(1 Ϫ ␣)% of the time. Notice that the

1 underneath the initial square root symbol makes the PI (12.7) wider than the CI

(12.6), though the intervals are both centered at ␤ˆ 0 ϩ ␤ˆ 1x*. Also, as n 0 ϱ, the width

of the CI approaches 0, whereas the width of the PI does not (because even with perfect knowledge of ␤0 and ␤1, there will still be uncertainty in prediction).

Example 12.14

Let’s return to the carbonation depth-strength data of Example 12.13 and calculate a

95% prediction interval for a strength value that would result from selecting a single

core specimen whose carbonation depth is 45 mm. Relevant quantities from that

example are

yˆ ϭ 13.79

s Yˆ ϭ .7582

s ϭ 2.8640

For a prediction level of 95% based on n Ϫ 2 ϭ 16 df, the t critical value is 2.120,

exactly what we previously used for a 95% conﬁdence level. The prediction interval

is then

13.79 Ϯ (2.120)͙(2

ෆ.8

ෆ6

ෆ4

ෆ0

ෆෆ

)2ෆ

ϩෆ(.7

ෆ5

ෆ8

ෆ2

ෆෆ

)2 ϭ 13.79 Ϯ (2.120)(2.963)

ϭ 13.79 Ϯ 6.28 ϭ (7.51, 20.07)

Plausible values for a single observation on strength when depth is 45 mm are (at the

95% prediction level) between 7.51 MPa and 20.07 MPa. The 95% conﬁdence interval for mean strength when depth is 45 was (12.18, 15.40). The prediction interval is

much wider than this because of the extra (2.8640)2 under the square root. Figure

12.18, the MINITAB output in Example 12.13, shows this interval as well as the conﬁdence interval.

■

The Bonferroni technique can be employed as in the case of conﬁdence intervals. If a 100(1 Ϫ ␣)% PI is calculated for each of k different values of x, the simultaneous or joint prediction level for all k intervals is at least 100(1 Ϫ k␣)%.

12.4 Inferences Concerning ␮Yиx* and the Prediction of Future Y Values

EXERCISES

Section 12.4 (44–56)

44. Fitting the simple linear regression model to the n ϭ 27

observations on x ϭ modulus of elasticity and y ϭ ﬂexural

strength given in Exercise 15 of Section 12.2 resulted in yˆ ϭ

7.592, sYˆ ϭ .179 when x ϭ 40 and yˆ ϭ 9.741, sYˆ ϭ .253 for

x ϭ 60.

a. Explain why sYˆ is larger when x ϭ 60 than when x ϭ 40.

b. Calculate a conﬁdence interval with a conﬁdence level of

95% for the true average strength of all beams whose

modulus of elasticity is 40.

c. Calculate a prediction interval with a prediction level of

95% for the strength of a single beam whose modulus of

elasticity is 40.

d. If a 95% CI is calculated for true average strength when

modulus of elasticity is 60, what will be the simultaneous

conﬁdence level for both this interval and the interval calculated in part (b)?

45. Reconsider the ﬁltration rate–moisture content data introduced

in Example 12.6 (see also Example 12.7).

a. Compute a 90% CI for ␤0 ϩ 125␤1, true average moisture content when the ﬁltration rate is 125.

b. Predict the value of moisture content for a single experimental run in which the ﬁltration rate is 125 using a 90%

prediction level. How does this interval compare to the

interval of part (a)? Why is this the case?

c. How would the intervals of parts (a) and (b) compare to a

CI and PI when ﬁltration rate is 115? Answer without

actually calculating these new intervals.

d. Interpret the hypotheses H0: ␤0 ϩ 125␤1 ϭ 80 and H0:

␤0 ϩ 125␤1 Ͻ 80, and then carry out a test at signiﬁcance

level .01.

46. The article “The Incorporation of Uranium and Silver

by Hydrothermally Synthesized Galena” (Econ. Geology,

1964: 1003–1024) reports on the determination of silver

content of galena crystals grown in a closed hydrothermal

system over a range of temperature. With x ϭ crystallization

temperature in °C and y ϭ Ag2S in mol%, the data follows:

x

398 292 352 575 568 450 550 408 484 350 503 600 600

y

.15

.05

.23

.43

483

.23

.40

.44

.44

.45

.09

.59

.63

.60

from which Α xi ϭ 6130, Α x 2i ϭ 3,022,050, Αyi ϭ 4.73,

Αy 2i ϭ 2.1785, Αxiyi ϭ 2418.74, ␤ˆ 1 ϭ .00143, ␤ˆ 0 ϭ Ϫ.311,

and s ϭ .131.

a. Estimate true average silver content when temperature

is 500°C using a 95% confidence interval.

b. How would the width of a 95% CI for true average silver

content when temperature is 400°C compare to the width

of the interval in part (a)? Answer without computing

this new interval.

c. Calculate a 95% CI for the true average change in silver

content associated with a 1°C increase in temperature.

d. Suppose it had previously been believed that when crystallization temperature was 400°C, true average silver content

would be .25. Carry out a test at signiﬁcance level .05 to

decide whether the sample data contradicts this prior belief.

47. The simple linear regression model provides a very good ﬁt

to the data on rainfall and runoff volume given in Exercise 16

of Section 12.2. The equation of the least squares line is yˆ ϭ

Ϫ1.128 ϩ .82697x, r 2 ϭ .975, and s ϭ 5.24.

a. Use the fact that sYˆ ϭ 1.44 when rainfall volume is 40 m3

to predict runoff in a way that conveys information about

reliability and precision. Does the resulting interval suggest

that precise information about the value of runoff for this

future observation is available? Explain your reasoning.

b. Calculate a PI for runoff when rainfall is 50 using the

same prediction level as in part (a). What can be said

about the simultaneous prediction level for the two intervals you have calculated?

48. The catch basin in a storm sewer system is the interface

between surface runoff and the sewer. The catch basin insert

is a device for retroﬁtting catch basins to improve pollutant

removal properties. The article “An Evaluation of the Urban

Stormwater Pollutant Removal Efﬁciency of Catch Basin

Inserts” (Water Envir. Res., 2005: 500–510) reported on

tests of various inserts under controlled conditions for

which inﬂow is close to what can be expected in the ﬁeld.

Consider the following data, read from a graph in the article, for one particular type of insert on x ϭ amount ﬁltered

(1000s of liters) and y ϭ % total suspended solids removed.

x

y

23

45

68

91

114 136 159 182 205 228

53.3 26.9 54.8 33.8 29.9 8.2 17.2 12.2 3.2 11.1

Summary quantities are

Α xi ϭ 1251, Αx 2i ϭ 199,365, Αyi ϭ 250.6, Αy 2i ϭ 9249.36,

Αxiyi ϭ 21,904.4

a. Does a scatter plot support the choice of the simple linear regression model? Explain.

b. Obtain the equation of the least squares line.

c. What proportion of observed variation in % removed can

be attributed to the model relationship?

d. Does the simple linear regression model specify a useful

relationship? Carry out an appropriate test of hypotheses

using a signiﬁcance level of .05.

e. Is there strong evidence for concluding that there is at least

a 2% decrease in true average suspended solid removal

associated with a 10,000 liter increase in the amount ﬁltered? Test appropriate hypotheses using ␣ ϭ .05.

f. Calculate an interpret a 95% CI for true average %

removed when amount ﬁltered is 100,000 liters. How

does this interval compare in width to a CI when amount

ﬁltered is 200,000 liters?

g. Calculate and interpret a 95% PI for % removed when

amount ﬁltered is 100,000 liters. How does this interval

compare in width to the CI calculated in (f) and to a PI

when amount ﬁltered is 200,000 liters?

484

CHAPTER 12

Simple Linear Regression and Correlation

49. You are told that a 95% CI for expected lead content when

trafﬁc ﬂow is 15, based on a sample of n ϭ 10 observations,

is (462.1, 597.7). Calculate a CI with conﬁdence level 99%

for expected lead content when trafﬁc ﬂow is 15.

50. Silicon-germanium alloys have been used in certain types of

solar cells. The paper “Silicon-Germanium Films Deposited

by Low-Frequency Plasma-Enhanced Chemical Vapor

Deposition” (J. of Material Res., 2006: 88–104) reported on

a study of various structural and electrical properties.

Consider the accompanying data on x ϭ Ge concentration in

solid phase (ranging from 0 to 1) and y ϭ Fermi level position (eV):

x

y

0

.42 .23 .33 .62 .60 .45 .87 .90 .79

1

1

1

.62 .53 .61 .59 .50 .55 .59 .31 .43 .46 .23 .22 .19

A scatter plot shows a substantial linear relationship. Here is

MINITAB output from a least squares ﬁt. [Note: There are

several inconsistencies between the data given in the paper,

the plot that appears there, and the summary information

about a regression analysis.]

The regression equation is

Fermi pos = 0.7217 – 0.4327 Ge conc

S = 0.0737573

R-Sq = 80.2%

R-Sq(adj) = 78.4%

Analysis of Variance

Source

Regression

Error

Total

DF

1

11

12

SS

0.241728

0.059842

0.301569

MS

0.241728

0.005440

F

44.43

P

0.000

a. Obtain an interval estimate of the expected change in

Fermi level position associated with an increase of .1 in

Ge concentration, and interpret your estimate.

b. Obtain an interval estimate for mean Fermi level position

when concentration is .50, and interpret your estimate.

c. Obtain an interval of plausible values for position resulting from a single observation to be made when concentration is .50, interpret your interval, and compare to the

interval of (b).

d. Obtain simultaneous CIs for expected position when

concentration is .3, .5, and .7; the joint conﬁdence level

should be at least 97%.

51. Refer to Example 12.12 in which x ϭ volume fraction of

oxides/inclusions and y ϭ % elongation.

a. MINITAB gave s␤ˆ 0ϩ␤ˆ 1(.40) ϭ .0311 and s␤ˆ0ϩ␤ˆ 1(1.20) ϭ .0352.

Why is the former estimated standard deviation smaller

than the latter one?

b. Use the MINITAB output from the example to calculate

a 95% CI for expected % elongation when volume fraction ϭ .40.

c. Use the MINITAB output to calculate a 95% PI for a single value of % elongation to be observed when volume

fraction ϭ 1.20.

52. Plasma etching is essential to the ﬁne-line pattern transfer

in current semiconductor processes. The article “Ion

Beam-Assisted Etching of Aluminum with Chlorine”

(J. Electrochem. Soc., 1985: 2010– 2012) gives the accompanying data (read from a graph) on chlorine ﬂow (x, in

SCCM) through a nozzle used in the etching mechanism and

etch rate (y, in 100 A/min).

x

1.5

y

1.5

2.0

2.5

2.5

3.0

3.5

3.5

4.0

23.0 24.5 25.0 30.0 33.5 40.0 40.5 47.0 49.0

The summary statistics are Αxi ϭ 24.0, Αyi ϭ 312.5, Αx 2i ϭ

70.50, Αxiyi ϭ 902.25, Αy2i ϭ 11,626.75, ␤ˆ 0 ϭ 6.448718,

␤ˆ 1 ϭ 10.602564.

a. Does the simple linear regression model specify a useful

relationship between chlorine ﬂow and etch rate?

b. Estimate the true average change in etch rate associated

with a 1-SCCM increase in ﬂow rate using a 95% conﬁdence interval, and interpret the interval.

c. Calculate a 95% CI for ␮Yи3.0, the true average etch rate

when ﬂow ϭ 3.0. Has this average been precisely estimated?

d. Calculate a 95% PI for a single future observation on

etch rate to be made when ﬂow ϭ 3.0. Is the prediction

likely to be accurate?

e. Would the 95% CI and PI when ﬂow ϭ 2.5 be wider or

narrower than the corresponding intervals of parts (c) and

(d)? Answer without actually computing the intervals.

f. Would you recommend calculating a 95% PI for a ﬂow of

6.0? Explain.

53. Consider the following four intervals based on the data of

Example 12.4 (Section 12.2):

a. A 95% CI for mean porosity when unit weight is 110

b. A 95% PI for porosity when unit weight is 110

c. A 95% CI for mean porosity when unit weight is 115

d. A 95% PI for porosity when unit weight is 115

Without computing any of these intervals, what can be said

about their widths relative to one another?

54. The decline of water supplies in certain areas of the United

States has created the need for increased understanding of

relationships between economic factors such as crop yield

and hydrologic and soil factors. The article “Variability of

Soil Water Properties and Crop Yield in a Sloped

Watershed” (Water Resources Bull., 1988: 281–288) gives

data on grain sorghum yield (y, in g/m-row) and distance

upslope (x, in m) on a sloping watershed. Selected observations are given in the accompanying table.

x

0

10

20

30

45

50

70

y

500

590

410

470

450

480

510

x

80

100

120

140

160

170

190

y

450

360

400

300

410

280

350

a. Construct a scatter plot. Does the simple linear regression model appear to be plausible?

12.5 Correlation

b. Carry out a test of model utility.

c. Estimate true average yield when distance upslope is

75 by giving an interval of plausible values.

55. Verify that V( ␤ˆ 0 ϩ ␤ˆ 1x) is indeed given by the expression in

the text. [Hint: V(ΑdiYi) ϭ Αd i2 и V(Yi).]

56. The article “Bone Density and Insertion Torque as

Predictors of Anterior Cruciate Ligament Graft Fixation

Strength” (The Amer. J. of Sports Med., 2004: 1421–1429)

gave the accompanying data on maximum insertion torque

(N и m) and yield load (N), the latter being one measure of

graft strength, for 15 different specimens.

Torque

Load

1.8

491

2.2

477

1.9

598

1.3

361

2.1

605

2.2

671

1.6

466

Torque

Load

1.2

384

1.8

422

2.6

554

2.5

577

2.5

642

1.7

348

1.6

446

2.1

431

a. Is it plausible that yield load is normally distributed?

b. Estimate true average yield load by calculating a conﬁdence interval with a conﬁdence level of 95%, and interpret the interval.

485

c. Here is output from MINITAB for the regression of yield

load on torque. Does the simple linear regression model

specify a useful relationship between the variables?

Predictor

Constant

Torque

Coef

152.44

178.23

S = 73.2141

SE Coef

91.17

45.97

R–Sq = 53.6%

T

1.67

3.88

P

0.118

0.002

R–Sq(adj) = 50.0%

Source

DF

SS

MS

F

P

Regression

1

80554 80554 15.03 0.002

Residual Error 13

69684

5360

Total

14 150238

d. The authors of the cited paper state, “Consequently, we

cannot but conclude that simple regression analysisbased methods are not clinically sufﬁcient to predict

individual ﬁxation strength.” Do you agree? [Hint:

Consider predicting yield load when torque is 2.0.]

12.5 Correlation

There are many situations in which the objective in studying the joint behavior of

two variables is to see whether they are related, rather than to use one to predict the

value of the other. In this section, we ﬁrst develop the sample correlation coefﬁcient

r as a measure of how strongly related two variables x and y are in a sample and then

relate r to the correlation coefﬁcient ␳ deﬁned in Chapter 5.

The Sample Correlation Coefﬁcient r

Given n pairs of observations (x1, y1), (x2, y2), . . . , (xn, yn), it is natural to speak of x and

y having a positive relationship if large x’s are paired with large y’s and small x’s with

small y’s. Similarly, if large x’s are paired with small y’s and small x’s with large y’s,

then a negative relationship between the variables is implied. Consider the quantity

n

n

iϭ1

iϭ1

Sxy ϭ Α (xi Ϫ xෆ)(yi Ϫ yෆ) ϭ Α xi yi Ϫ

΂ ΃΂ ΃΋

n

n

iϭ1

iϭ1

Α xi Α yi n

Then if the relationship is strongly positive, an xi above the mean xෆ will tend to be

paired with a yi above the mean ෆy, so that (xi Ϫ ෆx)(yi Ϫ ෆy) Ͼ 0, and this product will

also be positive whenever both xi and yi are below their respective means. Thus a positive relationship implies that Sxy will be positive. An analogous argument shows that

when the relationship is negative, Sxy will be negative, since most of the products

(xi Ϫ ෆx)(yi Ϫ ෆy) will be negative. This is illustrated in Figure 12.19.

Although Sxy seems a plausible measure of the strength of a relationship, we

do not yet have any idea of how positive or negative it can be. Unfortunately, Sxy has

a serious defect: By changing the units of measurement of either x or y, Sxy can be

made either arbitrarily large in magnitude or arbitrarily close to zero. For example,

if Sxy ϭ 25 when x is measured in meters, then Sxy ϭ 25,000 when x is measured in

millimeters and .025 when x is expressed in kilometers. A reasonable condition to

486

CHAPTER 12

Simple Linear Regression and Correlation

؊

؉

؊

؉

y

؊

y

؉

؉

؊

x

x

(a)

(b)

Figure 12.19 (a) Scatter plot with Sxy positive; (b) scatter plot with Sxy negative

[ϩ means (xi Ϫ xෆ )(yi Ϫ yෆ ) Ͼ 0, and Ϫ means (xi Ϫ xෆ )(yi Ϫ yෆ ) Ͻ 0]

impose on any measure of how strongly x and y are related is that the calculated

measure should not depend on the particular units used to measure them. This condition is achieved by modifying Sxy to obtain the sample correlation coefﬁcient.

DEFINITION

The sample correlation coefﬁcient for the n pairs (x1, y1), . . . , (xn, yn) is

Sxy

Sxy

r ϭ ᎏᎏᎏ

ϭ ᎏᎏ

2

2

ෆ

ෆ

ෆ

(x

ෆ

Ϫ

ෆ

x

)

ෆ

ෆ

ෆ

ෆෆ

(y

ෆ

ෆ

Ϫ

ෆ

y

)

ෆ

͙

ෆ

S

ෆ͙

Sෆෆ

xx

yy

͙Α i ෆ ͙Α i ෆ

Example 12.15

(12.8)

An accurate assessment of soil productivity is critical to rational land-use planning. Unfortunately, as the author of the article “Productivity Ratings Based on

Soil Series” (Prof. Geographer, 1980: 158–163) argues, an acceptable soil productivity index is not so easy to come by. One difficulty is that productivity is determined partly by which crop is planted, and the relationship between yield of two

different crops planted in the same soil may not be very strong. To illustrate, the

article presents the accompanying data on corn yield x and peanut yield y (mT/Ha)

for eight different types of soil.

x

2.4

3.4

4.6

3.7

2.2

3.3

4.0

2.1

y

1.33

2.12

1.80

1.65

2.00

1.76

2.11

1.63

With Αxi ϭ 25.7, Αyi ϭ 14.40, Αx 2i ϭ 88.31, Αxiyi ϭ 46.856, and Αy2i ϭ 26.4324,

(25.7)2

Sxx ϭ 88.31 Ϫ ᎏ ϭ 88.31 Ϫ 82.56 ϭ 5.75

8

(14.40)2

Syy ϭ 26.4324 Ϫ ᎏ ϭ .5124

8

(25.7)(14.40)

Sxy ϭ 46.856 Ϫ ᎏᎏ ϭ .5960

8

from which

.5960

r ϭ ᎏᎏ ϭ .347

͙ෆ

5ෆ

.7ෆ5͙ෆ

.5ෆ1ෆ2ෆ4

■

12.5 Correlation

487

Properties of r

The most important properties of r are as follows:

1. The value of r does not depend on which of the two variables under study is

labeled x and which is labeled y.

2. The value of r is independent of the units in which x and y are measured.

3. Ϫ1 Յ r Յ 1

4. r ϭ 1 if and only if (iff) all (xi, yi) pairs lie on a straight line with positive slope,

and r ϭ Ϫ1 iff all (xi, yi) pairs lie on a straight line with negative slope.

5. The square of the sample correlation coefﬁcient gives the value of the coefﬁcient

of determination that would result from ﬁtting the simple linear regression

model—in symbols, (r)2 ϭ r 2.

Property 1 stands in marked contrast to what happens in regression analysis,

where virtually all quantities of interest (the estimated slope, estimated y-intercept,

s2, etc.) depend on which of the two variables is treated as the dependent variable.

However, Property 5 shows that the proportion of variation in the dependent variable

explained by ﬁtting the simple linear regression model does not depend on which

variable plays this role.

Property 2 is equivalent to saying that r is unchanged if each xi is replaced by

cxi and if each yi is replaced by dyi (a change in the scale of measurement), as well

as if each xi is replaced by xi Ϫ a and yi by yi Ϫ b (which changes the location of zero

on the measurement axis). This implies, for example, that r is the same whether temperature is measured in °F or °C.

Property 3 tells us that the maximum value of r, corresponding to the largest

possible degree of positive relationship, is r ϭ 1, whereas the most negative relationship is identiﬁed with r ϭ Ϫ1. According to Property 4, the largest positive and

largest negative correlations are achieved only when all points lie along a straight line.

Any other conﬁguration of points, even if the conﬁguration suggests a deterministic

relationship between variables, will yield an r value less than 1 in absolute magnitude.

Thus r measures the degree of linear relationship among variables. A value of r near

0 is not evidence of the lack of a strong relationship, but only the absence of a linear

relation, so that such a value of r must be interpreted with caution. Figure 12.20 illustrates several conﬁgurations of points associated with different values of r.

(a) r near ϩ1

(b) r near Ϫ1

(c) r near 0, no

apparent relationship

(d) r near 0, nonlinear

relationship

Figure 12.20

Data plots for different values of r

488

CHAPTER 12

Simple Linear Regression and Correlation

A frequently asked question is, “When can it be said that there is a strong correlation between the variables, and when is the correlation weak?” A reasonable rule

of thumb is to say that the correlation is weak if 0 Յ ⏐r⏐Յ .5, strong if .8 Յ ⏐r⏐Յ

1, and moderate otherwise. It may surprise you that r ϭ .5 is considered weak, but

r 2 ϭ .25 implies that in a regression of y on x, only 25% of observed y variation

would be explained by the model. In Example 12.15, the correlation between corn

yield and peanut yield would be described as weak.

The Population Correlation Coefficient ␳

and Inferences About Correlation

The correlation coefﬁcient r is a measure of how strongly related x and y are in the

observed sample. We can think of the pairs (xi, yi) as having been drawn from a

bivariate population of pairs, with (Xi, Yi) having joint probability distribution f(x, y).

In Chapter 5, we deﬁned the correlation coefﬁcient ␳(X, Y) by

Cov(X, Y )

␳ ϭ ␳(X, Y ) ϭ ᎏᎏ

␴X и ␴ Y

where

Cov(X, Y ) ϭ

{

Α Α (x Ϫ ␮X)(y Ϫ ␮Y)p(x, y)

(X, Y ) discrete

͵ ͵

(X, Y ) continuous

x

ϱ

y

ϱ

᎐ϱ ᎐ϱ

(x Ϫ ␮X)(y Ϫ ␮Y)f(x, y) dx dy

If we think of f(x, y) as describing the distribution of pairs of values within the entire

population, ␳ becomes a measure of how strongly related x and y are in that population. Properties of ␳ analogous to those for r were given in Chapter 5.

The population correlation coefﬁcient ␳ is a parameter or population characteristic, just as ␮X, ␮Y, ␴X, and ␴Y are, so we can use the sample correlation coefﬁcient to make various inferences about ␳. In particular, r is a point estimate for ␳, and

the corresponding estimator is

Α(Xi Ϫ X

ෆ)(Yi Ϫ Yෆ)

␳ˆ ϭ R ϭ ᎏᎏᎏ

ෆ(X

ෆෆ

ෆ(Y

ෆෆ

X

Y

ෆෆ2͙Α

ෆෆ2

i Ϫෆ)

i Ϫෆ)

͙Α

Example 12.16

In some locations, there is a strong association between concentrations of two different pollutants. The article “The Carbon Component of the Los Angeles Aerosol:

Source Apportionment and Contributions to the Visibility Budget” (J. Air Pollution

Control Fed., 1984: 643–650) reports the accompanying data on ozone concentration x (ppm) and secondary carbon concentration y (␮g/m3).

x

.066

.088

.120

.050

.162

.186

.057

.100

y

4.6

11.6

9.5

6.3

13.8

15.4

2.5

11.8

x

.112

.055

.154

.074

.111

.140

.071

.110

y

8.0

7.0

20.6

16.6

9.2

17.9

2.8

13.0

The summary quantities are n ϭ 16, Αxi ϭ 1.656, Αyi ϭ 170.6, Αx ϭ .196912,

Αxi yi ϭ 20.0397, and Αy2i ϭ 2253.56, from which

20.0397 Ϫ (1.656)(170.6)/16

r ϭ ᎏᎏᎏᎏᎏᎏ

͙ෆ

.1ෆ9ෆ6ෆ9ෆ1ෆ2ෆ

Ϫෆ(ෆ1ෆ

.6ෆ5ෆ6ෆ

)2ෆ

/1ෆ6͙ෆ2ෆ2ෆ5ෆ3ෆ

.5ෆ6ෆ

Ϫෆ(ෆ1ෆ7ෆ0ෆ

.6ෆ

)2ෆ

/1ෆ6

2.3826

ϭ ᎏᎏ ϭ .716

(.1597)(20.8456)

2

i

Xem Thêm

4 Inferences Concerning μ[sub(y-x*)] and the Prediction of Future Y Values

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

4 Inferences Concerning &#956;[sub(y-x*)] and the Prediction of Future Y Values

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

4 Inferences Concerning μ[sub(y-x*)] and the Prediction of Future Y Values