Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.13 MB, 1,644 trang )
514
19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
Problem 254. Given the constant scalars a = 0 and c = 0 and b and d arbitrary.
Show that corr[x, y] = ± corr[ax + b, cy + d], with the + sign being valid if a and c
have the same sign, and the − sign otherwise.
Answer. Start with cov[ax + b, cy + d] = ac cov[x, y] and go from there.
Besides the simple correlation coefficient ρxy between two scalar variables y and
x, one can also define the squared multiple correlation coefficient ρ2
y(x) between one
scalar variable y and a whole vector of variables x, and the partial correlation coefficient ρ12.x between two scalar variables y 1 and y 2 , with a vector of other variables
x “partialled out.” The multiple correlation coefficient measures the strength of
a linear association between y and all components of x together, and the partial
correlation coefficient measures the strength of that part of the linear association
between y 1 and y 2 which cannot be attributed to their joint association with x. One
can also define partial multiple correlation coefficients. If one wants to measure the
linear association between two vectors, then one number is no longer enough, but
one needs several numbers, the “canonical correlations.”
The multiple or partial correlation coefficients are usually defined as simple correlation coefficients involving the best linear predictor or its residual. But all these
19.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS
515
correlation coefficients share the property that they indicate a proportionate reduction in the MSE. See e.g. [Rao73, pp. 268–70]. Problem 255 makes this point for
the simple correlation coefficient:
Problem 255. 4 points Show that the proportionate reduction in the MSE of
the best predictor of y, if one goes from predictors of the form y ∗ = a to predictors
of the form y ∗ = a + bx, is equal to the squared correlation coefficient between y and
x. You are allowed to use the results of Problems 229 and 240. To set notation, call
the minimum MSE in the first prediction (Problem 229) MSE[constant term; y], and
the minimum MSE in the second prediction (Problem 240) MSE[constant term and
x; y]. Show that
(19.1.2)
MSE[constant term; y] − MSE[constant term and x; y]
(cov[y, x])2
=
= ρ2 .
yx
MSE[constant term; y]
var[y] var[x]
Answer. The minimum MSE with only a constant is var[y] and (18.2.32) says that MSE[constant
term and x; y] = var[y]−(cov[x, y])2 / var[x]. Therefore the difference in MSE’s is (cov[x, y])2 / var[x],
and if one divides by var[y] to get the relative difference, one gets exactly the squared correlation
coefficient.
516
19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
Multiple Correlation Coefficients. Now assume x is a vector while y remains a
scalar. Their joint mean vector and dispersion matrix are
(19.1.3)
Ω
x
µ
∼
, σ 2 xx
ω xy
y
ν
ω xy
.
ωyy
By theorem ??, the best linear predictor of y based on x has the formula
(19.1.4)
−
y ∗ = ν + ω xy Ω xx (x − µ)
y ∗ has the following additional extremal value property: no linear combination b x
has a higher squared correlation with y than y ∗ . This maximal value of the squared
correlation is called the squared multiple correlation coefficient
(19.1.5)
2
ρy(x) =
−
ω xy Ω xxω xy
ωyy
The multiple correlation coefficient itself is the positive square root, i.e., it is always
nonnegative, while some other correlation coefficients may take on negative values.
The squared multiple correlation coefficient can also defined in terms of proportionate reduction in MSE. It is equal to the proportionate reduction in the MSE of
the best predictor of y if one goes from predictors of the form y ∗ = a to predictors
19.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS
517
of the form y ∗ = a + b x, i.e.,
(19.1.6)
ρ2
y(x) =
MSE[constant term; y] − MSE[constant term and x; y]
MSE[constant term; y]
There are therefore two natural definitions of the multiple correlation coefficient.
These two definitions correspond to the two formulas for R2 in (18.3.6).
Partial Correlation Coefficients. Now assume y = y 1 y 2
is a vector with
two elements and write
Ω xx ω y1 ω y2
x
µ
y 1 ∼ ν1 , σ 2 ω y1 ω11 ω12 .
(19.1.7)
y2
ν2
ω y2 ω21 ω22
Let y ∗ be the best linear predictor of y based on x. The partial correlation coefficient
ρ12.x is defined to be the simple correlation between the residuals corr[(y 1 −y ∗ ), (y 2 −
1
y ∗ )]. This measures the correlation between y 1 and y 2 which is “local,” i.e., which
2
does not follow from their association with x. Assume for instance that both y 1 and
y 2 are highly correlated with x. Then they will also have a high correlation with
each other. Subtracting y ∗ from y i eliminates this dependency on x, therefore any
i
remaining correlation is “local.” Compare [Krz88, p. 475].
518
19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
The partial correlation coefficient can be defined as the relative reduction in the
MSE if one adds y 2 to x as a predictor of y 1 :
(19.1.8)
MSE[constant term and x; y 2 ] − MSE[constant term, x, and y 1 ; y 2 ]
ρ2 =
.
12.x
MSE[constant term and x; y 2 ]
Problem 256. Using the definitions in terms of MSE’s, show that the following
relationship holds between the squares of multiple and partial correlation coefficients:
(19.1.9)
2
2
1 − ρ2
2(x,1) = (1 − ρ21.x )(1 − ρ2(x) )
Answer. In terms of the MSE, (19.1.9) reads
(19.1.10)
MSE[constant term, x, and y 1 ; y 2 ]
MSE[constant term, x, and y 1 ; y 2 ] MSE[constant term and x;
=
MSE[constant term; y 2 ]
MSE[constant term and x; y 2 ]
MSE[constant term; y 2 ]
From (19.1.9) follows the following weighted average formula:
(19.1.11)
2
2
2
ρ2
2(x,1) = ρ2(x) + (1 − ρ2(x) )ρ21.x
An alternative proof of (19.1.11) is given in [Gra76, pp. 116/17].
19.2. CORRELATION COEFFICIENTS AND THE ASSOCIATED LEAST SQUARES PROBLEM
519
Mixed cases: One can also form multiple correlations coefficients with some of
the variables partialled out. The dot notation used here is due to Yule, [Yul07]. The
notation, definition, and formula for the squared correlation coefficient is
(19.1.12)
ρ2
y(x).z =
(19.1.13)
=
MSE[constant term and z; y] − MSE[constant term, z, and x; y]
MSE[constant term and z; y]
ω xy.z Ω − ω xy.z
xx.z
ωyy.z
19.2. Correlation Coefficients and the Associated Least Squares Problem
One can define the correlation coefficients also as proportionate reductions in
the objective functions of the associated GLS problems. However one must reverse
predictor and predictand, i.e., one must look at predictions of a vector x by linear
functions of a scalar y.
Here it is done for multiple correlation coefficients: The value of the GLS objective function if one predicts x by the best linear predictor x∗ , which is the minimum
attainable when the scalar observation y is given and the vector x can be chosen
520
19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
freely, as long as it satisfies the constraint x = µ + Ω xx q for some q, is
(19.2.1)
−
Ω xx ω xy
x−µ
−
(y − ν)
SSE[y; best x] = min (x − µ)
= (y−ν) ωyy (y−
ω xy ωyy
y−ν
xs.t....
On the other hand, the value of the GLS objective function when one predicts
x by the best constant x = µ is
(19.2.2)
−
−
Ωxx
Ω − + Ω − ω xy ωyy.xω xy Ω − −Ω − ω xy ωyy.x
o
xx
xx
xx
(y − ν)
SSE[y; x = µ] = o
−
−
−
y−
−ωyy.xω xy Ω xx
ωyy.x
−
= (y − ν) ωyy.x (y − ν).
(19.2.3)
The proportionate reduction in the objective function is
(19.2.4)
(19.2.5)
SSE[y; x = µ] − SSE[y; best x]
(y − ν)2 /ωyy.x − (y − ν)2 /ωyy
=
=
SSE[y; x = µ]
(y − ν)2 /ωyy.x
=
ωyy − ωyy.x
ωyy.x
1
2
= ρ2
= 1 − yy
= ρy(x)
y(x) = 1 −
ωyy
ωyy
ω ωyy
19.3. CANONICAL CORRELATIONS
521
19.3. Canonical Correlations
Now what happens with the correlation coefficients if both predictor and predictand are vectors? In this case one has more than one correlation coefficient. One first
finds those two linear combinations of the two vectors which have highest correlation,
then those which are uncorrelated with the first and have second highest correlation,
and so on. Here is the mathematical construction needed:
Let x and y be two column vectors consisting of p and q scalar random variables,
respectively, and let
(19.3.1)
x
2 Ω xx
V[ y ] = σ Ω
yx
Ω xy
,
Ω yy
where Ω xx and Ω yy are nonsingular, and let r be the rank of Ω xy . Then there exist
two separate transformations
(19.3.2)
u = Lx,
v = My
such that
(19.3.3)
u
2 Ip
V[ v ] = σ
Λ
Λ
Iq
522
19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
where Λ is a (usually rectangular) diagonal matrix with only r diagonal elements
positive, and the others zero, and where these diagonal elements are sorted in descending order.
Proof: One obtains the matrix Λ by a singular value decomposition of Ω −1/2Ω xy Ω −
xx
yy
A, say. Let A = P ΛQ be its singular value decomposition with fully orthogonal
Ω−1/2
matrices, as in equation (A.9.8). Define L = P Ω−1/2 and M = QΩ yy . Therefore
xx
Ω
Ω
Ω
LΩ xx L = I, MΩ yy M = I, and LΩ xy M = P Ω −1/2Ω xy Ω −1/2 Q = P AQ =
xx
yy
Λ.
The next problems show how one gets from this the maximization property of
the canonical correlation coefficients:
Problem 257. Show that for every p-vector l and q-vector m,
(19.3.4)
corr(l x, m y) ≤ λ1
where λ1 is the first (and therefore biggest) diagonal element of Λ. Equality in
(19.3.4) holds if l = l1 , the first row in L, and m = m1 , the first row in M .
Answer: If l or m is the null vector, then there is nothing to prove. If neither of
them is a null vector, then one can, without loss of generality, multiply them with
appropriate scalars so that p = (L−1 ) l and q = (M −1 ) m satisfy p p = 1 and
19.3. CANONICAL CORRELATIONS
q q = 1. Then
(19.3.5)
p Lx
p
l x
] = V[
] = V[
V[
q My
o
m y
o
q
u
p
] = σ2
v
o
523
o
q
Ip
Λ
Λ
Iq
p o
=σ
o q
Since the matrix at the righthand side has ones in the diagonal, it is the correlation
matrix, i.e., p Λq = corr(l x, m y). Therefore (19.3.4) follows from Problem 258.
2
Problem 258. If
p2 = qi = 1, and λi ≥ 0, show that | pi λi qi | ≤ max λi .
i
Hint: first get an upper bound for | pi λi qi | through a Cauchy-Schwartz-type argument.
Answer. (
pi λi qi )2 ≤
p2 λi
i
2
qi λi ≤ (max λi )2 .
Problem 259. Show that for every p-vector l and q-vector m such that l x is
uncorrelated with l1 x, and m y is uncorrelated with m1 y,
(19.3.6)
corr(l x, m y) ≤ λ2
where λ2 is the second diagonal element of Λ. Equality in (19.3.6) holds if l = l2 ,
the second row in L, and m = m2 , the second row in M .
Answer. If l or m is the null vector, then there is nothing to prove. If neither of them is a
null vector, then one can, without loss of generality, multiply them with appropriate scalars so that
524
19. DIGRESSION ABOUT CORRELATION COEFFICIENTS
p = (L−1 ) l and q = (M −1 ) m satisfy p p = 1 and q q = 1. Now write e1 for the first unit
vector, which has a 1 as first component and zeros everywhere else:
(19.3.7)
cov[l x, l1 x] = cov[p Lx, e1 Lx] = p Λe1 = p e1 λ1 .
This covariance is zero iff p1 = 0. Furthermore one also needs the following, directly from the proof
of Problem 257:
(19.3.8)
Ip
Λ
p Lx
p
o
u
p
o
p o
p p
l x
] = V[
] = V[
] = σ2
= σ2
V[
q My
o
q
v
o
q
o q
q Λp
Λ
Iq
m y
Since the matrix at the righthand side has ones in the diagonal, it is the correlation matrix, i.e.,
p Λq = corr(l x, m y). Equation (19.3.6) follows from Problem 258 if one lets the subscript i
start at 2 instead of 1.
Problem 260. (Not eligible for in-class exams) Extra credit question for good
mathematicians: Reformulate the above treatment of canonical correlations without
the assumption that Ω xx and Ω yy are nonsingular.
19.4. Some Remarks about the Sample Partial Correlation Coefficients
The definition of the partial sample correlation coefficients is analogous to that of
the partial population correlation coefficients: Given two data vectors y and z, and
the matrix X (which includes a constant term), and let M = I −X(X X)−1 X be