Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.92 MB, 370 trang )
280
26. ASYMPTOTIC PROPERTIES OF OLS
Consistency means that the probability limit of the estimates converges towards
ˆ
ˆ
the true value. For β this can be written as plimn→∞ β n = β. This means by
ˆn − β| ≤ ε] = 1.
definition that for all ε > 0 follows limn→∞ Pr[|β
The probability limit is one of several concepts of limits used in probability
theory. We will need the following properties of the plim here:
(1) For nonrandom magnitudes, the probability limit is equal to the ordinary
limit.
(2) It satisfies the Slutsky theorem, that for a continuous function g,
(26.0.13)
plim g(z) = g(plim(z)).
(3) If the MSE-matrix of an estimator converges towards the null matrix, then
the estimator is consistent.
(4) Kinchine’s theorem: the sample mean of an i.i.d. distribution is a consistent
estimate of the population mean, even if the distribution does not have a population
variance.
26.1. Consistency of the OLS estimator
ˆ
For the proof of consistency of the OLS estimators β and of s2 we need the
following result:
1
X ε = o.
n
I.e., the true ε is asymptotically orthogonal to all columns of X. This follows immediately from MSE[o; X ε /n] = E [X εε X/n2 ] = σ 2 X X/n2 , which converges
towards O.
ˆ
ˆ
In order to prove consistency of β and s2 , transform the formulas for β and s2
in such a way that they are written as continuous functions of terms each of which
ˆ
converges for n → ∞, and then apply Slutsky’s theorem. Write β as
(26.1.1)
(26.1.2)
(26.1.3)
(26.1.4)
plim
X X
ˆ
β = β + (X X)−1 X ε = β +
n
X X −1
X ε
ˆ
plim β = β + lim
plim
n
n
−1
= β + Q o = β.
−1 X
ε
n
Let’s look at the geometry of this when there is only one explanatory variable.
ε
The specification is therefore y = xβ +ε . The assumption is that ε is asymptotically
orthogonal to x. In small samples, it only happens by sheer accident with probability
0 that ε is orthogonal to x. Only ε is. But now let’s assume the sample grows
ˆ
larger, i.e., the vectors y and x become very high-dimensional observation vectors,
i.e. we are drawing here a two-dimensional subspace out of a very high-dimensional
space. As more and more data are added, the observation vectors also become
√
lengths of these
longer and longer. But if we divide each vector by n, then the√
normalized lenghts stabilize. The squared length of the vector ε / n has the plim
1
of σ 2 . Furthermore, assumption (26.0.12) means in our case that plimn→∞ n x x
1
exists and is nonsingular. This is the squared length of √n x. I.e., if we normalize the
√
vectors by dividing them by n, then they do not get longer but converge towards
1
a finite length. And the result (26.1.1) plim n x ε = 0 means now that with this
√
√
normalization, ε / n becomes more and more orthogonal to x/ n. I.e., if n is large
enough, asymptotically, not only ε but also the true ε is orthogonal to x, and this
ˆ
ˆ
means that asymptotically β converges towards the true β.
26.2. ASYMPTOTIC NORMALITY OF THE LEAST SQUARES ESTIMATOR
281
For the proof of consistency of s2 we need, among others, that plim ε nε = σ 2 ,
ε
which is a consequence of Kinchine’s theorem. Since ε ε = ε Mε it follows
ˆ ˆ
ε ε
ˆ ˆ
I
n
X X X −1 X
ε
ε=
=
−
n−k
n−k
n
n
n
n
n
ε ε ε X X X −1 X ε
=
−
→ 1 · σ 2 − o Q−1 o .
n−k n
n
n
n
26.2. Asymptotic Normality of the Least Squares Estimator
√ To show asymptotic normality of an estimator, multiply the sampling error by
n, so that the variance is stabilized.
1
1
We have seen plim n X ε = o. Now look at √n X ε n . Its mean is o and its covariance matrix σ 2 X n X . Shape of distribution, due to a variant of the Central Limit
1
Theorem, is asymptotically normal: √n X ε n → N (o, σ 2 Q). (Here the convergence
is convergence in distribution.)
−1
√ ˆ
1
We can write n(β n −β) = X n X
( √n X ε n ). Therefore its limiting covari√ ˆ
ance matrix is Q−1 σ 2 QQ−1 = σ 2 Q−1 , Therefore n(β n −β) → N (o, σ 2 Q−1 ) in disˆ
tribution. One can also say: the asymptotic distribution of β is N (β, σ 2 (X X)−1 ).
√
ˆn − Rβ) → N (o, σ 2 RQ−1 R ), and therefore
From this follows n(Rβ
(26.2.1)
ˆ
n(Rβ n − Rβ) RQ−1 R
−1
ˆ
(Rβ n − Rβ) → σ 2 χ2 .
i
Divide by s2 and replace in the limiting case Q by X X/n and s2 by σ 2 to get
−1
ˆ
ˆ
(Rβ n − Rβ) R(X X)−1 R
(Rβ n − Rβ)
→ χ2
i
2
s
in distribution. All this is not a proof; the point is that in the denominator, the
distribution is divided by the increasingly bigger number n − k, while in the numerator, it is divided by the constant i; therefore asymptotically the denominator can
be considered 1.
The central limit theorems only say that for n → ∞ these converge towards the
χ2 , which is asymptotically equal to the F distribution. It is easily possible that
before one gets to the limit, the F -distribution is better.
(26.2.2)
ˆ
Problem 327. Are the residuals y − X β asymptotically normally distributed?
√
Answer. Only if the disturbances are normal, otherwise of course not! We can show that
√
ˆ
ε ˆ
n(ε − ε) = nX(β − β) ∼ N (o, σ 2 XQX ).
Now these results also go through if one has stochastic regressors. [Gre97, 6.7.7]
shows that the above condition (26.0.12) with the lim replaced by plim holds if xi
and ε i are an i.i.d. sequence of random variables.
Problem 328. 2 points In the regression model with random regressors y =
1
1
ε
Xβ+ε , you only know that plim n X X = Q is a nonsingular matrix, and plim n X ε =
o. Using these two conditions, show that the OLS estimate is consistent.
ˆ
Answer. β = (X X)−1 X y = β + (X X)−1 X ε due to (18.0.7), and
plim(X X)−1 X ε = plim(
X X −1 X ε
)
= Qo = o.
n
n
CHAPTER 27
Least Squares as the Normal Maximum Likelihood
Estimate
Now assume ε is multivariate normal. We will show that in this case the OLS
ˆ
estimator β is at the same time the Maximum Likelihood Estimator. For this we
need to write down the density function of y. First look at one y t which is y t ∼
x1
.
2
N (xt β, σ ), where X = . , i.e., xt is the tth row of X. It is written as a
.
xn
column vector, since we follow the “column vector convention.” The (marginal)
density function for this one observation is
(27.0.3)
fyt (yt ) = √
1
2πσ 2
e−(yt −xt
β)2 /2σ 2
.
Since the y i are stochastically independent, their joint density function is the product,
which can be written as
1
(27.0.4)
fy (y) = (2πσ 2 )−n/2 exp − 2 (y − Xβ) (y − Xβ) .
2σ
To compute the maximum likelihood estimator, it is advantageous to start with
the log likelihood function:
(27.0.5)
log fy (y; β, σ 2 ) = −
n
n
1
log 2π − log σ 2 − 2 (y − Xβ) (y − Xβ).
2
2
2σ
Assume for a moment that σ 2 is known. Then the MLE of β is clearly equal to
ˆ
ˆ
the OLS β. Since β does not depend on σ 2 , it is also the maximum likelihood
2
ˆ
estimate when σ is unknown. β is a linear function of y. Linear transformations
of normal variables are normal. Normal distributions are characterized by their
mean vector and covariance matrix. The distribution of the MLE of β is therefore
ˆ
β ∼ N (β, σ 2 (X X)−1 ).
ˆ
If we replace β in the log likelihood function (27.0.5) by β, we get what is called
the log likelihood function with β “concentrated out.”
ˆ
(27.0.6) log fy (y; β = β, σ 2 ) = −
n
n
1
ˆ
ˆ
log 2π − log σ 2 − 2 (y − X β) (y − X β).
2
2
2σ
One gets the maximum likelihood estimate of σ 2 by maximizing this “concentrated”
log likelihoodfunction. Taking the derivative with respect to σ 2 (consider σ 2 the
name of a variable, not the square of another variable), one gets
∂
n 1
1
ˆ
ˆ
ˆ
log fy (y; β) = −
+ 4 (y − X β) (y − X β)
∂σ 2
2 σ2
2σ
Setting this zero gives
(27.0.7)
(27.0.8)
σ2 =
˜
ˆ
ˆ
(y − X β) (y − X β)
ε ε
ˆ ˆ
=
.
n
n
283
284
27. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE
This is a scalar multiple of the unbiased estimate s2 = ε ε/(n − k) which we
ˆ ˆ
had earlier.
Let’s look at the distribution of s2 (from which that of its scalar multiples follows
easily). It is a quadratic form in a normal variable. Such quadratic forms very often
have χ2 distributions.
Now recall equation 7.4.9 characterizing all the quadratic forms of multivariate
normal variables that are χ2 ’s. Here it is again: Assume y is a multivariate normal
vector random variable with mean vector µ and covariance matrix σ 2 Ψ, and Ω is a
symmetric nonnegative definite matrix. Then (y − µ) Ω (y − µ) ∼ σ 2 χ2 iff
k
Ω Ω
Ω
ΨΩ ΨΩ Ψ = ΨΩ Ψ,
(27.0.9)
Ω
and k is the rank of ΨΩ .
This condition is satisfied in particular if Ψ = I (the identity matrix) and
Ω2 = Ω, and this is exactly our situation.
(27.0.10)
σ2 =
ˆ
ˆ
ˆ
ε
ε
(y − X β) (y − X β)
ε (I − X(X X)−1 X )ε
ε Mε
=
=
n−k
n−k
n−k
where M 2 = M and rank M = n − k. (This last identity because for idempotent
matrices, rank = tr, and we computed its tr above.) Therefore s2 ∼ σ 2 χ2 /(n − k),
n−k
from which one obtains again unbiasedness, but also that var[s2 ] = 2σ 4 /(n − k), a
result that one cannot get from mean and variance alone.
ˆ
Problem 329. 4 points Show that, if y is normally distributed, s2 and β are
independent.
ˆ
Answer. We showed in question 246 that β and ε are uncorrelated, therefore in the normal
ˆ
ˆ
case independent, therefore β is also independent of any function of ε, such as σ 2 .
ˆ
ˆ
Problem 330. Computer assignment: You run a regression with 3 explanatory
variables, no constant term, the sample size is 20, the errors are normally distributed
and you know that σ 2 = 2. Plot the density function of s2 . Hint: The command
dchisq(x,df=25) returns the density of a Chi-square distribution with 25 degrees of
freedom evaluated at x. But the number 25 was only taken as an example, this is not
the number of degrees of freedom you need here.
• a. In the same plot, plot the density function of the Theil-Schweitzer estimate.
Can one see from the comparison of these density functions why the Theil-Schweitzer
estimator has a better MSE?
Answer. Start with the Theil-Schweitzer plot, because it is higher. > x <- seq(from = 0, to
= 6, by = 0.01) > Density <- (19/2)*dchisq((19/2)*x, df=17) > plot(x, Density, type="l",
lty=2) > lines(x,(17/2)*dchisq((17/2)*x, df=17)) > title(main = "Unbiased versus Theil-Schweitzer
Variance Estimate, 17 d.f.")
Now let us derive the maximum likelihood estimator in the case of nonspherical but positive definite covariance matrix. I.e., the model is y = Xβ + ε , ε ∼
N (o, σ 2 Ψ). The density function is
−1/2
(27.0.11) fy (y) = (2πσ 2 )−n/2 |det Ψ|
exp −
1
(y − Xβ) Ψ−1 (y − Xβ) .
2σ 2
Problem 331. Derive (27.0.11) as follows: Take a matrix P with the property
that P ε has covariance matrix σ 2 I. Write down the joint density function of P ε .
Since y is a linear transformation of ε , one can apply the rule for the density function
of a transformed random variable.