Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.13 MB, 1,644 trang )
848
34. ASYMPTOTIC PROPERTIES OF OLS
Two examples where this is not the case. Look at the model y t = α+βt+εt . Here
1 1
1 2
1 + 1 + 1 + ··· + 1 1 + 2 + 3 + ··· + n
X = 1 3 . Therefore X X =
=
1 + 2 + 3 + · · · + n 1 + 4 + 9 + · · · + n2
. .
. .
. .
1 n
1 ∞
n
n(n + 1)/2
1
. Here the assumption
, and n X X →
∞ ∞
n(n + 1)/2 n(n + 1)(2n + 1)/6
(34.0.5) does not hold, but one can still prove consistency and asymtotic normality,
the estimators converge even faster than in the usual case.
The other example is the model y t = α + βλt + εt with a known λ with −1 <
λ < 1. Here
X X=
=
1 + 1 + ··· + 1
λ + λ2 + · · · + λn
(λ − λ
n
)/(1 − λ)
n+1
λ + λ2 + · · · + λn
=
λ2 + λ4 + · · · + λ2n
(λ − λn+1 )/(1 − λ)
.
(λ2 − λ2n+2 )/(1 − λ2 )
34. ASYMPTOTIC PROPERTIES OF OLS
849
1 0
, which is singular. In this case, a consistent estimate of
0 0
λ does not exist: future observations depend on λ so little that even with infinitely
many observations there is not enough information to get the precise value of λ.
ˆ
We will show that under assumption (34.0.5), β and s2 are consistent. However
this assumption is really too strong for consistency. A weaker set of assumptions
is the Grenander conditions, see [Gre97, p. 275]. To write down the Grenander
conditions, remember that presently X depends on n (in that we only look at the
first n elements of y and first n rows of X), therefore also the column vectors xj also
depend of n (although we are not indicating this here). Therefore xj xj depends
on n as well, and we will make this dependency explicit by writing xj xj = d2 .
nj
Then the first Grenander condition is limn→∞ d2 = +∞ for all j. Second: for all i
nj
and k, limn→∞ maxi=1···n xij /d2 = 0 (here is a typo in Greene, he leaves the max
nj
out). Third: Sample correlation matrix of the columns of X minus the constant
term converges to a nonsingular matrix.
Consistency means that the probability limit of the estimates converges towards
ˆ
ˆ
the true value. For β this can be written as plimn→∞ β n = β. This means by
ˆ
definition that for all ε > 0 follows limn→∞ Pr[|β n − β| ≤ ε] = 1.
The probability limit is one of several concepts of limits used in probability
theory. We will need the following properties of the plim here:
Therefore
1
nX
X→
850
34. ASYMPTOTIC PROPERTIES OF OLS
(1) For nonrandom magnitudes, the probability limit is equal to the ordinary
limit.
(2) It satisfies the Slutsky theorem, that for a continuous function g,
(34.0.6)
plim g(z) = g(plim(z)).
(3) If the MSE-matrix of an estimator converges towards the null matrix, then
the estimator is consistent.
(4) Kinchine’s theorem: the sample mean of an i.i.d. distribution is a consistent
estimate of the population mean, even if the distribution does not have a population
variance.
34.1. Consistency of the OLS estimator
ˆ
For the proof of consistency of the OLS estimators β and of s2 we need the
following result:
1
X ε = o.
n
I.e., the true ε is asymptotically orthogonal to all columns of X. This follows immediately from MSE[o; X ε /n] = E [X εε X/n2 ] = σ 2 X X/n2 , which converges
towards O.
(34.1.1)
plim
34.1. CONSISTENCY OF THE OLS ESTIMATOR
851
ˆ
ˆ
In order to prove consistency of β and s2 , transform the formulas for β and s2
in such a way that they are written as continuous functions of terms each of which
ˆ
converges for n → ∞, and then apply Slutsky’s theorem. Write β as
(34.1.2)
(34.1.3)
(34.1.4)
X X
ˆ
β = β + (X X)−1 X ε = β +
n
X X −1
X ε
ˆ
plim β = β + lim
plim
n
n
−1
= β + Q o = β.
−1 X
ε
n
Let’s look at the geometry of this when there is only one explanatory variable.
ε
The specification is therefore y = xβ +ε . The assumption is that ε is asymptotically
orthogonal to x. In small samples, it only happens by sheer accident with probability
0 that ε is orthogonal to x. Only ε is. But now let’s assume the sample grows
ˆ
larger, i.e., the vectors y and x become very high-dimensional observation vectors,
i.e. we are drawing here a two-dimensional subspace out of a very high-dimensional
space. As more and more data are added, the observation vectors also become
√
longer and longer. But if we divide each vector by n, then the lengths of these
√
normalized lenghts stabilize. The squared length of the vector ε / n has the plim of
1
σ 2 . Furthermore, assumption (34.0.5) means in our case that plimn→∞ n x x exists
1
and is nonsingular. This is the squared length of √n x. I.e., if we normalize the
852
34. ASYMPTOTIC PROPERTIES OF OLS
√
vectors by dividing them by n, then they do not get longer but converge towards
1
a finite length. And the result (34.1.1) plim n x ε = 0 means now that with this
√
√
normalization, ε / n becomes more and more orthogonal to x/ n. I.e., if n is large
ˆ
enough, asymptotically, not only ε but also the true ε is orthogonal to x, and this
ˆ
means that asymptotically β converges towards the true β.
For the proof of consistency of s2 we need, among others, that plim ε nε = σ 2 ,
ε
which is a consequence of Kinchine’s theorem. Since ε ε = ε Mε it follows
ˆ ˆ
n
I
X X X −1 X
ε ε
ˆ ˆ
=
−
ε
ε=
n−k
n−k
n
n
n
n
ε ε ε X X X −1 X ε
n
=
−
→ 1 · σ 2 − o Q−1 o .
n−k n
n
n
n
34.2. Asymptotic Normality of the Least Squares Estimator
√ To show asymptotic normality of an estimator, multiply the sampling error by
n, so that the variance is stabilized.
1
1
We have seen plim n X ε = o. Now look at √n X ε n . Its mean is o and its covariance matrix σ 2 X n X . Shape of distribution, due to a variant of the Central Limit
34.2. ASYMPTOTIC NORMALITY OF THE LEAST SQUARES ESTIMATOR
Theorem, is asymptotically normal:
is convergence in distribution.)
√ ˆ
We can write n(β n −β) = X
1
√ X
n
X
n
−1
853
ε n → N (o, σ 2 Q). (Here the convergence
−1
1
( √n X ε n ). Therefore its limiting covari√ ˆ
ance matrix is Q−1 σ 2 QQ−1 = σ 2 Q , Therefore n(β n −β) → N (o, σ 2 Q−1 ) in disˆ
tribution. One can also say: the asymptotic distribution of β is N (β, σ 2 (X X)−1 ).
√
ˆn − Rβ) → N (o, σ 2 RQ−1 R ), and therefore
From this follows n(Rβ
(34.2.1)
ˆ
n(Rβ n − Rβ) RQ−1 R
−1
ˆ
(Rβ n − Rβ) → σ 2 χ2 .
i
Divide by s2 and replace in the limiting case Q by X X/n and s2 by σ 2 to get
−1
ˆ
ˆ
(Rβ n − Rβ)
(Rβ n − Rβ) R(X X)−1 R
→ χ2
i
2
s
in distribution. All this is not a proof; the point is that in the denominator, the
distribution is divided by the increasingly bigger number n − k, while in the numerator, it is divided by the constant i; therefore asymptotically the denominator can
be considered 1.
The central limit theorems only say that for n → ∞ these converge towards the
χ2 , which is asymptotically equal to the F distribution. It is easily possible that
before one gets to the limit, the F -distribution is better.
(34.2.2)
854
34. ASYMPTOTIC PROPERTIES OF OLS
ˆ
Problem 393. Are the residuals y − X β asymptotically normally distributed?
√
Answer. Only if the disturbances are normal, otherwise of course not! We can show that
√
ˆ
ε ˆ
n(ε − ε) = nX(β − β) ∼ N (o, σ 2 XQX ).
Now these results also go through if one has stochastic regressors. [Gre97, 6.7.7]
shows that the above condition (34.0.5) with the lim replaced by plim holds if xi
and ε i are an i.i.d. sequence of random variables.
Problem 394. 2 points In the regression model with random regressors y =
1
1
ε
Xβ+ε , you only know that plim n X X = Q is a nonsingular matrix, and plim n X ε
o. Using these two conditions, show that the OLS estimate is consistent.
ˆ
Answer. β = (X X)−1 X y = β + (X X)−1 X ε due to (24.0.7), and
plim(X X)−1 X ε = plim(
X X −1 X ε
)
= Qo = o.
n
n
CHAPTER 35
Least Squares as the Normal Maximum Likelihood
Estimate
Now assume ε is multivariate normal. We will show that in this case the OLS
ˆ
estimator β is at the same time the Maximum Likelihood Estimator. For this we
need to write down the density function of y. First look at one y t which is y t ∼
x1
.
2
N (xt β, σ ), where X = . , i.e., xt is the tth row of X. It is written as a
.
xn
column vector, since we follow the “column vector convention.” The (marginal)
855
856
35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE
density function for this one observation is
(35.0.3)
fyt (yt ) = √
1
2πσ 2
e−(yt −xt
β)2 /2σ 2
.
Since the y i are stochastically independent, their joint density function is the product,
which can be written as
(35.0.4)
fy (y) = (2πσ 2 )−n/2 exp −
1
(y − Xβ) (y − Xβ) .
2σ 2
To compute the maximum likelihood estimator, it is advantageous to start with
the log likelihood function:
(35.0.5)
log fy (y; β, σ 2 ) = −
n
n
1
log 2π − log σ 2 − 2 (y − Xβ) (y − Xβ).
2
2
2σ
Assume for a moment that σ 2 is known. Then the MLE of β is clearly equal to
ˆ
ˆ
the OLS β. Since β does not depend on σ 2 , it is also the maximum likelihood
ˆ
estimate when σ 2 is unknown. β is a linear function of y. Linear transformations
of normal variables are normal. Normal distributions are characterized by their
mean vector and covariance matrix. The distribution of the MLE of β is therefore
ˆ
β ∼ N (β, σ 2 (X X)−1 ).