Chapter 26. Asymptotic Properties of the OLS Estimator

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.92 MB, 370 trang )

280

26. ASYMPTOTIC PROPERTIES OF OLS

Consistency means that the probability limit of the estimates converges towards

ˆ

ˆ

the true value. For β this can be written as plimn→∞ β n = β. This means by

ˆn − β| ≤ ε] = 1.

deﬁnition that for all ε > 0 follows limn→∞ Pr[|β

The probability limit is one of several concepts of limits used in probability

theory. We will need the following properties of the plim here:

(1) For nonrandom magnitudes, the probability limit is equal to the ordinary

limit.

(2) It satisﬁes the Slutsky theorem, that for a continuous function g,

(26.0.13)

plim g(z) = g(plim(z)).

(3) If the MSE-matrix of an estimator converges towards the null matrix, then

the estimator is consistent.

(4) Kinchine’s theorem: the sample mean of an i.i.d. distribution is a consistent

estimate of the population mean, even if the distribution does not have a population

variance.

26.1. Consistency of the OLS estimator

ˆ

For the proof of consistency of the OLS estimators β and of s2 we need the

following result:

1

X ε = o.

n

I.e., the true ε is asymptotically orthogonal to all columns of X. This follows immediately from MSE[o; X ε /n] = E [X εε X/n2 ] = σ 2 X X/n2 , which converges

towards O.

ˆ

ˆ

In order to prove consistency of β and s2 , transform the formulas for β and s2

in such a way that they are written as continuous functions of terms each of which

ˆ

converges for n → ∞, and then apply Slutsky’s theorem. Write β as

(26.1.1)

(26.1.2)

(26.1.3)

(26.1.4)

plim

X X

ˆ

β = β + (X X)−1 X ε = β +

n

X X −1

X ε

ˆ

plim β = β + lim

plim

n

n

−1

= β + Q o = β.

−1 X

ε

n

Let’s look at the geometry of this when there is only one explanatory variable.

ε

The speciﬁcation is therefore y = xβ +ε . The assumption is that ε is asymptotically

orthogonal to x. In small samples, it only happens by sheer accident with probability

0 that ε is orthogonal to x. Only ε is. But now let’s assume the sample grows

ˆ

larger, i.e., the vectors y and x become very high-dimensional observation vectors,

i.e. we are drawing here a two-dimensional subspace out of a very high-dimensional

space. As more and more data are added, the observation vectors also become

√

lengths of these

longer and longer. But if we divide each vector by n, then the√

normalized lenghts stabilize. The squared length of the vector ε / n has the plim

1

of σ 2 . Furthermore, assumption (26.0.12) means in our case that plimn→∞ n x x

1

exists and is nonsingular. This is the squared length of √n x. I.e., if we normalize the

√

vectors by dividing them by n, then they do not get longer but converge towards

1

a ﬁnite length. And the result (26.1.1) plim n x ε = 0 means now that with this

√

√

normalization, ε / n becomes more and more orthogonal to x/ n. I.e., if n is large

enough, asymptotically, not only ε but also the true ε is orthogonal to x, and this

ˆ

ˆ

means that asymptotically β converges towards the true β.

26.2. ASYMPTOTIC NORMALITY OF THE LEAST SQUARES ESTIMATOR

281

For the proof of consistency of s2 we need, among others, that plim ε nε = σ 2 ,

ε

which is a consequence of Kinchine’s theorem. Since ε ε = ε Mε it follows

ˆ ˆ

ε ε

ˆ ˆ

I

n

X X X −1 X

ε

ε=

=

−

n−k

n−k

n

n

n

n

n

ε ε ε X X X −1 X ε

=

−

→ 1 · σ 2 − o Q−1 o .

n−k n

n

n

n

26.2. Asymptotic Normality of the Least Squares Estimator

√ To show asymptotic normality of an estimator, multiply the sampling error by

n, so that the variance is stabilized.

1

1

We have seen plim n X ε = o. Now look at √n X ε n . Its mean is o and its covariance matrix σ 2 X n X . Shape of distribution, due to a variant of the Central Limit

1

Theorem, is asymptotically normal: √n X ε n → N (o, σ 2 Q). (Here the convergence

is convergence in distribution.)

−1

√ ˆ

1

We can write n(β n −β) = X n X

( √n X ε n ). Therefore its limiting covari√ ˆ

ance matrix is Q−1 σ 2 QQ−1 = σ 2 Q−1 , Therefore n(β n −β) → N (o, σ 2 Q−1 ) in disˆ

tribution. One can also say: the asymptotic distribution of β is N (β, σ 2 (X X)−1 ).

√

ˆn − Rβ) → N (o, σ 2 RQ−1 R ), and therefore

From this follows n(Rβ

(26.2.1)

ˆ

n(Rβ n − Rβ) RQ−1 R

−1

ˆ

(Rβ n − Rβ) → σ 2 χ2 .

i

Divide by s2 and replace in the limiting case Q by X X/n and s2 by σ 2 to get

−1

ˆ

ˆ

(Rβ n − Rβ) R(X X)−1 R

(Rβ n − Rβ)

→ χ2

i

2

s

in distribution. All this is not a proof; the point is that in the denominator, the

distribution is divided by the increasingly bigger number n − k, while in the numerator, it is divided by the constant i; therefore asymptotically the denominator can

be considered 1.

The central limit theorems only say that for n → ∞ these converge towards the

χ2 , which is asymptotically equal to the F distribution. It is easily possible that

before one gets to the limit, the F -distribution is better.

(26.2.2)

ˆ

Problem 327. Are the residuals y − X β asymptotically normally distributed?

√

Answer. Only if the disturbances are normal, otherwise of course not! We can show that

√

ˆ

ε ˆ

n(ε − ε) = nX(β − β) ∼ N (o, σ 2 XQX ).

Now these results also go through if one has stochastic regressors. [Gre97, 6.7.7]

shows that the above condition (26.0.12) with the lim replaced by plim holds if xi

and ε i are an i.i.d. sequence of random variables.

Problem 328. 2 points In the regression model with random regressors y =

1

1

ε

Xβ+ε , you only know that plim n X X = Q is a nonsingular matrix, and plim n X ε =

o. Using these two conditions, show that the OLS estimate is consistent.

ˆ

Answer. β = (X X)−1 X y = β + (X X)−1 X ε due to (18.0.7), and

plim(X X)−1 X ε = plim(

X X −1 X ε

)

= Qo = o.

n

n

CHAPTER 27

Least Squares as the Normal Maximum Likelihood

Estimate

Now assume ε is multivariate normal. We will show that in this case the OLS

ˆ

estimator β is at the same time the Maximum Likelihood Estimator. For this we

need to write down the density function of y. First look at one y t which is y t ∼

 

x1

 . 

2

N (xt β, σ ), where X =  . , i.e., xt is the tth row of X. It is written as a

.

xn

column vector, since we follow the “column vector convention.” The (marginal)

density function for this one observation is

(27.0.3)

fyt (yt ) = √

1

2πσ 2

e−(yt −xt

β)2 /2σ 2

.

Since the y i are stochastically independent, their joint density function is the product,

which can be written as

1

(27.0.4)

fy (y) = (2πσ 2 )−n/2 exp − 2 (y − Xβ) (y − Xβ) .

2σ

To compute the maximum likelihood estimator, it is advantageous to start with

the log likelihood function:

(27.0.5)

log fy (y; β, σ 2 ) = −

n

n

1

log 2π − log σ 2 − 2 (y − Xβ) (y − Xβ).

2

2

2σ

Assume for a moment that σ 2 is known. Then the MLE of β is clearly equal to

ˆ

ˆ

the OLS β. Since β does not depend on σ 2 , it is also the maximum likelihood

2

ˆ

estimate when σ is unknown. β is a linear function of y. Linear transformations

of normal variables are normal. Normal distributions are characterized by their

mean vector and covariance matrix. The distribution of the MLE of β is therefore

ˆ

β ∼ N (β, σ 2 (X X)−1 ).

ˆ

If we replace β in the log likelihood function (27.0.5) by β, we get what is called

the log likelihood function with β “concentrated out.”

ˆ

(27.0.6) log fy (y; β = β, σ 2 ) = −

n

n

1

ˆ

ˆ

log 2π − log σ 2 − 2 (y − X β) (y − X β).

2

2

2σ

One gets the maximum likelihood estimate of σ 2 by maximizing this “concentrated”

log likelihoodfunction. Taking the derivative with respect to σ 2 (consider σ 2 the

name of a variable, not the square of another variable), one gets

∂

n 1

1

ˆ

ˆ

ˆ

log fy (y; β) = −

+ 4 (y − X β) (y − X β)

∂σ 2

2 σ2

2σ

Setting this zero gives

(27.0.7)

(27.0.8)

σ2 =

˜

ˆ

ˆ

(y − X β) (y − X β)

ε ε

ˆ ˆ

=

.

n

n

283

284

27. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE

This is a scalar multiple of the unbiased estimate s2 = ε ε/(n − k) which we

ˆ ˆ

had earlier.

Let’s look at the distribution of s2 (from which that of its scalar multiples follows

easily). It is a quadratic form in a normal variable. Such quadratic forms very often

have χ2 distributions.

Now recall equation 7.4.9 characterizing all the quadratic forms of multivariate

normal variables that are χ2 ’s. Here it is again: Assume y is a multivariate normal

vector random variable with mean vector µ and covariance matrix σ 2 Ψ, and Ω is a

symmetric nonnegative deﬁnite matrix. Then (y − µ) Ω (y − µ) ∼ σ 2 χ2 iﬀ

k

Ω Ω

Ω

ΨΩ ΨΩ Ψ = ΨΩ Ψ,

(27.0.9)

Ω

and k is the rank of ΨΩ .

This condition is satisﬁed in particular if Ψ = I (the identity matrix) and

Ω2 = Ω, and this is exactly our situation.

(27.0.10)

σ2 =

ˆ

ˆ

ˆ

ε

ε

(y − X β) (y − X β)

ε (I − X(X X)−1 X )ε

ε Mε

=

=

n−k

n−k

n−k

where M 2 = M and rank M = n − k. (This last identity because for idempotent

matrices, rank = tr, and we computed its tr above.) Therefore s2 ∼ σ 2 χ2 /(n − k),

n−k

from which one obtains again unbiasedness, but also that var[s2 ] = 2σ 4 /(n − k), a

result that one cannot get from mean and variance alone.

ˆ

Problem 329. 4 points Show that, if y is normally distributed, s2 and β are

independent.

ˆ

Answer. We showed in question 246 that β and ε are uncorrelated, therefore in the normal

ˆ

ˆ

case independent, therefore β is also independent of any function of ε, such as σ 2 .

ˆ

ˆ

Problem 330. Computer assignment: You run a regression with 3 explanatory

variables, no constant term, the sample size is 20, the errors are normally distributed

and you know that σ 2 = 2. Plot the density function of s2 . Hint: The command

dchisq(x,df=25) returns the density of a Chi-square distribution with 25 degrees of

freedom evaluated at x. But the number 25 was only taken as an example, this is not

the number of degrees of freedom you need here.

• a. In the same plot, plot the density function of the Theil-Schweitzer estimate.

Can one see from the comparison of these density functions why the Theil-Schweitzer

estimator has a better MSE?

Answer. Start with the Theil-Schweitzer plot, because it is higher. > x <- seq(from = 0, to

= 6, by = 0.01) > Density <- (19/2)*dchisq((19/2)*x, df=17) > plot(x, Density, type="l",

lty=2) > lines(x,(17/2)*dchisq((17/2)*x, df=17)) > title(main = "Unbiased versus Theil-Schweitzer

Variance Estimate, 17 d.f.")

Now let us derive the maximum likelihood estimator in the case of nonspherical but positive deﬁnite covariance matrix. I.e., the model is y = Xβ + ε , ε ∼

N (o, σ 2 Ψ). The density function is

−1/2

(27.0.11) fy (y) = (2πσ 2 )−n/2 |det Ψ|

exp −

1

(y − Xβ) Ψ−1 (y − Xβ) .

2σ 2

Problem 331. Derive (27.0.11) as follows: Take a matrix P with the property

that P ε has covariance matrix σ 2 I. Write down the joint density function of P ε .

Since y is a linear transformation of ε , one can apply the rule for the density function

of a transformed random variable.

Xem Thêm

Chapter 26. Asymptotic Properties of the OLS Estimator

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về