Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.92 MB, 370 trang )
174
15. DIGRESSION ABOUT CORRELATION COEFFICIENTS
Answer. The minimum MSE with only a constant is var[y] and (14.2.32) says that MSE[constant
term and x; y] = var[y]−(cov[x, y])2 / var[x]. Therefore the difference in MSE’s is (cov[x, y])2 / var[x],
and if one divides by var[y] to get the relative difference, one gets exactly the squared correlation
coefficient.
Multiple Correlation Coefficients. Now assume x is a vector while y remains a
scalar. Their joint mean vector and dispersion matrix are
(15.1.3)
Ω
x
µ
∼
, σ 2 xx
ω xy
y
ν
ω xy
.
ωyy
By theorem ??, the best linear predictor of y based on x has the formula
(15.1.4)
y ∗ = ν + ω xy Ω − (x − µ)
xx
y ∗ has the following additional extremal value property: no linear combination b x
has a higher squared correlation with y than y ∗ . This maximal value of the squared
correlation is called the squared multiple correlation coefficient
(15.1.5)
ρ2
y(x) =
ω xy Ω − ω xy
xx
ωyy
The multiple correlation coefficient itself is the positive square root, i.e., it is always
nonnegative, while some other correlation coefficients may take on negative values.
The squared multiple correlation coefficient can also defined in terms of proportionate reduction in MSE. It is equal to the proportionate reduction in the MSE of
the best predictor of y if one goes from predictors of the form y ∗ = a to predictors
of the form y ∗ = a + b x, i.e.,
MSE[constant term; y] − MSE[constant term and x; y]
2
(15.1.6)
ρy(x) =
MSE[constant term; y]
There are therefore two natural definitions of the multiple correlation coefficient.
These two definitions correspond to the two formulas for R2 in (14.3.6).
Partial Correlation Coefficients. Now assume y = y 1 y 2
is a vector with
two elements and write
Ω xx ω y1 ω y2
x
µ
y 1 ∼ ν1 , σ 2 ω y1 ω11 ω12 .
(15.1.7)
y2
ν2
ω y2 ω21 ω22
Let y ∗ be the best linear predictor of y based on x. The partial correlation coefficient
ρ12.x is defined to be the simple correlation between the residuals corr[(y 1 −y ∗ ), (y 2 −
1
y ∗ )]. This measures the correlation between y 1 and y 2 which is “local,” i.e., which
2
does not follow from their association with x. Assume for instance that both y 1 and
y 2 are highly correlated with x. Then they will also have a high correlation with
each other. Subtracting y ∗ from y i eliminates this dependency on x, therefore any
i
remaining correlation is “local.” Compare [Krz88, p. 475].
The partial correlation coefficient can be defined as the relative reduction in the
MSE if one adds y 2 to x as a predictor of y 1 :
(15.1.8)
MSE[constant term and x; y 2 ] − MSE[constant term, x, and y 1 ; y 2 ]
.
ρ2 =
12.x
MSE[constant term and x; y 2 ]
Problem 218. Using the definitions in terms of MSE’s, show that the following
relationship holds between the squares of multiple and partial correlation coefficients:
(15.1.9)
2
2
1 − ρ2
2(x,1) = (1 − ρ21.x )(1 − ρ2(x) )
15.1. A UNIFIED DEFINITION OF CORRELATION COEFFICIENTS
175
Answer. In terms of the MSE, (15.1.9) reads
(15.1.10)
MSE[constant term, x, and y 1 ; y 2 ]
MSE[constant term, x, and y 1 ; y 2 ] MSE[constant term and x; y 2 ]
=
.
MSE[constant term; y 2 ]
MSE[constant term and x; y 2 ]
MSE[constant term; y 2 ]
From (15.1.9) follows the following weighted average formula:
2
2
2
ρ2
2(x,1) = ρ2(x) + (1 − ρ2(x) )ρ21.x
(15.1.11)
An alternative proof of (15.1.11) is given in [Gra76, pp. 116/17].
Mixed cases: One can also form multiple correlations coefficients with some of
the variables partialled out. The dot notation used here is due to Yule, [Yul07]. The
notation, definition, and formula for the squared correlation coefficient is
(15.1.12)
ρ2
y(x).z =
(15.1.13)
=
MSE[constant term and z; y] − MSE[constant term, z, and x; y]
MSE[constant term and z; y]
ω xy.z Ω − ω xy.z
xx.z
ωyy.z
CHAPTER 16
Specific Datasets
16.1. Cobb Douglas Aggregate Production Function
Problem 219. 2 points The Cobb-Douglas production function postulates the
following relationship between annual output q t and the inputs of labor t and capital
kt :
(16.1.1)
γ
q t = µ β kt exp(εt ).
t
q t , t , and kt are observed, and µ, β, γ, and the εt are to be estimated. By the
variable transformation xt = log q t , yt = log t , zt = log kt , and α = log µ, one
obtains the linear regression
(16.1.2)
xt = α + βyt + γzt + εt
Sometimes the following alternative variable transformation is made: ut = log(q t / t ),
vt = log(kt / t ), and the regression
(16.1.3)
ut = α + γvt + εt
is estimated. How are the regressions (16.1.2) and (16.1.3) related to each other?
Answer. Write (16.1.3) as
(16.1.4)
xt − yt = α + γ(zt − yt ) + εt
and collect terms to get
(16.1.5)
xt = α + (1 − γ)yt + γzt + εt
From this follows that running the regression (16.1.3) is equivalent to running the regression (16.1.2)
with the constraint β + γ = 1 imposed.
The assumption here is that output is the only random variable. The regression
model is based on the assumption that the dependent variables have more noise in
them than the independent variables. One can justify this by the argument that
any noise in the independent variables will be transferred to the dependent variable,
and also that variables which affect other variables have more steadiness in them
than variables which depend on others. This justification often has merit, but in the
specific case, there is much more measurement error in the labor and capital inputs
than in the outputs. Therefore the assumption that only the output has an error
term is clearly wrong, and problem 221 below will look for possible alternatives.
Problem 220. Table 1 shows the data used by Cobb and Douglas in their original
article [CD28] introducing the production function which would bear their name.
output is “Day’s index of the physical volume of production (1899 = 100)” described
in [DP20], capital is the capital stock in manufacturing in millions of 1880 dollars
[CD28, p. 145], labor is the “probable average number of wage earners employed in
manufacturing” [CD28, p. 148], and wage is an index of the real wage (1899–1908
= 100).
177
178
16. SPECIFIC DATASETS
year
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
output
100
101
112
122
124
122
143
152
151
126
155
159
capital
4449
4746
5061
5444
5806
6132
6626
7234
7832
8229
8820
9240
labor
4713
4968
5184
5554
5784
5468
5906
6251
6483
5714
6615
6807
wage
99
98
101
102
100
99
103
101
99
94
102
104
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
year
1910
output
153
177
184
169
189
225
227
223
218
231
179
240
capital
9624
10067
10520
10873
11840
13242
14915
16265
17234
18118
18542
19192
labor
6855
7167
7277
7026
7269
8601
9218
9446
9096
9110
6947
7602
wage
97
99
100
99
99
104
103
107
111
114
115
119
Table 1. Cobb Douglas Original Data
• a. A text file with the data is available on the web at www.econ.utah.edu/
ehrbar/data/cobbdoug.txt, and a SDML file (XML for statistical data which can be
read by R, Matlab, and perhaps also SPSS) is available at www.econ.utah.edu/ehrbar/
data/cobbdoug.sdml. Load these data into your favorite statistics package.
Answer. In R, you can simply issue the command cobbdoug <- read.table("http://www.
econ.utah.edu/ehrbar/data/cobbdoug.txt", header=TRUE). If you run R on unix, you can also
do the following: download cobbdoug.sdml from the www, and then first issue the command
library(StatDataML) and then readSDML("cobbdoug.sdml"). When I tried this last, the XML package necessary for StatDataML was not available on windows, but chances are it will be when you
read this.
In SAS, you must issue the commands
data cobbdoug;
infile ’cobbdoug.txt’;
input year output capital labor;
run;
But for this to work you must delete the first line in the file cobbdoug.txt which contains the
variable names. (Is it possible to tell SAS to skip the first line?) And you may have to tell SAS
the full pathname of the text file with the data. If you want a permanent instead of a temporary
dataset, give it a two-part name, such as ecmet.cobbdoug.
Here are the instructions for SPSS: 1) Begin SPSS with a blank spreadsheet. 2) Open up a file
with the following commands and run:
SET
BLANKS=SYSMIS
UNDEFINED=WARN.
DATA LIST
FILE=’A:\Cbbunst.dat’ FIXED RECORDS=1 TABLE /1 year 1-4 output 5-9 capital
10-16 labor 17-22 wage 23-27 .
EXECUTE.
This files assume the data file to be on the same directory, and again the first line in the data file
with the variable names must be deleted. Once the data are entered into SPSS the procedures
(regression, etc.) are best run from the point and click environment.
• b. The next step is to look at the data. On [CD28, p. 150], Cobb and Douglas
plot capital, labor, and output on a logarithmic scale against time, all 3 series
normalized such that they start in 1899 at the same level =100. Reproduce this graph
using a modern statistics package.
16.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION
179
• c. Run both regressions (16.1.2) and (16.1.3) on Cobb and Douglas’s original
dataset. Compute 95% confidence intervals for the coefficients of capital and labor
in the unconstrained and the cconstrained models.
Answer. SAS does not allow you to transform the data on the fly, it insists that you first
go through a data step creating the transformed data, before you can run a regression on them.
Therefore the next set of commands creates a temporary dataset cdtmp. The data step data cdtmp
includes all the data from cobbdoug into cdtemp and then creates some transformed data as well.
Then one can run the regressions. Here are the commands; they are in the file cbbrgrss.sas in
your data disk:
data cdtmp;
set cobbdoug;
logcap = log(capital);
loglab = log(labor);
logout = log(output);
logcl = logcap-loglab;
logol = logout-loglab;
run;
proc reg data = cdtmp;
model logout = logcap loglab;
run;
proc reg data = cdtmp;
model logol = logcl;
run;
Careful! In R, the command lm(log(output)-log(labor) ~ log(capital)-log(labor), data=cobbdoug)
does not give the right results. It does not complain but the result is wrong nevertheless. The right
way to write this command is lm(I(log(output)-log(labor)) ~ I(log(capital)-log(labor)), data=cobbdoug).
• d. The regression results are graphically represented in Figure 1. The big
ellipse is a joint 95% confidence region for β and γ. This ellipse is a level line of the
SSE. The vertical and horizontal bands represent univariate 95% confidence regions
for β and γ separately. The diagonal line is the set of all β and γ with β + γ = 1,
representing the constraint of constant returns to scale. The small ellipse is that level
line of the SSE which is tangent to the constraint. The point of tangency represents
the constrained estimator. Reproduce this graph (or as much of this graph as you
can) using your statistics package.
Remark: In order to make the hand computations easier, Cobb and Douglass
reduced the data for capital and labor to index numbers (1899=100) which were
rounded to integers, before running the regressions, and Figure 1 was constructed
using these rounded data. Since you are using the nonstandardized data, you may
get slightly different results.
Answer. lines(ellipse.lm(cbbfit, which=c(2, 3)))
Problem 221. In this problem we will treat the Cobb-Douglas data as a dataset
with errors in all three variables. See chapter ?? and problem ?? about that.
• a. Run the three elementary regressions for the whole period, then choose at
least two subperiods and run it for those. Plot all regression coefficients as points
in a plane, using different colors for the different subperiods (you have to normalize
them in a special way that they all fit on the same plot).
Answer. Here are the results in R:
> outputlm<-lm(log(output)~log(capital)+log(labor),data=cobbdoug)
> capitallm<-lm(log(capital)~log(labor)+log(output),data=cobbdoug)
> laborlm<-lm(log(labor)~log(output)+log(capital),data=cobbdoug)
180
16. SPECIFIC DATASETS
1.0
0.9
d
d
0.8
0.7
0.6
0.5
d
d
d
d
d
d
0.4
d
d
.
........................
...........................
..
......
......
...
......
...
......
......
...
....
......
......
....
.....
....
.....
....
.....
.....
....
....
.....
.....
....
....
.....
.....
....
.. .
.........
....
..........
.... ...
.....
.....
.....
..... ...
.....
..... ....
....
....
.....
........
.....
......
....
.....
....
.....
.
.....
....
....
.....
.....
....
....
.....
.....
....
....
......
......
...
...
......
......
...
...
......
......
......
..
........
.
.........
.. ...
.................
.................
.
0.3
dq
dq
d
0.2
0.1
d
d
d
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Figure 1. Coefficients of capital (vertical) and labor (horizontal), dependent variable output, unconstrained and constrained,
1899–1922
> coefficients(outputlm)
(Intercept) log(capital)
log(labor)
-0.1773097
0.2330535
0.8072782
> coefficients(capitallm)
(Intercept) log(labor) log(output)
-2.72052726 -0.08695944 1.67579357
> coefficients(laborlm)
(Intercept) log(output) log(capital)
1.27424214
0.73812541 -0.01105754
#Here is the information for the confidence ellipse:
> summary(outputlm,correlation=T)
Call:
lm(formula = log(output) ~ log(capital) + log(labor), data = cobbdoug)
Residuals:
Min
1Q
Median
-0.075282 -0.035234 -0.006439
3Q
0.038782
Max
0.142114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.17731 0.43429
-0.408 0.68721
log(capital) 0.23305 0.06353
3.668 0.00143 **
log(labor)
0.80728 0.14508
5.565 1.6e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
‘.’
0.1
‘ ’
Residual standard error: 0.05814 on 21 degrees of freedom
Multiple R-Squared: 0.9574,Adjusted R-squared: 0.9534
F-statistic: 236.1 on 2 and 21 degrees of freedom,p-value: 3.997e-15
Correlation of Coefficients:
(Intercept) log(capital)
log(capital)
0.7243
log(labor)
-0.9451
-0.9096
1
16.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION
181
#Quantile of the F-distribution:
> qf(p=0.95, df1=2, df2=21)
[1] 3.4668
• b. The elementary regressions will give you three fitted equations of the form
ˆ
ˆ
(16.1.6)
output = α1 + β12 labor + β13 capital + residual1
ˆ
(16.1.7)
(16.1.8)
ˆ
ˆ
labor = α2 + β21 output + β23 capital + residual2
ˆ
ˆ
ˆ
capital = α3 + β31 output + β32 labor + residual3 .
ˆ
In order to compare the slope parameters in these regressions, first rearrange them
in the form
ˆ
ˆ
(16.1.9)
−output + β12 labor + β13 capital + α1 + residual1 = 0
ˆ
(16.1.10)
ˆ
ˆ
β21 output − labor + β23 capital + α2 + residual2 = 0
ˆ
(16.1.11)
ˆ
ˆ
β31 output + β32 labor − capital + α3 + residual3 = 0
ˆ
This gives the following table of coefficients:
output
labor
capital
intercept
−1
0.8072782
0.2330535 −0.1773097
0.73812541
−1 −0.01105754
1.27424214
1.67579357 −0.08695944
−1 −2.72052726
Now divide the second and third rows by the negative of their first coefficient, so that
the coefficient of output becomes −1:
out
labor
capital
intercept
−1
0.8072782
0.2330535
−0.1773097
−1
1/0.73812541 0.01105754/0.73812541 −1.27424214/0.73812541
−1 0.08695944/1.67579357
1/1.67579357
2.72052726/1.67579357
After performing the divisions the following numbers are obtained:
output
labor
capital
intercept
−1 0.8072782
0.2330535 −0.1773097
−1 1.3547833 0.014980570 −1.726322
−1 0.05189149 0.59673221
1.6234262
These results can also be re-written in the form given by Table 2.
Intercept
Slope of output Slope of output
wrt labor
wrt capital
Regression of output
on labor and capital
Regression of labor on
output and capital
Regression of capital
on output and labor
Table 2. Comparison of coefficients in elementary regressions
Fill in the values for the whole period and also for several sample subperiods.
Make a scatter plot of the contents of this table, i.e., represent each regression result
as a point in a plane, using different colors for different sample periods.
182
16. SPECIFIC DATASETS
T
d
d
d
d
d
d
q
d
d
capital
d
d
d
d
d
d
dqoutput no error, crs
d c Cobb Douglas’s original result
output
d q
d
d
d
d
d
qlabor E
Figure 2. Coefficients of capital (vertical) and labor (horizontal), dependent variable output, 1899–1922
1.0
0.9
d
d
0.8
d
d
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
q
d
..... capital all errors
.....
.....
.....
.....
.....
.....
.....
.....
......
.....
.....
.....
.....
.....
......
.....
.....
.....
.....
.....
......
.....
.....
.....
.....
......
.....
.....
.....
.....
.....
...... output no error, crs
.....
.....
.....
.....
.....
......
.....
.....output all errors
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
.....
......
.....
.....
.....
.....
.....
......
.....
.....
.....
.....
.....
.....
..... labor
.....
.....
.....
....
..
d
d
d
d
d q
d
dq
d
d
d
d
q
all errors
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
Figure 3. Coefficient of capital (vertical) and labor (horizontal)
in the elementary regressions, dependent variable output, 1899–1922
Problem 222. Given a univariate problem with three variables all of which have
zero mean, and a linear constraint that the coefficients of all variables sum to 0. (This
is the model apparently appropriate to the Cobb-Douglas data, with the assumption
of constant returns to scale, after taking out the means.) Call the observed variables
x, y, and z, with underlying systematic variables x∗ , y ∗ , and z ∗ , and errors u, v,
and w.
• a. Write this model in the form (??).
16.1. COBB DOUGLAS AGGREGATE PRODUCTION FUNCTION
183
Answer.
x∗
y∗
−1
β
1−β
z∗
(16.1.12)
x
y
x∗ = βy ∗ + (1 − β)z ∗
x = x∗ + u
=0
or
z = x∗
y∗
z∗ + u
v
w
y = y∗ + v
z = z ∗ + w.
• b. The moment matrix of the systematic variables can be written fully in terms
2
2
of σy∗ , σz∗ , σy∗ z∗ , and the unknown parameter β. Write out the moment matrix and
therefore the Frisch decomposition.
Answer. The moment matrix is the middle matrix in the following Frisch decomposition:
2
σx
σxy
σxz
(16.1.13)
(16.1.14)
2
2
β 2 σy∗ + 2β(1 − β)σy∗ z∗ + (1 − β)2 σz∗
2 + (1 − β)σ ∗ ∗
βσy∗
=
y z
2
βσy∗ z∗ + (1 − β)σz∗
σxy
2
σy
σyz
σxz
σyz
2
σz
=
2
βσy∗ + (1 − β)σy∗ z∗
2
σy ∗
2
σy ∗
2
2
βσy∗ z∗ + (1 − β)σz∗
σu
σy ∗ z ∗
+ 0
2
0
σz ∗
• c. Show that the unknown parameters are not yet identified. However, if one
2
2
2
makes the additional assumption that one of the three error variances σu , σv , or σw
is zero, then the equations are identified. Since the quantity of output presumably
2
has less error than the other two variables, assume σu = 0. Under this assumption,
show that
σ 2 − σxz
(16.1.15)
β= x
σxy − σxz
and this can be estimated by replacing the variances and covariances by their sample
counterparts. In a similar way, derive estimates of all other parameters of the model.
Answer. Solving (16.1.14) one gets from the yz element of the covariance matrix
(16.1.16)
σy∗ z∗ = σyz
and from the xz element
(16.1.17)
2
σz ∗ =
σxz − βσyz
1−β
Similarly, one gets from the xy element:
(16.1.18)
2
σy ∗ =
σxy − (1 − β)σyz
β
Now plug (16.1.16), (16.1.17), and (16.1.18) into the equation for the xx element:
(16.1.19)
(16.1.20)
2
2
σx = β(σxy − (1 − β)σyz ) + 2β(1 − β)σyz + (1 − β)(σxz − βσyz ) + σu
2
= βσxy + (1 − β)σxz + σu
2
Since we are assuming σu = 0 this last equation can be solved for β:
(16.1.21)
β=
2
σx − σxz
σxy − σxz
If we replace the variances and covariances by the sample variances and covariances, this gives an
estimate of β.
• d. Evaluate these formulas numerically. In order to get the sample means and
the sample covariance matrix of the data, you may issue the SAS commands
0
2
σv
0
0
0 .
2
σw
184
16. SPECIFIC DATASETS
proc corr cov nocorr data=cdtmp;
var logout loglab logcap;
run;
These commands are in the file cbbcovma.sas on the disk.
Answer. Mean vector and covariance matrix are
(16.1.22)
LOGOUT
LOGLAB
LOGCAP
5.07734
0.0724870714
4.96272 , 0.0522115563
5.35648
0.1169330807
∼
0.0522115563
0.0404318579
0.0839798588
0.1169330807
0.0839798588
0.2108441826
Therefore equation (16.1.15) gives
0.0724870714 − 0.1169330807
ˆ
β=
= 0.686726861149148
0.0522115563 − 0.1169330807
ˆ
ˆ
In Figure 3, the point (β, 1− β) is exactly the intersection of the long dotted line with the constraint.
(16.1.23)
• e. The fact that all 3 points lie almost on the same line indicates that there may
be 2 linear relations: log labor is a certain coefficient times log output, and log capital
is a different coefficient times log output. I.e., y ∗ = δ1 + γ1 x∗ and z ∗ = δ2 + γ2 x∗ .
In other words, there is no substitution. What would be the two coefficients γ1 and
γ2 if this were the case?
Answer. Now the Frisch decomposition is
(16.1.24)
2
σx
σxy
σxz
σxy
2
σy
σyz
σxz
σyz
2
σz
2
= σx∗
1
γ1
γ2
γ1
2
γ1
γ1 γ2
γ2
γ1 γ2
2
γ2
+
2
σu
0
0
0
2
σv
0
0
0 .
2
σw
Solving this gives (obtain γ1 by dividing the 32-element by the 31-element, γ2 by dividing the
2
32-element by the 12-element, σx∗ by dividing the 21-element by γ1 , etc.
(16.1.25)
0.0839798588
σyz
=
= 0.7181873452513939
γ1 =
σxy
0.1169330807
σyz
0.0839798588
γ2 =
=
= 1.608453467992104
σxz
0.0522115563
σyx σxz
0.0522115563 · 0.1169330807
2
= 0.0726990758
σx∗ =
=
σyz
0.0839798588
σyx σxz
= 0.0724870714 − 0.0726990758 = −0.000212
σyz
σxy σyz
2
2
σv = σy −
σxz
σxz σzy
2
2
σw = σz −
σxy
2
2
σu = σx −
This model is just barely rejected by the data since it leads to a slightly negative variance for U .
• f. The assumption that there are two linear relations is represented as the
light-blue line in Figure 3. What is the equation of this line?
Answer. If y = γ1 x and z = γ2 x then the equation x = β1 y + β2 z holds whenever β1 γ1 +
β2 γ2 = 1. This is a straight line in the β1 , β2 -plane, going through the points and (0, 1/γ2 ) =
0.0522115563
(0, 0.0839798588 = 0.6217152189353289) and (1/γ1 , 0) = ( 0.1169330807 = 1.3923943475361023, 0).
0.0839798588
This line is in the figure, and it is just a tiny bit on the wrong side of the dotted line connecting
the two estimates.
16.2. Houthakker’s Data
For this example we will use Berndt’s textbook [Ber91], which discusses some
of the classic studies in the econometric literature.
One example described there is the estimation of a demand function for electricity [Hou51], which is the first multiple regression with several variables run on a
computer. In this exercise you are asked to do all steps in exercise 1 and 3 in chapter
7 of Berndt, and use the additional facilities of R to perform other steps of data
analysis which Berndt did not ask for, such as, for instance, explore the best subset
of regressors using leaps and the best nonlinear transformation using avas, do some