Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.13 MB, 1,644 trang )
190
7. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEO
to.”) One does not need to know the full distribution of y for that, only its expected
value and standard deviation. We will give here a proof only if y has a discrete
distribution, but the inequality is valid in general. Going over to the standardized
1
variable z = y−µ we have to show Pr[|z|≥k] ≤ k2 . Assuming z assumes the values
σ
z1 , z2 ,. . . with probabilities p(z1 ), p(z2 ),. . . , then
(7.1.2)
Pr[|z|≥k] =
p(zi ).
i : |zi |≥k
Now multiply by k 2 :
(7.1.3)
k 2 Pr[|z|≥k] =
k 2 p(zi )
i : |zi |≥k
(7.1.4)
2
zi p(zi )
≤
i : |zi |≥k
(7.1.5)
2
zi p(zi ) = var[z] = 1.
≤
all i
The Chebyshev inequality is sharp for all k ≥ 1. Proof: the random variable
1
which takes the value −k with probability 2k2 and the value +k with probability
7.1. CHEBYSHEV INEQUALITY
191
1
and 0 with probability 1 − k2 , has expected value 0 and variance 1 and the
≤-sign in (7.1.1) becomes an equal sign.
1
2k2 ,
Problem 115. [HT83, p. 316] Let y be the number of successes in n trials of a
Bernoulli experiment with success probability p. Show that
y
1
− p <ε ≥ 1 −
.
n
4nε2
Hint: first compute what Chebyshev will tell you about the lefthand side, and then
you will need still another inequality.
(7.1.6)
Pr
Answer. E[y/n] = p and var[y/n] = pq/n (where q = 1 − p). Chebyshev says therefore
(7.1.7)
Setting ε = k
(7.1.8)
Pr
y
− p ≥k
n
pq
n
≤
1
.
k2
pq/n, therefore 1/k2 = pq/nε2 one can rewerite (7.1.7) as
Pr
y
− p ≥ε
n
≤
pq
.
nε2
Now note that pq ≤ 1/4 whatever their values are.
Problem 116. 2 points For a standard normal variable, Pr[|z|≥1] is approximately 1/3, please look up the precise value in a table. What does the Chebyshev
192
7. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEO
inequality says about this probability? Also, Pr[|z|≥2] is approximately 5%, again
look up the precise value. What does Chebyshev say?
Answer. Pr[|z|≥1] = 0.3174, the Chebyshev inequality says that Pr[|z|≥1] ≤ 1.
Pr[|z|≥2] = 0.0456, while Chebyshev says it is ≤ 0.25.
Also,
7.2. The Probability Limit and the Law of Large Numbers
Let y 1 , y 2 , y 3 , . . . be a sequence of independent random variables all of which
n
1
have the same expected value µ and variance σ 2 . Then y n = n i=1 y i has expected
¯
2
value µ and variance σ . I.e., its probability mass is clustered much more closely
n
around the value µ than the individual y i . To make this statement more precise we
need a concept of convergence of random variables. It is not possible to define it in
the “obvious” way that the sequence of random variables y n converges toward y if
every realization of them converges, since it is possible, although extremely unlikely,
that e.g. all throws of a coin show heads ad infinitum, or follow another sequence
for which the average number of heads does not converge towards 1/2. Therefore we
will use the following definition:
The sequence of random variables y 1 , y 2 , . . . converges in probability to another
random variable y if and only if for every δ > 0
(7.2.1)
lim Pr |y n − y| ≥δ = 0.
n→∞
7.2. THE PROBABILITY LIMIT AND THE LAW OF LARGE NUMBERS
193
One can also say that the probability limit of y n is y, in formulas
(7.2.2)
plim y n = y.
n→∞
In many applications, the limiting variable y is a degenerate random variable, i.e., it
is a constant.
The Weak Law of Large Numbers says that, if the expected value exists, then the
probability limit of the sample means of an ever increasing sample is the expected
value, i.e., plimn→∞ y n = µ.
¯
Problem 117. 5 points Assuming that not only the expected value but also the
variance exists, derive the Weak Law of Large Numbers, which can be written as
(7.2.3)
lim Pr |¯n − E[y]|≥δ = 0 for all δ > 0,
y
n→∞
from the Chebyshev inequality
(7.2.4)
Pr[|x − µ|≥kσ] ≤
1
k2
where µ = E[x] and σ 2 = var[x]
Answer. From nonnegativity of probability and the Chebyshev inequality for x = y follows
¯
√
kσ
1
σ2
0 ≤ Pr[|¯ − µ|≥ √n ] ≤ k2 for all k. Set k = δ σ n to get 0 ≤ Pr[|¯n − µ|≥δ] ≤ nδ2 . For any fixed
y
y
δ > 0, the upper bound converges towards zero as n → ∞, and the lower bound is zero, therefore
the probability iself also converges towards zero.
194
7. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEO
Problem 118. 4 points Let y 1 , . . . , y n be a sample from some unknown probn
1
ability distribution, with sample mean y = n i=1 y i and sample variance s2 =
¯
n
1
¯ 2
i=1 (y i − y ) . Show that the data satisfy the following “sample equivalent” of
n
the Chebyshev inequality: if k is any fixed positive number, and m is the number of
observations y j which satisfy y j − y ≥ks, then m ≤ n/k 2 . In symbols,
¯
n
.
k2
Hint: apply the usual Chebyshev inequality to the so-called empirical distribution of
the sample. The empirical distribution is a discrete probability distribution defined
by Pr[y=y i ] = k/n, when the number y i appears k times in the sample. (If all y i are
different, then all probabilities are 1/n). The empirical distribution corresponds to
the experiment of randomly picking one observation out of the given sample.
(7.2.5)
¯
#{y i : |y i − y | ≥ks} ≤
Answer. The only thing to note is: the sample mean is the expected value in that empirical
distribution, the sample variance is the variance, and the relative number m/n is the probability.
(7.2.6)
#{y i : y i ∈ S} = n Pr[S]
• a. 3 points What happens to this result when the distribution from which the
y i are taken does not have an expected value or a variance?
7.3. CENTRAL LIMIT THEOREM
195
Answer. The result still holds but y and s2 do not converge as the number of observations
¯
increases.
7.3. Central Limit Theorem
Assume all y i are independent and have the same distribution with mean µ,
variance σ 2 , and also a moment generating function. Again, let y n be the sample
¯
mean of the first n observations. The central limit theorem says that the probability
distribution for
¯
yn − µ
√
(7.3.1)
σ/ n
converges to a N (0, 1). This is a different concept of convergence than the probability
limit, it is convergence in distribution.
Problem 119. 1 point Construct a sequence of random variables y 1 , y 2 . . . with
the following property: their cumulative distribution functions converge to the cumulative distribution function of a standard normal, but the random variables themselves
do not converge in probability. (This is easy!)
Answer. One example would be: all y i are independent standard normal variables.
196
7. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEO
y n −µ
¯
Why do we have the funny expression σ/√n ? Because this is the standardized
version of y n . We know from the law of large numbers that the distribution of
¯
y n becomes more and more concentrated around µ. If we standardize the sample
¯
averages y n , we compensate for this concentration. The central limit theorem tells
¯
us therefore what happens to the shape of the cumulative distribution function of y n .
¯
If we disregard the fact that it becomes more and more concentrated (by multiplying
it by a factor which is chosen such that the variance remains constant), then we see
that its geometric shape comes closer and closer to a normal distribution.
Proof of the Central Limit Theorem: By Problem 120,
(7.3.2)
yn − µ
¯
1
√ =√
σ/ n
n
n
i=1
yi − µ
1
=√
σ
n
n
zi
where z i =
i=1
yi − µ
.
σ
Let m3 , m4 , etc., be the third, fourth, etc., moments of z i ; then the m.g.f. of z i is
(7.3.3)
mzi (t) = 1 +
Therefore the m.g.f. of
(7.3.4)
1+
1
√
n
n
i=1
t2
m3 t3
m4 t4
+
+
+ ···
2!
3!
4!
√
z i is (multiply and substitute t/ n for t):
t2
m3 t3
m4 t 4
+ √ +
+ ···
2!n 3! n3
4!n2
n
= 1+
wn
n
n
7.3. CENTRAL LIMIT THEOREM
197
where
(7.3.5)
wn =
m4 t 4
t2
m3 t3
+ √ +
+ ··· .
2! 3! n
4!n
Now use Euler’s limit, this time in the form: if wn → w for n → ∞, then 1+ wn
n
n
→
t2
2
2
ew . Since our wn → t2 , the m.g.f. of the standardized y n converges toward e , which
¯
is that of a standard normal distribution.
The Central Limit theorem is an example of emergence: independently of the
distributions of the individual summands, the distribution of the sum has a very
specific shape, the Gaussian bell curve. The signals turn into white noise. Here
emergence is the emergence of homogenity and indeterminacy. In capitalism, much
more specific outcomes emerge: whether one quits the job or not, whether one sells
the stock or not, whether one gets a divorce or not, the outcome for society is to
perpetuate the system. Not many activities don’t have this outcome.
Problem 120. Show in detail that
√
Answer. Lhs =
µ
= rhs.
n
σ
1
n
n
i=1
y n −µ
¯
√
σ/ n
√
y i −µ
=
n
σ
=
1
√
n
1
n
n
y i −µ
i=1 σ .
n
i=1
yi −
1
n
n
i=1
√
µ
=
n 1
σ n
n
i=
198
7. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEO
Problem 121. 3 points Explain verbally clearly what the law of large numbers
means, what the Central Limit Theorem means, and what their difference is.
Problem 122. (For this problem, a table is needed.) [Lar82, exercise 5.6.1,
p. 301] If you roll a pair of dice 180 times, what is the approximate probability that
the sum seven appears 25 or more times? Hint: use the Central Limit Theorem (but
don’t worry about the continuity correction, which is beyond the scope of this class).
Answer. Let xi be the random variable that equals one if the i-th roll is a seven, and zero
otherwise. Since 7 can be obtained in six ways (1+6, 2+5, 3+4, 4+3, 5+2, 6+1), the probability
to get a 7 (which is at the same time the expected value of xi ) is 6/36=1/6. Since x2 = xi ,
i
180
1
5
var[xi ] = E[xi ] − (E[xi ])2 = 1 − 36 = 36 . Define x =
x . We need Pr[x≥25]. Since x
6
i=1 i
is the sum of many independent identically distributed random variables, the CLT says that x is
asympotically normal. Which normal? That which has the same expected value and variance as
x. E[x] = 180 · (1/6) = 30 and var[x] = 180 · (5/36) = 25. Therefore define y ∼ N (30, 25). The
CLT says that Pr[x≥25] ≈ Pr[y≥25]. Now y≥25 ⇐⇒ y − 30≥ − 5 ⇐⇒ y − 30≤ + 5 ⇐⇒
(y − 30)/5≤1. But z = (y − 30)/5 is a standard Normal, therefore Pr[(y − 30)/5≤1] = Fz (1), i.e.,
the cumulative distribution of the standard Normal evaluated at +1. One can look this up in a
table, the probability asked for is .8413. Larson uses the continuity correction: x is discrete, and
Pr[x≥25] = Pr[x>24]. Therefore Pr[y≥25] and Pr[y>24] are two alternative good approximations;
but the best is Pr[y≥24.5] = .8643. This is the continuity correction.
CHAPTER 8
Vector Random Variables
In this chapter we will look at two random variables x and y defined on the same
sample space U , i.e.,
(8.0.6)
x: U
ω → x(ω) ∈ R
and
y: U
ω → y(ω) ∈ R.
As we said before, x and y are called independent if all events of the form x ≤ x
are independent of any event of the form y ≤ y. But now let us assume they are
not independent. In this case, we do not have all the information about them if we
merely know the distribution of each.
The following example from [Lar82, example 5.1.7. on p. 233] illustrates the
issues involved. This example involves two random variables that have only two
possible outcomes each. Suppose you are told that a coin is to be flipped two times
199
200
8. VECTOR RANDOM VARIABLES
and that the probability of a head is .5 for each flip. This information is not enough
to determine the probability of the second flip giving a head conditionally on the
first flip giving a head.
For instance, the above two probabilities can be achieved by the following experimental setup: a person has one fair coin and flips it twice in a row. Then the
two flips are independent.
But the probabilities of 1/2 for heads and 1/2 for tails can also be achieved as
follows: The person has two coins in his or her pocket. One has two heads, and one
has two tails. If at random one of these two coins is picked and flipped twice, then
the second flip has the same outcome as the first flip.
What do we need to get the full picture? We must consider the two variables not
separately but jointly, as a totality. In order to do this, we combine x and y into one
x
entity, a vector
∈ R2 . Consequently we need to know the probability measure
y
x(ω)
induced by the mapping U ω →
∈ R2 .
y(ω)
It is not sufficient to look at random variables individually; one must look at
them as a totality.
Therefore let us first get an overview over all possible probability measures on the
plane R2 . In strict analogy with the one-dimensional case, these probability measures