Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.13 MB, 1,644 trang )
180
6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS
factorized as follows:
pθ (ω) = g t(ω), θ · h(ω)
for all ω ∈ U .
If U ⊂ Rn , we can write ω = (y1 , . . . , yn ). If Prθ is not discrete but generated by a
family of probability densities f (y1 , . . . , yn ; θ), then the condition reads
f (y1 , . . . , yn ; θ) = g t(y1 , . . . , yn ), θ · h(y1 , . . . , yn ).
Note what this means: the probability of an elementary event (or of an infinitesimal
interval) is written as the product of two parts: one depends on ω through t, while
the other depends on ω directly. Only that part of the probability that depends on
ω through t is allowed to also depend on θ.
Proof in the discrete case: First let us show the necessity of this factorization.
Assume that t is sufficient, i.e., that Prθ [ω|t=t] does not involve θ. Then one possible
factorization is
(6.1.1)
(6.1.2)
Prθ [ω] = Prθ [t=t(ω)] · Pr[ω|t=t(ω)]
= g(t(ω), θ) · h(ω).
Now let us prove that the factorization property implies sufficiency. Assume
therefore (6.1) holds. We have to show that for all ω ∈ U and t ∈ R, the conditional
probability Prθ [{ω}|{κ ∈ U : t(κ) = t}], which will in shorthand notation be written
6.1. FACTORIZATION THEOREM FOR SUFFICIENT STATISTICS
181
as Prθ [ω|t=t], does not depend on θ.
(6.1.3)
Prθ [t=t] =
(6.1.4)
g(t(ω), θ) · h(ω)
Prθ [{ω}] =
ω : t(ω)=t
= g(t, θ) ·
ω : t(ω)=t
h(ω) = g(t, θ) · k(t), say.
ω : t(ω)=t
Here it is important that k(t) does not depend on θ. Now
(6.1.5)
Prθ [ω|t=t] = Prθ [{ω} ∩ {t=t}]Prθ [t=t]
if t(ω) = t, this is zero, i.e., independent of θ. Now look at case t(ω) = t, i.e.,
{ω} ∩ {t=t} = {ω}. Then
(6.1.6)
Prθ [ω|t=t] =
g(t, θ)h(ω)
h(ω)
=
, which is independent of θ.
g(t, θ)k(t)
k(t)
Problem 108. 6 points Using the factorization theorem for sufficient statistics,
show that in a n times repeated Bernoulli experiment (n is known), the number of
successes is a sufficient statistic for the success probability p.
• a. Here is a formulation of the factorization theorem: Given a family of discrete
probability measures Prθ depending on a parameter θ. The statistic t is sufficient for
182
6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS
parameter θ iff there exists a function of two variables g : R×Θ → R, (t, θ) → g(t; θ),
and a function of one variable h : U → R, ω → h(ω) so that for all ω ∈ U
Prθ [{ω}] = g t(ω), θ · h(ω).
Before you apply this, ask yourself: what is ω?
Answer. This is very simple: the probability of every elementary event depends on this element only through the random variable t : U → N , which is the number of successes. Pr[{ω}] =
pt(ω) (1 − p)n−t(ω) . Therefore g(k; p) = pk (1 − p)n−k and h(ω) = 1 does the trick. One can also
n
say: the probability of one element ω is the probability of t(ω) successes divided by t(ω) . This
gives another easy-to-understand factorization.
6.2. The Exponential Family of Probability Distributions
Assume the random variable x has values in U ⊂ R. A family of density functions
(or, in the discrete case, probability mass functions) fx (x; ξ) that depends on the
parameter ξ ∈ Ξ is called a one-parameter exponential family if and only if there
exist functions s, u : Ξ → R and r, t : U → R such that the density function can be
written as
(6.2.1)
if x ∈ U , and = 0 otherwise.
fx (x; ξ) = r(x)s(ξ) exp t(x)u(ξ)
n
For this definition it is important that U ⊂ R does not depend on ξ. Notice how
symmetric this condition is between observations and parameters.
If we put the
6.2. THE EXPONENTIAL FAMILY OF DISTRIBUTIONS
183
factors r(x)s(ξ) into the exponent we get
(6.2.2)
fx (x; ξ) = exp t(x)u(ξ) + v(x) + w(ξ)
if x ∈ U , and = 0 otherwise.
If we plug the random variable x into the function t we get the transformed
random variable y = t(x), and if we re-define the parameter θ = u(ξ), we get the
density in its canonical form
(6.2.3)
fy (y; θ) = exp yθ − b(θ) + c(y)
if y ∈ U , and = 0 otherwise.
Note here the minus sign in front of b. We will see later that θ → b(θ) is an important
function; its derivatives yield the mean and variance functions used in the generalized
linear model.
Problem 109. 3 points Show that the Binomial distribution (3.7.1)
(6.2.4)
px (k) = Pr[x=k] =
n k
p (1 − p)(n−k)
k
k = 0, 1, 2, . . . , n
is a member of the exponential family. Compute the canonical parameter θ and the
function b(θ).
Answer. Rewrite (6.2.4) as
(6.2.5)
px (k) =
n
k
p
1−p
k
(1 − p)n = exp k ln
p
1−p
+ n ln(1 − p) + ln
n
k
184
6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS
therefore θ = ln
p
1−p
. To compute b(θ) you have to express n ln(1 − p) as a function of θ and
then reverse the sign. The following steps are involved: exp θ =
ln(1 + exp θ) = − ln(1 − p); therefore b(θ) = n ln(1 + exp θ).
p
1−p
=
1
1−p
− 1; 1 + exp θ =
1
;
1−p
Problem 110. 2 points Show that the Poisson distribution (5.3.5) with t = 1,
i.e.,
Pr[x=k] =
(6.2.6)
λk −λ
e
k!
for k = 0, 1, . . .
is a member of the exponential family. Compute the canonical parameter θ and the
function b(θ).
Answer. The probability mass function can be written as
(6.2.7)
Pr[x=k] =
ek ln λ −λ
e
= exp(k ln λ − λ − ln k!)
k!
for k = 0, 1, . . .
This is (6.2.3) for the Poisson distribution, where the values of the random variable are called k
instead of x, and θ = ln λ. Substituting λ = exp(θ) in (6.2.7) gives
(6.2.8)
Pr[x=k] = exp(kθ − exp(θ) − ln k!)
from which one sees b(θ) = exp(θ).
for k = 0, 1, . . .
6.2. THE EXPONENTIAL FAMILY OF DISTRIBUTIONS
185
The one-parameter exponential family can be generalized by the inclusion of a
scale parameter φ in the distribution. This gives the exponential dispersion family,
see [MN89, p. 28]: Each observation has the density function
fy (y; θ, φ) = exp
(6.2.9)
yθ − b(θ)
+ c(y, φ) .
a(φ)
Problem 111. [MN89, p. 28] Show that the Normal distribution is a member
of the exponential dispersion family.
Answer.
(6.2.10)
fy (y) = √
1
2πσ 2
−
e
(y−µ)2
2σ 2
= exp
yµ − µ2 /2 /σ 2 −
1 2 2
y /σ + log(2πσ 2 )
2
,
i.e., θ = µ, φ = σ 2 , a(φ) = φ, b(θ) = θ2 /2, c(y, φ) = − 1 y 2 /σ 2 + log(2πσ 2 ) .
2
Problem 112. Show that the Gamma distribution is a member of the exponential
dispersion family.
Next observation: for the exponential and the exponential dispersion families,
the expected value is the derivative of the function b(θ)
(6.2.11)
E[y] =
∂b(θ)
.
∂θ
186
6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS
This follows from the basic theory associated with maximum likelihood estimation,
see (13.4.12). E[y] is therefore a function of the “canonical parameter” θ, and in the
generalized linear model the assumption is made that this function has an inverse,
i.e., the canonical parameter can be written θ = g(µ) where g is called the “canonical
link function.”
Problem 113. 2 points In the case of the Binomial distribution (see Problem
109) compute b (θ) and verify that it is the same as E[x].
1
Answer. b(θ) = n ln(1 + exp θ), therefore b (θ) = n 1+exp θ exp(θ). Now exp(θ) =
plugging this in gives b (θ) = np, which is the same as E[x].
p
;
1−p
Problem 114. 1 point In the case of the Poisson distribution (see Problem 110)
compute b (θ) and verify that it is the same as E[x], and compute b (θ) and verify that
it is the same as var[x]. You are allowed, without proof, that a Poisson distribution
with parameter λ has expected value λ and variance λ.
Answer. b(θ) = exp θ, therefore b (θ) = b (θ) = exp(θ) = λ.
From (13.4.20) follows furthermore that the variance is the second derivative of
b, multiplied by a(φ):
(6.2.12)
var[y] =
∂ 2 b(θ)
a(φ)
∂θ2
6.2. THE EXPONENTIAL FAMILY OF DISTRIBUTIONS
187
Since θ is a function of the mean, this means: the variance of each observation is
the product of two factors, the first factor depends on the mean only, it is called
the “variance function,” and the other factor depends on φ. This is exactly the
specification of the generalized linear model, see Section 69.3.
CHAPTER 7
Chebyshev Inequality, Weak Law of Large
Numbers, and Central Limit Theorem
7.1. Chebyshev Inequality
If the random variable y has finite expected value µ and standard deviation σ,
and k is some positive number, then the Chebyshev Inequality says
(7.1.1)
Pr |y − µ|≥kσ ≤
1
.
k2
In words, the probability that a given random variable y differs from its expected
value by more than k standard deviations is less than 1/k 2 . (Here “more than”
and “less than” are short forms for “more than or equal to” and “less than or equal
189
190
7. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEO
to.”) One does not need to know the full distribution of y for that, only its expected
value and standard deviation. We will give here a proof only if y has a discrete
distribution, but the inequality is valid in general. Going over to the standardized
1
variable z = y−µ we have to show Pr[|z|≥k] ≤ k2 . Assuming z assumes the values
σ
z1 , z2 ,. . . with probabilities p(z1 ), p(z2 ),. . . , then
(7.1.2)
Pr[|z|≥k] =
p(zi ).
i : |zi |≥k
Now multiply by k 2 :
(7.1.3)
k 2 Pr[|z|≥k] =
k 2 p(zi )
i : |zi |≥k
(7.1.4)
2
zi p(zi )
≤
i : |zi |≥k
(7.1.5)
2
zi p(zi ) = var[z] = 1.
≤
all i
The Chebyshev inequality is sharp for all k ≥ 1. Proof: the random variable
1
which takes the value −k with probability 2k2 and the value +k with probability