Chapter 6. Sufficient Statistics and their Distributions

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.13 MB, 1,644 trang )

180

6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS

factorized as follows:

pθ (ω) = g t(ω), θ · h(ω)

for all ω ∈ U .

If U ⊂ Rn , we can write ω = (y1 , . . . , yn ). If Prθ is not discrete but generated by a

family of probability densities f (y1 , . . . , yn ; θ), then the condition reads

f (y1 , . . . , yn ; θ) = g t(y1 , . . . , yn ), θ · h(y1 , . . . , yn ).

Note what this means: the probability of an elementary event (or of an inﬁnitesimal

interval) is written as the product of two parts: one depends on ω through t, while

the other depends on ω directly. Only that part of the probability that depends on

ω through t is allowed to also depend on θ.

Proof in the discrete case: First let us show the necessity of this factorization.

Assume that t is suﬃcient, i.e., that Prθ [ω|t=t] does not involve θ. Then one possible

factorization is

(6.1.1)

(6.1.2)

Prθ [ω] = Prθ [t=t(ω)] · Pr[ω|t=t(ω)]

= g(t(ω), θ) · h(ω).

Now let us prove that the factorization property implies suﬃciency. Assume

therefore (6.1) holds. We have to show that for all ω ∈ U and t ∈ R, the conditional

probability Prθ [{ω}|{κ ∈ U : t(κ) = t}], which will in shorthand notation be written

6.1. FACTORIZATION THEOREM FOR SUFFICIENT STATISTICS

181

as Prθ [ω|t=t], does not depend on θ.

(6.1.3)

Prθ [t=t] =

(6.1.4)

g(t(ω), θ) · h(ω)

Prθ [{ω}] =

ω : t(ω)=t

= g(t, θ) ·

ω : t(ω)=t

h(ω) = g(t, θ) · k(t), say.

ω : t(ω)=t

Here it is important that k(t) does not depend on θ. Now

(6.1.5)

Prθ [ω|t=t] = Prθ [{ω} ∩ {t=t}]Prθ [t=t]

if t(ω) = t, this is zero, i.e., independent of θ. Now look at case t(ω) = t, i.e.,

{ω} ∩ {t=t} = {ω}. Then

(6.1.6)

Prθ [ω|t=t] =

g(t, θ)h(ω)

h(ω)

=

, which is independent of θ.

g(t, θ)k(t)

k(t)

Problem 108. 6 points Using the factorization theorem for suﬃcient statistics,

show that in a n times repeated Bernoulli experiment (n is known), the number of

successes is a suﬃcient statistic for the success probability p.

• a. Here is a formulation of the factorization theorem: Given a family of discrete

probability measures Prθ depending on a parameter θ. The statistic t is suﬃcient for

182

6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS

parameter θ iﬀ there exists a function of two variables g : R×Θ → R, (t, θ) → g(t; θ),

and a function of one variable h : U → R, ω → h(ω) so that for all ω ∈ U

Prθ [{ω}] = g t(ω), θ · h(ω).

Before you apply this, ask yourself: what is ω?

Answer. This is very simple: the probability of every elementary event depends on this element only through the random variable t : U → N , which is the number of successes. Pr[{ω}] =

pt(ω) (1 − p)n−t(ω) . Therefore g(k; p) = pk (1 − p)n−k and h(ω) = 1 does the trick. One can also

n

say: the probability of one element ω is the probability of t(ω) successes divided by t(ω) . This

gives another easy-to-understand factorization.

6.2. The Exponential Family of Probability Distributions

Assume the random variable x has values in U ⊂ R. A family of density functions

(or, in the discrete case, probability mass functions) fx (x; ξ) that depends on the

parameter ξ ∈ Ξ is called a one-parameter exponential family if and only if there

exist functions s, u : Ξ → R and r, t : U → R such that the density function can be

written as

(6.2.1)

if x ∈ U , and = 0 otherwise.

fx (x; ξ) = r(x)s(ξ) exp t(x)u(ξ)

n

For this deﬁnition it is important that U ⊂ R does not depend on ξ. Notice how

symmetric this condition is between observations and parameters.

If we put the

6.2. THE EXPONENTIAL FAMILY OF DISTRIBUTIONS

183

factors r(x)s(ξ) into the exponent we get

(6.2.2)

fx (x; ξ) = exp t(x)u(ξ) + v(x) + w(ξ)

if x ∈ U , and = 0 otherwise.

If we plug the random variable x into the function t we get the transformed

random variable y = t(x), and if we re-deﬁne the parameter θ = u(ξ), we get the

density in its canonical form

(6.2.3)

fy (y; θ) = exp yθ − b(θ) + c(y)

if y ∈ U , and = 0 otherwise.

Note here the minus sign in front of b. We will see later that θ → b(θ) is an important

function; its derivatives yield the mean and variance functions used in the generalized

linear model.

Problem 109. 3 points Show that the Binomial distribution (3.7.1)

(6.2.4)

px (k) = Pr[x=k] =

n k

p (1 − p)(n−k)

k

k = 0, 1, 2, . . . , n

is a member of the exponential family. Compute the canonical parameter θ and the

function b(θ).

Answer. Rewrite (6.2.4) as

(6.2.5)

px (k) =

n

k

p

1−p

k

(1 − p)n = exp k ln

p

1−p

+ n ln(1 − p) + ln

n

k

184

6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS

therefore θ = ln

p

1−p

. To compute b(θ) you have to express n ln(1 − p) as a function of θ and

then reverse the sign. The following steps are involved: exp θ =

ln(1 + exp θ) = − ln(1 − p); therefore b(θ) = n ln(1 + exp θ).

p

1−p

=

1

1−p

− 1; 1 + exp θ =

1

;

1−p

Problem 110. 2 points Show that the Poisson distribution (5.3.5) with t = 1,

i.e.,

Pr[x=k] =

(6.2.6)

λk −λ

e

k!

for k = 0, 1, . . .

is a member of the exponential family. Compute the canonical parameter θ and the

function b(θ).

Answer. The probability mass function can be written as

(6.2.7)

Pr[x=k] =

ek ln λ −λ

e

= exp(k ln λ − λ − ln k!)

k!

for k = 0, 1, . . .

This is (6.2.3) for the Poisson distribution, where the values of the random variable are called k

instead of x, and θ = ln λ. Substituting λ = exp(θ) in (6.2.7) gives

(6.2.8)

Pr[x=k] = exp(kθ − exp(θ) − ln k!)

from which one sees b(θ) = exp(θ).

for k = 0, 1, . . .

6.2. THE EXPONENTIAL FAMILY OF DISTRIBUTIONS

185

The one-parameter exponential family can be generalized by the inclusion of a

scale parameter φ in the distribution. This gives the exponential dispersion family,

see [MN89, p. 28]: Each observation has the density function

fy (y; θ, φ) = exp

(6.2.9)

yθ − b(θ)

+ c(y, φ) .

a(φ)

Problem 111. [MN89, p. 28] Show that the Normal distribution is a member

of the exponential dispersion family.

Answer.

(6.2.10)

fy (y) = √

1

2πσ 2

−

e

(y−µ)2

2σ 2

= exp

yµ − µ2 /2 /σ 2 −

1 2 2

y /σ + log(2πσ 2 )

2

,

i.e., θ = µ, φ = σ 2 , a(φ) = φ, b(θ) = θ2 /2, c(y, φ) = − 1 y 2 /σ 2 + log(2πσ 2 ) .

2

Problem 112. Show that the Gamma distribution is a member of the exponential

dispersion family.

Next observation: for the exponential and the exponential dispersion families,

the expected value is the derivative of the function b(θ)

(6.2.11)

E[y] =

∂b(θ)

.

∂θ

186

6. SUFFICIENT STATISTICS AND THEIR DISTRIBUTIONS

This follows from the basic theory associated with maximum likelihood estimation,

see (13.4.12). E[y] is therefore a function of the “canonical parameter” θ, and in the

generalized linear model the assumption is made that this function has an inverse,

i.e., the canonical parameter can be written θ = g(µ) where g is called the “canonical

link function.”

Problem 113. 2 points In the case of the Binomial distribution (see Problem

109) compute b (θ) and verify that it is the same as E[x].

1

Answer. b(θ) = n ln(1 + exp θ), therefore b (θ) = n 1+exp θ exp(θ). Now exp(θ) =

plugging this in gives b (θ) = np, which is the same as E[x].

p

;

1−p

Problem 114. 1 point In the case of the Poisson distribution (see Problem 110)

compute b (θ) and verify that it is the same as E[x], and compute b (θ) and verify that

it is the same as var[x]. You are allowed, without proof, that a Poisson distribution

with parameter λ has expected value λ and variance λ.

Answer. b(θ) = exp θ, therefore b (θ) = b (θ) = exp(θ) = λ.

From (13.4.20) follows furthermore that the variance is the second derivative of

b, multiplied by a(φ):

(6.2.12)

var[y] =

∂ 2 b(θ)

a(φ)

∂θ2

6.2. THE EXPONENTIAL FAMILY OF DISTRIBUTIONS

187

Since θ is a function of the mean, this means: the variance of each observation is

the product of two factors, the ﬁrst factor depends on the mean only, it is called

the “variance function,” and the other factor depends on φ. This is exactly the

speciﬁcation of the generalized linear model, see Section 69.3.

CHAPTER 7

Chebyshev Inequality, Weak Law of Large

Numbers, and Central Limit Theorem

7.1. Chebyshev Inequality

If the random variable y has ﬁnite expected value µ and standard deviation σ,

and k is some positive number, then the Chebyshev Inequality says

(7.1.1)

Pr |y − µ|≥kσ ≤

1

.

k2

In words, the probability that a given random variable y diﬀers from its expected

value by more than k standard deviations is less than 1/k 2 . (Here “more than”

and “less than” are short forms for “more than or equal to” and “less than or equal

189

190

7. CHEBYSHEV INEQUALITY, WEAK LAW OF LARGE NUMBERS, AND CENTRAL LIMIT THEO

to.”) One does not need to know the full distribution of y for that, only its expected

value and standard deviation. We will give here a proof only if y has a discrete

distribution, but the inequality is valid in general. Going over to the standardized

1

variable z = y−µ we have to show Pr[|z|≥k] ≤ k2 . Assuming z assumes the values

σ

z1 , z2 ,. . . with probabilities p(z1 ), p(z2 ),. . . , then

(7.1.2)

Pr[|z|≥k] =

p(zi ).

i : |zi |≥k

Now multiply by k 2 :

(7.1.3)

k 2 Pr[|z|≥k] =

k 2 p(zi )

i : |zi |≥k

(7.1.4)

2

zi p(zi )

≤

i : |zi |≥k

(7.1.5)

2

zi p(zi ) = var[z] = 1.

≤

all i

The Chebyshev inequality is sharp for all k ≥ 1. Proof: the random variable

1

which takes the value −k with probability 2k2 and the value +k with probability

Xem Thêm

Chapter 6. Sufficient Statistics and their Distributions

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về