This document summarizes key concepts from Chapter 2 of a book on statistical methods for handling incomplete data. It introduces the likelihood-based approach and defines key terms like the likelihood function, maximum likelihood estimator, Fisher information, and missing at random. The chapter also provides examples of observed likelihood functions for censored regression and survival analysis models with missing data.
Recombinant DNA technology (Immunological screening)
Chapter2: Likelihood-based approach
1. Statistical Methods for Handling Incomplete Data
Chapter 2: Likelihood-based approach
Jae-Kwang Kim
Department of Statistics, Iowa State University
2. Section 1: Introduction
Basic Setup (No missing data)
• y = (y1, y2, · · · yn) is a realization of the random sample from an infinite population
with density f (y).
• Assume that the true density f (y) belongs to a parametric family of densities
P = {f (y; θ) ; θ ∈ Ω} indexed by θ ∈ Ω. That is, there exist θ0 ∈ Ω such that
f (y; θ0) = f (y) for all y.
Kim (ISU) Ch 2 2 / 50
3. Likelihood
Definitions for likelihood theory
• The likelihood function of θ is defined as
L(θ) = f (y; θ)
where f (y; θ) is the joint pdf of y.
• Let ˆθ be the maximum likelihood estimator (MLE) of θ0 if it satisfies
L(ˆθ) = max
θ∈Θ
L(θ).
Kim (ISU) Ch 2 3 / 50
4. Kullback-Leibler information
Shannon-Kolmogorov information inequality
Let f0(y) = f (y; θ0) and f1(y) = f (y; θ1). Kullback-Leibler information number defined
by
K(f0, f1) = Eθ0 ln
f0(Y )
f1(Y )
satisfies
K(f0, f1) ≥ 0,
with equality if and only if Pθ0 {f0(Y ) = f1(Y )} = 1.
Kim (ISU) Ch 2 4 / 50
5. Proof
Jensen’s inequality
If g(t) is a convex function, then for any random variable X, g{E(X)} ≤ E{g(X)}.
Furthermore, if g(t) is strictly convex, then E{g(X)} = g{E(X)} if and only if
P(X = c) = 1 for some constant c.
Since φ(x) = − ln(x) is a strictly convex function of x, we have, using Jensen’s
inequality,
K(f0, f1) = Eθ0 − ln
f1(Y )
f0(Y )
≥ − ln Eθ0
f1(Y )
f0(Y )
≥ 0
with equality if and only if Pθ0 {f0(Y ) = f1(Y )} = 1.
Kim (ISU) Ch 2 5 / 50
6. Identifiability
Definition
Let P = {Pθ; θ ∈ Ω} be a statistical model with parameter space Ω. Model P is
identifiable if the mapping θ → Pθ is one-to-one:
Pθ1 = Pθ2 ⇒ θ1 = θ2.
If the distributions are defined in terms of the probability density functions (pdfs), then
two pdfs should be considered distinct only if they differ on a set of non-zero measure
(for example two functions f1(x) = I(0 ≤ x < 1) and f2(x) = I(0 ≤ x ≤ 1) differ only at
a single point x = 1, a set of measure zero, and thus cannot be considered as distinct
pdfs).
Kim (ISU) Ch 2 6 / 50
7. Lemma 2.1 Properties of identifiable distribution
Lemma 2.1.
If P = {f (y; θ) ; θ ∈ Ω} is identifiable and E {|ln f (Y ; θ)|} < ∞ for all θ, then
M (θ) = Eθ0 ln
f (Y ; θ)
f (Y ; θ0)
has a unique maximum at θ0.
Kim (ISU) Ch 2 7 / 50
8. Proof
1 By Shannon-Kolmogorov information inequality, we can obtain that
M(θ) ≤ 0
for all θ ∈ Ω, with equality iff Pθ0 {f (Y ; θ) = f (Y ; θ0)} = 1.
2 Note that M(θ0) = 0, by definition of M(θ). Thus, M(θ) has a maximum at θ0.
3 To show uniqueness, note that if θ1 satisfies M(θ1) = 0, it should satisfy
Pθ0 {f (Y ; θ1) = f (Y ; θ0)} = 1,
which implies θ1 = θ0, by the identifiability assumption.
Kim (ISU) Ch 2 8 / 50
9. Remark
1 Lemma 2.1 simply says that, under identifiability, Q(θ) = Eθ0 {log f (Y ; θ)} takes
the (unique) maximum value at θ = θ0.
2 Define
Qn(θ) = n−1
n
i=1
log f (yi ; θ)
then the MLE ˆθ is the maximizer of Qn(θ). Since Qn(θ) converges in probability to
Q(θ), can we say that the maximizer of Qn(θ) converges to the maximizer of
Q(θ)?
Kim (ISU) Ch 2 9 / 50
10. Theorem 2.1: (Weak) consistency of MLE
Theorem 2.1
Let
Qn (θ) =
1
n
n
i=1
log f (yi , θ)
and ˆθ be any solution of
Qn(ˆθ) = max
θ∈Ω
Qn (θ) .
Assume the following two conditions:
1 Uniform weak convergence:
sup
θ∈Ω
|Qn (θ) − Q (θ)|
p
−→ 0
for some non-stochastic function Q (θ)
2 Identification: Q (θ) is uniquely maximized at θ0.
Then, ˆθ
p
−→ θ0.
Kim (ISU) Ch 2 10 / 50
11. Remark
1 Convergence in probability:
ˆθ
p
−→ θ0 ⇐⇒ P |ˆθ − θ0| > → 0 as n → ∞,
for any > 0.
2 If P = {f (y; θ) ; θ ∈ Ω} is not identifiable, then Q (θ) may not have a unique
maximum and ˆθ may not converge (in probability) to a single point.
3 If the true distribution f (y) does not belong to the class P = {f (y; θ) ; θ ∈ Ω},
which point does ˆθ converge to?
Kim (ISU) Ch 2 11 / 50
12. Other properties of MLE
• Asymptotic normality (Theorem 2.2)
• Asymptotic optimality: MLE achieves Cramer-Rao lower bound
• Wilks’ theorem:
2{ln(ˆθ) − ln(θ0)}
d
→ χ2
p
Kim (ISU) Ch 2 12 / 50
13. 1 Introduction - Fisher information
Definition
1 Score function:
S(θ) =
∂ log L(θ)
∂θ
2 Fisher information = curvature of the log-likelihood:
I(θ) = −
∂2
∂θ∂θT
log L(θ) = −
∂
∂θT
S(θ)
3 Observed (Fisher) information: I(ˆθn) where ˆθn is the MLE.
4 Expected (Fisher) information: I(θ) = Eθ {I(θ)}
• The observed information is always positive. The observed information applies to a
single dataset.
• The expected information is meaningful as a function of θ across the admissible
values of θ. The expected information is an average quantity over all possible
datasets.
• I(ˆθ) = I(ˆθ) for exponential family.
Kim (ISU) Ch 2 13 / 50
14. Properties of score function
Theorem 2.3. Bartlett identities
Under the regularity conditions allowing for the exchange of the order of integration and
differentiation,
Eθ {S(θ)} = 0 and Vθ {S(θ)} = I(θ).
Kim (ISU) Ch 2 14 / 50
16. 1 Introduction - Fisher information
Remark
• Since ˆθ
p
−→ θ0, we can apply a Taylor linearization on S(ˆθ) = 0 to get
ˆθ − θ0
∼= {I(θ0)}−1
S(θ0).
Here, we use the fact that I(θ) = −∂S(θ)/∂θT
converges in probability to I(θ).
• Thus, the (asymptotic) variance of MLE is
V (ˆθ)
.
= {I(θ0)}−1
V {S(θ0)} {I(θ0)}−1
= {I(θ0)}−1
,
where the last equality follows from Theorem 2.3.
Kim (ISU) Ch 2 16 / 50
17. Section 2: Observed likelihood
Basic Setup
1 Let y = (y1, y2, · · · yn) be a realization of the random sample from an infinite
population with density f (y; θ0).
2 Let δi be an indicator function defined by
δi =
1 if yi is observed
0 otherwise
and we assume that
Pr (δi = 1 | yi ) = π (yi , φ)
for some (unknown) parameter φ and π(·) is a known function.
3 Thus, instead of observing (δi , yi ) directly, we observe (δi , yi,obs )
yi,obs =
yi if δi = 1
∗ if δi = 0.
4 What is the (marginal) density of (yi,obs , δi )?
Kim (ISU) Ch 2 17 / 50
18. Motivation (Change of variable technique)
1 Suppose that z is a random variable with density f (z).
2 Instead of observing z directly, we observe only y = y (z), where the mapping
z → y (z) is known.
3 The density of y is
g (y) =
R(y)
f (z) dz
where R (y) = {z; y (z) = y}.
Kim (ISU) Ch 2 18 / 50
19. Derivation
1 The joint density for z = (y, δ) is
f (y, δ) = f1 (y) f2 (δ | y)
where f1 (y) is the density of y and f2 (δ | y) is the conditional density of δ
conditional on y and is given by f2 (δ | y) = {π(y)}δ
{1 − π(y)}1−δ
, where
π(y) = Pr (δ = 1 | y).
2 Instead of observing z = (y, δ) directly, we observe only (yobs, δ) where
yobs = yobs (y, δ) and the mapping (y, δ) → yobs is known.
3 The (marginal) density of (yobs, δ) is
g (yobs, δ) =
R(yobs,δ)
f (y, δ) dy
where R (yobs, δ) = {y; yobs (y, δ) = yobs}.
Kim (ISU) Ch 2 19 / 50
20. Likelihood under missing data (=Observed likelihood)
The observed likelihood is the likelihood obtained from the marginal density of
(yi,obs, δi ) , i = 1, 2, · · · , n, and can be written as, under the IID setup,
Lobs (θ, φ) =
δi =1
[f1 (yi ; θ) f2 (δi | yi ; φ)] ×
δi =0
f1 (yi ; θ) f2 (δi | yi ; φ) dyi
=
δi =1
[f1 (yi ; θ) π (yi ; φ)] ×
δi =0
f1 (y; θ) {1 − π (y; φ)} dy ,
where π (y; φ) = P (δ = 1 | y; φ).
Kim (ISU) Ch 2 20 / 50
21. Example 2.2 (Censored regression model, or Tobit model)
zi = xi β + i i ∼ N 0, σ2
yi =
zi if zi ≥ 0
0 if zi < 0.
The observed log-likelihood is
l β, σ2
= −
1
2 yi >0
ln 2π + ln σ2
+
(yi − xi β)
2
σ2
+
yi =0
ln 1 − Φ
xi β
σ
where Φ (x) is the cdf of the standard normal distribution.
Kim (ISU) Ch 2 21 / 50
22. Example 2.3
Let t1, t2, · · · , tn be an IID sample from a distribution with density
fθ (t) = θe−θt
I (t > 0). Instead of observing ti , we observe (yi , δi ) where
yi =
ti if δi = 1
c if δi = 0
and
δi =
1 if ti ≤ c
0 if ti > c,
where c is a known censoring time. The observed likelihood for θ can be derived as
Lobs (θ) =
n
i=1
{fθ(ti )}δi
{P (ti > c)}1−δi
= θ
n
i=1 δi
exp(−θ
n
i=1
yi ).
Kim (ISU) Ch 2 22 / 50
23. Multivariate extension
Basic Setup
• Let y = (y1, . . . , yp) be a p-dimensional random vector with probability density
function f (y; θ).
• Let y1, · · · , yn be n independent realizations y from f (y; θ). (IID sample)
• Let δij be the response indicator function of yij with δij =
1 if yij is observed
0 otherwise.
• δi = (δi1, · · · , δip): p-dimensional random vector with density P(δ | y) assuming
P(δ|y) = P(δ|y; φ) for some φ.
• Let (yi,obs , yi,mis ) be the observed part and missing part of yi , respectively.
Kim (ISU) Ch 2 23 / 50
24. Observed Likelihood for multivariate Y case
• Under IID setup: The observed likelihood is
Lobs (θ, φ) =
n
i=1
f (yi ; θ)P(δi |yi ; φ)dyi,mis ,
where it is understood that, if yi = yi,obs and yi,mis is empty then there is nothing
to integrate out.
• In the special case of scalar y, the observed likelihood is
Lobs (θ, φ) =
δi =1
[f (yi ; θ)π(yi ; φ)] ×
δi =0
f (y; θ) {1 − π(y; φ)} dy ,
where π(y; φ) = P(δ = 1|y; φ).
Kim (ISU) Ch 2 24 / 50
25. Missing At Random
Definition: Missing At Random (MAR)
P(δ|y) is the density of the conditional distribution of δ given y. Let yobs = yobs (y, δ)
where
yi,obs =
yi if δi = 1
∗ if δi = 0.
The response mechanism is MAR if P (δ|y1) = P (δ|y2) { or P(δ|y) = P(δ|yobs )} for all
y1 and y2 satisfying yobs (y1, δ) = yobs (y2, δ).
• MAR: the response mechanism P(δ|y) depends on y only through yobs .
• Let y = (yobs , ymis ). By Bayes theorem,
P (ymis |yobs , δ) =
P(δ|ymis , yobs )
P(δ|yobs )
P (ymis |yobs ) .
• MAR: P(ymis |yobs , δ) = P(ymis |yobs ). That is, ymis ⊥ δ | yobs .
• MAR: the conditional independence of δ and ymis given yobs .
Kim (ISU) Ch 2 25 / 50
26. Remark
• MCAR (Missing Completely at random): P(δ | y) does not depend on y.
• MAR (Missing at random): P(δ | y) = P(δ | yobs )
• NMAR (Not Missing at random): P(δ | y) = P(δ | yobs )
• Thus, MCAR is a special case of MAR.
Kim (ISU) Ch 2 26 / 50
27. Likelihood factorization theorem
Theorem 2.4 (Rubin, 1976)
Pφ(δ|y) is the joint density of δ given y and fθ(y) is the joint density of y. Under
conditions
1 the parameters θ and φ are distinct and
2 MAR condition holds,
the observed likelihood can be written as
Lobs (θ, φ) = L1(θ)L2(φ),
and the MLE of θ can be obtained by maximizing L1(θ).
Thus, we do not have to specify the model for response mechanism. The response
mechanism is called ignorable if the above likelihood factorization holds.
Kim (ISU) Ch 2 27 / 50
28. Example 2.4
• Bivariate data (xi , yi ) with pdf f (x, y) = f1(y | x)f2(x).
• xi is always observed and yi is subject to missingness.
• Assume that the response status variable δi of yi satisfies
P (δi = 1 | xi , yi ) = Λ1 (φ0 + φ1xi + φ2yi )
for some function Λ1 (·) of known form.
• Let θ be the parameter of interest in the regression model f1 (y | x; θ). Let α be
the parameter in the marginal distribution of x, denoted by f2 (xi ; α). Define
Λ0(x) = 1 − Λ1(x).
• Three parameters
• θ: parameter of interest
• α and φ: nuisance parameter
Kim (ISU) Ch 2 28 / 50
29. Example 2.4 (Cont’d)
• Observed likelihood
Lobs (θ, α, φ) =
δi =1
f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1xi + φ2yi )
×
δi =0
f1 (y | xi ; θ) f2 (xi ; α) Λ0 (φ0 + φ1xi + φ2y) dy
= L1 (θ, φ) × L2 (α)
where L2 (α) = n
i=1 f2 (xi ; α) .
• Thus, we can safely ignore the marginal distribution of x if x is completely
observed.
Kim (ISU) Ch 2 29 / 50
30. Example 2.4 (Cont’d)
• If φ2 = 0, then MAR holds and
L1 (θ, φ) = L1a (θ) × L1b (φ)
where
L1a (θ) =
δi =1
f1 (yi | xi ; θ)
and
L1b (φ) =
δi =1
Λ1 (φ0 + φ1xi ) ×
δi =0
Λ0 (φ0 + φ1xi ) .
• Thus, under MAR, the MLE of θ can be obtained by maximizing L1a (θ), which is
obtained by ignoring the missing part of the data.
Kim (ISU) Ch 2 30 / 50
31. Example 2.4 (Cont’d)
• Instead of yi subject to missingness, if xi is subject to missingness, then the
observed likelihood becomes
Lobs (θ, φ, α) =
δi =1
f1 (yi | xi ; θ) f2 (xi ; α) Λ1 (φ0 + φ1xi + φ2yi )
×
δi =0
f1 (yi | x; θ) f2 (x; α) Λ0 (φ0 + φ1x + φ2yi ) dx
= L1 (θ, φ) × L2 (α) .
• If φ1 = 0 then
Lobs (θ, α, φ) = L1 (θ, α) × L2 (φ)
and MAR holds. Although we are not interested in the marginal distribution of x,
we still need to specify the model for the marginal distribution of x.
Kim (ISU) Ch 2 31 / 50
32. Section 3: Mean Score Approach
Motivation
• The observed likelihood is the marginal density of (yobs , δ).
• The observed likelihood is
Lobs (η) =
R(yobs ,δ)
f (y; θ)P(δ|y; φ)dµ(y) = f (y; θ)P(δ|y; φ)dµ(ymis )
where ymis is the missing part of y and η = (θ, φ).
• Observed score equation:
Sobs (η) ≡
∂
∂η
log Lobs (η) = 0
• Computing the observed score function can be computationally challenging
because the observed likelihood is an integral form.
Kim (ISU) Ch 2 32 / 50
33. 3 Mean Score Approach
Theorem 2.5: Mean Score Theorem (Fisher, 1922)
Under some regularity conditions, the observed score function equals to the mean score
function. That is,
Sobs (η) = ¯S(η)
where
¯S(η) = E{Scom(η)|yobs , δ}
Scom(η) =
∂
∂η
log f (y, δ; η),
f (y, δ; η) = f (y; θ)P(δ|y; φ).
• The mean score function is computed by taking the conditional expectation of the
complete-sample score function given the observation.
• The mean score function is easier to compute than the observed score function.
Kim (ISU) Ch 2 33 / 50
34. 3 Mean Score Approach
Proof of Theorem 2.5
Since Lobs (η) = f (y, δ; η) /f (y, δ | yobs, δ; η), we have
∂
∂η
ln Lobs (η) =
∂
∂η
ln f (y, δ; η) −
∂
∂η
ln f (y, δ | yobs, δ; η) ,
taking a conditional expectation of the above equation over the conditional distribution
of (y, δ) given (yobs, δ), we have
∂
∂η
ln Lobs (η) = E
∂
∂η
ln Lobs (η) | yobs, δ
= E {Scom (η) | yobs, δ} − E
∂
∂η
ln f (y, δ | yobs, δ; η) | yobs, δ .
Here, the first equality holds because Lobs (η) is a function of (yobs, δ) only. The last
term is equal to zero by Theorem 2.3, which states that the expected value of the score
function is zero and the reference distribution in this case is the conditional distribution
of (y, δ) given (yobs, δ).
Kim (ISU) Ch 2 34 / 50
35. Example 2.6
1 Suppose that the study variable y follows from a normal distribution with mean x β
and variance σ2
. The score equations for β and σ2
under complete response are
S1(β, σ2
) =
n
i=1
yi − xi β xi /σ2
= 0
and
S2(β, σ2
) = −n/(2σ2
) +
n
i=1
yi − xi β
2
/(2σ4
) = 0.
2 Assume that yi are observed only for the first r elements and the MAR assumption
holds. In this case, the mean score function reduces to
¯S1(β, σ2
) =
r
i=1
yi − xi β xi /σ2
and
¯S2(β, σ2
) = −n/(2σ2
) +
r
i=1
yi − xi β
2
/(2σ4
) + (n − r)/(2σ2
).
Kim (ISU) Ch 2 35 / 50
36. Example 2.6 (Cont’d)
3 The maximum likelihood estimator obtained by solving the mean score equations is
ˆβ =
r
i=1
xi xi
−1 r
i=1
xi yi
and
ˆσ2
=
1
r
r
i=1
yi − xi
ˆβ
2
.
Thus, the resulting estimators can be also obtained by simply ignoring the missing
part of the sample, which is consistent with the result in Example 2.4 (for φ2 = 0).
Kim (ISU) Ch 2 36 / 50
37. Discussion of Example 2.6
• We are interested in estimating θ for the conditional density f (y | x; θ).
• Under MAR, the observed likelihood for θ is
Lobs (θ) =
r
i=1
f (yi | xi ; θ) ×
n
i=r+1
f (y | xi ; θ)dy =
r
i=1
f (yi | xi ; θ).
• The same conclusion can follow from the mean score theorem. Under MAR, the
mean score function is
¯S(θ) =
r
i=1
S(θ; xi , yi ) +
n
i=r+1
E{S(θ; xi , Y ) | xi }
=
r
i=1
S(θ; xi , yi )
where S(θ; x, y) is the score function for θ and the second equality follows from
Theorem 2.3.
Kim (ISU) Ch 2 37 / 50
38. Example 2.5
1 Suppose that the study variable y is randomly distributed with Bernoulli
distribution with probability of success pi , where
pi = pi (β) =
exp (xi β)
1 + exp (xi β)
for some unknown parameter β and xi is a vector of the covariates in the logistic
regression model for yi . We assume that 1 is in the column space of xi .
2 Under complete response, the score function for β is
S1(β) =
n
i=1
(yi − pi (β)) xi .
Kim (ISU) Ch 2 38 / 50
39. Example 2.5 (Cont’d)
3 Let δi be the response indicator function for yi with distribution Bernoulli(πi )
where
πi =
exp (xi φ0 + yi φ1)
1 + exp (xi φ0 + yi φ1)
.
We assume that xi is always observed, but yi is missing if δi = 0.
4 Under missing data, the mean score function for β is
¯S1 (β, φ) =
δi =1
{yi − pi (β)} xi +
δi =0
1
y=0
wi (y; β, φ) {y − pi (β)} xi ,
where wi (y; β, φ) is the conditional probability of yi = y given xi and δi = 0:
wi (y; β, φ) =
Pβ (yi = y | xi ) Pφ (δi = 0 | yi = y, xi )
1
z=0 Pβ (yi = z | xi ) Pφ (δi = 0 | yi = z, xi )
Thus, ¯S1 (β, φ) is also a function of φ.
Kim (ISU) Ch 2 39 / 50
40. Example 2.5 (Cont’d)
5 If the response mechanism is MAR so that φ1 = 0, then
wi (y; β, φ) =
Pβ (yi = y | xi )
1
z=0 Pβ (yi = z | xi )
= Pβ (yi = y | xi )
and so
¯S1 (β, φ) =
δi =1
{yi − pi (β)} xi = ¯S1 (β) .
6 If MAR does not hold, then (ˆβ, ˆφ) can be obtained by solving ¯S1 (β, φ) = 0 and
¯S2(β, φ) = 0 jointly, where
¯S2 (β, φ) =
δi =1
{δi − π (φ; xi , yi )} (xi , yi )
+
δi =0
1
y=0
wi (y; β, φ) {δi − πi (φ; xi , y)} (xi , y) .
Kim (ISU) Ch 2 40 / 50
41. Section 4: Observed information
Definition
1 Observed score function: Sobs (η) = ∂
∂η
log Lobs (η)
2 Fisher information from observed likelihood: Iobs (η) = − ∂2
∂η∂ηT log Lobs (η)
3 Expected (Fisher) information from observed likelihood: Iobs (η) = Eη {Iobs (η)}.
Theorem 2.6
Under regularity conditions,
E{Sobs (η)} = 0, and V{Sobs (η)} = Iobs (η),
where Iobs (η) = Eη{Iobs (η)} is the expected information from the observed likelihood.
Kim (ISU) Ch 2 41 / 50
42. 4 Observed information
• Under missing data, the MLE ˆη is the solution to ¯S(η) = 0.
• Under some regularity conditions, ˆη converges in probability to η0 and has the
asymptotic variance {Iobs (η0)}−1
with
Iobs (η) = E −
∂
∂ηT
Sobs (η) = E S⊗2
obs (η) = E ¯S⊗2
(η) & B⊗2
= BBT
.
• For variance estimation of ˆη, may use {Iobs (ˆη)}−1
.
• In general, Iobs (ˆη) is preferred to Iobs (ˆη) for variance estimation of ˆη.
Kim (ISU) Ch 2 42 / 50
43. Return to Example 2.3
• Observed log-likelihood
ln Lobs (θ) =
n
i=1
δi log(θ) − θ
n
i=1
yi
• MLE for θ:
ˆθ =
n
i=1 yi
n
i=1 δi
• Fisher information: Iobs (θ) = n
i=1 δi /θ2
• Expected information: Iobs (θ) = n
i=1(1 − e−θc
)/θ2
= n(1 − e−θc
)/θ2
.
Which one do you prefer ?
Kim (ISU) Ch 2 43 / 50
44. 4 Observed information
Motivation
• Lcom(η) = f (y, δ; η): complete-sample likelihood with no missing data
• Fisher information associated with Lcom(η):
Icom(η) = −
∂
∂ηT
Scom(η) = −
∂2
∂η∂ηT
log Lcom(η)
• Lobs (η): the observed likelihood
• Fisher information associated with Lobs (η):
Iobs (η) = −
∂
∂ηT
Sobs (η) = −
∂2
∂η∂ηT
log Lobs (η)
• How to express Iobs (η) in terms of Icom(η) and Scom(η) ?
Kim (ISU) Ch 2 44 / 50
45. Louis’ formula
Theorem 2.7 (Louis, 1982; Oakes, 1999)
Under regularity conditions allowing the exchange of the order of integration and
differentiation,
Iobs (η) = E{Icom(η)|yobs , δ} − E{S⊗2
com(η)|yobs , δ} − ¯S(η)⊗2
= E{Icom(η)|yobs , δ} − V {Scom(η)|yobs , δ},
where ¯S(η) = E{Scom(η)|yobs , δ}.
Kim (ISU) Ch 2 45 / 50
46. Proof of Theorem 2.7
By Theorem 2.5, the observed information associated with Lobs(η) can be expressed as
Iobs (η) = −
∂
∂η
¯S (η)
where ¯S (η) = E {Scom(η) | yobs, δ; η}. Thus, we have
∂
∂η
¯S (η) =
∂
∂η
Scom(η; y)f (y, δ | yobs, δ; η) dµ(y)
=
∂
∂η
Scom(η; y) f (y, δ | yobs, δ; η) dµ(y)
+ Scom(η; y)
∂
∂η
f (y, δ | yobs, δ; η) dµ(y)
= E ∂Scom(η)/∂η | yobs, δ
+ Scom(η; y)
∂
∂η
log f (y, δ | yobs, δ; η) f (y, δ | yobs, δ; η) dµ(y).
The first term is equal to −E {Icom(η) | yobs, δ} and the second term is equal to
E Scom(η)Smis(η) | yobs, δ = E ¯S(η) + Smis(η) Smis(η) | yobs, δ
= E Smis(η)Smis(η) | yobs, δ
because E ¯S(η)Smis(η) | yobs, δ = 0.
Kim (ISU) Ch 2 46 / 50
47. Missing Information Principle
• Smis (η): the score function with the conditional density f (y, δ|yobs , δ).
• Expected missing information: Imis (η) = E − ∂
∂ηT Smis (η) satisfying
Imis (η) = E Smis (η)⊗2
.
• Missing information principle (Orchard and Woodbury, 1972):
Imis (η) = Icom(η) − Iobs (η),
where Icom(η) = E −∂Scom(η)/∂ηT
is the expected information with
complete-sample likelihood .
• An alternative expression of the missing information principle is
V{Smis (η)} = V{Scom(η)} − V{ ¯S(η)}.
Note that V{Scom(η)} = Icom(η) and V{Sobs (η)} = Iobs (η).
Kim (ISU) Ch 2 47 / 50
48. Example 2.7
1 Consider the following bivariate normal distribution:
y1i
y2i
∼ N
µ1
µ2
,
σ11 σ12
σ12 σ22
,
for i = 1, 2, · · · , n. Assume for simplicity that σ11, σ12 and σ22 are known
constants and µ = (µ1, µ2) be the parameter of interest.
2 The complete sample score function for µ is
Scom (µ) =
n
i=1
S(i)
com (µ) =
n
i=1
σ11 σ12
σ12 σ22
−1
y1i − µ1
y2i − µ2
.
The information matrix of µ based on the complete sample is
Icom (µ) = n
σ11 σ12
σ12 σ22
−1
.
Kim (ISU) Ch 2 48 / 50
49. Example 2.7 (Cont’d)
3 Suppose that there are some missing values in y1i and y2i and the original sample
is partitioned into four sets:
H = both y1 and y2 respond
K = only y1 is observed
L = only y2 is observed
M = both y1 and y2 are missing.
Let nH , nK , nL, nM represent the size of H, K, L, M, respectively.
4 Assume that the response mechanism does not depend on the value of (y1, y2) and
so it is MAR. In this case, the observed score function of µ based on a single
observation in set K is
E S(i)
com (µ) | y1i , i ∈ K =
σ11 σ12
σ12 σ22
−1
y1i − µ1
E(y2i | y1i ) − µ2
=
σ−1
11 (y1i − µ1)
0
.
Kim (ISU) Ch 2 49 / 50
50. Example 2.7 (Cont’d)
5 Similarly, we have
E S(i)
com (µ) | y2i , i ∈ L =
0
σ−1
22 (y2i − µ2)
.
6 Therefore, the observed information matrix of µ is
Iobs (µ) = nH
σ11 σ12
σ12 σ22
−1
+ nK
σ−1
11 0
0 0
+ nL
0 0
0 σ−1
22
and the asymptotic variance of the MLE of µ can be obtained by the inverse of
Iobs (µ).
Kim (ISU) Ch 2 50 / 50
51. REFERENCES
Fisher, R. A. (1922), ‘On the mathematical foundations of theoretical statistics’,
Philosophical Transactions of the Royal Society of London A 222, 309–368.
Louis, T. A. (1982), ‘Finding the observed information matrix when using the EM
algorithm’, Journal of the Royal Statistical Society: Series B 44, 226–233.
Oakes, D. (1999), ‘Direct calculation of the information matrix via the em algorithm’,
Journal of the Royal Statistical Society: Series B 61, 479–482.
Orchard, T. and M.A. Woodbury (1972), A missing information principle: theory and
applications, in ‘Proceedings of the 6th Berkeley Symposium on Mathematical
Statistics and Probability’, Vol. 1, University of California Press, Berkeley, California,
pp. 697–715.
Rubin, D. B. (1976), ‘Inference and missing data’, Biometrika 63, 581–592.
Kim (ISU) Ch 2 50 / 50