Topic 9- General Principles of International Law.pptx
Fi review5
1. Fractional imputation for handling missing data in
survey sampling
Jae-Kwang Kim
Iowa State University
Department of Cancer Epidemiology & Genetics
National Institutes of Health, National Cancer Institute
June 27, 2017
2. 1 Introduction
2 Fractional Imputation
3 Fractional hot deck imputation
4 R package: FHDI
5 Numerical Illustration
6 Concluding Remarks
Kim (ISU) Fractional Imputation NIH/NCI 2 / 57
3. Introduction
Basic Setup
U = {1, 2, · · · , N}: Finite population
A ⊂ U: sample (selected by a probability sampling design).
Under complete response, suppose that
ˆηn,g =
i∈A
wi g(yi )
is an unbiased estimator of ηg = N−1 N
i=1 g(yi ). Here, g(·) is a
known function.
For example, g(y) = I(y < 3) leads to ηg = P(Y < 3).
Kim (ISU) Fractional Imputation NIH/NCI 3 / 57
4. Introduction
Basic Setup (Cont’d)
A = AR ∪ AM, where yi are observed in AR. yi are missing in AM
Ri = 1 if i ∈ AR and Ri = 0 if i ∈ AM.
y∗
i : imputed value for yi , i ∈ AM
Imputed estimator of ηg
ˆηI,g =
i∈AR
wi g(yi ) +
i∈AM
wi g(y∗
i )
Need E {g(y∗
i ) | Ri = 0} = E {g(yi ) | Ri = 0}.
Kim (ISU) Fractional Imputation NIH/NCI 4 / 57
5. Introduction
ML estimation under missing data setup
Often, find x (always observed) such that
Missing at random (MAR) holds: f (y | x, R = 0) = f (y | x)
Imputed values are created from f (y | x).
Computing the conditional expectation can be a challenging problem.
1 Do not know the true parameter θ in f (y | x) = f (y | x; θ):
E {g (y) | x} = E {g (yi ) | xi ; θ} .
2 Even if we know θ, computing the conditional expectation can be
numerically difficult.
Kim (ISU) Fractional Imputation NIH/NCI 5 / 57
6. Introduction
Imputation
Imputation: Monte Carlo approximation of the conditional
expectation (given the observed data).
E {g (yi ) | xi } ∼=
1
M
M
j=1
g y
∗(j)
i
1 Bayesian approach: generate y∗
i from
f (yi | xi , yobs) = f (yi | xi , θ) p(θ | xi , yobs)dθ
2 Frequentist approach: generate y∗
i from f yi | xi ; ˆθ , where ˆθ is a
consistent estimator.
Kim (ISU) Fractional Imputation NIH/NCI 6 / 57
7. Comparison
Bayesian Frequentist
Model Posterior distribution Prediction model
f (latent, θ | data) f (latent | data, θ)
Computation Data augmentation EM algorithm
Prediction I-step E-step
Parameter update P-step M-step
Parameter est’n Posterior mode ML estimation
Imputation Multiple imputation Fractional imputation
Variance estimation Rubin’s formula Linearization
or Resampling
Kim (ISU) Fractional Imputation NIH/NCI 7 / 57
8. 1 Introduction
2 Fractional Imputation
3 Fractional hot deck imputation
4 R package: FHDI
5 Numerical Illustration
6 Concluding Remarks
Kim (ISU) Fractional Imputation NIH/NCI 8 / 57
9. Fractional Imputation
Idea (parametric model approach)
Approximate E{g(yi ) | xi } by
E{g(yi ) | xi } ∼=
Mi
j=1
w∗
ij g(y
∗(j)
i )
where w∗
ij is the fractional weight assigned to the j-th imputed value
of yi .
If yi is a categorical variable, we can use
y
∗(j)
i = the j-th possible value of yi
w
∗(j)
ij = P(yi = y
∗(j)
i | xi ; ˆθ),
where ˆθ is the (pseudo) MLE of θ.
Kim (ISU) Fractional Imputation NIH/NCI 9 / 57
10. Fractional imputation
Features
Split the record with missing item into M(> 1) imputed values
Assign fractional weights
The final product is a single data file with size ≤ nM.
For variance estimation, the fractional weights are replicated.
Kim (ISU) Fractional Imputation NIH/NCI 10 / 57
11. Fractional imputation using parametric model
Assume y ∼ f (y | x; θ) for some θ.
In this case, under MAR, the following fractional imputation method
can be used.
1 Parameter Estimation: Estimate θ by solving
i∈A
wi Ri S(θ; xi , yi ) = 0 (1)
where S(θ; x, y) = ∂ log f (y | x; θ)/∂θ.
2 Imputation: Generate M imputed values of yi , denoted by
y
∗(j)
i , j = 1, · · · , M, from f (yi | xi ; ˆθ), where ˆθ is obtained from (1).
For estimating µg = E{g(Y )}, the fractional imputation estimator of
µg is
ˆµFI,g =
i∈A
wi {Ri g(yi ) + (1 − Ri )
M
j=1
w∗
ij g(y
∗(j)
i )},
where w∗
ij = 1/M.
Kim (ISU) Fractional Imputation NIH/NCI 11 / 57
12. Remark
If we want to produce smaller M, then the following two-phase sampling
method can be developed.
1 First use large M1 (say M1 = 10, 000) to obtain y
∗(j)
i , j = 1, · · · , M1.
2 From the first-phase sample of size M1 generated from Step 1, select
a second-phase sample of size M2 using an efficient sampling method,
such as stratification or systematic sampling. Under optimal
stratification, we have
M2
j=1
w∗
ij g(y
∗(j)
i ) = E{g(Y ) | xi ; ˆθ} + Op(max{M
−1/2
1 , M−1
2 }). (2)
Kim (ISU) Fractional Imputation NIH/NCI 12 / 57
13. Figure: Two-phase sampling for fractional imputation: Fractional imputation with
size M2 = 4 from the histogram of M1 >> M2 imputed values
Kim (ISU) Fractional Imputation NIH/NCI 13 / 57
14. Calibration fractional imputation
In addition to efficient sampling, we can also consider calibration
weighting to construct the fractional weights to satisfy M
j=1 w∗
ij = 1
and
i∈A
wi (1 − Ri )
M
j=1
w∗
ij g(y
∗(j)
i ) =
i∈A
wi (1 − Ri )E{g(Y ) | xi ; ˆθ}
exactly for prespecified g(·) function. The calibration fractional
weighting is discussed in Fuller and Kim (2005).
Kim (ISU) Fractional Imputation NIH/NCI 14 / 57
15. Variance estimation
For replication variance estimation, we first compute ˆθ(k) from (1)
using wi replaced by w
(k)
i . Next, we need to construct the replicated
fractional weights w
∗(k)
ij to satisfy M
j=1 w
∗(k)
ij = 1 and
i∈A
w
(k)
i (1−Ri )
M
j=1
w
∗(k)
ij g(y
∗(j)
i )
.
=
i∈A
w
(k)
i (1−Ri )E{g(Y ) | xi ; ˆθ(k)
}.
(3)
The replicates for ˆµFI,g can be computed by
ˆµ
(k)
FI,g =
i∈A
w
(k)
i {Ri g(yi ) + (1 − Ri )
M
j=1
w
∗(k)
ij g(y
∗(j)
i )}.
Note that the imputed values are not changed for each replication.
Only the fractional weights are changed.
Kim (ISU) Fractional Imputation NIH/NCI 15 / 57
16. Variance estimation (Cont’d)
One way to achieve the condition (3) is to use the importance
weighting given by
w
∗(k)
ij ∝
f (y
∗(j)
i | xi ; ˆθ(k))
f (y
∗(j)
i | xi ; ˆθ)
with M
j=1 w
∗(k)
ij = 1.
If Y is categorical variable, fractional imputation is much easier. For
example, if Y is binary then we only need two imputed values
(y
∗(j)
i = 0 or 1) and the fractional weight corresponding to y
∗(j)
i is
w∗
ij = P(Y = y
∗(j)
i | xi ; ˆθ), for j = 1, 2.
Kim (ISU) Fractional Imputation NIH/NCI 16 / 57
17. Example 1: Fractional imputation for categorical data
Example (n = 10)
ID Weight y1 y2
1 w1 y1,1 y1,2
2 w2 y2,1 M
3 w3 M y3,2
4 w4 y4,1 y4,2
5 w5 y5,1 y5,2
6 w6 y6,1 y6,2
7 w7 M y7,2
8 w8 M M
9 w9 y9,1 y9,2
10 w10 y10,2 y10,2
M: Missing
Kim (ISU) Fractional Imputation NIH/NCI 17 / 57
20. Example 1 (Cont’d)
E-step: Fractional weights are the conditional probabilities of the
imputed values given the observations.
w∗
ij = ˆP(y
∗(j)
i,mis | yi,obs)
=
ˆπ(yi,obs, y
∗(j)
i,mis)
Mi
l=1 ˆπ(yi,obs, y
∗(l)
i,mis)
where (yi,obs, yi,mis) is the (observed, missing) part of yi = (yi1, yi,2).
M-step: Update the joint probability using the fractional weights.
ˆπab =
1
ˆN
n
i=1
Mi
j=1
wi w∗
ij I(y
∗(j)
i,1 = a, y
∗(j)
i,2 = b)
with ˆN = n
i=1 wi .
Kim (ISU) Fractional Imputation NIH/NCI 20 / 57
21. 1 Introduction
2 Fractional Imputation
3 Fractional hot deck imputation
4 R package: FHDI
5 Numerical Illustration
6 Concluding Remarks
Kim (ISU) Fractional Imputation NIH/NCI 21 / 57
22. Fractional hot deck imputation 1
Hot deck imputation: Imputed values are observed values.
Suppose that we are interested in estimating θ1 = E(Y ) or even
θ2 = Pr(Y < c) where y ∼ f (y | x) where x is always observed and y
is subject to missingness.
Assume MAR in the sense that Pr(R = 1 | x, y) does not depend on
y.
Assume that there exists z ∈ {1, · · · , G} such that
f (y | x, z) = f (y | z). (4)
In this case, we can assume that
y | (z = g)
i.i.d
∼ (µg , σ2
g )
which is sometimes called cell mean model (Kim and Fuller, 2004).
Kim (ISU) Fractional Imputation NIH/NCI 22 / 57
23. Fractional Hot deck imputation 2
Under (4), one can express
f (y | x) ∼=
G
g=1
pg (x)fg (y) (5)
where pg (x) = P(z = g | x) and fg (y) = f (y | z = g). Model (5)
can be called finite mixture model.
Under (5), we can implement two-step imputation
1 Step 1 (Parameter estimation): Compute the conditional CDF
corresponding to (5) using
ˆF(y | xi ) =
G
g=1
ˆpg (xi ) ˆFg (y),
where ˆpg (x) is the estimated cell probabilities and ˆFg (y) is the
empirical CDF within group g.
2 Step 2 (Imputation): From the conditional CDF, obtain M imputed
values.
Kim (ISU) Fractional Imputation NIH/NCI 23 / 57
24. Remark
Variable z is used to define imputation cells (or imputation classes).
If x is categorical and used directly to define cells (i.e. z = x), then
pg (xi ) = 1 if xi = g and pg (xi ) = 0 otherwise. In this case,
ˆF(y | xi ) = ˆFg (y), for xi = g.
Fractional hot deck imputation for this special case is discussed in
Kim and Fuller (2004) and Fuller and Kim (2005).
If x is continuous or highly dimensional, we may use some
classification method (or tree method) to define imputation cells.
Kim (ISU) Fractional Imputation NIH/NCI 24 / 57
25. Multivariate Extension
Idea
In hot deck imputation, we can make a nonparametric approximation
of f (·) using a finite mixture model
f (yi,mis | yi,obs) =
G
g=1
pg (yi,obs)fg (yi,mis), (6)
where pg (yi,obs) = P(zi = g | yi,obs), fg (yi,mis) = f (yi,mis | z = g)
and z is the latent variable associated with imputation cell.
To satisfy the above approximation, we need to find z such that
f (yi,mis | zi , yi,obs) = f (yi,mis | zi ).
Kim (ISU) Fractional Imputation NIH/NCI 25 / 57
26. Multivariate Extension
Imputation cell
Assume p-dimensional survey items: Y = (Y1, · · · , Yp)
For each item k, create a transformation of Yk into Zk, a discrete
version of Yk based on the sample quantiles among respondents.
If yi,k is missing, then zi,k is also missing.
Imputation cells are created based on the observed value of
zi = (zi,1, · · · , zi,p).
Expression (6) can be written as
f (yi,mis | yi,obs) =
zmis
P(zi,mis = zmis | yi,obs)f (yi,mis | zi )
∼=
zmis
P(zi,mis = zmis | zi,obs)f (yi,mis | zi )
where zi = (zi,obs, zi,mis) similarly to yi = (yi,obs, yi,mis).
Kim (ISU) Fractional Imputation NIH/NCI 26 / 57
27. Fractional hot deck imputation: Two-step approach
Step 1: Parameter estimation step
1 Compute ˆP(zmis | yi,obs): May require an iterative EM algorithm
2 Compute the cell CDF from the set of full respondents.
Combine the two estimates to obtain the conditional CDF.
Step 2: Imputation step
Select M donors from the conditional CDF.
Kim (ISU) Fractional Imputation NIH/NCI 27 / 57
28. 1 Introduction
2 Fractional Imputation
3 Fractional hot deck imputation
4 R package: FHDI
5 Numerical Illustration
6 Concluding Remarks
Kim (ISU) Fractional Imputation NIH/NCI 28 / 57
29. FHDI: Introduction 1
Input:
Multivariate missing data
Output (Goal):
Create a single complete data with imputed values.
Preserve correlation structure.
Provide a consistent FHDI estimator on the imputed data.
Provide variance estimator for the FHDI estimator.
Kim (ISU) Fractional Imputation NIH/NCI 29 / 57
30. FHDI: Introduction 2
(Recall) Fractional Imputation
E(yi,mis | yi,obs ) is approximated by
E(yi,mis | yi,obs ) ∼=
M
j=1
w∗
i,j y
∗(j)
i ,
Draw M(> 1) imputed values on each missing value.
Assign fractional weights on imputed values.
The final product is a single data set with size ≤ nR + nM × M.
Kim (ISU) Fractional Imputation NIH/NCI 30 / 57
31. FHDI: Introduction 3
How can we generate y∗
mis from f (ymis | yobs) in general case?
Apply Two phase sampling approach
(Phase I) Imputation cell for hot deck imputation
Determine imputation cells based on z, where z is the discretized
values of y (use estimated quantiles to create z).
Estimate cell probabilities for z using EM by weighting method
(Example 1).
(Phase II) Donor selection
Fractional imputation for missing y within each imputation cell.
Assign all possible values on missing ymis (FEFI, Fulley Efficient
Fraction Imputation) and fractional weights proportional to the
estimated cell probabilities.
Approximate FEFI imputation using a systematic sampling (FHDI).
Kim (ISU) Fractional Imputation NIH/NCI 31 / 57
32. FHDI: Introduction 4
Analysis
Mean estimator
¯y =
i∈A
M
j=1 wi w∗
ij y∗
ij
i∈A wi
,
Regression estimator
ˆβ = (X WX)−1
X Wy∗
Variance estimation
ˆθFHDI =
L
k=1
ck
ˆθ
(k)
FHDI − ˆθFHDI
2
,
where ck is a replicate factor associated with ˆθ
(k)
FHDI and ˆθ
(k)
FHDI is the
the k-th replicate estimate obtained using the k-th fractional weights
replicate denoted by w
(k)
i × w
∗(k)
ij .
Kim (ISU) Fractional Imputation NIH/NCI 32 / 57
33. FHDI: Implementation 1
Three scenarios for multivariate missing data
1 All categorical data: SAS procedure SURVEYIMPUTE and R
package FHDI.
install.packages(“FHDI”).
Require R 3.4.0 or later.
Require Rtools34 or later.
More details: see
https://sites.google.com/view/jaekwangkim/software.
2 All continuous data: R package FHDI.
3 A mixed data of categorical and continuous items: Not Applicable
with the current version of FHDI.
Kim (ISU) Fractional Imputation NIH/NCI 33 / 57
34. FHDI: Implementation 2
We have n = 100 sample observations for the multivariate data vector
yi = (y1i , y2i , y3i , y4i ), i = 1, . . . , n, generated from
Y1 = 1 + e1,
Y2 = 2 + ρe1 + 1 − ρ2e2,
Y3 = Y1 + e3,
Y4 = −1 + 0.5Y3 + e4.
We set ρ = 0.5; e1 and e2 are generated from a standard normal
distribution; e3 is generated from a standard exponential distribution; and
e4 is generated from a normal distribution N(0, 3/2).
Kim (ISU) Fractional Imputation NIH/NCI 34 / 57
35. FHDI: Implementation 3
> library(FHDI)
> example(FHDI)
> summary(daty)
y1 y2 y3 y4
Min. :-1.6701 Min. :0.02766 Min. :-1.4818 Min. :-2.920292
1st Qu.: 0.4369 1st Qu.:1.03796 1st Qu.: 0.9339 1st Qu.:-0.781067
Median : 0.8550 Median :1.79693 Median : 1.7246 Median :-0.121467
Mean : 0.9821 Mean :1.93066 Mean : 1.7955 Mean :-0.006254
3rd Qu.: 1.6171 3rd Qu.:2.71396 3rd Qu.: 2.5172 3rd Qu.: 0.787863
Max. : 3.1312 Max. :5.07103 Max. : 5.3347 Max. : 4.351372
NA’s :42 NA’s :34 NA’s :18 NA’s :11
Kim (ISU) Fractional Imputation NIH/NCI 35 / 57
42. FHDI: Implementation 10
Table: Regression (y1 ∼ y2) coefficient estimates with standard errors.
Estimator Intercept (S.E.) Slope (S.E.)
Naive -0.074 (0.305) 0.588 (0.142)
FEFI 0.035 (0.251) 0.466 (0.094)
FHDI 0.023 (0.252) 0.472 (0.095)
True 0 0.5
FEFI and FHDI estimators produce smaller standard errors compare
to Naive estimator.
FHDI estimator well approximates FEFI estimator.
Kim (ISU) Fractional Imputation NIH/NCI 42 / 57
43. 1 Introduction
2 Fractional Imputation
3 Fractional hot deck imputation
4 R package: FHDI
5 Numerical Illustration
6 Concluding Remarks
Kim (ISU) Fractional Imputation NIH/NCI 43 / 57
44. Numerical illustration
A pseudo finite population constructed from a single month data in
Monthly Retail Trade Survey (MRTS) at US Bureau of Census
N = 7, 260 retail business units in five strata
Three variables in the data
h: stratum
xhi : inventory values
yhi : sales
Kim (ISU) Fractional Imputation NIH/NCI 44 / 57
48. Stratified random sampling
Table: The sample allocation in stratified simple random sampling.
Strata 1 2 3 4 5
Strata size Nh 352 566 1963 2181 2198
Sample size nh 28 32 46 46 48
Sampling weight 12.57 17.69 42.67 47.41 45.79
Kim (ISU) Fractional Imputation NIH/NCI 48 / 57
49. Response mechanism: MAR
Variable xhi is always observed and only yhi is subject to missingness.
PMAR
Rhi ∼ Bernoulli(πhi ), πhi = 1/[1 + exp{4 − 0.3 log(xhi )}].
The overall response rate is about 0.6.
Kim (ISU) Fractional Imputation NIH/NCI 49 / 57
50. Simulation Study
Table 1 Monte Carlo bias and variance of the point estimators.
Parameter Estimator Bias Variance Std Var
Complete sample 0.00 0.42 100
θ = E(Y ) MI 0.00 0.59 134
FI 0.00 0.58 133
Table 2 Monte Carlo relative bias of the variance estimator.
Parameter Imputation Relative bias (%)
V (ˆθ) MI 18.4
FI 2.7
Kim (ISU) Fractional Imputation NIH/NCI 50 / 57
51. Discussion
Rubin’s formula is based on the following decomposition:
V (ˆθMI ) = V (ˆθn) + V (ˆθMI − ˆθn)
where ˆθn is the complete-sample estimator of θ. Basically, WM term
estimates V (ˆθn) and (1 + M−1)BM term estimates V (ˆθMI − ˆθn).
For general case, we have
V (ˆθMI ) = V (ˆθn) + V (ˆθMI − ˆθn) + 2Cov(ˆθMI − ˆθn, ˆθn)
and Rubin’s variance estimator ignores the covariance term. Thus, a
sufficient condition for the validity of unbiased variance estimator is
Cov(ˆθMI − ˆθn, ˆθn) = 0.
Meng (1994) called the condition congeniality of ˆθn.
Congeniality holds when ˆθn is the MLE of θ (self-efficient estimator).
Kim (ISU) Fractional Imputation NIH/NCI 51 / 57
52. Discussion
For example, there are two estimators of θ = E(Y ) when log(Y )
follows from N(β0 + β1x, σ2).
1 Maximum likelihood method:
ˆθMLE = n−1
n
i=1
exp{ˆβ0 + ˆβ1xi + 0.5ˆσ2
}
2 Method of moments:
ˆθMME = n−1
n
i=1
yi
The MME of θ = E(Y ) does not satisfy the congeniality and Rubin’s
variance estimator is biased (Yang and Kim, 2016).
Rubin’s variance estimator is essentially unbiased for MLE of θ (R.B.
= -1.9%) but MLE is rarely used in practice.
Kim (ISU) Fractional Imputation NIH/NCI 52 / 57
53. 1 Introduction
2 Fractional Imputation
3 Fractional hot deck imputation
4 R package: FHDI
5 Numerical Illustration
6 Concluding Remarks
Kim (ISU) Fractional Imputation NIH/NCI 53 / 57
54. Summary
Fractional imputation is developed as a frequentist imputation.
Multiple imputation is motivated from a Bayesian framework. The
frequentist validity of multiple imputation requires congeniality.
Fractional imputation does not require the congeniality condition and
works well for Method of Moments estimators.
Fractional Hot Deck Imputation is now developed in SAS and R.
Kim (ISU) Fractional Imputation NIH/NCI 54 / 57
55. Future research
Fractional imputation using Gaussian finite mixture models.
Survey data integration: Extension of Kim and Rao (2012) and Kim,
Berg, and Park (2016).
Fractional imputation under model uncertainty.
Kim (ISU) Fractional Imputation NIH/NCI 55 / 57
56. Collaborators
Fractional Hot Deck Imputation
Jongho Im, Inho Cho, Wayne Fuller, Pushpal Mukhopadhyay
Survey Data Integration
Emily Berg, J.N.K. Rao, Seho Park
Review of Fractional Imputation, Congeniality
Shu Yang
Kim (ISU) Fractional Imputation NIH/NCI 56 / 57