When data are missing at random (MAR), complete-case analysis with the full-data estimating equation is in general not valid. To correct the bias, we can employ the inverse probability weighting (IPW) technique on the complete cases. This requires modeling the missing pattern on the observed data (call it the $\pi$ model). The resulting IPW estimator, however, ignores information contained in cases with missing components, and is thus statistically inefficient. Efficiency can be improved by modifying the estimating equation along the lines of the semiparametric efficiency theory of Bickel et al. (1993). This modification usually requires modeling the distribution of the missing component on the observed ones (call it the $\mu$ model). Hence, when both the $\pi$ and the $\mu$ models are correct, the modified estimator is valid and is more efficient than the IPW one. In addition, the modified estimator is "doubly robust" in the sense that it is valid when either the $\pi$ model or the $\mu$ model is correct.
Essential materials of the slides are extracted from the book "Semiparametric Theory and Missing Data" (Tsiatis, 2006). The slides were originally presented in the class BIOS 773 Statistical Analysis with Missing Data in Spring 2013 at UNC Chapel Hill as a final project.
development of diagnostic enzyme assay to detect leuser virus
Double Robustness: Theory and Applications with Missing Data
1. Double Robustness:
Theory and Applications with Missing Data
Lu Mao
Department of Biostatistics
The University of North Carolina at Chapel Hill
Email: lmao@unc.edu
April 17, 2013
1/49
2. Table of Contents
Part I: A Semiparametric Perspective
A motivating example
Semiparametric approachs to coarsened data
Constructing the estimating equation
Part II: Applications in Missing Data Problems
Data with two levels of missingness
Monotone coarsened data
2/49
4. A Motivating Example
I Given an iid sample Y1; ; Yn from an arbitrary
distribution, consider the estimation of the population
mean = EY by Y , which solves
Pn(Y ) = 0;
where PnZ 1
n
Pn
i=1 Zi.
I Suppose some of the Yi's are missing. Let Ri = 1 if Yi is
observed and = 0 if otherwise. Let (Y ) = P(R = 1jY ).
Now consider estimating by solving
PnR(Y ) = 0
resulting in
^CC =
P
PRiYi
Ri
!p
E[(Y )Y ]
E(Y )
6= :
4/49
5. I Suppose in addition to Yi, an auxilary variable Xi is also
collected, and R ? Y jX. Assume P(R = 1jY;X) = (X; ).
To correct the bias, we apply the estimating equation ( ^n
is a consistent estimator for 0)
IPW
n = Pn
R
(X; ^n)
(Y )
resulting in
^IPW =
Pn[RY=(X; ^n)]
Pn[R=(X; ^n)]
=
Pn[RY=(X; 0)]
Pn[R=(X; 0)]
+ op(1)
!p
E[RY=(X; 0)]
E[R=(X; 0)]
=
(1)
5/49
6. I Assume (X) = E(Y jX; ), and consider a new estimating
equation as a modi
7. cation of IPW
n :
DR
n = Pn
R
(X; ^n)
(Y )
R (X; ^n)
(X; ^n)
!
;
((X; ^n) )
(2)
resulting in
^DR = Pn
R
(X; ^n)
Y
R (X; ^n)
(X; ^n)
!
: (3)
(X; ^n)
Now let's study the consistency of ^DR under dierent
assumptions.
6/49
8. I Scenario 1. (X; ) correct; (X; ) incorrect. So,
^n !p 0, but ^n ! , with (X; )6= E(Y jX):
^DR = Pn
R
(X; 0)
Y
R (X; 0)
(X; 0)
(X; )
+ op(1)
!p E
R
(X; 0)
Y
E
R (X; 0)
(X; 0)
(X; )
= E
Y E
R
(X; 0)
17. I Scenario 2.(X; ) correct; (X; ) incorrect; So, ^n !p 0,
but ^n ! , with (X; )6= E(RjY;X):
^DR = Pn
R
(X; )
Y
R (X; )
(X; )
(X; 0)
+ op(1)
!p E
R
(X; )
Y
E
R (X; )
(X; )
(X; 0)
= E
R
(X; )
E(Y jR;X)
E
R (X; )
(X; )
(X; 0)
= E
R
(X; )
E
(X; 0)
R (X; )
(X; )
(X; 0)
= E[E((X; 0))]
= :
(5)
8/49
18. Result 1 (Double robustness)
^DR is consistent if either the model or the model is
correct, that is, under M1 [M2, where
M1 = fp(rjy; x; ) : 2 1g, and M2 = fp(yjx; ) : 2 2g. In
other words, ^DR is doubly robust.
I Now, let's consider a somewhat dierent question:
eciency under M1 M2. For simplicity, we assume we
know the true values (0; 0).
I Denote Gng(Z) = p1
n
Pn
i=1[g(Zi) Eg(Z)]. Algebraic
manipulations yield:
p
n(^IPW ) =
1
Pn[R=(X; 0)]
Gn
R
0(X; )
(Y )
N(0; 2
IPW):
(6)
9/49
19. I where
2
IPW = E
R
(X; 0)
(Y )
2
= E
(Y )2
(X; 0)
:
I Similarly
p
n(^DR ) = Gn
R
(X; 0)
(Y )
R (X; 0)
(X; 0)
((X; 0) )
N(0; 2D
R);
(7)
where
2D
R = E
R
(X; 0)
2
(Y )
2E
R
(X; 0)
(Y )
R (X; 0)
(X; 0)
((X; 0) )
+ E
R (X; 0)
(X; 0)
2
((X; 0) )
10/49
20. I
2D
R = E
(Y )2
(X; 0)
2E
1 (X; 0)
(X; 0)
((X; 0) )2
+ E
1 (X; 0)
(X; 0)
((X; 0) )2
= 2
IPW E
1 (X; 0)
(X; 0)
((X; 0) )2
(X;0) (Y ), A = R(X;0)
(X;0) ((X; 0) ),
I IPW = R
22
and DR = IPW + A. Consider the Hilbert space L2(P).
Since IPW ^and DR ^have in
uence functions IPW and
DR respectively, their squared length (jj jj E()2) are
the asumptotic variances for ^IPW and ^DR.
11/49
22. gure provides a geometric illustration:
Figure : A geometric interpretation of eciency improvement by
the DR.
Result 2 (Eciency of DR)
^DR is more ecient than ^IPW under M1 M2.
12/49
23. Remark 1.1
The above example suggests that
I For a full data problem, there is a natural extension, via
the IPW (inverse probability weighting) method, to a
corresponding missing data problem;
I By positing a working model p(zmisjzobs; ), the IPW
estimating equation can be modi
24. ed by adding a suitable
augmentation term, resulting in an estimator that is still
consistent even if the working model p(zmisjzobs; ) is not
correct;
I If in case p(zmisjzobs; ) is correct, the new estimator is
consistent even if missing mechanism is incorrectly
modeled. In this sense, the new estimator is doubly robust;
I The doubly robust estimator has improved eciency if
both models are correct.
13/49
25. Semiparametric approachs to coarsened data
I First we introduce the terminology of coarsening, which
contains missing data as a special case:
De
26. nition 1.2 (Coarsening)
Suppose the full data consist of iid observations of an
l-dimensional random vector Z. De
27. ne a coarsening variable C
such that when C = r, we only observe Gr(Z), where Gr() is a
many-to-one function. Further denote C = 1 if Z is completely
observed (no coarsening), that is, G1(Z) = Z. Thus, the
observed data consist of iid copies of (C;GC(Z)).
De
28. nition 1.3 (Coarsening at random)
The data are said to be coarsened at random (CAR) if
C ? ZjGC(Z).
Remark 1.4 (Assumption)
All problems considered are under the assumption of CAR.
14/49
29. Terminology
I Z: Full data;
I GC(Z): Observed data;
I (C;GC(Z)): Coarsened data.
I Semiparametric models arise naturally in coarsened data
problems.
I Consider a full data regression model, z = (y; x)0:
p(zj
33. nite
dimensional (e.g. arbitrary cdf F for x).
I Now suppose some components of x is missing (at
random), then the likelihood becomes
q(y; xobs; rj
37. nite dimensional nuisance cannot be ignored.
Hence we have arrived at a semiparametric model.
I Let's review some basic theory about semiparametric
inference. We assume as previously that
75. Remark 1.10 (Z-estimation with estimated nuisance)
In the presence of a nuisance paramter , the estimating
equation generally involves . A natural strategy is to insert a
consistent estimator ^n.
Pn
116. (Z)jC = r;Gr(Z) = gr)
I Similarly we have the following theorem about :
21/49
117. Theorem 1.11 (Characterization of )
The coarsened data tangent space for is characterized by
= fE[F (Z)jC;GC(Z)] : F 2 F
g: (10)
I Remember that the important task is to characterize ?
,
which will aid us in constructing coarsened data estimating
equations for
118. .
Theorem 1.12 (Characterization of ?
)
The space ?consists of all elements h(C;GC(Z)) 2 H such that
E[h(C;GC(Z))jZ] 2 F?
: (11)
22/49
119. Proof.
By Theorem 1.11, The space ?
consists of all elements
h(C;GC(Z)) 2 H such that
Efh(C;GC(Z))E[F (Z)jC;GC(Z)])g = 0; 8F (Z) 2 F
:
This is equivalent to Efh(C;GC(Z))F (Z)g = 0, which is
equivalent to EfF (Z)E[h(C;GC(Z))jZ]g = 0
Remark 1.13 (An linear operator perspective)
De
120. ne the linear operator K : H ! HF by K() = E[jZ]. Then
?
= K1(F?
): (12)
Given
127. nition 1.15 (Augmentation space)
We denote A = K1(0), and call it the augmentation space
Corollary 1.16
Assume (1;Z; 0) = P(C = 1jZ; 0) 0 a.s.. Then
K1(
131. Suppose ^n is an ecient estimator of 0. Take h 0, and we
obtain the inverse probability weighted (IPW) estimating
equation IPW
n = Pn
I(C=1)
132. (Z)
(1;Z;0) : In practice, the choice of
h 2 A will be based on eciency considerations. We have the
following theorem regarding the in
uence function resulting
from the estimating function Pn h
142. (Z)
(1;Z; 0)
jA
: (15)
I Typically, calculating the projection [jA] requires us to
posit working parametric models p(zj). But the DR
estimating equation will still be valid even if p(zj) does
not contain the truth.
26/49
143. We conclude this section by a theorem characterizing the
augmentation space A.
Theorem 1.20 (Characterization of A)
The space A consists of all elements that can be written as
X
r6=1
I(C = 1)
(1;Z)
(r;Gr(Z)) I(C = r)
hr(Gr(Z)); (16)
where hr(Gr(Z)) is an arbitrary function of Gr(Z).
Proof.
See Theorem 7.2 of Tsiatis (2005).
27/49
145. Data with two levels of missingness
Suppose Z = (Z1;Z2) and Z2 is missing on some observations.
Denote R = 1 if Z2 is observed and = 0 if otherwise. Let
(Z1; 0) = P(R = 1jZ; 0). The following theorem states
explicitly how to calculate [ R
158. (Z)jZ1], we need to posit a parametric model
p(zj), or at least p(Z2jZ1; ), and
159. nd a consistent estimator ^n
for 0. Then the projection can be computed by
E[
160. (Z)jZ1; ^n]. We should note that the parametric models
need to be consistent with the original semiparametric model.
Similar to the motivating example, we can show that the
resulting estimating equation is doubly robust to p(rjz; ) and
p(zj).
DR
204. 2X2
where X2 is a real-valued continuous covariate and is missing on
some subjects; (Y;XT
1 )T is always observed.
I The full data model is
p(yjx;
205. )(x):
Let X = (1;XT
1 ;X2). The full data estimating equation is
Pn
209. I To use the IPW, we use a logistic regression for the missing
mechanism
P(R = 1jY;X; ) =
e0+1Y +T2
X1
1 + e0+1Y +T2
X1
:
The MLE ^n can be computed by solving PnS = 0, where
S =
0
@
1
Y
X1
1
A
R
e0+1Y +T2
X1
1 + e0+1Y +T2
X1
!
:
I To construct the DR estimating equation, we need to
compute the conditional expectation
E
X
y
e
217. I Therefore we need to posit a working model for p(xj), or
at least for p(z2jy; z1; ). If we do the latter, we should be
aware that p(z2jy; z1; ) must be compatible with the
regression model p(yjx;
218. ). In fact, if the covariate
distribution is MVN, we can show that xjy is multivariate
normal. This motivates the following working model
X2jY;X1; N(0 + 1Y + T
2 X1; 3):
The MLE ^n is easily computed by the Least Squares with
a complete case analysis.
I Finally we need to compute
E
X
Y
e
240. nition)
If we can order the levels of coarsening in such a way that
Gr(Z) is a coarsened version of Gr+1(Z); r = 1; 2; . That is
Gr(Z) = fr(Gr+1(Z));
where fr is a many-to-one function, then coarsening is said to
be monotone.
Example 2.4 (Monotone missing in longitudinal data)
When subject is followed over time, we observe (Y1; ; Yk),
where Yj is the measurement at the jth time point. Incomplete
data arise if a subject is lost to follow-up at certain point. In
this case, if a measurement is missing at the rth point, then all
measurements after that will be missing.
38/49
241. C = r Gr(Z)
1 Y1
2 Y1; Y2
...
...
k 1 Y1; ; Yk1
1 Y1; ; Yk
For monotone coarsened data, it is natural and convenient to
model missingness via the discrete hazard function
r(Gr) =
P(C = rjC r;Z); r6= 1
1; r = 1
:
De
242. ne
Kr(Gr) = P(C rjZ) = rj
=1[1 j(Gj )]:
Then the function can be expressed as
(r;Gr(Z)) = Kr(Gr(Z))r(Gr):
39/49
243. As in the case with two levels of missingness, we
244. rst need to
chacterize the augmentation space A using Theorem 1.20.
Then we use the characterization to derive ( IPWjA). We
provide the end result in the following theorem.
Theorem 2.5 (( IPWjA) in monotone coarsened data)
The projection of I(C=1)
245. (Z)
(1;Z) onto A is
X
r6=1
I(C = r) r(Gr)I(C r)
Kr(Gr)
E[
247. (Z)jGr(Z)] we need to posit a parametric working model
p(zj), or at least a series of conditional models p(gr+1jgr; r).
40/49
248. Remark 2.6 (Modeling the coarsening hazard)
Instead of modeling the coarsening probability, we model the
discrete hazard
P(C = rjC r;Z; r) = r(Gr; r):
With monotone missing longitudinal data, for example, we may
apply the logistic model
r(Gr; r) =
e0r+1rY1++rrYr
1 + e0r+1rY1++rrYr
:
The likelihood of C now has the following form
Y
r
Y
i:Cir
r(Gr(Zi); r)
I(Ci=r)
I(Cir)
1 r(Gr(Zi); r)
:
Note that the likelihood for r factorizes. So maximization can
be done separately.
41/49
249. If we use logistic regression for monotone missing longitudinal
data, the likelihood is given by
kY1
r
Y
i:Cir
e0r+1rY1++rrYrI(Ci = r)
1 + e0r+1rY1++rrYr
:
Each r can be estimated using logistic regression on the data
fi : Ci rg, and
S = (ST
1 ; ; ST
k1)T :
42/49
250. Now we look at the problem of double robustness. Let ^n !
and ^n ! .
Theorem 2.7 (Double robustness of DR)
E
I(C = 1)
264. Example: A longitudinal RCT with dropout
Tsiatis (2006) describes a randomized clinical trial on a new
drug for HIV/AIDS. The primary outcome is CD4 count,
denoted as Y . We also denote X as the indicator variable for
the treatment. Measurements of Y are taken at baseline t1 = 0
and l 1 subsequent time points, denoted t2; ; tl. We want
to model the mean CD4 count as a function of treatment and
time through
E[YjijXi] =
271. Now suppose there is random dropout, and the mechanism is
MAR.
I First we use the logistic regression for the missing hazard
r(Gr; r) =
e0r+1rY1++rrYr+r+1;rX
1 + e0r+1rY1++rrYr+r+1;rX :
and obtain the MLE ^n
I Denote Y r = (Y1; ; Yr)T and Y r = (Yr+1; ; Yl)T .
From Theorem 2.5, we need to compute the conditional
expectation
E[DT (X)(Y D(X)
274. I If we posit the working model
Y j(X = k) N(k; );
and denote rr to be the variance of Yr and rr to be the
covariance between Yr and Yr. Then we have
E[(Y D(X)
282. )jYr;X; ^
n]
:
(24)
I The asymptotic variance of
283. n can be estimated using the
sandwich-type estimator described in (23).
48/49
284. References
Bang H, Robins JM (2005). Doubly robust estimation in missing data and causal inference
models. Biometrics 61, 962-73.
Bickel J, Klaassen C, Ritov Y, Wellner JA (1993). Ecient and Adaptive Estimation for
Semiparametric Models. Springer.
Kosorok MR (2008). Introduction to empirical processes and semiparametric inference. Springer
Lipsitz SR, Ibrahim JG, Zhao LP (1999). A Weighted Estimating Equation for Missing
Covariate Data with Properties Similar to Maximum Likelihood. Journal of the American
Statistical Association 94, 1147-1160.
Robins JM, Rotnitzky A. (2001). Comment on the Bickel and Kwon article, Inference for
semiparametric models: Some questions and an answer Statistica Sinica 11, 920-936..
Scharfstein DO, Rotnitzky A, Robins JM (1999). Adjusting for Nonignorable Drop-Out Using
Semiparametric Nonresponse Models. Journal of the American Statistical Association 94,
1135-1146.
Tsiatis (2006). Semiparametric Theory and Missing Data. Springer Series in Statistics
49/49