Convergence of ABC methods

Convergence of ABC methods
Christian P. Robert
Universit´e Paris-Dauphine, University of Warwick, & IUF
joint work with D. Frazier, G. Martin & J. Rousseau

Outline
1 Approximate Bayesian computation
2 [some] asymptotics of ABC

Approximate Bayesian computation
ABC basics

Untractable likelihoods
Cases when the likelihood function
f (y|θ) is unavailable and when the
completion step
f (y|θ) =
Z
f (y, z|θ) dz
is impossible or too costly because of
the dimension of z
c MCMC cannot be implemented!

Illustration
Example (Ising & Potts models)
Potts model: if y takes values on a grid Y of size kn and
f (y|θ) ∝ exp θ
l∼i
Iyl =yi
where l∼i denotes a neighbourhood relation, n moderately large
prohibits the computation of the normalising constant Zθ

Illustration
Example (Ising & Potts models)
Potts model: if y takes values on a grid Y of size kn and
f (y|θ) ∝ exp θ
l∼i
Iyl =yi
where l∼i denotes a neighbourhood relation, n moderately large
prohibits the computation of the normalising constant Zθ
Special case of the intractable normalising constant, making the
likelihood impossible to compute

The ABC method
Bayesian setting: target is π(θ)f (x|θ)

The ABC method
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:

The ABC method
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavar´e et al., 1997]

Why does it work?!
The proof is trivial:
f (θi ) ∝
z∈D
π(θi )f (z|θi )Iy(z)
∝ π(θi )f (y|θi )
= π(θi |y) .
[Accept–Reject 101]

A as A...pproximative
When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,
(y, z) ≤
where is a distance

A as A...pproximative
When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,
(y, z) ≤
where is a distance
Output distributed from
π(θ) Pθ{ (y, z) < } ∝ π(θ| (y, z) < )
[Pritchard et al., 1999]

ABC algorithm
Algorithm 1 Likelihood-free rejection sampler 2
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)} ≤
set θi = θ
end for
where η(y) deﬁnes a (not necessarily suﬃcient) statistic

Output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f (z|θ)IA ,y (z)
A ,y×Θ π(θ)f (z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.

Output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f (z|θ)IA ,y (z)
A ,y×Θ π(θ)f (z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of a posterior
distribution:
π (θ|y) = π (θ, z|y)dz ≈ π(θ|η(y)) .

MA example
Back to the MA(q) model
xt = t +
q
i=1
ϑi t−i
Simple prior: uniform over the inverse [real and complex] roots in
Q(u) = 1 −
q
i=1
ϑi ui
under the identiﬁability conditions

MA example
Back to the MA(q) model
xt = t +
q
i=1
ϑi t−i
Simple prior: uniform prior over the identiﬁability zone, e.g.
triangle for MA(2)

MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1, ϑ2) in the triangle
2 generating an iid sequence ( t)−q<t≤T
3 producing a simulated series (xt)1≤t≤T

MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1, ϑ2) in the triangle
2 generating an iid sequence ( t)−q<t≤T
3 producing a simulated series (xt)1≤t≤T
Distance: basic distance between the series
ρ((xt)1≤t≤T , (xt)1≤t≤T ) =
T
t=1
(xt − xt)2
or distance between summary statistics like the q autocorrelations
τj =
T
t=j+1
xtxt−j

Comparison of distance impact
Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model

Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
01234
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.00.51.01.5
θ2
Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model

ABC as knn
[Biau et al., 2013, Annales de l’IHP]
Practice of ABC: determine tolerance as a quantile on observed
distances, say 10% or 1% quantile,
= N = qα(d1, . . . , dN)

ABC as knn
= N = qα(d1, . . . , dN)
• Interpretation of ε as nonparametric bandwidth only
approximation of the actual practice
[Blum & Fran¸cois, 2010]

ABC as knn
= N = qα(d1, . . . , dN)
• Interpretation of ε as nonparametric bandwidth only
approximation of the actual practice
[Blum & Fran¸cois, 2010]
• ABC is a k-nearest neighbour (knn) method with kN = N N
[Loftsgaarden & Quesenberry, 1965]

ABC consistency
Provided
kN/ log log N −→ ∞ and kN/N −→ 0
as N → ∞, for almost all s0 (with respect to the distribution of
S), with probability 1,
1
kN
kN
j=1
ϕ(θj ) −→ E[ϕ(θj )|S = s0]
[Devroye, 1982]

ABC consistency
Provided
kN/ log log N −→ ∞ and kN/N −→ 0
as N → ∞, for almost all s0 (with respect to the distribution of
S), with probability 1,
1
kN
kN
j=1
ϕ(θj ) −→ E[ϕ(θj )|S = s0]
[Devroye, 1982]
Biau et al. (2013) also recall pointwise and integrated mean square error
consistency results on the corresponding kernel estimate of the
conditional posterior distribution, under constraints
kN → ∞, kN /N → 0, hN → 0 and hp
N kN → ∞,

Rates of convergence
Further assumptions (on target and kernel) allow for precise
(integrated mean square) convergence rates (as a power of the
sample size N), derived from classical k-nearest neighbour
regression, like
• when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N
− 4
p+8
• when m = 4, kN ≈ N(p+4)/(p+8) and rate N
− 4
p+8 log N
• when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N
− 4
m+p+4
[Biau et al., 2013]

Rates of convergence
Further assumptions (on target and kernel) allow for precise
(integrated mean square) convergence rates (as a power of the
sample size N), derived from classical k-nearest neighbour
regression, like
• when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N
− 4
p+8
• when m = 4, kN ≈ N(p+4)/(p+8) and rate N
− 4
p+8 log N
• when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N
− 4
m+p+4
[Biau et al., 2013]
Drag: Only applies to suﬃcient summary statistics

[some] asymptotics of ABC
asymptotic setup
consistency of ABC posteriors
asymptotic shape of posterior
distribution
asymptotic behaviour of EABC [θ]

asymptotic setup
• asymptotic: y = y(n) ∼ Pn
θ and = n, n → +∞
• parametric: θ ∈ Rk, k ﬁxed
• concentration of summary statistics η(zn):
∃b : θ → b(θ) η(zn
) − b(θ) = oP
θ
(1), ∀θ

asymptotic setup
• asymptotic: y = y(n) ∼ Pn
θ and = n, n → +∞
• parametric: θ ∈ Rk, k ﬁxed
• concentration of summary statistics η(zn):
∃b : θ → b(θ) η(zn
) − b(θ) = oP
θ
(1), ∀θ
Objects of interest:
• posterior concentration and asymptotic shape of π (·|η(y(n)))
(normality?)
• convergence of the posterior mean ˆθ = EABC[θ|η(y(n))]
• asymptotic acceptance rate
[Frazier et al., 2016]

ABC algorithm Bayesian consistent at θ0 if for any δ > 0,
P ( θ − θ0 > δ| η(y) − η(z) ≤ ε) → 0
as n → +∞, ε → 0
Bayesian consistency implies that sets containing θ0 have posterior
probability tending to one as n → +∞, with implication being the
existence of a speciﬁc rate of concentration

ABC algorithm Bayesian consistent at θ0 if for any δ > 0,
P ( θ − θ0 > δ| η(y) − η(z) ≤ ε) → 0
as n → +∞, ε → 0
• Concentration around true value and Bayesian consistency
impose less stringent conditions on the convergence speed of
tolerance n to zero, when compared with asymptotic
normality of ABC posterior
• asymptotic normality of ABC posterior mean does not require
asymptotic normality of ABC posterior

• Concentration of summary η(z): there exists b(θ) such that
η(z) − b(θ) = oP
θ
(1)
• Consistency:
Π n ( θ − θ0 ≤ δ|η(y)) = 1 + op(1)
• Convergence rate: there exists δn = o(1) such that
Π n ( θ − θ0 ≤ δn|η(y)) = 1 + op(1)

• Consistency:
Π n ( θ − θ0 ≤ δ|η(y)) = 1 + op(1)
• Convergence rate: there exists δn = o(1) such that
Π n ( θ − θ0 ≤ δn|η(y)) = 1 + op(1)
• Point estimator consistency
ˆθ = EABC [θ|η(y(n)
)], EABC [θ|η(y(n)
)] − θ0 = op(1)
vn(EABC [θ|η(y(n)
)] − θ0) ⇒ N(0, v)

Rate of convergence
Π (·| η(y) − η(z) ≤ ε) concentrates at rate λn → 0 if
lim sup
ε→0
lim sup
n→+∞
Π ( θ − θ0 > λnM| η(y)η(z) ≤ ε) → 0
in P0-probability when M goes to inﬁnity.

Rate of convergence
Π (·| η(y) − η(z) ≤ ε) concentrates at rate λn → 0 if
lim sup
ε→0
lim sup
n→+∞
Π ( θ − θ0 > λnM| η(y)η(z) ≤ ε) → 0
in P0-probability when M goes to inﬁnity.
Posterior rate of concentration related to rate at which information
accumulates about true parameter vector

Related results
existing studies on the large sample properties of ABC, in which
the asymptotic properties of point estimators derived from ABC
have been the primary focus
[Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2015]

Convergence when n σn
Under (main) assumptions
(A1) ∃σn → +∞
Pθ σ−1
n η(z) − b(θ) > u ≤ c(θ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π( b(θ) − b(θ0) ≤ u) uD
, u ≈ 0
posterior consistency
posterior concentration rate λn that depends on the deviation
control of d2{η(z), b(θ)}
posterior concentration rate for b(θ) bounded from below by O( n)

Convergence when n σn
Under (main) assumptions
(A1) ∃σn → +∞
Pθ σ−1
n η(z) − b(θ) > u ≤ c(θ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π( b(θ) − b(θ0) ≤ u) uD
, u ≈ 0
then
Π n b(θ) − b(θ0) n + σnh−1
( D
n )|η(y) = 1 + op0 (1)
If also θ − θ0 ≤ L b(θ) − c(θ0) α, locally and θ → b(θ) 1-1
Π n ( θ − θ0
α
n + σα
n (h−1
( D
n ))α
δn
|η(y)) = 1 + op0 (1)

Comments
(A1) : if Pθ σ−1
n η(z) − b(θ) > u ≤ c(θ)h(u), two cases
1 Polynomial tail: h(u) u−κ
, then δn = n + σn
−D/κ
n
2 Exponential tail: h(u) e−cu
, then δn = n + σn log(1/ n)

Comments
(A1) : if Pθ σ−1
n η(z) − b(θ) > u ≤ c(θ)h(u), two cases
1 Polynomial tail: h(u) u−κ
, then δn = n + σn
−D/κ
n
2 Exponential tail: h(u) e−cu
, then δn = n + σn log(1/ n)
• E.g., η(y) = n−1
i g(yi ) with moments on g (case 1) or
Laplace transform (case 2)

Comments
(A2) : Π( b(θ) − b(θ0) ≤ u) uD : If Π regular enough then
D = dim(θ)
• no need to approximate the density f (η(y)|θ).
• Same results holds when n = o(σn) if (A1) replaced with
inf
|x|≤M
Pθ σ−1
n (η(z) − b(θ)) − x ≤ u uD
, u ≈ 0

proof
Simple enough proof: assume σn ≤ δ n and
η(y) − b(θ0) σn, η(y) − η(z) ≤ n
Hence
b(θ) − b(θ0) > δn ⇒ η(z) − b(θ) > δn − n − σn := tn

proof
Simple enough proof: assume σn ≤ δ n and
η(y) − b(θ0) σn, η(y) − η(z) ≤ n
Hence
b(θ) − b(θ0) > δn ⇒ η(z) − b(θ) > δn − n − σn := tn
Also, if b(θ) − b(θ0) ≤ n/3
η(y) − η(z) ≤ η(z) − b(θ) + σn
≤ n/3
+ n/3
and
Π n ( b(θ) − b(θ0) > δn|y) ≤
b(θ)−b(θ0) >δn
Pθ ( η(z) − b(θ) > tn) dΠ(θ)
|b(θ)−b(θ0)|≤ n/3
Pθ ( η(z) − b(θ) ≤ n/3) dΠ(θ)
−D
n h(tnσ−1
n )
Θ
c(θ)dΠ(θ)

Summary statistic and (in)consistency
Consider the moving average MA(2) model
yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1)
and
−2 ≤ θ1 ≤ 2, θ1 + θ2 ≥ −1, θ1 − θ2 ≤ 1.

Summary statistic and (in)consistency
Consider the moving average MA(2) model
yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1)
and
−2 ≤ θ1 ≤ 2, θ1 + θ2 ≥ −1, θ1 − θ2 ≤ 1.
summary statistics equal to sample autocovariances
ηj (y) = T−1
T
t=1+j
yt yt−j j = 0, 1
with
η0(y)
P
→ E[y2
t ] = 1 + (θ01)2
+ (θ02)2
and η1(y)
P
→ E[yt yt−1] = θ01(1 + θ02)
For ABC target pε (θ|η(y)) to be degenerate at θ0
0 = b(θ0) − b (θ) =
1 + (θ01)2
+ (θ02)2
θ01(1 + θ02)
−
1 + (θ1)2
+ (θ2)2
θ1(1 + θ2)
must have unique solution θ = θ0
Take θ01 = .6, θ02 = .2: equation has two solutions
θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204

asymptotic shape of posterior distribution
Shape of
Π (·| η(y), η(z) ≤ εn)
for several connections between εn and rate at which η(yn) satisfy
CLT

asymptotic shape of posterior distribution
Shape of
Π (·| η(y), η(z) ≤ εn)
for several connections between εn and rate at which η(yn) satisfy
CLT
Three diﬀerent regimes:
1 σn = o( n) −→ Uniform limit
2 σn n −→ perturbated Gaussian limit
3 σn n −→ Gaussian limit

scaling matrices
Introduction of sequence of (k, k) p.d. matrices Σn(θ) such that
for all θ near θ0
c1 Dn ∗ ≤ Σn(θ) ∗ ≤ c2 Dn ∗, Dn = diag(dn(1), · · · , dn(k)),
with 0 < c1, c2 < +∞, dn(j) → +∞ for all j
Possibly diﬀerent convergence rates for components of η(z)
Reordering components so that
dn(1) ≤ · · · ≤ dn(k)
with assumption that
lim inf
n
dn(j)εn = lim sup
n
dn(j)εn

new assumptions
(B1) Concentration of summary η: Σn(θ) ∈ Rk1×k1 is o(1)
Σn(θ)−1
{η(z)−b(θ)} ⇒ Nk1 (0, Id), (Σn(θ)Σn(θ0)−1
)n = Co
(B2) b(θ) is C1 and
θ − θ0 b(θ) − b(θ0)
(B3) Dominated convergence and
lim
n
Pθ(Σn(θ)−1{η(z) − b(θ)} ∈ u + B(0, un))
j un(j)
= ϕ(u)

main result
Set Σn(θ) = σnD(θ) for θ ≈ θ0 and
Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2)
• when nσ−1
n → +∞
Π n [ −1
n (θ−θ0) ∈ A|y] ⇒ UB0 (A), B0 = {x ∈ Rk
; b (θ0)T
x ≤ 1}

main result
• when nσ−1
n → c
Π n [Σn(θ0)−1
(θ − θ0) − Zo
∈ A|y] ⇒ Qc(A), Qc = N

main result
• when nσ−1
n → 0 and (B3) holds, set
Vn = [b (θ0)]n
Σn(θ0)b (θ0)
then
Π n [V −1
n (θ − θ0) − ˜Zo
∈ A|y] ⇒ Φ(A),

intuition (?!)
Set x(θ) = σ−1
n (θ − θ0) − Zo (k = 1)
πn := Π n [ −1
n (θ − θ0) ∈ A|y]
=
|θ−θ0|≤un
Ix(θ)∈A
Pθ σ−1
n (η(z) − b(θ)) + x(θ) ≤ σ−1
n n p(θ)dθ
|θ−θ0|≤un
Pθ σ−1
n (η(z) − b(θ)) + x(θ) ≤ σ−1
n n p(θ)dθ
+ op(1)
• If n/σn 1 :
Pθ σ−1
n (η(z) − b(θ)) + x(θ) ≤ σ−1
n n = 1+o(1), iﬀ x ≤ σ−1
n n+o(1)
• If n/σn = o(1)
Pθ σ−1
n (η(z) − b(θ)) + x ≤ σ−1
n n = φ(x)σn(1 + o(1))

more comments
• Surprising : U(− n, n) limit when n σn

more comments
• Surprising : U(− n, n) limit when n σn but not that
surprising since n = o(1) means concentration around θ0 and
σn = o( n) implies that b(θ) − b(θ0) ≈ η(z) − η(y)
• again, no need to control approximation of f (η(y)|θ) by a
Gaussian density: merely a control of the distribution
• generalisation to the case where eigenvalues of Σn are
dn,1 = · · · = dn,k
• behaviour of EABC (θ|y) consistent with Li & Fearnhead
(2016)

even more comments
If (also) p(θ) is Hölder β
EABC (θ|y) − θ0 = σn
Zo
b(θ0)
score for f (η(y)|θ)
+
β/2
j=1
2j
n Hj (θ0, p, b)
bias from threshold approx
+o(σn) + O( β+1
n )
with
• if 2
n = o(σn) : Efficiency
EABC (θ|y) − θ0 = σn
Zo
b(θ0)
+ o(σn)
• the Hj (θ0, p, b)’s are deterministic
we gain nothing by getting a first crude ˆθ(y) = EABC (θ|y)
for some η(y) and then rerun ABC with ˆθ(y)

When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10)
EABC [dT (θ − θ0)|yo
] ⇒ N(0, ( bo
)T
Σ−1
bo
−1
[Li & Fearnhead (2016)]
In fact, if β+1
n
√
n = o(1), with β H¨older-smoothness of π
EABC [(θ−θ0)|yo
] =
( bo)−1Zo
√
n
+
k
j=1
hj (θ0) 2j
n +op(1), 2k = β

When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10)
] ⇒ N(0, ( bo
)T
Σ−1
bo
−1
[Li & Fearnhead (2016)]
In fact, if β+1
n
√
n = o(1), with β H¨older-smoothness of π
EABC [(θ−θ0)|yo
] =
( bo)−1Zo
√
n
+
k
j=1
hj (θ0) 2j
n +op(1), 2k = β
Iterating for ﬁxed p mildly interesting: if
˜η(y) = EABC [θ|yo
]
then
EABC [θ|˜η(y)] = θ0 +
( bo)−1Zo
√
n
+
π (θ0)
π(θ0)
2
n + o()
[Fearnhead & Prangle, 2012]

more asymptotic behaviour of EABC [θ]
Li and Fearnhead (2016,2017) consider that
]
not optimal when p > d
• If
√
n 2
n = o(1) and n
√
n = o(1)
√
n[EABC (θ) − θ0] = P bo Zo
+ op(1)
Zo
=
√
n(η(y) − bo
)
P bo Zo
= (( bo
)T
bo
)−1
( bo
)T
Zo
and Vas(P bo Zo
) ≥ ( bo
)T
Vas(Zo
)−1
( bo
)
−1
• If n
√
n = o(1)
√
n[EABC (θ)−θ0] = ( bo
)T
Σ−1
bo
−1
( bo
)T
Σ−1
Zo
+op(1)

impact of the dimension of η
dimension of η(.) does not impact above result, but impacts
acceptance probability
• if n = o(σn), k1 = dim(η(y)), k = dim(θ) & k1 ≥ k
αn := Pr ( y − z ≤ n) k1
n σ−k1+k
n
• if n σn
αn := Pr ( y − z ≤ n) k
n
• If we choose αn
• αn = o(σk
n ) leads to n = σn(αnσ−k
n )1/k1
= o(σn)
• αn σn leads to n α
1/k
n .

conclusion on ABC consistency
• asymptotic description of ABC: diﬀerent regimes depending
on n & σn
• no point in choosing n arbitrarily small: just n = o(σn)
• no asymptotic gain in iterative ABC
• results under weak conditions by not studying g(η(z)|θ)

Convergence of ABC methods

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Convergence of ABC methods

Similar to Convergence of ABC methods (20)

More from Christian Robert

More from Christian Robert (13)

Recently uploaded

Recently uploaded (20)

Convergence of ABC methods