1. Convergence of ABC methods
Christian P. Robert
Universit´e Paris-Dauphine, University of Warwick, & IUF
joint work with D. Frazier, G. Martin & J. Rousseau
4. Untractable likelihoods
Cases when the likelihood function
f (y|θ) is unavailable and when the
completion step
f (y|θ) =
Z
f (y, z|θ) dz
is impossible or too costly because of
the dimension of z
c MCMC cannot be implemented!
5. Untractable likelihoods
Cases when the likelihood function
f (y|θ) is unavailable and when the
completion step
f (y|θ) =
Z
f (y, z|θ) dz
is impossible or too costly because of
the dimension of z
c MCMC cannot be implemented!
6. Illustration
Example (Ising & Potts models)
Potts model: if y takes values on a grid Y of size kn and
f (y|θ) ∝ exp θ
l∼i
Iyl =yi
where l∼i denotes a neighbourhood relation, n moderately large
prohibits the computation of the normalising constant Zθ
7. Illustration
Example (Ising & Potts models)
Potts model: if y takes values on a grid Y of size kn and
f (y|θ) ∝ exp θ
l∼i
Iyl =yi
where l∼i denotes a neighbourhood relation, n moderately large
prohibits the computation of the normalising constant Zθ
Special case of the intractable normalising constant, making the
likelihood impossible to compute
9. The ABC method
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
10. The ABC method
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y.
[Tavar´e et al., 1997]
11. Why does it work?!
The proof is trivial:
f (θi ) ∝
z∈D
π(θi )f (z|θi )Iy(z)
∝ π(θi )f (y|θi )
= π(θi |y) .
[Accept–Reject 101]
12. A as A...pproximative
When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,
(y, z) ≤
where is a distance
13. A as A...pproximative
When y is a continuous random variable, equality z = y is replaced
with a tolerance condition,
(y, z) ≤
where is a distance
Output distributed from
π(θ) Pθ{ (y, z) < } ∝ π(θ| (y, z) < )
[Pritchard et al., 1999]
14. ABC algorithm
Algorithm 1 Likelihood-free rejection sampler 2
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)} ≤
set θi = θ
end for
where η(y) defines a (not necessarily sufficient) statistic
15. Output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f (z|θ)IA ,y (z)
A ,y×Θ π(θ)f (z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
16. Output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f (z|θ)IA ,y (z)
A ,y×Θ π(θ)f (z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a
small tolerance should provide a good approximation of a posterior
distribution:
π (θ|y) = π (θ, z|y)dz ≈ π(θ|η(y)) .
17. MA example
Back to the MA(q) model
xt = t +
q
i=1
ϑi t−i
Simple prior: uniform over the inverse [real and complex] roots in
Q(u) = 1 −
q
i=1
ϑi ui
under the identifiability conditions
18. MA example
Back to the MA(q) model
xt = t +
q
i=1
ϑi t−i
Simple prior: uniform prior over the identifiability zone, e.g.
triangle for MA(2)
19. MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1, ϑ2) in the triangle
2 generating an iid sequence ( t)−q<t≤T
3 producing a simulated series (xt)1≤t≤T
20. MA example (2)
ABC algorithm thus made of
1 picking a new value (ϑ1, ϑ2) in the triangle
2 generating an iid sequence ( t)−q<t≤T
3 producing a simulated series (xt)1≤t≤T
Distance: basic distance between the series
ρ((xt)1≤t≤T , (xt)1≤t≤T ) =
T
t=1
(xt − xt)2
or distance between summary statistics like the q autocorrelations
τj =
T
t=j+1
xtxt−j
21. Comparison of distance impact
Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
22. Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
01234
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.00.51.01.5
θ2
Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
23. Comparison of distance impact
0.0 0.2 0.4 0.6 0.8
01234
θ1
−2.0 −1.0 0.0 0.5 1.0 1.5
0.00.51.01.5
θ2
Evaluation of the tolerance on the ABC sample against both
distances ( = 100%, 10%, 1%, 0.1%) for an MA(2) model
24. ABC as knn
[Biau et al., 2013, Annales de l’IHP]
Practice of ABC: determine tolerance as a quantile on observed
distances, say 10% or 1% quantile,
= N = qα(d1, . . . , dN)
25. ABC as knn
[Biau et al., 2013, Annales de l’IHP]
Practice of ABC: determine tolerance as a quantile on observed
distances, say 10% or 1% quantile,
= N = qα(d1, . . . , dN)
• Interpretation of ε as nonparametric bandwidth only
approximation of the actual practice
[Blum & Fran¸cois, 2010]
26. ABC as knn
[Biau et al., 2013, Annales de l’IHP]
Practice of ABC: determine tolerance as a quantile on observed
distances, say 10% or 1% quantile,
= N = qα(d1, . . . , dN)
• Interpretation of ε as nonparametric bandwidth only
approximation of the actual practice
[Blum & Fran¸cois, 2010]
• ABC is a k-nearest neighbour (knn) method with kN = N N
[Loftsgaarden & Quesenberry, 1965]
27. ABC consistency
Provided
kN/ log log N −→ ∞ and kN/N −→ 0
as N → ∞, for almost all s0 (with respect to the distribution of
S), with probability 1,
1
kN
kN
j=1
ϕ(θj ) −→ E[ϕ(θj )|S = s0]
[Devroye, 1982]
28. ABC consistency
Provided
kN/ log log N −→ ∞ and kN/N −→ 0
as N → ∞, for almost all s0 (with respect to the distribution of
S), with probability 1,
1
kN
kN
j=1
ϕ(θj ) −→ E[ϕ(θj )|S = s0]
[Devroye, 1982]
Biau et al. (2013) also recall pointwise and integrated mean square error
consistency results on the corresponding kernel estimate of the
conditional posterior distribution, under constraints
kN → ∞, kN /N → 0, hN → 0 and hp
N kN → ∞,
29. Rates of convergence
Further assumptions (on target and kernel) allow for precise
(integrated mean square) convergence rates (as a power of the
sample size N), derived from classical k-nearest neighbour
regression, like
• when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N
− 4
p+8
• when m = 4, kN ≈ N(p+4)/(p+8) and rate N
− 4
p+8 log N
• when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N
− 4
m+p+4
[Biau et al., 2013]
30. Rates of convergence
Further assumptions (on target and kernel) allow for precise
(integrated mean square) convergence rates (as a power of the
sample size N), derived from classical k-nearest neighbour
regression, like
• when m = 1, 2, 3, kN ≈ N(p+4)/(p+8) and rate N
− 4
p+8
• when m = 4, kN ≈ N(p+4)/(p+8) and rate N
− 4
p+8 log N
• when m > 4, kN ≈ N(p+4)/(m+p+4) and rate N
− 4
m+p+4
[Biau et al., 2013]
Drag: Only applies to sufficient summary statistics
31. [some] asymptotics of ABC
1 Approximate Bayesian computation
2 [some] asymptotics of ABC
asymptotic setup
consistency of ABC posteriors
asymptotic shape of posterior
distribution
asymptotic behaviour of EABC [θ]
32. asymptotic setup
• asymptotic: y = y(n) ∼ Pn
θ and = n, n → +∞
• parametric: θ ∈ Rk, k fixed
• concentration of summary statistics η(zn):
∃b : θ → b(θ) η(zn
) − b(θ) = oP
θ
(1), ∀θ
33. asymptotic setup
• asymptotic: y = y(n) ∼ Pn
θ and = n, n → +∞
• parametric: θ ∈ Rk, k fixed
• concentration of summary statistics η(zn):
∃b : θ → b(θ) η(zn
) − b(θ) = oP
θ
(1), ∀θ
Objects of interest:
• posterior concentration and asymptotic shape of π (·|η(y(n)))
(normality?)
• convergence of the posterior mean ˆθ = EABC[θ|η(y(n))]
• asymptotic acceptance rate
[Frazier et al., 2016]
34. consistency of ABC posteriors
ABC algorithm Bayesian consistent at θ0 if for any δ > 0,
P ( θ − θ0 > δ| η(y) − η(z) ≤ ε) → 0
as n → +∞, ε → 0
Bayesian consistency implies that sets containing θ0 have posterior
probability tending to one as n → +∞, with implication being the
existence of a specific rate of concentration
35. consistency of ABC posteriors
ABC algorithm Bayesian consistent at θ0 if for any δ > 0,
P ( θ − θ0 > δ| η(y) − η(z) ≤ ε) → 0
as n → +∞, ε → 0
• Concentration around true value and Bayesian consistency
impose less stringent conditions on the convergence speed of
tolerance n to zero, when compared with asymptotic
normality of ABC posterior
• asymptotic normality of ABC posterior mean does not require
asymptotic normality of ABC posterior
36. consistency of ABC posteriors
• Concentration of summary η(z): there exists b(θ) such that
η(z) − b(θ) = oP
θ
(1)
• Consistency:
Π n ( θ − θ0 ≤ δ|η(y)) = 1 + op(1)
• Convergence rate: there exists δn = o(1) such that
Π n ( θ − θ0 ≤ δn|η(y)) = 1 + op(1)
37. consistency of ABC posteriors
• Consistency:
Π n ( θ − θ0 ≤ δ|η(y)) = 1 + op(1)
• Convergence rate: there exists δn = o(1) such that
Π n ( θ − θ0 ≤ δn|η(y)) = 1 + op(1)
• Point estimator consistency
ˆθ = EABC [θ|η(y(n)
)], EABC [θ|η(y(n)
)] − θ0 = op(1)
vn(EABC [θ|η(y(n)
)] − θ0) ⇒ N(0, v)
38. Rate of convergence
Π (·| η(y) − η(z) ≤ ε) concentrates at rate λn → 0 if
lim sup
ε→0
lim sup
n→+∞
Π ( θ − θ0 > λnM| η(y)η(z) ≤ ε) → 0
in P0-probability when M goes to infinity.
39. Rate of convergence
Π (·| η(y) − η(z) ≤ ε) concentrates at rate λn → 0 if
lim sup
ε→0
lim sup
n→+∞
Π ( θ − θ0 > λnM| η(y)η(z) ≤ ε) → 0
in P0-probability when M goes to infinity.
Posterior rate of concentration related to rate at which information
accumulates about true parameter vector
40. Related results
existing studies on the large sample properties of ABC, in which
the asymptotic properties of point estimators derived from ABC
have been the primary focus
[Creel et al., 2015; Jasra, 2015; Li & Fearnhead, 2015]
41. Convergence when n σn
Under (main) assumptions
(A1) ∃σn → +∞
Pθ σ−1
n η(z) − b(θ) > u ≤ c(θ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π( b(θ) − b(θ0) ≤ u) uD
, u ≈ 0
posterior consistency
posterior concentration rate λn that depends on the deviation
control of d2{η(z), b(θ)}
posterior concentration rate for b(θ) bounded from below by O( n)
42. Convergence when n σn
Under (main) assumptions
(A1) ∃σn → +∞
Pθ σ−1
n η(z) − b(θ) > u ≤ c(θ)h(u), lim
u→+∞
h(u) = 0
(A2)
Π( b(θ) − b(θ0) ≤ u) uD
, u ≈ 0
then
Π n b(θ) − b(θ0) n + σnh−1
( D
n )|η(y) = 1 + op0 (1)
If also θ − θ0 ≤ L b(θ) − c(θ0) α, locally and θ → b(θ) 1-1
Π n ( θ − θ0
α
n + σα
n (h−1
( D
n ))α
δn
|η(y)) = 1 + op0 (1)
43. Comments
(A1) : if Pθ σ−1
n η(z) − b(θ) > u ≤ c(θ)h(u), two cases
1 Polynomial tail: h(u) u−κ
, then δn = n + σn
−D/κ
n
2 Exponential tail: h(u) e−cu
, then δn = n + σn log(1/ n)
44. Comments
(A1) : if Pθ σ−1
n η(z) − b(θ) > u ≤ c(θ)h(u), two cases
1 Polynomial tail: h(u) u−κ
, then δn = n + σn
−D/κ
n
2 Exponential tail: h(u) e−cu
, then δn = n + σn log(1/ n)
• E.g., η(y) = n−1
i g(yi ) with moments on g (case 1) or
Laplace transform (case 2)
45. Comments
(A2) : Π( b(θ) − b(θ0) ≤ u) uD : If Π regular enough then
D = dim(θ)
• no need to approximate the density f (η(y)|θ).
• Same results holds when n = o(σn) if (A1) replaced with
inf
|x|≤M
Pθ σ−1
n (η(z) − b(θ)) − x ≤ u uD
, u ≈ 0
46. proof
Simple enough proof: assume σn ≤ δ n and
η(y) − b(θ0) σn, η(y) − η(z) ≤ n
Hence
b(θ) − b(θ0) > δn ⇒ η(z) − b(θ) > δn − n − σn := tn
48. Summary statistic and (in)consistency
Consider the moving average MA(2) model
yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1)
and
−2 ≤ θ1 ≤ 2, θ1 + θ2 ≥ −1, θ1 − θ2 ≤ 1.
49. Summary statistic and (in)consistency
Consider the moving average MA(2) model
yt = et + θ1et−1 + θ2et−2, et ∼i.i.d. N(0, 1)
and
−2 ≤ θ1 ≤ 2, θ1 + θ2 ≥ −1, θ1 − θ2 ≤ 1.
summary statistics equal to sample autocovariances
ηj (y) = T−1
T
t=1+j
yt yt−j j = 0, 1
with
η0(y)
P
→ E[y2
t ] = 1 + (θ01)2
+ (θ02)2
and η1(y)
P
→ E[yt yt−1] = θ01(1 + θ02)
For ABC target pε (θ|η(y)) to be degenerate at θ0
0 = b(θ0) − b (θ) =
1 + (θ01)2
+ (θ02)2
θ01(1 + θ02)
−
1 + (θ1)2
+ (θ2)2
θ1(1 + θ2)
must have unique solution θ = θ0
Take θ01 = .6, θ02 = .2: equation has two solutions
θ1 = .6, θ2 = .2 and θ1 ≈ .5453, θ2 ≈ .3204
50. asymptotic shape of posterior distribution
Shape of
Π (·| η(y), η(z) ≤ εn)
for several connections between εn and rate at which η(yn) satisfy
CLT
51. asymptotic shape of posterior distribution
Shape of
Π (·| η(y), η(z) ≤ εn)
for several connections between εn and rate at which η(yn) satisfy
CLT
Three different regimes:
1 σn = o( n) −→ Uniform limit
2 σn n −→ perturbated Gaussian limit
3 σn n −→ Gaussian limit
52. scaling matrices
Introduction of sequence of (k, k) p.d. matrices Σn(θ) such that
for all θ near θ0
c1 Dn ∗ ≤ Σn(θ) ∗ ≤ c2 Dn ∗, Dn = diag(dn(1), · · · , dn(k)),
with 0 < c1, c2 < +∞, dn(j) → +∞ for all j
Possibly different convergence rates for components of η(z)
Reordering components so that
dn(1) ≤ · · · ≤ dn(k)
with assumption that
lim inf
n
dn(j)εn = lim sup
n
dn(j)εn
53. new assumptions
(B1) Concentration of summary η: Σn(θ) ∈ Rk1×k1 is o(1)
Σn(θ)−1
{η(z)−b(θ)} ⇒ Nk1 (0, Id), (Σn(θ)Σn(θ0)−1
)n = Co
(B2) b(θ) is C1 and
θ − θ0 b(θ) − b(θ0)
(B3) Dominated convergence and
lim
n
Pθ(Σn(θ)−1{η(z) − b(θ)} ∈ u + B(0, un))
j un(j)
= ϕ(u)
54. main result
Set Σn(θ) = σnD(θ) for θ ≈ θ0 and
Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2)
• when nσ−1
n → +∞
Π n [ −1
n (θ−θ0) ∈ A|y] ⇒ UB0 (A), B0 = {x ∈ Rk
; b (θ0)T
x ≤ 1}
55. main result
Set Σn(θ) = σnD(θ) for θ ≈ θ0 and
Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2)
• when nσ−1
n → c
Π n [Σn(θ0)−1
(θ − θ0) − Zo
∈ A|y] ⇒ Qc(A), Qc = N
56. main result
Set Σn(θ) = σnD(θ) for θ ≈ θ0 and
Zo = Σn(θ0)−1(η(y) − b(θ0)), then under (B1) and (B2)
• when nσ−1
n → 0 and (B3) holds, set
Vn = [b (θ0)]n
Σn(θ0)b (θ0)
then
Π n [V −1
n (θ − θ0) − ˜Zo
∈ A|y] ⇒ Φ(A),
57. intuition (?!)
Set x(θ) = σ−1
n (θ − θ0) − Zo (k = 1)
πn := Π n [ −1
n (θ − θ0) ∈ A|y]
=
|θ−θ0|≤un
Ix(θ)∈A
Pθ σ−1
n (η(z) − b(θ)) + x(θ) ≤ σ−1
n n p(θ)dθ
|θ−θ0|≤un
Pθ σ−1
n (η(z) − b(θ)) + x(θ) ≤ σ−1
n n p(θ)dθ
+ op(1)
• If n/σn 1 :
Pθ σ−1
n (η(z) − b(θ)) + x(θ) ≤ σ−1
n n = 1+o(1), iff x ≤ σ−1
n n+o(1)
• If n/σn = o(1)
Pθ σ−1
n (η(z) − b(θ)) + x ≤ σ−1
n n = φ(x)σn(1 + o(1))
59. more comments
• Surprising : U(− n, n) limit when n σn but not that
surprising since n = o(1) means concentration around θ0 and
σn = o( n) implies that b(θ) − b(θ0) ≈ η(z) − η(y)
• again, no need to control approximation of f (η(y)|θ) by a
Gaussian density: merely a control of the distribution
• generalisation to the case where eigenvalues of Σn are
dn,1 = · · · = dn,k
• behaviour of EABC (θ|y) consistent with Li & Fearnhead
(2016)
60. even more comments
If (also) p(θ) is H¨older β
EABC (θ|y) − θ0 = σn
Zo
b(θ0)
score for f (η(y)|θ)
+
β/2
j=1
2j
n Hj (θ0, p, b)
bias from threshold approx
+o(σn) + O( β+1
n )
with
• if 2
n = o(σn) : Efficiency
EABC (θ|y) − θ0 = σn
Zo
b(θ0)
+ o(σn)
• the Hj (θ0, p, b)’s are deterministic
we gain nothing by getting a first crude ˆθ(y) = EABC (θ|y)
for some η(y) and then rerun ABC with ˆθ(y)
61. asymptotic behaviour of EABC [θ]
When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10)
EABC [dT (θ − θ0)|yo
] ⇒ N(0, ( bo
)T
Σ−1
bo
−1
[Li & Fearnhead (2016)]
In fact, if β+1
n
√
n = o(1), with β H¨older-smoothness of π
EABC [(θ−θ0)|yo
] =
( bo)−1Zo
√
n
+
k
j=1
hj (θ0) 2j
n +op(1), 2k = β
62. asymptotic behaviour of EABC [θ]
When p = dim(η(y)) = d = dim(θ) and n = o(n−3/10)
EABC [dT (θ − θ0)|yo
] ⇒ N(0, ( bo
)T
Σ−1
bo
−1
[Li & Fearnhead (2016)]
In fact, if β+1
n
√
n = o(1), with β H¨older-smoothness of π
EABC [(θ−θ0)|yo
] =
( bo)−1Zo
√
n
+
k
j=1
hj (θ0) 2j
n +op(1), 2k = β
Iterating for fixed p mildly interesting: if
˜η(y) = EABC [θ|yo
]
then
EABC [θ|˜η(y)] = θ0 +
( bo)−1Zo
√
n
+
π (θ0)
π(θ0)
2
n + o()
[Fearnhead & Prangle, 2012]
63. more asymptotic behaviour of EABC [θ]
Li and Fearnhead (2016,2017) consider that
EABC [dT (θ − θ0)|yo
]
not optimal when p > d
• If
√
n 2
n = o(1) and n
√
n = o(1)
√
n[EABC (θ) − θ0] = P bo Zo
+ op(1)
Zo
=
√
n(η(y) − bo
)
P bo Zo
= (( bo
)T
bo
)−1
( bo
)T
Zo
and Vas(P bo Zo
) ≥ ( bo
)T
Vas(Zo
)−1
( bo
)
−1
• If n
√
n = o(1)
√
n[EABC (θ)−θ0] = ( bo
)T
Σ−1
bo
−1
( bo
)T
Σ−1
Zo
+op(1)
64. impact of the dimension of η
dimension of η(.) does not impact above result, but impacts
acceptance probability
• if n = o(σn), k1 = dim(η(y)), k = dim(θ) & k1 ≥ k
αn := Pr ( y − z ≤ n) k1
n σ−k1+k
n
• if n σn
αn := Pr ( y − z ≤ n) k
n
• If we choose αn
• αn = o(σk
n ) leads to n = σn(αnσ−k
n )1/k1
= o(σn)
• αn σn leads to n α
1/k
n .
65. conclusion on ABC consistency
• asymptotic description of ABC: different regimes depending
on n & σn
• no point in choosing n arbitrarily small: just n = o(σn)
• no asymptotic gain in iterative ABC
• results under weak conditions by not studying g(η(z)|θ)