Coordinate Sampler

Coordinate Sampler: A Non-Reversible Gibbs-like Sampler
Coordinate Sampler: A Non-Reversible
Gibbs-like Sampler
Christian P. Robert
U Paris Dauphine PSL & University of Warwick
Joint work with Wu Changye, plus loans from Arnaud Doucet
Statistics and Computing 30, 721–730 (2020)

Outline
Background
Versions of PDMP
Coordinate Sampler
Numerical comparison
Conclusion

Background
Generic issue
Goal: sample from a target known up to a constant, deﬁned over
Rd ,
π(x) ∝ γ(x)
with energy U(x) = − log π(x), U ∈ C1.

Background
Marketing arguments
Current default workhorse: reversible MCMC methods
Non-reversible MCMC algorithms based on piecewise deterministic
Markov processes perform well empirically

Background
Marketing arguments
Non-reversible MCMC algorithms based on piecewise deterministic
Markov processes perform well empirically
Quantitative convergence rates and variance now available
Physics (Peters & De With, 2012; Krauth et al., 2009, 2015,
2016) roots
Mesquita and Hespanha (2010) show geometric ergodicity for
exponentially decaying tail targets
Monmarché (2016) gives sharp results for compact
state-spaces
Bierkens et al. (2016a,b) show ergodicity of targets on the real
line

Background
Motivation: piecewise deterministic Markov process
PDMP sampler is a (new?) continuous-time, non-reversible MCMC
method based on auxiliary variables
1. particle physics simulation
[Peters and de With, 2012]
2. empirically state-of-the-art performances
[Bouchard-Côté et al., 2018]
3. exact subsampling in big data settings
[Bierkens et al., 2016]
4. geometric ergodicity for a large class of distribution
[Deligiannidis et al., 2017; Bierkens et al., 2017]
5. Ability to deal with intractable potential U(x) = Uω(x)µ(dω)
[Pakman et al., 2016]

Background
Older versions
Use of alternative methodology based on Birth–&-Death (point)
process
Idea: Create Markov chain in continuous time, i.e. a Markov jump
process
Time till next modiﬁcation (jump) exponentially distributed with
intensity q(θ, θ ) depending on current and future states.
[Preston, 1976; Ripley, 1977; Geyer & Møller, 1994; Stephens, 1999]

Background
Older versions
Difference with MH-MCMC: Whenever jump occurs, corresponding
move always accepted. Acceptance probabilities replaced with
holding times.
Implausible configurations
L(θ)π(θ) 1
die quickly.
It is sufficient to have detailed balance
L(θ)π(θ)q(θ, θ ) = L(θ )π(θ )q(θ , θ) for all θ, θ
for ˜π(θ) ∝ L(θ)π(θ) to be stationary.
[Cappé et al., 2000]

Background
Recent reference

Background
Hamiltonian setup
All MCMC schemes presented here target an extended distribution
on Z = Rd × Rd
ρ(z) = π(x) × ψ(v) = exp(
Hamiltonian
−H(z) )
where z = (x, v) extended state and Ψ(v) [by default] multivariate
standard Normal
Physics takes v as velocity or momentum variables allowing for a
deterministic dynamics on Rd
Obviously sampling from ρ provides samples from π

Background
Piecewise deterministic Markov process
Piecewise deterministic Markov process {zt ∈ Z}t∈[0,∞), with three
ingredients
1. Deterministic dynamics: between events, deterministic
evolution based on ODE
dzt/dt = Φ(zt)
2. Event occurrence rate: λ(t) = λ(zt)
3. Transition dynamics: At event time, τ, state prior to τ
denoted by zτ−, and new state generated by zτ ∼ Q(·|zτ−).
[Davis, 1984, 1993]

Background
Implementation
Algorithm 1: Simulation of PDMP
Starting point z0, τ0 ← 0.
for k = 1, 2, 3, · · · do
Sample inter-event time ηk from distribution
P(ηk > t) = exp −
t
0
λ(zτk−1+s )ds .
τk ← τk−1 + ηk, zτk−1+s ← Ψs(zτk−1
), for s ∈ (0, ηk),
where Ψ ODE ﬂow of Φ.
zτk − ← Ψηk
(zτk−1
), zτk
∼ Q(·|zτk −).
end

Background
Simulation of PDMP: constraints
requires being able to compute exactly ﬂow zt = Φt(z0)
simplest algorithms based on Φ(z) = (v; 0d ) hence
Φ(zt) = (x0 + v0t; v0)
except Hamiltonian
BPS with Hamiltonian dynamics for proxy Gaussian Hamiltonian
[Vanetti et al., 20
requires ability to simulate event times (inversion, thinning,
superposition)
requires simulations from Q

Versions of PDMP
Outline
Background
Versions of PDMP
Coordinate Sampler
Conclusion

Versions of PDMP
Basic bouncy particle sampler
Simulation of continuous-time piecewise linear trajectory (xt)t with
each segment in trajectory speciﬁed by
initial position x
length τ
velocity v

Versions of PDMP
initial position x
length τ
velocity v
length speciﬁed by inhomogeneous Poisson point process with
intensity function
λ(x, v) = max{0, < U(x), v >}

Versions of PDMP
initial position x
length τ
velocity v
new velocity after bouncing given by Newtonian elastic collision
R(x)v = v − 2
< U(x), v >
|| U(x)||2
U(x)

Versions of PDMP
[(C.) Bouchard-Côté et al., 2018]

Versions of PDMP
Implementation hardships
Generally speaking, the main diﬃculties of implementing PDMP
come from
1. Computing the ODE ﬂow Ψ: linear dynamic, quadratic
dynamic
2. Simulating the inter-event time ηk: many techniques of
superposition and thinning for Poisson processes
[Devroye, 1986]

Versions of PDMP
Poisson process on R+
Deﬁnition (Poisson process)
Poisson process with rate λ(·) on R+ is sequence
τ1, τ2, · · ·
of rv’s when intervals
τ1, τ2 − τ1, τ3 − τ2, · · ·
are iid with
P(τi − τi−1 > T) = exp −
τi−1+T
τi−1
λ(t)dt , τ0 = 0

Versions of PDMP
Poisson process on R+
Deﬁnition (Poisson process)
Poisson process with rate λ(·) on R+ is sequence
τ1, τ2, · · ·
of rv’s when intervals
τ1, τ2 − τ1, τ3 − τ2, · · ·
are iid with a rarely available cdf

Versions of PDMP
Simulation by thinning
Theorem (Lewis and Shedler, 1979)
Let
λ, Λ : R+
→ R+
be continuous functions such that λ(·) Λ(·). Let
τ1, τ2, · · · ,
be the increasing sequence of a Poisson process with rate Λ(·). For
all i, if τi is removed from the sequence with probability
1 − λ(τi )/Λ(τi )
then the remaining ˜τ1, ˜τ2, · · · form a non-homogeneous Poisson
process with rate λ(·)

Versions of PDMP
Simulation by thinning
Theorem (Lewis and Shedler, 1979)
Let
λ, Λ : R+
→ R+
be continuous functions such that λ(·) Λ(·). Let
τ1, τ2, · · · ,
be the increasing sequence of a Poisson process with rate Λ(·).
Simulation from upper bound [need be found]

Versions of PDMP
Simulation by superposition theorem
Theorem (Kingman, 1992)
Let Π1, Π2, · · · , be countable collection of independent Poisson
processes on R+ with resp. rates λn(·). If ∞
n=1λn(t) < ∞ for all
t’s, then superposition process
Π =
∞
n=1
Πn
is Poisson process with rate
λ(t) =
∞
n=1
λn(t)

Versions of PDMP
Simulation by superposition theorem
Theorem (Kingman, 1992)
Let Π1, Π2, · · · , be countable collection of independent Poisson
processes on R+ with resp. rates λn(·). If ∞
n=1λn(t) < ∞ for all
t’s, then superposition process is Poisson process with rate
lambda(t)
Decomposition of U = j Uj plus thinning

Versions of PDMP
Simulation by superposition plus thinning
Almost all implementations of discrete-time schemes consist in
sampling a Bernoulli rv of parameter α(z)
For
Φ(z) = (x + v , v) and α(z) = 1 ∧ π(x + v )/π(x)
sampling inter-event time for strictly convex U can be obtained by
solving t = arg min U(x + vt) and additional randomization
thinning: if there exists ¯α such that α(Φk(z)) ¯α(x, k),
accept-reject
superposition and thinning: when α(z) = 1 ∧ ρ(Φ(z))/ρ(z)
and ρ(·) = i ρi (·) then ¯α(z, k) = i ¯αi (z, k)

Versions of PDMP
Extended generator
Deﬁnition
For D(L) set of measurable functions f : Z → R such that there
exists a measurable function h : Z → R with t → h(zt) Pz-a.s. for
each z ∈ Z and the process
Cf
t = f (zt) − f (z0) −
t
0
h(zs)ds
is a local martingale. Then h
∆
= Lf and (L, D(L)) is the extended
generator of the process {zt}t 0.

Versions of PDMP
Extended generator of PDMP
Theorem (Davis, 1993)
The generator, L, of above PDMP is, for f ∈ D(L)
Lf (z) = f (z) · Φ(z) + λ(z)
z
f (z ) − f (z) Q(dz |z)
Furthermore, µ(dz) is an invariant distribution of above PDMP, if
Lf (z)µ(dz) = 0, for all f ∈ D(L)

Versions of PDMP
PDMP-based sampler
PDMP-based sampler is an auxiliary variable technique
Given target π(x),
1. introduce auxiliary variable V ∈ V along with a density π(v|x),
2. choose appropriate Φ, λ and Q
for π(x)π(v|x) to be unique invariant distribution of Markov process

Versions of PDMP
Bouncy Particle Sampler (Bouchard-Côté et al., 2018)
V = Rd , and π(v|x) = ϕ(v) for N(0, Id )
1. Deterministic dynamics:
dxt/dt = vt, dvt/dt = 0
2. Event occurrence rate: λ(x, v) = v, U(x) + + λref
3. Transition dynamics:
Q((dx , dv )|(x, v))
=
v, U(x) +
λ(x, v)
δx(dx )δR U(x)v(dv ) +
λref
λ(x, v)
δx(dx )ϕ(dv )
where R U(x)v = v − 2 U(x),v
U(x), U(x) U(x)

Versions of PDMP
Zig-Zag Sampler (Bierkens et al., 2016)
V = {+1, −1}d , and π(v|x) ∼ Uniform({+1, −1}d )
2. Event occurrence rate:
λ(x, v) = d
i=1 λi (x, v) = d
i=1 {vi i U(x)}+ + λref
i
Q((dx , dv )|(x, v)) =
d
i=1
λi (x, v)
λ(x, v)
δx(dx )δFi v(dv )
where Fi operator that ﬂips i-th component of v and keep others
unchanged
back to CS

Versions of PDMP
Continuous-time Hamiltonian Monte Carlo (Neal, 1999)
V = Rd , and π(v|x) = ϕ(v) ∼ N(0, Id )
dxt/dt = vt, dvt/dt = − U(xt)
2. Event occurrence rate: λ(x, v) = λ0(x)
Q((dx , dv )|(x, v)) = δx(dx )ϕ(dv )

Versions of PDMP
Continuous-time Riemannian manifold HMC (Girolami
& Calderhead, 2011)
V = Rd , and π(v|x) = N(0, G(x)), with Hamiltonian
H(x, v) = U(x) + 1/2vT
G(x)−1
v + 1/2 log(|G(x)|)
dxt/dt = ∂H/∂v(xt, vt), dvt/dt = −∂H/∂x(xt, vt)
2. Event occurrence rate: λ(x, v) = λ0(x)
Q((dx , dv )|(x, v)) = δx(dx )ϕ(dv |x )

Versions of PDMP
Randomized BPS
Deﬁne
a =
v, U(x)
U(x), U(x)
U(x), b = v − a
Regular BPS, move v = −a + b
Alternatives
1. Fearnhead et al. (2016):
v ∼ Qx(dv |v) = max {0, −v , U(x) } dv
2. Wu and Robert (2017): v = −a + b , where b Gaussian
variate over the space orthogonal to U(x) in Rd .

Versions of PDMP
HMC-BPS (Vanetti et al., 2017)
ρ(x) ∝ exp{−V (x)} is a Gaussian approximation of the target π(x).
^H(x, v) = V (x) + 1/2vT
v, ˜U(x) = U(x) − V (x)
dxt/dt = vt, dvt/dt = − V (xt)
2. Event occurrence rate: λ(x, v) = v, ˜U(x) + + λref
Q((dx , dv )|(x, v))
=
v, ˜U(x) +
λ(x, v)
δx(dx )δR ˜U(x)v(dv ) +
λref
λ(x, v)
δx(dx )ϕ(dv )

Versions of PDMP
Discretisation
1. Sherlock and Thiery (2017) considers delayed rejection
approach with only point-wise evaluations of target, by making
speed flip move once proposal involving flip in speed and drift
in variable of interest rejected. Also add random perturbation
for eergodicity, plus another perturbation based on a Brownian
argument. Requires calibration
2. Vanetti et al. (2017)
Benefit: bypassing the generation of inter-event time of
inhomogeneous Poisson processes.

Versions of PDMP
Discretisation
1. Sherlock and Thiery (2017)
2. Vanetti et al. (2017) uniﬁes many threads and relates PDMP,
HMC, and discrete versions, with convergence results. Main
idea improves upon existing deterministic methods by
accounting for target. Borrows from earlier slice sampler idea
of Murray et al. (AISTATS, 2010), exploiting exact
Hamiltonian dynamics for approximation to true target.
Except that bouncing avoids the slice step. Discrete BPS both
correct against target and not simulating event times.
Beneﬁt: bypassing the generation of inter-event time of
inhomogeneous Poisson processes.

Coordinate Sampler
Outline
Background
Versions of PDMP
Coordinate Sampler
Conclusion

Coordinate Sampler
Coordinate sampler
A generalisation of the Zig-Zag sampler such that
1. velocity set used in coordinate sampler (CS) made of
orthonormal basis of Rd , while for Zigzag sampler (ZS) it is
restricted to {−1, 1}d
2. event rate function λ(·) in ZS much larger than for CS,
especially for high dimensional distributions: events occur
more frequently in ZS [with lower eﬃciency]
3. CS targets only one component at a time, while ZS modiﬁes
all components at the same time

Coordinate Sampler
Coordinate Sampler
generalisation of ZS where bounce uniformly random on
V = {±e1, · · · , ±ed }
2. Event occurrence rate: λ(x, v) = v, U(x) + + λref
Q((dx , dv )|(x, v)) =
v∗∈V
λ(x, −v∗)
λ(x)
δx(dx )δv∗ (dv )
where λ(x) = v∈V λ(x, v) = 2dλref + d
i=1
∂U(x)/∂xi

Coordinate Sampler
Validation of Coordinate Sampler
Extended generator
Lf = xf (x, v), v + λ(x, v)
v ∈V
λ(x, −v )
λ(x)
f (x, v ) − f (x, v)
Theorem
For any positive λref > 0, the PDMP induced by CS enjoys
π(x)ϕ(v) as unique invariant distribution, provided potential U is
C1.

Coordinate Sampler
Geometric Ergodicity of Coordinate Sampler
Geometric ergodicity for distributions with tails faster than
exponential and slower than Gaussian
Assumptions: Assume U : Rd → R+ satisfy
A1 ∂2U(x)/∂xi xj is locally Lipschitz continuous for all i, j
A2 U(x) π(dx) < ∞
A3 lim|x|→∞eU(x)/2/ U(x) > 0
A4 V c0 for some positive constant c0
[Deligiannidis et al., 2017]

Coordinate Sampler
Geometric ergodicity for distributions with tails faster than
exponential and slower than Gaussian
Assumptions: Assume U : Rd → R+ satisfy
Further conditions
C1 lim|x|→∞ U(x) = ∞, lim|x|→∞ ∆U(x) α1 < ∞ and
λref >
√
8α1
C2 lim|x|→∞ U(x) = 2α2 > 0, lim|x|→∞ ∆U(x) = 0 and
λref < α2/14d
[Deligiannidis et al., 2017]

Coordinate Sampler
Lyapunov function for the Markov process induced by coordinate
sampler is
V (x, v) = eU(x)/2
/
√
λref+ U(x),−v +
Theorem
Suppose A1 - A4 hold and one of the conditions C1, C2 holds, then
CS is V -uniformly ergodic.

Coordinate Sampler
Coordinate Sampler vs Zig-Zag Sampler
1. Cardinality of V of CS is 2d, while for ZS it is 2d ;
2. Along each piecewise segment, CS only changes one
component of x, while ZS modiﬁes all components at the
same time (con);
3. λCS = {vi i U(x)}+ + λref, if v = vi ei ; while
λZS =
d
i =1
{vi i U(x)}+ + λref
if v = d
i =1 vi ei . Generally speaking, λCS is much smaller
than λZS (pro);

Coordinate Sampler
Coordinate Sampler vs Zig-Zag Sampler
Suppose that
1. λ(x, v) of CS and λi (x, v) of Zig-Zag sampler have same scale.
2. Simulating ﬁrst event time of Poisson process with rates
λ(x, v) of CS and λi (x, v) of ZS : same computation cost,
O(c)
O(dc) computation cost result in
1. ZS makes each component evolve with scale O( /d),
2. CS makes each component evolve O( ) [gain O(d)]
on average

Outline
Background
Versions of PDMP
Coordinate Sampler
Conclusion

Example 1: Banana-shaped Distribution
target with density
π(x) ∝ exp −(x1 − 1)2
− κ(x2 − x2
1 )2
where large κ increases curvature and diﬃculty
−2 −1 0 1 2 3 4 5
0.00.51.01.52.02.53.0
log2(κ)
RatioofESSpersecond
first component
second component
loglikelihood

Example 1: Banana-shaped Distribution
−2 −1 0 1 2 3 4 5
0.00.51.01.52.02.53.0
log2(κ)
RatioofESSpersecond
first component
second component
loglikelihood
x-axis corresponds to log2(κ), y-axis to ratio of ESS’s per second
for CS versus ZS. Red line eﬃciency ratio for component, blue for
second , and green for log-likelihood.

Example 2: Multivariate Gaussian Distribution
π(x) ∝ exp −
1
2
xT
A−1
x
1. MVN1: Aii = 1, for i = 1, · · · , d and Aij = 0.9 for i = j.
2. MVN2: Aii = 1, fro i = 1, · · · , d and Aij = 0.9|i−j| for i = j.

20 40 60 80 100
468101214
MinESS
MeanESS
MaxESS
20 40 60 80 100
4567891011
MinESS
MeanESS
MaxESS
lhs plot shows results for MVN1 and rhs for MVN2, x-axis indexes
dimension d and y-axis eﬃciency ratios of CS over ZS in terms of
minimal (red), mean (blue), and maximal ESS (green) across
components

upper plot shows results for MVN1 and lower for MVN2. The
x-axis indexes dimension d of distribution, and y-axis eﬃciency
ratios of CS over ZS in terms of minimum, mean, median and
maximum of ESS across the components over number of recall
event rate function.

Example 3: Bayesian Logistic Posterior
simulated dataset of N observations {(rn; tn)}N
1 where each rn;i
drawn from standard Normal distribution and tn drawn from {−1, 1}
uniformly
π(x) ∝
N
n=1
exp(tnxT rn)
1 + exp(xT rn)

Example 3: Bayesian Logistic Posterior
0
50
100
150
MinESS MeanESS MaxESS
ESSpersecond
Type
CS
ZS
Comparison of CS versus ZS: y-axis stands for ESS per second,
d = 10, N = 40

Conclusion
Forward
1. Comparing coordinate sampler and zigzag sampler theoretically
2. Optimising reparametrization
3. Riemannian manifold technique

Conclusion
Bierkens, J., Fearnhead, P., and Roberts, G. (2016). The zig-zag process and super-eﬃcient sampling
for Bayesian analysis of big data. arXiv preprint arXiv:1607.03188.
Bierkens, J., Roberts, G., and Zitt, P.-A. (2017). Ergodicity of the zigzag process. arXiv preprint
arXiv:1712.09875.
Bouchard-Côté, A., Vollmer, S. J., and Doucet, A. (2018). The bouncy particle sampler: a
non-reversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical
Association, (to appear).
Davis, M. H. (1984). Piecewise-deterministic Markov processes: A general class of non-diﬀusion
stochastic models. Journal of the Royal Statistical Society. Series B (Methodological), pages
353–388.
Davis, M. H. (1993). Markov Models & Optimization, volume 49. CRC Press.
Deligiannidis, G., Bouchard-Côté, A., and Doucet, A. (2017). Exponential ergodicity of the bouncy
particle sampler. arXiv preprint arXiv:1705.04579.
Fearnhead, P., Bierkens, J., Pollock, M., and Roberts, G. O. (2016). Piecewise deterministic Markov
processes for continuous-time Monte Carlo. arXiv preprint arXiv:1611.07873.
Kingman, J. F. C. (1992). Poisson processes, volume 3. Clarendon Press.
Lewis, P. A. and Shedler, G. S. (1979). Simulation of nonhomogeneous Poisson processes by thinning.
Naval Research Logistics (NRL), 26(3):403–413.
Peters, E. A. and de With, G. (2012). Rejection-free Monte Carlo sampling for general potentials.
Physical Review E, 85(2):026703.
Sherlock, C. and Thiery, A. H. (2017). A discrete bouncy particle sampler. arXiv preprint
arXiv:1707.05200.
Vanetti, P., Bouchard-Côté, A., Deligiannidis, G., and Doucet, A. (2017). Piecewise deterministic
Markov chain Monte Carlo. arXiv preprint arXiv:1707.05296.
Wu, C. and Robert, C. P. (2017). Generalized bouncy particle sampler. arXiv preprint arXiv:1706.04781.

Coordinate Sampler

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Coordinate Sampler

Similar to Coordinate Sampler (20)

More from Christian Robert

More from Christian Robert (15)

Recently uploaded

Recently uploaded (20)

Coordinate Sampler