Bayesian Model Selection Guide

Bayesian model choice
(and some alternatives)
Christian P. Robert
Universit´e Paris-Dauphine, IuF, & CRESt
http://www.ceremade.dauphine.fr/~xian
November 20, 2010
Christian P. Robert (Paris-Dauphine) Bayesian model choice November 20, 2010 1 / 64

Outline
Anyone not shocked by the Bayesian theory of inference has not understood it
Senn, BA., 2008
1 Introduction
2 Tests and model choice
3 Incoherent inferences

Vocabulary and concepts
Bayesian inference is a coherent mathematical theory
but I don’t trust it in scientiﬁc applications.
Gelman, BA, 2008
1 Introduction
Models
The Bayesian framework
Improper prior distributions
Noninformative prior distributions

Parametric model
Bayesians promote the idea that a multiplicity of parameters can be handled via
hierarchical, typically exchangeable, models, but it seems implausible that this
could really work automatically [instead of] giving reasonable answers using
minimal assumptions.
Gelman, BA, 2008
Observations x1, . . . , xn generated from a probability distribution
fi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)
x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)

Parametric model
Bayesians promote the idea that a multiplicity of parameters can be handled via
hierarchical, typically exchangeable, models, but it seems implausible that this
could really work automatically [instead of] giving reasonable answers using
minimal assumptions.
Gelman, BA, 2008
Observations x1, . . . , xn generated from a probability distribution
fi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)
x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)
Associated likelihood
(θ|x) = f(x|θ)
[inverted density & starting point]

Bayesian approach
The impact of treating x as a ﬁxed constant
is to increase statistical power as an artefact
Templeton, Molec. Ecol., 2009
New perspective
Uncertainty on the parameters θ of a model modeled through a
probability distribution π on Θ, called prior distribution

Bayesian approach
The impact of treating x as a ﬁxed constant
is to increase statistical power as an artefact
New perspective
Uncertainty on the parameters θ of a model modeled through a
probability distribution π on Θ, called prior distribution
Inference based on the distribution of θ conditional on x, π(θ|x),
called posterior distribution
π(θ|x) =
f(x|θ)π(θ)
f(x|θ)π(θ) dθ
.

[Nonphilosophical] justiﬁcations
Ignoring the sampling error of x undermines
the statistical validity of all inferences made by the method
Semantic drift from unknown to random

Actualization of the information on θ by extracting the information on
θ contained in the observation x

Allows incorporation of imperfect information in the decision process

Unique mathematical way to condition upon the observations
(conditional perspective)

Unique mathematical way to condition upon the observations
(conditional perspective)
Unique way to give meaning to statements like P(θ > 0)

Posterior distribution
Bayesian methods are presented as an automatic inference engine,
and this raises suspicion in anyone with applied experience
Gelman, BA, 2008
π(θ|x) central to Bayesian inference
Operates conditional upon the observations

Gelman, BA, 2008
Incorporates the requirement of the Likelihood Principle

Gelman, BA, 2008
Avoids averaging over the unobserved values of x

Gelman, BA, 2008
Coherent updating of the information available on θ

Gelman, BA, 2008
Coherent updating of the information available on θ
Provides a complete inferential machinery

Improper distributions
If we take P(dσ) ∝ dσ as a statement that σ may have any value between 0 and
∞ (...), we must use ∞ instead of 1 to denote certainty.
Jeﬀreys, ToP, 1939
Necessary extension from a prior distribution to a prior σ-ﬁnite measure π
such that
Θ
π(θ) dθ = +∞

such that
Θ
π(θ) dθ = +∞
Improper prior distribution

such that
Θ
π(θ) dθ = +∞
Improper prior distribution
[Weird? Inappropriate?? report!! ]

Justiﬁcations
If the parameter may have any value from −∞ to +∞,
its prior probability should be taken as uniformly distributed
Automated prior determination often leads to improper priors
1 Similar performances of estimators derived from these generalized
distributions

Justiﬁcations
If the parameter may have any value from −∞ to +∞,
its prior probability should be taken as uniformly distributed
Automated prior determination often leads to improper priors
1 Similar performances of estimators derived from these generalized
distributions
2 Improper priors as limits of proper distributions in many
[mathematical] senses

Further justifications
There is no good objective principle for choosing a noninformative prior (even if
that concept were mathematically defined, which it is not)
Gelman, BA, 2008
4 Robust answer against possible misspecifications of the prior

Gelman, BA, 2008
5 Frequencial justiﬁcations, such as:
(i) minimaxity
(ii) admissibility
(iii) invariance (Haar measure)

Gelman, BA, 2008
5 Frequencial justiﬁcations, such as:
(i) minimaxity
(ii) admissibility
(iii) invariance (Haar measure)
6 Improper priors [much] prefered to vague proper priors like N(0, 106)

Validation
The mistake is to think of them as representing ignorance
Lindley, JASA, 1990
Extension of the posterior distribution π(θ|x) associated with an improper
prior π as given by Bayes’s formula
π(θ|x) =
f(x|θ)π(θ)
Θ f(x|θ)π(θ) dθ
,

Validation
The mistake is to think of them as representing ignorance
Lindley, JASA, 1990
Extension of the posterior distribution π(θ|x) associated with an improper
prior π as given by Bayes’s formula
π(θ|x) =
f(x|θ)π(θ)
Θ f(x|θ)π(θ) dθ
,
when
Θ
f(x|θ)π(θ) dθ < ∞
Delete emotionally loaded names

Noninformative priors
...cannot be expected to represent exactly total ignorance about the problem, but
should rather be taken as reference priors, upon which everyone could fall back
when the prior information is missing.
Kass and Wasserman, JASA, 1996
What if all we know is that we know “nothing” ?!

In the absence of prior information, prior distributions solely derived from
the sample distribution f(x|θ)

In the absence of prior information, prior distributions solely derived from
the sample distribution f(x|θ)
Difficulty with uniform priors, lacking invariance properties. Rather use
Jeffreys’ prior.
[Jeffreys, 1939; Robert, Chopin & Rousseau, 2009]

Tests and model choice
The Jeﬀreys-subjective synthesis betrays a much more dangerous confusion than
the Neyman-Pearson-Fisher synthesis as regards hypothesis tests
Senn, BA, 2008
1 Introduction
Bayesian tests
Opposition to classical tests
Model choice
Pseudo-Bayes factors
Compatible priors
Variable selection

Construction of Bayes tests
What is almost never used, however, is the Jeffreys significance test.
Senn, BA, 2008
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical
model, a test is a statistical procedure that takes its values in {0, 1}.

Construction of Bayes tests
What is almost never used, however, is the Jeffreys significance test.
Senn, BA, 2008
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical
model, a test is a statistical procedure that takes its values in {0, 1}.
Example (Normal mean)
For x ∼ N (θ, 1), decide whether or not θ ≤ 0.

Decision-theoretic perspective
Loss functions [are] not relevant to statistical inference
Gelman, BA, 2008
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
a0 if d = 1 and θ ∈ Θ0

Decision-theoretic perspective
Loss functions [are] not relevant to statistical inference
Gelman, BA, 2008
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
L(θ, d) =



0 if d = IΘ0 (θ)
the Bayes procedure is
δπ
(x) =
1 if Prπ
(θ ∈ Θ0|x) ≥ a0/(a0 + a1)
0 otherwise

A function of posterior probabilities
The method posits two or more alternative hypotheses and tests their relative fits
to some observed statistics — Templeton, Mol. Ecol., 2009
Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
B01 =
π(Θ0|x)
π(Θc
0|x)
π(Θ0)
π(Θc
0)
=
Θ0
f(x|θ)π0(θ)dθ
Θc
0
f(x|θ)π1(θ)dθ
[Good, 1958 & Jeffreys, 1961]
pseudo-Bayes factors

Self-contained concept
Having a high relative probability does not mean that a hypothesis is true or
supported by the data — Templeton, Mol. Ecol., 2009
Non-decision-theoretic:
eliminates choice of π(Θ0)
Bayesian/marginal equivalent to the likelihood ratio
Jeﬀreys’ scale of evidence:
if log10(Bπ
10) between 0 and 0.5, evidence against H0 weak,
if log10(Bπ
10) 0.5 and 1, evidence substantial,
if log10(Bπ
10) 1 and 2, evidence strong and
if log10(Bπ
10) above 2, evidence decisive

A major modiﬁcation
Considering whether a location parameter α is 0. The prior is uniform and we
should have to take f(α) = 0 and B10 would always be inﬁnite
When the null hypothesis is supported by a set of measure 0, π(Θ0) = 0
and thus π(Θ0|x) = 0.
[End of the story?!]

Changing the prior to ﬁt the hypotheses
Given that some logical overlap is common when dealing with complex models,
this means that much of the literature is invalid
Templeton, Trends in Ecology and Evolution, 2010
Requirement
Deﬁne prior distributions under both assumptions,
π0(θ) ∝ π(θ)IΘ0 (θ), π1(θ) ∝ π(θ)IΘ1 (θ),
[under the standard dominating measures on Θ0 and Θ1], leading to
π(θ) = 0π0(θ) + 1π1(θ).

Point null hypotheses
I have no patience for statistical methods that assign positive probability to point
hypotheses of the θ = 0 type that can never actually be true
Gelman, BA, 2008
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Ha. Then

Point null hypotheses
I have no patience for statistical methods that assign positive probability to point
hypotheses of the θ = 0 type that can never actually be true
Gelman, BA, 2008
Take ρ0 = Prπ
(θ = θ0) and g1 prior density under Ha. Then
π(Θ0|x) =
f(x|θ0)ρ0
f(x|θ)π(θ) dθ
=
f(x|θ0)ρ0
f(x|θ0)ρ0 + (1 − ρ0)m1(x)
and Bayes factor
Bπ
01(x) =
f(x|θ0)ρ0
m1(x)(1 − ρ0)
ρ0
1 − ρ0
=
f(x|θ0)
m1(x)

Point null hypotheses (cont’d)
Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N(0, τ2)
m1(x)
f(x|0)
=
σ2
σ2 + τ2
exp
τ2x2
2σ2(σ2 + τ2)
and the posterior probability is
τ/x 0 0.68 1.28 1.96
1 0.586 0.557 0.484 0.351
10 0.768 0.729 0.612 0.366

Comparison with classical tests
The 95 percent frequentist intervals will live up to their advertised coverage
claims — Wasserman, BA, 2008
Standard/classical answer
Deﬁnition (p-value)
The p-value p(x) associated with a test is the largest signiﬁcance level for
which H0 is rejected

Problems with p-values
The use of P implies that a hypothesis that may be true may be rejected because
it had not predicted observable results that have not occurred
Evaluation of the wrong quantity, namely the probability to exceed
the observed quantity.(wrong conditioning)
Evaluation only under the null hypothesis
Huge numerical diﬀerence with the Bayesian range of answers

Bayesian lower bounds
If the Bayes estimator has good frequency behavior
then we might as well use the frequentist method.
If it has bad frequency behavior then we shouldn’t use it.
Wasserman, BA, 2008
Least favourable Bayesian answer is
B(x, GA) = inf
g∈GA
f(x|θ0)
Θ f(x|θ)g(θ) dθ
,
i.e., if there exists a mle for θ, ˆθ(x),
B(x, GA) =
f(x|θ0)
f(x|ˆθ(x))

Illustration
Example (Normal case)
When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are
B(x, GA) = e−x2/2
and P(x, GA) = 1 + ex2/2
−1
,

Illustration
B(x, GA) = e−x2/2
and P(x, GA) = 1 + ex2/2
−1
,
i.e.
p-value 0.10 0.05 0.01 0.001
P 0.205 0.128 0.035 0.004
B 0.256 0.146 0.036 0.004

Illustration
B(x, GA) = e−x2/2
and P(x, GA) = 1 + ex2/2
−1
,
i.e.
p-value 0.10 0.05 0.01 0.001
P 0.205 0.128 0.035 0.004
B 0.256 0.146 0.036 0.004
[Quite diﬀerent!]

Model choice and model comparison
There is no null hypothesis, which complicates the computation of sampling error
Templeton, Mol. Ecol., 2009
Choice among models:
Several models available for the same observation(s)
Mi : x ∼ fi(x|θi), i ∈ I
where I can be ﬁnite or inﬁnite

Bayesian resolution
The posterior probabilities are constructed by using a numerator that is a function
of the observation for a particular model, then divided by a denominator that
ensures that the ”probabilities” sum to one. — Templeton, Mol. Ecol., 2009
Probabilise the entire model/parameter space

Bayesian resolution
allocate probabilities pi to all models Mi
deﬁne priors πi(θi) for each parameter space Θi

Bayesian resolution
allocate probabilities pi to all models Mi
deﬁne priors πi(θi) for each parameter space Θi
compute
π(Mi|x) =
pi
Θi
fi(x|θi)πi(θi)dθi
j
pj
Θj
fj(x|θj)πj(θj)dθj

Bayesian resolution(2)
The numerators are not co-measurable across hypotheses, and the denominators
are sums of non-co-measurable entities. This means that it is mathematically
impossible for them to be probabilities — Templeton, Mol. Ecol., 2009
take largest π(Mi|x) to determine “best” model,
or use averaged predictive
j
π(Mj|x)
Θj
fj(x |θj)πj(θj|x)dθj

Natural Occam’s razor
Pluralitas non est ponenda sine neccesitate
Variation is random until the contrary
is shown; and new parameters in laws,
when they are suggested, must be
tested one at a time, unless there is
speciﬁc reason to the contrary.

Natural Occam’s razor
Pluralitas non est ponenda sine neccesitate
Variation is random until the contrary
is shown; and new parameters in laws,
when they are suggested, must be
tested one at a time, unless there is
specific reason to the contrary.
The Bayesian approach naturally weights differently models with different
parameter dimensions (BIC being an approximative log-Bayes factor).

A fundamental diﬃculty
1) ABC can and does produce results that are mathematically impossible;
2) the “posterior probabilities” of ABC cannot possibly be true probability
measures;
and 3) ABC is statistically incoherent.
Improper priors are NOT allowed here
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then either π1 or π2 cannot be coherently normalised

A fundamental diﬃculty
1) ABC can and does produce results that are mathematically impossible;
2) the “posterior probabilities” of ABC cannot possibly be true probability
measures;
and 3) ABC is statistically incoherent.
Improper priors are NOT allowed here
If
Θ1
π1(dθ1) = ∞ or
Θ2
π2(dθ2) = ∞
then either π1 or π2 cannot be coherently normalised but the
normalisation matters in the Bayes factor Recall Bayes factor

Normal illustration
Take x ∼ N (θ, 1) and H0 : θ = 0
Impact of the constant
x 0.0 1.0 1.65 1.96 2.58
π(θ) = 1 0.285 0.195 0.089 0.055 0.014
π(θ) = 10 0.0384 0.0236 0.0101 0.00581 0.00143

Vague proper priors are NOT the solution
Taking a proper prior and take a “very large” variance (e.g., BUGS)

Taking a proper prior and take a “very large” variance (e.g., BUGS) will
most often result in an undeﬁned or ill-deﬁned limit

Taking a proper prior and take a “very large” variance (e.g., BUGS) will
most often result in an undeﬁned or ill-deﬁned limit
Example (Lindley’s paradox)
If testing H0 : θ = 0 when observing x ∼ N(θ, 1), under a normal N(0, α)
prior π1(θ),
B01(x)
α−→∞
−→ 0

Learning from the sample
It is possible for data to discriminate among a set of hypotheses without saying
anything about a proposition that is common to all the alternatives considered.
Seber, Evidence and Evolution, 2008
Deﬁnition (Learning sample)
Given an improper prior π, (x1, . . . , xn) is a learning sample if
π(·|x1, . . . , xn) is proper and a minimal learning sample if none of its
subsamples is a learning sample

Learning from the sample
It is possible for data to discriminate among a set of hypotheses without saying
anything about a proposition that is common to all the alternatives considered.
Deﬁnition (Learning sample)
Given an improper prior π, (x1, . . . , xn) is a learning sample if
π(·|x1, . . . , xn) is proper and a minimal learning sample if none of its
subsamples is a learning sample
There is just enough information in a minimal learning sample to make
inference about θ under the prior π

Idea
Use a ﬁrst part x[i] of the data x to make the prior proper:

Motivation
Provides a working principle for improper priors

Motivation
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data

Motivation
Gather enough information from data to achieve properness
and use this properness to run the test on remaining data
does not use the data x twice as in Aitkin’s (1991,2010)
Back later!

Fractional Bayes factor
To test a theory, you need to test it against alternatives.
Idea
use directly the likelihood to separate training sample from testing sample
BF
12 = B12(x) × Lb
2(θ2)π2(θ2)dθ2 Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]

Fractional Bayes factor
To test a theory, you need to test it against alternatives.
Idea
use directly the likelihood to separate training sample from testing sample
BF
12 = B12(x) × Lb
2(θ2)π2(θ2)dθ2 Lb
1(θ1)π1(θ1)dθ1
[O’Hagan, 1995]
Proportion b of the sample used to gain proper-ness

Fractional Bayes factor (cont’d)
BF
12 =
1
√
b
en(b−1)¯x2
n/2
corresponds to exact Bayes factor for the prior N 0, 1−b
nb
If b constant, prior variance goes to 0
If b =
1
n
, prior variance stabilises around 1
If b = n−α
, α < 1, prior variance goes to 0 too.

Compatibility principle
Further complicating dimensionality of test statistics is the fact that the models
are often not nested, and one model may contain parameters that do not have
analogues in the other models and vice versa

Diﬃculty of ﬁnding simultaneously priors on a collection of models

Diﬃculty of ﬁnding simultaneously priors on a collection of models
Easier to start from a single prior on a “big” [encompassing] model and to
derive others from a coherence principle
[Dawid & Lauritzen, 2000]
Raw regression output

An illustration for linear regression
In the case M1 and M2 are two nested Gaussian linear regression models
with Zellner’s g-priors and the same variance σ2 ∼ π(σ2):
M1 : y|β1, σ2 ∼ N(X1β1, σ2) with
β1|σ2
∼ N s1, σ2
n1(XT
1 X1)−1
where X1 is a (n × k1) matrix of rank k1 ≤ n

An illustration for linear regression
In the case M1 and M2 are two nested Gaussian linear regression models
with Zellner’s g-priors and the same variance σ2 ∼ π(σ2):
M1 : y|β1, σ2 ∼ N(X1β1, σ2) with
β1|σ2
∼ N s1, σ2
n1(XT
1 X1)−1
where X1 is a (n × k1) matrix of rank k1 ≤ n
M2 : y|β2, σ2 ∼ N(X2β2, σ2) with
β2|σ2
∼ N s2, σ2
n2(XT
2 X2)−1
,
where X2 is a (n × k2) matrix with span(X2) ⊆ span(X1)
[ c Marin & Robert, Bayesian Core]

Compatible g-priors
I don’t see any role for squared error loss, minimax, or the rest of what is
sometimes called statistical decision theory
Gelman, BA, 2008
Since σ2 is a nuisance parameter, minimize the Kullback-Leibler
divergence between both marginal distributions conditional on σ2:
m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2), with solution

Compatible g-priors
I don’t see any role for squared error loss, minimax, or the rest of what is
sometimes called statistical decision theory
Gelman, BA, 2008
Since σ2 is a nuisance parameter, minimize the Kullback-Leibler
divergence between both marginal distributions conditional on σ2:
m1(y|σ2; s1, n1) and m2(y|σ2; s2, n2), with solution
β2|X2, σ2
∼ N s∗
2, σ2
n∗
2(XT
2 X2)−1
with
s∗
2 = (XT
2 X2)−1
XT
2 X1s1 n∗
2 = n1

Symmetrised compatible priors
If those prior probabilities are obscure, the same will be true of the posterior
probabilities — Seber, Evidence and Evolution, 2008
Postulate: Previous principle requires embedded models (or an
encompassing model) and proper priors, while being hard to implement
outside exponential families

Symmetrised compatible priors
If those prior probabilities are obscure, the same will be true of the posterior
probabilities — Seber, Evidence and Evolution, 2008
Postulate: Previous principle requires embedded models (or an
encompassing model) and proper priors, while being hard to implement
outside exponential families
We determine prior measures on two models M1 and M2, π1 and π2,
directly by a compatibility principle.

Generalised expected posterior priors
[Perez & Berger, 2000]
EPP Principle
Starting from reference priors πN
1 and πN
2 , substitute by prior distributions
π1 and π2 that solve the system of integral equations
π1(θ1) =
X
πN
1 (θ1 | x)m2(x)dx
and
π2(θ2) =
X
πN
2 (θ2 | x)m1(x)dx,
where x is an imaginary minimal training sample and m1, m2 are the
marginals associated with π1 and π2 respectively.

Motivations
Eliminates the “imaginary observation” device and proper-isation
through part of the data by integration under the “truth”

Motivations
Assumes that both models are equally valid and equipped with ideal
unknown priors
πi, i = 1, 2,
that yield “true” marginals balancing each model wrt the other

Motivations
Assumes that both models are equally valid and equipped with ideal
unknown priors
πi, i = 1, 2,
that yield “true” marginals balancing each model wrt the other
For a given π1, π2 is an expected posterior prior
Using both equations introduces symmetry into the game

Bayesian coherence
Logical overlap is the norm for the complex models analyzed with ABC, so many
ABC posterior model probabilities published to date are wrong.
Templeton, PNAS, 2009
Theorem (True Bayes factor)
If π1 and π2 are the EPPs and if their marginals are ﬁnite, then the
corresponding Bayes factor
B1,2(x)
is either a (true) Bayes factor or a limit of (true) Bayes factors.

Bayesian coherence
Logical overlap is the norm for the complex models analyzed with ABC, so many
ABC posterior model probabilities published to date are wrong.
Theorem (True Bayes factor)
If π1 and π2 are the EPPs and if their marginals are ﬁnite, then the
corresponding Bayes factor
B1,2(x)
is either a (true) Bayes factor or a limit of (true) Bayes factors.
Obviously only interesting when both π1 and π2 are improper.

Variable selection
Regression setup where y regressed on a set {x1, . . . , xp} of p potential
explanatory regressors (plus intercept)

Variable selection
Corresponding 2p submodels Mγ, where γ ∈ Γ = {0, 1}p indicates
inclusion/exclusion of variables by a binary representation,

Variable selection
Corresponding 2p submodels Mγ, where γ ∈ Γ = {0, 1}p indicates
inclusion/exclusion of variables by a binary representation,
e.g. γ = 101001011 means that x1, x3, x5, x7 and x8 are included.

Notations
For model Mγ,
qγ variables included
t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables and t0(γ)
indices of the variables not included
For β ∈ Rp+1,
βt1(γ) = β0, βt1,1(γ), . . . , βt1,qγ (γ)
Xt1(γ) = 1n|xt1,1(γ)| . . . |xt1,qγ (γ) .

Notations
For model Mγ,
qγ variables included
t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables and t0(γ)
indices of the variables not included
For β ∈ Rp+1,
βt1(γ) = β0, βt1,1(γ), . . . , βt1,qγ (γ)
Xt1(γ) = 1n|xt1,1(γ)| . . . |xt1,qγ (γ) .
Submodel Mγ is thus
y|β, γ, σ2
∼ N Xt1(γ)βt1(γ), σ2
In

Global and compatible priors
Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,
β|σ2
∼ N(˜β, cσ2
(XT
X)−1
)
and a Jeﬀreys prior for σ2,
π(σ2
) ∝ σ−2
Noninformative g

Global and compatible priors
Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,
β|σ2
∼ N(˜β, cσ2
(XT
X)−1
)
and a Jeﬀreys prior for σ2,
π(σ2
) ∝ σ−2
Noninformative g
Resulting compatible prior
βt1(γ) ∼ N XT
t1(γ)Xt1(γ)
−1
XT
t1(γ)X ˜β, cσ2
XT
t1(γ)Xt1(γ)
−1

Posterior model probability
Can be obtained in closed form:
π(γ|y) ∝ (c + 1)−(qγ +1)/2
yT
y −
cyT
P1y
c + 1
+
˜βT
XT
P1X ˜β
c + 1
−
2yT
P1X ˜β
c + 1
−n/2
.

Posterior model probability
Can be obtained in closed form:
π(γ|y) ∝ (c + 1)−(qγ +1)/2
yT
y −
cyT
P1y
c + 1
+
˜βT
XT
P1X ˜β
c + 1
−
2yT
P1X ˜β
c + 1
−n/2
.
Conditionally on γ, posterior distributions of β and σ2:
βt1(γ)|σ2
, y, γ ∼ N
c
c + 1
(U1y + U1X ˜β/c),
σ2
c
c + 1
XT
t1(γ)Xt1(γ)
−1
,
σ2
|y, γ ∼ IG
n
2
,
yT
y
2
−
cyT
P1y
2(c + 1)
+
˜βT
XT
P1X ˜β
2(c + 1)
−
yT
P1X ˜β
c + 1
.

Noninformative case
Use the same compatible informative g-prior distribution with ˜β = 0p+1
and a hierarchical diﬀuse prior distribution on c,
π(c) ∝ c−1
IN∗ (c) or π(c) ∝ c−1
Ic>0
Recall g-prior

Noninformative case
π(c) ∝ c−1
IN∗ (c) or π(c) ∝ c−1
Ic>0
Recall g-prior
The choice of this hierarchical diﬀuse prior distribution on c is due to the
model posterior sensitivity to large values of c:

Noninformative case
π(c) ∝ c−1
IN∗ (c) or π(c) ∝ c−1
Ic>0
Recall g-prior
The choice of this hierarchical diﬀuse prior distribution on c is due to the
model posterior sensitivity to large values of c:
Taking ˜β = 0p+1 and c large does not work

Processionary caterpillar
Inﬂuence of some forest settlement characteristics on the development of
caterpillar colonies

Processionary caterpillar
Inﬂuence of some forest settlement characteristics on the development of
caterpillar colonies
Response y log-transform of the average number of nests of caterpillars
per tree on an area of 500 square meters (n = 33 areas)
[ c Marin & Robert, Bayesian Core]

Processionary caterpillar (cont’d)
Potential explanatory variables
x1 altitude (in meters), x2 slope (in degrees),
x3 number of pines in the square,
x4 height (in meters) of the tree at the center of the square,
x5 diameter of the tree at the center of the square,
x6 index of the settlement density,
x7 orientation of the square (from 1 if southb’d to 2 ow),
x8 height (in meters) of the dominant tree,
x9 number of vegetation strata,
x10 mix settlement index (from 1 if not mixed to 2 if mixed).

Bayesian regression output
Estimate BF log10(BF)
(Intercept) 9.2714 26.334 1.4205 (***)
X1 -0.0037 7.0839 0.8502 (**)
X2 -0.0454 3.6850 0.5664 (**)
X3 0.0573 0.4356 -0.3609
X4 -1.0905 2.8314 0.4520 (*)
X5 0.1953 2.5157 0.4007 (*)
X6 -0.3008 0.3621 -0.4412
X7 -0.2002 0.3627 -0.4404
X8 0.1526 0.4589 -0.3383
X9 -1.0835 0.9069 -0.0424
X10 -0.3651 0.4132 -0.3838
evidence against H0: (****) decisive, (***) strong, (**) subtantial,
(*) poor

Bayesian variable selection
t1(γ) π(γ|y, X)
0,1,2,4,5 0.0929
0,1,2,4,5,9 0.0325
0,1,2,4,5,10 0.0295
0,1,2,4,5,7 0.0231
0,1,2,4,5,8 0.0228
0,1,2,4,5,6 0.0228
0,1,2,3,4,5 0.0224
0,1,2,3,4,5,9 0.0167
0,1,2,4,5,6,9 0.0167
0,1,2,4,5,8,9 0.0137
Noninformative G-prior model choice

Fringe alternatives
1 Introduction
Templeton’s debate
Bayes/likelihood fusion

A revealing confusion
In statistics, coherent measures of ﬁt of nested and overlapping composite
hypotheses are technically those measures that are consistent with the constraints
of formal logic. For example, the probability of the nested special case must be
less than or equal to the probability of the general model within which the special
case is nested. Any statistic that assigns greater probability to the special case is
said to be incoherent.

ABC algorithm
Instead of evaluating hypotheses in terms of how probable they say the data are,
we evaluate them by estimating how accurately they’ll predict new data when
fitted to old — Seber, Evidence and Evolution, 2008
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f(·|θ )
until ρ{η(z), η(y)} ≤
set θi = θ
end for
where η(y) defines a (not necessarily sufficient) statistic
[Pritchard et al., 1999]

ABC output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f(z|θ)IA ,y (z)
A ,y×Θ π(θ)f(z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.

ABC output
The likelihood-free algorithm samples from the marginal in z of:
π (θ, z|y) =
π(θ)f(z|θ)IA ,y (z)
A ,y×Θ π(θ)f(z|θ)dzdθ
,
where A ,y = {z ∈ D|ρ(η(z), η(y)) < }.
The idea behind ABC is that the summary statistics coupled with a small
tolerance should provide a good approximation of the posterior
distribution:
π (θ|y) = π (θ, z|y)dz ≈ π(θ|y) .

The ”Great ABC controversy”
On-going controvery in phylogeographic genetics about the validity of
using ABC for testing
Against: Templeton, 2008, 2009,
2010a, 2010b, 2010c argues that
nested hypotheses cannot have
higher probabilities than nesting
hypotheses (!)

hypotheses (!)
The probability of the nested special
case must be less than or equal to
the probability of the general model
within which the special case is
nested. Any statistic that assigns
greater probability to the special case
is incoherent. An example of
incoherence is shown for the ABC
method.

hypotheses (!)
Incoherent methods, such as ABC,
Bayes factor, or any simulation
approach that treats all hypotheses
as mutually exclusive, should never
be used with logically overlapping
hypotheses.

hypotheses (!)
The central equation of ABC
P(Hi|H, S∗
) =
Gi(||Si − S∗
||)Πi
Pn
j=1 Gj(||Sj − S∗||)Πj
is inherently incoherent. This
fundamental equation is
mathematically incorrect in every
instance of overlap.

hypotheses (!)
Replies: Fagundes et al., 2008,
Beaumont et al., 2010, Berger et al.,
2010, Csill`ery et al., 2010 point out
that the criticisms are addressed at
[Bayesian] model-based inference and
have nothing to do with ABC...

ABC is a statistically valid approach,
alongside other computational
statistical techniques that have been
successfully used to infer parameters
and compare models in population
genetics.
Beaumont et al., Molec. Ecology,
2010

The confusion seems to arise from
misunderstanding the diﬀerence
between scientiﬁc hypotheses and
their mathematical representation.
Bayes’ theorem shows that the
simpler model can indeed have a
much higher posterior probability.
Berger et al., PNAS, 2010

Aitkin’s alternative
Without a speciﬁc alternative, the best we can do is to
make posterior probability statements about µ and transfer
these to the posterior distribution of the likelihood ratio.
Aitkin, Statistical Inference, 2010

Aitkin’s alternative
Without a speciﬁc alternative, the best we can do is to
make posterior probability statements about µ and transfer
these to the posterior distribution of the likelihood ratio.
Proposal to examine the posterior distribution of the likelihood function :
compare models via the “posterior distribution” of the likelihood ratio.
L1(θ1|x) L2(θ2|x) ,
with θ1 ∼ π1(θ1|x) and θ2 ∼ π2(θ2|x).

Using the data “twice”
A persistent criticism of the posterior likelihood approach has been based
on the claim that these approaches are ‘using the data twice’, or are
‘violating temporal coherence’ — Aitkin, Statistical Inference, 2010
Complete separation between both models due to simulation under
product of the posterior distributions, i.e. replaces standard Bayesian
inference under joint posterior of (θ1, θ2),
p1m1(x)π1(θ1|x)π2(θ2) + p2m2(x)π2(θ2|x)π1(θ1)
by product of both posteriors

Illustration
Comparison of a Poisson model against a negative binomial with m = 5
successes, when x = 3:

Pros ...
This quite small change to standard Bayesian analysis allows a very
general approach to a wide range of apparently different inference
problems; a particular advantage of the approach is that it can use the
same noninformative priors — Aitkin, Statistical Inference, 2010
the approach is general and allows to resolve the difficulties with the
Bayesian processing of point null hypotheses;
the approach allows for the use of generic noninformative and
improper priors;
the approach handles more naturally the “vexed question of model
fit”;
the approach is “simple”.

... & cons
The p-value is equal to the posterior probability that the likelihood ratio,
for null hypothesis to alternative, is greater than 1 (...) The posterior
probability is p that the posterior probability of H0 is greater than 0.5.
the approach is not Bayesian (product of the posteriors)
the approach uses undeterminate entities (“posterior probability that
the posterior probability is larger than 0.5”...)
the approach tries to get as close as possible to the p-value

Bayesian Model Selection Guide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bayesian Model Selection Guide

Similar to Bayesian Model Selection Guide (20)

More from Christian Robert

More from Christian Robert (20)

Bayesian Model Selection Guide