SlideShare a Scribd company logo
1 of 23
Download to read offline
Reliable ABC model choice via random forests 
Pierre Pudlo  y, Jean-Michel Marin  y , Arnaud Estoup z, Jean-Marie Cornuet z , Mathieu Gauthier z 
and Christian P. Robert x {, 
Universite de Montpellier 2, I3M, Montpellier, France,yInstitut de Biologie Computationnelle (IBC), Montpellier, France,zCBGP, INRA, Montpellier, France,xUniversite Paris 
Dauphine, CEREMADE, Paris, France, and {University of Warwick, Coventry, UK 
Submitted to Proceedings of the National Academy of Sciences of the United States of America 
Approximate Bayesian computation (ABC) methods provide an elab- 
orate approach to Bayesian inference on complex models, includ- 
ing model choice. Both theoretical arguments and simulation ex- 
periments indicate, however, that model posterior probabilities are 
poorly evaluated by ABC. We propose a novel approach based on 
a machine learning tool named random forests to conduct selection 
among the highly complex models covered by ABC algorithms. We 
strongly shift the way Bayesian model selection is both understood 
and operated, since we replace the evidential use of model pos- 
terior probabilities by predicting the model that best
ts the data 
with random forests and computing an associated posterior error 
rate. Compared with past implementations of ABC model choice, 
the ABC random forest approach oers several improvements: (i) 
it has a larger discriminative power among the competing models, 
(ii) it is robust to the number and choice of statistics summarizing 
the data, (iii) the computing eort is drastically reduced (with a 
minimum gain in computation eciency around a factor of about
fty), and (iv) it includes an embedded and cost-free error evalua- 
tion conditional on the actual analyzed dataset. Random forest will 
undoubtedly extend the range of size of datasets and complexity of 
models that ABC can handle. We illustrate the power of the ABC 
random forest methodology by analyzing controlled experiments as 
well as real population genetics datasets. 1 
Approximate Bayesian computation j model selection j summary statistics j k- 
nearest neighbors j likelihood-free methods j random forests j posterior predic- 
tive j error rate j Harlequin ladybird j Bayesian model choice 
Abbreviations: ABC, approximate Bayesian computation; RF, random forest; LDA, 
linear discriminant analysis; MAP, maximum a posteriori; nn, nearest neighbors; 
CART, classi
cation and regression tree; SNP, single nucleotide polymorphism 
Since its introduction (1, 2, 3), the approximate Bayesian 
computation (ABC) method has found an ever increasing 
range of applications covering diverse types of complex mod- 
els (see, e.g., 4, 5, 6, 7). The principle of ABC is to conduct 
Bayesian inference on a dataset through comparisons with nu- 
merous simulated datasets. However, it suers from two ma- 
jor diculties. First, to ensure reliability of the method, the 
number of simulations is large; hence, it proves dicult to ap- 
ply ABC for large datasets (e.g., in population genomics where 
ten to hundred thousand markers are commonly genotyped). 
Second, calibration has always been a critical step in ABC 
implementation (8, 9). More speci
cally, the major feature in 
this calibration process involves selecting a vector of summary 
statistics that quanti
es the dierence between the observed 
data and the simulated data. The construction of this vec- 
tor is therefore paramount and examples abound about poor 
performances of ABC algorithms related with speci
c choices 
of those statistics. In particular, in the setting of ABC model 
choice, the summaries play a crucial role in providing consis- 
tent or inconsistent inference (10, 11, 12). 
We advocate here a drastic modi
cation of the way ABC 
model selection is conducted: we propose to both step away 
from a mere mimicking of exact Bayesian solutions like pos- 
terior probabilities, and reconsider the very problem of con- 
structing ecient summary statistics. First, given an arbi- 
trary pool of available statistics, we now completely bypass 
the selection of a subset of those. This new perspective di- 
rectly proceeds from machine learning methodology. Second, 
we also entirely bypass the ABC estimation of model poste- 
rior probabilities, as we deem the numerical ABC approxima- 
tions of such probabilities fundamentally untrustworthy, even 
though the approximations can preserve the proper ordering 
of the compared models. Having abandoned approximations 
of posterior probabilities, we implement the crucial shift to 
using posterior error rates for model selection towards assess- 
ing the reliability of the selection made by the classi
er. The 
statistical technique of random forests (RF) (13) represents 
a trustworthy machine learning tool well adapted to complex 
settings as is typical for ABC treatments, and which allows 
an ecient computation of posterior error rates. We show 
here how RF improves upon existing classi
cation methods 
in signi
cantly reducing both the classi
cation error and the 
computational expense. 
Model choice 
Bayesian model choice (14, 15) compares the
t of M mod- 
els to an observed dataset x0. It relies on a hierarchical 
modelling, setting
rst prior probabilities on model indices 
m 2 f1; : : : ;Mg and then prior distributions (jm) on the 
parameter  of each model, characterized by a likelihood func- 
tion f(xjm; ). Inferences and decisions are based on the pos- 
terior probabilities of each model (mjx0). 
ABC algorithms for model choice. To approximate posterior 
probabilities of competing models, ABC methods (16) com- 
pare observed data with a massive collection of pseudo-data, 
generated from the prior; the comparison proceeds via a nor- 
malized Euclidean distance on a vector of statistics S(x) com- 
puted for both observed and simulated data. Standard ABC 
estimates posterior probabilities (mjx0) at stage (B) of Al- 
gorithm 1 below as the frequencies of those models within the 
k nearest-to-x0 simulations, proximity being de
ned by the 
distance between s0 and the simulated S(x)'s. 
Selecting a model means choosing the model with the high- 
est frequency in the sample of size k produced by ABC, such 
frequencies being approximations to posterior probabilities of 
models. We stress that this solution means resorting to a k- 
nearest neighbor (k-nn) estimate of those probabilities, for a 
set of simulations drawn at stage (A), whose records consti- 
Reserved for Publication Footnotes 
1PP, JMM, AE and CPR designed and performed research, PP, JMM, AE, JMC and MG analysed 
data, and PP, JMM, AE and CPR wrote the paper. 
www.pnas.org/cgi/doi/10.1073/pnas.xxx PNAS Issue Date Volume Issue Number 1{7
tute the so-called reference table. In fact, this interpretation 
provides a useful path to convergence properties of ABC pa- 
rameter estimators (17) and properties of summary statistics 
to compare hidden Markov random
elds (18). 
Algorithm 1 General ABC algorithm 
(A) Generate Nref simulations (m; ; S(x)) from the joint 
(m)(jm)f(xjm; ). 
(B) Learn from this set to infer about m or  at s0 = S(x0). 
A major calibration issue with ABC imposes selecting the 
summary statistics S(x). When considering the speci
c goal 
of model selection, the ABC approximation to the posterior 
probabilities will eventually produce a right ordering of the
t 
of competing models to the observed data and thus will select 
the right model for a speci
c class of statistics when the in- 
formation carried by the data becomes important (12). The 
state-of-the-art evaluation of ABC model choice is thus that 
some statistics produce nonsensical decisions and that there 
exist sucient conditions for statistics to produce consistent 
model prediction, albeit at the cost of an information loss due 
to summaries that may be substantial. The toy example com- 
paring MA(1) and MA(2) models in SI and Fig. 1 clearly 
exhibits this potential loss. 
It may seem tempting to collect the largest possible num- 
ber of summary statistics to capture more information from 
the data. However, ABC algorithms, like k-nn and other local 
methods, suer from the curse of dimensionality, see e.g. Sec- 
tion 2.5 in (19), and yield poor results when the number of 
statistics is large. Selecting summary statistics is therefore 
paramount, as shown by the literature in the recent years. 
(See (9) surveying ABC parameter estimation.) Excursions 
into machine learning are currently limited, being mostly a 
dimension reduction device that preserves the recourse to k- 
nn methods. See, e.g., the call to boosting in (20) for selecting 
statistics in problems pertaining to parameter estimation (21). 
For model choice, two projection techniques are considered. 
First, (22) show that the Bayes factor itself is an acceptable 
summary (of dimension one) when comparing two models, but 
its practical evaluation via a pilot ABC simulation induces a 
poor approximation of model evidences (10, 11). The recourse 
to a regression layer like linear discriminant analysis (LDA) 
(23) is discussed below and in SI (Classi
cation method sec- 
tion). Given the fundamental diculty in producing reliable 
tools for model choice based on summary statistics (11), we 
now propose to switch to a better adapted machine learning 
approach based on random forest (RF) classi
ers. 
ABC model choice via random forests. SI provides a review 
of classi
cation methods. The so-called Bayesian classi

More Related Content

What's hot

(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013Christian Robert
 
ABC short course: survey chapter
ABC short course: survey chapterABC short course: survey chapter
ABC short course: survey chapterChristian Robert
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapterChristian Robert
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsChristian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Christian Robert
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methodsChristian Robert
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chaptersChristian Robert
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
Monte Carlo in Montréal 2017
Monte Carlo in Montréal 2017Monte Carlo in Montréal 2017
Monte Carlo in Montréal 2017Christian Robert
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modulesChristian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceChristian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 

What's hot (20)

ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
Discussion of ABC talk by Francesco Pauli, Padova, March 21, 2013
 
ABC short course: survey chapter
ABC short course: survey chapterABC short course: survey chapter
ABC short course: survey chapter
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapter
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
Discussion of ABC talk by Stefano Cabras, Padova, March 21, 2013
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chapters
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Monte Carlo in Montréal 2017
Monte Carlo in Montréal 2017Monte Carlo in Montréal 2017
Monte Carlo in Montréal 2017
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modules
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 

Similar to Reliable ABC model choice via random forests

Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...ijcsit
 
Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Join...
Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Join...Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Join...
Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Join...Thomas Templin
 
Energy minimization based spatially
Energy minimization based spatiallyEnergy minimization based spatially
Energy minimization based spatiallysipij
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
Pre-computation for ABC in image analysis
Pre-computation for ABC in image analysisPre-computation for ABC in image analysis
Pre-computation for ABC in image analysisMatt Moores
 
Bayesian Co clustering
Bayesian Co clusteringBayesian Co clustering
Bayesian Co clusteringlau
 
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...Waqas Tariq
 
Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra drboon
 
Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra drboon
 
ProbabilisticModeling20080411
ProbabilisticModeling20080411ProbabilisticModeling20080411
ProbabilisticModeling20080411Clay Stanek
 
Bayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A ComparisonBayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A ComparisonAtilla YARDIMCI
 
Bayesian Analysis Influences Autoregressive Models
Bayesian Analysis Influences Autoregressive ModelsBayesian Analysis Influences Autoregressive Models
Bayesian Analysis Influences Autoregressive ModelsAI Publications
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detectionTrilochan Panigrahi
 
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET Journal
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXINGCOMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXINGcsandit
 
MIXTURES OF TRAINED REGRESSION CURVES MODELS FOR HANDWRITTEN ARABIC CHARACTER...
MIXTURES OF TRAINED REGRESSION CURVES MODELS FOR HANDWRITTEN ARABIC CHARACTER...MIXTURES OF TRAINED REGRESSION CURVES MODELS FOR HANDWRITTEN ARABIC CHARACTER...
MIXTURES OF TRAINED REGRESSION CURVES MODELS FOR HANDWRITTEN ARABIC CHARACTER...gerogepatton
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMMRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM​Iván Rodríguez
 

Similar to Reliable ABC model choice via random forests (20)

Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
 
2009 spie hmm
2009 spie hmm2009 spie hmm
2009 spie hmm
 
Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Join...
Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Join...Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Join...
Using the Componentwise Metropolis-Hastings Algorithm to Sample from the Join...
 
Energy minimization based spatially
Energy minimization based spatiallyEnergy minimization based spatially
Energy minimization based spatially
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Pre-computation for ABC in image analysis
Pre-computation for ABC in image analysisPre-computation for ABC in image analysis
Pre-computation for ABC in image analysis
 
2009 asilomar
2009 asilomar2009 asilomar
2009 asilomar
 
Bayesian Co clustering
Bayesian Co clusteringBayesian Co clustering
Bayesian Co clustering
 
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
 
Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra
 
Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra Relevance Vector Machines for Earthquake Response Spectra
Relevance Vector Machines for Earthquake Response Spectra
 
ProbabilisticModeling20080411
ProbabilisticModeling20080411ProbabilisticModeling20080411
ProbabilisticModeling20080411
 
Bayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A ComparisonBayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A Comparison
 
Bayesian Analysis Influences Autoregressive Models
Bayesian Analysis Influences Autoregressive ModelsBayesian Analysis Influences Autoregressive Models
Bayesian Analysis Influences Autoregressive Models
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detection
 
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXINGCOMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
COMPARISON OF VOLUME AND DISTANCE CONSTRAINT ON HYPERSPECTRAL UNMIXING
 
MIXTURES OF TRAINED REGRESSION CURVES MODELS FOR HANDWRITTEN ARABIC CHARACTER...
MIXTURES OF TRAINED REGRESSION CURVES MODELS FOR HANDWRITTEN ARABIC CHARACTER...MIXTURES OF TRAINED REGRESSION CURVES MODELS FOR HANDWRITTEN ARABIC CHARACTER...
MIXTURES OF TRAINED REGRESSION CURVES MODELS FOR HANDWRITTEN ARABIC CHARACTER...
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMMRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
 

More from Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsChristian Robert
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018Christian Robert
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimationChristian Robert
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 

More from Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimation
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 

Recently uploaded

Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxSimeonChristian
 

Recently uploaded (20)

Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
 

Reliable ABC model choice via random forests

  • 1. Reliable ABC model choice via random forests Pierre Pudlo y, Jean-Michel Marin y , Arnaud Estoup z, Jean-Marie Cornuet z , Mathieu Gauthier z and Christian P. Robert x {, Universite de Montpellier 2, I3M, Montpellier, France,yInstitut de Biologie Computationnelle (IBC), Montpellier, France,zCBGP, INRA, Montpellier, France,xUniversite Paris Dauphine, CEREMADE, Paris, France, and {University of Warwick, Coventry, UK Submitted to Proceedings of the National Academy of Sciences of the United States of America Approximate Bayesian computation (ABC) methods provide an elab- orate approach to Bayesian inference on complex models, includ- ing model choice. Both theoretical arguments and simulation ex- periments indicate, however, that model posterior probabilities are poorly evaluated by ABC. We propose a novel approach based on a machine learning tool named random forests to conduct selection among the highly complex models covered by ABC algorithms. We strongly shift the way Bayesian model selection is both understood and operated, since we replace the evidential use of model pos- terior probabilities by predicting the model that best
  • 2. ts the data with random forests and computing an associated posterior error rate. Compared with past implementations of ABC model choice, the ABC random forest approach oers several improvements: (i) it has a larger discriminative power among the competing models, (ii) it is robust to the number and choice of statistics summarizing the data, (iii) the computing eort is drastically reduced (with a minimum gain in computation eciency around a factor of about
  • 3. fty), and (iv) it includes an embedded and cost-free error evalua- tion conditional on the actual analyzed dataset. Random forest will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of the ABC random forest methodology by analyzing controlled experiments as well as real population genetics datasets. 1 Approximate Bayesian computation j model selection j summary statistics j k- nearest neighbors j likelihood-free methods j random forests j posterior predic- tive j error rate j Harlequin ladybird j Bayesian model choice Abbreviations: ABC, approximate Bayesian computation; RF, random forest; LDA, linear discriminant analysis; MAP, maximum a posteriori; nn, nearest neighbors; CART, classi
  • 4. cation and regression tree; SNP, single nucleotide polymorphism Since its introduction (1, 2, 3), the approximate Bayesian computation (ABC) method has found an ever increasing range of applications covering diverse types of complex mod- els (see, e.g., 4, 5, 6, 7). The principle of ABC is to conduct Bayesian inference on a dataset through comparisons with nu- merous simulated datasets. However, it suers from two ma- jor diculties. First, to ensure reliability of the method, the number of simulations is large; hence, it proves dicult to ap- ply ABC for large datasets (e.g., in population genomics where ten to hundred thousand markers are commonly genotyped). Second, calibration has always been a critical step in ABC implementation (8, 9). More speci
  • 5. cally, the major feature in this calibration process involves selecting a vector of summary statistics that quanti
  • 6. es the dierence between the observed data and the simulated data. The construction of this vec- tor is therefore paramount and examples abound about poor performances of ABC algorithms related with speci
  • 7. c choices of those statistics. In particular, in the setting of ABC model choice, the summaries play a crucial role in providing consis- tent or inconsistent inference (10, 11, 12). We advocate here a drastic modi
  • 8. cation of the way ABC model selection is conducted: we propose to both step away from a mere mimicking of exact Bayesian solutions like pos- terior probabilities, and reconsider the very problem of con- structing ecient summary statistics. First, given an arbi- trary pool of available statistics, we now completely bypass the selection of a subset of those. This new perspective di- rectly proceeds from machine learning methodology. Second, we also entirely bypass the ABC estimation of model poste- rior probabilities, as we deem the numerical ABC approxima- tions of such probabilities fundamentally untrustworthy, even though the approximations can preserve the proper ordering of the compared models. Having abandoned approximations of posterior probabilities, we implement the crucial shift to using posterior error rates for model selection towards assess- ing the reliability of the selection made by the classi
  • 9. er. The statistical technique of random forests (RF) (13) represents a trustworthy machine learning tool well adapted to complex settings as is typical for ABC treatments, and which allows an ecient computation of posterior error rates. We show here how RF improves upon existing classi
  • 11. cantly reducing both the classi
  • 12. cation error and the computational expense. Model choice Bayesian model choice (14, 15) compares the
  • 13. t of M mod- els to an observed dataset x0. It relies on a hierarchical modelling, setting
  • 14. rst prior probabilities on model indices m 2 f1; : : : ;Mg and then prior distributions (jm) on the parameter of each model, characterized by a likelihood func- tion f(xjm; ). Inferences and decisions are based on the pos- terior probabilities of each model (mjx0). ABC algorithms for model choice. To approximate posterior probabilities of competing models, ABC methods (16) com- pare observed data with a massive collection of pseudo-data, generated from the prior; the comparison proceeds via a nor- malized Euclidean distance on a vector of statistics S(x) com- puted for both observed and simulated data. Standard ABC estimates posterior probabilities (mjx0) at stage (B) of Al- gorithm 1 below as the frequencies of those models within the k nearest-to-x0 simulations, proximity being de
  • 15. ned by the distance between s0 and the simulated S(x)'s. Selecting a model means choosing the model with the high- est frequency in the sample of size k produced by ABC, such frequencies being approximations to posterior probabilities of models. We stress that this solution means resorting to a k- nearest neighbor (k-nn) estimate of those probabilities, for a set of simulations drawn at stage (A), whose records consti- Reserved for Publication Footnotes 1PP, JMM, AE and CPR designed and performed research, PP, JMM, AE, JMC and MG analysed data, and PP, JMM, AE and CPR wrote the paper. www.pnas.org/cgi/doi/10.1073/pnas.xxx PNAS Issue Date Volume Issue Number 1{7
  • 16. tute the so-called reference table. In fact, this interpretation provides a useful path to convergence properties of ABC pa- rameter estimators (17) and properties of summary statistics to compare hidden Markov random
  • 17. elds (18). Algorithm 1 General ABC algorithm (A) Generate Nref simulations (m; ; S(x)) from the joint (m)(jm)f(xjm; ). (B) Learn from this set to infer about m or at s0 = S(x0). A major calibration issue with ABC imposes selecting the summary statistics S(x). When considering the speci
  • 18. c goal of model selection, the ABC approximation to the posterior probabilities will eventually produce a right ordering of the
  • 19. t of competing models to the observed data and thus will select the right model for a speci
  • 20. c class of statistics when the in- formation carried by the data becomes important (12). The state-of-the-art evaluation of ABC model choice is thus that some statistics produce nonsensical decisions and that there exist sucient conditions for statistics to produce consistent model prediction, albeit at the cost of an information loss due to summaries that may be substantial. The toy example com- paring MA(1) and MA(2) models in SI and Fig. 1 clearly exhibits this potential loss. It may seem tempting to collect the largest possible num- ber of summary statistics to capture more information from the data. However, ABC algorithms, like k-nn and other local methods, suer from the curse of dimensionality, see e.g. Sec- tion 2.5 in (19), and yield poor results when the number of statistics is large. Selecting summary statistics is therefore paramount, as shown by the literature in the recent years. (See (9) surveying ABC parameter estimation.) Excursions into machine learning are currently limited, being mostly a dimension reduction device that preserves the recourse to k- nn methods. See, e.g., the call to boosting in (20) for selecting statistics in problems pertaining to parameter estimation (21). For model choice, two projection techniques are considered. First, (22) show that the Bayes factor itself is an acceptable summary (of dimension one) when comparing two models, but its practical evaluation via a pilot ABC simulation induces a poor approximation of model evidences (10, 11). The recourse to a regression layer like linear discriminant analysis (LDA) (23) is discussed below and in SI (Classi
  • 21. cation method sec- tion). Given the fundamental diculty in producing reliable tools for model choice based on summary statistics (11), we now propose to switch to a better adapted machine learning approach based on random forest (RF) classi
  • 22. ers. ABC model choice via random forests. SI provides a review of classi
  • 23. cation methods. The so-called Bayesian classi
  • 24. er, based on the maximum a posteriori (MAP) model, minimizes the 0{1 error (24). However, estimating the posterior proba- bilities has a major impact on the performances of the clas- si
  • 25. er, due to the substitution of a classi
  • 26. cation exercise by a more dicult regression problem (24). This diculty drives us to a paradigm shift, namely to give up the attempt at both estimating posterior probabilities by ABC and selecting summary statistics. Instead, our version of stage (B) in Al- gorithm 1 relies on a classi
  • 27. er that can handle an arbitrary number of statistics and extract the maximal information from the reference table obtained at stage (A). For this purpose, we resort to random forest (RF) classi
  • 28. ers (13) and call the re- sulting algorithm ABC-RF. Refraining here from a detailed entry to RF algorithms (see SI for such details), we recall that the technique stems from (25) bagging algorithms, applying to both classi
  • 29. ca- tion and regression. RF grows many over
  • 30. tted decision trees trained with a randomized CART (classi
  • 31. cation and regres- sion tree, see 26) algorithm on bootstrap sub-samples from the ABC reference table: it takes advantage of the weak de- pendency of these almost unbiased trees to reduce variance by aggregating the tree classi
  • 32. ers towards a majority-rule deci- sion. The justi
  • 33. cation for choosing RF to conduct an ABC model selection is that, both formally and experimentally, RF classi
  • 34. cation was shown to be mostly insensitive both to strong correlations between predictors and to the presence of noisy variables, even in relatively large numbers (19, Chapter 5), a characteristic that k-nn classi
  • 35. ers miss. For instance, consistency for a simpli
  • 36. ed RF procedure is such that the rate of convergence only depends on the intrinsic dimension of the problem (27). Consistency of the original algorithm was also proven for additive regression models (28), demonstrating that RF can apprehend large dimensions. Such a robustness justi
  • 37. es adopting an RF strategy to learn from an ABC reference table towards Bayesian model selection. Within an arbitrary collection of summary statis- tics, some may exhibit strong correlations and others be un- informative about the model index, but this does not jeopar- dize the RF performances. For model selection, RF is thus in competition with the two local classi
  • 38. ers commonly imple- mented within ABC, and mimicking exact Bayesian solutions, It is arguably superior to local logistic regression, as imple- mented in the DIYABC software (29); the latter includes a linear model layer within the k-nn selection (30), but suf- fers from the curse of dimensionality, which forces a selection among statistics, and is extremely costly | see, e.g., how (23) reduces the dimension using a linear discriminant projection before resorting to local logistic regression. The outcome of RF is a model index, corresponding to the most frequently predicted model index within the aggregated decision trees. This is the model best suited to the observed data. It is worth stressing that there is no direct connection between the frequencies of the model allocations of the data among the tree classi
  • 39. ers and the posterior probabilities of the competing models. In practice, the decision frequencies of the trees happen to show a strong bias towards 0 or 1, thus produce an unreliable quantitative indicator. We therefore propose to rely on an alternative posterior error estimation to measure the con
  • 40. dence in model choice produced by RF. Posterior error rate as con
  • 41. dence report Machine learning classi
  • 42. ers miss a distinct advantage of pos- terior probabilities, namely that the latter evaluate a con
  • 43. - dence degree in the selected (MAP) model. An alternative to those probabilities is the prior error rate, which provides an indication of the global quality of a given classi
  • 44. er ^m on the whole feature space. This rate is the expected value of the misclassi
  • 45. cation error over the hierarchical prior X m (m) Z 1f ^m(S(y))6=mgf(yj;m)(jm)dyd and it can be evaluated from simulations (; m; S(y)) drawn from the prior, independently of the reference table (18), or with the out-of-bag error in RF (19, Chapter 15), a procedure that requires no further simulation (see SI). Machine learning 2 www.pnas.org/cgi/doi/10.1073/pnas.xxx Pudlo, Marin et al.
  • 46. relies on this prior error to calibrate classi
  • 47. ers (e.g., the num- ber k of neighbors of k-nn and local logistic models, or the tuning parameters of RF). But this indicator remains poorly relevant, since the only point of importance in the dataset space is the observed dataset s0 = S(x0). A
  • 48. rst step addressing this issue is to obtain error rates conditional on the data as in (18). However, the statisti- cal methodology available for this purpose suers from the curse of dimensionality. We thus replace this conditional error with the average of the misclassi
  • 49. cation loss 1f ^m(S(x))6= mg taken over the posterior predictive distribution, namely X m (mjs0) Z 1f ^m(S(y))6=mgf(yj;m)(jm; s0)dyd [1] This solution answers criticisms on the prior error evaluation, since it weights the misclassi
  • 50. cation error by the posterior distribution; [1] is called the posterior error rate. + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + ++ + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + +++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 True posterior probability of MA(2) ABC posterior probability of MA(2) Fig. 1: Illustration of the discrepancy between true posterior probabilities and their ABC approxima- tions. The aim is to choose between two nested time series models, namely moving averages of order 1 and 2 (denoted MA(1) and MA(2) respectively; see SI for more details). Each point of the plot gives the two posterior probabilities of MA(2) for a dataset simulated either from the
  • 51. rst (blue) or second model (orange). Even though the
  • 52. rst two autocovariance statistics are informative for this model choice, values on the x-axis, equal to the exact posterior probabilities of MA(2), dif- fer substantially from their ABC counterparts on the y-axis. The practical derivation of the posterior error rate is easily conducted via a secondary ABC algorithm, described below (see Algorithm 2). This algorithm relies on a natural prox- imity between s0 and S(y) stemming from the RF, namely the number of times both inputs fall into the same tip of an RF tree. The sample (m; ; S(y)) of size k Npp produced in step (c) constitutes an ABC approximation of the posterior predictive distribution. The posterior error rate [1] is then approximated in step (d) by averaging prediction errors over this sample. Algorithm 2 Computation of the posterior error (a) Use the trained RF to compute proximity between each (m; ; S(x)) of the reference table and s0 = S(x0) (b) Select the k simulations with the highest proximity to s0 (c) For each (m; ) in the latter set, compute Npp new simu- lations S(y) from f(yj;m) (d) Return the frequency of erroneous RF predictions over these k Npp simulations Illustrations To illustrate the power of the ABC-RF methodology, we now report several controlled experiments as well as two genuine population genetic examples. Insights from controlled experiments. SI details controlled ex- periments on a toy problem, comparing MA(1) and MA(2) time-series models, and two controlled synthetic examples from population genetics, based on SNP and microsatellite data. The toy example is particularly revealing of the dis- crepancy between the posterior probability of a model and the version conditioning on the summary statistics s0. Fig. 1 shows how far from the diagonal are realizations of the pairs ((mjx0); (mjs0)), even though the autocorrelation statistic is quite informative (8). Note in particular the vertical accu- mulation of points near p(m = 2jx0) = 1. Table S1 demon- strates the further gap in predictive power for the full Bayes solution with a true error rate of 12% versus the best solution (RF) based on the summaries barely achieving a 17% error rate. For both controlled genetics experiments in SI, the com- putation of the true posterior probabilities of the three models is impossible. The predictive performances of the competing classi
  • 53. ers can nonetheless be compared on a test sample. Re- sults, summarized in Table S2 and S3 in the SI legitimate our Table 1: Harlequin ladybird data: estimated prior error rates for various classi
  • 54. cation methods and sizes of reference table. Classi
  • 55. cation method Prior error rates (%) trained on Nref = 10; 000 Nref = 20; 000 Nref = 50; 000 linear discriminant analysis (LDA) 39:91 39:30 39:04 standard ABC (k-nn) on DIYABC summaries 57:46 53:76 51:03 standard ABC (k-nn) on LDA axes 39:18 38:46 37:91 local logistic regression on LDA axes 41:04 37:08 36:05 random forest (RF) on DIYABC summaries 40:18 38:94 37:63 RF on DIYABC summaries and LDA axes 36:86 35:62 34:44 Performances of classi
  • 56. ers used in stage (B) of Algorithm 1. A set of 10; 000 prior simulations was use to calibrate the number of neighbors k in both standard ABC and local logistic regression, and of sub-samples Nboot for the trees of RF. Prior error rates were estimated as average misclassi
  • 57. cation errors on an independent set of 10; 000 prior simulations, constant over methods and sizes of the reference tables. Pudlo, Marin et al. PNAS Issue Date Volume Issue Number 3
  • 58. support of RF as the optimal classi
  • 59. er, with gains of several percents. Those experiments demonstrate in addition that the posterior error rate can highly vary compared with the average prior rate, hence making a case of its signi
  • 61. tting (for details, see Section 3 in the SI). A last feature worth mentioning is that, while LDA alone does not perform uniformly well over all examples, the conjunction of LDA and RF always produces improvement, with the
  • 62. rst LDA axes appearing within the most active summaries of the trained forests (Fig. S6 and S8). This stresses both the ap- peal of LDA as extra summaries and the amalgamating eect of RF, namely its ability to incorporate highly relevant statis- tics within a wide set of possibly correlated or non-informative summaries. Microsatellite dataset: retracing the invasion routes of the Harlequin ladybird.The original challenge was to conduct inference about the introduction pathway of the invasive Harlequin ladybird (Harmonia axyridis) for the
  • 63. rst recorded outbreak of this species in eastern North America. The dataset,
  • 64. rst analyzed in (31) and (23) via ABC, includes sam- ples from
  • 65. ve natural and biocontrol populations genotyped at 18 microsatellite markers. The model selection requires the formalization and comparison of 10 complex competing sce- narios corresponding to various possible routes of introduction (see analysis 1 in (31) and SI for details). We now compare our results from the ABC-RF algorithm with other classi
  • 66. ca- tion methods and with the original solutions by (31) and (23). RF and other classi
  • 67. ers discriminating among the 10 scenar- ios were trained on either 104, 2 104 or 5 104 simulated datasets. We included all summary statistics computed by the DIYABC software for microsatellite markers (29), namely 130 statistics, complemented by the nine LDA axes as additional summary statistics. More details about this example can be found in the SI. In this example, discriminating among models based on the observation of summary statistics is dicult. The over- lapping groups of Fig. S10 in the SI re ect that diculty, which source is the relatively low information carried by the 18 autosomal microsatellite loci considered here. Prior error rates of learning methods on the whole reference table are given in Table 1. As expected in such high dimension settings (19, Section 2.5), k-nn classi
  • 68. ers behind the standard ABC methods perform uniformly badly when trained on the 130 numerical summaries, even when well calibrated. On a much smaller set of covariates, namely the nine LDA axes, these local methods (standard ABC, and the local logistic regres- sion) behave much more nicely. The best classi
  • 69. er in term of prior error rates is a RF trained on the 130 summaries and the nine LDA axes, whatever the size of the reference table. Additionally, Fig. S11 shows that RFs are clearly able to au- tomatically determine the (most) relevant statistics for model comparison, including in particular some crude estimates of admixture rate de
  • 70. ned in (32), some of them not selected by the experts in (31). We stress here that the level of informa- tion of the summary statistics displayed in Fig. S11 is relevant for model choice but not for parameter estimation issues. In other words, the set of best summaries found with ABC-RF should not be considered as an optimal set for further pa- rameter estimations under a given model with standard ABC techniques (3). The evolutionary scenario selected by our RF strategy fully agrees with the earlier conclusion of (31), based on ap- proximations of posterior probabilities with local logistic re- gression solely on the LDA axes (i.e., the same scenario dis- plays the highest ABC posterior probability and the largest number of selection among the decisions taken by the aggre- gated trees of RF). Another noteworthy feature of this re- analysis is the posterior error rate of the best ABC-RF, ap- proximated by 40% when running Algorithm 2 on k = 500 neighbors and Npp = 20 simulated datasets per neighbor. In agreement with this, the posterior probability bearing the cho- sen scenario in (31) is relatively low (about 60%). It is worth stressing here that posterior error rate and posterior proba- bilities are not commensurable, i.e. they cannot be measured on the same scale. For instance, a posterior probability of 60% is not the equivalent of a posterior error rate of 40%, as −5 0 5 LD1 −10 −5 0 5 10 LD2 * −10 −5 0 5 10 −10 −5 0 5 LD1 LD3 * −5 0 5 −10 −5 0 5 LD2 LD3 * −10 −5 0 5 10 −10 −5 0 5 LD1 LD4 * Fig. 2: Human SNP data: projection of the reference table on the
  • 71. rst four LDA axes. Colors correspond to model in- dices. (See SI for the description of the models.) The location of the additional datasets is indicated by a large black star. Table 2: Human SNP data: estimated prior error rates for classi
  • 72. cation methods and three sizes of reference table. Classi
  • 73. cation method Prior error rates (%) trained on Nref = 10; 000 Nref = 20; 000 Nref = 50; 000 linear discriminant analysis (LDA) 9:91 9:97 10:03 standard ABC (k-nn) using DYIABC summaries 23:18 20:55 17:76 standard ABC (k-nn) using only LDA axes 6:29 5:76 5:70 local logistic regression on LDA axes 6:85 6:42 6:07 random forest (RF) using DYIABC initial summaries 8:84 7:32 6:34 RF using both DYIABC summaries and LDA axes 5:01 4:66 4:18 Same comments as in Table 1. 4 www.pnas.org/cgi/doi/10.1073/pnas.xxx Pudlo, Marin et al.
  • 74. the former is a transform of a vector of evidences, while the latter is an average performance over hypothetical datasets. These quantities are therefore not to be assessed on the same ground, one being a Bayesian construct of the probability of a model, the other one a weighted evaluation of the chances of selecting the wrong model. SNP dataset: inference about Human population history. Because ABC-RF performs well with a substantially lower number of simulations compared to standard ABC methods, it is expected to be of particular interest for the statistical processing of massive Single Nucleotide Polymorphism (SNP) datasets, whose production is on the increase in the
  • 75. eld of population genetics. We analyze here a dataset including 50,000 SNP markers genotyped in four Human populations (33). The four populations include Yoruba (Africa), Han (East Asia), British (Europe) and American individuals of African Ancestry, respectively. Our intention is not to bring new insights into Human population history, which has been and is still studied in greater details in research using genetic data, but to illustrate the potential of ABC-RF in this con- text. We compared six scenarios (i.e. models) of evolution of the four Human populations which dier from each other by one ancient and one recent historical events: (i) a single out- of-Africa colonization event giving an ancestral out-of-Africa population which secondarily split into one European and one East Asian population lineages, versus two independent out- of-Africa colonization events, one giving the European lineage and the other one giving the East Asian lineage; (ii) the pos- sibility of a recent genetic admixture of Americans of African origin with their African ancestors and individuals of Euro- pean or East Asia origins. The SNP dataset and the compared scenarios are further detailed in the SI. We used all the sum- mary statistics provided by DIYABC for SNP markers (29), namely 130 statistics in this setting complemented by the
  • 76. ve LDA axes as additional statistics. To discriminate among the six scenarios of Fig. S12 in SI, RF and others classi
  • 77. ers have been trained on three nested reference tables of dierent sizes. The estimated prior error rates are reported in Table 2. Unlike the previous example, the information carried here by the 50; 000 SNP markers is much higher, because it induces better separated simulations on the LDA axes (Fig. 2), and much lower prior error rates (Table 2). Even in this case, RF using both the initial sum- maries and the LDA axes provides the best results. ABC-RF on the Human dataset selects Scenario 2 as the forecasted scenario, an answer which is not visually obvious on the LDA projections of Fig. 2. But, considering previous pop- ulation genetics studies in the
  • 78. eld, it is not surprising that this scenario, which includes a single out-of-Africa coloniza- tion event giving an ancestral out-of-Africa population with a secondarily split into one European and one East Asian pop- ulation lineage and a recent genetic admixture of Americans of African origin with their African ancestors and European individuals, was selected among the six compared scenarios. This selection is associated with a high con
  • 79. dence level as in- dicated by an estimated posterior error rate equals to zero. As in the previous example, we used Algorithm 2 with k = 500 neighbors and then simulated Npp = 20 replicates per neigh- bor to estimate the posterior error rate. Computation time is a particularly important issue in the present example. Simulating the 10; 000 SNP datasets used to train the classi
  • 80. cation methods requires seven hours on a computer with 32 processors (Intel Xeon(R) CPU 2GHz). In that context, we are delighted to observe that the RF classi-
  • 81. er constructed on the summaries and the LDA axes and a 10; 000 reference table has a smaller prior error rate than all other classi
  • 82. ers, even when they are trained on a 50; 000 refer- ence table. It is worth noting that standard ABC treatments for model choice are based in practice on reference tables of substantially larger sizes: 105 to 106 simulations per scenario (23, 34). For the above setting in which six scenarios are com- pared, standard ABC treatments would request a minimum computation time of 17 days (using the same computation re- sources). According to the comparative tests that we carried out on various example datasets, we found that RF globally allowed a minimum computation speed gain around a factor of 50 in comparison to standard ABC treatments (see also Sec- tion 4 of SI for other considerations regarding computation speed gain). Conclusion The present paper is purposely focused on selecting a model, which is a classi
  • 83. cation problem trained on ABC simulations. Indeed, there exists a fundamental and numerical discrep- ancy between genuine posterior probabilities and probabilities based on summary statistics (10, 11). When statistics follow the consistency conditions of (12), the discrepancy remains, but the resulting algorithm asymptotically select the proper model as the size of the data grows. We defend here the paradigm shift of quantifying our con-
  • 84. dence in the selected model by the computation of a poste- rior error rate, along with the abandonment of approximating posterior probabilities since the latter cannot be assessed at a reasonable computational cost. The posterior error rate pro- duces an estimated error as an average over the a posteriori most likely part of the parameter space, including the infor- mation contained in the data. It further remains within the Bayesian paradigm and is a convergent evaluation of the true error made by RF itself, whence represents a natural substi- tute to the usually uncertain ABC approximation of posterior probabilities. Compared with past ABC implementations, ABC-RF of- fers improvements at least at
  • 85. ve levels: (i) on all experi- ments we studied, it has a lower prior error rate; (ii) it is robust to the size and choice of summary statistics, as RF can handle many super uous statistics with no impact on the performance rates (which mostly depend on the intrinsic di- mension of the classi
  • 86. cation problem (27, 28), a characteristic con
  • 87. rmed by our results); (iii) the computing eort is consid- erably reduced as RF requires a much smaller reference table compared with alternatives (i.e., a few thousands versus hun- dred thousands to billions of simulations); (iv) the method is associated with an embedded and free error evaluation which assesses the reliability of ABC-RF analysis; and (v) RF can be easily and cheaply calibrated (with no further simulations) from the reference table via the reliable out-of-bag error. As a consequence, ABC-RF allows for a more robust handling of the degree of uncertainty in the choice between models, pos- sibly in contrast with earlier and over-optimistic assessments. Due to a massive gain in computing and simulation eorts, ABC-RF will undoubtedly extend the range and complexity of datasets (e.g. number of markers in population genetics) and models handled by ABC. Once a given model has been chosen and con
  • 88. dence evaluated by ABC-RF, it becomes pos- sible to estimate parameter distribution under this (single) model using standard ABC techniques (e.g. 35) or alternative methods such as those proposed by (36). ACKNOWLEDGMENTS. The use of random forests was suggested to JMM and CPR by Bin Yu during a visit at CREST, Paris, in 2013. We are grateful to our col- leagues at CBGP for their feedback and support, to the Department of Statistics at Pudlo, Marin et al. PNAS Issue Date Volume Issue Number 5
  • 89. Warwick for its hospitality, and to G. Biau for his help about the asymptotics of ran- dom forests. Some parts of the research was conducted at BIRS, Ban, Canada, and the authors (PP and CPR) took advantage of this congenial research environment. The authors also acknowledge the independent research conducted on classi
  • 90. cation tools for ABC by M. Gutmann, R. Dutta, S. Kaski, and J. Corander. References 1. Tavare S, Balding D, Grith R, Donnelly P (1997) Inferring coalescence times from DNA sequence data. Genetics 145:505{ 518. 2. Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M (1999) Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16:1791{1798. 3. Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics. Genetics 162:2025{2035. 4. Beaumont M (2008) in Simulations, Genetics and Human Pre- history, eds Matsumura S, Forster P, Renfrew C (Cambridge: (McDonald Institute Monographs), McDonald Institute for Ar-chaeological Research), pp 134{154. 5. Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf M (2009) Approximate Bayesian computation scheme for parameter in-ference and model selection in dynamical systems. Journal of the Royal Society Interface 6:187{202. 6. Beaumont M (2010) Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics 41:379{406. 7. Csillery K, Blum M, Gaggiotti O, Francois O (2010) Approxi-mate Bayesian computation (ABC) in practice. Trends in Ecol- ogy and Evolution 25:410{418. 8. Marin J, Pudlo P, Robert C, Ryder R (2011) Approximate Bayesian computational methods. Statistics and Computing pp 1{14. 9. Blum M, Nunes M, Prangle D, Sisson S (2013) A compar-ative review of dimension reduction methods in Approximate Bayesian Computation. Stat Sci 28:189{208. 10. Didelot X, Everitt R, Johansen A, Lawson D (2011) Likelihood-free estimation of model evidence. Bayesian Analysis 6:48{76. 11. Robert C, Cornuet JM, Marin JM, Pillai N (2011) Lack of con
  • 91. dence in ABC model choice. Proceedings of the National Academy of Sciences 108(37):15112{15117. 12. Marin J, Pillai N, Robert C, Rousseau J (2014) Relevant statis-tics for Bayesian model choice. J Roy Stat Soc B (to appear). 13. Breiman L (2001) Random forests. Machine Learning 45:5{32. 14. Berger J (1985) Statistical Decision Theory and Bayesian Analysis (Springer-Verlag, New York), Second edition. 15. Robert C (2001) The Bayesian Choice (Springer-Verlag, New York), second edition. 16. Grelaud A, Marin JM, Robert C, Rodolphe F, Tally F (2009) Likelihood-free methods for model choice in Gibbs random
  • 92. elds. Bayesian Analysis 3(2):427{442. 17. Biau G, Cerou F, Guyader A (2014) New insights into Approx-imate Bayesian Computation. Annales de l'IHP (Probability and Statistics). 18. Stoehr J, Pudlo P, Cucala L (2014) Adaptive ABC model choice and geometric summary statistics for hidden Gibbs random
  • 93. elds. Statistics and Computing pp 1{13. 19. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction., Springer Series in Statistics (Springer-Verlag, New York), 2 edition. 20. Freund Y, Schapire RE, et al. (1996) Experiments with a new boosting algorithm Vol. 96, pp 148{156. 21. Aeschbacher S, Beaumont MA, Futschik A (2012) A novel approach for choosing summary statistics in Approximate Bayesian Computation. Genetics 192:1027{1047. 22. Prangle D, Blum MGB, Popovic G, Sisson SA (2013) Diag-nostic tools of approximate Bayesian computation using the coverage property. ArXiv e-prints. 23. Estoup A, et al. (2012) Estimation of demo-genetic model prob-abilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics. Molecular Ecology Ressources 12:846{855. 24. Devroye L, Gyor
  • 94. L, Lugosi G (1996) A probabilistic theory of pattern recognition, Applications of Mathematics (New York) (Springer-Verlag, New York) Vol. 31, pp xvi+636. 25. Breiman L (1996) Bagging predictors. Mach Learn 24:123{140. 26. Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classi
  • 95. - cation and regression trees (CRC press). 27. Biau G (2012) Analysis of a random forest model. Journal of Machine Learning Research 13:1063{1095. 28. Scornet E, Biau G, Vert JP (2014) Consistency of random forests., (arXiv), Technical Report 1405.2881. 29. Cornuet JM, et al. (2014) DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about popula-tion history using Single Nucleotide Polymorphism, DNA se-quence and microsatellite data. Bioinformatics (to appear). 30. Cleveland W (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74:829{836. 31. Lombaert E, Guillemaud T, Thomas C, et al. (2011) Infer-ring the origin of populations introduced from a genetically structured native range by Approximate Bayesian Computa-tion: case study of the invasive ladybird Harmonia axyridis. Molecular Ecology 20:4654{4670. 32. Choisy M, Franck P, Cornuet JM (2004) Estimating admixture proportions with microsatellites: comparison of methods based on simulated data. Mol Ecol 13:955{968. 33. 1000 Genomes Project Consortium, Abecasis G, Auton A, et al. (2012) An integrated map of genetic variation from 1,092 hu-man genomes. Nature 491:56{65. 34. Bertorelle G, Benazzo A, Mona S (2010) ABC as a exible framework to estimate demography over space and time: some cons, many pros. Mol Ecol 19:2609{2625. 35. Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics. Genetics 162:2025{2035. 36. Excoer L, Dupanloup I, Huerta-Sanchez E, Sousa V, Foll M (2013) Robust demographic inference from genomic and SNP data. PLoS Genet p e1003905. 6 www.pnas.org/cgi/doi/10.1073/pnas.xxx Pudlo, Marin et al.
  • 96. Reliable ABC model choice via random forests | Supporting Information Pierre Pudlo y, Jean-Michel Marin y , Arnaud Estoup z, Jean-Marie Cornuet z , Mathieu Gautier z , and Christian P. Robert x { Universite de Montpellier 2, I3M, Montpellier, France,yInstitut de Biologie Computationnelle (IBC), Montpellier, France,zCBGP, INRA, Montpellier, France,xUniversite Paris Dauphine, CEREMADE, Paris, France, and {University of Warwick, Coventry, UK Table of contents 1. Classi
  • 97. cation methods 1 2. A revealing toy example: MA(1) versus MA(2) models 3 3. Examples based on controlled simulated population genetic datasets 5 4. Supplementary information about the Harlequin ladybird example 9 5. Supplementary informations about the Human population example 13 6. Computer software and codes 15 7. Summary statistics available in the DIYABC software 16 1. Classi
  • 99. cation methods aim at forecasting a variable Y that takes values in a
  • 100. nite set, e.g. f1; : : : ;Mg, based on a pre-dicting vector of covariates X = (X1; : : : ;Xd) of dimension d. They are
  • 101. tted with a training database (xi; yi) of indepen-dent replicates of the pair (X; Y ). We exploit such classi
  • 102. ers in ABC model choice by predicting a model index (Y ) from the observation of summary statistics on the data (X). The classi
  • 103. ers are trained with numerous simulations from the hi-erarchical Bayes model that constitute the ABC reference ta-ble. For a more detailed entry on classi
  • 104. cation, we refer the reader to the entry (1) and to the more theoretical (2). Standard classi
  • 106. rst family of classi
  • 107. ers including linear discriminant analysis (LDA) and nave Bayes. Those classi
  • 108. ers rely on a full likelihood function corresponding to the joint distribution of (X; Y ), speci
  • 109. ed by the marginal probabilities of Y and the conditional density f(xjy) of X given Y = y. Classi
  • 110. cation follows by ordering the probabilities Pr(Y = yjX = x). For instance, linear dis-criminant analysis assumes that each conditional distribution of X is a multivariate Gaussian distribution with unknown mean and covariance matrix, when the covariance matrix is assumed to be constant across classes. These parameters are
  • 111. tted on a training database by maximum likelihood; see e.g. Chapter 4 of (1). This classi
  • 112. cation method is quite popu-lar as it provides a linear projection of the covariates on a space of dimension M 1, called the LDA axes, which sep-arate classes as much as possible. Similarly, nave Bayes as-sumes that each density f(xjy), y = 1; : : : ;M, is a product of marginal densities. Despite this rather strong assumption of conditional independence of the components of X, nave Bayes often produces good classi
  • 113. cation results. Note that one can assume that the marginals are univariate Gaussians and
  • 114. t those by maximum likelihood estimation, or else resort to a nonparametric kernel density estimator to recover these marginal densities when the training database is large enough. Logistic and multinomial regressions use a conditional like-lihood based on a modeling of Pr(Y = yjX = x), as special cases of a generalized linear model. Modulo a logit transform (p) = logfp=(1 p)g, this model assume linear dependency in the covariates; see e.g. Chapter 4 in (1). Logistic regres-sion results rarely dier from LDA estimates since the decision boundaries are also linear. The sole dierence stands with the procedure used to
  • 116. ers. Local methods. k-nearest neighbor (k-nn) classi
  • 117. ers require no model
  • 118. tting but mere computations on the training database. More precisely, it builds upon a distance on the feature space, X 3 X. In order to make a classi
  • 119. cation when X = x, k-nn derives the k training points that are the closest in distance to x and classi
  • 120. es this new datapoint x according to a major-ity vote among the classes of the k neighbors. The accuracy of k-nn heavily depends on the tuning of k, which should be calibrated, as explained below. Local logistic (or multinomial) regression adds a linear re-gression layer to these procedures and dates back to (3). In order to make a decision at X = x, given the k nearest neigh-bors in the feature space, one weights them by a smoothing kernel (e.g., the Epanechnikov kernel) and a multinomial clas-si
  • 122. tted on this weighted sub-sample of the training database. More details on this procedure can be found in (4). Likewise, the accuracy of the classi
  • 123. er depends on the cali-bration of k. Random forest construction.RF aggregates decision trees built with a slight modi
  • 124. cation of the CART algorithm (5). PNAS Supplementary Information 1{17
  • 125. The latter procedure produces a binary tree that sets rules as labels of the internal nodes and predictions of Y as labels of the tips (terminal nodes). At a given internal node, the rule is of the form Xj t, which determines a left-hand branch ris-ing from that vertex and a right-hand branch corresponding to Xj t. To predict the value of Y when X = x from this tree means following a path from the root by applying these binary rules and returning the label of the tip at the end of the path. The randomized CART algorithm used to create the trees in the forest recursively infers the internal and terminal labels of each tree i from the root on a training database (xi; yi) as follows. Given a tree built until a node v, daughter nodes v1 and v2 are determined by partitioning the data remaining at v in a way highly correlated with the outcome Y . Practically, this means minimizing an empirical divergence criterion (the sum of impurities of the resulting nodes v1 and v2) towards se-lecting the most discriminating covariate Xj among a random subset of the covariates, of size ntry, and the best threshold t. Assuming ^p(v; y) denotes the relative frequency of y among the part of the learning database that falls at node v, N(v) the size of this part of the database, the Gini criterion we minimize is N(v1)Q(v1) + N(v2)Q(v2), where Q(vi) = MX y=1 ^p(vi; y) f1 ^p(v;y)g : (See Chapter 9 in (1) for criteria other than the Gini index above.) The recursive algorithm stops Pwhen all terminal nodes v are homogeneous, i.e., Q(v) = M y=1 ^p(v; y)f1 ^p(v; y)g = 0 and the label of the tip v is the only value of y for which ^p(v; y) = 1. This leads to Algorithm S1, whose decision boundaries are noisy but approximately unbiased. The RF algorithm aggregates randomized CART trees trained on bootstrap sub-sample of size Nboot from the origi-nal training database (i.e., the reference table in our context). The prediction at a new covariate value X = x is the most fre-quent response predicted by the trees in the forest. Three tun-ing parameters have to be calibrated: the number B of trees in the forest, the number ntry of covariates that are sampled at each node by the randomized CART, and the size Nboot of the bootstrap sub-sample. Following (6), if d is the total number of predictors, the default number of covariates ntry is p d and the default Nboot is the size of the original train-ing database. The out-of-bag error is the average number of time an observation from the training database is misclassi-
  • 126. ed by trees trained on bootstrap samples that do not include this observation, and it is instrumental in tuning the above parameters. Algorithm S1 Randomized CART start the tree with a single root repeat pick a non-homogeneous tip v such that Q(v)6= 1 attach to v two daughter nodes v1 and v2 draw a random subset of covariates of size ntry for all covariates Xj in the random subset do
  • 127. nd the threshold tj in the rule Xj tj that minimizes N(v1)Q(v1) + N(v2)Q(v2) end for
  • 128. nd the rule Xj tj that minimizes N(v1)Q(v1) + N(v2)Q(v2) in j and set this best rule to node v until all tips v are homogeneous (Q(v) = 0) set the labels of all tips Algorithm S2 RF for classi
  • 129. cation for b = 1 to B do draw a bootstrap sub-sample Z of size Nboot from the training data grow a tree Tb trained on Z with Algorithm S1 end for output the ensemble of trees fTb; b = 1 : : :Bg Notice that the frequencies of predicted responses amid the trees of Algorithm S2 do not re ect any posterior related quantities and thus should not be returned to the user. In-deed, if it is fairly easy to reach the decision y at covariate value X = x, almost all trees will produce the same prediction y and the frequency of this class y will be much higher than Pr(Y = yjX = x). The way we build a RF classi
  • 130. er given a collection of sta-tistical models is to start from an ABC reference table in-cluding a set of simulation records made of model indices, parameter values and summary statistics for the associated simulated data. This table then serves as training database for a RF that forecasts model index based on the summary statistics. Once more, we stress that the frequency of each model amid the tree predictions does not re ect any poste-rior probability. We therefore propose the computation of a posterior error rate (see main text) that render a reliable and fully Bayesian error evaluation. Calibration of the tuning parameters. Many machine learning algorithms involve tuning parameters that need to be deter-mined carefully in order to obtain good results (in terms of what is called the prior error rate in the main text). Usually, the predictive performances (averaged over the prior in our context) of classi
  • 131. ers are evaluated on new data (validation procedures) or fake new data (cross-validation procedures); see e.g. Chapter 7 of (1). This is the standard way to com-pare the performances of various possible values of the tuning parameters and thus calibrate these parameters. For instance, the value of k for both k-nn and local logis-tic regression, as well as Nboot of RF, need to be calibrated. But, while k-nn performances heavily depend on the value of k, the results of RF are rather stable over a large range of values of Nboot as illustrated on Fig. S1. The plots in this Figure display an empirical evaluation of the prior error rates of the classi
  • 132. ers against dierent values of their tuning pa-rameter with a validation sample made of a fresh set of 104 simulations from the hierarchical Bayesian model. Because of the moderate Monte Carlo noise within the empirical error, we
  • 133. rst smooth out the curve before determining the calibra-tion of the algorithms. Fig. S1 displays this derivation for the ABC analysis of the Harlequin ladybird data with ma-chine learning tools. The last case is quite characteristic of the plateau structure of errors in RFs. The validation procedure described above requires new simulations from the hierarchical Bayesian model, which we can always produce because of the very nature of ABC. But such simulations might be computationally intensive when analyzing large datasets or complex models. The cross-validation procedure is an alternative (we do not present here) while RF oers a separate evaluation procedure: it takes ad-vantage of the fact that bootstrap samples do not contain the whole reference table, leftovers being available for testing. The resulting evaluation of the prior error rate is the out-of-bag estimator, see e.g. Chapter 15 of (1). Calibration for other classi
  • 134. ers involve new prior simulations or a computationally heavy cross-validation approximation of the error. Moreover, calibrating local logistic regression may prove computation-ally unfeasible since for each dataset of the validation sample 2 Pudlo, Marin et al.
  • 135. (the second reference table), the procedure involves searching for nearest neighbors in the (
  • 137. tting a weighted logistic regression on those neighbors. 0 500 1000 1500 2000 2500 3000 0.56 0.60 0.64 0.68 k Prior error rate 2000 4000 6000 8000 10000 0.371 0.374 0.377 k Prior error rate 0 10000 20000 30000 40000 50000 0.36 0.38 0.40 Nboot Prior error rate Fig. S1. Calibration of k-nn, the local logistic regression, and RF. Plot of the empirical prior error rate (in black) of three classi
  • 138. ers, namely k-nn (top), the local logistic regression (middle) and RF (bottom) as a function of their tuning parameter (k for the
  • 139. rst two methods, Nboot for RF) when analyzing the Harlequin ladybird data with a reference table of 10; 000 simulations (top and middle) or 50; 000 simulations (bottom). To remove the noise of these estimated errors on a validation set composed of 10; 000 independent simulations, estimated errors are smoothed by a spline method that produces the red curve. The optimal values of the parameters are k = 300, k = 3; 000 and Nboot = 40; 000, respectively. 2. A revealing toy example: MA(1) versus MA(2) models Given a time series (xt) of length T = 100, we compare
  • 140. ts by moving average models of order either 1 or 2, MA(1) and MA(2), namely xt = t 1t1 and xt = t 1t1 2t2 ; t N(0; 2) ; respectively. As previously suggested (7), a possible set of (insucient) summary statistics is made of the
  • 141. rst two (or higher) autocorrelations, set that yields an ABC reference table of size Nref = 104 with two covariates, displayed on Fig. S2. For both models, the priors are uniform distribu-tions on the stationarity domains (8): { for MA(1), the single parameter 1 is drawn uniformly from the segment (1; 1); { for MA(2), the pair (1; 2) is drawn uniformly over the triangle de
  • 142. ned by 2 1 2; 1 + 2 1; 1 2 1: In this example, we can evaluate the discrepancy between the true posterior probabilities and those based on summaries. The true marginal likelihoods can be computed by numerical integrations of dimension 1 and 2 respectively, while the poste-rior probabilities based on the summary statistics are derived from the ABC reference table by a kernel density estimation. Fig. 1 of the main text shows how dierent the (estimated) posterior probabilities are when based on (i) the whole series of length T = 100 and (ii) only the summary statistics, even though the latter remain informative about the problem. This graph induces us to caution as to the degree of approximation provided by ABC about the true posterior probabilities and it brings numerical support to the severe warnings of (9). + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + ++ ++ + + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++++ + + + + + +++ + + + + ++ + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + ++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ ++ + + + + + + + ++ + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + ++ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ + + + + + + + + ++ + + + + ++ + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + ++ + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + +++ + + + + + + + + + ++ + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++++ + + + + + + + + + + + + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +++ + + + + ++ + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + ++++ + + + ++ + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + ++ + + + + + + + + ++ + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + ++ ++ + + + + +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + −6 −4 −2 0 2 4 6 −5 0 5 10 lag−1 autocovariance lag−2 autocovariance Fig. S2. Simulated ABC reference table under model MA(1) (blue) and MA(2) (orange). Pudlo, Marin et al. PNAS Supplementary Information 3
  • 143. The discrepancy between the genuine (mjx) and the ABC ersatz (mjS(x)) cannot be explained by the curse of dimensionality: the number of summary statistics is either 2 or 7. As seen in Table S1 which draws a comparison between various classi
  • 144. ers, k-nn is one of the best classi
  • 145. cation meth-ods. But all methods based on summaries are outperformed by the Bayes classi
  • 146. er that can be computed here via approx-imations of the genuine (mjx): this ideal classi
  • 147. er achieves a prior error of 12:36%. Most of the dierence between this error rate and the 17% misclassi
  • 148. cation rate achieved by ABC can be traced to dierences between (mjx) and (mjS(x)) that are so large as to be on opposite sides of the threshold 0:5. Besides, as illustrated in Fig. S2, a linear separation between both models does not occur and this is re ected by the high error rates of LDA and logistic regression in both cases. The standard ABC model choice (k-nn) does really well in this ex-ample, reaching one of the lowest error rates when optimized over the number k of neighbors. Interestingly, most methods presented in Table S1 display degraded performances when moving from two to seven summary statistics. By contrast, RFs achieve the absolute minimum in this comparison and manage to take advantage of a larger set of summaries. If we now turn to the performances of the posterior version of the misclassi
  • 149. cation error (computed with Algorithm 2 dis-played in the main text), Fig. S3 shows how the posterior er-ror rates vary according to the position of the two-dimensional summary statistics, with larger errors at the boundaries be-tween both models and overall for the MA(1) model. The error rates thus range from negligible to above 30% depend-ing on the summary statistic location. Table S1. Prior error rates in the MA(1) vs. MA(2) example. Classi
  • 150. cation method Prior error rate (%) 2 statistics 7 statistics linear discriminant analysis (LDA) 27:43 26:57 logistic regression 28:34 27:40 nave Bayes (with Gaussian marginals) 19:52 24:40 nave Bayes (with non-parametric marginal estimates) 18:25 21:92 k-nn with k = 100 neighbors 17:23 18:37 k-nn with k = 50 neighbors 16:97 17:35 random forest 17:04 16:15 The prior error rates displayed here were com-puted as averaged misclassi
  • 151. cation errors on a set of 104 simulations independent of the sim-ulation set of 104 values that trained the clas-si
  • 152. ers. Summary statistics are either the
  • 153. rst two or the
  • 154. rst seven autocorrelations. A base-line error of 12:36% is obtained when compar-ing the genuine posterior probabilities on the whole data. −4 −2 0 2 4 6 lag−1 auto correlation ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Simulated from ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Predictive error −4 −2 0 2 4 lag−2 auto correlation ● ● ● 0.1 0.5 0.9 ● ● MA(1) MA(2) Fig. S3. Posterior error rates in the MA(1) vs. MA(2) example. This graph displays posterior error rates for the ABC-RF Algorithm 2 of the main text when k = 100 neighbors and Npp = 10, based on 500 replicated time series simulated from priors either with the MA(1) (blue dots) or the MA(2) (orange dots) models. The dot diameter represents the posterior error rates. Large values of the posterior error concentrate around the boundary between both models; see also Fig. S2. 4 Pudlo, Marin et al.
  • 155. 3. Examples based on controlled simulated population genetic datasets Pop 1 Pop 3 Pop 2 tS tA Scenario 1 Pop 1 Pop 3 Pop 2 tS tA Scenario 2 r 1-r Pop 1 Pop 3 Pop 2 tS tA Scenario 3 Fig. S4. Three competing models of historical relationships between populations of a given species. Those population models or scenarios are used for both controlled examples based on SNP and microsatellite data: (left) Model 1 where Population 3 split from Population 1, (center) Model 2 where Population 3 split from Population 2, (right) Model 3 where Population 3 is an admixture between Population 1 and Population 2. Branches in blue represent population of eective size N1, in green N2, in red N3 and in cyan N4. We now consider a basic population genetic setting ascer-taining historical links between three populations of a given species. In both examples below, we try to decide whether a third (and recent) population emerged from a
  • 156. rst popula-tion (Model 1), or from a second population that split from the
  • 157. rst one some time ago (Model 2), or whether this third population was a mixture between individuals from both pop-ulations (Model 3); see Fig. S4. The only dierence between both examples stands with the kind of data they consider: the 1; 000 genetic markers of the
  • 158. rst example are autosomal, sin-gle nucleotide polymorphims (SNP) loci and the 20 markers of the second example are autosomal microsatellite loci. We assume that, in both cases, the data were collected on samples of 25 diploid individuals from each population. Simulated and observed genetic data are summarized with the help of a few statistics described in Section 7 of the SI. They are all com-putable with the DIYABC software (10) that we also used to produce simulated datasets; see also Section 6. For both examples, the seven demographics parameters of the Bayesian model are { tS: time of the split between Populations 1 and 2, { tA: time of the appearance of Population 3, { N1, N2, N3: eective population sizes of Populations 1, 2 and 3, respectively, below time tS, { N4: eective population size of the common ancestral pop-ulation above tS and { r: the probability that a gene from Population 3 at time tA came from Population 1. This last parameter r is the rate of the admixture event at time tA and as such speci
  • 159. c to Model 3. Note that Model 3 is equivalent to Model 1 when r = 1 and to Model 2 when r = 0. But the prior we set on r avoids nested models. Indeed, the prior distribution is as follows: { the times tS and tA (on the scale of number of generations) are drawn from a uniform distribution over the segment [10; 3 104] conditionally on tA tS; { the four eective population sizes Ni, i = 1; : : : ; 4 are drawn independently from a uniform distribution on a range from 100 to 30; 000 diploid individuals, denoted U(100; 3 104); { the admixture rate r is drawn from a uniform distribution U(0:05; 0:95). In this example, the prior on model indices is uniform so that each of the three models has a prior probability of 1=3. SNP data.The data is made of 1; 000 autosomal SNPs for which we assume that the distances between these loci on the genome are large enough to neglect linkage disequilibrium and hence consider them as having independent ancestral genealo-gies. We use all summary statistics oered by the DIYABC software for SNP markers (10), namely 48 summary statistics in this three population setting (provided in Section 7). In to-tal, we simulated 70; 000 datasets, based on the above priors. These datasets are then split into three groups: { 50; 000 datasets constitute the reference table and reserved for training classi
  • 160. cation steps, (we will also consider clas-si
  • 161. ers trained on subsamples of this set), { 10; 000 datasets constitute the validation set, used to cal-ibrate the tuning parameters of the classi
  • 162. ers if needed, and { 10; 000 datasets constitute the test set, used to evaluating the prior error rates. The classi
  • 163. cation methods applied here are given in Ta-ble S2. For the nave Bayes classi
  • 164. er and the LDA procedures, there is no parameter to calibrate. The numbers k of neigh-bors for the standard ABC techniques and for the local logistic regression are tuned as described in Section 1 of the present SI. This is also the case for the size Nboot of bootstrap sub-samples in RF methods. The prior error rates are estimated and minimized by using the validation set of 104 simulations, independent from the reference table. The optimal value of k for the standard ABC (k-nn) and 48 summary statistics Pudlo, Marin et al. PNAS Supplementary Information 5
  • 165. is small because of the dimension of the problem (k = 9, k = 15, and k = 55 when using 10; 000, 20; 000 and 50; 000 simulations in the reference table respectively). The optimal values of k for the local logistic regression are dierent, since this procedure
  • 166. ts a linear model on weighted neighbors. The calibration on a validation set made of 10; 000 simulations produced the following optimal values: k = 2; 000, k = 3; 000, and k = 6; 000 when
  • 167. tted on 10; 000, 20; 000, and 50; 000 simulations, respectively. As reported in Section 1 above, cal-ibrating the parameter k of the local logistic regression is very time consuming. In contrast, we stress that the out-of-bag er-ror rates of RF, derived directly and cheaply from the learning set (i.e., simulations from the reference table), are very close to the error rates estimated with a calibration sample. This indicates that RF does not require the simulation of a valida-tion sample, in addition to its own predictive quality, which constitutes a signi
  • 168. cant computational advantage of the ap-proach. Finally, for the standard ABC (k-nn) based on orig-inal summaries, we relied on a standard Euclidean distance after normalizing each variable by its median absolute devia-tion, while k-nn on the LDA axes requires no normalization procedure. −8 −6 −4 −2 0 2 4 6 −5 0 5 10 LD1 LD2 * * Fig. S5. Projections on the LDA axes of the simulations from the reference table. Colors correspond to model indices: black for Model 1, blue for Model 2 and orange for Model 3. The locations of both simulated pseudo-observed datasets that are analyzed as if they were truly observed data, are indicated by green and red stars. Table S2 provides estimated prior error rates for those classi
  • 169. cation techniques, based on a test sample of 10; 000 val-ues, independent of reference tables and calibration sets. It shows that the best error rate is associated with a RF trained on both the original DIYABC statistics and the LDA axes. The gain against the standard ABC solution is clearly sig-ni
  • 170. cant. Other interesting features exhibited in Table S2 are (i) good performances of the genuine LDA method, due to a good separation between summaries coming from the three models, as exhibited in Fig. S5, albeit involving some overlap between model clusters, (ii) that the local logistic regression on the two LDA axes of (4) achieves the second best solution. Fig. S6 describes further investigations into the RF solu-tion. This graph expresses the contributions from the sum-mary statistics to the decision taken by RF. The contribution of each summary is evaluated as the average decrease in node impurity over the nodes driven by the corresponding summary statistic, see e.g. Chapter 15 of (1). The appeal of including the
  • 171. rst two LDA axes is clear in Fig. S6, where they appear as LDA1 and LDA2: those statistics contribute more signif-icantly than any other statistic to the decision taken by the classi
  • 172. er. Note that the FMO statistics, which also have a strong contribution to the RF decisions, are the equivalent of pairwise FST -distances between populations when genetic markers are SNPs. In order to convey the meaning of the RF posterior er-ror rate, we simulated two typical datasets, hereafter consid-ered as pseudo-observed datasets or pod(s). The
  • 173. rst pod (green star in Figure S5) corresponds to a favorable situa-tion for which Model 3 should easily be discriminated from both Models 1 and 2. The parameter values used to sim-ulate this pod indeed correspond to a recent balanced ad-mixture between strongly dierentiated source populations (N1 = 20; 000, N2 = 15; 000, N3 = 10; 000, N4 = 25000, r = 0:5, ta = 500 and ts = 20000). The second pod (red star in Figure S5) corresponds to a less favorable setting where it is more dicult to discriminate Model 3 from Model 1 and 2. The parameter values used to simulate this second pod correspond to an ancient unbalanced admixture between the source populations (N1 = 20; 000, N2 = 15; 000, N3 = 10; 000, N4 = 25; 000, r = 0:1, ta = 10; 000, and ts = 20; 000). For both pods, ABC-RF (trained on both the 48 initial statistics and the two LDA axes) chooses Model 3. The RF was trained on a reference table of size 70; 000 (that con-tains all simulations) and calibrated using optimal values of Nboot = 30; 000 obtained by validation when the training set is smaller. The posterior error rates are substantially dier-ent for both pods: less than 0:1% for the
  • 174. rst pod and about 12:1% for the second pod. To evaluate the Monte Carlo error of the above estimates, we repeated Algorithm 2 several times, i.e., by simulating 500 20 = 10; 000 datasets. We observed very stable posterior error rates. These posterior error rates can be compared to the 18% prior error rate of the method (see Table S2). The prior error rate is considerably higher than both posterior errors because the data space de
  • 175. ned by our ABC setting includes a non-negligible proportion of worse cases, when compared with the pods represented by the green and red stars in Fig. S5. Obviously, for some less favorable cases, the posterior error rates can be higher than the prior error rate (results not shown in this case, but see the next example below based on microsatellite data). 6 Pudlo, Marin et al.
  • 176. Table S2. Estimated prior error rates for the controlled SNP example. Classi
  • 177. cation method Prior error rates (%) trained on Nref = 10; 000 Nref = 20; 000 Nref = 50; 000 nave Bayes (with Gaussian marginals) 34:86 34:72 34:41 linear discriminant analysis (LDA) 23:35 23:28 23:12 standard ABC (k-nn) using DIYABC summaries 27:82 25:61 23:58 standard ABC (k-nn) using only LDA axes 22:86 22:56 22:60 local logistic regression on LDA axes 22:07 21:84 21:93 random forest (RF) using DIYABC initial summaries 22:13 20:71 19:55 RF using both DIYABC summaries and LDA axes 20:07 18:95 18:11 Prior error rates are estimated by evaluating the methods (trained on three sizes of reference tables) on 10; 000 other simulations from the Bayesian model. FMO_23 FMO_13 AM1_3_12 NMO_23 NMO_13 FP0_23 FP0_13 AM1_1_23 AM1_2_13 AV1_2_13 AV1_1_23 AV1_3_12 AP0_3_12 FV1_13 NV1_13 FV1_23 NV1_23 FMO_12 FM1_23 FM1_13 l l l l l l l l l l l l l l l l l l l l 0 200 400 600 800 1000 1200 LD1 LD2 FMO_23 FMO_13 NMO_23 FP0_13 FP0_23 NMO_13 AM1_3_12 AM1_1_23 AM1_2_13 AV1_1_23 AV1_2_13 AV1_3_12 FV1_13 NV1_23 FV1_23 NV1_13 FM1_23 AP0_3_12 l l l l l l l l l l l l l l l l l l l l 0 1000 2000 3000 4000 Fig. S6. Contributions of the twenty most important statistics to the RF for the the controlled SNP example. The contribution of a statistic is evaluated as the mean decrease in node impurity in the trees of the RF when using 48 summary statistics (top) and when adding the two LDA axes to the previous set of statistics (bottom). The meaning of the variable acronyms is provided in Section 7 below. Pudlo, Marin et al. PNAS Supplementary Information 7
  • 178. Microsatellite data. This illustration reproduces the same set-tings as in the SNP data example above but the genetic data (which is of much smaller dimension) carries a dierent and lower amount of information. Indeed, we consider here datasets composed of only 20 autosomal microsatellite loci. The microsatellite loci are assumed to follow a generalized stepwise mutation model with three parameters (11, 12): the mean mutation rate (), the mean parameter of the geomet-ric distribution ( P) of changes in number of repeats during mutation events, and the mean mutation rate for single nu-cleotide instability (SNI ). The prior distributions for , P and SNI are the same as those given in Table S4 (see the prior distributions used for the real Harmonia axyridis mi-crosatellite dataset). Each locus has a possible range of 40 contiguous allelic states and is characterized by locus speci
  • 179. c 's drawn from a Gamma distribution with mean and shape 2, locus speci
  • 180. c P's drawn from a Gamma distribution with mean P and shape 2 and,
  • 182. c SNI's drawn from a Gamma distribution with mean SNI and shape 2. For microsatellite markers, DIYABC (13) produces 39 summary statistics described in Section 7 below. Table S3 is the equivalent of Table S2 for this kind of genetic data structure. Due to the lower and dierent infor-mation content of the data, the prior error rates are much higher in all cases, but the conclusion about the gain brought by RF using all summaries plus the LDA statistics remains. As in the SNP case, in order to illustrate the performances of the posterior error rate, we simulated two typical pods: one highly favorable (the green star in Fig. S7) and a second one quite challenging (the red star in Fig. S7). They were gen-erated using the same values of parameters as for the SNP pods. For both pods, we considered an ABC-RF treatment with a reference table of size 70; 000 and calibrated with Nboot = 104 as suggested by validation procedures on smaller refer-ence tables. Running twice Algorithm 2 with k = 500 nearest neighbors and Npp = 20 new simulations per neighbor, we ob-tained the following posterior error rates: about 2:5% for the
  • 183. rst pod (the presumed favorable case) and about 42:5% for the second pod (the presumed challenging case). We hence obtained for the challenging pod a posterior error rate that was larger than the prior error rate. −4 −2 0 2 4 −10 −5 0 5 LD1 LD2 * * Fig. S7. Projections on the LDA axes of the simulations from the reference table. Colors correspond to model indices: black for Model 1, blue for Model 2 and orange for Model 3. The loca- tions of both simulated pseudo-observed datasets are indicated by green and red stars. Interestingly Fig. S8 shows that the AML 3 12 summary statistic (see Section 7) contributes more to the RF decision than the second LDA axis. We recall that AML is an admix-ture rate estimation computed by maximum likelihood on a simpli
  • 184. ed model considering that the admixture occurred at time t = 0. The importance of the LDA axes in the random forest remains nevertheless very high in this setting. Table S3. Estimated prior error rates for the controlled SNP example. Classi