SlideShare a Scribd company logo
1 of 143
Download to read offline
Computational Bayesian Statistics




                     Computational Bayesian Statistics

                       Christian P. Robert, Universit´ Paris Dauphine
                                                     e


          Frontiers of Statistical Decision Making and Bayesian Analysis,
                              In honour of Jim Berger,
                                 17th of March, 2010




                                                                            1 / 143
Computational Bayesian Statistics




Outline



      1    Computational basics

      2    Regression and variable selection

      3    Generalized linear models




                                               2 / 143
Computational Bayesian Statistics
  Computational basics




Computational basics



       1   Computational basics
             The Bayesian toolbox
                   Improper prior distribution
               Testing
               Monte Carlo integration
               Bayes factor approximations in perspective




                                                            3 / 143
Computational Bayesian Statistics
  Computational basics
     The Bayesian toolbox



The Bayesian toolbox


                         Bayes theorem = Inversion of probabilities

       If A and E are events such that P (E) = 0, P (A|E) and P (E|A)
       are related by

                                                 P (E|A)P (A)
                         P (A|E) =
                                        P (E|A)P (A) + P (E|Ac )P (Ac )
                                        P (E|A)P (A)
                                    =
                                            P (E)




                                                                          4 / 143
Computational Bayesian Statistics
  Computational basics
     The Bayesian toolbox



New perspective


              Uncertainty on the parameters θ of a model modeled through
              a probability distribution π on Θ, called prior distribution
              Inference based on the distribution of θ conditional on x,
              π(θ|x), called posterior distribution

                                               f (x|θ)π(θ)
                                    π(θ|x) =                  .
                                               f (x|θ)π(θ) dθ




                                                                             5 / 143
Computational Bayesian Statistics
  Computational basics
     The Bayesian toolbox



Bayesian model


       A Bayesian statistical model is made of
          1   a likelihood
                                          f (x|θ),
              and of
          2   a prior distribution on the parameters,

                                           π(θ) .




                                                        6 / 143
Computational Bayesian Statistics
  Computational basics
     The Bayesian toolbox



Posterior distribution center of Bayesian inference

                                    π(θ|x) ∝ f (x|θ) π(θ)


              Operates conditional upon the observations
              Integrate simultaneously prior information/knowledge and
              information brought by x
              Avoids averaging over the unobserved values of x
              Coherent updating of the information available on θ,
              independent of the order in which i.i.d. observations are
              collected
              Provides a complete inferential scope and an unique motor of
              inference

                                                                             7 / 143
Computational Bayesian Statistics
  Computational basics
     The Bayesian toolbox



Improper prior distribution


       Extension from a prior distribution to a prior σ-finite measure π
       such that
                                    π(θ) dθ = +∞
                                    Θ



       Formal extension: π cannot be interpreted as a probability any
       longer




                                                                          8 / 143
Computational Bayesian Statistics
  Computational basics
     The Bayesian toolbox



Validation


       Extension of the posterior distribution π(θ|x) associated with an
       improper prior π given by Bayes’s formula

                               f (x|θ)π(θ)
           π(θ|x) =                           ,
                             Θ f (x|θ)π(θ) dθ

       when

                      f (x|θ)π(θ) dθ < ∞
                  Θ




                                                                           9 / 143
Computational Bayesian Statistics
  Computational basics
     Testing



Testing hypotheses

       Deciding about validity of assumptions or restrictions on the
       parameter θ from the data, represented as

                                H0 : θ ∈ Θ0   versus H1 : θ ∈ Θ0

       Binary outcome of the decision process: accept [coded by 1] or
       reject [coded by 0]
                                  D = {0, 1}
       Bayesian solution formally very close from a likelihood ratio test
       statistic, but numerical values often strongly differ from classical
       solutions


                                                                             10 / 143
Computational Bayesian Statistics
  Computational basics
     Testing



Bayes factor


       Bayesian testing procedure depends on P π (θ ∈ Θ0 |x) or
       alternatively on the Bayes factor

        π          {P π (θ ∈ Θ1 |x)/P π (θ ∈ Θ0 |x)}
       B10 =
                     {P π (θ ∈ Θ1 )/P π (θ ∈ Θ0 )}

               =         f (x|θ1 )π1 (θ1 )dθ1   f (x|θ0 )π0 (θ0 )dθ0 = m1 (x)/m0 (x)

                                                        [Akin to likelihood ratio]




                                                                                       11 / 143
Computational Bayesian Statistics
  Computational basics
     Testing



Banning improper priors



       Impossibility of using improper priors for testing!
       Reason: When using the representation

                   π(θ) = P π (θ ∈ Θ1 ) × π1 (θ) + P π (θ ∈ Θ0 ) × π0 (θ)

       π1 and π0 must be normalised




                                                                            12 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Fundamental integration issue



       Generic problem of evaluating an integral

                                I = Ef [h(X)] =       h(x) f (x) dx
                                                  X

       where X is uni- or multidimensional, f is a closed form, partly
       closed form, or implicit density, and h is a function




                                                                         13 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Monte Carlo Principle

       Use a sample (x1 , . . . , xm ) from the density f to approximate the
       integral I by the empirical average
                                               m
                                           1
                                    hm =             h(xj )
                                           m
                                               j=1


       Convergence of the average

                                    hm −→ Ef [h(X)]

       by the Strong Law of Large Numbers



                                                                               14 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



e.g., Bayes factor approximation

       For the normal case

                                    x1 , . . . , xn ∼ N (µ + ξ, σ 2 )
                                     y1 , . . . , yn ∼ N (µ − ξ, σ 2 )
                                            and         H0 : ξ = 0

       under prior

                               π(µ, σ 2 ) = 1/σ 2      and       ξ ∼ N (0, 1)
                                                                 −n+1/2
                      π                        (¯ − y )2 + S 2
                                                x ¯
                     B01 =                                                     √
                                    [(2ξ − x − y )2 + S 2 ]         e−ξ2 /2 dξ/ 2π
                                                          −n+1/2
                                           ¯ ¯


                                                                                     15 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Example



       CMBdata
                                                               π
       Simulate ξ1 , . . . , ξ1000 ∼ N (0, 1) and approximate B01 with
                                                           −n+1/2
                     π
                                        (¯ − y )2 + S 2
                                         x ¯
                    B01 =             1000
                                                                                 = 89.9
                                 1
                                             [(2ξi − x − y )2 + S 2 ]
                                                                        −n+1/2
                               1000   i=1            ¯ ¯

       when x = 0.0888 ,
            ¯                         y = 0.1078 ,
                                      ¯                  S 2 = 0.00875




                                                                                          16 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Precision evaluation
       Estimate the variance with
                                                  m
                                           1 1
                                    vm   =              [h(xj ) − hm ]2 ,
                                           mm−1
                                                  j=1

       and for m large,
                                                    √
                                    hm − Ef [h(X)] / vm ≈ N (0, 1).

       Note



                                                             0.05
       Construction of a convergence
                                                             0.04
                                                             0.03
       test and of confidence bounds on                       0.02




       the approximation of Ef [h(X)]
                                                             0.01
                                                             0.00




                                                                    0   50          100   150

                                                                             1B10




                                                                                                17 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration




       Example (Cauchy-normal)
       For estimating a normal mean, a robust prior is a Cauchy prior

                                       x ∼ N (θ, 1),     θ ∼ C(0, 1).

       Under squared error loss, posterior mean
                                                ∞
                                                       θ           2
                                                            e−(x−θ) /2 dθ
                                                −∞   1 + θ2
                                    δ π (x) =    ∞
                                                       1           2
                                                          2
                                                            e−(x−θ) /2 dθ
                                                −∞   1+θ




                                                                            18 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration




       Example (Cauchy-normal (2))
       Form of δ π suggests simulating iid variables θ1 , · · · , θm ∼ N (x, 1)
       and calculate




                                                 10.6
                                m       θi




                                                 10.4
                                i=1        2
            ˆπ                        1 + θi
            δm (x) =                         .
                                        1



                                                 10.2
                                m
                                i=1        2
                                      1 + θi


                                                 10.0
      LLN implies
                                                 9.8
                                                 9.6



       ˆπ
       δm (x) −→ δ π (x) as m −→ ∞.                     0   200   400                600   800   1000

                                                                        iterations




                                                                                                        19 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Importance sampling


       Simulation from f (the true density) is not necessarily optimal
       Alternative to direct sampling from f is importance sampling,
       based on the alternative representation

                                               f (x)                       f (x)
               Ef [h(x)] =              h(x)           g(x) dx = Eg h(x)
                                    X          g(x)                        g(x)

       which allows us to use other distributions than f




                                                                                   20 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Importance sampling (cont’d)

       Importance sampling algorithm
       Evaluation of
                                    Ef [h(x)] =           h(x) f (x) dx
                                                      X
       by
          1   Generate a sample x1 , . . . , xm from a distribution g
          2   Use the approximation
                                                  m
                                          1               f (xj )
                                                                  h(xj )
                                          m               g(xj )
                                              j=1




                                                                           21 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Justification

       Convergence of the estimator
                                        m
                                    1         f (xj )
                                                      h(xj ) −→ Ef [h(x)]
                                    m         g(xj )
                                        j=1


          1   converges for any choice of the distribution g as long as
              supp(g) ⊃ supp(f )
          2   Instrumental distribution g chosen from distributions easy to
              simulate
          3   Same sample (generated from g) can be used repeatedly, not
              only for different functions h, but also for different densities f


                                                                                 22 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Choice of importance function
       g can be any density but some choices better than others
          1   Finite variance only when

                                       f (x)                  f 2 (x)
                           Ef h2 (x)         =       h2 (x)           dx < ∞ .
                                       g(x)      X             g(x)
          2   Instrumental distributions with tails lighter than those of f
              (that is, with sup f /g = ∞) not appropriate, because weights
              f (xj )/g(xj ) vary widely, giving too much importance to a few
              values xj .
          3   If sup f /g = M < ∞, the accept-reject algorithm can be used
              as well to simulate f directly.
          4   IS suffers from curse of dimensionality

                                                                                 23 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration




       Example (Cauchy target)
      Case of Cauchy distribution C(0, 1) when importance function is
      Gaussian N (0, 1).
     Density ratio




                                            80
            p⋆ (x) √     exp x2 /2
                   = 2π
                        π (1 + x2 )




                                            60
            p0 (x)




                                            40
     very badly behaved: e.g.,


                                            20
                ∞
                     ̺(x)2 p0 (x)dx = ∞
                                            0
                                                 0   2000   4000                6000   8000   10000

                                                                   iterations


               −∞
       Poor performances of the associated importance sampling
       estimator


                                                                                                      24 / 143
Computational Bayesian Statistics
  Computational basics
     Monte Carlo integration



Practical alternative


                               m                            m
                                    h(xj ) f (xj )/g(xj )         f (xj )/g(xj )
                          j=1                               j=1

       where f and g are known up to constants.
          1   Also converges to I by the Strong Law of Large Numbers.
          2   Biased, but the bias is quite small: may beat the unbiased
              estimator in squared error loss.




                                                                                   25 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Computational alternatives for evidence



       Bayesian model choice and hypothesis testing relies on a similar
       quantity, the evidence

                            Zk =             πk (θk )Lk (θk ) dθk ,   k = 1, 2
                                        Θk

       aka the marginal likelihood.
                                                                           [Jeffreys, 1939]




                                                                                             26 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Importance sampling

       When approximating the Bayes factor

                                                       f0 (x|θ0 )π0 (θ0 )dθ0
                                                  Θ0
                                      B01 =
                                                       f1 (x|θ1 )π1 (θ1 )dθ1
                                                  Θ1

       simultaneous use of importance functions ̟0 and ̟1 and
                                                  n0
                                         n−1
                                          0
                                                             i       i        i
                                                  i=1 f0 (x|θ0 )π0 (θ0 )/̟0 (θ0 )
                            B01 =                 n1
                                         n−1
                                          1
                                                             i       i        i
                                                  i=1 f1 (x|θ1 )π1 (θ1 )/̟1 (θ1 )




                                                                                    27 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Diabetes in Pima Indian women

       Example (R benchmark)
       “A population of women who were at least 21 years old, of Pima Indian
       heritage and living near Phoenix (AZ), was tested for diabetes according
       to WHO criteria. The data were collected by the US National Institute of
       Diabetes and Digestive and Kidney Diseases.”
       200 Pima Indian women with observed variables
              plasma glucose concentration in oral glucose tolerance test
              diastolic blood pressure
              diabetes pedigree function
              presence/absence of diabetes


                                                                                  28 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Probit modelling on Pima Indian women


       Probability of diabetes function of above variables

                              P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) ,
       Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a
       g-prior modelling:

                                          β ∼ N3 (0, n XT X)−1




                                                                         29 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Importance sampling for the Pima Indian dataset


       Use of the importance function inspired from the MLE estimate
       distribution
                                         ˆ ˆ
                                 β ∼ N (β, Σ)

       R Importance sampling code
       model1=summary(glm(y~-1+X1,family=binomial(link="probit")))
       is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled)
       is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)
       bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1],
            sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)-
            dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled)))




                                                                                 30 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Diabetes in Pima Indian women
       Comparison of the variation of the Bayes factor approximations
       based on 100 replicas for 20, 000 simulations from the prior and
       the above MLE importance sampler
             5
             4
             3
             2




                               Basic Monte Carlo   Importance sampling



                                                                          31 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Bridge sampling

       General identity:

                                         π2 (θ|x)α(θ)π1 (θ|x)dθ
                                         ˜
                  B12 =                                                  ∀ α(·)
                                         π1 (θ|x)α(θ)π2 (θ|x)dθ
                                         ˜

                                           n1
                                    1
                                                  π2 (θ1i |x)α(θ1i )
                                                  ˜
                                    n1
                                          i=1
                            ≈              n2                          θji ∼ πj (θ|x)
                                    1
                                                  π1 (θ2i |x)α(θ2i )
                                                  ˜
                                    n2
                                          i=1




                                                                                        32 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Optimal bridge sampling

       The optimal choice of auxiliary function α is
                                                           n1 + n2
                                      α⋆ =
                                                  n1 π1 (θ|x) + n2 π2 (θ|x)

       leading to
                                                  n1
                                         1                       π2 (θ1i |x)
                                                                  ˜
                                         n1            n1 π1 (θ1i |x) + n2 π2 (θ1i |x)
                                              i=1
                            B12 ≈              n2
                                         1                       π1 (θ2i |x)
                                                                  ˜
                                         n2            n1 π1 (θ2i |x) + n2 π2 (θ2i |x)
                                              i=1

                                                                                         Back later!




                                                                                                       33 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Extension to varying dimensions

       When dim(Θ1 ) = dim(Θ2 ), e.g. θ2 = (θ1 , ψ), introduction of a
       pseudo-posterior density, ω(ψ|θ1 , x), augmenting π1 (θ1 |x) into
       joint distribution
                              π1 (θ1 |x) × ω(ψ|θ1 , x)
       on Θ2 so that
                                    Z
                                        π1 (θ1 |x)α(θ1 , ψ)π2 (θ1 , ψ|x)dθ1 ω(ψ|θ1 , x) dψ
                                        ˜
                          B12 = Z
                                        π2 (θ1 , ψ|x)α(θ1 , ψ)π1 (θ1 |x)dθ1 ω(ψ|θ1 , x) dψ
                                        ˜
                                         "                      #
                                             π1 (θ1 )ω(ψ|θ1 )
                                             ˜                          Eϕ [˜ 1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)]
                                                                            π
                                = Eπ 2                              =
                                               π2 (θ1 , ψ)
                                               ˜                          Eϕ [˜ 2 (θ1 , ψ)/ϕ(θ1 , ψ)]
                                                                              π


        for any conditional density ω(ψ|θ1 ) and any joint density ϕ.


                                                                                                           34 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Illustration for the Pima Indian dataset

       Use of the MLE induced conditional of β3 given (β1 , β2 ) as a
       pseudo-posterior and mixture of both MLE approximations on β3
       in bridge sampling estimate
       R bridge sampling code
       cova=model2$cov.unscaled
       expecta=model2$coeff[,1]
       covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3]

       probit1=hmprobit(Niter,y,X1)
       probit2=hmprobit(Niter,y,X2)
       pseudo=rnorm(Niter,meanw(probit1),sqrt(covw))
       probit1p=cbind(probit1,pseudo)

       bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]),
            sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3],
            cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+
            dnorm(pseudo,expecta[3],cova[3,3])))




                                                                                                  35 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Diabetes in Pima Indian women (cont’d)
       Comparison of the variation of the Bayes factor approximations
       based on 100 × 20, 000 simulations from the prior (MC), the above
       bridge sampler and the above importance sampler




                                                                           36 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



The original harmonic mean estimator


       When θki ∼ πk (θ|x),
                                                      T
                                                  1            1
                                                  T         L(θkt |x)
                                                      t=1

       is an unbiased estimator of 1/mk (x)
                                                                        [Newton & Raftery, 1994]

       Highly dangerous: Most often leads to an infinite variance!!!




                                                                                                   37 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Approximating Zk from a posterior sample


       Use of the [harmonic mean] identity

                    ϕ(θk )                            ϕ(θk )       πk (θk )Lk (θk )       1
       Eπk                       x =                                                dθk =
                πk (θk )Lk (θk )                  πk (θk )Lk (θk )       Zk               Zk

       which holds no matter what the proposal ϕ(·) is.
                           [Gelfand & Dey, 1994; Bartolucci et al., 2006]
       Direct exploitation of the Monte Carlo output




                                                                                           38 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Comparison with regular importance sampling


       Harmonic mean: Constraint opposed to usual importance sampling
       constraints: ϕ(θ) must have lighter (rather than fatter) tails
       than πk (θk )Lk (θk ) for the approximation
                                                      T               (t)
                                                  1             ϕ(θk )
                                    Z1k = 1                     (t)         (t)
                                                  T         πk (θk )Lk (θk )
                                                      t=1

       to have a finite variance.
       E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ




                                                                                  39 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



HPD indicator as ϕ
       Use the convex hull of Monte Carlo simulations corresponding to
       the 10% HPD region (easily derived!) and ϕ as indicator:
                                 10
                          ϕ(θ) =           Id(θ,θ(t) )≤ǫ
                                  T
                                                  t∈HPD




                                                                         40 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Diabetes in Pima Indian women (cont’d)
       Comparison of the variation of the Bayes factor approximations
       based on 100 replicas for 20, 000 simulations for a simulation from
       the above harmonic mean sampler and importance samplers
             3.102 3.104 3.106 3.108 3.110 3.112 3.114 3.116




                                                               Harmonic mean   Importance sampling



                                                                                                     41 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Chib’s representation


       Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and
       θk ∼ πk (θk ),
                                         fk (x|θk ) πk (θk )
                         Zk = mk (x) =
                                             πk (θk |x)
       Use of an approximation to the posterior
                                                            ∗       ∗
                                                    fk (x|θk ) πk (θk )
                                    Zk = mk (x) =                       .
                                                         ˆ ∗
                                                        πk (θk |x)




                                                                            42 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Case of latent variables



       For missing variable z as in mixture models, natural Rao-Blackwell
       estimate
                                        T
                             ∗       1          ∗      (t)
                        πk (θk |x) =       πk (θk |x, zk ) ,
                                     T
                                                   t=1
                          (t)
       where the         zk ’s      are latent variables sampled from π(z|x) (often by
        Gibbs )




                                                                                         43 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Case of the probit model

       For the completion by z, hidden normal variable,
                                                   1
                                       π (θ|x) =
                                       ˆ                   π(θ|x, z (t) )
                                                   T   t

       is a simple average of normal densities
       R Bridge sampling code
       gibbs1=gibbsprobit(Niter,y,X1)
       gibbs2=gibbsprobit(Niter,y,X2)
       bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3),
               sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/
             mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2),
               sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1)))




                                                                                 44 / 143
Computational Bayesian Statistics
  Computational basics
     Bayes factor approximations in perspective



Diabetes in Pima Indian women (cont’d)
       Comparison of the variation of the Bayes factor approximations
       based on 100 replicas for 20, 000 simulations for a simulation from
       the above Chib’s and importance samplers




                                                                             45 / 143
Computational Bayesian Statistics
  Regression and variable selection




Regression and variable selection



       2   Regression and variable selection
             Regression
             Zellner’s informative G-prior
             Zellner’s noninformative G-prior
             Markov Chain Monte Carlo Methods
             Variable selection




                                                46 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Regression



Regressors and response

       Variable of primary interest, y, called the response or the outcome
       variable [assumed here to be continuous]

                            E.g., number of Pine processionary caterpillar colonies

       Covariates x = (x1 , . . . , xk ) called explanatory variables [may be
       discrete, continuous or both]

       Distribution of y given x typically studied in the context of a set of
       units or experimental subjects, i = 1, . . . , n, for instance patients in
       an hospital ward, on which both yi and xi1 , . . . , xik are measured.



                                                                                      47 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Regression



Regressors and response cont’d

       Dataset made of the conjunction of the vector of outcomes

                                      y = (y1 , . . . , yn )

       and of the n × (k + 1) matrix of         explanatory variables
                                                               
                               1 x11            x12 . . . x1k
                             1 x21             x22 . . . x2k 
                                                               
                                               x32 . . . x3k 
                       X =  1 x31                              
                             .    .             .    .     . 
                             ..   .
                                   .             .
                                                 .    .
                                                      .     . 
                                                            .
                                      1 xn1     xn2 . . . xnk



                                                                        48 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Regression



Linear models


       Ordinary normal linear regression model such that

                                      y|β, σ 2 , X ∼ Nn (Xβ, σ 2 In )

       and thus

                              E[yi |β, X] = β0 + β1 xi1 + . . . + βk xik
                           V(yi |σ 2 , X) = σ 2




                                                                           49 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Regression



Pine processionary caterpillars
                                      Pine processionary caterpillar
                                      colony size influenced by
                                          x1 altitude
                                          x2 slope (in degrees)
                                          x3 number of pines in the area
                                          x4 height of the central tree
                                          x5 diameter of the central tree
                                          x6 index of the settlement density
                                          x7 orientation of the area (from 1
                                          [southbound] to 2)
                                          x8 height of the dominant tree
                                          x9 number of vegetation strata
                                          x10 mix settlement index (from 1 if not
                                          mixed to 2 if mixed)




                                          x1               x2                  x3




                                                                                    50 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Regression



Ibex horn growth
                                      Growth of the Ibex horn size
                                      S against age A
                                                                                           A
                                      log(S) = α + β log                                      +ǫ
                                                                                          1+A




                                                  6.5
                                                  6.0
                                                  5.5
                                      log(Size)

                                                  5.0
                                                  4.5
                                                  4.0
                                                  3.5




                                                        −0.25   −0.20                    −0.15   −0.10

                                                                        log(1 1 + Age)




                                                                                                         51 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Regression



Likelihood function & estimator
       The likelihood of the ordinary normal linear model is

                                        −n/2               1
          ℓ β, σ 2 |y, X = 2πσ 2               exp −           (y − Xβ)T (y − Xβ)
                                                          2σ 2


       The MLE of β is solution of the least squares minimisation problem
                                               n
                                                                                        2
        min (y − Xβ)T (y − Xβ) = min                 (yi − β0 − β1 xi1 − . . . − βk xik ) ,
          β                               β
                                               i=1

       namely
                                      ˆ
                                      β = (X T X)−1 X T y



                                                                                              52 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Regression



lm() for pine processionary caterpillars
       Residuals:                       Min        1Q     Median      3Q           Max
                                      -1.6989   -0.2731   -0.0003   0.3246        1.7305
       Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
       intercept   10.998412   3.060272   3.594 0.00161                      **
       XV1         -0.004431   0.001557 -2.846 0.00939                       **
       XV2         -0.053830   0.021900 -2.458 0.02232                       *
       XV3          0.067939   0.099472   0.683 0.50174
       XV4         -1.293636   0.563811 -2.294 0.03168                       *
       XV5          0.231637   0.104378   2.219 0.03709                      *
       XV6         -0.356800   1.566464 -0.228 0.82193
       XV7         -0.237469   1.006006 -0.236 0.81558
       XV8          0.181060   0.236724   0.765 0.45248
       XV9         -1.285316   0.864847 -1.486 0.15142
       XV10        -0.433106   0.734869 -0.589 0.56162
       ---
       Signif. codes:
       0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
                                                                                           53 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Zellner’s informative G-prior

       Constraint
       Allow the experimenter to introduce information about the
       location parameter of the regression while bypassing the most
       difficult aspects of the prior specification, namely the derivation of
       the prior correlation structure.


       Zellner’s prior corresponds to
                                                  ˜
                                β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 )
                                      σ 2 ∼ π(σ 2 |X) ∝ σ −2 .

                                                                 [Special conjugate]

                                                                                       54 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Prior selection


                                                                     ˜
       Experimental prior determination restricted to the choices of β and
       of the constant c.

       Note
       c can be interpreted as a measure of the amount of information
       available in the prior relative to the sample. For instance, setting
       1/c = 0.5 gives the prior the same weight as 50% of the sample.

              There still is a lasting influence of the factor c




                                                                              55 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Posterior structure

       With this prior model, the posterior simplifies into

           π(β, σ 2 |y, X)            ∝ f (y|β, σ 2 , X)π(β, σ 2 |X)
                                                                  1        ˆ         ˆ
                                      ∝ σ2              exp − 2 (y − X β)T (y − X β)
                                             −(n/2+1)
                                                                 2σ
                                              1           ˆ             ˆ
                                           − 2 (β − β)T X T X(β − β) σ 2
                                                                                −k/2
                                             2σ
                                                         1         ˜          ˜
                                           × exp −           (β − β)X T X(β − β) ,
                                                       2cσ 2

       because X T X used in both prior and likelihood
                                                                           [G-prior trick]


                                                                                             56 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Posterior structure (cont’d)
       Therefore,

                                                 c                 2
         β|σ 2 , y, X ∼ Nk+1                           ˜     ˆ σ c (X T X)−1
                                                     (β/c + β),
                                               c+1               c+1
                                            n s 2         1
              σ 2 |y, X ∼ IG                 ,     +            ˜ ˆ           ˜ ˆ
                                                               (β − β)T X T X(β − β)
                                            2 2       2(c + 1)

       and
                                                 c    ˜
                                                      β
            β|y, X        ∼ Tk+1           n,            ˆ
                                                        +β ,
                                                c+1   c
                                             ˜ ˆ           ˜ ˆ
                                     c(s2 + (β − β)T X T X(β − β)/(c + 1)) T −1
                                                                          (X X)    .
                                                   n(c + 1)


                                                                                       57 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Pine processionary caterpillars

                                     βi    Eπ (βi |y, X)    Vπ (βi |y, X)
                                     β0         10.8895           6.4094
                                     β1          -0.0044            2e-06
                                     β2          -0.0533          0.0003
                                     β3            0.0673         0.0068
                                     β4          -1.2808          0.2175
                                     β5            0.2293         0.0075
                                     β6          -0.3532          1.6793
                                     β7          -0.2351          0.6926
                                     β8            0.1793         0.0383
                                     β9          -1.2726          0.5119
                                     β10         -0.4288          0.3696
                                                    c = 100

                                                                            58 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Pine processionary caterpillars (2)
                                     βi    Eπ (βi |y, X)     Vπ (βi |y, X)
                                     β0         10.9874            6.2604
                                     β1          -0.0044             2e-06
                                     β2          -0.0538           0.0003
                                     β3            0.0679          0.0066
                                     β4          -1.2923           0.2125
                                     β5            0.2314          0.0073
                                     β6          -0.3564           1.6403
                                     β7          -0.2372           0.6765
                                     β8            0.1809          0.0375
                                     β9          -1.2840           0.5100
                                     β10         -0.4327           0.3670
                                                  c = 1, 000

                                                                             59 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Credible regions

       Highest posterior density (HPD) regions on subvectors of the
       parameter β derived from the marginal posterior distribution of β.
       For a single parameter,

                                            c    ˜
                                                 βi
           βi |y, X        ∼ T1       n,              ˆ
                                                    + βi   ,
                                           c+1   c
                                              ˜ ˆ           ˜ ˆ
                                      c(s2 + (β − β)T X T X(β − β)/(c + 1))
                                                                            ω(i,i)   ,
                                                    n(c + 1)

       where ω(i,i) is the (i, i)-th element of the matrix (X T X)−1 .




                                                                                         60 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Pine processionary caterpillars
                                      βi       HPD interval
                                      β0     [5.7435, 16.2533]
                                      β1    [−0.0071, −0.0018]
                                      β2    [−0.0914, −0.0162]
                                      β3     [−0.1029, 0.2387]
                                      β4    [−2.2618, −0.3255]
                                      β5      [0.0524, 0.4109]
                                      β6     [−3.0466, 2.3330]
                                      β7     [−1.9649, 1.4900]
                                      β8     [−0.2254, 0.5875]
                                      β9     [−2.7704, 0.1997]
                                      β10    [−1.6950, 0.8288]

                                              c = 100
                                                                 61 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



T marginal
       Marginal distribution of y is multivariate t distribution

                                      ˜
       Proof. Since β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 ),
                                                ˜
                              Xβ|σ 2 , X ∼ N (X β, cσ 2 X(X T X)−1 X T ) ,

       which implies that
                                            ˜
                          y|σ 2 , X ∼ Nn (X β, σ 2 (In + cX(X T X)−1 X T )).

       Integrating in σ 2 yields

       f (y|X)        = (c + 1)−(k+1)/2 π −n/2 Γ(n/2)
                                                                               −n/2
                                        c                          1 ˜T T ˜
                            × yT y −       y T X(X T X)−1 X T y −     β X Xβ          .
                                       c+1                        c+1


                                                                                 62 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Point null hypothesis




       If a null hypothesis is H0 : Rβ = r, the model under H0 can be
       rewritten as
                                         H0
                        y|β 0 , σ 2 , X0 ∼ Nn X0 β 0 , σ 2 In
       where β 0 is (k + 1 − q) dimensional.




                                                                        63 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Point null marginal

       Under the prior

                                                 ˜
                          β 0 |X0 , σ 2 ∼ Nk+1−q β 0 , c0 σ 2 (X0 X0 )−1 ,
                                                                T


       the marginal distribution of y under H0 is

                  f (y|X0 , H0 )      =   (c0 + 1)−(k+1−q)/2 π −n/2 Γ(n/2)
                                                      c0
                                          × yT y −         y T X0 (X0 X0 )−1 X0 y
                                                                     T        T
                                                    c0 + 1
                                                                −n/2
                                                 1 ˜T T ˜
                                          −         β X X0 β0          .
                                              c0 + 1 0 0




                                                                                    64 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s informative G-prior



Bayes factor

       Therefore the Bayes factor is in closed form:

          π               f (y|X, H1 )     (c0 + 1)(k+1−q)/2
         B10      =                      =
                          f (y|X0 , H0 )     (c + 1)(k+1)/2
                                                                           1 ˜T T            n/2
                            yT y −      c0    T      T    −1 T
                                                                     −                  ˜
                                       c0 +1 y X0 (X0 X0 )   X0 y        c0 +1 β0 X0 X0 β0
                                            c                             1 ˜T T      ˜
                                yT y   − c+1 y  T X(X T X)−1 X T y   −   c+1 β X X β


              Means using the same σ 2 on both models
              Problem: Still depends on the choice of (c0 , c)




                                                                                                   65 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s noninformative G-prior



Zellner’s noninformative G-prior


       Difference with informative G-prior setup is that we now consider c
       as unknown (relief!)


       Solution
                                              ˜
       Use the same G-prior distribution with β = 0k+1 , conditional on c,
       and introduce a diffuse prior on c,

                                        π(c) = c−1 IN∗ (c) .




                                                                             66 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s noninformative G-prior



Posterior distribution
       Corresponding marginal posterior on the parameters of interest

                     π(β, σ 2 |y, X)        =          π(β, σ 2 |y, X, c)π(c|y, X) dc
                                                   ∞
                                            ∝            π(β, σ 2 |y, X, c)f (y|X, c)π(c)
                                                   c=1
                                                    ∞
                                            ∝            π(β, σ 2 |y, X, c)f (y|X, c) c−1 .
                                                   c=1

       and
                                                                                              −n/2
                                                     T   c T
       f (y|X, c) ∝ (c+1)               −(k+1)/2
                                                   y y−     y X(X T X)−1 X T y                       .
                                                        c+1


                                                                                                     67 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s noninformative G-prior



Posterior means
       The Bayes estimates of β and σ 2 are given by

             Eπ [β|y, X]                                                      ˆ
                                    = Eπ [Eπ (β|y, X, c)|y, X] = Eπ [c/(c + 1)β)|y, X]
                                       ∞                          
                                           c/(c + 1)f (y|X, c)c      −1
                                                                           
                                       c=1                                
                                    = 
                                            ∞
                                                                           β
                                                                           
                                                                             ˆ
                                                                          
                                                 f (y|X, c)c−1
                                                c=1

       and
                                          ∞         ˆ         ˆ
                                               s2 + β T X T X β/(c + 1)
                                                                        f (y|X, c)c−1
                                         c=1
                                                        n−2
                  Eπ [σ 2 |y, X] =                      ∞                               .
                                                             f (y|X, c)c
                                                                       −1

                                                       c=1


                                                                                            68 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s noninformative G-prior



Marginal distribution


       Important point: the marginal distribution of the dataset is
       available in closed form
                        ∞                                                             −n/2
                                                            c
       f (y|X) ∝             c−1 (c + 1)−(k+1)/2 y T y −       y T X(X T X)−1 X T y
                       i=1
                                                           c+1


       T -shape means normalising constant can be computed too.




                                                                                             69 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s noninformative G-prior



Point null hypothesis




       For null hypothesis H0 : Rβ = r, the model under H0 can be
       rewritten as
                                         H0
                        y|β 0 , σ 2 , X0 ∼ Nn X0 β 0 , σ 2 In
       where β 0 is (k + 1 − q) dimensional.




                                                                    70 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s noninformative G-prior



Point null marginal


       Under the prior

                      β 0 |X0 , σ 2 , c ∼ Nk+1−q 0k+1−q , cσ 2 (X0 X0 )−1
                                                                 T


       and π(c) = 1/c, the marginal distribution of y under H0 is
                                ∞                                                            −n/2
                                                                 c
       f (y|X0 , H0 ) ∝               (c+1)−(k+1−q)/2 y T y −       y T X0 (X0 X0 )−1 X0 y
                                                                             T         T
                                                                                                        .
                               c=1
                                                                c+1

                     π
       Bayes factor B10 = f (y|X)/f (y|X0 , H0 ) can be computed




                                                                                             71 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Zellner’s noninformative G-prior



Processionary pine caterpillars
                                     π
       For H0 : β8 = β9 = 0, log10 (B10 ) = −0.7884
                              Estimate   Post. Var.   log10(BF)

       (Intercept)             9.2714    9.1164        1.4205   (***)
       X1                     -0.0037     2e-06        0.8502   (**)
       X2                     -0.0454    0.0004        0.5664   (**)
       X3                      0.0573    0.0086       -0.3609
       X4                     -1.0905    0.2901        0.4520   (*)
       X5                      0.1953    0.0099        0.4007   (*)
       X6                     -0.3008    2.1372       -0.4412
       X7                     -0.2002    0.8815       -0.4404
       X8                      0.1526    0.0490       -0.3383
       X9                     -1.0835    0.6643       -0.0424
       X10                    -0.3651    0.4716       -0.3838

       evidence against H0:
       (****) decisive, (***) strong, (**) subtantial, (*) poor
                                                                        72 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Markov Chain Monte Carlo Methods



       Complexity of most models encountered in Bayesian modelling

       Standard simulation methods not good enough a solution

       New technique at the core of Bayesian computing, based on
       Markov chains




                                                                     73 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Markov chains


       Markov chain


     A process (θ(t) )t∈N is an homogeneous
     Markov chain if the distribution of θ(t)
     given the past (θ(0) , . . . , θ(t−1) )
        1    only depends on θ(t−1)
        2    is the same for all t ∈ N∗ .




                                                74 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Algorithms based on Markov chains

       Idea: simulate from a posterior density π(·|x) [or any density] by
       producing a Markov chain

                                        (θ(t) )t∈N

       whose stationary distribution is

                                         π(·|x)


       Translation
       For t large enough, θ(t) is approximately distributed from π(θ|x),
       no matter what the starting value θ(0) is [Ergodicity].


                                                                            75 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Convergence



       If an algorithm that generates such a chain can be constructed, the
       ergodic theorem guarantees that, in almost all settings, the average
                                            T
                                        1
                                                  g(θ(t) )
                                        T
                                            t=1

       converges to Eπ [g(θ)|x], for (almost) any starting value




                                                                              76 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



More convergence


       If the produced Markov chains are irreducible [can reach any region
       in a finite number of steps], then they are both positive recurrent
       with stationary distribution π(·|x) and ergodic [asymptotically
       independent from the starting value θ(0) ]

              While, for t large enough, θ(t) is approximately distributed
              from π(θ|x) and can thus be used like the output from a more
              standard simulation algorithm, one must take care of the
              correlations between the θ(t) ’s




                                                                             77 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Demarginalising


       Takes advantage of hierarchical structures: if

                                    π(θ|x) =   π1 (θ|x, λ)π2 (λ|x) dλ ,

       simulating from π(θ|x) comes from simulating from the joint
       distribution
                               π1 (θ|x, λ) π2 (λ|x)




                                                                          78 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Two-stage Gibbs sampler

       Usually π2 (λ|x) not available/simulable

       More often, both conditional posterior distributions,

                                      π1 (θ|x, λ)   and   π2 (λ|x, θ)

       can be simulated.

       Idea: Create a Markov chain based on those conditionals

                                                                        Example of latent variables!




                                                                                                       79 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Two-stage Gibbs sampler (cont’d)



       Initialization: Start with an
       arbitrary value λ(0)
       Iteration t: Given λ(t−1) , generate
          1    θ(t) according to π1 (θ|x, λ(t−1) )
          2    λ(t) according to π2 (λ|x, θ(t) )

                                                     J.W. Gibbs (1839-1903)



              π(θ, λ|x) is a stationary distribution for this transition


                                                                              80 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Implementation



          1   Derive efficient decomposition of the joint distribution into
              simulable conditionals (mixing behavior, acf(), blocking, &tc.)

          2   Find when to stop the algorithm (mode chasing, missing
              mass, shortcuts, &tc.)




                                                                                81 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Simple Example: iid N (µ, σ 2 ) Observations
                                      iid
       When y1 , . . . , yn ∼ N (µ, σ 2 ) with both µ and σ unknown, the
       posterior in (µ, σ 2 ) is conjugate outside a standard family

       But...
                                                    1 n        σ2
                              µ|y, σ 2 ∼ N      µ   n i=1 yi , n )

                              σ 2 |y, µ ∼ I G   σ 2 n − 1, 2 n (yi
                                                    2
                                                           1
                                                                 i=1   − µ)2
       assuming constant (improper) priors on both µ and σ 2

              Hence we may use the Gibbs sampler for simulating from the
              posterior of (µ, σ 2 )



                                                                               82 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Gibbs output analysis
       Example (Cauchy posterior)
                                                                    2
                                               e−µ /20
                           π(µ|D) ∝
                                    (1 + (x1 − µ)2 )(1 + (x2 − µ)2 )
       is marginal of
                                                                2
                                             −µ2 /20                                   2
                           π(µ, ω|D) ∝ e                   ×         e−ωi [1+(xi −µ) ] .
                                                               i=1

       Corresponding conditionals

         (ω1 , ω2 )|µ        ∼ E xp(1 + (x1 − µ)2 ) ⊗ E xp(1 + (x2 − µ))2 )

                   µ|ω       ∼ N            ωi xi /(       ωi + 1/20), 1/(2                ωi + 1/10)
                                        i              i                           i

                                                                                                        83 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Gibbs output analysis (cont’d)

                                  4
                                  2
                                  0
                        µ

                                  −2
                                  −4
                                  −6




                                         9900   9920    9940       9960       9980   10000

                                                               n
                                  0.15
                        Density

                                  0.10
                                  0.05
                                  0.00




                                         −10       −5          0          5           10

                                                               µ
                                                                                             84 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



Generalisation



       Consider several groups of parameters, θ, λ1 , . . . , λp , such that

                      π(θ|x) =          ...   π(θ, λ1 , . . . , λp |x) dλ1 · · · dλp

       or simply divide θ in
                                              (θ1 , . . . , θp )




                                                                                       85 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Markov Chain Monte Carlo Methods



The general Gibbs sampler

       For a joint distribution π(θ) with full conditionals π1 , . . . , πp ,
                         (t)          (t)
          Given (θ1 , . . . , θp ), simulate
                    (t+1)                   (t)          (t)
             1. θ1             ∼ π1 (θ1 |θ2 , . . . , θp ),
                    (t+1)                   (t+1)      (t)         (t)
             2. θ2             ∼ π2 (θ2 |θ1         , θ3 , . . . , θp ),
                                 .
                                 .
                                 .
                    (t+1)                   (t+1)              (t+1)
             p. θp             ∼ πp (θp |θ1         , . . . , θp−1 ).


                                            Then θ(t) → θ ∼ π



                                                                                86 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Variable selection

       Back to regression: one dependent random variable y and a set
       {x1 , . . . , xk } of k explanatory variables.

       Question: Are all xi ’s involved in the regression?

       Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k)
       explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of
       explanatory variables for the regression of y [intercept included in
       every corresponding model]

       Computational issue
                                      2k models in competition...


                                                                              87 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Model notations

          1

                                      X = 1n x1 · · ·   xk
               is the matrix containing 1n and all the k potential predictor
               variables
          2    Each model Mγ associated with binary indicator vector
               γ ∈ Γ = {0, 1}k where γi = 1 means that the variable xi is
               included in the model Mγ
          3    qγ = 1T γ number of variables included in the model Mγ
                     n
          4    t1 (γ) and t0 (γ) indices of variables included in the model and
               indices of variables not included in the model


                                                                                  88 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Model indicators



       For β ∈ Rk+1 and X, we define βγ as the subvector

                                      βγ = β0 , (βi )i∈t1 (γ)

       and Xγ as the submatrix of X where only the column 1n and the
       columns in t1 (γ) have been left.




                                                                       89 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Models in competition


       The model Mγ is thus defined as

                                    y|γ, βγ , σ 2 , X ∼ Nn Xγ βγ , σ 2 In

       where βγ ∈ Rqγ +1 and σ 2 ∈ R∗ are the unknown parameters.
                                    +



       Warning
       σ 2 is common to all models and thus uses the same prior for all
       models




                                                                            90 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Informative G-prior



       Many (2k ) models in competition: we cannot expect a practitioner
       to specify a prior on every Mγ in a completely subjective and
       autonomous manner.

       Shortcut: We derive all priors from a single global prior associated
       with the so-called full model that corresponds to γ = (1, . . . , 1).




                                                                               91 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Prior definitions


         (i)    For the full model, Zellner’s G-prior:
                                  ˜
                β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 )       and σ 2 ∼ π(σ 2 |X) = σ −2

         (ii)   For each model Mγ , the prior distribution of βγ conditional
                on σ 2 is fixed as
                                                    ˜                 −1
                                βγ |γ, σ 2 ∼ Nqγ +1 βγ , cσ 2 Xγ Xγ
                                                               T
                                                                           ,

                      ˜     T             −1    T ˜
                where βγ = Xγ Xγ               Xγ β and same prior on σ 2 .




                                                                                    92 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Prior completion



       The joint prior for model Mγ is the improper prior

                                           −(qγ +1)/2−1              1          ˜
                                                                                     T
           π(βγ , σ 2 |γ) ∝           σ2                  exp −            βγ − βγ
                                                                  2(cσ 2 )
                                         T           ˜
                                       (Xγ Xγ ) βγ − βγ           .




                                                                                         93 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Prior competition (2)

       Infinitely many ways of defining a prior on the model index γ:
       choice of uniform prior π(γ|X) = 2−k .

       Posterior distribution of γ central to variable selection since
       proportional to the marginal density of y on Mγ (or evidence of
       Mγ )

       π(γ|y, X)          ∝ f (y|γ, X)π(γ|X) ∝ f (y|γ, X)

                          =           f (y|γ, β, σ 2 , X)π(β|γ, σ 2 , X) dβ π(σ 2 |X) dσ 2 .




                                                                                               94 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection




       f (y|γ, σ 2 , X)        =          f (y|γ, β, σ 2 )π(β|γ, σ 2 ) dβ

                                                                                             1 T
                                      (c + 1)−(qγ +1)/2 (2π)−n/2 σ 2
                                                                             −n/2
                               =                                                    exp −        y y
                                                                                            2σ 2
                                                 1                            −1
                                                                                           ˜T T     ˜
                                       +                cy T Xγ Xγ Xγ
                                                                 T                   T
                                                                                    Xγ y − βγ Xγ Xγ βγ            ,
                                           2σ 2 (c + 1)

       this posterior density satisfies

                                                                     c T                       −1
       π(γ|y, X) ∝ (c + 1)−(qγ +1)/2 y T y −                                  T
                                                                        y Xγ Xγ Xγ                   T
                                                                                                    Xγ y
                                                                    c+1
                                                                  −n/2
                                           1 ˜T T   ˜
                                      −      β X Xγ βγ                   .
                                          c+1 γ γ


                                                                                                       95 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Pine processionary caterpillars
                                      t1 (γ)         π(γ|y, X)
                                      0,1,2,4,5         0.2316
                                      0,1,2,4,5,9       0.0374
                                      0,1,9             0.0344
                                      0,1,2,4,5,10      0.0328
                                      0,1,4,5           0.0306
                                      0,1,2,9           0.0250
                                      0,1,2,4,5,7       0.0241
                                      0,1,2,4,5,8       0.0238
                                      0,1,2,4,5,6       0.0237
                                      0,1,2,3,4,5       0.0232
                                      0,1,6,9           0.0146
                                      0,1,2,3,9         0.0145
                                      0,9               0.0143
                                      0,1,2,6,9         0.0135
                                      0,1,4,5,9         0.0128
                                      0,1,3,9           0.0117
                                      0,1,2,8           0.0115
                                                                 96 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Pine processionary caterpillars (cont’d)

       Interpretation
       Model Mγ with the highest posterior probability is
       t1 (γ) = (1, 2, 4, 5), which corresponds to the variables
           - altitude,
           - slope,
           - height of the tree sampled in the center of the area, and
           - diameter of the tree sampled in the center of the area.


       Corresponds to the five variables identified in the R regression
       output


                                                                         97 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Noninformative extension


       For Zellner noninformative prior with π(c) = 1/c, we have
                                      ∞
                  π(γ|y, X) ∝               c−1 (c + 1)−(qγ +1)/2 y T y−
                                      c=1
                                                                            −n/2
                                           c T      T           −1    T
                                              y Xγ Xγ Xγ             Xγ y          .
                                          c+1




                                                                                       98 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Pine processionary caterpillars
                                      t1 (γ)           π(γ|y, X)
                                      0,1,2,4,5           0.0929
                                      0,1,2,4,5,9         0.0325
                                      0,1,2,4,5,10        0.0295
                                      0,1,2,4,5,7         0.0231
                                      0,1,2,4,5,8         0.0228
                                      0,1,2,4,5,6         0.0228
                                      0,1,2,3,4,5         0.0224
                                      0,1,2,3,4,5,9       0.0167
                                      0,1,2,4,5,6,9       0.0167
                                      0,1,2,4,5,8,9       0.0137
                                      0,1,4,5             0.0110
                                      0,1,2,4,5,9,10      0.0100
                                      0,1,2,3,9           0.0097
                                      0,1,2,9             0.0093
                                      0,1,2,4,5,7,9       0.0092
                                      0,1,2,6,9           0.0092

                                                                   99 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Stochastic search for the most likely model

       When k gets large, impossible to compute the posterior
       probabilities of the 2k models.

       Need of a tailored algorithm that samples from π(γ|y, X) and
       selects the most likely models.

       Can be done by Gibbs sampling , given the availability of the full
       conditional posterior probabilities of the γi ’s.
       If γ−i = (γ1 , . . . , γi−1 , γi+1 , . . . , γk ) (1 ≤ i ≤ k)

                                      π(γi |y, γ−i , X) ∝ π(γ|y, X)

       (to be evaluated in both γi = 0 and γi = 1)

                                                                            100 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Gibbs sampling for variable selection

                   Initialization: Draw γ 0 from the uniform
                   distribution on Γ

                                               (t−1)               (t−1)
                   Iteration t: Given (γ1              , . . . , γk        ), generate
                              (t)                           (t−1)                 (t−1)
                      1. γ1 according to π(γ1 |y, γ2                  , . . . , γk        , X)
                              (t)
                      2. γ2 according to
                                   (t)  (t−1)            (t−1)
                         π(γ2 |y, γ1 , γ3     , . . . , γk     , X)
                                      .
                                      .
                                      .
                              (t)                            (t)            (t)
                      p. γk according to π(γk |y, γ1 , . . . , γk−1 , X)


                                                                                                 101 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



MCMC interpretation
       After T ≫ 1 MCMC iterations, output used to approximate the
       posterior probabilities π(γ|y, X) by empirical averages
                                                             T
                                                1
                             π(γ|y, X) =                          Iγ (t) =γ .
                                            T − T0 + 1
                                                         t=T0

       where the T0 first values are eliminated as burnin.

       And approximation of the probability to include i-th variable,
                                                                  T
                                                    1
                          P π (γi = 1|y, X) =                           Iγ (t) =1 .
                                                T − T0 + 1                i
                                                                 t=T0



                                                                                      102 / 143
Computational Bayesian Statistics
  Regression and variable selection
     Variable selection



Pine processionary caterpillars

                           γi         P π (γi = 1|y, X)   P π (γi = 1|y, X)
                           γ1                    0.8624              0.8844
                           γ2                    0.7060              0.7716
                           γ3                    0.1482              0.2978
                           γ4                    0.6671              0.7261
                           γ5                    0.6515              0.7006
                           γ6                    0.1678              0.3115
                           γ7                    0.1371              0.2880
                           γ8                    0.1555              0.2876
                           γ9                    0.4039              0.5168
                           γ10                   0.1151              0.2609
                                                         ˜
       Probabilities of inclusion with both informative (β = 011 , c = 100)
                        and noninformative Zellner’s priors


                                                                              103 / 143
Computational Bayesian Statistics
  Generalized linear models




Generalized linear models



       3   Generalized linear models
             Generalisation of linear models
             Metropolis–Hastings algorithms
             The Probit Model
             The logit model




                                               104 / 143
Computational Bayesian Statistics
  Generalized linear models
     Generalisation of linear models



Generalisation of the linear dependence



       Broader class of models to cover various dependence structures.

       Class of generalised linear models (GLM) where

                                       y|x, β ∼ f (y|xT β) .

       i.e., dependence of y on x partly linear




                                                                         105 / 143
Computational Bayesian Statistics
  Generalized linear models
     Generalisation of linear models



Specifications of GLM’s

       Definition (GLM)
       A GLM is a conditional model specified by two functions:
          1   the density f of y given x parameterised by its expectation
              parameter µ = µ(x) [and possibly its dispersion parameter
              ϕ = ϕ(x)]
          2   the link g between the mean µ and the explanatory variables,
              written customarily as g(µ) = xT β or, equivalently,
              E[y|x, β] = g −1 (xT β).


       For identifiability reasons, g needs to be bijective.


                                                                             106 / 143
Computational Bayesian Statistics
  Generalized linear models
     Generalisation of linear models



Likelihood



       Obvious representation of the likelihood
                                                        n
                                       ℓ(β, ϕ|y, X) =         f yi |xiT β, ϕ
                                                        i=1

       with parameters β ∈ Rk and ϕ > 0.




                                                                               107 / 143
Computational Bayesian Statistics
  Generalized linear models
     Generalisation of linear models



Examples
       Case of binary and binomial data, when

                                        yi |xi ∼ B(ni , p(xi ))

       with known ni
              Logit [or logistic regression] model
              Link is logit transform on probability of success

                                        g(pi ) = log(pi /(1 − pi )) ,
              with likelihood
                              Y „ ni « „ exp(xiT β) «yi „
                               n
                                                                  1
                                                                           «ni −yi

                                   yi     1 + exp(xiT β)    1 + exp(xiT β)
                              i=1
                                        ( n           )ffi n
                                         X               Y“              ”ni −yi
                                  ∝ exp      yi xiT β      1 + exp(xiT β)
                                          i=1            i=1

                                                                                     108 / 143
Computational Bayesian Statistics
  Generalized linear models
     Generalisation of linear models



Canonical link
       Special link function g that appears in the natural exponential
       family representation of the density

                       g ⋆ (µ) = θ       if f (y|µ) ∝ exp{T (y) · θ − Ψ(θ)}


       Example
       Logit link is canonical for the binomial model, since

                                    ni                   pi
            f (yi |pi ) =                exp yi log                 + ni log(1 − pi )   ,
                                    yi                 1 − pi

       and thus
                                           θi = log pi /(1 − pi )

                                                                                            109 / 143
Computational Bayesian Statistics
  Generalized linear models
     Generalisation of linear models



Examples (2)

       Customary to use the canonical link, but only customary ...
              Probit model
              Probit link function given by

                                                g(µi ) = Φ−1 (µi )

              where Φ standard normal cdf
              Likelihood
                                            n
                              ℓ(β|y, X) ∝         Φ(xiT β)yi (1 − Φ(xiT β))ni −yi .
                                            i=1

          Full processing




                                                                                      110 / 143
Computational Bayesian Statistics
  Generalized linear models
     Generalisation of linear models



Log-linear models
       Standard approach to describe associations between several
       categorical variables, i.e, variables with finite support
       Sufficient statistic: contingency table, made of the cross-classified
       counts for the different categorical variables.
       Example (Titanic survivors)
                                               Child            Adult
                      Survivor         Class   Male    Female   Male    Female
                                        1st      0        0      118       4
                                        2nd      0        0      154      13
                           No           3rd     35       17      387      89
                                        Crew     0        0      670       3
                                        1st      5        1      57       140
                                        2nd     11       13      14       80
                          Yes           3rd     13       14      75       76
                                        Crew     0        0      192      20

                                                                                 111 / 143
Computational Bayesian Statistics
  Generalized linear models
     Generalisation of linear models



Poisson regression model


          1   Each count yi is Poisson with mean µi = µ(xi )
          2   Link function connecting R+ with R, e.g. logarithm
              g(µi ) = log(µi ).

       Corresponding likelihood
                                       n
                                              1
                       ℓ(β|y, X) =                  exp yi xiT β − exp(xiT β) .
                                       i=1
                                             yi !




                                                                                  112 / 143
Computational Bayesian Statistics
  Generalized linear models
     Metropolis–Hastings algorithms



Metropolis–Hastings algorithms


                                                          Convergence assessment



       Posterior inference in GLMs harder than for linear models

        c Working with a GLM requires specific numerical or simulation
       tools [E.g., GLIM in classical analyses]

       Opportunity to introduce universal MCMC method:
       Metropolis–Hastings algorithm




                                                                                   113 / 143
Computational Bayesian Statistics
  Generalized linear models
     Metropolis–Hastings algorithms



Generic MCMC sampler


              Metropolis–Hastings algorithms are generic/down-the-shelf
              MCMC algorithms
              Only require likelihood up to a constant [difference with Gibbs
              sampler]
              can be tuned with a wide range of possibilities [difference with
              Gibbs sampler & blocking]
              natural extensions of standard simulation algorithms: based
              on the choice of a proposal distribution [difference in Markov
              proposal q(x, y) and acceptance]




                                                                              114 / 143
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods
Course on Bayesian computational methods

More Related Content

What's hot

Yes III: Computational methods for model choice
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choiceChristian Robert
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABCChristian Robert
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningbutest
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationChristian Robert
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015Christian Robert
 
Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forestsChristian Robert
 
Approximating Bayes Factors
Approximating Bayes FactorsApproximating Bayes Factors
Approximating Bayes FactorsChristian Robert
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsChristian Robert
 
Lecture 7
Lecture 7Lecture 7
Lecture 7butest
 
4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics SeminarChristian Robert
 

What's hot (20)

the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
Yes III: Computational methods for model choice
Yes III: Computational methods for model choiceYes III: Computational methods for model choice
Yes III: Computational methods for model choice
 
Intractable likelihoods
Intractable likelihoodsIntractable likelihoods
Intractable likelihoods
 
06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes
 
Module 4 part_1
Module 4 part_1Module 4 part_1
Module 4 part_1
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Boston talk
Boston talkBoston talk
Boston talk
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimation
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
Reliable ABC model choice via random forests
Reliable ABC model choice via random forestsReliable ABC model choice via random forests
Reliable ABC model choice via random forests
 
Approximating Bayes Factors
Approximating Bayes FactorsApproximating Bayes Factors
Approximating Bayes Factors
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar4th joint Warwick Oxford Statistics Seminar
4th joint Warwick Oxford Statistics Seminar
 

Similar to Course on Bayesian computational methods

An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityChristian Robert
 
Bayesian model choice (and some alternatives)
Bayesian model choice (and some alternatives)Bayesian model choice (and some alternatives)
Bayesian model choice (and some alternatives)Christian Robert
 
4-ML-UNIT-IV-Bayesian Learning.pptx
4-ML-UNIT-IV-Bayesian Learning.pptx4-ML-UNIT-IV-Bayesian Learning.pptx
4-ML-UNIT-IV-Bayesian Learning.pptxSaitama84
 
Fundamentals of Statistics - Bayesian Statistics.pdf
Fundamentals of Statistics - Bayesian Statistics.pdfFundamentals of Statistics - Bayesian Statistics.pdf
Fundamentals of Statistics - Bayesian Statistics.pdfRizkyAlfianRizafahle
 
Elementary Probability and Information Theory
Elementary Probability and Information TheoryElementary Probability and Information Theory
Elementary Probability and Information TheoryKhalidSaghiri2
 
Bayesian Nonparametrics: Models Based on the Dirichlet Process
Bayesian Nonparametrics: Models Based on the Dirichlet ProcessBayesian Nonparametrics: Models Based on the Dirichlet Process
Bayesian Nonparametrics: Models Based on the Dirichlet ProcessAlessandro Panella
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.pptOmDalvi4
 
bayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningbayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningKumari Naveen
 
Expectation propagation
Expectation propagationExpectation propagation
Expectation propagationDong Guo
 
-BayesianLearning in machine Learning 12
-BayesianLearning in machine Learning 12-BayesianLearning in machine Learning 12
-BayesianLearning in machine Learning 12Kumari Naveen
 
original
originaloriginal
originalbutest
 
seminar at Princeton University
seminar at Princeton Universityseminar at Princeton University
seminar at Princeton UniversityChristian Robert
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classificationsathish sak
 

Similar to Course on Bayesian computational methods (20)

ISBA 2016: Foundations
ISBA 2016: FoundationsISBA 2016: Foundations
ISBA 2016: Foundations
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
Bayesian model choice (and some alternatives)
Bayesian model choice (and some alternatives)Bayesian model choice (and some alternatives)
Bayesian model choice (and some alternatives)
 
MaxEnt 2009 talk
MaxEnt 2009 talkMaxEnt 2009 talk
MaxEnt 2009 talk
 
4-ML-UNIT-IV-Bayesian Learning.pptx
4-ML-UNIT-IV-Bayesian Learning.pptx4-ML-UNIT-IV-Bayesian Learning.pptx
4-ML-UNIT-IV-Bayesian Learning.pptx
 
Fundamentals of Statistics - Bayesian Statistics.pdf
Fundamentals of Statistics - Bayesian Statistics.pdfFundamentals of Statistics - Bayesian Statistics.pdf
Fundamentals of Statistics - Bayesian Statistics.pdf
 
Elementary Probability and Information Theory
Elementary Probability and Information TheoryElementary Probability and Information Theory
Elementary Probability and Information Theory
 
Bayesian Nonparametrics: Models Based on the Dirichlet Process
Bayesian Nonparametrics: Models Based on the Dirichlet ProcessBayesian Nonparametrics: Models Based on the Dirichlet Process
Bayesian Nonparametrics: Models Based on the Dirichlet Process
 
Jsm09 talk
Jsm09 talkJsm09 talk
Jsm09 talk
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.ppt
 
bayesNaive.ppt
bayesNaive.pptbayesNaive.ppt
bayesNaive.ppt
 
bayesNaive algorithm in machine learning
bayesNaive algorithm in machine learningbayesNaive algorithm in machine learning
bayesNaive algorithm in machine learning
 
Expectation propagation
Expectation propagationExpectation propagation
Expectation propagation
 
Bayesian statistics
Bayesian statisticsBayesian statistics
Bayesian statistics
 
-BayesianLearning in machine Learning 12
-BayesianLearning in machine Learning 12-BayesianLearning in machine Learning 12
-BayesianLearning in machine Learning 12
 
original
originaloriginal
original
 
seminar at Princeton University
seminar at Princeton Universityseminar at Princeton University
seminar at Princeton University
 
Bayes Theorem.pdf
Bayes Theorem.pdfBayes Theorem.pdf
Bayes Theorem.pdf
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 

More from Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceChristian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsChristian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018Christian Robert
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 

More from Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 

Recently uploaded

4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 

Recently uploaded (20)

4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 

Course on Bayesian computational methods

  • 1. Computational Bayesian Statistics Computational Bayesian Statistics Christian P. Robert, Universit´ Paris Dauphine e Frontiers of Statistical Decision Making and Bayesian Analysis, In honour of Jim Berger, 17th of March, 2010 1 / 143
  • 2. Computational Bayesian Statistics Outline 1 Computational basics 2 Regression and variable selection 3 Generalized linear models 2 / 143
  • 3. Computational Bayesian Statistics Computational basics Computational basics 1 Computational basics The Bayesian toolbox Improper prior distribution Testing Monte Carlo integration Bayes factor approximations in perspective 3 / 143
  • 4. Computational Bayesian Statistics Computational basics The Bayesian toolbox The Bayesian toolbox Bayes theorem = Inversion of probabilities If A and E are events such that P (E) = 0, P (A|E) and P (E|A) are related by P (E|A)P (A) P (A|E) = P (E|A)P (A) + P (E|Ac )P (Ac ) P (E|A)P (A) = P (E) 4 / 143
  • 5. Computational Bayesian Statistics Computational basics The Bayesian toolbox New perspective Uncertainty on the parameters θ of a model modeled through a probability distribution π on Θ, called prior distribution Inference based on the distribution of θ conditional on x, π(θ|x), called posterior distribution f (x|θ)π(θ) π(θ|x) = . f (x|θ)π(θ) dθ 5 / 143
  • 6. Computational Bayesian Statistics Computational basics The Bayesian toolbox Bayesian model A Bayesian statistical model is made of 1 a likelihood f (x|θ), and of 2 a prior distribution on the parameters, π(θ) . 6 / 143
  • 7. Computational Bayesian Statistics Computational basics The Bayesian toolbox Posterior distribution center of Bayesian inference π(θ|x) ∝ f (x|θ) π(θ) Operates conditional upon the observations Integrate simultaneously prior information/knowledge and information brought by x Avoids averaging over the unobserved values of x Coherent updating of the information available on θ, independent of the order in which i.i.d. observations are collected Provides a complete inferential scope and an unique motor of inference 7 / 143
  • 8. Computational Bayesian Statistics Computational basics The Bayesian toolbox Improper prior distribution Extension from a prior distribution to a prior σ-finite measure π such that π(θ) dθ = +∞ Θ Formal extension: π cannot be interpreted as a probability any longer 8 / 143
  • 9. Computational Bayesian Statistics Computational basics The Bayesian toolbox Validation Extension of the posterior distribution π(θ|x) associated with an improper prior π given by Bayes’s formula f (x|θ)π(θ) π(θ|x) = , Θ f (x|θ)π(θ) dθ when f (x|θ)π(θ) dθ < ∞ Θ 9 / 143
  • 10. Computational Bayesian Statistics Computational basics Testing Testing hypotheses Deciding about validity of assumptions or restrictions on the parameter θ from the data, represented as H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ0 Binary outcome of the decision process: accept [coded by 1] or reject [coded by 0] D = {0, 1} Bayesian solution formally very close from a likelihood ratio test statistic, but numerical values often strongly differ from classical solutions 10 / 143
  • 11. Computational Bayesian Statistics Computational basics Testing Bayes factor Bayesian testing procedure depends on P π (θ ∈ Θ0 |x) or alternatively on the Bayes factor π {P π (θ ∈ Θ1 |x)/P π (θ ∈ Θ0 |x)} B10 = {P π (θ ∈ Θ1 )/P π (θ ∈ Θ0 )} = f (x|θ1 )π1 (θ1 )dθ1 f (x|θ0 )π0 (θ0 )dθ0 = m1 (x)/m0 (x) [Akin to likelihood ratio] 11 / 143
  • 12. Computational Bayesian Statistics Computational basics Testing Banning improper priors Impossibility of using improper priors for testing! Reason: When using the representation π(θ) = P π (θ ∈ Θ1 ) × π1 (θ) + P π (θ ∈ Θ0 ) × π0 (θ) π1 and π0 must be normalised 12 / 143
  • 13. Computational Bayesian Statistics Computational basics Monte Carlo integration Fundamental integration issue Generic problem of evaluating an integral I = Ef [h(X)] = h(x) f (x) dx X where X is uni- or multidimensional, f is a closed form, partly closed form, or implicit density, and h is a function 13 / 143
  • 14. Computational Bayesian Statistics Computational basics Monte Carlo integration Monte Carlo Principle Use a sample (x1 , . . . , xm ) from the density f to approximate the integral I by the empirical average m 1 hm = h(xj ) m j=1 Convergence of the average hm −→ Ef [h(X)] by the Strong Law of Large Numbers 14 / 143
  • 15. Computational Bayesian Statistics Computational basics Monte Carlo integration e.g., Bayes factor approximation For the normal case x1 , . . . , xn ∼ N (µ + ξ, σ 2 ) y1 , . . . , yn ∼ N (µ − ξ, σ 2 ) and H0 : ξ = 0 under prior π(µ, σ 2 ) = 1/σ 2 and ξ ∼ N (0, 1) −n+1/2 π (¯ − y )2 + S 2 x ¯ B01 = √ [(2ξ − x − y )2 + S 2 ] e−ξ2 /2 dξ/ 2π −n+1/2 ¯ ¯ 15 / 143
  • 16. Computational Bayesian Statistics Computational basics Monte Carlo integration Example CMBdata π Simulate ξ1 , . . . , ξ1000 ∼ N (0, 1) and approximate B01 with −n+1/2 π (¯ − y )2 + S 2 x ¯ B01 = 1000 = 89.9 1 [(2ξi − x − y )2 + S 2 ] −n+1/2 1000 i=1 ¯ ¯ when x = 0.0888 , ¯ y = 0.1078 , ¯ S 2 = 0.00875 16 / 143
  • 17. Computational Bayesian Statistics Computational basics Monte Carlo integration Precision evaluation Estimate the variance with m 1 1 vm = [h(xj ) − hm ]2 , mm−1 j=1 and for m large, √ hm − Ef [h(X)] / vm ≈ N (0, 1). Note 0.05 Construction of a convergence 0.04 0.03 test and of confidence bounds on 0.02 the approximation of Ef [h(X)] 0.01 0.00 0 50 100 150 1B10 17 / 143
  • 18. Computational Bayesian Statistics Computational basics Monte Carlo integration Example (Cauchy-normal) For estimating a normal mean, a robust prior is a Cauchy prior x ∼ N (θ, 1), θ ∼ C(0, 1). Under squared error loss, posterior mean ∞ θ 2 e−(x−θ) /2 dθ −∞ 1 + θ2 δ π (x) = ∞ 1 2 2 e−(x−θ) /2 dθ −∞ 1+θ 18 / 143
  • 19. Computational Bayesian Statistics Computational basics Monte Carlo integration Example (Cauchy-normal (2)) Form of δ π suggests simulating iid variables θ1 , · · · , θm ∼ N (x, 1) and calculate 10.6 m θi 10.4 i=1 2 ˆπ 1 + θi δm (x) = . 1 10.2 m i=1 2 1 + θi 10.0 LLN implies 9.8 9.6 ˆπ δm (x) −→ δ π (x) as m −→ ∞. 0 200 400 600 800 1000 iterations 19 / 143
  • 20. Computational Bayesian Statistics Computational basics Monte Carlo integration Importance sampling Simulation from f (the true density) is not necessarily optimal Alternative to direct sampling from f is importance sampling, based on the alternative representation f (x) f (x) Ef [h(x)] = h(x) g(x) dx = Eg h(x) X g(x) g(x) which allows us to use other distributions than f 20 / 143
  • 21. Computational Bayesian Statistics Computational basics Monte Carlo integration Importance sampling (cont’d) Importance sampling algorithm Evaluation of Ef [h(x)] = h(x) f (x) dx X by 1 Generate a sample x1 , . . . , xm from a distribution g 2 Use the approximation m 1 f (xj ) h(xj ) m g(xj ) j=1 21 / 143
  • 22. Computational Bayesian Statistics Computational basics Monte Carlo integration Justification Convergence of the estimator m 1 f (xj ) h(xj ) −→ Ef [h(x)] m g(xj ) j=1 1 converges for any choice of the distribution g as long as supp(g) ⊃ supp(f ) 2 Instrumental distribution g chosen from distributions easy to simulate 3 Same sample (generated from g) can be used repeatedly, not only for different functions h, but also for different densities f 22 / 143
  • 23. Computational Bayesian Statistics Computational basics Monte Carlo integration Choice of importance function g can be any density but some choices better than others 1 Finite variance only when f (x) f 2 (x) Ef h2 (x) = h2 (x) dx < ∞ . g(x) X g(x) 2 Instrumental distributions with tails lighter than those of f (that is, with sup f /g = ∞) not appropriate, because weights f (xj )/g(xj ) vary widely, giving too much importance to a few values xj . 3 If sup f /g = M < ∞, the accept-reject algorithm can be used as well to simulate f directly. 4 IS suffers from curse of dimensionality 23 / 143
  • 24. Computational Bayesian Statistics Computational basics Monte Carlo integration Example (Cauchy target) Case of Cauchy distribution C(0, 1) when importance function is Gaussian N (0, 1). Density ratio 80 p⋆ (x) √ exp x2 /2 = 2π π (1 + x2 ) 60 p0 (x) 40 very badly behaved: e.g., 20 ∞ ̺(x)2 p0 (x)dx = ∞ 0 0 2000 4000 6000 8000 10000 iterations −∞ Poor performances of the associated importance sampling estimator 24 / 143
  • 25. Computational Bayesian Statistics Computational basics Monte Carlo integration Practical alternative m m h(xj ) f (xj )/g(xj ) f (xj )/g(xj ) j=1 j=1 where f and g are known up to constants. 1 Also converges to I by the Strong Law of Large Numbers. 2 Biased, but the bias is quite small: may beat the unbiased estimator in squared error loss. 25 / 143
  • 26. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Computational alternatives for evidence Bayesian model choice and hypothesis testing relies on a similar quantity, the evidence Zk = πk (θk )Lk (θk ) dθk , k = 1, 2 Θk aka the marginal likelihood. [Jeffreys, 1939] 26 / 143
  • 27. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Importance sampling When approximating the Bayes factor f0 (x|θ0 )π0 (θ0 )dθ0 Θ0 B01 = f1 (x|θ1 )π1 (θ1 )dθ1 Θ1 simultaneous use of importance functions ̟0 and ̟1 and n0 n−1 0 i i i i=1 f0 (x|θ0 )π0 (θ0 )/̟0 (θ0 ) B01 = n1 n−1 1 i i i i=1 f1 (x|θ1 )π1 (θ1 )/̟1 (θ1 ) 27 / 143
  • 28. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Diabetes in Pima Indian women Example (R benchmark) “A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix (AZ), was tested for diabetes according to WHO criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases.” 200 Pima Indian women with observed variables plasma glucose concentration in oral glucose tolerance test diastolic blood pressure diabetes pedigree function presence/absence of diabetes 28 / 143
  • 29. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Probit modelling on Pima Indian women Probability of diabetes function of above variables P(y = 1|x) = Φ(x1 β1 + x2 β2 + x3 β3 ) , Test of H0 : β3 = 0 for 200 observations of Pima.tr based on a g-prior modelling: β ∼ N3 (0, n XT X)−1 29 / 143
  • 30. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Importance sampling for the Pima Indian dataset Use of the importance function inspired from the MLE estimate distribution ˆ ˆ β ∼ N (β, Σ) R Importance sampling code model1=summary(glm(y~-1+X1,family=binomial(link="probit"))) is1=rmvnorm(Niter,mean=model1$coeff[,1],sigma=2*model1$cov.unscaled) is2=rmvnorm(Niter,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled) bfis=mean(exp(probitlpost(is1,y,X1)-dmvlnorm(is1,mean=model1$coeff[,1], sigma=2*model1$cov.unscaled))) / mean(exp(probitlpost(is2,y,X2)- dmvlnorm(is2,mean=model2$coeff[,1],sigma=2*model2$cov.unscaled))) 30 / 143
  • 31. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Diabetes in Pima Indian women Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations from the prior and the above MLE importance sampler 5 4 3 2 Basic Monte Carlo Importance sampling 31 / 143
  • 32. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Bridge sampling General identity: π2 (θ|x)α(θ)π1 (θ|x)dθ ˜ B12 = ∀ α(·) π1 (θ|x)α(θ)π2 (θ|x)dθ ˜ n1 1 π2 (θ1i |x)α(θ1i ) ˜ n1 i=1 ≈ n2 θji ∼ πj (θ|x) 1 π1 (θ2i |x)α(θ2i ) ˜ n2 i=1 32 / 143
  • 33. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Optimal bridge sampling The optimal choice of auxiliary function α is n1 + n2 α⋆ = n1 π1 (θ|x) + n2 π2 (θ|x) leading to n1 1 π2 (θ1i |x) ˜ n1 n1 π1 (θ1i |x) + n2 π2 (θ1i |x) i=1 B12 ≈ n2 1 π1 (θ2i |x) ˜ n2 n1 π1 (θ2i |x) + n2 π2 (θ2i |x) i=1 Back later! 33 / 143
  • 34. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Extension to varying dimensions When dim(Θ1 ) = dim(Θ2 ), e.g. θ2 = (θ1 , ψ), introduction of a pseudo-posterior density, ω(ψ|θ1 , x), augmenting π1 (θ1 |x) into joint distribution π1 (θ1 |x) × ω(ψ|θ1 , x) on Θ2 so that Z π1 (θ1 |x)α(θ1 , ψ)π2 (θ1 , ψ|x)dθ1 ω(ψ|θ1 , x) dψ ˜ B12 = Z π2 (θ1 , ψ|x)α(θ1 , ψ)π1 (θ1 |x)dθ1 ω(ψ|θ1 , x) dψ ˜ " # π1 (θ1 )ω(ψ|θ1 ) ˜ Eϕ [˜ 1 (θ1 )ω(ψ|θ1 )/ϕ(θ1 , ψ)] π = Eπ 2 = π2 (θ1 , ψ) ˜ Eϕ [˜ 2 (θ1 , ψ)/ϕ(θ1 , ψ)] π for any conditional density ω(ψ|θ1 ) and any joint density ϕ. 34 / 143
  • 35. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Illustration for the Pima Indian dataset Use of the MLE induced conditional of β3 given (β1 , β2 ) as a pseudo-posterior and mixture of both MLE approximations on β3 in bridge sampling estimate R bridge sampling code cova=model2$cov.unscaled expecta=model2$coeff[,1] covw=cova[3,3]-t(cova[1:2,3])%*%ginv(cova[1:2,1:2])%*%cova[1:2,3] probit1=hmprobit(Niter,y,X1) probit2=hmprobit(Niter,y,X2) pseudo=rnorm(Niter,meanw(probit1),sqrt(covw)) probit1p=cbind(probit1,pseudo) bfbs=mean(exp(probitlpost(probit2[,1:2],y,X1)+dnorm(probit2[,3],meanw(probit2[,1:2]), sqrt(covw),log=T))/ (dmvnorm(probit2,expecta,cova)+dnorm(probit2[,3],expecta[3], cova[3,3])))/ mean(exp(probitlpost(probit1p,y,X2))/(dmvnorm(probit1p,expecta,cova)+ dnorm(pseudo,expecta[3],cova[3,3]))) 35 / 143
  • 36. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 × 20, 000 simulations from the prior (MC), the above bridge sampler and the above importance sampler 36 / 143
  • 37. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective The original harmonic mean estimator When θki ∼ πk (θ|x), T 1 1 T L(θkt |x) t=1 is an unbiased estimator of 1/mk (x) [Newton & Raftery, 1994] Highly dangerous: Most often leads to an infinite variance!!! 37 / 143
  • 38. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x = dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk which holds no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the Monte Carlo output 38 / 143
  • 39. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) 1 ϕ(θk ) Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ 39 / 143
  • 40. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective HPD indicator as ϕ Use the convex hull of Monte Carlo simulations corresponding to the 10% HPD region (easily derived!) and ϕ as indicator: 10 ϕ(θ) = Id(θ,θ(t) )≤ǫ T t∈HPD 40 / 143
  • 41. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above harmonic mean sampler and importance samplers 3.102 3.104 3.106 3.108 3.110 3.112 3.114 3.116 Harmonic mean Importance sampling 41 / 143
  • 42. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ ∗ πk (θk |x) 42 / 143
  • 43. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate T ∗ 1 ∗ (t) πk (θk |x) = πk (θk |x, zk ) , T t=1 (t) where the zk ’s are latent variables sampled from π(z|x) (often by Gibbs ) 43 / 143
  • 44. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Case of the probit model For the completion by z, hidden normal variable, 1 π (θ|x) = ˆ π(θ|x, z (t) ) T t is a simple average of normal densities R Bridge sampling code gibbs1=gibbsprobit(Niter,y,X1) gibbs2=gibbsprobit(Niter,y,X2) bfchi=mean(exp(dmvlnorm(t(t(gibbs2$mu)-model2$coeff[,1]),mean=rep(0,3), sigma=gibbs2$Sigma2)-probitlpost(model2$coeff[,1],y,X2)))/ mean(exp(dmvlnorm(t(t(gibbs1$mu)-model1$coeff[,1]),mean=rep(0,2), sigma=gibbs1$Sigma2)-probitlpost(model1$coeff[,1],y,X1))) 44 / 143
  • 45. Computational Bayesian Statistics Computational basics Bayes factor approximations in perspective Diabetes in Pima Indian women (cont’d) Comparison of the variation of the Bayes factor approximations based on 100 replicas for 20, 000 simulations for a simulation from the above Chib’s and importance samplers 45 / 143
  • 46. Computational Bayesian Statistics Regression and variable selection Regression and variable selection 2 Regression and variable selection Regression Zellner’s informative G-prior Zellner’s noninformative G-prior Markov Chain Monte Carlo Methods Variable selection 46 / 143
  • 47. Computational Bayesian Statistics Regression and variable selection Regression Regressors and response Variable of primary interest, y, called the response or the outcome variable [assumed here to be continuous] E.g., number of Pine processionary caterpillar colonies Covariates x = (x1 , . . . , xk ) called explanatory variables [may be discrete, continuous or both] Distribution of y given x typically studied in the context of a set of units or experimental subjects, i = 1, . . . , n, for instance patients in an hospital ward, on which both yi and xi1 , . . . , xik are measured. 47 / 143
  • 48. Computational Bayesian Statistics Regression and variable selection Regression Regressors and response cont’d Dataset made of the conjunction of the vector of outcomes y = (y1 , . . . , yn ) and of the n × (k + 1) matrix of explanatory variables   1 x11 x12 . . . x1k  1 x21 x22 . . . x2k     x32 . . . x3k  X =  1 x31   . . . . .   .. . . . . . . .  . 1 xn1 xn2 . . . xnk 48 / 143
  • 49. Computational Bayesian Statistics Regression and variable selection Regression Linear models Ordinary normal linear regression model such that y|β, σ 2 , X ∼ Nn (Xβ, σ 2 In ) and thus E[yi |β, X] = β0 + β1 xi1 + . . . + βk xik V(yi |σ 2 , X) = σ 2 49 / 143
  • 50. Computational Bayesian Statistics Regression and variable selection Regression Pine processionary caterpillars Pine processionary caterpillar colony size influenced by x1 altitude x2 slope (in degrees) x3 number of pines in the area x4 height of the central tree x5 diameter of the central tree x6 index of the settlement density x7 orientation of the area (from 1 [southbound] to 2) x8 height of the dominant tree x9 number of vegetation strata x10 mix settlement index (from 1 if not mixed to 2 if mixed) x1 x2 x3 50 / 143
  • 51. Computational Bayesian Statistics Regression and variable selection Regression Ibex horn growth Growth of the Ibex horn size S against age A A log(S) = α + β log +ǫ 1+A 6.5 6.0 5.5 log(Size) 5.0 4.5 4.0 3.5 −0.25 −0.20 −0.15 −0.10 log(1 1 + Age) 51 / 143
  • 52. Computational Bayesian Statistics Regression and variable selection Regression Likelihood function & estimator The likelihood of the ordinary normal linear model is −n/2 1 ℓ β, σ 2 |y, X = 2πσ 2 exp − (y − Xβ)T (y − Xβ) 2σ 2 The MLE of β is solution of the least squares minimisation problem n 2 min (y − Xβ)T (y − Xβ) = min (yi − β0 − β1 xi1 − . . . − βk xik ) , β β i=1 namely ˆ β = (X T X)−1 X T y 52 / 143
  • 53. Computational Bayesian Statistics Regression and variable selection Regression lm() for pine processionary caterpillars Residuals: Min 1Q Median 3Q Max -1.6989 -0.2731 -0.0003 0.3246 1.7305 Coefficients: Estimate Std. Error t value Pr(>|t|) intercept 10.998412 3.060272 3.594 0.00161 ** XV1 -0.004431 0.001557 -2.846 0.00939 ** XV2 -0.053830 0.021900 -2.458 0.02232 * XV3 0.067939 0.099472 0.683 0.50174 XV4 -1.293636 0.563811 -2.294 0.03168 * XV5 0.231637 0.104378 2.219 0.03709 * XV6 -0.356800 1.566464 -0.228 0.82193 XV7 -0.237469 1.006006 -0.236 0.81558 XV8 0.181060 0.236724 0.765 0.45248 XV9 -1.285316 0.864847 -1.486 0.15142 XV10 -0.433106 0.734869 -0.589 0.56162 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 53 / 143
  • 54. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Zellner’s informative G-prior Constraint Allow the experimenter to introduce information about the location parameter of the regression while bypassing the most difficult aspects of the prior specification, namely the derivation of the prior correlation structure. Zellner’s prior corresponds to ˜ β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 ) σ 2 ∼ π(σ 2 |X) ∝ σ −2 . [Special conjugate] 54 / 143
  • 55. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Prior selection ˜ Experimental prior determination restricted to the choices of β and of the constant c. Note c can be interpreted as a measure of the amount of information available in the prior relative to the sample. For instance, setting 1/c = 0.5 gives the prior the same weight as 50% of the sample. There still is a lasting influence of the factor c 55 / 143
  • 56. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Posterior structure With this prior model, the posterior simplifies into π(β, σ 2 |y, X) ∝ f (y|β, σ 2 , X)π(β, σ 2 |X) 1 ˆ ˆ ∝ σ2 exp − 2 (y − X β)T (y − X β) −(n/2+1) 2σ 1 ˆ ˆ − 2 (β − β)T X T X(β − β) σ 2 −k/2 2σ 1 ˜ ˜ × exp − (β − β)X T X(β − β) , 2cσ 2 because X T X used in both prior and likelihood [G-prior trick] 56 / 143
  • 57. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Posterior structure (cont’d) Therefore, c 2 β|σ 2 , y, X ∼ Nk+1 ˜ ˆ σ c (X T X)−1 (β/c + β), c+1 c+1 n s 2 1 σ 2 |y, X ∼ IG , + ˜ ˆ ˜ ˆ (β − β)T X T X(β − β) 2 2 2(c + 1) and c ˜ β β|y, X ∼ Tk+1 n, ˆ +β , c+1 c ˜ ˆ ˜ ˆ c(s2 + (β − β)T X T X(β − β)/(c + 1)) T −1 (X X) . n(c + 1) 57 / 143
  • 58. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Pine processionary caterpillars βi Eπ (βi |y, X) Vπ (βi |y, X) β0 10.8895 6.4094 β1 -0.0044 2e-06 β2 -0.0533 0.0003 β3 0.0673 0.0068 β4 -1.2808 0.2175 β5 0.2293 0.0075 β6 -0.3532 1.6793 β7 -0.2351 0.6926 β8 0.1793 0.0383 β9 -1.2726 0.5119 β10 -0.4288 0.3696 c = 100 58 / 143
  • 59. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Pine processionary caterpillars (2) βi Eπ (βi |y, X) Vπ (βi |y, X) β0 10.9874 6.2604 β1 -0.0044 2e-06 β2 -0.0538 0.0003 β3 0.0679 0.0066 β4 -1.2923 0.2125 β5 0.2314 0.0073 β6 -0.3564 1.6403 β7 -0.2372 0.6765 β8 0.1809 0.0375 β9 -1.2840 0.5100 β10 -0.4327 0.3670 c = 1, 000 59 / 143
  • 60. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Credible regions Highest posterior density (HPD) regions on subvectors of the parameter β derived from the marginal posterior distribution of β. For a single parameter, c ˜ βi βi |y, X ∼ T1 n, ˆ + βi , c+1 c ˜ ˆ ˜ ˆ c(s2 + (β − β)T X T X(β − β)/(c + 1)) ω(i,i) , n(c + 1) where ω(i,i) is the (i, i)-th element of the matrix (X T X)−1 . 60 / 143
  • 61. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Pine processionary caterpillars βi HPD interval β0 [5.7435, 16.2533] β1 [−0.0071, −0.0018] β2 [−0.0914, −0.0162] β3 [−0.1029, 0.2387] β4 [−2.2618, −0.3255] β5 [0.0524, 0.4109] β6 [−3.0466, 2.3330] β7 [−1.9649, 1.4900] β8 [−0.2254, 0.5875] β9 [−2.7704, 0.1997] β10 [−1.6950, 0.8288] c = 100 61 / 143
  • 62. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior T marginal Marginal distribution of y is multivariate t distribution ˜ Proof. Since β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 ), ˜ Xβ|σ 2 , X ∼ N (X β, cσ 2 X(X T X)−1 X T ) , which implies that ˜ y|σ 2 , X ∼ Nn (X β, σ 2 (In + cX(X T X)−1 X T )). Integrating in σ 2 yields f (y|X) = (c + 1)−(k+1)/2 π −n/2 Γ(n/2) −n/2 c 1 ˜T T ˜ × yT y − y T X(X T X)−1 X T y − β X Xβ . c+1 c+1 62 / 143
  • 63. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Point null hypothesis If a null hypothesis is H0 : Rβ = r, the model under H0 can be rewritten as H0 y|β 0 , σ 2 , X0 ∼ Nn X0 β 0 , σ 2 In where β 0 is (k + 1 − q) dimensional. 63 / 143
  • 64. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Point null marginal Under the prior ˜ β 0 |X0 , σ 2 ∼ Nk+1−q β 0 , c0 σ 2 (X0 X0 )−1 , T the marginal distribution of y under H0 is f (y|X0 , H0 ) = (c0 + 1)−(k+1−q)/2 π −n/2 Γ(n/2) c0 × yT y − y T X0 (X0 X0 )−1 X0 y T T c0 + 1 −n/2 1 ˜T T ˜ − β X X0 β0 . c0 + 1 0 0 64 / 143
  • 65. Computational Bayesian Statistics Regression and variable selection Zellner’s informative G-prior Bayes factor Therefore the Bayes factor is in closed form: π f (y|X, H1 ) (c0 + 1)(k+1−q)/2 B10 = = f (y|X0 , H0 ) (c + 1)(k+1)/2 1 ˜T T n/2 yT y − c0 T T −1 T − ˜ c0 +1 y X0 (X0 X0 ) X0 y c0 +1 β0 X0 X0 β0 c 1 ˜T T ˜ yT y − c+1 y T X(X T X)−1 X T y − c+1 β X X β Means using the same σ 2 on both models Problem: Still depends on the choice of (c0 , c) 65 / 143
  • 66. Computational Bayesian Statistics Regression and variable selection Zellner’s noninformative G-prior Zellner’s noninformative G-prior Difference with informative G-prior setup is that we now consider c as unknown (relief!) Solution ˜ Use the same G-prior distribution with β = 0k+1 , conditional on c, and introduce a diffuse prior on c, π(c) = c−1 IN∗ (c) . 66 / 143
  • 67. Computational Bayesian Statistics Regression and variable selection Zellner’s noninformative G-prior Posterior distribution Corresponding marginal posterior on the parameters of interest π(β, σ 2 |y, X) = π(β, σ 2 |y, X, c)π(c|y, X) dc ∞ ∝ π(β, σ 2 |y, X, c)f (y|X, c)π(c) c=1 ∞ ∝ π(β, σ 2 |y, X, c)f (y|X, c) c−1 . c=1 and −n/2 T c T f (y|X, c) ∝ (c+1) −(k+1)/2 y y− y X(X T X)−1 X T y . c+1 67 / 143
  • 68. Computational Bayesian Statistics Regression and variable selection Zellner’s noninformative G-prior Posterior means The Bayes estimates of β and σ 2 are given by Eπ [β|y, X] ˆ = Eπ [Eπ (β|y, X, c)|y, X] = Eπ [c/(c + 1)β)|y, X]  ∞   c/(c + 1)f (y|X, c)c −1   c=1  =   ∞ β  ˆ   f (y|X, c)c−1 c=1 and ∞ ˆ ˆ s2 + β T X T X β/(c + 1) f (y|X, c)c−1 c=1 n−2 Eπ [σ 2 |y, X] = ∞ . f (y|X, c)c −1 c=1 68 / 143
  • 69. Computational Bayesian Statistics Regression and variable selection Zellner’s noninformative G-prior Marginal distribution Important point: the marginal distribution of the dataset is available in closed form ∞ −n/2 c f (y|X) ∝ c−1 (c + 1)−(k+1)/2 y T y − y T X(X T X)−1 X T y i=1 c+1 T -shape means normalising constant can be computed too. 69 / 143
  • 70. Computational Bayesian Statistics Regression and variable selection Zellner’s noninformative G-prior Point null hypothesis For null hypothesis H0 : Rβ = r, the model under H0 can be rewritten as H0 y|β 0 , σ 2 , X0 ∼ Nn X0 β 0 , σ 2 In where β 0 is (k + 1 − q) dimensional. 70 / 143
  • 71. Computational Bayesian Statistics Regression and variable selection Zellner’s noninformative G-prior Point null marginal Under the prior β 0 |X0 , σ 2 , c ∼ Nk+1−q 0k+1−q , cσ 2 (X0 X0 )−1 T and π(c) = 1/c, the marginal distribution of y under H0 is ∞ −n/2 c f (y|X0 , H0 ) ∝ (c+1)−(k+1−q)/2 y T y − y T X0 (X0 X0 )−1 X0 y T T . c=1 c+1 π Bayes factor B10 = f (y|X)/f (y|X0 , H0 ) can be computed 71 / 143
  • 72. Computational Bayesian Statistics Regression and variable selection Zellner’s noninformative G-prior Processionary pine caterpillars π For H0 : β8 = β9 = 0, log10 (B10 ) = −0.7884 Estimate Post. Var. log10(BF) (Intercept) 9.2714 9.1164 1.4205 (***) X1 -0.0037 2e-06 0.8502 (**) X2 -0.0454 0.0004 0.5664 (**) X3 0.0573 0.0086 -0.3609 X4 -1.0905 0.2901 0.4520 (*) X5 0.1953 0.0099 0.4007 (*) X6 -0.3008 2.1372 -0.4412 X7 -0.2002 0.8815 -0.4404 X8 0.1526 0.0490 -0.3383 X9 -1.0835 0.6643 -0.0424 X10 -0.3651 0.4716 -0.3838 evidence against H0: (****) decisive, (***) strong, (**) subtantial, (*) poor 72 / 143
  • 73. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Markov Chain Monte Carlo Methods Complexity of most models encountered in Bayesian modelling Standard simulation methods not good enough a solution New technique at the core of Bayesian computing, based on Markov chains 73 / 143
  • 74. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Markov chains Markov chain A process (θ(t) )t∈N is an homogeneous Markov chain if the distribution of θ(t) given the past (θ(0) , . . . , θ(t−1) ) 1 only depends on θ(t−1) 2 is the same for all t ∈ N∗ . 74 / 143
  • 75. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Algorithms based on Markov chains Idea: simulate from a posterior density π(·|x) [or any density] by producing a Markov chain (θ(t) )t∈N whose stationary distribution is π(·|x) Translation For t large enough, θ(t) is approximately distributed from π(θ|x), no matter what the starting value θ(0) is [Ergodicity]. 75 / 143
  • 76. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Convergence If an algorithm that generates such a chain can be constructed, the ergodic theorem guarantees that, in almost all settings, the average T 1 g(θ(t) ) T t=1 converges to Eπ [g(θ)|x], for (almost) any starting value 76 / 143
  • 77. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods More convergence If the produced Markov chains are irreducible [can reach any region in a finite number of steps], then they are both positive recurrent with stationary distribution π(·|x) and ergodic [asymptotically independent from the starting value θ(0) ] While, for t large enough, θ(t) is approximately distributed from π(θ|x) and can thus be used like the output from a more standard simulation algorithm, one must take care of the correlations between the θ(t) ’s 77 / 143
  • 78. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Demarginalising Takes advantage of hierarchical structures: if π(θ|x) = π1 (θ|x, λ)π2 (λ|x) dλ , simulating from π(θ|x) comes from simulating from the joint distribution π1 (θ|x, λ) π2 (λ|x) 78 / 143
  • 79. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Two-stage Gibbs sampler Usually π2 (λ|x) not available/simulable More often, both conditional posterior distributions, π1 (θ|x, λ) and π2 (λ|x, θ) can be simulated. Idea: Create a Markov chain based on those conditionals Example of latent variables! 79 / 143
  • 80. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Two-stage Gibbs sampler (cont’d) Initialization: Start with an arbitrary value λ(0) Iteration t: Given λ(t−1) , generate 1 θ(t) according to π1 (θ|x, λ(t−1) ) 2 λ(t) according to π2 (λ|x, θ(t) ) J.W. Gibbs (1839-1903) π(θ, λ|x) is a stationary distribution for this transition 80 / 143
  • 81. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Implementation 1 Derive efficient decomposition of the joint distribution into simulable conditionals (mixing behavior, acf(), blocking, &tc.) 2 Find when to stop the algorithm (mode chasing, missing mass, shortcuts, &tc.) 81 / 143
  • 82. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Simple Example: iid N (µ, σ 2 ) Observations iid When y1 , . . . , yn ∼ N (µ, σ 2 ) with both µ and σ unknown, the posterior in (µ, σ 2 ) is conjugate outside a standard family But... 1 n σ2 µ|y, σ 2 ∼ N µ n i=1 yi , n ) σ 2 |y, µ ∼ I G σ 2 n − 1, 2 n (yi 2 1 i=1 − µ)2 assuming constant (improper) priors on both µ and σ 2 Hence we may use the Gibbs sampler for simulating from the posterior of (µ, σ 2 ) 82 / 143
  • 83. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Gibbs output analysis Example (Cauchy posterior) 2 e−µ /20 π(µ|D) ∝ (1 + (x1 − µ)2 )(1 + (x2 − µ)2 ) is marginal of 2 −µ2 /20 2 π(µ, ω|D) ∝ e × e−ωi [1+(xi −µ) ] . i=1 Corresponding conditionals (ω1 , ω2 )|µ ∼ E xp(1 + (x1 − µ)2 ) ⊗ E xp(1 + (x2 − µ))2 ) µ|ω ∼ N ωi xi /( ωi + 1/20), 1/(2 ωi + 1/10) i i i 83 / 143
  • 84. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Gibbs output analysis (cont’d) 4 2 0 µ −2 −4 −6 9900 9920 9940 9960 9980 10000 n 0.15 Density 0.10 0.05 0.00 −10 −5 0 5 10 µ 84 / 143
  • 85. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods Generalisation Consider several groups of parameters, θ, λ1 , . . . , λp , such that π(θ|x) = ... π(θ, λ1 , . . . , λp |x) dλ1 · · · dλp or simply divide θ in (θ1 , . . . , θp ) 85 / 143
  • 86. Computational Bayesian Statistics Regression and variable selection Markov Chain Monte Carlo Methods The general Gibbs sampler For a joint distribution π(θ) with full conditionals π1 , . . . , πp , (t) (t) Given (θ1 , . . . , θp ), simulate (t+1) (t) (t) 1. θ1 ∼ π1 (θ1 |θ2 , . . . , θp ), (t+1) (t+1) (t) (t) 2. θ2 ∼ π2 (θ2 |θ1 , θ3 , . . . , θp ), . . . (t+1) (t+1) (t+1) p. θp ∼ πp (θp |θ1 , . . . , θp−1 ). Then θ(t) → θ ∼ π 86 / 143
  • 87. Computational Bayesian Statistics Regression and variable selection Variable selection Variable selection Back to regression: one dependent random variable y and a set {x1 , . . . , xk } of k explanatory variables. Question: Are all xi ’s involved in the regression? Assumption: every subset {i1 , . . . , iq } of q (0 ≤ q ≤ k) explanatory variables, {1n , xi1 , . . . , xiq }, is a proper set of explanatory variables for the regression of y [intercept included in every corresponding model] Computational issue 2k models in competition... 87 / 143
  • 88. Computational Bayesian Statistics Regression and variable selection Variable selection Model notations 1 X = 1n x1 · · · xk is the matrix containing 1n and all the k potential predictor variables 2 Each model Mγ associated with binary indicator vector γ ∈ Γ = {0, 1}k where γi = 1 means that the variable xi is included in the model Mγ 3 qγ = 1T γ number of variables included in the model Mγ n 4 t1 (γ) and t0 (γ) indices of variables included in the model and indices of variables not included in the model 88 / 143
  • 89. Computational Bayesian Statistics Regression and variable selection Variable selection Model indicators For β ∈ Rk+1 and X, we define βγ as the subvector βγ = β0 , (βi )i∈t1 (γ) and Xγ as the submatrix of X where only the column 1n and the columns in t1 (γ) have been left. 89 / 143
  • 90. Computational Bayesian Statistics Regression and variable selection Variable selection Models in competition The model Mγ is thus defined as y|γ, βγ , σ 2 , X ∼ Nn Xγ βγ , σ 2 In where βγ ∈ Rqγ +1 and σ 2 ∈ R∗ are the unknown parameters. + Warning σ 2 is common to all models and thus uses the same prior for all models 90 / 143
  • 91. Computational Bayesian Statistics Regression and variable selection Variable selection Informative G-prior Many (2k ) models in competition: we cannot expect a practitioner to specify a prior on every Mγ in a completely subjective and autonomous manner. Shortcut: We derive all priors from a single global prior associated with the so-called full model that corresponds to γ = (1, . . . , 1). 91 / 143
  • 92. Computational Bayesian Statistics Regression and variable selection Variable selection Prior definitions (i) For the full model, Zellner’s G-prior: ˜ β|σ 2 , X ∼ Nk+1 (β, cσ 2 (X T X)−1 ) and σ 2 ∼ π(σ 2 |X) = σ −2 (ii) For each model Mγ , the prior distribution of βγ conditional on σ 2 is fixed as ˜ −1 βγ |γ, σ 2 ∼ Nqγ +1 βγ , cσ 2 Xγ Xγ T , ˜ T −1 T ˜ where βγ = Xγ Xγ Xγ β and same prior on σ 2 . 92 / 143
  • 93. Computational Bayesian Statistics Regression and variable selection Variable selection Prior completion The joint prior for model Mγ is the improper prior −(qγ +1)/2−1 1 ˜ T π(βγ , σ 2 |γ) ∝ σ2 exp − βγ − βγ 2(cσ 2 ) T ˜ (Xγ Xγ ) βγ − βγ . 93 / 143
  • 94. Computational Bayesian Statistics Regression and variable selection Variable selection Prior competition (2) Infinitely many ways of defining a prior on the model index γ: choice of uniform prior π(γ|X) = 2−k . Posterior distribution of γ central to variable selection since proportional to the marginal density of y on Mγ (or evidence of Mγ ) π(γ|y, X) ∝ f (y|γ, X)π(γ|X) ∝ f (y|γ, X) = f (y|γ, β, σ 2 , X)π(β|γ, σ 2 , X) dβ π(σ 2 |X) dσ 2 . 94 / 143
  • 95. Computational Bayesian Statistics Regression and variable selection Variable selection f (y|γ, σ 2 , X) = f (y|γ, β, σ 2 )π(β|γ, σ 2 ) dβ 1 T (c + 1)−(qγ +1)/2 (2π)−n/2 σ 2 −n/2 = exp − y y 2σ 2 1 −1 ˜T T ˜ + cy T Xγ Xγ Xγ T T Xγ y − βγ Xγ Xγ βγ , 2σ 2 (c + 1) this posterior density satisfies c T −1 π(γ|y, X) ∝ (c + 1)−(qγ +1)/2 y T y − T y Xγ Xγ Xγ T Xγ y c+1 −n/2 1 ˜T T ˜ − β X Xγ βγ . c+1 γ γ 95 / 143
  • 96. Computational Bayesian Statistics Regression and variable selection Variable selection Pine processionary caterpillars t1 (γ) π(γ|y, X) 0,1,2,4,5 0.2316 0,1,2,4,5,9 0.0374 0,1,9 0.0344 0,1,2,4,5,10 0.0328 0,1,4,5 0.0306 0,1,2,9 0.0250 0,1,2,4,5,7 0.0241 0,1,2,4,5,8 0.0238 0,1,2,4,5,6 0.0237 0,1,2,3,4,5 0.0232 0,1,6,9 0.0146 0,1,2,3,9 0.0145 0,9 0.0143 0,1,2,6,9 0.0135 0,1,4,5,9 0.0128 0,1,3,9 0.0117 0,1,2,8 0.0115 96 / 143
  • 97. Computational Bayesian Statistics Regression and variable selection Variable selection Pine processionary caterpillars (cont’d) Interpretation Model Mγ with the highest posterior probability is t1 (γ) = (1, 2, 4, 5), which corresponds to the variables - altitude, - slope, - height of the tree sampled in the center of the area, and - diameter of the tree sampled in the center of the area. Corresponds to the five variables identified in the R regression output 97 / 143
  • 98. Computational Bayesian Statistics Regression and variable selection Variable selection Noninformative extension For Zellner noninformative prior with π(c) = 1/c, we have ∞ π(γ|y, X) ∝ c−1 (c + 1)−(qγ +1)/2 y T y− c=1 −n/2 c T T −1 T y Xγ Xγ Xγ Xγ y . c+1 98 / 143
  • 99. Computational Bayesian Statistics Regression and variable selection Variable selection Pine processionary caterpillars t1 (γ) π(γ|y, X) 0,1,2,4,5 0.0929 0,1,2,4,5,9 0.0325 0,1,2,4,5,10 0.0295 0,1,2,4,5,7 0.0231 0,1,2,4,5,8 0.0228 0,1,2,4,5,6 0.0228 0,1,2,3,4,5 0.0224 0,1,2,3,4,5,9 0.0167 0,1,2,4,5,6,9 0.0167 0,1,2,4,5,8,9 0.0137 0,1,4,5 0.0110 0,1,2,4,5,9,10 0.0100 0,1,2,3,9 0.0097 0,1,2,9 0.0093 0,1,2,4,5,7,9 0.0092 0,1,2,6,9 0.0092 99 / 143
  • 100. Computational Bayesian Statistics Regression and variable selection Variable selection Stochastic search for the most likely model When k gets large, impossible to compute the posterior probabilities of the 2k models. Need of a tailored algorithm that samples from π(γ|y, X) and selects the most likely models. Can be done by Gibbs sampling , given the availability of the full conditional posterior probabilities of the γi ’s. If γ−i = (γ1 , . . . , γi−1 , γi+1 , . . . , γk ) (1 ≤ i ≤ k) π(γi |y, γ−i , X) ∝ π(γ|y, X) (to be evaluated in both γi = 0 and γi = 1) 100 / 143
  • 101. Computational Bayesian Statistics Regression and variable selection Variable selection Gibbs sampling for variable selection Initialization: Draw γ 0 from the uniform distribution on Γ (t−1) (t−1) Iteration t: Given (γ1 , . . . , γk ), generate (t) (t−1) (t−1) 1. γ1 according to π(γ1 |y, γ2 , . . . , γk , X) (t) 2. γ2 according to (t) (t−1) (t−1) π(γ2 |y, γ1 , γ3 , . . . , γk , X) . . . (t) (t) (t) p. γk according to π(γk |y, γ1 , . . . , γk−1 , X) 101 / 143
  • 102. Computational Bayesian Statistics Regression and variable selection Variable selection MCMC interpretation After T ≫ 1 MCMC iterations, output used to approximate the posterior probabilities π(γ|y, X) by empirical averages T 1 π(γ|y, X) = Iγ (t) =γ . T − T0 + 1 t=T0 where the T0 first values are eliminated as burnin. And approximation of the probability to include i-th variable, T 1 P π (γi = 1|y, X) = Iγ (t) =1 . T − T0 + 1 i t=T0 102 / 143
  • 103. Computational Bayesian Statistics Regression and variable selection Variable selection Pine processionary caterpillars γi P π (γi = 1|y, X) P π (γi = 1|y, X) γ1 0.8624 0.8844 γ2 0.7060 0.7716 γ3 0.1482 0.2978 γ4 0.6671 0.7261 γ5 0.6515 0.7006 γ6 0.1678 0.3115 γ7 0.1371 0.2880 γ8 0.1555 0.2876 γ9 0.4039 0.5168 γ10 0.1151 0.2609 ˜ Probabilities of inclusion with both informative (β = 011 , c = 100) and noninformative Zellner’s priors 103 / 143
  • 104. Computational Bayesian Statistics Generalized linear models Generalized linear models 3 Generalized linear models Generalisation of linear models Metropolis–Hastings algorithms The Probit Model The logit model 104 / 143
  • 105. Computational Bayesian Statistics Generalized linear models Generalisation of linear models Generalisation of the linear dependence Broader class of models to cover various dependence structures. Class of generalised linear models (GLM) where y|x, β ∼ f (y|xT β) . i.e., dependence of y on x partly linear 105 / 143
  • 106. Computational Bayesian Statistics Generalized linear models Generalisation of linear models Specifications of GLM’s Definition (GLM) A GLM is a conditional model specified by two functions: 1 the density f of y given x parameterised by its expectation parameter µ = µ(x) [and possibly its dispersion parameter ϕ = ϕ(x)] 2 the link g between the mean µ and the explanatory variables, written customarily as g(µ) = xT β or, equivalently, E[y|x, β] = g −1 (xT β). For identifiability reasons, g needs to be bijective. 106 / 143
  • 107. Computational Bayesian Statistics Generalized linear models Generalisation of linear models Likelihood Obvious representation of the likelihood n ℓ(β, ϕ|y, X) = f yi |xiT β, ϕ i=1 with parameters β ∈ Rk and ϕ > 0. 107 / 143
  • 108. Computational Bayesian Statistics Generalized linear models Generalisation of linear models Examples Case of binary and binomial data, when yi |xi ∼ B(ni , p(xi )) with known ni Logit [or logistic regression] model Link is logit transform on probability of success g(pi ) = log(pi /(1 − pi )) , with likelihood Y „ ni « „ exp(xiT β) «yi „ n 1 «ni −yi yi 1 + exp(xiT β) 1 + exp(xiT β) i=1 ( n )ffi n X Y“ ”ni −yi ∝ exp yi xiT β 1 + exp(xiT β) i=1 i=1 108 / 143
  • 109. Computational Bayesian Statistics Generalized linear models Generalisation of linear models Canonical link Special link function g that appears in the natural exponential family representation of the density g ⋆ (µ) = θ if f (y|µ) ∝ exp{T (y) · θ − Ψ(θ)} Example Logit link is canonical for the binomial model, since ni pi f (yi |pi ) = exp yi log + ni log(1 − pi ) , yi 1 − pi and thus θi = log pi /(1 − pi ) 109 / 143
  • 110. Computational Bayesian Statistics Generalized linear models Generalisation of linear models Examples (2) Customary to use the canonical link, but only customary ... Probit model Probit link function given by g(µi ) = Φ−1 (µi ) where Φ standard normal cdf Likelihood n ℓ(β|y, X) ∝ Φ(xiT β)yi (1 − Φ(xiT β))ni −yi . i=1 Full processing 110 / 143
  • 111. Computational Bayesian Statistics Generalized linear models Generalisation of linear models Log-linear models Standard approach to describe associations between several categorical variables, i.e, variables with finite support Sufficient statistic: contingency table, made of the cross-classified counts for the different categorical variables. Example (Titanic survivors) Child Adult Survivor Class Male Female Male Female 1st 0 0 118 4 2nd 0 0 154 13 No 3rd 35 17 387 89 Crew 0 0 670 3 1st 5 1 57 140 2nd 11 13 14 80 Yes 3rd 13 14 75 76 Crew 0 0 192 20 111 / 143
  • 112. Computational Bayesian Statistics Generalized linear models Generalisation of linear models Poisson regression model 1 Each count yi is Poisson with mean µi = µ(xi ) 2 Link function connecting R+ with R, e.g. logarithm g(µi ) = log(µi ). Corresponding likelihood n 1 ℓ(β|y, X) = exp yi xiT β − exp(xiT β) . i=1 yi ! 112 / 143
  • 113. Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Metropolis–Hastings algorithms Convergence assessment Posterior inference in GLMs harder than for linear models c Working with a GLM requires specific numerical or simulation tools [E.g., GLIM in classical analyses] Opportunity to introduce universal MCMC method: Metropolis–Hastings algorithm 113 / 143
  • 114. Computational Bayesian Statistics Generalized linear models Metropolis–Hastings algorithms Generic MCMC sampler Metropolis–Hastings algorithms are generic/down-the-shelf MCMC algorithms Only require likelihood up to a constant [difference with Gibbs sampler] can be tuned with a wide range of possibilities [difference with Gibbs sampler & blocking] natural extensions of standard simulation algorithms: based on the choice of a proposal distribution [difference in Markov proposal q(x, y) and acceptance] 114 / 143