SlideShare a Scribd company logo
1 of 43
Download to read offline
Bayesian Neural Network
2017/01/27
¤ bayesian neural network
¤ NIPS
http://bayesiandeeplearning.org
¤ bayesian neural network
¤ Stan PyMC3 Edward
¤
¤
¤
¤
¤
¤
¤
¤ ! " #
¤ $(#|")
¤
¤ $(!|#, ")
¤
¤
¤ w
¤ $(#|!, ")
¤
¤ $(!|")
¤
¤
¤
¤
¤ D
¤
¤ ! = {+} = - $ ! # = ∏ $(+|#)/
012 = $(-|#)
¤ ! = {+, 3} = (-, 3) $ ! # = ∏ $(3|+, #)/
012 = $(3|-, #)
$ # ! =
$ ! # $(#)
$(!)
=
$ ! # $(#)
∫ $ ! # $ # 5#
MAP
¤
¤ MAP
¤ MAP
¤
¤ MAP
#789 = arg max log $(!|#) = arg max
A
B log $(+C|#)
C
#7DE = arg max log $(#|!) = arg max
A
log $ ! # + log $(#)
¤ +G
¤ #
¤
¤ +G
3G
$ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ]
$ 3G +G, ! = H $ 3G +G, # $ # ! 5# = IJ(A|K)[$ 3G +G, # ]
¤
¤
¤
¤
¤
¤
$ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ]
$ " ! =
$ ! " $(")
$(!)
$ # ! =
$ ! # $(#)
$(!)
=
$ ! # $(#)
∫ $ ! # $ # 5#
¤
¤
¤ #
¤
$ # ! =
$ ! # $(#)
$(!)
=
$ ! # $(#)
∫ $ ! # $ # 5#
¤ 2
1. MCMC
2.
1. MCMC
¤ $(#|!)
¤
¤
2.
¤ $(#|N) O(#)
¤ 2
¤
¤
¤ # $(#)
¤ #
¤
Weight Uncertainty in Neural Networks
H1 H2 H3 1
X 1
Y
0.5 0.1 0.7 1.3
1.40.3
1.2
0.10.1 0.2
H1 H2 H3 1
X 1
Y
Figure 1. Left: each weight has a fixed value, as provided by clas-
sical backpropagation. Right: each weight is assigned a distribu-
tion, as provided by Bayes by Backprop.
is related to recent methods in deep, generative modelling
(Kingma and Welling, 2014; Rezende et al., 2014; Gregor
et al., 2014), where variational inference has been applied
to stochastic hidden units of an autoencoder. Whilst the
number of stochastic hidden units might be in the order of
the parameters of the categorical dis
through the exponential function then
regression Y is R and P(y|x, w) is a G
– this corresponds to a squared loss.
Inputs x are mapped onto the param
tion on Y by several successive layers
tion (given by w) interleaved with elem
transforms.
The weights can be learnt by maximum
tion (MLE): given a set of training exam
the MLE weights wMLE
are given by:
wMLE
= arg max
w
log P(D|w
= arg max
w
i
log P(
This is typically achieved by gradient
NN Bayesian NN
$(3G|!, +G, N) = H $ 3G +G, # $ # !, N 5#
[Blundell+ 2015]
¤
¤
¤ “The Importance of Knowing What We Don't Know (by Yarin Gal)”
¤
¤
¤
¤
¤
¤
WHY SHOULD WE CARE?
Calibrated model and prediction uncertainty: getting
systems that know when they don’t know.
Automatic model complexity control and structure learnin
(Bayesian Occam’s Razor)
Figure from Yarin Gal’s thesis “Uncertainty in Deep Learning” (2016)
Zoubin Ghahramani
http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
¤ stochastic neural network
¤ VAE
¤ =
¤
¤
https://jmhl.org/research/
¤
¤
¤
¤
¤ 1
¤
¤
http://evelinag.com/blog/2014/09-15-introducing-ariadne/#.WI8X7LaLTEY
¤
¤ P
Q
¤
¤ α
¤
¤
3.1.5 Multiple outputs
So far, we have considered the case of a single target variable t. In some applica-
tions, we may wish to predict K > 1 target variables, which we denote collectively
by the target vector t. This could be done by introducing a different set of basis func-
tions for each component of t, leading to multiple, independent regression problems.
However, a more interesting, and more common, approach is to use the same set of
basis functions to model all of the components of the target vector so that
y(x, w) = WT
φ(x) (3.31)
where y is a K-dimensional column vector, W is an M × K matrix of parameters,
and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1
as before. Suppose we take the conditional distribution of the target vector to be an
isotropic Gaussian of the form
p(t|x, W, β) = N(t|WT
φ(x), β−1
I). (3.32)
If we have a set of observations t1, . . . , tN , we can combine these into a matrix T
of size N × K such that the nth
row is given by tT
n. Similarly, we can combine the
input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given
by
ln p(T|X, W, β) =
N
n=1
ln N(tn|WT
φ(xn), β−1
I)
=
NK
2
ln
β
2π
−
β
2
N
n=1
tn − WT
φ(xn)
2
. (3.33)
by the target vector t. This could be done by introducing a different set of basis func-
tions for each component of t, leading to multiple, independent regression problems.
However, a more interesting, and more common, approach is to use the same set of
basis functions to model all of the components of the target vector so that
y(x, w) = WT
φ(x) (3.31)
where y is a K-dimensional column vector, W is an M × K matrix of parameters,
and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1
as before. Suppose we take the conditional distribution of the target vector to be an
isotropic Gaussian of the form
p(t|x, W, β) = N (t|WT
φ(x), β−1
I). (3.32)
If we have a set of observations t1, . . . , tN , we can combine these into a matrix T
of size N × K such that the nth
row is given by tT
n. Similarly, we can combine the
input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given
by
ln p(T|X, W, β) =
N
n=1
ln N (tn|WT
φ(xn), β−1
I)
=
NK
2
ln
β
2π
−
β
2
N
n=1
tn − WT
φ(xn)
2
. (3.33)
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
Note that because the posterior distribution is Gaussian, its mode coincides with its
mean. Thus the maximum posterior weight vector is simply given by wMAP = mN .
If we consider an infinitely broad prior S0 = α−1
I with α → 0, the mean mN
of the posterior distribution reduces to the maximum likelihood value wML given
by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior.
Furthermore, if data points arrive sequentially, then the posterior distribution at any
stage acts as the prior distribution for the subsequent data point, such that the new
posterior distribution is again given by (3.49).3.8
For the remainder of this chapter, we shall consider a particular form of Gaus-
sian prior in order to simplify the treatment. Specifically, we consider a zero-mean
isotropic Gaussian governed by a single precision parameter α so that
p(w|α) = N(w|0, α−1
I) (3.52)
and the corresponding posterior distribution over w is then given by (3.49) with
mN = βSN ΦT
t (3.53)
S−1
N = αI + βΦT
Φ. (3.54)
The log of the posterior distribution is given by the sum of the log likelihood and
the log of the prior and, as a function of w, takes the form
ln p(w|t) = −
β
2
N
n=1
{tn − wT
φ(xn)}2
−
α
2
wT
w + const. (3.55)
Maximization of this posterior distribution with respect to w is therefore equiva-
3.3. Bayesian Linear Regression 153
Next we compute the posterior distribution, which is proportional to the product
of the likelihood function and the prior. Due to the choice of a conjugate Gaus-
sian prior distribution, the posterior will also be Gaussian. We can evaluate this
distribution by the usual procedure of completing the square in the exponential, and
then finding the normalization coefficient using the standard result for a normalized
Gaussian. However, we have already done the necessary work in deriving the gen-
eral result (2.116), which allows us to write down the posterior distribution directly
in the form
p(w|t) = N (w|mN , SN ) (3.49)
where
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
then finding the normalization coefficient using the standard result for a normalized
Gaussian. However, we have already done the necessary work in deriving the gen-
eral result (2.116), which allows us to write down the posterior distribution directly
in the form
p(w|t) = N(w|mN , SN ) (3.49)
where
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
Note that because the posterior distribution is Gaussian, its mode coincides with its
mean. Thus the maximum posterior weight vector is simply given by wMAP = mN .
If we consider an infinitely broad prior S0 = α−1
I with α → 0, the mean mN
of the posterior distribution reduces to the maximum likelihood value wML given
by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior.
Furthermore, if data points arrive sequentially, then the posterior distribution at any
stage acts as the prior distribution for the subsequent data point, such that the new
posterior distribution is again given by (3.49).
For the remainder of this chapter, we shall consider a particular form of Gaus-
sian prior in order to simplify the treatment. Specifically, we consider a zero-mean
isotropic Gaussian governed by a single precision parameter α so that
p(w|α) = N(w|0, α−1
I) (3.52)
and the corresponding posterior distribution over w is then given by (3.49) with
mN = βSN ΦT
t (3.53)
S−1
N = αI + βΦT
Φ. (3.54)
The log of the posterior distribution is given by the sum of the log likelihood and
the log of the prior and, as a function of w, takes the form
PRML
¤
¤
equal to the mean, although this will no longer hold if q ̸= 2.
3.3.2 Predictive distribution
In practice, we are not usually interested in the value of w itself but rather in
making predictions of t for new values of x. This requires that we evaluate the
predictive distribution defined by
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the form3.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
(x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance3.11
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the form.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance.11
of the predictive distribution arises solely from the additive noise governed by the
parameter β.
As an illustration of the predictive distribution for Bayesian linear regression
models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8,
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the formcise 3.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variancecise 3.11
of the predictive distribution arises solely from the additive noise governed by the
parameter β.
As an illustration of the predictive distribution for Bayesian linear regression
models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8,
3.3. Bayesian Linear Regression 157
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Figure 3.8 Examples of the predictive distribution (3.58) for a model consisting of 9 Gaussian basis functions
of the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion.
PRML
¤
¤
¤
¤
¤ 2
¤
¤
¤
¤
1.
2. MAP #7DE
3. 2 R
4.
chapters and so we can exploit the results obtained there. We can then make use of
the evidence framework to provide point estimates for the hyperparameters and to
compare alternative models (for example, networks having different numbers of hid-
den units). To start with, we shall discuss the regression case and then later consider
the modifications needed for solving classification tasks.
5.7.1 Posterior parameter distribution
Consider the problem of predicting a single continuous target variable t from
a vector x of inputs (the extension to multiple targets is straightforward). We shall
suppose that the conditional distribution p(t|x) is Gaussian, with an x-dependent
mean given by the output of a neural network model y(x, w), and with precision
(inverse variance) β
p(t|x, w, β) = N(t|y(x, w), β−1
). (5.161)
Similarly, we shall choose a prior distribution over the weights w that is Gaussian of
the form
p(w|α) = N(w|0, α−1
I). (5.162)
For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of target
values D = {t1, . . . , tN }, the likelihood function is given by
p(D|w, β) =
N
n=1
N(tn|y(xn, w), β−1
) (5.163)
and so the resulting posterior distribution is then
p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.164)
the modifications needed for solving classification tasks.
5.7.1 Posterior parameter distribution
Consider the problem of predicting a single continuous target variable t fro
a vector x of inputs (the extension to multiple targets is straightforward). We sh
suppose that the conditional distribution p(t|x) is Gaussian, with an x-depende
mean given by the output of a neural network model y(x, w), and with precisi
(inverse variance) β
p(t|x, w, β) = N(t|y(x, w), β−1
). (5.16
Similarly, we shall choose a prior distribution over the weights w that is Gaussian
the form
p(w|α) = N(w|0, α−1
I). (5.16
For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of targ
values D = {t1, . . . , tN }, the likelihood function is given by
p(D|w, β) =
N
n=1
N(tn|y(xn, w), β−1
) (5.16
and so the resulting posterior distribution is then
p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.16
which, as a consequence of the nonlinear dependence of y(x, w) on w, will be no
Gaussian.
We can find a Gaussian approximation to the posterior distribution by using t
Laplace approximation. To do this, we must first find a (local) maximum of t
5.7. Bayesian Neural Networks 279
form
ln p(w|D) = −
α
2
wT
w −
β
2
N
n=1
{y(xn, w) − tn}
2
+ const (5.165)
which corresponds to a regularized sum-of-squares error function. Assuming for
the moment that α and β are fixed, we can find a maximum of the posterior, which
we denote wMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a mode wMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given by
A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166)
where H is the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components of w. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) by
q(w|D) = N(w|wMAP, A−1
). (5.167)
form
ln p(w|D) = −
α
2
wT
w −
β
2
N
n=1
{y(xn, w) − tn}
2
+ const (5.165)
which corresponds to a regularized sum-of-squares error function. Assuming for
the moment that α and β are fixed, we can find a maximum of the posterior, which
we denote wMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a mode wMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given by
A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166)
where H is the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components of w. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) by
q(w|D) = N(w|wMAP, A−1
). (5.167)
Similarly, the predictive distribution is obtained by marginalizing with respect
PRML
¤
¤
¤
https://www.r-bloggers.com/easy-laplace-approximation-of-bayesian-models-in-r/
¤ O(#|N)
¤ ST[O(#|N)||$ # ! ] N
¤ KL
¤ N ELBO N
→
ST[O(#|N)||$	(#|!)]
= −∫ O	 # N log
W K|A W A
J # N
5# + log $ !
= −ℒ	 !; N + log $ !
ELBO
ELBO
¤ ELBO ℒ	 !; N
1.
¤ $(#|!)
MC
¤
2. ELBO
¤ MC
¤ MC ∫ O	 # N log
W K|A W A
J # N
5# = IJ[log
W K|A W A
J # N
]
3.
¤ EM
¤ ELBO
¤ Gal
¤ Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987)
¤ Denker and LeCun (1991)
¤ MacKay (1992)
¤ Hinton and van Camp (1993)
¤
¤ Neal (1995)
¤ Barber and Bishop (1998)
¤ Graves (2011)
¤ Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015)
¤ Hernandez-Lobato and Adam (2015)
¤ Practical Variational Inference for Neural Networks [Graves 2011]
¤
¤
¤ T9 O(#|N) N = {Z, [}
¤
ℒ	 !; N = ∫ O	(#|N)log
$ !|# $(#)
O(#|N)
5#
= − EJ(A|])[log $(!|#)] + ST[O(#|N)||$(#)]
^T9
^Z
≈
1
a
B
^ log $(!|#)
^#
/
C
^T9
^[b
≈
1
2a
B
^ log $(!|#)
^#
b/
C
¤ Weight Uncertainty in Neural Networks [Blundell+ 2015]
¤
¤ O(#|N)
¤
¤ http://www.slideshare.net/masa_s/weight-uncertainty-in-neural-networks
Bayes by backprop
^
^N
ℒ	 !; N =
^
^N
IJ(A|]) d(#, N)
d #, N = log
$ !|# $(#)
O(#|N)
= IJ(e)
^d(#, N)
^N
^#
^N
+
^d(#, N)
^N
Reparameterization trick
# = Z + diag([) ⊙ i i~k(0, m)
Dropout
¤ Dropout as a Bayesian Approximation [Gal+ 2015]
¤ O # N = ∏ O(nC|oC)
¤ T9 = − EJ # N [log $(!|#)]
¤ oC 0
¤ 0
¤ dropout
¤ drop-connect multiplicative Gaussian noise
sults are summarised here
n uncertainty estimates for
el with L layers and a loss
max loss or the Euclidean
Wi the NN’s weight ma-
1, and by bi the bias vec-
ayer i = 1, ..., L. We de-
corresponding to input xi
the input and output sets
on a regularisation term is
egularisation weighted by
n a minimisation objective
λ
L
i=1
||Wi||2
2 + ||bi||2
2 .
(1)
variables for every input
in each layer (apart from
le takes value 1 with prob-
ropped (i.e. its value is set
orresponding binary vari-
me values in the backward
to the parameters.
p(y|x, ω) = N y; y(x, ω), τ ID
y x, ω = {W1, ...,WL}
=
1
KL
WLσ ...
1
K1
W2σ W1x + m1 ...
The posterior distribution p(ω|X, Y) in eq. (2) is in-
tractable. We use q(ω), a distribution over matrices whose
columns are randomly set to zero, to approximate the in-
tractable posterior. We define q(ω) as:
Wi = Mi · diag([zi,j]Ki
j=1)
zi,j ∼ Bernoulli(pi) for i = 1, ..., L, j = 1, ..., Ki−1
given some probabilities pi and matrices Mi as variational
parameters. The binary variable zi,j = 0 corresponds then
to unit j in layer i − 1 being dropped out as an input to
layer i. The variational distribution q(ω) is highly multi-
modal, inducing strong joint correlations over the rows of
the matrices Wi (which correspond to the frequencies in
the sparse spectrum GP approximation).
We minimise the KL divergence between the approximate
posterior q(ω) above and the posterior of the full deep GP,
p(ω|X, Y). This KL is our minimisation objective
− q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)
Dropout
¤ dropout
¤ dropout
¤
¤ MC dropout
¤ http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
where ω = {Wi}L
i=1 is our set of random variables for a
model with L layers.
We will perform moment-matching and estimate the first
two moments of the predictive distribution empirically.
More specifically, we sample T sets of vectors of realisa-
tions from the Bernoulli distribution {zt
1, ..., zt
L}T
t=1 with
zt
i = [zt
i,j]Ki
j=1, giving {Wt
1, ..., Wt
L}T
t=1. We estimate
Eq(y∗|x∗)(y∗
) ≈
1
T
T
t=1
y∗
(x∗
, Wt
1, ..., Wt
L) (6)
following proposition C in the appendix. We refer to this
Monte Carlo estimate as MC dropout. In practice this
is equivalent to performing T stochastic forward passes
through the network and averaging the results.
This result has been presented in the literature before as
model averaging. We have given a new derivation for this
result which allows us to derive mathematically grounded
uncertainty estimates as well. Srivastava et al. (2014, sec-
tion 7.5) have reasoned empirically that MC dropout can
be approximated by averaging the weights of the network
(multiplying each Wi by pi at test time, referred to as stan-
dard dropout).
We estimate the second raw moment in the same way:
log p(y∗
|x∗
, X, Y) ≈
with a log-sum-exp o
passes through the ne
Our predictive distr
highly multi-modal,
give a glimpse into i
proximating variation
matrix column is bi-
tribution over each la
3.2 in the appendix).
Note that the dropo
To estimate the predi
we simply collect the
through the model.
used with existing N
thermore, the forward
sulting in constant run
dropout.
5. Experiments
T t=1
following proposition C in the appendix. We refer to this
Monte Carlo estimate as MC dropout. In practice this
is equivalent to performing T stochastic forward passes
through the network and averaging the results.
This result has been presented in the literature before as
model averaging. We have given a new derivation for this
result which allows us to derive mathematically grounded
uncertainty estimates as well. Srivastava et al. (2014, sec-
tion 7.5) have reasoned empirically that MC dropout can
be approximated by averaging the weights of the network
(multiplying each Wi by pi at test time, referred to as stan-
dard dropout).
We estimate the second raw moment in the same way:
Eq(y∗|x∗) (y∗
)T
(y∗
) ≈ τ−1
ID
+
1
T
T
t=1
y∗
(x∗
, Wt
1, ..., Wt
L)T
y∗
(x∗
, Wt
1, ..., Wt
L)
following proposition D in the appendix. To obtain the
model’s predictive variance we have:
Varq(y∗|x∗) y∗
≈ τ−1
ID
2
In the appendix (section 4.1) we extend this derivation to
classification. E(·) is defined as softmax loss and τ is set to 1.
proximating variational distributio
matrix column is bi-modal, and
tribution over each layer’s weight
3.2 in the appendix).
Note that the dropout NN mod
To estimate the predictive mean a
we simply collect the results of s
through the model. As a result,
used with existing NN models tra
thermore, the forward passes can
sulting in constant running time id
dropout.
5. Experiments
We next perform an extensive ass
of the uncertainty estimates obta
and convnets on the tasks of regr
We compare the uncertainty obtai
architectures and non-linearities,
olation, and show that model unc
classification tasks using MNIST
as an example. We then show th
tainty we can obtain a considerabl
tive log-likelihood and RMSE co
of-the-art methods. We finish wi
¤ MC dropout
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
(a) Standard dropout with weight averaging (b) Gaussian process with SE covariance function
(c) MC dropout with ReLU non-linearities (d) MC dropout with TanH non-linearities
Figure 2. Predictive mean and uncertainties on the Mauna Loa CO2 concentrations dataset, for various models. In red is the
observed function (left of the dashed blue line); in blue is the predictive mean plus/minus two standard deviations (8 for fig. 2d).
Different shades of blue represent half a standard deviation. Marked with a dashed red line is a point far away from the data: standard
dropout confidently predicts an insensible value for the point; the other models predict insensible values as well but with the additional
information that the models are uncertain about their predictions.
model’s uncertainty in a Bayesian pipeline. We give a
quantitative assessment of the model’s performance in the
setting of reinforcement learning on a task similar to that
used in deep reinforcement learning (Mnih et al., 2015).
Using the results from the previous section, we begin by
qualitatively evaluating the dropout NN uncertainty on two
comparison. Fig. 2c shows the results of the same network
as in fig. 2a, but with MC dropout used to evaluate the pre-
dictive mean and uncertainty for the training and test sets.
Lastly, fig. 2d shows the same using the TanH network with
5 layers (plotted with 8 times the standard deviation for vi-
sualisation purposes). The shades of blue represent model
uncertainty: each colour gradient represents half a standard
Dropout
¤ Variational dropout and the local reparameterization trick
[Kingma+ 2015]
¤
¤ 0
¤ 0 local reparameterization
trick
¤ Dropout
¤
¤ http://www.slideshare.net/masa_s/dl-hacks-variational-dropout-and-the-local-
reparameterization-trick
2.2 Variance of the SGVB estimator
The theory of stochastic approximation tells us that stochastic gradient ascent using (3) will asymp-
totically converge to a local optimum for an appropriately declining step size and sufficient weight
updates [18], but in practice the performance of stochastic gradient ascent crucially depends on
the variance of the gradients. If this variance is too large, stochastic gradient descent will fail
to make much progress in any reasonable amount of time. Our objective function consists of an
expected log likelihood term that we approximate using Monte Carlo, and a KL divergence term
DKL(qφ(w)||p(w)) that we assume can be calculated analytically and otherwise be approximated
with Monte Carlo with similar reparameterization.
Assume that we draw minibatches of datapoints with replacement; see appendix F for a similar
analysis for minibatches without replacement. Using Li as shorthand for log p(yi
|xi
, w = f(ϵi
, φ)),
the contribution to the likelihood for the i-th datapoint in the minibatch, the Monte Carlo estimator
(3) may be rewritten as LSGVB
D (φ) = N
M
M
i=1 Li, whose variance is given by
Var LSGVB
D (φ) =
N2
M2
M
i=1
Var [Li] + 2
M
i=1
M
j=i+1
Cov [Li, Lj] (4)
=N2 1
M
Var [Li] +
M − 1
M
Cov [Li, Lj] , (5)
where the variances and covariances are w.r.t. both the data distribution and ϵ distribution, i.e.
Var [Li] = Varϵ,xi,yi log p(yi
|xi
, w = f(ϵ, φ)) , with xi
, yi
drawn from the empirical distribu-
tion defined by the training set. As can be seen from (5), the total contribution to the variance by
Var [Li] is inversely proportional to the minibatch size M. However, the total contribution by the
covariances does not decrease with M. In practice, this means that the variance of LSGVB
D (φ) can be
dominated by the covariances for even moderately large M.
2.3 Local Reparameterization Trick
We therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari-
ance of our stochastic gradients scales as 1/M. We then make this new estimator computationally
efficient by not sampling ϵ directly, but only sampling the intermediate variables f(ϵ) through which
SGVB
Automatic Differentiation
Variational Inference
¤
¤
¤
¤
→ Automatic differentiation variational inference ADVI
Automatic Differentiation
Variational Inference ADVI
¤ Automatic Differentiation Variational Inference [Kucukelbir+ 2016]
¤ Stan PyMC3 Edward
¤ ADVI
1. N $(+, N) $(+, p)
ST[O(p)||$(p, +))]
2. O MC
3. O
Automatic transformation
¤ N
¤ $(N) support
¤
¤ N p
¤ p
¤ N −> p r p
0 1 2 3 θ
De (a) Latent variable space
T−1
−1 0 1 2 ζ
(b) Real coordinate space
1: Transforming the latent variable to real coordinate space. The purple line is the pos
ne is the approximation. (a) The latent variable space is >0. (a→b) T transforms
space to . (b) The variational approximation is a Gaussian in real coordinate space
: Transforming the latent variable to real coordinate space. The purple line is the post
e is the approximation. (a) The latent variable space is >0. (a→b) T transforms
space to . (b) The variational approximation is a Gaussian in real coordinate space.
tify the transformed variables as ζ = T(θ). The transformed joint density p(x,ζ) is
as the representation
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
(x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1
of the inverse of T. Transformations of continuous probability densities require a
nts for how the transformation warps unit volumes and ensures that the transforme
s to one (Olive, 2014). (See Appendix A.)
Automatic transformation
¤
¤ r = log	(N)
¤
¤ p
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is the
of the inverse of T. Transformations of continuous probability densities require a Jacobian;
s for how the transformation warps unit volumes and ensures that the transformed density
to one (Olive, 2014). (See Appendix A.)
again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives in
ogarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is the
of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is
p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ).
epicts this transformation.
cribe in the introduction, we implement our algorithm in Stan (Stan Development Team,
an maintains a library of transformations and their corresponding Jacobians.4
With Stan,
tomatically transforms the joint density of any differentiable probability model to one with
d latent variables. (See Figure 2.)
riational Approximations in Real Coordinate Space
ransformation, the latent variables ζ have support in the real coordinate space K
. We have
f variational approximations in this space. Here, we consider Gaussian distributions; these
nduce non-Gaussian variational distributions in the original latent variable space.
of ζ; it has the representation
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
where p(x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is t
Jacobian of the inverse of T. Transformations of continuous probability densities require a Jacobia
it accounts for how the transformation warps unit volumes and ensures that the transformed dens
integrates to one (Olive, 2014). (See Appendix A.)
Consider again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives
>0. The logarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is t
derivative of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is
p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ).
Figure 1 depicts this transformation.
As we describe in the introduction, we implement our algorithm in Stan (Stan Development Tea
2015). Stan maintains a library of transformations and their corresponding Jacobians.4
With St
we can automatically transforms the joint density of any differentiable probability model to one w
real-valued latent variables. (See Figure 2.)
2.4 Variational Approximations in Real Coordinate Space
After the transformation, the latent variables ζ have support in the real coordinate space K
. We ha
0 1 2 3
1
θ
Density
(a) Latent variable space
T
T−1
−1 0 1 2
1
ζ
Prior
Posterior
Approximation
(b) Real coordinate space
Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The
green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent
variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The
green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent
variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
¤
¤
¤
¤ L
¤
¤
.4 Variational Approximations in Real Coordinate Space
fter the transformation, the latent variables ζ have support in the real coordinate space K
. We hav
choice of variational approximations in this space. Here, we consider Gaussian distributions; thes
mplicitly induce non-Gaussian variational distributions in the original latent variable space.
Mean-field Gaussian. One option is to posit a factorized (mean-field) Gaussian variational approxima
on
q(ζ; φ) = ζ; µ,diag(σ2
) =
K
k=1
ζk ; µk,σ2
k ,
where the vector φ = (µ1,··· ,µK ,σ2
1,··· ,σ2
K ) concatenates the mean and variance of each Gaussia
actor. Since the variance parameters must always be positive, the variational parameters live in th
et Φ = { K
, K
>0}. Re-parameterizing the mean-field Gaussian removes this constraint. Consider th
4
Stan provides various transformations for upper and lower bounds, simplex and ordered vectors, and structured matrices suc
covariance matrices and Cholesky factors.
6
N
x [ n ] ~ poisson ( theta ) ;
}
Figure 2: Specifying a simple nonconjugate probability model in Stan.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
arithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω is n
real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ; φ)
ζ; µ,diag(exp(ω)2
) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the mean a
arithm of the standard deviation of each factor. Now, the variational parameters are unconstrain
2K
.
l-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation
q(ζ; φ) = ζ; µ,Σ ,
ere the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ensure t
always remains positive semidefinite, we re-parameterize the covariance matrix using a Chole
torization, Σ = LL⊤
. We use the non-unique definition of the Cholesky factorization where
gonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). Therefor
s in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entries. T
-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤
, where the variational parameters φ = (µ, L)
constrained in K+K(K+1)/2
.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
logarithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω
the real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ
ζ; µ,diag(exp(ω)2
) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the m
logarithm of the standard deviation of each factor. Now, the variational parameters are uncon
in 2K
.
Full-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation
q(ζ; φ) = ζ; µ,Σ ,
where the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ens
Σ always remains positive semidefinite, we re-parameterize the covariance matrix using a C
factorization, Σ = LL⊤
. We use the non-unique definition of the Cholesky factorization wh
diagonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). The
lives in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entr
full-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤
, where the variational parameters φ = (µ
unconstrained in K+K(K+1)/2
.
The full-rank Gaussian generalizes the mean-field Gaussian approximation. The off-diagonal term
covariance matrix Σ capture posterior correlations across latent random variables.5
This leads to
accurate posterior approximation than the mean-field Gaussian; however, it comes at a compu
cost. Various low-rank approximations to the covariance matrix reduce this cost, yet limit its a
¤ ELBO
¤ s
¤
¤ Reparameterization trick p Z
¤ ELBO
¤ s
¤
2.5 The Variational Problem in Real Coordinate Space
Here is the story so far. We began with a differentiable probability model p(x,θ). We transformed th
latent variables into ζ, which live in the real coordinate space. We defined variational approximation
in the transformed space. Now, we consider the variational optimization problem.
Write the variational objective function, the ELBO, in real coordinate space as
(φ) = q(ζ;φ) log p x, T−1
(ζ) + log det JT−1 (ζ) + q(ζ; φ) . (5
The inverse of the transformation T−1
appears in the joint model, along with the determinant of th
Jacobian adjustment. The ELBO is a function of the variational parameters φ and the entropy , both o
which depend on the variational approximation. (Derivation in Appendix B.)
Now, we can freely optimize the ELBO in the real coordinate space without worrying about the suppo
matching constraint. The optimization problem from Equation (3) becomes
φ∗
= argmax
φ
(φ) (6
where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is a
unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this woul
require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm
that uses automatic differentiation to compute gradients and MC integration to approximate expect
tions.
We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un
where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is an
unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this would
require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm
that uses automatic differentiation to compute gradients and MC integration to approximate expecta-
tions.
We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un-
known expectation. However, we can automatically differentiate the functions inside the expectation.
(The model p and transformation T are both easy to represent as computer functions (Baydin et al.,
2015).) To apply automatic differentiation, we want to push the gradient operation inside the expec-
tation. To this end, we employ one final transformation: elliptical standardization6
(Härdle and Simar,
2012).
Elliptical standardization. Consider a transformation Sφ that absorbs the variational parameters φ;
this converts the Gaussian variational approximation into a standard Gaussian. In the mean-field case,
the standardization is η = Sφ(ζ) = diag exp(ω)
−1
(ζ − µ). In the full-rank Gaussian, the standardiza-
tion is η = Sφ(ζ) = L−1
(ζ − µ).
In both cases, the standardization encapsulates the variational parameters; in return it gives a fixed
variational density
q(η) = η; 0, I =
K
k=1
ηk ; 0,1 ,
as shown in Figure 3.
The standardization transforms the variational problem from Equation (5) into
φ∗
= argmax
φ
(η;0,I) log p x, T−1
(S−1
φ (η)) + log det JT−1 S−1
φ (η) + q(ζ; φ) .
The expectation is now in terms of a standard Gaussian density. The Jacobian of elliptical standard-
ization evaluates to one, because the Gaussian distribution is a member of the location-scale family:
standardizing a Gaussian gives another Gaussian distribution. (See Appendix A.)
We do not need to transform the entropy term as it does not depend on the model or the transformation;
we have a simple analytic form for the entropy of a Gaussian and its gradient. We implement these once
and reuse for all models.
¤ Black-box variational inference [Ranganath+ 2014]
¤
¤ ADVI likelihood ratio trick
¤ reparameterization trick
3.2 Variance of the Stochastic Gradients
ADVI uses Monte Carlo integration to approximate gradients of the ELBO, and then uses these gradients
in a stochastic optimization algorithm (Section 2). The speed of ADVI hinges on the variance of the
gradient estimates. When a stochastic optimization algorithm suffers from high-variance gradients, it
must repeatedly recover from poor parameter estimates.
ADVI is not the only way to compute Monte Carlo approximations of the gradient of the ELBO. Black
box variational inference (BBVI) takes a different approach (Ranganath et al., 2014). The BBVI gradient
estimator uses the gradient of the variational approximation and avoids using the gradient of the model.
For example, the following BBVI estimator
∇BBVI
µ = q(ζ;φ) ∇µ logq(ζ; φ) log p x, T−1
(ζ) + log det JT−1 (ζ) − logq(ζ; φ)
and the ADVI gradient estimator in Equation (7) both lead to unbiased estimates of the exact gradient.
While BBVI is more general—it does not require the gradient of the model and thus applies to more
settings—its gradients can suffer from high variance.
Figure 8 empirically compares the variance of both estimators for two models. Figure 8a shows the vari-
ance of both gradient estimators for a simple univariate model, where the posterior is a Gamma(10,10).
We estimate the variance using ten thousand re-calculations of the gradient ∇φ , across an increasing
number of MC samples M. The ADVI gradient has lower variance; in practice, a single sample suffices.
(See the experiments in Section 4.)
Figure 8b shows the same calculation for a 100-dimensional nonlinear regression model with likeli-
hood (y | tanh(x⊤
β), I) and a Gaussian prior on the regression coefficients β. Because this is a
multivariate example, we also show the BBVI gradient with a variance reduction scheme using control
variates described in Ranganath et al. (2014). In both cases, the ADVI gradients are statistically more
efficient.
100
101
102
103
100
101
102
103
Number of MC samples
Variance
(a) Univariate Model
100
101
102
103
10−3
10−1
101
103
Number of MC samples
ADVI
BBVI
BBVI with
control variate
(b) Multivariate Nonlinear Regression Model
Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower
variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which
is not available in univariate situations.
Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower
variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which
is not available in univariate situations.
¤
¤
¤
¤
¤ 2
¤
¤ O
¤
¤
¤ Stan
¤
¤ MCMC NUTS[Hoffman+ 2014] HMC
¤ Stan python R
¤ ADVI
¤ PyMC3
¤ Python MCMC
¤ Theano GPU
¤ ADVI
¤ Edward
¤
¤ criticism
¤ Python Tensorflow Keras
¤ Stan PyMC3 35x [Tran+ 2016]
¤ Tars
¤ https://github.com/masa-su/Tars
¤ Edward Tran PyMC3 Wiecki star
¤
¤
¤ Edward PyMC3
¤
¤
¤ Q Tars
¤ A

More Related Content

What's hot

Bayesian Neural Networks : Survey
Bayesian Neural Networks : SurveyBayesian Neural Networks : Survey
Bayesian Neural Networks : Surveytmtm otm
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
4 データ間の距離と類似度
4 データ間の距離と類似度4 データ間の距離と類似度
4 データ間の距離と類似度Seiichi Uchida
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定Akira Masuda
 
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布Nagayoshi Yamashita
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシンShinya Shimizu
 
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process ModelsDeep Learning JP
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)Takao Yamanaka
 
深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシンYuta Sugii
 
変分ベイズ法の説明
変分ベイズ法の説明変分ベイズ法の説明
変分ベイズ法の説明Haruka Ozaki
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)Takao Yamanaka
 
グラフィカルモデル入門
グラフィカルモデル入門グラフィカルモデル入門
グラフィカルモデル入門Kawamoto_Kazuhiko
 
【論文読み会】Autoregressive Diffusion Models.pptx
【論文読み会】Autoregressive Diffusion Models.pptx【論文読み会】Autoregressive Diffusion Models.pptx
【論文読み会】Autoregressive Diffusion Models.pptxARISE analytics
 
金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデルKei Nakagawa
 
パターン認識 04 混合正規分布
パターン認識 04 混合正規分布パターン認識 04 混合正規分布
パターン認識 04 混合正規分布sleipnir002
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)Kota Matsui
 
Cmdstanr入門とreduce_sum()解説
Cmdstanr入門とreduce_sum()解説Cmdstanr入門とreduce_sum()解説
Cmdstanr入門とreduce_sum()解説Hiroshi Shimizu
 
マルコフ連鎖モンテカルロ法 (2/3はベイズ推定の話)
マルコフ連鎖モンテカルロ法 (2/3はベイズ推定の話)マルコフ連鎖モンテカルロ法 (2/3はベイズ推定の話)
マルコフ連鎖モンテカルロ法 (2/3はベイズ推定の話)Yoshitake Takebayashi
 
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)Takuma Yagi
 

What's hot (20)

Bayesian Neural Networks : Survey
Bayesian Neural Networks : SurveyBayesian Neural Networks : Survey
Bayesian Neural Networks : Survey
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
4 データ間の距離と類似度
4 データ間の距離と類似度4 データ間の距離と類似度
4 データ間の距離と類似度
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定
 
PRML8章
PRML8章PRML8章
PRML8章
 
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
パターン認識と機械学習(PRML)第2章 確率分布 2.3 ガウス分布
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン
 
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
[DL輪読会]Scalable Training of Inference Networks for Gaussian-Process Models
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)
 
深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン深層学習 勉強会第5回 ボルツマンマシン
深層学習 勉強会第5回 ボルツマンマシン
 
変分ベイズ法の説明
変分ベイズ法の説明変分ベイズ法の説明
変分ベイズ法の説明
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)
 
グラフィカルモデル入門
グラフィカルモデル入門グラフィカルモデル入門
グラフィカルモデル入門
 
【論文読み会】Autoregressive Diffusion Models.pptx
【論文読み会】Autoregressive Diffusion Models.pptx【論文読み会】Autoregressive Diffusion Models.pptx
【論文読み会】Autoregressive Diffusion Models.pptx
 
金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル
 
パターン認識 04 混合正規分布
パターン認識 04 混合正規分布パターン認識 04 混合正規分布
パターン認識 04 混合正規分布
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)
 
Cmdstanr入門とreduce_sum()解説
Cmdstanr入門とreduce_sum()解説Cmdstanr入門とreduce_sum()解説
Cmdstanr入門とreduce_sum()解説
 
マルコフ連鎖モンテカルロ法 (2/3はベイズ推定の話)
マルコフ連鎖モンテカルロ法 (2/3はベイズ推定の話)マルコフ連鎖モンテカルロ法 (2/3はベイズ推定の話)
マルコフ連鎖モンテカルロ法 (2/3はベイズ推定の話)
 
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
RBM、Deep Learningと学習(全脳アーキテクチャ若手の会 第3回DL勉強会発表資料)
 

Viewers also liked

確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdwardYuta Kashino
 
データ解析のための統計モデリング入門10章前半
データ解析のための統計モデリング入門10章前半データ解析のための統計モデリング入門10章前半
データ解析のための統計モデリング入門10章前半Shinya Akiba
 
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)Nagayoshi Yamashita
 
広告プラットフォーム立ち上げ百鬼夜行
広告プラットフォーム立ち上げ百鬼夜行広告プラットフォーム立ち上げ百鬼夜行
広告プラットフォーム立ち上げ百鬼夜行Takahiro Ogoshi
 
Jap2017 ss65 優しいベイズ統計への導入法
Jap2017 ss65 優しいベイズ統計への導入法Jap2017 ss65 優しいベイズ統計への導入法
Jap2017 ss65 優しいベイズ統計への導入法考司 小杉
 
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービスLyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービスKosetsu Tsukuda
 
もしその単語がなかったら
もしその単語がなかったらもしその単語がなかったら
もしその単語がなかったらHiroshi Nakagawa
 
PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)Hidekazu Oiwa
 
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...Deep Learning JP
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
20171024NL研報告スライド
20171024NL研報告スライド20171024NL研報告スライド
20171024NL研報告スライドMasatoshi TSUCHIYA
 
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-Deep Learning JP
 
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本Takahiro Kubo
 
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"Shinnosuke Takamichi
 
マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話cyberagent
 
アドテクスタジオのデータ分析基盤について
アドテクスタジオのデータ分析基盤についてアドテクスタジオのデータ分析基盤について
アドテクスタジオのデータ分析基盤についてkazuhiro ito
 
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0Michitaka Yumoto
 
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~nocchi_airport
 
SwiftでRiemann球面を扱う
SwiftでRiemann球面を扱うSwiftでRiemann球面を扱う
SwiftでRiemann球面を扱うhayato iida
 

Viewers also liked (20)

確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward確率的プログラミングライブラリEdward
確率的プログラミングライブラリEdward
 
データ解析のための統計モデリング入門10章前半
データ解析のための統計モデリング入門10章前半データ解析のための統計モデリング入門10章前半
データ解析のための統計モデリング入門10章前半
 
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
研究者の研究履歴による学術の動向の把握とその予測 (第11回データマイニング+WEB@東京)
 
広告プラットフォーム立ち上げ百鬼夜行
広告プラットフォーム立ち上げ百鬼夜行広告プラットフォーム立ち上げ百鬼夜行
広告プラットフォーム立ち上げ百鬼夜行
 
Jap2017 ss65 優しいベイズ統計への導入法
Jap2017 ss65 優しいベイズ統計への導入法Jap2017 ss65 優しいベイズ統計への導入法
Jap2017 ss65 優しいベイズ統計への導入法
 
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービスLyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
Lyric Jumper:アーティストごとの歌詞トピックの傾向に基づく歌詞探索サービス
 
もしその単語がなかったら
もしその単語がなかったらもしその単語がなかったら
もしその単語がなかったら
 
PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)
 
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
[DL輪読会]Learning by Association - A versatile semi-supervised training method ...
 
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
多腕バンディット問題: 定式化と応用 (第13回ステアラボ人工知能セミナー)
 
20171024NL研報告スライド
20171024NL研報告スライド20171024NL研報告スライド
20171024NL研報告スライド
 
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
[DLHacks LT] PytorchのDataLoader -torchtextのソースコードを読んでみた-
 
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
深層学習の判断根拠を理解するための 研究とその意義 @PRMU 2017熊本
 
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
日本音響学会2017秋 ビギナーズセミナー "深層学習を深く学習するための基礎"
 
マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話
 
アドテクスタジオのデータ分析基盤について
アドテクスタジオのデータ分析基盤についてアドテクスタジオのデータ分析基盤について
アドテクスタジオのデータ分析基盤について
 
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017)
 
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
VentureCafe_第2回:SIerでのキャリアパスを考える_ござ先輩発表資料 V1.0
 
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
StanとRでベイズ統計モデリング読書会 Chapter 7(7.6-7.9) 回帰分析の悩みどころ ~統計の力で歌うまになりたい~
 
SwiftでRiemann球面を扱う
SwiftでRiemann球面を扱うSwiftでRiemann球面を扱う
SwiftでRiemann球面を扱う
 

Similar to (DL hacks輪読)Bayesian Neural Network

Machine learning (10)
Machine learning (10)Machine learning (10)
Machine learning (10)NYversity
 
Fixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyFixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyAlexander Decker
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)NYversity
 
Proofs nearest rank
Proofs nearest rankProofs nearest rank
Proofs nearest rankfithisux
 
Fixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractionsFixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractionsAlexander Decker
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Series_Solution_Methods_and_Special_Func.pdf
Series_Solution_Methods_and_Special_Func.pdfSeries_Solution_Methods_and_Special_Func.pdf
Series_Solution_Methods_and_Special_Func.pdfmohamedtawfik358886
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieMarco Moldenhauer
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt msFaeco Bot
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) inventionjournals
 

Similar to (DL hacks輪読)Bayesian Neural Network (20)

Lecture 3 - Linear Regression
Lecture 3 - Linear RegressionLecture 3 - Linear Regression
Lecture 3 - Linear Regression
 
Machine learning (10)
Machine learning (10)Machine learning (10)
Machine learning (10)
 
Fixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyFixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a property
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 
Proofs nearest rank
Proofs nearest rankProofs nearest rank
Proofs nearest rank
 
Matching
MatchingMatching
Matching
 
Fixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractionsFixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractions
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
lec8.ppt
lec8.pptlec8.ppt
lec8.ppt
 
506
506506
506
 
Series_Solution_Methods_and_Special_Func.pdf
Series_Solution_Methods_and_Special_Func.pdfSeries_Solution_Methods_and_Special_Func.pdf
Series_Solution_Methods_and_Special_Func.pdf
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
New test123
New test123New test123
New test123
 
Multivariate Methods Assignment Help
Multivariate Methods Assignment HelpMultivariate Methods Assignment Help
Multivariate Methods Assignment Help
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Calculas
CalculasCalculas
Calculas
 
Parallel algorithm in linear algebra
Parallel algorithm in linear algebraParallel algorithm in linear algebra
Parallel algorithm in linear algebra
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
 

More from Masahiro Suzuki

深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)Masahiro Suzuki
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについてMasahiro Suzuki
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究についてMasahiro Suzuki
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習Masahiro Suzuki
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習Masahiro Suzuki
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman FiltersMasahiro Suzuki
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...Masahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target PropagationMasahiro Suzuki
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization TrickMasahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?Masahiro Suzuki
 

More from Masahiro Suzuki (18)

深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
 

Recently uploaded

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

(DL hacks輪読)Bayesian Neural Network

  • 2. ¤ bayesian neural network ¤ NIPS http://bayesiandeeplearning.org ¤ bayesian neural network ¤ Stan PyMC3 Edward ¤ ¤ ¤
  • 4.
  • 5. ¤ ! " # ¤ $(#|") ¤ ¤ $(!|#, ") ¤ ¤ ¤ w ¤ $(#|!, ") ¤ ¤ $(!|") ¤ ¤ ¤
  • 6. ¤ ¤ D ¤ ¤ ! = {+} = - $ ! # = ∏ $(+|#)/ 012 = $(-|#) ¤ ! = {+, 3} = (-, 3) $ ! # = ∏ $(3|+, #)/ 012 = $(3|-, #) $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  • 7. MAP ¤ ¤ MAP ¤ MAP ¤ ¤ MAP #789 = arg max log $(!|#) = arg max A B log $(+C|#) C #7DE = arg max log $(#|!) = arg max A log $ ! # + log $(#)
  • 8. ¤ +G ¤ # ¤ ¤ +G 3G $ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ] $ 3G +G, ! = H $ 3G +G, # $ # ! 5# = IJ(A|K)[$ 3G +G, # ]
  • 9. ¤ ¤ ¤ ¤ ¤ ¤ $ +G ! = H $ +G # $ # ! 5# = IJ(A|K)[$ +G # ] $ " ! = $ ! " $(") $(!) $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  • 10. ¤ ¤ ¤ # ¤ $ # ! = $ ! # $(#) $(!) = $ ! # $(#) ∫ $ ! # $ # 5#
  • 11. ¤ 2 1. MCMC 2. 1. MCMC ¤ $(#|!) ¤ ¤ 2. ¤ $(#|N) O(#) ¤ 2 ¤ ¤
  • 12.
  • 13. ¤ # $(#) ¤ # ¤ Weight Uncertainty in Neural Networks H1 H2 H3 1 X 1 Y 0.5 0.1 0.7 1.3 1.40.3 1.2 0.10.1 0.2 H1 H2 H3 1 X 1 Y Figure 1. Left: each weight has a fixed value, as provided by clas- sical backpropagation. Right: each weight is assigned a distribu- tion, as provided by Bayes by Backprop. is related to recent methods in deep, generative modelling (Kingma and Welling, 2014; Rezende et al., 2014; Gregor et al., 2014), where variational inference has been applied to stochastic hidden units of an autoencoder. Whilst the number of stochastic hidden units might be in the order of the parameters of the categorical dis through the exponential function then regression Y is R and P(y|x, w) is a G – this corresponds to a squared loss. Inputs x are mapped onto the param tion on Y by several successive layers tion (given by w) interleaved with elem transforms. The weights can be learnt by maximum tion (MLE): given a set of training exam the MLE weights wMLE are given by: wMLE = arg max w log P(D|w = arg max w i log P( This is typically achieved by gradient NN Bayesian NN $(3G|!, +G, N) = H $ 3G +G, # $ # !, N 5# [Blundell+ 2015]
  • 14. ¤ ¤ ¤ “The Importance of Knowing What We Don't Know (by Yarin Gal)” ¤ ¤ ¤ ¤ ¤ ¤ WHY SHOULD WE CARE? Calibrated model and prediction uncertainty: getting systems that know when they don’t know. Automatic model complexity control and structure learnin (Bayesian Occam’s Razor) Figure from Yarin Gal’s thesis “Uncertainty in Deep Learning” (2016) Zoubin Ghahramani http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
  • 15. ¤ stochastic neural network ¤ VAE ¤ = ¤ ¤ https://jmhl.org/research/
  • 17.
  • 18. ¤ ¤ P Q ¤ ¤ α ¤ ¤ 3.1.5 Multiple outputs So far, we have considered the case of a single target variable t. In some applica- tions, we may wish to predict K > 1 target variables, which we denote collectively by the target vector t. This could be done by introducing a different set of basis func- tions for each component of t, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x, w) = WT φ(x) (3.31) where y is a K-dimensional column vector, W is an M × K matrix of parameters, and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1 as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form p(t|x, W, β) = N(t|WT φ(x), β−1 I). (3.32) If we have a set of observations t1, . . . , tN , we can combine these into a matrix T of size N × K such that the nth row is given by tT n. Similarly, we can combine the input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given by ln p(T|X, W, β) = N n=1 ln N(tn|WT φ(xn), β−1 I) = NK 2 ln β 2π − β 2 N n=1 tn − WT φ(xn) 2 . (3.33) by the target vector t. This could be done by introducing a different set of basis func- tions for each component of t, leading to multiple, independent regression problems. However, a more interesting, and more common, approach is to use the same set of basis functions to model all of the components of the target vector so that y(x, w) = WT φ(x) (3.31) where y is a K-dimensional column vector, W is an M × K matrix of parameters, and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1 as before. Suppose we take the conditional distribution of the target vector to be an isotropic Gaussian of the form p(t|x, W, β) = N (t|WT φ(x), β−1 I). (3.32) If we have a set of observations t1, . . . , tN , we can combine these into a matrix T of size N × K such that the nth row is given by tT n. Similarly, we can combine the input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given by ln p(T|X, W, β) = N n=1 ln N (tn|WT φ(xn), β−1 I) = NK 2 ln β 2π − β 2 N n=1 tn − WT φ(xn) 2 . (3.33) mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by wMAP = mN . If we consider an infinitely broad prior S0 = α−1 I with α → 0, the mean mN of the posterior distribution reduces to the maximum likelihood value wML given by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49).3.8 For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that p(w|α) = N(w|0, α−1 I) (3.52) and the corresponding posterior distribution over w is then given by (3.49) with mN = βSN ΦT t (3.53) S−1 N = αI + βΦT Φ. (3.54) The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w, takes the form ln p(w|t) = − β 2 N n=1 {tn − wT φ(xn)}2 − α 2 wT w + const. (3.55) Maximization of this posterior distribution with respect to w is therefore equiva- 3.3. Bayesian Linear Regression 153 Next we compute the posterior distribution, which is proportional to the product of the likelihood function and the prior. Due to the choice of a conjugate Gaus- sian prior distribution, the posterior will also be Gaussian. We can evaluate this distribution by the usual procedure of completing the square in the exponential, and then finding the normalization coefficient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- eral result (2.116), which allows us to write down the posterior distribution directly in the form p(w|t) = N (w|mN , SN ) (3.49) where mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) then finding the normalization coefficient using the standard result for a normalized Gaussian. However, we have already done the necessary work in deriving the gen- eral result (2.116), which allows us to write down the posterior distribution directly in the form p(w|t) = N(w|mN , SN ) (3.49) where mN = SN S−1 0 m0 + βΦT t (3.50) S−1 N = S−1 0 + βΦT Φ. (3.51) Note that because the posterior distribution is Gaussian, its mode coincides with its mean. Thus the maximum posterior weight vector is simply given by wMAP = mN . If we consider an infinitely broad prior S0 = α−1 I with α → 0, the mean mN of the posterior distribution reduces to the maximum likelihood value wML given by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior. Furthermore, if data points arrive sequentially, then the posterior distribution at any stage acts as the prior distribution for the subsequent data point, such that the new posterior distribution is again given by (3.49). For the remainder of this chapter, we shall consider a particular form of Gaus- sian prior in order to simplify the treatment. Specifically, we consider a zero-mean isotropic Gaussian governed by a single precision parameter α so that p(w|α) = N(w|0, α−1 I) (3.52) and the corresponding posterior distribution over w is then given by (3.49) with mN = βSN ΦT t (3.53) S−1 N = αI + βΦT Φ. (3.54) The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior and, as a function of w, takes the form PRML
  • 19. ¤ ¤ equal to the mean, although this will no longer hold if q ̸= 2. 3.3.2 Predictive distribution In practice, we are not usually interested in the value of w itself but rather in making predictions of t for new values of x. This requires that we evaluate the predictive distribution defined by p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the form3.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance3.11 p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the form.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance.11 of the predictive distribution arises solely from the additive noise governed by the parameter β. As an illustration of the predictive distribution for Bayesian linear regression models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8, p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57) in which t is the vector of target values from the training set, and we have omitted the corresponding input vectors from the right-hand side of the conditioning statements to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari- able is given by (3.8), and the posterior weight distribution is given by (3.49). We see that (3.57) involves the convolution of two Gaussian distributions, and so making use of the result (2.115) from Section 8.1.4, we see that the predictive distribution takes the formcise 3.10 p(t|x, t, α, β) = N(t|mT N φ(x), σ2 N (x)) (3.58) where the variance σ2 N (x) of the predictive distribution is given by σ2 N (x) = 1 β + φ(x)T SN φ(x). (3.59) The first term in (3.59) represents the noise on the data whereas the second term reflects the uncertainty associated with the parameters w. Because the noise process and the distribution of w are independent Gaussians, their variances are additive. Note that, as additional data points are observed, the posterior distribution becomes narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2 N+1(x) σ2 N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variancecise 3.11 of the predictive distribution arises solely from the additive noise governed by the parameter β. As an illustration of the predictive distribution for Bayesian linear regression models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8, 3.3. Bayesian Linear Regression 157 x t 0 1 −1 0 1 x t 0 1 −1 0 1 x t 0 1 −1 0 1 x t 0 1 −1 0 1 Figure 3.8 Examples of the predictive distribution (3.58) for a model consisting of 9 Gaussian basis functions of the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion. PRML
  • 21. ¤ ¤ 1. 2. MAP #7DE 3. 2 R 4. chapters and so we can exploit the results obtained there. We can then make use of the evidence framework to provide point estimates for the hyperparameters and to compare alternative models (for example, networks having different numbers of hid- den units). To start with, we shall discuss the regression case and then later consider the modifications needed for solving classification tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t from a vector x of inputs (the extension to multiple targets is straightforward). We shall suppose that the conditional distribution p(t|x) is Gaussian, with an x-dependent mean given by the output of a neural network model y(x, w), and with precision (inverse variance) β p(t|x, w, β) = N(t|y(x, w), β−1 ). (5.161) Similarly, we shall choose a prior distribution over the weights w that is Gaussian of the form p(w|α) = N(w|0, α−1 I). (5.162) For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of target values D = {t1, . . . , tN }, the likelihood function is given by p(D|w, β) = N n=1 N(tn|y(xn, w), β−1 ) (5.163) and so the resulting posterior distribution is then p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.164) the modifications needed for solving classification tasks. 5.7.1 Posterior parameter distribution Consider the problem of predicting a single continuous target variable t fro a vector x of inputs (the extension to multiple targets is straightforward). We sh suppose that the conditional distribution p(t|x) is Gaussian, with an x-depende mean given by the output of a neural network model y(x, w), and with precisi (inverse variance) β p(t|x, w, β) = N(t|y(x, w), β−1 ). (5.16 Similarly, we shall choose a prior distribution over the weights w that is Gaussian the form p(w|α) = N(w|0, α−1 I). (5.16 For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of targ values D = {t1, . . . , tN }, the likelihood function is given by p(D|w, β) = N n=1 N(tn|y(xn, w), β−1 ) (5.16 and so the resulting posterior distribution is then p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.16 which, as a consequence of the nonlinear dependence of y(x, w) on w, will be no Gaussian. We can find a Gaussian approximation to the posterior distribution by using t Laplace approximation. To do this, we must first find a (local) maximum of t 5.7. Bayesian Neural Networks 279 form ln p(w|D) = − α 2 wT w − β 2 N n=1 {y(xn, w) − tn} 2 + const (5.165) which corresponds to a regularized sum-of-squares error function. Assuming for the moment that α and β are fixed, we can find a maximum of the posterior, which we denote wMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a mode wMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166) where H is the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components of w. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by q(w|D) = N(w|wMAP, A−1 ). (5.167) form ln p(w|D) = − α 2 wT w − β 2 N n=1 {y(xn, w) − tn} 2 + const (5.165) which corresponds to a regularized sum-of-squares error function. Assuming for the moment that α and β are fixed, we can find a maximum of the posterior, which we denote wMAP, by standard nonlinear optimization algorithms such as conjugate gradients, using error backpropagation to evaluate the required derivatives. Having found a mode wMAP, we can then build a local Gaussian approximation by evaluating the matrix of second derivatives of the negative log posterior distribu- tion. From (5.165), this is given by A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166) where H is the Hessian matrix comprising the second derivatives of the sum-of- squares error function with respect to the components of w. Algorithms for comput- ing and approximating the Hessian were discussed in Section 5.4. The corresponding Gaussian approximation to the posterior is then given from (4.134) by q(w|D) = N(w|wMAP, A−1 ). (5.167) Similarly, the predictive distribution is obtained by marginalizing with respect PRML
  • 23. ¤ O(#|N) ¤ ST[O(#|N)||$ # ! ] N ¤ KL ¤ N ELBO N → ST[O(#|N)||$ (#|!)] = −∫ O # N log W K|A W A J # N 5# + log $ ! = −ℒ !; N + log $ ! ELBO
  • 24. ELBO ¤ ELBO ℒ !; N 1. ¤ $(#|!) MC ¤ 2. ELBO ¤ MC ¤ MC ∫ O # N log W K|A W A J # N 5# = IJ[log W K|A W A J # N ] 3. ¤ EM ¤ ELBO
  • 25. ¤ Gal ¤ Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987) ¤ Denker and LeCun (1991) ¤ MacKay (1992) ¤ Hinton and van Camp (1993) ¤ ¤ Neal (1995) ¤ Barber and Bishop (1998) ¤ Graves (2011) ¤ Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015) ¤ Hernandez-Lobato and Adam (2015)
  • 26. ¤ Practical Variational Inference for Neural Networks [Graves 2011] ¤ ¤ ¤ T9 O(#|N) N = {Z, [} ¤ ℒ !; N = ∫ O (#|N)log $ !|# $(#) O(#|N) 5# = − EJ(A|])[log $(!|#)] + ST[O(#|N)||$(#)] ^T9 ^Z ≈ 1 a B ^ log $(!|#) ^# / C ^T9 ^[b ≈ 1 2a B ^ log $(!|#) ^# b/ C
  • 27. ¤ Weight Uncertainty in Neural Networks [Blundell+ 2015] ¤ ¤ O(#|N) ¤ ¤ http://www.slideshare.net/masa_s/weight-uncertainty-in-neural-networks Bayes by backprop ^ ^N ℒ !; N = ^ ^N IJ(A|]) d(#, N) d #, N = log $ !|# $(#) O(#|N) = IJ(e) ^d(#, N) ^N ^# ^N + ^d(#, N) ^N Reparameterization trick # = Z + diag([) ⊙ i i~k(0, m)
  • 28. Dropout ¤ Dropout as a Bayesian Approximation [Gal+ 2015] ¤ O # N = ∏ O(nC|oC) ¤ T9 = − EJ # N [log $(!|#)] ¤ oC 0 ¤ 0 ¤ dropout ¤ drop-connect multiplicative Gaussian noise sults are summarised here n uncertainty estimates for el with L layers and a loss max loss or the Euclidean Wi the NN’s weight ma- 1, and by bi the bias vec- ayer i = 1, ..., L. We de- corresponding to input xi the input and output sets on a regularisation term is egularisation weighted by n a minimisation objective λ L i=1 ||Wi||2 2 + ||bi||2 2 . (1) variables for every input in each layer (apart from le takes value 1 with prob- ropped (i.e. its value is set orresponding binary vari- me values in the backward to the parameters. p(y|x, ω) = N y; y(x, ω), τ ID y x, ω = {W1, ...,WL} = 1 KL WLσ ... 1 K1 W2σ W1x + m1 ... The posterior distribution p(ω|X, Y) in eq. (2) is in- tractable. We use q(ω), a distribution over matrices whose columns are randomly set to zero, to approximate the in- tractable posterior. We define q(ω) as: Wi = Mi · diag([zi,j]Ki j=1) zi,j ∼ Bernoulli(pi) for i = 1, ..., L, j = 1, ..., Ki−1 given some probabilities pi and matrices Mi as variational parameters. The binary variable zi,j = 0 corresponds then to unit j in layer i − 1 being dropped out as an input to layer i. The variational distribution q(ω) is highly multi- modal, inducing strong joint correlations over the rows of the matrices Wi (which correspond to the frequencies in the sparse spectrum GP approximation). We minimise the KL divergence between the approximate posterior q(ω) above and the posterior of the full deep GP, p(ω|X, Y). This KL is our minimisation objective − q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)
  • 29. Dropout ¤ dropout ¤ dropout ¤ ¤ MC dropout ¤ http://mlg.eng.cam.ac.uk/yarin/blog_2248.html where ω = {Wi}L i=1 is our set of random variables for a model with L layers. We will perform moment-matching and estimate the first two moments of the predictive distribution empirically. More specifically, we sample T sets of vectors of realisa- tions from the Bernoulli distribution {zt 1, ..., zt L}T t=1 with zt i = [zt i,j]Ki j=1, giving {Wt 1, ..., Wt L}T t=1. We estimate Eq(y∗|x∗)(y∗ ) ≈ 1 T T t=1 y∗ (x∗ , Wt 1, ..., Wt L) (6) following proposition C in the appendix. We refer to this Monte Carlo estimate as MC dropout. In practice this is equivalent to performing T stochastic forward passes through the network and averaging the results. This result has been presented in the literature before as model averaging. We have given a new derivation for this result which allows us to derive mathematically grounded uncertainty estimates as well. Srivastava et al. (2014, sec- tion 7.5) have reasoned empirically that MC dropout can be approximated by averaging the weights of the network (multiplying each Wi by pi at test time, referred to as stan- dard dropout). We estimate the second raw moment in the same way: log p(y∗ |x∗ , X, Y) ≈ with a log-sum-exp o passes through the ne Our predictive distr highly multi-modal, give a glimpse into i proximating variation matrix column is bi- tribution over each la 3.2 in the appendix). Note that the dropo To estimate the predi we simply collect the through the model. used with existing N thermore, the forward sulting in constant run dropout. 5. Experiments T t=1 following proposition C in the appendix. We refer to this Monte Carlo estimate as MC dropout. In practice this is equivalent to performing T stochastic forward passes through the network and averaging the results. This result has been presented in the literature before as model averaging. We have given a new derivation for this result which allows us to derive mathematically grounded uncertainty estimates as well. Srivastava et al. (2014, sec- tion 7.5) have reasoned empirically that MC dropout can be approximated by averaging the weights of the network (multiplying each Wi by pi at test time, referred to as stan- dard dropout). We estimate the second raw moment in the same way: Eq(y∗|x∗) (y∗ )T (y∗ ) ≈ τ−1 ID + 1 T T t=1 y∗ (x∗ , Wt 1, ..., Wt L)T y∗ (x∗ , Wt 1, ..., Wt L) following proposition D in the appendix. To obtain the model’s predictive variance we have: Varq(y∗|x∗) y∗ ≈ τ−1 ID 2 In the appendix (section 4.1) we extend this derivation to classification. E(·) is defined as softmax loss and τ is set to 1. proximating variational distributio matrix column is bi-modal, and tribution over each layer’s weight 3.2 in the appendix). Note that the dropout NN mod To estimate the predictive mean a we simply collect the results of s through the model. As a result, used with existing NN models tra thermore, the forward passes can sulting in constant running time id dropout. 5. Experiments We next perform an extensive ass of the uncertainty estimates obta and convnets on the tasks of regr We compare the uncertainty obtai architectures and non-linearities, olation, and show that model unc classification tasks using MNIST as an example. We then show th tainty we can obtain a considerabl tive log-likelihood and RMSE co of-the-art methods. We finish wi
  • 30. ¤ MC dropout Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (a) Standard dropout with weight averaging (b) Gaussian process with SE covariance function (c) MC dropout with ReLU non-linearities (d) MC dropout with TanH non-linearities Figure 2. Predictive mean and uncertainties on the Mauna Loa CO2 concentrations dataset, for various models. In red is the observed function (left of the dashed blue line); in blue is the predictive mean plus/minus two standard deviations (8 for fig. 2d). Different shades of blue represent half a standard deviation. Marked with a dashed red line is a point far away from the data: standard dropout confidently predicts an insensible value for the point; the other models predict insensible values as well but with the additional information that the models are uncertain about their predictions. model’s uncertainty in a Bayesian pipeline. We give a quantitative assessment of the model’s performance in the setting of reinforcement learning on a task similar to that used in deep reinforcement learning (Mnih et al., 2015). Using the results from the previous section, we begin by qualitatively evaluating the dropout NN uncertainty on two comparison. Fig. 2c shows the results of the same network as in fig. 2a, but with MC dropout used to evaluate the pre- dictive mean and uncertainty for the training and test sets. Lastly, fig. 2d shows the same using the TanH network with 5 layers (plotted with 8 times the standard deviation for vi- sualisation purposes). The shades of blue represent model uncertainty: each colour gradient represents half a standard
  • 31. Dropout ¤ Variational dropout and the local reparameterization trick [Kingma+ 2015] ¤ ¤ 0 ¤ 0 local reparameterization trick ¤ Dropout ¤ ¤ http://www.slideshare.net/masa_s/dl-hacks-variational-dropout-and-the-local- reparameterization-trick 2.2 Variance of the SGVB estimator The theory of stochastic approximation tells us that stochastic gradient ascent using (3) will asymp- totically converge to a local optimum for an appropriately declining step size and sufficient weight updates [18], but in practice the performance of stochastic gradient ascent crucially depends on the variance of the gradients. If this variance is too large, stochastic gradient descent will fail to make much progress in any reasonable amount of time. Our objective function consists of an expected log likelihood term that we approximate using Monte Carlo, and a KL divergence term DKL(qφ(w)||p(w)) that we assume can be calculated analytically and otherwise be approximated with Monte Carlo with similar reparameterization. Assume that we draw minibatches of datapoints with replacement; see appendix F for a similar analysis for minibatches without replacement. Using Li as shorthand for log p(yi |xi , w = f(ϵi , φ)), the contribution to the likelihood for the i-th datapoint in the minibatch, the Monte Carlo estimator (3) may be rewritten as LSGVB D (φ) = N M M i=1 Li, whose variance is given by Var LSGVB D (φ) = N2 M2 M i=1 Var [Li] + 2 M i=1 M j=i+1 Cov [Li, Lj] (4) =N2 1 M Var [Li] + M − 1 M Cov [Li, Lj] , (5) where the variances and covariances are w.r.t. both the data distribution and ϵ distribution, i.e. Var [Li] = Varϵ,xi,yi log p(yi |xi , w = f(ϵ, φ)) , with xi , yi drawn from the empirical distribu- tion defined by the training set. As can be seen from (5), the total contribution to the variance by Var [Li] is inversely proportional to the minibatch size M. However, the total contribution by the covariances does not decrease with M. In practice, this means that the variance of LSGVB D (φ) can be dominated by the covariances for even moderately large M. 2.3 Local Reparameterization Trick We therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari- ance of our stochastic gradients scales as 1/M. We then make this new estimator computationally efficient by not sampling ϵ directly, but only sampling the intermediate variables f(ϵ) through which SGVB
  • 33. ¤ ¤ ¤ ¤ → Automatic differentiation variational inference ADVI
  • 34. Automatic Differentiation Variational Inference ADVI ¤ Automatic Differentiation Variational Inference [Kucukelbir+ 2016] ¤ Stan PyMC3 Edward ¤ ADVI 1. N $(+, N) $(+, p) ST[O(p)||$(p, +))] 2. O MC 3. O
  • 35. Automatic transformation ¤ N ¤ $(N) support ¤ ¤ N p ¤ p ¤ N −> p r p 0 1 2 3 θ De (a) Latent variable space T−1 −1 0 1 2 ζ (b) Real coordinate space 1: Transforming the latent variable to real coordinate space. The purple line is the pos ne is the approximation. (a) The latent variable space is >0. (a→b) T transforms space to . (b) The variational approximation is a Gaussian in real coordinate space : Transforming the latent variable to real coordinate space. The purple line is the post e is the approximation. (a) The latent variable space is >0. (a→b) T transforms space to . (b) The variational approximation is a Gaussian in real coordinate space. tify the transformed variables as ζ = T(θ). The transformed joint density p(x,ζ) is as the representation p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , (x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 of the inverse of T. Transformations of continuous probability densities require a nts for how the transformation warps unit volumes and ensures that the transforme s to one (Olive, 2014). (See Appendix A.)
  • 36. Automatic transformation ¤ ¤ r = log (N) ¤ ¤ p p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is the of the inverse of T. Transformations of continuous probability densities require a Jacobian; s for how the transformation warps unit volumes and ensures that the transformed density to one (Olive, 2014). (See Appendix A.) again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives in ogarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is the of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ). epicts this transformation. cribe in the introduction, we implement our algorithm in Stan (Stan Development Team, an maintains a library of transformations and their corresponding Jacobians.4 With Stan, tomatically transforms the joint density of any differentiable probability model to one with d latent variables. (See Figure 2.) riational Approximations in Real Coordinate Space ransformation, the latent variables ζ have support in the real coordinate space K . We have f variational approximations in this space. Here, we consider Gaussian distributions; these nduce non-Gaussian variational distributions in the original latent variable space. of ζ; it has the representation p(x,ζ) = p x, T−1 (ζ) det JT−1 (ζ) , where p(x,θ = T−1 (ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is t Jacobian of the inverse of T. Transformations of continuous probability densities require a Jacobia it accounts for how the transformation warps unit volumes and ensures that the transformed dens integrates to one (Olive, 2014). (See Appendix A.) Consider again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives >0. The logarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is t derivative of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ). Figure 1 depicts this transformation. As we describe in the introduction, we implement our algorithm in Stan (Stan Development Tea 2015). Stan maintains a library of transformations and their corresponding Jacobians.4 With St we can automatically transforms the joint density of any differentiable probability model to one w real-valued latent variables. (See Figure 2.) 2.4 Variational Approximations in Real Coordinate Space After the transformation, the latent variables ζ have support in the real coordinate space K . We ha 0 1 2 3 1 θ Density (a) Latent variable space T T−1 −1 0 1 2 1 ζ Prior Posterior Approximation (b) Real coordinate space Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent variable space to . (b) The variational approximation is a Gaussian in real coordinate space. Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
  • 37. ¤ ¤ ¤ ¤ L ¤ ¤ .4 Variational Approximations in Real Coordinate Space fter the transformation, the latent variables ζ have support in the real coordinate space K . We hav choice of variational approximations in this space. Here, we consider Gaussian distributions; thes mplicitly induce non-Gaussian variational distributions in the original latent variable space. Mean-field Gaussian. One option is to posit a factorized (mean-field) Gaussian variational approxima on q(ζ; φ) = ζ; µ,diag(σ2 ) = K k=1 ζk ; µk,σ2 k , where the vector φ = (µ1,··· ,µK ,σ2 1,··· ,σ2 K ) concatenates the mean and variance of each Gaussia actor. Since the variance parameters must always be positive, the variational parameters live in th et Φ = { K , K >0}. Re-parameterizing the mean-field Gaussian removes this constraint. Consider th 4 Stan provides various transformations for upper and lower bounds, simplex and ordered vectors, and structured matrices suc covariance matrices and Cholesky factors. 6 N x [ n ] ~ poisson ( theta ) ; } Figure 2: Specifying a simple nonconjugate probability model in Stan. Figure 2: Specifying a simple nonconjugate probability model in Stan. arithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω is n real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ; φ) ζ; µ,diag(exp(ω)2 ) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the mean a arithm of the standard deviation of each factor. Now, the variational parameters are unconstrain 2K . l-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation q(ζ; φ) = ζ; µ,Σ , ere the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ensure t always remains positive semidefinite, we re-parameterize the covariance matrix using a Chole torization, Σ = LL⊤ . We use the non-unique definition of the Cholesky factorization where gonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). Therefor s in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entries. T -rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤ , where the variational parameters φ = (µ, L) constrained in K+K(K+1)/2 . Figure 2: Specifying a simple nonconjugate probability model in Stan. Figure 2: Specifying a simple nonconjugate probability model in Stan. logarithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω the real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ ζ; µ,diag(exp(ω)2 ) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the m logarithm of the standard deviation of each factor. Now, the variational parameters are uncon in 2K . Full-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation q(ζ; φ) = ζ; µ,Σ , where the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ens Σ always remains positive semidefinite, we re-parameterize the covariance matrix using a C factorization, Σ = LL⊤ . We use the non-unique definition of the Cholesky factorization wh diagonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). The lives in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entr full-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤ , where the variational parameters φ = (µ unconstrained in K+K(K+1)/2 . The full-rank Gaussian generalizes the mean-field Gaussian approximation. The off-diagonal term covariance matrix Σ capture posterior correlations across latent random variables.5 This leads to accurate posterior approximation than the mean-field Gaussian; however, it comes at a compu cost. Various low-rank approximations to the covariance matrix reduce this cost, yet limit its a
  • 38. ¤ ELBO ¤ s ¤ ¤ Reparameterization trick p Z ¤ ELBO ¤ s ¤ 2.5 The Variational Problem in Real Coordinate Space Here is the story so far. We began with a differentiable probability model p(x,θ). We transformed th latent variables into ζ, which live in the real coordinate space. We defined variational approximation in the transformed space. Now, we consider the variational optimization problem. Write the variational objective function, the ELBO, in real coordinate space as (φ) = q(ζ;φ) log p x, T−1 (ζ) + log det JT−1 (ζ) + q(ζ; φ) . (5 The inverse of the transformation T−1 appears in the joint model, along with the determinant of th Jacobian adjustment. The ELBO is a function of the variational parameters φ and the entropy , both o which depend on the variational approximation. (Derivation in Appendix B.) Now, we can freely optimize the ELBO in the real coordinate space without worrying about the suppo matching constraint. The optimization problem from Equation (3) becomes φ∗ = argmax φ (φ) (6 where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is a unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this woul require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm that uses automatic differentiation to compute gradients and MC integration to approximate expect tions. We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is an unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this would require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm that uses automatic differentiation to compute gradients and MC integration to approximate expecta- tions. We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un- known expectation. However, we can automatically differentiate the functions inside the expectation. (The model p and transformation T are both easy to represent as computer functions (Baydin et al., 2015).) To apply automatic differentiation, we want to push the gradient operation inside the expec- tation. To this end, we employ one final transformation: elliptical standardization6 (Härdle and Simar, 2012). Elliptical standardization. Consider a transformation Sφ that absorbs the variational parameters φ; this converts the Gaussian variational approximation into a standard Gaussian. In the mean-field case, the standardization is η = Sφ(ζ) = diag exp(ω) −1 (ζ − µ). In the full-rank Gaussian, the standardiza- tion is η = Sφ(ζ) = L−1 (ζ − µ). In both cases, the standardization encapsulates the variational parameters; in return it gives a fixed variational density q(η) = η; 0, I = K k=1 ηk ; 0,1 , as shown in Figure 3. The standardization transforms the variational problem from Equation (5) into φ∗ = argmax φ (η;0,I) log p x, T−1 (S−1 φ (η)) + log det JT−1 S−1 φ (η) + q(ζ; φ) . The expectation is now in terms of a standard Gaussian density. The Jacobian of elliptical standard- ization evaluates to one, because the Gaussian distribution is a member of the location-scale family: standardizing a Gaussian gives another Gaussian distribution. (See Appendix A.) We do not need to transform the entropy term as it does not depend on the model or the transformation; we have a simple analytic form for the entropy of a Gaussian and its gradient. We implement these once and reuse for all models.
  • 39. ¤ Black-box variational inference [Ranganath+ 2014] ¤ ¤ ADVI likelihood ratio trick ¤ reparameterization trick 3.2 Variance of the Stochastic Gradients ADVI uses Monte Carlo integration to approximate gradients of the ELBO, and then uses these gradients in a stochastic optimization algorithm (Section 2). The speed of ADVI hinges on the variance of the gradient estimates. When a stochastic optimization algorithm suffers from high-variance gradients, it must repeatedly recover from poor parameter estimates. ADVI is not the only way to compute Monte Carlo approximations of the gradient of the ELBO. Black box variational inference (BBVI) takes a different approach (Ranganath et al., 2014). The BBVI gradient estimator uses the gradient of the variational approximation and avoids using the gradient of the model. For example, the following BBVI estimator ∇BBVI µ = q(ζ;φ) ∇µ logq(ζ; φ) log p x, T−1 (ζ) + log det JT−1 (ζ) − logq(ζ; φ) and the ADVI gradient estimator in Equation (7) both lead to unbiased estimates of the exact gradient. While BBVI is more general—it does not require the gradient of the model and thus applies to more settings—its gradients can suffer from high variance. Figure 8 empirically compares the variance of both estimators for two models. Figure 8a shows the vari- ance of both gradient estimators for a simple univariate model, where the posterior is a Gamma(10,10). We estimate the variance using ten thousand re-calculations of the gradient ∇φ , across an increasing number of MC samples M. The ADVI gradient has lower variance; in practice, a single sample suffices. (See the experiments in Section 4.) Figure 8b shows the same calculation for a 100-dimensional nonlinear regression model with likeli- hood (y | tanh(x⊤ β), I) and a Gaussian prior on the regression coefficients β. Because this is a multivariate example, we also show the BBVI gradient with a variance reduction scheme using control variates described in Ranganath et al. (2014). In both cases, the ADVI gradients are statistically more efficient. 100 101 102 103 100 101 102 103 Number of MC samples Variance (a) Univariate Model 100 101 102 103 10−3 10−1 101 103 Number of MC samples ADVI BBVI BBVI with control variate (b) Multivariate Nonlinear Regression Model Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which is not available in univariate situations. Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which is not available in univariate situations.
  • 40.
  • 42. ¤ Stan ¤ ¤ MCMC NUTS[Hoffman+ 2014] HMC ¤ Stan python R ¤ ADVI ¤ PyMC3 ¤ Python MCMC ¤ Theano GPU ¤ ADVI ¤ Edward ¤ ¤ criticism ¤ Python Tensorflow Keras ¤ Stan PyMC3 35x [Tran+ 2016]
  • 43. ¤ Tars ¤ https://github.com/masa-su/Tars ¤ Edward Tran PyMC3 Wiecki star ¤ ¤ ¤ Edward PyMC3 ¤ ¤ ¤ Q Tars ¤ A