13. ¤ # $(#)
¤ #
¤
Weight Uncertainty in Neural Networks
H1 H2 H3 1
X 1
Y
0.5 0.1 0.7 1.3
1.40.3
1.2
0.10.1 0.2
H1 H2 H3 1
X 1
Y
Figure 1. Left: each weight has a fixed value, as provided by clas-
sical backpropagation. Right: each weight is assigned a distribu-
tion, as provided by Bayes by Backprop.
is related to recent methods in deep, generative modelling
(Kingma and Welling, 2014; Rezende et al., 2014; Gregor
et al., 2014), where variational inference has been applied
to stochastic hidden units of an autoencoder. Whilst the
number of stochastic hidden units might be in the order of
the parameters of the categorical dis
through the exponential function then
regression Y is R and P(y|x, w) is a G
– this corresponds to a squared loss.
Inputs x are mapped onto the param
tion on Y by several successive layers
tion (given by w) interleaved with elem
transforms.
The weights can be learnt by maximum
tion (MLE): given a set of training exam
the MLE weights wMLE
are given by:
wMLE
= arg max
w
log P(D|w
= arg max
w
i
log P(
This is typically achieved by gradient
NN Bayesian NN
$(3G|!, +G, N) = H $ 3G +G, # $ # !, N 5#
[Blundell+ 2015]
14. ¤
¤
¤ “The Importance of Knowing What We Don't Know (by Yarin Gal)”
¤
¤
¤
¤
¤
¤
WHY SHOULD WE CARE?
Calibrated model and prediction uncertainty: getting
systems that know when they don’t know.
Automatic model complexity control and structure learnin
(Bayesian Occam’s Razor)
Figure from Yarin Gal’s thesis “Uncertainty in Deep Learning” (2016)
Zoubin Ghahramani
http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
18. ¤
¤ P
Q
¤
¤ α
¤
¤
3.1.5 Multiple outputs
So far, we have considered the case of a single target variable t. In some applica-
tions, we may wish to predict K > 1 target variables, which we denote collectively
by the target vector t. This could be done by introducing a different set of basis func-
tions for each component of t, leading to multiple, independent regression problems.
However, a more interesting, and more common, approach is to use the same set of
basis functions to model all of the components of the target vector so that
y(x, w) = WT
φ(x) (3.31)
where y is a K-dimensional column vector, W is an M × K matrix of parameters,
and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1
as before. Suppose we take the conditional distribution of the target vector to be an
isotropic Gaussian of the form
p(t|x, W, β) = N(t|WT
φ(x), β−1
I). (3.32)
If we have a set of observations t1, . . . , tN , we can combine these into a matrix T
of size N × K such that the nth
row is given by tT
n. Similarly, we can combine the
input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given
by
ln p(T|X, W, β) =
N
n=1
ln N(tn|WT
φ(xn), β−1
I)
=
NK
2
ln
β
2π
−
β
2
N
n=1
tn − WT
φ(xn)
2
. (3.33)
by the target vector t. This could be done by introducing a different set of basis func-
tions for each component of t, leading to multiple, independent regression problems.
However, a more interesting, and more common, approach is to use the same set of
basis functions to model all of the components of the target vector so that
y(x, w) = WT
φ(x) (3.31)
where y is a K-dimensional column vector, W is an M × K matrix of parameters,
and φ(x) is an M-dimensional column vector with elements φj(x), with φ0(x) = 1
as before. Suppose we take the conditional distribution of the target vector to be an
isotropic Gaussian of the form
p(t|x, W, β) = N (t|WT
φ(x), β−1
I). (3.32)
If we have a set of observations t1, . . . , tN , we can combine these into a matrix T
of size N × K such that the nth
row is given by tT
n. Similarly, we can combine the
input vectors x1, . . . , xN into a matrix X. The log likelihood function is then given
by
ln p(T|X, W, β) =
N
n=1
ln N (tn|WT
φ(xn), β−1
I)
=
NK
2
ln
β
2π
−
β
2
N
n=1
tn − WT
φ(xn)
2
. (3.33)
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
Note that because the posterior distribution is Gaussian, its mode coincides with its
mean. Thus the maximum posterior weight vector is simply given by wMAP = mN .
If we consider an infinitely broad prior S0 = α−1
I with α → 0, the mean mN
of the posterior distribution reduces to the maximum likelihood value wML given
by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior.
Furthermore, if data points arrive sequentially, then the posterior distribution at any
stage acts as the prior distribution for the subsequent data point, such that the new
posterior distribution is again given by (3.49).3.8
For the remainder of this chapter, we shall consider a particular form of Gaus-
sian prior in order to simplify the treatment. Specifically, we consider a zero-mean
isotropic Gaussian governed by a single precision parameter α so that
p(w|α) = N(w|0, α−1
I) (3.52)
and the corresponding posterior distribution over w is then given by (3.49) with
mN = βSN ΦT
t (3.53)
S−1
N = αI + βΦT
Φ. (3.54)
The log of the posterior distribution is given by the sum of the log likelihood and
the log of the prior and, as a function of w, takes the form
ln p(w|t) = −
β
2
N
n=1
{tn − wT
φ(xn)}2
−
α
2
wT
w + const. (3.55)
Maximization of this posterior distribution with respect to w is therefore equiva-
3.3. Bayesian Linear Regression 153
Next we compute the posterior distribution, which is proportional to the product
of the likelihood function and the prior. Due to the choice of a conjugate Gaus-
sian prior distribution, the posterior will also be Gaussian. We can evaluate this
distribution by the usual procedure of completing the square in the exponential, and
then finding the normalization coefficient using the standard result for a normalized
Gaussian. However, we have already done the necessary work in deriving the gen-
eral result (2.116), which allows us to write down the posterior distribution directly
in the form
p(w|t) = N (w|mN , SN ) (3.49)
where
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
then finding the normalization coefficient using the standard result for a normalized
Gaussian. However, we have already done the necessary work in deriving the gen-
eral result (2.116), which allows us to write down the posterior distribution directly
in the form
p(w|t) = N(w|mN , SN ) (3.49)
where
mN = SN S−1
0 m0 + βΦT
t (3.50)
S−1
N = S−1
0 + βΦT
Φ. (3.51)
Note that because the posterior distribution is Gaussian, its mode coincides with its
mean. Thus the maximum posterior weight vector is simply given by wMAP = mN .
If we consider an infinitely broad prior S0 = α−1
I with α → 0, the mean mN
of the posterior distribution reduces to the maximum likelihood value wML given
by (3.15). Similarly, if N = 0, then the posterior distribution reverts to the prior.
Furthermore, if data points arrive sequentially, then the posterior distribution at any
stage acts as the prior distribution for the subsequent data point, such that the new
posterior distribution is again given by (3.49).
For the remainder of this chapter, we shall consider a particular form of Gaus-
sian prior in order to simplify the treatment. Specifically, we consider a zero-mean
isotropic Gaussian governed by a single precision parameter α so that
p(w|α) = N(w|0, α−1
I) (3.52)
and the corresponding posterior distribution over w is then given by (3.49) with
mN = βSN ΦT
t (3.53)
S−1
N = αI + βΦT
Φ. (3.54)
The log of the posterior distribution is given by the sum of the log likelihood and
the log of the prior and, as a function of w, takes the form
PRML
19. ¤
¤
equal to the mean, although this will no longer hold if q ̸= 2.
3.3.2 Predictive distribution
In practice, we are not usually interested in the value of w itself but rather in
making predictions of t for new values of x. This requires that we evaluate the
predictive distribution defined by
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the form3.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
(x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance3.11
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the form.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variance.11
of the predictive distribution arises solely from the additive noise governed by the
parameter β.
As an illustration of the predictive distribution for Bayesian linear regression
models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8,
p(t|t, α, β) = p(t|w, β)p(w|t, α, β) dw (3.57)
in which t is the vector of target values from the training set, and we have omitted the
corresponding input vectors from the right-hand side of the conditioning statements
to simplify the notation. The conditional distribution p(t|x, w, β) of the target vari-
able is given by (3.8), and the posterior weight distribution is given by (3.49). We
see that (3.57) involves the convolution of two Gaussian distributions, and so making
use of the result (2.115) from Section 8.1.4, we see that the predictive distribution
takes the formcise 3.10
p(t|x, t, α, β) = N(t|mT
N φ(x), σ2
N (x)) (3.58)
where the variance σ2
N (x) of the predictive distribution is given by
σ2
N (x) =
1
β
+ φ(x)T
SN φ(x). (3.59)
The first term in (3.59) represents the noise on the data whereas the second term
reflects the uncertainty associated with the parameters w. Because the noise process
and the distribution of w are independent Gaussians, their variances are additive.
Note that, as additional data points are observed, the posterior distribution becomes
narrower. As a consequence it can be shown (Qazaz et al., 1997) that σ2
N+1(x)
σ2
N (x). In the limit N → ∞, the second term in (3.59) goes to zero, and the variancecise 3.11
of the predictive distribution arises solely from the additive noise governed by the
parameter β.
As an illustration of the predictive distribution for Bayesian linear regression
models, let us return to the synthetic sinusoidal data set of Section 1.1. In Figure 3.8,
3.3. Bayesian Linear Regression 157
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
x
t
0 1
−1
0
1
Figure 3.8 Examples of the predictive distribution (3.58) for a model consisting of 9 Gaussian basis functions
of the form (3.4) using the synthetic sinusoidal data set of Section 1.1. See the text for a detailed discussion.
PRML
21. ¤
¤
1.
2. MAP #7DE
3. 2 R
4.
chapters and so we can exploit the results obtained there. We can then make use of
the evidence framework to provide point estimates for the hyperparameters and to
compare alternative models (for example, networks having different numbers of hid-
den units). To start with, we shall discuss the regression case and then later consider
the modifications needed for solving classification tasks.
5.7.1 Posterior parameter distribution
Consider the problem of predicting a single continuous target variable t from
a vector x of inputs (the extension to multiple targets is straightforward). We shall
suppose that the conditional distribution p(t|x) is Gaussian, with an x-dependent
mean given by the output of a neural network model y(x, w), and with precision
(inverse variance) β
p(t|x, w, β) = N(t|y(x, w), β−1
). (5.161)
Similarly, we shall choose a prior distribution over the weights w that is Gaussian of
the form
p(w|α) = N(w|0, α−1
I). (5.162)
For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of target
values D = {t1, . . . , tN }, the likelihood function is given by
p(D|w, β) =
N
n=1
N(tn|y(xn, w), β−1
) (5.163)
and so the resulting posterior distribution is then
p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.164)
the modifications needed for solving classification tasks.
5.7.1 Posterior parameter distribution
Consider the problem of predicting a single continuous target variable t fro
a vector x of inputs (the extension to multiple targets is straightforward). We sh
suppose that the conditional distribution p(t|x) is Gaussian, with an x-depende
mean given by the output of a neural network model y(x, w), and with precisi
(inverse variance) β
p(t|x, w, β) = N(t|y(x, w), β−1
). (5.16
Similarly, we shall choose a prior distribution over the weights w that is Gaussian
the form
p(w|α) = N(w|0, α−1
I). (5.16
For an i.i.d. data set of N observations x1, . . . , xN , with a corresponding set of targ
values D = {t1, . . . , tN }, the likelihood function is given by
p(D|w, β) =
N
n=1
N(tn|y(xn, w), β−1
) (5.16
and so the resulting posterior distribution is then
p(w|D, α, β) ∝ p(w|α)p(D|w, β). (5.16
which, as a consequence of the nonlinear dependence of y(x, w) on w, will be no
Gaussian.
We can find a Gaussian approximation to the posterior distribution by using t
Laplace approximation. To do this, we must first find a (local) maximum of t
5.7. Bayesian Neural Networks 279
form
ln p(w|D) = −
α
2
wT
w −
β
2
N
n=1
{y(xn, w) − tn}
2
+ const (5.165)
which corresponds to a regularized sum-of-squares error function. Assuming for
the moment that α and β are fixed, we can find a maximum of the posterior, which
we denote wMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a mode wMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given by
A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166)
where H is the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components of w. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) by
q(w|D) = N(w|wMAP, A−1
). (5.167)
form
ln p(w|D) = −
α
2
wT
w −
β
2
N
n=1
{y(xn, w) − tn}
2
+ const (5.165)
which corresponds to a regularized sum-of-squares error function. Assuming for
the moment that α and β are fixed, we can find a maximum of the posterior, which
we denote wMAP, by standard nonlinear optimization algorithms such as conjugate
gradients, using error backpropagation to evaluate the required derivatives.
Having found a mode wMAP, we can then build a local Gaussian approximation
by evaluating the matrix of second derivatives of the negative log posterior distribu-
tion. From (5.165), this is given by
A = −∇∇ ln p(w|D, α, β) = αI + βH (5.166)
where H is the Hessian matrix comprising the second derivatives of the sum-of-
squares error function with respect to the components of w. Algorithms for comput-
ing and approximating the Hessian were discussed in Section 5.4. The corresponding
Gaussian approximation to the posterior is then given from (4.134) by
q(w|D) = N(w|wMAP, A−1
). (5.167)
Similarly, the predictive distribution is obtained by marginalizing with respect
PRML
23. ¤ O(#|N)
¤ ST[O(#|N)||$ # ! ] N
¤ KL
¤ N ELBO N
→
ST[O(#|N)||$ (#|!)]
= −∫ O # N log
W K|A W A
J # N
5# + log $ !
= −ℒ !; N + log $ !
ELBO
24. ELBO
¤ ELBO ℒ !; N
1.
¤ $(#|!)
MC
¤
2. ELBO
¤ MC
¤ MC ∫ O # N log
W K|A W A
J # N
5# = IJ[log
W K|A W A
J # N
]
3.
¤ EM
¤ ELBO
25. ¤ Gal
¤ Denker, Schwartz, Wittner, Solla, Howard, Jackel, Hopfield (1987)
¤ Denker and LeCun (1991)
¤ MacKay (1992)
¤ Hinton and van Camp (1993)
¤
¤ Neal (1995)
¤ Barber and Bishop (1998)
¤ Graves (2011)
¤ Blundell, Cornebise, Kavukcuoglu, and Wierstra (2015)
¤ Hernandez-Lobato and Adam (2015)
26. ¤ Practical Variational Inference for Neural Networks [Graves 2011]
¤
¤
¤ T9 O(#|N) N = {Z, [}
¤
ℒ !; N = ∫ O (#|N)log
$ !|# $(#)
O(#|N)
5#
= − EJ(A|])[log $(!|#)] + ST[O(#|N)||$(#)]
^T9
^Z
≈
1
a
B
^ log $(!|#)
^#
/
C
^T9
^[b
≈
1
2a
B
^ log $(!|#)
^#
b/
C
27. ¤ Weight Uncertainty in Neural Networks [Blundell+ 2015]
¤
¤ O(#|N)
¤
¤ http://www.slideshare.net/masa_s/weight-uncertainty-in-neural-networks
Bayes by backprop
^
^N
ℒ !; N =
^
^N
IJ(A|]) d(#, N)
d #, N = log
$ !|# $(#)
O(#|N)
= IJ(e)
^d(#, N)
^N
^#
^N
+
^d(#, N)
^N
Reparameterization trick
# = Z + diag([) ⊙ i i~k(0, m)
28. Dropout
¤ Dropout as a Bayesian Approximation [Gal+ 2015]
¤ O # N = ∏ O(nC|oC)
¤ T9 = − EJ # N [log $(!|#)]
¤ oC 0
¤ 0
¤ dropout
¤ drop-connect multiplicative Gaussian noise
sults are summarised here
n uncertainty estimates for
el with L layers and a loss
max loss or the Euclidean
Wi the NN’s weight ma-
1, and by bi the bias vec-
ayer i = 1, ..., L. We de-
corresponding to input xi
the input and output sets
on a regularisation term is
egularisation weighted by
n a minimisation objective
λ
L
i=1
||Wi||2
2 + ||bi||2
2 .
(1)
variables for every input
in each layer (apart from
le takes value 1 with prob-
ropped (i.e. its value is set
orresponding binary vari-
me values in the backward
to the parameters.
p(y|x, ω) = N y; y(x, ω), τ ID
y x, ω = {W1, ...,WL}
=
1
KL
WLσ ...
1
K1
W2σ W1x + m1 ...
The posterior distribution p(ω|X, Y) in eq. (2) is in-
tractable. We use q(ω), a distribution over matrices whose
columns are randomly set to zero, to approximate the in-
tractable posterior. We define q(ω) as:
Wi = Mi · diag([zi,j]Ki
j=1)
zi,j ∼ Bernoulli(pi) for i = 1, ..., L, j = 1, ..., Ki−1
given some probabilities pi and matrices Mi as variational
parameters. The binary variable zi,j = 0 corresponds then
to unit j in layer i − 1 being dropped out as an input to
layer i. The variational distribution q(ω) is highly multi-
modal, inducing strong joint correlations over the rows of
the matrices Wi (which correspond to the frequencies in
the sparse spectrum GP approximation).
We minimise the KL divergence between the approximate
posterior q(ω) above and the posterior of the full deep GP,
p(ω|X, Y). This KL is our minimisation objective
− q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)
29. Dropout
¤ dropout
¤ dropout
¤
¤ MC dropout
¤ http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
where ω = {Wi}L
i=1 is our set of random variables for a
model with L layers.
We will perform moment-matching and estimate the first
two moments of the predictive distribution empirically.
More specifically, we sample T sets of vectors of realisa-
tions from the Bernoulli distribution {zt
1, ..., zt
L}T
t=1 with
zt
i = [zt
i,j]Ki
j=1, giving {Wt
1, ..., Wt
L}T
t=1. We estimate
Eq(y∗|x∗)(y∗
) ≈
1
T
T
t=1
y∗
(x∗
, Wt
1, ..., Wt
L) (6)
following proposition C in the appendix. We refer to this
Monte Carlo estimate as MC dropout. In practice this
is equivalent to performing T stochastic forward passes
through the network and averaging the results.
This result has been presented in the literature before as
model averaging. We have given a new derivation for this
result which allows us to derive mathematically grounded
uncertainty estimates as well. Srivastava et al. (2014, sec-
tion 7.5) have reasoned empirically that MC dropout can
be approximated by averaging the weights of the network
(multiplying each Wi by pi at test time, referred to as stan-
dard dropout).
We estimate the second raw moment in the same way:
log p(y∗
|x∗
, X, Y) ≈
with a log-sum-exp o
passes through the ne
Our predictive distr
highly multi-modal,
give a glimpse into i
proximating variation
matrix column is bi-
tribution over each la
3.2 in the appendix).
Note that the dropo
To estimate the predi
we simply collect the
through the model.
used with existing N
thermore, the forward
sulting in constant run
dropout.
5. Experiments
T t=1
following proposition C in the appendix. We refer to this
Monte Carlo estimate as MC dropout. In practice this
is equivalent to performing T stochastic forward passes
through the network and averaging the results.
This result has been presented in the literature before as
model averaging. We have given a new derivation for this
result which allows us to derive mathematically grounded
uncertainty estimates as well. Srivastava et al. (2014, sec-
tion 7.5) have reasoned empirically that MC dropout can
be approximated by averaging the weights of the network
(multiplying each Wi by pi at test time, referred to as stan-
dard dropout).
We estimate the second raw moment in the same way:
Eq(y∗|x∗) (y∗
)T
(y∗
) ≈ τ−1
ID
+
1
T
T
t=1
y∗
(x∗
, Wt
1, ..., Wt
L)T
y∗
(x∗
, Wt
1, ..., Wt
L)
following proposition D in the appendix. To obtain the
model’s predictive variance we have:
Varq(y∗|x∗) y∗
≈ τ−1
ID
2
In the appendix (section 4.1) we extend this derivation to
classification. E(·) is defined as softmax loss and τ is set to 1.
proximating variational distributio
matrix column is bi-modal, and
tribution over each layer’s weight
3.2 in the appendix).
Note that the dropout NN mod
To estimate the predictive mean a
we simply collect the results of s
through the model. As a result,
used with existing NN models tra
thermore, the forward passes can
sulting in constant running time id
dropout.
5. Experiments
We next perform an extensive ass
of the uncertainty estimates obta
and convnets on the tasks of regr
We compare the uncertainty obtai
architectures and non-linearities,
olation, and show that model unc
classification tasks using MNIST
as an example. We then show th
tainty we can obtain a considerabl
tive log-likelihood and RMSE co
of-the-art methods. We finish wi
30. ¤ MC dropout
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
(a) Standard dropout with weight averaging (b) Gaussian process with SE covariance function
(c) MC dropout with ReLU non-linearities (d) MC dropout with TanH non-linearities
Figure 2. Predictive mean and uncertainties on the Mauna Loa CO2 concentrations dataset, for various models. In red is the
observed function (left of the dashed blue line); in blue is the predictive mean plus/minus two standard deviations (8 for fig. 2d).
Different shades of blue represent half a standard deviation. Marked with a dashed red line is a point far away from the data: standard
dropout confidently predicts an insensible value for the point; the other models predict insensible values as well but with the additional
information that the models are uncertain about their predictions.
model’s uncertainty in a Bayesian pipeline. We give a
quantitative assessment of the model’s performance in the
setting of reinforcement learning on a task similar to that
used in deep reinforcement learning (Mnih et al., 2015).
Using the results from the previous section, we begin by
qualitatively evaluating the dropout NN uncertainty on two
comparison. Fig. 2c shows the results of the same network
as in fig. 2a, but with MC dropout used to evaluate the pre-
dictive mean and uncertainty for the training and test sets.
Lastly, fig. 2d shows the same using the TanH network with
5 layers (plotted with 8 times the standard deviation for vi-
sualisation purposes). The shades of blue represent model
uncertainty: each colour gradient represents half a standard
31. Dropout
¤ Variational dropout and the local reparameterization trick
[Kingma+ 2015]
¤
¤ 0
¤ 0 local reparameterization
trick
¤ Dropout
¤
¤ http://www.slideshare.net/masa_s/dl-hacks-variational-dropout-and-the-local-
reparameterization-trick
2.2 Variance of the SGVB estimator
The theory of stochastic approximation tells us that stochastic gradient ascent using (3) will asymp-
totically converge to a local optimum for an appropriately declining step size and sufficient weight
updates [18], but in practice the performance of stochastic gradient ascent crucially depends on
the variance of the gradients. If this variance is too large, stochastic gradient descent will fail
to make much progress in any reasonable amount of time. Our objective function consists of an
expected log likelihood term that we approximate using Monte Carlo, and a KL divergence term
DKL(qφ(w)||p(w)) that we assume can be calculated analytically and otherwise be approximated
with Monte Carlo with similar reparameterization.
Assume that we draw minibatches of datapoints with replacement; see appendix F for a similar
analysis for minibatches without replacement. Using Li as shorthand for log p(yi
|xi
, w = f(ϵi
, φ)),
the contribution to the likelihood for the i-th datapoint in the minibatch, the Monte Carlo estimator
(3) may be rewritten as LSGVB
D (φ) = N
M
M
i=1 Li, whose variance is given by
Var LSGVB
D (φ) =
N2
M2
M
i=1
Var [Li] + 2
M
i=1
M
j=i+1
Cov [Li, Lj] (4)
=N2 1
M
Var [Li] +
M − 1
M
Cov [Li, Lj] , (5)
where the variances and covariances are w.r.t. both the data distribution and ϵ distribution, i.e.
Var [Li] = Varϵ,xi,yi log p(yi
|xi
, w = f(ϵ, φ)) , with xi
, yi
drawn from the empirical distribu-
tion defined by the training set. As can be seen from (5), the total contribution to the variance by
Var [Li] is inversely proportional to the minibatch size M. However, the total contribution by the
covariances does not decrease with M. In practice, this means that the variance of LSGVB
D (φ) can be
dominated by the covariances for even moderately large M.
2.3 Local Reparameterization Trick
We therefore propose an alternative estimator for which we have Cov [Li, Lj] = 0, so that the vari-
ance of our stochastic gradients scales as 1/M. We then make this new estimator computationally
efficient by not sampling ϵ directly, but only sampling the intermediate variables f(ϵ) through which
SGVB
34. Automatic Differentiation
Variational Inference ADVI
¤ Automatic Differentiation Variational Inference [Kucukelbir+ 2016]
¤ Stan PyMC3 Edward
¤ ADVI
1. N $(+, N) $(+, p)
ST[O(p)||$(p, +))]
2. O MC
3. O
35. Automatic transformation
¤ N
¤ $(N) support
¤
¤ N p
¤ p
¤ N −> p r p
0 1 2 3 θ
De (a) Latent variable space
T−1
−1 0 1 2 ζ
(b) Real coordinate space
1: Transforming the latent variable to real coordinate space. The purple line is the pos
ne is the approximation. (a) The latent variable space is >0. (a→b) T transforms
space to . (b) The variational approximation is a Gaussian in real coordinate space
: Transforming the latent variable to real coordinate space. The purple line is the post
e is the approximation. (a) The latent variable space is >0. (a→b) T transforms
space to . (b) The variational approximation is a Gaussian in real coordinate space.
tify the transformed variables as ζ = T(θ). The transformed joint density p(x,ζ) is
as the representation
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
(x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1
of the inverse of T. Transformations of continuous probability densities require a
nts for how the transformation warps unit volumes and ensures that the transforme
s to one (Olive, 2014). (See Appendix A.)
36. Automatic transformation
¤
¤ r = log (N)
¤
¤ p
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is the
of the inverse of T. Transformations of continuous probability densities require a Jacobian;
s for how the transformation warps unit volumes and ensures that the transformed density
to one (Olive, 2014). (See Appendix A.)
again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives in
ogarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is the
of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is
p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ).
epicts this transformation.
cribe in the introduction, we implement our algorithm in Stan (Stan Development Team,
an maintains a library of transformations and their corresponding Jacobians.4
With Stan,
tomatically transforms the joint density of any differentiable probability model to one with
d latent variables. (See Figure 2.)
riational Approximations in Real Coordinate Space
ransformation, the latent variables ζ have support in the real coordinate space K
. We have
f variational approximations in this space. Here, we consider Gaussian distributions; these
nduce non-Gaussian variational distributions in the original latent variable space.
of ζ; it has the representation
p(x,ζ) = p x, T−1
(ζ) det JT−1 (ζ) ,
where p(x,θ = T−1
(ζ)) is the joint density in the original latent variable space, and JT−1 (ζ) is t
Jacobian of the inverse of T. Transformations of continuous probability densities require a Jacobia
it accounts for how the transformation warps unit volumes and ensures that the transformed dens
integrates to one (Olive, 2014). (See Appendix A.)
Consider again our running Weibull-Poisson example from Section 2.1. The latent variable θ lives
>0. The logarithm ζ = T(θ) = log(θ) transforms >0 to the real line . Its Jacobian adjustment is t
derivative of the inverse of the logarithm |det JT−1(ζ)| = exp(ζ). The transformed density is
p(x,ζ) = Poisson(x | exp(ζ)) × Weibull(exp(ζ) ; 1.5,1) × exp(ζ).
Figure 1 depicts this transformation.
As we describe in the introduction, we implement our algorithm in Stan (Stan Development Tea
2015). Stan maintains a library of transformations and their corresponding Jacobians.4
With St
we can automatically transforms the joint density of any differentiable probability model to one w
real-valued latent variables. (See Figure 2.)
2.4 Variational Approximations in Real Coordinate Space
After the transformation, the latent variables ζ have support in the real coordinate space K
. We ha
0 1 2 3
1
θ
Density
(a) Latent variable space
T
T−1
−1 0 1 2
1
ζ
Prior
Posterior
Approximation
(b) Real coordinate space
Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The
green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent
variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
Figure 1: Transforming the latent variable to real coordinate space. The purple line is the posterior. The
green line is the approximation. (a) The latent variable space is >0. (a→b) T transforms the latent
variable space to . (b) The variational approximation is a Gaussian in real coordinate space.
37. ¤
¤
¤
¤ L
¤
¤
.4 Variational Approximations in Real Coordinate Space
fter the transformation, the latent variables ζ have support in the real coordinate space K
. We hav
choice of variational approximations in this space. Here, we consider Gaussian distributions; thes
mplicitly induce non-Gaussian variational distributions in the original latent variable space.
Mean-field Gaussian. One option is to posit a factorized (mean-field) Gaussian variational approxima
on
q(ζ; φ) = ζ; µ,diag(σ2
) =
K
k=1
ζk ; µk,σ2
k ,
where the vector φ = (µ1,··· ,µK ,σ2
1,··· ,σ2
K ) concatenates the mean and variance of each Gaussia
actor. Since the variance parameters must always be positive, the variational parameters live in th
et Φ = { K
, K
>0}. Re-parameterizing the mean-field Gaussian removes this constraint. Consider th
4
Stan provides various transformations for upper and lower bounds, simplex and ordered vectors, and structured matrices suc
covariance matrices and Cholesky factors.
6
N
x [ n ] ~ poisson ( theta ) ;
}
Figure 2: Specifying a simple nonconjugate probability model in Stan.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
arithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω is n
real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ; φ)
ζ; µ,diag(exp(ω)2
) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the mean a
arithm of the standard deviation of each factor. Now, the variational parameters are unconstrain
2K
.
l-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation
q(ζ; φ) = ζ; µ,Σ ,
ere the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ensure t
always remains positive semidefinite, we re-parameterize the covariance matrix using a Chole
torization, Σ = LL⊤
. We use the non-unique definition of the Cholesky factorization where
gonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). Therefor
s in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entries. T
-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤
, where the variational parameters φ = (µ, L)
constrained in K+K(K+1)/2
.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
Figure 2: Specifying a simple nonconjugate probability model in Stan.
logarithm of the standard deviations, ω = log(σ), applied element-wise. The support of ω
the real coordinate space and σ is always positive. The mean-field Gaussian becomes q(ζ
ζ; µ,diag(exp(ω)2
) , where the vector φ = (µ1,··· ,µK ,ω1,··· ,ωK ) concatenates the m
logarithm of the standard deviation of each factor. Now, the variational parameters are uncon
in 2K
.
Full-rank Gaussian. Another option is to posit a full-rank Gaussian variational approximation
q(ζ; φ) = ζ; µ,Σ ,
where the vector φ = (µ,Σ) concatenates the mean vector µ and covariance matrix Σ. To ens
Σ always remains positive semidefinite, we re-parameterize the covariance matrix using a C
factorization, Σ = LL⊤
. We use the non-unique definition of the Cholesky factorization wh
diagonal elements of L need not be positively constrained (Pinheiro and Bates, 1996). The
lives in the unconstrained space of lower-triangular matrices with K(K + 1)/2 real-valued entr
full-rank Gaussian becomes q(ζ; φ) = ζ; µ, LL⊤
, where the variational parameters φ = (µ
unconstrained in K+K(K+1)/2
.
The full-rank Gaussian generalizes the mean-field Gaussian approximation. The off-diagonal term
covariance matrix Σ capture posterior correlations across latent random variables.5
This leads to
accurate posterior approximation than the mean-field Gaussian; however, it comes at a compu
cost. Various low-rank approximations to the covariance matrix reduce this cost, yet limit its a
38. ¤ ELBO
¤ s
¤
¤ Reparameterization trick p Z
¤ ELBO
¤ s
¤
2.5 The Variational Problem in Real Coordinate Space
Here is the story so far. We began with a differentiable probability model p(x,θ). We transformed th
latent variables into ζ, which live in the real coordinate space. We defined variational approximation
in the transformed space. Now, we consider the variational optimization problem.
Write the variational objective function, the ELBO, in real coordinate space as
(φ) = q(ζ;φ) log p x, T−1
(ζ) + log det JT−1 (ζ) + q(ζ; φ) . (5
The inverse of the transformation T−1
appears in the joint model, along with the determinant of th
Jacobian adjustment. The ELBO is a function of the variational parameters φ and the entropy , both o
which depend on the variational approximation. (Derivation in Appendix B.)
Now, we can freely optimize the ELBO in the real coordinate space without worrying about the suppo
matching constraint. The optimization problem from Equation (3) becomes
φ∗
= argmax
φ
(φ) (6
where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is a
unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this woul
require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm
that uses automatic differentiation to compute gradients and MC integration to approximate expect
tions.
We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un
where the parameter vector φ lives in some appropriately dimensioned real coordinate space. This is an
unconstrained optimization problem that we can solve using gradient ascent. Traditionally, this would
require manual computation of gradients. Instead, we develop a stochastic gradient ascent algorithm
that uses automatic differentiation to compute gradients and MC integration to approximate expecta-
tions.
We cannot directly use automatic differentiation on the ELBO. This is because the ELBO involves an un-
known expectation. However, we can automatically differentiate the functions inside the expectation.
(The model p and transformation T are both easy to represent as computer functions (Baydin et al.,
2015).) To apply automatic differentiation, we want to push the gradient operation inside the expec-
tation. To this end, we employ one final transformation: elliptical standardization6
(Härdle and Simar,
2012).
Elliptical standardization. Consider a transformation Sφ that absorbs the variational parameters φ;
this converts the Gaussian variational approximation into a standard Gaussian. In the mean-field case,
the standardization is η = Sφ(ζ) = diag exp(ω)
−1
(ζ − µ). In the full-rank Gaussian, the standardiza-
tion is η = Sφ(ζ) = L−1
(ζ − µ).
In both cases, the standardization encapsulates the variational parameters; in return it gives a fixed
variational density
q(η) = η; 0, I =
K
k=1
ηk ; 0,1 ,
as shown in Figure 3.
The standardization transforms the variational problem from Equation (5) into
φ∗
= argmax
φ
(η;0,I) log p x, T−1
(S−1
φ (η)) + log det JT−1 S−1
φ (η) + q(ζ; φ) .
The expectation is now in terms of a standard Gaussian density. The Jacobian of elliptical standard-
ization evaluates to one, because the Gaussian distribution is a member of the location-scale family:
standardizing a Gaussian gives another Gaussian distribution. (See Appendix A.)
We do not need to transform the entropy term as it does not depend on the model or the transformation;
we have a simple analytic form for the entropy of a Gaussian and its gradient. We implement these once
and reuse for all models.
39. ¤ Black-box variational inference [Ranganath+ 2014]
¤
¤ ADVI likelihood ratio trick
¤ reparameterization trick
3.2 Variance of the Stochastic Gradients
ADVI uses Monte Carlo integration to approximate gradients of the ELBO, and then uses these gradients
in a stochastic optimization algorithm (Section 2). The speed of ADVI hinges on the variance of the
gradient estimates. When a stochastic optimization algorithm suffers from high-variance gradients, it
must repeatedly recover from poor parameter estimates.
ADVI is not the only way to compute Monte Carlo approximations of the gradient of the ELBO. Black
box variational inference (BBVI) takes a different approach (Ranganath et al., 2014). The BBVI gradient
estimator uses the gradient of the variational approximation and avoids using the gradient of the model.
For example, the following BBVI estimator
∇BBVI
µ = q(ζ;φ) ∇µ logq(ζ; φ) log p x, T−1
(ζ) + log det JT−1 (ζ) − logq(ζ; φ)
and the ADVI gradient estimator in Equation (7) both lead to unbiased estimates of the exact gradient.
While BBVI is more general—it does not require the gradient of the model and thus applies to more
settings—its gradients can suffer from high variance.
Figure 8 empirically compares the variance of both estimators for two models. Figure 8a shows the vari-
ance of both gradient estimators for a simple univariate model, where the posterior is a Gamma(10,10).
We estimate the variance using ten thousand re-calculations of the gradient ∇φ , across an increasing
number of MC samples M. The ADVI gradient has lower variance; in practice, a single sample suffices.
(See the experiments in Section 4.)
Figure 8b shows the same calculation for a 100-dimensional nonlinear regression model with likeli-
hood (y | tanh(x⊤
β), I) and a Gaussian prior on the regression coefficients β. Because this is a
multivariate example, we also show the BBVI gradient with a variance reduction scheme using control
variates described in Ranganath et al. (2014). In both cases, the ADVI gradients are statistically more
efficient.
100
101
102
103
100
101
102
103
Number of MC samples
Variance
(a) Univariate Model
100
101
102
103
10−3
10−1
101
103
Number of MC samples
ADVI
BBVI
BBVI with
control variate
(b) Multivariate Nonlinear Regression Model
Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower
variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which
is not available in univariate situations.
Figure 8: Comparison of gradient estimator variances. The ADVI gradient estimator exhibits lower
variance than the BBVI estimator. Moreover, it does not require control variate variance reduction, which
is not available in univariate situations.