1. Uncertainty in Deep Learning
Introduction to PhD dissertation by Yarin Gal
Yujiro Katagiri
ALBERT Inc.
2018/07/19
Yujiro Katagiri Uncertainty in Deep Learning
2. Why do we need uncertainties?
Uncertainty information is important for
safe application of autonomous systems.
medical diagnosis
autonomous driving
high frequency trading
...
There are various applications which can
make use of uncertainty information.
active learning
reinforcement learning
...
Yujiro Katagiri Uncertainty in Deep Learning
3. What uncertainties do we need?
There are two types of uncertainties:
Epistemic uncertainty
from the Greek episteme, meaning ”knowledge”.
captures our ignorance about the models.
reduced as the amount of observed data increases.
Aleatoric uncertainty
from the Latin aleator, meaning ”dice player”.
captures noise inherent in the environment.
cannot be reduced even if more data were available.
Combining epistemic and aleatoric uncertainties
gives us predictive uncertainty.
Yujiro Katagiri Uncertainty in Deep Learning
4. How do we estimate uncertainties?
Most deep learning models used in practice
do not offer uncertainty information.
Bayesian neural networks give us such
information, but are often not practical.
The author presents a very simple technique
to implement a Bayesian deep learning.
Yujiro Katagiri Uncertainty in Deep Learning
5. Principle of Bayesian modelling
Given training inputs X and outputs Y, we would like to find
the parameters ω of a function y = fω(x) that are likely to
have generated our outputs.
Following Bayes’ theorem:
p(ω|X, Y) =
p(Y|X, ω)p(ω)
p(Y|X)
=
p(Y|X, ω)p(ω)
p(Y|X, ω)p(ω)dω
If we specify a prior p(ω) and a likelihood p(y|x, ω) we can
infer the posterior p(ω|x, y) in principle.
Yujiro Katagiri Uncertainty in Deep Learning
6. Prior and likelihood
For the prior, we often assume a Gaussian distribution.
p(ω) = N(ω; 0, I)
For the likelihood, we usually specify a Gaussian likelihood for
regression and a softmax likelihood for classificaiton.
p(y|x, ω) = N(y; fω
(x), τ−1
I)
p(y = d|x, ω) =
exp(f ω
d (x))
d exp(f ω
d (x))
where τ is the model precision and d is the observed class.
Yujiro Katagiri Uncertainty in Deep Learning
7. Intractability of marginalization
An integration is required for the calculation of p(Y|X):
p(Y|X) = p(Y|X, ω)p(ω)dω
This integration, called ”marginalization” (because it
marginalizes the likelihood over ω), is usually intractable.
Therefore, we would like to approximate the posterior
without performing the marginalization directly.
Various approximations exist, and one of them is the
”variational inference” technique we will see next.
Yujiro Katagiri Uncertainty in Deep Learning
8. Variational inference
We define an approximating distribution qθ(ω),
parametrized by θ (variational parameters).
To approximate the posterior, we minimize the
Kullback-Leibler (KL) divergence between them w.r.t. θ.
KL(qθ(ω)||p(ω|X, Y))
which is equivalent to maximizing the expected log likelihood
minus KL between qθ(ω) and the prior p(ω).
log p(Y|X, ω)qθ(ω)dω − KL(qθ(ω)||p(ω))
Maximizing the first term encourages qθ(ω) to explain the
data well, while minimizing the second term encourages qθ(ω)
to be as close as possible to the prior.
Yujiro Katagiri Uncertainty in Deep Learning
9. Objective and difficulities
The maximization objective can be rewritten as the following
minimization objective.
LVI (θ) = − log p(Y|X, ω)qθ(ω)dω + KL(qθ(ω)||p(ω))
= −
N
i=1
log p(yi |fω
(xi ))qθ(ω)dω + KL(qθ(ω)||p(ω))
The first term raises two difficulties.
1 The summation over the entire dataset is computationally
expensive.
2 The expected log likelihood log p(yi |fω
(xi ))qθ(ω)dω is
usually intractable.
Yujiro Katagiri Uncertainty in Deep Learning
10. Solution to the first problem
1st problem: The summation over the entire dataset is
computationally expensive.
Solution: Data sub-sampling (also referred to as mini-batch
optimization).
ˆLVI (θ) = −
N
M
i∈S
log p(yi |fω
(xi ))qθ(ω)dω+KL(qθ(ω)||p(ω))
with a random index set S of size M.
It forms an unbiased stochastic estimator to LVI (θ),
meaning that ES[ ˆLVI (θ)] = LVI (θ).
Yujiro Katagiri Uncertainty in Deep Learning
11. Solution to the second problem
2nd problem: The expected log likelihood
log p(yi |fω(xi ))qθ(ω)dω is usually intractable.
Solution: MC estimation with re-parameterization trick.
If ω is re-parameterized with a differentiable transformation
g(θ, ) with ∼ p( ) (a parameter-free distribution),
ˆLVI (θ) = −
N
M
i∈S
log p(yi |fω
(xi ))qθ(ω)dω + KL(qθ(ω)||p(ω))
= −
N
M
i∈S
log p(yi |fg(θ, )
(xi ))p( )d + KL(qθ(ω)||p(ω))
ˆLVI (θ) can be estimated with a new MC estimator
ˆLMC (θ) = −
N
M
i∈S
log p(yi |fg(θ, )
(xi )) + KL(qθ(ω)||p(ω))
where ∼ p( ), meaning that E [ ˆLMC (θ)] = ˆLVI (θ).
Yujiro Katagiri Uncertainty in Deep Learning
12. Optimization
Therefore we follow algorithm 1 for optimization of θ.
algorithm 1
Given dataset {X, Y}, define learning rate schedule η, and
initialize parameters θ randomly,
repeat
Sample M random variables ˆi ∼ p( )
Sample a random index set S of size M.
Calculate ∂
∂θ
ˆLMC (θ):
−
N
M
i∈S
∂
∂θ
log p(yi |fg(θ,ˆi )
(xi )) +
∂
∂θ
KL(qθ(ω)||p(ω))
Update θ:
θ ← θ + η
∂
∂θ
ˆLMC (θ)
until θ has converged.
Yujiro Katagiri Uncertainty in Deep Learning
13. Objective of neural networks
Let ω = {Wl , bl }L
l=1 be a set of neural network parameters,
where L is the number of layers.
The objective of neural networks can be written as follows:
LNN(ω) :=
1
M
i∈S
Eω
(xi , yi ) +
L
l=1
λl (||Wl ||2
+ ||bl ||2
)
with a random index set S of size M, Eω the loss, and the
second term representing the L2 regularizations.
Yujiro Katagiri Uncertainty in Deep Learning
14. Loss as likelihood
The loss can be rewritten with the negative log likelihood:
Euclidean loss can be rewritten with Gaussian likelihood:
Eω
(x, y) =
1
2
||y − fω
(x)||2
= −
1
τ
log p(y|fω
(x)) + const.
where τ is the model precision.
Cross-entropy loss can be rewritten with softmax likelihood:
Eω
(x, y) = −log
exp(f ω
d (x))
d exp(f ω
d (x))
= − log p(y|fω
(x))
where d is the observed class.
Thus, LNN can be rewritten as follows:
LNN(ω) = −
1
Mτ
i∈S
log p(yi |fω
(xi ))+
L
l=1
λl (||Wl ||2
+||bl ||2
)
where τ = 1 in classification.
Yujiro Katagiri Uncertainty in Deep Learning
15. Similarity to variational inference
If ω is re-parameterized with a differentiable transformation
g(θ, ) with ∼ p( ) (a parameter-free distribution),
LNN can be rewritten as follows:
−
1
Mτ
i∈S
log p(yi |fg(θ, )
(xi )) +
L
l=1
λl (||Wl ||2
+ ||bl ||2
)
This objective is remarkably similar to the MC objective for
variational inference.
ˆLMC (θ) = −
N
M
i∈S
log p(yi |fg(θ, )
(xi )) + KL(qθ(ω)||p(ω))
Yujiro Katagiri Uncertainty in Deep Learning
16. Dropout as re-parameterization
As an example of re-parameterization, assume
Wl = diag( l )Ml with l ∼ p( l )
where l = {1, ..., L} is the layer index, Ml a deterministic
weight matrix, and l a binary vector which follows p( l ), a
product of Bernoulli distributions with probabilities 1 − pl .
In this case, the input feature hl to each layer is transformed
as follows:
hl+1 = σ(hl Wl + bl )
= σ(hl diag( l )Ml + bl )
where σ is non-linearity.
Multiplying a binary diagonal matrix diag( l ) to the input
feature vector hl means dropping some features from hl ,
which is the widely known dropout.
Yujiro Katagiri Uncertainty in Deep Learning
17. Dropout as a variational distribution
Moreover, if Wl is re-parameterized as Wl = diag( l )Ml ,
each row wl,k in each weight matrix Wl follows the following
distribution q(wl,k):
q(wl,k) = q(wl,k| l,k)p( l,k)d l,k
q(wl,k| l,k) = δ(wl,k − l,kml,k)
l,k ∼ p( l,k)
where p( l,k) is a Bernoulli distribution with probability 1 − pl ,
δ(x) is 1 if x = 0 and 0 otherwise, l,k is a binary scalar, and
ml,k is a kth row vector from the weight matrix Ml .
Therefore, dropout can be seen as setting
q(ω) = l,k q(wl,k) as a variational distribution.
The author refers to this distribution as Bernoulli variational
distribution or dropout variational distribution.
Yujiro Katagiri Uncertainty in Deep Learning
18. Dropout objective
Writing the re-parameterization with dropout as ω = g(θ, )
with ∼ p( ), the minimization objective becomes:
ˆLdropout(θ) = −
1
Mτ
i∈S
log p(yi |fg(θ,ˆi
)
(xi ))+
L
l=1
λl (||Ml ||2
+||bl ||2
)
with ˆi
realizations of the random variable .
The derivative of this objective w.r.t. θ = {Ml , bl }L
l=1 is:
∂
∂θ
ˆLdropout(θ) = −
1
Mτ
i∈S
∂
∂θ
log p(yi |fg(θ,ˆi
)
(xi ))
+
∂
∂θ
L
l=1
λl (||Ml ||2
+ ||bl ||2
)
Yujiro Katagiri Uncertainty in Deep Learning
19. Equality to variational inference
It can be shown that if we set the prior p(ω) to
p(ω) =
l
p(Wl ) =
l
MN(Wl ; 0, I/l2
l , I) with l2
l =
2Nτλl
1 − pl
where MN represents a matrix normal distribution, l2
l a prior
length scale, N the number of training samples, τ the model
precision, λl the rate of L2 regularization, and pl the rate of
dropout, and set the approximating variational distribution
qθ(ω) to the dropout variational distribution, we have
∂
∂θ
Nτ
L
l=1
λl (||Ml ||2
+ ||bl ||2
) ≈
∂
∂θ
KL(qθ(ω)||p(ω))
In that case, the objective for dropout ˆLdropout(θ) has the
following relation with that for variational inference ˆLMC (θ).
∂
∂θ
ˆLdropout(θ) =
1
Nτ
∂
∂θ
ˆLMC (θ)
Yujiro Katagiri Uncertainty in Deep Learning
20. Dropout as a Bayesian approximation
We have come to the following conclusion:
Dropout as a Bayesian approximation
A neural network with arbitrary length and non-linearities, with
dropout applied before every weight layer and L2 regularizations,
is an approximation to a Bayesian neural network.
Yujiro Katagiri Uncertainty in Deep Learning
21. Estimation of predictive mean
Given our approximate posterior q∗
θ (ω) we can infer the
distribution of an output y∗ for a new input point x∗:
p(y∗
|x∗
, X, Y) := p(y∗
|x∗
, ω)p(ω|X, Y)dω
≈ p(y∗
|x∗
, ω)q∗
θ (ω)dω := q∗
θ (y∗
|x∗
)
The predictive mean is estimated by performing T stochastic
forward passes through the network and averaging the results.
regression
Eq∗
θ (y∗|x∗)[y∗
] ≈
1
T
T
t=1
f ˆωt
(x∗
)
classification
Eq∗
θ (y∗|x∗)[y∗
= d] ≈
1
T
T
t=1
exp(f ˆωt
d (x∗
))
d exp(f ˆωt
d (x∗))
Yujiro Katagiri Uncertainty in Deep Learning
22. Estimation of uncertainty in regression
The predictive variance is estimated in regression as
Varq∗
θ (y∗|x∗)[y∗
] ≈ τ−1
I
+
1
T
T
t=1
f ˆωt
(x∗
)T
f ˆωt
(x∗
) − (
1
T
T
t=1
f ˆωt
(x∗
))T
(
1
T
T
t=1
f ˆωt
(x∗
))
where y∗ and f ˆωt (x∗) are row vectors and τ =
(1−pl )l2
l
2Nλl
is
found with grid-search over the hyper parameters (λl , ll , and
pl ) to minimize validation error.
The first term τ−1 captures the aleatoric uncertainty and the
following terms capture the epistemic uncertainty.
Note that the factor of 2 is removed (i.e., τ =
(1−pl )l2
l
Nλl
) when
we use mean-squared-error loss instead of Euclidean loss.
Yujiro Katagiri Uncertainty in Deep Learning
23. Estimation of uncertainty in classification
The mutual information is estimated in classification as
I[y∗
, ω|x∗
, X, Y] = H[y∗
|x∗
, X, Y] − Ep(ω|X,Y )[H[y∗
|x∗
, ω]]
≈ −
d
(
1
T t
pˆωt
d ) log(
1
T t
pˆωt
d ) −
1
T t
(−
d
pˆωt
d log pˆωt
d )
where
pˆωt
d =
exp(f ˆωt
d (x∗))
d exp(f ˆωt
d (x∗))
.
This captures the epistemic uncertainty.
In the context of active learning, this quantity is called BALD
(Bayesian active learning by disagreement).
Yujiro Katagiri Uncertainty in Deep Learning
24. Heteroscedastic aleatoric uncertainty
For regression, we have assumed homoscedastic aleatoric
uncertainty, which means that observation noise is constant
for every input point x. This can be seen in the definition of
our likelihood p(y|x, ω) = N(y; fω(x), τ−1I), where the
observation noise τ−1 is constant.
However, we can also assume heteroscedastic aleatoric
uncertainty, which means that observation noise can vary with
input point x. This simply involves making τ into a function
of the data: p(y|x, ω) = N(y; fω(x), gω(x)−1).
Yujiro Katagiri Uncertainty in Deep Learning
25. Estimation of heteroscedastic aleatoric uncertainty
With this likelihood, the loss function becomes
Eω
(x, y) := − log N(y; fω
(x), gω
(x)−1
)
=
1
2
(y − fω
(x))gω
(x)(y − fω
(x))T
−
1
2
log det gω
(x) + const.
In practice, it is convenient to assume a diagonal precision
matrix for gω(x), in which case the log determinant reduces
to a sum of the logs over each element of gω(x).
We split the top layers of a network into two parts, each
estimating predictive mean fω(x) and model precision gω(x).
We only need to estimate the diagonal elements of gω(x).
For numerical stability, we regress the log precisions and
exponentiate them instead of directly regressing precisions.
Yujiro Katagiri Uncertainty in Deep Learning
26. Further readings
Aleatoric uncertainty in classification
Kendall A & Gal Y. 2017. What Uncertainties Do We Need in
Bayesian Deep Learning for Computer Vision?
https://arxiv.org/abs/1703.04977
Kendall A, Gal Y, & Cipolla R. 2018. Multi-Task Learning
Using Uncertainty to Weigh Losses for Scene Geometry and
Semantics. https://arxiv.org/abs/1705.07115.
Application to convolutional neural networks
Gal Y & Ghahramani Z. 2015. Bayesian Convolutional Neural
Networks with Bernoulli Approximate Variational Inference.
https://arxiv.org/abs/1506.02158
Application to recurrent neural networks
Gal Y & Ghahramani Z. 2015. A Theoretically Grounded
Application of Dropout in Recurrent Neural Networks.
https://arxiv.org/abs/1512.05287
Yujiro Katagiri Uncertainty in Deep Learning