Uncertainty in deep learning

Uncertainty in Deep Learning
Introduction to PhD dissertation by Yarin Gal
Yujiro Katagiri
ALBERT Inc.
2018/07/19
Yujiro Katagiri Uncertainty in Deep Learning

Why do we need uncertainties?
Uncertainty information is important for
safe application of autonomous systems.
medical diagnosis
autonomous driving
high frequency trading
...
There are various applications which can
make use of uncertainty information.
active learning
reinforcement learning
...

What uncertainties do we need?
There are two types of uncertainties:
Epistemic uncertainty
from the Greek episteme, meaning ”knowledge”.
captures our ignorance about the models.
reduced as the amount of observed data increases.
Aleatoric uncertainty
from the Latin aleator, meaning ”dice player”.
captures noise inherent in the environment.
cannot be reduced even if more data were available.
Combining epistemic and aleatoric uncertainties
gives us predictive uncertainty.

How do we estimate uncertainties?
Most deep learning models used in practice
do not oﬀer uncertainty information.
Bayesian neural networks give us such
information, but are often not practical.
The author presents a very simple technique
to implement a Bayesian deep learning.

Principle of Bayesian modelling
Given training inputs X and outputs Y, we would like to ﬁnd
the parameters ω of a function y = fω(x) that are likely to
have generated our outputs.
Following Bayes’ theorem:
p(ω|X, Y) =
p(Y|X, ω)p(ω)
p(Y|X)
=
p(Y|X, ω)p(ω)
p(Y|X, ω)p(ω)dω
If we specify a prior p(ω) and a likelihood p(y|x, ω) we can
infer the posterior p(ω|x, y) in principle.

Prior and likelihood
For the prior, we often assume a Gaussian distribution.
p(ω) = N(ω; 0, I)
For the likelihood, we usually specify a Gaussian likelihood for
regression and a softmax likelihood for classiﬁcaiton.
p(y|x, ω) = N(y; fω
(x), τ−1
I)
p(y = d|x, ω) =
exp(f ω
d (x))
d exp(f ω
d (x))
where τ is the model precision and d is the observed class.

Intractability of marginalization
An integration is required for the calculation of p(Y|X):
p(Y|X) = p(Y|X, ω)p(ω)dω
This integration, called ”marginalization” (because it
marginalizes the likelihood over ω), is usually intractable.
Therefore, we would like to approximate the posterior
without performing the marginalization directly.
Various approximations exist, and one of them is the
”variational inference” technique we will see next.

Variational inference
We deﬁne an approximating distribution qθ(ω),
parametrized by θ (variational parameters).
To approximate the posterior, we minimize the
Kullback-Leibler (KL) divergence between them w.r.t. θ.
KL(qθ(ω)||p(ω|X, Y))
which is equivalent to maximizing the expected log likelihood
minus KL between qθ(ω) and the prior p(ω).
log p(Y|X, ω)qθ(ω)dω − KL(qθ(ω)||p(ω))
Maximizing the ﬁrst term encourages qθ(ω) to explain the
data well, while minimizing the second term encourages qθ(ω)
to be as close as possible to the prior.

Objective and difficulities
The maximization objective can be rewritten as the following
minimization objective.
LVI (θ) = − log p(Y|X, ω)qθ(ω)dω + KL(qθ(ω)||p(ω))
= −
N
i=1
log p(yi |fω
(xi ))qθ(ω)dω + KL(qθ(ω)||p(ω))
The first term raises two difficulties.
1 The summation over the entire dataset is computationally
expensive.
2 The expected log likelihood log p(yi |fω
(xi ))qθ(ω)dω is
usually intractable.

Solution to the ﬁrst problem
1st problem: The summation over the entire dataset is
computationally expensive.
Solution: Data sub-sampling (also referred to as mini-batch
optimization).
ˆLVI (θ) = −
N
M
i∈S
log p(yi |fω
(xi ))qθ(ω)dω+KL(qθ(ω)||p(ω))
with a random index set S of size M.
It forms an unbiased stochastic estimator to LVI (θ),
meaning that ES[ ˆLVI (θ)] = LVI (θ).

Solution to the second problem
2nd problem: The expected log likelihood
log p(yi |fω(xi ))qθ(ω)dω is usually intractable.
Solution: MC estimation with re-parameterization trick.
If ω is re-parameterized with a diﬀerentiable transformation
g(θ, ) with ∼ p( ) (a parameter-free distribution),
ˆLVI (θ) = −
N
M
i∈S
log p(yi |fω
(xi ))qθ(ω)dω + KL(qθ(ω)||p(ω))
= −
N
M
i∈S
log p(yi |fg(θ, )
(xi ))p( )d + KL(qθ(ω)||p(ω))
ˆLVI (θ) can be estimated with a new MC estimator
ˆLMC (θ) = −
N
M
i∈S
log p(yi |fg(θ, )
(xi )) + KL(qθ(ω)||p(ω))
where ∼ p( ), meaning that E [ ˆLMC (θ)] = ˆLVI (θ).

Optimization
Therefore we follow algorithm 1 for optimization of θ.
algorithm 1
Given dataset {X, Y}, define learning rate schedule η, and
initialize parameters θ randomly,
repeat
Sample M random variables î ∼ p( )
Sample a random index set S of size M.
Calculate ∂
∂θ
ˆLMC (θ):
−
N
M
i∈S
∂
∂θ
log p(yi |fg(θ,î )
(xi )) +
∂
∂θ
KL(qθ(ω)||p(ω))
Update θ:
θ ← θ + η
∂
∂θ
ˆLMC (θ)
until θ has converged.

Objective of neural networks
Let ω = {Wl , bl }L
l=1 be a set of neural network parameters,
where L is the number of layers.
The objective of neural networks can be written as follows:
LNN(ω) :=
1
M
i∈S
Eω
(xi , yi ) +
L
l=1
λl (||Wl ||2
+ ||bl ||2
)
with a random index set S of size M, Eω the loss, and the
second term representing the L2 regularizations.

Loss as likelihood
The loss can be rewritten with the negative log likelihood:
Euclidean loss can be rewritten with Gaussian likelihood:
Eω
(x, y) =
1
2
||y − fω
(x)||2
= −
1
τ
log p(y|fω
(x)) + const.
where τ is the model precision.
Cross-entropy loss can be rewritten with softmax likelihood:
Eω
(x, y) = −log
exp(f ω
d (x))
d exp(f ω
d (x))
= − log p(y|fω
(x))
where d is the observed class.
Thus, LNN can be rewritten as follows:
LNN(ω) = −
1
Mτ
i∈S
log p(yi |fω
(xi ))+
L
l=1
λl (||Wl ||2
+||bl ||2
)
where τ = 1 in classiﬁcation.

Similarity to variational inference
If ω is re-parameterized with a diﬀerentiable transformation
g(θ, ) with ∼ p( ) (a parameter-free distribution),
LNN can be rewritten as follows:
−
1
Mτ
i∈S
log p(yi |fg(θ, )
(xi )) +
L
l=1
λl (||Wl ||2
+ ||bl ||2
)
This objective is remarkably similar to the MC objective for
variational inference.
ˆLMC (θ) = −
N
M
i∈S
log p(yi |fg(θ, )
(xi )) + KL(qθ(ω)||p(ω))

Dropout as re-parameterization
As an example of re-parameterization, assume
Wl = diag( l )Ml with l ∼ p( l )
where l = {1, ..., L} is the layer index, Ml a deterministic
weight matrix, and l a binary vector which follows p( l ), a
product of Bernoulli distributions with probabilities 1 − pl .
In this case, the input feature hl to each layer is transformed
as follows:
hl+1 = σ(hl Wl + bl )
= σ(hl diag( l )Ml + bl )
where σ is non-linearity.
Multiplying a binary diagonal matrix diag( l ) to the input
feature vector hl means dropping some features from hl ,
which is the widely known dropout.

Dropout as a variational distribution
Moreover, if Wl is re-parameterized as Wl = diag( l )Ml ,
each row wl,k in each weight matrix Wl follows the following
distribution q(wl,k):
q(wl,k) = q(wl,k| l,k)p( l,k)d l,k
q(wl,k| l,k) = δ(wl,k − l,kml,k)
l,k ∼ p( l,k)
where p( l,k) is a Bernoulli distribution with probability 1 − pl ,
δ(x) is 1 if x = 0 and 0 otherwise, l,k is a binary scalar, and
ml,k is a kth row vector from the weight matrix Ml .
Therefore, dropout can be seen as setting
q(ω) = l,k q(wl,k) as a variational distribution.
The author refers to this distribution as Bernoulli variational
distribution or dropout variational distribution.

Dropout objective
Writing the re-parameterization with dropout as ω = g(θ, )
with ∼ p( ), the minimization objective becomes:
ˆLdropout(θ) = −
1
Mτ
i∈S
log p(yi |fg(θ,î
)
(xi ))+
L
l=1
λl (||Ml ||2
+||bl ||2
)
with î
realizations of the random variable .
The derivative of this objective w.r.t. θ = {Ml , bl }L
l=1 is:
∂
∂θ
ˆLdropout(θ) = −
1
Mτ
i∈S
∂
∂θ
log p(yi |fg(θ,î
)
(xi ))
+
∂
∂θ
L
l=1
λl (||Ml ||2
+ ||bl ||2
)

Equality to variational inference
It can be shown that if we set the prior p(ω) to
p(ω) =
l
p(Wl ) =
l
MN(Wl ; 0, I/l2
l , I) with l2
l =
2Nτλl
1 − pl
where MN represents a matrix normal distribution, l2
l a prior
length scale, N the number of training samples, τ the model
precision, λl the rate of L2 regularization, and pl the rate of
dropout, and set the approximating variational distribution
qθ(ω) to the dropout variational distribution, we have
∂
∂θ
Nτ
L
l=1
λl (||Ml ||2
+ ||bl ||2
) ≈
∂
∂θ
KL(qθ(ω)||p(ω))
In that case, the objective for dropout ˆLdropout(θ) has the
following relation with that for variational inference ˆLMC (θ).
∂
∂θ
ˆLdropout(θ) =
1
Nτ
∂
∂θ
ˆLMC (θ)

Dropout as a Bayesian approximation
We have come to the following conclusion:
Dropout as a Bayesian approximation
A neural network with arbitrary length and non-linearities, with
dropout applied before every weight layer and L2 regularizations,
is an approximation to a Bayesian neural network.

Estimation of predictive mean
Given our approximate posterior q∗
θ (ω) we can infer the
distribution of an output y∗ for a new input point x∗:
p(y∗
|x∗
, X, Y) := p(y∗
|x∗
, ω)p(ω|X, Y)dω
≈ p(y∗
|x∗
, ω)q∗
θ (ω)dω := q∗
θ (y∗
|x∗
)
The predictive mean is estimated by performing T stochastic
forward passes through the network and averaging the results.
regression
Eq∗
θ (y∗|x∗)[y∗
] ≈
1
T
T
t=1
f ˆωt
(x∗
)
classiﬁcation
Eq∗
θ (y∗|x∗)[y∗
= d] ≈
1
T
T
t=1
exp(f ˆωt
d (x∗
))
d exp(f ˆωt
d (x∗))

Estimation of uncertainty in regression
The predictive variance is estimated in regression as
Varq∗
θ (y∗|x∗)[y∗
] ≈ τ−1
I
+
1
T
T
t=1
f ˆωt
(x∗
)T
f ˆωt
(x∗
) − (
1
T
T
t=1
f ˆωt
(x∗
))T
(
1
T
T
t=1
f ˆωt
(x∗
))
where y∗ and f ˆωt (x∗) are row vectors and τ =
(1−pl )l2
l
2Nλl
is
found with grid-search over the hyper parameters (λl , ll , and
pl ) to minimize validation error.
The ﬁrst term τ−1 captures the aleatoric uncertainty and the
following terms capture the epistemic uncertainty.
Note that the factor of 2 is removed (i.e., τ =
(1−pl )l2
l
Nλl
) when
we use mean-squared-error loss instead of Euclidean loss.

Estimation of uncertainty in classiﬁcation
The mutual information is estimated in classiﬁcation as
I[y∗
, ω|x∗
, X, Y] = H[y∗
|x∗
, X, Y] − Ep(ω|X,Y )[H[y∗
|x∗
, ω]]
≈ −
d
(
1
T t
pˆωt
d ) log(
1
T t
pˆωt
d ) −
1
T t
(−
d
pˆωt
d log pˆωt
d )
where
pˆωt
d =
exp(f ˆωt
d (x∗))
d exp(f ˆωt
d (x∗))
.
This captures the epistemic uncertainty.
In the context of active learning, this quantity is called BALD
(Bayesian active learning by disagreement).

Heteroscedastic aleatoric uncertainty
For regression, we have assumed homoscedastic aleatoric
uncertainty, which means that observation noise is constant
for every input point x. This can be seen in the deﬁnition of
our likelihood p(y|x, ω) = N(y; fω(x), τ−1I), where the
observation noise τ−1 is constant.
However, we can also assume heteroscedastic aleatoric
uncertainty, which means that observation noise can vary with
input point x. This simply involves making τ into a function
of the data: p(y|x, ω) = N(y; fω(x), gω(x)−1).

Estimation of heteroscedastic aleatoric uncertainty
With this likelihood, the loss function becomes
Eω
(x, y) := − log N(y; fω
(x), gω
(x)−1
)
=
1
2
(y − fω
(x))gω
(x)(y − fω
(x))T
−
1
2
log det gω
(x) + const.
In practice, it is convenient to assume a diagonal precision
matrix for gω(x), in which case the log determinant reduces
to a sum of the logs over each element of gω(x).
We split the top layers of a network into two parts, each
estimating predictive mean fω(x) and model precision gω(x).
We only need to estimate the diagonal elements of gω(x).
For numerical stability, we regress the log precisions and
exponentiate them instead of directly regressing precisions.

Further readings
Aleatoric uncertainty in classiﬁcation
Kendall A & Gal Y. 2017. What Uncertainties Do We Need in
Bayesian Deep Learning for Computer Vision?
https://arxiv.org/abs/1703.04977
Kendall A, Gal Y, & Cipolla R. 2018. Multi-Task Learning
Using Uncertainty to Weigh Losses for Scene Geometry and
Semantics. https://arxiv.org/abs/1705.07115.
Application to convolutional neural networks
Gal Y & Ghahramani Z. 2015. Bayesian Convolutional Neural
Networks with Bernoulli Approximate Variational Inference.
Application to recurrent neural networks
Gal Y & Ghahramani Z. 2015. A Theoretically Grounded
Application of Dropout in Recurrent Neural Networks.

Uncertainty in deep learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Uncertainty in deep learning

Similar to Uncertainty in deep learning (20)

Recently uploaded

Recently uploaded (20)

Uncertainty in deep learning