SlideShare a Scribd company logo
1 of 26
Download to read offline
Uncertainty in Deep Learning
Introduction to PhD dissertation by Yarin Gal
Yujiro Katagiri
ALBERT Inc.
2018/07/19
Yujiro Katagiri Uncertainty in Deep Learning
Why do we need uncertainties?
Uncertainty information is important for
safe application of autonomous systems.
medical diagnosis
autonomous driving
high frequency trading
...
There are various applications which can
make use of uncertainty information.
active learning
reinforcement learning
...
Yujiro Katagiri Uncertainty in Deep Learning
What uncertainties do we need?
There are two types of uncertainties:
Epistemic uncertainty
from the Greek episteme, meaning ”knowledge”.
captures our ignorance about the models.
reduced as the amount of observed data increases.
Aleatoric uncertainty
from the Latin aleator, meaning ”dice player”.
captures noise inherent in the environment.
cannot be reduced even if more data were available.
Combining epistemic and aleatoric uncertainties
gives us predictive uncertainty.
Yujiro Katagiri Uncertainty in Deep Learning
How do we estimate uncertainties?
Most deep learning models used in practice
do not offer uncertainty information.
Bayesian neural networks give us such
information, but are often not practical.
The author presents a very simple technique
to implement a Bayesian deep learning.
Yujiro Katagiri Uncertainty in Deep Learning
Principle of Bayesian modelling
Given training inputs X and outputs Y, we would like to find
the parameters ω of a function y = fω(x) that are likely to
have generated our outputs.
Following Bayes’ theorem:
p(ω|X, Y) =
p(Y|X, ω)p(ω)
p(Y|X)
=
p(Y|X, ω)p(ω)
p(Y|X, ω)p(ω)dω
If we specify a prior p(ω) and a likelihood p(y|x, ω) we can
infer the posterior p(ω|x, y) in principle.
Yujiro Katagiri Uncertainty in Deep Learning
Prior and likelihood
For the prior, we often assume a Gaussian distribution.
p(ω) = N(ω; 0, I)
For the likelihood, we usually specify a Gaussian likelihood for
regression and a softmax likelihood for classificaiton.
p(y|x, ω) = N(y; fω
(x), τ−1
I)
p(y = d|x, ω) =
exp(f ω
d (x))
d exp(f ω
d (x))
where τ is the model precision and d is the observed class.
Yujiro Katagiri Uncertainty in Deep Learning
Intractability of marginalization
An integration is required for the calculation of p(Y|X):
p(Y|X) = p(Y|X, ω)p(ω)dω
This integration, called ”marginalization” (because it
marginalizes the likelihood over ω), is usually intractable.
Therefore, we would like to approximate the posterior
without performing the marginalization directly.
Various approximations exist, and one of them is the
”variational inference” technique we will see next.
Yujiro Katagiri Uncertainty in Deep Learning
Variational inference
We define an approximating distribution qθ(ω),
parametrized by θ (variational parameters).
To approximate the posterior, we minimize the
Kullback-Leibler (KL) divergence between them w.r.t. θ.
KL(qθ(ω)||p(ω|X, Y))
which is equivalent to maximizing the expected log likelihood
minus KL between qθ(ω) and the prior p(ω).
log p(Y|X, ω)qθ(ω)dω − KL(qθ(ω)||p(ω))
Maximizing the first term encourages qθ(ω) to explain the
data well, while minimizing the second term encourages qθ(ω)
to be as close as possible to the prior.
Yujiro Katagiri Uncertainty in Deep Learning
Objective and difficulities
The maximization objective can be rewritten as the following
minimization objective.
LVI (θ) = − log p(Y|X, ω)qθ(ω)dω + KL(qθ(ω)||p(ω))
= −
N
i=1
log p(yi |fω
(xi ))qθ(ω)dω + KL(qθ(ω)||p(ω))
The first term raises two difficulties.
1 The summation over the entire dataset is computationally
expensive.
2 The expected log likelihood log p(yi |fω
(xi ))qθ(ω)dω is
usually intractable.
Yujiro Katagiri Uncertainty in Deep Learning
Solution to the first problem
1st problem: The summation over the entire dataset is
computationally expensive.
Solution: Data sub-sampling (also referred to as mini-batch
optimization).
ˆLVI (θ) = −
N
M
i∈S
log p(yi |fω
(xi ))qθ(ω)dω+KL(qθ(ω)||p(ω))
with a random index set S of size M.
It forms an unbiased stochastic estimator to LVI (θ),
meaning that ES[ ˆLVI (θ)] = LVI (θ).
Yujiro Katagiri Uncertainty in Deep Learning
Solution to the second problem
2nd problem: The expected log likelihood
log p(yi |fω(xi ))qθ(ω)dω is usually intractable.
Solution: MC estimation with re-parameterization trick.
If ω is re-parameterized with a differentiable transformation
g(θ, ) with ∼ p( ) (a parameter-free distribution),
ˆLVI (θ) = −
N
M
i∈S
log p(yi |fω
(xi ))qθ(ω)dω + KL(qθ(ω)||p(ω))
= −
N
M
i∈S
log p(yi |fg(θ, )
(xi ))p( )d + KL(qθ(ω)||p(ω))
ˆLVI (θ) can be estimated with a new MC estimator
ˆLMC (θ) = −
N
M
i∈S
log p(yi |fg(θ, )
(xi )) + KL(qθ(ω)||p(ω))
where ∼ p( ), meaning that E [ ˆLMC (θ)] = ˆLVI (θ).
Yujiro Katagiri Uncertainty in Deep Learning
Optimization
Therefore we follow algorithm 1 for optimization of θ.
algorithm 1
Given dataset {X, Y}, define learning rate schedule η, and
initialize parameters θ randomly,
repeat
Sample M random variables ˆi ∼ p( )
Sample a random index set S of size M.
Calculate ∂
∂θ
ˆLMC (θ):
−
N
M
i∈S
∂
∂θ
log p(yi |fg(θ,ˆi )
(xi )) +
∂
∂θ
KL(qθ(ω)||p(ω))
Update θ:
θ ← θ + η
∂
∂θ
ˆLMC (θ)
until θ has converged.
Yujiro Katagiri Uncertainty in Deep Learning
Objective of neural networks
Let ω = {Wl , bl }L
l=1 be a set of neural network parameters,
where L is the number of layers.
The objective of neural networks can be written as follows:
LNN(ω) :=
1
M
i∈S
Eω
(xi , yi ) +
L
l=1
λl (||Wl ||2
+ ||bl ||2
)
with a random index set S of size M, Eω the loss, and the
second term representing the L2 regularizations.
Yujiro Katagiri Uncertainty in Deep Learning
Loss as likelihood
The loss can be rewritten with the negative log likelihood:
Euclidean loss can be rewritten with Gaussian likelihood:
Eω
(x, y) =
1
2
||y − fω
(x)||2
= −
1
τ
log p(y|fω
(x)) + const.
where τ is the model precision.
Cross-entropy loss can be rewritten with softmax likelihood:
Eω
(x, y) = −log
exp(f ω
d (x))
d exp(f ω
d (x))
= − log p(y|fω
(x))
where d is the observed class.
Thus, LNN can be rewritten as follows:
LNN(ω) = −
1
Mτ
i∈S
log p(yi |fω
(xi ))+
L
l=1
λl (||Wl ||2
+||bl ||2
)
where τ = 1 in classification.
Yujiro Katagiri Uncertainty in Deep Learning
Similarity to variational inference
If ω is re-parameterized with a differentiable transformation
g(θ, ) with ∼ p( ) (a parameter-free distribution),
LNN can be rewritten as follows:
−
1
Mτ
i∈S
log p(yi |fg(θ, )
(xi )) +
L
l=1
λl (||Wl ||2
+ ||bl ||2
)
This objective is remarkably similar to the MC objective for
variational inference.
ˆLMC (θ) = −
N
M
i∈S
log p(yi |fg(θ, )
(xi )) + KL(qθ(ω)||p(ω))
Yujiro Katagiri Uncertainty in Deep Learning
Dropout as re-parameterization
As an example of re-parameterization, assume
Wl = diag( l )Ml with l ∼ p( l )
where l = {1, ..., L} is the layer index, Ml a deterministic
weight matrix, and l a binary vector which follows p( l ), a
product of Bernoulli distributions with probabilities 1 − pl .
In this case, the input feature hl to each layer is transformed
as follows:
hl+1 = σ(hl Wl + bl )
= σ(hl diag( l )Ml + bl )
where σ is non-linearity.
Multiplying a binary diagonal matrix diag( l ) to the input
feature vector hl means dropping some features from hl ,
which is the widely known dropout.
Yujiro Katagiri Uncertainty in Deep Learning
Dropout as a variational distribution
Moreover, if Wl is re-parameterized as Wl = diag( l )Ml ,
each row wl,k in each weight matrix Wl follows the following
distribution q(wl,k):
q(wl,k) = q(wl,k| l,k)p( l,k)d l,k
q(wl,k| l,k) = δ(wl,k − l,kml,k)
l,k ∼ p( l,k)
where p( l,k) is a Bernoulli distribution with probability 1 − pl ,
δ(x) is 1 if x = 0 and 0 otherwise, l,k is a binary scalar, and
ml,k is a kth row vector from the weight matrix Ml .
Therefore, dropout can be seen as setting
q(ω) = l,k q(wl,k) as a variational distribution.
The author refers to this distribution as Bernoulli variational
distribution or dropout variational distribution.
Yujiro Katagiri Uncertainty in Deep Learning
Dropout objective
Writing the re-parameterization with dropout as ω = g(θ, )
with ∼ p( ), the minimization objective becomes:
ˆLdropout(θ) = −
1
Mτ
i∈S
log p(yi |fg(θ,ˆi
)
(xi ))+
L
l=1
λl (||Ml ||2
+||bl ||2
)
with ˆi
realizations of the random variable .
The derivative of this objective w.r.t. θ = {Ml , bl }L
l=1 is:
∂
∂θ
ˆLdropout(θ) = −
1
Mτ
i∈S
∂
∂θ
log p(yi |fg(θ,ˆi
)
(xi ))
+
∂
∂θ
L
l=1
λl (||Ml ||2
+ ||bl ||2
)
Yujiro Katagiri Uncertainty in Deep Learning
Equality to variational inference
It can be shown that if we set the prior p(ω) to
p(ω) =
l
p(Wl ) =
l
MN(Wl ; 0, I/l2
l , I) with l2
l =
2Nτλl
1 − pl
where MN represents a matrix normal distribution, l2
l a prior
length scale, N the number of training samples, τ the model
precision, λl the rate of L2 regularization, and pl the rate of
dropout, and set the approximating variational distribution
qθ(ω) to the dropout variational distribution, we have
∂
∂θ
Nτ
L
l=1
λl (||Ml ||2
+ ||bl ||2
) ≈
∂
∂θ
KL(qθ(ω)||p(ω))
In that case, the objective for dropout ˆLdropout(θ) has the
following relation with that for variational inference ˆLMC (θ).
∂
∂θ
ˆLdropout(θ) =
1
Nτ
∂
∂θ
ˆLMC (θ)
Yujiro Katagiri Uncertainty in Deep Learning
Dropout as a Bayesian approximation
We have come to the following conclusion:
Dropout as a Bayesian approximation
A neural network with arbitrary length and non-linearities, with
dropout applied before every weight layer and L2 regularizations,
is an approximation to a Bayesian neural network.
Yujiro Katagiri Uncertainty in Deep Learning
Estimation of predictive mean
Given our approximate posterior q∗
θ (ω) we can infer the
distribution of an output y∗ for a new input point x∗:
p(y∗
|x∗
, X, Y) := p(y∗
|x∗
, ω)p(ω|X, Y)dω
≈ p(y∗
|x∗
, ω)q∗
θ (ω)dω := q∗
θ (y∗
|x∗
)
The predictive mean is estimated by performing T stochastic
forward passes through the network and averaging the results.
regression
Eq∗
θ (y∗|x∗)[y∗
] ≈
1
T
T
t=1
f ˆωt
(x∗
)
classification
Eq∗
θ (y∗|x∗)[y∗
= d] ≈
1
T
T
t=1
exp(f ˆωt
d (x∗
))
d exp(f ˆωt
d (x∗))
Yujiro Katagiri Uncertainty in Deep Learning
Estimation of uncertainty in regression
The predictive variance is estimated in regression as
Varq∗
θ (y∗|x∗)[y∗
] ≈ τ−1
I
+
1
T
T
t=1
f ˆωt
(x∗
)T
f ˆωt
(x∗
) − (
1
T
T
t=1
f ˆωt
(x∗
))T
(
1
T
T
t=1
f ˆωt
(x∗
))
where y∗ and f ˆωt (x∗) are row vectors and τ =
(1−pl )l2
l
2Nλl
is
found with grid-search over the hyper parameters (λl , ll , and
pl ) to minimize validation error.
The first term τ−1 captures the aleatoric uncertainty and the
following terms capture the epistemic uncertainty.
Note that the factor of 2 is removed (i.e., τ =
(1−pl )l2
l
Nλl
) when
we use mean-squared-error loss instead of Euclidean loss.
Yujiro Katagiri Uncertainty in Deep Learning
Estimation of uncertainty in classification
The mutual information is estimated in classification as
I[y∗
, ω|x∗
, X, Y] = H[y∗
|x∗
, X, Y] − Ep(ω|X,Y )[H[y∗
|x∗
, ω]]
≈ −
d
(
1
T t
pˆωt
d ) log(
1
T t
pˆωt
d ) −
1
T t
(−
d
pˆωt
d log pˆωt
d )
where
pˆωt
d =
exp(f ˆωt
d (x∗))
d exp(f ˆωt
d (x∗))
.
This captures the epistemic uncertainty.
In the context of active learning, this quantity is called BALD
(Bayesian active learning by disagreement).
Yujiro Katagiri Uncertainty in Deep Learning
Heteroscedastic aleatoric uncertainty
For regression, we have assumed homoscedastic aleatoric
uncertainty, which means that observation noise is constant
for every input point x. This can be seen in the definition of
our likelihood p(y|x, ω) = N(y; fω(x), τ−1I), where the
observation noise τ−1 is constant.
However, we can also assume heteroscedastic aleatoric
uncertainty, which means that observation noise can vary with
input point x. This simply involves making τ into a function
of the data: p(y|x, ω) = N(y; fω(x), gω(x)−1).
Yujiro Katagiri Uncertainty in Deep Learning
Estimation of heteroscedastic aleatoric uncertainty
With this likelihood, the loss function becomes
Eω
(x, y) := − log N(y; fω
(x), gω
(x)−1
)
=
1
2
(y − fω
(x))gω
(x)(y − fω
(x))T
−
1
2
log det gω
(x) + const.
In practice, it is convenient to assume a diagonal precision
matrix for gω(x), in which case the log determinant reduces
to a sum of the logs over each element of gω(x).
We split the top layers of a network into two parts, each
estimating predictive mean fω(x) and model precision gω(x).
We only need to estimate the diagonal elements of gω(x).
For numerical stability, we regress the log precisions and
exponentiate them instead of directly regressing precisions.
Yujiro Katagiri Uncertainty in Deep Learning
Further readings
Aleatoric uncertainty in classification
Kendall A & Gal Y. 2017. What Uncertainties Do We Need in
Bayesian Deep Learning for Computer Vision?
https://arxiv.org/abs/1703.04977
Kendall A, Gal Y, & Cipolla R. 2018. Multi-Task Learning
Using Uncertainty to Weigh Losses for Scene Geometry and
Semantics. https://arxiv.org/abs/1705.07115.
Application to convolutional neural networks
Gal Y & Ghahramani Z. 2015. Bayesian Convolutional Neural
Networks with Bernoulli Approximate Variational Inference.
https://arxiv.org/abs/1506.02158
Application to recurrent neural networks
Gal Y & Ghahramani Z. 2015. A Theoretically Grounded
Application of Dropout in Recurrent Neural Networks.
https://arxiv.org/abs/1512.05287
Yujiro Katagiri Uncertainty in Deep Learning

More Related Content

What's hot

Intro to Approximate Bayesian Computation (ABC)
Intro to Approximate Bayesian Computation (ABC)Intro to Approximate Bayesian Computation (ABC)
Intro to Approximate Bayesian Computation (ABC)Umberto Picchini
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
Probability distributions for ml
Probability distributions for mlProbability distributions for ml
Probability distributions for mlSung Yub Kim
 
Accelerated approximate Bayesian computation with applications to protein fol...
Accelerated approximate Bayesian computation with applications to protein fol...Accelerated approximate Bayesian computation with applications to protein fol...
Accelerated approximate Bayesian computation with applications to protein fol...Umberto Picchini
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDValentin De Bortoli
 
My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...Umberto Picchini
 
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...Umberto Picchini
 
A Tau Approach for Solving Fractional Diffusion Equations using Legendre-Cheb...
A Tau Approach for Solving Fractional Diffusion Equations using Legendre-Cheb...A Tau Approach for Solving Fractional Diffusion Equations using Legendre-Cheb...
A Tau Approach for Solving Fractional Diffusion Equations using Legendre-Cheb...iosrjce
 
Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach Jae-kwang Kim
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learningbutest
 

What's hot (20)

1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 
Intro to Approximate Bayesian Computation (ABC)
Intro to Approximate Bayesian Computation (ABC)Intro to Approximate Bayesian Computation (ABC)
Intro to Approximate Bayesian Computation (ABC)
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
Probability distributions for ml
Probability distributions for mlProbability distributions for ml
Probability distributions for ml
 
Accelerated approximate Bayesian computation with applications to protein fol...
Accelerated approximate Bayesian computation with applications to protein fol...Accelerated approximate Bayesian computation with applications to protein fol...
Accelerated approximate Bayesian computation with applications to protein fol...
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGD
 
06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes
 
My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
A likelihood-free version of the stochastic approximation EM algorithm (SAEM)...
 
A Tau Approach for Solving Fractional Diffusion Equations using Legendre-Cheb...
A Tau Approach for Solving Fractional Diffusion Equations using Legendre-Cheb...A Tau Approach for Solving Fractional Diffusion Equations using Legendre-Cheb...
A Tau Approach for Solving Fractional Diffusion Equations using Legendre-Cheb...
 
Introduction to logistic regression
Introduction to logistic regressionIntroduction to logistic regression
Introduction to logistic regression
 
Puy chosuantai2
Puy chosuantai2Puy chosuantai2
Puy chosuantai2
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
MUMS Opening Workshop - An Overview of Reduced-Order Models and Emulators (ED...
 
Propensity albert
Propensity albertPropensity albert
Propensity albert
 
Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learning
 
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
 
Perceptron
PerceptronPerceptron
Perceptron
 

Similar to Uncertainty in deep learning

Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfAlexander Litvinenko
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論岳華 杜
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)NYversity
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber SecurityAltoros
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Tomasz Kusmierczyk
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
1 hofstad
1 hofstad1 hofstad
1 hofstadYandex
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
 

Similar to Uncertainty in deep learning (20)

Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Deep Learning for Cyber Security
Deep Learning for Cyber SecurityDeep Learning for Cyber Security
Deep Learning for Cyber Security
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
1 hofstad
1 hofstad1 hofstad
1 hofstad
 
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Fi review5
Fi review5Fi review5
Fi review5
 
BAYSM'14, Wien, Austria
BAYSM'14, Wien, AustriaBAYSM'14, Wien, Austria
BAYSM'14, Wien, Austria
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 

Recently uploaded

What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 

Recently uploaded (20)

What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 

Uncertainty in deep learning

  • 1. Uncertainty in Deep Learning Introduction to PhD dissertation by Yarin Gal Yujiro Katagiri ALBERT Inc. 2018/07/19 Yujiro Katagiri Uncertainty in Deep Learning
  • 2. Why do we need uncertainties? Uncertainty information is important for safe application of autonomous systems. medical diagnosis autonomous driving high frequency trading ... There are various applications which can make use of uncertainty information. active learning reinforcement learning ... Yujiro Katagiri Uncertainty in Deep Learning
  • 3. What uncertainties do we need? There are two types of uncertainties: Epistemic uncertainty from the Greek episteme, meaning ”knowledge”. captures our ignorance about the models. reduced as the amount of observed data increases. Aleatoric uncertainty from the Latin aleator, meaning ”dice player”. captures noise inherent in the environment. cannot be reduced even if more data were available. Combining epistemic and aleatoric uncertainties gives us predictive uncertainty. Yujiro Katagiri Uncertainty in Deep Learning
  • 4. How do we estimate uncertainties? Most deep learning models used in practice do not offer uncertainty information. Bayesian neural networks give us such information, but are often not practical. The author presents a very simple technique to implement a Bayesian deep learning. Yujiro Katagiri Uncertainty in Deep Learning
  • 5. Principle of Bayesian modelling Given training inputs X and outputs Y, we would like to find the parameters ω of a function y = fω(x) that are likely to have generated our outputs. Following Bayes’ theorem: p(ω|X, Y) = p(Y|X, ω)p(ω) p(Y|X) = p(Y|X, ω)p(ω) p(Y|X, ω)p(ω)dω If we specify a prior p(ω) and a likelihood p(y|x, ω) we can infer the posterior p(ω|x, y) in principle. Yujiro Katagiri Uncertainty in Deep Learning
  • 6. Prior and likelihood For the prior, we often assume a Gaussian distribution. p(ω) = N(ω; 0, I) For the likelihood, we usually specify a Gaussian likelihood for regression and a softmax likelihood for classificaiton. p(y|x, ω) = N(y; fω (x), τ−1 I) p(y = d|x, ω) = exp(f ω d (x)) d exp(f ω d (x)) where τ is the model precision and d is the observed class. Yujiro Katagiri Uncertainty in Deep Learning
  • 7. Intractability of marginalization An integration is required for the calculation of p(Y|X): p(Y|X) = p(Y|X, ω)p(ω)dω This integration, called ”marginalization” (because it marginalizes the likelihood over ω), is usually intractable. Therefore, we would like to approximate the posterior without performing the marginalization directly. Various approximations exist, and one of them is the ”variational inference” technique we will see next. Yujiro Katagiri Uncertainty in Deep Learning
  • 8. Variational inference We define an approximating distribution qθ(ω), parametrized by θ (variational parameters). To approximate the posterior, we minimize the Kullback-Leibler (KL) divergence between them w.r.t. θ. KL(qθ(ω)||p(ω|X, Y)) which is equivalent to maximizing the expected log likelihood minus KL between qθ(ω) and the prior p(ω). log p(Y|X, ω)qθ(ω)dω − KL(qθ(ω)||p(ω)) Maximizing the first term encourages qθ(ω) to explain the data well, while minimizing the second term encourages qθ(ω) to be as close as possible to the prior. Yujiro Katagiri Uncertainty in Deep Learning
  • 9. Objective and difficulities The maximization objective can be rewritten as the following minimization objective. LVI (θ) = − log p(Y|X, ω)qθ(ω)dω + KL(qθ(ω)||p(ω)) = − N i=1 log p(yi |fω (xi ))qθ(ω)dω + KL(qθ(ω)||p(ω)) The first term raises two difficulties. 1 The summation over the entire dataset is computationally expensive. 2 The expected log likelihood log p(yi |fω (xi ))qθ(ω)dω is usually intractable. Yujiro Katagiri Uncertainty in Deep Learning
  • 10. Solution to the first problem 1st problem: The summation over the entire dataset is computationally expensive. Solution: Data sub-sampling (also referred to as mini-batch optimization). ˆLVI (θ) = − N M i∈S log p(yi |fω (xi ))qθ(ω)dω+KL(qθ(ω)||p(ω)) with a random index set S of size M. It forms an unbiased stochastic estimator to LVI (θ), meaning that ES[ ˆLVI (θ)] = LVI (θ). Yujiro Katagiri Uncertainty in Deep Learning
  • 11. Solution to the second problem 2nd problem: The expected log likelihood log p(yi |fω(xi ))qθ(ω)dω is usually intractable. Solution: MC estimation with re-parameterization trick. If ω is re-parameterized with a differentiable transformation g(θ, ) with ∼ p( ) (a parameter-free distribution), ˆLVI (θ) = − N M i∈S log p(yi |fω (xi ))qθ(ω)dω + KL(qθ(ω)||p(ω)) = − N M i∈S log p(yi |fg(θ, ) (xi ))p( )d + KL(qθ(ω)||p(ω)) ˆLVI (θ) can be estimated with a new MC estimator ˆLMC (θ) = − N M i∈S log p(yi |fg(θ, ) (xi )) + KL(qθ(ω)||p(ω)) where ∼ p( ), meaning that E [ ˆLMC (θ)] = ˆLVI (θ). Yujiro Katagiri Uncertainty in Deep Learning
  • 12. Optimization Therefore we follow algorithm 1 for optimization of θ. algorithm 1 Given dataset {X, Y}, define learning rate schedule η, and initialize parameters θ randomly, repeat Sample M random variables ˆi ∼ p( ) Sample a random index set S of size M. Calculate ∂ ∂θ ˆLMC (θ): − N M i∈S ∂ ∂θ log p(yi |fg(θ,ˆi ) (xi )) + ∂ ∂θ KL(qθ(ω)||p(ω)) Update θ: θ ← θ + η ∂ ∂θ ˆLMC (θ) until θ has converged. Yujiro Katagiri Uncertainty in Deep Learning
  • 13. Objective of neural networks Let ω = {Wl , bl }L l=1 be a set of neural network parameters, where L is the number of layers. The objective of neural networks can be written as follows: LNN(ω) := 1 M i∈S Eω (xi , yi ) + L l=1 λl (||Wl ||2 + ||bl ||2 ) with a random index set S of size M, Eω the loss, and the second term representing the L2 regularizations. Yujiro Katagiri Uncertainty in Deep Learning
  • 14. Loss as likelihood The loss can be rewritten with the negative log likelihood: Euclidean loss can be rewritten with Gaussian likelihood: Eω (x, y) = 1 2 ||y − fω (x)||2 = − 1 τ log p(y|fω (x)) + const. where τ is the model precision. Cross-entropy loss can be rewritten with softmax likelihood: Eω (x, y) = −log exp(f ω d (x)) d exp(f ω d (x)) = − log p(y|fω (x)) where d is the observed class. Thus, LNN can be rewritten as follows: LNN(ω) = − 1 Mτ i∈S log p(yi |fω (xi ))+ L l=1 λl (||Wl ||2 +||bl ||2 ) where τ = 1 in classification. Yujiro Katagiri Uncertainty in Deep Learning
  • 15. Similarity to variational inference If ω is re-parameterized with a differentiable transformation g(θ, ) with ∼ p( ) (a parameter-free distribution), LNN can be rewritten as follows: − 1 Mτ i∈S log p(yi |fg(θ, ) (xi )) + L l=1 λl (||Wl ||2 + ||bl ||2 ) This objective is remarkably similar to the MC objective for variational inference. ˆLMC (θ) = − N M i∈S log p(yi |fg(θ, ) (xi )) + KL(qθ(ω)||p(ω)) Yujiro Katagiri Uncertainty in Deep Learning
  • 16. Dropout as re-parameterization As an example of re-parameterization, assume Wl = diag( l )Ml with l ∼ p( l ) where l = {1, ..., L} is the layer index, Ml a deterministic weight matrix, and l a binary vector which follows p( l ), a product of Bernoulli distributions with probabilities 1 − pl . In this case, the input feature hl to each layer is transformed as follows: hl+1 = σ(hl Wl + bl ) = σ(hl diag( l )Ml + bl ) where σ is non-linearity. Multiplying a binary diagonal matrix diag( l ) to the input feature vector hl means dropping some features from hl , which is the widely known dropout. Yujiro Katagiri Uncertainty in Deep Learning
  • 17. Dropout as a variational distribution Moreover, if Wl is re-parameterized as Wl = diag( l )Ml , each row wl,k in each weight matrix Wl follows the following distribution q(wl,k): q(wl,k) = q(wl,k| l,k)p( l,k)d l,k q(wl,k| l,k) = δ(wl,k − l,kml,k) l,k ∼ p( l,k) where p( l,k) is a Bernoulli distribution with probability 1 − pl , δ(x) is 1 if x = 0 and 0 otherwise, l,k is a binary scalar, and ml,k is a kth row vector from the weight matrix Ml . Therefore, dropout can be seen as setting q(ω) = l,k q(wl,k) as a variational distribution. The author refers to this distribution as Bernoulli variational distribution or dropout variational distribution. Yujiro Katagiri Uncertainty in Deep Learning
  • 18. Dropout objective Writing the re-parameterization with dropout as ω = g(θ, ) with ∼ p( ), the minimization objective becomes: ˆLdropout(θ) = − 1 Mτ i∈S log p(yi |fg(θ,ˆi ) (xi ))+ L l=1 λl (||Ml ||2 +||bl ||2 ) with ˆi realizations of the random variable . The derivative of this objective w.r.t. θ = {Ml , bl }L l=1 is: ∂ ∂θ ˆLdropout(θ) = − 1 Mτ i∈S ∂ ∂θ log p(yi |fg(θ,ˆi ) (xi )) + ∂ ∂θ L l=1 λl (||Ml ||2 + ||bl ||2 ) Yujiro Katagiri Uncertainty in Deep Learning
  • 19. Equality to variational inference It can be shown that if we set the prior p(ω) to p(ω) = l p(Wl ) = l MN(Wl ; 0, I/l2 l , I) with l2 l = 2Nτλl 1 − pl where MN represents a matrix normal distribution, l2 l a prior length scale, N the number of training samples, τ the model precision, λl the rate of L2 regularization, and pl the rate of dropout, and set the approximating variational distribution qθ(ω) to the dropout variational distribution, we have ∂ ∂θ Nτ L l=1 λl (||Ml ||2 + ||bl ||2 ) ≈ ∂ ∂θ KL(qθ(ω)||p(ω)) In that case, the objective for dropout ˆLdropout(θ) has the following relation with that for variational inference ˆLMC (θ). ∂ ∂θ ˆLdropout(θ) = 1 Nτ ∂ ∂θ ˆLMC (θ) Yujiro Katagiri Uncertainty in Deep Learning
  • 20. Dropout as a Bayesian approximation We have come to the following conclusion: Dropout as a Bayesian approximation A neural network with arbitrary length and non-linearities, with dropout applied before every weight layer and L2 regularizations, is an approximation to a Bayesian neural network. Yujiro Katagiri Uncertainty in Deep Learning
  • 21. Estimation of predictive mean Given our approximate posterior q∗ θ (ω) we can infer the distribution of an output y∗ for a new input point x∗: p(y∗ |x∗ , X, Y) := p(y∗ |x∗ , ω)p(ω|X, Y)dω ≈ p(y∗ |x∗ , ω)q∗ θ (ω)dω := q∗ θ (y∗ |x∗ ) The predictive mean is estimated by performing T stochastic forward passes through the network and averaging the results. regression Eq∗ θ (y∗|x∗)[y∗ ] ≈ 1 T T t=1 f ˆωt (x∗ ) classification Eq∗ θ (y∗|x∗)[y∗ = d] ≈ 1 T T t=1 exp(f ˆωt d (x∗ )) d exp(f ˆωt d (x∗)) Yujiro Katagiri Uncertainty in Deep Learning
  • 22. Estimation of uncertainty in regression The predictive variance is estimated in regression as Varq∗ θ (y∗|x∗)[y∗ ] ≈ τ−1 I + 1 T T t=1 f ˆωt (x∗ )T f ˆωt (x∗ ) − ( 1 T T t=1 f ˆωt (x∗ ))T ( 1 T T t=1 f ˆωt (x∗ )) where y∗ and f ˆωt (x∗) are row vectors and τ = (1−pl )l2 l 2Nλl is found with grid-search over the hyper parameters (λl , ll , and pl ) to minimize validation error. The first term τ−1 captures the aleatoric uncertainty and the following terms capture the epistemic uncertainty. Note that the factor of 2 is removed (i.e., τ = (1−pl )l2 l Nλl ) when we use mean-squared-error loss instead of Euclidean loss. Yujiro Katagiri Uncertainty in Deep Learning
  • 23. Estimation of uncertainty in classification The mutual information is estimated in classification as I[y∗ , ω|x∗ , X, Y] = H[y∗ |x∗ , X, Y] − Ep(ω|X,Y )[H[y∗ |x∗ , ω]] ≈ − d ( 1 T t pˆωt d ) log( 1 T t pˆωt d ) − 1 T t (− d pˆωt d log pˆωt d ) where pˆωt d = exp(f ˆωt d (x∗)) d exp(f ˆωt d (x∗)) . This captures the epistemic uncertainty. In the context of active learning, this quantity is called BALD (Bayesian active learning by disagreement). Yujiro Katagiri Uncertainty in Deep Learning
  • 24. Heteroscedastic aleatoric uncertainty For regression, we have assumed homoscedastic aleatoric uncertainty, which means that observation noise is constant for every input point x. This can be seen in the definition of our likelihood p(y|x, ω) = N(y; fω(x), τ−1I), where the observation noise τ−1 is constant. However, we can also assume heteroscedastic aleatoric uncertainty, which means that observation noise can vary with input point x. This simply involves making τ into a function of the data: p(y|x, ω) = N(y; fω(x), gω(x)−1). Yujiro Katagiri Uncertainty in Deep Learning
  • 25. Estimation of heteroscedastic aleatoric uncertainty With this likelihood, the loss function becomes Eω (x, y) := − log N(y; fω (x), gω (x)−1 ) = 1 2 (y − fω (x))gω (x)(y − fω (x))T − 1 2 log det gω (x) + const. In practice, it is convenient to assume a diagonal precision matrix for gω(x), in which case the log determinant reduces to a sum of the logs over each element of gω(x). We split the top layers of a network into two parts, each estimating predictive mean fω(x) and model precision gω(x). We only need to estimate the diagonal elements of gω(x). For numerical stability, we regress the log precisions and exponentiate them instead of directly regressing precisions. Yujiro Katagiri Uncertainty in Deep Learning
  • 26. Further readings Aleatoric uncertainty in classification Kendall A & Gal Y. 2017. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? https://arxiv.org/abs/1703.04977 Kendall A, Gal Y, & Cipolla R. 2018. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. https://arxiv.org/abs/1705.07115. Application to convolutional neural networks Gal Y & Ghahramani Z. 2015. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. https://arxiv.org/abs/1506.02158 Application to recurrent neural networks Gal Y & Ghahramani Z. 2015. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. https://arxiv.org/abs/1512.05287 Yujiro Katagiri Uncertainty in Deep Learning