Uncertainty Estimation in Deep Learning

Uncertainty in Deep Learning - Christian S. Perone (2019)
Uncertainties Bayesian Inference Deep Learning Variational Inference Ensembles Q&A
Uncertainty Estimation
in Deep Learning
A brief introduction
Christian S. Perone
christian.perone@gmail.com
http://blog.christianperone.com

Agenda
Uncertainties
Knowing what you don’t know
The problem
Different Uncertainties
Importance of Uncertainty
Bayesian Inference
The frequentist way
The bayesian inference
MCMC Sampling
Deep Learning
Short intro
Bayesian Neural Networks
Variational Inference
Introduction
Posterior Approximation
Training a BNN
Dropout
Ensembles
Introduction
Deep Ensembles
Randomized Prior Functions
Final Remarks
Q&A

Who Am I
Christian S. Perone
BSc in Computer Science in Brazil (UPF),
MSc in Biomedical Eng. in Montreal
(Polytechnique/UdeM)
Machine Learning / Data Science
Working at Jungle
Blog at
blog.christianperone.com
Open-source projects
https://github.com/perone
Twitter @tarantulae

Section I
Uncertainties

It is correct, somebody might say, that
(...) Socrates did not know anything; and
it was indeed wisdom that they
recognized their own lack of knowledge,
(...).
—Karl R. Popper, The World of Parmenides

It is correct, somebody might say, that
(...) Socrates did not know anything; and
it was indeed wisdom that they
recognized their own lack of knowledge,
(...).
—Karl R. Popper, The World of Parmenides
What this has to do statistical learning ?

The problem
Let’s say you trained a model to classify an image as having lesion or
not;
Different MRI contrasts (T2/T1). Source: http://www.msdiscovery.org. 2019.

The problem
not;
Later you do prediction on volumes with different parametrization,
anatomy, etc;

The problem
not;
Later you do prediction on volumes with different parametrization,
anatomy, etc;
The problem: you can still have a prediction with high probability,
even if your sample is out-of-distribution.

The problem
A simple regression problem.
Source: Yarin Gal. Uncertainty in Deep Learning. PhD Thesis. 2016.

The problem
A simple regression problem.
6 4 2 0 2 4 6
20
10
0
10
20
30
40
Source: Ian Osband et al. Using Randomized Prior Functions for Deep Reinforcement Learning. NIPS
2018. Image from: http://blog.christianperone.com

Two main types of uncertainty, often confused by practitioners, but very
different quantities:

Aleatoric Uncertainty
Information data cannot explain, also called data uncertainty, or irreducible
uncertainty. More data might not reduce it;
Ex: increasing measurement precision can reduce it.

Aleatoric Uncertainty
Information data cannot explain, also called data uncertainty, or irreducible
uncertainty. More data might not reduce it;
Ex: increasing measurement precision can reduce it.
Epistemic Uncertainty
Uncertainty in the model itself, also called model uncertainty, or reducible
uncertainty;
Ex: can be explained away by increasing training size.

Medical imaging (classification, segmentation);

Autonomous vehicles (what’s the uncertainty this object is a tree ?);

Active Learning (which sample should be labeled ?);

Explore/exploit dilemma in reinforcement learning;

Out-of-distribution detection;

Model understanding/dataset understanding;

Model understanding/dataset understanding;
Nearly all applications !

Example in Reinforcement Learning
The explore/exploit dilemma:

Example in Reinforcement Learning
Work by Maxime Wabartha et al.:
estimated by taking, for each approach, the pointwise average and standard deviation over 50 sampled
functions. We expect the empirical posterior predictive distribution to cover the ground truth function.
While we succeed to do so using a MSE loss and the proposed approach, we do not manage to obtain
diverse functions using solely anchoring neither using dropout; in our experiments, changing the
dropout rate did not improve the quality of the obtained uncertainty. Input bootstrapping does produce
functions that better span the width of outputs, but it also disregards by nature certain points of the
training set, where we expect the uncertainty to be low given our current knowledge. We also provide
in the appendix an example of the functions generated by our function approach when ﬁxing X.
0.4 0.2 0.0 0.2 0.4
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Dropout 0.2
0.4 0.2 0.0 0.2 0.4
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Input bootstrapping
0.4 0.2 0.0 0.2 0.4
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
AnchoringGround truth
Sample function
Standard deviations
Training set
0.4 0.2 0.0 0.2 0.4
1.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
RepulsiveReference
function
Figure 1: Comparison of the empirical (over 20 sample functions) posterior predictive distribution
for dropout, input bootstrapping, anchoring and repulsive constraint.
3.2 Diverse functions in high-dimensional input space
We apply the method to function approximation in the case of a reinforcement learning problem
requiring exploration. More precisely, we showcase how our method can help sample diverse reward
functions in a model-based setting. We create a dataset of 43 13x13 frames with the associated reward.
We use as function approximator a small CNN outputing a reward for a given frame (see appendix).
To illustrate our method, we sample the repulsive points from possible frames, thus directly from the
manifold, in or out of the training distribution (see appendix). Figure 2 (rightmost ﬁgure) shows how
Source: Maxime Wabartha et al. Sampling diverse neural networks for exploration in reinforcement
learning. NIPS 2018.

Section II
Bayesian Inference

A simple frequentist regression
In a frequentist linear regression, we have a point estimate for the
parameters of our model.
For a maximum likelihood derivation, take a look at
http://blog.christianperone.com/2019/01/mle/.

First, we define our model:
f(x) = θ0 + θ1x1 + θ2x2 + . . . =
Vectorial notation
x β

f(x) = θ0 + θ1x1 + θ2x2 + . . . =
Vectorial notation
x β
Later, we define a loss such as the MSE (mean squared error):
L =
1
n
n
i=1
(f(xi) − yi)2

f(x) = θ0 + θ1x1 + θ2x2 + . . . =
Vectorial notation
x β
Later, we define a loss such as the MSE (mean squared error):
L =
1
n
n
i=1
(f(xi) − yi)2
Finally, we optimize it:
ˆθ = arg min
θ
L(f(x), y)

0.0 0.2 0.4 0.6 0.8 1.0
x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
y
Frequentist regression
sample data
regression line

The bayesian way
Bayesian approaches represent the uncertainty using a distribution over
parameters. Instead of a point estimate, we have an entire posterior.

The bayesian way
To formulate our bayesian regression, we first select a likelihood;

The bayesian way
After that, we select priors over parameters;

The bayesian way
After that, we select priors over parameters;
Then we compute or approximate (sampling) the posterior of our
model and data.

Prior, likelihood and posterior
1 2 3
Credibility
Prior

1 2 3
Credibility
Prior
1 2 3
Credibility
Data

1 2 3
Credibility
Prior
1 2 3
Credibility
Data
1 2 3
Credibility
Posterior

Posterior
p(θ|X)

Posterior
p(θ|X) ∝ p(X|θ)
Likelihood

Posterior
p(θ|X) ∝ p(X|θ)
Likelihood
Prior
π(θ)

posterior
0 0.5 1
likelihood
0 0.5 1
prior
0 0.5 1
⇥ /
posterior
0 0.5 1
prior
0 0.5 1
⇥ /
⇥ /
likelihood
0 0.5 1
prior
0 0.5 1
likelihood
0 0.5 1
posterior
0 0.5 1
Source: Statistical Rethinking/Winter 2019. Richard McElreath.

Bayesian regression
Let’s reformulate our regression:
We will use a simple Gaussian distribution for our observations,
defined as:
Y ∼ N(µ, σ2
)

Bayesian regression
defined as:
Y ∼ N(µ, σ2
)
We plug our regression of the µ:
Y ∼ N( α + βx
Linear model
, σ2
)

Bayesian regression
defined as:
Y ∼ N(µ, σ2
)
We plug our regression of the µ:
Y ∼ N( α + βx
Linear model
, σ2
)
And define the priors:
α ∼ N(0, 20)
β ∼ N(0, 20)
σ ∼ U(0, 5)

Bayesian Regression in Plate Notation
You can represent the same model below with plate notation:
Y ∼ N(α + βx, σ2
)
α ∼ N(0, 20)
β ∼ N(0, 20)
σ ∼ U(0, 5)

MCMC Sampling
Let’s see a demo of a Monte Carlo Markov Chain sampler:
Source: MCMC Demos, by Chi Feng

MCMC Sampling
0.7 0.8 0.9 1.0 1.1 1.2
0
2
4
Frequency
Intercept
0 1000 2000 3000 4000
0.8
1.0
1.2
Samplevalue
Intercept
1.6 1.8 2.0 2.2 2.4
0
1
2
3
Frequency
x
0 1000 2000 3000 4000
1.5
2.0
Samplevalue
x
0.45 0.50 0.55 0.60
0
5
10
15
Frequency
sigma
0 1000 2000 3000 4000
0.5
0.6
Samplevalue
sigma
Trace plot generated using PyMC3, you can also use ArviZ.

Bayesian regression
0.0 0.2 0.4 0.6 0.8 1.0
x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
y
Posterior predictive regression lines
sample data
posterior predictive regression lines

Bayesian methods
Bayesian methods can give us a full posterior to reason about;
1
Zoubin Ghahramani, History of Bayesian Neural Networks, NIPS 2016

Bayesian methods
Explicit priors;
1

Bayesian methods
Explicit priors;
Uncertainty;
1

Bayesian methods
Explicit priors;
Uncertainty;
They’re on the side of algorithms, not models 1;
1

Bayesian methods
Explicit priors;
Uncertainty;
However,
Intractable posterior for many practical cases and large datasets;
p(θ|X) =
p(X|θ)π(θ)
p(X)
1

Bayesian methods
Explicit priors;
Uncertainty;
However,
Intractable posterior for many practical cases and large datasets;
p(θ|X) =
p(X|θ)π(θ)
p(X)
Tuning and using MCMC algorithms can be tricky.
1

Section III
Deep Learning

Deep Learning
It’s not a secret that Deep Learning reached an important milestone in
Machine Learning:
Non-linear function approximators;

Deep Learning
Machine Learning:
They can scale to large datasets (thanks to stochastic approximation);

Deep Learning
Machine Learning:
They are state-of-the-art for NLP, computer vision, speech, etc;

Deep Learning
Machine Learning:
Very expressive and flexible;

Deep Learning
Machine Learning:
Very expressive and flexible;
Representation learning;

One-slide Intro to Deep Learning
x0
x1
...
xD
y
(1)
0
y
(1)
1
...
y
(1)
m(1)
. . .
. . .
. . . y
(L)
0
y
(L)
1
...
y
(L)
m(L)
y
(L+1)
1
y
(L+1)
2
...
y
(L+1)
C
input layer
1st hidden layer Lth hidden layer
output layer
A multi-layer perceptron (MLP) network overview. Source: David Stutz, 2018, BSD 3-Clause License.
Parametrized models with composition of functions;
Trained using backpropagation and SGD;
Learned usually by maximizing the log likelihood;

A Bayesian Neural Network (BNN) is a Neural Network with
distributions over parameters2.
2
Neal, Radford M. (2012). Bayesian learning for neural networks.

A Bayesian Neural Network (BNN) is a Neural Network with
distributions over parameters2.
Source: Weight Uncertainty in Neural Networks. Charles Blundell et al. 2015.
2
Neal, Radford M. (2012). Bayesian learning for neural networks.

In modern Deep Neural Networks, however, we have some challenges:

A lot of data;

A lot of data;
High-dimensionality in data;

A lot of data;
Millions of parameters;

A lot of data;
Highly non-convex surfaces;

A lot of data;
Highly non-convex surfaces;
This makes these models very difficult for Bayesian methods, therefore an
approximation is required:
(variational bayes)

Section IV

Variational Inference (VI) is often used as an alternative to MCMC;

Can be used to approximate the posterior of Bayesian models;

Faster than MCMC for complex models and larger datasets;

Shift from sampling to optimization;

Less guarantees than MCMC, density close to the target;

Less guarantees than MCMC, density close to the target;
For an in-depth review
For a modern in-depth review please refer to: Variational Inference: A Review for
Statisticians. Blei, D. M. et al (2018).

We have a very complex posterior distribution p(w | D) that we
want to approximate (w are the parameters, and D is the data);

We do this approximation by using an "easier" distribution q(w | θ)
(also called the variational distribution, where θ are the variational
parameters);

We do this approximation by using an "easier" distribution q(w | θ)
(also called the variational distribution, where θ are the variational
parameters);
Variational approximation (green). Source: Eric Jang, 2016. https://blog.evjang.com

Posterior approximation
If we want to approximate p(w | D) with q(w | θ), we need a
measure of "closeness";

If we want to approximate p(w | D) with q(w | θ), we need a
measure of "closeness";
We use Kullback-Leibler (KL) divergence:
Source: Flawnson Tong, https://towardsdatascience.com

θ∗
= arg min
θ
KL[q(w | θ) || p(w | D)]

θ∗
= arg min
θ
KL[q(w | θ) || p(w | D)]
θ∗
= arg min
θ
log q(w | θ)
variational posterior
− log p(w)
prior
− log p(D | w)
log likelihood

θ∗
= arg min
θ
KL[q(w | θ) || p(w | D)]
θ∗
= arg min
θ
log q(w | θ)
− log p(w)
prior
− log p(D | w)
log likelihood
Why KL-divergence ?
Because it allows us to derive a cost that is tractable to optimization.

θ∗
= arg min
θ
KL[q(w | θ) || p(w | D)]
θ∗
= arg min
θ
log q(w | θ)
− log p(w)
prior
− log p(D | w)
log likelihood
Why KL-divergence ?
Because it allows us to derive a cost that is tractable to optimization.
Not without paying a price though.

Forward and Reverse KL
Forms of the KL-divergence. Source: Pattern Recognition and Machine Learning. Christopher M.
Bishop. 2006. (a) forward KL-divergence, (b) and (c) reverse KL-divergence.

Forward KL
Source: Colin Raffel, https://colinraffel.com

Forward KL (misspecification)

Reverse KL

Quality of the uncertainty estimation
MFVB approximation. Source: Variational Bayes and beyond: Bayesian inference for big data.
Tamara Broderick. ICML 2018.

MFVB approximation. Source: Variational Bayes and beyond: Bayesian inference for big data.
Tamara Broderick. ICML 2018.
Can underestimate variance severely;
When compared to MCMC, means are usually fine, but variance is
far away;

Training a Bayesian Neural Network
The training loop for a Bayesian Neural Network (BNN) using
Variational Inference is shown below:
Sample from q(w | θ) the parameters of the network. Two
variational parameters for each weight in q: µ and σ;

Parametrize the network with the sampled parameters, often using
the reparametrization trick;

Forward pass with the data batch;

Calculate the combined loss: variational posterior, prior and log
likelihood;

likelihood;
Compute gradients by backpropagation and optimize with SGD;

likelihood;
Repeat;

likelihood;
Repeat;
Prediction: multiple forward passes.

likelihood;
Repeat;
Prediction: multiple forward passes.
This method is also called bayes by backprop.

HMC vs VI. Source: Bayesian Inference with Anchored Ensembles of Neural Networks, and Application
to Exploration in Reinforcement Learning. Tim Pearce. 2018.
For more information
For more information about the variational approach, please refer to: Weight
Uncertainty in Neural Networks. C. Blundell, et al. 2015.

Dropout as a Bayesian Approximation
Dropout. Source: Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting. Nitish
Srivastava, et al. 2014.

In 2015, the work Dropout as a Bayesian Approximation: Insights and
Applications. Yarin Gal et al., they found a relationship between
Dropout and Bayesian approximation;

It turns out that to do a Bernoulli approximate variational inference
in Bayesian NNs, you can just add dropout during training and
during prediction time as well;

Quite appealing due to its simplicity and it also provided an
interesting interpretation of dropout;

Quite appealing due to its simplicity and it also provided an
interesting interpretation of dropout;
This technique is called "MC Dropout" or "Monte Carlo Dropout".

MC Dropout on a Regression Setting
Some results from the MC Dropout on a regression setting:
MC Dropout. Source: Dropout as a Bayesian Approximation: Insights and Applications. Yarin Gal et al.
ICML 2015.

MC Dropout on a Classification Setting
Some results from the MC Dropout on a classification setting:
MC Dropout. Source: Dropout as a Bayesian Approximation: Insights and Applications. Yarin Gal et al.
ICML 2015.

Criticism of MC Dropout
Some results from the MC Dropout on a regression setting:
MC Dropout with varying number of data points. Gray regions is 1, std. dev. above and below. Source:
Randomized Prior Functions for Deep Reinforcement Learning. Ian Osband et al. 2018.
It was shown that MC Dropout didn’t pass a simple sanity check in a linear
setting, as it didn’t concentrate with more data.

Section V
Ensembles

Ensembles
Uses multiple hypothesis to
learn a better one;
We can see dropout as an
ensemble, but with shared
weights;
The ensemble variance can be
interpreted as uncertainty;
Simple intuition why it works.
Input Data
Combine predictions
Model #1 Model #2 Model #3

Deep Ensembles
In the work: Simple and Scalable Predictive Uncertainty Estimation using
Deep Ensembles. Lakshminarayanan B., et al. NIPS 2017., they proposed a
very simple method to compute uncertainty with ensembles:

Deep Ensembles
Setting
You have M models, with independent parameters θ1, θ2, θM .

Deep Ensembles
Setting
1) Initialize parameters θ1, θ2, θM randomly;

Deep Ensembles
Setting
2) Train each network m ∈ M with weights θm individually;

Deep Ensembles
Setting
3) Add or not adversarial training;

Deep Ensembles
Setting
3) Add or not adversarial training;
4) Combine the predictions with:
p(y | x) = M−1
average
M
m=1
prediction from each network
pθm (y | x, θm)

Evaluating Entropy on Classification
Plot of the binary entropy function H(p). A measure of the uncertainty.

0.20.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6
entropy values
0
1
2
3
4
5
6
7
8 Known classes
1
2
3
4
5
1 0 1 2 3 4 5
entropy values
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7 Unknown classes
1
2
3
4
5
ImageNet trained only on dogs. Histogram of the predictive entropy on test examples from known classes
(dogs) and unknown classes (non-dogs) with varying ensemble size. Source: Lakshminarayanan B., et al.
NIPS 2017.

−0.50.0 0.5 1.0 1.5 2.0 2.5
entropy values
0
1
2
3
4
5
6
7
Ensemble
1
5
10
−0.50.0 0.5 1.0 1.5 2.0 2.5
entropy values
Ensemble + R
1
5
10
−0.50.0 0.5 1.0 1.5 2.0 2.5
entropy values
Ensemble + AT
1
5
10
−0.5 0.0 0.5 1.0 1.5 2.0
entropy values
MC dropout
1
5
10
−0.50.0 0.5 1.0 1.5 2.0 2.5
entropy values
0
1
2
3
4
5
6
7
Ensemble
1
5
10
−0.50.0 0.5 1.0 1.5 2.0 2.5
entropy values
Ensemble + R
1
5
10
−0.50.0 0.5 1.0 1.5 2.0 2.5
entropy values
Ensemble + AT
1
5
10
−0.50.0 0.5 1.0 1.5 2.0 2.5
entropy values
MC dropout
1
5
10
Histogram of the predictive entropy on test examples from known classes from SVHN (top row) and
unknown classes from CIFAR-10 (bottom row). Source: Lakshminarayanan B., et al. NIPS 2017.

Randomized Priors
In Randomized Prior Functions for Deep Reinforcement Learning. Ian
Osband et al. 2018:
Very simple and elegant modification on the ensemble method for
uncertainty;

Randomized Priors
Osband et al. 2018:
uncertainty;
Developed in the Reinforcement Learning context;

Randomized Priors
Osband et al. 2018:
uncertainty;
Overcome the issue of injecting a prior into ensemble-based
approaches to uncertainty;

Randomized Priors
Osband et al. 2018:
uncertainty;
Overcome the issue of injecting a prior into ensemble-based
approaches to uncertainty;
On a simple linear setting, it is equivalent to exact Bayesian inference
for the case of a linear Gaussian model.

Bootstrap
Population

Bootstrap
Population
Sample #1
Sample #2
Sample #3

Bootstrap
Population
Sample #1
Sample #2
Sample #3
Statistic
Statistic
Statistic
q1
q2
q3

Bootstrap
Population
Sample #1
Sample #2
Sample #3
Statistic
Statistic
Statistic
q1
q2
q3
Bootstrap Statistic
Distribution

The key insight is to add a randomized (but ﬁxed) prior and
bootstraped data:

The key insight is to add a randomized (but ﬁxed) prior and
bootstraped data:
for k = 1, . . . , K do:
Initialize θk ∼ random;
Form Dk with bootstrap;
Sample prior function pk ∼ P
Optimize L(fθ + λpk; Dk)
return posterior ensemble {fθk
+ pk}K
k=1

Qualitative Inspection
Some pathological cases:
Posterior predictive distributions for 1D regression with a (20, 20)-MLP and ReLUs. Source:

Qualitative Inspection
Some pathological cases:
Posterior predictive distributions for 1D regression with a (20, 20)-MLP and ReLUs. Source:
“(...) If an agent has only ever observed zero reward, then no amount of
bootstrapping or ensembling will cause it to simulate positive rewards.
(...)”
– Randomized Prior Functions for Deep Reinforcement Learning. Ian

Predictive Uncertainty
6 4 2 0 2 4 6
20
10
0
10
20
30
40

Posterior Samples
4 3 2 1 0 1 2 3 4
10
5
0
5
10

Prior Samples
4 3 2 1 0 1 2 3 4
4
2
0
2
4

Final Remarks
Many methods, no standardized evaluation, no ground truth for
model uncertainty;

Final Remarks
model uncertainty;
Performance (CPU/GPU resources) penalty basically for all
methods;

Final Remarks
model uncertainty;
methods;
No scalable solution for MCMC (yet);

Final Remarks
model uncertainty;
methods;
Choice depends on application;

Final Remarks
model uncertainty;
methods;
Always take into consideration the trade-off of guarantees;

Final Remarks
model uncertainty;
methods;
Always take into consideration the trade-off of guarantees;
Significant evolution of methods, frameworks and hardware.

Learning More - I
Statistical Rethinking (excellent book and course), by Richard
McElreath.
https://xcelab.net/rm/statistical-rethinking/
Variational Inference: A Review, by David M. Blei, et al.
https://arxiv.org/abs/1601.00670
Scalable Bayesian Inference, by David Dunson. NIPS 2018 Talk.
https://www.youtube.com/watch?v=0HXpnG_WnlI
Variational Bayes and Beyond, by Tamara Broderick. ICML 2018
Tutorial.
https://www.youtube.com/watch?v=Moo4-KR5qNg
History of Bayesian Neural Networks, by Zoubin Ghahramani.
NIPS 2016 Keynote talk.
https://www.youtube.com/watch?v=FD8l2vPU5FY

Learning More - II
Uncertainty in Deep Learning, Slides, by Roberto Silveira.
http://tiny.cc/c77n9y
A Beginner’s Guide to Variational Methods, by Eric Jang.
https:
//blog.evjang.com/2016/08/variational-bayes.html
Uncertainty in Deep Learning, Thesis, by Yarin Gal.
http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf
PyMC3, Framework, by PyMC3 developers.
https://docs.pymc.io/
Pyro, Framework, by Pyro developers.
http://pyro.ai/
Tensorflow Probability, Framework, by TensorFlow developers.
https://www.tensorflow.org/probability

Section VI
Q&A

Q&A
Hope you liked ! Questions ?

Uncertainty Estimation in Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Uncertainty Estimation in Deep Learning

Similar to Uncertainty Estimation in Deep Learning (20)

More from Christian Perone

More from Christian Perone (11)

Recently uploaded

Recently uploaded (20)

Uncertainty Estimation in Deep Learning