SlideShare a Scribd company logo
1 of 26
Download to read offline
Weight	Uncertainty	in	Neural	Networks
04/11/2015
MASAHIRO	SUZUKI
Paper	Information
Title	:	Weight	Uncertainty	in	Neural	Networks	(ICML	2015)
Authors	:	Charles	Blundell,	Jullen Cornebise,	KorayKavukcuoglu,	Daan
Wierstra
◦ Google	DeepMind
They	proposed	Bayes	by	Backprop
Motivation
◦ I’d	like	to	know	how	to	treat	the	model	“uncertainty’’	in	deep	learning	
approach.
◦ I	like	Bayesianapproach.
Overfitting and	Uncertainty
Overfitting
◦ Plain	feedforward neural	networks	are	prone	to	overfitting.
Uncertainty
◦ NN	often	incapable	of	correctly	assessing	the	uncertainty in	the	training	data.
→Overly	confident	decisions
(a) Softmax input as a function of data x: f(x) (b) Softmax output as a function of data x: (f(x))
Figure 1: A sketch of softmax input and output for an idealised binary classification problem.
Training data is given between the dashed grey lines. Function point estimate is shown with a solid
line. Function uncertainty is shown with a shaded area. Marked with a dashed red line is a point
x⇤
far from the training data. Ignoring function uncertainty, point x⇤
is classified as class 1 with
probability 1.
have made use of NNs for Q-value function approximation. These are functions that estimate the
quality of different actions an agent can make. Epsilon greedy search is often used where the agent
Training	data
Function	
point	
estimate
Without	uncertainty:
class	1	with	probability	1	
With	uncertainty:
better	reflects	classification	uncertainty	
[Gal	et.	al	2015]
Function	
uncertainty
How	to	prevent	overfitting
Various	regularization	schemes	have	been	proposed.
◦ Early	stopping
◦ Weight	decay
◦ Dropout
This	paper	addresses	this	problem	by	using	variational Bayesian	learning	to	
introduce	uncertainty	in	the	weights	of	the	network.
↓
Bayes	by	Backprop
Contribution
They	proposed	Bayes	by	Backprop
◦ A	simple	approximate	learning	algorithm	similar	to	backpropagation.
◦ All	weight	are	represented	by	probability	distributions	over	possible	values.
It	achieves	good	results	in	several	domains.
◦ Classification
◦ Regression
◦ Bandit	problem	
Weight Uncertainty in Neural Networks
H1 H2 H3 1
X 1
Y
0.5 0.1 0.7 1.3
1.40.3
1.2
0.10.1 0.2
H1 H2 H3 1
X 1
Y
Figure 1. Left: each weight has a fixed value, as provided by clas-
sical backpropagation. Right: each weight is assigned a distribu-
tion, as provided by Bayes by Backprop.
is related to recent methods in deep, generative modelling
(Kingma and Welling, 2014; Rezende et al., 2014; Gregor
et al., 2014), where variational inference has been applied
to stochastic hidden units of an autoencoder. Whilst the
number of stochastic hidden units might be in the order of
thousands, the number of weights in a neural network is
easily two orders of magnitude larger, making the optimisa-
tion problem much larger scale. Uncertainty in the hidden
units allows the expression of uncertainty about a particular
the parameters of the categorical distribution
through the exponential function then re-norm
regression Y is R and P(y|x, w) is a Gaussian
– this corresponds to a squared loss.
Inputs x are mapped onto the parameters of
tion on Y by several successive layers of linear
tion (given by w) interleaved with element-wise
transforms.
The weights can be learnt by maximum likelih
tion (MLE): given a set of training examples D
the MLE weights wMLE
are given by:
wMLE
= arg max
w
log P(D|w)
= arg max
w
X
i
log P(yi|xi, w
This is typically achieved by gradient descent
propagation), where we assume that log P(D|w
entiable in w.
Regularisation can be introduced by placing a
the weights w and finding the maximum a
MAP
Classical	backpropagation Bayes	by	Backprop
Related	Works
Variational approximation	
◦ [Graves	2011]
→ the	gradients	of	this	can	be	made	unbiased	and	this	method	can	be	used	with	
non-Gaussian	priors.	
Uncertainty	in	the	hidden	units	
◦ [Kingma and	Welling,	2014]	[Rezende et	al.,	2014]	[Gregor et	al.,	2014]
◦ Variational autoencoder
→ the	number	of	weights	in	a	neural	network	is	easily	two	orders	of	magnitude	
larger	
Contextual	bandit	problems	using	Thompson	sampling	
◦ [Thompson,	1933]	[Chapelle and	Li,	2011]	[Agrawal	and	Goyal,	2012]	[May	et	al.,	
2012]
→ Weights	with	greater	uncertainty	introduce	more	variability	into	the	decisions	
made	by	the	network,	leading	naturally	to	exploration.
Point	Estimates	of	NN
Neural	network	:	𝑝(𝑦|𝑥, 𝑤)
◦ Input	:	𝑥 ∈ ℝ+
◦ Output	:	𝑦 ∈ 𝒴
◦ The	set	of	parameters	:	𝑤
◦ Cross-entropy	(categorical	distiribution),	 squared	loss	(Gaussian	distribution)
Learning	𝐷 = (𝑥/, 𝑦/)
◦ MLE:
𝑤012
= arg max
8
log 𝑝 𝐷 𝑤 = 	 arg max
8
; log 𝑝(𝑦/ |𝑥/, 𝑤)
/
◦ MAP:
𝑤0<=
= arg max
8
log 𝑝 𝑤 𝐷 = 	 arg max
8
log 𝑝(𝐷|𝑤) + log 𝑝(𝑤)
Being	Bayesian
The	predictive	distribution	:	𝑝(𝑦?|𝑥?)
◦ an	unknown	 label	:	𝑦?
◦ a	test	data	item	:	𝑥?
𝑝 𝑦? 𝑥? = @ 𝑝 𝑦? 𝑥?, 𝑤 𝑝 𝑤 𝐷 𝑑𝑤 = 𝔼+(8|C)[𝑝 𝑦? 𝑥?, 𝑤 ]
Taking	expectation	=	an	ensemble	of	an	uncountably infinite	number	of	NN
↓
Intractable
Variational Learning
The	Basyan posterior	distribution	on	the	weight :	𝑞(𝑤|𝜃)
◦ parameters	: 𝜃
The	posterior	distribution	given	the	training	data	:	𝑝 𝑤 𝐷
Find	the	parameters	𝜃	that	minimizes	the	KL divergence	:	
𝜃∗ = arg min
L
ℱ 𝐷, 𝜃
where		ℱ 𝐷, 𝜃 = 𝐾𝐿 𝑞 𝑤 𝜃 	𝑝 𝑤 𝐷
= 𝐾𝐿 𝑞 𝑤 𝜃 	𝑝(𝑤) − 𝔼	Q 𝑤 𝜃 [log 𝑝(𝐷|𝑤)]
data-dependent	part	
(likelihood	cost)	
prior-dependent	part
(complexity	cost)
Unbiased	Monte	Carlo	gradients
Proposition	1.	Let	𝜀 be	a	random	variable	having	a	probability	density	
given	by	𝑞(𝜀) and	let	𝑤	 = 	𝑡(𝜃, 𝜀)	where	𝑡(𝜃, 𝜀) is	a	deterministic	
function.	Suppose	further	that	the	marginal	probability	density	of	𝑤,	
𝑞(𝑤|𝜃),	is	such	that	𝑞(𝜀)𝑑𝜀	 = 	𝑞(𝑤|𝜃)𝑑𝑤.	Then	for	a	function	𝑓 with	
derivatives	in	𝑤:	
U
UL
𝔼Q(8|L) 𝑓 𝑤, 𝜃 = 𝔼Q(V)
UW(8,L)
UL
U8
UL
+
UW(8,L)
UL
A	generalization	of	the	Gaussian	reparameterizationtrick
Bayes	by	backprop
We	approximate	the	expected	lower	bound	as
ℱ 𝐷, 𝜃 ≈ ; log 𝑞 𝑤(/) 𝜃 − log 𝑝(𝑤(/)) − log 𝑝(𝐷|𝑤 / )
/
where	𝑤(/)~𝑞 𝑤(/) 𝜃 (Monte	Carlo)
◦ Every	term	of	this	approximate	cost	depends	upon	the	particular	weights	
drawn	from	the	variational posterior.
Gaussian	variational posterior
1. Sample	𝜀 ∼ 𝑁(0, 𝐼)
2. Let	𝑤 = 𝜇 + log	(1 + exp	( 𝜌)) ∘ 	𝜀
3. Let	𝜃	 =	(𝜇, 𝜌)
4. Let	𝑓(𝑤, 𝜃) = log 𝑞 𝑤 𝜃 − log 𝑝 𝑤 𝑝(𝐷|𝑤)	
5. Calculate	the	gradient	with	respect	to	the	mean	
∆e 	=	
𝜕𝑓(𝑤, 𝜃)
𝜕𝑤
+
𝜕𝑓(𝑤, 𝜃)
𝜕𝜇
6. Calculate	the	gradient	with	respect	to	the	standard	deviation	parameter	𝜌
∆g	=	
𝜕𝑓(𝑤, 𝜃)
𝜕𝑤
𝜀
1 + exp(−𝜌)
+
𝜕𝑓(𝑤, 𝜃)
𝜕𝜌
7. Update	the	variational parameters:
𝜇 ← 𝜇 − 𝛼∆e
𝜌	 ← 	𝜌	 − 	𝛼∆g
Scalar	mixture	prior
They	proposed	using	scale	mixture	of	two	Gaussian	densities	 as	the	prior	
◦ Combined	with	a	diagonal	Gaussian	posterior
◦ Two	degrees	of	freedom	per	weight	only	increases	the	number	of	parameters	to	
optimize	by	a	factor	of	two.
	 𝑝(𝑤) =	∏ 𝜋𝑁(𝑤l 	|0, 𝜎n) + (1 − 𝜋)𝑁(𝑤l|0, 𝜎o)	l
where 𝜎n > 𝜎o, 𝜎o ≪ 1	
Empirically	they	found	optimizing	the	parameters	of	a	prior		𝑝(𝑤) to	not	be	
useful,	and	yield	worse	results.
◦ It	can	be	easier	to	change	the	prior	parameters	than	the	posterior	parameters.
◦ The	prior	parameters	try	to	capture	the	empirical	distribution	of	the	weights	at	the	
beginning	of	learning.	
→ pick a	fixed-form prior	and don’t adjust its hyperparamater
Minibatches and	KL	re-weighting
The	minibatch cost	for	minibatch i =	1,2,...,M:	
𝐹/
u
	(𝐷/, 𝜃)	= 	 𝜋/ 𝐾𝐿[𝑞(𝑤|𝜃)	||	𝑃(𝑤)]	− 𝔼Q 𝑤 𝜃 [log 𝑃 𝐷/ 𝑤 ]
where	π	∈ [0,1]0
	and	∑ 𝜋/ = 10
/xn
◦ Then	𝔼0 ∑ 𝐹/
u
	 𝐷/ , 𝜃0
/xn = 𝐹(𝐷, 𝜃)
	 𝜋 =
oyz{
oy|n
works	well
◦ the	first	few	minibatches are	heavily	influenced	 by	the	complexity	cost	
◦ the	later	minibatches are	largely	influenced	by	the	data
Contextual	Bandits
Contextual	Bandits
◦ Simple	reinforcement	learning	problems	
Example	:	Clinical	Decision	Making	
Another Example: Clinical Decision Making
Repeatedly:
1. A patient comes to a doctor with
symptoms, medical history, test results
2. The doctor chooses a treatment
3. The patient responds to it
The doctor wants a policy for choosing
targeted treatments for individual patients.
Context	𝑥
Actions	𝑎 ∈ [0, . . , 𝐾]
Rewards	
𝑟~𝑝(𝑟|𝑥, 𝑎, 𝑤)
Online	training	the	agent’s	model	
𝑝 𝑟 𝑥, 𝑎, 𝑤
Thompson	Sampling
Thompson	sampling
◦ Popular	means	of	picking	an	action	that	trades-off	between	exploitation and	
exploration.
◦ Necessitates	a	Bayesian	treatment	of	the	model	parameters
1. Sample	a	new	set	of	parameters	for	the	model.
2. Pick	the	action	with	the	highest	expected	reward	according	to	the	
sampled	parameters.
3. Update	the	model.	Go	to	1.
Thompson	Sampling	for	NN
Thompson	sampling	is	easily	adapted	to	neural	networks	using	the	
variational pasterior.
1. Sample	weights	from	the	variational posterior:	𝑤~𝑞(𝑤|𝜃).	
2. Receive	the	context	𝑥.	
3. Pick	the	action	𝑎 that	minimises 𝔼+(€|•,‚,8)[𝑟]
4. Receive	reward	𝑟.	
5. Update	variationalparameters	𝜃.	Go	to	1.
Experiment
1. Classification	on	MNIST
2. Regression	curves
3. Bandits	on	Mushroom	Task
Classification	on	MNIST
Model	:	
◦ 28×28→(ReLU)→(400,800,1200) →(ReLU)→(400,800,1200) → (Softmax)→	10
Weight Uncertainty in Neural Networks
Table 1. Classification Error Rates on MNIST. ? indicates result
used an ensemble of 5 networks.
Method
#Units/Layer
#Weights
Test
Error
SGD, no regularisation (Simard et al., 2003) 800 1.3m 1.6%
SGD, dropout (Hinton et al., 2012) ⇡ 1.3%
SGD, dropconnect (Wan et al., 2013) 800 1.3m 1.2%?
SGD 400 500k 1.83%
800 1.3m 1.84%
1200 2.4m 1.88%
SGD, dropout 400 500k 1.51%
800 1.3m 1.33%
1200 2.4m 1.36%
Bayes by Backprop, Gaussian 400 500k 1.82%
800 1.3m 1.99%
1200 2.4m 2.04%
Bayes by Backprop, Scale mixture 400 500k 1.36%
800 1.3m 1.34%
1200 2.4m 1.32%
known that variational methods under-estimate uncertainty
(Minka, 2001; 2005; Bishop, 2006) which could lead to
0.8
1.2
1.6
2.0
0 100 200 300 400 500 6
Epochs
Testerror(%)
Figure 2. Test error on MNIST as tr
0
5
10
15
−0.2 −0.1 0.0 0.1 0.2
Weight
Density
Figure 3. Histogram of the trained weight
for Dropout, plain SGD, and samples from
used	an	ensemble	of	
5	networks
Classification	on	MNIST
Test	error	on	MNIST	as	training	progress
◦ Bayes	by	Backprop and	dropout	 converge	at	similar	rates.
Classification	on	MNIST
Density	estimates	of	the	weights
◦ Bayes	by	Backprop :	sampled	from	the	variational posterior
◦ Dropout	:	used	at	test	time
◦ Bayes	by	Backprop uses	the	greatest	range	of	weights
Classification	on	MNIST
The	level	of	redundancy	in	a	Bayes	by	Backprop.
◦ Replace	the	weights	with	a	constant	zero	(from	the	lowest	signal-to-noise	
ratio											).
Figure 4. Density and CDF of the Signal-to-Noise ratio over all
weights in the network. The red line denotes the 75% cut-off.
In Table 2, we examine the effect of replacing the vari-
ational posterior on some of the weights with a constant
zero, so as to determine the level of redundancy in the
network found by Bayes by Backprop. We took a Bayes
by Backprop trained network with two layers of 1200
units2
and ordered the weights by their signal-to-noise ra-
tio (|µi|/ i). We removed the weights with the lowest sig-
nal to noise ratio. As can be seen in Table 2, even when
95% of the weights are removed the network still performs
well, with a significant drop in performance once 98% of
the weights have been removed.
In Figure 4 we examined the distribution of the signal-to-
noise relative to the cut-off in the network uses in Table 2.
The lower plot shows the cumulative distribution of signal-
to-noise ratio, whilst the top plot shows the density. From
the density plot we see there are two modalities of signal-
to-noise ratios, and from the CDF we see that the 75%
cut-off separates these two peaks. These two peaks coin-
cide with a drop in performance in Table 2 from 1.24%
to 1.29%, suggesting that the signal-to-noise heuristic is in
fact related to the test performance.
2
We used a network from the end of training rather than pick-
ing a network with a low validation cost found during training,
hence the disparity with results in Table 1. The lowest test error
a network 20 times smaller and did
proportion of weights (at most 11%)
ing good test performance. The sca
by Bayes by Backprop encourages
weights. Many of these weights can b
without impacting performance signi
5.2. Regression curves
We generated training data from the
y = x + 0.3 sin(2⇡(x + ✏)) + 0.3
where ✏ ⇠ N(0, 0.02). Figure 5 sh
fitting a neural network to these data
tional Gaussian loss. Note that in th
space where there are no data, the or
reduces the variance to zero and ch
lar function, even though there are m
olations of the training data. On the
averaging affects predictions: where
confidence intervals diverge, reflect
possible extrapolations. In this cas
prefers to be uncertain where there a
opposed to a standard neural networ
confident.
5.3. Bandits on Mushroom Task
We take the UCI Mushrooms data set
Regression	curves
Training	data	:
	 𝑦	 = 	𝑥	 + 	0.3sin 2𝜋 𝑥	 + 	𝜀 + 	0.3sin 4𝜋 𝑥	 + 	𝜀 + 	𝜀	
where		𝜀~𝑁(0,0.02)	Weight Uncertainty in Neural Networks
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xxxx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
xxx x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xxx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
xx xx
x
xxx
x
x
x
x
x
x
x
x
x
x
xx
x
x x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x xx
x
x
x
x
x x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x x
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
−0.4
0.0
0.4
0.8
1.2
0.0 0.4 0.8 1.2
x
x
x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xxxx
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
x x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x x
x
x
xxx x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
xxx
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
xxx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x x
x
x
x
x
x
x
xx
x
x
x
x
x
x
xx
x
xx xx
x
xxx
x
x
x
x
x
x
x
x
x
x
xx
x
x x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x xx
x
x
x
x
x x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xxx
x
x
x
x
x
x
x
x
xx
x
x x
x
xx
x
x
x
x
x
x
x
x
x
xx
xx
x
x
x
xx
x
xx
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
−0.4
0.0
0.4
0.8
1.2
0.0 0.4 0.8 1.2
Figure 5. Regression of noisy data with interquatile ranges. Black
crosses are training samples. Red lines are median predictions.
Blue/purple region is interquartile range. Left: Bayes by Back-
prop neural network, Right: standard neural network.
for action selection. We kept t
and action tuples in a buffer,
ing randomly drawn minibatch
steps (64 ⇥ 64 = 4096) per int
bandit. A common heuristic fo
exploitation is to follow an "
bility " propose a uniformly ra
the best action according to th
Figure 6 compares a Bayes by
"-greedy agents, for values of
and 5%. An " of 5% appears
purely greedy agent does poo
ily electing to eat nothing, but
it has seen enough data. It se
approximation updates allow
Bayes	by	Backprop Standard	NN
Bandits	on	Mushroom	Task
UCI	Mushrooms	dataset
◦ Each	mushroom	 has	a	set	of	features	and	is	labelled	as	edible	or	poisonous.	
◦ An	edible	mushroom	:	a	reward	of	5
◦ A	poisonous	 mushroom	:	a	reward	of	−35	or	5	(with	probability	0.5)
◦ Not	to	eat	a	mushroom	:	a	reward	of	0
Model	:	
◦ Input	(context	and	action)→(ReLU)→100→(ReLU)→100→ the	expected	reward	
Comparison	approach	:	
◦ 𝜀-greedy	policy	:	with	probability	 𝜀 propose	a	uniformly	random	action,	
otherwise	pick	the	best	action	according	to	the	neural	network.
Bandits	on	Mushroom	Task
Comparison	of	cumulative	regret	of	various	agents.
◦ Regret	:	the	difference	between	the	reward	achievable	by	an	oracle	and	the	
reward	received	by	an	agent.	
◦ Oracle	:	an	edible	mushroom→5,	 a	poisonous	 mushroom→0
◦ Lower	is	better.
x
x
x
xx x
x
x
xx
xx
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
xx xx
xxx
x
x
x
x
x x
xx
xx x
x
x
xx xx
xx xx
x
xx
xx
x
x
xx
x
x
x
x
x
xx
xx
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
−0.4
0.0
0.0 0.4 0.8 1.2
x
x
x
xx x
x
x
xx
xx
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
xx xx
xxx
x
x
x
x
x x
xx
xx x
x
x
xx xx
xx xx
x
xx
xx
x
x
xx
x
x
x
x
x
xx
xx
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
−0.4
0.0
0.0 0.4 0.8 1.2
Figure 5. Regression of noisy data with interquatile ranges. Black
crosses are training samples. Red lines are median predictions.
Blue/purple region is interquartile range. Left: Bayes by Back-
prop neural network, Right: standard neural network.
1000
10000
0 10000 20000 30000 40000 50000
Step
CumulativeRegret
5% Greedy
1% Greedy
Greedy
Bayes by Backprop
Figure 6. Comparison of cumulative regret of various agents on
bandit. A com
exploitation is
bility " propos
the best action
Figure 6 comp
"-greedy agen
and 5%. An
purely greedy
ily electing to
it has seen en
approximation
as for the first
approximately
eat mushroom
from the begi
and quickly co
most perfect r
6. Discussio
We introduced
works with u
Backprop. It
over-explore	
quickly	converges
Conclusion
They	introduced	a	new	algorithm	for	learning	neural	networks	with	
uncertainty	on	the	weights	called	Bayes	by	Backprop.	
The	algorithm	achieves	good	results	in	several	domains.	
◦ Classifying	MNIST	digits	:	Performance	from	Bayes	by	Backprop is	
comparable	to	that	of	dropout.	
◦ Non-linear	regression	problem	:	Bayes	by	Backprop allows	the	network	to	
make	more	reasonable	predictions	about	unseen	data.
◦ Contextual	bandits	:	Bayes	by	Backprop can	automatically	learn	how	to	
trade-off	exploration	and	exploitation.

More Related Content

What's hot

[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-EncoderDeep Learning JP
 
[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係についてDeep Learning JP
 
[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展Deep Learning JP
 
Fisher Vectorによる画像認識
Fisher Vectorによる画像認識Fisher Vectorによる画像認識
Fisher Vectorによる画像認識Takao Yamanaka
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes FamilyKota Matsui
 
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...Deep Learning JP
 
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習Eiji Uchibe
 
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihoodDeep Learning JP
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.Deep Learning JP
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)Masahiro Suzuki
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理Taiji Suzuki
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Yoshitaka Ushiku
 
[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報Deep Learning JP
 

What's hot (20)

[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
 
[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について[DL輪読会]GQNと関連研究,世界モデルとの関係について
[DL輪読会]GQNと関連研究,世界モデルとの関係について
 
主成分分析
主成分分析主成分分析
主成分分析
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展
 
Fisher Vectorによる画像認識
Fisher Vectorによる画像認識Fisher Vectorによる画像認識
Fisher Vectorによる画像認識
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
[DL輪読会] Spectral Norm Regularization for Improving the Generalizability of De...
 
実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE実装レベルで学ぶVQVAE
実装レベルで学ぶVQVAE
 
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
SSII2022 [SS2] 少ないデータやラベルを効率的に活用する機械学習技術 〜 足りない情報をどのように補うか?〜
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習
 
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
リプシッツ連続性に基づく勾配法・ニュートン型手法の計算量解析
リプシッツ連続性に基づく勾配法・ニュートン型手法の計算量解析リプシッツ連続性に基づく勾配法・ニュートン型手法の計算量解析
リプシッツ連続性に基づく勾配法・ニュートン型手法の計算量解析
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)
 
[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報[DL輪読会]ICLR2020の分布外検知速報
[DL輪読会]ICLR2020の分布外検知速報
 

Viewers also liked

(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...Masahiro Suzuki
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman FiltersMasahiro Suzuki
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習Masahiro Suzuki
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization TrickMasahiro Suzuki
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target PropagationMasahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
 
僕たちはなぜ「学ぶ」のだろうか? ~高専出身教職勢による教育学入門~ Share ver.
僕たちはなぜ「学ぶ」のだろうか? ~高専出身教職勢による教育学入門~ Share ver.僕たちはなぜ「学ぶ」のだろうか? ~高専出身教職勢による教育学入門~ Share ver.
僕たちはなぜ「学ぶ」のだろうか? ~高専出身教職勢による教育学入門~ Share ver.ShotaSatuma
 
子どもにさせたい基本三体験の話 Share Ver
子どもにさせたい基本三体験の話 Share Ver子どもにさせたい基本三体験の話 Share Ver
子どもにさせたい基本三体験の話 Share VerShotaSatuma
 
進学準備&大学生活について
進学準備&大学生活について進学準備&大学生活について
進学準備&大学生活についてShotaSatuma
 
Kosen10sLT#05 22歳の発達課題
Kosen10sLT#05 22歳の発達課題Kosen10sLT#05 22歳の発達課題
Kosen10sLT#05 22歳の発達課題ShotaSatuma
 
僕たちはなぜ「学ぶ」だろうか?(Shot A Talk 1st ver.)
僕たちはなぜ「学ぶ」だろうか?(Shot A Talk 1st ver.)僕たちはなぜ「学ぶ」だろうか?(Shot A Talk 1st ver.)
僕たちはなぜ「学ぶ」だろうか?(Shot A Talk 1st ver.)ShotaSatuma
 
自己紹介 (kosen10sLT #03)
自己紹介 (kosen10sLT #03)自己紹介 (kosen10sLT #03)
自己紹介 (kosen10sLT #03)ShotaSatuma
 
自己紹介20160501 share ver
自己紹介20160501 share ver自己紹介20160501 share ver
自己紹介20160501 share verShotaSatuma
 
高専・大学での過ごし方(高専カンファ茨香祭)
高専・大学での過ごし方(高専カンファ茨香祭)高専・大学での過ごし方(高専カンファ茨香祭)
高専・大学での過ごし方(高専カンファ茨香祭)ShotaSatuma
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?Masahiro Suzuki
 

Viewers also liked (20)

(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
僕たちはなぜ「学ぶ」のだろうか? ~高専出身教職勢による教育学入門~ Share ver.
僕たちはなぜ「学ぶ」のだろうか? ~高専出身教職勢による教育学入門~ Share ver.僕たちはなぜ「学ぶ」のだろうか? ~高専出身教職勢による教育学入門~ Share ver.
僕たちはなぜ「学ぶ」のだろうか? ~高専出身教職勢による教育学入門~ Share ver.
 
子どもにさせたい基本三体験の話 Share Ver
子どもにさせたい基本三体験の話 Share Ver子どもにさせたい基本三体験の話 Share Ver
子どもにさせたい基本三体験の話 Share Ver
 
進学準備&大学生活について
進学準備&大学生活について進学準備&大学生活について
進学準備&大学生活について
 
Kosen10sLT#05 22歳の発達課題
Kosen10sLT#05 22歳の発達課題Kosen10sLT#05 22歳の発達課題
Kosen10sLT#05 22歳の発達課題
 
僕たちはなぜ「学ぶ」だろうか?(Shot A Talk 1st ver.)
僕たちはなぜ「学ぶ」だろうか?(Shot A Talk 1st ver.)僕たちはなぜ「学ぶ」だろうか?(Shot A Talk 1st ver.)
僕たちはなぜ「学ぶ」だろうか?(Shot A Talk 1st ver.)
 
自己紹介 (kosen10sLT #03)
自己紹介 (kosen10sLT #03)自己紹介 (kosen10sLT #03)
自己紹介 (kosen10sLT #03)
 
自己紹介20160501 share ver
自己紹介20160501 share ver自己紹介20160501 share ver
自己紹介20160501 share ver
 
高専・大学での過ごし方(高専カンファ茨香祭)
高専・大学での過ごし方(高専カンファ茨香祭)高専・大学での過ごし方(高専カンファ茨香祭)
高専・大学での過ごし方(高専カンファ茨香祭)
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
 

Similar to (研究会輪読) Weight Uncertainty in Neural Networks

Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
 
Chapter 9. Classification Advanced Methods.ppt
Chapter 9. Classification Advanced Methods.pptChapter 9. Classification Advanced Methods.ppt
Chapter 9. Classification Advanced Methods.pptSubrata Kumer Paul
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
Boundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequenceBoundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequenceIJDKP
 
Artificial Neural Networks for NIU
Artificial Neural Networks for NIUArtificial Neural Networks for NIU
Artificial Neural Networks for NIUProf. Neeta Awasthy
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
Batch gradient method for training of
Batch gradient method for training ofBatch gradient method for training of
Batch gradient method for training ofijaia
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks ShwethaShreeS
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Mncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningMncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningSeung-gyu Byeon
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2arogozhnikov
 
ANNs have been widely used in various domains for: Pattern recognition Funct...
ANNs have been widely used in various domains for: Pattern recognition  Funct...ANNs have been widely used in various domains for: Pattern recognition  Funct...
ANNs have been widely used in various domains for: Pattern recognition Funct...vijaym148
 

Similar to (研究会輪読) Weight Uncertainty in Neural Networks (20)

Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
Chapter 9. Classification Advanced Methods.ppt
Chapter 9. Classification Advanced Methods.pptChapter 9. Classification Advanced Methods.ppt
Chapter 9. Classification Advanced Methods.ppt
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
09 classadvanced
09 classadvanced09 classadvanced
09 classadvanced
 
Boundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequenceBoundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequence
 
Artificial Neural Networks for NIU
Artificial Neural Networks for NIUArtificial Neural Networks for NIU
Artificial Neural Networks for NIU
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
Batch gradient method for training of
Batch gradient method for training ofBatch gradient method for training of
Batch gradient method for training of
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Neural networks
Neural networksNeural networks
Neural networks
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
Mncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningMncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learning
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 
MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2MLHEP 2015: Introductory Lecture #2
MLHEP 2015: Introductory Lecture #2
 
ANNs have been widely used in various domains for: Pattern recognition Funct...
ANNs have been widely used in various domains for: Pattern recognition  Funct...ANNs have been widely used in various domains for: Pattern recognition  Funct...
ANNs have been widely used in various domains for: Pattern recognition Funct...
 

More from Masahiro Suzuki

深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)Masahiro Suzuki
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについてMasahiro Suzuki
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究についてMasahiro Suzuki
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習Masahiro Suzuki
 

More from Masahiro Suzuki (6)

深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
 

Recently uploaded

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 

Recently uploaded (20)

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 

(研究会輪読) Weight Uncertainty in Neural Networks

  • 3. Overfitting and Uncertainty Overfitting ◦ Plain feedforward neural networks are prone to overfitting. Uncertainty ◦ NN often incapable of correctly assessing the uncertainty in the training data. →Overly confident decisions (a) Softmax input as a function of data x: f(x) (b) Softmax output as a function of data x: (f(x)) Figure 1: A sketch of softmax input and output for an idealised binary classification problem. Training data is given between the dashed grey lines. Function point estimate is shown with a solid line. Function uncertainty is shown with a shaded area. Marked with a dashed red line is a point x⇤ far from the training data. Ignoring function uncertainty, point x⇤ is classified as class 1 with probability 1. have made use of NNs for Q-value function approximation. These are functions that estimate the quality of different actions an agent can make. Epsilon greedy search is often used where the agent Training data Function point estimate Without uncertainty: class 1 with probability 1 With uncertainty: better reflects classification uncertainty [Gal et. al 2015] Function uncertainty
  • 4. How to prevent overfitting Various regularization schemes have been proposed. ◦ Early stopping ◦ Weight decay ◦ Dropout This paper addresses this problem by using variational Bayesian learning to introduce uncertainty in the weights of the network. ↓ Bayes by Backprop
  • 5. Contribution They proposed Bayes by Backprop ◦ A simple approximate learning algorithm similar to backpropagation. ◦ All weight are represented by probability distributions over possible values. It achieves good results in several domains. ◦ Classification ◦ Regression ◦ Bandit problem Weight Uncertainty in Neural Networks H1 H2 H3 1 X 1 Y 0.5 0.1 0.7 1.3 1.40.3 1.2 0.10.1 0.2 H1 H2 H3 1 X 1 Y Figure 1. Left: each weight has a fixed value, as provided by clas- sical backpropagation. Right: each weight is assigned a distribu- tion, as provided by Bayes by Backprop. is related to recent methods in deep, generative modelling (Kingma and Welling, 2014; Rezende et al., 2014; Gregor et al., 2014), where variational inference has been applied to stochastic hidden units of an autoencoder. Whilst the number of stochastic hidden units might be in the order of thousands, the number of weights in a neural network is easily two orders of magnitude larger, making the optimisa- tion problem much larger scale. Uncertainty in the hidden units allows the expression of uncertainty about a particular the parameters of the categorical distribution through the exponential function then re-norm regression Y is R and P(y|x, w) is a Gaussian – this corresponds to a squared loss. Inputs x are mapped onto the parameters of tion on Y by several successive layers of linear tion (given by w) interleaved with element-wise transforms. The weights can be learnt by maximum likelih tion (MLE): given a set of training examples D the MLE weights wMLE are given by: wMLE = arg max w log P(D|w) = arg max w X i log P(yi|xi, w This is typically achieved by gradient descent propagation), where we assume that log P(D|w entiable in w. Regularisation can be introduced by placing a the weights w and finding the maximum a MAP Classical backpropagation Bayes by Backprop
  • 6. Related Works Variational approximation ◦ [Graves 2011] → the gradients of this can be made unbiased and this method can be used with non-Gaussian priors. Uncertainty in the hidden units ◦ [Kingma and Welling, 2014] [Rezende et al., 2014] [Gregor et al., 2014] ◦ Variational autoencoder → the number of weights in a neural network is easily two orders of magnitude larger Contextual bandit problems using Thompson sampling ◦ [Thompson, 1933] [Chapelle and Li, 2011] [Agrawal and Goyal, 2012] [May et al., 2012] → Weights with greater uncertainty introduce more variability into the decisions made by the network, leading naturally to exploration.
  • 7. Point Estimates of NN Neural network : 𝑝(𝑦|𝑥, 𝑤) ◦ Input : 𝑥 ∈ ℝ+ ◦ Output : 𝑦 ∈ 𝒴 ◦ The set of parameters : 𝑤 ◦ Cross-entropy (categorical distiribution), squared loss (Gaussian distribution) Learning 𝐷 = (𝑥/, 𝑦/) ◦ MLE: 𝑤012 = arg max 8 log 𝑝 𝐷 𝑤 = arg max 8 ; log 𝑝(𝑦/ |𝑥/, 𝑤) / ◦ MAP: 𝑤0<= = arg max 8 log 𝑝 𝑤 𝐷 = arg max 8 log 𝑝(𝐷|𝑤) + log 𝑝(𝑤)
  • 8. Being Bayesian The predictive distribution : 𝑝(𝑦?|𝑥?) ◦ an unknown label : 𝑦? ◦ a test data item : 𝑥? 𝑝 𝑦? 𝑥? = @ 𝑝 𝑦? 𝑥?, 𝑤 𝑝 𝑤 𝐷 𝑑𝑤 = 𝔼+(8|C)[𝑝 𝑦? 𝑥?, 𝑤 ] Taking expectation = an ensemble of an uncountably infinite number of NN ↓ Intractable
  • 9. Variational Learning The Basyan posterior distribution on the weight : 𝑞(𝑤|𝜃) ◦ parameters : 𝜃 The posterior distribution given the training data : 𝑝 𝑤 𝐷 Find the parameters 𝜃 that minimizes the KL divergence : 𝜃∗ = arg min L ℱ 𝐷, 𝜃 where ℱ 𝐷, 𝜃 = 𝐾𝐿 𝑞 𝑤 𝜃 𝑝 𝑤 𝐷 = 𝐾𝐿 𝑞 𝑤 𝜃 𝑝(𝑤) − 𝔼 Q 𝑤 𝜃 [log 𝑝(𝐷|𝑤)] data-dependent part (likelihood cost) prior-dependent part (complexity cost)
  • 10. Unbiased Monte Carlo gradients Proposition 1. Let 𝜀 be a random variable having a probability density given by 𝑞(𝜀) and let 𝑤 = 𝑡(𝜃, 𝜀) where 𝑡(𝜃, 𝜀) is a deterministic function. Suppose further that the marginal probability density of 𝑤, 𝑞(𝑤|𝜃), is such that 𝑞(𝜀)𝑑𝜀 = 𝑞(𝑤|𝜃)𝑑𝑤. Then for a function 𝑓 with derivatives in 𝑤: U UL 𝔼Q(8|L) 𝑓 𝑤, 𝜃 = 𝔼Q(V) UW(8,L) UL U8 UL + UW(8,L) UL A generalization of the Gaussian reparameterizationtrick
  • 11. Bayes by backprop We approximate the expected lower bound as ℱ 𝐷, 𝜃 ≈ ; log 𝑞 𝑤(/) 𝜃 − log 𝑝(𝑤(/)) − log 𝑝(𝐷|𝑤 / ) / where 𝑤(/)~𝑞 𝑤(/) 𝜃 (Monte Carlo) ◦ Every term of this approximate cost depends upon the particular weights drawn from the variational posterior.
  • 12. Gaussian variational posterior 1. Sample 𝜀 ∼ 𝑁(0, 𝐼) 2. Let 𝑤 = 𝜇 + log (1 + exp ( 𝜌)) ∘ 𝜀 3. Let 𝜃 = (𝜇, 𝜌) 4. Let 𝑓(𝑤, 𝜃) = log 𝑞 𝑤 𝜃 − log 𝑝 𝑤 𝑝(𝐷|𝑤) 5. Calculate the gradient with respect to the mean ∆e = 𝜕𝑓(𝑤, 𝜃) 𝜕𝑤 + 𝜕𝑓(𝑤, 𝜃) 𝜕𝜇 6. Calculate the gradient with respect to the standard deviation parameter 𝜌 ∆g = 𝜕𝑓(𝑤, 𝜃) 𝜕𝑤 𝜀 1 + exp(−𝜌) + 𝜕𝑓(𝑤, 𝜃) 𝜕𝜌 7. Update the variational parameters: 𝜇 ← 𝜇 − 𝛼∆e 𝜌 ← 𝜌 − 𝛼∆g
  • 13. Scalar mixture prior They proposed using scale mixture of two Gaussian densities as the prior ◦ Combined with a diagonal Gaussian posterior ◦ Two degrees of freedom per weight only increases the number of parameters to optimize by a factor of two. 𝑝(𝑤) = ∏ 𝜋𝑁(𝑤l |0, 𝜎n) + (1 − 𝜋)𝑁(𝑤l|0, 𝜎o) l where 𝜎n > 𝜎o, 𝜎o ≪ 1 Empirically they found optimizing the parameters of a prior 𝑝(𝑤) to not be useful, and yield worse results. ◦ It can be easier to change the prior parameters than the posterior parameters. ◦ The prior parameters try to capture the empirical distribution of the weights at the beginning of learning. → pick a fixed-form prior and don’t adjust its hyperparamater
  • 14. Minibatches and KL re-weighting The minibatch cost for minibatch i = 1,2,...,M: 𝐹/ u (𝐷/, 𝜃) = 𝜋/ 𝐾𝐿[𝑞(𝑤|𝜃) || 𝑃(𝑤)] − 𝔼Q 𝑤 𝜃 [log 𝑃 𝐷/ 𝑤 ] where π ∈ [0,1]0 and ∑ 𝜋/ = 10 /xn ◦ Then 𝔼0 ∑ 𝐹/ u 𝐷/ , 𝜃0 /xn = 𝐹(𝐷, 𝜃) 𝜋 = oyz{ oy|n works well ◦ the first few minibatches are heavily influenced by the complexity cost ◦ the later minibatches are largely influenced by the data
  • 15. Contextual Bandits Contextual Bandits ◦ Simple reinforcement learning problems Example : Clinical Decision Making Another Example: Clinical Decision Making Repeatedly: 1. A patient comes to a doctor with symptoms, medical history, test results 2. The doctor chooses a treatment 3. The patient responds to it The doctor wants a policy for choosing targeted treatments for individual patients. Context 𝑥 Actions 𝑎 ∈ [0, . . , 𝐾] Rewards 𝑟~𝑝(𝑟|𝑥, 𝑎, 𝑤) Online training the agent’s model 𝑝 𝑟 𝑥, 𝑎, 𝑤
  • 16. Thompson Sampling Thompson sampling ◦ Popular means of picking an action that trades-off between exploitation and exploration. ◦ Necessitates a Bayesian treatment of the model parameters 1. Sample a new set of parameters for the model. 2. Pick the action with the highest expected reward according to the sampled parameters. 3. Update the model. Go to 1.
  • 17. Thompson Sampling for NN Thompson sampling is easily adapted to neural networks using the variational pasterior. 1. Sample weights from the variational posterior: 𝑤~𝑞(𝑤|𝜃). 2. Receive the context 𝑥. 3. Pick the action 𝑎 that minimises 𝔼+(€|•,‚,8)[𝑟] 4. Receive reward 𝑟. 5. Update variationalparameters 𝜃. Go to 1.
  • 19. Classification on MNIST Model : ◦ 28×28→(ReLU)→(400,800,1200) →(ReLU)→(400,800,1200) → (Softmax)→ 10 Weight Uncertainty in Neural Networks Table 1. Classification Error Rates on MNIST. ? indicates result used an ensemble of 5 networks. Method #Units/Layer #Weights Test Error SGD, no regularisation (Simard et al., 2003) 800 1.3m 1.6% SGD, dropout (Hinton et al., 2012) ⇡ 1.3% SGD, dropconnect (Wan et al., 2013) 800 1.3m 1.2%? SGD 400 500k 1.83% 800 1.3m 1.84% 1200 2.4m 1.88% SGD, dropout 400 500k 1.51% 800 1.3m 1.33% 1200 2.4m 1.36% Bayes by Backprop, Gaussian 400 500k 1.82% 800 1.3m 1.99% 1200 2.4m 2.04% Bayes by Backprop, Scale mixture 400 500k 1.36% 800 1.3m 1.34% 1200 2.4m 1.32% known that variational methods under-estimate uncertainty (Minka, 2001; 2005; Bishop, 2006) which could lead to 0.8 1.2 1.6 2.0 0 100 200 300 400 500 6 Epochs Testerror(%) Figure 2. Test error on MNIST as tr 0 5 10 15 −0.2 −0.1 0.0 0.1 0.2 Weight Density Figure 3. Histogram of the trained weight for Dropout, plain SGD, and samples from used an ensemble of 5 networks
  • 21. Classification on MNIST Density estimates of the weights ◦ Bayes by Backprop : sampled from the variational posterior ◦ Dropout : used at test time ◦ Bayes by Backprop uses the greatest range of weights
  • 22. Classification on MNIST The level of redundancy in a Bayes by Backprop. ◦ Replace the weights with a constant zero (from the lowest signal-to-noise ratio ). Figure 4. Density and CDF of the Signal-to-Noise ratio over all weights in the network. The red line denotes the 75% cut-off. In Table 2, we examine the effect of replacing the vari- ational posterior on some of the weights with a constant zero, so as to determine the level of redundancy in the network found by Bayes by Backprop. We took a Bayes by Backprop trained network with two layers of 1200 units2 and ordered the weights by their signal-to-noise ra- tio (|µi|/ i). We removed the weights with the lowest sig- nal to noise ratio. As can be seen in Table 2, even when 95% of the weights are removed the network still performs well, with a significant drop in performance once 98% of the weights have been removed. In Figure 4 we examined the distribution of the signal-to- noise relative to the cut-off in the network uses in Table 2. The lower plot shows the cumulative distribution of signal- to-noise ratio, whilst the top plot shows the density. From the density plot we see there are two modalities of signal- to-noise ratios, and from the CDF we see that the 75% cut-off separates these two peaks. These two peaks coin- cide with a drop in performance in Table 2 from 1.24% to 1.29%, suggesting that the signal-to-noise heuristic is in fact related to the test performance. 2 We used a network from the end of training rather than pick- ing a network with a low validation cost found during training, hence the disparity with results in Table 1. The lowest test error a network 20 times smaller and did proportion of weights (at most 11%) ing good test performance. The sca by Bayes by Backprop encourages weights. Many of these weights can b without impacting performance signi 5.2. Regression curves We generated training data from the y = x + 0.3 sin(2⇡(x + ✏)) + 0.3 where ✏ ⇠ N(0, 0.02). Figure 5 sh fitting a neural network to these data tional Gaussian loss. Note that in th space where there are no data, the or reduces the variance to zero and ch lar function, even though there are m olations of the training data. On the averaging affects predictions: where confidence intervals diverge, reflect possible extrapolations. In this cas prefers to be uncertain where there a opposed to a standard neural networ confident. 5.3. Bandits on Mushroom Task We take the UCI Mushrooms data set
  • 23. Regression curves Training data : 𝑦 = 𝑥 + 0.3sin 2𝜋 𝑥 + 𝜀 + 0.3sin 4𝜋 𝑥 + 𝜀 + 𝜀 where 𝜀~𝑁(0,0.02) Weight Uncertainty in Neural Networks x x x x x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x xxxx x x x x x xx x x x x x x x x x x x x x xx x xx x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x xx x x x xx x x x x x x x x x x x x x x x x xxx x xx x x x x x x x x x x x x x x xx x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x xx x x x x x x xx x xx xx x xxx x x x x x x x x x x xx x x x x x x x x x x x xxx x x x x x x x xx x x x x x x x x xx x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x xx x xx x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x x x x x x x x xx x x x x xx x x x x x x x x x xx xx x x x xx x xx xx x x x x x x x x x x x x x x x x x x x x x x x x x −0.4 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 x x x x x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x xxxx x x x x x xx x x x x x x x x x x x x x xx x xx x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x xx x x x xx x x x x x x x x x x x x x x x x xxx x xx x x x x x x x x x x x x x x xx x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x xx x x x x x x xx x xx xx x xxx x x x x x x x x x x xx x x x x x x x x x x x xxx x x x x x x x xx x x x x x x x x xx x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x xx x xx x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x xx x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x x x x x x x x xx x x x x xx x x x x x x x x x xx xx x x x xx x xx xx x x x x x x x x x x x x x x x x x x x x x x x x x −0.4 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 Figure 5. Regression of noisy data with interquatile ranges. Black crosses are training samples. Red lines are median predictions. Blue/purple region is interquartile range. Left: Bayes by Back- prop neural network, Right: standard neural network. for action selection. We kept t and action tuples in a buffer, ing randomly drawn minibatch steps (64 ⇥ 64 = 4096) per int bandit. A common heuristic fo exploitation is to follow an " bility " propose a uniformly ra the best action according to th Figure 6 compares a Bayes by "-greedy agents, for values of and 5%. An " of 5% appears purely greedy agent does poo ily electing to eat nothing, but it has seen enough data. It se approximation updates allow Bayes by Backprop Standard NN
  • 24. Bandits on Mushroom Task UCI Mushrooms dataset ◦ Each mushroom has a set of features and is labelled as edible or poisonous. ◦ An edible mushroom : a reward of 5 ◦ A poisonous mushroom : a reward of −35 or 5 (with probability 0.5) ◦ Not to eat a mushroom : a reward of 0 Model : ◦ Input (context and action)→(ReLU)→100→(ReLU)→100→ the expected reward Comparison approach : ◦ 𝜀-greedy policy : with probability 𝜀 propose a uniformly random action, otherwise pick the best action according to the neural network.
  • 25. Bandits on Mushroom Task Comparison of cumulative regret of various agents. ◦ Regret : the difference between the reward achievable by an oracle and the reward received by an agent. ◦ Oracle : an edible mushroom→5, a poisonous mushroom→0 ◦ Lower is better. x x x xx x x x xx xx x xx x x x x x x x x x x x x x xx xx xxx x x x x x x xx xx x x x xx xx xx xx x xx xx x x xx x x x x x xx xx x x xx x x x x x x x x xx x x x x x x x −0.4 0.0 0.0 0.4 0.8 1.2 x x x xx x x x xx xx x xx x x x x x x x x x x x x x xx xx xxx x x x x x x xx xx x x x xx xx xx xx x xx xx x x xx x x x x x xx xx x x xx x x x x x x x x xx x x x x x x x −0.4 0.0 0.0 0.4 0.8 1.2 Figure 5. Regression of noisy data with interquatile ranges. Black crosses are training samples. Red lines are median predictions. Blue/purple region is interquartile range. Left: Bayes by Back- prop neural network, Right: standard neural network. 1000 10000 0 10000 20000 30000 40000 50000 Step CumulativeRegret 5% Greedy 1% Greedy Greedy Bayes by Backprop Figure 6. Comparison of cumulative regret of various agents on bandit. A com exploitation is bility " propos the best action Figure 6 comp "-greedy agen and 5%. An purely greedy ily electing to it has seen en approximation as for the first approximately eat mushroom from the begi and quickly co most perfect r 6. Discussio We introduced works with u Backprop. It over-explore quickly converges
  • 26. Conclusion They introduced a new algorithm for learning neural networks with uncertainty on the weights called Bayes by Backprop. The algorithm achieves good results in several domains. ◦ Classifying MNIST digits : Performance from Bayes by Backprop is comparable to that of dropout. ◦ Non-linear regression problem : Bayes by Backprop allows the network to make more reasonable predictions about unseen data. ◦ Contextual bandits : Bayes by Backprop can automatically learn how to trade-off exploration and exploitation.