SlideShare a Scribd company logo
1 of 42
Download to read offline
A Model of Inductive Bias Learning
Outlines
• What is Inductive bias?
• A model of Inductive Bias Learning
• Meta-Learning Based on PAC-Bayes Theory
2
Outlines
• What is Inductive bias?
• A model of Inductive Bias Learning
• Meta-Learning Based on PAC-Bayes Theory
3
What is Inductive Bias?
• The set of assumptions that the learner uses to predict outputs
• ex:
4
Translation Invariance in
Computer Vision
Distributional Hypothesis in
Natural Language Processing
a
a
Convolution
neural networks
Convolution
neural networks
a
a
A dog runs.
A cat runs.
word2vec
word2vec
run
run
What is Inductive Bias?
• Convolutional Neural Networks:
• Learning Inductive Bias (Receptive Fields) from dataset
5
Receptive
Fields
Receptive
Fields
Convolutional
Layer
Convolutional
Layer Pooling
Layer
Input
Layer
Pooling
Layer
What is Inductive Bias?
• Multitask / Multiclass / Multilabel Learning
6
Feature
Extractor
(Inductive
Bias)
Input of
Task 1
Input of
Task 2
Input of
Task 3
Features
of Task 1
Features
of Task 2
Features
of Task 3
Classifier
of Task 1
Classifier
of Task 2
Classifier
of Task 3
Labels of
Task 1
Labels of
Task 2
Labels of
Task 3
What is Inductive Bias?
• Meta-Learning / Learning-to-Learn / Transfer Learning
7
Feature
Extractor
(Inductive
Bias)
Input of
Task 1
Features of
Task 1
Classifier of
Task 1
Labels of
Task 1
Input of
Task 2
Features of
Task 2
Classifier of
Task 2
Labels of
Task 2
Input of
Target Task
Features
Target Task
Classifier of
Target Task
Labels of
Target Task
Feature
Extractor
(Inductive
Bias)
Outlines
• What is Inductive bias?
• A model of Inductive Bias Learning
• Meta-Learning Based on PAC-Bayes Theory
8
A model of Inductive Bias Learning
9
Machine Learning Theory
10
Empirical loss
ˆerz(h) =
1
m
mX
i=1
l(h(xi), yi)
y1, ..., ym
Dataset P
(probability
distribution)
Training set Hypothesis
h
Sample
prediction
z = {(x1, y1), . . . , (xm, ym)}
x1, ..., xm
h(x1), . . . , h(xm)(x1, y1),
(x2, y2),
...
...
Training algorithm
selects h.
Hypothesis space
H = {h1, h2, . . . }
Expected loss
erP (h) =
Z
X⇥Y
l(h(x), y)dP(x, y)
Machine Learning Theory
11
Theorem 1. Suppose z = {(x1, y1), ..., (xm, ym)} is sampling from distribution P
Let d = VCDim(H). Then with at least 1 , all h 2 H will satisfy:
erP (h)  ˆerz(h) +
h32
m
(dlog
2em
d
+ log
4
)
i1
2
over-fitting
error
shatter
d >= m
d : VC-Dimension
erP (h)
ˆerz(h)
Environment of Related Tasks
12
Environment Q
(probability
distribution)
Dataset P1
Dataset P2
...
Sample
Dataset P1
Dataset P2
Dataset Pn
...
n training tasks
Training set
z1 = {(x11, y11), . . . , (x1m, y1m)}
Training set
z2 = {(x21, y21), . . . , (x2m, y2m)}
Sample
Sample
Training setSample
zn = {(xn1, yn1), . . . , (xnm, ynm)}
m training samples per task
Environment of Related Tasks
• Ex: ImageNet Classification
13
Environment of
ImageNet
classification Task
Goldfish or not
Cock or not
...
Goldfish or not
Cock or not
Daisy or not
...
( , 1), ( , 0) , … , ( , 1)
( , 1), ( , 0) , … , ( , 1)
( , 0), ( , 1) , … , ( , 1)
Training set
Training set
Training set
Hypothesis Space Family
14
Hypothesis Space Family
Hypothesis space 1
Hypothesis space 2
...
H
H2 = {h21, h22, . . . }
H1 = {h11, h12, . . . } Hypothesis space
H = {h1, h2, . . . }
Training algorithm of
Inductive Bias Learning:
select H.
Hypothesis Space Family
15
Feature
Extractor
Input of
Task
Features
of Task
Classifier Labels of
Taskf1, f2, ..., g1, g2, ...,
Hypothesis Space Family
Hypothesis space 1
Hypothesis space 2
...
H
H1 = {g1 f1, g2 f1, . . . }
H2 = {g1 f2, g2 f2, . . . }
Hypothesis space
Train Feature Extractor
) select f
) select H
H = {g1 f, g2 f, . . . }
Inductive Bias Learning
16
PredictionTraining set
Environment
Q
Dataset
P1
Dataset
Pn
Hypothesis
Space Family
H
...
Hypothesis space
PredictionTraining set
(x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m)
hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm)
Training algorithm
selects H.
H = {h1, h2, . . . }
Empirical loss
Empirical loss of learning from dataset P1,…, Pn
ˆerz(H) =
1
n
nX
i=1
inf
h2H
ˆerzi
(h)
Expected loss
Expected loss of learningnew tasks drawn from Environment Q
erQ(H) =
Z
P
inf
h2H
erP (h)dQ(P)
Theorem 2. Suppose z is an (n, m)-sample generated by
sampling n times from distribution Q to give P1, ..., Pn and then
sampling m times from each Pi to generate zi = {(xi1, yi1), .., (xim, yim)}.
Let H = {H} be any hypothesis space family. If (n, m) satisfies
n max
n256
✏2
log
8C( ✏
32 , H⇤
)o
, and m max
n256
n✏2
log
8C( ✏
32 , Hn
l )o
,
Inductive Bias Learning
17
C: capacity (model complexity)
then with probability at least 1 , all H 2 H will satisfy
erQ(H)  ˆerz(H) + ✏
Meta-Learning / Learning-to-Learn / Transfer
Learning
18
PredictionTraining set
Environment
Q
Dataset
P1
Dataset
Pn
Hypothesis
Space Family
H
...
Hypothesis space
PredictionTraining set
(x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m)
hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm)
Training algorithm
selects H.
H = {h1, h2, . . . }
Pre-train
PredictionTraining setDataset
P (x1, y1), . . . , (xm, ym) h(x1), . . . , h(xm)
Train target task
Empirical lossExpected loss
ˆerz(h) =
1
m
mX
i=1
l(h(xi), yi)erP (h) =
Z
X⇥Y
l(h(x), y)dP(x, y)
Meta-Learning / Learning-to-Learn / Transfer
Learning
19
Theorem 3. Let z = {(x1, y1), ...(xm, ym)} be a training set sampling from P
Let H be a hypothesis space. For all 0 < ✏, < 1, if m satisfies
m max
n64
✏2
log
4C( ✏
16 , Hl)
,
16
✏2
o
,
then with probability at least 1 , all h 2 H will satisfy
erP (h)  ˆerz(h) + ✏
C: capacity (model complexity)
Hypothesis space
H = {h1, h2, . . . } Inductive Bias Learning
selects H.Low model complexity
Hypothesis Space Family
Hypothesis space 1
...
H
H1 = {h11, h12, . . . }
Hypothesis space
H = {h1, h2, . . . }
...
High model complexity
Multitask / Multiclass / Multilabel Learning
20
PredictionTraining set
Hypothesis
Space Family
H
Hypothesis space
PredictionTraining set
(x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m)
hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm)
Training algorithm
selects H.
H = {h1, h2, . . . }
Environment
Q
Dataset
P1
Dataset
Pn
...
Empirical loss
ˆerz(h) =
1
n
nX
i=1
1
m
mX
j=1
l(hi(xij), yij)
Expected loss
erP(h) =
1
n
nX
i=1
Z
X⇥Y
l(hi(x), y)dPi(x, y)
Theorem 4. Let P = (P1, ..., Pn) be n probability distribution and
let z be an (n, m)-sample generated by sampling m times from each Pi.
Let H = {H} be any hypothesis space family. If m of each task satisfies
m max
n 64
n✏2
log
4C( ✏
16 , Hn
l )
,
16
✏2
o
,
then with probability at least 1 , any h = (h1, ..., hn) 2 Hn
satisfy
erp(h)  ˆerz(h) + ✏
Lemma 5. For any H,
logC(✏, H1
l )  logC(✏, Hn
l )  nlogC(✏, H1
l )
Multitask / Multiclass / Multilabel Learning
21
• Worse case: logC(✏, Hn
l ) = nlogC(✏, H1
l )
m remains the same when n increase.
• Best case: logC(✏, Hn
l ) = logC(✏, H1
l )
m decreases as O(1/n).
Outlines
• What is Inductive bias?
• A model of Inductive Bias Learning
• Meta-Learning Based on PAC-Bayes Theory
22
Meta-Learning Based on PAC-Bayes Theory
23
ICML 2018
Generalization in Deep Learning
24
• Neural Networks (or Deep Learning
models) have extremely high VC
Dimension.
• Over-fitting is caused by high VC Dimension
(high model complexity).
• VC Bound :
2 layer NN
input: 780 hidden: 600
d ≈ 26M
MNIST
n = 50,000
d >> n
over-fitting
error
shatter
d >= m
d : VC-Dimension
erP (h)
ˆerz(h)
erP (h)  ˆerz(h) +
h32
m
(dlog
2em
d
+ log
4
)
i1
2
Over-Parameterization
• Rethinking Optimization and Generalization
• [Chiyuan Zhang et al. https://arxiv.org/abs/1611.03530]
1 0 1 0 2
random pixel
deep
neural
networks
feature:
label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
25No over-fitting erP (h) ⇡ 0.9 erP (h) ⇡ 0.9erP (h) ⇡ 0.14
shatter
ˆerz(h) ⇡ 0 ˆerz(h) ⇡ 0 ˆerz(h) ⇡ 0
Over-Parameterization
• Generalization curve of Over-Parameterization model
• [Mikhail Belkin et al. https://arxiv.org/abs/1812.11118]
d (VC Dimension)
error
over-parameterization
model with extremely
high VC-Dimension
26
erP (h)  ˆerz(h) +
h32
m
(dlog
2em
d
+ log
4
)
i1
2
d = m
over-fittingerP (h)
ˆerz(h)
Over-Parameterization
• PAC-Bayesian Bound [Gintare Karolina Dziugaite et al. https://arxiv.org/abs/1703.11008]
Training error
Testingerror
VC Bound
Pac-Bayesian Bound
0.028
0.034
26m
0.161
0.027
0.035
56m
0.179
Model Complexityof 2- layer NN
width :600 width : 1200
0.028
0.034
26m
0.161
0.112
0.503
26m
1.352
Random labelTrue label
Dataset : MNIST
binary classification,class0:
0~4, class1 : 5~9
Existence of noise in dataset
• D: distribution of dataset
• S : training samples
• P : model distribution before training
• Q: model distribution after training
27
er(Q, D)  ˆer(Q, S) +
s
KL(QkP) + log(m
)
2(m 1)
PAC-Bayesian Bound
• Deterministic Model
• Stochastic Model
• Deterministic Neural Networks
• Stochastic Neural Networks
28
x y = tanh(wx + b)w, b
x y = tanh(wx + b)w, b
N(µw, σw),
N(µb, σb)
sample w, b
distribution of
w, b
hypothesis h
input
error
er(h) label
distribution of
hypothesis
sample
hypothesis h
input
Gibbs error
er(Q)
= Eh⇠Q[er(h)]
label
PAC-Bayesian Bound
• Training Stochastic Neural Networks
29
sample w, b
P :
Prior, distribution of hypothesis
before training
x yw, b
N(µw, σw),
N(µb, σb)
Q (S, P) :
Posterior, distribution of hypothesis
after training
x yw, b
N(µw, σw),
N(µb, σb)
sample w, b
training algorithm:
Update µw, σw, µb, σb
training data S
PAC-Bayesian Bound
under
fitting
over
fitting
appropriate
fitting
training data
testing data
P
Q
30
er(Q, D)  ˆer(Q, S) +
s
KL(QkP) + log(m
)
2(m 1)
P
Q
P
Q
8
><
>:
high KL(Q||P)
low ˆer(Q.S) but high er(Q, D)
high |er(Q, D) ˆer(Q, S)|
8
><
>:
low KL(Q||P)
high ˆer(Q.S) and high er(Q, D)
low |er(Q, D) ˆer(Q, S)|
8
><
>:
moderate KL(Q||P)
moderate ˆer(Q.S) and moderate er(Q, D)
moderate |er(Q, D) ˆer(Q, S)|
Hyper-Prior and Hyper-Posterior
31
...
Prior P2
Hyper-Prior
Prior P1
Hyper-Posterior
P
Q
empirical
multi-task error
Inductive-Bias
Training
Algorithm
Update Q
Hyper-Prior and Hyper-Posterior
• Stochastic Neural Networks
32
Inductive-
Bias Training:
Update the
Hyper-Prior
Environment
Hyper-Prior
x yw, b
N(µw, σw),
N(µb, σb)
Sample
N(µpw, σpw),
N(µpb, σpb)
Prior P
P Hyper-Prior
x yw, b
N(µw, σw),
N(µb, σb)
Sample
N(µpw, σpw),
N(µpb, σpb)
Prior P
Q
x yw, b
N(µw, σw),
N(µb, σb)
Training:
Update
the Prior
New task from
environment
Inductive Bias Learning
33
Multi-taskerror
Empirical loss of learning from
dataset P1,…, Pn
ˆer(Q, S1, ..., Sn) =
1
n
nX
i=1
E
P ⇠Q
ˆer(Q(Si, P), Si)
Transfer error
Expected loss of learningnew tasks
drawn from Environment Q
er(Q, ⌧) = E
D⇠⌧
E
P ⇠Q
E
S⇠D
er(Q(S, P), D)
...
Training set
Sn
Environment
𝜏
Dataset
D1
Dataset
Dn
...
Prior
P1
Training set
S1
Hyper-Prior P
Hyper-Posterior
Q
Posterior
Posterior
...
Training Algorithm :
UpdateQ
Prior
Pn
Q(S1, P1)
Q(Sn, Pn)
Inductive Bias Learning
34
Theorem 2. For any hyper-posterior distribution Q,
the following inequality hold with probability at least 1
er(Q, ⌧) =
1
n
nX
i=1
E
P ⇠Q
ˆeri(Q(Si, P), Si)
+
1
n
nX
i=1
s
KL(QkP) + EP ⇠Q KL(Q(Si, P)kP) + log2nmi
2(mi 1)
+
s
KL(QkP) + log2n
2(n 1)
task-complexity terms of the observed tasks
environment-complexity term
Meta Learning Algorithm
• Loss function for pre-training:
35
J(✓) =
1
n
nX
i=1
Ji(✓) + ⌥(✓)
task-complexity terms of the observed tasks
environment-complexity term
Ji(✓) = E
˜✓⇠Q✓
ˆeri Qi(Si, P˜✓)Si +
s
KL(Q✓kP) + E˜✓⇠Q✓
KL Q(Si, P˜✓)kP˜✓ + log2nmi
2(mi 1)
˜✓ = ✓ + ✏P , where ✏P ⇠ N(0, 2
QI)
⌥(✓) =
s
KL(Q✓kP) + log2n
2(n 1)
Meta Learning Algorithm
36
Hyper-Posterior
Q✓
Prior θPrior θ …
Observed
Task Sn
Environment 𝜏
Observed
Task S1
…
Posterior
ϕn
Posterior
ϕ1
…
Meta Learning Algorithm
37
Posterior ϕ’
Environment 𝜏
New Task S
Hyper-Posterior
Prior θ
Q✓
Meta Learning Algorithm Example
38
Experiments
• Model : CNN with 2 convolution Layers
• Task Environments: Permuted Labels & Permuted Pixels
39
Permuted Labels
Task 1 Task 2
0 1 2
0 1 2
0 1 2
0 1 2
Permuted Labels
Task 1 Task 2
0 1 2
0 1 2
0 1 2
0 1 2
Results
• Test error on new tasks
40
Trained from scratch
Trained by proposed
meta learning algorithm
Further Reading
• Understanding deep learning requires rethinking generalization
https://arxiv.org/abs/1611.03530
• Reconciling modern machine learning practice and the bias-variance trade-
off https://arxiv.org/abs/1812.11118
• Computing Nonvacuous Generalization Bounds for Deep (Stochastic)
Neural Networks with Many More Parameters than Training Data
https://arxiv.org/abs/1703.11008
• A Model of Inductive Bias Learning https://arxiv.org/abs/1106.0245
• Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory
https://arxiv.org/abs/1711.01244
41
About the Speaker
Mark Chang
• Facebook: https://www.facebook.com/ckmarkoh.chang
• IG: https://www.instagram.com/markchang95/
• YouTube: https://www.youtube.com/channel/UCckNPGDL21aznRhl3EijRQw
42

More Related Content

More from Mark Chang

NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsMark Chang
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMark Chang
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksMark Chang
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly ProblemMark Chang
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterMark Chang
 
淺談深度學習
淺談深度學習淺談深度學習
淺談深度學習Mark Chang
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
TensorFlow 深度學習快速上手班--深度學習
 TensorFlow 深度學習快速上手班--深度學習 TensorFlow 深度學習快速上手班--深度學習
TensorFlow 深度學習快速上手班--深度學習Mark Chang
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用Mark Chang
 
TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用Mark Chang
 
TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習Mark Chang
 
Computational Linguistics week 10
 Computational Linguistics week 10 Computational Linguistics week 10
Computational Linguistics week 10Mark Chang
 
TensorFlow 深度學習講座
TensorFlow 深度學習講座TensorFlow 深度學習講座
TensorFlow 深度學習講座Mark Chang
 
Computational Linguistics week 5
Computational Linguistics  week 5Computational Linguistics  week 5
Computational Linguistics week 5Mark Chang
 
Neural Art (English Version)
Neural Art (English Version)Neural Art (English Version)
Neural Art (English Version)Mark Chang
 
AlphaGo in Depth
AlphaGo in Depth AlphaGo in Depth
AlphaGo in Depth Mark Chang
 
Image completion
Image completionImage completion
Image completionMark Chang
 
自然語言處理簡介
自然語言處理簡介自然語言處理簡介
自然語言處理簡介Mark Chang
 
民主的網路世代—推動罷免的工程師們
民主的網路世代—推動罷免的工程師們民主的網路世代—推動罷免的工程師們
民主的網路世代—推動罷免的工程師們Mark Chang
 

More from Mark Chang (20)

NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANs
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural Networks
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
 
淺談深度學習
淺談深度學習淺談深度學習
淺談深度學習
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
TensorFlow 深度學習快速上手班--深度學習
 TensorFlow 深度學習快速上手班--深度學習 TensorFlow 深度學習快速上手班--深度學習
TensorFlow 深度學習快速上手班--深度學習
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
 
TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用
 
TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習
 
Computational Linguistics week 10
 Computational Linguistics week 10 Computational Linguistics week 10
Computational Linguistics week 10
 
Neural Doodle
Neural DoodleNeural Doodle
Neural Doodle
 
TensorFlow 深度學習講座
TensorFlow 深度學習講座TensorFlow 深度學習講座
TensorFlow 深度學習講座
 
Computational Linguistics week 5
Computational Linguistics  week 5Computational Linguistics  week 5
Computational Linguistics week 5
 
Neural Art (English Version)
Neural Art (English Version)Neural Art (English Version)
Neural Art (English Version)
 
AlphaGo in Depth
AlphaGo in Depth AlphaGo in Depth
AlphaGo in Depth
 
Image completion
Image completionImage completion
Image completion
 
自然語言處理簡介
自然語言處理簡介自然語言處理簡介
自然語言處理簡介
 
民主的網路世代—推動罷免的工程師們
民主的網路世代—推動罷免的工程師們民主的網路世代—推動罷免的工程師們
民主的網路世代—推動罷免的工程師們
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

A Model of Inductive Bias Learning

  • 1. A Model of Inductive Bias Learning
  • 2. Outlines • What is Inductive bias? • A model of Inductive Bias Learning • Meta-Learning Based on PAC-Bayes Theory 2
  • 3. Outlines • What is Inductive bias? • A model of Inductive Bias Learning • Meta-Learning Based on PAC-Bayes Theory 3
  • 4. What is Inductive Bias? • The set of assumptions that the learner uses to predict outputs • ex: 4 Translation Invariance in Computer Vision Distributional Hypothesis in Natural Language Processing a a Convolution neural networks Convolution neural networks a a A dog runs. A cat runs. word2vec word2vec run run
  • 5. What is Inductive Bias? • Convolutional Neural Networks: • Learning Inductive Bias (Receptive Fields) from dataset 5 Receptive Fields Receptive Fields Convolutional Layer Convolutional Layer Pooling Layer Input Layer Pooling Layer
  • 6. What is Inductive Bias? • Multitask / Multiclass / Multilabel Learning 6 Feature Extractor (Inductive Bias) Input of Task 1 Input of Task 2 Input of Task 3 Features of Task 1 Features of Task 2 Features of Task 3 Classifier of Task 1 Classifier of Task 2 Classifier of Task 3 Labels of Task 1 Labels of Task 2 Labels of Task 3
  • 7. What is Inductive Bias? • Meta-Learning / Learning-to-Learn / Transfer Learning 7 Feature Extractor (Inductive Bias) Input of Task 1 Features of Task 1 Classifier of Task 1 Labels of Task 1 Input of Task 2 Features of Task 2 Classifier of Task 2 Labels of Task 2 Input of Target Task Features Target Task Classifier of Target Task Labels of Target Task Feature Extractor (Inductive Bias)
  • 8. Outlines • What is Inductive bias? • A model of Inductive Bias Learning • Meta-Learning Based on PAC-Bayes Theory 8
  • 9. A model of Inductive Bias Learning 9
  • 10. Machine Learning Theory 10 Empirical loss ˆerz(h) = 1 m mX i=1 l(h(xi), yi) y1, ..., ym Dataset P (probability distribution) Training set Hypothesis h Sample prediction z = {(x1, y1), . . . , (xm, ym)} x1, ..., xm h(x1), . . . , h(xm)(x1, y1), (x2, y2), ... ... Training algorithm selects h. Hypothesis space H = {h1, h2, . . . } Expected loss erP (h) = Z X⇥Y l(h(x), y)dP(x, y)
  • 11. Machine Learning Theory 11 Theorem 1. Suppose z = {(x1, y1), ..., (xm, ym)} is sampling from distribution P Let d = VCDim(H). Then with at least 1 , all h 2 H will satisfy: erP (h)  ˆerz(h) + h32 m (dlog 2em d + log 4 ) i1 2 over-fitting error shatter d >= m d : VC-Dimension erP (h) ˆerz(h)
  • 12. Environment of Related Tasks 12 Environment Q (probability distribution) Dataset P1 Dataset P2 ... Sample Dataset P1 Dataset P2 Dataset Pn ... n training tasks Training set z1 = {(x11, y11), . . . , (x1m, y1m)} Training set z2 = {(x21, y21), . . . , (x2m, y2m)} Sample Sample Training setSample zn = {(xn1, yn1), . . . , (xnm, ynm)} m training samples per task
  • 13. Environment of Related Tasks • Ex: ImageNet Classification 13 Environment of ImageNet classification Task Goldfish or not Cock or not ... Goldfish or not Cock or not Daisy or not ... ( , 1), ( , 0) , … , ( , 1) ( , 1), ( , 0) , … , ( , 1) ( , 0), ( , 1) , … , ( , 1) Training set Training set Training set
  • 14. Hypothesis Space Family 14 Hypothesis Space Family Hypothesis space 1 Hypothesis space 2 ... H H2 = {h21, h22, . . . } H1 = {h11, h12, . . . } Hypothesis space H = {h1, h2, . . . } Training algorithm of Inductive Bias Learning: select H.
  • 15. Hypothesis Space Family 15 Feature Extractor Input of Task Features of Task Classifier Labels of Taskf1, f2, ..., g1, g2, ..., Hypothesis Space Family Hypothesis space 1 Hypothesis space 2 ... H H1 = {g1 f1, g2 f1, . . . } H2 = {g1 f2, g2 f2, . . . } Hypothesis space Train Feature Extractor ) select f ) select H H = {g1 f, g2 f, . . . }
  • 16. Inductive Bias Learning 16 PredictionTraining set Environment Q Dataset P1 Dataset Pn Hypothesis Space Family H ... Hypothesis space PredictionTraining set (x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m) hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm) Training algorithm selects H. H = {h1, h2, . . . } Empirical loss Empirical loss of learning from dataset P1,…, Pn ˆerz(H) = 1 n nX i=1 inf h2H ˆerzi (h) Expected loss Expected loss of learningnew tasks drawn from Environment Q erQ(H) = Z P inf h2H erP (h)dQ(P)
  • 17. Theorem 2. Suppose z is an (n, m)-sample generated by sampling n times from distribution Q to give P1, ..., Pn and then sampling m times from each Pi to generate zi = {(xi1, yi1), .., (xim, yim)}. Let H = {H} be any hypothesis space family. If (n, m) satisfies n max n256 ✏2 log 8C( ✏ 32 , H⇤ )o , and m max n256 n✏2 log 8C( ✏ 32 , Hn l )o , Inductive Bias Learning 17 C: capacity (model complexity) then with probability at least 1 , all H 2 H will satisfy erQ(H)  ˆerz(H) + ✏
  • 18. Meta-Learning / Learning-to-Learn / Transfer Learning 18 PredictionTraining set Environment Q Dataset P1 Dataset Pn Hypothesis Space Family H ... Hypothesis space PredictionTraining set (x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m) hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm) Training algorithm selects H. H = {h1, h2, . . . } Pre-train PredictionTraining setDataset P (x1, y1), . . . , (xm, ym) h(x1), . . . , h(xm) Train target task Empirical lossExpected loss ˆerz(h) = 1 m mX i=1 l(h(xi), yi)erP (h) = Z X⇥Y l(h(x), y)dP(x, y)
  • 19. Meta-Learning / Learning-to-Learn / Transfer Learning 19 Theorem 3. Let z = {(x1, y1), ...(xm, ym)} be a training set sampling from P Let H be a hypothesis space. For all 0 < ✏, < 1, if m satisfies m max n64 ✏2 log 4C( ✏ 16 , Hl) , 16 ✏2 o , then with probability at least 1 , all h 2 H will satisfy erP (h)  ˆerz(h) + ✏ C: capacity (model complexity) Hypothesis space H = {h1, h2, . . . } Inductive Bias Learning selects H.Low model complexity Hypothesis Space Family Hypothesis space 1 ... H H1 = {h11, h12, . . . } Hypothesis space H = {h1, h2, . . . } ... High model complexity
  • 20. Multitask / Multiclass / Multilabel Learning 20 PredictionTraining set Hypothesis Space Family H Hypothesis space PredictionTraining set (x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m) hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm) Training algorithm selects H. H = {h1, h2, . . . } Environment Q Dataset P1 Dataset Pn ... Empirical loss ˆerz(h) = 1 n nX i=1 1 m mX j=1 l(hi(xij), yij) Expected loss erP(h) = 1 n nX i=1 Z X⇥Y l(hi(x), y)dPi(x, y)
  • 21. Theorem 4. Let P = (P1, ..., Pn) be n probability distribution and let z be an (n, m)-sample generated by sampling m times from each Pi. Let H = {H} be any hypothesis space family. If m of each task satisfies m max n 64 n✏2 log 4C( ✏ 16 , Hn l ) , 16 ✏2 o , then with probability at least 1 , any h = (h1, ..., hn) 2 Hn satisfy erp(h)  ˆerz(h) + ✏ Lemma 5. For any H, logC(✏, H1 l )  logC(✏, Hn l )  nlogC(✏, H1 l ) Multitask / Multiclass / Multilabel Learning 21 • Worse case: logC(✏, Hn l ) = nlogC(✏, H1 l ) m remains the same when n increase. • Best case: logC(✏, Hn l ) = logC(✏, H1 l ) m decreases as O(1/n).
  • 22. Outlines • What is Inductive bias? • A model of Inductive Bias Learning • Meta-Learning Based on PAC-Bayes Theory 22
  • 23. Meta-Learning Based on PAC-Bayes Theory 23 ICML 2018
  • 24. Generalization in Deep Learning 24 • Neural Networks (or Deep Learning models) have extremely high VC Dimension. • Over-fitting is caused by high VC Dimension (high model complexity). • VC Bound : 2 layer NN input: 780 hidden: 600 d ≈ 26M MNIST n = 50,000 d >> n over-fitting error shatter d >= m d : VC-Dimension erP (h) ˆerz(h) erP (h)  ˆerz(h) + h32 m (dlog 2em d + log 4 ) i1 2
  • 25. Over-Parameterization • Rethinking Optimization and Generalization • [Chiyuan Zhang et al. https://arxiv.org/abs/1611.03530] 1 0 1 0 2 random pixel deep neural networks feature: label: 1 0 1 0 2 original dataset (CIFAR) 0 1 1 2 0 random label 25No over-fitting erP (h) ⇡ 0.9 erP (h) ⇡ 0.9erP (h) ⇡ 0.14 shatter ˆerz(h) ⇡ 0 ˆerz(h) ⇡ 0 ˆerz(h) ⇡ 0
  • 26. Over-Parameterization • Generalization curve of Over-Parameterization model • [Mikhail Belkin et al. https://arxiv.org/abs/1812.11118] d (VC Dimension) error over-parameterization model with extremely high VC-Dimension 26 erP (h)  ˆerz(h) + h32 m (dlog 2em d + log 4 ) i1 2 d = m over-fittingerP (h) ˆerz(h)
  • 27. Over-Parameterization • PAC-Bayesian Bound [Gintare Karolina Dziugaite et al. https://arxiv.org/abs/1703.11008] Training error Testingerror VC Bound Pac-Bayesian Bound 0.028 0.034 26m 0.161 0.027 0.035 56m 0.179 Model Complexityof 2- layer NN width :600 width : 1200 0.028 0.034 26m 0.161 0.112 0.503 26m 1.352 Random labelTrue label Dataset : MNIST binary classification,class0: 0~4, class1 : 5~9 Existence of noise in dataset • D: distribution of dataset • S : training samples • P : model distribution before training • Q: model distribution after training 27 er(Q, D)  ˆer(Q, S) + s KL(QkP) + log(m ) 2(m 1)
  • 28. PAC-Bayesian Bound • Deterministic Model • Stochastic Model • Deterministic Neural Networks • Stochastic Neural Networks 28 x y = tanh(wx + b)w, b x y = tanh(wx + b)w, b N(µw, σw), N(µb, σb) sample w, b distribution of w, b hypothesis h input error er(h) label distribution of hypothesis sample hypothesis h input Gibbs error er(Q) = Eh⇠Q[er(h)] label
  • 29. PAC-Bayesian Bound • Training Stochastic Neural Networks 29 sample w, b P : Prior, distribution of hypothesis before training x yw, b N(µw, σw), N(µb, σb) Q (S, P) : Posterior, distribution of hypothesis after training x yw, b N(µw, σw), N(µb, σb) sample w, b training algorithm: Update µw, σw, µb, σb training data S
  • 30. PAC-Bayesian Bound under fitting over fitting appropriate fitting training data testing data P Q 30 er(Q, D)  ˆer(Q, S) + s KL(QkP) + log(m ) 2(m 1) P Q P Q 8 >< >: high KL(Q||P) low ˆer(Q.S) but high er(Q, D) high |er(Q, D) ˆer(Q, S)| 8 >< >: low KL(Q||P) high ˆer(Q.S) and high er(Q, D) low |er(Q, D) ˆer(Q, S)| 8 >< >: moderate KL(Q||P) moderate ˆer(Q.S) and moderate er(Q, D) moderate |er(Q, D) ˆer(Q, S)|
  • 31. Hyper-Prior and Hyper-Posterior 31 ... Prior P2 Hyper-Prior Prior P1 Hyper-Posterior P Q empirical multi-task error Inductive-Bias Training Algorithm Update Q
  • 32. Hyper-Prior and Hyper-Posterior • Stochastic Neural Networks 32 Inductive- Bias Training: Update the Hyper-Prior Environment Hyper-Prior x yw, b N(µw, σw), N(µb, σb) Sample N(µpw, σpw), N(µpb, σpb) Prior P P Hyper-Prior x yw, b N(µw, σw), N(µb, σb) Sample N(µpw, σpw), N(µpb, σpb) Prior P Q x yw, b N(µw, σw), N(µb, σb) Training: Update the Prior New task from environment
  • 33. Inductive Bias Learning 33 Multi-taskerror Empirical loss of learning from dataset P1,…, Pn ˆer(Q, S1, ..., Sn) = 1 n nX i=1 E P ⇠Q ˆer(Q(Si, P), Si) Transfer error Expected loss of learningnew tasks drawn from Environment Q er(Q, ⌧) = E D⇠⌧ E P ⇠Q E S⇠D er(Q(S, P), D) ... Training set Sn Environment 𝜏 Dataset D1 Dataset Dn ... Prior P1 Training set S1 Hyper-Prior P Hyper-Posterior Q Posterior Posterior ... Training Algorithm : UpdateQ Prior Pn Q(S1, P1) Q(Sn, Pn)
  • 34. Inductive Bias Learning 34 Theorem 2. For any hyper-posterior distribution Q, the following inequality hold with probability at least 1 er(Q, ⌧) = 1 n nX i=1 E P ⇠Q ˆeri(Q(Si, P), Si) + 1 n nX i=1 s KL(QkP) + EP ⇠Q KL(Q(Si, P)kP) + log2nmi 2(mi 1) + s KL(QkP) + log2n 2(n 1) task-complexity terms of the observed tasks environment-complexity term
  • 35. Meta Learning Algorithm • Loss function for pre-training: 35 J(✓) = 1 n nX i=1 Ji(✓) + ⌥(✓) task-complexity terms of the observed tasks environment-complexity term Ji(✓) = E ˜✓⇠Q✓ ˆeri Qi(Si, P˜✓)Si + s KL(Q✓kP) + E˜✓⇠Q✓ KL Q(Si, P˜✓)kP˜✓ + log2nmi 2(mi 1) ˜✓ = ✓ + ✏P , where ✏P ⇠ N(0, 2 QI) ⌥(✓) = s KL(Q✓kP) + log2n 2(n 1)
  • 36. Meta Learning Algorithm 36 Hyper-Posterior Q✓ Prior θPrior θ … Observed Task Sn Environment 𝜏 Observed Task S1 … Posterior ϕn Posterior ϕ1 …
  • 37. Meta Learning Algorithm 37 Posterior ϕ’ Environment 𝜏 New Task S Hyper-Posterior Prior θ Q✓
  • 39. Experiments • Model : CNN with 2 convolution Layers • Task Environments: Permuted Labels & Permuted Pixels 39 Permuted Labels Task 1 Task 2 0 1 2 0 1 2 0 1 2 0 1 2 Permuted Labels Task 1 Task 2 0 1 2 0 1 2 0 1 2 0 1 2
  • 40. Results • Test error on new tasks 40 Trained from scratch Trained by proposed meta learning algorithm
  • 41. Further Reading • Understanding deep learning requires rethinking generalization https://arxiv.org/abs/1611.03530 • Reconciling modern machine learning practice and the bias-variance trade- off https://arxiv.org/abs/1812.11118 • Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data https://arxiv.org/abs/1703.11008 • A Model of Inductive Bias Learning https://arxiv.org/abs/1106.0245 • Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory https://arxiv.org/abs/1711.01244 41
  • 42. About the Speaker Mark Chang • Facebook: https://www.facebook.com/ckmarkoh.chang • IG: https://www.instagram.com/markchang95/ • YouTube: https://www.youtube.com/channel/UCckNPGDL21aznRhl3EijRQw 42