A Model of Inductive Bias Learning

Outlines
• What is Inductive bias?
• A model of Inductive Bias Learning
• Meta-Learning Based on PAC-Bayes Theory
2

Outlines
3

What is Inductive Bias?
• The set of assumptions that the learner uses to predict outputs
• ex:
4
Translation Invariance in
Computer Vision
Distributional Hypothesis in
Natural Language Processing
a
a
Convolution
neural networks
Convolution
neural networks
a
a
A dog runs.
A cat runs.
word2vec
word2vec
run
run

• Convolutional Neural Networks:
• Learning Inductive Bias (Receptive Fields) from dataset
5
Receptive
Fields
Receptive
Fields
Convolutional
Layer
Convolutional
Layer Pooling
Layer
Input
Layer
Pooling
Layer

• Multitask / Multiclass / Multilabel Learning
6
Feature
Extractor
(Inductive
Bias)
Input of
Task 1
Input of
Task 2
Input of
Task 3
Features
of Task 1
Features
of Task 2
Features
of Task 3
Classifier
of Task 1
Classifier
of Task 2
Classifier
of Task 3
Labels of
Task 1
Labels of
Task 2
Labels of
Task 3

• Meta-Learning / Learning-to-Learn / Transfer Learning
7
Feature
Extractor
(Inductive
Bias)
Input of
Task 1
Features of
Task 1
Classifier of
Task 1
Labels of
Task 1
Input of
Task 2
Features of
Task 2
Classifier of
Task 2
Labels of
Task 2
Input of
Target Task
Features
Target Task
Classifier of
Target Task
Labels of
Target Task
Feature
Extractor
(Inductive
Bias)

Outlines
8

A model of Inductive Bias Learning
9

Machine Learning Theory
10
Empirical loss
ˆerz(h) =
1
m
mX
i=1
l(h(xi), yi)
y1, ..., ym
Dataset P
(probability
distribution)
Training set Hypothesis
h
Sample
prediction
z = {(x1, y1), . . . , (xm, ym)}
x1, ..., xm
h(x1), . . . , h(xm)(x1, y1),
(x2, y2),
...
...
Training algorithm
selects h.
Hypothesis space
H = {h1, h2, . . . }
Expected loss
erP (h) =
Z
X⇥Y
l(h(x), y)dP(x, y)

Machine Learning Theory
11
Theorem 1. Suppose z = {(x1, y1), ..., (xm, ym)} is sampling from distribution P
Let d = VCDim(H). Then with at least 1 , all h 2 H will satisfy:
erP (h)  ˆerz(h) +
h32
m
(dlog
2em
d
+ log
4
)
i1
2
over-fitting
error
shatter
d >= m
d : VC-Dimension
erP (h)
ˆerz(h)

Environment of Related Tasks
12
Environment Q
(probability
distribution)
Dataset P1
Dataset P2
...
Sample
Dataset P1
Dataset P2
Dataset Pn
...
n training tasks
Training set
z1 = {(x11, y11), . . . , (x1m, y1m)}
Training set
z2 = {(x21, y21), . . . , (x2m, y2m)}
Sample
Sample
Training setSample
zn = {(xn1, yn1), . . . , (xnm, ynm)}
m training samples per task

Environment of Related Tasks
• Ex: ImageNet Classification
13
Environment of
ImageNet
classification Task
Goldfish or not
Cock or not
...
Goldfish or not
Cock or not
Daisy or not
...
( , 1), ( , 0) , … , ( , 1)
( , 1), ( , 0) , … , ( , 1)
( , 0), ( , 1) , … , ( , 1)
Training set
Training set
Training set

Hypothesis Space Family
14
Hypothesis space 1
Hypothesis space 2
...
H
H2 = {h21, h22, . . . }
H1 = {h11, h12, . . . } Hypothesis space
H = {h1, h2, . . . }
Training algorithm of
Inductive Bias Learning:
select H.

15
Feature
Extractor
Input of
Task
Features
of Task
Classifier Labels of
Taskf1, f2, ..., g1, g2, ...,
Hypothesis space 1
Hypothesis space 2
...
H
H1 = {g1 f1, g2 f1, . . . }
H2 = {g1 f2, g2 f2, . . . }
Hypothesis space
Train Feature Extractor
) select f
) select H
H = {g1 f, g2 f, . . . }

Inductive Bias Learning
16
PredictionTraining set
Environment
Q
Dataset
P1
Dataset
Pn
Hypothesis
Space Family
H
...
Hypothesis space
(x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m)
hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm)
Training algorithm
selects H.
H = {h1, h2, . . . }
Empirical loss
Empirical loss of learning from dataset P1,…, Pn
ˆerz(H) =
1
n
nX
i=1
inf
h2H
ˆerzi
(h)
Expected loss
Expected loss of learningnew tasks drawn from Environment Q
erQ(H) =
Z
P
inf
h2H
erP (h)dQ(P)

Theorem 2. Suppose z is an (n, m)-sample generated by
sampling n times from distribution Q to give P1, ..., Pn and then
sampling m times from each Pi to generate zi = {(xi1, yi1), .., (xim, yim)}.
Let H = {H} be any hypothesis space family. If (n, m) satisﬁes
n max
n256
✏2
log
8C( ✏
32 , H⇤
)o
, and m max
n256
n✏2
log
8C( ✏
32 , Hn
l )o
,
17
C: capacity (model complexity)
then with probability at least 1 , all H 2 H will satisfy
erQ(H)  ˆerz(H) + ✏

Meta-Learning / Learning-to-Learn / Transfer
Learning
18
Environment
Q
Dataset
P1
Dataset
Pn
Hypothesis
Space Family
H
...
Hypothesis space
(x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m)
Training algorithm
selects H.
H = {h1, h2, . . . }
Pre-train
PredictionTraining setDataset
P (x1, y1), . . . , (xm, ym) h(x1), . . . , h(xm)
Train target task
Empirical lossExpected loss
ˆerz(h) =
1
m
mX
i=1
l(h(xi), yi)erP (h) =
Z
X⇥Y
l(h(x), y)dP(x, y)

Meta-Learning / Learning-to-Learn / Transfer
Learning
19
Theorem 3. Let z = {(x1, y1), ...(xm, ym)} be a training set sampling from P
Let H be a hypothesis space. For all 0 < ✏, < 1, if m satisﬁes
m max
n64
✏2
log
4C( ✏
16 , Hl)
,
16
✏2
o
,
then with probability at least 1 , all h 2 H will satisfy
erP (h)  ˆerz(h) + ✏
C: capacity (model complexity)
Hypothesis space
H = {h1, h2, . . . } Inductive Bias Learning
selects H.Low model complexity
Hypothesis space 1
...
H
H1 = {h11, h12, . . . }
Hypothesis space
H = {h1, h2, . . . }
...
High model complexity

Multitask / Multiclass / Multilabel Learning
20
Hypothesis
Space Family
H
Hypothesis space
(x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m)
Training algorithm
selects H.
H = {h1, h2, . . . }
Environment
Q
Dataset
P1
Dataset
Pn
...
Empirical loss
ˆerz(h) =
1
n
nX
i=1
1
m
mX
j=1
l(hi(xij), yij)
Expected loss
erP(h) =
1
n
nX
i=1
Z
X⇥Y
l(hi(x), y)dPi(x, y)

Theorem 4. Let P = (P1, ..., Pn) be n probability distribution and
let z be an (n, m)-sample generated by sampling m times from each Pi.
Let H = {H} be any hypothesis space family. If m of each task satisﬁes
m max
n 64
n✏2
log
4C( ✏
16 , Hn
l )
,
16
✏2
o
,
then with probability at least 1 , any h = (h1, ..., hn) 2 Hn
satisfy
erp(h)  ˆerz(h) + ✏
Lemma 5. For any H,
logC(✏, H1
l )  logC(✏, Hn
l )  nlogC(✏, H1
l )
Multitask / Multiclass / Multilabel Learning
21
• Worse case: logC(✏, Hn
l ) = nlogC(✏, H1
l )
m remains the same when n increase.
• Best case: logC(✏, Hn
l ) = logC(✏, H1
l )
m decreases as O(1/n).

Outlines
22

Meta-Learning Based on PAC-Bayes Theory
23
ICML 2018

Generalization in Deep Learning
24
• Neural Networks (or Deep Learning
models) have extremely high VC
Dimension.
• Over-fitting is caused by high VC Dimension
(high model complexity).
• VC Bound :
2 layer NN
input: 780 hidden: 600
d ≈ 26M
MNIST
n = 50,000
d >> n
over-fitting
error
shatter
d >= m
d : VC-Dimension
erP (h)
ˆerz(h)
h32
m
(dlog
2em
d
+ log
4
)
i1
2

Over-Parameterization
• Rethinking Optimization and Generalization
• [Chiyuan Zhang et al. https://arxiv.org/abs/1611.03530]
1 0 1 0 2
random pixel
deep
neural
networks
feature:
label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
25No over-fitting erP (h) ⇡ 0.9 erP (h) ⇡ 0.9erP (h) ⇡ 0.14
shatter
êrz(h) ⇡ 0 êrz(h) ⇡ 0 êrz(h) ⇡ 0

• Generalization curve of Over-Parameterization model
• [Mikhail Belkin et al. https://arxiv.org/abs/1812.11118]
d (VC Dimension)
error
over-parameterization
model with extremely
high VC-Dimension
26
h32
m
(dlog
2em
d
+ log
4
)
i1
2
d = m
over-fittingerP (h)
ˆerz(h)

• PAC-Bayesian Bound [Gintare Karolina Dziugaite et al. https://arxiv.org/abs/1703.11008]
Training error
Testingerror
VC Bound
Pac-Bayesian Bound
0.028
0.034
26m
0.161
0.027
0.035
56m
0.179
Model Complexityof 2- layer NN
width :600 width : 1200
0.028
0.034
26m
0.161
0.112
0.503
26m
1.352
Random labelTrue label
Dataset : MNIST
binary classification,class0:
0~4, class1 : 5~9
Existence of noise in dataset
• D: distribution of dataset
• S : training samples
• P : model distribution before training
• Q: model distribution after training
27
er(Q, D)  ˆer(Q, S) +
s
KL(QkP) + log(m
)
2(m 1)

PAC-Bayesian Bound
• Deterministic Model
• Stochastic Model
• Deterministic Neural Networks
• Stochastic Neural Networks
28
x y = tanh(wx + b)w, b
x y = tanh(wx + b)w, b
N(µw, σw),
N(µb, σb)
sample w, b
distribution of
w, b
hypothesis h
input
error
er(h) label
distribution of
hypothesis
sample
hypothesis h
input
Gibbs error
er(Q)
= Eh⇠Q[er(h)]
label

PAC-Bayesian Bound
• Training Stochastic Neural Networks
29
sample w, b
P :
Prior, distribution of hypothesis
before training
x yw, b
N(µw, σw),
N(µb, σb)
Q (S, P) :
Posterior, distribution of hypothesis
after training
x yw, b
N(µw, σw),
N(µb, σb)
sample w, b
training algorithm:
Update µw, σw, µb, σb
training data S

PAC-Bayesian Bound
under
fitting
over
fitting
appropriate
fitting
training data
testing data
P
Q
30
er(Q, D)  êr(Q, S) +
s
KL(QkP) + log(m
)
2(m 1)
P
Q
P
Q
8
><
>:
high KL(Q||P)
low êr(Q.S) but high er(Q, D)
high |er(Q, D) êr(Q, S)|
8
><
>:
low KL(Q||P)
high êr(Q.S) and high er(Q, D)
low |er(Q, D) êr(Q, S)|
8
><
>:
moderate KL(Q||P)
moderate êr(Q.S) and moderate er(Q, D)
moderate |er(Q, D) êr(Q, S)|

Hyper-Prior and Hyper-Posterior
31
...
Prior P2
Hyper-Prior
Prior P1
Hyper-Posterior
P
Q
empirical
multi-task error
Inductive-Bias
Training
Algorithm
Update Q

Hyper-Prior and Hyper-Posterior
• Stochastic Neural Networks
32
Inductive-
Bias Training:
Update the
Hyper-Prior
Environment
Hyper-Prior
x yw, b
N(µw, σw),
N(µb, σb)
Sample
N(µpw, σpw),
N(µpb, σpb)
Prior P
P Hyper-Prior
x yw, b
N(µw, σw),
N(µb, σb)
Sample
N(µpw, σpw),
N(µpb, σpb)
Prior P
Q
x yw, b
N(µw, σw),
N(µb, σb)
Training:
Update
the Prior
New task from
environment

33
Multi-taskerror
Empirical loss of learning from
dataset P1,…, Pn
ˆer(Q, S1, ..., Sn) =
1
n
nX
i=1
E
P ⇠Q
ˆer(Q(Si, P), Si)
Transfer error
Expected loss of learningnew tasks
drawn from Environment Q
er(Q, ⌧) = E
D⇠⌧
E
P ⇠Q
E
S⇠D
er(Q(S, P), D)
...
Training set
Sn
Environment
𝜏
Dataset
D1
Dataset
Dn
...
Prior
P1
Training set
S1
Hyper-Prior P
Hyper-Posterior
Q
Posterior
Posterior
...
Training Algorithm :
UpdateQ
Prior
Pn
Q(S1, P1)
Q(Sn, Pn)

34
Theorem 2. For any hyper-posterior distribution Q,
the following inequality hold with probability at least 1
er(Q, ⌧) =
1
n
nX
i=1
E
P ⇠Q
ˆeri(Q(Si, P), Si)
+
1
n
nX
i=1
s
KL(QkP) + EP ⇠Q KL(Q(Si, P)kP) + log2nmi
2(mi 1)
+
s
KL(QkP) + log2n
2(n 1)
task-complexity terms of the observed tasks
environment-complexity term

Meta Learning Algorithm
• Loss function for pre-training:
35
J(✓) =
1
n
nX
i=1
Ji(✓) + ⌥(✓)
task-complexity terms of the observed tasks
environment-complexity term
Ji(✓) = E
˜✓⇠Q✓
ˆeri Qi(Si, P˜✓)Si +
s
KL(Q✓kP) + E˜✓⇠Q✓
KL Q(Si, P˜✓)kP˜✓ + log2nmi
2(mi 1)
˜✓ = ✓ + ✏P , where ✏P ⇠ N(0, 2
QI)
⌥(✓) =
s
KL(Q✓kP) + log2n
2(n 1)

36
Hyper-Posterior
Q✓
Prior θPrior θ …
Observed
Task Sn
Environment 𝜏
Observed
Task S1
…
Posterior
ϕn
Posterior
ϕ1
…

37
Posterior ϕ’
Environment 𝜏
New Task S
Hyper-Posterior
Prior θ
Q✓

Meta Learning Algorithm Example
38

Experiments
• Model : CNN with 2 convolution Layers
• Task Environments: Permuted Labels & Permuted Pixels
39
Permuted Labels
Task 1 Task 2
0 1 2
0 1 2
0 1 2
0 1 2
Permuted Labels
Task 1 Task 2
0 1 2
0 1 2
0 1 2
0 1 2

Results
• Test error on new tasks
40
Trained from scratch
Trained by proposed
meta learning algorithm

Further Reading
• Understanding deep learning requires rethinking generalization
https://arxiv.org/abs/1611.03530
• Reconciling modern machine learning practice and the bias-variance trade-
off https://arxiv.org/abs/1812.11118
• Computing Nonvacuous Generalization Bounds for Deep (Stochastic)
Neural Networks with Many More Parameters than Training Data
• A Model of Inductive Bias Learning https://arxiv.org/abs/1106.0245
• Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory
41

About the Speaker
Mark Chang
• Facebook: https://www.facebook.com/ckmarkoh.chang
• IG: https://www.instagram.com/markchang95/
• YouTube: https://www.youtube.com/channel/UCckNPGDL21aznRhl3EijRQw
42

A Model of Inductive Bias Learning

Recommended

Recommended

More Related Content

More from Mark Chang

More from Mark Chang (20)

Recently uploaded

Recently uploaded (20)

A Model of Inductive Bias Learning