5. VC Bound
• Over-fitting is caused by high VC Dimension
• For a given dataset (n is constant), search for the best VC Dimension
d=n (shatter)
d (VC Dimension)
over-fittingerror
best VC Dimension
numbef of
training instances
VC Dimension
(model complexity)✏(h) ˆ✏(h) +
r
8
n
log(
4(2n)d
)
training error
testing error
6. VC Dimension
• VC Dimension of linear model:
• O(W)
• W = number of parameters
• VC Dimension of fully-connected
neural networks:
• O(LW log W)
• L = number of layers
• W = number of parameters
• VC Dimension is independent from data distribution, and only
depends on model
11. Generalization in Deep Learning
1 0 1 0 2
random noise features
shatter !
deep
neural
networks
(Inception)
feature
:label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
12. Generalization in Deep Learning
1 0 1 0 2
random noise features
deep
neural
networks
(Inception)
feature
:label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9 ✏(h) ⇡ 0.9
13. Generalization in Deep Learning
• Testing error depends on data distribution
• However, VC-Bound does not depend on data distribution
1 0 1 0 2
random noise
features
original dataset
feature
:label: 1 0 1 0 2
random label
0 1 1 2 0
✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9
19. KL(ˆ✏(Q)k✏(Q))
KL(QkP) + log(n
)
n 1
PAC-Bayesian Bound for Deep Learning
• With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is
satisfied
number of
training instances
KL divergence
between P and Q
Distribution of model
before training (prior)
Distribution of models
after training (posterior)
KL divergence
between testing error
& training error
20. PAC-Bayesian Bound for Deep Learning
✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
high ✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
, high
under fitting over fittingappropriate fitting
✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
moderate ✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
) +
2n 1
, moderate ✏(Q) ˆ✏(Q) +
s
KL(Q||P) + log(n
2n 1
low ✏(Q) ˆ✏(Q) +
s
, high
low KL(Q||P) moderate KL(Q||P) high KL(Q||P)
training data
testing data
P
Q
P
Q
P
Q
KL(ˆ✏(Q)k✏(Q))
KL(QkP) + log(n
)
n 1
lowKL(ˆ✏(Q)k✏(Q)) moderae KL(ˆ✏(Q)k✏(Q)) high KL(ˆ✏(Q)k✏(Q))
28. Conditional Entropy
Information Diagram
H(X)H(Y )
H(X, Y )
H(Y |X)
H(Y |X) =
X
x2X
p(x)
X
y2Y
p(y|x) log p(y|x)
=
X
x2X
p(x)
X
y2Y
p(y|x)(log p(x, y) log p(x))
=
X
x2X,y2Y
p(x, y) log p(x, y) +
X
x2X
p(x) log p(x)
= H(X, Y ) H(X)
29. Conditional Entropy
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent Y is a stochasit function of X
H(X, Y ) = 1.722
H(X) = 0.722, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 1 = H(Y )
H(X, Y ) = 1.722
H(X) = 1, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 0.722
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
Y is a deterministic function of X
H(X, Y ) = 1
H(X) = 1, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 0
H(X)
H(X, Y )
H(Y |X) = H(Y )
H(Y )
H(X, Y )
H(X)
H(Y |X) = 0
H(Y |X)
H(X, Y )
30. Mutual Information
• The mutual dependence between two variables X, Y
I(X; Y ) =
X
x2X,y2Y
p(x, y) log
p(x, x)
p(x)p(y)
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent
I(X; Y ) = 0 I(X; Y ) = 0.278
Y is a stochasit function of X
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
Y is a deterministic function of X
I(X, Y ) = 1
31. Mutual Information
H(X)H(Y )
H(X, Y )
Information
Diagram
I(X; Y )
I(X; Y ) =
X
x2X,y2Y
p(x, y) log
p(x, x)
p(x)p(y)
=
X
x2X,y2Y
p(x, y)( log p(x) log p(y) + log p(x, x))
=
X
x2X
p(x) log p(x)
X
y2Y
p(y) log p(y) +
X
x2X,y2Y
p(x, y) log p(x, x))
= H(X) + H(Y ) H(X, Y )
32. Mutual Information
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent
H(Y )
H(X, Y )
H(X)
I(X; Y )
H(X)
H(X, Y )
H(Y )
I(X; Y ) = 0
Y is a stochasit function of X
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
Y is a deterministic function of X
H(X, Y ) = 1.722
H(X) = 0.722, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 0
H(X, Y ) = 1.722
H(X) = 1, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 0.278
H(X, Y ) = 1
H(X) = 1, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 1
H(Y )
H(X, Y )
H(X)
I(X; Y )
37. Cause of Over-fitting
• Training loss (Cross-Entropy):
Hp,q(y|x, w) = Ex,y
⇥
p(y|x, w) log q(y|x, w)
⇤
p : probability density function of data
q : probability density function predicted by model
x : input feature of training data
y : label of training data
w : weights of model
✓ : latent parameters of data distribution
38. Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w)
Cause of Over-fitting
the uncertainty of y given w and x
Hp(x)
Hp(y)
Hp(y|x, w)
Hp(w)
39. Cause of Over-fitting
• lower
-> lower uncertainty of y given w and x -> lower training error
• ex: given x as , and a fixed w
8
>><
>>:
p(y = 1|x, w) = 0.9
p(y = 2|x, w) = 0.1
p(y = 3|x, w) = 0.0
p(y = 4|x, w) = 0.0
8
>><
>>:
p(y = 1|x, w) = 0.3
p(y = 2|x, w) = 0.3
p(y = 3|x, w) = 0.2
p(y = 4|x, w) = 0.2
higher Hp(y|x, w)lower Hp(y|x, w)
Hp(y|x, w)
Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w)
40. Cause of Over-fitting
Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w)
Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
✓ : latent parameters of (training & testing) data distribution
Hp(y|x)Hp(ytest|xtest)
Hp(✓)
41. Cause of Over-fitting
Hp(y|x)Hp(ytest|xtest)
Hp(✓)
✓ : latent parameters of (training & testing) data distribution
1
2
3
x y
3
2
x y
3
2
x y
3
Hp(y|x, ✓)
Ip(y; ✓|x)
useful information in
training data
noisy information
and outlier in
training data
noise and outlier
in testing data
normal samples not in
training data
42. Cause of Over-fitting
Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
Hp(x)
Hp(y)
Hp(✓)
Hp(y|x, ✓) the uncertainty of
y given w and x
Hp(y|x, w)
Hp(w)
I(y; ✓|x, w)
noisy information
and outlier in training data
useful information not
learned by weights
noisy information
and outlier learned by
weights
I(y; w|x, ✓)
47. Cause of Over-fitting
Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
Hp(x)
Hp(y)
Hp(✓)
Hp(w)
noisy information
and outlier learned by
weights
I(y; w|x, ✓)
48. w
• Cause of over fitting: weights memorize the noisy informationin training data.
Cause of Over-fitting
High VC Dimension, but clean data
-> few noise to memorize
High VC Dimension, and noisy data
-> much noise to memorize
training data
testing data
Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
ww
49. • is unknown, cannot be compute
• , the information in the weight, is an upper bound of
Information in the Weights as a Regularizer
I(y; w|x, ✓) I(y, x; w|✓) = I(D; w|✓) I(D; w)
I(y; w|x, ✓)
I(y; w|x, ✓)I(D; w)
D
Hp(x)
Hp(y)
Hp(✓)
I(y; w|x, ✓) Hp(x, y) = Hp(D)
I(y, x; w|✓) = I(D; w|✓)I(D; w)
Hp(w)
I(y; w|x, ✓)
50. Information in the Weights as a Regularizer
• The actual data distribution p is unknown
• Estimate by
• New loss function : as a regularizerIp(D; w) ⇡ Iq(D; w)
L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
I(D; w) Iq(D; w)
51. Connection with Flat Minimum
• Flat minimum has low information in the weight
kH k⇤ : nuclear norm of the Hessian at the local minimum
Flat minimum -> low neclear norm of the Hessian -> low information
Iq(w; D)
1
2
K
⇥
log k ˆwk2
2 + log kH k⇤ K log(
K2
2
)
⇤
52. Connection with PAC-Bayesian Bound
• Given a prior distribution , we have:p(w)
distribution of weights
before training (prior)
distributionof weights
after training on dataset D
(posterior)
Iq(w, D) = EDKL(q(w|D)kq(w))
EDKL(q(w|D)kq(w)) + EDKL(q(w|D)kp(w))
EDKL(q(w|D)kp(w))
53. Connection with PAC-Bayesian Bound
• Loss function with the regularizer :
• PAC Bayesian Bound :
L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
Hp,q(y|x, w) + EDKL(q(w|D)kp(w))
Ip(D; w) ⇡ Iq(D; w)
ED
⇥
Ltest
(q(w|D))
⇤
Hp,q(y|x, w) + LmaxED
⇥
KL(q(w|D)kp(w))
⇤
n(1 1
2 )
Ltest : test error of the network with weights q(w|D)
Lmax : maximum per-sample loss function