Information in the Weights

Information in the Weights
Mark Chang
2020/06/19

Outline
• Traditional Machine Learning v.s. Deep Learning
• Basic Concepts in Information Theory
• Information in the Weights

Traditional Machine Learning v.s. Deep
Learning
• VC Bound
• Generalization in Deep Learning
• PAC-Bayesian Bound for Deep Learning

VC Bound
• What cause over-fitting?
• Too many parameters -> over fitting ?
• Too many parameters -> high VC Dimension -> over fitting ?
• … ?
h
h h
under fitting over fittingappropriate fitting
too few
parameters
adequate
parameters
too many
parameters
training data
testing data

VC Bound
• Over-fitting is caused by high VC Dimension
• For a given dataset (n is constant), search for the best VC Dimension
d=n (shatter)
d (VC Dimension)
over-fittingerror
best VC Dimension
numbef of
training instances
VC Dimension
(model complexity)✏(h)  ˆ✏(h) +
r
8
n
log(
4(2n)d
)
training error
testing error

VC Dimension
• VC Dimension of linear model:
• O(W)
• W = number of parameters
• VC Dimension of fully-connected
neural networks:
• O(LW log W)
• L = number of layers
• W = number of parameters
• VC Dimension is independent from data distribution, and only
depends on model

Generalization in Deep Learning
• Considering a toy example:
neural networks
input:
780
hidden:
600
d ≈ 26M
dataset
n = 50,000
d >> n, but testing error < 0.1

• However, when you are solving your problem …
neural networks
input:
780
hidden:
600
testing error = 0.6
over fitting !!
your
dataset
n = 50,000
10 classes
testing error = 0.6
over fitting !!
reduce
VC Dimension
neural networks
input:
780
hidden:
200
…
reduce
VC Dimension

d (VC Dimension)
error
✏(h)  ˆ✏(h) +
r
8
n
log(
4(2n)d
)
d=n
over-fitting
over-parameterization
model with
extremely high VC-
Dimension

ICLR2017

1 0 1 0 2
random noise features
shatter !
deep
neural
networks
(Inception)
feature
:label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0

1 0 1 0 2
random noise features
deep
neural
networks
(Inception)
feature
:label: 1 0 1 0 2
original dataset (CIFAR)
0 1 1 2 0
random label
ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9 ✏(h) ⇡ 0.9

• Testing error depends on data distribution
• However, VC-Bound does not depend on data distribution
1 0 1 0 2
random noise
features
original dataset
feature
:label: 1 0 1 0 2
random label
0 1 1 2 0
✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9

• high sharpness -> high testing error

PAC-Bayesian Bound for Deep Learning
UAI 2017

• Deterministic Model • Stochastic Model (Gibbs Classifier)
data ✏(h)
hypothesis
h
error
data
✏(Q)
= Eh⇠Q(✏(h))
distribution of
hypothesis
hypothesis
h
Gibbs
error
sampling

• Considering the sharpness of local minimums
single
hypothesis
h
distribution of
hypothesis
Q
flat minimum:
• low
• low
ˆ✏(h)
ˆ✏(Q)
sharp minimum:
• low
• high
ˆ✏(h)
ˆ✏(Q)

KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1
• With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is
satisfied
number of
training instances
KL divergence
between P and Q
Distribution of model
before training (prior)
Distribution of models
after training (posterior)
KL divergence
between testing error
& training error

✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
high ✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
, high
under fitting over fittingappropriate fitting
✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + log(n
) + 2
2n 1
moderate ✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + log(n
) +
2n 1
, moderate ✏(Q)  ˆ✏(Q) +
s
KL(Q||P) + log(n
2n 1
low ✏(Q)  ˆ✏(Q) +
s
, high
low KL(Q||P) moderate KL(Q||P) high KL(Q||P)
training data
testing data
P
Q
P
Q
P
Q
KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1
lowKL(ˆ✏(Q)k✏(Q)) moderae KL(ˆ✏(Q)k✏(Q)) high KL(ˆ✏(Q)k✏(Q))

• PAC-Bayesian Bound is data-dependent
KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1
• High VC Dimension, but clean data
-> low KL (Q||P)
• High VC Dimension, and noisy data
-> high KL(Q||P)
P
Q Q
P

• Data : M-NIST (binary classification, class0 : 0~4, class1 : 5~9)
• Model : 2 layer or 3 layer NN
Training
Testing
VC Bound
Pac-Bayesian
Bound
0.028
0.034
26m
0.161
0.027
0.035
56m
0.179
Varying the width of
hidden layer (2 layer NN)
600 1200
0.027
0.032
121m
0.201
0.028
0.034
26m
0.161
0.028
0.033
66m
0.186
Varying the number of
hidden layers
2 3 4
0.028
0.034
26m
0.161
0.112
0.503
26m
1.352
original random
original M-NIST
v.s. random label

• PAC-Bayesian Bound is Data Dependent
clean data:
->small KL(Q||P)
->small ε(Q)
noisy data:
->large KL(Q||P)
->large ε(Q)
feature:
label:
original M-NIST
0 0 0 1 1
random label
1 0 1 0 1
KL(ˆ✏(Q)k✏(Q)) 
KL(QkP) + log(n
)
n 1

Basic Concepts in Information Theory
• Entropy
• Joint Entropy
• Conditional Entropy
• Mutual Information
• Cross Entropy

Entropy
• The uncertainty of a random variable X
H(X) =
X
x2X
p(x) log p(x)
x 1 2
P(X=x) 0.5 0.5
x 1 2
P(X=x) 0.9 0.1
x 1 2 3 4
P(X=x) 0.25 0.25 0.25 0.25
H(X) = 2 ⇥ 0.5 log(0.5) = 1
H(X) = 0.9 log(0.9) 0.1 log(0.1) = 0.469
H(X) = 4 ⇥ 0.25 log(0.25) = 2

Joint Entropy
• The uncertainty of a joint distribution involving two random variables
X, Y
H(X, Y ) =
X
x2X,y2Y
p(x, y) log p(x, y)
P(X,Y) Y=1 Y=2
X=1 0.25 0.25
X=2 0.25 0.25
H(X, Y ) = 4 ⇥ 0.25 log(0.25) = 2

Conditional Entropy
• The uncertainty of a random variable Y given the value of another
random variable X
H(Y |X) =
X
x2X
p(x)
X
y2Y
p(y|x) log p(y|x)
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent Y is a stochasit function of X
H(X, Y ) = 1.722
H(Y |X) = 0.722
H(X, Y ) = 1.722
H(Y |X) = 1
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
Y is a deterministic function of X
H(X, Y ) = 1
H(Y |X) = 0

Conditional Entropy
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent Y is a stochasit function of X
H(X, Y ) = 1.722
H(X) = 0.722, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 1 = H(Y )
H(X, Y ) = 1.722
H(X) = 1, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 0.722
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
H(X, Y ) = 1
H(X) = 1, H(Y ) = 1
H(Y |X) = H(X, Y ) H(X)
= 0
H(X)
H(X, Y )
H(Y |X) = H(Y )
H(Y )
H(X, Y )
H(X)
H(Y |X) = 0
H(Y |X)
H(X, Y )

Mutual Information
• The mutual dependence between two variables X, Y
I(X; Y ) =
X
x2X,y2Y
p(x, y) log
p(x, x)
p(x)p(y)
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent
I(X; Y ) = 0 I(X; Y ) = 0.278
Y is a stochasit function of X
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
I(X, Y ) = 1

Mutual Information
H(X)H(Y )
H(X, Y )
Information
Diagram
I(X; Y )
I(X; Y ) =
X
x2X,y2Y
p(x, y) log
p(x, x)
p(x)p(y)
=
X
x2X,y2Y
p(x, y)( log p(x) log p(y) + log p(x, x))
=
X
x2X
p(x) log p(x)
X
y2Y
p(y) log p(y) +
X
x2X,y2Y
p(x, y) log p(x, x))
= H(X) + H(Y ) H(X, Y )

Mutual Information
P(X,Y) Y=1 Y=2
X=1 0.4 0.4
X=2 0.1 0.1
P(X,Y) Y=1 Y=2
X=1 0.4 0.1
X=2 0.1 0.4
X & Y are independent
H(Y )
H(X, Y )
H(X)
I(X; Y )
H(X)
H(X, Y )
H(Y )
I(X; Y ) = 0
Y is a stochasit function of X
P(X,Y) Y=1 Y=2
X=1 0.5 0
X=2 0 0.5
H(X, Y ) = 1.722
H(X) = 0.722, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 0
H(X, Y ) = 1.722
H(X) = 1, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 0.278
H(X, Y ) = 1
H(X) = 1, H(Y ) = 1
I(X; Y ) = H(X) + H(Y )
H(X, Y ) = 1
H(Y )
H(X, Y )
H(X)
I(X; Y )

Cross Entropy
Hp,q(X) =
X
x2X
p(x) log q(x)
=
X
x2X
p(x) log p(x)
X
x2X
p(x) log q(x) log p(x)
= Hp(X) + KL p(X)kq(X)

Information in the Weights
• Cause of Over-fitting
• Information in the Weights as a Regularizer
• Bounding the Information in the Weights
• Connection with Flat Minimum
• Connection with PAC-Bayesian Bound
• Experiments

Cause of Over-fitting
• Training loss (Cross-Entropy):
Hp,q(y|x, w) = Ex,y
⇥
p(y|x, w) log q(y|x, w)
⇤
p : probability density function of data
q : probability density function predicted by model
x : input feature of training data
y : label of training data
w : weights of model
✓ : latent parameters of data distribution

Hp(y|x)Hp(ytest|xtest)
Hp(✓)
✓ : latent parameters of (training & testing) data distribution
1
2
3
x y
3
2
x y
3
2
x y
3
Hp(y|x, ✓)
Ip(y; ✓|x)
useful information in
training data
noisy information
and outlier in
training data
noise and outlier
in testing data
normal samples not in
training data

Hp(x)
Hp(y)
Hp(✓)
Hp(y|x, ✓) the uncertainty of
y given w and x
Hp(y|x, w)
Hp(w)
I(y; ✓|x, w)
noisy information
and outlier in training data
useful information not
learned by weights
noisy information
and outlier learned by
weights
I(y; w|x, ✓)

Hp(x)
Hp(y)
Hp(✓)
Hp(y|x, ✓)
Hp(w)
noisy information
and outlier in training data

Hp(x)
Hp(y)
Hp(✓)
Hp(w)
I(y; ✓|x, w)
useful information not
learned by weights

Hp(x)
Hp(y)
Hp(✓)
Hp(w)
noisy information
and outlier learned by
weights
I(y; w|x, ✓)

w
• Cause of over fitting: weights memorize the noisy informationin training data.
High VC Dimension, but clean data
-> few noise to memorize
High VC Dimension, and noisy data
-> much noise to memorize
training data
testing data
ww

Information in the Weights as a Regularizer
• The actual data distribution p is unknown
• Estimate by
• New loss function : as a regularizerIp(D; w) ⇡ Iq(D; w)
L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
I(D; w) Iq(D; w)

Connection with Flat Minimum
• Flat minimum has low information in the weight
kH k⇤ : nuclear norm of the Hessian at the local minimum
Flat minimum -> low neclear norm of the Hessian -> low information
Iq(w; D) 
1
2
K
⇥
log k ˆwk2
2 + log kH k⇤ K log(
K2
2
)
⇤

Connection with PAC-Bayesian Bound
• Given a prior distribution , we have:p(w)
distribution of weights
before training (prior)
distributionof weights
after training on dataset D
(posterior)
Iq(w, D) = EDKL(q(w|D)kq(w))
 EDKL(q(w|D)kq(w)) + EDKL(q(w|D)kp(w))
 EDKL(q(w|D)kp(w))

Experiments
• Random Labels
Dataset size
Information complexity
L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)

Experiments
Dataset size
Information complexity
• Real Labels
L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)

Experiments
• information in the weights v.s. percentage of corrupted labels

Information in the Weights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Information in the Weights

Similar to Information in the Weights (20)

More from Mark Chang

More from Mark Chang (20)

Recently uploaded

Recently uploaded (20)

Information in the Weights