Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- What to Upload to SlideShare by SlideShare 6450817 views
- Customer Code: Creating a Company C... by HubSpot 4808335 views
- Be A Great Product Leader (Amplify,... by Adam Nash 1076441 views
- Trillion Dollar Coach Book (Bill Ca... by Eric Schmidt 1263663 views
- APIdays Paris 2019 - Innovation @ s... by apidays 1513773 views
- A few thoughts on work life-balance by Wim Vanderbauwhede 1105001 views

No Downloads

Total views

239

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

10

Comments

0

Likes

1

No notes for slide

- 1. Information in the Weights Mark Chang 2020/06/19
- 2. Outline • Traditional Machine Learning v.s. Deep Learning • Basic Concepts in Information Theory • Information in the Weights
- 3. Traditional Machine Learning v.s. Deep Learning • VC Bound • Generalization in Deep Learning • PAC-Bayesian Bound for Deep Learning
- 4. VC Bound • What cause over-fitting? • Too many parameters -> over fitting ? • Too many parameters -> high VC Dimension -> over fitting ? • … ? h h h under fitting over fittingappropriate fitting too few parameters adequate parameters too many parameters training data testing data
- 5. VC Bound • Over-fitting is caused by high VC Dimension • For a given dataset (n is constant), search for the best VC Dimension d=n (shatter) d (VC Dimension) over-fittingerror best VC Dimension numbef of training instances VC Dimension (model complexity)✏(h) ˆ✏(h) + r 8 n log( 4(2n)d ) training error testing error
- 6. VC Dimension • VC Dimension of linear model: • O(W) • W = number of parameters • VC Dimension of fully-connected neural networks: • O(LW log W) • L = number of layers • W = number of parameters • VC Dimension is independent from data distribution, and only depends on model
- 7. Generalization in Deep Learning • Considering a toy example: neural networks input: 780 hidden: 600 d ≈ 26M dataset n = 50,000 d >> n, but testing error < 0.1
- 8. Generalization in Deep Learning • However, when you are solving your problem … neural networks input: 780 hidden: 600 testing error = 0.6 over fitting !! your dataset n = 50,000 10 classes testing error = 0.6 over fitting !! reduce VC Dimension neural networks input: 780 hidden: 200 … reduce VC Dimension
- 9. Generalization in Deep Learning d (VC Dimension) error ✏(h) ˆ✏(h) + r 8 n log( 4(2n)d ) d=n over-fitting over-parameterization model with extremely high VC- Dimension
- 10. Generalization in Deep Learning ICLR2017
- 11. Generalization in Deep Learning 1 0 1 0 2 random noise features shatter ! deep neural networks (Inception) feature :label: 1 0 1 0 2 original dataset (CIFAR) 0 1 1 2 0 random label ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
- 12. Generalization in Deep Learning 1 0 1 0 2 random noise features deep neural networks (Inception) feature :label: 1 0 1 0 2 original dataset (CIFAR) 0 1 1 2 0 random label ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9 ✏(h) ⇡ 0.9
- 13. Generalization in Deep Learning • Testing error depends on data distribution • However, VC-Bound does not depend on data distribution 1 0 1 0 2 random noise features original dataset feature :label: 1 0 1 0 2 random label 0 1 1 2 0 ✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9
- 14. Generalization in Deep Learning
- 15. Generalization in Deep Learning • high sharpness -> high testing error
- 16. PAC-Bayesian Bound for Deep Learning UAI 2017
- 17. PAC-Bayesian Bound for Deep Learning • Deterministic Model • Stochastic Model (Gibbs Classifier) data ✏(h) hypothesis h error data ✏(Q) = Eh⇠Q(✏(h)) distribution of hypothesis hypothesis h Gibbs error sampling
- 18. PAC-Bayesian Bound for Deep Learning • Considering the sharpness of local minimums single hypothesis h distribution of hypothesis Q flat minimum: • low • low ˆ✏(h) ˆ✏(Q) sharp minimum: • low • high ˆ✏(h) ˆ✏(Q)
- 19. KL(ˆ✏(Q)k✏(Q)) KL(QkP) + log(n ) n 1 PAC-Bayesian Bound for Deep Learning • With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is satisfied number of training instances KL divergence between P and Q Distribution of model before training (prior) Distribution of models after training (posterior) KL divergence between testing error & training error
- 20. PAC-Bayesian Bound for Deep Learning ✏(Q) ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 high ✏(Q) ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 , high under fitting over fittingappropriate fitting ✏(Q) ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 moderate ✏(Q) ˆ✏(Q) + s KL(Q||P) + log(n ) + 2n 1 , moderate ✏(Q) ˆ✏(Q) + s KL(Q||P) + log(n 2n 1 low ✏(Q) ˆ✏(Q) + s , high low KL(Q||P) moderate KL(Q||P) high KL(Q||P) training data testing data P Q P Q P Q KL(ˆ✏(Q)k✏(Q)) KL(QkP) + log(n ) n 1 lowKL(ˆ✏(Q)k✏(Q)) moderae KL(ˆ✏(Q)k✏(Q)) high KL(ˆ✏(Q)k✏(Q))
- 21. PAC-Bayesian Bound for Deep Learning • PAC-Bayesian Bound is data-dependent KL(ˆ✏(Q)k✏(Q)) KL(QkP) + log(n ) n 1 • High VC Dimension, but clean data -> low KL (Q||P) • High VC Dimension, and noisy data -> high KL(Q||P) P Q Q P
- 22. PAC-Bayesian Bound for Deep Learning • Data : M-NIST (binary classification, class0 : 0~4, class1 : 5~9) • Model : 2 layer or 3 layer NN Training Testing VC Bound Pac-Bayesian Bound 0.028 0.034 26m 0.161 0.027 0.035 56m 0.179 Varying the width of hidden layer (2 layer NN) 600 1200 0.027 0.032 121m 0.201 0.028 0.034 26m 0.161 0.028 0.033 66m 0.186 Varying the number of hidden layers 2 3 4 0.028 0.034 26m 0.161 0.112 0.503 26m 1.352 original random original M-NIST v.s. random label
- 23. PAC-Bayesian Bound for Deep Learning • PAC-Bayesian Bound is Data Dependent clean data: ->small KL(Q||P) ->small ε(Q) noisy data: ->large KL(Q||P) ->large ε(Q) feature: label: original M-NIST 0 0 0 1 1 random label 1 0 1 0 1 KL(ˆ✏(Q)k✏(Q)) KL(QkP) + log(n ) n 1
- 24. Basic Concepts in Information Theory • Entropy • Joint Entropy • Conditional Entropy • Mutual Information • Cross Entropy
- 25. Entropy • The uncertainty of a random variable X H(X) = X x2X p(x) log p(x) x 1 2 P(X=x) 0.5 0.5 x 1 2 P(X=x) 0.9 0.1 x 1 2 3 4 P(X=x) 0.25 0.25 0.25 0.25 H(X) = 2 ⇥ 0.5 log(0.5) = 1 H(X) = 0.9 log(0.9) 0.1 log(0.1) = 0.469 H(X) = 4 ⇥ 0.25 log(0.25) = 2
- 26. Joint Entropy • The uncertainty of a joint distribution involving two random variables X, Y H(X, Y ) = X x2X,y2Y p(x, y) log p(x, y) P(X,Y) Y=1 Y=2 X=1 0.25 0.25 X=2 0.25 0.25 H(X, Y ) = 4 ⇥ 0.25 log(0.25) = 2
- 27. Conditional Entropy • The uncertainty of a random variable Y given the value of another random variable X H(Y |X) = X x2X p(x) X y2Y p(y|x) log p(y|x) P(X,Y) Y=1 Y=2 X=1 0.4 0.4 X=2 0.1 0.1 P(X,Y) Y=1 Y=2 X=1 0.4 0.1 X=2 0.1 0.4 X & Y are independent Y is a stochasit function of X H(X, Y ) = 1.722 H(Y |X) = 0.722 H(X, Y ) = 1.722 H(Y |X) = 1 P(X,Y) Y=1 Y=2 X=1 0.5 0 X=2 0 0.5 Y is a deterministic function of X H(X, Y ) = 1 H(Y |X) = 0
- 28. Conditional Entropy Information Diagram H(X)H(Y ) H(X, Y ) H(Y |X) H(Y |X) = X x2X p(x) X y2Y p(y|x) log p(y|x) = X x2X p(x) X y2Y p(y|x)(log p(x, y) log p(x)) = X x2X,y2Y p(x, y) log p(x, y) + X x2X p(x) log p(x) = H(X, Y ) H(X)
- 29. Conditional Entropy P(X,Y) Y=1 Y=2 X=1 0.4 0.4 X=2 0.1 0.1 P(X,Y) Y=1 Y=2 X=1 0.4 0.1 X=2 0.1 0.4 X & Y are independent Y is a stochasit function of X H(X, Y ) = 1.722 H(X) = 0.722, H(Y ) = 1 H(Y |X) = H(X, Y ) H(X) = 1 = H(Y ) H(X, Y ) = 1.722 H(X) = 1, H(Y ) = 1 H(Y |X) = H(X, Y ) H(X) = 0.722 P(X,Y) Y=1 Y=2 X=1 0.5 0 X=2 0 0.5 Y is a deterministic function of X H(X, Y ) = 1 H(X) = 1, H(Y ) = 1 H(Y |X) = H(X, Y ) H(X) = 0 H(X) H(X, Y ) H(Y |X) = H(Y ) H(Y ) H(X, Y ) H(X) H(Y |X) = 0 H(Y |X) H(X, Y )
- 30. Mutual Information • The mutual dependence between two variables X, Y I(X; Y ) = X x2X,y2Y p(x, y) log p(x, x) p(x)p(y) P(X,Y) Y=1 Y=2 X=1 0.4 0.4 X=2 0.1 0.1 P(X,Y) Y=1 Y=2 X=1 0.4 0.1 X=2 0.1 0.4 X & Y are independent I(X; Y ) = 0 I(X; Y ) = 0.278 Y is a stochasit function of X P(X,Y) Y=1 Y=2 X=1 0.5 0 X=2 0 0.5 Y is a deterministic function of X I(X, Y ) = 1
- 31. Mutual Information H(X)H(Y ) H(X, Y ) Information Diagram I(X; Y ) I(X; Y ) = X x2X,y2Y p(x, y) log p(x, x) p(x)p(y) = X x2X,y2Y p(x, y)( log p(x) log p(y) + log p(x, x)) = X x2X p(x) log p(x) X y2Y p(y) log p(y) + X x2X,y2Y p(x, y) log p(x, x)) = H(X) + H(Y ) H(X, Y )
- 32. Mutual Information P(X,Y) Y=1 Y=2 X=1 0.4 0.4 X=2 0.1 0.1 P(X,Y) Y=1 Y=2 X=1 0.4 0.1 X=2 0.1 0.4 X & Y are independent H(Y ) H(X, Y ) H(X) I(X; Y ) H(X) H(X, Y ) H(Y ) I(X; Y ) = 0 Y is a stochasit function of X P(X,Y) Y=1 Y=2 X=1 0.5 0 X=2 0 0.5 Y is a deterministic function of X H(X, Y ) = 1.722 H(X) = 0.722, H(Y ) = 1 I(X; Y ) = H(X) + H(Y ) H(X, Y ) = 0 H(X, Y ) = 1.722 H(X) = 1, H(Y ) = 1 I(X; Y ) = H(X) + H(Y ) H(X, Y ) = 0.278 H(X, Y ) = 1 H(X) = 1, H(Y ) = 1 I(X; Y ) = H(X) + H(Y ) H(X, Y ) = 1 H(Y ) H(X, Y ) H(X) I(X; Y )
- 33. Entropy, Joint Entropy , Conditional Entropy & Mutual Information H(X) H(Y ) H(Z) H(X|Y, Z) H(Y |X, Z) H(Z|X, Y ) I(X; Z|Y )I(X; Y |Z) I(Y ; Z|X) I(X; Y ; Z)
- 34. Cross Entropy Hp,q(X) = X x2X p(x) log q(x) = X x2X p(x) log p(x) X x2X p(x) log q(x) log p(x) = Hp(X) + KL p(X)kq(X)
- 35. Information in the Weights • Cause of Over-fitting • Information in the Weights as a Regularizer • Bounding the Information in the Weights • Connection with Flat Minimum • Connection with PAC-Bayesian Bound • Experiments
- 36. Information in the Weights
- 37. Cause of Over-fitting • Training loss (Cross-Entropy): Hp,q(y|x, w) = Ex,y ⇥ p(y|x, w) log q(y|x, w) ⇤ p : probability density function of data q : probability density function predicted by model x : input feature of training data y : label of training data w : weights of model ✓ : latent parameters of data distribution
- 38. Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w) Cause of Over-fitting the uncertainty of y given w and x Hp(x) Hp(y) Hp(y|x, w) Hp(w)
- 39. Cause of Over-fitting • lower -> lower uncertainty of y given w and x -> lower training error • ex: given x as , and a fixed w 8 >>< >>: p(y = 1|x, w) = 0.9 p(y = 2|x, w) = 0.1 p(y = 3|x, w) = 0.0 p(y = 4|x, w) = 0.0 8 >>< >>: p(y = 1|x, w) = 0.3 p(y = 2|x, w) = 0.3 p(y = 3|x, w) = 0.2 p(y = 4|x, w) = 0.2 higher Hp(y|x, w)lower Hp(y|x, w) Hp(y|x, w) Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w)
- 40. Cause of Over-fitting Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w) Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) ✓ : latent parameters of (training & testing) data distribution Hp(y|x)Hp(ytest|xtest) Hp(✓)
- 41. Cause of Over-fitting Hp(y|x)Hp(ytest|xtest) Hp(✓) ✓ : latent parameters of (training & testing) data distribution 1 2 3 x y 3 2 x y 3 2 x y 3 Hp(y|x, ✓) Ip(y; ✓|x) useful information in training data noisy information and outlier in training data noise and outlier in testing data normal samples not in training data
- 42. Cause of Over-fitting Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) Hp(x) Hp(y) Hp(✓) Hp(y|x, ✓) the uncertainty of y given w and x Hp(y|x, w) Hp(w) I(y; ✓|x, w) noisy information and outlier in training data useful information not learned by weights noisy information and outlier learned by weights I(y; w|x, ✓)
- 43. Cause of Over-fitting Hp(x) Hp(y) Hp(✓) Hp(y|x, ✓) Hp(w) noisy information and outlier in training data Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
- 44. Cause of Over-fitting • lower -> less noise and outlier in training data. low Hp(y|x, ✓) lower Hp(y|x, ✓) higher Hp(y|x, ✓) Hp(y|x)Hp(✓) 3 2 x y 1 3 Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) Hp(y|x)Hp(✓) 3 2 x y
- 45. Cause of Over-fitting Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) Hp(x) Hp(y) Hp(✓) Hp(w) I(y; ✓|x, w) useful information not learned by weights
- 46. Cause of Over-fitting • lower -> more useful information learned by weights -> lower -> lower testing error Hp(✓) Hp(w1) Hp(w2) I(y; ✓|x, w) Hp(y|x) Hp(ytest|xtest) I(y; ✓|x, w2) < I(y; ✓|x, w1) ) Hp(ytest|xtest, w2) < Hp(ytest|xtest, w1) Hp(ytest|xtest, w) Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
- 47. Cause of Over-fitting Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) Hp(x) Hp(y) Hp(✓) Hp(w) noisy information and outlier learned by weights I(y; w|x, ✓)
- 48. w • Cause of over fitting: weights memorize the noisy informationin training data. Cause of Over-fitting High VC Dimension, but clean data -> few noise to memorize High VC Dimension, and noisy data -> much noise to memorize training data testing data Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) ww
- 49. • is unknown, cannot be compute • , the information in the weight, is an upper bound of Information in the Weights as a Regularizer I(y; w|x, ✓) I(y, x; w|✓) = I(D; w|✓) I(D; w) I(y; w|x, ✓) I(y; w|x, ✓)I(D; w) D Hp(x) Hp(y) Hp(✓) I(y; w|x, ✓) Hp(x, y) = Hp(D) I(y, x; w|✓) = I(D; w|✓)I(D; w) Hp(w) I(y; w|x, ✓)
- 50. Information in the Weights as a Regularizer • The actual data distribution p is unknown • Estimate by • New loss function : as a regularizerIp(D; w) ⇡ Iq(D; w) L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w) I(D; w) Iq(D; w)
- 51. Connection with Flat Minimum • Flat minimum has low information in the weight kH k⇤ : nuclear norm of the Hessian at the local minimum Flat minimum -> low neclear norm of the Hessian -> low information Iq(w; D) 1 2 K ⇥ log k ˆwk2 2 + log kH k⇤ K log( K2 2 ) ⇤
- 52. Connection with PAC-Bayesian Bound • Given a prior distribution , we have:p(w) distribution of weights before training (prior) distributionof weights after training on dataset D (posterior) Iq(w, D) = EDKL(q(w|D)kq(w)) EDKL(q(w|D)kq(w)) + EDKL(q(w|D)kp(w)) EDKL(q(w|D)kp(w))
- 53. Connection with PAC-Bayesian Bound • Loss function with the regularizer : • PAC Bayesian Bound : L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w) Hp,q(y|x, w) + EDKL(q(w|D)kp(w)) Ip(D; w) ⇡ Iq(D; w) ED ⇥ Ltest (q(w|D)) ⇤ Hp,q(y|x, w) + LmaxED ⇥ KL(q(w|D)kp(w)) ⇤ n(1 1 2 ) Ltest : test error of the network with weights q(w|D) Lmax : maximum per-sample loss function
- 54. Experiments • Random Labels Dataset size Information complexity L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
- 55. Experiments Dataset size Information complexity • Real Labels L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
- 56. Experiments • information in the weights v.s. percentage of corrupted labels

No public clipboards found for this slide

Be the first to comment