Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Information in the Weights

Information in the Weights

  • Be the first to comment

Information in the Weights

  1. 1. Information in the Weights Mark Chang 2020/06/19
  2. 2. Outline • Traditional Machine Learning v.s. Deep Learning • Basic Concepts in Information Theory • Information in the Weights
  3. 3. Traditional Machine Learning v.s. Deep Learning • VC Bound • Generalization in Deep Learning • PAC-Bayesian Bound for Deep Learning
  4. 4. VC Bound • What cause over-fitting? • Too many parameters -> over fitting ? • Too many parameters -> high VC Dimension -> over fitting ? • … ? h h h under fitting over fittingappropriate fitting too few parameters adequate parameters too many parameters training data testing data
  5. 5. VC Bound • Over-fitting is caused by high VC Dimension • For a given dataset (n is constant), search for the best VC Dimension d=n (shatter) d (VC Dimension) over-fittingerror best VC Dimension numbef of training instances VC Dimension (model complexity)✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d ) training error testing error
  6. 6. VC Dimension • VC Dimension of linear model: • O(W) • W = number of parameters • VC Dimension of fully-connected neural networks: • O(LW log W) • L = number of layers • W = number of parameters • VC Dimension is independent from data distribution, and only depends on model
  7. 7. Generalization in Deep Learning • Considering a toy example: neural networks input: 780 hidden: 600 d ≈ 26M dataset n = 50,000 d >> n, but testing error < 0.1
  8. 8. Generalization in Deep Learning • However, when you are solving your problem … neural networks input: 780 hidden: 600 testing error = 0.6 over fitting !! your dataset n = 50,000 10 classes testing error = 0.6 over fitting !! reduce VC Dimension neural networks input: 780 hidden: 200 … reduce VC Dimension
  9. 9. Generalization in Deep Learning d (VC Dimension) error ✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d ) d=n over-fitting over-parameterization model with extremely high VC- Dimension
  10. 10. Generalization in Deep Learning ICLR2017
  11. 11. Generalization in Deep Learning 1 0 1 0 2 random noise features shatter ! deep neural networks (Inception) feature :label: 1 0 1 0 2 original dataset (CIFAR) 0 1 1 2 0 random label ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
  12. 12. Generalization in Deep Learning 1 0 1 0 2 random noise features deep neural networks (Inception) feature :label: 1 0 1 0 2 original dataset (CIFAR) 0 1 1 2 0 random label ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9 ✏(h) ⇡ 0.9
  13. 13. Generalization in Deep Learning • Testing error depends on data distribution • However, VC-Bound does not depend on data distribution 1 0 1 0 2 random noise features original dataset feature :label: 1 0 1 0 2 random label 0 1 1 2 0 ✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9
  14. 14. Generalization in Deep Learning
  15. 15. Generalization in Deep Learning • high sharpness -> high testing error
  16. 16. PAC-Bayesian Bound for Deep Learning UAI 2017
  17. 17. PAC-Bayesian Bound for Deep Learning • Deterministic Model • Stochastic Model (Gibbs Classifier) data ✏(h) hypothesis h error data ✏(Q) = Eh⇠Q(✏(h)) distribution of hypothesis hypothesis h Gibbs error sampling
  18. 18. PAC-Bayesian Bound for Deep Learning • Considering the sharpness of local minimums single hypothesis h distribution of hypothesis Q flat minimum: • low • low ˆ✏(h) ˆ✏(Q) sharp minimum: • low • high ˆ✏(h) ˆ✏(Q)
  19. 19. KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1 PAC-Bayesian Bound for Deep Learning • With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is satisfied number of training instances KL divergence between P and Q Distribution of model before training (prior) Distribution of models after training (posterior) KL divergence between testing error & training error
  20. 20. PAC-Bayesian Bound for Deep Learning ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 high ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 , high under fitting over fittingappropriate fitting ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 moderate ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2n 1 , moderate ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n 2n 1 low ✏(Q)  ˆ✏(Q) + s , high low KL(Q||P) moderate KL(Q||P) high KL(Q||P) training data testing data P Q P Q P Q KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1 lowKL(ˆ✏(Q)k✏(Q)) moderae KL(ˆ✏(Q)k✏(Q)) high KL(ˆ✏(Q)k✏(Q))
  21. 21. PAC-Bayesian Bound for Deep Learning • PAC-Bayesian Bound is data-dependent KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1 • High VC Dimension, but clean data -> low KL (Q||P) • High VC Dimension, and noisy data -> high KL(Q||P) P Q Q P
  22. 22. PAC-Bayesian Bound for Deep Learning • Data : M-NIST (binary classification, class0 : 0~4, class1 : 5~9) • Model : 2 layer or 3 layer NN Training Testing VC Bound Pac-Bayesian Bound 0.028 0.034 26m 0.161 0.027 0.035 56m 0.179 Varying the width of hidden layer (2 layer NN) 600 1200 0.027 0.032 121m 0.201 0.028 0.034 26m 0.161 0.028 0.033 66m 0.186 Varying the number of hidden layers 2 3 4 0.028 0.034 26m 0.161 0.112 0.503 26m 1.352 original random original M-NIST v.s. random label
  23. 23. PAC-Bayesian Bound for Deep Learning • PAC-Bayesian Bound is Data Dependent clean data: ->small KL(Q||P) ->small ε(Q) noisy data: ->large KL(Q||P) ->large ε(Q) feature: label: original M-NIST 0 0 0 1 1 random label 1 0 1 0 1 KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1
  24. 24. Basic Concepts in Information Theory • Entropy • Joint Entropy • Conditional Entropy • Mutual Information • Cross Entropy
  25. 25. Entropy • The uncertainty of a random variable X H(X) = X x2X p(x) log p(x) x 1 2 P(X=x) 0.5 0.5 x 1 2 P(X=x) 0.9 0.1 x 1 2 3 4 P(X=x) 0.25 0.25 0.25 0.25 H(X) = 2 ⇥ 0.5 log(0.5) = 1 H(X) = 0.9 log(0.9) 0.1 log(0.1) = 0.469 H(X) = 4 ⇥ 0.25 log(0.25) = 2
  26. 26. Joint Entropy • The uncertainty of a joint distribution involving two random variables X, Y H(X, Y ) = X x2X,y2Y p(x, y) log p(x, y) P(X,Y) Y=1 Y=2 X=1 0.25 0.25 X=2 0.25 0.25 H(X, Y ) = 4 ⇥ 0.25 log(0.25) = 2
  27. 27. Conditional Entropy • The uncertainty of a random variable Y given the value of another random variable X H(Y |X) = X x2X p(x) X y2Y p(y|x) log p(y|x) P(X,Y) Y=1 Y=2 X=1 0.4 0.4 X=2 0.1 0.1 P(X,Y) Y=1 Y=2 X=1 0.4 0.1 X=2 0.1 0.4 X & Y are independent Y is a stochasit function of X H(X, Y ) = 1.722 H(Y |X) = 0.722 H(X, Y ) = 1.722 H(Y |X) = 1 P(X,Y) Y=1 Y=2 X=1 0.5 0 X=2 0 0.5 Y is a deterministic function of X H(X, Y ) = 1 H(Y |X) = 0
  28. 28. Conditional Entropy Information Diagram H(X)H(Y ) H(X, Y ) H(Y |X) H(Y |X) = X x2X p(x) X y2Y p(y|x) log p(y|x) = X x2X p(x) X y2Y p(y|x)(log p(x, y) log p(x)) = X x2X,y2Y p(x, y) log p(x, y) + X x2X p(x) log p(x) = H(X, Y ) H(X)
  29. 29. Conditional Entropy P(X,Y) Y=1 Y=2 X=1 0.4 0.4 X=2 0.1 0.1 P(X,Y) Y=1 Y=2 X=1 0.4 0.1 X=2 0.1 0.4 X & Y are independent Y is a stochasit function of X H(X, Y ) = 1.722 H(X) = 0.722, H(Y ) = 1 H(Y |X) = H(X, Y ) H(X) = 1 = H(Y ) H(X, Y ) = 1.722 H(X) = 1, H(Y ) = 1 H(Y |X) = H(X, Y ) H(X) = 0.722 P(X,Y) Y=1 Y=2 X=1 0.5 0 X=2 0 0.5 Y is a deterministic function of X H(X, Y ) = 1 H(X) = 1, H(Y ) = 1 H(Y |X) = H(X, Y ) H(X) = 0 H(X) H(X, Y ) H(Y |X) = H(Y ) H(Y ) H(X, Y ) H(X) H(Y |X) = 0 H(Y |X) H(X, Y )
  30. 30. Mutual Information • The mutual dependence between two variables X, Y I(X; Y ) = X x2X,y2Y p(x, y) log p(x, x) p(x)p(y) P(X,Y) Y=1 Y=2 X=1 0.4 0.4 X=2 0.1 0.1 P(X,Y) Y=1 Y=2 X=1 0.4 0.1 X=2 0.1 0.4 X & Y are independent I(X; Y ) = 0 I(X; Y ) = 0.278 Y is a stochasit function of X P(X,Y) Y=1 Y=2 X=1 0.5 0 X=2 0 0.5 Y is a deterministic function of X I(X, Y ) = 1
  31. 31. Mutual Information H(X)H(Y ) H(X, Y ) Information Diagram I(X; Y ) I(X; Y ) = X x2X,y2Y p(x, y) log p(x, x) p(x)p(y) = X x2X,y2Y p(x, y)( log p(x) log p(y) + log p(x, x)) = X x2X p(x) log p(x) X y2Y p(y) log p(y) + X x2X,y2Y p(x, y) log p(x, x)) = H(X) + H(Y ) H(X, Y )
  32. 32. Mutual Information P(X,Y) Y=1 Y=2 X=1 0.4 0.4 X=2 0.1 0.1 P(X,Y) Y=1 Y=2 X=1 0.4 0.1 X=2 0.1 0.4 X & Y are independent H(Y ) H(X, Y ) H(X) I(X; Y ) H(X) H(X, Y ) H(Y ) I(X; Y ) = 0 Y is a stochasit function of X P(X,Y) Y=1 Y=2 X=1 0.5 0 X=2 0 0.5 Y is a deterministic function of X H(X, Y ) = 1.722 H(X) = 0.722, H(Y ) = 1 I(X; Y ) = H(X) + H(Y ) H(X, Y ) = 0 H(X, Y ) = 1.722 H(X) = 1, H(Y ) = 1 I(X; Y ) = H(X) + H(Y ) H(X, Y ) = 0.278 H(X, Y ) = 1 H(X) = 1, H(Y ) = 1 I(X; Y ) = H(X) + H(Y ) H(X, Y ) = 1 H(Y ) H(X, Y ) H(X) I(X; Y )
  33. 33. Entropy, Joint Entropy , Conditional Entropy & Mutual Information H(X) H(Y ) H(Z) H(X|Y, Z) H(Y |X, Z) H(Z|X, Y ) I(X; Z|Y )I(X; Y |Z) I(Y ; Z|X) I(X; Y ; Z)
  34. 34. Cross Entropy Hp,q(X) = X x2X p(x) log q(x) = X x2X p(x) log p(x) X x2X p(x) log q(x) log p(x) = Hp(X) + KL p(X)kq(X)
  35. 35. Information in the Weights • Cause of Over-fitting • Information in the Weights as a Regularizer • Bounding the Information in the Weights • Connection with Flat Minimum • Connection with PAC-Bayesian Bound • Experiments
  36. 36. Information in the Weights
  37. 37. Cause of Over-fitting • Training loss (Cross-Entropy): Hp,q(y|x, w) = Ex,y ⇥ p(y|x, w) log q(y|x, w) ⇤ p : probability density function of data q : probability density function predicted by model x : input feature of training data y : label of training data w : weights of model ✓ : latent parameters of data distribution
  38. 38. Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w) Cause of Over-fitting the uncertainty of y given w and x Hp(x) Hp(y) Hp(y|x, w) Hp(w)
  39. 39. Cause of Over-fitting • lower -> lower uncertainty of y given w and x -> lower training error • ex: given x as , and a fixed w 8 >>< >>: p(y = 1|x, w) = 0.9 p(y = 2|x, w) = 0.1 p(y = 3|x, w) = 0.0 p(y = 4|x, w) = 0.0 8 >>< >>: p(y = 1|x, w) = 0.3 p(y = 2|x, w) = 0.3 p(y = 3|x, w) = 0.2 p(y = 4|x, w) = 0.2 higher Hp(y|x, w)lower Hp(y|x, w) Hp(y|x, w) Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w)
  40. 40. Cause of Over-fitting Hp,q(y|x, w) = Hp(y|x, w) + Ex,wKL p(y|x, w)kq(y|x, w) Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) ✓ : latent parameters of (training & testing) data distribution Hp(y|x)Hp(ytest|xtest) Hp(✓)
  41. 41. Cause of Over-fitting Hp(y|x)Hp(ytest|xtest) Hp(✓) ✓ : latent parameters of (training & testing) data distribution 1 2 3 x y 3 2 x y 3 2 x y 3 Hp(y|x, ✓) Ip(y; ✓|x) useful information in training data noisy information and outlier in training data noise and outlier in testing data normal samples not in training data
  42. 42. Cause of Over-fitting Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) Hp(x) Hp(y) Hp(✓) Hp(y|x, ✓) the uncertainty of y given w and x Hp(y|x, w) Hp(w) I(y; ✓|x, w) noisy information and outlier in training data useful information not learned by weights noisy information and outlier learned by weights I(y; w|x, ✓)
  43. 43. Cause of Over-fitting Hp(x) Hp(y) Hp(✓) Hp(y|x, ✓) Hp(w) noisy information and outlier in training data Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
  44. 44. Cause of Over-fitting • lower -> less noise and outlier in training data. low Hp(y|x, ✓) lower Hp(y|x, ✓) higher Hp(y|x, ✓) Hp(y|x)Hp(✓) 3 2 x y 1 3 Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) Hp(y|x)Hp(✓) 3 2 x y
  45. 45. Cause of Over-fitting Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) Hp(x) Hp(y) Hp(✓) Hp(w) I(y; ✓|x, w) useful information not learned by weights
  46. 46. Cause of Over-fitting • lower -> more useful information learned by weights -> lower -> lower testing error Hp(✓) Hp(w1) Hp(w2) I(y; ✓|x, w) Hp(y|x) Hp(ytest|xtest) I(y; ✓|x, w2) < I(y; ✓|x, w1) ) Hp(ytest|xtest, w2) < Hp(ytest|xtest, w1) Hp(ytest|xtest, w) Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓)
  47. 47. Cause of Over-fitting Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) Hp(x) Hp(y) Hp(✓) Hp(w) noisy information and outlier learned by weights I(y; w|x, ✓)
  48. 48. w • Cause of over fitting: weights memorize the noisy informationin training data. Cause of Over-fitting High VC Dimension, but clean data -> few noise to memorize High VC Dimension, and noisy data -> much noise to memorize training data testing data Hp(y|x, w) = Hp(y|x, ✓) + I(y; ✓|x, w) I(y; w|x, ✓) ww
  49. 49. • is unknown, cannot be compute • , the information in the weight, is an upper bound of Information in the Weights as a Regularizer I(y; w|x, ✓)  I(y, x; w|✓) = I(D; w|✓)  I(D; w) I(y; w|x, ✓) I(y; w|x, ✓)I(D; w) D Hp(x) Hp(y) Hp(✓) I(y; w|x, ✓) Hp(x, y) = Hp(D) I(y, x; w|✓) = I(D; w|✓)I(D; w) Hp(w) I(y; w|x, ✓)
  50. 50. Information in the Weights as a Regularizer • The actual data distribution p is unknown • Estimate by • New loss function : as a regularizerIp(D; w) ⇡ Iq(D; w) L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w) I(D; w) Iq(D; w)
  51. 51. Connection with Flat Minimum • Flat minimum has low information in the weight kH k⇤ : nuclear norm of the Hessian at the local minimum Flat minimum -> low neclear norm of the Hessian -> low information Iq(w; D)  1 2 K ⇥ log k ˆwk2 2 + log kH k⇤ K log( K2 2 ) ⇤
  52. 52. Connection with PAC-Bayesian Bound • Given a prior distribution , we have:p(w) distribution of weights before training (prior) distributionof weights after training on dataset D (posterior) Iq(w, D) = EDKL(q(w|D)kq(w))  EDKL(q(w|D)kq(w)) + EDKL(q(w|D)kp(w))  EDKL(q(w|D)kp(w))
  53. 53. Connection with PAC-Bayesian Bound • Loss function with the regularizer : • PAC Bayesian Bound : L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)  Hp,q(y|x, w) + EDKL(q(w|D)kp(w)) Ip(D; w) ⇡ Iq(D; w) ED ⇥ Ltest (q(w|D)) ⇤  Hp,q(y|x, w) + LmaxED ⇥ KL(q(w|D)kp(w)) ⇤ n(1 1 2 ) Ltest : test error of the network with weights q(w|D) Lmax : maximum per-sample loss function
  54. 54. Experiments • Random Labels Dataset size Information complexity L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
  55. 55. Experiments Dataset size Information complexity • Real Labels L (q(w|D)) = Hp,q(y|x, w) + Iq(D; w)
  56. 56. Experiments • information in the weights v.s. percentage of corrupted labels

×