Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PAC-Bayesian Bound for Deep Learning

PAC-Bayesian Bound for Deep Learning

  • Be the first to comment

PAC-Bayesian Bound for Deep Learning

  1. 1. PAC-Bayesian Bound for Deep Learning Mark Chang 2019/11/26
  2. 2. Outlines • VC Dimension & VC Bound • Generalization in Deep Learning • PAC-Bayesian Bound • PAC-Bayesian Bound for Deep Learning • Tips for Training Deep Neural Networks
  3. 3. VC Dimension & VC Bound data training data testing data sampling sampling hypothesis h hypothesis h training algorithm: minimize training error update the hypothesis h ✏(h) testing error
  4. 4. VC Dimension & VC Bound • Learning is feasible when is small -> is small • With 1-ẟ probability, the following inequality (VC Bound) is satisfied ✏(h) numbef of training instances VC Dimension (model complexity) ✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d )
  5. 5. VC Dimension & VC Bound • VC-Dimension for 2D Linear Model (VC Dimension = 3) n = 3, shatter n = 4, not shattern = 2, shatter
  6. 6. VC Dimension & VC Bound • For a fixed n: ✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d ) d=n d (VC Dimension) over-fittingerror
  7. 7. Generalization in Deep Learning • Considering a M-NIST toy example: 2-layer NN input: 780 hidden: 600 d = 26M M-NIST Dataset n = 50,000 d >> n
  8. 8. Generalization in Deep Learning feature: label: 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 original M-MNIST random label random noise features 2-layer NN ✏(h) ⇡ 0 ✏(h) ⇡ 0 ✏(h) ⇡ 0 shatter !
  9. 9. Generalization in Deep Learning feature: label: 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 original M-MNIST random label random noise features 2-layer NN ✏(h) ⇡ 0 ✏(h) ⇡ 0 ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0.5 ˆ✏(h) ⇡ 0.5ˆ✏(h) ⇡ 0.02
  10. 10. PAC-Bayesian Bound • COLT 1999
  11. 11. PAC-Bayesian Bound P (Prior) : distribution of hypothesis before training Q (Posterior) : distribution of hypothesis after training training algorithm training data
  12. 12. PAC-Bayesian Bound data training data sampling hypothesis h hypothesis h training algorithm: minimize training error update the distribution of hypothesis Q testing error Distribution of hypothesis Q testing data sampling sampling sampling ˆ✏(Q) = Eh⇠Q(ˆ✏(h)) ✏(Q) = Eh⇠Q(✏(h)) ˆ✏(Q)
  13. 13. PAC-Bayesian Bound • With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is satisfied number of training instances KL divergence between P and Q ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1
  14. 14. PAC-Bayesian Bound ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 P Q ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 high ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 , high under fitting over fittingappropriate fitting ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 moderate ✏(Q)  ˆ✏(Q) + s KL(Q||P) + lo 2n 1 , moderate ✏(Q)  ˆ✏(Q) + s KL(Q||P 2 low ✏(Q) , high low KL(Q||P) moderate KL(Q||P) high KL(Q||P) |✏(Q) ˆ✏(Q)|low |✏(Q) ˆ✏(Q)|moderae |✏(Q) ˆ✏(Q)|high P Q P Q training data testing data
  15. 15. PAC-Bayesian Bound for Deep Learning • UAI 2017
  16. 16. PAC-Bayesian Bound for Deep Learning • deterministic neural networks • stochastic neural networks (SNN) w, bx y=wx+b w, bx y=wx+b distribution of w, b sampling
  17. 17. PAC-Bayesian Bound for Deep Learning • With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is satisfied KL(ˆ✏(Q)||✏(Q))  KL(Q||P) + log(n ) n 1 SNN after training (posterior) SNN before training (prior)
  18. 18. PAC-Bayesian Bound for Deep Learning feature: label: 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 original M-MNIST random label random noise features easy problem : ->small KL (Q||P) ->small ε(Q) hard problems : ->large KL(Q||P) ->large ε(Q) KL(ˆ✏(Q)||✏(Q))  KL(Q||P) + log(n ) n 1
  19. 19. PAC-Bayesian Bound for Deep Learning Random Label Original M-NIST One, two or three hidden layers with 600~1200 neurons
  20. 20. Tips for Training Deep Neural Networks Traditional Machine Learning • Reducing d: • reducing the number of parameters • weight decay • early stop Modern Deep Learning • Reducing KL(Q||P) • reducing the number of parameters • weight decay ? • early stop ? • starting from a good P (transfer learning) ✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d ) KL(ˆ✏(Q)||✏(Q))  KL(Q||P) + log(n ) n 1
  21. 21. About the Speaker Mark Chang • Email: ckmarkoh at gmail dot com • Facebook: https://www.facebook.com/ckmarkoh.chang

×