Successfully reported this slideshow.
Upcoming SlideShare
×

# PAC Bayesian for Deep Learning

PAC Bayesian for Deep Learning

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### PAC Bayesian for Deep Learning

1. 1. PAC-Bayesian Bound for Deep Learning Mark Chang 2019/11/26
2. 2. Outlines • Learning Theory • Generalization in Deep Learning • PAC-Bayesian Bound • PAC-Bayesian Bound for Deep Learning • How to Overcome Over-fitting?
3. 3. Learning Theory data training data testing data sampling sampling hypothesis h hypothesis h training algorithm: minimize training error update the hypothesis h ✏(h) testing error ˆ✏(h) Learning is feasible when is small -> is small✏(h)ˆ✏(h)
4. 4. Learning Theory • Over fitting : is small, but is large • What cause over-fitting? ✏(h) h h h under fitting over fittingappropriate fitting too few parameters adequate parameters too many parameters training data testing data ˆ✏(h)
5. 5. Learning Theory • What cause over-fitting? • Too many parameters -> over fitting ? • Too many parameters -> high VC Dimension -> over fitting ? • … ?
6. 6. Learning Theory • With 1-ẟ probability, the following inequality (VC Bound) is satisfied. numbef of training instances VC Dimension (model complexity) ✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d )
7. 7. VC Dimension • Model Complexity • Categorized the hypothesis set into finite groups • ex: 1D linear model infinite hypothesis finite groups of hypothesis VC dim = log(4) = 2
8. 8. • Growth Function • VC Dimension VC Dimension ⌧H(n) = max (x1,...,xn) ⌧(x1, ..., xn) ⌧(x1, ..., xn) = n h(x1), ..., h(xn) h 2 H o data samples hypothesis set d(H) = max{n : ⌧H (n) = 2n }
9. 9. Growth Function • Considering 2D Linear Model, and n=2 (h(x1), h(x2)) = (1, 1) (0, 1) (1, 0) (0, 0) h x1 x2 ⌧(x1, x2) = n h(x1), h(xn) h 2 H o = 4 ⌧(x1, ..., xn) = n h(x1), ..., h(xn) h 2 H o
10. 10. Growth Function • Considering 2D Linear Model, and n=3 ⌧H(3) = max (x1,...,x3) ⌧(x1, ..., x3) = 8 ⌧(x11, x12, x13) = 8 x11 x12 x13 ⌧(x21, x22, x23) = 6 x21 x22 x23 ⌧H(n) = max (x1,...,xn) ⌧(x1, ..., xn)
11. 11. VC Dimension • VC-Dimension for 2D Linear Model (VC Dimension = 3) … ⌧H(3) = 8 = 23 (shatter) ⌧H(4) = 14 < 24 (not shatter) d(H) = max{n : ⌧H (n) = 2n } = 3 d(H) = max{n : ⌧H (n) = 2n }
12. 12. VC Dimension • VC Dimension is independent from data distribution ⌧H(n) = max (x1,...,xn) ⌧(x1, ..., xn) maximize over all possible data distributionwith n samples ⌧(x11, x12, x13) = 8 ⌧(x21, x22, x23) = 6 ⌧H(3) = max (x1,...,x3) ⌧(x1, ..., x3) = 8 x11 x12 x13 x21 x22 x23 d(H) = max{n : ⌧H (n) = 2n } maximize over all possible number of data samples ⌧H(3) = 8 = 23 ⌧H(4) = 14 < 24 d(H) = max{n : ⌧H (n) = 2n } = 3
13. 13. VC Dimension • VC Dimension of linear model: • O(W) • W = number of parameters • VC Dimension of fully-connected neural networks: • O(LW log W) • L = number of layers • W = number of parameters
14. 14. VC Bound • For a given dataset (n is constant), search for the best VC Dimension ✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d ) d=n (shatter) d (VC Dimension) over-fittingerror best VC Dimension
15. 15. VC Bound • For a given dataset (n is constant), search for the best VC Dimension training data testing data reduce VC Dimension Model with high VC Dimension Model with moderate VC Dimension over fitting !!
16. 16. Generalization in Deep Learning • Considering a toy example: neural networks input: 780 hidden: 600 d ≈ 26M dataset n = 50,000 d >> n, but testing error < 0.1
17. 17. Generalization in Deep Learning • However, when you are solving your problem … neural networks input: 780 hidden: 600 testing error = 0.6 over fitting !! your dataset n = 50,000 10 classes testing error = 0.6 over fitting !! reduce VC Dimension neural networks input: 780 hidden: 200 … reduce VC Dimension
18. 18. Generalization in Deep Learning • Reconciling modern machine learning and the bias-variance trade-off (arxiv 2018) • Understanding deep learning requires rethinking generalization (ICLR 2017) • On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (ICLR 2017)
19. 19. Generalization in Deep Learning
20. 20. Reconciling modern machine learning and the bias-variance trade-off d (VC Dimension) error ✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d ) d=n over-fitting over-parameterization model with extremely high VC-Dimension
21. 21. Reconciling modern machine learning and the bias-variance trade-off neural networks (two layers) random forest
22. 22. Generalization in Deep Learning ICLR2017
23. 23. Understanding deep learning requires rethinking generalization 1 0 1 0 2 random noise features shatter ! deep neural networks (Inception) feature: label: 1 0 1 0 2 original dataset (CIFAR) 0 1 1 2 0 random label ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0
24. 24. Understanding deep learning requires rethinking generalization 1 0 1 0 2 random noise features deep neural networks (Inception) original dataset (CIFAR) feature: label: 1 0 1 0 2 random label 0 1 1 2 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ˆ✏(h) ⇡ 0 ✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9 ✏(h) ⇡ 0.9
25. 25. Understanding deep learning requires rethinking generalization • Testing error depends on data distribution • However, VC-Bound does not depend on data distribution 1 0 1 0 2 random noise featuresoriginal dataset feature: label: 1 0 1 0 2 random label 0 1 1 2 0 ✏(h) ⇡ 0.14 ✏(h) ⇡ 0.9
26. 26. Understanding deep learning requires rethinking generalization • Regularization of Deep Neural Networks • only small improvement on testing error ✏(h) ⇡ 0.1397 ✏(h) ⇡ 0.1069 weight decay data augmentation ✏(h) ⇡ 0.1095 weight decay + data augmentation ✏(h) ⇡ 0.1424 no regularization
27. 27. Generalization in Deep Learning
28. 28. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima • high sharpness -> high testing error
29. 29. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima • high sharpness -> high testing error • large batch size -> high sharpness
30. 30. PAC-Bayesian Bound • Original PAC Bound : • PAC-Bayesian Model Averaging (COLT 1999) • PAC-Bayesian Bound for Deep Learning (Stochastic) : • Computing NonvacuousGeneralization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data (UAI 2017) • PAC-Bayesian Bound for Deep Learning (Deterministic): • A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks (ICLR 2018)
31. 31. PAC-Bayesian Bound COLT 1999
32. 32. PAC-Bayesian Bound • Deterministic Model • Stochastic Model (Gibbs Classifier) data ✏(h) hypothesis h error data ✏(Q) = Eh⇠Q(✏(h)) distribution of hypothesis hypothesis h Gibbs error sampling
33. 33. PAC-Bayesian Bound • Considering the sharpness of local minimums single hypothesis h distribution of hypothesis Q flat minimum: • low • low ˆ✏(h) ˆ✏(Q) sharp minimum: • low • high ˆ✏(h) ˆ✏(Q)
34. 34. PAC-Bayesian Bound • Training Gibbs Classifier P (Prior) : distribution of hypothesis before training Q (Posterior) : distribution of hypothesis after training training algorithm training data
35. 35. sampling PAC-Bayesian Bound • Training data training data sampling hypothesis h training algorithm: minimize training error update the distribution of hypothesis Q Distribution of hypothesis Q ˆ✏(Q) ˆ✏(Q) = Eh⇠Q(ˆ✏(h))
36. 36. PAC-Bayesian Bound • Testing hypothesis h testing error Trained distribution of hypothesis Q testing data sampling sampling ˆ✏(Q) data ✏(Q) = Eh⇠Q(✏(h))
37. 37. PAC-Bayesian Bound • Learning is feasible when is small -> is small • With 1-ẟ probability, the following inequality (PAC-Bayesian Bound) is satisfied number of training instances KL divergence between P and Q ˆ✏(Q) ✏(Q) ✏(Q)  ˆ✏(Q) + s KL(QkP) + log(n ) + 2 2n 1
38. 38. PAC-Bayesian Bound ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 high ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 , high under fitting over fittingappropriate fitting ✏(Q)  ˆ✏(Q) + s KL(Q||P) + log(n ) + 2 2n 1 moderate ✏(Q)  ˆ✏(Q) + s KL(Q||P) + lo 2n 1 , moderate ✏(Q)  ˆ✏(Q) + s KL(Q||P 2 low ✏(Q) , high low KL(Q||P) moderate KL(Q||P) high KL(Q||P) |✏(Q) ˆ✏(Q)|low |✏(Q) ˆ✏(Q)|moderae |✏(Q) ˆ✏(Q)|high training data testing data P Q P Q P Q ✏(Q)  ˆ✏(Q) + s KL(QkP) + log(n ) + 2 2n 1
39. 39. PAC-Bayesian Bound • High VC Dimension, but clean data -> low KL (Q||P) • High VC Dimension, but noisy data -> high KL(Q||P) • PAC-Bayesian Bound is data-dependent ✏(Q)  ˆ✏(Q) + s KL(QkP) + log(n ) + 2 2n 1 P Q Q P
40. 40. PAC-Bayesian Bound for Deep Learning (Stochastic) UAI 2017
41. 41. PAC-Bayesian Bound for Deep Learning (Stochastic) • deterministic neural networks • stochastic neural networks (SNN) w, bx y=wx+b w, bx y=wx+b distribution of w, b : N(W,S) • N : normal distribution • W : mean of weights • S : covariance of weights sampling
42. 42. PAC-Bayesian Bound for Deep Learning (Stochastic) • With 1-ẟ probability, the following inequality is satisfied. Q: SNN after training P: SNN before training KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1
43. 43. Estimating the PAC-Bayesian Bound • • N : normal distribution • w : mean of weights (d dimensional vector) • s : covariance of weights (d x d diagonal matrix) KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1 KL(QkP) = 1 2 1 ksk1 d + 1 kw w0k2 2 + d log( ) 1d · log(s) Q = N(w, s), P = N(w0, Id)
44. 44. KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1 Estimating the PAC-Bayesian Bound • is intractable • Sampling m hypothesis from Q : • with 1-ẟ’ probability, , the following inequality is satisfied. ˆ✏(Q) = Eh⇠Q(ˆ✏(h)) KL(ˆ✏( ˆQ)kˆ✏(Q))  log 2 m 0 ˆ✏( ˆQ) = 1 m mX i=1 ˆ✏(hi), hi ⇠ Q
45. 45. Experiments • Data : M-NIST (binary classification, class0 : 0~4, class1 : 5~9) • Model : 2 layer or 3 layer NN original M-NIST random label feature: label: 0 0 0 1 1 1 0 1 0 1
46. 46. Experiments Training Testing VC Bound Pac-Bayesian Bound 0.028 0.034 26m 0.161 0.027 0.035 56m 0.179 Varying the width of hidden layer (2 layer NN) 600 1200 0.027 0.032 121m 0.201 0.028 0.034 26m 0.161 0.028 0.033 66m 0.186 Varying the number of hidden layers 2 3 4
47. 47. Experiments Training Testing VC Bound Pac-Bayesian Bound 0.028 0.034 26m 0.161 0.112 0.503 26m 1.352 random label 1 0 1 0 1 original M-NIST 0 0 0 1 1
48. 48. PAC-Bayesian Bound for Deep Learning • PAC-Bayesian Bound is Data Dependent original M-NIST random label clean data: ->small KL (Q||P) ->small ε(Q) noisy data: ->large KL(Q||P) ->large ε(Q) feature: label: 0 0 0 1 1 1 0 1 0 1 KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1
49. 49. PAC-Bayesian Bound for Deep Learning (Deterministic)
50. 50. PAC-Bayesian Bound for Deep Learning (Deterministic) • Constrain on sharpness: given a margin γ > 0, if Q satisfy: h Q h' Ph0⇠Q n sup x2X kh0 (x) h(x)k1  4 o 1 2
51. 51. PAC-Bayesian Bound for Deep Learning (Deterministic) • Constrain on sharpness: given a margin γ > 0, if Q satisfy: • With 1-ẟ probability, the following inequality is satisfied. • margin loss L γ (h): L0(h)  ˆL (h) + 4 s KL(QkP) + log(6n ) n 1 Ph0⇠Q n sup x2X kh0 (x) h(x)k1  4 o 1 2 L (h) = P(x,y)⇠D n h(x)[y]  + max j6=y h(x)[j] o
52. 52. How to Overcome Over-fitting? Traditional Machine Learning • reduce the number of parameters • weight decay • early stop • data augmentation Modern Deep Learning • reduce the number of parameters • weight decay ? • early stop ? • data augmentation ? • improving data quality • starting from a good P (transfer learning) ✏(h)  ˆ✏(h) + r 8 n log( 4(2n)d ) KL(ˆ✏(Q)k✏(Q))  KL(QkP) + log(n ) n 1