Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Model of Inductive Bias Learning

slides of https://www.meetup.com/Taiwan-R/events/273954352/

  • Be the first to comment

A Model of Inductive Bias Learning

  1. 1. A Model of Inductive Bias Learning
  2. 2. Outlines • What is Inductive bias? • A model of Inductive Bias Learning • Meta-Learning Based on PAC-Bayes Theory 2
  3. 3. Outlines • What is Inductive bias? • A model of Inductive Bias Learning • Meta-Learning Based on PAC-Bayes Theory 3
  4. 4. What is Inductive Bias? • The set of assumptions that the learner uses to predict outputs • ex: 4 Translation Invariance in Computer Vision Distributional Hypothesis in Natural Language Processing a a Convolution neural networks Convolution neural networks a a A dog runs. A cat runs. word2vec word2vec run run
  5. 5. What is Inductive Bias? • Convolutional Neural Networks: • Learning Inductive Bias (Receptive Fields) from dataset 5 Receptive Fields Receptive Fields Convolutional Layer Convolutional Layer Pooling Layer Input Layer Pooling Layer
  6. 6. What is Inductive Bias? • Multitask / Multiclass / Multilabel Learning 6 Feature Extractor (Inductive Bias) Input of Task 1 Input of Task 2 Input of Task 3 Features of Task 1 Features of Task 2 Features of Task 3 Classifier of Task 1 Classifier of Task 2 Classifier of Task 3 Labels of Task 1 Labels of Task 2 Labels of Task 3
  7. 7. What is Inductive Bias? • Meta-Learning / Learning-to-Learn / Transfer Learning 7 Feature Extractor (Inductive Bias) Input of Task 1 Features of Task 1 Classifier of Task 1 Labels of Task 1 Input of Task 2 Features of Task 2 Classifier of Task 2 Labels of Task 2 Input of Target Task Features Target Task Classifier of Target Task Labels of Target Task Feature Extractor (Inductive Bias)
  8. 8. Outlines • What is Inductive bias? • A model of Inductive Bias Learning • Meta-Learning Based on PAC-Bayes Theory 8
  9. 9. A model of Inductive Bias Learning 9
  10. 10. Machine Learning Theory 10 Empirical loss ˆerz(h) = 1 m mX i=1 l(h(xi), yi) y1, ..., ym Dataset P (probability distribution) Training set Hypothesis h Sample prediction z = {(x1, y1), . . . , (xm, ym)} x1, ..., xm h(x1), . . . , h(xm)(x1, y1), (x2, y2), ... ... Training algorithm selects h. Hypothesis space H = {h1, h2, . . . } Expected loss erP (h) = Z X⇥Y l(h(x), y)dP(x, y)
  11. 11. Machine Learning Theory 11 Theorem 1. Suppose z = {(x1, y1), ..., (xm, ym)} is sampling from distribution P Let d = VCDim(H). Then with at least 1 , all h 2 H will satisfy: erP (h)  ˆerz(h) + h32 m (dlog 2em d + log 4 ) i1 2 over-fitting error shatter d >= m d : VC-Dimension erP (h) ˆerz(h)
  12. 12. Environment of Related Tasks 12 Environment Q (probability distribution) Dataset P1 Dataset P2 ... Sample Dataset P1 Dataset P2 Dataset Pn ... n training tasks Training set z1 = {(x11, y11), . . . , (x1m, y1m)} Training set z2 = {(x21, y21), . . . , (x2m, y2m)} Sample Sample Training setSample zn = {(xn1, yn1), . . . , (xnm, ynm)} m training samples per task
  13. 13. Environment of Related Tasks • Ex: ImageNet Classification 13 Environment of ImageNet classification Task Goldfish or not Cock or not ... Goldfish or not Cock or not Daisy or not ... ( , 1), ( , 0) , … , ( , 1) ( , 1), ( , 0) , … , ( , 1) ( , 0), ( , 1) , … , ( , 1) Training set Training set Training set
  14. 14. Hypothesis Space Family 14 Hypothesis Space Family Hypothesis space 1 Hypothesis space 2 ... H H2 = {h21, h22, . . . } H1 = {h11, h12, . . . } Hypothesis space H = {h1, h2, . . . } Training algorithm of Inductive Bias Learning: select H.
  15. 15. Hypothesis Space Family 15 Feature Extractor Input of Task Features of Task Classifier Labels of Taskf1, f2, ..., g1, g2, ..., Hypothesis Space Family Hypothesis space 1 Hypothesis space 2 ... H H1 = {g1 f1, g2 f1, . . . } H2 = {g1 f2, g2 f2, . . . } Hypothesis space Train Feature Extractor ) select f ) select H H = {g1 f, g2 f, . . . }
  16. 16. Inductive Bias Learning 16 PredictionTraining set Environment Q Dataset P1 Dataset Pn Hypothesis Space Family H ... Hypothesis space PredictionTraining set (x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m) hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm) Training algorithm selects H. H = {h1, h2, . . . } Empirical loss Empirical loss of learning from dataset P1,…, Pn ˆerz(H) = 1 n nX i=1 inf h2H ˆerzi (h) Expected loss Expected loss of learningnew tasks drawn from Environment Q erQ(H) = Z P inf h2H erP (h)dQ(P)
  17. 17. Theorem 2. Suppose z is an (n, m)-sample generated by sampling n times from distribution Q to give P1, ..., Pn and then sampling m times from each Pi to generate zi = {(xi1, yi1), .., (xim, yim)}. Let H = {H} be any hypothesis space family. If (n, m) satisfies n max n256 ✏2 log 8C( ✏ 32 , H⇤ )o , and m max n256 n✏2 log 8C( ✏ 32 , Hn l )o , Inductive Bias Learning 17 C: capacity (model complexity) then with probability at least 1 , all H 2 H will satisfy erQ(H)  ˆerz(H) + ✏
  18. 18. Meta-Learning / Learning-to-Learn / Transfer Learning 18 PredictionTraining set Environment Q Dataset P1 Dataset Pn Hypothesis Space Family H ... Hypothesis space PredictionTraining set (x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m) hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm) Training algorithm selects H. H = {h1, h2, . . . } Pre-train PredictionTraining setDataset P (x1, y1), . . . , (xm, ym) h(x1), . . . , h(xm) Train target task Empirical lossExpected loss ˆerz(h) = 1 m mX i=1 l(h(xi), yi)erP (h) = Z X⇥Y l(h(x), y)dP(x, y)
  19. 19. Meta-Learning / Learning-to-Learn / Transfer Learning 19 Theorem 3. Let z = {(x1, y1), ...(xm, ym)} be a training set sampling from P Let H be a hypothesis space. For all 0 < ✏, < 1, if m satisfies m max n64 ✏2 log 4C( ✏ 16 , Hl) , 16 ✏2 o , then with probability at least 1 , all h 2 H will satisfy erP (h)  ˆerz(h) + ✏ C: capacity (model complexity) Hypothesis space H = {h1, h2, . . . } Inductive Bias Learning selects H.Low model complexity Hypothesis Space Family Hypothesis space 1 ... H H1 = {h11, h12, . . . } Hypothesis space H = {h1, h2, . . . } ... High model complexity
  20. 20. Multitask / Multiclass / Multilabel Learning 20 PredictionTraining set Hypothesis Space Family H Hypothesis space PredictionTraining set (x11, y11), . . . , (x1m, y1m) h1(x11), . . . , h1(x1m) hn(xn1), . . . , h2(xnm)(xn1, yn1), . . . , (xnm, ynm) Training algorithm selects H. H = {h1, h2, . . . } Environment Q Dataset P1 Dataset Pn ... Empirical loss ˆerz(h) = 1 n nX i=1 1 m mX j=1 l(hi(xij), yij) Expected loss erP(h) = 1 n nX i=1 Z X⇥Y l(hi(x), y)dPi(x, y)
  21. 21. Theorem 4. Let P = (P1, ..., Pn) be n probability distribution and let z be an (n, m)-sample generated by sampling m times from each Pi. Let H = {H} be any hypothesis space family. If m of each task satisfies m max n 64 n✏2 log 4C( ✏ 16 , Hn l ) , 16 ✏2 o , then with probability at least 1 , any h = (h1, ..., hn) 2 Hn satisfy erp(h)  ˆerz(h) + ✏ Lemma 5. For any H, logC(✏, H1 l )  logC(✏, Hn l )  nlogC(✏, H1 l ) Multitask / Multiclass / Multilabel Learning 21 • Worse case: logC(✏, Hn l ) = nlogC(✏, H1 l ) m remains the same when n increase. • Best case: logC(✏, Hn l ) = logC(✏, H1 l ) m decreases as O(1/n).
  22. 22. Outlines • What is Inductive bias? • A model of Inductive Bias Learning • Meta-Learning Based on PAC-Bayes Theory 22
  23. 23. Meta-Learning Based on PAC-Bayes Theory 23 ICML 2018
  24. 24. Generalization in Deep Learning 24 • Neural Networks (or Deep Learning models) have extremely high VC Dimension. • Over-fitting is caused by high VC Dimension (high model complexity). • VC Bound : 2 layer NN input: 780 hidden: 600 d ≈ 26M MNIST n = 50,000 d >> n over-fitting error shatter d >= m d : VC-Dimension erP (h) ˆerz(h) erP (h)  ˆerz(h) + h32 m (dlog 2em d + log 4 ) i1 2
  25. 25. Over-Parameterization • Rethinking Optimization and Generalization • [Chiyuan Zhang et al. https://arxiv.org/abs/1611.03530] 1 0 1 0 2 random pixel deep neural networks feature: label: 1 0 1 0 2 original dataset (CIFAR) 0 1 1 2 0 random label 25No over-fitting erP (h) ⇡ 0.9 erP (h) ⇡ 0.9erP (h) ⇡ 0.14 shatter ˆerz(h) ⇡ 0 ˆerz(h) ⇡ 0 ˆerz(h) ⇡ 0
  26. 26. Over-Parameterization • Generalization curve of Over-Parameterization model • [Mikhail Belkin et al. https://arxiv.org/abs/1812.11118] d (VC Dimension) error over-parameterization model with extremely high VC-Dimension 26 erP (h)  ˆerz(h) + h32 m (dlog 2em d + log 4 ) i1 2 d = m over-fittingerP (h) ˆerz(h)
  27. 27. Over-Parameterization • PAC-Bayesian Bound [Gintare Karolina Dziugaite et al. https://arxiv.org/abs/1703.11008] Training error Testingerror VC Bound Pac-Bayesian Bound 0.028 0.034 26m 0.161 0.027 0.035 56m 0.179 Model Complexityof 2- layer NN width :600 width : 1200 0.028 0.034 26m 0.161 0.112 0.503 26m 1.352 Random labelTrue label Dataset : MNIST binary classification,class0: 0~4, class1 : 5~9 Existence of noise in dataset • D: distribution of dataset • S : training samples • P : model distribution before training • Q: model distribution after training 27 er(Q, D)  ˆer(Q, S) + s KL(QkP) + log(m ) 2(m 1)
  28. 28. PAC-Bayesian Bound • Deterministic Model • Stochastic Model • Deterministic Neural Networks • Stochastic Neural Networks 28 x y = tanh(wx + b)w, b x y = tanh(wx + b)w, b N(µw, σw), N(µb, σb) sample w, b distribution of w, b hypothesis h input error er(h) label distribution of hypothesis sample hypothesis h input Gibbs error er(Q) = Eh⇠Q[er(h)] label
  29. 29. PAC-Bayesian Bound • Training Stochastic Neural Networks 29 sample w, b P : Prior, distribution of hypothesis before training x yw, b N(µw, σw), N(µb, σb) Q (S, P) : Posterior, distribution of hypothesis after training x yw, b N(µw, σw), N(µb, σb) sample w, b training algorithm: Update µw, σw, µb, σb training data S
  30. 30. PAC-Bayesian Bound under fitting over fitting appropriate fitting training data testing data P Q 30 er(Q, D)  ˆer(Q, S) + s KL(QkP) + log(m ) 2(m 1) P Q P Q 8 >< >: high KL(Q||P) low ˆer(Q.S) but high er(Q, D) high |er(Q, D) ˆer(Q, S)| 8 >< >: low KL(Q||P) high ˆer(Q.S) and high er(Q, D) low |er(Q, D) ˆer(Q, S)| 8 >< >: moderate KL(Q||P) moderate ˆer(Q.S) and moderate er(Q, D) moderate |er(Q, D) ˆer(Q, S)|
  31. 31. Hyper-Prior and Hyper-Posterior 31 ... Prior P2 Hyper-Prior Prior P1 Hyper-Posterior P Q empirical multi-task error Inductive-Bias Training Algorithm Update Q
  32. 32. Hyper-Prior and Hyper-Posterior • Stochastic Neural Networks 32 Inductive- Bias Training: Update the Hyper-Prior Environment Hyper-Prior x yw, b N(µw, σw), N(µb, σb) Sample N(µpw, σpw), N(µpb, σpb) Prior P P Hyper-Prior x yw, b N(µw, σw), N(µb, σb) Sample N(µpw, σpw), N(µpb, σpb) Prior P Q x yw, b N(µw, σw), N(µb, σb) Training: Update the Prior New task from environment
  33. 33. Inductive Bias Learning 33 Multi-taskerror Empirical loss of learning from dataset P1,…, Pn ˆer(Q, S1, ..., Sn) = 1 n nX i=1 E P ⇠Q ˆer(Q(Si, P), Si) Transfer error Expected loss of learningnew tasks drawn from Environment Q er(Q, ⌧) = E D⇠⌧ E P ⇠Q E S⇠D er(Q(S, P), D) ... Training set Sn Environment 𝜏 Dataset D1 Dataset Dn ... Prior P1 Training set S1 Hyper-Prior P Hyper-Posterior Q Posterior Posterior ... Training Algorithm : UpdateQ Prior Pn Q(S1, P1) Q(Sn, Pn)
  34. 34. Inductive Bias Learning 34 Theorem 2. For any hyper-posterior distribution Q, the following inequality hold with probability at least 1 er(Q, ⌧) = 1 n nX i=1 E P ⇠Q ˆeri(Q(Si, P), Si) + 1 n nX i=1 s KL(QkP) + EP ⇠Q KL(Q(Si, P)kP) + log2nmi 2(mi 1) + s KL(QkP) + log2n 2(n 1) task-complexity terms of the observed tasks environment-complexity term
  35. 35. Meta Learning Algorithm • Loss function for pre-training: 35 J(✓) = 1 n nX i=1 Ji(✓) + ⌥(✓) task-complexity terms of the observed tasks environment-complexity term Ji(✓) = E ˜✓⇠Q✓ ˆeri Qi(Si, P˜✓)Si + s KL(Q✓kP) + E˜✓⇠Q✓ KL Q(Si, P˜✓)kP˜✓ + log2nmi 2(mi 1) ˜✓ = ✓ + ✏P , where ✏P ⇠ N(0, 2 QI) ⌥(✓) = s KL(Q✓kP) + log2n 2(n 1)
  36. 36. Meta Learning Algorithm 36 Hyper-Posterior Q✓ Prior θPrior θ … Observed Task Sn Environment 𝜏 Observed Task S1 … Posterior ϕn Posterior ϕ1 …
  37. 37. Meta Learning Algorithm 37 Posterior ϕ’ Environment 𝜏 New Task S Hyper-Posterior Prior θ Q✓
  38. 38. Meta Learning Algorithm Example 38
  39. 39. Experiments • Model : CNN with 2 convolution Layers • Task Environments: Permuted Labels & Permuted Pixels 39 Permuted Labels Task 1 Task 2 0 1 2 0 1 2 0 1 2 0 1 2 Permuted Labels Task 1 Task 2 0 1 2 0 1 2 0 1 2 0 1 2
  40. 40. Results • Test error on new tasks 40 Trained from scratch Trained by proposed meta learning algorithm
  41. 41. Further Reading • Understanding deep learning requires rethinking generalization https://arxiv.org/abs/1611.03530 • Reconciling modern machine learning practice and the bias-variance trade- off https://arxiv.org/abs/1812.11118 • Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data https://arxiv.org/abs/1703.11008 • A Model of Inductive Bias Learning https://arxiv.org/abs/1106.0245 • Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory https://arxiv.org/abs/1711.01244 41
  42. 42. About the Speaker Mark Chang • Facebook: https://www.facebook.com/ckmarkoh.chang • IG: https://www.instagram.com/markchang95/ • YouTube: https://www.youtube.com/channel/UCckNPGDL21aznRhl3EijRQw 42

×