SlideShare a Scribd company logo
1 of 14
Download to read offline
Taiji Suzuki
The University of Tokyo / AIP-RIKEN
NeurIPS2020
Generalization bound of globally optimal
non-convex neural network training:
Transportation map estimation by
infinite dimensional Langevin dynamics
1
Summary
Neural network optimization
• We formulate NN training as an infinite dimensional
gradient Langevin dynamics in RKHS.
➢“Lift” of noisy gradient descent trajectory.
• Global optimality is ensured.
➢Geometric ergodicity + time discretization error
• Generalization error bound + Excess risk bound.
➢(i) 1/ 𝑛 gen error. (ii) Fast learning rate of excess risk.
2
• Finite/infinite width can be treated in a unifying manner.
• Good generalization error guarantee
→ Different from NTK and mean field analysis.
Difficulty of NN optimization
Optimization of neural network is “difficult”
because of
3
Nonconvexity High-dimensionality
+
•Neural tangent kernel:
➢ Take infinite width asymptotics as 𝑛 → ∞.
➢ Benefit of NN is lost compared with kernel methods.
•Mean field analysis:
➢ Take infinite width asymptotics to guarantee convergence.
➢ Its generalization error is not well understood.
•(Usual) gradient Langevin dynamics:
➢ Suffer from curse of dimensionality.
Our formulation:
Infinite dimensional gradient Langevin dynamics.
Infinite dim neural network
• 2-layer NN: direct expression
4
(training loss)
• 2-layer NN: transportation map expression
(infinite width)
(integral representation)
Also includes
• DNN
• ResNet
etc.
𝑎𝑚 = 0 (𝑚 > 𝑀)
→ finite width network
Mean field model 5
Expectation w.r.t. prob. density 𝜌 of (𝑎, 𝑤):
Optimization of 𝑓 ⇔ Optimization of 𝜌
Continuity equation
𝑣𝑡: gradient
Convergence is guaranteed for 𝜌𝑡 with density.
(Infinite width)
(movement of
each particle)
(distribution)
[Nitanda&Suzuki, 2017][Chizat&Bach, 2018][Mei, Montanari&Nguyen, 2018]
Each neuron corresponds
to one particle.
One partilce
“Lift” of neural network training 6
Transportation map formulation:
(finite width)
𝜌0 has a finite discrete support
→ finite width network
Finite/Infinite width can be treated
in a unifying manner.
(unlike existing frame-work such as NTK and mean field)
Infinite-dim non-convex optimization 7
Ex.
• ℋ: 𝐿2(𝜌)
• ℋ𝐾: RKHS (e.g., Sobolev sp.)
Optimal solution
nonconvex
We utilize gradient Langevin dynamics
in a Hilbert space to optimize the objective.
Infinite-dim. Langevin dynamics 8
: RKHS with kernel 𝐾.
Cylindrical Brownian motion:
Time discretization
Analogous to Gaussian process estimator.
(Gaussian measure associated with RKHS)
Stationary
distribution
Likelihood Prior
(more precisely we consider semi-implicit Euler scheme)
Infinite dimensional setting
Hilbert space
9
RKHS structure
Assumption (eigenvalue decay)
(not essential, can be relaxed to 𝜇𝑘 ∼ 𝑘−𝑝
for 𝑝 > 1)
Risk bounds of NN training 10
Gen. error: Excess risk:
Time discretization
Optimization method (Infinite dimensional GLD):
Error bound 11
Thm (Generalization error bound)
with probability 1 − 𝛿.
Opt. error:
[Muzellec, Sato, Massias, Suzuki, arXiv:2003.00306 (2020)]
Ο(1/ 𝑛)
PAC-Bayesian stability bound [Rivasplata, Kuzborskij, Szepesvári, and Shawe-Taylor, 2019]
• Loss function ℓ is “sufficiently smooth.”
• Loss and its gradients are bounded:
Assumption
(geometric ergodicity + time discretization)
Λ𝜂
∗ : spectral gap
Fast rate: general result 12
Thm (Excess risk bound: fast rate)
Let and .
Can be faster than Ο(1/ 𝑛)
Example: classification & regression 13
Strong low noise condition:
For sufficiently large 𝑛 and any 𝛽 ≤ 𝑛,
Classification
Regression
Model:
Excess classification error
Summary
Neural network optimization
• We formulate NN training as an infinite dimensional
gradient Langevin dynamics in RKHS.
➢“Lift” of noisy gradient descent trajectory.
• Global optimality is ensured.
➢Geometric ergodicity + time discretization error
• Generalization error bound + Excess risk bound.
➢(i) 1/ 𝑛 gen error. (ii) Fast learning rate of excess risk.
14
• Finite/infinite width can be treated in a unifying manner.
• Good generalization error guarantee
→ Different from NTK and mean field analysis.

More Related Content

What's hot

Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 

What's hot (20)

Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural Networks
 
Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approach
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 
Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
 
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
 

Similar to [NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics

Similar to [NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics (20)

Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
Optimization for Deep Networks (D2L1 2017 UPC Deep Learning for Computer Vision)
 
08039246
0803924608039246
08039246
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
7926563mocskoff pack method k sampling.ppt
7926563mocskoff pack method k sampling.ppt7926563mocskoff pack method k sampling.ppt
7926563mocskoff pack method k sampling.ppt
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
Normalized averaging using adaptive applicability functions with applications...
Normalized averaging using adaptive applicability functions with applications...Normalized averaging using adaptive applicability functions with applications...
Normalized averaging using adaptive applicability functions with applications...
 
INVESTIGATIONS OF THE INFLUENCES OF A CNN’S RECEPTIVE FIELD ON SEGMENTATION O...
INVESTIGATIONS OF THE INFLUENCES OF A CNN’S RECEPTIVE FIELD ON SEGMENTATION O...INVESTIGATIONS OF THE INFLUENCES OF A CNN’S RECEPTIVE FIELD ON SEGMENTATION O...
INVESTIGATIONS OF THE INFLUENCES OF A CNN’S RECEPTIVE FIELD ON SEGMENTATION O...
 
[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx[20240422_LabSeminar_Huy]Taming_Effect.pptx
[20240422_LabSeminar_Huy]Taming_Effect.pptx
 
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
 
The Traveling Salesman Problem: A Neural Network Perspective
The Traveling Salesman Problem: A Neural Network PerspectiveThe Traveling Salesman Problem: A Neural Network Perspective
The Traveling Salesman Problem: A Neural Network Perspective
 
3 article azojete vol 7 24 33
3 article azojete vol 7 24 333 article azojete vol 7 24 33
3 article azojete vol 7 24 33
 
Turbulence numerical modelling
Turbulence numerical modellingTurbulence numerical modelling
Turbulence numerical modelling
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectives
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Ensemble Data Assimilation on a Non-Conservative Adaptive Mesh
Ensemble Data Assimilation on a Non-Conservative Adaptive MeshEnsemble Data Assimilation on a Non-Conservative Adaptive Mesh
Ensemble Data Assimilation on a Non-Conservative Adaptive Mesh
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Sparse and Redundant Representations: Theory and Applications
Sparse and Redundant Representations: Theory and ApplicationsSparse and Redundant Representations: Theory and Applications
Sparse and Redundant Representations: Theory and Applications
 

More from Taiji Suzuki

Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
Taiji Suzuki
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
Taiji Suzuki
 

More from Taiji Suzuki (12)

深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点
 
数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
 
はじめての機械学習
はじめての機械学習はじめての機械学習
はじめての機械学習
 
Ibis2016
Ibis2016Ibis2016
Ibis2016
 
Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Sparse estimation tutorial 2014
Sparse estimation tutorial 2014
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
 
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
 
Jokyokai
JokyokaiJokyokai
Jokyokai
 
Jokyokai2
Jokyokai2Jokyokai2
Jokyokai2
 

Recently uploaded

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Recently uploaded (20)

Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 

[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics

  • 1. Taiji Suzuki The University of Tokyo / AIP-RIKEN NeurIPS2020 Generalization bound of globally optimal non-convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics 1
  • 2. Summary Neural network optimization • We formulate NN training as an infinite dimensional gradient Langevin dynamics in RKHS. ➢“Lift” of noisy gradient descent trajectory. • Global optimality is ensured. ➢Geometric ergodicity + time discretization error • Generalization error bound + Excess risk bound. ➢(i) 1/ 𝑛 gen error. (ii) Fast learning rate of excess risk. 2 • Finite/infinite width can be treated in a unifying manner. • Good generalization error guarantee → Different from NTK and mean field analysis.
  • 3. Difficulty of NN optimization Optimization of neural network is “difficult” because of 3 Nonconvexity High-dimensionality + •Neural tangent kernel: ➢ Take infinite width asymptotics as 𝑛 → ∞. ➢ Benefit of NN is lost compared with kernel methods. •Mean field analysis: ➢ Take infinite width asymptotics to guarantee convergence. ➢ Its generalization error is not well understood. •(Usual) gradient Langevin dynamics: ➢ Suffer from curse of dimensionality. Our formulation: Infinite dimensional gradient Langevin dynamics.
  • 4. Infinite dim neural network • 2-layer NN: direct expression 4 (training loss) • 2-layer NN: transportation map expression (infinite width) (integral representation) Also includes • DNN • ResNet etc. 𝑎𝑚 = 0 (𝑚 > 𝑀) → finite width network
  • 5. Mean field model 5 Expectation w.r.t. prob. density 𝜌 of (𝑎, 𝑤): Optimization of 𝑓 ⇔ Optimization of 𝜌 Continuity equation 𝑣𝑡: gradient Convergence is guaranteed for 𝜌𝑡 with density. (Infinite width) (movement of each particle) (distribution) [Nitanda&Suzuki, 2017][Chizat&Bach, 2018][Mei, Montanari&Nguyen, 2018] Each neuron corresponds to one particle. One partilce
  • 6. “Lift” of neural network training 6 Transportation map formulation: (finite width) 𝜌0 has a finite discrete support → finite width network Finite/Infinite width can be treated in a unifying manner. (unlike existing frame-work such as NTK and mean field)
  • 7. Infinite-dim non-convex optimization 7 Ex. • ℋ: 𝐿2(𝜌) • ℋ𝐾: RKHS (e.g., Sobolev sp.) Optimal solution nonconvex We utilize gradient Langevin dynamics in a Hilbert space to optimize the objective.
  • 8. Infinite-dim. Langevin dynamics 8 : RKHS with kernel 𝐾. Cylindrical Brownian motion: Time discretization Analogous to Gaussian process estimator. (Gaussian measure associated with RKHS) Stationary distribution Likelihood Prior (more precisely we consider semi-implicit Euler scheme)
  • 9. Infinite dimensional setting Hilbert space 9 RKHS structure Assumption (eigenvalue decay) (not essential, can be relaxed to 𝜇𝑘 ∼ 𝑘−𝑝 for 𝑝 > 1)
  • 10. Risk bounds of NN training 10 Gen. error: Excess risk: Time discretization Optimization method (Infinite dimensional GLD):
  • 11. Error bound 11 Thm (Generalization error bound) with probability 1 − 𝛿. Opt. error: [Muzellec, Sato, Massias, Suzuki, arXiv:2003.00306 (2020)] Ο(1/ 𝑛) PAC-Bayesian stability bound [Rivasplata, Kuzborskij, Szepesvári, and Shawe-Taylor, 2019] • Loss function ℓ is “sufficiently smooth.” • Loss and its gradients are bounded: Assumption (geometric ergodicity + time discretization) Λ𝜂 ∗ : spectral gap
  • 12. Fast rate: general result 12 Thm (Excess risk bound: fast rate) Let and . Can be faster than Ο(1/ 𝑛)
  • 13. Example: classification & regression 13 Strong low noise condition: For sufficiently large 𝑛 and any 𝛽 ≤ 𝑛, Classification Regression Model: Excess classification error
  • 14. Summary Neural network optimization • We formulate NN training as an infinite dimensional gradient Langevin dynamics in RKHS. ➢“Lift” of noisy gradient descent trajectory. • Global optimality is ensured. ➢Geometric ergodicity + time discretization error • Generalization error bound + Excess risk bound. ➢(i) 1/ 𝑛 gen error. (ii) Fast learning rate of excess risk. 14 • Finite/infinite width can be treated in a unifying manner. • Good generalization error guarantee → Different from NTK and mean field analysis.