SlideShare a Scribd company logo
1 of 12
Taiji Suzuki1,2 and Shunta Akiyama1
1The University of Tokyo, 2AIP-RIKEN
ICLR2021
(spotlight)
Benefit of deep learning with
non-convex noisy gradient descent
1
Background
Benefit of neural network with optimization guarantee.
2
Trainable
Fixed
Trainable
Trainable
Kernel method
(linear model) Deep model
• Statistical efficiency: Curse of dimensionality
• Optimization efficiency: Non-convexity
We show
• separation between deep and shallow in terms of excess risk,
• global optimality of noisy gradient descent.
Problem setting (teacher-student model)
Teacher and student model:
3
:trainable parameter
:fixed parameter
Observation model: (true function),
From (observed data), we estimate 𝑓∘.
Excess risk (mean squared error):
 Convergence rate?
 Deep vs shallow?
Condition on parameters 4
•𝜇𝑚 ∝ 𝑚−2
•𝑎𝑚 ∝ 𝜇𝑚
𝛼1
for 𝛼1 > 1/2
•𝑏𝑚 ∝ 𝜇𝑚
𝛼2
for 𝛼2 > 𝛾/2
•Activation function 𝜎 is sufficiently smooth.
Condition
: true function
Related work
• Generalization error comparison via Rademacher complexity
analysis.
5
[Allen-Zhu & Li (2019; 2020); Li et al. (2020); Bai & Lee (2020); Chen et al. (2020)]
Separation between deep and shallow.
• Approximation ability
• Sparse regularization and Frank-Wolfe algorithm
Question: Convergence of excess risk + Optimization guarantee by GD?
 Estimation error with optimization guarantee is not compared.
 Some of them require 𝑑 → ∞. What happens for fixed 𝑑?
[E, Ma & Wu (2018); Ghorbani et al. (2020); Yehudai & Shamir (2019)]
[Barron (1993); Chizat & Bach (2020); Chizat (2019); Gunasekar et al. (2018); Woodworth et al. (2020);
Klusowski & Barron (2016)]
 They do not give tight comparison for excess risk.
 Every derived rate is 𝑂 1 𝑛 : difference of rate of conv is not shown.
 Models with sparsity inducing regularization.
 Frank-Wolfe type method is analyzed. What happens for GD?
Approach 6
Loss function (squared loss):
• Finite-dim truncation
• Langevin dynamics for the truncated parameter
(element in ℋ1)
Regularized empirical risk minimization:
Infinite dimensional non-convex optimization problem
(noisy gradient descent)
𝑀
Gaussian noise
Infinite-dim. Langevin dynamics 7
Cylindrical Brownian motion
Time discretization
(Euler-Maruyama scheme)
In our theory, we used a bid modified scheme (semi-implicit Euler scheme):
Gaussian noise
unbounded
[Muzellec, Sato, Massias, Suzuki (2020); Suzuki (NeurIPS2020)]
Optimization error bound 8
Thm (informal)
Geometric
ergodicity
invariant measure
of continuous dynamics
[Muzellec, Sato, Massias, Suzuki (2020); Suzuki (NeurIPS2020)]
Analogous to Bayes posterior
Likelihood Prior
The distribution of 𝑊𝑡 weakly converges to an invariant measure 𝜋∞:
Time discretization Finite dim
truncation
• Convergence to near global optimal is guaranteed even though the objective is non-convex.
• The rate of convergence is independent of dimensionality.
Excess risk bound for deep learning 9
• 𝜇𝑚 ∝ 𝑚−2
• 𝑎𝑚 ∝ 𝜇𝑚
𝛼1
for 𝛼1 > 1/2
• 𝑏𝑚 ∝ 𝜇𝑚
𝛼2
for 𝛼2 > 𝛾/2
• Activation function 𝜎 is sufficiently smooth.
Condition
Thm (Bound of excess risk for deep learning)
Let 𝜆 = 𝛽−1 = Θ(1 𝑛), then for 𝑀 = Ω(𝑛1/2(𝛼1−3𝛼2+1)), it holds that
Independent of dimension
(free from curse of dim.)
Gradient Langevin dynamics (noisy gradient descent):
Neural network training
Lower bound for linear estimators 10
• Kernel ridge estimator
• Sieve estimator
• Nadaraya-Watson estimator
• k-NN estimator
linear
e.g., Kernel ridge regression:
Linear estimator: Any estimator that has the following form:
Minimax risk of
linear estimators
Lower bound of worst case error for any linear estimators
:
Thm (Lower bound of minimax risk for linear estimators)
For and any 𝜅′ > 0, it holds that
Dependent on dimension
(curse of dim.!!!)
We compare the rate of convergence of DL with that of linear estimators.
Comparison between deep and shallow
1. Excess risk of deep learning
2. Minimax excess risk of linear estimators
11
This rate depends on 𝑑
→ Large error for high dimensional setting
Ex.: 𝛼1 = 𝛾 + 3𝛼2, 𝛼2 = 4𝛼2
Deep Linear (kernel)
Separation of estimation performance is shown for a realistic optimization.
(large 𝛾) (large 𝑑)
(curse of dimensionality)
𝑛-times large!!
Summary
• Analyzed excess risk of deep and shallow methods in a
teacher-student model.
• A gradient Langevin dynamics (noisy gradient descent)
was applied to train neural network.
• The Langevin dynamics achieves a near global optimal
solution.
• We derived excess risk of deep learning and linear
estimators:
DL can avoid the curse of dimensionality.
Any linear estimator suffers from the curse of dimensionality.
12
Separation between deep and shallow was shown in terms of
excess risk with optimization guarantee.
Deep Linear (kernel)

More Related Content

What's hot

[DL輪読会]Understanding Black-box Predictions via Influence Functions
[DL輪読会]Understanding Black-box Predictions via Influence Functions [DL輪読会]Understanding Black-box Predictions via Influence Functions
[DL輪読会]Understanding Black-box Predictions via Influence Functions Deep Learning JP
 
[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論Deep Learning JP
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)narumikanno0918
 
ベイズ推定の概要@広島ベイズ塾
ベイズ推定の概要@広島ベイズ塾ベイズ推定の概要@広島ベイズ塾
ベイズ推定の概要@広島ベイズ塾Yoshitake Takebayashi
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理Taiji Suzuki
 
Oracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoOracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoSatoshi Kato
 
DeepLearning 輪読会 第1章 はじめに
DeepLearning 輪読会 第1章 はじめにDeepLearning 輪読会 第1章 はじめに
DeepLearning 輪読会 第1章 はじめにDeep Learning JP
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化gree_tech
 
MCMCでマルチレベルモデル
MCMCでマルチレベルモデルMCMCでマルチレベルモデル
MCMCでマルチレベルモデルHiroshi Shimizu
 
Sliced Wasserstein Distance for Learning Gaussian Mixture Models
Sliced Wasserstein Distance for Learning Gaussian Mixture ModelsSliced Wasserstein Distance for Learning Gaussian Mixture Models
Sliced Wasserstein Distance for Learning Gaussian Mixture ModelsFujimoto Keisuke
 
【論文読み会】Universal Language Model Fine-tuning for Text Classification
【論文読み会】Universal Language Model Fine-tuning for Text Classification【論文読み会】Universal Language Model Fine-tuning for Text Classification
【論文読み会】Universal Language Model Fine-tuning for Text ClassificationARISE analytics
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺についてKeisuke Hosaka
 
機械学習によるデータ分析まわりのお話
機械学習によるデータ分析まわりのお話機械学習によるデータ分析まわりのお話
機械学習によるデータ分析まわりのお話Ryota Kamoshida
 
敵対的学習に対するラデマッハ複雑度
敵対的学習に対するラデマッハ複雑度敵対的学習に対するラデマッハ複雑度
敵対的学習に対するラデマッハ複雑度Masa Kato
 
深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについてryosuke-kojima
 
グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知Yuya Takashina
 
相関と因果について考える:統計的因果推論、その(不)可能性の中心
相関と因果について考える:統計的因果推論、その(不)可能性の中心相関と因果について考える:統計的因果推論、その(不)可能性の中心
相関と因果について考える:統計的因果推論、その(不)可能性の中心takehikoihayashi
 
Recent Advances on Transfer Learning and Related Topics Ver.2
Recent Advances on Transfer Learning and Related Topics Ver.2Recent Advances on Transfer Learning and Related Topics Ver.2
Recent Advances on Transfer Learning and Related Topics Ver.2Kota Matsui
 

What's hot (20)

[DL輪読会]Understanding Black-box Predictions via Influence Functions
[DL輪読会]Understanding Black-box Predictions via Influence Functions [DL輪読会]Understanding Black-box Predictions via Influence Functions
[DL輪読会]Understanding Black-box Predictions via Influence Functions
 
[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論
 
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
 
ベイズ推定の概要@広島ベイズ塾
ベイズ推定の概要@広島ベイズ塾ベイズ推定の概要@広島ベイズ塾
ベイズ推定の概要@広島ベイズ塾
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
 
Oracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoOracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslasso
 
DeepLearning 輪読会 第1章 はじめに
DeepLearning 輪読会 第1章 はじめにDeepLearning 輪読会 第1章 はじめに
DeepLearning 輪読会 第1章 はじめに
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化
 
MCMCでマルチレベルモデル
MCMCでマルチレベルモデルMCMCでマルチレベルモデル
MCMCでマルチレベルモデル
 
Sliced Wasserstein Distance for Learning Gaussian Mixture Models
Sliced Wasserstein Distance for Learning Gaussian Mixture ModelsSliced Wasserstein Distance for Learning Gaussian Mixture Models
Sliced Wasserstein Distance for Learning Gaussian Mixture Models
 
【論文読み会】Universal Language Model Fine-tuning for Text Classification
【論文読み会】Universal Language Model Fine-tuning for Text Classification【論文読み会】Universal Language Model Fine-tuning for Text Classification
【論文読み会】Universal Language Model Fine-tuning for Text Classification
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺について
 
機械学習によるデータ分析まわりのお話
機械学習によるデータ分析まわりのお話機械学習によるデータ分析まわりのお話
機械学習によるデータ分析まわりのお話
 
敵対的学習に対するラデマッハ複雑度
敵対的学習に対するラデマッハ複雑度敵対的学習に対するラデマッハ複雑度
敵対的学習に対するラデマッハ複雑度
 
深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて
 
グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知
 
相関と因果について考える:統計的因果推論、その(不)可能性の中心
相関と因果について考える:統計的因果推論、その(不)可能性の中心相関と因果について考える:統計的因果推論、その(不)可能性の中心
相関と因果について考える:統計的因果推論、その(不)可能性の中心
 
Recent Advances on Transfer Learning and Related Topics Ver.2
Recent Advances on Transfer Learning and Related Topics Ver.2Recent Advances on Transfer Learning and Related Topics Ver.2
Recent Advances on Transfer Learning and Related Topics Ver.2
 
Hessian free
Hessian freeHessian free
Hessian free
 

Similar to Benefits of Deep Learning with Non-Convex Noisy Gradient Descent

[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...Taiji Suzuki
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep LearningShajun Nisha
 
Artificial Neural Networks Deep Learning Report
Artificial Neural Networks   Deep Learning ReportArtificial Neural Networks   Deep Learning Report
Artificial Neural Networks Deep Learning ReportLisa Muthukumar
 
Regression vs Neural Net
Regression vs Neural NetRegression vs Neural Net
Regression vs Neural NetRatul Alahy
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerSeiya Tokui
 
Sparsenet
SparsenetSparsenet
Sparsenetndronen
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Atsushi Nitanda
 
High-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECLHigh-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECLHPCC Systems
 
Presentation at SMI 2023
Presentation at SMI 2023Presentation at SMI 2023
Presentation at SMI 2023Joaquim Jorge
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Pedro Lopes
 
Translation Invariance (TI) based Novel Approach for better De-noising of Dig...
Translation Invariance (TI) based Novel Approach for better De-noising of Dig...Translation Invariance (TI) based Novel Approach for better De-noising of Dig...
Translation Invariance (TI) based Novel Approach for better De-noising of Dig...IRJET Journal
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 

Similar to Benefits of Deep Learning with Non-Convex Noisy Gradient Descent (20)

[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Auto encoders in Deep Learning
Auto encoders in Deep LearningAuto encoders in Deep Learning
Auto encoders in Deep Learning
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Artificial Neural Networks Deep Learning Report
Artificial Neural Networks   Deep Learning ReportArtificial Neural Networks   Deep Learning Report
Artificial Neural Networks Deep Learning Report
 
Regression vs Neural Net
Regression vs Neural NetRegression vs Neural Net
Regression vs Neural Net
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with Chainer
 
Sparsenet
SparsenetSparsenet
Sparsenet
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
Wavelet
WaveletWavelet
Wavelet
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
 
High-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECLHigh-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECL
 
Presentation at SMI 2023
Presentation at SMI 2023Presentation at SMI 2023
Presentation at SMI 2023
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
 
Translation Invariance (TI) based Novel Approach for better De-noising of Dig...
Translation Invariance (TI) based Novel Approach for better De-noising of Dig...Translation Invariance (TI) based Novel Approach for better De-noising of Dig...
Translation Invariance (TI) based Novel Approach for better De-noising of Dig...
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 

More from Taiji Suzuki

深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点Taiji Suzuki
 
Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...Taiji Suzuki
 
はじめての機械学習
はじめての機械学習はじめての機械学習
はじめての機械学習Taiji Suzuki
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Taiji Suzuki
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersTaiji Suzuki
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論Taiji Suzuki
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...Taiji Suzuki
 

More from Taiji Suzuki (10)

深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点
 
Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...
 
はじめての機械学習
はじめての機械学習はじめての機械学習
はじめての機械学習
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
 
Ibis2016
Ibis2016Ibis2016
Ibis2016
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
 
Jokyokai
JokyokaiJokyokai
Jokyokai
 
Jokyokai2
Jokyokai2Jokyokai2
Jokyokai2
 

Recently uploaded

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 

Recently uploaded (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 

Benefits of Deep Learning with Non-Convex Noisy Gradient Descent

  • 1. Taiji Suzuki1,2 and Shunta Akiyama1 1The University of Tokyo, 2AIP-RIKEN ICLR2021 (spotlight) Benefit of deep learning with non-convex noisy gradient descent 1
  • 2. Background Benefit of neural network with optimization guarantee. 2 Trainable Fixed Trainable Trainable Kernel method (linear model) Deep model • Statistical efficiency: Curse of dimensionality • Optimization efficiency: Non-convexity We show • separation between deep and shallow in terms of excess risk, • global optimality of noisy gradient descent.
  • 3. Problem setting (teacher-student model) Teacher and student model: 3 :trainable parameter :fixed parameter Observation model: (true function), From (observed data), we estimate 𝑓∘. Excess risk (mean squared error):  Convergence rate?  Deep vs shallow?
  • 4. Condition on parameters 4 •𝜇𝑚 ∝ 𝑚−2 •𝑎𝑚 ∝ 𝜇𝑚 𝛼1 for 𝛼1 > 1/2 •𝑏𝑚 ∝ 𝜇𝑚 𝛼2 for 𝛼2 > 𝛾/2 •Activation function 𝜎 is sufficiently smooth. Condition : true function
  • 5. Related work • Generalization error comparison via Rademacher complexity analysis. 5 [Allen-Zhu & Li (2019; 2020); Li et al. (2020); Bai & Lee (2020); Chen et al. (2020)] Separation between deep and shallow. • Approximation ability • Sparse regularization and Frank-Wolfe algorithm Question: Convergence of excess risk + Optimization guarantee by GD?  Estimation error with optimization guarantee is not compared.  Some of them require 𝑑 → ∞. What happens for fixed 𝑑? [E, Ma & Wu (2018); Ghorbani et al. (2020); Yehudai & Shamir (2019)] [Barron (1993); Chizat & Bach (2020); Chizat (2019); Gunasekar et al. (2018); Woodworth et al. (2020); Klusowski & Barron (2016)]  They do not give tight comparison for excess risk.  Every derived rate is 𝑂 1 𝑛 : difference of rate of conv is not shown.  Models with sparsity inducing regularization.  Frank-Wolfe type method is analyzed. What happens for GD?
  • 6. Approach 6 Loss function (squared loss): • Finite-dim truncation • Langevin dynamics for the truncated parameter (element in ℋ1) Regularized empirical risk minimization: Infinite dimensional non-convex optimization problem (noisy gradient descent) 𝑀 Gaussian noise
  • 7. Infinite-dim. Langevin dynamics 7 Cylindrical Brownian motion Time discretization (Euler-Maruyama scheme) In our theory, we used a bid modified scheme (semi-implicit Euler scheme): Gaussian noise unbounded [Muzellec, Sato, Massias, Suzuki (2020); Suzuki (NeurIPS2020)]
  • 8. Optimization error bound 8 Thm (informal) Geometric ergodicity invariant measure of continuous dynamics [Muzellec, Sato, Massias, Suzuki (2020); Suzuki (NeurIPS2020)] Analogous to Bayes posterior Likelihood Prior The distribution of 𝑊𝑡 weakly converges to an invariant measure 𝜋∞: Time discretization Finite dim truncation • Convergence to near global optimal is guaranteed even though the objective is non-convex. • The rate of convergence is independent of dimensionality.
  • 9. Excess risk bound for deep learning 9 • 𝜇𝑚 ∝ 𝑚−2 • 𝑎𝑚 ∝ 𝜇𝑚 𝛼1 for 𝛼1 > 1/2 • 𝑏𝑚 ∝ 𝜇𝑚 𝛼2 for 𝛼2 > 𝛾/2 • Activation function 𝜎 is sufficiently smooth. Condition Thm (Bound of excess risk for deep learning) Let 𝜆 = 𝛽−1 = Θ(1 𝑛), then for 𝑀 = Ω(𝑛1/2(𝛼1−3𝛼2+1)), it holds that Independent of dimension (free from curse of dim.) Gradient Langevin dynamics (noisy gradient descent): Neural network training
  • 10. Lower bound for linear estimators 10 • Kernel ridge estimator • Sieve estimator • Nadaraya-Watson estimator • k-NN estimator linear e.g., Kernel ridge regression: Linear estimator: Any estimator that has the following form: Minimax risk of linear estimators Lower bound of worst case error for any linear estimators : Thm (Lower bound of minimax risk for linear estimators) For and any 𝜅′ > 0, it holds that Dependent on dimension (curse of dim.!!!) We compare the rate of convergence of DL with that of linear estimators.
  • 11. Comparison between deep and shallow 1. Excess risk of deep learning 2. Minimax excess risk of linear estimators 11 This rate depends on 𝑑 → Large error for high dimensional setting Ex.: 𝛼1 = 𝛾 + 3𝛼2, 𝛼2 = 4𝛼2 Deep Linear (kernel) Separation of estimation performance is shown for a realistic optimization. (large 𝛾) (large 𝑑) (curse of dimensionality) 𝑛-times large!!
  • 12. Summary • Analyzed excess risk of deep and shallow methods in a teacher-student model. • A gradient Langevin dynamics (noisy gradient descent) was applied to train neural network. • The Langevin dynamics achieves a near global optimal solution. • We derived excess risk of deep learning and linear estimators: DL can avoid the curse of dimensionality. Any linear estimator suffers from the curse of dimensionality. 12 Separation between deep and shallow was shown in terms of excess risk with optimization guarantee. Deep Linear (kernel)