Presentation slide of our ICLR2021 paper "Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods."
Abstract:
Establishing a theoretical analysis that explains why deep learning can outperform shallow learning such as kernel methods is one of the biggest issues in the deep learning literature. Towards answering this question, we evaluate excess risk of a deep learning estimator trained by a noisy gradient descent with ridge regularization on a mildly overparameterized neural network, and discuss its superiority to a class of linear estimators that includes neural tangent kernel approach, random feature model, other kernel methods, k-NN estimator and so on. We consider a teacher-student regression model, and eventually show that {\it any} linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting. The obtained excess bounds are so-called fast learning rate which is faster than O(1/\sqrt{n}) that is obtained by usual Rademacher complexity analysis. This discrepancy is induced by the non-convex geometry of the model and the noisy gradient descent used for neural network training provably reaches a near global optimal solution even though the loss landscape is highly non-convex. Although the noisy gradient descent does not employ any explicit or implicit sparsity inducing regularization, it shows a preferable generalization performance that dominates linear estimators.
Water Industry Process Automation & Control Monthly - April 2024
Benefits of Deep Learning with Non-Convex Noisy Gradient Descent
1. Taiji Suzuki1,2 and Shunta Akiyama1
1The University of Tokyo, 2AIP-RIKEN
ICLR2021
(spotlight)
Benefit of deep learning with
non-convex noisy gradient descent
1
2. Background
Benefit of neural network with optimization guarantee.
2
Trainable
Fixed
Trainable
Trainable
Kernel method
(linear model) Deep model
• Statistical efficiency: Curse of dimensionality
• Optimization efficiency: Non-convexity
We show
• separation between deep and shallow in terms of excess risk,
• global optimality of noisy gradient descent.
3. Problem setting (teacher-student model)
Teacher and student model:
3
:trainable parameter
:fixed parameter
Observation model: (true function),
From (observed data), we estimate 𝑓∘.
Excess risk (mean squared error):
Convergence rate?
Deep vs shallow?
4. Condition on parameters 4
•𝜇𝑚 ∝ 𝑚−2
•𝑎𝑚 ∝ 𝜇𝑚
𝛼1
for 𝛼1 > 1/2
•𝑏𝑚 ∝ 𝜇𝑚
𝛼2
for 𝛼2 > 𝛾/2
•Activation function 𝜎 is sufficiently smooth.
Condition
: true function
5. Related work
• Generalization error comparison via Rademacher complexity
analysis.
5
[Allen-Zhu & Li (2019; 2020); Li et al. (2020); Bai & Lee (2020); Chen et al. (2020)]
Separation between deep and shallow.
• Approximation ability
• Sparse regularization and Frank-Wolfe algorithm
Question: Convergence of excess risk + Optimization guarantee by GD?
Estimation error with optimization guarantee is not compared.
Some of them require 𝑑 → ∞. What happens for fixed 𝑑?
[E, Ma & Wu (2018); Ghorbani et al. (2020); Yehudai & Shamir (2019)]
[Barron (1993); Chizat & Bach (2020); Chizat (2019); Gunasekar et al. (2018); Woodworth et al. (2020);
Klusowski & Barron (2016)]
They do not give tight comparison for excess risk.
Every derived rate is 𝑂 1 𝑛 : difference of rate of conv is not shown.
Models with sparsity inducing regularization.
Frank-Wolfe type method is analyzed. What happens for GD?
6. Approach 6
Loss function (squared loss):
• Finite-dim truncation
• Langevin dynamics for the truncated parameter
(element in ℋ1)
Regularized empirical risk minimization:
Infinite dimensional non-convex optimization problem
(noisy gradient descent)
𝑀
Gaussian noise
7. Infinite-dim. Langevin dynamics 7
Cylindrical Brownian motion
Time discretization
(Euler-Maruyama scheme)
In our theory, we used a bid modified scheme (semi-implicit Euler scheme):
Gaussian noise
unbounded
[Muzellec, Sato, Massias, Suzuki (2020); Suzuki (NeurIPS2020)]
8. Optimization error bound 8
Thm (informal)
Geometric
ergodicity
invariant measure
of continuous dynamics
[Muzellec, Sato, Massias, Suzuki (2020); Suzuki (NeurIPS2020)]
Analogous to Bayes posterior
Likelihood Prior
The distribution of 𝑊𝑡 weakly converges to an invariant measure 𝜋∞:
Time discretization Finite dim
truncation
• Convergence to near global optimal is guaranteed even though the objective is non-convex.
• The rate of convergence is independent of dimensionality.
9. Excess risk bound for deep learning 9
• 𝜇𝑚 ∝ 𝑚−2
• 𝑎𝑚 ∝ 𝜇𝑚
𝛼1
for 𝛼1 > 1/2
• 𝑏𝑚 ∝ 𝜇𝑚
𝛼2
for 𝛼2 > 𝛾/2
• Activation function 𝜎 is sufficiently smooth.
Condition
Thm (Bound of excess risk for deep learning)
Let 𝜆 = 𝛽−1 = Θ(1 𝑛), then for 𝑀 = Ω(𝑛1/2(𝛼1−3𝛼2+1)), it holds that
Independent of dimension
(free from curse of dim.)
Gradient Langevin dynamics (noisy gradient descent):
Neural network training
10. Lower bound for linear estimators 10
• Kernel ridge estimator
• Sieve estimator
• Nadaraya-Watson estimator
• k-NN estimator
linear
e.g., Kernel ridge regression:
Linear estimator: Any estimator that has the following form:
Minimax risk of
linear estimators
Lower bound of worst case error for any linear estimators
:
Thm (Lower bound of minimax risk for linear estimators)
For and any 𝜅′ > 0, it holds that
Dependent on dimension
(curse of dim.!!!)
We compare the rate of convergence of DL with that of linear estimators.
11. Comparison between deep and shallow
1. Excess risk of deep learning
2. Minimax excess risk of linear estimators
11
This rate depends on 𝑑
→ Large error for high dimensional setting
Ex.: 𝛼1 = 𝛾 + 3𝛼2, 𝛼2 = 4𝛼2
Deep Linear (kernel)
Separation of estimation performance is shown for a realistic optimization.
(large 𝛾) (large 𝑑)
(curse of dimensionality)
𝑛-times large!!
12. Summary
• Analyzed excess risk of deep and shallow methods in a
teacher-student model.
• A gradient Langevin dynamics (noisy gradient descent)
was applied to train neural network.
• The Langevin dynamics achieves a near global optimal
solution.
• We derived excess risk of deep learning and linear
estimators:
DL can avoid the curse of dimensionality.
Any linear estimator suffers from the curse of dimensionality.
12
Separation between deep and shallow was shown in terms of
excess risk with optimization guarantee.
Deep Linear (kernel)