1) The document presents a new compression-based bound for analyzing the generalization error of large deep neural networks, even when the networks are not explicitly compressed.
2) It shows that if a trained network's weights and covariance matrices exhibit low-rank properties, then the network has a small intrinsic dimensionality and can be efficiently compressed.
3) This allows deriving a tighter generalization bound than existing approaches, providing insight into why overparameterized networks generalize well despite having more parameters than training examples.
Similar to Iclr2020: Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network
Similar to Iclr2020: Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network (20)
Iclr2020: Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network
1. Taiji Suzuki1, Hiroshi Abe2, Tomoaki Nishimura3
Compression based bound for non-
compressed network: unified
generalization error analysis of large
compressible deep neural network
1
1 University of Tokyo/AIP-RIKEN/Japan Digital Design
2 iPride
3 NTT Data Corporation
https://openreview.net/forum?id=ByeGzlrKwH
3. Generalization error of DL
• Generalization gap
3
: loss function (1-Lipschitz continuous w.r.t. 𝑓)
Empirical risk (training error) Population risk (generalization error)
For an estimator 𝑓 (DNN), we want to bound
: training data
Gen. Gap
4. Naïve bound (VC-bound) 4
?
VC-dimension
[Harvey et al.2017]
☹ The number of parameters ℓ=1
𝐿
𝑚ℓ 𝑚ℓ+1 appears in the bound.
☹ It does not explain the generalization ability of overparameterized net.
L
5. Bias Variance
Typical compression based bound:
[Arora et al., 2018; Zhou et al., 2019; Baykal et al., 2019; Suzuki et al., 2018]
Compression based bound 5
Original network Compressed network
compressible ⇔ simple
𝑚ℓ 𝑚ℓ
#
This type of bound does not give gen error of 𝒇.
Q: What happens for “non-compressed” network 𝒇 ?
6. Bias Variance
Typical compression based bound:
[Arora et al., 2018; Zhou et al., 2019; Baykal et al., 2019; Suzuki et al., 2018]
Compression based bound 6
Original network Compressed network
compressible ⇔ simple
Compressed
network
Original net
𝑚ℓ 𝑚ℓ
#
Size of compressed
network
This type of bound does not give gen error of 𝒇.
Q: What happens for “non-compressed” network 𝒇 ?
Bias-variance trade-off
7. Our new compression based bound 7
Trained network 𝑓 can be compressed to smaller one 𝑓#
.
( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.)
Our new compression based bound (main result):
:compression scheme can be data dependent.
(This assumption restricts training procedure too)
(Existing bound)
𝑚ℓ 𝑚ℓ
#
𝑟
8. Our new compression based bound 8
Trained network 𝑓 can be compressed to smaller one 𝑓#
.
( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.)
Our new compression based bound (main result):
:compression scheme can be data dependent.
(This assumption restricts training procedure too)
(Existing bound)
Variance term can be smaller.
𝑚ℓ 𝑚ℓ
#
𝑟Improved
9. More precise description 9
with probability at least 1 − 𝑒−𝑡
.
: local Rademacher complexity
: fixed point of local Rad.
Trained network 𝑓 can be compressed to smaller one 𝑓#.
( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.)
:compression scheme can be data dependent.
(This assumption restricts training procedure too)
•
•
•
Theorem (compression based bound for the original net)
Fast part (O(1/n)) Main part (O(1/ 𝒏))
bias variance
11. Singular values of weight matrix 11
Rapid decay
See also Martin&Mahoney,
arXiv:1901.08276.
7-th layer in VGG-19 trained on CIFAR-10
Rapid decay
Eigenvalues of covariance matrix Singular-values of weight matrix
Both covariance matrix and
weight matrix shows rapid
decay of eigenvalues.
⇒ Small degree of freedom.
12. Near low rank weight and covariance12
• Near low rank weight matrix:
• Both of weight and covariance
are near low rank
Theorem
•
where .
+ Other boundedness condition.
Much smaller than the VC-bound:
13. Comparison with existing work 13
Comparison of intrinsic dimensionality between our degree of freedom and that in
Arora et al. (2018). They are computed on VGG-19 network trained on CIFAR-10.
larger smaller
2
[S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via
a compression approach. ICML2018.]
14. Summary
Why overparamterized network can generalize?
• If the network can be compressed to a smaller
one, then it generalizes well.
A general frame-work to obtain compression based
bound for non-compressed net is derived.
Our bound gives better bias-variance trade-off.
If the covariance and weight matrices are near low
rank, then the network can be compressed efficiently.
⇒ Better generalization.
14
For more details, please look at our paper:
https://openreview.net/forum?id=ByeGzlrKwH