SlideShare a Scribd company logo
1 of 14
Taiji Suzuki1, Hiroshi Abe2, Tomoaki Nishimura3
Compression based bound for non-
compressed network: unified
generalization error analysis of large
compressible deep neural network
1
1 University of Tokyo/AIP-RIKEN/Japan Digital Design
2 iPride
3 NTT Data Corporation
https://openreview.net/forum?id=ByeGzlrKwH
Generalization of
overparameterized networks
2
[Neyshabur et al., ICLR2019]
# of parameters ≫ sample size
Why do they generalize?
⇒ Intrinsic dimensionality is small.
Compression based bound
(billions) (millions)
Generalization error of DL
• Generalization gap
3
: loss function (1-Lipschitz continuous w.r.t. 𝑓)
Empirical risk (training error) Population risk (generalization error)
For an estimator 𝑓 (DNN), we want to bound
: training data
Gen. Gap
Naïve bound (VC-bound) 4
?
VC-dimension
[Harvey et al.2017]
☹ The number of parameters ℓ=1
𝐿
𝑚ℓ 𝑚ℓ+1 appears in the bound.
☹ It does not explain the generalization ability of overparameterized net.
L
Bias Variance
Typical compression based bound:
[Arora et al., 2018; Zhou et al., 2019; Baykal et al., 2019; Suzuki et al., 2018]
Compression based bound 5
Original network Compressed network
compressible ⇔ simple
𝑚ℓ 𝑚ℓ
#
This type of bound does not give gen error of 𝒇.
Q: What happens for “non-compressed” network 𝒇 ?
Bias Variance
Typical compression based bound:
[Arora et al., 2018; Zhou et al., 2019; Baykal et al., 2019; Suzuki et al., 2018]
Compression based bound 6
Original network Compressed network
compressible ⇔ simple
Compressed
network
Original net
𝑚ℓ 𝑚ℓ
#
Size of compressed
network
This type of bound does not give gen error of 𝒇.
Q: What happens for “non-compressed” network 𝒇 ?
Bias-variance trade-off
Our new compression based bound 7
Trained network 𝑓 can be compressed to smaller one 𝑓#
.
( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.)
Our new compression based bound (main result):
:compression scheme can be data dependent.
(This assumption restricts training procedure too)
(Existing bound)
𝑚ℓ 𝑚ℓ
#
𝑟
Our new compression based bound 8
Trained network 𝑓 can be compressed to smaller one 𝑓#
.
( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.)
Our new compression based bound (main result):
:compression scheme can be data dependent.
(This assumption restricts training procedure too)
(Existing bound)
Variance term can be smaller.
𝑚ℓ 𝑚ℓ
#
𝑟Improved
More precise description 9
with probability at least 1 − 𝑒−𝑡
.
: local Rademacher complexity
: fixed point of local Rad.
Trained network 𝑓 can be compressed to smaller one 𝑓#.
( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.)
:compression scheme can be data dependent.
(This assumption restricts training procedure too)
•
•
•
Theorem (compression based bound for the original net)
Fast part (O(1/n)) Main part (O(1/ 𝒏))
bias variance
Compression bounds for
non-compressed network
with low rank properties
10
Singular values of weight matrix 11
Rapid decay
See also Martin&Mahoney,
arXiv:1901.08276.
7-th layer in VGG-19 trained on CIFAR-10
Rapid decay
Eigenvalues of covariance matrix Singular-values of weight matrix
Both covariance matrix and
weight matrix shows rapid
decay of eigenvalues.
⇒ Small degree of freedom.
Near low rank weight and covariance12
• Near low rank weight matrix:
• Both of weight and covariance
are near low rank
Theorem
•
where .
+ Other boundedness condition.
Much smaller than the VC-bound:
Comparison with existing work 13
Comparison of intrinsic dimensionality between our degree of freedom and that in
Arora et al. (2018). They are computed on VGG-19 network trained on CIFAR-10.
larger smaller
2
[S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via
a compression approach. ICML2018.]
Summary
Why overparamterized network can generalize?
• If the network can be compressed to a smaller
one, then it generalizes well.
 A general frame-work to obtain compression based
bound for non-compressed net is derived.
 Our bound gives better bias-variance trade-off.
 If the covariance and weight matrices are near low
rank, then the network can be compressed efficiently.
⇒ Better generalization.
14
For more details, please look at our paper:
https://openreview.net/forum?id=ByeGzlrKwH

More Related Content

What's hot

Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachAllen Wu
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper reviewtaeseon ryu
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...MLAI2
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningMLAI2
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsSangwoo Mo
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural NetworksSangwoo Mo
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
 
Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksSang Jun Lee
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드taeseon ryu
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of FunctionsJaeJun Yoo
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnnDebarko De
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoSeongwon Hwang
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear ComplexitySangwoo Mo
 

What's hot (20)

Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approach
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
 
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Recursive Neural Networks
Recursive Neural NetworksRecursive Neural Networks
Recursive Neural Networks
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural Networks
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
딥러닝 논문읽기 모임 - 송헌 Deep sets 슬라이드
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
 
Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 
Image classification using cnn
Image classification using cnnImage classification using cnn
Image classification using cnn
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
Convolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in TheanoConvolutional Neural Network (CNN) presentation from theory to code in Theano
Convolutional Neural Network (CNN) presentation from theory to code in Theano
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 

Similar to Iclr2020: Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network

A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKSA NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKSijaia
 
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKSA NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKSgerogepatton
 
Power rotational interleaver on an idma system
Power rotational interleaver on an idma systemPower rotational interleaver on an idma system
Power rotational interleaver on an idma systemAlexander Decker
 
On the Resilience of Deep Learning for reduced-voltage FPGAs
On the Resilience of Deep Learning for reduced-voltage FPGAsOn the Resilience of Deep Learning for reduced-voltage FPGAs
On the Resilience of Deep Learning for reduced-voltage FPGAsLEGATO project
 
Clique-based Network Clustering
Clique-based Network ClusteringClique-based Network Clustering
Clique-based Network ClusteringGuang Ouyang
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”Dr.(Mrs).Gethsiyal Augasta
 
02.cnn - CNN 파헤치기 3탄
02.cnn - CNN 파헤치기 3탄02.cnn - CNN 파헤치기 3탄
02.cnn - CNN 파헤치기 3탄Jeong-gyu Kim
 
Neural Networks on Steroids (Poster)
Neural Networks on Steroids (Poster)Neural Networks on Steroids (Poster)
Neural Networks on Steroids (Poster)Adam Blevins
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_BSrimatre K
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Networkgerogepatton
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Networkgerogepatton
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...IJERA Editor
 
Modeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologyModeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologytheijes
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural networkNagarajan
 

Similar to Iclr2020: Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network (20)

A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKSA NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
 
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKSA NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
A NEW GENERALIZATION OF EDGE OVERLAP TO WEIGHTED NETWORKS
 
Power rotational interleaver on an idma system
Power rotational interleaver on an idma systemPower rotational interleaver on an idma system
Power rotational interleaver on an idma system
 
On the Resilience of Deep Learning for reduced-voltage FPGAs
On the Resilience of Deep Learning for reduced-voltage FPGAsOn the Resilience of Deep Learning for reduced-voltage FPGAs
On the Resilience of Deep Learning for reduced-voltage FPGAs
 
Clique-based Network Clustering
Clique-based Network ClusteringClique-based Network Clustering
Clique-based Network Clustering
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
ResNet.pptx
ResNet.pptxResNet.pptx
ResNet.pptx
 
02.cnn - CNN 파헤치기 3탄
02.cnn - CNN 파헤치기 3탄02.cnn - CNN 파헤치기 3탄
02.cnn - CNN 파헤치기 3탄
 
Neural Networks on Steroids (Poster)
Neural Networks on Steroids (Poster)Neural Networks on Steroids (Poster)
Neural Networks on Steroids (Poster)
 
ResNet.pptx
ResNet.pptxResNet.pptx
ResNet.pptx
 
ML_ Unit 2_Part_B
ML_ Unit 2_Part_BML_ Unit 2_Part_B
ML_ Unit 2_Part_B
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction NetworkEDGE-Net: Efficient Deep-learning Gradients Extraction Network
EDGE-Net: Efficient Deep-learning Gradients Extraction Network
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
Optimization of Number of Neurons in the Hidden Layer in Feed Forward Neural ...
 
Modeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologyModeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technology
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
 
N ns 1
N ns 1N ns 1
N ns 1
 

More from Taiji Suzuki

深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点Taiji Suzuki
 
数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理Taiji Suzuki
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理Taiji Suzuki
 
はじめての機械学習
はじめての機械学習はじめての機械学習
はじめての機械学習Taiji Suzuki
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Taiji Suzuki
 
Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Taiji Suzuki
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersTaiji Suzuki
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論Taiji Suzuki
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...Taiji Suzuki
 
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)Taiji Suzuki
 

More from Taiji Suzuki (13)

深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点
 
数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
 
はじめての機械学習
はじめての機械学習はじめての機械学習
はじめての機械学習
 
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
Minimax optimal alternating minimization \\ for kernel nonparametric tensor l...
 
Ibis2016
Ibis2016Ibis2016
Ibis2016
 
Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Sparse estimation tutorial 2014
Sparse estimation tutorial 2014
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additi...
 
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
 
Jokyokai
JokyokaiJokyokai
Jokyokai
 
Jokyokai2
Jokyokai2Jokyokai2
Jokyokai2
 

Recently uploaded

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 

Recently uploaded (20)

Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 

Iclr2020: Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network

  • 1. Taiji Suzuki1, Hiroshi Abe2, Tomoaki Nishimura3 Compression based bound for non- compressed network: unified generalization error analysis of large compressible deep neural network 1 1 University of Tokyo/AIP-RIKEN/Japan Digital Design 2 iPride 3 NTT Data Corporation https://openreview.net/forum?id=ByeGzlrKwH
  • 2. Generalization of overparameterized networks 2 [Neyshabur et al., ICLR2019] # of parameters ≫ sample size Why do they generalize? ⇒ Intrinsic dimensionality is small. Compression based bound (billions) (millions)
  • 3. Generalization error of DL • Generalization gap 3 : loss function (1-Lipschitz continuous w.r.t. 𝑓) Empirical risk (training error) Population risk (generalization error) For an estimator 𝑓 (DNN), we want to bound : training data Gen. Gap
  • 4. Naïve bound (VC-bound) 4 ? VC-dimension [Harvey et al.2017] ☹ The number of parameters ℓ=1 𝐿 𝑚ℓ 𝑚ℓ+1 appears in the bound. ☹ It does not explain the generalization ability of overparameterized net. L
  • 5. Bias Variance Typical compression based bound: [Arora et al., 2018; Zhou et al., 2019; Baykal et al., 2019; Suzuki et al., 2018] Compression based bound 5 Original network Compressed network compressible ⇔ simple 𝑚ℓ 𝑚ℓ # This type of bound does not give gen error of 𝒇. Q: What happens for “non-compressed” network 𝒇 ?
  • 6. Bias Variance Typical compression based bound: [Arora et al., 2018; Zhou et al., 2019; Baykal et al., 2019; Suzuki et al., 2018] Compression based bound 6 Original network Compressed network compressible ⇔ simple Compressed network Original net 𝑚ℓ 𝑚ℓ # Size of compressed network This type of bound does not give gen error of 𝒇. Q: What happens for “non-compressed” network 𝒇 ? Bias-variance trade-off
  • 7. Our new compression based bound 7 Trained network 𝑓 can be compressed to smaller one 𝑓# . ( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.) Our new compression based bound (main result): :compression scheme can be data dependent. (This assumption restricts training procedure too) (Existing bound) 𝑚ℓ 𝑚ℓ # 𝑟
  • 8. Our new compression based bound 8 Trained network 𝑓 can be compressed to smaller one 𝑓# . ( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.) Our new compression based bound (main result): :compression scheme can be data dependent. (This assumption restricts training procedure too) (Existing bound) Variance term can be smaller. 𝑚ℓ 𝑚ℓ # 𝑟Improved
  • 9. More precise description 9 with probability at least 1 − 𝑒−𝑡 . : local Rademacher complexity : fixed point of local Rad. Trained network 𝑓 can be compressed to smaller one 𝑓#. ( 𝑓 ∈ ℱ, 𝑓# ∈ ℱ#; ℱ is a set of trained net, ℱ# is a set of compressed net.) :compression scheme can be data dependent. (This assumption restricts training procedure too) • • • Theorem (compression based bound for the original net) Fast part (O(1/n)) Main part (O(1/ 𝒏)) bias variance
  • 10. Compression bounds for non-compressed network with low rank properties 10
  • 11. Singular values of weight matrix 11 Rapid decay See also Martin&Mahoney, arXiv:1901.08276. 7-th layer in VGG-19 trained on CIFAR-10 Rapid decay Eigenvalues of covariance matrix Singular-values of weight matrix Both covariance matrix and weight matrix shows rapid decay of eigenvalues. ⇒ Small degree of freedom.
  • 12. Near low rank weight and covariance12 • Near low rank weight matrix: • Both of weight and covariance are near low rank Theorem • where . + Other boundedness condition. Much smaller than the VC-bound:
  • 13. Comparison with existing work 13 Comparison of intrinsic dimensionality between our degree of freedom and that in Arora et al. (2018). They are computed on VGG-19 network trained on CIFAR-10. larger smaller 2 [S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. ICML2018.]
  • 14. Summary Why overparamterized network can generalize? • If the network can be compressed to a smaller one, then it generalizes well.  A general frame-work to obtain compression based bound for non-compressed net is derived.  Our bound gives better bias-variance trade-off.  If the covariance and weight matrices are near low rank, then the network can be compressed efficiently. ⇒ Better generalization. 14 For more details, please look at our paper: https://openreview.net/forum?id=ByeGzlrKwH