This is the presentation for the paper "Fractional Step Discriminant Pruning: A Filter Pruning Framework for Deep Convolutional Neural Networks", delivered by N. Gkalelis and V. Mezaris at the 7th IEEE Int. Workshop on Mobile Multimedia Computing (MMC2020) that was held as part of the IEEE Int. Conf. on Multimedia and Expo (ICME), in July 2020.
1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Title of presentation
Subtitle
Name of presenter
Date
Fractional step discriminant pruning: a filter pruning
framework for deep convolutional neural networks
N. Gkalelis, V. Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Conf. on Multimedia &
Expo Workshops, 7th MMC,
London, United Kingdom, July 2020
2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Outline
2
• Problem statement
• Related work
• Filter importance measure
• Fractional step pruning strategy
• Experiments
• Conclusions
3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
• Deep convolutional neural networks (DCNNs) are witnessing significant
commercial deployment due to their breakthrough classification performance in
many machine learning tasks
Problem statement
• Multimedia
understanding
• Self-driving cars • Edge computing
Image Credits: V2Gov
Image Credits: [1]
[1] Chen, J., Ran, X.: Deep Learning With Edge Computing: A Review, Proc. of the IEEE, vol. 107, no. 8, (Aug. 2019)
4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
4
• The deployment of DCNNs in resource-limited or real-time applications is still
challenging due to their high computational inference time and storage
requirements
• DCNNs are highly overparametrized and the use of methods to reduce their
capacity may be even beneficial for their performance [2]
How to reduce the size of DCNNs and at the same time retain their generalization
performance ?
Problem statement
[2] Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds for deep nets via a compression approach, ICML, 2018
5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
5
Related work
• DCNN compression and acceleration methods can be categorized to: a) pruning,
b) low-rank factorization, c) compact conv filters, d) knowledge distillation [3, 4]
• Filter pruning is getting increasing attention because: a) achieves high
compression rates with small performance degradation, b) is complementary to
the methods from the other 3 categories
• It consists of: a) filter importance estimation criterion, usually the smaller-norm-
less-important, b) pruning strategy, usually an iterative one: training, pruning,
retraining, …
[3] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing
Communications & Applications (TOMM), vol. 13, no. 3s, June 2017
[4] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and
Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018
6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
6
Related work
• In [5], it is shown that pruning filters with small l2-norm may have a negative
impact to network’s performance
• FPGM is proposed utilizing a Geometric Median (GM) based measure
• FPGM selects a fraction of filters using the l2-norm (usually 10%), and the rest
using the GM-based measure
An iterative strategy is used (training, pruning, retraining, …) where all filters
corresponding to the target pruning rate are pruned at each iteration
[5] Y. He, P. Liu, Z. Wang, Z. Hu and Y. Yang: Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration, CVPR, 2019
7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
7
Related work
• In [6], it is shown that the iterative pruning strategy, where all selected filters are
set to zero from the first iteration, may lead to unrecoverable information loss
• Asymptotic pruning strategy: iterative strategy, but, the number of selected
filters at each iteration varies asymptotically to the target pruning rate
The l2-norm measure is used to select the filters at each iteration
[6] Y. He, X. Dong, G. Kang, Y. Fu, C. Yan and Y. Yang, "Asymptotic Soft Filter Pruning for Deep Convolutional Neural Networks, IEEE Trans. on
Cybernetics, pp. 1-11, Aug. 2019
8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
8
Overview of proposed method
• Motivated from limitations in recent works [5, 6] and related research findings in
shallow learning [7, 8, 9] we extend [6]:
• Replacing the l2-norm-based criterion by: a) Class-Separability (CS) based
exploiting labelling information in annotated training datasets [7, 8, 9], b) GM-
based [5]
• Applying fractional step pruning strategy: not only the number of selected filters
but also their weights vary asymptotically to their target value
[7] N. Gkalelis, V. Mezaris, I. Kompatsiaris and T. Stathaki: Mixture Subclass Discriminant Analysis Link to Restricted Gaussian Model and Other
Generalizations, IEEE Trans. Neural Networks and Learning Systems, vol. 24, no. 1, pp. 8-21, Jan. 2013
[8] R. Lotlikar and R. Kothari: Fractional-step dimensionality reduction, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 6, pp.
623-627, June 2000
[9] K. Fukunaga, Introduction to statistical pattern recognition (2nd ed.), Academic Press Professional, Inc., San Diego, CA, USA, 1990
9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
9
Importance measure 1: CS-based
• Suppose an annotated training dataset of n observations and m classes
• Let Xk
(i,j) be the feature map of the k-th observation at the j-th filter of i-th layer
• The feature maps are vectorized and stacked to form data matrix X(i,j) for filter (i,j)
𝐗(𝑖,𝑗) = 𝐱1
(𝑖,𝑗), … , 𝐱n
(𝑖,𝑗) , 𝐱k
(𝑖,𝑗) = 𝑣𝑒𝑐 𝑋 𝑘
(𝑖,𝑗)
• A filter discriminant score is then computed using
𝜂(𝑖,𝑗)
= 𝑡𝑟 𝐒(𝑖,𝑗)
𝐒(𝑖,𝑗) = 𝛍p
(𝑖,𝑗) − 𝛍q
(𝑖,𝑗) 𝛍p
(𝑖,𝑗) − 𝛍q
(𝑖,𝑗) 𝑇
𝑚
𝑞=𝑝+1
𝑚−1
𝑝=1
between-class scatter matrix for
filter (i,j) (can be computed
efficiently; see paper for details )
Mean vector of class p
(class labels are used to
compute the means)
10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
10
Importance measure 1: CS-based
• tr(S(i,j)) quantifies the distance among class distributions using the features
produced from the corresponding filter [7,8,9]
• A large value indicates that the filter extracts discriminant features for
separating the classes
• In contrary, filters that extract noise or irrelevant features with respect to the
classification task attain very small CS values and can be discarded safely
μ1
(i,1)
||v(i,2)|| > ||v(i,1)||
v(i,j)
X(i,j)
μ2
(i,1)
μ1
(i,2)
tr(S(i,1)) is large
μ2
(i,2)
tr(S(i,2)) is very small; despite a
possible large l2-norm, filter (i,2)
can be safely discarded
11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
11
Importance measure 2: GM-based
• For large pruning rates, the CS-based criterion may eliminate filters that extract
features with small but still important discriminant information
• The GM-based measure identifies the most replaceable filters in a layer [5]
𝜂(𝑖,𝑗)
= 𝐯(𝑖,𝑗)
− 𝐯(𝑖,𝑜)
𝑐 𝑖
𝑜=1
• Combined selection strategy: select a fraction of filters using the CS-based
measure and another fraction using the GM-based one
12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
12
Fractional step pruning strategy
• Let ε, θ be the total epochs and target pruning rate
• The pruning rate θι and scaling factor ζι at epoch ι are computed as:
𝜃𝜄 = 𝛼𝑒−𝛽𝜄 + 𝛾
𝜁𝜄 = 1 −
𝜃𝜄
𝜃
• The parameters α, β, γ, are estimated using 3 known points similarly to [6]
• The individual pruning rates for the CS and GM-based criteria are
𝜃𝜄 = 𝑚𝑖𝑛 𝜃𝜄, 𝜃𝑓
𝜃𝜄 = 𝜃𝜄 − 𝜃𝜄
• 𝜃𝑓 is the final pruning rate associated with the CS measure (e.g. 10%)
13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
13
• CIFAR10 [10]: 10 classes, 32 x 32 color images, 50000 training and 10000 testing
observations
• ImageNet32 [11]: ILSVRC-2012 where images are resized to 32 x 32; 1000 classes,
32 x 32 color images, 1281167 training and 50000 testing observations
• GSC (ver. 0.01) [12]: 12 classes, speech utterances, 51094 training, 6798
validation and 6835 testing
• Comparison with MIL [13], PFEC [14], CP [15], SFP [16] , FPGM [6], ASFP [7]
[10] Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep. (2009)
[11] P. Chrabaszcz, I. Loshchilov, and F. Hutter: A downsampled variant of ImageNet as an alternative to the CIFAR datasets, CoRR, vol.
abs/1707.08819, 2017
[12] P. Warden, Speech commands: A dataset for limited-vocabulary speech recognition, CoRR, vol. abs/1804.03209, 2018
[13] X. Dong et al., More is less: A more complicated network with less inference complexity, CVPR, Honolulu, HI, USA, July 2017
[14] H. Li et al.: Pruning filters for efficient convnets, ICLR Toulon, France, Apr. 2017
[15] Y. He, X. Zhang, and J. Sun: Channel pruning for accelerating very deep neural networks, ICCV, Venice, Italy, Oct. 2017
[16] Y. He et al., “Soft filter pruning for accelerating deep convolutional neural networks,” IJCAI, Stockholm, Sweden, July 2018
14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
14
• Experimental setup for CIFAR10 and ImageNet32, same as in FPGM [6], ASFP [7]
• Images are normalized to zero mean and unit variance, data augmentation is
applied (cropping, mirroring, flipping, etc.)
• ResNet, CE loss, Minibatch SGD, Nesterov momentum 0.9, batch size 128,
weight decay 0.0005, ε = 200
• Initial learning rate is 0.01, divided by 5 at epochs 60, 120, 160 for CIFAR10, and
by 10 every 10 epochs for ImageNet32
15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
15
• Experimental setup for GSC as in [17]
• Log mel-spectrogams (LMSs) are used for speech commands representation to
derive 32 x 32 LMS for each recording: 16KHz sampling rate, STFT with Hamming
window of size 1024, hop length 512, 32 mel filterbanks, etc.
• Augmentation: pitch shifting, mixing with background noise, etc.
• ResNet, CE loss, Minibatch SGD, Nesterov momentum 0.9, batch size 96, weight
decay 0.0005, ε = 70, initial learning rate is 0.01 and divided by 10 at epoch 50
[17] J. Salamon and J. P. Bello: Deep convolutional neural networks and data augmentation for environmental sound classification, IEEE Signal
Process. Lett., vol. 24, no. 3, pp. 279–283, Mar. 2017.
17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
17
θ=20% No pruning SFP [16] FPGM [7] FSDP(𝜃𝑓=10%)
ImageNet32 40.79% 29.92% 37.23% 38.3%
GSC 97.47% 94.57% 95.64% 96.22%
• CCRs in ImageNet32 and GSC with ResNet56 and pruning rates θ = 20%, 50%
• Evaluation of SFP, FPGM, FSDP (based on performance results in CIFAR10)
θ=50% FPGM FSDP(𝜃𝑓=10%)
ImageNet32 32.32% 33.23%
GSC 92.89% 94.66%
• FSDP outperforms both SFP and FPGM
• In the challenging ImageNet32 dataset the performance drop of SFP is quite
high; this is attributed to the l2-norm based criterion, where a fraction of the
selected filters still carry significant discriminant information
18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Experiments
18
• Visualization of FSDP (𝜃𝑓 = 20%) while training a ResNet20 in CIFAR10, with θ = 20%
• Illustration of CS measure scores for each filter at epochs 10, 40, 200 (figures from left to right)
• Filters closer to the input seem to attain high discriminant scores (especially in the initial epochs)
• Surviving filters of the 2nd conv layer in residual blocks (e.g., 11, 13, 15, 17) accumulate a quite high
discriminant power as the training proceeds
• After a certain number of epochs, the surviving filters in the last conv layer attain a high
discriminant power
19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Summary and next steps
19
• A new filter pruning approach was presented exploiting a class-separability-based measure for
estimating the importance of the filters and a fractional step strategy to prune them
asymptotically
• The proposed approach was evaluated successfully in three popular datasets (CIFAR-10,
ImageNet32, GSC) for image and speech classification tasks
• As a future work, we are planning to investigate the use of variable pruning rates utilizing the
discriminant scores at layer-level, similarly to the globally-comparing criteria in [14,18]
[18] P. Molchanov et al.: Pruning convolutional neural networks for resource efficient inference, ICLR, Toulon, France, Apr. 2017
20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
20
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/fractional_step_discriminant_pruning_dcnn
This work was supported by the EUs Horizon 2020 research and innovation
programme under grant agreement H2020-780656 ReTV