Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep learning on mobile

A practical talk by Anirudh Koul aimed at how to run Deep Neural Networks to run on memory and energy constrained devices like smartphones. Highlights some frameworks and best practices.

  • Login to see the comments

Deep learning on mobile

  1. 1. Deep Learning on mobile phones - A Practitioners guide Anirudh Koul
  2. 2. Deep Learning on mobile phones - A Practitioners guide Anirudh Koul
  3. 3. Anirudh Koul , @anirudhkoul , http://koul.ai Head of AI & Research, Aira [lastname]@aira.io Founder, Seeing AI Previously at Microsoft
  4. 4. Why Deep Learning On Mobile? Latency Privacy
  5. 5. Response Time Limits – Powers of 10 0.1 second : Reacting instantly 1.0 seconds : User’s flow of thought 10 seconds : Keeping the user’s attention [Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
  6. 6. Mobile Deep Learning Recipe Mobile Inference Engine + Pretrained Model = DL App (Efficient) (Efficient)
  7. 7. Building a DL App in _ time
  8. 8. Building a DL App in 1 hour
  9. 9. Use Cloud APIs for General Recognition Needs • Microsoft Cognitive Services • Clarifai • Google Cloud Vision • IBM Watson Services • Amazon Rekognition
  10. 10. How to Choose a Computer Vision Based API? Benchmark & Compare them COCO-Text v2.0 for Text reading in the wild • ~2k random images • Candidate text has at least 2 characters together • Direct word match COCO-Val 2017 for Image Tagging in the wild • ~4k random images • Tag similarity match instead of word match
  11. 11. Pricing
  12. 12. Recognize Text Benchmarks Text API Accuracy Amazon Rekognition 45.4% Google Cloud Vision 33.4% Microsoft Cognitive Services 55.4% Evaluation criteria: • Photos have candidate words with at length>=2 • Direct word match with ground truth
  13. 13. Image Tagging Benchmarks Evaluation criteria: • Concept similarity match instead of word match • E.g. ‘military-officer’ tag matched with ground truth tag ‘person’ Text API Accuracy Amazon Rekognition 65% Google Cloud Vision 47.6% Microsoft Cognitive Services 50.0%
  14. 14. Image Tagging Benchmarks Evaluation criteria: • Concept similarity match instead of word match • E.g. ‘military-officer’ tag matched with ground truth tag ‘person’ Text API Accuracy Avg #Tags Amazon Rekognition 65% 14 Google Cloud Vision 47.6% 14 Microsoft Cognitive Services 50.0% 8
  15. 15. Image Tagging Benchmarks Hard to do Precision-Recall since COCO ground truth tags are not exhaustive Lower # of tags for a given accuracy indicates higher F-measure Text API Accuracy Avg #Tags Amazon Rekognition 65% 14 Google Cloud Vision 47.6% 14 Microsoft Cognitive Services 50.0% 8
  16. 16. Tips for reducing network latency • For Text Recognition • Compressing setting of upto 90% has little effect on accuracy, but drastic savings in size • Resizing is dangerous, text recognition needs a minimum size for recognition • For image recognition • Resize to 224 as the minimum(height,width) at 50% compression with bilinear interpolation
  17. 17. Building a DL App in 1 day
  18. 18. http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/ Energy to train Convolutional Neural Network Energy to use Convolutional Neural Network
  19. 19. Base Pretrained Model ImageNet – 1000 Object Categorizer VGG16 Inception-v3 Resnet-50 MobileNet SqueezeNet
  20. 20. Running pre-trained models on mobile Core ML TensorFlow Lite Caffe2
  21. 21. Apple’s Ecosystem Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  22. 22. Apple’s Ecosystem Metal - low-level, low-overhead hardware-accelerated 3D graphic and compute shader application programming interface (API) - Available since iOS 8 Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  23. 23. Apple’s Ecosystem Fast low-level primitives: • BNNS – Basic Neural Network Subroutine • Ideal case: Fully connected NN • MPS – Metal Performance Shaders • Ideal case: Convolutions Inconvenient for large networks: • Inception-v3 inference consisted of 1.5K hard coded model definition • Libraries Like Forge by Matthijs Hollemans provide abstraction Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  24. 24. Apple’s Ecosystem Convert Caffe/Tensorflow model to CoreML model in 3 lines: import coremltools coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’) coreml_model.save('my_model.mlmodel’) Add model to iOS project and call for prediction. Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM Automatically minimizes memory footprint and power consumption Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  25. 25. Apple’s Ecosystem • Model quantization support upto 1 bit • Batch API for improved performance • Conversion support for MXNet, ONNX • ONNX opens models from PyTorch, Cognitive Toolkit, Caffe2, Chainer • ML Create for quick training • tf-coreml for direct conversion from tensorflow Metal BNNS +MPS CoreML CoreML2 2014 2016 2017 2018
  26. 26. CoreML Benchmark - Pick a DNN for your mobile architecture Model Top-1 Accurac y Size of Model (MB) iPhone 5S Execution Time (ms) iPhone 6 Execution Time (ms) iPhone 6S/SE Execution Time (ms) iPhone 7 Execution Time (ms) iPhone 8/X Execution Time (ms) VGG 16 71 553 7408 4556 235 181 146 Inception v3 78 95 727 637 114 90 78 Resnet 50 75 103 538 557 77 74 71 MobileNet 71 17 129 109 44 35 33 SqueezeNet 57 5 75 78 36 30 29 2014 2015 2016 Huge improvement in GPU hardware in 2015 2013 2017
  27. 27. Putting out more frames than an art gallery
  28. 28. TensorFlow Ecosystem TensorFlow TensorFlow Mobile TensorFlow Lite 2015 2016 2018
  29. 29. TensorFlow Ecosystem The full, bulky deal TensorFlow TensorFlow Mobile TensorFlow Lite 2015 2016 2018
  30. 30. TensorFlow Ecosystem TensorFlow TensorFlow Mobile TensorFlow Lite 2015 2016 2018 Easy pipeline to bring Tensorflow models to mobile Excellent documentation Optimizations to bring model to mobile
  31. 31. TensorFlow Ecosystem • Smaller • Faster • Minimal dependencies • Easier to package & deploy • Allows running custom operators 1 line conversion from Keras to TensorFlow lite • tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite TensorFlow TensorFlow Mobile TensorFlow Lite 2015 2016 2018
  32. 32. TensorFlow Lite is small • ~75KB for core interpreter • ~400KB for core interpreter + supported operations • Compared to 1.5MB for Tensorflow Mobile
  33. 33. TensorFlow Lite is fast • Takes advantage of on-device hardware acceleration • Uses FlatBuffers • Reduces code footprint, memory usage • Reduces CPU cycles on serialization and deserialization • Improves startup time • Pre-fused activations • Combining batch normalization layer with previous Convolution • Interpreter uses static memory and static execution plan • Decreases load time
  34. 34. TensorFlow Lite Architecture
  35. 35. TensorFlow Lite Benchmarks - http://alpha.lab.numericcal.com/
  36. 36. TensorFlow Lite Benchmarks - http://ai-benchmark.com/ • Crowdsourcing benchmarking with AI Benchmark android app • By Andrey Ignatov from ETH • 9 Tests • E.g Semantic Segmentation, Image Super Resolution, Face Recognition
  37. 37. Caffe2 From Facebook Under 1 MB of binary size Built for Speed : For ARM CPU : Uses NEON Kernels, NNPack For iPhone GPU : Uses Metal Performance Shaders and Metal For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup) ONNX format support to import models from CNTK/PyTorch
  38. 38. Caffe2
  39. 39. Recommendation for development 1. Train a model using Keras 2. For iOS: • Convert to CoreML using coremltools 3. For Android: • Convert to Tensorflow Lite using tflite_convert Keras .mlmodel file .tflite file coremltools tflite_convert
  40. 40. Common Questions “My app has become too big to download. What do I do?” • iOS doesn’t allow apps over 150 MB to be downloaded • Solution : Download on demand, and compile on device • 0 MB change to app size on first install
  41. 41. Common Questions “Do I need to ship a new app update with every model improvement?” • Making App updates is a decent amount of overheard, plus ~2 days wait time • Solution : Check for model updates, download and compile on device • Easier solution – Use a framework for Model Management, e.g. • Google ML Kit • Fritz • Numerrical
  42. 42. Common Questions “Why does my app not recognize objects at top/bottom of screen?” • Solution : Check the cropping used, by default, its center crop 
  43. 43. Building a DL App in 1 week
  44. 44. Learn Playing an Accordion 3 months
  45. 45. Learn Playing an Accordion 3 months Knows Piano Fine Tune Skills 1 week
  46. 46. I got a dataset, Now What? Step 1 : Find a pre-trained model Step 2 : Fine tune a pre-trained model Step 3 : Run using existing frameworks “Don’t Be A Hero” - Andrej Karpathy
  47. 47. How to find pretrained models for my task? Search “Model Zoo” https://modelzoo.co - 300+ models
  48. 48. AlexNet, 2012 (simplified) [Krizhevsky, Sutskever,Hinton’12] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11 n-dimension Feature representation
  49. 49. Deciding how to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  50. 50. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  51. 51. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  52. 52. Deciding when to fine tune Size of New Dataset Similarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
  53. 53. Could you training your own classifier ... without coding? • Microsoft CustomVision.ai • Unique: Under a minute training, Custom object detection • Google AutoML • Unique: Full CNN training, crowdsourced workers • IBM Watson Visual recognition • Baidu EZDL • Unique: Custom Sound recognition
  54. 54. Custom Vision Service (customvision.ai) – Drag and drop training Tip : Upload 30 photos per class for make prototype model Upload 200 photos per class for more robust production model More distinct the shape/type of object, lesser images required.
  55. 55. Custom Vision Service (customvision.ai) – Drag and drop training Tip : Use Fatkun Browser Extension to download images from Search Engine, or use Bing Image Search API to programmatically download photos with proper rights
  56. 56. CoreML exporter from customvision.ai – Drag and drop training 5 minute shortcut to training, finetuning and getting model ready in CoreML format Drag and drop interface
  57. 57. Building a Crowdsourced Data Collector in 1 months
  58. 58. Barcode recognition from Seeing AI Live Guide user in finding a barcode with audio cues With Server Decode barcode to identify product Tech MPSCNN running on mobile GPU + barcode library Metrics 40 FPS (~25 ms) on iPhone 7 Aim : Help blind users identify products using barcode Issue : Blind users don’t know where the barcode is
  59. 59. Currency recognition from Seeing AI Aim : Identify currency Live Identify denomination of paper currency instantly With Server - Tech Task specific CNN running on mobile GPU Metrics 40 FPS (~25 ms) on iPhone 7
  60. 60. Training Data Collection App Request volunteers to take photos of objects in non-obvious settings Sends photos to cloud, trains model nightly Newsletter shows the best photos from volunteers Let them compete for fame
  61. 61. Daily challenge - Collected by volunteers
  62. 62. Daily challenge - Collected by volunteers
  63. 63. Building a production DL App in 3 months
  64. 64. What you want https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016 $2000$200,000 What you can afford
  65. 65. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  66. 66. 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max1 Soft maxAct ivat ion soft max2 GoogleNet, 22 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  67. 67. AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2 VGG, 19 layers (ILSVRC 2014) Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Ultra deep
  68. 68. ResNet, 152 layers 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 7x7 conv, 64, /2, pool/2 Revolution of Depth Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
  69. 69. 28.2 25.8 16.4 11.7 7.3 6.7 3.6 2.9 ILSVRC'10 ILSVRC'11 ILSVRC'12 AlexNet ILSVRC'13 ILSVRC'14 VGG ILSVRC'14 GoogleNet ILSVRC'15 ResNet ILSVRC'16 Ensemble ImageNet Classification top-5 error (%) shallow 8 layers 19 layers 22 layers 152 layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015 Revolution of Depth vs Classification Accuracy Ensemble of Resnet, Inception Resnet, Inception and Wide Residual Network
  70. 70. Accuracy vs Operations Per Image Inference Size is proportional to num parameters Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016 552 MB 240 MB What we want
  71. 71. Your Budget - Smartphone Floating Point Operations Per Second (2015) http://pages.experts-exchange.com/processing-power-compared/
  72. 72. iPhone X is more powerful than a Macbook Pro https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/
  73. 73. Strategies to get maximum efficiency from your CNN Before training • Pick an efficient architecture for your task • Designing efficient layers After training • Pruning • Quantization • Network binarization
  74. 74. CoreML Benchmark - Pick a DNN for your mobile architecture Model Top-1 Accura cy Size of Model (MB) Million Multi Adds iPhone 5S Execution Time (ms) iPhone 6 Execution Time (ms) iPhone 6S/SE Execution Time (ms) iPhone 7 Execution Time (ms) iPhone 8/X Execution Time (ms) VGG 16 71 553 15300 7408 4556 235 181 146 Inception v3 78 95 5000 727 637 114 90 78 Resnet 50 75 103 3900 538 557 77 74 71 MobileNet 71 17 569 129 109 44 35 33 SqueezeN et 57 5 800 75 78 36 30 29 2014 2015 2016 Huge improvement in GPU hardware in 2015 2013 2017
  75. 75. MobileNet family Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv Tune with two parameters – Width Multiplier and resolution multiplier Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
  76. 76. Efficient Classification Architectures https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html MobileNetV2 is the current favourite
  77. 77. Efficient Detection Architectures Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
  78. 78. Efficient Detection Architectures Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
  79. 79. Efficient Segmentation Architectures ICNet - Image cascade network
  80. 80. Tricks while designing your own network • Dilated Convolutions • Great for Segmentation / when target object has high area in image • Replace NxN convolutions with Nx1 followed by 1xN • Depth wise Separable Convolutions (e.g. MobileNet) • Inverted residual block (e.g. MobileNetV2) • Replacing large filters with multiple small filters • 5x5 is slower than 3x3 followed by 3x3
  81. 81. Design consideration for custom architectures – Small Filters Three layers of 3x3 convolutions >> One layer of 7x7 convolution Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions Replace NxN convolutions with stack of 1xN and Nx1 Fewer parameters  Less compute  More non-linearity  Better Faster Stronger Andrej Karpathy, CS-231n Notes, Lecture 11
  82. 82. Selective training to keep networks shallow Idea : Augment data limited to how your network will be used Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate. Followed by WordLens / Google Translate Example : Add blur if analyzing mobile phone frames
  83. 83. Pruning Aim : Remove all connections with absolute weights below a threshold Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
  84. 84. Observation : Most parameters in Fully Connected Layers AlexNet 240 MB VGG-16 552 MB 96% of all parameters 90% of all parameters
  85. 85. Pruning gets quickest model compression without accuracy loss AlexNet 240 MB VGG-16 552 MB First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy
  86. 86. Weight Sharing Idea : Cluster weights with similar values together, and store in a dictionary. Codebook Huffman coding HashedNets Cons: Need a special inference engine, doesn’t work for most applications
  87. 87. Filter Pruning - ThiNet Idea : Discard whole filter if not important to predictions Advantage: • No change in architecture, other than thinning of filters per layer • Can be further compressed with other methods Just like feature selection, select filter to discard. Possible greedy methods: • Absolute weight sum of entire filter closest to 0 • Average percentage of ‘Zeros’ as outputs • ThiNet – Collect statistics on the output of the next layer
  88. 88. SqueezeNet - AlexNet-level accuracy in 0.5 MB SqueezeNet base 4.8 MB SqueezeNet compressed 0.5 MB 80.3% top-5 Accuracy on ImageNet 0.72 GFLOPS/image Fire Block Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"
  89. 89. Quantization Reduce precision from 32 bits to <=16 bits or lesser Use stochastic rounding for best results In Practice: • Ristretto + Caffe • Automatic Network quantization • Finds balance between compression rate and accuracy • Apple Metal Performance Shaders automatically quantize to 16 bits • Tensorflow has 8 bit quantization support • Gemmlowp – Low precision matrix multiplication library
  90. 90. Quantizing CNNs in Practice Reducing CoreML models to half size # Load a model, lower its precision, and then save the smaller model. model_spec = coremltools.utils.load_spec(‘model.mlmodel’) model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec) coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')
  91. 91. Quantizing CNNs in Practice Reducing CoreML models to even smaller size Choose bits and quantization mode Bits from [1,2,4,8] Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”] • Lut = look up table from coremltools.models.neural_network.quantization_utils import * quantized_model= quantize_weights(model, 8, 'linear') quantized_model.save('quantizedModel.mlmodel’) compare_model(model, quantized_model, './sample_data/')
  92. 92. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  93. 93. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  94. 94. Binary weighted Networks Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  95. 95. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  96. 96. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  97. 97. XNOR-Net Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
  98. 98. XNOR-Net on Mobile
  99. 99. Challenges Off the shelf CNNs not robust for video Solutions: • Collective confidence over several frames • CortexNet
  100. 100. Building a DL App and get $10 million in funding (or a PhD)
  101. 101. Competitions to follow Winners = High accuracy + Low energy consumption * LPIRC - Low-Power Image Recognition Challenge * EDLDC - Embedded deep learning design contest * System Design Contest at Design Automation Conference (DAC)
  102. 102. AutoML – Let AI design an efficient AI architecture MnasNet: Platform-Aware Neural Architecture Search for Mobile • An automated neural architecture search approach for designing mobile models using reinforcement learning • Incorporates latency information into the reward objective function • Measure real-world inference latency by executing on a particular platform Sample models from search space Trainer Mobile phones Multi-objective reward latency reward Controller accuracy
  103. 103. AutoML – Let AI design an efficient AI architecture For same accuracy: • 1.5x faster than MobileNetV2 • ResNet-50 accuracy with 19x less parameters • SSD300 mAP with 35x less FLOPs
  104. 104. Mr. Data Scientist PhD
  105. 105. One Last Question
  106. 106. How to access the slides in 1 second Link posted here -> @anirudhkoul

×