SlideShare a Scribd company logo
1 of 17
Download to read offline
An Analysis of Convolution for Inference
24 June 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.ner va na
Direct Convolution
2
• Compute with in-place slicing + gemm
• Data layout considerations: C, H, W, N
• Minimize slicing logic
• Maximize contiguous access
• Leverage filter overlap
Proprietary and confidential. Do not distribute.ner va na
Small N direct convolution: Without Superblocking
3
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad
Fig from V. Dumoulin,
https://github.com/vdumoulin/conv_arithmetic
Proprietary and confidential. Do not distribute.ner va na
Small N direct convolution: With Superblocking
4
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad
Proprietary and confidential. Do not distribute.ner va na
Small N direct convolution: Bprop for deconv
5
bprop
pad’ = S - pad - 1
wi = (qj - pad’ + sk) / stride
Proprietary and confidential. Do not distribute.ner va na
Small N direct convolution: Dilated Filters
6
Dilated
S’ = (S-1) * rate + 1
Q = (W-S’+1 + 2*pad) / stride
wi = sk * rate + qj * stride - pad
Fig from F. Yu, V. Koltun
http://arxiv.org/abs/1511.07122v3
Proprietary and confidential. Do not distribute.ner va na
Convolution with Algorithmic Speedups
7
• FFT and Winograd have same basic computational flow
• FFT tiles typically need to be much bigger
• Winograd history: Toom and Cook, then Lavin
Proprietary and confidential. Do not distribute.ner va na
Winograd: input transform
8
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros
Proprietary and confidential. Do not distribute.ner va na
Winograd: filter transform
9
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently
Proprietary and confidential. Do not distribute.ner va na
Winograd: batched GEMM
10
• Point-wise Multiplication
• Posed as batched GEMM
operation
Proprietary and confidential. Do not distribute.ner va na
Winograd: output transform
11
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile
Proprietary and confidential. Do not distribute.ner va na
Transforms for Increased Accuracy
12
Integer roots
4 0 -5 0 1 0
0 -4 -4 1 1 0
0 4 -4 -1 1 0
0 -2 -1 2 1 0
0 2 -1 -2 1 0
0 4 0 -5 0 1
0.87 0 -2.64 0 1 0
0 -1.4 -2.25 0.62 1 0
0 1.4 -2.25 -0.62 1 0
0 -0.58 -0.39 1.5 1 0
0 0.58 -0.39 -1.5 1 0
0 0.87 0 -2.64 0 1
Fractional roots
Input transforms for 4x4
Proprietary and confidential. Do not distribute.ner va na
Precision
13
Percentage error from Convolution
0
5
10
15
20
25
3 4 5 6 7 8 9 10 11 12 13 14 15 16
Direct
2x2 Winograd
4x4 winograd (Fractional Roots)
4x4 Winograd (Integer Roots)
PercentageError
Bit width
Bits Direct 2x2
Winograd
4x4 frac 4x4 int
3 56.461 112.174 351.196 314.62
4 23.533 46.222 274.28 432.959
5 10.879 21.394 142.649 459.723
6 5.245 10.34 68.062 446.271
7 2.585 5.074 33.73 250.057
8 1.286 2.516 16.667 123.585
9 0.639 1.253 8.246 62.001
10 0.319 0.626 4.154 31.006
11 0.159 0.312 2.064 15.439
12 0.08 0.156 1.029 7.669
13 0.04 0.078 0.515 3.857
14 0.02 0.039 0.259 1.923
15 0.01 0.019 0.129 0.966
16 0.005 0.01 0.064 0.483
Proprietary and confidential. Do not distribute.ner va na
Multiplier Transistor Efficiency
14
Algo bits speedup transistors
performance
/ transistor
Direct 8 1.0 3000 1
2x2 9 2.25 3750 1.8
4x4 12 4.0 6000 2.0
Transistor Counts from Wikipedia:
Proprietary and confidential. Do not distribute.ner va na
Logarithmic quantization
15
D. Miyashita, EH. Lee, B. Murmann
Convolutional Neural Networks using Logarithmic Data Representation
http://arxiv.org/abs/1603.01025v2
Proprietary and confidential. Do not distribute.ner va na 16
Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Totals:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT
Proprietary and confidential. Do not distribute.ner va na 17
Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Layer 4.2:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

More Related Content

What's hot

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongGDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongUmbra Software
 
Image Segmentation Using Hardware Forest Classifiers
Image Segmentation Using Hardware Forest ClassifiersImage Segmentation Using Hardware Forest Classifiers
Image Segmentation Using Hardware Forest ClassifiersNeil Pittman
 
Chaotic substitution box design for block ciphers
Chaotic substitution box design for block  ciphersChaotic substitution box design for block  ciphers
Chaotic substitution box design for block ciphersHammad Haleem
 
Math cad fourier analysis (jcb-edited)
Math cad   fourier analysis (jcb-edited)Math cad   fourier analysis (jcb-edited)
Math cad fourier analysis (jcb-edited)Julio Banks
 
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering WorkflowTakahiro Harada
 
Fragging Rights: A Tale of a Pathological Storage Workload
Fragging Rights: A Tale of a Pathological Storage WorkloadFragging Rights: A Tale of a Pathological Storage Workload
Fragging Rights: A Tale of a Pathological Storage WorkloadEric Sproul
 
Multi core k means
Multi core k meansMulti core k means
Multi core k meansb0rAAs
 
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...Kenichiro Tanaka
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
Neighbourhood Preserving Quantisation for LSH SIGIR PosterNeighbourhood Preserving Quantisation for LSH SIGIR Poster
Neighbourhood Preserving Quantisation for LSH SIGIR PosterSean Moran
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalfSalo Shp
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...Takahiro Harada
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 

What's hot (20)

Dds 2
Dds 2Dds 2
Dds 2
 
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongGDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong
 
Image Segmentation Using Hardware Forest Classifiers
Image Segmentation Using Hardware Forest ClassifiersImage Segmentation Using Hardware Forest Classifiers
Image Segmentation Using Hardware Forest Classifiers
 
Chaotic substitution box design for block ciphers
Chaotic substitution box design for block  ciphersChaotic substitution box design for block  ciphers
Chaotic substitution box design for block ciphers
 
Math cad fourier analysis (jcb-edited)
Math cad   fourier analysis (jcb-edited)Math cad   fourier analysis (jcb-edited)
Math cad fourier analysis (jcb-edited)
 
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow
 
Fragging Rights: A Tale of a Pathological Storage Workload
Fragging Rights: A Tale of a Pathological Storage WorkloadFragging Rights: A Tale of a Pathological Storage Workload
Fragging Rights: A Tale of a Pathological Storage Workload
 
Unit 5 vsp
Unit 5 vspUnit 5 vsp
Unit 5 vsp
 
Multi core k means
Multi core k meansMulti core k means
Multi core k means
 
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
 
The Internet
The InternetThe Internet
The Internet
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
Neighbourhood Preserving Quantisation for LSH SIGIR PosterNeighbourhood Preserving Quantisation for LSH SIGIR Poster
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe[BGOUG] Java GC - Friend or Foe
[BGOUG] Java GC - Friend or Foe
 

Viewers also liked

Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at ScaleIntel Nervana
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of ComputingIntel Nervana
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntroduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntel Nervana
 
Urs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural NetworksUrs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural NetworksIntel Nervana
 
Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana
 
RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016Intel Nervana
 
懇親会の余興スライド
懇親会の余興スライド懇親会の余興スライド
懇親会の余興スライドAkira Tamamori
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationclCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationIntel® Software
 
Video Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model ExampleVideo Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model ExampleIntel Nervana
 
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...Akira Tamamori
 
Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataIntel Nervana
 
Urs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in BostonUrs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in BostonIntel Nervana
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Intel Nervana
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceIntel Nervana
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntel Nervana
 
Intel's Machine Learning Strategy
Intel's Machine Learning StrategyIntel's Machine Learning Strategy
Intel's Machine Learning Strategyinside-BigData.com
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognitionIntel Nervana
 

Viewers also liked (20)

Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntroduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres Rodriguez
 
Urs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural NetworksUrs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural Networks
 
Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16
 
RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016
 
懇親会の余興スライド
懇親会の余興スライド懇親会の余興スライド
懇親会の余興スライド
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationclCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
 
Video Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model ExampleVideo Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model Example
 
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...
 
Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
 
Urs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in BostonUrs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in Boston
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligence
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will Constable
 
Intel's Machine Learning Strategy
Intel's Machine Learning StrategyIntel's Machine Learning Strategy
Intel's Machine Learning Strategy
 
ODSC West
ODSC WestODSC West
ODSC West
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognition
 

Similar to An Analysis of Convolution for Inference

Visual thinking colin_ware_lectures_2013_3_findability
Visual thinking colin_ware_lectures_2013_3_findabilityVisual thinking colin_ware_lectures_2013_3_findability
Visual thinking colin_ware_lectures_2013_3_findabilityElsa von Licy
 
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...Edge AI and Vision Alliance
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectPeter Hlavaty
 
Video Compression, Part 2-Section 2, Video Coding Concepts
Video Compression, Part 2-Section 2, Video Coding Concepts Video Compression, Part 2-Section 2, Video Coding Concepts
Video Compression, Part 2-Section 2, Video Coding Concepts Dr. Mohieddin Moradi
 
#6 PyData Warsaw: Deep learning for image segmentation
#6 PyData Warsaw: Deep learning for image segmentation#6 PyData Warsaw: Deep learning for image segmentation
#6 PyData Warsaw: Deep learning for image segmentationMatthew Opala
 
02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...
02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...
02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...Deltares
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance AMD
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...pycontw
 
Verifiably Random
Verifiably RandomVerifiably Random
Verifiably RandomDavid Evans
 
Code vectorization for mobile devices
Code vectorization for mobile devicesCode vectorization for mobile devices
Code vectorization for mobile devicesSt1X
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraDataStax Academy
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.
 
Genome Browser based on Google Maps API
Genome Browser based on Google Maps APIGenome Browser based on Google Maps API
Genome Browser based on Google Maps APIHong ChangBum
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelAnne Nicolas
 
Clipping & Rasterization
Clipping & RasterizationClipping & Rasterization
Clipping & RasterizationAhmed Daoud
 

Similar to An Analysis of Convolution for Inference (20)

Visual thinking colin_ware_lectures_2013_3_findability
Visual thinking colin_ware_lectures_2013_3_findabilityVisual thinking colin_ware_lectures_2013_3_findability
Visual thinking colin_ware_lectures_2013_3_findability
 
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could Expect
 
Video Compression, Part 2-Section 2, Video Coding Concepts
Video Compression, Part 2-Section 2, Video Coding Concepts Video Compression, Part 2-Section 2, Video Coding Concepts
Video Compression, Part 2-Section 2, Video Coding Concepts
 
畳み込みについて
畳み込みについて畳み込みについて
畳み込みについて
 
#6 PyData Warsaw: Deep learning for image segmentation
#6 PyData Warsaw: Deep learning for image segmentation#6 PyData Warsaw: Deep learning for image segmentation
#6 PyData Warsaw: Deep learning for image segmentation
 
02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...
02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...
02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
December 4, Project
December 4, ProjectDecember 4, Project
December 4, Project
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance
 
DL (v2).pptx
DL (v2).pptxDL (v2).pptx
DL (v2).pptx
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...Panoramic Video in Environmental Monitoring Software Development and Applica...
Panoramic Video in Environmental Monitoring Software Development and Applica...
 
Verifiably Random
Verifiably RandomVerifiably Random
Verifiably Random
 
Code vectorization for mobile devices
Code vectorization for mobile devicesCode vectorization for mobile devices
Code vectorization for mobile devices
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
 
Genome Browser based on Google Maps API
Genome Browser based on Google Maps APIGenome Browser based on Google Maps API
Genome Browser based on Google Maps API
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
 
Clipping & Rasterization
Clipping & RasterizationClipping & Rasterization
Clipping & Rasterization
 

Recently uploaded

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

An Analysis of Convolution for Inference

  • 1. An Analysis of Convolution for Inference 24 June 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™
  • 2. Proprietary and confidential. Do not distribute.ner va na Direct Convolution 2 • Compute with in-place slicing + gemm • Data layout considerations: C, H, W, N • Minimize slicing logic • Maximize contiguous access • Leverage filter overlap
  • 3. Proprietary and confidential. Do not distribute.ner va na Small N direct convolution: Without Superblocking 3 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad Fig from V. Dumoulin, https://github.com/vdumoulin/conv_arithmetic
  • 4. Proprietary and confidential. Do not distribute.ner va na Small N direct convolution: With Superblocking 4 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad
  • 5. Proprietary and confidential. Do not distribute.ner va na Small N direct convolution: Bprop for deconv 5 bprop pad’ = S - pad - 1 wi = (qj - pad’ + sk) / stride
  • 6. Proprietary and confidential. Do not distribute.ner va na Small N direct convolution: Dilated Filters 6 Dilated S’ = (S-1) * rate + 1 Q = (W-S’+1 + 2*pad) / stride wi = sk * rate + qj * stride - pad Fig from F. Yu, V. Koltun http://arxiv.org/abs/1511.07122v3
  • 7. Proprietary and confidential. Do not distribute.ner va na Convolution with Algorithmic Speedups 7 • FFT and Winograd have same basic computational flow • FFT tiles typically need to be much bigger • Winograd history: Toom and Cook, then Lavin
  • 8. Proprietary and confidential. Do not distribute.ner va na Winograd: input transform 8 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros
  • 9. Proprietary and confidential. Do not distribute.ner va na Winograd: filter transform 9 • Filter transform • Same as input but with different coefficients • Transform each feature map independently
  • 10. Proprietary and confidential. Do not distribute.ner va na Winograd: batched GEMM 10 • Point-wise Multiplication • Posed as batched GEMM operation
  • 11. Proprietary and confidential. Do not distribute.ner va na Winograd: output transform 11 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile
  • 12. Proprietary and confidential. Do not distribute.ner va na Transforms for Increased Accuracy 12 Integer roots 4 0 -5 0 1 0 0 -4 -4 1 1 0 0 4 -4 -1 1 0 0 -2 -1 2 1 0 0 2 -1 -2 1 0 0 4 0 -5 0 1 0.87 0 -2.64 0 1 0 0 -1.4 -2.25 0.62 1 0 0 1.4 -2.25 -0.62 1 0 0 -0.58 -0.39 1.5 1 0 0 0.58 -0.39 -1.5 1 0 0 0.87 0 -2.64 0 1 Fractional roots Input transforms for 4x4
  • 13. Proprietary and confidential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483
  • 14. Proprietary and confidential. Do not distribute.ner va na Multiplier Transistor Efficiency 14 Algo bits speedup transistors performance / transistor Direct 8 1.0 3000 1 2x2 9 2.25 3750 1.8 4x4 12 4.0 6000 2.0 Transistor Counts from Wikipedia:
  • 15. Proprietary and confidential. Do not distribute.ner va na Logarithmic quantization 15 D. Miyashita, EH. Lee, B. Murmann Convolutional Neural Networks using Logarithmic Data Representation http://arxiv.org/abs/1603.01025v2
  • 16. Proprietary and confidential. Do not distribute.ner va na 16 Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Totals: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT
  • 17. Proprietary and confidential. Do not distribute.ner va na 17 Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Layer 4.2: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT