An Analysis of Convolution for Inference

•

6 likes•6,792 views

Scott Gray presents at the 2016 ICML conference. Scott Gray went over various ways of computing convolution in the workshop on "On-device Intelligence".

Technology

An Analysis of Convolution for Inference
24 June 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™

Proprietary and conﬁdential. Do not distribute.ner va na
Direct Convolution
2
• Compute with in-place slicing + gemm
• Data layout considerations: C, H, W, N
• Minimize slicing logic
• Maximize contiguous access
• Leverage filter overlap

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Without Superblocking
3
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad
Fig from V. Dumoulin,
https://github.com/vdumoulin/conv_arithmetic

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: With Superblocking
4
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Bprop for deconv
5
bprop
pad’ = S - pad - 1
wi = (qj - pad’ + sk) / stride

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Dilated Filters
6
Dilated
S’ = (S-1) * rate + 1
Q = (W-S’+1 + 2*pad) / stride
wi = sk * rate + qj * stride - pad
Fig from F. Yu, V. Koltun
http://arxiv.org/abs/1511.07122v3

Proprietary and conﬁdential. Do not distribute.ner va na
Convolution with Algorithmic Speedups
7
• FFT and Winograd have same basic computational flow
• FFT tiles typically need to be much bigger
• Winograd history: Toom and Cook, then Lavin

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: input transform
8
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: filter transform
9
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: batched GEMM
10
• Point-wise Multiplication
• Posed as batched GEMM
operation

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: output transform
11
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile

Proprietary and conﬁdential. Do not distribute.ner va na
Transforms for Increased Accuracy
12
Integer roots
4 0 -5 0 1 0
0 -4 -4 1 1 0
0 4 -4 -1 1 0
0 -2 -1 2 1 0
0 2 -1 -2 1 0
0 4 0 -5 0 1
0.87 0 -2.64 0 1 0
0 -1.4 -2.25 0.62 1 0
0 1.4 -2.25 -0.62 1 0
0 -0.58 -0.39 1.5 1 0
0 0.58 -0.39 -1.5 1 0
0 0.87 0 -2.64 0 1
Fractional roots
Input transforms for 4x4

$Proprietary and conﬁdential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483$

Proprietary and conﬁdential. Do not distribute.ner va na
Multiplier Transistor Efficiency
14
Algo bits speedup transistors
performance
/ transistor
Direct 8 1.0 3000 1
2x2 9 2.25 3750 1.8
4x4 12 4.0 6000 2.0
Transistor Counts from Wikipedia:

Proprietary and conﬁdential. Do not distribute.ner va na
Logarithmic quantization
15
D. Miyashita, EH. Lee, B. Murmann
Convolutional Neural Networks using Logarithmic Data Representation
http://arxiv.org/abs/1603.01025v2

Proprietary and conﬁdential. Do not distribute.ner va na 16
Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Totals:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

Proprietary and conﬁdential. Do not distribute.ner va na 17
Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Layer 4.2:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

What's hot

Dds 2Nhân Lê

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongUmbra Software

Image Segmentation Using Hardware Forest ClassifiersNeil Pittman

Chaotic substitution box design for block ciphersHammad Haleem

Math cad fourier analysis (jcb-edited)Julio Banks

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering WorkflowTakahiro Harada

Fragging Rights: A Tale of a Pathological Storage WorkloadEric Sproul

Unit 5 vspsushant7dare

Multi core k meansb0rAAs

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...Kenichiro Tanaka

The InternetDavid Evans

Parallel Implementation of K Means Clustering on CUDAprithan

Neighbourhood Preserving Quantisation for LSH SIGIR PosterSean Moran

Scaling the #2ndhalfSalo Shp

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...Takahiro Harada

Parallel K means clustering using CUDAprithan

Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서

TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central

[BGOUG] Java GC - Friend or FoeSAP HANA Cloud Platform

What's hot (20)

Dds 2

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong

Image Segmentation Using Hardware Forest Classifiers

Chaotic substitution box design for block ciphers

Math cad fourier analysis (jcb-edited)

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow

Fragging Rights: A Tale of a Pathological Storage Workload

Unit 5 vsp

Multi core k means

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...

The Internet

Parallel Implementation of K Means Clustering on CUDA

Neighbourhood Preserving Quantisation for LSH SIGIR Poster

Scaling the #2ndhalf

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...

Parallel K means clustering using CUDA

Unite2019 HLOD를 활용한 대규모 씬 제작 방법

TressFX The Fast and The Furry by Nicolas Thibieroz

[BGOUG] Java GC - Friend or Foe

Viewers also liked

Deep Learning at ScaleIntel Nervana

Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana

Nervana and the Future of ComputingIntel Nervana

Introduction to deep learning @ Startup.ML by Andres RodriguezIntel Nervana

Urs Köster - Convolutional and Recurrent Neural NetworksIntel Nervana

Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana

RE-Work Deep Learning Summit - September 2016Intel Nervana

懇親会の余興スライドAkira Tamamori

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationIntel® Software

Video Activity Recognition and NLP Q&A Model ExampleIntel Nervana

A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...Akira Tamamori

Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana

Using neon for pattern recognition in audio dataIntel Nervana

Urs Köster Presenting at RE-Work DL Summit in BostonIntel Nervana

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Intel Nervana

Rethinking computation: A processor architecture for machine intelligenceIntel Nervana

Introduction to Deep Learning with Will ConstableIntel Nervana

Intel's Machine Learning Strategyinside-BigData.com

ODSC WestIntel Nervana

Anil Thomas - Object recognitionIntel Nervana

Viewers also liked (20)

Deep Learning at Scale

Intel Nervana Artificial Intelligence Meetup 1/31/17

Nervana and the Future of Computing

Introduction to deep learning @ Startup.ML by Andres Rodriguez

Urs Köster - Convolutional and Recurrent Neural Networks

Intel Nervana Artificial Intelligence Meetup 11/30/16

RE-Work Deep Learning Summit - September 2016

懇親会の余興スライド

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration

Video Activity Recognition and NLP Q&A Model Example

A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...

Startup.Ml: Using neon for NLP and Localization Applications

Using neon for pattern recognition in audio data

Urs Köster Presenting at RE-Work DL Summit in Boston

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...

Rethinking computation: A processor architecture for machine intelligence

Introduction to Deep Learning with Will Constable

Intel's Machine Learning Strategy

ODSC West

Anil Thomas - Object recognition

Similar to An Analysis of Convolution for Inference

Visual thinking colin_ware_lectures_2013_3_findabilityElsa von Licy

“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...Edge AI and Vision Alliance

Rainbow Over the Windows: More Colors Than You Could ExpectPeter Hlavaty

Video Compression, Part 2-Section 2, Video Coding Concepts Dr. Mohieddin Moradi

畳み込みについてHironoriKanazawa

#6 PyData Warsaw: Deep learning for image segmentationMatthew Opala

02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...Deltares

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance

December 4, ProjectUniversity of Colorado at Boulder

7nm "Navi" GPU - A GPU Built For Performance AMD

DL (v2).pptxFKKBWITTAINAN

Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal

Panoramic Video in Environmental Monitoring Software Development and Applica...pycontw

Verifiably RandomDavid Evans

Code vectorization for mobile devicesSt1X

A Deep Dive Into Understanding Apache CassandraDataStax Academy

HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.

Genome Browser based on Google Maps APIHong ChangBum

Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelAnne Nicolas

Clipping & RasterizationAhmed Daoud

Similar to An Analysis of Convolution for Inference (20)

Visual thinking colin_ware_lectures_2013_3_findability

“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...

Rainbow Over the Windows: More Colors Than You Could Expect

Video Compression, Part 2-Section 2, Video Coding Concepts

畳み込みについて

#6 PyData Warsaw: Deep learning for image segmentation

02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...

December 4, Project

7nm "Navi" GPU - A GPU Built For Performance

DL (v2).pptx

Optimizing the Graphics Pipeline with Compute, GDC 2016

Panoramic Video in Environmental Monitoring Software Development and Applica...

Verifiably Random

Code vectorization for mobile devices

A Deep Dive Into Understanding Apache Cassandra

HBaseCon 2013: Scalable Network Designs for Apache HBase

Genome Browser based on Google Maps API

Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel

Clipping & Rasterization

Recently uploaded

Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

A Framework for Development in the AI AgeCprime

2024 April Patch TuesdayIvanti

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Data governance with Unity Catalog PresentationKnoldus Inc.

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Recently uploaded (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

TeamStation AI System Report LATAM IT Salaries 2024

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Long journey of Ruby standard library at RubyConf AU 2024

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

A Framework for Development in the AI Age

2024 April Patch Tuesday

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Data governance with Unity Catalog Presentation

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Moving Beyond Passwords: FIDO Paris Seminar.pdf

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

UiPath Community: Communication Mining from Zero to Hero

The Ultimate Guide to Choosing WordPress Pros and Cons

Take control of your SAP testing with UiPath Test Suite

Potential of AI (Generative AI) in Business: Learnings and Insights

How AI, OpenAI, and ChatGPT impact business and software.

An Analysis of Convolution for Inference

1. An Analysis of Convolution for Inference 24 June 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™

2. Proprietary and conﬁdential. Do not distribute.ner va na Direct Convolution 2 • Compute with in-place slicing + gemm • Data layout considerations: C, H, W, N • Minimize slicing logic • Maximize contiguous access • Leverage filter overlap

3. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Without Superblocking 3 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad Fig from V. Dumoulin, https://github.com/vdumoulin/conv_arithmetic

4. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: With Superblocking 4 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad

5. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Bprop for deconv 5 bprop pad’ = S - pad - 1 wi = (qj - pad’ + sk) / stride

6. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Dilated Filters 6 Dilated S’ = (S-1) * rate + 1 Q = (W-S’+1 + 2*pad) / stride wi = sk * rate + qj * stride - pad Fig from F. Yu, V. Koltun http://arxiv.org/abs/1511.07122v3

7. Proprietary and conﬁdential. Do not distribute.ner va na Convolution with Algorithmic Speedups 7 • FFT and Winograd have same basic computational flow • FFT tiles typically need to be much bigger • Winograd history: Toom and Cook, then Lavin

8. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: input transform 8 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros

9. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: filter transform 9 • Filter transform • Same as input but with different coefficients • Transform each feature map independently

10. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: batched GEMM 10 • Point-wise Multiplication • Posed as batched GEMM operation

11. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: output transform 11 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile

12. Proprietary and conﬁdential. Do not distribute.ner va na Transforms for Increased Accuracy 12 Integer roots 4 0 -5 0 1 0 0 -4 -4 1 1 0 0 4 -4 -1 1 0 0 -2 -1 2 1 0 0 2 -1 -2 1 0 0 4 0 -5 0 1 0.87 0 -2.64 0 1 0 0 -1.4 -2.25 0.62 1 0 0 1.4 -2.25 -0.62 1 0 0 -0.58 -0.39 1.5 1 0 0 0.58 -0.39 -1.5 1 0 0 0.87 0 -2.64 0 1 Fractional roots Input transforms for 4x4

13. Proprietary and conﬁdential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483

14. Proprietary and conﬁdential. Do not distribute.ner va na Multiplier Transistor Efficiency 14 Algo bits speedup transistors performance / transistor Direct 8 1.0 3000 1 2x2 9 2.25 3750 1.8 4x4 12 4.0 6000 2.0 Transistor Counts from Wikipedia:

15. Proprietary and conﬁdential. Do not distribute.ner va na Logarithmic quantization 15 D. Miyashita, EH. Lee, B. Murmann Convolutional Neural Networks using Logarithmic Data Representation http://arxiv.org/abs/1603.01025v2

16. Proprietary and conﬁdential. Do not distribute.ner va na 16 Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Totals: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT

17. Proprietary and conﬁdential. Do not distribute.ner va na 17 Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Layer 4.2: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT

An Analysis of Convolution for Inference

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to An Analysis of Convolution for Inference

Similar to An Analysis of Convolution for Inference (20)

Recently uploaded

Recently uploaded (20)

An Analysis of Convolution for Inference