SlideShare a Scribd company logo
1 of 17
Download to read offline
High-Performance GPU
Programming for Deep Learning
7 April 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™
Proprietary and confidential. Do not distribute.ner va na
High-Performance GPU kernels for deep learning
2
• Fast matrix multiply for small minibatches
• Direct convolution leveraging GEMM advances
• Even faster convolution with Winograd
Proprietary and confidential. Do not distribute.ner va na
GEMM: Basics
3
C = AB
Proprietary and confidential. Do not distribute.ner va na
GEMM: Memory Load
4
Outer product contiguous Outer product strided
threads
memory load
single tile
batched GEMM
Proprietary and confidential. Do not distribute.ner va na
Batched GEMM tiles 32 x 32
GEMM tile 32 x 64GEMM tile 32 x 32
GEMM: Tile sizes
5
threads
shared memory load
Proprietary and confidential. Do not distribute.ner va na
hGEMM Results - NN
6
Nx3072x3072 NN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
GFLOPS
Proprietary and confidential. Do not distribute.ner va na
hGEMM Results - TN
7
GFLOPS
Nx3072x3072 TN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
Proprietary and confidential. Do not distribute.ner va na
Direct convolution is still relevant
8
• Striding
• Odd-size filters
• Placeholder until faster algo can be implemented
• Often faster for single image or first small C layer
Proprietary and confidential. Do not distribute.ner va na
Direct convolution: implementation details
9
• Batched GEMM for efficient transpose and higher occupancy
• Compound outer product block remapping
• Square wave pattern for P,Q block mapping
• Slicing: shared memory lookup + integer division
• N vs C contiguous
• Single P,Q vs tiled P,Q
• Bprop as upside down fprop
• Update specific optimizations
Proprietary and confidential. Do not distribute.ner va na
Winograd: input transform
10
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros
Proprietary and confidential. Do not distribute.ner va na
Winograd: filter transform
11
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently
Proprietary and confidential. Do not distribute.ner va na
Winograd: batched GEMM
12
Proprietary and confidential. Do not distribute.ner va na
Winograd: output transform
13
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile
Proprietary and confidential. Do not distribute.ner va na 14
Performance: VGG
VGG fp32 - Totals by operation
0
0.5
1
1.5
2
64 32 16 8 4 2 1
Winograd fp32 fprop
Winograd fp32 bprop
Winograd fp32 update
cuDNN fp32 fprop
cuDNN fp32 bprop
cuDNN fp32 update
AlgorithmicSpeedup
Batch Size
Proprietary and confidential. Do not distribute.ner va na
Performance: Alexnet convolutional layers
15
Alexnet Totals
0
0.5
1
1.5
2
128 64 32 16 8 4
Nervana fp16
Nervana fp32
CuBLAS fp16
CuBLAS fp32
Batch Size
AlgorithmicSpeedup
Proprietary and confidential. Do not distribute.ner va na
Compounding
16
• alpha / beta
• bias
• relu, prelu, tanh, …
• bprop relu, …
• bprop bias
• batchnorm mean
Compounding inside of GEMM and conv for free:
Proprietary and confidential. Do not distribute.ner va na
Summary
17
• Nervana has the fastest tools for deep learning
• neon with state-of-the-art Maxwell kernels
• Nervana Cloud with multi-GPU training
• Watch for Nervana Engine, our deep learning processor

More Related Content

What's hot

A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)Takahiro Harada
 
Multi core k means
Multi core k meansMulti core k means
Multi core k meansb0rAAs
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Ece512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutionsEce512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutionsnadia abd
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2shwetha mk
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalfSalo Shp
 
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...Kenichiro Tanaka
 
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서
 
Multi-Jet Generation -status report-
Multi-Jet Generation -status report-Multi-Jet Generation -status report-
Multi-Jet Generation -status report-Yoshitaro Takaesu
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleMarina Kolpakova
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCGanesan Narayanasamy
 
xilinx fpga problems
xilinx fpga problemsxilinx fpga problems
xilinx fpga problemsAnish Gupta
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...Austin Benson
 
Rules of block diagram
Rules of block diagramRules of block diagram
Rules of block diagramManishDubey118
 
Grincon U.S. 2019 How to Mine Grin
Grincon U.S. 2019 How to Mine GrinGrincon U.S. 2019 How to Mine Grin
Grincon U.S. 2019 How to Mine GrinKaren Hsu
 
Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiKarel Ha
 

What's hot (20)

Unit 5 vsp
Unit 5 vspUnit 5 vsp
Unit 5 vsp
 
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
 
Multi core k means
Multi core k meansMulti core k means
Multi core k means
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Ece512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutionsEce512 h1 20139_621386735458ece512_test2_solutions
Ece512 h1 20139_621386735458ece512_test2_solutions
 
Gsm attacks
Gsm attacksGsm attacks
Gsm attacks
 
Network simulator 2
Network simulator 2Network simulator 2
Network simulator 2
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...
 
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법Unite2019 HLOD를 활용한 대규모 씬 제작 방법
Unite2019 HLOD를 활용한 대규모 씬 제작 방법
 
The Internet
The InternetThe Internet
The Internet
 
Multi-Jet Generation -status report-
Multi-Jet Generation -status report-Multi-Jet Generation -status report-
Multi-Jet Generation -status report-
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
 
xilinx fpga problems
xilinx fpga problemsxilinx fpga problems
xilinx fpga problems
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
 
Rules of block diagram
Rules of block diagramRules of block diagram
Rules of block diagram
 
Grincon U.S. 2019 How to Mine Grin
Grincon U.S. 2019 How to Mine GrinGrincon U.S. 2019 How to Mine Grin
Grincon U.S. 2019 How to Mine Grin
 
Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 

Viewers also liked

Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike WangIntroduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike WangPAPIs.io
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at ScaleIntel Nervana
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceIntel Nervana
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntroduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntel Nervana
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2NVIDIA
 
The AI Era Ignited by GPU Deep Learning
The AI Era Ignited by GPU Deep Learning The AI Era Ignited by GPU Deep Learning
The AI Era Ignited by GPU Deep Learning NVIDIA
 
20161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#0220161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#02ManaMurakami1
 
ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4zukun
 
Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)Marek Rosa
 
Deep learning tutorial (i)
Deep learning tutorial (i)Deep learning tutorial (i)
Deep learning tutorial (i)Guan Wang
 
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev220160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2Tomokazu Kanazawa
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksKenta Oono
 
Video Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model ExampleVideo Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model ExampleIntel Nervana
 
Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataIntel Nervana
 
Urs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in BostonUrs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in BostonIntel Nervana
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Intel Nervana
 

Viewers also liked (20)

Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16
 
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike WangIntroduction to multi gpu deep learning with DIGITS 2 - Mike Wang
Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Rethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligenceRethinking computation: A processor architecture for machine intelligence
Rethinking computation: A processor architecture for machine intelligence
 
Introduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres RodriguezIntroduction to deep learning @ Startup.ML by Andres Rodriguez
Introduction to deep learning @ Startup.ML by Andres Rodriguez
 
GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2GPU Accelerated Deep Learning for CUDNN V2
GPU Accelerated Deep Learning for CUDNN V2
 
The AI Era Ignited by GPU Deep Learning
The AI Era Ignited by GPU Deep Learning The AI Era Ignited by GPU Deep Learning
The AI Era Ignited by GPU Deep Learning
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
20161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#0220161122 gpu deep_learningcommunity#02
20161122 gpu deep_learningcommunity#02
 
ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4ECCV2010: feature learning for image classification, part 4
ECCV2010: feature learning for image classification, part 4
 
Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)Artificial general intelligence research project at Keen Software House (3/2015)
Artificial general intelligence research project at Keen Software House (3/2015)
 
Deep learning tutorial (i)
Deep learning tutorial (i)Deep learning tutorial (i)
Deep learning tutorial (i)
 
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev220160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2
 
Common Design of Deep Learning Frameworks
Common Design of Deep Learning FrameworksCommon Design of Deep Learning Frameworks
Common Design of Deep Learning Frameworks
 
Video Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model ExampleVideo Activity Recognition and NLP Q&A Model Example
Video Activity Recognition and NLP Q&A Model Example
 
Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications Startup.Ml: Using neon for NLP and Localization Applications
Startup.Ml: Using neon for NLP and Localization Applications
 
Using neon for pattern recognition in audio data
Using neon for pattern recognition in audio dataUsing neon for pattern recognition in audio data
Using neon for pattern recognition in audio data
 
Urs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in BostonUrs Köster Presenting at RE-Work DL Summit in Boston
Urs Köster Presenting at RE-Work DL Summit in Boston
 
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...
 

Similar to High-Performance GPU Programming for Deep Learning

Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Gao Boyang
 
Matrix glitcher tutorial
Matrix glitcher tutorialMatrix glitcher tutorial
Matrix glitcher tutorialJosé Mota
 
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016Alex Vlachos
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicschangehee lee
 
new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86Droidcon Berlin
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect AndromedaElectronic Arts / DICE
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveMarcelo Altmann
 
WebRender (MadRust)
WebRender (MadRust)WebRender (MadRust)
WebRender (MadRust)Igalia
 
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...Edge AI and Vision Alliance
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-renderingmistercteam
 
Volodymyr Lyubinets “Generative models for images”
Volodymyr Lyubinets  “Generative models for images”Volodymyr Lyubinets  “Generative models for images”
Volodymyr Lyubinets “Generative models for images”Lviv Startup Club
 
Dissecting and fixing Vulkan rendering issues in drivers with RenderDoc
Dissecting and fixing Vulkan rendering issues in drivers with RenderDocDissecting and fixing Vulkan rendering issues in drivers with RenderDoc
Dissecting and fixing Vulkan rendering issues in drivers with RenderDocIgalia
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance
 
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015Alex Vlachos
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The SurgePhilip Hammer
 

Similar to High-Performance GPU Programming for Deep Learning (20)

Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1
 
Matrix glitcher tutorial
Matrix glitcher tutorialMatrix glitcher tutorial
Matrix glitcher tutorial
 
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016
 
OpenGL for 2015
OpenGL for 2015OpenGL for 2015
OpenGL for 2015
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
 
OBDPC 2022
OBDPC 2022OBDPC 2022
OBDPC 2022
 
DC GAN - GO GAME
DC GAN - GO GAMEDC GAN - GO GAME
DC GAN - GO GAME
 
WebRender (MadRust)
WebRender (MadRust)WebRender (MadRust)
WebRender (MadRust)
 
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Volodymyr Lyubinets “Generative models for images”
Volodymyr Lyubinets  “Generative models for images”Volodymyr Lyubinets  “Generative models for images”
Volodymyr Lyubinets “Generative models for images”
 
Dissecting and fixing Vulkan rendering issues in drivers with RenderDoc
Dissecting and fixing Vulkan rendering issues in drivers with RenderDocDissecting and fixing Vulkan rendering issues in drivers with RenderDoc
Dissecting and fixing Vulkan rendering issues in drivers with RenderDoc
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015Alex_Vlachos_Advanced_VR_Rendering_GDC2015
Alex_Vlachos_Advanced_VR_Rendering_GDC2015
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The Surge
 

More from Intel Nervana

Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntel Nervana
 
Women in AI kickoff
Women in AI kickoff Women in AI kickoff
Women in AI kickoff Intel Nervana
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for RoboticsIntel Nervana
 
RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016Intel Nervana
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of ComputingIntel Nervana
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition Intel Nervana
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntel Nervana
 
Urs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural NetworksUrs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural NetworksIntel Nervana
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognitionIntel Nervana
 

More from Intel Nervana (10)

Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at Galvanize
 
Women in AI kickoff
Women in AI kickoff Women in AI kickoff
Women in AI kickoff
 
ODSC West
ODSC WestODSC West
ODSC West
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for Robotics
 
RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016RE-Work Deep Learning Summit - September 2016
RE-Work Deep Learning Summit - September 2016
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Object Detection and Recognition
Object Detection and Recognition Object Detection and Recognition
Object Detection and Recognition
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will Constable
 
Urs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural NetworksUrs Köster - Convolutional and Recurrent Neural Networks
Urs Köster - Convolutional and Recurrent Neural Networks
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognition
 

Recently uploaded

Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 

Recently uploaded (20)

Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 

High-Performance GPU Programming for Deep Learning

  • 1. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™
  • 2. Proprietary and confidential. Do not distribute.ner va na High-Performance GPU kernels for deep learning 2 • Fast matrix multiply for small minibatches • Direct convolution leveraging GEMM advances • Even faster convolution with Winograd
  • 3. Proprietary and confidential. Do not distribute.ner va na GEMM: Basics 3 C = AB
  • 4. Proprietary and confidential. Do not distribute.ner va na GEMM: Memory Load 4 Outer product contiguous Outer product strided threads memory load single tile batched GEMM
  • 5. Proprietary and confidential. Do not distribute.ner va na Batched GEMM tiles 32 x 32 GEMM tile 32 x 64GEMM tile 32 x 32 GEMM: Tile sizes 5 threads shared memory load
  • 6. Proprietary and confidential. Do not distribute.ner va na hGEMM Results - NN 6 Nx3072x3072 NN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N) GFLOPS
  • 7. Proprietary and confidential. Do not distribute.ner va na hGEMM Results - TN 7 GFLOPS Nx3072x3072 TN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N)
  • 8. Proprietary and confidential. Do not distribute.ner va na Direct convolution is still relevant 8 • Striding • Odd-size filters • Placeholder until faster algo can be implemented • Often faster for single image or first small C layer
  • 9. Proprietary and confidential. Do not distribute.ner va na Direct convolution: implementation details 9 • Batched GEMM for efficient transpose and higher occupancy • Compound outer product block remapping • Square wave pattern for P,Q block mapping • Slicing: shared memory lookup + integer division • N vs C contiguous • Single P,Q vs tiled P,Q • Bprop as upside down fprop • Update specific optimizations
  • 10. Proprietary and confidential. Do not distribute.ner va na Winograd: input transform 10 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros
  • 11. Proprietary and confidential. Do not distribute.ner va na Winograd: filter transform 11 • Filter transform • Same as input but with different coefficients • Transform each feature map independently
  • 12. Proprietary and confidential. Do not distribute.ner va na Winograd: batched GEMM 12
  • 13. Proprietary and confidential. Do not distribute.ner va na Winograd: output transform 13 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile
  • 14. Proprietary and confidential. Do not distribute.ner va na 14 Performance: VGG VGG fp32 - Totals by operation 0 0.5 1 1.5 2 64 32 16 8 4 2 1 Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update cuDNN fp32 fprop cuDNN fp32 bprop cuDNN fp32 update AlgorithmicSpeedup Batch Size
  • 15. Proprietary and confidential. Do not distribute.ner va na Performance: Alexnet convolutional layers 15 Alexnet Totals 0 0.5 1 1.5 2 128 64 32 16 8 4 Nervana fp16 Nervana fp32 CuBLAS fp16 CuBLAS fp32 Batch Size AlgorithmicSpeedup
  • 16. Proprietary and confidential. Do not distribute.ner va na Compounding 16 • alpha / beta • bias • relu, prelu, tanh, … • bprop relu, … • bprop bias • batchnorm mean Compounding inside of GEMM and conv for free:
  • 17. Proprietary and confidential. Do not distribute.ner va na Summary 17 • Nervana has the fastest tools for deep learning • neon with state-of-the-art Maxwell kernels • Nervana Cloud with multi-GPU training • Watch for Nervana Engine, our deep learning processor