High-Performance GPU Programming for Deep Learning

•

2 likes•1,815 views

This session goes over many of the techniques we use at Nervana in GPU programming to achieve state-of-the-art performance for deep learning networks. The main focus will be on the customization of dense linear algebra kernels: Winograd 3x3 convolution, direct convolution, and small tile GEMM (matrix multiply). In particular, we'll look at how we achieve high utilization at very small mini batches which is important for multi-gpu scaling and inference. In addition we'll talk about where and how you can effectively leverage lower and mixed precision to further increase performance without loss in accuracy.

Engineering

High-Performance GPU
Programming for Deep Learning
7 April 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™

Proprietary and conﬁdential. Do not distribute.ner va na
High-Performance GPU kernels for deep learning
2
• Fast matrix multiply for small minibatches
• Direct convolution leveraging GEMM advances
• Even faster convolution with Winograd

Proprietary and conﬁdential. Do not distribute.ner va na
GEMM: Basics
3
C = AB

Proprietary and conﬁdential. Do not distribute.ner va na
GEMM: Memory Load
4
Outer product contiguous Outer product strided
threads
memory load
single tile
batched GEMM

Proprietary and conﬁdential. Do not distribute.ner va na
Batched GEMM tiles 32 x 32
GEMM tile 32 x 64GEMM tile 32 x 32
GEMM: Tile sizes
5
threads
shared memory load

Proprietary and conﬁdential. Do not distribute.ner va na
hGEMM Results - NN
6
Nx3072x3072 NN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)
GFLOPS

Proprietary and conﬁdential. Do not distribute.ner va na
hGEMM Results - TN
7
GFLOPS
Nx3072x3072 TN op
0
1500
3000
4500
6000
32 64 96 128
Nervana 32x32 cuBLAS 128x64
Batch Size (N)

Proprietary and conﬁdential. Do not distribute.ner va na
Direct convolution is still relevant
8
• Striding
• Odd-size filters
• Placeholder until faster algo can be implemented
• Often faster for single image or first small C layer

Proprietary and conﬁdential. Do not distribute.ner va na
Direct convolution: implementation details
9
• Batched GEMM for efficient transpose and higher occupancy
• Compound outer product block remapping
• Square wave pattern for P,Q block mapping
• Slicing: shared memory lookup + integer division
• N vs C contiguous
• Single P,Q vs tiled P,Q
• Bprop as upside down fprop
• Update specific optimizations

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: input transform
10
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: filter transform
11
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: batched GEMM
12

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: output transform
13
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile

Proprietary and conﬁdential. Do not distribute.ner va na 14
Performance: VGG
VGG fp32 - Totals by operation
0
0.5
1
1.5
2
64 32 16 8 4 2 1
Winograd fp32 fprop
Winograd fp32 bprop
Winograd fp32 update
cuDNN fp32 fprop
cuDNN fp32 bprop
cuDNN fp32 update
AlgorithmicSpeedup
Batch Size

Proprietary and conﬁdential. Do not distribute.ner va na
Performance: Alexnet convolutional layers
15
Alexnet Totals
0
0.5
1
1.5
2
128 64 32 16 8 4
Nervana fp16
Nervana fp32
CuBLAS fp16
CuBLAS fp32
Batch Size
AlgorithmicSpeedup

Proprietary and conﬁdential. Do not distribute.ner va na
Compounding
16
• alpha / beta
• bias
• relu, prelu, tanh, …
• bprop relu, …
• bprop bias
• batchnorm mean
Compounding inside of GEMM and conv for free:

Proprietary and conﬁdential. Do not distribute.ner va na
Summary
17
• Nervana has the fastest tools for deep learning
• neon with state-of-the-art Maxwell kernels
• Nervana Cloud with multi-GPU training
• Watch for Nervana Engine, our deep learning processor

What's hot

Unit 5 vspsushant7dare

A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)Takahiro Harada

Multi core k meansb0rAAs

Parallel Implementation of K Means Clustering on CUDAprithan

Ece512 h1 20139_621386735458ece512_test2_solutionsnadia abd

Gsm attacksmaicuong8

Network simulator 2shwetha mk

Parallel K means clustering using CUDAprithan

Scaling the #2ndhalfSalo Shp

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...Kenichiro Tanaka

Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서

The InternetDavid Evans

Multi-Jet Generation -status report-Yoshitaro Takaesu

Code GPU with CUDA - Device code optimization principleMarina Kolpakova

Experiences with Power 9 at A*STAR CRCGanesan Narayanasamy

xilinx fpga problemsAnish Gupta

QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...Austin Benson

Rules of block diagramManishDubey118

Grincon U.S. 2019 How to Mine GrinKaren Hsu

Real-time applications on IntelXeon/PhiKarel Ha

What's hot (20)

Unit 5 vsp

A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)

Multi core k means

Parallel Implementation of K Means Clustering on CUDA

Ece512 h1 20139_621386735458ece512_test2_solutions

Gsm attacks

Network simulator 2

Parallel K means clustering using CUDA

Scaling the #2ndhalf

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...

Unite2019 HLOD를 활용한 대규모 씬 제작 방법

The Internet

Multi-Jet Generation -status report-

Code GPU with CUDA - Device code optimization principle

Experiences with Power 9 at A*STAR CRC

xilinx fpga problems

QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...

Rules of block diagram

Grincon U.S. 2019 How to Mine Grin

Real-time applications on IntelXeon/Phi

Viewers also liked

Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana

Introduction to multi gpu deep learning with DIGITS 2 - Mike WangPAPIs.io

Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana

Deep Learning at ScaleIntel Nervana

Rethinking computation: A processor architecture for machine intelligenceIntel Nervana

Introduction to deep learning @ Startup.ML by Andres RodriguezIntel Nervana

GPU Accelerated Deep Learning for CUDNN V2NVIDIA

The AI Era Ignited by GPU Deep Learning NVIDIA

RocksDB meetupJavier González

20161122 gpu deep_learningcommunity#02ManaMurakami1

ECCV2010: feature learning for image classification, part 4zukun

Artificial general intelligence research project at Keen Software House (3/2015)Marek Rosa

Deep learning tutorial (i)Guan Wang

20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2Tomokazu Kanazawa

Common Design of Deep Learning FrameworksKenta Oono

Video Activity Recognition and NLP Q&A Model ExampleIntel Nervana

Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana

Using neon for pattern recognition in audio dataIntel Nervana

Urs Köster Presenting at RE-Work DL Summit in BostonIntel Nervana

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Intel Nervana

Viewers also liked (20)

Intel Nervana Artificial Intelligence Meetup 11/30/16

Introduction to multi gpu deep learning with DIGITS 2 - Mike Wang

Intel Nervana Artificial Intelligence Meetup 1/31/17

Deep Learning at Scale

Rethinking computation: A processor architecture for machine intelligence

Introduction to deep learning @ Startup.ML by Andres Rodriguez

GPU Accelerated Deep Learning for CUDNN V2

The AI Era Ignited by GPU Deep Learning

RocksDB meetup

20161122 gpu deep_learningcommunity#02

ECCV2010: feature learning for image classification, part 4

Artificial general intelligence research project at Keen Software House (3/2015)

Deep learning tutorial (i)

20160913 gpu deep-learningcomminity-morpho_20160912-公開用rev2

Common Design of Deep Learning Frameworks

Video Activity Recognition and NLP Q&A Model Example

Startup.Ml: Using neon for NLP and Localization Applications

Using neon for pattern recognition in audio data

Urs Köster Presenting at RE-Work DL Summit in Boston

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...

Similar to High-Performance GPU Programming for Deep Learning

Boyang gao gpu k-means_gmm_final_v1Gao Boyang

Matrix glitcher tutorialJosé Mota

Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016Alex Vlachos

OpenGL for 2015Mark Kilgard

Smedberg niklas bringing_aaa_graphicschangehee lee

new_age_graphics_android_x86Droidcon Berlin

“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid

4K Checkerboard in Battlefield 1 and Mass Effect AndromedaElectronic Arts / DICE

Troubleshooting MySQL from a MySQL Developer PerspectiveMarcelo Altmann

OBDPC 2022klepsydratechnologie

DC GAN - GO GAMEShaoqing Tan

WebRender (MadRust)Igalia

“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...Edge AI and Vision Alliance

Advancements in-tiled-renderingmistercteam

Volodymyr Lyubinets “Generative models for images”Lviv Startup Club

Dissecting and fixing Vulkan rendering issues in drivers with RenderDocIgalia

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance

Alex_Vlachos_Advanced_VR_Rendering_GDC2015Alex Vlachos

Monte Carlo G P U Jan2010John Holden

Dissecting the Rendering of The SurgePhilip Hammer

Similar to High-Performance GPU Programming for Deep Learning (20)

Boyang gao gpu k-means_gmm_final_v1

Matrix glitcher tutorial

Alex_Vlachos_Advanced_VR_Rendering_Performance_GDC2016

OpenGL for 2015

Smedberg niklas bringing_aaa_graphics

new_age_graphics_android_x86

“Show Me the Garbage!”, Garbage Collection a Friend or a Foe

4K Checkerboard in Battlefield 1 and Mass Effect Andromeda

Troubleshooting MySQL from a MySQL Developer Perspective

OBDPC 2022

DC GAN - GO GAME

WebRender (MadRust)

“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...

Advancements in-tiled-rendering

Volodymyr Lyubinets “Generative models for images”

Dissecting and fixing Vulkan rendering issues in drivers with RenderDoc

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...

Alex_Vlachos_Advanced_VR_Rendering_GDC2015

Monte Carlo G P U Jan2010

Dissecting the Rendering of The Surge

Recently uploaded

Moment Distribution Method For Btech CivilVinayVitekari

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai

Introduction to Serverless with AWS LambdaOmar Fathy

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture

Integrated Test Rig For HTFE-25 - NeometrixNeometrix_Engineering_Pvt_Ltd

Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies

Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6

COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA

Double Revolving field theory-how the rotor develops torqueBhangaleSonal

Computer Networks Basics of Network DevicesChandrakantDivate1

Online electricity billing project report..pdfKamal Acharya

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)

AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248

Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1

Block diagram reduction techniques in control systems.pptNANDHAKUMARA10

A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30

School management system project Report.pdfKamal Acharya

Thermal Engineering Unit - I & II . pptDineshKumar4165

Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697

Recently uploaded (20)

Moment Distribution Method For Btech Civil

Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...

Introduction to Serverless with AWS Lambda

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx

Integrated Test Rig For HTFE-25 - Neometrix

Standard vs Custom Battery Packs - Decoding the Power Play

Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx

COST-EFFETIVE and Energy Efficient BUILDINGS ptx

Double Revolving field theory-how the rotor develops torque

Computer Networks Basics of Network Devices

Online electricity billing project report..pdf

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...

AIRCANVAS[1].pdf mini project for btech students

Wadi Rum luxhotel lodge Analysis case study.pptx

Block diagram reduction techniques in control systems.ppt

A Study of Urban Area Plan for Pabna Municipality

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf

School management system project Report.pdf

Thermal Engineering Unit - I & II . ppt

Engineering Drawing focus on projection of planes

High-Performance GPU Programming for Deep Learning

1. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™

2. Proprietary and conﬁdential. Do not distribute.ner va na High-Performance GPU kernels for deep learning 2 • Fast matrix multiply for small minibatches • Direct convolution leveraging GEMM advances • Even faster convolution with Winograd

3. Proprietary and conﬁdential. Do not distribute.ner va na GEMM: Basics 3 C = AB

4. Proprietary and conﬁdential. Do not distribute.ner va na GEMM: Memory Load 4 Outer product contiguous Outer product strided threads memory load single tile batched GEMM

5. Proprietary and conﬁdential. Do not distribute.ner va na Batched GEMM tiles 32 x 32 GEMM tile 32 x 64GEMM tile 32 x 32 GEMM: Tile sizes 5 threads shared memory load

6. Proprietary and conﬁdential. Do not distribute.ner va na hGEMM Results - NN 6 Nx3072x3072 NN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N) GFLOPS

7. Proprietary and conﬁdential. Do not distribute.ner va na hGEMM Results - TN 7 GFLOPS Nx3072x3072 TN op 0 1500 3000 4500 6000 32 64 96 128 Nervana 32x32 cuBLAS 128x64 Batch Size (N)

8. Proprietary and conﬁdential. Do not distribute.ner va na Direct convolution is still relevant 8 • Striding • Odd-size filters • Placeholder until faster algo can be implemented • Often faster for single image or first small C layer

9. Proprietary and conﬁdential. Do not distribute.ner va na Direct convolution: implementation details 9 • Batched GEMM for efficient transpose and higher occupancy • Compound outer product block remapping • Square wave pattern for P,Q block mapping • Slicing: shared memory lookup + integer division • N vs C contiguous • Single P,Q vs tiled P,Q • Bprop as upside down fprop • Update specific optimizations

10. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: input transform 10 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros

11. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: filter transform 11 • Filter transform • Same as input but with different coefficients • Transform each feature map independently

12. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: batched GEMM 12

13. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: output transform 13 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile

14. Proprietary and conﬁdential. Do not distribute.ner va na 14 Performance: VGG VGG fp32 - Totals by operation 0 0.5 1 1.5 2 64 32 16 8 4 2 1 Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update cuDNN fp32 fprop cuDNN fp32 bprop cuDNN fp32 update AlgorithmicSpeedup Batch Size

15. Proprietary and conﬁdential. Do not distribute.ner va na Performance: Alexnet convolutional layers 15 Alexnet Totals 0 0.5 1 1.5 2 128 64 32 16 8 4 Nervana fp16 Nervana fp32 CuBLAS fp16 CuBLAS fp32 Batch Size AlgorithmicSpeedup

16. Proprietary and conﬁdential. Do not distribute.ner va na Compounding 16 • alpha / beta • bias • relu, prelu, tanh, … • bprop relu, … • bprop bias • batchnorm mean Compounding inside of GEMM and conv for free:

17. Proprietary and conﬁdential. Do not distribute.ner va na Summary 17 • Nervana has the fastest tools for deep learning • neon with state-of-the-art Maxwell kernels • Nervana Cloud with multi-GPU training • Watch for Nervana Engine, our deep learning processor

High-Performance GPU Programming for Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to High-Performance GPU Programming for Deep Learning

Similar to High-Performance GPU Programming for Deep Learning (20)

More from Intel Nervana

More from Intel Nervana (10)

Recently uploaded

Recently uploaded (20)

High-Performance GPU Programming for Deep Learning