3. 3
DEEP LEARNING —
A NEW COMPUTING MODEL
“Software that writes software”
“little girl is eating
piece of cake"
LEARNING
ALGORITHM
“millions of trillions
of FLOPS”
4. 4
AI IS EVERYWHERE
“Find where I parked my car” “Find the bag I just saw
in this magazine”
“What movie should
I watch next?”
5. 5
TOUCHING OUR LIVES
Bringing grandmother closer to
family by bridging language barrier
Predicting sick baby’s vitals like heart
rate, blood pressure, survival rate
Enabling the blind to “see” their
surrounding, read emotions on faces
6. 6
FUELING ALL INDUSTRIES
Increasing public safety with smart
video surveillance at airports & malls
Providing intelligent services in
hotels, banks and stores
Separating weeds as it harvests,
reduces chemical usage by 90%
7. 7
DEEP LEARNING DEMANDS NEW CLASS OF HPC
TRAINING INFERENCING
Data / Users
Scalable
Performance
Throughput
+ Efficiency
Billions of TFLOPS per training run
Years of compute-days on Xeon CPU
GPU turns years to days
Billions of FLOPS per inference
Seconds for response on Xeon CPU
GPU for instant response
8. 8
BAIDU DEEP SPEECH 2
12K
Neurons
100M
Parameters
2.5x Deep Speech 1 4x Deep Speech 1
15
Exaflops
Super-human
Accuracy
10x Deep Speech 1
2 Months on CPU Server | 2 Days on DGX-1
Word Error Rate
DS2: 5% | Human: 6% | DS1: 8%
“Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”, 12/2015 | Dataset: LibriSpeech test-clean
9. 9
MODERN AI NEEDS NEW INFERENCE SOLUTION
0 0.5 1 1.5 2 2.5
Network
Network
Deep Speech 2
User Wait Time (seconds)
“Where is the nearest Szechuan restaurant?”
User Experience: From Seconds to Instant
Wait Time for Text after Speech is Complete
6 sec
CPU
0.1 sec
Pascal GPU
Deep Speech 2 inference performance on 16 user server | CPU: 170 ms of estimated compute time
required for each 100 ms of speech sample | Pascal GPU: 51 ms of compute required for each 100
ms of speech sample
2.2 sec
CPU
12. 12
0X
4X
8X
12X
16X
GeForce® GTX TITAN X GeForce® GTX 1080 Tesla® P100 DIGITS™ DevBox (4X
GeForce® GTX Titan X)
Quadro® VCA (8X Quadro®
M6000)
DGX-1™ (8X Tesla® P100)
RelativeTrainingPerformance
ResNet Inception v3 AlexNet vgg MSR
DGX-1 — A LEAGUE OF ITS OWN
NVIDIA CONFIDENTIAL. PRELIMINARY NUMBERS. NOT FOR DISTRIBUTION.
Caffe on DeepMark. GeForce TITAN X and GTX 1080 system: Intel Core i7-5930K @ 3.5 GHz, 64 GB System Memory | Tesla P100 (SXM2) system: Dual CPU server, Intel E5-2698 v4 @ 2.2 GHz, 256 GB System Memory
1X
GeForce GTX TITAN X GeForce GTX 1080 Tesla P100 DIGITS DevBox
(4X GeForce GTX TITAN X)
Quadro VCA
(8X Quadro M6000)
DGX-1
(8X Tesla P100)
13. 13
Instant productivity — plug-and-
play, supports every AI framework
Performance optimized across
the entire stack
Always up-to-date via the cloud
Mixed framework environments
—containerized
Direct access to NVIDIA experts
DGX STACK
Fully integrated Deep Learning platform
14. 14
DGX — THE ESSENTIAL TOOL
OF DEEP LEARNING SCIENTISTS
The platform of
AI pioneers
Reduce training time
from weeks to days
250 node HPC
Supercomputer-in-a-Box
15. 15
0 50 100 150 200 250 300
P40
P4
1x CPU (14 cores)
Inference Execution Time (ms)
11 ms
6 ms
User Experience: Instant Response
45x Faster with Pascal + TensorRT
Faster, more responsive AI-powered services such as voice recognition, speech translation
Efficient inference on images, video, & other data in hyperscale production data centers
Based on VGG-19 from IntelCaffe Github: https://github.com/intel/caffe/tree/master/models/mkl2017_vgg_19
CPU: IntelCaffe, batch size = 4, Intel E5-2690v4, using Intel MKL 2017 | GPU: Caffe, batch size = 4, using TensorRT internal version
INTRODUCING NVIDIA TensorRT
High Performance Inference Engine
260 ms
16. 16
NVIDIA DEEPSTREAM SDK
Delivering Video Analytics at Scale
Inference
Preprocess
Hardware
Decode
“Boy playing soccer”
Simple, high performance API for analyzing video
Decode H.264, HEVC, MPEG-2, MPEG-4, VP9
CUDA-optimized resize and scale
TensorRT
0
20
40
60
80
100
1x Tesla P4 Server +
DeepStream SDK
13x E5-2650 v4 Servers
ConcurrentVideoStreams
Concurrent Video Streams Analyzed
720p30 decode | IntelCaffe using dual socket E5-2650 v4 CPU servers, Intel MKL 2017
Based on GoogLeNet optimized by Intel: https://github.com/intel/caffe/tree/master/models/mkl2017_googlenet_v2
17. 17
PIONEERS ADOPTING HPC
FOR DEEP LEARNING
“Investments in computer systems — and I think
the bleeding-edge of AI, and deep learning
specifically, is shifting to HPC — can cut down
the time to run an experiment from a week to
a day and sometimes even faster.”
— Andrew Ng, Baidu
Dr. Andrew Ng, Chief Scientist, Baidu
18. 18NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
END-TO-END DATA CENTER PRODUCT FAMILY
MIXED-APPS HPCSTRONG-SCALE HPC
Data centers running HPC and DL
apps scaling to multiple GPUs
HPC data centers running mix
of CPU and GPU workloads
HYPERSCALE HPC
Hyperscale deployment for deep
learning training & inference
Training - Tesla P100
Inference - Tesla P40 & P4
Tesla P100 with NVLink Tesla P100 with PCI-E
19. 19
NVIDIA EXPERTISE AT EVERY STEP
Solution Architects
Global Network
of Partners
Deep Learning
Institute
GTC
Conferences
1:1 support
Network training setup
Network optimization
Certified expert instructors
Worldwide workshops
Online courses
Epicenter of industry leaders
Onsite training
Global reach
NVIDIA Partner Network
OEMs
Startups
Need image
20. 20
NVIDIA DEEP LEARNING PARTNERS
Graph Analytics Enterprises Data ManagementDL Frameworks Enterprise DL
Services Core Analytics Tech
21. 21
MOST PERVASIVE HPC PLATFORM EVER BUILT
ACCESS ANYWHERE BUY ANYWHERE LEARN EVERYWHERE
+ 240 Resellers Worldwide
1000
Universities Teaching CUDA
78
Countries
300K
CUDA Developers