Secure your environment with UiPath and CyberArk technologies - Session 1
Nervana and the Future of Computing
1. Proprietary and confidential. Do not distribute.
Nervana and the
Future of Computing
26 April 2016
Arjun Bansal
Co-founder & VP Algorithms, Nervana
MAKING MACHINES SMARTER.™
2. Proprietary and confidential. Do not distribute.
AI on demand using Deep Learning
2
DL
Image
Classification
Object
Localization
Video
Indexing
Text
Analysis
Nervana Platform
Machine
Translation
3. Proprietary and confidential. Do not distribute.
Image classification and video activity detection
3
Deep learning model Potential applications
• Trained on a public dataset1 of
13K videos in 100 categories
• Training was approximately 3
times faster than competitive
framework
• Can be extended to perform
scene and object detection,
action similarity labeling, video
retrieval, anomaly detection
1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php
• Activity detection and
monitoring for security
• Automatic editing of captured
moments from video camera
• Facial recognition and image
based retrieval
• Sense and avoid systems for
autonomous driving
• Baggage screening at airports
and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw
6. Proprietary and confidential. Do not distribute.ner va na
Question answering
6
Stories
Mary journeyed to Texas.
John went to Maryland.
Mary went to Iowa.
John travelled to Florida.
Questions
Answers
Where is John located?
Florida
7. Proprietary and confidential. Do not distribute.ner va na
Reinforcement learning
7
Pong Breakout
https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg
8. Proprietary and confidential. Do not distribute.ner va na
Application areas
8
Healthcare Agriculture Finance
Online Services Automotive Energy
9. Proprietary and confidential. Do not distribute.
Nervana is building the future of computing
9
The Economist, March 12, 2016
Cloud Computing
Custom ASIC
Deep Learning / AI
10. Proprietary and confidential. Do not distribute.ner va na
nervana cloud
10
Images
Text
Tabular
Speech
Time series
Video
Data
import trainbuild deploy
Cloud
14. Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
• Model support Models
• Convnet
• RNN, LSTM
• MLP
• DQN
• NTM
Domains
• Images
• Video
• Speech
• Text
• Time series
15. Proprietary and confidential. Do not distribute.ner va na
Running locally:
% python rnn.py # or neon rnn.yaml
Running in nervana cloud:
% ncloud submit —py rnn.py # or —yaml rnn.yaml
% ncloud show <model_id>
% ncloud list
% ncloud deploy <model_id>
% ncloud predict <model_id> <data> # or use REST api
nervana neon
11
• Fastest library
• Model support
• Cloud integration
16. Proprietary and confidential. Do not distribute.ner va na
Backends
• CPU
• GPU
• Multiple GPUs
• Parameter server
• (Xeon Phi)
• nervana TPU
nervana neon
11
• Fastest library
• Model support
• Cloud integration
• Multiple backends
17. Proprietary and confidential. Do not distribute.ner va na
nervana neon
11
• Fastest library
• Model support
• Cloud integration
• Multiple backends
• Optimized at assembler level
19. Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
=1
nervana
engine
10 GPUs
200 CPUs
20. Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
21. Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
Instruction
and data
memory
Ctrl
ALU
CPU
Data
Memory
Ctrl
Nervana
22. Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
23. Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
24. Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Power efficiency
25. Proprietary and confidential. Do not distribute.ner va na
nervana tensor processing unit (TPU)
12
• 10-100x gain
• Architecture optimized for
• Unprecedented compute density
• Scalable distributed architecture
• Memory near computation
• Learning and inference
• Exploit limited precision
• Power efficiency
26. Proprietary and confidential. Do not distribute.ner va na
Special purpose computation
13
1940s: Turing Bombe
Motivation: Automating
calculations, code breaking
27. Proprietary and confidential. Do not distribute.ner va na
General purpose computation
14
2000s: SoC
Motivation: reduce power
and cost, fungible
computing.
Enabled inexpensive
mobile devices.
28. Proprietary and confidential. Do not distribute.ner va na
Dennard scaling has ended
15
What business and
technology constraints do
we have now?
29. Proprietary and confidential. Do not distribute.ner va na
Many-core tiled architectures
16
Tile Processor Architecture Overview for the TILEPro Series 5
and provides high bandwidth and extremely low latency communication among tiles. The Tile
Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-
ble multicore processor. External memory and I/O interfaces are connected to the tiles via the
iMesh interconnect.
Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s
structure.
Figure 2-1. Tile Processor Hardware Architecture
Each tile is a powerful, full-featured computing system that can independently run an entire oper-
ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a
three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC),
cache, and DMA subsystem. An individual tile is capable of executing up to three operations per
cycle.
CDN
TDN
IDN
MDN
STN
UDN
1,1 6,1
3,2 4,2 5,2 6,2 7,2
XAUI
(10GbE)
TDN
IDN
MDN
STN
UDN
LEGEND:
Tile Detail
port2
msh0
port0
port2 port1 port0
DDR2
DDR2
port0
msh1
port2
port0 port1 port2
DDR2
DDR2
RGMII
(GbE)
XAUI
(10GbE)
FlexI/O
PCIe
(x4 lane)
I2C, JTAG,
HPI, UART,
SPI ROM
FlexI/O
PCIe
(x4 lane)
port1 port1
msh3 msh2
port2
msh0
port0
port2 port1 port0
port0
msh1
port2
port0 port1 port2
port1 port1
msh3 msh2
gpio1
port0
port1
port1
port0
port1
xgbe0
gbe0
xgbe1
port0
gpio1
port1
port0
port1
gbe1
port0
port1
xgbe0
xgbe1
port0
0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3
0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5
0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6
0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7
7,00,0 1,0 2,0 3,0 4,0 5,0 6,0
0,1 1,1 6,12,1 3,1 4,1 5,1 7,1
3,2 4,2 5,2 6,2 7,20,2 1,2 2,2
0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4
port0
7,0
port0
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
pcie0
port0
port1
rshim0
gpio0
pcie1
port0
port1
Switch
Engine
Cache
Engine
Processor
Engine
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
U
D
N
S
T
N
M
D
N
I
D
N
T
D
N
C
D
N
STNSTN
TDNTDN
IDNIDN
MDNMDN
UDNUDN
CDNCDN
2010s: multi-core, GPGPU
Motivation: increased
performance without clock
rate increase or smaller
devices.
Requires changes in
programming paradigm.
NVIDIA GM204Tilera
Intel Xeon Phi
Knight’s landing
30. Proprietary and confidential. Do not distribute.ner va na
FPGA architectures
17
Altera Arria 10
Motivation: fine grained
parallelism, reconfigurable,
lots of IO, scalable.
Slow clock speed, lacks
compute density for
machine learning.
31. Proprietary and confidential. Do not distribute.ner va na
Neuromorphic architectures
18
IBM TrueNorth
dress for the target axon and
addresses representing core
ension to the target core). This
coded into a packet that is in-
entering spikes (Fig. 2I). Spikes leaving the mesh
are tagged with their row (for spikes traveling
east-west) or column (for spikes traveling north-
south) before being merged onto a shared link
ters (31,232 bits), destination addresses (6656
bits), and axonal delays (1024 bits). In terms of
efficiency, TrueNorth’s power density is 20 mW
per cm2
, whereas that of a typical central processing
32. Proprietary and confidential. Do not distribute.ner va na
Neural network parallelism
20
Data chunk 1 Data chunk n
…
Processor 1 Processor n
…
parameter server
Full deep
network on
each processor
Parameter coordination
Data parallelism Model parallelism
33. Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
34. Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
35. Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
36. Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
37. Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
38. Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
39. Proprietary and confidential. Do not distribute.ner va na
Existing computing topologies are lacking
21
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
G
P
U
CPU
S
S
D
CPU
G
P
U
G
P
U
G
P
U
IB
10
G
PCIE SW PCIE SW
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
CPU
S
S
D
CPU
IB
10
G
G
P
U
G
P
U
G
P
U
G
P
U
PCIE SW
40. Proprietary and confidential. Do not distribute.ner va na
nervana compute topology
22
CPU
CPU
S
S
D
IB
10
G
S
S
D
IB
10
G
nn
n n
nn
nn
PCIE SW
PCIE SW
41. Proprietary and confidential. Do not distribute.ner va na
Distributed linear algebra and convolution
23
02/27/2014! CS267 Lecture 12! 50!
52!
SUMMA – n x n matmul on P1/2 x P1/2 grid
• C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij!
• A[i,k] is n/P1/2 x b submatrix of A!
• B[k,j] is b x n/P1/2 submatrix of B !
• C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] !
• summation over submatrices!
• Need not be square processor grid !
* =
i"
j"
A[i,k]"
k"
k"
B[k,j]"
C[i,j]
02/27/2014! CS267 Lecture 12!
SUMMA distributed matrix multiply C=A*B
(Jim Demmel, CS267 lecture notes)
Matrix multiplication on multidimensional torus networks
Edgar Solomonik and James Demmel
Division of Computer Science
University of California at Berkeley, CA, USA
solomon@cs.berkeley.edu, demmel@cs.berkeley.edu
Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have
a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of
Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure.
This algorithm is useful for higher-dimensional torus interconnects that can achieve more injection
bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon
42. Proprietary and confidential. Do not distribute.ner va na
Summary
24
• Computers are tools for solving problems of their time
• Was: Coding, calculation, graphics, web
• Today: Learning and Inference on data
• Deep learning as a computational paradigm
• Custom architecture can do vastly better