A Platform for Accelerating Machine Learning Applications

TAIPEI | SEP. 21-22, 2016
Robert Sheen
HPE APJeC Principle Solution Architect
Sep 21, 2016
A PLATFORM FOR ACCELERATING
MACHINE LEARNING APPLICATIONS

2
WHAT CONFUSION! ARTIFICIAL INTELLIGENCE …
MACHINE LEARNING … NEURAL NETWORKS … DEEP LEARNING

3
A QUICK INTRODUCTION TO (DEEP) NEURAL NETWORKS
The (artificial) neuron.
Artificial Neural Networks
(ANNs) are inspired by
biological systems similar to
our brain
f(z)
xo
x1
x2
x3
x4
1
y1
Bias = threshold
Inputs
1
1,0w
1
2,0w
1
3,0w
1
0b
Weights
1
4,0w
1
5,0w 1
0
11
0 bxwz l
kk
l
jk  

)( 1
0
1
0 zfa 
NNs are made up of neurons, which are a mathematical
approximation to biological neurons
ReLU / SoftplusHyperbolic tangent
+1
-1
Logistic (sigma)
𝑎 𝑧 = tanh 𝑧
𝑎 𝑧 = max 0, 𝑧
𝑎 𝑧 = ln 1 + 𝑒 𝑧
𝑎 𝑧 =
1
1 + 𝑒−𝑧
𝑤ℎ𝑒𝑟𝑒 𝑧 =
𝑗
𝑤𝑗 𝑥𝑗 − 𝑏
Artificial Neural Networks
(ANNs) are inspired by
biological systems similar to
our brain
f(z)
xo
x1
x2
x3
x4
1
y1
Bias = threshold
Inputs
1
1,0w
1
2,0w
1
3,0w
1
0b
Weights
1
4,0w
1
5,0w 1
0
11
0 bxwz l
kk
l
jk  

)( 1
0
1
0 zfa 
NNs are made up of neurons, which are a mathematical
approximation to biological neurons
In a typical neuron the inputs (xn) are multiplied by weights
(𝑤𝑗𝑘
𝑙
) and then summed up ( 𝑤𝑗𝑘
𝑙
).
A non-linear activation function, 𝑓, is applied to the
summed and “thresholded” output (𝑧𝑖
𝑙
) using a non-linear
activation function, 𝑓 𝑧 .
This activation is the output of the neuron.

4
To solve useful problems we have to connect multiple neurons together. The output from a neuron in one layer becomes
the input to neurons in the next layer.
Notice that the arrows go in one direction only. We will only be discussing “feed-forward” networks. There are others.
What is deep learning? It is essentially artificial neural networks consisting of many (>1) layers and a large number
of neurons (units). This is very computationally intensive and uses mathematical techniques typical of high
performance computing (matrix-matrix multiplies, vector operations, FFTs, convolutions) and requires HPC
hardware.
Training deep networks requires
high performance computing
hardware and techniques.

5
What do neural networks do?
They classify
-E.g., Given an image is it a bird, is it a cat? Is it Stephen Fleischman?
-Given an audio signal, what are the words. What do they mean?
-This requires a training data set with inputs and their classes.
-This is supervised learning and what we will focus on.
-They cluster
-Find groups of similar things.
-Does not require classified training sets.
-This is unsupervised learning.
-It is often used together with supervised learning.
MNIST handwriting recognition
data set for digits. Classify
each image as 0 .. 9.

6
 The most important networks that solve the ImageNet
challenge over the years are benchmarked.
 Some of them are:
 Alexnet (The original!)
 VGG_A
 Overfeat
 Inception V1 (and now Inception V3!) (From Google)
 The ImageNet dataset is a database of around 1.2 million annotated images.
 The challenge is to train the neural network using a subset of the database and then attempt to classify all
the images in the dataset.
 The industry standard parameter is the number of images per second that we can train.
 Training time is forward + back propagation time of the network
 Every year various teams compete to classify the ImageNet dataset in the “ImageNet Large Scale Visual
Recognition Challenge” (ILSVRC). The network that has the greatest accuracy wins.
Testing Performance
The ImageNet dataset and benchmark

7
 The most important networks that solve the ImageNet challenge over the years are benchmarked.
 The classification accuracy has been improving year on year, so much that now it is better than humans!
Testing Performance
The ImageNet dataset and benchmark
Lowerisbetter

8
Computers have to be explicitly programmed
Analyze the problem to be solved.
Write the code in a programming language.
Deductive reasoning
Instruction and PC
Neural networks learn from examples
No requirement of an explicit description of the problem.
The neural computer adapts itself during a training period, based on examples of similar problems
Able to generalize or to handle incomplete data.
Inductive reasoning
Works well with “natural” data (like speech, image etc.)
How does a Neural Network work?
A quick introduction to (Deep) Neural Networks

9
Why is Deep Learning High Performance Computing?
DNNs are compute intensive and the training for a typical DNN application runs for weeks even on
modern hardware
Maps to BLAS functions like SGEMM, finding max/min, matrix inversions, FFTs etc.
Easily mapped to accelerators thus these applications becomes natural target for HPC platforms
Analysis shows that about 80% of time is spent in convolutions, which are basically SGEMM
computations
Recent developments in learning models have enhanced parallelism with both data and model
parallelisms
Recent advances with Nvidia libraries have supported multiple GPUs (1-8) in a single node
Known to scale well with scale-out configurations too.

10
Challenges in training deep neural networks
– Slow convergence with millions of weights / parameters.
– Activations saturate or explode.
– Depends on the function but result is that weights going into that neuron stop training.
– Vanishing gradient problem.
– Result of how we optimize the weights.
– Overfitting (or Overtraining)
- So many parameters you can easily train to fit the training data but then be completely unable to generalize.
– Achieving scalability in training is crucial but to do so on more than one GPU
For each of these challenges there are methods to ameliorate them. Depends on the problem and the
choices that you make in the activation function, the cost function, the number of layers, the number of
neurons, the types of layers etc.
These are the hyper-parameters of the neural network model and choosing them is currently 1) an art as
much as a science 2) an active area of research 3) a major factor in sizing the hardware for deep learning.

11
Getting training to scale
– Model parallelism
– Split the model (neural network) across GPUs and servers.
– Parallelizes well on a single GPU
– Up to 8 GPUs currently but some claims of better efficiency (Baidu).
– Multiple server is a problem.
– Data parallelism
– Gather scatter (SXM2)
– Split the training set across processing units and gather the updates. Requires peer to peer communication.
– Parameter servers (Master-Slave)
– Traditional manager/worker parallelism. Use the CPU to gather and dispatch the data. Not being used for much anyway. Need to store the
entire model on the GPU but no peer to peer communication.
– Hyper-parameters
– Figuring out the number of layers, number of neurons, training momentum can be done in parallels.
– Consensus
– Can have multiple neural networks training on the same data with different models and have them vote or otherwise combine their weights.
– Potentially more suitable for clusters of servers.
– Inference: Run it in parallel if you replicate the model.

12
• Domain-specific embedded language with associated optimizing compiler and runtime
• Array programming language embedded in a state machine execution model
• Targets advanced analytics workloads on massively parallel distributed systems
• Design Goals
– Optimal deployment on parallel hardware
– Fast design iterations
– Enforce scalability
– Broad COTS hardware support
– Compatible with shared infrastructure
– High productivity for analysts and algorithm engineers
What is CogX?
CogX

13
Compute graph
moviet
backgroundt +*0.999f
*0.001f
nextBackgroundt backgroundt+1
- abs
reduce
Sum
suspicioust
ColorMovie
Opportunities for optimization

14
Compute graph
moviet
backgroundt
suspicioust
ColorMovie
*0.001f
*0.999f +
- Abs reduce
Sum
device
kernel
Initially: 6 separate devie kernels.

15
Compute graph
moviet
*0.001f
- abs
reduce
Sum
suspicioust
ColorMovie
device
kernel
After a “single-output” kernel fuser pass: 2 device kernels remain.

16
Compute graph
moviet
*0.001f
- abs
reduce
Sum
suspicioust
ColorMovie
device
kernel
After a “multi-output” kernel fuser pass: only a single device kernel remains

17
User CogX
model
(scala)
parsing and
OpenCL code
generation
Kernel
circuit
(kernels,
field bufs)
Optimized
kernel
circuit
(merged
kernels)
optimizations,
including kernel
fusion
CogX code snippet
*
opencl
multiply
kernel
A
B
C
+
opencl
add
kernelD
E *+
fused
opencl
multiply/
add
kernel
A
D
EB
val A = ScalarField(10,10)
val B = ScalarField(10,10)
val C = A * B
val D = ScalarField(10,10)
val E = C + D
CogX compiler:
translating CogX to OpenCL with kernel fusion

18
• Basic operators • FFT/DCT • Type coercion
• +, -, *, /, % • fft, fftInverse • toScalarField, toVectorField
• Logical operators • fftRI, fftInverseRI • toMatrixField, toComplexField
• >, >=, <, <=, ===, !=== • fftRows, fftInverseRows • toComplexVectorField, toColorField
• Pointwise functions • fftColumns, fftInverseColumns • toGenericComplexField
• cos, cosh, acos • dct, dctInverse, dctTransposed • Type construction
• sin, sinh, asin • dctInverseTransposed • complex, polarComplex
• tan, tanh, atan2 • Complex numbers • vectorField, complexVectorField
• sq, sqrt, log, signum • phase, magnitude, conjugate • matrixField, colorField
• pow, reciprocal • realPart, imaginaryPart • Reductions
• exp, abs, floor • Convolution-like • reduceSum, blockReduceSum
• Comparison functions • crossCorrelate, • reduceMin, blockReduceMin
• max, min crossCorrelateSeparable • reduceMax, blockReduceMax
• Shape manipulation • convolve, convolveSeparable • fieldReduceMax, fieldReduceMin
• flip, shift, shiftCyclic • projectFrame, backProjectFrame • fieldReduceSum, fieldReduceMedian
• transpose, subfield • crossCorrelateFilterAdjoint • Normalizations
• expand, select, stack • convolveFilterAdjoint • normalizeL1, normalizeL2
• matrixRow, reshape • Gradient/divergence • Resampling
• subfields, trim • backwardDivergence • supersample, downsample, upsample
• vectorElement, vectorElements • backwardGradient • Special operators
• transposeMatrices • centralGradient • winnerTakeAll
• transposeVectors • forwardGradient • random
• replicate, slice • Linear algebra • solve
• dot, crossDot • transform
• reverseCrossDot • warp
• Debugging • <==
• probe
CogX core functions and operators

19
• Computer Vision
• Annotation tools
• Color space transformations
• Polynomial dense optic flow
• Segmentation
• Solvers
• Boundary-gated nonlinear
diffusion
• FISTAsolver (with sub-
variants)
• Golden section solver
• Incremental k-means
implementation
• LSQR solver (with sub-
variants)
• Poisson solver (with sub-
variants)
• Filtering
• Contourlets
• 4 frequency-domain filters
• Mathematical morphology
operators
• 27 space-domain filters (from
a simple box filter up to local
polynomial expansion and
steerable Gabor filters)
• Steerable pyramid filter
• Wavelets
• Variants of whitening
transforms
• Contrast normalization
• Domain transfer filter
• Gaussian pyramid
• Monogenic phase
congruency
• Dynamical Systems
• Kalman filter
• Linear system modeling
support
• CPU matrix pseudo-
inverse
• Statistics
• Normal and uniform
distributions
• Histograms
• Moment calculations
• Pseudo-random number
generator sensors
CogX toolkit functions

20
Application
CogX debugger
CogX compiler and standard library
Neural network
toolkit
Sandbox toolkitI/O toolkit
Scala CogX runtime C++ CogX runtime
HDF5 loader JOCL
HDF5 OpenCL HDF5
CogX core
External
libraries
CogX
libraries/toolkit
Cluster package
Apache Mesos
Applications are written by users
– Introductory and training examples for single-GPU and distributed computation
– Performance benchmarks covering the core and neural network package
– Several larger-scale demo applications integrating multiple CogX functions
HPE Cognitive Computing Toolkit
http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=S6772&searchItems=&sessionTopic=&sessionEvent=&sessionYear=&sessionFormat=&submit=&select=

21
SOME MACHINE LEARNING APPLICATIONS
21

22
….But what about “enterprise-class” use cases?
Games
Chat bots (Cortana, Suri, Jarvis, etc.)
Intelligent Assistants (Siri, Alexa, etc)
Deep Learning Use Cases
The better-known, well publicized implementations..
Self-driving cars

23
Finance Medicine E-Commerce
shoppers
Security
threats
AI-assisted trading,
beyond current
algorithmic trading
Rise of “AI Hedge Funds”
Healthcare
institutions use AI-
assisted diagnosis,
recommendations,
reduce human error
Agent and chatbots
provide product
recommendations,
“interacts” with
potential
Beyond facial
recognition,
understand “context”
of danger and flag
security
AI in the Enterprise
Deep Learning and Neural Networks for the mainstream?

24
Social networking Geospatial
Yan LeCunn was hired by
Facebook, Geoff Hinton by Google
and Andrew Ng by Baidu.
Sentiment analysis.
Facial recognition.
Understanding text.
Image recognition.
High spatial resolution remote-
sensing (HSR-RS) images scene
classification (BoVWs)
Oil and Gas
Channel sands
identification.
Other seismic analysis.
AI in the Enterprise
Deep Learning and Neural Networks for the mainstream?

25
Self-driving cars
Deep neural networks are being used to understand the scene in self-driving cars!

The 4 Stage IoT Solutions Architecture:
Primarily
analog data
sources
Devices,
machines,
people, tools,
cars, animals,
clothes, toys,
environment,
buildings, etc.
The “Things”
Data Flow:
TheEdge
Sensors/Actuators
(wired, wireless)
Internet Gateways,
Data Acquisition
Systems
(data aggregation, A/D,
measurement, control)
Edge IT
(analytics, pre-
processing)
Data Center / Cloud
(analytics,
management, archive)
Stage 1 Stage 2 Stage 3 Stage 4
Visualization
Control Flow:
SW Stacks:
Analytics
Management
Control
Analytics
Management
Control
Analytics
Management
Control

27
Enable
workplace
productivity
Empower
a data-driven
organization
Transform
to a hybrid
infrastructure
Protect
your digital
enterprise
* Benchmarking results provided at or shortly after announcement
Use Cases Automated
Intelligence
delivered by HPE
Apollo 6500 and Deep
Learning software
Video, Image, Text,
Audio, time series
pattern recognition
solutions
Large, highly complex, Real-time, near
unstructured simulation real-time analytics
and modeling
Faster Model training time, better fusion of data*
Customer benefits
HPE Apollo 6500 is an ideal HPC and Deep Learning platform providing unprecedented performance with 8 GPUs, high bandwidth
fabric and a configurable GPU topology to match deep learning workloads
– Up to 8 high powered GPUs per tray (node), 2P Intel E5-2600 v4 support
– Choice of high-speed, low latency fabrics with 2x IO expansion
– Workload optimized using flexible configuration capabilities
Deliver automated intelligence in real-time
Unprecedented performance and scale with HPE Apollo 6500 high density GPU solution

Apollo 8000
Supercomputing
Apollo 6000
Rack Scale HPC
Apollo 4000
Server Solutions Purpose
Built for Big Data
Apollo 2000
Enterprise Bridge to
Scale-Out Compute
Big Data WorkloadsHPC Workloads
Mellanox NVIDIA Seagate
PlatformsSolutions/ISVs
HPE Apollo platforms and solutions are optimized for HPC, IoT and Big Data
Next Gen Workloads
Moonshot*
Optimized for Next Gen
Workloads
Video
encoding
Mobile
workplace
IoT
Oil and gas Life Sciences Financial
Services
Manufacturing
CAD/CAE
Academia Object
Storage
Data
Analytics
Scality
Cleversafe
Ceph
Hortonworks
Hadoop
Cloudera
Schlumberger
Paradigm
Halliburton
Gaussian
BIOVIA Redline
Synopsys
ANSYS Custom
Apps
28
HPE Software (i.e. Vertica, HPE Haven), HPE Enterprise Services

29
HP APOLLO 6000 POWER SHELF
Pooled Power Efficiency
Efficiency
• External pooled power shelf
• Fits up to 6 power supplies
• 2400W or 2650W power supplies
• Up to 15.9kW non-redundant
• Single or 3-phased AC input
• Up to twelve 12V DC cables
1.5U
2.55”
17.64”
30.88”
Back View
Front View
1.5U (H) x 44.81cm (W) x 78.44cm
(D)
1.5U (H) x 17.64 in (W) x 30.88 in
(D)

30
HPE Apollo 6500
– Dense GPU server optimized for Deep
Learning and HPC workloads
– Density optimization
– High performance fabrics
Cluster Management Enhancements
(Massive Scaling, Open APIs, tight Integration, multiple user
interfaces)
– GPU density
– Configurable GPU topologies
– More network bandwidth
– Power and cooling optimization
– Manageability
– Better productivity
New technologies, products
Unique
Solution differentiators
Deep Learning, HPC Software platform
Enablement
(HPE CCTK, Caffe, CUDA, Google TensorFlow, HPE IDOL)
HPE Apollo 6500 solution innovation
System Design Innovation to maximize GPU capacity and performance with lower TCO

31
方案一 : 企業虛擬化首選方案二 : 高效能運算首選
HPE Apollo 2000/XL190r 1 node
+ NVIDIA TeslaM60 *1
Apollo r2200 12LFF 或 r2600 24SFF
XL190r Gen9 規格 :
E5-2640v4*2/ 16GB*2/ 1TB*1/ 800W/
3yr Fndn Care 24*7 service NVIDIA
Tesla M60 Dual GPU*1
HPE Apollo 2000/XL190r 1 node
+ NVIDIA TeslaK80 *1
Apollo r2200 12LFF 或 r2600 24SFF
XL190r Gen9 規格 :
E5-2640v4*2/ 16GB*2/ 1TB*1/ 800W/
3yr Fndn Care 24*7 service NVIDIA
Tesla K80 Dual GPU*1
限時限量優惠組合
最強組合
密度最佳的 HPE 伺服器再加 NVIDIA GPU 給你最強大組合
單一 2U 機箱最大可擴至 2 台 HPE Apollo 系統伺服器及
4 張 NVIDIA 高效運算加速卡
Apollo 2000+ NVIDIA GPU 促銷方案
NT$360,000(未稅價) 起 NT$360,000(未稅價) 起
※ 活動截止日期 : 2016 / 12 / 31 如對產品有興趣請撥打：(02)2652-4040 本號碼僅限台灣區使用

TAIPEI | SEP. 21-22, 2016
THANK YOU

A Platform for Accelerating Machine Learning Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A Platform for Accelerating Machine Learning Applications

Similar to A Platform for Accelerating Machine Learning Applications (20)

More from NVIDIA Taiwan

More from NVIDIA Taiwan (20)

Recently uploaded

Recently uploaded (20)

A Platform for Accelerating Machine Learning Applications