Deep learning: Cutting through the Myths and Hype

Deep Learning: Cutting through
the Myths an the Hype
Siby Jose Plathottam
Postdoctoral Appointee, Energy Systems
splathottam@anl.gov

Outline
• AI vs Machine Learning vs Deep learning
• Why all the hype?
• Deep Learning basics
• Data and compute
• Addressing interpretability
• Research tracks
Artificial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
(DL)

AI vs Machine Learning vs Deep learning
What is Artificial Intelligence?
“... making a machine behave in ways that would be called intelligent if a human
were so behaving.” McCarthy, Minsky Et al. (Dartmouth Conference, 1956)
What is Machine Learning?
“…seeking to provide knowledge to computers through data, observations and
interacting with the world. That acquired knowledge allows computers to correctly
generalize to new settings.” Yoshua Bengio
What is Deep Learning?
“A sub-field of ML which uses the artificial neuron as the basic computing model.”
(my own definition)

Why all the hype for Deep Learning?
Examples of intelligent behavior Examples Deep learning solutions Breakthrough year
Visual perception Image recognition AlexNet, ResNet, NASnet 2012
Object detection YOLO, R-CNN, SSD 2015
Natural language processing
Speech recognition/synthesis Google Assistant 2011/2016
Language translation Neural Machine Translation 2015
Game playing
Board games AlpahGo, AlphaZero 2015/2016
Strategy computer games OpenAI5, AlphaStar 2018
Medical diagnostics Retinal diagnosis U-Net (DeepMind) 2018
Cancer detection LYmph Node Assistant 2018
Scientific Discovery Protein folding AlphaFold 2018
Creativity Image synthesis StyleGAN, BigGAN 2015/2018

The breakthrough’s behind the hype
Figure reference: image-net.org
First use of deep learning
48
38.8
19.3
19.8
19.8
16.1
9.9 5.1
0
25
50
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019
WER levels out in mid 2000s
First use of deep learning
MSFT human parity:
System combination
+ LSTM
Figure reference: ‘The deep learning revolution in automatic speech recognition’ Ananth, Sankar, ODSC India 2018
Image recognition: ILSVRC top-5 error rate
Speech recognition: SWBD-1 word error rate
Human error: 5.1%
AlexNet (2012): 15.4 %
ResNet (2015): 3.5%
SENet (2017): 2.25 %
First use of Deep learning
ILSVRCTop-5
errorrate

The breakthroughs behind the hype (cont…)
0
1000
2000
3000
4000
5000
6000
Go Chess Shogi
AlphaZero Previous best
AlphaGo Zero
Comparing Elo ratings for Go, Chess, and Shogi (2017)
StockFish
Elmo
Figure reference: Waymo’s fleet reaches 4 million
self-driven miles, Waymo Team, Medium, 2017
Autonomous Vehicles
(which use Deep learning for perception, and planning)
Real world miles (L3/L4) Real world miles (L1/L2)
Data reference: A general reinforcement learning
algorithm that masters chess, shogi and Go through self-
play, Silver et al., Science, 2018
Figure reference: Tesla Autopilot Miles, Lex
Fridman, MIT HCAI, 2019

The mathematics behind the hype:
Universal approximation theorem*
What we know: A neural network can approximate any continuous
function to any desired precision using a finite amount of hidden units.
What we don’t know: The optimal way to compute the neural network
parameters (weights & biases) for the function we want to approximate.
What we mostly use now: Backpropagation with gradient descent
*Reference: Approximation by superposition's of a sigmoidal function, G. Cybenko, MCSS, 1989
Forward pass to compute
network output
Generate an error term by comparing
network output with a teaching signal
 
2
1
1

 
N
i i
i
error t y
N
Teaching signal
Network output
Use error to compute and apply gradients

Deep Learning basics: Training vs Inference
Create
Model
Training Inference
Input Data
Target Data
Input Data
Prediction
“Homer” Class ‘Homer’: 90% Class ‘Marge’: 10%
Raw Data
Preprocessing
These take least time. Real-
time inference is possible for
many applications
These take most time

Deep Learning basics: Building Deep Neural Networks
Directed acyclic graphs of desired complexity from a few core units.

1w
f2w
nw
e.g. Feedforward neuron Densely connected
Convolutional
MLP/CNN
AE
Residual Network
VAE
GAN
(with Sigmoid activation)
1
max 0,

 
  
 

n
i i
i
y w X b
(with ReLU activation)
1
1
1 
 



n
i i
i
w X b
y
e
Unit Layers Groups of
layers (Cells)
Models

Deep Learning basics: All three learning problems
Supervised
Unsupervised
Reinforcement
• Classification
• Regression
• Deep generative models
• Segmentation
• DeepRL
• World Models

What motivated me to pursue Deep Learning?
Successful applications.
Approximating optimal control trajectories with a single layer neural network.
w111
w113
w114
w115
w117
w112
w116



 xtanh
w211
b11 b21
w232y3
b22b13
b12
y2
w222
w212
w221
w231
 xtanh
 xtanh
Fdr_f ωr_f
Data
Normalization
Fdr_0 ωr_0
Fdr
ωr
t
y1 Iqs_opt
Ids_opt
        24
1 2 1 4 1 4 1 1
 
        
 
a
dr r
ds
m r
L
i t x t x t x t
L R

   _ 2
1
30
4 1
 
  
 
opt A
qs a
dr
C
i t t t
x K
Single layer with single hidden layer consisting of 3 units Analytical form
     
 
 
2
2_
1
3 3 4
2 116 2 216 4 16
15 3 15 3
240
0
4 1
   
      
  

  

atotal optA
drrloss
r
xxdE x
E
dx R
E E
x

However to find 𝑥, we need to find roots of following
equation:

Deep Learning Challenges
Data
• We may not have enough of it!
• Available data may be biased.
Interpretability
• Disparity in what we believe/want the neural network
to observe versus what it actually observes.
• Unintended emergent behavior
Computing
requirements
• Millions of parameters need to be updated at each
iteration.

Data requirements: Why does it need so much?
Parameters in production models: 1M to 50 M
Three possibilities when training an ML model
Under fitting Optimal fitting Over fitting
This is easily solvable in
Deep Learning due to UAT.
This is what we need.
Deep Neural Networks are
prone to overfitting!

Data requirements: Biased data is the real villain
Untrained deep learning models can’t be pre-conditioned towards
the task they have to learn.
If the data is biased towards a particular outcome the predictions
made by the model would be biased to that particular outcome.

Data requirements: Solutions
• Transfer learning: Networks trained for a particular task (for e.g..
Image recognition) can be retrained for a similar task (for e.g. object
detection) but with lesser amount of data.
• Data augmentation: Apply transformations to original dataset and
generated additional samples.
• Regularization: Make the neural network work harder to solve the
problem.
Figure reference: Deep Learning basics, Lex Fridman, MIT HCAI, 2019
Transfer learning example

Computing: Is Deep Learning computationally feasible?
Neural Network operations are parellizable.
Tens of thousands of matrix operations per clock cycle through vector processing.
1x
2x
2x
1y
2y
1 11 1 12 2 13 3
  y w x w x w x
2 21 1 22 2 23 3
  y w x w x w x
SSE/AVX instructions on CPU
 T
Y WX
Streaming multiprocessors on GPU
(eg. NVIDIA CuDNN)
or
Neural network models are parellizable
Hundreds of models on distributed GPU or CPU nodes

Computing: Is Deep Learning computationally feasible?
Neural network models are parellizable
Distribute on hundreds of GPU or CPU nodes: Data-parallel or model-
parallel
Update model
parameters
Parameter node
Worker nodes
Data pipelines
Each pipeline randomly samples from training
dataset
Each worker node calculates gradients
independently but share model parameters
Parameter node uses gradients from worker
node to update model parameters
Data-parallel training

Interpretability: Is Deep Learning a black box?
Myth: Deep learning uses Magic so we can’t understand what is going on.
1
n
i i
i
y w X b

 
Linear regression model
Question: If we were to do visual inspection of the model parameters in which model would it be easier to quantify
the relationships between of each parameter on the output.
Perceptron model (single layer ANN)
1 1
1
1
m n
o H H o
j j i i i
j i
w f X w b b
y
e  
 
   
 
 

 

X 
,w b 
Model inputs
Model parameters
 Let us compare 3 models:
n  Number of input features
m  Number of hidden neurons
  P I D
de
y K e K edt K
dt
PID controller model
e
, , P I D
K K K
Error input
Controller gains

Interpretability: Adversarial inputs
Adversarial inputs: Interspersing adversarial noise with specific statistical
properties into clean data.
Figure reference: www.pluribus-one.it/research/sec-ml/wild-patterns
 However: The attacker needs access to the network architecture and trained
weights.

Interpretability: Solutions
Intermediate outputs: Ask networks to provide intermediate output.
Figure reference: Clinically applicable deep learning for diagnosis and referral in retinal
disease, Fauw et al., Nature Medicine, 2018
Heat maps: Visualize activations from individual layers or neurons
Figure Reference: Approximating CNN’s with bag-of-local features models works surprisingly well
on imagenet, Brendel, ICLR 2019

Deep Learning R&D Tracks(cont.)Application
Applications
New domains
Bench marks
Infrastructure
Training performance
Optimize inference
Architecture
Modify architectures
Loss
Hybrid models
Eg. Wasserstein
distance
Eg. Horovod
Eg. Imagenet, CINIC-10
E.g. Cancer research
Eg. Transformers
Eg. XLA, TensorRT
Eg. DeepRL
Eg. ANL DeepHyper

Fundamental
Understanding
Interpretable
models
Explore failure
modes
Architectures
Beyond gradient
descent
Novel
architectures
Deep Learning R&D Tracks (cont.)
E.g. Bag of models
Eg. Neural Ordinary Differential
Equations, Neural Turing Machines
Eg. One shot learning
Eg. Adversarial attacks

Power System Application: Load Modelling
Can you model a consumer load at the individual smart meter level?
Distribution plot for consumer load for particular hours over a 30 day period period
Frequency
Load (normalized) Load (normalized)
xt-n
xt-1
xt
Sampling from a known distribution
(e.g. unit Gaussian)
zm
Interpretable latent space feature
(e.g. timeof day, cloudiness index)
xt-n Forecast for the (t-n)th
timeblock
xt-2
xt-3
xt-4
z1
z2
zm
Synthetic load curves at consumer level from deep generative model

Power System Applications (cont.)
Concept: Train neural networks to approximate dynamic behavior of
dynamic components within a certain operating point.
Potential advantages:
Allows us to accelerate parts of simulation on a dedicated GPU.
Reduces number of ODE’s.
Intelligent agents
Accelerating Simulations
Concept: Deep Reinforcement learning agents for supervisory control.
Allow agents to learn through interactions with a power system simulator.
Currently being researched by GEIRI NA.

Concluding remarks
Deep Learning is powerful machine learning tool to develop Artificial
Narrow Intelligence programs.
Developing Deep Learning solutions is a non-trivial engineering problem.
The performance of a Deep Learning model is intimately tied to the
quality and quantity of training data.
Neural network parameters cannot be interpreted by direct observation
and require specialized software tools.

Deep learning: Cutting through the Myths and Hype

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning: Cutting through the Myths and Hype

Similar to Deep learning: Cutting through the Myths and Hype (20)

Recently uploaded

Recently uploaded (20)

Deep learning: Cutting through the Myths and Hype