Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data Science Presentation

© 2014 Impetus Technologies1
Impetus Technologies Inc.
Deep Learning: Evolution of ML
from Statistical to Brain-like
Computing
The Fifth Elephant
July 25, 2014.
Dr. Vijay Srinivas Agneeswaran,
Director, Big Data Labs,
Impetus

Contents
Introduction to Artificial Neural Networks
Deep learning networks
• Towards deep learning
• From ANNs to DLNs.
• Basics of DLNs.
• Related Approaches.
Distributed DLNs: Challenges
Distributed DLNs over GraphLab

Deep Learning: Evolution Timeline

Introduction to Artificial Neural Networks (ANNs)
Perceptron

Introduction to Artificial Neural Networks (ANNs)
Sigmoid Neuron
• Small change in input = small change in behaviour.
• Output of a sigmoid neuron is given below:
• Small change in input = small change in behaviour.
• Output of a sigmoid neuron is given below:

Introduction to Artificial Neural Networks (ANNs): Back
Propagation
http://zerkpage.tripod.com/ann.htm
What is this?
NAND Gate!
initialize network weights (often small random
values)
do forEach training example ex
prediction = neural-net-output(network, ex) //
forward pass
actual = teacher-output(ex)
compute error (prediction - actual) at the
output units
compute delta(wh)for all weights from hidden
layer to output layer // backward pass
compute delta(wi) for all weights from input
layer to hidden layer
// backward pass continued
update network weights until all examples
classified correctly or
another stopping criterion satisfied
return the network

The network to identify the individual digits from the
input image
http://neuralnetworksanddeeplearning.com/chap1.html

Different Shallow Architectures
Weighted Sum Weighted SumWeighted Sum
Template matchers
Fixed Basis
Functions
Simple Trainable
Basis Functions
Y. Bengio and Y. LeCun, "Scaling learning algorithms towards AI," in Large Scale Kernel Machines, (L.
Bottou, O. Chapelle, D. DeCoste, and J. Weston, eds.), MIT Press, 2007.
Linear predictor ANN, Radial Basis FunctionsKernel Machines

ANNs for Face Recognition?

DLN for Face Recognition
http://theanalyticsstore.com/deep-learning/

Deep Learning Networks: Learning
No general
learning algorithm
(No-free-lunch
theorem by
Wolpert 1996).
Learning
algorithm for
specific tasks
– perception,
control,
prediction,
planning,
reasoning,
language
understanding
.
Limitations of
BP – local
minima,
optimization
challenges for
non-convex
objective
functions.
Hinton’s deep
belief
networks as
stack of
RBMs.
Lecun’s
energy based
learning for
DBNs.

• This is a deep neural network composed
of multiple layers of latent variables
(hidden units or feature detectors)
• Can be viewed as a stack of RBMs
• Hinton along with his student proposed
that these networks can be trained
greedily one layer at a time
Deep Belief Networks
http://www.iro.umontreal.ca/~lisa/twiki/pub/Public/DeepBeliefNetworks/DBNs.png
• Boltzmann Machine is a specific energy
model with linear energy function.

• RBM are Energy Based Models (EBM)
• EBM associate an energy with every
configuration of a system
• Learning corresponds to modifying the
shape of energy function, so that it has
desirable properties
• Like in physics, lower energy = more
stability
• So, modify shape of energy function such
that the desirable configurations have lower
energy
Energy Based Models
http://www.cs.nyu.edu/~yann/research
/ebm/loss-func.png

Other DL networks: Convolutional Networks
Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. 1999. Object Recognition with Gradient-Based
Learning. In Shape, Contour and Grouping in Computer Vision, David A. Forsyth, Joseph L. Mundy, Vito Di
Gesù, and Roberto Cipolla (Eds.). Springer-Verlag, London, UK, UK, 319-.

• Aim of auto encoders network is to learn
a compressed representation for set of
data
• Is an unsupervised learning algorithm
that applies back propagation, setting
the target values equal to inputs
(identity function)
• Denoising auto encoder addresses
identity function by randomly corrupting
input that the auto encoder must then
reconstruct or denoise
• Best applied when there is structure in
the data
• Applications : Dimensionality reduction,
feature selection
Other DL Networks: Auto Encoders (Auto-
associators or Diabolo Network

Why Deep Learning Networks are Brain-like?
Statistical
approach of
traditional ML –
SVMs or kernel
approaches.
• Not applicable in
deep learning
networks.
Human brain –
trophic factors
Traditional ML –
lot of data
munging,
representational
issues (feature
abstractor),
before classifier
can kick in.
Deep learning –
allows the
system to learn
representations
as well naturally.

Success stories of DLNs
Android voice
recognition system –
based on DLNs
Improves accuracy by
25% compared to state-
of-art
Microsoft Skype
Translate software
and Digital assistant
Cortana
1.2 million images, 1000 classes (ImageNet Data) –
error rate of 15.3%, better than state of art at 26.1%

Success stories of DLNs…..
Senna system – PoS tagging, chunking, NER, semantic role
labeling, syntactic parsing
Comparable F1 score with state-of-art with huge speed advantage (5
days VS few hours).
DLNs VS TF-IDF: 1 million documents, relevance search. 3.2ms
VS 1.2s.
Robot navigation

Potential Applications of DLNs
Speech recognition/enhancement
Video sequencing
Emotion recognition (video/audio),
Malware detection,
Robotics – navigation.
multi-modal learning (text and image).
Natural Language Processing

• Deeplearning4j – open source
implementation of Jeffery Dean’s
distributed deep learning paper.
• Theano: python library of math
functions.
– Efficient use of GPUs
transparently.
• Hinton’ courses on Coursera:
https://www.coursera.org/instructor/~15
4
Available resources

• Large no. of
parameters can also
improve accuracy.
• Limitations –
CPU_to_GPU data
transfers.
Challenges in Realizing DLNs
Large no. of training
examples – high
accuracy.
Inherently sequential
nature – freeze up one
layer for learning.
GPUs to improve
training speedup
Distributed DLNs –
Jeffrey Dean’s work.

• Motivation
– Scalable, low latency training
– Parallelize training data and learn
fast
Distributed DLNs
• Jeffrey Dean’s work DistBelief
– Pseudo-centralized realization

• Purely distributed realizations are
needed.
• Our approach
– Use asynchronous graph
processing framework (GraphLab)
– Making modifications in GraphLab
code as required
• Layer abstraction, mass
communication

Engine

• ANN to Distributed Deep Learning
– Key ideas in deep learning
– Need for distributed
realizations.
– DistBelief, deeplearning4j etc.
– Our work on large scale
distributed deep learning
• Deep learning leads us from
statistics based machine learning
towards brain inspired AI.
Conclusions

THANK YOU!
Mail • bigdata@impetus.com
LinkedIn • www.linkedin.com/company/impetus
Blogs • blogs.impetus.com
Twitter • @impetustech

BACKUP SLIDES

• Recurrent Neural networks
– Long Short Term Memory
(LSTM), Temporal data
• Sum-product networks
– Deep architectures of sum-
product networks
• Hierarchical temporal memory
– online structural and algorithmic
model of neocortex.
Other Brain-like Approaches

• Connections between units form a Directed cycle i.e. a typical feed back
connections
• RNNs can use their internal memory to process arbitrary sequences of inputs
• RNNs cannot learn to look far back past
• LSTM solve this problem by introducing stem cells
• These stem cells can remember a value for an arbitrary amount of time
Recurrent Neural Networks

• SPN is deep network model and is a directed acyclic graph
• These networks allow to compute the probability of an event quickly
• SPNs try to convert multi linear functions to ones in computationally short
forms i.e. it must consist of multiple additions and multiplications
• Leaves correspond to variables and nodes correspond to sums and products
Sum-Product Networks (SPN)

• Is a online machine learning model developed by Jeff Hawkins
• This model learns one instance at a time
• Best explained by online stock model. Today’s situation of stock helps in
prediction of tomorrow’s stock
• A HTM network is tree shaped hierarchy of levels
• Higher hierarchy levels can use patterns learned at lower levels. This is adopted
from learning model adopted by brain in the form of neo cortex
Hierarchical Temporal Memory

http://en.wikipedia.org/wiki/Hierarchical_temporal_memory

Mathematical Equations
• The Energy Function is defined as follows:
b’ and c’ are the biases
𝐸 𝑥, ℎ = −𝑏′ 𝑥 − 𝑐′ℎ − ℎ′ 𝑊𝑥
where, W represents the weights connecting
visible layer and hidden layer.

Learning Energy Based Models
• Energy based models can be learnt by performing gradient descent on
negative log-likelihood of training data
• It has the following form:
−
𝜕 log 𝑝 𝑥
𝜕θ
=
𝜕 𝐹 𝑥
𝜕θ
−
𝑥̃
𝑝 𝑥
𝜕 𝐹 𝑥
𝜕θ
Positive phase Negative phase

Thank you.
Questions??

Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data Science Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data Science Presentation

Similar to Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data Science Presentation (20)

More from Impetus Technologies

More from Impetus Technologies (20)

Recently uploaded

Recently uploaded (20)

Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data Science Presentation

Editor's Notes