SlideShare a Scribd company logo
1 of 277
Download to read offline
Introduction to Neural Networks and Deep Learning
Backpropagation and Automatic Differentiation
Andres Mendez-Vazquez
November 4, 2019
1 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
2 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
3 / 158
Images/cinvestav
A Remarkable Revenant
This algorithm has been used by many communities
Discovered and rediscovered, until 1985 it reached the AI community
[1]
Basically
The Basis of the modern neural networks
4 / 158
Images/cinvestav
A Remarkable Revenant
This algorithm has been used by many communities
Discovered and rediscovered, until 1985 it reached the AI community
[1]
Basically
The Basis of the modern neural networks
4 / 158
Images/cinvestav
One Big Problem, a lot of Local Minimums
A Lot of Them!!!
Local Minimas
5 / 158
Images/cinvestav
This is due to the fact that
Yes, we have a convex function
1
2
(zi − ti)2
With an intermediate non-linear activation function
zi = f


d
j=1
wijyj


Making the surface to be searched for the optimum
A non linear function map from Rd to Rm
6 / 158
Images/cinvestav
This is due to the fact that
Yes, we have a convex function
1
2
(zi − ti)2
With an intermediate non-linear activation function
zi = f


d
j=1
wijyj


Making the surface to be searched for the optimum
A non linear function map from Rd to Rm
6 / 158
Images/cinvestav
This is due to the fact that
Yes, we have a convex function
1
2
(zi − ti)2
With an intermediate non-linear activation function
zi = f


d
j=1
wijyj


Making the surface to be searched for the optimum
A non linear function map from Rd to Rm
6 / 158
Images/cinvestav
Recall The Learning Problem
Neural Networks
You can see the network as a computational graph...
Transmitting information from node to node...
Therefore, the network
It is a particular implementation of a composite function from input
space to output space.
7 / 158
Images/cinvestav
Recall The Learning Problem
Neural Networks
You can see the network as a computational graph...
Transmitting information from node to node...
Therefore, the network
It is a particular implementation of a composite function from input
space to output space.
7 / 158
Images/cinvestav
Extended Network
The computation of the error by the network [2]
NETWORK
8 / 158
Images/cinvestav
Thus
The network can calculate the total error
E =
N
i=1
Ei
Therefore, the network can be updated using
E =
∂E
∂w1
,
∂E
∂w2
, ...,
∂E
∂wl
∆wi = −γ
∂E
∂w1
for i = 1, ..., l
9 / 158
Images/cinvestav
Thus
The network can calculate the total error
E =
N
i=1
Ei
Therefore, the network can be updated using
E =
∂E
∂w1
,
∂E
∂w2
, ...,
∂E
∂wl
∆wi = −γ
∂E
∂w1
for i = 1, ..., l
9 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
10 / 158
Images/cinvestav
Now, if we forget everything about learning
Given that the network is a complex composition of functions
E = f1 ◦ f2 ◦ · · · ◦ fK
Now, each node has a left and right side
11 / 158
Images/cinvestav
Now, if we forget everything about learning
Given that the network is a complex composition of functions
E = f1 ◦ f2 ◦ · · · ◦ fK
Now, each node has a left and right side
11 / 158
Images/cinvestav
Furthermore
Separation of integration and activation function
Then, we can use this notation to build the forward/backward steps
Actually the basis for automatic differentiation
12 / 158
Images/cinvestav
Furthermore
Separation of integration and activation function
Then, we can use this notation to build the forward/backward steps
Actually the basis for automatic differentiation
12 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
13 / 158
Images/cinvestav
First, we have
The sequence of derivatives
Then, we can do the forward step getting the function compositions
14 / 158
Images/cinvestav
First, we have
The sequence of derivatives
Then, we can do the forward step getting the function compositions
Function Composition
14 / 158
Images/cinvestav
Now, Backpropagation
Here the interesting part, you can collect such information
Backpropagation
1
Now, what else?
The aggregation of functions toward the activation functions!!!
15 / 158
Images/cinvestav
Now, Backpropagation
Here the interesting part, you can collect such information
Backpropagation
1
Now, what else?
The aggregation of functions toward the activation functions!!!
15 / 158
Images/cinvestav
We add an extra caveat to the graph representation
A weight into the graph
Feed-Forward
We have the backward process
16 / 158
Images/cinvestav
We add an extra caveat to the graph representation
A weight into the graph
Feed-Forward
We have the backward process
1
Backpropagation
16 / 158
Images/cinvestav
Function Addition
We have the forward step
Function Composition
17 / 158
Images/cinvestav
Then
At the Backward Step, we have
Backward
1
18 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
19 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Backpropagation Algorithm
Consider a network with a single input and a network function F
The Derivative F (x) is computed in two phases.
1 Feed-forward:
The input x is fed into the network.
The primitive functions at the nodes and their derivatives are evaluated
at each node.
The derivatives are stored at the left side of the node.
2 Backpropagation:
The constant 1 is fed into the output unit and the network is run
backwards.
Incoming information to a node is added and the result is multiplied by
the value stored in the left part of the unit.
The result is transmitted to the left of the unit.
The result collected at the input unit is the derivative of the network
function with respect to x.
20 / 158
Images/cinvestav
Proof of Correctness about the derivatives
Proposition
The Backpropagation algorithm computes the derivative of the
network function F with respect to the input x correctly.
Proof
By induction assume that the algorithm works with n or fewer nodes
21 / 158
Images/cinvestav
Proof of Correctness about the derivatives
Proposition
The Backpropagation algorithm computes the derivative of the
network function F with respect to the input x correctly.
Proof
By induction assume that the algorithm works with n or fewer nodes
21 / 158
Images/cinvestav
Consider
The following network with n + 1 nodes
NETWORK
22 / 158
Images/cinvestav
Thus
We have that
F (x) = φ (w1F1 (x) + w2F2 (x) + · · · + wmFm (x))
We have that the derivative
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
With s = w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
23 / 158
Images/cinvestav
Thus
We have that
F (x) = φ (w1F1 (x) + w2F2 (x) + · · · + wmFm (x))
We have that the derivative
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
With s = w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
23 / 158
Images/cinvestav
Now, we use induction
The subgraph of the main graph which contains all the nodes to
F1 (x)
Thus, by induction, we can calculate the derivative of F1 (x) by
introducing a 1 into the last unit and doing backpropagation
The same happens to all the other units
Now if instead of multiplying by 1 we introduce φ (s)and multiply by
wj, we get
wjFj (x) φ (s)
This can be accomplished by
Introducing a 1 into the output unit, multiplying by the stored value
φ (s) and distributing the result to the m units through edge weight
nodes.
24 / 158
Images/cinvestav
Now, we use induction
The subgraph of the main graph which contains all the nodes to
F1 (x)
Thus, by induction, we can calculate the derivative of F1 (x) by
introducing a 1 into the last unit and doing backpropagation
The same happens to all the other units
Now if instead of multiplying by 1 we introduce φ (s)and multiply by
wj, we get
wjFj (x) φ (s)
This can be accomplished by
Introducing a 1 into the output unit, multiplying by the stored value
φ (s) and distributing the result to the m units through edge weight
nodes.
24 / 158
Images/cinvestav
Now, we use induction
The subgraph of the main graph which contains all the nodes to
F1 (x)
Thus, by induction, we can calculate the derivative of F1 (x) by
introducing a 1 into the last unit and doing backpropagation
The same happens to all the other units
Now if instead of multiplying by 1 we introduce φ (s)and multiply by
wj, we get
wjFj (x) φ (s)
This can be accomplished by
Introducing a 1 into the output unit, multiplying by the stored value
φ (s) and distributing the result to the m units through edge weight
nodes.
24 / 158
Images/cinvestav
Basically, we get the derivative
We get then
φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
Basically the networks is run backward
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
The algorithms works for n + 1
QED
25 / 158
Images/cinvestav
Basically, we get the derivative
We get then
φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
Basically the networks is run backward
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
The algorithms works for n + 1
QED
25 / 158
Images/cinvestav
Basically, we get the derivative
We get then
φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
Basically the networks is run backward
F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x)
The algorithms works for n + 1
QED
25 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
26 / 158
Images/cinvestav
Why not using matrices to process all the individual parts?
Imagine the following, a simple idea
X =






xT
1
xT
2
...
xT
N






We know the fields are created in input to hidden as
g (X) = XW =






xT
1
xT
2
...
xT
N






w1 w2 · · · wd
27 / 158
Images/cinvestav
Why not using matrices to process all the individual parts?
Imagine the following, a simple idea
X =






xT
1
xT
2
...
xT
N






We know the fields are created in input to hidden as
g (X) = XW =






xT
1
xT
2
...
xT
N






w1 w2 · · · wd
27 / 158
Images/cinvestav
Where
We have these construct gij xT
i = xT
i wj
g (X) =








g11 xT
1 g12 xT
1 · · · g1d xT
1
g21 xT
2 g21 xT
2 · · · g2d xT
2
...
...
...
...
gN1 xT
N gN2 xT
N · · · gNd xT
N








28 / 158
Images/cinvestav
Then
We have that the fij (x) = 1
1+exp{−x}
f (g (X)) =








f11 g11 xT
1 f g12 xT
1 · · · f g1d xT
1
f g21 xT
2 f g21 xT
2 · · · f g2d xT
2
...
...
...
...
f gN1 xT
N f gN2 xT
N · · · f gNd xT
N








29 / 158
Images/cinvestav
Finally, we can do the following modification when forward
Then the matrix can be extended
g (X) |g (X) =







dg11(xT
1 )
dw1
|xT
1 w1
dg12(xT
1 )
dw2
|xT
1 w2 · · ·
dg1d(xT
1 )
dwd
|xT
1 wh
dg21(xT
2 )
dw1
|xT
2 w1
dg22(xT
2 )
dw2
|xT
2 w2 · · ·
dg2d(xT
2 )
dwd
|xT
2 wh
...
...
...
...
dgN1(xT
N )
dw1
|xT
N w1
dgN2(xT
N )
dw2
|xT
2 w2 · · ·
dgNd(xT
N )
dwd
|xT
N wh







30 / 158
Images/cinvestav
Finally, we have
The next function f (g (X)) |f (g (X)) =




df11(x)
dx
g11 xT
1 |f11 g11 xT
1 · · ·
df1d(x)
dx
g1d xT
1 |f g1h xT
1
df21(x)
dx
g21 xT
1 |f g21 xT
2 · · ·
df2d(x)
dx
g2d xT
1 |f g2h xT
2
.
.
.
.
.
.
.
.
.
dfN1(x)
dx
gN1 xT
1 |f gN1 xT
N · · ·
dfNd(x)
dx
gNd xT
1 |f gNh xT
N




31 / 158
Images/cinvestav
Using the Hadamard Product
We have for the backpropagation
f (g (X)) ◦ g (X)
In particular for a position ij
dgij xT
i
dwj
×
dfij (x)
dx
gij xT
i =
dfij (x)
dx
gij xT
i ×






x1i
x2i
...
xdi






32 / 158
Images/cinvestav
Using the Hadamard Product
We have for the backpropagation
f (g (X)) ◦ g (X)
In particular for a position ij
dgij xT
i
dwj
×
dfij (x)
dx
gij xT
i =
dfij (x)
dx
gij xT
i ×






x1i
x2i
...
xdi






32 / 158
Images/cinvestav
Then using a vertical sum
We get the change that is imposed into the possible vector wj
sum f (g (X)) ◦ g (X) , axis =0 =
N
i=1
dgij xT
i
dwj
×
dfij (x)
dx
gij xT
i
h
j=1
33 / 158
Images/cinvestav
Now a Historical Perspective
The idea of a Graph Structure was proposed by Raul Rojas
“Neural Networks - A Systematic Introduction” by Raul Rojas in
1996...
TensorFlow was initially released in November 9, 2015
Originally an inception of the project “Google Brain” (Circa 2011)
So TensorFlow started around 2012-2013 with internal development
and DNNResearch’s code (Hinton’s Company)
However, the graph idea was introduced in 2002 in torch, the basis of
Pytorch (Circa 2016)
One of the creators, Samy Bengio, is the brother of Joshua Bengio [3]
34 / 158
Images/cinvestav
Now a Historical Perspective
The idea of a Graph Structure was proposed by Raul Rojas
“Neural Networks - A Systematic Introduction” by Raul Rojas in
1996...
TensorFlow was initially released in November 9, 2015
Originally an inception of the project “Google Brain” (Circa 2011)
So TensorFlow started around 2012-2013 with internal development
and DNNResearch’s code (Hinton’s Company)
However, the graph idea was introduced in 2002 in torch, the basis of
Pytorch (Circa 2016)
One of the creators, Samy Bengio, is the brother of Joshua Bengio [3]
34 / 158
Images/cinvestav
Now a Historical Perspective
The idea of a Graph Structure was proposed by Raul Rojas
“Neural Networks - A Systematic Introduction” by Raul Rojas in
1996...
TensorFlow was initially released in November 9, 2015
Originally an inception of the project “Google Brain” (Circa 2011)
So TensorFlow started around 2012-2013 with internal development
and DNNResearch’s code (Hinton’s Company)
However, the graph idea was introduced in 2002 in torch, the basis of
Pytorch (Circa 2016)
One of the creators, Samy Bengio, is the brother of Joshua Bengio [3]
34 / 158
Images/cinvestav
Now a Historical Perspective
The idea of a Graph Structure was proposed by Raul Rojas
“Neural Networks - A Systematic Introduction” by Raul Rojas in
1996...
TensorFlow was initially released in November 9, 2015
Originally an inception of the project “Google Brain” (Circa 2011)
So TensorFlow started around 2012-2013 with internal development
and DNNResearch’s code (Hinton’s Company)
However, the graph idea was introduced in 2002 in torch, the basis of
Pytorch (Circa 2016)
One of the creators, Samy Bengio, is the brother of Joshua Bengio [3]
34 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
35 / 158
Images/cinvestav
Backpropagation a little brother of Automatic
Differentiation (AD)
We have a crude way to obtain derivatives [4, 5, 6][7]
D+hf (x) ≈
f (x + h) − f (x)
2h
or D hf (x) ≈
f (x + h) − f (x − h)
2h
Huge Problems
If h is small, then cancellation error reduces the number of significant
figures in D+hf (x).
if h is not small, then truncation errors (terms such as h2f (x))
become significant.
Even if h is optimally chosen, the values of D+hf (x) and D hf (x)
will be accurate to only about 1
2 or 2
3 of the significant digits of f.
36 / 158
Images/cinvestav
Backpropagation a little brother of Automatic
Differentiation (AD)
We have a crude way to obtain derivatives [4, 5, 6][7]
D+hf (x) ≈
f (x + h) − f (x)
2h
or D hf (x) ≈
f (x + h) − f (x − h)
2h
Huge Problems
If h is small, then cancellation error reduces the number of significant
figures in D+hf (x).
if h is not small, then truncation errors (terms such as h2f (x))
become significant.
Even if h is optimally chosen, the values of D+hf (x) and D hf (x)
will be accurate to only about 1
2 or 2
3 of the significant digits of f.
36 / 158
Images/cinvestav
Backpropagation a little brother of Automatic
Differentiation (AD)
We have a crude way to obtain derivatives [4, 5, 6][7]
D+hf (x) ≈
f (x + h) − f (x)
2h
or D hf (x) ≈
f (x + h) − f (x − h)
2h
Huge Problems
If h is small, then cancellation error reduces the number of significant
figures in D+hf (x).
if h is not small, then truncation errors (terms such as h2f (x))
become significant.
Even if h is optimally chosen, the values of D+hf (x) and D hf (x)
will be accurate to only about 1
2 or 2
3 of the significant digits of f.
36 / 158
Images/cinvestav
Backpropagation a little brother of Automatic
Differentiation (AD)
We have a crude way to obtain derivatives [4, 5, 6][7]
D+hf (x) ≈
f (x + h) − f (x)
2h
or D hf (x) ≈
f (x + h) − f (x − h)
2h
Huge Problems
If h is small, then cancellation error reduces the number of significant
figures in D+hf (x).
if h is not small, then truncation errors (terms such as h2f (x))
become significant.
Even if h is optimally chosen, the values of D+hf (x) and D hf (x)
will be accurate to only about 1
2 or 2
3 of the significant digits of f.
36 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
37 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
38 / 158
Images/cinvestav
Avoiding Truncation Errors
We have that
Algorithmic differentiation does not incur truncation errors.
For example
f (x) =
n
i=1
x2
i at xi = i for i = 1...n
Then for e1 ∈ Rn
f (x + he1) − f (x)
h
=
∂f (x)
∂x1
+ h = 2x1 + h = 2 + h
39 / 158
Images/cinvestav
Avoiding Truncation Errors
We have that
Algorithmic differentiation does not incur truncation errors.
For example
f (x) =
n
i=1
x2
i at xi = i for i = 1...n
Then for e1 ∈ Rn
f (x + he1) − f (x)
h
=
∂f (x)
∂x1
+ h = 2x1 + h = 2 + h
39 / 158
Images/cinvestav
Avoiding Truncation Errors
We have that
Algorithmic differentiation does not incur truncation errors.
For example
f (x) =
n
i=1
x2
i at xi = i for i = 1...n
Then for e1 ∈ Rn
f (x + he1) − f (x)
h
=
∂f (x)
∂x1
+ h = 2x1 + h = 2 + h
39 / 158
Images/cinvestav
Floating Points
Given that the quantity needs floating point number representation in
machine accuracy of 64 bits
Roundoff error = f (x + he1) ≈ n3
3
with = 2−54
≈ 10−16
For h =
√
,as often is recommended
The difference quotient has a rounding error of size
1
3
n3√
≈
1
3
n3
10−8
40 / 158
Images/cinvestav
Floating Points
Given that the quantity needs floating point number representation in
machine accuracy of 64 bits
Roundoff error = f (x + he1) ≈ n3
3
with = 2−54
≈ 10−16
For h =
√
,as often is recommended
The difference quotient has a rounding error of size
1
3
n3√
≈
1
3
n3
10−8
40 / 158
Images/cinvestav
Now, Imagine n = 1000
Then Rounding Error
1
3
10003√
≈
1
3
1000000000 × 10−8
=
1
3
100 ≈ 33.333...
Ouch
We cannot even get the sign correctly!!!
f (x + he1) − f (x)
h
41 / 158
Images/cinvestav
Now, Imagine n = 1000
Then Rounding Error
1
3
10003√
≈
1
3
1000000000 × 10−8
=
1
3
100 ≈ 33.333...
Ouch
We cannot even get the sign correctly!!!
f (x + he1) − f (x)
h
41 / 158
Images/cinvestav
In contrast Automatic Differentiation
It yields
2xi in both its forward and reverse modes
You could assume that the derivatives are generated symbolically
Actually is true in some sense, but 2xi will be never be generated by
Symbolic Differentiation
In Symbolic Differentiation
The numerical value of xi is multiplied by 2 then returned as the
gradient value.
42 / 158
Images/cinvestav
In contrast Automatic Differentiation
It yields
2xi in both its forward and reverse modes
You could assume that the derivatives are generated symbolically
Actually is true in some sense, but 2xi will be never be generated by
Symbolic Differentiation
In Symbolic Differentiation
The numerical value of xi is multiplied by 2 then returned as the
gradient value.
42 / 158
Images/cinvestav
In contrast Automatic Differentiation
It yields
2xi in both its forward and reverse modes
You could assume that the derivatives are generated symbolically
Actually is true in some sense, but 2xi will be never be generated by
Symbolic Differentiation
In Symbolic Differentiation
The numerical value of xi is multiplied by 2 then returned as the
gradient value.
42 / 158
Images/cinvestav
Example using Forward Differentiation
We will see the forward procedure later on
f (xx) =
n
i=1
x2
i with xi = i for i = 1, ..., n
AD Initializes (Do not worry we will see this in more detail)
vi−n =i for i = 1, ..., n
˙vi−n =0, but ˙v1−n = 1
43 / 158
Images/cinvestav
Example using Forward Differentiation
We will see the forward procedure later on
f (xx) =
n
i=1
x2
i with xi = i for i = 1, ..., n
AD Initializes (Do not worry we will see this in more detail)
vi−n =i for i = 1, ..., n
˙vi−n =0, but ˙v1−n = 1
43 / 158
Images/cinvestav
Then, we have that
Apply the compositions
φ Functions Derivatives
v1 = 12 ˙v1 = ∂v1
∂v1−n
˙v1−n = 2 × (1) × 1 = 2
...
...
vn = n2 0
Therefore, we have at the end
∂f
∂x
(x) = (2, 0, ..., 0)
44 / 158
Images/cinvestav
Then, we have that
Apply the compositions
φ Functions Derivatives
v1 = 12 ˙v1 = ∂v1
∂v1−n
˙v1−n = 2 × (1) × 1 = 2
...
...
vn = n2 0
Therefore, we have at the end
∂f
∂x
(x) = (2, 0, ..., 0)
44 / 158
Images/cinvestav
Quite different from
Using a numerical difference, we have
f (x + e1h) − f (x)
h
− 2 < 0
Then for n = 10j
and h = 10−k
10k
(h + 1)2
− 1 < 2
Finally, we have
k > − log10 3
45 / 158
Images/cinvestav
Quite different from
Using a numerical difference, we have
f (x + e1h) − f (x)
h
− 2 < 0
Then for n = 10j
and h = 10−k
10k
(h + 1)2
− 1 < 2
Finally, we have
k > − log10 3
45 / 158
Images/cinvestav
Quite different from
Using a numerical difference, we have
f (x + e1h) − f (x)
h
− 2 < 0
Then for n = 10j
and h = 10−k
10k
(h + 1)2
− 1 < 2
Finally, we have
k > − log10 3
45 / 158
Images/cinvestav
Therefore
It is possible to get into underflow
by getting a k > − log10 3
Therefore, we have that
Automatic Differentiation allows to obtain the correct answer!!!
46 / 158
Images/cinvestav
Therefore
It is possible to get into underflow
by getting a k > − log10 3
Therefore, we have that
Automatic Differentiation allows to obtain the correct answer!!!
46 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
47 / 158
Images/cinvestav
For example
You have the following equation
f (x) =
n
i=1
xi
Then, the gradient
f (x) =
∂f
∂x1
,
∂f
∂x2
, ...,
∂f
∂xn
=


j=i
xj


i=1...n
= (x2 × x3 × ... × xi × xi+1 × ... × xn−1 × xn,
...............................................................
x1 × x2 × ... × xi−1 × xi+1 × ... × xn−1 × xn,
...............................................................
x1 × x2 × ... × xi−1 × xi × ... × xn−2 × xn−1, )
48 / 158
Images/cinvestav
For example
You have the following equation
f (x) =
n
i=1
xi
Then, the gradient
f (x) =
∂f
∂x1
,
∂f
∂x2
, ...,
∂f
∂xn
=


j=i
xj


i=1...n
= (x2 × x3 × ... × xi × xi+1 × ... × xn−1 × xn,
...............................................................
x1 × x2 × ... × xi−1 × xi+1 × ... × xn−1 × xn,
...............................................................
x1 × x2 × ... × xi−1 × xi × ... × xn−2 × xn−1, )
48 / 158
Images/cinvestav
Actually
Symbolic Differentiation will consume a lot of memory
Instead AD will reuse the common expressions to improve
performance and memory.
However, Symbolic and Automatic Differentiation
They make use of the chain rule to achieve their results
However, the chain rules in AD
It is used not into the symbolic expressions but the actual numerical
values.
49 / 158
Images/cinvestav
Actually
Symbolic Differentiation will consume a lot of memory
Instead AD will reuse the common expressions to improve
performance and memory.
However, Symbolic and Automatic Differentiation
They make use of the chain rule to achieve their results
However, the chain rules in AD
It is used not into the symbolic expressions but the actual numerical
values.
49 / 158
Images/cinvestav
Actually
Symbolic Differentiation will consume a lot of memory
Instead AD will reuse the common expressions to improve
performance and memory.
However, Symbolic and Automatic Differentiation
They make use of the chain rule to achieve their results
However, the chain rules in AD
It is used not into the symbolic expressions but the actual numerical
values.
49 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
50 / 158
Images/cinvestav
The User Insight
Difference quotients may sometimes be useful too
f (x + he1) − f (x)
h
Computer Algebra packages
They have really neat ways to simplify expressions.
In contrast, current AD packages assume that
That the given program calculates the underlying function efficiently
51 / 158
Images/cinvestav
The User Insight
Difference quotients may sometimes be useful too
f (x + he1) − f (x)
h
Computer Algebra packages
They have really neat ways to simplify expressions.
In contrast, current AD packages assume that
That the given program calculates the underlying function efficiently
51 / 158
Images/cinvestav
The User Insight
Difference quotients may sometimes be useful too
f (x + he1) − f (x)
h
Computer Algebra packages
They have really neat ways to simplify expressions.
In contrast, current AD packages assume that
That the given program calculates the underlying function efficiently
51 / 158
Images/cinvestav
There
AD can automatize the gradient generation
The best results will be obtained when AD takes advantage
the user’s insight into the structure underlying the program
52 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
53 / 158
Images/cinvestav
RNN Example
When you look at the recurrent neural network Elman [8]
ht = σh (Wsdxt + Ushht−1 + bh)
yt = σy (Vosht)
L =
1
2
(yt − zt)2
Here if you do blind AD sooner or later you have
∂ht
∂ht−1
×
∂ht−1
∂ht−2
×
∂ht−2
∂ht−3
× ... ×
∂hk+1
∂hk
This is known as Back Propagation Through Time (BPTT)
This is a problem given
The Vanishing Gradient or Exploding Gradient
54 / 158
Images/cinvestav
RNN Example
When you look at the recurrent neural network Elman [8]
ht = σh (Wsdxt + Ushht−1 + bh)
yt = σy (Vosht)
L =
1
2
(yt − zt)2
Here if you do blind AD sooner or later you have
∂ht
∂ht−1
×
∂ht−1
∂ht−2
×
∂ht−2
∂ht−3
× ... ×
∂hk+1
∂hk
This is known as Back Propagation Through Time (BPTT)
This is a problem given
The Vanishing Gradient or Exploding Gradient
54 / 158
Images/cinvestav
RNN Example
When you look at the recurrent neural network Elman [8]
ht = σh (Wsdxt + Ushht−1 + bh)
yt = σy (Vosht)
L =
1
2
(yt − zt)2
Here if you do blind AD sooner or later you have
∂ht
∂ht−1
×
∂ht−1
∂ht−2
×
∂ht−2
∂ht−3
× ... ×
∂hk+1
∂hk
This is known as Back Propagation Through Time (BPTT)
This is a problem given
The Vanishing Gradient or Exploding Gradient
54 / 158
Images/cinvestav
Here, you can modify the architecture
Using an intermediate layer using the Hadamard product ◦ we have
L =
1
2
(yt − zt)2
yt = σy (Wodxt + Uohht−1 + bo)
st = σs (Vhoyt + Dhdxt + bh)
ht = (1 − yt) ◦ ht−1 + yt ◦ st
55 / 158
Images/cinvestav
Therefore
You have multiple paths of derivatives
-1
-1
56 / 158
Images/cinvestav
One of them
It can be seen
That one of the paths can take you to BPTT
57 / 158
Images/cinvestav
The Other One
The other gets you into a more Markovian Property
This allows to to get a Backpropagation that does not require the
BPTT
How? For example, the derivative of L with respect to Dhd
∂L
∂Dhd
=
∂L
∂yt
×
∂yt
∂nety
×
∂nety
∂ht−1
×
∂ht−1
∂st−2
×
∂st−2
nets
×
nets
∂Dhd
58 / 158
Images/cinvestav
The Other One
The other gets you into a more Markovian Property
This allows to to get a Backpropagation that does not require the
BPTT
How? For example, the derivative of L with respect to Dhd
∂L
∂Dhd
=
∂L
∂yt
×
∂yt
∂nety
×
∂nety
∂ht−1
×
∂ht−1
∂st−2
×
∂st−2
nets
×
nets
∂Dhd
58 / 158
Images/cinvestav
The Other One
The other gets you into a more Markovian Property
This allows to to get a Backpropagation that does not require the
BPTT
How? For example, the derivative of L with respect to Dhd
∂L
∂Dhd
=
∂L
∂yt
×
∂yt
∂nety
×
∂nety
∂ht−1
×
∂ht−1
∂st−2
×
∂st−2
nets
×
nets
∂Dhd
58 / 158
Images/cinvestav
Therefore
You do not have
The Backpropagation through time... you can avoid it all together!!!
Because Backpropagation Through Time
Makes the process of obtaining the gradients unstable...
59 / 158
Images/cinvestav
Therefore
You do not have
The Backpropagation through time... you can avoid it all together!!!
Because Backpropagation Through Time
Makes the process of obtaining the gradients unstable...
59 / 158
Images/cinvestav
Thus
A great simplifying step
Here resound trues the phrase
“AD taking advantage of the user’s insight”
60 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
61 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
A Simple Example
Here, we have the following ideas
Some of the floating point values, generated by the AD, will be stored
in variables of the program,
Other operations will be held until overwritten or discarded.
Thus, we will introduce the concept
Evaluation Trace which is basically a record of a particular run of a
given program.
This Evaluation Trace stores
Input variables,
Sequence of floating point generated by the CPU
Operations that are used for it
62 / 158
Images/cinvestav
Example
A simple example
y = f (x1, x2) = sin
x1
x2
+
x1
x2
− exp (x2) ×
x1
x2
− exp (x2)
We wish to calculate y = f (x1, x2)
With x1 = 1.5, x2 = 0.5
63 / 158
Images/cinvestav
Example
A simple example
y = f (x1, x2) = sin
x1
x2
+
x1
x2
− exp (x2) ×
x1
x2
− exp (x2)
We wish to calculate y = f (x1, x2)
With x1 = 1.5, x2 = 0.5
63 / 158
Images/cinvestav
Evaluation Trace/Forward Procedure
We have the table for the evaluation of the function
v−1 = x1 = 1.5
v0 = x2 = 0.5
v1 = v−1
v0
= 1.5
0.5 = 3.0
v2 = sin (v1) = sin (3.0) = 0.1411
v3 = exp (v0) = exp (0.5) = 1.6487
v4 = v1 − v3 = 3.0 − 1.6487 = 1.3513
v5 = v2 + v4 = 0.1411 + 1.3413 = 1.4924
v6 = v5 × v4 = 1.4924 × 1.3513 = 2.0167
y = v6 = 2.0167
64 / 158
Images/cinvestav
A Cautionary Note
Normally
Programmers will try to rearrange this execution trace to improve
performance through parallelism.
Thus
Subexpressions will be algorithmically exploited by the AD to improve
performance.
It is usually more convenient to use
The so called “computational graph”
65 / 158
Images/cinvestav
A Cautionary Note
Normally
Programmers will try to rearrange this execution trace to improve
performance through parallelism.
Thus
Subexpressions will be algorithmically exploited by the AD to improve
performance.
It is usually more convenient to use
The so called “computational graph”
65 / 158
Images/cinvestav
A Cautionary Note
Normally
Programmers will try to rearrange this execution trace to improve
performance through parallelism.
Thus
Subexpressions will be algorithmically exploited by the AD to improve
performance.
It is usually more convenient to use
The so called “computational graph”
65 / 158
Images/cinvestav
Computational Graph
A Simpler Version
66 / 158
Images/cinvestav
Please take a look at section in Chapter 2 A Framework for
Evaluating Functions
At the book [7]
Andreas Griewank and Andrea Walther, Evaluating derivatives:
principles and techniques of algorithmic differentiation vol. 105,
(Siam, 2008).
67 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
67 / 158
Images/cinvestav
A Little Bit of Notation
In general, we assume quantities vi such
v1−n, ..., v0
x
v1, ..., vl−m−1vl−m+1, ..., vl
y
Then, we have
1 v1−n, ..., v0 are the initial input variables
2 vl−m+1, ..., vl the output variables
3 v1, ..., vl−m−1 the intermediate functions
68 / 158
Images/cinvestav
A Little Bit of Notation
In general, we assume quantities vi such
v1−n, ..., v0
x
v1, ..., vl−m−1vl−m+1, ..., vl
y
Then, we have
1 v1−n, ..., v0 are the initial input variables
2 vl−m+1, ..., vl the output variables
3 v1, ..., vl−m−1 the intermediate functions
68 / 158
Images/cinvestav
Additionally
Where each value vi with i > 0 is obtained by applying an elemental
function φ
vi = φi (vj)j i
j i vi depends directly on vj
69 / 158
Images/cinvestav
Then, for the application of the chain rule
It is useful to associate with each elemental function φi the state
transformation
vi = Φi (vi−1) with Φi : Rn+l
→ Rn+l
where
vi = (v1−n, ..., vi, 0, ..., 0)T
In other words
Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i
unchanged.
70 / 158
Images/cinvestav
Then, for the application of the chain rule
It is useful to associate with each elemental function φi the state
transformation
vi = Φi (vi−1) with Φi : Rn+l
→ Rn+l
where
vi = (v1−n, ..., vi, 0, ..., 0)T
In other words
Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i
unchanged.
70 / 158
Images/cinvestav
Then, for the application of the chain rule
It is useful to associate with each elemental function φi the state
transformation
vi = Φi (vi−1) with Φi : Rn+l
→ Rn+l
where
vi = (v1−n, ..., vi, 0, ..., 0)T
In other words
Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i
unchanged.
70 / 158
Images/cinvestav
Basically the Computational Graph
A Simpler Version
71 / 158
Images/cinvestav
Example of the Forward Mode
Suppose we want to differentiate y = f (x1, x2) with respect to x1
We consider x1 as an independent variable and y as a dependent
variable.
We can work the numerical value of the y = f (x1, x2)
By getting the numerical derivative of each of its components
Something like
˙vi =
∂vi
∂x1
72 / 158
Images/cinvestav
Example of the Forward Mode
Suppose we want to differentiate y = f (x1, x2) with respect to x1
We consider x1 as an independent variable and y as a dependent
variable.
We can work the numerical value of the y = f (x1, x2)
By getting the numerical derivative of each of its components
Something like
˙vi =
∂vi
∂x1
72 / 158
Images/cinvestav
Example of the Forward Mode
Suppose we want to differentiate y = f (x1, x2) with respect to x1
We consider x1 as an independent variable and y as a dependent
variable.
We can work the numerical value of the y = f (x1, x2)
By getting the numerical derivative of each of its components
Something like
˙vi =
∂vi
∂x1
72 / 158
Images/cinvestav
Therefore, we get
We have the Procedure
v−1 = x1 = 1.5 ˙v−1 = 1.0
v0 = x2 = 0.5 ˙v1 = 0.0
v1 =
v−1
v0
= 1.5
0.5
= 3.0 ˙v1 = ∂v1
∂v−1
˙v−1 + ∂v1
∂v0
˙v0 = 2.0
v2 = sin (v1) = sin (3.0) = 0.1411 ˙v2 = cos (v1) ˙v1 = −1.98
v3 = exp (v0) = exp (0.5) = 1.6487 ˙v3 = v3 ˙×v1 = 0.0
v4 = v1 − v3 = 3.0 − 1.6487 = 1.3513 ˙v4 = ˙v1 − ˙v3 = 2.0
v5 = v2 + v4 = 0.1411 + 1.3413 = 1.4924 ˙v5 = ˙v2 + ˙v4 = 0.02
v6 = v5 × v4 = 1.4924 × 1.3513 = 2.0167 ˙v6 = ˙v5 × v4 + v5 × ˙v4 = 3.0118
y = v6 = 2.0167 ˙y = 3.0118
73 / 158
Images/cinvestav
The first Column of this process
It can be seen as an automatic procedure
vi−n i = 1...n
vi = ϕi (vj)j i i = 1...l
ym−i = vl−i i = m − 1...0
74 / 158
Images/cinvestav
In a similar way
We can obtain ∂f(x1,x2)
∂x2
However, it can be more efficient to redefine the ˙vi as vectors for
efficiency!!!
75 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
76 / 158
Images/cinvestav
Forward propagation of Tangents
Remarks
As you can see the second column of the evaluation procedure is done
in a mechanical way
This increase the size
Basically, twice the size of the original simple evaluation.
77 / 158
Images/cinvestav
Forward propagation of Tangents
Remarks
As you can see the second column of the evaluation procedure is done
in a mechanical way
This increase the size
Basically, twice the size of the original simple evaluation.
77 / 158
Images/cinvestav
We have the following
We have the chain rule
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
Where
F (x) ∈ Rm×n is the Jacobian Matrix
Here, we will be tempted to calculate ˙y (t)
By evaluating the full Jacobian F (x) then multiplying by ˙x (t)
78 / 158
Images/cinvestav
We have the following
We have the chain rule
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
Where
F (x) ∈ Rm×n is the Jacobian Matrix
Here, we will be tempted to calculate ˙y (t)
By evaluating the full Jacobian F (x) then multiplying by ˙x (t)
78 / 158
Images/cinvestav
We have the following
We have the chain rule
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
Where
F (x) ∈ Rm×n is the Jacobian Matrix
Here, we will be tempted to calculate ˙y (t)
By evaluating the full Jacobian F (x) then multiplying by ˙x (t)
78 / 158
Images/cinvestav
However
Such approach is quite uneconomically
Unless many tangents need to be calculated as in the Newton Step.
A simpler version, differentiate the first column of the table
vi−n = xi i = 1, ..., n
vi = φi (vj)j i i = 1, ..., l
ym−i = vl−i i = m − 1, ..., 0
j i vi depends directly vj (The graph propagation of the
dependencies)
79 / 158
Images/cinvestav
However
Such approach is quite uneconomically
Unless many tangents need to be calculated as in the Newton Step.
A simpler version, differentiate the first column of the table
vi−n = xi i = 1, ..., n
vi = φi (vj)j i i = 1, ..., l
ym−i = vl−i i = m − 1, ..., 0
j i vi depends directly vj (The graph propagation of the
dependencies)
79 / 158
Images/cinvestav
Which can be seen as Forward Propagation of Tangents
Basically, we can think of the forward mode as a propagation of
tangents
80 / 158
Images/cinvestav
The Automatic Procedure
Therefore, we have the following automatic procedure
j i vi depends directly on vj and ui = (vj)j i ∈ Rni
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
81 / 158
Images/cinvestav
Therefore
Each element assignment vi = φi (ui)
You have the corresponding
˙vi =
j i
∂φi (uj)
∂vj
× ˙vj =
j i
cij × ˙vj
Abbreviating ˙ui = (˙vj)j i
˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui
Where ˙φi = R2ni
→ R
It is called the tangent function associated with the elemental φi.
82 / 158
Images/cinvestav
Therefore
Each element assignment vi = φi (ui)
You have the corresponding
˙vi =
j i
∂φi (uj)
∂vj
× ˙vj =
j i
cij × ˙vj
Abbreviating ˙ui = (˙vj)j i
˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui
Where ˙φi = R2ni
→ R
It is called the tangent function associated with the elemental φi.
82 / 158
Images/cinvestav
Therefore
Each element assignment vi = φi (ui)
You have the corresponding
˙vi =
j i
∂φi (uj)
∂vj
× ˙vj =
j i
cij × ˙vj
Abbreviating ˙ui = (˙vj)j i
˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui
Where ˙φi = R2ni
→ R
It is called the tangent function associated with the elemental φi.
82 / 158
Images/cinvestav
Now
Question
What is the correct order of evaluation?
83 / 158
Images/cinvestav
Why the question?
Until now, we have always placed the tangent statement yielding ˙vi
after the underlying value vi
This order of calculation seems natural and certainly yields correct
results as long as there is no overwriting.
Then the order of 2l statements in the middle part of Table does not
matter
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
84 / 158
Images/cinvestav
Why the question?
Until now, we have always placed the tangent statement yielding ˙vi
after the underlying value vi
This order of calculation seems natural and certainly yields correct
results as long as there is no overwriting.
Then the order of 2l statements in the middle part of Table does not
matter
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
84 / 158
Images/cinvestav
Here, we have a big problem in Cache
Imagine that we have a single block of memory to hold
For vi and its arguments vj live in the same memory cell on the cache
memory
CPU
Register
Register
Register
Register
CACHE
MainMemory
85 / 158
Images/cinvestav
This is known as Cache Aliasing
Definition
Cache aliasing occurs when multiple mappings to a physical page of
memory have conflicting caching states, such as cached and
uncached.
the same physical address can be mapped to multiple virtual addresses.
On ARMv4 and ARMv5 processors, cache is organized as a
virtual-indexed, virtual-tagged (VIVT)
Cache lookups are faster because the translation look-aside buffer
(TLB) is not involved in matching cache lines for a virtual address.
However
This caching method does require more frequent cache flushing
because of cache aliasing.
86 / 158
Images/cinvestav
This is known as Cache Aliasing
Definition
Cache aliasing occurs when multiple mappings to a physical page of
memory have conflicting caching states, such as cached and
uncached.
the same physical address can be mapped to multiple virtual addresses.
On ARMv4 and ARMv5 processors, cache is organized as a
virtual-indexed, virtual-tagged (VIVT)
Cache lookups are faster because the translation look-aside buffer
(TLB) is not involved in matching cache lines for a virtual address.
However
This caching method does require more frequent cache flushing
because of cache aliasing.
86 / 158
Images/cinvestav
This is known as Cache Aliasing
Definition
Cache aliasing occurs when multiple mappings to a physical page of
memory have conflicting caching states, such as cached and
uncached.
the same physical address can be mapped to multiple virtual addresses.
On ARMv4 and ARMv5 processors, cache is organized as a
virtual-indexed, virtual-tagged (VIVT)
Cache lookups are faster because the translation look-aside buffer
(TLB) is not involved in matching cache lines for a virtual address.
However
This caching method does require more frequent cache flushing
because of cache aliasing.
86 / 158
Images/cinvestav
Then
The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect
Once we update vi = φi (ui)
ADIFOR and Tapenade [9, 5]
They put the derivative statement ahead of the original assignment
and update before the erasing the original statement.
On the other hand
For most univariate functions v = φ (u) is better to obtain the
undifferentiated value first
Then to use it into the tangent function ˙φ
87 / 158
Images/cinvestav
Then
The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect
Once we update vi = φi (ui)
ADIFOR and Tapenade [9, 5]
They put the derivative statement ahead of the original assignment
and update before the erasing the original statement.
On the other hand
For most univariate functions v = φ (u) is better to obtain the
undifferentiated value first
Then to use it into the tangent function ˙φ
87 / 158
Images/cinvestav
Then
The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect
Once we update vi = φi (ui)
ADIFOR and Tapenade [9, 5]
They put the derivative statement ahead of the original assignment
and update before the erasing the original statement.
On the other hand
For most univariate functions v = φ (u) is better to obtain the
undifferentiated value first
Then to use it into the tangent function ˙φ
87 / 158
Images/cinvestav
In this presentation
We will list ϕ and ˙ϕ
Side by side in a common bracket to indicate that they should be
evaluated simultaneously
Then
sharing results is immediate.
88 / 158
Images/cinvestav
In this presentation
We will list ϕ and ˙ϕ
Side by side in a common bracket to indicate that they should be
evaluated simultaneously
Then
sharing results is immediate.
88 / 158
Images/cinvestav
Classic Tangent Operations
We have a series of improvements on the tangent equations
φ φ, ˙φ
v = c v = c, ˙v = 0
v = v ± w v = v ± w
˙v = ˙v ± ˙w
v = u × w ˙v = ˙u × w + u × ˙w
v = u × w
v = 1/u v = 1/u
˙v = −v × (v × ˙u)
φ φ, ˙φ
v = uc v = ˙u
u ; v = uc
˙v = v × (v × ˙u)
v =
√
u v =
√
u
v = 0.5 × ˙u
v
v = exp (u) v = exp (u)
˙v = v ∗ ˙u
v = log (u) ˙v = ˙u/u
v = log (u)
v = sin (u) ˙v = cos (u) × ˙u
v = sin (u)
89 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
90 / 158
Images/cinvestav
Now Imagine the following network
Something simple for our sake
91 / 158
Images/cinvestav
Forward mode to get gradient of x1
v−11 = w11, ..., v−6 = w16, v−5 = w21, ..., v−2 = w24, v−1 = v31, v−1 = w41
˙v−11 = 1, ˙v−10 = 0, ..., ˙v0 = 0
v1 =
3
i=1
w1ixi , ˙v1 = x1
v2 =
3
i=1
w2ixi , ˙v2 = 0
v3 = 1
1+exp(−v1)
, ˙v3 = v3 [1 − v3] x11
v4 = 1
1+exp(−v2)
, ˙v4 = 0
v5 =
3
i=1
w3ivi, ˙v5 = w31 × ˙v3
v6 =
3
i=1
w4ivi, ˙v6 = w41 × ˙v3
v7 = 1
1+exp(−v5)
, ˙v7 = v7 [1 − v7] × ˙v5
v8 = 1
1+exp(−v6)
, ˙v8 = v8 [1 − v8] × ˙v6
v9 =
2
i=1
w5ivi, ˙v9 = w51 × ˙v7 + w32 × ˙v8
v10 = 1
1+exp(−v9)
, ˙v10 = v10 [1 − v10] × ˙v9
92 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
93 / 158
Images/cinvestav
Complexity of the Procedure
Time Complexity
TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)}
Where wtan ∈ 2, 5
2
Space Complexity
SPACE F (x) , F (x) ˙x ≤ 2SPACE {F (x)}
94 / 158
Images/cinvestav
Complexity of the Procedure
Time Complexity
TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)}
Where wtan ∈ 2, 5
2
Space Complexity
SPACE F (x) , F (x) ˙x ≤ 2SPACE {F (x)}
94 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
95 / 158
Images/cinvestav
Here, an essential observation
The cost of evaluating derivatives by propagating them forward
it increases linearly with number of directions ˙x along which we want
to differentiate.
It looks inevitable
But it is possible to avoid these complexity by
Observing that the gradient of a single dependent variable could be
obtained for a fixed multiple of the cost of evaluating the underlying
scalar-valued function.
96 / 158
Images/cinvestav
Here, an essential observation
The cost of evaluating derivatives by propagating them forward
it increases linearly with number of directions ˙x along which we want
to differentiate.
It looks inevitable
But it is possible to avoid these complexity by
Observing that the gradient of a single dependent variable could be
obtained for a fixed multiple of the cost of evaluating the underlying
scalar-valued function.
96 / 158
Images/cinvestav
We choose instead an output variable
We use the term “reverse mode” for this technique
Because the label “backward differentiation” is well established
[10, 11].
Therefore, for an output f (x1, x2)
We have for each variable v1
vi =
∂y
∂vi
(Adjoint Variable)
97 / 158
Images/cinvestav
We choose instead an output variable
We use the term “reverse mode” for this technique
Because the label “backward differentiation” is well established
[10, 11].
Therefore, for an output f (x1, x2)
We have for each variable v1
vi =
∂y
∂vi
(Adjoint Variable)
97 / 158
Images/cinvestav
Actually
This is an abuse of notation
We mean a new independent variable δi
vi =
∂y
∂δi
(Adjoint Variable)
Which can be thought as adding a small numerical value δi to vi
vi + δi → f (x1, x2) + viδi
As a perturbation in variational calculus
98 / 158
Images/cinvestav
Actually
This is an abuse of notation
We mean a new independent variable δi
vi =
∂y
∂δi
(Adjoint Variable)
Which can be thought as adding a small numerical value δi to vi
vi + δi → f (x1, x2) + viδi
As a perturbation in variational calculus
98 / 158
Images/cinvestav
Actually, you propagate the Normal vectors
Actually, y and vi are normals or cotangents
99 / 158
Images/cinvestav
Then, we have
The following sought mapping
x = yT
F (x) = yT
F (x)
Observation
Here, y is a fixed vector that plays a dual role to the domain direction
˙x.
In the Forward Procedure, you compute
˙y = F (x) ˙x = ˙F (x, ˙x)
100 / 158
Images/cinvestav
Then, we have
The following sought mapping
x = yT
F (x) = yT
F (x)
Observation
Here, y is a fixed vector that plays a dual role to the domain direction
˙x.
In the Forward Procedure, you compute
˙y = F (x) ˙x = ˙F (x, ˙x)
100 / 158
Images/cinvestav
Then, we have
The following sought mapping
x = yT
F (x) = yT
F (x)
Observation
Here, y is a fixed vector that plays a dual role to the domain direction
˙x.
In the Forward Procedure, you compute
˙y = F (x) ˙x = ˙F (x, ˙x)
100 / 158
Images/cinvestav
Instead
In the Reverse Procedure, you compute
xT
= yT
F (x) ≡ F (x, y)
Where we solve F and F are evaluated together
Thus, we have a dual process
101 / 158
Images/cinvestav
Instead
In the Reverse Procedure, you compute
xT
= yT
F (x) ≡ F (x, y)
Where we solve F and F are evaluated together
Thus, we have a dual process
101 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
102 / 158
Images/cinvestav
Dual Process
Here, we have that the hyperplane yT
y = c in the range of F has
inverse image x|yT
F (x) = c
103 / 158
Images/cinvestav
The implicit function theorem
Theorem
Let F : Rn+m
→ Rm
be a continuously differentiable function, and a
point x0
1, x0
2, ..., x0
m+n so F x0
1, x0
2, ..., x0
m+n = c. If
∂F(x0
1,x0
2,...,x0
m+n)
∂xm+n
= 0, then there exist a neighborhood of
x0
1, x0
2, ..., x0
m+n so whatever (x1, ..., xn+m−1) is close enough to
x0
1, ..., x0
m+n−1 , there is a unique z so that
F (x1, ..., xn+m−1, z) = c. Furthermore, z = g (x1, ..., xn+m−1) a
continuous function of (x1, ..., xn+m−1).
104 / 158
Images/cinvestav
Therefore
The set x|yT
F (x) = c
It is a smooth hyper-surface with the normal
xT
= yT
F (x)
at x provided that x does not vanishes.
105 / 158
Images/cinvestav
The Process
Here, we have that the hyperplane yT
y = c in the range of F has
inverse image x|yT
F (x) = c
Forward Procedure
Reverse Procedure
106 / 158
Images/cinvestav
Therefore
When m = 1, then F = f is scaler-valued
We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x).
Something Notable
We will look only at the main procedure of Incremental Adjoint
Recursion
Please take a look at section in Derivation by Matrix-Product
Reversal
At the book [7]
Andreas Griewank and Andrea Walther, Evaluating derivatives:
principles and techniques of algorithmic differentiation vol. 105,
(Siam, 2008).
107 / 158
Images/cinvestav
Therefore
When m = 1, then F = f is scaler-valued
We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x).
Something Notable
We will look only at the main procedure of Incremental Adjoint
Recursion
Please take a look at section in Derivation by Matrix-Product
Reversal
At the book [7]
Andreas Griewank and Andrea Walther, Evaluating derivatives:
principles and techniques of algorithmic differentiation vol. 105,
(Siam, 2008).
107 / 158
Images/cinvestav
Therefore
When m = 1, then F = f is scaler-valued
We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x).
Something Notable
We will look only at the main procedure of Incremental Adjoint
Recursion
Please take a look at section in Derivation by Matrix-Product
Reversal
At the book [7]
Andreas Griewank and Andrea Walther, Evaluating derivatives:
principles and techniques of algorithmic differentiation vol. 105,
(Siam, 2008).
107 / 158
Images/cinvestav
The derivation of the reversal mode
For this, we will use
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
And the identity
yT
˙y = xT
˙x
108 / 158
Images/cinvestav
The derivation of the reversal mode
For this, we will use
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
And the identity
yT
˙y = xT
˙x
108 / 158
Images/cinvestav
Now, using the state transformation Φ
We map from x to y = F (x) as the composition
y = QmΦl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 PT
n x
Where Pn ≡ [I, 0, ..., 0] ∈ Rn×(n+l) and Qm ≡ [0, 0, ..., I] ∈ Rm×(n+l)
They are matrices that project an arbitrary (n + l)-vector
Onto its first n and last m components.
109 / 158
Images/cinvestav
Now, using the state transformation Φ
We map from x to y = F (x) as the composition
y = QmΦl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 PT
n x
Where Pn ≡ [I, 0, ..., 0] ∈ Rn×(n+l) and Qm ≡ [0, 0, ..., I] ∈ Rm×(n+l)
They are matrices that project an arbitrary (n + l)-vector
Onto its first n and last m components.
109 / 158
Images/cinvestav
Where
The cij’s represent partial differential
cij ≡ cij (ui) ≡
∂φi
∂vj
for 1 − n ≤ i, j ≤ l
110 / 158
Images/cinvestav
Labelin the elemental partials as cij
We get the state Jacobian
Ai ≡ Φi ≡
















1 0 . . . 0 . . . . . . 0
0 1 . . . 0 . . . . . . 0
...
...
...
... . . . . . .
0 0 . . . 1 . . . . . . 0
ci1−n ci2−n . . . cii−n . . . . . . 0
0 0 . . . 0 1 . . . 0
...
...
...
...
...
...
...
0 0 . . . . . . . . . . . . 1
















∈ R(n+l)×(n+l)
where the cij occur in the (n + i)th row of Ai.
111 / 158
Images/cinvestav
Remarks
The square matrices Ai are lower triangular
It may also be written as rank-one perturbations of the identity,
Ai = I + en+i [ φi (ui) − en+i]T
Where ej denotes the jth Cartesian basis vector in Rn+l
The differentiating the composition of functions
˙y = QmAlAl−1 · · · A2A1PT
n ˙x
112 / 158
Images/cinvestav
Remarks
The square matrices Ai are lower triangular
It may also be written as rank-one perturbations of the identity,
Ai = I + en+i [ φi (ui) − en+i]T
Where ej denotes the jth Cartesian basis vector in Rn+l
The differentiating the composition of functions
˙y = QmAlAl−1 · · · A2A1PT
n ˙x
112 / 158
Images/cinvestav
Embeddings
The multiplication by PT
n ∈ R(n+l)×n
It embeds ˙x into Rn+l
Meaning
corresponding to the first part of the tangent recursion
The subsequent multiplications by the Ai
It generates ine component ˙vi at a time, according to the middle part
113 / 158
Images/cinvestav
Embeddings
The multiplication by PT
n ∈ R(n+l)×n
It embeds ˙x into Rn+l
Meaning
corresponding to the first part of the tangent recursion
The subsequent multiplications by the Ai
It generates ine component ˙vi at a time, according to the middle part
113 / 158
Images/cinvestav
Embeddings
The multiplication by PT
n ∈ R(n+l)×n
It embeds ˙x into Rn+l
Meaning
corresponding to the first part of the tangent recursion
The subsequent multiplications by the Ai
It generates ine component ˙vi at a time, according to the middle part
113 / 158
Images/cinvestav
Finally
Qm extracts the last m components as ˙y corresponding to the third
part of the table
vi−n ≡ xi
i = 1...n
˙vi−n ≡ ˙xi
vi ≡ φi (vj)j i i = 1...l
i = 1...l
˙vi ≡ j i
∂φi(uj)
∂vj
˙vj
ym−i ≡ vl−i
i = m − 1...0
˙ym−i ≡ ˙vl−i
114 / 158
Images/cinvestav
Now
By comparison with
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
We have in fact a product representation of the full Jacobian
F (x) = QmAlAl−1 · · · A2A1PT
n ∈ Rm×n
115 / 158
Images/cinvestav
Now
By comparison with
˙y (t) =
∂F (x (t))
∂t
= F (x (t)) ˙x (t)
We have in fact a product representation of the full Jacobian
F (x) = QmAlAl−1 · · · A2A1PT
n ∈ Rm×n
115 / 158
Images/cinvestav
Then
By transposing the product we obtain the adjoint relation
x = PnAT
1 AT
2 · · · AT
l−1AT
l y
Given that
AT
i = I + [ φi (ui) − en+i] eT
n+i
116 / 158
Images/cinvestav
Then
By transposing the product we obtain the adjoint relation
x = PnAT
1 AT
2 · · · AT
l−1AT
l y
Given that
AT
i = I + [ φi (ui) − en+i] eT
n+i
116 / 158
Images/cinvestav
Therefore
The transformation of any vector (vj)1−n≤j≤l
By multiplication with AT
i representing an incremental operation.
117 / 158
Images/cinvestav
In detail, one obtains for i = l, ..., 1 the operations
For all j with i = j i
vj is left unchanged
For all j with i = j i
vi is augmented by vicij
cij ≡ cij (ui) ≡
∂φi
∂vj
for 1 − n ≤ i, j ≤ l
Subsequently
vi is set to zero.
118 / 158
Images/cinvestav
In detail, one obtains for i = l, ..., 1 the operations
For all j with i = j i
vj is left unchanged
For all j with i = j i
vi is augmented by vicij
cij ≡ cij (ui) ≡
∂φi
∂vj
for 1 − n ≤ i, j ≤ l
Subsequently
vi is set to zero.
118 / 158
Images/cinvestav
In detail, one obtains for i = l, ..., 1 the operations
For all j with i = j i
vj is left unchanged
For all j with i = j i
vi is augmented by vicij
cij ≡ cij (ui) ≡
∂φi
∂vj
for 1 − n ≤ i, j ≤ l
Subsequently
vi is set to zero.
118 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
119 / 158
Images/cinvestav
Some Remarks
Using the C-style abbreviation
a+ ≡ b for a ≡ a + b
We may rewrite the matrix- vector product as the adjoint evaluation
procedure in the following table
120 / 158
Images/cinvestav
Incremental Adjoint Recursion
We have the following procedure (ui = (vj)j i ∈ Rni
)
vi ≡ 0 i = 1 − n...l
vi−n ≡ xi i = 1...n
vi ≡ φi (vj)j i i = m − 1...l
ym−i ≡ vl−i i = 0...m − 1
vl−i ≡ ym−i i = 0...m − 1
vj+ ≡ vi
∂φi(ui)
∂vj
for j i i = l...1
xi ≡ vi−n i = n...1
121 / 158
Images/cinvestav
Explanation
It is assumed as a precondition that the adjoint quantities
vi for 1 ≤ i ≤ l have been initialized to zero
As indicated by the range specification i = l, ..., 1
we think of the incremental assignments as being executed in reverse
order, i.e., for i = l, l − 1, l − 2, ..., 1.
Only then is it guaranteed
Each vi will reach its full value before it occurs on the right-hand side.
122 / 158
Images/cinvestav
Explanation
It is assumed as a precondition that the adjoint quantities
vi for 1 ≤ i ≤ l have been initialized to zero
As indicated by the range specification i = l, ..., 1
we think of the incremental assignments as being executed in reverse
order, i.e., for i = l, l − 1, l − 2, ..., 1.
Only then is it guaranteed
Each vi will reach its full value before it occurs on the right-hand side.
122 / 158
Images/cinvestav
Explanation
It is assumed as a precondition that the adjoint quantities
vi for 1 ≤ i ≤ l have been initialized to zero
As indicated by the range specification i = l, ..., 1
we think of the incremental assignments as being executed in reverse
order, i.e., for i = l, l − 1, l − 2, ..., 1.
Only then is it guaranteed
Each vi will reach its full value before it occurs on the right-hand side.
122 / 158
Images/cinvestav
Furthermore
We can combine the incremental operations
Affected by the adjoint of φi to
ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni
Something Remarkable
We can do something different
one can directly compute the value of the adjoint quantity vj by
collecting all contributions to it as a sum ranging over all successors
i j.
This no-incremental
Requires global information that is not easy to come by.
123 / 158
Images/cinvestav
Furthermore
We can combine the incremental operations
Affected by the adjoint of φi to
ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni
Something Remarkable
We can do something different
one can directly compute the value of the adjoint quantity vj by
collecting all contributions to it as a sum ranging over all successors
i j.
This no-incremental
Requires global information that is not easy to come by.
123 / 158
Images/cinvestav
Furthermore
We can combine the incremental operations
Affected by the adjoint of φi to
ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni
Something Remarkable
We can do something different
one can directly compute the value of the adjoint quantity vj by
collecting all contributions to it as a sum ranging over all successors
i j.
This no-incremental
Requires global information that is not easy to come by.
123 / 158
Images/cinvestav
Complexity
Something Notable
TIME F (x) , yT
F (x) ≤ wgradTIME {F (x)}
Where wgrad ∈ [3, 4] (The cheap gradient principle)
124 / 158
Images/cinvestav
Remember
Time Complexity
TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)}
Where wtan ∈ 2, 5
2
125 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
126 / 158
Images/cinvestav
Example a single layer perceptron
We have
y = σ
3
i=1
wixi
127 / 158
Images/cinvestav
First Phase
Forward Step
Forward Step
v−2 = w1
v−1 = w2
v0 = w3
v1 = x1v−2
v2 = x2v−1
v3 = x3v0
v4 = v1 + v2 + v3
v5 = σ (v4)
y1 = v5
128 / 158
Images/cinvestav
Second Phase
Incremental Return
Forward Step
v−2 = w1
v−1 = w2
v0 = w3
v1 = x1v−2
v2 = x2v−1
v3 = x3v0
v4 = v1 + v2 + v3
v5 = σ (v4)
y1 = v5
Incremental Return
v5 = y1 = 1
v4 = ∂v5
∂v4
y1 = σ (v4)
v3+ = ∂v4
∂v3
v4 = 1 × σ (v4)
v0 = ∂v3
∂v0
v3 = x3 × σ (v4)
v2+ = ∂v4
∂v2
v4 = 1 × σ (v4)
v−1 = ∂v2
∂v−1
v2 = x2 × σ (v4)
v1+ = ∂v4
∂v1
v4 = 1 × σ (v4)
v−2 = ∂v1
∂v−2
v1 = x1 × σ (v4)
w3 = x3 × σ (v4)
w2 = x2 × σ (v4)
w1 = x1 × σ (v4)
129 / 158
Images/cinvestav
How does it compares with the Forward Mode?
We noticed that you do the following for each gradient variable
Forward Step; Gradient of Forward Step
v−2 = w1; ˙v−2 = ˙w1 = 0
v−1 = w2; ˙v−1 = ˙w2 = 0
v0 = w3; ˙v0 = ˙w2 = 1
v1 = x1v−2
˙v1 = x1 ˙v−2 = 0
v2 = x2v−1
˙v2 = x2 ˙v−1 = 0
v3 = w3v0
˙v3 = x3 ˙v0 = x3
v4 = v1 + v2 + v3
˙v4 = ˙v1 + ˙v2 + ˙v3 = x3
v5 = σ (v4)
˙v5 = ˙v4 = x3 × σ (v4)
y1 = v5; ˙y1 = ˙v5
130 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
131 / 158
Images/cinvestav
Let us to look at the following example
We have the following system of equations
y1 = σ (w1x)
y2 = σ (w2x)
y3 = σ (w2x)
132 / 158
Images/cinvestav
With the following graph
Notice the difference with a neural network
133 / 158
Images/cinvestav
The Forward mode looks like
We have that
v0 = x; ˙v0 = ˙x = 1
v1 = w1v0
˙v1 = w1 ˙v−2 = w1
v2 = w2v0
˙v2 = w1 ˙v0 = w2
v3 = w3v0
˙v3 = w1 ˙v0 = w3
v4 = σ (v1)
˙v4 = σ (v1) × ˙v1 = w1 × σ (v1)
v5 = σ (v2)
˙v5 = σ (v2) × ˙v2 = w2 × σ (v2)
v6 = σ (v3)
˙v6 = σ (v3) × ˙v3 = w3 × σ (v3)
y1 = v4; ˙y1 = ˙v4
y2 = v5; ˙y2 = ˙v5
y = v ; ˙y = ˙v
134 / 158
Images/cinvestav
Now you can see it
Forward and Reverse Mode
They depend on the input and output size!!!
A More Formal Definition
For a function f : Rn → Rm, suppose we wish to compute all the
elements of the m × n Jacobian matrix
Ignoring the overhead of building the expression graph
Under this situation Reverse Mode requires m sweeps performs better
when n > m.
135 / 158
Images/cinvestav
Now you can see it
Forward and Reverse Mode
They depend on the input and output size!!!
A More Formal Definition
For a function f : Rn → Rm, suppose we wish to compute all the
elements of the m × n Jacobian matrix
Ignoring the overhead of building the expression graph
Under this situation Reverse Mode requires m sweeps performs better
when n > m.
135 / 158
Images/cinvestav
Now you can see it
Forward and Reverse Mode
They depend on the input and output size!!!
A More Formal Definition
For a function f : Rn → Rm, suppose we wish to compute all the
elements of the m × n Jacobian matrix
Ignoring the overhead of building the expression graph
Under this situation Reverse Mode requires m sweeps performs better
when n > m.
135 / 158
Images/cinvestav
Consequences for Deep Learning
With a relatively small overhead
The performance of reverse-mode AD is superior when n m, that
is when we have many inputs and few outputs.
As we saw it in the previous examples
If n ≤ m forward mode performs better
136 / 158
Images/cinvestav
Consequences for Deep Learning
With a relatively small overhead
The performance of reverse-mode AD is superior when n m, that
is when we have many inputs and few outputs.
As we saw it in the previous examples
If n ≤ m forward mode performs better
136 / 158
Images/cinvestav
Special Cases
Nevertheless when we have a comparable number of outputs and
inputs
Forward mode can be more efficient,
less overhead associated with storing the expression graph in memory
in forward mode.
For Example
If you have f : Rn → R, when n = 1 forward mode is more efficient,
but the result flips as n increases.
137 / 158
Images/cinvestav
Special Cases
Nevertheless when we have a comparable number of outputs and
inputs
Forward mode can be more efficient,
less overhead associated with storing the expression graph in memory
in forward mode.
For Example
If you have f : Rn → R, when n = 1 forward mode is more efficient,
but the result flips as n increases.
137 / 158
Images/cinvestav
Outline
1 Backpropagation
Introduction
Derivatives of Network Functions
Function Composition, Weights and Addition
The Backpropagation Algorithm Works
Moving everything to Tensors
2 Automatic Differentiation
Introduction
Advantages of Automatic Differentiation
Avoiding Truncation Errors
Differences with Symbolic Differentiation
Difference Quotients May be Useful
RNN Example
A Simple Example
The Forward and Reverse Mode
Forward propagation of Tangents
Forward Mode of a ML Perceptron
Complexity of the Forward Procedure
The Reverse Mode
Dual Process in Reverse Process
Incremental Adjoint Recursion
Example
What Method to Use Forward or Reverse Mode?
3 Basic Implementation of Automatic Differentiation
Source Transformation and Overloading
Building the Computational Graph
Memory Structures
Way More...
4 Conclusions
The Problem of Backpropagation
138 / 158
Images/cinvestav
Be Aware
Be Careful
A computationally naive implementation of AD can result in slow
code and excess use of memory.
Additionally
There exists no standard set of problems spanning the diversity of AD
applications.
139 / 158
Images/cinvestav
Be Aware
Be Careful
A computationally naive implementation of AD can result in slow
code and excess use of memory.
Additionally
There exists no standard set of problems spanning the diversity of AD
applications.
139 / 158
Images/cinvestav
Source Transformation in Fortran and C
First
We start with the source code of a computer program that
implements our target function.
Second
A preprocessor then applies differentiation rules to the code
generating new source code which calculates derivatives.
Remember our basic tables
140 / 158
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation

More Related Content

What's hot

Classification By Back Propagation
Classification By Back PropagationClassification By Back Propagation
Classification By Back PropagationBineeshJose99
 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Mohammed Bennamoun
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madalineNagarajan
 
Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...bihira aggrey
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer PerceptronsESCOM
 
15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer PerceptronAndres Mendez-Vazquez
 
Mitchell's Face Recognition
Mitchell's Face RecognitionMitchell's Face Recognition
Mitchell's Face Recognitionbutest
 
MPerceptron
MPerceptronMPerceptron
MPerceptronbutest
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANNwaseem khan
 
ARTIFICIAL NEURAL NETWORKS
ARTIFICIAL NEURAL NETWORKSARTIFICIAL NEURAL NETWORKS
ARTIFICIAL NEURAL NETWORKSAIMS Education
 
Nural network ER. Abhishek k. upadhyay
Nural network ER. Abhishek  k. upadhyayNural network ER. Abhishek  k. upadhyay
Nural network ER. Abhishek k. upadhyayabhishek upadhyay
 
Adaptive resonance theory (art)
Adaptive resonance theory (art)Adaptive resonance theory (art)
Adaptive resonance theory (art)Ashutosh Tyagi
 
Review: “Implementation of Feedforward and Feedback Neural Network for Signal...
Review: “Implementation of Feedforward and Feedback Neural Network for Signal...Review: “Implementation of Feedforward and Feedback Neural Network for Signal...
Review: “Implementation of Feedforward and Feedback Neural Network for Signal...IJERA Editor
 

What's hot (20)

Classification By Back Propagation
Classification By Back PropagationClassification By Back Propagation
Classification By Back Propagation
 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
 
Back propagation method
Back propagation methodBack propagation method
Back propagation method
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madaline
 
Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...
 
Neural networks
Neural networksNeural networks
Neural networks
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron
 
Mitchell's Face Recognition
Mitchell's Face RecognitionMitchell's Face Recognition
Mitchell's Face Recognition
 
Unit 1
Unit 1Unit 1
Unit 1
 
MPerceptron
MPerceptronMPerceptron
MPerceptron
 
Activation function
Activation functionActivation function
Activation function
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
ARTIFICIAL NEURAL NETWORKS
ARTIFICIAL NEURAL NETWORKSARTIFICIAL NEURAL NETWORKS
ARTIFICIAL NEURAL NETWORKS
 
Nural network ER. Abhishek k. upadhyay
Nural network ER. Abhishek  k. upadhyayNural network ER. Abhishek  k. upadhyay
Nural network ER. Abhishek k. upadhyay
 
Adaptive resonance theory (art)
Adaptive resonance theory (art)Adaptive resonance theory (art)
Adaptive resonance theory (art)
 
Review: “Implementation of Feedforward and Feedback Neural Network for Signal...
Review: “Implementation of Feedforward and Feedback Neural Network for Signal...Review: “Implementation of Feedforward and Feedback Neural Network for Signal...
Review: “Implementation of Feedforward and Feedback Neural Network for Signal...
 

Similar to 05 backpropagation automatic_differentiation

Web spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsWeb spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsaciijournal
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network AlgorithmsWeb Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network Algorithmsaciijournal
 
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORijac123
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Fordham University
 
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionSaumyaMundra3
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksJeremy Nixon
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
 
Neural network based image compression with lifting scheme and rlc
Neural network based image compression with lifting scheme and rlcNeural network based image compression with lifting scheme and rlc
Neural network based image compression with lifting scheme and rlceSAT Publishing House
 
Neural network based image compression with lifting scheme and rlc
Neural network based image compression with lifting scheme and rlcNeural network based image compression with lifting scheme and rlc
Neural network based image compression with lifting scheme and rlceSAT Journals
 
Implementation of recurrent neural network for the forecasting of USD buy ra...
Implementation of recurrent neural network for the forecasting  of USD buy ra...Implementation of recurrent neural network for the forecasting  of USD buy ra...
Implementation of recurrent neural network for the forecasting of USD buy ra...IJECEIAES
 
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;Robin Srivastava
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5Sunwoo Kim
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
Nt1330 Unit 4.2 Paper
Nt1330 Unit 4.2 PaperNt1330 Unit 4.2 Paper
Nt1330 Unit 4.2 PaperLisa Olive
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...ijistjournal
 

Similar to 05 backpropagation automatic_differentiation (20)

Web spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsWeb spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithms
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network AlgorithmsWeb Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
 
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTORARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
ARTIFICIAL NEURAL NETWORK APPROACH TO MODELING OF POLYPROPYLENE REACTOR
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, Attention
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
 
A040101001006
A040101001006A040101001006
A040101001006
 
N ns 1
N ns 1N ns 1
N ns 1
 
Neural network based image compression with lifting scheme and rlc
Neural network based image compression with lifting scheme and rlcNeural network based image compression with lifting scheme and rlc
Neural network based image compression with lifting scheme and rlc
 
Neural network based image compression with lifting scheme and rlc
Neural network based image compression with lifting scheme and rlcNeural network based image compression with lifting scheme and rlc
Neural network based image compression with lifting scheme and rlc
 
Implementation of recurrent neural network for the forecasting of USD buy ra...
Implementation of recurrent neural network for the forecasting  of USD buy ra...Implementation of recurrent neural network for the forecasting  of USD buy ra...
Implementation of recurrent neural network for the forecasting of USD buy ra...
 
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
Nt1330 Unit 4.2 Paper
Nt1330 Unit 4.2 PaperNt1330 Unit 4.2 Paper
Nt1330 Unit 4.2 Paper
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 

Recently uploaded

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfEr. Suman Jyoti
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 

Recently uploaded (20)

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 

05 backpropagation automatic_differentiation

  • 1. Introduction to Neural Networks and Deep Learning Backpropagation and Automatic Differentiation Andres Mendez-Vazquez November 4, 2019 1 / 158
  • 2. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 2 / 158
  • 3. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 3 / 158
  • 4. Images/cinvestav A Remarkable Revenant This algorithm has been used by many communities Discovered and rediscovered, until 1985 it reached the AI community [1] Basically The Basis of the modern neural networks 4 / 158
  • 5. Images/cinvestav A Remarkable Revenant This algorithm has been used by many communities Discovered and rediscovered, until 1985 it reached the AI community [1] Basically The Basis of the modern neural networks 4 / 158
  • 6. Images/cinvestav One Big Problem, a lot of Local Minimums A Lot of Them!!! Local Minimas 5 / 158
  • 7. Images/cinvestav This is due to the fact that Yes, we have a convex function 1 2 (zi − ti)2 With an intermediate non-linear activation function zi = f   d j=1 wijyj   Making the surface to be searched for the optimum A non linear function map from Rd to Rm 6 / 158
  • 8. Images/cinvestav This is due to the fact that Yes, we have a convex function 1 2 (zi − ti)2 With an intermediate non-linear activation function zi = f   d j=1 wijyj   Making the surface to be searched for the optimum A non linear function map from Rd to Rm 6 / 158
  • 9. Images/cinvestav This is due to the fact that Yes, we have a convex function 1 2 (zi − ti)2 With an intermediate non-linear activation function zi = f   d j=1 wijyj   Making the surface to be searched for the optimum A non linear function map from Rd to Rm 6 / 158
  • 10. Images/cinvestav Recall The Learning Problem Neural Networks You can see the network as a computational graph... Transmitting information from node to node... Therefore, the network It is a particular implementation of a composite function from input space to output space. 7 / 158
  • 11. Images/cinvestav Recall The Learning Problem Neural Networks You can see the network as a computational graph... Transmitting information from node to node... Therefore, the network It is a particular implementation of a composite function from input space to output space. 7 / 158
  • 12. Images/cinvestav Extended Network The computation of the error by the network [2] NETWORK 8 / 158
  • 13. Images/cinvestav Thus The network can calculate the total error E = N i=1 Ei Therefore, the network can be updated using E = ∂E ∂w1 , ∂E ∂w2 , ..., ∂E ∂wl ∆wi = −γ ∂E ∂w1 for i = 1, ..., l 9 / 158
  • 14. Images/cinvestav Thus The network can calculate the total error E = N i=1 Ei Therefore, the network can be updated using E = ∂E ∂w1 , ∂E ∂w2 , ..., ∂E ∂wl ∆wi = −γ ∂E ∂w1 for i = 1, ..., l 9 / 158
  • 15. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 10 / 158
  • 16. Images/cinvestav Now, if we forget everything about learning Given that the network is a complex composition of functions E = f1 ◦ f2 ◦ · · · ◦ fK Now, each node has a left and right side 11 / 158
  • 17. Images/cinvestav Now, if we forget everything about learning Given that the network is a complex composition of functions E = f1 ◦ f2 ◦ · · · ◦ fK Now, each node has a left and right side 11 / 158
  • 18. Images/cinvestav Furthermore Separation of integration and activation function Then, we can use this notation to build the forward/backward steps Actually the basis for automatic differentiation 12 / 158
  • 19. Images/cinvestav Furthermore Separation of integration and activation function Then, we can use this notation to build the forward/backward steps Actually the basis for automatic differentiation 12 / 158
  • 20. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 13 / 158
  • 21. Images/cinvestav First, we have The sequence of derivatives Then, we can do the forward step getting the function compositions 14 / 158
  • 22. Images/cinvestav First, we have The sequence of derivatives Then, we can do the forward step getting the function compositions Function Composition 14 / 158
  • 23. Images/cinvestav Now, Backpropagation Here the interesting part, you can collect such information Backpropagation 1 Now, what else? The aggregation of functions toward the activation functions!!! 15 / 158
  • 24. Images/cinvestav Now, Backpropagation Here the interesting part, you can collect such information Backpropagation 1 Now, what else? The aggregation of functions toward the activation functions!!! 15 / 158
  • 25. Images/cinvestav We add an extra caveat to the graph representation A weight into the graph Feed-Forward We have the backward process 16 / 158
  • 26. Images/cinvestav We add an extra caveat to the graph representation A weight into the graph Feed-Forward We have the backward process 1 Backpropagation 16 / 158
  • 27. Images/cinvestav Function Addition We have the forward step Function Composition 17 / 158
  • 28. Images/cinvestav Then At the Backward Step, we have Backward 1 18 / 158
  • 29. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 19 / 158
  • 30. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 31. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 32. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 33. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 34. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 35. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 36. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 37. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 38. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 39. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 40. Images/cinvestav Backpropagation Algorithm Consider a network with a single input and a network function F The Derivative F (x) is computed in two phases. 1 Feed-forward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node. The derivatives are stored at the left side of the node. 2 Backpropagation: The constant 1 is fed into the output unit and the network is run backwards. Incoming information to a node is added and the result is multiplied by the value stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the derivative of the network function with respect to x. 20 / 158
  • 41. Images/cinvestav Proof of Correctness about the derivatives Proposition The Backpropagation algorithm computes the derivative of the network function F with respect to the input x correctly. Proof By induction assume that the algorithm works with n or fewer nodes 21 / 158
  • 42. Images/cinvestav Proof of Correctness about the derivatives Proposition The Backpropagation algorithm computes the derivative of the network function F with respect to the input x correctly. Proof By induction assume that the algorithm works with n or fewer nodes 21 / 158
  • 43. Images/cinvestav Consider The following network with n + 1 nodes NETWORK 22 / 158
  • 44. Images/cinvestav Thus We have that F (x) = φ (w1F1 (x) + w2F2 (x) + · · · + wmFm (x)) We have that the derivative F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) With s = w1F1 (x) + w2F2 (x) + · · · + wmFm (x) 23 / 158
  • 45. Images/cinvestav Thus We have that F (x) = φ (w1F1 (x) + w2F2 (x) + · · · + wmFm (x)) We have that the derivative F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) With s = w1F1 (x) + w2F2 (x) + · · · + wmFm (x) 23 / 158
  • 46. Images/cinvestav Now, we use induction The subgraph of the main graph which contains all the nodes to F1 (x) Thus, by induction, we can calculate the derivative of F1 (x) by introducing a 1 into the last unit and doing backpropagation The same happens to all the other units Now if instead of multiplying by 1 we introduce φ (s)and multiply by wj, we get wjFj (x) φ (s) This can be accomplished by Introducing a 1 into the output unit, multiplying by the stored value φ (s) and distributing the result to the m units through edge weight nodes. 24 / 158
  • 47. Images/cinvestav Now, we use induction The subgraph of the main graph which contains all the nodes to F1 (x) Thus, by induction, we can calculate the derivative of F1 (x) by introducing a 1 into the last unit and doing backpropagation The same happens to all the other units Now if instead of multiplying by 1 we introduce φ (s)and multiply by wj, we get wjFj (x) φ (s) This can be accomplished by Introducing a 1 into the output unit, multiplying by the stored value φ (s) and distributing the result to the m units through edge weight nodes. 24 / 158
  • 48. Images/cinvestav Now, we use induction The subgraph of the main graph which contains all the nodes to F1 (x) Thus, by induction, we can calculate the derivative of F1 (x) by introducing a 1 into the last unit and doing backpropagation The same happens to all the other units Now if instead of multiplying by 1 we introduce φ (s)and multiply by wj, we get wjFj (x) φ (s) This can be accomplished by Introducing a 1 into the output unit, multiplying by the stored value φ (s) and distributing the result to the m units through edge weight nodes. 24 / 158
  • 49. Images/cinvestav Basically, we get the derivative We get then φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) Basically the networks is run backward F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) The algorithms works for n + 1 QED 25 / 158
  • 50. Images/cinvestav Basically, we get the derivative We get then φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) Basically the networks is run backward F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) The algorithms works for n + 1 QED 25 / 158
  • 51. Images/cinvestav Basically, we get the derivative We get then φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) Basically the networks is run backward F (x) = φ (s) w1F1 (x) + w2F2 (x) + · · · + wmFm (x) The algorithms works for n + 1 QED 25 / 158
  • 52. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 26 / 158
  • 53. Images/cinvestav Why not using matrices to process all the individual parts? Imagine the following, a simple idea X =       xT 1 xT 2 ... xT N       We know the fields are created in input to hidden as g (X) = XW =       xT 1 xT 2 ... xT N       w1 w2 · · · wd 27 / 158
  • 54. Images/cinvestav Why not using matrices to process all the individual parts? Imagine the following, a simple idea X =       xT 1 xT 2 ... xT N       We know the fields are created in input to hidden as g (X) = XW =       xT 1 xT 2 ... xT N       w1 w2 · · · wd 27 / 158
  • 55. Images/cinvestav Where We have these construct gij xT i = xT i wj g (X) =         g11 xT 1 g12 xT 1 · · · g1d xT 1 g21 xT 2 g21 xT 2 · · · g2d xT 2 ... ... ... ... gN1 xT N gN2 xT N · · · gNd xT N         28 / 158
  • 56. Images/cinvestav Then We have that the fij (x) = 1 1+exp{−x} f (g (X)) =         f11 g11 xT 1 f g12 xT 1 · · · f g1d xT 1 f g21 xT 2 f g21 xT 2 · · · f g2d xT 2 ... ... ... ... f gN1 xT N f gN2 xT N · · · f gNd xT N         29 / 158
  • 57. Images/cinvestav Finally, we can do the following modification when forward Then the matrix can be extended g (X) |g (X) =        dg11(xT 1 ) dw1 |xT 1 w1 dg12(xT 1 ) dw2 |xT 1 w2 · · · dg1d(xT 1 ) dwd |xT 1 wh dg21(xT 2 ) dw1 |xT 2 w1 dg22(xT 2 ) dw2 |xT 2 w2 · · · dg2d(xT 2 ) dwd |xT 2 wh ... ... ... ... dgN1(xT N ) dw1 |xT N w1 dgN2(xT N ) dw2 |xT 2 w2 · · · dgNd(xT N ) dwd |xT N wh        30 / 158
  • 58. Images/cinvestav Finally, we have The next function f (g (X)) |f (g (X)) =     df11(x) dx g11 xT 1 |f11 g11 xT 1 · · · df1d(x) dx g1d xT 1 |f g1h xT 1 df21(x) dx g21 xT 1 |f g21 xT 2 · · · df2d(x) dx g2d xT 1 |f g2h xT 2 . . . . . . . . . dfN1(x) dx gN1 xT 1 |f gN1 xT N · · · dfNd(x) dx gNd xT 1 |f gNh xT N     31 / 158
  • 59. Images/cinvestav Using the Hadamard Product We have for the backpropagation f (g (X)) ◦ g (X) In particular for a position ij dgij xT i dwj × dfij (x) dx gij xT i = dfij (x) dx gij xT i ×       x1i x2i ... xdi       32 / 158
  • 60. Images/cinvestav Using the Hadamard Product We have for the backpropagation f (g (X)) ◦ g (X) In particular for a position ij dgij xT i dwj × dfij (x) dx gij xT i = dfij (x) dx gij xT i ×       x1i x2i ... xdi       32 / 158
  • 61. Images/cinvestav Then using a vertical sum We get the change that is imposed into the possible vector wj sum f (g (X)) ◦ g (X) , axis =0 = N i=1 dgij xT i dwj × dfij (x) dx gij xT i h j=1 33 / 158
  • 62. Images/cinvestav Now a Historical Perspective The idea of a Graph Structure was proposed by Raul Rojas “Neural Networks - A Systematic Introduction” by Raul Rojas in 1996... TensorFlow was initially released in November 9, 2015 Originally an inception of the project “Google Brain” (Circa 2011) So TensorFlow started around 2012-2013 with internal development and DNNResearch’s code (Hinton’s Company) However, the graph idea was introduced in 2002 in torch, the basis of Pytorch (Circa 2016) One of the creators, Samy Bengio, is the brother of Joshua Bengio [3] 34 / 158
  • 63. Images/cinvestav Now a Historical Perspective The idea of a Graph Structure was proposed by Raul Rojas “Neural Networks - A Systematic Introduction” by Raul Rojas in 1996... TensorFlow was initially released in November 9, 2015 Originally an inception of the project “Google Brain” (Circa 2011) So TensorFlow started around 2012-2013 with internal development and DNNResearch’s code (Hinton’s Company) However, the graph idea was introduced in 2002 in torch, the basis of Pytorch (Circa 2016) One of the creators, Samy Bengio, is the brother of Joshua Bengio [3] 34 / 158
  • 64. Images/cinvestav Now a Historical Perspective The idea of a Graph Structure was proposed by Raul Rojas “Neural Networks - A Systematic Introduction” by Raul Rojas in 1996... TensorFlow was initially released in November 9, 2015 Originally an inception of the project “Google Brain” (Circa 2011) So TensorFlow started around 2012-2013 with internal development and DNNResearch’s code (Hinton’s Company) However, the graph idea was introduced in 2002 in torch, the basis of Pytorch (Circa 2016) One of the creators, Samy Bengio, is the brother of Joshua Bengio [3] 34 / 158
  • 65. Images/cinvestav Now a Historical Perspective The idea of a Graph Structure was proposed by Raul Rojas “Neural Networks - A Systematic Introduction” by Raul Rojas in 1996... TensorFlow was initially released in November 9, 2015 Originally an inception of the project “Google Brain” (Circa 2011) So TensorFlow started around 2012-2013 with internal development and DNNResearch’s code (Hinton’s Company) However, the graph idea was introduced in 2002 in torch, the basis of Pytorch (Circa 2016) One of the creators, Samy Bengio, is the brother of Joshua Bengio [3] 34 / 158
  • 66. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 35 / 158
  • 67. Images/cinvestav Backpropagation a little brother of Automatic Differentiation (AD) We have a crude way to obtain derivatives [4, 5, 6][7] D+hf (x) ≈ f (x + h) − f (x) 2h or D hf (x) ≈ f (x + h) − f (x − h) 2h Huge Problems If h is small, then cancellation error reduces the number of significant figures in D+hf (x). if h is not small, then truncation errors (terms such as h2f (x)) become significant. Even if h is optimally chosen, the values of D+hf (x) and D hf (x) will be accurate to only about 1 2 or 2 3 of the significant digits of f. 36 / 158
  • 68. Images/cinvestav Backpropagation a little brother of Automatic Differentiation (AD) We have a crude way to obtain derivatives [4, 5, 6][7] D+hf (x) ≈ f (x + h) − f (x) 2h or D hf (x) ≈ f (x + h) − f (x − h) 2h Huge Problems If h is small, then cancellation error reduces the number of significant figures in D+hf (x). if h is not small, then truncation errors (terms such as h2f (x)) become significant. Even if h is optimally chosen, the values of D+hf (x) and D hf (x) will be accurate to only about 1 2 or 2 3 of the significant digits of f. 36 / 158
  • 69. Images/cinvestav Backpropagation a little brother of Automatic Differentiation (AD) We have a crude way to obtain derivatives [4, 5, 6][7] D+hf (x) ≈ f (x + h) − f (x) 2h or D hf (x) ≈ f (x + h) − f (x − h) 2h Huge Problems If h is small, then cancellation error reduces the number of significant figures in D+hf (x). if h is not small, then truncation errors (terms such as h2f (x)) become significant. Even if h is optimally chosen, the values of D+hf (x) and D hf (x) will be accurate to only about 1 2 or 2 3 of the significant digits of f. 36 / 158
  • 70. Images/cinvestav Backpropagation a little brother of Automatic Differentiation (AD) We have a crude way to obtain derivatives [4, 5, 6][7] D+hf (x) ≈ f (x + h) − f (x) 2h or D hf (x) ≈ f (x + h) − f (x − h) 2h Huge Problems If h is small, then cancellation error reduces the number of significant figures in D+hf (x). if h is not small, then truncation errors (terms such as h2f (x)) become significant. Even if h is optimally chosen, the values of D+hf (x) and D hf (x) will be accurate to only about 1 2 or 2 3 of the significant digits of f. 36 / 158
  • 71. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 37 / 158
  • 72. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 38 / 158
  • 73. Images/cinvestav Avoiding Truncation Errors We have that Algorithmic differentiation does not incur truncation errors. For example f (x) = n i=1 x2 i at xi = i for i = 1...n Then for e1 ∈ Rn f (x + he1) − f (x) h = ∂f (x) ∂x1 + h = 2x1 + h = 2 + h 39 / 158
  • 74. Images/cinvestav Avoiding Truncation Errors We have that Algorithmic differentiation does not incur truncation errors. For example f (x) = n i=1 x2 i at xi = i for i = 1...n Then for e1 ∈ Rn f (x + he1) − f (x) h = ∂f (x) ∂x1 + h = 2x1 + h = 2 + h 39 / 158
  • 75. Images/cinvestav Avoiding Truncation Errors We have that Algorithmic differentiation does not incur truncation errors. For example f (x) = n i=1 x2 i at xi = i for i = 1...n Then for e1 ∈ Rn f (x + he1) − f (x) h = ∂f (x) ∂x1 + h = 2x1 + h = 2 + h 39 / 158
  • 76. Images/cinvestav Floating Points Given that the quantity needs floating point number representation in machine accuracy of 64 bits Roundoff error = f (x + he1) ≈ n3 3 with = 2−54 ≈ 10−16 For h = √ ,as often is recommended The difference quotient has a rounding error of size 1 3 n3√ ≈ 1 3 n3 10−8 40 / 158
  • 77. Images/cinvestav Floating Points Given that the quantity needs floating point number representation in machine accuracy of 64 bits Roundoff error = f (x + he1) ≈ n3 3 with = 2−54 ≈ 10−16 For h = √ ,as often is recommended The difference quotient has a rounding error of size 1 3 n3√ ≈ 1 3 n3 10−8 40 / 158
  • 78. Images/cinvestav Now, Imagine n = 1000 Then Rounding Error 1 3 10003√ ≈ 1 3 1000000000 × 10−8 = 1 3 100 ≈ 33.333... Ouch We cannot even get the sign correctly!!! f (x + he1) − f (x) h 41 / 158
  • 79. Images/cinvestav Now, Imagine n = 1000 Then Rounding Error 1 3 10003√ ≈ 1 3 1000000000 × 10−8 = 1 3 100 ≈ 33.333... Ouch We cannot even get the sign correctly!!! f (x + he1) − f (x) h 41 / 158
  • 80. Images/cinvestav In contrast Automatic Differentiation It yields 2xi in both its forward and reverse modes You could assume that the derivatives are generated symbolically Actually is true in some sense, but 2xi will be never be generated by Symbolic Differentiation In Symbolic Differentiation The numerical value of xi is multiplied by 2 then returned as the gradient value. 42 / 158
  • 81. Images/cinvestav In contrast Automatic Differentiation It yields 2xi in both its forward and reverse modes You could assume that the derivatives are generated symbolically Actually is true in some sense, but 2xi will be never be generated by Symbolic Differentiation In Symbolic Differentiation The numerical value of xi is multiplied by 2 then returned as the gradient value. 42 / 158
  • 82. Images/cinvestav In contrast Automatic Differentiation It yields 2xi in both its forward and reverse modes You could assume that the derivatives are generated symbolically Actually is true in some sense, but 2xi will be never be generated by Symbolic Differentiation In Symbolic Differentiation The numerical value of xi is multiplied by 2 then returned as the gradient value. 42 / 158
  • 83. Images/cinvestav Example using Forward Differentiation We will see the forward procedure later on f (xx) = n i=1 x2 i with xi = i for i = 1, ..., n AD Initializes (Do not worry we will see this in more detail) vi−n =i for i = 1, ..., n ˙vi−n =0, but ˙v1−n = 1 43 / 158
  • 84. Images/cinvestav Example using Forward Differentiation We will see the forward procedure later on f (xx) = n i=1 x2 i with xi = i for i = 1, ..., n AD Initializes (Do not worry we will see this in more detail) vi−n =i for i = 1, ..., n ˙vi−n =0, but ˙v1−n = 1 43 / 158
  • 85. Images/cinvestav Then, we have that Apply the compositions φ Functions Derivatives v1 = 12 ˙v1 = ∂v1 ∂v1−n ˙v1−n = 2 × (1) × 1 = 2 ... ... vn = n2 0 Therefore, we have at the end ∂f ∂x (x) = (2, 0, ..., 0) 44 / 158
  • 86. Images/cinvestav Then, we have that Apply the compositions φ Functions Derivatives v1 = 12 ˙v1 = ∂v1 ∂v1−n ˙v1−n = 2 × (1) × 1 = 2 ... ... vn = n2 0 Therefore, we have at the end ∂f ∂x (x) = (2, 0, ..., 0) 44 / 158
  • 87. Images/cinvestav Quite different from Using a numerical difference, we have f (x + e1h) − f (x) h − 2 < 0 Then for n = 10j and h = 10−k 10k (h + 1)2 − 1 < 2 Finally, we have k > − log10 3 45 / 158
  • 88. Images/cinvestav Quite different from Using a numerical difference, we have f (x + e1h) − f (x) h − 2 < 0 Then for n = 10j and h = 10−k 10k (h + 1)2 − 1 < 2 Finally, we have k > − log10 3 45 / 158
  • 89. Images/cinvestav Quite different from Using a numerical difference, we have f (x + e1h) − f (x) h − 2 < 0 Then for n = 10j and h = 10−k 10k (h + 1)2 − 1 < 2 Finally, we have k > − log10 3 45 / 158
  • 90. Images/cinvestav Therefore It is possible to get into underflow by getting a k > − log10 3 Therefore, we have that Automatic Differentiation allows to obtain the correct answer!!! 46 / 158
  • 91. Images/cinvestav Therefore It is possible to get into underflow by getting a k > − log10 3 Therefore, we have that Automatic Differentiation allows to obtain the correct answer!!! 46 / 158
  • 92. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 47 / 158
  • 93. Images/cinvestav For example You have the following equation f (x) = n i=1 xi Then, the gradient f (x) = ∂f ∂x1 , ∂f ∂x2 , ..., ∂f ∂xn =   j=i xj   i=1...n = (x2 × x3 × ... × xi × xi+1 × ... × xn−1 × xn, ............................................................... x1 × x2 × ... × xi−1 × xi+1 × ... × xn−1 × xn, ............................................................... x1 × x2 × ... × xi−1 × xi × ... × xn−2 × xn−1, ) 48 / 158
  • 94. Images/cinvestav For example You have the following equation f (x) = n i=1 xi Then, the gradient f (x) = ∂f ∂x1 , ∂f ∂x2 , ..., ∂f ∂xn =   j=i xj   i=1...n = (x2 × x3 × ... × xi × xi+1 × ... × xn−1 × xn, ............................................................... x1 × x2 × ... × xi−1 × xi+1 × ... × xn−1 × xn, ............................................................... x1 × x2 × ... × xi−1 × xi × ... × xn−2 × xn−1, ) 48 / 158
  • 95. Images/cinvestav Actually Symbolic Differentiation will consume a lot of memory Instead AD will reuse the common expressions to improve performance and memory. However, Symbolic and Automatic Differentiation They make use of the chain rule to achieve their results However, the chain rules in AD It is used not into the symbolic expressions but the actual numerical values. 49 / 158
  • 96. Images/cinvestav Actually Symbolic Differentiation will consume a lot of memory Instead AD will reuse the common expressions to improve performance and memory. However, Symbolic and Automatic Differentiation They make use of the chain rule to achieve their results However, the chain rules in AD It is used not into the symbolic expressions but the actual numerical values. 49 / 158
  • 97. Images/cinvestav Actually Symbolic Differentiation will consume a lot of memory Instead AD will reuse the common expressions to improve performance and memory. However, Symbolic and Automatic Differentiation They make use of the chain rule to achieve their results However, the chain rules in AD It is used not into the symbolic expressions but the actual numerical values. 49 / 158
  • 98. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 50 / 158
  • 99. Images/cinvestav The User Insight Difference quotients may sometimes be useful too f (x + he1) − f (x) h Computer Algebra packages They have really neat ways to simplify expressions. In contrast, current AD packages assume that That the given program calculates the underlying function efficiently 51 / 158
  • 100. Images/cinvestav The User Insight Difference quotients may sometimes be useful too f (x + he1) − f (x) h Computer Algebra packages They have really neat ways to simplify expressions. In contrast, current AD packages assume that That the given program calculates the underlying function efficiently 51 / 158
  • 101. Images/cinvestav The User Insight Difference quotients may sometimes be useful too f (x + he1) − f (x) h Computer Algebra packages They have really neat ways to simplify expressions. In contrast, current AD packages assume that That the given program calculates the underlying function efficiently 51 / 158
  • 102. Images/cinvestav There AD can automatize the gradient generation The best results will be obtained when AD takes advantage the user’s insight into the structure underlying the program 52 / 158
  • 103. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 53 / 158
  • 104. Images/cinvestav RNN Example When you look at the recurrent neural network Elman [8] ht = σh (Wsdxt + Ushht−1 + bh) yt = σy (Vosht) L = 1 2 (yt − zt)2 Here if you do blind AD sooner or later you have ∂ht ∂ht−1 × ∂ht−1 ∂ht−2 × ∂ht−2 ∂ht−3 × ... × ∂hk+1 ∂hk This is known as Back Propagation Through Time (BPTT) This is a problem given The Vanishing Gradient or Exploding Gradient 54 / 158
  • 105. Images/cinvestav RNN Example When you look at the recurrent neural network Elman [8] ht = σh (Wsdxt + Ushht−1 + bh) yt = σy (Vosht) L = 1 2 (yt − zt)2 Here if you do blind AD sooner or later you have ∂ht ∂ht−1 × ∂ht−1 ∂ht−2 × ∂ht−2 ∂ht−3 × ... × ∂hk+1 ∂hk This is known as Back Propagation Through Time (BPTT) This is a problem given The Vanishing Gradient or Exploding Gradient 54 / 158
  • 106. Images/cinvestav RNN Example When you look at the recurrent neural network Elman [8] ht = σh (Wsdxt + Ushht−1 + bh) yt = σy (Vosht) L = 1 2 (yt − zt)2 Here if you do blind AD sooner or later you have ∂ht ∂ht−1 × ∂ht−1 ∂ht−2 × ∂ht−2 ∂ht−3 × ... × ∂hk+1 ∂hk This is known as Back Propagation Through Time (BPTT) This is a problem given The Vanishing Gradient or Exploding Gradient 54 / 158
  • 107. Images/cinvestav Here, you can modify the architecture Using an intermediate layer using the Hadamard product ◦ we have L = 1 2 (yt − zt)2 yt = σy (Wodxt + Uohht−1 + bo) st = σs (Vhoyt + Dhdxt + bh) ht = (1 − yt) ◦ ht−1 + yt ◦ st 55 / 158
  • 108. Images/cinvestav Therefore You have multiple paths of derivatives -1 -1 56 / 158
  • 109. Images/cinvestav One of them It can be seen That one of the paths can take you to BPTT 57 / 158
  • 110. Images/cinvestav The Other One The other gets you into a more Markovian Property This allows to to get a Backpropagation that does not require the BPTT How? For example, the derivative of L with respect to Dhd ∂L ∂Dhd = ∂L ∂yt × ∂yt ∂nety × ∂nety ∂ht−1 × ∂ht−1 ∂st−2 × ∂st−2 nets × nets ∂Dhd 58 / 158
  • 111. Images/cinvestav The Other One The other gets you into a more Markovian Property This allows to to get a Backpropagation that does not require the BPTT How? For example, the derivative of L with respect to Dhd ∂L ∂Dhd = ∂L ∂yt × ∂yt ∂nety × ∂nety ∂ht−1 × ∂ht−1 ∂st−2 × ∂st−2 nets × nets ∂Dhd 58 / 158
  • 112. Images/cinvestav The Other One The other gets you into a more Markovian Property This allows to to get a Backpropagation that does not require the BPTT How? For example, the derivative of L with respect to Dhd ∂L ∂Dhd = ∂L ∂yt × ∂yt ∂nety × ∂nety ∂ht−1 × ∂ht−1 ∂st−2 × ∂st−2 nets × nets ∂Dhd 58 / 158
  • 113. Images/cinvestav Therefore You do not have The Backpropagation through time... you can avoid it all together!!! Because Backpropagation Through Time Makes the process of obtaining the gradients unstable... 59 / 158
  • 114. Images/cinvestav Therefore You do not have The Backpropagation through time... you can avoid it all together!!! Because Backpropagation Through Time Makes the process of obtaining the gradients unstable... 59 / 158
  • 115. Images/cinvestav Thus A great simplifying step Here resound trues the phrase “AD taking advantage of the user’s insight” 60 / 158
  • 116. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 61 / 158
  • 117. Images/cinvestav A Simple Example Here, we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 118. Images/cinvestav A Simple Example Here, we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 119. Images/cinvestav A Simple Example Here, we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 120. Images/cinvestav A Simple Example Here, we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 121. Images/cinvestav A Simple Example Here, we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 122. Images/cinvestav A Simple Example Here, we have the following ideas Some of the floating point values, generated by the AD, will be stored in variables of the program, Other operations will be held until overwritten or discarded. Thus, we will introduce the concept Evaluation Trace which is basically a record of a particular run of a given program. This Evaluation Trace stores Input variables, Sequence of floating point generated by the CPU Operations that are used for it 62 / 158
  • 123. Images/cinvestav Example A simple example y = f (x1, x2) = sin x1 x2 + x1 x2 − exp (x2) × x1 x2 − exp (x2) We wish to calculate y = f (x1, x2) With x1 = 1.5, x2 = 0.5 63 / 158
  • 124. Images/cinvestav Example A simple example y = f (x1, x2) = sin x1 x2 + x1 x2 − exp (x2) × x1 x2 − exp (x2) We wish to calculate y = f (x1, x2) With x1 = 1.5, x2 = 0.5 63 / 158
  • 125. Images/cinvestav Evaluation Trace/Forward Procedure We have the table for the evaluation of the function v−1 = x1 = 1.5 v0 = x2 = 0.5 v1 = v−1 v0 = 1.5 0.5 = 3.0 v2 = sin (v1) = sin (3.0) = 0.1411 v3 = exp (v0) = exp (0.5) = 1.6487 v4 = v1 − v3 = 3.0 − 1.6487 = 1.3513 v5 = v2 + v4 = 0.1411 + 1.3413 = 1.4924 v6 = v5 × v4 = 1.4924 × 1.3513 = 2.0167 y = v6 = 2.0167 64 / 158
  • 126. Images/cinvestav A Cautionary Note Normally Programmers will try to rearrange this execution trace to improve performance through parallelism. Thus Subexpressions will be algorithmically exploited by the AD to improve performance. It is usually more convenient to use The so called “computational graph” 65 / 158
  • 127. Images/cinvestav A Cautionary Note Normally Programmers will try to rearrange this execution trace to improve performance through parallelism. Thus Subexpressions will be algorithmically exploited by the AD to improve performance. It is usually more convenient to use The so called “computational graph” 65 / 158
  • 128. Images/cinvestav A Cautionary Note Normally Programmers will try to rearrange this execution trace to improve performance through parallelism. Thus Subexpressions will be algorithmically exploited by the AD to improve performance. It is usually more convenient to use The so called “computational graph” 65 / 158
  • 130. Images/cinvestav Please take a look at section in Chapter 2 A Framework for Evaluating Functions At the book [7] Andreas Griewank and Andrea Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation vol. 105, (Siam, 2008). 67 / 158
  • 131. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 67 / 158
  • 132. Images/cinvestav A Little Bit of Notation In general, we assume quantities vi such v1−n, ..., v0 x v1, ..., vl−m−1vl−m+1, ..., vl y Then, we have 1 v1−n, ..., v0 are the initial input variables 2 vl−m+1, ..., vl the output variables 3 v1, ..., vl−m−1 the intermediate functions 68 / 158
  • 133. Images/cinvestav A Little Bit of Notation In general, we assume quantities vi such v1−n, ..., v0 x v1, ..., vl−m−1vl−m+1, ..., vl y Then, we have 1 v1−n, ..., v0 are the initial input variables 2 vl−m+1, ..., vl the output variables 3 v1, ..., vl−m−1 the intermediate functions 68 / 158
  • 134. Images/cinvestav Additionally Where each value vi with i > 0 is obtained by applying an elemental function φ vi = φi (vj)j i j i vi depends directly on vj 69 / 158
  • 135. Images/cinvestav Then, for the application of the chain rule It is useful to associate with each elemental function φi the state transformation vi = Φi (vi−1) with Φi : Rn+l → Rn+l where vi = (v1−n, ..., vi, 0, ..., 0)T In other words Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i unchanged. 70 / 158
  • 136. Images/cinvestav Then, for the application of the chain rule It is useful to associate with each elemental function φi the state transformation vi = Φi (vi−1) with Φi : Rn+l → Rn+l where vi = (v1−n, ..., vi, 0, ..., 0)T In other words Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i unchanged. 70 / 158
  • 137. Images/cinvestav Then, for the application of the chain rule It is useful to associate with each elemental function φi the state transformation vi = Φi (vi−1) with Φi : Rn+l → Rn+l where vi = (v1−n, ..., vi, 0, ..., 0)T In other words Φi sets of vi to φi (vj)j i and keeps all other components vj for j = i unchanged. 70 / 158
  • 138. Images/cinvestav Basically the Computational Graph A Simpler Version 71 / 158
  • 139. Images/cinvestav Example of the Forward Mode Suppose we want to differentiate y = f (x1, x2) with respect to x1 We consider x1 as an independent variable and y as a dependent variable. We can work the numerical value of the y = f (x1, x2) By getting the numerical derivative of each of its components Something like ˙vi = ∂vi ∂x1 72 / 158
  • 140. Images/cinvestav Example of the Forward Mode Suppose we want to differentiate y = f (x1, x2) with respect to x1 We consider x1 as an independent variable and y as a dependent variable. We can work the numerical value of the y = f (x1, x2) By getting the numerical derivative of each of its components Something like ˙vi = ∂vi ∂x1 72 / 158
  • 141. Images/cinvestav Example of the Forward Mode Suppose we want to differentiate y = f (x1, x2) with respect to x1 We consider x1 as an independent variable and y as a dependent variable. We can work the numerical value of the y = f (x1, x2) By getting the numerical derivative of each of its components Something like ˙vi = ∂vi ∂x1 72 / 158
  • 142. Images/cinvestav Therefore, we get We have the Procedure v−1 = x1 = 1.5 ˙v−1 = 1.0 v0 = x2 = 0.5 ˙v1 = 0.0 v1 = v−1 v0 = 1.5 0.5 = 3.0 ˙v1 = ∂v1 ∂v−1 ˙v−1 + ∂v1 ∂v0 ˙v0 = 2.0 v2 = sin (v1) = sin (3.0) = 0.1411 ˙v2 = cos (v1) ˙v1 = −1.98 v3 = exp (v0) = exp (0.5) = 1.6487 ˙v3 = v3 ˙×v1 = 0.0 v4 = v1 − v3 = 3.0 − 1.6487 = 1.3513 ˙v4 = ˙v1 − ˙v3 = 2.0 v5 = v2 + v4 = 0.1411 + 1.3413 = 1.4924 ˙v5 = ˙v2 + ˙v4 = 0.02 v6 = v5 × v4 = 1.4924 × 1.3513 = 2.0167 ˙v6 = ˙v5 × v4 + v5 × ˙v4 = 3.0118 y = v6 = 2.0167 ˙y = 3.0118 73 / 158
  • 143. Images/cinvestav The first Column of this process It can be seen as an automatic procedure vi−n i = 1...n vi = ϕi (vj)j i i = 1...l ym−i = vl−i i = m − 1...0 74 / 158
  • 144. Images/cinvestav In a similar way We can obtain ∂f(x1,x2) ∂x2 However, it can be more efficient to redefine the ˙vi as vectors for efficiency!!! 75 / 158
  • 145. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 76 / 158
  • 146. Images/cinvestav Forward propagation of Tangents Remarks As you can see the second column of the evaluation procedure is done in a mechanical way This increase the size Basically, twice the size of the original simple evaluation. 77 / 158
  • 147. Images/cinvestav Forward propagation of Tangents Remarks As you can see the second column of the evaluation procedure is done in a mechanical way This increase the size Basically, twice the size of the original simple evaluation. 77 / 158
  • 148. Images/cinvestav We have the following We have the chain rule ˙y (t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) Where F (x) ∈ Rm×n is the Jacobian Matrix Here, we will be tempted to calculate ˙y (t) By evaluating the full Jacobian F (x) then multiplying by ˙x (t) 78 / 158
  • 149. Images/cinvestav We have the following We have the chain rule ˙y (t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) Where F (x) ∈ Rm×n is the Jacobian Matrix Here, we will be tempted to calculate ˙y (t) By evaluating the full Jacobian F (x) then multiplying by ˙x (t) 78 / 158
  • 150. Images/cinvestav We have the following We have the chain rule ˙y (t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) Where F (x) ∈ Rm×n is the Jacobian Matrix Here, we will be tempted to calculate ˙y (t) By evaluating the full Jacobian F (x) then multiplying by ˙x (t) 78 / 158
  • 151. Images/cinvestav However Such approach is quite uneconomically Unless many tangents need to be calculated as in the Newton Step. A simpler version, differentiate the first column of the table vi−n = xi i = 1, ..., n vi = φi (vj)j i i = 1, ..., l ym−i = vl−i i = m − 1, ..., 0 j i vi depends directly vj (The graph propagation of the dependencies) 79 / 158
  • 152. Images/cinvestav However Such approach is quite uneconomically Unless many tangents need to be calculated as in the Newton Step. A simpler version, differentiate the first column of the table vi−n = xi i = 1, ..., n vi = φi (vj)j i i = 1, ..., l ym−i = vl−i i = m − 1, ..., 0 j i vi depends directly vj (The graph propagation of the dependencies) 79 / 158
  • 153. Images/cinvestav Which can be seen as Forward Propagation of Tangents Basically, we can think of the forward mode as a propagation of tangents 80 / 158
  • 154. Images/cinvestav The Automatic Procedure Therefore, we have the following automatic procedure j i vi depends directly on vj and ui = (vj)j i ∈ Rni vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i 81 / 158
  • 155. Images/cinvestav Therefore Each element assignment vi = φi (ui) You have the corresponding ˙vi = j i ∂φi (uj) ∂vj × ˙vj = j i cij × ˙vj Abbreviating ˙ui = (˙vj)j i ˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui Where ˙φi = R2ni → R It is called the tangent function associated with the elemental φi. 82 / 158
  • 156. Images/cinvestav Therefore Each element assignment vi = φi (ui) You have the corresponding ˙vi = j i ∂φi (uj) ∂vj × ˙vj = j i cij × ˙vj Abbreviating ˙ui = (˙vj)j i ˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui Where ˙φi = R2ni → R It is called the tangent function associated with the elemental φi. 82 / 158
  • 157. Images/cinvestav Therefore Each element assignment vi = φi (ui) You have the corresponding ˙vi = j i ∂φi (uj) ∂vj × ˙vj = j i cij × ˙vj Abbreviating ˙ui = (˙vj)j i ˙vi = ˙φi (ui, ˙ui) = φi (ui) ˙ui Where ˙φi = R2ni → R It is called the tangent function associated with the elemental φi. 82 / 158
  • 158. Images/cinvestav Now Question What is the correct order of evaluation? 83 / 158
  • 159. Images/cinvestav Why the question? Until now, we have always placed the tangent statement yielding ˙vi after the underlying value vi This order of calculation seems natural and certainly yields correct results as long as there is no overwriting. Then the order of 2l statements in the middle part of Table does not matter vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i 84 / 158
  • 160. Images/cinvestav Why the question? Until now, we have always placed the tangent statement yielding ˙vi after the underlying value vi This order of calculation seems natural and certainly yields correct results as long as there is no overwriting. Then the order of 2l statements in the middle part of Table does not matter vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i 84 / 158
  • 161. Images/cinvestav Here, we have a big problem in Cache Imagine that we have a single block of memory to hold For vi and its arguments vj live in the same memory cell on the cache memory CPU Register Register Register Register CACHE MainMemory 85 / 158
  • 162. Images/cinvestav This is known as Cache Aliasing Definition Cache aliasing occurs when multiple mappings to a physical page of memory have conflicting caching states, such as cached and uncached. the same physical address can be mapped to multiple virtual addresses. On ARMv4 and ARMv5 processors, cache is organized as a virtual-indexed, virtual-tagged (VIVT) Cache lookups are faster because the translation look-aside buffer (TLB) is not involved in matching cache lines for a virtual address. However This caching method does require more frequent cache flushing because of cache aliasing. 86 / 158
  • 163. Images/cinvestav This is known as Cache Aliasing Definition Cache aliasing occurs when multiple mappings to a physical page of memory have conflicting caching states, such as cached and uncached. the same physical address can be mapped to multiple virtual addresses. On ARMv4 and ARMv5 processors, cache is organized as a virtual-indexed, virtual-tagged (VIVT) Cache lookups are faster because the translation look-aside buffer (TLB) is not involved in matching cache lines for a virtual address. However This caching method does require more frequent cache flushing because of cache aliasing. 86 / 158
  • 164. Images/cinvestav This is known as Cache Aliasing Definition Cache aliasing occurs when multiple mappings to a physical page of memory have conflicting caching states, such as cached and uncached. the same physical address can be mapped to multiple virtual addresses. On ARMv4 and ARMv5 processors, cache is organized as a virtual-indexed, virtual-tagged (VIVT) Cache lookups are faster because the translation look-aside buffer (TLB) is not involved in matching cache lines for a virtual address. However This caching method does require more frequent cache flushing because of cache aliasing. 86 / 158
  • 165. Images/cinvestav Then The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect Once we update vi = φi (ui) ADIFOR and Tapenade [9, 5] They put the derivative statement ahead of the original assignment and update before the erasing the original statement. On the other hand For most univariate functions v = φ (u) is better to obtain the undifferentiated value first Then to use it into the tangent function ˙φ 87 / 158
  • 166. Images/cinvestav Then The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect Once we update vi = φi (ui) ADIFOR and Tapenade [9, 5] They put the derivative statement ahead of the original assignment and update before the erasing the original statement. On the other hand For most univariate functions v = φ (u) is better to obtain the undifferentiated value first Then to use it into the tangent function ˙φ 87 / 158
  • 167. Images/cinvestav Then The value of ˙vi = ˙φi (ui, ˙ui) it will incorrect Once we update vi = φi (ui) ADIFOR and Tapenade [9, 5] They put the derivative statement ahead of the original assignment and update before the erasing the original statement. On the other hand For most univariate functions v = φ (u) is better to obtain the undifferentiated value first Then to use it into the tangent function ˙φ 87 / 158
  • 168. Images/cinvestav In this presentation We will list ϕ and ˙ϕ Side by side in a common bracket to indicate that they should be evaluated simultaneously Then sharing results is immediate. 88 / 158
  • 169. Images/cinvestav In this presentation We will list ϕ and ˙ϕ Side by side in a common bracket to indicate that they should be evaluated simultaneously Then sharing results is immediate. 88 / 158
  • 170. Images/cinvestav Classic Tangent Operations We have a series of improvements on the tangent equations φ φ, ˙φ v = c v = c, ˙v = 0 v = v ± w v = v ± w ˙v = ˙v ± ˙w v = u × w ˙v = ˙u × w + u × ˙w v = u × w v = 1/u v = 1/u ˙v = −v × (v × ˙u) φ φ, ˙φ v = uc v = ˙u u ; v = uc ˙v = v × (v × ˙u) v = √ u v = √ u v = 0.5 × ˙u v v = exp (u) v = exp (u) ˙v = v ∗ ˙u v = log (u) ˙v = ˙u/u v = log (u) v = sin (u) ˙v = cos (u) × ˙u v = sin (u) 89 / 158
  • 171. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 90 / 158
  • 172. Images/cinvestav Now Imagine the following network Something simple for our sake 91 / 158
  • 173. Images/cinvestav Forward mode to get gradient of x1 v−11 = w11, ..., v−6 = w16, v−5 = w21, ..., v−2 = w24, v−1 = v31, v−1 = w41 ˙v−11 = 1, ˙v−10 = 0, ..., ˙v0 = 0 v1 = 3 i=1 w1ixi , ˙v1 = x1 v2 = 3 i=1 w2ixi , ˙v2 = 0 v3 = 1 1+exp(−v1) , ˙v3 = v3 [1 − v3] x11 v4 = 1 1+exp(−v2) , ˙v4 = 0 v5 = 3 i=1 w3ivi, ˙v5 = w31 × ˙v3 v6 = 3 i=1 w4ivi, ˙v6 = w41 × ˙v3 v7 = 1 1+exp(−v5) , ˙v7 = v7 [1 − v7] × ˙v5 v8 = 1 1+exp(−v6) , ˙v8 = v8 [1 − v8] × ˙v6 v9 = 2 i=1 w5ivi, ˙v9 = w51 × ˙v7 + w32 × ˙v8 v10 = 1 1+exp(−v9) , ˙v10 = v10 [1 − v10] × ˙v9 92 / 158
  • 174. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 93 / 158
  • 175. Images/cinvestav Complexity of the Procedure Time Complexity TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)} Where wtan ∈ 2, 5 2 Space Complexity SPACE F (x) , F (x) ˙x ≤ 2SPACE {F (x)} 94 / 158
  • 176. Images/cinvestav Complexity of the Procedure Time Complexity TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)} Where wtan ∈ 2, 5 2 Space Complexity SPACE F (x) , F (x) ˙x ≤ 2SPACE {F (x)} 94 / 158
  • 177. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 95 / 158
  • 178. Images/cinvestav Here, an essential observation The cost of evaluating derivatives by propagating them forward it increases linearly with number of directions ˙x along which we want to differentiate. It looks inevitable But it is possible to avoid these complexity by Observing that the gradient of a single dependent variable could be obtained for a fixed multiple of the cost of evaluating the underlying scalar-valued function. 96 / 158
  • 179. Images/cinvestav Here, an essential observation The cost of evaluating derivatives by propagating them forward it increases linearly with number of directions ˙x along which we want to differentiate. It looks inevitable But it is possible to avoid these complexity by Observing that the gradient of a single dependent variable could be obtained for a fixed multiple of the cost of evaluating the underlying scalar-valued function. 96 / 158
  • 180. Images/cinvestav We choose instead an output variable We use the term “reverse mode” for this technique Because the label “backward differentiation” is well established [10, 11]. Therefore, for an output f (x1, x2) We have for each variable v1 vi = ∂y ∂vi (Adjoint Variable) 97 / 158
  • 181. Images/cinvestav We choose instead an output variable We use the term “reverse mode” for this technique Because the label “backward differentiation” is well established [10, 11]. Therefore, for an output f (x1, x2) We have for each variable v1 vi = ∂y ∂vi (Adjoint Variable) 97 / 158
  • 182. Images/cinvestav Actually This is an abuse of notation We mean a new independent variable δi vi = ∂y ∂δi (Adjoint Variable) Which can be thought as adding a small numerical value δi to vi vi + δi → f (x1, x2) + viδi As a perturbation in variational calculus 98 / 158
  • 183. Images/cinvestav Actually This is an abuse of notation We mean a new independent variable δi vi = ∂y ∂δi (Adjoint Variable) Which can be thought as adding a small numerical value δi to vi vi + δi → f (x1, x2) + viδi As a perturbation in variational calculus 98 / 158
  • 184. Images/cinvestav Actually, you propagate the Normal vectors Actually, y and vi are normals or cotangents 99 / 158
  • 185. Images/cinvestav Then, we have The following sought mapping x = yT F (x) = yT F (x) Observation Here, y is a fixed vector that plays a dual role to the domain direction ˙x. In the Forward Procedure, you compute ˙y = F (x) ˙x = ˙F (x, ˙x) 100 / 158
  • 186. Images/cinvestav Then, we have The following sought mapping x = yT F (x) = yT F (x) Observation Here, y is a fixed vector that plays a dual role to the domain direction ˙x. In the Forward Procedure, you compute ˙y = F (x) ˙x = ˙F (x, ˙x) 100 / 158
  • 187. Images/cinvestav Then, we have The following sought mapping x = yT F (x) = yT F (x) Observation Here, y is a fixed vector that plays a dual role to the domain direction ˙x. In the Forward Procedure, you compute ˙y = F (x) ˙x = ˙F (x, ˙x) 100 / 158
  • 188. Images/cinvestav Instead In the Reverse Procedure, you compute xT = yT F (x) ≡ F (x, y) Where we solve F and F are evaluated together Thus, we have a dual process 101 / 158
  • 189. Images/cinvestav Instead In the Reverse Procedure, you compute xT = yT F (x) ≡ F (x, y) Where we solve F and F are evaluated together Thus, we have a dual process 101 / 158
  • 190. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 102 / 158
  • 191. Images/cinvestav Dual Process Here, we have that the hyperplane yT y = c in the range of F has inverse image x|yT F (x) = c 103 / 158
  • 192. Images/cinvestav The implicit function theorem Theorem Let F : Rn+m → Rm be a continuously differentiable function, and a point x0 1, x0 2, ..., x0 m+n so F x0 1, x0 2, ..., x0 m+n = c. If ∂F(x0 1,x0 2,...,x0 m+n) ∂xm+n = 0, then there exist a neighborhood of x0 1, x0 2, ..., x0 m+n so whatever (x1, ..., xn+m−1) is close enough to x0 1, ..., x0 m+n−1 , there is a unique z so that F (x1, ..., xn+m−1, z) = c. Furthermore, z = g (x1, ..., xn+m−1) a continuous function of (x1, ..., xn+m−1). 104 / 158
  • 193. Images/cinvestav Therefore The set x|yT F (x) = c It is a smooth hyper-surface with the normal xT = yT F (x) at x provided that x does not vanishes. 105 / 158
  • 194. Images/cinvestav The Process Here, we have that the hyperplane yT y = c in the range of F has inverse image x|yT F (x) = c Forward Procedure Reverse Procedure 106 / 158
  • 195. Images/cinvestav Therefore When m = 1, then F = f is scaler-valued We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x). Something Notable We will look only at the main procedure of Incremental Adjoint Recursion Please take a look at section in Derivation by Matrix-Product Reversal At the book [7] Andreas Griewank and Andrea Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation vol. 105, (Siam, 2008). 107 / 158
  • 196. Images/cinvestav Therefore When m = 1, then F = f is scaler-valued We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x). Something Notable We will look only at the main procedure of Incremental Adjoint Recursion Please take a look at section in Derivation by Matrix-Product Reversal At the book [7] Andreas Griewank and Andrea Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation vol. 105, (Siam, 2008). 107 / 158
  • 197. Images/cinvestav Therefore When m = 1, then F = f is scaler-valued We obtain y = 1 ∈ R the familiar gradient f (x) = yT F (x). Something Notable We will look only at the main procedure of Incremental Adjoint Recursion Please take a look at section in Derivation by Matrix-Product Reversal At the book [7] Andreas Griewank and Andrea Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation vol. 105, (Siam, 2008). 107 / 158
  • 198. Images/cinvestav The derivation of the reversal mode For this, we will use vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i And the identity yT ˙y = xT ˙x 108 / 158
  • 199. Images/cinvestav The derivation of the reversal mode For this, we will use vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i And the identity yT ˙y = xT ˙x 108 / 158
  • 200. Images/cinvestav Now, using the state transformation Φ We map from x to y = F (x) as the composition y = QmΦl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 PT n x Where Pn ≡ [I, 0, ..., 0] ∈ Rn×(n+l) and Qm ≡ [0, 0, ..., I] ∈ Rm×(n+l) They are matrices that project an arbitrary (n + l)-vector Onto its first n and last m components. 109 / 158
  • 201. Images/cinvestav Now, using the state transformation Φ We map from x to y = F (x) as the composition y = QmΦl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 PT n x Where Pn ≡ [I, 0, ..., 0] ∈ Rn×(n+l) and Qm ≡ [0, 0, ..., I] ∈ Rm×(n+l) They are matrices that project an arbitrary (n + l)-vector Onto its first n and last m components. 109 / 158
  • 202. Images/cinvestav Where The cij’s represent partial differential cij ≡ cij (ui) ≡ ∂φi ∂vj for 1 − n ≤ i, j ≤ l 110 / 158
  • 203. Images/cinvestav Labelin the elemental partials as cij We get the state Jacobian Ai ≡ Φi ≡                 1 0 . . . 0 . . . . . . 0 0 1 . . . 0 . . . . . . 0 ... ... ... ... . . . . . . 0 0 . . . 1 . . . . . . 0 ci1−n ci2−n . . . cii−n . . . . . . 0 0 0 . . . 0 1 . . . 0 ... ... ... ... ... ... ... 0 0 . . . . . . . . . . . . 1                 ∈ R(n+l)×(n+l) where the cij occur in the (n + i)th row of Ai. 111 / 158
  • 204. Images/cinvestav Remarks The square matrices Ai are lower triangular It may also be written as rank-one perturbations of the identity, Ai = I + en+i [ φi (ui) − en+i]T Where ej denotes the jth Cartesian basis vector in Rn+l The differentiating the composition of functions ˙y = QmAlAl−1 · · · A2A1PT n ˙x 112 / 158
  • 205. Images/cinvestav Remarks The square matrices Ai are lower triangular It may also be written as rank-one perturbations of the identity, Ai = I + en+i [ φi (ui) − en+i]T Where ej denotes the jth Cartesian basis vector in Rn+l The differentiating the composition of functions ˙y = QmAlAl−1 · · · A2A1PT n ˙x 112 / 158
  • 206. Images/cinvestav Embeddings The multiplication by PT n ∈ R(n+l)×n It embeds ˙x into Rn+l Meaning corresponding to the first part of the tangent recursion The subsequent multiplications by the Ai It generates ine component ˙vi at a time, according to the middle part 113 / 158
  • 207. Images/cinvestav Embeddings The multiplication by PT n ∈ R(n+l)×n It embeds ˙x into Rn+l Meaning corresponding to the first part of the tangent recursion The subsequent multiplications by the Ai It generates ine component ˙vi at a time, according to the middle part 113 / 158
  • 208. Images/cinvestav Embeddings The multiplication by PT n ∈ R(n+l)×n It embeds ˙x into Rn+l Meaning corresponding to the first part of the tangent recursion The subsequent multiplications by the Ai It generates ine component ˙vi at a time, according to the middle part 113 / 158
  • 209. Images/cinvestav Finally Qm extracts the last m components as ˙y corresponding to the third part of the table vi−n ≡ xi i = 1...n ˙vi−n ≡ ˙xi vi ≡ φi (vj)j i i = 1...l i = 1...l ˙vi ≡ j i ∂φi(uj) ∂vj ˙vj ym−i ≡ vl−i i = m − 1...0 ˙ym−i ≡ ˙vl−i 114 / 158
  • 210. Images/cinvestav Now By comparison with ˙y (t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) We have in fact a product representation of the full Jacobian F (x) = QmAlAl−1 · · · A2A1PT n ∈ Rm×n 115 / 158
  • 211. Images/cinvestav Now By comparison with ˙y (t) = ∂F (x (t)) ∂t = F (x (t)) ˙x (t) We have in fact a product representation of the full Jacobian F (x) = QmAlAl−1 · · · A2A1PT n ∈ Rm×n 115 / 158
  • 212. Images/cinvestav Then By transposing the product we obtain the adjoint relation x = PnAT 1 AT 2 · · · AT l−1AT l y Given that AT i = I + [ φi (ui) − en+i] eT n+i 116 / 158
  • 213. Images/cinvestav Then By transposing the product we obtain the adjoint relation x = PnAT 1 AT 2 · · · AT l−1AT l y Given that AT i = I + [ φi (ui) − en+i] eT n+i 116 / 158
  • 214. Images/cinvestav Therefore The transformation of any vector (vj)1−n≤j≤l By multiplication with AT i representing an incremental operation. 117 / 158
  • 215. Images/cinvestav In detail, one obtains for i = l, ..., 1 the operations For all j with i = j i vj is left unchanged For all j with i = j i vi is augmented by vicij cij ≡ cij (ui) ≡ ∂φi ∂vj for 1 − n ≤ i, j ≤ l Subsequently vi is set to zero. 118 / 158
  • 216. Images/cinvestav In detail, one obtains for i = l, ..., 1 the operations For all j with i = j i vj is left unchanged For all j with i = j i vi is augmented by vicij cij ≡ cij (ui) ≡ ∂φi ∂vj for 1 − n ≤ i, j ≤ l Subsequently vi is set to zero. 118 / 158
  • 217. Images/cinvestav In detail, one obtains for i = l, ..., 1 the operations For all j with i = j i vj is left unchanged For all j with i = j i vi is augmented by vicij cij ≡ cij (ui) ≡ ∂φi ∂vj for 1 − n ≤ i, j ≤ l Subsequently vi is set to zero. 118 / 158
  • 218. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 119 / 158
  • 219. Images/cinvestav Some Remarks Using the C-style abbreviation a+ ≡ b for a ≡ a + b We may rewrite the matrix- vector product as the adjoint evaluation procedure in the following table 120 / 158
  • 220. Images/cinvestav Incremental Adjoint Recursion We have the following procedure (ui = (vj)j i ∈ Rni ) vi ≡ 0 i = 1 − n...l vi−n ≡ xi i = 1...n vi ≡ φi (vj)j i i = m − 1...l ym−i ≡ vl−i i = 0...m − 1 vl−i ≡ ym−i i = 0...m − 1 vj+ ≡ vi ∂φi(ui) ∂vj for j i i = l...1 xi ≡ vi−n i = n...1 121 / 158
  • 221. Images/cinvestav Explanation It is assumed as a precondition that the adjoint quantities vi for 1 ≤ i ≤ l have been initialized to zero As indicated by the range specification i = l, ..., 1 we think of the incremental assignments as being executed in reverse order, i.e., for i = l, l − 1, l − 2, ..., 1. Only then is it guaranteed Each vi will reach its full value before it occurs on the right-hand side. 122 / 158
  • 222. Images/cinvestav Explanation It is assumed as a precondition that the adjoint quantities vi for 1 ≤ i ≤ l have been initialized to zero As indicated by the range specification i = l, ..., 1 we think of the incremental assignments as being executed in reverse order, i.e., for i = l, l − 1, l − 2, ..., 1. Only then is it guaranteed Each vi will reach its full value before it occurs on the right-hand side. 122 / 158
  • 223. Images/cinvestav Explanation It is assumed as a precondition that the adjoint quantities vi for 1 ≤ i ≤ l have been initialized to zero As indicated by the range specification i = l, ..., 1 we think of the incremental assignments as being executed in reverse order, i.e., for i = l, l − 1, l − 2, ..., 1. Only then is it guaranteed Each vi will reach its full value before it occurs on the right-hand side. 122 / 158
  • 224. Images/cinvestav Furthermore We can combine the incremental operations Affected by the adjoint of φi to ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni Something Remarkable We can do something different one can directly compute the value of the adjoint quantity vj by collecting all contributions to it as a sum ranging over all successors i j. This no-incremental Requires global information that is not easy to come by. 123 / 158
  • 225. Images/cinvestav Furthermore We can combine the incremental operations Affected by the adjoint of φi to ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni Something Remarkable We can do something different one can directly compute the value of the adjoint quantity vj by collecting all contributions to it as a sum ranging over all successors i j. This no-incremental Requires global information that is not easy to come by. 123 / 158
  • 226. Images/cinvestav Furthermore We can combine the incremental operations Affected by the adjoint of φi to ui+ = vi · φi (ui) where ui ≡ (uj)j i ∈ Rni Something Remarkable We can do something different one can directly compute the value of the adjoint quantity vj by collecting all contributions to it as a sum ranging over all successors i j. This no-incremental Requires global information that is not easy to come by. 123 / 158
  • 227. Images/cinvestav Complexity Something Notable TIME F (x) , yT F (x) ≤ wgradTIME {F (x)} Where wgrad ∈ [3, 4] (The cheap gradient principle) 124 / 158
  • 228. Images/cinvestav Remember Time Complexity TIME F (x) , F (x) ˙x ≤ wtanTIME {F (x)} Where wtan ∈ 2, 5 2 125 / 158
  • 229. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 126 / 158
  • 230. Images/cinvestav Example a single layer perceptron We have y = σ 3 i=1 wixi 127 / 158
  • 231. Images/cinvestav First Phase Forward Step Forward Step v−2 = w1 v−1 = w2 v0 = w3 v1 = x1v−2 v2 = x2v−1 v3 = x3v0 v4 = v1 + v2 + v3 v5 = σ (v4) y1 = v5 128 / 158
  • 232. Images/cinvestav Second Phase Incremental Return Forward Step v−2 = w1 v−1 = w2 v0 = w3 v1 = x1v−2 v2 = x2v−1 v3 = x3v0 v4 = v1 + v2 + v3 v5 = σ (v4) y1 = v5 Incremental Return v5 = y1 = 1 v4 = ∂v5 ∂v4 y1 = σ (v4) v3+ = ∂v4 ∂v3 v4 = 1 × σ (v4) v0 = ∂v3 ∂v0 v3 = x3 × σ (v4) v2+ = ∂v4 ∂v2 v4 = 1 × σ (v4) v−1 = ∂v2 ∂v−1 v2 = x2 × σ (v4) v1+ = ∂v4 ∂v1 v4 = 1 × σ (v4) v−2 = ∂v1 ∂v−2 v1 = x1 × σ (v4) w3 = x3 × σ (v4) w2 = x2 × σ (v4) w1 = x1 × σ (v4) 129 / 158
  • 233. Images/cinvestav How does it compares with the Forward Mode? We noticed that you do the following for each gradient variable Forward Step; Gradient of Forward Step v−2 = w1; ˙v−2 = ˙w1 = 0 v−1 = w2; ˙v−1 = ˙w2 = 0 v0 = w3; ˙v0 = ˙w2 = 1 v1 = x1v−2 ˙v1 = x1 ˙v−2 = 0 v2 = x2v−1 ˙v2 = x2 ˙v−1 = 0 v3 = w3v0 ˙v3 = x3 ˙v0 = x3 v4 = v1 + v2 + v3 ˙v4 = ˙v1 + ˙v2 + ˙v3 = x3 v5 = σ (v4) ˙v5 = ˙v4 = x3 × σ (v4) y1 = v5; ˙y1 = ˙v5 130 / 158
  • 234. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 131 / 158
  • 235. Images/cinvestav Let us to look at the following example We have the following system of equations y1 = σ (w1x) y2 = σ (w2x) y3 = σ (w2x) 132 / 158
  • 236. Images/cinvestav With the following graph Notice the difference with a neural network 133 / 158
  • 237. Images/cinvestav The Forward mode looks like We have that v0 = x; ˙v0 = ˙x = 1 v1 = w1v0 ˙v1 = w1 ˙v−2 = w1 v2 = w2v0 ˙v2 = w1 ˙v0 = w2 v3 = w3v0 ˙v3 = w1 ˙v0 = w3 v4 = σ (v1) ˙v4 = σ (v1) × ˙v1 = w1 × σ (v1) v5 = σ (v2) ˙v5 = σ (v2) × ˙v2 = w2 × σ (v2) v6 = σ (v3) ˙v6 = σ (v3) × ˙v3 = w3 × σ (v3) y1 = v4; ˙y1 = ˙v4 y2 = v5; ˙y2 = ˙v5 y = v ; ˙y = ˙v 134 / 158
  • 238. Images/cinvestav Now you can see it Forward and Reverse Mode They depend on the input and output size!!! A More Formal Definition For a function f : Rn → Rm, suppose we wish to compute all the elements of the m × n Jacobian matrix Ignoring the overhead of building the expression graph Under this situation Reverse Mode requires m sweeps performs better when n > m. 135 / 158
  • 239. Images/cinvestav Now you can see it Forward and Reverse Mode They depend on the input and output size!!! A More Formal Definition For a function f : Rn → Rm, suppose we wish to compute all the elements of the m × n Jacobian matrix Ignoring the overhead of building the expression graph Under this situation Reverse Mode requires m sweeps performs better when n > m. 135 / 158
  • 240. Images/cinvestav Now you can see it Forward and Reverse Mode They depend on the input and output size!!! A More Formal Definition For a function f : Rn → Rm, suppose we wish to compute all the elements of the m × n Jacobian matrix Ignoring the overhead of building the expression graph Under this situation Reverse Mode requires m sweeps performs better when n > m. 135 / 158
  • 241. Images/cinvestav Consequences for Deep Learning With a relatively small overhead The performance of reverse-mode AD is superior when n m, that is when we have many inputs and few outputs. As we saw it in the previous examples If n ≤ m forward mode performs better 136 / 158
  • 242. Images/cinvestav Consequences for Deep Learning With a relatively small overhead The performance of reverse-mode AD is superior when n m, that is when we have many inputs and few outputs. As we saw it in the previous examples If n ≤ m forward mode performs better 136 / 158
  • 243. Images/cinvestav Special Cases Nevertheless when we have a comparable number of outputs and inputs Forward mode can be more efficient, less overhead associated with storing the expression graph in memory in forward mode. For Example If you have f : Rn → R, when n = 1 forward mode is more efficient, but the result flips as n increases. 137 / 158
  • 244. Images/cinvestav Special Cases Nevertheless when we have a comparable number of outputs and inputs Forward mode can be more efficient, less overhead associated with storing the expression graph in memory in forward mode. For Example If you have f : Rn → R, when n = 1 forward mode is more efficient, but the result flips as n increases. 137 / 158
  • 245. Images/cinvestav Outline 1 Backpropagation Introduction Derivatives of Network Functions Function Composition, Weights and Addition The Backpropagation Algorithm Works Moving everything to Tensors 2 Automatic Differentiation Introduction Advantages of Automatic Differentiation Avoiding Truncation Errors Differences with Symbolic Differentiation Difference Quotients May be Useful RNN Example A Simple Example The Forward and Reverse Mode Forward propagation of Tangents Forward Mode of a ML Perceptron Complexity of the Forward Procedure The Reverse Mode Dual Process in Reverse Process Incremental Adjoint Recursion Example What Method to Use Forward or Reverse Mode? 3 Basic Implementation of Automatic Differentiation Source Transformation and Overloading Building the Computational Graph Memory Structures Way More... 4 Conclusions The Problem of Backpropagation 138 / 158
  • 246. Images/cinvestav Be Aware Be Careful A computationally naive implementation of AD can result in slow code and excess use of memory. Additionally There exists no standard set of problems spanning the diversity of AD applications. 139 / 158
  • 247. Images/cinvestav Be Aware Be Careful A computationally naive implementation of AD can result in slow code and excess use of memory. Additionally There exists no standard set of problems spanning the diversity of AD applications. 139 / 158
  • 248. Images/cinvestav Source Transformation in Fortran and C First We start with the source code of a computer program that implements our target function. Second A preprocessor then applies differentiation rules to the code generating new source code which calculates derivatives. Remember our basic tables 140 / 158