SlideShare a Scribd company logo
1 of 266
Download to read offline
Introduction to Neural Networks and Deep Learning
Recurrent Neural Networks
Andres Mendez-Vazquez
November 13, 2019
1 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
2 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
3 / 142
Images/cinvestav
In 1987 Robinson and Fallside [2]
At Cambridge University Engineering Department
They proposed a new type of neural network based on Linear Control
Theory
They took the work of Jacobs, 1974 on dynamic nets [1]
st+1 = Ast + Bxt
yt = Cst
4 / 142
Images/cinvestav
In 1987 Robinson and Fallside [2]
At Cambridge University Engineering Department
They proposed a new type of neural network based on Linear Control
Theory
They took the work of Jacobs, 1974 on dynamic nets [1]
st+1 = Ast + Bxt
yt = Cst
4 / 142
Images/cinvestav
Example of this unit
We have
5 / 142
Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
Images/cinvestav
Graphically
We have
7 / 142
Images/cinvestav
What were they used for?
Robinson and Fallside
As with Hidden Markov Models, they were proposed for Speech
Coding
They proposed the following architecture
8 / 142
Images/cinvestav
What were they used for?
Robinson and Fallside
As with Hidden Markov Models, they were proposed for Speech
Coding
They proposed the following architecture
8 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
9 / 142
Images/cinvestav
Based on the State-Space Model
Basically, a linear system
Based in a state-determined system model
Definition
A mathematical description of the system in terms of a minimum set
of variables xi(t), i = 1, ..., n, together with knowledge of those
variables at an initial time t0 and the system inputs for time t ≥ t0,
are sufficient to predict the future system state and outputs for all
time t > t0.
10 / 142
Images/cinvestav
Based on the State-Space Model
Basically, a linear system
Based in a state-determined system model
Definition
A mathematical description of the system in terms of a minimum set
of variables xi(t), i = 1, ..., n, together with knowledge of those
variables at an initial time t0 and the system inputs for time t ≥ t0,
are sufficient to predict the future system state and outputs for all
time t > t0.
10 / 142
Images/cinvestav
Therefore
We have a system as a block
System Described
by State Variables
This can be expressed as a state equations
˙s1 = f1 (x, s, t)
˙s2 = f2 (x, s, t)
· · · = · · ·
˙sn = fn (x, s, t)
11 / 142
Images/cinvestav
Therefore
We have a system as a block
System Described
by State Variables
This can be expressed as a state equations
˙s1 = f1 (x, s, t)
˙s2 = f2 (x, s, t)
· · · = · · ·
˙sn = fn (x, s, t)
11 / 142
Images/cinvestav
Using Vector Notation
Assuming that we have a linear system and time invariant
Time-Invariant x (t + δ) directly equates y (t + δ), for example
αx (t + δ) + β = y (t + δ)
Therefore, using this idea
˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t)
Or in Matrix form
y (t) = Ax (t) + Bs (t)
12 / 142
Images/cinvestav
Using Vector Notation
Assuming that we have a linear system and time invariant
Time-Invariant x (t + δ) directly equates y (t + δ), for example
αx (t + δ) + β = y (t + δ)
Therefore, using this idea
˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t)
Or in Matrix form
y (t) = Ax (t) + Bs (t)
12 / 142
Images/cinvestav
Using Vector Notation
Assuming that we have a linear system and time invariant
Time-Invariant x (t + δ) directly equates y (t + δ), for example
αx (t + δ) + β = y (t + δ)
Therefore, using this idea
˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t)
Or in Matrix form
y (t) = Ax (t) + Bs (t)
12 / 142
Images/cinvestav
Then, the discretized version
We introduce an update for the state part
y (t) = Ax (t) + Bs (t)
˙s (t) = Cs (t)
Or our discrete step equitations
y (t) = Ax (t) + Bs (t)
s (t + 1) = Cs (t)
13 / 142
Images/cinvestav
Then, the discretized version
We introduce an update for the state part
y (t) = Ax (t) + Bs (t)
˙s (t) = Cs (t)
Or our discrete step equitations
y (t) = Ax (t) + Bs (t)
s (t + 1) = Cs (t)
13 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
14 / 142
Images/cinvestav
The Elman Network
In Elman’s Equations
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σy (Vosht + by)
We noticed something different from the linear recurrent system
The use of activation functions to introduce the concept of
non-linearity
15 / 142
Images/cinvestav
The Elman Network
In Elman’s Equations
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σy (Vosht + by)
We noticed something different from the linear recurrent system
The use of activation functions to introduce the concept of
non-linearity
15 / 142
Images/cinvestav
Explanation
We have the following
1 The input xt is coded by Whd
Wsdxt
2 An state is generated by using the codified version of the input plus a
previous state ht−1
ht = σh (Wsdxt + Ussht−1 + bh)
3 The output is generated using the new state ht−1
yt = σy (Vosht + by)
16 / 142
Images/cinvestav
Explanation
We have the following
1 The input xt is coded by Whd
Wsdxt
2 An state is generated by using the codified version of the input plus a
previous state ht−1
ht = σh (Wsdxt + Ussht−1 + bh)
3 The output is generated using the new state ht−1
yt = σy (Vosht + by)
16 / 142
Images/cinvestav
Explanation
We have the following
1 The input xt is coded by Whd
Wsdxt
2 An state is generated by using the codified version of the input plus a
previous state ht−1
ht = σh (Wsdxt + Ussht−1 + bh)
3 The output is generated using the new state ht−1
yt = σy (Vosht + by)
16 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
17 / 142
Images/cinvestav
We need to introduce the concept of cost function
Which as always
It needs to comply with two properties
The cost function L must be able to be written as an average
L =
1
N x∈X
Cx
over the cost individual cost functions Cx
This allow to apply different optimization techniques as
Minbatch
Stochastic Gradient Descent
etc
18 / 142
Images/cinvestav
We need to introduce the concept of cost function
Which as always
It needs to comply with two properties
The cost function L must be able to be written as an average
L =
1
N x∈X
Cx
over the cost individual cost functions Cx
This allow to apply different optimization techniques as
Minbatch
Stochastic Gradient Descent
etc
18 / 142
Images/cinvestav
We need to introduce the concept of cost function
Which as always
It needs to comply with two properties
The cost function L must be able to be written as an average
L =
1
N x∈X
Cx
over the cost individual cost functions Cx
This allow to apply different optimization techniques as
Minbatch
Stochastic Gradient Descent
etc
18 / 142
Images/cinvestav
Furthermore
Non dependency
The cost function L must not be dependent on any activation values
of a neural network besides the output values.
If we cannot assure this
If not Backpropagation becomes too unstable or too complex to
solve. For example
L =
1
N
N
t=0
[yt + ht − zt]2
This gives two entry points to the network.
19 / 142
Images/cinvestav
Furthermore
Non dependency
The cost function L must not be dependent on any activation values
of a neural network besides the output values.
If we cannot assure this
If not Backpropagation becomes too unstable or too complex to
solve. For example
L =
1
N
N
t=0
[yt + ht − zt]2
This gives two entry points to the network.
19 / 142
Images/cinvestav
A List of Cost Functions
The Average Quadratic Cost
L =
1
N
N
t=0
[yt − zt]2
Where yt is the output of the network and zt is the ground truth of
the output.
Here, we are interpolating functions
20 / 142
Images/cinvestav
A List of Cost Functions
The Average Quadratic Cost
L =
1
N
N
t=0
[yt − zt]2
Where yt is the output of the network and zt is the ground truth of
the output.
Here, we are interpolating functions
REAL FUNCTION
INTERPOLATED FUNCTION
20 / 142
Images/cinvestav
Cross-Entropy cost
First, the Loss Function
L = −
C
i=1
zi log (yi)
Where yi is the output and zi is the ground truth for the class
estimation.
Why yi log (zi)?
We can imagine a sequence of class probabilities y1, y2, ..., ym and the
likelihood of the data and the model
P [data|model] = yk1
1 yk2
2 · · · ykn
m
21 / 142
Images/cinvestav
Cross-Entropy cost
First, the Loss Function
L = −
C
i=1
zi log (yi)
Where yi is the output and zi is the ground truth for the class
estimation.
Why yi log (zi)?
We can imagine a sequence of class probabilities y1, y2, ..., ym and the
likelihood of the data and the model
P [data|model] = yk1
1 yk2
2 · · · ykn
m
21 / 142
Images/cinvestav
Then
Taking the logarithm and multiplying by -1
− log P [data|model] = −
C
i=1
ki log yi
Then, dividing by the total number of samples
−
1
N
log P [data|model] = −
C
i=1
ki
N
log yi = −
C
i=1
zi log yi
22 / 142
Images/cinvestav
Then
Taking the logarithm and multiplying by -1
− log P [data|model] = −
C
i=1
ki log yi
Then, dividing by the total number of samples
−
1
N
log P [data|model] = −
C
i=1
ki
N
log yi = −
C
i=1
zi log yi
22 / 142
Images/cinvestav
Then
In information theory, The Kraft–McMillan theorem
It establishes that any directly decodable coding scheme for coding a
message to identify one value xi ∈ {x1, x2, ..., xn}
It can be seen as representing an implicit probability distribution over
{x1, x2, ..., xn}
q (xi) =
1
2
li
Where li is the length of the code for xi
23 / 142
Images/cinvestav
Then
In information theory, The Kraft–McMillan theorem
It establishes that any directly decodable coding scheme for coding a
message to identify one value xi ∈ {x1, x2, ..., xn}
It can be seen as representing an implicit probability distribution over
{x1, x2, ..., xn}
q (xi) =
1
2
li
Where li is the length of the code for xi
23 / 142
Images/cinvestav
Now
We have that
Cross entropy can be interpreted as the expected message-length per
datum when a wrong distribution q is assumed while the data actually
follows a distribution p.
The expected message-length under the true distribution p is
Ep [l] = −Ep
ln q (x)
ln 2
= −Ep [log2 q (x)]
= −
xi
p (xi) log2 q (x)
= H (p, q)
24 / 142
Images/cinvestav
Now
We have that
Cross entropy can be interpreted as the expected message-length per
datum when a wrong distribution q is assumed while the data actually
follows a distribution p.
The expected message-length under the true distribution p is
Ep [l] = −Ep
ln q (x)
ln 2
= −Ep [log2 q (x)]
= −
xi
p (xi) log2 q (x)
= H (p, q)
24 / 142
Images/cinvestav
Special Case
A special case is the binary class problem, C = 2
Based on the fact that z1 + z2 = 1 and y1 + y2 = 1
L = −
2
i=1
zi log (yi) = −z1 log (y1) − (1 − z1) log (1 − y1)
A problem of this
It could be possible to have a y1 = 0
25 / 142
Images/cinvestav
Special Case
A special case is the binary class problem, C = 2
Based on the fact that z1 + z2 = 1 and y1 + y2 = 1
L = −
2
i=1
zi log (yi) = −z1 log (y1) − (1 − z1) log (1 − y1)
A problem of this
It could be possible to have a y1 = 0
25 / 142
Images/cinvestav
Dealing with this problem
We can use an activation function in front of it
SIGMOID CROSS ENTROPY LOSS
26 / 142
Images/cinvestav
Another Interpretation
The Loss can be expressed as
L =
− log (f (y1)) if z1 = 1
− log (1 − f (y1)) if z1 = 1
Where z1 = 1
It means that the class C1 = Ci is positive for this sample.
27 / 142
Images/cinvestav
Another Interpretation
The Loss can be expressed as
L =
− log (f (y1)) if z1 = 1
− log (1 − f (y1)) if z1 = 1
Where z1 = 1
It means that the class C1 = Ci is positive for this sample.
27 / 142
Images/cinvestav
The Gradient of the Binary Cross Entropy
We make a derivative with respect to yi
∂L
∂y1
= z1 (f (y1) − 1) + (1 − z1) f (y1)
28 / 142
Images/cinvestav
In the case of the Multiclass Problem
We use two things, a softmax
f (yi) =
exp {yi}
C
j=1 exp {yj}
As in the multiclass for the Linear Models
The labels are one-hot, so only the positive class Cp keeps its term in
the loss.
Therefore
There is only one element of the Target vector z that is not zero,
zi = zp.
29 / 142
Images/cinvestav
In the case of the Multiclass Problem
We use two things, a softmax
f (yi) =
exp {yi}
C
j=1 exp {yj}
As in the multiclass for the Linear Models
The labels are one-hot, so only the positive class Cp keeps its term in
the loss.
Therefore
There is only one element of the Target vector z that is not zero,
zi = zp.
29 / 142
Images/cinvestav
In the case of the Multiclass Problem
We use two things, a softmax
f (yi) =
exp {yi}
C
j=1 exp {yj}
As in the multiclass for the Linear Models
The labels are one-hot, so only the positive class Cp keeps its term in
the loss.
Therefore
There is only one element of the Target vector z that is not zero,
zi = zp.
29 / 142
Images/cinvestav
We can then simplify
The cost function becomes
L = −
C
i=1
zi log (f (yi)) = −log
exp {yp}
C
j=1 exp {yp}
30 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
31 / 142
Images/cinvestav
Other Cost Functions
Exponential Cost with hyper-parameter τ
L = τ exp
1
τ
N
i=1
(yi − zi)2
Hellinger Distance
L =
1
2
N
i=1
(
√
yi −
√
zi)2
Here the values need to be at the interval [0, 1] .
32 / 142
Images/cinvestav
Other Cost Functions
Exponential Cost with hyper-parameter τ
L = τ exp
1
τ
N
i=1
(yi − zi)2
Hellinger Distance
L =
1
2
N
i=1
(
√
yi −
√
zi)2
Here the values need to be at the interval [0, 1] .
32 / 142
Images/cinvestav
Other Cost Functions
Given Kullback-Leibler Divergence
DKL (P Q) =
i
P (i) ln
P (i)
Q (i)
The Final Cost function
L =
j
ˆyj log
ˆyj
ypred
j
33 / 142
Images/cinvestav
Other Cost Functions
Given Kullback-Leibler Divergence
DKL (P Q) =
i
P (i) ln
P (i)
Q (i)
The Final Cost function
L =
j
ˆyj log
ˆyj
ypred
j
33 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
34 / 142
Images/cinvestav
We have the following
Architecture with Quadratic Error
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σy (Vosht + by)
L =
1
2
N
t=0
[yt − zt]2
Something Notable
How do we train something with a recurrence forcing a dependence
over time?
35 / 142
Images/cinvestav
We have the following
Architecture with Quadratic Error
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σy (Vosht + by)
L =
1
2
N
t=0
[yt − zt]2
Something Notable
How do we train something with a recurrence forcing a dependence
over time?
35 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
36 / 142
Images/cinvestav
Now, given the dependency over time
We can use the classic unfolding of the network [3, 4] by assuming
W, U, V, bh and bo do not change under the unfolding
Unfolding?
Assume that there are not bias correcting terms, only, W, U and V .
37 / 142
Images/cinvestav
Now, given the dependency over time
We can use the classic unfolding of the network [3, 4] by assuming
W, U, V, bh and bo do not change under the unfolding
Unfolding?
Assume that there are not bias correcting terms, only, W, U and V .
37 / 142
Images/cinvestav
Then
Given an observation sequence x = {x1, x2, ..., xT }
where xi ∈ R, and their corresponding label y = {y1, y2, ..., yT }
We remove the bias to simplify our derivations
ht = σh (Wsdxt + Ussht−1)
yt = σy (Vosht)
L =
1
2
T
t=0
[zt − yt]2
38 / 142
Images/cinvestav
Then
Given an observation sequence x = {x1, x2, ..., xT }
where xi ∈ R, and their corresponding label y = {y1, y2, ..., yT }
We remove the bias to simplify our derivations
ht = σh (Wsdxt + Ussht−1)
yt = σy (Vosht)
L =
1
2
T
t=0
[zt − yt]2
38 / 142
Images/cinvestav
Unfolding
We can then see the unfolding of the recurrence
+ + + +
39 / 142
Images/cinvestav
This allows
To simplify the backpropagation process
∂L
∂Vos
=
1
2
T
t=0
∂L
∂yt
×
∂yt
∂Vos
=
1
2
T
t=0
∂L
∂yt
×
∂yt
∂neto
×
∂neto
∂Vos
= −
1
2
T
t=0
[zt − yt] ×
∂yt
∂neto
×
∂neto
∂Vos
Where neto = Vosht
40 / 142
Images/cinvestav
This allows
To simplify the backpropagation process
∂L
∂Vos
=
1
2
T
t=0
∂L
∂yt
×
∂yt
∂Vos
=
1
2
T
t=0
∂L
∂yt
×
∂yt
∂neto
×
∂neto
∂Vos
= −
1
2
T
t=0
[zt − yt] ×
∂yt
∂neto
×
∂neto
∂Vos
Where neto = Vosht
40 / 142
Images/cinvestav
This allows
To simplify the backpropagation process
∂L
∂Vos
=
1
2
T
t=0
∂L
∂yt
×
∂yt
∂Vos
=
1
2
T
t=0
∂L
∂yt
×
∂yt
∂neto
×
∂neto
∂Vos
= −
1
2
T
t=0
[zt − yt] ×
∂yt
∂neto
×
∂neto
∂Vos
Where neto = Vosht
40 / 142
Images/cinvestav
Now, we have
We have that
∂yt
∂neto
=






∂yt1
∂neto1
∂yt2
∂neto1
· · · ∂yto
∂neto1
∂yt1
∂neto2
∂yt2
∂neto2
· · · ∂yto
∂neto2
...
...
...
...
∂yt1
∂netoo
∂yt2
∂netoo
· · · ∂yto
∂netoo






41 / 142
Images/cinvestav
Simplify!!!
Now, we have that if i = j
∂yti
∂netoi
= σ (netoi)
And for the rest, we have i = j
∂yti
∂netoi
= 0
42 / 142
Images/cinvestav
Simplify!!!
Now, we have that if i = j
∂yti
∂netoi
= σ (netoi)
And for the rest, we have i = j
∂yti
∂netoi
= 0
42 / 142
Images/cinvestav
Finally
We have that
∂yt
∂neto
=






σo (neto1) 0 · · · 0
0 σo (neto2) · · · 0
...
...
...
...
0 0 · · · σo (netoo)






= A
43 / 142
Images/cinvestav
Now, ∂neto
∂Vos
First we have a component i
netoi =
s
j=1
Vijhj
What happen when we derive with respect to the matrix?
∂neto
∂Vos
=






∂neto
∂V11
∂neto
∂V12
· · · ∂neto
∂V1s
∂neto
∂V21
∂neto
∂V22
· · · ∂neto
∂V2s
...
...
...
...
∂neto
∂Vo1
∂neto
∂Vo2
· · · ∂neto
∂Vos






Actually
A Tensor with three dimensions...
44 / 142
Images/cinvestav
Now, ∂neto
∂Vos
First we have a component i
netoi =
s
j=1
Vijhj
What happen when we derive with respect to the matrix?
∂neto
∂Vos
=






∂neto
∂V11
∂neto
∂V12
· · · ∂neto
∂V1s
∂neto
∂V21
∂neto
∂V22
· · · ∂neto
∂V2s
...
...
...
...
∂neto
∂Vo1
∂neto
∂Vo2
· · · ∂neto
∂Vos






Actually
A Tensor with three dimensions...
44 / 142
Images/cinvestav
Now, ∂neto
∂Vos
First we have a component i
netoi =
s
j=1
Vijhj
What happen when we derive with respect to the matrix?
∂neto
∂Vos
=






∂neto
∂V11
∂neto
∂V12
· · · ∂neto
∂V1s
∂neto
∂V21
∂neto
∂V22
· · · ∂neto
∂V2s
...
...
...
...
∂neto
∂Vo1
∂neto
∂Vo2
· · · ∂neto
∂Vos






Actually
A Tensor with three dimensions...
44 / 142
Images/cinvestav
But something quite nice
Each of the components of neto
It has the previous structure
netoi =
s
j=1
Vijhj
Then if the Vkl does not intervene on it
∂netoi
∂Vjk
= 0
Additionally if it intervenes
∂netoi
∂Vji
= hj
45 / 142
Images/cinvestav
But something quite nice
Each of the components of neto
It has the previous structure
netoi =
s
j=1
Vijhj
Then if the Vkl does not intervene on it
∂netoi
∂Vjk
= 0
Additionally if it intervenes
∂netoi
∂Vji
= hj
45 / 142
Images/cinvestav
But something quite nice
Each of the components of neto
It has the previous structure
netoi =
s
j=1
Vijhj
Then if the Vkl does not intervene on it
∂netoi
∂Vjk
= 0
Additionally if it intervenes
∂netoi
∂Vji
= hj
45 / 142
Images/cinvestav
Therefore
It is possible to collapse the tensor into a 2D Matrix
Given that the other information is redundant, ad we can rewrite the
tensor as
Fijk =
∂netoi
∂Vjk
Then, we have that
Fiji = Gij ⇐ Better Storage!!!
46 / 142
Images/cinvestav
Therefore
It is possible to collapse the tensor into a 2D Matrix
Given that the other information is redundant, ad we can rewrite the
tensor as
Fijk =
∂netoi
∂Vjk
Then, we have that
Fiji = Gij ⇐ Better Storage!!!
46 / 142
Images/cinvestav
Therefore, given that a matrix is a tensor also
We have that two tensors, neto×o
and Fo×s×o
[5]
We will use the contracted product of two tensors which is a
generalization of the tensor-vector and tensor-matrix multiplications
Definition
Given two tensors Ao×o and Bo×s×o
A, B (k, j) =
o
i=1
Ai,kGi,j = Ai,iGi,j = σ (netok) hj
47 / 142
Images/cinvestav
Therefore, given that a matrix is a tensor also
We have that two tensors, neto×o
and Fo×s×o
[5]
We will use the contracted product of two tensors which is a
generalization of the tensor-vector and tensor-matrix multiplications
Definition
Given two tensors Ao×o and Bo×s×o
A, B (k, j) =
o
i=1
Ai,kGi,j = Ai,iGi,j = σ (netok) hj
47 / 142
Images/cinvestav
Now, the term ∂L
∂Uss
Assuming our change in time step t → t + 1 and given
ht = σh (Wsdxt + Ussht−1)
Therefore we have
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂Uss
Therefore
We can think on this as a Markovian Backpropagation
48 / 142
Images/cinvestav
Now, the term ∂L
∂Uss
Assuming our change in time step t → t + 1 and given
ht = σh (Wsdxt + Ussht−1)
Therefore we have
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂Uss
Therefore
We can think on this as a Markovian Backpropagation
48 / 142
Images/cinvestav
Now, the term ∂L
∂Uss
Assuming our change in time step t → t + 1 and given
ht = σh (Wsdxt + Ussht−1)
Therefore we have
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂Uss
Therefore
We can think on this as a Markovian Backpropagation
48 / 142
Images/cinvestav
What if we go further
From t − 1 → t + 1
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Uss
Now, the trick if we consider all the possible derivatives from t to 0
We have:
∂L (t + 1)
∂Uss
=
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
However
How do we calculate ∂ht+1
∂hk
?
49 / 142
Images/cinvestav
What if we go further
From t − 1 → t + 1
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Uss
Now, the trick if we consider all the possible derivatives from t to 0
We have:
∂L (t + 1)
∂Uss
=
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
However
How do we calculate ∂ht+1
∂hk
?
49 / 142
Images/cinvestav
What if we go further
From t − 1 → t + 1
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Uss
Now, the trick if we consider all the possible derivatives from t to 0
We have:
∂L (t + 1)
∂Uss
=
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
However
How do we calculate ∂ht+1
∂hk
?
49 / 142
Images/cinvestav
We have a proposal
Given the composition of functions
∂ht+1
∂hk
=
∂hk+1
∂hk
×
∂hk+2
∂hk+1
× · · · ×
∂ht+1
∂ht
Here, we know that
∂hi+1
∂hi
=
∂hi+1
∂nets
×
∂nets
∂hi
50 / 142
Images/cinvestav
We have a proposal
Given the composition of functions
∂ht+1
∂hk
=
∂hk+1
∂hk
×
∂hk+2
∂hk+1
× · · · ×
∂ht+1
∂ht
Here, we know that
∂hi+1
∂hi
=
∂hi+1
∂nets
×
∂nets
∂hi
50 / 142
Images/cinvestav
We have that
We have given hi+1 = σh (Wsdxi + Usshi) and
neth = Wsdxi + Usshi
∂hi+1
∂nets
=






σh (neth1) 0 · · · 0
0 σh (neth2) · · · 0
...
...
...
...
0 0 · · · σh (neths)






= Di+1
Finally, we have that
∂nets
∂hi
= Uss
51 / 142
Images/cinvestav
We have that
We have given hi+1 = σh (Wsdxi + Usshi) and
neth = Wsdxi + Usshi
∂hi+1
∂nets
=






σh (neth1) 0 · · · 0
0 σh (neth2) · · · 0
...
...
...
...
0 0 · · · σh (neths)






= Di+1
Finally, we have that
∂nets
∂hi
= Uss
51 / 142
Images/cinvestav
Then
We can aggregate over all the time
∂L
∂Uss
=
t
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
Now, we need to derive the L with respect to Wsd
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
52 / 142
Images/cinvestav
Then
We can aggregate over all the time
∂L
∂Uss
=
t
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
Now, we need to derive the L with respect to Wsd
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
52 / 142
Images/cinvestav
Now
Because ht and xt+1, we need to back-propagate to ht
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht
×
∂ht
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Wsd
Then summing over all the contributions from t to 0
∂L (t + 1)
∂Wsd
=
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
Finally, summing over all the time
∂L
∂Wsd
=
t
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
53 / 142
Images/cinvestav
Now
Because ht and xt+1, we need to back-propagate to ht
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht
×
∂ht
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Wsd
Then summing over all the contributions from t to 0
∂L (t + 1)
∂Wsd
=
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
Finally, summing over all the time
∂L
∂Wsd
=
t
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
53 / 142
Images/cinvestav
Now
Because ht and xt+1, we need to back-propagate to ht
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht
×
∂ht
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Wsd
Then summing over all the contributions from t to 0
∂L (t + 1)
∂Wsd
=
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
Finally, summing over all the time
∂L
∂Wsd
=
t
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
53 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
54 / 142
Images/cinvestav
Vanishing Gradients
We have a problem
∂hk+1
∂hk
×
∂hk+2
∂hk+1
× · · · ×
∂ht+1
∂ht
You finish with a vanishing gradient using σ = 1
1+exp{−x}
This is problematic!!!
55 / 142
Images/cinvestav
Vanishing Gradients
We have a problem
∂hk+1
∂hk
×
∂hk+2
∂hk+1
× · · · ×
∂ht+1
∂ht
You finish with a vanishing gradient using σ = 1
1+exp{−x}
This is problematic!!!
55 / 142
Images/cinvestav
Given
Given the commutativity of the product
You could put together the derivative of the sigmoid’s
f (x) =
ds (x)
dx
=
e−x
(1 + e−x)2
Therefore, deriving again
df (x)
dx
= −
e−x
(1 + e−x)2 +
2 (e−x)
2
(1 + e−x)3
After making df(x)
dx
= 0
We have the maximum is at x = 0
56 / 142
Images/cinvestav
Given
Given the commutativity of the product
You could put together the derivative of the sigmoid’s
f (x) =
ds (x)
dx
=
e−x
(1 + e−x)2
Therefore, deriving again
df (x)
dx
= −
e−x
(1 + e−x)2 +
2 (e−x)
2
(1 + e−x)3
After making df(x)
dx
= 0
We have the maximum is at x = 0
56 / 142
Images/cinvestav
Given
Given the commutativity of the product
You could put together the derivative of the sigmoid’s
f (x) =
ds (x)
dx
=
e−x
(1 + e−x)2
Therefore, deriving again
df (x)
dx
= −
e−x
(1 + e−x)2 +
2 (e−x)
2
(1 + e−x)3
After making df(x)
dx
= 0
We have the maximum is at x = 0
56 / 142
Images/cinvestav
Therefore
The maximum for the derivative of the sigmoid
f (0) = 0.25
Therefore, Given a Deep Network
We could finish with
lim
k→∞
ds (x)
dx
k
= lim
k→∞
(0.25)k
→ 0
A Vanishing Derivative or Vanishing Gradient
Making quite difficult to do train a deeper network using this
activation function for Deep Learning and even in Shallow Learning
57 / 142
Images/cinvestav
Therefore
The maximum for the derivative of the sigmoid
f (0) = 0.25
Therefore, Given a Deep Network
We could finish with
lim
k→∞
ds (x)
dx
k
= lim
k→∞
(0.25)k
→ 0
A Vanishing Derivative or Vanishing Gradient
Making quite difficult to do train a deeper network using this
activation function for Deep Learning and even in Shallow Learning
57 / 142
Images/cinvestav
Therefore
The maximum for the derivative of the sigmoid
f (0) = 0.25
Therefore, Given a Deep Network
We could finish with
lim
k→∞
ds (x)
dx
k
= lim
k→∞
(0.25)k
→ 0
A Vanishing Derivative or Vanishing Gradient
Making quite difficult to do train a deeper network using this
activation function for Deep Learning and even in Shallow Learning
57 / 142
Images/cinvestav
For the case of vanishing gradient, we have that
Rearranging terms in ∂hk+1
∂hk
× ∂hk+2
∂hk+1
× · · · × ∂ht+1
∂ht
We have
T
k=0
∂hk+1
∂nets
[Uss]T+1
Then, given the sigmoid
T
k=0
∂hk+1
∂nets
=





T
k=0
σh netk
h1 0 · · · 0
0
T
k=0
σh netk
h2 · · · 0
...
...
...
...
0 0 · · ·
T
k=0
σh netk
hs





58 / 142
Images/cinvestav
For the case of vanishing gradient, we have that
Rearranging terms in ∂hk+1
∂hk
× ∂hk+2
∂hk+1
× · · · × ∂ht+1
∂ht
We have
T
k=0
∂hk+1
∂nets
[Uss]T+1
Then, given the sigmoid
T
k=0
∂hk+1
∂nets
=





T
k=0
σh netk
h1 0 · · · 0
0
T
k=0
σh netk
h2 · · · 0
...
...
...
...
0 0 · · ·
T
k=0
σh netk
hs





58 / 142
Images/cinvestav
It is clear
That you have the phenomena of vanishing gradient
Do we have a way to fixing this?
Yes
The use of new activation functions.
59 / 142
Images/cinvestav
It is clear
That you have the phenomena of vanishing gradient
Do we have a way to fixing this?
Yes
The use of new activation functions.
59 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
60 / 142
Images/cinvestav
Thus
The need to introduce a new function
f (x) = x+
= max (0, x)
It is called ReLu or Rectifier
With a smooth approximation (Softplus function)
f (x) =
ln 1 + ekx
k
61 / 142
Images/cinvestav
Thus
The need to introduce a new function
f (x) = x+
= max (0, x)
It is called ReLu or Rectifier
With a smooth approximation (Softplus function)
f (x) =
ln 1 + ekx
k
61 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
62 / 142
Images/cinvestav
However
Here the gradient can explode
Thus, the need to control the gradient...
Therefore, we will use the following analysis [6]
“The Emergence of Spectral Universality in Deep Networks” by
Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli
63 / 142
Images/cinvestav
However
Here the gradient can explode
Thus, the need to control the gradient...
Therefore, we will use the following analysis [6]
“The Emergence of Spectral Universality in Deep Networks” by
Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli
63 / 142
Images/cinvestav
We have
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Then, we have the following Jacobian
J =
∂hT
∂h0
=
L
t=1
DtUSS
Where as we saw it Dt is a diagonal matrix
This Jacobian J is a matrix of dimension s × s therefore, if it is well
conditioned you are not sending the projection to lower dimensionality.
64 / 142
Images/cinvestav
We have
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Then, we have the following Jacobian
J =
∂hT
∂h0
=
L
t=1
DtUSS
Where as we saw it Dt is a diagonal matrix
This Jacobian J is a matrix of dimension s × s therefore, if it is well
conditioned you are not sending the projection to lower dimensionality.
64 / 142
Images/cinvestav
We have
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Then, we have the following Jacobian
J =
∂hT
∂h0
=
L
t=1
DtUSS
Where as we saw it Dt is a diagonal matrix
This Jacobian J is a matrix of dimension s × s therefore, if it is well
conditioned you are not sending the projection to lower dimensionality.
64 / 142
Images/cinvestav
A Trick
A RNN can be seen as a deep neural network
+ + + +
65 / 142
Images/cinvestav
Remember the structure of the layer
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Therefore, we have that
sit =
j
Wijxt
j +
k
Uikht−1
k + bi
We assume the following about the temporal layer weights
[Uss, Wsd] ∼ N 0,
ρ2
w
N
, bh ∼ N 0, ρ2
b
Here N = s the state dimension.
66 / 142
Images/cinvestav
Remember the structure of the layer
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Therefore, we have that
sit =
j
Wijxt
j +
k
Uikht−1
k + bi
We assume the following about the temporal layer weights
[Uss, Wsd] ∼ N 0,
ρ2
w
N
, bh ∼ N 0, ρ2
b
Here N = s the state dimension.
66 / 142
Images/cinvestav
Remember the structure of the layer
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Therefore, we have that
sit =
j
Wijxt
j +
k
Uikht−1
k + bi
We assume the following about the temporal layer weights
[Uss, Wsd] ∼ N 0,
ρ2
w
N
, bh ∼ N 0, ρ2
b
Here N = s the state dimension.
66 / 142
Images/cinvestav
Now, assume that
Now, consider the evolution of a single input through the network xit
Since the weights and biases are independent with zero mean
E [sit] = 0
The second moment of the Gaussian random variable
E [sitsjt] = qt
δij
67 / 142
Images/cinvestav
Now, assume that
Now, consider the evolution of a single input through the network xit
Since the weights and biases are independent with zero mean
E [sit] = 0
The second moment of the Gaussian random variable
E [sitsjt] = qt
δij
67 / 142
Images/cinvestav
Where the second moment
Of a Gaussian Distribution is
∞
−∞
s2 1
√
2πσ
exp −
(s − µ)
2σ2
ds
68 / 142
Images/cinvestav
Here we have
Here q is the variance of the pre-activations in the tth
layer due to an
input xt
qt
=
ρ2
w
√
2π
σ2
h qt−1sit−1 exp −
1
2
s2
it dsit + ρ2
b
They describe the pass through the recursion of the RNN
For any choice of ρ2
w and ρ2
b and a bounded φ the previous equation
converges to a specific fix point.
This recursion has a fixed point
q∗
=
ρ2
w
√
2π
σ2
h q∗sit−1 exp −
1
2
s2
it dsit + ρ2
b
69 / 142
Images/cinvestav
Here we have
Here q is the variance of the pre-activations in the tth
layer due to an
input xt
qt
=
ρ2
w
√
2π
σ2
h qt−1sit−1 exp −
1
2
s2
it dsit + ρ2
b
They describe the pass through the recursion of the RNN
For any choice of ρ2
w and ρ2
b and a bounded φ the previous equation
converges to a specific fix point.
This recursion has a fixed point
q∗
=
ρ2
w
√
2π
σ2
h q∗sit−1 exp −
1
2
s2
it dsit + ρ2
b
69 / 142
Images/cinvestav
Here we have
Here q is the variance of the pre-activations in the tth
layer due to an
input xt
qt
=
ρ2
w
√
2π
σ2
h qt−1sit−1 exp −
1
2
s2
it dsit + ρ2
b
They describe the pass through the recursion of the RNN
For any choice of ρ2
w and ρ2
b and a bounded φ the previous equation
converges to a specific fix point.
This recursion has a fixed point
q∗
=
ρ2
w
√
2π
σ2
h q∗sit−1 exp −
1
2
s2
it dsit + ρ2
b
69 / 142
Images/cinvestav
A Fixed Point
Definition
In mathematics, a fixed point of a function is an element of the
function’s domain that is mapped to itself by the function.
Example
70 / 142
Images/cinvestav
A Fixed Point
Definition
In mathematics, a fixed point of a function is an element of the
function’s domain that is mapped to itself by the function.
Example
70 / 142
Images/cinvestav
Therefore
We have that
It the input x0 is chosen so that q1 = q∗ the dynamics start at the
fixed point and the distribution of Dt is independent of t.
Not only that
q1 = q∗ a few layers is often sufficient to approximately converge to a
fixed point.
So when t is large
So it is a good approximation to assume qt = q∗.
71 / 142
Images/cinvestav
Therefore
We have that
It the input x0 is chosen so that q1 = q∗ the dynamics start at the
fixed point and the distribution of Dt is independent of t.
Not only that
q1 = q∗ a few layers is often sufficient to approximately converge to a
fixed point.
So when t is large
So it is a good approximation to assume qt = q∗.
71 / 142
Images/cinvestav
Therefore
We have that
It the input x0 is chosen so that q1 = q∗ the dynamics start at the
fixed point and the distribution of Dt is independent of t.
Not only that
q1 = q∗ a few layers is often sufficient to approximately converge to a
fixed point.
So when t is large
So it is a good approximation to assume qt = q∗.
71 / 142
Images/cinvestav
Additionally
The independence of the weights and biases implies
The covariance between different pre-activations in the same layer will
be given by
E [zit;azjt;b] = qt
abδij
Therefore
qt
ab = ρ2
w σh (u1) σh (u2) Dz1Dz2 + ρ2
b
Where Dz = 1√
2π
exp −1
2 s2
ds
u1 = qt−1
aa
u2 = qt−1
bb ct−1
ab s1 + 1 − ct−1
ab
2
z2
ct
ab =
qt
ab√
qt
aaqt
bb
72 / 142
Images/cinvestav
Additionally
The independence of the weights and biases implies
The covariance between different pre-activations in the same layer will
be given by
E [zit;azjt;b] = qt
abδij
Therefore
qt
ab = ρ2
w σh (u1) σh (u2) Dz1Dz2 + ρ2
b
Where Dz = 1√
2π
exp −1
2 s2
ds
u1 = qt−1
aa
u2 = qt−1
bb ct−1
ab s1 + 1 − ct−1
ab
2
z2
ct
ab =
qt
ab√
qt
aaqt
bb
72 / 142
Images/cinvestav
Therefore
Therefore, we can look at the variance of the Jacobian Matrix
elements
χ =
1
N
Tr (DtUSS)
T
DtUSS = σ2
w σh
√
q∗sit
2
exp −
1
2
s2
it dsit
73 / 142
Images/cinvestav
Then
χ (ρw, ρb)
It separates (ρw, ρb) plane into two regions.
When χ > 1
Forward signal propagation expands and folds space in a chaotic
manner and gradients explode
When χ < 1
Forward signal propagation contracts in an ordered manner and
gradients exponentially vanishes
74 / 142
Images/cinvestav
Then
χ (ρw, ρb)
It separates (ρw, ρb) plane into two regions.
When χ > 1
Forward signal propagation expands and folds space in a chaotic
manner and gradients explode
When χ < 1
Forward signal propagation contracts in an ordered manner and
gradients exponentially vanishes
74 / 142
Images/cinvestav
Then
χ (ρw, ρb)
It separates (ρw, ρb) plane into two regions.
When χ > 1
Forward signal propagation expands and folds space in a chaotic
manner and gradients explode
When χ < 1
Forward signal propagation contracts in an ordered manner and
gradients exponentially vanishes
74 / 142
Images/cinvestav
This Regions establish the stability of the network
We have the following
0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
ORDERED
VANISHING
GRADIENT CHAOTIC
EXPLODING
GRADIENT
75 / 142
Images/cinvestav
Therefore
It is clear that
When we choose same ρb = ρw we have a convergence of the network
Having other values
It requires a careful choosing of the values
76 / 142
Images/cinvestav
Therefore
It is clear that
When we choose same ρb = ρw we have a convergence of the network
Having other values
It requires a careful choosing of the values
76 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
77 / 142
Images/cinvestav
Another Problem
Although, the Vanishing and Exploding Gradients
They are a problem for the RNN’s
If we use the full BPTT
We confront limitations on the amount of Memory and Hardware
available
Thus a popular strategy
It is the Truncated BPTT [7, 8]
78 / 142
Images/cinvestav
Another Problem
Although, the Vanishing and Exploding Gradients
They are a problem for the RNN’s
If we use the full BPTT
We confront limitations on the amount of Memory and Hardware
available
Thus a popular strategy
It is the Truncated BPTT [7, 8]
78 / 142
Images/cinvestav
Another Problem
Although, the Vanishing and Exploding Gradients
They are a problem for the RNN’s
If we use the full BPTT
We confront limitations on the amount of Memory and Hardware
available
Thus a popular strategy
It is the Truncated BPTT [7, 8]
78 / 142
Images/cinvestav
Therefore
They proposed using a truncation on the BPTT
To solve the problem with the Vanishing and Exploding Gradient
What is Truncated BPTT?
In general, this should be regarded as a heuristic technique for
simplifying the computation.
Which it is a good approximation true gradient
79 / 142
Images/cinvestav
Therefore
They proposed using a truncation on the BPTT
To solve the problem with the Vanishing and Exploding Gradient
What is Truncated BPTT?
In general, this should be regarded as a heuristic technique for
simplifying the computation.
Which it is a good approximation true gradient
79 / 142
Images/cinvestav
The Algorithm
Truncated BPTT
1 for t = 1 to T do:
2 Run the RNN for one step, computing ht and yt
3 if t divides k1 then
4 Run BPTT from t to t − k2
Something Notable
1 It was first used by Elman [9]
2 Also Mikolov et al. [10] used the TBPTT to train RNN on word-level
language modeling.
80 / 142
Images/cinvestav
The Algorithm
Truncated BPTT
1 for t = 1 to T do:
2 Run the RNN for one step, computing ht and yt
3 if t divides k1 then
4 Run BPTT from t to t − k2
Something Notable
1 It was first used by Elman [9]
2 Also Mikolov et al. [10] used the TBPTT to train RNN on word-level
language modeling.
80 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
81 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
82 / 142
Images/cinvestav
Initialization of the Hidden State
This is the classic problem in RNN
How to initialize the hs hidden state?
There are two main mehtods
1 Initialize hs to the zero vector.
2 Adaptive noisy initialization of hs
3 Find the steady state
83 / 142
Images/cinvestav
Initialization of the Hidden State
This is the classic problem in RNN
How to initialize the hs hidden state?
There are two main mehtods
1 Initialize hs to the zero vector.
2 Adaptive noisy initialization of hs
3 Find the steady state
83 / 142
Images/cinvestav
The Simplest One
We can simply initialize hs
To a zero state
Quite simple and easy to apply
However do we have something better?
84 / 142
Images/cinvestav
The Simplest One
We can simply initialize hs
To a zero state
Quite simple and easy to apply
However do we have something better?
84 / 142
Images/cinvestav
Adaptive noisy initialization
It is proposed by Zimmermann et al. [11]
They proposed to use the residual error once the back-propagation
was done for h0
This is done
By disturbing h0 with a noise term Θ which follows the distribution
of the residual error.
85 / 142
Images/cinvestav
Adaptive noisy initialization
It is proposed by Zimmermann et al. [11]
They proposed to use the residual error once the back-propagation
was done for h0
This is done
By disturbing h0 with a noise term Θ which follows the distribution
of the residual error.
85 / 142
Images/cinvestav
Adaptive Noise
The network tries to stabilize the output
0 + Noise
86 / 142
Images/cinvestav
Example of this initializations
Source https://r2rt.com/non-zero-initial-states-for-recurrent-neural-
networks.html
87 / 142
Images/cinvestav
What about the Weight Parameters?
We could simply initialize them to zero
Denger Will Robinson!!!
A simple example with the following feed-forward architecture
w = σ1 (Whix)
y = σ2 (Wohw)
L =
1
2
[y − z]2
88 / 142
Images/cinvestav
What about the Weight Parameters?
We could simply initialize them to zero
Denger Will Robinson!!!
A simple example with the following feed-forward architecture
w = σ1 (Whix)
y = σ2 (Wohw)
L =
1
2
[y − z]2
88 / 142
Images/cinvestav
Therefore
We have by back-propagation
∆Who = σ2 (Wohσ1 (Whix1)) − z σ2 (Wohσ1 (Whix)) Wohσ1 (Whix) x
Therefore
∆Who = 0
89 / 142
Images/cinvestav
Therefore
We have by back-propagation
∆Who = σ2 (Wohσ1 (Whix1)) − z σ2 (Wohσ1 (Whix)) Wohσ1 (Whix) x
Therefore
∆Who = 0
89 / 142
Images/cinvestav
Therefore
Not a good idea
What else we can do?
We have heuristics as the Gaussian initialization
wij ∼ N 0, σ2
90 / 142
Images/cinvestav
Therefore
Not a good idea
What else we can do?
We have heuristics as the Gaussian initialization
wij ∼ N 0, σ2
90 / 142
Images/cinvestav
Do you remember?
We have the following
0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
ORDERED
VANISHING
GRADIENT CHAOTIC
EXPLODING
GRADIENT
91 / 142
Images/cinvestav
Furthermore
We have heuristics
For Relu — We multiply the randomly generated values of W by:
2
sizel−1
For tanh — The heuristic is called Xavier initialization
2
sizel−1
Other common one
2
sizel−1 + sizel
92 / 142
Images/cinvestav
Furthermore
We have heuristics
For Relu — We multiply the randomly generated values of W by:
2
sizel−1
For tanh — The heuristic is called Xavier initialization
2
sizel−1
Other common one
2
sizel−1 + sizel
92 / 142
Images/cinvestav
Furthermore
We have heuristics
For Relu — We multiply the randomly generated values of W by:
2
sizel−1
For tanh — The heuristic is called Xavier initialization
2
sizel−1
Other common one
2
sizel−1 + sizel
92 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
93 / 142
Images/cinvestav
History of LSTM
They were introduced by
LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber [12]
An attempt to deal with the vanishing and exploding gradient
By introducing Constant Error Carousel (CEC) units
Properties
In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred
Cummins introduced the forget gate (also called “keep gate”) into
LSTM architecture.
It enables the LSTM to reset its own state
94 / 142
Images/cinvestav
History of LSTM
They were introduced by
LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber [12]
An attempt to deal with the vanishing and exploding gradient
By introducing Constant Error Carousel (CEC) units
Properties
In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred
Cummins introduced the forget gate (also called “keep gate”) into
LSTM architecture.
It enables the LSTM to reset its own state
94 / 142
Images/cinvestav
History of LSTM
They were introduced by
LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber [12]
An attempt to deal with the vanishing and exploding gradient
By introducing Constant Error Carousel (CEC) units
Properties
In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred
Cummins introduced the forget gate (also called “keep gate”) into
LSTM architecture.
It enables the LSTM to reset its own state
94 / 142
Images/cinvestav
Long Short Term Memory (LSTM)
We have the following Architecture (Component wise product )
ft =σ [Wf [ht−1, xt] + bf ] (Forget Gate)
it =σ [Wi [ht−1, xt] + bi] (Input/Update Gate)
ot =σ [Wo [ht−1, xt] + bo] (Output Gate)
ˆct = tanh [Wo [ht−1, xt] + bc] (Intermediate Cell Gate)
ct =ft ct−1 + it ˆct (Cell State Gate)
ht =ot tanh (ct) (Hidden State)
Where σ is a sigmoid function.
95 / 142
Images/cinvestav
Graphically
We have that
96 / 142
Images/cinvestav
Here the interesting part
In the RNN
ht = σh (Wsdxt + Ussht−1)
But Here
ct =ft ct−1 + it ˆct (Cell State Gate)
ht =ot tanh (ct)
You need the forget term, the input term ant the intermediate cell
To update the state
97 / 142
Images/cinvestav
Here the interesting part
In the RNN
ht = σh (Wsdxt + Ussht−1)
But Here
ct =ft ct−1 + it ˆct (Cell State Gate)
ht =ot tanh (ct)
You need the forget term, the input term ant the intermediate cell
To update the state
97 / 142
Images/cinvestav
Here the interesting part
In the RNN
ht = σh (Wsdxt + Ussht−1)
But Here
ct =ft ct−1 + it ˆct (Cell State Gate)
ht =ot tanh (ct)
You need the forget term, the input term ant the intermediate cell
To update the state
97 / 142
Images/cinvestav
You can see
Something Notable
The cell keeps track of the dependencies between the elements in the
input sequence and the state
The input gate
It is in charge of how much of the input flows into the cell gate
it = σ [Wi [ht−1, xt] + bi]
98 / 142
Images/cinvestav
You can see
Something Notable
The cell keeps track of the dependencies between the elements in the
input sequence and the state
The input gate
It is in charge of how much of the input flows into the cell gate
it = σ [Wi [ht−1, xt] + bi]
98 / 142
Images/cinvestav
What is the meaning?
We have that
The sigmoid layer decides what values to update
They impact the term it ˆct
Making possible to decide how to control the cell intermediate values
99 / 142
Images/cinvestav
What is the meaning?
We have that
The sigmoid layer decides what values to update
They impact the term it ˆct
Making possible to decide how to control the cell intermediate values
99 / 142
Images/cinvestav
Now
The forget gate
How much of the previous cell gate time value remains in the cell at
time t
ft = σ [Wf [ht−1, xt] + bf ]
Actually
It uses previous state and input
Then the sigmoid actually can be interpreted as
Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep”
100 / 142
Images/cinvestav
Now
The forget gate
How much of the previous cell gate time value remains in the cell at
time t
ft = σ [Wf [ht−1, xt] + bf ]
Actually
It uses previous state and input
Then the sigmoid actually can be interpreted as
Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep”
100 / 142
Images/cinvestav
Now
The forget gate
How much of the previous cell gate time value remains in the cell at
time t
ft = σ [Wf [ht−1, xt] + bf ]
Actually
It uses previous state and input
Then the sigmoid actually can be interpreted as
Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep”
100 / 142
Images/cinvestav
Furthermore
The output gate
It controls the extent to which the value in the cell is used to
compute the actual state
Which impacts the term ft ct−1
Based on the previous cell state
Thus a type of control
Between the previous cell state and the new cell state
101 / 142
Images/cinvestav
Furthermore
The output gate
It controls the extent to which the value in the cell is used to
compute the actual state
Which impacts the term ft ct−1
Based on the previous cell state
Thus a type of control
Between the previous cell state and the new cell state
101 / 142
Images/cinvestav
Furthermore
The output gate
It controls the extent to which the value in the cell is used to
compute the actual state
Which impacts the term ft ct−1
Based on the previous cell state
Thus a type of control
Between the previous cell state and the new cell state
101 / 142
Images/cinvestav
Finally
We have the update of the cell as
ct = ft ct−1 + it ˆct
Basically
Apply forget operation to previous internal cell state.
Add new candidate values, scaled by how much we decided to update
We can see as
Drop old information and add new information about subject’s gender.
102 / 142
Images/cinvestav
Finally
We have the update of the cell as
ct = ft ct−1 + it ˆct
Basically
Apply forget operation to previous internal cell state.
Add new candidate values, scaled by how much we decided to update
We can see as
Drop old information and add new information about subject’s gender.
102 / 142
Images/cinvestav
Finally
We have the update of the cell as
ct = ft ct−1 + it ˆct
Basically
Apply forget operation to previous internal cell state.
Add new candidate values, scaled by how much we decided to update
We can see as
Drop old information and add new information about subject’s gender.
102 / 142
Images/cinvestav
Thus at the output layer and update state
We have
ot =σ [Wo [ht−1, xt] + bo] (Output Gate)
ht =ot tanh (ct) (Hidden State)
Therefore, we have that
Sigmoid layer: decide what linear combination of state/input to
output
Additionally, we have that the tanh squashes the values between -1
and 1
The output is used to filter a version of cell state!!!
103 / 142
Images/cinvestav
Thus at the output layer and update state
We have
ot =σ [Wo [ht−1, xt] + bo] (Output Gate)
ht =ot tanh (ct) (Hidden State)
Therefore, we have that
Sigmoid layer: decide what linear combination of state/input to
output
Additionally, we have that the tanh squashes the values between -1
and 1
The output is used to filter a version of cell state!!!
103 / 142
Images/cinvestav
Thus at the output layer and update state
We have
ot =σ [Wo [ht−1, xt] + bo] (Output Gate)
ht =ot tanh (ct) (Hidden State)
Therefore, we have that
Sigmoid layer: decide what linear combination of state/input to
output
Additionally, we have that the tanh squashes the values between -1
and 1
The output is used to filter a version of cell state!!!
103 / 142
Images/cinvestav
Something nice about LSTM
Quite nice
Backpropagation from ct to ct−1 requires only elementwise
multiplication!
104 / 142
Images/cinvestav
LSTM Remarks
First
It maintains a separate cell state from what is outputted
Second
Use gates to control the flow of information
Forget gate tries to get rid of irrelevant information
Selectively update cell state
Output gate returns a filtered version of the cell state
Third
Backpropagation from ct to ct−1 requires only elementwise
multiplication!
105 / 142
Images/cinvestav
LSTM Remarks
First
It maintains a separate cell state from what is outputted
Second
Use gates to control the flow of information
Forget gate tries to get rid of irrelevant information
Selectively update cell state
Output gate returns a filtered version of the cell state
Third
Backpropagation from ct to ct−1 requires only elementwise
multiplication!
105 / 142
Images/cinvestav
LSTM Remarks
First
It maintains a separate cell state from what is outputted
Second
Use gates to control the flow of information
Forget gate tries to get rid of irrelevant information
Selectively update cell state
Output gate returns a filtered version of the cell state
Third
Backpropagation from ct to ct−1 requires only elementwise
multiplication!
105 / 142
Images/cinvestav
Achievements
LSTM achieved record results in natural language text compression
http://www.mattmahoney.net/dc/text.html#1218
Unsegmented connected handwriting recognition
Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H.,
Schmidhuber, J. (May 2009). "A Novel Connectionist System for
Unconstrained Handwriting Recognition". IEEE Transactions on
Pattern Analysis and Machine Intelligence. 31 (5): 855–868
Finally
Won the ICDAR handwriting competition (2009)
106 / 142
Images/cinvestav
Achievements
LSTM achieved record results in natural language text compression
http://www.mattmahoney.net/dc/text.html#1218
Unsegmented connected handwriting recognition
Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H.,
Schmidhuber, J. (May 2009). "A Novel Connectionist System for
Unconstrained Handwriting Recognition". IEEE Transactions on
Pattern Analysis and Machine Intelligence. 31 (5): 855–868
Finally
Won the ICDAR handwriting competition (2009)
106 / 142
Images/cinvestav
Achievements
LSTM achieved record results in natural language text compression
http://www.mattmahoney.net/dc/text.html#1218
Unsegmented connected handwriting recognition
Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H.,
Schmidhuber, J. (May 2009). "A Novel Connectionist System for
Unconstrained Handwriting Recognition". IEEE Transactions on
Pattern Analysis and Machine Intelligence. 31 (5): 855–868
Finally
Won the ICDAR handwriting competition (2009)
106 / 142
Images/cinvestav
Right now
Something Notable
As of 2016, major technology companies including Google, Apple, and
Microsoft were using LSTM as fundamental components in new
products.
107 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
108 / 142
Images/cinvestav
History
They were proposed as a simplification of the LSTM
In 2014, Kyunghyun Cho et al. put forward a simplified variant called
Gated recurrent unit (GRU)
Something Notable
The GRU is like a long short-term memory (LSTM) with forget
gate...
but has fewer parameters than LSTM, as it lacks an output gate
109 / 142
Images/cinvestav
History
They were proposed as a simplification of the LSTM
In 2014, Kyunghyun Cho et al. put forward a simplified variant called
Gated recurrent unit (GRU)
Something Notable
The GRU is like a long short-term memory (LSTM) with forget
gate...
but has fewer parameters than LSTM, as it lacks an output gate
109 / 142
Images/cinvestav
Gated Recurrent Units
Architecture
zt =σ [Wz [ht−1, xt] + bz] (Update Gate)
rt =σ [Wr [ht−1, xt] + br] (Reset Gate)
ˆht = tanh [Wo [rt ht−1, xt] + bh]
ht = (1 − zt) ht−1 + zt
ˆht
110 / 142
Images/cinvestav
Graphically, we have the architecture
GRU Architecture
+-1
111 / 142
Images/cinvestav
Main Observations
There is a gate used to combine the state ht−1,
The zt gate that basically uses the information of the input and the
previous state to decide how to update
ht = (1 − zt) ht−1 + zt
ˆht
The intermediate step ˆht
A bounded version of the possible state ht
112 / 142
Images/cinvestav
Main Observations
There is a gate used to combine the state ht−1,
The zt gate that basically uses the information of the input and the
previous state to decide how to update
ht = (1 − zt) ht−1 + zt
ˆht
The intermediate step ˆht
A bounded version of the possible state ht
112 / 142
Images/cinvestav
Next
We have that a reset gate
rt = σ [Wr [ht−1, xt] + br]
To update
ˆht = tanh [Wo [rt ht−1, xt] + bh]
113 / 142
Images/cinvestav
However
It has been shown that
As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is
"strictly stronger" than the GRU
LSTM can perform unbounded counting[13]
The GRU cannot.
It simulates a counting machine used for theoretical CS
Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google
Brain
LSTM cells consistently outperform GRU cells in "the first large-scale
analysis of architecture variations for Neural Machine Translation.”
114 / 142
Images/cinvestav
However
It has been shown that
As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is
"strictly stronger" than the GRU
LSTM can perform unbounded counting[13]
The GRU cannot.
It simulates a counting machine used for theoretical CS
Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google
Brain
LSTM cells consistently outperform GRU cells in "the first large-scale
analysis of architecture variations for Neural Machine Translation.”
114 / 142
Images/cinvestav
However
It has been shown that
As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is
"strictly stronger" than the GRU
LSTM can perform unbounded counting[13]
The GRU cannot.
It simulates a counting machine used for theoretical CS
Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google
Brain
LSTM cells consistently outperform GRU cells in "the first large-scale
analysis of architecture variations for Neural Machine Translation.”
114 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
115 / 142
Images/cinvestav
Given that we want to do sequence modeling
Stock Options
Predict next phrase
Question: If I am a man ?
Prediction: you are homo sapiens
116 / 142
Images/cinvestav
Given that we want to do sequence modeling
Stock Options
Predict next phrase
Question: If I am a man ?
Prediction: you are homo sapiens
116 / 142
Images/cinvestav
What do we have in this sequences of data?
Sequences have different lengths
We need to handle variable-length sequences
117 / 142
Images/cinvestav
Furthermore
We need to track long-term dependencies
118 / 142
Images/cinvestav
Not only that
Maintain information about order
“We have a mother living in Yucatan, Mexico”
Share parameters across the sequence
Do you remember the state ht?
119 / 142
Images/cinvestav
Not only that
Maintain information about order
“We have a mother living in Yucatan, Mexico”
Share parameters across the sequence
Do you remember the state ht?
119 / 142
Images/cinvestav
However
There is a need to increase their power
Given the amounts of data we have right know
Then there is a tendency to start using the Recurrent Neural Networks
As cells to be stacked for bigger systems [14, 15]
This is based in the following idea [16]
Hypothesis, hierarchical model can be exponentially more efficient at
representing some functions than a shallow one.
120 / 142
Images/cinvestav
However
There is a need to increase their power
Given the amounts of data we have right know
Then there is a tendency to start using the Recurrent Neural Networks
As cells to be stacked for bigger systems [14, 15]
This is based in the following idea [16]
Hypothesis, hierarchical model can be exponentially more efficient at
representing some functions than a shallow one.
120 / 142
Images/cinvestav
However
There is a need to increase their power
Given the amounts of data we have right know
Then there is a tendency to start using the Recurrent Neural Networks
As cells to be stacked for bigger systems [14, 15]
This is based in the following idea [16]
Hypothesis, hierarchical model can be exponentially more efficient at
representing some functions than a shallow one.
120 / 142
Images/cinvestav
In the case of RNN’s
Certain Transitions are not Deep
They are only results of a linear projection followed by an
element-wise nonlinearity.
They are
Hidden-to-hidden ht−1 → ht
Hidden-to-output ht → yt
Input-to-hidden xt−1 → ht
Meaning
They are all shallow in the sense that there exists no intermediate,
nonlinear hidden layer.
121 / 142
Images/cinvestav
In the case of RNN’s
Certain Transitions are not Deep
They are only results of a linear projection followed by an
element-wise nonlinearity.
They are
Hidden-to-hidden ht−1 → ht
Hidden-to-output ht → yt
Input-to-hidden xt−1 → ht
Meaning
They are all shallow in the sense that there exists no intermediate,
nonlinear hidden layer.
121 / 142
Images/cinvestav
In the case of RNN’s
Certain Transitions are not Deep
They are only results of a linear projection followed by an
element-wise nonlinearity.
They are
Hidden-to-hidden ht−1 → ht
Hidden-to-output ht → yt
Input-to-hidden xt−1 → ht
Meaning
They are all shallow in the sense that there exists no intermediate,
nonlinear hidden layer.
121 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
122 / 142
Images/cinvestav
Bengio et al. [17]
Gave the following Hypothesis
In sampling algorithms (Markov Chains and MCMC techniques) suffer
from a fundamental problem
Given unconnected or weakly connected regions of distributions
We have that
it is difficult for the Markov chain to jump from one mode of the
distribution to another, when these are separated by large low-density
regions
This means that we have a slow mixing of samples
In order to represent distributions
123 / 142
Images/cinvestav
Bengio et al. [17]
Gave the following Hypothesis
In sampling algorithms (Markov Chains and MCMC techniques) suffer
from a fundamental problem
Given unconnected or weakly connected regions of distributions
We have that
it is difficult for the Markov chain to jump from one mode of the
distribution to another, when these are separated by large low-density
regions
This means that we have a slow mixing of samples
In order to represent distributions
123 / 142
Images/cinvestav
Bengio et al. [17]
Gave the following Hypothesis
In sampling algorithms (Markov Chains and MCMC techniques) suffer
from a fundamental problem
Given unconnected or weakly connected regions of distributions
We have that
it is difficult for the Markov chain to jump from one mode of the
distribution to another, when these are separated by large low-density
regions
This means that we have a slow mixing of samples
In order to represent distributions
123 / 142
Images/cinvestav
Example
A Big Problem
Regions of LOW DENSITY
HIGH DENSITY REGIONS
124 / 142
Images/cinvestav
The Main Problem
We have that
Slow mixing means that many consecutive samples tend to be
correlated
They belong to the same mode of the mixture
Why?
Jumping around in the MCMC method is quite slow and scarce
125 / 142
Images/cinvestav
The Main Problem
We have that
Slow mixing means that many consecutive samples tend to be
correlated
They belong to the same mode of the mixture
Why?
Jumping around in the MCMC method is quite slow and scarce
125 / 142
Images/cinvestav
Implications in Learning Algorithms
Given that some form of sampling is at the core of many learning
algorithms
For example, to estimate the log-likelihood gradient
Therefore, at the beginning of learning
Mixing is therefore initially easy
However as the model improves
its corresponding distribution sharpens and mixing becomes slower
126 / 142
Images/cinvestav
Implications in Learning Algorithms
Given that some form of sampling is at the core of many learning
algorithms
For example, to estimate the log-likelihood gradient
Therefore, at the beginning of learning
Mixing is therefore initially easy
However as the model improves
its corresponding distribution sharpens and mixing becomes slower
126 / 142
Images/cinvestav
Implications in Learning Algorithms
Given that some form of sampling is at the core of many learning
algorithms
For example, to estimate the log-likelihood gradient
Therefore, at the beginning of learning
Mixing is therefore initially easy
However as the model improves
its corresponding distribution sharpens and mixing becomes slower
126 / 142
Images/cinvestav
Basically
We have slow downs on the learning in shallow models
No Learning AreaLearning Area
127 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
128 / 142
Images/cinvestav
Therefore
We need to build deeper structures to reach more capabilities
For example the vector representation of documents
Here a extra layer of representation can be used for doing
representation
For Example, Mikolov et al. [18]
129 / 142
Images/cinvestav
Therefore
We need to build deeper structures to reach more capabilities
For example the vector representation of documents
Here a extra layer of representation can be used for doing
representation
For Example, Mikolov et al. [18]
129 / 142
Images/cinvestav
Basically a shallow network before the main architecture
An Encoder Layer before a GRU
+-1
130 / 142
Images/cinvestav
The equations
They will look like
zt =σ [Wz [ht−1, xt] + bz] (Update Gate)
rt =σ [Wr [ht−1, xt] + br] (Reset Gate)
ˆht = tanh [Wo [rt ht−1, xt] + bh]
ht = (1 − zt) ht−1 + zt
ˆht
x =σ (Wohy)
y =σ (Whiw)
131 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
132 / 142
Images/cinvestav
Deep Transition Architectures
In a deep transition RNN (DT-RNN)
At each time step the next state is computed by the sequential
application of multiple transition layers.
For example in Nematus system [19]
They use GRU transitions blocks under independent trainable
parameters
With a Caveat
The hidden state output is used as the input state on the next one
133 / 142
Images/cinvestav
Deep Transition Architectures
In a deep transition RNN (DT-RNN)
At each time step the next state is computed by the sequential
application of multiple transition layers.
For example in Nematus system [19]
They use GRU transitions blocks under independent trainable
parameters
With a Caveat
The hidden state output is used as the input state on the next one
133 / 142
Images/cinvestav
Deep Transition Architectures
In a deep transition RNN (DT-RNN)
At each time step the next state is computed by the sequential
application of multiple transition layers.
For example in Nematus system [19]
They use GRU transitions blocks under independent trainable
parameters
With a Caveat
The hidden state output is used as the input state on the next one
133 / 142
Images/cinvestav
For example, at the encoder phase
For the ith
source word in the forward direction, we have hi = hi,Ls
hi,1 = GRU1 (x1, hi−1,Ls )
hi,k = GRUk (0, hi,k−1) for 1 < k ≤ Ls
The sequence word is reversed and you have a backward state then
C ≡
−→
h i,Ls ,
←−
h i,Ls
134 / 142
Images/cinvestav
For example, at the encoder phase
For the ith
source word in the forward direction, we have hi = hi,Ls
hi,1 = GRU1 (x1, hi−1,Ls )
hi,k = GRUk (0, hi,k−1) for 1 < k ≤ Ls
The sequence word is reversed and you have a backward state then
C ≡
−→
h i,Ls ,
←−
h i,Ls
134 / 142
Images/cinvestav
Then
Decoder phase uses the outputs from the previous GRU and
something called attention (We will look at this latter)
sj,1 = GRU1 yj−1, sj−1, Lt
sj,2 = GRU2 (ATT, sj−1, Lt)
sj,k = GRUk (0, Lt) for 2 < k ≤ Lt
Then, the target word state sj ≡ sj,Lt
It is used by a feed-forward neural network to predict the current
target network
135 / 142
Images/cinvestav
Then
Decoder phase uses the outputs from the previous GRU and
something called attention (We will look at this latter)
sj,1 = GRU1 yj−1, sj−1, Lt
sj,2 = GRU2 (ATT, sj−1, Lt)
sj,k = GRUk (0, Lt) for 2 < k ≤ Lt
Then, the target word state sj ≡ sj,Lt
It is used by a feed-forward neural network to predict the current
target network
135 / 142
Images/cinvestav
Deep Transition Decoder
We have the following depiction of the architecture
136 / 142
Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
137 / 142
Images/cinvestav
There are many other examples
Basically
We are far from the classic methods as
1 Autoregressive integrated moving average (ARMA)
2 Auto Regressive Integrated Moving Average (ARIMA)
3 etc
These RNN architectures are taking the prediction of time series
To another level!!!
138 / 142
Images/cinvestav
There are many other examples
Basically
We are far from the classic methods as
1 Autoregressive integrated moving average (ARMA)
2 Auto Regressive Integrated Moving Average (ARIMA)
3 etc
These RNN architectures are taking the prediction of time series
To another level!!!
138 / 142
Images/cinvestav
O. L. R. Jacobs, “Introduction to control theory,” 1974.
A. Robinson and F. Fallside, The utility driven dynamic error
propagation network.
University of Cambridge Department of Engineering, 1987.
P. J. Werbos et al., “Backpropagation through time: what it does and
how to do it,” Proceedings of the IEEE, vol. 78, no. 10,
pp. 1550–1560, 1990.
G. Chen, “A gentle tutorial of recurrent neural network with error
backpropagation,” arXiv preprint arXiv:1610.02583, 2016.
B. W. Bader and T. G. Kolda, “Algorithm 862: Matlab tensor classes
for fast algorithm prototyping,” ACM Trans. Math. Softw., vol. 32,
Dec. 2006.
J. Pennington, S. S. Schoenholz, and S. Ganguli, “The emergence of
spectral universality in deep networks,” arXiv preprint
arXiv:1802.09979, 2018.
139 / 142
Images/cinvestav
R. J. Williams and D. Zipser, “Gradient-based learning algorithms for
recurrent,” Backpropagation: Theory, architectures, and applications,
vol. 433, 1995.
R. J. Williams and J. Peng, “An efficient gradient-based algorithm for
on-line training of recurrent network trajectories,” Neural
computation, vol. 2, no. 4, pp. 490–501, 1990.
J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14,
no. 2, pp. 179–211, 1990.
T. Mikolov, M. Karafiát, L. Burget, J. Černock`y, and S. Khudanpur,
“Recurrent neural network based language model,” in Eleventh annual
conference of the international speech communication association,
2010.
H.-G. Zimmermann, C. Tietz, and R. Grothmann, “Forecasting with
recurrent neural networks: 12 tricks,” in Neural Networks: Tricks of
the Trade, pp. 687–707, Springer, 2012.
140 / 142
Images/cinvestav
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
G. Weiss, Y. Goldberg, and E. Yahav, “On the practical computational
power of finite precision rnns for language recognition,” CoRR,
vol. abs/1805.04908, 2018.
R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct
deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013.
A. V. M. Barone, J. Helcl, R. Sennrich, B. Haddow, and A. Birch,
“Deep architectures for neural machine translation,” arXiv preprint
arXiv:1707.07631, 2017.
Y. Bengio et al., “Learning deep architectures for ai,” Foundations and
trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via
deep representations,” in International conference on machine
learning, pp. 552–560, 2013.
141 / 142
Images/cinvestav
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
of word representations in vector space,” arXiv preprint
arXiv:1301.3781, 2013.
R. Sennrich, O. Firat, K. Cho, A. Birch, B. Haddow, J. Hitschler,
M. Junczys-Dowmunt, S. Läubli, A. V. M. Barone, J. Mokry, et al.,
“Nematus: a toolkit for neural machine translation,” arXiv preprint
arXiv:1703.04357, 2017.
142 / 142

More Related Content

What's hot

Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Andres Mendez-Vazquez
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statisticsSteve Nouri
 
Computational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesComputational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesAntonis Antonopoulos
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
Introduction to complexity theory assignment
Introduction to complexity theory assignmentIntroduction to complexity theory assignment
Introduction to complexity theory assignmenttesfahunegn minwuyelet
 
Computational Complexity: Introduction-Turing Machines-Undecidability
Computational Complexity: Introduction-Turing Machines-UndecidabilityComputational Complexity: Introduction-Turing Machines-Undecidability
Computational Complexity: Introduction-Turing Machines-UndecidabilityAntonis Antonopoulos
 
Cheatsheet convolutional-neural-networks
Cheatsheet convolutional-neural-networksCheatsheet convolutional-neural-networks
Cheatsheet convolutional-neural-networksSteve Nouri
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksSteve Nouri
 
Algorithmics on SLP-compressed strings
Algorithmics on SLP-compressed stringsAlgorithmics on SLP-compressed strings
Algorithmics on SLP-compressed stringsAntonis Antonopoulos
 
Computational Complexity: Oracles and the Polynomial Hierarchy
Computational Complexity: Oracles and the Polynomial HierarchyComputational Complexity: Oracles and the Polynomial Hierarchy
Computational Complexity: Oracles and the Polynomial HierarchyAntonis Antonopoulos
 
Refresher algebra-calculus
Refresher algebra-calculusRefresher algebra-calculus
Refresher algebra-calculusSteve Nouri
 
Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Traian Rebedea
 

What's hot (20)

Chemistry Assignment Help
Chemistry Assignment Help Chemistry Assignment Help
Chemistry Assignment Help
 
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
Artificial Intelligence 06.3 Bayesian Networks - Belief Propagation - Junctio...
 
Refresher probabilities-statistics
Refresher probabilities-statisticsRefresher probabilities-statistics
Refresher probabilities-statistics
 
Computational Complexity: Complexity Classes
Computational Complexity: Complexity ClassesComputational Complexity: Complexity Classes
Computational Complexity: Complexity Classes
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Richard Everitt's slides
Richard Everitt's slidesRichard Everitt's slides
Richard Everitt's slides
 
Introduction to complexity theory assignment
Introduction to complexity theory assignmentIntroduction to complexity theory assignment
Introduction to complexity theory assignment
 
Computational Complexity: Introduction-Turing Machines-Undecidability
Computational Complexity: Introduction-Turing Machines-UndecidabilityComputational Complexity: Introduction-Turing Machines-Undecidability
Computational Complexity: Introduction-Turing Machines-Undecidability
 
Cheatsheet convolutional-neural-networks
Cheatsheet convolutional-neural-networksCheatsheet convolutional-neural-networks
Cheatsheet convolutional-neural-networks
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networks
 
Algorithmics on SLP-compressed strings
Algorithmics on SLP-compressed stringsAlgorithmics on SLP-compressed strings
Algorithmics on SLP-compressed strings
 
Stochastic Assignment Help
Stochastic Assignment Help Stochastic Assignment Help
Stochastic Assignment Help
 
Signals Processing Homework Help
Signals Processing Homework HelpSignals Processing Homework Help
Signals Processing Homework Help
 
Computational Complexity: Oracles and the Polynomial Hierarchy
Computational Complexity: Oracles and the Polynomial HierarchyComputational Complexity: Oracles and the Polynomial Hierarchy
Computational Complexity: Oracles and the Polynomial Hierarchy
 
Matlab Assignment Help
Matlab Assignment HelpMatlab Assignment Help
Matlab Assignment Help
 
Refresher algebra-calculus
Refresher algebra-calculusRefresher algebra-calculus
Refresher algebra-calculus
 
Statistical Physics Assignment Help
Statistical Physics Assignment HelpStatistical Physics Assignment Help
Statistical Physics Assignment Help
 
Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10Algorithm Design and Complexity - Course 10
Algorithm Design and Complexity - Course 10
 
Chris Sherlock's slides
Chris Sherlock's slidesChris Sherlock's slides
Chris Sherlock's slides
 

Similar to 06 recurrent neural_networks

Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingDongang (Sean) Wang
 
14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.pptManiMaran230751
 
Deep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.pptDeep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.pptShankerRajendiran2
 
Deep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.pptDeep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.pptSatyaNarayana594629
 
Deep learning for detection hate speech.ppt
Deep learning for detection hate speech.pptDeep learning for detection hate speech.ppt
Deep learning for detection hate speech.pptusmanshoukat28
 
12337673 deep learning RNN RNN DL ML sa.ppt
12337673 deep learning RNN RNN DL ML sa.ppt12337673 deep learning RNN RNN DL ML sa.ppt
12337673 deep learning RNN RNN DL ML sa.pptManiMaran230751
 
Deep-Learning-2017-Lecture ML DL RNN.ppt
Deep-Learning-2017-Lecture  ML DL RNN.pptDeep-Learning-2017-Lecture  ML DL RNN.ppt
Deep-Learning-2017-Lecture ML DL RNN.pptManiMaran230751
 
Digital signal processing
Digital signal processingDigital signal processing
Digital signal processingchitra raju
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
 
Basics of edge detection and forier transform
Basics of edge detection and forier transformBasics of edge detection and forier transform
Basics of edge detection and forier transformSimranjit Singh
 
Review session2
Review session2Review session2
Review session2NEEDY12345
 
Availability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active ComponentsAvailability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active Componentstheijes
 
Backpropagation in RNN and LSTM
Backpropagation in RNN and LSTMBackpropagation in RNN and LSTM
Backpropagation in RNN and LSTMHimanshuSingh1370
 
Parametrized Model Checking of Fault Tolerant Distributed Algorithms by Abstr...
Parametrized Model Checking of Fault Tolerant Distributed Algorithms by Abstr...Parametrized Model Checking of Fault Tolerant Distributed Algorithms by Abstr...
Parametrized Model Checking of Fault Tolerant Distributed Algorithms by Abstr...Iosif Itkin
 

Similar to 06 recurrent neural_networks (20)

Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt14889574 dl ml RNN Deeplearning MMMm.ppt
14889574 dl ml RNN Deeplearning MMMm.ppt
 
Dsp 2marks
Dsp 2marksDsp 2marks
Dsp 2marks
 
LSTM.pdf
LSTM.pdfLSTM.pdf
LSTM.pdf
 
10_rnn.pdf
10_rnn.pdf10_rnn.pdf
10_rnn.pdf
 
Fourier slide
Fourier slideFourier slide
Fourier slide
 
Deep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.pptDeep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.ppt
 
Deep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.pptDeep-Learning-2017-Lecture6RNN.ppt
Deep-Learning-2017-Lecture6RNN.ppt
 
RNN.ppt
RNN.pptRNN.ppt
RNN.ppt
 
Deep learning for detection hate speech.ppt
Deep learning for detection hate speech.pptDeep learning for detection hate speech.ppt
Deep learning for detection hate speech.ppt
 
12337673 deep learning RNN RNN DL ML sa.ppt
12337673 deep learning RNN RNN DL ML sa.ppt12337673 deep learning RNN RNN DL ML sa.ppt
12337673 deep learning RNN RNN DL ML sa.ppt
 
Deep-Learning-2017-Lecture ML DL RNN.ppt
Deep-Learning-2017-Lecture  ML DL RNN.pptDeep-Learning-2017-Lecture  ML DL RNN.ppt
Deep-Learning-2017-Lecture ML DL RNN.ppt
 
Digital signal processing
Digital signal processingDigital signal processing
Digital signal processing
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
Basics of edge detection and forier transform
Basics of edge detection and forier transformBasics of edge detection and forier transform
Basics of edge detection and forier transform
 
Review session2
Review session2Review session2
Review session2
 
Availability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active ComponentsAvailability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active Components
 
Backpropagation in RNN and LSTM
Backpropagation in RNN and LSTMBackpropagation in RNN and LSTM
Backpropagation in RNN and LSTM
 
Parametrized Model Checking of Fault Tolerant Distributed Algorithms by Abstr...
Parametrized Model Checking of Fault Tolerant Distributed Algorithms by Abstr...Parametrized Model Checking of Fault Tolerant Distributed Algorithms by Abstr...
Parametrized Model Checking of Fault Tolerant Distributed Algorithms by Abstr...
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning SyllabusAndres Mendez-Vazquez
 
Schedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine LearningSchedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine LearningAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
 
Schedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine LearningSchedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine Learning
 

Recently uploaded

Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 

Recently uploaded (20)

Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 

06 recurrent neural_networks

  • 1. Introduction to Neural Networks and Deep Learning Recurrent Neural Networks Andres Mendez-Vazquez November 13, 2019 1 / 142
  • 2. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 2 / 142
  • 3. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 3 / 142
  • 4. Images/cinvestav In 1987 Robinson and Fallside [2] At Cambridge University Engineering Department They proposed a new type of neural network based on Linear Control Theory They took the work of Jacobs, 1974 on dynamic nets [1] st+1 = Ast + Bxt yt = Cst 4 / 142
  • 5. Images/cinvestav In 1987 Robinson and Fallside [2] At Cambridge University Engineering Department They proposed a new type of neural network based on Linear Control Theory They took the work of Jacobs, 1974 on dynamic nets [1] st+1 = Ast + Bxt yt = Cst 4 / 142
  • 6. Images/cinvestav Example of this unit We have 5 / 142
  • 7. Images/cinvestav Furthermore Jordan Proposed a simple recurrent network ht = σh (Wsdxt + Ussht−1 + bh) yt = σs (Vosht + bo) Where 1 xt is an input of dimension d. 2 ht is a hidden state layer of dimension h. 3 yt is the output vector of dimension s. 4 W, U, V, bh and bo parameter matrices and vectors. 5 σh and σs are activation functions. 6 / 142
  • 8. Images/cinvestav Furthermore Jordan Proposed a simple recurrent network ht = σh (Wsdxt + Ussht−1 + bh) yt = σs (Vosht + bo) Where 1 xt is an input of dimension d. 2 ht is a hidden state layer of dimension h. 3 yt is the output vector of dimension s. 4 W, U, V, bh and bo parameter matrices and vectors. 5 σh and σs are activation functions. 6 / 142
  • 9. Images/cinvestav Furthermore Jordan Proposed a simple recurrent network ht = σh (Wsdxt + Ussht−1 + bh) yt = σs (Vosht + bo) Where 1 xt is an input of dimension d. 2 ht is a hidden state layer of dimension h. 3 yt is the output vector of dimension s. 4 W, U, V, bh and bo parameter matrices and vectors. 5 σh and σs are activation functions. 6 / 142
  • 10. Images/cinvestav Furthermore Jordan Proposed a simple recurrent network ht = σh (Wsdxt + Ussht−1 + bh) yt = σs (Vosht + bo) Where 1 xt is an input of dimension d. 2 ht is a hidden state layer of dimension h. 3 yt is the output vector of dimension s. 4 W, U, V, bh and bo parameter matrices and vectors. 5 σh and σs are activation functions. 6 / 142
  • 11. Images/cinvestav Furthermore Jordan Proposed a simple recurrent network ht = σh (Wsdxt + Ussht−1 + bh) yt = σs (Vosht + bo) Where 1 xt is an input of dimension d. 2 ht is a hidden state layer of dimension h. 3 yt is the output vector of dimension s. 4 W, U, V, bh and bo parameter matrices and vectors. 5 σh and σs are activation functions. 6 / 142
  • 12. Images/cinvestav Furthermore Jordan Proposed a simple recurrent network ht = σh (Wsdxt + Ussht−1 + bh) yt = σs (Vosht + bo) Where 1 xt is an input of dimension d. 2 ht is a hidden state layer of dimension h. 3 yt is the output vector of dimension s. 4 W, U, V, bh and bo parameter matrices and vectors. 5 σh and σs are activation functions. 6 / 142
  • 14. Images/cinvestav What were they used for? Robinson and Fallside As with Hidden Markov Models, they were proposed for Speech Coding They proposed the following architecture 8 / 142
  • 15. Images/cinvestav What were they used for? Robinson and Fallside As with Hidden Markov Models, they were proposed for Speech Coding They proposed the following architecture 8 / 142
  • 16. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 9 / 142
  • 17. Images/cinvestav Based on the State-Space Model Basically, a linear system Based in a state-determined system model Definition A mathematical description of the system in terms of a minimum set of variables xi(t), i = 1, ..., n, together with knowledge of those variables at an initial time t0 and the system inputs for time t ≥ t0, are sufficient to predict the future system state and outputs for all time t > t0. 10 / 142
  • 18. Images/cinvestav Based on the State-Space Model Basically, a linear system Based in a state-determined system model Definition A mathematical description of the system in terms of a minimum set of variables xi(t), i = 1, ..., n, together with knowledge of those variables at an initial time t0 and the system inputs for time t ≥ t0, are sufficient to predict the future system state and outputs for all time t > t0. 10 / 142
  • 19. Images/cinvestav Therefore We have a system as a block System Described by State Variables This can be expressed as a state equations ˙s1 = f1 (x, s, t) ˙s2 = f2 (x, s, t) · · · = · · · ˙sn = fn (x, s, t) 11 / 142
  • 20. Images/cinvestav Therefore We have a system as a block System Described by State Variables This can be expressed as a state equations ˙s1 = f1 (x, s, t) ˙s2 = f2 (x, s, t) · · · = · · · ˙sn = fn (x, s, t) 11 / 142
  • 21. Images/cinvestav Using Vector Notation Assuming that we have a linear system and time invariant Time-Invariant x (t + δ) directly equates y (t + δ), for example αx (t + δ) + β = y (t + δ) Therefore, using this idea ˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t) Or in Matrix form y (t) = Ax (t) + Bs (t) 12 / 142
  • 22. Images/cinvestav Using Vector Notation Assuming that we have a linear system and time invariant Time-Invariant x (t + δ) directly equates y (t + δ), for example αx (t + δ) + β = y (t + δ) Therefore, using this idea ˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t) Or in Matrix form y (t) = Ax (t) + Bs (t) 12 / 142
  • 23. Images/cinvestav Using Vector Notation Assuming that we have a linear system and time invariant Time-Invariant x (t + δ) directly equates y (t + δ), for example αx (t + δ) + β = y (t + δ) Therefore, using this idea ˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t) Or in Matrix form y (t) = Ax (t) + Bs (t) 12 / 142
  • 24. Images/cinvestav Then, the discretized version We introduce an update for the state part y (t) = Ax (t) + Bs (t) ˙s (t) = Cs (t) Or our discrete step equitations y (t) = Ax (t) + Bs (t) s (t + 1) = Cs (t) 13 / 142
  • 25. Images/cinvestav Then, the discretized version We introduce an update for the state part y (t) = Ax (t) + Bs (t) ˙s (t) = Cs (t) Or our discrete step equitations y (t) = Ax (t) + Bs (t) s (t + 1) = Cs (t) 13 / 142
  • 26. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 14 / 142
  • 27. Images/cinvestav The Elman Network In Elman’s Equations ht = σh (Wsdxt + Ussht−1 + bh) yt = σy (Vosht + by) We noticed something different from the linear recurrent system The use of activation functions to introduce the concept of non-linearity 15 / 142
  • 28. Images/cinvestav The Elman Network In Elman’s Equations ht = σh (Wsdxt + Ussht−1 + bh) yt = σy (Vosht + by) We noticed something different from the linear recurrent system The use of activation functions to introduce the concept of non-linearity 15 / 142
  • 29. Images/cinvestav Explanation We have the following 1 The input xt is coded by Whd Wsdxt 2 An state is generated by using the codified version of the input plus a previous state ht−1 ht = σh (Wsdxt + Ussht−1 + bh) 3 The output is generated using the new state ht−1 yt = σy (Vosht + by) 16 / 142
  • 30. Images/cinvestav Explanation We have the following 1 The input xt is coded by Whd Wsdxt 2 An state is generated by using the codified version of the input plus a previous state ht−1 ht = σh (Wsdxt + Ussht−1 + bh) 3 The output is generated using the new state ht−1 yt = σy (Vosht + by) 16 / 142
  • 31. Images/cinvestav Explanation We have the following 1 The input xt is coded by Whd Wsdxt 2 An state is generated by using the codified version of the input plus a previous state ht−1 ht = σh (Wsdxt + Ussht−1 + bh) 3 The output is generated using the new state ht−1 yt = σy (Vosht + by) 16 / 142
  • 32. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 17 / 142
  • 33. Images/cinvestav We need to introduce the concept of cost function Which as always It needs to comply with two properties The cost function L must be able to be written as an average L = 1 N x∈X Cx over the cost individual cost functions Cx This allow to apply different optimization techniques as Minbatch Stochastic Gradient Descent etc 18 / 142
  • 34. Images/cinvestav We need to introduce the concept of cost function Which as always It needs to comply with two properties The cost function L must be able to be written as an average L = 1 N x∈X Cx over the cost individual cost functions Cx This allow to apply different optimization techniques as Minbatch Stochastic Gradient Descent etc 18 / 142
  • 35. Images/cinvestav We need to introduce the concept of cost function Which as always It needs to comply with two properties The cost function L must be able to be written as an average L = 1 N x∈X Cx over the cost individual cost functions Cx This allow to apply different optimization techniques as Minbatch Stochastic Gradient Descent etc 18 / 142
  • 36. Images/cinvestav Furthermore Non dependency The cost function L must not be dependent on any activation values of a neural network besides the output values. If we cannot assure this If not Backpropagation becomes too unstable or too complex to solve. For example L = 1 N N t=0 [yt + ht − zt]2 This gives two entry points to the network. 19 / 142
  • 37. Images/cinvestav Furthermore Non dependency The cost function L must not be dependent on any activation values of a neural network besides the output values. If we cannot assure this If not Backpropagation becomes too unstable or too complex to solve. For example L = 1 N N t=0 [yt + ht − zt]2 This gives two entry points to the network. 19 / 142
  • 38. Images/cinvestav A List of Cost Functions The Average Quadratic Cost L = 1 N N t=0 [yt − zt]2 Where yt is the output of the network and zt is the ground truth of the output. Here, we are interpolating functions 20 / 142
  • 39. Images/cinvestav A List of Cost Functions The Average Quadratic Cost L = 1 N N t=0 [yt − zt]2 Where yt is the output of the network and zt is the ground truth of the output. Here, we are interpolating functions REAL FUNCTION INTERPOLATED FUNCTION 20 / 142
  • 40. Images/cinvestav Cross-Entropy cost First, the Loss Function L = − C i=1 zi log (yi) Where yi is the output and zi is the ground truth for the class estimation. Why yi log (zi)? We can imagine a sequence of class probabilities y1, y2, ..., ym and the likelihood of the data and the model P [data|model] = yk1 1 yk2 2 · · · ykn m 21 / 142
  • 41. Images/cinvestav Cross-Entropy cost First, the Loss Function L = − C i=1 zi log (yi) Where yi is the output and zi is the ground truth for the class estimation. Why yi log (zi)? We can imagine a sequence of class probabilities y1, y2, ..., ym and the likelihood of the data and the model P [data|model] = yk1 1 yk2 2 · · · ykn m 21 / 142
  • 42. Images/cinvestav Then Taking the logarithm and multiplying by -1 − log P [data|model] = − C i=1 ki log yi Then, dividing by the total number of samples − 1 N log P [data|model] = − C i=1 ki N log yi = − C i=1 zi log yi 22 / 142
  • 43. Images/cinvestav Then Taking the logarithm and multiplying by -1 − log P [data|model] = − C i=1 ki log yi Then, dividing by the total number of samples − 1 N log P [data|model] = − C i=1 ki N log yi = − C i=1 zi log yi 22 / 142
  • 44. Images/cinvestav Then In information theory, The Kraft–McMillan theorem It establishes that any directly decodable coding scheme for coding a message to identify one value xi ∈ {x1, x2, ..., xn} It can be seen as representing an implicit probability distribution over {x1, x2, ..., xn} q (xi) = 1 2 li Where li is the length of the code for xi 23 / 142
  • 45. Images/cinvestav Then In information theory, The Kraft–McMillan theorem It establishes that any directly decodable coding scheme for coding a message to identify one value xi ∈ {x1, x2, ..., xn} It can be seen as representing an implicit probability distribution over {x1, x2, ..., xn} q (xi) = 1 2 li Where li is the length of the code for xi 23 / 142
  • 46. Images/cinvestav Now We have that Cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q is assumed while the data actually follows a distribution p. The expected message-length under the true distribution p is Ep [l] = −Ep ln q (x) ln 2 = −Ep [log2 q (x)] = − xi p (xi) log2 q (x) = H (p, q) 24 / 142
  • 47. Images/cinvestav Now We have that Cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q is assumed while the data actually follows a distribution p. The expected message-length under the true distribution p is Ep [l] = −Ep ln q (x) ln 2 = −Ep [log2 q (x)] = − xi p (xi) log2 q (x) = H (p, q) 24 / 142
  • 48. Images/cinvestav Special Case A special case is the binary class problem, C = 2 Based on the fact that z1 + z2 = 1 and y1 + y2 = 1 L = − 2 i=1 zi log (yi) = −z1 log (y1) − (1 − z1) log (1 − y1) A problem of this It could be possible to have a y1 = 0 25 / 142
  • 49. Images/cinvestav Special Case A special case is the binary class problem, C = 2 Based on the fact that z1 + z2 = 1 and y1 + y2 = 1 L = − 2 i=1 zi log (yi) = −z1 log (y1) − (1 − z1) log (1 − y1) A problem of this It could be possible to have a y1 = 0 25 / 142
  • 50. Images/cinvestav Dealing with this problem We can use an activation function in front of it SIGMOID CROSS ENTROPY LOSS 26 / 142
  • 51. Images/cinvestav Another Interpretation The Loss can be expressed as L = − log (f (y1)) if z1 = 1 − log (1 − f (y1)) if z1 = 1 Where z1 = 1 It means that the class C1 = Ci is positive for this sample. 27 / 142
  • 52. Images/cinvestav Another Interpretation The Loss can be expressed as L = − log (f (y1)) if z1 = 1 − log (1 − f (y1)) if z1 = 1 Where z1 = 1 It means that the class C1 = Ci is positive for this sample. 27 / 142
  • 53. Images/cinvestav The Gradient of the Binary Cross Entropy We make a derivative with respect to yi ∂L ∂y1 = z1 (f (y1) − 1) + (1 − z1) f (y1) 28 / 142
  • 54. Images/cinvestav In the case of the Multiclass Problem We use two things, a softmax f (yi) = exp {yi} C j=1 exp {yj} As in the multiclass for the Linear Models The labels are one-hot, so only the positive class Cp keeps its term in the loss. Therefore There is only one element of the Target vector z that is not zero, zi = zp. 29 / 142
  • 55. Images/cinvestav In the case of the Multiclass Problem We use two things, a softmax f (yi) = exp {yi} C j=1 exp {yj} As in the multiclass for the Linear Models The labels are one-hot, so only the positive class Cp keeps its term in the loss. Therefore There is only one element of the Target vector z that is not zero, zi = zp. 29 / 142
  • 56. Images/cinvestav In the case of the Multiclass Problem We use two things, a softmax f (yi) = exp {yi} C j=1 exp {yj} As in the multiclass for the Linear Models The labels are one-hot, so only the positive class Cp keeps its term in the loss. Therefore There is only one element of the Target vector z that is not zero, zi = zp. 29 / 142
  • 57. Images/cinvestav We can then simplify The cost function becomes L = − C i=1 zi log (f (yi)) = −log exp {yp} C j=1 exp {yp} 30 / 142
  • 58. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 31 / 142
  • 59. Images/cinvestav Other Cost Functions Exponential Cost with hyper-parameter τ L = τ exp 1 τ N i=1 (yi − zi)2 Hellinger Distance L = 1 2 N i=1 ( √ yi − √ zi)2 Here the values need to be at the interval [0, 1] . 32 / 142
  • 60. Images/cinvestav Other Cost Functions Exponential Cost with hyper-parameter τ L = τ exp 1 τ N i=1 (yi − zi)2 Hellinger Distance L = 1 2 N i=1 ( √ yi − √ zi)2 Here the values need to be at the interval [0, 1] . 32 / 142
  • 61. Images/cinvestav Other Cost Functions Given Kullback-Leibler Divergence DKL (P Q) = i P (i) ln P (i) Q (i) The Final Cost function L = j ˆyj log ˆyj ypred j 33 / 142
  • 62. Images/cinvestav Other Cost Functions Given Kullback-Leibler Divergence DKL (P Q) = i P (i) ln P (i) Q (i) The Final Cost function L = j ˆyj log ˆyj ypred j 33 / 142
  • 63. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 34 / 142
  • 64. Images/cinvestav We have the following Architecture with Quadratic Error ht = σh (Wsdxt + Ussht−1 + bh) yt = σy (Vosht + by) L = 1 2 N t=0 [yt − zt]2 Something Notable How do we train something with a recurrence forcing a dependence over time? 35 / 142
  • 65. Images/cinvestav We have the following Architecture with Quadratic Error ht = σh (Wsdxt + Ussht−1 + bh) yt = σy (Vosht + by) L = 1 2 N t=0 [yt − zt]2 Something Notable How do we train something with a recurrence forcing a dependence over time? 35 / 142
  • 66. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 36 / 142
  • 67. Images/cinvestav Now, given the dependency over time We can use the classic unfolding of the network [3, 4] by assuming W, U, V, bh and bo do not change under the unfolding Unfolding? Assume that there are not bias correcting terms, only, W, U and V . 37 / 142
  • 68. Images/cinvestav Now, given the dependency over time We can use the classic unfolding of the network [3, 4] by assuming W, U, V, bh and bo do not change under the unfolding Unfolding? Assume that there are not bias correcting terms, only, W, U and V . 37 / 142
  • 69. Images/cinvestav Then Given an observation sequence x = {x1, x2, ..., xT } where xi ∈ R, and their corresponding label y = {y1, y2, ..., yT } We remove the bias to simplify our derivations ht = σh (Wsdxt + Ussht−1) yt = σy (Vosht) L = 1 2 T t=0 [zt − yt]2 38 / 142
  • 70. Images/cinvestav Then Given an observation sequence x = {x1, x2, ..., xT } where xi ∈ R, and their corresponding label y = {y1, y2, ..., yT } We remove the bias to simplify our derivations ht = σh (Wsdxt + Ussht−1) yt = σy (Vosht) L = 1 2 T t=0 [zt − yt]2 38 / 142
  • 71. Images/cinvestav Unfolding We can then see the unfolding of the recurrence + + + + 39 / 142
  • 72. Images/cinvestav This allows To simplify the backpropagation process ∂L ∂Vos = 1 2 T t=0 ∂L ∂yt × ∂yt ∂Vos = 1 2 T t=0 ∂L ∂yt × ∂yt ∂neto × ∂neto ∂Vos = − 1 2 T t=0 [zt − yt] × ∂yt ∂neto × ∂neto ∂Vos Where neto = Vosht 40 / 142
  • 73. Images/cinvestav This allows To simplify the backpropagation process ∂L ∂Vos = 1 2 T t=0 ∂L ∂yt × ∂yt ∂Vos = 1 2 T t=0 ∂L ∂yt × ∂yt ∂neto × ∂neto ∂Vos = − 1 2 T t=0 [zt − yt] × ∂yt ∂neto × ∂neto ∂Vos Where neto = Vosht 40 / 142
  • 74. Images/cinvestav This allows To simplify the backpropagation process ∂L ∂Vos = 1 2 T t=0 ∂L ∂yt × ∂yt ∂Vos = 1 2 T t=0 ∂L ∂yt × ∂yt ∂neto × ∂neto ∂Vos = − 1 2 T t=0 [zt − yt] × ∂yt ∂neto × ∂neto ∂Vos Where neto = Vosht 40 / 142
  • 75. Images/cinvestav Now, we have We have that ∂yt ∂neto =       ∂yt1 ∂neto1 ∂yt2 ∂neto1 · · · ∂yto ∂neto1 ∂yt1 ∂neto2 ∂yt2 ∂neto2 · · · ∂yto ∂neto2 ... ... ... ... ∂yt1 ∂netoo ∂yt2 ∂netoo · · · ∂yto ∂netoo       41 / 142
  • 76. Images/cinvestav Simplify!!! Now, we have that if i = j ∂yti ∂netoi = σ (netoi) And for the rest, we have i = j ∂yti ∂netoi = 0 42 / 142
  • 77. Images/cinvestav Simplify!!! Now, we have that if i = j ∂yti ∂netoi = σ (netoi) And for the rest, we have i = j ∂yti ∂netoi = 0 42 / 142
  • 78. Images/cinvestav Finally We have that ∂yt ∂neto =       σo (neto1) 0 · · · 0 0 σo (neto2) · · · 0 ... ... ... ... 0 0 · · · σo (netoo)       = A 43 / 142
  • 79. Images/cinvestav Now, ∂neto ∂Vos First we have a component i netoi = s j=1 Vijhj What happen when we derive with respect to the matrix? ∂neto ∂Vos =       ∂neto ∂V11 ∂neto ∂V12 · · · ∂neto ∂V1s ∂neto ∂V21 ∂neto ∂V22 · · · ∂neto ∂V2s ... ... ... ... ∂neto ∂Vo1 ∂neto ∂Vo2 · · · ∂neto ∂Vos       Actually A Tensor with three dimensions... 44 / 142
  • 80. Images/cinvestav Now, ∂neto ∂Vos First we have a component i netoi = s j=1 Vijhj What happen when we derive with respect to the matrix? ∂neto ∂Vos =       ∂neto ∂V11 ∂neto ∂V12 · · · ∂neto ∂V1s ∂neto ∂V21 ∂neto ∂V22 · · · ∂neto ∂V2s ... ... ... ... ∂neto ∂Vo1 ∂neto ∂Vo2 · · · ∂neto ∂Vos       Actually A Tensor with three dimensions... 44 / 142
  • 81. Images/cinvestav Now, ∂neto ∂Vos First we have a component i netoi = s j=1 Vijhj What happen when we derive with respect to the matrix? ∂neto ∂Vos =       ∂neto ∂V11 ∂neto ∂V12 · · · ∂neto ∂V1s ∂neto ∂V21 ∂neto ∂V22 · · · ∂neto ∂V2s ... ... ... ... ∂neto ∂Vo1 ∂neto ∂Vo2 · · · ∂neto ∂Vos       Actually A Tensor with three dimensions... 44 / 142
  • 82. Images/cinvestav But something quite nice Each of the components of neto It has the previous structure netoi = s j=1 Vijhj Then if the Vkl does not intervene on it ∂netoi ∂Vjk = 0 Additionally if it intervenes ∂netoi ∂Vji = hj 45 / 142
  • 83. Images/cinvestav But something quite nice Each of the components of neto It has the previous structure netoi = s j=1 Vijhj Then if the Vkl does not intervene on it ∂netoi ∂Vjk = 0 Additionally if it intervenes ∂netoi ∂Vji = hj 45 / 142
  • 84. Images/cinvestav But something quite nice Each of the components of neto It has the previous structure netoi = s j=1 Vijhj Then if the Vkl does not intervene on it ∂netoi ∂Vjk = 0 Additionally if it intervenes ∂netoi ∂Vji = hj 45 / 142
  • 85. Images/cinvestav Therefore It is possible to collapse the tensor into a 2D Matrix Given that the other information is redundant, ad we can rewrite the tensor as Fijk = ∂netoi ∂Vjk Then, we have that Fiji = Gij ⇐ Better Storage!!! 46 / 142
  • 86. Images/cinvestav Therefore It is possible to collapse the tensor into a 2D Matrix Given that the other information is redundant, ad we can rewrite the tensor as Fijk = ∂netoi ∂Vjk Then, we have that Fiji = Gij ⇐ Better Storage!!! 46 / 142
  • 87. Images/cinvestav Therefore, given that a matrix is a tensor also We have that two tensors, neto×o and Fo×s×o [5] We will use the contracted product of two tensors which is a generalization of the tensor-vector and tensor-matrix multiplications Definition Given two tensors Ao×o and Bo×s×o A, B (k, j) = o i=1 Ai,kGi,j = Ai,iGi,j = σ (netok) hj 47 / 142
  • 88. Images/cinvestav Therefore, given that a matrix is a tensor also We have that two tensors, neto×o and Fo×s×o [5] We will use the contracted product of two tensors which is a generalization of the tensor-vector and tensor-matrix multiplications Definition Given two tensors Ao×o and Bo×s×o A, B (k, j) = o i=1 Ai,kGi,j = Ai,iGi,j = σ (netok) hj 47 / 142
  • 89. Images/cinvestav Now, the term ∂L ∂Uss Assuming our change in time step t → t + 1 and given ht = σh (Wsdxt + Ussht−1) Therefore we have ∂L (t + 1) ∂Uss = ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂Uss Therefore We can think on this as a Markovian Backpropagation 48 / 142
  • 90. Images/cinvestav Now, the term ∂L ∂Uss Assuming our change in time step t → t + 1 and given ht = σh (Wsdxt + Ussht−1) Therefore we have ∂L (t + 1) ∂Uss = ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂Uss Therefore We can think on this as a Markovian Backpropagation 48 / 142
  • 91. Images/cinvestav Now, the term ∂L ∂Uss Assuming our change in time step t → t + 1 and given ht = σh (Wsdxt + Ussht−1) Therefore we have ∂L (t + 1) ∂Uss = ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂Uss Therefore We can think on this as a Markovian Backpropagation 48 / 142
  • 92. Images/cinvestav What if we go further From t − 1 → t + 1 ∂L (t + 1) ∂Uss = ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂ht × ∂ht ∂Uss Now, the trick if we consider all the possible derivatives from t to 0 We have: ∂L (t + 1) ∂Uss = t k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Uss However How do we calculate ∂ht+1 ∂hk ? 49 / 142
  • 93. Images/cinvestav What if we go further From t − 1 → t + 1 ∂L (t + 1) ∂Uss = ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂ht × ∂ht ∂Uss Now, the trick if we consider all the possible derivatives from t to 0 We have: ∂L (t + 1) ∂Uss = t k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Uss However How do we calculate ∂ht+1 ∂hk ? 49 / 142
  • 94. Images/cinvestav What if we go further From t − 1 → t + 1 ∂L (t + 1) ∂Uss = ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂ht × ∂ht ∂Uss Now, the trick if we consider all the possible derivatives from t to 0 We have: ∂L (t + 1) ∂Uss = t k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Uss However How do we calculate ∂ht+1 ∂hk ? 49 / 142
  • 95. Images/cinvestav We have a proposal Given the composition of functions ∂ht+1 ∂hk = ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht Here, we know that ∂hi+1 ∂hi = ∂hi+1 ∂nets × ∂nets ∂hi 50 / 142
  • 96. Images/cinvestav We have a proposal Given the composition of functions ∂ht+1 ∂hk = ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht Here, we know that ∂hi+1 ∂hi = ∂hi+1 ∂nets × ∂nets ∂hi 50 / 142
  • 97. Images/cinvestav We have that We have given hi+1 = σh (Wsdxi + Usshi) and neth = Wsdxi + Usshi ∂hi+1 ∂nets =       σh (neth1) 0 · · · 0 0 σh (neth2) · · · 0 ... ... ... ... 0 0 · · · σh (neths)       = Di+1 Finally, we have that ∂nets ∂hi = Uss 51 / 142
  • 98. Images/cinvestav We have that We have given hi+1 = σh (Wsdxi + Usshi) and neth = Wsdxi + Usshi ∂hi+1 ∂nets =       σh (neth1) 0 · · · 0 0 σh (neth2) · · · 0 ... ... ... ... 0 0 · · · σh (neths)       = Di+1 Finally, we have that ∂nets ∂hi = Uss 51 / 142
  • 99. Images/cinvestav Then We can aggregate over all the time ∂L ∂Uss = t t k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Uss Now, we need to derive the L with respect to Wsd ∂L (t + 1) ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd 52 / 142
  • 100. Images/cinvestav Then We can aggregate over all the time ∂L ∂Uss = t t k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Uss Now, we need to derive the L with respect to Wsd ∂L (t + 1) ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd 52 / 142
  • 101. Images/cinvestav Now Because ht and xt+1, we need to back-propagate to ht ∂L (t + 1) ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd + ∂L (t + 1) ∂ht × ∂ht ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd + ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂ht × ∂ht ∂Wsd Then summing over all the contributions from t to 0 ∂L (t + 1) ∂Wsd = t+1 k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Wsd Finally, summing over all the time ∂L ∂Wsd = t t+1 k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Wsd 53 / 142
  • 102. Images/cinvestav Now Because ht and xt+1, we need to back-propagate to ht ∂L (t + 1) ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd + ∂L (t + 1) ∂ht × ∂ht ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd + ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂ht × ∂ht ∂Wsd Then summing over all the contributions from t to 0 ∂L (t + 1) ∂Wsd = t+1 k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Wsd Finally, summing over all the time ∂L ∂Wsd = t t+1 k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Wsd 53 / 142
  • 103. Images/cinvestav Now Because ht and xt+1, we need to back-propagate to ht ∂L (t + 1) ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd + ∂L (t + 1) ∂ht × ∂ht ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd + ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂ht × ∂ht ∂Wsd Then summing over all the contributions from t to 0 ∂L (t + 1) ∂Wsd = t+1 k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Wsd Finally, summing over all the time ∂L ∂Wsd = t t+1 k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Wsd 53 / 142
  • 104. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 54 / 142
  • 105. Images/cinvestav Vanishing Gradients We have a problem ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht You finish with a vanishing gradient using σ = 1 1+exp{−x} This is problematic!!! 55 / 142
  • 106. Images/cinvestav Vanishing Gradients We have a problem ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht You finish with a vanishing gradient using σ = 1 1+exp{−x} This is problematic!!! 55 / 142
  • 107. Images/cinvestav Given Given the commutativity of the product You could put together the derivative of the sigmoid’s f (x) = ds (x) dx = e−x (1 + e−x)2 Therefore, deriving again df (x) dx = − e−x (1 + e−x)2 + 2 (e−x) 2 (1 + e−x)3 After making df(x) dx = 0 We have the maximum is at x = 0 56 / 142
  • 108. Images/cinvestav Given Given the commutativity of the product You could put together the derivative of the sigmoid’s f (x) = ds (x) dx = e−x (1 + e−x)2 Therefore, deriving again df (x) dx = − e−x (1 + e−x)2 + 2 (e−x) 2 (1 + e−x)3 After making df(x) dx = 0 We have the maximum is at x = 0 56 / 142
  • 109. Images/cinvestav Given Given the commutativity of the product You could put together the derivative of the sigmoid’s f (x) = ds (x) dx = e−x (1 + e−x)2 Therefore, deriving again df (x) dx = − e−x (1 + e−x)2 + 2 (e−x) 2 (1 + e−x)3 After making df(x) dx = 0 We have the maximum is at x = 0 56 / 142
  • 110. Images/cinvestav Therefore The maximum for the derivative of the sigmoid f (0) = 0.25 Therefore, Given a Deep Network We could finish with lim k→∞ ds (x) dx k = lim k→∞ (0.25)k → 0 A Vanishing Derivative or Vanishing Gradient Making quite difficult to do train a deeper network using this activation function for Deep Learning and even in Shallow Learning 57 / 142
  • 111. Images/cinvestav Therefore The maximum for the derivative of the sigmoid f (0) = 0.25 Therefore, Given a Deep Network We could finish with lim k→∞ ds (x) dx k = lim k→∞ (0.25)k → 0 A Vanishing Derivative or Vanishing Gradient Making quite difficult to do train a deeper network using this activation function for Deep Learning and even in Shallow Learning 57 / 142
  • 112. Images/cinvestav Therefore The maximum for the derivative of the sigmoid f (0) = 0.25 Therefore, Given a Deep Network We could finish with lim k→∞ ds (x) dx k = lim k→∞ (0.25)k → 0 A Vanishing Derivative or Vanishing Gradient Making quite difficult to do train a deeper network using this activation function for Deep Learning and even in Shallow Learning 57 / 142
  • 113. Images/cinvestav For the case of vanishing gradient, we have that Rearranging terms in ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht We have T k=0 ∂hk+1 ∂nets [Uss]T+1 Then, given the sigmoid T k=0 ∂hk+1 ∂nets =      T k=0 σh netk h1 0 · · · 0 0 T k=0 σh netk h2 · · · 0 ... ... ... ... 0 0 · · · T k=0 σh netk hs      58 / 142
  • 114. Images/cinvestav For the case of vanishing gradient, we have that Rearranging terms in ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht We have T k=0 ∂hk+1 ∂nets [Uss]T+1 Then, given the sigmoid T k=0 ∂hk+1 ∂nets =      T k=0 σh netk h1 0 · · · 0 0 T k=0 σh netk h2 · · · 0 ... ... ... ... 0 0 · · · T k=0 σh netk hs      58 / 142
  • 115. Images/cinvestav It is clear That you have the phenomena of vanishing gradient Do we have a way to fixing this? Yes The use of new activation functions. 59 / 142
  • 116. Images/cinvestav It is clear That you have the phenomena of vanishing gradient Do we have a way to fixing this? Yes The use of new activation functions. 59 / 142
  • 117. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 60 / 142
  • 118. Images/cinvestav Thus The need to introduce a new function f (x) = x+ = max (0, x) It is called ReLu or Rectifier With a smooth approximation (Softplus function) f (x) = ln 1 + ekx k 61 / 142
  • 119. Images/cinvestav Thus The need to introduce a new function f (x) = x+ = max (0, x) It is called ReLu or Rectifier With a smooth approximation (Softplus function) f (x) = ln 1 + ekx k 61 / 142
  • 120. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 62 / 142
  • 121. Images/cinvestav However Here the gradient can explode Thus, the need to control the gradient... Therefore, we will use the following analysis [6] “The Emergence of Spectral Universality in Deep Networks” by Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli 63 / 142
  • 122. Images/cinvestav However Here the gradient can explode Thus, the need to control the gradient... Therefore, we will use the following analysis [6] “The Emergence of Spectral Universality in Deep Networks” by Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli 63 / 142
  • 123. Images/cinvestav We have The following dynamic ht = σh (st) , st = Wsdxt + Ussht−1 + bh Then, we have the following Jacobian J = ∂hT ∂h0 = L t=1 DtUSS Where as we saw it Dt is a diagonal matrix This Jacobian J is a matrix of dimension s × s therefore, if it is well conditioned you are not sending the projection to lower dimensionality. 64 / 142
  • 124. Images/cinvestav We have The following dynamic ht = σh (st) , st = Wsdxt + Ussht−1 + bh Then, we have the following Jacobian J = ∂hT ∂h0 = L t=1 DtUSS Where as we saw it Dt is a diagonal matrix This Jacobian J is a matrix of dimension s × s therefore, if it is well conditioned you are not sending the projection to lower dimensionality. 64 / 142
  • 125. Images/cinvestav We have The following dynamic ht = σh (st) , st = Wsdxt + Ussht−1 + bh Then, we have the following Jacobian J = ∂hT ∂h0 = L t=1 DtUSS Where as we saw it Dt is a diagonal matrix This Jacobian J is a matrix of dimension s × s therefore, if it is well conditioned you are not sending the projection to lower dimensionality. 64 / 142
  • 126. Images/cinvestav A Trick A RNN can be seen as a deep neural network + + + + 65 / 142
  • 127. Images/cinvestav Remember the structure of the layer The following dynamic ht = σh (st) , st = Wsdxt + Ussht−1 + bh Therefore, we have that sit = j Wijxt j + k Uikht−1 k + bi We assume the following about the temporal layer weights [Uss, Wsd] ∼ N 0, ρ2 w N , bh ∼ N 0, ρ2 b Here N = s the state dimension. 66 / 142
  • 128. Images/cinvestav Remember the structure of the layer The following dynamic ht = σh (st) , st = Wsdxt + Ussht−1 + bh Therefore, we have that sit = j Wijxt j + k Uikht−1 k + bi We assume the following about the temporal layer weights [Uss, Wsd] ∼ N 0, ρ2 w N , bh ∼ N 0, ρ2 b Here N = s the state dimension. 66 / 142
  • 129. Images/cinvestav Remember the structure of the layer The following dynamic ht = σh (st) , st = Wsdxt + Ussht−1 + bh Therefore, we have that sit = j Wijxt j + k Uikht−1 k + bi We assume the following about the temporal layer weights [Uss, Wsd] ∼ N 0, ρ2 w N , bh ∼ N 0, ρ2 b Here N = s the state dimension. 66 / 142
  • 130. Images/cinvestav Now, assume that Now, consider the evolution of a single input through the network xit Since the weights and biases are independent with zero mean E [sit] = 0 The second moment of the Gaussian random variable E [sitsjt] = qt δij 67 / 142
  • 131. Images/cinvestav Now, assume that Now, consider the evolution of a single input through the network xit Since the weights and biases are independent with zero mean E [sit] = 0 The second moment of the Gaussian random variable E [sitsjt] = qt δij 67 / 142
  • 132. Images/cinvestav Where the second moment Of a Gaussian Distribution is ∞ −∞ s2 1 √ 2πσ exp − (s − µ) 2σ2 ds 68 / 142
  • 133. Images/cinvestav Here we have Here q is the variance of the pre-activations in the tth layer due to an input xt qt = ρ2 w √ 2π σ2 h qt−1sit−1 exp − 1 2 s2 it dsit + ρ2 b They describe the pass through the recursion of the RNN For any choice of ρ2 w and ρ2 b and a bounded φ the previous equation converges to a specific fix point. This recursion has a fixed point q∗ = ρ2 w √ 2π σ2 h q∗sit−1 exp − 1 2 s2 it dsit + ρ2 b 69 / 142
  • 134. Images/cinvestav Here we have Here q is the variance of the pre-activations in the tth layer due to an input xt qt = ρ2 w √ 2π σ2 h qt−1sit−1 exp − 1 2 s2 it dsit + ρ2 b They describe the pass through the recursion of the RNN For any choice of ρ2 w and ρ2 b and a bounded φ the previous equation converges to a specific fix point. This recursion has a fixed point q∗ = ρ2 w √ 2π σ2 h q∗sit−1 exp − 1 2 s2 it dsit + ρ2 b 69 / 142
  • 135. Images/cinvestav Here we have Here q is the variance of the pre-activations in the tth layer due to an input xt qt = ρ2 w √ 2π σ2 h qt−1sit−1 exp − 1 2 s2 it dsit + ρ2 b They describe the pass through the recursion of the RNN For any choice of ρ2 w and ρ2 b and a bounded φ the previous equation converges to a specific fix point. This recursion has a fixed point q∗ = ρ2 w √ 2π σ2 h q∗sit−1 exp − 1 2 s2 it dsit + ρ2 b 69 / 142
  • 136. Images/cinvestav A Fixed Point Definition In mathematics, a fixed point of a function is an element of the function’s domain that is mapped to itself by the function. Example 70 / 142
  • 137. Images/cinvestav A Fixed Point Definition In mathematics, a fixed point of a function is an element of the function’s domain that is mapped to itself by the function. Example 70 / 142
  • 138. Images/cinvestav Therefore We have that It the input x0 is chosen so that q1 = q∗ the dynamics start at the fixed point and the distribution of Dt is independent of t. Not only that q1 = q∗ a few layers is often sufficient to approximately converge to a fixed point. So when t is large So it is a good approximation to assume qt = q∗. 71 / 142
  • 139. Images/cinvestav Therefore We have that It the input x0 is chosen so that q1 = q∗ the dynamics start at the fixed point and the distribution of Dt is independent of t. Not only that q1 = q∗ a few layers is often sufficient to approximately converge to a fixed point. So when t is large So it is a good approximation to assume qt = q∗. 71 / 142
  • 140. Images/cinvestav Therefore We have that It the input x0 is chosen so that q1 = q∗ the dynamics start at the fixed point and the distribution of Dt is independent of t. Not only that q1 = q∗ a few layers is often sufficient to approximately converge to a fixed point. So when t is large So it is a good approximation to assume qt = q∗. 71 / 142
  • 141. Images/cinvestav Additionally The independence of the weights and biases implies The covariance between different pre-activations in the same layer will be given by E [zit;azjt;b] = qt abδij Therefore qt ab = ρ2 w σh (u1) σh (u2) Dz1Dz2 + ρ2 b Where Dz = 1√ 2π exp −1 2 s2 ds u1 = qt−1 aa u2 = qt−1 bb ct−1 ab s1 + 1 − ct−1 ab 2 z2 ct ab = qt ab√ qt aaqt bb 72 / 142
  • 142. Images/cinvestav Additionally The independence of the weights and biases implies The covariance between different pre-activations in the same layer will be given by E [zit;azjt;b] = qt abδij Therefore qt ab = ρ2 w σh (u1) σh (u2) Dz1Dz2 + ρ2 b Where Dz = 1√ 2π exp −1 2 s2 ds u1 = qt−1 aa u2 = qt−1 bb ct−1 ab s1 + 1 − ct−1 ab 2 z2 ct ab = qt ab√ qt aaqt bb 72 / 142
  • 143. Images/cinvestav Therefore Therefore, we can look at the variance of the Jacobian Matrix elements χ = 1 N Tr (DtUSS) T DtUSS = σ2 w σh √ q∗sit 2 exp − 1 2 s2 it dsit 73 / 142
  • 144. Images/cinvestav Then χ (ρw, ρb) It separates (ρw, ρb) plane into two regions. When χ > 1 Forward signal propagation expands and folds space in a chaotic manner and gradients explode When χ < 1 Forward signal propagation contracts in an ordered manner and gradients exponentially vanishes 74 / 142
  • 145. Images/cinvestav Then χ (ρw, ρb) It separates (ρw, ρb) plane into two regions. When χ > 1 Forward signal propagation expands and folds space in a chaotic manner and gradients explode When χ < 1 Forward signal propagation contracts in an ordered manner and gradients exponentially vanishes 74 / 142
  • 146. Images/cinvestav Then χ (ρw, ρb) It separates (ρw, ρb) plane into two regions. When χ > 1 Forward signal propagation expands and folds space in a chaotic manner and gradients explode When χ < 1 Forward signal propagation contracts in an ordered manner and gradients exponentially vanishes 74 / 142
  • 147. Images/cinvestav This Regions establish the stability of the network We have the following 0.5 1.0 1.5 2.0 2.5 3.0 0.00 0.05 0.10 0.15 0.20 0.25 ORDERED VANISHING GRADIENT CHAOTIC EXPLODING GRADIENT 75 / 142
  • 148. Images/cinvestav Therefore It is clear that When we choose same ρb = ρw we have a convergence of the network Having other values It requires a careful choosing of the values 76 / 142
  • 149. Images/cinvestav Therefore It is clear that When we choose same ρb = ρw we have a convergence of the network Having other values It requires a careful choosing of the values 76 / 142
  • 150. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 77 / 142
  • 151. Images/cinvestav Another Problem Although, the Vanishing and Exploding Gradients They are a problem for the RNN’s If we use the full BPTT We confront limitations on the amount of Memory and Hardware available Thus a popular strategy It is the Truncated BPTT [7, 8] 78 / 142
  • 152. Images/cinvestav Another Problem Although, the Vanishing and Exploding Gradients They are a problem for the RNN’s If we use the full BPTT We confront limitations on the amount of Memory and Hardware available Thus a popular strategy It is the Truncated BPTT [7, 8] 78 / 142
  • 153. Images/cinvestav Another Problem Although, the Vanishing and Exploding Gradients They are a problem for the RNN’s If we use the full BPTT We confront limitations on the amount of Memory and Hardware available Thus a popular strategy It is the Truncated BPTT [7, 8] 78 / 142
  • 154. Images/cinvestav Therefore They proposed using a truncation on the BPTT To solve the problem with the Vanishing and Exploding Gradient What is Truncated BPTT? In general, this should be regarded as a heuristic technique for simplifying the computation. Which it is a good approximation true gradient 79 / 142
  • 155. Images/cinvestav Therefore They proposed using a truncation on the BPTT To solve the problem with the Vanishing and Exploding Gradient What is Truncated BPTT? In general, this should be regarded as a heuristic technique for simplifying the computation. Which it is a good approximation true gradient 79 / 142
  • 156. Images/cinvestav The Algorithm Truncated BPTT 1 for t = 1 to T do: 2 Run the RNN for one step, computing ht and yt 3 if t divides k1 then 4 Run BPTT from t to t − k2 Something Notable 1 It was first used by Elman [9] 2 Also Mikolov et al. [10] used the TBPTT to train RNN on word-level language modeling. 80 / 142
  • 157. Images/cinvestav The Algorithm Truncated BPTT 1 for t = 1 to T do: 2 Run the RNN for one step, computing ht and yt 3 if t divides k1 then 4 Run BPTT from t to t − k2 Something Notable 1 It was first used by Elman [9] 2 Also Mikolov et al. [10] used the TBPTT to train RNN on word-level language modeling. 80 / 142
  • 158. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 81 / 142
  • 159. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 82 / 142
  • 160. Images/cinvestav Initialization of the Hidden State This is the classic problem in RNN How to initialize the hs hidden state? There are two main mehtods 1 Initialize hs to the zero vector. 2 Adaptive noisy initialization of hs 3 Find the steady state 83 / 142
  • 161. Images/cinvestav Initialization of the Hidden State This is the classic problem in RNN How to initialize the hs hidden state? There are two main mehtods 1 Initialize hs to the zero vector. 2 Adaptive noisy initialization of hs 3 Find the steady state 83 / 142
  • 162. Images/cinvestav The Simplest One We can simply initialize hs To a zero state Quite simple and easy to apply However do we have something better? 84 / 142
  • 163. Images/cinvestav The Simplest One We can simply initialize hs To a zero state Quite simple and easy to apply However do we have something better? 84 / 142
  • 164. Images/cinvestav Adaptive noisy initialization It is proposed by Zimmermann et al. [11] They proposed to use the residual error once the back-propagation was done for h0 This is done By disturbing h0 with a noise term Θ which follows the distribution of the residual error. 85 / 142
  • 165. Images/cinvestav Adaptive noisy initialization It is proposed by Zimmermann et al. [11] They proposed to use the residual error once the back-propagation was done for h0 This is done By disturbing h0 with a noise term Θ which follows the distribution of the residual error. 85 / 142
  • 166. Images/cinvestav Adaptive Noise The network tries to stabilize the output 0 + Noise 86 / 142
  • 167. Images/cinvestav Example of this initializations Source https://r2rt.com/non-zero-initial-states-for-recurrent-neural- networks.html 87 / 142
  • 168. Images/cinvestav What about the Weight Parameters? We could simply initialize them to zero Denger Will Robinson!!! A simple example with the following feed-forward architecture w = σ1 (Whix) y = σ2 (Wohw) L = 1 2 [y − z]2 88 / 142
  • 169. Images/cinvestav What about the Weight Parameters? We could simply initialize them to zero Denger Will Robinson!!! A simple example with the following feed-forward architecture w = σ1 (Whix) y = σ2 (Wohw) L = 1 2 [y − z]2 88 / 142
  • 170. Images/cinvestav Therefore We have by back-propagation ∆Who = σ2 (Wohσ1 (Whix1)) − z σ2 (Wohσ1 (Whix)) Wohσ1 (Whix) x Therefore ∆Who = 0 89 / 142
  • 171. Images/cinvestav Therefore We have by back-propagation ∆Who = σ2 (Wohσ1 (Whix1)) − z σ2 (Wohσ1 (Whix)) Wohσ1 (Whix) x Therefore ∆Who = 0 89 / 142
  • 172. Images/cinvestav Therefore Not a good idea What else we can do? We have heuristics as the Gaussian initialization wij ∼ N 0, σ2 90 / 142
  • 173. Images/cinvestav Therefore Not a good idea What else we can do? We have heuristics as the Gaussian initialization wij ∼ N 0, σ2 90 / 142
  • 174. Images/cinvestav Do you remember? We have the following 0.5 1.0 1.5 2.0 2.5 3.0 0.00 0.05 0.10 0.15 0.20 0.25 ORDERED VANISHING GRADIENT CHAOTIC EXPLODING GRADIENT 91 / 142
  • 175. Images/cinvestav Furthermore We have heuristics For Relu — We multiply the randomly generated values of W by: 2 sizel−1 For tanh — The heuristic is called Xavier initialization 2 sizel−1 Other common one 2 sizel−1 + sizel 92 / 142
  • 176. Images/cinvestav Furthermore We have heuristics For Relu — We multiply the randomly generated values of W by: 2 sizel−1 For tanh — The heuristic is called Xavier initialization 2 sizel−1 Other common one 2 sizel−1 + sizel 92 / 142
  • 177. Images/cinvestav Furthermore We have heuristics For Relu — We multiply the randomly generated values of W by: 2 sizel−1 For tanh — The heuristic is called Xavier initialization 2 sizel−1 Other common one 2 sizel−1 + sizel 92 / 142
  • 178. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 93 / 142
  • 179. Images/cinvestav History of LSTM They were introduced by LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber [12] An attempt to deal with the vanishing and exploding gradient By introducing Constant Error Carousel (CEC) units Properties In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred Cummins introduced the forget gate (also called “keep gate”) into LSTM architecture. It enables the LSTM to reset its own state 94 / 142
  • 180. Images/cinvestav History of LSTM They were introduced by LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber [12] An attempt to deal with the vanishing and exploding gradient By introducing Constant Error Carousel (CEC) units Properties In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred Cummins introduced the forget gate (also called “keep gate”) into LSTM architecture. It enables the LSTM to reset its own state 94 / 142
  • 181. Images/cinvestav History of LSTM They were introduced by LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber [12] An attempt to deal with the vanishing and exploding gradient By introducing Constant Error Carousel (CEC) units Properties In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred Cummins introduced the forget gate (also called “keep gate”) into LSTM architecture. It enables the LSTM to reset its own state 94 / 142
  • 182. Images/cinvestav Long Short Term Memory (LSTM) We have the following Architecture (Component wise product ) ft =σ [Wf [ht−1, xt] + bf ] (Forget Gate) it =σ [Wi [ht−1, xt] + bi] (Input/Update Gate) ot =σ [Wo [ht−1, xt] + bo] (Output Gate) ˆct = tanh [Wo [ht−1, xt] + bc] (Intermediate Cell Gate) ct =ft ct−1 + it ˆct (Cell State Gate) ht =ot tanh (ct) (Hidden State) Where σ is a sigmoid function. 95 / 142
  • 184. Images/cinvestav Here the interesting part In the RNN ht = σh (Wsdxt + Ussht−1) But Here ct =ft ct−1 + it ˆct (Cell State Gate) ht =ot tanh (ct) You need the forget term, the input term ant the intermediate cell To update the state 97 / 142
  • 185. Images/cinvestav Here the interesting part In the RNN ht = σh (Wsdxt + Ussht−1) But Here ct =ft ct−1 + it ˆct (Cell State Gate) ht =ot tanh (ct) You need the forget term, the input term ant the intermediate cell To update the state 97 / 142
  • 186. Images/cinvestav Here the interesting part In the RNN ht = σh (Wsdxt + Ussht−1) But Here ct =ft ct−1 + it ˆct (Cell State Gate) ht =ot tanh (ct) You need the forget term, the input term ant the intermediate cell To update the state 97 / 142
  • 187. Images/cinvestav You can see Something Notable The cell keeps track of the dependencies between the elements in the input sequence and the state The input gate It is in charge of how much of the input flows into the cell gate it = σ [Wi [ht−1, xt] + bi] 98 / 142
  • 188. Images/cinvestav You can see Something Notable The cell keeps track of the dependencies between the elements in the input sequence and the state The input gate It is in charge of how much of the input flows into the cell gate it = σ [Wi [ht−1, xt] + bi] 98 / 142
  • 189. Images/cinvestav What is the meaning? We have that The sigmoid layer decides what values to update They impact the term it ˆct Making possible to decide how to control the cell intermediate values 99 / 142
  • 190. Images/cinvestav What is the meaning? We have that The sigmoid layer decides what values to update They impact the term it ˆct Making possible to decide how to control the cell intermediate values 99 / 142
  • 191. Images/cinvestav Now The forget gate How much of the previous cell gate time value remains in the cell at time t ft = σ [Wf [ht−1, xt] + bf ] Actually It uses previous state and input Then the sigmoid actually can be interpreted as Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep” 100 / 142
  • 192. Images/cinvestav Now The forget gate How much of the previous cell gate time value remains in the cell at time t ft = σ [Wf [ht−1, xt] + bf ] Actually It uses previous state and input Then the sigmoid actually can be interpreted as Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep” 100 / 142
  • 193. Images/cinvestav Now The forget gate How much of the previous cell gate time value remains in the cell at time t ft = σ [Wf [ht−1, xt] + bf ] Actually It uses previous state and input Then the sigmoid actually can be interpreted as Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep” 100 / 142
  • 194. Images/cinvestav Furthermore The output gate It controls the extent to which the value in the cell is used to compute the actual state Which impacts the term ft ct−1 Based on the previous cell state Thus a type of control Between the previous cell state and the new cell state 101 / 142
  • 195. Images/cinvestav Furthermore The output gate It controls the extent to which the value in the cell is used to compute the actual state Which impacts the term ft ct−1 Based on the previous cell state Thus a type of control Between the previous cell state and the new cell state 101 / 142
  • 196. Images/cinvestav Furthermore The output gate It controls the extent to which the value in the cell is used to compute the actual state Which impacts the term ft ct−1 Based on the previous cell state Thus a type of control Between the previous cell state and the new cell state 101 / 142
  • 197. Images/cinvestav Finally We have the update of the cell as ct = ft ct−1 + it ˆct Basically Apply forget operation to previous internal cell state. Add new candidate values, scaled by how much we decided to update We can see as Drop old information and add new information about subject’s gender. 102 / 142
  • 198. Images/cinvestav Finally We have the update of the cell as ct = ft ct−1 + it ˆct Basically Apply forget operation to previous internal cell state. Add new candidate values, scaled by how much we decided to update We can see as Drop old information and add new information about subject’s gender. 102 / 142
  • 199. Images/cinvestav Finally We have the update of the cell as ct = ft ct−1 + it ˆct Basically Apply forget operation to previous internal cell state. Add new candidate values, scaled by how much we decided to update We can see as Drop old information and add new information about subject’s gender. 102 / 142
  • 200. Images/cinvestav Thus at the output layer and update state We have ot =σ [Wo [ht−1, xt] + bo] (Output Gate) ht =ot tanh (ct) (Hidden State) Therefore, we have that Sigmoid layer: decide what linear combination of state/input to output Additionally, we have that the tanh squashes the values between -1 and 1 The output is used to filter a version of cell state!!! 103 / 142
  • 201. Images/cinvestav Thus at the output layer and update state We have ot =σ [Wo [ht−1, xt] + bo] (Output Gate) ht =ot tanh (ct) (Hidden State) Therefore, we have that Sigmoid layer: decide what linear combination of state/input to output Additionally, we have that the tanh squashes the values between -1 and 1 The output is used to filter a version of cell state!!! 103 / 142
  • 202. Images/cinvestav Thus at the output layer and update state We have ot =σ [Wo [ht−1, xt] + bo] (Output Gate) ht =ot tanh (ct) (Hidden State) Therefore, we have that Sigmoid layer: decide what linear combination of state/input to output Additionally, we have that the tanh squashes the values between -1 and 1 The output is used to filter a version of cell state!!! 103 / 142
  • 203. Images/cinvestav Something nice about LSTM Quite nice Backpropagation from ct to ct−1 requires only elementwise multiplication! 104 / 142
  • 204. Images/cinvestav LSTM Remarks First It maintains a separate cell state from what is outputted Second Use gates to control the flow of information Forget gate tries to get rid of irrelevant information Selectively update cell state Output gate returns a filtered version of the cell state Third Backpropagation from ct to ct−1 requires only elementwise multiplication! 105 / 142
  • 205. Images/cinvestav LSTM Remarks First It maintains a separate cell state from what is outputted Second Use gates to control the flow of information Forget gate tries to get rid of irrelevant information Selectively update cell state Output gate returns a filtered version of the cell state Third Backpropagation from ct to ct−1 requires only elementwise multiplication! 105 / 142
  • 206. Images/cinvestav LSTM Remarks First It maintains a separate cell state from what is outputted Second Use gates to control the flow of information Forget gate tries to get rid of irrelevant information Selectively update cell state Output gate returns a filtered version of the cell state Third Backpropagation from ct to ct−1 requires only elementwise multiplication! 105 / 142
  • 207. Images/cinvestav Achievements LSTM achieved record results in natural language text compression http://www.mattmahoney.net/dc/text.html#1218 Unsegmented connected handwriting recognition Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H., Schmidhuber, J. (May 2009). "A Novel Connectionist System for Unconstrained Handwriting Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868 Finally Won the ICDAR handwriting competition (2009) 106 / 142
  • 208. Images/cinvestav Achievements LSTM achieved record results in natural language text compression http://www.mattmahoney.net/dc/text.html#1218 Unsegmented connected handwriting recognition Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H., Schmidhuber, J. (May 2009). "A Novel Connectionist System for Unconstrained Handwriting Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868 Finally Won the ICDAR handwriting competition (2009) 106 / 142
  • 209. Images/cinvestav Achievements LSTM achieved record results in natural language text compression http://www.mattmahoney.net/dc/text.html#1218 Unsegmented connected handwriting recognition Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H., Schmidhuber, J. (May 2009). "A Novel Connectionist System for Unconstrained Handwriting Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868 Finally Won the ICDAR handwriting competition (2009) 106 / 142
  • 210. Images/cinvestav Right now Something Notable As of 2016, major technology companies including Google, Apple, and Microsoft were using LSTM as fundamental components in new products. 107 / 142
  • 211. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 108 / 142
  • 212. Images/cinvestav History They were proposed as a simplification of the LSTM In 2014, Kyunghyun Cho et al. put forward a simplified variant called Gated recurrent unit (GRU) Something Notable The GRU is like a long short-term memory (LSTM) with forget gate... but has fewer parameters than LSTM, as it lacks an output gate 109 / 142
  • 213. Images/cinvestav History They were proposed as a simplification of the LSTM In 2014, Kyunghyun Cho et al. put forward a simplified variant called Gated recurrent unit (GRU) Something Notable The GRU is like a long short-term memory (LSTM) with forget gate... but has fewer parameters than LSTM, as it lacks an output gate 109 / 142
  • 214. Images/cinvestav Gated Recurrent Units Architecture zt =σ [Wz [ht−1, xt] + bz] (Update Gate) rt =σ [Wr [ht−1, xt] + br] (Reset Gate) ˆht = tanh [Wo [rt ht−1, xt] + bh] ht = (1 − zt) ht−1 + zt ˆht 110 / 142
  • 215. Images/cinvestav Graphically, we have the architecture GRU Architecture +-1 111 / 142
  • 216. Images/cinvestav Main Observations There is a gate used to combine the state ht−1, The zt gate that basically uses the information of the input and the previous state to decide how to update ht = (1 − zt) ht−1 + zt ˆht The intermediate step ˆht A bounded version of the possible state ht 112 / 142
  • 217. Images/cinvestav Main Observations There is a gate used to combine the state ht−1, The zt gate that basically uses the information of the input and the previous state to decide how to update ht = (1 − zt) ht−1 + zt ˆht The intermediate step ˆht A bounded version of the possible state ht 112 / 142
  • 218. Images/cinvestav Next We have that a reset gate rt = σ [Wr [ht−1, xt] + br] To update ˆht = tanh [Wo [rt ht−1, xt] + bh] 113 / 142
  • 219. Images/cinvestav However It has been shown that As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is "strictly stronger" than the GRU LSTM can perform unbounded counting[13] The GRU cannot. It simulates a counting machine used for theoretical CS Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google Brain LSTM cells consistently outperform GRU cells in "the first large-scale analysis of architecture variations for Neural Machine Translation.” 114 / 142
  • 220. Images/cinvestav However It has been shown that As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is "strictly stronger" than the GRU LSTM can perform unbounded counting[13] The GRU cannot. It simulates a counting machine used for theoretical CS Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google Brain LSTM cells consistently outperform GRU cells in "the first large-scale analysis of architecture variations for Neural Machine Translation.” 114 / 142
  • 221. Images/cinvestav However It has been shown that As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is "strictly stronger" than the GRU LSTM can perform unbounded counting[13] The GRU cannot. It simulates a counting machine used for theoretical CS Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google Brain LSTM cells consistently outperform GRU cells in "the first large-scale analysis of architecture variations for Neural Machine Translation.” 114 / 142
  • 222. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 115 / 142
  • 223. Images/cinvestav Given that we want to do sequence modeling Stock Options Predict next phrase Question: If I am a man ? Prediction: you are homo sapiens 116 / 142
  • 224. Images/cinvestav Given that we want to do sequence modeling Stock Options Predict next phrase Question: If I am a man ? Prediction: you are homo sapiens 116 / 142
  • 225. Images/cinvestav What do we have in this sequences of data? Sequences have different lengths We need to handle variable-length sequences 117 / 142
  • 226. Images/cinvestav Furthermore We need to track long-term dependencies 118 / 142
  • 227. Images/cinvestav Not only that Maintain information about order “We have a mother living in Yucatan, Mexico” Share parameters across the sequence Do you remember the state ht? 119 / 142
  • 228. Images/cinvestav Not only that Maintain information about order “We have a mother living in Yucatan, Mexico” Share parameters across the sequence Do you remember the state ht? 119 / 142
  • 229. Images/cinvestav However There is a need to increase their power Given the amounts of data we have right know Then there is a tendency to start using the Recurrent Neural Networks As cells to be stacked for bigger systems [14, 15] This is based in the following idea [16] Hypothesis, hierarchical model can be exponentially more efficient at representing some functions than a shallow one. 120 / 142
  • 230. Images/cinvestav However There is a need to increase their power Given the amounts of data we have right know Then there is a tendency to start using the Recurrent Neural Networks As cells to be stacked for bigger systems [14, 15] This is based in the following idea [16] Hypothesis, hierarchical model can be exponentially more efficient at representing some functions than a shallow one. 120 / 142
  • 231. Images/cinvestav However There is a need to increase their power Given the amounts of data we have right know Then there is a tendency to start using the Recurrent Neural Networks As cells to be stacked for bigger systems [14, 15] This is based in the following idea [16] Hypothesis, hierarchical model can be exponentially more efficient at representing some functions than a shallow one. 120 / 142
  • 232. Images/cinvestav In the case of RNN’s Certain Transitions are not Deep They are only results of a linear projection followed by an element-wise nonlinearity. They are Hidden-to-hidden ht−1 → ht Hidden-to-output ht → yt Input-to-hidden xt−1 → ht Meaning They are all shallow in the sense that there exists no intermediate, nonlinear hidden layer. 121 / 142
  • 233. Images/cinvestav In the case of RNN’s Certain Transitions are not Deep They are only results of a linear projection followed by an element-wise nonlinearity. They are Hidden-to-hidden ht−1 → ht Hidden-to-output ht → yt Input-to-hidden xt−1 → ht Meaning They are all shallow in the sense that there exists no intermediate, nonlinear hidden layer. 121 / 142
  • 234. Images/cinvestav In the case of RNN’s Certain Transitions are not Deep They are only results of a linear projection followed by an element-wise nonlinearity. They are Hidden-to-hidden ht−1 → ht Hidden-to-output ht → yt Input-to-hidden xt−1 → ht Meaning They are all shallow in the sense that there exists no intermediate, nonlinear hidden layer. 121 / 142
  • 235. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 122 / 142
  • 236. Images/cinvestav Bengio et al. [17] Gave the following Hypothesis In sampling algorithms (Markov Chains and MCMC techniques) suffer from a fundamental problem Given unconnected or weakly connected regions of distributions We have that it is difficult for the Markov chain to jump from one mode of the distribution to another, when these are separated by large low-density regions This means that we have a slow mixing of samples In order to represent distributions 123 / 142
  • 237. Images/cinvestav Bengio et al. [17] Gave the following Hypothesis In sampling algorithms (Markov Chains and MCMC techniques) suffer from a fundamental problem Given unconnected or weakly connected regions of distributions We have that it is difficult for the Markov chain to jump from one mode of the distribution to another, when these are separated by large low-density regions This means that we have a slow mixing of samples In order to represent distributions 123 / 142
  • 238. Images/cinvestav Bengio et al. [17] Gave the following Hypothesis In sampling algorithms (Markov Chains and MCMC techniques) suffer from a fundamental problem Given unconnected or weakly connected regions of distributions We have that it is difficult for the Markov chain to jump from one mode of the distribution to another, when these are separated by large low-density regions This means that we have a slow mixing of samples In order to represent distributions 123 / 142
  • 239. Images/cinvestav Example A Big Problem Regions of LOW DENSITY HIGH DENSITY REGIONS 124 / 142
  • 240. Images/cinvestav The Main Problem We have that Slow mixing means that many consecutive samples tend to be correlated They belong to the same mode of the mixture Why? Jumping around in the MCMC method is quite slow and scarce 125 / 142
  • 241. Images/cinvestav The Main Problem We have that Slow mixing means that many consecutive samples tend to be correlated They belong to the same mode of the mixture Why? Jumping around in the MCMC method is quite slow and scarce 125 / 142
  • 242. Images/cinvestav Implications in Learning Algorithms Given that some form of sampling is at the core of many learning algorithms For example, to estimate the log-likelihood gradient Therefore, at the beginning of learning Mixing is therefore initially easy However as the model improves its corresponding distribution sharpens and mixing becomes slower 126 / 142
  • 243. Images/cinvestav Implications in Learning Algorithms Given that some form of sampling is at the core of many learning algorithms For example, to estimate the log-likelihood gradient Therefore, at the beginning of learning Mixing is therefore initially easy However as the model improves its corresponding distribution sharpens and mixing becomes slower 126 / 142
  • 244. Images/cinvestav Implications in Learning Algorithms Given that some form of sampling is at the core of many learning algorithms For example, to estimate the log-likelihood gradient Therefore, at the beginning of learning Mixing is therefore initially easy However as the model improves its corresponding distribution sharpens and mixing becomes slower 126 / 142
  • 245. Images/cinvestav Basically We have slow downs on the learning in shallow models No Learning AreaLearning Area 127 / 142
  • 246. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 128 / 142
  • 247. Images/cinvestav Therefore We need to build deeper structures to reach more capabilities For example the vector representation of documents Here a extra layer of representation can be used for doing representation For Example, Mikolov et al. [18] 129 / 142
  • 248. Images/cinvestav Therefore We need to build deeper structures to reach more capabilities For example the vector representation of documents Here a extra layer of representation can be used for doing representation For Example, Mikolov et al. [18] 129 / 142
  • 249. Images/cinvestav Basically a shallow network before the main architecture An Encoder Layer before a GRU +-1 130 / 142
  • 250. Images/cinvestav The equations They will look like zt =σ [Wz [ht−1, xt] + bz] (Update Gate) rt =σ [Wr [ht−1, xt] + br] (Reset Gate) ˆht = tanh [Wo [rt ht−1, xt] + bh] ht = (1 − zt) ht−1 + zt ˆht x =σ (Wohy) y =σ (Whiw) 131 / 142
  • 251. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 132 / 142
  • 252. Images/cinvestav Deep Transition Architectures In a deep transition RNN (DT-RNN) At each time step the next state is computed by the sequential application of multiple transition layers. For example in Nematus system [19] They use GRU transitions blocks under independent trainable parameters With a Caveat The hidden state output is used as the input state on the next one 133 / 142
  • 253. Images/cinvestav Deep Transition Architectures In a deep transition RNN (DT-RNN) At each time step the next state is computed by the sequential application of multiple transition layers. For example in Nematus system [19] They use GRU transitions blocks under independent trainable parameters With a Caveat The hidden state output is used as the input state on the next one 133 / 142
  • 254. Images/cinvestav Deep Transition Architectures In a deep transition RNN (DT-RNN) At each time step the next state is computed by the sequential application of multiple transition layers. For example in Nematus system [19] They use GRU transitions blocks under independent trainable parameters With a Caveat The hidden state output is used as the input state on the next one 133 / 142
  • 255. Images/cinvestav For example, at the encoder phase For the ith source word in the forward direction, we have hi = hi,Ls hi,1 = GRU1 (x1, hi−1,Ls ) hi,k = GRUk (0, hi,k−1) for 1 < k ≤ Ls The sequence word is reversed and you have a backward state then C ≡ −→ h i,Ls , ←− h i,Ls 134 / 142
  • 256. Images/cinvestav For example, at the encoder phase For the ith source word in the forward direction, we have hi = hi,Ls hi,1 = GRU1 (x1, hi−1,Ls ) hi,k = GRUk (0, hi,k−1) for 1 < k ≤ Ls The sequence word is reversed and you have a backward state then C ≡ −→ h i,Ls , ←− h i,Ls 134 / 142
  • 257. Images/cinvestav Then Decoder phase uses the outputs from the previous GRU and something called attention (We will look at this latter) sj,1 = GRU1 yj−1, sj−1, Lt sj,2 = GRU2 (ATT, sj−1, Lt) sj,k = GRUk (0, Lt) for 2 < k ≤ Lt Then, the target word state sj ≡ sj,Lt It is used by a feed-forward neural network to predict the current target network 135 / 142
  • 258. Images/cinvestav Then Decoder phase uses the outputs from the previous GRU and something called attention (We will look at this latter) sj,1 = GRU1 yj−1, sj−1, Lt sj,2 = GRU2 (ATT, sj−1, Lt) sj,k = GRUk (0, Lt) for 2 < k ≤ Lt Then, the target word state sj ≡ sj,Lt It is used by a feed-forward neural network to predict the current target network 135 / 142
  • 259. Images/cinvestav Deep Transition Decoder We have the following depiction of the architecture 136 / 142
  • 260. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 137 / 142
  • 261. Images/cinvestav There are many other examples Basically We are far from the classic methods as 1 Autoregressive integrated moving average (ARMA) 2 Auto Regressive Integrated Moving Average (ARIMA) 3 etc These RNN architectures are taking the prediction of time series To another level!!! 138 / 142
  • 262. Images/cinvestav There are many other examples Basically We are far from the classic methods as 1 Autoregressive integrated moving average (ARMA) 2 Auto Regressive Integrated Moving Average (ARIMA) 3 etc These RNN architectures are taking the prediction of time series To another level!!! 138 / 142
  • 263. Images/cinvestav O. L. R. Jacobs, “Introduction to control theory,” 1974. A. Robinson and F. Fallside, The utility driven dynamic error propagation network. University of Cambridge Department of Engineering, 1987. P. J. Werbos et al., “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990. G. Chen, “A gentle tutorial of recurrent neural network with error backpropagation,” arXiv preprint arXiv:1610.02583, 2016. B. W. Bader and T. G. Kolda, “Algorithm 862: Matlab tensor classes for fast algorithm prototyping,” ACM Trans. Math. Softw., vol. 32, Dec. 2006. J. Pennington, S. S. Schoenholz, and S. Ganguli, “The emergence of spectral universality in deep networks,” arXiv preprint arXiv:1802.09979, 2018. 139 / 142
  • 264. Images/cinvestav R. J. Williams and D. Zipser, “Gradient-based learning algorithms for recurrent,” Backpropagation: Theory, architectures, and applications, vol. 433, 1995. R. J. Williams and J. Peng, “An efficient gradient-based algorithm for on-line training of recurrent network trajectories,” Neural computation, vol. 2, no. 4, pp. 490–501, 1990. J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990. T. Mikolov, M. Karafiát, L. Burget, J. Černock`y, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010. H.-G. Zimmermann, C. Tietz, and R. Grothmann, “Forecasting with recurrent neural networks: 12 tricks,” in Neural Networks: Tricks of the Trade, pp. 687–707, Springer, 2012. 140 / 142
  • 265. Images/cinvestav S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. G. Weiss, Y. Goldberg, and E. Yahav, “On the practical computational power of finite precision rnns for language recognition,” CoRR, vol. abs/1805.04908, 2018. R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013. A. V. M. Barone, J. Helcl, R. Sennrich, B. Haddow, and A. Birch, “Deep architectures for neural machine translation,” arXiv preprint arXiv:1707.07631, 2017. Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via deep representations,” in International conference on machine learning, pp. 552–560, 2013. 141 / 142
  • 266. Images/cinvestav T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. R. Sennrich, O. Firat, K. Cho, A. Birch, B. Haddow, J. Hitschler, M. Junczys-Dowmunt, S. Läubli, A. V. M. Barone, J. Mokry, et al., “Nematus: a toolkit for neural machine translation,” arXiv preprint arXiv:1703.04357, 2017. 142 / 142