1. Introduction to Neural Networks and Deep Learning
Recurrent Neural Networks
Andres Mendez-Vazquez
November 13, 2019
1 / 142
2. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
2 / 142
3. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
3 / 142
4. Images/cinvestav
In 1987 Robinson and Fallside [2]
At Cambridge University Engineering Department
They proposed a new type of neural network based on Linear Control
Theory
They took the work of Jacobs, 1974 on dynamic nets [1]
st+1 = Ast + Bxt
yt = Cst
4 / 142
5. Images/cinvestav
In 1987 Robinson and Fallside [2]
At Cambridge University Engineering Department
They proposed a new type of neural network based on Linear Control
Theory
They took the work of Jacobs, 1974 on dynamic nets [1]
st+1 = Ast + Bxt
yt = Cst
4 / 142
7. Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
8. Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
9. Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
10. Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
11. Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
12. Images/cinvestav
Furthermore
Jordan Proposed a simple recurrent network
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σs (Vosht + bo)
Where
1 xt is an input of dimension d.
2 ht is a hidden state layer of dimension h.
3 yt is the output vector of dimension s.
4 W, U, V, bh and bo parameter matrices and vectors.
5 σh and σs are activation functions.
6 / 142
14. Images/cinvestav
What were they used for?
Robinson and Fallside
As with Hidden Markov Models, they were proposed for Speech
Coding
They proposed the following architecture
8 / 142
15. Images/cinvestav
What were they used for?
Robinson and Fallside
As with Hidden Markov Models, they were proposed for Speech
Coding
They proposed the following architecture
8 / 142
16. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
9 / 142
17. Images/cinvestav
Based on the State-Space Model
Basically, a linear system
Based in a state-determined system model
Definition
A mathematical description of the system in terms of a minimum set
of variables xi(t), i = 1, ..., n, together with knowledge of those
variables at an initial time t0 and the system inputs for time t ≥ t0,
are sufficient to predict the future system state and outputs for all
time t > t0.
10 / 142
18. Images/cinvestav
Based on the State-Space Model
Basically, a linear system
Based in a state-determined system model
Definition
A mathematical description of the system in terms of a minimum set
of variables xi(t), i = 1, ..., n, together with knowledge of those
variables at an initial time t0 and the system inputs for time t ≥ t0,
are sufficient to predict the future system state and outputs for all
time t > t0.
10 / 142
19. Images/cinvestav
Therefore
We have a system as a block
System Described
by State Variables
This can be expressed as a state equations
˙s1 = f1 (x, s, t)
˙s2 = f2 (x, s, t)
· · · = · · ·
˙sn = fn (x, s, t)
11 / 142
20. Images/cinvestav
Therefore
We have a system as a block
System Described
by State Variables
This can be expressed as a state equations
˙s1 = f1 (x, s, t)
˙s2 = f2 (x, s, t)
· · · = · · ·
˙sn = fn (x, s, t)
11 / 142
21. Images/cinvestav
Using Vector Notation
Assuming that we have a linear system and time invariant
Time-Invariant x (t + δ) directly equates y (t + δ), for example
αx (t + δ) + β = y (t + δ)
Therefore, using this idea
˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t)
Or in Matrix form
y (t) = Ax (t) + Bs (t)
12 / 142
22. Images/cinvestav
Using Vector Notation
Assuming that we have a linear system and time invariant
Time-Invariant x (t + δ) directly equates y (t + δ), for example
αx (t + δ) + β = y (t + δ)
Therefore, using this idea
˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t)
Or in Matrix form
y (t) = Ax (t) + Bs (t)
12 / 142
23. Images/cinvestav
Using Vector Notation
Assuming that we have a linear system and time invariant
Time-Invariant x (t + δ) directly equates y (t + δ), for example
αx (t + δ) + β = y (t + δ)
Therefore, using this idea
˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t)
Or in Matrix form
y (t) = Ax (t) + Bs (t)
12 / 142
24. Images/cinvestav
Then, the discretized version
We introduce an update for the state part
y (t) = Ax (t) + Bs (t)
˙s (t) = Cs (t)
Or our discrete step equitations
y (t) = Ax (t) + Bs (t)
s (t + 1) = Cs (t)
13 / 142
25. Images/cinvestav
Then, the discretized version
We introduce an update for the state part
y (t) = Ax (t) + Bs (t)
˙s (t) = Cs (t)
Or our discrete step equitations
y (t) = Ax (t) + Bs (t)
s (t + 1) = Cs (t)
13 / 142
26. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
14 / 142
27. Images/cinvestav
The Elman Network
In Elman’s Equations
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σy (Vosht + by)
We noticed something different from the linear recurrent system
The use of activation functions to introduce the concept of
non-linearity
15 / 142
28. Images/cinvestav
The Elman Network
In Elman’s Equations
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σy (Vosht + by)
We noticed something different from the linear recurrent system
The use of activation functions to introduce the concept of
non-linearity
15 / 142
29. Images/cinvestav
Explanation
We have the following
1 The input xt is coded by Whd
Wsdxt
2 An state is generated by using the codified version of the input plus a
previous state ht−1
ht = σh (Wsdxt + Ussht−1 + bh)
3 The output is generated using the new state ht−1
yt = σy (Vosht + by)
16 / 142
30. Images/cinvestav
Explanation
We have the following
1 The input xt is coded by Whd
Wsdxt
2 An state is generated by using the codified version of the input plus a
previous state ht−1
ht = σh (Wsdxt + Ussht−1 + bh)
3 The output is generated using the new state ht−1
yt = σy (Vosht + by)
16 / 142
31. Images/cinvestav
Explanation
We have the following
1 The input xt is coded by Whd
Wsdxt
2 An state is generated by using the codified version of the input plus a
previous state ht−1
ht = σh (Wsdxt + Ussht−1 + bh)
3 The output is generated using the new state ht−1
yt = σy (Vosht + by)
16 / 142
32. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
17 / 142
33. Images/cinvestav
We need to introduce the concept of cost function
Which as always
It needs to comply with two properties
The cost function L must be able to be written as an average
L =
1
N x∈X
Cx
over the cost individual cost functions Cx
This allow to apply different optimization techniques as
Minbatch
Stochastic Gradient Descent
etc
18 / 142
34. Images/cinvestav
We need to introduce the concept of cost function
Which as always
It needs to comply with two properties
The cost function L must be able to be written as an average
L =
1
N x∈X
Cx
over the cost individual cost functions Cx
This allow to apply different optimization techniques as
Minbatch
Stochastic Gradient Descent
etc
18 / 142
35. Images/cinvestav
We need to introduce the concept of cost function
Which as always
It needs to comply with two properties
The cost function L must be able to be written as an average
L =
1
N x∈X
Cx
over the cost individual cost functions Cx
This allow to apply different optimization techniques as
Minbatch
Stochastic Gradient Descent
etc
18 / 142
36. Images/cinvestav
Furthermore
Non dependency
The cost function L must not be dependent on any activation values
of a neural network besides the output values.
If we cannot assure this
If not Backpropagation becomes too unstable or too complex to
solve. For example
L =
1
N
N
t=0
[yt + ht − zt]2
This gives two entry points to the network.
19 / 142
37. Images/cinvestav
Furthermore
Non dependency
The cost function L must not be dependent on any activation values
of a neural network besides the output values.
If we cannot assure this
If not Backpropagation becomes too unstable or too complex to
solve. For example
L =
1
N
N
t=0
[yt + ht − zt]2
This gives two entry points to the network.
19 / 142
38. Images/cinvestav
A List of Cost Functions
The Average Quadratic Cost
L =
1
N
N
t=0
[yt − zt]2
Where yt is the output of the network and zt is the ground truth of
the output.
Here, we are interpolating functions
20 / 142
39. Images/cinvestav
A List of Cost Functions
The Average Quadratic Cost
L =
1
N
N
t=0
[yt − zt]2
Where yt is the output of the network and zt is the ground truth of
the output.
Here, we are interpolating functions
REAL FUNCTION
INTERPOLATED FUNCTION
20 / 142
40. Images/cinvestav
Cross-Entropy cost
First, the Loss Function
L = −
C
i=1
zi log (yi)
Where yi is the output and zi is the ground truth for the class
estimation.
Why yi log (zi)?
We can imagine a sequence of class probabilities y1, y2, ..., ym and the
likelihood of the data and the model
P [data|model] = yk1
1 yk2
2 · · · ykn
m
21 / 142
41. Images/cinvestav
Cross-Entropy cost
First, the Loss Function
L = −
C
i=1
zi log (yi)
Where yi is the output and zi is the ground truth for the class
estimation.
Why yi log (zi)?
We can imagine a sequence of class probabilities y1, y2, ..., ym and the
likelihood of the data and the model
P [data|model] = yk1
1 yk2
2 · · · ykn
m
21 / 142
42. Images/cinvestav
Then
Taking the logarithm and multiplying by -1
− log P [data|model] = −
C
i=1
ki log yi
Then, dividing by the total number of samples
−
1
N
log P [data|model] = −
C
i=1
ki
N
log yi = −
C
i=1
zi log yi
22 / 142
43. Images/cinvestav
Then
Taking the logarithm and multiplying by -1
− log P [data|model] = −
C
i=1
ki log yi
Then, dividing by the total number of samples
−
1
N
log P [data|model] = −
C
i=1
ki
N
log yi = −
C
i=1
zi log yi
22 / 142
44. Images/cinvestav
Then
In information theory, The Kraft–McMillan theorem
It establishes that any directly decodable coding scheme for coding a
message to identify one value xi ∈ {x1, x2, ..., xn}
It can be seen as representing an implicit probability distribution over
{x1, x2, ..., xn}
q (xi) =
1
2
li
Where li is the length of the code for xi
23 / 142
45. Images/cinvestav
Then
In information theory, The Kraft–McMillan theorem
It establishes that any directly decodable coding scheme for coding a
message to identify one value xi ∈ {x1, x2, ..., xn}
It can be seen as representing an implicit probability distribution over
{x1, x2, ..., xn}
q (xi) =
1
2
li
Where li is the length of the code for xi
23 / 142
46. Images/cinvestav
Now
We have that
Cross entropy can be interpreted as the expected message-length per
datum when a wrong distribution q is assumed while the data actually
follows a distribution p.
The expected message-length under the true distribution p is
Ep [l] = −Ep
ln q (x)
ln 2
= −Ep [log2 q (x)]
= −
xi
p (xi) log2 q (x)
= H (p, q)
24 / 142
47. Images/cinvestav
Now
We have that
Cross entropy can be interpreted as the expected message-length per
datum when a wrong distribution q is assumed while the data actually
follows a distribution p.
The expected message-length under the true distribution p is
Ep [l] = −Ep
ln q (x)
ln 2
= −Ep [log2 q (x)]
= −
xi
p (xi) log2 q (x)
= H (p, q)
24 / 142
48. Images/cinvestav
Special Case
A special case is the binary class problem, C = 2
Based on the fact that z1 + z2 = 1 and y1 + y2 = 1
L = −
2
i=1
zi log (yi) = −z1 log (y1) − (1 − z1) log (1 − y1)
A problem of this
It could be possible to have a y1 = 0
25 / 142
49. Images/cinvestav
Special Case
A special case is the binary class problem, C = 2
Based on the fact that z1 + z2 = 1 and y1 + y2 = 1
L = −
2
i=1
zi log (yi) = −z1 log (y1) − (1 − z1) log (1 − y1)
A problem of this
It could be possible to have a y1 = 0
25 / 142
51. Images/cinvestav
Another Interpretation
The Loss can be expressed as
L =
− log (f (y1)) if z1 = 1
− log (1 − f (y1)) if z1 = 1
Where z1 = 1
It means that the class C1 = Ci is positive for this sample.
27 / 142
52. Images/cinvestav
Another Interpretation
The Loss can be expressed as
L =
− log (f (y1)) if z1 = 1
− log (1 − f (y1)) if z1 = 1
Where z1 = 1
It means that the class C1 = Ci is positive for this sample.
27 / 142
53. Images/cinvestav
The Gradient of the Binary Cross Entropy
We make a derivative with respect to yi
∂L
∂y1
= z1 (f (y1) − 1) + (1 − z1) f (y1)
28 / 142
54. Images/cinvestav
In the case of the Multiclass Problem
We use two things, a softmax
f (yi) =
exp {yi}
C
j=1 exp {yj}
As in the multiclass for the Linear Models
The labels are one-hot, so only the positive class Cp keeps its term in
the loss.
Therefore
There is only one element of the Target vector z that is not zero,
zi = zp.
29 / 142
55. Images/cinvestav
In the case of the Multiclass Problem
We use two things, a softmax
f (yi) =
exp {yi}
C
j=1 exp {yj}
As in the multiclass for the Linear Models
The labels are one-hot, so only the positive class Cp keeps its term in
the loss.
Therefore
There is only one element of the Target vector z that is not zero,
zi = zp.
29 / 142
56. Images/cinvestav
In the case of the Multiclass Problem
We use two things, a softmax
f (yi) =
exp {yi}
C
j=1 exp {yj}
As in the multiclass for the Linear Models
The labels are one-hot, so only the positive class Cp keeps its term in
the loss.
Therefore
There is only one element of the Target vector z that is not zero,
zi = zp.
29 / 142
57. Images/cinvestav
We can then simplify
The cost function becomes
L = −
C
i=1
zi log (f (yi)) = −log
exp {yp}
C
j=1 exp {yp}
30 / 142
58. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
31 / 142
59. Images/cinvestav
Other Cost Functions
Exponential Cost with hyper-parameter τ
L = τ exp
1
τ
N
i=1
(yi − zi)2
Hellinger Distance
L =
1
2
N
i=1
(
√
yi −
√
zi)2
Here the values need to be at the interval [0, 1] .
32 / 142
60. Images/cinvestav
Other Cost Functions
Exponential Cost with hyper-parameter τ
L = τ exp
1
τ
N
i=1
(yi − zi)2
Hellinger Distance
L =
1
2
N
i=1
(
√
yi −
√
zi)2
Here the values need to be at the interval [0, 1] .
32 / 142
63. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
34 / 142
64. Images/cinvestav
We have the following
Architecture with Quadratic Error
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σy (Vosht + by)
L =
1
2
N
t=0
[yt − zt]2
Something Notable
How do we train something with a recurrence forcing a dependence
over time?
35 / 142
65. Images/cinvestav
We have the following
Architecture with Quadratic Error
ht = σh (Wsdxt + Ussht−1 + bh)
yt = σy (Vosht + by)
L =
1
2
N
t=0
[yt − zt]2
Something Notable
How do we train something with a recurrence forcing a dependence
over time?
35 / 142
66. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
36 / 142
67. Images/cinvestav
Now, given the dependency over time
We can use the classic unfolding of the network [3, 4] by assuming
W, U, V, bh and bo do not change under the unfolding
Unfolding?
Assume that there are not bias correcting terms, only, W, U and V .
37 / 142
68. Images/cinvestav
Now, given the dependency over time
We can use the classic unfolding of the network [3, 4] by assuming
W, U, V, bh and bo do not change under the unfolding
Unfolding?
Assume that there are not bias correcting terms, only, W, U and V .
37 / 142
69. Images/cinvestav
Then
Given an observation sequence x = {x1, x2, ..., xT }
where xi ∈ R, and their corresponding label y = {y1, y2, ..., yT }
We remove the bias to simplify our derivations
ht = σh (Wsdxt + Ussht−1)
yt = σy (Vosht)
L =
1
2
T
t=0
[zt − yt]2
38 / 142
70. Images/cinvestav
Then
Given an observation sequence x = {x1, x2, ..., xT }
where xi ∈ R, and their corresponding label y = {y1, y2, ..., yT }
We remove the bias to simplify our derivations
ht = σh (Wsdxt + Ussht−1)
yt = σy (Vosht)
L =
1
2
T
t=0
[zt − yt]2
38 / 142
79. Images/cinvestav
Now, ∂neto
∂Vos
First we have a component i
netoi =
s
j=1
Vijhj
What happen when we derive with respect to the matrix?
∂neto
∂Vos
=
∂neto
∂V11
∂neto
∂V12
· · · ∂neto
∂V1s
∂neto
∂V21
∂neto
∂V22
· · · ∂neto
∂V2s
...
...
...
...
∂neto
∂Vo1
∂neto
∂Vo2
· · · ∂neto
∂Vos
Actually
A Tensor with three dimensions...
44 / 142
80. Images/cinvestav
Now, ∂neto
∂Vos
First we have a component i
netoi =
s
j=1
Vijhj
What happen when we derive with respect to the matrix?
∂neto
∂Vos
=
∂neto
∂V11
∂neto
∂V12
· · · ∂neto
∂V1s
∂neto
∂V21
∂neto
∂V22
· · · ∂neto
∂V2s
...
...
...
...
∂neto
∂Vo1
∂neto
∂Vo2
· · · ∂neto
∂Vos
Actually
A Tensor with three dimensions...
44 / 142
81. Images/cinvestav
Now, ∂neto
∂Vos
First we have a component i
netoi =
s
j=1
Vijhj
What happen when we derive with respect to the matrix?
∂neto
∂Vos
=
∂neto
∂V11
∂neto
∂V12
· · · ∂neto
∂V1s
∂neto
∂V21
∂neto
∂V22
· · · ∂neto
∂V2s
...
...
...
...
∂neto
∂Vo1
∂neto
∂Vo2
· · · ∂neto
∂Vos
Actually
A Tensor with three dimensions...
44 / 142
82. Images/cinvestav
But something quite nice
Each of the components of neto
It has the previous structure
netoi =
s
j=1
Vijhj
Then if the Vkl does not intervene on it
∂netoi
∂Vjk
= 0
Additionally if it intervenes
∂netoi
∂Vji
= hj
45 / 142
83. Images/cinvestav
But something quite nice
Each of the components of neto
It has the previous structure
netoi =
s
j=1
Vijhj
Then if the Vkl does not intervene on it
∂netoi
∂Vjk
= 0
Additionally if it intervenes
∂netoi
∂Vji
= hj
45 / 142
84. Images/cinvestav
But something quite nice
Each of the components of neto
It has the previous structure
netoi =
s
j=1
Vijhj
Then if the Vkl does not intervene on it
∂netoi
∂Vjk
= 0
Additionally if it intervenes
∂netoi
∂Vji
= hj
45 / 142
85. Images/cinvestav
Therefore
It is possible to collapse the tensor into a 2D Matrix
Given that the other information is redundant, ad we can rewrite the
tensor as
Fijk =
∂netoi
∂Vjk
Then, we have that
Fiji = Gij ⇐ Better Storage!!!
46 / 142
86. Images/cinvestav
Therefore
It is possible to collapse the tensor into a 2D Matrix
Given that the other information is redundant, ad we can rewrite the
tensor as
Fijk =
∂netoi
∂Vjk
Then, we have that
Fiji = Gij ⇐ Better Storage!!!
46 / 142
87. Images/cinvestav
Therefore, given that a matrix is a tensor also
We have that two tensors, neto×o
and Fo×s×o
[5]
We will use the contracted product of two tensors which is a
generalization of the tensor-vector and tensor-matrix multiplications
Definition
Given two tensors Ao×o and Bo×s×o
A, B (k, j) =
o
i=1
Ai,kGi,j = Ai,iGi,j = σ (netok) hj
47 / 142
88. Images/cinvestav
Therefore, given that a matrix is a tensor also
We have that two tensors, neto×o
and Fo×s×o
[5]
We will use the contracted product of two tensors which is a
generalization of the tensor-vector and tensor-matrix multiplications
Definition
Given two tensors Ao×o and Bo×s×o
A, B (k, j) =
o
i=1
Ai,kGi,j = Ai,iGi,j = σ (netok) hj
47 / 142
89. Images/cinvestav
Now, the term ∂L
∂Uss
Assuming our change in time step t → t + 1 and given
ht = σh (Wsdxt + Ussht−1)
Therefore we have
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂Uss
Therefore
We can think on this as a Markovian Backpropagation
48 / 142
90. Images/cinvestav
Now, the term ∂L
∂Uss
Assuming our change in time step t → t + 1 and given
ht = σh (Wsdxt + Ussht−1)
Therefore we have
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂Uss
Therefore
We can think on this as a Markovian Backpropagation
48 / 142
91. Images/cinvestav
Now, the term ∂L
∂Uss
Assuming our change in time step t → t + 1 and given
ht = σh (Wsdxt + Ussht−1)
Therefore we have
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂Uss
Therefore
We can think on this as a Markovian Backpropagation
48 / 142
92. Images/cinvestav
What if we go further
From t − 1 → t + 1
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Uss
Now, the trick if we consider all the possible derivatives from t to 0
We have:
∂L (t + 1)
∂Uss
=
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
However
How do we calculate ∂ht+1
∂hk
?
49 / 142
93. Images/cinvestav
What if we go further
From t − 1 → t + 1
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Uss
Now, the trick if we consider all the possible derivatives from t to 0
We have:
∂L (t + 1)
∂Uss
=
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
However
How do we calculate ∂ht+1
∂hk
?
49 / 142
94. Images/cinvestav
What if we go further
From t − 1 → t + 1
∂L (t + 1)
∂Uss
=
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Uss
Now, the trick if we consider all the possible derivatives from t to 0
We have:
∂L (t + 1)
∂Uss
=
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
However
How do we calculate ∂ht+1
∂hk
?
49 / 142
95. Images/cinvestav
We have a proposal
Given the composition of functions
∂ht+1
∂hk
=
∂hk+1
∂hk
×
∂hk+2
∂hk+1
× · · · ×
∂ht+1
∂ht
Here, we know that
∂hi+1
∂hi
=
∂hi+1
∂nets
×
∂nets
∂hi
50 / 142
96. Images/cinvestav
We have a proposal
Given the composition of functions
∂ht+1
∂hk
=
∂hk+1
∂hk
×
∂hk+2
∂hk+1
× · · · ×
∂ht+1
∂ht
Here, we know that
∂hi+1
∂hi
=
∂hi+1
∂nets
×
∂nets
∂hi
50 / 142
97. Images/cinvestav
We have that
We have given hi+1 = σh (Wsdxi + Usshi) and
neth = Wsdxi + Usshi
∂hi+1
∂nets
=
σh (neth1) 0 · · · 0
0 σh (neth2) · · · 0
...
...
...
...
0 0 · · · σh (neths)
= Di+1
Finally, we have that
∂nets
∂hi
= Uss
51 / 142
98. Images/cinvestav
We have that
We have given hi+1 = σh (Wsdxi + Usshi) and
neth = Wsdxi + Usshi
∂hi+1
∂nets
=
σh (neth1) 0 · · · 0
0 σh (neth2) · · · 0
...
...
...
...
0 0 · · · σh (neths)
= Di+1
Finally, we have that
∂nets
∂hi
= Uss
51 / 142
99. Images/cinvestav
Then
We can aggregate over all the time
∂L
∂Uss
=
t
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
Now, we need to derive the L with respect to Wsd
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
52 / 142
100. Images/cinvestav
Then
We can aggregate over all the time
∂L
∂Uss
=
t
t
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Uss
Now, we need to derive the L with respect to Wsd
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
52 / 142
101. Images/cinvestav
Now
Because ht and xt+1, we need to back-propagate to ht
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht
×
∂ht
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Wsd
Then summing over all the contributions from t to 0
∂L (t + 1)
∂Wsd
=
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
Finally, summing over all the time
∂L
∂Wsd
=
t
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
53 / 142
102. Images/cinvestav
Now
Because ht and xt+1, we need to back-propagate to ht
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht
×
∂ht
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Wsd
Then summing over all the contributions from t to 0
∂L (t + 1)
∂Wsd
=
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
Finally, summing over all the time
∂L
∂Wsd
=
t
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
53 / 142
103. Images/cinvestav
Now
Because ht and xt+1, we need to back-propagate to ht
∂L (t + 1)
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht
×
∂ht
∂Wsd
=
∂L (t + 1)
∂ht+1
×
∂ht+1
∂Wsd
+
∂L (t + 1)
∂ht+1
×
∂ht+1
∂ht
×
∂ht
∂Wsd
Then summing over all the contributions from t to 0
∂L (t + 1)
∂Wsd
=
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
Finally, summing over all the time
∂L
∂Wsd
=
t
t+1
k=1
∂L (t + 1)
∂yt+1
×
∂yt+1
∂ht+1
×
∂ht+1
∂hk
×
∂ht
∂Wsd
53 / 142
104. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
54 / 142
105. Images/cinvestav
Vanishing Gradients
We have a problem
∂hk+1
∂hk
×
∂hk+2
∂hk+1
× · · · ×
∂ht+1
∂ht
You finish with a vanishing gradient using σ = 1
1+exp{−x}
This is problematic!!!
55 / 142
106. Images/cinvestav
Vanishing Gradients
We have a problem
∂hk+1
∂hk
×
∂hk+2
∂hk+1
× · · · ×
∂ht+1
∂ht
You finish with a vanishing gradient using σ = 1
1+exp{−x}
This is problematic!!!
55 / 142
107. Images/cinvestav
Given
Given the commutativity of the product
You could put together the derivative of the sigmoid’s
f (x) =
ds (x)
dx
=
e−x
(1 + e−x)2
Therefore, deriving again
df (x)
dx
= −
e−x
(1 + e−x)2 +
2 (e−x)
2
(1 + e−x)3
After making df(x)
dx
= 0
We have the maximum is at x = 0
56 / 142
108. Images/cinvestav
Given
Given the commutativity of the product
You could put together the derivative of the sigmoid’s
f (x) =
ds (x)
dx
=
e−x
(1 + e−x)2
Therefore, deriving again
df (x)
dx
= −
e−x
(1 + e−x)2 +
2 (e−x)
2
(1 + e−x)3
After making df(x)
dx
= 0
We have the maximum is at x = 0
56 / 142
109. Images/cinvestav
Given
Given the commutativity of the product
You could put together the derivative of the sigmoid’s
f (x) =
ds (x)
dx
=
e−x
(1 + e−x)2
Therefore, deriving again
df (x)
dx
= −
e−x
(1 + e−x)2 +
2 (e−x)
2
(1 + e−x)3
After making df(x)
dx
= 0
We have the maximum is at x = 0
56 / 142
110. Images/cinvestav
Therefore
The maximum for the derivative of the sigmoid
f (0) = 0.25
Therefore, Given a Deep Network
We could finish with
lim
k→∞
ds (x)
dx
k
= lim
k→∞
(0.25)k
→ 0
A Vanishing Derivative or Vanishing Gradient
Making quite difficult to do train a deeper network using this
activation function for Deep Learning and even in Shallow Learning
57 / 142
111. Images/cinvestav
Therefore
The maximum for the derivative of the sigmoid
f (0) = 0.25
Therefore, Given a Deep Network
We could finish with
lim
k→∞
ds (x)
dx
k
= lim
k→∞
(0.25)k
→ 0
A Vanishing Derivative or Vanishing Gradient
Making quite difficult to do train a deeper network using this
activation function for Deep Learning and even in Shallow Learning
57 / 142
112. Images/cinvestav
Therefore
The maximum for the derivative of the sigmoid
f (0) = 0.25
Therefore, Given a Deep Network
We could finish with
lim
k→∞
ds (x)
dx
k
= lim
k→∞
(0.25)k
→ 0
A Vanishing Derivative or Vanishing Gradient
Making quite difficult to do train a deeper network using this
activation function for Deep Learning and even in Shallow Learning
57 / 142
113. Images/cinvestav
For the case of vanishing gradient, we have that
Rearranging terms in ∂hk+1
∂hk
× ∂hk+2
∂hk+1
× · · · × ∂ht+1
∂ht
We have
T
k=0
∂hk+1
∂nets
[Uss]T+1
Then, given the sigmoid
T
k=0
∂hk+1
∂nets
=
T
k=0
σh netk
h1 0 · · · 0
0
T
k=0
σh netk
h2 · · · 0
...
...
...
...
0 0 · · ·
T
k=0
σh netk
hs
58 / 142
114. Images/cinvestav
For the case of vanishing gradient, we have that
Rearranging terms in ∂hk+1
∂hk
× ∂hk+2
∂hk+1
× · · · × ∂ht+1
∂ht
We have
T
k=0
∂hk+1
∂nets
[Uss]T+1
Then, given the sigmoid
T
k=0
∂hk+1
∂nets
=
T
k=0
σh netk
h1 0 · · · 0
0
T
k=0
σh netk
h2 · · · 0
...
...
...
...
0 0 · · ·
T
k=0
σh netk
hs
58 / 142
115. Images/cinvestav
It is clear
That you have the phenomena of vanishing gradient
Do we have a way to fixing this?
Yes
The use of new activation functions.
59 / 142
116. Images/cinvestav
It is clear
That you have the phenomena of vanishing gradient
Do we have a way to fixing this?
Yes
The use of new activation functions.
59 / 142
117. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
60 / 142
118. Images/cinvestav
Thus
The need to introduce a new function
f (x) = x+
= max (0, x)
It is called ReLu or Rectifier
With a smooth approximation (Softplus function)
f (x) =
ln 1 + ekx
k
61 / 142
119. Images/cinvestav
Thus
The need to introduce a new function
f (x) = x+
= max (0, x)
It is called ReLu or Rectifier
With a smooth approximation (Softplus function)
f (x) =
ln 1 + ekx
k
61 / 142
120. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
62 / 142
121. Images/cinvestav
However
Here the gradient can explode
Thus, the need to control the gradient...
Therefore, we will use the following analysis [6]
“The Emergence of Spectral Universality in Deep Networks” by
Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli
63 / 142
122. Images/cinvestav
However
Here the gradient can explode
Thus, the need to control the gradient...
Therefore, we will use the following analysis [6]
“The Emergence of Spectral Universality in Deep Networks” by
Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli
63 / 142
123. Images/cinvestav
We have
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Then, we have the following Jacobian
J =
∂hT
∂h0
=
L
t=1
DtUSS
Where as we saw it Dt is a diagonal matrix
This Jacobian J is a matrix of dimension s × s therefore, if it is well
conditioned you are not sending the projection to lower dimensionality.
64 / 142
124. Images/cinvestav
We have
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Then, we have the following Jacobian
J =
∂hT
∂h0
=
L
t=1
DtUSS
Where as we saw it Dt is a diagonal matrix
This Jacobian J is a matrix of dimension s × s therefore, if it is well
conditioned you are not sending the projection to lower dimensionality.
64 / 142
125. Images/cinvestav
We have
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Then, we have the following Jacobian
J =
∂hT
∂h0
=
L
t=1
DtUSS
Where as we saw it Dt is a diagonal matrix
This Jacobian J is a matrix of dimension s × s therefore, if it is well
conditioned you are not sending the projection to lower dimensionality.
64 / 142
127. Images/cinvestav
Remember the structure of the layer
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Therefore, we have that
sit =
j
Wijxt
j +
k
Uikht−1
k + bi
We assume the following about the temporal layer weights
[Uss, Wsd] ∼ N 0,
ρ2
w
N
, bh ∼ N 0, ρ2
b
Here N = s the state dimension.
66 / 142
128. Images/cinvestav
Remember the structure of the layer
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Therefore, we have that
sit =
j
Wijxt
j +
k
Uikht−1
k + bi
We assume the following about the temporal layer weights
[Uss, Wsd] ∼ N 0,
ρ2
w
N
, bh ∼ N 0, ρ2
b
Here N = s the state dimension.
66 / 142
129. Images/cinvestav
Remember the structure of the layer
The following dynamic
ht = σh (st) , st = Wsdxt + Ussht−1 + bh
Therefore, we have that
sit =
j
Wijxt
j +
k
Uikht−1
k + bi
We assume the following about the temporal layer weights
[Uss, Wsd] ∼ N 0,
ρ2
w
N
, bh ∼ N 0, ρ2
b
Here N = s the state dimension.
66 / 142
130. Images/cinvestav
Now, assume that
Now, consider the evolution of a single input through the network xit
Since the weights and biases are independent with zero mean
E [sit] = 0
The second moment of the Gaussian random variable
E [sitsjt] = qt
δij
67 / 142
131. Images/cinvestav
Now, assume that
Now, consider the evolution of a single input through the network xit
Since the weights and biases are independent with zero mean
E [sit] = 0
The second moment of the Gaussian random variable
E [sitsjt] = qt
δij
67 / 142
133. Images/cinvestav
Here we have
Here q is the variance of the pre-activations in the tth
layer due to an
input xt
qt
=
ρ2
w
√
2π
σ2
h qt−1sit−1 exp −
1
2
s2
it dsit + ρ2
b
They describe the pass through the recursion of the RNN
For any choice of ρ2
w and ρ2
b and a bounded φ the previous equation
converges to a specific fix point.
This recursion has a fixed point
q∗
=
ρ2
w
√
2π
σ2
h q∗sit−1 exp −
1
2
s2
it dsit + ρ2
b
69 / 142
134. Images/cinvestav
Here we have
Here q is the variance of the pre-activations in the tth
layer due to an
input xt
qt
=
ρ2
w
√
2π
σ2
h qt−1sit−1 exp −
1
2
s2
it dsit + ρ2
b
They describe the pass through the recursion of the RNN
For any choice of ρ2
w and ρ2
b and a bounded φ the previous equation
converges to a specific fix point.
This recursion has a fixed point
q∗
=
ρ2
w
√
2π
σ2
h q∗sit−1 exp −
1
2
s2
it dsit + ρ2
b
69 / 142
135. Images/cinvestav
Here we have
Here q is the variance of the pre-activations in the tth
layer due to an
input xt
qt
=
ρ2
w
√
2π
σ2
h qt−1sit−1 exp −
1
2
s2
it dsit + ρ2
b
They describe the pass through the recursion of the RNN
For any choice of ρ2
w and ρ2
b and a bounded φ the previous equation
converges to a specific fix point.
This recursion has a fixed point
q∗
=
ρ2
w
√
2π
σ2
h q∗sit−1 exp −
1
2
s2
it dsit + ρ2
b
69 / 142
136. Images/cinvestav
A Fixed Point
Definition
In mathematics, a fixed point of a function is an element of the
function’s domain that is mapped to itself by the function.
Example
70 / 142
137. Images/cinvestav
A Fixed Point
Definition
In mathematics, a fixed point of a function is an element of the
function’s domain that is mapped to itself by the function.
Example
70 / 142
138. Images/cinvestav
Therefore
We have that
It the input x0 is chosen so that q1 = q∗ the dynamics start at the
fixed point and the distribution of Dt is independent of t.
Not only that
q1 = q∗ a few layers is often sufficient to approximately converge to a
fixed point.
So when t is large
So it is a good approximation to assume qt = q∗.
71 / 142
139. Images/cinvestav
Therefore
We have that
It the input x0 is chosen so that q1 = q∗ the dynamics start at the
fixed point and the distribution of Dt is independent of t.
Not only that
q1 = q∗ a few layers is often sufficient to approximately converge to a
fixed point.
So when t is large
So it is a good approximation to assume qt = q∗.
71 / 142
140. Images/cinvestav
Therefore
We have that
It the input x0 is chosen so that q1 = q∗ the dynamics start at the
fixed point and the distribution of Dt is independent of t.
Not only that
q1 = q∗ a few layers is often sufficient to approximately converge to a
fixed point.
So when t is large
So it is a good approximation to assume qt = q∗.
71 / 142
141. Images/cinvestav
Additionally
The independence of the weights and biases implies
The covariance between different pre-activations in the same layer will
be given by
E [zit;azjt;b] = qt
abδij
Therefore
qt
ab = ρ2
w σh (u1) σh (u2) Dz1Dz2 + ρ2
b
Where Dz = 1√
2π
exp −1
2 s2
ds
u1 = qt−1
aa
u2 = qt−1
bb ct−1
ab s1 + 1 − ct−1
ab
2
z2
ct
ab =
qt
ab√
qt
aaqt
bb
72 / 142
142. Images/cinvestav
Additionally
The independence of the weights and biases implies
The covariance between different pre-activations in the same layer will
be given by
E [zit;azjt;b] = qt
abδij
Therefore
qt
ab = ρ2
w σh (u1) σh (u2) Dz1Dz2 + ρ2
b
Where Dz = 1√
2π
exp −1
2 s2
ds
u1 = qt−1
aa
u2 = qt−1
bb ct−1
ab s1 + 1 − ct−1
ab
2
z2
ct
ab =
qt
ab√
qt
aaqt
bb
72 / 142
144. Images/cinvestav
Then
χ (ρw, ρb)
It separates (ρw, ρb) plane into two regions.
When χ > 1
Forward signal propagation expands and folds space in a chaotic
manner and gradients explode
When χ < 1
Forward signal propagation contracts in an ordered manner and
gradients exponentially vanishes
74 / 142
145. Images/cinvestav
Then
χ (ρw, ρb)
It separates (ρw, ρb) plane into two regions.
When χ > 1
Forward signal propagation expands and folds space in a chaotic
manner and gradients explode
When χ < 1
Forward signal propagation contracts in an ordered manner and
gradients exponentially vanishes
74 / 142
146. Images/cinvestav
Then
χ (ρw, ρb)
It separates (ρw, ρb) plane into two regions.
When χ > 1
Forward signal propagation expands and folds space in a chaotic
manner and gradients explode
When χ < 1
Forward signal propagation contracts in an ordered manner and
gradients exponentially vanishes
74 / 142
147. Images/cinvestav
This Regions establish the stability of the network
We have the following
0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
ORDERED
VANISHING
GRADIENT CHAOTIC
EXPLODING
GRADIENT
75 / 142
148. Images/cinvestav
Therefore
It is clear that
When we choose same ρb = ρw we have a convergence of the network
Having other values
It requires a careful choosing of the values
76 / 142
149. Images/cinvestav
Therefore
It is clear that
When we choose same ρb = ρw we have a convergence of the network
Having other values
It requires a careful choosing of the values
76 / 142
150. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
77 / 142
151. Images/cinvestav
Another Problem
Although, the Vanishing and Exploding Gradients
They are a problem for the RNN’s
If we use the full BPTT
We confront limitations on the amount of Memory and Hardware
available
Thus a popular strategy
It is the Truncated BPTT [7, 8]
78 / 142
152. Images/cinvestav
Another Problem
Although, the Vanishing and Exploding Gradients
They are a problem for the RNN’s
If we use the full BPTT
We confront limitations on the amount of Memory and Hardware
available
Thus a popular strategy
It is the Truncated BPTT [7, 8]
78 / 142
153. Images/cinvestav
Another Problem
Although, the Vanishing and Exploding Gradients
They are a problem for the RNN’s
If we use the full BPTT
We confront limitations on the amount of Memory and Hardware
available
Thus a popular strategy
It is the Truncated BPTT [7, 8]
78 / 142
154. Images/cinvestav
Therefore
They proposed using a truncation on the BPTT
To solve the problem with the Vanishing and Exploding Gradient
What is Truncated BPTT?
In general, this should be regarded as a heuristic technique for
simplifying the computation.
Which it is a good approximation true gradient
79 / 142
155. Images/cinvestav
Therefore
They proposed using a truncation on the BPTT
To solve the problem with the Vanishing and Exploding Gradient
What is Truncated BPTT?
In general, this should be regarded as a heuristic technique for
simplifying the computation.
Which it is a good approximation true gradient
79 / 142
156. Images/cinvestav
The Algorithm
Truncated BPTT
1 for t = 1 to T do:
2 Run the RNN for one step, computing ht and yt
3 if t divides k1 then
4 Run BPTT from t to t − k2
Something Notable
1 It was first used by Elman [9]
2 Also Mikolov et al. [10] used the TBPTT to train RNN on word-level
language modeling.
80 / 142
157. Images/cinvestav
The Algorithm
Truncated BPTT
1 for t = 1 to T do:
2 Run the RNN for one step, computing ht and yt
3 if t divides k1 then
4 Run BPTT from t to t − k2
Something Notable
1 It was first used by Elman [9]
2 Also Mikolov et al. [10] used the TBPTT to train RNN on word-level
language modeling.
80 / 142
158. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
81 / 142
159. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
82 / 142
160. Images/cinvestav
Initialization of the Hidden State
This is the classic problem in RNN
How to initialize the hs hidden state?
There are two main mehtods
1 Initialize hs to the zero vector.
2 Adaptive noisy initialization of hs
3 Find the steady state
83 / 142
161. Images/cinvestav
Initialization of the Hidden State
This is the classic problem in RNN
How to initialize the hs hidden state?
There are two main mehtods
1 Initialize hs to the zero vector.
2 Adaptive noisy initialization of hs
3 Find the steady state
83 / 142
162. Images/cinvestav
The Simplest One
We can simply initialize hs
To a zero state
Quite simple and easy to apply
However do we have something better?
84 / 142
163. Images/cinvestav
The Simplest One
We can simply initialize hs
To a zero state
Quite simple and easy to apply
However do we have something better?
84 / 142
164. Images/cinvestav
Adaptive noisy initialization
It is proposed by Zimmermann et al. [11]
They proposed to use the residual error once the back-propagation
was done for h0
This is done
By disturbing h0 with a noise term Θ which follows the distribution
of the residual error.
85 / 142
165. Images/cinvestav
Adaptive noisy initialization
It is proposed by Zimmermann et al. [11]
They proposed to use the residual error once the back-propagation
was done for h0
This is done
By disturbing h0 with a noise term Θ which follows the distribution
of the residual error.
85 / 142
167. Images/cinvestav
Example of this initializations
Source https://r2rt.com/non-zero-initial-states-for-recurrent-neural-
networks.html
87 / 142
168. Images/cinvestav
What about the Weight Parameters?
We could simply initialize them to zero
Denger Will Robinson!!!
A simple example with the following feed-forward architecture
w = σ1 (Whix)
y = σ2 (Wohw)
L =
1
2
[y − z]2
88 / 142
169. Images/cinvestav
What about the Weight Parameters?
We could simply initialize them to zero
Denger Will Robinson!!!
A simple example with the following feed-forward architecture
w = σ1 (Whix)
y = σ2 (Wohw)
L =
1
2
[y − z]2
88 / 142
174. Images/cinvestav
Do you remember?
We have the following
0.5 1.0 1.5 2.0 2.5 3.0
0.00
0.05
0.10
0.15
0.20
0.25
ORDERED
VANISHING
GRADIENT CHAOTIC
EXPLODING
GRADIENT
91 / 142
175. Images/cinvestav
Furthermore
We have heuristics
For Relu — We multiply the randomly generated values of W by:
2
sizel−1
For tanh — The heuristic is called Xavier initialization
2
sizel−1
Other common one
2
sizel−1 + sizel
92 / 142
176. Images/cinvestav
Furthermore
We have heuristics
For Relu — We multiply the randomly generated values of W by:
2
sizel−1
For tanh — The heuristic is called Xavier initialization
2
sizel−1
Other common one
2
sizel−1 + sizel
92 / 142
177. Images/cinvestav
Furthermore
We have heuristics
For Relu — We multiply the randomly generated values of W by:
2
sizel−1
For tanh — The heuristic is called Xavier initialization
2
sizel−1
Other common one
2
sizel−1 + sizel
92 / 142
178. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
93 / 142
179. Images/cinvestav
History of LSTM
They were introduced by
LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber [12]
An attempt to deal with the vanishing and exploding gradient
By introducing Constant Error Carousel (CEC) units
Properties
In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred
Cummins introduced the forget gate (also called “keep gate”) into
LSTM architecture.
It enables the LSTM to reset its own state
94 / 142
180. Images/cinvestav
History of LSTM
They were introduced by
LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber [12]
An attempt to deal with the vanishing and exploding gradient
By introducing Constant Error Carousel (CEC) units
Properties
In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred
Cummins introduced the forget gate (also called “keep gate”) into
LSTM architecture.
It enables the LSTM to reset its own state
94 / 142
181. Images/cinvestav
History of LSTM
They were introduced by
LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen
Schmidhuber [12]
An attempt to deal with the vanishing and exploding gradient
By introducing Constant Error Carousel (CEC) units
Properties
In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred
Cummins introduced the forget gate (also called “keep gate”) into
LSTM architecture.
It enables the LSTM to reset its own state
94 / 142
182. Images/cinvestav
Long Short Term Memory (LSTM)
We have the following Architecture (Component wise product )
ft =σ [Wf [ht−1, xt] + bf ] (Forget Gate)
it =σ [Wi [ht−1, xt] + bi] (Input/Update Gate)
ot =σ [Wo [ht−1, xt] + bo] (Output Gate)
ˆct = tanh [Wo [ht−1, xt] + bc] (Intermediate Cell Gate)
ct =ft ct−1 + it ˆct (Cell State Gate)
ht =ot tanh (ct) (Hidden State)
Where σ is a sigmoid function.
95 / 142
184. Images/cinvestav
Here the interesting part
In the RNN
ht = σh (Wsdxt + Ussht−1)
But Here
ct =ft ct−1 + it ˆct (Cell State Gate)
ht =ot tanh (ct)
You need the forget term, the input term ant the intermediate cell
To update the state
97 / 142
185. Images/cinvestav
Here the interesting part
In the RNN
ht = σh (Wsdxt + Ussht−1)
But Here
ct =ft ct−1 + it ˆct (Cell State Gate)
ht =ot tanh (ct)
You need the forget term, the input term ant the intermediate cell
To update the state
97 / 142
186. Images/cinvestav
Here the interesting part
In the RNN
ht = σh (Wsdxt + Ussht−1)
But Here
ct =ft ct−1 + it ˆct (Cell State Gate)
ht =ot tanh (ct)
You need the forget term, the input term ant the intermediate cell
To update the state
97 / 142
187. Images/cinvestav
You can see
Something Notable
The cell keeps track of the dependencies between the elements in the
input sequence and the state
The input gate
It is in charge of how much of the input flows into the cell gate
it = σ [Wi [ht−1, xt] + bi]
98 / 142
188. Images/cinvestav
You can see
Something Notable
The cell keeps track of the dependencies between the elements in the
input sequence and the state
The input gate
It is in charge of how much of the input flows into the cell gate
it = σ [Wi [ht−1, xt] + bi]
98 / 142
189. Images/cinvestav
What is the meaning?
We have that
The sigmoid layer decides what values to update
They impact the term it ˆct
Making possible to decide how to control the cell intermediate values
99 / 142
190. Images/cinvestav
What is the meaning?
We have that
The sigmoid layer decides what values to update
They impact the term it ˆct
Making possible to decide how to control the cell intermediate values
99 / 142
191. Images/cinvestav
Now
The forget gate
How much of the previous cell gate time value remains in the cell at
time t
ft = σ [Wf [ht−1, xt] + bf ]
Actually
It uses previous state and input
Then the sigmoid actually can be interpreted as
Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep”
100 / 142
192. Images/cinvestav
Now
The forget gate
How much of the previous cell gate time value remains in the cell at
time t
ft = σ [Wf [ht−1, xt] + bf ]
Actually
It uses previous state and input
Then the sigmoid actually can be interpreted as
Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep”
100 / 142
193. Images/cinvestav
Now
The forget gate
How much of the previous cell gate time value remains in the cell at
time t
ft = σ [Wf [ht−1, xt] + bf ]
Actually
It uses previous state and input
Then the sigmoid actually can be interpreted as
Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep”
100 / 142
194. Images/cinvestav
Furthermore
The output gate
It controls the extent to which the value in the cell is used to
compute the actual state
Which impacts the term ft ct−1
Based on the previous cell state
Thus a type of control
Between the previous cell state and the new cell state
101 / 142
195. Images/cinvestav
Furthermore
The output gate
It controls the extent to which the value in the cell is used to
compute the actual state
Which impacts the term ft ct−1
Based on the previous cell state
Thus a type of control
Between the previous cell state and the new cell state
101 / 142
196. Images/cinvestav
Furthermore
The output gate
It controls the extent to which the value in the cell is used to
compute the actual state
Which impacts the term ft ct−1
Based on the previous cell state
Thus a type of control
Between the previous cell state and the new cell state
101 / 142
197. Images/cinvestav
Finally
We have the update of the cell as
ct = ft ct−1 + it ˆct
Basically
Apply forget operation to previous internal cell state.
Add new candidate values, scaled by how much we decided to update
We can see as
Drop old information and add new information about subject’s gender.
102 / 142
198. Images/cinvestav
Finally
We have the update of the cell as
ct = ft ct−1 + it ˆct
Basically
Apply forget operation to previous internal cell state.
Add new candidate values, scaled by how much we decided to update
We can see as
Drop old information and add new information about subject’s gender.
102 / 142
199. Images/cinvestav
Finally
We have the update of the cell as
ct = ft ct−1 + it ˆct
Basically
Apply forget operation to previous internal cell state.
Add new candidate values, scaled by how much we decided to update
We can see as
Drop old information and add new information about subject’s gender.
102 / 142
200. Images/cinvestav
Thus at the output layer and update state
We have
ot =σ [Wo [ht−1, xt] + bo] (Output Gate)
ht =ot tanh (ct) (Hidden State)
Therefore, we have that
Sigmoid layer: decide what linear combination of state/input to
output
Additionally, we have that the tanh squashes the values between -1
and 1
The output is used to filter a version of cell state!!!
103 / 142
201. Images/cinvestav
Thus at the output layer and update state
We have
ot =σ [Wo [ht−1, xt] + bo] (Output Gate)
ht =ot tanh (ct) (Hidden State)
Therefore, we have that
Sigmoid layer: decide what linear combination of state/input to
output
Additionally, we have that the tanh squashes the values between -1
and 1
The output is used to filter a version of cell state!!!
103 / 142
202. Images/cinvestav
Thus at the output layer and update state
We have
ot =σ [Wo [ht−1, xt] + bo] (Output Gate)
ht =ot tanh (ct) (Hidden State)
Therefore, we have that
Sigmoid layer: decide what linear combination of state/input to
output
Additionally, we have that the tanh squashes the values between -1
and 1
The output is used to filter a version of cell state!!!
103 / 142
204. Images/cinvestav
LSTM Remarks
First
It maintains a separate cell state from what is outputted
Second
Use gates to control the flow of information
Forget gate tries to get rid of irrelevant information
Selectively update cell state
Output gate returns a filtered version of the cell state
Third
Backpropagation from ct to ct−1 requires only elementwise
multiplication!
105 / 142
205. Images/cinvestav
LSTM Remarks
First
It maintains a separate cell state from what is outputted
Second
Use gates to control the flow of information
Forget gate tries to get rid of irrelevant information
Selectively update cell state
Output gate returns a filtered version of the cell state
Third
Backpropagation from ct to ct−1 requires only elementwise
multiplication!
105 / 142
206. Images/cinvestav
LSTM Remarks
First
It maintains a separate cell state from what is outputted
Second
Use gates to control the flow of information
Forget gate tries to get rid of irrelevant information
Selectively update cell state
Output gate returns a filtered version of the cell state
Third
Backpropagation from ct to ct−1 requires only elementwise
multiplication!
105 / 142
207. Images/cinvestav
Achievements
LSTM achieved record results in natural language text compression
http://www.mattmahoney.net/dc/text.html#1218
Unsegmented connected handwriting recognition
Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H.,
Schmidhuber, J. (May 2009). "A Novel Connectionist System for
Unconstrained Handwriting Recognition". IEEE Transactions on
Pattern Analysis and Machine Intelligence. 31 (5): 855–868
Finally
Won the ICDAR handwriting competition (2009)
106 / 142
208. Images/cinvestav
Achievements
LSTM achieved record results in natural language text compression
http://www.mattmahoney.net/dc/text.html#1218
Unsegmented connected handwriting recognition
Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H.,
Schmidhuber, J. (May 2009). "A Novel Connectionist System for
Unconstrained Handwriting Recognition". IEEE Transactions on
Pattern Analysis and Machine Intelligence. 31 (5): 855–868
Finally
Won the ICDAR handwriting competition (2009)
106 / 142
209. Images/cinvestav
Achievements
LSTM achieved record results in natural language text compression
http://www.mattmahoney.net/dc/text.html#1218
Unsegmented connected handwriting recognition
Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H.,
Schmidhuber, J. (May 2009). "A Novel Connectionist System for
Unconstrained Handwriting Recognition". IEEE Transactions on
Pattern Analysis and Machine Intelligence. 31 (5): 855–868
Finally
Won the ICDAR handwriting competition (2009)
106 / 142
210. Images/cinvestav
Right now
Something Notable
As of 2016, major technology companies including Google, Apple, and
Microsoft were using LSTM as fundamental components in new
products.
107 / 142
211. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
108 / 142
212. Images/cinvestav
History
They were proposed as a simplification of the LSTM
In 2014, Kyunghyun Cho et al. put forward a simplified variant called
Gated recurrent unit (GRU)
Something Notable
The GRU is like a long short-term memory (LSTM) with forget
gate...
but has fewer parameters than LSTM, as it lacks an output gate
109 / 142
213. Images/cinvestav
History
They were proposed as a simplification of the LSTM
In 2014, Kyunghyun Cho et al. put forward a simplified variant called
Gated recurrent unit (GRU)
Something Notable
The GRU is like a long short-term memory (LSTM) with forget
gate...
but has fewer parameters than LSTM, as it lacks an output gate
109 / 142
216. Images/cinvestav
Main Observations
There is a gate used to combine the state ht−1,
The zt gate that basically uses the information of the input and the
previous state to decide how to update
ht = (1 − zt) ht−1 + zt
ˆht
The intermediate step ˆht
A bounded version of the possible state ht
112 / 142
217. Images/cinvestav
Main Observations
There is a gate used to combine the state ht−1,
The zt gate that basically uses the information of the input and the
previous state to decide how to update
ht = (1 − zt) ht−1 + zt
ˆht
The intermediate step ˆht
A bounded version of the possible state ht
112 / 142
219. Images/cinvestav
However
It has been shown that
As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is
"strictly stronger" than the GRU
LSTM can perform unbounded counting[13]
The GRU cannot.
It simulates a counting machine used for theoretical CS
Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google
Brain
LSTM cells consistently outperform GRU cells in "the first large-scale
analysis of architecture variations for Neural Machine Translation.”
114 / 142
220. Images/cinvestav
However
It has been shown that
As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is
"strictly stronger" than the GRU
LSTM can perform unbounded counting[13]
The GRU cannot.
It simulates a counting machine used for theoretical CS
Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google
Brain
LSTM cells consistently outperform GRU cells in "the first large-scale
analysis of architecture variations for Neural Machine Translation.”
114 / 142
221. Images/cinvestav
However
It has been shown that
As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is
"strictly stronger" than the GRU
LSTM can perform unbounded counting[13]
The GRU cannot.
It simulates a counting machine used for theoretical CS
Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google
Brain
LSTM cells consistently outperform GRU cells in "the first large-scale
analysis of architecture variations for Neural Machine Translation.”
114 / 142
222. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
115 / 142
223. Images/cinvestav
Given that we want to do sequence modeling
Stock Options
Predict next phrase
Question: If I am a man ?
Prediction: you are homo sapiens
116 / 142
224. Images/cinvestav
Given that we want to do sequence modeling
Stock Options
Predict next phrase
Question: If I am a man ?
Prediction: you are homo sapiens
116 / 142
225. Images/cinvestav
What do we have in this sequences of data?
Sequences have different lengths
We need to handle variable-length sequences
117 / 142
227. Images/cinvestav
Not only that
Maintain information about order
“We have a mother living in Yucatan, Mexico”
Share parameters across the sequence
Do you remember the state ht?
119 / 142
228. Images/cinvestav
Not only that
Maintain information about order
“We have a mother living in Yucatan, Mexico”
Share parameters across the sequence
Do you remember the state ht?
119 / 142
229. Images/cinvestav
However
There is a need to increase their power
Given the amounts of data we have right know
Then there is a tendency to start using the Recurrent Neural Networks
As cells to be stacked for bigger systems [14, 15]
This is based in the following idea [16]
Hypothesis, hierarchical model can be exponentially more efficient at
representing some functions than a shallow one.
120 / 142
230. Images/cinvestav
However
There is a need to increase their power
Given the amounts of data we have right know
Then there is a tendency to start using the Recurrent Neural Networks
As cells to be stacked for bigger systems [14, 15]
This is based in the following idea [16]
Hypothesis, hierarchical model can be exponentially more efficient at
representing some functions than a shallow one.
120 / 142
231. Images/cinvestav
However
There is a need to increase their power
Given the amounts of data we have right know
Then there is a tendency to start using the Recurrent Neural Networks
As cells to be stacked for bigger systems [14, 15]
This is based in the following idea [16]
Hypothesis, hierarchical model can be exponentially more efficient at
representing some functions than a shallow one.
120 / 142
232. Images/cinvestav
In the case of RNN’s
Certain Transitions are not Deep
They are only results of a linear projection followed by an
element-wise nonlinearity.
They are
Hidden-to-hidden ht−1 → ht
Hidden-to-output ht → yt
Input-to-hidden xt−1 → ht
Meaning
They are all shallow in the sense that there exists no intermediate,
nonlinear hidden layer.
121 / 142
233. Images/cinvestav
In the case of RNN’s
Certain Transitions are not Deep
They are only results of a linear projection followed by an
element-wise nonlinearity.
They are
Hidden-to-hidden ht−1 → ht
Hidden-to-output ht → yt
Input-to-hidden xt−1 → ht
Meaning
They are all shallow in the sense that there exists no intermediate,
nonlinear hidden layer.
121 / 142
234. Images/cinvestav
In the case of RNN’s
Certain Transitions are not Deep
They are only results of a linear projection followed by an
element-wise nonlinearity.
They are
Hidden-to-hidden ht−1 → ht
Hidden-to-output ht → yt
Input-to-hidden xt−1 → ht
Meaning
They are all shallow in the sense that there exists no intermediate,
nonlinear hidden layer.
121 / 142
235. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
122 / 142
236. Images/cinvestav
Bengio et al. [17]
Gave the following Hypothesis
In sampling algorithms (Markov Chains and MCMC techniques) suffer
from a fundamental problem
Given unconnected or weakly connected regions of distributions
We have that
it is difficult for the Markov chain to jump from one mode of the
distribution to another, when these are separated by large low-density
regions
This means that we have a slow mixing of samples
In order to represent distributions
123 / 142
237. Images/cinvestav
Bengio et al. [17]
Gave the following Hypothesis
In sampling algorithms (Markov Chains and MCMC techniques) suffer
from a fundamental problem
Given unconnected or weakly connected regions of distributions
We have that
it is difficult for the Markov chain to jump from one mode of the
distribution to another, when these are separated by large low-density
regions
This means that we have a slow mixing of samples
In order to represent distributions
123 / 142
238. Images/cinvestav
Bengio et al. [17]
Gave the following Hypothesis
In sampling algorithms (Markov Chains and MCMC techniques) suffer
from a fundamental problem
Given unconnected or weakly connected regions of distributions
We have that
it is difficult for the Markov chain to jump from one mode of the
distribution to another, when these are separated by large low-density
regions
This means that we have a slow mixing of samples
In order to represent distributions
123 / 142
240. Images/cinvestav
The Main Problem
We have that
Slow mixing means that many consecutive samples tend to be
correlated
They belong to the same mode of the mixture
Why?
Jumping around in the MCMC method is quite slow and scarce
125 / 142
241. Images/cinvestav
The Main Problem
We have that
Slow mixing means that many consecutive samples tend to be
correlated
They belong to the same mode of the mixture
Why?
Jumping around in the MCMC method is quite slow and scarce
125 / 142
242. Images/cinvestav
Implications in Learning Algorithms
Given that some form of sampling is at the core of many learning
algorithms
For example, to estimate the log-likelihood gradient
Therefore, at the beginning of learning
Mixing is therefore initially easy
However as the model improves
its corresponding distribution sharpens and mixing becomes slower
126 / 142
243. Images/cinvestav
Implications in Learning Algorithms
Given that some form of sampling is at the core of many learning
algorithms
For example, to estimate the log-likelihood gradient
Therefore, at the beginning of learning
Mixing is therefore initially easy
However as the model improves
its corresponding distribution sharpens and mixing becomes slower
126 / 142
244. Images/cinvestav
Implications in Learning Algorithms
Given that some form of sampling is at the core of many learning
algorithms
For example, to estimate the log-likelihood gradient
Therefore, at the beginning of learning
Mixing is therefore initially easy
However as the model improves
its corresponding distribution sharpens and mixing becomes slower
126 / 142
246. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
128 / 142
247. Images/cinvestav
Therefore
We need to build deeper structures to reach more capabilities
For example the vector representation of documents
Here a extra layer of representation can be used for doing
representation
For Example, Mikolov et al. [18]
129 / 142
248. Images/cinvestav
Therefore
We need to build deeper structures to reach more capabilities
For example the vector representation of documents
Here a extra layer of representation can be used for doing
representation
For Example, Mikolov et al. [18]
129 / 142
250. Images/cinvestav
The equations
They will look like
zt =σ [Wz [ht−1, xt] + bz] (Update Gate)
rt =σ [Wr [ht−1, xt] + br] (Reset Gate)
ˆht = tanh [Wo [rt ht−1, xt] + bh]
ht = (1 − zt) ht−1 + zt
ˆht
x =σ (Wohy)
y =σ (Whiw)
131 / 142
251. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
132 / 142
252. Images/cinvestav
Deep Transition Architectures
In a deep transition RNN (DT-RNN)
At each time step the next state is computed by the sequential
application of multiple transition layers.
For example in Nematus system [19]
They use GRU transitions blocks under independent trainable
parameters
With a Caveat
The hidden state output is used as the input state on the next one
133 / 142
253. Images/cinvestav
Deep Transition Architectures
In a deep transition RNN (DT-RNN)
At each time step the next state is computed by the sequential
application of multiple transition layers.
For example in Nematus system [19]
They use GRU transitions blocks under independent trainable
parameters
With a Caveat
The hidden state output is used as the input state on the next one
133 / 142
254. Images/cinvestav
Deep Transition Architectures
In a deep transition RNN (DT-RNN)
At each time step the next state is computed by the sequential
application of multiple transition layers.
For example in Nematus system [19]
They use GRU transitions blocks under independent trainable
parameters
With a Caveat
The hidden state output is used as the input state on the next one
133 / 142
255. Images/cinvestav
For example, at the encoder phase
For the ith
source word in the forward direction, we have hi = hi,Ls
hi,1 = GRU1 (x1, hi−1,Ls )
hi,k = GRUk (0, hi,k−1) for 1 < k ≤ Ls
The sequence word is reversed and you have a backward state then
C ≡
−→
h i,Ls ,
←−
h i,Ls
134 / 142
256. Images/cinvestav
For example, at the encoder phase
For the ith
source word in the forward direction, we have hi = hi,Ls
hi,1 = GRU1 (x1, hi−1,Ls )
hi,k = GRUk (0, hi,k−1) for 1 < k ≤ Ls
The sequence word is reversed and you have a backward state then
C ≡
−→
h i,Ls ,
←−
h i,Ls
134 / 142
257. Images/cinvestav
Then
Decoder phase uses the outputs from the previous GRU and
something called attention (We will look at this latter)
sj,1 = GRU1 yj−1, sj−1, Lt
sj,2 = GRU2 (ATT, sj−1, Lt)
sj,k = GRUk (0, Lt) for 2 < k ≤ Lt
Then, the target word state sj ≡ sj,Lt
It is used by a feed-forward neural network to predict the current
target network
135 / 142
258. Images/cinvestav
Then
Decoder phase uses the outputs from the previous GRU and
something called attention (We will look at this latter)
sj,1 = GRU1 yj−1, sj−1, Lt
sj,2 = GRU2 (ATT, sj−1, Lt)
sj,k = GRUk (0, Lt) for 2 < k ≤ Lt
Then, the target word state sj ≡ sj,Lt
It is used by a feed-forward neural network to predict the current
target network
135 / 142
260. Images/cinvestav
Outline
1 Introduction
History
State-Space Model
Back to the RNN Equations
Introducing the Cost Function
Other Cost Functions
2 Training a Vanilla RNN Model
The Final RNN Model
Back-Propagation Through Time (BPTT)
Vanishing and Exploding Gradients
Fixing the Problem, ReLu function
The Analysis of the Exploding and Vanishing Gradient
Truncated BPTT
Initialization
Hidden State
3 Modern Recurrent Architectures
Now, Long Short Term Memory (LSTM)
What about Gated Recurrent Units (GRU) units?
4 Deeper Architectures with RNN’s
Introduction
Deep Architectures for Better Learning
Deep Input-to-Hidden Function
Deep Transition Architectures
Conclusions
137 / 142
261. Images/cinvestav
There are many other examples
Basically
We are far from the classic methods as
1 Autoregressive integrated moving average (ARMA)
2 Auto Regressive Integrated Moving Average (ARIMA)
3 etc
These RNN architectures are taking the prediction of time series
To another level!!!
138 / 142
262. Images/cinvestav
There are many other examples
Basically
We are far from the classic methods as
1 Autoregressive integrated moving average (ARMA)
2 Auto Regressive Integrated Moving Average (ARIMA)
3 etc
These RNN architectures are taking the prediction of time series
To another level!!!
138 / 142
263. Images/cinvestav
O. L. R. Jacobs, “Introduction to control theory,” 1974.
A. Robinson and F. Fallside, The utility driven dynamic error
propagation network.
University of Cambridge Department of Engineering, 1987.
P. J. Werbos et al., “Backpropagation through time: what it does and
how to do it,” Proceedings of the IEEE, vol. 78, no. 10,
pp. 1550–1560, 1990.
G. Chen, “A gentle tutorial of recurrent neural network with error
backpropagation,” arXiv preprint arXiv:1610.02583, 2016.
B. W. Bader and T. G. Kolda, “Algorithm 862: Matlab tensor classes
for fast algorithm prototyping,” ACM Trans. Math. Softw., vol. 32,
Dec. 2006.
J. Pennington, S. S. Schoenholz, and S. Ganguli, “The emergence of
spectral universality in deep networks,” arXiv preprint
arXiv:1802.09979, 2018.
139 / 142
264. Images/cinvestav
R. J. Williams and D. Zipser, “Gradient-based learning algorithms for
recurrent,” Backpropagation: Theory, architectures, and applications,
vol. 433, 1995.
R. J. Williams and J. Peng, “An efficient gradient-based algorithm for
on-line training of recurrent network trajectories,” Neural
computation, vol. 2, no. 4, pp. 490–501, 1990.
J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14,
no. 2, pp. 179–211, 1990.
T. Mikolov, M. Karafiát, L. Burget, J. Černock`y, and S. Khudanpur,
“Recurrent neural network based language model,” in Eleventh annual
conference of the international speech communication association,
2010.
H.-G. Zimmermann, C. Tietz, and R. Grothmann, “Forecasting with
recurrent neural networks: 12 tricks,” in Neural Networks: Tricks of
the Trade, pp. 687–707, Springer, 2012.
140 / 142
265. Images/cinvestav
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
G. Weiss, Y. Goldberg, and E. Yahav, “On the practical computational
power of finite precision rnns for language recognition,” CoRR,
vol. abs/1805.04908, 2018.
R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct
deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013.
A. V. M. Barone, J. Helcl, R. Sennrich, B. Haddow, and A. Birch,
“Deep architectures for neural machine translation,” arXiv preprint
arXiv:1707.07631, 2017.
Y. Bengio et al., “Learning deep architectures for ai,” Foundations and
trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via
deep representations,” in International conference on machine
learning, pp. 552–560, 2013.
141 / 142
266. Images/cinvestav
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
of word representations in vector space,” arXiv preprint
arXiv:1301.3781, 2013.
R. Sennrich, O. Firat, K. Cho, A. Birch, B. Haddow, J. Hitschler,
M. Junczys-Dowmunt, S. Läubli, A. V. M. Barone, J. Mokry, et al.,
“Nematus: a toolkit for neural machine translation,” arXiv preprint
arXiv:1703.04357, 2017.
142 / 142