06 recurrent neural_networks

1. Introduction to Neural Networks and Deep Learning Recurrent Neural Networks Andres Mendez-Vazquez November 13, 2019 1 / 142

2. Images/cinvestav Outline 1 Introduction History State-Space Model Back to the RNN Equations Introducing the Cost Function Other Cost Functions 2 Training a Vanilla RNN Model The Final RNN Model Back-Propagation Through Time (BPTT) Vanishing and Exploding Gradients Fixing the Problem, ReLu function The Analysis of the Exploding and Vanishing Gradient Truncated BPTT Initialization Hidden State 3 Modern Recurrent Architectures Now, Long Short Term Memory (LSTM) What about Gated Recurrent Units (GRU) units? 4 Deeper Architectures with RNN’s Introduction Deep Architectures for Better Learning Deep Input-to-Hidden Function Deep Transition Architectures Conclusions 2 / 142

4. Images/cinvestav In 1987 Robinson and Fallside [2] At Cambridge University Engineering Department They proposed a new type of neural network based on Linear Control Theory They took the work of Jacobs, 1974 on dynamic nets [1] st+1 = Ast + Bxt yt = Cst 4 / 142

5. Images/cinvestav In 1987 Robinson and Fallside [2] At Cambridge University Engineering Department They proposed a new type of neural network based on Linear Control Theory They took the work of Jacobs, 1974 on dynamic nets [1] st+1 = Ast + Bxt yt = Cst 4 / 142

6. Images/cinvestav Example of this unit We have 5 / 142

7. Images/cinvestav Furthermore Jordan Proposed a simple recurrent network ht = σh (Wsdxt + Ussht−1 + bh) yt = σs (Vosht + bo) Where 1 xt is an input of dimension d. 2 ht is a hidden state layer of dimension h. 3 yt is the output vector of dimension s. 4 W, U, V, bh and bo parameter matrices and vectors. 5 σh and σs are activation functions. 6 / 142

13. Images/cinvestav Graphically We have 7 / 142

14. Images/cinvestav What were they used for? Robinson and Fallside As with Hidden Markov Models, they were proposed for Speech Coding They proposed the following architecture 8 / 142

15. Images/cinvestav What were they used for? Robinson and Fallside As with Hidden Markov Models, they were proposed for Speech Coding They proposed the following architecture 8 / 142

17. Images/cinvestav Based on the State-Space Model Basically, a linear system Based in a state-determined system model Deﬁnition A mathematical description of the system in terms of a minimum set of variables xi(t), i = 1, ..., n, together with knowledge of those variables at an initial time t0 and the system inputs for time t ≥ t0, are suﬃcient to predict the future system state and outputs for all time t > t0. 10 / 142

18. Images/cinvestav Based on the State-Space Model Basically, a linear system Based in a state-determined system model Deﬁnition A mathematical description of the system in terms of a minimum set of variables xi(t), i = 1, ..., n, together with knowledge of those variables at an initial time t0 and the system inputs for time t ≥ t0, are suﬃcient to predict the future system state and outputs for all time t > t0. 10 / 142

19. Images/cinvestav Therefore We have a system as a block System Described by State Variables This can be expressed as a state equations ˙s1 = f1 (x, s, t) ˙s2 = f2 (x, s, t) · · · = · · · ˙sn = fn (x, s, t) 11 / 142

20. Images/cinvestav Therefore We have a system as a block System Described by State Variables This can be expressed as a state equations ˙s1 = f1 (x, s, t) ˙s2 = f2 (x, s, t) · · · = · · · ˙sn = fn (x, s, t) 11 / 142

21. Images/cinvestav Using Vector Notation Assuming that we have a linear system and time invariant Time-Invariant x (t + δ) directly equates y (t + δ), for example αx (t + δ) + β = y (t + δ) Therefore, using this idea ˙si = ai1x1 (t) + ... + aidxd (t) + b11s1 (t) + ... + b1nsn (t) Or in Matrix form y (t) = Ax (t) + Bs (t) 12 / 142

24. Images/cinvestav Then, the discretized version We introduce an update for the state part y (t) = Ax (t) + Bs (t) ˙s (t) = Cs (t) Or our discrete step equitations y (t) = Ax (t) + Bs (t) s (t + 1) = Cs (t) 13 / 142

25. Images/cinvestav Then, the discretized version We introduce an update for the state part y (t) = Ax (t) + Bs (t) ˙s (t) = Cs (t) Or our discrete step equitations y (t) = Ax (t) + Bs (t) s (t + 1) = Cs (t) 13 / 142

27. Images/cinvestav The Elman Network In Elman’s Equations ht = σh (Wsdxt + Ussht−1 + bh) yt = σy (Vosht + by) We noticed something diﬀerent from the linear recurrent system The use of activation functions to introduce the concept of non-linearity 15 / 142

28. Images/cinvestav The Elman Network In Elman’s Equations ht = σh (Wsdxt + Ussht−1 + bh) yt = σy (Vosht + by) We noticed something diﬀerent from the linear recurrent system The use of activation functions to introduce the concept of non-linearity 15 / 142

29. Images/cinvestav Explanation We have the following 1 The input xt is coded by Whd Wsdxt 2 An state is generated by using the codiﬁed version of the input plus a previous state ht−1 ht = σh (Wsdxt + Ussht−1 + bh) 3 The output is generated using the new state ht−1 yt = σy (Vosht + by) 16 / 142

33. Images/cinvestav We need to introduce the concept of cost function Which as always It needs to comply with two properties The cost function L must be able to be written as an average L = 1 N x∈X Cx over the cost individual cost functions Cx This allow to apply diﬀerent optimization techniques as Minbatch Stochastic Gradient Descent etc 18 / 142

36. Images/cinvestav Furthermore Non dependency The cost function L must not be dependent on any activation values of a neural network besides the output values. If we cannot assure this If not Backpropagation becomes too unstable or too complex to solve. For example L = 1 N N t=0 [yt + ht − zt]2 This gives two entry points to the network. 19 / 142

37. Images/cinvestav Furthermore Non dependency The cost function L must not be dependent on any activation values of a neural network besides the output values. If we cannot assure this If not Backpropagation becomes too unstable or too complex to solve. For example L = 1 N N t=0 [yt + ht − zt]2 This gives two entry points to the network. 19 / 142

38. Images/cinvestav A List of Cost Functions The Average Quadratic Cost L = 1 N N t=0 [yt − zt]2 Where yt is the output of the network and zt is the ground truth of the output. Here, we are interpolating functions 20 / 142

39. Images/cinvestav A List of Cost Functions The Average Quadratic Cost L = 1 N N t=0 [yt − zt]2 Where yt is the output of the network and zt is the ground truth of the output. Here, we are interpolating functions REAL FUNCTION INTERPOLATED FUNCTION 20 / 142

40. Images/cinvestav Cross-Entropy cost First, the Loss Function L = − C i=1 zi log (yi) Where yi is the output and zi is the ground truth for the class estimation. Why yi log (zi)? We can imagine a sequence of class probabilities y1, y2, ..., ym and the likelihood of the data and the model P [data|model] = yk1 1 yk2 2 · · · ykn m 21 / 142

41. Images/cinvestav Cross-Entropy cost First, the Loss Function L = − C i=1 zi log (yi) Where yi is the output and zi is the ground truth for the class estimation. Why yi log (zi)? We can imagine a sequence of class probabilities y1, y2, ..., ym and the likelihood of the data and the model P [data|model] = yk1 1 yk2 2 · · · ykn m 21 / 142

42. Images/cinvestav Then Taking the logarithm and multiplying by -1 − log P [data|model] = − C i=1 ki log yi Then, dividing by the total number of samples − 1 N log P [data|model] = − C i=1 ki N log yi = − C i=1 zi log yi 22 / 142

43. Images/cinvestav Then Taking the logarithm and multiplying by -1 − log P [data|model] = − C i=1 ki log yi Then, dividing by the total number of samples − 1 N log P [data|model] = − C i=1 ki N log yi = − C i=1 zi log yi 22 / 142

44. Images/cinvestav Then In information theory, The Kraft–McMillan theorem It establishes that any directly decodable coding scheme for coding a message to identify one value xi ∈ {x1, x2, ..., xn} It can be seen as representing an implicit probability distribution over {x1, x2, ..., xn} q (xi) = 1 2 li Where li is the length of the code for xi 23 / 142

45. Images/cinvestav Then In information theory, The Kraft–McMillan theorem It establishes that any directly decodable coding scheme for coding a message to identify one value xi ∈ {x1, x2, ..., xn} It can be seen as representing an implicit probability distribution over {x1, x2, ..., xn} q (xi) = 1 2 li Where li is the length of the code for xi 23 / 142

46. Images/cinvestav Now We have that Cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q is assumed while the data actually follows a distribution p. The expected message-length under the true distribution p is Ep [l] = −Ep ln q (x) ln 2 = −Ep [log2 q (x)] = − xi p (xi) log2 q (x) = H (p, q) 24 / 142

47. Images/cinvestav Now We have that Cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q is assumed while the data actually follows a distribution p. The expected message-length under the true distribution p is Ep [l] = −Ep ln q (x) ln 2 = −Ep [log2 q (x)] = − xi p (xi) log2 q (x) = H (p, q) 24 / 142

48. Images/cinvestav Special Case A special case is the binary class problem, C = 2 Based on the fact that z1 + z2 = 1 and y1 + y2 = 1 L = − 2 i=1 zi log (yi) = −z1 log (y1) − (1 − z1) log (1 − y1) A problem of this It could be possible to have a y1 = 0 25 / 142

49. Images/cinvestav Special Case A special case is the binary class problem, C = 2 Based on the fact that z1 + z2 = 1 and y1 + y2 = 1 L = − 2 i=1 zi log (yi) = −z1 log (y1) − (1 − z1) log (1 − y1) A problem of this It could be possible to have a y1 = 0 25 / 142

50. Images/cinvestav Dealing with this problem We can use an activation function in front of it SIGMOID CROSS ENTROPY LOSS 26 / 142

51. Images/cinvestav Another Interpretation The Loss can be expressed as L = − log (f (y1)) if z1 = 1 − log (1 − f (y1)) if z1 = 1 Where z1 = 1 It means that the class C1 = Ci is positive for this sample. 27 / 142

52. Images/cinvestav Another Interpretation The Loss can be expressed as L = − log (f (y1)) if z1 = 1 − log (1 − f (y1)) if z1 = 1 Where z1 = 1 It means that the class C1 = Ci is positive for this sample. 27 / 142

53. Images/cinvestav The Gradient of the Binary Cross Entropy We make a derivative with respect to yi ∂L ∂y1 = z1 (f (y1) − 1) + (1 − z1) f (y1) 28 / 142

54. Images/cinvestav In the case of the Multiclass Problem We use two things, a softmax f (yi) = exp {yi} C j=1 exp {yj} As in the multiclass for the Linear Models The labels are one-hot, so only the positive class Cp keeps its term in the loss. Therefore There is only one element of the Target vector z that is not zero, zi = zp. 29 / 142

57. Images/cinvestav We can then simplify The cost function becomes L = − C i=1 zi log (f (yi)) = −log exp {yp} C j=1 exp {yp} 30 / 142

59. Images/cinvestav Other Cost Functions Exponential Cost with hyper-parameter τ L = τ exp 1 τ N i=1 (yi − zi)2 Hellinger Distance L = 1 2 N i=1 ( √ yi − √ zi)2 Here the values need to be at the interval [0, 1] . 32 / 142

60. Images/cinvestav Other Cost Functions Exponential Cost with hyper-parameter τ L = τ exp 1 τ N i=1 (yi − zi)2 Hellinger Distance L = 1 2 N i=1 ( √ yi − √ zi)2 Here the values need to be at the interval [0, 1] . 32 / 142

61. Images/cinvestav Other Cost Functions Given Kullback-Leibler Divergence DKL (P Q) = i P (i) ln P (i) Q (i) The Final Cost function L = j ˆyj log ˆyj ypred j 33 / 142

62. Images/cinvestav Other Cost Functions Given Kullback-Leibler Divergence DKL (P Q) = i P (i) ln P (i) Q (i) The Final Cost function L = j ˆyj log ˆyj ypred j 33 / 142

64. Images/cinvestav We have the following Architecture with Quadratic Error ht = σh (Wsdxt + Ussht−1 + bh) yt = σy (Vosht + by) L = 1 2 N t=0 [yt − zt]2 Something Notable How do we train something with a recurrence forcing a dependence over time? 35 / 142

65. Images/cinvestav We have the following Architecture with Quadratic Error ht = σh (Wsdxt + Ussht−1 + bh) yt = σy (Vosht + by) L = 1 2 N t=0 [yt − zt]2 Something Notable How do we train something with a recurrence forcing a dependence over time? 35 / 142

67. Images/cinvestav Now, given the dependency over time We can use the classic unfolding of the network [3, 4] by assuming W, U, V, bh and bo do not change under the unfolding Unfolding? Assume that there are not bias correcting terms, only, W, U and V . 37 / 142

68. Images/cinvestav Now, given the dependency over time We can use the classic unfolding of the network [3, 4] by assuming W, U, V, bh and bo do not change under the unfolding Unfolding? Assume that there are not bias correcting terms, only, W, U and V . 37 / 142

69. Images/cinvestav Then Given an observation sequence x = {x1, x2, ..., xT } where xi ∈ R, and their corresponding label y = {y1, y2, ..., yT } We remove the bias to simplify our derivations ht = σh (Wsdxt + Ussht−1) yt = σy (Vosht) L = 1 2 T t=0 [zt − yt]2 38 / 142

70. Images/cinvestav Then Given an observation sequence x = {x1, x2, ..., xT } where xi ∈ R, and their corresponding label y = {y1, y2, ..., yT } We remove the bias to simplify our derivations ht = σh (Wsdxt + Ussht−1) yt = σy (Vosht) L = 1 2 T t=0 [zt − yt]2 38 / 142

71. Images/cinvestav Unfolding We can then see the unfolding of the recurrence + + + + 39 / 142

72. Images/cinvestav This allows To simplify the backpropagation process ∂L ∂Vos = 1 2 T t=0 ∂L ∂yt × ∂yt ∂Vos = 1 2 T t=0 ∂L ∂yt × ∂yt ∂neto × ∂neto ∂Vos = − 1 2 T t=0 [zt − yt] × ∂yt ∂neto × ∂neto ∂Vos Where neto = Vosht 40 / 142

75. Images/cinvestav Now, we have We have that ∂yt ∂neto =       ∂yt1 ∂neto1 ∂yt2 ∂neto1 · · · ∂yto ∂neto1 ∂yt1 ∂neto2 ∂yt2 ∂neto2 · · · ∂yto ∂neto2 ... ... ... ... ∂yt1 ∂netoo ∂yt2 ∂netoo · · · ∂yto ∂netoo       41 / 142

76. Images/cinvestav Simplify!!! Now, we have that if i = j ∂yti ∂netoi = σ (netoi) And for the rest, we have i = j ∂yti ∂netoi = 0 42 / 142

77. Images/cinvestav Simplify!!! Now, we have that if i = j ∂yti ∂netoi = σ (netoi) And for the rest, we have i = j ∂yti ∂netoi = 0 42 / 142

78. Images/cinvestav Finally We have that ∂yt ∂neto =       σo (neto1) 0 · · · 0 0 σo (neto2) · · · 0 ... ... ... ... 0 0 · · · σo (netoo)       = A 43 / 142

79. Images/cinvestav Now, ∂neto ∂Vos First we have a component i netoi = s j=1 Vijhj What happen when we derive with respect to the matrix? ∂neto ∂Vos =       ∂neto ∂V11 ∂neto ∂V12 · · · ∂neto ∂V1s ∂neto ∂V21 ∂neto ∂V22 · · · ∂neto ∂V2s ... ... ... ... ∂neto ∂Vo1 ∂neto ∂Vo2 · · · ∂neto ∂Vos       Actually A Tensor with three dimensions... 44 / 142

82. Images/cinvestav But something quite nice Each of the components of neto It has the previous structure netoi = s j=1 Vijhj Then if the Vkl does not intervene on it ∂netoi ∂Vjk = 0 Additionally if it intervenes ∂netoi ∂Vji = hj 45 / 142

85. Images/cinvestav Therefore It is possible to collapse the tensor into a 2D Matrix Given that the other information is redundant, ad we can rewrite the tensor as Fijk = ∂netoi ∂Vjk Then, we have that Fiji = Gij ⇐ Better Storage!!! 46 / 142

86. Images/cinvestav Therefore It is possible to collapse the tensor into a 2D Matrix Given that the other information is redundant, ad we can rewrite the tensor as Fijk = ∂netoi ∂Vjk Then, we have that Fiji = Gij ⇐ Better Storage!!! 46 / 142

87. Images/cinvestav Therefore, given that a matrix is a tensor also We have that two tensors, neto×o and Fo×s×o [5] We will use the contracted product of two tensors which is a generalization of the tensor-vector and tensor-matrix multiplications Deﬁnition Given two tensors Ao×o and Bo×s×o A, B (k, j) = o i=1 Ai,kGi,j = Ai,iGi,j = σ (netok) hj 47 / 142

88. Images/cinvestav Therefore, given that a matrix is a tensor also We have that two tensors, neto×o and Fo×s×o [5] We will use the contracted product of two tensors which is a generalization of the tensor-vector and tensor-matrix multiplications Deﬁnition Given two tensors Ao×o and Bo×s×o A, B (k, j) = o i=1 Ai,kGi,j = Ai,iGi,j = σ (netok) hj 47 / 142

89. Images/cinvestav Now, the term ∂L ∂Uss Assuming our change in time step t → t + 1 and given ht = σh (Wsdxt + Ussht−1) Therefore we have ∂L (t + 1) ∂Uss = ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂Uss Therefore We can think on this as a Markovian Backpropagation 48 / 142

92. Images/cinvestav What if we go further From t − 1 → t + 1 ∂L (t + 1) ∂Uss = ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂ht × ∂ht ∂Uss Now, the trick if we consider all the possible derivatives from t to 0 We have: ∂L (t + 1) ∂Uss = t k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Uss However How do we calculate ∂ht+1 ∂hk ? 49 / 142

95. Images/cinvestav We have a proposal Given the composition of functions ∂ht+1 ∂hk = ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht Here, we know that ∂hi+1 ∂hi = ∂hi+1 ∂nets × ∂nets ∂hi 50 / 142

96. Images/cinvestav We have a proposal Given the composition of functions ∂ht+1 ∂hk = ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht Here, we know that ∂hi+1 ∂hi = ∂hi+1 ∂nets × ∂nets ∂hi 50 / 142

97. Images/cinvestav We have that We have given hi+1 = σh (Wsdxi + Usshi) and neth = Wsdxi + Usshi ∂hi+1 ∂nets =       σh (neth1) 0 · · · 0 0 σh (neth2) · · · 0 ... ... ... ... 0 0 · · · σh (neths)       = Di+1 Finally, we have that ∂nets ∂hi = Uss 51 / 142

98. Images/cinvestav We have that We have given hi+1 = σh (Wsdxi + Usshi) and neth = Wsdxi + Usshi ∂hi+1 ∂nets =       σh (neth1) 0 · · · 0 0 σh (neth2) · · · 0 ... ... ... ... 0 0 · · · σh (neths)       = Di+1 Finally, we have that ∂nets ∂hi = Uss 51 / 142

99. Images/cinvestav Then We can aggregate over all the time ∂L ∂Uss = t t k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Uss Now, we need to derive the L with respect to Wsd ∂L (t + 1) ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd 52 / 142

100. Images/cinvestav Then We can aggregate over all the time ∂L ∂Uss = t t k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Uss Now, we need to derive the L with respect to Wsd ∂L (t + 1) ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd 52 / 142

101. Images/cinvestav Now Because ht and xt+1, we need to back-propagate to ht ∂L (t + 1) ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd + ∂L (t + 1) ∂ht × ∂ht ∂Wsd = ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂Wsd + ∂L (t + 1) ∂ht+1 × ∂ht+1 ∂ht × ∂ht ∂Wsd Then summing over all the contributions from t to 0 ∂L (t + 1) ∂Wsd = t+1 k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Wsd Finally, summing over all the time ∂L ∂Wsd = t t+1 k=1 ∂L (t + 1) ∂yt+1 × ∂yt+1 ∂ht+1 × ∂ht+1 ∂hk × ∂ht ∂Wsd 53 / 142

105. Images/cinvestav Vanishing Gradients We have a problem ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht You ﬁnish with a vanishing gradient using σ = 1 1+exp{−x} This is problematic!!! 55 / 142

106. Images/cinvestav Vanishing Gradients We have a problem ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht You ﬁnish with a vanishing gradient using σ = 1 1+exp{−x} This is problematic!!! 55 / 142

107. Images/cinvestav Given Given the commutativity of the product You could put together the derivative of the sigmoid’s f (x) = ds (x) dx = e−x (1 + e−x)2 Therefore, deriving again df (x) dx = − e−x (1 + e−x)2 + 2 (e−x) 2 (1 + e−x)3 After making df(x) dx = 0 We have the maximum is at x = 0 56 / 142

110. Images/cinvestav Therefore The maximum for the derivative of the sigmoid f (0) = 0.25 Therefore, Given a Deep Network We could ﬁnish with lim k→∞ ds (x) dx k = lim k→∞ (0.25)k → 0 A Vanishing Derivative or Vanishing Gradient Making quite diﬃcult to do train a deeper network using this activation function for Deep Learning and even in Shallow Learning 57 / 142

113. Images/cinvestav For the case of vanishing gradient, we have that Rearranging terms in ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht We have T k=0 ∂hk+1 ∂nets [Uss]T+1 Then, given the sigmoid T k=0 ∂hk+1 ∂nets =      T k=0 σh netk h1 0 · · · 0 0 T k=0 σh netk h2 · · · 0 ... ... ... ... 0 0 · · · T k=0 σh netk hs      58 / 142

114. Images/cinvestav For the case of vanishing gradient, we have that Rearranging terms in ∂hk+1 ∂hk × ∂hk+2 ∂hk+1 × · · · × ∂ht+1 ∂ht We have T k=0 ∂hk+1 ∂nets [Uss]T+1 Then, given the sigmoid T k=0 ∂hk+1 ∂nets =      T k=0 σh netk h1 0 · · · 0 0 T k=0 σh netk h2 · · · 0 ... ... ... ... 0 0 · · · T k=0 σh netk hs      58 / 142

115. Images/cinvestav It is clear That you have the phenomena of vanishing gradient Do we have a way to ﬁxing this? Yes The use of new activation functions. 59 / 142

116. Images/cinvestav It is clear That you have the phenomena of vanishing gradient Do we have a way to ﬁxing this? Yes The use of new activation functions. 59 / 142

118. Images/cinvestav Thus The need to introduce a new function f (x) = x+ = max (0, x) It is called ReLu or Rectiﬁer With a smooth approximation (Softplus function) f (x) = ln 1 + ekx k 61 / 142

119. Images/cinvestav Thus The need to introduce a new function f (x) = x+ = max (0, x) It is called ReLu or Rectiﬁer With a smooth approximation (Softplus function) f (x) = ln 1 + ekx k 61 / 142

121. Images/cinvestav However Here the gradient can explode Thus, the need to control the gradient... Therefore, we will use the following analysis [6] “The Emergence of Spectral Universality in Deep Networks” by Jeﬀrey Pennington, Samuel S. Schoenholz, Surya Ganguli 63 / 142

122. Images/cinvestav However Here the gradient can explode Thus, the need to control the gradient... Therefore, we will use the following analysis [6] “The Emergence of Spectral Universality in Deep Networks” by Jeﬀrey Pennington, Samuel S. Schoenholz, Surya Ganguli 63 / 142

123. Images/cinvestav We have The following dynamic ht = σh (st) , st = Wsdxt + Ussht−1 + bh Then, we have the following Jacobian J = ∂hT ∂h0 = L t=1 DtUSS Where as we saw it Dt is a diagonal matrix This Jacobian J is a matrix of dimension s × s therefore, if it is well conditioned you are not sending the projection to lower dimensionality. 64 / 142

126. Images/cinvestav A Trick A RNN can be seen as a deep neural network + + + + 65 / 142

127. Images/cinvestav Remember the structure of the layer The following dynamic ht = σh (st) , st = Wsdxt + Ussht−1 + bh Therefore, we have that sit = j Wijxt j + k Uikht−1 k + bi We assume the following about the temporal layer weights [Uss, Wsd] ∼ N 0, ρ2 w N , bh ∼ N 0, ρ2 b Here N = s the state dimension. 66 / 142

130. Images/cinvestav Now, assume that Now, consider the evolution of a single input through the network xit Since the weights and biases are independent with zero mean E [sit] = 0 The second moment of the Gaussian random variable E [sitsjt] = qt δij 67 / 142

131. Images/cinvestav Now, assume that Now, consider the evolution of a single input through the network xit Since the weights and biases are independent with zero mean E [sit] = 0 The second moment of the Gaussian random variable E [sitsjt] = qt δij 67 / 142

132. Images/cinvestav Where the second moment Of a Gaussian Distribution is ∞ −∞ s2 1 √ 2πσ exp − (s − µ) 2σ2 ds 68 / 142

133. Images/cinvestav Here we have Here q is the variance of the pre-activations in the tth layer due to an input xt qt = ρ2 w √ 2π σ2 h qt−1sit−1 exp − 1 2 s2 it dsit + ρ2 b They describe the pass through the recursion of the RNN For any choice of ρ2 w and ρ2 b and a bounded φ the previous equation converges to a specific fix point. This recursion has a fixed point q∗ = ρ2 w √ 2π σ2 h q∗sit−1 exp − 1 2 s2 it dsit + ρ2 b 69 / 142

136. Images/cinvestav A Fixed Point Deﬁnition In mathematics, a ﬁxed point of a function is an element of the function’s domain that is mapped to itself by the function. Example 70 / 142

137. Images/cinvestav A Fixed Point Deﬁnition In mathematics, a ﬁxed point of a function is an element of the function’s domain that is mapped to itself by the function. Example 70 / 142

138. Images/cinvestav Therefore We have that It the input x0 is chosen so that q1 = q∗ the dynamics start at the fixed point and the distribution of Dt is independent of t. Not only that q1 = q∗ a few layers is often sufficient to approximately converge to a fixed point. So when t is large So it is a good approximation to assume qt = q∗. 71 / 142

141. Images/cinvestav Additionally The independence of the weights and biases implies The covariance between diﬀerent pre-activations in the same layer will be given by E [zit;azjt;b] = qt abδij Therefore qt ab = ρ2 w σh (u1) σh (u2) Dz1Dz2 + ρ2 b Where Dz = 1√ 2π exp −1 2 s2 ds u1 = qt−1 aa u2 = qt−1 bb ct−1 ab s1 + 1 − ct−1 ab 2 z2 ct ab = qt ab√ qt aaqt bb 72 / 142

142. Images/cinvestav Additionally The independence of the weights and biases implies The covariance between diﬀerent pre-activations in the same layer will be given by E [zit;azjt;b] = qt abδij Therefore qt ab = ρ2 w σh (u1) σh (u2) Dz1Dz2 + ρ2 b Where Dz = 1√ 2π exp −1 2 s2 ds u1 = qt−1 aa u2 = qt−1 bb ct−1 ab s1 + 1 − ct−1 ab 2 z2 ct ab = qt ab√ qt aaqt bb 72 / 142

143. Images/cinvestav Therefore Therefore, we can look at the variance of the Jacobian Matrix elements χ = 1 N Tr (DtUSS) T DtUSS = σ2 w σh √ q∗sit 2 exp − 1 2 s2 it dsit 73 / 142

144. Images/cinvestav Then χ (ρw, ρb) It separates (ρw, ρb) plane into two regions. When χ > 1 Forward signal propagation expands and folds space in a chaotic manner and gradients explode When χ < 1 Forward signal propagation contracts in an ordered manner and gradients exponentially vanishes 74 / 142

147. Images/cinvestav This Regions establish the stability of the network We have the following 0.5 1.0 1.5 2.0 2.5 3.0 0.00 0.05 0.10 0.15 0.20 0.25 ORDERED VANISHING GRADIENT CHAOTIC EXPLODING GRADIENT 75 / 142

148. Images/cinvestav Therefore It is clear that When we choose same ρb = ρw we have a convergence of the network Having other values It requires a careful choosing of the values 76 / 142

149. Images/cinvestav Therefore It is clear that When we choose same ρb = ρw we have a convergence of the network Having other values It requires a careful choosing of the values 76 / 142

151. Images/cinvestav Another Problem Although, the Vanishing and Exploding Gradients They are a problem for the RNN’s If we use the full BPTT We confront limitations on the amount of Memory and Hardware available Thus a popular strategy It is the Truncated BPTT [7, 8] 78 / 142

154. Images/cinvestav Therefore They proposed using a truncation on the BPTT To solve the problem with the Vanishing and Exploding Gradient What is Truncated BPTT? In general, this should be regarded as a heuristic technique for simplifying the computation. Which it is a good approximation true gradient 79 / 142

155. Images/cinvestav Therefore They proposed using a truncation on the BPTT To solve the problem with the Vanishing and Exploding Gradient What is Truncated BPTT? In general, this should be regarded as a heuristic technique for simplifying the computation. Which it is a good approximation true gradient 79 / 142

156. Images/cinvestav The Algorithm Truncated BPTT 1 for t = 1 to T do: 2 Run the RNN for one step, computing ht and yt 3 if t divides k1 then 4 Run BPTT from t to t − k2 Something Notable 1 It was ﬁrst used by Elman [9] 2 Also Mikolov et al. [10] used the TBPTT to train RNN on word-level language modeling. 80 / 142

157. Images/cinvestav The Algorithm Truncated BPTT 1 for t = 1 to T do: 2 Run the RNN for one step, computing ht and yt 3 if t divides k1 then 4 Run BPTT from t to t − k2 Something Notable 1 It was ﬁrst used by Elman [9] 2 Also Mikolov et al. [10] used the TBPTT to train RNN on word-level language modeling. 80 / 142

160. Images/cinvestav Initialization of the Hidden State This is the classic problem in RNN How to initialize the hs hidden state? There are two main mehtods 1 Initialize hs to the zero vector. 2 Adaptive noisy initialization of hs 3 Find the steady state 83 / 142

161. Images/cinvestav Initialization of the Hidden State This is the classic problem in RNN How to initialize the hs hidden state? There are two main mehtods 1 Initialize hs to the zero vector. 2 Adaptive noisy initialization of hs 3 Find the steady state 83 / 142

162. Images/cinvestav The Simplest One We can simply initialize hs To a zero state Quite simple and easy to apply However do we have something better? 84 / 142

163. Images/cinvestav The Simplest One We can simply initialize hs To a zero state Quite simple and easy to apply However do we have something better? 84 / 142

164. Images/cinvestav Adaptive noisy initialization It is proposed by Zimmermann et al. [11] They proposed to use the residual error once the back-propagation was done for h0 This is done By disturbing h0 with a noise term Θ which follows the distribution of the residual error. 85 / 142

165. Images/cinvestav Adaptive noisy initialization It is proposed by Zimmermann et al. [11] They proposed to use the residual error once the back-propagation was done for h0 This is done By disturbing h0 with a noise term Θ which follows the distribution of the residual error. 85 / 142

166. Images/cinvestav Adaptive Noise The network tries to stabilize the output 0 + Noise 86 / 142

167. Images/cinvestav Example of this initializations Source https://r2rt.com/non-zero-initial-states-for-recurrent-neural- networks.html 87 / 142

168. Images/cinvestav What about the Weight Parameters? We could simply initialize them to zero Denger Will Robinson!!! A simple example with the following feed-forward architecture w = σ1 (Whix) y = σ2 (Wohw) L = 1 2 [y − z]2 88 / 142

169. Images/cinvestav What about the Weight Parameters? We could simply initialize them to zero Denger Will Robinson!!! A simple example with the following feed-forward architecture w = σ1 (Whix) y = σ2 (Wohw) L = 1 2 [y − z]2 88 / 142

170. Images/cinvestav Therefore We have by back-propagation ∆Who = σ2 (Wohσ1 (Whix1)) − z σ2 (Wohσ1 (Whix)) Wohσ1 (Whix) x Therefore ∆Who = 0 89 / 142

171. Images/cinvestav Therefore We have by back-propagation ∆Who = σ2 (Wohσ1 (Whix1)) − z σ2 (Wohσ1 (Whix)) Wohσ1 (Whix) x Therefore ∆Who = 0 89 / 142

172. Images/cinvestav Therefore Not a good idea What else we can do? We have heuristics as the Gaussian initialization wij ∼ N 0, σ2 90 / 142

173. Images/cinvestav Therefore Not a good idea What else we can do? We have heuristics as the Gaussian initialization wij ∼ N 0, σ2 90 / 142

174. Images/cinvestav Do you remember? We have the following 0.5 1.0 1.5 2.0 2.5 3.0 0.00 0.05 0.10 0.15 0.20 0.25 ORDERED VANISHING GRADIENT CHAOTIC EXPLODING GRADIENT 91 / 142

175. Images/cinvestav Furthermore We have heuristics For Relu — We multiply the randomly generated values of W by: 2 sizel−1 For tanh — The heuristic is called Xavier initialization 2 sizel−1 Other common one 2 sizel−1 + sizel 92 / 142

179. Images/cinvestav History of LSTM They were introduced by LSTM was proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber [12] An attempt to deal with the vanishing and exploding gradient By introducing Constant Error Carousel (CEC) units Properties In 1999, Felix Gers and his advisor Jürgen Schmidhuber and Fred Cummins introduced the forget gate (also called “keep gate”) into LSTM architecture. It enables the LSTM to reset its own state 94 / 142

182. Images/cinvestav Long Short Term Memory (LSTM) We have the following Architecture (Component wise product ) ft =σ [Wf [ht−1, xt] + bf ] (Forget Gate) it =σ [Wi [ht−1, xt] + bi] (Input/Update Gate) ot =σ [Wo [ht−1, xt] + bo] (Output Gate) ˆct = tanh [Wo [ht−1, xt] + bc] (Intermediate Cell Gate) ct =ft ct−1 + it ˆct (Cell State Gate) ht =ot tanh (ct) (Hidden State) Where σ is a sigmoid function. 95 / 142

183. Images/cinvestav Graphically We have that 96 / 142

184. Images/cinvestav Here the interesting part In the RNN ht = σh (Wsdxt + Ussht−1) But Here ct =ft ct−1 + it ˆct (Cell State Gate) ht =ot tanh (ct) You need the forget term, the input term ant the intermediate cell To update the state 97 / 142

187. Images/cinvestav You can see Something Notable The cell keeps track of the dependencies between the elements in the input sequence and the state The input gate It is in charge of how much of the input ﬂows into the cell gate it = σ [Wi [ht−1, xt] + bi] 98 / 142

188. Images/cinvestav You can see Something Notable The cell keeps track of the dependencies between the elements in the input sequence and the state The input gate It is in charge of how much of the input ﬂows into the cell gate it = σ [Wi [ht−1, xt] + bi] 98 / 142

189. Images/cinvestav What is the meaning? We have that The sigmoid layer decides what values to update They impact the term it ˆct Making possible to decide how to control the cell intermediate values 99 / 142

190. Images/cinvestav What is the meaning? We have that The sigmoid layer decides what values to update They impact the term it ˆct Making possible to decide how to control the cell intermediate values 99 / 142

191. Images/cinvestav Now The forget gate How much of the previous cell gate time value remains in the cell at time t ft = σ [Wf [ht−1, xt] + bf ] Actually It uses previous state and input Then the sigmoid actually can be interpreted as Sigmoid: value 0 and 1 – “completely forget” vs. “completely keep” 100 / 142

194. Images/cinvestav Furthermore The output gate It controls the extent to which the value in the cell is used to compute the actual state Which impacts the term ft ct−1 Based on the previous cell state Thus a type of control Between the previous cell state and the new cell state 101 / 142

197. Images/cinvestav Finally We have the update of the cell as ct = ft ct−1 + it ˆct Basically Apply forget operation to previous internal cell state. Add new candidate values, scaled by how much we decided to update We can see as Drop old information and add new information about subject’s gender. 102 / 142

200. Images/cinvestav Thus at the output layer and update state We have ot =σ [Wo [ht−1, xt] + bo] (Output Gate) ht =ot tanh (ct) (Hidden State) Therefore, we have that Sigmoid layer: decide what linear combination of state/input to output Additionally, we have that the tanh squashes the values between -1 and 1 The output is used to ﬁlter a version of cell state!!! 103 / 142

203. Images/cinvestav Something nice about LSTM Quite nice Backpropagation from ct to ct−1 requires only elementwise multiplication! 104 / 142

204. Images/cinvestav LSTM Remarks First It maintains a separate cell state from what is outputted Second Use gates to control the ﬂow of information Forget gate tries to get rid of irrelevant information Selectively update cell state Output gate returns a ﬁltered version of the cell state Third Backpropagation from ct to ct−1 requires only elementwise multiplication! 105 / 142

207. Images/cinvestav Achievements LSTM achieved record results in natural language text compression http://www.mattmahoney.net/dc/text.html#1218 Unsegmented connected handwriting recognition Graves, A., Liwicki, M., Fernández, S., Bertolami, R.; Bunke, H., Schmidhuber, J. (May 2009). "A Novel Connectionist System for Unconstrained Handwriting Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868 Finally Won the ICDAR handwriting competition (2009) 106 / 142

210. Images/cinvestav Right now Something Notable As of 2016, major technology companies including Google, Apple, and Microsoft were using LSTM as fundamental components in new products. 107 / 142

212. Images/cinvestav History They were proposed as a simpliﬁcation of the LSTM In 2014, Kyunghyun Cho et al. put forward a simpliﬁed variant called Gated recurrent unit (GRU) Something Notable The GRU is like a long short-term memory (LSTM) with forget gate... but has fewer parameters than LSTM, as it lacks an output gate 109 / 142

213. Images/cinvestav History They were proposed as a simpliﬁcation of the LSTM In 2014, Kyunghyun Cho et al. put forward a simpliﬁed variant called Gated recurrent unit (GRU) Something Notable The GRU is like a long short-term memory (LSTM) with forget gate... but has fewer parameters than LSTM, as it lacks an output gate 109 / 142

214. Images/cinvestav Gated Recurrent Units Architecture zt =σ [Wz [ht−1, xt] + bz] (Update Gate) rt =σ [Wr [ht−1, xt] + br] (Reset Gate) ˆht = tanh [Wo [rt ht−1, xt] + bh] ht = (1 − zt) ht−1 + zt ˆht 110 / 142

215. Images/cinvestav Graphically, we have the architecture GRU Architecture +-1 111 / 142

216. Images/cinvestav Main Observations There is a gate used to combine the state ht−1, The zt gate that basically uses the information of the input and the previous state to decide how to update ht = (1 − zt) ht−1 + zt ˆht The intermediate step ˆht A bounded version of the possible state ht 112 / 142

217. Images/cinvestav Main Observations There is a gate used to combine the state ht−1, The zt gate that basically uses the information of the input and the previous state to decide how to update ht = (1 − zt) ht−1 + zt ˆht The intermediate step ˆht A bounded version of the possible state ht 112 / 142

218. Images/cinvestav Next We have that a reset gate rt = σ [Wr [ht−1, xt] + br] To update ˆht = tanh [Wo [rt ht−1, xt] + bh] 113 / 142

219. Images/cinvestav However It has been shown that As shown by Gail Weiss, Yoav Goldberg, Eran Yahav, the LSTM is "strictly stronger" than the GRU LSTM can perform unbounded counting[13] The GRU cannot. It simulates a counting machine used for theoretical CS Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le of Google Brain LSTM cells consistently outperform GRU cells in "the ﬁrst large-scale analysis of architecture variations for Neural Machine Translation.” 114 / 142

223. Images/cinvestav Given that we want to do sequence modeling Stock Options Predict next phrase Question: If I am a man ? Prediction: you are homo sapiens 116 / 142

224. Images/cinvestav Given that we want to do sequence modeling Stock Options Predict next phrase Question: If I am a man ? Prediction: you are homo sapiens 116 / 142

225. Images/cinvestav What do we have in this sequences of data? Sequences have diﬀerent lengths We need to handle variable-length sequences 117 / 142

226. Images/cinvestav Furthermore We need to track long-term dependencies 118 / 142

227. Images/cinvestav Not only that Maintain information about order “We have a mother living in Yucatan, Mexico” Share parameters across the sequence Do you remember the state ht? 119 / 142

228. Images/cinvestav Not only that Maintain information about order “We have a mother living in Yucatan, Mexico” Share parameters across the sequence Do you remember the state ht? 119 / 142

229. Images/cinvestav However There is a need to increase their power Given the amounts of data we have right know Then there is a tendency to start using the Recurrent Neural Networks As cells to be stacked for bigger systems [14, 15] This is based in the following idea [16] Hypothesis, hierarchical model can be exponentially more eﬃcient at representing some functions than a shallow one. 120 / 142

232. Images/cinvestav In the case of RNN’s Certain Transitions are not Deep They are only results of a linear projection followed by an element-wise nonlinearity. They are Hidden-to-hidden ht−1 → ht Hidden-to-output ht → yt Input-to-hidden xt−1 → ht Meaning They are all shallow in the sense that there exists no intermediate, nonlinear hidden layer. 121 / 142

236. Images/cinvestav Bengio et al. [17] Gave the following Hypothesis In sampling algorithms (Markov Chains and MCMC techniques) suﬀer from a fundamental problem Given unconnected or weakly connected regions of distributions We have that it is diﬃcult for the Markov chain to jump from one mode of the distribution to another, when these are separated by large low-density regions This means that we have a slow mixing of samples In order to represent distributions 123 / 142

239. Images/cinvestav Example A Big Problem Regions of LOW DENSITY HIGH DENSITY REGIONS 124 / 142

240. Images/cinvestav The Main Problem We have that Slow mixing means that many consecutive samples tend to be correlated They belong to the same mode of the mixture Why? Jumping around in the MCMC method is quite slow and scarce 125 / 142

241. Images/cinvestav The Main Problem We have that Slow mixing means that many consecutive samples tend to be correlated They belong to the same mode of the mixture Why? Jumping around in the MCMC method is quite slow and scarce 125 / 142

242. Images/cinvestav Implications in Learning Algorithms Given that some form of sampling is at the core of many learning algorithms For example, to estimate the log-likelihood gradient Therefore, at the beginning of learning Mixing is therefore initially easy However as the model improves its corresponding distribution sharpens and mixing becomes slower 126 / 142

245. Images/cinvestav Basically We have slow downs on the learning in shallow models No Learning AreaLearning Area 127 / 142

247. Images/cinvestav Therefore We need to build deeper structures to reach more capabilities For example the vector representation of documents Here a extra layer of representation can be used for doing representation For Example, Mikolov et al. [18] 129 / 142

248. Images/cinvestav Therefore We need to build deeper structures to reach more capabilities For example the vector representation of documents Here a extra layer of representation can be used for doing representation For Example, Mikolov et al. [18] 129 / 142

249. Images/cinvestav Basically a shallow network before the main architecture An Encoder Layer before a GRU +-1 130 / 142

250. Images/cinvestav The equations They will look like zt =σ [Wz [ht−1, xt] + bz] (Update Gate) rt =σ [Wr [ht−1, xt] + br] (Reset Gate) ˆht = tanh [Wo [rt ht−1, xt] + bh] ht = (1 − zt) ht−1 + zt ˆht x =σ (Wohy) y =σ (Whiw) 131 / 142

252. Images/cinvestav Deep Transition Architectures In a deep transition RNN (DT-RNN) At each time step the next state is computed by the sequential application of multiple transition layers. For example in Nematus system [19] They use GRU transitions blocks under independent trainable parameters With a Caveat The hidden state output is used as the input state on the next one 133 / 142

255. Images/cinvestav For example, at the encoder phase For the ith source word in the forward direction, we have hi = hi,Ls hi,1 = GRU1 (x1, hi−1,Ls ) hi,k = GRUk (0, hi,k−1) for 1 < k ≤ Ls The sequence word is reversed and you have a backward state then C ≡ −→ h i,Ls , ←− h i,Ls 134 / 142

256. Images/cinvestav For example, at the encoder phase For the ith source word in the forward direction, we have hi = hi,Ls hi,1 = GRU1 (x1, hi−1,Ls ) hi,k = GRUk (0, hi,k−1) for 1 < k ≤ Ls The sequence word is reversed and you have a backward state then C ≡ −→ h i,Ls , ←− h i,Ls 134 / 142

257. Images/cinvestav Then Decoder phase uses the outputs from the previous GRU and something called attention (We will look at this latter) sj,1 = GRU1 yj−1, sj−1, Lt sj,2 = GRU2 (ATT, sj−1, Lt) sj,k = GRUk (0, Lt) for 2 < k ≤ Lt Then, the target word state sj ≡ sj,Lt It is used by a feed-forward neural network to predict the current target network 135 / 142

258. Images/cinvestav Then Decoder phase uses the outputs from the previous GRU and something called attention (We will look at this latter) sj,1 = GRU1 yj−1, sj−1, Lt sj,2 = GRU2 (ATT, sj−1, Lt) sj,k = GRUk (0, Lt) for 2 < k ≤ Lt Then, the target word state sj ≡ sj,Lt It is used by a feed-forward neural network to predict the current target network 135 / 142

259. Images/cinvestav Deep Transition Decoder We have the following depiction of the architecture 136 / 142

261. Images/cinvestav There are many other examples Basically We are far from the classic methods as 1 Autoregressive integrated moving average (ARMA) 2 Auto Regressive Integrated Moving Average (ARIMA) 3 etc These RNN architectures are taking the prediction of time series To another level!!! 138 / 142

262. Images/cinvestav There are many other examples Basically We are far from the classic methods as 1 Autoregressive integrated moving average (ARMA) 2 Auto Regressive Integrated Moving Average (ARIMA) 3 etc These RNN architectures are taking the prediction of time series To another level!!! 138 / 142

263. Images/cinvestav O. L. R. Jacobs, “Introduction to control theory,” 1974. A. Robinson and F. Fallside, The utility driven dynamic error propagation network. University of Cambridge Department of Engineering, 1987. P. J. Werbos et al., “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990. G. Chen, “A gentle tutorial of recurrent neural network with error backpropagation,” arXiv preprint arXiv:1610.02583, 2016. B. W. Bader and T. G. Kolda, “Algorithm 862: Matlab tensor classes for fast algorithm prototyping,” ACM Trans. Math. Softw., vol. 32, Dec. 2006. J. Pennington, S. S. Schoenholz, and S. Ganguli, “The emergence of spectral universality in deep networks,” arXiv preprint arXiv:1802.09979, 2018. 139 / 142

264. Images/cinvestav R. J. Williams and D. Zipser, “Gradient-based learning algorithms for recurrent,” Backpropagation: Theory, architectures, and applications, vol. 433, 1995. R. J. Williams and J. Peng, “An eﬃcient gradient-based algorithm for on-line training of recurrent network trajectories,” Neural computation, vol. 2, no. 4, pp. 490–501, 1990. J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990. T. Mikolov, M. Karaﬁát, L. Burget, J. Černock`y, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010. H.-G. Zimmermann, C. Tietz, and R. Grothmann, “Forecasting with recurrent neural networks: 12 tricks,” in Neural Networks: Tricks of the Trade, pp. 687–707, Springer, 2012. 140 / 142

265. Images/cinvestav S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. G. Weiss, Y. Goldberg, and E. Yahav, “On the practical computational power of ﬁnite precision rnns for language recognition,” CoRR, vol. abs/1805.04908, 2018. R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” arXiv preprint arXiv:1312.6026, 2013. A. V. M. Barone, J. Helcl, R. Sennrich, B. Haddow, and A. Birch, “Deep architectures for neural machine translation,” arXiv preprint arXiv:1707.07631, 2017. Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via deep representations,” in International conference on machine learning, pp. 552–560, 2013. 141 / 142

266. Images/cinvestav T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Eﬃcient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. R. Sennrich, O. Firat, K. Cho, A. Birch, B. Haddow, J. Hitschler, M. Junczys-Dowmunt, S. Läubli, A. V. M. Barone, J. Mokry, et al., “Nematus: a toolkit for neural machine translation,” arXiv preprint arXiv:1703.04357, 2017. 142 / 142

06 recurrent neural_networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 06 recurrent neural_networks

Similar to 06 recurrent neural_networks (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

06 recurrent neural_networks