Learn Recurrent Neural Networks (RNN), GRU and LSTM networks and their architecture. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2024 first half of the year.
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU and LSTM networks and their architecture.
1. DA 5330 – Advanced Machine Learning
Applications
Lecture 9 – Deep Sequence Models
Maninda Edirisooriya
manindaw@uom.lk
2. Sequence Modeling
• There are some events happening as a sequence of events. E.g.:
• Price of gold with time
• Velocity vector of a football during a kick
• Glucose level in blood with time
• Base pairs of a DNA sequence
• Right and left turns of a steering wheel while driving a car
• Sequence of words in an a essay
• Sequence of sound frequencies in a speech
• Sequence Modeling are the techniques used to model the events
happening as a sequence
• Using Deep Learning to model it is known as Deep Sequence Modeling
3. Modeling Independent Events vs. Sequences
Independent events are
dependent only on the input X
Sequence Events dependent on,
1. The input X given at the current
timeframe and
2. The previous event/events
Source: https://www.youtube.com/watch?v=ySEx_Bqxvvo&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=9
5. Recurrent Neural Networks (RNNs)
• RNNs are the special type of NNs that can keep track of the events
happened past in the memory
• An RNN maintains a hidden state ht which represents the cumulative
history of the events
Source: https://www.youtube.com/watch?v=ySEx_Bqxvvo&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=9
6. Recurrent Neural Networks (RNNs)
• RNN maintains 3 separate weight vectors for multiplying with,
• Input: wxh
• Hidden State: whh
• Output: why
• Non-linear activation function tanh is used after adding the linear
combinations of input vector and previous hidden state with their
respective weight vectors to get the hidden state
• Then the linear combination is taken with hidden state with the
output weight vector to get the output vector
• Note that wxh, whh and why are common for all the time steps
7. Training RNNs
• Losses are calculated for each of the time step and the total of them is
taken as the total loss
• Backpropagation is applied through the time steps of the RNN
• Compared to other NN types RNNs are deeper which creates the
problems of,
• Exploding Gradient problem and
• Vanishing Gradient problem
• Gradient Clipping is used as a solution for Exploding Gradient Problem
• ReLU Activation, Identity Initialization and modified versions of RNN
such as LSTM and GRU are used to address Vanishing Gradient problem
8. Backpropagation Through Time (for RNNs)
Source: https://www.youtube.com/watch?v=ySEx_Bqxvvo&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=9
9. RNN Application – Language Modeling
• Given a sequence of words, predicting the probability of the next word is
known as Language Modeling. E.g.:
• “Capital of Sri Lanka is _____ ” is an example where “Colombo” should be the next
word to be there in this sentence
• In this application, words have to be considered as input events to the RNN
• But RNN can only take numerical values as inputs but not words
• Therefore, words have to be converted to numerical values first
Source: https://www.youtube.com/watch?v=ySEx_Bqxvvo&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=9
10. Converting Words Numerically
• First the given string (e.g.: “Capital of Sri Lanka is”) of words have to
be converted to a sequence of word tokens by splitting with spaces
• Which results the list, [“Capital”, “of”, “Sri”, “Lanka”, “is”]
• Then each word should be assigned a numerical value. There are
several ways to do it
• Having a vocabulary of words (i.e. like an English dictionary) and assigning
each of the unique word a unique number. E.g. [“Capital”:34, “of”:567,
“Sri”:734, “Lanka”:56, “is”:346] This is Label Encoding which is suitable only
for ordinal values. But words are not ordinal.
• Therefore, we can use one-hot encoding instead for word tokens.
12. Word Embeddings
• However, one-hot
encodings are extremely
sparse and large in size
• Word Embedding is a
sparse and memory
efficient alternative that
also captures natural
relationships in between
words
Source: https://www.scaler.com/topics/tensorflow/tensorflow-word-embeddings/
13. Word Embeddings of Words
• E.g.: Assuming the size of embedding is 4,
• “Capital”: [34, 74, 85, 83]
• “of”: [63, 85, 97, 64]
• “Sri”: [36, 45, 15, 90]
• “Lanka”: [62, 37, 63, 56]
• “is”: [42, 73, 93, 69]
• As each of the word in the sentence is relevant to each of the event in the
independent variable, X and dependent variable Y, training happens as,
• x0 = [0, 0, 0, 0]
• x1 = y0 = [34, 74, 85, 83]
• x2 = y1 = [63, 85, 97, 64]
• x3 = y2 = [36, 45, 15, 90]
• x4 = y3 = [62, 37, 63, 56]
• x5 = y4 = [42, 73, 93, 69]
14. Sampling From The Trained Language Model
• Let the output of one time step be the input of the next time step
• Keep that going till the unknown token is generated as ො
𝑦
15. Limitations of RNNs
• Encoding Bottleneck: As historical information is only propagated via
the hidden state, the size of the hidden state is a bottleneck for
storing the historical context of a RNN
• Inefficient Learning due to no Parallelism: As each of the time step is
considered as a distinct layer during the backpropagation, training
time is increased with the number of time steps
• No Long-Term Memory: RNNs are only capable of keeping the recent
history in its hidden state where the long term memory gets lost with
the increased number of time steps. This issue is handled in LSTM
(Long Short Term Memory) and GRU (Gated Recurrent Unit) types
16. From RNN to GRU and LSTM – RNN Summary
• First lets look at the hidden state formula of a RNN
• Here you can see the hidden state of previous time step and the input
are concatenated and multiplied with a single weight matrix Wa
• Then a common bias ba is added
• Activation function g is generally the tanh function
17. From RNN to GRU and LSTM
• GRU (Gated Recurrent Unit) has hidden state known as the Cell State
c<t> instead of the well-known hidden state of a RNN
• As the Cell State is updated only when applicable (i.e. with special
conditions) it can maintain a long-term memory compared to a RNN
• LSTM (Long Short Term Memory) has both hidden state a<t> and a cell
state c<t> where the cell state maintains long-term memory
• As a LSTM has both of them it needs more memory and processing
power than the GRU
• However, LSTM has better long-term memory in general, where GRU
may lack
18. From RNN to GRU and LSTM
• Though we explain examples (like Language Modeling) using RNNs,
due to their lack of long term memory, they are not used in practice,
in such scenarios with word sequences
• Instead, in almost all practical implementations GRUs or LSTMs are
used instead of RNNs
• As GRUs and LSTMs can be used to replace most of the RNN related
architecture, we just explain with RNNs for the simplicity in upcoming
slides. E.g.:
• Bidirectional RNNs can be replaced with Bidirectional GRUs and Bidirectional
LSTMs
• Deep RNNs can be replaced with Deep GRUs and Deep LSTMs
• Attentions models are common for RNNs, GRUs and LSTMs
19. From RNN to GRU and LSTM
Now let’s look at the formula of a GRU and a LSTM
GRU Formula LSTM Formula
Source: Deep Learning Specialization, Andrew NG
20. Bidirectional RNN (BRNN)
• As we have learned RNNs are used to model an event sequence in
one direction
• In other words RNN unit in time step t has the information from
previous time steps t-1, t-2, … 0
• However, in some use cases like Natural Language Understanding
(NLU) we have to process information not only in a one direction in a
sentence but both ways!
• For example, filling the missing word in “Colombo is the _______ of Sri Lanka”
needs reading the word sequence in both directions, as reading up to
“Colombo is the” does not get the full information to fill the missing word
21. Bidirectional RNN (BRNN)
BRNNs are two sequence of RNN time step units ordered in both
directions to be trained
Source: https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66
22. Deep RNNs
• RNNs by nature act as Deep NNs due to its learning nature across
time steps is sequential like layers in a Deep NN
• However, even with this expensive-to-learn architecture, RNNs we
learned up to now only have a shallow layers of neurons from
information input (x vectors) to the information output (y vectors)
• When we need to model more complex functions with RNNs we have
to add several layers of RNNs in real which are known as Deep RNNs
• Generally, we do not go too many deeper layers due to increased
memory and processing requirements in Deep RNNs
24. Attention Models
• Say, your sequence model application is related to Natural Language
Processing (NLP) where words are used as the input vectors
• In natural language only some of the words in the sentence are
important to get the meaning of the sentence or fill a missing word
• Sequence models we discussed up to now, are giving the same weight
to all the time steps when making the predictions which is not the
real requirement
• Deep Learning models that are capable of giving a focused attention
to only some words of the word sequence while making predictions is
known as Attention Models
25. Attention Models
• Instead of directly getting the y output from the
BRNN units, output from all time steps are used as
information to find the most relevant word in a
different word sequence denoted by St , which is
unidirectional
• This process of finding the most relevant words is
known as the Attention Mechanism
• Softmax function is used to train the weights of
attention mechanism so that only a single word is
given almost all the attention
• The downside of the Attention Models is that their
processing complexity is quadratic with the
number of time steps
Source: https://machinelearningmastery.com/how-does-attention-work-in-encoder-
decoder-recurrent-neural-networks/