In this presentation I go over the theory and practical aspects of applying Deep Learning to solve NLP problems, more specifically, developing models for sentiment analysis.
All the code used in the demo can be found here:
https://github.com/ekholabs/DLinK
https://github.com/ekholabs/automated_ml
The presentation is available on YouTube: https://www.youtube.com/watch?v=eZavheF5TBE
I start on 1:06:02.
2. MACHINE LEARNING ENGINEER
WILDER RODRIGUES
• Coursera Mentor
• City.AI Ambassador;
• IBM Watson AI XPRIZE contestant;
• Kaggler;
• Guest attendee at AI for
Good Global Summit at the UN;
• X-Men geek;
• family man and father of 5 (3 kids and
2 cats).
@wilderrodrigues
https://medium.com/@wilder.rodrigues/
3. WHAT IS IN THERE FOR YOU?
AGENDA
• The Basics
• Vector Representation of Words
• The Shallow
• [Deep] Neural Networks for NLP
• The Deep
• Convolutional Networks for NLP
• The Recurrent
• Long-short Term Memory for NLP
• Where do we go from here?
• Automation of AWS GPUs with Terraform
6. HOW DOES IT WORK?
WORD2VEC
• Cosine distance between words in the vector
space:
• X = vector(”biggest”)−vector(”big”) +
vector(”small”)
• X = smallest
• Algorithms:
• Skip-Gram
• It predicts the context words from the
target words.
• CBOW
• It predicts the target word from the bag of
all context words.
Cosine Distance Euclidian Distance
The CBOW architecture predicts the current word based on the context,
and the Skip-gram predicts surrounding words given the current word.
13. HOW THEY WORK WITH TEXT?
CNNS
• Each row of the matrix corresponds
to a word/token. Meaning, each row
is a low-dimensional vector that
represents a word/token.
• The width of the filters is usually the
same as the width of the input
matrix.
• The height may vary, but it’s typically
between 2 and 5. So, for a 2x5 filter
it means we would cover 2 words
per sliding window.
16. LONG-TERM DEPENDENCIES PROBLEMS
RNNS
• Small vs Large gap between the
relevant information for the
prediction:
• “the clouds are in the sky.”;
• “I grew up in France… I speak
fluent French.”.
17. HOW THEY WORK?
LSTMS
• LSTMs’ Gates:
• Forget
• Decides whether the state will be passed through
or not.
• Input
• Decides on which values to update and then feeds
a tanh which will output the next Candidate state.
• Update the new state based on the previous one
plus the candidate state.
• Output
• Feeds a sigmoid function to decide which parts of
the state will be output.
• Feeds a tanh function with the State and multiplies
its output with the sigmoid result.
22. WHERE DID I GET THIS STUFF FROM?
REFERENCES
• Efficient Estimation of Word Representations in Vector Space: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google,
2013.
• A Neural Probabilistic Language Model: Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. Université de
Montréal, Montréal, Québec, Canada, 2013.
• Dropout: A Simple Way to Prevent Neural Networks from Overfitting: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
Sutskever, Ruslan Salakhutdinov. University of Toronto, Toronto, Ontario, Canada.
• https://medium.com/cityai/deep-learning-for-natural-language-processing-part-i-8369895ffb98
• https://medium.com/cityai/deep-learning-for-natural-language-processing-part-ii-8b2b99b3fa1e
• https://medium.com/cityai/deep-learning-for-natural-language-processing-part-iii-96cfc6acfcc3
• http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
• https://github.com/ekholabs/DLinK
• https://github.com/ekholabs/automated_ml