SlideShare a Scribd company logo
1 of 22
Download to read offline
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax Approximations for Learning Word
Embeddings and Language Modeling
Sebastian Ruder
@seb ruder
1st NLP Meet-up
03.08.16
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Agenda
1 Softmax
2 Softmax-based Approaches
Hierarchial Softmax
Differentiated Softmax
CNN-Softmax
3 Sampling-based Approaches
Margin-based Hinge Loss
Noise Contrastive Estimation
Negative Sampling
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Language modeling objective
Goal: Probabilistic model of language
Maximize probability of a word wt given its n previous
words, i.e. p(wt | wt−1, · · · wt−n+1)
N-gram models:
p(wt | wt−1, · · · , wt−n+1) =
count(wt−n+1, · · · , wt−1, wt)
count(wt−n+1, · · · , wt−1)
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax objective for language modeling
Figure: Predicting the next word with the softmax
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax objective for language modeling
Neural networks with softmax:
p(w | wt−1, · · · , wt−n+1) =
exp(h vw )
wi ∈V exp(h vwi )
where
h is ”hidden” representation of input, i.e. previous words
of dimensionality d
vwi
is the ”output” word embedding of word i, = word
embedding
V is the vocabulary
Inner product h vw computes score (”unnormalized”
probability) of model for word w given input
Output word embeddings are stored in a d × |V | matrix
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Neural language model
Figure: Neural language model [Bengio et al., 2003]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax use cases
Maximum entropy models minimize same probability
distribution:
Ph(y | x) =
exp(h · f (x, y))
y ∈Y exp(h · f (x, y ))
where
h is a weight vector
f (x, y) is a feature vector
Pervasive use in NNs:
Go-to multi-class classification objective
”Soft” selection e.g. for attention, memory retrieval, etc.
Denominator is called partition function:
Z =
wi ∈V
exp(h vwi )
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Softmax-based vs. sampling-based
Softmax-based approaches keep softmax layer intact,
make it more efficient.
Sampling-based approaches optimize a different loss
function that approximates the softmax.
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Hierarchical Softmax
Softmax as a binary tree: evaluate at most log2 |V | nodes
instead of all |V | nodes
Figure: Hierarchical softmax [Morin and Bengio, 2005]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Hierarchical Softmax
Structure is important; fastest (and most commonly used)
variant: Huffman tree (short paths for frequent words)
Figure: Hierarchical softmax [Mnih and Hinton, 2008]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Differentiated Softmax
Idea: We have more knowledge (co-occurrences, etc.)
about frequent words, less about rare words
→ words that occur more often allows us to fit more
parameters; extremely rare words only allow to fit a few
→ different embedding sizes to represent each output word
Larger embeddings (more parameters) for frequent words,
smaller embeddings for rare words
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Differentiated Softmax
Figure: Differentiated softmax [Chen et al., 2015]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
CNN-Softmax
Idea: Instead of learning all output word embeddings
separately, learn function to produce them
Figure: CNN-Softmax [Jozefowicz et al., 2016]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Sampling-based approaches
Sampling-based approaches optimize a different loss
function that approximates the softmax.
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Margin-based Hinge Loss
Idea: Why do multi-class classification at all? Only one
correct word, many incorrect ones. [Collobert et al., 2011]
Train model to produce higher scores for correct word
windows than for incorrect ones, i.e. maximize
x∈X w∈V
max{0, 1 − f (x) + f (x(w)
)}
where
x is a correct window
x(w)
is a ”corrupted” window (target word replaced by
random word)
f (x) is the score output by the model
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Noise Contrastive Estimation
Idea: Train model to differentiate target word from noise
Figure: Noise Contrastive Estimation (NCE) [Mnih and Teh, 2012]
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Noise Contrastive Estimation
Language modeling reduces to binary classification
Draw k noise samples from a noise distribution (e.g.
unigram) for every word; correct words given their context
are true (y = 1), noise samples are false (y = 0)
Minimize cross-entropy with logistic regression loss
Approximates softmax as number of noise samples k
increases
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Negative Sampling
Simplification of NCE [Mikolov et al., 2013]
No longer approximates softmax as goal is to learn
high-quality word embeddings (rather than language
modeling)
Makes NCE more efficient by making most expensive term
constant
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Thank you for your attention!
The content of most of these slides is also available as blog
posts at sebastianruder.com.
For more information: sebastian@aylien.com
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Bibliography I
[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P.,
and Janvin, C. (2003).
A Neural Probabilistic Language Model.
The Journal of Machine Learning Research, 3:1137–1155.
[Chen et al., 2015] Chen, W., Grangier, D., and Auli, M.
(2015).
Strategies for Training Large Vocabulary Neural Language
Models.
[Collobert et al., 2011] Collobert, R., Weston, J., Bottou, L.,
Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011).
Natural Language Processing (almost) from Scratch.
Journal of Machine Learning Research, 12(Aug):2493–2537.
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Bibliography II
[Jozefowicz et al., 2016] Jozefowicz, R., Vinyals, O., Schuster,
M., Shazeer, N., and Wu, Y. (2016).
Exploring the Limits of Language Modeling.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and
Dean, J. (2013).
Distributed Representations of Words and Phrases and their
Compositionality.
NIPS, pages 1–9.
[Mnih and Hinton, 2008] Mnih, A. and Hinton, G. E. (2008).
A Scalable Hierarchical Distributed Language Model.
Advances in Neural Information Processing Systems, pages
1–8.
Softmax Ap-
proximations
Sebastian
Ruder
Softmax
Softmax-based
Approaches
Hierarchial
Softmax
Differentiated
Softmax
CNN-Softmax
Sampling-
based
Approaches
Margin-based
Hinge Loss
Noise
Contrastive
Estimation
Negative
Sampling
Bibliography
Bibliography III
[Mnih and Teh, 2012] Mnih, A. and Teh, Y. W. (2012).
A Fast and Simple Algorithm for Training Neural
Probabilistic Language Models.
Proceedings of the 29th International Conference on
Machine Learning (ICML’12), pages 1751–1758.
[Morin and Bengio, 2005] Morin, F. and Bengio, Y. (2005).
Hierarchical Probabilistic Neural Network Language Model.
Aistats, 5.

More Related Content

More from Sebastian Ruder

More from Sebastian Ruder (20)

On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Softmax Approximations for Learning Word Embeddings and Language Modeling (Sebastian Ruder)