7. w
ord2vec
“The fox jumped over the lazy dog”
Maximize the likelihood of seeing the words given the word over.
P(the|over)
P(fox|over)
P(jumped|over)
P(the|over)
P(lazy|over)
P(dog|over)
…instead of maximizing the likelihood of co-occurrence counts.
37. w
ord2vec
The SGNS Model
PM
I
ci·wj = PMI(Mij) - log k
…is extremely similar to matrix factorization!
Levy & Goldberg 2014
L = σ(c·w) + σ(-c·wneg)
38. w
ord2vec
The SGNS Model
PM
I
Levy & Goldberg 2014
‘traditional’ NLP
L = σ(c·w) + σ(-c·wneg)
ci·wj = PMI(Mij) - log k
…is extremely similar to matrix factorization!
39. w
ord2vec
The SGNS Model
L = σ(c·w) + Σσ(-c·w)
PM
I
ci·wj = log
Levy & Goldberg 2014
#(ci,wj)/n
k #(wj)/n #(ci)/n
‘traditional’ NLP
40. w
ord2vec
The SGNS Model
L = σ(c·w) + Σσ(-c·w)
PM
I
ci·wj = log
Levy & Goldberg 2014
popularity of c,w
k (popularity of c) (popularity of w)
‘traditional’ NLP
81. -1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
fox
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Lufthansa is a German airline and when
German
word2vec predicts locally:
one word predicts a nearby word
82. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
German
Document vector
predicts a word from
a global context
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
83. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
We’re missing
mixtures & sparsity!
German
84. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
We’re missing
mixtures & sparsity!
85. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Now it’s a mixture.
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
86. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
87. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
topic 1 = “religion”
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
88. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
89. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
topic 2 = “politics”
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
90. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
91. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
92. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
93. 0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Sparsity!
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
34% 32% 34%
t=0
41% 26% 34%
t=10
99% 1% 0%
t=∞
tim
e
96. @chrisemoody
Example Hacker News comments
Topics:
http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/
hacker_news/lda2vec/lda2vec.ipynb
Word vectors:
https://github.com/cemoody/
lda2vec/blob/master/examples/
hacker_news/lda2vec/
word_vectors.ipynb
97. @chrisemoody
lda2vec.com
human-interpretable doc topics, use LDA.
machine-useable word-level features, use word2vec.
if you like to experiment a lot, and have
topics over user / doc / region / etc. features, use lda2vec.
(and you have a GPU)
If you want…
100. Credit
Large swathes of this talk are from
previous presentations by:
• Tomas Mikolov
• David Blei
• Christopher Olah
• Radim Rehurek
• Omer Levy & Yoav Goldberg
• Richard Socher
• Xin Rong
• Tim Hopper
101. “PS! Thank you for such an awesome idea”
@chrisemoody
doc_id=1846
Can we model topics to sentences?
lda2lstm
102. Can we model topics to sentences?
lda2lstm
“PS! Thank you for such an awesome idea”doc_id=1846
@chrisemoody
Can we model topics to images?
lda2ae
TJ Torres
103. and now for something completely crazy
4
Fun
Stuff
105. deepwalk
Perozzi
etal2014
learn word vectors from
sentences
“The fox jumped over the lazy dog”
vOUT vOUT vOUT vOUT vOUTvOUT
‘words’ are graph vertices
‘sentences’ are random walks on the
graph
word2vec
120. Crazy
Approaches
Paragraph Vectors
(Just extend the context window)
Content dependency
(Change the window grammatically)
Social word2vec (deepwalk)
(Sentence is a walk on the graph)
Spotify
(Sentence is a playlist of song_ids)
Stitch Fix
(Sentence is a shipment of five items)
121.
122. CBOW
“The fox jumped over the lazy dog”
Guess the word
given the context
~20x faster.
(this is the alternative.)
vOUT
vIN vINvIN vINvIN vIN
SkipGram
“The fox jumped over the lazy dog”
vOUT vOUT
vIN
vOUT vOUT vOUTvOUT
Guess the context
given the word
Better at syntax.
(this is the one we went over)
133. All of the following ideas will change what
‘words’ and ‘context’ represent.
134. paragraph
vector
What about summarizing documents?
On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
135. On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector
Normal skipgram extends C words before, and C words after.
IN
OUT OUT
136. On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector
A document vector simply extends the context to the whole document.
IN
OUT OUT
OUT OUTdoc_1347