(Data Day 2016)
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
1. A word is worth a
thousand vectors
(word2vec, lda, and introducing lda2vec)
Christopher Moody
@ Stitch Fix
Welcome,
thanks for coming, having me, organizer
NLP can be a messy aļ¬air because you have to teach a computer about the irregularities and ambiguities of the English
language in this sort of hierarchical sparse nature in all the grammar
3rd trimester, pregnant
āwears scrubsā ā medicine
taking a trip ā a ļ¬x for vacation clothing
power of word vectors promise is to sweep away a lot of issues
2. About
@chrisemoody
Caltech Physics
PhD. in astrostats supercomputing
sklearn t-SNE contributor
Data Labs at Stitch Fix
github.com/cemoody
Gaussian Processes t-SNE
chainer
deep learning
Tensor Decomposition
3. Credit
Large swathes of this talk are from
previous presentations by:
ā¢ Tomas Mikolov
ā¢ David Blei
ā¢ Christopher Olah
ā¢ Radim Rehurek
ā¢ Omer Levy & Yoav Goldberg
ā¢ Richard Socher
ā¢ Xin Rong
ā¢ Tim Hopper
5. 1. king - man + woman = queen
2. Huge splash in NLP world
3. Learns from raw text
4. Pretty simple algorithm
5. Comes pretrained
word2vec
1. Learns what words mean ā can solve analogies cleanly.
1. Not treating words as blocks, but instead modeling relationships
2. Distributed representations form the basis of more complicated deep learning systems
3. Shallow ā not deep learning!
1. Power comes from this simplicity ā super fast, lots of data
4. Get a lot of mileage out of this
1. Donāt need to model the wikipedia corpus before starting your own
6. word2vec
1. Set up an objective function
2. Randomly initialize vectors
3. Do gradient descent
7. w
ord2vec
word2vec: learn word vector vin
from itās surrounding context
vin
1. Letās talk about training ļ¬rst
2. In SVD and n-grams we built a co-occurence and transition probability matrices
3. Here we will learn the embedded representation directly, with no intermediates, update it w/ every example
8. w
ord2vec
āThe fox jumped over the lazy dogā
Maximize the likelihood of seeing the words given the word over.
P(the|over)
P(fox|over)
P(jumped|over)
P(the|over)
P(lazy|over)
P(dog|over)
ā¦instead of maximizing the likelihood of co-occurrence counts.
1. Context ā the words surrounding the training word
2. Naively assume P(*|over) is independent conditional on the training word
3. Still a pretty simple assumption!
Conditioning on just *over* no other secret parameters or anything
10. w
ord2vec
P(vfox|vover)
Should depend on the word vectors.
P(fox|over)
Trying to learn the word vectors, so letās start with those
(weāll randomly initialize them to begin with)
11. w
ord2vec
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
āThe fox jumped over the lazy dogā
P(vOUT|vIN)
12. w
ord2vec
āThe fox jumped over the lazy dogā
vIN
P(vOUT|vIN)
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
IN = training word
13. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
14. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
15. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
16. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
17. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
18. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
19. w
ord2vec
P(vOUT|vIN)
āThe fox jumped over the lazy dogā
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
20. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
ā¦So that at a high level is what we want word2vec to do.
21. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
ā¦So that at a high level is what we want word2vec to do.
22. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
ā¦So that at a high level is what we want word2vec to do.
23. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
ā¦So that at a high level is what we want word2vec to do.
24. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
ā¦So that at a high level is what we want word2vec to do.
25. w
ord2vec
āThe fox jumped over the lazy dogā
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itās the input or the output.
Also a context window around every input word.
ā¦So that at a high level is what we want word2vec to do.
two for loops
Thatās it! A bit disengious to make this a giant network
26. objective
Measure loss between
vIN and vOUT?
vin . vout
How should we deļ¬ne P(vOUT|vIN)?
Now weāve deļ¬ned the high-level update path for the algorithm.
Need to deļ¬ne this prob exactly in order to deļ¬ne our updates.
Boils down to diļ¬ between in & out ā want to make as similar as possible, and then the probability will go up.
Use cosine sim.
27. w
ord2vec
vin . vout ~ 1
objective
vin
vout
Dot product has these properties:
Similar vectors have similarly near 1
30. w
ord2vec
vin . vout ā [-1,1]
objective
But the inner product ranges from -1 to 1 (when normalized)
ā¦and weād like a probability
31. w
ord2vec
But weād like to measure a probability.
vin . vout ā [-1,1]
objective
But the inner product ranges from -1 to 1 (when normalized)
ā¦and weād like a probability
32. w
ord2vec
But weād like to measure a probability.
softmax(vin . vout ā [-1,1])
objective
ā [0,1]
Transform again using softmax
33. w
ord2vec
But weād like to measure a probability.
softmax(vin . vout ā [-1,1])
Probability of choosing 1 of N discrete items.
Mapping from vector space to a multinomial over words.
objective
Similar to logistic function for binary outcomes, but instead for 1 of N outcomes.
So now weāre modeling the probability of a word showing up as the combination of the training word vector and the target
word vector and transforming it to a 1 of N
34. w
ord2vec
But weād like to measure a probability.
exp(vin . vout ā [0,1])softmax ~
objective
So hereās the actual form of the equation ā we normalize by the sum of all of the other possible pairs of word combinations
35. w
ord2vec
But weād like to measure a probability.
exp(vin . vout ā [-1,1])
Ī£exp(vin . vk)
softmax =
objective
Normalization term over all words
k ā V
So hereās the actual form of the equation ā we normalize by the sum of all of the other possible pairs of word combinations
two eļ¬ects
make vin and vout more similar
make vin and every other word less similar
36. w
ord2vec
But weād like to measure a probability.
exp(vin . vout ā [-1,1])
Ī£exp(vin . vk)
softmax = = P(vout|vin)
objective
k ā V
This is the kernel of the word2vec. Weāre just going to apply this operation every time we want to update the vectors.
For every word, weāre going to have a context window, and then for every pair of words in that window and the input word,
weāll measure this probability.
37. w
ord2vec
Learn by gradient descent on the softmax prob.
For every example we see update vin
vin := vin + P(vout|vin)
objective
vout := vout + P(vout|vin)
ā¦I wonāt go through the derivation of the gradient, but this is the general idea
relatively simple, fast ā fast enough to read billions of words in a day
41. Showing just 2 of the ~500 dimensions. Eļ¬ectively weāve PCAād it
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52. If we only had locality and not regularity, this wouldnāt necessarily be true
53.
54.
55. So we live in a vector space where operations like addition and subtraction are meaningful.
So hereās a few examples of this working.
Really get the idea of these vectors as being āmixesā of other ideas & vectors
57. + āPregnantā
I love the stripes and the cut around my neckline was amazing
someone else might write āgrey and blackā
subtlety and nuance in that language
We have lots of this interaction ā of order wikipedia amount ā far too much to manually annotate anything
67. Latent style vectors from text
Pairwise gamma correlation
from style ratings
Diversity from ratings Diversity from text
Lots of structure in both ā but the diversity much higher in the text
Maybe obvious: but the way people describe items is fundamentally richer than the style ratings
69. word2vec is local:
one word predicts a nearby word
āI love ļ¬nding new designer brands for jeansā
as if the world where one very long text string. no end of documents, no end of sentence, etc.
and a window across words
70. āI love ļ¬nding new designer brands for jeansā
But text is usually organized.
as if the world where one very long text string. no end of documents, no end of sentence, etc.
71. āI love ļ¬nding new designer brands for jeansā
But text is usually organized.
as if the world where one very long text string. no end of documents, no end of sentence, etc.
72. āI love ļ¬nding new designer brands for jeansā
In LDA, documents globally predict words.
doc 7681
these are client comment which are short, only predict dozens of words
but could be legal documents, or medical documents, 10k words ā here the diļ¬erence between global and local algorithms
is much more important
74. typical word2vec vector
[ 0%, 9%, 78%, 11%]
typical LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
All sum to 100%All real values
75. 5D word2vec vector
[ 0%, 9%, 78%, 11%]
5D LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
LDA is a *mixture*
w2v is a bunch of real numbers ā more like and *address*
much easier to say to another human 78% of something rather than it is +2.2 of something and -1.25 of something else
76. 100D word2vec vector
[ 0%0%0%0%0% ā¦ 0%, 9%, 78%, 11%]
100D LDA document vector
[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 ā¦ -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
dense sparse
78. can we do both? lda2vec
series of exp
grain of salt
very new ā no good quantitative results only qualitative (but promising!)
79. The goal:
Use all of this context to learn
interpretable topics.
P(vOUT |vIN)word2vec
@chrisemoody
Use this at SF.
typical table
w2v will use w-w
80. word2vec
LDA P(vOUT |vDOC)
The goal:
Use all of this context to learn
interpretable topics.
this document is
80% high fashion
this document is
60% style
@chrisemoody
LDA will use that doc ID column
you can use this to steer the business as a whole
81. word2vec
LDA
The goal:
Use all of this context to learn
interpretable topics.
this zip code is
80% hot climate
this zip code is
60% outdoors wear
@chrisemoody
But doesnāt predict word-to-word relationships.
in texas, maybe i want more lonestars & stirrup icons
in austin, maybe i want more bats
82. word2vec
LDA
The goal:
Use all of this context to learn
interpretable topics.
this client is
80% sporty
this client is
60% casual wear
@chrisemoody
love to learn client topics
are there ātypesā of clients? q every biz asks
so this is the promise of lda2vec
83. lda2vec
word2vec predicts locally:
one word predicts a nearby word
P(vOUT |vIN)
vIN vOUT
āPS! Thank you for such an awesome topā
But doesnāt predict word-to-word relationships.
84. lda2vec
LDA predicts a word from a global context
doc_id=1846
P(vOUT |vDOC)
vOUT
vDOC
āPS! Thank you for such an awesome topā
But doesnāt predict word-to-word relationships.
86. lda2vec
āPS! Thank you for such an awesome topādoc_id=1846
vIN vOUT
vDOC
can we predict a word both locally and globally ?
P(vOUT |vIN+ vDOC)
doc vector captures long-distance dependencies
word vector captures short-distance
87. lda2vec
doc_id=1846
vIN vOUT
vDOC
*very similar to the Paragraph Vectors / doc2vec
can we predict a word both locally and globally ?
āPS! Thank you for such an awesome topā
P(vOUT |vIN+ vDOC)
88. lda2vec
This works! š But vDOC isnāt as
interpretable as the LDA topic vectors. š
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā¦
89. lda2vec
This works! š But vDOC isnāt as
interpretable as the LDA topic vectors. š
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā¦
about as interpretable a hash
90. lda2vec
This works! š But vDOC isnāt as
interpretable as the LDA topic vectors. š
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā¦
91. lda2vec
This works! š But vDOC isnāt as
interpretable as the LDA topic vectors. š
Weāre missing mixtures & sparsity.
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā¦
92. lda2vec
This works! š But vDOC isnāt as
interpretable as the LDA topic vectors. š
Letās make vDOC into a mixtureā¦
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā¦
93. lda2vec
Letās make vDOC into a mixtureā¦
vDOC = a vtopic1 + b vtopic2 +ā¦ (up to k topics)
sum of other word vectors
intuition here is that āhanoi = vietnam + capitalā and lufthansa = āgermany + airlinesā
so we think that document vectors should also be some word vector + some word vector
94. lda2vec
Letās make vDOC into a mixtureā¦
vDOC = a vtopic1 + b vtopic2 +ā¦
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
twenty newsgroup dataset, free, canonical
95. lda2vec
Letās make vDOC into a mixtureā¦
vDOC = a vtopic1 + b vtopic2 +ā¦
topic 1 = āreligionā
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
96. lda2vec
Letās make vDOC into a mixtureā¦
vDOC = a vtopic1 + b vtopic2 +ā¦
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = āreligionā
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
97. lda2vec
Letās make vDOC into a mixtureā¦
vDOC = a vtopic1 + b vtopic2 +ā¦
topic 1 = āreligionā
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
topic 2 = āpoliticsā
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
purple a,b coeļ¬cients tell you how much it is that topic
98. lda2vec
Letās make vDOC into a mixtureā¦
vDOC = 10% religion + 89% politics +ā¦
topic 2 = āpoliticsā
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = āreligionā
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
Doc is now 10% religion 89% politics
mixture models are powerful for interpretability
99. lda2vec
Letās make vDOC sparse
[ -0.75, -1.25, ā¦]
vDOC = a vreligion + b vpolitics +ā¦
Now 1st time I did thisā¦
Hard to interpret. What does -1.2 politics mean? math works, but not intuitive
100. lda2vec
Letās make vDOC sparse
vDOC = a vreligion + b vpolitics +ā¦
How much of this doc is in religion, how much in poltics
but doesnāt work when you have more than a few
101. lda2vec
Letās make vDOC sparse
vDOC = a vreligion + b vpolitics +ā¦
How much of this doc is in religion, how much in cars
but doesnāt work when you have more than a few
102. lda2vec
Letās make vDOC sparse
{a, b, cā¦} ~ dirichlet(alpha)
vDOC = a vreligion + b vpolitics +ā¦
trick we can steal from bayesian
make it dirichlet
skipping technical details
make everything sum to 100%
penalize non-zero
force model to only make it non-zero w/ lots of evidence
103. lda2vec
Letās make vDOC sparse
{a, b, cā¦} ~ dirichlet(alpha)
vDOC = a vreligion + b vpolitics +ā¦
sparsity-inducing eļ¬ect.
similar to the lasso or l1 reg, but dirichlet
few dimensions, sum to 100%
I can say to the CEO, set of docs could have been in 100 topics, but we picked only the best topics
104. word2vec
LDA
P(vOUT |vIN + vDOC)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
this document is
80% high fashion
this document is
60% style
go back to our problem lda2vec is going to use all the info here
105. word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
add column = adding a term
add features in an ML model
106. word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
this zip code is
80% hot climate
this zip code is
60% outdoors wear
@chrisemoody
in addition to doc topics, like ārec SFā
107. word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP +vCLIENTS)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
this client is
80% sporty
this client is
60% casual wear
@chrisemoody
client topics ā sporty, casual,
this is where if she says ā3rd trimesterā ā identify a future mother
āscrubsā ā medicine
108. word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP +vCLIENTS)
P(sold | vCLIENTS)
lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
Can also make the topics
supervised so that they predict
an outcome.
helps ļ¬ne-tune topics so that correlate with your favorite business metric
align topics w/ expectations
helps us guess when revenue goes up what the leading causes are
110. āPS! Thank you for such an awesome ideaā
@chrisemoody
doc_id=1846
Can we model topics to sentences?
lda2lstm
SF is all about mixing cutting edge algorithms but we absolutely need interpretability. human component to algos is not
negotiable
Could we demand the model make us a sentence that is 80% religion, 10% politics?
classify word level, LSTM on sentence, LDA on document level
111. āPS! Thank you for such an awesome ideaā
@chrisemoody
doc_id=1846
Can we represent the internal LSTM
states as a dirichlet mixture?
Dirichlet-squeeze internal states and manipulations, that maybe will help us understand the science of LSTM dynamics ā
because seriously WTF is going on there
112. Can we model topics to sentences?
lda2lstm
āPS! Thank you for such an awesome ideaādoc_id=1846
@chrisemoody
Can we model topics to images?
lda2ae
TJ Torres
Can we also extend this to image generation? TJ is working on a ridiculous VAE/GAN modelā¦ can we throw in a topic
model? Can we say make me an image that is 80% sweater, and 10% zippers, and 10% elbow patches?
119. Crazy
Approaches
Paragraph Vectors
(Just extend the context window)
Content dependency
(Change the window grammatically)
Social word2vec (deepwalk)
(Sentence is a walk on the graph)
Spotify
(Sentence is a playlist of song_ids)
Stitch Fix
(Sentence is a shipment of ļ¬ve items)
121. CBOW
āThe fox jumped over the lazy dogā
Guess the word
given the context
~20x faster.
(this is the alternative.)
vOUT
vIN vINvIN vIN
vIN vIN
SkipGram
āThe fox jumped over the lazy dogā
vOUT vOUT
vIN
vOUT vOUT vOUTvOUT
Guess the context
given the word
Better at syntax.
(this is the one we went over)
CBOW sums words vectors, loses the order in the sentence
Both are good at semantic relationships
Child and kid are nearby
Or gender in man, woman
If you blur words over the scale of context ā 5ish words, you lose a lot grammatical nuance
But skipgram preserves order
Preserves the relationship in pluralizing, for example
122. Shows that are many words similar to vacation actually come in lots of ļ¬avors
ā wedding words (bachelorette, rehearsals)
ā holiday/event words (birthdays, brunch, christmas, thanksgiving)
ā seasonal words (spring, summer,)
ā trip words (getaway)
ā destinations
127. What I didnāt mention
A lot of text (only if you have a specialized vocabulary)
Cleaning the text
Memory & performance
Traditional databases arenāt well-suited
False positives
hundreds of millions of words, 1,000 books, 500,000 comments, or 4,000,000 tweets
high-memory and high-performance multicore machine.
Training can take several hours to several days but shouldn't need frequent retraining.
If you use pretrained vectors, then this isn't an issue.
Databases. Modern SQL systems aren't well-suited to performing the vector addition, subtraction and multiplication
searching in vector space requires. There are a few libraries that will help you quickly ļ¬nd the most similar items12: annoy,
ball trees, locality-sensitive hashing (LSH) or FLANN.
False-positives & exactness. Despite the impressive results that come with word vectorization, no NLP technique is perfect.
Take care that your system is robust to results that a computer deems relevant but an expert human wouldn't.
129. All of the following ideas will change what
āwordsā and ācontextā represent.
But weāll still use the same w2v algo
130. paragraph
vector
What about summarizing documents?
On the day he took ofļ¬ce, President Obama reached out to Americaās enemies,
offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench
your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that
131. On the day he took ofļ¬ce, President Obama reached out to Americaās enemies,
offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench
your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the deļ¬nitive answer to whether Mr. Obamaās audacious gamble will pay off. The ļ¬st
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector Normal skipgram extends C words before, and C words after.
IN
OUT OUT
Except we stay inside a sentence
132. On the day he took ofļ¬ce, President Obama reached out to Americaās enemies,
offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench
your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the deļ¬nitive answer to whether Mr. Obamaās audacious gamble will pay off. The ļ¬st
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector A document vector simply extends the context to the whole document.
IN
OUT OUT
OUT OUTdoc_1347
134. translation
(using just a rotation
matrix)
M
ikolov
2013
English
Spanish
Matrix
Rotation
Blows my mind
Explain plot
Not a complicated NN here
Still have to learn the rotation matrix ā but it generalizes very nicely.
Have analogies for every linalg op as a linguistic operator: + and - and matrix multiplies
Robust framework and new tools to do science on words
139. context
dependent
context
Levy
&
G
oldberg
2014
Also show that SGNS is simply factorizing:
w * c = PMI(w, c) - log k
This is completely amazing!
Intuition: positive associations (canada, snow)
stronger in humans than negative associations
(what is the opposite of Canada?)
Also means we can do SVD-like techniques to get a convex w2v, uses fast lining libs, uses compressed word count matrix
so also better storageā¦. but not online
140. deepwalk
Perozzi
etal2014
learn word vectors from
sentences
āThe fox jumped over the lazy dogā
vOUT vOUT vOUT vOUT vOUTvOUT
āwordsā are graph vertices
āsentencesā are random walks on the
graph
word2vec
150. A speciļ¬c lda2vec model
Our text blob is a comment that comes from a region_id and a style_id
151.
152.
153.
154.
155.
156.
157.
158.
159. Can measure similarity between topic vectors m and n, and word vectors w
This gets you the ātopā words in a topic, can ļ¬gure out what that topic is
160.
161. lda2vec
Letās make vDOC into a mixtureā¦
vDOC = 10% religion + 89% politics +ā¦
topic 2 = āpoliticsā
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = āreligionā
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
This is now on the 20 newsgroups datasetā¦
Doc is now 10% religion 89% politics
mixture models are powerful for interpretability