SlideShare a Scribd company logo
1 of 161
Download to read offline
A word is worth a
thousand vectors
(word2vec, lda, and introducing lda2vec)
Christopher Moody
@ Stitch Fix
Welcome, 

thanks for coming, having me, organizer

NLP can be a messy aļ¬€air because you have to teach a computer about the irregularities and ambiguities of the English
language in this sort of hierarchical sparse nature in all the grammar

3rd trimester, pregnant

ā€œwears scrubsā€ ā€” medicine

taking a trip ā€” a ļ¬x for vacation clothing

power of word vectors promise is to sweep away a lot of issues
About
@chrisemoody
Caltech Physics
PhD. in astrostats supercomputing
sklearn t-SNE contributor
Data Labs at Stitch Fix
github.com/cemoody
Gaussian Processes t-SNE
chainer
deep learning
Tensor Decomposition
Credit
Large swathes of this talk are from
previous presentations by:
ā€¢ Tomas Mikolov
ā€¢ David Blei
ā€¢ Christopher Olah
ā€¢ Radim Rehurek
ā€¢ Omer Levy & Yoav Goldberg
ā€¢ Richard Socher
ā€¢ Xin Rong
ā€¢ Tim Hopper
word2vec
lda
1
2
3lda2vec
1. king - man + woman = queen
2. Huge splash in NLP world
3. Learns from raw text
4. Pretty simple algorithm
5. Comes pretrained
word2vec
1. Learns what words mean ā€” can solve analogies cleanly.

1. Not treating words as blocks, but instead modeling relationships 

2. Distributed representations form the basis of more complicated deep learning systems

3. Shallow ā€” not deep learning!

1. Power comes from this simplicity ā€” super fast, lots of data

4. Get a lot of mileage out of this

1. Donā€™t need to model the wikipedia corpus before starting your own
word2vec
1. Set up an objective function
2. Randomly initialize vectors
3. Do gradient descent
w
ord2vec
word2vec: learn word vector vin
from itā€™s surrounding context
vin
1. Letā€™s talk about training ļ¬rst

2. In SVD and n-grams we built a co-occurence and transition probability matrices

3. Here we will learn the embedded representation directly, with no intermediates, update it w/ every example
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
Maximize the likelihood of seeing the words given the word over.
P(the|over)
P(fox|over)
P(jumped|over)
P(the|over)
P(lazy|over)
P(dog|over)
ā€¦instead of maximizing the likelihood of co-occurrence counts.
1. Context ā€” the words surrounding the training word

2. Naively assume P(*|over) is independent conditional on the training word

3. Still a pretty simple assumption!

Conditioning on just *over* no other secret parameters or anything
w
ord2vec
P(fox|over)
What should this be?
w
ord2vec
P(vfox|vover)
Should depend on the word vectors.
P(fox|over)
Trying to learn the word vectors, so letā€™s start with those

(weā€™ll randomly initialize them to begin with)
w
ord2vec
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
ā€œThe fox jumped over the lazy dogā€
P(vOUT|vIN)
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vIN
P(vOUT|vIN)
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
IN = training word
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
w
ord2vec
P(vOUT|vIN)
ā€œThe fox jumped over the lazy dogā€
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
ā€¦So that at a high level is what we want word2vec to do.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
ā€¦So that at a high level is what we want word2vec to do.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
ā€¦So that at a high level is what we want word2vec to do.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
ā€¦So that at a high level is what we want word2vec to do.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
ā€¦So that at a high level is what we want word2vec to do.
w
ord2vec
ā€œThe fox jumped over the lazy dogā€
vOUT
P(vOUT|vIN)
vIN
Twist: we have two vectors for every word.
Should depend on whether itā€™s the input or the output.
Also a context window around every input word.
ā€¦So that at a high level is what we want word2vec to do.

two for loops

Thatā€™s it! A bit disengious to make this a giant network
objective
Measure loss between
vIN and vOUT?
vin . vout
How should we deļ¬ne P(vOUT|vIN)?
Now weā€™ve deļ¬ned the high-level update path for the algorithm.

Need to deļ¬ne this prob exactly in order to deļ¬ne our updates.

Boils down to diļ¬€ between in & out ā€” want to make as similar as possible, and then the probability will go up.

Use cosine sim.
w
ord2vec
vin . vout ~ 1
objective
vin
vout
Dot product has these properties:

Similar vectors have similarly near 1
w
ord2vec
objective
vin
vout
vin . vout ~ 0
Orthogonal vectors have similarity near 0
w
ord2vec
objective
vin
vout
vin . vout ~ -1
Orthogonal vectors have similarity near -1
w
ord2vec
vin . vout āˆˆ [-1,1]
objective
But the inner product ranges from -1 to 1 (when normalized)

ā€¦and weā€™d like a probability
w
ord2vec
But weā€™d like to measure a probability.
vin . vout āˆˆ [-1,1]
objective
But the inner product ranges from -1 to 1 (when normalized)

ā€¦and weā€™d like a probability
w
ord2vec
But weā€™d like to measure a probability.
softmax(vin . vout āˆˆ [-1,1])
objective
āˆˆ [0,1]
Transform again using softmax
w
ord2vec
But weā€™d like to measure a probability.
softmax(vin . vout āˆˆ [-1,1])
Probability of choosing 1 of N discrete items.
Mapping from vector space to a multinomial over words.
objective
Similar to logistic function for binary outcomes, but instead for 1 of N outcomes. 

So now weā€™re modeling the probability of a word showing up as the combination of the training word vector and the target
word vector and transforming it to a 1 of N
w
ord2vec
But weā€™d like to measure a probability.
exp(vin . vout āˆˆ [0,1])softmax ~
objective
So hereā€™s the actual form of the equation ā€” we normalize by the sum of all of the other possible pairs of word combinations
w
ord2vec
But weā€™d like to measure a probability.
exp(vin . vout āˆˆ [-1,1])
Ī£exp(vin . vk)
softmax =
objective
Normalization term over all words
k āˆˆ V
So hereā€™s the actual form of the equation ā€” we normalize by the sum of all of the other possible pairs of word combinations

two eļ¬€ects

make vin and vout more similar

make vin and every other word less similar
w
ord2vec
But weā€™d like to measure a probability.
exp(vin . vout āˆˆ [-1,1])
Ī£exp(vin . vk)
softmax = = P(vout|vin)
objective
k āˆˆ V
This is the kernel of the word2vec. Weā€™re just going to apply this operation every time we want to update the vectors.

For every word, weā€™re going to have a context window, and then for every pair of words in that window and the input word,
weā€™ll measure this probability.
w
ord2vec
Learn by gradient descent on the softmax prob.
For every example we see update vin
vin := vin + P(vout|vin)
objective
vout := vout + P(vout|vin)
ā€¦I wonā€™t go through the derivation of the gradient, but this is the general idea

relatively simple, fast ā€” fast enough to read billions of words in a day
word2vec
explain table
word2vec
if not convinced by qualitative resultsā€¦.
Showing just 2 of the ~500 dimensions. Eļ¬€ectively weā€™ve PCAā€™d it
If we only had locality and not regularity, this wouldnā€™t necessarily be true
So we live in a vector space where operations like addition and subtraction are meaningful. 

So hereā€™s a few examples of this working.

Really get the idea of these vectors as being ā€˜mixesā€™ of other ideas & vectors
ITEM_3469 + ā€˜Pregnantā€™
SF is a person service

Box
+ ā€˜Pregnantā€™
I love the stripes and the cut around my neckline was amazing

someone else might write ā€˜grey and blackā€™

subtlety and nuance in that language

We have lots of this interaction ā€” of order wikipedia amount ā€” far too much to manually annotate anything
= ITEM_701333
= ITEM_901004
= ITEM_800456
Stripes and are safe for maternity

And also similar tones and ļ¬‚owy ā€” still great for expecting mothers
what about?LDA?
LDA
on Client Item
Descriptions
This shows the incredible amount of structure
LDA
on Item
Descriptions
(with Jay)
clunky jewelry

dangling delicate jewelry elsewhere
LDA
on Item
Descriptions
(with Jay)
topics on patterns, styles ā€” this cluster is similarly described as high contrast tops with popping colors
LDA
on Item
Descriptions
(with Jay)
bright dresses for a warm summer
LDA
on Item
Descriptions
(with Jay)
maternity line clothes
LDA
on Item
Descriptions
(with Jay)
not just visual topics, but also topics about ļ¬t
Latent style vectors from text
Pairwise gamma correlation
from style ratings
Diversity from ratings Diversity from text
Lots of structure in both ā€” but the diversity much higher in the text

Maybe obvious: but the way people describe items is fundamentally richer than the style ratings
lda vs word2vec
word2vec is local:
one word predicts a nearby word
ā€œI love ļ¬nding new designer brands for jeansā€
as if the world where one very long text string. no end of documents, no end of sentence, etc. 

and a window across words
ā€œI love ļ¬nding new designer brands for jeansā€
But text is usually organized.
as if the world where one very long text string. no end of documents, no end of sentence, etc.
ā€œI love ļ¬nding new designer brands for jeansā€
But text is usually organized.
as if the world where one very long text string. no end of documents, no end of sentence, etc.
ā€œI love ļ¬nding new designer brands for jeansā€
In LDA, documents globally predict words.
doc 7681
these are client comment which are short, only predict dozens of words

but could be legal documents, or medical documents, 10k words ā€” here the diļ¬€erence between global and local algorithms
is much more important
[ -0.75, -1.25, -0.55, -0.12, +2.2] [ 0%, 9%, 78%, 11%]
typical word2vec vector typical LDA document vector
typical word2vec vector
[ 0%, 9%, 78%, 11%]
typical LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
All sum to 100%All real values
5D word2vec vector
[ 0%, 9%, 78%, 11%]
5D LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
LDA is a *mixture* 

w2v is a bunch of real numbers ā€” more like and *address*

much easier to say to another human 78% of something rather than it is +2.2 of something and -1.25 of something else
100D word2vec vector
[ 0%0%0%0%0% ā€¦ 0%, 9%, 78%, 11%]
100D LDA document vector
[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 ā€¦ -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
dense sparse
100D word2vec vector
[ 0%0%0%0%0% ā€¦ 0%, 9%, 78%, 11%]
100D LDA document vector
[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 ā€¦ -0.12, +2.2]
Similar in fewer ways
(more interpretable)
Similar in 100D ways
(very ļ¬‚exible)
+mixture
+sparse
can we do both? lda2vec
series of exp

grain of salt

very new ā€” no good quantitative results only qualitative (but promising!)
The goal:
Use all of this context to learn
interpretable topics.
P(vOUT |vIN)word2vec
@chrisemoody
Use this at SF. 

typical table

w2v will use w-w
word2vec
LDA P(vOUT |vDOC)
The goal:
Use all of this context to learn
interpretable topics.
this document is
80% high fashion
this document is
60% style
@chrisemoody
LDA will use that doc ID column

you can use this to steer the business as a whole
word2vec
LDA
The goal:
Use all of this context to learn
interpretable topics.
this zip code is
80% hot climate
this zip code is
60% outdoors wear
@chrisemoody
But doesnā€™t predict word-to-word relationships.

in texas, maybe i want more lonestars & stirrup icons

in austin, maybe i want more bats
word2vec
LDA
The goal:
Use all of this context to learn
interpretable topics.
this client is
80% sporty
this client is
60% casual wear
@chrisemoody
love to learn client topics

are there ā€˜typesā€™ of clients? q every biz asks

so this is the promise of lda2vec
lda2vec
word2vec predicts locally:
one word predicts a nearby word
P(vOUT |vIN)
vIN vOUT
ā€œPS! Thank you for such an awesome topā€
But doesnā€™t predict word-to-word relationships.
lda2vec
LDA predicts a word from a global context
doc_id=1846
P(vOUT |vDOC)
vOUT
vDOC
ā€œPS! Thank you for such an awesome topā€
But doesnā€™t predict word-to-word relationships.
lda2vec
doc_id=1846
vIN vOUT
vDOC
can we predict a word both locally and globally ?
ā€œPS! Thank you for such an awesome topā€
lda2vec
ā€œPS! Thank you for such an awesome topā€doc_id=1846
vIN vOUT
vDOC
can we predict a word both locally and globally ?
P(vOUT |vIN+ vDOC)
doc vector captures long-distance dependencies

word vector captures short-distance
lda2vec
doc_id=1846
vIN vOUT
vDOC
*very similar to the Paragraph Vectors / doc2vec
can we predict a word both locally and globally ?
ā€œPS! Thank you for such an awesome topā€
P(vOUT |vIN+ vDOC)
lda2vec
This works! šŸ˜€ But vDOC isnā€™t as
interpretable as the LDA topic vectors. šŸ˜”
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦
lda2vec
This works! šŸ˜€ But vDOC isnā€™t as
interpretable as the LDA topic vectors. šŸ˜”
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦

about as interpretable a hash
lda2vec
This works! šŸ˜€ But vDOC isnā€™t as
interpretable as the LDA topic vectors. šŸ˜”
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦
lda2vec
This works! šŸ˜€ But vDOC isnā€™t as
interpretable as the LDA topic vectors. šŸ˜”
Weā€™re missing mixtures & sparsity.
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦
lda2vec
This works! šŸ˜€ But vDOC isnā€™t as
interpretable as the LDA topic vectors. šŸ˜”
Letā€™s make vDOC into a mixtureā€¦
Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦
lda2vec
Letā€™s make vDOC into a mixtureā€¦
vDOC = a vtopic1 + b vtopic2 +ā€¦ (up to k topics)
sum of other word vectors

intuition here is that ā€˜hanoi = vietnam + capitalā€™ and lufthansa = ā€˜germany + airlinesā€™

so we think that document vectors should also be some word vector + some word vector
lda2vec
Letā€™s make vDOC into a mixtureā€¦
vDOC = a vtopic1 + b vtopic2 +ā€¦
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
twenty newsgroup dataset, free, canonical
lda2vec
Letā€™s make vDOC into a mixtureā€¦
vDOC = a vtopic1 + b vtopic2 +ā€¦
topic 1 = ā€œreligionā€
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
lda2vec
Letā€™s make vDOC into a mixtureā€¦
vDOC = a vtopic1 + b vtopic2 +ā€¦
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = ā€œreligionā€
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
lda2vec
Letā€™s make vDOC into a mixtureā€¦
vDOC = a vtopic1 + b vtopic2 +ā€¦
topic 1 = ā€œreligionā€
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
topic 2 = ā€œpoliticsā€
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
purple a,b coeļ¬ƒcients tell you how much it is that topic
lda2vec
Letā€™s make vDOC into a mixtureā€¦
vDOC = 10% religion + 89% politics +ā€¦
topic 2 = ā€œpoliticsā€
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = ā€œreligionā€
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
Doc is now 10% religion 89% politics

mixture models are powerful for interpretability
lda2vec
Letā€™s make vDOC sparse
[ -0.75, -1.25, ā€¦]
vDOC = a vreligion + b vpolitics +ā€¦
Now 1st time I did thisā€¦

Hard to interpret. What does -1.2 politics mean? math works, but not intuitive
lda2vec
Letā€™s make vDOC sparse
vDOC = a vreligion + b vpolitics +ā€¦
How much of this doc is in religion, how much in poltics

but doesnā€™t work when you have more than a few
lda2vec
Letā€™s make vDOC sparse
vDOC = a vreligion + b vpolitics +ā€¦
How much of this doc is in religion, how much in cars

but doesnā€™t work when you have more than a few
lda2vec
Letā€™s make vDOC sparse
{a, b, cā€¦} ~ dirichlet(alpha)
vDOC = a vreligion + b vpolitics +ā€¦
trick we can steal from bayesian

make it dirichlet 

skipping technical details

make everything sum to 100%

penalize non-zero

force model to only make it non-zero w/ lots of evidence
lda2vec
Letā€™s make vDOC sparse
{a, b, cā€¦} ~ dirichlet(alpha)
vDOC = a vreligion + b vpolitics +ā€¦
sparsity-inducing eļ¬€ect. 

similar to the lasso or l1 reg, but dirichlet

few dimensions, sum to 100%

I can say to the CEO, set of docs could have been in 100 topics, but we picked only the best topics
word2vec
LDA
P(vOUT |vIN + vDOC)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
this document is
80% high fashion
this document is
60% style
go back to our problem lda2vec is going to use all the info here
word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
add column = adding a term

add features in an ML model
word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
this zip code is
80% hot climate
this zip code is
60% outdoors wear
@chrisemoody
in addition to doc topics, like ā€˜rec SFā€™
word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP +vCLIENTS)lda2vec
The goal:
Use all of this context to learn
interpretable topics.
this client is
80% sporty
this client is
60% casual wear
@chrisemoody
client topics ā€” sporty, casual, 

this is where if she says ā€˜3rd trimesterā€™ ā€” identify a future mother

ā€˜scrubsā€™ ā€” medicine
word2vec
LDA
P(vOUT |vIN+ vDOC + vZIP +vCLIENTS)
P(sold | vCLIENTS)
lda2vec
The goal:
Use all of this context to learn
interpretable topics.
@chrisemoody
Can also make the topics
supervised so that they predict
an outcome.
helps ļ¬ne-tune topics so that correlate with your favorite business metric

align topics w/ expectations

helps us guess when revenue goes up what the leading causes are
github.com/cemoody/lda2vec
uses pyldavis
API Ref docs (no narrative docs)
GPU
Decent test coverage
@chrisemoody
ā€œPS! Thank you for such an awesome ideaā€
@chrisemoody
doc_id=1846
Can we model topics to sentences?
lda2lstm
SF is all about mixing cutting edge algorithms but we absolutely need interpretability. human component to algos is not
negotiable

Could we demand the model make us a sentence that is 80% religion, 10% politics?

classify word level, LSTM on sentence, LDA on document level
ā€œPS! Thank you for such an awesome ideaā€
@chrisemoody
doc_id=1846
Can we represent the internal LSTM
states as a dirichlet mixture?
Dirichlet-squeeze internal states and manipulations, that maybe will help us understand the science of LSTM dynamics ā€”
because seriously WTF is going on there
Can we model topics to sentences?
lda2lstm
ā€œPS! Thank you for such an awesome ideaā€doc_id=1846
@chrisemoody
Can we model topics to images?
lda2ae
TJ Torres
Can we also extend this to image generation? TJ is working on a ridiculous VAE/GAN modelā€¦ can we throw in a topic
model? Can we say make me an image that is 80% sweater, and 10% zippers, and 10% elbow patches?
?@chrisemoody
Multithreaded
Stitch Fix
Bonus slides
Crazy
Approaches
Paragraph Vectors
(Just extend the context window)
Content dependency
(Change the window grammatically)
Social word2vec (deepwalk)
(Sentence is a walk on the graph)
Spotify
(Sentence is a playlist of song_ids)
Stitch Fix
(Sentence is a shipment of ļ¬ve items)
See previous
CBOW
ā€œThe fox jumped over the lazy dogā€
Guess the word
given the context
~20x faster.
(this is the alternative.)
vOUT
vIN vINvIN vIN
vIN vIN
SkipGram
ā€œThe fox jumped over the lazy dogā€
vOUT vOUT
vIN
vOUT vOUT vOUTvOUT
Guess the context
given the word
Better at syntax.
(this is the one we went over)
CBOW sums words vectors, loses the order in the sentence

Both are good at semantic relationships 

	 Child and kid are nearby

	 Or gender in man, woman

	 If you blur words over the scale of context ā€” 5ish words, you lose a lot grammatical nuance

But skipgram preserves order 

	 Preserves the relationship in pluralizing, for example
Shows that are many words similar to vacation actually come in lots of ļ¬‚avors

ā€” wedding words (bachelorette, rehearsals)

ā€” holiday/event words (birthdays, brunch, christmas, thanksgiving)

ā€” seasonal words (spring, summer,)

ā€” trip words (getaway)

ā€” destinations
LDA
Results
context
H
istory
I loved every choice in this ļ¬x!! Great job!
Great Stylist Perfect
LDA
Results
context
H
istory
Body Fit
My measurements are 36-28-32. If that helps.
I like wearing some clothing that is ļ¬tted.
Very hard for me to ļ¬nd pants that ļ¬t right.
LDA
Results
context
H
istory
Sizing
Really enjoyed the experience and the
pieces, sizing for tops was too big.
Looking forward to my next box!
Excited for next
LDA
Results
context
H
istory
Almost Bought
It was a great ļ¬x. Loved the two items I
kept and the three I sent back were close!
Perfect
What I didnā€™t mention
A lot of text (only if you have a specialized vocabulary)
Cleaning the text
Memory & performance
Traditional databases arenā€™t well-suited
False positives
hundreds of millions of words, 1,000 books, 500,000 comments, or 4,000,000 tweets

high-memory and high-performance multicore machine. 

Training can take several hours to several days but shouldn't need frequent retraining. 

If you use pretrained vectors, then this isn't an issue.

Databases. Modern SQL systems aren't well-suited to performing the vector addition, subtraction and multiplication
searching in vector space requires. There are a few libraries that will help you quickly ļ¬nd the most similar items12: annoy,
ball trees, locality-sensitive hashing (LSH) or FLANN.

False-positives & exactness. Despite the impressive results that come with word vectorization, no NLP technique is perfect.
Take care that your system is robust to results that a computer deems relevant but an expert human wouldn't.
and now for something completely crazy
All of the following ideas will change what
ā€˜wordsā€™ and ā€˜contextā€™ represent.
But weā€™ll still use the same w2v algo
paragraph
vector
What about summarizing documents?
On the day he took ofļ¬ce, President Obama reached out to Americaā€™s enemies,
offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench
your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that
On the day he took ofļ¬ce, President Obama reached out to Americaā€™s enemies,
offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench
your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the deļ¬nitive answer to whether Mr. Obamaā€™s audacious gamble will pay off. The ļ¬st
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector Normal skipgram extends C words before, and C words after.
IN
OUT OUT
Except we stay inside a sentence
On the day he took ofļ¬ce, President Obama reached out to Americaā€™s enemies,
offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench
your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the deļ¬nitive answer to whether Mr. Obamaā€™s audacious gamble will pay off. The ļ¬st
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector A document vector simply extends the context to the whole document.
IN
OUT OUT
OUT OUTdoc_1347
from	gensim.models	import	Doc2Vec		
fn	=	ā€œitem_document_vectorsā€		
model	=	Doc2Vec.load(fn)		
model.most_similar('pregnant')		
matches	=	list(filter(lambda	x:	'SENT_'	in	x[0],	matches))			
#	['...I	am	currently	23	weeks	pregnant...',		
#		'...I'm	now	10	weeks	pregnant...',		
#		'...not	showing	too	much	yet...',		
#		'...15	weeks	now.	Baby	bump...',		
#		'...6	weeks	post	partum!...',		
#		'...12	weeks	postpartum	and	am	nursing...',		
#		'...I	have	my	baby	shower	that...',		
#		'...am	still	breastfeeding...',		
#		'...I	would	love	an	outfit	for	a	baby	shower...']
sentence
search
translation
(using just a rotation
matrix)
M
ikolov
2013
English
Spanish
Matrix
Rotation
Blows my mind

Explain plot

Not a complicated NN here

Still have to learn the rotation matrix ā€” but it generalizes very nicely.

Have analogies for every linalg op as a linguistic operator: + and - and matrix multiplies

Robust framework and new tools to do science on words
context
dependent
Levy
&
G
oldberg
2014
Australian scientist discovers star with telescope
context +/- 2 words
context
dependent
context
Australian scientist discovers star with telescope
Levy
&
G
oldberg
2014
What if we
context
dependent
context
Australian scientist discovers star with telescope
context
Levy
&
G
oldberg
2014
context
dependent
context
BoW DEPS
topically-similar vs ā€˜functionallyā€™ similar
Levy
&
G
oldberg
2014
context
dependent
context
Levy
&
G
oldberg
2014
Also show that SGNS is simply factorizing:
w * c = PMI(w, c) - log k
This is completely amazing!
Intuition: positive associations (canada, snow)
stronger in humans than negative associations
(what is the opposite of Canada?)
Also means we can do SVD-like techniques to get a convex w2v, uses fast lining libs, uses compressed word count matrix
so also better storageā€¦. but not online
deepwalk
Perozzi
etal2014
learn word vectors from
sentences
ā€œThe fox jumped over the lazy dogā€
vOUT vOUT vOUT vOUT vOUTvOUT
ā€˜wordsā€™ are graph vertices
ā€˜sentencesā€™ are random walks on the
graph
word2vec
Playlists at
Spotify
context
sequence
learning
ā€˜wordsā€™ are songs
ā€˜sentencesā€™ are playlists
Playlists at
Spotify
context
Erik
Bernhardsson
Great performance on ā€˜related artistsā€™
Fixes at
Stitch Fix
sequence
learning
Letā€™s try:
ā€˜wordsā€™ are styles
ā€˜sentencesā€™ are ļ¬xes
Fixes at
Stitch Fix
context
Learn similarity between styles
because they co-occur
Learn ā€˜coherentā€™ styles
sequence
learning
Fixes at
Stitch Fix?
context
sequence
learning
Got lots of structure!
Fixes at
Stitch Fix?
context
sequence
learning
Fixes at
Stitch Fix?
context
sequence
learning
Nearby regions are
consistent ā€˜closetsā€™
A speciļ¬c lda2vec model
Our text blob is a comment that comes from a region_id and a style_id
Can measure similarity between topic vectors m and n, and word vectors w

This gets you the ā€˜topā€™ words in a topic, can ļ¬gure out what that topic is
lda2vec
Letā€™s make vDOC into a mixtureā€¦
vDOC = 10% religion + 89% politics +ā€¦
topic 2 = ā€œpoliticsā€
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
topic 1 = ā€œreligionā€
Trinitarian
baptismal
Pentecostals
bede
schismatics
excommunication
This is now on the 20 newsgroups datasetā€¦

Doc is now 10% religion 89% politics

mixture models are powerful for interpretability

More Related Content

What's hot

Real-time personalized recommendations using product embeddings
Real-time personalized recommendations using product embeddingsReal-time personalized recommendations using product embeddings
Real-time personalized recommendations using product embeddings
Jakub Macina
Ā 
stackconf 2021 | Weaviate Vector Search Engine ā€“ Introduction
stackconf 2021 | Weaviate Vector Search Engine ā€“ Introductionstackconf 2021 | Weaviate Vector Search Engine ā€“ Introduction
stackconf 2021 | Weaviate Vector Search Engine ā€“ Introduction
NETWAYS
Ā 

What's hot (20)

Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
Ā 
Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)Thai Text processing by Transfer Learning using Transformer (Bert)
Thai Text processing by Transfer Learning using Transformer (Bert)
Ā 
Word2Vec
Word2VecWord2Vec
Word2Vec
Ā 
Context-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vecContext-based movie search using doc2vec, word2vec
Context-based movie search using doc2vec, word2vec
Ā 
CH02 API Governance
CH02 API Governance CH02 API Governance
CH02 API Governance
Ā 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
Ā 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Ā 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Ā 
Real-time personalized recommendations using product embeddings
Real-time personalized recommendations using product embeddingsReal-time personalized recommendations using product embeddings
Real-time personalized recommendations using product embeddings
Ā 
GPT : Generative Pre-Training Model
GPT : Generative Pre-Training ModelGPT : Generative Pre-Training Model
GPT : Generative Pre-Training Model
Ā 
Semantic web meetup ā€“ sparql tutorial
Semantic web meetup ā€“ sparql tutorialSemantic web meetup ā€“ sparql tutorial
Semantic web meetup ā€“ sparql tutorial
Ā 
Conversational AI with Transformer Models
Conversational AI with Transformer ModelsConversational AI with Transformer Models
Conversational AI with Transformer Models
Ā 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
Ā 
stackconf 2021 | Weaviate Vector Search Engine ā€“ Introduction
stackconf 2021 | Weaviate Vector Search Engine ā€“ Introductionstackconf 2021 | Weaviate Vector Search Engine ā€“ Introduction
stackconf 2021 | Weaviate Vector Search Engine ā€“ Introduction
Ā 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Ā 
ė”„ ėŸ¬ė‹ ģžģ—°ģ–“ ģ²˜ė¦¬ė„¼ ķ•™ģŠµģ„ ģœ„ķ•œ ķŒŒģ›Œķ¬ģøķŠø. (Deep Learning for Natural Language Processing)
ė”„ ėŸ¬ė‹ ģžģ—°ģ–“ ģ²˜ė¦¬ė„¼ ķ•™ģŠµģ„ ģœ„ķ•œ ķŒŒģ›Œķ¬ģøķŠø. (Deep Learning for Natural Language Processing)ė”„ ėŸ¬ė‹ ģžģ—°ģ–“ ģ²˜ė¦¬ė„¼ ķ•™ģŠµģ„ ģœ„ķ•œ ķŒŒģ›Œķ¬ģøķŠø. (Deep Learning for Natural Language Processing)
ė”„ ėŸ¬ė‹ ģžģ—°ģ–“ ģ²˜ė¦¬ė„¼ ķ•™ģŠµģ„ ģœ„ķ•œ ķŒŒģ›Œķ¬ģøķŠø. (Deep Learning for Natural Language Processing)
Ā 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Ā 
Switch transformers paper review
Switch transformers paper reviewSwitch transformers paper review
Switch transformers paper review
Ā 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
Ā 
How to come up with new research ideas
How to come up with new research ideasHow to come up with new research ideas
How to come up with new research ideas
Ā 

Similar to word2vec, LDA, and introducing a new hybrid algorithm: lda2vec

Interview presentation
Interview presentationInterview presentation
Interview presentation
Joseph Gubbins
Ā 
Paper dissected glove_ global vectors for word representation_ explained _ ...
Paper dissected   glove_ global vectors for word representation_ explained _ ...Paper dissected   glove_ global vectors for word representation_ explained _ ...
Paper dissected glove_ global vectors for word representation_ explained _ ...
Nikhil Jaiswal
Ā 
word2vec_summary_revised
word2vec_summary_revisedword2vec_summary_revised
word2vec_summary_revised
Bennett Bullock
Ā 

Similar to word2vec, LDA, and introducing a new hybrid algorithm: lda2vec (20)

Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
Ā 
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
Ā 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
Ā 
Interview presentation
Interview presentationInterview presentation
Interview presentation
Ā 
Recipe2Vec: Or how does my robot know whatā€™s tasty
Recipe2Vec: Or how does my robot know whatā€™s tastyRecipe2Vec: Or how does my robot know whatā€™s tasty
Recipe2Vec: Or how does my robot know whatā€™s tasty
Ā 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
Ā 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
Ā 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
Ā 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
Ā 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
Ā 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Ā 
Word2vec ultimate beginner
Word2vec ultimate beginnerWord2vec ultimate beginner
Word2vec ultimate beginner
Ā 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
Ā 
Word2Vec on Italian language
Word2Vec on Italian languageWord2Vec on Italian language
Word2Vec on Italian language
Ā 
Semantic similarity between two sentences in arabic
Semantic similarity between two sentences in arabicSemantic similarity between two sentences in arabic
Semantic similarity between two sentences in arabic
Ā 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
Ā 
Skip-gram Model Broken Down
Skip-gram Model Broken DownSkip-gram Model Broken Down
Skip-gram Model Broken Down
Ā 
Paper dissected glove_ global vectors for word representation_ explained _ ...
Paper dissected   glove_ global vectors for word representation_ explained _ ...Paper dissected   glove_ global vectors for word representation_ explained _ ...
Paper dissected glove_ global vectors for word representation_ explained _ ...
Ā 
word2vec_summary_revised
word2vec_summary_revisedword2vec_summary_revised
word2vec_summary_revised
Ā 
nlp_compromise
nlp_compromisenlp_compromise
nlp_compromise
Ā 

Recently uploaded

module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
levieagacer
Ā 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
Ā 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
Ā 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
Ā 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
Ā 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
Ā 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
Ā 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
Ā 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
Ā 

Recently uploaded (20)

module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
Ā 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Ā 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
Ā 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Ā 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Ā 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
Ā 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Ā 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
Ā 
High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘
High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘
High Profile šŸ” 8250077686 šŸ“ž Call Girls Service in GTB NagaršŸ‘
Ā 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
Ā 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
Ā 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
Ā 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Ā 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
Ā 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
Ā 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Ā 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Ā 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
Ā 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
Ā 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Ā 

word2vec, LDA, and introducing a new hybrid algorithm: lda2vec

  • 1. A word is worth a thousand vectors (word2vec, lda, and introducing lda2vec) Christopher Moody @ Stitch Fix Welcome, thanks for coming, having me, organizer NLP can be a messy aļ¬€air because you have to teach a computer about the irregularities and ambiguities of the English language in this sort of hierarchical sparse nature in all the grammar 3rd trimester, pregnant ā€œwears scrubsā€ ā€” medicine taking a trip ā€” a ļ¬x for vacation clothing power of word vectors promise is to sweep away a lot of issues
  • 2. About @chrisemoody Caltech Physics PhD. in astrostats supercomputing sklearn t-SNE contributor Data Labs at Stitch Fix github.com/cemoody Gaussian Processes t-SNE chainer deep learning Tensor Decomposition
  • 3. Credit Large swathes of this talk are from previous presentations by: ā€¢ Tomas Mikolov ā€¢ David Blei ā€¢ Christopher Olah ā€¢ Radim Rehurek ā€¢ Omer Levy & Yoav Goldberg ā€¢ Richard Socher ā€¢ Xin Rong ā€¢ Tim Hopper
  • 5. 1. king - man + woman = queen 2. Huge splash in NLP world 3. Learns from raw text 4. Pretty simple algorithm 5. Comes pretrained word2vec 1. Learns what words mean ā€” can solve analogies cleanly. 1. Not treating words as blocks, but instead modeling relationships 2. Distributed representations form the basis of more complicated deep learning systems 3. Shallow ā€” not deep learning! 1. Power comes from this simplicity ā€” super fast, lots of data 4. Get a lot of mileage out of this 1. Donā€™t need to model the wikipedia corpus before starting your own
  • 6. word2vec 1. Set up an objective function 2. Randomly initialize vectors 3. Do gradient descent
  • 7. w ord2vec word2vec: learn word vector vin from itā€™s surrounding context vin 1. Letā€™s talk about training ļ¬rst 2. In SVD and n-grams we built a co-occurence and transition probability matrices 3. Here we will learn the embedded representation directly, with no intermediates, update it w/ every example
  • 8. w ord2vec ā€œThe fox jumped over the lazy dogā€ Maximize the likelihood of seeing the words given the word over. P(the|over) P(fox|over) P(jumped|over) P(the|over) P(lazy|over) P(dog|over) ā€¦instead of maximizing the likelihood of co-occurrence counts. 1. Context ā€” the words surrounding the training word 2. Naively assume P(*|over) is independent conditional on the training word 3. Still a pretty simple assumption! Conditioning on just *over* no other secret parameters or anything
  • 10. w ord2vec P(vfox|vover) Should depend on the word vectors. P(fox|over) Trying to learn the word vectors, so letā€™s start with those (weā€™ll randomly initialize them to begin with)
  • 11. w ord2vec Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word. ā€œThe fox jumped over the lazy dogā€ P(vOUT|vIN)
  • 12. w ord2vec ā€œThe fox jumped over the lazy dogā€ vIN P(vOUT|vIN) Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word. IN = training word
  • 13. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word.
  • 14. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word.
  • 15. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word.
  • 16. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word.
  • 17. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word.
  • 18. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word.
  • 19. w ord2vec P(vOUT|vIN) ā€œThe fox jumped over the lazy dogā€ vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word.
  • 20. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word. ā€¦So that at a high level is what we want word2vec to do.
  • 21. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word. ā€¦So that at a high level is what we want word2vec to do.
  • 22. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word. ā€¦So that at a high level is what we want word2vec to do.
  • 23. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word. ā€¦So that at a high level is what we want word2vec to do.
  • 24. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word. ā€¦So that at a high level is what we want word2vec to do.
  • 25. w ord2vec ā€œThe fox jumped over the lazy dogā€ vOUT P(vOUT|vIN) vIN Twist: we have two vectors for every word. Should depend on whether itā€™s the input or the output. Also a context window around every input word. ā€¦So that at a high level is what we want word2vec to do. two for loops Thatā€™s it! A bit disengious to make this a giant network
  • 26. objective Measure loss between vIN and vOUT? vin . vout How should we deļ¬ne P(vOUT|vIN)? Now weā€™ve deļ¬ned the high-level update path for the algorithm. Need to deļ¬ne this prob exactly in order to deļ¬ne our updates. Boils down to diļ¬€ between in & out ā€” want to make as similar as possible, and then the probability will go up. Use cosine sim.
  • 27. w ord2vec vin . vout ~ 1 objective vin vout Dot product has these properties: Similar vectors have similarly near 1
  • 28. w ord2vec objective vin vout vin . vout ~ 0 Orthogonal vectors have similarity near 0
  • 29. w ord2vec objective vin vout vin . vout ~ -1 Orthogonal vectors have similarity near -1
  • 30. w ord2vec vin . vout āˆˆ [-1,1] objective But the inner product ranges from -1 to 1 (when normalized) ā€¦and weā€™d like a probability
  • 31. w ord2vec But weā€™d like to measure a probability. vin . vout āˆˆ [-1,1] objective But the inner product ranges from -1 to 1 (when normalized) ā€¦and weā€™d like a probability
  • 32. w ord2vec But weā€™d like to measure a probability. softmax(vin . vout āˆˆ [-1,1]) objective āˆˆ [0,1] Transform again using softmax
  • 33. w ord2vec But weā€™d like to measure a probability. softmax(vin . vout āˆˆ [-1,1]) Probability of choosing 1 of N discrete items. Mapping from vector space to a multinomial over words. objective Similar to logistic function for binary outcomes, but instead for 1 of N outcomes. So now weā€™re modeling the probability of a word showing up as the combination of the training word vector and the target word vector and transforming it to a 1 of N
  • 34. w ord2vec But weā€™d like to measure a probability. exp(vin . vout āˆˆ [0,1])softmax ~ objective So hereā€™s the actual form of the equation ā€” we normalize by the sum of all of the other possible pairs of word combinations
  • 35. w ord2vec But weā€™d like to measure a probability. exp(vin . vout āˆˆ [-1,1]) Ī£exp(vin . vk) softmax = objective Normalization term over all words k āˆˆ V So hereā€™s the actual form of the equation ā€” we normalize by the sum of all of the other possible pairs of word combinations two eļ¬€ects make vin and vout more similar make vin and every other word less similar
  • 36. w ord2vec But weā€™d like to measure a probability. exp(vin . vout āˆˆ [-1,1]) Ī£exp(vin . vk) softmax = = P(vout|vin) objective k āˆˆ V This is the kernel of the word2vec. Weā€™re just going to apply this operation every time we want to update the vectors. For every word, weā€™re going to have a context window, and then for every pair of words in that window and the input word, weā€™ll measure this probability.
  • 37. w ord2vec Learn by gradient descent on the softmax prob. For every example we see update vin vin := vin + P(vout|vin) objective vout := vout + P(vout|vin) ā€¦I wonā€™t go through the derivation of the gradient, but this is the general idea relatively simple, fast ā€” fast enough to read billions of words in a day
  • 39. word2vec if not convinced by qualitative resultsā€¦.
  • 40.
  • 41. Showing just 2 of the ~500 dimensions. Eļ¬€ectively weā€™ve PCAā€™d it
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52. If we only had locality and not regularity, this wouldnā€™t necessarily be true
  • 53.
  • 54.
  • 55. So we live in a vector space where operations like addition and subtraction are meaningful. So hereā€™s a few examples of this working. Really get the idea of these vectors as being ā€˜mixesā€™ of other ideas & vectors
  • 56. ITEM_3469 + ā€˜Pregnantā€™ SF is a person service Box
  • 57. + ā€˜Pregnantā€™ I love the stripes and the cut around my neckline was amazing someone else might write ā€˜grey and blackā€™ subtlety and nuance in that language We have lots of this interaction ā€” of order wikipedia amount ā€” far too much to manually annotate anything
  • 59. Stripes and are safe for maternity And also similar tones and ļ¬‚owy ā€” still great for expecting mothers
  • 61. LDA on Client Item Descriptions This shows the incredible amount of structure
  • 62. LDA on Item Descriptions (with Jay) clunky jewelry dangling delicate jewelry elsewhere
  • 63. LDA on Item Descriptions (with Jay) topics on patterns, styles ā€” this cluster is similarly described as high contrast tops with popping colors
  • 64. LDA on Item Descriptions (with Jay) bright dresses for a warm summer
  • 66. LDA on Item Descriptions (with Jay) not just visual topics, but also topics about ļ¬t
  • 67. Latent style vectors from text Pairwise gamma correlation from style ratings Diversity from ratings Diversity from text Lots of structure in both ā€” but the diversity much higher in the text Maybe obvious: but the way people describe items is fundamentally richer than the style ratings
  • 69. word2vec is local: one word predicts a nearby word ā€œI love ļ¬nding new designer brands for jeansā€ as if the world where one very long text string. no end of documents, no end of sentence, etc. and a window across words
  • 70. ā€œI love ļ¬nding new designer brands for jeansā€ But text is usually organized. as if the world where one very long text string. no end of documents, no end of sentence, etc.
  • 71. ā€œI love ļ¬nding new designer brands for jeansā€ But text is usually organized. as if the world where one very long text string. no end of documents, no end of sentence, etc.
  • 72. ā€œI love ļ¬nding new designer brands for jeansā€ In LDA, documents globally predict words. doc 7681 these are client comment which are short, only predict dozens of words but could be legal documents, or medical documents, 10k words ā€” here the diļ¬€erence between global and local algorithms is much more important
  • 73. [ -0.75, -1.25, -0.55, -0.12, +2.2] [ 0%, 9%, 78%, 11%] typical word2vec vector typical LDA document vector
  • 74. typical word2vec vector [ 0%, 9%, 78%, 11%] typical LDA document vector [ -0.75, -1.25, -0.55, -0.12, +2.2] All sum to 100%All real values
  • 75. 5D word2vec vector [ 0%, 9%, 78%, 11%] 5D LDA document vector [ -0.75, -1.25, -0.55, -0.12, +2.2] Sparse All sum to 100% Dimensions are absolute Dense All real values Dimensions relative LDA is a *mixture* w2v is a bunch of real numbers ā€” more like and *address* much easier to say to another human 78% of something rather than it is +2.2 of something and -1.25 of something else
  • 76. 100D word2vec vector [ 0%0%0%0%0% ā€¦ 0%, 9%, 78%, 11%] 100D LDA document vector [ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 ā€¦ -0.12, +2.2] Sparse All sum to 100% Dimensions are absolute Dense All real values Dimensions relative dense sparse
  • 77. 100D word2vec vector [ 0%0%0%0%0% ā€¦ 0%, 9%, 78%, 11%] 100D LDA document vector [ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 ā€¦ -0.12, +2.2] Similar in fewer ways (more interpretable) Similar in 100D ways (very ļ¬‚exible) +mixture +sparse
  • 78. can we do both? lda2vec series of exp grain of salt very new ā€” no good quantitative results only qualitative (but promising!)
  • 79. The goal: Use all of this context to learn interpretable topics. P(vOUT |vIN)word2vec @chrisemoody Use this at SF. typical table w2v will use w-w
  • 80. word2vec LDA P(vOUT |vDOC) The goal: Use all of this context to learn interpretable topics. this document is 80% high fashion this document is 60% style @chrisemoody LDA will use that doc ID column you can use this to steer the business as a whole
  • 81. word2vec LDA The goal: Use all of this context to learn interpretable topics. this zip code is 80% hot climate this zip code is 60% outdoors wear @chrisemoody But doesnā€™t predict word-to-word relationships. in texas, maybe i want more lonestars & stirrup icons in austin, maybe i want more bats
  • 82. word2vec LDA The goal: Use all of this context to learn interpretable topics. this client is 80% sporty this client is 60% casual wear @chrisemoody love to learn client topics are there ā€˜typesā€™ of clients? q every biz asks so this is the promise of lda2vec
  • 83. lda2vec word2vec predicts locally: one word predicts a nearby word P(vOUT |vIN) vIN vOUT ā€œPS! Thank you for such an awesome topā€ But doesnā€™t predict word-to-word relationships.
  • 84. lda2vec LDA predicts a word from a global context doc_id=1846 P(vOUT |vDOC) vOUT vDOC ā€œPS! Thank you for such an awesome topā€ But doesnā€™t predict word-to-word relationships.
  • 85. lda2vec doc_id=1846 vIN vOUT vDOC can we predict a word both locally and globally ? ā€œPS! Thank you for such an awesome topā€
  • 86. lda2vec ā€œPS! Thank you for such an awesome topā€doc_id=1846 vIN vOUT vDOC can we predict a word both locally and globally ? P(vOUT |vIN+ vDOC) doc vector captures long-distance dependencies word vector captures short-distance
  • 87. lda2vec doc_id=1846 vIN vOUT vDOC *very similar to the Paragraph Vectors / doc2vec can we predict a word both locally and globally ? ā€œPS! Thank you for such an awesome topā€ P(vOUT |vIN+ vDOC)
  • 88. lda2vec This works! šŸ˜€ But vDOC isnā€™t as interpretable as the LDA topic vectors. šŸ˜” Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦
  • 89. lda2vec This works! šŸ˜€ But vDOC isnā€™t as interpretable as the LDA topic vectors. šŸ˜” Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦ about as interpretable a hash
  • 90. lda2vec This works! šŸ˜€ But vDOC isnā€™t as interpretable as the LDA topic vectors. šŸ˜” Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦
  • 91. lda2vec This works! šŸ˜€ But vDOC isnā€™t as interpretable as the LDA topic vectors. šŸ˜” Weā€™re missing mixtures & sparsity. Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦
  • 92. lda2vec This works! šŸ˜€ But vDOC isnā€™t as interpretable as the LDA topic vectors. šŸ˜” Letā€™s make vDOC into a mixtureā€¦ Too many documents. I really like that document X is 70% in topic 0, 30% in topic1, ā€¦
  • 93. lda2vec Letā€™s make vDOC into a mixtureā€¦ vDOC = a vtopic1 + b vtopic2 +ā€¦ (up to k topics) sum of other word vectors intuition here is that ā€˜hanoi = vietnam + capitalā€™ and lufthansa = ā€˜germany + airlinesā€™ so we think that document vectors should also be some word vector + some word vector
  • 94. lda2vec Letā€™s make vDOC into a mixtureā€¦ vDOC = a vtopic1 + b vtopic2 +ā€¦ Trinitarian baptismal Pentecostals Bede schismatics excommunication twenty newsgroup dataset, free, canonical
  • 95. lda2vec Letā€™s make vDOC into a mixtureā€¦ vDOC = a vtopic1 + b vtopic2 +ā€¦ topic 1 = ā€œreligionā€ Trinitarian baptismal Pentecostals Bede schismatics excommunication
  • 96. lda2vec Letā€™s make vDOC into a mixtureā€¦ vDOC = a vtopic1 + b vtopic2 +ā€¦ Milosevic absentee Indonesia Lebanese Isrealis Karadzic topic 1 = ā€œreligionā€ Trinitarian baptismal Pentecostals Bede schismatics excommunication
  • 97. lda2vec Letā€™s make vDOC into a mixtureā€¦ vDOC = a vtopic1 + b vtopic2 +ā€¦ topic 1 = ā€œreligionā€ Trinitarian baptismal Pentecostals bede schismatics excommunication topic 2 = ā€œpoliticsā€ Milosevic absentee Indonesia Lebanese Isrealis Karadzic purple a,b coeļ¬ƒcients tell you how much it is that topic
  • 98. lda2vec Letā€™s make vDOC into a mixtureā€¦ vDOC = 10% religion + 89% politics +ā€¦ topic 2 = ā€œpoliticsā€ Milosevic absentee Indonesia Lebanese Isrealis Karadzic topic 1 = ā€œreligionā€ Trinitarian baptismal Pentecostals bede schismatics excommunication Doc is now 10% religion 89% politics mixture models are powerful for interpretability
  • 99. lda2vec Letā€™s make vDOC sparse [ -0.75, -1.25, ā€¦] vDOC = a vreligion + b vpolitics +ā€¦ Now 1st time I did thisā€¦ Hard to interpret. What does -1.2 politics mean? math works, but not intuitive
  • 100. lda2vec Letā€™s make vDOC sparse vDOC = a vreligion + b vpolitics +ā€¦ How much of this doc is in religion, how much in poltics but doesnā€™t work when you have more than a few
  • 101. lda2vec Letā€™s make vDOC sparse vDOC = a vreligion + b vpolitics +ā€¦ How much of this doc is in religion, how much in cars but doesnā€™t work when you have more than a few
  • 102. lda2vec Letā€™s make vDOC sparse {a, b, cā€¦} ~ dirichlet(alpha) vDOC = a vreligion + b vpolitics +ā€¦ trick we can steal from bayesian make it dirichlet skipping technical details make everything sum to 100% penalize non-zero force model to only make it non-zero w/ lots of evidence
  • 103. lda2vec Letā€™s make vDOC sparse {a, b, cā€¦} ~ dirichlet(alpha) vDOC = a vreligion + b vpolitics +ā€¦ sparsity-inducing eļ¬€ect. similar to the lasso or l1 reg, but dirichlet few dimensions, sum to 100% I can say to the CEO, set of docs could have been in 100 topics, but we picked only the best topics
  • 104. word2vec LDA P(vOUT |vIN + vDOC)lda2vec The goal: Use all of this context to learn interpretable topics. @chrisemoody this document is 80% high fashion this document is 60% style go back to our problem lda2vec is going to use all the info here
  • 105. word2vec LDA P(vOUT |vIN+ vDOC + vZIP)lda2vec The goal: Use all of this context to learn interpretable topics. @chrisemoody add column = adding a term add features in an ML model
  • 106. word2vec LDA P(vOUT |vIN+ vDOC + vZIP)lda2vec The goal: Use all of this context to learn interpretable topics. this zip code is 80% hot climate this zip code is 60% outdoors wear @chrisemoody in addition to doc topics, like ā€˜rec SFā€™
  • 107. word2vec LDA P(vOUT |vIN+ vDOC + vZIP +vCLIENTS)lda2vec The goal: Use all of this context to learn interpretable topics. this client is 80% sporty this client is 60% casual wear @chrisemoody client topics ā€” sporty, casual, this is where if she says ā€˜3rd trimesterā€™ ā€” identify a future mother ā€˜scrubsā€™ ā€” medicine
  • 108. word2vec LDA P(vOUT |vIN+ vDOC + vZIP +vCLIENTS) P(sold | vCLIENTS) lda2vec The goal: Use all of this context to learn interpretable topics. @chrisemoody Can also make the topics supervised so that they predict an outcome. helps ļ¬ne-tune topics so that correlate with your favorite business metric align topics w/ expectations helps us guess when revenue goes up what the leading causes are
  • 109. github.com/cemoody/lda2vec uses pyldavis API Ref docs (no narrative docs) GPU Decent test coverage @chrisemoody
  • 110. ā€œPS! Thank you for such an awesome ideaā€ @chrisemoody doc_id=1846 Can we model topics to sentences? lda2lstm SF is all about mixing cutting edge algorithms but we absolutely need interpretability. human component to algos is not negotiable Could we demand the model make us a sentence that is 80% religion, 10% politics? classify word level, LSTM on sentence, LDA on document level
  • 111. ā€œPS! Thank you for such an awesome ideaā€ @chrisemoody doc_id=1846 Can we represent the internal LSTM states as a dirichlet mixture? Dirichlet-squeeze internal states and manipulations, that maybe will help us understand the science of LSTM dynamics ā€” because seriously WTF is going on there
  • 112. Can we model topics to sentences? lda2lstm ā€œPS! Thank you for such an awesome ideaā€doc_id=1846 @chrisemoody Can we model topics to images? lda2ae TJ Torres Can we also extend this to image generation? TJ is working on a ridiculous VAE/GAN modelā€¦ can we throw in a topic model? Can we say make me an image that is 80% sweater, and 10% zippers, and 10% elbow patches?
  • 114.
  • 115.
  • 117.
  • 118.
  • 119. Crazy Approaches Paragraph Vectors (Just extend the context window) Content dependency (Change the window grammatically) Social word2vec (deepwalk) (Sentence is a walk on the graph) Spotify (Sentence is a playlist of song_ids) Stitch Fix (Sentence is a shipment of ļ¬ve items)
  • 121. CBOW ā€œThe fox jumped over the lazy dogā€ Guess the word given the context ~20x faster. (this is the alternative.) vOUT vIN vINvIN vIN vIN vIN SkipGram ā€œThe fox jumped over the lazy dogā€ vOUT vOUT vIN vOUT vOUT vOUTvOUT Guess the context given the word Better at syntax. (this is the one we went over) CBOW sums words vectors, loses the order in the sentence Both are good at semantic relationships Child and kid are nearby Or gender in man, woman If you blur words over the scale of context ā€” 5ish words, you lose a lot grammatical nuance But skipgram preserves order Preserves the relationship in pluralizing, for example
  • 122. Shows that are many words similar to vacation actually come in lots of ļ¬‚avors ā€” wedding words (bachelorette, rehearsals) ā€” holiday/event words (birthdays, brunch, christmas, thanksgiving) ā€” seasonal words (spring, summer,) ā€” trip words (getaway) ā€” destinations
  • 123. LDA Results context H istory I loved every choice in this ļ¬x!! Great job! Great Stylist Perfect
  • 124. LDA Results context H istory Body Fit My measurements are 36-28-32. If that helps. I like wearing some clothing that is ļ¬tted. Very hard for me to ļ¬nd pants that ļ¬t right.
  • 125. LDA Results context H istory Sizing Really enjoyed the experience and the pieces, sizing for tops was too big. Looking forward to my next box! Excited for next
  • 126. LDA Results context H istory Almost Bought It was a great ļ¬x. Loved the two items I kept and the three I sent back were close! Perfect
  • 127. What I didnā€™t mention A lot of text (only if you have a specialized vocabulary) Cleaning the text Memory & performance Traditional databases arenā€™t well-suited False positives hundreds of millions of words, 1,000 books, 500,000 comments, or 4,000,000 tweets high-memory and high-performance multicore machine. Training can take several hours to several days but shouldn't need frequent retraining. If you use pretrained vectors, then this isn't an issue. Databases. Modern SQL systems aren't well-suited to performing the vector addition, subtraction and multiplication searching in vector space requires. There are a few libraries that will help you quickly ļ¬nd the most similar items12: annoy, ball trees, locality-sensitive hashing (LSH) or FLANN. False-positives & exactness. Despite the impressive results that come with word vectorization, no NLP technique is perfect. Take care that your system is robust to results that a computer deems relevant but an expert human wouldn't.
  • 128. and now for something completely crazy
  • 129. All of the following ideas will change what ā€˜wordsā€™ and ā€˜contextā€™ represent. But weā€™ll still use the same w2v algo
  • 130. paragraph vector What about summarizing documents? On the day he took ofļ¬ce, President Obama reached out to Americaā€™s enemies, offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that
  • 131. On the day he took ofļ¬ce, President Obama reached out to Americaā€™s enemies, offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that The framework nuclear agreement he reached with Iran on Thursday did not provide the deļ¬nitive answer to whether Mr. Obamaā€™s audacious gamble will pay off. The ļ¬st Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed. paragraph vector Normal skipgram extends C words before, and C words after. IN OUT OUT Except we stay inside a sentence
  • 132. On the day he took ofļ¬ce, President Obama reached out to Americaā€™s enemies, offering in his ļ¬rst inaugural address to extend a hand if you are willing to unclench your ļ¬st. More than six years later, he has arrived at a moment of truth in testing that The framework nuclear agreement he reached with Iran on Thursday did not provide the deļ¬nitive answer to whether Mr. Obamaā€™s audacious gamble will pay off. The ļ¬st Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed. paragraph vector A document vector simply extends the context to the whole document. IN OUT OUT OUT OUTdoc_1347
  • 134. translation (using just a rotation matrix) M ikolov 2013 English Spanish Matrix Rotation Blows my mind Explain plot Not a complicated NN here Still have to learn the rotation matrix ā€” but it generalizes very nicely. Have analogies for every linalg op as a linguistic operator: + and - and matrix multiplies Robust framework and new tools to do science on words
  • 136. context dependent context Australian scientist discovers star with telescope Levy & G oldberg 2014 What if we
  • 137. context dependent context Australian scientist discovers star with telescope context Levy & G oldberg 2014
  • 138. context dependent context BoW DEPS topically-similar vs ā€˜functionallyā€™ similar Levy & G oldberg 2014
  • 139. context dependent context Levy & G oldberg 2014 Also show that SGNS is simply factorizing: w * c = PMI(w, c) - log k This is completely amazing! Intuition: positive associations (canada, snow) stronger in humans than negative associations (what is the opposite of Canada?) Also means we can do SVD-like techniques to get a convex w2v, uses fast lining libs, uses compressed word count matrix so also better storageā€¦. but not online
  • 140. deepwalk Perozzi etal2014 learn word vectors from sentences ā€œThe fox jumped over the lazy dogā€ vOUT vOUT vOUT vOUT vOUTvOUT ā€˜wordsā€™ are graph vertices ā€˜sentencesā€™ are random walks on the graph word2vec
  • 141. Playlists at Spotify context sequence learning ā€˜wordsā€™ are songs ā€˜sentencesā€™ are playlists
  • 143. Fixes at Stitch Fix sequence learning Letā€™s try: ā€˜wordsā€™ are styles ā€˜sentencesā€™ are ļ¬xes
  • 144. Fixes at Stitch Fix context Learn similarity between styles because they co-occur Learn ā€˜coherentā€™ styles sequence learning
  • 147. Fixes at Stitch Fix? context sequence learning Nearby regions are consistent ā€˜closetsā€™
  • 148.
  • 149.
  • 150. A speciļ¬c lda2vec model Our text blob is a comment that comes from a region_id and a style_id
  • 151.
  • 152.
  • 153.
  • 154.
  • 155.
  • 156.
  • 157.
  • 158.
  • 159. Can measure similarity between topic vectors m and n, and word vectors w This gets you the ā€˜topā€™ words in a topic, can ļ¬gure out what that topic is
  • 160.
  • 161. lda2vec Letā€™s make vDOC into a mixtureā€¦ vDOC = 10% religion + 89% politics +ā€¦ topic 2 = ā€œpoliticsā€ Milosevic absentee Indonesia Lebanese Isrealis Karadzic topic 1 = ā€œreligionā€ Trinitarian baptismal Pentecostals bede schismatics excommunication This is now on the 20 newsgroups datasetā€¦ Doc is now 10% religion 89% politics mixture models are powerful for interpretability