SlideShare a Scribd company logo
1 of 144
Download to read offline
lda2vec
(word2vec, and lda)
Christopher Moody
@ Stitch Fix
About
@chrisemoody
Caltech Physics
PhD. in astrostats supercomputing
sklearn t-SNE contributor
Data Labs at Stitch Fix
github.com/cemoody
Gaussian Processes t-SNE
chainer
deep learning
Tensor Decomposition
word2vec
lda
1
2
3lda2vec
1. king - man + woman = queen
2. Huge splash in NLP world
3. Learns from raw text
4. Pretty simple algorithm
5. Comes pretrained
word2vec
1. Set up an objective function
2. Randomly initialize vectors
3. Do gradient descent
word2vec
w
ord2vec
word2vec: learn word vector w
from it’s surrounding context
w
w
ord2vec
“The fox jumped over the lazy dog”
Maximize the likelihood of seeing the words given the word over.
P(the|over)
P(fox|over)
P(jumped|over)
P(the|over)
P(lazy|over)
P(dog|over)
…instead of maximizing the likelihood of co-occurrence counts.
w
ord2vec
P(fox|over)
What should this be?
w
ord2vec
P(vfox|vover)
Should depend on the word vectors.
P(fox|over)
w
ord2vec
“The fox jumped over the lazy dog”
P(w|c)
Extract pairs from context window around every input word.
w
ord2vec
“The fox jumped over the lazy dog”
c
P(w|c)
Extract pairs from context window around every input word.
w
ord2vec
“The fox jumped over the lazy dog”
w
P(w|c)
c
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
w c
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
“The fox jumped over the lazy dog”
P(w|c)
w c
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
w c
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
cw
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
cw
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
cw
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
objective
Measure loss between
w and c?
How should we define P(w|c)?
objective
w . c
How should we define P(w|c)?
Measure loss between
w and c?
w
ord2vec
w . c ~ 1
objective
w
c
vcanada . vsnow ~ 1
w
ord2vec
w . c ~ 0
objective
w
c
vcanada . vdesert ~0
w
ord2vec
w . c ~ -1
objective
w
c
w
ord2vec
w . c ∈ [-1,1]
objective
w
ord2vec
But we’d like to measure a probability.
w . c ∈ [-1,1]
objective
w
ord2vec
But we’d like to measure a probability.
objective
∈ [0,1]σ(c·w)
w
ord2vec
But we’d like to measure a probability.
objective
∈ [0,1]σ(c·w)
w
c
w
c
SimilarDissimilar
w
ord2vec
Loss function:
objective
L=σ(c·w)
Logistic (binary) choice.
Is the (context, word) combination from our dataset?
w
ord2vec
The skip-gram negative-sampling model
objective
Trivial solution is that context = word for all vectors
L=σ(c·w)
w
c
w
ord2vec
The skip-gram negative-sampling model
L = σ(c·w) + σ(-c·wneg)
objective
Draw random words in vocabulary.
w
ord2vec
The skip-gram negative-sampling model
objective
Discriminate positive from negative samples
Multiple Negative
L = σ(c·w) + σ(-c·wneg) +…+ σ(-c·wneg)
w
ord2vec
The SGNS Model
PM
I
ci·wj = PMI(Mij) - log k
…is extremely similar to matrix factorization!
Levy & Goldberg 2014
L = σ(c·w) + σ(-c·wneg)
w
ord2vec
The SGNS Model
PM
I
Levy & Goldberg 2014
‘traditional’ NLP
L = σ(c·w) + σ(-c·wneg)
ci·wj = PMI(Mij) - log k
…is extremely similar to matrix factorization!
w
ord2vec
The SGNS Model
L = σ(c·w) + Σσ(-c·w)
PM
I
ci·wj = log
Levy & Goldberg 2014
#(ci,wj)/n
k #(wj)/n #(ci)/n
‘traditional’ NLP
w
ord2vec
The SGNS Model
L = σ(c·w) + Σσ(-c·w)
PM
I
ci·wj = log
Levy & Goldberg 2014
popularity of c,w
k (popularity of c) (popularity of w)
‘traditional’ NLP
w
ord2vec
PM
I
99% of word2vec
is counting.
And you can count
words in SQL
w
ord2vec
PM
I
Count how many times
you saw c·w
Count how many times
you saw c
Count how many times
you saw w
w
ord2vec
PM
I
…and this takes ~5 minutes to compute on a single core.
Computing SVD is a completely standard math library.
word2vec
ITEM_3469 + ‘Pregnant’
+ ‘Pregnant’
= ITEM_701333
= ITEM_901004
= ITEM_800456
what about?LDA?
LDA
on Client Item
Descriptions
LDA
on Item
Descriptions
(with Jay)
LDA
on Item
Descriptions
(with Jay)
LDA
on Item
Descriptions
(with Jay)
lda vs word2vec
Bayesian Graphical ModelML Neural Model
word2vec is local:
one word predicts a nearby word
“I love finding new designer brands for jeans”
“I love finding new designer brands for jeans”
But text is usually organized.
“I love finding new designer brands for jeans”
But text is usually organized.
“I love finding new designer brands for jeans”
In LDA, documents globally predict words.
doc 7681
typical word2vec vector
[ 0%, 9%, 78%, 11%]
typical LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
All sum to 100%All real values
5D word2vec vector
[ 0%, 9%, 78%, 11%]
5D LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
100D word2vec vector
[ 0%0%0%0%0% … 0%, 9%, 78%, 11%]
100D LDA document vector
[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
dense sparse
100D word2vec vector
[ 0%0%0%0%0% … 0%, 9%, 78%, 11%]
100D LDA document vector
[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2]
Similar in fewer ways
(more interpretable)
Similar in 100D ways
(very flexible)
+mixture
+sparse
can we do both? lda2vec
-1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
fox
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Lufthansa is a German airline and when
German
word2vec predicts locally:
one word predicts a nearby word
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
German
Document vector
predicts a word from
a global context
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
We’re missing
mixtures & sparsity!
German
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
We’re missing
mixtures & sparsity!
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Now it’s a mixture.
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
topic 1 = “religion”
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
topic 2 = “politics”
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Sparsity!
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
34% 32% 34%
t=0
41% 26% 34%
t=10
99% 1% 0%
t=∞
tim
e
@chrisemoody
lda2vec.com
+ API docs
+ Examples
+ GPU
+ Tests
@chrisemoody
lda2vec.com
@chrisemoody
Example Hacker News comments
Topics:
http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/
hacker_news/lda2vec/lda2vec.ipynb
Word vectors:
https://github.com/cemoody/
lda2vec/blob/master/examples/
hacker_news/lda2vec/
word_vectors.ipynb
@chrisemoody
lda2vec.com
human-interpretable doc topics, use LDA.
machine-useable word-level features, use word2vec.
if you like to experiment a lot, and have
topics over user / doc / region / etc. features, use lda2vec.
(and you have a GPU)
If you want…
?@chrisemoody
Multithreaded
Stitch Fix
@chrisemoody
lda2vec.com
Credit
Large swathes of this talk are from
previous presentations by:
• Tomas Mikolov
• David Blei
• Christopher Olah
• Radim Rehurek
• Omer Levy & Yoav Goldberg
• Richard Socher
• Xin Rong
• Tim Hopper
“PS! Thank you for such an awesome idea”
@chrisemoody
doc_id=1846
Can we model topics to sentences?
lda2lstm
Can we model topics to sentences?
lda2lstm
“PS! Thank you for such an awesome idea”doc_id=1846
@chrisemoody
Can we model topics to images?
lda2ae
TJ Torres
and now for something completely crazy
4
Fun
Stuff
translation
(using just a rotation
matrix)
M
ikolov
2013
English
Spanish
Matrix
Rotation
deepwalk
Perozzi
etal2014
learn word vectors from
sentences
“The fox jumped over the lazy dog”
vOUT vOUT vOUT vOUT vOUTvOUT
‘words’ are graph vertices
‘sentences’ are random walks on the
graph
word2vec
Playlists at
Spotify
context
sequence
learning
‘words’ are song indices
‘sentences’ are playlists
Playlists at
Spotify
context
Erik
Bernhardsson
Great performance on ‘related artists’
Fixes at
Stitch Fix
sequence
learning
Let’s try:
‘words’ are items
‘sentences’ are fixes
Fixes at
Stitch Fix
context
Learn similarity between styles
because they co-occur
Learn ‘coherent’ styles
sequence
learning
Fixes at
Stitch Fix?
context
sequence
learning
Got lots of structure!
Fixes at
Stitch Fix?
context
sequence
learning
Fixes at
Stitch Fix?
context
sequence
learning
Nearby regions are
consistent ‘closets’
?@chrisemoody
Multithreaded
Stitch Fix
context
dependent
Levy
&
G
oldberg
2014
Australian scientist discovers star with telescope
context +/- 2 words
context
dependent
context
Australian scientist discovers star with telescope
Levy
&
G
oldberg
2014
context
dependent
context
Australian scientist discovers star with telescope
context
Levy
&
G
oldberg
2014
context
dependent
context
BoW DEPS
topically-similar vs ‘functionally’ similar
Levy
&
G
oldberg
2014
?@chrisemoody
Multithreaded
Stitch Fix
Crazy
Approaches
Paragraph Vectors
(Just extend the context window)
Content dependency
(Change the window grammatically)
Social word2vec (deepwalk)
(Sentence is a walk on the graph)
Spotify
(Sentence is a playlist of song_ids)
Stitch Fix
(Sentence is a shipment of five items)
CBOW
“The fox jumped over the lazy dog”
Guess the word
given the context
~20x faster.
(this is the alternative.)
vOUT
vIN vINvIN vINvIN vIN
SkipGram
“The fox jumped over the lazy dog”
vOUT vOUT
vIN
vOUT vOUT vOUTvOUT
Guess the context
given the word
Better at syntax.
(this is the one we went over)
lda2vec
vDOC = a vtopic1 + b vtopic2 +…
Let’s make vDOC sparse
lda2vec
This works! 😀 But vDOC isn’t as
interpretable as the topic vectors. 😔
vDOC = topic0 + topic1
Let’s say that vDOC ads
lda2vec
softmax(vOUT * (vIN+ vDOC))
theory of lda2vec
lda2vec
pyLDAvis of lda2vec
lda2vec
LDA
Results
context
H
istory
I loved every choice in this fix!! Great job!
Great Stylist Perfect
LDA
Results
context
H
istory
Body Fit
My measurements are 36-28-32. If that helps.
I like wearing some clothing that is fitted.
Very hard for me to find pants that fit right.
LDA
Results
context
H
istory
Sizing
Really enjoyed the experience and the
pieces, sizing for tops was too big.
Looking forward to my next box!
Excited for next
LDA
Results
context
H
istory
Almost Bought
It was a great fix. Loved the two items I
kept and the three I sent back were close!
Perfect
All of the following ideas will change what
‘words’ and ‘context’ represent.
paragraph
vector
What about summarizing documents?
On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector
Normal skipgram extends C words before, and C words after.
IN
OUT OUT
On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector
A document vector simply extends the context to the whole document.
IN
OUT OUT
OUT OUTdoc_1347
from	gensim.models	import	Doc2Vec		
fn	=	“item_document_vectors”		
model	=	Doc2Vec.load(fn)		
model.most_similar('pregnant')		
matches	=	list(filter(lambda	x:	'SENT_'	in	x[0],	matches))			
#	['...I	am	currently	23	weeks	pregnant...',		
#		'...I'm	now	10	weeks	pregnant...',		
#		'...not	showing	too	much	yet...',		
#		'...15	weeks	now.	Baby	bump...',		
#		'...6	weeks	post	partum!...',		
#		'...12	weeks	postpartum	and	am	nursing...',		
#		'...I	have	my	baby	shower	that...',		
#		'...am	still	breastfeeding...',		
#		'...I	would	love	an	outfit	for	a	baby	shower...']
sentence
search
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016

More Related Content

Viewers also liked

Discussion on the Distributed Search Engine
Discussion on the Distributed Search EngineDiscussion on the Distributed Search Engine
Discussion on the Distributed Search EngineYusuke Fujisaka
 
Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec Yuya Kanemoto
 
What do we get from Twitter - and what not?
What do we get from Twitter - and what not?What do we get from Twitter - and what not?
What do we get from Twitter - and what not?Katrin Weller
 
Fabrikatyr lda topic modelling practical application
Fabrikatyr lda topic modelling practical applicationFabrikatyr lda topic modelling practical application
Fabrikatyr lda topic modelling practical applicationTim Carnus
 
Topic Modelling to identify behavioral trends in online communities
Topic Modelling to identify behavioral trends in online communities Topic Modelling to identify behavioral trends in online communities
Topic Modelling to identify behavioral trends in online communities Conor Duke
 
Distributed representation of sentences and documents
Distributed representation of sentences and documentsDistributed representation of sentences and documents
Distributed representation of sentences and documentsAbdullah Khan Zehady
 
Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vecKai Sasaki
 
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...Yuya Unno
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...FrenchWeb.fr
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 

Viewers also liked (13)

Discussion on the Distributed Search Engine
Discussion on the Distributed Search EngineDiscussion on the Distributed Search Engine
Discussion on the Distributed Search Engine
 
P2p search engine
P2p search engineP2p search engine
P2p search engine
 
Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec
 
What do we get from Twitter - and what not?
What do we get from Twitter - and what not?What do we get from Twitter - and what not?
What do we get from Twitter - and what not?
 
Fabrikatyr lda topic modelling practical application
Fabrikatyr lda topic modelling practical applicationFabrikatyr lda topic modelling practical application
Fabrikatyr lda topic modelling practical application
 
Topic Modelling to identify behavioral trends in online communities
Topic Modelling to identify behavioral trends in online communities Topic Modelling to identify behavioral trends in online communities
Topic Modelling to identify behavioral trends in online communities
 
Distributed representation of sentences and documents
Distributed representation of sentences and documentsDistributed representation of sentences and documents
Distributed representation of sentences and documents
 
Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vec
 
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
 
Emnlp読み会資料
Emnlp読み会資料Emnlp読み会資料
Emnlp読み会資料
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 

Similar to lda2vec Text by the Bay 2016

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERTQAware GmbH
 
A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009Jordan Baker
 
The TclQuadcode Compiler
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode CompilerDonal Fellows
 
Building WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesBuilding WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesPierpaolo Basile
 
"SSC" - Geometria e Semantica del Linguaggio
"SSC" - Geometria e Semantica del Linguaggio"SSC" - Geometria e Semantica del Linguaggio
"SSC" - Geometria e Semantica del LinguaggioAlumni Mathematica
 
CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semanticsJinho Choi
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyPyData
 
Ur Domain Haz Monoids DDDx NYC 2014
Ur Domain Haz Monoids DDDx NYC 2014Ur Domain Haz Monoids DDDx NYC 2014
Ur Domain Haz Monoids DDDx NYC 2014Cyrille Martraire
 
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...GeeksLab Odessa
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
Python Performance 101
Python Performance 101Python Performance 101
Python Performance 101Ankur Gupta
 
Simultaneous,Deep,Transfer,Across, Domains,and,Tasks
Simultaneous,Deep,Transfer,Across, Domains,and,TasksSimultaneous,Deep,Transfer,Across, Domains,and,Tasks
Simultaneous,Deep,Transfer,Across, Domains,and,TasksAlejandro Cartas
 
DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019Sabrina Marechal
 
The Error of Our Ways
The Error of Our WaysThe Error of Our Ways
The Error of Our WaysKevlin Henney
 
An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...Claudio Capobianco
 

Similar to lda2vec Text by the Bay 2016 (20)

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERT
 
A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009
 
The TclQuadcode Compiler
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode Compiler
 
Building WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesBuilding WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spaces
 
"SSC" - Geometria e Semantica del Linguaggio
"SSC" - Geometria e Semantica del Linguaggio"SSC" - Geometria e Semantica del Linguaggio
"SSC" - Geometria e Semantica del Linguaggio
 
CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semantics
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 
Ur Domain Haz Monoids DDDx NYC 2014
Ur Domain Haz Monoids DDDx NYC 2014Ur Domain Haz Monoids DDDx NYC 2014
Ur Domain Haz Monoids DDDx NYC 2014
 
Lesson 7: The Derivative
Lesson 7: The DerivativeLesson 7: The Derivative
Lesson 7: The Derivative
 
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Deep into Ruby Code Coverage
Deep into Ruby Code CoverageDeep into Ruby Code Coverage
Deep into Ruby Code Coverage
 
Python Performance 101
Python Performance 101Python Performance 101
Python Performance 101
 
Chapter 7 drill
Chapter 7 drillChapter 7 drill
Chapter 7 drill
 
Simultaneous,Deep,Transfer,Across, Domains,and,Tasks
Simultaneous,Deep,Transfer,Across, Domains,and,TasksSimultaneous,Deep,Transfer,Across, Domains,and,Tasks
Simultaneous,Deep,Transfer,Across, Domains,and,Tasks
 
DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019
 
The Error of Our Ways
The Error of Our WaysThe Error of Our Ways
The Error of Our Ways
 
An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

lda2vec Text by the Bay 2016

  • 2. About @chrisemoody Caltech Physics PhD. in astrostats supercomputing sklearn t-SNE contributor Data Labs at Stitch Fix github.com/cemoody Gaussian Processes t-SNE chainer deep learning Tensor Decomposition
  • 4. 1. king - man + woman = queen 2. Huge splash in NLP world 3. Learns from raw text 4. Pretty simple algorithm 5. Comes pretrained word2vec
  • 5. 1. Set up an objective function 2. Randomly initialize vectors 3. Do gradient descent word2vec
  • 6. w ord2vec word2vec: learn word vector w from it’s surrounding context w
  • 7. w ord2vec “The fox jumped over the lazy dog” Maximize the likelihood of seeing the words given the word over. P(the|over) P(fox|over) P(jumped|over) P(the|over) P(lazy|over) P(dog|over) …instead of maximizing the likelihood of co-occurrence counts.
  • 9. w ord2vec P(vfox|vover) Should depend on the word vectors. P(fox|over)
  • 10. w ord2vec “The fox jumped over the lazy dog” P(w|c) Extract pairs from context window around every input word.
  • 11. w ord2vec “The fox jumped over the lazy dog” c P(w|c) Extract pairs from context window around every input word.
  • 12. w ord2vec “The fox jumped over the lazy dog” w P(w|c) c Extract pairs from context window around every input word.
  • 13. w ord2vec P(w|c) w c “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 14. w ord2vec “The fox jumped over the lazy dog” P(w|c) w c Extract pairs from context window around every input word.
  • 15. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 16. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 17. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 18. w ord2vec P(w|c) w c “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 19. w ord2vec P(w|c) cw “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 20. w ord2vec P(w|c) cw “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 21. w ord2vec P(w|c) cw “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 22. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 23. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 24. objective Measure loss between w and c? How should we define P(w|c)?
  • 25. objective w . c How should we define P(w|c)? Measure loss between w and c?
  • 26. w ord2vec w . c ~ 1 objective w c vcanada . vsnow ~ 1
  • 27. w ord2vec w . c ~ 0 objective w c vcanada . vdesert ~0
  • 28. w ord2vec w . c ~ -1 objective w c
  • 29. w ord2vec w . c ∈ [-1,1] objective
  • 30. w ord2vec But we’d like to measure a probability. w . c ∈ [-1,1] objective
  • 31. w ord2vec But we’d like to measure a probability. objective ∈ [0,1]σ(c·w)
  • 32. w ord2vec But we’d like to measure a probability. objective ∈ [0,1]σ(c·w) w c w c SimilarDissimilar
  • 33. w ord2vec Loss function: objective L=σ(c·w) Logistic (binary) choice. Is the (context, word) combination from our dataset?
  • 34. w ord2vec The skip-gram negative-sampling model objective Trivial solution is that context = word for all vectors L=σ(c·w) w c
  • 35. w ord2vec The skip-gram negative-sampling model L = σ(c·w) + σ(-c·wneg) objective Draw random words in vocabulary.
  • 36. w ord2vec The skip-gram negative-sampling model objective Discriminate positive from negative samples Multiple Negative L = σ(c·w) + σ(-c·wneg) +…+ σ(-c·wneg)
  • 37. w ord2vec The SGNS Model PM I ci·wj = PMI(Mij) - log k …is extremely similar to matrix factorization! Levy & Goldberg 2014 L = σ(c·w) + σ(-c·wneg)
  • 38. w ord2vec The SGNS Model PM I Levy & Goldberg 2014 ‘traditional’ NLP L = σ(c·w) + σ(-c·wneg) ci·wj = PMI(Mij) - log k …is extremely similar to matrix factorization!
  • 39. w ord2vec The SGNS Model L = σ(c·w) + Σσ(-c·w) PM I ci·wj = log Levy & Goldberg 2014 #(ci,wj)/n k #(wj)/n #(ci)/n ‘traditional’ NLP
  • 40. w ord2vec The SGNS Model L = σ(c·w) + Σσ(-c·w) PM I ci·wj = log Levy & Goldberg 2014 popularity of c,w k (popularity of c) (popularity of w) ‘traditional’ NLP
  • 41. w ord2vec PM I 99% of word2vec is counting. And you can count words in SQL
  • 42. w ord2vec PM I Count how many times you saw c·w Count how many times you saw c Count how many times you saw w
  • 43. w ord2vec PM I …and this takes ~5 minutes to compute on a single core. Computing SVD is a completely standard math library.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 64.
  • 72. word2vec is local: one word predicts a nearby word “I love finding new designer brands for jeans”
  • 73. “I love finding new designer brands for jeans” But text is usually organized.
  • 74. “I love finding new designer brands for jeans” But text is usually organized.
  • 75. “I love finding new designer brands for jeans” In LDA, documents globally predict words. doc 7681
  • 76. typical word2vec vector [ 0%, 9%, 78%, 11%] typical LDA document vector [ -0.75, -1.25, -0.55, -0.12, +2.2] All sum to 100%All real values
  • 77. 5D word2vec vector [ 0%, 9%, 78%, 11%] 5D LDA document vector [ -0.75, -1.25, -0.55, -0.12, +2.2] Sparse All sum to 100% Dimensions are absolute Dense All real values Dimensions relative
  • 78. 100D word2vec vector [ 0%0%0%0%0% … 0%, 9%, 78%, 11%] 100D LDA document vector [ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2] Sparse All sum to 100% Dimensions are absolute Dense All real values Dimensions relative dense sparse
  • 79. 100D word2vec vector [ 0%0%0%0%0% … 0%, 9%, 78%, 11%] 100D LDA document vector [ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2] Similar in fewer ways (more interpretable) Similar in 100D ways (very flexible) +mixture +sparse
  • 80. can we do both? lda2vec
  • 81. -1.9 0.85 -0.6 -0.3 -0.5 Lufthansa is a German airline and when fox #hidden units Skip grams from sentences Word vector Negative sampling loss Lufthansa is a German airline and when German word2vec predicts locally: one word predicts a nearby word
  • 82. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when German Document vector predicts a word from a global context 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when
  • 83. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when We’re missing mixtures & sparsity! German
  • 84. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when We’re missing mixtures & sparsity!
  • 85. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when Now it’s a mixture. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Topic matrix Document proportion Document weight Document vector Context vector x +
  • 86. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when Trinitarian baptismal Pentecostals Bede schismatics excommunication 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 #topics Document weight
  • 87. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when topic 1 = “religion” Trinitarian baptismal Pentecostals Bede schismatics excommunication 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 #topics Document weight
  • 88. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when Milosevic absentee Indonesia Lebanese Isrealis Karadzic 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 #topics Document weight
  • 89. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when topic 2 = “politics” Milosevic absentee Indonesia Lebanese Isrealis Karadzic 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 #topics Document weight
  • 90. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Topic matrix Document proportion Document weight Document vector Context vector x +
  • 91. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when
  • 92. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when
  • 93. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when Sparsity! 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 34% 32% 34% t=0 41% 26% 34% t=10 99% 1% 0% t=∞ tim e
  • 95. + API docs + Examples + GPU + Tests @chrisemoody lda2vec.com
  • 96. @chrisemoody Example Hacker News comments Topics: http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/ hacker_news/lda2vec/lda2vec.ipynb Word vectors: https://github.com/cemoody/ lda2vec/blob/master/examples/ hacker_news/lda2vec/ word_vectors.ipynb
  • 97. @chrisemoody lda2vec.com human-interpretable doc topics, use LDA. machine-useable word-level features, use word2vec. if you like to experiment a lot, and have topics over user / doc / region / etc. features, use lda2vec. (and you have a GPU) If you want…
  • 100. Credit Large swathes of this talk are from previous presentations by: • Tomas Mikolov • David Blei • Christopher Olah • Radim Rehurek • Omer Levy & Yoav Goldberg • Richard Socher • Xin Rong • Tim Hopper
  • 101. “PS! Thank you for such an awesome idea” @chrisemoody doc_id=1846 Can we model topics to sentences? lda2lstm
  • 102. Can we model topics to sentences? lda2lstm “PS! Thank you for such an awesome idea”doc_id=1846 @chrisemoody Can we model topics to images? lda2ae TJ Torres
  • 103. and now for something completely crazy 4 Fun Stuff
  • 104. translation (using just a rotation matrix) M ikolov 2013 English Spanish Matrix Rotation
  • 105. deepwalk Perozzi etal2014 learn word vectors from sentences “The fox jumped over the lazy dog” vOUT vOUT vOUT vOUT vOUTvOUT ‘words’ are graph vertices ‘sentences’ are random walks on the graph word2vec
  • 106. Playlists at Spotify context sequence learning ‘words’ are song indices ‘sentences’ are playlists
  • 108. Fixes at Stitch Fix sequence learning Let’s try: ‘words’ are items ‘sentences’ are fixes
  • 109. Fixes at Stitch Fix context Learn similarity between styles because they co-occur Learn ‘coherent’ styles sequence learning
  • 112. Fixes at Stitch Fix? context sequence learning Nearby regions are consistent ‘closets’
  • 115. context dependent context Australian scientist discovers star with telescope Levy & G oldberg 2014
  • 116. context dependent context Australian scientist discovers star with telescope context Levy & G oldberg 2014
  • 117. context dependent context BoW DEPS topically-similar vs ‘functionally’ similar Levy & G oldberg 2014
  • 119.
  • 120. Crazy Approaches Paragraph Vectors (Just extend the context window) Content dependency (Change the window grammatically) Social word2vec (deepwalk) (Sentence is a walk on the graph) Spotify (Sentence is a playlist of song_ids) Stitch Fix (Sentence is a shipment of five items)
  • 121.
  • 122. CBOW “The fox jumped over the lazy dog” Guess the word given the context ~20x faster. (this is the alternative.) vOUT vIN vINvIN vINvIN vIN SkipGram “The fox jumped over the lazy dog” vOUT vOUT vIN vOUT vOUT vOUTvOUT Guess the context given the word Better at syntax. (this is the one we went over)
  • 123. lda2vec vDOC = a vtopic1 + b vtopic2 +… Let’s make vDOC sparse
  • 124. lda2vec This works! 😀 But vDOC isn’t as interpretable as the topic vectors. 😔 vDOC = topic0 + topic1 Let’s say that vDOC ads
  • 126.
  • 129. LDA Results context H istory I loved every choice in this fix!! Great job! Great Stylist Perfect
  • 130. LDA Results context H istory Body Fit My measurements are 36-28-32. If that helps. I like wearing some clothing that is fitted. Very hard for me to find pants that fit right.
  • 131. LDA Results context H istory Sizing Really enjoyed the experience and the pieces, sizing for tops was too big. Looking forward to my next box! Excited for next
  • 132. LDA Results context H istory Almost Bought It was a great fix. Loved the two items I kept and the three I sent back were close! Perfect
  • 133. All of the following ideas will change what ‘words’ and ‘context’ represent.
  • 134. paragraph vector What about summarizing documents? On the day he took office, President Obama reached out to America’s enemies, offering in his first inaugural address to extend a hand if you are willing to unclench your fist. More than six years later, he has arrived at a moment of truth in testing that
  • 135. On the day he took office, President Obama reached out to America’s enemies, offering in his first inaugural address to extend a hand if you are willing to unclench your fist. More than six years later, he has arrived at a moment of truth in testing that The framework nuclear agreement he reached with Iran on Thursday did not provide the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed. paragraph vector Normal skipgram extends C words before, and C words after. IN OUT OUT
  • 136. On the day he took office, President Obama reached out to America’s enemies, offering in his first inaugural address to extend a hand if you are willing to unclench your fist. More than six years later, he has arrived at a moment of truth in testing that The framework nuclear agreement he reached with Iran on Thursday did not provide the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed. paragraph vector A document vector simply extends the context to the whole document. IN OUT OUT OUT OUTdoc_1347