Word2vec: From intuition to practice using gensim

WORD2VEC
FROM INTUITION TO PRACTICE USING GENSIM
Edgar Marca
matiskay@gmail.com
Python Peru Meetup
September 1st, 2016
Lima - Perú

About Edgar Marca
Software Engineer at Love Mondays.
One of the organizers of Data Science Lima Meetup.
Machine Learning and Data Science enthusiasm.
Eu falo um pouco de Português.
1

Data Science Lima Meetup
Datos
5 Meetups y el 6to a la vuelta de la esquina
410 Datanautas en el Grupo de Meetup.
329 Personas en el Grupo de Facebook.
Organizadores
Manuel Solorzano.
Dennis Barreda.
Freddy Cahuas.
Edgar Marca
3

Data Science Lima Meetup
Figure: Foto del quinto Data Science Lima Meetup.
4

Data Never Sleeps
Figure: How much data is generated every minute? 1
1Data Never Sleeps 3.0
https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/
6

Introduction
Text is the core business of internet companies today.
Machine Learning and natural language processing
techniques are applied to big datasets to improve search,
ranking and many other tasks (spam detection, ads
recomendations, email categorization, machine translation,
speech recognition, etc)
8

Natural Language Processing
Problems with text
Messy.
Irregularities of the language.
Hierarchically.
Sparse Nature.
9

How to Learn good representations?
12

One-hot Representation
One-hot encoding
Represent every word as an R|V| vector with all 0s and 1 at the
index of that word.
13

EXAMPLE
Example:
Let V = {the, hotel, nice, motel}
wthe =

1
0
0
0

, whotel =

0
1
0
0

, wnice =

0
0
1
0

, wmotel =

0
0
0
1

We represent each word as a completely independent entity.
This word representation does not give us directly any notion of
similarity.
14

For instance
⟨whotel, wmotel⟩R4 = 0 (1)
⟨whotel, wcat⟩R4 = 0 (2)
we can try to reduce the size of this space from R4 to something
smaller and find a subspace that encodes the relationships
between words.
15

Problems
The dimension depends on the vocabulary size.
Leads to data sparsity. So we need more data.
Provide not useful information to the system.
Encondings are arbitrary.
16

Bag-of-words representation
Sum of one-hot codes.
Ignores orders or words.
Examples:
vocabulary = (monday, tuesday, is, a, today)
Monday Monday = [2, 0, 0, 0, 0]
today is monday = [1 0 1 1 1]
today is tuesday = [0 1 1 1 1]
is a monday today = [1 0 1 1 1]
17

Distributional hypotesis
You shall know a word by the company it keeps!
Firth (1957)
18

Language Modeling (Unigrams, Bigrams, etc)
A language model is a probabilistic model that assigns
probability to any sequence of n words P(w1, w2, . . . , wn)
Unigrams
Assuming that the word ocurrences are completely independent
P(w1, w2, . . . , wn) = Πn
i=1P(wi) (3)
19

Language Modeling (Unigrams, Bigrams, etc)
Bigrams
The probability of the sequence depend on the pairwise prob-
ability of a word in the sequence and the word next to it.
P(w1, w2, . . . , wn) = Πn
i=2P(wi | wi−1) (4)
20

Word Embeddings
Word Embeddings
A set of language modeling and feature learning techniques in
NLP where words or phrases from the vocabulary are mapped
to vectors of real numbers in a low-dimensional space relative
to the vocabulary size (”continuous space”).
Vector space models (VSMs) represent (embed) words in a
continous vector space.
Semantically similar words are mapped to nearby points.
Basic idea is Distributional Hypothesis: words that appear
in the same context share semantic meaning.
21

Distributional hypotesis
You shall know a word by the company it keeps!
Firth (1957)
23

Word2Vec
Figure: Two original papers published in association with word2vec
by Mikolov et al. (2013)
Efficient Estimation of Word Representations in Vector
Space https://arxiv.org/abs/1301.3781.
Distributed Representations of Words and Phrases and
their Compositionality https://arxiv.org/abs/1310.4546. 24

Continuous Bag of Words and Skip-gram
25

Contextual Representation
Word is represented by context in use
26

Word2Vec
vking − vman + vwoman ≈ vqueen
vparis − vfrance + vitaly ≈ vrome
Learns from raw text
Huge splash in NLP world.
Comes pretrained. (If you don’t have any specialize
vocabulary)
Word2vec is computationally efficient model for learning
word embeddings.
Word2Vec is a successful example of ”shallow” learning.
Very simple Feedforward neural network with single hidden
layer, backpropagation, and no non-linearities.
32

What the Fuck Are Trump Supporters Thinking?
36

37

They gathered four million tweets belonging to more than
two thousand hard-core Trump supporters.
Distances between those vectors encoded the semantic
distance between their associated words (e.g. the vector
representation of the word morons was near idiots but far
away from funny)
Link: https://medium.com/adventurous-social-science/
what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d
38

Restaurant Recomendation.
http://www.slideshare.net/SudeepDasPhD/
recsys-2015-making-meaningful-restaurant-recommendations-at-opent
39

Restaurant Recomendation.
http://www.slideshare.net/SudeepDasPhD/
recsys-2015-making-meaningful-restaurant-recommendations-at-opent
40

Song Recomendations
Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/
playlist-harvesting
41

Takeaways
If you don’t have enough data you can use pre-trained
models.
Remember: Garbage in, garbage out.
Every data set will come out with diferent results.
Use Word2vec as feature extractor.
43

Word2vec: From intuition to practice using gensim

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Word2vec: From intuition to practice using gensim

Similar to Word2vec: From intuition to practice using gensim (20)

More from Edgar Marca

More from Edgar Marca (7)

Recently uploaded

Recently uploaded (20)

Word2vec: From intuition to practice using gensim