1. WORD2VEC
FROM INTUITION TO PRACTICE USING GENSIM
Edgar Marca
matiskay@gmail.com
Python Peru Meetup
September 1st, 2016
Lima - Perú
2. About Edgar Marca
Software Engineer at Love Mondays.
One of the organizers of Data Science Lima Meetup.
Machine Learning and Data Science enthusiasm.
Eu falo um pouco de Português.
1
4. Data Science Lima Meetup
Datos
5 Meetups y el 6to a la vuelta de la esquina
410 Datanautas en el Grupo de Meetup.
329 Personas en el Grupo de Facebook.
Organizadores
Manuel Solorzano.
Dennis Barreda.
Freddy Cahuas.
Edgar Marca
3
5. Data Science Lima Meetup
Figure: Foto del quinto Data Science Lima Meetup.
4
7. Data Never Sleeps
Figure: How much data is generated every minute? 1
1Data Never Sleeps 3.0
https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/
6
9. Introduction
Text is the core business of internet companies today.
Machine Learning and natural language processing
techniques are applied to big datasets to improve search,
ranking and many other tasks (spam detection, ads
recomendations, email categorization, machine translation,
speech recognition, etc)
8
15. One-hot Representation
EXAMPLE
Example:
Let V = {the, hotel, nice, motel}
wthe =
1
0
0
0
, whotel =
0
1
0
0
, wnice =
0
0
1
0
, wmotel =
0
0
0
1
We represent each word as a completely independent entity.
This word representation does not give us directly any notion of
similarity.
14
16. One-hot Representation
For instance
⟨whotel, wmotel⟩R4 = 0 (1)
⟨whotel, wcat⟩R4 = 0 (2)
we can try to reduce the size of this space from R4 to something
smaller and find a subspace that encodes the relationships
between words.
15
17. One-hot Representation
Problems
The dimension depends on the vocabulary size.
Leads to data sparsity. So we need more data.
Provide not useful information to the system.
Encondings are arbitrary.
16
18. Bag-of-words representation
Sum of one-hot codes.
Ignores orders or words.
Examples:
vocabulary = (monday, tuesday, is, a, today)
Monday Monday = [2, 0, 0, 0, 0]
today is monday = [1 0 1 1 1]
today is tuesday = [0 1 1 1 1]
is a monday today = [1 0 1 1 1]
17
20. Language Modeling (Unigrams, Bigrams, etc)
A language model is a probabilistic model that assigns
probability to any sequence of n words P(w1, w2, . . . , wn)
Unigrams
Assuming that the word ocurrences are completely independent
P(w1, w2, . . . , wn) = Πn
i=1P(wi) (3)
19
21. Language Modeling (Unigrams, Bigrams, etc)
Bigrams
The probability of the sequence depend on the pairwise prob-
ability of a word in the sequence and the word next to it.
P(w1, w2, . . . , wn) = Πn
i=2P(wi | wi−1) (4)
20
22. Word Embeddings
Word Embeddings
A set of language modeling and feature learning techniques in
NLP where words or phrases from the vocabulary are mapped
to vectors of real numbers in a low-dimensional space relative
to the vocabulary size (”continuous space”).
Vector space models (VSMs) represent (embed) words in a
continous vector space.
Semantically similar words are mapped to nearby points.
Basic idea is Distributional Hypothesis: words that appear
in the same context share semantic meaning.
21
25. Word2Vec
Figure: Two original papers published in association with word2vec
by Mikolov et al. (2013)
Efficient Estimation of Word Representations in Vector
Space https://arxiv.org/abs/1301.3781.
Distributed Representations of Words and Phrases and
their Compositionality https://arxiv.org/abs/1310.4546. 24
33. Word2Vec
vking − vman + vwoman ≈ vqueen
vparis − vfrance + vitaly ≈ vrome
Learns from raw text
Huge splash in NLP world.
Comes pretrained. (If you don’t have any specialize
vocabulary)
Word2vec is computationally efficient model for learning
word embeddings.
Word2Vec is a successful example of ”shallow” learning.
Very simple Feedforward neural network with single hidden
layer, backpropagation, and no non-linearities.
32
39. What the Fuck Are Trump Supporters Thinking?
They gathered four million tweets belonging to more than
two thousand hard-core Trump supporters.
Distances between those vectors encoded the semantic
distance between their associated words (e.g. the vector
representation of the word morons was near idiots but far
away from funny)
Link: https://medium.com/adventurous-social-science/
what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d
38
44. Takeaways
If you don’t have enough data you can use pre-trained
models.
Remember: Garbage in, garbage out.
Every data set will come out with diferent results.
Use Word2vec as feature extractor.
43