SlideShare a Scribd company logo
1 of 46
Download to read offline
WORD2VEC
FROM INTUITION TO PRACTICE USING GENSIM
Edgar Marca
matiskay@gmail.com
Python Peru Meetup
September 1st, 2016
Lima - Perú
About Edgar Marca
Software Engineer at Love Mondays.
One of the organizers of Data Science Lima Meetup.
Machine Learning and Data Science enthusiasm.
Eu falo um pouco de Português.
1
DATA SCIENCE LIMA MEETUP
Data Science Lima Meetup
Datos
5 Meetups y el 6to a la vuelta de la esquina
410 Datanautas en el Grupo de Meetup.
329 Personas en el Grupo de Facebook.
Organizadores
Manuel Solorzano.
Dennis Barreda.
Freddy Cahuas.
Edgar Marca
3
Data Science Lima Meetup
Figure: Foto del quinto Data Science Lima Meetup.
4
DATA
Data Never Sleeps
Figure: How much data is generated every minute? 1
1Data Never Sleeps 3.0
https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/
6
NATURAL LANGUAGE PROCESSING
Introduction
Text is the core business of internet companies today.
Machine Learning and natural language processing
techniques are applied to big datasets to improve search,
ranking and many other tasks (spam detection, ads
recomendations, email categorization, machine translation,
speech recognition, etc)
8
Natural Language Processing
Problems with text
Messy.
Irregularities of the language.
Hierarchically.
Sparse Nature.
9
REPRESENTATIONS FOR TEXTS
Contextual Representation
11
How to Learn good representations?
12
One-hot Representation
One-hot encoding
Represent every word as an R|V| vector with all 0s and 1 at the
index of that word.
13
One-hot Representation
EXAMPLE
Example:
Let V = {the, hotel, nice, motel}
wthe =

1
0
0
0

, whotel =

0
1
0
0

, wnice =

0
0
1
0

, wmotel =

0
0
0
1

We represent each word as a completely independent entity.
This word representation does not give us directly any notion of
similarity.
14
One-hot Representation
For instance
⟨whotel, wmotel⟩R4 = 0 (1)
⟨whotel, wcat⟩R4 = 0 (2)
we can try to reduce the size of this space from R4 to something
smaller and find a subspace that encodes the relationships
between words.
15
One-hot Representation
Problems
The dimension depends on the vocabulary size.
Leads to data sparsity. So we need more data.
Provide not useful information to the system.
Encondings are arbitrary.
16
Bag-of-words representation
Sum of one-hot codes.
Ignores orders or words.
Examples:
vocabulary = (monday, tuesday, is, a, today)
Monday Monday = [2, 0, 0, 0, 0]
today is monday = [1 0 1 1 1]
today is tuesday = [0 1 1 1 1]
is a monday today = [1 0 1 1 1]
17
Distributional hypotesis
You shall know a word by the company it keeps!
Firth (1957)
18
Language Modeling (Unigrams, Bigrams, etc)
A language model is a probabilistic model that assigns
probability to any sequence of n words P(w1, w2, . . . , wn)
Unigrams
Assuming that the word ocurrences are completely independent
P(w1, w2, . . . , wn) = Πn
i=1P(wi) (3)
19
Language Modeling (Unigrams, Bigrams, etc)
Bigrams
The probability of the sequence depend on the pairwise prob-
ability of a word in the sequence and the word next to it.
P(w1, w2, . . . , wn) = Πn
i=2P(wi | wi−1) (4)
20
Word Embeddings
Word Embeddings
A set of language modeling and feature learning techniques in
NLP where words or phrases from the vocabulary are mapped
to vectors of real numbers in a low-dimensional space relative
to the vocabulary size (”continuous space”).
Vector space models (VSMs) represent (embed) words in a
continous vector space.
Semantically similar words are mapped to nearby points.
Basic idea is Distributional Hypothesis: words that appear
in the same context share semantic meaning.
21
WORD2VEC
Distributional hypotesis
You shall know a word by the company it keeps!
Firth (1957)
23
Word2Vec
Figure: Two original papers published in association with word2vec
by Mikolov et al. (2013)
Efficient Estimation of Word Representations in Vector
Space https://arxiv.org/abs/1301.3781.
Distributed Representations of Words and Phrases and
their Compositionality https://arxiv.org/abs/1310.4546. 24
Continuous Bag of Words and Skip-gram
25
Contextual Representation
Word is represented by context in use
26
Contextual Representation
27
Word Vectors
28
Word Vectors
29
Word Vectors
30
Word Vectors
31
Word2Vec
vking − vman + vwoman ≈ vqueen
vparis − vfrance + vitaly ≈ vrome
Learns from raw text
Huge splash in NLP world.
Comes pretrained. (If you don’t have any specialize
vocabulary)
Word2vec is computationally efficient model for learning
word embeddings.
Word2Vec is a successful example of ”shallow” learning.
Very simple Feedforward neural network with single hidden
layer, backpropagation, and no non-linearities.
32
Word2vec
33
Gensim
34
APPLICATIONS
What the Fuck Are Trump Supporters Thinking?
36
What the Fuck Are Trump Supporters Thinking?
37
What the Fuck Are Trump Supporters Thinking?
They gathered four million tweets belonging to more than
two thousand hard-core Trump supporters.
Distances between those vectors encoded the semantic
distance between their associated words (e.g. the vector
representation of the word morons was near idiots but far
away from funny)
Link: https://medium.com/adventurous-social-science/
what-the-fuck-are-trump-supporters-thinking-ecc16fb66a8d
38
Restaurant Recomendation.
http://www.slideshare.net/SudeepDasPhD/
recsys-2015-making-meaningful-restaurant-recommendations-at-opent
39
Restaurant Recomendation.
http://www.slideshare.net/SudeepDasPhD/
recsys-2015-making-meaningful-restaurant-recommendations-at-opent
40
Song Recomendations
Link: https://social.shorthand.com/mawsonguy/3CfQA8mj2S/
playlist-harvesting
41
TAKEAWAYS
Takeaways
If you don’t have enough data you can use pre-trained
models.
Remember: Garbage in, garbage out.
Every data set will come out with diferent results.
Use Word2vec as feature extractor.
43
44
Obrigado
45

More Related Content

What's hot

word embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisword embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysis
Mostapha Benhenda
 

What's hot (20)

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Introduction to word embeddings with Python
Introduction to word embeddings with PythonIntroduction to word embeddings with Python
Introduction to word embeddings with Python
 
word embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisword embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysis
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Fasttext 20170720 yjy
Fasttext 20170720 yjyFasttext 20170720 yjy
Fasttext 20170720 yjy
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 

Similar to Word2vec: From intuition to practice using gensim

5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
Andre Freitas
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
jcscholtes
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
c sharada
 

Similar to Word2vec: From intuition to practice using gensim (20)

5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
ESWC 2014 Tutorial part 3
ESWC 2014 Tutorial part 3ESWC 2014 Tutorial part 3
ESWC 2014 Tutorial part 3
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
 
Semeval Deep Learning In Semantic Similarity
Semeval Deep Learning In Semantic SimilaritySemeval Deep Learning In Semantic Similarity
Semeval Deep Learning In Semantic Similarity
 
Towards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web StackTowards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web Stack
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 
Machine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-dataMachine learning-and-data-mining-19-mining-text-and-web-data
Machine learning-and-data-mining-19-mining-text-and-web-data
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic Applications
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Using topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic searchUsing topic modelling frameworks for NLP and semantic search
Using topic modelling frameworks for NLP and semantic search
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
FinalReport
FinalReportFinalReport
FinalReport
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Artificial Thinking: can machines reason with analogies?
Artificial Thinking:  can machines reason with analogies? Artificial Thinking:  can machines reason with analogies?
Artificial Thinking: can machines reason with analogies?
 
Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6
 
Measuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and ConceptsMeasuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and Concepts
 

More from Edgar Marca (7)

Python Packages for Web Data Extraction and Analysis
Python Packages for Web Data Extraction and AnalysisPython Packages for Web Data Extraction and Analysis
Python Packages for Web Data Extraction and Analysis
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel Trick
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Aprendizaje de Maquina y Aplicaciones
Aprendizaje de Maquina y AplicacionesAprendizaje de Maquina y Aplicaciones
Aprendizaje de Maquina y Aplicaciones
 
Tilemill: Una Herramienta Open Source para diseñar mapas
Tilemill: Una Herramienta Open Source para diseñar mapasTilemill: Una Herramienta Open Source para diseñar mapas
Tilemill: Una Herramienta Open Source para diseñar mapas
 
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.
Buenas Aplicaciones y Programas con Datos Abiertos / Publicos.
 
Theming cck-n-views
Theming cck-n-viewsTheming cck-n-views
Theming cck-n-views
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Recently uploaded (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 

Word2vec: From intuition to practice using gensim