SlideShare a Scribd company logo
1 of 26
Team Members
Ganesh Borle
Sonam Gupta
Satyam Verma
Vivek Vishnoi
From Word Embeddings To Document
Distances
Mentor
Narendra Babu Unnam
Content
●The question
●Existing Ways/Models
●Drawbacks
●Proposed Solutions
●Need
●Word Mover Distance
●Word Embedding to Document distances.
The Question
How to check whether the statements/documents
infers the same meaning or not?
“Obama speaks to the media in Illinois”
and
“The President greets the press in Chicago.”
Same content. Different words.
Existing Ways
Documents commonly are represented as :
● Bag of Words(BOW)
● term frequency-inverse document frequency (TF-IDF) of the
words
● Based on the Count of common words
#Dimensions = #Vocabulary (thousands)
● Stuck if no words in common.
“Obama” != “President”
Existing Ways
BOW
● A model in which a text (such as a sentence or a document) is
represented as the bag (multiset) of its words, disregarding grammar
and even word order but keeping multiplicity.
● Commonly used in methods of document classification where the
(frequency of) occurrence of each word is used as a feature for training
a classifier.
Existing Ways
Tf-idf Model
A statistical measure used to evaluate how important a word is
to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus.
Tf(term Frequency) Score
idf(inverse-doc frequency) Score
Existing Ways
Tf-idf Model
The tf-idf weight of a term is the product of its tf weight and
its idf weight.
Existing Ways
BM25 Okapi
● A bag-of-words retrieval function that ranks a set of documents based
on the query terms appearing in each document, regardless of the
inter-relationship between the query terms within a document.
● A ranking function that extends TF-IDF for each word w in a
document D
Existing Ways
BM25 Okapi
Existing Ways
Drawbacks of previous models
● Representing documents via BOW or by TF-IDF representations
are often not suitable for document distances due to their
frequent near-orthogonality
● Another significant draw-back of these representations are that
they do not capture the distance between individual words.
Proposed Solutions
Low-dimensional latent features models
● Latent Semantic Indexing(LSI )
○ LSA assumes that words that are close in meaning will occur in
similar pieces of text.
○ uses singular value decomposition on the BOW representation to
arrive at a semantic feature space.
● Latent Dirichlet Allocation(LDA)
○ Generative model for text documents that learns representations
for documents as distributions over word topics.
Proposed Solutions
More Models...
● mSDA Marginalized Stacked Denoising Autoencoder
○ a representation learned from stacked denoting autoencoders (SDAs),
marginalized for fast training.
○ Generally, SDAs have been shown to have state-of-the-art performance
for document sentiment analysis tasks
Need for improved Solution
● While these low dimensional approaches produce a more coherent
document representation than BOW, they often do not improve the
empirical performance of BOW on distance-based tasks LDA.
Hence we need something better...
Improved Solution
Word Mover’s Distance (WMD)
● A novel distance function between text documents.
● The WMD distance measures the dissimilarity between two text
documents as the minimum amount of distance that the embedded
words of one document need to “travel” to reach the embedded words
of another document.
● Based on recent results in word embeddings that learn semantically
meaningful representations for words from local co-occurrences in
sentences.
● Based on the well known concept of Earth Mover’s Distance
Word Mover’s Distance (WMD)
● Built on top of Google’s word2vec and based on its embeddings
● Scaling to very large data because of word2vec
● semantic relationships are often preserved in vector operations on word
vectors, e.g., vec(Berlin) - vec(Germany) + vec(France) is close to
vec(Paris).
● It represent text documents as a weighted point cloud of embedded
words
Let's get familiar with the terms which are used in WMD….
Word Mover’s Distance (WMD)
● It is the collective name for a set of language modeling and feature
learning techniques in natural language processing (NLP) where words
or phrases from the vocabulary are mapped to vectors of real numbers.
● Stores each word in as a point in space, where it is represented by a
vector of fixed number of dimensions (generally 300).
● For example, “Hello” might be represented as : [0.4, -0.11, 0.55, 0.3 . . .
0.1, 0.02]
● Dimensions are basically projections along different axes, more of a
mathematical concept.
Word Embedding
● vector d as a point on the n−1 dimensional simplex of word distributions
● Two documents with different unique words will lie in different regions of
this simplex. However, these documents may still be semantically close
● if word i appears ci times in the document, we denote di = ci / ∑j=1 to ncj
● It is very parse as most of the words will not appear in document
Word Mover’s Distance (WMD)
nBOW representation
Word travel cost
● measure of word dissimilarity is naturally provided by their Euclidean
distance in the word2vec embedding space
● The cost associated with “traveling” from one word to another
c(i, j) = ||xi − xj||2
Word Mover’s Distance (WMD)
● Uses “travel cost” between two words to create a distance between two
documents
● each word i in d to be transformed into any word j in d’ in total or in parts
Tij ≥ 0 denotes how much of word i in d travels to word j in d’ (T ∈ Rn×n)
● To transform d entirely into d’ we ensure that the entire outgoing flow
from word i equals di ,i.e. ∑j Tij= di and same applies for the incoming
● Hence, the distance between the two documents as the minimum
(weighted) cumulative cost required to move all words from d to d’ , i.e.
Document distance
This optimization is a special case of
the earth mover’s distance metric, a
well studied transportation problem
for which specialized solvers have
been developed, hence we can use
those solvers to calculate the WMD.
● the minimum cumulative cost of moving d to d’ given the constraints is pro
vided by the solution to the following linear program
Word Mover’s Distance (WMD)
Transportation problem
Word Mover’s Distance (WMD)
The components of the WMD metric
between a query D0 and two
sentences D1, D2 (with equal BOW
distance). The arrows represent flow
between two words and are labeled
with their distance contribution.
The flow between two sentences D3
and D0 with different numbers of
words. This mis match causes the
WMD to move words to multiple
similar words.
Word Mover’s Distance (WMD)
● best average time complexity : O(p3 log p); p ⇾ # unique words in the
documents
● For datasets with many unique words (i.e., high-dimensional) and/or a
large number of documents, solving the WMD optimal transport problem
can become prohibitive.
Word Mover’s Distance (WMD)
Complexity
● Twitter
○ Collection of “Polarity - Tweet” tweets
● BBC Sport
○ Documents related to 5 categories like athletics, cricket, football, rugby,
tennis
● Classic
○ Documents related to 5 categories like casm, med, etc
● News20Groups
○ Documents divided into 20 categories like space, medical, etc
Word Mover’s Distance (WMD)
DataSets Used
Word Mover’s Distance (WMD)
% Error on BBC Sports data
Word Mover’s Distance (WMD)
% Error on Twitter data
Word Mover’s Distance (WMD)
Github link :
● https://github.com/buggynap/Word-Movers-Distance
Reference :
● Word Mover's Distance from Matthew J Kusner's paper "From Word Embeddings to Document
Distances"
SideShare Link :

More Related Content

What's hot

DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublinm_ackermann
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdaviirpycon
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilledrICh morrow
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 

What's hot (20)

DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublin
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
 
Mt
MtMt
Mt
 
Word embedding
Word embedding Word embedding
Word embedding
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
NoSql
NoSqlNoSql
NoSql
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesApache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Document similarity
Document similarityDocument similarity
Document similarity
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilled
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 

Similar to Word Embedding to Document distances

Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptxChode Amarnath
 
Word Space Models and Random Indexing
Word Space Models and Random IndexingWord Space Models and Random Indexing
Word Space Models and Random IndexingDileepa Jayakody
 
Word Space Models & Random indexing
Word Space Models & Random indexingWord Space Models & Random indexing
Word Space Models & Random indexingDileepa Jayakody
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherenceContent Savvy
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...eSAT Journals
 

Similar to Word Embedding to Document distances (20)

Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
 
Word Space Models and Random Indexing
Word Space Models and Random IndexingWord Space Models and Random Indexing
Word Space Models and Random Indexing
 
Word Space Models & Random indexing
Word Space Models & Random indexingWord Space Models & Random indexing
Word Space Models & Random indexing
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherence
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
ijcai11
ijcai11ijcai11
ijcai11
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...A performance of svm with modified lesk approach for word sense disambiguatio...
A performance of svm with modified lesk approach for word sense disambiguatio...
 

Recently uploaded

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 

Recently uploaded (20)

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 

Word Embedding to Document distances

  • 1. Team Members Ganesh Borle Sonam Gupta Satyam Verma Vivek Vishnoi From Word Embeddings To Document Distances Mentor Narendra Babu Unnam
  • 2. Content ●The question ●Existing Ways/Models ●Drawbacks ●Proposed Solutions ●Need ●Word Mover Distance ●Word Embedding to Document distances.
  • 3. The Question How to check whether the statements/documents infers the same meaning or not? “Obama speaks to the media in Illinois” and “The President greets the press in Chicago.” Same content. Different words.
  • 4. Existing Ways Documents commonly are represented as : ● Bag of Words(BOW) ● term frequency-inverse document frequency (TF-IDF) of the words ● Based on the Count of common words #Dimensions = #Vocabulary (thousands) ● Stuck if no words in common. “Obama” != “President”
  • 5. Existing Ways BOW ● A model in which a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. ● Commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.
  • 6. Existing Ways Tf-idf Model A statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Tf(term Frequency) Score idf(inverse-doc frequency) Score
  • 7. Existing Ways Tf-idf Model The tf-idf weight of a term is the product of its tf weight and its idf weight.
  • 8. Existing Ways BM25 Okapi ● A bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document. ● A ranking function that extends TF-IDF for each word w in a document D
  • 10. Existing Ways Drawbacks of previous models ● Representing documents via BOW or by TF-IDF representations are often not suitable for document distances due to their frequent near-orthogonality ● Another significant draw-back of these representations are that they do not capture the distance between individual words.
  • 11. Proposed Solutions Low-dimensional latent features models ● Latent Semantic Indexing(LSI ) ○ LSA assumes that words that are close in meaning will occur in similar pieces of text. ○ uses singular value decomposition on the BOW representation to arrive at a semantic feature space. ● Latent Dirichlet Allocation(LDA) ○ Generative model for text documents that learns representations for documents as distributions over word topics.
  • 12. Proposed Solutions More Models... ● mSDA Marginalized Stacked Denoising Autoencoder ○ a representation learned from stacked denoting autoencoders (SDAs), marginalized for fast training. ○ Generally, SDAs have been shown to have state-of-the-art performance for document sentiment analysis tasks
  • 13. Need for improved Solution ● While these low dimensional approaches produce a more coherent document representation than BOW, they often do not improve the empirical performance of BOW on distance-based tasks LDA. Hence we need something better...
  • 14. Improved Solution Word Mover’s Distance (WMD) ● A novel distance function between text documents. ● The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document. ● Based on recent results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences. ● Based on the well known concept of Earth Mover’s Distance
  • 15. Word Mover’s Distance (WMD) ● Built on top of Google’s word2vec and based on its embeddings ● Scaling to very large data because of word2vec ● semantic relationships are often preserved in vector operations on word vectors, e.g., vec(Berlin) - vec(Germany) + vec(France) is close to vec(Paris). ● It represent text documents as a weighted point cloud of embedded words Let's get familiar with the terms which are used in WMD….
  • 16. Word Mover’s Distance (WMD) ● It is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. ● Stores each word in as a point in space, where it is represented by a vector of fixed number of dimensions (generally 300). ● For example, “Hello” might be represented as : [0.4, -0.11, 0.55, 0.3 . . . 0.1, 0.02] ● Dimensions are basically projections along different axes, more of a mathematical concept. Word Embedding
  • 17. ● vector d as a point on the n−1 dimensional simplex of word distributions ● Two documents with different unique words will lie in different regions of this simplex. However, these documents may still be semantically close ● if word i appears ci times in the document, we denote di = ci / ∑j=1 to ncj ● It is very parse as most of the words will not appear in document Word Mover’s Distance (WMD) nBOW representation Word travel cost ● measure of word dissimilarity is naturally provided by their Euclidean distance in the word2vec embedding space ● The cost associated with “traveling” from one word to another c(i, j) = ||xi − xj||2
  • 18. Word Mover’s Distance (WMD) ● Uses “travel cost” between two words to create a distance between two documents ● each word i in d to be transformed into any word j in d’ in total or in parts Tij ≥ 0 denotes how much of word i in d travels to word j in d’ (T ∈ Rn×n) ● To transform d entirely into d’ we ensure that the entire outgoing flow from word i equals di ,i.e. ∑j Tij= di and same applies for the incoming ● Hence, the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from d to d’ , i.e. Document distance
  • 19. This optimization is a special case of the earth mover’s distance metric, a well studied transportation problem for which specialized solvers have been developed, hence we can use those solvers to calculate the WMD. ● the minimum cumulative cost of moving d to d’ given the constraints is pro vided by the solution to the following linear program Word Mover’s Distance (WMD) Transportation problem
  • 20. Word Mover’s Distance (WMD) The components of the WMD metric between a query D0 and two sentences D1, D2 (with equal BOW distance). The arrows represent flow between two words and are labeled with their distance contribution. The flow between two sentences D3 and D0 with different numbers of words. This mis match causes the WMD to move words to multiple similar words.
  • 22. ● best average time complexity : O(p3 log p); p ⇾ # unique words in the documents ● For datasets with many unique words (i.e., high-dimensional) and/or a large number of documents, solving the WMD optimal transport problem can become prohibitive. Word Mover’s Distance (WMD) Complexity
  • 23. ● Twitter ○ Collection of “Polarity - Tweet” tweets ● BBC Sport ○ Documents related to 5 categories like athletics, cricket, football, rugby, tennis ● Classic ○ Documents related to 5 categories like casm, med, etc ● News20Groups ○ Documents divided into 20 categories like space, medical, etc Word Mover’s Distance (WMD) DataSets Used
  • 24. Word Mover’s Distance (WMD) % Error on BBC Sports data
  • 25. Word Mover’s Distance (WMD) % Error on Twitter data
  • 26. Word Mover’s Distance (WMD) Github link : ● https://github.com/buggynap/Word-Movers-Distance Reference : ● Word Mover's Distance from Matthew J Kusner's paper "From Word Embeddings to Document Distances" SideShare Link :