SlideShare a Scribd company logo
1 of 14
SIMILARITY OF
DOCUMENTS BASED ON
VECTOR SPACE MODEL
Introduction

This presentation gives an overview about the problem of
finding documents which are similar and how Vector space
can be used to solve it.

A vector space is a mathematical structure formed by a
collection of elements called vectors, which may be added
together and multiplied ("scaled") by numbers, called scalars
in this context.

A document is a bag of words or a collection of words or
terms. The problem can be easily experienced in the domain
of web search or classification, where the aim is to find out
documents which are similar in context or content.
Introduction

A vector v can be expressed as a sum of elements such as,

v = a1vi1+a2vi2+….+anvin

Where ak are called scalars or weights and vin as the
components or elements.
Vectors

 Now we explore, how a set of documents                                  can be
 represented as vectors in a common vector space.

 V(d) denotes the vector derived from document d, with one
 component for each dictionary term.
               t1
                         V(d2)

                                       V(Q)


                                               V(d1)
                           θ
                                                  t2

The documents in a collection can be viewed as a set of vectors in vector space, in
which there is one axis for every term.
Vectors

In the previous slide, the diagram shows a simple
representation of two document vectors - d1, d2 and a
query vector Q.
The space contains terms – {t1,t2,t3,…tN}, but for simplicity
only two terms are represented since there is a axis for each
term.
The document d1 has components {t1,t3,…} and d2 has
components {t2,…}. So V(d1) is represented closer to axis t1
and V(d2) is closer to t2.

The angle θ represents the closeness of a document vector
to the query vector. And its value is calculated by cosine of θ.
Vectors

Weights
The weight of the components of a document vector can be
represented by Term Frequency or combination of Term
Frequency and Inverse Document Frequency.

Term Frequency denoted by tf, is the number of occurrences
of a term t in the document D .
Document Frequency is the number of documents , where a
particular term t occurs.

Inverse Document Frequency of a term t, denoted by idf is
log(N/df), where N is the total number of documents in the
space. So, it reduces the weight when a term occurs many
times in a document, or in other words a word with rare
occurrences has more weight.
Vectors

tf-idf weight

The combination of tf and idf is the most popular weight
used in case of document similarity exercises.

tf-idf t,d = tf t,d * idf t

So, the weight is the highest when t occurs many times
within a small number of documents.
And, the weight is the lowest , when the term occurs fewer
times in a document or occurs in many documents.

Later, in the example you will see how tf-idf weights are
used in the Similarity calculation.
Similarity

Cosine Similarity
The similarity between two documents can be found by
computing the Cosine Similarity between their vector
representations.

             V(d1).V(d2)
sim(d1,d2) = ____________
             |V(d1)||V(d2)

 The numerator is a dot product of two products, such as

 ∑ i=1 to M (xi * yi), and the denominator is the product of the
Euclidean length of the vectors, such as
|V(d1)| = √ ∑ i=1 to M (xi )2
Similarity

For example,
If the vector d1 has component weights {w1,w2,w3} and
vector d2 has component weights {u1,u2},
then the dot product = w1*u1 + w2*u2 .
Since there is no third component, hence w3*ф = 0.

Euclidean length of d1 = √ (w1)2 + (w2)2 + (w3)2
Example
    This is a famous example given by Dr. David Grossman and Dr. Ophir
    Frieder of the Illinois Institute of Technology.
    There are 3 documents,
    D1 = “Shipment of gold damaged in a fire”
    D2 = “Delivery of silver arrived in a silver truck”
    D3 = “Shipment of gold arrived in a truck”
    Q = “gold silver truck”
    No. of docs, D = 3 ; Inverse document frequency, IDF = log(D/dfi)
Terms                   tfi                                                                         Weights = tfi * IDFi
           Q       D1       D2       D3       dfi       D/dfi         IDFi            Q            D1            D2            D3
a              0        1        1        1         3            1           0.0000       0.0000        0.0000        0.0000        0.0000
arrived        0        0        1        1         2           1.5          0.1761       0.0000        0.0000        0.1761        0.1761
damaged        0        1        0        0         1            3           0.4771       0.0000        0.4771        0.0000        0.0000
delivery       0        0        1        0         1            3           0.4771       0.0000        0.0000        0.4771        0.0000
gold           1        1        0        1         2           1.5          0.1761       0.1761        0.1761        0.0000        0.1761
fire           0        1        0        0         1            3           0.4771       0.0000        0.4771        0.0000        0.0000
in             0        1        1        1         3            1           0.0000       0.0000        0.0000        0.0000        0.0000
of             0        1        1        1         3            1           0.0000       0.0000        0.0000        0.0000        0.0000
shipment       0        1        0        1         2           1.5          0.1761       0.0000        0.1761        0.0000        0.1761
silver         1        0        2        0         1            3           0.4771       0.4771        0.0000        0.9542        0.0000
truck          1        0        1        1         2           1.5          0.1761       0.1761        0.0000        0.1761        0.1761
Example … continued
Similarity Analysis……
We calculate the vector lengths,
|D| = √ ∑i(wi,j)2

which is the Euclidean length of the vector

|D1| = √(0.4771)2 + (0.1761)2 + (0.4771)2 + (0.17761)2 = √0.5173 = 0.7192
|D2| = √(0.1761)2 + (0.4771)2 + (0.9542)2 + (0.1761)2 = √1.2001 = 1.0955
|D3| = √(0.1761)2 + √(0.1761)2 + √(0.1761)2 + √(0.1761)2 = √0.1240 = 0.3522

|Q| = √ (0.1761)2 + (0.4771)2 + √(0.1761)2 = √0.2896 = 0.5382

Next, we calculate the Dot products of the Query vector with each Document
vector, Q • Di = √ (wQ,j * wi,j )

Q • D1 = 0.1761 * 0.1761 = 0.0310
Q • D2 = 0.4771*0.9542 + 0.1761*0.1761 = 0.4862
Q • D3 = 0.1761*0.1761 + 0.1761*0.1761 = 0.0620
Example … continued
Now, we calculate the cosine value,

Cosine θ (d1) = Q • D1 /|Q|*|D1| = 0.0310/(0.5382 * 0.7192) = 0.0801
Cosine θ (d2) = Q • D2 /|Q|*|D2| = 0.4862/(0.5382 * 1.0955) = 0.8246
Cosine θ (d3) = Q • D3 /|Q|*|D3| = 0.0620/(0.5382 * 0.3522) = 0.3271

So, we see that document D2 is the most similar to the Query.
Conclusion
Pros
• Allows documents with partial match to be also identified
• The cosine formula gives a score which can be used to order
   documents.

Disadvantages
• Documents are treated as bag of words and so the positional
   information about the terms is lost.


Usage
  Apache Lucene, the text search api uses this concept while searching
for documents matching a query.
Acknowledgements
•   An Introduction to Information Retrieval by Christopher D. Manning,
    Prabhakar Raghavan, Hinrich Schutze.
•   Term Vector Theory and Keyword Weights by Dr. E. Garcia.
•   Information Retrieval: Algorithms and Heuristics by Dr. David
    Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology
•   Wikipedia - http://en.wikipedia.org/wiki/Vector_space_model

More Related Content

What's hot

The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training ModelingMax De Marzi
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clusteringSOYEON KIM
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptImXaib
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to MetadataEUDAT
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vecananth
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsViet-Trung TRAN
 
VOWLMap: Graph-based Ontology Alignment Visualization and Editing
VOWLMap: Graph-based Ontology Alignment Visualization and EditingVOWLMap: Graph-based Ontology Alignment Visualization and Editing
VOWLMap: Graph-based Ontology Alignment Visualization and EditingCatia Pesquita
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 

What's hot (20)

The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Ontologies
OntologiesOntologies
Ontologies
 
Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training Modeling
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Vector database
Vector databaseVector database
Vector database
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to Metadata
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
Bayesian data analysis
Bayesian data analysisBayesian data analysis
Bayesian data analysis
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
VOWLMap: Graph-based Ontology Alignment Visualization and Editing
VOWLMap: Graph-based Ontology Alignment Visualization and EditingVOWLMap: Graph-based Ontology Alignment Visualization and Editing
VOWLMap: Graph-based Ontology Alignment Visualization and Editing
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 

Viewers also liked

Teacher management system guide
Teacher management system guideTeacher management system guide
Teacher management system guidenicolasmunozvera
 
Cisco router command configuration overview
Cisco router command configuration overviewCisco router command configuration overview
Cisco router command configuration overview3Anetwork com
 
Day 5.3 configuration of router
Day 5.3 configuration of routerDay 5.3 configuration of router
Day 5.3 configuration of routerCYBERINTELLIGENTS
 
De-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco ServicesDe-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco ServicesCisco Canada
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...University of Minnesota, Duluth
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
MikroTik Basic Training Class - Online Moduls - English
 MikroTik Basic Training Class - Online Moduls - English MikroTik Basic Training Class - Online Moduls - English
MikroTik Basic Training Class - Online Moduls - EnglishAdhie Lesmana
 
Computer networking short_questions_and_answers
Computer networking short_questions_and_answersComputer networking short_questions_and_answers
Computer networking short_questions_and_answersTarun Thakur
 
Initial Configuration of Router
Initial Configuration of RouterInitial Configuration of Router
Initial Configuration of RouterKishore Kumar
 
Pass4sure 640-864 Questions Answers
Pass4sure 640-864 Questions AnswersPass4sure 640-864 Questions Answers
Pass4sure 640-864 Questions AnswersRoxycodone Online
 

Viewers also liked (16)

Teacher management system guide
Teacher management system guideTeacher management system guide
Teacher management system guide
 
Cisco router command configuration overview
Cisco router command configuration overviewCisco router command configuration overview
Cisco router command configuration overview
 
Day 5.3 configuration of router
Day 5.3 configuration of routerDay 5.3 configuration of router
Day 5.3 configuration of router
 
De-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco ServicesDe-Risk Data Center Projects With Cisco Services
De-Risk Data Center Projects With Cisco Services
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
 
similarity measure
similarity measure similarity measure
similarity measure
 
Day 11 eigrp
Day 11 eigrpDay 11 eigrp
Day 11 eigrp
 
Lesson 1 slideshow
Lesson 1 slideshowLesson 1 slideshow
Lesson 1 slideshow
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
MikroTik Basic Training Class - Online Moduls - English
 MikroTik Basic Training Class - Online Moduls - English MikroTik Basic Training Class - Online Moduls - English
MikroTik Basic Training Class - Online Moduls - English
 
E s switch_v6_ch01
E s switch_v6_ch01E s switch_v6_ch01
E s switch_v6_ch01
 
Computer networking short_questions_and_answers
Computer networking short_questions_and_answersComputer networking short_questions_and_answers
Computer networking short_questions_and_answers
 
College Network
College NetworkCollege Network
College Network
 
Initial Configuration of Router
Initial Configuration of RouterInitial Configuration of Router
Initial Configuration of Router
 
Pass4sure 640-864 Questions Answers
Pass4sure 640-864 Questions AnswersPass4sure 640-864 Questions Answers
Pass4sure 640-864 Questions Answers
 
10 More Quotes for Entrepreneurs
10 More Quotes for Entrepreneurs10 More Quotes for Entrepreneurs
10 More Quotes for Entrepreneurs
 

Similar to Document similarity with vector space model

Text Similarities - PG Pushpin
Text Similarities - PG PushpinText Similarities - PG Pushpin
Text Similarities - PG Pushpinjsurve
 
Cosine tf idf_example
Cosine tf idf_exampleCosine tf idf_example
Cosine tf idf_examplehellangel13
 
Calculus 1
Calculus 1Calculus 1
Calculus 1hosiduy
 
Concept of-complex-frequency
Concept of-complex-frequencyConcept of-complex-frequency
Concept of-complex-frequencyVishal Thakur
 
Chapter 4: Vector Spaces - Part 5/Slides By Pearson
Chapter 4: Vector Spaces - Part 5/Slides By PearsonChapter 4: Vector Spaces - Part 5/Slides By Pearson
Chapter 4: Vector Spaces - Part 5/Slides By PearsonChaimae Baroudi
 
Succesive differntiation
Succesive differntiationSuccesive differntiation
Succesive differntiationJaydevVadachhak
 
linear transformation and rank nullity theorem
linear transformation and rank nullity theorem linear transformation and rank nullity theorem
linear transformation and rank nullity theorem Manthan Chavda
 
Application of Calculus in Real World
Application of Calculus in Real World Application of Calculus in Real World
Application of Calculus in Real World milanmath
 
Calculus Early Transcendentals 10th Edition Anton Solutions Manual
Calculus Early Transcendentals 10th Edition Anton Solutions ManualCalculus Early Transcendentals 10th Edition Anton Solutions Manual
Calculus Early Transcendentals 10th Edition Anton Solutions Manualnodyligomi
 
VHDL and Cordic Algorithim
VHDL and Cordic AlgorithimVHDL and Cordic Algorithim
VHDL and Cordic AlgorithimSubeer Rangra
 
Weight enumerators of block codes and the mc williams
Weight  enumerators of block codes and  the mc williamsWeight  enumerators of block codes and  the mc williams
Weight enumerators of block codes and the mc williamsMadhumita Tamhane
 
3 grechnikov
3 grechnikov3 grechnikov
3 grechnikovYandex
 

Similar to Document similarity with vector space model (17)

1573 measuring arclength
1573 measuring arclength1573 measuring arclength
1573 measuring arclength
 
Text Similarities - PG Pushpin
Text Similarities - PG PushpinText Similarities - PG Pushpin
Text Similarities - PG Pushpin
 
Chapter 03 cyclic codes
Chapter 03   cyclic codesChapter 03   cyclic codes
Chapter 03 cyclic codes
 
Solution 2 i ph o 37
Solution 2 i ph o 37Solution 2 i ph o 37
Solution 2 i ph o 37
 
Lesson 7: The Derivative
Lesson 7: The DerivativeLesson 7: The Derivative
Lesson 7: The Derivative
 
Cosine tf idf_example
Cosine tf idf_exampleCosine tf idf_example
Cosine tf idf_example
 
Calculus 1
Calculus 1Calculus 1
Calculus 1
 
Vectors and Kinematics
Vectors and KinematicsVectors and Kinematics
Vectors and Kinematics
 
Concept of-complex-frequency
Concept of-complex-frequencyConcept of-complex-frequency
Concept of-complex-frequency
 
Chapter 4: Vector Spaces - Part 5/Slides By Pearson
Chapter 4: Vector Spaces - Part 5/Slides By PearsonChapter 4: Vector Spaces - Part 5/Slides By Pearson
Chapter 4: Vector Spaces - Part 5/Slides By Pearson
 
Succesive differntiation
Succesive differntiationSuccesive differntiation
Succesive differntiation
 
linear transformation and rank nullity theorem
linear transformation and rank nullity theorem linear transformation and rank nullity theorem
linear transformation and rank nullity theorem
 
Application of Calculus in Real World
Application of Calculus in Real World Application of Calculus in Real World
Application of Calculus in Real World
 
Calculus Early Transcendentals 10th Edition Anton Solutions Manual
Calculus Early Transcendentals 10th Edition Anton Solutions ManualCalculus Early Transcendentals 10th Edition Anton Solutions Manual
Calculus Early Transcendentals 10th Edition Anton Solutions Manual
 
VHDL and Cordic Algorithim
VHDL and Cordic AlgorithimVHDL and Cordic Algorithim
VHDL and Cordic Algorithim
 
Weight enumerators of block codes and the mc williams
Weight  enumerators of block codes and  the mc williamsWeight  enumerators of block codes and  the mc williams
Weight enumerators of block codes and the mc williams
 
3 grechnikov
3 grechnikov3 grechnikov
3 grechnikov
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Document similarity with vector space model

  • 1. SIMILARITY OF DOCUMENTS BASED ON VECTOR SPACE MODEL
  • 2. Introduction This presentation gives an overview about the problem of finding documents which are similar and how Vector space can be used to solve it. A vector space is a mathematical structure formed by a collection of elements called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars in this context. A document is a bag of words or a collection of words or terms. The problem can be easily experienced in the domain of web search or classification, where the aim is to find out documents which are similar in context or content.
  • 3. Introduction A vector v can be expressed as a sum of elements such as, v = a1vi1+a2vi2+….+anvin Where ak are called scalars or weights and vin as the components or elements.
  • 4. Vectors Now we explore, how a set of documents can be represented as vectors in a common vector space. V(d) denotes the vector derived from document d, with one component for each dictionary term. t1 V(d2) V(Q) V(d1) θ t2 The documents in a collection can be viewed as a set of vectors in vector space, in which there is one axis for every term.
  • 5. Vectors In the previous slide, the diagram shows a simple representation of two document vectors - d1, d2 and a query vector Q. The space contains terms – {t1,t2,t3,…tN}, but for simplicity only two terms are represented since there is a axis for each term. The document d1 has components {t1,t3,…} and d2 has components {t2,…}. So V(d1) is represented closer to axis t1 and V(d2) is closer to t2. The angle θ represents the closeness of a document vector to the query vector. And its value is calculated by cosine of θ.
  • 6. Vectors Weights The weight of the components of a document vector can be represented by Term Frequency or combination of Term Frequency and Inverse Document Frequency. Term Frequency denoted by tf, is the number of occurrences of a term t in the document D . Document Frequency is the number of documents , where a particular term t occurs. Inverse Document Frequency of a term t, denoted by idf is log(N/df), where N is the total number of documents in the space. So, it reduces the weight when a term occurs many times in a document, or in other words a word with rare occurrences has more weight.
  • 7. Vectors tf-idf weight The combination of tf and idf is the most popular weight used in case of document similarity exercises. tf-idf t,d = tf t,d * idf t So, the weight is the highest when t occurs many times within a small number of documents. And, the weight is the lowest , when the term occurs fewer times in a document or occurs in many documents. Later, in the example you will see how tf-idf weights are used in the Similarity calculation.
  • 8. Similarity Cosine Similarity The similarity between two documents can be found by computing the Cosine Similarity between their vector representations. V(d1).V(d2) sim(d1,d2) = ____________ |V(d1)||V(d2) The numerator is a dot product of two products, such as ∑ i=1 to M (xi * yi), and the denominator is the product of the Euclidean length of the vectors, such as |V(d1)| = √ ∑ i=1 to M (xi )2
  • 9. Similarity For example, If the vector d1 has component weights {w1,w2,w3} and vector d2 has component weights {u1,u2}, then the dot product = w1*u1 + w2*u2 . Since there is no third component, hence w3*ф = 0. Euclidean length of d1 = √ (w1)2 + (w2)2 + (w3)2
  • 10. Example This is a famous example given by Dr. David Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology. There are 3 documents, D1 = “Shipment of gold damaged in a fire” D2 = “Delivery of silver arrived in a silver truck” D3 = “Shipment of gold arrived in a truck” Q = “gold silver truck” No. of docs, D = 3 ; Inverse document frequency, IDF = log(D/dfi) Terms tfi Weights = tfi * IDFi Q D1 D2 D3 dfi D/dfi IDFi Q D1 D2 D3 a 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000 arrived 0 0 1 1 2 1.5 0.1761 0.0000 0.0000 0.1761 0.1761 damaged 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000 delivery 0 0 1 0 1 3 0.4771 0.0000 0.0000 0.4771 0.0000 gold 1 1 0 1 2 1.5 0.1761 0.1761 0.1761 0.0000 0.1761 fire 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000 in 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000 of 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000 shipment 0 1 0 1 2 1.5 0.1761 0.0000 0.1761 0.0000 0.1761 silver 1 0 2 0 1 3 0.4771 0.4771 0.0000 0.9542 0.0000 truck 1 0 1 1 2 1.5 0.1761 0.1761 0.0000 0.1761 0.1761
  • 11. Example … continued Similarity Analysis…… We calculate the vector lengths, |D| = √ ∑i(wi,j)2 which is the Euclidean length of the vector |D1| = √(0.4771)2 + (0.1761)2 + (0.4771)2 + (0.17761)2 = √0.5173 = 0.7192 |D2| = √(0.1761)2 + (0.4771)2 + (0.9542)2 + (0.1761)2 = √1.2001 = 1.0955 |D3| = √(0.1761)2 + √(0.1761)2 + √(0.1761)2 + √(0.1761)2 = √0.1240 = 0.3522 |Q| = √ (0.1761)2 + (0.4771)2 + √(0.1761)2 = √0.2896 = 0.5382 Next, we calculate the Dot products of the Query vector with each Document vector, Q • Di = √ (wQ,j * wi,j ) Q • D1 = 0.1761 * 0.1761 = 0.0310 Q • D2 = 0.4771*0.9542 + 0.1761*0.1761 = 0.4862 Q • D3 = 0.1761*0.1761 + 0.1761*0.1761 = 0.0620
  • 12. Example … continued Now, we calculate the cosine value, Cosine θ (d1) = Q • D1 /|Q|*|D1| = 0.0310/(0.5382 * 0.7192) = 0.0801 Cosine θ (d2) = Q • D2 /|Q|*|D2| = 0.4862/(0.5382 * 1.0955) = 0.8246 Cosine θ (d3) = Q • D3 /|Q|*|D3| = 0.0620/(0.5382 * 0.3522) = 0.3271 So, we see that document D2 is the most similar to the Query.
  • 13. Conclusion Pros • Allows documents with partial match to be also identified • The cosine formula gives a score which can be used to order documents. Disadvantages • Documents are treated as bag of words and so the positional information about the terms is lost. Usage Apache Lucene, the text search api uses this concept while searching for documents matching a query.
  • 14. Acknowledgements • An Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze. • Term Vector Theory and Keyword Weights by Dr. E. Garcia. • Information Retrieval: Algorithms and Heuristics by Dr. David Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology • Wikipedia - http://en.wikipedia.org/wiki/Vector_space_model