2. Introduction
This presentation gives an overview about the problem of
finding documents which are similar and how Vector space
can be used to solve it.
A vector space is a mathematical structure formed by a
collection of elements called vectors, which may be added
together and multiplied ("scaled") by numbers, called scalars
in this context.
A document is a bag of words or a collection of words or
terms. The problem can be easily experienced in the domain
of web search or classification, where the aim is to find out
documents which are similar in context or content.
3. Introduction
A vector v can be expressed as a sum of elements such as,
v = a1vi1+a2vi2+….+anvin
Where ak are called scalars or weights and vin as the
components or elements.
4. Vectors
Now we explore, how a set of documents can be
represented as vectors in a common vector space.
V(d) denotes the vector derived from document d, with one
component for each dictionary term.
t1
V(d2)
V(Q)
V(d1)
θ
t2
The documents in a collection can be viewed as a set of vectors in vector space, in
which there is one axis for every term.
5. Vectors
In the previous slide, the diagram shows a simple
representation of two document vectors - d1, d2 and a
query vector Q.
The space contains terms – {t1,t2,t3,…tN}, but for simplicity
only two terms are represented since there is a axis for each
term.
The document d1 has components {t1,t3,…} and d2 has
components {t2,…}. So V(d1) is represented closer to axis t1
and V(d2) is closer to t2.
The angle θ represents the closeness of a document vector
to the query vector. And its value is calculated by cosine of θ.
6. Vectors
Weights
The weight of the components of a document vector can be
represented by Term Frequency or combination of Term
Frequency and Inverse Document Frequency.
Term Frequency denoted by tf, is the number of occurrences
of a term t in the document D .
Document Frequency is the number of documents , where a
particular term t occurs.
Inverse Document Frequency of a term t, denoted by idf is
log(N/df), where N is the total number of documents in the
space. So, it reduces the weight when a term occurs many
times in a document, or in other words a word with rare
occurrences has more weight.
7. Vectors
tf-idf weight
The combination of tf and idf is the most popular weight
used in case of document similarity exercises.
tf-idf t,d = tf t,d * idf t
So, the weight is the highest when t occurs many times
within a small number of documents.
And, the weight is the lowest , when the term occurs fewer
times in a document or occurs in many documents.
Later, in the example you will see how tf-idf weights are
used in the Similarity calculation.
8. Similarity
Cosine Similarity
The similarity between two documents can be found by
computing the Cosine Similarity between their vector
representations.
V(d1).V(d2)
sim(d1,d2) = ____________
|V(d1)||V(d2)
The numerator is a dot product of two products, such as
∑ i=1 to M (xi * yi), and the denominator is the product of the
Euclidean length of the vectors, such as
|V(d1)| = √ ∑ i=1 to M (xi )2
9. Similarity
For example,
If the vector d1 has component weights {w1,w2,w3} and
vector d2 has component weights {u1,u2},
then the dot product = w1*u1 + w2*u2 .
Since there is no third component, hence w3*ф = 0.
Euclidean length of d1 = √ (w1)2 + (w2)2 + (w3)2
10. Example
This is a famous example given by Dr. David Grossman and Dr. Ophir
Frieder of the Illinois Institute of Technology.
There are 3 documents,
D1 = “Shipment of gold damaged in a fire”
D2 = “Delivery of silver arrived in a silver truck”
D3 = “Shipment of gold arrived in a truck”
Q = “gold silver truck”
No. of docs, D = 3 ; Inverse document frequency, IDF = log(D/dfi)
Terms tfi Weights = tfi * IDFi
Q D1 D2 D3 dfi D/dfi IDFi Q D1 D2 D3
a 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000
arrived 0 0 1 1 2 1.5 0.1761 0.0000 0.0000 0.1761 0.1761
damaged 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000
delivery 0 0 1 0 1 3 0.4771 0.0000 0.0000 0.4771 0.0000
gold 1 1 0 1 2 1.5 0.1761 0.1761 0.1761 0.0000 0.1761
fire 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000
in 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000
of 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000
shipment 0 1 0 1 2 1.5 0.1761 0.0000 0.1761 0.0000 0.1761
silver 1 0 2 0 1 3 0.4771 0.4771 0.0000 0.9542 0.0000
truck 1 0 1 1 2 1.5 0.1761 0.1761 0.0000 0.1761 0.1761
11. Example … continued
Similarity Analysis……
We calculate the vector lengths,
|D| = √ ∑i(wi,j)2
which is the Euclidean length of the vector
|D1| = √(0.4771)2 + (0.1761)2 + (0.4771)2 + (0.17761)2 = √0.5173 = 0.7192
|D2| = √(0.1761)2 + (0.4771)2 + (0.9542)2 + (0.1761)2 = √1.2001 = 1.0955
|D3| = √(0.1761)2 + √(0.1761)2 + √(0.1761)2 + √(0.1761)2 = √0.1240 = 0.3522
|Q| = √ (0.1761)2 + (0.4771)2 + √(0.1761)2 = √0.2896 = 0.5382
Next, we calculate the Dot products of the Query vector with each Document
vector, Q • Di = √ (wQ,j * wi,j )
Q • D1 = 0.1761 * 0.1761 = 0.0310
Q • D2 = 0.4771*0.9542 + 0.1761*0.1761 = 0.4862
Q • D3 = 0.1761*0.1761 + 0.1761*0.1761 = 0.0620
12. Example … continued
Now, we calculate the cosine value,
Cosine θ (d1) = Q • D1 /|Q|*|D1| = 0.0310/(0.5382 * 0.7192) = 0.0801
Cosine θ (d2) = Q • D2 /|Q|*|D2| = 0.4862/(0.5382 * 1.0955) = 0.8246
Cosine θ (d3) = Q • D3 /|Q|*|D3| = 0.0620/(0.5382 * 0.3522) = 0.3271
So, we see that document D2 is the most similar to the Query.
13. Conclusion
Pros
• Allows documents with partial match to be also identified
• The cosine formula gives a score which can be used to order
documents.
Disadvantages
• Documents are treated as bag of words and so the positional
information about the terms is lost.
Usage
Apache Lucene, the text search api uses this concept while searching
for documents matching a query.
14. Acknowledgements
• An Introduction to Information Retrieval by Christopher D. Manning,
Prabhakar Raghavan, Hinrich Schutze.
• Term Vector Theory and Keyword Weights by Dr. E. Garcia.
• Information Retrieval: Algorithms and Heuristics by Dr. David
Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology
• Wikipedia - http://en.wikipedia.org/wiki/Vector_space_model