2. CLUSTERING
• Definition: Clustering – grouping objects on the
base of similarities between them.
• Clustering algorithms
- Supervised clustering
- Unsupervised clustering
- Hierarchical and non hierarchical
• Similarities measures – measure of “distance”
between objects
3. VECTOR SPACE MODEL
• Representing text documents as vectors in
n-dimensional space
d = ((t1, w1), (t2, w2), …, (ti, wi), …, (tn, wn))
t1, t2,…,tn – features(terms) of the documents
w1, w2,…,wn – measure the term-document
associations
4. VECTOR SPACE MODEL
• Each dimension (axis) corresponds to a
document feature.
• Features: words or phrases (bag of words
model)
• TFIDF (term frequency inverted term
frequency) table
• Cos similarity measure
5. TERM WEIGHTING
• Term frequency inverse document frequency TFIDF : weight
assigned to each term describing a document:
Wij = TF * IDF = tfij * log (N / dfi)
TF – Term Frequency
IDF – Inverse Document Frequency
Wij – Weight of the i-th term in j-th document
tfij – Frequency of the i-th term in document j
N – The total number of documents in the collection
dfi – The number of documents containing the i-th term
6. Similarity Measure and Query
matching
• A txd– term-document matrix for d documents described by t terms
• q – query vector – pseudo document, described by the terms of the
query
• cos between a document vector aj and the vector q:
• After normalization of the vectors to a unit length qT
A yields a vector
of similarities between the query and each document in the
collection.
9. Classical Vector Space Model
Issues
• Synonyms ?
• Single words only
• Assumption that the keywords describing
the documents are mutually independent
• Detailed example and analysis:
http://www.miislita.com/term-vector/term-vector-3.html
10. Latent Semantic Indexing (LSI)
and SVD
• LSI tries to overcome the problems of
lexical matching by using statistically
derived conceptual indices instead of
individual words
• LSI assumes that there is some
underlying latent structure in word usage
• Singular Value Decomposition is used to
estimate the structure in word usage
across documents
11. Singular Value Decomposition
(SVD)
• A txd – term document matrix
• SVD - Each matrix can be represented as product of 3 matrix
factors
A = U∑VT
U txt -orthogonal matrix – left singular vectors of A
V dxd - orthogonal matrix – right singular values of A
∑ txd – diagonal matrix having the singular values of A ordered
decreasingly along its diagonal (σ1 ≥ σ2 ≥ …. σ min )
• Interpretation: the vectors (columns of the U matrix)
summarize the most important abstract concepts present in
the document collection
13. K – rank approximation of A
Ak = UkΣkVk
Where:
- Uk is the k×t matrix whose columns are
first k columns of U
- Vk is the d×k matrix whose columns are
the first k columns of U
- Σk is the k×k matrix whose diagonal
elements are the k largest singular values
of A
15. SUFFIX ARRAYS AND LCP
ARRAYS
• Definition: A suffix array is an
alphabetically ordered array of all suffixes
of a string.
• Find substrings by binary search
• Definition: LCP array – array of integers
where the i-th element contains the length
of the longest common prefix of the i-th
and (i-1)-th elements of a suffix array.
16. Suffix Array and LCP array
example
• string: _to_be_or_not_to_be
17. Phrase discovery
• Definition — S is a complete substring of T when S occurs in k distinct positions
p1, p2, …, pk, in T, and the (pi-1)-th character in T is different from the (pj-1)-th
character for at least one pair (i, j), 1≤i<j≤k (left-completeness), and the (pi+|S|)-th
character is different from the (pj+|S|)-th character for at least one pair (i, j), 1≤i<j≤k
(right-completeness).
Example: x a b c d e a b c d f a b c d g
"bc" is not right-complete as the same character "d" directly follows all of its occurrences
"bc“ is not a left-complete phrase either as the same character "a" directly precedes
all of its appearances.
"bcd" is right-complete because its subsequent occurrences are followed by different
letters, "e", "f", and "g", respectively.
“abcd” is a complete substring (phrase). It is both left and right complete
18. Phrase Discovery
• Dell Zhang and Yisheng Dong proposed an algorithm based on
suffix arrays and LCP arrays that identifies complete phrases in
O(n) time, n being the total length of all processed.
The algorithm is described in details in their paper: Semantic,
Hierarchical, Online Clustering of Web Search
Results
19. Clustering Algorithm
1. Preprocessing
- text filtration
- stemming
- mark stop words
2. Feature extraction: find terms and phrases describing the
documents (dimensions of the vector space)
- convert the documents from char-based to word-based
representation
- concatenate all documents
- create inverted version of the documents
- discover right-complete phrases
- discover left complete phrases
21. Clustering algorithm
- Sort the left-complete phrases alphabetically
- Combine left and right complete phrases into a set of complete
phrases
- For further processing chose the terms and phrases whose
frequency exceed the Term Frequency Threshold
3. Determine clusters and cluster labels
- use LSI to discover abstract concepts (candidate clusters)
- determine the labels - best matching words/phrases for each
abstract concept
4. Determine the cluster content
- for each label use VSM similarity to determine the cluster content
5. Calculate clusters score
6. Show the results
22. Parameters
1. CandidateClusterThreshold – corresponds to q in the following expression where A is
the term-document matrix and Ak is its k-rank approximation:
2. ClusterAssignmentThreshold – similarity threshold that must be exceeded in order to add a
snippet to a cluster
3. DuplicateClusterThreshold -
4. MinTDF – min times of occurrence of a term in order to be included into term-document
matrix for further processing
5. PhraseLenWithoutPenalty – max length of the phrases (in words) that are not penaltilized
6. StrongWordWeight – scaling factor for strong words (those that are in the dociment titles)
7. DataFileName – file containing the input documents
8. StopWordsFileName – file containing the stop words
9. AllDocsMaxSize – max size of the TFIDF matrix – for SVD
23. Problems
• Non deterministic algorithm
• Requires tuning of a lot of parameters
• Depends on the input data, stop words, etc.
• Sensitive to the preprocessing details and the term
weighting schema (TFIDF depends on that)
• Scalability – the suffix arrays require continuous memory
: can’t be done for big collection of documents
• SVD – slow for high dimensional matrixes; Need of fast
SVD algorithm
24. DEMO
• Input – collection of Answers
questions/replies document collections
• Output: Set of labeled clusters with the
documents belonging to each cluster
28. Potential Applications
• Automatic creation of links
• Clustering the results of Search queries
and extending the query search with
cluster labels
• search has the effect of having
additional data sources for free.
• Phrase Discovery – find “good keywords
and phrases from the incoming
documents and use them as tags.