SlideShare a Scribd company logo
1 of 28
LSI TEXT CLUSTERING
Concepts, Algorithm, Demo
George Simov
ECO/SDE
CLUSTERING
• Definition: Clustering – grouping objects on the
base of similarities between them.
• Clustering algorithms
- Supervised clustering
- Unsupervised clustering
- Hierarchical and non hierarchical
• Similarities measures – measure of “distance”
between objects
VECTOR SPACE MODEL
• Representing text documents as vectors in
n-dimensional space
d = ((t1, w1), (t2, w2), …, (ti, wi), …, (tn, wn))
t1, t2,…,tn – features(terms) of the documents
w1, w2,…,wn – measure the term-document
associations
VECTOR SPACE MODEL
• Each dimension (axis) corresponds to a
document feature.
• Features: words or phrases (bag of words
model)
• TFIDF (term frequency inverted term
frequency) table
• Cos similarity measure
TERM WEIGHTING
• Term frequency inverse document frequency TFIDF : weight
assigned to each term describing a document:
Wij = TF * IDF = tfij * log (N / dfi)
TF – Term Frequency
IDF – Inverse Document Frequency
Wij – Weight of the i-th term in j-th document
tfij – Frequency of the i-th term in document j
N – The total number of documents in the collection
dfi – The number of documents containing the i-th term
Similarity Measure and Query
matching
• A txd– term-document matrix for d documents described by t terms
• q – query vector – pseudo document, described by the terms of the
query
• cos between a document vector aj and the vector q:
• After normalization of the vectors to a unit length qT
A yields a vector
of similarities between the query and each document in the
collection.
EXAMPLE
Example - results
Classical Vector Space Model
Issues
• Synonyms ?
• Single words only
• Assumption that the keywords describing
the documents are mutually independent
• Detailed example and analysis:
http://www.miislita.com/term-vector/term-vector-3.html
Latent Semantic Indexing (LSI)
and SVD
• LSI tries to overcome the problems of
lexical matching by using statistically
derived conceptual indices instead of
individual words
• LSI assumes that there is some
underlying latent structure in word usage
• Singular Value Decomposition is used to
estimate the structure in word usage
across documents
Singular Value Decomposition
(SVD)
• A txd – term document matrix
• SVD - Each matrix can be represented as product of 3 matrix
factors
A = U∑VT
U txt -orthogonal matrix – left singular vectors of A
V dxd - orthogonal matrix – right singular values of A
∑ txd – diagonal matrix having the singular values of A ordered
decreasingly along its diagonal (σ1 ≥ σ2 ≥ …. σ min )
• Interpretation: the vectors (columns of the U matrix)
summarize the most important abstract concepts present in
the document collection
SVD
K – rank approximation of A
Ak = UkΣkVk
Where:
- Uk is the k×t matrix whose columns are
first k columns of U
- Vk is the d×k matrix whose columns are
the first k columns of U
- Σk is the k×k matrix whose diagonal
elements are the k largest singular values
of A
K – rank approximation of A
SUFFIX ARRAYS AND LCP
ARRAYS
• Definition: A suffix array is an
alphabetically ordered array of all suffixes
of a string.
• Find substrings by binary search
• Definition: LCP array – array of integers
where the i-th element contains the length
of the longest common prefix of the i-th
and (i-1)-th elements of a suffix array.
Suffix Array and LCP array
example
• string: _to_be_or_not_to_be
Phrase discovery
• Definition — S is a complete substring of T when S occurs in k distinct positions
p1, p2, …, pk, in T, and the (pi-1)-th character in T is different from the (pj-1)-th
character for at least one pair (i, j), 1≤i<j≤k (left-completeness), and the (pi+|S|)-th
character is different from the (pj+|S|)-th character for at least one pair (i, j), 1≤i<j≤k
(right-completeness).
Example: x a b c d e a b c d f a b c d g
"bc" is not right-complete as the same character "d" directly follows all of its occurrences
"bc“ is not a left-complete phrase either as the same character "a" directly precedes
all of its appearances.
"bcd" is right-complete because its subsequent occurrences are followed by different
letters, "e", "f", and "g", respectively.
“abcd” is a complete substring (phrase). It is both left and right complete
Phrase Discovery
• Dell Zhang and Yisheng Dong proposed an algorithm based on
suffix arrays and LCP arrays that identifies complete phrases in
O(n) time, n being the total length of all processed.
The algorithm is described in details in their paper: Semantic,
Hierarchical, Online Clustering of Web Search
Results
Clustering Algorithm
1. Preprocessing
- text filtration
- stemming
- mark stop words
2. Feature extraction: find terms and phrases describing the
documents (dimensions of the vector space)
- convert the documents from char-based to word-based
representation
- concatenate all documents
- create inverted version of the documents
- discover right-complete phrases
- discover left complete phrases
Clustering Algorithm
• Example of word-based representation of
a document
Clustering algorithm
- Sort the left-complete phrases alphabetically
- Combine left and right complete phrases into a set of complete
phrases
- For further processing chose the terms and phrases whose
frequency exceed the Term Frequency Threshold
3. Determine clusters and cluster labels
- use LSI to discover abstract concepts (candidate clusters)
- determine the labels - best matching words/phrases for each
abstract concept
4. Determine the cluster content
- for each label use VSM similarity to determine the cluster content
5. Calculate clusters score
6. Show the results
Parameters
1. CandidateClusterThreshold – corresponds to q in the following expression where A is
the term-document matrix and Ak is its k-rank approximation:
2. ClusterAssignmentThreshold – similarity threshold that must be exceeded in order to add a
snippet to a cluster
3. DuplicateClusterThreshold -
4. MinTDF – min times of occurrence of a term in order to be included into term-document
matrix for further processing
5. PhraseLenWithoutPenalty – max length of the phrases (in words) that are not penaltilized
6. StrongWordWeight – scaling factor for strong words (those that are in the dociment titles)
7. DataFileName – file containing the input documents
8. StopWordsFileName – file containing the stop words
9. AllDocsMaxSize – max size of the TFIDF matrix – for SVD
Problems
• Non deterministic algorithm
• Requires tuning of a lot of parameters
• Depends on the input data, stop words, etc.
• Sensitive to the preprocessing details and the term
weighting schema (TFIDF depends on that)
• Scalability – the suffix arrays require continuous memory
: can’t be done for big collection of documents
• SVD – slow for high dimensional matrixes; Need of fast
SVD algorithm
DEMO
• Input – collection of Answers
questions/replies document collections
• Output: Set of labeled clusters with the
documents belonging to each cluster
Answers Questions 100
Answers – questions and
replies - 600 (page 1/2)
Answers – questions and
replies - 600 (page 2/2)
Potential Applications
• Automatic creation of links
• Clustering the results of Search queries
and extending the query search with
cluster labels
• search  has the effect of having
additional data sources for free.
• Phrase Discovery – find “good keywords
and phrases from the incoming
documents and use them as tags.

More Related Content

What's hot

Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationQuentin Pleplé
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyAuro Tripathy
 
WAND Top-k Retrieval
WAND Top-k RetrievalWAND Top-k Retrieval
WAND Top-k RetrievalAndrew Zhang
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar itemsViet-Trung TRAN
 
RSP-QL*: Querying Data-Level Annotations in RDF Streams
RSP-QL*: Querying Data-Level Annotations in RDF StreamsRSP-QL*: Querying Data-Level Annotations in RDF Streams
RSP-QL*: Querying Data-Level Annotations in RDF Streamskeski
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsHarry Potter
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...inscit2006
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanfordSakthivel C R
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationSangwoo Mo
 
Sorting and Searching Techniques
Sorting and Searching TechniquesSorting and Searching Techniques
Sorting and Searching TechniquesProf Ansari
 
Latent Relational Model for Relation Extraction
Latent Relational Model for Relation ExtractionLatent Relational Model for Relation Extraction
Latent Relational Model for Relation ExtractionGaetano Rossiello, PhD
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsJulie Iskander
 

What's hot (16)

Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 
WAND Top-k Retrieval
WAND Top-k RetrievalWAND Top-k Retrieval
WAND Top-k Retrieval
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
RSP-QL*: Querying Data-Level Annotations in RDF Streams
RSP-QL*: Querying Data-Level Annotations in RDF StreamsRSP-QL*: Querying Data-Level Annotations in RDF Streams
RSP-QL*: Querying Data-Level Annotations in RDF Streams
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Unit7
Unit7Unit7
Unit7
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...Intelligent Methods in Models of Text Information Retrieval: Implications for...
Intelligent Methods in Models of Text Information Retrieval: Implications for...
 
Ch03 Mining Massive Data Sets stanford
Ch03 Mining Massive Data Sets  stanfordCh03 Mining Massive Data Sets  stanford
Ch03 Mining Massive Data Sets stanford
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
Sorting and Searching Techniques
Sorting and Searching TechniquesSorting and Searching Techniques
Sorting and Searching Techniques
 
Latent Relational Model for Relation Extraction
Latent Relational Model for Relation ExtractionLatent Relational Model for Relation Extraction
Latent Relational Model for Relation Extraction
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 

Viewers also liked

Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learningSanjib Basak
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsAyush Jain
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 

Viewers also liked (7)

Topic extraction using machine learning
Topic extraction using machine learningTopic extraction using machine learning
Topic extraction using machine learning
 
Vsm lsi
Vsm lsiVsm lsi
Vsm lsi
 
Topic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and ApplicationsTopic Modelling: Tutorial on Usage and Applications
Topic Modelling: Tutorial on Usage and Applications
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 

Similar to ECO_TEXT_CLUSTERING

Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxjainaaru59
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQLModeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQLKostis Kyzirakos
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learningfridolin.wild
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
ML2014_Poster_ TextClusteringDemo
ML2014_Poster_ TextClusteringDemoML2014_Poster_ TextClusteringDemo
ML2014_Poster_ TextClusteringDemoGeorge Simov
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache sparkEmiliano Martinez Sanchez
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 
is2015_poster
is2015_posteris2015_poster
is2015_posterJan Svec
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 

Similar to ECO_TEXT_CLUSTERING (20)

Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptx
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQLModeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
Modeling and Querying Metadata in the Semantic Sensor Web: stRDF and stSPARQL
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
ML2014_Poster_ TextClusteringDemo
ML2014_Poster_ TextClusteringDemoML2014_Poster_ TextClusteringDemo
ML2014_Poster_ TextClusteringDemo
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
is2015_poster
is2015_posteris2015_poster
is2015_poster
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 

ECO_TEXT_CLUSTERING

  • 1. LSI TEXT CLUSTERING Concepts, Algorithm, Demo George Simov ECO/SDE
  • 2. CLUSTERING • Definition: Clustering – grouping objects on the base of similarities between them. • Clustering algorithms - Supervised clustering - Unsupervised clustering - Hierarchical and non hierarchical • Similarities measures – measure of “distance” between objects
  • 3. VECTOR SPACE MODEL • Representing text documents as vectors in n-dimensional space d = ((t1, w1), (t2, w2), …, (ti, wi), …, (tn, wn)) t1, t2,…,tn – features(terms) of the documents w1, w2,…,wn – measure the term-document associations
  • 4. VECTOR SPACE MODEL • Each dimension (axis) corresponds to a document feature. • Features: words or phrases (bag of words model) • TFIDF (term frequency inverted term frequency) table • Cos similarity measure
  • 5. TERM WEIGHTING • Term frequency inverse document frequency TFIDF : weight assigned to each term describing a document: Wij = TF * IDF = tfij * log (N / dfi) TF – Term Frequency IDF – Inverse Document Frequency Wij – Weight of the i-th term in j-th document tfij – Frequency of the i-th term in document j N – The total number of documents in the collection dfi – The number of documents containing the i-th term
  • 6. Similarity Measure and Query matching • A txd– term-document matrix for d documents described by t terms • q – query vector – pseudo document, described by the terms of the query • cos between a document vector aj and the vector q: • After normalization of the vectors to a unit length qT A yields a vector of similarities between the query and each document in the collection.
  • 9. Classical Vector Space Model Issues • Synonyms ? • Single words only • Assumption that the keywords describing the documents are mutually independent • Detailed example and analysis: http://www.miislita.com/term-vector/term-vector-3.html
  • 10. Latent Semantic Indexing (LSI) and SVD • LSI tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual words • LSI assumes that there is some underlying latent structure in word usage • Singular Value Decomposition is used to estimate the structure in word usage across documents
  • 11. Singular Value Decomposition (SVD) • A txd – term document matrix • SVD - Each matrix can be represented as product of 3 matrix factors A = U∑VT U txt -orthogonal matrix – left singular vectors of A V dxd - orthogonal matrix – right singular values of A ∑ txd – diagonal matrix having the singular values of A ordered decreasingly along its diagonal (σ1 ≥ σ2 ≥ …. σ min ) • Interpretation: the vectors (columns of the U matrix) summarize the most important abstract concepts present in the document collection
  • 12. SVD
  • 13. K – rank approximation of A Ak = UkΣkVk Where: - Uk is the k×t matrix whose columns are first k columns of U - Vk is the d×k matrix whose columns are the first k columns of U - Σk is the k×k matrix whose diagonal elements are the k largest singular values of A
  • 14. K – rank approximation of A
  • 15. SUFFIX ARRAYS AND LCP ARRAYS • Definition: A suffix array is an alphabetically ordered array of all suffixes of a string. • Find substrings by binary search • Definition: LCP array – array of integers where the i-th element contains the length of the longest common prefix of the i-th and (i-1)-th elements of a suffix array.
  • 16. Suffix Array and LCP array example • string: _to_be_or_not_to_be
  • 17. Phrase discovery • Definition — S is a complete substring of T when S occurs in k distinct positions p1, p2, …, pk, in T, and the (pi-1)-th character in T is different from the (pj-1)-th character for at least one pair (i, j), 1≤i<j≤k (left-completeness), and the (pi+|S|)-th character is different from the (pj+|S|)-th character for at least one pair (i, j), 1≤i<j≤k (right-completeness). Example: x a b c d e a b c d f a b c d g "bc" is not right-complete as the same character "d" directly follows all of its occurrences "bc“ is not a left-complete phrase either as the same character "a" directly precedes all of its appearances. "bcd" is right-complete because its subsequent occurrences are followed by different letters, "e", "f", and "g", respectively. “abcd” is a complete substring (phrase). It is both left and right complete
  • 18. Phrase Discovery • Dell Zhang and Yisheng Dong proposed an algorithm based on suffix arrays and LCP arrays that identifies complete phrases in O(n) time, n being the total length of all processed. The algorithm is described in details in their paper: Semantic, Hierarchical, Online Clustering of Web Search Results
  • 19. Clustering Algorithm 1. Preprocessing - text filtration - stemming - mark stop words 2. Feature extraction: find terms and phrases describing the documents (dimensions of the vector space) - convert the documents from char-based to word-based representation - concatenate all documents - create inverted version of the documents - discover right-complete phrases - discover left complete phrases
  • 20. Clustering Algorithm • Example of word-based representation of a document
  • 21. Clustering algorithm - Sort the left-complete phrases alphabetically - Combine left and right complete phrases into a set of complete phrases - For further processing chose the terms and phrases whose frequency exceed the Term Frequency Threshold 3. Determine clusters and cluster labels - use LSI to discover abstract concepts (candidate clusters) - determine the labels - best matching words/phrases for each abstract concept 4. Determine the cluster content - for each label use VSM similarity to determine the cluster content 5. Calculate clusters score 6. Show the results
  • 22. Parameters 1. CandidateClusterThreshold – corresponds to q in the following expression where A is the term-document matrix and Ak is its k-rank approximation: 2. ClusterAssignmentThreshold – similarity threshold that must be exceeded in order to add a snippet to a cluster 3. DuplicateClusterThreshold - 4. MinTDF – min times of occurrence of a term in order to be included into term-document matrix for further processing 5. PhraseLenWithoutPenalty – max length of the phrases (in words) that are not penaltilized 6. StrongWordWeight – scaling factor for strong words (those that are in the dociment titles) 7. DataFileName – file containing the input documents 8. StopWordsFileName – file containing the stop words 9. AllDocsMaxSize – max size of the TFIDF matrix – for SVD
  • 23. Problems • Non deterministic algorithm • Requires tuning of a lot of parameters • Depends on the input data, stop words, etc. • Sensitive to the preprocessing details and the term weighting schema (TFIDF depends on that) • Scalability – the suffix arrays require continuous memory : can’t be done for big collection of documents • SVD – slow for high dimensional matrixes; Need of fast SVD algorithm
  • 24. DEMO • Input – collection of Answers questions/replies document collections • Output: Set of labeled clusters with the documents belonging to each cluster
  • 26. Answers – questions and replies - 600 (page 1/2)
  • 27. Answers – questions and replies - 600 (page 2/2)
  • 28. Potential Applications • Automatic creation of links • Clustering the results of Search queries and extending the query search with cluster labels • search  has the effect of having additional data sources for free. • Phrase Discovery – find “good keywords and phrases from the incoming documents and use them as tags.