4. Mathematical treatment capable
of inferring meaning
Measures of word-word, word-passage,
& passage-passage relations that
correlate well with human
understanding of semantic similarity
Similarity estimates are NOT based on
contiguity frequencies, co-occurrence
counts, or usage correlations
Mathematical way capable of inferring
deeper relationships; hence “latent”
5. Akin to a well-read nun dispensing
sex-advice
Analysis of text alone
Its knowledge does NOT come from
perceived information about the physical
world, NOT from instinct, NOT from
feelings, NOT from emotions
Does NOT take into account word-order,
phrases, syntactic relationships, logic,
It takes in large amounts of text and looks
for mutual interdependencies in the text
6. Words and Passages
LSA represents the meaning of a word as the
average of the meaning of all the passages in
which it appears…
…and the meaning of the passage as an
average of the meaning of the words it
contains
word1
word2
word3
7. What is LSA?
LSA is a mathematical technique for
extracting and inferring relations of
expected contextual usage of words in
documents
8. What LSA is not
Not a natural language processing
program
Not an artificial intelligence program
Does NOT use dictionaries or databases
Does NOT use syntactic parsers
Does not use morphologies
Takes as input – words and text
paragraphs
9. Example
Titles of N=9 technical memoranda
Five on human-computer interaction
Four on mathematical graph theory
Disjoint topics
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
10. Sample Word-by-Document Matrix
Word selection criteria – occurs in at least two of the
titles
How much was said about a topic
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
11. Semantic Similarity
using Spearman rank coefficient
correlation
The correlation between human and user is
negative, -0.38
The correlation between human and minor is
also negative, -0.29
Expected; words never in the same
passage, no co-occurrences
Spearman ρ (human.user) = -0.38
Spearman ρ (human.minor) = -0.29
http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
13. The Term Space
Documents
Terms
Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
14. The Document Space
Documents
Terms
Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
15. The Semantic Space
one space for terms and documents
Represent terms AND documents in one
space
Makes it possible to calculate similarities
Between documents
Between terms
Between terms and documents
16. The Decomposition
Term1
Term2
Term3 S DT
M T
Term-by- rxr rxd
document
matrix
txd txr
Splits the term-document matrix into three matrices
New space, the SVD space
because new axes were found by SVD along which the terms
and documents can be grouped
17. New Term Vector, New Document
Vector, & Singular Values
T contains in its rows the term vectors
scaled to a new basis
DT contains the new vectors of the
documents
S contains the singular values
σ1,σ2, …. σn
Where, σ1 ≥ σ2 ≥ …. ≥ σn ≥ 0
19. Reduce to k Dimensions
Term1
Term2 S DT
Term3
M T
Term-by- kxk rxk
document
matrix
txd txk
20. Example
Term Vector Reduced to two Dimensions
T
S
D
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
21. Reconstruction of the original matrix
based on the reduced dimensions
NEW
Original
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
22. Recomputed Semantic Similarity
using Spearman rank coefficient
correlation
Spearman ρ (human.user) = +0.94
NEW
Spearman ρ (human.minor) = -0.83
Spearman ρ (human.user) = -0.38
Original
Spearman ρ (human.minor) = -0.29
Humans-user correlation went up and the human-minor correlation went down
23. Correlation between a title and all
other titles – Raw Data
•Correlation between the human-computer interaction titles was low
•Average correlations, 0.2; half the Spearman correlations were 0
•Correlation between the four graph-theory papers (mx / my) was mixed
•Average Spearman correlation was 0.44, 0.
•Correlation between human-computer interaction titles and the
graph-theory papers was -0.3, despite no semantic overlap
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
24. Correlation in the reduced
dimension (k=2) space
•Average correlations jumped from 0.2 to 0.92
•Correlation between the graph-theory papers (mx/my) was HIGH;1.0
•Correlation between human-computer interaction titles and the
graph-theory papers was strongly negative
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
26. How to treat a query
Matrix of term-by-document
Perform SVD, reduce dimensions to 50-400
A query is a “pseudo-document”
Weighted average of the vector of the words it
contains
Use a similarity metric (such as cosine)
between the query vector and the document-
to-document vectors
Rank the results
27. The Query Vector
Does better that literal matches between terms in
query documents
Superior when query and document use different
words Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
28. References
• Latent Semantic Indexing and
Information Retrieval, Johanna Geiß
• An Introduction to Latent Semantic
Analysis, Landauer, Foltz, Laham