How to Measure Document Similarity and Build Text Classifiers: A First Look at Term Frequency-Inverse Document Frequency (TF-IDF) Representations
Text data is potentially valuable for many data science projects but working with text is different from working with structured data. One representation of text that has worked well for many text mining and machine learning applications is the term frequency - inverse document frequency (TF-IDF) vector. In spite of the long winded name, this method is easy to understand, performs well in many applications, and has been implemented in commonly used data science tools. This presentation will introduce TF-IDF and show examples of how to use TF-IDF for document classification and measuring the similarity between documents.
This presentation does not assume any background in text mining or natural language processing. Examples will use Python.
3. Challenges
No obvious structure
Fully understanding language is hard
Large number of documents
Want to
Find documents based on similarity
Classify documents
10. Example: Corpus of Machine Learning Papers
Some terms appear frequently
“Feature”
“Algorithm”
“Training”
Some less frequently
“Reinforcement”
“Non-linear”
“Convolution”
11. Intuition
Combination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
12. Intuition
Combination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
Words that appear frequently across documents in a corpus are not good
indicators of topic
13. Intuition
Combination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
Words that appear frequently across documents in a corpus are not good indicators of topic
Words that appear frequently only within documents about a single topic are good indicators
of topic
15. Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d
16. Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d
IDF - inverse document frequency
idf(t,D) = log(N / | {d in D: t in d} | )
17. Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d
IDF - inverse document frequency
idf(t,D) = log(N / | {d in D: t in d} | )
TF-IDF = tf(t,d) * idf(t,D)
18. TF-IDF is
Large when:
There is a large count of a term in a
document (large TF) and ...
Low number of documents with term in
them
Small when
Term appears in many documents in
the corpus
TF-IDF
Frequency
Stop Words
Common Words
Rare Words