WordPress Websites for Engineers: Elevate Your Brand
KNN Classifier Explained
1. K Nearest Neighbour Classifier
●
Tejas Bubane (I-05)
●
Shriyansh Jain (H-43)
●
Mitesh Butala (J- 15)
●
Gaurav Jagtap (H-42)
Project Guide: Asst. Prof. P.A. Bailke
VIT, Pune
2. TF-IDF Values
● Term Frequency (TF): Importance of the term within that document – raw
frequency
i.e. TF(d,t) = Number of occurrences of the term(t) in the document(d)
● Inverse Document Frequency (IDF): Importance of the term in the corpus
IDF(t) = log(D/t)
where, D = total number of documents
t = number of documents in which the term has occurred
word occurs in many documents – less useful – IDF value low (and vice-versa)
● TF-IDF(d,t) = TF(d,t) × IDF(t)
3. KNN - Introduction
● Learning by analogy – comparison with similar items from training set
● Training tuples described by n attributes – each document represents a
point in n dimensional space
● Closeness defined in terms of distance metric
eg. Euclidean distance, Cosine similarity, Manhattan distance
●
● Cos = 1 i.e. Angle = 0 documents are similar
● Cos = 0 i.e. Angle = 90 documents are not similar
4. KNN Algorithm
● Find cosine distance of query document with each document
in the training set
● Find the k documents that are closest / nearest to the query document
● Class of query is the class of majority of the nearest neighbours
(classes of each document in the training set are known)
5. Further Analysis of Classification
● Lazy Learner : Starts operation only after a query is provided
eg. KNN (calculates TF-IDF values after receiving query)
● Eager Learner : Operates and keeps “learning” till query is received.
eg. ANN (adjusts weights before receiving query)
● Supervised Learning : Labelled training data
eg. Classification
● Unsupervised Learning : Find hidded structure in unlabelled data
eg. Clusturing
● KNN is Supervised Learning Algorithm and follows Lazy Learning approach
6. Scaling KNN
● Vocabulary – Set of all words occurring in all documents
● Large data set – Drastic increase in the vocabulary – difficult to handle
● Feature Selection – Relation between terms in vocabulary and classes
Remove words which are less related (below threshold) to all classes
Reduce vocabulary to make it manageable
● eg. Chi-square test