This document provides an overview of machine learning concepts including supervised and unsupervised learning algorithms. It describes naive Bayes classifiers which use probabilistic models to classify data based on features. It also describes k-means clustering which groups unlabeled data into k clusters by minimizing distances between data points and assigned cluster centroids. The document provides examples of applying these algorithms to tasks like document and image classification, customer segmentation, and grouping related news articles.
2. Definitions
Pattern recognition, artificial intelligence and a bit of data
mining
Solves a given task without explicitly being programmed to do
so instead it makes predictions from provided data
Machine learning algorithms can be divided into 3 categories:
Supervised learning
Unsupervised learning
Reinforcement learning
Problem types
Classification
Regression
Clustering
etc.
3. Algorithms
Supervised Learning
Naive Bayesian Classifier
Linear/Polynomial/Logistic/Multinomial Regression
Neural Networks
etc.
Unsupervised Learning
K-means / K-medoids
Principal Component Analysis
Gaussian Distribution (Anomaly Detection)
etc.
4. Naive Bayes Classifier
Classify information based on probabilistic model score
Score for a category ck with features f1, f2, f3, ..., fn
p(Ck|f1, f2, ..., fn) = P(Ck )p(f1|Ck )p(f2|Ck )...p(fn|Ck )
p(f1)p(f2)...p(fn)
For a text classifier, features above are each word in the
sentence (bag-of-words model)
Also known as multinomial naive bayes classifier
6. Naive Bayes Classifier
Concrete Example
Ingredients
word occurrences
2 1
tbsp 1
salt 1
lemon 1
total 4
examples 2
Instructions
word occurrences
cut 1
lemon 1
pour 1
salt 1
total 4
examples 2
Global
word occurrences
2 1
tbsp 1
salt 2
lemon 2
cut 1
pour 1
total 8
examples 4
7. Naive Bayes Classifier
Concrete Example
Ingredients 1/2
word probability
2 1/4
tbsp 1/4
salt 1/4
lemon 1/4
Instructions 1/2
word probability
cut 1/4
lemon 1/4
pour 1/4
salt 1/4
Global
word probability
2 1/8
tbsp 1/8
salt 2/8
lemon 2/8
cut 1/8
pour 1/8
8. Naive Bayes Classifier
Concrete Example
Query ’1 tbsp salt’
Ingredients (I)
p(I| 1 , tbsp , salt ) = P(I)p( 1 |I)p( tbsp |I)p( salt |I)
p( 1 )p( tbsp )p( salt )
= 0.5x0.0001x0.25x0.25
0.0001x0.125x0.25 = 1
Instructions (D)
p(D| 1 , tbsp , salt ) = P(D)p( 1 |D)p( tbsp |D)p( salt |D)
p( 1 )p( tbsp )p( salt )
= 0.5x0.0001x0.0001x0.25
0.0001x0.125x0.25 = 0.0004
Result: Ingredients (since it has the highest probability)
Note: 0.0001 is the probability of an unknown element (cannot be
zero!)
10. K-Means
Unsupervised learning algorithm to identify clusters
Find clusters for unlabeled data
Algorithm
k-means
Choose K examples as initial centroids
While centroids move
1) Choose closest centroid Ki for each xi and store distance ci
2) Calculate new centroid Ki in each cluster
end
11. K-Means
K-means example steps to converge to final solution
Figure : Taken from https://en.wikipedia.org/wiki/File:
K_Means_Example_Step_2.svg
12. K-Means
How to avoid sub-optimal results ?
Figure : Generated from http://www.naftaliharris.com/blog/
visualizing-k-means-clustering/
13. K-Means
How to avoid sub-optimal results ?
k-means
Repeat N times do
Randomly choose K examples as initial centroids
While centroids move
1) Choose closest centroid Ki for each xi and store distance ci
2) Calculate new centroid Ki in each cluster
end
Calculate result cost (average distance of examples to its centroids)
If result cost is lower
end (repeat)
14. K-Means
Elbow Method - How to identify the number of clusters ?
Figure : K-means elbow method
15. K-Means
Elbow Method - How to identify the number of clusters ?
Figure : Solution for k=1
16. K-Means
Elbow Method - How to identify the number of clusters ?
Figure : Solution for k=2
17. K-Means
Elbow Method - How to identify the number of clusters ?
Figure : Solution for k=3
18. K-Means
Elbow Method - How to identify the number of clusters ?
Figure : Solution for k=4
19. K-Means
Elbow Method - How to identify the number of clusters ?
Figure : Solution for k=5
21. K-Means
Elbow Method - How to identify the number of clusters ?
Elbow method
Repeat for clusters K = 1,2,3,...n
Run K-Means
Compute average cost for K clusters
n
i=1 ci
n (simplifying
n
i=1 ci )
end (repeat)
Plot cost for each K and choose the one located at the ”elbow”
22. K-Means
Elbow Method - How to identify the number of clusters ?
Figure : K-means elbow method
23. K-Means
Elbow Method - How to identify the number of clusters ?
Figure : K-means elbow method
Not always possible to find elbow (well distributes examples)
Best practice associate cluster number with business meaning