Machine Learning Basics

Machine Learning Basics
Classiﬁcation and Clustering
Humberto Marchezi
hcmarchezi@gmail.com
November 2015

Definitions
Pattern recognition, artificial intelligence and a bit of data
mining
Solves a given task without explicitly being programmed to do
so instead it makes predictions from provided data
Machine learning algorithms can be divided into 3 categories:
Supervised learning
Unsupervised learning
Reinforcement learning
Problem types
Classification
Regression
Clustering
etc.

Algorithms
Supervised Learning
Naive Bayesian Classiﬁer
Linear/Polynomial/Logistic/Multinomial Regression
Neural Networks
etc.
Unsupervised Learning
K-means / K-medoids
Principal Component Analysis
Gaussian Distribution (Anomaly Detection)
etc.

Naive Bayes Classifier
Classify information based on probabilistic model score
Score for a category ck with features f1, f2, f3, ..., fn
p(Ck|f1, f2, ..., fn) = P(Ck )p(f1|Ck )p(f2|Ck )...p(fn|Ck )
p(f1)p(f2)...p(fn)
For a text classifier, features above are each word in the
sentence (bag-of-words model)
Also known as multinomial naive bayes classifier

Concrete Example
Ingredients
2 tbsp salt
lemon
Instructions
Cut lemon
Pour salt

Concrete Example
Ingredients
word occurrences
2 1
tbsp 1
salt 1
lemon 1
total 4
examples 2
Instructions
word occurrences
cut 1
lemon 1
pour 1
salt 1
total 4
examples 2
Global
word occurrences
2 1
tbsp 1
salt 2
lemon 2
cut 1
pour 1
total 8
examples 4

Concrete Example
Ingredients 1/2
word probability
2 1/4
tbsp 1/4
salt 1/4
lemon 1/4
Instructions 1/2
word probability
cut 1/4
lemon 1/4
pour 1/4
salt 1/4
Global
word probability
2 1/8
tbsp 1/8
salt 2/8
lemon 2/8
cut 1/8
pour 1/8

Concrete Example
Query ’1 tbsp salt’
Ingredients (I)
p(I| 1 , tbsp , salt ) = P(I)p( 1 |I)p( tbsp |I)p( salt |I)
p( 1 )p( tbsp )p( salt )
= 0.5x0.0001x0.25x0.25
0.0001x0.125x0.25 = 1
Instructions (D)
p(D| 1 , tbsp , salt ) = P(D)p( 1 |D)p( tbsp |D)p( salt |D)
p( 1 )p( tbsp )p( salt )
= 0.5x0.0001x0.0001x0.25
0.0001x0.125x0.25 = 0.0004
Result: Ingredients (since it has the highest probability)
Note: 0.0001 is the probability of an unknown element (cannot be
zero!)

Examples
Classify email as spam or not spam
Document type classification
Document sections classification
Image Classification

K-Means
Unsupervised learning algorithm to identify clusters
Find clusters for unlabeled data
Algorithm
k-means
Choose K examples as initial centroids
While centroids move
1) Choose closest centroid Ki for each xi and store distance ci
2) Calculate new centroid Ki in each cluster
end

K-Means
K-means example steps to converge to ﬁnal solution
Figure : Taken from https://en.wikipedia.org/wiki/File:
K_Means_Example_Step_2.svg

K-Means
How to avoid sub-optimal results ?
Figure : Generated from http://www.naftaliharris.com/blog/
visualizing-k-means-clustering/

K-Means
How to avoid sub-optimal results ?
k-means
Repeat N times do
Randomly choose K examples as initial centroids
While centroids move
1) Choose closest centroid Ki for each xi and store distance ci
2) Calculate new centroid Ki in each cluster
end
Calculate result cost (average distance of examples to its centroids)
If result cost is lower
end (repeat)

K-Means
Elbow Method - How to identify the number of clusters ?
Figure : K-means elbow method

K-Means
Figure : Solution for k=1

K-Means

K-Means
Figure : Cluster costs

K-Means
Elbow method
Repeat for clusters K = 1,2,3,...n
Run K-Means
Compute average cost for K clusters
n
i=1 ci
n (simplifying
n
i=1 ci )
end (repeat)
Plot cost for each K and choose the one located at the ”elbow”

K-Means
Figure : K-means elbow method
Not always possible to ﬁnd elbow (well distributes examples)
Best practice associate cluster number with business meaning

K-Means
Examples
Figure : Customer segmentation with k-means

K-Means
Examples
Figure : Identify related news and articles

K-Means
Examples
Figure : Image color reduction -
http://opencv-python-tutroals.readthedocs.org/en/latest/
_images/oc_color_quantization.jpg

References and Resources
1 Coursera Machine Learning
https://www.coursera.org/learn/machine-learning
2 Naive Bayes Classiﬁer - Wikipedia
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
3 K-Means Clustering - Wikipedia
https://en.wikipedia.org/wiki/K-means_clustering
4 Visualizing K-Means Clustering
http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
5 Naive Bayes for Image Processing
http://www.cs.ubc.ca/~lowe/papers/12mccannCVPR.pdf
6 Document Clustering with K-Means
http://www.codeproject.com/Articles/439890/
Text-Documents-Clustering-using-K-Means-Algorithm

Machine Learning Basics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning Basics

Similar to Machine Learning Basics (20)

More from Humberto Marchezi

More from Humberto Marchezi (6)

Recently uploaded

Recently uploaded (20)

Machine Learning Basics