SlideShare a Scribd company logo
1 of 25
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1
K-Means Clustering
- Nikita
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Machine Learning
 Supervised Learning - The output datasets are provided which are used to train the
machine and get the desired outputs
 Categorized into "regression" and "classification“ problems. In a regression problem, we are
trying to predict results within a continuous output, meaning that we are trying to map input
variables to some continuous function. In a classification problem, we are instead trying to
predict results in a discrete output. In other words, we are trying to map input variables
into discrete categories.
 Unsupervised Learning –
 Allows us to approach problems with little or no idea what our results should look like. We can
derive structure from data where we don't necessarily know the effect of the variables.
 No datasets are provided, instead the data is clustered into different classes .
Oracle Confidential – Internal/Restricted/Highly Restricted 2
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What is Clustering?
 Clustering is the classification of objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so
that the data in each subset (ideally) share some common trait - often
according to some defined distance measure.
 Clustering is unsupervised classification: no predefined classes
Oracle Confidential – Internal/Restricted/Highly Restricted 3
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Partitional clustering
 Partitional algorithms determine all clusters at once. They include:
 K-means and derivatives
 Fuzzy c-means clustering
 QT clustering algorithm
Oracle Confidential – Internal/Restricted/Highly Restricted 4
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What Is Good Clustering?
 A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity
 The quality of a clustering result depends on both the similarity measure used
by the method and its implementation.
 The quality of a clustering method is also measured by its ability to discover
some or all of the hidden patterns.
Oracle Confidential – Internal/Restricted/Highly Restricted 5
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Common Distance measures:
 Distance measure will determine how the similarity of two elements is
calculated and it will influence the shape of the clusters. They include:
1. The Euclidean distance (also called 2-norm distance) is given by:
2. The Manhattan distance (also called taxicab norm or 1-norm) is given by:
Oracle Confidential – Internal/Restricted/Highly Restricted 6
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
K-MEANS CLUSTERING
 The k-means algorithm is an algorithm to cluster n objects based on attributes
into k partitions, where k < n.
 K-means algorithm is the simplest partitioning method for clustering analysis
and widely used in data mining applications.
 Each cluster is represented by the centre of the cluster and the algorithm
converges to stable centriods of clusters.
 It is similar to the expectation-maximization algorithm for mixtures of Gaussians
in that they both attempt to find the centers of natural clusters in the data.
Oracle Confidential – Internal/Restricted/Highly Restricted 7
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
K-means Algorithm
Given the cluster number K, the K-means algorithm is carried out in three steps
after initialization:
Initialisation: set seed points (randomly)
1) Assign each object to the cluster of the nearest seed point measured with a
specific distance metric
2) Compute new seed points as the centroids of the clusters of the current
partition (the centroid is the centre, i.e., mean point, of the cluster)
3) Go back to Step 1), stop when no more new assignment (i.e., membership in
each cluster no longer changes)
Oracle Confidential – Internal/Restricted/Highly Restricted 8
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
The K-Means Clustering Method
Oracle Confidential – Internal/Restricted/Highly Restricted 9
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassignreassign
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
An algorithm for partitioning (or clustering) N data points into K
disjoint subsets Sj containing data points so as to minimize the sum-of-
squares criterion
where xn is a vector representing the the nth data point and uj is the
geometric centroid of the data points in Sj.
Oracle Confidential – Internal/Restricted/Highly Restricted 10
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Example - Problem
Oracle Confidential – Internal/Restricted/Highly Restricted 11
Medicin
e
Weight pH-
Index
A 1 1
B 2 1
C 4 3
D 5 4
A B
C
D
Suppose we have 4 types of medicines and each has two attributes (pH and weight
index). Our goal is to group these objects into K=2 group of medicine.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential – Internal/Restricted/Highly Restricted 12
Step 1: Use initial seed points for partitioning
Bc,Ac 21 
24.4)14()25(),(
5)14()15(),(
22
2
22
1


cDd
cDd
Assign each object to the cluster
with the nearest seed point
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential – Internal/Restricted/Highly Restricted 13
Step 2: Compute new centroids of the current partition
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
3
8
,
3
11
(
3
431
,
3
542
)1,1(
2
1






 


c
c
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential – Internal/Restricted/Highly Restricted 14
Step 2: Renew membership based on new centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential – Internal/Restricted/Highly Restricted 15
Step 3: Repeat the first two steps until its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
)
2
1
3,
2
1
4(
2
43
,
2
54
)1,
2
1
1(
2
11
,
2
21
2
1





 






 

c
c
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Example
Oracle Confidential – Internal/Restricted/Highly Restricted 16
Step 3: Repeat the first two steps until its convergence
Compute the distance of all
objects to the new centroids
Stop due to no new assignment
Membership in each cluster no
longer change
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Relevant Issues
Efficient in computation - O(tKn), where n is number of objects, K is number of clusters,
and t is number of iterations. Normally, K, t << n.
Local optimum - sensitive to initial seed points , converge to a local optimum: maybe an
unwanted solution
Other problems
 Need to specify K, the number of clusters, in advance
 Unable to handle noisy data and outliers (K-Medoids algorithm)
 Not suitable for discovering clusters with non-convex shapes
 Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)
 how to evaluate the K-mean performance?
Oracle Confidential – Internal/Restricted/Highly Restricted 17
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
K-means++ - An initialization procedure for K-means
 This approach acknowledges that there is probably a better choice of initial centroid
locations than simple random assignment. Specifically, K-means tends to perform better
when centroids are seeded in such a way that doesn't clump them together in space.
Step 1: Choose one of your data points at random as an initial centroid.
Step 2: Calculate D(x), the distance between your initial centroid and all other data points,
x.
Step 3: Choose your next centroid from the remaining datapoints with probability
proportional to D(x).D(x) (the probability of each point is based on its distance to the
closest centroid to that point.)
Step 3: Repeat until all centroids have been assigned.
Note: D(x) should be updated as more centroids are added. It should be set to be the
distance between a data point and the nearest centroid.
Oracle Confidential – Internal/Restricted/Highly Restricted 18
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How can we choose a "good" K for K-means clustering?
 Choose the number of clusters by visually inspecting your data points
 Elbow Method:
 First of all, compute the sum of squared error (SSE) for some values of k (for example 2,
4, 6, 8, etc.). The SSE is defined as the sum of the squared distance between each member
of the cluster and its centroid.
Mathematically:
 If you plot k against the SSE, you will see that the error decreases as k gets larger; this is
because when the number of clusters increases, they should be smaller, so distortion is
also smaller. The idea of the elbow method is to choose the k at which the SSE decreases
abruptly.
Oracle Confidential – Internal/Restricted/Highly Restricted 19
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
This produces an "elbow effect" in the graph, as you can see
in the following picture:
 In this case, k=6 is the value that the
Elbow method has selected. Take into
account that the Elbow method is an
heuristic and, as such, it may or may not
work well in your particular case. Sometimes,
there are more than one elbow, or no elbow
at all. In those situations you usually end up
calculating the best k by evaluating how well
k-means performs in the context of the
particular clustering problem you are trying
to solve.
Oracle Confidential – Internal/Restricted/Highly Restricted 20
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How can we choose a "good" K for K-means clustering?
 Silhoutte Coefficient –
 Silhouette analysis can be used to study the separation distance between the
resulting clusters. The silhouette plot displays a measure of how close each point in
one cluster is to points in the neighboring clusters and thus provides a way to
assess parameters like number of clusters visually.
 The Silhouette Coefficient is calculated using the mean intra-cluster distance (a)
and the mean nearest-cluster distance (b) for each sample. The Silhouette
Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a
sample and the nearest cluster that the sample is not a part of. Note that
Silhouette Coefficent is only defined if number of labels is 2 <= n_labels <=
n_samples - 1.
Oracle Confidential – Internal/Restricted/Highly Restricted 21
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Feature Scaling – Pre-processing step to improve K-Means
 Feature scaling is a method used to standardize the range of independent
variables or features of data. In data processing, it is also known as data
normalization and is generally performed during the data preprocessing step.
 Since the range of values of raw data varies widely, in some machine
learning algorithms, objective functions will not work properly without
normalization. For example, the majority of classifiers calculate the distance
between two points by the Euclidean distance. If one of the features has a broad
range of values, the distance will be governed by this particular feature. Therefore,
the range of all features should be normalized so that each feature contributes
approximately proportionately to the final distance..
Oracle Confidential – Internal/Restricted/Highly Restricted 22
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Curse of Dimensionality
 As the number of dimensions tend to infinity the distance between
any two points in the dataset converges. This means the maximum
distance and minimum distance between any two points of your
dataset will be the same.
 Reduce number of features
Oracle Confidential – Internal/Restricted/Highly Restricted 23
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Summary
 Its performance is determined by initialization and appropriate
distance measure
 There are several variants of K-means to overcome its weaknesses
 K-Medoids: resistance to noise and/or outliers
 K-Modes: extension to categorical data clustering analysis
 CLARA: extension to deal with large data sets
Oracle Confidential – Internal/Restricted/Highly Restricted 24
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 25

More Related Content

What's hot (20)

K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
K means clustering
K means clusteringK means clustering
K means clustering
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
KNN
KNNKNN
KNN
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 

Similar to Kmeans

New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clusteringmonalisa Das
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means ClusteringJunghoon Kim
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”Dr.(Mrs).Gethsiyal Augasta
 

Similar to Kmeans (20)

New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Clustering
ClusteringClustering
Clustering
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
K means report
K means reportK means report
K means report
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clustering
 
Bj24390398
Bj24390398Bj24390398
Bj24390398
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 

Kmeans

  • 1. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1 K-Means Clustering - Nikita
  • 2. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Machine Learning  Supervised Learning - The output datasets are provided which are used to train the machine and get the desired outputs  Categorized into "regression" and "classification“ problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.  Unsupervised Learning –  Allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.  No datasets are provided, instead the data is clustered into different classes . Oracle Confidential – Internal/Restricted/Highly Restricted 2
  • 3. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What is Clustering?  Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often according to some defined distance measure.  Clustering is unsupervised classification: no predefined classes Oracle Confidential – Internal/Restricted/Highly Restricted 3
  • 4. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Partitional clustering  Partitional algorithms determine all clusters at once. They include:  K-means and derivatives  Fuzzy c-means clustering  QT clustering algorithm Oracle Confidential – Internal/Restricted/Highly Restricted 4
  • 5. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What Is Good Clustering?  A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation.  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. Oracle Confidential – Internal/Restricted/Highly Restricted 5
  • 6. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Common Distance measures:  Distance measure will determine how the similarity of two elements is calculated and it will influence the shape of the clusters. They include: 1. The Euclidean distance (also called 2-norm distance) is given by: 2. The Manhattan distance (also called taxicab norm or 1-norm) is given by: Oracle Confidential – Internal/Restricted/Highly Restricted 6
  • 7. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | K-MEANS CLUSTERING  The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n.  K-means algorithm is the simplest partitioning method for clustering analysis and widely used in data mining applications.  Each cluster is represented by the centre of the cluster and the algorithm converges to stable centriods of clusters.  It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. Oracle Confidential – Internal/Restricted/Highly Restricted 7
  • 8. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | K-means Algorithm Given the cluster number K, the K-means algorithm is carried out in three steps after initialization: Initialisation: set seed points (randomly) 1) Assign each object to the cluster of the nearest seed point measured with a specific distance metric 2) Compute new seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) 3) Go back to Step 1), stop when no more new assignment (i.e., membership in each cluster no longer changes) Oracle Confidential – Internal/Restricted/Highly Restricted 8
  • 9. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | The K-Means Clustering Method Oracle Confidential – Internal/Restricted/Highly Restricted 9 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means Update the cluster means reassignreassign
  • 10. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | An algorithm for partitioning (or clustering) N data points into K disjoint subsets Sj containing data points so as to minimize the sum-of- squares criterion where xn is a vector representing the the nth data point and uj is the geometric centroid of the data points in Sj. Oracle Confidential – Internal/Restricted/Highly Restricted 10
  • 11. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Example - Problem Oracle Confidential – Internal/Restricted/Highly Restricted 11 Medicin e Weight pH- Index A 1 1 B 2 1 C 4 3 D 5 4 A B C D Suppose we have 4 types of medicines and each has two attributes (pH and weight index). Our goal is to group these objects into K=2 group of medicine.
  • 12. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Example Oracle Confidential – Internal/Restricted/Highly Restricted 12 Step 1: Use initial seed points for partitioning Bc,Ac 21  24.4)14()25(),( 5)14()15(),( 22 2 22 1   cDd cDd Assign each object to the cluster with the nearest seed point
  • 13. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Example Oracle Confidential – Internal/Restricted/Highly Restricted 13 Step 2: Compute new centroids of the current partition Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships. ) 3 8 , 3 11 ( 3 431 , 3 542 )1,1( 2 1           c c
  • 14. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Example Oracle Confidential – Internal/Restricted/Highly Restricted 14 Step 2: Renew membership based on new centroids Compute the distance of all objects to the new centroids Assign the membership to objects
  • 15. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Example Oracle Confidential – Internal/Restricted/Highly Restricted 15 Step 3: Repeat the first two steps until its convergence Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships. ) 2 1 3, 2 1 4( 2 43 , 2 54 )1, 2 1 1( 2 11 , 2 21 2 1                 c c
  • 16. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Example Oracle Confidential – Internal/Restricted/Highly Restricted 16 Step 3: Repeat the first two steps until its convergence Compute the distance of all objects to the new centroids Stop due to no new assignment Membership in each cluster no longer change
  • 17. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Relevant Issues Efficient in computation - O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n. Local optimum - sensitive to initial seed points , converge to a local optimum: maybe an unwanted solution Other problems  Need to specify K, the number of clusters, in advance  Unable to handle noisy data and outliers (K-Medoids algorithm)  Not suitable for discovering clusters with non-convex shapes  Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)  how to evaluate the K-mean performance? Oracle Confidential – Internal/Restricted/Highly Restricted 17
  • 18. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | K-means++ - An initialization procedure for K-means  This approach acknowledges that there is probably a better choice of initial centroid locations than simple random assignment. Specifically, K-means tends to perform better when centroids are seeded in such a way that doesn't clump them together in space. Step 1: Choose one of your data points at random as an initial centroid. Step 2: Calculate D(x), the distance between your initial centroid and all other data points, x. Step 3: Choose your next centroid from the remaining datapoints with probability proportional to D(x).D(x) (the probability of each point is based on its distance to the closest centroid to that point.) Step 3: Repeat until all centroids have been assigned. Note: D(x) should be updated as more centroids are added. It should be set to be the distance between a data point and the nearest centroid. Oracle Confidential – Internal/Restricted/Highly Restricted 18
  • 19. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How can we choose a "good" K for K-means clustering?  Choose the number of clusters by visually inspecting your data points  Elbow Method:  First of all, compute the sum of squared error (SSE) for some values of k (for example 2, 4, 6, 8, etc.). The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid. Mathematically:  If you plot k against the SSE, you will see that the error decreases as k gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller. The idea of the elbow method is to choose the k at which the SSE decreases abruptly. Oracle Confidential – Internal/Restricted/Highly Restricted 19
  • 20. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | This produces an "elbow effect" in the graph, as you can see in the following picture:  In this case, k=6 is the value that the Elbow method has selected. Take into account that the Elbow method is an heuristic and, as such, it may or may not work well in your particular case. Sometimes, there are more than one elbow, or no elbow at all. In those situations you usually end up calculating the best k by evaluating how well k-means performs in the context of the particular clustering problem you are trying to solve. Oracle Confidential – Internal/Restricted/Highly Restricted 20
  • 21. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How can we choose a "good" K for K-means clustering?  Silhoutte Coefficient –  Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.  The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficent is only defined if number of labels is 2 <= n_labels <= n_samples - 1. Oracle Confidential – Internal/Restricted/Highly Restricted 21
  • 22. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Feature Scaling – Pre-processing step to improve K-Means  Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.  Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.. Oracle Confidential – Internal/Restricted/Highly Restricted 22
  • 23. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Curse of Dimensionality  As the number of dimensions tend to infinity the distance between any two points in the dataset converges. This means the maximum distance and minimum distance between any two points of your dataset will be the same.  Reduce number of features Oracle Confidential – Internal/Restricted/Highly Restricted 23
  • 24. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Summary  Its performance is determined by initialization and appropriate distance measure  There are several variants of K-means to overcome its weaknesses  K-Medoids: resistance to noise and/or outliers  K-Modes: extension to categorical data clustering analysis  CLARA: extension to deal with large data sets Oracle Confidential – Internal/Restricted/Highly Restricted 24
  • 25. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 25

Editor's Notes

  1. Clustering – ->the process of grouping a set of objects into classes of similar objects -> the task is to create groups and assign data point to each group
  2. Partitioning Clustering Approach- a typical clustering analysis approach via iteratively partitioning training data set to learn a partition of the given data space learning a partition on a data set to produce several non-empty clusters (usually, the number of clusters given in advance) in principle, optimal partition achieved via minimising the sum of squared distance to its “representative object” in each cluster
  3. 3.The maximum norm is given by: D(x,y) = max |xi - yi| where 1<i<p 4. The Mahalanobis distance corrects data for different scales and correlations in the variables. 5. Inner product space: The angle between two vectors can be used as a distance measure when clustering high dimensional data 6. Hamming distance (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.
  4. Points are near to the centroid of the cluster ->What if clusters are overlapping? - Hard to tell which cluster is right - Maye we should try to remain uncertain ->What if cluster has a non-circular shape? Gaussian mixture models: - Clusters modeled as Gaussians (not by mean) - EM algo assign data to cluster with some probablity -Gives probabality model of x! EM algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. Points are near to the centroid of the cluster Probabilistic Clustering – EM with gaussian distribution
  5. It is relatively efficient and fast. It computes result at O(tkn), where n is number of objects or points, k is number of clusters and t is number of iterations Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization or Image Segmentation). Also used for choosing color palettes on old fashioned graphical display devices and Image Quantization. K cluster - . Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. -It is sensitive to initial condition. Different initial condition may produce different result of cluster. The algorithm may be trapped in the local optimum
  6. The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data. we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple dimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the enumerator) and unit-variance - The reason is that normalization gives the same importance to all the variables. The standard example is considering age (in year) and height (in cm). The age may range in [18 50], while the height may range in [130 180] (I am just making up numbers). If you use the classical Euclidean distance, the height will have dispoportionately more importance in its computation with respect to the age.
  7. Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such asmedicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.
  8. -K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. Use Hamming distance instead of Euclidean distance , i.e. if we two categorical values are same then make the distance 0 or else 1. - k-medoids minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster. - Always a member of data points K-medoids is also a partitioning technique of clustering that clusters the data set of n objects into k clusters with k known a priori. https://en.wikipedia.org/wiki/K-medoids The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm. PAM uses a greedy search which may not find the optimum solution, but it is faster than exhaustive search[citation needed]. It works as follows: Initialize: randomly select[citation needed] (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. While the cost of the configuration decreases: For each medoid m, for each non-medoid data point o: Swap m and o, recompute the cost (sum of distances of points to their medoid) If the total cost of the configuration increased in the previous step, undo the swap Other algorithms than PAM have been suggested in the literature, including the following Voronoi iteration method:[2] Select initial medoids Iterate while the cost decreases: In each cluster, make the point that minimizes the sum of distances within the cluster the medoid Reassign each point to the cluster defined by the closest medoid determined in the previous step.