SlideShare a Scribd company logo
1 of 43
Download to read offline
Prof. Pier Luca Lanzi
ClusteringValidation
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Syllabus
• Chapter 17, Data Mining and Analysis: Fundamental Concepts
and Algorithms. Mohammed J. Zaki & Wagner Meira Jr
• Functions available in Scikit-learn
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
• Functions available in R
https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf
2
Prof. Pier Luca Lanzi
Cluster Validation and Assessment
Clustering Evaluation
assess the goodness or quality of the clustering
Clustering Stability
sensitivity of the clustering result to various algorithmic parameters
Clustering Tendency
suitability of applying clustering in the first place,
does the data have any inherent grouping structure?
3
Prof. Pier Luca Lanzi
Validity Measures
• External Validation Measures
§ Employ criteria that are not inherent to the dataset
§ E.g. prior or expert-specified knowledge about the clusters, for example,
class labels for each point.
• Internal Validation Measures
§ Employ criteria that are derived from the data itself
§ For instance, intracluster and intercluster distances to measure cluster
compactness (e.g., how similar are the points in the same cluster?) and
separation (e.g., how far apart are the points in different clusters?).
• Relative Validation Measures
§ Aim to directly compare different clusterings, usually those obtained via
different parameter settings for the same algorithm.
4
Prof. Pier Luca Lanzi
External Measures
(the correct or ground-truth clustering is known a priori)
5
Prof. Pier Luca Lanzi
Given a clustering partition C and
the ground truth partitioning T,
we redefine TP, TN, FP, FN
in the context of clustering
Prof. Pier Luca Lanzi
True Positives, True Negatives, False
Positives, and False Negatives
• True Positives
§xi and xj are a true positive pair if they belong to the same
partition in T, and they are also in the same cluster in C
§TP is defined as the number of true positive pairs
• False Negatives
§xi and xj are a false negative pair if they belong to the same
partition in T, but they do not belong to the same cluster in C.
§FN is defined as the number of true positive pairs
7
Prof. Pier Luca Lanzi
True Positives, True Negatives, False
Positives, and False Negatives
• False Positives
§xi and xj are a false positive pair if the do not belong to the
same partition in T, but belong to the same cluster in C
§FP is the number of false positive pairs
• True Negatives
§xi and xj are a false negative pair if they do not belong to the
same partition in T, nor to the same cluster in C
§TN is the number of true negative pairs
8
Prof. Pier Luca Lanzi
Given the number of pairs N
N=TP+FP+FN+TN
9
Prof. Pier Luca Lanzi
Jaccard Coefficient
• Measures the fraction of true positive point pairs, but after
ignoring the true negatives as,
• For a perfect clustering C, the coefficient is one, that is, there are
no false positives nor false negatives.
• Note that the Jaccard coefficient is asymmetric in that it ignores
the true negatives
10
Prof. Pier Luca Lanzi
Rand Statistic
• Measures the fraction of true positives and true negatives over all
pairs as
• The Rand statistic measures the fraction of point pairs where
both the clustering C and the ground truth T agree
• A perfect clustering has a value of 1 for the statistic.
• The adjusted rand index is the extension of the rand statistic
corrected for chance.
11
Prof. Pier Luca Lanzi
Fowlkes-Mallows Measure
• Define precision and recall analogously to what done for
classification,
• The Fowlkes–Mallows (FM) measure is defined as the geometric
mean of the pairwise precision and recall
• FM is also asymmetric in terms of the true positives and negatives
because it ignores the true negatives. Its highest value is also 1,
achieved when there are no false positives or negatives.
12
Prof. Pier Luca Lanzi
Mutual Information Based Scores
13
Prof. Pier Luca Lanzi
Mutual Information
• Mutual information tries to quantify the amount of shared
information between the clustering C and ground truth
partitioning T,
• Where
§pij is the probability that a point in cluster i also belongs to
partition j
§pci is the probability of cluster Ci
§ptj is the probability of cluster Tj
14
Prof. Pier Luca Lanzi
Normalized Mutual Information
• The normalized mutual information (NMI) is defined as
• Where,
• Values close to zero indicate two label assignments that are
largely independent, while values close to one indicate significant
agreement.
15
Prof. Pier Luca Lanzi
Homogeneity, Completeness and V-measure
16
Prof. Pier Luca Lanzi
Homogeneity, Completeness,
and V-measure
• Homogeneity
§Each cluster contains only members of a single class.
• Completeness
§All members of a given class are assigned to the same cluster
• V-measure
§Harmonic mean of homogeneity and completeness
• The three measures are bounded between 0 and 1
• The higher the value the better
17
Prof. Pier Luca Lanzi
Internal Validation Measures
(criteria that are derived from the data itself)
18
Prof. Pier Luca Lanzi
Internal Validation Measures
• Based on the notions of intracluster similarity or compactness
contrasted with the notions of intercluster separation
• They typically propose a trade-off to maximizing these two
competing measures
• They are computed from the distance (or proximity) matrix
• The internal measures are based on various functions over the
intracluster and intercluster weights.
19
Prof. Pier Luca Lanzi
Some Important Statistics
• Sum over all the intracluster weights over all the clusters
• Sum of all intercluster weights
• Number of distinct intracluster edges Nin and intercluster edges,
Nout
20
Prof. Pier Luca Lanzi
BetaCV
• BetaCV is computed as the ratio of the mean intracluster distance
to the mean intercluster distance
• The smaller the BetaCV ratio, the better the clustering, as it
indicates that intracluster distances are on average smaller than
intercluster distances
21
Prof. Pier Luca Lanzi
C-Index
• Let Wmin(Nin) be the sum of the smallest Nin distances in the
proximity matrix W, where Nin is the total number of intracluster
edges, or point pairs
• Let Wmax(Nin) be the sum of the largest Nin distances in W
• The C-index measures to what extent the clustering puts
together the Nin points that are the closest across the k clusters.
• It is defined as,
• The smaller the C-index, the better the clustering, as it indicates
more compact clusters with relatively smaller distances within
clusters rather than between clusters.
22
Prof. Pier Luca Lanzi
• Defined as the ratio between the minimum distance between point pairs from
different clusters and the maximum distance between point pairs from the
same cluster
• Where, the minimum intercluster distance is computed as,
• And the maximum intracluster distance is computed as,
• The larger the Dunn index the better the clustering because it means even the
closest distance between points in different clusters is much larger than the
farthest distance between points in the same cluster. However, the Dunn
index may be insensitive because the minimum intercluster and maximum
intracluster distances do not capture all the information about a clustering.
Dunn Index 23
Prof. Pier Luca Lanzi
Davies–Bouldin Index
• Let μi denote the cluster mean and σμi denote the dispersion or
spread of the points around the cluster mean,
where var(Ci) is the total variance of cluster Ci
• The Davies–Bouldin measure for a pair of clusters Ci and Cj is
defined as the ratio
• DBij measures how compact the clusters are compared to the
distance between the cluster means.
24
Prof. Pier Luca Lanzi
Davies–Bouldin Index
• The Davies–Bouldin index is then defined as
• For each cluster Ci, we pick the cluster Cj that yields the largest
DBij ratio.
• The smaller the DB value the better the clustering, as it means
that the clusters are well separated (i.e., the distance between
cluster means is large), and each cluster is well represented by its
mean (i.e., has a small spread).
25
Prof. Pier Luca Lanzi
Silhouette Coefficient
• Measure of both cohesion and separation of clusters, and is based
on the difference between the average distance to points in the
closest cluster and to points in the same cluster.
• For each point xi we calculate its silhouette coefficient si as
• Where μin(xi) is the mean distance from xi to points in its own
cluster yi
26
Prof. Pier Luca Lanzi
Silhouette Coefficient
• And the mean of the distances from xi to points in the closest cluster is
computed as,
• The si value of a point lies in the interval [−1,+1].
§ A value close to +1 indicates that xi is much closer to points in its own
cluster and is far from other clusters.
§ A value close to zero indicates that xi is close to the boundary between
two clusters.
§ A value close to −1 indicates that xi is much closer to another cluster than
its own cluster, and therefore, the point may be mis-clustered.
27
Prof. Pier Luca Lanzi
Silhouette Coefficient
• The silhouette coefficient is defined as the mean si value across all
the points
• A value close to +1 indicates a good clustering.
• Drawbacks
§The Silhouette Coefficient is generally higher for convex
clusters than other concepts of clusters, such as density based
clusters like those obtained through DBSCAN
28
Prof. Pier Luca Lanzi
Calinski-Harabaz Index¶
• Given k clusters, the Calinski-Harabaz score s is given by the ratio
of the between-cluster dispersion mean and the within-cluster
dispersion,
• That is,
• The score is higher when clusters are dense and well separated,
which relates to a standard concept of a cluster
• The index is generally higher for convex clusters than other
concepts of clusters, such as density based clusters like those
obtained through DBSCAN.
29
Prof. Pier Luca Lanzi
Relative Measures
(compare different clusterings obtained by varying different
parameters for the same algorithm, e.g., the number of clusters k)
30
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
31
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering 32
Prof. Pier Luca Lanzi
Calinski-Harabaz Index¶
• We can use the Calinski-Harabaz index to select k
• In a good clustering, we expect the within-cluster scatter to be
smaller relative to the between-cluster scatter, which should
result in a higher value of the index
• Thus, we can either select the k corresponding to the higher
index value or we can perform a knee analysis and look for a
significant increase followed by much smaller differences
• For instance, we can choose the value k > 3 that minimizes,
33
Prof. Pier Luca Lanzi
Silhouette Coefficient
• We can use the silhouette coefficient sj of each point xj and the
average SC value to estimate the number of clusters in the data
• For each cluster, plot the sj values in descending order
• Check the overall SC value for a particular value of k, as well as
SCi values for each cluster i
• Pick the value of k that yields the best clustering, with many
points having high sj values within each cluster, as well as high
values for SC and SCi (1 ≤ i ≤ k).
34
Prof. Pier Luca Lanzi
Silhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
Prof. Pier Luca Lanzi
Silhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
Prof. Pier Luca Lanzi
Silhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
Prof. Pier Luca Lanzi
Cluster Stability
Prof. Pier Luca Lanzi
the clusterings obtained from several
datasets sampled from the same
distribution should be similar or “stable.”
Prof. Pier Luca Lanzi
Algorithm to choose k as the number of clusters that exhibits the least deviation between the
clusterings. From Zaki’s textbook © Cambridge University Press 2014
Prof. Pier Luca Lanzi
Clustering Tendency
Prof. Pier Luca Lanzi
Clustering Tendency
• Aims to determine whether the dataset has any meaningful
groups to begin with
• Difficult task typically tackled by comparing the data distribution
with samples randomly generated from the same data space
• Existing approaches include,
§Spatial Histogram
§Distance Distribution
§Hopkins Statistic
§…
42
Prof. Pier Luca Lanzi
Run the Python notebook
for this lecture
43

More Related Content

What's hot

Fuzzy c-means clustering for image segmentation
Fuzzy c-means  clustering for image segmentationFuzzy c-means  clustering for image segmentation
Fuzzy c-means clustering for image segmentationDharmesh Patel
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANNMohamed Talaat
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Praxitelis Nikolaos Kouroupetroglou
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Inductive bias
Inductive biasInductive bias
Inductive biasswapnac12
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine LearningKnoldus Inc.
 

What's hot (20)

Fuzzy c-means clustering for image segmentation
Fuzzy c-means  clustering for image segmentationFuzzy c-means  clustering for image segmentation
Fuzzy c-means clustering for image segmentation
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANN
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Back propagation
Back propagationBack propagation
Back propagation
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Inductive bias
Inductive biasInductive bias
Inductive bias
 
Activation function
Activation functionActivation function
Activation function
 
Fuzzy c means manual work
Fuzzy c means manual workFuzzy c means manual work
Fuzzy c means manual work
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Clustering
ClusteringClustering
Clustering
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 

Similar to DMTM Lecture 15 Clustering evaluation

DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
Statistics-3 : Statistical Inference - Core
Statistics-3 : Statistical Inference - CoreStatistics-3 : Statistical Inference - Core
Statistics-3 : Statistical Inference - CoreGiridhar Chandrasekaran
 
Information retrieval 20 divergence from randomness
Information retrieval 20 divergence from randomnessInformation retrieval 20 divergence from randomness
Information retrieval 20 divergence from randomnessVaibhav Khanna
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptxCalvinAdorDionisio
 
A new development in the hierarchical clustering of repertory grid data
A new development in the hierarchical clustering of repertory grid dataA new development in the hierarchical clustering of repertory grid data
A new development in the hierarchical clustering of repertory grid dataMark Heckmann
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAnand Thokal
 
QCA and Fuzzy Sets for Impact Assessment in International Development and Epi...
QCA and Fuzzy Sets for Impact Assessment in International Development and Epi...QCA and Fuzzy Sets for Impact Assessment in International Development and Epi...
QCA and Fuzzy Sets for Impact Assessment in International Development and Epi...Wendy Olsen
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
Discrete and continuous probability models
Discrete and continuous probability modelsDiscrete and continuous probability models
Discrete and continuous probability modelsAkshay Kumar Mishra
 
3centrality-1235089982174yuuhhh803-1.ppt
3centrality-1235089982174yuuhhh803-1.ppt3centrality-1235089982174yuuhhh803-1.ppt
3centrality-1235089982174yuuhhh803-1.pptTariqqandeel
 
Chap 3 - PrinciplesofInference-part1.pptx
Chap 3 - PrinciplesofInference-part1.pptxChap 3 - PrinciplesofInference-part1.pptx
Chap 3 - PrinciplesofInference-part1.pptxarifmachinelearning
 
Topic 4 Measures of Dispersion & Numericals.pptx
Topic 4  Measures of Dispersion & Numericals.pptxTopic 4  Measures of Dispersion & Numericals.pptx
Topic 4 Measures of Dispersion & Numericals.pptxCallplanetsDeveloper
 
Cluster randomization trial presentation
Cluster randomization trial presentationCluster randomization trial presentation
Cluster randomization trial presentationRanadip Chowdhury
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx
Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptxChapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx
Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptxHazelLansula1
 

Similar to DMTM Lecture 15 Clustering evaluation (20)

DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
Statistics-3 : Statistical Inference - Core
Statistics-3 : Statistical Inference - CoreStatistics-3 : Statistical Inference - Core
Statistics-3 : Statistical Inference - Core
 
Information retrieval 20 divergence from randomness
Information retrieval 20 divergence from randomnessInformation retrieval 20 divergence from randomness
Information retrieval 20 divergence from randomness
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptx
 
A new development in the hierarchical clustering of repertory grid data
A new development in the hierarchical clustering of repertory grid dataA new development in the hierarchical clustering of repertory grid data
A new development in the hierarchical clustering of repertory grid data
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
QCA and Fuzzy Sets for Impact Assessment in International Development and Epi...
QCA and Fuzzy Sets for Impact Assessment in International Development and Epi...QCA and Fuzzy Sets for Impact Assessment in International Development and Epi...
QCA and Fuzzy Sets for Impact Assessment in International Development and Epi...
 
DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
Discrete and continuous probability models
Discrete and continuous probability modelsDiscrete and continuous probability models
Discrete and continuous probability models
 
3centrality-1235089982174yuuhhh803-1.ppt
3centrality-1235089982174yuuhhh803-1.ppt3centrality-1235089982174yuuhhh803-1.ppt
3centrality-1235089982174yuuhhh803-1.ppt
 
Chap 3 - PrinciplesofInference-part1.pptx
Chap 3 - PrinciplesofInference-part1.pptxChap 3 - PrinciplesofInference-part1.pptx
Chap 3 - PrinciplesofInference-part1.pptx
 
Topic 4 Measures of Dispersion & Numericals.pptx
Topic 4  Measures of Dispersion & Numericals.pptxTopic 4  Measures of Dispersion & Numericals.pptx
Topic 4 Measures of Dispersion & Numericals.pptx
 
Cluster randomization trial presentation
Cluster randomization trial presentationCluster randomization trial presentation
Cluster randomization trial presentation
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
unit 4 nearest neighbor.ppt
unit 4 nearest neighbor.pptunit 4 nearest neighbor.ppt
unit 4 nearest neighbor.ppt
 
Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx
Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptxChapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx
Chapter 2 The Science of Psychological Measurement (Alivio, Ansula).pptx
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationPier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationPier Luca Lanzi
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 

Recently uploaded

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 

Recently uploaded (20)

Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 

DMTM Lecture 15 Clustering evaluation

  • 1. Prof. Pier Luca Lanzi ClusteringValidation Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Syllabus • Chapter 17, Data Mining and Analysis: Fundamental Concepts and Algorithms. Mohammed J. Zaki & Wagner Meira Jr • Functions available in Scikit-learn http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation • Functions available in R https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf 2
  • 3. Prof. Pier Luca Lanzi Cluster Validation and Assessment Clustering Evaluation assess the goodness or quality of the clustering Clustering Stability sensitivity of the clustering result to various algorithmic parameters Clustering Tendency suitability of applying clustering in the first place, does the data have any inherent grouping structure? 3
  • 4. Prof. Pier Luca Lanzi Validity Measures • External Validation Measures § Employ criteria that are not inherent to the dataset § E.g. prior or expert-specified knowledge about the clusters, for example, class labels for each point. • Internal Validation Measures § Employ criteria that are derived from the data itself § For instance, intracluster and intercluster distances to measure cluster compactness (e.g., how similar are the points in the same cluster?) and separation (e.g., how far apart are the points in different clusters?). • Relative Validation Measures § Aim to directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm. 4
  • 5. Prof. Pier Luca Lanzi External Measures (the correct or ground-truth clustering is known a priori) 5
  • 6. Prof. Pier Luca Lanzi Given a clustering partition C and the ground truth partitioning T, we redefine TP, TN, FP, FN in the context of clustering
  • 7. Prof. Pier Luca Lanzi True Positives, True Negatives, False Positives, and False Negatives • True Positives §xi and xj are a true positive pair if they belong to the same partition in T, and they are also in the same cluster in C §TP is defined as the number of true positive pairs • False Negatives §xi and xj are a false negative pair if they belong to the same partition in T, but they do not belong to the same cluster in C. §FN is defined as the number of true positive pairs 7
  • 8. Prof. Pier Luca Lanzi True Positives, True Negatives, False Positives, and False Negatives • False Positives §xi and xj are a false positive pair if the do not belong to the same partition in T, but belong to the same cluster in C §FP is the number of false positive pairs • True Negatives §xi and xj are a false negative pair if they do not belong to the same partition in T, nor to the same cluster in C §TN is the number of true negative pairs 8
  • 9. Prof. Pier Luca Lanzi Given the number of pairs N N=TP+FP+FN+TN 9
  • 10. Prof. Pier Luca Lanzi Jaccard Coefficient • Measures the fraction of true positive point pairs, but after ignoring the true negatives as, • For a perfect clustering C, the coefficient is one, that is, there are no false positives nor false negatives. • Note that the Jaccard coefficient is asymmetric in that it ignores the true negatives 10
  • 11. Prof. Pier Luca Lanzi Rand Statistic • Measures the fraction of true positives and true negatives over all pairs as • The Rand statistic measures the fraction of point pairs where both the clustering C and the ground truth T agree • A perfect clustering has a value of 1 for the statistic. • The adjusted rand index is the extension of the rand statistic corrected for chance. 11
  • 12. Prof. Pier Luca Lanzi Fowlkes-Mallows Measure • Define precision and recall analogously to what done for classification, • The Fowlkes–Mallows (FM) measure is defined as the geometric mean of the pairwise precision and recall • FM is also asymmetric in terms of the true positives and negatives because it ignores the true negatives. Its highest value is also 1, achieved when there are no false positives or negatives. 12
  • 13. Prof. Pier Luca Lanzi Mutual Information Based Scores 13
  • 14. Prof. Pier Luca Lanzi Mutual Information • Mutual information tries to quantify the amount of shared information between the clustering C and ground truth partitioning T, • Where §pij is the probability that a point in cluster i also belongs to partition j §pci is the probability of cluster Ci §ptj is the probability of cluster Tj 14
  • 15. Prof. Pier Luca Lanzi Normalized Mutual Information • The normalized mutual information (NMI) is defined as • Where, • Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. 15
  • 16. Prof. Pier Luca Lanzi Homogeneity, Completeness and V-measure 16
  • 17. Prof. Pier Luca Lanzi Homogeneity, Completeness, and V-measure • Homogeneity §Each cluster contains only members of a single class. • Completeness §All members of a given class are assigned to the same cluster • V-measure §Harmonic mean of homogeneity and completeness • The three measures are bounded between 0 and 1 • The higher the value the better 17
  • 18. Prof. Pier Luca Lanzi Internal Validation Measures (criteria that are derived from the data itself) 18
  • 19. Prof. Pier Luca Lanzi Internal Validation Measures • Based on the notions of intracluster similarity or compactness contrasted with the notions of intercluster separation • They typically propose a trade-off to maximizing these two competing measures • They are computed from the distance (or proximity) matrix • The internal measures are based on various functions over the intracluster and intercluster weights. 19
  • 20. Prof. Pier Luca Lanzi Some Important Statistics • Sum over all the intracluster weights over all the clusters • Sum of all intercluster weights • Number of distinct intracluster edges Nin and intercluster edges, Nout 20
  • 21. Prof. Pier Luca Lanzi BetaCV • BetaCV is computed as the ratio of the mean intracluster distance to the mean intercluster distance • The smaller the BetaCV ratio, the better the clustering, as it indicates that intracluster distances are on average smaller than intercluster distances 21
  • 22. Prof. Pier Luca Lanzi C-Index • Let Wmin(Nin) be the sum of the smallest Nin distances in the proximity matrix W, where Nin is the total number of intracluster edges, or point pairs • Let Wmax(Nin) be the sum of the largest Nin distances in W • The C-index measures to what extent the clustering puts together the Nin points that are the closest across the k clusters. • It is defined as, • The smaller the C-index, the better the clustering, as it indicates more compact clusters with relatively smaller distances within clusters rather than between clusters. 22
  • 23. Prof. Pier Luca Lanzi • Defined as the ratio between the minimum distance between point pairs from different clusters and the maximum distance between point pairs from the same cluster • Where, the minimum intercluster distance is computed as, • And the maximum intracluster distance is computed as, • The larger the Dunn index the better the clustering because it means even the closest distance between points in different clusters is much larger than the farthest distance between points in the same cluster. However, the Dunn index may be insensitive because the minimum intercluster and maximum intracluster distances do not capture all the information about a clustering. Dunn Index 23
  • 24. Prof. Pier Luca Lanzi Davies–Bouldin Index • Let μi denote the cluster mean and σμi denote the dispersion or spread of the points around the cluster mean, where var(Ci) is the total variance of cluster Ci • The Davies–Bouldin measure for a pair of clusters Ci and Cj is defined as the ratio • DBij measures how compact the clusters are compared to the distance between the cluster means. 24
  • 25. Prof. Pier Luca Lanzi Davies–Bouldin Index • The Davies–Bouldin index is then defined as • For each cluster Ci, we pick the cluster Cj that yields the largest DBij ratio. • The smaller the DB value the better the clustering, as it means that the clusters are well separated (i.e., the distance between cluster means is large), and each cluster is well represented by its mean (i.e., has a small spread). 25
  • 26. Prof. Pier Luca Lanzi Silhouette Coefficient • Measure of both cohesion and separation of clusters, and is based on the difference between the average distance to points in the closest cluster and to points in the same cluster. • For each point xi we calculate its silhouette coefficient si as • Where μin(xi) is the mean distance from xi to points in its own cluster yi 26
  • 27. Prof. Pier Luca Lanzi Silhouette Coefficient • And the mean of the distances from xi to points in the closest cluster is computed as, • The si value of a point lies in the interval [−1,+1]. § A value close to +1 indicates that xi is much closer to points in its own cluster and is far from other clusters. § A value close to zero indicates that xi is close to the boundary between two clusters. § A value close to −1 indicates that xi is much closer to another cluster than its own cluster, and therefore, the point may be mis-clustered. 27
  • 28. Prof. Pier Luca Lanzi Silhouette Coefficient • The silhouette coefficient is defined as the mean si value across all the points • A value close to +1 indicates a good clustering. • Drawbacks §The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN 28
  • 29. Prof. Pier Luca Lanzi Calinski-Harabaz Index¶ • Given k clusters, the Calinski-Harabaz score s is given by the ratio of the between-cluster dispersion mean and the within-cluster dispersion, • That is, • The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster • The index is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN. 29
  • 30. Prof. Pier Luca Lanzi Relative Measures (compare different clusterings obtained by varying different parameters for the same algorithm, e.g., the number of clusters k) 30
  • 31. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 31
  • 32. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering 32
  • 33. Prof. Pier Luca Lanzi Calinski-Harabaz Index¶ • We can use the Calinski-Harabaz index to select k • In a good clustering, we expect the within-cluster scatter to be smaller relative to the between-cluster scatter, which should result in a higher value of the index • Thus, we can either select the k corresponding to the higher index value or we can perform a knee analysis and look for a significant increase followed by much smaller differences • For instance, we can choose the value k > 3 that minimizes, 33
  • 34. Prof. Pier Luca Lanzi Silhouette Coefficient • We can use the silhouette coefficient sj of each point xj and the average SC value to estimate the number of clusters in the data • For each cluster, plot the sj values in descending order • Check the overall SC value for a particular value of k, as well as SCi values for each cluster i • Pick the value of k that yields the best clustering, with many points having high sj values within each cluster, as well as high values for SC and SCi (1 ≤ i ≤ k). 34
  • 35. Prof. Pier Luca Lanzi Silhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
  • 36. Prof. Pier Luca Lanzi Silhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
  • 37. Prof. Pier Luca Lanzi Silhouette coefficients for the Iris dataset computed using a k-means algorithm with k=2
  • 38. Prof. Pier Luca Lanzi Cluster Stability
  • 39. Prof. Pier Luca Lanzi the clusterings obtained from several datasets sampled from the same distribution should be similar or “stable.”
  • 40. Prof. Pier Luca Lanzi Algorithm to choose k as the number of clusters that exhibits the least deviation between the clusterings. From Zaki’s textbook © Cambridge University Press 2014
  • 41. Prof. Pier Luca Lanzi Clustering Tendency
  • 42. Prof. Pier Luca Lanzi Clustering Tendency • Aims to determine whether the dataset has any meaningful groups to begin with • Difficult task typically tackled by comparing the data distribution with samples randomly generated from the same data space • Existing approaches include, §Spatial Histogram §Distance Distribution §Hopkins Statistic §… 42
  • 43. Prof. Pier Luca Lanzi Run the Python notebook for this lecture 43