SlideShare a Scribd company logo
1 of 59
Download to read offline
Prof. Pier Luca Lanzi
Hierarchical Clustering
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
2
Prof. Pier Luca Lanzi
3
Prof. Pier Luca Lanzi
4
Prof. Pier Luca Lanzi
5
Prof. Pier Luca Lanzi
6
Prof. Pier Luca Lanzi
7
Prof. Pier Luca Lanzi
8
Prof. Pier Luca Lanzi
9
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• Suppose we have five items, a, b, c, d, and e.
• Initially, we consider one cluster for each item
• Then, at each step we merge together the most similar clusters,
until we generate one cluster
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
Step 0 Step 1 Step 2 Step 3 Step 4
10
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• Alternatively, we start from one cluster containing the five
elements
• Then, at each step we split one cluster to improve intracluster
similarity, until all the elements are contained in one cluster
c
a
b
d
e
d,e
a,b,c,d,e
a,b
c,d,e
Step 4 Step 3 Step 2 Step 1 Step 0
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• By far, it is the most common clustering technique
• Produces a hierarchy of nested clusters
• The hiearchy be visualized as a dendrogram: a tree like diagram
that records the sequences of merges or splits
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
12
Prof. Pier Luca Lanzi
What Approaches?
• Agglomerative
§ Start individual clusters, at each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
• Divisive
§ Start with one cluster, at each step, split a cluster until each cluster
contains a point (or there are k clusters)
13
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
agglomerative
divisive
Prof. Pier Luca Lanzi
Strengths of Hierarchical Clustering
• No need to assume any particular number of clusters
• Any desired number of clusters can be obtained by ‘cutting’ the
dendrogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences include animal kingdom, phylogeny
reconstruction, etc.
• Traditional hierarchical algorithms use a similarity
or distance matrix to merge or split one cluster at a time
14
Prof. Pier Luca Lanzi
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Compute the proximity matrix
• Let each data point be a cluster
• Repeat
§Merge the two closest clusters
§ Update the proximity matrix
• Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters
• Different approaches to defining the distance between clusters
distinguish the different algorithms
15
Prof. Pier Luca Lanzi
Hierarchical Clustering:
Time and Space Requirements
• O(N2) space since it uses the proximity matrix.
§N is the number of points.
• O(N3) time in many cases
§There are N steps and at each step the size, N2, proximity
matrix must be updated and searched
§Complexity can be reduced to O(N2 log(N) )
time for some approaches
16
Prof. Pier Luca Lanzi
Efficient Implementation
• Compute the distance between all pairs of points [O(N2)]
• Insert the pairs and their distances into a priority queue to fine the min in one
step [O(N2)]
• When two clusters are merged, we remove all entries in the priority queue
involving one of these two clusters [O(Nlog N)]
• Compute all the distances between the new cluster and the re- maining
clusters [O(NlogN)]
• Since the last two steps are executed at most N time, the complexity of the
whole algorithms is O(N2logN)
17
Prof. Pier Luca Lanzi
Distance Between Clusters
Prof. Pier Luca Lanzi
Initial Configuration
• Start with clusters of individual points and the distance matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Distance Matrix
19
Prof. Pier Luca Lanzi
Intermediate Situation
• After some merging steps, we have some clusters
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
20
Prof. Pier Luca Lanzi
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
21
Prof. Pier Luca Lanzi
After Merging
• The question is “How do we update the proximity matrix?”
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5C1
C1
C3
C4
C2 U C5
C3 C4
Distance Matrix
22
Prof. Pier Luca Lanzi
Similarity?
Prof. Pier Luca Lanzi
Single Linkage or MIN
Prof. Pier Luca Lanzi
Complete Linkage or MAX
Prof. Pier Luca Lanzi
Average or Group Average
Prof. Pier Luca Lanzi
Distance between Centroids
´ ´
Prof. Pier Luca Lanzi
Typical Alternatives to Calculate the
Distance Between Clusters
• Single link (or MIN)
§smallest distance between an element in one cluster and an
element in the other, i.e., d(Ci, Cj) = min(ti,p, tj,q)
• Complete link (or MAX)
§largest distance between an element in one cluster and
an element in the other, i.e., d(Ci, Cj) = max(ti,p, tj,q)
• Average (or group average)
§average distance between an element in one cluster and an
element in the other, i.e., d(Ci, Cj) = avg(d(ti,p, tj,q))
• Centroid
§distance between the centroids of two clusters, i.e.,
d(Ci, Cj) = d(μi, μj) where μi and μi are the centroids
• …
28
Prof. Pier Luca Lanzi
Example
• Suppose we have five items, a, b, c, d, and e.
• We wanto to perform hierarchical clustering on
five instances following an agglomerative approach
• First: we compute the distance or similarity matrix
• Dij is the distance between instancce “i” and “j”
÷
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
ç
è
æ
=
0003050809
000409010
000506
0002
00
.....
....
...
..
.
D
29
Prof. Pier Luca Lanzi
Example
• Group the two instances that are closer
• In this case, a and b are the closest items (D2,1=2)
• Compute again the distance matrix, and start again.
• Suppose we apply single-linkage (MIN), we need to compute the
distance between the new cluster {1,2} and the others
§d(12)3 = min[d13,d23] = d23 = 5.0
§d(12)4 = min[d14,d24] = d24 = 9.0
§d(12)5 = min[d15,d25] = d25 = 8.0
30
Prof. Pier Luca Lanzi
Example
• The new distance matrix is,
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
è
æ
=
0.00.30.50.8
0.00.40.9
0.00.5
0.0
D
31
• At the end, we obtain the
following dendrogram
Prof. Pier Luca Lanzi
Determining the Number of Clusters
32
Prof. Pier Luca Lanzi
hierarchical clustering generates
a set of N possible partitions
which one should I choose?
Prof. Pier Luca Lanzi
From the previous lecture we know ideally
a good cluster should partition points so that …
Data points in the same cluster should have
a small distance from one another
Data points in different clusters should be at
a large distance from one another.
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
35
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares (for distance function d)
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
36
Prof. Pier Luca Lanzi
Evaluation of Hierarchical Clustering
using Knee/Elbow Analysis
plot the WSS and BSS for every clustering and look
for a knee in the plot that show a significant
modification in the evaluation metrics
Prof. Pier Luca Lanzi
Run the Python notebook
for hierarchical clustering
Prof. Pier Luca Lanzi
Example data generated using the make_blob function of Scikit-Learn
Prof. Pier Luca Lanzi
Dendrogram computed using single linkage.
Prof. Pier Luca Lanzi
BSS and WSS for values of k from 1 until 19.
Prof. Pier Luca Lanzi
Clusters produced for values of k from 2 to 7.
Prof. Pier Luca Lanzi
Clusters produced for values of k from 5 to 10.
Prof. Pier Luca Lanzi
How can we represent clusters?
Prof. Pier Luca Lanzi
Euclidean vs Non-Euclidean Spaces
• Euclidean Spaces
§ We can identify a cluster using for instance its centroid
(e.g. computed as the average among all its data points)
§ Alternatively, we can use its convex hull
• Non-Euclidean Spaces
§ We can define a distance (jaccard, cosine, edit)
§ We cannot compute a centroid and we can introduce the concept of
clustroid
• Clustroid
§ An existing data point that we take as a cluster representative
§ It can be the point that minimizes the sum of the distances to the other
points in the cluster
§ Or, the one minimizing the maximum distance to another point
§ Or, the sum of the squares of the distances to the other points in the
cluster
45
Prof. Pier Luca Lanzi
Examples using KNIME
Prof. Pier Luca Lanzi
Evaluation of the result from hierarchical clustering with
3 clusters and average linkage against existing labels
Prof. Pier Luca Lanzi
Comparison of hierarchical clustering with 3 clusters
and average linkage against k-Means with k=3
Prof. Pier Luca Lanzi
Computing cluster quality from one to 20 clusters
using the entropy scorer
Prof. Pier Luca Lanzi
Examples using R
Prof. Pier Luca Lanzi
Hierarchical Clustering in R
# init the seed to be able to repeat the experiment
set.seed(1234)
par(mar=c(0,0,0,0))
# randomly generates the data
x<-rnorm(12, mean=rep(1:3,each=4), sd=0.2)
y<-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2)
plot(x,y,pch=19,cex=2,col="blue")
# distance matrix
d <- data.frame(x,y)
dm <- dist(d)
# generate the
cl <- hclust(dm)
plot(cl)
# other ways to plot dendrograms
# http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html
51
Prof. Pier Luca Lanzi
Evaluation of Clustering in R
library(GMD)
###
### checking the quality of the previous cluster
###
# init two vectors that will contain the evaluation
# in terms of within and between sum of squares
plot_wss = rep(0,12)
plot_bss = rep(0,12)
# evaluate every clustering
for(i in 1:12)
{
clusters <- cutree(cl,i)
eval <- css(dm,clusters);
plot_wss[i] <- eval$totwss
plot_bss[i] <- eval$totbss
}
52
Prof. Pier Luca Lanzi
Evaluation of Clustering in R
# plot the results
x = 1:12
plot(x, y=plot_bss, main="Between Cluster Sum-of-square",
cex=2, pch=18, col="blue", xlab="Number of Clusters",
ylab="Evaluation")
lines(x, plot_bss, col="blue")
par(new=TRUE)
plot(x, y=plot_wss, cex=2, pch=19, col="red", ylab="", xlab="")
lines(x,plot_wss, col="red");
53
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering 54
Prof. Pier Luca Lanzi
Hierarchical Clustering in R – Iris2D
library(foreign)
iris = read.arff("iris.2D.arff")
with(iris, plot(petallength,petalwidth, col="blue", pch=19, cex=2))
dm <- dist(iris[,1:2])
cl <- hclust(iris_dist, method="single")
#clustering <- hclust(dist(iris[,1:2],method="manhattan"), method="single")
plot(cl)
cl_average <- hclust(iris_dist, method="average")
plot(clustering)
cutree(clustering,2)
55
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering for
iris2D
56
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering for
iris
57
Prof. Pier Luca Lanzi
Summary
Prof. Pier Luca Lanzi
Hierarchical Clustering:
Problems and Limitations
• Once a decision is made to combine two clusters,
it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one
or more of the following:
§Sensitivity to noise and outliers
§Difficulty handling different sized clusters
and convex shapes
§Breaking large clusters
• Major weakness of agglomerative clustering methods
§They do not scale well: time complexity of at least O(n2),
where n is the number of total objects
§They can never undo what was done previously
59

More Related Content

What's hot

Graph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsGraph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsWQ Fan
 
Social network analysis: Quadratic assignment procedure
Social network analysis: Quadratic assignment procedureSocial network analysis: Quadratic assignment procedure
Social network analysis: Quadratic assignment procedureMatthew Gwynfryn Thomas
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsTakuya Akiba
 
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Introduction to Graph neural networks @  Vienna Deep Learning meetupIntroduction to Graph neural networks @  Vienna Deep Learning meetup
Introduction to Graph neural networks @ Vienna Deep Learning meetupLiad Magen
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)yukihiro domae
 
Community Detection
Community Detection Community Detection
Community Detection Kanika Kanwal
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognitionYUNG-KUEI CHEN
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationMohammed Bennamoun
 
Graph Neural Network 1부
Graph Neural Network 1부Graph Neural Network 1부
Graph Neural Network 1부seungwoo kim
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graphDing Li
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filteringSungMin OH
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLPhytae
 
Union find(素集合データ構造)
Union find(素集合データ構造)Union find(素集合データ構造)
Union find(素集合データ構造)AtCoder Inc.
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function범준 김
 
Graph Neural Network (한국어)
Graph Neural Network (한국어)Graph Neural Network (한국어)
Graph Neural Network (한국어)Jungwon Kim
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...Taegyun Jeon
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - IntroductionJungwon Kim
 

What's hot (20)

Graph Neural Networks for Recommendations
Graph Neural Networks for RecommendationsGraph Neural Networks for Recommendations
Graph Neural Networks for Recommendations
 
Social network analysis: Quadratic assignment procedure
Social network analysis: Quadratic assignment procedureSocial network analysis: Quadratic assignment procedure
Social network analysis: Quadratic assignment procedure
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Introduction to Graph neural networks @  Vienna Deep Learning meetupIntroduction to Graph neural networks @  Vienna Deep Learning meetup
Introduction to Graph neural networks @ Vienna Deep Learning meetup
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)
 
Community Detection
Community Detection Community Detection
Community Detection
 
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimization
 
Graph Neural Network 1부
Graph Neural Network 1부Graph Neural Network 1부
Graph Neural Network 1부
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
 
Union find(素集合データ構造)
Union find(素集合データ構造)Union find(素集合データ構造)
Union find(素集合データ構造)
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
 
Graph Neural Network (한국어)
Graph Neural Network (한국어)Graph Neural Network (한국어)
Graph Neural Network (한국어)
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
Graph Convolutional Network 概説
Graph Convolutional Network 概説Graph Convolutional Network 概説
Graph Convolutional Network 概説
 

Similar to DMTM Lecture 12 Hierarchical clustering

DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringPier Luca Lanzi
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfp_manimozhi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationPier Luca Lanzi
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptImXaib
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSandinoBerutu1
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 

Similar to DMTM Lecture 12 Hierarchical clustering (20)

DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationPier Luca Lanzi
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionPier Luca Lanzi
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningPier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelinePier Luca Lanzi
 

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 

Recently uploaded

4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 

Recently uploaded (20)

4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 

DMTM Lecture 12 Hierarchical clustering

  • 1. Prof. Pier Luca Lanzi Hierarchical Clustering Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi 2
  • 3. Prof. Pier Luca Lanzi 3
  • 4. Prof. Pier Luca Lanzi 4
  • 5. Prof. Pier Luca Lanzi 5
  • 6. Prof. Pier Luca Lanzi 6
  • 7. Prof. Pier Luca Lanzi 7
  • 8. Prof. Pier Luca Lanzi 8
  • 9. Prof. Pier Luca Lanzi 9
  • 10. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • Suppose we have five items, a, b, c, d, and e. • Initially, we consider one cluster for each item • Then, at each step we merge together the most similar clusters, until we generate one cluster a b c d e a,b d,e c,d,e a,b,c,d,e Step 0 Step 1 Step 2 Step 3 Step 4 10
  • 11. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • Alternatively, we start from one cluster containing the five elements • Then, at each step we split one cluster to improve intracluster similarity, until all the elements are contained in one cluster c a b d e d,e a,b,c,d,e a,b c,d,e Step 4 Step 3 Step 2 Step 1 Step 0
  • 12. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • By far, it is the most common clustering technique • Produces a hierarchy of nested clusters • The hiearchy be visualized as a dendrogram: a tree like diagram that records the sequences of merges or splits a b c d e a,b d,e c,d,e a,b,c,d,e 12
  • 13. Prof. Pier Luca Lanzi What Approaches? • Agglomerative § Start individual clusters, at each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive § Start with one cluster, at each step, split a cluster until each cluster contains a point (or there are k clusters) 13 a b c d e a,b d,e c,d,e a,b,c,d,e agglomerative divisive
  • 14. Prof. Pier Luca Lanzi Strengths of Hierarchical Clustering • No need to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences include animal kingdom, phylogeny reconstruction, etc. • Traditional hierarchical algorithms use a similarity or distance matrix to merge or split one cluster at a time 14
  • 15. Prof. Pier Luca Lanzi Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Compute the proximity matrix • Let each data point be a cluster • Repeat §Merge the two closest clusters § Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms 15
  • 16. Prof. Pier Luca Lanzi Hierarchical Clustering: Time and Space Requirements • O(N2) space since it uses the proximity matrix. §N is the number of points. • O(N3) time in many cases §There are N steps and at each step the size, N2, proximity matrix must be updated and searched §Complexity can be reduced to O(N2 log(N) ) time for some approaches 16
  • 17. Prof. Pier Luca Lanzi Efficient Implementation • Compute the distance between all pairs of points [O(N2)] • Insert the pairs and their distances into a priority queue to fine the min in one step [O(N2)] • When two clusters are merged, we remove all entries in the priority queue involving one of these two clusters [O(Nlog N)] • Compute all the distances between the new cluster and the re- maining clusters [O(NlogN)] • Since the last two steps are executed at most N time, the complexity of the whole algorithms is O(N2logN) 17
  • 18. Prof. Pier Luca Lanzi Distance Between Clusters
  • 19. Prof. Pier Luca Lanzi Initial Configuration • Start with clusters of individual points and the distance matrix ... p1 p2 p3 p4 p9 p10 p11 p12 p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Distance Matrix 19
  • 20. Prof. Pier Luca Lanzi Intermediate Situation • After some merging steps, we have some clusters ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 C5 C3 C2C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix 20
  • 21. Prof. Pier Luca Lanzi Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 C5 C3 C2C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix 21
  • 22. Prof. Pier Luca Lanzi After Merging • The question is “How do we update the proximity matrix?” ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 U C5 C3 ? ? ? ? ? ? ? C2 U C5C1 C1 C3 C4 C2 U C5 C3 C4 Distance Matrix 22
  • 23. Prof. Pier Luca Lanzi Similarity?
  • 24. Prof. Pier Luca Lanzi Single Linkage or MIN
  • 25. Prof. Pier Luca Lanzi Complete Linkage or MAX
  • 26. Prof. Pier Luca Lanzi Average or Group Average
  • 27. Prof. Pier Luca Lanzi Distance between Centroids ´ ´
  • 28. Prof. Pier Luca Lanzi Typical Alternatives to Calculate the Distance Between Clusters • Single link (or MIN) §smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min(ti,p, tj,q) • Complete link (or MAX) §largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max(ti,p, tj,q) • Average (or group average) §average distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = avg(d(ti,p, tj,q)) • Centroid §distance between the centroids of two clusters, i.e., d(Ci, Cj) = d(μi, μj) where μi and μi are the centroids • … 28
  • 29. Prof. Pier Luca Lanzi Example • Suppose we have five items, a, b, c, d, and e. • We wanto to perform hierarchical clustering on five instances following an agglomerative approach • First: we compute the distance or similarity matrix • Dij is the distance between instancce “i” and “j” ÷ ÷ ÷ ÷ ÷ ÷ ø ö ç ç ç ç ç ç è æ = 0003050809 000409010 000506 0002 00 ..... .... ... .. . D 29
  • 30. Prof. Pier Luca Lanzi Example • Group the two instances that are closer • In this case, a and b are the closest items (D2,1=2) • Compute again the distance matrix, and start again. • Suppose we apply single-linkage (MIN), we need to compute the distance between the new cluster {1,2} and the others §d(12)3 = min[d13,d23] = d23 = 5.0 §d(12)4 = min[d14,d24] = d24 = 9.0 §d(12)5 = min[d15,d25] = d25 = 8.0 30
  • 31. Prof. Pier Luca Lanzi Example • The new distance matrix is, ÷ ÷ ÷ ÷ ÷ ø ö ç ç ç ç ç è æ = 0.00.30.50.8 0.00.40.9 0.00.5 0.0 D 31 • At the end, we obtain the following dendrogram
  • 32. Prof. Pier Luca Lanzi Determining the Number of Clusters 32
  • 33. Prof. Pier Luca Lanzi hierarchical clustering generates a set of N possible partitions which one should I choose?
  • 34. Prof. Pier Luca Lanzi From the previous lecture we know ideally a good cluster should partition points so that … Data points in the same cluster should have a small distance from one another Data points in different clusters should be at a large distance from one another.
  • 35. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 35
  • 36. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares (for distance function d) • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 36
  • 37. Prof. Pier Luca Lanzi Evaluation of Hierarchical Clustering using Knee/Elbow Analysis plot the WSS and BSS for every clustering and look for a knee in the plot that show a significant modification in the evaluation metrics
  • 38. Prof. Pier Luca Lanzi Run the Python notebook for hierarchical clustering
  • 39. Prof. Pier Luca Lanzi Example data generated using the make_blob function of Scikit-Learn
  • 40. Prof. Pier Luca Lanzi Dendrogram computed using single linkage.
  • 41. Prof. Pier Luca Lanzi BSS and WSS for values of k from 1 until 19.
  • 42. Prof. Pier Luca Lanzi Clusters produced for values of k from 2 to 7.
  • 43. Prof. Pier Luca Lanzi Clusters produced for values of k from 5 to 10.
  • 44. Prof. Pier Luca Lanzi How can we represent clusters?
  • 45. Prof. Pier Luca Lanzi Euclidean vs Non-Euclidean Spaces • Euclidean Spaces § We can identify a cluster using for instance its centroid (e.g. computed as the average among all its data points) § Alternatively, we can use its convex hull • Non-Euclidean Spaces § We can define a distance (jaccard, cosine, edit) § We cannot compute a centroid and we can introduce the concept of clustroid • Clustroid § An existing data point that we take as a cluster representative § It can be the point that minimizes the sum of the distances to the other points in the cluster § Or, the one minimizing the maximum distance to another point § Or, the sum of the squares of the distances to the other points in the cluster 45
  • 46. Prof. Pier Luca Lanzi Examples using KNIME
  • 47. Prof. Pier Luca Lanzi Evaluation of the result from hierarchical clustering with 3 clusters and average linkage against existing labels
  • 48. Prof. Pier Luca Lanzi Comparison of hierarchical clustering with 3 clusters and average linkage against k-Means with k=3
  • 49. Prof. Pier Luca Lanzi Computing cluster quality from one to 20 clusters using the entropy scorer
  • 50. Prof. Pier Luca Lanzi Examples using R
  • 51. Prof. Pier Luca Lanzi Hierarchical Clustering in R # init the seed to be able to repeat the experiment set.seed(1234) par(mar=c(0,0,0,0)) # randomly generates the data x<-rnorm(12, mean=rep(1:3,each=4), sd=0.2) y<-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2) plot(x,y,pch=19,cex=2,col="blue") # distance matrix d <- data.frame(x,y) dm <- dist(d) # generate the cl <- hclust(dm) plot(cl) # other ways to plot dendrograms # http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html 51
  • 52. Prof. Pier Luca Lanzi Evaluation of Clustering in R library(GMD) ### ### checking the quality of the previous cluster ### # init two vectors that will contain the evaluation # in terms of within and between sum of squares plot_wss = rep(0,12) plot_bss = rep(0,12) # evaluate every clustering for(i in 1:12) { clusters <- cutree(cl,i) eval <- css(dm,clusters); plot_wss[i] <- eval$totwss plot_bss[i] <- eval$totbss } 52
  • 53. Prof. Pier Luca Lanzi Evaluation of Clustering in R # plot the results x = 1:12 plot(x, y=plot_bss, main="Between Cluster Sum-of-square", cex=2, pch=18, col="blue", xlab="Number of Clusters", ylab="Evaluation") lines(x, plot_bss, col="blue") par(new=TRUE) plot(x, y=plot_wss, cex=2, pch=19, col="red", ylab="", xlab="") lines(x,plot_wss, col="red"); 53
  • 54. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering 54
  • 55. Prof. Pier Luca Lanzi Hierarchical Clustering in R – Iris2D library(foreign) iris = read.arff("iris.2D.arff") with(iris, plot(petallength,petalwidth, col="blue", pch=19, cex=2)) dm <- dist(iris[,1:2]) cl <- hclust(iris_dist, method="single") #clustering <- hclust(dist(iris[,1:2],method="manhattan"), method="single") plot(cl) cl_average <- hclust(iris_dist, method="average") plot(clustering) cutree(clustering,2) 55
  • 56. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering for iris2D 56
  • 57. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering for iris 57
  • 58. Prof. Pier Luca Lanzi Summary
  • 59. Prof. Pier Luca Lanzi Hierarchical Clustering: Problems and Limitations • Once a decision is made to combine two clusters, it cannot be undone • No objective function is directly minimized • Different schemes have problems with one or more of the following: §Sensitivity to noise and outliers §Difficulty handling different sized clusters and convex shapes §Breaking large clusters • Major weakness of agglomerative clustering methods §They do not scale well: time complexity of at least O(n2), where n is the number of total objects §They can never undo what was done previously 59