8. Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
7 / 46
9. Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
10. Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
11. Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
12. Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
13. Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
14. Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
15. Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
18. Images/cinvestav-
Nested Clustering
Definition
1 A clustering i containing k clusters is said to be nested in the
clustering i+1, which contains r < k clusters, if each cluster in i, it
is a subset of a set in i+1.
2 At least one cluster at i is a proper subset of a set in i+1.
This is written as
i i+1 (1)
10 / 46
19. Images/cinvestav-
Nested Clustering
Definition
1 A clustering i containing k clusters is said to be nested in the
clustering i+1, which contains r < k clusters, if each cluster in i, it
is a subset of a set in i+1.
2 At least one cluster at i is a proper subset of a set in i+1.
This is written as
i i+1 (1)
10 / 46
20. Images/cinvestav-
Nested Clustering
Definition
1 A clustering i containing k clusters is said to be nested in the
clustering i+1, which contains r < k clusters, if each cluster in i, it
is a subset of a set in i+1.
2 At least one cluster at i is a proper subset of a set in i+1.
This is written as
i i+1 (1)
10 / 46
21. Images/cinvestav-
Example
We have
The following set{x1, x2, x3, x4, x5}.
With the following structures
1 = {{x1, x3} , {x4} , {x2, x5}}
2 = {{x1, x3, x4} , {x2, x5}}
Again
Hierarchical Clustering produces a hierarchy of clusterings!!!
11 / 46
22. Images/cinvestav-
Example
We have
The following set{x1, x2, x3, x4, x5}.
With the following structures
1 = {{x1, x3} , {x4} , {x2, x5}}
2 = {{x1, x3, x4} , {x2, x5}}
Again
Hierarchical Clustering produces a hierarchy of clusterings!!!
11 / 46
23. Images/cinvestav-
Example
We have
The following set{x1, x2, x3, x4, x5}.
With the following structures
1 = {{x1, x3} , {x4} , {x2, x5}}
2 = {{x1, x3, x4} , {x2, x5}}
Again
Hierarchical Clustering produces a hierarchy of clusterings!!!
11 / 46
24. Images/cinvestav-
Example
We have
The following set{x1, x2, x3, x4, x5}.
With the following structures
1 = {{x1, x3} , {x4} , {x2, x5}}
2 = {{x1, x3, x4} , {x2, x5}}
Again
Hierarchical Clustering produces a hierarchy of clusterings!!!
11 / 46
25. Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
12 / 46
31. Images/cinvestav-
The Basic Algorithm for Agglomerative
For this
We have a function g (Ci, Cj) defined in all pair of cluster to measure
similarity or dissimilarity.
t denotes the current level of the hierarchy.
Algorithm
I n i t i a l i z a t i o n
Choose 0 = {Ci = {xi } , i = 1, ..., N}
t = 0
Repeat
t = t + 1
Find one p a i r of c l u s t e r s
(Cr , Cs) i n
t−1 such that
g(Ci , Cj ) = max, min of a s i m i l l a r i t y
or d i s s i m i l a r i t y f u n c t i o n
over a l l p a i r s
Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq
U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r
15 / 46
32. Images/cinvestav-
The Basic Algorithm for Agglomerative
For this
We have a function g (Ci, Cj) defined in all pair of cluster to measure
similarity or dissimilarity.
t denotes the current level of the hierarchy.
Algorithm
I n i t i a l i z a t i o n
Choose 0 = {Ci = {xi } , i = 1, ..., N}
t = 0
Repeat
t = t + 1
Find one p a i r of c l u s t e r s
(Cr , Cs) i n
t−1 such that
g(Ci , Cj ) = max, min of a s i m i l l a r i t y
or d i s s i m i l a r i t y f u n c t i o n
over a l l p a i r s
Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq
U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r
15 / 46
33. Images/cinvestav-
The Basic Algorithm for Agglomerative
For this
We have a function g (Ci, Cj) defined in all pair of cluster to measure
similarity or dissimilarity.
t denotes the current level of the hierarchy.
Algorithm
I n i t i a l i z a t i o n
Choose 0 = {Ci = {xi } , i = 1, ..., N}
t = 0
Repeat
t = t + 1
Find one p a i r of c l u s t e r s
(Cr , Cs) i n
t−1 such that
g(Ci , Cj ) = max, min of a s i m i l l a r i t y
or d i s s i m i l a r i t y f u n c t i o n
over a l l p a i r s
Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq
U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r
15 / 46
34. Images/cinvestav-
Enforcing Nesting
Note the following
“We can say that if two vectors come together into a single cluster at level
t of the hierarchy, they will remain in the same cluster for all subsequent
clusterings.”
Thus
0 1 2 ... N−1 N (3)
Hurra!!!
Enforcing the nesting property!!!
16 / 46
35. Images/cinvestav-
Enforcing Nesting
Note the following
“We can say that if two vectors come together into a single cluster at level
t of the hierarchy, they will remain in the same cluster for all subsequent
clusterings.”
Thus
0 1 2 ... N−1 N (3)
Hurra!!!
Enforcing the nesting property!!!
16 / 46
36. Images/cinvestav-
Enforcing Nesting
Note the following
“We can say that if two vectors come together into a single cluster at level
t of the hierarchy, they will remain in the same cluster for all subsequent
clusterings.”
Thus
0 1 2 ... N−1 N (3)
Hurra!!!
Enforcing the nesting property!!!
16 / 46
37. Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
17 / 46
38. Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
39. Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
40. Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
41. Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
42. Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
43. Images/cinvestav-
Thus
We have that
N−1
t=0
N − t
2
=
N
k=1
k
2
=
(N − 1) N (N + 1)
6
(6)
Thus
The complexity of this schema is O N3
However
You still depend on the nature of g.
19 / 46
44. Images/cinvestav-
Thus
We have that
N−1
t=0
N − t
2
=
N
k=1
k
2
=
(N − 1) N (N + 1)
6
(6)
Thus
The complexity of this schema is O N3
However
You still depend on the nature of g.
19 / 46
45. Images/cinvestav-
Thus
We have that
N−1
t=0
N − t
2
=
N
k=1
k
2
=
(N − 1) N (N + 1)
6
(6)
Thus
The complexity of this schema is O N3
However
You still depend on the nature of g.
19 / 46
46. Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
20 / 46
47. Images/cinvestav-
Two Categories of Agglomerative Algorithms
There are two
1 Matrix Theory Based.
2 Graph Theory Concepts.
Matrix Theory Based
As the name says, they are based in dissimilarity matrix P0 = P (X)
of N × N.
At each merging the matrix is reduced by one level ⇒ Pt becomes a
N − t × N − t matrix.
21 / 46
48. Images/cinvestav-
Two Categories of Agglomerative Algorithms
There are two
1 Matrix Theory Based.
2 Graph Theory Concepts.
Matrix Theory Based
As the name says, they are based in dissimilarity matrix P0 = P (X)
of N × N.
At each merging the matrix is reduced by one level ⇒ Pt becomes a
N − t × N − t matrix.
21 / 46
49. Images/cinvestav-
Two Categories of Agglomerative Algorithms
There are two
1 Matrix Theory Based.
2 Graph Theory Concepts.
Matrix Theory Based
As the name says, they are based in dissimilarity matrix P0 = P (X)
of N × N.
At each merging the matrix is reduced by one level ⇒ Pt becomes a
N − t × N − t matrix.
21 / 46
50. Images/cinvestav-
Two Categories of Agglomerative Algorithms
There are two
1 Matrix Theory Based.
2 Graph Theory Concepts.
Matrix Theory Based
As the name says, they are based in dissimilarity matrix P0 = P (X)
of N × N.
At each merging the matrix is reduced by one level ⇒ Pt becomes a
N − t × N − t matrix.
21 / 46
51. Images/cinvestav-
Matrix Based Algorithm
Matrix Updating Algorithmic Scheme (MUAS)
I n i t i a l i z a t i o n
Choose 0 = {Ci = {xi} , i = 1, ..., N}
P0 = P(X)
t = 0
Repeat
t = t + 1
Find one p a i r of c l u s t e r s
(Cr , Cs) i n
t−1 such that
d(Ci, Cj) = minr,s=1,..,N,r=s d(Cr , Cs)
Define Cq = Ci ∪ Cj , t = t−1 − {Ci, Cj} ∪ Cq
Create Pt by s t r a t e g y
U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r
22 / 46
52. Images/cinvestav-
Matrix Based Algorithm
Strategy
1 Delete the two rows and columns that correspond to the merged
clusters.
2 Add new row and a new column that contain the distances between
the newly formed cluster and the old (unaffected at this level) clusters.
23 / 46
53. Images/cinvestav-
Matrix Based Algorithm
Strategy
1 Delete the two rows and columns that correspond to the merged
clusters.
2 Add new row and a new column that contain the distances between
the newly formed cluster and the old (unaffected at this level) clusters.
23 / 46
54. Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
55. Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
56. Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
57. Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
58. Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
59. Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
60. Images/cinvestav-
For example
The single link algorithm
This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2
Thus, we have
d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7)
Please look at the example in the Dropbox
It is an interesting example.
25 / 46
61. Images/cinvestav-
For example
The single link algorithm
This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2
Thus, we have
d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7)
Please look at the example in the Dropbox
It is an interesting example.
25 / 46
62. Images/cinvestav-
For example
The single link algorithm
This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2
Thus, we have
d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7)
Please look at the example in the Dropbox
It is an interesting example.
25 / 46
63. Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Consider the following
1 Each node in the graph G correspond to a vector.
2 Cluster are formed by connecting nodes.
3 Certain property, h (k), needs to be respected.
Common Properties: Node Connectivity
The node connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no nodes
in common.
26 / 46
64. Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Consider the following
1 Each node in the graph G correspond to a vector.
2 Cluster are formed by connecting nodes.
3 Certain property, h (k), needs to be respected.
Common Properties: Node Connectivity
The node connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no nodes
in common.
26 / 46
65. Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Consider the following
1 Each node in the graph G correspond to a vector.
2 Cluster are formed by connecting nodes.
3 Certain property, h (k), needs to be respected.
Common Properties: Node Connectivity
The node connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no nodes
in common.
26 / 46
66. Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Consider the following
1 Each node in the graph G correspond to a vector.
2 Cluster are formed by connecting nodes.
3 Certain property, h (k), needs to be respected.
Common Properties: Node Connectivity
The node connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no nodes
in common.
26 / 46
67. Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Common Properties: Edge Connectivity
The edge connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no edges
in common.
Common Properties: Node Degree
The degree of a connected subgraph is the largest integer k such that
each node has at least k incident edges.
27 / 46
68. Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Common Properties: Edge Connectivity
The edge connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no edges
in common.
Common Properties: Node Degree
The degree of a connected subgraph is the largest integer k such that
each node has at least k incident edges.
27 / 46
69. Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
70. Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
71. Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
72. Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
73. Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
79. Images/cinvestav-
Generalized Divisive Scheme
Algorithm PROBLEM what is wrong!!!
I n i t i a l i z a t i o n
Choose 0 = {X}
P0 = P(X)
t = 0
Repeat
t = t + 1
For i = 1 to t
Given a p a r t i t i o n Ct−1, i
Generate a l l p o s s i b l e c l u s t e r s
next i
Find the p a i r C1
t−1,j, C2
t−1,j that
maximize g
Create
t = t−1 − {Ct−1,j} ∪ C1
t−1,j, C2
t−1,j
U n t i l a l l v e c t o r s l i e i n a s i n g l e c l u s t e r
32 / 46
80. Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
33 / 46
81. Images/cinvestav-
Algorithms for Large Data Sets
There are several
1 The CURE Algorithm
2 The ROCK Algorithm
3 The Chameleon Algorithm
4 The BIRCH Algorithm
34 / 46
82. Images/cinvestav-
Algorithms for Large Data Sets
There are several
1 The CURE Algorithm
2 The ROCK Algorithm
3 The Chameleon Algorithm
4 The BIRCH Algorithm
34 / 46
83. Images/cinvestav-
Algorithms for Large Data Sets
There are several
1 The CURE Algorithm
2 The ROCK Algorithm
3 The Chameleon Algorithm
4 The BIRCH Algorithm
34 / 46
84. Images/cinvestav-
Algorithms for Large Data Sets
There are several
1 The CURE Algorithm
2 The ROCK Algorithm
3 The Chameleon Algorithm
4 The BIRCH Algorithm
34 / 46
85. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Idea
Each cluster Ci has a set of representatives RCi = x
(i)
1 , x
(i)
2 , ..., x
(i)
K
with K > 1.
What is happening
By using multiple representatives for each cluster, the CURE algorithm
tries to “capture” the shape of each one.
However
In order to avoid taking into account irregularities (For example, outliers)
in the border of the cluster.
The initially chosen representatives are “pushed” toward the mean of
the cluster.
35 / 46
86. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Idea
Each cluster Ci has a set of representatives RCi = x
(i)
1 , x
(i)
2 , ..., x
(i)
K
with K > 1.
What is happening
By using multiple representatives for each cluster, the CURE algorithm
tries to “capture” the shape of each one.
However
In order to avoid taking into account irregularities (For example, outliers)
in the border of the cluster.
The initially chosen representatives are “pushed” toward the mean of
the cluster.
35 / 46
87. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Idea
Each cluster Ci has a set of representatives RCi = x
(i)
1 , x
(i)
2 , ..., x
(i)
K
with K > 1.
What is happening
By using multiple representatives for each cluster, the CURE algorithm
tries to “capture” the shape of each one.
However
In order to avoid taking into account irregularities (For example, outliers)
in the border of the cluster.
The initially chosen representatives are “pushed” toward the mean of
the cluster.
35 / 46
88. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Idea
Each cluster Ci has a set of representatives RCi = x
(i)
1 , x
(i)
2 , ..., x
(i)
K
with K > 1.
What is happening
By using multiple representatives for each cluster, the CURE algorithm
tries to “capture” the shape of each one.
However
In order to avoid taking into account irregularities (For example, outliers)
in the border of the cluster.
The initially chosen representatives are “pushed” toward the mean of
the cluster.
35 / 46
89. Images/cinvestav-
Therfore
This action is known
As “Shrinking” in the sense that the volume of space “defined” by the
representatives is shrunk toward the mean of the cluster.
36 / 46
90. Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
91. Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
92. Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
93. Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
94. Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
97. Images/cinvestav-
Resulting set RC
Thus
The resulting set RC is the set of representatives of C.
Thus the distance between two cluster is defined as
d (Ci, Cj) = min
x∈RCi
,y∈RCj
d (x, y) (10)
39 / 46
98. Images/cinvestav-
Resulting set RC
Thus
The resulting set RC is the set of representatives of C.
Thus the distance between two cluster is defined as
d (Ci, Cj) = min
x∈RCi
,y∈RCj
d (x, y) (10)
39 / 46
99. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
100. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
101. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
102. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
103. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
104. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
105. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
106. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
107. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
108. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
109. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
110. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
111. Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
113. Images/cinvestav-
Possible Solution
CURE does the following
The technique adopted by the CURE algorithm, in order to reduce the
computational complexity, is that of random sampling.
Actually
That is, a sample set X is created from X, by choosing randomly N out
of the N points of X.
However, one has to ensure that the probability of missing a cluster of
X, due to this sampling
This can be guaranteed if the number of points N is sufficiently large.
42 / 46
114. Images/cinvestav-
Possible Solution
CURE does the following
The technique adopted by the CURE algorithm, in order to reduce the
computational complexity, is that of random sampling.
Actually
That is, a sample set X is created from X, by choosing randomly N out
of the N points of X.
However, one has to ensure that the probability of missing a cluster of
X, due to this sampling
This can be guaranteed if the number of points N is sufficiently large.
42 / 46
115. Images/cinvestav-
Possible Solution
CURE does the following
The technique adopted by the CURE algorithm, in order to reduce the
computational complexity, is that of random sampling.
Actually
That is, a sample set X is created from X, by choosing randomly N out
of the N points of X.
However, one has to ensure that the probability of missing a cluster of
X, due to this sampling
This can be guaranteed if the number of points N is sufficiently large.
42 / 46
116. Images/cinvestav-
Then
Having estimated N
CURE forms a number of p = N
N sample data sets by successive random
samples.
In other words
In other words, X is partitioned randomly in p subsets.
For this a parameter q is selected
Then, the points in each partition are clustered until N
q clusters are
formed or the distance between the closest pair of clusters to be merged in
the next iteration step exceeds a user-defined threshold.
43 / 46
117. Images/cinvestav-
Then
Having estimated N
CURE forms a number of p = N
N sample data sets by successive random
samples.
In other words
In other words, X is partitioned randomly in p subsets.
For this a parameter q is selected
Then, the points in each partition are clustered until N
q clusters are
formed or the distance between the closest pair of clusters to be merged in
the next iteration step exceeds a user-defined threshold.
43 / 46
118. Images/cinvestav-
Then
Having estimated N
CURE forms a number of p = N
N sample data sets by successive random
samples.
In other words
In other words, X is partitioned randomly in p subsets.
For this a parameter q is selected
Then, the points in each partition are clustered until N
q clusters are
formed or the distance between the closest pair of clusters to be merged in
the next iteration step exceeds a user-defined threshold.
43 / 46
119. Images/cinvestav-
Once this has been finished
A second clustering pass is done
One the at most pN
q = N
q clusters from all the subsets.
The Goal
To apply the merging procedure described previously to all (at most) N
q so
that we end up with the required final number, m, of clusters.
Finally
Each point x in the data set, X, that is not used as a representative in
any one of the m clusters is assigned to one of them according to the
following strategy.
44 / 46
120. Images/cinvestav-
Once this has been finished
A second clustering pass is done
One the at most pN
q = N
q clusters from all the subsets.
The Goal
To apply the merging procedure described previously to all (at most) N
q so
that we end up with the required final number, m, of clusters.
Finally
Each point x in the data set, X, that is not used as a representative in
any one of the m clusters is assigned to one of them according to the
following strategy.
44 / 46
121. Images/cinvestav-
Once this has been finished
A second clustering pass is done
One the at most pN
q = N
q clusters from all the subsets.
The Goal
To apply the merging procedure described previously to all (at most) N
q so
that we end up with the required final number, m, of clusters.
Finally
Each point x in the data set, X, that is not used as a representative in
any one of the m clusters is assigned to one of them according to the
following strategy.
44 / 46
122. Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
123. Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
124. Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
125. Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
126. Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
127. Images/cinvestav-
Not only that
The value of a affects also CURE
Small values, CURE looks similar than a MST clustering.
Large values, CURE resembles an algorithm with a single
representative.
Worst Case Complexity
O N 2
log2 N (12)
46 / 46
128. Images/cinvestav-
Not only that
The value of a affects also CURE
Small values, CURE looks similar than a MST clustering.
Large values, CURE resembles an algorithm with a single
representative.
Worst Case Complexity
O N 2
log2 N (12)
46 / 46