3. AA ppaarrttiittiioonniinngg mmeetthhoodd:: construct a partition of
a database D of nn objects into a set of k clusters
such that
o each cluster contains at least one object
o each object belongs to exactly one cluster
3
4. mmeetthhooddss:
o kk--mmeeaannss : Each cluster is
represented by the center of the
cluster (cceennttrrooiidd).
o kk--mmeeddooiiddss : Each cluster is
represented by one of the objects in
the cluster (mmeeddooiidd).
4
5. IInnppuutt ttoo tthhee aallggoorriitthhmm: the number of clusters k,
and a database of n objects
AAllggoorriitthhmm ccoonnssiissttss ooff ffoouurr sstteeppss:
1. partition object into k nonempty
subsets/clusters
2. compute a seed points as the cceennttrrooiidd (the
mean of the objects in the cluster) for each
cluster in the current partition
3. assign each object to the cluster with the
nearest centroid
4. go back to Step 2, stop when there are no
more new assignments
5
6. Alternative algorithm aallssoo ccoonnssiissttss ooff ffoouurr sstteeppss:
1. arbitrarily choose k objects as the initial
cluster centers (centroids)
2. (re)assign each object to the cluster with the
nearest centroid
3. update the centroids
4. go back to Step 2, stop when there are no
more new assignments
6
8. IInnppuutt ttoo tthhee aallggoorriitthhmm: the number of clusters k,
and a database of n objects
AAllggoorriitthhmm ccoonnssiissttss ooff ffoouurr sstteeppss:
1. arbitrarily choose k objects as the initial
mmeeddooiiddss (representative objects)
2. assign each remaining object to the cluster
with the nearest medoid
3. select a nonmedoid and replace one of the
medoids with it if this improves the clustering
4. go back to Step 2, stop when there are no
more new assignments
8
9. AA hhiieerraarrcchhiiccaall mmeetthhoodd:: construct a hierarchy of
clustering, not just a single partition of objects.
The number of clusters kk is not required as an
input .
Use a ddiissttaannccee mmaattrriixx as clustering criteria
A tteerrmmiinnaattiioonn ccoonnddiittiioonn can be used (e.g., a
number of clusters)
9
10. 11..aagggglloommeerraattiivvee (bottom-up):
oplace each object in its own cluster.
omerge in each step the two most similar clusters until
there is only one cluster left or the termination
condition is satisfied.
22..ddiivviissiivvee (top-down):
ostart with one big cluster containing all the objects
odivide the most distinctive cluster into smaller
clusters and proceed until there are n clusters or the
termination condition is satisfied.
10
11. 11
Step 0 Step 1 Step 2 Step 3 Step 4
a a b
b
c
d
e
d e
a b c d e
c d e
Step 4 Step 3 Step 2 Step 1 Step 0
aagggglloommeerraattiivvee
ddiivviissiivvee
12. Birch: Balanced Iterative Reducing and Clustering using
Hierarchies.
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data).
Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree.
12
13. Clustering feature:
• summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
• registers crucial measurements for computing cluster and utilizes
storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
• Branching factor: specify the maximum number of children.
• threshold: max diameter of sub-clusters stored at the leaf nodes
13
15. ROCK: RObust Clustering using linKs
Major ideas
• Use links to measure similarity/proximity.
15
16. Links: # of common neighbors
• C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a,
d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
• link(T1, T2) = 4, since they have 4 common neighbors
{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
• link(T1, T3) = 3, since they have 3 common neighbors
{a, b, d}, {a, b, e}, {a, b, g}
Thus link is a better measure than Jaccard coefficient
16
17. Measures the similarity based on a dynamic model
• Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high relative to the
internal interconnectivity of the clusters and closeness of items
within the clusters
A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters
17
18. 18
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set