DATA MINING:Clustering Types

PPrreesseenntteedd BByy,,
AAsshhwwiinn SShheennooyy MM
44CCBB1133SSCCSS0022
1

 Partitioning methods
 HHiieerraarrcchhiiccaall mmeetthhooddss
 DDeennssiittyy--bbaasseedd mmeetthhooddss
 GGrriidd--bbaasseedd mmeetthhooddss
 MMooddeell--bbaasseedd mmeetthhooddss
(conceptual clustering,
neural networks)
2

 AA ppaarrttiittiioonniinngg mmeetthhoodd:: construct a partition of
a database D of nn objects into a set of k clusters
such that
o each cluster contains at least one object
o each object belongs to exactly one cluster
3

mmeetthhooddss:
o kk--mmeeaannss : Each cluster is
represented by the center of the
cluster (cceennttrrooiidd).
o kk--mmeeddooiiddss : Each cluster is
represented by one of the objects in
the cluster (mmeeddooiidd).
4

 IInnppuutt ttoo tthhee aallggoorriitthhmm: the number of clusters k,
and a database of n objects
 AAllggoorriitthhmm ccoonnssiissttss ooff ffoouurr sstteeppss:
1. partition object into k nonempty
subsets/clusters
2. compute a seed points as the cceennttrrooiidd (the
mean of the objects in the cluster) for each
cluster in the current partition
3. assign each object to the cluster with the
nearest centroid
4. go back to Step 2, stop when there are no
more new assignments
5

Alternative algorithm aallssoo ccoonnssiissttss ooff ffoouurr sstteeppss:
1. arbitrarily choose k objects as the initial
cluster centers (centroids)
2. (re)assign each object to the cluster with the
nearest centroid
3. update the centroids
6

7
10
9
8
7
6
5
4
3
2
1
0
10
9
8
7
6
5
4
3
2
1
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10

 IInnppuutt ttoo tthhee aallggoorriitthhmm: the number of clusters k,
and a database of n objects
 AAllggoorriitthhmm ccoonnssiissttss ooff ffoouurr sstteeppss:
1. arbitrarily choose k objects as the initial
mmeeddooiiddss (representative objects)
2. assign each remaining object to the cluster
with the nearest medoid
3. select a nonmedoid and replace one of the
medoids with it if this improves the clustering
8

 AA hhiieerraarrcchhiiccaall mmeetthhoodd:: construct a hierarchy of
clustering, not just a single partition of objects.
 The number of clusters kk is not required as an
input .
 Use a ddiissttaannccee mmaattrriixx as clustering criteria
 A tteerrmmiinnaattiioonn ccoonnddiittiioonn can be used (e.g., a
number of clusters)
9

11..aagggglloommeerraattiivvee (bottom-up):
oplace each object in its own cluster.
omerge in each step the two most similar clusters until
there is only one cluster left or the termination
condition is satisfied.
22..ddiivviissiivvee (top-down):
ostart with one big cluster containing all the objects
odivide the most distinctive cluster into smaller
clusters and proceed until there are n clusters or the
termination condition is satisfied.
10

11
Step 0 Step 1 Step 2 Step 3 Step 4
a a b
b
c
d
e
d e
a b c d e
c d e
Step 4 Step 3 Step 2 Step 1 Step 0
aagggglloommeerraattiivvee
ddiivviissiivvee

 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies.
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data).
Phase 2: use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree.
12

 Clustering feature:
• summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
• registers crucial measurements for computing cluster and utilizes
storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
• Branching factor: specify the maximum number of children.
• threshold: max diameter of sub-clusters stored at the leaf nodes
13

14
CF1
child1
Root
CF3
child3
CF2
child2
CF6
child6
CF1
child1
Non-leaf node
CF3
child3
CF2
child2
CF5
child5
B = 7
L = 6
Leaf node Leaf node
CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next

 ROCK: RObust Clustering using linKs
 Major ideas
• Use links to measure similarity/proximity.
15

 Links: # of common neighbors
• C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a,
d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
• C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
• link(T1, T2) = 4, since they have 4 common neighbors
 {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
• link(T1, T3) = 3, since they have 3 common neighbors
 {a, b, d}, {a, b, e}, {a, b, g}
 Thus link is a better measure than Jaccard coefficient
16

 Measures the similarity based on a dynamic model
• Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high relative to the
internal interconnectivity of the clusters and closeness of items
within the clusters
 A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters
17

18
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set

DATA MINING:Clustering Types

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to DATA MINING:Clustering Types

Similar to DATA MINING:Clustering Types (20)

Recently uploaded

Recently uploaded (20)

DATA MINING:Clustering Types