SlideShare a Scribd company logo
1 of 67
Download to read offline
CLUSTERING
TUTORIAL

GABOR VERESS
2013.10.16
CONTENTS
What is clustering?

Distance: Similarity and dissimilarity
Data types in cluster analysis
Clustering methods
Evaluation of clustering
Summary

2
WHAT IS CLUSTERING?
Grouping of objects

3
CLUSTERING I. (BY TYPE)
Fruit

Veggie

4
CLUSTERING II. (BY COLOR)
Yellow

Green

5
CLUSTERING III.
(BY SHAPE)

Bushy

Longish

Ball
Chili shape

6
ANOTHER CLUSTERING EXAMPLE

7
IMAGE PROCESSING EXAMPLE

Figure from “Image and video segmentation: the normalised cut framework”by Shi and Malik, copyright IEEE, 1998

8
YET ANOTHER EXAMPLE
Original

Clustering 1

Clustering 2

9
CLUSTERING BY COLOR EXAMPLE
Item

Cian

Magenta

Yellow

Black

Chili

72

0

51

57

Cucamber

11

0

45

19

Broccoli

15

0

23

31

Apple

25

0

74

20

Paprika

0

52

100

11

Lemon

0

20

93

5

Orange

0

18

65

3

Banana

0

1

100

1
10
CLUSTERING BY COLOR EXAMPLE
Item

Cian

Magenta

Yellow

Black

Cluster

Chili

72

0

51

57

Cluster 1

Cucamber

11

0

45

19

Cluster 1

Broccoli

15

0

23

31

Cluster 1

Apple

25

0

74

20

Cluster 1

Paprika

0

52

100

11

Cluster 2

Lemon

0

20

93

5

Cluster 2

Orange

0

18

65

3

Cluster 2

Banana

0

1

100

1

Cluster 2
11
WHAT IS CLUSTERING?
Grouping of objects into classes such a way that

• Objects in same cluster are similar
• Objects in different clusters are dissimilar
Segmentation vs. Clustering
•

Clustering is finding borders between groups,

•

Segmenting is using borders to form groups

Clustering is the method of creating segments

12
SUPERVISED VS. UNSUPERVISED
CLASSIFICATION VS. CLUSTERING
Classification – Supervised
Classes are predetermined
we know in advance the stamping

For example if we already diagnosed some disease
Or we know who has churned

Clustering – Unsupervised
Classes are not known in advance
we don’t know in advance the stamping
Market behaviour segmentation
Or Gene analysis
13
APPLICATIONS OF CLUSTERING
Marketing: segmentation of the customer based on behavior

Banking: ATM Fraud detection (outlier detection)
ATM classification: segmentation based on time series

Gene analysis: Identifying gene responsible for a disease
Chemistry: Periodic table of the elements
Image processing: identifying objects on an image (face detection)

Insurance: identifying groups of car insurance policy holders with a
high average claim cost
Houses: identifying groups of houses according to their house type,
value, and geographical location

14
TYPICAL DATABASE
id
age sex
ID12101 48 FEMALE
ID12102 40 MALE
ID12103 51 FEMALE
ID12104 23 FEMALE
ID12105 57 FEMALE
ID12106 57 FEMALE
ID12107 22 MALE
ID12108 58 MALE
ID12109 37 FEMALE
ID12110 54 MALE
ID12111 66 FEMALE

region
income married children car
INNER_CITY 17,546 NO
1 NO
TOWN
30,085 YES
3 YES
INNER_CITY 16,575 YES
0 YES
TOWN
20,375 YES
3 NO
RURAL
50,576 YES
0 NO
TOWN
37,870 YES
2 NO
RURAL
8,877 NO
0 NO
TOWN
24,947 YES
0 YES
SUBURBAN 25,304 YES
2 YES
TOWN
24,212 YES
2 YES
TOWN
59,804 YES
0 NO

save_act
NO
NO
YES
NO
YES
YES
NO
YES
NO
YES
YES

current_act
NO
YES
YES
YES
NO
YES
YES
YES
NO
YES
YES

mortgage
NO
YES
NO
NO
NO
NO
NO
NO
NO
NO
NO

pep
YES
NO
NO
NO
NO
YES
YES
NO
NO
NO
NO

How we define similarity or dissimilarity?
Especially for categorical variables?
15
WHAT TO DERIVE
FORM THE DATABASE?
id
age sex
ID12101 48 FEMALE
ID12102 40 MALE
ID12103 51 FEMALE
ID12104 23 FEMALE
ID12105 57 FEMALE
ID12106 57 FEMALE
ID12107 22 MALE
ID12108 58 MALE
ID12109 37 FEMALE
ID12110 54 MALE
ID12111 66 FEMALE

region
income married children car
INNER_CITY 17,546 NO
1 NO
TOWN
30,085 YES
3 YES
INNER_CITY 16,575 YES
0 YES
TOWN
20,375 YES
3 NO
RURAL
50,576 YES
0 NO
TOWN
37,870 YES
2 NO
RURAL
8,877 NO
0 NO
TOWN
24,947 YES
0 YES
SUBURBAN 25,304 YES
2 YES
TOWN
24,212 YES
2 YES
TOWN
59,804 YES
0 NO

Upper: Original database of the
objects (customers)
Right: Similarity or dissimilarity
measure of the objects
(similarity of customers)

save_act
NO
NO
YES
NO
YES
YES
NO
YES
NO
YES
YES

current_act
NO
YES
YES
YES
NO
YES
YES
YES
NO
YES
YES

mortgage
NO
YES
NO
NO
NO
NO
NO
NO
NO
NO
NO

pep
YES
NO
NO
NO
NO
YES
YES
NO
NO
NO
NO

id
ID12101 ID12102 ID12103 ID12104 ID12105
ID12101
0
12
23
19
13
ID12102
12
0
25
13
17
ID12103
23
25
0
9
21
ID12104
19
13
9
0
12
ID12105
13
17
21
12
0

16
REQUIREMENTS OF CLUSTERING
• Ability to deal with different types of attributes

• Discovery of clusters with arbitrary shape
• Able to deal with noise and outliers
• Insensitive to order of input records

• High dimensionality
• Scalability
• Minimal requirements for domain knowledge to
determine input parameters
• Incorporation of user-specified constraints
• Interpretability and usability
17
DISTANCE:
SIMILARITY AND
DISSIMILARITY
SIMILARITY AND DISSIMILARITY
There is no single definition of similarity or
dissimilarity between data objects
The definition of similarity or dissimilarity between
objects depends on

• the type of the data considered
• what kind of similarity we are looking for

19
DISTANCE MEASURE
Similarity/dissimilarity between objects is often
expressed in terms of a distance measure d(x,y)
Ideally, every distance measure should be a metric, i.e.,
it should satisfy the following conditions:
1. d(x,y) ≥ 0
2. d(x,y) = 0 if x = y
3. d(x,y) = d(y,x)

4. d(x,z) ≤ d(x,y) + d(y,z)

20
TYPE OF VARIABLES
id
age sex
ID12101 48 FEMALE
ID12102 40 MALE
ID12103 51 FEMALE
ID12104 23 FEMALE
ID12105 57 FEMALE
ID12106 57 FEMALE
ID12110 54 MALE

region
income married children car
INNER_CITY 17,546 NO
1 NO
TOWN
30,085 YES
3 YES
INNER_CITY 16,575 YES
0 YES
TOWN
20,375 YES
3 NO
RURAL
50,576 YES
0 NO
TOWN
37,870 YES
2 NO
TOWN
24,212 YES
2 YES

save_act
NO
NO
YES
NO
YES
YES
YES

current_act
NO
YES
YES
YES
NO
YES
YES

mortgage
NO
YES
NO
NO
NO
NO
NO

pep
YES
NO
NO
NO
NO
YES
NO

Interval-scaled variables: Age
Binary variables: Car, Mortgage

Nominal, Ordinal, and Ratio variables
Variables of mixed types
Complex data types: Documents, GPS coordinates
21
INTERVAL-SCALED VARIABLES
Continuous measurements of a roughly linear scale

for example, age, weight and height
The measurement unit can affect the cluster analysis
To avoid dependence on the measurement unit, we should
standardize the data

22
STANDARDIZATION
To standardize the measurements:

• calculate the mean absolute deviation
s f  1 (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
n

where

m f  1 (x1 f  x2 f  ...  xnf ),
n

and

• calculate the standardized measurement (z-score)
xif  m f
zif 
sf

23
DISTANCE MEASURE I.
One group of popular distance measures for intervalscaled variables are Minkowski distances

d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2 j 2
ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive integer

24
DISTANCE MEASURES II.
If q = 1, the distance measure is Manhattan (or city
block) distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1
j1
i2
j2
ip
jp

If q = 2, the distance measure is Euclidean distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1
i2 j 2
ip jp

25
EXAMPLE: DISTANCE MEASURES
point
p1
p2
p3
p4

x
0
2
3
5

y
2
0
1
1

Manhattan
Distance

0, 2

y

2

1

3, 1

0

5, 1

2, 0
0

1

2

3

4

5

6

p2

p3

p4

p1
p2
p3
p4

3

p1

0
4
4
6

4
0
2
4

4
2
0
2

6
4
2
0

Euclidean
Distance
p1
p2
p3
p4

p1

p2

p3

p4

0
2.828
3.162
5.099

2.828
0
1.414
3.162

3.162
1.414
0
2

5.099
3.162
2
0

x

Distance Matrix
26
WHY STANDARDIZATION?
Age and Income
No standardization

Income >> Age
No separation on age

With standardization
Separation based on both
age and income

27
RATIO-SCALED VARIABLES
A positive measurement on a nonlinear scale, approximately at
exponential scale AeBt or Ae-Bt
Methods:

1. treat them like interval-scaled variables is not a good choice!
2. apply logarithmic transformation yif = log(xif)
3. treat them as continuous ordinal data and treat their rank as
interval-scaled
4. create a better variable
Object

On-net Off-net Ratio
Log-Ratio On-net/Total
1
95
6
0.06
-1.20
94%
2
56
15
0.27
-0.57
79%
Dist 1-2
0.04
0.39
0.02
3
12
23
1.92
0.28
34%
4
12
29
2.42
0.38
29%
Dist 3-4
0.25
0.01
0.00

28
ORDINAL VARIABLES
An ordinal variable can be discrete or continuous

Order of values is important, e.g., rank
Can be treated like interval-scaled
rif {1, ..., M f }
• replacing x by their rank
if

• map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
• compute the dissimilarity using methods for intervalscaled variables
29
BINARY VARIABLES I.
Binary variables
has 2 outcomes 0/1, Y/N, …

Distances

Symmetric binary variable:

FEMALE

MALE

FEMALE

0

1

MALE

1

0

No preference on which outcome
should be coded 0 or 1
like gender

Asymmetric binary variable:
Outcomes are not equally important,
or based on one outcome the objects
are similar but based on the other
outcome we can’t tell
Like Has Mortgage or HIV positive

Mortgage No Mortgage

Mortgage

0

1

No Mortgage

1

undef

30
BINARY VARIABLES II.
If we have more binary variables in the database we can
calculate the distance based on the contingency table
A contingency table for binary data

Object i

1
0
SUM

1
a
c
a+c

Object j
0
b
d
b+d

SUM
a+b
c+d
t

31
BINARY VARIABLES III.

Object i

1
0
SUM

1
a
c
a+c

Object j
0
b
d
b+d

SUM
a+b
c+d
t

Simple matching coefficient (invariant similarity, if the
binary variable is symmetric):
bc
ad
sim(i, j) 
d (i, j) 
a bc  d
a bc  d
Jaccard coefficient (non-invariant similarity, if the binary
variable is asymmetric):
a
sim(i, j) 
d (i, j)  b  c
a bc
a bc
32
NOMINAL VARIABLES
Generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue
Distance matrix

More variables
Method 1: simple matching

Distance
Red
Yellow
Blue

Red

Yellow
0
1
1

Blue
1
0
1

1
1
0

• m: # of matches, p: total # of variables

sim(i, j)  m
p

d (i, j)  p  m
p

Method 2: use a large number of binary variables
• create new binary variable for each of the k nominal states
33
VARIABLES OF MIXED TYPES
Database usually contains different types of
variables
• symmetric binary, asymmetric binary, nominal, ordinal,
interval
Approaches

1. Group each type of variable together, performing
a separate cluster analysis for each type.
2. Bring different variables onto a common scale of
the interval [0.0, 1.0], performing a single cluster
analysis
34
WEIGHTED FORMULA
(
(
 p  1 ij f ) dij f )
f
d (i, j) 
(
 p  1 ij f )
f

Weight δij (f) = 0
• if xif or xjf is missing
• or xif = xjf =0 and variable f is asymmetric binary,

Otherwise Weight δij (f) = 1
Another option is to choose the weights based on
business aspects
35
VECTOR OBJECTS:
COSINE SIMILARITY
Vector objects: keywords in documents, gene features in micro-arrays,
…
Applications: information retrieval, biologic taxonomy, ...
Cosine measure: If d1 and d2 are two vectors, then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5
||d1||= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5=(6) 0.5 = 2.245
cos( d1, d2 ) = .3150

36
COMPLEX DATA TYPES
All not relational objects => complex types of data

• examples: spatial data, location data, multimedia data,
genetic data, time-series data, text data and data
collected from Web
We can define our own similarity or dissimilarity
measures than the previous
• can, for example, mean using of string and/or
sequence matching, or methods of information retrieval

37
CLUSTERING
METHODS
MAJOR CLUSTERING APPROACHES
Partitioning algorithms: Construct various partitions and then

evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition of
the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the clusters
and the idea is to find the best fit of that model to each other
39
PARTITIONING
BASIC CONCEPT
Partitioning method: Construct a partition of a database D of n
objects into k clusters
• each cluster contains at least one object
• each object belongs to exactly one cluster
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion (min distance from cluster centers)
• Global optimal: exhaustively enumerate all partitions Stirling(n,k)
(S(10,3) = 9.330, S(20,3) = 580.606.446,…)
• Heuristic methods: k-means and k-medoids algorithms
• k-means: Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids): Each cluster is
represented by one of the objects in the cluster
40
PARTITIONING
K-MEANS ALGORITHM
Input: k clusters, n objects of database D.

Output: A set of k clusters which minimizes the squared-error function
Algorithm:

1. Choose k objects as the initial cluster centers
2. Assign each object to the cluster which has the closest mean
point (centroid) under squared Euclidean distance metric
3. When all objects have been assigned, recalculate the positions of
k mean point (centroid)

4. Repeat Steps 2. and 3. until the centroids do not change any
more

41
PARTITIONING
K-MEANS ALGORITHM

Source: Clustering: A survey 2008, R. Capaldo F. Collovà

42
PARTITIONING
K-MEANS
+ Easy to implement
+ The K-means method is is relatively efficient: O(tkn), where n is
objects number, k is clusters number, and t is iterations number.
Normally, k, t << n.

- Often terminates at a local optimum. The global optimum may be

found using techniques such as: deterministic annealing and genetic
algorithms

-

Not applicable in categorical data
Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

To overcome some of these problems is introduced the
K-medoids or PAM

43
PARTITIONING
K-MEDOID ALGORITHM
The method K-medoid or PAM ( Partitioning Around Medoids ) is the
same as k-means but instead of mean it uses medoid
mq (q = 1,2,…,k) as object more representative of cluster

medoid is the most centrally located object in a cluster

Source: Clustering: A survey 2008, R. Capaldo F. Collovà

44
PARTITIONING
K-MEDOID OR PAM
+ PAM is more robust than K-means in the presence of noise and
outliers because a medoid is less influenced by outliers or other
extreme values than a mean

- PAM works efficiently for small data sets but does not scale well
for large data sets. Infact: O( k(n-k)2 ) for each iteration where n
is data numbers, k is the clusters numbers
To overcome these problems is introduced:

CLARA (Clustering LARge Applications) - > Sampling based method
CLARANS - > A Clustering Algorithm based on Randomized Search.

45
PARTITIONING
CLARA
CLARA (Clustering LARge Applications) (Kaufmann and Rousseeuw
in 1990) draws multiple sample of the dataset and applies PAM on the
sample in order to find the medoids.

+ Deals with larger data sets than PAM
+ Experiments show that 5 samples of size 40+2k give satisfactory
results

- Efficiency depends on the sample size, should also determine
that parameter

- A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample is
biased, but to avoid this we use multiple sampling

46
PARTITIONING
CLARANS
CLARANS (CLustering Algorithm based on RANdomized Search)
(Ng and Han’94)

A clustering method that draws sample of neighbors dynamically
There are 2 parameters: maxneighbour the maximum number of
neighbours examined, numlocal the number of local minimum obtained
The algorithm is searching for new neighbours and replaces the current
setup with a lower cost setup until the number of examined neighbours
reaches the maxneighbour or the number of new local minimum obtained
is reaches numlocal

+
+
+
-

It is more efficient and scalable than both PAM and CLARA
returns higher quality clusters

has the benefit of not confining the search to a restricted area
Depending on parameters can be very time consuming (close to PAM)

47
HIERARCHICAL
BASIC CONCEPT
Hierarchical clustering

Construct a hierarchy of clusters not just a single partition
of objects
• Use distance matrix as clustering criteria

• Does not require the number of clusters as an input, but
needs a termination condition, e. g., number of clusters
or a distance threshold for merging

48
HIERARCHICAL
CLUSTERING TREE, DENDOGRAM

Agglomerative

Divisive

The hierarchy of clustering is given
as a clustering tree or dendrogram
• leaves of the tree represent the
individual objects
• internal nodes of the tree represent
the clusters

Two main types of hierarchical
clustering
• agglomerative (bottom-up)
• place each object in its own cluster (a
singleton)
• merge in each step the two most similar
clusters until there is only one cluster left or
the termination condition is satisfied

• divisive (top-down)

• start with one big cluster containing all the
objects
• divide the most distinctive cluster into
smaller clusters and proceed until there are
n clusters or the termination condition is
satisfied

49
HIERARCHICAL
CLUSTER DISTANCE MEASURES
Single link (nearest neighbor). The distance
between two clusters is determined by the
distance of the two closest objects (nearest
neighbors) in the different clusters.
Complete link (furthest neighbor). The
distances between clusters are determined by
the greatest distance between any two objects
in the different clusters (i.e., by the "furthest
neighbors").
Pair-group average link. The distance
between two clusters is calculated as the
average distance between all pairs of objects
in the two different clusters.

Pair-group centroid. The distance between
two clusters is determined as the distance
between centroids.
Centroid link

50
HIERARCHICAL
EXAMPLE WITH DENDOGRAM

51
HIERARCHICAL
+ Conceptually simple
+ Theoretical properties are well understood
+ When clusters are merged/split, the decision is permanent => the
number of different alternatives that need to be examined is
reduced

- Merging/splitting of clusters is permanent => erroneous decisions
are impossible to correct later

- Divisive methods can be computational hard
- Methods are not (necessarily) scalable for large data sets

52
EVALUATION
EVALUATION BASICS
Business

• Segment sizes
• Meaningful segments
Technical

• Compactness
• Separation

54
COMPACTNESS AND SEPARATION
Compactness
intra-cluster variance
Separation
inter-cluster distance
Sometimes the two
measures leads to different
results

Dens_bw
0,5
0,45
0,4
0,35
0,3
0,25
0,2
0,15
0,1
0,05
0

Scatt_orig

Compactness
Separation

2

3

4

5

6

55
INDEX FUNCTIONS
DB

Number of clusters

• Finding the
minimum/maximum of a
function we can determine the
optimal number of clusters

1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
2

3

4

5

7

8

SD KM

Comparing clustering methods

• Using the index functions we
can compare the results of
different clustering methods of
the same database

6

9

10

SD TS

100
80
60
40
20
0
2

3

4

5

6

7

8

9

10

56
SAMPLE DATABASE
We generated a sample with 4 clusters
• 2dimensions
• Real values between (–10;15)
With outliers

57
TWO-STEP AND K-MEANS
CLUSTERING
K-means

Two-step
3

4

3

4

5

6

5

6

7

8

7

8

58
DB (DAVIES-BOULDIN) INDEX
DB index summarizes the similarity of a given cluster and the most
dissimilar cluster and then take the average of them
DB TS

DB KM

1

DB TS

0,8

1

0,6
0,4

0,8

0,2
0
2

3

4

5

6

7

8

9

10

DB KM

0,6
0,4

1
0,8

0,2

0,6
0,4

0

0,2

2

0
2

3

4

5

6

7

8

9

3

4

5

6

7

8

9

10

10

59
S_DBW INDEX

S_Dbw KM

S_Dbw TS

4

7

0,6

2 components

0,5

• Dens_bw: cluster separation
• Scatt: the average variance of the
clusters divided by the variance of
all objects

0,4
0,3
0,2
0,1
0
2

Dens_bw TS

3

Dens_bw KM

Scatt TS

0,6

0,2

0,1

0,1

0

Scatt KM

0,3

0,2

9

0,4

0,3

8

0,5

0,4

6

0,6

0,5

5

0

2

3

4

5

6

7

8

9

10

2

3

4

5

6

7

8

9

10

60

10
SD INDEX

SD KM

SD TS
80
70
60
50
40
30
20
10
0

100
80

2 components :

60

• Scatt: compactness of
the clusters
• Dis: Function of the
centroids of the clusters
We should know the
maximum number of
clusters

20

40

0
2

3

4

5

6

7

8

9

2

10

3

SD KM

4

5

6

7

8

9

SD TS

100
80
60
40
20
0
2

3

4

5

6

7

8

9

10

61

10
RS, RMSSTD INDEXEK
RD (R-squared) = variance between clusters / total variance
RMSSTD (Root-mean-square standard deviation)
= within cluster variance

RS KM

RS TS

1
RS_diff TS

RS_diff KM

RS TS
0,08

1

1,2

0,07

1,2

0,95

RS KM

1

0,15
0,1

0,06
0,8

0,05

0,6

0,04
0,03

0,4

0,02
0,2

0,01

0

0
2

3

4

5

6

7

8

RMSSTD_diff TS

9

0,8

0
2

1,5
1
0,5
0
4

5

6

7

8

9

10

4

5

6

7

8

9

0,6
2

3

10

4

5

6

7

RMSSTD KM

RMSSTD KM

12

8

0,05
0
-0,05

2

3

RMSSTD_diff KM

0,3
0,25
0,2
0,15
0,1

3

0,7

8

9

10

-0,1

RMSSTD TS

2,5

0,8
0,75
0,65

-0,05

0,2

0,45
0,4
0,35

3

0

0,4

10

3,5

2

0,05

0,6

0,9
0,85

2

12
5
4

10

3
2

6

1
0

4

-1
-2

0

-3
2

3

4

5

6

7

8

9

10

RMSSTD TS

10
8
6
4
2
0
2

3

4

5

6

7

8

9

10

62
SEGMENTATION IN BANK
Needs based segmentation for new tariff
plans

When the number of cluster is 4 or 5
then we have a too big segment
(cca. 60 000 customer)
Above 6 segments we can not to identify
more significant segment
Balance decrease is the cutting variable
Szeparáltság

Átmérő

Separation

0,12

0,8
0,7
0,6
0,5

0,35

0,1

0,9

0,3

0,08
0,06

0,4
0,3

0,04

0,2
0,1

0,02

0

0
2

3

4

5

6

7

8

9 10

Diameter

0,05
0,045
0,04
0,035
0,03
0,025
0,02
0,015
0,01
0,005
0

0,25
0,2
0,15
0,1
0,05
0
2

3

4

5

6

7

8

9 10

63
BANK SEGMENTATION – INDEXES
RS_diff TS

DB TS

RS TS

SD TS

0,9

1,4

0,12

0,8

1,2

0,1

0,7

1

0,6

0,8

0,5

0,6

0,08

0,4

0,06

0,3

0,4

0,04

0,2

0,2

0,02

0,1

0

0

2

3

4

5

6

Dens_bw TS

7

8

9

10

0
2

Scatt_orig TS

3

4

5

6

7

8

RMSSTD_diff TS

9

4
3,5
3
2,5
2
1,5
1
0,5
0
2

10

0,04
0,03
0,02
0,01

0,1

0

0

6

7

8

9

10

8

9

10

RMSSTD TS

0,4

0,2

5

7

0,45

0,3

4

6

0,6

0,05

0,4

3

5

RMSSTD KM

0,5

2

4

RMSSTD TS

0,6

1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0

3

-0,01
2

3

4

5

6

7

8

9

10

0,55
0,5

0,35
0,3
0,25
0,2
2

3

Based on the indexes there are 4-6 really different segments

4

5

6

7

8

9

64

10
LITERATURE I.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, August 2000.

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, August 2000 (k-means, k-medoids or PAM )
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990 (CLARA, AGNES, DIANA).
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94 (CLARANS).
J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann
Publishers, August 2000 (deterministic annealing, genetic algorithms).
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96 (BIRCH).
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98 (CURE).

65
LITERATURE II.
Karypis G., Eui-Hong Han, Kumar V. Chameleon: hierarchical clustering using
dynamic modeling (CHAMELEON).

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96 (DBSCAN).
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to
identify the clustering structure, SIGMOD’99 (OPTICS).
A. Hinneburg D., A. Keim: An Efficient Approach to Clustering in Large Multimedia
Database with Noise. Proceedings of the 4-th ICKDDM, New York ’98
(DENCLUE).

Abramowitz, M. and Stegun, I. A. (Eds.). "Stirling Numbers of the Second Kind."
§24.1.4 in Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables, 9th printing. New York: Dover, pp. 824-825, 1972.
Introduction to Data Mining Pang-Ning Tan, Michigan State University Michael
Steinbach,Vipin Kumar, University of Minnesota Publisher: Addison-Wesley
Copyright: 2006.

66
THANK YOU!

GABOR VERESS
LYNX ANALYTICS

More Related Content

What's hot

Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Fuzzy c-means clustering for image segmentation
Fuzzy c-means  clustering for image segmentationFuzzy c-means  clustering for image segmentation
Fuzzy c-means clustering for image segmentationDharmesh Patel
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighborbutest
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptSyedNahin1
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: ClusteringDeepak George
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic RegressionKnoldus Inc.
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Edureka!
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail MarketingJonathan Sedar
 

What's hot (20)

Clique and sting
Clique and stingClique and sting
Clique and sting
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Fuzzy c-means clustering for image segmentation
Fuzzy c-means  clustering for image segmentationFuzzy c-means  clustering for image segmentation
Fuzzy c-means clustering for image segmentation
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Clustering
ClusteringClustering
Clustering
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.ppt
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: Clustering
 
Clustering
ClusteringClustering
Clustering
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
 
Clustering
ClusteringClustering
Clustering
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 

Viewers also liked

Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 
Clustering Analysis of Issues to arrive at Voter Segments
Clustering Analysis of Issues to arrive at Voter SegmentsClustering Analysis of Issues to arrive at Voter Segments
Clustering Analysis of Issues to arrive at Voter SegmentsAashish Juneja
 
Cluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchCluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchSahil Kapoor
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Association Analysis
Association AnalysisAssociation Analysis
Association Analysisguest0edcaf
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesGilad Barkan
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 

Viewers also liked (14)

Malhotra20
Malhotra20Malhotra20
Malhotra20
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Clustering Analysis of Issues to arrive at Voter Segments
Clustering Analysis of Issues to arrive at Voter SegmentsClustering Analysis of Issues to arrive at Voter Segments
Clustering Analysis of Issues to arrive at Voter Segments
 
Cluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing ResearchCluster analysis in prespective to Marketing Research
Cluster analysis in prespective to Marketing Research
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Association Analysis
Association AnalysisAssociation Analysis
Association Analysis
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
Bayesian Belief Networks for dummies
Bayesian Belief Networks for dummiesBayesian Belief Networks for dummies
Bayesian Belief Networks for dummies
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 

Similar to Clustering training

Module 1 Powerpoint 2.pptx
Module 1 Powerpoint 2.pptxModule 1 Powerpoint 2.pptx
Module 1 Powerpoint 2.pptxZyrenMisaki
 
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Abebe Admasu
 
On Estimation of Population Variance Using Auxiliary Information
On Estimation of Population Variance Using Auxiliary InformationOn Estimation of Population Variance Using Auxiliary Information
On Estimation of Population Variance Using Auxiliary Informationinventionjournals
 
First term notes 2020 econs ss2 1
First term notes 2020 econs ss2 1First term notes 2020 econs ss2 1
First term notes 2020 econs ss2 1OmotaraAkinsowon
 
Quantitative Methods for Lawyers - Class #15 - Chi Square Distribution and Ch...
Quantitative Methods for Lawyers - Class #15 - Chi Square Distribution and Ch...Quantitative Methods for Lawyers - Class #15 - Chi Square Distribution and Ch...
Quantitative Methods for Lawyers - Class #15 - Chi Square Distribution and Ch...Daniel Katz
 
Estadística investigación _grupo1_ Zitácuaro
Estadística investigación _grupo1_ ZitácuaroEstadística investigación _grupo1_ Zitácuaro
Estadística investigación _grupo1_ ZitácuaroYasminSotoEsquivel
 
Quantitative Methods for Lawyers - Class #12 - Chi Square Distribution and Ch...
Quantitative Methods for Lawyers - Class #12 - Chi Square Distribution and Ch...Quantitative Methods for Lawyers - Class #12 - Chi Square Distribution and Ch...
Quantitative Methods for Lawyers - Class #12 - Chi Square Distribution and Ch...Daniel Katz
 
Yolos you only look one sequence
Yolos you only look one sequenceYolos you only look one sequence
Yolos you only look one sequencetaeseon ryu
 
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical InferenceSpanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inferencejemille6
 
Applied Math 40S March 12, 2008
Applied Math 40S March 12, 2008Applied Math 40S March 12, 2008
Applied Math 40S March 12, 2008Darren Kuropatwa
 
Correlation_and_Regression-3.ppt
Correlation_and_Regression-3.pptCorrelation_and_Regression-3.ppt
Correlation_and_Regression-3.pptRidaIrfan10
 
Clasification approaches
Clasification approachesClasification approaches
Clasification approachesgscprasad1111
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)준식 최
 

Similar to Clustering training (20)

Module 1 Powerpoint 2.pptx
Module 1 Powerpoint 2.pptxModule 1 Powerpoint 2.pptx
Module 1 Powerpoint 2.pptx
 
Ch01_03.ppt
Ch01_03.pptCh01_03.ppt
Ch01_03.ppt
 
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
 
On Estimation of Population Variance Using Auxiliary Information
On Estimation of Population Variance Using Auxiliary InformationOn Estimation of Population Variance Using Auxiliary Information
On Estimation of Population Variance Using Auxiliary Information
 
First term notes 2020 econs ss2 1
First term notes 2020 econs ss2 1First term notes 2020 econs ss2 1
First term notes 2020 econs ss2 1
 
Lilliefors test
Lilliefors testLilliefors test
Lilliefors test
 
Quantitative Methods for Lawyers - Class #15 - Chi Square Distribution and Ch...
Quantitative Methods for Lawyers - Class #15 - Chi Square Distribution and Ch...Quantitative Methods for Lawyers - Class #15 - Chi Square Distribution and Ch...
Quantitative Methods for Lawyers - Class #15 - Chi Square Distribution and Ch...
 
Estadística investigación _grupo1_ Zitácuaro
Estadística investigación _grupo1_ ZitácuaroEstadística investigación _grupo1_ Zitácuaro
Estadística investigación _grupo1_ Zitácuaro
 
02Data-osu-0829.pdf
02Data-osu-0829.pdf02Data-osu-0829.pdf
02Data-osu-0829.pdf
 
Quantitative Methods for Lawyers - Class #12 - Chi Square Distribution and Ch...
Quantitative Methods for Lawyers - Class #12 - Chi Square Distribution and Ch...Quantitative Methods for Lawyers - Class #12 - Chi Square Distribution and Ch...
Quantitative Methods for Lawyers - Class #12 - Chi Square Distribution and Ch...
 
Ch01 03
Ch01 03Ch01 03
Ch01 03
 
Data Mining Lecture_5.pptx
Data Mining Lecture_5.pptxData Mining Lecture_5.pptx
Data Mining Lecture_5.pptx
 
Yolos you only look one sequence
Yolos you only look one sequenceYolos you only look one sequence
Yolos you only look one sequence
 
Cluster analysis (2)
Cluster analysis (2)Cluster analysis (2)
Cluster analysis (2)
 
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical InferenceSpanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
Spanos: Lecture 1 Notes: Introduction to Probability and Statistical Inference
 
Applied Math 40S March 12, 2008
Applied Math 40S March 12, 2008Applied Math 40S March 12, 2008
Applied Math 40S March 12, 2008
 
Alkaire foster methodology
Alkaire foster methodologyAlkaire foster methodology
Alkaire foster methodology
 
Correlation_and_Regression-3.ppt
Correlation_and_Regression-3.pptCorrelation_and_Regression-3.ppt
Correlation_and_Regression-3.ppt
 
Clasification approaches
Clasification approachesClasification approaches
Clasification approaches
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
 

Recently uploaded

Mastering Affiliate Marketing: A Comprehensive Guide to Success
Mastering Affiliate Marketing: A Comprehensive Guide to SuccessMastering Affiliate Marketing: A Comprehensive Guide to Success
Mastering Affiliate Marketing: A Comprehensive Guide to SuccessAbdulsamad Lukman
 
W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...William (Bill) H. Bender, FCSI
 
The 9th May Incident in Pakistan A Turning Point in History.pptx
The 9th May Incident in Pakistan A Turning Point in History.pptxThe 9th May Incident in Pakistan A Turning Point in History.pptx
The 9th May Incident in Pakistan A Turning Point in History.pptxelizabethella096
 
Gain potential customers through Lead Generation
Gain potential customers through Lead GenerationGain potential customers through Lead Generation
Gain potential customers through Lead Generationvidhyalakshmiveerapp
 
Alpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptxAlpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptxDave McCallum
 
The seven principles of persuasion by Dr. Robert Cialdini
The seven principles of persuasion by Dr. Robert CialdiniThe seven principles of persuasion by Dr. Robert Cialdini
The seven principles of persuasion by Dr. Robert CialdiniSurya Prasath
 
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best PracticesInstant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best PracticesMedia Logic
 
Best 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In ChandigarhBest 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In Chandigarhhamitthakurdma01
 
SP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdfSP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdfPauleneNicoleLapira
 
Crypto Quantum Leap - Digital - membership area
Crypto Quantum Leap -  Digital - membership areaCrypto Quantum Leap -  Digital - membership area
Crypto Quantum Leap - Digital - membership areajaynee G
 
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITY
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITYHITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITY
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITYNiteshKumar82226
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendayMaharshBenday
 
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night ServiceVIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Servicemeghakumariji156
 
Discover Ardency Elite: Elevate Your Lifestyle
Discover Ardency Elite: Elevate Your LifestyleDiscover Ardency Elite: Elevate Your Lifestyle
Discover Ardency Elite: Elevate Your LifestyleMy Heart Throw Pillow
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendayMaharshBenday
 
SALES-PITCH-an-introduction-to-sales.pptx
SALES-PITCH-an-introduction-to-sales.pptxSALES-PITCH-an-introduction-to-sales.pptx
SALES-PITCH-an-introduction-to-sales.pptx23397013
 
How consumers use technology and the impacts on their lives
How consumers use technology and the impacts on their livesHow consumers use technology and the impacts on their lives
How consumers use technology and the impacts on their livesMathuraa
 
Aiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMMAiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMMaiizennxqc
 
Optimizing Your Marketing with AI-Powered Prompts
Optimizing Your Marketing with AI-Powered PromptsOptimizing Your Marketing with AI-Powered Prompts
Optimizing Your Marketing with AI-Powered PromptsVbout.com
 
Resumé Karina Perez | Digital Strategist
Resumé Karina Perez | Digital StrategistResumé Karina Perez | Digital Strategist
Resumé Karina Perez | Digital StrategistKarina Perez
 

Recently uploaded (20)

Mastering Affiliate Marketing: A Comprehensive Guide to Success
Mastering Affiliate Marketing: A Comprehensive Guide to SuccessMastering Affiliate Marketing: A Comprehensive Guide to Success
Mastering Affiliate Marketing: A Comprehensive Guide to Success
 
W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...W.H.Bender Quote 61 -Influential restaurant and food service industry network...
W.H.Bender Quote 61 -Influential restaurant and food service industry network...
 
The 9th May Incident in Pakistan A Turning Point in History.pptx
The 9th May Incident in Pakistan A Turning Point in History.pptxThe 9th May Incident in Pakistan A Turning Point in History.pptx
The 9th May Incident in Pakistan A Turning Point in History.pptx
 
Gain potential customers through Lead Generation
Gain potential customers through Lead GenerationGain potential customers through Lead Generation
Gain potential customers through Lead Generation
 
Alpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptxAlpha Media March 2024 Buyers Guide.pptx
Alpha Media March 2024 Buyers Guide.pptx
 
The seven principles of persuasion by Dr. Robert Cialdini
The seven principles of persuasion by Dr. Robert CialdiniThe seven principles of persuasion by Dr. Robert Cialdini
The seven principles of persuasion by Dr. Robert Cialdini
 
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best PracticesInstant Digital Issuance: An Overview With Critical First Touch Best Practices
Instant Digital Issuance: An Overview With Critical First Touch Best Practices
 
Best 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In ChandigarhBest 5 Graphics Designing Course In Chandigarh
Best 5 Graphics Designing Course In Chandigarh
 
SP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdfSP Search Term Data Optimization Template.pdf
SP Search Term Data Optimization Template.pdf
 
Crypto Quantum Leap - Digital - membership area
Crypto Quantum Leap -  Digital - membership areaCrypto Quantum Leap -  Digital - membership area
Crypto Quantum Leap - Digital - membership area
 
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITY
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITYHITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITY
HITECH CITY CALL GIRL IN 9234842891 💞 INDEPENDENT ESCORT SERVICE HITECH CITY
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh Benday
 
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night ServiceVIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
VIP Call Girls Dongri WhatsApp +91-9833363713, Full Night Service
 
Discover Ardency Elite: Elevate Your Lifestyle
Discover Ardency Elite: Elevate Your LifestyleDiscover Ardency Elite: Elevate Your Lifestyle
Discover Ardency Elite: Elevate Your Lifestyle
 
Social Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh BendaySocial Media Marketing Portfolio - Maharsh Benday
Social Media Marketing Portfolio - Maharsh Benday
 
SALES-PITCH-an-introduction-to-sales.pptx
SALES-PITCH-an-introduction-to-sales.pptxSALES-PITCH-an-introduction-to-sales.pptx
SALES-PITCH-an-introduction-to-sales.pptx
 
How consumers use technology and the impacts on their lives
How consumers use technology and the impacts on their livesHow consumers use technology and the impacts on their lives
How consumers use technology and the impacts on their lives
 
Aiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMMAiizennxqc Digital Marketing | SEO & SMM
Aiizennxqc Digital Marketing | SEO & SMM
 
Optimizing Your Marketing with AI-Powered Prompts
Optimizing Your Marketing with AI-Powered PromptsOptimizing Your Marketing with AI-Powered Prompts
Optimizing Your Marketing with AI-Powered Prompts
 
Resumé Karina Perez | Digital Strategist
Resumé Karina Perez | Digital StrategistResumé Karina Perez | Digital Strategist
Resumé Karina Perez | Digital Strategist
 

Clustering training

  • 2. CONTENTS What is clustering? Distance: Similarity and dissimilarity Data types in cluster analysis Clustering methods Evaluation of clustering Summary 2
  • 4. CLUSTERING I. (BY TYPE) Fruit Veggie 4
  • 5. CLUSTERING II. (BY COLOR) Yellow Green 5
  • 8. IMAGE PROCESSING EXAMPLE Figure from “Image and video segmentation: the normalised cut framework”by Shi and Malik, copyright IEEE, 1998 8
  • 10. CLUSTERING BY COLOR EXAMPLE Item Cian Magenta Yellow Black Chili 72 0 51 57 Cucamber 11 0 45 19 Broccoli 15 0 23 31 Apple 25 0 74 20 Paprika 0 52 100 11 Lemon 0 20 93 5 Orange 0 18 65 3 Banana 0 1 100 1 10
  • 11. CLUSTERING BY COLOR EXAMPLE Item Cian Magenta Yellow Black Cluster Chili 72 0 51 57 Cluster 1 Cucamber 11 0 45 19 Cluster 1 Broccoli 15 0 23 31 Cluster 1 Apple 25 0 74 20 Cluster 1 Paprika 0 52 100 11 Cluster 2 Lemon 0 20 93 5 Cluster 2 Orange 0 18 65 3 Cluster 2 Banana 0 1 100 1 Cluster 2 11
  • 12. WHAT IS CLUSTERING? Grouping of objects into classes such a way that • Objects in same cluster are similar • Objects in different clusters are dissimilar Segmentation vs. Clustering • Clustering is finding borders between groups, • Segmenting is using borders to form groups Clustering is the method of creating segments 12
  • 13. SUPERVISED VS. UNSUPERVISED CLASSIFICATION VS. CLUSTERING Classification – Supervised Classes are predetermined we know in advance the stamping For example if we already diagnosed some disease Or we know who has churned Clustering – Unsupervised Classes are not known in advance we don’t know in advance the stamping Market behaviour segmentation Or Gene analysis 13
  • 14. APPLICATIONS OF CLUSTERING Marketing: segmentation of the customer based on behavior Banking: ATM Fraud detection (outlier detection) ATM classification: segmentation based on time series Gene analysis: Identifying gene responsible for a disease Chemistry: Periodic table of the elements Image processing: identifying objects on an image (face detection) Insurance: identifying groups of car insurance policy holders with a high average claim cost Houses: identifying groups of houses according to their house type, value, and geographical location 14
  • 15. TYPICAL DATABASE id age sex ID12101 48 FEMALE ID12102 40 MALE ID12103 51 FEMALE ID12104 23 FEMALE ID12105 57 FEMALE ID12106 57 FEMALE ID12107 22 MALE ID12108 58 MALE ID12109 37 FEMALE ID12110 54 MALE ID12111 66 FEMALE region income married children car INNER_CITY 17,546 NO 1 NO TOWN 30,085 YES 3 YES INNER_CITY 16,575 YES 0 YES TOWN 20,375 YES 3 NO RURAL 50,576 YES 0 NO TOWN 37,870 YES 2 NO RURAL 8,877 NO 0 NO TOWN 24,947 YES 0 YES SUBURBAN 25,304 YES 2 YES TOWN 24,212 YES 2 YES TOWN 59,804 YES 0 NO save_act NO NO YES NO YES YES NO YES NO YES YES current_act NO YES YES YES NO YES YES YES NO YES YES mortgage NO YES NO NO NO NO NO NO NO NO NO pep YES NO NO NO NO YES YES NO NO NO NO How we define similarity or dissimilarity? Especially for categorical variables? 15
  • 16. WHAT TO DERIVE FORM THE DATABASE? id age sex ID12101 48 FEMALE ID12102 40 MALE ID12103 51 FEMALE ID12104 23 FEMALE ID12105 57 FEMALE ID12106 57 FEMALE ID12107 22 MALE ID12108 58 MALE ID12109 37 FEMALE ID12110 54 MALE ID12111 66 FEMALE region income married children car INNER_CITY 17,546 NO 1 NO TOWN 30,085 YES 3 YES INNER_CITY 16,575 YES 0 YES TOWN 20,375 YES 3 NO RURAL 50,576 YES 0 NO TOWN 37,870 YES 2 NO RURAL 8,877 NO 0 NO TOWN 24,947 YES 0 YES SUBURBAN 25,304 YES 2 YES TOWN 24,212 YES 2 YES TOWN 59,804 YES 0 NO Upper: Original database of the objects (customers) Right: Similarity or dissimilarity measure of the objects (similarity of customers) save_act NO NO YES NO YES YES NO YES NO YES YES current_act NO YES YES YES NO YES YES YES NO YES YES mortgage NO YES NO NO NO NO NO NO NO NO NO pep YES NO NO NO NO YES YES NO NO NO NO id ID12101 ID12102 ID12103 ID12104 ID12105 ID12101 0 12 23 19 13 ID12102 12 0 25 13 17 ID12103 23 25 0 9 21 ID12104 19 13 9 0 12 ID12105 13 17 21 12 0 16
  • 17. REQUIREMENTS OF CLUSTERING • Ability to deal with different types of attributes • Discovery of clusters with arbitrary shape • Able to deal with noise and outliers • Insensitive to order of input records • High dimensionality • Scalability • Minimal requirements for domain knowledge to determine input parameters • Incorporation of user-specified constraints • Interpretability and usability 17
  • 19. SIMILARITY AND DISSIMILARITY There is no single definition of similarity or dissimilarity between data objects The definition of similarity or dissimilarity between objects depends on • the type of the data considered • what kind of similarity we are looking for 19
  • 20. DISTANCE MEASURE Similarity/dissimilarity between objects is often expressed in terms of a distance measure d(x,y) Ideally, every distance measure should be a metric, i.e., it should satisfy the following conditions: 1. d(x,y) ≥ 0 2. d(x,y) = 0 if x = y 3. d(x,y) = d(y,x) 4. d(x,z) ≤ d(x,y) + d(y,z) 20
  • 21. TYPE OF VARIABLES id age sex ID12101 48 FEMALE ID12102 40 MALE ID12103 51 FEMALE ID12104 23 FEMALE ID12105 57 FEMALE ID12106 57 FEMALE ID12110 54 MALE region income married children car INNER_CITY 17,546 NO 1 NO TOWN 30,085 YES 3 YES INNER_CITY 16,575 YES 0 YES TOWN 20,375 YES 3 NO RURAL 50,576 YES 0 NO TOWN 37,870 YES 2 NO TOWN 24,212 YES 2 YES save_act NO NO YES NO YES YES YES current_act NO YES YES YES NO YES YES mortgage NO YES NO NO NO NO NO pep YES NO NO NO NO YES NO Interval-scaled variables: Age Binary variables: Car, Mortgage Nominal, Ordinal, and Ratio variables Variables of mixed types Complex data types: Documents, GPS coordinates 21
  • 22. INTERVAL-SCALED VARIABLES Continuous measurements of a roughly linear scale for example, age, weight and height The measurement unit can affect the cluster analysis To avoid dependence on the measurement unit, we should standardize the data 22
  • 23. STANDARDIZATION To standardize the measurements: • calculate the mean absolute deviation s f  1 (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |) n where m f  1 (x1 f  x2 f  ...  xnf ), n and • calculate the standardized measurement (z-score) xif  m f zif  sf 23
  • 24. DISTANCE MEASURE I. One group of popular distance measures for intervalscaled variables are Minkowski distances d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q ) i1 j1 i2 j 2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer 24
  • 25. DISTANCE MEASURES II. If q = 1, the distance measure is Manhattan (or city block) distance d (i, j) | x  x |  | x  x | ... | x  x | i1 j1 i2 j2 ip jp If q = 2, the distance measure is Euclidean distance d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 ) i1 j1 i2 j 2 ip jp 25
  • 26. EXAMPLE: DISTANCE MEASURES point p1 p2 p3 p4 x 0 2 3 5 y 2 0 1 1 Manhattan Distance 0, 2 y 2 1 3, 1 0 5, 1 2, 0 0 1 2 3 4 5 6 p2 p3 p4 p1 p2 p3 p4 3 p1 0 4 4 6 4 0 2 4 4 2 0 2 6 4 2 0 Euclidean Distance p1 p2 p3 p4 p1 p2 p3 p4 0 2.828 3.162 5.099 2.828 0 1.414 3.162 3.162 1.414 0 2 5.099 3.162 2 0 x Distance Matrix 26
  • 27. WHY STANDARDIZATION? Age and Income No standardization Income >> Age No separation on age With standardization Separation based on both age and income 27
  • 28. RATIO-SCALED VARIABLES A positive measurement on a nonlinear scale, approximately at exponential scale AeBt or Ae-Bt Methods: 1. treat them like interval-scaled variables is not a good choice! 2. apply logarithmic transformation yif = log(xif) 3. treat them as continuous ordinal data and treat their rank as interval-scaled 4. create a better variable Object On-net Off-net Ratio Log-Ratio On-net/Total 1 95 6 0.06 -1.20 94% 2 56 15 0.27 -0.57 79% Dist 1-2 0.04 0.39 0.02 3 12 23 1.92 0.28 34% 4 12 29 2.42 0.38 29% Dist 3-4 0.25 0.01 0.00 28
  • 29. ORDINAL VARIABLES An ordinal variable can be discrete or continuous Order of values is important, e.g., rank Can be treated like interval-scaled rif {1, ..., M f } • replacing x by their rank if • map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by rif 1 zif  M f 1 • compute the dissimilarity using methods for intervalscaled variables 29
  • 30. BINARY VARIABLES I. Binary variables has 2 outcomes 0/1, Y/N, … Distances Symmetric binary variable: FEMALE MALE FEMALE 0 1 MALE 1 0 No preference on which outcome should be coded 0 or 1 like gender Asymmetric binary variable: Outcomes are not equally important, or based on one outcome the objects are similar but based on the other outcome we can’t tell Like Has Mortgage or HIV positive Mortgage No Mortgage Mortgage 0 1 No Mortgage 1 undef 30
  • 31. BINARY VARIABLES II. If we have more binary variables in the database we can calculate the distance based on the contingency table A contingency table for binary data Object i 1 0 SUM 1 a c a+c Object j 0 b d b+d SUM a+b c+d t 31
  • 32. BINARY VARIABLES III. Object i 1 0 SUM 1 a c a+c Object j 0 b d b+d SUM a+b c+d t Simple matching coefficient (invariant similarity, if the binary variable is symmetric): bc ad sim(i, j)  d (i, j)  a bc  d a bc  d Jaccard coefficient (non-invariant similarity, if the binary variable is asymmetric): a sim(i, j)  d (i, j)  b  c a bc a bc 32
  • 33. NOMINAL VARIABLES Generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue Distance matrix More variables Method 1: simple matching Distance Red Yellow Blue Red Yellow 0 1 1 Blue 1 0 1 1 1 0 • m: # of matches, p: total # of variables sim(i, j)  m p d (i, j)  p  m p Method 2: use a large number of binary variables • create new binary variable for each of the k nominal states 33
  • 34. VARIABLES OF MIXED TYPES Database usually contains different types of variables • symmetric binary, asymmetric binary, nominal, ordinal, interval Approaches 1. Group each type of variable together, performing a separate cluster analysis for each type. 2. Bring different variables onto a common scale of the interval [0.0, 1.0], performing a single cluster analysis 34
  • 35. WEIGHTED FORMULA ( (  p  1 ij f ) dij f ) f d (i, j)  (  p  1 ij f ) f Weight δij (f) = 0 • if xif or xjf is missing • or xif = xjf =0 and variable f is asymmetric binary, Otherwise Weight δij (f) = 1 Another option is to choose the weights based on business aspects 35
  • 36. VECTOR OBJECTS: COSINE SIMILARITY Vector objects: keywords in documents, gene features in micro-arrays, … Applications: information retrieval, biologic taxonomy, ... Cosine measure: If d1 and d2 are two vectors, then cos(d1, d2) = (d1  d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d||: the length of vector d Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5 ||d1||= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5=(6) 0.5 = 2.245 cos( d1, d2 ) = .3150 36
  • 37. COMPLEX DATA TYPES All not relational objects => complex types of data • examples: spatial data, location data, multimedia data, genetic data, time-series data, text data and data collected from Web We can define our own similarity or dissimilarity measures than the previous • can, for example, mean using of string and/or sequence matching, or methods of information retrieval 37
  • 39. MAJOR CLUSTERING APPROACHES Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other 39
  • 40. PARTITIONING BASIC CONCEPT Partitioning method: Construct a partition of a database D of n objects into k clusters • each cluster contains at least one object • each object belongs to exactly one cluster Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion (min distance from cluster centers) • Global optimal: exhaustively enumerate all partitions Stirling(n,k) (S(10,3) = 9.330, S(20,3) = 580.606.446,…) • Heuristic methods: k-means and k-medoids algorithms • k-means: Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster 40
  • 41. PARTITIONING K-MEANS ALGORITHM Input: k clusters, n objects of database D. Output: A set of k clusters which minimizes the squared-error function Algorithm: 1. Choose k objects as the initial cluster centers 2. Assign each object to the cluster which has the closest mean point (centroid) under squared Euclidean distance metric 3. When all objects have been assigned, recalculate the positions of k mean point (centroid) 4. Repeat Steps 2. and 3. until the centroids do not change any more 41
  • 42. PARTITIONING K-MEANS ALGORITHM Source: Clustering: A survey 2008, R. Capaldo F. Collovà 42
  • 43. PARTITIONING K-MEANS + Easy to implement + The K-means method is is relatively efficient: O(tkn), where n is objects number, k is clusters number, and t is iterations number. Normally, k, t << n. - Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms - Not applicable in categorical data Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes To overcome some of these problems is introduced the K-medoids or PAM 43
  • 44. PARTITIONING K-MEDOID ALGORITHM The method K-medoid or PAM ( Partitioning Around Medoids ) is the same as k-means but instead of mean it uses medoid mq (q = 1,2,…,k) as object more representative of cluster medoid is the most centrally located object in a cluster Source: Clustering: A survey 2008, R. Capaldo F. Collovà 44
  • 45. PARTITIONING K-MEDOID OR PAM + PAM is more robust than K-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean - PAM works efficiently for small data sets but does not scale well for large data sets. Infact: O( k(n-k)2 ) for each iteration where n is data numbers, k is the clusters numbers To overcome these problems is introduced: CLARA (Clustering LARge Applications) - > Sampling based method CLARANS - > A Clustering Algorithm based on Randomized Search. 45
  • 46. PARTITIONING CLARA CLARA (Clustering LARge Applications) (Kaufmann and Rousseeuw in 1990) draws multiple sample of the dataset and applies PAM on the sample in order to find the medoids. + Deals with larger data sets than PAM + Experiments show that 5 samples of size 40+2k give satisfactory results - Efficiency depends on the sample size, should also determine that parameter - A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased, but to avoid this we use multiple sampling 46
  • 47. PARTITIONING CLARANS CLARANS (CLustering Algorithm based on RANdomized Search) (Ng and Han’94) A clustering method that draws sample of neighbors dynamically There are 2 parameters: maxneighbour the maximum number of neighbours examined, numlocal the number of local minimum obtained The algorithm is searching for new neighbours and replaces the current setup with a lower cost setup until the number of examined neighbours reaches the maxneighbour or the number of new local minimum obtained is reaches numlocal + + + - It is more efficient and scalable than both PAM and CLARA returns higher quality clusters has the benefit of not confining the search to a restricted area Depending on parameters can be very time consuming (close to PAM) 47
  • 48. HIERARCHICAL BASIC CONCEPT Hierarchical clustering Construct a hierarchy of clusters not just a single partition of objects • Use distance matrix as clustering criteria • Does not require the number of clusters as an input, but needs a termination condition, e. g., number of clusters or a distance threshold for merging 48
  • 49. HIERARCHICAL CLUSTERING TREE, DENDOGRAM Agglomerative Divisive The hierarchy of clustering is given as a clustering tree or dendrogram • leaves of the tree represent the individual objects • internal nodes of the tree represent the clusters Two main types of hierarchical clustering • agglomerative (bottom-up) • place each object in its own cluster (a singleton) • merge in each step the two most similar clusters until there is only one cluster left or the termination condition is satisfied • divisive (top-down) • start with one big cluster containing all the objects • divide the most distinctive cluster into smaller clusters and proceed until there are n clusters or the termination condition is satisfied 49
  • 50. HIERARCHICAL CLUSTER DISTANCE MEASURES Single link (nearest neighbor). The distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. Complete link (furthest neighbor). The distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). Pair-group average link. The distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. Pair-group centroid. The distance between two clusters is determined as the distance between centroids. Centroid link 50
  • 52. HIERARCHICAL + Conceptually simple + Theoretical properties are well understood + When clusters are merged/split, the decision is permanent => the number of different alternatives that need to be examined is reduced - Merging/splitting of clusters is permanent => erroneous decisions are impossible to correct later - Divisive methods can be computational hard - Methods are not (necessarily) scalable for large data sets 52
  • 54. EVALUATION BASICS Business • Segment sizes • Meaningful segments Technical • Compactness • Separation 54
  • 55. COMPACTNESS AND SEPARATION Compactness intra-cluster variance Separation inter-cluster distance Sometimes the two measures leads to different results Dens_bw 0,5 0,45 0,4 0,35 0,3 0,25 0,2 0,15 0,1 0,05 0 Scatt_orig Compactness Separation 2 3 4 5 6 55
  • 56. INDEX FUNCTIONS DB Number of clusters • Finding the minimum/maximum of a function we can determine the optimal number of clusters 1,6 1,4 1,2 1 0,8 0,6 0,4 0,2 0 2 3 4 5 7 8 SD KM Comparing clustering methods • Using the index functions we can compare the results of different clustering methods of the same database 6 9 10 SD TS 100 80 60 40 20 0 2 3 4 5 6 7 8 9 10 56
  • 57. SAMPLE DATABASE We generated a sample with 4 clusters • 2dimensions • Real values between (–10;15) With outliers 57
  • 59. DB (DAVIES-BOULDIN) INDEX DB index summarizes the similarity of a given cluster and the most dissimilar cluster and then take the average of them DB TS DB KM 1 DB TS 0,8 1 0,6 0,4 0,8 0,2 0 2 3 4 5 6 7 8 9 10 DB KM 0,6 0,4 1 0,8 0,2 0,6 0,4 0 0,2 2 0 2 3 4 5 6 7 8 9 3 4 5 6 7 8 9 10 10 59
  • 60. S_DBW INDEX S_Dbw KM S_Dbw TS 4 7 0,6 2 components 0,5 • Dens_bw: cluster separation • Scatt: the average variance of the clusters divided by the variance of all objects 0,4 0,3 0,2 0,1 0 2 Dens_bw TS 3 Dens_bw KM Scatt TS 0,6 0,2 0,1 0,1 0 Scatt KM 0,3 0,2 9 0,4 0,3 8 0,5 0,4 6 0,6 0,5 5 0 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 60 10
  • 61. SD INDEX SD KM SD TS 80 70 60 50 40 30 20 10 0 100 80 2 components : 60 • Scatt: compactness of the clusters • Dis: Function of the centroids of the clusters We should know the maximum number of clusters 20 40 0 2 3 4 5 6 7 8 9 2 10 3 SD KM 4 5 6 7 8 9 SD TS 100 80 60 40 20 0 2 3 4 5 6 7 8 9 10 61 10
  • 62. RS, RMSSTD INDEXEK RD (R-squared) = variance between clusters / total variance RMSSTD (Root-mean-square standard deviation) = within cluster variance RS KM RS TS 1 RS_diff TS RS_diff KM RS TS 0,08 1 1,2 0,07 1,2 0,95 RS KM 1 0,15 0,1 0,06 0,8 0,05 0,6 0,04 0,03 0,4 0,02 0,2 0,01 0 0 2 3 4 5 6 7 8 RMSSTD_diff TS 9 0,8 0 2 1,5 1 0,5 0 4 5 6 7 8 9 10 4 5 6 7 8 9 0,6 2 3 10 4 5 6 7 RMSSTD KM RMSSTD KM 12 8 0,05 0 -0,05 2 3 RMSSTD_diff KM 0,3 0,25 0,2 0,15 0,1 3 0,7 8 9 10 -0,1 RMSSTD TS 2,5 0,8 0,75 0,65 -0,05 0,2 0,45 0,4 0,35 3 0 0,4 10 3,5 2 0,05 0,6 0,9 0,85 2 12 5 4 10 3 2 6 1 0 4 -1 -2 0 -3 2 3 4 5 6 7 8 9 10 RMSSTD TS 10 8 6 4 2 0 2 3 4 5 6 7 8 9 10 62
  • 63. SEGMENTATION IN BANK Needs based segmentation for new tariff plans When the number of cluster is 4 or 5 then we have a too big segment (cca. 60 000 customer) Above 6 segments we can not to identify more significant segment Balance decrease is the cutting variable Szeparáltság Átmérő Separation 0,12 0,8 0,7 0,6 0,5 0,35 0,1 0,9 0,3 0,08 0,06 0,4 0,3 0,04 0,2 0,1 0,02 0 0 2 3 4 5 6 7 8 9 10 Diameter 0,05 0,045 0,04 0,035 0,03 0,025 0,02 0,015 0,01 0,005 0 0,25 0,2 0,15 0,1 0,05 0 2 3 4 5 6 7 8 9 10 63
  • 64. BANK SEGMENTATION – INDEXES RS_diff TS DB TS RS TS SD TS 0,9 1,4 0,12 0,8 1,2 0,1 0,7 1 0,6 0,8 0,5 0,6 0,08 0,4 0,06 0,3 0,4 0,04 0,2 0,2 0,02 0,1 0 0 2 3 4 5 6 Dens_bw TS 7 8 9 10 0 2 Scatt_orig TS 3 4 5 6 7 8 RMSSTD_diff TS 9 4 3,5 3 2,5 2 1,5 1 0,5 0 2 10 0,04 0,03 0,02 0,01 0,1 0 0 6 7 8 9 10 8 9 10 RMSSTD TS 0,4 0,2 5 7 0,45 0,3 4 6 0,6 0,05 0,4 3 5 RMSSTD KM 0,5 2 4 RMSSTD TS 0,6 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 3 -0,01 2 3 4 5 6 7 8 9 10 0,55 0,5 0,35 0,3 0,25 0,2 2 3 Based on the indexes there are 4-6 really different segments 4 5 6 7 8 9 64 10
  • 65. LITERATURE I. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000 (k-means, k-medoids or PAM ) L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990 (CLARA, AGNES, DIANA). R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94 (CLARANS). J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann Publishers, August 2000 (deterministic annealing, genetic algorithms). T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96 (BIRCH). S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98 (CURE). 65
  • 66. LITERATURE II. Karypis G., Eui-Hong Han, Kumar V. Chameleon: hierarchical clustering using dynamic modeling (CHAMELEON). M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96 (DBSCAN). M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99 (OPTICS). A. Hinneburg D., A. Keim: An Efficient Approach to Clustering in Large Multimedia Database with Noise. Proceedings of the 4-th ICKDDM, New York ’98 (DENCLUE). Abramowitz, M. and Stegun, I. A. (Eds.). "Stirling Numbers of the Second Kind." §24.1.4 in Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, pp. 824-825, 1972. Introduction to Data Mining Pang-Ning Tan, Michigan State University Michael Steinbach,Vipin Kumar, University of Minnesota Publisher: Addison-Wesley Copyright: 2006. 66