2. Cluster Analysis is a multivariate statistical techniques
in which large data set is segregated into several
groups based on homogeneity or similarity measures
Cluster Analysis make sensible and informative
classification of an initially unclassified set of data
with desired accuracy, using the variable values
observed on each individual
It saves lot of resource in terms of time, money etc
4. To assign observations to groups (‘clusters’)
To divide the observations into homogenous and
distinct groups
To reduce the complexity of data
5. Generates several groups of data set which are similar
Homogeneous within the group and as much as
possible heterogeneous to other groups
Normally, data consists of objects or persons
Segregation is done based on more than two
variables.
7. Hierarchical clustering is a method of cluster analysis which
seeks to build a hierarchy of clusters.
Two types:
Agglomerative (bottom-top):
◦ Start with each document being a single cluster.
◦ Eventually all documents belong to the same cluster.
Divisive (top-bottom):
◦ Start with all documents belong to the same cluster.
◦ Eventually each node forms a cluster on its own.
No. of clusters need not be k.
8. Construction of a tree-based hierarchical diagram
usually called dendrogram. E.g., In case of taxonomy
classification
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
9. In this clustering, clusters are
represented by a central
vector, which may not
necessarily be a member of
the data set.
Aims to partition on
observations into k clusters.
Each observation belongs to
the cluster with the nearest
mean.
Here, the no. of clusters is
fixed to k(k-means clustering)
10. Clusters can be defined as objects belonging to same
distribution.
It provides correlation and dependence of attributes.
11. Clusters are based on density.
Objects in these sparse areas - that are required to separate
clusters - are usually considered to be noise and border
points.
The most popular density based clustering method is
DBSCAN (density-based spatial clustering of applications
with noise).
OPTICS (Ordering Points To Identify the Clustering
Structure) is a generalization of DBSCAN that handles
different densities much better way.
12. Density-based clustering
with DBSCAN.
DBSCAN assumes clusters of
similar density, and may have
problems separating nearby
clusters
OPTICS is a DBSCAN variant
that handles different densities
much better
13. 1. Forming the clusters from the given data set – resulting
in a new variable that identifies cluster members among
the cases (one phase cluster)
2. Description of clusters by re-crossing with the data
(Two phase cluster)
14. FISH CUTLET
FISH FINGER
FISH BURGER
VALUE
ADDED
PRODUCTS
One phase cluster
Forming of clusters by the
chosen data set
16. Cuts down the cost of preparing a sampling frame and
other administrative factors.
No special scales of measurement necessary
Visual graphic provides clear understanding of the
clusters.
Disadvantages:
Choice of cluster-forming variables often not based on
theory but at random
In some cases, determination of clusters is difficult to
decide.
Advantages :
17. Marketing: Help marketers to discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults