Clustering –
->the process of grouping a set of objects into classes of similar objects
-> the task is to create groups and assign data point to each group
Partitioning Clustering Approach-
a typical clustering analysis approach via iteratively partitioning training data set to learn a partition of the given data space
learning a partition on a data set to produce several non-empty clusters (usually, the number of clusters given in advance)
in principle, optimal partition achieved via minimising the sum of squared distance to its “representative object” in each cluster
3.The maximum norm is given by:
D(x,y) = max |xi - yi| where 1<i<p
4. The Mahalanobis distance corrects data for different scales and correlations in the variables.
5. Inner product space: The angle between two vectors can be used as a distance measure when clustering high dimensional data
6. Hamming distance (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.
Points are near to the centroid of the cluster
->What if clusters are overlapping?
- Hard to tell which cluster is right
- Maye we should try to remain uncertain
->What if cluster has a non-circular shape?
Gaussian mixture models:
- Clusters modeled as Gaussians (not by mean)
- EM algo assign data to cluster with some probablity
-Gives probabality model of x!
EM algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.
Points are near to the centroid of the cluster
Probabilistic Clustering – EM with gaussian distribution
It is relatively efficient and fast. It computes result at O(tkn), where n is number of objects or points, k is number of clusters and t is number of iterations
Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization or Image Segmentation).
Also used for choosing color palettes on old fashioned graphical display devices and Image Quantization.
K cluster - . Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.
-It is sensitive to initial condition. Different initial condition may produce different result of cluster. The algorithm may be trapped in the local optimum
The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Selecting the target range depends on the nature of the data.
we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple dimensions. Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in the enumerator) and unit-variance
- The reason is that normalization gives the same importance to all the variables.
The standard example is considering age (in year) and height (in cm). The age may range in [18 50], while the height may range in [130 180] (I am just making up numbers). If you use the classical Euclidean distance, the height will have dispoportionately more importance in its computation with respect to the age.
Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such asmedicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.
-K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. Use Hamming distance instead of Euclidean distance , i.e. if we two categorical values are same then make the distance 0 or else 1.
- k-medoids minimizes the sum of dissimilarities between points labeled to be in a cluster and a point designated as the center of that cluster.
- Always a member of data points
K-medoids is also a partitioning technique of clustering that clusters the data set of n objects into k clusters with k known a priori.
https://en.wikipedia.org/wiki/K-medoids
The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm. PAM uses a greedy search which may not find the optimum solution, but it is faster than exhaustive search[citation needed]. It works as follows:
Initialize: randomly select[citation needed] (without replacement) k of the n data points as the medoids
Associate each data point to the closest medoid.
While the cost of the configuration decreases:
For each medoid m, for each non-medoid data point o:
Swap m and o, recompute the cost (sum of distances of points to their medoid)
If the total cost of the configuration increased in the previous step, undo the swap
Other algorithms than PAM have been suggested in the literature, including the following Voronoi iteration method:[2]
Select initial medoids
Iterate while the cost decreases:
In each cluster, make the point that minimizes the sum of distances within the cluster the medoid
Reassign each point to the cluster defined by the closest medoid determined in the previous step.