SlideShare a Scribd company logo
1 of 56
Download to read offline
Clustering
Viet-Trung Tran
1	
  
High Dimensional Data
•  Given a cloud of data points we want to
understand its structure
2	
  
Credits
•  Jure Leskovec, Anand Rajaraman, Jeff
Ullman 
•  Stanford University
3	
  
4	
  
The Problem of Clustering
•  Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that 
–  Members of a cluster are close/similar to each other
–  Members of different clusters are dissimilar
•  Usually: 
–  Points are in a high-dimensional space
–  Similarity is defined using a distance measure
•  Euclidean, Cosine, Jaccard, edit distance, …
Example: Clusters & Outliers
5	
  
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
x
x
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
Outlier Cluster
Clustering is a hard problem!
6	
  
7	
  
Why is it hard?
•  Clustering in two dimensions looks easy
•  Clustering small amounts of data looks easy
•  And in most cases, looks are not deceiving
•  Many applications involve not 2, but 10 or
10,000 dimensions
•  High-dimensional spaces look different:
Almost all pairs of points are at about the
same distance
Clustering Problem: Galaxies
•  A catalog of 2 billion “sky objects” represents objects by
their radiation in 7 dimensions (frequency bands)
•  Problem: Cluster into similar objects, e.g., galaxies, nearby
stars, quasars, etc.
•  Sloan Digital Sky Survey
8	
  
Clustering Problem: Music CDs
•  Intuitively: Music divides into categories,
and customers prefer a few categories
– But what are categories really?
•  Represent a CD by a set of customers who
bought it:
•  Similar CDs have similar sets of customers,
and vice-versa

9	
  
Clustering Problem: Music CDs
Space of all CDs:
•  Think of a space with one dim. for each
customer
– Values in a dimension may be 0 or 1 only
– A CD is a point in this space (x1, x2,…, xk), 

where xi = 1 iff the i th customer bought the CD
•  For Amazon, the dimension is tens of millions
•  Task: Find clusters of similar CDs
10	
  
Clustering Problem: Documents
Finding topics:
•  Represent a document by a vector 

(x1, x2,…, xk), where xi = 1 iff the i th word 

(in some order) appears in the document
– It actually doesn’t matter if k is infinite; i.e., we
don’t limit the set of words
•  Documents with similar sets of words 

may be about the same topic
11	
  
Cosine, Jaccard, and Euclidean
•  As with CDs we have a choice when we
think of documents as sets of words or
shingles:
– Sets as vectors: Measure similarity by the
cosine distance
– Sets as sets: Measure similarity by the Jaccard
distance
– Sets as points: Measure similarity by Euclidean
distance
12	
  
Overview: Methods of Clustering
•  Hierarchical:
–  Agglomerative (bottom up):
•  Initially, each point is a cluster
•  Repeatedly combine the two 

“nearest” clusters into one
–  Divisive (top down):
•  Start with one cluster and recursively split it
•  Point assignment:
–  Maintain a set of clusters
–  Points belong to “nearest” cluster
13	
  
Hierarchical Clustering
•  Key operation: 

Repeatedly combine 

two nearest clusters
•  Three important questions:
–  1) How do you represent a cluster of more 

than one point?
–  2) How do you determine the “nearness” of clusters?
–  3) When to stop combining clusters?
14	
  
Hierarchical Clustering
•  Key operation: Repeatedly combine two nearest
clusters
•  (1) How to represent a cluster of many points?
–  Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which pair
of clusters is closest?
•  Euclidean case: each cluster has a 

centroid = average of its (data)points
•  (2) How to determine “nearness” of clusters?
–  Measure cluster distances by distances of centroids
15	
  
Example: Hierarchical clustering
16	
  
	
   	
   	
   	
   	
  	
  	
  (5,3)
o
(1,2)
o
o (2,1) o (4,1)
	
  
o (0,0) o (5,0)
	
  
x (1.5,1.5)
x (4.5,0.5)
x (1,1)
x (4.7,1.3)
Data:
o … data point
x … centroid Dendrogram
And in the Non-Euclidean Case?
What about the Non-Euclidean case?
•  The only “locations” we can talk about are the
points themselves
–  i.e., there is no “average” of two points
•  Approach 1:
–  (1) How to represent a cluster of many points?

clustroid = (data)point “closest” to other points
–  (2) How do you determine the “nearness” of clusters?
Treat clustroid as if it were centroid, when computing
inter-cluster distances
17	
  
“Closest” Point?
•  (1) How to represent a cluster of many points?

clustroid = point “closest” to other points
•  Possible meanings of “closest”:
–  Smallest maximum distance to other points
–  Smallest average distance to other points
–  Smallest sum of squares of distances to other points
•  For distance metric d clustroid c of cluster C is:
18	
  
∑∈Cx
c
cxd 2
),(min
Centroid is the avg. of all (data)points
in the cluster. This means centroid is
an “artificial” point.
Clustroid is an existing (data)point
that is “closest” to all other points in
the cluster.
X
Cluster on
3 datapoints
Centroid
Clustroid
Datapoint
Defining “Nearness” of Clusters
•  (2) How do you determine the “nearness”
of clusters? 
– Approach 2: 

Intercluster distance = minimum of the
distances between any two points, one from
each cluster
– Approach 3:

Pick a notion of “cohesion” of clusters, e.g.,
maximum distance from the clustroid
•  Merge clusters whose union is most cohesive
19	
  
Cohesion
•  Approach 3.1: Use the diameter of the
merged cluster = maximum distance
between points in the cluster
•  Approach 3.2: Use the average distance
between points in the cluster
•  Approach 3.3: Use a density-based
approach
– Take the diameter or avg. distance, e.g., and
divide by the number of points in the cluster
20	
  
Implementation
•  Naïve implementation of hierarchical
clustering:
– At each step, compute pairwise distances 

between all pairs of clusters, then merge
– O(N3)
•  Careful implementation using priority queue
can reduce time to O(N2 log N)
– Still too expensive for really big datasets 

that do not fit in memory
21	
  


k-means clustering
k–means Algorithm(s)
•  Assumes Euclidean space/distance
•  Start by picking k, the number of clusters
•  Initialize clusters by picking one point per
cluster
– Example: Pick one point at random, then k-1
other points, each as far away as possible from 

the previous points
23	
  
Populating Clusters
•  1) For each point, place it in the cluster whose
current centroid it is nearest
•  2) After all points are assigned, update the locations
of centroids of the k clusters
•  3) Reassign all points to their closest centroid
–  Sometimes moves points between clusters
•  Repeat 2 and 3 until convergence
–  Convergence: Points don’t move between clusters and
centroids stabilize
24	
  
Example: Assigning Clusters
25	
  
x
x
x
x
x
x
x x
x … data point
… centroid
x
x
x
Clusters after round 1
Example: Assigning Clusters
26	
  
x
x
x
x
x
x
x x
x … data point
… centroid
x
x
x
Clusters after round 2
Example: Assigning Clusters
27	
  
x
x
x
x
x
x
x x
x … data point
… centroid
x
x
x
Clusters at the end
Getting the k right
How to select k?
•  Try different k, looking at the change in the
average distance to centroid as k increases
•  Average falls rapidly until right k, then
changes little
28	
  
k
Average
distance to
centroid
Best value
of k
Example: Picking k
29	
  
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
x
x
Too few;
many long
distances
to centroid.
Example: Picking k
30	
  
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
x
x
Just right;
distances
rather short.
Example: Picking k
31	
  
x x
x x x x
x x x x
x x x
x x
x
xx x
x x
x x x
x
x x x
x
x x
x x x x
x x x
x
x
x
Too many;
little improvement
in average
distance.


The BFR Algorithm
Extension of k-means to large
data
BFR Algorithm
•  BFR [Bradley-Fayyad-Reina] is a 

variant of k-means designed to 

handle very large (disk-resident) data sets
•  Assumes that clusters are normally distributed
around a centroid in a Euclidean space
–  Standard deviations in different 

dimensions may vary
•  Clusters are axis-aligned ellipses
•  Efficient way to summarize clusters 

(want memory required O(clusters) and not O(data))
33	
  
BFR Algorithm
•  Points are read from disk one main-memory-full at
a time
•  Most points from previous memory loads are
summarized by simple statistics
•  To begin, from the initial load we select the initial k
centroids by some sensible approach:
–  Take k random points
–  Take a small random sample and cluster optimally
–  Take a sample; pick a random point, and then 

k–1 more points, each as far from the previously selected
points as possible
34	
  
Three Classes of Points
3 sets of points which we keep track of:
•  Discard set (DS): 
–  Points close enough to a centroid to be summarized
•  Compression set (CS): 
–  Groups of points that are close together but not close to
any existing centroid
–  These points are summarized, but not assigned to a
cluster
•  Retained set (RS): 
–  Isolated points waiting to be assigned to a compression
set
35	
  
BFR: “Galaxies” Picture
36	
  
A cluster. Its points
are in the DS.
The centroid
Compressed sets.
Their points are in
the CS.
Points in
the RS
Discard set (DS): Close enough to a centroid to be summarized
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
Summarizing Sets of Points
For each cluster, the discard set (DS) is
summarized by:
•  The number of points, N
•  The vector SUM, whose ith component is the
sum of the coordinates of the points in the ith
dimension
•  The vector SUMSQ: ith component = sum of
squares of coordinates in ith dimension
37	
  
A cluster.
All its points are in the DS. The centroid
Summarizing Points: Comments
•  2d + 1 values represent any size cluster
–  d = number of dimensions
•  Average in each dimension (the centroid) 

can be calculated as SUMi / N
–  SUMi = ith component of SUM
•  Variance of a cluster’s discard set in dimension i is:
(SUMSQi / N) – (SUMi / N)2
–  And standard deviation is the square root of that
•  Next step: Actual clustering
38	
  
Note: Dropping the “axis-aligned” clusters assumption would require
storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too
big!
The “Memory-Load” of Points
Processing the “Memory-Load” of points (1):
•  1) Find those points that are “sufficiently close” to
a cluster centroid and add those points to that
cluster and the DS
–  These points are so close to the centroid that 

they can be summarized and then discarded
•  2) Use any main-memory clustering algorithm to
cluster the remaining points and the old RS
–  Clusters go to the CS; outlying points to the RS
39	
  
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
The “Memory-Load” of Points
Processing the “Memory-Load” of points (2):
•  3) DS set: Adjust statistics of the clusters to
account for the new points
–  Add Ns, SUMs, SUMSQs
•  4) Consider merging compressed sets in the CS
•  5) If this is the last round, merge all compressed
sets in the CS and all RS points into their nearest
cluster
40	
  
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
BFR: “Galaxies” Picture
41	
  
A cluster. Its points
are in the DS.
The centroid
Compressed sets.
Their points are in
the CS.
Points in
the RS
Discard set (DS): Close enough to a centroid to be summarized
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
A Few Details…
•  Q1) How do we decide if a point is “close
enough” to a cluster that we will add the
point to that cluster?
•  Q2) How do we decide whether two
compressed sets (CS) deserve to be
combined into one?
42	
  
How Close is Close Enough?
•  Q1) We need a way to decide whether to put a new
point into a cluster (and discard)
•  BFR suggests two ways:
–  The Mahalanobis distance is less than a threshold
–  High likelihood of the point belonging to currently nearest
centroid
43	
  
Mahalanobis Distance
•  Normalized Euclidean distance from
centroid
•  For point (x1, …, xd) and centroid (c1, …, cd)
1.  Normalize in each dimension: yi = (xi - ci) / σi
2.  Take sum of the squares of the yi
3.  Take the square root

44	
  
σi … standard deviation of points in
the cluster in the ith dimension
Mahalanobis Distance
•  If clusters are normally distributed in d
dimensions, then after transformation, one
standard deviation = √⁠ 𝒅 
– i.e., 68% of the points of the cluster will 

have a Mahalanobis distance <√⁠ 𝒅 
•  Accept a point for a cluster if 

its M.D. is < some threshold, 

e.g. 2 standard deviations
45	
  
Picture: Equal M.D. Regions
•  Euclidean vs. Mahalanobis distance
46	
  
Contours of equidistant points from the origin
Uniformly distributed points,
Euclidean distance
Normally distributed points,
Euclidean distance
Normally distributed points,
Mahalanobis distance
Should 2 CS clusters be
combined?
Q2) Should 2 CS subclusters be combined?
•  Compute the variance of the combined subcluster
–  N, SUM, and SUMSQ allow us to make that calculation
quickly
•  Combine if the combined variance is 

below some threshold
•  Many alternatives: Treat dimensions differently,
consider density
47	
  


The CURE Algorithm
Extension of k-means to
clusters of arbitrary shapes
The CURE Algorithm
•  Problem with BFR/k-means:
–  Assumes clusters are normally 

distributed in each dimension
–  And axes are fixed – ellipses at 

an angle are not OK
•  CURE (Clustering Using REpresentatives):
–  Assumes a Euclidean distance
–  Allows clusters to assume any shape
–  Uses a collection of representative 

points to represent clusters
49	
  
Vs.
Example: Stanford Salaries
50	
  
e	
   e	
  
e	
  
e	
  
e	
   e	
  
e	
  
e	
   e	
  
e	
  
e	
  
h	
  
h	
  
h	
  
h	
  
h	
  
h	
  
h	
   h	
  
h	
  
h	
  
h	
  
h	
   h	
  
salary	
  
age	
  
Starting CURE
2 Pass algorithm. Pass 1:
•  0) Pick a random sample of points that fit in
main memory
•  1) Initial clusters: 
–  Cluster these points hierarchically – group 

nearest points/clusters
•  2) Pick representative points:
–  For each cluster, pick a sample of points, as dispersed as
possible
–  From the sample, pick representatives by moving them
(say) 20% toward the centroid of the cluster
51	
  
Example: Initial Clusters
52	
  
e	
   e	
  
e	
  
e	
  
e	
   e	
  
e	
  
e	
   e	
  
e	
  
e	
  
h	
  
h	
  
h	
  
h	
  
h	
  
h	
  
h	
   h	
  
h	
  
h	
  
h	
  
h	
   h	
  
salary	
  
age	
  
Example: Pick Dispersed Points
53	
  
e	
   e	
  
e	
  
e	
  
e	
   e	
  
e	
  
e	
   e	
  
e	
  
e	
  
h	
  
h	
  
h	
  
h	
  
h	
  
h	
  
h	
   h	
  
h	
  
h	
  
h	
  
h	
   h	
  
salary	
  
age	
  
Pick (say) 4
remote points
for each
cluster.
Example: Pick Dispersed Points
54	
  
e	
   e	
  
e	
  
e	
  
e	
   e	
  
e	
  
e	
   e	
  
e	
  
e	
  
h	
  
h	
  
h	
  
h	
  
h	
  
h	
  
h	
   h	
  
h	
  
h	
  
h	
  
h	
   h	
  
salary	
  
age	
  
Move points
(say) 20%
toward the
centroid.
Finishing CURE
Pass 2:
•  Now, rescan the whole dataset and 

visit each point p in the data set
•  Place it in the “closest cluster”
– Normal definition of “closest”: 

Find the closest representative to p and 

assign it to representative’s cluster
55	
  
p
Summary
•  Clustering: Given a set of points, with a notion of
distance between points, group the points into
some number of clusters
•  Algorithms:
–  Agglomerative hierarchical clustering: 
•  Centroid and clustroid
–  k-means: 
•  Initialization, picking k
–  BFR
–  CURE
56	
  

More Related Content

What's hot

Genetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleGenetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleAhmed Gad
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: ParsingRushdi Shams
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer GraphicsKamal Acharya
 
Knowledge-based Systems
Knowledge-based SystemsKnowledge-based Systems
Knowledge-based Systemssaimohang
 
Design rules and usability requirements
Design rules and usability requirementsDesign rules and usability requirements
Design rules and usability requirementsAndres Baravalle
 
Human pose estimation with deep learning
Human pose estimation with deep learningHuman pose estimation with deep learning
Human pose estimation with deep learningengiyad95
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Geometric modelling
Geometric modellingGeometric modelling
Geometric modellingAakash S
 
Recommender system
Recommender systemRecommender system
Recommender systemSaiguru P.v
 
Introduction to Agent Based Modeling Using NetLogo
Introduction to Agent Based Modeling Using NetLogoIntroduction to Agent Based Modeling Using NetLogo
Introduction to Agent Based Modeling Using NetLogoPremsankar Chakkingal
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
Natural Language Processing - Unit 1
Natural Language Processing - Unit 1Natural Language Processing - Unit 1
Natural Language Processing - Unit 1Mithun B N
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation SystemMinha Hwang
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMPuneet Kulyana
 
Introduction to Optimization with Genetic Algorithm (GA)
Introduction to Optimization with Genetic Algorithm (GA)Introduction to Optimization with Genetic Algorithm (GA)
Introduction to Optimization with Genetic Algorithm (GA)Ahmed Gad
 
illumination model in Computer Graphics by irru pychukar
illumination model in Computer Graphics by irru pychukarillumination model in Computer Graphics by irru pychukar
illumination model in Computer Graphics by irru pychukarsyedArr
 

What's hot (20)

Genetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleGenetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step Example
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
 
Output primitives in Computer Graphics
Output primitives in Computer GraphicsOutput primitives in Computer Graphics
Output primitives in Computer Graphics
 
Knowledge-based Systems
Knowledge-based SystemsKnowledge-based Systems
Knowledge-based Systems
 
Design rules and usability requirements
Design rules and usability requirementsDesign rules and usability requirements
Design rules and usability requirements
 
Human pose estimation with deep learning
Human pose estimation with deep learningHuman pose estimation with deep learning
Human pose estimation with deep learning
 
Membrane computing
Membrane computingMembrane computing
Membrane computing
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Geometric modelling
Geometric modellingGeometric modelling
Geometric modelling
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Introduction to Agent Based Modeling Using NetLogo
Introduction to Agent Based Modeling Using NetLogoIntroduction to Agent Based Modeling Using NetLogo
Introduction to Agent Based Modeling Using NetLogo
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Natural Language Processing - Unit 1
Natural Language Processing - Unit 1Natural Language Processing - Unit 1
Natural Language Processing - Unit 1
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation System
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
 
Introduction to Optimization with Genetic Algorithm (GA)
Introduction to Optimization with Genetic Algorithm (GA)Introduction to Optimization with Genetic Algorithm (GA)
Introduction to Optimization with Genetic Algorithm (GA)
 
illumination model in Computer Graphics by irru pychukar
illumination model in Computer Graphics by irru pychukarillumination model in Computer Graphics by irru pychukar
illumination model in Computer Graphics by irru pychukar
 
ANIMATION SEQUENCE
ANIMATION SEQUENCEANIMATION SEQUENCE
ANIMATION SEQUENCE
 

Viewers also liked

Automatic Subspace Clustering of High Dimensional Data for Data Mining Applic...
Automatic Subspace Clustering of High Dimensional Data for DataMining Applic...Automatic Subspace Clustering of High Dimensional Data for DataMining Applic...
Automatic Subspace Clustering of High Dimensional Data for Data Mining Applic...Hajer Trabelsi
 
Digital image classification22oct
Digital image classification22octDigital image classification22oct
Digital image classification22octAleemuddin Abbasi
 
Metric learning ICML2010 tutorial
Metric learning  ICML2010 tutorialMetric learning  ICML2010 tutorial
Metric learning ICML2010 tutorialzukun
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Public Opinion Landscape: Election 2016 - New Hampshire 2-9-2016
Public Opinion Landscape: Election 2016 - New Hampshire 2-9-2016Public Opinion Landscape: Election 2016 - New Hampshire 2-9-2016
Public Opinion Landscape: Election 2016 - New Hampshire 2-9-2016GloverParkGroup
 
IA Summit 2011 Takeaways
IA Summit 2011 TakeawaysIA Summit 2011 Takeaways
IA Summit 2011 TakeawaysJill Christ
 
버즈니 플랫폼 : 의견 검색? 의견 검색!
버즈니 플랫폼 : 의견 검색? 의견 검색!버즈니 플랫폼 : 의견 검색? 의견 검색!
버즈니 플랫폼 : 의견 검색? 의견 검색!mosaicnet
 
Why Dentists should sink their Teeth into Social Media
Why Dentists should sink their Teeth into Social MediaWhy Dentists should sink their Teeth into Social Media
Why Dentists should sink their Teeth into Social MediaGreg Fry
 
University of Worcester Children's Conference
University of Worcester Children's Conference University of Worcester Children's Conference
University of Worcester Children's Conference Amy Burvall
 
A Snapshot of America’s Consumer Credit Market
A Snapshot of America’s Consumer Credit MarketA Snapshot of America’s Consumer Credit Market
A Snapshot of America’s Consumer Credit MarketGlobal Client Solutions
 

Viewers also liked (20)

Automatic Subspace Clustering of High Dimensional Data for Data Mining Applic...
Automatic Subspace Clustering of High Dimensional Data for DataMining Applic...Automatic Subspace Clustering of High Dimensional Data for DataMining Applic...
Automatic Subspace Clustering of High Dimensional Data for Data Mining Applic...
 
Chapter8
Chapter8Chapter8
Chapter8
 
Distance function
Distance functionDistance function
Distance function
 
Digital image classification22oct
Digital image classification22octDigital image classification22oct
Digital image classification22oct
 
Metric learning ICML2010 tutorial
Metric learning  ICML2010 tutorialMetric learning  ICML2010 tutorial
Metric learning ICML2010 tutorial
 
Clusteryanam
ClusteryanamClusteryanam
Clusteryanam
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
clustering
clusteringclustering
clustering
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Horario 5ª fase formativa
Horario   5ª fase formativaHorario   5ª fase formativa
Horario 5ª fase formativa
 
Public Opinion Landscape: Election 2016 - New Hampshire 2-9-2016
Public Opinion Landscape: Election 2016 - New Hampshire 2-9-2016Public Opinion Landscape: Election 2016 - New Hampshire 2-9-2016
Public Opinion Landscape: Election 2016 - New Hampshire 2-9-2016
 
IA Summit 2011 Takeaways
IA Summit 2011 TakeawaysIA Summit 2011 Takeaways
IA Summit 2011 Takeaways
 
버즈니 플랫폼 : 의견 검색? 의견 검색!
버즈니 플랫폼 : 의견 검색? 의견 검색!버즈니 플랫폼 : 의견 검색? 의견 검색!
버즈니 플랫폼 : 의견 검색? 의견 검색!
 
Calendar Check-In - Q2
Calendar Check-In - Q2Calendar Check-In - Q2
Calendar Check-In - Q2
 
Improving Findability Inside the Firewall
Improving Findability Inside the FirewallImproving Findability Inside the Firewall
Improving Findability Inside the Firewall
 
Why Dentists should sink their Teeth into Social Media
Why Dentists should sink their Teeth into Social MediaWhy Dentists should sink their Teeth into Social Media
Why Dentists should sink their Teeth into Social Media
 
University of Worcester Children's Conference
University of Worcester Children's Conference University of Worcester Children's Conference
University of Worcester Children's Conference
 
A Snapshot of America’s Consumer Credit Market
A Snapshot of America’s Consumer Credit MarketA Snapshot of America’s Consumer Credit Market
A Snapshot of America’s Consumer Credit Market
 

Similar to 6 clustering

Lecture8 clustering
Lecture8 clusteringLecture8 clustering
Lecture8 clusteringsidsingh680
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptRamanamurthy Banda
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfp_manimozhi
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptxssusere1fd42
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithmsMark Moriarty
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSandinoBerutu1
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptImXaib
 

Similar to 6 clustering (20)

Lecture8 clustering
Lecture8 clusteringLecture8 clustering
Lecture8 clustering
 
UnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.pptUnSupervised Machincs4811-ch23a-clustering.ppt
UnSupervised Machincs4811-ch23a-clustering.ppt
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
Mathematics online: some common algorithms
Mathematics online: some common algorithmsMathematics online: some common algorithms
Mathematics online: some common algorithms
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
Machine Learning - Clustering
Machine Learning - ClusteringMachine Learning - Clustering
Machine Learning - Clustering
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
Clustering
ClusteringClustering
Clustering
 
Clustering
ClusteringClustering
Clustering
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Clustering
ClusteringClustering
Clustering
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 

More from Viet-Trung TRAN

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreViet-Trung TRAN
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnViet-Trung TRAN
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processingViet-Trung TRAN
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookViet-Trung TRAN
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learningViet-Trung TRAN
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposalsViet-Trung TRAN
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents Viet-Trung TRAN
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Viet-Trung TRAN
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learningViet-Trung TRAN
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 

More from Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposals
 
GPSinsights poster
GPSinsights posterGPSinsights poster
GPSinsights poster
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 

Recently uploaded

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

6 clustering

  • 2. High Dimensional Data •  Given a cloud of data points we want to understand its structure 2  
  • 3. Credits •  Jure Leskovec, Anand Rajaraman, Jeff Ullman •  Stanford University 3  
  • 4. 4   The Problem of Clustering •  Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that –  Members of a cluster are close/similar to each other –  Members of different clusters are dissimilar •  Usually: –  Points are in a high-dimensional space –  Similarity is defined using a distance measure •  Euclidean, Cosine, Jaccard, edit distance, …
  • 5. Example: Clusters & Outliers 5   x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x Outlier Cluster
  • 6. Clustering is a hard problem! 6  
  • 7. 7   Why is it hard? •  Clustering in two dimensions looks easy •  Clustering small amounts of data looks easy •  And in most cases, looks are not deceiving •  Many applications involve not 2, but 10 or 10,000 dimensions •  High-dimensional spaces look different: Almost all pairs of points are at about the same distance
  • 8. Clustering Problem: Galaxies •  A catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands) •  Problem: Cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc. •  Sloan Digital Sky Survey 8  
  • 9. Clustering Problem: Music CDs •  Intuitively: Music divides into categories, and customers prefer a few categories – But what are categories really? •  Represent a CD by a set of customers who bought it: •  Similar CDs have similar sets of customers, and vice-versa 9  
  • 10. Clustering Problem: Music CDs Space of all CDs: •  Think of a space with one dim. for each customer – Values in a dimension may be 0 or 1 only – A CD is a point in this space (x1, x2,…, xk), 
 where xi = 1 iff the i th customer bought the CD •  For Amazon, the dimension is tens of millions •  Task: Find clusters of similar CDs 10  
  • 11. Clustering Problem: Documents Finding topics: •  Represent a document by a vector 
 (x1, x2,…, xk), where xi = 1 iff the i th word 
 (in some order) appears in the document – It actually doesn’t matter if k is infinite; i.e., we don’t limit the set of words •  Documents with similar sets of words 
 may be about the same topic 11  
  • 12. Cosine, Jaccard, and Euclidean •  As with CDs we have a choice when we think of documents as sets of words or shingles: – Sets as vectors: Measure similarity by the cosine distance – Sets as sets: Measure similarity by the Jaccard distance – Sets as points: Measure similarity by Euclidean distance 12  
  • 13. Overview: Methods of Clustering •  Hierarchical: –  Agglomerative (bottom up): •  Initially, each point is a cluster •  Repeatedly combine the two 
 “nearest” clusters into one –  Divisive (top down): •  Start with one cluster and recursively split it •  Point assignment: –  Maintain a set of clusters –  Points belong to “nearest” cluster 13  
  • 14. Hierarchical Clustering •  Key operation: 
 Repeatedly combine 
 two nearest clusters •  Three important questions: –  1) How do you represent a cluster of more 
 than one point? –  2) How do you determine the “nearness” of clusters? –  3) When to stop combining clusters? 14  
  • 15. Hierarchical Clustering •  Key operation: Repeatedly combine two nearest clusters •  (1) How to represent a cluster of many points? –  Key problem: As you merge clusters, how do you represent the “location” of each cluster, to tell which pair of clusters is closest? •  Euclidean case: each cluster has a 
 centroid = average of its (data)points •  (2) How to determine “nearness” of clusters? –  Measure cluster distances by distances of centroids 15  
  • 16. Example: Hierarchical clustering 16                (5,3) o (1,2) o o (2,1) o (4,1)   o (0,0) o (5,0)   x (1.5,1.5) x (4.5,0.5) x (1,1) x (4.7,1.3) Data: o … data point x … centroid Dendrogram
  • 17. And in the Non-Euclidean Case? What about the Non-Euclidean case? •  The only “locations” we can talk about are the points themselves –  i.e., there is no “average” of two points •  Approach 1: –  (1) How to represent a cluster of many points?
 clustroid = (data)point “closest” to other points –  (2) How do you determine the “nearness” of clusters? Treat clustroid as if it were centroid, when computing inter-cluster distances 17  
  • 18. “Closest” Point? •  (1) How to represent a cluster of many points?
 clustroid = point “closest” to other points •  Possible meanings of “closest”: –  Smallest maximum distance to other points –  Smallest average distance to other points –  Smallest sum of squares of distances to other points •  For distance metric d clustroid c of cluster C is: 18   ∑∈Cx c cxd 2 ),(min Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point. Clustroid is an existing (data)point that is “closest” to all other points in the cluster. X Cluster on 3 datapoints Centroid Clustroid Datapoint
  • 19. Defining “Nearness” of Clusters •  (2) How do you determine the “nearness” of clusters? – Approach 2: 
 Intercluster distance = minimum of the distances between any two points, one from each cluster – Approach 3:
 Pick a notion of “cohesion” of clusters, e.g., maximum distance from the clustroid •  Merge clusters whose union is most cohesive 19  
  • 20. Cohesion •  Approach 3.1: Use the diameter of the merged cluster = maximum distance between points in the cluster •  Approach 3.2: Use the average distance between points in the cluster •  Approach 3.3: Use a density-based approach – Take the diameter or avg. distance, e.g., and divide by the number of points in the cluster 20  
  • 21. Implementation •  Naïve implementation of hierarchical clustering: – At each step, compute pairwise distances 
 between all pairs of clusters, then merge – O(N3) •  Careful implementation using priority queue can reduce time to O(N2 log N) – Still too expensive for really big datasets 
 that do not fit in memory 21  
  • 23. k–means Algorithm(s) •  Assumes Euclidean space/distance •  Start by picking k, the number of clusters •  Initialize clusters by picking one point per cluster – Example: Pick one point at random, then k-1 other points, each as far away as possible from 
 the previous points 23  
  • 24. Populating Clusters •  1) For each point, place it in the cluster whose current centroid it is nearest •  2) After all points are assigned, update the locations of centroids of the k clusters •  3) Reassign all points to their closest centroid –  Sometimes moves points between clusters •  Repeat 2 and 3 until convergence –  Convergence: Points don’t move between clusters and centroids stabilize 24  
  • 25. Example: Assigning Clusters 25   x x x x x x x x x … data point … centroid x x x Clusters after round 1
  • 26. Example: Assigning Clusters 26   x x x x x x x x x … data point … centroid x x x Clusters after round 2
  • 27. Example: Assigning Clusters 27   x x x x x x x x x … data point … centroid x x x Clusters at the end
  • 28. Getting the k right How to select k? •  Try different k, looking at the change in the average distance to centroid as k increases •  Average falls rapidly until right k, then changes little 28   k Average distance to centroid Best value of k
  • 29. Example: Picking k 29   x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Too few; many long distances to centroid.
  • 30. Example: Picking k 30   x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Just right; distances rather short.
  • 31. Example: Picking k 31   x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Too many; little improvement in average distance.
  • 32. 
 The BFR Algorithm Extension of k-means to large data
  • 33. BFR Algorithm •  BFR [Bradley-Fayyad-Reina] is a 
 variant of k-means designed to 
 handle very large (disk-resident) data sets •  Assumes that clusters are normally distributed around a centroid in a Euclidean space –  Standard deviations in different 
 dimensions may vary •  Clusters are axis-aligned ellipses •  Efficient way to summarize clusters 
 (want memory required O(clusters) and not O(data)) 33  
  • 34. BFR Algorithm •  Points are read from disk one main-memory-full at a time •  Most points from previous memory loads are summarized by simple statistics •  To begin, from the initial load we select the initial k centroids by some sensible approach: –  Take k random points –  Take a small random sample and cluster optimally –  Take a sample; pick a random point, and then 
 k–1 more points, each as far from the previously selected points as possible 34  
  • 35. Three Classes of Points 3 sets of points which we keep track of: •  Discard set (DS): –  Points close enough to a centroid to be summarized •  Compression set (CS): –  Groups of points that are close together but not close to any existing centroid –  These points are summarized, but not assigned to a cluster •  Retained set (RS): –  Isolated points waiting to be assigned to a compression set 35  
  • 36. BFR: “Galaxies” Picture 36   A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
  • 37. Summarizing Sets of Points For each cluster, the discard set (DS) is summarized by: •  The number of points, N •  The vector SUM, whose ith component is the sum of the coordinates of the points in the ith dimension •  The vector SUMSQ: ith component = sum of squares of coordinates in ith dimension 37   A cluster. All its points are in the DS. The centroid
  • 38. Summarizing Points: Comments •  2d + 1 values represent any size cluster –  d = number of dimensions •  Average in each dimension (the centroid) 
 can be calculated as SUMi / N –  SUMi = ith component of SUM •  Variance of a cluster’s discard set in dimension i is: (SUMSQi / N) – (SUMi / N)2 –  And standard deviation is the square root of that •  Next step: Actual clustering 38   Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
  • 39. The “Memory-Load” of Points Processing the “Memory-Load” of points (1): •  1) Find those points that are “sufficiently close” to a cluster centroid and add those points to that cluster and the DS –  These points are so close to the centroid that 
 they can be summarized and then discarded •  2) Use any main-memory clustering algorithm to cluster the remaining points and the old RS –  Clusters go to the CS; outlying points to the RS 39   Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
  • 40. The “Memory-Load” of Points Processing the “Memory-Load” of points (2): •  3) DS set: Adjust statistics of the clusters to account for the new points –  Add Ns, SUMs, SUMSQs •  4) Consider merging compressed sets in the CS •  5) If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster 40   Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
  • 41. BFR: “Galaxies” Picture 41   A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
  • 42. A Few Details… •  Q1) How do we decide if a point is “close enough” to a cluster that we will add the point to that cluster? •  Q2) How do we decide whether two compressed sets (CS) deserve to be combined into one? 42  
  • 43. How Close is Close Enough? •  Q1) We need a way to decide whether to put a new point into a cluster (and discard) •  BFR suggests two ways: –  The Mahalanobis distance is less than a threshold –  High likelihood of the point belonging to currently nearest centroid 43  
  • 44. Mahalanobis Distance •  Normalized Euclidean distance from centroid •  For point (x1, …, xd) and centroid (c1, …, cd) 1.  Normalize in each dimension: yi = (xi - ci) / σi 2.  Take sum of the squares of the yi 3.  Take the square root 44   σi … standard deviation of points in the cluster in the ith dimension
  • 45. Mahalanobis Distance •  If clusters are normally distributed in d dimensions, then after transformation, one standard deviation = √⁠ 𝒅  – i.e., 68% of the points of the cluster will 
 have a Mahalanobis distance <√⁠ 𝒅  •  Accept a point for a cluster if 
 its M.D. is < some threshold, 
 e.g. 2 standard deviations 45  
  • 46. Picture: Equal M.D. Regions •  Euclidean vs. Mahalanobis distance 46   Contours of equidistant points from the origin Uniformly distributed points, Euclidean distance Normally distributed points, Euclidean distance Normally distributed points, Mahalanobis distance
  • 47. Should 2 CS clusters be combined? Q2) Should 2 CS subclusters be combined? •  Compute the variance of the combined subcluster –  N, SUM, and SUMSQ allow us to make that calculation quickly •  Combine if the combined variance is 
 below some threshold •  Many alternatives: Treat dimensions differently, consider density 47  
  • 48. 
 The CURE Algorithm Extension of k-means to clusters of arbitrary shapes
  • 49. The CURE Algorithm •  Problem with BFR/k-means: –  Assumes clusters are normally 
 distributed in each dimension –  And axes are fixed – ellipses at 
 an angle are not OK •  CURE (Clustering Using REpresentatives): –  Assumes a Euclidean distance –  Allows clusters to assume any shape –  Uses a collection of representative 
 points to represent clusters 49   Vs.
  • 50. Example: Stanford Salaries 50   e   e   e   e   e   e   e   e   e   e   e   h   h   h   h   h   h   h   h   h   h   h   h   h   salary   age  
  • 51. Starting CURE 2 Pass algorithm. Pass 1: •  0) Pick a random sample of points that fit in main memory •  1) Initial clusters: –  Cluster these points hierarchically – group 
 nearest points/clusters •  2) Pick representative points: –  For each cluster, pick a sample of points, as dispersed as possible –  From the sample, pick representatives by moving them (say) 20% toward the centroid of the cluster 51  
  • 52. Example: Initial Clusters 52   e   e   e   e   e   e   e   e   e   e   e   h   h   h   h   h   h   h   h   h   h   h   h   h   salary   age  
  • 53. Example: Pick Dispersed Points 53   e   e   e   e   e   e   e   e   e   e   e   h   h   h   h   h   h   h   h   h   h   h   h   h   salary   age   Pick (say) 4 remote points for each cluster.
  • 54. Example: Pick Dispersed Points 54   e   e   e   e   e   e   e   e   e   e   e   h   h   h   h   h   h   h   h   h   h   h   h   h   salary   age   Move points (say) 20% toward the centroid.
  • 55. Finishing CURE Pass 2: •  Now, rescan the whole dataset and 
 visit each point p in the data set •  Place it in the “closest cluster” – Normal definition of “closest”: 
 Find the closest representative to p and 
 assign it to representative’s cluster 55   p
  • 56. Summary •  Clustering: Given a set of points, with a notion of distance between points, group the points into some number of clusters •  Algorithms: –  Agglomerative hierarchical clustering: •  Centroid and clustroid –  k-means: •  Initialization, picking k –  BFR –  CURE 56