Call Girls Bommasandra Just Call š 7737669865 š Top Class Call Girl Service B...
Ā
One Graduate Paper
1. 1
Comparison of 6 K-Means Clustering Algorithms under
Differing Data Conditions
Colleen M. Farrelly
BST699
November, 2014
āSubmitted to the Graduate Programs in Public Health
Sciences in partial fulfillment of the requirements for the
degree of Master of Science in Biostatisticsā
2. 2
Abstract: Many variations of the k-means algorithm exist,
and this study aims to test promising k-means algorithms
from previous studies with new algorithms largely untested
against other algorithms in simulation studies under
varying conditions of noise, cluster overlap, cluster size,
and heterogeneity. Algorithms tested include a genetic
algorithm-based method, expectation-maximization algorithm-
based method, a Bayesian method, a chain-restart method, a
simple restart method, and an unmodified k-means algorithm.
Each of these seemed to have preferred conditions under
which it performed favorably. The MacQueen algorithm with
10 restarts showed exceptionally strong performance
compared to other algorithms across simulations. Three
algorithms (MacQueen algorithm with 10 restarts, the
expectation-maximization algorithm, and the Bayesian
algorithm) were applied to a real dataset to examine
whether or not these algorithms find the same patterns
within a dataset; all three algorithms found distinctly
different patterns within the data. This finding, combined
with differing performance across simulations for each
algorithm, suggests that care must be taken when selecting
a k-means algorithm to apply to an empirical dataset.
3. 3
Introduction
The goal of k-means clustering is to partition a data
set into a series of mutually exclusive and exhaustive
clusters such that the overall sum-of-squares error between
each cluster mean and the observations belonging to a
cluster are minimized (e.g., the variance of the clusters
are as small as possible1). Generally, it is assumed that
there are an underlying set of latent categories or groups
that exist in the dataset. Further, it has been shown that,
under certain conditions, the k-means clustering problem is
equivalent to finite mixture modeling, of which latent
class analysis (LCA) and latent profile analysis (LPA) are
also special instances1. To achieve the goal of constructing
clusters in this fashion, many algorithmic variations exist
(referred to throughout as āk-means clustering algorithms),
providing researchers with a plethora of options for
conducting a āsimpleā k-means cluster analysis.
There is some debate in the literature regarding
whether cluster analysis or LCA should be used2. In cases
where hard categories are desired or assumed, LCA (which
uses probabilistic partitions and does not assume that each
case fits into only one category) may not be useful. Other
conditions under which k-means clustering performs equally
4. 4
well or better than LCA (and, hence, where cluster-analytic
methods would be more efficacious) include homogeneity, in
which observations within a cluster have similar scores on
the predictor variables used in analysis3-5 and small
overlap between groups. Additionally, when investigating
the performance of finite mixture modeling, Steinley and
Brusco6-7 found that k-means clustering performed as well as
or better when aspects about the mixture model were unknown
(specifically, both the complexity of the within-group
covariance matrices and the number of clusters themselves).
One of the inherent difficulties in cluster analysis
is that the problem of dividing N observations into K
clusters is NP-hard; that is, it is a computational complex
problem where finding the optimal answer would require
complete enumeration. Unfortunately, the number of possible
solutions can be approximated by k^{n}/k!, creating a
search space that is infeasible to exhaust. To combat this
problem, several heuristic approaches have been developed,
each designed to search the solution space in a different
manner in hopes of obtaining a better solution. Indeed, in
the past few years, many studies within the social and
medical sciences have been employing various k-means
algorithms as a method of data partitioning and
classification8-15. However, with this recent proliferation
5. 5
of new and modified k-means cluster algorithms, it is
difficult to discern whether a given study has employed a
k-means algorithm that is appropriate for the associated
dataset8,16-19. Few studies have systematically compared
different k-means clustering algorithms, and those studies
have not examined noise or heterogeneous conditions during
their simulations, leaving a gap in knowledge for applying
k-means applications to collected datasets1,20-21. In
addition, extant simulation studies comparing multiple k-
means algorithms have not explored newer approaches, such
as Bayesian methodology for k-means clustering. Therefore,
the real-world data conditions it under which a variety of
algorithms perform well, and which of these conditions vary
between or among algorithms, are largely unknown.
Researchers hoping to employ one of these techniques in
their data analysis need a guide to proper implementation
and adaptation of the various k-means algorithms if they
are to reach correct conclusions regarding the
relationships that exist in their data.
The present study builds on the Brusco and Steinley20
study in which several k-means algorithms were tested under
several simulation parameters and then applied to several
empirical datasets. In their study, Brusco and Steinley20
examined nine algorithms, ranging from the earliest
6. 6
conceptions of the k-means algorithm to newer computational
methods, such as simulated annealing and genetic
algorithms. In simulations, the genetic algorithm and the
k-means hybrid algorithms performed exceptionally well, and
algorithms similar to these were chosen for further
investigation in the current study to determine the extent
to which these algorithms retain their high performance in
settings with many noise variables (defined as variables in
the data set that are not related to the cluster
structure), or with heterogeneity among items entered into
the cluster analysis. In addition, we included algorithms
employing novel solutions that have not yet been tested in
simulation comparison studies, as a way of determining how
these new approaches perform relative to those that have
been previously tested.
In this paper, we examine six distinct variants of k-
means clustering that are being employed in studies todayā
including genetic and expectation maximization algorithms
that improve optimization, a simple restart method, a chain
restart method, a Bayesian-based method, and an unmodified
k-means algorithmāthrough a simulation of small datasets
with varying signal-to-noise, mixing fraction (relative
size of clusters), cluster overlap, and cluster
heterogeneity conditions. In doing so, we hope to
7. 7
understand the relative strengths and weaknesses of each
algorithm under conditions common in social science
datasets. To understand how these algorithms might function
on a real dataset, we simulated two datasets with larger
sample sizes, high overlap among clusters, heterogeneity,
and more variables to assess which algorithms seem to
perform well under these complex problems. We then applied
the most promising algorithms to a real dataset, the Multi-
Site University Study of Identity and Culture (MUSIC)22-23,
which is similar to the larger simulated datasets. We took
this last step to examine the extent to which these
algorithms show similar clustering solutions in a real
dataset. This two-step approach provides information as to
whether the various algorithms are detecting the same
latent data partitions in both simulated and real data.
K-Means Algorithm Structure
The goal of k-means algorithms is minimizing the sum
of squares error (SSE), which maximizes within-cluster
homogeneity and between-cluster heterogeneity20. The SSE is
calculated as follows:
ššš
{š1, . . . , šā}
ā ā || š„ ā šā||2
š„āšā
š
ā=1
8. 8
where there are k clusters, each with mean Ī¼h, and x
represents an observation assigned to the kth cluster. The
basic k-means algorithm proceeds as follows (where specific
variations are described in their respective section
below). First, k initial seeds are chosen as centers of the
clusters, either by randomly choosing observations within a
dataset or by setting cluster centers a priori. Second,
each observation is assigned to one of the k clusters
according to the minimization of the SSE. Based on this new
assignment, means are recomputed for each cluster and
points are reassigned to clusters to minimize SSE. This
process is repeated until no observations can be reassigned
(or for a fixed number of iterations). As discussed in
detail by Steinley1, this stopping criterion is only
guaranteed to result in a locally optimal solution; in
other words, there could exist another assignment of
observations to clusters that resulted in a lower value for
SSE.
Modifications of K-Means Algorithms and Overview of
Algorithms Tested in This Study
Lloyd-Forgy and MacQueen Algorithms without Modification
9. 9
Two basic k-means algorithms exist that do not involve
modification to the fundamental k-means algorithmās steps
and calculation procedure. The Lloyd-Forgy algorithm
randomly selects k points from the dataset as starting
centers for k clusters and proceeds through the above
steps, waiting to recalculate clusters until all points in
the dataset have been assigned24. However, the MacQueen
algorithm proceeds through the above algorithm with
recalculation of clusters each time a point is moved,
rather than recalculating after all points have been
assigned25.
Need for Improvement of Basic Algorithms
One of the largest hurdles in k-means clustering is
the hill-climbing nature of the algorithm in its search for
a solution. The basic k-means search to minimize error and
imprecision does not escape local optima well and often
fails to find the global optimum16,19. This situation is akin
to a mountain climber getting stuck on a lower peak rather
than making it all the way to the mountainās summit.
Unfortunately, the analyst cannot see the higher peak of
the function as the mountain climber can see the summit
when at the (lower) peak of the mountain. Several
modifications of the k-means clustering algorithm have been
10. 10
developed to deal specifically with this problem, including
use of an expectation/maximization algorithm, a genetic
algorithm, two restart methods to find optimal starting
points, and calculation of posteriors through Bayesian
methods16,19,17,8.
Genetic Algorithm Approach
Rather than relying on a hill-climbing algorithm, one
solution to this problem employs a genetic algorithm to
optimize clustering partitions16. Genetic algorithms are
computing tools based on the principles of evolutionary
biology and population genetics, such that they efficiently
search a given space of possible solutions to a problem,
including variable selection and optimal partitioning26.
These algorithms thrive in situations in which other
enumerative and machine-learning algorithms stall or fail
to converge upon global solutions, and genetic algorithms
have been successfully adapted for problems in statistical
physics27, quantum chromodynamics28, aerospace engineering29,
molecular chemistry30, spline-fitting with function
estimation31, and statistics16,32-33.
Genetic algorithms start with a population of
different, randomly-generated possible solutions. Each
solution is described by the set of cluster memberships and
11. 11
is placed into a vector format referred to as chromosomes.
The algorithm evaluates each individual solution based on
the SEE, ranking them from smallest to largest16,26. The
algorithm then enters a mutation step in which each
solution undergoes further evolution/change. The amount of
mutation depends on a given individual solutionās ranking
on SSE. Solutions that have smaller SSE undergo less
mutation because they may be near the optimum; those with
higher SSE undergo more mutation because they are further
away from the optimum. The number of solutions remains
constant in size, and these steps are repeated to create a
new set of solutions until a convergence criterion is met,
usually a pre-specified number of sets of solutions (often
referred to as generations in the genetic algorithm
literature). At the designated stopping point, the
individual solution with the lowest SSE is chosen as the
best solution and other individual solutions are discarded.
One of the advantages of the genetic algorithm
approach is that this algorithm is computationally
efficient and appears to find better solutions than does
the unmodified k-means algorithm16. In addition, by
utilizing a population of solutions (e.g., the number of
solutions to be evaluated by the genetic algorithm), it is
unlikely that all solutions in a population will converge
12. 12
on a local optimum at each step because individuals vary at
the initialization of the population, greatly increasing
the chances of convergence to a global optimum within a
large search space34-35.
In addition to escaping local optima, this genetic
algorithm variation also allows k-means clustering to
accommodate functions other than straightforward SSE
measures, including functions that may be nonlinear or
multimodal16. This algorithm seems to find better solutions
(nearly error-free in Krishna and Murtyās test data) and in
smaller datasets it is also computationally efficient16. We
note that there are numerous manners in which genetic
algorithms can be operationalized. For instance, Brusco and
Steinley20 used a genetic algorithm that was based on a
hybridized k-means algorithm; however, such an approach is
unlikely to be as utilized by applied researchers because
it is not available in standard software packages. As such,
we use the Krishna and Murty16 algorithm found in the R
package.
Restart Modification
Another possible solution to the local optimum
problem, without resorting to a potentially
computationally-intensive genetic algorithm, is to search
13. 13
for better initial clustering points (rather than a random
or nearest-point start to the algorithm) in the hopes that
starting conditions will be closer to the global optimum
than to local optima.
The simple restart methods are the most straight-
forward modification. These methods involve n runs of the
algorithm, each with a different starting point chosen at
random17. Like the genetic algorithm, this method allows for
many iterations of the k-means algorithm; however, this
approach does not optimize the starting cluster points as
in a genetic algorithm and does not search the solution
space to obtain its starting point variations. However, the
restart algorithm is less computationally intensive, is
available as a setting in several software packages, and
has served as the starting point for several newer
algorithms17,19. With enough starting points chosen, it is
possible that this algorithm will find an initial
clustering close enough to a global optimum to converge on
the best clustering solution. Additionally, Steinley and
Brusco20 found that numerous random initializations compared
favorably to other methods for initializing k-means
clustering algorithms. Brusco and Steinley36 indicated that
a version of k-means clustering (one that combined the
method described by Lloyd24 with MacQueen25) that relied on
14. 14
multiple restarts compared favorably with their genetic
algorithm implementation.
Chain Variation Restart Method
An extension of the restart method, when applied to a
base k-means clustering algorithm, achieves starting point
initialization using a chain variation method17. This
variation involves moving one point to another cluster and
running the k-means algorithm to evaluate whether or not
this move provides a better k-means clustering solution
(smaller calculated SSE), and the chain of variations
consists of a sequence of these moves (called a Kernighan-
Line chain) evaluated at each step through running the k-
means algorithm17. In this way, the chain-variation restart
method is somewhat similar to the genetic algorithmās
mutation approach in that a series of initial clusterings
is iteratively changed to identify a better solution by
starting from an existing solution. However, the restart
method does not seek to identify a population of possible
solutions; rather, only one individual solution, chosen at
random, undergoes this chain procedure. Experimental
results suggest that this is an efficacious adaptation that
does allow the algorithm to escape local optima. However,
with the largest datasets in the initial development of
15. 15
this method, the algorithm typically requires a chain of 30
or more moves to reach an initial cluster, yielding a
different solution from the original K-means clustering
algorithm clustering17. It is unknown how this algorithm
performs as datasets increase further in size.
Expectation-Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a two-
step iterative algorithm commonly used to estimate model
parameters when there are missing or unobserved data37. This
method builds on maximum likelihood, which assumes a
distribution where the data meet the assumptions of a
parametric model. Each iteration of this algorithm involves
two steps: (1) the Expectation step, given an initial guess
for the parameters of the data model, uses this model to
form the expectation of the missing data conditional on the
model, thereby creating an estimate of the missing or
unobserved data; and (2) The Maximization Step, given the
estimates of the missing data, perform a likelihood
function step, where the likelihood function is evaluated
and a new estimate of the parameters for the data model is
generated. This is repeated for many iterations until a
convergence criterion is met, usually involving the
difference score between iteration estimates getting very
16. 16
close to zero37. The EM algorithm fits naturally into the
clustering problem because cluster membership is not
observed in the dataset.
An EM-based initialization procedure was proposed by
Bradley and Fayyad17. Similar to the chain-variation method,
this algorithm runs through the k-means algorithm for each
variation of starting points (in this case, based upon
bootstrap samples) and chooses the best initial clustering
as a starting point for the full k-means algorithm17. This
algorithm allows the user to specify EM-algorithm
parameters for a given problem. The EM algorithm has been
shown to find better solutions compared to non-refined
starting point k-means algorithms, but it shows poor
performance under some conditions20. Although this EM
algorithm variation is one of the more widely tested
algorithms, it has not been tested against the newer
algorithms, including the genetic algorithm method and the
Bayesian method.
Bayesian Approach
Another approach to k-means clustering is to use
Bayesian methods to approximate the posterior probabilities
of class membership given variables of interest, and then
cluster based upon this inferred probability density
17. 17
function (PDF) shape (basically chopping the PDF into
clusters at natural cut-points). The algorithm proposed by
Fuentes and Casella8 and Gopal et al.38 accomplishes this
through the Metropolis-Hastings algorithm, which is a type
of Markov chain Monte Carlo sampling method.
Stated more simply, the Bayesian approach uses
information from the observed data to infer the PDF from
which the data likely came without knowing the PDF a
priori. This is accomplished through repeated sampling,
with replacement, of the observed data until a sufficient
number of data points, typically many thousands or
millions, have been drawn to construct the distribution. In
this way, enough information is gathered about the
distribution to compute marginal and joint probabilities
directly from the repeated sampling distribution, called
the posterior distribution, or to derive other statistical
information about the joint distribution.
In addition to clustering, a byproduct of the Bayesian
approach is the computation of the Bayes factor, which
tests the hypothesis that the number of existing clusters
is k, against the null, which states that there is only 1
cluster in the data. This is done to ensure that the
correct number of clusters is determined prior to placing
cases into clusters:
18. 18
BF10=
š(š|š=š)
š( š| š = 1)
This hypothesis test addresses a major limitation of K-
means clusteringāspecifying the number of clusters39-40. The
function m(Yān=k) denotes the distribution of the data
given that there are n=k clusters. This Bayesian procedure
has shown promise in the few studies employing it; however,
this algorithm has a long run-time and has not been tested
on many data structures8.
The Present Study
The six algorithms selected for use in this study
were: (1) Genetic Algorithm-Based Hybrid, (2) Unmodified
Lloyd-Forgy, (3) Chain Variation of the Lloyd-Forgy, (4)
Bayesian-Based, (5) EM-Based, and (6) MacQueen Restart with
10 initial point restarts. These represent a range of
approaches to the k-means computational problem and include
some of the more-promising algorithms from Brusco and
Steinley20. In addition, given the scarcity of simulation
studies employing some of these algorithms, the algorithms
selected for the present study have been largely untested
against one another. Our goal was to identify the algorithm
that provided the best fit to the simulated data, as
determined by the Hubert-Arabie Adjusted Rand Index
19. 19
(ARIHA), and then to investigate whether or not a subset of
these algorithms finds similar clusters within a large
empirical dataset (MUSIC).
The correct method for determining the number of
clusters extant in an empirical dataset is debated1,39-41, and
this remains an open question in the field. Two previous
simulation studies of k-means clustering methods did not
find a significant difference across cluster numbers36,41,
suggesting that this factor is less important in simulation
studies comparing k-means algorithms than other factors,
including cluster overlap, signal-to-noise ratio, and
cluster density36,41. In light of these previous studies, k
was chosen to be set at 4, and simulation parameters
focused on those factors shown to impact algorithm
performance more noticeably36,41, as the number of clusters
problem is beyond the scope of this investigation and
merits its own simulation study of existing cluster number
determination methodology39.
Methods
Two approaches to cluster study simulations exist: (1)
many conditions, with few replicates and (2) fewer
20. 20
conditions, with many replicates. Steinleyās applied
simulation studies tend to focus on many conditions20,41-43,
whereas studies developing a new method have focused on
many replicates to empirically demonstrate and test
algorithm performance16-17,44.
In terms of testing the robustness of a clustering
algorithm to variations in data parameters, several
parameter variations have been suggested17,20,41, including
number of signal variables (or signal-to-noise ratio with
signal providing relevant information and noise adding no
important information), probability of overlap among
clusters, degree of homogeneity/heterogeneity within
clusters, sample size, and cluster density (in which
classes do not have equal cluster sizeāe.g., 3 clusters of
each consisting of 30% of cases, and a fourth cluster
consisting of the remaining 10% of cases). At N = 200, the
impact of sample size and number of clusters is
negligible20. Because performance on simulated datasets such
as those used in our study is largely unknown in the
literature (particularly for signal-to-noise ratio
variations), our study will use the few-conditions, many-
replicates technique so as to maximize power, running each
loop 30 times and averaging responses across iterations.
21. 21
Based on these previous indications, the following
parameters/variations were chosen as parameters to generate
the datasets through the R package MixSim42:
(1) N=200
(2) K=4
(3) Signal Variables=2, 5, 10
(4) Noise Variables=10 (giving 3 signal to noise ratios
when combined with 2)
(5) Cluster Homogeneity=yes or no
(6) Mixing Fraction=0.25, 0.5, or 1 (giving i.e., 1 small,
all equal, or 1 large group sizes)
(7) Average Overlap among Clusters=0.1, 0.2, 0.4
These variations, coupled with our use of six
different algorithms (genetic algorithm based, Lloyd-Forgy
chain variation, EM-based, Lloyd-Forgy unmodified, MacQueen
with 10 restarts, and Bayesian), yield 54 unique small
datasets. Previous studies have suggested combining across
conditions to discern the effects of each parameter41.
Therefore, after 30 separate datasets for each parameter
combination were simulated and analyzed by the six
algorithms of interest and the diagnostic criteria averaged
22. 22
over all 30 replicates, each of the 11 parameter variations
were examined across all other conditions.
To directly compare algorithm classification with true
classification, the Hubert-Arabie Adjusted Rand Index
(ARIHA) was chosen, as it has been shown to perform well in
comparing cluster solutions45. For this index, values of 0
correspond to chance levels of agreement, and values of 1
indicate a perfect classification match between the
algorithm-generated and true class solutions. It has been
shown that the ARIHA is robust across cluster densities
(mixing fractions), number of clusters, and sample size39,45.
Therefore, the ARIHA was chosen as our measure of algorithm
performance across conditions.
In addition, in the present study we also aimed to
test the performance of these algorithms on a large real-
world simulated dataset. Therefore, two final simulations
were conducted using with N=2500, variables=15 (once with
12 signal, 3 noise and a second time with 5 signal, 10
noise), average overlap of 0.2, and mixing fraction of 0.1,
to identify which algorithms perform well on a real
dataset. The three best performing, most disparate
algorithms were selected to be re-estimated on a set of
risk behavior variables from the MUSIC dataset. MUSIC
consists of 9,750 college-student participants who reported
23. 23
on their engagement in a number of risky behaviors during
the 30 days prior to assessment.
In our simulations, the number of clusters was fixed;
however, in the MUSIC dataset the optimal number of
clusters is unknown. The NbClust package in R provides 20
different, non-graphical algorithms for determining the
appropriate number of clusters in a dataset, including
Ratkowsky, Scott, Marriott, TraceCovW, and TraceW.
Therefore, to assess which of these procedures are most
appropriate for the MUSIC dataset, we assessed the
performance of each of these procedures with 1000 simulated
datasets according to the aforementioned 12 signal, 3 noise
variable scenarios. Results were tallied for each algorithm
according to (a) the correct number of clusters, k; (b) k-1
clusters; and (c) k+1 clusters. The best performing
algorithms were then used to obtain an estimate of the
likely number of clusters within the MUSIC data. The two
most likely numbers of clusters according to these
algorithms were then used in runs of the three algorithms
chosen for case study analysis. Readers interested in
methods used to determine optimal cluster number within a
dataset are referred to Tibshirani et al.46, Milligan and
Cooper39; and Dimitriadou et al.47. The best performing
24. 24
algorithms were then used on the MUSIC dataset with 10
replicates.
Within a 30 replicate loop per condition, all
simulated datasets were created using the R package MixSim42
and then used to evaluate each of the six algorithms. The
MacQueen restart algorithm was implemented through the use
of the kmeans function contained in the R base package with
10 restarts (nstart=10) and a maximum of 90 iterations of
the k-means algorithm (iter.max=90). The genetic algorithm
variation was implemented using the R package skmeans48 with
parameters calibrated to 140 generations, a mutation rate
of 0.2 per generation, and a population size of 20. The
Lloyd-Forgy algorithm with no restarts or chain variations
was also implemented with the skmeans package and 90
iterations, as this was the number of iterations chosen for
the other algorithms. The chain-restart method was
implemented using the skmeans package using 90 iterations
and chains of 50 moves each. The EM algorithm was
implemented using the default parameter settings in the
mclust R package, which searches through possible cluster
shapes according to the BIC criterion and does not use a
prior on the distribution. Finally, the Bayesian k-means
variation was conducted using the Bayesclust R package38 and
default parameter settings (Metropolis search algorithm
25. 25
with 100,000 iterations, sigma hyperparameters of 2.01 and
0.99, alpha hyperparameter of 1, and minimum cluster size
of 10%).
Real-World Dataset
MUSIC is a large dataset in which a number of
psychological and health-related surveys were administered
to 9,750 undergraduate students at 30 U.S. colleges and
universities. Among the surveys administered was a series
of questions asking participants how many times they had
engaged in a number of substance-use (marijuana, hard
drugs, inhalants, injection drugs, and misuse of
prescription drugs), sexual (oral sex, anal sex, sex with
strangers or brief acquaintances, unprotected sex, and sex
while drunk or high), and risky driving (drunk/drugged
driving and riding with a drunk/drugged driver) behaviors
during the 30 days prior to assessment. Possible response
choices for each of these behaviors were 0 (none), 1
(once/twice), 2 (3-5 times), 3 (6-10 times), or 4 (more
than 10 times).
There is reason to believe that these behaviors will
be highly correlated with one another and will form
clusters. Problem behavior theory49 suggests that young
people who engage in one risky behavior are more apt to
26. 26
engage in other risky behaviors; and epidemiologic data
indicate that the adolescent and emerging adult years
(roughly ages 15-24) are characterized by the highest rates
of illicit drug use, sexual risk behavior, and impaired
driving50-51. At the same time, not all college-aged
individuals engage in high levels of risky behavior, and
some forms of risky behavior are more common than others.
For example, many college students engage in casual sexual
relationships52 ā but far fewer students use inhalants or
injection drugs. One would therefore expect multiple
clusters of students, some of which are characterized by no
or mild engagement in risky behavior and others of which
are characterized by more severe engagement in many or all
of the risk behaviors. In using the MUSIC data, we employed
the clustering techniques on the 13 illicit drug use,
sexual risk taking, and impaired driving variables listed
here: drunk driving, driving in a car driven by a drunk
driver, engagement in casual sex, number of partners,
engaging in sex without condoms, oral sex, anal sex, sex
while drunk, smoking marijuana, using hard drugs (e.g.,
cocaine, heroin), inhalant use, injecting drugs, and
nonmedical use of prescription medications.
Results
27. 27
Simulations under Parameter Variations
The various algorithms evidenced different performance
under different conditions of the simulation (see Table 1).
In all, the MacQueen algorithm with 10 restarts
outperformed the other five algorithms handily under every
condition, suggesting that this algorithm is a strong,
general clustering technique. However, despite
outperforming all other algorithms, the performance of the
MacQueen algorithm with 10 restarts suggests a strong
decrease in its performance as additional signal variables
are added to the dataset, suggesting that it may struggle
with larger predictor sets.
All six algorithms struggled when the mean overlap of
groups increased to 0.4, indicating that strongly
overlapping clusters are difficult to recover. This finding
is consistent with the results reported by Steinley53,54. For
conditions in which there is likely to be large group
overlap, one may want to consider LCA or fuzzy k-means
algorithms, as these deal with probabilistic classification
rather than hard classification2.
Under most conditions, the genetic algorithm-based,
unmodified Lloyd-Forgy, and chain variation Lloyd-Forgy
methods produced very similar results to each other, with
28. 28
the genetic algorithm approach performing slightly better.
In addition, the chain variation Lloyd-Forgy method seemed
to perform better as the mixing fraction decreased,
suggesting that it performs well with unbalanced group
sizes. The Bayesian algorithm performed maximally well with
increased numbers of salient predictors, low average
overlap, and smaller mixing fractions.
Simulations to Mimic Real-World Conditions
In the āreal-world conditionsā simulation, all six
algorithms performed better as the number of relevant
predictors increased (Table 2), with the MacQueen algorithm
with 10 restarts and EM algorithm scoring over 0.5 on the
ARIHA for both conditions. The other four algorithms scored
approximately 0.3 for the condition with 5 signal and 10
noise variables; with 12 signal variables, all four
algorithms approached 0.4. This pattern indicates a fair
amount of agreement for the four lower-performing
algorithms and a moderate to substantial agreement for the
MacQueen algorithm with 10 restarts and EM algorithms.
When examining computing time for the six algorithms
on a standard laptop (Table 3), some of these algorithms
required a large amount of computing resources that may be
less than ideal for large datasets. The genetic algorithm
29. 29
and Bayesian algorithm both required quite a bit of
computing resources, though neither were unreasonable for
the datasets generated to mimic a large real-world dataset.
However, in very large datasets, computing resources may
become an issue, and alternative algorithms may yield
results more quickly without sacrificing quality. The EM
algorithm required more computing resources than three
other algorithms but quite a bit less than the two most
demanding ones (genetic and Bayesian), suggesting that it
would be reasonable (though not efficient) for larger
datasets. In all, running 30 iterations of the loop which
generated and analyzed one real-world condition for each of
the six algorithms required about 60 hours on a laptop
containing 8 GB of RAM and 2.5 GHz processing speed.
MUSIC Dataset
Because the EM method and MacQueen algorithm with 10
restarts gave the best performance on the simulated data,
both of these methods were used to analyze the MUSIC
dataset. We also used the Bayesian algorithm because this
approach is novel among clustering algorithms and has not
been applied to many real-life datasets.
The most promising algorithms for finding the correct
number of clusters on simulated datasets similar to the
MUSIC data were the Ratkowsky, Scott, Marriott, TraceCovW,
30. 30
and TraceW algorithms (summary of methodology and results
available upon request). These were used to determine the
optimum number of clusters within the MUSIC data. The
average number of optimum clusters detected by each
algorithm was, on average, between 3 and 4 clusters (Table
4); therefore, the three k-means algorithms for the MUSIC
dataset were conducted with k=3 and k=4 clusters. Each
groupās mean response across the 13 risk behavior questions
was examined to characterize the cluster solution.
After obtaining the results from both k=3 and k=4, it
was determined that the k=3 solutions did not make as much
sense clinically as the k=4 solutions. Therefore, we
further explored the k=4 solutions to characterize and
compare cluster solutions among the three algorithms. As
summarized in Table 5, the MacQueen algorithm with 10
restarts yielded the following four groups: (1) some sexual
risk and low substance abuse (n = 1523), (2) very high risk
in all areas (n = 11), (3) fairly high sexual and substance
abuse risk (n = 713), and (4) low risk in all areas (n =
7704). The second group was extremely small and reported an
average of 69 sexual partners in the month prior to
assessment; it is possible that this algorithm detected
students who lied about (or provided extremely unlikely
reports of) their sexual behavior. Such a conclusion would
31. 31
suggest that this algorithm classifies outliers within the
same cluster, rather than adding them to the next-closest
cluster, which would skew the clusters to which these
extreme observations are added. This is desirable for an
algorithm used on an experimental dataset, as it is able to
find distinct subgroups within the sample, including
extreme scores or distinct small groups of individuals
(such as those extremely high or low scorers). Care should
be taken in over-interpreting small groups of extreme
scores, unless replicated in other samples.
The EM algorithm found the following four groups: (1)
moderate sexual and substance abuse risk (n = 277), (2)
high oral sex but very low drug use (n = 7925), (3) risk
from smoking marijuana and alcohol-related behaviors
(1581), and (4) high risk in all areas (n = 168). This
seems reasonable given previous research on college risk
behaviors50-52. Cross-tabulating the classes obtained with
the EM algorithm against those obtained with the MacQueen
algorithm with 10 restarts yielded an ARIHA of .15,
suggesting little overlap across these two algorithms and a
high likelihood of erroneous classification based upon
simulation results.
The Bayesian algorithm found the following groups: (1)
low risk (n = 3969), (2) risk from impaired driving,
32. 32
marijuana use, and sex (n=1991), (3) moderate risk in all
areas (n=1990), and (4) assumedly monogamous (i.e., low
risks other than unprotected sex; n=2001). . Considering
that there was very low overlap with the classes extracted
using the MacQueen algorithm with 10 restarts (ARIHA = .18)
and the EM algorithmās outcome (ARIHA = .06), it is likely
that this algorithm is also not finding accurate data
partitions.
Considered together with the simulation results, it
appears that the different algorithms are likely to produce
different results with little overlap among methods. Given
the poor performance of the Bayesian and EM algorithms on
the simulated data, it is likely that these algorithms
produced untrustworthy results with high misclassification
rates. However, these algorithms did find clusters within
the MUSIC dataset and produced results that were
interpretable, suggesting that these algorithms need to be
applied with caution, as they may yield interpretable
results based on faulty clustering of data. These results
also support the speculation of Brusco and Steinley20, where
it was suggested that their genetic algorithm-based k-means
algorithm performed well because it was based on the HK-
means+ algorithm, which was based heavily on the MacQueen
algorithm.
33. 33
Discussion
We conducted the present study to compare six commonly
used k-means algorithms in terms of their ability to
correctly cluster optimally simulated data, data simulated
to mimic a real world dataset, and a collected dataset
consisting of risk behavior reports from nearly 10,000
college students. Although many studies have compared two
or three clustering algorithms to one another, our study is
among the first to compare six different algorithms under
various conditions (e.g., different sample sizes, varying
degrees of overlap between/among clusters, varying degrees
of random error). Our results may therefore have the
potential to provide greater clarity regarding which
algorithms do ā and do not ā perform well given specific
properties within the dataset.
Our results also indicate that large differences exist
among these six k-means algorithms in terms of the quality
of their performance and in terms of the content of
clusters that they produce. These results underscore the
need for informed application of these clustering
algorithms to real-world datasets, as results may
drastically vary depending upon which algorithm is chosen
34. 34
for the situation. Within the small-scale simulation
portion of this study, low signal-to-noise ratios (large
numbers of uninformative variables) presented a significant
challenge to the majority of algorithms. The MacQueen
algorithm with 10 restarts and EM algorithms performed most
favorably in the context of low signal-to-noise ratios,
though the MacQueen algorithm was the only one that
provided a strong performance across conditions.
Nonetheless, our results suggest that large or complex
datasets may present challenges to many of the k-means
algorithms, as the MacQueen algorithm showed a faltering
performance as the number of variables increased and as the
overlap increased. In addition, the performance of five
algorithms deteriorated as the average group overlap
increased beyond 0.2. This finding suggests that correct
classification is most likely when groups are well-
separated, and that the MacQueen algorithm with 10 restarts
may be among the most effective options when cluster
overlap is higher.
When examining two simulated datasets with several
variables, heterogeneity, moderate overlap, and low mixing
fraction, the EM and MacQueen with 10 restarts algorithms
performed much more strongly than the other four algorithms
without requiring an unreasonable amount of computing
35. 35
power. The Bayesian and genetic algorithm methods, though
showing a comparable performance to the remaining
algorithms (chain-restart and non-restart Lloyd-Forgy),
required much more time to run and more computing
resources. Given the trend towards analyzing datasets with
extremely large numbers of cases and variables (i.e., big
data), considerations related to computing resources
required may need to be taken into account.
On the MUSIC dataset, the MacQueen algorithm with 10
restarts identified a small group of exceptionally high
scores, suggesting that it is sensitive enough to detect
possible outliers and may be of use in flagging suspect
observations. This may be useful for researchers with large
collections of observations and who aim to examine extreme
events more closely. The other two algorithms identified
cluster solutions consistent with previous rates of risk
behaviors among a college population55; however, the overlap
with the MacQueen solutions was troublingly low, suggesting
that these algorithms may have found very inaccurate
solutions. Given their poor performance in the simulations,
caution should be taken when applying any of these more
complex algorithms to real datasets, as they are unlikely
to provide as accurate of a solution as the MacQueen
restart method.
36. 36
Our results suggest that different k-means
computational methods may partition data in different ways,
such that the groups produced do not converge well across
algorithms, and that given current implementations, the
more complex algorithms should be used with caution or
disregarded in favor of the MacQueen restart method. It is
possible that each algorithm āseesā the data structure from
a different point of view, analogous to researchers from
different fields approaching a multidisciplinary problem
from different points of view and then solving that problem
using the tools from their particular fieldās approach;
however, it is also likely that some of these algorithms
are unable to find valid clustering solutions. This finding
has important implications for researchers applying these
methods to real-life data, as all algorithms are not
created equally. From the results of Brusco and Steinley20,
it is possible that the computational methods upon which
some of these more complex algorithms are based may yield
k-means solutions that perform well on real data; however,
the versions of these algorithms provided within some
software packages (e.g., R) do not perform well.
Strengths and Limitations
This study has several strengths and weaknesses. Using
simulated datasets with known ātrueā classifications
37. 37
allowed us to compare algorithms under controlled
conditions. However, these simulations involved smaller
datasets than might be encountered in large empirical
studies. Even with the MUSIC dataset, which provides a
sample of almost 10,000 participants, only 13 clustering
variables were used. As such, the results of this study
should be interpreted with caution by researchers hoping to
apply these algorithms to studies with large numbers of
variables or in which the number of parameters may exceed
the number of observations. For example, many genomics
studies include millions of genes taken from a comparably
small number of participants. Further work should be done
to examine these algorithms in such situations to
understand which k-means algorithms perform well under such
conditions.
In sum, this study provides a roadmap for the
application of several k-means algorithms to datasets
involving several predictors with varying degrees of signal
and noise. For most of the algorithms tested in this study,
it seems that the roadmap is quite simple in scope.
Researchers hoping to use the k-means methodology would do
well to use the basic k-means algorithms, particularly with
a simple restart step added to the algorithm, rather than
using some of the newer, more complex algorithms. We hope
38. 38
that our results have contributed to the literature
comparing various approaches to k-means cluster analysis
and provide a more accurate, if shorter, roadmap for
researchers hoping to employ this methodology.
39. 39
Acknowledgements:
I would like to thank my coauthors Dr. Daniel Feaster,
Dr. Seth Schwartz, and Dr. Douglas Steinley for their
contributions to this study, as well as University of Miami
for the courses allowing me to complete a study like this.
40. 40
References
1) Steinley D. K-means clustering: a half-century
synthesis. Br J Math Stat Psychol. 2006; 59(Pt 1):1-34.
2) DiStefano C, Kamphaus RW. Investigating Subtypes of
Child Development A Comparison of Cluster Analysis and
Latent Class Cluster Analysis in Typology Creation. Educ
Psychol Meas. 2006; 66(5):778-794.
3) Eshghi A, Haughton D, Legrand P, Skaletsky M, Woolford
S. Identifying groups: a comparison of methodologies. J
Data Sci. 2011; 9:271-91.
4) Scott AJ, Knott M. A cluster analysis method for
grouping means in the analysis of variance. Biometrics.
1974; 507-512.
5) Edwards AWF, Cavalli-Sforza LL. A method for cluster
analysis. Biometrics. 1965; 362-375.
6) Steinley D, Brusco MJ. Evaluating the performance of
model-based clustering: Recommendations and cautions.
Psychol. Methods. 2011; 16, 63-79.
41. 41
7) Steinley D, Brusco MJ. K-means clustering and model-
based clustering: Reply to McLachlan and Vermunt. Psychol.
Methods. 2011; 16, 89-92.
8) Fuentes C, Casella G. Testing for the existence of
clusters. Sort (Barc). 2009; 33(2):115-157.
9) Martis RJ, Chakraborty C. Arrhythmia disease diagnosis
using neural network, SVM, and genetic algorithm-optimized
k-means clustering. J. Mech. Med. Biol. 2011; 11(04): 897-
915.
10) Rovniak LS, Sallis JF, Saelens, et al. Adults' physical
activity patterns across life domains: cluster analysis
with replication. Health Psychology. 2010; 29(5): 496.
11) Newby PK, Muller D, Hallfrisch J, Qiao N, Andres R,
Tucker KL. Dietary patterns and changes in body mass index
and waist circumference in adults. Am. J. Clin. Nutr. 2003;
77(6): 1417-1425.
12) Ahn Y, Park YJ, Park SJ, et al. Dietary patterns and
prevalence odds ratio in middle-aged adults of rural and
42. 42
mid-size city in Korean Genome Epidemiology Study. Korean
Journal of Nutrition. 2007; 40(3): 259-269.
13) Bauer AK, Rondini EA, Hummel, et al. Identification of
candidate genes downstream of TLR4 signaling after ozone
exposure in mice: a role for heat-shock protein 70.
Environmental health perspectives. 2010; 119(8): 1091.
14) O'Quin KE, Hofmann CM, Hofmann HA, Carleton KL.
Parallel evolution of opsin gene expression in African
cichlid fishes. Mol. Biol. Evol. 2010; 27(12): 2839-2854.
15) Tukiendorf A, KaÅŗmierski R, Michalak S. The Taxonomy
Statistic Uncovers Novel Clinical Patterns in a Population
of Ischemic Stroke Patients. PloS one. 2013; 8(7): e69816.
16) Krishna K, Murty MN. Genetic K-means algorithm.
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE
Transactions on. 1999; 29(3):433-439.
17) Dhillon IS, Guan Y, Kogan J. Iterative clustering of
high dimensional text data augmented by local search. Data
Mining. 2002; 131-138.
43. 43
18) Cannon RL, Dave JV, Bezdek JC. Efficient implementation
of the fuzzy c-means clustering algorithms. Pattern
Analysis and Machine Intelligence, IEEE Transactions on.
1986; (2): 248-255.
19) Bradley PS, Fayyad U. Refining Initial Points for K-
Means Clustering. Proc. 15th International Conf. Machine
Learning. 1998.
20) Brusco MJ, Steinley D. A comparison of heuristic
procedures for minimum within-cluster sums of squares
partitioning. Psychometrika. 2007; 72(4): 583-600.
21) Steinley D. Validating clusters with the lower bound
for sum-of-squares error. Psychometrika. 2007; 72(1): 93-
106.
22) Castillo LG, Schwartz SJ. Introduction to the special
issue on college student mental health. Journal of clinical
psychology. 2013; 69(4): 291-297.
23) Weisskirch RS, Zamboanga BL, Ravert RD, et al. An
introduction to the composition of the Multi-Site
University Study of Identity and Culture (MUSIC): A
44. 44
collaborative approach to research and mentorship. Cultural
Diversity and Ethnic Minority Psychology. 2013; 19(2): 123.
24) Lloyd SP. "Least square quantization in PCM". Bell
Telephone Laboratories Paper. 1957.
25) MacQueen J. Some methods for classification and
analysis of multivariate observations. Proceedings of the
fifth Berkeley symposium on mathematical statistics and
probability. 1967; 1(14): 281-297.
26) Holland J. Adaptation in Natural and Artificial
Systems. University of Michigan Press. 1975.
27) Somma RD, Boixo S, Barnum H, Knill E. Quantum
simulations of classical annealing processes. Physical
review letters. 2008; 101(13): 130504.
28) Temme K, Osborne TJ, Vollbrecht KG, Poulin D,
Verstraete F. Quantum metropolis sampling. Nature. 2011;
471(7336): 87-90.
29) Hassan R, De Weck O, Springmann P. Architecting a
communication satellite product line. 22nd AIAA
45. 45
international communications satellite systems conference &
exhibit (ICSSC). 2004.
30) Najafi A, Ardakani SS, Marjani M. Quantitative
structure-activity relationship analysis of the
anticonvulsant activity of some benzylacetamides based on
genetic algorithm-based multiple linear regression.
Tropical Journal of Pharmaceutical Research. 2011; 10(4):
483-490.
31) Pittman J. Adaptive splines and genetic algorithms. J.
Comp. Graph. Stat. 2002; 11(3): 615-638.
32) Paterlini S, Minerva T. Regression model selection
using genetic algorithms. Proceedings of the 11th WSEAS
International Conference on RECENT Advances in Neural
Networks, Fuzzy Systems & Evolutionary Computing. 2010; 19-
27.
33) Maulik U, Bandyopadhyay S. Genetic algorithm-based
clustering technique. Pattern Recogn. 2000; 33(9): 1455-
1465.
46. 46
34) Forrest S. Genetic algorithms: principles of natural
selection applied to computation. Science. 1993; 261(5123):
872-878.
35) Whitley D. A genetic algorithm tutorial. Stat Comp.
1994; 4(2): 65-85.
36) Steinley D, Brusco, MJ. Initializing K-means batch
clustering: A critical evaluation of several techniques. J.
Classif. 2007; 24, 99-121.
37) Dempster AP, Laird NM, Rubin DB. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the
Royal Statistical Society. Series B (Methodological). 1977;
1-38.
38) Gopal V, Fuentes C, Casella G, Gopal M. The bayesclust
Package. 2009.
39) Milligan GW, Cooper MC. An examination of procedures
for determining the number of clusters in a data set.
Psychometrika. 1985; 50(2): 159-179.
47. 47
40) Steinley D, Brusco, MJ. Testing for validity and
choosing the number of clusters in K-means clustering.
Psychol. Methods. 2011; 16, 285-297.
41) Steinley D, Brusco MJ. A new variable weighting and
selection procedure for K-means cluster analysis. Multiv
Behav Res. 2008; 43(1): 77-108.
42) Melnykov V, Chen W, Maitra R. MixSim: An R Package for
Simulating Data to Study Performance of Clustering
Algorithms. J Stat Softw. 2012; 51(12), 1-25.
43) Steinley D. Standardizing variables in K-means
clustering. Classification, clustering, and data mining
applications. 2004; 53-60.
44) Huang Z. Clustering large data sets with mixed numeric
and categorical values. Proceedings of the 1st Pacific-Asia
Conference on Knowledge Discovery and Data Mining. 1997;
21-34.
45) Steinley D. Properties of the Hubert-Arable Adjusted
Rand Index. Psychol Methods. 2004; 9(3): 386.
48. 48
46) Tibshirani R, Walther G, Hastie T. Estimating the
number of clusters in a data set via the gap statistic.
Journal of the Royal Statistical Society: Series B
(Statistical Methodology). 2001; 63(2): 411-423.
47) Dimitriadou E, DolniÄar S, Weingessel A. An examination
of indexes for determining the number of clusters in binary
data sets. Psychometrika. 2002; 67(1): 137-159.
48) Hornik K, Feinerer I, Kober M, Buchta C. Spherical k-
Means Clustering. Journal of Statistical Software. 2012;
50(10), 1-22.
49) Jessor R, Jessor SL. Problem behavior and psychosocial
development: A longitudinal study of youth. 1977.
50) National Highway Traffic Safety Administration. Teen
drivers: Additional resources. 2013. Retrieved October 22,
2014 from
http://www.nhtsa.gov/Driving+Safety/Teen+Drivers/Teen+Drive
rs+-+Additional+Resources.
49. 49
51) Centers for Disease Control and Prevention. Youth Risk
Behavior Surveillance ā United States, 2013. Morbidity and
Mortality Weekly Report. 2013; 63(1): 1-168.
52) Bogle KA. Hooking up: Sex, dating, and relationships on
campus. New York: New York University Press. 2008.
53) Steinley D. Profiling local optima in K-means
clustering: Developing a diagnostic technique. Psychol.
Methods. 2006; 11, 178--192.
54) Steinley D, Brusco, MJ. A new variable weighting and
selection procedure for K-means cluster analysis. Multiv
Behav Res. 2008; 43, 77-108.
55) Cooper ML. Alcohol use and risky sexual behavior among
college students and youth: Evaluating the evidence. J.
Stud. Alcohol Drugs. 2002; (14): 101.
53. 53
Table 3: Computing Times in Seconds on Simulation of
Dataset like the MUSIC Dataset
Genetic
Algorithm
Lloyd-
Forgy
Chain
Lloyd-
Forgy
Bayesian EM
Algorithm
MacQueen
10
Restarts
Elapsed
Time
146.57 0.14 0.50 600.85 14.81 0.36
System
Time
0.25 0.00 0.00 0.09 0.02 0.00
User
Time
135.15 0.14 0.49 594.83 14.48 0.36
54. 54
Table 4: Average Number of Clusters Estimated in MUSIC
Dataset
Average Number of
Clusters
Range of Clusters
Found
Scott 4 3 or 5
Marriott 3.2 3 or 4
TraceCovW 3.1 3 or 4
TraceW 3.3 3 or 4
Ratkowsky 3.9 3 or 4
55. 55
Table 5: 4-Cluster Solutions by Algorithm
Cluster MacQueen with
10 Restarts
EM-based Bayesian
1 Moderate
sexual risk
with low
substance use
risk (n=1523)
Moderate to
high
substance
use and
sexual risk
(n=277)
Low sexual
and substance
use risk
(n=3969)
2 High sexual
and substance
use risk
(n=11)
Low risk,
particularly
with
substance
use (n=7925)
Moderate
sexual risk,
moderate
alcohol and
marijuana
risk (n=1991)
3 Moderate
sexual and
substance use
risk (n=713)
Moderate
alcohol and
marijuana
risk, low
sexual risk
(n=1581)
Low to
moderate
sexual and
substance
abuse risk
(n=1990)
4 Low sexual
and substance
use risk
(n=7704)
High sexual
and
substance
use risk
Low sexual
and substance
abuse risk
(sexual risk