SlideShare a Scribd company logo
1 of 56
1
Comparison of 6 K-Means Clustering Algorithms under
Differing Data Conditions
Colleen M. Farrelly
BST699
November, 2014
ā€œSubmitted to the Graduate Programs in Public Health
Sciences in partial fulfillment of the requirements for the
degree of Master of Science in Biostatisticsā€
2
Abstract: Many variations of the k-means algorithm exist,
and this study aims to test promising k-means algorithms
from previous studies with new algorithms largely untested
against other algorithms in simulation studies under
varying conditions of noise, cluster overlap, cluster size,
and heterogeneity. Algorithms tested include a genetic
algorithm-based method, expectation-maximization algorithm-
based method, a Bayesian method, a chain-restart method, a
simple restart method, and an unmodified k-means algorithm.
Each of these seemed to have preferred conditions under
which it performed favorably. The MacQueen algorithm with
10 restarts showed exceptionally strong performance
compared to other algorithms across simulations. Three
algorithms (MacQueen algorithm with 10 restarts, the
expectation-maximization algorithm, and the Bayesian
algorithm) were applied to a real dataset to examine
whether or not these algorithms find the same patterns
within a dataset; all three algorithms found distinctly
different patterns within the data. This finding, combined
with differing performance across simulations for each
algorithm, suggests that care must be taken when selecting
a k-means algorithm to apply to an empirical dataset.
3
Introduction
The goal of k-means clustering is to partition a data
set into a series of mutually exclusive and exhaustive
clusters such that the overall sum-of-squares error between
each cluster mean and the observations belonging to a
cluster are minimized (e.g., the variance of the clusters
are as small as possible1). Generally, it is assumed that
there are an underlying set of latent categories or groups
that exist in the dataset. Further, it has been shown that,
under certain conditions, the k-means clustering problem is
equivalent to finite mixture modeling, of which latent
class analysis (LCA) and latent profile analysis (LPA) are
also special instances1. To achieve the goal of constructing
clusters in this fashion, many algorithmic variations exist
(referred to throughout as ā€œk-means clustering algorithms),
providing researchers with a plethora of options for
conducting a ā€œsimpleā€ k-means cluster analysis.
There is some debate in the literature regarding
whether cluster analysis or LCA should be used2. In cases
where hard categories are desired or assumed, LCA (which
uses probabilistic partitions and does not assume that each
case fits into only one category) may not be useful. Other
conditions under which k-means clustering performs equally
4
well or better than LCA (and, hence, where cluster-analytic
methods would be more efficacious) include homogeneity, in
which observations within a cluster have similar scores on
the predictor variables used in analysis3-5 and small
overlap between groups. Additionally, when investigating
the performance of finite mixture modeling, Steinley and
Brusco6-7 found that k-means clustering performed as well as
or better when aspects about the mixture model were unknown
(specifically, both the complexity of the within-group
covariance matrices and the number of clusters themselves).
One of the inherent difficulties in cluster analysis
is that the problem of dividing N observations into K
clusters is NP-hard; that is, it is a computational complex
problem where finding the optimal answer would require
complete enumeration. Unfortunately, the number of possible
solutions can be approximated by k^{n}/k!, creating a
search space that is infeasible to exhaust. To combat this
problem, several heuristic approaches have been developed,
each designed to search the solution space in a different
manner in hopes of obtaining a better solution. Indeed, in
the past few years, many studies within the social and
medical sciences have been employing various k-means
algorithms as a method of data partitioning and
classification8-15. However, with this recent proliferation
5
of new and modified k-means cluster algorithms, it is
difficult to discern whether a given study has employed a
k-means algorithm that is appropriate for the associated
dataset8,16-19. Few studies have systematically compared
different k-means clustering algorithms, and those studies
have not examined noise or heterogeneous conditions during
their simulations, leaving a gap in knowledge for applying
k-means applications to collected datasets1,20-21. In
addition, extant simulation studies comparing multiple k-
means algorithms have not explored newer approaches, such
as Bayesian methodology for k-means clustering. Therefore,
the real-world data conditions it under which a variety of
algorithms perform well, and which of these conditions vary
between or among algorithms, are largely unknown.
Researchers hoping to employ one of these techniques in
their data analysis need a guide to proper implementation
and adaptation of the various k-means algorithms if they
are to reach correct conclusions regarding the
relationships that exist in their data.
The present study builds on the Brusco and Steinley20
study in which several k-means algorithms were tested under
several simulation parameters and then applied to several
empirical datasets. In their study, Brusco and Steinley20
examined nine algorithms, ranging from the earliest
6
conceptions of the k-means algorithm to newer computational
methods, such as simulated annealing and genetic
algorithms. In simulations, the genetic algorithm and the
k-means hybrid algorithms performed exceptionally well, and
algorithms similar to these were chosen for further
investigation in the current study to determine the extent
to which these algorithms retain their high performance in
settings with many noise variables (defined as variables in
the data set that are not related to the cluster
structure), or with heterogeneity among items entered into
the cluster analysis. In addition, we included algorithms
employing novel solutions that have not yet been tested in
simulation comparison studies, as a way of determining how
these new approaches perform relative to those that have
been previously tested.
In this paper, we examine six distinct variants of k-
means clustering that are being employed in studies todayā€”
including genetic and expectation maximization algorithms
that improve optimization, a simple restart method, a chain
restart method, a Bayesian-based method, and an unmodified
k-means algorithmā€”through a simulation of small datasets
with varying signal-to-noise, mixing fraction (relative
size of clusters), cluster overlap, and cluster
heterogeneity conditions. In doing so, we hope to
7
understand the relative strengths and weaknesses of each
algorithm under conditions common in social science
datasets. To understand how these algorithms might function
on a real dataset, we simulated two datasets with larger
sample sizes, high overlap among clusters, heterogeneity,
and more variables to assess which algorithms seem to
perform well under these complex problems. We then applied
the most promising algorithms to a real dataset, the Multi-
Site University Study of Identity and Culture (MUSIC)22-23,
which is similar to the larger simulated datasets. We took
this last step to examine the extent to which these
algorithms show similar clustering solutions in a real
dataset. This two-step approach provides information as to
whether the various algorithms are detecting the same
latent data partitions in both simulated and real data.
K-Means Algorithm Structure
The goal of k-means algorithms is minimizing the sum
of squares error (SSE), which maximizes within-cluster
homogeneity and between-cluster heterogeneity20. The SSE is
calculated as follows:
š‘šš‘–š‘›
{šœ‡1, . . . , šœ‡ā„Ž}
āˆ‘ āˆ‘ || š‘„ āˆ’ šœ‡ā„Ž||2
š‘„āˆˆš‘‹ā„Ž
š‘˜
ā„Ž=1
8
where there are k clusters, each with mean Ī¼h, and x
represents an observation assigned to the kth cluster. The
basic k-means algorithm proceeds as follows (where specific
variations are described in their respective section
below). First, k initial seeds are chosen as centers of the
clusters, either by randomly choosing observations within a
dataset or by setting cluster centers a priori. Second,
each observation is assigned to one of the k clusters
according to the minimization of the SSE. Based on this new
assignment, means are recomputed for each cluster and
points are reassigned to clusters to minimize SSE. This
process is repeated until no observations can be reassigned
(or for a fixed number of iterations). As discussed in
detail by Steinley1, this stopping criterion is only
guaranteed to result in a locally optimal solution; in
other words, there could exist another assignment of
observations to clusters that resulted in a lower value for
SSE.
Modifications of K-Means Algorithms and Overview of
Algorithms Tested in This Study
Lloyd-Forgy and MacQueen Algorithms without Modification
9
Two basic k-means algorithms exist that do not involve
modification to the fundamental k-means algorithmā€™s steps
and calculation procedure. The Lloyd-Forgy algorithm
randomly selects k points from the dataset as starting
centers for k clusters and proceeds through the above
steps, waiting to recalculate clusters until all points in
the dataset have been assigned24. However, the MacQueen
algorithm proceeds through the above algorithm with
recalculation of clusters each time a point is moved,
rather than recalculating after all points have been
assigned25.
Need for Improvement of Basic Algorithms
One of the largest hurdles in k-means clustering is
the hill-climbing nature of the algorithm in its search for
a solution. The basic k-means search to minimize error and
imprecision does not escape local optima well and often
fails to find the global optimum16,19. This situation is akin
to a mountain climber getting stuck on a lower peak rather
than making it all the way to the mountainā€™s summit.
Unfortunately, the analyst cannot see the higher peak of
the function as the mountain climber can see the summit
when at the (lower) peak of the mountain. Several
modifications of the k-means clustering algorithm have been
10
developed to deal specifically with this problem, including
use of an expectation/maximization algorithm, a genetic
algorithm, two restart methods to find optimal starting
points, and calculation of posteriors through Bayesian
methods16,19,17,8.
Genetic Algorithm Approach
Rather than relying on a hill-climbing algorithm, one
solution to this problem employs a genetic algorithm to
optimize clustering partitions16. Genetic algorithms are
computing tools based on the principles of evolutionary
biology and population genetics, such that they efficiently
search a given space of possible solutions to a problem,
including variable selection and optimal partitioning26.
These algorithms thrive in situations in which other
enumerative and machine-learning algorithms stall or fail
to converge upon global solutions, and genetic algorithms
have been successfully adapted for problems in statistical
physics27, quantum chromodynamics28, aerospace engineering29,
molecular chemistry30, spline-fitting with function
estimation31, and statistics16,32-33.
Genetic algorithms start with a population of
different, randomly-generated possible solutions. Each
solution is described by the set of cluster memberships and
11
is placed into a vector format referred to as chromosomes.
The algorithm evaluates each individual solution based on
the SEE, ranking them from smallest to largest16,26. The
algorithm then enters a mutation step in which each
solution undergoes further evolution/change. The amount of
mutation depends on a given individual solutionā€™s ranking
on SSE. Solutions that have smaller SSE undergo less
mutation because they may be near the optimum; those with
higher SSE undergo more mutation because they are further
away from the optimum. The number of solutions remains
constant in size, and these steps are repeated to create a
new set of solutions until a convergence criterion is met,
usually a pre-specified number of sets of solutions (often
referred to as generations in the genetic algorithm
literature). At the designated stopping point, the
individual solution with the lowest SSE is chosen as the
best solution and other individual solutions are discarded.
One of the advantages of the genetic algorithm
approach is that this algorithm is computationally
efficient and appears to find better solutions than does
the unmodified k-means algorithm16. In addition, by
utilizing a population of solutions (e.g., the number of
solutions to be evaluated by the genetic algorithm), it is
unlikely that all solutions in a population will converge
12
on a local optimum at each step because individuals vary at
the initialization of the population, greatly increasing
the chances of convergence to a global optimum within a
large search space34-35.
In addition to escaping local optima, this genetic
algorithm variation also allows k-means clustering to
accommodate functions other than straightforward SSE
measures, including functions that may be nonlinear or
multimodal16. This algorithm seems to find better solutions
(nearly error-free in Krishna and Murtyā€™s test data) and in
smaller datasets it is also computationally efficient16. We
note that there are numerous manners in which genetic
algorithms can be operationalized. For instance, Brusco and
Steinley20 used a genetic algorithm that was based on a
hybridized k-means algorithm; however, such an approach is
unlikely to be as utilized by applied researchers because
it is not available in standard software packages. As such,
we use the Krishna and Murty16 algorithm found in the R
package.
Restart Modification
Another possible solution to the local optimum
problem, without resorting to a potentially
computationally-intensive genetic algorithm, is to search
13
for better initial clustering points (rather than a random
or nearest-point start to the algorithm) in the hopes that
starting conditions will be closer to the global optimum
than to local optima.
The simple restart methods are the most straight-
forward modification. These methods involve n runs of the
algorithm, each with a different starting point chosen at
random17. Like the genetic algorithm, this method allows for
many iterations of the k-means algorithm; however, this
approach does not optimize the starting cluster points as
in a genetic algorithm and does not search the solution
space to obtain its starting point variations. However, the
restart algorithm is less computationally intensive, is
available as a setting in several software packages, and
has served as the starting point for several newer
algorithms17,19. With enough starting points chosen, it is
possible that this algorithm will find an initial
clustering close enough to a global optimum to converge on
the best clustering solution. Additionally, Steinley and
Brusco20 found that numerous random initializations compared
favorably to other methods for initializing k-means
clustering algorithms. Brusco and Steinley36 indicated that
a version of k-means clustering (one that combined the
method described by Lloyd24 with MacQueen25) that relied on
14
multiple restarts compared favorably with their genetic
algorithm implementation.
Chain Variation Restart Method
An extension of the restart method, when applied to a
base k-means clustering algorithm, achieves starting point
initialization using a chain variation method17. This
variation involves moving one point to another cluster and
running the k-means algorithm to evaluate whether or not
this move provides a better k-means clustering solution
(smaller calculated SSE), and the chain of variations
consists of a sequence of these moves (called a Kernighan-
Line chain) evaluated at each step through running the k-
means algorithm17. In this way, the chain-variation restart
method is somewhat similar to the genetic algorithmā€™s
mutation approach in that a series of initial clusterings
is iteratively changed to identify a better solution by
starting from an existing solution. However, the restart
method does not seek to identify a population of possible
solutions; rather, only one individual solution, chosen at
random, undergoes this chain procedure. Experimental
results suggest that this is an efficacious adaptation that
does allow the algorithm to escape local optima. However,
with the largest datasets in the initial development of
15
this method, the algorithm typically requires a chain of 30
or more moves to reach an initial cluster, yielding a
different solution from the original K-means clustering
algorithm clustering17. It is unknown how this algorithm
performs as datasets increase further in size.
Expectation-Maximization Algorithm
The Expectation-Maximization (EM) algorithm is a two-
step iterative algorithm commonly used to estimate model
parameters when there are missing or unobserved data37. This
method builds on maximum likelihood, which assumes a
distribution where the data meet the assumptions of a
parametric model. Each iteration of this algorithm involves
two steps: (1) the Expectation step, given an initial guess
for the parameters of the data model, uses this model to
form the expectation of the missing data conditional on the
model, thereby creating an estimate of the missing or
unobserved data; and (2) The Maximization Step, given the
estimates of the missing data, perform a likelihood
function step, where the likelihood function is evaluated
and a new estimate of the parameters for the data model is
generated. This is repeated for many iterations until a
convergence criterion is met, usually involving the
difference score between iteration estimates getting very
16
close to zero37. The EM algorithm fits naturally into the
clustering problem because cluster membership is not
observed in the dataset.
An EM-based initialization procedure was proposed by
Bradley and Fayyad17. Similar to the chain-variation method,
this algorithm runs through the k-means algorithm for each
variation of starting points (in this case, based upon
bootstrap samples) and chooses the best initial clustering
as a starting point for the full k-means algorithm17. This
algorithm allows the user to specify EM-algorithm
parameters for a given problem. The EM algorithm has been
shown to find better solutions compared to non-refined
starting point k-means algorithms, but it shows poor
performance under some conditions20. Although this EM
algorithm variation is one of the more widely tested
algorithms, it has not been tested against the newer
algorithms, including the genetic algorithm method and the
Bayesian method.
Bayesian Approach
Another approach to k-means clustering is to use
Bayesian methods to approximate the posterior probabilities
of class membership given variables of interest, and then
cluster based upon this inferred probability density
17
function (PDF) shape (basically chopping the PDF into
clusters at natural cut-points). The algorithm proposed by
Fuentes and Casella8 and Gopal et al.38 accomplishes this
through the Metropolis-Hastings algorithm, which is a type
of Markov chain Monte Carlo sampling method.
Stated more simply, the Bayesian approach uses
information from the observed data to infer the PDF from
which the data likely came without knowing the PDF a
priori. This is accomplished through repeated sampling,
with replacement, of the observed data until a sufficient
number of data points, typically many thousands or
millions, have been drawn to construct the distribution. In
this way, enough information is gathered about the
distribution to compute marginal and joint probabilities
directly from the repeated sampling distribution, called
the posterior distribution, or to derive other statistical
information about the joint distribution.
In addition to clustering, a byproduct of the Bayesian
approach is the computation of the Bayes factor, which
tests the hypothesis that the number of existing clusters
is k, against the null, which states that there is only 1
cluster in the data. This is done to ensure that the
correct number of clusters is determined prior to placing
cases into clusters:
18
BF10=
š‘š(š‘Œ|š‘›=š‘˜)
š‘š( š‘Œ| š‘› = 1)
This hypothesis test addresses a major limitation of K-
means clusteringā€”specifying the number of clusters39-40. The
function m(Yā”‚n=k) denotes the distribution of the data
given that there are n=k clusters. This Bayesian procedure
has shown promise in the few studies employing it; however,
this algorithm has a long run-time and has not been tested
on many data structures8.
The Present Study
The six algorithms selected for use in this study
were: (1) Genetic Algorithm-Based Hybrid, (2) Unmodified
Lloyd-Forgy, (3) Chain Variation of the Lloyd-Forgy, (4)
Bayesian-Based, (5) EM-Based, and (6) MacQueen Restart with
10 initial point restarts. These represent a range of
approaches to the k-means computational problem and include
some of the more-promising algorithms from Brusco and
Steinley20. In addition, given the scarcity of simulation
studies employing some of these algorithms, the algorithms
selected for the present study have been largely untested
against one another. Our goal was to identify the algorithm
that provided the best fit to the simulated data, as
determined by the Hubert-Arabie Adjusted Rand Index
19
(ARIHA), and then to investigate whether or not a subset of
these algorithms finds similar clusters within a large
empirical dataset (MUSIC).
The correct method for determining the number of
clusters extant in an empirical dataset is debated1,39-41, and
this remains an open question in the field. Two previous
simulation studies of k-means clustering methods did not
find a significant difference across cluster numbers36,41,
suggesting that this factor is less important in simulation
studies comparing k-means algorithms than other factors,
including cluster overlap, signal-to-noise ratio, and
cluster density36,41. In light of these previous studies, k
was chosen to be set at 4, and simulation parameters
focused on those factors shown to impact algorithm
performance more noticeably36,41, as the number of clusters
problem is beyond the scope of this investigation and
merits its own simulation study of existing cluster number
determination methodology39.
Methods
Two approaches to cluster study simulations exist: (1)
many conditions, with few replicates and (2) fewer
20
conditions, with many replicates. Steinleyā€™s applied
simulation studies tend to focus on many conditions20,41-43,
whereas studies developing a new method have focused on
many replicates to empirically demonstrate and test
algorithm performance16-17,44.
In terms of testing the robustness of a clustering
algorithm to variations in data parameters, several
parameter variations have been suggested17,20,41, including
number of signal variables (or signal-to-noise ratio with
signal providing relevant information and noise adding no
important information), probability of overlap among
clusters, degree of homogeneity/heterogeneity within
clusters, sample size, and cluster density (in which
classes do not have equal cluster sizeā€”e.g., 3 clusters of
each consisting of 30% of cases, and a fourth cluster
consisting of the remaining 10% of cases). At N = 200, the
impact of sample size and number of clusters is
negligible20. Because performance on simulated datasets such
as those used in our study is largely unknown in the
literature (particularly for signal-to-noise ratio
variations), our study will use the few-conditions, many-
replicates technique so as to maximize power, running each
loop 30 times and averaging responses across iterations.
21
Based on these previous indications, the following
parameters/variations were chosen as parameters to generate
the datasets through the R package MixSim42:
(1) N=200
(2) K=4
(3) Signal Variables=2, 5, 10
(4) Noise Variables=10 (giving 3 signal to noise ratios
when combined with 2)
(5) Cluster Homogeneity=yes or no
(6) Mixing Fraction=0.25, 0.5, or 1 (giving i.e., 1 small,
all equal, or 1 large group sizes)
(7) Average Overlap among Clusters=0.1, 0.2, 0.4
These variations, coupled with our use of six
different algorithms (genetic algorithm based, Lloyd-Forgy
chain variation, EM-based, Lloyd-Forgy unmodified, MacQueen
with 10 restarts, and Bayesian), yield 54 unique small
datasets. Previous studies have suggested combining across
conditions to discern the effects of each parameter41.
Therefore, after 30 separate datasets for each parameter
combination were simulated and analyzed by the six
algorithms of interest and the diagnostic criteria averaged
22
over all 30 replicates, each of the 11 parameter variations
were examined across all other conditions.
To directly compare algorithm classification with true
classification, the Hubert-Arabie Adjusted Rand Index
(ARIHA) was chosen, as it has been shown to perform well in
comparing cluster solutions45. For this index, values of 0
correspond to chance levels of agreement, and values of 1
indicate a perfect classification match between the
algorithm-generated and true class solutions. It has been
shown that the ARIHA is robust across cluster densities
(mixing fractions), number of clusters, and sample size39,45.
Therefore, the ARIHA was chosen as our measure of algorithm
performance across conditions.
In addition, in the present study we also aimed to
test the performance of these algorithms on a large real-
world simulated dataset. Therefore, two final simulations
were conducted using with N=2500, variables=15 (once with
12 signal, 3 noise and a second time with 5 signal, 10
noise), average overlap of 0.2, and mixing fraction of 0.1,
to identify which algorithms perform well on a real
dataset. The three best performing, most disparate
algorithms were selected to be re-estimated on a set of
risk behavior variables from the MUSIC dataset. MUSIC
consists of 9,750 college-student participants who reported
23
on their engagement in a number of risky behaviors during
the 30 days prior to assessment.
In our simulations, the number of clusters was fixed;
however, in the MUSIC dataset the optimal number of
clusters is unknown. The NbClust package in R provides 20
different, non-graphical algorithms for determining the
appropriate number of clusters in a dataset, including
Ratkowsky, Scott, Marriott, TraceCovW, and TraceW.
Therefore, to assess which of these procedures are most
appropriate for the MUSIC dataset, we assessed the
performance of each of these procedures with 1000 simulated
datasets according to the aforementioned 12 signal, 3 noise
variable scenarios. Results were tallied for each algorithm
according to (a) the correct number of clusters, k; (b) k-1
clusters; and (c) k+1 clusters. The best performing
algorithms were then used to obtain an estimate of the
likely number of clusters within the MUSIC data. The two
most likely numbers of clusters according to these
algorithms were then used in runs of the three algorithms
chosen for case study analysis. Readers interested in
methods used to determine optimal cluster number within a
dataset are referred to Tibshirani et al.46, Milligan and
Cooper39; and Dimitriadou et al.47. The best performing
24
algorithms were then used on the MUSIC dataset with 10
replicates.
Within a 30 replicate loop per condition, all
simulated datasets were created using the R package MixSim42
and then used to evaluate each of the six algorithms. The
MacQueen restart algorithm was implemented through the use
of the kmeans function contained in the R base package with
10 restarts (nstart=10) and a maximum of 90 iterations of
the k-means algorithm (iter.max=90). The genetic algorithm
variation was implemented using the R package skmeans48 with
parameters calibrated to 140 generations, a mutation rate
of 0.2 per generation, and a population size of 20. The
Lloyd-Forgy algorithm with no restarts or chain variations
was also implemented with the skmeans package and 90
iterations, as this was the number of iterations chosen for
the other algorithms. The chain-restart method was
implemented using the skmeans package using 90 iterations
and chains of 50 moves each. The EM algorithm was
implemented using the default parameter settings in the
mclust R package, which searches through possible cluster
shapes according to the BIC criterion and does not use a
prior on the distribution. Finally, the Bayesian k-means
variation was conducted using the Bayesclust R package38 and
default parameter settings (Metropolis search algorithm
25
with 100,000 iterations, sigma hyperparameters of 2.01 and
0.99, alpha hyperparameter of 1, and minimum cluster size
of 10%).
Real-World Dataset
MUSIC is a large dataset in which a number of
psychological and health-related surveys were administered
to 9,750 undergraduate students at 30 U.S. colleges and
universities. Among the surveys administered was a series
of questions asking participants how many times they had
engaged in a number of substance-use (marijuana, hard
drugs, inhalants, injection drugs, and misuse of
prescription drugs), sexual (oral sex, anal sex, sex with
strangers or brief acquaintances, unprotected sex, and sex
while drunk or high), and risky driving (drunk/drugged
driving and riding with a drunk/drugged driver) behaviors
during the 30 days prior to assessment. Possible response
choices for each of these behaviors were 0 (none), 1
(once/twice), 2 (3-5 times), 3 (6-10 times), or 4 (more
than 10 times).
There is reason to believe that these behaviors will
be highly correlated with one another and will form
clusters. Problem behavior theory49 suggests that young
people who engage in one risky behavior are more apt to
26
engage in other risky behaviors; and epidemiologic data
indicate that the adolescent and emerging adult years
(roughly ages 15-24) are characterized by the highest rates
of illicit drug use, sexual risk behavior, and impaired
driving50-51. At the same time, not all college-aged
individuals engage in high levels of risky behavior, and
some forms of risky behavior are more common than others.
For example, many college students engage in casual sexual
relationships52 ā€“ but far fewer students use inhalants or
injection drugs. One would therefore expect multiple
clusters of students, some of which are characterized by no
or mild engagement in risky behavior and others of which
are characterized by more severe engagement in many or all
of the risk behaviors. In using the MUSIC data, we employed
the clustering techniques on the 13 illicit drug use,
sexual risk taking, and impaired driving variables listed
here: drunk driving, driving in a car driven by a drunk
driver, engagement in casual sex, number of partners,
engaging in sex without condoms, oral sex, anal sex, sex
while drunk, smoking marijuana, using hard drugs (e.g.,
cocaine, heroin), inhalant use, injecting drugs, and
nonmedical use of prescription medications.
Results
27
Simulations under Parameter Variations
The various algorithms evidenced different performance
under different conditions of the simulation (see Table 1).
In all, the MacQueen algorithm with 10 restarts
outperformed the other five algorithms handily under every
condition, suggesting that this algorithm is a strong,
general clustering technique. However, despite
outperforming all other algorithms, the performance of the
MacQueen algorithm with 10 restarts suggests a strong
decrease in its performance as additional signal variables
are added to the dataset, suggesting that it may struggle
with larger predictor sets.
All six algorithms struggled when the mean overlap of
groups increased to 0.4, indicating that strongly
overlapping clusters are difficult to recover. This finding
is consistent with the results reported by Steinley53,54. For
conditions in which there is likely to be large group
overlap, one may want to consider LCA or fuzzy k-means
algorithms, as these deal with probabilistic classification
rather than hard classification2.
Under most conditions, the genetic algorithm-based,
unmodified Lloyd-Forgy, and chain variation Lloyd-Forgy
methods produced very similar results to each other, with
28
the genetic algorithm approach performing slightly better.
In addition, the chain variation Lloyd-Forgy method seemed
to perform better as the mixing fraction decreased,
suggesting that it performs well with unbalanced group
sizes. The Bayesian algorithm performed maximally well with
increased numbers of salient predictors, low average
overlap, and smaller mixing fractions.
Simulations to Mimic Real-World Conditions
In the ā€œreal-world conditionsā€ simulation, all six
algorithms performed better as the number of relevant
predictors increased (Table 2), with the MacQueen algorithm
with 10 restarts and EM algorithm scoring over 0.5 on the
ARIHA for both conditions. The other four algorithms scored
approximately 0.3 for the condition with 5 signal and 10
noise variables; with 12 signal variables, all four
algorithms approached 0.4. This pattern indicates a fair
amount of agreement for the four lower-performing
algorithms and a moderate to substantial agreement for the
MacQueen algorithm with 10 restarts and EM algorithms.
When examining computing time for the six algorithms
on a standard laptop (Table 3), some of these algorithms
required a large amount of computing resources that may be
less than ideal for large datasets. The genetic algorithm
29
and Bayesian algorithm both required quite a bit of
computing resources, though neither were unreasonable for
the datasets generated to mimic a large real-world dataset.
However, in very large datasets, computing resources may
become an issue, and alternative algorithms may yield
results more quickly without sacrificing quality. The EM
algorithm required more computing resources than three
other algorithms but quite a bit less than the two most
demanding ones (genetic and Bayesian), suggesting that it
would be reasonable (though not efficient) for larger
datasets. In all, running 30 iterations of the loop which
generated and analyzed one real-world condition for each of
the six algorithms required about 60 hours on a laptop
containing 8 GB of RAM and 2.5 GHz processing speed.
MUSIC Dataset
Because the EM method and MacQueen algorithm with 10
restarts gave the best performance on the simulated data,
both of these methods were used to analyze the MUSIC
dataset. We also used the Bayesian algorithm because this
approach is novel among clustering algorithms and has not
been applied to many real-life datasets.
The most promising algorithms for finding the correct
number of clusters on simulated datasets similar to the
MUSIC data were the Ratkowsky, Scott, Marriott, TraceCovW,
30
and TraceW algorithms (summary of methodology and results
available upon request). These were used to determine the
optimum number of clusters within the MUSIC data. The
average number of optimum clusters detected by each
algorithm was, on average, between 3 and 4 clusters (Table
4); therefore, the three k-means algorithms for the MUSIC
dataset were conducted with k=3 and k=4 clusters. Each
groupā€™s mean response across the 13 risk behavior questions
was examined to characterize the cluster solution.
After obtaining the results from both k=3 and k=4, it
was determined that the k=3 solutions did not make as much
sense clinically as the k=4 solutions. Therefore, we
further explored the k=4 solutions to characterize and
compare cluster solutions among the three algorithms. As
summarized in Table 5, the MacQueen algorithm with 10
restarts yielded the following four groups: (1) some sexual
risk and low substance abuse (n = 1523), (2) very high risk
in all areas (n = 11), (3) fairly high sexual and substance
abuse risk (n = 713), and (4) low risk in all areas (n =
7704). The second group was extremely small and reported an
average of 69 sexual partners in the month prior to
assessment; it is possible that this algorithm detected
students who lied about (or provided extremely unlikely
reports of) their sexual behavior. Such a conclusion would
31
suggest that this algorithm classifies outliers within the
same cluster, rather than adding them to the next-closest
cluster, which would skew the clusters to which these
extreme observations are added. This is desirable for an
algorithm used on an experimental dataset, as it is able to
find distinct subgroups within the sample, including
extreme scores or distinct small groups of individuals
(such as those extremely high or low scorers). Care should
be taken in over-interpreting small groups of extreme
scores, unless replicated in other samples.
The EM algorithm found the following four groups: (1)
moderate sexual and substance abuse risk (n = 277), (2)
high oral sex but very low drug use (n = 7925), (3) risk
from smoking marijuana and alcohol-related behaviors
(1581), and (4) high risk in all areas (n = 168). This
seems reasonable given previous research on college risk
behaviors50-52. Cross-tabulating the classes obtained with
the EM algorithm against those obtained with the MacQueen
algorithm with 10 restarts yielded an ARIHA of .15,
suggesting little overlap across these two algorithms and a
high likelihood of erroneous classification based upon
simulation results.
The Bayesian algorithm found the following groups: (1)
low risk (n = 3969), (2) risk from impaired driving,
32
marijuana use, and sex (n=1991), (3) moderate risk in all
areas (n=1990), and (4) assumedly monogamous (i.e., low
risks other than unprotected sex; n=2001). . Considering
that there was very low overlap with the classes extracted
using the MacQueen algorithm with 10 restarts (ARIHA = .18)
and the EM algorithmā€™s outcome (ARIHA = .06), it is likely
that this algorithm is also not finding accurate data
partitions.
Considered together with the simulation results, it
appears that the different algorithms are likely to produce
different results with little overlap among methods. Given
the poor performance of the Bayesian and EM algorithms on
the simulated data, it is likely that these algorithms
produced untrustworthy results with high misclassification
rates. However, these algorithms did find clusters within
the MUSIC dataset and produced results that were
interpretable, suggesting that these algorithms need to be
applied with caution, as they may yield interpretable
results based on faulty clustering of data. These results
also support the speculation of Brusco and Steinley20, where
it was suggested that their genetic algorithm-based k-means
algorithm performed well because it was based on the HK-
means+ algorithm, which was based heavily on the MacQueen
algorithm.
33
Discussion
We conducted the present study to compare six commonly
used k-means algorithms in terms of their ability to
correctly cluster optimally simulated data, data simulated
to mimic a real world dataset, and a collected dataset
consisting of risk behavior reports from nearly 10,000
college students. Although many studies have compared two
or three clustering algorithms to one another, our study is
among the first to compare six different algorithms under
various conditions (e.g., different sample sizes, varying
degrees of overlap between/among clusters, varying degrees
of random error). Our results may therefore have the
potential to provide greater clarity regarding which
algorithms do ā€“ and do not ā€“ perform well given specific
properties within the dataset.
Our results also indicate that large differences exist
among these six k-means algorithms in terms of the quality
of their performance and in terms of the content of
clusters that they produce. These results underscore the
need for informed application of these clustering
algorithms to real-world datasets, as results may
drastically vary depending upon which algorithm is chosen
34
for the situation. Within the small-scale simulation
portion of this study, low signal-to-noise ratios (large
numbers of uninformative variables) presented a significant
challenge to the majority of algorithms. The MacQueen
algorithm with 10 restarts and EM algorithms performed most
favorably in the context of low signal-to-noise ratios,
though the MacQueen algorithm was the only one that
provided a strong performance across conditions.
Nonetheless, our results suggest that large or complex
datasets may present challenges to many of the k-means
algorithms, as the MacQueen algorithm showed a faltering
performance as the number of variables increased and as the
overlap increased. In addition, the performance of five
algorithms deteriorated as the average group overlap
increased beyond 0.2. This finding suggests that correct
classification is most likely when groups are well-
separated, and that the MacQueen algorithm with 10 restarts
may be among the most effective options when cluster
overlap is higher.
When examining two simulated datasets with several
variables, heterogeneity, moderate overlap, and low mixing
fraction, the EM and MacQueen with 10 restarts algorithms
performed much more strongly than the other four algorithms
without requiring an unreasonable amount of computing
35
power. The Bayesian and genetic algorithm methods, though
showing a comparable performance to the remaining
algorithms (chain-restart and non-restart Lloyd-Forgy),
required much more time to run and more computing
resources. Given the trend towards analyzing datasets with
extremely large numbers of cases and variables (i.e., big
data), considerations related to computing resources
required may need to be taken into account.
On the MUSIC dataset, the MacQueen algorithm with 10
restarts identified a small group of exceptionally high
scores, suggesting that it is sensitive enough to detect
possible outliers and may be of use in flagging suspect
observations. This may be useful for researchers with large
collections of observations and who aim to examine extreme
events more closely. The other two algorithms identified
cluster solutions consistent with previous rates of risk
behaviors among a college population55; however, the overlap
with the MacQueen solutions was troublingly low, suggesting
that these algorithms may have found very inaccurate
solutions. Given their poor performance in the simulations,
caution should be taken when applying any of these more
complex algorithms to real datasets, as they are unlikely
to provide as accurate of a solution as the MacQueen
restart method.
36
Our results suggest that different k-means
computational methods may partition data in different ways,
such that the groups produced do not converge well across
algorithms, and that given current implementations, the
more complex algorithms should be used with caution or
disregarded in favor of the MacQueen restart method. It is
possible that each algorithm ā€œseesā€ the data structure from
a different point of view, analogous to researchers from
different fields approaching a multidisciplinary problem
from different points of view and then solving that problem
using the tools from their particular fieldā€™s approach;
however, it is also likely that some of these algorithms
are unable to find valid clustering solutions. This finding
has important implications for researchers applying these
methods to real-life data, as all algorithms are not
created equally. From the results of Brusco and Steinley20,
it is possible that the computational methods upon which
some of these more complex algorithms are based may yield
k-means solutions that perform well on real data; however,
the versions of these algorithms provided within some
software packages (e.g., R) do not perform well.
Strengths and Limitations
This study has several strengths and weaknesses. Using
simulated datasets with known ā€œtrueā€ classifications
37
allowed us to compare algorithms under controlled
conditions. However, these simulations involved smaller
datasets than might be encountered in large empirical
studies. Even with the MUSIC dataset, which provides a
sample of almost 10,000 participants, only 13 clustering
variables were used. As such, the results of this study
should be interpreted with caution by researchers hoping to
apply these algorithms to studies with large numbers of
variables or in which the number of parameters may exceed
the number of observations. For example, many genomics
studies include millions of genes taken from a comparably
small number of participants. Further work should be done
to examine these algorithms in such situations to
understand which k-means algorithms perform well under such
conditions.
In sum, this study provides a roadmap for the
application of several k-means algorithms to datasets
involving several predictors with varying degrees of signal
and noise. For most of the algorithms tested in this study,
it seems that the roadmap is quite simple in scope.
Researchers hoping to use the k-means methodology would do
well to use the basic k-means algorithms, particularly with
a simple restart step added to the algorithm, rather than
using some of the newer, more complex algorithms. We hope
38
that our results have contributed to the literature
comparing various approaches to k-means cluster analysis
and provide a more accurate, if shorter, roadmap for
researchers hoping to employ this methodology.
39
Acknowledgements:
I would like to thank my coauthors Dr. Daniel Feaster,
Dr. Seth Schwartz, and Dr. Douglas Steinley for their
contributions to this study, as well as University of Miami
for the courses allowing me to complete a study like this.
40
References
1) Steinley D. K-means clustering: a half-century
synthesis. Br J Math Stat Psychol. 2006; 59(Pt 1):1-34.
2) DiStefano C, Kamphaus RW. Investigating Subtypes of
Child Development A Comparison of Cluster Analysis and
Latent Class Cluster Analysis in Typology Creation. Educ
Psychol Meas. 2006; 66(5):778-794.
3) Eshghi A, Haughton D, Legrand P, Skaletsky M, Woolford
S. Identifying groups: a comparison of methodologies. J
Data Sci. 2011; 9:271-91.
4) Scott AJ, Knott M. A cluster analysis method for
grouping means in the analysis of variance. Biometrics.
1974; 507-512.
5) Edwards AWF, Cavalli-Sforza LL. A method for cluster
analysis. Biometrics. 1965; 362-375.
6) Steinley D, Brusco MJ. Evaluating the performance of
model-based clustering: Recommendations and cautions.
Psychol. Methods. 2011; 16, 63-79.
41
7) Steinley D, Brusco MJ. K-means clustering and model-
based clustering: Reply to McLachlan and Vermunt. Psychol.
Methods. 2011; 16, 89-92.
8) Fuentes C, Casella G. Testing for the existence of
clusters. Sort (Barc). 2009; 33(2):115-157.
9) Martis RJ, Chakraborty C. Arrhythmia disease diagnosis
using neural network, SVM, and genetic algorithm-optimized
k-means clustering. J. Mech. Med. Biol. 2011; 11(04): 897-
915.
10) Rovniak LS, Sallis JF, Saelens, et al. Adults' physical
activity patterns across life domains: cluster analysis
with replication. Health Psychology. 2010; 29(5): 496.
11) Newby PK, Muller D, Hallfrisch J, Qiao N, Andres R,
Tucker KL. Dietary patterns and changes in body mass index
and waist circumference in adults. Am. J. Clin. Nutr. 2003;
77(6): 1417-1425.
12) Ahn Y, Park YJ, Park SJ, et al. Dietary patterns and
prevalence odds ratio in middle-aged adults of rural and
42
mid-size city in Korean Genome Epidemiology Study. Korean
Journal of Nutrition. 2007; 40(3): 259-269.
13) Bauer AK, Rondini EA, Hummel, et al. Identification of
candidate genes downstream of TLR4 signaling after ozone
exposure in mice: a role for heat-shock protein 70.
Environmental health perspectives. 2010; 119(8): 1091.
14) O'Quin KE, Hofmann CM, Hofmann HA, Carleton KL.
Parallel evolution of opsin gene expression in African
cichlid fishes. Mol. Biol. Evol. 2010; 27(12): 2839-2854.
15) Tukiendorf A, KaÅŗmierski R, Michalak S. The Taxonomy
Statistic Uncovers Novel Clinical Patterns in a Population
of Ischemic Stroke Patients. PloS one. 2013; 8(7): e69816.
16) Krishna K, Murty MN. Genetic K-means algorithm.
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE
Transactions on. 1999; 29(3):433-439.
17) Dhillon IS, Guan Y, Kogan J. Iterative clustering of
high dimensional text data augmented by local search. Data
Mining. 2002; 131-138.
43
18) Cannon RL, Dave JV, Bezdek JC. Efficient implementation
of the fuzzy c-means clustering algorithms. Pattern
Analysis and Machine Intelligence, IEEE Transactions on.
1986; (2): 248-255.
19) Bradley PS, Fayyad U. Refining Initial Points for K-
Means Clustering. Proc. 15th International Conf. Machine
Learning. 1998.
20) Brusco MJ, Steinley D. A comparison of heuristic
procedures for minimum within-cluster sums of squares
partitioning. Psychometrika. 2007; 72(4): 583-600.
21) Steinley D. Validating clusters with the lower bound
for sum-of-squares error. Psychometrika. 2007; 72(1): 93-
106.
22) Castillo LG, Schwartz SJ. Introduction to the special
issue on college student mental health. Journal of clinical
psychology. 2013; 69(4): 291-297.
23) Weisskirch RS, Zamboanga BL, Ravert RD, et al. An
introduction to the composition of the Multi-Site
University Study of Identity and Culture (MUSIC): A
44
collaborative approach to research and mentorship. Cultural
Diversity and Ethnic Minority Psychology. 2013; 19(2): 123.
24) Lloyd SP. "Least square quantization in PCM". Bell
Telephone Laboratories Paper. 1957.
25) MacQueen J. Some methods for classification and
analysis of multivariate observations. Proceedings of the
fifth Berkeley symposium on mathematical statistics and
probability. 1967; 1(14): 281-297.
26) Holland J. Adaptation in Natural and Artificial
Systems. University of Michigan Press. 1975.
27) Somma RD, Boixo S, Barnum H, Knill E. Quantum
simulations of classical annealing processes. Physical
review letters. 2008; 101(13): 130504.
28) Temme K, Osborne TJ, Vollbrecht KG, Poulin D,
Verstraete F. Quantum metropolis sampling. Nature. 2011;
471(7336): 87-90.
29) Hassan R, De Weck O, Springmann P. Architecting a
communication satellite product line. 22nd AIAA
45
international communications satellite systems conference &
exhibit (ICSSC). 2004.
30) Najafi A, Ardakani SS, Marjani M. Quantitative
structure-activity relationship analysis of the
anticonvulsant activity of some benzylacetamides based on
genetic algorithm-based multiple linear regression.
Tropical Journal of Pharmaceutical Research. 2011; 10(4):
483-490.
31) Pittman J. Adaptive splines and genetic algorithms. J.
Comp. Graph. Stat. 2002; 11(3): 615-638.
32) Paterlini S, Minerva T. Regression model selection
using genetic algorithms. Proceedings of the 11th WSEAS
International Conference on RECENT Advances in Neural
Networks, Fuzzy Systems & Evolutionary Computing. 2010; 19-
27.
33) Maulik U, Bandyopadhyay S. Genetic algorithm-based
clustering technique. Pattern Recogn. 2000; 33(9): 1455-
1465.
46
34) Forrest S. Genetic algorithms: principles of natural
selection applied to computation. Science. 1993; 261(5123):
872-878.
35) Whitley D. A genetic algorithm tutorial. Stat Comp.
1994; 4(2): 65-85.
36) Steinley D, Brusco, MJ. Initializing K-means batch
clustering: A critical evaluation of several techniques. J.
Classif. 2007; 24, 99-121.
37) Dempster AP, Laird NM, Rubin DB. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the
Royal Statistical Society. Series B (Methodological). 1977;
1-38.
38) Gopal V, Fuentes C, Casella G, Gopal M. The bayesclust
Package. 2009.
39) Milligan GW, Cooper MC. An examination of procedures
for determining the number of clusters in a data set.
Psychometrika. 1985; 50(2): 159-179.
47
40) Steinley D, Brusco, MJ. Testing for validity and
choosing the number of clusters in K-means clustering.
Psychol. Methods. 2011; 16, 285-297.
41) Steinley D, Brusco MJ. A new variable weighting and
selection procedure for K-means cluster analysis. Multiv
Behav Res. 2008; 43(1): 77-108.
42) Melnykov V, Chen W, Maitra R. MixSim: An R Package for
Simulating Data to Study Performance of Clustering
Algorithms. J Stat Softw. 2012; 51(12), 1-25.
43) Steinley D. Standardizing variables in K-means
clustering. Classification, clustering, and data mining
applications. 2004; 53-60.
44) Huang Z. Clustering large data sets with mixed numeric
and categorical values. Proceedings of the 1st Pacific-Asia
Conference on Knowledge Discovery and Data Mining. 1997;
21-34.
45) Steinley D. Properties of the Hubert-Arable Adjusted
Rand Index. Psychol Methods. 2004; 9(3): 386.
48
46) Tibshirani R, Walther G, Hastie T. Estimating the
number of clusters in a data set via the gap statistic.
Journal of the Royal Statistical Society: Series B
(Statistical Methodology). 2001; 63(2): 411-423.
47) Dimitriadou E, Dolničar S, Weingessel A. An examination
of indexes for determining the number of clusters in binary
data sets. Psychometrika. 2002; 67(1): 137-159.
48) Hornik K, Feinerer I, Kober M, Buchta C. Spherical k-
Means Clustering. Journal of Statistical Software. 2012;
50(10), 1-22.
49) Jessor R, Jessor SL. Problem behavior and psychosocial
development: A longitudinal study of youth. 1977.
50) National Highway Traffic Safety Administration. Teen
drivers: Additional resources. 2013. Retrieved October 22,
2014 from
http://www.nhtsa.gov/Driving+Safety/Teen+Drivers/Teen+Drive
rs+-+Additional+Resources.
49
51) Centers for Disease Control and Prevention. Youth Risk
Behavior Surveillance ā€” United States, 2013. Morbidity and
Mortality Weekly Report. 2013; 63(1): 1-168.
52) Bogle KA. Hooking up: Sex, dating, and relationships on
campus. New York: New York University Press. 2008.
53) Steinley D. Profiling local optima in K-means
clustering: Developing a diagnostic technique. Psychol.
Methods. 2006; 11, 178--192.
54) Steinley D, Brusco, MJ. A new variable weighting and
selection procedure for K-means cluster analysis. Multiv
Behav Res. 2008; 43, 77-108.
55) Cooper ML. Alcohol use and risky sexual behavior among
college students and youth: Evaluating the evidence. J.
Stud. Alcohol Drugs. 2002; (14): 101.
50
Table 1: Algorithm Performance by Condition
Genetic
Algorithm
Lloyd-
Forgy
Lloyd-
Forgy
Chain
Variation
Bayesian EM
Algorithm
MacQueen
10
Restarts
Salient
Predictors
2 0.249 0.237 0.251 0.179 0.267 0.972
5 0.326 0.298 0.320 0.233 0.293 0.846
10 0.306 0.265 0.295 0.280 0.283 0.665
Homogeneity
TRUE 0.295 0.266 0.288 0.235 0.274 0.806
FALSE 0.297 0.266 0.289 0.226 0.288 0.819
Average
Overlap
0.1 0.438 0.419 0.445 0.455 0.539 0.879
0.2 0.297 0.221 0.289 0.198 0.256 0.875
0.4 0.125 0.116 0.132 0.039 0.048 0.684
Mix
Fraction
1 0.301 0.267 0.263 0.222 0.272 0.819
51
0.5 0.291 0.261 0.285 0.217 0.279 0.815
0.25 0.297 0.270 0.318 0.253 0.292 0.850
52
Table 2: Algorithm Performance on Simulation of MUSIC
Dataset
Genetic
Algorithm
Lloyd-
Forgy
Chain
Lloyd-
Forgy
Bayesian EM
Algorithm
MacQueen
10
Restarts
5
Signal,
10
Noise
0.329 0.301 0.327 0.304 0.500 0.674
12
Signal,
3 Noise
0.388 0.363 0.387 0.387 0.594 0.727
53
Table 3: Computing Times in Seconds on Simulation of
Dataset like the MUSIC Dataset
Genetic
Algorithm
Lloyd-
Forgy
Chain
Lloyd-
Forgy
Bayesian EM
Algorithm
MacQueen
10
Restarts
Elapsed
Time
146.57 0.14 0.50 600.85 14.81 0.36
System
Time
0.25 0.00 0.00 0.09 0.02 0.00
User
Time
135.15 0.14 0.49 594.83 14.48 0.36
54
Table 4: Average Number of Clusters Estimated in MUSIC
Dataset
Average Number of
Clusters
Range of Clusters
Found
Scott 4 3 or 5
Marriott 3.2 3 or 4
TraceCovW 3.1 3 or 4
TraceW 3.3 3 or 4
Ratkowsky 3.9 3 or 4
55
Table 5: 4-Cluster Solutions by Algorithm
Cluster MacQueen with
10 Restarts
EM-based Bayesian
1 Moderate
sexual risk
with low
substance use
risk (n=1523)
Moderate to
high
substance
use and
sexual risk
(n=277)
Low sexual
and substance
use risk
(n=3969)
2 High sexual
and substance
use risk
(n=11)
Low risk,
particularly
with
substance
use (n=7925)
Moderate
sexual risk,
moderate
alcohol and
marijuana
risk (n=1991)
3 Moderate
sexual and
substance use
risk (n=713)
Moderate
alcohol and
marijuana
risk, low
sexual risk
(n=1581)
Low to
moderate
sexual and
substance
abuse risk
(n=1990)
4 Low sexual
and substance
use risk
(n=7704)
High sexual
and
substance
use risk
Low sexual
and substance
abuse risk
(sexual risk
56
(n=168) from
relationship)
(n=2001)

More Related Content

What's hot

What's hot (20)

Logistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerationsLogistic regression: topological and geometric considerations
Logistic regression: topological and geometric considerations
Ā 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy II
Ā 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Ā 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
Ā 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
Ā 
Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
Ā 
Quantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
Quantum-Min-Cut/Max-Flow-Based Vertex Importance RankingQuantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
Quantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
Ā 
Statistical Modeling: The Two Cultures
Statistical Modeling: The Two CulturesStatistical Modeling: The Two Cultures
Statistical Modeling: The Two Cultures
Ā 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
Ā 
Statsci
StatsciStatsci
Statsci
Ā 
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...
Ā 
Machine Learning by Analogy
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by Analogy
Ā 
Panel data analysis a survey on model based clustering of time series - stats...
Panel data analysis a survey on model based clustering of time series - stats...Panel data analysis a survey on model based clustering of time series - stats...
Panel data analysis a survey on model based clustering of time series - stats...
Ā 
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
Ā 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Ā 
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...A Formal Machine Learning or Multi Objective Decision Making System for Deter...
A Formal Machine Learning or Multi Objective Decision Making System for Deter...
Ā 
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELS
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELSREPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELS
REPRESENTATION OF UNCERTAIN DATA USING POSSIBILISTIC NETWORK MODELS
Ā 
Di35605610
Di35605610Di35605610
Di35605610
Ā 
Topology for data science
Topology for data scienceTopology for data science
Topology for data science
Ā 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Ā 

Similar to One Graduate Paper

Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Anders Viken
Ā 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
IJERD Editor
Ā 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
Ā 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
Ā 

Similar to One Graduate Paper (20)

Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering ProblemEnhanced Genetic Algorithm with K-Means for the Clustering Problem
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Ā 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Ā 
I017235662
I017235662I017235662
I017235662
Ā 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
Ā 
rule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systemsrule refinement in inductive knowledge based systems
rule refinement in inductive knowledge based systems
Ā 
A03202001005
A03202001005A03202001005
A03202001005
Ā 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Ā 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problem
Ā 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Ā 
Statistical modeling in pharmaceutical research and development
Statistical modeling in pharmaceutical research and developmentStatistical modeling in pharmaceutical research and development
Statistical modeling in pharmaceutical research and development
Ā 
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type  Data MK-Prototypes: A Novel Algorithm for Clustering Mixed Type  Data
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data
Ā 
Recommender system
Recommender systemRecommender system
Recommender system
Ā 
Case-Based Reasoning for Explaining Probabilistic Machine Learning
Case-Based Reasoning for Explaining Probabilistic Machine LearningCase-Based Reasoning for Explaining Probabilistic Machine Learning
Case-Based Reasoning for Explaining Probabilistic Machine Learning
Ā 
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
ON THE PREDICTION ACCURACIES OF THREE MOST KNOWN REGULARIZERS : RIDGE REGRESS...
Ā 
Ijetr042111
Ijetr042111Ijetr042111
Ijetr042111
Ā 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
Ā 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
Ā 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
Ā 
F04463437
F04463437F04463437
F04463437
Ā 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Ā 

More from Colleen Farrelly

More from Colleen Farrelly (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
Ā 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
Ā 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
Ā 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
Ā 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
Ā 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
Ā 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
Ā 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
Ā 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
Ā 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
Ā 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
Ā 
An introduction toĀ quantumĀ machine learning.pptx
An introduction toĀ quantumĀ machine learning.pptxAn introduction toĀ quantumĀ machine learning.pptx
An introduction toĀ quantumĀ machine learning.pptx
Ā 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
Ā 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
Ā 
Geometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptxGeometry, Data, and One Path Into Data Science.pptx
Geometry, Data, and One Path Into Data Science.pptx
Ā 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
Ā 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
Ā 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
Ā 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
Ā 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network Science
Ā 

Recently uploaded

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
Ā 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
Ā 
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Ā 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
9to5mart
Ā 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Ā 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
Ā 
āž„šŸ” 7737669865 šŸ”ā–» malwa Call-girls in Women Seeking Men šŸ”malwašŸ” Escorts Ser...
āž„šŸ” 7737669865 šŸ”ā–» malwa Call-girls in Women Seeking Men  šŸ”malwašŸ”   Escorts Ser...āž„šŸ” 7737669865 šŸ”ā–» malwa Call-girls in Women Seeking Men  šŸ”malwašŸ”   Escorts Ser...
āž„šŸ” 7737669865 šŸ”ā–» malwa Call-girls in Women Seeking Men šŸ”malwašŸ” Escorts Ser...
amitlee9823
Ā 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
Ā 
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men šŸ”MathurašŸ” Escorts...
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men  šŸ”MathurašŸ”   Escorts...āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men  šŸ”MathurašŸ”   Escorts...
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men šŸ”MathurašŸ” Escorts...
amitlee9823
Ā 
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night StandCall Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
amitlee9823
Ā 
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night StandCall Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
amitlee9823
Ā 
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
amitlee9823
Ā 
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
amitlee9823
Ā 

Recently uploaded (20)

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
Ā 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Ā 
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi šŸ’Æ Call Us šŸ”9205541914 šŸ”( Delhi) Escorts S...
Ā 
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )šŸ” 9953056974šŸ”(=)/CALL GIRLS SERVICE
Ā 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
Ā 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Ā 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
Ā 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Ā 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Ā 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
Ā 
āž„šŸ” 7737669865 šŸ”ā–» malwa Call-girls in Women Seeking Men šŸ”malwašŸ” Escorts Ser...
āž„šŸ” 7737669865 šŸ”ā–» malwa Call-girls in Women Seeking Men  šŸ”malwašŸ”   Escorts Ser...āž„šŸ” 7737669865 šŸ”ā–» malwa Call-girls in Women Seeking Men  šŸ”malwašŸ”   Escorts Ser...
āž„šŸ” 7737669865 šŸ”ā–» malwa Call-girls in Women Seeking Men šŸ”malwašŸ” Escorts Ser...
Ā 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Ā 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Ā 
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men šŸ”MathurašŸ” Escorts...
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men  šŸ”MathurašŸ”   Escorts...āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men  šŸ”MathurašŸ”   Escorts...
āž„šŸ” 7737669865 šŸ”ā–» Mathura Call-girls in Women Seeking Men šŸ”MathurašŸ” Escorts...
Ā 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
Ā 
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night StandCall Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Nandini Layout ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Ā 
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night StandCall Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Call Girls In Attibele ā˜Ž 7737669865 šŸ„µ Book Your One night Stand
Ā 
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service Ban...
Ā 
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Bommasandra Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Ā 

One Graduate Paper

  • 1. 1 Comparison of 6 K-Means Clustering Algorithms under Differing Data Conditions Colleen M. Farrelly BST699 November, 2014 ā€œSubmitted to the Graduate Programs in Public Health Sciences in partial fulfillment of the requirements for the degree of Master of Science in Biostatisticsā€
  • 2. 2 Abstract: Many variations of the k-means algorithm exist, and this study aims to test promising k-means algorithms from previous studies with new algorithms largely untested against other algorithms in simulation studies under varying conditions of noise, cluster overlap, cluster size, and heterogeneity. Algorithms tested include a genetic algorithm-based method, expectation-maximization algorithm- based method, a Bayesian method, a chain-restart method, a simple restart method, and an unmodified k-means algorithm. Each of these seemed to have preferred conditions under which it performed favorably. The MacQueen algorithm with 10 restarts showed exceptionally strong performance compared to other algorithms across simulations. Three algorithms (MacQueen algorithm with 10 restarts, the expectation-maximization algorithm, and the Bayesian algorithm) were applied to a real dataset to examine whether or not these algorithms find the same patterns within a dataset; all three algorithms found distinctly different patterns within the data. This finding, combined with differing performance across simulations for each algorithm, suggests that care must be taken when selecting a k-means algorithm to apply to an empirical dataset.
  • 3. 3 Introduction The goal of k-means clustering is to partition a data set into a series of mutually exclusive and exhaustive clusters such that the overall sum-of-squares error between each cluster mean and the observations belonging to a cluster are minimized (e.g., the variance of the clusters are as small as possible1). Generally, it is assumed that there are an underlying set of latent categories or groups that exist in the dataset. Further, it has been shown that, under certain conditions, the k-means clustering problem is equivalent to finite mixture modeling, of which latent class analysis (LCA) and latent profile analysis (LPA) are also special instances1. To achieve the goal of constructing clusters in this fashion, many algorithmic variations exist (referred to throughout as ā€œk-means clustering algorithms), providing researchers with a plethora of options for conducting a ā€œsimpleā€ k-means cluster analysis. There is some debate in the literature regarding whether cluster analysis or LCA should be used2. In cases where hard categories are desired or assumed, LCA (which uses probabilistic partitions and does not assume that each case fits into only one category) may not be useful. Other conditions under which k-means clustering performs equally
  • 4. 4 well or better than LCA (and, hence, where cluster-analytic methods would be more efficacious) include homogeneity, in which observations within a cluster have similar scores on the predictor variables used in analysis3-5 and small overlap between groups. Additionally, when investigating the performance of finite mixture modeling, Steinley and Brusco6-7 found that k-means clustering performed as well as or better when aspects about the mixture model were unknown (specifically, both the complexity of the within-group covariance matrices and the number of clusters themselves). One of the inherent difficulties in cluster analysis is that the problem of dividing N observations into K clusters is NP-hard; that is, it is a computational complex problem where finding the optimal answer would require complete enumeration. Unfortunately, the number of possible solutions can be approximated by k^{n}/k!, creating a search space that is infeasible to exhaust. To combat this problem, several heuristic approaches have been developed, each designed to search the solution space in a different manner in hopes of obtaining a better solution. Indeed, in the past few years, many studies within the social and medical sciences have been employing various k-means algorithms as a method of data partitioning and classification8-15. However, with this recent proliferation
  • 5. 5 of new and modified k-means cluster algorithms, it is difficult to discern whether a given study has employed a k-means algorithm that is appropriate for the associated dataset8,16-19. Few studies have systematically compared different k-means clustering algorithms, and those studies have not examined noise or heterogeneous conditions during their simulations, leaving a gap in knowledge for applying k-means applications to collected datasets1,20-21. In addition, extant simulation studies comparing multiple k- means algorithms have not explored newer approaches, such as Bayesian methodology for k-means clustering. Therefore, the real-world data conditions it under which a variety of algorithms perform well, and which of these conditions vary between or among algorithms, are largely unknown. Researchers hoping to employ one of these techniques in their data analysis need a guide to proper implementation and adaptation of the various k-means algorithms if they are to reach correct conclusions regarding the relationships that exist in their data. The present study builds on the Brusco and Steinley20 study in which several k-means algorithms were tested under several simulation parameters and then applied to several empirical datasets. In their study, Brusco and Steinley20 examined nine algorithms, ranging from the earliest
  • 6. 6 conceptions of the k-means algorithm to newer computational methods, such as simulated annealing and genetic algorithms. In simulations, the genetic algorithm and the k-means hybrid algorithms performed exceptionally well, and algorithms similar to these were chosen for further investigation in the current study to determine the extent to which these algorithms retain their high performance in settings with many noise variables (defined as variables in the data set that are not related to the cluster structure), or with heterogeneity among items entered into the cluster analysis. In addition, we included algorithms employing novel solutions that have not yet been tested in simulation comparison studies, as a way of determining how these new approaches perform relative to those that have been previously tested. In this paper, we examine six distinct variants of k- means clustering that are being employed in studies todayā€” including genetic and expectation maximization algorithms that improve optimization, a simple restart method, a chain restart method, a Bayesian-based method, and an unmodified k-means algorithmā€”through a simulation of small datasets with varying signal-to-noise, mixing fraction (relative size of clusters), cluster overlap, and cluster heterogeneity conditions. In doing so, we hope to
  • 7. 7 understand the relative strengths and weaknesses of each algorithm under conditions common in social science datasets. To understand how these algorithms might function on a real dataset, we simulated two datasets with larger sample sizes, high overlap among clusters, heterogeneity, and more variables to assess which algorithms seem to perform well under these complex problems. We then applied the most promising algorithms to a real dataset, the Multi- Site University Study of Identity and Culture (MUSIC)22-23, which is similar to the larger simulated datasets. We took this last step to examine the extent to which these algorithms show similar clustering solutions in a real dataset. This two-step approach provides information as to whether the various algorithms are detecting the same latent data partitions in both simulated and real data. K-Means Algorithm Structure The goal of k-means algorithms is minimizing the sum of squares error (SSE), which maximizes within-cluster homogeneity and between-cluster heterogeneity20. The SSE is calculated as follows: š‘šš‘–š‘› {šœ‡1, . . . , šœ‡ā„Ž} āˆ‘ āˆ‘ || š‘„ āˆ’ šœ‡ā„Ž||2 š‘„āˆˆš‘‹ā„Ž š‘˜ ā„Ž=1
  • 8. 8 where there are k clusters, each with mean Ī¼h, and x represents an observation assigned to the kth cluster. The basic k-means algorithm proceeds as follows (where specific variations are described in their respective section below). First, k initial seeds are chosen as centers of the clusters, either by randomly choosing observations within a dataset or by setting cluster centers a priori. Second, each observation is assigned to one of the k clusters according to the minimization of the SSE. Based on this new assignment, means are recomputed for each cluster and points are reassigned to clusters to minimize SSE. This process is repeated until no observations can be reassigned (or for a fixed number of iterations). As discussed in detail by Steinley1, this stopping criterion is only guaranteed to result in a locally optimal solution; in other words, there could exist another assignment of observations to clusters that resulted in a lower value for SSE. Modifications of K-Means Algorithms and Overview of Algorithms Tested in This Study Lloyd-Forgy and MacQueen Algorithms without Modification
  • 9. 9 Two basic k-means algorithms exist that do not involve modification to the fundamental k-means algorithmā€™s steps and calculation procedure. The Lloyd-Forgy algorithm randomly selects k points from the dataset as starting centers for k clusters and proceeds through the above steps, waiting to recalculate clusters until all points in the dataset have been assigned24. However, the MacQueen algorithm proceeds through the above algorithm with recalculation of clusters each time a point is moved, rather than recalculating after all points have been assigned25. Need for Improvement of Basic Algorithms One of the largest hurdles in k-means clustering is the hill-climbing nature of the algorithm in its search for a solution. The basic k-means search to minimize error and imprecision does not escape local optima well and often fails to find the global optimum16,19. This situation is akin to a mountain climber getting stuck on a lower peak rather than making it all the way to the mountainā€™s summit. Unfortunately, the analyst cannot see the higher peak of the function as the mountain climber can see the summit when at the (lower) peak of the mountain. Several modifications of the k-means clustering algorithm have been
  • 10. 10 developed to deal specifically with this problem, including use of an expectation/maximization algorithm, a genetic algorithm, two restart methods to find optimal starting points, and calculation of posteriors through Bayesian methods16,19,17,8. Genetic Algorithm Approach Rather than relying on a hill-climbing algorithm, one solution to this problem employs a genetic algorithm to optimize clustering partitions16. Genetic algorithms are computing tools based on the principles of evolutionary biology and population genetics, such that they efficiently search a given space of possible solutions to a problem, including variable selection and optimal partitioning26. These algorithms thrive in situations in which other enumerative and machine-learning algorithms stall or fail to converge upon global solutions, and genetic algorithms have been successfully adapted for problems in statistical physics27, quantum chromodynamics28, aerospace engineering29, molecular chemistry30, spline-fitting with function estimation31, and statistics16,32-33. Genetic algorithms start with a population of different, randomly-generated possible solutions. Each solution is described by the set of cluster memberships and
  • 11. 11 is placed into a vector format referred to as chromosomes. The algorithm evaluates each individual solution based on the SEE, ranking them from smallest to largest16,26. The algorithm then enters a mutation step in which each solution undergoes further evolution/change. The amount of mutation depends on a given individual solutionā€™s ranking on SSE. Solutions that have smaller SSE undergo less mutation because they may be near the optimum; those with higher SSE undergo more mutation because they are further away from the optimum. The number of solutions remains constant in size, and these steps are repeated to create a new set of solutions until a convergence criterion is met, usually a pre-specified number of sets of solutions (often referred to as generations in the genetic algorithm literature). At the designated stopping point, the individual solution with the lowest SSE is chosen as the best solution and other individual solutions are discarded. One of the advantages of the genetic algorithm approach is that this algorithm is computationally efficient and appears to find better solutions than does the unmodified k-means algorithm16. In addition, by utilizing a population of solutions (e.g., the number of solutions to be evaluated by the genetic algorithm), it is unlikely that all solutions in a population will converge
  • 12. 12 on a local optimum at each step because individuals vary at the initialization of the population, greatly increasing the chances of convergence to a global optimum within a large search space34-35. In addition to escaping local optima, this genetic algorithm variation also allows k-means clustering to accommodate functions other than straightforward SSE measures, including functions that may be nonlinear or multimodal16. This algorithm seems to find better solutions (nearly error-free in Krishna and Murtyā€™s test data) and in smaller datasets it is also computationally efficient16. We note that there are numerous manners in which genetic algorithms can be operationalized. For instance, Brusco and Steinley20 used a genetic algorithm that was based on a hybridized k-means algorithm; however, such an approach is unlikely to be as utilized by applied researchers because it is not available in standard software packages. As such, we use the Krishna and Murty16 algorithm found in the R package. Restart Modification Another possible solution to the local optimum problem, without resorting to a potentially computationally-intensive genetic algorithm, is to search
  • 13. 13 for better initial clustering points (rather than a random or nearest-point start to the algorithm) in the hopes that starting conditions will be closer to the global optimum than to local optima. The simple restart methods are the most straight- forward modification. These methods involve n runs of the algorithm, each with a different starting point chosen at random17. Like the genetic algorithm, this method allows for many iterations of the k-means algorithm; however, this approach does not optimize the starting cluster points as in a genetic algorithm and does not search the solution space to obtain its starting point variations. However, the restart algorithm is less computationally intensive, is available as a setting in several software packages, and has served as the starting point for several newer algorithms17,19. With enough starting points chosen, it is possible that this algorithm will find an initial clustering close enough to a global optimum to converge on the best clustering solution. Additionally, Steinley and Brusco20 found that numerous random initializations compared favorably to other methods for initializing k-means clustering algorithms. Brusco and Steinley36 indicated that a version of k-means clustering (one that combined the method described by Lloyd24 with MacQueen25) that relied on
  • 14. 14 multiple restarts compared favorably with their genetic algorithm implementation. Chain Variation Restart Method An extension of the restart method, when applied to a base k-means clustering algorithm, achieves starting point initialization using a chain variation method17. This variation involves moving one point to another cluster and running the k-means algorithm to evaluate whether or not this move provides a better k-means clustering solution (smaller calculated SSE), and the chain of variations consists of a sequence of these moves (called a Kernighan- Line chain) evaluated at each step through running the k- means algorithm17. In this way, the chain-variation restart method is somewhat similar to the genetic algorithmā€™s mutation approach in that a series of initial clusterings is iteratively changed to identify a better solution by starting from an existing solution. However, the restart method does not seek to identify a population of possible solutions; rather, only one individual solution, chosen at random, undergoes this chain procedure. Experimental results suggest that this is an efficacious adaptation that does allow the algorithm to escape local optima. However, with the largest datasets in the initial development of
  • 15. 15 this method, the algorithm typically requires a chain of 30 or more moves to reach an initial cluster, yielding a different solution from the original K-means clustering algorithm clustering17. It is unknown how this algorithm performs as datasets increase further in size. Expectation-Maximization Algorithm The Expectation-Maximization (EM) algorithm is a two- step iterative algorithm commonly used to estimate model parameters when there are missing or unobserved data37. This method builds on maximum likelihood, which assumes a distribution where the data meet the assumptions of a parametric model. Each iteration of this algorithm involves two steps: (1) the Expectation step, given an initial guess for the parameters of the data model, uses this model to form the expectation of the missing data conditional on the model, thereby creating an estimate of the missing or unobserved data; and (2) The Maximization Step, given the estimates of the missing data, perform a likelihood function step, where the likelihood function is evaluated and a new estimate of the parameters for the data model is generated. This is repeated for many iterations until a convergence criterion is met, usually involving the difference score between iteration estimates getting very
  • 16. 16 close to zero37. The EM algorithm fits naturally into the clustering problem because cluster membership is not observed in the dataset. An EM-based initialization procedure was proposed by Bradley and Fayyad17. Similar to the chain-variation method, this algorithm runs through the k-means algorithm for each variation of starting points (in this case, based upon bootstrap samples) and chooses the best initial clustering as a starting point for the full k-means algorithm17. This algorithm allows the user to specify EM-algorithm parameters for a given problem. The EM algorithm has been shown to find better solutions compared to non-refined starting point k-means algorithms, but it shows poor performance under some conditions20. Although this EM algorithm variation is one of the more widely tested algorithms, it has not been tested against the newer algorithms, including the genetic algorithm method and the Bayesian method. Bayesian Approach Another approach to k-means clustering is to use Bayesian methods to approximate the posterior probabilities of class membership given variables of interest, and then cluster based upon this inferred probability density
  • 17. 17 function (PDF) shape (basically chopping the PDF into clusters at natural cut-points). The algorithm proposed by Fuentes and Casella8 and Gopal et al.38 accomplishes this through the Metropolis-Hastings algorithm, which is a type of Markov chain Monte Carlo sampling method. Stated more simply, the Bayesian approach uses information from the observed data to infer the PDF from which the data likely came without knowing the PDF a priori. This is accomplished through repeated sampling, with replacement, of the observed data until a sufficient number of data points, typically many thousands or millions, have been drawn to construct the distribution. In this way, enough information is gathered about the distribution to compute marginal and joint probabilities directly from the repeated sampling distribution, called the posterior distribution, or to derive other statistical information about the joint distribution. In addition to clustering, a byproduct of the Bayesian approach is the computation of the Bayes factor, which tests the hypothesis that the number of existing clusters is k, against the null, which states that there is only 1 cluster in the data. This is done to ensure that the correct number of clusters is determined prior to placing cases into clusters:
  • 18. 18 BF10= š‘š(š‘Œ|š‘›=š‘˜) š‘š( š‘Œ| š‘› = 1) This hypothesis test addresses a major limitation of K- means clusteringā€”specifying the number of clusters39-40. The function m(Yā”‚n=k) denotes the distribution of the data given that there are n=k clusters. This Bayesian procedure has shown promise in the few studies employing it; however, this algorithm has a long run-time and has not been tested on many data structures8. The Present Study The six algorithms selected for use in this study were: (1) Genetic Algorithm-Based Hybrid, (2) Unmodified Lloyd-Forgy, (3) Chain Variation of the Lloyd-Forgy, (4) Bayesian-Based, (5) EM-Based, and (6) MacQueen Restart with 10 initial point restarts. These represent a range of approaches to the k-means computational problem and include some of the more-promising algorithms from Brusco and Steinley20. In addition, given the scarcity of simulation studies employing some of these algorithms, the algorithms selected for the present study have been largely untested against one another. Our goal was to identify the algorithm that provided the best fit to the simulated data, as determined by the Hubert-Arabie Adjusted Rand Index
  • 19. 19 (ARIHA), and then to investigate whether or not a subset of these algorithms finds similar clusters within a large empirical dataset (MUSIC). The correct method for determining the number of clusters extant in an empirical dataset is debated1,39-41, and this remains an open question in the field. Two previous simulation studies of k-means clustering methods did not find a significant difference across cluster numbers36,41, suggesting that this factor is less important in simulation studies comparing k-means algorithms than other factors, including cluster overlap, signal-to-noise ratio, and cluster density36,41. In light of these previous studies, k was chosen to be set at 4, and simulation parameters focused on those factors shown to impact algorithm performance more noticeably36,41, as the number of clusters problem is beyond the scope of this investigation and merits its own simulation study of existing cluster number determination methodology39. Methods Two approaches to cluster study simulations exist: (1) many conditions, with few replicates and (2) fewer
  • 20. 20 conditions, with many replicates. Steinleyā€™s applied simulation studies tend to focus on many conditions20,41-43, whereas studies developing a new method have focused on many replicates to empirically demonstrate and test algorithm performance16-17,44. In terms of testing the robustness of a clustering algorithm to variations in data parameters, several parameter variations have been suggested17,20,41, including number of signal variables (or signal-to-noise ratio with signal providing relevant information and noise adding no important information), probability of overlap among clusters, degree of homogeneity/heterogeneity within clusters, sample size, and cluster density (in which classes do not have equal cluster sizeā€”e.g., 3 clusters of each consisting of 30% of cases, and a fourth cluster consisting of the remaining 10% of cases). At N = 200, the impact of sample size and number of clusters is negligible20. Because performance on simulated datasets such as those used in our study is largely unknown in the literature (particularly for signal-to-noise ratio variations), our study will use the few-conditions, many- replicates technique so as to maximize power, running each loop 30 times and averaging responses across iterations.
  • 21. 21 Based on these previous indications, the following parameters/variations were chosen as parameters to generate the datasets through the R package MixSim42: (1) N=200 (2) K=4 (3) Signal Variables=2, 5, 10 (4) Noise Variables=10 (giving 3 signal to noise ratios when combined with 2) (5) Cluster Homogeneity=yes or no (6) Mixing Fraction=0.25, 0.5, or 1 (giving i.e., 1 small, all equal, or 1 large group sizes) (7) Average Overlap among Clusters=0.1, 0.2, 0.4 These variations, coupled with our use of six different algorithms (genetic algorithm based, Lloyd-Forgy chain variation, EM-based, Lloyd-Forgy unmodified, MacQueen with 10 restarts, and Bayesian), yield 54 unique small datasets. Previous studies have suggested combining across conditions to discern the effects of each parameter41. Therefore, after 30 separate datasets for each parameter combination were simulated and analyzed by the six algorithms of interest and the diagnostic criteria averaged
  • 22. 22 over all 30 replicates, each of the 11 parameter variations were examined across all other conditions. To directly compare algorithm classification with true classification, the Hubert-Arabie Adjusted Rand Index (ARIHA) was chosen, as it has been shown to perform well in comparing cluster solutions45. For this index, values of 0 correspond to chance levels of agreement, and values of 1 indicate a perfect classification match between the algorithm-generated and true class solutions. It has been shown that the ARIHA is robust across cluster densities (mixing fractions), number of clusters, and sample size39,45. Therefore, the ARIHA was chosen as our measure of algorithm performance across conditions. In addition, in the present study we also aimed to test the performance of these algorithms on a large real- world simulated dataset. Therefore, two final simulations were conducted using with N=2500, variables=15 (once with 12 signal, 3 noise and a second time with 5 signal, 10 noise), average overlap of 0.2, and mixing fraction of 0.1, to identify which algorithms perform well on a real dataset. The three best performing, most disparate algorithms were selected to be re-estimated on a set of risk behavior variables from the MUSIC dataset. MUSIC consists of 9,750 college-student participants who reported
  • 23. 23 on their engagement in a number of risky behaviors during the 30 days prior to assessment. In our simulations, the number of clusters was fixed; however, in the MUSIC dataset the optimal number of clusters is unknown. The NbClust package in R provides 20 different, non-graphical algorithms for determining the appropriate number of clusters in a dataset, including Ratkowsky, Scott, Marriott, TraceCovW, and TraceW. Therefore, to assess which of these procedures are most appropriate for the MUSIC dataset, we assessed the performance of each of these procedures with 1000 simulated datasets according to the aforementioned 12 signal, 3 noise variable scenarios. Results were tallied for each algorithm according to (a) the correct number of clusters, k; (b) k-1 clusters; and (c) k+1 clusters. The best performing algorithms were then used to obtain an estimate of the likely number of clusters within the MUSIC data. The two most likely numbers of clusters according to these algorithms were then used in runs of the three algorithms chosen for case study analysis. Readers interested in methods used to determine optimal cluster number within a dataset are referred to Tibshirani et al.46, Milligan and Cooper39; and Dimitriadou et al.47. The best performing
  • 24. 24 algorithms were then used on the MUSIC dataset with 10 replicates. Within a 30 replicate loop per condition, all simulated datasets were created using the R package MixSim42 and then used to evaluate each of the six algorithms. The MacQueen restart algorithm was implemented through the use of the kmeans function contained in the R base package with 10 restarts (nstart=10) and a maximum of 90 iterations of the k-means algorithm (iter.max=90). The genetic algorithm variation was implemented using the R package skmeans48 with parameters calibrated to 140 generations, a mutation rate of 0.2 per generation, and a population size of 20. The Lloyd-Forgy algorithm with no restarts or chain variations was also implemented with the skmeans package and 90 iterations, as this was the number of iterations chosen for the other algorithms. The chain-restart method was implemented using the skmeans package using 90 iterations and chains of 50 moves each. The EM algorithm was implemented using the default parameter settings in the mclust R package, which searches through possible cluster shapes according to the BIC criterion and does not use a prior on the distribution. Finally, the Bayesian k-means variation was conducted using the Bayesclust R package38 and default parameter settings (Metropolis search algorithm
  • 25. 25 with 100,000 iterations, sigma hyperparameters of 2.01 and 0.99, alpha hyperparameter of 1, and minimum cluster size of 10%). Real-World Dataset MUSIC is a large dataset in which a number of psychological and health-related surveys were administered to 9,750 undergraduate students at 30 U.S. colleges and universities. Among the surveys administered was a series of questions asking participants how many times they had engaged in a number of substance-use (marijuana, hard drugs, inhalants, injection drugs, and misuse of prescription drugs), sexual (oral sex, anal sex, sex with strangers or brief acquaintances, unprotected sex, and sex while drunk or high), and risky driving (drunk/drugged driving and riding with a drunk/drugged driver) behaviors during the 30 days prior to assessment. Possible response choices for each of these behaviors were 0 (none), 1 (once/twice), 2 (3-5 times), 3 (6-10 times), or 4 (more than 10 times). There is reason to believe that these behaviors will be highly correlated with one another and will form clusters. Problem behavior theory49 suggests that young people who engage in one risky behavior are more apt to
  • 26. 26 engage in other risky behaviors; and epidemiologic data indicate that the adolescent and emerging adult years (roughly ages 15-24) are characterized by the highest rates of illicit drug use, sexual risk behavior, and impaired driving50-51. At the same time, not all college-aged individuals engage in high levels of risky behavior, and some forms of risky behavior are more common than others. For example, many college students engage in casual sexual relationships52 ā€“ but far fewer students use inhalants or injection drugs. One would therefore expect multiple clusters of students, some of which are characterized by no or mild engagement in risky behavior and others of which are characterized by more severe engagement in many or all of the risk behaviors. In using the MUSIC data, we employed the clustering techniques on the 13 illicit drug use, sexual risk taking, and impaired driving variables listed here: drunk driving, driving in a car driven by a drunk driver, engagement in casual sex, number of partners, engaging in sex without condoms, oral sex, anal sex, sex while drunk, smoking marijuana, using hard drugs (e.g., cocaine, heroin), inhalant use, injecting drugs, and nonmedical use of prescription medications. Results
  • 27. 27 Simulations under Parameter Variations The various algorithms evidenced different performance under different conditions of the simulation (see Table 1). In all, the MacQueen algorithm with 10 restarts outperformed the other five algorithms handily under every condition, suggesting that this algorithm is a strong, general clustering technique. However, despite outperforming all other algorithms, the performance of the MacQueen algorithm with 10 restarts suggests a strong decrease in its performance as additional signal variables are added to the dataset, suggesting that it may struggle with larger predictor sets. All six algorithms struggled when the mean overlap of groups increased to 0.4, indicating that strongly overlapping clusters are difficult to recover. This finding is consistent with the results reported by Steinley53,54. For conditions in which there is likely to be large group overlap, one may want to consider LCA or fuzzy k-means algorithms, as these deal with probabilistic classification rather than hard classification2. Under most conditions, the genetic algorithm-based, unmodified Lloyd-Forgy, and chain variation Lloyd-Forgy methods produced very similar results to each other, with
  • 28. 28 the genetic algorithm approach performing slightly better. In addition, the chain variation Lloyd-Forgy method seemed to perform better as the mixing fraction decreased, suggesting that it performs well with unbalanced group sizes. The Bayesian algorithm performed maximally well with increased numbers of salient predictors, low average overlap, and smaller mixing fractions. Simulations to Mimic Real-World Conditions In the ā€œreal-world conditionsā€ simulation, all six algorithms performed better as the number of relevant predictors increased (Table 2), with the MacQueen algorithm with 10 restarts and EM algorithm scoring over 0.5 on the ARIHA for both conditions. The other four algorithms scored approximately 0.3 for the condition with 5 signal and 10 noise variables; with 12 signal variables, all four algorithms approached 0.4. This pattern indicates a fair amount of agreement for the four lower-performing algorithms and a moderate to substantial agreement for the MacQueen algorithm with 10 restarts and EM algorithms. When examining computing time for the six algorithms on a standard laptop (Table 3), some of these algorithms required a large amount of computing resources that may be less than ideal for large datasets. The genetic algorithm
  • 29. 29 and Bayesian algorithm both required quite a bit of computing resources, though neither were unreasonable for the datasets generated to mimic a large real-world dataset. However, in very large datasets, computing resources may become an issue, and alternative algorithms may yield results more quickly without sacrificing quality. The EM algorithm required more computing resources than three other algorithms but quite a bit less than the two most demanding ones (genetic and Bayesian), suggesting that it would be reasonable (though not efficient) for larger datasets. In all, running 30 iterations of the loop which generated and analyzed one real-world condition for each of the six algorithms required about 60 hours on a laptop containing 8 GB of RAM and 2.5 GHz processing speed. MUSIC Dataset Because the EM method and MacQueen algorithm with 10 restarts gave the best performance on the simulated data, both of these methods were used to analyze the MUSIC dataset. We also used the Bayesian algorithm because this approach is novel among clustering algorithms and has not been applied to many real-life datasets. The most promising algorithms for finding the correct number of clusters on simulated datasets similar to the MUSIC data were the Ratkowsky, Scott, Marriott, TraceCovW,
  • 30. 30 and TraceW algorithms (summary of methodology and results available upon request). These were used to determine the optimum number of clusters within the MUSIC data. The average number of optimum clusters detected by each algorithm was, on average, between 3 and 4 clusters (Table 4); therefore, the three k-means algorithms for the MUSIC dataset were conducted with k=3 and k=4 clusters. Each groupā€™s mean response across the 13 risk behavior questions was examined to characterize the cluster solution. After obtaining the results from both k=3 and k=4, it was determined that the k=3 solutions did not make as much sense clinically as the k=4 solutions. Therefore, we further explored the k=4 solutions to characterize and compare cluster solutions among the three algorithms. As summarized in Table 5, the MacQueen algorithm with 10 restarts yielded the following four groups: (1) some sexual risk and low substance abuse (n = 1523), (2) very high risk in all areas (n = 11), (3) fairly high sexual and substance abuse risk (n = 713), and (4) low risk in all areas (n = 7704). The second group was extremely small and reported an average of 69 sexual partners in the month prior to assessment; it is possible that this algorithm detected students who lied about (or provided extremely unlikely reports of) their sexual behavior. Such a conclusion would
  • 31. 31 suggest that this algorithm classifies outliers within the same cluster, rather than adding them to the next-closest cluster, which would skew the clusters to which these extreme observations are added. This is desirable for an algorithm used on an experimental dataset, as it is able to find distinct subgroups within the sample, including extreme scores or distinct small groups of individuals (such as those extremely high or low scorers). Care should be taken in over-interpreting small groups of extreme scores, unless replicated in other samples. The EM algorithm found the following four groups: (1) moderate sexual and substance abuse risk (n = 277), (2) high oral sex but very low drug use (n = 7925), (3) risk from smoking marijuana and alcohol-related behaviors (1581), and (4) high risk in all areas (n = 168). This seems reasonable given previous research on college risk behaviors50-52. Cross-tabulating the classes obtained with the EM algorithm against those obtained with the MacQueen algorithm with 10 restarts yielded an ARIHA of .15, suggesting little overlap across these two algorithms and a high likelihood of erroneous classification based upon simulation results. The Bayesian algorithm found the following groups: (1) low risk (n = 3969), (2) risk from impaired driving,
  • 32. 32 marijuana use, and sex (n=1991), (3) moderate risk in all areas (n=1990), and (4) assumedly monogamous (i.e., low risks other than unprotected sex; n=2001). . Considering that there was very low overlap with the classes extracted using the MacQueen algorithm with 10 restarts (ARIHA = .18) and the EM algorithmā€™s outcome (ARIHA = .06), it is likely that this algorithm is also not finding accurate data partitions. Considered together with the simulation results, it appears that the different algorithms are likely to produce different results with little overlap among methods. Given the poor performance of the Bayesian and EM algorithms on the simulated data, it is likely that these algorithms produced untrustworthy results with high misclassification rates. However, these algorithms did find clusters within the MUSIC dataset and produced results that were interpretable, suggesting that these algorithms need to be applied with caution, as they may yield interpretable results based on faulty clustering of data. These results also support the speculation of Brusco and Steinley20, where it was suggested that their genetic algorithm-based k-means algorithm performed well because it was based on the HK- means+ algorithm, which was based heavily on the MacQueen algorithm.
  • 33. 33 Discussion We conducted the present study to compare six commonly used k-means algorithms in terms of their ability to correctly cluster optimally simulated data, data simulated to mimic a real world dataset, and a collected dataset consisting of risk behavior reports from nearly 10,000 college students. Although many studies have compared two or three clustering algorithms to one another, our study is among the first to compare six different algorithms under various conditions (e.g., different sample sizes, varying degrees of overlap between/among clusters, varying degrees of random error). Our results may therefore have the potential to provide greater clarity regarding which algorithms do ā€“ and do not ā€“ perform well given specific properties within the dataset. Our results also indicate that large differences exist among these six k-means algorithms in terms of the quality of their performance and in terms of the content of clusters that they produce. These results underscore the need for informed application of these clustering algorithms to real-world datasets, as results may drastically vary depending upon which algorithm is chosen
  • 34. 34 for the situation. Within the small-scale simulation portion of this study, low signal-to-noise ratios (large numbers of uninformative variables) presented a significant challenge to the majority of algorithms. The MacQueen algorithm with 10 restarts and EM algorithms performed most favorably in the context of low signal-to-noise ratios, though the MacQueen algorithm was the only one that provided a strong performance across conditions. Nonetheless, our results suggest that large or complex datasets may present challenges to many of the k-means algorithms, as the MacQueen algorithm showed a faltering performance as the number of variables increased and as the overlap increased. In addition, the performance of five algorithms deteriorated as the average group overlap increased beyond 0.2. This finding suggests that correct classification is most likely when groups are well- separated, and that the MacQueen algorithm with 10 restarts may be among the most effective options when cluster overlap is higher. When examining two simulated datasets with several variables, heterogeneity, moderate overlap, and low mixing fraction, the EM and MacQueen with 10 restarts algorithms performed much more strongly than the other four algorithms without requiring an unreasonable amount of computing
  • 35. 35 power. The Bayesian and genetic algorithm methods, though showing a comparable performance to the remaining algorithms (chain-restart and non-restart Lloyd-Forgy), required much more time to run and more computing resources. Given the trend towards analyzing datasets with extremely large numbers of cases and variables (i.e., big data), considerations related to computing resources required may need to be taken into account. On the MUSIC dataset, the MacQueen algorithm with 10 restarts identified a small group of exceptionally high scores, suggesting that it is sensitive enough to detect possible outliers and may be of use in flagging suspect observations. This may be useful for researchers with large collections of observations and who aim to examine extreme events more closely. The other two algorithms identified cluster solutions consistent with previous rates of risk behaviors among a college population55; however, the overlap with the MacQueen solutions was troublingly low, suggesting that these algorithms may have found very inaccurate solutions. Given their poor performance in the simulations, caution should be taken when applying any of these more complex algorithms to real datasets, as they are unlikely to provide as accurate of a solution as the MacQueen restart method.
  • 36. 36 Our results suggest that different k-means computational methods may partition data in different ways, such that the groups produced do not converge well across algorithms, and that given current implementations, the more complex algorithms should be used with caution or disregarded in favor of the MacQueen restart method. It is possible that each algorithm ā€œseesā€ the data structure from a different point of view, analogous to researchers from different fields approaching a multidisciplinary problem from different points of view and then solving that problem using the tools from their particular fieldā€™s approach; however, it is also likely that some of these algorithms are unable to find valid clustering solutions. This finding has important implications for researchers applying these methods to real-life data, as all algorithms are not created equally. From the results of Brusco and Steinley20, it is possible that the computational methods upon which some of these more complex algorithms are based may yield k-means solutions that perform well on real data; however, the versions of these algorithms provided within some software packages (e.g., R) do not perform well. Strengths and Limitations This study has several strengths and weaknesses. Using simulated datasets with known ā€œtrueā€ classifications
  • 37. 37 allowed us to compare algorithms under controlled conditions. However, these simulations involved smaller datasets than might be encountered in large empirical studies. Even with the MUSIC dataset, which provides a sample of almost 10,000 participants, only 13 clustering variables were used. As such, the results of this study should be interpreted with caution by researchers hoping to apply these algorithms to studies with large numbers of variables or in which the number of parameters may exceed the number of observations. For example, many genomics studies include millions of genes taken from a comparably small number of participants. Further work should be done to examine these algorithms in such situations to understand which k-means algorithms perform well under such conditions. In sum, this study provides a roadmap for the application of several k-means algorithms to datasets involving several predictors with varying degrees of signal and noise. For most of the algorithms tested in this study, it seems that the roadmap is quite simple in scope. Researchers hoping to use the k-means methodology would do well to use the basic k-means algorithms, particularly with a simple restart step added to the algorithm, rather than using some of the newer, more complex algorithms. We hope
  • 38. 38 that our results have contributed to the literature comparing various approaches to k-means cluster analysis and provide a more accurate, if shorter, roadmap for researchers hoping to employ this methodology.
  • 39. 39 Acknowledgements: I would like to thank my coauthors Dr. Daniel Feaster, Dr. Seth Schwartz, and Dr. Douglas Steinley for their contributions to this study, as well as University of Miami for the courses allowing me to complete a study like this.
  • 40. 40 References 1) Steinley D. K-means clustering: a half-century synthesis. Br J Math Stat Psychol. 2006; 59(Pt 1):1-34. 2) DiStefano C, Kamphaus RW. Investigating Subtypes of Child Development A Comparison of Cluster Analysis and Latent Class Cluster Analysis in Typology Creation. Educ Psychol Meas. 2006; 66(5):778-794. 3) Eshghi A, Haughton D, Legrand P, Skaletsky M, Woolford S. Identifying groups: a comparison of methodologies. J Data Sci. 2011; 9:271-91. 4) Scott AJ, Knott M. A cluster analysis method for grouping means in the analysis of variance. Biometrics. 1974; 507-512. 5) Edwards AWF, Cavalli-Sforza LL. A method for cluster analysis. Biometrics. 1965; 362-375. 6) Steinley D, Brusco MJ. Evaluating the performance of model-based clustering: Recommendations and cautions. Psychol. Methods. 2011; 16, 63-79.
  • 41. 41 7) Steinley D, Brusco MJ. K-means clustering and model- based clustering: Reply to McLachlan and Vermunt. Psychol. Methods. 2011; 16, 89-92. 8) Fuentes C, Casella G. Testing for the existence of clusters. Sort (Barc). 2009; 33(2):115-157. 9) Martis RJ, Chakraborty C. Arrhythmia disease diagnosis using neural network, SVM, and genetic algorithm-optimized k-means clustering. J. Mech. Med. Biol. 2011; 11(04): 897- 915. 10) Rovniak LS, Sallis JF, Saelens, et al. Adults' physical activity patterns across life domains: cluster analysis with replication. Health Psychology. 2010; 29(5): 496. 11) Newby PK, Muller D, Hallfrisch J, Qiao N, Andres R, Tucker KL. Dietary patterns and changes in body mass index and waist circumference in adults. Am. J. Clin. Nutr. 2003; 77(6): 1417-1425. 12) Ahn Y, Park YJ, Park SJ, et al. Dietary patterns and prevalence odds ratio in middle-aged adults of rural and
  • 42. 42 mid-size city in Korean Genome Epidemiology Study. Korean Journal of Nutrition. 2007; 40(3): 259-269. 13) Bauer AK, Rondini EA, Hummel, et al. Identification of candidate genes downstream of TLR4 signaling after ozone exposure in mice: a role for heat-shock protein 70. Environmental health perspectives. 2010; 119(8): 1091. 14) O'Quin KE, Hofmann CM, Hofmann HA, Carleton KL. Parallel evolution of opsin gene expression in African cichlid fishes. Mol. Biol. Evol. 2010; 27(12): 2839-2854. 15) Tukiendorf A, KaÅŗmierski R, Michalak S. The Taxonomy Statistic Uncovers Novel Clinical Patterns in a Population of Ischemic Stroke Patients. PloS one. 2013; 8(7): e69816. 16) Krishna K, Murty MN. Genetic K-means algorithm. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on. 1999; 29(3):433-439. 17) Dhillon IS, Guan Y, Kogan J. Iterative clustering of high dimensional text data augmented by local search. Data Mining. 2002; 131-138.
  • 43. 43 18) Cannon RL, Dave JV, Bezdek JC. Efficient implementation of the fuzzy c-means clustering algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1986; (2): 248-255. 19) Bradley PS, Fayyad U. Refining Initial Points for K- Means Clustering. Proc. 15th International Conf. Machine Learning. 1998. 20) Brusco MJ, Steinley D. A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika. 2007; 72(4): 583-600. 21) Steinley D. Validating clusters with the lower bound for sum-of-squares error. Psychometrika. 2007; 72(1): 93- 106. 22) Castillo LG, Schwartz SJ. Introduction to the special issue on college student mental health. Journal of clinical psychology. 2013; 69(4): 291-297. 23) Weisskirch RS, Zamboanga BL, Ravert RD, et al. An introduction to the composition of the Multi-Site University Study of Identity and Culture (MUSIC): A
  • 44. 44 collaborative approach to research and mentorship. Cultural Diversity and Ethnic Minority Psychology. 2013; 19(2): 123. 24) Lloyd SP. "Least square quantization in PCM". Bell Telephone Laboratories Paper. 1957. 25) MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967; 1(14): 281-297. 26) Holland J. Adaptation in Natural and Artificial Systems. University of Michigan Press. 1975. 27) Somma RD, Boixo S, Barnum H, Knill E. Quantum simulations of classical annealing processes. Physical review letters. 2008; 101(13): 130504. 28) Temme K, Osborne TJ, Vollbrecht KG, Poulin D, Verstraete F. Quantum metropolis sampling. Nature. 2011; 471(7336): 87-90. 29) Hassan R, De Weck O, Springmann P. Architecting a communication satellite product line. 22nd AIAA
  • 45. 45 international communications satellite systems conference & exhibit (ICSSC). 2004. 30) Najafi A, Ardakani SS, Marjani M. Quantitative structure-activity relationship analysis of the anticonvulsant activity of some benzylacetamides based on genetic algorithm-based multiple linear regression. Tropical Journal of Pharmaceutical Research. 2011; 10(4): 483-490. 31) Pittman J. Adaptive splines and genetic algorithms. J. Comp. Graph. Stat. 2002; 11(3): 615-638. 32) Paterlini S, Minerva T. Regression model selection using genetic algorithms. Proceedings of the 11th WSEAS International Conference on RECENT Advances in Neural Networks, Fuzzy Systems & Evolutionary Computing. 2010; 19- 27. 33) Maulik U, Bandyopadhyay S. Genetic algorithm-based clustering technique. Pattern Recogn. 2000; 33(9): 1455- 1465.
  • 46. 46 34) Forrest S. Genetic algorithms: principles of natural selection applied to computation. Science. 1993; 261(5123): 872-878. 35) Whitley D. A genetic algorithm tutorial. Stat Comp. 1994; 4(2): 65-85. 36) Steinley D, Brusco, MJ. Initializing K-means batch clustering: A critical evaluation of several techniques. J. Classif. 2007; 24, 99-121. 37) Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological). 1977; 1-38. 38) Gopal V, Fuentes C, Casella G, Gopal M. The bayesclust Package. 2009. 39) Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985; 50(2): 159-179.
  • 47. 47 40) Steinley D, Brusco, MJ. Testing for validity and choosing the number of clusters in K-means clustering. Psychol. Methods. 2011; 16, 285-297. 41) Steinley D, Brusco MJ. A new variable weighting and selection procedure for K-means cluster analysis. Multiv Behav Res. 2008; 43(1): 77-108. 42) Melnykov V, Chen W, Maitra R. MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms. J Stat Softw. 2012; 51(12), 1-25. 43) Steinley D. Standardizing variables in K-means clustering. Classification, clustering, and data mining applications. 2004; 53-60. 44) Huang Z. Clustering large data sets with mixed numeric and categorical values. Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining. 1997; 21-34. 45) Steinley D. Properties of the Hubert-Arable Adjusted Rand Index. Psychol Methods. 2004; 9(3): 386.
  • 48. 48 46) Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2001; 63(2): 411-423. 47) Dimitriadou E, Dolničar S, Weingessel A. An examination of indexes for determining the number of clusters in binary data sets. Psychometrika. 2002; 67(1): 137-159. 48) Hornik K, Feinerer I, Kober M, Buchta C. Spherical k- Means Clustering. Journal of Statistical Software. 2012; 50(10), 1-22. 49) Jessor R, Jessor SL. Problem behavior and psychosocial development: A longitudinal study of youth. 1977. 50) National Highway Traffic Safety Administration. Teen drivers: Additional resources. 2013. Retrieved October 22, 2014 from http://www.nhtsa.gov/Driving+Safety/Teen+Drivers/Teen+Drive rs+-+Additional+Resources.
  • 49. 49 51) Centers for Disease Control and Prevention. Youth Risk Behavior Surveillance ā€” United States, 2013. Morbidity and Mortality Weekly Report. 2013; 63(1): 1-168. 52) Bogle KA. Hooking up: Sex, dating, and relationships on campus. New York: New York University Press. 2008. 53) Steinley D. Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychol. Methods. 2006; 11, 178--192. 54) Steinley D, Brusco, MJ. A new variable weighting and selection procedure for K-means cluster analysis. Multiv Behav Res. 2008; 43, 77-108. 55) Cooper ML. Alcohol use and risky sexual behavior among college students and youth: Evaluating the evidence. J. Stud. Alcohol Drugs. 2002; (14): 101.
  • 50. 50 Table 1: Algorithm Performance by Condition Genetic Algorithm Lloyd- Forgy Lloyd- Forgy Chain Variation Bayesian EM Algorithm MacQueen 10 Restarts Salient Predictors 2 0.249 0.237 0.251 0.179 0.267 0.972 5 0.326 0.298 0.320 0.233 0.293 0.846 10 0.306 0.265 0.295 0.280 0.283 0.665 Homogeneity TRUE 0.295 0.266 0.288 0.235 0.274 0.806 FALSE 0.297 0.266 0.289 0.226 0.288 0.819 Average Overlap 0.1 0.438 0.419 0.445 0.455 0.539 0.879 0.2 0.297 0.221 0.289 0.198 0.256 0.875 0.4 0.125 0.116 0.132 0.039 0.048 0.684 Mix Fraction 1 0.301 0.267 0.263 0.222 0.272 0.819
  • 51. 51 0.5 0.291 0.261 0.285 0.217 0.279 0.815 0.25 0.297 0.270 0.318 0.253 0.292 0.850
  • 52. 52 Table 2: Algorithm Performance on Simulation of MUSIC Dataset Genetic Algorithm Lloyd- Forgy Chain Lloyd- Forgy Bayesian EM Algorithm MacQueen 10 Restarts 5 Signal, 10 Noise 0.329 0.301 0.327 0.304 0.500 0.674 12 Signal, 3 Noise 0.388 0.363 0.387 0.387 0.594 0.727
  • 53. 53 Table 3: Computing Times in Seconds on Simulation of Dataset like the MUSIC Dataset Genetic Algorithm Lloyd- Forgy Chain Lloyd- Forgy Bayesian EM Algorithm MacQueen 10 Restarts Elapsed Time 146.57 0.14 0.50 600.85 14.81 0.36 System Time 0.25 0.00 0.00 0.09 0.02 0.00 User Time 135.15 0.14 0.49 594.83 14.48 0.36
  • 54. 54 Table 4: Average Number of Clusters Estimated in MUSIC Dataset Average Number of Clusters Range of Clusters Found Scott 4 3 or 5 Marriott 3.2 3 or 4 TraceCovW 3.1 3 or 4 TraceW 3.3 3 or 4 Ratkowsky 3.9 3 or 4
  • 55. 55 Table 5: 4-Cluster Solutions by Algorithm Cluster MacQueen with 10 Restarts EM-based Bayesian 1 Moderate sexual risk with low substance use risk (n=1523) Moderate to high substance use and sexual risk (n=277) Low sexual and substance use risk (n=3969) 2 High sexual and substance use risk (n=11) Low risk, particularly with substance use (n=7925) Moderate sexual risk, moderate alcohol and marijuana risk (n=1991) 3 Moderate sexual and substance use risk (n=713) Moderate alcohol and marijuana risk, low sexual risk (n=1581) Low to moderate sexual and substance abuse risk (n=1990) 4 Low sexual and substance use risk (n=7704) High sexual and substance use risk Low sexual and substance abuse risk (sexual risk