A Comparative Study Of Various Clustering Algorithms In Data Mining
A combination of decision tree learning and clustering
1. 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE)
A Combination of Decision Tree Learning
and Clustering for Data Classification
Chinnapat Kaewchinporn
Department of Computer Science
King Mongkut’s Institute of Technology Ladkrabang. Ladkrabang, Thailand
s0050117@kmitl.ac.th, scriptsds@gmail.com
2. Contents
Abstract
Introduction
C4.5 Decision Tree algorithm
k-means Clustering algorithm
bagging algorithm
Tree Bagging and Weighted Clustering algorithm
Datasets
Experimental Results
Conclusion
A Combination of Decision Tree Learning and Clustering for Data Classification
3. Abstract
we present a new classification algorithm which is
a combination of decision tree learning and
clustering called Tree Bagging and Weighted
Clustering (TBWC).
A Combination of Decision Tree Learning and Clustering for Data Classification
4. Introduction
Data Classification is an important problem in
Knowledge Discovery in Database (KDD).
Currently, there are many techniques to solve this
problem such as decision tree, naïve bays,
instance-based learning , and artificial neural
network.
A Combination of Decision Tree Learning and Clustering for Data Classification
5. Introduction
However, the techniques that mentioned above
have some problems in classification, for
examples, outliner handling, accuracy reducing
when testing data are increased, and classification
time increasing when there are a large numbers of
attributes.
There are many researches proposed new
methods to enhance a predictive performance of
a classifier.
A Combination of Decision Tree Learning and Clustering for Data Classification
6. Introduction (Reference)
For decision tree learning,
Guo et al. introduced a new improved C4.5 decision
tree based weights, which considers imbalanced
weights between different instances, to address the
class imbalanced problems.
For k-means clustering,
Zhu and Wang proposed the approach to optimize the
k value by using a genetic algorithm.
Qin et al. addressed the problem of efficiency of the
k-means clustering algorithm for large datasets. This
research presented the improved algorithm avoiding
unnecessary calculations by using the triangle
inequality.
A Combination of Decision Tree Learning and Clustering for Data Classification
7. Introduction (Reference)
For an ensemble classifier research,
Gaddam et al. [13] proposed a novel method called
“K-Means+ID3” for anomaly detection.
The concept of the K-Means+ID3 algorithm used
the k-means clustering method first partitions the
training instances into k clusters.
Then build an ID3 decision tree using the instances
in each k-means clustering.
For a final output of K-Means+ID3, the decisions of
the k-means and ID3 methods are combined using
the nearest-neighbor rule and the nearest-
consensus rule.
A Combination of Decision Tree Learning and Clustering for Data Classification
8. Introduction (Summary)
we present the new classification algorithm which is a
combination of decision tree learning and clustering called
Tree Bagging and Weighted Clustering (TBWC).
The concept of the TBWC algorithm is to select important
attributes and weights to them by using decision tree
bagging, then the weighted attributes are used to
generate clusters for classifying a new instance.
A Combination of Decision Tree Learning and Clustering for Data Classification
9. C4.5 Decision Tree Algorithm
C4.5 decision tree is an algorithm used to generate
decision trees for a classification problem.
A Combination of Decision Tree Learning and Clustering for Data Classification
10. k-means Clustering algorithm
k-means is a well-known algorithm for clustering.
The main idea of k-means is data partitioning into k
clusters by given a dataset of n objects and integer value
of k.
A Combination of Decision Tree Learning and Clustering for Data Classification
11. bagging algorithm
Bagging is an approach of classifier combination method.
The objective of bagging is to construct a composite
model which improves the predictive performance of a
single model.
Basically, bagging uses a single algorithm to construct
n models and requires a diversity of training data.
For a final output of a prediction, the bagging algorithm
counts the votes from all the classification models and
assigns the class with the most votes to a new data.
A Combination of Decision Tree Learning and Clustering for Data Classification
12. Tree Bagging and Weighted Clustering algorithm
Tree Bagging and
Weighted Clustering
(TBWC) consists of two
main parts.
At first, important
attributes are selected and
weighted by using bagging
technique with C4.5
The processes of TBWC decision tree.
A Combination of Decision Tree Learning and Clustering for Data Classification
13. Tree Bagging and Weighted Clustering algorithm
Second, the weighted
attributes are used as the
inputs of k-means
clustering to create
clusters.
The processes of TBWC
A Combination of Decision Tree Learning and Clustering for Data Classification
14. Tree Bagging and Weighted Clustering algorithm
As shown in Fig., there are
four processes of TBWC:
1) modeling process
2) attribute selection process
3) weighted clustering process
4) prediction process.
The processes of TBWC
A Combination of Decision Tree Learning and Clustering for Data Classification
15. Tree Bagging and Weighted Clustering algorithm
Modeling process
A. Modeling process
Training data are used to
create models with bagging
technique with C4.5 decision
tree.
The output of the bagging
algorithm is n models of
The processes of TBWC decision tree.
A Combination of Decision Tree Learning and Clustering for Data Classification
16. Tree Bagging and Weighted Clustering algorithm
Attribute Selection process B. Attribute selection process
After decision trees were
created from bagging,
attributes which appear in all
trees are selected.
For assigning a weight to
attributes, we consider
The processes of TBWC attribute nodes for each tree.
A weight of an attribute is
vary on a position of that
attribute appearing in a tree
and a size of the tree.
A Combination of Decision Tree Learning and Clustering for Data Classification
17. Tree Bagging and Weighted Clustering algorithm
Attribute Selection process B. Attribute selection process
If there are same attributes
occur in several positions in
the tree, only the attribute
node appearing in the lowest
level is considered.
The processes of TBWC
A Combination of Decision Tree Learning and Clustering for Data Classification
18. Tree Bagging and Weighted Clustering algorithm
The algorithm starts with
defining a set of weight
belongs to an attribute and
initialized to zero.
The next step is to assign a
weight to each attribute by
considering all internal
nodes of each tree.
A Combination of Decision Tree Learning and Clustering for Data Classification
19. Tree Bagging and Weighted Clustering algorithm
For an example, suppose there are two models of decision
trees and a set of attribute is
A = {age, income, student, credit_rating}.
The decision tree model 1 (M1)
A Combination of Decision Tree Learning and Clustering for Data Classification
20. Tree Bagging and Weighted Clustering algorithm
For assigning a weight, the attribute node at level 0 is
considered at first, thus the weight of attribute student is
defined as formula.
A Combination of Decision Tree Learning and Clustering for Data Classification
21. Tree Bagging and Weighted Clustering algorithm
A Combination of Decision Tree Learning and Clustering for Data Classification
22. Tree Bagging and Weighted Clustering algorithm
The decision tree model 2 (M2)
A Combination of Decision Tree Learning and Clustering for Data Classification
23. Tree Bagging and Weighted Clustering algorithm
A Combination of Decision Tree Learning and Clustering for Data Classification
24. Tree Bagging and Weighted Clustering algorithm
When all of tree models were considered, the next step is
to calculate the actual weight of each attribute using
equation.
A Combination of Decision Tree Learning and Clustering for Data Classification
25. Tree Bagging and Weighted Clustering algorithm
The total weight of each attribute is
A Combination of Decision Tree Learning and Clustering for Data Classification
26. Tree Bagging and Weighted Clustering algorithm
At the end of the algorithm, the attribute which has the
weight greater than zero will be selected.
Finally, all of selected attributes are normalized into [0, 1]
by using Min-Max normalization
A Combination of Decision Tree Learning and Clustering for Data Classification
27. Tree Bagging and Weighted Clustering algorithm
C. Weighted clustering process
The dataset with selected
Weighted clustering process attributes and their weights are
used to generate k clusters by
k-means clustering.
each center is assigned a class
label by using a majority vote.
The processes of TBWC
A Combination of Decision Tree Learning and Clustering for Data Classification
28. Tree Bagging and Weighted Clustering algorithm
D. Prediction process
When a new instance x is
present, distances between
Prediction process
all pairs of cluster centers
and the instance x are
measured.
A prediction of the instance x
The processes of TBWC is defined by the class label
of the nearest center cluster.
A Combination of Decision Tree Learning and Clustering for Data Classification
29. Datasets
In the experiments, we use five datasets:
TABLE 1. THE CHARACTERISTIC OF EACH DATASET
attribute Number
datasets instance
all type of class
1. Cardiocography 1 2,126 23 real 3
2. Cardiocography 2 2,126 23 real 10
3. Internet Advertisement 3,279 1,558 categorical, integer, real 2
4. Libras Movement 360 91 real 15
5. Multiple Features 2,000 649 integer, real 10
- Cardiocography 1, Cardiocography 2
Cardiocography 1 and Cardiocography 2 are the information includes detection rate of fetal heart (FHR)
and the rate of uterine contraction (UC).
- Internet Advertisement
Internet Advertisements is the information of the problem identification picture on the webpage.
- Libras Movement
Libras Movement is the information of the problem of classifying hand movements.
- Multiple Features
Multiple Features is the information of the problem by
a number of electronic fingerprints.
A Combination of Decision Tree Learning and Clustering for Data Classification
30. Experimental Results
To evaluate the predictive performance of our proposed
algorithm, C4.5 decision tree and k-means clustering were
used to compare the predictive accuracy with TBWC.
TABLE 2. THE COMPARSIONS OF THE PREDICTIVE ACCURACY OF EACH ALGORITHM
Datasets C4.5 k-means TBWC
Cardiocography 1 92.56% 86.97% 96.47%
Cardiocography 2 82.83% 62.65% 90.35%
Internet Advertisements 96.19% 93.11% 96.65%
Libras Movement 60.56% 43.89% 80.56%
Multiple Features 92.60% 93.45% 96.55%
A Combination of Decision Tree Learning and Clustering for Data Classification
31. Experimental Results
100
90
80
accuracy (%)
70
60
50
40
Internet Cardiocography 1 Cardiocography 2 Multiple Features Libras Movement
Advertisements
C4.5 k-means TBWC
The predictive accuracy of each algorithms.
A Combination of Decision Tree Learning and Clustering for Data Classification
32. Experimental Results
TABLE 3. PARAMETERS SETTING FOR EACH DATASET
number of number of
datasets
models n clusters k
Cardiocography 1 40 100
Cardiocography 2 20 100
Internet Advertisements 10 100
Libras Movement 5 20
Multiple Features 10 80
The number of model (n) is the number of C4.5 decision tree models
from the bagging algorithm.
The number of cluster (k) is the number of clusters in k-means
clustering.
The values of each dataset (n and k) are provided by the model which
has the highest accuracy by using validate dataset.
A Combination of Decision Tree Learning and Clustering for Data Classification
33. Experimental Results
TABLE 4. ATTRIBUTE REDUCTION FOR EACH DATASET
number of number of selected Percent reduction
datasets
attributes attribute used attribute
Cardiocography 1 23 22 4.35%
Cardiocography 2 23 22 4.35%
Internet Advertisements 1,558 626 59.82%
Libras Movement 91 86 5.49%
Multiple Features 649 454 30.05%
The results show that the proposed algorithm has ability to select
important attributes.
A Combination of Decision Tree Learning and Clustering for Data Classification
34. Conclusion
The Tree Bagging algorithm and Weighted Clustering
(TBWC) was proposed to enhance an efficiency of data
classification.
The TBWC algorithm consists of two main steps: attribute
selection and classifying a new instance.
For attribute selection step, bagging algorithm with C4.5
decision tree are used to select and weight attributes.
Then k-means clustering is applied to assign a class label to
a new instance.
A Combination of Decision Tree Learning and Clustering for Data Classification
35. Conclusion
The experimental results are summarized in the following
aspects.
1. The TBWC algorithm yields the highest accuracies
when compared with decision tree and clustering
for all datasets.
2. The TBWC algorithm can greatly improve the
predictive performance espectically for multiple-
classes datasets such as Libras Movement and
Cardiocography 2.
3. The TBWC algorithm can reduce attributes up to
59.82% in Internet Advertisement dataset, and has
a higher accuracy than C4.5 decision tree and
k-means clustering.
A Combination of Decision Tree Learning and Clustering for Data Classification
36. A Combination of Decision Tree Learning and Clustering for Data Classification
37. A Combination of Decision Tree Learning and Clustering for Data Classification