A combination of decision tree learning and clustering

2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE)

A Combination of Decision Tree Learning
and Clustering for Data Classification

Chinnapat Kaewchinporn

Department of Computer Science
King Mongkut’s Institute of Technology Ladkrabang. Ladkrabang, Thailand

s0050117@kmitl.ac.th, scriptsds@gmail.com

Contents

Abstract
Introduction
C4.5 Decision Tree algorithm
k-means Clustering algorithm
bagging algorithm
Tree Bagging and Weighted Clustering algorithm
Datasets
Experimental Results
Conclusion

A Combination of Decision Tree Learning and Clustering for Data Classification

Abstract

we present a new classification algorithm which is
a combination of decision tree learning and
clustering called Tree Bagging and Weighted
Clustering (TBWC).


Introduction

Data Classification is an important problem in
Knowledge Discovery in Database (KDD).
Currently, there are many techniques to solve this
problem such as decision tree, naïve bays,
instance-based learning , and artificial neural
network.


Introduction

However, the techniques that mentioned above
have some problems in classification, for
examples, outliner handling, accuracy reducing
when testing data are increased, and classification
time increasing when there are a large numbers of
attributes.
There are many researches proposed new
methods to enhance a predictive performance of
a classifier.


Introduction (Reference)

For decision tree learning,
Guo et al. introduced a new improved C4.5 decision
tree based weights, which considers imbalanced
weights between different instances, to address the
class imbalanced problems.
For k-means clustering,
Zhu and Wang proposed the approach to optimize the
k value by using a genetic algorithm.
Qin et al. addressed the problem of efficiency of the
k-means clustering algorithm for large datasets. This
research presented the improved algorithm avoiding
unnecessary calculations by using the triangle
inequality.


Introduction (Reference)

For an ensemble classifier research,
Gaddam et al. [13] proposed a novel method called
“K-Means+ID3” for anomaly detection.
The concept of the K-Means+ID3 algorithm used
the k-means clustering method first partitions the
training instances into k clusters.
Then build an ID3 decision tree using the instances
in each k-means clustering.
For a final output of K-Means+ID3, the decisions of
the k-means and ID3 methods are combined using
the nearest-neighbor rule and the nearest-
consensus rule.


Introduction (Summary)
we present the new classification algorithm which is a
combination of decision tree learning and clustering called
Tree Bagging and Weighted Clustering (TBWC).
The concept of the TBWC algorithm is to select important
attributes and weights to them by using decision tree
bagging, then the weighted attributes are used to
generate clusters for classifying a new instance.


C4.5 Decision Tree Algorithm
C4.5 decision tree is an algorithm used to generate
decision trees for a classification problem.


k-means Clustering algorithm
k-means is a well-known algorithm for clustering.
The main idea of k-means is data partitioning into k
clusters by given a dataset of n objects and integer value
of k.


bagging algorithm
Bagging is an approach of classifier combination method.
The objective of bagging is to construct a composite
model which improves the predictive performance of a
single model.
Basically, bagging uses a single algorithm to construct
n models and requires a diversity of training data.
For a final output of a prediction, the bagging algorithm
counts the votes from all the classification models and
assigns the class with the most votes to a new data.


Tree Bagging and
Weighted Clustering
(TBWC) consists of two
main parts.
At first, important
attributes are selected and
weighted by using bagging
technique with C4.5
The processes of TBWC decision tree.


Second, the weighted
attributes are used as the
inputs of k-means
clustering to create
clusters.

The processes of TBWC


As shown in Fig., there are
four processes of TBWC:
1) modeling process
2) attribute selection process
3) weighted clustering process
4) prediction process.



Modeling process

A. Modeling process
Training data are used to
create models with bagging
technique with C4.5 decision
tree.
The output of the bagging
algorithm is n models of
The processes of TBWC decision tree.



Attribute Selection process B. Attribute selection process
After decision trees were
created from bagging,
attributes which appear in all
trees are selected.
For assigning a weight to
attributes, we consider
The processes of TBWC attribute nodes for each tree.
A weight of an attribute is
vary on a position of that
attribute appearing in a tree
and a size of the tree.



Attribute Selection process B. Attribute selection process
If there are same attributes
occur in several positions in
the tree, only the attribute
node appearing in the lowest
level is considered.



The algorithm starts with
defining a set of weight
belongs to an attribute and
initialized to zero.
The next step is to assign a
weight to each attribute by
considering all internal
nodes of each tree.


For an example, suppose there are two models of decision
trees and a set of attribute is
A = {age, income, student, credit_rating}.

The decision tree model 1 (M1)


For assigning a weight, the attribute node at level 0 is
considered at first, thus the weight of attribute student is
defined as formula.



The decision tree model 2 (M2)


When all of tree models were considered, the next step is
to calculate the actual weight of each attribute using
equation.


The total weight of each attribute is


At the end of the algorithm, the attribute which has the
weight greater than zero will be selected.
Finally, all of selected attributes are normalized into [0, 1]
by using Min-Max normalization



C. Weighted clustering process
The dataset with selected
Weighted clustering process attributes and their weights are
used to generate k clusters by
k-means clustering.
each center is assigned a class
label by using a majority vote.



D. Prediction process
When a new instance x is
present, distances between
Prediction process
all pairs of cluster centers
and the instance x are
measured.
A prediction of the instance x
The processes of TBWC is defined by the class label
of the nearest center cluster.


Datasets
In the experiments, we use five datasets:
TABLE 1. THE CHARACTERISTIC OF EACH DATASET
attribute Number
datasets instance
all type of class
1. Cardiocography 1 2,126 23 real 3
2. Cardiocography 2 2,126 23 real 10
3. Internet Advertisement 3,279 1,558 categorical, integer, real 2
4. Libras Movement 360 91 real 15
5. Multiple Features 2,000 649 integer, real 10

- Cardiocography 1, Cardiocography 2
Cardiocography 1 and Cardiocography 2 are the information includes detection rate of fetal heart (FHR)
and the rate of uterine contraction (UC).
- Internet Advertisement
Internet Advertisements is the information of the problem identification picture on the webpage.
- Libras Movement
Libras Movement is the information of the problem of classifying hand movements.
- Multiple Features
Multiple Features is the information of the problem by
a number of electronic fingerprints.


To evaluate the predictive performance of our proposed
algorithm, C4.5 decision tree and k-means clustering were
used to compare the predictive accuracy with TBWC.
TABLE 2. THE COMPARSIONS OF THE PREDICTIVE ACCURACY OF EACH ALGORITHM
Datasets C4.5 k-means TBWC

Cardiocography 1 92.56% 86.97% 96.47%
Cardiocography 2 82.83% 62.65% 90.35%
Internet Advertisements 96.19% 93.11% 96.65%
Libras Movement 60.56% 43.89% 80.56%
Multiple Features 92.60% 93.45% 96.55%



100

90

80
accuracy (%)

70

60

50

40
Internet Cardiocography 1 Cardiocography 2 Multiple Features Libras Movement
Advertisements

C4.5 k-means TBWC

The predictive accuracy of each algorithms.


TABLE 3. PARAMETERS SETTING FOR EACH DATASET
number of number of
datasets
models n clusters k
Cardiocography 1 40 100
Cardiocography 2 20 100
Internet Advertisements 10 100
Libras Movement 5 20

Multiple Features 10 80

The number of model (n) is the number of C4.5 decision tree models
from the bagging algorithm.
The number of cluster (k) is the number of clusters in k-means
clustering.
The values of each dataset (n and k) are provided by the model which
has the highest accuracy by using validate dataset.


TABLE 4. ATTRIBUTE REDUCTION FOR EACH DATASET
number of number of selected Percent reduction
datasets
attributes attribute used attribute
Cardiocography 1 23 22 4.35%
Cardiocography 2 23 22 4.35%
Internet Advertisements 1,558 626 59.82%
Libras Movement 91 86 5.49%
Multiple Features 649 454 30.05%

The results show that the proposed algorithm has ability to select
important attributes.


Conclusion
The Tree Bagging algorithm and Weighted Clustering
(TBWC) was proposed to enhance an efficiency of data
classification.
The TBWC algorithm consists of two main steps: attribute
selection and classifying a new instance.
For attribute selection step, bagging algorithm with C4.5
decision tree are used to select and weight attributes.
Then k-means clustering is applied to assign a class label to
a new instance.


Conclusion
The experimental results are summarized in the following
aspects.
1. The TBWC algorithm yields the highest accuracies
when compared with decision tree and clustering
for all datasets.
2. The TBWC algorithm can greatly improve the
predictive performance espectically for multiple-
classes datasets such as Libras Movement and
Cardiocography 2.
3. The TBWC algorithm can reduce attributes up to
59.82% in Internet Advertisement dataset, and has
a higher accuracy than C4.5 decision tree and
k-means clustering.


A combination of decision tree learning and clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A combination of decision tree learning and clustering

Similar to A combination of decision tree learning and clustering (20)

A combination of decision tree learning and clustering