SlideShare a Scribd company logo
1 of 37
2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE)




 A Combination of Decision Tree Learning
 and Clustering for Data Classification




  Chinnapat Kaewchinporn

  Department of Computer Science
  King Mongkut’s Institute of Technology Ladkrabang. Ladkrabang, Thailand

  s0050117@kmitl.ac.th, scriptsds@gmail.com
Contents

       Abstract
       Introduction
       C4.5 Decision Tree algorithm
       k-means Clustering algorithm
       bagging algorithm
       Tree Bagging and Weighted Clustering algorithm
       Datasets
       Experimental Results
       Conclusion



 A Combination of Decision Tree Learning and Clustering for Data Classification
Abstract

       we present a new classification algorithm which is
       a combination of decision tree learning and
       clustering called Tree Bagging and Weighted
       Clustering (TBWC).




 A Combination of Decision Tree Learning and Clustering for Data Classification
Introduction

       Data Classification is an important problem in
       Knowledge Discovery in Database (KDD).
       Currently, there are many techniques to solve this
       problem such as decision tree, naïve bays,
       instance-based learning , and artificial neural
       network.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Introduction

       However, the techniques that mentioned above
       have some problems in classification, for
       examples, outliner handling, accuracy reducing
       when testing data are increased, and classification
       time increasing when there are a large numbers of
       attributes.
       There are many researches proposed new
       methods to enhance a predictive performance of
       a classifier.



 A Combination of Decision Tree Learning and Clustering for Data Classification
Introduction (Reference)

       For decision tree learning,
                Guo et al. introduced a new improved C4.5 decision
                tree based weights, which considers imbalanced
                weights between different instances, to address the
                class imbalanced problems.
       For k-means clustering,
                Zhu and Wang proposed the approach to optimize the
                k value by using a genetic algorithm.
                Qin et al. addressed the problem of efficiency of the
                k-means clustering algorithm for large datasets. This
                research presented the improved algorithm avoiding
                unnecessary calculations by using the triangle
                inequality.

 A Combination of Decision Tree Learning and Clustering for Data Classification
Introduction (Reference)

       For an ensemble classifier research,
                Gaddam et al. [13] proposed a novel method called
                “K-Means+ID3” for anomaly detection.
                   The concept of the K-Means+ID3 algorithm used
                   the k-means clustering method first partitions the
                   training instances into k clusters.
                   Then build an ID3 decision tree using the instances
                   in each k-means clustering.
                   For a final output of K-Means+ID3, the decisions of
                   the k-means and ID3 methods are combined using
                   the nearest-neighbor rule and the nearest-
                   consensus rule.


 A Combination of Decision Tree Learning and Clustering for Data Classification
Introduction (Summary)
       we present the new classification algorithm which is a
       combination of decision tree learning and clustering called
       Tree Bagging and Weighted Clustering (TBWC).
       The concept of the TBWC algorithm is to select important
       attributes and weights to them by using decision tree
       bagging, then the weighted attributes are used to
       generate clusters for classifying a new instance.




 A Combination of Decision Tree Learning and Clustering for Data Classification
C4.5 Decision Tree Algorithm
       C4.5 decision tree is an algorithm used to generate
       decision trees for a classification problem.




 A Combination of Decision Tree Learning and Clustering for Data Classification
k-means Clustering algorithm
       k-means is a well-known algorithm for clustering.
       The main idea of k-means is data partitioning into k
       clusters by given a dataset of n objects and integer value
       of k.




 A Combination of Decision Tree Learning and Clustering for Data Classification
bagging algorithm
       Bagging is an approach of classifier combination method.
        The objective of bagging is to construct a composite
       model which improves the predictive performance of a
       single model.
       Basically, bagging uses a single algorithm to construct
       n models and requires a diversity of training data.
       For a final output of a prediction, the bagging algorithm
       counts the votes from all the classification models and
       assigns the class with the most votes to a new data.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
                                                                                  Tree Bagging and
                                                                                  Weighted Clustering
                                                                                  (TBWC) consists of two
                                                                                  main parts.
                                                                                  At first, important
                                                                                  attributes are selected and
                                                                                  weighted by using bagging
                                                                                  technique with C4.5
 The processes of TBWC                                                            decision tree.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
                                                                                  Second, the weighted
                                                                                  attributes are used as the
                                                                                  inputs of k-means
                                                                                  clustering to create
                                                                                  clusters.



 The processes of TBWC




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
                                                                                  As shown in Fig., there are
                                                                                  four processes of TBWC:
                                                                                  1) modeling process
                                                                                  2) attribute selection process
                                                                                  3) weighted clustering process
                                                                                  4) prediction process.



 The processes of TBWC




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
               Modeling process

                                                                           A. Modeling process
                                                                                  Training data are used to
                                                                                  create models with bagging
                                                                                  technique with C4.5 decision
                                                                                  tree.
                                                                                  The output of the bagging
                                                                                  algorithm is n models of
  The processes of TBWC                                                           decision tree.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm

Attribute Selection process                                                  B. Attribute selection process
                                                                                    After decision trees were
                                                                                    created from bagging,
                                                                                    attributes which appear in all
                                                                                    trees are selected.
                                                                                    For assigning a weight to
                                                                                    attributes, we consider
    The processes of TBWC                                                           attribute nodes for each tree.
                                                                                     A weight of an attribute is
                                                                                    vary on a position of that
                                                                                    attribute appearing in a tree
                                                                                    and a size of the tree.

   A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm

Attribute Selection process                                                  B. Attribute selection process
                                                                                    If there are same attributes
                                                                                    occur in several positions in
                                                                                    the tree, only the attribute
                                                                                    node appearing in the lowest
                                                                                    level is considered.

    The processes of TBWC




   A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
                                                                                  The algorithm starts with
                                                                                  defining a set of weight
                                                                                  belongs to an attribute and
                                                                                  initialized to zero.
                                                                                  The next step is to assign a
                                                                                  weight to each attribute by
                                                                                  considering all internal
                                                                                  nodes of each tree.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
     For an example, suppose there are two models of decision
     trees and a set of attribute is
     A = {age, income, student, credit_rating}.




                                                The decision tree model 1 (M1)

 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
     For assigning a weight, the attribute node at level 0 is
     considered at first, thus the weight of attribute student is
     defined as formula.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm




                                     The decision tree model 2 (M2)




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
     When all of tree models were considered, the next step is
     to calculate the actual weight of each attribute using
     equation.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
     The total weight of each attribute is




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm
     At the end of the algorithm, the attribute which has the
     weight greater than zero will be selected.
     Finally, all of selected attributes are normalized into [0, 1]
     by using Min-Max normalization




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm

                                                                           C. Weighted clustering process
                                                                                  The dataset with selected
Weighted clustering process                                                       attributes and their weights are
                                                                                  used to generate k clusters by
                                                                                  k-means clustering.
                                                                                  each center is assigned a class
                                                                                  label by using a majority vote.
  The processes of TBWC




 A Combination of Decision Tree Learning and Clustering for Data Classification
Tree Bagging and Weighted Clustering algorithm

                                                                            D. Prediction process
                                                                                   When a new instance x is
                                                                                   present, distances between
Prediction process
                                                                                   all pairs of cluster centers
                                                                                   and the instance x are
                                                                                   measured.
                                                                                   A prediction of the instance x
   The processes of TBWC                                                           is defined by the class label
                                                                                   of the nearest center cluster.




  A Combination of Decision Tree Learning and Clustering for Data Classification
Datasets
     In the experiments, we use five datasets:
                                      TABLE 1. THE CHARACTERISTIC OF EACH DATASET
                                                                                     attribute                 Number
               datasets                             instance
                                                                            all             type               of class
1. Cardiocography 1                                    2,126               23               real                  3
2. Cardiocography 2                                    2,126               23               real                 10
3. Internet Advertisement                              3,279             1,558    categorical, integer, real      2
4. Libras Movement                                      360                91               real                 15
5. Multiple Features                                   2,000              649           integer, real            10

- Cardiocography 1, Cardiocography 2
Cardiocography 1 and Cardiocography 2 are the information includes detection rate of fetal heart (FHR)
and the rate of uterine contraction (UC).
- Internet Advertisement
Internet Advertisements is the information of the problem identification picture on the webpage.
- Libras Movement
Libras Movement is the information of the problem of classifying hand movements.
- Multiple Features
Multiple Features is the information of the problem by
a number of electronic fingerprints.

 A Combination of Decision Tree Learning and Clustering for Data Classification
Experimental Results
     To evaluate the predictive performance of our proposed
     algorithm, C4.5 decision tree and k-means clustering were
     used to compare the predictive accuracy with TBWC.
                 TABLE 2. THE COMPARSIONS OF THE PREDICTIVE ACCURACY OF EACH ALGORITHM
                           Datasets                              C4.5             k-means   TBWC

               Cardiocography 1                                92.56%             86.97%    96.47%
               Cardiocography 2                                82.83%             62.65%    90.35%
               Internet Advertisements                         96.19%             93.11%    96.65%
               Libras Movement                                 60.56%             43.89%    80.56%
               Multiple Features                               92.60%             93.45%    96.55%




 A Combination of Decision Tree Learning and Clustering for Data Classification
Experimental Results

                                100


                                 90


                                 80
                 accuracy (%)




                                 70


                                 60


                                 50


                                 40
                                         Internet      Cardiocography 1   Cardiocography 2   Multiple Features   Libras Movement
                                      Advertisements

                                                              C4.5         k-means           TBWC


                                                 The predictive accuracy of each algorithms.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Experimental Results
                                    TABLE 3. PARAMETERS SETTING FOR EACH DATASET
                                                                        number of     number of
                                         datasets
                                                                        models n      clusters k
                            Cardiocography 1                                  40         100
                            Cardiocography 2                                  20         100
                            Internet Advertisements                           10         100
                            Libras Movement                                       5       20

                            Multiple Features                                 10          80




        The number of model (n) is the number of C4.5 decision tree models
        from the bagging algorithm.
        The number of cluster (k) is the number of clusters in k-means
        clustering.
        The values of each dataset (n and k) are provided by the model which
        has the highest accuracy by using validate dataset.



 A Combination of Decision Tree Learning and Clustering for Data Classification
Experimental Results
                                    TABLE 4. ATTRIBUTE REDUCTION FOR EACH DATASET
                                                             number of            number of selected   Percent reduction
                             datasets
                                                             attributes             attribute used         attribute
                Cardiocography 1                                  23                     22                 4.35%
                Cardiocography 2                                  23                     22                 4.35%
                Internet Advertisements                         1,558                    626                59.82%
                Libras Movement                                   91                     86                 5.49%
                Multiple Features                                649                     454                30.05%



        The results show that the proposed algorithm has ability to select
        important attributes.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Conclusion
        The Tree Bagging algorithm and Weighted Clustering
        (TBWC) was proposed to enhance an efficiency of data
        classification.
        The TBWC algorithm consists of two main steps: attribute
        selection and classifying a new instance.
        For attribute selection step, bagging algorithm with C4.5
        decision tree are used to select and weight attributes.
        Then k-means clustering is applied to assign a class label to
        a new instance.




 A Combination of Decision Tree Learning and Clustering for Data Classification
Conclusion
        The experimental results are summarized in the following
        aspects.
         1. The TBWC algorithm yields the highest accuracies
            when compared with decision tree and clustering
            for all datasets.
         2. The TBWC algorithm can greatly improve the
            predictive performance espectically for multiple-
            classes datasets such as Libras Movement and
            Cardiocography 2.
         3. The TBWC algorithm can reduce attributes up to
            59.82% in Internet Advertisement dataset, and has
            a higher accuracy than C4.5 decision tree and
            k-means clustering.

 A Combination of Decision Tree Learning and Clustering for Data Classification
A Combination of Decision Tree Learning and Clustering for Data Classification
A Combination of Decision Tree Learning and Clustering for Data Classification

More Related Content

What's hot

Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkFerdous ahmed
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlowSpotle.ai
 
Data Mining Technique - CRISP-DM
Data Mining Technique - CRISP-DMData Mining Technique - CRISP-DM
Data Mining Technique - CRISP-DMAshish Chandra Jha
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmPınar Yahşi
 
Calculated Fields in Tableau
Calculated Fields in TableauCalculated Fields in Tableau
Calculated Fields in TableauKanika Nagpal
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)SwatiTripathi44
 

What's hot (20)

Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Machine Learning ppt
Machine Learning pptMachine Learning ppt
Machine Learning ppt
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlow
 
Data Mining Technique - CRISP-DM
Data Mining Technique - CRISP-DMData Mining Technique - CRISP-DM
Data Mining Technique - CRISP-DM
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Machine learning
Machine learningMachine learning
Machine learning
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Calculated Fields in Tableau
Calculated Fields in TableauCalculated Fields in Tableau
Calculated Fields in Tableau
 
Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)Introduction to ML (Machine Learning)
Introduction to ML (Machine Learning)
 

Similar to A combination of decision tree learning and clustering

Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIJSRD
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...IDES Editor
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...IJCSIS Research Publications
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxSaiPragnaKancheti
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxSaiPragnaKancheti
 
An Iterative Improved k-means Clustering
An Iterative Improved k-means ClusteringAn Iterative Improved k-means Clustering
An Iterative Improved k-means ClusteringIDES Editor
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...IRJET Journal
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
 
Improving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docImproving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docbutest
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 

Similar to A combination of decision tree learning and clustering (20)

Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
An Iterative Improved k-means Clustering
An Iterative Improved k-means ClusteringAn Iterative Improved k-means Clustering
An Iterative Improved k-means Clustering
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...
 
Data clustring
Data clustring Data clustring
Data clustring
 
Improving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..docImproving Classifier Accuracy using Unlabeled Data..doc
Improving Classifier Accuracy using Unlabeled Data..doc
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
F04463437
F04463437F04463437
F04463437
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkages
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 

A combination of decision tree learning and clustering

  • 1. 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE) A Combination of Decision Tree Learning and Clustering for Data Classification Chinnapat Kaewchinporn Department of Computer Science King Mongkut’s Institute of Technology Ladkrabang. Ladkrabang, Thailand s0050117@kmitl.ac.th, scriptsds@gmail.com
  • 2. Contents Abstract Introduction C4.5 Decision Tree algorithm k-means Clustering algorithm bagging algorithm Tree Bagging and Weighted Clustering algorithm Datasets Experimental Results Conclusion A Combination of Decision Tree Learning and Clustering for Data Classification
  • 3. Abstract we present a new classification algorithm which is a combination of decision tree learning and clustering called Tree Bagging and Weighted Clustering (TBWC). A Combination of Decision Tree Learning and Clustering for Data Classification
  • 4. Introduction Data Classification is an important problem in Knowledge Discovery in Database (KDD). Currently, there are many techniques to solve this problem such as decision tree, naïve bays, instance-based learning , and artificial neural network. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 5. Introduction However, the techniques that mentioned above have some problems in classification, for examples, outliner handling, accuracy reducing when testing data are increased, and classification time increasing when there are a large numbers of attributes. There are many researches proposed new methods to enhance a predictive performance of a classifier. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 6. Introduction (Reference) For decision tree learning, Guo et al. introduced a new improved C4.5 decision tree based weights, which considers imbalanced weights between different instances, to address the class imbalanced problems. For k-means clustering, Zhu and Wang proposed the approach to optimize the k value by using a genetic algorithm. Qin et al. addressed the problem of efficiency of the k-means clustering algorithm for large datasets. This research presented the improved algorithm avoiding unnecessary calculations by using the triangle inequality. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 7. Introduction (Reference) For an ensemble classifier research, Gaddam et al. [13] proposed a novel method called “K-Means+ID3” for anomaly detection. The concept of the K-Means+ID3 algorithm used the k-means clustering method first partitions the training instances into k clusters. Then build an ID3 decision tree using the instances in each k-means clustering. For a final output of K-Means+ID3, the decisions of the k-means and ID3 methods are combined using the nearest-neighbor rule and the nearest- consensus rule. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 8. Introduction (Summary) we present the new classification algorithm which is a combination of decision tree learning and clustering called Tree Bagging and Weighted Clustering (TBWC). The concept of the TBWC algorithm is to select important attributes and weights to them by using decision tree bagging, then the weighted attributes are used to generate clusters for classifying a new instance. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 9. C4.5 Decision Tree Algorithm C4.5 decision tree is an algorithm used to generate decision trees for a classification problem. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 10. k-means Clustering algorithm k-means is a well-known algorithm for clustering. The main idea of k-means is data partitioning into k clusters by given a dataset of n objects and integer value of k. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 11. bagging algorithm Bagging is an approach of classifier combination method. The objective of bagging is to construct a composite model which improves the predictive performance of a single model. Basically, bagging uses a single algorithm to construct n models and requires a diversity of training data. For a final output of a prediction, the bagging algorithm counts the votes from all the classification models and assigns the class with the most votes to a new data. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 12. Tree Bagging and Weighted Clustering algorithm Tree Bagging and Weighted Clustering (TBWC) consists of two main parts. At first, important attributes are selected and weighted by using bagging technique with C4.5 The processes of TBWC decision tree. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 13. Tree Bagging and Weighted Clustering algorithm Second, the weighted attributes are used as the inputs of k-means clustering to create clusters. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
  • 14. Tree Bagging and Weighted Clustering algorithm As shown in Fig., there are four processes of TBWC: 1) modeling process 2) attribute selection process 3) weighted clustering process 4) prediction process. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
  • 15. Tree Bagging and Weighted Clustering algorithm Modeling process A. Modeling process Training data are used to create models with bagging technique with C4.5 decision tree. The output of the bagging algorithm is n models of The processes of TBWC decision tree. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 16. Tree Bagging and Weighted Clustering algorithm Attribute Selection process B. Attribute selection process After decision trees were created from bagging, attributes which appear in all trees are selected. For assigning a weight to attributes, we consider The processes of TBWC attribute nodes for each tree. A weight of an attribute is vary on a position of that attribute appearing in a tree and a size of the tree. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 17. Tree Bagging and Weighted Clustering algorithm Attribute Selection process B. Attribute selection process If there are same attributes occur in several positions in the tree, only the attribute node appearing in the lowest level is considered. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
  • 18. Tree Bagging and Weighted Clustering algorithm The algorithm starts with defining a set of weight belongs to an attribute and initialized to zero. The next step is to assign a weight to each attribute by considering all internal nodes of each tree. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 19. Tree Bagging and Weighted Clustering algorithm For an example, suppose there are two models of decision trees and a set of attribute is A = {age, income, student, credit_rating}. The decision tree model 1 (M1) A Combination of Decision Tree Learning and Clustering for Data Classification
  • 20. Tree Bagging and Weighted Clustering algorithm For assigning a weight, the attribute node at level 0 is considered at first, thus the weight of attribute student is defined as formula. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 21. Tree Bagging and Weighted Clustering algorithm A Combination of Decision Tree Learning and Clustering for Data Classification
  • 22. Tree Bagging and Weighted Clustering algorithm The decision tree model 2 (M2) A Combination of Decision Tree Learning and Clustering for Data Classification
  • 23. Tree Bagging and Weighted Clustering algorithm A Combination of Decision Tree Learning and Clustering for Data Classification
  • 24. Tree Bagging and Weighted Clustering algorithm When all of tree models were considered, the next step is to calculate the actual weight of each attribute using equation. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 25. Tree Bagging and Weighted Clustering algorithm The total weight of each attribute is A Combination of Decision Tree Learning and Clustering for Data Classification
  • 26. Tree Bagging and Weighted Clustering algorithm At the end of the algorithm, the attribute which has the weight greater than zero will be selected. Finally, all of selected attributes are normalized into [0, 1] by using Min-Max normalization A Combination of Decision Tree Learning and Clustering for Data Classification
  • 27. Tree Bagging and Weighted Clustering algorithm C. Weighted clustering process The dataset with selected Weighted clustering process attributes and their weights are used to generate k clusters by k-means clustering. each center is assigned a class label by using a majority vote. The processes of TBWC A Combination of Decision Tree Learning and Clustering for Data Classification
  • 28. Tree Bagging and Weighted Clustering algorithm D. Prediction process When a new instance x is present, distances between Prediction process all pairs of cluster centers and the instance x are measured. A prediction of the instance x The processes of TBWC is defined by the class label of the nearest center cluster. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 29. Datasets In the experiments, we use five datasets: TABLE 1. THE CHARACTERISTIC OF EACH DATASET attribute Number datasets instance all type of class 1. Cardiocography 1 2,126 23 real 3 2. Cardiocography 2 2,126 23 real 10 3. Internet Advertisement 3,279 1,558 categorical, integer, real 2 4. Libras Movement 360 91 real 15 5. Multiple Features 2,000 649 integer, real 10 - Cardiocography 1, Cardiocography 2 Cardiocography 1 and Cardiocography 2 are the information includes detection rate of fetal heart (FHR) and the rate of uterine contraction (UC). - Internet Advertisement Internet Advertisements is the information of the problem identification picture on the webpage. - Libras Movement Libras Movement is the information of the problem of classifying hand movements. - Multiple Features Multiple Features is the information of the problem by a number of electronic fingerprints. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 30. Experimental Results To evaluate the predictive performance of our proposed algorithm, C4.5 decision tree and k-means clustering were used to compare the predictive accuracy with TBWC. TABLE 2. THE COMPARSIONS OF THE PREDICTIVE ACCURACY OF EACH ALGORITHM Datasets C4.5 k-means TBWC Cardiocography 1 92.56% 86.97% 96.47% Cardiocography 2 82.83% 62.65% 90.35% Internet Advertisements 96.19% 93.11% 96.65% Libras Movement 60.56% 43.89% 80.56% Multiple Features 92.60% 93.45% 96.55% A Combination of Decision Tree Learning and Clustering for Data Classification
  • 31. Experimental Results 100 90 80 accuracy (%) 70 60 50 40 Internet Cardiocography 1 Cardiocography 2 Multiple Features Libras Movement Advertisements C4.5 k-means TBWC The predictive accuracy of each algorithms. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 32. Experimental Results TABLE 3. PARAMETERS SETTING FOR EACH DATASET number of number of datasets models n clusters k Cardiocography 1 40 100 Cardiocography 2 20 100 Internet Advertisements 10 100 Libras Movement 5 20 Multiple Features 10 80 The number of model (n) is the number of C4.5 decision tree models from the bagging algorithm. The number of cluster (k) is the number of clusters in k-means clustering. The values of each dataset (n and k) are provided by the model which has the highest accuracy by using validate dataset. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 33. Experimental Results TABLE 4. ATTRIBUTE REDUCTION FOR EACH DATASET number of number of selected Percent reduction datasets attributes attribute used attribute Cardiocography 1 23 22 4.35% Cardiocography 2 23 22 4.35% Internet Advertisements 1,558 626 59.82% Libras Movement 91 86 5.49% Multiple Features 649 454 30.05% The results show that the proposed algorithm has ability to select important attributes. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 34. Conclusion The Tree Bagging algorithm and Weighted Clustering (TBWC) was proposed to enhance an efficiency of data classification. The TBWC algorithm consists of two main steps: attribute selection and classifying a new instance. For attribute selection step, bagging algorithm with C4.5 decision tree are used to select and weight attributes. Then k-means clustering is applied to assign a class label to a new instance. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 35. Conclusion The experimental results are summarized in the following aspects. 1. The TBWC algorithm yields the highest accuracies when compared with decision tree and clustering for all datasets. 2. The TBWC algorithm can greatly improve the predictive performance espectically for multiple- classes datasets such as Libras Movement and Cardiocography 2. 3. The TBWC algorithm can reduce attributes up to 59.82% in Internet Advertisement dataset, and has a higher accuracy than C4.5 decision tree and k-means clustering. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 36. A Combination of Decision Tree Learning and Clustering for Data Classification
  • 37. A Combination of Decision Tree Learning and Clustering for Data Classification