SlideShare a Scribd company logo
1 of 75
Download to read offline
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Machine Learning for Data Mining
Dr. Dewan Md. Farid
Department of Computer Science & Engineering,
United International University, Bangladesh
December 01, 2016
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Big Data Project
Rule-based Classifier
Class Imbalanced Problem
Active Learning
Ensemble Clustering
Hybrid Classifier
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Data Mining: What is Data Mining?
Data mining (DM) is also known as Knowledge Discovery from
Data, or KDD for short, which turns a large collection of data into
knowledge. DM is a multidisciplinary field including machine learning,
artificial intelligence, pattern recognition, knowledge-based systems,
high-performance computing, database technology and data visualisation.
1. Data mining is the process of analysing data from different
perspectives and summarising it into useful information.
2. Data mining is the process of finding hidden information and
patterns in a huge database.
3. Data mining is the extraction of implicit, previously unknown, and
potentially useful information from data.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Machine Learning
Machine learning (ML) provides the technical basis of data mining,
which concerns the construction and study of systems that can learn
from data.
1. Supervised learning/ Classification - the supervision in the
learning comes from the labeled instances.
2. Unsupervised learning/ Clustering - the learning process is
unsupervised since the instances are not class labeled.
3. Semi-supervised learning - uses of both labeled and unlabelled
instances when learning a model.
4. Active learning - lets users play an active role in the learning
process. It asks a user (e.g., a domain expert) to label an instance,
which may be from a set of unlabelled instances.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Learning Algorithms
Decision Tree (DT) Induction
Na¨ıve Bayes (NB) Classifier
NBTree Classifier
RainForest and BOAT Classifier
k Nearest Neighbour (kNN) Classifier
Random Forest, Bagging and Boosting (AdaBoost)
Support Vector Machines (SVM)
k Means Clustering
Similarity based Clustering
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Mining Big Data
Mining big data is the process of extracting knowledge to uncover large
hidden information from the massive amount of complex data or
databases. Big data is defined by the three V’s:
Volume - the quantity of data.
Variety - the category of data.
Velocity - the speed of data in and out.
It might suggest throwing a few more V’s into the mix:
Vision - having a purpose/ plan).
Verification - ensuring that the data conforms to a set of
specifications.
Validation - checking that its purpose is fulfilled.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Big Data Project
1. BRiDGEIris - Brussels Big Data Platform for Sharing and Discovery
in Clinical Genomics.
Hosted by IB2
(Interuniversity Institute of Bioinformatics in
Brussels).
Funded by INNOVIRIS
(Brussels Institute for Research and
Innovation).
2. FWO research project G004414N “Machine Learning for Data
Mining Applications in Cancer Genomics”.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
BRiDGEIris Project
Brussels big data platform for sharing and discovery in clinical genomics
project aims to answer the research challenges by:
1. Design and creation of a multi-site clinical/phenomic and genomic
data warehouse.
2. Development of automated tools for extracting relevant information
from genetic data.
3. Use of the designed tools to extract new knowledge and transfer it
to the medical setting.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
VUB AI Lab (CoMo)
Lab is particularly focused on the aspect of design and developing
strategy for information discovery on genomic and clinical big data
by employing an optimal ensemble method. Goal is to evaluate
ensemble predictive modelling techniques for:
1. Improving the prediction accuracy of variant identification/ genomic
variants classification.
2. Pathology classification tasks.
Developing new methods/ algorithms to deal with the following issues:
Multi-class classification
High-dimensional data
Class imbalanced data
Big data
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Brugada syndrome
Brugada syndrome (BrS), also known as sudden adult death
syndrome (SADS) is a genetic disease. It increases the risk of sudden
cardiac death (SCD) at a young age. The Spanish cardiologists Pedro
Brugada and Josep Brugada name Brugada syndrome.
BrS is detected by abnormal electrocardiogram (ECG) findings called
a type 1 Brugada ECG pattern, which is much more common in men.
BrS is a heart rhythm disorder.
Sudden cardiac death (SCD) caused when the heart doesn’t pump
effectively and not enough blood travels to the rest of the body.
The Exome datasets of 148 patients have analysed for Brugada syndrome
at UZ Brussels (Universitair Ziekenhuis Brussel) (www.uzbrussel.be/)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Knowledge Discovery from Genomic Data
Exome	1	
Formatted	
Data		
Gene	Panel	
Mining	Algorithm	
Genomic	Data	Sets	
Knowledge	Discovery	
from	Genomic	Data	
Exome	2	
Exome	148	
Data	
Preprocessing	
Feature	Selection	
Figure: The process of extracting knowledge from genomic data in data mining.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Genomic Data of BrS
Table: Classification of DNA variants for Brugada syndrome.
Class Label
Class I Nonpathogenic
Class II VUS1 - Unlikely pathogenic
Class III VUS2 - Unclear
Class IV VUS3 - Likely pathogenic
Class V Pathogenic
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Gene Panel of BrS
Table: Gene panel of Brugada syndrome.
Chromosome Name of Gene
Chr 1 KCND3
Chr 3 SCN5A, GPD1L, SLMAP, CAV3, SCN10A
Chr 4 ANK2
Chr 7 CACNA2D1, AKAP9, KCNH2
Chr 10 CACNAB2
Chr 11 KCNE3, SCN3B, SCN2B, KCNJ5,
KCNQ1, SCN4B
Chr 12 CACNA1C, KCNJ8
Chr 15 HCN4
Chr 17 RANGRF, KCNJ2
Chr 19 SCN1B, TRPM4
Chr 20 SNTA1
Chr 21 KCNE1, KCNE2
Chr X KCNE1L
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Chromosomes
1 11 12 15 17 19 21 3 4 7 X
Chromosomes
No.ofVariants
0100200300400500
Figure: Chromosomes in 148 Exome Datasets.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Genomic Data
0	
20	
40	
60	
80	
100	
120	
140	
160	
180	
200	
220	
240	
260	
280	
1	
3	
5	
7	
9	
11	
13	
15	
17	
19	
21	
23	
25	
27	
29	
31	
33	
35	
37	
38	
41	
43	
45	
47	
49	
51	
53	
55	
57	
59	
61	
63	
65	
67	
69	
71	
73	
75	
77	
79	
81	
83	
85	
87	
89	
91	
93	
95	
97	
99	
101	
103	
105	
107	
109	
111	
113	
115	
117	
119	
121	
123	
125	
127	
129	
131	
133	
135	
137	
139	
141	
143	
145	
147	
No.	of	Variants	
Exome	Data	Sets	
Annotated	vcf	File	
Gene	Panel	
BrS	Variants	
Figure: Genomic Data: 148 Exome Datasets.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Rule-based Classifier
Rule-based classifier is easy to deal with complex classification problems.
It has various advantages:
Highly expressive as DT
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to DT
New rules can be added to existing rules without disturbing ones
already in there
Rules can be executed in any order
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Adaptive Rule-based Classifier
It combines the random subspace and boosting approaches with
ensemble of decision trees to construct a set of classification rules for
multi-class classification of biological big data.
Random subspace method (or attribute bagging) to avoid
overfitting
Boosting approach for classifying noisy instances
Ensemble of decision trees to deal with class-imbalance data
It uses two popular classification techniques: decision tree (DT) and
k-nearest-neighbour (kNN) classifiers.
DTs are used for evolving classification rules from the training data.
kNN is used for analysing the misclassified instances and removing
vagueness between the contradictory rules.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Random Subspace & Boosting Method
Random subspace is an ensemble classifier. It consists of several
classifiers each operating in a subspace of the original feature space, and
outputs the class based on the outputs of these individual classifiers.
It has been used for decision trees (random decision forests).
It is an attractive choice for high dimensional data.
Boosting is designed specifically for classification.
It converts weak classifiers to strong ones.
It is an iterative process.
It uses voting for classification to combine the output of individual
classifiers.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Ensemble Classifier
Figure: An example of an ensemble classifier.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Decision Tree Induction
Decision tree (DT) induction is a top down recursive divide and
conquer algorithm for multi-class classification task. The goal of DT is to
iteratively partition the data into smaller subsets until all the subsets
belong to a single class. It is easy to interpret and explain, and also
requires little prior knowledge.
Information Gain: ID3 (Iterative Dichotomiser) algorithm
Gain Ratio: C4.5 algorithm
Gini Index: CART algorithm
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 1 Decision Tree Induction
Input: D = {x1, · · · , xi , · · · , xN }
Output: A decision tree, DT.
Method:
1: DT = ∅;
2: find the root node with best splitting, Aj ∈ D;
3: DT = create the root node;
4: DT = add arc to root node for each split predicate and label;
5: for each arc do
6: Dj created by applying splitting predicate to D;
7: if stopping point reached for this path, then
8: DT = create a leaf node and label it with cl ;
9: else
10: DT = DTBuild(Dj );
11: end if
12: DT = add DT to arc;
13: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
K-Nearest-Neighbour (kNN) Classifier
The k-nearest-neighbour (kNN) is a simple classifier. It uses the
distance measurement techniques that widely used in pattern recognition.
kNN finds k instances, X = {x1, x2, · · · , xk } ∈ Dtraining that are closest to
the test instance, xtest and assigns the most frequent class label,
cl → xtest among the X. When a classification is to be made for a new
instance, xnew , its distance to each Aj ∈ Dtraining , must be determined.
Only the k closest instances, X ∈ Dtraining are considered further. The
closest is defined in terms of a distance metric, such as Euclidean
distance. The Euclidean distance between two points,
x1 = (x11, x12, · · · , x1n) and x2 = (x21, x22, · · · , x2n), is shown in Eq. 1
dist(x1, x2) =
n
i=1
(x1i − x2i )2 (1)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 2 k-Nearest-Neighbour classifier
Input: D = {x1, · · · , xi , · · · , xn}
Output: kNN classifier, kNN.
Method:
1: find X ∈ D that identify the k nearest neighbours, regardless of class
label, cl .
2: out of these instances, X = {x1, x2, · · · , xk }, identify the number of
instances, ki , that belong to class cl , l = 1, 2, · · · , M. Obviously,
i ki = k.
3: assign xtest to the class cl with the maximum number of ki of instances.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Constructing Classification Rules
Extracting classification rules from DTs is easy and well-known process.
Rules are highly expressive as DT, so the performance of rule-based
classifier is comparable to DT.
Each rule is generated for each leaf of the DT.
Each path in DT from the root node to a leaf node corresponds with
a rule.
Tree corresponds exactly to the classification rules.
DT vs. Rules
New rules can be added to an existing rule set without disturbing ones
already there, whereas to add to a tree structure may require reshaping
the whole tree. Rules can be executed in any order.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm: Adaptive rule-based (ARB) classifier
It considers a series of k iterations.
Initially, an equal weight, 1
N is assigned to each training instance.
The weights of training instances are adjusted according to how they
are classified in every iterations.
In each iteration, a sub-dataset Dj is created from the original
training dataset D and previous sub-dataset Dj−1 with maximum
weighted instances. Only the sampling with replacement technique
is used to create the sub-dataset D1 from the original training data
D in the first iteration.
A tree DTj is built from the sub-dataset Dj with randomly selected
features in each iteration.
Each rule is generated for each leaf node of DTj .
Each path in DTj from the root to a leaf corresponds with a rule.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 3 Adaptive rule-based classifier.
Input:
D = {x1, · · · , xi , · · · , xN }, training dataset;
k, number of iterations;
DT learning scheme;
Output: rule-set; // A set of classification rules.
Method:
1: rule-set = ∅;
2: for i = 1 to N do
3: xi = 1
N ; // initialising weights of each xi ∈ D.
4: end for
5: for j = 1 to k do
6: if j==1 then
7: create Dj , by sampling D with replacement;
8: else
9: create Dj , by Dj−1 and D with maximum weighted X;
10: end if
11: build a tree, DTj ← Dj by randomly selected features;
12: compute error(DTj ); // the error rate of DTj .
13: if error(DTj ) ≥ threshold-value then
14: go back to step 6 and try again;
15: else
16: rules ← DTj ; // extracting the rules from DTj .
17: end if
18: for each xi ∈ Dj that was correctly classified do
19: multiply the weight of xi by (
error(DTj )
1−error(DTj ) ); // update weights.
20: end for
21: normalise the weight of each xi ∈ Dj ;
22: rule-set = rule-set ∪ rules;
23: end for
24: return rule-set;
25: create sub-dataset, Dmisclassified with misclassified instances from Dj ;
26: analyse Dmisclassified employing algorithm 4.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Error Rrate Calculation
The error rate of DTj is calculated by the sum of weights of misclassified
instances that is shown in Eq. 2. Where, err(xi ) is the misclassification
error of an instance xi . If an instance, xi is misclassified, then err(xi ) is
one. Otherwise, err(xi ) is zero (correctly classified).
error(DTj ) =
n
i=1
wi × err(xi ) (2)
If error rate of DTj is less than the threshold-value, then rules are
extracted from DTj .
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Mining Big Data with Rules
Big data is so big (millions of instances) that we cannot process all
the instances together at the same time.
It is not possible to store all the data in the main memory at a time.
We can create several smaller sample (or subsets) of data from the
big data that each of which fits in main memory.
Each subset of data is used to construct a set of rules, resulting in
several sets of rules.
Then the rules are examined and used to merge together to
construct the final set of classification rules to deal with big data.
As we have the advantage to add new rules with existing rules and
rules are executed in any order.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Mining Big Data with Rules (con.)
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
Data	
Data	
Data	
Integrating	Rules	
Big	Data	
Sub-data,	1	
Adaptive	Rule-based	
Classifier	
	
Final	Classification	Rules	
Adaptive	Rule-based	
Classifier	
	
Adaptive	Rule-based	
Classifier	
	
Sub-data,	N	Sub-data,	2	
Figure: Mining big data using adaptive rule-based classifier.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Reduced-Error Pruning
Split the original data into two parts: (a) a growing set, and (b) a
pruning set.
Rules are generated using growing set only. So, important rules
might miss because some key instances had been assigned to the
pruning set.
A rule generated from the growing set is deleted, and the effect is
evaluated by trying out the truncated rule from the pruning set and
seeing whether it performs well than the original rule.
If the new truncated rule performs better then this new rule is added
to the rule set.
This process continues for each rule and for each class.
The overall best rules are established by evaluating the rules on the
pruning set.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm: Analysing Misclassified Instances
To check the classes of misclassified instances we used the kNN classifier
with feature selection and weighting approach.
We applied DT induction for feature selection and weighting
approach.
We build a tree from the misclassified instances.
Each feature that is tested in the tree, Aj ∈ Dmisclassified is assigned
by a weight 1
d . Where d is the depth of the tree.
We do not consider the features that are not tested in the tree for
similarity measure of kNN classifier.
We apply kNN classifier to classify each misclassified instance based
on the weighted features.
We update the class label of misclassified instances.
We check for the contradictory rules, if there is any.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 4 Analysing misclassified instances
Input: D, original training data;
Dmisclassified , dataset with misclassified instances;
Output: A set of instances, X with right class labels.
Method:
1: build a tree, DT using Dmisclassified ;
2: for each Aj ∈ Dmisclassified do
3: if Aj is tested in DT then
4: assign weight to Aj by 1
d , where d is the depth of DT;
5: else
6: not to consider, Aj for similarity measure;
7: end if
8: end for
9: for each xi ∈ Dmisclassified do
10: find X ∈ D, with the similarity of weighted A =
{A1, · · · , Aj , · · · , An};
11: find the most frequent class, cl , in X;
12: assign xi ← cl ;
13: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Performance Measurement
The classification accuracy:
accuracy =
|X|
i=1 assess(xi )
|X|
, xi ∈ X (3)
If xi is correctly classified then assess(xi ) = 1, or If xi is misclassified then
assess(xi ) = 0.
precision =
TP
TP + FP
(4)
recall =
TP
TP + FN
(5)
F − score =
2 × precision × recall
precision + recall
(6)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Experiments on Exome datasets
The performance of the proposed ARB classifier against RainForest, NB
and kNN classifiers on 148 Exome datasets. The ARB classifier correctly
classifies 91% gene variants for BrS using training data. We have
considered five iterations for the proposed ARB classifier on each Exome
dataset.
Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and
proposed ARB classifier using training data.
Algorithm Classification Precision Recall F-score
accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.)
RainForest 83.33 0.76 0.83 0.79
NB 83.33 0.79 0.83 0.78
kNN 75 0.56 0.75 0.64
ARB classifier 91.66 0.95 0.91 0.92
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Experiments on Exome datasets (con.)
The performance of the proposed ARB classifier against RainForest, NB
and kNN classifiers using 10-folds cross validation on 148 Exome
datasets.
Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and
proposed ARB classifier using 10 folds cross-validation.
Algorithm Classification Precision Recall F-score
accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.)
RainForest 58.33 0.46 0.58 0.51
NB 58.33 0.63 0.58 0.6
kNN 50 0.33 0.5 0.4
ARB classifier 75 0.73 0.75 0.68
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Experiments on Exome datasets (con.)
The performance of the proposed ARB classifier against RainForest, NB
and kNN classifiers using unseen test variants of 45 Exome datasets.
Where 103 Exome datasets were used for training the models.
Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and
proposed ARB classifier using testing data.
Algorithm Classification Precision Recall F-score
accuracy (%) (weighted (weighted (weighted
avg.) avg.) avg.)
RainForest 50 0.33 0.5 0.4
NB 50 0.25 0.5 0.62
kNN 50 0.25 0.5 0.33
ARB classifier 66.66 0.44 0.66 0.53
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Benchmark Life Sciences Datasets
Table: 10 real benchmark life sciences datasets from UCI (University of
California, Irvine) machine learning repository.
No. Datasets Instances No of Att. Att. Types Classes
1 Appendicitis 106 7 Numeric 2
2 Breast cancer 286 9 Nominal 2
3 Contraceptive 1473 9 Numeric 3
4 Ecoli 336 7 Numeric 8
5 Heart 270 13 Numeric 2
6 Pima diabetes 768 8 Numeric 2
7 Iris 150 4 Numeric 3
8 Soybean 683 35 Nominal 19
9 Thyroid 215 5 Numeric 2
10 Yeast 1484 8 Numeric 10
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Classification Accuracy
Table: The classification accuracy (%) of C4.5, kNN, na¨ıve Bayes (NB) and
proposed adaptive rule-based classifier with 10-fold cross validation.
Datasets C4.5 kNN NB Proposed
classifier
Appendicitis 85.84 86.79 85.84 87.73
Breast cancer 75.52 73.42 71.67 75.52
Contraceptive 50.98 49.76 48.13 50.1
Ecoli 79.76 83.03 78.86 83.92
Heart 77.40 78.88 83.7 83.7
Pima diabetes 73.82 73.17 76.3 75.65
Iris 96 95.33 96 95.33
Soybean 91.50 90.19 92.97 91.94
Thyroid 98.13 97.2 98.13 98.13
Yeast 56.73 56.94 57.88 61.99
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Classification Accuracy (con.)
45	
50	
55	
60	
65	
70	
75	
80	
85	
90	
95	
100	
Appendici1s	 Breast	cancer		 Contracep1ve	 Ecoli	 Heart	 Pima	diabetes	 Iris	 Soybean	 Thyroid	 Yeast	
Classifica(on	Accuracy	
UCI	Benchmark	Life	Sciences	Data	Sets	
C4.5	 kNN	 NB	 Adap1ve	rule-based	classifier	
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Accuracy having 20% noisy instances
40	
45	
50	
55	
60	
65	
70	
75	
80	
85	
Appendici/s	 Breast	cancer		 Contracep/ve	 Ecoli	 Heart	 Pima	diabetes	 Iris	 Soybean	 Thyroid	 Yeast	
Classifica(on	Accuracy	
UCI	Benchmark	Life	Sciences	Data	Sets	
C4.5	 kNN	 NB	 Adap/ve	rule-based	classifier	
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Data Balancing Methods
Classification of multi-class imbalanced data is a difficult task, as real
data sets are noisy, high dimensional, small sample size that results
overfitting and overlapping of classes..
Traditional machine learning algorithms are very successful with
classifying majority class instances compare to the minority class
instances.
The conventional data balancing methods alter the original data
distribution, so they might suffer from overfitting or drop some
potential information.
We proposed a new method for dealing with multi-class imbalanced data
based on clustering and selecting most informative instances from the
majority classes.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Classifying Imbalanced Data
Machine learning algorithms successfully classify majority class instances,
but misclassify the minority class instances in many high-dimensional
data sets.
Following methods are used for class imbalance problems:
1. Sampling methods
Under-sampling
Over-sampling
2. Cost-sensitive learning methods (difficult to get the accurate
misclassification cost)
3. Ensemble methods
Bagging
Boosting
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Proposed Data Balancing Method
Initially, we cluster the majority class instances into several clusters.
Find the most informative instances in each cluster. The informative
instances are close to the center of cluster and border of cluster.
Then several data sets are created using these clusters with most
informative instances by combining the instances of minority classes.
Every data set should have almost equal number of
minority-majority classes instances.
Finally, multiple classifiers are trained using these data sets. The
voting technique is used to classify the existing/ new instances.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Proposed Data Balancing Method (con.)
Imbalanced Data
Majority Classes
Instances
Minority Classes
Instances
Cluster 1
Balanced
Data 1
Classifier 1
Find
Informative
Instances
Cluster 2 Cluster N
Find
Informative
Instances
Find
Informative
Instances
Balanced
Data 2
Balanced
Data N
Classifier 2 Classifier N
Combine	Votes	
Prediction	
New Data
Instances
Figure: Proposed data balancing method.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Performance of Data Balancing Methods
The performance of data balancing methods using area under the ROC
(Receiver Operating Characteristic) curve (AUC) on 2143 variants of
Brugada syndrome (BrS) of 148 Exome data sets.
Table: Average AUC values of 148 imbalanced Exome data sets for different
imbalance data handling methods.
Algorithm Average AUC value
Random Under-Sampling 0.8923
Random Over-Sampling 0.8673
Bagging 0.8915
Boosting 0.9136
Proposed Method 0.9317
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Active Learning
It achieves high accuracy using the number of instances to learn a
concept can often be much lower than the number required in typical
supervised learning.
It interactively queries a user/ expert for class labels of unlabeled
instances.
The objective is to train a classifier using as few labeled instances as
possible by selecting the most informative instances.
Let the data, D contains both set of labeled data, DL and set of
unlabeled data, DU . Initially, a model, M∗
trains using DL. Then a
querying function uses to select unlabeled instances, XU ∈ DU and
requests a user for labeling, XU → XL. After XL is added to DL and train
M∗
again. The process repeats until the user is satisfied.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Active Learning (con.)
Data, D
Labeled Data,
DL
Unlabeled Data,
DU
Unlabeled
Instances, XU
Labeled
Instances, XL
DL + XL
Ensemble Model,
M*
User/ Oracle
Figure: Active learning process.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Proposed Method
The na¨ıve Bayes (NB) classifier and clustering are used to find the most
informative instances for labeling as part of active learning. The unlabeled
instances are selected for labeling using the following two strategies:
Instances close to centers of clusters and borders of clusters.
If the posterior probabilities of instances are equal/ very close.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Performance of Ensemble Methods
Adaptive boosting (AdaBoost algorithm) with NB classifier is used as
base classifier.
Table: The accuracy and F-score of ensemble methods on 2143 DNA variants
of Brugada syndrome.
Algorithm Classification F-score
accuracy (%) (weighted
avg.)
Random Forest 92.3 0.93
Bagging 87.5 0.83
Boosting 91.66 0.9
AdaBoost with NB classifier 94.73 0.93
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Clustering of high-dimensional big data
An ensemble clustering method with feature selection and grouping
approach.
K-means clustering.
Similarity-based clustering.
Biclustering (On each cluster that generated by ensemble clustering
to find the sub-matrices).
Unlabelled genomic data of Brugada syndrome (148 Exome
datasets).
The proposed method selects the most relevant features in the dataset
and grouping them into subset of features to overcome the problems
associated with the traditional clustering methods.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Clustering
It is the process of grouping a set of instances into clusters (subsets or
groups) so that instances within a cluster have high similarity in
comparison to one another, but are very dissimilar to instances in other
clusters.
Let X be the unlabelled data set, that is,
X = {x1, x2, · · · , xN }; (7)
The partition of X into k clusters, C1, · · · , Ck , so that the following
conditions are met:
Ci = ∅, i = 1, · · · , k; (8)
∪k
i=1Ci = X; (9)
Ci ∩ Cj = ∅, i = j, i, j = 1, · · · , k; (10)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Challenges
Pattern extracting from the genomic big data.
Genomic data is often too big and too messy.
Genomic data is also high-dimensional, so traditional distance
measures may be dominated by the noise in many dimensions.
In genomic data, we need to find not only the clusters of instances
(genes), but for each cluster a set of features (conditions).
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
k-Means
It defines the mean value of instances {xi1, xi2, · · · , xiN } ∈ Ci .
It randomly selects k instances, {xk1, xk2, · · · , xkN } ∈ X each of
which initially represents a cluster center.
Remaining instances, xi ∈ X, xi is assigned to the cluster.
Similar is measure based on the Euclidean distance between xi and
Ci .
It iteratively improves the within-cluster variation.
A high degree of similarity among instances in clusters is obtained, while
a high degree of dissimilarity among instances in different clusters is
achieved simultaneously. The cluster mean of Ci = {xi1, xi2, · · · , xiN } is
defined in equation 11.
Mean = Ci =
N
j=1(xij )
N
(11)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 5 k-Means Clustering
Input: X = {x1, x2, · · · , xN } // A set of unlabelled instances.
k // the number of clusters
Output: A set of k clusters.
Method:
1: arbitrarily choose k number of instances, {xk1, xk2, · · · , xkN } ∈ X as
the initial k clusters center;
2: repeat
3: (re)assign each xi ∈ X → k to which the xi is the most similar based
on the mean value of the xm ∈ k;
4: update the k means, that is, calculate the mean value of the instances
for each cluster;
5: until no change
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Similarity-Based Clustering (SCM)
It is robust to initialise the cluster numbers.
It detects different volumes of clusters.
Let’s consider sim(xi , xl ) as the similarity measure between instances xi
and the lth cluster center xl . The goal is to find xl to maximise the total
similarity measure shown in Eq. 12.
Js(C) =
k
l=1
N
i=1
f (sim(xi , xl )) (12)
Where, f (sim(xi , xl )) is a reasonable similarity measure and
C = {C1, · · · , Ck }. In general, SCM uses feature values to check the
similarity between instances. However, any suitable distance measure can
be used to check the similarity between the instances.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 6 Similarity-based Clustering
Input: X = {x1, x2, · · · , xN } // A set of unlabelled instances.
Output: A set of clusters, C = {C1, C2, · · · , Ck }.
Method:
1: C = ∅;
2: k = 1;
3: Ck = {x1};
4: C = C ∪ Ck ;
5: for i = 2 to N do
6: for l = 1 to k do
7: find the lth cluster center xl ∈ Cl to maximize the similarity
measure, sim(xi , xl );
8: end for
9: if sim(xi , xl ) ≥ threshold value then
10: Cl = Cl ∪ xi
11: else
12: k = k + 1;
13: Ck = {xi };
14: C = C ∪ Ck ;
15: end if
16: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Ensemble Clustering
Ensemble clustering is a process of integrating multiple clustering
algorithms to form a single strong clustering approach that usually
provides better clustering results. It generates a set of clusters from a
given unlabelled data set and then combines the clusters into final
clusters to improve the quality of individual clustering.
No single cluster analysis method is optimal.
Different clustering methods may produce different clusters, because
they impose different structure on the data set.
Ensemble clustering performs more effectively in high dimensional
complex data.
It’s a good alternative when facing cluster analysis problems.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Ensemble clustering (con.)
Generally three strategies are applied in ensemble clustering:
1. Using different clustering algorithms on the same data set to create
heterogeneous clusters.
2. Using different samples/ subsets of the data with different clustering
algorithms to cluster them to produce component clusters.
3. Running the same clustering algorithm many times on same data set
with different parameters or initialisations to create homogeneous
clusters.
The main goal of the ensemble clustering is to integrate component
clustering into one final clustering with a higher accuracy.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Ensemble clustering on genomic/ biological data
Pattern extraction from genomic data applying ensemble clustering.
Data	
  
Data	
  
Data	
  
Data	
  
Preprocessing	
  
Biclustering	
  
Big	
  Biological	
  Data	
  
Hidden	
  Patterns	
  
in	
  Data	
  
Feature	
  Selection	
  
Feature	
  Grouping	
  
Ensemble	
  
Clustering	
  
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Data Pre-processing
It transforms raw data into an understandable format, which includes
several techniques:
Data cleaning is the process of dealing with missing values.
Data integration merges data from different multiple sources into a
coherent data store like data warehouse or integrate metadata.
Data transformation includes the followings: (a) normalisation, (b)
aggregation, (c) generalisation, and (d) feature construction.
Data reduction obtains a reduced representation of data set
(eliminating redundant features/ instances).
Data discretisation involves the reduction of a number of values of
a continuous feature by dividing the range of feature intervals.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Feature Selection
It is the process of selecting a subset of relevant features from a total
original features in data.
Mainly the following three reasons are used for feature selection:
Simplification of models
Shorter training times
Reducing overfitting
In biological data, features may contain false correlations and the
information they add is contained in other features. In this work, we have
applied an unsupervised feature selection approach based on measuring
similarities between features by maximum information compression index.
We have quantified the information loss in feature selection with entropy
measure technique. After selecting the subset of features from the data,
we have grouped them into two groups: nominal and numeric features.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Subspace Clustering
The subspace clustering finds subspace clusters in high-dimensional data.
It can be classified into three groups:
1. Subspace search methods.
2. Correlation-based clustering methods
3. Biclustering methods.
A subspace search method searches various subspaces for clusters (set
of instances that are similar to each other in a subspace) in the full
space. It uses two kinds of strategies:
Bottom-up approach - start from low-dimensional subspace and
search higher-dimensional subspaces.
Top-down approach - start with full space and search smaller
subspaces recursively.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 7 δ-Biclustering
Input: E, a data matrix and δ ≥ 0, the maximum acceptable mean squared
residue score.
Output: EIJ , a δ-bicluster that is a submatrix of E with row set I and
column set J, with a score no longer than δ.
Initialization: I and J are initialized to the instance and feature sets in
the data and EIJ = E.
Deletion phase:
1: compute eiJ for all i ∈ I, eIj for all j ∈ J, eIJ , and H(I, J);
2: if H(I, J) ≤ δ then
3: return EIJ ;
4: end if
5: find the rows i ∈ I with d(i) = j∈J (eij −eiJ −eIj +eIJ )2
|J| ;
6: find the columns j ∈ J with d(j) = i∈I (eij −eiJ −eIj +eIJ )2
|I| ;
7: remove rows i ∈ I and columns j ∈ J with larger d;
Addition phase:
1: compute eiJ for all i, eIj for all j, eIJ , and H(I, J);
2: add the columns j /∈ J with i∈I (eij −eiJ −eIj +eIJ )2
|I| ≤ H(I, J);
3: recompute eiJ , eIJ and H(I, J);
4: add the rows i /∈ I with j∈J (eij −eiJ −eIj +eIJ )2
|J| ≤ H(I, J);
5: for each row i /∈ I do
6: if j∈J (eij −eiJ −eIj +eIJ )2
|J| ≤ H(I, J) then
7: add inverse of i;
8: end if
9: end for
10: return EIJ ;
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Clustering of BrS variants
Distribution of BrS variants in clusters using proposed ensemble
clustering.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Experimental Method
To test the performance of clustering algorithms we have used an
unsupervised evaluation method that compute the Compactness (CP) of
clusters is shown in Eq. 13.
CP =
1
n
k
l=1
nl
xi ,xj ∈Cl
d(xi , xj )
nl (nl − 1)/2
(13)
Where d(xi , xj ) is the distance between two instances in cluster Cl and nl
is the number of instances in Cl . The smaller the CP for a clustering
result, the more compact and better the clustering result.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Results
The proposed ensemble clustering is compared with following clustering
algorithms:
SimpleKMeans (clustering using the k-means method)
XMeans (extension of k-means)
DBScan (nearest-neighbor-based that automatically determines the
number of clusters)
MakeDensityBasedCluster (wrap a clusterer to make it return
distribution and density)
Table: Comparison of clustering results on 148 Exome data sets of BrS.
Clustering Method Compactness (CP)
SimpleKMeans 9.401
XMeans 8.297
MakeDensityBasedCluster 7.483
DBScan 6.351
Ensemble Clustering 5.647
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Hybrid Decision Tree & Na¨ıve Bayes Classifiers
The presence of noisy contradictory instances in the training data cause
the learning models suffer from overfitting and decrease classification
accuracy.
Hybrid Decision Tree (DT) classifier - A na¨ıve Bayes (NB)
classifier is used to remove the noisy troublesome instances from the
training data before the DT induction.
Hybrid Na¨ıve Bayes (NB) classifier - A DT is used to select a
comparatively more important subset of features for the production
of na¨ıve assumption of class conditional independence. It is
extremely computationally expensive for a na¨ıve Bayes classifier to
compute class conditional independence for high dimensional data
sets.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 8 Decision Tree Induction
Input: D = {x1, x2, · · · , xn} // Training dataset, D, which contains a set
of training instances and their associated class labels.
Output: T, Decision tree.
Method:
1: for each class, Ci ∈ D, do
2: Find the prior probabilities, P(Ci ).
3: end for
4: for each attribute value, Aij ∈ D, do
5: Find the class conditional probabilities, P(Aij |Ci ).
6: end for
7: for each training instance, xi ∈ D, do
8: Find the posterior probability, P(Ci |xi )
9: if xi is misclassified, then
10: Remove xi from D;
11: end if
12: end for
13: T = ∅;
14: Determine best splitting attribute;
15: T = Create the root node and label it with the splitting attribute;
16: T = Add arc to the root node for each split predicate and label;
17: for each arc do
18: D = Dataset created by applying splitting predicate to D;
19: if stopping point reached for this path, then
20: T = Create a leaf node and label it with an appropriate class;
21: else
22: T = DTBuild(D);
23: end if
24: T = Add T to arc;
25: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Algorithm 9 Na¨ıve Bayes classifier
Input: D = {x1, x2, · · · , xn} // Training data.
Output: A classification Model.
Method:
1: T = ∅;
2: Determine the best splitting attribute;
3: T = Create the root node and label it with the splitting attribute;
4: T = Add arc to the root node for each split predicate and label;
5: for each arc do
6: D = Dataset created by applying splitting predicate to D;
7: if stopping point reached for this path, then
8: T = Create a leaf node and label it with an appropriate class;
9: else
10: T = DTBuild(D);
11: end if
12: T = Add T to arc;
13: end for
14: for each attribute, Ai ∈ D, do
15: if Ai is not tested in T, then
16: Wi = 0;
17: else
18: d as the minimum depth of Ai ∈ T, and Wi = 1√
d
;
19: end if
20: end for
21: for each class, Ci ∈ D, do
22: Find the prior probabilities, P(Ci ).
23: end for
24: for each attribute, Ai ∈ D and Wi = 0, do
25: for each attribute value, Aij ∈ Ai , do
26: Find the class conditional probabilities, P(Aij |Ci )
Wi
.
27: end for
28: end for
29: for each instance, xi ∈ D, do
30: Find the posterior probability, P(Ci |xi );
31: end for
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Accuracy on Benchmark Datasets
Figure: Classification accuracy on 10 datasets with 10-fold cross validation.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Novel Class Instances
Figure: Instances with a fixed number of class labels (left) and instances of a
novel class arriving in the data stream (right).
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Novel Class Instances (con.)
Figure: Flow chart of classification and novel class detection.
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Novel Class Instances (con.)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
Novel Class Instances (con.)
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining
Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier
*** THANK YOU ***
Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh
Machine Learning for Data Mining

More Related Content

What's hot

Bioinformatics.doc
Bioinformatics.docBioinformatics.doc
Bioinformatics.docbutest
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Datavbrant
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Bryan Heidorn
 
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsTechnology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsAlexander Pico
 
Curriculum_Amoroso_EN_28_07_2016
Curriculum_Amoroso_EN_28_07_2016Curriculum_Amoroso_EN_28_07_2016
Curriculum_Amoroso_EN_28_07_2016Nicola Amoroso
 
NRNB Annual Report 2011
NRNB Annual Report 2011NRNB Annual Report 2011
NRNB Annual Report 2011Alexander Pico
 
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data viaNeuroscience Information Framework
 
The Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military DataThe Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military DataBarry Smith
 
IAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
IAO-Intel: An Ontology of Information Artifacts in the Intelligence DomainIAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
IAO-Intel: An Ontology of Information Artifacts in the Intelligence DomainBarry Smith
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
Free webinar-introduction to bioinformatics - biologist-1
Free webinar-introduction to bioinformatics - biologist-1Free webinar-introduction to bioinformatics - biologist-1
Free webinar-introduction to bioinformatics - biologist-1Elia Brodsky
 
Natural language processing through the subtractive mountain clustering algor...
Natural language processing through the subtractive mountain clustering algor...Natural language processing through the subtractive mountain clustering algor...
Natural language processing through the subtractive mountain clustering algor...ijnlc
 
NRNB Annual Report 2012
NRNB Annual Report 2012NRNB Annual Report 2012
NRNB Annual Report 2012Alexander Pico
 

What's hot (18)

NRNB EAC Meeting 2012
NRNB EAC Meeting 2012NRNB EAC Meeting 2012
NRNB EAC Meeting 2012
 
Bioinformatics.doc
Bioinformatics.docBioinformatics.doc
Bioinformatics.doc
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Data
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
 
AI for drug discovery
AI for drug discoveryAI for drug discovery
AI for drug discovery
 
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsTechnology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network Representations
 
Curriculum_Amoroso_EN_28_07_2016
Curriculum_Amoroso_EN_28_07_2016Curriculum_Amoroso_EN_28_07_2016
Curriculum_Amoroso_EN_28_07_2016
 
NRNB Annual Report 2011
NRNB Annual Report 2011NRNB Annual Report 2011
NRNB Annual Report 2011
 
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data via
 
The Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military DataThe Role of Ontology in the Era of Big Military Data
The Role of Ontology in the Era of Big Military Data
 
IAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
IAO-Intel: An Ontology of Information Artifacts in the Intelligence DomainIAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
IAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defense
 
NRNB EAC Report 2011
NRNB EAC Report 2011NRNB EAC Report 2011
NRNB EAC Report 2011
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Free webinar-introduction to bioinformatics - biologist-1
Free webinar-introduction to bioinformatics - biologist-1Free webinar-introduction to bioinformatics - biologist-1
Free webinar-introduction to bioinformatics - biologist-1
 
Natural language processing through the subtractive mountain clustering algor...
Natural language processing through the subtractive mountain clustering algor...Natural language processing through the subtractive mountain clustering algor...
Natural language processing through the subtractive mountain clustering algor...
 
NRNB Annual Report 2012
NRNB Annual Report 2012NRNB Annual Report 2012
NRNB Annual Report 2012
 

Viewers also liked

Bee algorithm
Bee algorithmBee algorithm
Bee algorithmkousick
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyayabhishek upadhyay
 
Deep Learning
Deep LearningDeep Learning
Deep LearningJun Wang
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 

Viewers also liked (6)

Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Bee algorithm
Bee algorithmBee algorithm
Bee algorithm
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 

Similar to Machine Learning for Data Mining

SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISSEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISIRJET Journal
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
 
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)IJCSEA Journal
 
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.Elena Sügis
 
One View of Data Science
One View of Data ScienceOne View of Data Science
One View of Data SciencePhilip Bourne
 
AI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data ScienceAI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data SciencePhilip Bourne
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Adversarial Multi Scale Features Learning for Person Re Identification
Adversarial Multi Scale Features Learning for Person Re IdentificationAdversarial Multi Scale Features Learning for Person Re Identification
Adversarial Multi Scale Features Learning for Person Re Identificationijtsrd
 
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive SurveyPrognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Surveyijtsrd
 
Data mining with human genetics to enhance gene based algorithm and
Data mining with human genetics to enhance gene based algorithm andData mining with human genetics to enhance gene based algorithm and
Data mining with human genetics to enhance gene based algorithm andIAEME Publication
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primersijdmtaiir
 
Application and Implementation of different deep learning
Application and Implementation of different deep learningApplication and Implementation of different deep learning
Application and Implementation of different deep learningJIEJackyZOUChou
 
Comprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction TechniquesComprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction Techniquesijsrd.com
 

Similar to Machine Learning for Data Mining (20)

SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSISSEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
SEMI SUPERVISED BASED SPATIAL EM FRAMEWORK FOR MICROARRAY ANALYSIS
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Network Intrusion Detection using MRF Technique
Network Intrusion Detection using MRF Technique Network Intrusion Detection using MRF Technique
Network Intrusion Detection using MRF Technique
 
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
MICROARRAY GENE EXPRESSION ANALYSIS USING TYPE 2 FUZZY LOGIC(MGA-FL)
 
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...
 
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
One View of Data Science
One View of Data ScienceOne View of Data Science
One View of Data Science
 
AI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data ScienceAI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data Science
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
C0344023028
C0344023028C0344023028
C0344023028
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Adversarial Multi Scale Features Learning for Person Re Identification
Adversarial Multi Scale Features Learning for Person Re IdentificationAdversarial Multi Scale Features Learning for Person Re Identification
Adversarial Multi Scale Features Learning for Person Re Identification
 
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive SurveyPrognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
Prognosis of Cardiac Disease using Data Mining Techniques A Comprehensive Survey
 
Data mining with human genetics to enhance gene based algorithm and
Data mining with human genetics to enhance gene based algorithm andData mining with human genetics to enhance gene based algorithm and
Data mining with human genetics to enhance gene based algorithm and
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
 
Application and Implementation of different deep learning
Application and Implementation of different deep learningApplication and Implementation of different deep learning
Application and Implementation of different deep learning
 
Comprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction TechniquesComprehensive Survey of Data Classification & Prediction Techniques
Comprehensive Survey of Data Classification & Prediction Techniques
 

Recently uploaded

Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 

Recently uploaded (20)

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 

Machine Learning for Data Mining

  • 1. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Machine Learning for Data Mining Dr. Dewan Md. Farid Department of Computer Science & Engineering, United International University, Bangladesh December 01, 2016 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 2. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 3. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Data Mining: What is Data Mining? Data mining (DM) is also known as Knowledge Discovery from Data, or KDD for short, which turns a large collection of data into knowledge. DM is a multidisciplinary field including machine learning, artificial intelligence, pattern recognition, knowledge-based systems, high-performance computing, database technology and data visualisation. 1. Data mining is the process of analysing data from different perspectives and summarising it into useful information. 2. Data mining is the process of finding hidden information and patterns in a huge database. 3. Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 4. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Machine Learning Machine learning (ML) provides the technical basis of data mining, which concerns the construction and study of systems that can learn from data. 1. Supervised learning/ Classification - the supervision in the learning comes from the labeled instances. 2. Unsupervised learning/ Clustering - the learning process is unsupervised since the instances are not class labeled. 3. Semi-supervised learning - uses of both labeled and unlabelled instances when learning a model. 4. Active learning - lets users play an active role in the learning process. It asks a user (e.g., a domain expert) to label an instance, which may be from a set of unlabelled instances. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 5. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Learning Algorithms Decision Tree (DT) Induction Na¨ıve Bayes (NB) Classifier NBTree Classifier RainForest and BOAT Classifier k Nearest Neighbour (kNN) Classifier Random Forest, Bagging and Boosting (AdaBoost) Support Vector Machines (SVM) k Means Clustering Similarity based Clustering Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 6. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Mining Big Data Mining big data is the process of extracting knowledge to uncover large hidden information from the massive amount of complex data or databases. Big data is defined by the three V’s: Volume - the quantity of data. Variety - the category of data. Velocity - the speed of data in and out. It might suggest throwing a few more V’s into the mix: Vision - having a purpose/ plan). Verification - ensuring that the data conforms to a set of specifications. Validation - checking that its purpose is fulfilled. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 7. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Big Data Project 1. BRiDGEIris - Brussels Big Data Platform for Sharing and Discovery in Clinical Genomics. Hosted by IB2 (Interuniversity Institute of Bioinformatics in Brussels). Funded by INNOVIRIS (Brussels Institute for Research and Innovation). 2. FWO research project G004414N “Machine Learning for Data Mining Applications in Cancer Genomics”. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 8. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier BRiDGEIris Project Brussels big data platform for sharing and discovery in clinical genomics project aims to answer the research challenges by: 1. Design and creation of a multi-site clinical/phenomic and genomic data warehouse. 2. Development of automated tools for extracting relevant information from genetic data. 3. Use of the designed tools to extract new knowledge and transfer it to the medical setting. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 9. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier VUB AI Lab (CoMo) Lab is particularly focused on the aspect of design and developing strategy for information discovery on genomic and clinical big data by employing an optimal ensemble method. Goal is to evaluate ensemble predictive modelling techniques for: 1. Improving the prediction accuracy of variant identification/ genomic variants classification. 2. Pathology classification tasks. Developing new methods/ algorithms to deal with the following issues: Multi-class classification High-dimensional data Class imbalanced data Big data Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 10. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Brugada syndrome Brugada syndrome (BrS), also known as sudden adult death syndrome (SADS) is a genetic disease. It increases the risk of sudden cardiac death (SCD) at a young age. The Spanish cardiologists Pedro Brugada and Josep Brugada name Brugada syndrome. BrS is detected by abnormal electrocardiogram (ECG) findings called a type 1 Brugada ECG pattern, which is much more common in men. BrS is a heart rhythm disorder. Sudden cardiac death (SCD) caused when the heart doesn’t pump effectively and not enough blood travels to the rest of the body. The Exome datasets of 148 patients have analysed for Brugada syndrome at UZ Brussels (Universitair Ziekenhuis Brussel) (www.uzbrussel.be/) Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 11. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Knowledge Discovery from Genomic Data Exome 1 Formatted Data Gene Panel Mining Algorithm Genomic Data Sets Knowledge Discovery from Genomic Data Exome 2 Exome 148 Data Preprocessing Feature Selection Figure: The process of extracting knowledge from genomic data in data mining. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 12. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Genomic Data of BrS Table: Classification of DNA variants for Brugada syndrome. Class Label Class I Nonpathogenic Class II VUS1 - Unlikely pathogenic Class III VUS2 - Unclear Class IV VUS3 - Likely pathogenic Class V Pathogenic Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 13. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Gene Panel of BrS Table: Gene panel of Brugada syndrome. Chromosome Name of Gene Chr 1 KCND3 Chr 3 SCN5A, GPD1L, SLMAP, CAV3, SCN10A Chr 4 ANK2 Chr 7 CACNA2D1, AKAP9, KCNH2 Chr 10 CACNAB2 Chr 11 KCNE3, SCN3B, SCN2B, KCNJ5, KCNQ1, SCN4B Chr 12 CACNA1C, KCNJ8 Chr 15 HCN4 Chr 17 RANGRF, KCNJ2 Chr 19 SCN1B, TRPM4 Chr 20 SNTA1 Chr 21 KCNE1, KCNE2 Chr X KCNE1L Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 14. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Chromosomes 1 11 12 15 17 19 21 3 4 7 X Chromosomes No.ofVariants 0100200300400500 Figure: Chromosomes in 148 Exome Datasets. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 15. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Genomic Data 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 38 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 145 147 No. of Variants Exome Data Sets Annotated vcf File Gene Panel BrS Variants Figure: Genomic Data: 148 Exome Datasets. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 16. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Rule-based Classifier Rule-based classifier is easy to deal with complex classification problems. It has various advantages: Highly expressive as DT Easy to interpret Easy to generate Can classify new instances rapidly Performance comparable to DT New rules can be added to existing rules without disturbing ones already in there Rules can be executed in any order Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 17. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Adaptive Rule-based Classifier It combines the random subspace and boosting approaches with ensemble of decision trees to construct a set of classification rules for multi-class classification of biological big data. Random subspace method (or attribute bagging) to avoid overfitting Boosting approach for classifying noisy instances Ensemble of decision trees to deal with class-imbalance data It uses two popular classification techniques: decision tree (DT) and k-nearest-neighbour (kNN) classifiers. DTs are used for evolving classification rules from the training data. kNN is used for analysing the misclassified instances and removing vagueness between the contradictory rules. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 18. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Random Subspace & Boosting Method Random subspace is an ensemble classifier. It consists of several classifiers each operating in a subspace of the original feature space, and outputs the class based on the outputs of these individual classifiers. It has been used for decision trees (random decision forests). It is an attractive choice for high dimensional data. Boosting is designed specifically for classification. It converts weak classifiers to strong ones. It is an iterative process. It uses voting for classification to combine the output of individual classifiers. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 19. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Ensemble Classifier Figure: An example of an ensemble classifier. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 20. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Decision Tree Induction Decision tree (DT) induction is a top down recursive divide and conquer algorithm for multi-class classification task. The goal of DT is to iteratively partition the data into smaller subsets until all the subsets belong to a single class. It is easy to interpret and explain, and also requires little prior knowledge. Information Gain: ID3 (Iterative Dichotomiser) algorithm Gain Ratio: C4.5 algorithm Gini Index: CART algorithm Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 21. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 1 Decision Tree Induction Input: D = {x1, · · · , xi , · · · , xN } Output: A decision tree, DT. Method: 1: DT = ∅; 2: find the root node with best splitting, Aj ∈ D; 3: DT = create the root node; 4: DT = add arc to root node for each split predicate and label; 5: for each arc do 6: Dj created by applying splitting predicate to D; 7: if stopping point reached for this path, then 8: DT = create a leaf node and label it with cl ; 9: else 10: DT = DTBuild(Dj ); 11: end if 12: DT = add DT to arc; 13: end for Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 22. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier K-Nearest-Neighbour (kNN) Classifier The k-nearest-neighbour (kNN) is a simple classifier. It uses the distance measurement techniques that widely used in pattern recognition. kNN finds k instances, X = {x1, x2, · · · , xk } ∈ Dtraining that are closest to the test instance, xtest and assigns the most frequent class label, cl → xtest among the X. When a classification is to be made for a new instance, xnew , its distance to each Aj ∈ Dtraining , must be determined. Only the k closest instances, X ∈ Dtraining are considered further. The closest is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points, x1 = (x11, x12, · · · , x1n) and x2 = (x21, x22, · · · , x2n), is shown in Eq. 1 dist(x1, x2) = n i=1 (x1i − x2i )2 (1) Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 23. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 2 k-Nearest-Neighbour classifier Input: D = {x1, · · · , xi , · · · , xn} Output: kNN classifier, kNN. Method: 1: find X ∈ D that identify the k nearest neighbours, regardless of class label, cl . 2: out of these instances, X = {x1, x2, · · · , xk }, identify the number of instances, ki , that belong to class cl , l = 1, 2, · · · , M. Obviously, i ki = k. 3: assign xtest to the class cl with the maximum number of ki of instances. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 24. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Constructing Classification Rules Extracting classification rules from DTs is easy and well-known process. Rules are highly expressive as DT, so the performance of rule-based classifier is comparable to DT. Each rule is generated for each leaf of the DT. Each path in DT from the root node to a leaf node corresponds with a rule. Tree corresponds exactly to the classification rules. DT vs. Rules New rules can be added to an existing rule set without disturbing ones already there, whereas to add to a tree structure may require reshaping the whole tree. Rules can be executed in any order. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 25. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm: Adaptive rule-based (ARB) classifier It considers a series of k iterations. Initially, an equal weight, 1 N is assigned to each training instance. The weights of training instances are adjusted according to how they are classified in every iterations. In each iteration, a sub-dataset Dj is created from the original training dataset D and previous sub-dataset Dj−1 with maximum weighted instances. Only the sampling with replacement technique is used to create the sub-dataset D1 from the original training data D in the first iteration. A tree DTj is built from the sub-dataset Dj with randomly selected features in each iteration. Each rule is generated for each leaf node of DTj . Each path in DTj from the root to a leaf corresponds with a rule. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 26. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 3 Adaptive rule-based classifier. Input: D = {x1, · · · , xi , · · · , xN }, training dataset; k, number of iterations; DT learning scheme; Output: rule-set; // A set of classification rules. Method: 1: rule-set = ∅; 2: for i = 1 to N do 3: xi = 1 N ; // initialising weights of each xi ∈ D. 4: end for 5: for j = 1 to k do 6: if j==1 then 7: create Dj , by sampling D with replacement; 8: else 9: create Dj , by Dj−1 and D with maximum weighted X; 10: end if 11: build a tree, DTj ← Dj by randomly selected features; 12: compute error(DTj ); // the error rate of DTj . 13: if error(DTj ) ≥ threshold-value then 14: go back to step 6 and try again; 15: else 16: rules ← DTj ; // extracting the rules from DTj . 17: end if 18: for each xi ∈ Dj that was correctly classified do 19: multiply the weight of xi by ( error(DTj ) 1−error(DTj ) ); // update weights. 20: end for 21: normalise the weight of each xi ∈ Dj ; 22: rule-set = rule-set ∪ rules; 23: end for 24: return rule-set; 25: create sub-dataset, Dmisclassified with misclassified instances from Dj ; 26: analyse Dmisclassified employing algorithm 4. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 27. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Error Rrate Calculation The error rate of DTj is calculated by the sum of weights of misclassified instances that is shown in Eq. 2. Where, err(xi ) is the misclassification error of an instance xi . If an instance, xi is misclassified, then err(xi ) is one. Otherwise, err(xi ) is zero (correctly classified). error(DTj ) = n i=1 wi × err(xi ) (2) If error rate of DTj is less than the threshold-value, then rules are extracted from DTj . Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 28. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Mining Big Data with Rules Big data is so big (millions of instances) that we cannot process all the instances together at the same time. It is not possible to store all the data in the main memory at a time. We can create several smaller sample (or subsets) of data from the big data that each of which fits in main memory. Each subset of data is used to construct a set of rules, resulting in several sets of rules. Then the rules are examined and used to merge together to construct the final set of classification rules to deal with big data. As we have the advantage to add new rules with existing rules and rules are executed in any order. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 29. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Mining Big Data with Rules (con.) Data Data Data Integrating Rules Big Data Sub-data, 1 Adaptive Rule-based Classifier Final Classification Rules Adaptive Rule-based Classifier Adaptive Rule-based Classifier Sub-data, N Sub-data, 2 Figure: Mining big data using adaptive rule-based classifier. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 30. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Reduced-Error Pruning Split the original data into two parts: (a) a growing set, and (b) a pruning set. Rules are generated using growing set only. So, important rules might miss because some key instances had been assigned to the pruning set. A rule generated from the growing set is deleted, and the effect is evaluated by trying out the truncated rule from the pruning set and seeing whether it performs well than the original rule. If the new truncated rule performs better then this new rule is added to the rule set. This process continues for each rule and for each class. The overall best rules are established by evaluating the rules on the pruning set. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 31. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm: Analysing Misclassified Instances To check the classes of misclassified instances we used the kNN classifier with feature selection and weighting approach. We applied DT induction for feature selection and weighting approach. We build a tree from the misclassified instances. Each feature that is tested in the tree, Aj ∈ Dmisclassified is assigned by a weight 1 d . Where d is the depth of the tree. We do not consider the features that are not tested in the tree for similarity measure of kNN classifier. We apply kNN classifier to classify each misclassified instance based on the weighted features. We update the class label of misclassified instances. We check for the contradictory rules, if there is any. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 32. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 4 Analysing misclassified instances Input: D, original training data; Dmisclassified , dataset with misclassified instances; Output: A set of instances, X with right class labels. Method: 1: build a tree, DT using Dmisclassified ; 2: for each Aj ∈ Dmisclassified do 3: if Aj is tested in DT then 4: assign weight to Aj by 1 d , where d is the depth of DT; 5: else 6: not to consider, Aj for similarity measure; 7: end if 8: end for 9: for each xi ∈ Dmisclassified do 10: find X ∈ D, with the similarity of weighted A = {A1, · · · , Aj , · · · , An}; 11: find the most frequent class, cl , in X; 12: assign xi ← cl ; 13: end for Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 33. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Performance Measurement The classification accuracy: accuracy = |X| i=1 assess(xi ) |X| , xi ∈ X (3) If xi is correctly classified then assess(xi ) = 1, or If xi is misclassified then assess(xi ) = 0. precision = TP TP + FP (4) recall = TP TP + FN (5) F − score = 2 × precision × recall precision + recall (6) Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 34. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Experiments on Exome datasets The performance of the proposed ARB classifier against RainForest, NB and kNN classifiers on 148 Exome datasets. The ARB classifier correctly classifies 91% gene variants for BrS using training data. We have considered five iterations for the proposed ARB classifier on each Exome dataset. Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and proposed ARB classifier using training data. Algorithm Classification Precision Recall F-score accuracy (%) (weighted (weighted (weighted avg.) avg.) avg.) RainForest 83.33 0.76 0.83 0.79 NB 83.33 0.79 0.83 0.78 kNN 75 0.56 0.75 0.64 ARB classifier 91.66 0.95 0.91 0.92 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 35. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Experiments on Exome datasets (con.) The performance of the proposed ARB classifier against RainForest, NB and kNN classifiers using 10-folds cross validation on 148 Exome datasets. Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and proposed ARB classifier using 10 folds cross-validation. Algorithm Classification Precision Recall F-score accuracy (%) (weighted (weighted (weighted avg.) avg.) avg.) RainForest 58.33 0.46 0.58 0.51 NB 58.33 0.63 0.58 0.6 kNN 50 0.33 0.5 0.4 ARB classifier 75 0.73 0.75 0.68 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 36. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Experiments on Exome datasets (con.) The performance of the proposed ARB classifier against RainForest, NB and kNN classifiers using unseen test variants of 45 Exome datasets. Where 103 Exome datasets were used for training the models. Table: The accuracy, precision, recall and F-score of RainForest, NB, kNN and proposed ARB classifier using testing data. Algorithm Classification Precision Recall F-score accuracy (%) (weighted (weighted (weighted avg.) avg.) avg.) RainForest 50 0.33 0.5 0.4 NB 50 0.25 0.5 0.62 kNN 50 0.25 0.5 0.33 ARB classifier 66.66 0.44 0.66 0.53 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 37. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Benchmark Life Sciences Datasets Table: 10 real benchmark life sciences datasets from UCI (University of California, Irvine) machine learning repository. No. Datasets Instances No of Att. Att. Types Classes 1 Appendicitis 106 7 Numeric 2 2 Breast cancer 286 9 Nominal 2 3 Contraceptive 1473 9 Numeric 3 4 Ecoli 336 7 Numeric 8 5 Heart 270 13 Numeric 2 6 Pima diabetes 768 8 Numeric 2 7 Iris 150 4 Numeric 3 8 Soybean 683 35 Nominal 19 9 Thyroid 215 5 Numeric 2 10 Yeast 1484 8 Numeric 10 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 38. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Classification Accuracy Table: The classification accuracy (%) of C4.5, kNN, na¨ıve Bayes (NB) and proposed adaptive rule-based classifier with 10-fold cross validation. Datasets C4.5 kNN NB Proposed classifier Appendicitis 85.84 86.79 85.84 87.73 Breast cancer 75.52 73.42 71.67 75.52 Contraceptive 50.98 49.76 48.13 50.1 Ecoli 79.76 83.03 78.86 83.92 Heart 77.40 78.88 83.7 83.7 Pima diabetes 73.82 73.17 76.3 75.65 Iris 96 95.33 96 95.33 Soybean 91.50 90.19 92.97 91.94 Thyroid 98.13 97.2 98.13 98.13 Yeast 56.73 56.94 57.88 61.99 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 39. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Classification Accuracy (con.) 45 50 55 60 65 70 75 80 85 90 95 100 Appendici1s Breast cancer Contracep1ve Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast Classifica(on Accuracy UCI Benchmark Life Sciences Data Sets C4.5 kNN NB Adap1ve rule-based classifier Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 40. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Accuracy having 20% noisy instances 40 45 50 55 60 65 70 75 80 85 Appendici/s Breast cancer Contracep/ve Ecoli Heart Pima diabetes Iris Soybean Thyroid Yeast Classifica(on Accuracy UCI Benchmark Life Sciences Data Sets C4.5 kNN NB Adap/ve rule-based classifier Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 41. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Data Balancing Methods Classification of multi-class imbalanced data is a difficult task, as real data sets are noisy, high dimensional, small sample size that results overfitting and overlapping of classes.. Traditional machine learning algorithms are very successful with classifying majority class instances compare to the minority class instances. The conventional data balancing methods alter the original data distribution, so they might suffer from overfitting or drop some potential information. We proposed a new method for dealing with multi-class imbalanced data based on clustering and selecting most informative instances from the majority classes. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 42. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Classifying Imbalanced Data Machine learning algorithms successfully classify majority class instances, but misclassify the minority class instances in many high-dimensional data sets. Following methods are used for class imbalance problems: 1. Sampling methods Under-sampling Over-sampling 2. Cost-sensitive learning methods (difficult to get the accurate misclassification cost) 3. Ensemble methods Bagging Boosting Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 43. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Proposed Data Balancing Method Initially, we cluster the majority class instances into several clusters. Find the most informative instances in each cluster. The informative instances are close to the center of cluster and border of cluster. Then several data sets are created using these clusters with most informative instances by combining the instances of minority classes. Every data set should have almost equal number of minority-majority classes instances. Finally, multiple classifiers are trained using these data sets. The voting technique is used to classify the existing/ new instances. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 44. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Proposed Data Balancing Method (con.) Imbalanced Data Majority Classes Instances Minority Classes Instances Cluster 1 Balanced Data 1 Classifier 1 Find Informative Instances Cluster 2 Cluster N Find Informative Instances Find Informative Instances Balanced Data 2 Balanced Data N Classifier 2 Classifier N Combine Votes Prediction New Data Instances Figure: Proposed data balancing method. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 45. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Performance of Data Balancing Methods The performance of data balancing methods using area under the ROC (Receiver Operating Characteristic) curve (AUC) on 2143 variants of Brugada syndrome (BrS) of 148 Exome data sets. Table: Average AUC values of 148 imbalanced Exome data sets for different imbalance data handling methods. Algorithm Average AUC value Random Under-Sampling 0.8923 Random Over-Sampling 0.8673 Bagging 0.8915 Boosting 0.9136 Proposed Method 0.9317 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 46. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Active Learning It achieves high accuracy using the number of instances to learn a concept can often be much lower than the number required in typical supervised learning. It interactively queries a user/ expert for class labels of unlabeled instances. The objective is to train a classifier using as few labeled instances as possible by selecting the most informative instances. Let the data, D contains both set of labeled data, DL and set of unlabeled data, DU . Initially, a model, M∗ trains using DL. Then a querying function uses to select unlabeled instances, XU ∈ DU and requests a user for labeling, XU → XL. After XL is added to DL and train M∗ again. The process repeats until the user is satisfied. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 47. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Active Learning (con.) Data, D Labeled Data, DL Unlabeled Data, DU Unlabeled Instances, XU Labeled Instances, XL DL + XL Ensemble Model, M* User/ Oracle Figure: Active learning process. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 48. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Proposed Method The na¨ıve Bayes (NB) classifier and clustering are used to find the most informative instances for labeling as part of active learning. The unlabeled instances are selected for labeling using the following two strategies: Instances close to centers of clusters and borders of clusters. If the posterior probabilities of instances are equal/ very close. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 49. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Performance of Ensemble Methods Adaptive boosting (AdaBoost algorithm) with NB classifier is used as base classifier. Table: The accuracy and F-score of ensemble methods on 2143 DNA variants of Brugada syndrome. Algorithm Classification F-score accuracy (%) (weighted avg.) Random Forest 92.3 0.93 Bagging 87.5 0.83 Boosting 91.66 0.9 AdaBoost with NB classifier 94.73 0.93 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 50. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Clustering of high-dimensional big data An ensemble clustering method with feature selection and grouping approach. K-means clustering. Similarity-based clustering. Biclustering (On each cluster that generated by ensemble clustering to find the sub-matrices). Unlabelled genomic data of Brugada syndrome (148 Exome datasets). The proposed method selects the most relevant features in the dataset and grouping them into subset of features to overcome the problems associated with the traditional clustering methods. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 51. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Clustering It is the process of grouping a set of instances into clusters (subsets or groups) so that instances within a cluster have high similarity in comparison to one another, but are very dissimilar to instances in other clusters. Let X be the unlabelled data set, that is, X = {x1, x2, · · · , xN }; (7) The partition of X into k clusters, C1, · · · , Ck , so that the following conditions are met: Ci = ∅, i = 1, · · · , k; (8) ∪k i=1Ci = X; (9) Ci ∩ Cj = ∅, i = j, i, j = 1, · · · , k; (10) Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 52. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Challenges Pattern extracting from the genomic big data. Genomic data is often too big and too messy. Genomic data is also high-dimensional, so traditional distance measures may be dominated by the noise in many dimensions. In genomic data, we need to find not only the clusters of instances (genes), but for each cluster a set of features (conditions). Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 53. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier k-Means It defines the mean value of instances {xi1, xi2, · · · , xiN } ∈ Ci . It randomly selects k instances, {xk1, xk2, · · · , xkN } ∈ X each of which initially represents a cluster center. Remaining instances, xi ∈ X, xi is assigned to the cluster. Similar is measure based on the Euclidean distance between xi and Ci . It iteratively improves the within-cluster variation. A high degree of similarity among instances in clusters is obtained, while a high degree of dissimilarity among instances in different clusters is achieved simultaneously. The cluster mean of Ci = {xi1, xi2, · · · , xiN } is defined in equation 11. Mean = Ci = N j=1(xij ) N (11) Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 54. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 5 k-Means Clustering Input: X = {x1, x2, · · · , xN } // A set of unlabelled instances. k // the number of clusters Output: A set of k clusters. Method: 1: arbitrarily choose k number of instances, {xk1, xk2, · · · , xkN } ∈ X as the initial k clusters center; 2: repeat 3: (re)assign each xi ∈ X → k to which the xi is the most similar based on the mean value of the xm ∈ k; 4: update the k means, that is, calculate the mean value of the instances for each cluster; 5: until no change Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 55. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Similarity-Based Clustering (SCM) It is robust to initialise the cluster numbers. It detects different volumes of clusters. Let’s consider sim(xi , xl ) as the similarity measure between instances xi and the lth cluster center xl . The goal is to find xl to maximise the total similarity measure shown in Eq. 12. Js(C) = k l=1 N i=1 f (sim(xi , xl )) (12) Where, f (sim(xi , xl )) is a reasonable similarity measure and C = {C1, · · · , Ck }. In general, SCM uses feature values to check the similarity between instances. However, any suitable distance measure can be used to check the similarity between the instances. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 56. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 6 Similarity-based Clustering Input: X = {x1, x2, · · · , xN } // A set of unlabelled instances. Output: A set of clusters, C = {C1, C2, · · · , Ck }. Method: 1: C = ∅; 2: k = 1; 3: Ck = {x1}; 4: C = C ∪ Ck ; 5: for i = 2 to N do 6: for l = 1 to k do 7: find the lth cluster center xl ∈ Cl to maximize the similarity measure, sim(xi , xl ); 8: end for 9: if sim(xi , xl ) ≥ threshold value then 10: Cl = Cl ∪ xi 11: else 12: k = k + 1; 13: Ck = {xi }; 14: C = C ∪ Ck ; 15: end if 16: end for Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 57. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Ensemble Clustering Ensemble clustering is a process of integrating multiple clustering algorithms to form a single strong clustering approach that usually provides better clustering results. It generates a set of clusters from a given unlabelled data set and then combines the clusters into final clusters to improve the quality of individual clustering. No single cluster analysis method is optimal. Different clustering methods may produce different clusters, because they impose different structure on the data set. Ensemble clustering performs more effectively in high dimensional complex data. It’s a good alternative when facing cluster analysis problems. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 58. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Ensemble clustering (con.) Generally three strategies are applied in ensemble clustering: 1. Using different clustering algorithms on the same data set to create heterogeneous clusters. 2. Using different samples/ subsets of the data with different clustering algorithms to cluster them to produce component clusters. 3. Running the same clustering algorithm many times on same data set with different parameters or initialisations to create homogeneous clusters. The main goal of the ensemble clustering is to integrate component clustering into one final clustering with a higher accuracy. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 59. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Ensemble clustering on genomic/ biological data Pattern extraction from genomic data applying ensemble clustering. Data   Data   Data   Data   Preprocessing   Biclustering   Big  Biological  Data   Hidden  Patterns   in  Data   Feature  Selection   Feature  Grouping   Ensemble   Clustering   Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 60. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Data Pre-processing It transforms raw data into an understandable format, which includes several techniques: Data cleaning is the process of dealing with missing values. Data integration merges data from different multiple sources into a coherent data store like data warehouse or integrate metadata. Data transformation includes the followings: (a) normalisation, (b) aggregation, (c) generalisation, and (d) feature construction. Data reduction obtains a reduced representation of data set (eliminating redundant features/ instances). Data discretisation involves the reduction of a number of values of a continuous feature by dividing the range of feature intervals. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 61. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Feature Selection It is the process of selecting a subset of relevant features from a total original features in data. Mainly the following three reasons are used for feature selection: Simplification of models Shorter training times Reducing overfitting In biological data, features may contain false correlations and the information they add is contained in other features. In this work, we have applied an unsupervised feature selection approach based on measuring similarities between features by maximum information compression index. We have quantified the information loss in feature selection with entropy measure technique. After selecting the subset of features from the data, we have grouped them into two groups: nominal and numeric features. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 62. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Subspace Clustering The subspace clustering finds subspace clusters in high-dimensional data. It can be classified into three groups: 1. Subspace search methods. 2. Correlation-based clustering methods 3. Biclustering methods. A subspace search method searches various subspaces for clusters (set of instances that are similar to each other in a subspace) in the full space. It uses two kinds of strategies: Bottom-up approach - start from low-dimensional subspace and search higher-dimensional subspaces. Top-down approach - start with full space and search smaller subspaces recursively. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 63. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 7 δ-Biclustering Input: E, a data matrix and δ ≥ 0, the maximum acceptable mean squared residue score. Output: EIJ , a δ-bicluster that is a submatrix of E with row set I and column set J, with a score no longer than δ. Initialization: I and J are initialized to the instance and feature sets in the data and EIJ = E. Deletion phase: 1: compute eiJ for all i ∈ I, eIj for all j ∈ J, eIJ , and H(I, J); 2: if H(I, J) ≤ δ then 3: return EIJ ; 4: end if 5: find the rows i ∈ I with d(i) = j∈J (eij −eiJ −eIj +eIJ )2 |J| ; 6: find the columns j ∈ J with d(j) = i∈I (eij −eiJ −eIj +eIJ )2 |I| ; 7: remove rows i ∈ I and columns j ∈ J with larger d; Addition phase: 1: compute eiJ for all i, eIj for all j, eIJ , and H(I, J); 2: add the columns j /∈ J with i∈I (eij −eiJ −eIj +eIJ )2 |I| ≤ H(I, J); 3: recompute eiJ , eIJ and H(I, J); 4: add the rows i /∈ I with j∈J (eij −eiJ −eIj +eIJ )2 |J| ≤ H(I, J); 5: for each row i /∈ I do 6: if j∈J (eij −eiJ −eIj +eIJ )2 |J| ≤ H(I, J) then 7: add inverse of i; 8: end if 9: end for 10: return EIJ ; Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 64. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Clustering of BrS variants Distribution of BrS variants in clusters using proposed ensemble clustering. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 65. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Experimental Method To test the performance of clustering algorithms we have used an unsupervised evaluation method that compute the Compactness (CP) of clusters is shown in Eq. 13. CP = 1 n k l=1 nl xi ,xj ∈Cl d(xi , xj ) nl (nl − 1)/2 (13) Where d(xi , xj ) is the distance between two instances in cluster Cl and nl is the number of instances in Cl . The smaller the CP for a clustering result, the more compact and better the clustering result. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 66. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Results The proposed ensemble clustering is compared with following clustering algorithms: SimpleKMeans (clustering using the k-means method) XMeans (extension of k-means) DBScan (nearest-neighbor-based that automatically determines the number of clusters) MakeDensityBasedCluster (wrap a clusterer to make it return distribution and density) Table: Comparison of clustering results on 148 Exome data sets of BrS. Clustering Method Compactness (CP) SimpleKMeans 9.401 XMeans 8.297 MakeDensityBasedCluster 7.483 DBScan 6.351 Ensemble Clustering 5.647 Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 67. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Hybrid Decision Tree & Na¨ıve Bayes Classifiers The presence of noisy contradictory instances in the training data cause the learning models suffer from overfitting and decrease classification accuracy. Hybrid Decision Tree (DT) classifier - A na¨ıve Bayes (NB) classifier is used to remove the noisy troublesome instances from the training data before the DT induction. Hybrid Na¨ıve Bayes (NB) classifier - A DT is used to select a comparatively more important subset of features for the production of na¨ıve assumption of class conditional independence. It is extremely computationally expensive for a na¨ıve Bayes classifier to compute class conditional independence for high dimensional data sets. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 68. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 8 Decision Tree Induction Input: D = {x1, x2, · · · , xn} // Training dataset, D, which contains a set of training instances and their associated class labels. Output: T, Decision tree. Method: 1: for each class, Ci ∈ D, do 2: Find the prior probabilities, P(Ci ). 3: end for 4: for each attribute value, Aij ∈ D, do 5: Find the class conditional probabilities, P(Aij |Ci ). 6: end for 7: for each training instance, xi ∈ D, do 8: Find the posterior probability, P(Ci |xi ) 9: if xi is misclassified, then 10: Remove xi from D; 11: end if 12: end for 13: T = ∅; 14: Determine best splitting attribute; 15: T = Create the root node and label it with the splitting attribute; 16: T = Add arc to the root node for each split predicate and label; 17: for each arc do 18: D = Dataset created by applying splitting predicate to D; 19: if stopping point reached for this path, then 20: T = Create a leaf node and label it with an appropriate class; 21: else 22: T = DTBuild(D); 23: end if 24: T = Add T to arc; 25: end for Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 69. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Algorithm 9 Na¨ıve Bayes classifier Input: D = {x1, x2, · · · , xn} // Training data. Output: A classification Model. Method: 1: T = ∅; 2: Determine the best splitting attribute; 3: T = Create the root node and label it with the splitting attribute; 4: T = Add arc to the root node for each split predicate and label; 5: for each arc do 6: D = Dataset created by applying splitting predicate to D; 7: if stopping point reached for this path, then 8: T = Create a leaf node and label it with an appropriate class; 9: else 10: T = DTBuild(D); 11: end if 12: T = Add T to arc; 13: end for 14: for each attribute, Ai ∈ D, do 15: if Ai is not tested in T, then 16: Wi = 0; 17: else 18: d as the minimum depth of Ai ∈ T, and Wi = 1√ d ; 19: end if 20: end for 21: for each class, Ci ∈ D, do 22: Find the prior probabilities, P(Ci ). 23: end for 24: for each attribute, Ai ∈ D and Wi = 0, do 25: for each attribute value, Aij ∈ Ai , do 26: Find the class conditional probabilities, P(Aij |Ci ) Wi . 27: end for 28: end for 29: for each instance, xi ∈ D, do 30: Find the posterior probability, P(Ci |xi ); 31: end for Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 70. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Accuracy on Benchmark Datasets Figure: Classification accuracy on 10 datasets with 10-fold cross validation. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 71. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Novel Class Instances Figure: Instances with a fixed number of class labels (left) and instances of a novel class arriving in the data stream (right). Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 72. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Novel Class Instances (con.) Figure: Flow chart of classification and novel class detection. Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 73. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Novel Class Instances (con.) Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 74. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier Novel Class Instances (con.) Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining
  • 75. Outline Big Data Project Rule-based Classifier Class Imbalanced Problem Active Learning Ensemble Clustering Hybrid Classifier *** THANK YOU *** Dr. Dewan Md. Farid: Department of Computer Science & Engineering, United International University, Bangladesh Machine Learning for Data Mining