In this project, we study the classification problem and compare some traditional statistical models with neural networks. This work was done in the frame of postgraduate programme in Web Science at Department of Mathematics, Aristotle University of Thessaloniki
Statistical classification: A review on some techniques
1. Statistical Classification:
A review on some techniques
Bamparopoulos Giorgos
Master in Web Science, Department of
Mathematics, Aristotle University of Thessaloniki
2. What is Pattern Recognition?
The study of how machines can observe the
environment
learn to distinguish patterns of interest from their
background, and
make sound and reasonable decisions about the
categories of the patterns.
A pattern is an object, process or event that can
be given a name.
A pattern class (or category) is a set of patterns
sharing common attributes and usually originating
from the same source.
3. Definition
In machine learning, pattern recognition is the assignment of
a label to a given input value. An example of pattern
recognition is classification. However, pattern recognition is a
more general problem that encompasses other types of output
as well.
Other examples are regression, which assigns a real-valued
output to each input; sequence labeling, which assigns a class
to each member of a sequence of values (for example, part of
speech tagging, which assigns a part of speech to each word in
an input sentence); and parsing, which assigns a parse tree to
an input sentence, describing the syntactic structure of the
sentence.
- Wikipedia
4. Pattern recognition system
Classification Mode
Test
Set
test Feature
Preprocessing Classification
pattern Measurement
training Feature
pattern Preprocessing Extraction/ Learning
Selection
Training
Set
Training Mode
5. Supervised vs Unsupervised Learning
Supervised learning (classification)
The training data are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations,
with the aim of establishing the existence of
classes or clusters in the data
6. Classification
A classification problem occurs when an object needs to be
assigned into a predefined group or class based on a
number of observed attributes related to that object.
The individual observations are analyzed into a set of
quantifiable properties, known as explanatory
variables, features, etc. These properties may variously
be categorical , ordinal. integer-valued or real-valued.
During classification given objects are assigned to
prescribed classes. A classifier is a mathematical function,
implemented by a classification algorithm, that maps input
data to a categorywhich performs classification
7. Application Domains (1/3)
Computer vision
Medical imaging and medical image analysis
Optical character recognition
Video tracking
Drug discovery and development
Toxicogenomics
Quantitative structure-activity relationship
Geostatistics
Speech recognition
9. Application Domains (3/3)
Digit recognition
Object
Automated protein
recognition
classification
http://www.glue.umd.edu/~zhelin/recog.html
Phoneme recognition
[Waibel, Hanzawa, Hinton,Shikano, Lang 1989]
10. Example of Classification 1/2
Input: Output:
Spam Binary
filtering
!!!!$$$!!!!
Character Multi-Class
recognition
[thanks to Ben Taskar for slide!]
C
11. Example of Classification 2/2
Input Output
brace
Handwriting
recognition
3D object
recognition
[thanks to Ben Taskar for slide!]
12. Classifiers
Neural networks
Quadratic classifiers
Naive Bayes classifier
Kernel estimation and K-nearest
neighbor algorithms
Decision trees, decision lists
Support vector machines
Maximum entropy classifier
13. Linear Classifier
10
9
8
7 R.A. Fisher
1890-1962
6
5 If previously unseen instance above the line
then
4 class is Katydid
3 else
class is Grasshopper
2
1
Katydids
1 2 3 4 5 6 7 8 9 10 Grasshoppers
Eamonn Keogh, Professor Computer Science & Engineering Department, University of California - Riverside
14. Higher Dimensional Spaces
… we can visualize it as
being an n-dimensional
hyperplane
Eamonn Keogh, Professor Computer Science & Engineering Department, University of California - Riverside
15. If we did not have the 3rd dimension…
Eamonn Keogh, Professor Computer Science & Engineering Department, University of California - Riverside
16. We can no longer get
perfect accuracy with the
simple linear classifier.
Maybe solve this problem by
using a simple quadratic
classifier or a simple cubic
classifier.
Eamonn Keogh, Professor Computer Science & Engineering Department, University of California - Riverside
20. Linear Discriminant Analysis
Assumes that the conditional class densities are
(multivariate) Gaussian
Assumes equal covariance for every class
Then the sample is in the class such that the
discriminant function is maximized for that sample.
Classification rule:
ˆ ( xi k )( xi k )T / ( N K )
K
Covariance matrix: Σ ˆ ˆ
k 1 gi
21. Quadratic Discriminant Analysis
Class conditional probability densities are allowed to
have different covariance matrices
The class decision boundaries are not linear rather
quadratic
Classification rule:
22. Problems on Learning
Dimensionality
The number of features is too large relative
to the number of training samples
Classifiers complexity
The number of unknown parameters
associated with the classifier is large
Overtrained
Too intensively optimized on training set.
23. Dimensionality Reduction
Feature Extraction
Create new features based on the
original feature set
Transforms are usually involved
Feature Selection
Select the best subset from a given
feature set.
25. Kernel Density Estimation
Kernel density estimation (KDE) is a non-parametric way
to estimate the probability density function of a random variable.
Let (x1, x2, …, xn) be a sample drawn from some distribution with
an unknown density ƒ. Its kernel density estimator is described by
equation below, where K(•) is the kernel function that integrates to
one and h > 0 is a smoothing parameter called bandwidth.
Some Kernel
Functions:
26. Glass Identification dataset
This data is available from the UCI Machine Learning Repository.
Murphy,P.M., Aha, D.W. (1994). UCI Repository of machine learning
databases (http://archive.ics.uci.edu/ml/). Irvine, CA: University of
California, Department of Information and Computer Science.
From USA Forensic Science Service.
Sources:
Creator: B. German -- Central Research Establishment Home
Office Forensic Science Service Aldermaston, Reading, Berkshire
RG7 4PN
Donor: Vina Spiehler, Ph.D., DABFT Diagnostic Products
Corporation (213) 776-0180 (ext 3014)
Date: September, 1987
27. Glass Identification dataset
Dataset have 214 instances
6 types of glass; defined in terms of
their oxide content (i.e. Na, Fe, K, etc)
The study of classification of types of
glass was motivated by criminological
investigation. At the scene of the
crime, the glass left can be used as Attribute
id
evidence...if it is correctly identified! Information:
RI: refractive
1
id Type of glass: (class attribute) index
2 Na: Sodium
1 building windows float processed
3 Mg: Magnesium
2 building windows non float processed 4 Al: Aluminum
5 Si: Silicon
3 vehicle windows float processed 6 K: Potassium
7 Ca: Calcium
4 containers
8 Ba: Barium
5 tableware 9 Fe: Iron
6 headlamps *unit measurement: weight
percent in corresponding oxide
28. Scatter plots of features
The scatter plots
depict the relationship
between features
grouped by the type
of glass.
X-axis:
Refractive index
Sodium
Magnesium
Y-axis:
Iron
Barium
Calcium
29. Scatter plots of features
X-axis:
Refractive index
Sodium
Magnesium
Y-axis:
Potassium
Silicon
Aluminum
30. Scatter plots of features
X-axis:
Aluminum
Silicon
Potassium
Y-axis:
Iron
Barium
Calcium
32. Linear Discriminant Analysis
First classify the data using the default linear discriminant
analysis (LDA).
In the scatter plot above, we draw a X through the
misclassified observations.
33. Confusion Matrix
A confusion matrix contains information about known
class labels and predicted class labels.
The (i,j) element in the confusion matrix is the number
of samples whose predicted class is i and whose
known class label is class j.
The diagonal elements represent correctly classified
observations.
46 16 3 0 1 0
14 41 3 2 1 1
The misclassification error (the
10 12 11 0 0 1
proportion of misclassified
0 4 0 10 0 2
observations) on the training set
0 3 0 0 7 1
is 35.05%
0 0 0 1 0 24
34. Visualization of the regions
The function has separated the plane into regions
divided by lines, and assigned different regions to
different species. One way to visualize these regions
is to create a grid of (x,y) values and apply the
classification function to that grid.
35. Generalization error
Generalization error, is the expected prediction error on
an independent set. Cross-validation is a statistical
method for estimating the generalization error on
classification algorithms.
In k-fold cross-validation the data is first partitioned into
k equally (or nearly equally) sized segments or folds.
Subsequently k iterations of training and validation are
performed such that within each iteration a different fold
of the data is held-out for validation while the remaining
k-1 folds are used for learning. Here we use 10-fold
cross-validation.
The LDA cross-validation error is 40.19%
36. Quadratic discriminant analysis
The covariance matrix of some classes in training was
not positive definite, so in order to solve that problem,
we used quadratic discriminant analysis (QDA) without
taking into account the features 6,8 and 9.
Confusion Matrix
54 42 1 0 0 0
6 26 0 0 1 0 Misclassification error is
10 7 16 0 0 0
31.78% and cross-
0 1 0 9 0 0
validation error is 42.52%
0 0 0 0 12 0
0 0 0 0 0 29
37. Mean of variables
The table above depict the
mean of each variable
(feature) for each group
separately.
The plot depict the means
in three dimensions.
Barium
ClassesFeatures Refractive index Sodium (Na) Magnesium (Mg) Aluminum (Al) Silicon (Si) Potassium (K) Calcium (Ca (Ba) Iron (Fe)
building windows
1,518718286 13,24228571 3,552428571 1,163857143 72,61914286 0,447428571 8,797285714 0,012714 0,057
float
building windows
1,518618553 13,11171053 3,002105263 1,408157895 72,59802632 0,521052632 9,073684211 0,050263 0,079737
non float
vehicle windows
1,517963529 13,43705882 3,543529412 1,201176471 72,40470588 0,406470588 8,782941176 0,008824 0,057059
float
containers 1,518927692 12,82769231 0,773846154 2,033846154 72,36615385 1,47 10,12384615 0,187692 0,060769
tableware 1,517455556 14,64666667 1,305555556 1,366666667 73,20666667 0 9,356666667 0 0
headlamps 1,517116207 14,44206897 0,538275862 2,122758621 72,96586207 0,325172414 8,49137931 1,04 0,013448
38. Variance of variables
The table above depict
the variance of each
variable (feature) for each
group separately. The plot
depict the means in three
dimensions.
Refractive Potassium Calcium Barium Iron
ClassesFeatures index
Sodium (Na) Magnesium (Mg) Aluminum (Al) Silicon (Si)
(K) (Ca) (Ba) (Fe)
building windows
5,14E-06 0,249302 0,06103 0,074615 0,324312 0,046173 0,330403 0,007029 0,007934
float
building windows
1,45E-05 0,441108 1,477833 0,101341 0,525005 0,045679 3,692682 0,131291 0,011328
non float
vehicle windows
3,67E-06 0,256935 0,026499 0,120749 0,262426 0,052849 0,144485 0,001324 0,011635
float
containers 1,12E-05 0,603786 0,998292 0,481526 1,644342 4,574017 4,768942 0,369969 0,024208
tableware 9,71E-06 1,1751 1,203703 0,327025 1,16525 0 2,10235 0 0
headlamps 6,48E-06 0,471088 1,249215 0,196006 0,884039 0,446883 0,947712 0,442679 0,000888
39. NaiveBayes Classifier
We use the Gaussian distribution for features
1,2,3,4,5,7 and use the kernel density estimation for
features 6, 8 and 9.
Confusion Matrix
12 2 0 0 5 3
50 56 7 1 0 0
Misclassification error is
5 12 10 0 0 0 46.26% and cross-
1 0 0 28 1 4 validation error is 57.01%
7 0 0 0 7 0
1 0 0 0 0 2
If we assume that the prior probabilities are equal for all
classes, the errors are 60.75% and 64.49% respectively.
40. NaiveBayes Classifier
If we use the kernel density estimation with normal
kernel function we have:
Confusion Matrix
60 17 3 0 0 0
4 44 0 0 1 0
6 11 14 0 0 1
Misclassification error is
21.96% and cross-
0 1 0 9 0 0
validation error is 39.72%
0 3 0 0 12 0
0 0 0 0 0 28
If we assume that the prior probabilities are equal for all
classes, the errors are 29.44% and 45.33% respectively.
41. NaiveBayes Classifier
If we use the kernel density estimation with triangle
kernel function we have:
Confusion Matrix
61 16 2 0 0 0
4 44 0 0 1 0 Misclassification error is
5 12 15 0 0 1 21.03% and cross-
0 1 0 9 0 0 validation error is 42.06%
0 3 0 0 12 0
0 0 0 0 0 28
If we assume that the prior probabilities are equal for all
classes, the errors are 29.44% and 45.33% respectively.
42. NaiveBayes Classifier
If we use the kernel density estimation with
epanechnikov kernel function we have:
Confusion Matrix
61 16 3 0 0 0
4 44 0 1 0 0
Misclassification error is
5 12 14 0 0 1
21.5% and cross-
0 4 0 12 0 0 validation error is 42.52%
0 0 0 0 9 0
0 0 0 0 0 28
If we assume that the prior probabilities are equal for all
classes, the errors are 30.37% and 44.39% respectively.
43. Neural Networks
Biological neural networks are made up of real biological
neurons that are connected or functionally related in a nervous
system. In the field of neuroscience, they are often identified as
groups of neurons that perform a specific physiological function in
laboratory analysis.
http://www.mysearch.org.uk/website1/html/106.Co http://www.quora.com/Radiology/Will-MRI-technology-ever-
nnectionist.html reach-the-resolution-to-image-individual-neurons
44. Artificial Neural Network
An Artificial Neural Network (ANN), usually called neural
network (NN), is a mathematical model or computational
model that is inspired by the structure and/or functional
aspects of biological neural networks.
A neural network consists of an interconnected group
of artificial neurons, and it processes information using
a connectionist approach to computation.
- wikipedia
45. Multiple Layers of Neurons
A network can have several layers. Each layer has a
weight matrix W, a bias vector b, and an output vector a.
Neural Network Toolbox™, User’s Guide, Mark Hudson Beale, Martin T. Hagan. Howard B. Demuth
46. Neural network for Classification
In this classification problem, we use a two-layered
feed-forward network, with 10 neurons in hidden
layer.
47. Transfer functions
Hyperbolic tangent sigmoid transfer function is used
in both hidden and output neurons.
o This is mathematically equivalent
to tanh(n). It differs in that it runs
faster than the MATLAB
implementation of tanh, but the
results can have very small
numerical differences.
48. Initializing weights and bias
We initialized weights and biases in each layer with
initnw function from Neural Network Toolbox in Matlab.
This function initializes a layer's weights and biases
according to the Nguyen-Widrow initialization algorithm.
This algorithm chooses values in order to distribute the
active region of each neuron in the layer approximately
evenly across the layer's input space.
The values contain a degree of randomness, so they
are not the same each time this function is called.
50. Training function
We trained the network using Scaled conjugate
gradient back propagation.
The scaled conjugate gradient algorithm is based on
conjugate directions, but it does not perform a line search
at each iteration. (Moller ,Neural Networks, Vol. 6, 1993, pp.
525–533)
51. Error Back propagation Algorithm
The gradients are computed through a back propagation
process.
Thanks Sargur Srihari for the slide
53. Confusion Matrices
We observe that the
missclassification error is
arournd 31%-32% for both
training, validation and test
samples.
The total missclassification
error is 31.8%.
56. Receiver operating characteristic
The receiver operating characteristic (roc) is a
metric used to check the quality of classifiers.
For each class of a classifier, roc applies
threshold values across the interval [0,1] to
outputs.
For each threshold, two values are calculated:
True Positive Ratio (the number of outputs greater or
equal to the threshold, divided by the number of one
targets)
False Positive Ratio (the number of outputs less than
the threshold, divided by the number of zero targets)
57. Receiver operating characteristic
These plots depict the
receiver operating
characteristic for each
output class. The
more each curve hugs
the left and top edges
of the plot, the better
the classification.
60. Conclusion (1/2)
One major limitation of the statistical models is that
they work well only when the underlying
assumptions are satisfied.
The effectiveness of these methods depends on the
various assumptions or conditions under which the
models are developed.
On the other side, neural networks are data driven
self-adaptive methods in that they can adjust
themselves to the data without any explicit
specification of functional or distributional form for
the underlying model.
61. Conclusion (2/2)
In this dataset, it seems that neural networks is
significantly more accurate than linear discriminant
analysis, quadratic discriminant analysis and Naive
Bayes.
Although in some traditional classification
methods the missclassification error was near 21%,
the cross-validation error was over 40% for the
majority of them. These results implies that for this
dataset, these methods tend to overfit the data.
In neural network, the missclassification error of the
independent test sample was 31.3%.
62. References
M. F. Moller A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning,
Neural Networks, Vol. 6, pp. 525-533, 1993
G. P. Zhang, Neural Networks for Classification: A Survey, IEEE transactions on
systems, man, and cybernetics—part c: applications and reviews, Vol. 30, No. 4,
2000
Hagan, Demuth, and Beale Neural Network Toolbox™ User’s Guide, 2012
Murphy,P.M., Aha, D.W. (1994). UCI Repository of machine learning databases
(http://archive.ics.uci.edu/ml/). Irvine, CA: University of California, Department of
Information and Computer Science.
Lectures:
I. Antoniou, Statistical Models of Networks 1, Master in Web Science, Aristotle
University of Thessaloniki, 2012
I. Antoniou, Statistical Models of Networks 2, Master in Web Science, Aristotle
University of Thessaloniki, 2012
All computations were performed in Matlab. The following toolboxes were used:
Neural Network Toolbox™
StatisticsToolbox™