Statistical classification: A review on some techniques

Statistical Classification:
A review on some techniques

Bamparopoulos Giorgos
Master in Web Science, Department of
Mathematics, Aristotle University of Thessaloniki

What is Pattern Recognition?
 The study of how machines can observe the
environment

 learn to distinguish patterns of interest from their
background, and
 make sound and reasonable decisions about the
categories of the patterns.

 A pattern is an object, process or event that can
be given a name.
 A pattern class (or category) is a set of patterns
sharing common attributes and usually originating
from the same source.

Definition
In machine learning, pattern recognition is the assignment of
a label to a given input value. An example of pattern
recognition is classification. However, pattern recognition is a
more general problem that encompasses other types of output
as well.

Other examples are regression, which assigns a real-valued
output to each input; sequence labeling, which assigns a class
to each member of a sequence of values (for example, part of
speech tagging, which assigns a part of speech to each word in
an input sentence); and parsing, which assigns a parse tree to
an input sentence, describing the syntactic structure of the
sentence.
- Wikipedia

Pattern recognition system
Classification Mode
Test
Set
test Feature
Preprocessing Classification
pattern Measurement

training Feature
pattern Preprocessing Extraction/ Learning
Selection
Training
Set
Training Mode

Supervised vs Unsupervised Learning

 Supervised learning (classification)
 The training data are accompanied by labels
indicating the class of the observations
 New data is classified based on the training set

 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations,
with the aim of establishing the existence of
classes or clusters in the data

Classification
A classification problem occurs when an object needs to be
assigned into a predefined group or class based on a
number of observed attributes related to that object.

The individual observations are analyzed into a set of
quantifiable properties, known as explanatory
variables, features, etc. These properties may variously
be categorical , ordinal. integer-valued or real-valued.

During classification given objects are assigned to
prescribed classes. A classifier is a mathematical function,
implemented by a classification algorithm, that maps input
data to a categorywhich performs classification

Application Domains (1/3)
 Computer vision
 Medical imaging and medical image analysis
 Optical character recognition
 Video tracking
 Drug discovery and development
 Toxicogenomics
 Quantitative structure-activity relationship
 Geostatistics
 Speech recognition

 Handwriting recognition
 Biometric identification
 Biological classification
 Statistical natural language processing
 Document classification
 Internet search engines
 Credit scoring

Digit recognition
Object
Automated protein
recognition
classification

http://www.glue.umd.edu/~zhelin/recog.html
Phoneme recognition

[Waibel, Hanzawa, Hinton,Shikano, Lang 1989]

Example of Classification 1/2
Input: Output:

Spam Binary
filtering
!!!!$$$!!!!

Character Multi-Class
recognition

[thanks to Ben Taskar for slide!]
C

Example of Classification 2/2
Input Output

brace
Handwriting
recognition

3D object
recognition

[thanks to Ben Taskar for slide!]

Classifiers
Neural networks

Quadratic classifiers

Naive Bayes classifier

Kernel estimation and K-nearest
neighbor algorithms

Decision trees, decision lists

Support vector machines

Maximum entropy classifier

Linear Classifier

10
9
8
7 R.A. Fisher
1890-1962
6
5 If previously unseen instance above the line
then
4 class is Katydid
3 else
class is Grasshopper
2
1
Katydids
1 2 3 4 5 6 7 8 9 10 Grasshoppers
Eamonn Keogh, Professor Computer Science & Engineering Department, University of California - Riverside

Higher Dimensional Spaces

… we can visualize it as
being an n-dimensional
hyperplane


If we did not have the 3rd dimension…


We can no longer get
perfect accuracy with the
simple linear classifier.

Maybe solve this problem by
using a simple quadratic
classifier or a simple cubic
classifier.


Naive Bayesian Classiﬁer


Naive Bayesian Classiﬁer



Linear Discriminant Analysis
 Assumes that the conditional class densities are
(multivariate) Gaussian
 Assumes equal covariance for every class

Then the sample is in the class such that the
discriminant function is maximized for that sample.

Classification rule:

ˆ   ( xi  k )( xi  k )T / ( N  K )
K
Covariance matrix: Σ  ˆ ˆ
k 1 gi

Quadratic Discriminant Analysis
 Class conditional probability densities are allowed to
have different covariance matrices

 The class decision boundaries are not linear rather
quadratic

Classification rule:

Problems on Learning

 Dimensionality
 The number of features is too large relative
to the number of training samples
 Classifiers complexity
 The number of unknown parameters
associated with the classifier is large
 Overtrained
 Too intensively optimized on training set.

Dimensionality Reduction

 Feature Extraction
 Create new features based on the
original feature set
 Transforms are usually involved

 Feature Selection
 Select the best subset from a given
feature set.

Overfitting and underfitting

underfitting good fit overfitting

Kernel Density Estimation
Kernel density estimation (KDE) is a non-parametric way
to estimate the probability density function of a random variable.

Let (x1, x2, …, xn) be a sample drawn from some distribution with
an unknown density ƒ. Its kernel density estimator is described by
equation below, where K(•) is the kernel function that integrates to
one and h > 0 is a smoothing parameter called bandwidth.

Some Kernel
Functions:

Glass Identification dataset
This data is available from the UCI Machine Learning Repository.

Murphy,P.M., Aha, D.W. (1994). UCI Repository of machine learning
databases (http://archive.ics.uci.edu/ml/). Irvine, CA: University of
California, Department of Information and Computer Science.

From USA Forensic Science Service.

Sources:
 Creator: B. German -- Central Research Establishment Home
Office Forensic Science Service Aldermaston, Reading, Berkshire
RG7 4PN
 Donor: Vina Spiehler, Ph.D., DABFT Diagnostic Products
Corporation (213) 776-0180 (ext 3014)
 Date: September, 1987

Glass Identification dataset
 Dataset have 214 instances
 6 types of glass; defined in terms of
their oxide content (i.e. Na, Fe, K, etc)
 The study of classification of types of
glass was motivated by criminological
investigation. At the scene of the
crime, the glass left can be used as Attribute
id
evidence...if it is correctly identified! Information:
RI: refractive
1
id Type of glass: (class attribute) index
2 Na: Sodium
1 building windows float processed
3 Mg: Magnesium
2 building windows non float processed 4 Al: Aluminum
5 Si: Silicon
3 vehicle windows float processed 6 K: Potassium
7 Ca: Calcium
4 containers
8 Ba: Barium
5 tableware 9 Fe: Iron
6 headlamps *unit measurement: weight
percent in corresponding oxide

Scatter plots of features

The scatter plots
depict the relationship
between features
grouped by the type
of glass.

 X-axis:
 Refractive index
 Sodium
 Magnesium
 Y-axis:
 Iron
 Barium
 Calcium


 X-axis:
 Refractive index
 Sodium
 Magnesium
 Y-axis:
 Potassium
 Silicon
 Aluminum


 X-axis:
 Aluminum
 Silicon
 Potassium
 Y-axis:
 Iron
 Barium
 Calcium

Linear Discriminant Analysis
First classify the data using the default linear discriminant
analysis (LDA).

In the scatter plot above, we draw a X through the
misclassified observations.

Confusion Matrix
 A confusion matrix contains information about known
class labels and predicted class labels.
 The (i,j) element in the confusion matrix is the number
of samples whose predicted class is i and whose
known class label is class j.
 The diagonal elements represent correctly classified
observations.

46 16 3 0 1 0
14 41 3 2 1 1
The misclassification error (the
10 12 11 0 0 1
proportion of misclassified
0 4 0 10 0 2
observations) on the training set
0 3 0 0 7 1
is 35.05%
0 0 0 1 0 24

Visualization of the regions
The function has separated the plane into regions
divided by lines, and assigned different regions to
different species. One way to visualize these regions
is to create a grid of (x,y) values and apply the
classification function to that grid.

Generalization error
Generalization error, is the expected prediction error on
an independent set. Cross-validation is a statistical
method for estimating the generalization error on
classification algorithms.

In k-fold cross-validation the data is first partitioned into
k equally (or nearly equally) sized segments or folds.
Subsequently k iterations of training and validation are
performed such that within each iteration a different fold
of the data is held-out for validation while the remaining
k-1 folds are used for learning. Here we use 10-fold
cross-validation.

The LDA cross-validation error is 40.19%

Quadratic discriminant analysis
The covariance matrix of some classes in training was
not positive definite, so in order to solve that problem,
we used quadratic discriminant analysis (QDA) without
taking into account the features 6,8 and 9.

Confusion Matrix
54 42 1 0 0 0
6 26 0 0 1 0 Misclassification error is
10 7 16 0 0 0
31.78% and cross-
0 1 0 9 0 0
validation error is 42.52%
0 0 0 0 12 0
0 0 0 0 0 29

Mean of variables
 The table above depict the
mean of each variable
(feature) for each group
separately.

 The plot depict the means
in three dimensions.
Barium
ClassesFeatures Refractive index Sodium (Na) Magnesium (Mg) Aluminum (Al) Silicon (Si) Potassium (K) Calcium (Ca (Ba) Iron (Fe)
building windows
1,518718286 13,24228571 3,552428571 1,163857143 72,61914286 0,447428571 8,797285714 0,012714 0,057
float
building windows
1,518618553 13,11171053 3,002105263 1,408157895 72,59802632 0,521052632 9,073684211 0,050263 0,079737
non float
vehicle windows
1,517963529 13,43705882 3,543529412 1,201176471 72,40470588 0,406470588 8,782941176 0,008824 0,057059
float

containers 1,518927692 12,82769231 0,773846154 2,033846154 72,36615385 1,47 10,12384615 0,187692 0,060769

tableware 1,517455556 14,64666667 1,305555556 1,366666667 73,20666667 0 9,356666667 0 0

headlamps 1,517116207 14,44206897 0,538275862 2,122758621 72,96586207 0,325172414 8,49137931 1,04 0,013448

Variance of variables
The table above depict
the variance of each
variable (feature) for each
group separately. The plot
depict the means in three
dimensions.

Refractive Potassium Calcium Barium Iron
ClassesFeatures index
Sodium (Na) Magnesium (Mg) Aluminum (Al) Silicon (Si)
(K) (Ca) (Ba) (Fe)

building windows
5,14E-06 0,249302 0,06103 0,074615 0,324312 0,046173 0,330403 0,007029 0,007934
float

building windows
1,45E-05 0,441108 1,477833 0,101341 0,525005 0,045679 3,692682 0,131291 0,011328
non float

vehicle windows
3,67E-06 0,256935 0,026499 0,120749 0,262426 0,052849 0,144485 0,001324 0,011635
float

containers 1,12E-05 0,603786 0,998292 0,481526 1,644342 4,574017 4,768942 0,369969 0,024208

tableware 9,71E-06 1,1751 1,203703 0,327025 1,16525 0 2,10235 0 0

headlamps 6,48E-06 0,471088 1,249215 0,196006 0,884039 0,446883 0,947712 0,442679 0,000888

NaiveBayes Classifier
We use the Gaussian distribution for features
1,2,3,4,5,7 and use the kernel density estimation for
features 6, 8 and 9.

Confusion Matrix
12 2 0 0 5 3

50 56 7 1 0 0
Misclassification error is
5 12 10 0 0 0 46.26% and cross-
1 0 0 28 1 4 validation error is 57.01%
7 0 0 0 7 0

1 0 0 0 0 2

If we assume that the prior probabilities are equal for all
classes, the errors are 60.75% and 64.49% respectively.

If we use the kernel density estimation with normal
kernel function we have:

Confusion Matrix
60 17 3 0 0 0

4 44 0 0 1 0

6 11 14 0 0 1
21.96% and cross-
0 1 0 9 0 0
validation error is 39.72%
0 3 0 0 12 0

0 0 0 0 0 28


If we use the kernel density estimation with triangle
kernel function we have:

Confusion Matrix
61 16 2 0 0 0
4 44 0 0 1 0 Misclassification error is
5 12 15 0 0 1 21.03% and cross-
0 3 0 0 12 0
0 0 0 0 0 28


If we use the kernel density estimation with
epanechnikov kernel function we have:

Confusion Matrix
61 16 3 0 0 0

4 44 0 1 0 0
5 12 14 0 0 1
21.5% and cross-
0 0 0 0 9 0

0 0 0 0 0 28


Neural Networks
Biological neural networks are made up of real biological
neurons that are connected or functionally related in a nervous
system. In the field of neuroscience, they are often identified as
groups of neurons that perform a specific physiological function in
laboratory analysis.

http://www.mysearch.org.uk/website1/html/106.Co http://www.quora.com/Radiology/Will-MRI-technology-ever-
nnectionist.html reach-the-resolution-to-image-individual-neurons

Artificial Neural Network
 An Artificial Neural Network (ANN), usually called neural
network (NN), is a mathematical model or computational
model that is inspired by the structure and/or functional
aspects of biological neural networks.
 A neural network consists of an interconnected group
of artificial neurons, and it processes information using
a connectionist approach to computation.

- wikipedia

Multiple Layers of Neurons
A network can have several layers. Each layer has a
weight matrix W, a bias vector b, and an output vector a.

Neural Network Toolbox™, User’s Guide, Mark Hudson Beale, Martin T. Hagan. Howard B. Demuth

Neural network for Classification
In this classification problem, we use a two-layered
feed-forward network, with 10 neurons in hidden
layer.

Transfer functions
Hyperbolic tangent sigmoid transfer function is used
in both hidden and output neurons.

o This is mathematically equivalent
to tanh(n). It differs in that it runs
faster than the MATLAB
implementation of tanh, but the
results can have very small
numerical differences.

Initializing weights and bias
 We initialized weights and biases in each layer with
initnw function from Neural Network Toolbox in Matlab.

 This function initializes a layer's weights and biases
according to the Nguyen-Widrow initialization algorithm.

 This algorithm chooses values in order to distribute the
active region of each neuron in the layer approximately
evenly across the layer's input space.

 The values contain a degree of randomness, so they
are not the same each time this function is called.

Training function
We trained the network using Scaled conjugate
gradient back propagation.

The scaled conjugate gradient algorithm is based on
conjugate directions, but it does not perform a line search
at each iteration. (Moller ,Neural Networks, Vol. 6, 1993, pp.
525–533)

Error Back propagation Algorithm
The gradients are computed through a back propagation
process.

Thanks Sargur Srihari for the slide

Error Back propagation Algorithm

Thanks Sargur Srihari for the slide

Confusion Matrices

We observe that the
missclassification error is
arournd 31%-32% for both
training, validation and test
samples.

The total missclassification
error is 31.8%.

Performance function
The plot below depicts the mean squared error in 31
epochs (repetitions):

Training state
The plot below depicts the gradient and
validation performance fails

Receiver operating characteristic
 The receiver operating characteristic (roc) is a
metric used to check the quality of classifiers.

 For each class of a classifier, roc applies
threshold values across the interval [0,1] to
outputs.

 For each threshold, two values are calculated:
 True Positive Ratio (the number of outputs greater or
equal to the threshold, divided by the number of one
targets)
 False Positive Ratio (the number of outputs less than
the threshold, divided by the number of zero targets)

Receiver operating characteristic

 These plots depict the
receiver operating
characteristic for each
output class. The
more each curve hugs
the left and top edges
of the plot, the better
the classification.

Weights and biases in the first layer
Biases
Neurons 1 2 3 4 5 6 7 8 9
-2,09671
1 0,359192 -0,94229 -0,17764 -0,04082 0,400314 -0,4872 -0,16613 0,275041 -0,25531

2 0,887809 0,593089 0,932094 -1,77481 1,246858 -0,53313 -0,15056 -0,28295 -0,00608 -1,03526

3 -0,32364 -0,96883 0,404728 -0,32628 -0,09994 -0,50305 0,458974 0,645598 -1,11714 1,24571

4 0,447813 -0,29893 0,406556 0,678079 -0,37813 -1,02435 -1,14411 -0,05205 0,81408 -0,02091

5 -0,46423 -1,72538 -0,09891 -0,70205 -0,28577 0,704576 0,829762 -0,35652 -0,04445
0,669005
6 0,785984 0,215749 -0,79605 -0,8742 -0,2018 0,149792 1,380153 0,967953 0,663471
-0,59566
7 1,300987 -0,72761 1,17204 0,313455 -0,62038 0,317305 0,929915 -1,13674 -0,21046 1,07551

8 -1,07973 1,031171 0,357152 -0,1526 0,68631 -0,5282 -0,22145 -0,74771 0,143525
-1,36921
9 -0,60609 0,272917 -0,49096 0,306454 0,732686 0,494037 0,698083 0,554006 0,358042
-2,36606

Weights and biases in second layer

Neurons 1 2 3 4 5 6 7 8 9
Biases
1 0,882965 1,837766 0,0633 0,099547 0,949198 -0,25449 0,382952 -0,06175 -0,65366
-1,92628

2 -0,01837 -0,67212 0,203406 0,566755 0,77496 1,235017 0,936339 0,488274 1,107461
0,387852

3 0,211355 -0,33971 0,030057 -0,64298 -0,05428 0,815069 -0,29038 0,82432 0,691158 -0,18307

4 0,891952 -1,50686 -1,66037 -0,92405 0,931156 -0,86609 0,371743 -0,34958 0,383325 -0,33859

5 0,849883 0,136988 -0,69641 -0,52966 0,018272 -0,51882 -0,16588 0,254206 0,968375 0,954107

1,219512
6 0,241114 -0,18111 -0,30394 0,186851 -1,10375 -0,28263 -1,56852 -0,19618 1,182049

Conclusion (1/2)
One major limitation of the statistical models is that
they work well only when the underlying
assumptions are satisfied.

The effectiveness of these methods depends on the
various assumptions or conditions under which the
models are developed.

On the other side, neural networks are data driven
self-adaptive methods in that they can adjust
themselves to the data without any explicit
specification of functional or distributional form for
the underlying model.

Conclusion (2/2)
In this dataset, it seems that neural networks is
significantly more accurate than linear discriminant
analysis, quadratic discriminant analysis and Naive
Bayes.

Although in some traditional classification
methods the missclassification error was near 21%,
the cross-validation error was over 40% for the
majority of them. These results implies that for this
dataset, these methods tend to overfit the data.

In neural network, the missclassification error of the
independent test sample was 31.3%.

References
 M. F. Moller A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning,
Neural Networks, Vol. 6, pp. 525-533, 1993
 G. P. Zhang, Neural Networks for Classification: A Survey, IEEE transactions on
systems, man, and cybernetics—part c: applications and reviews, Vol. 30, No. 4,
2000
 Hagan, Demuth, and Beale Neural Network Toolbox™ User’s Guide, 2012
 Murphy,P.M., Aha, D.W. (1994). UCI Repository of machine learning databases
(http://archive.ics.uci.edu/ml/). Irvine, CA: University of California, Department of
Information and Computer Science.

Lectures:
 I. Antoniou, Statistical Models of Networks 1, Master in Web Science, Aristotle
University of Thessaloniki, 2012
 I. Antoniou, Statistical Models of Networks 2, Master in Web Science, Aristotle
University of Thessaloniki, 2012

All computations were performed in Matlab. The following toolboxes were used:
 Neural Network Toolbox™
 StatisticsToolbox™

Questions?

Thank you for your
attention!

Statistical classification: A review on some techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistical classification: A review on some techniques

Similar to Statistical classification: A review on some techniques (20)

Recently uploaded

Recently uploaded (20)

Statistical classification: A review on some techniques