5. Random Variables and Probability
Distributions
• Random Variables - Random responses
corresponding to subjects randomly selected from
a population.
• Probability Distributions - A listing of the
possible outcomes and their probabilities (discrete
r.v.s) or their densities (continuous r.v.s)
• Normal Distribution - Bell-shaped continuous
distribution widely used in statistical inference
• Sampling Distributions - Distributions
corresponding to sample statistics (such as mean
and proportion) computed from random samples
6. Normal Distribution
• Bell-shaped, symmetric family of distributions
• Classified by 2 parameters: Mean (µ) and standard
deviation (σ). These represent location and spread
• Random variables that are approximately normal have
the following properties wrt individual measurements:
• Approximately half (50%) fall above (and below) mean
• Approximately 68% fall within 1 standard deviation of mean
• Approximately 95% fall within 2 standard deviations of mean
• Virtually all fall within 3 standard deviations of mean
• Notation when Y is normally distributed with mean µ
and standard deviation σ :
),(~ σµNY
8. Example - Heights of U.S. Adults
• Female and Male adult heights are well approximated by normal
distributions: YF~N(63.7,2.5) YM~N(69.1,2.6)
INCHESM
76.5
75.5
74.5
73.5
72.5
71.5
70.5
69.5
68.5
67.5
66.5
65.5
64.5
63.5
62.5
61.5
60.5
59.5
Cases weighted by PCTM
20
10
0
Std. Dev = 2.61
Mean = 69.1
N = 99.23
INCHESF
70.5
69.5
68.5
67.5
66.5
65.5
64.5
63.5
62.5
61.5
60.5
59.5
58.5
57.5
56.5
55.5
Cases weighted by PCTF
20
18
16
14
12
10
8
6
4
2
0
Std. Dev = 2.48
Mean = 63.7
N = 99.68
Source: Statistical Abstract of the U.S. (1992)
9. Standard Normal (Z) Distribution
• Problem: Unlimited number of possible
normal distributions (-¥< µ < ¥, σ > 0)
• Solution: Standardize the random variable to
have mean 0 and standard deviation 1
)1,0(~),(~ N
Y
ZNY
σ
µ
σµ
−
=⇒
• Probabilities of certain ranges of values and specific
percentiles of interest can be obtained through the standard
normal (Z) distribution
10. Standard Normal (Z) Distribution
• Standard Normal Distribution Characteristics:
• P(Z ³0) = P(Y ³µ ) = 0.5000
• P(-1 £Z £1) = P(µ-σ £Y £µ+σ ) = 0.6826
• P(-2 £Z £2) = P(µ-2σ £Y £µ+2σ ) = 0.9544
• P(Z ³za) = P(Z £-za) = a (using Z-table)
a 0.500 0.100 0.050 0.025 0.010 0.005
za 0.000 1.282 1.645 1.960 2.326 2.576
11. Finding Probabilities of Specific Ranges
• Step 1 - Identify the normal distribution of interest
(e.g. its mean (µ) and standard deviation (σ) )
• Step 2 - Identify the range of values that you wish to
determine the probability of observing (YL , YU), where
often the upper or lower bounds are ¥or -¥
• Step 3 - Transform YL and YU into Z-values:
σ
µ
σ
µ −
=
−
= U
U
L
L
Y
Z
Y
Z
• Step 4 - Obtain P(ZL£Z £ZU) from Z-table
12. Example - Adult Female Heights
• What is the probability a randomly selected female
is 5’10” or taller (70 inches)?
• Step 1 - Y ~ N(63.7 , 2.5)
• Step 2 - YL = 70.0 YU = ¥
• Step 3 -
∞==
−
= UL ZZ 52.2
5.2
7.630.70
• Step 4 - P(Y ³70) = P(Z ³2.52) = .0059 ( »1/170)
z .00 .01 .02 .03
2.4 .0082 .0080 .0078 .0075
2.5 .0062 .0060 .0059 .0057
2.6 .0047 .0045 .0044 .0043
13. Finding Percentiles of a Distribution
• Step 1 - Identify the normal distribution of
interest (e.g. its mean (µ) and standard deviation
(σ) )
• Step 2 - Determine the percentile of interest
100p% (e.g. the 90th percentile is the cut-off where
only 90% of scores are below and 10% are above)
• Step 3 - Turn the percentile of interest into a
tail probability a and corresponding z-value
(zp):
• If 100p ³50 then a = 1-p and zp = za
• If 100p < 50 then a = p and zp = -za
• Step 4 - Transform zp back to original units:
σµ p
p zY +=
14. Example - Adult Male Heights
• Above what height do the tallest 5% of males lie above?
• Step 1 - Y ~ N(69.1 , 2.6)
• Step 2 - Want to determine 95th percentile (p = .95)
• Step 3 - Since 100p > 50, a = 1-p = 0.05
zp = za = z.05 = 1.645
• Step 4 - Y.95 = 69.1 + (1.645)(2.6) = 73.4
z .03 .04 .05 .06
1.5 .0630 .0618 .0606 .0594
1.6 .0516 .0505 .0495 .0485
1.7 .0418 .0409 .0401 .0392
21. Statistical Models
• When making statistical inference it is useful
to write random variables in terms of model
parameters and random errors
µεεµµµ −=+=−+= YYY )(
• Here µ is a fixed constant and ε is a random variable
• In practice µ will be unknown, and we will use sample data to estimate or
make statements regarding its value
22. Sampling Distributions and the
Central Limit Theorem
• Sample statistics based on random samples are also
random variables and have sampling distributions
that are probability distributions for the statistic
(outcomes that would vary across samples)
• When samples are large and measurements
independent then many estimators have normal
sampling distributions (CLT):
• Sample Mean:
• Sample Proportion: ⎟
⎠
⎞
⎜
⎝
⎛
n
NY
σ
µ,~
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ −
n
N
)1(
,~
^ ππ
ππ
23. Example - Adult Female Heights
• Random samples of n = 100 females to be
selected
• For each sample, the sample mean is computed
• Sampling distribution:
)25.0,5.63(
100
5.2
,5.63~ NNY ≡⎟
⎠
⎞
⎜
⎝
⎛
• Note that approximately 95% of all possible random samples of 100
females will have sample means between 63.0 and 64.0 inches
25. Topics Covered:
• Is there a relationship between x and y?
• What is the strength of this relationship
• Pearson’s r
• Can we describe this relationship and use this to predict y from x?
• Regression
• Is the relationship we have described statistically significant?
• t test
• Relevance to SPM
• GLM
26. The relationship between x and y
• Correlation: is there a relationship between 2 variables?
• Regression: how well a certain independent variable predict dependent
variable?
• CORRELATION ¹CAUSATION
• In order to infer causality: manipulate independent variable and observe effect
on dependent variable
28. Variance vs Covariance
• Do two variables change together?
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i
Covariance:
• Gives information on the
degree to which two
variables vary together.
• Note how similar the
covariance is to variance: the
equation simply multiplies x’s
error scores by y’s error
scores as opposed to
squaring x’s error scores.
1
)( 2
12
−
−
=
∑=
n
xx
S
n
i
i
x
Variance:
• Gives information on
variability of a single variable.
29. Covariance
n When X and Y : cov (x,y) = pos.
n When X and Y : cov (x,y) = neg.
n When no constant relationship: cov (x,y) = 0
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i
30. Example Covariance
x y xxi − yyi − ( xix − )( yiy − )
0 3 -3 0 0
2 2 -1 -1 1
3 4 0 1 0
4 0 1 -3 -3
6 6 3 3 9
3=x 3=y å = 7
75.1
4
7
1
)))((
),cov( 1
==
−
−−
=
∑=
n
yyxx
yx
i
n
i
i What does this
number tell us?
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
31. Problem with Covariance:
• The value obtained by covariance is dependent on the size of the
data’s standard deviations: if large, the value will be greater than if
small… even if the relationship between x and y is exactly the same in
the large versus small standard deviation datasets.
32. Example of how covariance value
relies on variance
High variance data Low variance data
Subject x y x error * y
error
x y X error * y
error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50
Sum of x error * y error : 7000 Sum of x error * y error : 28
Covariance: 1166.67 Covariance: 4.67
33. Solution: Pearson’s r
• Covariance does not really tell us anything
• Solution: standardise this measure
• Pearson’s R: standardises the covariance value.
• Divides the covariance by the multiplied standard
deviations of X and Y:
yx
xy
ss
yx
r
),cov(
=
34. Pearson’s R continued
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i
yx
i
n
i
i
xy
ssn
yyxx
r
)1(
))((
1
−
−−
=
∑=
1
*
1
−
=
∑=
n
ZZ
r
n
i
yx
xy
ii
35. Limitations of r
• When r = 1 or r = -1:
• We can predict y from x with certainty
• all data points are on a straight line: y = ax + b
• r is actually
• r = true r of whole population
• = estimate of r based on data
• r is very sensitive to extreme values:
0
1
2
3
4
5
0 1 2 3 4 5 6
rˆ
rˆ
36.
37. Regression
• Correlation tells you if there is an association between x and y but it
doesn’t describe the relationship or allow you to predict one variable
from the other.
• To do this we need REGRESSION!
38. Best-fit Line
= ŷ, predicted value
• Aim of linear regression is to fit a straight line, ŷ = ax + b,
to data that gives best prediction of y for any value of x
• This will be the line that
minimises distance between
data and fitted line, i.e.
the residuals
interceptε
ŷ = ax + b
ε = residual error
= y i , true value
slope
39. Least Squares Regression
• To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
Model line: ŷ = ax + b
n we must find values of a and b that minimise
Σ (y – ŷ)2
a = slope, b = intercept
40. Finding b
• First we find the value of b that gives the min sum of
squares
ε εb
b
b
n Trying different values of b is equivalent to
shifting the line up and down the scatter plot
41. Finding a
• Now we find the value of a that gives the min sum of
squares
b b b
n Trying out different values of a is equivalent to
changing the slope of the line, while b stays
constant
42. Minimising sums of squares
• Need to minimise Σ(y–ŷ)2
• ŷ = ax + b
• so need to minimise:
Σ(y - ax - b)2
• If we plot the sums of squares for
all different values of a and b we
get a parabola, because it is a
squared term
• So the min sum of squares is at
the bottom of the curve, where
the gradient is zero.
Values of a and b
sumsofsquares(S)
Gradient= 0
min S
43. The maths bit
• The min sum of squares is at the bottom of the curve where the
gradient = 0
• So we can find a and b that give min sum of squares by taking partial
derivatives of Σ(y - ax - b)2 with respect to a and b separately
• Then we solve these for 0 to give us the values of a and b that give the
min sum of squares
44. The solution
• Doing this gives the following equations for a and b:
a =
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
n From you can see that:
§ A low correlation coefficient gives a flatter slope (small value of a)
§ Large spread of y, i.e. high standard deviation, results in a steeper slope
(high value of a)
§ Large spread of x, i.e. high standard deviation, results in a flatter slope
(small value of a)
45. The solution cont.
• Our model equation is ŷ = ax + b
• This line must pass through the mean so:
y = ax + b b = y – ax
n We can put our equation for a into this giving:
b = y – ax
b = y -
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
x
n The smaller the correlation, the closer the
intercept is to the mean of y
46. Back to the model
• If the correlationis zero, we will simplypredict the mean of y for every
value of x, and our regressionline is just a flat straight line crossing the
x-axis at y
• But this isn’t very useful.
• We can calculate the regressionline for any data, but the important
question is how well does this line fit the data, or how good is it at
predictingy from x
ŷ = ax + b =
r sy
sx
r sy
sx
x + y - x
r sy
sx
ŷ = (x – x) + yRearranges to:
a b
a a
47. How good is our model?
• Total variance of y:
n Variance of predicted y values (ŷ):
n Error variance:
sŷ
2 =
∑(ŷ – y)2
n - 1
SSpred
dfŷ
=
sy
2 =
∑(y – y)2
n - 1
SSy
dfy
=
This is the variance explained by
our regression model
serror
2 =
∑(y – ŷ)2
n - 2
SSer
dfer
=
This is the variance of the error between our
predicted y values and the actual y values,
and thus is the variance in y that is NOT
explained by the regression model
48. • Total variance = predicted variance + error variance
sy
2 = sŷ
2 + ser
2
• Conveniently, via some complicated rearranging
sŷ
2 = r2 sy
2
r2 = sŷ
2 / sy
2
• so r2 is the proportion of the variance in y that is explained by
our regression model
How good is our model cont.
49. How good is our model cont.
• Insert r2 sy
2 into sy
2 = sŷ
2 + ser
2 and rearrange to get:
ser
2 = sy
2 – r2sy
2
= sy
2 (1 – r2)
• From this we can see that the greater the correlation
the smaller the error variance, so the better our
prediction
50. Is the model significant?
• i.e. do we get a significantly better prediction of y
from our regression equation than by just predicting
the mean?
• F-statistic:
F(dfŷ,dfer) =
sŷ
2
ser
2
=......=
r2 (n - 2)2
1 – r2
complicated
rearranging
n And it follows that:
t(n-2) =
r (n - 2)
√1 – r2
(because F = t2)
So all we need to
know are r and n
51.
52. General Linear Model
• Linear regression is actually a form of the General Linear Model
where the parameters are a, the slope of the line, and b, the intercept.
y = ax + b +ε
• A General Linear Model is just any model that describes the data in
terms of a straight line
53. Multiple regression
• Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
• The different x variables are combined in a linear way and each
has its own regression coefficient:
y = a1x1+ a2x2 +…..+ anxn + b + ε
• The a parameters reflect the independent contribution of each
independent variable, x, to the value of the dependent variable,
y.
• i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
55. Dimensionality Reduction
and Feature Construction
• Principal components analysis (PCA)
• Reading: L. I. Smith, A tutorial on principal components analysis (on
class website)
• PCA used to reduce dimensions of data without much loss of
information.
• Used in machine learning and in signal processing and image
compression (among other things).
56. PCA is “an orthogonal linear transformation that transfers the
data to a new coordinate system such that the greatest
variance by any projection of the data comes to lie on the
first coordinate (first principal component), the second
greatest variance lies on the second coordinate (second
principal component), and so on.”
57. • Suppose attributes are A1 and A2, and we
have n training examples. x’s denote values
of A1 and y’s denote values of A2 over the
training examples.
• Variance of an attribute:
Background for PCA
)1(
)(
)var( 1
2
1
−
−
=
∑=
n
xx
A
n
i
i
58. • Covariance of two attributes:
• If covariance is positive, both dimensions
increase together. If negative, as one increases,
the other decreases. Zero: independent of
each other.
)1(
))((
),cov( 1
21
−
−−
=
∑=
n
yyxx
AA
n
i
ii
59. • Covariance matrix
• Suppose we have n attributes, A1, ..., An.
• Covariance matrix:
),cov(where),( ,, jijiji
nn
AAccC ==×
61. • Eigenvectors:
• Let M be an n´n matrix.
• v is an eigenvector of M if M ´v = lv
• lis called the eigenvalue associated with v
• For any eigenvector v of M and scalar a,
• Thus you can always choose eigenvectors of length 1:
• If M has any eigenvectors, it has n of them, and they are
orthogonal to one another.
• Thus eigenvectors can be used as a new basis for a n-
dimensional vector space.
vvM aa λ=×
1...
22
1 =++ nvv
62. PCA
1. Given original data set S = {x1, ..., xk},
produce new set by subtracting the mean
of attribute Ai from each xi.
Mean: 1.81 1.91 Mean: 0 0
63.
64. 2. Calculate the covariance matrix:
3. Calculate the (unit) eigenvectors and
eigenvalues of the covariance matrix:
x y
x
y
66. 4. Order eigenvectors by eigenvalue, highest to
lowest.
In general, you get n components. To reduce
dimensionality to p, ignore n-p components at
the bottom of the list.
0490833989.
677873399.
735178956.
28402771.1
735178956.
677873399.
2
1
=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛−
=
=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−
=
λ
λ
v
v
68. 5. Derive the new data set.
TransformedData = RowFeatureVector ´RowDataAdjust
This gives original data in terms of chosen
components (eigenvectors)—that is, along these
axes.
( )735178956.677873399.2
677873399.735178956.
735178956.677873399.
1
−−=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−−
=
VectorRowFeature
VectorRowFeature
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−−−−−
−−−−
=
01.131.81.31.79.09.129.99.21.149.
71.31.81.19.49.29.109.39.31.169.
ustRowDataAdj
75. Clustering Strategies
• K-means
– Iteratively re-assign points to the nearest cluster center
• Agglomerative clustering
– Start with each point as its own cluster and iteratively merg
e the closest clusters
• Dense-based clustering
– DBSCAN
As we go down this chart, the clustering strategies h
ave more tendency to transitively group points even
if they are not nearby in feature space
83. Classification
• Apply a prediction function to a feature representation
of the image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
84. The machine learning framework
y = f(x)
• Training: given a training set of labeled examples {(x1,y1),
…, (xN,yN)}, estimate the prediction function f by minimizin
g the prediction error on the training set
• Testing: apply f to a never before seen test example x an
d output the predicted value y = f(x)
outputprediction functi
on
Image featur
e
Slide credit: L. Lazebnik
87. Classifiers: Nearest neighbor
f(x) = label of the training example nearest to x
• All we need is a distance function for our inputs
• No training required!
Test exa
mple
Training exa
mples from
class 1
Training exa
mples from
class 2
Slide credit: L. Lazebnik
88. Classifiers: Linear
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)
Slide credit: L. Lazebnik
89. Many classifiers to choose from
• SVM
• Neural networks
• Naïve Bayes
• Bayesian network
• Logistic regression
• Randomized Forests
• Boosted Decision Trees
• K-nearest neighbor
• RBMs
• Etc.
Which is the best one?
Slide credit: D. Hoiem
90. Generalization
• How well does a learned model generalize from
the data it was trained on to a new test set?
Training set (labels known) Test set (labels unknow
n)
Slide credit: L. Lazebnik
91. Generalization
• Components of generalization error
• Bias: how much the average model over all training sets differ f
rom the true model?
• Error due to inaccurate assumptions/simplifications made b
y the model
• Variance: how much models estimated from different training s
ets differ from each other
• Underfitting: model is too “simple” to represent all the
relevant class characteristics
• High bias and low variance
• High training error and high test error
• Overfitting: model is too “complex” and fits irrelevant c
haracteristics (noise) in the data
• Low bias and high variance
• Low training error and high test error
Slide credit: L. Lazebnik
92. Bias-Variance Trade-off
• Models with too few par
ameters are inaccurate b
ecause of a large bias (n
ot enough flexibility).
• Models with too many p
arameters are inaccurate
because of a large varian
ce (too much sensitivity t
o the sample).
Slide credit: D. Hoiem
93. Bias-Variance Trade-off
E(MSE) = noise2 + bias2 + variance
See the following for explanations of bias-variance (also Bishop’s “Neural Network
s” book):
•http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf
Unavoidable err
or
Error due to inc
orrect assumpti
ons
Error due to varian
ce of training samp
les
Slide credit: D. Hoiem
95. Bias-variance tradeoff
Many training examples
Few training examples
Complexity Low Bias
High Variance
High Bias
Low Variance
TestError
Slide credit: D. Hoiem
96. Effect of Training Size
Testing
Training
Generalization Error
Number of Training Examples
Error
Fixed prediction model
Slide credit: D. Hoiem
97. How to reduce variance?
• Choose a simpler classifier
• Regularize the parameters
• Get more training data
Slide credit: D. Hoiem
98.
99. Very brief tour of some classifiers
• K-nearest neighbor
• SVM
• Boosted Decision Trees
• Neural networks
• Naïve Bayes
• Bayesian network
• Logistic regression
• Randomized Forests
• RBMs
• Etc.
100. Generative vs. Discriminative Classifiers
Generative Models
• Represent both the data and the l
abels
• Often, makes use of conditional in
dependence and priors
• Examples
• Naïve Bayes classifier
• Bayesian network
• Models of data may apply to futur
e prediction problems
Discriminative Models
• Learn to directly predict the labels from t
he data
• Often, assume a simple boundary (e.g., li
near)
• Examples
– Logistic regression
– SVM
– Boosted decision trees
• Often easier to predict a label from the da
ta than to model the data
Slide credit: D. Hoiem
101. Classification
• Assign input vector to one of two or more class
es
• Any decision rule divides input space into decisi
on regions separated by decision boundaries
Slide credit: L. Lazebnik
102. Nearest Neighbor Classifier
• Assign label of nearest training data point to
each test data point
Voronoi partitioning of feature space
for two-category 2D and 3D data
from Duda et al.
Source: D. Lowe
110. Classifiers: Logistic Regression
x2
x1
Height
Pitch of voice
xwT
yxxP
yxxP
=
−=
=
)1|,(
)1|,(
log
21
21
male
female
( )( )xwT
xxyP −+== exp1/1),|1( 21
Maximize likelihood of l
abel given data, assumi
ng a log-linear model
111. Classifiers: Linear SVM
x x
x
x
x
x
x
x
o
o
o
o
o
x2
x1
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)
112. Classifiers: Linear SVM
x x
x
x
x
x
x
x
o
o
o
o
o
x2
x1
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)
113. Classifiers: Linear SVM
x x
x
x
x
x
x
x
o
o
o
o
o
o
x2
x1
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)
114.
115. • Datasets that are linearly separable work out great:
• But what if the dataset is just too hard?
• We can map it to a higher-dimensional space:
0 x
0 x
0 x
x2
Nonlinear SVMs
Slide credit: Andrew Moore
116. Φ: x → φ(x)
Nonlinear SVMs
• General idea: the original input space can always be
mapped to some higher-dimensional feature space w
here the training set is separable:
Slide credit: Andrew Moore
117. Nonlinear SVMs
• The kernel trick: instead of explicitly computing the lifting tra
nsformation φ(x), define a kernel function K such that
K(xi,xj) = φ(xi ) · φ(xj)
• This gives a nonlinear decision boundary in the original feat
ure space:
bKyby
i
iii
i
iii +=+⋅ ∑∑ ),()()( xxxx αϕϕα
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
118. Kernels for bags of features
• Histogram intersection kernel:
• Generalized Gaussian kernel:
• D can be (inverse) L1 distance, Euclidean distance, χ2 distance,
etc.
∑=
=
N
i
ihihhhI
1
2121 ))(),(min(),(
⎟
⎠
⎞
⎜
⎝
⎛
−= 2
2121 ),(
1
exp),( hhD
A
hhK
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Cl
assifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007
119. What about multi-class SVMs?
• Unfortunately, there is no “definitive” multi-class SVM formulation
• In practice, we have to obtain a multi-class SVM by combining mu
ltiple two-class SVMs
• One vs. others
• Traning: learn an SVM for each class vs. the others
• Testing: apply each SVM to test example and assign to it the class of the S
VM that returns the highest decision value
• One vs. one
• Training: learn an SVM for each pair of classes
• Testing: each learned SVM “votes” for a class to assign to the test example
Slide credit: L. Lazebnik
from sklearn.svm import SVC
from sklearn.svm LinearSVC
120.
121. Summary: Classifiers
• Nearest-neighbor and k-nearest-neighbor classifiers
• L1 distance, χ2 distance, quadratic distance
• Support vector machines
• Linear classifiers
• Margin maximization
• The kernel trick
• Kernel functions: generalized Gaussian, RBF
• Multi-class
• Of course, there are many other classifiers out there
• Neural networks, boosting, decision trees, …
Slide credit: L. Lazebnik
137. Perceptron Learning Theorem
• Recap: A perceptron (threshold unit) can learn anything
that it can represent (i.e. anything separable with a
hyperplane)
137
138. The Exclusive OR problem
A Perceptron cannot represent Exclusive OR
since it is not linearly separable.
138
139.
140. Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of
Units. Piecewise linear classification using an MLP with
threshold (perceptron) units
1
2
+1
+1
3
140
141. What do each of the layers do?
1st layer draws
linear boundaries
2nd layer combines
the boundaries
3rd layer can generate
arbitrarily complex
boundaries 141