슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)

Scikit-Learn을 이용한
머신러닝
강박사

Anaconda
• leading open data science platform powered by Python
• https://www.continuum.io/downloads

Random Variables and Probability
Distributions
• Random Variables - Random responses
corresponding to subjects randomly selected from
a population.
• Probability Distributions - A listing of the
possible outcomes and their probabilities (discrete
r.v.s) or their densities (continuous r.v.s)
• Normal Distribution - Bell-shaped continuous
distribution widely used in statistical inference
• Sampling Distributions - Distributions
corresponding to sample statistics (such as mean
and proportion) computed from random samples

Normal Distribution
• Bell-shaped, symmetric family of distributions
• Classified by 2 parameters: Mean (µ) and standard
deviation (σ). These represent location and spread
• Random variables that are approximately normal have
the following properties wrt individual measurements:
• Approximately half (50%) fall above (and below) mean
• Approximately 68% fall within 1 standard deviation of mean
• Approximately 95% fall within 2 standard deviations of mean
• Virtually all fall within 3 standard deviations of mean
• Notation when Y is normally distributed with mean µ
and standard deviation σ :
),(~ σµNY

Normal Distribution
95.0)22(68.0)(50.0)( ≈+≤≤−≈+≤≤−=≥ σµσµσµσµµ YPYPYP

Example - Heights of U.S. Adults
• Female and Male adult heights are well approximated by normal
distributions: YF~N(63.7,2.5) YM~N(69.1,2.6)
INCHESM
76.5
75.5
74.5
73.5
72.5
71.5
70.5
69.5
68.5
67.5
66.5
65.5
64.5
63.5
62.5
61.5
60.5
59.5
Cases weighted by PCTM
20
10
0
Std. Dev = 2.61
Mean = 69.1
N = 99.23
INCHESF
70.5
69.5
68.5
67.5
66.5
65.5
64.5
63.5
62.5
61.5
60.5
59.5
58.5
57.5
56.5
55.5
Cases weighted by PCTF
20
18
16
14
12
10
8
6
4
2
0
Std. Dev = 2.48
Mean = 63.7
N = 99.68
Source: Statistical Abstract of the U.S. (1992)

Standard Normal (Z) Distribution
• Problem: Unlimited number of possible
normal distributions (-¥< µ < ¥, σ > 0)
• Solution: Standardize the random variable to
have mean 0 and standard deviation 1
)1,0(~),(~ N
Y
ZNY
σ
µ
σµ
−
=⇒
• Probabilities of certain ranges of values and specific
percentiles of interest can be obtained through the standard
normal (Z) distribution

Standard Normal (Z) Distribution
• Standard Normal Distribution Characteristics:
• P(Z ³0) = P(Y ³µ ) = 0.5000
• P(-1 £Z £1) = P(µ-σ £Y £µ+σ ) = 0.6826
• P(-2 £Z £2) = P(µ-2σ £Y £µ+2σ ) = 0.9544
• P(Z ³za) = P(Z £-za) = a (using Z-table)
a 0.500 0.100 0.050 0.025 0.010 0.005
za 0.000 1.282 1.645 1.960 2.326 2.576

Finding Probabilities of Specific Ranges
• Step 1 - Identify the normal distribution of interest
(e.g. its mean (µ) and standard deviation (σ) )
• Step 2 - Identify the range of values that you wish to
determine the probability of observing (YL , YU), where
often the upper or lower bounds are ¥or -¥
• Step 3 - Transform YL and YU into Z-values:
σ
µ
σ
µ −
=
−
= U
U
L
L
Y
Z
Y
Z
• Step 4 - Obtain P(ZL£Z £ZU) from Z-table

Example - Adult Female Heights
• What is the probability a randomly selected female
is 5’10” or taller (70 inches)?
• Step 1 - Y ~ N(63.7 , 2.5)
• Step 2 - YL = 70.0 YU = ¥
• Step 3 -
∞==
−
= UL ZZ 52.2
5.2
7.630.70
• Step 4 - P(Y ³70) = P(Z ³2.52) = .0059 ( »1/170)
z .00 .01 .02 .03
2.4 .0082 .0080 .0078 .0075
2.5 .0062 .0060 .0059 .0057
2.6 .0047 .0045 .0044 .0043

Finding Percentiles of a Distribution
• Step 1 - Identify the normal distribution of
interest (e.g. its mean (µ) and standard deviation
(σ) )
• Step 2 - Determine the percentile of interest
100p% (e.g. the 90th percentile is the cut-off where
only 90% of scores are below and 10% are above)
• Step 3 - Turn the percentile of interest into a
tail probability a and corresponding z-value
(zp):
• If 100p ³50 then a = 1-p and zp = za
• If 100p < 50 then a = p and zp = -za
• Step 4 - Transform zp back to original units:
σµ p
p zY +=

Example - Adult Male Heights
• Above what height do the tallest 5% of males lie above?
• Step 1 - Y ~ N(69.1 , 2.6)
• Step 2 - Want to determine 95th percentile (p = .95)
• Step 3 - Since 100p > 50, a = 1-p = 0.05
zp = za = z.05 = 1.645
• Step 4 - Y.95 = 69.1 + (1.645)(2.6) = 73.4
z .03 .04 .05 .06
1.5 .0630 .0618 .0606 .0594
1.6 .0516 .0505 .0495 .0485
1.7 .0418 .0409 .0401 .0392

Statistical Models
• When making statistical inference it is useful
to write random variables in terms of model
parameters and random errors
µεεµµµ −=+=−+= YYY )(
• Here µ is a fixed constant and ε is a random variable
• In practice µ will be unknown, and we will use sample data to estimate or
make statements regarding its value

Sampling Distributions and the
Central Limit Theorem
• Sample statistics based on random samples are also
random variables and have sampling distributions
that are probability distributions for the statistic
(outcomes that would vary across samples)
• When samples are large and measurements
independent then many estimators have normal
sampling distributions (CLT):
• Sample Mean:
• Sample Proportion: ⎟
⎠
⎞
⎜
⎝
⎛
n
NY
σ
µ,~
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ −
n
N
)1(
,~
^ ππ
ππ

Example - Adult Female Heights
• Random samples of n = 100 females to be
selected
• For each sample, the sample mean is computed
• Sampling distribution:
)25.0,5.63(
100
5.2
,5.63~ NNY ≡⎟
⎠
⎞
⎜
⎝
⎛
• Note that approximately 95% of all possible random samples of 100
females will have sample means between 63.0 and 64.0 inches

Topics Covered:
• Is there a relationship between x and y?
• What is the strength of this relationship
• Pearson’s r
• Can we describe this relationship and use this to predict y from x?
• Regression
• Is the relationship we have described statistically significant?
• t test
• Relevance to SPM
• GLM

The relationship between x and y
• Correlation: is there a relationship between 2 variables?
• Regression: how well a certain independent variable predict dependent
variable?
• CORRELATION ¹CAUSATION
• In order to infer causality: manipulate independent variable and observe effect
on dependent variable

Scattergrams
Y
X
Y
X
Y
X
YY Y
Positive correlation Negative
correlation
No
correlation

Variance vs Covariance
• Do two variables change together?
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i
Covariance:
• Gives information on the
degree to which two
variables vary together.
• Note how similar the
covariance is to variance: the
equation simply multiplies x’s
error scores by y’s error
scores as opposed to
squaring x’s error scores.
1
)( 2
12
−
−
=
∑=
n
xx
S
n
i
i
x
Variance:
• Gives information on
variability of a single variable.

Covariance
n When X and Y : cov (x,y) = pos.
n When X and Y : cov (x,y) = neg.
n When no constant relationship: cov (x,y) = 0
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i

Example Covariance
x y xxi − yyi − ( xix − )( yiy − )
0 3 -3 0 0
2 2 -1 -1 1
3 4 0 1 0
4 0 1 -3 -3
6 6 3 3 9
3=x 3=y å = 7
75.1
4
7
1
)))((
),cov( 1
==
−
−−
=
∑=
n
yyxx
yx
i
n
i
i What does this
number tell us?
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7

Problem with Covariance:
• The value obtained by covariance is dependent on the size of the
data’s standard deviations: if large, the value will be greater than if
small… even if the relationship between x and y is exactly the same in
the large versus small standard deviation datasets.

Example of how covariance value
relies on variance
High variance data Low variance data
Subject x y x error * y
error
x y X error * y
error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50
Sum of x error * y error : 7000 Sum of x error * y error : 28
Covariance: 1166.67 Covariance: 4.67

Solution: Pearson’s r
• Covariance does not really tell us anything
• Solution: standardise this measure
• Pearson’s R: standardises the covariance value.
• Divides the covariance by the multiplied standard
deviations of X and Y:
yx
xy
ss
yx
r
),cov(
=

Pearson’s R continued
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i
yx
i
n
i
i
xy
ssn
yyxx
r
)1(
))((
1
−
−−
=
∑=
1
*
1
−
=
∑=
n
ZZ
r
n
i
yx
xy
ii

Limitations of r
• When r = 1 or r = -1:
• We can predict y from x with certainty
• all data points are on a straight line: y = ax + b
• r is actually
• r = true r of whole population
• = estimate of r based on data
• r is very sensitive to extreme values:
0
1
2
3
4
5
0 1 2 3 4 5 6
rˆ
rˆ

Regression
• Correlation tells you if there is an association between x and y but it
doesn’t describe the relationship or allow you to predict one variable
from the other.
• To do this we need REGRESSION!

Best-fit Line
= ŷ, predicted value
• Aim of linear regression is to fit a straight line, ŷ = ax + b,
to data that gives best prediction of y for any value of x
• This will be the line that
minimises distance between
data and fitted line, i.e.
the residuals
interceptε
ŷ = ax + b
ε = residual error
= y i , true value
slope

Least Squares Regression
• To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
Model line: ŷ = ax + b
n we must find values of a and b that minimise
Σ (y – ŷ)2
a = slope, b = intercept

Finding b
• First we find the value of b that gives the min sum of
squares
ε εb
b
b
n Trying different values of b is equivalent to
shifting the line up and down the scatter plot

Finding a
• Now we find the value of a that gives the min sum of
squares
b b b
n Trying out different values of a is equivalent to
changing the slope of the line, while b stays
constant

Minimising sums of squares
• Need to minimise Σ(y–ŷ)2
• ŷ = ax + b
• so need to minimise:
Σ(y - ax - b)2
• If we plot the sums of squares for
all different values of a and b we
get a parabola, because it is a
squared term
• So the min sum of squares is at
the bottom of the curve, where
the gradient is zero.
Values of a and b
sumsofsquares(S)
Gradient= 0
min S

The maths bit
• The min sum of squares is at the bottom of the curve where the
gradient = 0
• So we can find a and b that give min sum of squares by taking partial
derivatives of Σ(y - ax - b)2 with respect to a and b separately
• Then we solve these for 0 to give us the values of a and b that give the
min sum of squares

The solution
• Doing this gives the following equations for a and b:
a =
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
n From you can see that:
§ A low correlation coefficient gives a flatter slope (small value of a)
§ Large spread of y, i.e. high standard deviation, results in a steeper slope
(high value of a)
§ Large spread of x, i.e. high standard deviation, results in a flatter slope
(small value of a)

The solution cont.
• Our model equation is ŷ = ax + b
• This line must pass through the mean so:
y = ax + b b = y – ax
n We can put our equation for a into this giving:
b = y – ax
b = y -
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
x
n The smaller the correlation, the closer the
intercept is to the mean of y

Back to the model
• If the correlationis zero, we will simplypredict the mean of y for every
value of x, and our regressionline is just a flat straight line crossing the
x-axis at y
• But this isn’t very useful.
• We can calculate the regressionline for any data, but the important
question is how well does this line fit the data, or how good is it at
predictingy from x
ŷ = ax + b =
r sy
sx
r sy
sx
x + y - x
r sy
sx
ŷ = (x – x) + yRearranges to:
a b
a a

How good is our model?
• Total variance of y:
n Variance of predicted y values (ŷ):
n Error variance:
sŷ
2 =
∑(ŷ – y)2
n - 1
SSpred
dfŷ
=
sy
2 =
∑(y – y)2
n - 1
SSy
dfy
=
This is the variance explained by
our regression model
serror
2 =
∑(y – ŷ)2
n - 2
SSer
dfer
=
This is the variance of the error between our
predicted y values and the actual y values,
and thus is the variance in y that is NOT
explained by the regression model

• Total variance = predicted variance + error variance
sy
2 = sŷ
2 + ser
2
• Conveniently, via some complicated rearranging
sŷ
2 = r2 sy
2
r2 = sŷ
2 / sy
2
• so r2 is the proportion of the variance in y that is explained by
our regression model
How good is our model cont.

How good is our model cont.
• Insert r2 sy
2 into sy
2 = sŷ
2 + ser
2 and rearrange to get:
ser
2 = sy
2 – r2sy
2
= sy
2 (1 – r2)
• From this we can see that the greater the correlation
the smaller the error variance, so the better our
prediction

Is the model significant?
• i.e. do we get a significantly better prediction of y
from our regression equation than by just predicting
the mean?
• F-statistic:
F(dfŷ,dfer) =
sŷ
2
ser
2
=......=
r2 (n - 2)2
1 – r2
complicated
rearranging
n And it follows that:
t(n-2) =
r (n - 2)
√1 – r2
(because F = t2)
So all we need to
know are r and n

General Linear Model
• Linear regression is actually a form of the General Linear Model
where the parameters are a, the slope of the line, and b, the intercept.
y = ax + b +ε
• A General Linear Model is just any model that describes the data in
terms of a straight line

Multiple regression
• Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
• The different x variables are combined in a linear way and each
has its own regression coefficient:
y = a1x1+ a2x2 +…..+ anxn + b + ε
• The a parameters reflect the independent contribution of each
independent variable, x, to the value of the dependent variable,
y.
• i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for

Dimensionality Reduction
and Feature Construction
• Principal components analysis (PCA)
• Reading: L. I. Smith, A tutorial on principal components analysis (on
class website)
• PCA used to reduce dimensions of data without much loss of
information.
• Used in machine learning and in signal processing and image
compression (among other things).

PCA is “an orthogonal linear transformation that transfers the
data to a new coordinate system such that the greatest
variance by any projection of the data comes to lie on the
first coordinate (first principal component), the second
greatest variance lies on the second coordinate (second
principal component), and so on.”

• Suppose attributes are A1 and A2, and we
have n training examples. x’s denote values
of A1 and y’s denote values of A2 over the
training examples.
• Variance of an attribute:
Background for PCA
)1(
)(
)var( 1
2
1
−
−
=
∑=
n
xx
A
n
i
i

• Covariance of two attributes:
• If covariance is positive, both dimensions
increase together. If negative, as one increases,
the other decreases. Zero: independent of
each other.
)1(
))((
),cov( 1
21
−
−−
=
∑=
n
yyxx
AA
n
i
ii

• Covariance matrix
• Suppose we have n attributes, A1, ..., An.
• Covariance matrix:
),cov(where),( ,, jijiji
nn
AAccC ==×

Covariance matrix
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
3705.104
5.1047.47
)var(5.104
5.104)var(
),cov(),cov(
),cov(),cov(
M
H
MMHM
MHHH

• Eigenvectors:
• Let M be an n´n matrix.
• v is an eigenvector of M if M ´v = lv
• lis called the eigenvalue associated with v
• For any eigenvector v of M and scalar a,
• Thus you can always choose eigenvectors of length 1:
• If M has any eigenvectors, it has n of them, and they are
orthogonal to one another.
• Thus eigenvectors can be used as a new basis for a n-
dimensional vector space.
vvM aa λ=×
1...
22
1 =++ nvv

PCA
1. Given original data set S = {x1, ..., xk},
produce new set by subtracting the mean
of attribute Ai from each xi.
Mean: 1.81 1.91 Mean: 0 0

2. Calculate the covariance matrix:
3. Calculate the (unit) eigenvectors and
eigenvalues of the covariance matrix:
x y
x
y

Eigenvector with largest
eigenvalue traces
linear pattern in data

4. Order eigenvectors by eigenvalue, highest to
lowest.
In general, you get n components. To reduce
dimensionality to p, ignore n-p components at
the bottom of the list.
0490833989.
677873399.
735178956.
28402771.1
735178956.
677873399.
2
1
=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛−
=
=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−
=
λ
λ
v
v

Construct new feature vector.
Feature vector = (v1, v2, ...vp)
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−
=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−−
=
735178956.
677873399.
2
:vectorfeaturedimensionreducedor
677873399.735178956.
735178956.677873399.
1
torFeatureVec
torFeatureVec

5. Derive the new data set.
TransformedData = RowFeatureVector ´RowDataAdjust
This gives original data in terms of chosen
components (eigenvectors)—that is, along these
axes.
( )735178956.677873399.2
677873399.735178956.
735178956.677873399.
1
−−=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−−
=
VectorRowFeature
VectorRowFeature
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−−−−−
−−−−
=
01.131.81.31.79.09.129.99.21.149.
71.31.81.19.49.29.109.39.31.169.
ustRowDataAdj

Clustering Strategies
• K-means
– Iteratively re-assign points to the nearest cluster center
• Agglomerative clustering
– Start with each point as its own cluster and iteratively merg
e the closest clusters
• Dense-based clustering
– DBSCAN
As we go down this chart, the clustering strategies h
ave more tendency to transitively group points even
if they are not nearby in feature space

Agglomerative clustering
(Hierarchical clustering)

Classification
• Apply a prediction function to a feature representation
of the image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik

The machine learning framework
y = f(x)
• Training: given a training set of labeled examples {(x1,y1),
…, (xN,yN)}, estimate the prediction function f by minimizin
g the prediction error on the training set
• Testing: apply f to a never before seen test example x an
d output the predicted value y = f(x)
outputprediction functi
on
Image featur
e

Prediction
Steps
Training
Labels
Training Im
ages
Training
Training
Image F
eatures
Image Fea
tures
Testing
Test Image
Learned
model
Learned m
odel
Slide credit: D. Hoiem and L. Lazebnik

Features
• Raw pixels
• Histograms
• …

Classifiers: Nearest neighbor
f(x) = label of the training example nearest to x
• All we need is a distance function for our inputs
• No training required!
Test exa
mple
Training exa
mples from
class 1
Training exa
mples from
class 2

Classifiers: Linear
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)

Many classifiers to choose from
• SVM
• Neural networks
• Naïve Bayes
• Bayesian network
• Logistic regression
• Randomized Forests
• Boosted Decision Trees
• K-nearest neighbor
• RBMs
• Etc.
Which is the best one?
Slide credit: D. Hoiem

Generalization
• How well does a learned model generalize from
the data it was trained on to a new test set?
Training set (labels known) Test set (labels unknow
n)

Generalization
• Components of generalization error
• Bias: how much the average model over all training sets differ f
rom the true model?
• Error due to inaccurate assumptions/simplifications made b
y the model
• Variance: how much models estimated from different training s
ets differ from each other
• Underfitting: model is too “simple” to represent all the
relevant class characteristics
• High bias and low variance
• High training error and high test error
• Overfitting: model is too “complex” and fits irrelevant c
haracteristics (noise) in the data
• Low bias and high variance
• Low training error and high test error

Bias-Variance Trade-off
• Models with too few par
ameters are inaccurate b
ecause of a large bias (n
ot enough flexibility).
• Models with too many p
arameters are inaccurate
because of a large varian
ce (too much sensitivity t
o the sample).

Bias-Variance Trade-off
E(MSE) = noise2 + bias2 + variance
See the following for explanations of bias-variance (also Bishop’s “Neural Network
s” book):
•http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf
Unavoidable err
or
Error due to inc
orrect assumpti
ons
Error due to varian
ce of training samp
les

Bias-variance tradeoff
Training error
Test error
Underfitting Overfitting
Complexity Low Bias
High Variance
High Bias
Low Variance
Error

Bias-variance tradeoff
Many training examples
Few training examples
Complexity Low Bias
High Variance
High Bias
Low Variance
TestError

Effect of Training Size
Testing
Training
Generalization Error
Number of Training Examples
Error
Fixed prediction model

How to reduce variance?
• Choose a simpler classifier
• Regularize the parameters
• Get more training data

Very brief tour of some classifiers
• K-nearest neighbor
• SVM
• Boosted Decision Trees
• Neural networks
• Naïve Bayes
• Logistic regression
• Randomized Forests
• RBMs
• Etc.

Generative vs. Discriminative Classifiers
Generative Models
• Represent both the data and the l
abels
• Often, makes use of conditional in
dependence and priors
• Examples
• Naïve Bayes classifier
• Models of data may apply to futur
e prediction problems
Discriminative Models
• Learn to directly predict the labels from t
he data
• Often, assume a simple boundary (e.g., li
near)
• Examples
– Logistic regression
– SVM
– Boosted decision trees
• Often easier to predict a label from the da
ta than to model the data

Classification
• Assign input vector to one of two or more class
es
• Any decision rule divides input space into decisi
on regions separated by decision boundaries

Nearest Neighbor Classifier
• Assign label of nearest training data point to
each test data point
Voronoi partitioning of feature space
for two-category 2D and 3D data
from Duda et al.
Source: D. Lowe

K-nearest neighbor
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+

1-nearest neighbor vs.
3-nearest neighbor
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+

Naïve Bayes Classifier
• For instance, P(+ | p1) vs. P(- | p1)
• We don’t know a model. Just given sample.
• Bayes Rule

Classifiers: Logistic Regression
x2
x1
Height
Pitch of voice
xwT
yxxP
yxxP
=
−=
=
)1|,(
)1|,(
log
21
21
male
female
( )( )xwT
xxyP −+== exp1/1),|1( 21
Maximize likelihood of l
abel given data, assumi
ng a log-linear model

Classifiers: Linear SVM
x x
x
x
x
x
x
x
o
o
o
o
o
x2
x1

Classifiers: Linear SVM
x x
x
x
x
x
x
x
o
o
o
o
o
o
x2
x1

• Datasets that are linearly separable work out great:
• But what if the dataset is just too hard?
• We can map it to a higher-dimensional space:
0 x
0 x
0 x
x2
Nonlinear SVMs
Slide credit: Andrew Moore

Φ: x → φ(x)
Nonlinear SVMs
• General idea: the original input space can always be
mapped to some higher-dimensional feature space w
here the training set is separable:
Slide credit: Andrew Moore

Nonlinear SVMs
• The kernel trick: instead of explicitly computing the lifting tra
nsformation φ(x), define a kernel function K such that
K(xi,xj) = φ(xi ) · φ(xj)
• This gives a nonlinear decision boundary in the original feat
ure space:
bKyby
i
iii
i
iii +=+⋅ ∑∑ ),()()( xxxx αϕϕα
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998

Kernels for bags of features
• Histogram intersection kernel:
• Generalized Gaussian kernel:
• D can be (inverse) L1 distance, Euclidean distance, χ2 distance,
etc.
∑=
=
N
i
ihihhhI
1
2121 ))(),(min(),(
⎟
⎠
⎞
⎜
⎝
⎛
−= 2
2121 ),(
1
exp),( hhD
A
hhK
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Cl
assifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007

What about multi-class SVMs?
• Unfortunately, there is no “definitive” multi-class SVM formulation
• In practice, we have to obtain a multi-class SVM by combining mu
ltiple two-class SVMs
• One vs. others
• Traning: learn an SVM for each class vs. the others
• Testing: apply each SVM to test example and assign to it the class of the S
VM that returns the highest decision value
• One vs. one
• Training: learn an SVM for each pair of classes
• Testing: each learned SVM “votes” for a class to assign to the test example
from sklearn.svm import SVC
from sklearn.svm LinearSVC

Summary: Classifiers
• Nearest-neighbor and k-nearest-neighbor classifiers
• L1 distance, χ2 distance, quadratic distance
• Support vector machines
• Linear classifiers
• Margin maximization
• The kernel trick
• Kernel functions: generalized Gaussian, RBF
• Multi-class
• Of course, there are many other classifiers out there
• Neural networks, boosting, decision trees, …

Classifiers: Decision Trees
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1

Feature Selection
• GINI Index
• Information Gain
• Chi-square
t
O X
c
O A B
X C D

Ensemble Training
• Growing Tree à Over fitting
• Make strong classifier with weak classifiers

Random Forest
(bootstrap aggregating, Bagging)

Perceptron Learning Theorem
• Recap: A perceptron (threshold unit) can learn anything
that it can represent (i.e. anything separable with a
hyperplane)
137

The Exclusive OR problem
A Perceptron cannot represent Exclusive OR
since it is not linearly separable.
138

Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of
Units. Piecewise linear classification using an MLP with
threshold (perceptron) units
1
2
+1
+1
3
140

What do each of the layers do?
1st layer draws
linear boundaries
2nd layer combines
the boundaries
3rd layer can generate
arbitrarily complex
boundaries 141

슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)

Similar to 슬로우캠퍼스: scikit-learn & 머신러닝 (강박사) (20)

Recently uploaded

Recently uploaded (20)

슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)