SlideShare a Scribd company logo
1 of 149
Download to read offline
Scikit-Learn을 이용한
머신러닝
강박사
Anaconda
• leading open data science platform powered by Python
• https://www.continuum.io/downloads
Jupyter notebook
Normal Distribution
Random Variables and Probability
Distributions
• Random Variables - Random responses
corresponding to subjects randomly selected from
a population.
• Probability Distributions - A listing of the
possible outcomes and their probabilities (discrete
r.v.s) or their densities (continuous r.v.s)
• Normal Distribution - Bell-shaped continuous
distribution widely used in statistical inference
• Sampling Distributions - Distributions
corresponding to sample statistics (such as mean
and proportion) computed from random samples
Normal Distribution
• Bell-shaped, symmetric family of distributions
• Classified by 2 parameters: Mean (µ) and standard
deviation (σ). These represent location and spread
• Random variables that are approximately normal have
the following properties wrt individual measurements:
• Approximately half (50%) fall above (and below) mean
• Approximately 68% fall within 1 standard deviation of mean
• Approximately 95% fall within 2 standard deviations of mean
• Virtually all fall within 3 standard deviations of mean
• Notation when Y is normally distributed with mean µ
and standard deviation σ :
),(~ σµNY
Normal Distribution
95.0)22(68.0)(50.0)( ≈+≤≤−≈+≤≤−=≥ σµσµσµσµµ YPYPYP
Example - Heights of U.S. Adults
• Female and Male adult heights are well approximated by normal
distributions: YF~N(63.7,2.5) YM~N(69.1,2.6)
INCHESM
76.5
75.5
74.5
73.5
72.5
71.5
70.5
69.5
68.5
67.5
66.5
65.5
64.5
63.5
62.5
61.5
60.5
59.5
Cases weighted by PCTM
20
10
0
Std. Dev = 2.61
Mean = 69.1
N = 99.23
INCHESF
70.5
69.5
68.5
67.5
66.5
65.5
64.5
63.5
62.5
61.5
60.5
59.5
58.5
57.5
56.5
55.5
Cases weighted by PCTF
20
18
16
14
12
10
8
6
4
2
0
Std. Dev = 2.48
Mean = 63.7
N = 99.68
Source: Statistical Abstract of the U.S. (1992)
Standard Normal (Z) Distribution
• Problem: Unlimited number of possible
normal distributions (-¥< µ < ¥, σ > 0)
• Solution: Standardize the random variable to
have mean 0 and standard deviation 1
)1,0(~),(~ N
Y
ZNY
σ
µ
σµ
−
=⇒
• Probabilities of certain ranges of values and specific
percentiles of interest can be obtained through the standard
normal (Z) distribution
Standard Normal (Z) Distribution
• Standard Normal Distribution Characteristics:
• P(Z ³0) = P(Y ³µ ) = 0.5000
• P(-1 £Z £1) = P(µ-σ £Y £µ+σ ) = 0.6826
• P(-2 £Z £2) = P(µ-2σ £Y £µ+2σ ) = 0.9544
• P(Z ³za) = P(Z £-za) = a (using Z-table)
a 0.500 0.100 0.050 0.025 0.010 0.005
za 0.000 1.282 1.645 1.960 2.326 2.576
Finding Probabilities of Specific Ranges
• Step 1 - Identify the normal distribution of interest
(e.g. its mean (µ) and standard deviation (σ) )
• Step 2 - Identify the range of values that you wish to
determine the probability of observing (YL , YU), where
often the upper or lower bounds are ¥or -¥
• Step 3 - Transform YL and YU into Z-values:
σ
µ
σ
µ −
=
−
= U
U
L
L
Y
Z
Y
Z
• Step 4 - Obtain P(ZL£Z £ZU) from Z-table
Example - Adult Female Heights
• What is the probability a randomly selected female
is 5’10” or taller (70 inches)?
• Step 1 - Y ~ N(63.7 , 2.5)
• Step 2 - YL = 70.0 YU = ¥
• Step 3 -
∞==
−
= UL ZZ 52.2
5.2
7.630.70
• Step 4 - P(Y ³70) = P(Z ³2.52) = .0059 ( »1/170)
z .00 .01 .02 .03
2.4 .0082 .0080 .0078 .0075
2.5 .0062 .0060 .0059 .0057
2.6 .0047 .0045 .0044 .0043
Finding Percentiles of a Distribution
• Step 1 - Identify the normal distribution of
interest (e.g. its mean (µ) and standard deviation
(σ) )
• Step 2 - Determine the percentile of interest
100p% (e.g. the 90th percentile is the cut-off where
only 90% of scores are below and 10% are above)
• Step 3 - Turn the percentile of interest into a
tail probability a and corresponding z-value
(zp):
• If 100p ³50 then a = 1-p and zp = za
• If 100p < 50 then a = p and zp = -za
• Step 4 - Transform zp back to original units:
σµ p
p zY +=
Example - Adult Male Heights
• Above what height do the tallest 5% of males lie above?
• Step 1 - Y ~ N(69.1 , 2.6)
• Step 2 - Want to determine 95th percentile (p = .95)
• Step 3 - Since 100p > 50, a = 1-p = 0.05
zp = za = z.05 = 1.645
• Step 4 - Y.95 = 69.1 + (1.645)(2.6) = 73.4
z .03 .04 .05 .06
1.5 .0630 .0618 .0606 .0594
1.6 .0516 .0505 .0495 .0485
1.7 .0418 .0409 .0401 .0392
Practice
Statistical Models
• When making statistical inference it is useful
to write random variables in terms of model
parameters and random errors
µεεµµµ −=+=−+= YYY )(
• Here µ is a fixed constant and ε is a random variable
• In practice µ will be unknown, and we will use sample data to estimate or
make statements regarding its value
Sampling Distributions and the
Central Limit Theorem
• Sample statistics based on random samples are also
random variables and have sampling distributions
that are probability distributions for the statistic
(outcomes that would vary across samples)
• When samples are large and measurements
independent then many estimators have normal
sampling distributions (CLT):
• Sample Mean:
• Sample Proportion: ⎟
⎠
⎞
⎜
⎝
⎛
n
NY
σ
µ,~
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ −
n
N
)1(
,~
^ ππ
ππ
Example - Adult Female Heights
• Random samples of n = 100 females to be
selected
• For each sample, the sample mean is computed
• Sampling distribution:
)25.0,5.63(
100
5.2
,5.63~ NNY ≡⎟
⎠
⎞
⎜
⎝
⎛
• Note that approximately 95% of all possible random samples of 100
females will have sample means between 63.0 and 64.0 inches
Correlation and Regression
Topics Covered:
• Is there a relationship between x and y?
• What is the strength of this relationship
• Pearson’s r
• Can we describe this relationship and use this to predict y from x?
• Regression
• Is the relationship we have described statistically significant?
• t test
• Relevance to SPM
• GLM
The relationship between x and y
• Correlation: is there a relationship between 2 variables?
• Regression: how well a certain independent variable predict dependent
variable?
• CORRELATION ¹CAUSATION
• In order to infer causality: manipulate independent variable and observe effect
on dependent variable
Scattergrams
Y
X
Y
X
Y
X
YY Y
Positive correlation Negative
correlation
No
correlation
Variance vs Covariance
• Do two variables change together?
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i
Covariance:
• Gives information on the
degree to which two
variables vary together.
• Note how similar the
covariance is to variance: the
equation simply multiplies x’s
error scores by y’s error
scores as opposed to
squaring x’s error scores.
1
)( 2
12
−
−
=
∑=
n
xx
S
n
i
i
x
Variance:
• Gives information on
variability of a single variable.
Covariance
n When X and Y : cov (x,y) = pos.
n When X and Y : cov (x,y) = neg.
n When no constant relationship: cov (x,y) = 0
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i
Example Covariance
x y xxi − yyi − ( xix − )( yiy − )
0 3 -3 0 0
2 2 -1 -1 1
3 4 0 1 0
4 0 1 -3 -3
6 6 3 3 9
3=x 3=y å = 7
75.1
4
7
1
)))((
),cov( 1
==
−
−−
=
∑=
n
yyxx
yx
i
n
i
i What does this
number tell us?
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
Problem with Covariance:
• The value obtained by covariance is dependent on the size of the
data’s standard deviations: if large, the value will be greater than if
small… even if the relationship between x and y is exactly the same in
the large versus small standard deviation datasets.
Example of how covariance value
relies on variance
High variance data Low variance data
Subject x y x error * y
error
x y X error * y
error
1 101 100 2500 54 53 9
2 81 80 900 53 52 4
3 61 60 100 52 51 1
4 51 50 0 51 50 0
5 41 40 100 50 49 1
6 21 20 900 49 48 4
7 1 0 2500 48 47 9
Mean 51 50 51 50
Sum of x error * y error : 7000 Sum of x error * y error : 28
Covariance: 1166.67 Covariance: 4.67
Solution: Pearson’s r
• Covariance does not really tell us anything
• Solution: standardise this measure
• Pearson’s R: standardises the covariance value.
• Divides the covariance by the multiplied standard
deviations of X and Y:
yx
xy
ss
yx
r
),cov(
=
Pearson’s R continued
1
))((
),cov( 1
−
−−
=
∑=
n
yyxx
yx
i
n
i
i
yx
i
n
i
i
xy
ssn
yyxx
r
)1(
))((
1
−
−−
=
∑=
1
*
1
−
=
∑=
n
ZZ
r
n
i
yx
xy
ii
Limitations of r
• When r = 1 or r = -1:
• We can predict y from x with certainty
• all data points are on a straight line: y = ax + b
• r is actually
• r = true r of whole population
• = estimate of r based on data
• r is very sensitive to extreme values:
0
1
2
3
4
5
0 1 2 3 4 5 6
rˆ
rˆ
Regression
• Correlation tells you if there is an association between x and y but it
doesn’t describe the relationship or allow you to predict one variable
from the other.
• To do this we need REGRESSION!
Best-fit Line
= ŷ, predicted value
• Aim of linear regression is to fit a straight line, ŷ = ax + b,
to data that gives best prediction of y for any value of x
• This will be the line that
minimises distance between
data and fitted line, i.e.
the residuals
interceptε
ŷ = ax + b
ε = residual error
= y i , true value
slope
Least Squares Regression
• To find the best line we must minimise the sum of
the squares of the residuals (the vertical distances
from the data points to our line)
Residual (ε) = y - ŷ
Sum of squares of residuals = Σ (y – ŷ)2
Model line: ŷ = ax + b
n we must find values of a and b that minimise
Σ (y – ŷ)2
a = slope, b = intercept
Finding b
• First we find the value of b that gives the min sum of
squares
ε εb
b
b
n Trying different values of b is equivalent to
shifting the line up and down the scatter plot
Finding a
• Now we find the value of a that gives the min sum of
squares
b b b
n Trying out different values of a is equivalent to
changing the slope of the line, while b stays
constant
Minimising sums of squares
• Need to minimise Σ(y–ŷ)2
• ŷ = ax + b
• so need to minimise:
Σ(y - ax - b)2
• If we plot the sums of squares for
all different values of a and b we
get a parabola, because it is a
squared term
• So the min sum of squares is at
the bottom of the curve, where
the gradient is zero.
Values of a and b
sumsofsquares(S)
Gradient= 0
min S
The maths bit
• The min sum of squares is at the bottom of the curve where the
gradient = 0
• So we can find a and b that give min sum of squares by taking partial
derivatives of Σ(y - ax - b)2 with respect to a and b separately
• Then we solve these for 0 to give us the values of a and b that give the
min sum of squares
The solution
• Doing this gives the following equations for a and b:
a =
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
n From you can see that:
§ A low correlation coefficient gives a flatter slope (small value of a)
§ Large spread of y, i.e. high standard deviation, results in a steeper slope
(high value of a)
§ Large spread of x, i.e. high standard deviation, results in a flatter slope
(small value of a)
The solution cont.
• Our model equation is ŷ = ax + b
• This line must pass through the mean so:
y = ax + b b = y – ax
n We can put our equation for a into this giving:
b = y – ax
b = y -
r sy
sx
r = correlation coefficient of x and y
sy = standard deviation of y
sx = standard deviation of x
x
n The smaller the correlation, the closer the
intercept is to the mean of y
Back to the model
• If the correlationis zero, we will simplypredict the mean of y for every
value of x, and our regressionline is just a flat straight line crossing the
x-axis at y
• But this isn’t very useful.
• We can calculate the regressionline for any data, but the important
question is how well does this line fit the data, or how good is it at
predictingy from x
ŷ = ax + b =
r sy
sx
r sy
sx
x + y - x
r sy
sx
ŷ = (x – x) + yRearranges to:
a b
a a
How good is our model?
• Total variance of y:
n Variance of predicted y values (ŷ):
n Error variance:
sŷ
2 =
∑(ŷ – y)2
n - 1
SSpred
dfŷ
=
sy
2 =
∑(y – y)2
n - 1
SSy
dfy
=
This is the variance explained by
our regression model
serror
2 =
∑(y – ŷ)2
n - 2
SSer
dfer
=
This is the variance of the error between our
predicted y values and the actual y values,
and thus is the variance in y that is NOT
explained by the regression model
• Total variance = predicted variance + error variance
sy
2 = sŷ
2 + ser
2
• Conveniently, via some complicated rearranging
sŷ
2 = r2 sy
2
r2 = sŷ
2 / sy
2
• so r2 is the proportion of the variance in y that is explained by
our regression model
How good is our model cont.
How good is our model cont.
• Insert r2 sy
2 into sy
2 = sŷ
2 + ser
2 and rearrange to get:
ser
2 = sy
2 – r2sy
2
= sy
2 (1 – r2)
• From this we can see that the greater the correlation
the smaller the error variance, so the better our
prediction
Is the model significant?
• i.e. do we get a significantly better prediction of y
from our regression equation than by just predicting
the mean?
• F-statistic:
F(dfŷ,dfer) =
sŷ
2
ser
2
=......=
r2 (n - 2)2
1 – r2
complicated
rearranging
n And it follows that:
t(n-2) =
r (n - 2)
√1 – r2
(because F = t2)
So all we need to
know are r and n
General Linear Model
• Linear regression is actually a form of the General Linear Model
where the parameters are a, the slope of the line, and b, the intercept.
y = ax + b +ε
• A General Linear Model is just any model that describes the data in
terms of a straight line
Multiple regression
• Multiple regression is used to determine the effect of a number
of independent variables, x1, x2, x3 etc, on a single dependent
variable, y
• The different x variables are combined in a linear way and each
has its own regression coefficient:
y = a1x1+ a2x2 +…..+ anxn + b + ε
• The a parameters reflect the independent contribution of each
independent variable, x, to the value of the dependent variable,
y.
• i.e. the amount of variance in y that is accounted for by each x
variable after all the other x variables have been accounted for
Principle Components
Analysis
Dimensionality Reduction
and Feature Construction
• Principal components analysis (PCA)
• Reading: L. I. Smith, A tutorial on principal components analysis (on
class website)
• PCA used to reduce dimensions of data without much loss of
information.
• Used in machine learning and in signal processing and image
compression (among other things).
PCA is “an orthogonal linear transformation that transfers the
data to a new coordinate system such that the greatest
variance by any projection of the data comes to lie on the
first coordinate (first principal component), the second
greatest variance lies on the second coordinate (second
principal component), and so on.”
• Suppose attributes are A1 and A2, and we
have n training examples. x’s denote values
of A1 and y’s denote values of A2 over the
training examples.
• Variance of an attribute:
Background for PCA
)1(
)(
)var( 1
2
1
−
−
=
∑=
n
xx
A
n
i
i
• Covariance of two attributes:
• If covariance is positive, both dimensions
increase together. If negative, as one increases,
the other decreases. Zero: independent of
each other.
)1(
))((
),cov( 1
21
−
−−
=
∑=
n
yyxx
AA
n
i
ii
• Covariance matrix
• Suppose we have n attributes, A1, ..., An.
• Covariance matrix:
),cov(where),( ,, jijiji
nn
AAccC ==×
Covariance matrix
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
3705.104
5.1047.47
)var(5.104
5.104)var(
),cov(),cov(
),cov(),cov(
M
H
MMHM
MHHH
• Eigenvectors:
• Let M be an n´n matrix.
• v is an eigenvector of M if M ´v = lv
• lis called the eigenvalue associated with v
• For any eigenvector v of M and scalar a,
• Thus you can always choose eigenvectors of length 1:
• If M has any eigenvectors, it has n of them, and they are
orthogonal to one another.
• Thus eigenvectors can be used as a new basis for a n-
dimensional vector space.
vvM aa λ=×
1...
22
1 =++ nvv
PCA
1. Given original data set S = {x1, ..., xk},
produce new set by subtracting the mean
of attribute Ai from each xi.
Mean: 1.81 1.91 Mean: 0 0
2. Calculate the covariance matrix:
3. Calculate the (unit) eigenvectors and
eigenvalues of the covariance matrix:
x y
x
y
Eigenvector with largest
eigenvalue traces
linear pattern in data
4. Order eigenvectors by eigenvalue, highest to
lowest.
In general, you get n components. To reduce
dimensionality to p, ignore n-p components at
the bottom of the list.
0490833989.
677873399.
735178956.
28402771.1
735178956.
677873399.
2
1
=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛−
=
=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−
=
λ
λ
v
v
Construct new feature vector.
Feature vector = (v1, v2, ...vp)
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−
=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−−
=
735178956.
677873399.
2
:vectorfeaturedimensionreducedor
677873399.735178956.
735178956.677873399.
1
torFeatureVec
torFeatureVec
5. Derive the new data set.
TransformedData = RowFeatureVector ´RowDataAdjust
This gives original data in terms of chosen
components (eigenvectors)—that is, along these
axes.
( )735178956.677873399.2
677873399.735178956.
735178956.677873399.
1
−−=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−
−−
=
VectorRowFeature
VectorRowFeature
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛
−−−−−
−−−−
=
01.131.81.31.79.09.129.99.21.149.
71.31.81.19.49.29.109.39.31.169.
ustRowDataAdj
Machine Learning
Clustering Strategies
• K-means
– Iteratively re-assign points to the nearest cluster center
• Agglomerative clustering
– Start with each point as its own cluster and iteratively merg
e the closest clusters
• Dense-based clustering
– DBSCAN
As we go down this chart, the clustering strategies h
ave more tendency to transitively group points even
if they are not nearby in feature space
K-Means
Agglomerative clustering
(Hierarchical clustering)
Classification
• Apply a prediction function to a feature representation
of the image to get the desired output:
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
The machine learning framework
y = f(x)
• Training: given a training set of labeled examples {(x1,y1),
…, (xN,yN)}, estimate the prediction function f by minimizin
g the prediction error on the training set
• Testing: apply f to a never before seen test example x an
d output the predicted value y = f(x)
outputprediction functi
on
Image featur
e
Slide credit: L. Lazebnik
Prediction
Steps
Training
Labels
Training Im
ages
Training
Training
Image F
eatures
Image Fea
tures
Testing
Test Image
Learned
model
Learned m
odel
Slide credit: D. Hoiem and L. Lazebnik
Features
• Raw pixels
• Histograms
• …
Slide credit: L. Lazebnik
Classifiers: Nearest neighbor
f(x) = label of the training example nearest to x
• All we need is a distance function for our inputs
• No training required!
Test exa
mple
Training exa
mples from
class 1
Training exa
mples from
class 2
Slide credit: L. Lazebnik
Classifiers: Linear
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)
Slide credit: L. Lazebnik
Many classifiers to choose from
• SVM
• Neural networks
• Naïve Bayes
• Bayesian network
• Logistic regression
• Randomized Forests
• Boosted Decision Trees
• K-nearest neighbor
• RBMs
• Etc.
Which is the best one?
Slide credit: D. Hoiem
Generalization
• How well does a learned model generalize from
the data it was trained on to a new test set?
Training set (labels known) Test set (labels unknow
n)
Slide credit: L. Lazebnik
Generalization
• Components of generalization error
• Bias: how much the average model over all training sets differ f
rom the true model?
• Error due to inaccurate assumptions/simplifications made b
y the model
• Variance: how much models estimated from different training s
ets differ from each other
• Underfitting: model is too “simple” to represent all the
relevant class characteristics
• High bias and low variance
• High training error and high test error
• Overfitting: model is too “complex” and fits irrelevant c
haracteristics (noise) in the data
• Low bias and high variance
• Low training error and high test error
Slide credit: L. Lazebnik
Bias-Variance Trade-off
• Models with too few par
ameters are inaccurate b
ecause of a large bias (n
ot enough flexibility).
• Models with too many p
arameters are inaccurate
because of a large varian
ce (too much sensitivity t
o the sample).
Slide credit: D. Hoiem
Bias-Variance Trade-off
E(MSE) = noise2 + bias2 + variance
See the following for explanations of bias-variance (also Bishop’s “Neural Network
s” book):
•http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf
Unavoidable err
or
Error due to inc
orrect assumpti
ons
Error due to varian
ce of training samp
les
Slide credit: D. Hoiem
Bias-variance tradeoff
Training error
Test error
Underfitting Overfitting
Complexity Low Bias
High Variance
High Bias
Low Variance
Error
Slide credit: D. Hoiem
Bias-variance tradeoff
Many training examples
Few training examples
Complexity Low Bias
High Variance
High Bias
Low Variance
TestError
Slide credit: D. Hoiem
Effect of Training Size
Testing
Training
Generalization Error
Number of Training Examples
Error
Fixed prediction model
Slide credit: D. Hoiem
How to reduce variance?
• Choose a simpler classifier
• Regularize the parameters
• Get more training data
Slide credit: D. Hoiem
Very brief tour of some classifiers
• K-nearest neighbor
• SVM
• Boosted Decision Trees
• Neural networks
• Naïve Bayes
• Bayesian network
• Logistic regression
• Randomized Forests
• RBMs
• Etc.
Generative vs. Discriminative Classifiers
Generative Models
• Represent both the data and the l
abels
• Often, makes use of conditional in
dependence and priors
• Examples
• Naïve Bayes classifier
• Bayesian network
• Models of data may apply to futur
e prediction problems
Discriminative	Models
• Learn	to	directly	predict	the	labels	from	t
he	data
• Often,	assume	a	simple	boundary	(e.g.,	li
near)
• Examples
– Logistic	regression
– SVM
– Boosted	decision	trees
• Often	easier	to	predict	a	label	from	the	da
ta	than	to	model	the	data
Slide credit: D. Hoiem
Classification
• Assign input vector to one of two or more class
es
• Any decision rule divides input space into decisi
on regions separated by decision boundaries
Slide credit: L. Lazebnik
Nearest Neighbor Classifier
• Assign label of nearest training data point to
each test data point
Voronoi partitioning of feature space
for two-category 2D and 3D data
from Duda et al.
Source: D. Lowe
K-nearest neighbor
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
1-nearest neighbor vs.
3-nearest neighbor
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
+
+
Naïve Bayes Classifier
• For instance, P(+ | p1) vs. P(- | p1)
• We don’t know a model. Just given sample.
• Bayes Rule
Naïve Bayes Classifier
Classifiers: Logistic Regression
x2
x1
Height
Pitch of voice
xwT
yxxP
yxxP
=
−=
=
)1|,(
)1|,(
log
21
21
male
female
( )( )xwT
xxyP −+== exp1/1),|1( 21
Maximize likelihood of l
abel given data, assumi
ng a log-linear model
Classifiers: Linear SVM
x x
x
x
x
x
x
x
o
o
o
o
o
x2
x1
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)
Classifiers: Linear SVM
x x
x
x
x
x
x
x
o
o
o
o
o
x2
x1
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)
Classifiers: Linear SVM
x x
x
x
x
x
x
x
o
o
o
o
o
o
x2
x1
• Find a linear function to separate the classes:
f(x) = sgn(w ×x + b)
• Datasets that are linearly separable work out great:
• But what if the dataset is just too hard?
• We can map it to a higher-dimensional space:
0 x
0 x
0 x
x2
Nonlinear SVMs
Slide credit: Andrew Moore
Φ: x → φ(x)
Nonlinear SVMs
• General idea: the original input space can always be
mapped to some higher-dimensional feature space w
here the training set is separable:
Slide credit: Andrew Moore
Nonlinear SVMs
• The kernel trick: instead of explicitly computing the lifting tra
nsformation φ(x), define a kernel function K such that
K(xi,xj) = φ(xi ) · φ(xj)
• This gives a nonlinear decision boundary in the original feat
ure space:
bKyby
i
iii
i
iii +=+⋅ ∑∑ ),()()( xxxx αϕϕα
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Kernels for bags of features
• Histogram intersection kernel:
• Generalized Gaussian kernel:
• D can be (inverse) L1 distance, Euclidean distance, χ2 distance,
etc.
∑=
=
N
i
ihihhhI
1
2121 ))(),(min(),(
⎟
⎠
⎞
⎜
⎝
⎛
−= 2
2121 ),(
1
exp),( hhD
A
hhK
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Cl
assifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007
What about multi-class SVMs?
• Unfortunately, there is no “definitive” multi-class SVM formulation
• In practice, we have to obtain a multi-class SVM by combining mu
ltiple two-class SVMs
• One vs. others
• Traning: learn an SVM for each class vs. the others
• Testing: apply each SVM to test example and assign to it the class of the S
VM that returns the highest decision value
• One vs. one
• Training: learn an SVM for each pair of classes
• Testing: each learned SVM “votes” for a class to assign to the test example
Slide credit: L. Lazebnik
from sklearn.svm import SVC
from sklearn.svm LinearSVC
Summary: Classifiers
• Nearest-neighbor and k-nearest-neighbor classifiers
• L1 distance, χ2 distance, quadratic distance
• Support vector machines
• Linear classifiers
• Margin maximization
• The kernel trick
• Kernel functions: generalized Gaussian, RBF
• Multi-class
• Of course, there are many other classifiers out there
• Neural networks, boosting, decision trees, …
Slide credit: L. Lazebnik
Classifiers: Decision Trees
x x
x
x
x
x
x
x
o
o
o
o
o
o
o
x2
x1
Feature Selection
• GINI Index
• Information Gain
• Chi-square
t
O X
c
O A B
X C D
Ensemble Training
• Growing Tree à Over fitting
• Make strong classifier with weak classifiers
Random Forest
(bootstrap aggregating, Bagging)
AdaBoost (Boosting)
The Biological Neuron
Perceptron Learning Theorem
• Recap: A perceptron (threshold unit) can learn anything
that it can represent (i.e. anything separable with a
hyperplane)
137
The Exclusive OR problem
A Perceptron cannot represent Exclusive OR
since it is not linearly separable.
138
Minsky & Papert (1969) offered solution to XOR problem by
combining perceptron unit responses using a second layer of
Units. Piecewise linear classification using an MLP with
threshold (perceptron) units
1
2
+1
+1
3
140
What do each of the layers do?
1st layer draws
linear boundaries
2nd layer combines
the boundaries
3rd layer can generate
arbitrarily complex
boundaries 141
142
Local Minima
Activation Function
• Q & A

More Related Content

What's hot

STATISTICS: Normal Distribution
STATISTICS: Normal Distribution STATISTICS: Normal Distribution
STATISTICS: Normal Distribution jundumaug1
 
The standard normal curve & its application in biomedical sciences
The standard normal curve & its application in biomedical sciencesThe standard normal curve & its application in biomedical sciences
The standard normal curve & its application in biomedical sciencesAbhi Manu
 
Standard normal distribution
Standard normal distributionStandard normal distribution
Standard normal distributionNadeem Uddin
 
Measures of dispersion range qd md
Measures of dispersion range qd mdMeasures of dispersion range qd md
Measures of dispersion range qd mdRekhaChoudhary24
 
Measures of Dispersion: Standard Deviation and Co- efficient of Variation
Measures of Dispersion: Standard Deviation and Co- efficient of Variation Measures of Dispersion: Standard Deviation and Co- efficient of Variation
Measures of Dispersion: Standard Deviation and Co- efficient of Variation RekhaChoudhary24
 
Normal distribution
Normal distributionNormal distribution
Normal distributionrishal619
 
Generalized linear model
Generalized linear modelGeneralized linear model
Generalized linear modelRahul Rockers
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JMJapheth Muthama
 
Normal distribution slide share
Normal distribution slide shareNormal distribution slide share
Normal distribution slide shareKate FLR
 
multiple regression
multiple regressionmultiple regression
multiple regressionPriya Sharma
 

What's hot (20)

STATISTICS: Normal Distribution
STATISTICS: Normal Distribution STATISTICS: Normal Distribution
STATISTICS: Normal Distribution
 
The standard normal curve & its application in biomedical sciences
The standard normal curve & its application in biomedical sciencesThe standard normal curve & its application in biomedical sciences
The standard normal curve & its application in biomedical sciences
 
Normal distribution
Normal distribution Normal distribution
Normal distribution
 
Standard normal distribution
Standard normal distributionStandard normal distribution
Standard normal distribution
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Normal distri
Normal distriNormal distri
Normal distri
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Measures of dispersion range qd md
Measures of dispersion range qd mdMeasures of dispersion range qd md
Measures of dispersion range qd md
 
Measures of Dispersion: Standard Deviation and Co- efficient of Variation
Measures of Dispersion: Standard Deviation and Co- efficient of Variation Measures of Dispersion: Standard Deviation and Co- efficient of Variation
Measures of Dispersion: Standard Deviation and Co- efficient of Variation
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
Generalized linear model
Generalized linear modelGeneralized linear model
Generalized linear model
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression analysis by Muthama JM
Regression analysis by Muthama JMRegression analysis by Muthama JM
Regression analysis by Muthama JM
 
Correlation continued
Correlation continuedCorrelation continued
Correlation continued
 
Normal distribution slide share
Normal distribution slide shareNormal distribution slide share
Normal distribution slide share
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Statistics for entrepreneurs
Statistics for entrepreneurs Statistics for entrepreneurs
Statistics for entrepreneurs
 
multiple regression
multiple regressionmultiple regression
multiple regression
 
5 regression
5 regression5 regression
5 regression
 

Similar to 슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)

Correlation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxCorrelation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxkrunal soni
 
Regression analysis
Regression analysisRegression analysis
Regression analysisAwais Salman
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).pptMuhammadAftab89
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.pptRidaIrfan10
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.pptkrunal soni
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.pptMoinPasha12
 
Correlation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social ScienceCorrelation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social Sciencessuser71ac73
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Maninda Edirisooriya
 
Statr sessions 9 to 10
Statr sessions 9 to 10Statr sessions 9 to 10
Statr sessions 9 to 10Ruru Chowdhury
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relationnuwan udugampala
 
Normal Distribution slides(1).pptx
Normal Distribution slides(1).pptxNormal Distribution slides(1).pptx
Normal Distribution slides(1).pptxKinzaSuhail2
 
CORRELATION AND REGRESSION.pptx
CORRELATION AND REGRESSION.pptxCORRELATION AND REGRESSION.pptx
CORRELATION AND REGRESSION.pptxVitalis Adongo
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regressionKhulna University
 
Lecture 01 probability distributions
Lecture 01 probability distributionsLecture 01 probability distributions
Lecture 01 probability distributionsmohamed ali
 

Similar to 슬로우캠퍼스: scikit-learn & 머신러닝 (강박사) (20)

Correlation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptxCorrelation _ Regression Analysis statistics.pptx
Correlation _ Regression Analysis statistics.pptx
 
regression
regressionregression
regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Corr-and-Regress (1).ppt
Corr-and-Regress (1).pptCorr-and-Regress (1).ppt
Corr-and-Regress (1).ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Cr-and-Regress.ppt
Cr-and-Regress.pptCr-and-Regress.ppt
Cr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Correlation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social ScienceCorrelation & Regression for Statistics Social Science
Correlation & Regression for Statistics Social Science
 
Corr-and-Regress.ppt
Corr-and-Regress.pptCorr-and-Regress.ppt
Corr-and-Regress.ppt
 
Corr And Regress
Corr And RegressCorr And Regress
Corr And Regress
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
 
Statistics-2 : Elements of Inference
Statistics-2 : Elements of InferenceStatistics-2 : Elements of Inference
Statistics-2 : Elements of Inference
 
Statr sessions 9 to 10
Statr sessions 9 to 10Statr sessions 9 to 10
Statr sessions 9 to 10
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
Normal Distribution slides(1).pptx
Normal Distribution slides(1).pptxNormal Distribution slides(1).pptx
Normal Distribution slides(1).pptx
 
CORRELATION AND REGRESSION.pptx
CORRELATION AND REGRESSION.pptxCORRELATION AND REGRESSION.pptx
CORRELATION AND REGRESSION.pptx
 
Stat 1163 -correlation and regression
Stat 1163 -correlation and regressionStat 1163 -correlation and regression
Stat 1163 -correlation and regression
 
Lecture 01 probability distributions
Lecture 01 probability distributionsLecture 01 probability distributions
Lecture 01 probability distributions
 

Recently uploaded

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)

  • 2. Anaconda • leading open data science platform powered by Python • https://www.continuum.io/downloads
  • 5. Random Variables and Probability Distributions • Random Variables - Random responses corresponding to subjects randomly selected from a population. • Probability Distributions - A listing of the possible outcomes and their probabilities (discrete r.v.s) or their densities (continuous r.v.s) • Normal Distribution - Bell-shaped continuous distribution widely used in statistical inference • Sampling Distributions - Distributions corresponding to sample statistics (such as mean and proportion) computed from random samples
  • 6. Normal Distribution • Bell-shaped, symmetric family of distributions • Classified by 2 parameters: Mean (µ) and standard deviation (σ). These represent location and spread • Random variables that are approximately normal have the following properties wrt individual measurements: • Approximately half (50%) fall above (and below) mean • Approximately 68% fall within 1 standard deviation of mean • Approximately 95% fall within 2 standard deviations of mean • Virtually all fall within 3 standard deviations of mean • Notation when Y is normally distributed with mean µ and standard deviation σ : ),(~ σµNY
  • 8. Example - Heights of U.S. Adults • Female and Male adult heights are well approximated by normal distributions: YF~N(63.7,2.5) YM~N(69.1,2.6) INCHESM 76.5 75.5 74.5 73.5 72.5 71.5 70.5 69.5 68.5 67.5 66.5 65.5 64.5 63.5 62.5 61.5 60.5 59.5 Cases weighted by PCTM 20 10 0 Std. Dev = 2.61 Mean = 69.1 N = 99.23 INCHESF 70.5 69.5 68.5 67.5 66.5 65.5 64.5 63.5 62.5 61.5 60.5 59.5 58.5 57.5 56.5 55.5 Cases weighted by PCTF 20 18 16 14 12 10 8 6 4 2 0 Std. Dev = 2.48 Mean = 63.7 N = 99.68 Source: Statistical Abstract of the U.S. (1992)
  • 9. Standard Normal (Z) Distribution • Problem: Unlimited number of possible normal distributions (-¥< µ < ¥, σ > 0) • Solution: Standardize the random variable to have mean 0 and standard deviation 1 )1,0(~),(~ N Y ZNY σ µ σµ − =⇒ • Probabilities of certain ranges of values and specific percentiles of interest can be obtained through the standard normal (Z) distribution
  • 10. Standard Normal (Z) Distribution • Standard Normal Distribution Characteristics: • P(Z ³0) = P(Y ³µ ) = 0.5000 • P(-1 £Z £1) = P(µ-σ £Y £µ+σ ) = 0.6826 • P(-2 £Z £2) = P(µ-2σ £Y £µ+2σ ) = 0.9544 • P(Z ³za) = P(Z £-za) = a (using Z-table) a 0.500 0.100 0.050 0.025 0.010 0.005 za 0.000 1.282 1.645 1.960 2.326 2.576
  • 11. Finding Probabilities of Specific Ranges • Step 1 - Identify the normal distribution of interest (e.g. its mean (µ) and standard deviation (σ) ) • Step 2 - Identify the range of values that you wish to determine the probability of observing (YL , YU), where often the upper or lower bounds are ¥or -¥ • Step 3 - Transform YL and YU into Z-values: σ µ σ µ − = − = U U L L Y Z Y Z • Step 4 - Obtain P(ZL£Z £ZU) from Z-table
  • 12. Example - Adult Female Heights • What is the probability a randomly selected female is 5’10” or taller (70 inches)? • Step 1 - Y ~ N(63.7 , 2.5) • Step 2 - YL = 70.0 YU = ¥ • Step 3 - ∞== − = UL ZZ 52.2 5.2 7.630.70 • Step 4 - P(Y ³70) = P(Z ³2.52) = .0059 ( »1/170) z .00 .01 .02 .03 2.4 .0082 .0080 .0078 .0075 2.5 .0062 .0060 .0059 .0057 2.6 .0047 .0045 .0044 .0043
  • 13. Finding Percentiles of a Distribution • Step 1 - Identify the normal distribution of interest (e.g. its mean (µ) and standard deviation (σ) ) • Step 2 - Determine the percentile of interest 100p% (e.g. the 90th percentile is the cut-off where only 90% of scores are below and 10% are above) • Step 3 - Turn the percentile of interest into a tail probability a and corresponding z-value (zp): • If 100p ³50 then a = 1-p and zp = za • If 100p < 50 then a = p and zp = -za • Step 4 - Transform zp back to original units: σµ p p zY +=
  • 14. Example - Adult Male Heights • Above what height do the tallest 5% of males lie above? • Step 1 - Y ~ N(69.1 , 2.6) • Step 2 - Want to determine 95th percentile (p = .95) • Step 3 - Since 100p > 50, a = 1-p = 0.05 zp = za = z.05 = 1.645 • Step 4 - Y.95 = 69.1 + (1.645)(2.6) = 73.4 z .03 .04 .05 .06 1.5 .0630 .0618 .0606 .0594 1.6 .0516 .0505 .0495 .0485 1.7 .0418 .0409 .0401 .0392
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Statistical Models • When making statistical inference it is useful to write random variables in terms of model parameters and random errors µεεµµµ −=+=−+= YYY )( • Here µ is a fixed constant and ε is a random variable • In practice µ will be unknown, and we will use sample data to estimate or make statements regarding its value
  • 22. Sampling Distributions and the Central Limit Theorem • Sample statistics based on random samples are also random variables and have sampling distributions that are probability distributions for the statistic (outcomes that would vary across samples) • When samples are large and measurements independent then many estimators have normal sampling distributions (CLT): • Sample Mean: • Sample Proportion: ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ n NY σ µ,~ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − n N )1( ,~ ^ ππ ππ
  • 23. Example - Adult Female Heights • Random samples of n = 100 females to be selected • For each sample, the sample mean is computed • Sampling distribution: )25.0,5.63( 100 5.2 ,5.63~ NNY ≡⎟ ⎠ ⎞ ⎜ ⎝ ⎛ • Note that approximately 95% of all possible random samples of 100 females will have sample means between 63.0 and 64.0 inches
  • 25. Topics Covered: • Is there a relationship between x and y? • What is the strength of this relationship • Pearson’s r • Can we describe this relationship and use this to predict y from x? • Regression • Is the relationship we have described statistically significant? • t test • Relevance to SPM • GLM
  • 26. The relationship between x and y • Correlation: is there a relationship between 2 variables? • Regression: how well a certain independent variable predict dependent variable? • CORRELATION ¹CAUSATION • In order to infer causality: manipulate independent variable and observe effect on dependent variable
  • 27. Scattergrams Y X Y X Y X YY Y Positive correlation Negative correlation No correlation
  • 28. Variance vs Covariance • Do two variables change together? 1 ))(( ),cov( 1 − −− = ∑= n yyxx yx i n i i Covariance: • Gives information on the degree to which two variables vary together. • Note how similar the covariance is to variance: the equation simply multiplies x’s error scores by y’s error scores as opposed to squaring x’s error scores. 1 )( 2 12 − − = ∑= n xx S n i i x Variance: • Gives information on variability of a single variable.
  • 29. Covariance n When X and Y : cov (x,y) = pos. n When X and Y : cov (x,y) = neg. n When no constant relationship: cov (x,y) = 0 1 ))(( ),cov( 1 − −− = ∑= n yyxx yx i n i i
  • 30. Example Covariance x y xxi − yyi − ( xix − )( yiy − ) 0 3 -3 0 0 2 2 -1 -1 1 3 4 0 1 0 4 0 1 -3 -3 6 6 3 3 9 3=x 3=y å = 7 75.1 4 7 1 )))(( ),cov( 1 == − −− = ∑= n yyxx yx i n i i What does this number tell us? 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
  • 31. Problem with Covariance: • The value obtained by covariance is dependent on the size of the data’s standard deviations: if large, the value will be greater than if small… even if the relationship between x and y is exactly the same in the large versus small standard deviation datasets.
  • 32. Example of how covariance value relies on variance High variance data Low variance data Subject x y x error * y error x y X error * y error 1 101 100 2500 54 53 9 2 81 80 900 53 52 4 3 61 60 100 52 51 1 4 51 50 0 51 50 0 5 41 40 100 50 49 1 6 21 20 900 49 48 4 7 1 0 2500 48 47 9 Mean 51 50 51 50 Sum of x error * y error : 7000 Sum of x error * y error : 28 Covariance: 1166.67 Covariance: 4.67
  • 33. Solution: Pearson’s r • Covariance does not really tell us anything • Solution: standardise this measure • Pearson’s R: standardises the covariance value. • Divides the covariance by the multiplied standard deviations of X and Y: yx xy ss yx r ),cov( =
  • 34. Pearson’s R continued 1 ))(( ),cov( 1 − −− = ∑= n yyxx yx i n i i yx i n i i xy ssn yyxx r )1( ))(( 1 − −− = ∑= 1 * 1 − = ∑= n ZZ r n i yx xy ii
  • 35. Limitations of r • When r = 1 or r = -1: • We can predict y from x with certainty • all data points are on a straight line: y = ax + b • r is actually • r = true r of whole population • = estimate of r based on data • r is very sensitive to extreme values: 0 1 2 3 4 5 0 1 2 3 4 5 6 rˆ rˆ
  • 36.
  • 37. Regression • Correlation tells you if there is an association between x and y but it doesn’t describe the relationship or allow you to predict one variable from the other. • To do this we need REGRESSION!
  • 38. Best-fit Line = ŷ, predicted value • Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that gives best prediction of y for any value of x • This will be the line that minimises distance between data and fitted line, i.e. the residuals interceptε ŷ = ax + b ε = residual error = y i , true value slope
  • 39. Least Squares Regression • To find the best line we must minimise the sum of the squares of the residuals (the vertical distances from the data points to our line) Residual (ε) = y - ŷ Sum of squares of residuals = Σ (y – ŷ)2 Model line: ŷ = ax + b n we must find values of a and b that minimise Σ (y – ŷ)2 a = slope, b = intercept
  • 40. Finding b • First we find the value of b that gives the min sum of squares ε εb b b n Trying different values of b is equivalent to shifting the line up and down the scatter plot
  • 41. Finding a • Now we find the value of a that gives the min sum of squares b b b n Trying out different values of a is equivalent to changing the slope of the line, while b stays constant
  • 42. Minimising sums of squares • Need to minimise Σ(y–ŷ)2 • ŷ = ax + b • so need to minimise: Σ(y - ax - b)2 • If we plot the sums of squares for all different values of a and b we get a parabola, because it is a squared term • So the min sum of squares is at the bottom of the curve, where the gradient is zero. Values of a and b sumsofsquares(S) Gradient= 0 min S
  • 43. The maths bit • The min sum of squares is at the bottom of the curve where the gradient = 0 • So we can find a and b that give min sum of squares by taking partial derivatives of Σ(y - ax - b)2 with respect to a and b separately • Then we solve these for 0 to give us the values of a and b that give the min sum of squares
  • 44. The solution • Doing this gives the following equations for a and b: a = r sy sx r = correlation coefficient of x and y sy = standard deviation of y sx = standard deviation of x n From you can see that: § A low correlation coefficient gives a flatter slope (small value of a) § Large spread of y, i.e. high standard deviation, results in a steeper slope (high value of a) § Large spread of x, i.e. high standard deviation, results in a flatter slope (small value of a)
  • 45. The solution cont. • Our model equation is ŷ = ax + b • This line must pass through the mean so: y = ax + b b = y – ax n We can put our equation for a into this giving: b = y – ax b = y - r sy sx r = correlation coefficient of x and y sy = standard deviation of y sx = standard deviation of x x n The smaller the correlation, the closer the intercept is to the mean of y
  • 46. Back to the model • If the correlationis zero, we will simplypredict the mean of y for every value of x, and our regressionline is just a flat straight line crossing the x-axis at y • But this isn’t very useful. • We can calculate the regressionline for any data, but the important question is how well does this line fit the data, or how good is it at predictingy from x ŷ = ax + b = r sy sx r sy sx x + y - x r sy sx ŷ = (x – x) + yRearranges to: a b a a
  • 47. How good is our model? • Total variance of y: n Variance of predicted y values (ŷ): n Error variance: sŷ 2 = ∑(ŷ – y)2 n - 1 SSpred dfŷ = sy 2 = ∑(y – y)2 n - 1 SSy dfy = This is the variance explained by our regression model serror 2 = ∑(y – ŷ)2 n - 2 SSer dfer = This is the variance of the error between our predicted y values and the actual y values, and thus is the variance in y that is NOT explained by the regression model
  • 48. • Total variance = predicted variance + error variance sy 2 = sŷ 2 + ser 2 • Conveniently, via some complicated rearranging sŷ 2 = r2 sy 2 r2 = sŷ 2 / sy 2 • so r2 is the proportion of the variance in y that is explained by our regression model How good is our model cont.
  • 49. How good is our model cont. • Insert r2 sy 2 into sy 2 = sŷ 2 + ser 2 and rearrange to get: ser 2 = sy 2 – r2sy 2 = sy 2 (1 – r2) • From this we can see that the greater the correlation the smaller the error variance, so the better our prediction
  • 50. Is the model significant? • i.e. do we get a significantly better prediction of y from our regression equation than by just predicting the mean? • F-statistic: F(dfŷ,dfer) = sŷ 2 ser 2 =......= r2 (n - 2)2 1 – r2 complicated rearranging n And it follows that: t(n-2) = r (n - 2) √1 – r2 (because F = t2) So all we need to know are r and n
  • 51.
  • 52. General Linear Model • Linear regression is actually a form of the General Linear Model where the parameters are a, the slope of the line, and b, the intercept. y = ax + b +ε • A General Linear Model is just any model that describes the data in terms of a straight line
  • 53. Multiple regression • Multiple regression is used to determine the effect of a number of independent variables, x1, x2, x3 etc, on a single dependent variable, y • The different x variables are combined in a linear way and each has its own regression coefficient: y = a1x1+ a2x2 +…..+ anxn + b + ε • The a parameters reflect the independent contribution of each independent variable, x, to the value of the dependent variable, y. • i.e. the amount of variance in y that is accounted for by each x variable after all the other x variables have been accounted for
  • 55. Dimensionality Reduction and Feature Construction • Principal components analysis (PCA) • Reading: L. I. Smith, A tutorial on principal components analysis (on class website) • PCA used to reduce dimensions of data without much loss of information. • Used in machine learning and in signal processing and image compression (among other things).
  • 56. PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (first principal component), the second greatest variance lies on the second coordinate (second principal component), and so on.”
  • 57. • Suppose attributes are A1 and A2, and we have n training examples. x’s denote values of A1 and y’s denote values of A2 over the training examples. • Variance of an attribute: Background for PCA )1( )( )var( 1 2 1 − − = ∑= n xx A n i i
  • 58. • Covariance of two attributes: • If covariance is positive, both dimensions increase together. If negative, as one increases, the other decreases. Zero: independent of each other. )1( ))(( ),cov( 1 21 − −− = ∑= n yyxx AA n i ii
  • 59. • Covariance matrix • Suppose we have n attributes, A1, ..., An. • Covariance matrix: ),cov(where),( ,, jijiji nn AAccC ==×
  • 61. • Eigenvectors: • Let M be an n´n matrix. • v is an eigenvector of M if M ´v = lv • lis called the eigenvalue associated with v • For any eigenvector v of M and scalar a, • Thus you can always choose eigenvectors of length 1: • If M has any eigenvectors, it has n of them, and they are orthogonal to one another. • Thus eigenvectors can be used as a new basis for a n- dimensional vector space. vvM aa λ=× 1... 22 1 =++ nvv
  • 62. PCA 1. Given original data set S = {x1, ..., xk}, produce new set by subtracting the mean of attribute Ai from each xi. Mean: 1.81 1.91 Mean: 0 0
  • 63.
  • 64. 2. Calculate the covariance matrix: 3. Calculate the (unit) eigenvectors and eigenvalues of the covariance matrix: x y x y
  • 65. Eigenvector with largest eigenvalue traces linear pattern in data
  • 66. 4. Order eigenvectors by eigenvalue, highest to lowest. In general, you get n components. To reduce dimensionality to p, ignore n-p components at the bottom of the list. 0490833989. 677873399. 735178956. 28402771.1 735178956. 677873399. 2 1 =⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛− = =⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − − = λ λ v v
  • 67. Construct new feature vector. Feature vector = (v1, v2, ...vp) ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − − = ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − −− = 735178956. 677873399. 2 :vectorfeaturedimensionreducedor 677873399.735178956. 735178956.677873399. 1 torFeatureVec torFeatureVec
  • 68. 5. Derive the new data set. TransformedData = RowFeatureVector ´RowDataAdjust This gives original data in terms of chosen components (eigenvectors)—that is, along these axes. ( )735178956.677873399.2 677873399.735178956. 735178956.677873399. 1 −−= ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − −− = VectorRowFeature VectorRowFeature ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ −−−−− −−−− = 01.131.81.31.79.09.129.99.21.149. 71.31.81.19.49.29.109.39.31.169. ustRowDataAdj
  • 69.
  • 70.
  • 71.
  • 72.
  • 74.
  • 75. Clustering Strategies • K-means – Iteratively re-assign points to the nearest cluster center • Agglomerative clustering – Start with each point as its own cluster and iteratively merg e the closest clusters • Dense-based clustering – DBSCAN As we go down this chart, the clustering strategies h ave more tendency to transitively group points even if they are not nearby in feature space
  • 77.
  • 78.
  • 80.
  • 81.
  • 82.
  • 83. Classification • Apply a prediction function to a feature representation of the image to get the desired output: f( ) = “apple” f( ) = “tomato” f( ) = “cow” Slide credit: L. Lazebnik
  • 84. The machine learning framework y = f(x) • Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizin g the prediction error on the training set • Testing: apply f to a never before seen test example x an d output the predicted value y = f(x) outputprediction functi on Image featur e Slide credit: L. Lazebnik
  • 85. Prediction Steps Training Labels Training Im ages Training Training Image F eatures Image Fea tures Testing Test Image Learned model Learned m odel Slide credit: D. Hoiem and L. Lazebnik
  • 86. Features • Raw pixels • Histograms • … Slide credit: L. Lazebnik
  • 87. Classifiers: Nearest neighbor f(x) = label of the training example nearest to x • All we need is a distance function for our inputs • No training required! Test exa mple Training exa mples from class 1 Training exa mples from class 2 Slide credit: L. Lazebnik
  • 88. Classifiers: Linear • Find a linear function to separate the classes: f(x) = sgn(w ×x + b) Slide credit: L. Lazebnik
  • 89. Many classifiers to choose from • SVM • Neural networks • Naïve Bayes • Bayesian network • Logistic regression • Randomized Forests • Boosted Decision Trees • K-nearest neighbor • RBMs • Etc. Which is the best one? Slide credit: D. Hoiem
  • 90. Generalization • How well does a learned model generalize from the data it was trained on to a new test set? Training set (labels known) Test set (labels unknow n) Slide credit: L. Lazebnik
  • 91. Generalization • Components of generalization error • Bias: how much the average model over all training sets differ f rom the true model? • Error due to inaccurate assumptions/simplifications made b y the model • Variance: how much models estimated from different training s ets differ from each other • Underfitting: model is too “simple” to represent all the relevant class characteristics • High bias and low variance • High training error and high test error • Overfitting: model is too “complex” and fits irrelevant c haracteristics (noise) in the data • Low bias and high variance • Low training error and high test error Slide credit: L. Lazebnik
  • 92. Bias-Variance Trade-off • Models with too few par ameters are inaccurate b ecause of a large bias (n ot enough flexibility). • Models with too many p arameters are inaccurate because of a large varian ce (too much sensitivity t o the sample). Slide credit: D. Hoiem
  • 93. Bias-Variance Trade-off E(MSE) = noise2 + bias2 + variance See the following for explanations of bias-variance (also Bishop’s “Neural Network s” book): •http://www.inf.ed.ac.uk/teaching/courses/mlsc/Notes/Lecture4/BiasVariance.pdf Unavoidable err or Error due to inc orrect assumpti ons Error due to varian ce of training samp les Slide credit: D. Hoiem
  • 94. Bias-variance tradeoff Training error Test error Underfitting Overfitting Complexity Low Bias High Variance High Bias Low Variance Error Slide credit: D. Hoiem
  • 95. Bias-variance tradeoff Many training examples Few training examples Complexity Low Bias High Variance High Bias Low Variance TestError Slide credit: D. Hoiem
  • 96. Effect of Training Size Testing Training Generalization Error Number of Training Examples Error Fixed prediction model Slide credit: D. Hoiem
  • 97. How to reduce variance? • Choose a simpler classifier • Regularize the parameters • Get more training data Slide credit: D. Hoiem
  • 98.
  • 99. Very brief tour of some classifiers • K-nearest neighbor • SVM • Boosted Decision Trees • Neural networks • Naïve Bayes • Bayesian network • Logistic regression • Randomized Forests • RBMs • Etc.
  • 100. Generative vs. Discriminative Classifiers Generative Models • Represent both the data and the l abels • Often, makes use of conditional in dependence and priors • Examples • Naïve Bayes classifier • Bayesian network • Models of data may apply to futur e prediction problems Discriminative Models • Learn to directly predict the labels from t he data • Often, assume a simple boundary (e.g., li near) • Examples – Logistic regression – SVM – Boosted decision trees • Often easier to predict a label from the da ta than to model the data Slide credit: D. Hoiem
  • 101. Classification • Assign input vector to one of two or more class es • Any decision rule divides input space into decisi on regions separated by decision boundaries Slide credit: L. Lazebnik
  • 102. Nearest Neighbor Classifier • Assign label of nearest training data point to each test data point Voronoi partitioning of feature space for two-category 2D and 3D data from Duda et al. Source: D. Lowe
  • 104. 1-nearest neighbor vs. 3-nearest neighbor x x x x x x x x o o o o o o o x2 x1 + + x x x x x x x x o o o o o o o x2 x1 + +
  • 105.
  • 106.
  • 107. Naïve Bayes Classifier • For instance, P(+ | p1) vs. P(- | p1) • We don’t know a model. Just given sample. • Bayes Rule
  • 109.
  • 110. Classifiers: Logistic Regression x2 x1 Height Pitch of voice xwT yxxP yxxP = −= = )1|,( )1|,( log 21 21 male female ( )( )xwT xxyP −+== exp1/1),|1( 21 Maximize likelihood of l abel given data, assumi ng a log-linear model
  • 111. Classifiers: Linear SVM x x x x x x x x o o o o o x2 x1 • Find a linear function to separate the classes: f(x) = sgn(w ×x + b)
  • 112. Classifiers: Linear SVM x x x x x x x x o o o o o x2 x1 • Find a linear function to separate the classes: f(x) = sgn(w ×x + b)
  • 113. Classifiers: Linear SVM x x x x x x x x o o o o o o x2 x1 • Find a linear function to separate the classes: f(x) = sgn(w ×x + b)
  • 114.
  • 115. • Datasets that are linearly separable work out great: • But what if the dataset is just too hard? • We can map it to a higher-dimensional space: 0 x 0 x 0 x x2 Nonlinear SVMs Slide credit: Andrew Moore
  • 116. Φ: x → φ(x) Nonlinear SVMs • General idea: the original input space can always be mapped to some higher-dimensional feature space w here the training set is separable: Slide credit: Andrew Moore
  • 117. Nonlinear SVMs • The kernel trick: instead of explicitly computing the lifting tra nsformation φ(x), define a kernel function K such that K(xi,xj) = φ(xi ) · φ(xj) • This gives a nonlinear decision boundary in the original feat ure space: bKyby i iii i iii +=+⋅ ∑∑ ),()()( xxxx αϕϕα C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
  • 118. Kernels for bags of features • Histogram intersection kernel: • Generalized Gaussian kernel: • D can be (inverse) L1 distance, Euclidean distance, χ2 distance, etc. ∑= = N i ihihhhI 1 2121 ))(),(min(),( ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ −= 2 2121 ),( 1 exp),( hhD A hhK J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for Cl assifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007
  • 119. What about multi-class SVMs? • Unfortunately, there is no “definitive” multi-class SVM formulation • In practice, we have to obtain a multi-class SVM by combining mu ltiple two-class SVMs • One vs. others • Traning: learn an SVM for each class vs. the others • Testing: apply each SVM to test example and assign to it the class of the S VM that returns the highest decision value • One vs. one • Training: learn an SVM for each pair of classes • Testing: each learned SVM “votes” for a class to assign to the test example Slide credit: L. Lazebnik from sklearn.svm import SVC from sklearn.svm LinearSVC
  • 120.
  • 121. Summary: Classifiers • Nearest-neighbor and k-nearest-neighbor classifiers • L1 distance, χ2 distance, quadratic distance • Support vector machines • Linear classifiers • Margin maximization • The kernel trick • Kernel functions: generalized Gaussian, RBF • Multi-class • Of course, there are many other classifiers out there • Neural networks, boosting, decision trees, … Slide credit: L. Lazebnik
  • 122. Classifiers: Decision Trees x x x x x x x x o o o o o o o x2 x1
  • 123.
  • 124. Feature Selection • GINI Index • Information Gain • Chi-square t O X c O A B X C D
  • 125.
  • 126.
  • 127.
  • 128.
  • 129. Ensemble Training • Growing Tree à Over fitting • Make strong classifier with weak classifiers
  • 131.
  • 132.
  • 134.
  • 136.
  • 137. Perceptron Learning Theorem • Recap: A perceptron (threshold unit) can learn anything that it can represent (i.e. anything separable with a hyperplane) 137
  • 138. The Exclusive OR problem A Perceptron cannot represent Exclusive OR since it is not linearly separable. 138
  • 139.
  • 140. Minsky & Papert (1969) offered solution to XOR problem by combining perceptron unit responses using a second layer of Units. Piecewise linear classification using an MLP with threshold (perceptron) units 1 2 +1 +1 3 140
  • 141. What do each of the layers do? 1st layer draws linear boundaries 2nd layer combines the boundaries 3rd layer can generate arbitrarily complex boundaries 141
  • 142. 142
  • 143.
  • 146.
  • 147.
  • 148.
  • 149. • Q & A