Support Vector Machine without tears

Digg Data
Support Vector Machine
Ankit Sharma
www.diggdata.in
without tears

Digg Data
Content
SVM and its application
Basic SVM
•Hyperplane
•Understanding of basics
•Optimization
Soft margin SVM
Non-linear decision boundary
SVMs in “loss + penalty” form
Kernel method
•Gaussian kernel
SVM usage beyond classification
Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 2

Digg Data
• In machine learning, support vector machines are supervised
learning models with associated learning algorithms that analyze
data and recognize patterns, used for classification and regression
analysis.
• Properties of SVM :
Duality
Kernels
Margin
Convexity
Sparseness
SVM : Support Vector Machine

Digg Data
Time Series
analysis
Classification
Anomaly
detection
Regression
Machine
Vision
Text
categorization
Application of SVM

Digg Data
Basic concept of SVM
Find a linear decision surface (“hyperplane”) that can separate classes and has the largest
distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”)

Digg Data
Hyperplane as a Decision boundary
• A hyperplane is a linear decision surface that splits the space into two parts;
• It is obvious that a hyperplane is a binary classifier

Digg Data
Equation of a hyperplane
An equation of a hyperplane is defined by a
point (P0) and a perpendicular vector to the
plane (𝑤) at that point.

Digg Data
• g(x) is a linear function:
x1
x2
w x + b < 0
w x + b > 0
 A hyper-plane in the feature
space
 (Unit-length) normal vector of
the hyper-plane:

w
n
w
n
Understanding the basics

Digg Data
x1
x2How to classify these points using
a linear discriminant function in
order to minimize the error rate?
 Infinite number of answers!
 Which one is the best?
denotes +1
denotes -1Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 9

Digg Data
• The linear discriminant
function (classifier) with the
maximum margin is the best
“safe zone”
 Margin is defined as the width
that the boundary could be
increased by before hitting a
data point
 Why it is the best?
Robust to outliners and thus
strong generalization ability
Margin
x1
x2
denotes +1

Digg Data
• Given a set of data points:
 With a scale transformation on
both w and b, the above is
equivalent to
x1
x2
{( , )}, 1,2, ,i iy i nx , where
𝑭𝒐𝒓 𝒚𝒊 = +𝟏, 𝑾 𝑿𝒊 + 𝒃> 0
𝑭𝒐𝒓 𝒚𝒊 = −𝟏, 𝑾 𝑿𝒊 + 𝒃 < 𝟎
𝑭𝒐𝒓 𝒚𝒊 = +𝟏, 𝑾 𝑿𝒊 + 𝒃> +1
𝑭𝒐𝒓 𝒚𝒊 = −𝟏, 𝑾 𝑿𝒊 + 𝒃 < -1
denotes +1

Digg Data
• We know that
 The margin width is:
x1
x2
Margin
x+
x+
x-
( )
2
( )
M  
 
  
   
x x n
w
x x
w w
n
Support Vectors
𝑾 𝑿+ + 𝒃 = +𝟏
𝑾 𝑿− + 𝒃 = −𝟏
denotes +1

Digg Data
• Formulation:
x1
x2
Margin
x+
x+
x-
n
such that
2
maximize
w
𝑭𝒐𝒓 𝒚𝒊 = +𝟏, 𝑾 𝑿𝒊 + 𝒃> +1
𝑭𝒐𝒓 𝒚𝒊 = −𝟏, 𝑾 𝑿𝒊 + 𝒃 < -1
denotes +1

Digg Data
• Formulation:
x1
x2
Margin
x+
x+
x-
n
21
minimize
2
w
such that
𝑭𝒐𝒓 𝒚𝒊 = +𝟏, 𝑾 𝑿𝒊 + 𝒃> +1
𝑭𝒐𝒓 𝒚𝒊 = −𝟏, 𝑾 𝑿𝒊 + 𝒃 < -1
denotes +1

Digg Data
• Formulation:
x1
x2
Margin
x+
x+
x-
n
21
minimize
2
w
such that
𝐲𝐢 𝐖 𝐗 + 𝐛 ≥ 𝟏
denotes +1

Digg Data
Basics of optimization: Convex functions
• A function is called convex if the function lies below the straight line
segment connecting two points, for any two points in the interval.
• Property: Any local minimum is a global minimum!

Digg Data
Basics of optimization: Quadratic programming
• Quadratic programming (QP) is a special optimization problem: the function to
optimize (“objective”) is quadratic, subject to linear constraints.
• Convex QP problems have convex objective functions.
• These problems can be solved easily and efficiently by greedy algorithms (because
every local minimum is a global minimum).

Digg Data
SVM optimization problem: Primal formulation
• This is called “primal formulation of linear SVMs”
• It is a convex quadratic programming (QP) optimization problem with n
variables (wi, i= 1,…,n), where n is the number of features in the dataset.

Digg Data
SVM optimization problem: Dual formulation
• The previous problem can be recast in the so-called “dual form” giving rise to
“dual formulation of linear SVMs”.
• Apply the method of Lagrange multipliers.
• We need to minimize this Lagrangian with respect to and simultaneously require
that the derivative with respect to vanishes , all subject to the constraints that
αi > 0

Digg Data
SVM optimization problem: Dual formulation
Cond…
It is also a convex quadratic programming problem but with N variables (αi, i= 1,…,N), where N is
the number of samples.

Digg Data
SVM optimization problem: Benefits of using
dual formulation
1) No need to access original data, need to access only dot products.
2) Number of free parameters is bounded by the number of support vectors
and not by the number of variables (beneficial for high-dimensional
problems).

Digg Data
Non linearly separable data: “Soft-margin” linear SVM
Assign a “slack variable” to each instance ,
ξi > 0 which can be thought of distance from the
separating hyperplane if an instance is misclassified
and 0 otherwise.
Primal formulation:
Dual formulation:
• When C is very large, the soft-margin SVM is equivalent
to hard-margin SVM;
• When C is very small, we admit misclassifications in the
training data at the expense of having w-vector with
small norm;
• C has to be selected for the distribution at hand as it will
be discussed later in this tutorial.

Digg Data
SVMs in “loss + penalty” form
• Many statistical learning algorithms (including SVMs) search for a decision function by solving the
following optimization problem:
Minimize (Loss+ λ Penalty)
– Loss measures error of fitting the data
– Penalty penalizes complexity of the learned function
– λ is regularization parameter that balances Loss and Penalty
• Overfitting → Poor generalization
Can also be stated as

Digg Data
Nonlinear decision boundary
Non Linear
Decision
Boundary
Kernel
method

Digg Data
Kernel method
• Kernel methods involve
– Nonlinear transformation of data to a higher dimensional feature space induced by a Mercer kernel
– Detection of optimal linear solutions in the kernel feature space
• Transformation to a higher dimensional space is expected to be helpful in conversion of nonlinear relations
into linear relations (Cover’s theorem)
– Nonlinearly separable patterns to linearly separable patterns
– Nonlinear regression to linear regression
– Nonlinear separation of clusters to linear separation of clusters
• Pattern analysis methods are implemented in such a way that the kernel feature space representation is
not explicitly required. They involve computation of pair-wise inner-products only.
• The pair-wise inner-products are computed efficiently directly from the original representation of data
using a kernel function (Kernel trick)

Digg Data
Kernel trick
Not every function RN×RN -> R can be a valid kernel; it has to satisfy so-called Mercer conditions.
Otherwise, the underlying quadratic program may not be solvable.

Digg Data
Popular kernels

Digg Data
Gaussian kernel
Consider the Gaussian kernel:
Geometrically, this is a “bump” or “cavity”
centered at the training data point 𝑥j :
The resulting mapping function is a
combination of bumps and cavities.

Digg Data
SVM usage beyond classification
Regression analysis
(ε-Support vector
regression)
Anomaly detection
(One-class SVM)
Clustering analysis
(Support Vector
Domain
Description)

Digg Data
Thank you

Support Vector Machine without tears

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Support Vector Machine without tears

Similar to Support Vector Machine without tears (20)

Recently uploaded

Recently uploaded (20)

Support Vector Machine without tears