SlideShare a Scribd company logo
1 of 40
CIS 419/519 Fall’20
Support Vector Machines (SVM)
Dan Roth
danroth@seas.upenn.edu|http://www.cis.upenn.edu/~danroth/|461C, 3401
Walnut
1
Slides were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC),
and other authors who have made their ML slides available.
CIS 419/519 Fall’20 2
Administration (11/4/20)
• Remember that all the lectures are available on the website before
the class
– Go over it and be prepared
– A new set of written notes will accompany most lectures, with some more
details, examples and, (when relevant) some code.
• HW 3: Due on 11/16/20
– You cannot solve all the problems yet.
– Less time consuming; no programming
• Projects
Are we recording? YES!
Available on the web site
CIS 419/519 Fall’20 3
Projects
• CIS 519 students need to do a team project
– Teams will be of size 2-4
– We will help grouping if needed
• There will be 3 projects.
– Natural Language Processing (Text)
– Computer Vision (Images)
– Speech (Audio)
• In all cases, we will give you datasets and initial ideas
– The problem will be multiclass classification problems
– You will get annotated data only for some of the labels, but will also have to predict other labels
– 0-zero shot learning; few-shot learning; transfer learning
• A detailed note will come out today.
• Timeline:
– 11/11 Choose a project and team up
– 11/23 Initial proposal describing what your team plans to do
– 12/2 Progress report
– 12/15-20 (TBD) Final paper + short video
• Try to make it interesting!
CIS 419/519 Fall’20 4
COLT approach to explaining Learning
• No Distributional Assumption
• Training Distribution is the same as
the Test Distribution
• Generalization bounds depend on
this view and affects model selection.
𝐸𝑟𝑟𝐷(ℎ) < 𝐸𝑟𝑟𝑇𝑅(ℎ) + 𝑃(𝑉𝐶(𝐻), log(
1
ϒ
),
1
𝑚
)
• This is also called the
“Structural Risk Minimization”
principle.
CIS 419/519 Fall’20 5
COLT approach to explaining Learning
• No Distributional Assumption
• Training Distribution is the same as the Test Distribution
• Generalization bounds depend on this view and affect model
selection.
𝐸𝑟𝑟𝐷 ℎ < 𝐸𝑟𝑟𝑇𝑅(ℎ) + 𝑃(𝑉𝐶(𝐻), log(1/ϒ), 1/𝑚)
– As presented, the VC dimension is a combinatorial parameter that is
associated with a class of functions.
• We know that the class of linear functions has a lower VC
dimension than the class of quadratic functions.
– But this notion can be refined to depend on a given data set, and this way
directly affect the hypothesis chosen for a given data set.
CIS 419/519 Fall’20 7
Data Dependent VC dimension
• So far, we discussed VC
dimension in the context
of a fixed class of
functions.
• We can also
parameterize the class of
functions in interesting
ways.
• Consider the class of
linear functions,
parameterized by their
margin. Note that this is
CIS 419/519 Fall’20 8
VC and Linear Classification
• Recall the VC based generalization bound:
𝐸𝑟𝑟 ℎ ≤ 𝑒𝑟𝑟𝑇𝑅(ℎ) + 𝑃𝑜𝑙𝑦{𝑉𝐶 𝐻 ,
1
𝑚
, log(
1
ϒ
)}
• Here we get the same bound for both classifiers:
𝐸𝑟𝑟𝑇𝑅 (ℎ1) = 𝐸𝑟𝑟𝑇𝑅 (ℎ2) = 0
ℎ1, ℎ2 ∈ 𝐻𝑙𝑖𝑛 2 , 𝑉𝐶(𝐻𝑙𝑖𝑛 2 ) = 3
• How, then, can we explain our intuition that ℎ2 should
give better generalization than ℎ1?
CIS 419/519 Fall’20
• Although both classifiers separate the data, the distance
with which the separation is achieved is different:
9
Linear Classification
h1 h2
CIS 419/519 Fall’20 10
Concept of Margin
• The margin ϒ𝑖 of a point 𝒙𝑖 ∈ 𝑹𝑛
with respect to a linear
classifier ℎ(𝒙) = 𝑠𝑖𝑔𝑛(𝒘𝑇 ∙ 𝒙 + 𝑏) is defined as the distance
of 𝒙𝑖 from the hyperplane 𝒘𝑇 ∙ 𝒙 + 𝑏 = 0:
ϒ𝑖 = |
𝒘𝑇
∙ 𝒙𝑖
+𝑏
||𝒘||
|
• The margin of a set of points {𝒙1, … 𝒙𝑚} with respect to a
hyperplane 𝒘, is defined as the margin of the point closest to
the hyperplane:
ϒ = min
𝑖
ϒ𝑖 = 𝑚𝑖𝑛𝑖|
𝒘𝑇
∙ 𝒙𝑖
+𝑏
||𝒘||
|
CIS 419/519 Fall’20 11
VC and Linear Classification
• Theorem: If 𝐻ϒ is the space of all linear classifiers in 𝑹𝑛
that separate the training data with margin at least ϒ,
then:
𝑉𝐶(𝐻ϒ) ≤ min(
𝑅2
ϒ2 , 𝑛) + 1,
• Where 𝑅 is the radius of the smallest sphere (in 𝑹𝑛) that
contains the data.
• Thus, for such classifiers, we have a bound of the form:
𝐸𝑟𝑟 ℎ ≤ 𝑒𝑟𝑟𝑇𝑅(ℎ) +
𝑂
𝑅2
ϒ2 +log
4
𝛿
𝑚
1/2
In particular, you see here that for “general” linear
separators of dimensionality n, the VC is n+1
CIS 419/519 Fall’20
• First observation: When we consider the class 𝐻ϒ of linear
hypotheses that separate a given data set with a margin ϒ, we
see that
– Large Margin ϒ  Small VC dimension of 𝐻ϒ
• Consequently, our goal could be to find a separating hyperplane
𝒘 that maximizes the margin of the set 𝑆 of examples.
• A second observation that drives an algorithmic approach is that:
– Small ||𝒘||  Large Margin
• Together, this leads to an algorithm: from among all those 𝒘’s
that agree with the data, find the one with the minimal size ||𝒘||
– But, if 𝒘 separates the data, so does 𝒘/7….
– We need to better understand the relations between 𝒘 and the margin
12
Towards Max Margin Classifiers
But, how can we do it
algorithmically?
CIS 419/519 Fall’20 13
Maximal Margin
• This discussion motivates the notion of a maximal
margin.
• The maximal margin of a data set 𝑆 is defined as:
ϒ 𝑆 = max
||𝑤||=1
min
𝒙,𝑦 ∈𝑆
|𝑦 𝒘𝑇
𝒙| 1. For a given 𝒘: Find the closest point.
2. Then, across all 𝒘’s (of size 1), find the
point for which this closets point is the
farthest (that gives the maximal margin).
Note: the selection of the point is in the min
and therefore the max does not change if we
scale 𝒘, so it’s okay to only deal with
normalized 𝒘’s.
The distance between a point 𝒙 and the hyperplane
defined by 𝑤 is: |𝒘𝑇
𝒙 |/ 𝒘
How does it help us to derive these ℎ’s?
𝑎𝑟𝑔𝑚𝑎𝑥 𝒘 =1 min
(𝒙,𝑦) ∈ 𝑆
|𝑦 𝒘𝑇 𝒙|
A hypothesis
(𝒘) has many
names
Interpretation 1: among all
𝒘’s, choose the one that
maximizes the margin.
CIS 419/519 Fall’20 14
Recap: Margin and VC dimension
• Theorem (Vapnik): If 𝐻ϒ is the space of all linear classifiers in
𝑹𝑛 that separate the training data with margin at least ϒ, then
𝑉𝐶(𝐻ϒ) ≤ 𝑅2/ϒ2
– where 𝑅 is the radius of the smallest sphere (in 𝑹𝑛) that contains the
data.
• This is the first observation that will lead to an algorithmic
approach.
• The second observation is that: Small ||𝒘||  Large Margin
• Consequently, the algorithm will be: from among all those 𝒘’s
that agree with the data, find the one with the minimal size
||𝒘||
Believ
e
We’ll show
this
CIS 419/519 Fall’20 15
From Margin to ||𝒘||
• We want to choose the hyperplane that achieves the largest margin. That is,
given a data set 𝑆, find:
– 𝒘∗
= argmax| 𝐰 |=1 min
𝐱,y ∈𝑆
|𝑦 𝒘𝑇
𝒙|
• How to find this 𝒘∗
?
• Claim: Define 𝒘0 to be the solution of the optimization problem:
– 𝒘0 = 𝑎𝑟𝑔𝑚𝑖𝑛 {| 𝒘 |2
∶ ∀ 𝒙, 𝑦 ∈ 𝑆, 𝑦 𝒘𝑇
𝒙 ≥ 1 }.
– Then:
– 𝐰0/| 𝐰0 | = 𝑎𝑟𝑔𝑚𝑎𝑥||𝒘||=1 min
𝒙,𝑦 ∈𝑆
𝑦 𝒘𝑇
𝒙
• That is, the normalization of 𝒘0 corresponds to the largest margin separating
hyperplane.
Interpretation 2: among all
𝒘’s that separate the data
with margin 1, choose the
one with minimal size.
The next slide will show that the two interpretations
are equivalent
CIS 419/519 Fall’20 16
From Margin to ||𝒘||(2)
• Claim: Define 𝒘0 to be the solution of the optimization problem:
– 𝒘0 = 𝑎𝑟𝑔𝑚𝑖𝑛 {| 𝒘 |2
∶ ∀ 𝒙, 𝑦 ∈ 𝑆, 𝑦 𝒘𝑇
𝒙 ≥ 1 }(**)
Then:
– 𝒘0/||𝒘0|| = 𝑎𝑟𝑔𝑚𝑎𝑥| 𝒘 |=1 min
𝒙,𝑦 ∈𝑆
𝑦 𝒘𝑇
𝒙
That is, the normalization of 𝒘0 corresponds to the largest margin separating
hyperplane.
• Proof: Define 𝒘’ = 𝒘0/||𝒘0|| and let 𝒘∗
be the largest-margin separating
hyperplane of size 1. We need to show that 𝒘’ = 𝒘∗.
Note first that
𝒘∗
ϒ 𝑆
satisfies the constraints in (**);
therefore: ||𝒘0|| ≤ ||𝒘∗
/ ϒ(𝑆)|| = 1/ ϒ(𝑆) .
• Consequently:
∀ 𝒙, 𝑦 ∈ 𝑆 𝑦 𝒘’𝑇
𝒙 =
1
||𝒘0||
𝑦𝒘0
𝑇
𝒙 ≥ 1/||𝒘0|| ≥ ϒ(𝑆)
But since ||𝒘’|| = 1 this implies that 𝒘’ corresponds to the largest margin, that is
𝒘’ = 𝒘∗
Def. of
𝒘’
Prev.
ineq.
Def. of 𝒘0
Def. of
𝒘0
Def. of 𝒘∗
𝒘∗
= argmax| 𝐰 |=1 min
𝐱,y ∈𝑆
|𝑦 𝒘𝑇
𝒙|
Def. of 𝒘
∗
And, recall that ϒ 𝑆 is the
maximal margin for the set S
CIS 419/519 Fall’20 17
Margin of a Separating Hyperplane
• A separating hyperplane: 𝒘𝑇 𝒙 + 𝑏 = 0
Distance between
𝒘𝑇 𝒙 + 𝑏 = +1 𝑎𝑛𝑑 − 1 is 2/| 𝒘 |
What we did:
1. Consider all possible 𝒘 with different
angles
2. Scale 𝒘 such that the constraints are
tight (closets points are on the +/-1
line)
3. Pick the one with largest
margin/minimal size
𝒘𝑇 𝒙 + 𝑏 = 0
𝒘𝑇 𝒙 + 𝑏 = −1
𝒘𝑇 𝒙𝑖 + 𝑏 ≥ 1 𝑖𝑓 𝑦𝑖 = 1
𝒘𝑇 𝒙𝑖 + 𝑏 ≤ −1 𝑖𝑓 𝑦𝑖 = −1
→ 𝑦𝑖(𝒘𝑇
𝒙𝑖 + 𝑏) ≥ 1
Assumption: data is linearly separable
Let (𝒙0 ,
𝑦0) be a point on 𝒘𝑇𝒙 + 𝑏 = 1
Then its distance to the separating plane
𝒘𝑇 𝒙 + 𝑏 = 0 is: |𝒘𝑇 𝒙0 + 𝑏|/||𝒘|| =
1/||𝒘||
CIS 419/519 Fall’20 18
CIS 419/519 Fall’20 19
Administration (11/9/20)
• Remember that all the lectures are available on the website before
the class
– Go over it and be prepared
– A new set of written notes will accompany most lectures, with some more
details, examples and, (when relevant) some code.
• HW 3: Due on 11/16/20
– You cannot solve all the problems yet.
– Less time consuming; no programming
• Cheating
– Several problems in HW1 and HW2
Are we recording? YES!
Available on the web site
CIS 419/519 Fall’20 20
Projects
• CIS 519 students need to do a team project: Read the project descriptions
– Teams will be of size 2-4
– We will help grouping if needed
• There will be 3 projects.
– Natural Language Processing (Text)
– Computer Vision (Images)
– Speech (Audio)
• In all cases, we will give you datasets and initial ideas
– The problem will be multiclass classification problems
– You will get annotated data only for some of the labels, but will also have to predict other labels
– 0-zero shot learning; few-shot learning; transfer learning
• A detailed note will come out today.
• Timeline:
– 11/11 Choose a project and team up
– 11/23 Initial proposal describing what your team plans to do
– 12/2 Progress report
– 12/15-20 (TBD) Final paper + short video
• Try to make it interesting!
CIS 419/519 Fall’20 21
Hard SVM Optimization
• We have shown that the sought-after weight vector 𝒘 is the
solution of the following optimization problem:
SVM Optimization: (***)
– Minimize: ½ 𝒘
2
– Subject to: ∀ 𝒙, 𝑦 ∈ 𝑆: 𝑦 𝒘𝑇
𝒙 ≥ 1
• This is a quadratic optimization problem in (𝑛 + 1) variables,
with |𝑆| = 𝑚 inequality constraints.
• It has a unique solution.
CIS 419/519 Fall’20 22
Maximal Margin
The margin of a linear separator
𝒘𝑇 𝒙 + 𝑏 = 0 is
1
||𝒘||
max
1
||𝒘||
= min ||𝒘||
= min ½ 𝒘𝑇𝒘
min
𝒘,𝑏
1
2
𝒘𝑇
𝒘
s.t 𝑦𝑖(𝐰T𝒙𝑖 + 𝑏) ≥ 1, ∀ 𝒙𝑖, 𝑦𝑖 ∈ 𝑆
CIS 419/519 Fall’20 23
Support Vector Machines
• The name “Support Vector Machine” stems from the fact that
𝒘∗
is supported by (i.e. is the linear span of) the examples that
are exactly at a distance 1/||𝒘∗
|| from the separating
hyperplane. These vectors are therefore called support vectors.
• Theorem: Let 𝒘∗
be the minimizer of
the SVM optimization problem (***)
for 𝑆 = {(𝒙𝑖, 𝑦𝑖)}. Let 𝐼 = {𝑖: 𝒘∗𝑇𝒙𝑖 = 1}.
Then there exists coefficients 𝛼𝑖 > 0 such that:
𝒘
∗
= 𝑖 ∈ 𝐼 𝛼𝑖 𝑦𝑖 𝒙𝑖
This
representation
should ring a
bell…
CIS 419/519 Fall’20 24
CIS 419/519 Fall’20 25
Duality
• This, and other properties of Support Vector Machines are
shown by moving to the dual problem.
• Theorem: Let 𝒘∗ be the minimizer of
the SVM optimization problem (***)
for 𝑆 = {(𝒙𝑖, 𝑦𝑖)}.
Let 𝐼 = 𝑖: 𝑦𝑖 𝒘∗𝑇𝒙𝑖 + 𝑏 = 1 .
Then there exists coefficients 𝛼𝑖 > 0
such that:
𝒘
∗
= 𝑖 Є 𝐼 𝛼𝑖 𝑦𝑖 𝒙𝑖
CIS 419/519 Fall’20 28
Footnote about the threshold
• Similar to Perceptron, we can augment vectors to handle the bias term
𝒙 ⇐ 𝒙 , 1 ; 𝒘 ⇐ 𝒘 , 𝑏 so that 𝒘𝑇
𝒙 = 𝒘𝑇
𝒙 + 𝑏
• Then consider the following formulation
min
𝒘
1
2
𝒘𝑇𝒘 s.t yi𝒘T𝒙i ≥ 1, ∀ 𝒙𝑖, 𝑦𝑖 ∈ S
• However, this formulation is slightly different from (***), because it is equivalent to
min
𝒘,𝑏
1
2
𝒘𝑇
𝒘 +
1
2
𝑏2
s.t yi(𝒘T
𝐱i + 𝑏) ≥ 1, ∀ 𝒙𝑖, 𝑦𝑖 ∈ S
The bias term is included in the regularization.
This usually doesn’t matter
For simplicity, we ignore the bias term
CIS 419/519 Fall’20 29
Key Issues
• Computational Issues
– Training of an SVM used to be is very time consuming – solving
quadratic program.
– Modern methods are based on Stochastic Gradient Descent and
Coordinate Descent and are much faster.
• Is it really optimal?
– Is the objective function we are optimizing the “right” one?
CIS 419/519 Fall’20 30
Real Data
• 17,000 dimensional context sensitive spelling
• Histogram of distance of points from the hyperplane
In practice, even in the
separable case, we may not
want to depend on the points
closest to the hyperplane but
rather on the distribution of the
distance. If only a few are
close, maybe we can dismiss
them.
CIS 419/519 Fall’20 31
Soft SVM
• The hard SVM formulation assumes linearly separable
data.
• A natural relaxation:
– maximize the margin while minimizing the # of examples that
violate the margin (separability) constraints.
• However, this leads to non-convex problem that is hard to
solve.
• Instead, we relax in a different way, that results in
optimizing a surrogate loss function that is convex.
CIS 419/519 Fall’20 32
Soft SVM
• Notice that the relaxation of the constraint:
𝑦𝑖𝒘𝑇𝒙𝑖 ≥ 1
• Can be done by introducing a slack variable 𝜉𝑖 (per
example) and requiring:
𝑦𝑖𝒘𝑇𝒙𝑖 ≥ 1 − 𝜉𝑖 ; 𝜉𝑖 ≥ 0
• Now, we want to solve:
min
𝒘,𝜉𝑖
1
2
𝒘𝑇
𝒘 + 𝐶 𝑖 𝜉𝑖
s.t 𝑦𝑖𝒘𝑇
𝒙𝑖 ≥ 1 − 𝜉𝑖 ; 𝜉𝑖 ≥ 0 ∀𝑖
• A large value of C means that we
want 𝜉𝑖 to be small; that is,
misclassifications are bad – we
focus on a small training error (at
the expense of margin).
• A small C results in more training
error, but hopefully better true
error.
CIS 419/519 Fall’20 33
Soft SVM (2)
• Now, we want to solve:
• Which can be written as:
min
𝒘
1
2
𝒘𝑇𝒘 + 𝐶
𝑖
max(0, 1 − 𝑦𝑖𝒘𝑇𝒙𝑖) .
• What is the interpretation of this?
min
𝑤,𝜉𝑖
1
2
𝑤𝑇
𝑤 + 𝐶 𝑖 𝜉𝑖
s.t yiwT
xi ≥ 1 − 𝜉𝑖 ; 𝜉𝑖 ≥ 0 ∀𝑖
In optimum, ξi = max(0, 1 − 𝑦𝑖 𝒘𝑇 𝒙𝑖)
min
𝒘,𝜉𝑖
1
2
𝒘𝑇
𝒘 + 𝐶 𝑖 𝜉𝑖
s.t 𝜉𝑖 ≥ 1 − 𝑦𝑖𝒘𝑇𝑥𝑖 𝜉𝑖 ≥ 0 ∀𝑖
CIS 419/519 Fall’20 34
SVM Objective Function
• The problem we solved is:
𝑀𝑖𝑛 ½ ||𝒘||2 + 𝑐 𝜉𝑖
• Where 𝜉𝑖 > 0 is called a slack variable, and is defined by:
– 𝜉𝑖 = max(0, 1 – 𝑦𝑖𝒘𝑇
𝒙𝑖)
– Equivalently, we can say that: 𝑦𝑖𝒘𝑇
𝒙𝑖 ≥1 - 𝜉𝑖; 𝜉𝑖 ≥ 0
• And this can be written as:
𝑀𝑖𝑛 ½ ||𝒘||2 + 𝑐 𝜉𝑖
• General Form of a learning algorithm:
– Minimize empirical loss, and Regularize (to avoid over fitting)
– Theoretically motivated improvement over the original algorithm we’ve seen at the beginning of the semester.
Can be replaced by
other loss functions
Can be replaced by other
regularization functions
Empirical loss
Regularization term
CIS 419/519 Fall’20 35
Balance between regularization and empirical
loss
CIS 419/519 Fall’20 36
Balance between regularization and empirical
loss
(DEMO)
CIS 419/519 Fall’20
Underfittin
g
Overfitting
Model complexity
Expected
Error
37
Underfitting and Overfitting
Simple models:
High bias and low variance
Variance
Bias
Complex models:
High variance and low bias
Smaller C Larger C
High
Empirical
Error
Low Empirical
Error
CIS 419/519 Fall’20 38
What Do We Optimize?
• Logistic Regression
min
𝒘
1
2
𝒘𝑇𝒘 + 𝐶
𝑖=1
𝑙
log(1 + 𝑒−𝑦𝑖 𝒘𝑇𝒙𝑖 )
• L1-loss SVM
min
𝒘
1
2
𝒘𝑇𝒘 + 𝐶
𝑖=1
𝑙
max(0,1 − 𝑦𝑖𝒘𝑇𝒙𝑖)
• L2-loss SVM
min
𝒘
1
2
𝒘𝑇𝒘 + 𝐶 𝑖=1
𝑙
max 0,1 − 𝑦𝑖𝒘𝑇𝒙𝑖
2
CIS 419/519 Fall’20 39
What Do We Optimize(2)?
• We get an unconstrained problem.
We can use the (stochastic)
gradient descent algorithm!
• Many other methods
– Iterative scaling; non-linear
conjugate gradient; quasi-Newton
methods; truncated Newton
methods; trust-region newton
method.
– All methods are iterative methods,
that generate a sequence 𝒘𝑘 that
converges to the optimal solution of
the optimization problem above.
• Currently: Limited memory BFGS
is very popular
CIS 419/519 Fall’20 40
Optimization: How to Solve
1. Earlier methods used Quadratic Programming. Very slow.
2. The soft SVM problem is an unconstrained optimization problems. It is
possible to use the gradient descent algorithm.
• Many options within this category:
– Iterative scaling; non-linear conjugate gradient; quasi-Newton methods;
truncated Newton methods; trust-region newton method.
– All methods are iterative methods, that generate a sequence 𝒘𝑘 that converges
to the optimal solution of the optimization problem above.
– Currently: Limited memory BFGS is very popular
3. 3rd generation algorithms are based on Stochastic Gradient Decent
– The runtime does not depend on 𝑛 = #(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠); advantage when 𝑛 is very
large.
– Stopping criteria is a problem: method tends to be too aggressive at the
beginning and reaches a moderate accuracy quite fast, but it’s convergence
becomes slow if we are interested in more accurate solutions.
4. Dual Coordinated Descent (& Stochastic Version)
CIS 419/519 Fall’20 41
SGD for SVM
• Goal: min
𝒘
𝑓 𝒘 ≡
1
2
𝒘𝑇
𝒘 +
𝐶
𝑚 𝑖 max 0, 1 − 𝑦𝑖𝒘𝑇
𝒙𝑖 𝑚: data size
• Compute sub-gradient of 𝑓(𝒘):
𝛻𝑓 𝒘 = 𝒘 − 𝐶𝑦𝑖𝒙𝑖 if 1 − 𝑦𝑖𝒘𝑇
𝒙𝑖 ≥ 0 ; otherwise 𝛻𝑓 𝒘 = 𝒘
1. Initialize 𝒘 = 𝟎 ∈ 𝑹𝑛
2. For every example (𝒙𝑖, 𝑦𝑖 ) ∈ 𝐷
If 𝑦𝑖𝒘𝑇𝒙𝑖 ≤ 1 update the weight vector to
𝒘 ← 𝒘 − 𝛾(𝒘 − 𝐶𝑦𝑖𝒙𝑖) = 1 − 𝛾 𝒘 + 𝛾𝐶𝑦𝑖𝒙𝑖 (𝛾 - learning
rate)
Otherwise 𝒘 ← (1 − 𝛾)𝒘
3. Continue until convergence is achieved This algorithm
should ring a
bell…
Convergence can be proved for a slightly
complicated version of SGD (e.g, Pegasos)
𝑚 is here for mathematical correctness,
it doesn’t matter in the view of modeling.
CIS 419/519 Fall’20 42
Nonlinear SVM
• We can map data to a high dimensional space: 𝒙 → 𝜙(𝒙) (DEMO)
• Then use Kernel trick: 𝐾 𝒙𝑖, 𝒙𝑗 = 𝜙 𝒙𝑖
𝑇
𝜙 𝒙𝑗 (DEMO2)
Theorem: Let 𝒘∗ be the minimizer of the primal
problem, 𝛼∗ be the minimizer of the dual problem.
Then 𝒘∗
= 𝑖 𝛼∗
𝑦𝑖𝒙𝑖
Primal
min
𝒘
1
2
𝒘𝑇𝒘 + 𝐶
𝑖
𝜉𝑖
s.t 𝑦𝑖𝒘𝑇𝜙 𝒙𝑖 ≥ 1 − 𝜉𝑖
𝜉𝑖 ≥ 0 ∀𝑖
Dual
min
𝛼
1
2
𝛼𝑇
𝑄𝛼 − 𝑒𝑇
𝛼
s.t 0 ≤ 𝛼 ≤ 𝐶 ∀𝑖
𝑄𝑖𝑗 = 𝑦𝑖𝑦𝑗𝐾(𝒙𝑖, 𝒙𝑗)
CIS 419/519 Fall’20 43
Nonlinear SVM
• Tradeoff between training time and accuracy
• Complex model vs. simple model
From:
http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf

More Related Content

Similar to Support Vector Machine (SVM) Artificial Intelligence

[Document] MultiProject analysis with Critical Path Method
[Document] MultiProject analysis with Critical Path Method[Document] MultiProject analysis with Critical Path Method
[Document] MultiProject analysis with Critical Path MethodMichele Palumbo
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design PatternsAshwin Shiv
 
Optimizing Data Partitioning at Broadcasting the Data
Optimizing Data Partitioning at Broadcasting the DataOptimizing Data Partitioning at Broadcasting the Data
Optimizing Data Partitioning at Broadcasting the DataTakashi Yamanoue
 
Performance measures
Performance measuresPerformance measures
Performance measuresDivya Tiwari
 
CS345-Algorithms-II-Lecture-1-CS345-2016.pdf
CS345-Algorithms-II-Lecture-1-CS345-2016.pdfCS345-Algorithms-II-Lecture-1-CS345-2016.pdf
CS345-Algorithms-II-Lecture-1-CS345-2016.pdfOpenWorld6
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
10_support_vector_machines (1).pptx
10_support_vector_machines (1).pptx10_support_vector_machines (1).pptx
10_support_vector_machines (1).pptxshyedshahriar
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVictor Pillac
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersSeunghyun Hwang
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptxsurbhidutta4
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
Lecture9-bayes.pptx
Lecture9-bayes.pptxLecture9-bayes.pptx
Lecture9-bayes.pptxTienChung4
 

Similar to Support Vector Machine (SVM) Artificial Intelligence (20)

[Document] MultiProject analysis with Critical Path Method
[Document] MultiProject analysis with Critical Path Method[Document] MultiProject analysis with Critical Path Method
[Document] MultiProject analysis with Critical Path Method
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
 
Optimizing Data Partitioning at Broadcasting the Data
Optimizing Data Partitioning at Broadcasting the DataOptimizing Data Partitioning at Broadcasting the Data
Optimizing Data Partitioning at Broadcasting the Data
 
Performance measures
Performance measuresPerformance measures
Performance measures
 
CS345-Algorithms-II-Lecture-1-CS345-2016.pdf
CS345-Algorithms-II-Lecture-1-CS345-2016.pdfCS345-Algorithms-II-Lecture-1-CS345-2016.pdf
CS345-Algorithms-II-Lecture-1-CS345-2016.pdf
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
10_support_vector_machines (1).pptx
10_support_vector_machines (1).pptx10_support_vector_machines (1).pptx
10_support_vector_machines (1).pptx
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
VRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRPVRP2013 - Comp Aspects VRP
VRP2013 - Comp Aspects VRP
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptx
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
Lecture9-bayes.pptx
Lecture9-bayes.pptxLecture9-bayes.pptx
Lecture9-bayes.pptx
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 

Recently uploaded

Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfOverkill Security
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهMohamed Sweelam
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistandanishmna97
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 

Recently uploaded (20)

Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 

Support Vector Machine (SVM) Artificial Intelligence

  • 1. CIS 419/519 Fall’20 Support Vector Machines (SVM) Dan Roth danroth@seas.upenn.edu|http://www.cis.upenn.edu/~danroth/|461C, 3401 Walnut 1 Slides were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC), and other authors who have made their ML slides available.
  • 2. CIS 419/519 Fall’20 2 Administration (11/4/20) • Remember that all the lectures are available on the website before the class – Go over it and be prepared – A new set of written notes will accompany most lectures, with some more details, examples and, (when relevant) some code. • HW 3: Due on 11/16/20 – You cannot solve all the problems yet. – Less time consuming; no programming • Projects Are we recording? YES! Available on the web site
  • 3. CIS 419/519 Fall’20 3 Projects • CIS 519 students need to do a team project – Teams will be of size 2-4 – We will help grouping if needed • There will be 3 projects. – Natural Language Processing (Text) – Computer Vision (Images) – Speech (Audio) • In all cases, we will give you datasets and initial ideas – The problem will be multiclass classification problems – You will get annotated data only for some of the labels, but will also have to predict other labels – 0-zero shot learning; few-shot learning; transfer learning • A detailed note will come out today. • Timeline: – 11/11 Choose a project and team up – 11/23 Initial proposal describing what your team plans to do – 12/2 Progress report – 12/15-20 (TBD) Final paper + short video • Try to make it interesting!
  • 4. CIS 419/519 Fall’20 4 COLT approach to explaining Learning • No Distributional Assumption • Training Distribution is the same as the Test Distribution • Generalization bounds depend on this view and affects model selection. 𝐸𝑟𝑟𝐷(ℎ) < 𝐸𝑟𝑟𝑇𝑅(ℎ) + 𝑃(𝑉𝐶(𝐻), log( 1 ϒ ), 1 𝑚 ) • This is also called the “Structural Risk Minimization” principle.
  • 5. CIS 419/519 Fall’20 5 COLT approach to explaining Learning • No Distributional Assumption • Training Distribution is the same as the Test Distribution • Generalization bounds depend on this view and affect model selection. 𝐸𝑟𝑟𝐷 ℎ < 𝐸𝑟𝑟𝑇𝑅(ℎ) + 𝑃(𝑉𝐶(𝐻), log(1/ϒ), 1/𝑚) – As presented, the VC dimension is a combinatorial parameter that is associated with a class of functions. • We know that the class of linear functions has a lower VC dimension than the class of quadratic functions. – But this notion can be refined to depend on a given data set, and this way directly affect the hypothesis chosen for a given data set.
  • 6. CIS 419/519 Fall’20 7 Data Dependent VC dimension • So far, we discussed VC dimension in the context of a fixed class of functions. • We can also parameterize the class of functions in interesting ways. • Consider the class of linear functions, parameterized by their margin. Note that this is
  • 7. CIS 419/519 Fall’20 8 VC and Linear Classification • Recall the VC based generalization bound: 𝐸𝑟𝑟 ℎ ≤ 𝑒𝑟𝑟𝑇𝑅(ℎ) + 𝑃𝑜𝑙𝑦{𝑉𝐶 𝐻 , 1 𝑚 , log( 1 ϒ )} • Here we get the same bound for both classifiers: 𝐸𝑟𝑟𝑇𝑅 (ℎ1) = 𝐸𝑟𝑟𝑇𝑅 (ℎ2) = 0 ℎ1, ℎ2 ∈ 𝐻𝑙𝑖𝑛 2 , 𝑉𝐶(𝐻𝑙𝑖𝑛 2 ) = 3 • How, then, can we explain our intuition that ℎ2 should give better generalization than ℎ1?
  • 8. CIS 419/519 Fall’20 • Although both classifiers separate the data, the distance with which the separation is achieved is different: 9 Linear Classification h1 h2
  • 9. CIS 419/519 Fall’20 10 Concept of Margin • The margin ϒ𝑖 of a point 𝒙𝑖 ∈ 𝑹𝑛 with respect to a linear classifier ℎ(𝒙) = 𝑠𝑖𝑔𝑛(𝒘𝑇 ∙ 𝒙 + 𝑏) is defined as the distance of 𝒙𝑖 from the hyperplane 𝒘𝑇 ∙ 𝒙 + 𝑏 = 0: ϒ𝑖 = | 𝒘𝑇 ∙ 𝒙𝑖 +𝑏 ||𝒘|| | • The margin of a set of points {𝒙1, … 𝒙𝑚} with respect to a hyperplane 𝒘, is defined as the margin of the point closest to the hyperplane: ϒ = min 𝑖 ϒ𝑖 = 𝑚𝑖𝑛𝑖| 𝒘𝑇 ∙ 𝒙𝑖 +𝑏 ||𝒘|| |
  • 10. CIS 419/519 Fall’20 11 VC and Linear Classification • Theorem: If 𝐻ϒ is the space of all linear classifiers in 𝑹𝑛 that separate the training data with margin at least ϒ, then: 𝑉𝐶(𝐻ϒ) ≤ min( 𝑅2 ϒ2 , 𝑛) + 1, • Where 𝑅 is the radius of the smallest sphere (in 𝑹𝑛) that contains the data. • Thus, for such classifiers, we have a bound of the form: 𝐸𝑟𝑟 ℎ ≤ 𝑒𝑟𝑟𝑇𝑅(ℎ) + 𝑂 𝑅2 ϒ2 +log 4 𝛿 𝑚 1/2 In particular, you see here that for “general” linear separators of dimensionality n, the VC is n+1
  • 11. CIS 419/519 Fall’20 • First observation: When we consider the class 𝐻ϒ of linear hypotheses that separate a given data set with a margin ϒ, we see that – Large Margin ϒ  Small VC dimension of 𝐻ϒ • Consequently, our goal could be to find a separating hyperplane 𝒘 that maximizes the margin of the set 𝑆 of examples. • A second observation that drives an algorithmic approach is that: – Small ||𝒘||  Large Margin • Together, this leads to an algorithm: from among all those 𝒘’s that agree with the data, find the one with the minimal size ||𝒘|| – But, if 𝒘 separates the data, so does 𝒘/7…. – We need to better understand the relations between 𝒘 and the margin 12 Towards Max Margin Classifiers But, how can we do it algorithmically?
  • 12. CIS 419/519 Fall’20 13 Maximal Margin • This discussion motivates the notion of a maximal margin. • The maximal margin of a data set 𝑆 is defined as: ϒ 𝑆 = max ||𝑤||=1 min 𝒙,𝑦 ∈𝑆 |𝑦 𝒘𝑇 𝒙| 1. For a given 𝒘: Find the closest point. 2. Then, across all 𝒘’s (of size 1), find the point for which this closets point is the farthest (that gives the maximal margin). Note: the selection of the point is in the min and therefore the max does not change if we scale 𝒘, so it’s okay to only deal with normalized 𝒘’s. The distance between a point 𝒙 and the hyperplane defined by 𝑤 is: |𝒘𝑇 𝒙 |/ 𝒘 How does it help us to derive these ℎ’s? 𝑎𝑟𝑔𝑚𝑎𝑥 𝒘 =1 min (𝒙,𝑦) ∈ 𝑆 |𝑦 𝒘𝑇 𝒙| A hypothesis (𝒘) has many names Interpretation 1: among all 𝒘’s, choose the one that maximizes the margin.
  • 13. CIS 419/519 Fall’20 14 Recap: Margin and VC dimension • Theorem (Vapnik): If 𝐻ϒ is the space of all linear classifiers in 𝑹𝑛 that separate the training data with margin at least ϒ, then 𝑉𝐶(𝐻ϒ) ≤ 𝑅2/ϒ2 – where 𝑅 is the radius of the smallest sphere (in 𝑹𝑛) that contains the data. • This is the first observation that will lead to an algorithmic approach. • The second observation is that: Small ||𝒘||  Large Margin • Consequently, the algorithm will be: from among all those 𝒘’s that agree with the data, find the one with the minimal size ||𝒘|| Believ e We’ll show this
  • 14. CIS 419/519 Fall’20 15 From Margin to ||𝒘|| • We want to choose the hyperplane that achieves the largest margin. That is, given a data set 𝑆, find: – 𝒘∗ = argmax| 𝐰 |=1 min 𝐱,y ∈𝑆 |𝑦 𝒘𝑇 𝒙| • How to find this 𝒘∗ ? • Claim: Define 𝒘0 to be the solution of the optimization problem: – 𝒘0 = 𝑎𝑟𝑔𝑚𝑖𝑛 {| 𝒘 |2 ∶ ∀ 𝒙, 𝑦 ∈ 𝑆, 𝑦 𝒘𝑇 𝒙 ≥ 1 }. – Then: – 𝐰0/| 𝐰0 | = 𝑎𝑟𝑔𝑚𝑎𝑥||𝒘||=1 min 𝒙,𝑦 ∈𝑆 𝑦 𝒘𝑇 𝒙 • That is, the normalization of 𝒘0 corresponds to the largest margin separating hyperplane. Interpretation 2: among all 𝒘’s that separate the data with margin 1, choose the one with minimal size. The next slide will show that the two interpretations are equivalent
  • 15. CIS 419/519 Fall’20 16 From Margin to ||𝒘||(2) • Claim: Define 𝒘0 to be the solution of the optimization problem: – 𝒘0 = 𝑎𝑟𝑔𝑚𝑖𝑛 {| 𝒘 |2 ∶ ∀ 𝒙, 𝑦 ∈ 𝑆, 𝑦 𝒘𝑇 𝒙 ≥ 1 }(**) Then: – 𝒘0/||𝒘0|| = 𝑎𝑟𝑔𝑚𝑎𝑥| 𝒘 |=1 min 𝒙,𝑦 ∈𝑆 𝑦 𝒘𝑇 𝒙 That is, the normalization of 𝒘0 corresponds to the largest margin separating hyperplane. • Proof: Define 𝒘’ = 𝒘0/||𝒘0|| and let 𝒘∗ be the largest-margin separating hyperplane of size 1. We need to show that 𝒘’ = 𝒘∗. Note first that 𝒘∗ ϒ 𝑆 satisfies the constraints in (**); therefore: ||𝒘0|| ≤ ||𝒘∗ / ϒ(𝑆)|| = 1/ ϒ(𝑆) . • Consequently: ∀ 𝒙, 𝑦 ∈ 𝑆 𝑦 𝒘’𝑇 𝒙 = 1 ||𝒘0|| 𝑦𝒘0 𝑇 𝒙 ≥ 1/||𝒘0|| ≥ ϒ(𝑆) But since ||𝒘’|| = 1 this implies that 𝒘’ corresponds to the largest margin, that is 𝒘’ = 𝒘∗ Def. of 𝒘’ Prev. ineq. Def. of 𝒘0 Def. of 𝒘0 Def. of 𝒘∗ 𝒘∗ = argmax| 𝐰 |=1 min 𝐱,y ∈𝑆 |𝑦 𝒘𝑇 𝒙| Def. of 𝒘 ∗ And, recall that ϒ 𝑆 is the maximal margin for the set S
  • 16. CIS 419/519 Fall’20 17 Margin of a Separating Hyperplane • A separating hyperplane: 𝒘𝑇 𝒙 + 𝑏 = 0 Distance between 𝒘𝑇 𝒙 + 𝑏 = +1 𝑎𝑛𝑑 − 1 is 2/| 𝒘 | What we did: 1. Consider all possible 𝒘 with different angles 2. Scale 𝒘 such that the constraints are tight (closets points are on the +/-1 line) 3. Pick the one with largest margin/minimal size 𝒘𝑇 𝒙 + 𝑏 = 0 𝒘𝑇 𝒙 + 𝑏 = −1 𝒘𝑇 𝒙𝑖 + 𝑏 ≥ 1 𝑖𝑓 𝑦𝑖 = 1 𝒘𝑇 𝒙𝑖 + 𝑏 ≤ −1 𝑖𝑓 𝑦𝑖 = −1 → 𝑦𝑖(𝒘𝑇 𝒙𝑖 + 𝑏) ≥ 1 Assumption: data is linearly separable Let (𝒙0 , 𝑦0) be a point on 𝒘𝑇𝒙 + 𝑏 = 1 Then its distance to the separating plane 𝒘𝑇 𝒙 + 𝑏 = 0 is: |𝒘𝑇 𝒙0 + 𝑏|/||𝒘|| = 1/||𝒘||
  • 18. CIS 419/519 Fall’20 19 Administration (11/9/20) • Remember that all the lectures are available on the website before the class – Go over it and be prepared – A new set of written notes will accompany most lectures, with some more details, examples and, (when relevant) some code. • HW 3: Due on 11/16/20 – You cannot solve all the problems yet. – Less time consuming; no programming • Cheating – Several problems in HW1 and HW2 Are we recording? YES! Available on the web site
  • 19. CIS 419/519 Fall’20 20 Projects • CIS 519 students need to do a team project: Read the project descriptions – Teams will be of size 2-4 – We will help grouping if needed • There will be 3 projects. – Natural Language Processing (Text) – Computer Vision (Images) – Speech (Audio) • In all cases, we will give you datasets and initial ideas – The problem will be multiclass classification problems – You will get annotated data only for some of the labels, but will also have to predict other labels – 0-zero shot learning; few-shot learning; transfer learning • A detailed note will come out today. • Timeline: – 11/11 Choose a project and team up – 11/23 Initial proposal describing what your team plans to do – 12/2 Progress report – 12/15-20 (TBD) Final paper + short video • Try to make it interesting!
  • 20. CIS 419/519 Fall’20 21 Hard SVM Optimization • We have shown that the sought-after weight vector 𝒘 is the solution of the following optimization problem: SVM Optimization: (***) – Minimize: ½ 𝒘 2 – Subject to: ∀ 𝒙, 𝑦 ∈ 𝑆: 𝑦 𝒘𝑇 𝒙 ≥ 1 • This is a quadratic optimization problem in (𝑛 + 1) variables, with |𝑆| = 𝑚 inequality constraints. • It has a unique solution.
  • 21. CIS 419/519 Fall’20 22 Maximal Margin The margin of a linear separator 𝒘𝑇 𝒙 + 𝑏 = 0 is 1 ||𝒘|| max 1 ||𝒘|| = min ||𝒘|| = min ½ 𝒘𝑇𝒘 min 𝒘,𝑏 1 2 𝒘𝑇 𝒘 s.t 𝑦𝑖(𝐰T𝒙𝑖 + 𝑏) ≥ 1, ∀ 𝒙𝑖, 𝑦𝑖 ∈ 𝑆
  • 22. CIS 419/519 Fall’20 23 Support Vector Machines • The name “Support Vector Machine” stems from the fact that 𝒘∗ is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/||𝒘∗ || from the separating hyperplane. These vectors are therefore called support vectors. • Theorem: Let 𝒘∗ be the minimizer of the SVM optimization problem (***) for 𝑆 = {(𝒙𝑖, 𝑦𝑖)}. Let 𝐼 = {𝑖: 𝒘∗𝑇𝒙𝑖 = 1}. Then there exists coefficients 𝛼𝑖 > 0 such that: 𝒘 ∗ = 𝑖 ∈ 𝐼 𝛼𝑖 𝑦𝑖 𝒙𝑖 This representation should ring a bell…
  • 24. CIS 419/519 Fall’20 25 Duality • This, and other properties of Support Vector Machines are shown by moving to the dual problem. • Theorem: Let 𝒘∗ be the minimizer of the SVM optimization problem (***) for 𝑆 = {(𝒙𝑖, 𝑦𝑖)}. Let 𝐼 = 𝑖: 𝑦𝑖 𝒘∗𝑇𝒙𝑖 + 𝑏 = 1 . Then there exists coefficients 𝛼𝑖 > 0 such that: 𝒘 ∗ = 𝑖 Є 𝐼 𝛼𝑖 𝑦𝑖 𝒙𝑖
  • 25. CIS 419/519 Fall’20 28 Footnote about the threshold • Similar to Perceptron, we can augment vectors to handle the bias term 𝒙 ⇐ 𝒙 , 1 ; 𝒘 ⇐ 𝒘 , 𝑏 so that 𝒘𝑇 𝒙 = 𝒘𝑇 𝒙 + 𝑏 • Then consider the following formulation min 𝒘 1 2 𝒘𝑇𝒘 s.t yi𝒘T𝒙i ≥ 1, ∀ 𝒙𝑖, 𝑦𝑖 ∈ S • However, this formulation is slightly different from (***), because it is equivalent to min 𝒘,𝑏 1 2 𝒘𝑇 𝒘 + 1 2 𝑏2 s.t yi(𝒘T 𝐱i + 𝑏) ≥ 1, ∀ 𝒙𝑖, 𝑦𝑖 ∈ S The bias term is included in the regularization. This usually doesn’t matter For simplicity, we ignore the bias term
  • 26. CIS 419/519 Fall’20 29 Key Issues • Computational Issues – Training of an SVM used to be is very time consuming – solving quadratic program. – Modern methods are based on Stochastic Gradient Descent and Coordinate Descent and are much faster. • Is it really optimal? – Is the objective function we are optimizing the “right” one?
  • 27. CIS 419/519 Fall’20 30 Real Data • 17,000 dimensional context sensitive spelling • Histogram of distance of points from the hyperplane In practice, even in the separable case, we may not want to depend on the points closest to the hyperplane but rather on the distribution of the distance. If only a few are close, maybe we can dismiss them.
  • 28. CIS 419/519 Fall’20 31 Soft SVM • The hard SVM formulation assumes linearly separable data. • A natural relaxation: – maximize the margin while minimizing the # of examples that violate the margin (separability) constraints. • However, this leads to non-convex problem that is hard to solve. • Instead, we relax in a different way, that results in optimizing a surrogate loss function that is convex.
  • 29. CIS 419/519 Fall’20 32 Soft SVM • Notice that the relaxation of the constraint: 𝑦𝑖𝒘𝑇𝒙𝑖 ≥ 1 • Can be done by introducing a slack variable 𝜉𝑖 (per example) and requiring: 𝑦𝑖𝒘𝑇𝒙𝑖 ≥ 1 − 𝜉𝑖 ; 𝜉𝑖 ≥ 0 • Now, we want to solve: min 𝒘,𝜉𝑖 1 2 𝒘𝑇 𝒘 + 𝐶 𝑖 𝜉𝑖 s.t 𝑦𝑖𝒘𝑇 𝒙𝑖 ≥ 1 − 𝜉𝑖 ; 𝜉𝑖 ≥ 0 ∀𝑖 • A large value of C means that we want 𝜉𝑖 to be small; that is, misclassifications are bad – we focus on a small training error (at the expense of margin). • A small C results in more training error, but hopefully better true error.
  • 30. CIS 419/519 Fall’20 33 Soft SVM (2) • Now, we want to solve: • Which can be written as: min 𝒘 1 2 𝒘𝑇𝒘 + 𝐶 𝑖 max(0, 1 − 𝑦𝑖𝒘𝑇𝒙𝑖) . • What is the interpretation of this? min 𝑤,𝜉𝑖 1 2 𝑤𝑇 𝑤 + 𝐶 𝑖 𝜉𝑖 s.t yiwT xi ≥ 1 − 𝜉𝑖 ; 𝜉𝑖 ≥ 0 ∀𝑖 In optimum, ξi = max(0, 1 − 𝑦𝑖 𝒘𝑇 𝒙𝑖) min 𝒘,𝜉𝑖 1 2 𝒘𝑇 𝒘 + 𝐶 𝑖 𝜉𝑖 s.t 𝜉𝑖 ≥ 1 − 𝑦𝑖𝒘𝑇𝑥𝑖 𝜉𝑖 ≥ 0 ∀𝑖
  • 31. CIS 419/519 Fall’20 34 SVM Objective Function • The problem we solved is: 𝑀𝑖𝑛 ½ ||𝒘||2 + 𝑐 𝜉𝑖 • Where 𝜉𝑖 > 0 is called a slack variable, and is defined by: – 𝜉𝑖 = max(0, 1 – 𝑦𝑖𝒘𝑇 𝒙𝑖) – Equivalently, we can say that: 𝑦𝑖𝒘𝑇 𝒙𝑖 ≥1 - 𝜉𝑖; 𝜉𝑖 ≥ 0 • And this can be written as: 𝑀𝑖𝑛 ½ ||𝒘||2 + 𝑐 𝜉𝑖 • General Form of a learning algorithm: – Minimize empirical loss, and Regularize (to avoid over fitting) – Theoretically motivated improvement over the original algorithm we’ve seen at the beginning of the semester. Can be replaced by other loss functions Can be replaced by other regularization functions Empirical loss Regularization term
  • 32. CIS 419/519 Fall’20 35 Balance between regularization and empirical loss
  • 33. CIS 419/519 Fall’20 36 Balance between regularization and empirical loss (DEMO)
  • 34. CIS 419/519 Fall’20 Underfittin g Overfitting Model complexity Expected Error 37 Underfitting and Overfitting Simple models: High bias and low variance Variance Bias Complex models: High variance and low bias Smaller C Larger C High Empirical Error Low Empirical Error
  • 35. CIS 419/519 Fall’20 38 What Do We Optimize? • Logistic Regression min 𝒘 1 2 𝒘𝑇𝒘 + 𝐶 𝑖=1 𝑙 log(1 + 𝑒−𝑦𝑖 𝒘𝑇𝒙𝑖 ) • L1-loss SVM min 𝒘 1 2 𝒘𝑇𝒘 + 𝐶 𝑖=1 𝑙 max(0,1 − 𝑦𝑖𝒘𝑇𝒙𝑖) • L2-loss SVM min 𝒘 1 2 𝒘𝑇𝒘 + 𝐶 𝑖=1 𝑙 max 0,1 − 𝑦𝑖𝒘𝑇𝒙𝑖 2
  • 36. CIS 419/519 Fall’20 39 What Do We Optimize(2)? • We get an unconstrained problem. We can use the (stochastic) gradient descent algorithm! • Many other methods – Iterative scaling; non-linear conjugate gradient; quasi-Newton methods; truncated Newton methods; trust-region newton method. – All methods are iterative methods, that generate a sequence 𝒘𝑘 that converges to the optimal solution of the optimization problem above. • Currently: Limited memory BFGS is very popular
  • 37. CIS 419/519 Fall’20 40 Optimization: How to Solve 1. Earlier methods used Quadratic Programming. Very slow. 2. The soft SVM problem is an unconstrained optimization problems. It is possible to use the gradient descent algorithm. • Many options within this category: – Iterative scaling; non-linear conjugate gradient; quasi-Newton methods; truncated Newton methods; trust-region newton method. – All methods are iterative methods, that generate a sequence 𝒘𝑘 that converges to the optimal solution of the optimization problem above. – Currently: Limited memory BFGS is very popular 3. 3rd generation algorithms are based on Stochastic Gradient Decent – The runtime does not depend on 𝑛 = #(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠); advantage when 𝑛 is very large. – Stopping criteria is a problem: method tends to be too aggressive at the beginning and reaches a moderate accuracy quite fast, but it’s convergence becomes slow if we are interested in more accurate solutions. 4. Dual Coordinated Descent (& Stochastic Version)
  • 38. CIS 419/519 Fall’20 41 SGD for SVM • Goal: min 𝒘 𝑓 𝒘 ≡ 1 2 𝒘𝑇 𝒘 + 𝐶 𝑚 𝑖 max 0, 1 − 𝑦𝑖𝒘𝑇 𝒙𝑖 𝑚: data size • Compute sub-gradient of 𝑓(𝒘): 𝛻𝑓 𝒘 = 𝒘 − 𝐶𝑦𝑖𝒙𝑖 if 1 − 𝑦𝑖𝒘𝑇 𝒙𝑖 ≥ 0 ; otherwise 𝛻𝑓 𝒘 = 𝒘 1. Initialize 𝒘 = 𝟎 ∈ 𝑹𝑛 2. For every example (𝒙𝑖, 𝑦𝑖 ) ∈ 𝐷 If 𝑦𝑖𝒘𝑇𝒙𝑖 ≤ 1 update the weight vector to 𝒘 ← 𝒘 − 𝛾(𝒘 − 𝐶𝑦𝑖𝒙𝑖) = 1 − 𝛾 𝒘 + 𝛾𝐶𝑦𝑖𝒙𝑖 (𝛾 - learning rate) Otherwise 𝒘 ← (1 − 𝛾)𝒘 3. Continue until convergence is achieved This algorithm should ring a bell… Convergence can be proved for a slightly complicated version of SGD (e.g, Pegasos) 𝑚 is here for mathematical correctness, it doesn’t matter in the view of modeling.
  • 39. CIS 419/519 Fall’20 42 Nonlinear SVM • We can map data to a high dimensional space: 𝒙 → 𝜙(𝒙) (DEMO) • Then use Kernel trick: 𝐾 𝒙𝑖, 𝒙𝑗 = 𝜙 𝒙𝑖 𝑇 𝜙 𝒙𝑗 (DEMO2) Theorem: Let 𝒘∗ be the minimizer of the primal problem, 𝛼∗ be the minimizer of the dual problem. Then 𝒘∗ = 𝑖 𝛼∗ 𝑦𝑖𝒙𝑖 Primal min 𝒘 1 2 𝒘𝑇𝒘 + 𝐶 𝑖 𝜉𝑖 s.t 𝑦𝑖𝒘𝑇𝜙 𝒙𝑖 ≥ 1 − 𝜉𝑖 𝜉𝑖 ≥ 0 ∀𝑖 Dual min 𝛼 1 2 𝛼𝑇 𝑄𝛼 − 𝑒𝑇 𝛼 s.t 0 ≤ 𝛼 ≤ 𝐶 ∀𝑖 𝑄𝑖𝑗 = 𝑦𝑖𝑦𝑗𝐾(𝒙𝑖, 𝒙𝑗)
  • 40. CIS 419/519 Fall’20 43 Nonlinear SVM • Tradeoff between training time and accuracy • Complex model vs. simple model From: http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf