Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Advanced Computing Seminar  Data Mining and Its Industrial Applications  — Chapter 8 —   Support Vector Machines   <ul><li...
Outline <ul><li>Introduction </li></ul><ul><li>Support Vector Machine </li></ul><ul><li>Non-linear Classification </li></u...
History <ul><li>SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis </li></ul><ul><li>...
What is SVM? <ul><li>SVMs are learning systems that  </li></ul><ul><ul><li>use a hypothesis space of  linear functions </l...
Linear Classifiers  y est denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore...
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this da...
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this da...
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this da...
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this da...
Maximum Margin f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) The  maximum margin linear classif...
Model of Linear Classification <ul><li>Binary classification is frequently performed by using a real-valued  hypothesis  f...
The concept of Hyperplane <ul><li>For a binary  linear separable  training set, we can find at least a  hyperplane   (w,b)...
Tuning the Hyperplane (w,b)  <ul><li>The  Perceptron  Algorithm </li></ul><ul><ul><li>Proposed by  Frank Rosenblatt  in 19...
The  Perceptron  Algorithm <ul><li>The number of mistakes is at most </li></ul>
The Geometric margin  -> <ul><li>The Euclidean distance of an example (x i ,y i ) from the decision boundary </li></ul>
The Geometric margin <ul><li>The margin of a training set  S </li></ul><ul><li>Maximal Margin Hyperplane </li></ul><ul><ul...
How to Find the optimal solution ? <ul><li>The drawback of the  perceptron  algorithm </li></ul><ul><ul><li>The algorithm ...
The Maximal Margin Classifier <ul><li>The simplest model of SVM </li></ul><ul><ul><li>Finds the  maximal margin hyperplane...
Support Vector Classifiers <ul><li>Support vector machines  </li></ul><ul><ul><li>Cortes and Vapnik (1995) </li></ul></ul>...
Formalizi the geometric margin <ul><li>Assumes that  </li></ul><ul><li>The geometric margin  </li></ul><ul><ul><li>In orde...
Minimizing the norm  ->  <ul><li>Because </li></ul><ul><li>We can re-formalize the optimization problem </li></ul>
Minimizing the norm  -> <ul><li>Uses the  Lagrangian  function </li></ul><ul><li>Obtained </li></ul><ul><li>Resubstituting...
Minimizing the norm <ul><li>Finds the minimum  is equivalent to find the maximum  </li></ul><ul><li>The strategies for min...
The Support Vector <ul><li>The condition of the optimization problem states that </li></ul><ul><ul><li>This implies that o...
The optimal hypothesis (w,b) <ul><li>The two parameters can be obtained from </li></ul><ul><li>The hypothesis is </li></ul>
Soft Margin Optimization <ul><li>The main problem with the maximal margin classifier is that it always products perfectly ...
Non-linear Classification <ul><li>The problem </li></ul><ul><ul><li>The maximal margin classifier is an important concept,...
A learning machine <ul><li>A learning machine  f   takes an input  x  and transforms it, somehow using weights   , into a...
Some definitions <ul><li>Given some machine  f </li></ul><ul><li>And under the assumption that all training points  (x k ,...
Some definitions <ul><li>Given some machine  f </li></ul><ul><li>And under the assumption that all training points  (x k ,...
Vapnik-Chervonenkis  Dimension  <ul><li>Given some machine  f , let  h  be its VC dimension. </li></ul><ul><li>h  is a mea...
Structural Risk Minimization <ul><li>Let   (f)  = the set of functions representable by f. </li></ul><ul><li>Suppose  </l...
Kernel-Induced Feature Space <ul><li>Mapping the data of space  X  into space  F   </li></ul>
Implicit Mapping into Feature Space <ul><li>For the non-linear separable data set, we can modify the hypothesis to map imp...
Kernel Function <ul><li>A Kernel is a function  K , such that for all </li></ul><ul><li>The benefits </li></ul><ul><ul><li...
Kernel function
The Polynomial Kernel <ul><ul><li>The kind of kernel represents the inner product of two vector(point) in a feature space ...
 
 
Text Categorization Inductive learning  Inpute : Output :  f(x) = confidence(class) In the case of text classification ,th...
PROPERTIES OF TEXT-CLASSIFICATION TASKS <ul><li>High-Dimensional Feature Space. </li></ul><ul><li>Sparse Document Vectors....
Text representation and feature selection <ul><li>Binary feature </li></ul><ul><li>term frequency  </li></ul><ul><li>Inver...
 
Learning SVMS <ul><li>To learn the vector of feature weights  </li></ul><ul><li>Linear SVMS </li></ul><ul><li>Polynomial c...
Processing <ul><li>Text files are processed to produce a vector of words </li></ul><ul><li>Select 300 words with highest  ...
An example - Reuters ( trends & controversies) <ul><li>Category : interest </li></ul><ul><li>Weight vector </li></ul><ul><...
 
Text Categorization Results Dumais et al. (1998)
Apply to the Linear Classifier <ul><li>Substitutes to the hypothesis </li></ul><ul><li>Substitutes to the margin optimizat...
SVMs and PAC Learning <ul><li>Theorems connect PAC theory to the size of the  margin </li></ul><ul><li>Basically, the  lar...
PAC and the Number of Support Vectors <ul><li>The fewer the support vectors, the better the generalization will be </li></...
VC-dimension of an SVM <ul><li>Very  loosely speaking there is some theory which under some different assumptions puts an ...
Finding Non-Linear Separating Surfaces <ul><li>Map inputs into new space </li></ul><ul><ul><li>Example: features  x 1   x ...
Summary <ul><li>Maximize the margin between positive and negative examples (connects to PAC theory) </li></ul><ul><li>Non-...
References <ul><li>Vladimir Vapnik.  The Nature of Statistical Learning Theory , Springer, 1995  </li></ul><ul><li>Andrew ...
www.intsci.ac.cn/shizz / <ul><li>Questions?! </li></ul>
Upcoming SlideShare
Loading in …5
×

Support Vector Machines

23,355 views

Published on

Published in: Technology, Education
  • Login to see the comments

Support Vector Machines

  1. 1. Advanced Computing Seminar Data Mining and Its Industrial Applications — Chapter 8 — Support Vector Machines <ul><li>Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr </li></ul><ul><li>Knowledge and Software Engineering Lab </li></ul><ul><li>Advanced Computing Research Centre </li></ul><ul><li>School of Computer and Information Science </li></ul><ul><li>University of South Australia </li></ul>
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Support Vector Machine </li></ul><ul><li>Non-linear Classification </li></ul><ul><li>SVM and PAC </li></ul><ul><li>Applications </li></ul><ul><li>Summary </li></ul>
  3. 3. History <ul><li>SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis </li></ul><ul><li>SVMs introduced by Boser, Guyon, Vapnik in COLT-92 </li></ul><ul><li>Initially popularized in the NIPS community, now an important and active field of all Machine Learning research. </li></ul><ul><li>Special issues of Machine Learning Journal, and Journal of Machine Learning Research. </li></ul>
  4. 4. What is SVM? <ul><li>SVMs are learning systems that </li></ul><ul><ul><li>use a hypothesis space of linear functions </li></ul></ul><ul><ul><li>in a high dimensional feature space — Kernel function </li></ul></ul><ul><ul><li>trained with a learning algorithm from optimization theory — Lagrange </li></ul></ul><ul><ul><li>Implements a learning bias derived from statistical learning theory — Generalisation SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis </li></ul></ul>
  5. 5. Linear Classifiers  y est denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore f x f ( x , w ,b ) = sign( w . x - b )
  6. 6. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  7. 7. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  8. 8. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  9. 9. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  10. 10. Maximum Margin f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore
  11. 11. Model of Linear Classification <ul><li>Binary classification is frequently performed by using a real-valued hypothesis function: </li></ul><ul><ul><li>The input x is assigned to the positive class, if </li></ul></ul><ul><ul><li>Otherwise to the negative class. </li></ul></ul>
  12. 12. The concept of Hyperplane <ul><li>For a binary linear separable training set, we can find at least a hyperplane (w,b) which divides the space into two half spaces. </li></ul><ul><li>The definition of hyperplane </li></ul>
  13. 13. Tuning the Hyperplane (w,b) <ul><li>The Perceptron Algorithm </li></ul><ul><ul><li>Proposed by Frank Rosenblatt in 1956 </li></ul></ul><ul><li>Preliminary definition </li></ul><ul><ul><li>The functional margin of an example (x i ,y i ) </li></ul></ul><ul><ul><li>implies correct classification of (x i ,y i ) </li></ul></ul>
  14. 14. The Perceptron Algorithm <ul><li>The number of mistakes is at most </li></ul>
  15. 15. The Geometric margin -> <ul><li>The Euclidean distance of an example (x i ,y i ) from the decision boundary </li></ul>
  16. 16. The Geometric margin <ul><li>The margin of a training set S </li></ul><ul><li>Maximal Margin Hyperplane </li></ul><ul><ul><li>A hyperplane realising the maximun geometric margin </li></ul></ul><ul><li>The optimal linear classifier </li></ul><ul><ul><li>If it can form the Maximal Margin Hyperplane. </li></ul></ul>
  17. 17. How to Find the optimal solution ? <ul><li>The drawback of the perceptron algorithm </li></ul><ul><ul><li>The algorithm may give a different solution depending on the order in which the examples are processed. </li></ul></ul><ul><li>The superiority of SVM </li></ul><ul><ul><li>The kind of learning machines tune the solution based on the optimization theory . </li></ul></ul>
  18. 18. The Maximal Margin Classifier <ul><li>The simplest model of SVM </li></ul><ul><ul><li>Finds the maximal margin hyperplane in an chosen kernel-induced feature space. </li></ul></ul><ul><li>A convex optimization problem </li></ul><ul><ul><li>Minimizing a quadratic function under linear inequality constrains </li></ul></ul>
  19. 19. Support Vector Classifiers <ul><li>Support vector machines </li></ul><ul><ul><li>Cortes and Vapnik (1995) </li></ul></ul><ul><ul><li>well suited for high-dimensional data </li></ul></ul><ul><ul><li>binary classification </li></ul></ul><ul><li>Training set </li></ul><ul><ul><li>D = {( x i ,y i ), i=1,…,n}, x i  R m and y i  {-1,1} </li></ul></ul><ul><li>Linear discriminant classifier </li></ul><ul><ul><li>Separating hyperplane </li></ul></ul><ul><ul><li>{ x : g( x ) = w T x + w 0 = 0 } </li></ul></ul><ul><ul><ul><li>model parameters: w  R m and w 0  R </li></ul></ul></ul>
  20. 20. Formalizi the geometric margin <ul><li>Assumes that </li></ul><ul><li>The geometric margin </li></ul><ul><ul><li>In order to find the maximum ,we must find the minimum </li></ul></ul>
  21. 21. Minimizing the norm -> <ul><li>Because </li></ul><ul><li>We can re-formalize the optimization problem </li></ul>
  22. 22. Minimizing the norm -> <ul><li>Uses the Lagrangian function </li></ul><ul><li>Obtained </li></ul><ul><li>Resubstituting into the primal to obtain </li></ul>
  23. 23. Minimizing the norm <ul><li>Finds the minimum is equivalent to find the maximum </li></ul><ul><li>The strategies for minimizing differentiable function </li></ul><ul><ul><li>Decomposition </li></ul></ul><ul><ul><li>Sequential Minimal Optimization (SMO) </li></ul></ul>
  24. 24. The Support Vector <ul><li>The condition of the optimization problem states that </li></ul><ul><ul><li>This implies that only for input xi for which the functional margin is one </li></ul></ul><ul><ul><li>This implies that it lies closest to the hyperplane </li></ul></ul><ul><ul><li>The corresponding </li></ul></ul>
  25. 25. The optimal hypothesis (w,b) <ul><li>The two parameters can be obtained from </li></ul><ul><li>The hypothesis is </li></ul>
  26. 26. Soft Margin Optimization <ul><li>The main problem with the maximal margin classifier is that it always products perfectly a consistent hypothesis </li></ul><ul><ul><li>a hypothesis with no training error </li></ul></ul><ul><li>Relax the boundary </li></ul>
  27. 27. Non-linear Classification <ul><li>The problem </li></ul><ul><ul><li>The maximal margin classifier is an important concept, but it cannot be used in many real-world problems </li></ul></ul><ul><ul><li>There will in general be no linear separation in the feature space. </li></ul></ul><ul><li>The solution </li></ul><ul><ul><li>Maps the data into another space that can be separated linearly. </li></ul></ul>
  28. 28. A learning machine <ul><li>A learning machine f takes an input x and transforms it, somehow using weights  , into a predicted output y est = +/- 1 </li></ul>f x  y est  is some vector of adjustable parameters
  29. 29. Some definitions <ul><li>Given some machine f </li></ul><ul><li>And under the assumption that all training points (x k ,y k ) were drawn i.i.d from some distribution. </li></ul><ul><li>And under the assumption that future test points will be drawn from the same distribution </li></ul><ul><li>Define </li></ul>Official terminology
  30. 30. Some definitions <ul><li>Given some machine f </li></ul><ul><li>And under the assumption that all training points (x k ,y k ) were drawn i.i.d from some distribution. </li></ul><ul><li>And under the assumption that future test points will be drawn from the same distribution </li></ul><ul><li>Define </li></ul>Official terminology R = #training set data points
  31. 31. Vapnik-Chervonenkis Dimension <ul><li>Given some machine f , let h be its VC dimension. </li></ul><ul><li>h is a measure of f ’s power ( h does not depend on the choice of training set) </li></ul><ul><li>Vapnik showed that with probability 1-  </li></ul>This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f
  32. 32. Structural Risk Minimization <ul><li>Let  (f) = the set of functions representable by f. </li></ul><ul><li>Suppose </li></ul><ul><li>Then </li></ul><ul><li>We’re trying to decide which machine to use. </li></ul><ul><li>We train each machine and make a table… </li></ul>f 4 4 f 5 5 f 6 6  f 3 3 f 2 2 f 1 1 Choice Probable upper bound on TESTERR VC-Conf TRAINERR f i i
  33. 33. Kernel-Induced Feature Space <ul><li>Mapping the data of space X into space F </li></ul>
  34. 34. Implicit Mapping into Feature Space <ul><li>For the non-linear separable data set, we can modify the hypothesis to map implicitly the data to another feature space </li></ul>
  35. 35. Kernel Function <ul><li>A Kernel is a function K , such that for all </li></ul><ul><li>The benefits </li></ul><ul><ul><li>Solve the computational problem of working with many dimensions </li></ul></ul>
  36. 36. Kernel function
  37. 37. The Polynomial Kernel <ul><ul><li>The kind of kernel represents the inner product of two vector(point) in a feature space of dimension. </li></ul></ul><ul><ul><li>For example </li></ul></ul>
  38. 40. Text Categorization Inductive learning Inpute : Output : f(x) = confidence(class) In the case of text classification ,the attribute are words in the document ,and the classes are the categories.
  39. 41. PROPERTIES OF TEXT-CLASSIFICATION TASKS <ul><li>High-Dimensional Feature Space. </li></ul><ul><li>Sparse Document Vectors. </li></ul><ul><li>High Level of Redundancy. </li></ul>
  40. 42. Text representation and feature selection <ul><li>Binary feature </li></ul><ul><li>term frequency </li></ul><ul><li>Inverse document frequency </li></ul><ul><ul><li>n is the total number of documents </li></ul></ul><ul><ul><li>DF(w) is the number of documents the word occurs in </li></ul></ul>
  41. 44. Learning SVMS <ul><li>To learn the vector of feature weights </li></ul><ul><li>Linear SVMS </li></ul><ul><li>Polynomial classifiers </li></ul><ul><li>Radial basis functions </li></ul>
  42. 45. Processing <ul><li>Text files are processed to produce a vector of words </li></ul><ul><li>Select 300 words with highest mutual information with each category(remove stopwords) </li></ul><ul><li>A separate classifier is learned for each category. </li></ul>
  43. 46. An example - Reuters ( trends & controversies) <ul><li>Category : interest </li></ul><ul><li>Weight vector </li></ul><ul><li>large positive weights : prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46) </li></ul><ul><li>large negative weights: group (–.24),year (–.25), sees (–.33) world (–.35), and dlrs (–.71) </li></ul>
  44. 48. Text Categorization Results Dumais et al. (1998)
  45. 49. Apply to the Linear Classifier <ul><li>Substitutes to the hypothesis </li></ul><ul><li>Substitutes to the margin optimization </li></ul>
  46. 50. SVMs and PAC Learning <ul><li>Theorems connect PAC theory to the size of the margin </li></ul><ul><li>Basically, the larger the margin, the better the expected accuracy </li></ul><ul><li>See, for example, Chapter 4 of Support Vector Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002 </li></ul>
  47. 51. PAC and the Number of Support Vectors <ul><li>The fewer the support vectors, the better the generalization will be </li></ul><ul><li>Recall, non-support vectors are </li></ul><ul><ul><li>Correctly classified </li></ul></ul><ul><ul><li>Don’t change the learned model if left out of the training set </li></ul></ul><ul><li>So </li></ul>
  48. 52. VC-dimension of an SVM <ul><li>Very loosely speaking there is some theory which under some different assumptions puts an upper bound on the VC dimension as </li></ul><ul><li>where </li></ul><ul><ul><li>Diameter is the diameter of the smallest sphere that can enclose all the high-dimensional term-vectors derived from the training set. </li></ul></ul><ul><ul><li>Margin is the smallest margin we’ll let the SVM use </li></ul></ul><ul><li>This can be used in SRM (Structural Risk Minimization) for choosing the polynomial degree, RBF  , etc. </li></ul><ul><ul><li>But most people just use Cross-Validation </li></ul></ul>Copyright © 2001, 2003, Andrew W. Moore
  49. 53. Finding Non-Linear Separating Surfaces <ul><li>Map inputs into new space </li></ul><ul><ul><li>Example: features x 1 x 2 </li></ul></ul><ul><ul><li>5 4 </li></ul></ul><ul><ul><li>Example: features x 1 x 2 x 1 2 x 2 2 x 1 *x 2 </li></ul></ul><ul><ul><li>5 4 25 16 20 </li></ul></ul><ul><li>Solve SVM program in this new space </li></ul><ul><ul><li>Computationally complex if many features </li></ul></ul><ul><ul><li>But a clever trick exists </li></ul></ul>
  50. 54. Summary <ul><li>Maximize the margin between positive and negative examples (connects to PAC theory) </li></ul><ul><li>Non-linear Classification </li></ul><ul><li>The support vectors contribute to the solution </li></ul><ul><li>Kernels map examples into a new, usually non-linear space </li></ul>
  51. 55. References <ul><li>Vladimir Vapnik. The Nature of Statistical Learning Theory , Springer, 1995 </li></ul><ul><li>Andrew W. Moore. cmsc726: SVMs. http:// www.cs.cmu.edu/~awm/tutorials </li></ul><ul><li>C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html </li></ul><ul><li>Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience; 1998 </li></ul><ul><li>Thorsten Joachims (joachims_01a): A Statistical Learning Model of Text Classification for Support Vector Machines </li></ul>
  52. 56. www.intsci.ac.cn/shizz / <ul><li>Questions?! </li></ul>

×