SlideShare a Scribd company logo
1 of 54
Kriti Puniyani
Carnegie Mellon University
      kriti@cmu.edu
About me
ο‚—β€ˆ Graduate student at Carnegie Mellon University
ο‚—β€ˆ Statistical machine learning
  ο‚—β€ˆ Topic models
  ο‚—β€ˆ Sparse network learning
  ο‚—β€ˆ Optimization
ο‚—β€ˆ Domains of interest
  ο‚—β€ˆ Social media analysis
  ο‚—β€ˆ Systems biology
  ο‚—β€ˆ Genetics
  ο‚—β€ˆ Sentiment analysis
  ο‚—β€ˆ Text processing




                                                    4/15/11   2
Machine learning
ο‚—β€ˆ Computers to β€œlearn with experience”
ο‚—β€ˆ Learn : to be able to predict β€œunseen” things.


ο‚—β€ˆ Many applications
  ο‚—β€ˆ Search
  ο‚—β€ˆ Machine translation
  ο‚—β€ˆ Speech recognition
  ο‚—β€ˆ Vision : identify cars, people, sky, apples
  ο‚—β€ˆ Robot control


ο‚—β€ˆ Introductions :
  ο‚—β€ˆ http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
  ο‚—β€ˆ http://videolectures.net/mlss2010_lawrence_mlfcs/
                                                         4/15/11   3
Classification

ο‚—β€ˆ Is this the digit β€œ9” ?   ρ

ο‚—β€ˆ Will this patient survive ?




ο‚—β€ˆ Will this user click on my ad ?



                                     4/15/11   4
Predict the next coin toss
                                                    Data
Task



              THTTTTHHTHTHTTT


     Model 1 :                            Model 2 :
  Coin is tossed with                Toss depends on wind
   probability p (of                 condition W, starting
     being tails)                      pose S, torque T

                        Parameters                    4/15/11   5
Predict the next coin toss



          THTTTTHHTHTHTTT


                Learning

                            Model 2 :
    Model 1 :              W=12.2, S=1,
     p=2/3                   T=0.23
                                      4/15/11   6
Predict the next coin toss
                               I predict the next
                               toss to be T




                Inference


                             Model 2 :
    Model 1 :               W=12.2, S=1,
     P=2/3                    T=0.23
                                        4/15/11     7
Inference
ο‚—β€ˆ Parameter : p=2/3


ο‚—β€ˆ Predicted next 9 tosses           ….H H H T T T T T T
  Observed next 9 tosses             ….T T T T T T H H H
  Accuracy = 2/9 

ο‚—β€ˆ Predicted next 9 tosses           ….T T T T T T T T T
  Observed next 9 tosses             ….T T T T T T H H H
  Accuracy = 6/9 

ο‚—β€ˆ Inference rule :
  ο‚—β€ˆ if p > 0.5, always predict T,
  ο‚—β€ˆ if p < 0.5 always predict H.                          4/15/11   8
The anatomy of classification
     1.β€ˆ What is the data (features X, label y) ? β˜…β˜…β˜…
     2.β€ˆ What is the model ? Model parameterization (w)
     3.β€ˆ Inference : Given X, w, predict the label.
     4.β€ˆ Learning    : Given (X,y) pairs, learn the β€œbest” w
          ο‚—β€ˆ Define β€œbest” – maximize an objective function



(X,Y) pairs    train time
                               Learning
                                              w

                test time
     (X, ? )                    Inference         predicted Y
                                                                9
Logistic Regression




                      4/15/11   10
Predict speaker success
ο‚—β€ˆ X = Number of hours spent in preparation
ο‚—β€ˆ Y = Was the speaker β€œgood”?




                                     I(a) = 1 if(a==TRUE)
Prediction : Y   = I ( X > h)             = 0 if(a==FALSE)

                                                  4/15/11    11
Predict speaker success
                     Y = I ( X > h)

ο‚—β€ˆ Learning is difficult.
ο‚—β€ˆ Not robust




                                      4/15/11   12
1
     P(Y | w, X) =
                     1+eβˆ’wX +w0



€

    Y = I ( X > 10)




                                  4/15/11   13
Logistic (sigmoidal) function




                                4/15/11   14
Extend to d dimensions

                                           1
        P(Y | w, X ) =         βˆ’( w1 X 1 +w 2 X 2 +...+w d X d +w 0 )
                         1+e


                                           1
              P(Y | w, X ) =                  
€                                   1+e
                                          βˆ’( w. X +w 0 )




                                                                   4/15/11   15
    €
Logistic regression
    ο‚—β€ˆ Model parameter :   w
                                      1
          P(Y = 1 | w, X) =          βˆ’wX +w 0
                               1+e
    ο‚—β€ˆ Example : Given X = 0.9 , w = 1.2
      => wX = 1.08, P(Y=1|X=0.9) = 0.7465 ~ 0.75
€     Toss a coin with p=3/4

    ο‚—β€ˆ Example : Given X = -1.1 , w = 1.2
      => wX = -1.32, P(Y=1|X=-1.1) = 0.2108 ~ 0.2
      Toss a coin with p=1/5
                                                    4/15/11   16
Another view of logistic regression
ο‚—β€ˆ Log odds : ln [ p/(1-p) ] = wX + Ξ΅

ο‚—β€ˆ p / (1-p) = ewX

ο‚—β€ˆ p = (1-p) ewX

ο‚—β€ˆ p (1 + ewX) = ewX

ο‚—β€ˆ p = ewX / (1 + ewX) = 1/(e-wX+1)


ο‚—β€ˆ Logistic regression is a β€œlinear regression” between log-
  odds of an event and features (X)


                                                       4/15/11   17
The anatomy of classification
1.β€ˆ What is the data (features X, label y) ?                βœ”
2.β€ˆ What is the model ? Model parameterization (w)          βœ”
3.β€ˆ Inference :Given X, w, predict the label.               βœ”
4.β€ˆ Learning : Given (X,y) pairs, learn the β€œbest” w
    ο‚—β€ˆ Define β€œbest” – maximize an objective function




                                                  4/15/11       18
Learning : Finding the best w
Expressing…(X , Y )
 ο‚—β€ˆ Data : (X , Y ),
             1   1
                     Conditional Log Likelihood
                         n   n



 ο‚—β€ˆ If yi == 1, max P(yi=1| xi, w)
 ο‚—β€ˆ If yi == 0, max P(yi=0| xi, w)


 ο‚—β€ˆ Maximize Log-likelihood




                                     4/15/11   19
Learning : Example


           ο‚—β€ˆ Data : (5, 0), (11, 1), (25,1)

    l(w)= ln P(y = 0 | x = 5,w) + ln P(y = 1 | x = 11,w) + ln P(y = 1 | x = 25,w)


           ο‚—β€ˆ P(Y=1|X,w) is a logistic function

                                  1                               1
                                           Β©Carlos Guestrin 2005-2009               1             !
             l(w)= ln(1βˆ’          βˆ’5w +w 0
                                           ) + ln                       + ln
                           1+ e                      1+ eβˆ’11w +w 0             1+ eβˆ’25w +w 0


      P(y=1|x) + P(y=0|x) = 1                                                           4/15/11   20
€
Optimization : Pick the β€œbest” w




1.β€ˆ   Weka
2.β€ˆ   Matlab : w = mnrfit(X,Y)
3.β€ˆ   R : w <- glm(Y~X, family=binomial(link="logit"))
4.β€ˆ   IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m
5.β€ˆ   Implement your own                                             4/15/11   21
Decision surface is linear




                             Errors

    Y=0          Y=1




                              4/15/11   22
Decision surface is linear




http://www.cs.technion.ac.il/~rani/LocBoost/   4/15/11   23
So far..
ο‚—β€ˆ Logistic regression is a binary classifier (multinomial
   version exists)
ο‚—β€ˆ P(Y=1|X,w) is a logistic function
ο‚—β€ˆ Inference : Compute P(Y=1|X,w), and do β€œrounding”.
ο‚—β€ˆ Parameter learnt by maximizing log-likelihood of data.
ο‚—β€ˆ Decision surface is linear (kernelized version exists)




                                                        4/15/11   24
Improvements in the model
ο‚—β€ˆ Prevent over-fitting          Regularization
ο‚—β€ˆ Maximize accuracy directly    SVMs
ο‚—β€ˆ Non-linear decision surface   Kernel Trick
ο‚—β€ˆ Multi-label data




                                         4/15/11   25
Occam’s razor


The simplest explanation is most likely the correct
one




                                                  4/15/11   26
New and improved learning
    ο‚—β€ˆ β€œBest” w == maximize log-likelihood
      Maximum Log-likelihood Estimate (MLE)

                    Small concern … over-fitting


If data is
linearly
separable,
w



                                                   4/15/11   27
L2 regularization
                                                                         2
                                                   || w ||2 =     βˆ‘ wi
                                                                  i

                                               2
                 max w l(w) βˆ’ Ξ» || w ||        2
ο‚—β€ˆ Prevents over-fitting
                                           €
ο‚—β€ˆ β€œPushes” parameters
   towards zero
ο‚—β€ˆ Equivalent to a prior on
€  the parameters
    ο‚—β€ˆ Normal distribution (0
      mean, unit covariance)




             Ξ» : tuning parameter ( 0.1)                4/15/11       28
Patient Diagnosis
ο‚—β€ˆ Y = disease
ο‚—β€ˆ X = [age, weight, BP, blood sugar, MRI, genetic tests …]


ο‚—β€ˆ Don’t want all β€œfeatures” to be relevant.


ο‚—β€ˆ Weight vector w should be β€œmostly zeros”.




                                                     4/15/11   29
L1 regularization                                     || w ||1= βˆ‘ | w i |
                                                                i



                 max w l(w) βˆ’ Ξ» || w ||1
                                               €
ο‚—β€ˆ Prevents over-fitting
ο‚—β€ˆ β€œPushes” parameters to
   zero
ο‚—β€ˆ Equivalent to a prior on
€  the parameters
    ο‚—β€ˆ Laplace distribution




      Ξ» increases, more zeros (irrelevant) features       4/15/11      30
L1 v/s L2 example
ο‚—β€ˆ MLE estimate         : [ 11 0.8 ]

ο‚—β€ˆ L2 estimate          : [ 10 0.6 ]      shrinkage

ο‚—β€ˆ L1 estimate          : [ 10.2 0 ]     sparsity

ο‚—β€ˆ Mini-conclusion :
  ο‚—β€ˆ L2 optimization is fast, L1 tends to be slower. If you have the
     computational resources, try both (at the same time) !
  ο‚—β€ˆ ALWAYS run logistic regression with at least some
     regularization.
  ο‚—β€ˆ Corollary : ALWAYS run logistic regression on features that
     have been standardized (zero mean, unit variance)
                                                            4/15/11    31
So far …
ο‚—β€ˆ Logistic regression
  ο‚—β€ˆ Model
  ο‚—β€ˆ Inference
  ο‚—β€ˆ Learning via maximum likelihood
  ο‚—β€ˆ L1 and L2 regularization




  Next …. SVMs !



                                       4/15/11   32
Why did we use probability again?
ο‚—β€ˆ Aim : Maximize β€œaccuracy”


ο‚—β€ˆ Logistic regression : Indirect method that maximizes
  likelihood of data.

ο‚—β€ˆ A much more direct approach is to directly maximize
  accuracy.


        Support Vector Machines (SVMs)


                                                     4/15/11   33
Maximize the margin
Maximize the margin




               ! 2005-2007 Carlos Guestrin                  "


                                             4/15/11   34
Geometry review
                  Y=1              2x1+x2-2=0




      Y= -1




For a point on the line :
(0.5, 1 ) : d = 2*0.5 + 1 – 2       =0

Signed β€œdistance” to the line from (x10, x20)
      d = 2x10 + x20 - 2                        4/15/11   35
Geometry review
                 Y=1            2x1+x2-2=0




      Y= -1




(1,   2.5) : d = 2*1 + 2.5 - 2    = 2.5 > 0
        y(wx+b) = 1*2.5 = 2.5 > Ξ³



                                              4/15/11   36
Geometry review
                 Y=1              2x1+x2-2=0




     Y= -1




(0.5, 0.5) : d = 2*0.5 + 0.5 – 2    = -0.5 < 0
        y(w.x+b) = y*d = -1 * -0.5 = 0.5



                                                 4/15/11   37
Support Vector Machines
Normalized margin – Canonical
hyperplanes
                                               ! 2005-2007 Carlos Guestrin                  "




                                            Support vectors are the
  x+                                        points touching the margins.
         x-


                                                                             4/15/11   38
              ! 2005-2007 Carlos Guestrin                                       !
= !j w(j) x(j)
                      ! 2005-2007 Carlos Guestrin                                      !


          w.x = !j w(j) x(j)
                                                    w.x = !j w(j) x(j)
                                                         ! 2005-2007 Carlos Guestrin                                 !




            Slack variables                                                            ! 2005-2007 Carlos Guestrin




                 ο‚—β€ˆ SVMs are made robust by adding β€œslack variables” that
                   allow training error to be non-zero
ximize the margin point. Slack variable ==0 for correctly
       ο‚—β€ˆ One for each data
     Maximizepoints. margin
          classified the
                      Maximize theβˆ’Cβˆ‘ΞΎ i
                             max Ξ³ margin



                                                     €



                                                                                                                     4/15/11   39
Slack variables
           max Ξ³ βˆ’C βˆ‘ΞΎ i



€
Need to tune C :
          high C == minimize mis-classifications
          low C == maximize margin




                                              4/15/11   40
SVM summary
ο‚—β€ˆ Model :     w.x + b > 0       if y = +1
               w.x + b < 0       if y = -1

ο‚—β€ˆ Inference : Ε· = sign(w.x+b)


ο‚—β€ˆ Learning : Maximize { (margin) - C ( slack-variables) }




                  Next … Kernel SVMs


                                                      4/15/11   41
The kernel trick
ο‚—β€ˆ Why linear separator ? What if data looks like below ?

                                          The kernel trick
                                          allows you to use
                                          SVMs with non-
                                          linear separators.

                                          Different kernels
                                          1.β€ˆ Polynomial
                                          2.β€ˆ Gaussian
                                          3.β€ˆ exponential


                                                     4/15/11   42
Logistic                      Linear SVM




  Error ~ 40% in both cases
                                      4/15/11   43
Kernel SVM with polynomial kernel
of degree 2

                      Polynomial kernel of
                      degree 2/4 do very
                      well, but degree 3/5
                      do very bad.

                      Gaussian kernel has
                      tuning parameter
                      (bandwidth).
                      Performance
                      depends on picking
                      the right bandwith.
        Error = 7%                4/15/11   44
SVMs summary
ο‚—β€ˆ Maximize the margin between positive and negative
   examples.
ο‚—β€ˆ Kernel trick is widely implemented, allowing non-linear
   decision surface.
ο‚—β€ˆ Not probabilistic 




ο‚—β€ˆ Software :
  ο‚—β€ˆ SVM-light http://svmlight.joachims.org/,
  ο‚—β€ˆ LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  ο‚—β€ˆ Weka, matlab, R



                                                        4/15/11   45
Demo


       http://www.cs.technion.ac.il/~rani/LocBoost




                                                     4/15/11   46
Which to use ?
ο‚—β€ˆ Linear SVMs and logistic regression work very similar in
   most cases.
ο‚—β€ˆ Kernelized SVMs work better than linear SVMs (mostly)
ο‚—β€ˆ Kernelized logistic regression is possible, but
   implementations are not available easily.




                                                     4/15/11   47
Recommendations
1.β€ˆ First, try logistic regression. Easy, fast, stable. No β€œtuning”
    parameters.
2.β€ˆ Equivalently, you can first try linear SVMs, but you need
    to tune β€œC”
3.β€ˆ If results are β€œgood enough”, stop.
4.β€ˆ Else try SVMs with Gaussian kernels.
     Need to tune bandwidth, C – by using validation data.

If you have more time/computational resources, try random
     forests as well.


** Recommendations are opinions of the presenter, and not known facts.

                                                                         4/15/11   48
In conclusion …


  ο‚—β€ˆ Logistic Regression
  ο‚—β€ˆ Support Vector Machines


       Other classification approaches …

  ο‚—β€ˆ Random forests / decision trees
  ο‚—β€ˆ NaΓ―ve Bayes
  ο‚—β€ˆ Nearest Neighbors
  ο‚—β€ˆ Boosting (Adaboost)
                                       4/15/11   49
Thank you
Questions?




             4/15/11   50
Kriti Puniyani
Carnegie Mellon University
      kriti@cmu.edu
Is this athlete doing drugs ?
ο‚—β€ˆ X = Blood-test-to-detect-drugs
ο‚—β€ˆ Y = Doped athlete ?


ο‚—β€ˆ Two types of errors :
  ο‚—β€ˆ Athlete is doped, we predict β€œNO” : false negative
  ο‚—β€ˆ Athlete is NOT doped, we predict β€œYES” : false positive


ο‚—β€ˆ Penalize false positives more than false negatives




                                                          4/15/11   52
Outline
ο‚—β€ˆ What is classification ?
   ο‚—β€ˆ Parameters, data, inference, learning
   ο‚—β€ˆ Predicting coin tosses (0-dimensional X)
ο‚—β€ˆ Logistic Regression
   ο‚—β€ˆ Predicting β€œspeaker success” (1-dimensional X)
   ο‚—β€ˆ Formulation, optimizatiob
   ο‚—β€ˆ Decision surface is linear
   ο‚—β€ˆ Interpreting coefficients
   ο‚—β€ˆ Hypothesis testing
   ο‚—β€ˆ Evaluating the performance of the model
   ο‚—β€ˆ Why is it called β€œregression” : log-odds
   ο‚—β€ˆ L2 regularization
   ο‚—β€ˆ Patient survival (d-dimensional X)
   ο‚—β€ˆ L1 regularization
ο‚—β€ˆ Support Vector Machines
   ο‚—β€ˆ Linear SVMs + formulation
   ο‚—β€ˆ What are β€œsupport vectors”
   ο‚—β€ˆ The kernel trick
ο‚—β€ˆ Demo : logistic regression v/s SVMs v/s kernel tricks   4/15/11   53
Overfitting a more serious problem




2x+y-2 = 0               w = [2 1 -2]
4x+2y-4 = 0              w = [4 2 -4]
400x+200y-400 = 0        w = [400 200 -400]

 Absolutely need L2 regularization           4/15/11   54

More Related Content

What's hot

Resampling methods
Resampling methodsResampling methods
Resampling methodsSetia Pramana
Β 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
Β 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
Β 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief NetworksSai Kumar Kodam
Β 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimationguestfee8698
Β 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
Β 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
Β 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
Β 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
Β 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
Β 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree LearningMilind Gokhale
Β 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
Β 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
Β 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
Β 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning TutorialAmr Rashed
Β 
Support Vector Machines (SVM)
Support Vector Machines (SVM)Support Vector Machines (SVM)
Support Vector Machines (SVM)FAO
Β 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix FactorizationTatsuya Yokota
Β 

What's hot (20)

Resampling methods
Resampling methodsResampling methods
Resampling methods
Β 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
Β 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Β 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
Β 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
Β 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
Β 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
Β 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
Β 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Β 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
Β 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
Β 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
Β 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Β 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
Β 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
Β 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Β 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
Β 
Support Vector Machines (SVM)
Support Vector Machines (SVM)Support Vector Machines (SVM)
Support Vector Machines (SVM)
Β 
Nonnegative Matrix Factorization
Nonnegative Matrix FactorizationNonnegative Matrix Factorization
Nonnegative Matrix Factorization
Β 
Decision tree
Decision treeDecision tree
Decision tree
Β 

Similar to Intro to Classification: Logistic Regression & SVM

05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
Β 
Bayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdfBayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdfsivasanthoshdasari1
Β 
Lecture2 xing
Lecture2 xingLecture2 xing
Lecture2 xingTianlu Wang
Β 
Bayes 6
Bayes 6Bayes 6
Bayes 6uddingias
Β 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
Β 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
Β 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
Β 
Tutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian NetworksTutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian NetworksAnmol Dwivedi
Β 
random variables-descriptive and contincuous
random variables-descriptive and contincuousrandom variables-descriptive and contincuous
random variables-descriptive and contincuousar9530
Β 
quebec.pdf
quebec.pdfquebec.pdf
quebec.pdfykyog
Β 
Lec04 min cost linear problems
Lec04 min cost linear problemsLec04 min cost linear problems
Lec04 min cost linear problemsDanny Luk
Β 
3_MLE_printable.pdf
3_MLE_printable.pdf3_MLE_printable.pdf
3_MLE_printable.pdfElio Laureano
Β 
Signal processingcolumbia
Signal processingcolumbiaSignal processingcolumbia
Signal processingcolumbiavevin1986
Β 
Connection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsConnection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsAlexander Litvinenko
Β 
Machine Learning Summer School 2016
Machine Learning Summer School 2016Machine Learning Summer School 2016
Machine Learning Summer School 2016chris wiggins
Β 
March12 natarajan
March12 natarajanMarch12 natarajan
March12 natarajanBBKuhn
Β 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
Β 
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...Advanced-Concepts-Team
Β 

Similar to Intro to Classification: Logistic Regression & SVM (20)

Slides risk-rennes
Slides risk-rennesSlides risk-rennes
Slides risk-rennes
Β 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
Β 
Bayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdfBayesian_Decision_Theory-3.pdf
Bayesian_Decision_Theory-3.pdf
Β 
Lecture2 xing
Lecture2 xingLecture2 xing
Lecture2 xing
Β 
Bayes 6
Bayes 6Bayes 6
Bayes 6
Β 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
Β 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
Β 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
Β 
Tutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian NetworksTutorial on Belief Propagation in Bayesian Networks
Tutorial on Belief Propagation in Bayesian Networks
Β 
random variables-descriptive and contincuous
random variables-descriptive and contincuousrandom variables-descriptive and contincuous
random variables-descriptive and contincuous
Β 
quebec.pdf
quebec.pdfquebec.pdf
quebec.pdf
Β 
Lec04 min cost linear problems
Lec04 min cost linear problemsLec04 min cost linear problems
Lec04 min cost linear problems
Β 
3_MLE_printable.pdf
3_MLE_printable.pdf3_MLE_printable.pdf
3_MLE_printable.pdf
Β 
Signal processingcolumbia
Signal processingcolumbiaSignal processingcolumbia
Signal processingcolumbia
Β 
Connection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsConnection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problems
Β 
Machine Learning Summer School 2016
Machine Learning Summer School 2016Machine Learning Summer School 2016
Machine Learning Summer School 2016
Β 
March12 natarajan
March12 natarajanMarch12 natarajan
March12 natarajan
Β 
report
reportreport
report
Β 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
Β 
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Β 

More from NYC Predictive Analytics

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics
Β 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsNYC Predictive Analytics
Β 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionNYC Predictive Analytics
Β 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsNYC Predictive Analytics
Β 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionNYC Predictive Analytics
Β 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeNYC Predictive Analytics
Β 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
Β 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
Β 

More from NYC Predictive Analytics (10)

Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
Β 
The caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive ModelsThe caret Package: A Unified Interface for Predictive Models
The caret Package: A Unified Interface for Predictive Models
Β 
Introduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System CompetitionIntroduction to R Package Recommendation System Competition
Introduction to R Package Recommendation System Competition
Β 
R package Recommendation Engine
R package Recommendation EngineR package Recommendation Engine
R package Recommendation Engine
Β 
Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive Analytics
Β 
An Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for PredictionAn Introduction to Multilevel Regression Modeling for Prediction
An Introduction to Multilevel Regression Modeling for Prediction
Β 
How OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive ChangeHow OMGPOP Uses Predictive Analytics to Drive Change
How OMGPOP Uses Predictive Analytics to Drive Change
Β 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
Β 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
Β 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
Β 

Recently uploaded

Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
Β 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
Β 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri
Β 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
Β 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
Β 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
Β 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
Β 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
Β 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
Β 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
Β 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
Β 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
Β 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
Β 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 πŸ’ž Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 πŸ’ž Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 πŸ’ž Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 πŸ’ž Full Nigh...Pooja Nehwal
Β 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
Β 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
Β 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
Β 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
Β 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
Β 

Recently uploaded (20)

Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
Β 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
Β 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
Β 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
Β 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
Β 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
Β 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
Β 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
Β 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Β 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
Β 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
Β 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
Β 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
Β 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 πŸ’ž Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 πŸ’ž Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 πŸ’ž Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 πŸ’ž Full Nigh...
Β 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
Β 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
Β 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Β 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
Β 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
Β 

Intro to Classification: Logistic Regression & SVM

  • 1. Kriti Puniyani Carnegie Mellon University kriti@cmu.edu
  • 2. About me ο‚—β€ˆ Graduate student at Carnegie Mellon University ο‚—β€ˆ Statistical machine learning ο‚—β€ˆ Topic models ο‚—β€ˆ Sparse network learning ο‚—β€ˆ Optimization ο‚—β€ˆ Domains of interest ο‚—β€ˆ Social media analysis ο‚—β€ˆ Systems biology ο‚—β€ˆ Genetics ο‚—β€ˆ Sentiment analysis ο‚—β€ˆ Text processing 4/15/11 2
  • 3. Machine learning ο‚—β€ˆ Computers to β€œlearn with experience” ο‚—β€ˆ Learn : to be able to predict β€œunseen” things. ο‚—β€ˆ Many applications ο‚—β€ˆ Search ο‚—β€ˆ Machine translation ο‚—β€ˆ Speech recognition ο‚—β€ˆ Vision : identify cars, people, sky, apples ο‚—β€ˆ Robot control ο‚—β€ˆ Introductions : ο‚—β€ˆ http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf ο‚—β€ˆ http://videolectures.net/mlss2010_lawrence_mlfcs/ 4/15/11 3
  • 4. Classification ο‚—β€ˆ Is this the digit β€œ9” ? ρ ο‚—β€ˆ Will this patient survive ? ο‚—β€ˆ Will this user click on my ad ? 4/15/11 4
  • 5. Predict the next coin toss Data Task THTTTTHHTHTHTTT Model 1 : Model 2 : Coin is tossed with Toss depends on wind probability p (of condition W, starting being tails) pose S, torque T Parameters 4/15/11 5
  • 6. Predict the next coin toss THTTTTHHTHTHTTT Learning Model 2 : Model 1 : W=12.2, S=1, p=2/3 T=0.23 4/15/11 6
  • 7. Predict the next coin toss I predict the next toss to be T Inference Model 2 : Model 1 : W=12.2, S=1, P=2/3 T=0.23 4/15/11 7
  • 8. Inference ο‚—β€ˆ Parameter : p=2/3 ο‚—β€ˆ Predicted next 9 tosses ….H H H T T T T T T Observed next 9 tosses ….T T T T T T H H H Accuracy = 2/9  ο‚—β€ˆ Predicted next 9 tosses ….T T T T T T T T T Observed next 9 tosses ….T T T T T T H H H Accuracy = 6/9  ο‚—β€ˆ Inference rule : ο‚—β€ˆ if p > 0.5, always predict T, ο‚—β€ˆ if p < 0.5 always predict H. 4/15/11 8
  • 9. The anatomy of classification 1.β€ˆ What is the data (features X, label y) ? β˜…β˜…β˜… 2.β€ˆ What is the model ? Model parameterization (w) 3.β€ˆ Inference : Given X, w, predict the label. 4.β€ˆ Learning : Given (X,y) pairs, learn the β€œbest” w ο‚—β€ˆ Define β€œbest” – maximize an objective function (X,Y) pairs train time Learning w test time (X, ? ) Inference predicted Y 9
  • 10. Logistic Regression 4/15/11 10
  • 11. Predict speaker success ο‚—β€ˆ X = Number of hours spent in preparation ο‚—β€ˆ Y = Was the speaker β€œgood”? I(a) = 1 if(a==TRUE) Prediction : Y = I ( X > h) = 0 if(a==FALSE) 4/15/11 11
  • 12. Predict speaker success Y = I ( X > h) ο‚—β€ˆ Learning is difficult. ο‚—β€ˆ Not robust 4/15/11 12
  • 13. 1 P(Y | w, X) = 1+eβˆ’wX +w0 € Y = I ( X > 10) 4/15/11 13
  • 15. Extend to d dimensions   1 P(Y | w, X ) = βˆ’( w1 X 1 +w 2 X 2 +...+w d X d +w 0 ) 1+e   1 P(Y | w, X ) =   € 1+e βˆ’( w. X +w 0 ) 4/15/11 15 €
  • 16. Logistic regression ο‚—β€ˆ Model parameter : w 1 P(Y = 1 | w, X) = βˆ’wX +w 0 1+e ο‚—β€ˆ Example : Given X = 0.9 , w = 1.2 => wX = 1.08, P(Y=1|X=0.9) = 0.7465 ~ 0.75 € Toss a coin with p=3/4 ο‚—β€ˆ Example : Given X = -1.1 , w = 1.2 => wX = -1.32, P(Y=1|X=-1.1) = 0.2108 ~ 0.2 Toss a coin with p=1/5 4/15/11 16
  • 17. Another view of logistic regression ο‚—β€ˆ Log odds : ln [ p/(1-p) ] = wX + Ξ΅ ο‚—β€ˆ p / (1-p) = ewX ο‚—β€ˆ p = (1-p) ewX ο‚—β€ˆ p (1 + ewX) = ewX ο‚—β€ˆ p = ewX / (1 + ewX) = 1/(e-wX+1) ο‚—β€ˆ Logistic regression is a β€œlinear regression” between log- odds of an event and features (X) 4/15/11 17
  • 18. The anatomy of classification 1.β€ˆ What is the data (features X, label y) ? βœ” 2.β€ˆ What is the model ? Model parameterization (w) βœ” 3.β€ˆ Inference :Given X, w, predict the label. βœ” 4.β€ˆ Learning : Given (X,y) pairs, learn the β€œbest” w ο‚—β€ˆ Define β€œbest” – maximize an objective function 4/15/11 18
  • 19. Learning : Finding the best w Expressing…(X , Y ) ο‚—β€ˆ Data : (X , Y ), 1 1 Conditional Log Likelihood n n ο‚—β€ˆ If yi == 1, max P(yi=1| xi, w) ο‚—β€ˆ If yi == 0, max P(yi=0| xi, w) ο‚—β€ˆ Maximize Log-likelihood 4/15/11 19
  • 20. Learning : Example ο‚—β€ˆ Data : (5, 0), (11, 1), (25,1) l(w)= ln P(y = 0 | x = 5,w) + ln P(y = 1 | x = 11,w) + ln P(y = 1 | x = 25,w) ο‚—β€ˆ P(Y=1|X,w) is a logistic function 1 1 Β©Carlos Guestrin 2005-2009 1 ! l(w)= ln(1βˆ’ βˆ’5w +w 0 ) + ln + ln 1+ e 1+ eβˆ’11w +w 0 1+ eβˆ’25w +w 0 P(y=1|x) + P(y=0|x) = 1 4/15/11 20 €
  • 21. Optimization : Pick the β€œbest” w 1.β€ˆ Weka 2.β€ˆ Matlab : w = mnrfit(X,Y) 3.β€ˆ R : w <- glm(Y~X, family=binomial(link="logit")) 4.β€ˆ IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m 5.β€ˆ Implement your own  4/15/11 21
  • 22. Decision surface is linear Errors Y=0 Y=1 4/15/11 22
  • 23. Decision surface is linear http://www.cs.technion.ac.il/~rani/LocBoost/ 4/15/11 23
  • 24. So far.. ο‚—β€ˆ Logistic regression is a binary classifier (multinomial version exists) ο‚—β€ˆ P(Y=1|X,w) is a logistic function ο‚—β€ˆ Inference : Compute P(Y=1|X,w), and do β€œrounding”. ο‚—β€ˆ Parameter learnt by maximizing log-likelihood of data. ο‚—β€ˆ Decision surface is linear (kernelized version exists) 4/15/11 24
  • 25. Improvements in the model ο‚—β€ˆ Prevent over-fitting Regularization ο‚—β€ˆ Maximize accuracy directly SVMs ο‚—β€ˆ Non-linear decision surface Kernel Trick ο‚—β€ˆ Multi-label data 4/15/11 25
  • 26. Occam’s razor The simplest explanation is most likely the correct one 4/15/11 26
  • 27. New and improved learning ο‚—β€ˆ β€œBest” w == maximize log-likelihood Maximum Log-likelihood Estimate (MLE) Small concern … over-fitting If data is linearly separable, w 4/15/11 27
  • 28. L2 regularization 2 || w ||2 = βˆ‘ wi i 2 max w l(w) βˆ’ Ξ» || w || 2 ο‚—β€ˆ Prevents over-fitting € ο‚—β€ˆ β€œPushes” parameters towards zero ο‚—β€ˆ Equivalent to a prior on € the parameters ο‚—β€ˆ Normal distribution (0 mean, unit covariance) Ξ» : tuning parameter ( 0.1) 4/15/11 28
  • 29. Patient Diagnosis ο‚—β€ˆ Y = disease ο‚—β€ˆ X = [age, weight, BP, blood sugar, MRI, genetic tests …] ο‚—β€ˆ Don’t want all β€œfeatures” to be relevant. ο‚—β€ˆ Weight vector w should be β€œmostly zeros”. 4/15/11 29
  • 30. L1 regularization || w ||1= βˆ‘ | w i | i max w l(w) βˆ’ Ξ» || w ||1 € ο‚—β€ˆ Prevents over-fitting ο‚—β€ˆ β€œPushes” parameters to zero ο‚—β€ˆ Equivalent to a prior on € the parameters ο‚—β€ˆ Laplace distribution Ξ» increases, more zeros (irrelevant) features 4/15/11 30
  • 31. L1 v/s L2 example ο‚—β€ˆ MLE estimate : [ 11 0.8 ] ο‚—β€ˆ L2 estimate : [ 10 0.6 ] shrinkage ο‚—β€ˆ L1 estimate : [ 10.2 0 ] sparsity ο‚—β€ˆ Mini-conclusion : ο‚—β€ˆ L2 optimization is fast, L1 tends to be slower. If you have the computational resources, try both (at the same time) ! ο‚—β€ˆ ALWAYS run logistic regression with at least some regularization. ο‚—β€ˆ Corollary : ALWAYS run logistic regression on features that have been standardized (zero mean, unit variance) 4/15/11 31
  • 32. So far … ο‚—β€ˆ Logistic regression ο‚—β€ˆ Model ο‚—β€ˆ Inference ο‚—β€ˆ Learning via maximum likelihood ο‚—β€ˆ L1 and L2 regularization Next …. SVMs ! 4/15/11 32
  • 33. Why did we use probability again? ο‚—β€ˆ Aim : Maximize β€œaccuracy” ο‚—β€ˆ Logistic regression : Indirect method that maximizes likelihood of data. ο‚—β€ˆ A much more direct approach is to directly maximize accuracy. Support Vector Machines (SVMs) 4/15/11 33
  • 34. Maximize the margin Maximize the margin ! 2005-2007 Carlos Guestrin " 4/15/11 34
  • 35. Geometry review Y=1 2x1+x2-2=0 Y= -1 For a point on the line : (0.5, 1 ) : d = 2*0.5 + 1 – 2 =0 Signed β€œdistance” to the line from (x10, x20) d = 2x10 + x20 - 2 4/15/11 35
  • 36. Geometry review Y=1 2x1+x2-2=0 Y= -1 (1, 2.5) : d = 2*1 + 2.5 - 2 = 2.5 > 0 y(wx+b) = 1*2.5 = 2.5 > Ξ³ 4/15/11 36
  • 37. Geometry review Y=1 2x1+x2-2=0 Y= -1 (0.5, 0.5) : d = 2*0.5 + 0.5 – 2 = -0.5 < 0 y(w.x+b) = y*d = -1 * -0.5 = 0.5 4/15/11 37
  • 38. Support Vector Machines Normalized margin – Canonical hyperplanes ! 2005-2007 Carlos Guestrin " Support vectors are the x+ points touching the margins. x- 4/15/11 38 ! 2005-2007 Carlos Guestrin !
  • 39. = !j w(j) x(j) ! 2005-2007 Carlos Guestrin ! w.x = !j w(j) x(j) w.x = !j w(j) x(j) ! 2005-2007 Carlos Guestrin ! Slack variables ! 2005-2007 Carlos Guestrin ο‚—β€ˆ SVMs are made robust by adding β€œslack variables” that allow training error to be non-zero ximize the margin point. Slack variable ==0 for correctly ο‚—β€ˆ One for each data Maximizepoints. margin classified the Maximize theβˆ’Cβˆ‘ΞΎ i max Ξ³ margin € 4/15/11 39
  • 40. Slack variables max Ξ³ βˆ’C βˆ‘ΞΎ i € Need to tune C : high C == minimize mis-classifications low C == maximize margin 4/15/11 40
  • 41. SVM summary ο‚—β€ˆ Model : w.x + b > 0 if y = +1 w.x + b < 0 if y = -1 ο‚—β€ˆ Inference : Ε· = sign(w.x+b) ο‚—β€ˆ Learning : Maximize { (margin) - C ( slack-variables) } Next … Kernel SVMs 4/15/11 41
  • 42. The kernel trick ο‚—β€ˆ Why linear separator ? What if data looks like below ? The kernel trick allows you to use SVMs with non- linear separators. Different kernels 1.β€ˆ Polynomial 2.β€ˆ Gaussian 3.β€ˆ exponential 4/15/11 42
  • 43. Logistic Linear SVM Error ~ 40% in both cases 4/15/11 43
  • 44. Kernel SVM with polynomial kernel of degree 2 Polynomial kernel of degree 2/4 do very well, but degree 3/5 do very bad. Gaussian kernel has tuning parameter (bandwidth). Performance depends on picking the right bandwith. Error = 7% 4/15/11 44
  • 45. SVMs summary ο‚—β€ˆ Maximize the margin between positive and negative examples. ο‚—β€ˆ Kernel trick is widely implemented, allowing non-linear decision surface. ο‚—β€ˆ Not probabilistic  ο‚—β€ˆ Software : ο‚—β€ˆ SVM-light http://svmlight.joachims.org/, ο‚—β€ˆ LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ ο‚—β€ˆ Weka, matlab, R 4/15/11 45
  • 46. Demo http://www.cs.technion.ac.il/~rani/LocBoost 4/15/11 46
  • 47. Which to use ? ο‚—β€ˆ Linear SVMs and logistic regression work very similar in most cases. ο‚—β€ˆ Kernelized SVMs work better than linear SVMs (mostly) ο‚—β€ˆ Kernelized logistic regression is possible, but implementations are not available easily. 4/15/11 47
  • 48. Recommendations 1.β€ˆ First, try logistic regression. Easy, fast, stable. No β€œtuning” parameters. 2.β€ˆ Equivalently, you can first try linear SVMs, but you need to tune β€œC” 3.β€ˆ If results are β€œgood enough”, stop. 4.β€ˆ Else try SVMs with Gaussian kernels. Need to tune bandwidth, C – by using validation data. If you have more time/computational resources, try random forests as well. ** Recommendations are opinions of the presenter, and not known facts. 4/15/11 48
  • 49. In conclusion … ο‚—β€ˆ Logistic Regression ο‚—β€ˆ Support Vector Machines Other classification approaches … ο‚—β€ˆ Random forests / decision trees ο‚—β€ˆ NaΓ―ve Bayes ο‚—β€ˆ Nearest Neighbors ο‚—β€ˆ Boosting (Adaboost) 4/15/11 49
  • 50. Thank you Questions? 4/15/11 50
  • 51. Kriti Puniyani Carnegie Mellon University kriti@cmu.edu
  • 52. Is this athlete doing drugs ? ο‚—β€ˆ X = Blood-test-to-detect-drugs ο‚—β€ˆ Y = Doped athlete ? ο‚—β€ˆ Two types of errors : ο‚—β€ˆ Athlete is doped, we predict β€œNO” : false negative ο‚—β€ˆ Athlete is NOT doped, we predict β€œYES” : false positive ο‚—β€ˆ Penalize false positives more than false negatives 4/15/11 52
  • 53. Outline ο‚—β€ˆ What is classification ? ο‚—β€ˆ Parameters, data, inference, learning ο‚—β€ˆ Predicting coin tosses (0-dimensional X) ο‚—β€ˆ Logistic Regression ο‚—β€ˆ Predicting β€œspeaker success” (1-dimensional X) ο‚—β€ˆ Formulation, optimizatiob ο‚—β€ˆ Decision surface is linear ο‚—β€ˆ Interpreting coefficients ο‚—β€ˆ Hypothesis testing ο‚—β€ˆ Evaluating the performance of the model ο‚—β€ˆ Why is it called β€œregression” : log-odds ο‚—β€ˆ L2 regularization ο‚—β€ˆ Patient survival (d-dimensional X) ο‚—β€ˆ L1 regularization ο‚—β€ˆ Support Vector Machines ο‚—β€ˆ Linear SVMs + formulation ο‚—β€ˆ What are β€œsupport vectors” ο‚—β€ˆ The kernel trick ο‚—β€ˆ Demo : logistic regression v/s SVMs v/s kernel tricks 4/15/11 53
  • 54. Overfitting a more serious problem 2x+y-2 = 0 w = [2 1 -2] 4x+2y-4 = 0 w = [4 2 -4] 400x+200y-400 = 0 w = [400 200 -400]  Absolutely need L2 regularization 4/15/11 54