A gentle introduction to 2 classification techniques, as presented by Kriti Puniyani to the NYC Predictive Analytics group (April 14, 2011). To download the file please go here: http://www.meetup.com/NYC-Predictive-Analytics/files/
2. About me
οβ Graduate student at Carnegie Mellon University
οβ Statistical machine learning
οβ Topic models
οβ Sparse network learning
οβ Optimization
οβ Domains of interest
οβ Social media analysis
οβ Systems biology
οβ Genetics
οβ Sentiment analysis
οβ Text processing
4/15/11 2
3. Machine learning
οβ Computers to βlearn with experienceβ
οβ Learn : to be able to predict βunseenβ things.
οβ Many applications
οβ Search
οβ Machine translation
οβ Speech recognition
οβ Vision : identify cars, people, sky, apples
οβ Robot control
οβ Introductions :
οβ http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
οβ http://videolectures.net/mlss2010_lawrence_mlfcs/
4/15/11 3
4. Classification
οβ Is this the digit β9β ? Ο
οβ Will this patient survive ?
οβ Will this user click on my ad ?
4/15/11 4
5. Predict the next coin toss
Data
Task
THTTTTHHTHTHTTT
Model 1 : Model 2 :
Coin is tossed with Toss depends on wind
probability p (of condition W, starting
being tails) pose S, torque T
Parameters 4/15/11 5
6. Predict the next coin toss
THTTTTHHTHTHTTT
Learning
Model 2 :
Model 1 : W=12.2, S=1,
p=2/3 T=0.23
4/15/11 6
7. Predict the next coin toss
I predict the next
toss to be T
Inference
Model 2 :
Model 1 : W=12.2, S=1,
P=2/3 T=0.23
4/15/11 7
8. Inference
οβ Parameter : p=2/3
οβ Predicted next 9 tosses β¦.H H H T T T T T T
Observed next 9 tosses β¦.T T T T T T H H H
Accuracy = 2/9 ο
οβ Predicted next 9 tosses β¦.T T T T T T T T T
Observed next 9 tosses β¦.T T T T T T H H H
Accuracy = 6/9 ο
οβ Inference rule :
οβ if p > 0.5, always predict T,
οβ if p < 0.5 always predict H. 4/15/11 8
9. The anatomy of classification
1.β What is the data (features X, label y) ? β β β
2.β What is the model ? Model parameterization (w)
3.β Inference : Given X, w, predict the label.
4.β Learning : Given (X,y) pairs, learn the βbestβ w
οβ Define βbestβ β maximize an objective function
(X,Y) pairs train time
Learning
w
test time
(X, ? ) Inference predicted Y
9
11. Predict speaker success
οβ X = Number of hours spent in preparation
οβ Y = Was the speaker βgoodβ?
I(a) = 1 if(a==TRUE)
Prediction : Y = I ( X > h) = 0 if(a==FALSE)
4/15/11 11
12. Predict speaker success
Y = I ( X > h)
οβ Learning is difficult.
οβ Not robust
4/15/11 12
13. 1
P(Y | w, X) =
1+eβwX +w0
β¬
Y = I ( X > 10)
4/15/11 13
15. Extend to d dimensions
ο² ο² 1
P(Y | w, X ) = β( w1 X 1 +w 2 X 2 +...+w d X d +w 0 )
1+e
ο² ο² 1
P(Y | w, X ) = ο² ο²
β¬ 1+e
β( w. X +w 0 )
4/15/11 15
β¬
16. Logistic regression
οβ Model parameter : w
1
P(Y = 1 | w, X) = βwX +w 0
1+e
οβ Example : Given X = 0.9 , w = 1.2
=> wX = 1.08, P(Y=1|X=0.9) = 0.7465 ~ 0.75
β¬ Toss a coin with p=3/4
οβ Example : Given X = -1.1 , w = 1.2
=> wX = -1.32, P(Y=1|X=-1.1) = 0.2108 ~ 0.2
Toss a coin with p=1/5
4/15/11 16
17. Another view of logistic regression
οβ Log odds : ln [ p/(1-p) ] = wX + Ξ΅
οβ p / (1-p) = ewX
οβ p = (1-p) ewX
οβ p (1 + ewX) = ewX
οβ p = ewX / (1 + ewX) = 1/(e-wX+1)
οβ Logistic regression is a βlinear regressionβ between log-
odds of an event and features (X)
4/15/11 17
18. The anatomy of classification
1.β What is the data (features X, label y) ? β
2.β What is the model ? Model parameterization (w) β
3.β Inference :Given X, w, predict the label. β
4.β Learning : Given (X,y) pairs, learn the βbestβ w
οβ Define βbestβ β maximize an objective function
4/15/11 18
19. Learning : Finding the best w
Expressingβ¦(X , Y )
οβ Data : (X , Y ),
1 1
Conditional Log Likelihood
n n
οβ If yi == 1, max P(yi=1| xi, w)
οβ If yi == 0, max P(yi=0| xi, w)
οβ Maximize Log-likelihood
4/15/11 19
21. Optimization : Pick the βbestβ w
1.β Weka
2.β Matlab : w = mnrfit(X,Y)
3.β R : w <- glm(Y~X, family=binomial(link="logit"))
4.β IRLS : http://www.cs.cmu.edu/~ggordon/IRLS-example/logistic.m
5.β Implement your own ο 4/15/11 21
23. Decision surface is linear
http://www.cs.technion.ac.il/~rani/LocBoost/ 4/15/11 23
24. So far..
οβ Logistic regression is a binary classifier (multinomial
version exists)
οβ P(Y=1|X,w) is a logistic function
οβ Inference : Compute P(Y=1|X,w), and do βroundingβ.
οβ Parameter learnt by maximizing log-likelihood of data.
οβ Decision surface is linear (kernelized version exists)
4/15/11 24
25. Improvements in the model
οβ Prevent over-fitting Regularization
οβ Maximize accuracy directly SVMs
οβ Non-linear decision surface Kernel Trick
οβ Multi-label data
4/15/11 25
27. New and improved learning
οβ βBestβ w == maximize log-likelihood
Maximum Log-likelihood Estimate (MLE)
Small concern β¦ over-fitting
If data is
linearly
separable,
wο¨
4/15/11 27
28. L2 regularization
2
|| w ||2 = β wi
i
2
max w l(w) β Ξ» || w || 2
οβ Prevents over-fitting
β¬
οβ βPushesβ parameters
towards zero
οβ Equivalent to a prior on
β¬ the parameters
οβ Normal distribution (0
mean, unit covariance)
Ξ» : tuning parameter ( 0.1) 4/15/11 28
29. Patient Diagnosis
οβ Y = disease
οβ X = [age, weight, BP, blood sugar, MRI, genetic tests β¦]
οβ Donβt want all βfeaturesβ to be relevant.
οβ Weight vector w should be βmostly zerosβ.
4/15/11 29
30. L1 regularization || w ||1= β | w i |
i
max w l(w) β Ξ» || w ||1
β¬
οβ Prevents over-fitting
οβ βPushesβ parameters to
zero
οβ Equivalent to a prior on
β¬ the parameters
οβ Laplace distribution
Ξ» increases, more zeros (irrelevant) features 4/15/11 30
31. L1 v/s L2 example
οβ MLE estimate : [ 11 0.8 ]
οβ L2 estimate : [ 10 0.6 ] shrinkage
οβ L1 estimate : [ 10.2 0 ] sparsity
οβ Mini-conclusion :
οβ L2 optimization is fast, L1 tends to be slower. If you have the
computational resources, try both (at the same time) !
οβ ALWAYS run logistic regression with at least some
regularization.
οβ Corollary : ALWAYS run logistic regression on features that
have been standardized (zero mean, unit variance)
4/15/11 31
32. So far β¦
οβ Logistic regression
οβ Model
οβ Inference
οβ Learning via maximum likelihood
οβ L1 and L2 regularization
Next β¦. SVMs !
4/15/11 32
33. Why did we use probability again?
οβ Aim : Maximize βaccuracyβ
οβ Logistic regression : Indirect method that maximizes
likelihood of data.
οβ A much more direct approach is to directly maximize
accuracy.
Support Vector Machines (SVMs)
4/15/11 33
35. Geometry review
Y=1 2x1+x2-2=0
Y= -1
For a point on the line :
(0.5, 1 ) : d = 2*0.5 + 1 β 2 =0
Signed βdistanceβ to the line from (x10, x20)
d = 2x10 + x20 - 2 4/15/11 35
38. Support Vector Machines
Normalized margin β Canonical
hyperplanes
! 2005-2007 Carlos Guestrin "
Support vectors are the
x+ points touching the margins.
x-
4/15/11 38
! 2005-2007 Carlos Guestrin !
39. = !j w(j) x(j)
! 2005-2007 Carlos Guestrin !
w.x = !j w(j) x(j)
w.x = !j w(j) x(j)
! 2005-2007 Carlos Guestrin !
Slack variables ! 2005-2007 Carlos Guestrin
οβ SVMs are made robust by adding βslack variablesβ that
allow training error to be non-zero
ximize the margin point. Slack variable ==0 for correctly
οβ One for each data
Maximizepoints. margin
classified the
Maximize theβCβΞΎ i
max Ξ³ margin
β¬
4/15/11 39
40. Slack variables
max Ξ³ βC βΞΎ i
β¬
Need to tune C :
high C == minimize mis-classifications
low C == maximize margin
4/15/11 40
41. SVM summary
οβ Model : w.x + b > 0 if y = +1
w.x + b < 0 if y = -1
οβ Inference : Ε· = sign(w.x+b)
οβ Learning : Maximize { (margin) - C ( slack-variables) }
Next β¦ Kernel SVMs
4/15/11 41
42. The kernel trick
οβ Why linear separator ? What if data looks like below ?
The kernel trick
allows you to use
SVMs with non-
linear separators.
Different kernels
1.β Polynomial
2.β Gaussian
3.β exponential
4/15/11 42
43. Logistic Linear SVM
Error ~ 40% in both cases
4/15/11 43
44. Kernel SVM with polynomial kernel
of degree 2
Polynomial kernel of
degree 2/4 do very
well, but degree 3/5
do very bad.
Gaussian kernel has
tuning parameter
(bandwidth).
Performance
depends on picking
the right bandwith.
Error = 7% 4/15/11 44
45. SVMs summary
οβ Maximize the margin between positive and negative
examples.
οβ Kernel trick is widely implemented, allowing non-linear
decision surface.
οβ Not probabilistic ο
οβ Software :
οβ SVM-light http://svmlight.joachims.org/,
οβ LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
οβ Weka, matlab, R
4/15/11 45
47. Which to use ?
οβ Linear SVMs and logistic regression work very similar in
most cases.
οβ Kernelized SVMs work better than linear SVMs (mostly)
οβ Kernelized logistic regression is possible, but
implementations are not available easily.
4/15/11 47
48. Recommendations
1.β First, try logistic regression. Easy, fast, stable. No βtuningβ
parameters.
2.β Equivalently, you can first try linear SVMs, but you need
to tune βCβ
3.β If results are βgood enoughβ, stop.
4.β Else try SVMs with Gaussian kernels.
Need to tune bandwidth, C β by using validation data.
If you have more time/computational resources, try random
forests as well.
** Recommendations are opinions of the presenter, and not known facts.
4/15/11 48
49. In conclusion β¦
οβ Logistic Regression
οβ Support Vector Machines
Other classification approaches β¦
οβ Random forests / decision trees
οβ NaΓ―ve Bayes
οβ Nearest Neighbors
οβ Boosting (Adaboost)
4/15/11 49
52. Is this athlete doing drugs ?
οβ X = Blood-test-to-detect-drugs
οβ Y = Doped athlete ?
οβ Two types of errors :
οβ Athlete is doped, we predict βNOβ : false negative
οβ Athlete is NOT doped, we predict βYESβ : false positive
οβ Penalize false positives more than false negatives
4/15/11 52
53. Outline
οβ What is classification ?
οβ Parameters, data, inference, learning
οβ Predicting coin tosses (0-dimensional X)
οβ Logistic Regression
οβ Predicting βspeaker successβ (1-dimensional X)
οβ Formulation, optimizatiob
οβ Decision surface is linear
οβ Interpreting coefficients
οβ Hypothesis testing
οβ Evaluating the performance of the model
οβ Why is it called βregressionβ : log-odds
οβ L2 regularization
οβ Patient survival (d-dimensional X)
οβ L1 regularization
οβ Support Vector Machines
οβ Linear SVMs + formulation
οβ What are βsupport vectorsβ
οβ The kernel trick
οβ Demo : logistic regression v/s SVMs v/s kernel tricks 4/15/11 53
54. Overfitting a more serious problem
2x+y-2 = 0 w = [2 1 -2]
4x+2y-4 = 0 w = [4 2 -4]
400x+200y-400 = 0 w = [400 200 -400]
ο¨ Absolutely need L2 regularization 4/15/11 54