SlideShare a Scribd company logo
1 of 69
Evaluation
Machine Learning and Data Mining (Unit 14)




Prof. Pier Luca Lanzi
References                                    2



  Jiawei Han and Micheline Kamber, quot;Data Mining, : Concepts
  and Techniquesquot;, The Morgan Kaufmann Series in Data
  Management Systems (Second Edition).
  Pang-Ning Tan, Michael Steinbach, Vipin Kumar,
  “Introduction to Data Mining”, Addison Wesley.
  Ian H. Witten, Eibe Frank. “Data Mining: Practical Machine
  Learning Tools and Techniques with Java Implementations”
  2nd Edition.




                  Prof. Pier Luca Lanzi
Evaluation issues                               3



  Which measures?
    Classification accuracy
    Total cost/benefit
    when different errors involve different costs
    Lift and ROC curves
    Error in numeric predictions

  How reliable are the predicted results ?




                    Prof. Pier Luca Lanzi
Credibility issues                                 4



  How much should we believe in what was learned?
  How predictive is the model we learned?

  Error on the training data is not a good indicator of
  performance on future data

  Question: Why?
  Answer:
     Since the classifier has been learned from the very same
     training data, any estimate based on that data will be
     optimistic.
     In addition, new data will probably not be exactly
     the same as the training data!

  Overfitting: fitting the training data too precisely usually
  leads to poor results on new data

                     Prof. Pier Luca Lanzi
Model Evaluation                            5



  Metrics for Performance Evaluation
    How to evaluate the performance of a model?

  Methods for Performance Evaluation
    How to obtain reliable estimates?

  Methods for Model Comparison
    How to compare the relative performance among
    competing models?

  Model selection
    Which model should we prefer?




                   Prof. Pier Luca Lanzi
Metrics for Performance Evaluation                           6



  Focus on the predictive capability of a model
     Rather than how fast it takes to classify or build models,
     scalability, etc.
  Confusion Matrix:

                         PREDICTED CLASS

                                    Yes           No

      ACTUAL      Yes                 a            b
      CLASS

                   No                 c            d



      a: TP (true positive)         c: FP (false positive)
      b: FN (false negative) d: TN (true negative)

                    Prof. Pier Luca Lanzi
Metrics for Performance Evaluation…                  7




                                 PREDICTED CLASS

                                          Yes      No

        ACTUAL          Yes                 a        b
        CLASS                             (TP)     (FN)
                         No                 c        d
                                          (FP)     (TN)


  Most widely-used metric:




                  Prof. Pier Luca Lanzi
Limitation of Accuracy                          8



  Consider a 2-class problem
    Number of Class 0 examples = 9990
    Number of Class 1 examples = 10

  If model predicts everything to be class 0,
  accuracy is 9990/10000 = 99.9 %

  Accuracy is misleading because model does not
  detect any class 1 example




                   Prof. Pier Luca Lanzi
Cost Matrix                                                  9




                                     PREDICTED CLASS

                        C(i|j)                Yes          No

        ACTUAL           Yes            C(Yes|Yes)      C(No|Yes)
        CLASS

                          No                C(Yes|No)   C(No|No)




      C(i|j): Cost of misclassifying class j example as class i




                    Prof. Pier Luca Lanzi
Computing Cost of Classification                                10



                        Cost          PREDICTED CLASS
                       Matrix

                                             +          -
                                    C(i|j)
                    ACTUAL
                                     +       -1     100
                    CLASS
                                      -      1          0


 Model M1   PREDICTED CLASS                  Model M2       PREDICTED CLASS


                  +             -                                    +      -
  ACTUAL                                     ACTUAL
            +    150        40                              +        250   45
  CLASS                                      CLASS
            -     60        250                             -        5     200



 Accuracy = 80%                               Accuracy = 90%
 Cost = 3910                                  Cost = 4255

                    Prof. Pier Luca Lanzi
Cost vs Accuracy                                            11



Count           PREDICTED CLASS
                                             Accuracy is proportional to cost if
                                             1. C(Yes|No)=C(No|Yes) = q
                     Yes             No      2. C(Yes|Yes)=C(No|No) = p

          Yes          a              b
 ACTUAL
 CLASS
                                             N=a+b+c+d
          No           c              d

                                             Accuracy = (a + d)/N


                                             Cost = p (a + d) + q (b + c)
Cost            PREDICTED CLASS
                                                   = p (a + d) + q (N – a – d)
                      Yes            No
                                                   = q N – (q – p)(a + d)
                                                   = N [q – (q-p) × Accuracy]
          Yes          P              q
 ACTUAL
 CLASS

          No           q              p


                     Prof. Pier Luca Lanzi
Cost-Sensitive Measures                          12

                                           The higher the precision,
                                              the lower the FPs



                                            The higher the precision,
                                               the lower the FNs



                                               The higher the F1,
                                            the lower the FPs & FNs

  Precision is biased towards C(Yes|Yes) & C(Yes|No),
  Recall is biased towards C(Yes|Yes) & C(No|Yes)
  F1-measure is biased towards all except C(No|No),
  it is high when both precision and recall are reasonably high




                   Prof. Pier Luca Lanzi
Model Evaluation                           13



  Metrics for Performance Evaluation
    How to evaluate the performance of a model?

  Methods for Performance Evaluation
    How to obtain reliable estimates?

  Methods for Model Comparison
    How to compare the relative performance among
    competing models?

  Model selection
    Which model should be preferred?




                   Prof. Pier Luca Lanzi
Classification Step 1:                        14
Split data into train and test sets




                               Training set
 data




            Testing set




                   Prof. Pier Luca Lanzi
Classification Step 2:                         15
Build a model on a training set




                               Training set
 data


                             Model Builder


                                       Y   N


            Testing set




                   Prof. Pier Luca Lanzi
Classification Step 3:                                    16
Evaluate on test set (Re-train?)




                             Training set
data
                                               Evaluate
                           Model Builder           Predictions
                                                    +
                                                    -
                                    Y      N
                                                    +
                                                    -
           Testing set




                   Prof. Pier Luca Lanzi
A note on parameter tuning                            17



  It is important that the test data is not used
  in any way to create the classifier
  Some learning schemes operate in two stages:
       Stage 1: builds the basic structure
       Stage 2: optimizes parameter settings
  The test data can’t be used for parameter tuning!
  Proper procedure uses three sets:
  training data, validation data, and test data
  Validation data is used to optimize parameters




                    Prof. Pier Luca Lanzi
Making the most of the data                    18



  Once evaluation is complete, all the data can be used to
  build the final classifier
  Generally, the larger the training data the better the
  classifier (but returns diminish)
  The larger the test data the more accurate the error estimate




                  Prof. Pier Luca Lanzi
Classification Step:                                         19
Train, validation, and test




                                                                    Model
                                                                    Builder

                                  Training set
  data
                                                  Evaluate
                                Model Builder         Predictions
                                                       +
                                                       -
                                          Y
                                                                    Final
                                              N
                                                       +
                                                                    Evaluation
                                                       -
              Validation set
                                                                      +
                                                                      -
                                         Y    N
                                                                      +
                                                                      -
Testing set
                      Prof. Pier Luca Lanzi
Methods for Performance Evaluation                20



  How to obtain a reliable estimate of performance?

  Performance of a model may depend on
  other factors besides the learning algorithm:
     Class distribution
     Cost of misclassification
     Size of training and test sets




                   Prof. Pier Luca Lanzi
Learning Curve                                           21



                                           Learning curve shows how
                                         accuracy changes with varying
                                         sample size
                                           Requires a sampling schedule
                                         for creating learning curve:
                                           Arithmetic sampling
                                           Geometric sampling

                                          Effect of small sample size:
                                               Bias in the estimate
                                               Variance of estimate




                 Prof. Pier Luca Lanzi
Methods of Estimation                           22



  Holdout
     Reserve ½ for training and ½ for testing
     Reserve 2/3 for training and 1/3 for testing
  Random subsampling
     Repeated holdout
  Cross validation
     Partition data into k disjoint subsets
     k-fold: train on k-1 partitions, test on the remaining one
     Leave-one-out: k=n
  Stratified sampling
     oversampling vs undersampling
  Bootstrap
     Sampling with replacement



                   Prof. Pier Luca Lanzi
Evaluation on “small” data                   23



  The holdout method reserves a certain amount for testing
  and uses the remainder for training
  Usually, one third for testing, the rest for training
  For small or “unbalanced” datasets, samples might not be
  representative
  For instance, few or none instances of some classes
  Stratified sample
     Advanced version of balancing the data
     Make sure that each class is represented with
     approximately equal proportions in both subsets




                  Prof. Pier Luca Lanzi
Repeated holdout method                         24



  Holdout estimate can be made more reliable by repeating the
  process with different subsamples
      In each iteration, a certain proportion is randomly
      selected for training (possibly with stratification)
      The error rates on the different iterations are averaged to
      yield an overall error rate
  This is called the repeated holdout method
  Still not optimum: the different test sets overlap




                   Prof. Pier Luca Lanzi
Cross-validation                                     25



  Avoids overlapping test sets
     First step: data is split into k subsets of equal size
     Second step: each subset in turn is used for testing and the
     remainder for training

  This is called k-fold cross-validation

  Often the subsets are stratified before cross-validation is performed

  The error estimates are averaged to yield an overall error estimate




                      Prof. Pier Luca Lanzi
More on cross-validation                              26



  Standard method for evaluation stratified ten-fold cross-validation

  Why ten? Extensive experiments have shown that this is the best
  choice to get an accurate estimate

  Stratification reduces the estimate’s variance

  Even better: repeated stratified cross-validation

  E.g. ten-fold cross-validation is repeated ten times and results are
  averaged (reduces the variance)

  Other approaches appear to be robust, e.g., 5x2 crossvalidation




                     Prof. Pier Luca Lanzi
Leave-One-Out cross-validation                     27



  It is a particular form of cross-validation
      Set number of folds to number of training instances
      I.e., for n training instances, build classifier n times
  Makes best use of the data
  Involves no random subsampling
  Very computationally expensive




                    Prof. Pier Luca Lanzi
Leave-One-Out and stratification              28



  Disadvantage of Leave-One-Out is that:
  stratification is not possible

  It guarantees a non-stratified sample because
  there is only one instance in the test set!

  Extreme example: random dataset split equally into two
  classes
     Best inducer predicts majority class
     50% accuracy on fresh data
     Leave-One-Out-CV estimate is 100% error!




                  Prof. Pier Luca Lanzi
The bootstrap                                 29



  Crossvalidation uses sampling without replacement
     The same instance, once selected, can not be selected
     again for a particular training/test set

  The bootstrap uses sampling with replacement to form the
  training set
      Sample a dataset of n instances n times with
      replacement to form a new dataset of n instances
      Use this data as the training set
      Use the instances from the original dataset that don’t
      occur in the new training set for testing




                  Prof. Pier Luca Lanzi
The 0.632 bootstrap                             30



  A particular instance has a probability of 1–1/n of not being
  picked
  Thus its probability of ending up in the test data is:
                                   n
                      ⎛ 1⎞
                        1 − ⎟ ≈ e −1 = 0.368
                      ⎜
                      ⎝ n⎠
  This means the training data will contain approximately
  63.2% of the instances




                   Prof. Pier Luca Lanzi
Estimating error with the bootstrap                      31



   The error estimate on the test data will be very pessimistic,
   since training was on just ~63% of the instances

   Therefore, combine it with the resubstitution error:
        err = 0.632 ⋅ etest instances + 0.368 ⋅ etraining instances
   The resubstitution error gets less weight than the error on
   the test data

   Repeat process several times with different replacement
   samples; average the results




                     Prof. Pier Luca Lanzi
More on the bootstrap                              32



   Probably the best way of estimating performance for very small
   datasets
   However, it has some problems
        Consider a random dataset with a 50% class distribution
        A perfect memorizer will achieve 0% resubstitution error
        and ~50% error on test data
        Bootstrap estimate for this classifier:

             err = 0.632 ⋅ 50% + 0.368 ⋅ 0% = 31.6%
       True expected error: 50%




                    Prof. Pier Luca Lanzi
Model Evaluation                           33



  Metrics for Performance Evaluation
    How to evaluate the performance of a model?

  Methods for Performance Evaluation
    How to obtain reliable estimates?

  Methods for Model Comparison
    How to compare the relative performance among
    competing models?

  Model selection
    Which model should be preferred?




                   Prof. Pier Luca Lanzi
ROC (Receiver Operating Characteristic)        34



  Developed in 1950s for signal detection theory to analyze
  noisy signals
     Characterize the trade-off between positive hits and false
     alarms
  ROC curve plots TP (on the y-axis) against FP (on the x-axis)
  Performance of each classifier represented as a point on the
  ROC curve
     changing the threshold of algorithm, sample distribution
     or cost matrix changes the location of the point




                  Prof. Pier Luca Lanzi
ROC Curve                                               35



- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive




  At threshold t:
  TP=0.5, FN=0.5, FP=0.12, FN=0.88

                         Prof. Pier Luca Lanzi
ROC Curve                                      36




(TP,FP):
   (0,0): declare everything
           to be negative class
   (1,1): declare everything
          to be positive class
   (1,0): ideal

   Diagonal line:
      Random guessing
      Below diagonal line,
      prediction is opposite of the
      true class




                         Prof. Pier Luca Lanzi
Using ROC for Model Comparison                 37




                                        No model consistently
                                        outperform the other
                                           M1 is better for
                                           small FPR
                                           M2 is better for
                                           large FPR

                                        Area Under the ROC
                                        curve
                                           Ideal, area = 1
                                           Random guess,
                                           area = 0.5




                Prof. Pier Luca Lanzi
How to Construct an ROC curve                              38




                                            • Use classifier that produces
Instance   P(+|A)   True Class
                                            posterior probability for each
   1        0.95         +
                                            test instance P(+|A)
   2        0.93         +
                                            • Sort the instances according
   3        0.87          -
                                            to P(+|A) in decreasing order
   4        0.85          -

   5        0.85          -
                                            • Apply threshold at each
   6        0.85         +
                                            unique value of P(+|A)
   7        0.76          -
                                            • Count the number of TP, FP,
   8        0.53         +
                                              TN, FN at each threshold
   9        0.43          -

  10        0.25         +
                                            • TP rate, TPR = TP/(TP+FN)
                                            • FP rate, FPR = FP/(FP + TN)

                    Prof. Pier Luca Lanzi
How to Construct an ROC curve               39




Instance   P(+|A)   True Class

   1        0.95         +

   2        0.93         +

   3        0.87          -

   4        0.85          -

   5        0.85          -

   6        0.85         +

   7        0.76          -

   8        0.53         +

   9        0.43          -

  10        0.25         +




                    Prof. Pier Luca Lanzi
How to Construct an ROC curve               40




Instance   P(+|A)   True Class

   1        0.95         +

   2        0.93         +

   3        0.87          -

   4        0.85          -

   5        0.85          -

   6        0.85         +

   7        0.76          -

   8        0.53         +

   9        0.43          -

  10        0.25         +




                    Prof. Pier Luca Lanzi
How to Construct an ROC curve               41




Instance   P(+|A)   True Class

   1        0.95         +

   2        0.93         +

   3        0.87          -

   4        0.85          -

   5        0.85          -

   6        0.85         +

   7        0.76          -

   8        0.53         +

   9        0.43          -

  10        0.25         +




                    Prof. Pier Luca Lanzi
How to Construct an ROC curve               42




Instance   P(+|A)   True Class

   1        0.95         +

   2        0.93         +

   3        0.87          -

   4        0.85          -

   5        0.85          -

   6        0.85         +

   7        0.76          -

   8        0.53         +

   9        0.43          -

  10        0.25         +




                    Prof. Pier Luca Lanzi
How to construct an ROC curve                                                      43



                  +      -       +       -      -       -      +      -      +       +
         Class

 Threshold >=    0.25   0.43    0.53   0.76    0.85    0.85   0.85   0.87   0.93    0.95    1.00

          TP      5      4       4      3       3       3      3      2      2          1    0

          FP      5      5       4      4       3       2      1      1      0          0    0

          TN      0      0       1      1       2       3      4      4      5          5    5

          FN      0      1       1      2       2       2      2      3      3          4    5

         TPR      1     0.8     0.8     0.6    0.6     0.6    0.6    0.4    0.4     0.2      0

         FPR      1      1      0.8     0.8    0.6     0.4    0.2    0.2     0          0    0




   ROC Curve:




                               Prof. Pier Luca Lanzi
Lift Charts                                    44



  Lift is a measure of the effectiveness of a predictive model
  calculated as the ratio between the results obtained with and
  without the predictive model.
  Cumulative gains and lift charts are visual aids for measuring
  model performance
  Both charts consist of a lift curve and a baseline
  The greater the area between the lift curve and the baseline,
  the better the model




                   Prof. Pier Luca Lanzi
Example: Direct Marketing                        45



  Mass mailout of promotional offers (1000000)
  The proportion who normally respond is 0.1%, that is 1000
  responses
  A data mining tool can identify a subset of a 100000 for
  which the response rate is 0.4%, that is 400 responses
  In marketing terminology, the increase of response rate is
  known as the lift factor yielded by the model.
  The same data mining tool, may be able to identify 400000
  households for which the response rate is 0.2%, that is 800
  respondents corresponding to a lift factor of 2.
  The overall goal is to find subsets of test instances that have
  a high proportion of true positive




                   Prof. Pier Luca Lanzi
Generating a lift chart                               46



  Given a scheme that outputs probability, sort the instances in
  descending order according to the predicted probability
  Select the instance subset starting from the one with the highest
  predicted probability

     Rank     Predicted Probability Actual Class

     1        0.95                       Yes

     2        0.93                       Yes

     3        0.93                       No

     4        0.88                       yes

     …        …                          …


  In lift chart, x axis is sample size and y axis is number of true
  positives


                     Prof. Pier Luca Lanzi
Example: Direct Marketing               47




                Prof. Pier Luca Lanzi
Test of Significance                            48




Given two models:
   Model M1: accuracy = 85%, tested on 30 instances
   Model M2: accuracy = 75%, tested on 5000 instances


Can we say M1 is better than M2?
   How much confidence can we place on accuracy of M1 and M2?
   Can the difference in performance measure be explained as a
   result of random fluctuations in the test set?




                    Prof. Pier Luca Lanzi
Confidence Interval for Accuracy                 49



  Prediction can be regarded as a Bernoulli trial
     A Bernoulli trial has 2 possible outcomes
     Possible outcomes for prediction: correct or wrong
     Collection of Bernoulli trials has a Binomial distribution

  Given x (# of correct predictions) or equivalently,
  acc=x/N, and N (# of test instances),

  Can we predict p (true accuracy of model)?




                   Prof. Pier Luca Lanzi
Confidence Interval for Accuracy                                  50




For large test sets (N > 30), the accuracy acc has a normal
distribution with mean p and variance p(1-p)/N
                                                                   Area = 1 - α
Confidence Interval for p:

                acc − p
                              <Z
P( Z <                                          )
               p (1 − p ) / N
     α /2                             1−α / 2




            = 1−α


                                                           Zα/2            Z1- α /2

   2 × N × acc + Z ± Z + 4 × N × acc − 4 × N × acc
                          2                2                                      2


p=                       α /2             α /2


                     2( N + Z )                     2

                                                    α /2


                       Prof. Pier Luca Lanzi
Confidence Interval for Accuracy                         51



   Consider a model that produces an accuracy of 80% when
   evaluated on 100 test instances:
       N=100, acc = 0.8
       Let 1-α = 0.95 (95% confidence)
                                                              1-α     Z
       From probability table, Zα/2=1.96
                                                              0.99   2.58

                                                              0.98   2.33

                                                              0.95   1.96
   N        50     100       500         1000    5000

                                                              0.90   1.65
p(lower)   0.670   0.711    0.763        0.774   0.789


p(upper)   0.888   0.866    0.833        0.824   0.811




                      Prof. Pier Luca Lanzi
Comparing Performance of 2 Models                   52



  Given two models, say M1 and M2, which is better?
  M1 is tested on D1 (size=n1), found error rate = e1
  M2 is tested on D2 (size=n2), found error rate = e2
  Assume D1 and D2 are independent
  If n1 and n2 are sufficiently large, then

                       e1 ~ N (μ1 , σ 1 )
                       e2 ~ N (μ2 , σ 2 )
  Approximate:
                            e (1 − e )
                         σ=
                         ˆ          i           i
                             i
                                n           i




                    Prof. Pier Luca Lanzi
Comparing Performance of 2 Models                            53



  To test if performance difference is statistically significant,
  d = e1 – e2

  d ~ N(dt,σt) where dt is the true difference
  Since D1 and D2 are independent, their variance adds up:


              σ = σ +σ ≅ σ +σ
                 2         2          2         2        2
                          ˆ  ˆ
                 t         1         2          1        2


                       e1(1 − e1) e2(1 − e2)
                     =           +
                           n1        n2
  At (1-α) confidence level,


                     d =d ±Z σˆ               α /2
                       t                             t




                      Prof. Pier Luca Lanzi
An Illustrative Example                         54



  Given: M1: n1 = 30, e1 = 0.15
         M2: n2 = 5000, e2 = 0.25
  d = |e2 – e1| = 0.1 (2-sided test)

      0.15(1 − 0.15) 0.25(1 − 0.25)
   σ=               +               = 0.0043
   ˆ d
           30            5000
  At 95% confidence level, Zα/2=1.96

         d = 0.100 ± 1.96 × 0.0043 = 0.100 ± 0.128
          t




  The interval contains 0, therefore the difference
  may not be statistically significant




                   Prof. Pier Luca Lanzi
Comparing Performance of 2 Algorithms                           55



  Each learning algorithm may produce k models:
     L1 may produce M11 , M12, …, M1k
     L2 may produce M21 , M22, …, M2k

  If models are generated on the same test sets D1,D2, …, Dk
  (e.g., via cross-validation)
     For each set: compute dj = e1j – e2j
     dj has mean dt and variance σt
     Estimate:

                                               − d)
                                     k
                                    ∑ (d j
                                                            2


                        σ=          j =1
                               2
                        ˆ
                              k (k − 1)
                               t




                        d = d ±t σ    ˆ    1−α , k −1
                           t                            t



                  Prof. Pier Luca Lanzi
Model Evaluation                           56



  Metrics for Performance Evaluation
    How to evaluate the performance of a model?

  Methods for Performance Evaluation
    How to obtain reliable estimates?

  Methods for Model Comparison
    How to compare the relative performance among
    competing models?

  Model selection
    Which model should be preferred?




                   Prof. Pier Luca Lanzi
The MDL principle                                57



  MDL stands for minimum description length
  The description length is defined as:

               space required to describe a theory
                                +
         space required to describe the theory’s mistakes

  In our case the theory is the classifier and the mistakes are
  the errors on the training data

  Aim: we seek a classifier with minimal DL

  MDL principle is a model selection criterion



                    Prof. Pier Luca Lanzi
Model selection criteria                                                              58



  Model selection criteria attempt to find a good compromise
  between:
     The complexity of a model
     Its prediction accuracy on the training data
  Reasoning: a good model is a simple model that achieves
  high accuracy on the given data
  Also known as Occam’s Razor :
  the best theory is the smallest one
  that describes all the facts




  William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the
  most influential philosopher of the 14th century and a controversial theologian.



                                 Prof. Pier Luca Lanzi
Elegance vs. errors                           59



  Theory 1: very simple, elegant theory that explains the data
  almost perfectly
  Theory 2: significantly more complex theory that reproduces
  the data without mistakes
  Theory 1 is probably preferable
  Classical example: Kepler’s three laws on planetary motion
     Less accurate than Copernicus’s latest refinement of the
     Ptolemaic theory of epicycles




                  Prof. Pier Luca Lanzi
MDL and compression                               60



  MDL principle relates to data compression

  The best theory is the one that compresses
  the data the most

  I.e. to compress a dataset we generate a model and
  then store the model and its mistakes

  We need to compute
  (a) size of the model, and
  (b) space needed to encode the errors

  (b) easy: use the informational loss function
  (a) need a method to encode the model


                   Prof. Pier Luca Lanzi
MDL and Bayes’s theorem                        61



  L[T]=“length” of the theory
  L[E|T]=training set encoded wrt the theory
  Description length= L[T] + L[E|T]
  Bayes’ theorem gives a posteriori probability of a theory
  given the data:




  Equivalent to,




                                                 constant



                   Prof. Pier Luca Lanzi
MDL and MAP                                      62



 MAP stands for maximum a posteriori probability

 Finding the MAP theory corresponds to finding
 the MDL theory

 Difficult bit in applying the MAP principle:
 determining the prior probability Pr[T] of the theory

 Corresponds to difficult part in applying the MDL principle:
 coding scheme for the theory

 I.e. if we know a priori that a particular theory is more likely
 we need less bits to encode it




                  Prof. Pier Luca Lanzi
Discussion of MDL principle                      63



  Advantage: makes full use of the training data when
  selecting a model

  Disadvantage 1: appropriate coding scheme/prior
  probabilities for theories are crucial

  Disadvantage 2: no guarantee that the MDL theory is the one
  which minimizes the expected error

  Note: Occam’s Razor is an axiom!

  Epicurus’ principle of multiple explanations: keep all theories
  that are consistent with the data




                   Prof. Pier Luca Lanzi
Bayesian model averaging                          64



  Reflects Epicurus’ principle: all theories are used for
  prediction weighted according to P(T|E)

  Let I be a new instance whose class we must predict
  Let C be the random variable denoting the class
  Let E be the training data
  Let Tj be the possible theories
  Then, BMA gives the probability of C given,




                   Prof. Pier Luca Lanzi
Which classification algorithm?                65



  Among the several algorithms, which one is the “best”?
    Some algorithms have a lower computational complexity
    Different algorithms provide different representations
    Some algorithms allow the specification of prior
    knowledge

  If we are interested in the generalization performance,
  are there any reasons to prefer one classifier over another?

  Can we expect any classification method to be superior or
  inferior overall?

  According to the No Free Lunch Theorem,
  the answer to all these questions is known


                   Prof. Pier Luca Lanzi
No Free Lunch Theorem                               66



  If the goal is to obtain good generalization performance, there are
  no context-independent or usage-independent reasons to favor one
  classification method over another

  If one algorithm seems to outperform another in a certain situation,
  it is a consequence of its fit to the particular problem, not the
  general superiority of the algorithm

  When confronting a new problem, this theorem suggests that we
  should focus on the aspects that matter most
     Prior information
     Data distribution
     Amount of training data
     Cost or reward

  The theorem also justifies skepticism regarding studies that
  “demonstrate” the overall superiority of a certain algorithm



                    Prof. Pier Luca Lanzi
No Free Lunch Theorem                          67



    quot;[A]ll algorithms that search for an extremum of a cost
       [objective] function perform exactly the same, when
           averaged over all possible cost functions.“ [1]

“[T]he average performance of any pair of algorithms across all
               possible problems is identical.” [2]



  Wolpert, D.H., Macready, W.G. (1995), No Free Lunch
  Theorems for Search, Technical Report SFI-TR-95-02-010
  (Santa Fe Institute).
  Wolpert, D.H., Macready, W.G. (1997), No Free Lunch
  Theorems for Optimization, IEEE Transactions on
  Evolutionary Computation 1, 67.



                   Prof. Pier Luca Lanzi
No Free Lunch Theorem                                 68



  Let
        F be the unknown concept
        n the number of training examples
        ε(E|D) be the error of the learning algorithm given the data
        Pi(h|D) the probability that algorithm i generates
        hypothesis h from the training data D

  For any two learning algorithm, P_1(h|D) and P
   1. Uniformly averaged over all the target function F,
      ε1(E|F,n) – ε2(E|F,n) = 0
   2. For any fixed training set D, uniformly averaged over F,
      ε1(E|F,D) – ε2(E|F,D) = 0
   3. Uniformly averaged over all priors P(F),
      ε1(E|n) – ε2(E|n) = 0
   4. For any training set D, uniformly averaged over P(F),
      ε1(E|D) – ε2(E|D) = 0


                      Prof. Pier Luca Lanzi
No Free Lunch Theorem                               69



  For any two learning algorithm, P_1(h|D) and P
   1. Uniformly averaged over all the target function F,
      ε1(E|F,n) – ε2(E|F,n) = 0
   2. For any fixed training set D, uniformly averaged over F,
      ε1(E|F,D) – ε2(E|F,D) = 0
   3. Uniformly averaged over all priors P(F),
      ε1(E|n) – ε2(E|n) = 0
   4. For any training set D, uniformly averaged over P(F),
      ε1(E|D) – ε2(E|D) = 0

(1) No matter how clever we choose a good algorithm P1(h|D)
and a bad algorithm P2(h|D), if all functions are equally likely,
 then the “good” algorithm will not outperform the “bad” one


   (2) Even if we know D, then averaged over all the target
 functions no learning algorithm will outperform another one

                    Prof. Pier Luca Lanzi

More Related Content

What's hot

Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsKush Kulshrestha
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Machine Learning
Machine LearningMachine Learning
Machine LearningVivek Garg
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningSangath babu
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSpotle.ai
 

What's hot (20)

Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
 
Bayes Belief Networks
Bayes Belief NetworksBayes Belief Networks
Bayes Belief Networks
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
KNN
KNN KNN
KNN
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 

Recently uploaded

Indore Real Estate Market Trends Report.pdf
Indore Real Estate Market Trends Report.pdfIndore Real Estate Market Trends Report.pdf
Indore Real Estate Market Trends Report.pdfSaviRakhecha1
 
Log your LOA pain with Pension Lab's brilliant campaign
Log your LOA pain with Pension Lab's brilliant campaignLog your LOA pain with Pension Lab's brilliant campaign
Log your LOA pain with Pension Lab's brilliant campaignHenry Tapper
 
The Economic History of the U.S. Lecture 18.pdf
The Economic History of the U.S. Lecture 18.pdfThe Economic History of the U.S. Lecture 18.pdf
The Economic History of the U.S. Lecture 18.pdfGale Pooley
 
The Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfThe Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfGale Pooley
 
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )Pooja Nehwal
 
Gurley shaw Theory of Monetary Economics.
Gurley shaw Theory of Monetary Economics.Gurley shaw Theory of Monetary Economics.
Gurley shaw Theory of Monetary Economics.Vinodha Devi
 
The Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdfThe Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdfGale Pooley
 
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...Pooja Nehwal
 
The Economic History of the U.S. Lecture 30.pdf
The Economic History of the U.S. Lecture 30.pdfThe Economic History of the U.S. Lecture 30.pdf
The Economic History of the U.S. Lecture 30.pdfGale Pooley
 
00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptxFinTech Belgium
 
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...dipikadinghjn ( Why You Choose Us? ) Escorts
 
WhatsApp 📞 Call : 9892124323 ✅Call Girls In Chembur ( Mumbai ) secure service
WhatsApp 📞 Call : 9892124323  ✅Call Girls In Chembur ( Mumbai ) secure serviceWhatsApp 📞 Call : 9892124323  ✅Call Girls In Chembur ( Mumbai ) secure service
WhatsApp 📞 Call : 9892124323 ✅Call Girls In Chembur ( Mumbai ) secure servicePooja Nehwal
 
The Economic History of the U.S. Lecture 20.pdf
The Economic History of the U.S. Lecture 20.pdfThe Economic History of the U.S. Lecture 20.pdf
The Economic History of the U.S. Lecture 20.pdfGale Pooley
 
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure serviceCall US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure servicePooja Nehwal
 
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...dipikadinghjn ( Why You Choose Us? ) Escorts
 
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
The Economic History of the U.S. Lecture 23.pdf
The Economic History of the U.S. Lecture 23.pdfThe Economic History of the U.S. Lecture 23.pdf
The Economic History of the U.S. Lecture 23.pdfGale Pooley
 
The Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfThe Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfGale Pooley
 

Recently uploaded (20)

Indore Real Estate Market Trends Report.pdf
Indore Real Estate Market Trends Report.pdfIndore Real Estate Market Trends Report.pdf
Indore Real Estate Market Trends Report.pdf
 
Log your LOA pain with Pension Lab's brilliant campaign
Log your LOA pain with Pension Lab's brilliant campaignLog your LOA pain with Pension Lab's brilliant campaign
Log your LOA pain with Pension Lab's brilliant campaign
 
The Economic History of the U.S. Lecture 18.pdf
The Economic History of the U.S. Lecture 18.pdfThe Economic History of the U.S. Lecture 18.pdf
The Economic History of the U.S. Lecture 18.pdf
 
The Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdfThe Economic History of the U.S. Lecture 22.pdf
The Economic History of the U.S. Lecture 22.pdf
 
(Vedika) Low Rate Call Girls in Pune Call Now 8250077686 Pune Escorts 24x7
(Vedika) Low Rate Call Girls in Pune Call Now 8250077686 Pune Escorts 24x7(Vedika) Low Rate Call Girls in Pune Call Now 8250077686 Pune Escorts 24x7
(Vedika) Low Rate Call Girls in Pune Call Now 8250077686 Pune Escorts 24x7
 
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
Vip Call US 📞 7738631006 ✅Call Girls In Sakinaka ( Mumbai )
 
Gurley shaw Theory of Monetary Economics.
Gurley shaw Theory of Monetary Economics.Gurley shaw Theory of Monetary Economics.
Gurley shaw Theory of Monetary Economics.
 
The Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdfThe Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdf
 
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
Dharavi Russian callg Girls, { 09892124323 } || Call Girl In Mumbai ...
 
The Economic History of the U.S. Lecture 30.pdf
The Economic History of the U.S. Lecture 30.pdfThe Economic History of the U.S. Lecture 30.pdf
The Economic History of the U.S. Lecture 30.pdf
 
00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx00_Main ppt_MeetupDORA&CyberSecurity.pptx
00_Main ppt_MeetupDORA&CyberSecurity.pptx
 
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West ⚡ 9920725232 What It Takes To Be The Best ...
 
WhatsApp 📞 Call : 9892124323 ✅Call Girls In Chembur ( Mumbai ) secure service
WhatsApp 📞 Call : 9892124323  ✅Call Girls In Chembur ( Mumbai ) secure serviceWhatsApp 📞 Call : 9892124323  ✅Call Girls In Chembur ( Mumbai ) secure service
WhatsApp 📞 Call : 9892124323 ✅Call Girls In Chembur ( Mumbai ) secure service
 
The Economic History of the U.S. Lecture 20.pdf
The Economic History of the U.S. Lecture 20.pdfThe Economic History of the U.S. Lecture 20.pdf
The Economic History of the U.S. Lecture 20.pdf
 
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure serviceCall US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
Call US 📞 9892124323 ✅ Kurla Call Girls In Kurla ( Mumbai ) secure service
 
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri 🌹 9920725232 ( Call Me ) Mumbai Escorts...
 
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
 
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(DIYA) Bhumkar Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
The Economic History of the U.S. Lecture 23.pdf
The Economic History of the U.S. Lecture 23.pdfThe Economic History of the U.S. Lecture 23.pdf
The Economic History of the U.S. Lecture 23.pdf
 
The Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdfThe Economic History of the U.S. Lecture 21.pdf
The Economic History of the U.S. Lecture 21.pdf
 

Machine Learning and Data Mining: 14 Evaluation and Credibility

  • 1. Evaluation Machine Learning and Data Mining (Unit 14) Prof. Pier Luca Lanzi
  • 2. References 2 Jiawei Han and Micheline Kamber, quot;Data Mining, : Concepts and Techniquesquot;, The Morgan Kaufmann Series in Data Management Systems (Second Edition). Pang-Ning Tan, Michael Steinbach, Vipin Kumar, “Introduction to Data Mining”, Addison Wesley. Ian H. Witten, Eibe Frank. “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” 2nd Edition. Prof. Pier Luca Lanzi
  • 3. Evaluation issues 3 Which measures? Classification accuracy Total cost/benefit when different errors involve different costs Lift and ROC curves Error in numeric predictions How reliable are the predicted results ? Prof. Pier Luca Lanzi
  • 4. Credibility issues 4 How much should we believe in what was learned? How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data Question: Why? Answer: Since the classifier has been learned from the very same training data, any estimate based on that data will be optimistic. In addition, new data will probably not be exactly the same as the training data! Overfitting: fitting the training data too precisely usually leads to poor results on new data Prof. Pier Luca Lanzi
  • 5. Model Evaluation 5 Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Model selection Which model should we prefer? Prof. Pier Luca Lanzi
  • 6. Metrics for Performance Evaluation 6 Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS Yes No ACTUAL Yes a b CLASS No c d a: TP (true positive) c: FP (false positive) b: FN (false negative) d: TN (true negative) Prof. Pier Luca Lanzi
  • 7. Metrics for Performance Evaluation… 7 PREDICTED CLASS Yes No ACTUAL Yes a b CLASS (TP) (FN) No c d (FP) (TN) Most widely-used metric: Prof. Pier Luca Lanzi
  • 8. Limitation of Accuracy 8 Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example Prof. Pier Luca Lanzi
  • 9. Cost Matrix 9 PREDICTED CLASS C(i|j) Yes No ACTUAL Yes C(Yes|Yes) C(No|Yes) CLASS No C(Yes|No) C(No|No) C(i|j): Cost of misclassifying class j example as class i Prof. Pier Luca Lanzi
  • 10. Computing Cost of Classification 10 Cost PREDICTED CLASS Matrix + - C(i|j) ACTUAL + -1 100 CLASS - 1 0 Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS + - + - ACTUAL ACTUAL + 150 40 + 250 45 CLASS CLASS - 60 250 - 5 200 Accuracy = 80% Accuracy = 90% Cost = 3910 Cost = 4255 Prof. Pier Luca Lanzi
  • 11. Cost vs Accuracy 11 Count PREDICTED CLASS Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q Yes No 2. C(Yes|Yes)=C(No|No) = p Yes a b ACTUAL CLASS N=a+b+c+d No c d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) Cost PREDICTED CLASS = p (a + d) + q (N – a – d) Yes No = q N – (q – p)(a + d) = N [q – (q-p) × Accuracy] Yes P q ACTUAL CLASS No q p Prof. Pier Luca Lanzi
  • 12. Cost-Sensitive Measures 12 The higher the precision, the lower the FPs The higher the precision, the lower the FNs The higher the F1, the lower the FPs & FNs Precision is biased towards C(Yes|Yes) & C(Yes|No), Recall is biased towards C(Yes|Yes) & C(No|Yes) F1-measure is biased towards all except C(No|No), it is high when both precision and recall are reasonably high Prof. Pier Luca Lanzi
  • 13. Model Evaluation 13 Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Model selection Which model should be preferred? Prof. Pier Luca Lanzi
  • 14. Classification Step 1: 14 Split data into train and test sets Training set data Testing set Prof. Pier Luca Lanzi
  • 15. Classification Step 2: 15 Build a model on a training set Training set data Model Builder Y N Testing set Prof. Pier Luca Lanzi
  • 16. Classification Step 3: 16 Evaluate on test set (Re-train?) Training set data Evaluate Model Builder Predictions + - Y N + - Testing set Prof. Pier Luca Lanzi
  • 17. A note on parameter tuning 17 It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Stage 1: builds the basic structure Stage 2: optimizes parameter settings The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters Prof. Pier Luca Lanzi
  • 18. Making the most of the data 18 Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish) The larger the test data the more accurate the error estimate Prof. Pier Luca Lanzi
  • 19. Classification Step: 19 Train, validation, and test Model Builder Training set data Evaluate Model Builder Predictions + - Y Final N + Evaluation - Validation set + - Y N + - Testing set Prof. Pier Luca Lanzi
  • 20. Methods for Performance Evaluation 20 How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets Prof. Pier Luca Lanzi
  • 21. Learning Curve 21 Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling Geometric sampling Effect of small sample size: Bias in the estimate Variance of estimate Prof. Pier Luca Lanzi
  • 22. Methods of Estimation 22 Holdout Reserve ½ for training and ½ for testing Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vs undersampling Bootstrap Sampling with replacement Prof. Pier Luca Lanzi
  • 23. Evaluation on “small” data 23 The holdout method reserves a certain amount for testing and uses the remainder for training Usually, one third for testing, the rest for training For small or “unbalanced” datasets, samples might not be representative For instance, few or none instances of some classes Stratified sample Advanced version of balancing the data Make sure that each class is represented with approximately equal proportions in both subsets Prof. Pier Luca Lanzi
  • 24. Repeated holdout method 24 Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratification) The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Still not optimum: the different test sets overlap Prof. Pier Luca Lanzi
  • 25. Cross-validation 25 Avoids overlapping test sets First step: data is split into k subsets of equal size Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before cross-validation is performed The error estimates are averaged to yield an overall error estimate Prof. Pier Luca Lanzi
  • 26. More on cross-validation 26 Standard method for evaluation stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) Other approaches appear to be robust, e.g., 5x2 crossvalidation Prof. Pier Luca Lanzi
  • 27. Leave-One-Out cross-validation 27 It is a particular form of cross-validation Set number of folds to number of training instances I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Very computationally expensive Prof. Pier Luca Lanzi
  • 28. Leave-One-Out and stratification 28 Disadvantage of Leave-One-Out is that: stratification is not possible It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes Best inducer predicts majority class 50% accuracy on fresh data Leave-One-Out-CV estimate is 100% error! Prof. Pier Luca Lanzi
  • 29. The bootstrap 29 Crossvalidation uses sampling without replacement The same instance, once selected, can not be selected again for a particular training/test set The bootstrap uses sampling with replacement to form the training set Sample a dataset of n instances n times with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that don’t occur in the new training set for testing Prof. Pier Luca Lanzi
  • 30. The 0.632 bootstrap 30 A particular instance has a probability of 1–1/n of not being picked Thus its probability of ending up in the test data is: n ⎛ 1⎞ 1 − ⎟ ≈ e −1 = 0.368 ⎜ ⎝ n⎠ This means the training data will contain approximately 63.2% of the instances Prof. Pier Luca Lanzi
  • 31. Estimating error with the bootstrap 31 The error estimate on the test data will be very pessimistic, since training was on just ~63% of the instances Therefore, combine it with the resubstitution error: err = 0.632 ⋅ etest instances + 0.368 ⋅ etraining instances The resubstitution error gets less weight than the error on the test data Repeat process several times with different replacement samples; average the results Prof. Pier Luca Lanzi
  • 32. More on the bootstrap 32 Probably the best way of estimating performance for very small datasets However, it has some problems Consider a random dataset with a 50% class distribution A perfect memorizer will achieve 0% resubstitution error and ~50% error on test data Bootstrap estimate for this classifier: err = 0.632 ⋅ 50% + 0.368 ⋅ 0% = 31.6% True expected error: 50% Prof. Pier Luca Lanzi
  • 33. Model Evaluation 33 Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Model selection Which model should be preferred? Prof. Pier Luca Lanzi
  • 34. ROC (Receiver Operating Characteristic) 34 Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP (on the y-axis) against FP (on the x-axis) Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point Prof. Pier Luca Lanzi
  • 35. ROC Curve 35 - 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TP=0.5, FN=0.5, FP=0.12, FN=0.88 Prof. Pier Luca Lanzi
  • 36. ROC Curve 36 (TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line, prediction is opposite of the true class Prof. Pier Luca Lanzi
  • 37. Using ROC for Model Comparison 37 No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal, area = 1 Random guess, area = 0.5 Prof. Pier Luca Lanzi
  • 38. How to Construct an ROC curve 38 • Use classifier that produces Instance P(+|A) True Class posterior probability for each 1 0.95 + test instance P(+|A) 2 0.93 + • Sort the instances according 3 0.87 - to P(+|A) in decreasing order 4 0.85 - 5 0.85 - • Apply threshold at each 6 0.85 + unique value of P(+|A) 7 0.76 - • Count the number of TP, FP, 8 0.53 + TN, FN at each threshold 9 0.43 - 10 0.25 + • TP rate, TPR = TP/(TP+FN) • FP rate, FPR = FP/(FP + TN) Prof. Pier Luca Lanzi
  • 39. How to Construct an ROC curve 39 Instance P(+|A) True Class 1 0.95 + 2 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0.53 + 9 0.43 - 10 0.25 + Prof. Pier Luca Lanzi
  • 40. How to Construct an ROC curve 40 Instance P(+|A) True Class 1 0.95 + 2 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0.53 + 9 0.43 - 10 0.25 + Prof. Pier Luca Lanzi
  • 41. How to Construct an ROC curve 41 Instance P(+|A) True Class 1 0.95 + 2 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0.53 + 9 0.43 - 10 0.25 + Prof. Pier Luca Lanzi
  • 42. How to Construct an ROC curve 42 Instance P(+|A) True Class 1 0.95 + 2 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0.53 + 9 0.43 - 10 0.25 + Prof. Pier Luca Lanzi
  • 43. How to construct an ROC curve 43 + - + - - - + - + + Class Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 ROC Curve: Prof. Pier Luca Lanzi
  • 44. Lift Charts 44 Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. Cumulative gains and lift charts are visual aids for measuring model performance Both charts consist of a lift curve and a baseline The greater the area between the lift curve and the baseline, the better the model Prof. Pier Luca Lanzi
  • 45. Example: Direct Marketing 45 Mass mailout of promotional offers (1000000) The proportion who normally respond is 0.1%, that is 1000 responses A data mining tool can identify a subset of a 100000 for which the response rate is 0.4%, that is 400 responses In marketing terminology, the increase of response rate is known as the lift factor yielded by the model. The same data mining tool, may be able to identify 400000 households for which the response rate is 0.2%, that is 800 respondents corresponding to a lift factor of 2. The overall goal is to find subsets of test instances that have a high proportion of true positive Prof. Pier Luca Lanzi
  • 46. Generating a lift chart 46 Given a scheme that outputs probability, sort the instances in descending order according to the predicted probability Select the instance subset starting from the one with the highest predicted probability Rank Predicted Probability Actual Class 1 0.95 Yes 2 0.93 Yes 3 0.93 No 4 0.88 yes … … … In lift chart, x axis is sample size and y axis is number of true positives Prof. Pier Luca Lanzi
  • 47. Example: Direct Marketing 47 Prof. Pier Luca Lanzi
  • 48. Test of Significance 48 Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set? Prof. Pier Luca Lanzi
  • 49. Confidence Interval for Accuracy 49 Prediction can be regarded as a Bernoulli trial A Bernoulli trial has 2 possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)? Prof. Pier Luca Lanzi
  • 50. Confidence Interval for Accuracy 50 For large test sets (N > 30), the accuracy acc has a normal distribution with mean p and variance p(1-p)/N Area = 1 - α Confidence Interval for p: acc − p <Z P( Z < ) p (1 − p ) / N α /2 1−α / 2 = 1−α Zα/2 Z1- α /2 2 × N × acc + Z ± Z + 4 × N × acc − 4 × N × acc 2 2 2 p= α /2 α /2 2( N + Z ) 2 α /2 Prof. Pier Luca Lanzi
  • 51. Confidence Interval for Accuracy 51 Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1-α = 0.95 (95% confidence) 1-α Z From probability table, Zα/2=1.96 0.99 2.58 0.98 2.33 0.95 1.96 N 50 100 500 1000 5000 0.90 1.65 p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.824 0.811 Prof. Pier Luca Lanzi
  • 52. Comparing Performance of 2 Models 52 Given two models, say M1 and M2, which is better? M1 is tested on D1 (size=n1), found error rate = e1 M2 is tested on D2 (size=n2), found error rate = e2 Assume D1 and D2 are independent If n1 and n2 are sufficiently large, then e1 ~ N (μ1 , σ 1 ) e2 ~ N (μ2 , σ 2 ) Approximate: e (1 − e ) σ= ˆ i i i n i Prof. Pier Luca Lanzi
  • 53. Comparing Performance of 2 Models 53 To test if performance difference is statistically significant, d = e1 – e2 d ~ N(dt,σt) where dt is the true difference Since D1 and D2 are independent, their variance adds up: σ = σ +σ ≅ σ +σ 2 2 2 2 2 ˆ ˆ t 1 2 1 2 e1(1 − e1) e2(1 − e2) = + n1 n2 At (1-α) confidence level, d =d ±Z σˆ α /2 t t Prof. Pier Luca Lanzi
  • 54. An Illustrative Example 54 Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 d = |e2 – e1| = 0.1 (2-sided test) 0.15(1 − 0.15) 0.25(1 − 0.25) σ= + = 0.0043 ˆ d 30 5000 At 95% confidence level, Zα/2=1.96 d = 0.100 ± 1.96 × 0.0043 = 0.100 ± 0.128 t The interval contains 0, therefore the difference may not be statistically significant Prof. Pier Luca Lanzi
  • 55. Comparing Performance of 2 Algorithms 55 Each learning algorithm may produce k models: L1 may produce M11 , M12, …, M1k L2 may produce M21 , M22, …, M2k If models are generated on the same test sets D1,D2, …, Dk (e.g., via cross-validation) For each set: compute dj = e1j – e2j dj has mean dt and variance σt Estimate: − d) k ∑ (d j 2 σ= j =1 2 ˆ k (k − 1) t d = d ±t σ ˆ 1−α , k −1 t t Prof. Pier Luca Lanzi
  • 56. Model Evaluation 56 Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Model selection Which model should be preferred? Prof. Pier Luca Lanzi
  • 57. The MDL principle 57 MDL stands for minimum description length The description length is defined as: space required to describe a theory + space required to describe the theory’s mistakes In our case the theory is the classifier and the mistakes are the errors on the training data Aim: we seek a classifier with minimal DL MDL principle is a model selection criterion Prof. Pier Luca Lanzi
  • 58. Model selection criteria 58 Model selection criteria attempt to find a good compromise between: The complexity of a model Its prediction accuracy on the training data Reasoning: a good model is a simple model that achieves high accuracy on the given data Also known as Occam’s Razor : the best theory is the smallest one that describes all the facts William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian. Prof. Pier Luca Lanzi
  • 59. Elegance vs. errors 59 Theory 1: very simple, elegant theory that explains the data almost perfectly Theory 2: significantly more complex theory that reproduces the data without mistakes Theory 1 is probably preferable Classical example: Kepler’s three laws on planetary motion Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles Prof. Pier Luca Lanzi
  • 60. MDL and compression 60 MDL principle relates to data compression The best theory is the one that compresses the data the most I.e. to compress a dataset we generate a model and then store the model and its mistakes We need to compute (a) size of the model, and (b) space needed to encode the errors (b) easy: use the informational loss function (a) need a method to encode the model Prof. Pier Luca Lanzi
  • 61. MDL and Bayes’s theorem 61 L[T]=“length” of the theory L[E|T]=training set encoded wrt the theory Description length= L[T] + L[E|T] Bayes’ theorem gives a posteriori probability of a theory given the data: Equivalent to, constant Prof. Pier Luca Lanzi
  • 62. MDL and MAP 62 MAP stands for maximum a posteriori probability Finding the MAP theory corresponds to finding the MDL theory Difficult bit in applying the MAP principle: determining the prior probability Pr[T] of the theory Corresponds to difficult part in applying the MDL principle: coding scheme for the theory I.e. if we know a priori that a particular theory is more likely we need less bits to encode it Prof. Pier Luca Lanzi
  • 63. Discussion of MDL principle 63 Advantage: makes full use of the training data when selecting a model Disadvantage 1: appropriate coding scheme/prior probabilities for theories are crucial Disadvantage 2: no guarantee that the MDL theory is the one which minimizes the expected error Note: Occam’s Razor is an axiom! Epicurus’ principle of multiple explanations: keep all theories that are consistent with the data Prof. Pier Luca Lanzi
  • 64. Bayesian model averaging 64 Reflects Epicurus’ principle: all theories are used for prediction weighted according to P(T|E) Let I be a new instance whose class we must predict Let C be the random variable denoting the class Let E be the training data Let Tj be the possible theories Then, BMA gives the probability of C given, Prof. Pier Luca Lanzi
  • 65. Which classification algorithm? 65 Among the several algorithms, which one is the “best”? Some algorithms have a lower computational complexity Different algorithms provide different representations Some algorithms allow the specification of prior knowledge If we are interested in the generalization performance, are there any reasons to prefer one classifier over another? Can we expect any classification method to be superior or inferior overall? According to the No Free Lunch Theorem, the answer to all these questions is known Prof. Pier Luca Lanzi
  • 66. No Free Lunch Theorem 66 If the goal is to obtain good generalization performance, there are no context-independent or usage-independent reasons to favor one classification method over another If one algorithm seems to outperform another in a certain situation, it is a consequence of its fit to the particular problem, not the general superiority of the algorithm When confronting a new problem, this theorem suggests that we should focus on the aspects that matter most Prior information Data distribution Amount of training data Cost or reward The theorem also justifies skepticism regarding studies that “demonstrate” the overall superiority of a certain algorithm Prof. Pier Luca Lanzi
  • 67. No Free Lunch Theorem 67 quot;[A]ll algorithms that search for an extremum of a cost [objective] function perform exactly the same, when averaged over all possible cost functions.“ [1] “[T]he average performance of any pair of algorithms across all possible problems is identical.” [2] Wolpert, D.H., Macready, W.G. (1995), No Free Lunch Theorems for Search, Technical Report SFI-TR-95-02-010 (Santa Fe Institute). Wolpert, D.H., Macready, W.G. (1997), No Free Lunch Theorems for Optimization, IEEE Transactions on Evolutionary Computation 1, 67. Prof. Pier Luca Lanzi
  • 68. No Free Lunch Theorem 68 Let F be the unknown concept n the number of training examples ε(E|D) be the error of the learning algorithm given the data Pi(h|D) the probability that algorithm i generates hypothesis h from the training data D For any two learning algorithm, P_1(h|D) and P 1. Uniformly averaged over all the target function F, ε1(E|F,n) – ε2(E|F,n) = 0 2. For any fixed training set D, uniformly averaged over F, ε1(E|F,D) – ε2(E|F,D) = 0 3. Uniformly averaged over all priors P(F), ε1(E|n) – ε2(E|n) = 0 4. For any training set D, uniformly averaged over P(F), ε1(E|D) – ε2(E|D) = 0 Prof. Pier Luca Lanzi
  • 69. No Free Lunch Theorem 69 For any two learning algorithm, P_1(h|D) and P 1. Uniformly averaged over all the target function F, ε1(E|F,n) – ε2(E|F,n) = 0 2. For any fixed training set D, uniformly averaged over F, ε1(E|F,D) – ε2(E|F,D) = 0 3. Uniformly averaged over all priors P(F), ε1(E|n) – ε2(E|n) = 0 4. For any training set D, uniformly averaged over P(F), ε1(E|D) – ε2(E|D) = 0 (1) No matter how clever we choose a good algorithm P1(h|D) and a bad algorithm P2(h|D), if all functions are equally likely, then the “good” algorithm will not outperform the “bad” one (2) Even if we know D, then averaged over all the target functions no learning algorithm will outperform another one Prof. Pier Luca Lanzi