SlideShare a Scribd company logo
1 of 63
Download to read offline
Cross-validation for
       detecting and preventing
              overfitting
                                                      Andrew W. Moore
   Note to other teachers and users of
   these slides. Andrew would be delighted
   if you found this source material useful in
                                                          Professor
   giving your own lectures. Feel free to use
   these slides verbatim, or to modify them
   to fit your own needs. PowerPoint
                                                 School of Computer Science
   originals are available. If you make use
   of a significant portion of these slides in
   your own lecture, please include this
                                                  Carnegie Mellon University
   message, or the following link to the
   source repository of Andrew’s tutorials:
                                                       www.cs.cmu.edu/~awm
   http://www.cs.cmu.edu/~awm/tutorials .
   Comments and corrections gratefully
                                                         awm@cs.cmu.edu
   received.
                                                           412-268-7599



Copyright © Andrew W. Moore                                                    Slide 1
A Regression Problem

                                       y = f(x) + noise
                                       Can we learn f from this data?


    y



                                       Let’s consider three methods…
                    x




Copyright © Andrew W. Moore                                             Slide 2
Linear Regression




    y



                    x




Copyright © Andrew W. Moore                       Slide 3
Linear Regression
   Univariate Linear regression with a constant term:
                                  X= 3       y= 7
   X         Y
                                                       Originally
   3         7                       1          3      discussed in the
                                                       previous Andrew
   1         3                       :          :      Lecture: “Neural
                                                       Nets”
   :         :                    x1=(3)..    y1=7..




Copyright © Andrew W. Moore                                               Slide 4
Linear Regression
   Univariate Linear regression with a constant term:
                                    X= 3           y= 7
   X         Y
   3         7                             1          3
   1         3                             :          :
   :         :                      x1=(3)..        y1=7..
                 Z= 1               y=
                              3                7
                         1    1                3
                         :                     :
                 z1=(1,3)..       y1=7..
                 zk=(1,xk)

Copyright © Andrew W. Moore                                  Slide 5
Linear Regression
   Univariate Linear regression with a constant term:
                                    X= 3           y= 7
   X         Y
   3         7                             1          3
   1         3                             :          :
   :         :                      x1=(3)..        y1=7..
                 Z= 1               y=
                              3                7
                         1    1                3
                                                          β=(ZTZ)-1(ZTy)
                         :                     :
                 z1=(1,3)..       y1=7..
                                                          yest = β0+ β1 x
                 zk=(1,xk)

Copyright © Andrew W. Moore                                                 Slide 6
Quadratic Regression




    y



                    x




Copyright © Andrew W. Moore                          Slide 7
Quadratic Regression
                                                             Much more about
                                   X= 3          y= 7
   X         Y                                               this in the future
                                                             Andrew Lecture:
   3         7                         1            3        “Favorite
                                                             Regression
   1         3                         :            :        Algorithms”
   :  :                             x1=(3,2)..    y1=7..
                                            y=
      1                 3      9
   Z=
                                             7
      1                 1      1
                                             3          β=(ZTZ)-1(ZTy)
             :
                                             :
                                                   yest = β0+ β1 x+ β2 x2
    z=(1 , x, x2,)


Copyright © Andrew W. Moore                                                   Slide 8
Join-the-dots
                                     Also known as piecewise
                                       linear nonparametric
                                     regression if that makes
                                           you feel better

    y



                    x




Copyright © Andrew W. Moore                                     Slide 9
Which is best?


    y                          y


              x                    x


            Why not choose the method with the
            best fit to the data?




Copyright © Andrew W. Moore                      Slide 10
What do we really want?


    y                         y


              x                   x


            Why not choose the method with the
            best fit to the data?


                                  “How well are you going to predict
                                   future data drawn from the same
                                             distribution?”

Copyright © Andrew W. Moore                                            Slide 11
The test set method

                                        1. Randomly choose
                                        30% of the data to be in a
                                        test set
                                        2. The remainder is a
                                        training set
    y



                    x




Copyright © Andrew W. Moore                                          Slide 12
The test set method

                                        1. Randomly choose
                                        30% of the data to be in a
                                        test set
                                        2. The remainder is a
                                        training set
    y

                                        3. Perform your
                                        regression on the training
                    x                   set

          (Linear regression example)



Copyright © Andrew W. Moore                                          Slide 13
The test set method

                                        1. Randomly choose
                                        30% of the data to be in a
                                        test set
                                        2. The remainder is a
                                        training set
    y

                                        3. Perform your
                                        regression on the training
                    x                   set
                                        4. Estimate your future
          (Linear regression example)
                                        performance with the test
             Mean Squared Error = 2.4   set

Copyright © Andrew W. Moore                                          Slide 14
The test set method

                                        1. Randomly choose
                                        30% of the data to be in a
                                        test set
                                        2. The remainder is a
                                        training set
    y

                                        3. Perform your
                                        regression on the training
                    x                   set

        (Quadratic regression example) 4. Estimate your future
                                       performance with the test
          Mean Squared Error = 0.9     set

Copyright © Andrew W. Moore                                          Slide 15
The test set method

                                          1. Randomly choose
                                          30% of the data to be in a
                                          test set
                                          2. The remainder is a
                                          training set
    y

                                          3. Perform your
                                          regression on the training
                    x                     set
                                          4. Estimate your future
                (Join the dots example)
                                          performance with the test
             Mean Squared Error = 2.2     set

Copyright © Andrew W. Moore                                            Slide 16
The test set method
    Good news:
    •Very very simple
    •Can then simply choose the method with
    the best test-set score
    Bad news:
    •What’s the downside?




Copyright © Andrew W. Moore                         Slide 17
The test set method
    Good news:
    •Very very simple
    •Can then simply choose the method with
    the best test-set score
    Bad news:
                                                    We say the
    •Wastes data: we get an estimate of the         “test-set
    best method to apply to 30% less data           estimator of
                                                    performance
    •If we don’t have much data, our test-set       has high
                                                    variance”
    might just be lucky or unlucky

Copyright © Andrew W. Moore                                        Slide 18
LOOCV (Leave-one-out Cross Validation)
                              For k=1 to R
                              1. Let (xk,yk) be the kth record




    y



                    x




Copyright © Andrew W. Moore                                  Slide 19
LOOCV (Leave-one-out Cross Validation)
                              For k=1 to R
                              1. Let (xk,yk) be the kth record
                              2. Temporarily remove (xk,yk)
                                 from the dataset

    y



                    x




Copyright © Andrew W. Moore                                  Slide 20
LOOCV (Leave-one-out Cross Validation)
                              For k=1 to R
                              1. Let (xk,yk) be the kth record
                              2. Temporarily remove (xk,yk)
                                 from the dataset
                              3. Train on the remaining R-1
    y                            datapoints


                    x




Copyright © Andrew W. Moore                                  Slide 21
LOOCV (Leave-one-out Cross Validation)
                              For k=1 to R
                              1. Let (xk,yk) be the kth record
                              2. Temporarily remove (xk,yk)
                                 from the dataset
                              3. Train on the remaining R-1
    y                            datapoints
                              4. Note your error (xk,yk)
                    x




Copyright © Andrew W. Moore                                  Slide 22
LOOCV (Leave-one-out Cross Validation)
                              For k=1 to R
                              1. Let (xk,yk) be the kth record
                              2. Temporarily remove (xk,yk)
                                 from the dataset
                              3. Train on the remaining R-1
    y                            datapoints
                              4. Note your error (xk,yk)
                              When you’ve done all points,
                    x
                              report the mean error.




Copyright © Andrew W. Moore                                  Slide 23
LOOCV (Leave-one-out Cross Validation)
                                              For k=1 to R
                                               1. Let (xk,yk) be
                                                    the kth
                                                    record
                                               2. Temporarily
            y                 y       y
                                                    remove
                                                    (xk,yk) from
                                                    the dataset
                    x             x       x
                                               3. Train on the
                                                    remaining
                                                    R-1
                                                    datapoints
                                               4. Note your
                                                    error (xk,yk)
            y                 y       y
                                              When you’ve
                                              done all points,
                                              report the mean
                    x             x       x
                                              error.
                                              MSELOOCV
                                               = 2.12
            y                 y       y



                    x             x       x

Copyright © Andrew W. Moore                                  Slide 24
LOOCV for Quadratic Regression
                                              For k=1 to R
                                               1. Let (xk,yk) be
                                                    the kth
                                                    record
                                               2. Temporarily
            y                 y       y
                                                    remove
                                                    (xk,yk) from
                                                    the dataset
                    x             x       x
                                               3. Train on the
                                                    remaining
                                                    R-1
                                                    datapoints
                                               4. Note your
                                                    error (xk,yk)
            y                 y       y
                                              When you’ve
                                              done all points,
                                              report the mean
                    x             x       x
                                              error.
                                              MSELOOCV
                                              =0.962
            y                 y       y



                    x             x       x

Copyright © Andrew W. Moore                                  Slide 25
LOOCV for Join The Dots
                                              For k=1 to R
                                                  1. Let (xk,yk) be
                                                       the kth
                                                       record
                                                  2. Temporarily
            y                 y       y
                                                       remove
                                                       (xk,yk) from
                                                       the dataset
                    x             x       x
                                                  3. Train on the
                                                       remaining
                                                       R-1
                                                       datapoints
                                                  4. Note your
                                                       error (xk,yk)
            y                 y       y
                                              When you’ve
                                              done all points,
                                              report the mean
                    x             x       x
                                              error.
                                              MSELOOCV
                                              =3.33
            y                 y       y



                    x             x       x

Copyright © Andrew W. Moore                                     Slide 26
Which kind of Cross Validation?
                              Downside                  Upside
            Test-set          Variance: unreliable      Cheap
                              estimate of future
                              performance
            Leave-            Expensive.                Doesn’t
            one-out                                     waste data
                              Has some weird
                              behavior


                              ..can we get the best of both worlds?

Copyright © Andrew W. Moore                                           Slide 27
k-fold Cross       Randomly break the dataset into k
                              partitions (in our example we’ll have k=3
            Validation        partitions colored Red Green and Blue)




    y



                    x




Copyright © Andrew W. Moore                                         Slide 28
k-fold Cross       Randomly break the dataset into k
                              partitions (in our example we’ll have k=3
            Validation        partitions colored Red Green and Blue)
                               For the red partition: Train on all the
                                  points not in the red partition. Find
                                  the test-set sum of errors on the red
                                  points.



    y



                    x




Copyright © Andrew W. Moore                                          Slide 29
k-fold Cross       Randomly break the dataset into k
                              partitions (in our example we’ll have k=3
            Validation        partitions colored Red Green and Blue)
                               For the red partition: Train on all the
                                  points not in the red partition. Find
                                  the test-set sum of errors on the red
                                  points.
                               For the green partition: Train on all the
                                  points not in the green partition.
    y                             Find the test-set sum of errors on
                                  the green points.


                    x




Copyright © Andrew W. Moore                                           Slide 30
k-fold Cross       Randomly break the dataset into k
                              partitions (in our example we’ll have k=3
            Validation        partitions colored Red Green and Blue)
                               For the red partition: Train on all the
                                  points not in the red partition. Find
                                  the test-set sum of errors on the red
                                  points.
                               For the green partition: Train on all the
                                  points not in the green partition.
    y                             Find the test-set sum of errors on
                                  the green points.
                               For the blue partition: Train on all the
                                  points not in the blue partition. Find
                    x
                                  the test-set sum of errors on the
                                  blue points.




Copyright © Andrew W. Moore                                           Slide 31
k-fold Cross       Randomly break the dataset into k
                              partitions (in our example we’ll have k=3
            Validation        partitions colored Red Green and Blue)
                               For the red partition: Train on all the
                                  points not in the red partition. Find
                                  the test-set sum of errors on the red
                                  points.
                               For the green partition: Train on all the
                                  points not in the green partition.
    y                             Find the test-set sum of errors on
                                  the green points.
                               For the blue partition: Train on all the
                                  points not in the blue partition. Find
                    x
                                  the test-set sum of errors on the
                                  blue points.
     Linear Regression
     MSE3FOLD=2.05             Then report the mean error



Copyright © Andrew W. Moore                                           Slide 32
k-fold Cross       Randomly break the dataset into k
                              partitions (in our example we’ll have k=3
            Validation        partitions colored Red Green and Blue)
                               For the red partition: Train on all the
                                  points not in the red partition. Find
                                  the test-set sum of errors on the red
                                  points.
                               For the green partition: Train on all the
                                  points not in the green partition.
    y                             Find the test-set sum of errors on
                                  the green points.
                               For the blue partition: Train on all the
                                  points not in the blue partition. Find
                    x
                                  the test-set sum of errors on the
                                  blue points.
     Quadratic Regression
     MSE3FOLD=1.11             Then report the mean error



Copyright © Andrew W. Moore                                           Slide 33
k-fold Cross       Randomly break the dataset into k
                              partitions (in our example we’ll have k=3
            Validation        partitions colored Red Green and Blue)
                               For the red partition: Train on all the
                                  points not in the red partition. Find
                                  the test-set sum of errors on the red
                                  points.
                               For the green partition: Train on all the
                                  points not in the green partition.
    y                             Find the test-set sum of errors on
                                  the green points.
                               For the blue partition: Train on all the
                                  points not in the blue partition. Find
                    x
                                  the test-set sum of errors on the
                                  blue points.
     Joint-the-dots
     MSE3FOLD=2.93             Then report the mean error



Copyright © Andrew W. Moore                                           Slide 34
Which kind of Cross Validation?
                              Downside                     Upside
  Test-set                    Variance: unreliable         Cheap
                              estimate of future
                              performance

  Leave-                      Expensive.                   Doesn’t waste data
  one-out                     Has some weird behavior
  10-fold                     Wastes 10% of the data.      Only wastes 10%. Only
                              10 times more expensive      10 times more expensive
                              than test set                instead of R times.
  3-fold                      Wastier than 10-fold.        Slightly better than test-
                              Expensivier than test set    set
  R-fold                      Identical to Leave-one-out

Copyright © Andrew W. Moore                                                        Slide 35
Which kind of Cross Validation?
                              Downside                   Upside
  Test-set                    Variance: unreliable       Cheap
                              estimate of future
                              performance
                                                       But note: One of
  Leave-                      Expensive.               Andrew’s joys in data
                                                         Doesn’t waste life is
  one-out                     Has some weird behavioralgorithmic tricks for
                              Wastes 10% of the data. makingwastescheap Only
                                                         Only these 10%.
  10-fold
                              10 times more expensive 10 times more expensive
                              than testset               instead of R times.
  3-fold                      Wastier than 10-fold.      Slightly better than test-
                              Expensivier than testset set
  R-fold                      Identical to Leave-one-out

Copyright © Andrew W. Moore                                                      Slide 36
CV-based Model Selection
    • We’re trying to decide which algorithm to use.
    • We train each machine and make a table…




                      TRAINERR
   i        fi                   10-FOLD-CV-ERR        Choice
   1        f1
   2        f2
                                                       ⌦
   3        f3
   4        f4
   5        f5
   6        f6

Copyright © Andrew W. Moore                                     Slide 37
CV-based Model Selection
    • Example: Choosing number of hidden units in a one-
      hidden-layer neural net.
    • Step 1: Compute 10-fold CV error for six different model
      classes:
    Algorithm                 TRAINERR   10-FOLD-CV-ERR   Choice
    0 hidden units
    1 hidden units
                                                          ⌦
    2 hidden units
    3 hidden units
    4 hidden units
    5 hidden units

    • Step 2: Whichever model class gave best CV score: train it
      with all the data, and that’s the predictive model you’ll use.
Copyright © Andrew W. Moore                                            Slide 38
CV-based Model Selection
    • Example: Choosing “k” for a k-nearest-neighbor regression.
    • Step 1: Compute LOOCV error for six different model
      classes:
       Algorithm              TRAINERR   10-fold-CV-ERR   Choice
       K=1
       K=2
       K=3
                                                          ⌦
       K=4
       K=5
       K=6


    • Step 2: Whichever model class gave best CV score: train it
      with all the data, and that’s the predictive model you’ll use.
Copyright © Andrew W. Moore                                            Slide 39
CV-based Model Selection
    • Example: Choosing “k” for a k-nearest-neighbor regression.
                                                The reason is Computational. For k-
    • Step 1: Compute LOOCV error for six different model
                                                NN (and all other nonparametric
                                                methods) LOOCV happens to be as
      classes:    Why did we use 10-fold-CV for cheap as regular predictions.
                                 neural nets and LOOCV for k-
                                 nearest neighbor?                 No good reason, except it looked
                                                                   like things were getting worse as K
    Algorithm                 TRAINERR LOOCV-ERR                              Choice
                                                                   was increasing
                                 And why stop at K=6
    K=1
                                  Are we guaranteed that a local     Sadly, no. And in fact, the
                                                                     relationship can be very bumpy.
    K=2                           optimum of K vs LOOCV will be
                                  the global optimum?
    K=3
                                  What should we do if we are                 ⌦
    K=4
                                  depressed at the expense of      Idea One: K=1, K=2, K=4, K=8,
    K=5                                                            K=16, K=32, K=64 … K=1024
                                  doing LOOCV for K= 1 through
                                                                   Idea Two: Hillclimbing from an initial
                                  1000?
    K=6                                                            guess at K


    • Step 2: Whichever model class gave best CV score: train it
      with all the data, and that’s the predictive model you’ll use.
Copyright © Andrew W. Moore                                                                              Slide 40
CV-based Model Selection
    • Can you think of other decisions we can ask Cross
      Validation to make for us, based on other machine learning
      algorithms in the class so far?




Copyright © Andrew W. Moore                                        Slide 41
CV-based Model Selection
    • Can you think of other decisions we can ask Cross
      Validation to make for us, based on other machine learning
      algorithms in the class so far?
            • Degree of polynomial in polynomial regression
            • Whether to use full, diagonal or spherical Gaussians in a Gaussian
              Bayes Classifier.
            • The Kernel Width in Kernel Regression
            • The Kernel Width in Locally Weighted Regression
            • The Bayesian Prior in Bayesian Regression

                These involve
                choosing the value of a
                real-valued parameter.
                What should we do?

Copyright © Andrew W. Moore                                                        Slide 42
CV-based Model Selection
    • Can you think of other decisions we can ask Cross
      Validation to make for us, based on other machine learning
      algorithms in the class so far?
            • Degree of polynomial in polynomial regression
            • Whether to use full, diagonal or spherical Gaussians in a Gaussian
              Bayes Classifier.
            • The Kernel Width in Kernel Regression
            • The Kernel Width in Locally Weighted Regression
            • The Bayesian Prior in Bayesian Regression

                These involve              Idea One: Consider a discrete set of values
                                           (often best to consider a set of values with
                choosing the value of a    exponentially increasing gaps, as in the K-NN
                real-valued parameter.     example).
                                                                ∂ LOOCV
                What should we do?         Idea Two: Compute               and then
                                                               ∂ Parameter
                                           do gradianet descent.

Copyright © Andrew W. Moore                                                                Slide 43
CV-based Model Selection
    • Can you think of other decisions we can ask Cross
      Validation to make for us, based on other machine learning
      algorithms in the class so far?
            • Degree of polynomial in polynomial regression
            • Whether to use full, diagonal or spherical Gaussians in a Gaussian
              Bayes Classifier.
            • The Kernel Width in Kernel Regression
            • The Kernel Width in Locally Weighted Regression tors of a non-
                                                            fac
                                               : The scale
            • The Bayesian Prior in Bayesian Regression distance metric
                                         Also
                                              parametric
                These involve              Idea One: Consider a discrete set of values
                                           (often best to consider a set of values with
                choosing the value of a    exponentially increasing gaps, as in the K-NN
                real-valued parameter.     example).
                                                                ∂ LOOCV
                What should we do?         Idea Two: Compute               and then
                                                               ∂ Parameter
                                           do gradianet descent.

Copyright © Andrew W. Moore                                                                Slide 44
CV-based Algorithm Choice
    • Example: Choosing which regression algorithm to use
    • Step 1: Compute 10-fold-CV error for six different model
      classes:

    Algorithm                 TRAINERR   10-fold-CV-ERR   Choice
    1-NN
    10-NN
    Linear Reg’n
                                                          ⌦
    Quad reg’n
    LWR, KW=0.1
    LWR, KW=0.5

    • Step 2: Whichever algorithm gave best CV score: train it
      with all the data, and that’s the predictive model you’ll use.
Copyright © Andrew W. Moore                                            Slide 45
Alternatives to CV-based model selection
    • Model selection methods:
               1.     Cross-validation
               2.     AIC (Akaike Information Criterion)
               3.     BIC (Bayesian Information Criterion)
               4.     VC-dimension (Vapnik-Chervonenkis Dimension)




                                                   Only directly applicable to
                                                   choosing classifiers


                                                   Described in a future
                                                   Lecture

Copyright © Andrew W. Moore                                                      Slide 46
Which model selection method is best?
         1. (CV) Cross-validation
         2. AIC (Akaike Information Criterion)
         3. BIC (Bayesian Information Criterion)
         4. (SRMVC) Structural Risk Minimize with VC-dimension
  • AIC, BIC and SRMVC advantage: you only need the training
    error.
  • CV error might have more variance
  • SRMVC is wildly conservative
  • Asymptotically AIC and Leave-one-out CV should be the same
  • Asymptotically BIC and carefully chosen k-fold should be same
  • You want BIC if you want the best structure instead of the best
    predictor (e.g. for clustering or Bayes Net structure finding)
  • Many alternatives---including proper Bayesian approaches.
  • It’s an emotional issue.
Copyright © Andrew W. Moore                                      Slide 47
Other Cross-validation issues
    • Can do “leave all pairs out” or “leave-all-
      ntuples-out” if feeling resourceful.
    • Some folks do k-folds in which each fold is
      an independently-chosen subset of the data
    • Do you know what AIC and BIC are?
            If so…
            • LOOCV behaves like AIC asymptotically.
            • k-fold behaves like BIC if you choose k carefully
            If not…
            • Nyardely nyardely nyoo nyoo
Copyright © Andrew W. Moore                                       Slide 48
Cross-Validation for regression
    • Choosing the number of hidden units in a
      neural net
    • Feature selection (see later)
    • Choosing a polynomial degree
    • Choosing which regressor to use




Copyright © Andrew W. Moore                      Slide 49
Supervising Gradient Descent
    • This is a weird but common use of Test-set
      validation
    • Suppose you have a neural net with too
      many hidden units. It will overfit.
    • As gradient descent progresses, maintain a
      graph of MSE-testset-error vs. Iteration
                                                                          Training Set
                                                Use the weights you
                       Mean Squared




                                                found on this iteration
                                                                           Test Set
                       Error




                                      Iteration of Gradient Descent

Copyright © Andrew W. Moore                                                              Slide 50
Supervising Gradient Descent
    • This is a weird but common use of Test-set
      validation
    • Suppose you have a neural net with too
                   Relies on an intuition that a not-fully-
      many hiddenminimized set of overfit. somewhat like
                    units. It will weights is
    • As gradient descent progresses, maintain a
                   having fewer parameters.
      graph of MSE-testset-error vs. Iteration
                   Works pretty well in practice, apparently

                                                                          Training Set
                                                Use the weights you
                       Mean Squared




                                                found on this iteration
                                                                           Test Set
                       Error




                                      Iteration of Gradient Descent

Copyright © Andrew W. Moore                                                              Slide 51
Cross-validation for classification
    • Instead of computing the sum squared
      errors on a test set, you should compute…




Copyright © Andrew W. Moore                       Slide 52
Cross-validation for classification
    • Instead of computing the sum squared
      errors on a test set, you should compute…
         The total number of misclassifications on
           a testset.




Copyright © Andrew W. Moore                          Slide 53
Cross-validation for classification
    • Instead of computing the sum squared
      errors on a test set, you should compute…
         The total number of misclassifications on
           a testset.
                              • What’s LOOCV of 1-NN?
                              • What’s LOOCV of 3-NN?
                              • What’s LOOCV of 22-NN?




Copyright © Andrew W. Moore                              Slide 54
Cross-validation for classification
    • Instead of computing the sum squared
      errors on a test set, you should compute…
         The total number of misclassifications on
           a testset.
    • But there’s a more sensitive alternative:
         Compute
                    log P(all test outputs|all test inputs, your model)




Copyright © Andrew W. Moore                                           Slide 55
Cross-Validation for classification
    • Choosing the pruning parameter for decision
      trees
    • Feature selection (see later)
    • What kind of Gaussian to use in a Gaussian-
      based Bayes Classifier
    • Choosing which classifier to use




Copyright © Andrew W. Moore                     Slide 56
Cross-Validation for density
                          estimation
    • Compute the sum of log-likelihoods of test
      points
    Example uses:
    • Choosing what kind of Gaussian assumption
      to use
    • Choose the density estimator
    • NOT Feature selection (testset density will
      almost always look better with fewer
      features)
Copyright © Andrew W. Moore                        Slide 57
Feature Selection
    • Suppose you have a learning algorithm LA
      and a set of input attributes { X1 , X2 .. Xm }
    • You expect that LA will only find some
      subset of the attributes useful.
    • Question: How can we use cross-validation
      to find a useful subset?
    • Four ideas:
                                          Another fun area in which
            •    Forward selection
                                         Andrew has spent a lot of his
            •    Backward elimination            wild youth
            •    Hill Climbing
            •    Stochastic search (Simulated Annealing or GAs)
Copyright © Andrew W. Moore                                              Slide 58
Very serious warning
    • Intensive use of cross validation can overfit.
    • How?




    • What can be done about it?


Copyright © Andrew W. Moore                            Slide 59
Very serious warning
    • Intensive use of cross validation can overfit.
    • How?
                    • Imagine a dataset with 50 records and 1000
                      attributes.
                    • You try 1000 linear regression models, each one
                      using one of the attributes.




    • What can be done about it?



Copyright © Andrew W. Moore                                             Slide 60
Very serious warning
    • Intensive use of cross validation can overfit.
    • How?
                    • Imagine a dataset with 50 records and 1000
                      attributes.
                    • You try 1000 linear regression models, each one
                      using one of the attributes.
                    • The best of those 1000 looks good!



    • What can be done about it?



Copyright © Andrew W. Moore                                             Slide 61
Very serious warning
    • Intensive use of cross validation can overfit.
    • How?
                    • Imagine a dataset with 50 records and 1000
                      attributes.
                    • You try 1000 linear regression models, each one
                      using one of the attributes.
                    • The best of those 1000 looks good!
                    • But you realize it would have looked good even if the
                      output had been purely random!
    • What can be done about it?
                    • Hold out an additional testset before doing any model
                      selection. Check the best model performs well even
                      on the additional testset.
                    • Or: Randomization Testing
Copyright © Andrew W. Moore                                                   Slide 62
What you should know
    • Why you can’t use “training-set-error” to
      estimate the quality of your learning
      algorithm on your data.
    • Why you can’t use “training set error” to
      choose the learning algorithm
    • Test-set cross-validation
    • Leave-one-out cross-validation
    • k-fold cross-validation
    • Feature selection methods
    • CV for classification, regression & densities
Copyright © Andrew W. Moore                           Slide 63

More Related Content

What's hot

Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Simplilearn
 

What's hot (20)

Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Data Structure: Algorithm and analysis
Data Structure: Algorithm and analysisData Structure: Algorithm and analysis
Data Structure: Algorithm and analysis
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Classification using back propagation algorithm
Classification using back propagation algorithmClassification using back propagation algorithm
Classification using back propagation algorithm
 
Random forest
Random forestRandom forest
Random forest
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
Backpropagation algo
Backpropagation  algoBackpropagation  algo
Backpropagation algo
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
 
Machine Learning (Classification Models)
Machine Learning (Classification Models)Machine Learning (Classification Models)
Machine Learning (Classification Models)
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 

Similar to Cross-Validation

Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
guestfee8698
 
Overfit10
Overfit10Overfit10
Overfit10
okeee
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
guestfee8698
 
Information Gain
Information GainInformation Gain
Information Gain
guest32311f
 

Similar to Cross-Validation (12)

Eight Regression Algorithms
Eight Regression AlgorithmsEight Regression Algorithms
Eight Regression Algorithms
 
Probability density in data mining and coveriance
Probability density in data mining and coverianceProbability density in data mining and coveriance
Probability density in data mining and coveriance
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Cross product
Cross productCross product
Cross product
 
Overfit10
Overfit10Overfit10
Overfit10
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
 
Predicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regressionPredicting Real-valued Outputs: An introduction to regression
Predicting Real-valued Outputs: An introduction to regression
 
Fisicaimpulsivaingles
FisicaimpulsivainglesFisicaimpulsivaingles
Fisicaimpulsivaingles
 
Lesson 27: Integration by Substitution (worksheet)
Lesson 27: Integration by Substitution (worksheet)Lesson 27: Integration by Substitution (worksheet)
Lesson 27: Integration by Substitution (worksheet)
 
Information Gain
Information GainInformation Gain
Information Gain
 
CS571: Gradient Descent
CS571: Gradient DescentCS571: Gradient Descent
CS571: Gradient Descent
 
Gradient Descent
Gradient DescentGradient Descent
Gradient Descent
 

More from guestfee8698

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
guestfee8698
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
guestfee8698
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
guestfee8698
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
guestfee8698
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
guestfee8698
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
guestfee8698
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
guestfee8698
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
guestfee8698
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
guestfee8698
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
guestfee8698
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
guestfee8698
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
guestfee8698
 

More from guestfee8698 (16)

PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
VC dimensio
VC dimensioVC dimensio
VC dimensio
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Gaussian Mixture Models
Gaussian Mixture ModelsGaussian Mixture Models
Gaussian Mixture Models
 
Short Overview of Bayes Nets
Short Overview of Bayes NetsShort Overview of Bayes Nets
Short Overview of Bayes Nets
 
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian ClassifiersA Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
 
Learning Bayesian Networks
Learning Bayesian NetworksLearning Bayesian Networks
Learning Bayesian Networks
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Bayesian Networks
Bayesian NetworksBayesian Networks
Bayesian Networks
 
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
 
Gaussian Bayes Classifiers
Gaussian Bayes ClassifiersGaussian Bayes Classifiers
Gaussian Bayes Classifiers
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 
Gaussians
GaussiansGaussians
Gaussians
 
Probability for Data Miners
Probability for Data MinersProbability for Data Miners
Probability for Data Miners
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Cross-Validation

  • 1. Cross-validation for detecting and preventing overfitting Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in Professor giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint School of Computer Science originals are available. If you make use of a significant portion of these slides in your own lecture, please include this Carnegie Mellon University message, or the following link to the source repository of Andrew’s tutorials: www.cs.cmu.edu/~awm http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully awm@cs.cmu.edu received. 412-268-7599 Copyright © Andrew W. Moore Slide 1
  • 2. A Regression Problem y = f(x) + noise Can we learn f from this data? y Let’s consider three methods… x Copyright © Andrew W. Moore Slide 2
  • 3. Linear Regression y x Copyright © Andrew W. Moore Slide 3
  • 4. Linear Regression Univariate Linear regression with a constant term: X= 3 y= 7 X Y Originally 3 7 1 3 discussed in the previous Andrew 1 3 : : Lecture: “Neural Nets” : : x1=(3).. y1=7.. Copyright © Andrew W. Moore Slide 4
  • 5. Linear Regression Univariate Linear regression with a constant term: X= 3 y= 7 X Y 3 7 1 3 1 3 : : : : x1=(3).. y1=7.. Z= 1 y= 3 7 1 1 3 : : z1=(1,3).. y1=7.. zk=(1,xk) Copyright © Andrew W. Moore Slide 5
  • 6. Linear Regression Univariate Linear regression with a constant term: X= 3 y= 7 X Y 3 7 1 3 1 3 : : : : x1=(3).. y1=7.. Z= 1 y= 3 7 1 1 3 β=(ZTZ)-1(ZTy) : : z1=(1,3).. y1=7.. yest = β0+ β1 x zk=(1,xk) Copyright © Andrew W. Moore Slide 6
  • 7. Quadratic Regression y x Copyright © Andrew W. Moore Slide 7
  • 8. Quadratic Regression Much more about X= 3 y= 7 X Y this in the future Andrew Lecture: 3 7 1 3 “Favorite Regression 1 3 : : Algorithms” : : x1=(3,2).. y1=7.. y= 1 3 9 Z= 7 1 1 1 3 β=(ZTZ)-1(ZTy) : : yest = β0+ β1 x+ β2 x2 z=(1 , x, x2,) Copyright © Andrew W. Moore Slide 8
  • 9. Join-the-dots Also known as piecewise linear nonparametric regression if that makes you feel better y x Copyright © Andrew W. Moore Slide 9
  • 10. Which is best? y y x x Why not choose the method with the best fit to the data? Copyright © Andrew W. Moore Slide 10
  • 11. What do we really want? y y x x Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?” Copyright © Andrew W. Moore Slide 11
  • 12. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y x Copyright © Andrew W. Moore Slide 12
  • 13. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y 3. Perform your regression on the training x set (Linear regression example) Copyright © Andrew W. Moore Slide 13
  • 14. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y 3. Perform your regression on the training x set 4. Estimate your future (Linear regression example) performance with the test Mean Squared Error = 2.4 set Copyright © Andrew W. Moore Slide 14
  • 15. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y 3. Perform your regression on the training x set (Quadratic regression example) 4. Estimate your future performance with the test Mean Squared Error = 0.9 set Copyright © Andrew W. Moore Slide 15
  • 16. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y 3. Perform your regression on the training x set 4. Estimate your future (Join the dots example) performance with the test Mean Squared Error = 2.2 set Copyright © Andrew W. Moore Slide 16
  • 17. The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: •What’s the downside? Copyright © Andrew W. Moore Slide 17
  • 18. The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: We say the •Wastes data: we get an estimate of the “test-set best method to apply to 30% less data estimator of performance •If we don’t have much data, our test-set has high variance” might just be lucky or unlucky Copyright © Andrew W. Moore Slide 18
  • 19. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record y x Copyright © Andrew W. Moore Slide 19
  • 20. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset y x Copyright © Andrew W. Moore Slide 20
  • 21. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints x Copyright © Andrew W. Moore Slide 21
  • 22. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints 4. Note your error (xk,yk) x Copyright © Andrew W. Moore Slide 22
  • 23. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints 4. Note your error (xk,yk) When you’ve done all points, x report the mean error. Copyright © Andrew W. Moore Slide 23
  • 24. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily y y y remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) y y y When you’ve done all points, report the mean x x x error. MSELOOCV = 2.12 y y y x x x Copyright © Andrew W. Moore Slide 24
  • 25. LOOCV for Quadratic Regression For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily y y y remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) y y y When you’ve done all points, report the mean x x x error. MSELOOCV =0.962 y y y x x x Copyright © Andrew W. Moore Slide 25
  • 26. LOOCV for Join The Dots For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily y y y remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) y y y When you’ve done all points, report the mean x x x error. MSELOOCV =3.33 y y y x x x Copyright © Andrew W. Moore Slide 26
  • 27. Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable Cheap estimate of future performance Leave- Expensive. Doesn’t one-out waste data Has some weird behavior ..can we get the best of both worlds? Copyright © Andrew W. Moore Slide 27
  • 28. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) y x Copyright © Andrew W. Moore Slide 28
  • 29. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. y x Copyright © Andrew W. Moore Slide 29
  • 30. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. x Copyright © Andrew W. Moore Slide 30
  • 31. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Copyright © Andrew W. Moore Slide 31
  • 32. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Linear Regression MSE3FOLD=2.05 Then report the mean error Copyright © Andrew W. Moore Slide 32
  • 33. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Quadratic Regression MSE3FOLD=1.11 Then report the mean error Copyright © Andrew W. Moore Slide 33
  • 34. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Joint-the-dots MSE3FOLD=2.93 Then report the mean error Copyright © Andrew W. Moore Slide 34
  • 35. Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable Cheap estimate of future performance Leave- Expensive. Doesn’t waste data one-out Has some weird behavior 10-fold Wastes 10% of the data. Only wastes 10%. Only 10 times more expensive 10 times more expensive than test set instead of R times. 3-fold Wastier than 10-fold. Slightly better than test- Expensivier than test set set R-fold Identical to Leave-one-out Copyright © Andrew W. Moore Slide 35
  • 36. Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable Cheap estimate of future performance But note: One of Leave- Expensive. Andrew’s joys in data Doesn’t waste life is one-out Has some weird behavioralgorithmic tricks for Wastes 10% of the data. makingwastescheap Only Only these 10%. 10-fold 10 times more expensive 10 times more expensive than testset instead of R times. 3-fold Wastier than 10-fold. Slightly better than test- Expensivier than testset set R-fold Identical to Leave-one-out Copyright © Andrew W. Moore Slide 36
  • 37. CV-based Model Selection • We’re trying to decide which algorithm to use. • We train each machine and make a table… TRAINERR i fi 10-FOLD-CV-ERR Choice 1 f1 2 f2 ⌦ 3 f3 4 f4 5 f5 6 f6 Copyright © Andrew W. Moore Slide 37
  • 38. CV-based Model Selection • Example: Choosing number of hidden units in a one- hidden-layer neural net. • Step 1: Compute 10-fold CV error for six different model classes: Algorithm TRAINERR 10-FOLD-CV-ERR Choice 0 hidden units 1 hidden units ⌦ 2 hidden units 3 hidden units 4 hidden units 5 hidden units • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Copyright © Andrew W. Moore Slide 38
  • 39. CV-based Model Selection • Example: Choosing “k” for a k-nearest-neighbor regression. • Step 1: Compute LOOCV error for six different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice K=1 K=2 K=3 ⌦ K=4 K=5 K=6 • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Copyright © Andrew W. Moore Slide 39
  • 40. CV-based Model Selection • Example: Choosing “k” for a k-nearest-neighbor regression. The reason is Computational. For k- • Step 1: Compute LOOCV error for six different model NN (and all other nonparametric methods) LOOCV happens to be as classes: Why did we use 10-fold-CV for cheap as regular predictions. neural nets and LOOCV for k- nearest neighbor? No good reason, except it looked like things were getting worse as K Algorithm TRAINERR LOOCV-ERR Choice was increasing And why stop at K=6 K=1 Are we guaranteed that a local Sadly, no. And in fact, the relationship can be very bumpy. K=2 optimum of K vs LOOCV will be the global optimum? K=3 What should we do if we are ⌦ K=4 depressed at the expense of Idea One: K=1, K=2, K=4, K=8, K=5 K=16, K=32, K=64 … K=1024 doing LOOCV for K= 1 through Idea Two: Hillclimbing from an initial 1000? K=6 guess at K • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Copyright © Andrew W. Moore Slide 40
  • 41. CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Copyright © Andrew W. Moore Slide 41
  • 42. CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve choosing the value of a real-valued parameter. What should we do? Copyright © Andrew W. Moore Slide 42
  • 43. CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve Idea One: Consider a discrete set of values (often best to consider a set of values with choosing the value of a exponentially increasing gaps, as in the K-NN real-valued parameter. example). ∂ LOOCV What should we do? Idea Two: Compute and then ∂ Parameter do gradianet descent. Copyright © Andrew W. Moore Slide 43
  • 44. CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression tors of a non- fac : The scale • The Bayesian Prior in Bayesian Regression distance metric Also parametric These involve Idea One: Consider a discrete set of values (often best to consider a set of values with choosing the value of a exponentially increasing gaps, as in the K-NN real-valued parameter. example). ∂ LOOCV What should we do? Idea Two: Compute and then ∂ Parameter do gradianet descent. Copyright © Andrew W. Moore Slide 44
  • 45. CV-based Algorithm Choice • Example: Choosing which regression algorithm to use • Step 1: Compute 10-fold-CV error for six different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice 1-NN 10-NN Linear Reg’n ⌦ Quad reg’n LWR, KW=0.1 LWR, KW=0.5 • Step 2: Whichever algorithm gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Copyright © Andrew W. Moore Slide 45
  • 46. Alternatives to CV-based model selection • Model selection methods: 1. Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) Only directly applicable to choosing classifiers Described in a future Lecture Copyright © Andrew W. Moore Slide 46
  • 47. Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension • AIC, BIC and SRMVC advantage: you only need the training error. • CV error might have more variance • SRMVC is wildly conservative • Asymptotically AIC and Leave-one-out CV should be the same • Asymptotically BIC and carefully chosen k-fold should be same • You want BIC if you want the best structure instead of the best predictor (e.g. for clustering or Bayes Net structure finding) • Many alternatives---including proper Bayesian approaches. • It’s an emotional issue. Copyright © Andrew W. Moore Slide 47
  • 48. Other Cross-validation issues • Can do “leave all pairs out” or “leave-all- ntuples-out” if feeling resourceful. • Some folks do k-folds in which each fold is an independently-chosen subset of the data • Do you know what AIC and BIC are? If so… • LOOCV behaves like AIC asymptotically. • k-fold behaves like BIC if you choose k carefully If not… • Nyardely nyardely nyoo nyoo Copyright © Andrew W. Moore Slide 48
  • 49. Cross-Validation for regression • Choosing the number of hidden units in a neural net • Feature selection (see later) • Choosing a polynomial degree • Choosing which regressor to use Copyright © Andrew W. Moore Slide 49
  • 50. Supervising Gradient Descent • This is a weird but common use of Test-set validation • Suppose you have a neural net with too many hidden units. It will overfit. • As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Training Set Use the weights you Mean Squared found on this iteration Test Set Error Iteration of Gradient Descent Copyright © Andrew W. Moore Slide 50
  • 51. Supervising Gradient Descent • This is a weird but common use of Test-set validation • Suppose you have a neural net with too Relies on an intuition that a not-fully- many hiddenminimized set of overfit. somewhat like units. It will weights is • As gradient descent progresses, maintain a having fewer parameters. graph of MSE-testset-error vs. Iteration Works pretty well in practice, apparently Training Set Use the weights you Mean Squared found on this iteration Test Set Error Iteration of Gradient Descent Copyright © Andrew W. Moore Slide 51
  • 52. Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… Copyright © Andrew W. Moore Slide 52
  • 53. Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. Copyright © Andrew W. Moore Slide 53
  • 54. Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. • What’s LOOCV of 1-NN? • What’s LOOCV of 3-NN? • What’s LOOCV of 22-NN? Copyright © Andrew W. Moore Slide 54
  • 55. Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. • But there’s a more sensitive alternative: Compute log P(all test outputs|all test inputs, your model) Copyright © Andrew W. Moore Slide 55
  • 56. Cross-Validation for classification • Choosing the pruning parameter for decision trees • Feature selection (see later) • What kind of Gaussian to use in a Gaussian- based Bayes Classifier • Choosing which classifier to use Copyright © Andrew W. Moore Slide 56
  • 57. Cross-Validation for density estimation • Compute the sum of log-likelihoods of test points Example uses: • Choosing what kind of Gaussian assumption to use • Choose the density estimator • NOT Feature selection (testset density will almost always look better with fewer features) Copyright © Andrew W. Moore Slide 57
  • 58. Feature Selection • Suppose you have a learning algorithm LA and a set of input attributes { X1 , X2 .. Xm } • You expect that LA will only find some subset of the attributes useful. • Question: How can we use cross-validation to find a useful subset? • Four ideas: Another fun area in which • Forward selection Andrew has spent a lot of his • Backward elimination wild youth • Hill Climbing • Stochastic search (Simulated Annealing or GAs) Copyright © Andrew W. Moore Slide 58
  • 59. Very serious warning • Intensive use of cross validation can overfit. • How? • What can be done about it? Copyright © Andrew W. Moore Slide 59
  • 60. Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • What can be done about it? Copyright © Andrew W. Moore Slide 60
  • 61. Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • The best of those 1000 looks good! • What can be done about it? Copyright © Andrew W. Moore Slide 61
  • 62. Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • The best of those 1000 looks good! • But you realize it would have looked good even if the output had been purely random! • What can be done about it? • Hold out an additional testset before doing any model selection. Check the best model performs well even on the additional testset. • Or: Randomization Testing Copyright © Andrew W. Moore Slide 62
  • 63. What you should know • Why you can’t use “training-set-error” to estimate the quality of your learning algorithm on your data. • Why you can’t use “training set error” to choose the learning algorithm • Test-set cross-validation • Leave-one-out cross-validation • k-fold cross-validation • Feature selection methods • CV for classification, regression & densities Copyright © Andrew W. Moore Slide 63