Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learning for Data Mining
Combining Models
Andres Mendez-Vazquez
July 20, 2015
1 / 65
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Commi...
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multip...
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multip...
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multip...
Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of c...
Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of c...
Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of c...
Images/cinvestav-
We want this
Models in charge of different set of inputs
5 / 65
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a mod...
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a mod...
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a mod...
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a mod...
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a mod...
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a mod...
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a mod...
Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k)...
Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k)...
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Commi...
Images/cinvestav-
Now
Although
Model Combinations and Bayesian Model Averaging look similar.
However
They ara actually diff...
Images/cinvestav-
Now
Although
Model Combinations and Bayesian Model Averaging look similar.
However
They ara actually diff...
Images/cinvestav-
Now
Although
Model Combinations and Bayesian Model Averaging look similar.
However
They ara actually diff...
Images/cinvestav-
Example
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating ...
Images/cinvestav-
Example
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating ...
Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, ident...
Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, ident...
Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, ident...
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
O...
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
O...
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
O...
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
O...
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
O...
Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model.
Model combina...
Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model.
Model combina...
Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of indi...
Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of indi...
Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of indi...
Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of indi...
Images/cinvestav-
For example
We can do the follwoing
When we averaged a set of low-bias models, we obtained accurate
pred...
Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability bet...
Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability bet...
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Commi...
Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the val...
Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the val...
Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the val...
Images/cinvestav-
Furthermore
This is repeated M times
Thus, we generate M data sets each of size N and each obtained by
s...
Images/cinvestav-
Thus
Use each of them to train a copy ym (x) of a predictive regression
model to predict a single contin...
Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x...
Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x...
Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x...
Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
M
M
m=1
Ex { m (x)}2
(10)
Similarly the Expected error over t...
Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
M
M
m=1
Ex { m (x)}2
(10)
Similarly the Expected error over t...
Images/cinvestav-
Assume that the errors have zero mean and are
uncorrelated
Assume that the errors have zero mean and are...
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) ...
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) ...
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) ...
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) ...
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) ...
Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
M
EAV (12)
Looks great BUT!!!
Unfortunately, it depends on the key a...
Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
M
EAV (12)
Looks great BUT!!!
Unfortunately, it depends on the key a...
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is ge...
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is ge...
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is ge...
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Commi...
Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will de...
Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will de...
Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are traine...
Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are traine...
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs ...
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs ...
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs ...
Images/cinvestav-
Now
Adaptive Boosting
It works even with a continuum of classifiers.
However
For the sake of simplicity, ...
Images/cinvestav-
Now
Adaptive Boosting
It works even with a continuum of classifiers.
However
For the sake of simplicity, ...
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select ...
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select ...
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select ...
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
m...
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
m...
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
m...
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
m...
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
m...
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It ...
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It ...
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It ...
Images/cinvestav-
Exponential Loss Function
Something Notable
This kind of error function different from the usual squared ...
Images/cinvestav-
Exponential Loss Function
Something Notable
This kind of error function different from the usual squared ...
Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a...
Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a...
Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a...
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extra...
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extra...
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extra...
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extra...
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committ...
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committ...
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committ...
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committ...
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select t...
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select t...
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select t...
Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iter...
Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iter...
Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iter...
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} ...
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} ...
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} ...
Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) repres...
Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) repres...
Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) repres...
Images/cinvestav-
Now
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm...
Images/cinvestav-
Now
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm...
Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp...
Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp...
Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp...
Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration we pick the classifier w...
Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration we pick the classifier w...
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −W...
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −W...
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −W...
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
...
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
...
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
...
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ...
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ...
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ...
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ...
Images/cinvestav-
Sequential Training
Thus
AdaBoost trains a new classifier using a data set in which the weighting
coefficie...
Images/cinvestav-
Illustration
Schematic Illustration
50 / 65
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Commi...
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifie...
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifie...
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifie...
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifie...
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifie...
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifie...
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the ...
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the ...
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the ...
Images/cinvestav-
Finally, make predictions
For this use
YM (x) = sign
M
m=1
αmym (x) (37)
54 / 65
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second...
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second...
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second...
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose memb...
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose memb...
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose memb...
Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the we...
Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the we...
Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the we...
Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the we...
Images/cinvestav-
Explanation
Graph for 1−em
em
in the range 0 ≤ em ≤ 1
58 / 65
Images/cinvestav-
So we have
Graph for αm
59 / 65
Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not corre...
Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not corre...
Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not corre...
Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
e...
Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
e...
Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
e...
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Commi...
Images/cinvestav-
Example
Training set
Classes
ω1 = N(0, 1)
ω2 = N(0, σ2) − N(0, 1)
63 / 65
Images/cinvestav-
Example
Weak Classifier
Perceptron
For m = 1, ..., M, we have the following possible classifiers
64 / 65
Images/cinvestav-
Example
Weak Classifier
Perceptron
For m = 1, ..., M, we have the following possible classifiers
64 / 65
Images/cinvestav-
At the end of the process
For M = 40
65 / 65
Upcoming SlideShare
Loading in …5
×

24 Machine Learning Combining Models - Ada Boost

400 views

Published on

Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:

1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets

And the cherry on the top the AdaBoost

Published in: Engineering
  • Login to see the comments

24 Machine Learning Combining Models - Ada Boost

  1. 1. Machine Learning for Data Mining Combining Models Andres Mendez-Vazquez July 20, 2015 1 / 65
  2. 2. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 2 / 65
  3. 3. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classifiers together in some way. Example: Committees We might train L different classifiers and then make predictions using the average of the predictions made by each classifier. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
  4. 4. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classifiers together in some way. Example: Committees We might train L different classifiers and then make predictions using the average of the predictions made by each classifier. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
  5. 5. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classifiers together in some way. Example: Committees We might train L different classifiers and then make predictions using the average of the predictions made by each classifier. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
  6. 6. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 4 / 65
  7. 7. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 4 / 65
  8. 8. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 4 / 65
  9. 9. Images/cinvestav- We want this Models in charge of different set of inputs 5 / 65
  10. 10. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  11. 11. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  12. 12. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  13. 13. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  14. 14. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  15. 15. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  16. 16. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  17. 17. Images/cinvestav- This is used in the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coefficients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coefficients are conditioned on the input variables and are known as mixture experts. 7 / 65
  18. 18. Images/cinvestav- This is used in the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coefficients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coefficients are conditioned on the input variables and are known as mixture experts. 7 / 65
  19. 19. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 8 / 65
  20. 20. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually different For this We have the following example. 9 / 65
  21. 21. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually different For this We have the following example. 9 / 65
  22. 22. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually different For this We have the following example. 9 / 65
  23. 23. Images/cinvestav- Example For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. p (x, z) (2) Thus Corresponding density over the observed variable x using marginalization: p (x) = z p (x, z) (3) 10 / 65
  24. 24. Images/cinvestav- Example For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. p (x, z) (2) Thus Corresponding density over the observed variable x using marginalization: p (x) = z p (x, z) (3) 10 / 65
  25. 25. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several different models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
  26. 26. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several different models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
  27. 27. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several different models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
  28. 28. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  29. 29. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  30. 30. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  31. 31. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  32. 32. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  33. 33. Images/cinvestav- The Differences Bayesian model averaging The whole data set is generated by a single model. Model combination Different data points within the data set can potentially be generated from different values of the latent variable z and hence by different components. 13 / 65
  34. 34. Images/cinvestav- The Differences Bayesian model averaging The whole data set is generated by a single model. Model combination Different data points within the data set can potentially be generated from different values of the latent variable z and hence by different components. 13 / 65
  35. 35. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
  36. 36. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
  37. 37. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
  38. 38. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
  39. 39. Images/cinvestav- For example We can do the follwoing When we averaged a set of low-bias models, we obtained accurate predictions of the underlying sinusoidal function from which the data were generated. 15 / 65
  40. 40. Images/cinvestav- However Big Problem We have normally a single data set Thus We need to introduce certain variability between the different committee members. One approach You can use bootstrap data sets. 16 / 65
  41. 41. Images/cinvestav- However Big Problem We have normally a single data set Thus We need to introduce certain variability between the different committee members. One approach You can use bootstrap data sets. 16 / 65
  42. 42. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 17 / 65
  43. 43. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
  44. 44. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
  45. 45. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
  46. 46. Images/cinvestav- Furthermore This is repeated M times Thus, we generate M data sets each of size N and each obtained by sampling from the original data set X. 19 / 65
  47. 47. Images/cinvestav- Thus Use each of them to train a copy ym (x) of a predictive regression model to predict a single continuous variable Then, ycom (x) = 1 M M m=1 ym (x) (7) This is also known as Bootstrap Aggregation or Bagging. 20 / 65
  48. 48. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
  49. 49. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
  50. 50. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
  51. 51. Images/cinvestav- Meaning Thus, the average error is EAV = 1 M M m=1 Ex { m (x)}2 (10) Similarly the Expected error over the committee ECOM = Ex   1 M M m=1 (ym (x) − h (x)) 2   = Ex   1 M M m=1 m (x) 2   (11) 22 / 65
  52. 52. Images/cinvestav- Meaning Thus, the average error is EAV = 1 M M m=1 Ex { m (x)}2 (10) Similarly the Expected error over the committee ECOM = Ex   1 M M m=1 (ym (x) − h (x)) 2   = Ex   1 M M m=1 m (x) 2   (11) 22 / 65
  53. 53. Images/cinvestav- Assume that the errors have zero mean and are uncorrelated Assume that the errors have zero mean and are uncorrelated Ex [ m (x)] =0 Ex [ m (x) l (x)] = 0, for m = l 23 / 65
  54. 54. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  55. 55. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  56. 56. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  57. 57. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  58. 58. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  59. 59. Images/cinvestav- We finally obtain We obtain ECOM = 1 M EAV (12) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. 25 / 65
  60. 60. Images/cinvestav- We finally obtain We obtain ECOM = 1 M EAV (12) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. 25 / 65
  61. 61. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
  62. 62. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
  63. 63. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
  64. 64. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 27 / 65
  65. 65. Images/cinvestav- Boosting What Boosting does? It combines several classifiers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 28 / 65
  66. 66. Images/cinvestav- Boosting What Boosting does? It combines several classifiers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 28 / 65
  67. 67. Images/cinvestav- Sequential Training Main difference between boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 29 / 65
  68. 68. Images/cinvestav- Sequential Training Main difference between boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 29 / 65
  69. 69. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
  70. 70. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
  71. 71. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
  72. 72. Images/cinvestav- Now Adaptive Boosting It works even with a continuum of classifiers. However For the sake of simplicity, we will assume that the set of expert is finite. 31 / 65
  73. 73. Images/cinvestav- Now Adaptive Boosting It works even with a continuum of classifiers. However For the sake of simplicity, we will assume that the set of expert is finite. 31 / 65
  74. 74. Images/cinvestav- Getting the correct classifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
  75. 75. Images/cinvestav- Getting the correct classifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
  76. 76. Images/cinvestav- Getting the correct classifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
  77. 77. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  78. 78. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  79. 79. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  80. 80. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  81. 81. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  82. 82. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
  83. 83. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
  84. 84. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
  85. 85. Images/cinvestav- Exponential Loss Function Something Notable This kind of error function different from the usual squared Euclidean distance to the classification target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. 35 / 65
  86. 86. Images/cinvestav- Exponential Loss Function Something Notable This kind of error function different from the usual squared Euclidean distance to the classification target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. 35 / 65
  87. 87. Images/cinvestav- The Matrix S When we test the M classifiers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classifier in the pool. Classifiers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
  88. 88. Images/cinvestav- The Matrix S When we test the M classifiers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classifier in the pool. Classifiers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
  89. 89. Images/cinvestav- The Matrix S When we test the M classifiers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classifier in the pool. Classifiers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
  90. 90. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
  91. 91. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
  92. 92. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
  93. 93. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
  94. 94. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65
  95. 95. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65
  96. 96. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65
  97. 97. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65
  98. 98. Images/cinvestav- Selecting New Classifiers What we want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
  99. 99. Images/cinvestav- Selecting New Classifiers What we want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
  100. 100. Images/cinvestav- Selecting New Classifiers What we want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
  101. 101. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the first iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
  102. 102. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the first iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
  103. 103. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the first iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
  104. 104. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
  105. 105. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
  106. 106. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
  107. 107. Images/cinvestav- Thus In the first iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
  108. 108. Images/cinvestav- Thus In the first iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
  109. 109. Images/cinvestav- Thus In the first iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
  110. 110. Images/cinvestav- Now Writing the first summand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (21) Now, for the selection of ym the exact value of αm > 0 is irrelevant Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and exp {αm} E = Wc + We exp {2αm} (22) 43 / 65
  111. 111. Images/cinvestav- Now Writing the first summand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (21) Now, for the selection of ym the exact value of αm > 0 is irrelevant Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and exp {αm} E = Wc + We exp {2αm} (22) 43 / 65
  112. 112. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
  113. 113. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
  114. 114. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
  115. 115. Images/cinvestav- Thus The right hand side of the equation is minimized When at the m-th iteration we pick the classifier with the lowest total cost We (that is the lowest rate of weighted error). Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 45 / 65
  116. 116. Images/cinvestav- Thus The right hand side of the equation is minimized When at the m-th iteration we pick the classifier with the lowest total cost We (that is the lowest rate of weighted error). Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 45 / 65
  117. 117. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
  118. 118. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
  119. 119. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
  120. 120. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
  121. 121. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
  122. 122. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
  123. 123. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
  124. 124. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
  125. 125. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
  126. 126. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
  127. 127. Images/cinvestav- Sequential Training Thus AdaBoost trains a new classifier using a data set in which the weighting coefficients are adjusted according to the performance of the previously trained classifier so as to give greater weight to the misclassified data points. 49 / 65
  128. 128. Images/cinvestav- Illustration Schematic Illustration 50 / 65
  129. 129. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 51 / 65
  130. 130. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  131. 131. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  132. 132. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  133. 133. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  134. 134. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  135. 135. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  136. 136. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
  137. 137. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
  138. 138. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
  139. 139. Images/cinvestav- Finally, make predictions For this use YM (x) = sign M m=1 αmym (x) (37) 54 / 65
  140. 140. Images/cinvestav- Observations First The first base classifier is the usual procedure of training a single classifier. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 55 / 65
  141. 141. Images/cinvestav- Observations First The first base classifier is the usual procedure of training a single classifier. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 55 / 65
  142. 142. Images/cinvestav- Observations First The first base classifier is the usual procedure of training a single classifier. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 55 / 65
  143. 143. Images/cinvestav- In addition The pool of classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
  144. 144. Images/cinvestav- In addition The pool of classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
  145. 145. Images/cinvestav- In addition The pool of classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
  146. 146. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 57 / 65
  147. 147. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 57 / 65
  148. 148. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 57 / 65
  149. 149. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 57 / 65
  150. 150. Images/cinvestav- Explanation Graph for 1−em em in the range 0 ≤ em ≤ 1 58 / 65
  151. 151. Images/cinvestav- So we have Graph for αm 59 / 65
  152. 152. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 60 / 65
  153. 153. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 60 / 65
  154. 154. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 60 / 65
  155. 155. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classified : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
  156. 156. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classified : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
  157. 157. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classified : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
  158. 158. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 62 / 65
  159. 159. Images/cinvestav- Example Training set Classes ω1 = N(0, 1) ω2 = N(0, σ2) − N(0, 1) 63 / 65
  160. 160. Images/cinvestav- Example Weak Classifier Perceptron For m = 1, ..., M, we have the following possible classifiers 64 / 65
  161. 161. Images/cinvestav- Example Weak Classifier Perceptron For m = 1, ..., M, we have the following possible classifiers 64 / 65
  162. 162. Images/cinvestav- At the end of the process For M = 40 65 / 65

×