SlideShare a Scribd company logo
1 of 162
Download to read offline
Machine Learning for Data Mining
Combining Models
Andres Mendez-Vazquez
July 20, 2015
1 / 65
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
2 / 65
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
Images/cinvestav-
Introduction
Observation
It is often found that improved performance can be obtained by combining
multiple classifiers together in some way.
Example: Committees
We might train L different classifiers and then make predictions using the
average of the predictions made by each classifier.
Example: Boosting
It involves training multiple models in sequence in which the error function
used to train a particular model depends on the performance of the
previous models.
3 / 65
Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
Images/cinvestav-
In addition
Instead of averaging the predictions of a set of models
You can use an alternative form of combination that selects one of the
models to make the prediction.
Where
The choice of model is a function of the input variables.
Thus
Different Models become responsible for making decisions in different
regions of the input space.
4 / 65
Images/cinvestav-
We want this
Models in charge of different set of inputs
5 / 65
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
Images/cinvestav-
Example: Decision Trees
We can have the decision trees on top of the models
Given a set of models, a model is chosen to take a decision in certain area of the
input.
Limitation: It is based on hard splits in which only one model is responsible for
making predictions for any given value.
Thus it is better to soften the combination by using
If we have M classifier for a conditional distribution p (t|x, k).
x is the input variable.
t is the target variable.
k = 1, 2, ..., M indexes the classifiers.
6 / 65
Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of models
They can be viewed as mixture distribution in which the component
densities and the mixing coefficients are conditioned on the input variables
and are known as mixture experts.
7 / 65
Images/cinvestav-
This is used in the mixture of distributions
Thus (Mixture of Experts)
p (t|x) =
M
k=1
πk (x) p (t|x, k) (1)
where πk (x) = p (k|x) represent the input-dependent mixing coefficients.
This type of models
They can be viewed as mixture distribution in which the component
densities and the mixing coefficients are conditioned on the input variables
and are known as mixture experts.
7 / 65
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
8 / 65
Images/cinvestav-
Now
Although
Model Combinations and Bayesian Model Averaging look similar.
However
They ara actually different
For this
We have the following example.
9 / 65
Images/cinvestav-
Now
Although
Model Combinations and Bayesian Model Averaging look similar.
However
They ara actually different
For this
We have the following example.
9 / 65
Images/cinvestav-
Now
Although
Model Combinations and Bayesian Model Averaging look similar.
However
They ara actually different
For this
We have the following example.
9 / 65
Images/cinvestav-
Example
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to which
component a point belongs to.
p (x, z) (2)
Thus
Corresponding density over the observed variable x using marginalization:
p (x) =
z
p (x, z) (3)
10 / 65
Images/cinvestav-
Example
For this consider the following
Mixture of Gaussians with a binary latent variable z indicating to which
component a point belongs to.
p (x, z) (2)
Thus
Corresponding density over the observed variable x using marginalization:
p (x) =
z
p (x, z) (3)
10 / 65
Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
Images/cinvestav-
In the case of Gaussian distributions
We have
p (x) =
M
k=1
πkN (x|µkΣk) (4)
Now, for independent, identically distributed data
X = {x1, x2, ..., xN }
p (X) =
N
n=1
p (xn) =
N
n=1 zn
p (xn, zn) (5)
Now, suppose
We have several different models indexed by h = 1, ..., H with prior
probabilities.
For instance one model might be a mixture of Gaussians and another
model might be a mixture of Cauchy distributions.
11 / 65
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
Images/cinvestav-
Bayesian Model Averaging
The Marginal Distribution is
p (X) =
H
h=1
p (X, h) =
H
h=1
p (X|h) p (h) (6)
Observations
1 This is an example of Bayesian model averaging.
2 The summation over h means that just one model is responsible for
generating the whole data set.
3 The probability over h simply reflects our uncertainty of which is the
correct model to use.
Thus
As the size of the data set increases, this uncertainty reduces, and the
posterior probabilities p(h|X) become increasingly focused on just one of
the models.
12 / 65
Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model.
Model combination
Different data points within the data set can potentially be generated from
different values of the latent variable z and hence by different components.
13 / 65
Images/cinvestav-
The Differences
Bayesian model averaging
The whole data set is generated by a single model.
Model combination
Different data points within the data set can potentially be generated from
different values of the latent variable z and hence by different components.
13 / 65
Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
Images/cinvestav-
Committees
Idea
The simplest way to construct a committee is to average the predictions of
a set of individual models.
Thinking as a frequentist
This is coming from taking in consideration the trade-off between bias and
variance.
Where the error in the model into
The bias component that arises from differences between the model
and the true function to be predicted.
The variance component that represents the sensitivity of the model
to the individual data points.
14 / 65
Images/cinvestav-
For example
We can do the follwoing
When we averaged a set of low-bias models, we obtained accurate
predictions of the underlying sinusoidal function from which the data were
generated.
15 / 65
Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability between the different committee
members.
One approach
You can use bootstrap data sets.
16 / 65
Images/cinvestav-
However
Big Problem
We have normally a single data set
Thus
We need to introduce certain variability between the different committee
members.
One approach
You can use bootstrap data sets.
16 / 65
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
17 / 65
Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
Images/cinvestav-
What to do with the Bootstrap Data Set
Given a Regression Problem
Here, we are trying to predict the value of a single continuous variable.
What?
Suppose our original data set consists of N data points X = {x1, ..., xN } .
Thus
We can create a new data set XB by drawing N points at random from X,
with replacement, so that some points in X may be replicated in XB,
whereas other points in X may be absent from XB.
18 / 65
Images/cinvestav-
Furthermore
This is repeated M times
Thus, we generate M data sets each of size N and each obtained by
sampling from the original data set X.
19 / 65
Images/cinvestav-
Thus
Use each of them to train a copy ym (x) of a predictive regression
model to predict a single continuous variable
Then,
ycom (x) =
1
M
M
m=1
ym (x) (7)
This is also known as Bootstrap Aggregation or Bagging.
20 / 65
Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
Images/cinvestav-
What we do with this samples?
Now, assume a true regression function h (x) and a estimation ym (x)
ym (x) = h (x) + m (x) (8)
The average sum-of-squares error over the data takes the form
Ex {ym (x) − h (x)}2
= Ex { m (x)}2
(9)
What is Ex?
It denotes a frequentest expectation with respect to the distribution of the
input vector x.
21 / 65
Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
M
M
m=1
Ex { m (x)}2
(10)
Similarly the Expected error over the committee
ECOM = Ex


1
M
M
m=1
(ym (x) − h (x))
2

 = Ex


1
M
M
m=1
m (x)
2


(11)
22 / 65
Images/cinvestav-
Meaning
Thus, the average error is
EAV =
1
M
M
m=1
Ex { m (x)}2
(10)
Similarly the Expected error over the committee
ECOM = Ex


1
M
M
m=1
(ym (x) − h (x))
2

 = Ex


1
M
M
m=1
m (x)
2


(11)
22 / 65
Images/cinvestav-
Assume that the errors have zero mean and are
uncorrelated
Assume that the errors have zero mean and are uncorrelated
Ex [ m (x)] =0
Ex [ m (x) l (x)] = 0, for m = l
23 / 65
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)



=
1
M2



Ex
M
m=1
2
m (x) + Ex



M
h=1
M
k=1
h=k
h (x) k (x)






=
1
M2



Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)



=
1
M2



Ex
M
m=1
2
m (x) + Ex



M
h=1
M
k=1
h=k
h (x) k (x)






=
1
M2



Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)



=
1
M2



Ex
M
m=1
2
m (x) + Ex



M
h=1
M
k=1
h=k
h (x) k (x)






=
1
M2



Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)



=
1
M2



Ex
M
m=1
2
m (x) + Ex



M
h=1
M
k=1
h=k
h (x) k (x)






=
1
M2



Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
Images/cinvestav-
Then
We have that
ECOM =
1
M2
Ex
M
m=1
( m (x))
2
=
1
M2
Ex



M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
h (x) k (x)



=
1
M2



Ex
M
m=1
2
m (x) + Ex



M
h=1
M
k=1
h=k
h (x) k (x)






=
1
M2



Ex
M
m=1
2
m (x) +
M
h=1
M
k=1
h=k
Ex ( h (x) k (x))



=
1
M2
Ex
M
m=1
2
m (x) =
1
M
1
M
Ex
M
m=1
2
m (x)
24 / 65
Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
M
EAV (12)
Looks great BUT!!!
Unfortunately, it depends on the key assumption that the errors due to the
individual models are uncorrelated.
25 / 65
Images/cinvestav-
We finally obtain
We obtain
ECOM =
1
M
EAV (12)
Looks great BUT!!!
Unfortunately, it depends on the key assumption that the errors due to the
individual models are uncorrelated.
25 / 65
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
Images/cinvestav-
Thus
The Reality!!!
The errors are typically highly correlated, and the reduction in overall error
is generally small.
Something Notable
However, It can be shown that the expected committee error will not
exceed the expected error of the constituent models, so
ECOM ≤ EAV (13)
We need something better
A more sophisticated technique known as boosting.
26 / 65
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
27 / 65
Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will describe AdaBoost
“Adaptive Boosting” developed by Freund and Schapire (1995).
28 / 65
Images/cinvestav-
Boosting
What Boosting does?
It combines several classifiers to produce a form of a committee.
We will describe AdaBoost
“Adaptive Boosting” developed by Freund and Schapire (1995).
28 / 65
Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are trained in sequence.
Explanation
Consider a two-class classification problem:
1 Samples x1, x2,..., xN
2 Binary labels (-1,1) t1, t2, ..., tN
29 / 65
Images/cinvestav-
Sequential Training
Main difference between boosting and committee methods
The base classifiers are trained in sequence.
Explanation
Consider a two-class classification problem:
1 Samples x1, x2,..., xN
2 Binary labels (-1,1) t1, t2, ..., tN
29 / 65
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
Images/cinvestav-
Cost Function
Now
You want to put together a set of M experts able to recognize the most
difficult inputs in an accurate way!!!
Thus
For each pattern xi each expert classifier outputs a classification
yj (xi) ∈ {−1, 1}
The final decision of the committee of M experts is sign (C (xi))
C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14)
30 / 65
Images/cinvestav-
Now
Adaptive Boosting
It works even with a continuum of classifiers.
However
For the sake of simplicity, we will assume that the set of expert is finite.
31 / 65
Images/cinvestav-
Now
Adaptive Boosting
It works even with a continuum of classifiers.
However
For the sake of simplicity, we will assume that the set of expert is finite.
31 / 65
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
Images/cinvestav-
Getting the correct classifiers
We want the following
We want to review possible element members.
Select them, if they have certain properties.
Assigning a weight to their contribution to the set of experts.
32 / 65
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
Images/cinvestav-
Now
Selection is done the following way
Testing the classifiers in the pool using a training set T of N
multidimensional data points xi:
For each point xi we have a label ti = 1 or ti = −1.
Assigning a cost for actions
We test and rank all classifiers in the expert pool by
Charging a cost exp {β} any time a classifier fails (a miss).
Charging a cost exp {−β} any time a classifier provides the right
label (a hit).
33 / 65
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
Images/cinvestav-
Remarks about β
We require β > 0
Thus misses are penalized more heavily penalized than hits
Although
It looks strange to penalize hits, but as long that the penalty of a success
is smaller than the penalty for a miss exp {−β} < exp {β}.
Why?
Show that if we assign cost a to misses and cost b to hits, where
a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants
c and d. Thus, it does not compromise generality.
34 / 65
Images/cinvestav-
Exponential Loss Function
Something Notable
This kind of error function different from the usual squared Euclidean
distance to the classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
35 / 65
Images/cinvestav-
Exponential Loss Function
Something Notable
This kind of error function different from the usual squared Euclidean
distance to the classification target is called an exponential loss function.
AdaBoost uses exponential error loss as error criterion.
35 / 65
Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
Images/cinvestav-
The Matrix S
When we test the M classifiers in the pool, we build a matrix S
We record the misses (with a ONE) and hits (with a ZERO) of each
classifiers.
Row i in the matrix is reserved for the data point xi - Column m is
reserved for the mth classifier in the pool.
Classifiers
1 2 · · · M
x1 0 1 · · · 1
x2 0 0 · · · 1
x3 1 1 · · · 0
...
...
...
...
xN 0 0 · · · 0
36 / 65
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
Images/cinvestav-
Main Idea
What does AdaBoost want to do?
The main idea of AdaBoost is to proceed systematically by extracting one
classifier from the pool in each of M iterations.
Thus
The elements in the data set are weighted according to their current
relevance (or urgency) at each iteration.
Thus at the beginning of the iterations
All data samples are assigned the same weight:
Just 1, or 1
N , if we want to have a total sum of 1 for all weights.
37 / 65
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
Images/cinvestav-
The process of the weights
As the selection progresses
The more difficult samples, those where the committee still performs
badly, are assigned larger and larger weights.
Basically
The selection process concentrates in selecting new classifiers for the
committee by focusing on those which can help with the still misclassified
examples.
Then
The best classifiers are those which can provide new insights to the
committee.
Classifiers being selected should complement each other in an optimal
way.
38 / 65
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
Images/cinvestav-
Selecting New Classifiers
What we want
In each iteration, we rank all classifiers, so that we can select the current
best out of the pool.
At mth iteration
We have already included m − 1 classifiers in the committee and we want
to select the next one.
Thus, we have the following cost function which is actually the
output of the committee
C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15)
39 / 65
Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
Images/cinvestav-
Thus, we have that
Extending the cost function
C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)
At the first iteration m = 1
C(m−1) is the zero function.
Thus, the total cost or total error is defined as the exponential error
E =
N
i=1
exp −ti C(m−1) (xi) + αmym (xi) (17)
40 / 65
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
Images/cinvestav-
Thus
We want to determine
αm and ym in optimal way
Thus, rewriting
E =
N
i=1
w
(m)
i exp {−tiαmym (xi)} (18)
Where, for i = 1, 2, ..., N
w
(m)
i = exp −tiC(m−1) (xi) (19)
41 / 65
Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
Images/cinvestav-
Thus
In the first iteration w
(1)
i = 1 for i = 1, ..., N
During later iterations, the vector w(m) represents the weight assigned to
each data point in the training set at iteration m.
Splitting equation (Eq. 18)
E =
ti=ym(xi)
w
(m)
i exp {−αm} +
ti=ym(xi)
w
(m)
i exp {αm} (20)
Meaning
The total cost is the weighted cost of all hits plus the weighted cost of all
misses.
42 / 65
Images/cinvestav-
Now
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm} (21)
Now, for the selection of ym the exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and
exp {αm} E = Wc + We exp {2αm} (22)
43 / 65
Images/cinvestav-
Now
Writing the first summand as Wc exp {−αm} and the second as
We exp {αm}
E = Wc exp {−αm} + We exp {αm} (21)
Now, for the selection of ym the exact value of αm > 0 is irrelevant
Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and
exp {αm} E = Wc + We exp {2αm} (22)
43 / 65
Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
Images/cinvestav-
Then
Since exp {2αm} > 1, we can rewrite (Eq. 22)
exp {αm} E = Wc + We − We + We exp {2αm} (23)
Thus
exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24)
Now
Now, Wc + We is the total sum W of the weights of all data points which
is constant in the current iteration.
44 / 65
Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration we pick the classifier with the lowest total cost
We (that is the lowest rate of weighted error).
Intuitively
The next selected ym should be the one with the lowest penalty given the
current set of weights.
45 / 65
Images/cinvestav-
Thus
The right hand side of the equation is minimized
When at the m-th iteration we pick the classifier with the lowest total cost
We (that is the lowest rate of weighted error).
Intuitively
The next selected ym should be the one with the lowest penalty given the
current set of weights.
45 / 65
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
Images/cinvestav-
Deriving against the weight αm
Going back to the original E, we can use the derivative trick
∂E
∂αm
= −Wc exp {−αm} + We exp {αm} (25)
Making the equation equal to zero and multiplying by exp {αm}
−Wc + We exp {2αm} = 0 (26)
The optimal value is thus
αm =
1
2
ln
Wc
We
(27)
46 / 65
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
Images/cinvestav-
Now
Making the total sum of all weights
W = Wc + We (28)
We can rewrite the previous equation as
αm =
1
2
ln
W − We
We
=
1
2
ln
1 − em
em
(29)
With the percentage rate of error given the weights of the data points
em =
We
W
(30)
47 / 65
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
Images/cinvestav-
What about the weights?
Using the equation
w
(m)
i = exp −tiC(m−1) (xi) (31)
And because we have αm and ym (xi)
w
(m+1)
i = exp −tiC(m) (xi)
= exp −ti C(m−1) (xi) + αmym (xi)
=w
(m)
i exp {−tiαmym (xi)}
48 / 65
Images/cinvestav-
Sequential Training
Thus
AdaBoost trains a new classifier using a data set in which the weighting
coefficients are adjusted according to the performance of the previously
trained classifier so as to give greater weight to the misclassified data
points.
49 / 65
Images/cinvestav-
Illustration
Schematic Illustration
50 / 65
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
51 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 1
Initialize w
(1)
i to 1
N
Step 2
For m = 1, 2, ..., M
1 Select a weak classifier ym (x) to the training data by minimizing the weighted
error function or
arg min
ym
N
i=1
w
(m)
i I (ym (xi) = tn) = arg min
ym
ti =ym(xi )
w
(m)
i = arg min
ym
We (32)
Where I is an indicator function.
2 Evaluate
em =
N
n=1
w
(m)
n I (ym (xn) = tn)
N
n=1
w
(m)
n
(33)
Where I is an indicator function 52 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
Images/cinvestav-
AdaBoost Algorithm
Step 3
Set the αm weight to
αm =
1
2
ln
1 − em
em
(34)
Now update the weights of the data for the next iteration
If ti = ym (xi) i.e. a miss
w
(m+1)
i = w
(m)
i exp {αm} = w
(m)
i
1 − em
em
(35)
If ti == ym (xi) i.e. a hit
w
(m+1)
i = w
(m)
i exp {−αm} = w
(m)
i
em
1 − em
(36)
53 / 65
Images/cinvestav-
Finally, make predictions
For this use
YM (x) = sign
M
m=1
αmym (x) (37)
54 / 65
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
Images/cinvestav-
Observations
First
The first base classifier is the usual procedure of training a single classifier.
Second
From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are
increased for data points that are misclassified.
Third
The quantity em represent weighted measures of the error rate.
Thus αm gives more weight to the more accurate classifiers.
55 / 65
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
Images/cinvestav-
In addition
The pool of classifiers in Step 1 can be substituted by a family of
classifiers
One whose members are trained to minimize the error function given the
current weights
Now
If indeed a finite set of classifiers is given, we only need to test the
classifiers once for each data point
The Scouting Matrix S
It can be reused at each iteration by multiplying the transposed vector of
weights w(m) with S to obtain We of each machine
56 / 65
Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
Images/cinvestav-
We have then
The following
W (1)
e W (2)
e · · · W M
e = w(m)
T
S (38)
This allows to reformulate the weight update step such that
It only misses lead to weight modification.
Note
Note that the weight vector w(m) is constructed iteratively.
It could be recomputed completely at every iteration, but the iterative
construction is more efficient and simple to implement.
57 / 65
Images/cinvestav-
Explanation
Graph for 1−em
em
in the range 0 ≤ em ≤ 1
58 / 65
Images/cinvestav-
So we have
Graph for αm
59 / 65
Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
Images/cinvestav-
We have the following cases
We have the following
If em −→ 1, we have that all the sample were not correctly classified!!!
Thus
We get that for all miss-classified sample lim
em→1
1−em
em
−→ 0, then
αm −→ −∞
Finally
We get that for all miss-classified sample w
(m+1)
i = w
(m)
i exp {αm} −→ 0
We only need to reverse the answers to get the perfect classifier and
select it as the only committee member.
60 / 65
Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
Images/cinvestav-
Thus, we have
If em −→ 1/2
We have αm −→ 0, thus we have that if the sample is well or bad
classified :
exp {−αmtiym (xi)} → 1 (39)
So the weight does not change at all.
What about em → 0
We have that αm → +∞
Thus, we have
Samples always correctly classified
w
(m+1)
i = w
(m)
i exp {−αmtiym (xi)} → 0
Thus, the only committee that we need is m, we do not need m + 1
61 / 65
Images/cinvestav-
Outline
1 Introduction
2 Bayesian Model Averaging
Model Combination Vs. Bayesian Model Averaging
3 Committees
Bootstrap Data Sets
4 Boosting
AdaBoost Development
Cost Function
Selection Process
Selecting New Classifiers
Using our Deriving Trick
AdaBoost Algorithm
Some Remarks and Implementation Issues
Explanation about AdaBoost’s behavior
Example
62 / 65
Images/cinvestav-
Example
Training set
Classes
ω1 = N(0, 1)
ω2 = N(0, σ2) − N(0, 1)
63 / 65
Images/cinvestav-
Example
Weak Classifier
Perceptron
For m = 1, ..., M, we have the following possible classifiers
64 / 65
Images/cinvestav-
Example
Weak Classifier
Perceptron
For m = 1, ..., M, we have the following possible classifiers
64 / 65
Images/cinvestav-
At the end of the process
For M = 40
65 / 65

More Related Content

What's hot

18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward HeuristicsAndres Mendez-Vazquez
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer PerceptronAndres Mendez-Vazquez
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2arogozhnikov
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3arogozhnikov
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine LearningVARUN KUMAR
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEParogozhnikov
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemErika G. G.
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation MaximizationAndres Mendez-Vazquez
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionJordan McBain
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Syed Atif Naseem
 
FUZZY DIAGONAL OPTIMAL ALGORITHM TO SOLVE INTUITIONISTIC FUZZY ASSIGNMENT PRO...
FUZZY DIAGONAL OPTIMAL ALGORITHM TO SOLVE INTUITIONISTIC FUZZY ASSIGNMENT PRO...FUZZY DIAGONAL OPTIMAL ALGORITHM TO SOLVE INTUITIONISTIC FUZZY ASSIGNMENT PRO...
FUZZY DIAGONAL OPTIMAL ALGORITHM TO SOLVE INTUITIONISTIC FUZZY ASSIGNMENT PRO...IAEME Publication
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015Christian Robert
 
[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded ImaginationDeep Learning JP
 

What's hot (20)

18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics18 Machine Learning Radial Basis Function Networks Forward Heuristics
18 Machine Learning Radial Basis Function Networks Forward Heuristics
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
A Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation ProblemA Maximum Entropy Approach to the Loss Data Aggregation Problem
A Maximum Entropy Approach to the Loss Data Aggregation Problem
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
Statistical Pattern recognition(1)
Statistical Pattern recognition(1)Statistical Pattern recognition(1)
Statistical Pattern recognition(1)
 
FUZZY DIAGONAL OPTIMAL ALGORITHM TO SOLVE INTUITIONISTIC FUZZY ASSIGNMENT PRO...
FUZZY DIAGONAL OPTIMAL ALGORITHM TO SOLVE INTUITIONISTIC FUZZY ASSIGNMENT PRO...FUZZY DIAGONAL OPTIMAL ALGORITHM TO SOLVE INTUITIONISTIC FUZZY ASSIGNMENT PRO...
FUZZY DIAGONAL OPTIMAL ALGORITHM TO SOLVE INTUITIONISTIC FUZZY ASSIGNMENT PRO...
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination
 

Similar to 24 Machine Learning Combining Models - Ada Boost

Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recapbutest
 
Aaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAminaRepo
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAminaRepo
 
Toward a Unified Approach to Fitting Loss Models
Toward a Unified Approach to Fitting Loss ModelsToward a Unified Approach to Fitting Loss Models
Toward a Unified Approach to Fitting Loss ModelsJacques Rioux
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3butest
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...Golden Helix Inc
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsChirag Gupta
 
BaggingBoosting.pdf
BaggingBoosting.pdfBaggingBoosting.pdf
BaggingBoosting.pdfDynamicPitch
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lecturesssuserfece35
 
Minghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation TalkMinghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation TalkWei Wang
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
Decision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationDecision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationKomal Kotak
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
Dong Zhang's project
Dong Zhang's projectDong Zhang's project
Dong Zhang's projectDong Zhang
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptxsurbhidutta4
 

Similar to 24 Machine Learning Combining Models - Ada Boost (20)

Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recap
 
Aaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clusteringAaa ped-16-Unsupervised Learning: clustering
Aaa ped-16-Unsupervised Learning: clustering
 
Aaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble LearningAaa ped-14-Ensemble Learning: About Ensemble Learning
Aaa ped-14-Ensemble Learning: About Ensemble Learning
 
Toward a Unified Approach to Fitting Loss Models
Toward a Unified Approach to Fitting Loss ModelsToward a Unified Approach to Fitting Loss Models
Toward a Unified Approach to Fitting Loss Models
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
 
Probability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional ExpertsProbability density estimation using Product of Conditional Experts
Probability density estimation using Product of Conditional Experts
 
BaggingBoosting.pdf
BaggingBoosting.pdfBaggingBoosting.pdf
BaggingBoosting.pdf
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lectures
 
Minghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation TalkMinghui Conference Cross-Validation Talk
Minghui Conference Cross-Validation Talk
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Decision Tree and Bayesian Classification
Decision Tree and Bayesian ClassificationDecision Tree and Bayesian Classification
Decision Tree and Bayesian Classification
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Dong Zhang's project
Dong Zhang's projectDong Zhang's project
Dong Zhang's project
 
Model selection
Model selectionModel selection
Model selection
 
support vector machine 1.pptx
support vector machine 1.pptxsupport vector machine 1.pptx
support vector machine 1.pptx
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 

Recently uploaded

High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdfAkritiPradhan2
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptJohnWilliam111370
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming languageSmritiSharma901052
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 

Recently uploaded (20)

High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdfDEVICE DRIVERS AND INTERRUPTS  SERVICE MECHANISM.pdf
DEVICE DRIVERS AND INTERRUPTS SERVICE MECHANISM.pdf
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming language
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 

24 Machine Learning Combining Models - Ada Boost

  • 1. Machine Learning for Data Mining Combining Models Andres Mendez-Vazquez July 20, 2015 1 / 65
  • 2. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 2 / 65
  • 3. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classifiers together in some way. Example: Committees We might train L different classifiers and then make predictions using the average of the predictions made by each classifier. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
  • 4. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classifiers together in some way. Example: Committees We might train L different classifiers and then make predictions using the average of the predictions made by each classifier. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
  • 5. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classifiers together in some way. Example: Committees We might train L different classifiers and then make predictions using the average of the predictions made by each classifier. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65
  • 6. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 4 / 65
  • 7. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 4 / 65
  • 8. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Different Models become responsible for making decisions in different regions of the input space. 4 / 65
  • 9. Images/cinvestav- We want this Models in charge of different set of inputs 5 / 65
  • 10. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  • 11. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  • 12. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  • 13. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  • 14. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  • 15. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  • 16. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classifier for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classifiers. 6 / 65
  • 17. Images/cinvestav- This is used in the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coefficients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coefficients are conditioned on the input variables and are known as mixture experts. 7 / 65
  • 18. Images/cinvestav- This is used in the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coefficients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coefficients are conditioned on the input variables and are known as mixture experts. 7 / 65
  • 19. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 8 / 65
  • 20. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually different For this We have the following example. 9 / 65
  • 21. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually different For this We have the following example. 9 / 65
  • 22. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually different For this We have the following example. 9 / 65
  • 23. Images/cinvestav- Example For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. p (x, z) (2) Thus Corresponding density over the observed variable x using marginalization: p (x) = z p (x, z) (3) 10 / 65
  • 24. Images/cinvestav- Example For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. p (x, z) (2) Thus Corresponding density over the observed variable x using marginalization: p (x) = z p (x, z) (3) 10 / 65
  • 25. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several different models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
  • 26. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several different models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
  • 27. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several different models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65
  • 28. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  • 29. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  • 30. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  • 31. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  • 32. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reflects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65
  • 33. Images/cinvestav- The Differences Bayesian model averaging The whole data set is generated by a single model. Model combination Different data points within the data set can potentially be generated from different values of the latent variable z and hence by different components. 13 / 65
  • 34. Images/cinvestav- The Differences Bayesian model averaging The whole data set is generated by a single model. Model combination Different data points within the data set can potentially be generated from different values of the latent variable z and hence by different components. 13 / 65
  • 35. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
  • 36. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
  • 37. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
  • 38. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-off between bias and variance. Where the error in the model into The bias component that arises from differences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65
  • 39. Images/cinvestav- For example We can do the follwoing When we averaged a set of low-bias models, we obtained accurate predictions of the underlying sinusoidal function from which the data were generated. 15 / 65
  • 40. Images/cinvestav- However Big Problem We have normally a single data set Thus We need to introduce certain variability between the different committee members. One approach You can use bootstrap data sets. 16 / 65
  • 41. Images/cinvestav- However Big Problem We have normally a single data set Thus We need to introduce certain variability between the different committee members. One approach You can use bootstrap data sets. 16 / 65
  • 42. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 17 / 65
  • 43. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
  • 44. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
  • 45. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65
  • 46. Images/cinvestav- Furthermore This is repeated M times Thus, we generate M data sets each of size N and each obtained by sampling from the original data set X. 19 / 65
  • 47. Images/cinvestav- Thus Use each of them to train a copy ym (x) of a predictive regression model to predict a single continuous variable Then, ycom (x) = 1 M M m=1 ym (x) (7) This is also known as Bootstrap Aggregation or Bagging. 20 / 65
  • 48. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
  • 49. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
  • 50. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65
  • 51. Images/cinvestav- Meaning Thus, the average error is EAV = 1 M M m=1 Ex { m (x)}2 (10) Similarly the Expected error over the committee ECOM = Ex   1 M M m=1 (ym (x) − h (x)) 2   = Ex   1 M M m=1 m (x) 2   (11) 22 / 65
  • 52. Images/cinvestav- Meaning Thus, the average error is EAV = 1 M M m=1 Ex { m (x)}2 (10) Similarly the Expected error over the committee ECOM = Ex   1 M M m=1 (ym (x) − h (x)) 2   = Ex   1 M M m=1 m (x) 2   (11) 22 / 65
  • 53. Images/cinvestav- Assume that the errors have zero mean and are uncorrelated Assume that the errors have zero mean and are uncorrelated Ex [ m (x)] =0 Ex [ m (x) l (x)] = 0, for m = l 23 / 65
  • 54. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  • 55. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  • 56. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  • 57. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  • 58. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65
  • 59. Images/cinvestav- We finally obtain We obtain ECOM = 1 M EAV (12) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. 25 / 65
  • 60. Images/cinvestav- We finally obtain We obtain ECOM = 1 M EAV (12) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. 25 / 65
  • 61. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
  • 62. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
  • 63. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65
  • 64. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 27 / 65
  • 65. Images/cinvestav- Boosting What Boosting does? It combines several classifiers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 28 / 65
  • 66. Images/cinvestav- Boosting What Boosting does? It combines several classifiers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 28 / 65
  • 67. Images/cinvestav- Sequential Training Main difference between boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 29 / 65
  • 68. Images/cinvestav- Sequential Training Main difference between boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 29 / 65
  • 69. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
  • 70. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
  • 71. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65
  • 72. Images/cinvestav- Now Adaptive Boosting It works even with a continuum of classifiers. However For the sake of simplicity, we will assume that the set of expert is finite. 31 / 65
  • 73. Images/cinvestav- Now Adaptive Boosting It works even with a continuum of classifiers. However For the sake of simplicity, we will assume that the set of expert is finite. 31 / 65
  • 74. Images/cinvestav- Getting the correct classifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
  • 75. Images/cinvestav- Getting the correct classifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
  • 76. Images/cinvestav- Getting the correct classifiers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65
  • 77. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  • 78. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  • 79. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  • 80. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  • 81. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65
  • 82. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
  • 83. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
  • 84. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65
  • 85. Images/cinvestav- Exponential Loss Function Something Notable This kind of error function different from the usual squared Euclidean distance to the classification target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. 35 / 65
  • 86. Images/cinvestav- Exponential Loss Function Something Notable This kind of error function different from the usual squared Euclidean distance to the classification target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. 35 / 65
  • 87. Images/cinvestav- The Matrix S When we test the M classifiers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classifier in the pool. Classifiers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
  • 88. Images/cinvestav- The Matrix S When we test the M classifiers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classifier in the pool. Classifiers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
  • 89. Images/cinvestav- The Matrix S When we test the M classifiers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classifier in the pool. Classifiers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65
  • 90. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
  • 91. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
  • 92. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
  • 93. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classifier from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65
  • 94. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65
  • 95. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65
  • 96. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65
  • 97. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65
  • 98. Images/cinvestav- Selecting New Classifiers What we want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
  • 99. Images/cinvestav- Selecting New Classifiers What we want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
  • 100. Images/cinvestav- Selecting New Classifiers What we want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65
  • 101. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the first iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
  • 102. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the first iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
  • 103. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the first iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is defined as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65
  • 104. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
  • 105. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
  • 106. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65
  • 107. Images/cinvestav- Thus In the first iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
  • 108. Images/cinvestav- Thus In the first iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
  • 109. Images/cinvestav- Thus In the first iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65
  • 110. Images/cinvestav- Now Writing the first summand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (21) Now, for the selection of ym the exact value of αm > 0 is irrelevant Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and exp {αm} E = Wc + We exp {2αm} (22) 43 / 65
  • 111. Images/cinvestav- Now Writing the first summand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (21) Now, for the selection of ym the exact value of αm > 0 is irrelevant Since a fixed αm minimizing E is equivalent to minimizing exp {αm} E and exp {αm} E = Wc + We exp {2αm} (22) 43 / 65
  • 112. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
  • 113. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
  • 114. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65
  • 115. Images/cinvestav- Thus The right hand side of the equation is minimized When at the m-th iteration we pick the classifier with the lowest total cost We (that is the lowest rate of weighted error). Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 45 / 65
  • 116. Images/cinvestav- Thus The right hand side of the equation is minimized When at the m-th iteration we pick the classifier with the lowest total cost We (that is the lowest rate of weighted error). Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 45 / 65
  • 117. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
  • 118. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
  • 119. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65
  • 120. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
  • 121. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
  • 122. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65
  • 123. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
  • 124. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
  • 125. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
  • 126. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65
  • 127. Images/cinvestav- Sequential Training Thus AdaBoost trains a new classifier using a data set in which the weighting coefficients are adjusted according to the performance of the previously trained classifier so as to give greater weight to the misclassified data points. 49 / 65
  • 129. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 51 / 65
  • 130. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  • 131. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  • 132. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  • 133. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  • 134. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  • 135. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classifier ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65
  • 136. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
  • 137. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
  • 138. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65
  • 139. Images/cinvestav- Finally, make predictions For this use YM (x) = sign M m=1 αmym (x) (37) 54 / 65
  • 140. Images/cinvestav- Observations First The first base classifier is the usual procedure of training a single classifier. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 55 / 65
  • 141. Images/cinvestav- Observations First The first base classifier is the usual procedure of training a single classifier. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 55 / 65
  • 142. Images/cinvestav- Observations First The first base classifier is the usual procedure of training a single classifier. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 55 / 65
  • 143. Images/cinvestav- In addition The pool of classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
  • 144. Images/cinvestav- In addition The pool of classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
  • 145. Images/cinvestav- In addition The pool of classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65
  • 146. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 57 / 65
  • 147. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 57 / 65
  • 148. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 57 / 65
  • 149. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modification. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more efficient and simple to implement. 57 / 65
  • 150. Images/cinvestav- Explanation Graph for 1−em em in the range 0 ≤ em ≤ 1 58 / 65
  • 152. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 60 / 65
  • 153. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 60 / 65
  • 154. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 60 / 65
  • 155. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classified : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
  • 156. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classified : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
  • 157. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classified : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classified w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65
  • 158. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classifiers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 62 / 65
  • 159. Images/cinvestav- Example Training set Classes ω1 = N(0, 1) ω2 = N(0, σ2) − N(0, 1) 63 / 65
  • 160. Images/cinvestav- Example Weak Classifier Perceptron For m = 1, ..., M, we have the following possible classifiers 64 / 65
  • 161. Images/cinvestav- Example Weak Classifier Perceptron For m = 1, ..., M, we have the following possible classifiers 64 / 65
  • 162. Images/cinvestav- At the end of the process For M = 40 65 / 65