24 Machine Learning Combining Models - Ada Boost

1. Machine Learning for Data Mining Combining Models Andres Mendez-Vazquez July 20, 2015 1 / 65

2. Images/cinvestav- Outline 1 Introduction 2 Bayesian Model Averaging Model Combination Vs. Bayesian Model Averaging 3 Committees Bootstrap Data Sets 4 Boosting AdaBoost Development Cost Function Selection Process Selecting New Classiﬁers Using our Deriving Trick AdaBoost Algorithm Some Remarks and Implementation Issues Explanation about AdaBoost’s behavior Example 2 / 65

3. Images/cinvestav- Introduction Observation It is often found that improved performance can be obtained by combining multiple classifiers together in some way. Example: Committees We might train L different classifiers and then make predictions using the average of the predictions made by each classifier. Example: Boosting It involves training multiple models in sequence in which the error function used to train a particular model depends on the performance of the previous models. 3 / 65

6. Images/cinvestav- In addition Instead of averaging the predictions of a set of models You can use an alternative form of combination that selects one of the models to make the prediction. Where The choice of model is a function of the input variables. Thus Diﬀerent Models become responsible for making decisions in diﬀerent regions of the input space. 4 / 65

9. Images/cinvestav- We want this Models in charge of diﬀerent set of inputs 5 / 65

10. Images/cinvestav- Example: Decision Trees We can have the decision trees on top of the models Given a set of models, a model is chosen to take a decision in certain area of the input. Limitation: It is based on hard splits in which only one model is responsible for making predictions for any given value. Thus it is better to soften the combination by using If we have M classiﬁer for a conditional distribution p (t|x, k). x is the input variable. t is the target variable. k = 1, 2, ..., M indexes the classiﬁers. 6 / 65

17. Images/cinvestav- This is used in the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coeﬃcients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coeﬃcients are conditioned on the input variables and are known as mixture experts. 7 / 65

18. Images/cinvestav- This is used in the mixture of distributions Thus (Mixture of Experts) p (t|x) = M k=1 πk (x) p (t|x, k) (1) where πk (x) = p (k|x) represent the input-dependent mixing coeﬃcients. This type of models They can be viewed as mixture distribution in which the component densities and the mixing coeﬃcients are conditioned on the input variables and are known as mixture experts. 7 / 65

20. Images/cinvestav- Now Although Model Combinations and Bayesian Model Averaging look similar. However They ara actually diﬀerent For this We have the following example. 9 / 65

23. Images/cinvestav- Example For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. p (x, z) (2) Thus Corresponding density over the observed variable x using marginalization: p (x) = z p (x, z) (3) 10 / 65

24. Images/cinvestav- Example For this consider the following Mixture of Gaussians with a binary latent variable z indicating to which component a point belongs to. p (x, z) (2) Thus Corresponding density over the observed variable x using marginalization: p (x) = z p (x, z) (3) 10 / 65

25. Images/cinvestav- In the case of Gaussian distributions We have p (x) = M k=1 πkN (x|µkΣk) (4) Now, for independent, identically distributed data X = {x1, x2, ..., xN } p (X) = N n=1 p (xn) = N n=1 zn p (xn, zn) (5) Now, suppose We have several diﬀerent models indexed by h = 1, ..., H with prior probabilities. For instance one model might be a mixture of Gaussians and another model might be a mixture of Cauchy distributions. 11 / 65

28. Images/cinvestav- Bayesian Model Averaging The Marginal Distribution is p (X) = H h=1 p (X, h) = H h=1 p (X|h) p (h) (6) Observations 1 This is an example of Bayesian model averaging. 2 The summation over h means that just one model is responsible for generating the whole data set. 3 The probability over h simply reﬂects our uncertainty of which is the correct model to use. Thus As the size of the data set increases, this uncertainty reduces, and the posterior probabilities p(h|X) become increasingly focused on just one of the models. 12 / 65

33. Images/cinvestav- The Differences Bayesian model averaging The whole data set is generated by a single model. Model combination Different data points within the data set can potentially be generated from different values of the latent variable z and hence by different components. 13 / 65

34. Images/cinvestav- The Differences Bayesian model averaging The whole data set is generated by a single model. Model combination Different data points within the data set can potentially be generated from different values of the latent variable z and hence by different components. 13 / 65

35. Images/cinvestav- Committees Idea The simplest way to construct a committee is to average the predictions of a set of individual models. Thinking as a frequentist This is coming from taking in consideration the trade-oﬀ between bias and variance. Where the error in the model into The bias component that arises from diﬀerences between the model and the true function to be predicted. The variance component that represents the sensitivity of the model to the individual data points. 14 / 65

39. Images/cinvestav- For example We can do the follwoing When we averaged a set of low-bias models, we obtained accurate predictions of the underlying sinusoidal function from which the data were generated. 15 / 65

40. Images/cinvestav- However Big Problem We have normally a single data set Thus We need to introduce certain variability between the diﬀerent committee members. One approach You can use bootstrap data sets. 16 / 65

41. Images/cinvestav- However Big Problem We have normally a single data set Thus We need to introduce certain variability between the diﬀerent committee members. One approach You can use bootstrap data sets. 16 / 65

43. Images/cinvestav- What to do with the Bootstrap Data Set Given a Regression Problem Here, we are trying to predict the value of a single continuous variable. What? Suppose our original data set consists of N data points X = {x1, ..., xN } . Thus We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. 18 / 65

46. Images/cinvestav- Furthermore This is repeated M times Thus, we generate M data sets each of size N and each obtained by sampling from the original data set X. 19 / 65

47. Images/cinvestav- Thus Use each of them to train a copy ym (x) of a predictive regression model to predict a single continuous variable Then, ycom (x) = 1 M M m=1 ym (x) (7) This is also known as Bootstrap Aggregation or Bagging. 20 / 65

48. Images/cinvestav- What we do with this samples? Now, assume a true regression function h (x) and a estimation ym (x) ym (x) = h (x) + m (x) (8) The average sum-of-squares error over the data takes the form Ex {ym (x) − h (x)}2 = Ex { m (x)}2 (9) What is Ex? It denotes a frequentest expectation with respect to the distribution of the input vector x. 21 / 65

51. Images/cinvestav- Meaning Thus, the average error is EAV = 1 M M m=1 Ex { m (x)}2 (10) Similarly the Expected error over the committee ECOM = Ex   1 M M m=1 (ym (x) − h (x)) 2   = Ex   1 M M m=1 m (x) 2   (11) 22 / 65

52. Images/cinvestav- Meaning Thus, the average error is EAV = 1 M M m=1 Ex { m (x)}2 (10) Similarly the Expected error over the committee ECOM = Ex   1 M M m=1 (ym (x) − h (x)) 2   = Ex   1 M M m=1 m (x) 2   (11) 22 / 65

53. Images/cinvestav- Assume that the errors have zero mean and are uncorrelated Assume that the errors have zero mean and are uncorrelated Ex [ m (x)] =0 Ex [ m (x) l (x)] = 0, for m = l 23 / 65

54. Images/cinvestav- Then We have that ECOM = 1 M2 Ex M m=1 ( m (x)) 2 = 1 M2 Ex    M m=1 2 m (x) + M h=1 M k=1 h=k h (x) k (x)    = 1 M2    Ex M m=1 2 m (x) + Ex    M h=1 M k=1 h=k h (x) k (x)       = 1 M2    Ex M m=1 2 m (x) + M h=1 M k=1 h=k Ex ( h (x) k (x))    = 1 M2 Ex M m=1 2 m (x) = 1 M 1 M Ex M m=1 2 m (x) 24 / 65

59. Images/cinvestav- We ﬁnally obtain We obtain ECOM = 1 M EAV (12) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. 25 / 65

60. Images/cinvestav- We ﬁnally obtain We obtain ECOM = 1 M EAV (12) Looks great BUT!!! Unfortunately, it depends on the key assumption that the errors due to the individual models are uncorrelated. 25 / 65

61. Images/cinvestav- Thus The Reality!!! The errors are typically highly correlated, and the reduction in overall error is generally small. Something Notable However, It can be shown that the expected committee error will not exceed the expected error of the constituent models, so ECOM ≤ EAV (13) We need something better A more sophisticated technique known as boosting. 26 / 65

65. Images/cinvestav- Boosting What Boosting does? It combines several classiﬁers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 28 / 65

66. Images/cinvestav- Boosting What Boosting does? It combines several classiﬁers to produce a form of a committee. We will describe AdaBoost “Adaptive Boosting” developed by Freund and Schapire (1995). 28 / 65

67. Images/cinvestav- Sequential Training Main difference between boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 29 / 65

68. Images/cinvestav- Sequential Training Main difference between boosting and committee methods The base classifiers are trained in sequence. Explanation Consider a two-class classification problem: 1 Samples x1, x2,..., xN 2 Binary labels (-1,1) t1, t2, ..., tN 29 / 65

69. Images/cinvestav- Cost Function Now You want to put together a set of M experts able to recognize the most difficult inputs in an accurate way!!! Thus For each pattern xi each expert classifier outputs a classification yj (xi) ∈ {−1, 1} The final decision of the committee of M experts is sign (C (xi)) C (xi) = α1y1 (xi) + α2y2 (xi) + ... + αM yM (xi) (14) 30 / 65

72. Images/cinvestav- Now Adaptive Boosting It works even with a continuum of classiﬁers. However For the sake of simplicity, we will assume that the set of expert is ﬁnite. 31 / 65

73. Images/cinvestav- Now Adaptive Boosting It works even with a continuum of classiﬁers. However For the sake of simplicity, we will assume that the set of expert is ﬁnite. 31 / 65

74. Images/cinvestav- Getting the correct classiﬁers We want the following We want to review possible element members. Select them, if they have certain properties. Assigning a weight to their contribution to the set of experts. 32 / 65

77. Images/cinvestav- Now Selection is done the following way Testing the classifiers in the pool using a training set T of N multidimensional data points xi: For each point xi we have a label ti = 1 or ti = −1. Assigning a cost for actions We test and rank all classifiers in the expert pool by Charging a cost exp {β} any time a classifier fails (a miss). Charging a cost exp {−β} any time a classifier provides the right label (a hit). 33 / 65

82. Images/cinvestav- Remarks about β We require β > 0 Thus misses are penalized more heavily penalized than hits Although It looks strange to penalize hits, but as long that the penalty of a success is smaller than the penalty for a miss exp {−β} < exp {β}. Why? Show that if we assign cost a to misses and cost b to hits, where a > b > 0, we can rewrite such costs as a = cd and b = c−d for constants c and d. Thus, it does not compromise generality. 34 / 65

85. Images/cinvestav- Exponential Loss Function Something Notable This kind of error function diﬀerent from the usual squared Euclidean distance to the classiﬁcation target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. 35 / 65

86. Images/cinvestav- Exponential Loss Function Something Notable This kind of error function diﬀerent from the usual squared Euclidean distance to the classiﬁcation target is called an exponential loss function. AdaBoost uses exponential error loss as error criterion. 35 / 65

87. Images/cinvestav- The Matrix S When we test the M classifiers in the pool, we build a matrix S We record the misses (with a ONE) and hits (with a ZERO) of each classifiers. Row i in the matrix is reserved for the data point xi - Column m is reserved for the mth classifier in the pool. Classifiers 1 2 · · · M x1 0 1 · · · 1 x2 0 0 · · · 1 x3 1 1 · · · 0 ... ... ... ... xN 0 0 · · · 0 36 / 65

90. Images/cinvestav- Main Idea What does AdaBoost want to do? The main idea of AdaBoost is to proceed systematically by extracting one classiﬁer from the pool in each of M iterations. Thus The elements in the data set are weighted according to their current relevance (or urgency) at each iteration. Thus at the beginning of the iterations All data samples are assigned the same weight: Just 1, or 1 N , if we want to have a total sum of 1 for all weights. 37 / 65

94. Images/cinvestav- The process of the weights As the selection progresses The more difficult samples, those where the committee still performs badly, are assigned larger and larger weights. Basically The selection process concentrates in selecting new classifiers for the committee by focusing on those which can help with the still misclassified examples. Then The best classifiers are those which can provide new insights to the committee. Classifiers being selected should complement each other in an optimal way. 38 / 65

98. Images/cinvestav- Selecting New Classifiers What we want In each iteration, we rank all classifiers, so that we can select the current best out of the pool. At mth iteration We have already included m − 1 classifiers in the committee and we want to select the next one. Thus, we have the following cost function which is actually the output of the committee C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ... + αm−1ym−1 (xi) (15) 39 / 65

101. Images/cinvestav- Thus, we have that Extending the cost function C(m) (xi) = C(m−1) (xi) + αmym (xi) (16) At the ﬁrst iteration m = 1 C(m−1) is the zero function. Thus, the total cost or total error is deﬁned as the exponential error E = N i=1 exp −ti C(m−1) (xi) + αmym (xi) (17) 40 / 65

104. Images/cinvestav- Thus We want to determine αm and ym in optimal way Thus, rewriting E = N i=1 w (m) i exp {−tiαmym (xi)} (18) Where, for i = 1, 2, ..., N w (m) i = exp −tiC(m−1) (xi) (19) 41 / 65

107. Images/cinvestav- Thus In the ﬁrst iteration w (1) i = 1 for i = 1, ..., N During later iterations, the vector w(m) represents the weight assigned to each data point in the training set at iteration m. Splitting equation (Eq. 18) E = ti=ym(xi) w (m) i exp {−αm} + ti=ym(xi) w (m) i exp {αm} (20) Meaning The total cost is the weighted cost of all hits plus the weighted cost of all misses. 42 / 65

110. Images/cinvestav- Now Writing the ﬁrst summand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (21) Now, for the selection of ym the exact value of αm > 0 is irrelevant Since a ﬁxed αm minimizing E is equivalent to minimizing exp {αm} E and exp {αm} E = Wc + We exp {2αm} (22) 43 / 65

111. Images/cinvestav- Now Writing the ﬁrst summand as Wc exp {−αm} and the second as We exp {αm} E = Wc exp {−αm} + We exp {αm} (21) Now, for the selection of ym the exact value of αm > 0 is irrelevant Since a ﬁxed αm minimizing E is equivalent to minimizing exp {αm} E and exp {αm} E = Wc + We exp {2αm} (22) 43 / 65

112. Images/cinvestav- Then Since exp {2αm} > 1, we can rewrite (Eq. 22) exp {αm} E = Wc + We − We + We exp {2αm} (23) Thus exp {αm} E = (Wc + We) + We (exp {2αm} − 1) (24) Now Now, Wc + We is the total sum W of the weights of all data points which is constant in the current iteration. 44 / 65

115. Images/cinvestav- Thus The right hand side of the equation is minimized When at the m-th iteration we pick the classiﬁer with the lowest total cost We (that is the lowest rate of weighted error). Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 45 / 65

116. Images/cinvestav- Thus The right hand side of the equation is minimized When at the m-th iteration we pick the classiﬁer with the lowest total cost We (that is the lowest rate of weighted error). Intuitively The next selected ym should be the one with the lowest penalty given the current set of weights. 45 / 65

117. Images/cinvestav- Deriving against the weight αm Going back to the original E, we can use the derivative trick ∂E ∂αm = −Wc exp {−αm} + We exp {αm} (25) Making the equation equal to zero and multiplying by exp {αm} −Wc + We exp {2αm} = 0 (26) The optimal value is thus αm = 1 2 ln Wc We (27) 46 / 65

120. Images/cinvestav- Now Making the total sum of all weights W = Wc + We (28) We can rewrite the previous equation as αm = 1 2 ln W − We We = 1 2 ln 1 − em em (29) With the percentage rate of error given the weights of the data points em = We W (30) 47 / 65

123. Images/cinvestav- What about the weights? Using the equation w (m) i = exp −tiC(m−1) (xi) (31) And because we have αm and ym (xi) w (m+1) i = exp −tiC(m) (xi) = exp −ti C(m−1) (xi) + αmym (xi) =w (m) i exp {−tiαmym (xi)} 48 / 65

127. Images/cinvestav- Sequential Training Thus AdaBoost trains a new classifier using a data set in which the weighting coefficients are adjusted according to the performance of the previously trained classifier so as to give greater weight to the misclassified data points. 49 / 65

128. Images/cinvestav- Illustration Schematic Illustration 50 / 65

130. Images/cinvestav- AdaBoost Algorithm Step 1 Initialize w (1) i to 1 N Step 2 For m = 1, 2, ..., M 1 Select a weak classiﬁer ym (x) to the training data by minimizing the weighted error function or arg min ym N i=1 w (m) i I (ym (xi) = tn) = arg min ym ti =ym(xi ) w (m) i = arg min ym We (32) Where I is an indicator function. 2 Evaluate em = N n=1 w (m) n I (ym (xn) = tn) N n=1 w (m) n (33) Where I is an indicator function 52 / 65

136. Images/cinvestav- AdaBoost Algorithm Step 3 Set the αm weight to αm = 1 2 ln 1 − em em (34) Now update the weights of the data for the next iteration If ti = ym (xi) i.e. a miss w (m+1) i = w (m) i exp {αm} = w (m) i 1 − em em (35) If ti == ym (xi) i.e. a hit w (m+1) i = w (m) i exp {−αm} = w (m) i em 1 − em (36) 53 / 65

139. Images/cinvestav- Finally, make predictions For this use YM (x) = sign M m=1 αmym (x) (37) 54 / 65

140. Images/cinvestav- Observations First The first base classifier is the usual procedure of training a single classifier. Second From (Eq. 35) and (Eq. 36), we can see that the weighting coefficient are increased for data points that are misclassified. Third The quantity em represent weighted measures of the error rate. Thus αm gives more weight to the more accurate classifiers. 55 / 65

143. Images/cinvestav- In addition The pool of classifiers in Step 1 can be substituted by a family of classifiers One whose members are trained to minimize the error function given the current weights Now If indeed a finite set of classifiers is given, we only need to test the classifiers once for each data point The Scouting Matrix S It can be reused at each iteration by multiplying the transposed vector of weights w(m) with S to obtain We of each machine 56 / 65

146. Images/cinvestav- We have then The following W (1) e W (2) e · · · W M e = w(m) T S (38) This allows to reformulate the weight update step such that It only misses lead to weight modiﬁcation. Note Note that the weight vector w(m) is constructed iteratively. It could be recomputed completely at every iteration, but the iterative construction is more eﬃcient and simple to implement. 57 / 65

150. Images/cinvestav- Explanation Graph for 1−em em in the range 0 ≤ em ≤ 1 58 / 65

151. Images/cinvestav- So we have Graph for αm 59 / 65

152. Images/cinvestav- We have the following cases We have the following If em −→ 1, we have that all the sample were not correctly classified!!! Thus We get that for all miss-classified sample lim em→1 1−em em −→ 0, then αm −→ −∞ Finally We get that for all miss-classified sample w (m+1) i = w (m) i exp {αm} −→ 0 We only need to reverse the answers to get the perfect classifier and select it as the only committee member. 60 / 65

155. Images/cinvestav- Thus, we have If em −→ 1/2 We have αm −→ 0, thus we have that if the sample is well or bad classiﬁed : exp {−αmtiym (xi)} → 1 (39) So the weight does not change at all. What about em → 0 We have that αm → +∞ Thus, we have Samples always correctly classiﬁed w (m+1) i = w (m) i exp {−αmtiym (xi)} → 0 Thus, the only committee that we need is m, we do not need m + 1 61 / 65

159. Images/cinvestav- Example Training set Classes ω1 = N(0, 1) ω2 = N(0, σ2) − N(0, 1) 63 / 65

160. Images/cinvestav- Example Weak Classiﬁer Perceptron For m = 1, ..., M, we have the following possible classiﬁers 64 / 65

161. Images/cinvestav- Example Weak Classiﬁer Perceptron For m = 1, ..., M, we have the following possible classiﬁers 64 / 65

162. Images/cinvestav- At the end of the process For M = 40 65 / 65

24 Machine Learning Combining Models - Ada Boost

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 24 Machine Learning Combining Models - Ada Boost

Similar to 24 Machine Learning Combining Models - Ada Boost (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

24 Machine Learning Combining Models - Ada Boost