24 Machine Learning Combining Models - Ada Boost

Preview:

Citation preview

Machine Learning for Data MiningCombining Models

Andres Mendez-Vazquez

July 20, 2015

1 / 65

Images/cinvestav-1.jpg

Outline1 Introduction

2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging

3 CommitteesBootstrap Data Sets

4 BoostingAdaBoost Development

Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick

AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior

Example

2 / 65

Images/cinvestav-1.jpg

Introduction

ObservationIt is often found that improved performance can be obtained by combiningmultiple classifiers together in some way.

Example: CommitteesWe might train L different classifiers and then make predictions using theaverage of the predictions made by each classifier.

Example: BoostingIt involves training multiple models in sequence in which the error functionused to train a particular model depends on the performance of theprevious models.

3 / 65

Images/cinvestav-1.jpg

Introduction

ObservationIt is often found that improved performance can be obtained by combiningmultiple classifiers together in some way.

Example: CommitteesWe might train L different classifiers and then make predictions using theaverage of the predictions made by each classifier.

Example: BoostingIt involves training multiple models in sequence in which the error functionused to train a particular model depends on the performance of theprevious models.

3 / 65

Images/cinvestav-1.jpg

Introduction

ObservationIt is often found that improved performance can be obtained by combiningmultiple classifiers together in some way.

Example: CommitteesWe might train L different classifiers and then make predictions using theaverage of the predictions made by each classifier.

Example: BoostingIt involves training multiple models in sequence in which the error functionused to train a particular model depends on the performance of theprevious models.

3 / 65

Images/cinvestav-1.jpg

In addition

Instead of averaging the predictions of a set of modelsYou can use an alternative form of combination that selects one of themodels to make the prediction.

WhereThe choice of model is a function of the input variables.

ThusDifferent Models become responsible for making decisions in differentregions of the input space.

4 / 65

Images/cinvestav-1.jpg

In addition

Instead of averaging the predictions of a set of modelsYou can use an alternative form of combination that selects one of themodels to make the prediction.

WhereThe choice of model is a function of the input variables.

ThusDifferent Models become responsible for making decisions in differentregions of the input space.

4 / 65

Images/cinvestav-1.jpg

In addition

Instead of averaging the predictions of a set of modelsYou can use an alternative form of combination that selects one of themodels to make the prediction.

WhereThe choice of model is a function of the input variables.

ThusDifferent Models become responsible for making decisions in differentregions of the input space.

4 / 65

Images/cinvestav-1.jpg

We want this

Models in charge of different set of inputs

5 / 65

Images/cinvestav-1.jpg

Example: Decision Trees

We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.

Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.

Thus it is better to soften the combination by usingIf we have M classifier for a conditional distribution p (t|x, k).

I x is the input variable.I t is the target variable.I k = 1, 2, ...,M indexes the classifiers.

6 / 65

Images/cinvestav-1.jpg

Example: Decision Trees

We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.

Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.

Thus it is better to soften the combination by usingIf we have M classifier for a conditional distribution p (t|x, k).

I x is the input variable.I t is the target variable.I k = 1, 2, ...,M indexes the classifiers.

6 / 65

Images/cinvestav-1.jpg

Example: Decision Trees

We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.

Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.

Thus it is better to soften the combination by usingIf we have M classifier for a conditional distribution p (t|x, k).

I x is the input variable.I t is the target variable.I k = 1, 2, ...,M indexes the classifiers.

6 / 65

Images/cinvestav-1.jpg

Example: Decision Trees

We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.

Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.

Thus it is better to soften the combination by usingIf we have M classifier for a conditional distribution p (t|x, k).

I x is the input variable.I t is the target variable.I k = 1, 2, ...,M indexes the classifiers.

6 / 65

Images/cinvestav-1.jpg

Example: Decision Trees

We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.

Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.

Thus it is better to soften the combination by usingIf we have M classifier for a conditional distribution p (t|x, k).

I x is the input variable.I t is the target variable.I k = 1, 2, ...,M indexes the classifiers.

6 / 65

Images/cinvestav-1.jpg

Example: Decision Trees

We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.

Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.

Thus it is better to soften the combination by usingIf we have M classifier for a conditional distribution p (t|x, k).

I x is the input variable.I t is the target variable.I k = 1, 2, ...,M indexes the classifiers.

6 / 65

Images/cinvestav-1.jpg

Example: Decision Trees

We can have the decision trees on top of the modelsGiven a set of models, a model is chosen to take a decision in certain area of theinput.

Limitation: It is based on hard splits in which only one model is responsible formaking predictions for any given value.

Thus it is better to soften the combination by usingIf we have M classifier for a conditional distribution p (t|x, k).

I x is the input variable.I t is the target variable.I k = 1, 2, ...,M indexes the classifiers.

6 / 65

Images/cinvestav-1.jpg

This is used in the mixture of distributions

Thus (Mixture of Experts)

p (t|x) =M∑

k=1πk (x) p (t|x, k) (1)

where πk (x) = p (k|x) represent the input-dependent mixing coefficients.

This type of modelsThey can be viewed as mixture distribution in which the componentdensities and the mixing coefficients are conditioned on the input variablesand are known as mixture experts.

7 / 65

Images/cinvestav-1.jpg

This is used in the mixture of distributions

Thus (Mixture of Experts)

p (t|x) =M∑

k=1πk (x) p (t|x, k) (1)

where πk (x) = p (k|x) represent the input-dependent mixing coefficients.

This type of modelsThey can be viewed as mixture distribution in which the componentdensities and the mixing coefficients are conditioned on the input variablesand are known as mixture experts.

7 / 65

Images/cinvestav-1.jpg

Outline1 Introduction

2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging

3 CommitteesBootstrap Data Sets

4 BoostingAdaBoost Development

Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick

AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior

Example

8 / 65

Images/cinvestav-1.jpg

Now

AlthoughModel Combinations and Bayesian Model Averaging look similar.

HoweverThey ara actually different

For thisWe have the following example.

9 / 65

Images/cinvestav-1.jpg

Now

AlthoughModel Combinations and Bayesian Model Averaging look similar.

HoweverThey ara actually different

For thisWe have the following example.

9 / 65

Images/cinvestav-1.jpg

Now

AlthoughModel Combinations and Bayesian Model Averaging look similar.

HoweverThey ara actually different

For thisWe have the following example.

9 / 65

Images/cinvestav-1.jpg

Example

For this consider the followingMixture of Gaussians with a binary latent variable z indicating to whichcomponent a point belongs to.

p (x, z) (2)

ThusCorresponding density over the observed variable x using marginalization:

p (x) =∑

zp (x, z) (3)

10 / 65

Images/cinvestav-1.jpg

Example

For this consider the followingMixture of Gaussians with a binary latent variable z indicating to whichcomponent a point belongs to.

p (x, z) (2)

ThusCorresponding density over the observed variable x using marginalization:

p (x) =∑

zp (x, z) (3)

10 / 65

Images/cinvestav-1.jpg

In the case of Gaussian distributionsWe have

p (x) =M∑

k=1πkN (x|µkΣk) (4)

Now, for independent, identically distributed dataX = {x1,x2, ...,xN}

p (X) =N∏

n=1p (xn) =

N∏n=1

[∑zn

p (xn , zn)]

(5)

Now, supposeWe have several different models indexed by h = 1, ...,H with priorprobabilities.

For instance one model might be a mixture of Gaussians and anothermodel might be a mixture of Cauchy distributions.

11 / 65

Images/cinvestav-1.jpg

In the case of Gaussian distributionsWe have

p (x) =M∑

k=1πkN (x|µkΣk) (4)

Now, for independent, identically distributed dataX = {x1,x2, ...,xN}

p (X) =N∏

n=1p (xn) =

N∏n=1

[∑zn

p (xn , zn)]

(5)

Now, supposeWe have several different models indexed by h = 1, ...,H with priorprobabilities.

For instance one model might be a mixture of Gaussians and anothermodel might be a mixture of Cauchy distributions.

11 / 65

Images/cinvestav-1.jpg

In the case of Gaussian distributionsWe have

p (x) =M∑

k=1πkN (x|µkΣk) (4)

Now, for independent, identically distributed dataX = {x1,x2, ...,xN}

p (X) =N∏

n=1p (xn) =

N∏n=1

[∑zn

p (xn , zn)]

(5)

Now, supposeWe have several different models indexed by h = 1, ...,H with priorprobabilities.

For instance one model might be a mixture of Gaussians and anothermodel might be a mixture of Cauchy distributions.

11 / 65

Images/cinvestav-1.jpg

Bayesian Model AveragingThe Marginal Distribution is

p (X) =H∑

h=1p (X , h) =

H∑h=1

p (X |h) p (h) (6)

Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for

generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the

correct model to use.

ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.

12 / 65

Images/cinvestav-1.jpg

Bayesian Model AveragingThe Marginal Distribution is

p (X) =H∑

h=1p (X , h) =

H∑h=1

p (X |h) p (h) (6)

Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for

generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the

correct model to use.

ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.

12 / 65

Images/cinvestav-1.jpg

Bayesian Model AveragingThe Marginal Distribution is

p (X) =H∑

h=1p (X , h) =

H∑h=1

p (X |h) p (h) (6)

Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for

generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the

correct model to use.

ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.

12 / 65

Images/cinvestav-1.jpg

Bayesian Model AveragingThe Marginal Distribution is

p (X) =H∑

h=1p (X , h) =

H∑h=1

p (X |h) p (h) (6)

Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for

generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the

correct model to use.

ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.

12 / 65

Images/cinvestav-1.jpg

Bayesian Model AveragingThe Marginal Distribution is

p (X) =H∑

h=1p (X , h) =

H∑h=1

p (X |h) p (h) (6)

Observations1 This is an example of Bayesian model averaging.2 The summation over h means that just one model is responsible for

generating the whole data set.3 The probability over h simply reflects our uncertainty of which is the

correct model to use.

ThusAs the size of the data set increases, this uncertainty reduces, and theposterior probabilities p(h|X) become increasingly focused on just one ofthe models.

12 / 65

Images/cinvestav-1.jpg

The Differences

Bayesian model averagingThe whole data set is generated by a single model.

Model combinationDifferent data points within the data set can potentially be generated fromdifferent values of the latent variable z and hence by different components.

13 / 65

Images/cinvestav-1.jpg

The Differences

Bayesian model averagingThe whole data set is generated by a single model.

Model combinationDifferent data points within the data set can potentially be generated fromdifferent values of the latent variable z and hence by different components.

13 / 65

Images/cinvestav-1.jpg

Committees

IdeaThe simplest way to construct a committee is to average the predictions ofa set of individual models.

Thinking as a frequentistThis is coming from taking in consideration the trade-off between bias andvariance.

Where the error in the model intoThe bias component that arises from differences between the modeland the true function to be predicted.The variance component that represents the sensitivity of the modelto the individual data points.

14 / 65

Images/cinvestav-1.jpg

Committees

IdeaThe simplest way to construct a committee is to average the predictions ofa set of individual models.

Thinking as a frequentistThis is coming from taking in consideration the trade-off between bias andvariance.

Where the error in the model intoThe bias component that arises from differences between the modeland the true function to be predicted.The variance component that represents the sensitivity of the modelto the individual data points.

14 / 65

Images/cinvestav-1.jpg

Committees

IdeaThe simplest way to construct a committee is to average the predictions ofa set of individual models.

Thinking as a frequentistThis is coming from taking in consideration the trade-off between bias andvariance.

Where the error in the model intoThe bias component that arises from differences between the modeland the true function to be predicted.The variance component that represents the sensitivity of the modelto the individual data points.

14 / 65

Images/cinvestav-1.jpg

Committees

IdeaThe simplest way to construct a committee is to average the predictions ofa set of individual models.

Thinking as a frequentistThis is coming from taking in consideration the trade-off between bias andvariance.

Where the error in the model intoThe bias component that arises from differences between the modeland the true function to be predicted.The variance component that represents the sensitivity of the modelto the individual data points.

14 / 65

Images/cinvestav-1.jpg

For example

We can do the follwoingWhen we averaged a set of low-bias models, we obtained accuratepredictions of the underlying sinusoidal function from which the data weregenerated.

15 / 65

Images/cinvestav-1.jpg

However

Big ProblemWe have normally a single data set

ThusWe need to introduce certain variability between the different committeemembers.

One approachYou can use bootstrap data sets.

16 / 65

Images/cinvestav-1.jpg

However

Big ProblemWe have normally a single data set

ThusWe need to introduce certain variability between the different committeemembers.

One approachYou can use bootstrap data sets.

16 / 65

Images/cinvestav-1.jpg

Outline1 Introduction

2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging

3 CommitteesBootstrap Data Sets

4 BoostingAdaBoost Development

Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick

AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior

Example

17 / 65

Images/cinvestav-1.jpg

What to do with the Bootstrap Data Set

Given a Regression ProblemHere, we are trying to predict the value of a single continuous variable.

What?Suppose our original data set consists of N data points X = {x1, ..., xN} .

ThusWe can create a new data set XB by drawing N points at random from X ,with replacement, so that some points in X may be replicated in XB,whereas other points in X may be absent from XB.

18 / 65

Images/cinvestav-1.jpg

What to do with the Bootstrap Data Set

Given a Regression ProblemHere, we are trying to predict the value of a single continuous variable.

What?Suppose our original data set consists of N data points X = {x1, ..., xN} .

ThusWe can create a new data set XB by drawing N points at random from X ,with replacement, so that some points in X may be replicated in XB,whereas other points in X may be absent from XB.

18 / 65

Images/cinvestav-1.jpg

What to do with the Bootstrap Data Set

Given a Regression ProblemHere, we are trying to predict the value of a single continuous variable.

What?Suppose our original data set consists of N data points X = {x1, ..., xN} .

ThusWe can create a new data set XB by drawing N points at random from X ,with replacement, so that some points in X may be replicated in XB,whereas other points in X may be absent from XB.

18 / 65

Images/cinvestav-1.jpg

Furthermore

This is repeated M timesThus, we generate M data sets each of size N and each obtained bysampling from the original data set X .

19 / 65

Images/cinvestav-1.jpg

Thus

Use each of them to train a copy ym (x) of a predictive regressionmodel to predict a single continuous variableThen,

ycom (x) = 1M

M∑m=1

ym (x) (7)

This is also known as Bootstrap Aggregation or Bagging.

20 / 65

Images/cinvestav-1.jpg

What we do with this samples?

Now, assume a true regression function h (x) and a estimation ym (x)

ym (x) = h (x) + εm (x) (8)

The average sum-of-squares error over the data takes the form

Ex[{ym (x)− h (x)}2

]= Ex

[{εm (x)}2

](9)

What is Ex?It denotes a frequentest expectation with respect to the distribution of theinput vector x.

21 / 65

Images/cinvestav-1.jpg

What we do with this samples?

Now, assume a true regression function h (x) and a estimation ym (x)

ym (x) = h (x) + εm (x) (8)

The average sum-of-squares error over the data takes the form

Ex[{ym (x)− h (x)}2

]= Ex

[{εm (x)}2

](9)

What is Ex?It denotes a frequentest expectation with respect to the distribution of theinput vector x.

21 / 65

Images/cinvestav-1.jpg

What we do with this samples?

Now, assume a true regression function h (x) and a estimation ym (x)

ym (x) = h (x) + εm (x) (8)

The average sum-of-squares error over the data takes the form

Ex[{ym (x)− h (x)}2

]= Ex

[{εm (x)}2

](9)

What is Ex?It denotes a frequentest expectation with respect to the distribution of theinput vector x.

21 / 65

Images/cinvestav-1.jpg

Meaning

Thus, the average error is

EAV = 1M

M∑m=1

Ex[{εm (x)}2

](10)

Similarly the Expected error over the committee

ECOM = Ex

{ 1M

M∑m=1

(ym (x)− h (x))}2 = Ex

{ 1M

M∑m=1

εm (x)}2(11)

22 / 65

Images/cinvestav-1.jpg

Meaning

Thus, the average error is

EAV = 1M

M∑m=1

Ex[{εm (x)}2

](10)

Similarly the Expected error over the committee

ECOM = Ex

{ 1M

M∑m=1

(ym (x)− h (x))}2 = Ex

{ 1M

M∑m=1

εm (x)}2(11)

22 / 65

Images/cinvestav-1.jpg

Assume that the errors have zero mean and areuncorrelated

Assume that the errors have zero mean and are uncorrelated

Ex [εm (x)] =0Ex [εm (x) εl (x)] = 0, for m 6= l

23 / 65

Images/cinvestav-1.jpg

ThenWe have that

ECOM =1

M2Ex

[{M∑

m=1

(εm (x))

}2]

=1

M2Ex

M∑m=1

ε2m (x) +

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=

1M2

Ex

(M∑

m=1

ε2m (x)

)+ Ex

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=1

M2

Ex

(M∑

m=1

ε2m (x)

)+

M∑h=1

M∑k=1

h 6=k

Ex (εh (x) εk (x))

=

1M2

{Ex

(M∑

m=1

ε2m (x)

)}=

1M

{1M

Ex

(M∑

m=1

ε2m (x)

)}24 / 65

Images/cinvestav-1.jpg

ThenWe have that

ECOM =1

M2Ex

[{M∑

m=1

(εm (x))

}2]

=1

M2Ex

M∑m=1

ε2m (x) +

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=

1M2

Ex

(M∑

m=1

ε2m (x)

)+ Ex

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=1

M2

Ex

(M∑

m=1

ε2m (x)

)+

M∑h=1

M∑k=1

h 6=k

Ex (εh (x) εk (x))

=

1M2

{Ex

(M∑

m=1

ε2m (x)

)}=

1M

{1M

Ex

(M∑

m=1

ε2m (x)

)}24 / 65

Images/cinvestav-1.jpg

ThenWe have that

ECOM =1

M2Ex

[{M∑

m=1

(εm (x))

}2]

=1

M2Ex

M∑m=1

ε2m (x) +

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=

1M2

Ex

(M∑

m=1

ε2m (x)

)+ Ex

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=1

M2

Ex

(M∑

m=1

ε2m (x)

)+

M∑h=1

M∑k=1

h 6=k

Ex (εh (x) εk (x))

=

1M2

{Ex

(M∑

m=1

ε2m (x)

)}=

1M

{1M

Ex

(M∑

m=1

ε2m (x)

)}24 / 65

Images/cinvestav-1.jpg

ThenWe have that

ECOM =1

M2Ex

[{M∑

m=1

(εm (x))

}2]

=1

M2Ex

M∑m=1

ε2m (x) +

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=

1M2

Ex

(M∑

m=1

ε2m (x)

)+ Ex

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=1

M2

Ex

(M∑

m=1

ε2m (x)

)+

M∑h=1

M∑k=1

h 6=k

Ex (εh (x) εk (x))

=

1M2

{Ex

(M∑

m=1

ε2m (x)

)}=

1M

{1M

Ex

(M∑

m=1

ε2m (x)

)}24 / 65

Images/cinvestav-1.jpg

ThenWe have that

ECOM =1

M2Ex

[{M∑

m=1

(εm (x))

}2]

=1

M2Ex

M∑m=1

ε2m (x) +

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=

1M2

Ex

(M∑

m=1

ε2m (x)

)+ Ex

M∑h=1

M∑k=1

h 6=k

εh (x) εk (x)

=1

M2

Ex

(M∑

m=1

ε2m (x)

)+

M∑h=1

M∑k=1

h 6=k

Ex (εh (x) εk (x))

=

1M2

{Ex

(M∑

m=1

ε2m (x)

)}=

1M

{1M

Ex

(M∑

m=1

ε2m (x)

)}24 / 65

Images/cinvestav-1.jpg

We finally obtain

We obtain

ECOM = 1M EAV (12)

Looks great BUT!!!Unfortunately, it depends on the key assumption that the errors due to theindividual models are uncorrelated.

25 / 65

Images/cinvestav-1.jpg

We finally obtain

We obtain

ECOM = 1M EAV (12)

Looks great BUT!!!Unfortunately, it depends on the key assumption that the errors due to theindividual models are uncorrelated.

25 / 65

Images/cinvestav-1.jpg

Thus

The Reality!!!The errors are typically highly correlated, and the reduction in overall erroris generally small.

Something NotableHowever, It can be shown that the expected committee error will notexceed the expected error of the constituent models, so

ECOM ≤ EAV (13)

We need something betterA more sophisticated technique known as boosting.

26 / 65

Images/cinvestav-1.jpg

Thus

The Reality!!!The errors are typically highly correlated, and the reduction in overall erroris generally small.

Something NotableHowever, It can be shown that the expected committee error will notexceed the expected error of the constituent models, so

ECOM ≤ EAV (13)

We need something betterA more sophisticated technique known as boosting.

26 / 65

Images/cinvestav-1.jpg

Thus

The Reality!!!The errors are typically highly correlated, and the reduction in overall erroris generally small.

Something NotableHowever, It can be shown that the expected committee error will notexceed the expected error of the constituent models, so

ECOM ≤ EAV (13)

We need something betterA more sophisticated technique known as boosting.

26 / 65

Images/cinvestav-1.jpg

Outline1 Introduction

2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging

3 CommitteesBootstrap Data Sets

4 BoostingAdaBoost Development

Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick

AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior

Example

27 / 65

Images/cinvestav-1.jpg

Boosting

What Boosting does?It combines several classifiers to produce a form of a committee.

We will describe AdaBoost“Adaptive Boosting” developed by Freund and Schapire (1995).

28 / 65

Images/cinvestav-1.jpg

Boosting

What Boosting does?It combines several classifiers to produce a form of a committee.

We will describe AdaBoost“Adaptive Boosting” developed by Freund and Schapire (1995).

28 / 65

Images/cinvestav-1.jpg

Sequential Training

Main difference between boosting and committee methodsThe base classifiers are trained in sequence.

ExplanationConsider a two-class classification problem:

1 Samples x1,x2,...,xN2 Binary labels (-1,1) t1, t2, ..., tN

29 / 65

Images/cinvestav-1.jpg

Sequential Training

Main difference between boosting and committee methodsThe base classifiers are trained in sequence.

ExplanationConsider a two-class classification problem:

1 Samples x1,x2,...,xN2 Binary labels (-1,1) t1, t2, ..., tN

29 / 65

Images/cinvestav-1.jpg

Cost Function

NowYou want to put together a set of M experts able to recognize the mostdifficult inputs in an accurate way!!!

ThusFor each pattern xi each expert classifier outputs a classificationyj (xi) ∈ {−1, 1}

The final decision of the committee of M experts is sign (C (x i))

C (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αM yM (xi) (14)

30 / 65

Images/cinvestav-1.jpg

Cost Function

NowYou want to put together a set of M experts able to recognize the mostdifficult inputs in an accurate way!!!

ThusFor each pattern xi each expert classifier outputs a classificationyj (xi) ∈ {−1, 1}

The final decision of the committee of M experts is sign (C (x i))

C (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αM yM (xi) (14)

30 / 65

Images/cinvestav-1.jpg

Cost Function

NowYou want to put together a set of M experts able to recognize the mostdifficult inputs in an accurate way!!!

ThusFor each pattern xi each expert classifier outputs a classificationyj (xi) ∈ {−1, 1}

The final decision of the committee of M experts is sign (C (x i))

C (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αM yM (xi) (14)

30 / 65

Images/cinvestav-1.jpg

Now

Adaptive BoostingIt works even with a continuum of classifiers.

HoweverFor the sake of simplicity, we will assume that the set of expert is finite.

31 / 65

Images/cinvestav-1.jpg

Now

Adaptive BoostingIt works even with a continuum of classifiers.

HoweverFor the sake of simplicity, we will assume that the set of expert is finite.

31 / 65

Images/cinvestav-1.jpg

Getting the correct classifiers

We want the followingWe want to review possible element members.Select them, if they have certain properties.Assigning a weight to their contribution to the set of experts.

32 / 65

Images/cinvestav-1.jpg

Getting the correct classifiers

We want the followingWe want to review possible element members.Select them, if they have certain properties.Assigning a weight to their contribution to the set of experts.

32 / 65

Images/cinvestav-1.jpg

Getting the correct classifiers

We want the followingWe want to review possible element members.Select them, if they have certain properties.Assigning a weight to their contribution to the set of experts.

32 / 65

Images/cinvestav-1.jpg

Now

Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :

For each point xi we have a label ti = 1 or ti = −1.

Assigning a cost for actionsWe test and rank all classifiers in the expert pool by

Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).

33 / 65

Images/cinvestav-1.jpg

Now

Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :

For each point xi we have a label ti = 1 or ti = −1.

Assigning a cost for actionsWe test and rank all classifiers in the expert pool by

Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).

33 / 65

Images/cinvestav-1.jpg

Now

Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :

For each point xi we have a label ti = 1 or ti = −1.

Assigning a cost for actionsWe test and rank all classifiers in the expert pool by

Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).

33 / 65

Images/cinvestav-1.jpg

Now

Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :

For each point xi we have a label ti = 1 or ti = −1.

Assigning a cost for actionsWe test and rank all classifiers in the expert pool by

Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).

33 / 65

Images/cinvestav-1.jpg

Now

Selection is done the following wayTesting the classifiers in the pool using a training set T of Nmultidimensional data points xi :

For each point xi we have a label ti = 1 or ti = −1.

Assigning a cost for actionsWe test and rank all classifiers in the expert pool by

Charging a cost exp {β} any time a classifier fails (a miss).Charging a cost exp {−β} any time a classifier provides the rightlabel (a hit).

33 / 65

Images/cinvestav-1.jpg

Remarks about β

We require β > 0Thus misses are penalized more heavily penalized than hits

AlthoughIt looks strange to penalize hits, but as long that the penalty of a successis smaller than the penalty for a miss exp {−β} < exp {β}.

Why?Show that if we assign cost a to misses and cost b to hits, wherea > b > 0, we can rewrite such costs as a = cd and b = c−d for constantsc and d. Thus, it does not compromise generality.

34 / 65

Images/cinvestav-1.jpg

Remarks about β

We require β > 0Thus misses are penalized more heavily penalized than hits

AlthoughIt looks strange to penalize hits, but as long that the penalty of a successis smaller than the penalty for a miss exp {−β} < exp {β}.

Why?Show that if we assign cost a to misses and cost b to hits, wherea > b > 0, we can rewrite such costs as a = cd and b = c−d for constantsc and d. Thus, it does not compromise generality.

34 / 65

Images/cinvestav-1.jpg

Remarks about β

We require β > 0Thus misses are penalized more heavily penalized than hits

AlthoughIt looks strange to penalize hits, but as long that the penalty of a successis smaller than the penalty for a miss exp {−β} < exp {β}.

Why?Show that if we assign cost a to misses and cost b to hits, wherea > b > 0, we can rewrite such costs as a = cd and b = c−d for constantsc and d. Thus, it does not compromise generality.

34 / 65

Images/cinvestav-1.jpg

Exponential Loss Function

Something NotableThis kind of error function different from the usual squared Euclideandistance to the classification target is called an exponential loss function.

AdaBoost uses exponential error loss as error criterion.

35 / 65

Images/cinvestav-1.jpg

Exponential Loss Function

Something NotableThis kind of error function different from the usual squared Euclideandistance to the classification target is called an exponential loss function.

AdaBoost uses exponential error loss as error criterion.

35 / 65

Images/cinvestav-1.jpg

The Matrix S

When we test the M classifiers in the pool, we build a matrix SWe record the misses (with a ONE) and hits (with a ZERO) of eachclassifiers.

Row i in the matrix is reserved for the data point xi - Column m isreserved for the mth classifier in the pool.

Classifiers

1 2 · · · Mx1 0 1 · · · 1x2 0 0 · · · 1x3 1 1 · · · 0...

......

...xN 0 0 · · · 0

36 / 65

Images/cinvestav-1.jpg

The Matrix S

When we test the M classifiers in the pool, we build a matrix SWe record the misses (with a ONE) and hits (with a ZERO) of eachclassifiers.

Row i in the matrix is reserved for the data point xi - Column m isreserved for the mth classifier in the pool.

Classifiers

1 2 · · · Mx1 0 1 · · · 1x2 0 0 · · · 1x3 1 1 · · · 0...

......

...xN 0 0 · · · 0

36 / 65

Images/cinvestav-1.jpg

The Matrix S

When we test the M classifiers in the pool, we build a matrix SWe record the misses (with a ONE) and hits (with a ZERO) of eachclassifiers.

Row i in the matrix is reserved for the data point xi - Column m isreserved for the mth classifier in the pool.

Classifiers

1 2 · · · Mx1 0 1 · · · 1x2 0 0 · · · 1x3 1 1 · · · 0...

......

...xN 0 0 · · · 0

36 / 65

Images/cinvestav-1.jpg

Main Idea

What does AdaBoost want to do?The main idea of AdaBoost is to proceed systematically by extracting oneclassifier from the pool in each of M iterations.

ThusThe elements in the data set are weighted according to their currentrelevance (or urgency) at each iteration.

Thus at the beginning of the iterationsAll data samples are assigned the same weight:

Just 1, or 1N , if we want to have a total sum of 1 for all weights.

37 / 65

Images/cinvestav-1.jpg

Main Idea

What does AdaBoost want to do?The main idea of AdaBoost is to proceed systematically by extracting oneclassifier from the pool in each of M iterations.

ThusThe elements in the data set are weighted according to their currentrelevance (or urgency) at each iteration.

Thus at the beginning of the iterationsAll data samples are assigned the same weight:

Just 1, or 1N , if we want to have a total sum of 1 for all weights.

37 / 65

Images/cinvestav-1.jpg

Main Idea

What does AdaBoost want to do?The main idea of AdaBoost is to proceed systematically by extracting oneclassifier from the pool in each of M iterations.

ThusThe elements in the data set are weighted according to their currentrelevance (or urgency) at each iteration.

Thus at the beginning of the iterationsAll data samples are assigned the same weight:

Just 1, or 1N , if we want to have a total sum of 1 for all weights.

37 / 65

Images/cinvestav-1.jpg

Main Idea

What does AdaBoost want to do?The main idea of AdaBoost is to proceed systematically by extracting oneclassifier from the pool in each of M iterations.

ThusThe elements in the data set are weighted according to their currentrelevance (or urgency) at each iteration.

Thus at the beginning of the iterationsAll data samples are assigned the same weight:

Just 1, or 1N , if we want to have a total sum of 1 for all weights.

37 / 65

Images/cinvestav-1.jpg

The process of the weights

As the selection progressesThe more difficult samples, those where the committee still performsbadly, are assigned larger and larger weights.

BasicallyThe selection process concentrates in selecting new classifiers for thecommittee by focusing on those which can help with the still misclassifiedexamples.

ThenThe best classifiers are those which can provide new insights to thecommittee.Classifiers being selected should complement each other in an optimalway.

38 / 65

Images/cinvestav-1.jpg

The process of the weights

As the selection progressesThe more difficult samples, those where the committee still performsbadly, are assigned larger and larger weights.

BasicallyThe selection process concentrates in selecting new classifiers for thecommittee by focusing on those which can help with the still misclassifiedexamples.

ThenThe best classifiers are those which can provide new insights to thecommittee.Classifiers being selected should complement each other in an optimalway.

38 / 65

Images/cinvestav-1.jpg

The process of the weights

As the selection progressesThe more difficult samples, those where the committee still performsbadly, are assigned larger and larger weights.

BasicallyThe selection process concentrates in selecting new classifiers for thecommittee by focusing on those which can help with the still misclassifiedexamples.

ThenThe best classifiers are those which can provide new insights to thecommittee.Classifiers being selected should complement each other in an optimalway.

38 / 65

Images/cinvestav-1.jpg

The process of the weights

As the selection progressesThe more difficult samples, those where the committee still performsbadly, are assigned larger and larger weights.

BasicallyThe selection process concentrates in selecting new classifiers for thecommittee by focusing on those which can help with the still misclassifiedexamples.

ThenThe best classifiers are those which can provide new insights to thecommittee.Classifiers being selected should complement each other in an optimalway.

38 / 65

Images/cinvestav-1.jpg

Selecting New Classifiers

What we wantIn each iteration, we rank all classifiers, so that we can select the currentbest out of the pool.

At mth iterationWe have already included m − 1 classifiers in the committee and we wantto select the next one.

Thus, we have the following cost function which is actually theoutput of the committee

C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αm−1ym−1 (xi) (15)

39 / 65

Images/cinvestav-1.jpg

Selecting New Classifiers

What we wantIn each iteration, we rank all classifiers, so that we can select the currentbest out of the pool.

At mth iterationWe have already included m − 1 classifiers in the committee and we wantto select the next one.

Thus, we have the following cost function which is actually theoutput of the committee

C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αm−1ym−1 (xi) (15)

39 / 65

Images/cinvestav-1.jpg

Selecting New Classifiers

What we wantIn each iteration, we rank all classifiers, so that we can select the currentbest out of the pool.

At mth iterationWe have already included m − 1 classifiers in the committee and we wantto select the next one.

Thus, we have the following cost function which is actually theoutput of the committee

C(m−1) (xi) = α1y1 (xi) + α2y2 (xi) + ...+ αm−1ym−1 (xi) (15)

39 / 65

Images/cinvestav-1.jpg

Thus, we have that

Extending the cost function

C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)

At the first iteration m = 1C(m−1) is the zero function.

Thus, the total cost or total error is defined as the exponential error

E =N∑

i=1exp

{−ti

(C(m−1) (xi) + αmym (xi)

)}(17)

40 / 65

Images/cinvestav-1.jpg

Thus, we have that

Extending the cost function

C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)

At the first iteration m = 1C(m−1) is the zero function.

Thus, the total cost or total error is defined as the exponential error

E =N∑

i=1exp

{−ti

(C(m−1) (xi) + αmym (xi)

)}(17)

40 / 65

Images/cinvestav-1.jpg

Thus, we have that

Extending the cost function

C(m) (xi) = C(m−1) (xi) + αmym (xi) (16)

At the first iteration m = 1C(m−1) is the zero function.

Thus, the total cost or total error is defined as the exponential error

E =N∑

i=1exp

{−ti

(C(m−1) (xi) + αmym (xi)

)}(17)

40 / 65

Images/cinvestav-1.jpg

Thus

We want to determineαm and ym in optimal way

Thus, rewriting

E =N∑

i=1w(m)

i exp {−tiαmym (xi)} (18)

Where, for i = 1, 2, ...,N

w(m)i = exp

{−tiC(m−1) (xi)

}(19)

41 / 65

Images/cinvestav-1.jpg

Thus

We want to determineαm and ym in optimal way

Thus, rewriting

E =N∑

i=1w(m)

i exp {−tiαmym (xi)} (18)

Where, for i = 1, 2, ...,N

w(m)i = exp

{−tiC(m−1) (xi)

}(19)

41 / 65

Images/cinvestav-1.jpg

Thus

We want to determineαm and ym in optimal way

Thus, rewriting

E =N∑

i=1w(m)

i exp {−tiαmym (xi)} (18)

Where, for i = 1, 2, ...,N

w(m)i = exp

{−tiC(m−1) (xi)

}(19)

41 / 65

Images/cinvestav-1.jpg

Thus

In the first iteration w(1)i = 1 for i = 1, ...,N

During later iterations, the vector w(m) represents the weight assigned toeach data point in the training set at iteration m.

Splitting equation (Eq. 18)

E =∑

ti=ym(xi)w(m)

i exp {−αm}+∑

ti 6=ym(xi)w(m)

i exp {αm} (20)

MeaningThe total cost is the weighted cost of all hits plus the weighted cost of allmisses.

42 / 65

Images/cinvestav-1.jpg

Thus

In the first iteration w(1)i = 1 for i = 1, ...,N

During later iterations, the vector w(m) represents the weight assigned toeach data point in the training set at iteration m.

Splitting equation (Eq. 18)

E =∑

ti=ym(xi)w(m)

i exp {−αm}+∑

ti 6=ym(xi)w(m)

i exp {αm} (20)

MeaningThe total cost is the weighted cost of all hits plus the weighted cost of allmisses.

42 / 65

Images/cinvestav-1.jpg

Thus

In the first iteration w(1)i = 1 for i = 1, ...,N

During later iterations, the vector w(m) represents the weight assigned toeach data point in the training set at iteration m.

Splitting equation (Eq. 18)

E =∑

ti=ym(xi)w(m)

i exp {−αm}+∑

ti 6=ym(xi)w(m)

i exp {αm} (20)

MeaningThe total cost is the weighted cost of all hits plus the weighted cost of allmisses.

42 / 65

Images/cinvestav-1.jpg

Now

Writing the first summand as Wc exp {−αm} and the second asWe exp {αm}

E = Wc exp {−αm}+ We exp {αm} (21)

Now, for the selection of ym the exact value of αm > 0 is irrelevantSince a fixed αm minimizing E is equivalent to minimizing exp {αm}E and

exp {αm}E = Wc + We exp {2αm} (22)

43 / 65

Images/cinvestav-1.jpg

Now

Writing the first summand as Wc exp {−αm} and the second asWe exp {αm}

E = Wc exp {−αm}+ We exp {αm} (21)

Now, for the selection of ym the exact value of αm > 0 is irrelevantSince a fixed αm minimizing E is equivalent to minimizing exp {αm}E and

exp {αm}E = Wc + We exp {2αm} (22)

43 / 65

Images/cinvestav-1.jpg

Then

Since exp {2αm} > 1, we can rewrite (Eq. 22)

exp {αm}E = Wc + We −We + We exp {2αm} (23)

Thus

exp {αm}E = (Wc + We) + We (exp {2αm} − 1) (24)

NowNow, Wc + We is the total sum W of the weights of all data points whichis constant in the current iteration.

44 / 65

Images/cinvestav-1.jpg

Then

Since exp {2αm} > 1, we can rewrite (Eq. 22)

exp {αm}E = Wc + We −We + We exp {2αm} (23)

Thus

exp {αm}E = (Wc + We) + We (exp {2αm} − 1) (24)

NowNow, Wc + We is the total sum W of the weights of all data points whichis constant in the current iteration.

44 / 65

Images/cinvestav-1.jpg

Then

Since exp {2αm} > 1, we can rewrite (Eq. 22)

exp {αm}E = Wc + We −We + We exp {2αm} (23)

Thus

exp {αm}E = (Wc + We) + We (exp {2αm} − 1) (24)

NowNow, Wc + We is the total sum W of the weights of all data points whichis constant in the current iteration.

44 / 65

Images/cinvestav-1.jpg

Thus

The right hand side of the equation is minimizedWhen at the m-th iteration we pick the classifier with the lowest total costWe (that is the lowest rate of weighted error).

IntuitivelyThe next selected ym should be the one with the lowest penalty given thecurrent set of weights.

45 / 65

Images/cinvestav-1.jpg

Thus

The right hand side of the equation is minimizedWhen at the m-th iteration we pick the classifier with the lowest total costWe (that is the lowest rate of weighted error).

IntuitivelyThe next selected ym should be the one with the lowest penalty given thecurrent set of weights.

45 / 65

Images/cinvestav-1.jpg

Deriving against the weight αm

Going back to the original E , we can use the derivative trick∂E∂αm

= −Wc exp {−αm}+ We exp {αm} (25)

Making the equation equal to zero and multiplying by exp {αm}

−Wc + We exp {2αm} = 0 (26)

The optimal value is thus

αm = 12 ln

(WcWe

)(27)

46 / 65

Images/cinvestav-1.jpg

Deriving against the weight αm

Going back to the original E , we can use the derivative trick∂E∂αm

= −Wc exp {−αm}+ We exp {αm} (25)

Making the equation equal to zero and multiplying by exp {αm}

−Wc + We exp {2αm} = 0 (26)

The optimal value is thus

αm = 12 ln

(WcWe

)(27)

46 / 65

Images/cinvestav-1.jpg

Deriving against the weight αm

Going back to the original E , we can use the derivative trick∂E∂αm

= −Wc exp {−αm}+ We exp {αm} (25)

Making the equation equal to zero and multiplying by exp {αm}

−Wc + We exp {2αm} = 0 (26)

The optimal value is thus

αm = 12 ln

(WcWe

)(27)

46 / 65

Images/cinvestav-1.jpg

Now

Making the total sum of all weights

W = Wc + We (28)

We can rewrite the previous equation as

αm = 12 ln

(W −WeWe

)= 1

2 ln(1− em

em

)(29)

With the percentage rate of error given the weights of the data points

em = WeW (30)

47 / 65

Images/cinvestav-1.jpg

Now

Making the total sum of all weights

W = Wc + We (28)

We can rewrite the previous equation as

αm = 12 ln

(W −WeWe

)= 1

2 ln(1− em

em

)(29)

With the percentage rate of error given the weights of the data points

em = WeW (30)

47 / 65

Images/cinvestav-1.jpg

Now

Making the total sum of all weights

W = Wc + We (28)

We can rewrite the previous equation as

αm = 12 ln

(W −WeWe

)= 1

2 ln(1− em

em

)(29)

With the percentage rate of error given the weights of the data points

em = WeW (30)

47 / 65

Images/cinvestav-1.jpg

What about the weights?

Using the equation

w(m)i = exp

{−tiC(m−1) (xi)

}(31)

And because we have αm and ym (x i)

w(m+1)i = exp

{−tiC(m) (xi)

}= exp

{−ti

[C(m−1) (xi) + αmym (xi)

]}=w(m)

i exp {−tiαmym (xi)}

48 / 65

Images/cinvestav-1.jpg

What about the weights?

Using the equation

w(m)i = exp

{−tiC(m−1) (xi)

}(31)

And because we have αm and ym (x i)

w(m+1)i = exp

{−tiC(m) (xi)

}= exp

{−ti

[C(m−1) (xi) + αmym (xi)

]}=w(m)

i exp {−tiαmym (xi)}

48 / 65

Images/cinvestav-1.jpg

What about the weights?

Using the equation

w(m)i = exp

{−tiC(m−1) (xi)

}(31)

And because we have αm and ym (x i)

w(m+1)i = exp

{−tiC(m) (xi)

}= exp

{−ti

[C(m−1) (xi) + αmym (xi)

]}=w(m)

i exp {−tiαmym (xi)}

48 / 65

Images/cinvestav-1.jpg

What about the weights?

Using the equation

w(m)i = exp

{−tiC(m−1) (xi)

}(31)

And because we have αm and ym (x i)

w(m+1)i = exp

{−tiC(m) (xi)

}= exp

{−ti

[C(m−1) (xi) + αmym (xi)

]}=w(m)

i exp {−tiαmym (xi)}

48 / 65

Images/cinvestav-1.jpg

Sequential Training

ThusAdaBoost trains a new classifier using a data set in which the weightingcoefficients are adjusted according to the performance of the previouslytrained classifier so as to give greater weight to the misclassified datapoints.

49 / 65

Images/cinvestav-1.jpg

Illustration

Schematic Illustration

50 / 65

Images/cinvestav-1.jpg

Outline1 Introduction

2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging

3 CommitteesBootstrap Data Sets

4 BoostingAdaBoost Development

Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick

AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior

Example

51 / 65

Images/cinvestav-1.jpg

AdaBoost AlgorithmStep 1Initialize

{w(1)

i

}to 1

N

Step 2For m = 1, 2, ...,M

1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or

arg minym

N∑i=1

w(m)i I (ym (xi) 6= tn) = arg min

ym

∑ti 6=ym(xi)

w(m)i = arg min

ymWe (32)

Where I is an indicator function.2 Evaluate

em =∑N

n=1 w(m)n I (ym (xn) 6= tn)∑N

n=1 w(m)n

(33)

Where I is an indicator function 52 / 65

Images/cinvestav-1.jpg

AdaBoost AlgorithmStep 1Initialize

{w(1)

i

}to 1

N

Step 2For m = 1, 2, ...,M

1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or

arg minym

N∑i=1

w(m)i I (ym (xi) 6= tn) = arg min

ym

∑ti 6=ym(xi)

w(m)i = arg min

ymWe (32)

Where I is an indicator function.2 Evaluate

em =∑N

n=1 w(m)n I (ym (xn) 6= tn)∑N

n=1 w(m)n

(33)

Where I is an indicator function 52 / 65

Images/cinvestav-1.jpg

AdaBoost AlgorithmStep 1Initialize

{w(1)

i

}to 1

N

Step 2For m = 1, 2, ...,M

1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or

arg minym

N∑i=1

w(m)i I (ym (xi) 6= tn) = arg min

ym

∑ti 6=ym(xi)

w(m)i = arg min

ymWe (32)

Where I is an indicator function.2 Evaluate

em =∑N

n=1 w(m)n I (ym (xn) 6= tn)∑N

n=1 w(m)n

(33)

Where I is an indicator function 52 / 65

Images/cinvestav-1.jpg

AdaBoost AlgorithmStep 1Initialize

{w(1)

i

}to 1

N

Step 2For m = 1, 2, ...,M

1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or

arg minym

N∑i=1

w(m)i I (ym (xi) 6= tn) = arg min

ym

∑ti 6=ym(xi)

w(m)i = arg min

ymWe (32)

Where I is an indicator function.2 Evaluate

em =∑N

n=1 w(m)n I (ym (xn) 6= tn)∑N

n=1 w(m)n

(33)

Where I is an indicator function 52 / 65

Images/cinvestav-1.jpg

AdaBoost AlgorithmStep 1Initialize

{w(1)

i

}to 1

N

Step 2For m = 1, 2, ...,M

1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or

arg minym

N∑i=1

w(m)i I (ym (xi) 6= tn) = arg min

ym

∑ti 6=ym(xi)

w(m)i = arg min

ymWe (32)

Where I is an indicator function.2 Evaluate

em =∑N

n=1 w(m)n I (ym (xn) 6= tn)∑N

n=1 w(m)n

(33)

Where I is an indicator function 52 / 65

Images/cinvestav-1.jpg

AdaBoost AlgorithmStep 1Initialize

{w(1)

i

}to 1

N

Step 2For m = 1, 2, ...,M

1 Select a weak classifier ym (x) to the training data by minimizing the weightederror function or

arg minym

N∑i=1

w(m)i I (ym (xi) 6= tn) = arg min

ym

∑ti 6=ym(xi)

w(m)i = arg min

ymWe (32)

Where I is an indicator function.2 Evaluate

em =∑N

n=1 w(m)n I (ym (xn) 6= tn)∑N

n=1 w(m)n

(33)

Where I is an indicator function 52 / 65

Images/cinvestav-1.jpg

AdaBoost Algorithm

Step 3Set the αm weight to

αm = 12 ln

{1− em

em

}(34)

Now update the weights of the data for the next iterationIf ti 6= ym (xi) i.e. a miss

w(m+1)i = w(m)

i exp {αm} = w(m)i

√1− em

em(35)

If ti == ym (xi) i.e. a hit

w(m+1)i = w(m)

i exp {−αm} = w(m)i

√em

1− em(36)

53 / 65

Images/cinvestav-1.jpg

AdaBoost Algorithm

Step 3Set the αm weight to

αm = 12 ln

{1− em

em

}(34)

Now update the weights of the data for the next iterationIf ti 6= ym (xi) i.e. a miss

w(m+1)i = w(m)

i exp {αm} = w(m)i

√1− em

em(35)

If ti == ym (xi) i.e. a hit

w(m+1)i = w(m)

i exp {−αm} = w(m)i

√em

1− em(36)

53 / 65

Images/cinvestav-1.jpg

AdaBoost Algorithm

Step 3Set the αm weight to

αm = 12 ln

{1− em

em

}(34)

Now update the weights of the data for the next iterationIf ti 6= ym (xi) i.e. a miss

w(m+1)i = w(m)

i exp {αm} = w(m)i

√1− em

em(35)

If ti == ym (xi) i.e. a hit

w(m+1)i = w(m)

i exp {−αm} = w(m)i

√em

1− em(36)

53 / 65

Images/cinvestav-1.jpg

Finally, make predictions

For this use

YM (x) = sign( M∑

m=1αmym (x)

)(37)

54 / 65

Images/cinvestav-1.jpg

Observations

FirstThe first base classifier is the usual procedure of training a single classifier.

SecondFrom (Eq. 35) and (Eq. 36), we can see that the weighting coefficient areincreased for data points that are misclassified.

ThirdThe quantity em represent weighted measures of the error rate.Thus αm gives more weight to the more accurate classifiers.

55 / 65

Images/cinvestav-1.jpg

Observations

FirstThe first base classifier is the usual procedure of training a single classifier.

SecondFrom (Eq. 35) and (Eq. 36), we can see that the weighting coefficient areincreased for data points that are misclassified.

ThirdThe quantity em represent weighted measures of the error rate.Thus αm gives more weight to the more accurate classifiers.

55 / 65

Images/cinvestav-1.jpg

Observations

FirstThe first base classifier is the usual procedure of training a single classifier.

SecondFrom (Eq. 35) and (Eq. 36), we can see that the weighting coefficient areincreased for data points that are misclassified.

ThirdThe quantity em represent weighted measures of the error rate.Thus αm gives more weight to the more accurate classifiers.

55 / 65

Images/cinvestav-1.jpg

In addition

The pool of classifiers in Step 1 can be substituted by a family ofclassifiersOne whose members are trained to minimize the error function given thecurrent weights

NowIf indeed a finite set of classifiers is given, we only need to test theclassifiers once for each data point

The Scouting Matrix SIt can be reused at each iteration by multiplying the transposed vector ofweights w(m) with S to obtain We of each machine

56 / 65

Images/cinvestav-1.jpg

In addition

The pool of classifiers in Step 1 can be substituted by a family ofclassifiersOne whose members are trained to minimize the error function given thecurrent weights

NowIf indeed a finite set of classifiers is given, we only need to test theclassifiers once for each data point

The Scouting Matrix SIt can be reused at each iteration by multiplying the transposed vector ofweights w(m) with S to obtain We of each machine

56 / 65

Images/cinvestav-1.jpg

In addition

The pool of classifiers in Step 1 can be substituted by a family ofclassifiersOne whose members are trained to minimize the error function given thecurrent weights

NowIf indeed a finite set of classifiers is given, we only need to test theclassifiers once for each data point

The Scouting Matrix SIt can be reused at each iteration by multiplying the transposed vector ofweights w(m) with S to obtain We of each machine

56 / 65

Images/cinvestav-1.jpg

We have then

The following [W (1)

e W (2)e · · ·W M

e

]=(w(m)

)TS (38)

This allows to reformulate the weight update step such thatIt only misses lead to weight modification.

NoteNote that the weight vector w(m) is constructed iteratively.It could be recomputed completely at every iteration, but the iterativeconstruction is more efficient and simple to implement.

57 / 65

Images/cinvestav-1.jpg

We have then

The following [W (1)

e W (2)e · · ·W M

e

]=(w(m)

)TS (38)

This allows to reformulate the weight update step such thatIt only misses lead to weight modification.

NoteNote that the weight vector w(m) is constructed iteratively.It could be recomputed completely at every iteration, but the iterativeconstruction is more efficient and simple to implement.

57 / 65

Images/cinvestav-1.jpg

We have then

The following [W (1)

e W (2)e · · ·W M

e

]=(w(m)

)TS (38)

This allows to reformulate the weight update step such thatIt only misses lead to weight modification.

NoteNote that the weight vector w(m) is constructed iteratively.It could be recomputed completely at every iteration, but the iterativeconstruction is more efficient and simple to implement.

57 / 65

Images/cinvestav-1.jpg

We have then

The following [W (1)

e W (2)e · · ·W M

e

]=(w(m)

)TS (38)

This allows to reformulate the weight update step such thatIt only misses lead to weight modification.

NoteNote that the weight vector w(m) is constructed iteratively.It could be recomputed completely at every iteration, but the iterativeconstruction is more efficient and simple to implement.

57 / 65

Images/cinvestav-1.jpg

Explanation

Graph for 1−emem

in the range 0 ≤ em ≤ 1

58 / 65

Images/cinvestav-1.jpg

So we have

Graph for αm

59 / 65

Images/cinvestav-1.jpg

We have the following cases

We have the followingIf em −→ 1, we have that all the sample were not correctly classified!!!

ThusWe get that for all miss-classified sample lim

em→1

1−emem−→ 0, then

αm −→ −∞

FinallyWe get that for all miss-classified sample w(m+1)

i = w(m)i exp {αm} −→ 0

We only need to reverse the answers to get the perfect classifier andselect it as the only committee member.

60 / 65

Images/cinvestav-1.jpg

We have the following cases

We have the followingIf em −→ 1, we have that all the sample were not correctly classified!!!

ThusWe get that for all miss-classified sample lim

em→1

1−emem−→ 0, then

αm −→ −∞

FinallyWe get that for all miss-classified sample w(m+1)

i = w(m)i exp {αm} −→ 0

We only need to reverse the answers to get the perfect classifier andselect it as the only committee member.

60 / 65

Images/cinvestav-1.jpg

We have the following cases

We have the followingIf em −→ 1, we have that all the sample were not correctly classified!!!

ThusWe get that for all miss-classified sample lim

em→1

1−emem−→ 0, then

αm −→ −∞

FinallyWe get that for all miss-classified sample w(m+1)

i = w(m)i exp {αm} −→ 0

We only need to reverse the answers to get the perfect classifier andselect it as the only committee member.

60 / 65

Images/cinvestav-1.jpg

Thus, we have

If em −→ 1/2

We have αm −→ 0, thus we have that if the sample is well or badclassified :

exp {−αmtiym (xi)} → 1 (39)

So the weight does not change at all.

What about em → 0We have that αm → +∞

Thus, we haveSamples always correctly classifiedw(m+1)

i = w(m)i exp {−αmtiym (xi)} → 0

Thus, the only committee that we need is m, we do not need m + 1

61 / 65

Images/cinvestav-1.jpg

Thus, we have

If em −→ 1/2

We have αm −→ 0, thus we have that if the sample is well or badclassified :

exp {−αmtiym (xi)} → 1 (39)

So the weight does not change at all.

What about em → 0We have that αm → +∞

Thus, we haveSamples always correctly classifiedw(m+1)

i = w(m)i exp {−αmtiym (xi)} → 0

Thus, the only committee that we need is m, we do not need m + 1

61 / 65

Images/cinvestav-1.jpg

Thus, we have

If em −→ 1/2

We have αm −→ 0, thus we have that if the sample is well or badclassified :

exp {−αmtiym (xi)} → 1 (39)

So the weight does not change at all.

What about em → 0We have that αm → +∞

Thus, we haveSamples always correctly classifiedw(m+1)

i = w(m)i exp {−αmtiym (xi)} → 0

Thus, the only committee that we need is m, we do not need m + 1

61 / 65

Images/cinvestav-1.jpg

Outline1 Introduction

2 Bayesian Model AveragingModel Combination Vs. Bayesian Model Averaging

3 CommitteesBootstrap Data Sets

4 BoostingAdaBoost Development

Cost FunctionSelection ProcessSelecting New ClassifiersUsing our Deriving Trick

AdaBoost AlgorithmSome Remarks and Implementation IssuesExplanation about AdaBoost’s behavior

Example

62 / 65

Images/cinvestav-1.jpg

Example

Training set

Classesω1 = N (0, 1)ω2 = N (0, σ2)−N (0, 1)

63 / 65

Images/cinvestav-1.jpg

Example

Weak ClassifierPerceptron

For m = 1, ...,M , we have the following possible classifiers

64 / 65

Images/cinvestav-1.jpg

Example

Weak ClassifierPerceptron

For m = 1, ...,M , we have the following possible classifiers

64 / 65

Images/cinvestav-1.jpg

At the end of the process

For M = 40

65 / 65

Recommended