Announcements No reading assignment for next weekNo reading assignment for next week Prepare for examPrepare for exam Midterm exam next weekMidterm exam

AnnouncementsAnnouncements

• No reading assignment for next weekNo reading assignment for next week• Prepare for examPrepare for exam

• Midterm exam next weekMidterm exam next week

• Last TimeLast Time• Neural nets (briefly)Neural nets (briefly)• Decision treesDecision trees

• TodayToday• More decision treesMore decision trees• EnsemblesEnsembles• Exam ReviewExam Review

• Next timeNext time• ExamExam• Advanced ML topicsAdvanced ML topics

Overview of ID3Overview of ID3

*NULL*

A4

+3 +5

+1+2

+4

-1 -2 -3

A1 A2

A3 A4

ID3

+3 +5+4

A1A3

-1 -2

A1 A3

-

+

-

+

A2

+1 +2

A1 A3 A4

A1 A3 A4

-1 -2

+3 +5+4

A1 A3 A4

-3

A1 A3 A4

SplittingAttribute

Use Majority class at parent node

SplittingAttribute

Example Info Gain Example Info Gain CalculationCalculation

++BIGBIGRedRed

++BIGBIGRedRed

--SMALLSMALLYellowYellow

--SMALLSMALLRedRed

++BIGBIGBlueBlue

ClassClassSizeSizeShapeShapeColorColor

?)(?)(

?)(?),(

sizeEshapeE

colorEffI

Info Gain Calculation Info Gain Calculation (contd.)(contd.)

0)1,0(4.0)0,1(6.0)(

)()2

1,

2

1(4.0)

3

1,

3

2(6.0)(

)1,0(2.0)0,1(2.0)3

1,

3

2(6.0)(

91.0)4.0(log4.0)6.0(log6.0)4.0,6.0(),( 22

IIsizeE

colorEIIshapeE

IIIcolorE

IffI

Note that “Size” provides complete classification.

Runtime Performance Runtime Performance of ID3of ID3• Let E = # examples Let E = # examples F = # featuresF = # features• At level 1At level 1 Look at each featureLook at each feature Look at each ex (to get feature value)Look at each ex (to get feature value)

Work to choose 1 featureWork to choose 1 feature = O(F x E)= O(F x E)

Runtime Performance Runtime Performance of ID3 (cont.)of ID3 (cont.)

• In worst case, need to consider all In worst case, need to consider all features along all paths (full tree)features along all paths (full tree)

Reasonably efficient

O(FO(F2 x E) x E)

Generating RulesGenerating Rules

• Antecedent: Conjuction of all Antecedent: Conjuction of all decisions leading to terminal decisions leading to terminal nodenode

• Consequent: Label of terminal Consequent: Label of terminal nodenode

• ExampleExample

RedCOLOR ?

SIZE ?

Blue

Big Small+ -

+

Green

-

Generating Rules Generating Rules (cont.)(cont.)• Generates rules:Generates rules:

Color=Green Color=Green - -

Color=Blue Color=Blue + +

Color=Red and Size=Big Color=Red and Size=Big + +

Color=Red and Size=Small Color=Red and Size=Small - -

• Note:Note:

1. Can “clean up” the rule set (see Quinlan’s)1. Can “clean up” the rule set (see Quinlan’s)

2. Decision trees learn 2. Decision trees learn disjunctivedisjunctive concepts concepts

Noise-A Major Issue in Noise-A Major Issue in MLML• Worst Case Worst Case

+, - at same point in feature space+, - at same point in feature space• Causes Causes

1. Too few features (“hidden variables”) or 1. Too few features (“hidden variables”) or too few possible valuestoo few possible values

2. Incorrectly reported/measured/judged 2. Incorrectly reported/measured/judged feature valuesfeature values

3. mis-classified instances3. mis-classified instances

Noise-A Major Issue in Noise-A Major Issue in ML (cont.)ML (cont.)

• Issue – overfittingIssue – overfitting

Producing an “awkward” concept Producing an “awkward” concept because of a few “noisy” points.because of a few “noisy” points.

-

+ + + + - +

- -

- - -

+ + + + - +

- -

- -

Bad performance on future ex’s?

Better performance?

Overfitting Viewed in Overfitting Viewed in Terms of Function-Terms of Function-FittingFitting

Data = Red Line + Noise ModelData = Red Line + Noise Model

f(x

)

x

+ + +

+ + + + + +

+ + +

+

+

Definition of Definition of OverfittingOverfitting• Assuming large enough test set so that it is Assuming large enough test set so that it is

representative. Concept C overfit the training data if representative. Concept C overfit the training data if there exists a “simpler” concept S so thatthere exists a “simpler” concept S so that

but

>

<

Training set accuracy of

C

Training set accuracy of

S

Test set accuracy of

C

Test set accuracy of

S

Remember!Remember!

• It is easy to learn/fit the training It is easy to learn/fit the training datadata

• What’s hard is generalizing well What’s hard is generalizing well to future (“test set”) data!to future (“test set”) data!

• Overfitting avoidance is a key Overfitting avoidance is a key issue in Machine Learningissue in Machine Learning

Can One Underfit?Can One Underfit?

• Sure, if not fully fitting the training Sure, if not fully fitting the training setset

-eg, just return majority category -eg, just return majority category (+ or -) in the trainset as the (+ or -) in the trainset as the learned model.learned model.

• But also if not enough data to But also if not enough data to illustrate the important distinctions.illustrate the important distinctions.

ID3 & Noisy DataID3 & Noisy Data

• To avoid overfitting, allow splitting To avoid overfitting, allow splitting to stop before all ex’s are of one to stop before all ex’s are of one class.class.

• Option 1: if info left < E, don’t splitOption 1: if info left < E, don’t split

-empirically failed; bad performance -empirically failed; bad performance on error-free data (Quinlan)on error-free data (Quinlan)

ID3 & Noisy Data ID3 & Noisy Data (cont.)(cont.)

• Option 2: Estimate if all remaining Option 2: Estimate if all remaining features are statistically features are statistically independent of the class of independent of the class of remaining examplesremaining examples

-uses “chi test” of original ID3 -uses “chi test” of original ID3 paper paper

-works well on error-free data-works well on error-free data


• Option 3: (not in original ID3 Option 3: (not in original ID3 paper)paper)

Build complete tree, then use Build complete tree, then use some “spare” (tuning) examples some “spare” (tuning) examples to decide which parts of tree can to decide which parts of tree can be pruned. be pruned.


• Pruning is currently the best choice—Pruning is currently the best choice—see c4.5 for technical detailssee c4.5 for technical details

• Repeat using greedy algo.Repeat using greedy algo.

Greedily Pruning D-Greedily Pruning D-treestrees

• Sample (Hill Climbing) Search SpaceSample (Hill Climbing) Search Space

best

Stop if no improvement

Pruning by Measuring Pruning by Measuring Accuracy on Tune SetAccuracy on Tune Set1.1. Run ID3 to fully fit TRAIN’ Set, measure Run ID3 to fully fit TRAIN’ Set, measure

accuracy on TUNEaccuracy on TUNE2.2. Consider all subtrees where ONE Consider all subtrees where ONE

interior node removed and replaced by interior node removed and replaced by leafleaf

-label with majority category-label with majority category in pruned subtreein pruned subtree choose best subtree on TUNEchoose best subtree on TUNE if no improvement, quitif no improvement, quit3. Go to 23. Go to 2

+

The Tradeoff in Greedy The Tradeoff in Greedy AlgorithmAlgorithm

• Efficiency vs OptimalityEfficiency vs Optimality

EgEg R

A B

CD

FE

Initial

IF

“Tune” best cuts is to discard C’s & F’s subtrees

BUT

The single best cut is too discard B’s subtrees

Greedy Search will not find best tree

Greedy Search: Powerful, General Purpose, Trick – of - Trade

Hypothetical Trace of a Hypothetical Trace of a Greedy AlgorithmGreedy Algorithm

Full-Tree Accuracy = 85% on TUNE set

R

A B

CD

FE

[64]

[77]

[74]

[87]

[63]

[89]

[88]

Accuracy if we replace this node with a leaf (leaving rest of the tree the same)

Pruning @ B works best

Hypothetical Trace of a Hypothetical Trace of a Greedy Algorithm Greedy Algorithm (cont.)(cont.)

• Full-Tree Accuracy = 89%Full-Tree Accuracy = 89%

- STOP since no improvement by - STOP since no improvement by cutting again, and return above cutting again, and return above tree. tree.

R

A B

[64]

[77]

[89]

Another Possibility: Another Possibility: Rule Post-PruningRule Post-Pruning(also greedy algoritm)(also greedy algoritm)

1.1. Induce a decision treeInduce a decision tree

2.2. Convert to rules (see earlier slide)Convert to rules (see earlier slide)

3.3. Consider dropping one rule Consider dropping one rule antecedentantecedent

• Delete the one that improves tuning Delete the one that improves tuning set accuracy the most.set accuracy the most.

• Repeat as long as progress being Repeat as long as progress being made.made.

Rule Post-Pruning Rule Post-Pruning (Continue)(Continue)

• AdvantagesAdvantages• Allows an intermediate node to be Allows an intermediate node to be

pruned from some rules but retained pruned from some rules but retained in others.in others.

• Can correct poor early decisions in Can correct poor early decisions in tree construction.tree construction.

• Final concept more understandable.Final concept more understandable.

Training with Noisy Training with Noisy DataData• If we can clean up the training If we can clean up the training

data, should we do so?data, should we do so?• No (assuming one can’t clean up No (assuming one can’t clean up

the testing data when the learned the testing data when the learned concept will be used).concept will be used).

• Better to train with the same type Better to train with the same type of data as will be experienced of data as will be experienced when the result of learning is put when the result of learning is put into use.into use.

Overfitting + NoiseOverfitting + Noise

• Using the strict definition of Using the strict definition of overfitting presented earlier, is it overfitting presented earlier, is it possible to overfit noise-free possible to overfit noise-free data?data?• In general?In general?• Using ID3?Using ID3?

Example of Overfitting Example of Overfitting of Noise-free Dataof Noise-free Data

Let Let • Correct concept = A ^ BCorrect concept = A ^ B• Feature C to be true 50% of the Feature C to be true 50% of the

time, for both + and – examplestime, for both + and – examples• Prob(+ example) = 0.9Prob(+ example) = 0.9• Training Set:Training Set:

• +: ABCDE, ABC¬DE, ABCD¬E+: ABCDE, ABC¬DE, ABCD¬E• -: A¬B¬CD¬E, ¬AB¬C¬DE -: A¬B¬CD¬E, ¬AB¬C¬DE

Example (Continued)Example (Continued)

TreeTree Trainset Accuracy Trainset Accuracy TestSet TestSet AccuracyAccuracy

ID3’sID3’s 100%100% 50%50%

Simpler “tree”Simpler “tree” 60%60% 90%90%

C

+ -

FT

+

Post PruningPost Pruning

• There are more sophisticated methods There are more sophisticated methods of deciding where to prune than simply of deciding where to prune than simply estimating accuracy on a tuning set.estimating accuracy on a tuning set.

• See the C4.5 and CART books for See the C4.5 and CART books for details.details.

• We won’t discuss them, except for We won’t discuss them, except for MDLMDL

• Tuning sets also calledTuning sets also called• Pruning sets (in d-tree algorithms)Pruning sets (in d-tree algorithms)• Validation sets (in general)Validation sets (in general)

Tuning Sets vs MDLTuning Sets vs MDL

• Two ways to deal with overfittingTwo ways to deal with overfitting• Tuning SetsTuning Sets

• Empirically evaluate pruned treesEmpirically evaluate pruned trees

• MDL (Minimal Description Length)MDL (Minimal Description Length)• Theoretically evaluate/score pruned Theoretically evaluate/score pruned

treestrees• Describe training data in as few bits Describe training data in as few bits

as possible (“compression”)as possible (“compression”)

MDL (continue)MDL (continue)

• No need to hold aside training No need to hold aside training datadata

• But how good is the MDL But how good is the MDL hypothesis?hypothesis?• Heuristic: MDL => good Heuristic: MDL => good

generalizationgeneralization

The Minimal The Minimal Description Length Description Length (MDL) Principle(MDL) Principle(Rissanen, 1986; Quinlan and Rivest, (Rissanen, 1986; Quinlan and Rivest,

1989)1989)• Informally, we want to view a training set asInformally, we want to view a training set as

data = general rule + exceptions to the rule data = general rule + exceptions to the rule (“noise”)(“noise”)

• Tradeoff betweenTradeoff between• Simple rule, but many exceptions Simple rule, but many exceptions • Complex rule with few exceptionsComplex rule with few exceptions

• How to make this tradeoff?How to make this tradeoff?• Try to minimize the “description length” of Try to minimize the “description length” of

the rule + exceptionsthe rule + exceptions

Trading Off Simplicity Trading Off Simplicity vs Coveragevs Coverage

Size of Rules Size of Exceptions

# bits needed to # bits needed to represent a represent a decision tree decision tree that covers that covers (possibly (possibly incompletely) incompletely) the training the training examplesexamples

# bits needed # bits needed to encode the to encode the exceptions to exceptions to this decision this decision treetree

+ λ x+ λ xDescriptionDescriptionLengthLength ==

A weighting factor, A weighting factor, user-defined or use user-defined or use tuning settuning set

Key issue: what’s the best coding strategy to use?Key issue: what’s the best coding strategy to use?

minimize

A Simple MDL A Simple MDL AlgorithmAlgorithm1.1. Build the full tree using ID3 (and Build the full tree using ID3 (and allall the the

training examples)training examples)2.2. Consider all/many subtrees, keeping the Consider all/many subtrees, keeping the

one that minimizes:one that minimizes:• score = (# nodes in tree) + score = (# nodes in tree) + λλ * (error rate on * (error rate on

training set)training set)(A crude scoring function)(A crude scoring function)

Some details:Some details:If # features = NIf # features = Nff and # examples = N and # examples = Nee

then need Ceiling(logthen need Ceiling(log22NNff) bits to encode ) bits to encode each tree node and Ceiling (logeach tree node and Ceiling (log22NNee) bits ) bits to encode an exception.to encode an exception.

Searching the Space of Searching the Space of Pruned D-trees with Pruned D-trees with MDLMDL• Can use same greedy search Can use same greedy search

algorithm used with pruning setsalgorithm used with pruning sets• But use But use MDL scoreMDL score rather than rather than

pruning set accuracypruning set accuracy as the as the heuristic functionheuristic function

MDL SummarizedMDL Summarized

The overfitting problemThe overfitting problem• Can exactly fit the training data, but will Can exactly fit the training data, but will

this generalize well to test data?this generalize well to test data?• Tradeoff some training-set errors for fewer test-Tradeoff some training-set errors for fewer test-

set errorsset errors

• One solution – the MDL hypothesisOne solution – the MDL hypothesis• Solve the MDL problem (on the training data) Solve the MDL problem (on the training data)

and you are likely to generalize well (accuracy and you are likely to generalize well (accuracy on the test data)on the test data)

The MDL ProblemThe MDL Problem• Minimize |description of general concept| Minimize |description of general concept|

+ + λλ | list of exceptions (in the train set) | | list of exceptions (in the train set) |

Small Disjuncts Small Disjuncts (Holte et al. IJCAI 1989)(Holte et al. IJCAI 1989)

• Results of learning can often be Results of learning can often be viewed as a disjunction of viewed as a disjunction of conjunctionsconjunctions

• Definition: small disjuncts – Definition: small disjuncts – Disjuncts that correctly classify Disjuncts that correctly classify few training examplesfew training examples• Not necessarily small in area.Not necessarily small in area.

The Problem with The Problem with Small DisjunctsSmall Disjuncts• Collectively, cover much of the training data, Collectively, cover much of the training data,

but account for much of the testset errorbut account for much of the testset error• One studyOne study

• Cover 41% of training data and produce 95% of the Cover 41% of training data and produce 95% of the test set errortest set error

• The “small-disjuncts problem” still an open The “small-disjuncts problem” still an open issue (See Quinlan paper in MLJ for additional issue (See Quinlan paper in MLJ for additional discussion).discussion).

Overfitting Avoidance Overfitting Avoidance WrapupWrapup• Note: fundamental issue in all of ML, not Note: fundamental issue in all of ML, not

just decision trees; after all, easy to just decision trees; after all, easy to exactly match training data via “table exactly match training data via “table lookup”)lookup”)

• ApproachesApproaches• Use simple ML algorithm from the start.Use simple ML algorithm from the start.• Optimize accuracy on a tuning set.Optimize accuracy on a tuning set.• Only make distinctions that are statistically Only make distinctions that are statistically

justified.justified.• Minimize |concept descriptions| + Minimize |concept descriptions| + λλ |exception list|. |exception list|.• Use ensembles to average out overfitting (next Use ensembles to average out overfitting (next

topic).topic).

Decision “Stumps”Decision “Stumps”

• Holte (MLJ) compared:Holte (MLJ) compared:• Decision trees with only one decision (decision stumps)Decision trees with only one decision (decision stumps)

VSVS• Trees produced by C4.5 (with pruning algorithm used)Trees produced by C4.5 (with pruning algorithm used)

• Decision “stumps” do remarkably well on UC Irvine Decision “stumps” do remarkably well on UC Irvine data setsdata sets• Archive too easy?Archive too easy?

• Decision stumps are a “quick and dirty” Decision stumps are a “quick and dirty” controlcontrol for for comparing to new algorithms.comparing to new algorithms.• But C4.5 easy to use and probably a better control.But C4.5 easy to use and probably a better control.

C4.5 Compared to 1R (“Decision Stumps”)C4.5 Compared to 1R (“Decision Stumps”)

• Test Set AccuracyTest Set Accuracy• 11stst column: UCI datasets column: UCI datasets

• See Holte Paper for keySee Holte Paper for key

• Max diff: 2Max diff: 2ndnd row row• Min Diff: 5Min Diff: 5thth row row• UCI datasets too easy?UCI datasets too easy?

DatasetDataset C4.5C4.5 1R1R

BCBC 72.0%72.0% 68.7%68.7%

CHCH 99.2%99.2% 68.7%68.7%

GLGL 63.2%63.2% 67.6%67.6%

G2G2 74.3%74.3% 53.8%53.8%

HDHD 73.6%73.6% 72.9%72.9%

HEHE 81.2%81.2% 76.3%76.3%

HOHO 83.6%83.6% 81.0%81.0%

HYHY 99.1%99.1% 97.2%97.2%

IRIR 93.8%93.8% 93.5%93.5%

LALA 77.2%77.2% 71.5%71.5%

LYLY 77.5%77.5% 70.7%70.7%

MUMU 100.0%100.0% 98.4%98.4%

SESE 97.7%97.7% 95.0%95.0%

SOSO 97.5%97.5% 81.0%81.0%

VOVO 95.6%95.6% 95.2%95.2%

V1V1 89.4%89.4% 86.8%86.8%

Dealing with Missing Dealing with Missing FeaturesFeatures

• Bayes nets might be the best technique if Bayes nets might be the best technique if many missing features (later)many missing features (later)

• Common technique: Use EM algorithm (later)Common technique: Use EM algorithm (later)• Quinlan’s suggested approach:Quinlan’s suggested approach:

• During Training (on each recursive call)During Training (on each recursive call)• Fill in missing values proportionallyFill in missing values proportionally• If 50% red, 30% blue and 20% green If 50% red, 30% blue and 20% green

(for non-missing cases), then fill missing (for non-missing cases), then fill missing values according to this probability values according to this probability distributiondistribution

• Do this per output categoryDo this per output category

Simple ExampleSimple Example

• Note: by “missing features” we really mean Note: by “missing features” we really mean “missing feature values”“missing feature values”

ColorColor CategoryCategory

redred ++redred ++blueblue ++bedbed --blueblue --

?? ++?? --

Prob(red | +) = 2/3Prob(blue | +) = 1/3Prob(red | - ) = 1/2Prob(blue | - ) = 1/2

Flip weightedCoins to fill in ?’s

Missing Feature During Missing Feature During TestingTesting

• Follow all paths, weight answers Follow all paths, weight answers proportional to probability of each pathproportional to probability of each path

Color

40%red 20%

blue

40%green

out+(color) = 0.4 out+(red) + 0.2 out+(blue) + 0.4 out+(green)

votes for +being the category(repeat for -)

• Repeat throughout subtrees

Why are Features Why are Features Missing?Missing?

• Model on previous page implicitly Model on previous page implicitly assumes feature values are randomly assumes feature values are randomly deleted deleted • as if hit by a cosmic ray!as if hit by a cosmic ray!

• But values might be missing for a reasonBut values might be missing for a reason• E.g., data collector decided the values for E.g., data collector decided the values for

some features are not worth recordingsome features are not worth recording

• One suggested solution:One suggested solution:• Let Let “not-recorded”“not-recorded” be another legal value be another legal value

(and, hence, a branch in the decision tree)(and, hence, a branch in the decision tree)

A D-Tree Variant that A D-Tree Variant that Exploits Info in “Missing” Exploits Info in “Missing” Feature ValuesFeature Values• At each recursive call, At each recursive call, onlyonly consider consider

features that have features that have nono missing values missing values• E.g.E.g.

• Could generalize this algorithm by penalizing Could generalize this algorithm by penalizing features with missing valuesfeatures with missing values

Shape

Color< maybe all the missing values for color take this path >

ID3 Recap ~ Questions ID3 Recap ~ Questions AddressedAddressed

• How closely should we fit the training How closely should we fit the training data?data?• Completely, then pruneCompletely, then prune• Use MDL or tuning sets to chooseUse MDL or tuning sets to choose

• How do we judge features?How do we judge features?• Use info theory (Shannon)Use info theory (Shannon)

• What if a features has many values?What if a features has many values?• Correction factor based on info theoryCorrection factor based on info theory

• What if some features values are What if some features values are unknown (in some examples)?unknown (in some examples)?• Distribute based on other examples (???)Distribute based on other examples (???)

ID3 Recap (cont.)ID3 Recap (cont.)

• What if some features cost more to What if some features cost more to evaluate (CAT scan vs. Temperature)?evaluate (CAT scan vs. Temperature)?• Ad hoc correction factorAd hoc correction factor

• Batch vs. incremental learning?Batch vs. incremental learning?• Basically a “batch” approach; incremental Basically a “batch” approach; incremental

variants exist but since ID3 is so fast, why variants exist but since ID3 is so fast, why not simply rerun “from scratch”?not simply rerun “from scratch”?

ID3 Recap (cont.)ID3 Recap (cont.)

• What about real-valued outputs?What about real-valued outputs?• Could learn a linear approximation for Could learn a linear approximation for

various regions of the feature space, e.g.various regions of the feature space, e.g.

• How rich is our language for describing How rich is our language for describing examples?examples?• Limited to fixed-length feature vectors (but Limited to fixed-length feature vectors (but

they are surprisingly effective)they are surprisingly effective)

f1 + 2 f2

3 f1 - f2

f4 Venn

Summary of ID3Summary of ID3

• Strengths:Strengths:• Good technique for learning discrete-Good technique for learning discrete-

valued functions from “real world” (e.g., valued functions from “real world” (e.g., noisy) datanoisy) data

• Fast, simple and robustFast, simple and robust• Considers complete hypothesis spaceConsiders complete hypothesis space• Successfully applied to many real-world Successfully applied to many real-world

taskstasks• Results (trees or rules) are human-Results (trees or rules) are human-

comprehensivecomprehensive

Summary of ID3 (cont.)Summary of ID3 (cont.)

• WeaknessesWeaknesses• Requires fixed-length feature Requires fixed-length feature

vectorsvectors• Only makes axis-parallel Only makes axis-parallel

(univariate) splits(univariate) splits• Non-incrementalNon-incremental• Hill-climbing algorithm (poor Hill-climbing algorithm (poor

early decisions can be early decisions can be disastrous)disastrous)

However, extensionsexist

Next Next Topic: Topic: EnsemblesEnsembles

• Boosting, Bagging, …

EnsemblesEnsembles(Bagging, Boosting, etc)(Bagging, Boosting, etc)

• Old View:Old View:• Learn a good modelLearn a good model

• New View:New View:• Learn a good Learn a good setset of models of models

• Probably best example of interplay Probably best example of interplay between “theory & practice” in Machine between “theory & practice” in Machine LearningLearning

Naïve Bayes, neural netsd-trees, etc

Ensembles of Neural Ensembles of Neural NetworksNetworks

(or any supervised learner)(or any supervised learner)

• Ensembles often produce gains of 5-10 Ensembles often produce gains of 5-10 percentage points!percentage points!

• Can combine “classifiers” of various typesCan combine “classifiers” of various types• E.g., decision trees, rule sets, neural networks, etc.E.g., decision trees, rule sets, neural networks, etc.

Network Network Network

INPUT

Combiner

OUPUT

Some Relevant Early PapersSome Relevant Early Papers

• Hansen & Salamen, PAMI:20, 1990Hansen & Salamen, PAMI:20, 1990• If (a) the combined predictors have If (a) the combined predictors have errorserrors

that are that are independentindependent from one another from one another and (b) each possible input is correctly and (b) each possible input is correctly predicted > 50% of the time, thenpredicted > 50% of the time, then

0)predictors of rateerror set test (lim

NN

Some Relevant Early PapersSome Relevant Early Papers

• Schapire, MLJ:5, 1990 (“Boosting”)Schapire, MLJ:5, 1990 (“Boosting”)• If you have an algorithm that gets If you have an algorithm that gets > 50%> 50% on on

any distribution of examples, you can create any distribution of examples, you can create an algorithm that gets an algorithm that gets > (100% - > (100% - )), for any , for any > 0> 0

Impossible by NFL thm (later) ???Impossible by NFL thm (later) ???• Need an infinite (or very large, at least) Need an infinite (or very large, at least)

source of examplessource of examples

(Later extensions address this weakness)(Later extensions address this weakness)• Also see Wolpert, Stacked Generalization, Also see Wolpert, Stacked Generalization,

Neural Networks, 1992Neural Networks, 1992

Some Methods for Producing Some Methods for Producing “Uncorrelated” Members of an “Uncorrelated” Members of an EnsembleEnsemble

• kk times randomly choose (with times randomly choose (with replacement) replacement) NN examples from a training examples from a training set of size set of size NN• ““Bagging” by Brieman (MLJ, 1996)Bagging” by Brieman (MLJ, 1996)• Want unstable algorithmsWant unstable algorithms• Part of HW2Part of HW2

• Reweight examples each cycle (if wrong, Reweight examples each cycle (if wrong, increase weight; else decrease weight)increase weight; else decrease weight)• ““AdaBoosting” by Freund & Schapire AdaBoosting” by Freund & Schapire (1995, 1996)(1995, 1996)

• More laterMore later

Some Methods for Producing Some Methods for Producing “Uncorrelated” Members of an “Uncorrelated” Members of an Ensemble (cont.)Ensemble (cont.)

• Optimize errorOptimize error -1 -1 + diversity + diversity• Opitz & Shavlik (1995, 1996)Opitz & Shavlik (1995, 1996)

• Different number of hidden units in a Different number of hidden units in a neural network, initial (network) neural network, initial (network) weights, tie-breaking scheme, example weights, tie-breaking scheme, example ordering, etcordering, etc

Variance/Diversity Creating Variance/Diversity Creating Methods (cont.)Methods (cont.)

• Train with different associated tasksTrain with different associated tasks• Caruana (1996) and othersCaruana (1996) and others

• Use different input features, randomly Use different input features, randomly perturb training examples, etcperturb training examples, etc• Cherkauv, among othersCherkauv, among others

Xage

Xgender

Xincome

“Multi-Task Learning”

Other functions relatedto the main task of X

Variance/Diversity Creating Variance/Diversity Creating Methods (cont.)Methods (cont.)

• Assign each category an Assign each category an error error correcting codecorrecting code, and train on , and train on each bit each bit separatelyseparately• Dietterich et al. (ICML 1995)Dietterich et al. (ICML 1995)Cat1 = 1 1 1 0 1 1 1

Cat2 = 1 1 0 1 1 0 0Cat3 = 1 0 1 1 0 1 0Cat4 = 0 1 1 1 0 0 1

Predicting 5 of 7 bitscorrectly suffices

Related to “distributed reps”(rep of each cat distributed over N bits)

Random ForestsRandom Forests(Breiman, MLJ 2001)(Breiman, MLJ 2001)

• A variant of BAGGINGA variant of BAGGING• Algorithm:Algorithm:

Repeat Repeat kk times times(1)(1) Draw with replacement Draw with replacement NN examples, put in examples, put in

train settrain set(2)(2) Build d-tree, but in each recursive call, choose Build d-tree, but in each recursive call, choose

(w/o replacement) (w/o replacement) ii features features• Choose best of these Choose best of these ii as the root of this as the root of this

(sub)tree(sub)tree(3)(3) Do NOT pruneDo NOT prune

(NOT required in HW2, but OK to do so)(NOT required in HW2, but OK to do so)

Let N = # of examples F = # of features i = some number << F

More on Random More on Random ForestsForests• Increasing Increasing ii

• Increases correlation among individual trees Increases correlation among individual trees (BAD)(BAD)

• Also increases accuracy of individual trees Also increases accuracy of individual trees (GOOD)(GOOD)

• Can use tuning set to choose good setting Can use tuning set to choose good setting for for ii

• Overall, random forests areOverall, random forests are• Very fast (e.g., 50K examples, 10 features, 10 Very fast (e.g., 50K examples, 10 features, 10

trees/min on 1 GHz CUP)trees/min on 1 GHz CUP)• Deals with large # of featuresDeals with large # of features• Reduces overfitting substantiallyReduces overfitting substantially

AdaBoosting (Freund & AdaBoosting (Freund & Schapire)Schapire)

Initially weight all ex’s equally. wij=1/N, weight on exj on cycle i

1. Let Hi = concept/hypothesis learned on current weighted train set.

2. Let i= weighted error on current train set.

3. If i>1/2, return {H1, H2, …, Hi-1} (all previous hypotheses)

4. Reweight correct ex’s:Note: since i<1/2, wi+1<wi

5. Renormalize, so 6. i <- i+1, goto 1.

jii

iji ww ,,1 1

sex

iiw

'#

1

1

Using the Set of Using the Set of Hypothesis Produced by Hypothesis Produced by AdaBoostAdaBoost

Output for example x =

where (0) 0, (1) 1- ie, count weighted votes for

hypotheses that predict y for input x

shypo

ii

i

i yxhcategoriesy

'#

1

)(1

logmaxarg

Dealing with Weighted Dealing with Weighted ExamplesExamples

Two approaches1. Sample from this probability distribution

and train as normal.2. Alter learning algorithm so it counts

weighted examples and not just examples

eg) from accuracy = # correct / # totalto weighted accuracy = wi of correct / wi of all

#2 preferred – avoids sampling effects

AdaBoosting & ID3AdaBoosting & ID3

• Apply to PRUNED trees* – otherwise no trainset error

• ID3’s calc’s all based on weighted sums, so easy to extend to weighted examples.

• Donot sample possible data sets, instead alter the internals of ID3!

*: could also try with decision stumps

Boosting & Overfitting Boosting & Overfitting

Often get better test-set results, even when train error is ZERO

Hypothesis (see Schurmanns, AAAI-98, R.Shapire

website)Still improving number/strength of votes even

though getting all train-set ex’s correct“margins” – relates to SVM’s

test

train

Error (on unweighted examples)

time

Empirical Studies Empirical Studies

(from Freund & Schapise, reprinted in Dietterich’s AI Mag paper)Error

Rate of C4.5

Error Rate of Bagged (Boosted) C4.5 Error Rate of AdaBoost

Error Rate of Bagging

(Each point one data set)

Boosting and Bagging helped almost always!

On average, boosting slightly better?

Large Empirical Study Large Empirical Study of Bagging vs. of Bagging vs. Boosting Boosting Opitz & Maclin (UW CS PhD’s), 1999Opitz & Maclin (UW CS PhD’s), 1999

JAIR Vol 11, pp 169-198, JAIR Vol 11, pp 169-198, www.jair.org/abstracts/opitz99a.htmlwww.jair.org/abstracts/opitz99a.html

• Bagging almost always better than single Bagging almost always better than single D-tree or ANN (artificial neural net)D-tree or ANN (artificial neural net)

• Boosting can be much better than BaggingBoosting can be much better than Bagging• However, boosting can sometimes be However, boosting can sometimes be

harmful (too much emphasis on “outliers”?)harmful (too much emphasis on “outliers”?)

Boosting/Bagging/Etc Boosting/Bagging/Etc Wrapup Wrapup

- an easy to use and usually highly an easy to use and usually highly effective techniqueeffective technique- always consider it when applying ML always consider it when applying ML

to practical problemsto practical problems

- does reduce “comprehensibility” does reduce “comprehensibility” of learned modelsof learned models- see work by Craven & Shavlik see work by Craven & Shavlik

though (“rule extraction”) though (“rule extraction”)

Exam InfoExam Info

• 75 minutes75 minutes• You can useYou can use

• 1 8.5 x 11 inch sheet of paper with 1 8.5 x 11 inch sheet of paper with notesnotes

• A standard calculator A standard calculator

Main TopicsMain Topics

• ClassifiersClassifiers• Naïve Bayes, logistic regression, Naïve Bayes, logistic regression,

perceptrons, neural networks, decision perceptrons, neural networks, decision treestrees

• MethodologyMethodology• Compute confidence intervalsCompute confidence intervals• Paired Paired t-t-teststests• ROC curvesROC curves

• EnsemblesEnsembles

ClassifiersClassifiers

• You should know the algorithm You should know the algorithm that each classifier uses to learn a that each classifier uses to learn a conceptconcept

Documents

Announcements No reading assignment for next weekNo reading assignment for next week Prepare for examPrepare for exam Midterm exam next weekMidterm exam