44
1 Ch 3. Decision Tree Learning Decision trees Basic learning algorithm (ID3) Entropy, information gain Hypothesis space Inductive bias Occam’s razor in general Overfitting problem & extensions post-pruning real values, missing values, attribute costs, …

1 Ch 3. Decision Tree Learning Decision trees Basic learning algorithm (ID3) Entropy, information gain Hypothesis space Inductive bias Occam’s

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

1

Ch 3. Decision Tree Learning

Decision trees Basic learning algorithm (ID3)

Entropy, information gain Hypothesis space Inductive bias

Occam’s razor in general Overfitting problem & extensions

post-pruning real values, missing values, attribute costs, …

Page 2: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

2

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Page 3: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

3

Training Examples

Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Page 4: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

4

Decision Tree Learning

Classify instances sorting them down the tree to a leaf node

containing the class (value) based on attributes of instances branch for each value

Widely used, practical method of approximating discrete-valued functions

Robust to noisy data !! Capable of learning disjunctive expressions Typical bias: prefer smaller trees

Page 5: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

5

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Each internal node tests an attribute

Each branch corresponds to anattribute value node

Each leaf node assigns a classification

Page 6: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

6

No

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?

Page 7: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

7

Decision Tree for Conjunction

Outlook

Sunny Overcast Rain

Wind

Strong Weak

No Yes

No

Outlook=Sunny Wind=Weak

No

Page 8: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

8

Decision Tree for Disjunction

Outlook

Sunny Overcast Rain

Yes

Outlook=Sunny Wind=Weak

Wind

Strong Weak

No Yes

Wind

Strong Weak

No Yes

Page 9: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

9

Decision Tree for XOR

Outlook

Sunny Overcast Rain

Wind

Strong Weak

Yes No

Outlook=Sunny XOR Wind=Weak

Wind

Strong Weak

No Yes

Wind

Strong Weak

No Yes

Page 10: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

10

Decision Tree

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

• decision trees represent disjunctions of conjunctions

(Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak)

Page 11: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

11

When to use Decision Trees?

Instances presented as attribute-value pairs Target function has discrete values

classification problems Disjunctive descriptions required

disjunction of conjunctions of constraints on attribute values of instances

Training data may contain errors missing attribute values

Examples: Medical diagnosis Credit risk analysis Object classification for robot manipulator

Page 12: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

12

Top-Down Induction of

Decision Trees - ID3

1. A the “best” decision attribute for next node

2. Assign A as decision attribute for node

3. For each value of A create new descendant

4. Sort training examples to leaf nodes according to the attribute value of the branch

5. If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

Page 13: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

13

Which Attribute is ”best”?

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Page 14: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

14

Entropy

S is a sample of training examples p+ is the proportion of positive examples

p- is the proportion of negative examples

Entropy measures the impurity of S

Entropy(S) = -p+ log2 p+ - p- log2 p-

Page 15: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

15

Entropy

Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code)

Why? Information theory optimal length code assign

–log2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S:

-p+ log2 p+ - p- log2 p-

Page 16: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

16

Information Gain

Gain(S,A): expected reduction in entropy due to sorting S on attribute A

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

Page 17: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

17

Information Gain

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-]

Entropy([21+,5-]) = 0.71Entropy([8+,30-]) = 0.74Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-])

-38/64*Entropy([8+,30-])

=0.27

Entropy([18+,33-]) = 0.94Entropy([8+,30-]) = 0.62Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-])

=0.12

A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Page 18: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

18

Training Examples

Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

Page 19: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

19

Selecting the Best Attribute

Humidity

High Normal

[3+, 4-] [6+, 1-]

S=[9+,5-]E=0.940

Gain(S,Humidity)=0.940-(7/14)*0.985 – (7/14)*0.592=0.151

E=0.985 E=0.592

Wind

Weak Strong

[6+, 2-] [3+, 3-]

S=[9+,5-]E=0.940

E=0.811 E=1.0

Gain(S,Wind)=0.940-(8/14)*0.811 – (6/14)*1.0=0.048

Page 20: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

20

Selecting the Best Attribute

Outlook

Sunny Rain

[2+, 3-] [3+, 2-]

S=[9+,5-]E=0.940

Gain(S,Outlook)=0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.971=0.247

E=0.971

E=0.971

Overcast

[4+, 0]

E=0.0

Page 21: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

21

ID3 Algorithm

Outlook

Sunny Overcast Rain

Yes

[D1,D2,…,D14] [9+,5-]

Ssunny=[D1,D2,D8,D9,D11] [2+,3-]

? ?

[D3,D7,D12,D13] [4+,0-]

[D4,D5,D6,D10,D14] [3+,2-]

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

Page 22: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

22

ID3 Algorithm

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

[D3,D7,D12,D13]

[D8,D9,D11] [D6,D14]

[D1,D2] [D4,D5,D10]

Page 23: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

23

Hypothesis Space Search ID3

+ - +

+ - +

A1

- - ++ - +

A2

+ - -

+ - +

A2

-

A4+ -

A2

-

A3- +

Page 24: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

24

Hypothesis Space Search ID3

Hypothesis space is complete! Target function surely in there…

Outputs a single hypothesis No backtracking on selected attributes (greedy

search) Local minimal (suboptimal splits)

Statistically-based search choices Robust to noisy data

Inductive bias (search bias) Prefer shorter trees over longer ones Place high info. gain attributes close to the root

Page 25: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

25

Inductive Bias in ID3

H is the power set of instances X Unbiased ?

Preference for short trees, and for those with high information gain attributes near the root

Bias is a preference for some hypotheses, rather than a restriction of the hypothesis space H

Compare bias to C-E ID3: complete space, incomplete search --> bias from

search strategy C-E: incomplete space, complete search --> bias from

expressive power of H

Page 26: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

26

Occam’s Razor

Why prefer short hypotheses? Argument in favor:

Fewer short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a

coincidence A long hypothesis that fits the data might be a

coincidence Argument against:

There are many ways to define small sets of hypotheses

E.g. All trees with a prime number of nodes that use attributes beginning with ”Z”

What is so special about small sets based on size of hypothesis???

Page 27: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

27

Overfitting - definition

Consider error of hypothesis h over Training data: errortrain(h) Entire distribution D of data: errorD(h)

Hypothesis hH overfits training data if there is an alternative hypothesis h’H such that

errortrain(h) < errortrain(h’)and

errorD(h) > errorD(h’) Possible reasons: noise and small samples

coincidences possible with small samples noisy data creates a large tree h h’ not fitting it is likely to work better

Page 28: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

28

Overfitting in Decision Trees

error

#nodes

validation set

training set

optimal#nodes

Page 29: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

29

Avoid Overfitting

How can we avoid overfitting? Stop growing when data split not

statistically significant Grow full tree then post-prune

has been found more successful

Minimum description length (MDL):

Minimize:

size(tree) + size(misclassifications(tree))

Page 30: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

30

How to decide tree size?

What criterion to use separate test set to evaluate the use of

pruning use all data, apply statistical test to estimate if

expanding/pruning is likely to produce improvement

use an explicit complexity measure (coding length of data & tree), stop growth when minimized

Page 31: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

31

Training/validation sets

Available data split training set: apply learning to this validation set: evaluate result

accuracy impact of pruning ‘safety check’ against overfit

common strategy: 2/3 for training

Page 32: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

32

Reduced error pruning

Pruning make an inner node a leaf node assign it the most common class

ProcedureSplit data into training and validation setDo until further pruning is harmful:1. Evaluate impact on validation set of pruning each

possible node (plus those below it)2. Greedily remove the one that most improves the

validation set accuracy Produces smallest version of most accurate

subtreee

Page 33: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

33

Rule Post-Pruning

Procedure (C4.5 uses a variant) infer DT as usual (allow overfit) convert tree to rules (one per path) prune each rule independently

remove preconditions if result is more accurate sort rules by estimated accuracy into a

desired sequence to use apply rules in this order in classification

Page 34: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

34

Rule post-pruning…

Estimating the accuracy separate validation set training data & pessimistic estimates

data is too favorable for the rules compute accuracy & standard deviation take lower bound from given confidence level

(e.g. 95%) as the measure very close to observed one for large sets not statistically valid but works

Page 35: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

35

Why convert to rules?

Distinguishing different contexts in which a node is used separate pruning decision for each path

No difference for root/inner no bookkeeping on how to reorganize tree if

root node is pruned

Improves readability

Page 36: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

36

Converting a Tree to Rules

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=NoR5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes

Page 37: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

37

Continuous values

Define new discrete-valued attr partition the continuous value into a discrete set of

intervals Ac = true iff A < c How to select best c? (inf. gain)

Example case sort examples by continuous value identify borderlines

Temperature 150C 180C 190C 220C 240C 270C

PlayTennis No No Yes Yes Yes No

Page 38: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

38

Continuous values…

Fact value maximizing inf. gain lies on such

boundary

Evaluation compute gain for each boundary

Extensions multiple values LTUs based on many attributes

Page 39: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

39

Alternative selection measures

Information gain measure favors attributes with many values separates data into small subsets high gain, poor prediction E.g.: Imagine using Date=xx.yy.zz as attribute

perfectly splits the data into subsets of size 1

Use gain ratio instead of information gain!! penalize gain with split information sensitive to how broadly & uniformly attribute splits

data Entropy of S with respect to values of A

earlier: entropy of S w.r.t target values

Page 40: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

40

Split information

GainRatio(S,A) = Gain(S,A) / SplitInfo (S,A) SplitInfo(S,A) = -i=1..c |Si|/|S| log2 |Si|/|S|

Where Si is the subset for which attribute A has the value vi

Discourages selection of attr with many uniformly distr. Values n values: log n, boolean: 1

Better to apply heuristics to select attributes compute Gain first compute GR only when Gain large enough (above

average)

Page 41: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

41

Another alternative

Distance-based measure define a metric between partitions of the data evaluate attributes: distance between created

& perfect partition choose the attribute with closest one

Shown not biased towards attr. with large value sets

Many alternatives like this in the literature.

Page 42: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

42

Missing values

Estimate value other examples with known value

Compute Gain(S,A), A(x) unknown assign most common value in S most common with class c(x) assign probability for each value, distribute

fractionals of x down

Similar techniques in classification

Page 43: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

43

Attributes with differing costs

Measuring attribute costs something prefer cheap ones if possible use costly ones only if good gain introduce cost term in selection measure no guarantee in finding optimum, but give

bias towards cheapest Replace Gain by : Gain2(S,A)/Cost(A) Example applications

robot & sonar: time required to position medical diagnosis: cost of a laboratory test

Page 44: 1 Ch 3. Decision Tree Learning  Decision trees  Basic learning algorithm (ID3) Entropy, information gain  Hypothesis space  Inductive bias Occam’s

44

Summary

Decision-tree induction is a popular approach to Decision-tree induction is a popular approach to classification that enables us to interpret the output classification that enables us to interpret the output hypothesishypothesis

Practical learning method discrete-valued functions ID3: top-down greedy algorithm

Complete hypothesis space Preference bias - shorter trees preferred.shorter trees preferred. Overfitting is an important issue Overfitting is an important issue

Reduced error pruningReduced error pruning Rule post-pruningRule post-pruning

Techniques exist to deal with continuous attributes and Techniques exist to deal with continuous attributes and missing attribute values.missing attribute values.