40
CpSc 810: Machine Learning Decision Tree Learning

CpSc 810: Machine Learning Decision Tree Learning

Embed Size (px)

Citation preview

Page 1: CpSc 810: Machine Learning Decision Tree Learning

CpSc 810: Machine Learning

Decision Tree Learning

Page 2: CpSc 810: Machine Learning Decision Tree Learning

2

Copy Right Notice

Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

Page 3: CpSc 810: Machine Learning Decision Tree Learning

3

Review

Concept learning as search through H

S and G boundaries characterize learner’s uncertainty

A learner that make no a priori assumption regarding the identity of the target concept has no rational basis for classifying any unseen instance.

Inductive leaps possible only if learner is biased

Page 4: CpSc 810: Machine Learning Decision Tree Learning

4

Decision Tree Representation for PlayTennis

Outlook

Humidity

SunnyOvercast

Rain

High Normal Strong Weak

A Decision Tree for the concept PlayTennis

Wind

Page 5: CpSc 810: Machine Learning Decision Tree Learning

5

Decision Trees

Decision tree representationEach internal node tests an attributeEach branch corresponds to attribute valueEach leaf node assign a classification

Decision tree represent a disjunction of conjunctions of constraints on the attribute values of instances.

Page 6: CpSc 810: Machine Learning Decision Tree Learning

6

Appropriate Problems for Decision Tree Learning

Instances are represented by discrete attribute-value pairs (though the basic algorithm was extended to real-valued attributes as well)

The target function has discrete output values (can have more than two possible output values --> classes)

Disjunctive hypothesis descriptions may be required

The training data may contain errors

The training data may contain missing attribute values

Page 7: CpSc 810: Machine Learning Decision Tree Learning

7

ID3: The Basic Decision Tree Learning Algorithm

Top-down induction of decision tree

Page 8: CpSc 810: Machine Learning Decision Tree Learning

8

Which attribute is best?

Page 9: CpSc 810: Machine Learning Decision Tree Learning

9

Which attribute is best?

Choose the attribute that minimize the Disorder in the subtree rooted at a given node.

Disorder and Information are related as follows: the more disorderly a set, the more information is required to correctly guess an element of that set.

Page 10: CpSc 810: Machine Learning Decision Tree Learning

10

Information theory

Information: What is the best strategy for guessing a number from a finite set of possible numbers? i.e., how many questions do you need to ask in order to know the answer (we are looking for the minimal number of questions). Answer Log2(S), where S is the set of numbers and |S|, its cardinality.

Q1: is it smaller than 5?Q2: is it smaller than 2?

E.g.: 0 1 2 3 4 5 6 7 8 9 10

Q1Q2

Page 11: CpSc 810: Machine Learning Decision Tree Learning

11

Information theory

Information theory: optimal length code assigns -Log2p bits to message having probability p. So expected number of bits to encode + or – of random member of S is

)log()log( 22 pppp

Page 12: CpSc 810: Machine Learning Decision Tree Learning

12

Entropy

S is a set of training examples

p+ is the proportion of positive examples in S

p- is the proportion of negative examples in S

Entropy measures the impurity of S

Entropy(S) = expected number of bits needed to encode class (+ or -) of randomly drawn member of S (under the optimal, shortest-length code

ppppsEntropy 22 loglog)(

Page 13: CpSc 810: Machine Learning Decision Tree Learning

13

Information Gain

Gain(S, A) = expected reduction in entropy due to sorting on A

)()(),()(

vAValuesv

v SEntropyS

SSEntropyASGain

Page 14: CpSc 810: Machine Learning Decision Tree Learning

14

Training Examples

Entropy?

Page 15: CpSc 810: Machine Learning Decision Tree Learning

15

Which attribute is the best?

S: [9+. 5-]E=0.94

Humidity

[3+, 4-] [6+, 1-]

High Normal

E=0.985 E=0.592

Gain(S, Humidity) = 0.94 – (7/14)*0.985-(7/14)*0.592 = 0.151

Page 16: CpSc 810: Machine Learning Decision Tree Learning

16

Which attribute is the best?

S: [9+. 5-]E=0.94

Wind

[6+, 2-] [3+, 3-]

Weak Strong

E=0.811 E=1.0

Gain(S, Wind) = 0.94 – (8/14)*0.811-(6/14)*1.0 = 0.048

Page 17: CpSc 810: Machine Learning Decision Tree Learning

17

Which attribute is the best?

S: [9+. 5-]E=0.94

Temperature

[2+, 2-] [3+, 1-]

Hot Cool

E=1.0 E=0.811

Gain(S, Temperature) = 0.94 – (4/14)*1.0-(6/14)*0.918-(4/14)*0.811 = 0.029

[4+, 2-]

E=0.918

Mild

Page 18: CpSc 810: Machine Learning Decision Tree Learning

18

Which attribute is the best?

S: [9+. 5-]E=0.94

Outlook

[2+, 3-] [3+, 2-]

Sunny Rain

E=0.970 E=0.970

Gain(S, Outlook) = 0.94 – (5/14)*0.970-(4/14)*0.0-(5/14)*0.970 = 0.246

[4+, 0-]

E=0

Overcast

Page 19: CpSc 810: Machine Learning Decision Tree Learning

19

Which attribute is the best?

Gain(S, Humidity) = 0.151

Gain(S, Wind) = 0.048

Gain(S, Temperature) = 0.029

Gain(S, Outlook) = 0.246

Outlook

Page 20: CpSc 810: Machine Learning Decision Tree Learning

20

Selecting the Next attribute

S: [9+. 5-]E=0.94

Outlook

[2+, 3-] [3+, 2-]

Sunny Rain

E=0.970 E=0.970

[4+, 0-]

E=0

Overcast

? ?

Which attribute should be tested here?

Page 21: CpSc 810: Machine Learning Decision Tree Learning

21

Selecting the Next attribute

S: [9+. 5-]E=0.94

Outlook

[2+, 3-] [3+, 2-]

Sunny Rain

E=0.970 E=0.970

[4+, 0-]

E=0

Overcast

? ?

Gain(Ssunny, Humidity) = 0.97-(3/5)*0.0-(2/5)*0.0=0.97

Gain(Ssunny, Temperature) = 0.97-(3/5)*0.0-(2/5)*1.0-(1/5)0.0=0.57

Gain(Ssunny, Wind) = 0.97-(2/5)*1.0-(3/5)*0.918=0.019

Page 22: CpSc 810: Machine Learning Decision Tree Learning

22

Hypothesis Space Search in ID3

Hypothesis Space: Set of possible decision trees.

Hypotheses space is complete space of finite discrete-valued functions.The target function surely in there.

Search Method: Simple-to-Complex Hill-Climbing Search

only output a single hypothesis ( from candidate-elimination method).Unable to explicitly represent all consistent hypotheses.No Backtracking!!! – Local minima.

Page 23: CpSc 810: Machine Learning Decision Tree Learning

23

Hypothesis Space Search in ID3

Evaluation Function: Information Gain Measure

Batch Learning: ID3 uses all training examples at each step to make statistically-based decisions ( from candidate-elimination method which makes decisions incrementally).

the search is less sensitive to errors in individual training examples.

Page 24: CpSc 810: Machine Learning Decision Tree Learning

24

Inductive Bias in ID3

Note H is the power set of instance X -> unbiased?

Not ReallyPreference for short treesPreference for trees with high information gain attributes near the root.

Restriction Bias vs. Preference BiasID3 search a complete hypothesis space, but it searches incompletely through this space. (preference or search bias) Candidate-Elimination searches an incomplete hypothesis space. However, it search this space completely. (restriction or language bias)

Page 25: CpSc 810: Machine Learning Decision Tree Learning

25

Why Prefer Short Hypotheses?

Occam’s razor: Prefer the simplest hypothesis that fits the data [William Occam (Philosopher), circa 1320]

Scientists seem to do that: E.g., Physicist seem to prefer simple explanations for the motion of planets, over more complex ones

Argument: Since there are fewer short hypotheses than long ones, it is less likely that one will find a short hypothesis that coincidentally fits the training data.

Problem with this argument: it can be made about many other constraints. Why is the “short description” constraint more relevant than others?

Two learner use different representation will result different hypotheses.

Nevertheless: Occam’s razor was shown experimentally to be a successful strategy.

Page 26: CpSc 810: Machine Learning Decision Tree Learning

26

Overfitting

Consider error of hypotheses h over

Training data: errortrain(h)

Entire distribution D of data: errorD(h)

Hypothesis h H overfits training data if there is an alternative hypothesis h’ H such that

and

)'()( herrorherror traintrain

)'()( herrorherror DD

Page 27: CpSc 810: Machine Learning Decision Tree Learning

27

Overfitting in Decision Trees

Consider adding a noisy training example to the PlayTennis

<Outlook=Sunny, Temperature=Hot, Humidity=Normal, Wind=Strong, PlayTennis = No>

What effect the earlier tree?

Page 28: CpSc 810: Machine Learning Decision Tree Learning

28

Overfitting in Decision Trees

S: [9+. 5-]E=0.94

Outlook

[2+, 3-] [3+, 2-]

Sunny Rain

E=0.970 E=0.970

[4+, 0-]

E=0

Overcast

Humidity Wind

High Normal

{D1, D2, D8} {D9, D11}[3-] [2+]

New Data add a - here

{D9, D11, D15}[2+, 1-]

Page 29: CpSc 810: Machine Learning Decision Tree Learning

29

Overfitting in Decision Trees

Page 30: CpSc 810: Machine Learning Decision Tree Learning

30

Avoiding Overfitting

How can we avoid overfitting?Stop growing the tree when data split not statistically significantGrow full tree, then post-prune it.

How to select “best” tree:Measure performance over training data

Apply a statistical test to estimate whether pruning is likely to improve performance.

Measure performance over separate validation data setMinimum Description Length (MDL):

Minimize size(tree)+size(misclassifications(tree))

Page 31: CpSc 810: Machine Learning Decision Tree Learning

31

Reduced-Error Pruning

Split data into training and validation set

Do until further pruning is harmful:1. Evaluate impact on validation set of pruning each possible node (plus those below it).2. Greedily remove the on that most improves validation set accuracy

Produces smallest version of most accurate tree

Drawback: what if data is limited?

Page 32: CpSc 810: Machine Learning Decision Tree Learning

32

Effect of Reduced-Error Pruning

Page 33: CpSc 810: Machine Learning Decision Tree Learning

33

Rule Post-Pruning

Most frequently used method

Page 34: CpSc 810: Machine Learning Decision Tree Learning

34

Converting a Tree to Rules

IF (Outlook = Sunny) ∩ (Humidity = High)Then PlayTennis = No

IF (Outlook = Sunny) ∩ (Humidity = Normal)Then PlayTennis = Yes

Page 35: CpSc 810: Machine Learning Decision Tree Learning

35

Why converting to rule

Converting to rules allows distinguishing among the different contexts in which a decision node is used

Converting to rules removes the distinction between attribute tests that occur near the root of the tree and those occur near the leaves.

Converting to rules improves readability.

Page 36: CpSc 810: Machine Learning Decision Tree Learning

36

Incorporating Continuous-Valued Attributes

For a continuous valued attribute A, create a new boolean attribute Ac that is true if A < c and false otherwise.

Select the threshold c that produces the greatest information gain.Example:

Two possible threshold c1=(48+60)/2=54; c2=(80+90)/2=85

Which one is better? (homework)

Page 37: CpSc 810: Machine Learning Decision Tree Learning

37

Alternative Measures for Selecting Attributes: GainRatio

Problem: information gain measure favors attributes with many values.

Example: add a Date attribute in our example

One solution: GainRatioSplit Information

Where Si is subset of S for which A have value vi

Gain Ratio

S

S

S

SASmationSplitInfor ic

i

i2

1log),(

),(

),(),(

ASmationSplitInfor

ASGainASGainRatio

Page 38: CpSc 810: Machine Learning Decision Tree Learning

38

Attributes with Differing Costs

In some learning tasks the instance attributes may have associated cost. How to learn a consistent tree with low expected cost?

Approaches:Tan and Schlimmer (1990)

Nunez(1988)

Where determines importance of cost

)(

),(2

ACost

ASGain

w

ASGain

ACost )1)((

12 ),(

]1,0[w

Page 39: CpSc 810: Machine Learning Decision Tree Learning

39

Training Examples with Missing Attribute Values

If some examples missing values of attribute A, use training examples anyway.

Assign most common value of A among other examples with same target values

Assign probability pi to each possible vi value of A. Then, assign fraction pi of example to each descendant in tree.

Classify new examples in same fashion.

Page 40: CpSc 810: Machine Learning Decision Tree Learning

40

Summary Points

Decision tree learning provides a practical method for concept learning and for learning other discrete-valued functions

ID3 searches a complete hypothesis space.Target function always present in the hypothesis space.

The inductive bias implicit in ID3 includes a preference for smaller trees.

Overfitting the training data is an important issue in decision tree learning.

A large variety of extensions to the basic ID3 algorithm has been developed.