Chapter 3: Decision Tree Learning. Decision Tree Learning t Introduction t Decision Tree Representation t Appropriate Problems for Decision Tree Learning

Chapter 3: Decision Tree Learning

Decision Tree Learning Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learnin

g Basic Algorithm Hypothesis Space Search in Decision Tree Learn

ing Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary

Introduction

A method for approximating discrete-valued target functions

Easy to convert learned tree into if-then rule ID3, ASSISTANT, C4.5 Preference bias to smaller trees. Search a completely expressive hypothesis

space

Decision Tree Representation

Root -> leaf 로 sorting 하면서 학습에로 분류

Node: attribute 테스트 Branch: attribute’s value 에 해당 Disjunction of conjunctions of constraints on

the attribute values of instances

Appropriate Problems for Decision tree Learning

Instances are represented by attribute-value pairs

The target function has discrete output values Disjunctive descriptions may be required The training data may contain errors The training data may contain missing

attribute values

Basic Algorithm

가능한 모든 decision trees space 에서의 top-down, greedy search

Training examples 를 가장 잘 분류할 수 있는 attribute 를 루트에 둔다 .

Entropy, Information gain

Entropy

Minimum number of bits of information needed to encode the classification of an arbitrary member of S

entropy = 0, if all members in the same class entropy = 1, if |positive examples|=|negative

examples|

i

c

i

i ppSEntropy 2

1

log)(

En

trop

y(S

)

P 1.01.00.00.0

1.01.0

ppppSEntropy 22 loglog)(

940.0

)14/5(log)14/5()14/9(log)14/9(])5,9([ 22

Entropy

Information Gain

Expected reduction in entropy caused by partitioning the examples according to attribute A

Attribute A 를 앎으로서 얻어지는 entropy 의 축소 정도

)()(),()(

v

AValuesv

vSEntropy

S

SSEntropyASGain

Day Outlook 온도 Humidity Wind Pl ay T e nn i sD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

048.0

00.1)14/6(811.0)14/8(940.0

)()14/6(

)()14/8()(

)()(),(

]3,3[

]2,6[

]5,9[

,)(

,

SstrongEntropy

SweakEntropySEntropy

SvEntropyS

SvSEntropyWindSGain

Sstrong

Sweak

S

StrongWeakWindValues

StrongWeakv

Which Attribute is the Best Classifier? (1)

Humidity

High Normal

S:[9+, 5-]E=0.940

[3+, 4-]E=0.985

[6+, 1-]E=0.592

0.151

592.0)14/7(

985.0)14/7(940.0

Humidity) Gain(S,

Which Attribute is the Best Classifier? (2)

Wind

Weak Strong

S:[9+, 5-]E=0.940

[6+, 2-]E=0.811

[3+, 3-]E=1.000

0.048

0.1)14/6(

811.0)14/8(940.0

Wind)Gain(S,

Classifying examples by Humidity provides more information gain than by Wind.

Hypothesis Space Search in Decision Tree Learning (1)

Training examples 에 적합한 하나의 hypothesis 를 찾는다 .

ID3 의 hypothesis space the set of possible decision trees

Simple-to-complex, hill-climbing search Information gain => hill-climbing 의 guide

Hypothesis Space Search in Decision tree Learning (2)

Complete space of finite discrete-valued functions

Single current hypothesis 만 유지한다 . No back-tracking 탐색의 각 단계에서 모든 training examples

고려 - 통계적인 결정을 내림

Inductive Bias (1) - Case ID3

Examples 에 부합되는 decision tree 들 중 어느 decision tree 를 선택해야 할 것인가 ?

Shorter trees are preferred over larger trees, Trees that place high information gain

attributes close to the root are preferred.

Inductive Bias (2)

ID3 C ANDIDATE - E LIMINATION

Hypothesis spac eSearc h strategyInduc tive bias

C ompleteInc omplete

O rdering of hypothesisby its searc h strategy

Inc ompleteC omplete

E xpressive power of its hypothesisrepresentation

Searc h strategyPreferenc e bias

Searc h spac eRestric tion bias

Inductive Bias (3)

Occam’s razor Prefer the simplest hypothesis that fits the data

Major difficulty 학습의 내부 표현에 의해 hypothesis 의 크기가

다양할 수 있다 .

Issues in Decision Tree Learning

How deeply to grow the decision tree Handling continuous attributes Choosing an appropriate attribute selection

measure Handling the missing attribute values

Avoiding Overfitting the Data (1)

Training examples 를 완벽하게 분류할 때까지 tree 를 성장시킴 ? 1. Data 에 noise 가 있을 때 2. Training examples 수가 적을 때

Overfit: training data 에 대한 hypothesis h,h’ 가 있을 때 h 의 error < h’ 의 error, (training examples 에 대해서 ) h 의 error > h’ 의 error, ( 전체 인스턴스에 대해서 )

Avoiding Overfitting the Data (2)

해결책 1.examples 를 training set 과 validation set 으로 나눈다 . 2. 모든 data 는 training 으로 사용하고 , 특정 노드의

절단이 성능을 시킬 수 있는 지 통계적으로 검사한다 . 3.Training examples, decision tree 를 encoding 하는

복잡도를 측정하는 explicit measure 개발 -chapter 6

1 번 방식 : training and validation set approach validation set => hypothesis 의 pruning 효과 측정

Reduced Error Pruning

validation set 에 대하여 , 노드가 절단된 tree가 원래의 tree 보다 나쁘지 않은 결과를 나타낼 때 , 그 노드를 삭제한다 .

Training set 에서 우연하게 추가된 leaf 노드가 절단될 가능성이 있다 . 이 같은 우연성이 validation set 에서도 나타나기는

힘들기 때문 Training set, test set, validation set 으로 구성 단점 : data 의 수가 적을 때

Rule Post-Pruning (1)

1. Decision tree 를 만든다 . (overfitting 허용 )2. Root 에서 leaf 에 이르는 rule 로 변환3. Precondition 을 제거함으로써 estimated accuracy 을

향상시키는 rule 을 절단4. Estimated accuracy 에 따라 sort 한다 . Subsequent

instance 를 분류할 때 정렬된 순으로 적용한다 .

NoPlayTennis

HighHumiditysunnyOutlook

THEN

)()(IF

Rule Post-Pruning (2)

Pruning 전에 decision tree 를 rule 로 변환하는 이유 Decision node 가 사용되는 별개의 context 들을

구별할 수 있다 . Root 나 leaf 노드에서의 attribute 테스트를 구분할

필요 없다 .

Incorporating Continuous-Valued Attributes

Information gain 을 최대가 되게 하는 threshold 를 고른다 . Attribute value 에 따라 sort 한다 .

Target classification 이 변하는 pair 를 고른다 . 이 pair 의 중간값을 threshold 후보로 본다 . 이 후보들 중 information gain 을 최대로 하는 것을

선택

Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No

Alternative Measures for Selecting Attributes (1)

Information gain measure 는 많은 value 를 가진 attribute 를 선호한다 .

• Attribute Data (e.g. March 4. 1979)Attribute Data (e.g. March 4. 1979)

• Training dataTraining data 에 대해서는 에 대해서는 target attributetarget attribute 를 를 완벽하게 분류완벽하게 분류

• 좋은 좋은 predictorpredictor 는 되지 못한다는 되지 못한다

• Extreme exampleExtreme example

Alternative Measures for Selecting Attributes (2)

attribute A 의 value 에 대한 관점에서의 S 에 대한 entropy이다 .

S

S

S

SASmationSplitInfor

ic

i

i2

1

log),(

•nn 개의 개의 datadata 를 를 nn 개의 개의 valuevalue 가 완벽하게 가 완벽하게 분류한다면분류한다면

•22 부분으로 완벽하게 나누는 부분으로 완벽하게 나누는 22 개의 개의 valuevalue 를 가진다면를 가진다면

nnnnn

222 log}1

log11

log1

{

1}2

1log

2

1

2

1log

2

1{ 22

Alternative Measures for Selecting Attributes(3)

),(

),(),(

ASmationSplitInfor

ASGainASGainRatio

Handling Training Examples with Missing Attribute Values

node n 에 있는 examples 중에서 C(x) 를 가지는 것들 중 가장 흔한 attribute value 를 할당함

attribute A 의 가능한 value 에 대해 확률값을 할당 . Node n 에 있는 A 의 value 의 frequency

를 관찰함으로써 알 수 있다 .

Handling Attributes with Differing Costs

1,0,)1)((

12

)(

),( ),(2

wACostACost

ASGainw

ASGain

Summary

ID3 family = root rule 부터 downward 로 성장 , next best attribute 를 greedy search

Complete hypothesis space Preference for smaller trees Overfitting avoidance by Post-pruning

Documents

Chapter 3: Decision Tree Learning. Decision Tree Learning t Introduction t Decision Tree Representation t Appropriate Problems for Decision Tree Learning