K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

K236: Basis of Data AnalyticsLecture 7: Classification and prediction

Decision tree induction

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

2

Schedule of K236

1. Introduction to data science (1) データ科学入門 6/9

2. Introduction to data science (2) データ科学入門 6/13

3. Data and databases データとデータベース 6/16

4. Review of univariate statistics 単変量統計 6/20

5. Review of linear algebra 線形代数 6/23

6. Data mining software データマイニングソフトウェア 6/27

7. Data preprocessing データ前処理 6/30

8. Classification and prediction (1) 分類と予測 (1) 7/4

9. Knowledge evaluation 知識評価 7/7



12. Mining association rules (1) 相関ルールの解析 7/18

13. Mining association rules (2) 相関ルールの解析 7/21

14. Cluster analysis クラスター解析 7/25

15. Review and Examination レビューと試験 (the data is not fixed) 7/27

3

Data schemas vs. mining methodsデータ･スキーマ vs. 学習手法

Types of data

§ Flat data tables 表形式データ

§ Relational databases 関係DB§ Temporal & spatial data

時空間データ

§ Transactional databases 取引データ

§ Multimedia data マルチメディアデータ

§ Genome databases ゲノムデータ

§ Materials science data 材料データ

§ Textual data テキストデータ

§ Web data ウェブデータ

§ etc.

Mining tasks and methods マイニングの課題と手法

§ Classification/Prediction 分類/予測

q Decision trees 決定木

q Bayesian classification ベイジアン分類

q Neural networks 神経回路網

q Rule induction ルール帰納法

q Support vector machines SVMq Hidden Markov Model 隠れマルコフ

q etc.§ Description 記述

q Association analysis 相関分析

q Clustering クラスタリング

q Summarization 要約

q etc.

4

1. Issues Regarding Classification and Prediction

2. Attribute selection in decision tree induction

3. Tree pruning and other issues

Outline

Classification and prediction

H1

C3

H3 H4

H2

C2C1

C4

Supervised dataUnsupervised data

color #nuclei #tails class

H1 light 1 1 heaH2 dark 1 1 healthyH3 light 1 2 healthyH4 light 2 1 healthyC1 dark 1 2 cancerousC2 dark 2 1 cancerousC3 light 2 2 cancerousC4 dark 2 2 cancerous

color #nuclei #tails label

H1 light 1 1 healH2 dark 1 1 healthyH3 light 1 2 healthyH4 light 2 1 healthyC1 dark 1 2 cancerousC2 dark 2 1 cancerousC3 light 2 2 cancerousC4 dark 2 2 cancerous

Given: 𝒙", 𝑦" , 𝒙%, 𝑦% , … , (𝒙(, 𝑦()-‐ 𝑥+ is description of an object, phenomenon, etc.-‐ 𝑦+ (label attribute) is some property of 𝑥+, if not available learning is unsupervised

Find: a function 𝑓 𝑥 that characterizes {𝑥+} or that 𝑓 𝑥+ = 𝑦+

The problem is usually called classification if “label” is categorical, and prediction if “label” is continuous (in this case, if the descriptive attribute is numerical the problem is regression)

6

Classification—a two-‐step process

• Model construction: describing a set of predetermined classesq Each tuple/object is assumed to belong to a predefined class, as determined by the

class label attribute

q The set of tuples used for model construction: training set

q The model is represented as classification rules, decision trees, or mathematical formulae (classifiers)

• Model usage: for classifying future or unknown objectsEstimate accuracy of the model:

q The known label of test object is compared with the classified result from the model

q Accuracy rate is the percentage of test set objects that are correctly classified by the model

q Test set is independent of training set, otherwise over-‐fitting will occur

7

ClassificationAlgorithms

If color = darkand # tails = 2

Then cancerous cell

H1

H3 H4

H2

C2C1

training data

Classifier(model)

Unknown object

Classification—a two-‐step process

Cancerous?

Model construction Model usage

Cancerous

8

• Predictive accuracy（予測精度）: the ability of the classifier to correctly predict unseen data

• Speed: refers to computation cost

• Robustness（頑健性）: the ability of the classifier to make correctly predictions given noisy data or data with missing values

• Scalability（拡張性）: the ability to construct the classifier efficiently given large amounts of data

• Interpretability（解釈容易性）: the level of understanding and insight that is provided by the classifier

Criteria for classification methods

45

Machine learning: View by nature of methods

The five tribes of machine learning, Pedro Domingos

Tribes Origins Master Algorithms

Symbolists Logic, philosophy Inverse deduction

Evolutionaries Evolutionary biology Genetic programming

Connectionists Neuroscience Backpropagation

Bayesians Statistics Probabilistic inference

Analogizers Psychology Kernel machines

Symbolists

46

Tom Mitchell Steve Muggleton Ross Quinlan

47

#nuclei?

1 2

light dark

color?

light dark

1 2

#tails?H

H C

color?

#tails?

1 2

H C

C

H1

C3

H3 H4

H2

C2C1

C4

Classification with decision trees

K236, L7

Analoziger

49

Peter Hart Vladimir Vapnik Douglas Hofstadter

x1 x2

…xn-1 xn

f(x)f(x1)

f(x2)

f(xn-1)f(xn)

...

inverse map f-1

k(xi,xj) = f(xi).f(xj)

Kernel matrix Knxn

Input space X Feature space F

kernel function k: XxX à R kernel-based algorithm on K(computation done on kernel matrix)

Kernel methodsThe basic ideas

50

K619

Connectionists

51

Yann LeCun Geoff Hinton Yoshua Bengio

52

H1

C3

H3 H4

H2

C2C1

C4

Healthy

Cancerous

color = dark

# nuclei = 1

# tails = 2

Classification with neural networks

K236, L9

Deep learning

53K619

Bayesians in machine learning

54

David Heckerman Judea Pearl Michael Jordan

K236, L8

Probabilistic graphical modelsInstances of graphical models

55

Probabilistic modelsGraphical models

Directed Undirected

Bayes nets MRFs

DBNs

Hidden Markov Model (HMM)

Naïve Bayes classifier

Mixture models

Kalmanfiltermodel

Conditionalrandom fields

MaxEnt

LDA

Murphy, ML for life sciences

K619

19




Outline

20

A decision tree is a flow-chart-like tree structure:フローチャートのような木構造

§ each internal node denotes a test on an attribute属性の値を判定するのが中間にある節

§ each branch represents an outcome of the test値を判定して各枝へ分岐

§ leaf nodes represent classes or class distributions 末端(葉)はクラス/分布

§ The top-most node in a tree is the root node 木構造の頂点は根

Mining with decision trees決定木でのマイニング

#nuclei

color?

1 2

#tails

light dark

1 2

H

{H1, H3}

{H4, C2, C3, C4}{H1, H2, H3, C1}

{H2, C1}

CH {H2} {C1}

#tails

1 2{H4, C2} {C3, C4}

C

{H1, H2, H3, H4,C1, C2, C3, C4}

color?

light dark

CH {H4} {C2}

21

Decision tree induction (DTI)

§ Decision tree generation consists of two phasesq Tree construction（決定木構築）

§ Partition examples recursively based on selected attributes

§ At start, all the training objects are at the root

q Tree pruning （構築した木の枝刈）

§ Identify and remove branches that reflect noise or outliers

§ Use of decision trees: Classify unknown objects（新事例の分類）

q Test the attribute values of the object against the decision tree

22

1. At each node, choose the “best”attribute by a given measure for attribute selection 各節では事前に指定した選択基準をに対し、良の属性を選ぶ

2. Extend tree by adding new branch for each value of the attributeその属性の値ごとに枝を追加して木を拡張

3. Sorting training examples to leaf nodes末端に訓練データを並べ替える

4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes ある節のデータが同一クラスだけなら停止、混じっていれば1から繰返す

5. Prune the tree to avoid over-fitting枝刈をして過学習を防ぐ

Two steps: recursively generate the tree（順次、

属性を選んでデータを分割）(1-4), and prune the tree （構築した木の枝刈）(5)

Tree construction general algorithm木構造を構築する一般的なアルゴリズム

#nuclei

color?

1 2

#tails

light dark

1 2

H

{H1, H3}

{H4, C2, C3, C4}{H1, H2, H3, C1}

{H2, C1}

CH {H2} {C1}

#tails

1 2{H4, C2} {C3, C4}

C

{H1, H2, H3, H4,C1, C2, C3, C4}

color?

light dark

CH {H4} {C2}

23

• A typical dataset in machine learning

• 14 objects belonging to two class {Y, N} are observed on 4 properties.

• Dom(Outlook) = {sunny, overcast, rain}

• Dom(Temperature) = {hot, mild, cool}

• Dom(humidity) = {high, normal}

• Dom(Wind) = {weak, strong}

Training data for concept “play-‐tennis”

24

temperature

sunny rain o’cast{D9} {D5, D6} {D7}

outlook outlookwind

cool hot mild{D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14}

true false{D2} {D1, D3, D13}

true false{D5} {D6}

wind

high normal{D1, D3} {D3}

humidity

sunny rain o’cast{D1} {D3}

outlook

sunny o’cast rain{D8, D11} {D12} {D4, D10,D14}

true false{D11} {D8}

windyes yes

no yes

yesno null

yes

no yes

high normal{D4, D14} {D10}

humidity

yestrue false

{D14} {D4}

wind

no yes

noyes

A decision tree for playing tennisテニスに関する決定木の一例

25

sunny o’cast rain{D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11}

outlook

high normal{D1, D2, D8} {D9, D10}

humidity

no yes

yes

true false{D6, D14} {D4, D5, D10}

wind

no yes

This tree is much simpler as “outlook” is selected at the root.How to select good attribute to split a decision node?

初の属性として”outlook”を選択することで決定木がかなり簡潔になる.分割条件として適切な属性をどのように選ぶのか?

A simple decision tree for playing tennisテニスに関する簡潔な決定木

26

• The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-] テニスデータ（テニスする(+)9件、しない(-)５件）のクラス分布[9+, 5-]

• If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better?

データを“humidity”で分割する場合と “wind”で分割する場合とでは、クラスの分布はどちらがよいか？

[9+, 5-]

[6+, 1-] [3+, 4-]

A1 = humidity

normal high

[9+, 5-]

[6+, 2-] [3+, 3-]

A2 = wind

weak strong

Which attribute is the best?最良の属性は?

27

• Entropy characterizes the impurity (purity) of an arbitrary collection of objects (データ集合の純度の指標).q S is the collection of positive and negative objects(全体)

q is the proportion of positive objects in S (該当データの比率)

q is the proportion of negative objects in S (非該当データの比率)

q In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively (テニスデータでは、それぞれ14，9/14, 5/14)

• Entropy is defined as follows エントロピーの定義式

Entropy エントロピー

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝 𝑙𝑜𝑔.𝑝 − 𝑝 𝑙𝑜𝑔.𝑝

𝑝𝑝

28

Entropy

The entropy function relative to a Boolean classification, as the proportion of positive objects varies between 0 and 1.

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 =/−𝑝0𝑙𝑜𝑔.𝑝0

1

023

𝑝

If the collection has c distinct groups of objects then the entropy is defined by

entr

opy

𝑝

29

From 14 examples of Play-‐Tennis, 9 positive and 5 negative objects (denote by [9+, 5-‐] ) 14件中、正例9件、負例5件なら

Entropy([9+, 5-‐] ) = − (9/14)log2(9/14) − (5/14)log2(5/14) = 0.940

Notice: 1. Entropy is 0 if all members of S belong to the same class（全データ

が同じクラスの場合のエントロピーは0） . For example, if all members are positive ( = 1), then is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0.

2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1. （両クラスのデータ件数が等しい場合のエントロピーは1、等しくなければ0から1の間の値）

Example

𝑝 𝑝

30

We define a measure, called information gain (情報利得), of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attributeその属性によるデータ分割における不純度低減効果をはかる尺度のひとつが情報利得

where Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which A has value v.

Value(A)：属性Aの値 Sv: 全データSのうちValue(A)=vのもの

Information gain measures the expected reduction in entropy

𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − /𝑆9𝑆 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆9)

�

9∈>?@AB(C)

31

Values(Wind) = {Weak, Strong}, S = [9+, 5-]

Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-]

Information gain measures the expected reduction in entropy

𝐺𝑎𝑖𝑛 𝑆,𝑊𝑖𝑛𝑑 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) −∑ GHG𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆9)�

9∈{JB?K,LMNOPQ}

= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 −814𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆JB?K −

614𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆LMNOPQ)

= 0.940 −814 0.811 −

614 𝑥 1.0 = 0.048

32

S:[9+, 5-]E = 0.940

Humidity

High Normal

[3+, 4-] [6+, 1-]E = 0.985 E = 0.592

Gain(S, Humidity)= .940 - (7/14).985 - (7/14).592= .151

S:[9+, 5-]E = 0.940

Wind

Weak Strong

[6+, 2-] [3+, 3-]E = 0.811 E = 1.00

Gain(S, Wind)= .940 - (8/14).811 - (6/14)1.00= .048

Which attribute is the best classifier?

33

Information gain of all attributes

Gain (S, Outlook) = 0.246

Gain (S, Humidity) = 0.151

Gain (S, Wind) = 0.048

Gain (S, Temperature) = 0.029

34

{D1, D2, ..., D14} [9+, 5-]

Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11}[2+, 3-]

{D3, D7, D12, D13}[4+, 0-]

{D4, D5, D6, D10, D14}[3+, 2-]

? Yes ?

Which attribute should be tested here?

Ssunny = {D1, D2, D3, D9, D11}Gain(Ssunny, Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = 0.970Gain(Ssunny, Temperature) = .970 - (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570Gain(Ssunny, Wind) = .970 - (2/5)1.0 - (3/5)0.918 = 0.019

Next step in growing the decision tree

35

Attributes with many values

• If attribute has many values (e.g., days of the month), ID3 will select it

• C4.5 uses GainRatio instead

𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑆, 𝐴 =𝐺𝑎𝑖𝑛(𝑆, 𝐴)

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑎𝑡𝑖𝑜𝑛(𝑆, 𝐴)

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑆, 𝐴 = −/𝑆0𝑆

1

023

𝑙𝑜𝑔.𝑆0𝑆

𝑤ℎ𝑒𝑟𝑒 𝑆0 𝑖𝑠 𝑠𝑢𝑏𝑠𝑒𝑡 𝑜𝑓 𝑆 𝑤𝑖𝑡ℎ 𝐴 ℎ𝑎𝑠 ℎ𝑎𝑠 𝑣𝑎𝑙𝑢𝑒 𝑣0

Measures for attribute selection

∑ f.g ∑ fhgijkfhgl∑ fh.ijkfh.�h

�h

�g

∑ f.gijkf.g�g

Quinlan, C4.5, 1993

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 /𝑝.m/𝑝0/m. −/𝑝0.. �

0

�

0

�

m

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜

Breiman, CART, 1984

𝜒. //𝑒0m − 𝑛0m

.

𝑒0m,

�

m

�

0

𝑒0m =𝑛.m𝑛0.𝑛..

Statistics

𝑅 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 /𝑝.m𝑚𝑎𝑥0 𝑝0/m.�

m

Ho & Nguyen, 1997

37




Outline

38

1. Every attribute has already been included along this path through the tree木構造の経路内に出現しない属性がなくなったとき

2. The training objects associated with each leaf node all have the same target attribute value (i.e., their entropy is zero 末端に該当するデータが同一クラスで構成される場合 = エ

ントロピー0

Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain)分割の適切さを測る尺度として、ID３では情報利得、その後継C4.5では情報利得比を用いる

Stopping condition

Generalization problem in classification

Underfitting OverfittingGood fitting

• One of the most common tasks is to fit a “model” to a set of training data, so as to be able to make reliable predictions on general untrained data.

• Overfitting: A statistical model describes random error or noise instead of the underlying relationship.

• Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

• A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

40

Over-‐fitting in decision trees• The generated tree may overfit the training data

q Too many branches, some may reflect anomalies due to noise or outliers

q Result is in poor accuracy for unseen objects

• Two approaches to avoid overfittingq Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold.• Difficult to choose an appropriate threshold

q Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees• Use a set of data different from the training data to decide which is the “best pruned tree”.

41

sunny o’cast rain

outlook

high normal

humidity

no yes

yestrue false

wind

no yes

IF (Outlook = Sunny) and (Humidity = High) THEN PlayTennis = No

IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes

Converting a tree to rules

42

Sunday11-12 PM

Tree map

Cone tree

Fisheye view

Hyperbolic tree

Visualization of decision trees

Our D2MS

D2MS’s T2.5D

Ensemble learningEnsemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models.

q Boosting: Make examples currently misclassified more importantq Bagging: Use different subsets of the training data for each model

43

Training Data

Data1 Data mData2 × × × × × × × ×

Learner1 Learner2 Learner m× × × × ×

Model1 Model2 Model m× × × ×× ×

Model Combiner Final Model

Model 1

Model 2

Model 3

Model 4

Model 5Model 6

Some unknown distribution

Random forest

• Random forests is a forest of random decision trees (ensemble)

• Tree bagging: Given a training set 𝒙3, 𝑦3 , 𝒙., 𝑦. , … , 𝒙P, 𝑦P .

q Sample with replacement 𝑛 training examples à Learn a tree

q Repeat 𝐾 times to learn 𝐾 decision trees

q Making prediction for an unknown case by the majority vote from the results of 𝐾trees

• Random forest: As tree bagging but choose a random subset of attributes to build the tree. Leo Breiman, 1928 -‐ 2005

45

Issues in decision tree learning

• Attribute selection• Pruning trees• From trees to rules (high cost of pruning)• Visualization• Data access: recent development on very large training

sets, fast, efficient and scalable (well-‐known systems: C4.5 and CART)

• Random Forest• Further reading:

http://www.jaist.ac.jp/~bao/DA-‐K236/TopTenDMAlgorithms.pdf

Homework

Homework

A company preparares its marketing strategy and sent out some promotion to various houses and recorded 4 facts (attributes) about each house and also whether the people responded or not (outcome of promotion). The data is as in the table.

Manually build a decision tree with the method studied in this lecture.

Documents

K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data