47
K236: Basis of Data Analytics Lecture 7: Classification and prediction Decision tree induction Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai

K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

K236:  Basis  of  Data  AnalyticsLecture  7:  Classification and prediction

Decision  tree  induction

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

Page 2: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

2

Schedule of K236

1. Introduction to data science (1) データ科学入門 6/9

2. Introduction to data science (2) データ科学入門 6/13

3. Data and databases データとデータベース 6/16

4. Review of univariate statistics 単変量統計 6/20

5. Review of linear algebra 線形代数 6/23

6. Data mining software データマイニングソフトウェア 6/27

7. Data preprocessing データ前処理 6/30

8. Classification and prediction (1) 分類と予測 (1) 7/4

9. Knowledge evaluation 知識評価 7/7

10. Classification and prediction (2) 分類と予測 (2) 7/11

11. Classification and prediction (3) 分類と予測 (3) 7/14

12. Mining association rules (1) 相関ルールの解析 7/18

13. Mining association rules (2) 相関ルールの解析 7/21

14. Cluster analysis クラスター解析 7/25

15. Review and Examination レビューと試験 (the data is not fixed) 7/27

Page 3: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

3

Data schemas vs. mining methodsデータ・スキーマ vs. 学習手法

Types of data

§ Flat data tables 表形式データ

§ Relational databases 関係DB§ Temporal & spatial data

時空間データ

§ Transactional databases 取引データ

§ Multimedia data マルチメディアデータ

§ Genome databases ゲノムデータ

§ Materials science data 材料データ

§ Textual data テキストデータ

§ Web data ウェブデータ

§ etc.

Mining tasks and methods マイニングの課題と手法

§ Classification/Prediction 分類/予測

q Decision trees 決定木

q Bayesian classification ベイジアン分類

q Neural networks 神経回路網

q Rule induction ルール帰納法

q Support vector machines SVMq Hidden Markov Model 隠れマルコフ

q etc.§ Description 記述

q Association analysis 相関分析

q Clustering クラスタリング

q Summarization 要約

q etc.

Page 4: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

4

1. Issues  Regarding  Classification  and  Prediction

2. Attribute  selection  in  decision  tree  induction

3. Tree  pruning  and  other  issues

Outline

Page 5: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Classification  and  prediction

H1

C3

H3 H4

H2

C2C1

C4

Supervised dataUnsupervised data

color            #nuclei        #tails            class

H1              light 1 1                    heaH2              dark 1 1                    healthyH3              light 1 2                    healthyH4              light 2 1                    healthyC1              dark 1 2                    cancerousC2              dark 2 1                    cancerousC3              light 2 2                    cancerousC4              dark 2 2                    cancerous  

color            #nuclei        #tails            label

H1              light 1 1                    healH2              dark 1 1                    healthyH3              light 1 2                    healthyH4              light 2 1                    healthyC1              dark 1 2                    cancerousC2              dark 2 1                    cancerousC3              light 2 2                    cancerousC4              dark 2 2                    cancerous  

Given:   𝒙", 𝑦" , 𝒙%, 𝑦% , … , (𝒙(, 𝑦()-­‐ 𝑥+ is  description  of  an  object,  phenomenon,  etc.-­‐ 𝑦+ (label  attribute)  is  some  property  of  𝑥+,  if  not  available  learning  is  unsupervised

Find:  a  function  𝑓 𝑥 that  characterizes  {𝑥+}  or  that  𝑓 𝑥+ = 𝑦+

The  problem  is  usually  called  classification if  “label”  is  categorical,  and  prediction if  “label”  is  continuous  (in  this  case,  if  the  descriptive  attribute  is  numerical  the  problem  is  regression)    

Page 6: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

6

Classification—a  two-­‐step  process

• Model  construction:  describing  a  set  of  predetermined  classesq Each  tuple/object  is  assumed  to  belong  to  a  predefined  class,  as  determined  by  the  

class  label  attribute

q The  set  of  tuples  used  for  model  construction:  training  set

q The  model  is  represented  as  classification  rules,  decision  trees,  or  mathematical  formulae  (classifiers)

• Model  usage:  for  classifying  future  or  unknown  objectsEstimate  accuracy  of  the  model:

q The  known  label  of  test  object  is  compared  with  the  classified  result  from  the  model

q Accuracy rate  is  the  percentage  of  test  set  objects  that  are  correctly  classified  by  the  model

q Test  set  is  independent  of  training  set,  otherwise  over-­‐fitting  will  occur

Page 7: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

7

ClassificationAlgorithms

If color = darkand # tails = 2

Then cancerous cell

H1

H3 H4

H2

C2C1

training data

Classifier(model)

Unknown object

Classification—a  two-­‐step  process

Cancerous?

Model construction Model usage

Cancerous

Page 8: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

8

• Predictive accuracy(予測精度): the ability of the classifier to correctly predict unseen data

• Speed: refers to computation cost

• Robustness(頑健性): the ability of the classifier to make correctly predictions given noisy data or data with missing values

• Scalability(拡張性): the ability to construct the classifier efficiently given large amounts of data

• Interpretability(解釈容易性): the level of understanding and insight that is provided by the classifier

Criteria  for  classification  methods

Page 9: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

45

Machine learning: View by nature of methods

The  five  tribes  of  machine  learning,  Pedro  Domingos

Tribes Origins Master Algorithms

Symbolists Logic, philosophy Inverse deduction

Evolutionaries Evolutionary biology Genetic programming

Connectionists Neuroscience Backpropagation

Bayesians Statistics Probabilistic inference

Analogizers Psychology Kernel machines

Page 10: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Symbolists

46

Tom  Mitchell Steve  Muggleton Ross  Quinlan

Page 11: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

47

#nuclei?

1 2

light dark

color?

light dark

1 2

#tails?H

H C

color?

#tails?

1 2

H C

C

H1

C3

H3 H4

H2

C2C1

C4

Classification with decision trees

K236,  L7

Page 12: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Analoziger

49

Peter  Hart Vladimir  Vapnik Douglas  Hofstadter

Page 13: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

x1 x2

…xn-1 xn

f(x)f(x1)

f(x2)

f(xn-1)f(xn)

...

inverse map f-1

k(xi,xj) = f(xi).f(xj)

Kernel matrix Knxn

Input space X Feature space F

kernel function k: XxX à R kernel-based algorithm on K(computation done on kernel matrix)

Kernel methodsThe basic ideas

50

K619

Page 14: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Connectionists

51

Yann  LeCun Geoff  Hinton Yoshua Bengio

Page 15: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

52

H1

C3

H3 H4

H2

C2C1

C4

Healthy

Cancerous

color  =  dark

#  nuclei  =  1

#  tails  =  2

Classification with neural networks

K236,  L9

Page 16: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Deep learning

53K619

Page 17: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Bayesians in machine learning

54

David  Heckerman Judea  Pearl   Michael  Jordan

K236,  L8

Page 18: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Probabilistic  graphical  modelsInstances  of  graphical  models

55

Probabilistic  modelsGraphical  models

Directed Undirected

Bayes  nets MRFs

DBNs

Hidden  Markov  Model  (HMM)

Naïve  Bayes  classifier

Mixture  models

Kalmanfiltermodel

Conditionalrandom  fields

MaxEnt

LDA

Murphy, ML for life sciences

K619

Page 19: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

19

1. Issues  Regarding  Classification  and  Prediction

2. Attribute  selection  in  decision  tree  induction

3. Tree  pruning  and  other  issues

Outline

Page 20: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

20

A decision tree is a flow-chart-like tree structure:フローチャートのような木構造

§ each internal node denotes a test on an attribute属性の値を判定するのが中間にある節

§ each branch represents an outcome of the test値を判定して各枝へ分岐

§ leaf nodes represent classes or class distributions 末端(葉)はクラス/分布

§ The top-most node in a tree is the root node 木構造の頂点は根

Mining  with  decision  trees決定木でのマイニング

#nuclei

color?

1                                          2

#tails

light                        dark

1 2

H

{H1, H3}

{H4, C2, C3, C4}{H1, H2, H3, C1}

{H2, C1}

CH {H2} {C1}

#tails

1                              2{H4, C2} {C3, C4}

C

{H1, H2, H3, H4,C1, C2, C3, C4}

color?

light                        dark

CH {H4} {C2}

Page 21: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

21

Decision  tree  induction  (DTI)

§ Decision tree generation consists of two phasesq Tree construction(決定木構築)

§ Partition examples recursively based on selected attributes

§ At start, all the training objects are at the root

q Tree pruning (構築した木の枝刈)

§ Identify and remove branches that reflect noise or outliers

§ Use of decision trees: Classify unknown objects(新事例の分類)

q Test the attribute values of the object against the decision tree

Page 22: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

22

1. At each node, choose the “best”attribute by a given measure for attribute selection 各節では事前に指定した選択基準をに対し、 良の属性を選ぶ

2. Extend tree by adding new branch for each value of the attributeその属性の値ごとに枝を追加して木を拡張

3. Sorting training examples to leaf nodes末端に訓練データを並べ替える

4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes ある節のデータが同一クラスだけなら停止、混じっていれば1から繰返す

5. Prune the tree to avoid over-fitting枝刈をして過学習を防ぐ

Two steps: recursively generate the tree(順次、

属性を選んでデータを分割)(1-4), and prune the tree (構築した木の枝刈)(5)

Tree  construction  general  algorithm木構造を構築する一般的なアルゴリズム

#nuclei

color?

1                                          2

#tails

light                        dark

1 2

H

{H1, H3}

{H4, C2, C3, C4}{H1, H2, H3, C1}

{H2, C1}

CH {H2} {C1}

#tails

1                              2{H4, C2} {C3, C4}

C

{H1, H2, H3, H4,C1, C2, C3, C4}

color?

light                        dark

CH {H4} {C2}

Page 23: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

23

• A  typical  dataset  in  machine  learning

• 14  objects  belonging  to  two  class  {Y,  N}  are  observed  on  4  properties.

• Dom(Outlook)  =                            {sunny,  overcast,  rain}

• Dom(Temperature)  =                  {hot,  mild,  cool}

• Dom(humidity)  =                              {high,  normal}

• Dom(Wind)  =                                      {weak,  strong}

Training  data  for  concept  “play-­‐tennis”

Page 24: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

24

temperature

sunny rain o’cast{D9} {D5, D6} {D7}

outlook outlookwind

cool hot mild{D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14}

true false{D2} {D1, D3, D13}

true false{D5} {D6}

wind

high normal{D1, D3} {D3}

humidity

sunny rain o’cast{D1} {D3}

outlook

sunny o’cast rain{D8, D11} {D12} {D4, D10,D14}

true false{D11} {D8}

windyes yes

no yes

yesno null

yes

no yes

high normal{D4, D14} {D10}

humidity

yestrue false

{D14} {D4}

wind

no yes

noyes

A  decision  tree  for  playing  tennisテニスに関する決定木の一例

Page 25: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

25

sunny o’cast rain{D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11}

outlook

high normal{D1, D2, D8} {D9, D10}

humidity

no yes

yes

true false{D6, D14} {D4, D5, D10}

wind

no yes

This tree is much simpler as “outlook” is selected at the root.How to select good attribute to split a decision node?

初の属性として”outlook”を選択することで決定木がかなり簡潔になる.分割条件として適切な属性をどのように選ぶのか?

A  simple  decision  tree  for  playing  tennisテニスに関する簡潔な決定木

Page 26: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

26

• The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-] テニスデータ(テニスする(+)9件、しない(-)5件) のクラス分布[9+, 5-]

• If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better?

データを“humidity”で分割する場合と “wind”で分割する場合とでは、クラスの分布はどちらがよいか?

[9+, 5-]

[6+, 1-] [3+, 4-]

A1 = humidity

normal high

[9+, 5-]

[6+, 2-] [3+, 3-]

A2 = wind

weak strong

Which  attribute  is  the  best?最良の属性は?

Page 27: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

27

• Entropy characterizes the impurity (purity) of an arbitrary collection of objects (データ集合の純度の指標).q S is the collection of positive and negative objects(全体)

q is the proportion of positive objects in S (該当データの比率)

q is the proportion of negative objects in S (非該当データの比率)

q In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively (テニスデータでは、それぞれ14,9/14, 5/14)

• Entropy is defined as follows エントロピーの定義式

Entropy エントロピー

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝      𝑙𝑜𝑔.𝑝         − 𝑝      𝑙𝑜𝑔.𝑝

𝑝𝑝

Page 28: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

28

Entropy

The  entropy  function  relative  to  a  Boolean  classification,  as  the  proportion                of  positive  objects  varies  between  0  and  1.

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 =/−𝑝0𝑙𝑜𝑔.𝑝0

1

023

𝑝

If  the  collection  has  c distinct  groups  of  objects  then  the  entropy  is  defined  by

entr

opy

𝑝

Page 29: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

29

From  14  examples  of  Play-­‐Tennis,  9  positive  and  5  negative  objects  (denote  by  [9+,  5-­‐]  ) 14件中、正例9件、負例5件なら

Entropy([9+,  5-­‐]  )    =  −  (9/14)log2(9/14)  − (5/14)log2(5/14)  =  0.940

Notice: 1. Entropy is 0 if all members of S belong to the same class(全データ

が同じクラスの場合のエントロピーは0) . For example, if all members are positive ( = 1), then is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0.

2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1. (両クラスのデータ件数が等しい場合のエントロピーは1、等しくなければ0から1の間の値)

Example

𝑝 𝑝

Page 30: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

30

We define a measure, called information gain (情報利得), of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attributeその属性によるデータ分割における不純度低減効果をはかる尺度のひとつが情報利得

where Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which A has value v.

Value(A):属性Aの値 Sv: 全データSのうちValue(A)=vのもの

Information  gain  measures  the  expected  reduction  in  entropy

𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − /𝑆9𝑆 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆9)

9∈>?@AB(C)

Page 31: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

31

Values(Wind) = {Weak, Strong}, S = [9+, 5-]

Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-]

Information  gain  measures  the  expected  reduction  in  entropy

𝐺𝑎𝑖𝑛 𝑆,𝑊𝑖𝑛𝑑 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) −∑ GHG𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆9)�

9∈{JB?K,LMNOPQ}

= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 −814𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆JB?K −  

614𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆LMNOPQ)

= 0.940 −814  0.811 −

614 𝑥  1.0 = 0.048

Page 32: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

32

S:[9+, 5-]E = 0.940

Humidity

High Normal

[3+, 4-] [6+, 1-]E = 0.985 E = 0.592

Gain(S, Humidity)= .940 - (7/14).985 - (7/14).592= .151

S:[9+, 5-]E = 0.940

Wind

Weak Strong

[6+, 2-] [3+, 3-]E = 0.811 E = 1.00

Gain(S, Wind)= .940 - (8/14).811 - (6/14)1.00= .048

Which  attribute  is  the  best  classifier?

Page 33: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

33

Information  gain  of  all  attributes  

Gain (S, Outlook) = 0.246

Gain (S, Humidity) = 0.151

Gain (S, Wind) = 0.048

Gain (S, Temperature) = 0.029

Page 34: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

34

{D1, D2, ..., D14} [9+, 5-]

Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11}[2+, 3-]

{D3, D7, D12, D13}[4+, 0-]

{D4, D5, D6, D10, D14}[3+, 2-]

? Yes ?

Which attribute should be tested here?

Ssunny = {D1, D2, D3, D9, D11}Gain(Ssunny, Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = 0.970Gain(Ssunny, Temperature) = .970 - (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570Gain(Ssunny, Wind) = .970 - (2/5)1.0 - (3/5)0.918 = 0.019

Next  step  in  growing  the  decision  tree

Page 35: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

35

Attributes  with  many  values

• If  attribute  has  many  values  (e.g.,  days  of  the  month),  ID3  will  select  it

• C4.5  uses  GainRatio instead  

𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜 𝑆, 𝐴 =𝐺𝑎𝑖𝑛(𝑆, 𝐴)

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑎𝑡𝑖𝑜𝑛(𝑆, 𝐴)

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑆, 𝐴 = −/𝑆0𝑆

1

023

𝑙𝑜𝑔.𝑆0𝑆

𝑤ℎ𝑒𝑟𝑒  𝑆0  𝑖𝑠  𝑠𝑢𝑏𝑠𝑒𝑡  𝑜𝑓  𝑆  𝑤𝑖𝑡ℎ  𝐴  ℎ𝑎𝑠  ℎ𝑎𝑠  𝑣𝑎𝑙𝑢𝑒  𝑣0

Page 36: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Measures  for  attribute  selection

∑ f.g ∑ fhgijkfhgl∑ fh.ijkfh.�h

�h

�g

∑ f.gijkf.g�g

Quinlan,  C4.5,  1993

𝐺𝑖𝑛𝑖  𝐼𝑛𝑑𝑒𝑥                              /𝑝.m/𝑝0/m. −/𝑝0..                                                                                                            �

0

0

m

𝐺𝑎𝑖𝑛  𝑅𝑎𝑡𝑖𝑜

Breiman,  CART,  1984

𝜒.                                                                //𝑒0m − 𝑛0m

.

𝑒0m,

m

 0

𝑒0m =𝑛.m𝑛0.𝑛..

Statistics

𝑅  𝑚𝑒𝑎𝑠𝑢𝑟𝑒                            /𝑝.m𝑚𝑎𝑥0 𝑝0/m.�

m

Ho  &  Nguyen,  1997

Page 37: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

37

1. Issues  Regarding  Classification  and  Prediction

2. Attribute  selection  in  decision  tree  induction

3. Tree  pruning  and  other  issues

Outline

Page 38: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

38

1. Every attribute has already been included along this path through the tree木構造の経路内に出現しない属性がなくなったとき

2. The training objects associated with each leaf node all have the same target attribute value (i.e., their entropy is zero 末端に該当するデータが同一クラスで構成される場合 = エ

ントロピー0

Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain)分割の適切さを測る尺度として、ID3では情報利得、その後継C4.5では情報利得比を用いる

Stopping  condition

Page 39: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Generalization  problem  in  classification

Underfitting OverfittingGood  fitting

• One  of  the  most  common  tasks  is  to  fit  a  “model”  to  a  set  of  training  data,  so  as  to  be  able  to  make  reliable  predictions  on  general  untrained  data.

• Overfitting:  A statistical  model  describes  random  error  or  noise  instead  of  the  underlying  relationship.  

• Overfitting  occurs  when  a  model  is  excessively  complex,  such  as  having  too  many  parameters  relative  to  the  number  of  observations.  

• A  model  that  has  been  overfit has  poor  predictive  performance,  as  it  overreacts  to  minor  fluctuations  in  the  training  data.

Page 40: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

40

Over-­‐fitting  in  decision  trees• The  generated  tree  may  overfit the  training  data

q Too  many  branches,  some  may  reflect  anomalies  due  to  noise  or  outliers

q Result  is  in  poor  accuracy  for  unseen  objects

• Two  approaches  to  avoid  overfittingq Prepruning:  Halt  tree  construction  early—do  not  split  a  node  if  this  would  result  in  the  goodness  measure  falling  below  a  threshold.• Difficult  to  choose  an  appropriate  threshold

q Postpruning: Remove  branches  from  a  “fully  grown”  tree—get  a  sequence  of  progressively  pruned  trees• Use  a  set  of  data  different  from  the  training  data  to  decide  which  is  the  “best  pruned  tree”.

Page 41: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

41

sunny o’cast rain

outlook

high normal

humidity

no yes

yestrue false

wind

no yes

IF (Outlook = Sunny) and (Humidity = High) THEN PlayTennis = No

IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes

Converting  a  tree  to  rules

Page 42: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

42

Sunday11-­12  PM

Tree  map

Cone  tree

Fisheye  view

Hyperbolic  tree

Visualization  of  decision  trees

Our  D2MS

D2MS’s  T2.5D

Page 43: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Ensemble  learningEnsemble  methods use  multiple  models  to  obtain  better  predictive  performance  than  could  be  obtained  from  any  of  the  constituent  models.

q Boosting:  Make  examples  currently  misclassified  more  importantq Bagging:  Use  different  subsets  of  the  training  data  for  each  model

43

Training  Data

Data1 Data  mData2 × × × × × × × ×

Learner1 Learner2 Learner  m× × × × ×

Model1 Model2 Model  m× × × ×× ×

Model  Combiner Final  Model

Model  1

Model  2

Model  3

Model  4

Model  5Model  6

Some  unknown  distribution

Page 44: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Random  forest

• Random  forests is  a  forest  of  random  decision  trees  (ensemble)

• Tree  bagging:  Given  a  training  set  𝒙3, 𝑦3 , 𝒙., 𝑦. , … , 𝒙P, 𝑦P .  

q Sample  with  replacement  𝑛  training  examples  à Learn  a  tree

q Repeat  𝐾 times  to  learn  𝐾 decision  trees

q Making  prediction    for  an  unknown  case  by  the  majority  vote  from  the  results  of  𝐾trees

• Random  forest:  As  tree  bagging                                                                                                but  choose  a  random  subset  of  attributes to  build  the  tree. Leo  Breiman,  1928  -­‐ 2005  

Page 45: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

45

Issues  in  decision  tree  learning

• Attribute  selection• Pruning  trees• From  trees  to  rules (high  cost  of  pruning)• Visualization• Data  access:  recent  development  on  very  large  training  

sets,  fast,  efficient  and  scalable  (well-­‐known  systems:  C4.5  and  CART)

• Random  Forest• Further  reading:  

http://www.jaist.ac.jp/~bao/DA-­‐K236/TopTenDMAlgorithms.pdf

Page 46: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Homework

Page 47: K236:&Basis&ofData&Analyticsbao/K236/K236-L7.pdf · 1. Introduction to data science(1) データ科学入門 6/9 2. Introduction to data science(2) データ科学入門 6/13 3. Data

Homework

A  company  preparares  its    marketing  strategy  and  sent  out  some  promotion  to  various  houses  and  recorded  4  facts  (attributes)  about  each  house  and  also  whether  the  people  responded  or  not  (outcome  of  promotion).  The  data  is  as  in  the  table.

Manually  build  a  decision  tree  with  the  method  studied  in  this  lecture.