Mihaela van der Schaarflaxman/HT17_lecture13.pdf · Classi cation Tree Regression Tree Medical Applications of CART Terminology. Terminology. Parent of a node c is the immediate predecessor

Classification Tree Regression Tree Medical Applications of CART

Classification and Regression Trees

Mihaela van der Schaar

Department of Engineering ScienceUniversity of Oxford

March 1, 2017

Mihaela van der Schaar Classification and Regression Trees


Overview

Many decisions are tree structures

Medical treatment

Many decisions are tree structures

Medical treatment

Fever

𝑇 > 100 𝑇 < 100

Treatment #1 Muscle Pain

Treatment #2

High

Treatment #3

Low

Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 20 / 42



Overview

Overview

A Decision Tree is a hierarchically organized structure, with eachnode splitting the data space into pieces based on value of afeature.

Equivalent to a partition of Rd into K disjoint featuresubspaces {R1, ...,RK}, where each Rj ⊂ Rd

On each feature subspace Rj , the same decision/prediction ismade for all x ∈ Rj .



Terminology

Terminology

Parent of a node c is the immediate predecessor node.

Children of a node c are the immediate successors of c ,equivalently nodes which have c as a parent.

Root node is the top node of the tree; the only node withoutparents.

Leaf nodes are nodes which do not have children.

A K -ary tree is a tree where each node (except for leafnodes) has K children. Usually working with binary trees(K = 2).

Depth of a tree is the maximal length of a path from the rootnode to a leaf node.



Terminology

Terminology

Root Node

Node C

Parent of Node C

Children of Node C

Leaf Nodes

Depth

= 2Binary Tree



Terminology

The leaves of a tree partition the feature spaceA tree partitions the feature space

A

B

C D

E

θ1 θ4

θ2

θ3

x1

x2

x1 > θ1

x2 > θ3

x1 6 θ4

x2 6 θ2

A B C D E




Learning

Learning a tree model

Three things to learnThe structure of the tree.The threshold values (θi )The values for the leafs (A,B,...)

Learning a tree model

Three things to learn:

1 The structure of the tree.

2 The threshold values (θi).

3 The values for the leafs(A,B, . . .).

x1 > θ1

x2 > θ3

x1 6 θ4

x2 6 θ2

A B C D E




Overview - Classification Tree

Overview - Classification Tree

Classification Tree:Given the dataset D = (x1, y1), ..., (xn, yn) wherexi ∈ R, yi ∈ Y = {1, ...,m}minimize the misclassification error in each leafthe estimated probability of each class k in region Rj is simply:

βjk =

∑i I(yi = k) · I(xi ∈ Rj)∑

i I(xi ∈ Rj)

This is the frequency in which label k occurs in the leaf Rj

These estimates can be regularized as well.



Example

Example

Decide whether to wait for a table at a restaurant, based onthe following attributes (Example from Russell and Norvig,AIMA)

Alternate: is there an alternative restaurant nearby?Bar: is there a comfortable bar area to wait in?Fri/Sat: is today Friday or Saturday?Hungry: are we hungry?Patrons: number of people in the restaurant (None, Some,Full)Price: price range ($, $$, $$$)Raining: is it raining outside?Reservation: have we made a reservation?Type: kind of restaurant (French, Italian, Thai, Burger)Wait Estimate: estimated waiting time (0-10, 10-30, 30-60,>60)



Example

A tree model for deciding where to eat

Examples described by attributes values (Binary, discrete,continuous)

A tree model for deciding where to eat

Choosing a restaurant(Example from Russell & Norvig, AIMA)


Classification of examples is positive (T) or negative (F)



Example

First decision: at the root of the tree

Which attribute to split?



Idea: use information gain to choosewhich attribute to split


patrons? is a better choice - gives information about theclassification

Idea: use information gain to choose which attribute to split



Example

Information gain

A chosen attribute A divides the training set E into subsetsE1, ...,Ev according to their values for A, where A has vdistinct values.

remainder(A) =v∑

i=1

pi + nip + n

× I (pi

pi + ni,

nipi + ni

)

Information Gain (IG) or reduction from the attribute test:

IG (A) = I (p

p + n,

n

p + n)− remainder(A)

Choose the attribute with the largest IG (A)



Example

How to measure information gain?

Idea: Gaining information reduces uncertaintyUse to entropy to measure uncertainty

If a random variable X has K different values, (a1, ..., aK ) itsentropy is given by

H[X ] = −K∑

k=1

P(X = ak)× logP(X = ak)

Different measures of uncertainty:Misclassification error: 1−max{P(X = ak)}GINI Index:

∑Kk=1 2P(X = ak)(1− P(X = ak))

C4.5 Tree: Classification tree uses entropy to measureuncertainty.

CART: Classification tree uses Gini index to measureuncertainty.



Example

Learning trees

Uncertainty measurement for two-class classification (as afunction of the proportional p in class 1.)



Example

Partitions and CART



Example

Algorithm for Classification Trees

Start with R1 = Rd

For each feature j = 1, ..., d , for each value v ∈ R that we cansplit on:

Split the data set:

I< = {i : xij < v} and I> = {i : xij ≥ v}Estimate the parameters:

β< =

∑i I(yi = 1) · I(xi ∈ I<)∑

i I(xi ∈ I<)and β> =

∑i I(yi = 1) · I(xi ∈ I>)∑

i I(xi ∈ I>)

Quality of split is measured by the weighted sum of theuncertainty:

|I<||I<|+ |I>|

× I (β<, 1− β<) +|I>|

|I<|+ |I>|× I (β>, 1− β>)

Choose split with minimal weighted sum of the uncertainty.Recurse on both children, with (xi , yi )i∈I< and (xi , yi )i∈I> .



Example




Idea: use information gain to choosewhich attribute to split


Patron vs. Type?By choosing Patron, we end up with a partition (3 branches)with smaller entropy, i.e. smaller uncertainty (0.45 bit)By choosing Type, we end up with uncertainty of 1 bit.Thus, we choose Patron over Type.



Example

Uncertainty if we go with ”Patron”

For ”None” branch

−(0

0 + 2log

0

0 + 2+

2

0 + 2log

2

0 + 2) = 0

For ”Some” branch

−(4

4 + 0log

4

4 + 0+

0

4 + 0log

0

4 + 0) = 0

For ”Full” branch

−(2

2 + 4log

2

2 + 4+

4

2 + 4log

4

2 + 4) ≈ 0.9

For choosing ”Patrons”weighted average of each branch: this quantity is calledconditional entropy

2

12× 0 +

4

12× 0 +

6

12× 0.9 = 0.45



Example

Conditional entropy

Definition. Given two random variables X and Y

H[Y |X ] =∑k

P(X = ak)× H[Y |X = ak ]

In our exampleX : the attribute to be splitY : Wait or not

Relation to information gain

Gain = H[Y ]− H[Y |X ]

When H[Y ] is fixed, we need only to compare conditionalentropy.



Example

Conditional entropy for ”Type”

For ”French” and ”Italian” branch

−(1

1 + 1log

1

1 + 1+

1

1 + 1log

1

1 + 1) = 1

For ”Thai” and ”Burger” branch

−(2

2 + 2log

2

2 + 2+

2

2 + 2log

2

2 + 2) = 1

For choosing ”Type”

weighted average of each branch (conditional entropy)

2

12× 1 +

2

12× 1 +

4

12× 1 +

4

12× 1 = 1



Example

Next split?Do we split on “Non” or “Some”?

!

No, we do not

The decision is deterministic, as seen from the training data


Do we split on ”None” or ”Some”?

No, we do not

The decision is deterministic, as seen from the training data

We will look only at the 6 instances with Patrons == Full



Example

Greedily we build the tree and get thisGreedily we build the tree and get this




Example

What is the optimal Tree Depth?

We need to be careful to pick an appropriate tree depth.

If the tree is too deep, we can overfit.

If the tree is too shallow, we underfit

Max depth is a hyper-parameter that should be tuned by thedata.

Alternative strategy is to create a very deep tree, and then toprune it.



Example

Control the size of the tree

We would prune to have a smaller one.

Control the size of the tree

We would prune to have a smaller one

If we stop here, not all training sample would be classified correctly.

More importantly, how do we classify a new instance?

We label the leaves of this smaller tree with the majorityof training samples’ labels


If we stop here, not all training samples would be classifiedcorrectly.

More importantly, how do we classify a new instance?

We label the leaves of this smaller tree with the majority oftraining samples’ labels.



Example

Example

We stop after the root (first node).

Example

Example

We stop after the root (first node)

!

!

!

!

!

!

!

Wait: yes Wait: noWait: no




Example

Computational Consideration

Numerical FeaturesWe could split on any feature, with any threshold.However, for a given feature, the only split points we need toconsider are the n values in the training data for this feature.If we sort each feature by these n values, we can quicklycompute our uncertainty metric of interest (cross entropy orothers)This takes O(dn log n) time

Categorical FeaturesAssuming q distinct categories, there are 2q−1 − 1 possiblebinary partitions we can consider.However, things simplify in the case of binary classification (orregression), and we can find the optimal split (for crossentropy and Gini) by only considering q − 1 possible splits



Example

Summary of learning classification trees

AdvantagesEasily interpretable by human (as long as the tree is not toobig)Computationally efficientHandles both numerical and categorical dataIt is parametric thus compact: unlike Nearest NeighborhoodClassification, we do not have to carry our training instancesaroundBuilding block for various ensemble methods (more on thislater)

DisadvantagesHeuristic training techniquesFinding partition of space that minimizes empirical error isNP-hard.We resort to greedy approaches with limited theoreticalunderpinning.



Overview - Regression Tree

Overview - Regression Tree

Regression Tree:Given the dataset D = (x1, y1), ..., (xn, yn) wherexi ∈ R, yi ∈ Y = Rminimize the error (e.g. square loss error) in each leafthe parameterized function is

f (x) =K∑j=1

βj · I(x ∈ Rj)

Using squared loss, optimal parameters are:

βj =

∑i yi · I(xi ∈ Rj)∑

i I(xi ∈ Rj)

which is sample mean.



Methodology

Assign the prediction for each leaf

For each leaf, we need to assign the prediction y which minimizesthe loss for the regression problem.

Regression Tree:

y = arg miny∈R

∑i∈Rk

(y − yi )2 =

1∑I(xi ∈ Rk)

·∑

I(xi ∈ Rk) · yi



Methodology

Growing the Tree: Overview

Ideally, would like to find partition that achieves minimal risk:lowest mean-squared error for regression problem.

Number of potential partitions is too large to searchexhaustively.

Greedy search heuristics for a good partition:

Start at root.Determine the best feature and value to split.Recurse on children of node.Stop at some point (with heuristic pruning rules).



Growth Heuristic for Regression Trees

Algorithm for Regression Trees

Start with R1 = Rd

For each feature j = 1, ..., d , for each value v ∈ R that we cansplit on:

Split the data set:

I< = {i : xij < v} and I> = {i : xij ≥ v}

Estimate parameters:

β< =

∑i∈I< yi

|I<|and β> =

∑i∈I> yi

|I>|Quality of split is measured by the squared loss:∑

i∈I<

(yi − β<)2 +∑i∈I>

(yi − β>)2

Choose split with minimal loss.

Recurse on both children, with (xi , yi )i∈I< and (xi , yi )i∈I> .



Example of Regression Trees


ଵ ଶ

ଵ

ଵ

ଵݕ ଶݕ

Feature Space

Regression Tree





ଷ

ଵ ଶ

ଵ

ଶ

ଵଶ

ଵݕ

ଶݕଷݕ

Feature Space

Regression Tree





ଵ ଶ ଷ ସ

ଵ

ଶ ଷ

ଵଶ ଷ

ଵݕ

ଶݕଷݕ

ସݕ

Feature Space

Regression Tree



Model Complexity

Model Complexity

When should tree growing be stopped?

Will need to control complexity to prevent over-fitting, and ingeneral find optimal tree size with best predictive performance.

Consider a regularized objective

Traing Error(Tree) + C × Size(Tree)

Grow the tree from scratch and stop once the criterionobjective starts to increase.First grow the full tree and prune nodes (starting at leaves),until the objective starts to increase.

Second option is preferred as the choice of tree is lesssensitive to wrong choices of split points and variables to spliton in the first stages of tree fitting.

Use cross validation to determine optimal C .



Model Complexity

Pruning Rule

Stop when one instance in each leaf (regression problem)

Stop when all the instance in each leaf have the same label(classification problem)

Stop when the number of leaves is less than the threshold

Stop when the leaf’s error is less than the threshold

Stop when the number of instances in each leaf is less thanthe threshold

Stop when the p-value between two divided leafs is larger thanthe certain threshold (e.g. 0.05 or 0.01) based on chosenstatistical tests.



Overview

Decision Tree can be applied in general applications (e.g.medical applications/recommendation systems).

We focus on medical applications.

1. Neurosurgery

Aim: To recommend certain type of neurosurgery to patientswho have higher probability to exhibit the clinicalimprovement.

2. Heart Transplant

Aim: To match the best available heart to the patient whosesurvival probability is maximized.

Pruning Rule: Stop when the p-value between two divided leavesis larger than the certain threshold (0.05) based on the studentt-test.



Student t-test

Student t-test for the pruning rule of decision tree

Student t-test is a statistical hypothesis test used todetermine whether two sets of data are statistically differentfrom each other.

Usually, it is used to test the null hypothesis that the meansof two populations are equal.

Therefore, when pruning decision trees, the Student t-test isused to determine whether the two resulting partitions (thepopulations of different leaves) have statistically differentmeans.

It is usually determined by the p-value (less than 0.05 or 0.01)computed by the Student t-test.



Student t-test

Details of Student t-test

For the unpaired datasets and unequal variance setting,Mean of group A:

yA =1

NA

∑yi · I(Xi ∈ A)

Variance of group A:

s2A =

1

NA − 1

∑(yi − yA)2 · I(Xi ∈ A)

.Therefore, t-value can be computed as follow

t =yA − yB√s2A

NA+

s2B

NB

Based on the computed t-value, the p-value is computed as.

p-value = P(Z > |t|) where Z ∼ N (0, 1)



Neurosurgery

Neurosurgery: Block Diagram

2

Patient information Prediction for the success probabilty of NeurosurgeryPredictive model(Decision Tree)

Recommend Neurosurgery only for certain patient groups



Neurosurgery

Neurosurgery: Achieved Decision Tree

Success40.0%

Success77.8%

Success64.7%

SRS Image Score ≤ 3.45

Success68.9%

Success91.3%

SRS Image Score ≤ 2.9

Success47.9%

Success27.9%

SRS Mental Score ≤ 4.1

Success77.5%

Success61.7%

SRS Mental Score ≤ 4.1

Success55.3%

Success34.0%

Trunk Shift≤ 1.55

Success63.6%

Success26.2%

Sacral Slope≤ 49.5Success24.0%

Success65.0%

Weight≤ 76.5



Heart Transplant

Heart Transplant: Block Diagram

4

Prediction for the success of heart transplantPredictive model(Decision Tree)

Optimally matchpatients and donors to minimize mortality

Patients / Donors Information



Heart Transplant

Heart Transplant: Achieved Decision Tree

Success55.3%

Success81.2%

Success70.1%

Donor Age ≤ 28

Success54.2%

Success88.5%

No Ventilator Assist

Success21.3%

Success67.1%

No Ventilator Assist

Success32.1%

Success71.2%

Previous Transplant ≤ 0

Success13.9%

Success41.5%

No Dialysis in Listing

Success41.1%

Success69.1%

Previous Transplant ≤ 0

Success69.3%

Success77.1%

Creatinine ≤1.36

Success69.1%

Success91.2%

No DiabetesSuccess71.2%

Success57.9%

No Donor HEP C Antigen


Documents

Mihaela van der Schaarflaxman/HT17_lecture13.pdf · Classi cation Tree Regression Tree Medical Applications of CART Terminology. Terminology. Parent of a node c is the immediate predecessor