Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Classification Tree Regression Tree Medical Applications of CART
Classification and Regression Trees
Mihaela van der Schaar
Department of Engineering ScienceUniversity of Oxford
March 1, 2017
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Overview
Many decisions are tree structures
Medical treatment
Many decisions are tree structures
Medical treatment
Fever
𝑇 > 100 𝑇 < 100
Treatment #1 Muscle Pain
Treatment #2
High
Treatment #3
Low
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 20 / 42
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Overview
Overview
A Decision Tree is a hierarchically organized structure, with eachnode splitting the data space into pieces based on value of afeature.
Equivalent to a partition of Rd into K disjoint featuresubspaces {R1, ...,RK}, where each Rj ⊂ Rd
On each feature subspace Rj , the same decision/prediction ismade for all x ∈ Rj .
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Terminology
Terminology
Parent of a node c is the immediate predecessor node.
Children of a node c are the immediate successors of c ,equivalently nodes which have c as a parent.
Root node is the top node of the tree; the only node withoutparents.
Leaf nodes are nodes which do not have children.
A K -ary tree is a tree where each node (except for leafnodes) has K children. Usually working with binary trees(K = 2).
Depth of a tree is the maximal length of a path from the rootnode to a leaf node.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Terminology
Terminology
Root Node
Node C
Parent of Node C
Children of Node C
Leaf Nodes
Depth
= 2Binary Tree
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Terminology
The leaves of a tree partition the feature spaceA tree partitions the feature space
A
B
C D
E
θ1 θ4
θ2
θ3
x1
x2
x1 > θ1
x2 > θ3
x1 6 θ4
x2 6 θ2
A B C D E
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 23 / 42
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Learning
Learning a tree model
Three things to learnThe structure of the tree.The threshold values (θi )The values for the leafs (A,B,...)
Learning a tree model
Three things to learn:
1 The structure of the tree.
2 The threshold values (θi).
3 The values for the leafs(A,B, . . .).
x1 > θ1
x2 > θ3
x1 6 θ4
x2 6 θ2
A B C D E
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 24 / 42
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Overview - Classification Tree
Overview - Classification Tree
Classification Tree:Given the dataset D = (x1, y1), ..., (xn, yn) wherexi ∈ R, yi ∈ Y = {1, ...,m}minimize the misclassification error in each leafthe estimated probability of each class k in region Rj is simply:
βjk =
∑i I(yi = k) · I(xi ∈ Rj)∑
i I(xi ∈ Rj)
This is the frequency in which label k occurs in the leaf Rj
These estimates can be regularized as well.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Example
Decide whether to wait for a table at a restaurant, based onthe following attributes (Example from Russell and Norvig,AIMA)
Alternate: is there an alternative restaurant nearby?Bar: is there a comfortable bar area to wait in?Fri/Sat: is today Friday or Saturday?Hungry: are we hungry?Patrons: number of people in the restaurant (None, Some,Full)Price: price range ($, $$, $$$)Raining: is it raining outside?Reservation: have we made a reservation?Type: kind of restaurant (French, Italian, Thai, Burger)Wait Estimate: estimated waiting time (0-10, 10-30, 30-60,>60)
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
A tree model for deciding where to eat
Examples described by attributes values (Binary, discrete,continuous)
A tree model for deciding where to eat
Choosing a restaurant(Example from Russell & Norvig, AIMA)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 25 / 42
Classification of examples is positive (T) or negative (F)
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
First decision: at the root of the tree
Which attribute to split?
First decision: at the root of the tree
Which attribute to split?
Idea: use information gain to choosewhich attribute to split
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 27 / 42
patrons? is a better choice - gives information about theclassification
Idea: use information gain to choose which attribute to split
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Information gain
A chosen attribute A divides the training set E into subsetsE1, ...,Ev according to their values for A, where A has vdistinct values.
remainder(A) =v∑
i=1
pi + nip + n
× I (pi
pi + ni,
nipi + ni
)
Information Gain (IG) or reduction from the attribute test:
IG (A) = I (p
p + n,
n
p + n)− remainder(A)
Choose the attribute with the largest IG (A)
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
How to measure information gain?
Idea: Gaining information reduces uncertaintyUse to entropy to measure uncertainty
If a random variable X has K different values, (a1, ..., aK ) itsentropy is given by
H[X ] = −K∑
k=1
P(X = ak)× logP(X = ak)
Different measures of uncertainty:Misclassification error: 1−max{P(X = ak)}GINI Index:
∑Kk=1 2P(X = ak)(1− P(X = ak))
C4.5 Tree: Classification tree uses entropy to measureuncertainty.
CART: Classification tree uses Gini index to measureuncertainty.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Learning trees
Uncertainty measurement for two-class classification (as afunction of the proportional p in class 1.)
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Partitions and CART
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Algorithm for Classification Trees
Start with R1 = Rd
For each feature j = 1, ..., d , for each value v ∈ R that we cansplit on:
Split the data set:
I< = {i : xij < v} and I> = {i : xij ≥ v}Estimate the parameters:
β< =
∑i I(yi = 1) · I(xi ∈ I<)∑
i I(xi ∈ I<)and β> =
∑i I(yi = 1) · I(xi ∈ I>)∑
i I(xi ∈ I>)
Quality of split is measured by the weighted sum of theuncertainty:
|I<||I<|+ |I>|
× I (β<, 1− β<) +|I>|
|I<|+ |I>|× I (β>, 1− β>)
Choose split with minimal weighted sum of the uncertainty.Recurse on both children, with (xi , yi )i∈I< and (xi , yi )i∈I> .
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Which attribute to split?
First decision: at the root of the tree
Which attribute to split?
Idea: use information gain to choosewhich attribute to split
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 27 / 42
Patron vs. Type?By choosing Patron, we end up with a partition (3 branches)with smaller entropy, i.e. smaller uncertainty (0.45 bit)By choosing Type, we end up with uncertainty of 1 bit.Thus, we choose Patron over Type.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Uncertainty if we go with ”Patron”
For ”None” branch
−(0
0 + 2log
0
0 + 2+
2
0 + 2log
2
0 + 2) = 0
For ”Some” branch
−(4
4 + 0log
4
4 + 0+
0
4 + 0log
0
4 + 0) = 0
For ”Full” branch
−(2
2 + 4log
2
2 + 4+
4
2 + 4log
4
2 + 4) ≈ 0.9
For choosing ”Patrons”weighted average of each branch: this quantity is calledconditional entropy
2
12× 0 +
4
12× 0 +
6
12× 0.9 = 0.45
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Conditional entropy
Definition. Given two random variables X and Y
H[Y |X ] =∑k
P(X = ak)× H[Y |X = ak ]
In our exampleX : the attribute to be splitY : Wait or not
Relation to information gain
Gain = H[Y ]− H[Y |X ]
When H[Y ] is fixed, we need only to compare conditionalentropy.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Conditional entropy for ”Type”
For ”French” and ”Italian” branch
−(1
1 + 1log
1
1 + 1+
1
1 + 1log
1
1 + 1) = 1
For ”Thai” and ”Burger” branch
−(2
2 + 2log
2
2 + 2+
2
2 + 2log
2
2 + 2) = 1
For choosing ”Type”
weighted average of each branch (conditional entropy)
2
12× 1 +
2
12× 1 +
4
12× 1 +
4
12× 1 = 1
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Next split?Do we split on “Non” or “Some”?
!
No, we do not
The decision is deterministic, as seen from the training data
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 35 / 42
Do we split on ”None” or ”Some”?
No, we do not
The decision is deterministic, as seen from the training data
We will look only at the 6 instances with Patrons == Full
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Greedily we build the tree and get thisGreedily we build the tree and get this
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 36 / 42
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
What is the optimal Tree Depth?
We need to be careful to pick an appropriate tree depth.
If the tree is too deep, we can overfit.
If the tree is too shallow, we underfit
Max depth is a hyper-parameter that should be tuned by thedata.
Alternative strategy is to create a very deep tree, and then toprune it.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Control the size of the tree
We would prune to have a smaller one.
Control the size of the tree
We would prune to have a smaller one
If we stop here, not all training sample would be classified correctly.
More importantly, how do we classify a new instance?
We label the leaves of this smaller tree with the majorityof training samples’ labels
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 39 / 42
If we stop here, not all training samples would be classifiedcorrectly.
More importantly, how do we classify a new instance?
We label the leaves of this smaller tree with the majority oftraining samples’ labels.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Example
We stop after the root (first node).
Example
Example
We stop after the root (first node)
!
!
!
!
!
!
!
Wait: yes Wait: noWait: no
Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 1, 2015 40 / 42
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Computational Consideration
Numerical FeaturesWe could split on any feature, with any threshold.However, for a given feature, the only split points we need toconsider are the n values in the training data for this feature.If we sort each feature by these n values, we can quicklycompute our uncertainty metric of interest (cross entropy orothers)This takes O(dn log n) time
Categorical FeaturesAssuming q distinct categories, there are 2q−1 − 1 possiblebinary partitions we can consider.However, things simplify in the case of binary classification (orregression), and we can find the optimal split (for crossentropy and Gini) by only considering q − 1 possible splits
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example
Summary of learning classification trees
AdvantagesEasily interpretable by human (as long as the tree is not toobig)Computationally efficientHandles both numerical and categorical dataIt is parametric thus compact: unlike Nearest NeighborhoodClassification, we do not have to carry our training instancesaroundBuilding block for various ensemble methods (more on thislater)
DisadvantagesHeuristic training techniquesFinding partition of space that minimizes empirical error isNP-hard.We resort to greedy approaches with limited theoreticalunderpinning.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Overview - Regression Tree
Overview - Regression Tree
Regression Tree:Given the dataset D = (x1, y1), ..., (xn, yn) wherexi ∈ R, yi ∈ Y = Rminimize the error (e.g. square loss error) in each leafthe parameterized function is
f (x) =K∑j=1
βj · I(x ∈ Rj)
Using squared loss, optimal parameters are:
βj =
∑i yi · I(xi ∈ Rj)∑
i I(xi ∈ Rj)
which is sample mean.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Methodology
Assign the prediction for each leaf
For each leaf, we need to assign the prediction y which minimizesthe loss for the regression problem.
Regression Tree:
y = arg miny∈R
∑i∈Rk
(y − yi )2 =
1∑I(xi ∈ Rk)
·∑
I(xi ∈ Rk) · yi
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Methodology
Growing the Tree: Overview
Ideally, would like to find partition that achieves minimal risk:lowest mean-squared error for regression problem.
Number of potential partitions is too large to searchexhaustively.
Greedy search heuristics for a good partition:
Start at root.Determine the best feature and value to split.Recurse on children of node.Stop at some point (with heuristic pruning rules).
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Growth Heuristic for Regression Trees
Algorithm for Regression Trees
Start with R1 = Rd
For each feature j = 1, ..., d , for each value v ∈ R that we cansplit on:
Split the data set:
I< = {i : xij < v} and I> = {i : xij ≥ v}
Estimate parameters:
β< =
∑i∈I< yi
|I<|and β> =
∑i∈I> yi
|I>|Quality of split is measured by the squared loss:∑
i∈I<
(yi − β<)2 +∑i∈I>
(yi − β>)2
Choose split with minimal loss.
Recurse on both children, with (xi , yi )i∈I< and (xi , yi )i∈I> .
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example of Regression Trees
Example of Regression Trees
ଵ ଶ
ଵ
ଵ
ଵݕ ଶݕ
Feature Space
Regression Tree
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example of Regression Trees
Example of Regression Trees
ଷ
ଵ ଶ
ଵ
ଶ
ଵଶ
ଵݕ
ଶݕଷݕ
Feature Space
Regression Tree
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Example of Regression Trees
Example of Regression Trees
ଵ ଶ ଷ ସ
ଵ
ଶ ଷ
ଵଶ ଷ
ଵݕ
ଶݕଷݕ
ସݕ
Feature Space
Regression Tree
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Model Complexity
Model Complexity
When should tree growing be stopped?
Will need to control complexity to prevent over-fitting, and ingeneral find optimal tree size with best predictive performance.
Consider a regularized objective
Traing Error(Tree) + C × Size(Tree)
Grow the tree from scratch and stop once the criterionobjective starts to increase.First grow the full tree and prune nodes (starting at leaves),until the objective starts to increase.
Second option is preferred as the choice of tree is lesssensitive to wrong choices of split points and variables to spliton in the first stages of tree fitting.
Use cross validation to determine optimal C .
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Model Complexity
Pruning Rule
Stop when one instance in each leaf (regression problem)
Stop when all the instance in each leaf have the same label(classification problem)
Stop when the number of leaves is less than the threshold
Stop when the leaf’s error is less than the threshold
Stop when the number of instances in each leaf is less thanthe threshold
Stop when the p-value between two divided leafs is larger thanthe certain threshold (e.g. 0.05 or 0.01) based on chosenstatistical tests.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Overview
Decision Tree can be applied in general applications (e.g.medical applications/recommendation systems).
We focus on medical applications.
1. Neurosurgery
Aim: To recommend certain type of neurosurgery to patientswho have higher probability to exhibit the clinicalimprovement.
2. Heart Transplant
Aim: To match the best available heart to the patient whosesurvival probability is maximized.
Pruning Rule: Stop when the p-value between two divided leavesis larger than the certain threshold (0.05) based on the studentt-test.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Student t-test
Student t-test for the pruning rule of decision tree
Student t-test is a statistical hypothesis test used todetermine whether two sets of data are statistically differentfrom each other.
Usually, it is used to test the null hypothesis that the meansof two populations are equal.
Therefore, when pruning decision trees, the Student t-test isused to determine whether the two resulting partitions (thepopulations of different leaves) have statistically differentmeans.
It is usually determined by the p-value (less than 0.05 or 0.01)computed by the Student t-test.
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Student t-test
Details of Student t-test
For the unpaired datasets and unequal variance setting,Mean of group A:
yA =1
NA
∑yi · I(Xi ∈ A)
Variance of group A:
s2A =
1
NA − 1
∑(yi − yA)2 · I(Xi ∈ A)
.Therefore, t-value can be computed as follow
t =yA − yB√s2A
NA+
s2B
NB
Based on the computed t-value, the p-value is computed as.
p-value = P(Z > |t|) where Z ∼ N (0, 1)
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Neurosurgery
Neurosurgery: Block Diagram
2
Patient information Prediction for the success probabilty of NeurosurgeryPredictive model(Decision Tree)
Recommend Neurosurgery only for certain patient groups
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Neurosurgery
Neurosurgery: Achieved Decision Tree
Success40.0%
Success77.8%
Success64.7%
SRS Image Score ≤ 3.45
Success68.9%
Success91.3%
SRS Image Score ≤ 2.9
Success47.9%
Success27.9%
SRS Mental Score ≤ 4.1
Success77.5%
Success61.7%
SRS Mental Score ≤ 4.1
Success55.3%
Success34.0%
Trunk Shift≤ 1.55
Success63.6%
Success26.2%
Sacral Slope≤ 49.5Success24.0%
Success65.0%
Weight≤ 76.5
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Heart Transplant
Heart Transplant: Block Diagram
4
Prediction for the success of heart transplantPredictive model(Decision Tree)
Optimally matchpatients and donors to minimize mortality
Patients / Donors Information
Mihaela van der Schaar Classification and Regression Trees
Classification Tree Regression Tree Medical Applications of CART
Heart Transplant
Heart Transplant: Achieved Decision Tree
Success55.3%
Success81.2%
Success70.1%
Donor Age ≤ 28
Success54.2%
Success88.5%
No Ventilator Assist
Success21.3%
Success67.1%
No Ventilator Assist
Success32.1%
Success71.2%
Previous Transplant ≤ 0
Success13.9%
Success41.5%
No Dialysis in Listing
Success41.1%
Success69.1%
Previous Transplant ≤ 0
Success69.3%
Success77.1%
Creatinine ≤1.36
Success69.1%
Success91.2%
No DiabetesSuccess71.2%
Success57.9%
No Donor HEP C Antigen
Mihaela van der Schaar Classification and Regression Trees