Decision Trees - University of WashingtonProbabilistic prediction with decision trees • By taking the probability of the training data in that subset, we can make probabilistic predictions

Decision Trees

Sewoong Oh

CSE/STAT 416University of Washington

!2

Decision trees: An interpretable predictor

!3

Example: predicting potential loan defaults• Data: discrete for now• Goal: Given a new loan application,

predict the whether he will default on the loan

• Learning: fitting a model to data• Inference: making prediction using

a learned model

!4

Credit Term Income yexcellent 3 yrs high safe

fair 5 yrs low riskyfair 3 yrs high safepoor 5 yrs high risky

excellent 3 yrs low riskyfair 5 yrs low safepoor 3 yrs high riskypoor 5 yrs low safefair 3 yrs high safe

Did I pay previous loans on time?

Example: excellent, good, or fair

What’s my income?

Example:>$80K per year

How soon do I need to pay the loan?

Example: 3 years, 5 years,…

predictor f(x) ⇡ y<latexit sha1_base64="ibr5p71ydrEdcrNs+VWmFWXdR6I=">AAACAnicbVC7TsMwFHV4lvIKMCEWixapLFVSBhgrWBiLRB9SG1WO47RWndiyHdQoqlj4FRYGEGLlK9j4G9w2A7Sc6eicc3XvPb5gVGnH+bZWVtfWNzYLW8Xtnd29ffvgsKV4IjFpYs647PhIEUZj0tRUM9IRkqDIZ6Ttj26mfvuBSEV5fK9TQbwIDWIaUoy0kfr2sUkHFGsuYTmsjM97SAjJxzAt9+2SU3VmgMvEzUkJ5Gj07a9ewHESkVhjhpTquo7QXoakppiRSbGXKCIQHqEB6Roao4goL5u9MIFnRglgaM4IeazhTP09kaFIqTTyTTJCeqgWvan4n9dNdHjlZTQWiSYxni8KEwY1h9M+YEAlwZqlhiAsqbkV4iGSCGvTWtGU4C6+vExatap7Ua3d1Ur167yOAjgBp6ACXHAJ6uAWNEATYPAInsEreLOerBfr3fqYR1esfOYI/IH1+QMg15aZ</latexit>

Decision tree

f(poor credit, high income, 3 year) = ?!5

Start

Credit?

Safe

excellent

Income?

poor

Term?

Risky Safe

fair

5 years3 years

Risky

Low

Term?

Risky Safe

high

5 years3 years

• Each internal node tests a feature x[i]• Each branch assigns a feature value x[i]=fair

(or a subset of feature values {fair,poor})• Each leaf node assigns a class y • To predict, traverse the tree from root to a leaf

• Decision trees are naturally human interpretable!

What functions can be represented?• For discrete input and output data, any function of the

input can be represented as a decision tree• However, in general, it could require exponentially many

nodes to represent a arbitrary function (exponential in the dimension of the input)

• For example, a function that is sensitive to a small change in the input

!6

x[1] x[2] x[3] Y0 0 0 00 0 1 10 1 0 11 0 0 10 1 1 01 0 1 01 1 0 01 1 1 1

Parity function x[1]

x[2] x[2]

x[3] x[3] x[3] x[3]

0 1 1 0 1 0 0 1

0 1x[1] x[2] x[3] Y0 0 0 10 0 1 10 1 0 11 0 0 00 1 1 11 0 1 01 1 0 01 1 1 0

Simple functionx[1]

0 11 0

Which tree is better?• Trade off between Accuracy vs. Simplicity • Accuracy is measured by

• Simplicity can be measured by depth, number of leaves, etc

• If you have just the root node, for your decision tree, what should the decision be, and what is the accuracy and what is the error?

!7

# of correct predictions

# of examples<latexit sha1_base64="AMSq3+p1e9d388z/SMphdKIsOXI=">AAACJ3icbVBNSwMxFMz6WetX1aOXYBE8lV0V9CRFLx4r2FbolpJN39ZgdrMkb6Vl6b/x4l/xIqiIHv0nZtsebOtAYJiZl+RNkEhh0HW/nYXFpeWV1cJacX1jc2u7tLPbMCrVHOpcSaXvAmZAihjqKFDCXaKBRYGEZvBwlfvNR9BGqPgWBwm0I9aLRSg4Qyt1Shd+qBnPfIQ+Zn6ZqpBypTVwpPaeruB5zAyHUwnosyiRYOVOqexW3BHoPPEmpEwmqHVKb35X8TSCGLlkxrQ8N8F2xjQKLmFY9FMDCeMPrActS2MWgWlnoz2H9NAqXRoqbU+MdKT+nchYZMwgCmwyYnhvZr1c/M9rpRietzMRJylCzMcPhamkqGheGu2KvBE5sIRxLexfKb9ntji01RZtCd7syvOkcVzxTirHN6fl6uWkjgLZJwfkiHjkjFTJNamROuHkibyQd/LhPDuvzqfzNY4uOJOZPTIF5+cXTPancg==</latexit>

(all data)

Loan status: Safe Risky

# of Safe loans

22

# of Risky loans

18

N = 40 examples

Root22 18

Loan status: Safe Risky

# of Safe loans# of Risky loans

Training Data Visualizing a decision tree and data

Decision stump: single level tree

• We grow the tree, by adding one more level of branching,and deciding which hypothesis (or feature) to test at the branch

• At the intermediate node, the prediction is determined by the majority rule• In a greedy approach, we choose the hypothesis that gives better

accuracy at the intermediate nodes; credit: 32/40=0.8, term: 30/40=0.75!8

Root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Choice 1: Split on CreditRoot

22 18

3 years16 4

5 years6 14


Term?

Choice 2: Split on Term

RiskySafeSafe Safe Risky

Intermediate nodes Internal nodes

Root

Greedy algorithm for growing a decision tree• Start with the root node as an intermediate node• Repeat if there exists an intermediate node• Choose a feature x[i] to split at the intermediate node

that maximizes the accuracy• Change the intermediate node into a internal node

branching on x[i]• Add intermediate nodes to each branch• If a intermediate node meets the stopping rule, change it

to a leaf node and make a prediction

• Stopping rule: • 1. Do not branch if at that intermediate node,

all data have the same label (perfect prediction)• 2. Do not branch if no feature left to branch

!9

Greedy approach

!10

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?

Safe All data points are Safe ènothing else to do with this subset of data

Greedy approach

!11

Root22 18

excellent9 0

fair9 4

poor4 14


Credit?

SafeBuild decision stump with

subset of data where Credit = poor

Build decision stump with subset of data where

Credit = fair

End of second level

!12

Root22 18


Credit?

excellent9 0

fair9 4

poor4 14

Safe

3 years0 4

5 years9 0

Term?

Risky Safe

Build another stumpthese data points

high4 5

Low0 9

Income?

Risky

Final decision tree

• Branching only increases the accuracy (and decreases the error)• The final accuracy is 37/40• Why not branch further?

!13

Root22 18

excellent9 0 Fair

9 4

poor4 14


Credit?

Safe

5 years9 0

3 years0 4

Term?

Risky Safe

low0 9

high4 5

Income?

5 years4 3

3 years0 2

Term?

Risky Safe

Risky

Another potential early stopping rule• If a branching does not increase accuracy, should we still

branch?

!14

x[1] x[2] yFALSE FALSE FALSE

FALSE TRUE TRUE

TRUE FALSE TRUE

TRUE TRUE FALSE

Root2 2

Tree Classification error(root) 0.5

XOR

!15


branch?

Tree Classification error(root) 0.5

Split on x[1] 0.5


FALSE TRUE TRUE

TRUE FALSE TRUE

TRUE TRUE FALSE

Root2 2

True1 1

False1 1

x[1]

!16


branch? Yes.


FALSE TRUE TRUE

TRUE FALSE TRUE

TRUE TRUE FALSE

Root2 2

True1 1

False1 1

x[1]

True0 1

x[2]

True1 0

False1 0

x[2]

False0 1

True FalseFalse True

Tree Classification error

(root) 0.5Split on x[1] 0.5

Split on x[1],x[2] 0

Decision trees on real valued data

!17

Binary branching on real valued data

!18

Income Credit Term y$105 K excellent 3 yrs Safe$112 K good 5 yrs Risky$73 K fair 3 yrs Safe$69 K excellent 5 yrs Safe$217 K excellent 3 yrs Risky$120 K good 5 yrs Safe$64 K fair 3 yrs Risky$340 K excellent 5 yrs Safe$60 K good 3 yrs Risky

Root22 18


< $60K8 13

>= $60K14 5

Income?

Subset of data with Income >= $60K

• Is there any gain in ternary or k-ary branching?

• Suppose we want to branch on a real valued feature x[i], then how do we choose the threshold?• Between two data points, it does not make any difference

where you split

• there are really only a finite number of choices

• We choose one with lowest error (or maximum accuracy)!19

SafeRisky

Income

$120K$10K

vA vB

SafeRisky

Income

$120K$10K

Growing decision tree for real-valued data

!20

0 10 20 30 40 …$0K

$40K

$80K

…

Age

Income

Age

Income age >= 38age < 38

Predict Safe

Predict Risky

0 10 20 30 40 …$0K

$40K

$80K

…

Age

Income

0 10 20 30 40 …$0K

$40K

$80K

…

root 9 8

Age

Income

age<38 5 3

risky

age>=38 4 5

income<48k 4 0

income>=48k 0 5

Input data Step 1

Step 2

Class is 31 body parts: LU/RU/LW/RW head, neck, L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, L/R ankle, L/R foot (Left, Right, Upper, loWer).

!21

!22

features:

Decision tree:

Classification with decision trees• Data

!23

Root18 13

x[1] >= -0.074 11

x[1] < -0.0713 3

x[1]

!24

Root18 13

x[1] < -0.0713 3

x[1] >= -0.074 11

x[1]

x[1] < -1.667 0

x[1] >= -1.666 3

x[1]

x[2] < 1.551 11

x[2] >= 1.55 3 0

x[2]

y values- +

For threshold splits, same feature can be used multiple times

Evolution of decision boundaries• Decision boundaries get more complicated as we increase the

model complexity (here measured by the depth fo decision tree)

!25

Logistic Regression

Decision Tree

Degree 2 featuresDegree 1 features

Depth 3Depth 1 Depth 10

Degree 6 features

Probabilistic prediction with decision trees

• By taking the probability of the training data in that subset, we can make probabilistic predictions

!26

root18 13

X1 < -0.0713 3

X1 >= -0.074 11

X1

Y values- +

P(yi = +1) =11

15<latexit sha1_base64="UuHDfzIeEpg8V/VHNDmkfcFHA58=">AAACEXicbVDLSsNAFJ34rPUVdelmsAgVoWSqoptC0Y3LCvYBTSiT6aQdOnkwMxFCyC+48VfcuFDErTt3/o2TNgttPXDhcM693HuPG3EmlWV9G0vLK6tr66WN8ubW9s6uubffkWEsCG2TkIei52JJOQtoWzHFaS8SFPsup113cpP73QcqJAuDe5VE1PHxKGAeI1hpaWBWU9vHauy6sJVV7TFWaTJgWeMUnTSg7QlMUoSyFF1kA7Ni1awp4CJBBamAAq2B+WUPQxL7NFCEYyn7yIqUk2KhGOE0K9uxpBEmEzyifU0D7FPppNOPMnislSH0QqErUHCq/p5IsS9l4ru6Mz9fznu5+J/Xj5V35aQsiGJFAzJb5MUcqhDm8cAhE5QonmiCiWD6VkjGWOegdIhlHQKaf3mRdOo1dFar351XmtdFHCVwCI5AFSBwCZrgFrRAGxDwCJ7BK3gznowX4934mLUuGcXMAfgD4/MHnR2cOg==</latexit>

P(yi = �1) =13

16<latexit sha1_base64="OZ2S3naisEZz4A2W1LgKMQRWC5c=">AAACEnicbVDLSsNAFJ34rPUVdelmsAjtwpK0om4KRTcuK9gHNCFMppN26OTBzEQIId/gxl9x40IRt67c+TdO2iy09cCFwzn3cu89bsSokIbxra2srq1vbJa2yts7u3v7+sFhT4Qxx6SLQxbygYsEYTQgXUklI4OIE+S7jPTd6U3u9x8IFzQM7mUSEdtH44B6FCOpJEevpZaP5MR1YSerWhMk08ShWevMrLWg5XGEU7OZpeZFBh29YtSNGeAyMQtSAQU6jv5ljUIc+ySQmCEhhqYRSTtFXFLMSFa2YkEihKdoTIaKBsgnwk5nL2XwVCkj6IVcVSDhTP09kSJfiMR3VWd+v1j0cvE/bxhL78pOaRDFkgR4vsiLGZQhzPOBI8oJlixRBGFO1a0QT5AKQqoUyyoEc/HlZdJr1M1mvXF3XmlfF3GUwDE4AVVggkvQBregA7oAg0fwDF7Bm/akvWjv2se8dUUrZo7AH2ifPwi+nGk=</latexit>

root18 13

X1 < -0.0713 3

X1 >= -0.074 11

X1 < -1.667 0

X1 >= -1.666 3

X2 < 1.551 11

X2 >= 1.55 3 0

X1

X1 X2

Y values- +

!27

Overfitting

Trade-off between training error and depth• As in other regression and classification approaches,

training error monotonically decreases (or more precisely non-increases) with model complexity (usually measured in depth)

• Training error that is too small is a sign of overfitting

!28

Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10

Training error 0.22 0.13 0.10 0.03 0.00

Decision boundary

Training error reduces with depth

Two ways to prevent overfitting

• 1. Early stopping • Stop before tree gets too complicated

• 2. Pruning • Simplify after training a complex model

!29

Clas

sifica

tion

Erro

r

Complex trees

True error

Training error

Simple trees

Tree depthmax_depth

1. Early stopping• Stopping rule: (Recap)• 1. All examples in the subset have the same label• 2. No more features left to split

• Early stopping rule• Only grow up to max_depth (which is chosen via validation)• Hard to figure out the right depth• An imbalanced tree can be a better one

• Do not split if it does not give sufficient decrease in error• There are cases where it takes some depth to get the gain• e.g. XOR

• Do not split intermediate nodes with too few data points!30

x[1] x[2] yFALSE FALSE FALSEFALSE TRUE TRUETRUE FALSE TRUETRUE TRUE FALSE

!30

2. Pruning• Train till overfit, and then simplify

• Pruning is guided by a choice of quality metric that balances fitting data and simplicity

• Data fit is measured by the error• Simplicity is measured by the number of leaves in the tree

!31

Complex Tree

Simplify

Simpler Tree

Loss(T ) = Error(T ) + � r(T )|{z}# of leaf nodes

<latexit sha1_base64="WiYI9wGc9iHgj3xPgi5tMVV1iwg=">AAACSXicbZDNSxwxGMYzq7Y62rqtRy/BpaC0LDMqtCAFsQgePCi4KmyWbSbzjgYzyZC8U7oM8+/10ltv/R968aCIJzO7e/CjLyQ8PM/75uOXFEo6jKK/QWtmdu7V6/mFcHHpzdvl9rv3p86UVkBPGGXsecIdKKmhhxIVnBcWeJ4oOEuuvjX52Q+wThp9gqMCBjm/0DKTgqO3hu3vtGI2p4fGuXr9ZIOyHbbztdkm/r61xjZB+JGGTPlzU84+UVbqFGxiuYDK+rQeVgzhJ1asQ01GFfCMapOCq+thuxN1o3HRlyKeig6Z1tGw/YelRpQ5aBSKO9ePowIHFbcohYI6ZKWDgosrfgF9LzXPwQ2qMYmafvBOSjNj/dJIx+7jiYrnzo3yxHfmHC/d86wx/5f1S8y+DCqpixJBi8lFWakoGtpgpam0IFCNvODCSv9WKi65B4QefughxM+//FKcbnbjre7m8XZnd2+KY56skjWyTmLymeySA3JEekSQX+QfuSG3we/gOrgL7ietrWA6s0KeVGvmAeZksCM=</latexit>

Simplify

simplest tree<latexit sha1_base64="BfW3nLYEmiRf1Apg4FqpvxuHnCU=">AAAB9HicbVA9SwNBEN2LXzF+RS1tFoNgFe5ioWXQxjKC+YDkCHubuWTJ7t25OxcIR36HjYUitv4YO/+Nm+QKTXww8Hhvhpl5QSKFQdf9dgobm1vbO8Xd0t7+weFR+fikZeJUc2jyWMa6EzADUkTQRIESOokGpgIJ7WB8N/fbE9BGxNEjThPwFRtGIhScoZV8I1QiwSBFDdAvV9yquwBdJ15OKiRHo1/+6g1iniqIkEtmTNdzE/QzplFwCbNSLzWQMD5mQ+haGjEFxs8WR8/ohVUGNIy1rQjpQv09kTFlzFQFtlMxHJlVby7+53VTDG/8TERJihDx5aIwlRRjOk+ADoQGjnJqCeNa2FspHzHNONqcSjYEb/XlddKqVb2rau2hVqnf5nEUyRk5J5fEI9ekTu5JgzQJJ0/kmbySN2fivDjvzseyteDkM6fkD5zPHwOskj8=</latexit>

� = 0<latexit sha1_base64="5TvkrM6y2794e0KmFtszz1YXvBk=">AAAB8HicbVDLSgMxFL1TX7W+qi7dBIvgqszUgm6EohuXFexD2qFkMpk2NMkMSUYoQ7/CjQtF3Po57vwb03YW2nogcDjnXHLvCRLOtHHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFaEtEvNYdQOsKWeStgwznHYTRbEIOO0E49uZ33miSrNYPphJQn2Bh5JFjGBjpcc+t9EQX7uDcsWtunOgVeLlpAI5moPyVz+MSSqoNIRjrXuemxg/w8owwum01E81TTAZ4yHtWSqxoNrP5gtP0ZlVQhTFyj5p0Fz9PZFhofVEBDYpsBnpZW8m/uf1UhNd+RmTSWqoJIuPopQjE6PZ9ShkihLDJ5ZgopjdFZERVpgY21HJluAtn7xK2rWqd1Gt3dcrjZu8jiKcwCmcgweX0IA7aEILCAh4hld4c5Tz4rw7H4towclnjuEPnM8fMu+QAg==</latexit>

� = 1<latexit sha1_base64="rKTxpoxfdyIMAiMb+o13DP0k40k=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgqiRV0I1QdOOygn1AE8tkMmmHTiZhZqKE0P9w40IRt/6LO//GaZuFth4YOJxzLvfO8RPOlLbtb6u0srq2vlHerGxt7+zuVfcPOipOJaFtEvNY9nysKGeCtjXTnPYSSXHkc9r1xzdTv/tIpWKxuNdZQr0IDwULGcHaSA8uN9EAX7lMhDobVGt23Z4BLROnIDUo0BpUv9wgJmlEhSYcK9V37ER7OZaaEU4nFTdVNMFkjIe0b6jAEVVePrt6gk6MEqAwluYJjWbq74kcR0plkW+SEdYjtehNxf+8fqrDSy9nIkk1FWS+KEw50jGaVoACJinRPDMEE8nMrYiMsMREm6IqpgRn8cvLpNOoO2f1xt15rXld1FGGIziGU3DgAppwCy1oAwEJz/AKb9aT9WK9Wx/zaMkqZg7hD6zPH4s3koo=</latexit>

Pruning algorithm

!32

• 1. Compute current loss, and find a candidate split

Loss(T ) = Error(T ) + � r(T )|{z}# of leaf nodes

<latexit sha1_base64="WiYI9wGc9iHgj3xPgi5tMVV1iwg=">AAACSXicbZDNSxwxGMYzq7Y62rqtRy/BpaC0LDMqtCAFsQgePCi4KmyWbSbzjgYzyZC8U7oM8+/10ltv/R968aCIJzO7e/CjLyQ8PM/75uOXFEo6jKK/QWtmdu7V6/mFcHHpzdvl9rv3p86UVkBPGGXsecIdKKmhhxIVnBcWeJ4oOEuuvjX52Q+wThp9gqMCBjm/0DKTgqO3hu3vtGI2p4fGuXr9ZIOyHbbztdkm/r61xjZB+JGGTPlzU84+UVbqFGxiuYDK+rQeVgzhJ1asQ01GFfCMapOCq+thuxN1o3HRlyKeig6Z1tGw/YelRpQ5aBSKO9ePowIHFbcohYI6ZKWDgosrfgF9LzXPwQ2qMYmafvBOSjNj/dJIx+7jiYrnzo3yxHfmHC/d86wx/5f1S8y+DCqpixJBi8lFWakoGtpgpam0IFCNvODCSv9WKi65B4QefughxM+//FKcbnbjre7m8XZnd2+KY56skjWyTmLymeySA3JEekSQX+QfuSG3we/gOrgL7ietrWA6s0KeVGvmAeZksCM=</latexit>

• 2. Replace the split by a leaf node and recompute

Root22 18

excellent9 0 Fair

9 4

poor4 14


Credit?

Safe

5 years9 0

3 years0 4

Term?

Risky Safe

low0 9

high4 5

Income?

5 years4 3

3 years0 2

Term?

Risky Safe

Risky

Tree Error #Leaves TotalT 3/40 6 0.255

T-term 4/40 5 0.25T-credit

T-incomeT-term’

� = 0.03<latexit sha1_base64="v8W0GJA3q72RI6EzEDRgEFFAf1k=">AAAB83icbVDNSgMxGPy2/tX6V/XoJVgET2W3FfQiFL14rGBrobuUbDbbhmazIckKZelrePGgiFdfxptvY9ruQVsHQoaZ+ciXCSVn2rjut1NaW9/Y3CpvV3Z29/YPqodHXZ1mitAOSXmqeiHWlDNBO4YZTntSUZyEnD6G49uZ//hElWapeDATSYMEDwWLGcHGSr7PbTTC127dbQ6qNXvNgVaJV5AaFGgPql9+lJIsocIQjrXue640QY6VYYTTacXPNJWYjPGQ9i0VOKE6yOc7T9GZVSIUp8oeYdBc/T2R40TrSRLaZILNSC97M/E/r5+Z+CrImZCZoYIsHoozjkyKZgWgiClKDJ9YgolidldERlhhYmxNFVuCt/zlVdJt1L1mvXF/UWvdFHWU4QRO4Rw8uIQW3EEbOkBAwjO8wpuTOS/Ou/OxiJacYuYY/sD5/AGHrpCx</latexit>

high4 5

Risky

• 3. Repeat for all splits (there are 4 splits in this example)

Decision trees for regression

!33

X

Y

Decision trees for regression• Same process as classification, but • error measured in squared error• Prediction is the mean value of the samples in that partition

!34

Documents

Decision Trees - University of WashingtonProbabilistic prediction with decision trees • By taking the probability of the training data in that subset, we can make probabilistic predictions