Upload
holly-duffy
View
19
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Tree-based methods, neutral networks. Lecture 10. Tree-based methods. Statistical methods in which the input space (feature space) is partitioned into a set of cuboids (rectangles), and then a simple model is set up in each one. Why decision trees. Compact representation of data - PowerPoint PPT Presentation
Citation preview
Tree-based methods
Statistical methods in which the input space (feature space) is partitioned into a set of cuboids (rectangles), and then a simple model is set up in each one
Tree structureRootNodesLeaves (terminal nodes)Parent-child relationshipConditionLabel is assigned to a leaf
N4 N5 N6 N7
Cond.1
Cond.3
Cond.2
Cond.4 Cond.5 Cond.6
Example
Non-mammals
Warm
Leaf nodes
Body temperature?
Gives birth?
Non-mammals
Mammals
Cold
NoYes
Internal node
Root node
How to build a decision tree: Hunt’s algorithm
Proc Hunt(Dt,t)
Given Data set Dt={(X1i,..Xpi, Yi), i=1..n}, t-curr.node
If all Yi are equal mark t as leaf with label Yi
If not, use the test condition to split into Dt1…Dtn, create children t1…tn and run Hunt(Dt1,t1),…, Hunt(Dtn,tn)
Hunt’s algorithm example
0
10
20
10 20
X1
X2 X2
0 1 X11
01
<9 >=9
<16 <7>=16 >=7
<15 >=15
}),{(),(ˆ21
121 m
N
mm RXXIcXXf
Hunt’s algorithmWhat if some combinations of attributes are missing?
Empty nodeIs assigned the label representing the majority class among the records (instances, objects, cases) in its parent node.
All records in a node have identical attributesThe node is declared a leaf node with the same class label as the majority class of this node
CART: Classification and regression treesRegression treesGiven Dt={(X1i,..Xpi, Yi), i=1..n}, Y –
continuous, build a tree that will fit the data best
Classification treesGiven Dt={(X1i,..Xpi, Yi), i=1..n}, Y –
categorical, build a tree that will classify the observations best
A CART algorithm: Regression trees
Splitting variables and split points
Consider a splitting variable j and a split point s, and define the pair of half planes
We seek the splitting variable j and split point s that solve
}|{),( and}|{),( 21 sxxsjRsxxsjR jj
),( ),(
22
21,
1 2
21)(min)(minmin
sjRx sjRxicicsj
i i
cycy
Aim:Want to find ; computationally expensive to
test all possible splits. Instead
iii fy 2min
Post-pruningHow large tree to grow? Too large – overfitting!
Grow a large tree T0
Then prune this tree using post-pruning
Define a subtree T and index its terminal nodes by m, with node m representing region Rm.
Let |T| denote the number of terminal nodes in T and set
where
Then minimize this expression, using cross-validation to select the factor that penalizes complex trees.
mi Rx
im
m yN
c1
ˆ
||)ˆ()(||
1
2 TcyTCT
m Rxmi
mi
CART: Classification treesFor each node define proportions
Define measure of impurity
s
iik kyI
sp
1
1
i
c
ii
c
iii
p
p
pp
max1errortion Classifica
)(1Gini
)(logEntropy
1
0
2
1
02
Design issues of decision tree induction
How to split the training recordsWe need a measure for evaluating the goodness of various test conditions
How to terminate the splitting procedure1) Continue expanding nodes until either all the records belong to the same class or all the records have identical attribute values2) Define criteria for early termination
How to split: CARTSelect splitting with max information gain
where I(.) is the impurity measure of a given node, N is the total number of records at the parent node, and N(vj) is the is the number of records associated with the child node vj
k
jj
j vIN
vNparentI
1
)()(
)(
How to split: C4.5
Impurity measures such as Gini index tend to favour attributes that have a large number of distinct values
Strategy 1: Restrict the test conditions to binary splits only
Strategy 2: Use the gain ratio as splitting criterion
k
jjj
k
jj
j
vpvp
vIN
vNparentI
12
1
info
)(log)(
)()(
)(
infoSplit ratioGain
Constructing decision trees
Yes
Home owner
Married
Defaulted
= No
Defaulted
= No
Marital status
Defaulted = ?
Not married
NoHome owner
Marital status
Annual income
Defaulted borrower
No Divorced 95K YesNo Married 100K NoNo Married 60K NoNo Married 75K NoNo Single 70K NoNo Single 85K YesNo Single 90K YesYes Divorced 220K NoYes Married 120K NoYes Single 125K No
Income
> 100 K≤ 100 K
Expressing attribute test conditions
Binary attributesBinary splits
Nominal attributesBinary or multiway splits
Ordinal attributesBinary or multiway splits honoring the order of the attributes
Continuous attributesBinary or multiway splits into disjoint interval
Characteristics of decision tree induction
Nonparametric approach (no underlying probability model)
Computationally inexpensive techniques have been dveloped for constructing decision trees. Once a decision tree has been built, classification is extremely fast
The presence of redundant attributes will not adversely affect the accuracy of decision trees
The presence of irrelevant attributes can lower the accuracy of decision trees, especially if no measures are taken to avoid overfitting
At the leaf nodes, the number of records may be too small (data fragmentation)
Principal components regression (PCR)
Extract principal components (transformation of the inputs) as derived features, and then model the target (response) as a linear function of these features
MmZ Tmm ,...,1, Xα
x1 x2 xp
z1 z2 zM…
…
y
ZβTy 0
Extract linear combinations
of the inputs as derived features, and then model the target (response) as a linear function of a sigmoid function of these features
MmZ Tmmm ,...,1,)( 0 Xα
x1 x1 xp
z1 z2 zM…
…
y
ZβTy 0
MmTmm ,...,1,0 Xα
Neural networks with a single target
Artificial neural networksIntroduction from biology:NeuronsAxonsDendritesSynapseCapabilities of neural networks:Memorization (noise stable, fragmentary
stable!)Classification
TerminologyFeed-forward neural networkInput layer[Hidden layer(s)]Output layer
x1 x2 xp
z1 z2 zM…
…
f1 fK…
TerminologyFeed-forward network
Nodes in one layer are connected to the nodes in next layer
Recurrent networkNodes in one layer may be connected to the
ones in previous layer or within the same layer
TerminologyFormulas for multilayer perceptron (MLP)
C1, C2 combination functiong, ς activation functionα0m β0k bias of hidden unit
αim βjk weight of connection
MmZC
Tmmm ,...,1,)(
1
0 Xα
KkgfC
Tkkk ,...,1,)(
2
0 Z