Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
MMG991 Session 10• Binary recursive partitioning
– Chapter 10 in MASS– Zhang et al., PNAS 98: 6730 – 6735– Concepts– S-Plus implementation
• Thernaeu and Atkinson’s RPART routines– Microarrays
• Alon’s colon cancer data et• Looking at Khan’s cancer data
– Unanswered questions• Revisiting the concept of data filtration
– Is this an alternative solution• Compare the output
– Classification of cancers– Selection of genes
Projects– Updates from each group
Binary recursive partitioning• Tree-based models
– Graphically encapsulate and express knowledge– Aid in decision making
• Prediction of some outcome– Use in a variety of settings
• Social sciences• Machine learning
• S-Plus implementation– tree() implementation
• Integral part of S-Plus distribution– rpart()
• Separate library/distribution– Advantages
» Speed» Size of data sets» Ability to handle missing data
• S-Plus Version 6.0– www.stats.ox.ac.uk/pub/Mass3/Winlibs
Basic concepts• Two basic models
– Classification trees*• Supervised method
– Regression trees• Anova• Poisson• Survival
• Data type– Categorical– Continuous
• Underlying concept– Identify a single variable that best splits data into two groups
• Loss functions, conditional likelihoods, measure of impurity– Separate data– Repeat process on each subset is split to a minimum number or no
further improvement occurs– Refine model
• Pruning, snipping, cross-validation
Adding the RPART library
• Distributed as .zip file– Download to library directory– Unzip
• Requires winzip or comparable archive tool to unpack• Installs rpart library automatically
– Includes help files and documentation– RPART Technical reports
• Useful supplementary reading– Long report provides extensive theoretical background
The data set(s)
• The incomplete data set– http://www.sph.uth.tmc.edu.hgc– Sequentially ordered affymetrix data
– No mapping to genes!– No reference to original source of data!– Cannot reproduce the authors’ observations!
• The original dataset– Alon, et al. PNAS 96 6745 – 6750– http://microarray.princeton.edu/oncology/– Sequentially ordered affymetrix data– Gene map– Ordering of experiments
Some new twists
• Overriding S-Plus defaults– Necessary because of size of data sets– Increase expressions to 4000– Increase object.size 100000000
• Loading the RPART and Mass3 libraries• Fitted model notation
– See pg 33 of MASS– response ~ expression
• For classification trees– tissue.type ~ spot1 + spot2 … spotn– Tissue.type ~ ., data = data.frame– Default location of response variable
Setting up the data for analysis
• Import data through gui interface– Excel files
• affydata– matrix
• affynames– gene names, Affymetrix identifies, gene
annotations– Cut and paste
• affyexpts
• Housekeepinglibrary(Mass)library(rpart)options(object.size=100000000)options(expressions=4000)attach("mmg991", pos=1)
The transposed matrix
taffydata<-as.data.frame(t(affydata))taffydata<-cbind(matrix("",
nrow(taffydata),1), taffydata)dimnames(taffydata)[[2]][1]<-"tissue.type"taffydata[,1]<-as.character(taffydata[,1])taffydata[grep("tumor.*",dimnames(taffydata)[[
1]]),1]<-"tumor"taffydata[grep("normal.*",dimnames(taffydata)[
[1]]),1]<-"normal"taffydata[,1]<-as.factor(taffydata[,1])
taffydata<-as.data.frame(taffydata)
Binary recursive partitioning
attach(taffydata)taffy.rp<-rpart(tissue.type ~ .,data=taffydata[,-1], minsplit=5,maxcompete=10)
plot(taffy.rp, uniform=T, branch=0.25)text(taffy.rp)
The heatmap
0 500 1000 1500 2000
010
2030
4050
60
12
34
The output
> taffy.rpn= 62
node), split, n, loss, yval, (yprob)* denotes terminal node
1) root 62 22 tumor (0.3548387 0.6451613)2) spot.1671< 59.82813 14 0 normal (1.0000000 0.0000000) *3) spot.1671>=59.82813 48 8 tumor (0.1666667 0.8333333)6) spot.590< 289.9481 18 8 tumor (0.4444444 0.5555556)12) spot.123>=766.2131 7 0 normal (1.0000000 0.0000000) *13) spot.123< 766.2131 11 1 tumor (0.0909090 0.9090909) *
7) spot.590>=289.9481 30 0 tumor (0.0000000 1.0000000) *
The tree
|spot.1671< 59.83
spot.590< 289.9
spot.123>=766.2
normal14/0
normal7/0
tumor 1/10
tumor 0/30
Spot.1671 = M26383, Humanmonocyte-derived neutrophil-activating protein (MONAP)mRNA, complete cds.
Spot.590 = R15447, CALNEXIN PRECURSOR (Homo sapiens)
Spot.123 = M28214, RAS-RELATED PROTEIN RAB-3B (HUMAN).
Examining the model> printcp(taffy.rp)
Classification tree:rpart(formula = taffydata, method = "class", parms =
list(split = "information"),minsplit = 5, maxcompete = 2000)
Variables actually used in tree construction:[1] spot.123 spot.1671 spot.590
Root node error: 22/62 = 0.35484
n= 62
CP nsplit rel error xerror xstd1 0.63636 0 1.000000 1.00000 0.171252 0.15909 1 0.363636 0.72727 0.156613 0.01000 3 0.045455 0.63636 0.14965
Examining the model (cont.)> summary(taffy.rp)Call:rpart(formula = taffydata, method = "class", parms = list(split =
"information"),minsplit = 5, maxcompete = 2000)
n= 62
CP nsplit rel error xerror xstd1 0.6363636 0 1.00000000 1.0000000 0.17124692 0.1590909 1 0.36363636 0.7272727 0.15661033 0.0100000 3 0.04545455 0.6363636 0.1496463
Node number 1: 62 observations, complexity param=0.6363636predicted class=tumor expected loss=0.3548387
class counts: 22 40probabilities: 0.355 0.645
left son=2 (14 obs) right son=3 (48 obs)Primary splits:
spot.1671 < 59.82813 to the left, improve=18.6972800, (0 missing)spot.249 < 1696.228 to the right, improve=16.5197300, (0 missing)spot.493 < 379.3894 to the right, improve=16.1353500, (0 missing)spot.765 < 842.3056 to the right, improve=15.3041000, (0 missing)
To prune or not to prune…
cp
X-va
l Rel
ativ
e Er
ror
0.4
0.6
0.8
1.0
1.2
Inf 0.32 0.04
1 2 4
size of tree
Subset of samples in nodes 5-7
log(taffydata[, 591])
log(
taffy
data
[, 12
4])
4 5 6 7
5.5
6.0
6.5
7.0
7.5
Node 6
Node 7 Node 5
• Excellent agreement with Zhang et al.• The model is, however, trivial
– No pruning required
• How well does the method really perform?
The take home message
Revisiting Kahn’s MDS/ANN data set• Tumor classification/diagnostic prediction
– 63 training samples/25 test samples– four tumor/cell types
• EWS– 13 tumors/10 cell lines
• BL– 8 cell lines
• NB– 12 cell lines
• RMS– 10 tumors/10 cell lines
• Test set– 25 samples
– Relative red index• rri = mean spot intensity/mean intensity of filtered genes• Expression measured as ln(rri)
– Strategy• Training set• Prediction
The heatmap
0 20 40 60 80 100
020
4060
80
-20
2
Updating the modeltnhgri2<-tnhgridimnames(tnhgri2)[[2]]<-paste("spot",1:2308,sep=".")tnhgri2<-as.data.frame(tnhgri2)tnhgri2<-cbind(1:nrow(tnhgri2), tnhgri2)dimnames(tnhgri2)[[2]][1]<-"tissue.type“
ews.c<-grep("EWS.C*", dimnames(tnhgri2)[[1]])ews.t<-grep("EWS.T*", dimnames(tnhgri2)[[1]])bl<-grep("BL.C", dimnames(tnhgri2)[[1]])nb<-grep("NB*", dimnames(tnhgri2)[[1]])rms.c<-grep("RMS.C*", dimnames(tnhgri2)[[1]])rms.t<-grep("RMS.T*", dimnames(tnhgri2)[[1]])test<-grep("TEST*", dimnames(tnhgri2)[[1]])
tnhgri2[,1]<-as.character(tnhgri2[,1])tnhgri2[ews.c,1]<-"EWS.C"tnhgri2[ews.t,1]<-"EWS.T"tnhgri2[bl,1]<-"BL.C"tnhgri2[nb,1]<-"NB"tnhgri2[rms.c,1]<-"RMS.C"tnhgri2[rms.t,1]<-"RMS.T"tnhgri2[test,1]<-"TEST"tnhgri2[,1]<-as.factor(tnhgri2[,1])
Creating the classification tree
tnhgri2.rp<-rpart(tnhgri2[1:63,], minsplit=5,maxcompete=20, method="class",
parms=list(split="information"))
plot(tnhgri2.rp, uniform=T, branch=0.25)text(tnhgri2.rp, use.n=T)
The outputtnhgri2.rpn= 63
node), split, n, loss, yval, (yprob)* denotes terminal node
1) root 63 50 EWS.T (0.13 0.16 0.21 0.19 0.16 0.16)2) spot.1389>=1.58565 22 9 EWS.T (0 0.41 0.59 0 0 0)
4) spot.1339>=1.9433 9 0 EWS.C (0 1 0 0 0 0) *5) spot.1339< 1.9433 13 0 EWS.T (0 0 1 0 0 0) *
3) spot.1389< 1.58565 41 29 NB (0.2 0.024 0 0.29 0.24 0.24)6) spot.2303< 0.4544 18 8 RMS.T (0.44 0 0 0 0 0.56)12) spot.1< 0.74705 8 0 BL.C (1 0 0 0 0 0) *13) spot.1>=0.74705 10 0 RMS.T (0 0 0 0 0 1) *
7) spot.2303>=0.4544 23 11 NB (0 0.043 0 0.52 0.43 0)14) spot.262< 0.80825 12 0 NB (0 0 0 1 0 0) *15) spot.262>=0.80825 11 1 RMS.C (0 0.091 0 0 0.91 0) *
The classification tree
|spot.1389>=1.586
spot.1339>=1.943 spot.2303< 0.4544
spot.1< 0.7471 spot.262< 0.8082EWS.C
0/9/0/0/0/0EWS.T
0/0/13/0/0/0
BL.C 8/0/0/0/0/0
RMS.T0/0/0/0/0/10
NB 0/0/0/12/0/0
RMS.C0/1/0/0/10/0
Spot 1389 = Fc fragment of IgG, receptor, transporter, alpha
Spot 1339 = ubiquitin thiolesterase
Spot 2303 = Homo sapiens clone 23716mRNA sequence
Spot 1 = catenin(cadherin-associated protein), alpha 1 (102kD)
Spot 262 = B-cell CLL/lymphoma 7b
The output (cont.)
summary(tnhgri2.rp)Call:rpart(formula = tnhgri2[1:63, ], method = "class",
parms = list(split ="information"), minsplit = 5, maxcompete = 20)n= 63
CP nsplit rel error xerror xstd1 0.24 0 1.00 1.04 0.060263972 0.20 1 0.76 0.94 0.069098503 0.18 3 0.36 0.76 0.077664324 0.16 4 0.18 0.60 0.079282505 0.01 5 0.02 0.34 0.07046332
Cost-complexity pruning
cp
X-va
l Rel
ativ
e Er
ror
0.2
0.4
0.6
0.8
1.0
1.2
Inf 0.22 0.19 0.17 0.04
1 2 4 5 6
size of tree
Competing variablesNode number 1: 63 observations, complexity param=0.24
predicted class=EWS.T expected loss=0.7936508class counts: 8 10 13 12 10 10
probabilities: 0.127 0.159 0.206 0.190 0.159 0.159left son=2 (22 obs) right son=3 (41 obs)Primary splits:
spot.1389 < 1.58565 to the right, improve=37.50727, (0 missing)spot.1634 < 0.7854 to the right, improve=37.50727, (0 missing)spot.1954 < 1.31135 to the right, improve=37.50727, (0 missing)spot.1803 < 0.20905 to the left, improve=36.34711, (0 missing)spot.157 < 0.17895 to the left, improve=34.68483, (0 missing)spot.545 < 2.14155 to the right, improve=34.56965, (0 missing)spot.2050 < 0.30165 to the left, improve=34.56965, (0 missing)spot.246 < 1.4397 to the right, improve=34.56965, (0 missing)
Surrogates
Surrogate splits:
spot.1954 < 1.31135 to the right, agree=0.968, adj=0.909, (0 split)spot.246 < 1.4397 to the right, agree=0.952, adj=0.864, (0 split)spot.545 < 2.14155 to the right, agree=0.952, adj=0.864, (0 split)spot.2050 < 0.30165 to the left, agree=0.952, adj=0.864, (0 split)spot.1708 < 0.61675 to the right, agree=0.937, adj=0.818, (0 split)
Prediction
• How well does recursive partitioning compare to Khan’s ANN
tnhgri2.pred<-predict.rpart(tnhgri2.rp, newdata=tnhgri2[64:88,],type=c("class"))
• 20 test samples– Not included in the training set– BL (3/3)– RMS-T (4/5)– NB (6/6)– EWS-T (4/5)– EWS-C (1/1)– Concordance = 90%
Some closing thoughts…
• Performance of method appears to be quite good– Are the “diagnostic” genes meaningful?
• ssMultiple “solutions” yield a smaller set of “diagnostic” genes– Recursive-recursive partition?– Extraction of larger groups of differential genes