MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences...

Preview:

Citation preview

MMG991 Session 10• Binary recursive partitioning

– Chapter 10 in MASS– Zhang et al., PNAS 98: 6730 – 6735– Concepts– S-Plus implementation

• Thernaeu and Atkinson’s RPART routines– Microarrays

• Alon’s colon cancer data et• Looking at Khan’s cancer data

– Unanswered questions• Revisiting the concept of data filtration

– Is this an alternative solution• Compare the output

– Classification of cancers– Selection of genes

Projects– Updates from each group

Binary recursive partitioning• Tree-based models

– Graphically encapsulate and express knowledge– Aid in decision making

• Prediction of some outcome– Use in a variety of settings

• Social sciences• Machine learning

• S-Plus implementation– tree() implementation

• Integral part of S-Plus distribution– rpart()

• Separate library/distribution– Advantages

» Speed» Size of data sets» Ability to handle missing data

• S-Plus Version 6.0– www.stats.ox.ac.uk/pub/Mass3/Winlibs

Basic concepts• Two basic models

– Classification trees*• Supervised method

– Regression trees• Anova• Poisson• Survival

• Data type– Categorical– Continuous

• Underlying concept– Identify a single variable that best splits data into two groups

• Loss functions, conditional likelihoods, measure of impurity– Separate data– Repeat process on each subset is split to a minimum number or no

further improvement occurs– Refine model

• Pruning, snipping, cross-validation

Adding the RPART library

• Distributed as .zip file– Download to library directory– Unzip

• Requires winzip or comparable archive tool to unpack• Installs rpart library automatically

– Includes help files and documentation– RPART Technical reports

• Useful supplementary reading– Long report provides extensive theoretical background

The data set(s)

• The incomplete data set– http://www.sph.uth.tmc.edu.hgc– Sequentially ordered affymetrix data

– No mapping to genes!– No reference to original source of data!– Cannot reproduce the authors’ observations!

• The original dataset– Alon, et al. PNAS 96 6745 – 6750– http://microarray.princeton.edu/oncology/– Sequentially ordered affymetrix data– Gene map– Ordering of experiments

Some new twists

• Overriding S-Plus defaults– Necessary because of size of data sets– Increase expressions to 4000– Increase object.size 100000000

• Loading the RPART and Mass3 libraries• Fitted model notation

– See pg 33 of MASS– response ~ expression

• For classification trees– tissue.type ~ spot1 + spot2 … spotn– Tissue.type ~ ., data = data.frame– Default location of response variable

Setting up the data for analysis

• Import data through gui interface– Excel files

• affydata– matrix

• affynames– gene names, Affymetrix identifies, gene

annotations– Cut and paste

• affyexpts

• Housekeepinglibrary(Mass)library(rpart)options(object.size=100000000)options(expressions=4000)attach("mmg991", pos=1)

The transposed matrix

taffydata<-as.data.frame(t(affydata))taffydata<-cbind(matrix("",

nrow(taffydata),1), taffydata)dimnames(taffydata)[[2]][1]<-"tissue.type"taffydata[,1]<-as.character(taffydata[,1])taffydata[grep("tumor.*",dimnames(taffydata)[[

1]]),1]<-"tumor"taffydata[grep("normal.*",dimnames(taffydata)[

[1]]),1]<-"normal"taffydata[,1]<-as.factor(taffydata[,1])

taffydata<-as.data.frame(taffydata)

Binary recursive partitioning

attach(taffydata)taffy.rp<-rpart(tissue.type ~ .,data=taffydata[,-1], minsplit=5,maxcompete=10)

plot(taffy.rp, uniform=T, branch=0.25)text(taffy.rp)

The heatmap

0 500 1000 1500 2000

010

2030

4050

60

12

34

The output

> taffy.rpn= 62

node), split, n, loss, yval, (yprob)* denotes terminal node

1) root 62 22 tumor (0.3548387 0.6451613)2) spot.1671< 59.82813 14 0 normal (1.0000000 0.0000000) *3) spot.1671>=59.82813 48 8 tumor (0.1666667 0.8333333)6) spot.590< 289.9481 18 8 tumor (0.4444444 0.5555556)12) spot.123>=766.2131 7 0 normal (1.0000000 0.0000000) *13) spot.123< 766.2131 11 1 tumor (0.0909090 0.9090909) *

7) spot.590>=289.9481 30 0 tumor (0.0000000 1.0000000) *

The tree

|spot.1671< 59.83

spot.590< 289.9

spot.123>=766.2

normal14/0

normal7/0

tumor 1/10

tumor 0/30

Spot.1671 = M26383, Humanmonocyte-derived neutrophil-activating protein (MONAP)mRNA, complete cds.

Spot.590 = R15447, CALNEXIN PRECURSOR (Homo sapiens)

Spot.123 = M28214, RAS-RELATED PROTEIN RAB-3B (HUMAN).

Examining the model> printcp(taffy.rp)

Classification tree:rpart(formula = taffydata, method = "class", parms =

list(split = "information"),minsplit = 5, maxcompete = 2000)

Variables actually used in tree construction:[1] spot.123 spot.1671 spot.590

Root node error: 22/62 = 0.35484

n= 62

CP nsplit rel error xerror xstd1 0.63636 0 1.000000 1.00000 0.171252 0.15909 1 0.363636 0.72727 0.156613 0.01000 3 0.045455 0.63636 0.14965

Examining the model (cont.)> summary(taffy.rp)Call:rpart(formula = taffydata, method = "class", parms = list(split =

"information"),minsplit = 5, maxcompete = 2000)

n= 62

CP nsplit rel error xerror xstd1 0.6363636 0 1.00000000 1.0000000 0.17124692 0.1590909 1 0.36363636 0.7272727 0.15661033 0.0100000 3 0.04545455 0.6363636 0.1496463

Node number 1: 62 observations, complexity param=0.6363636predicted class=tumor expected loss=0.3548387

class counts: 22 40probabilities: 0.355 0.645

left son=2 (14 obs) right son=3 (48 obs)Primary splits:

spot.1671 < 59.82813 to the left, improve=18.6972800, (0 missing)spot.249 < 1696.228 to the right, improve=16.5197300, (0 missing)spot.493 < 379.3894 to the right, improve=16.1353500, (0 missing)spot.765 < 842.3056 to the right, improve=15.3041000, (0 missing)

To prune or not to prune…

cp

X-va

l Rel

ativ

e Er

ror

0.4

0.6

0.8

1.0

1.2

Inf 0.32 0.04

1 2 4

size of tree

Subset of samples in nodes 5-7

log(taffydata[, 591])

log(

taffy

data

[, 12

4])

4 5 6 7

5.5

6.0

6.5

7.0

7.5

Node 6

Node 7 Node 5

• Excellent agreement with Zhang et al.• The model is, however, trivial

– No pruning required

• How well does the method really perform?

The take home message

Revisiting Kahn’s MDS/ANN data set• Tumor classification/diagnostic prediction

– 63 training samples/25 test samples– four tumor/cell types

• EWS– 13 tumors/10 cell lines

• BL– 8 cell lines

• NB– 12 cell lines

• RMS– 10 tumors/10 cell lines

• Test set– 25 samples

– Relative red index• rri = mean spot intensity/mean intensity of filtered genes• Expression measured as ln(rri)

– Strategy• Training set• Prediction

The heatmap

0 20 40 60 80 100

020

4060

80

-20

2

Updating the modeltnhgri2<-tnhgridimnames(tnhgri2)[[2]]<-paste("spot",1:2308,sep=".")tnhgri2<-as.data.frame(tnhgri2)tnhgri2<-cbind(1:nrow(tnhgri2), tnhgri2)dimnames(tnhgri2)[[2]][1]<-"tissue.type“

ews.c<-grep("EWS.C*", dimnames(tnhgri2)[[1]])ews.t<-grep("EWS.T*", dimnames(tnhgri2)[[1]])bl<-grep("BL.C", dimnames(tnhgri2)[[1]])nb<-grep("NB*", dimnames(tnhgri2)[[1]])rms.c<-grep("RMS.C*", dimnames(tnhgri2)[[1]])rms.t<-grep("RMS.T*", dimnames(tnhgri2)[[1]])test<-grep("TEST*", dimnames(tnhgri2)[[1]])

tnhgri2[,1]<-as.character(tnhgri2[,1])tnhgri2[ews.c,1]<-"EWS.C"tnhgri2[ews.t,1]<-"EWS.T"tnhgri2[bl,1]<-"BL.C"tnhgri2[nb,1]<-"NB"tnhgri2[rms.c,1]<-"RMS.C"tnhgri2[rms.t,1]<-"RMS.T"tnhgri2[test,1]<-"TEST"tnhgri2[,1]<-as.factor(tnhgri2[,1])

Creating the classification tree

tnhgri2.rp<-rpart(tnhgri2[1:63,], minsplit=5,maxcompete=20, method="class",

parms=list(split="information"))

plot(tnhgri2.rp, uniform=T, branch=0.25)text(tnhgri2.rp, use.n=T)

The outputtnhgri2.rpn= 63

node), split, n, loss, yval, (yprob)* denotes terminal node

1) root 63 50 EWS.T (0.13 0.16 0.21 0.19 0.16 0.16)2) spot.1389>=1.58565 22 9 EWS.T (0 0.41 0.59 0 0 0)

4) spot.1339>=1.9433 9 0 EWS.C (0 1 0 0 0 0) *5) spot.1339< 1.9433 13 0 EWS.T (0 0 1 0 0 0) *

3) spot.1389< 1.58565 41 29 NB (0.2 0.024 0 0.29 0.24 0.24)6) spot.2303< 0.4544 18 8 RMS.T (0.44 0 0 0 0 0.56)12) spot.1< 0.74705 8 0 BL.C (1 0 0 0 0 0) *13) spot.1>=0.74705 10 0 RMS.T (0 0 0 0 0 1) *

7) spot.2303>=0.4544 23 11 NB (0 0.043 0 0.52 0.43 0)14) spot.262< 0.80825 12 0 NB (0 0 0 1 0 0) *15) spot.262>=0.80825 11 1 RMS.C (0 0.091 0 0 0.91 0) *

The classification tree

|spot.1389>=1.586

spot.1339>=1.943 spot.2303< 0.4544

spot.1< 0.7471 spot.262< 0.8082EWS.C

0/9/0/0/0/0EWS.T

0/0/13/0/0/0

BL.C 8/0/0/0/0/0

RMS.T0/0/0/0/0/10

NB 0/0/0/12/0/0

RMS.C0/1/0/0/10/0

Spot 1389 = Fc fragment of IgG, receptor, transporter, alpha

Spot 1339 = ubiquitin thiolesterase

Spot 2303 = Homo sapiens clone 23716mRNA sequence

Spot 1 = catenin(cadherin-associated protein), alpha 1 (102kD)

Spot 262 = B-cell CLL/lymphoma 7b

The output (cont.)

summary(tnhgri2.rp)Call:rpart(formula = tnhgri2[1:63, ], method = "class",

parms = list(split ="information"), minsplit = 5, maxcompete = 20)n= 63

CP nsplit rel error xerror xstd1 0.24 0 1.00 1.04 0.060263972 0.20 1 0.76 0.94 0.069098503 0.18 3 0.36 0.76 0.077664324 0.16 4 0.18 0.60 0.079282505 0.01 5 0.02 0.34 0.07046332

Cost-complexity pruning

cp

X-va

l Rel

ativ

e Er

ror

0.2

0.4

0.6

0.8

1.0

1.2

Inf 0.22 0.19 0.17 0.04

1 2 4 5 6

size of tree

Competing variablesNode number 1: 63 observations, complexity param=0.24

predicted class=EWS.T expected loss=0.7936508class counts: 8 10 13 12 10 10

probabilities: 0.127 0.159 0.206 0.190 0.159 0.159left son=2 (22 obs) right son=3 (41 obs)Primary splits:

spot.1389 < 1.58565 to the right, improve=37.50727, (0 missing)spot.1634 < 0.7854 to the right, improve=37.50727, (0 missing)spot.1954 < 1.31135 to the right, improve=37.50727, (0 missing)spot.1803 < 0.20905 to the left, improve=36.34711, (0 missing)spot.157 < 0.17895 to the left, improve=34.68483, (0 missing)spot.545 < 2.14155 to the right, improve=34.56965, (0 missing)spot.2050 < 0.30165 to the left, improve=34.56965, (0 missing)spot.246 < 1.4397 to the right, improve=34.56965, (0 missing)

Surrogates

Surrogate splits:

spot.1954 < 1.31135 to the right, agree=0.968, adj=0.909, (0 split)spot.246 < 1.4397 to the right, agree=0.952, adj=0.864, (0 split)spot.545 < 2.14155 to the right, agree=0.952, adj=0.864, (0 split)spot.2050 < 0.30165 to the left, agree=0.952, adj=0.864, (0 split)spot.1708 < 0.61675 to the right, agree=0.937, adj=0.818, (0 split)

Prediction

• How well does recursive partitioning compare to Khan’s ANN

tnhgri2.pred<-predict.rpart(tnhgri2.rp, newdata=tnhgri2[64:88,],type=c("class"))

• 20 test samples– Not included in the training set– BL (3/3)– RMS-T (4/5)– NB (6/6)– EWS-T (4/5)– EWS-C (1/1)– Concordance = 90%

Some closing thoughts…

• Performance of method appears to be quite good– Are the “diagnostic” genes meaningful?

• ssMultiple “solutions” yield a smaller set of “diagnostic” genes– Recursive-recursive partition?– Extraction of larger groups of differential genes

Recommended