29
MMG991 Session 10 Binary recursive partitioning Chapter 10 in MASS Zhang et al., PNAS 98: 6730 – 6735 Concepts S-Plus implementation Thernaeu and Atkinson’s RPART routines Microarrays Alon’s colon cancer data et Looking at Khan’s cancer data Unanswered questions Revisiting the concept of data filtration Is this an alternative solution Compare the output Classification of cancers Selection of genes Projects Updates from each group

MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

MMG991 Session 10• Binary recursive partitioning

– Chapter 10 in MASS– Zhang et al., PNAS 98: 6730 – 6735– Concepts– S-Plus implementation

• Thernaeu and Atkinson’s RPART routines– Microarrays

• Alon’s colon cancer data et• Looking at Khan’s cancer data

– Unanswered questions• Revisiting the concept of data filtration

– Is this an alternative solution• Compare the output

– Classification of cancers– Selection of genes

Projects– Updates from each group

Page 2: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Binary recursive partitioning• Tree-based models

– Graphically encapsulate and express knowledge– Aid in decision making

• Prediction of some outcome– Use in a variety of settings

• Social sciences• Machine learning

• S-Plus implementation– tree() implementation

• Integral part of S-Plus distribution– rpart()

• Separate library/distribution– Advantages

» Speed» Size of data sets» Ability to handle missing data

• S-Plus Version 6.0– www.stats.ox.ac.uk/pub/Mass3/Winlibs

Page 3: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Basic concepts• Two basic models

– Classification trees*• Supervised method

– Regression trees• Anova• Poisson• Survival

• Data type– Categorical– Continuous

• Underlying concept– Identify a single variable that best splits data into two groups

• Loss functions, conditional likelihoods, measure of impurity– Separate data– Repeat process on each subset is split to a minimum number or no

further improvement occurs– Refine model

• Pruning, snipping, cross-validation

Page 4: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Adding the RPART library

• Distributed as .zip file– Download to library directory– Unzip

• Requires winzip or comparable archive tool to unpack• Installs rpart library automatically

– Includes help files and documentation– RPART Technical reports

• Useful supplementary reading– Long report provides extensive theoretical background

Page 5: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The data set(s)

• The incomplete data set– http://www.sph.uth.tmc.edu.hgc– Sequentially ordered affymetrix data

– No mapping to genes!– No reference to original source of data!– Cannot reproduce the authors’ observations!

• The original dataset– Alon, et al. PNAS 96 6745 – 6750– http://microarray.princeton.edu/oncology/– Sequentially ordered affymetrix data– Gene map– Ordering of experiments

Page 6: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Some new twists

• Overriding S-Plus defaults– Necessary because of size of data sets– Increase expressions to 4000– Increase object.size 100000000

• Loading the RPART and Mass3 libraries• Fitted model notation

– See pg 33 of MASS– response ~ expression

• For classification trees– tissue.type ~ spot1 + spot2 … spotn– Tissue.type ~ ., data = data.frame– Default location of response variable

Page 7: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Setting up the data for analysis

• Import data through gui interface– Excel files

• affydata– matrix

• affynames– gene names, Affymetrix identifies, gene

annotations– Cut and paste

• affyexpts

• Housekeepinglibrary(Mass)library(rpart)options(object.size=100000000)options(expressions=4000)attach("mmg991", pos=1)

Page 8: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The transposed matrix

taffydata<-as.data.frame(t(affydata))taffydata<-cbind(matrix("",

nrow(taffydata),1), taffydata)dimnames(taffydata)[[2]][1]<-"tissue.type"taffydata[,1]<-as.character(taffydata[,1])taffydata[grep("tumor.*",dimnames(taffydata)[[

1]]),1]<-"tumor"taffydata[grep("normal.*",dimnames(taffydata)[

[1]]),1]<-"normal"taffydata[,1]<-as.factor(taffydata[,1])

taffydata<-as.data.frame(taffydata)

Page 9: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Binary recursive partitioning

attach(taffydata)taffy.rp<-rpart(tissue.type ~ .,data=taffydata[,-1], minsplit=5,maxcompete=10)

plot(taffy.rp, uniform=T, branch=0.25)text(taffy.rp)

Page 10: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The heatmap

0 500 1000 1500 2000

010

2030

4050

60

12

34

Page 11: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The output

> taffy.rpn= 62

node), split, n, loss, yval, (yprob)* denotes terminal node

1) root 62 22 tumor (0.3548387 0.6451613)2) spot.1671< 59.82813 14 0 normal (1.0000000 0.0000000) *3) spot.1671>=59.82813 48 8 tumor (0.1666667 0.8333333)6) spot.590< 289.9481 18 8 tumor (0.4444444 0.5555556)12) spot.123>=766.2131 7 0 normal (1.0000000 0.0000000) *13) spot.123< 766.2131 11 1 tumor (0.0909090 0.9090909) *

7) spot.590>=289.9481 30 0 tumor (0.0000000 1.0000000) *

Page 12: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The tree

|spot.1671< 59.83

spot.590< 289.9

spot.123>=766.2

normal14/0

normal7/0

tumor 1/10

tumor 0/30

Spot.1671 = M26383, Humanmonocyte-derived neutrophil-activating protein (MONAP)mRNA, complete cds.

Spot.590 = R15447, CALNEXIN PRECURSOR (Homo sapiens)

Spot.123 = M28214, RAS-RELATED PROTEIN RAB-3B (HUMAN).

Page 13: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Examining the model> printcp(taffy.rp)

Classification tree:rpart(formula = taffydata, method = "class", parms =

list(split = "information"),minsplit = 5, maxcompete = 2000)

Variables actually used in tree construction:[1] spot.123 spot.1671 spot.590

Root node error: 22/62 = 0.35484

n= 62

CP nsplit rel error xerror xstd1 0.63636 0 1.000000 1.00000 0.171252 0.15909 1 0.363636 0.72727 0.156613 0.01000 3 0.045455 0.63636 0.14965

Page 14: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Examining the model (cont.)> summary(taffy.rp)Call:rpart(formula = taffydata, method = "class", parms = list(split =

"information"),minsplit = 5, maxcompete = 2000)

n= 62

CP nsplit rel error xerror xstd1 0.6363636 0 1.00000000 1.0000000 0.17124692 0.1590909 1 0.36363636 0.7272727 0.15661033 0.0100000 3 0.04545455 0.6363636 0.1496463

Node number 1: 62 observations, complexity param=0.6363636predicted class=tumor expected loss=0.3548387

class counts: 22 40probabilities: 0.355 0.645

left son=2 (14 obs) right son=3 (48 obs)Primary splits:

spot.1671 < 59.82813 to the left, improve=18.6972800, (0 missing)spot.249 < 1696.228 to the right, improve=16.5197300, (0 missing)spot.493 < 379.3894 to the right, improve=16.1353500, (0 missing)spot.765 < 842.3056 to the right, improve=15.3041000, (0 missing)

Page 15: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

To prune or not to prune…

cp

X-va

l Rel

ativ

e Er

ror

0.4

0.6

0.8

1.0

1.2

Inf 0.32 0.04

1 2 4

size of tree

Page 16: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Subset of samples in nodes 5-7

log(taffydata[, 591])

log(

taffy

data

[, 12

4])

4 5 6 7

5.5

6.0

6.5

7.0

7.5

Node 6

Node 7 Node 5

Page 17: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

• Excellent agreement with Zhang et al.• The model is, however, trivial

– No pruning required

• How well does the method really perform?

The take home message

Page 18: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Revisiting Kahn’s MDS/ANN data set• Tumor classification/diagnostic prediction

– 63 training samples/25 test samples– four tumor/cell types

• EWS– 13 tumors/10 cell lines

• BL– 8 cell lines

• NB– 12 cell lines

• RMS– 10 tumors/10 cell lines

• Test set– 25 samples

– Relative red index• rri = mean spot intensity/mean intensity of filtered genes• Expression measured as ln(rri)

– Strategy• Training set• Prediction

Page 19: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The heatmap

0 20 40 60 80 100

020

4060

80

-20

2

Page 20: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Updating the modeltnhgri2<-tnhgridimnames(tnhgri2)[[2]]<-paste("spot",1:2308,sep=".")tnhgri2<-as.data.frame(tnhgri2)tnhgri2<-cbind(1:nrow(tnhgri2), tnhgri2)dimnames(tnhgri2)[[2]][1]<-"tissue.type“

ews.c<-grep("EWS.C*", dimnames(tnhgri2)[[1]])ews.t<-grep("EWS.T*", dimnames(tnhgri2)[[1]])bl<-grep("BL.C", dimnames(tnhgri2)[[1]])nb<-grep("NB*", dimnames(tnhgri2)[[1]])rms.c<-grep("RMS.C*", dimnames(tnhgri2)[[1]])rms.t<-grep("RMS.T*", dimnames(tnhgri2)[[1]])test<-grep("TEST*", dimnames(tnhgri2)[[1]])

tnhgri2[,1]<-as.character(tnhgri2[,1])tnhgri2[ews.c,1]<-"EWS.C"tnhgri2[ews.t,1]<-"EWS.T"tnhgri2[bl,1]<-"BL.C"tnhgri2[nb,1]<-"NB"tnhgri2[rms.c,1]<-"RMS.C"tnhgri2[rms.t,1]<-"RMS.T"tnhgri2[test,1]<-"TEST"tnhgri2[,1]<-as.factor(tnhgri2[,1])

Page 21: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Creating the classification tree

tnhgri2.rp<-rpart(tnhgri2[1:63,], minsplit=5,maxcompete=20, method="class",

parms=list(split="information"))

plot(tnhgri2.rp, uniform=T, branch=0.25)text(tnhgri2.rp, use.n=T)

Page 22: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The outputtnhgri2.rpn= 63

node), split, n, loss, yval, (yprob)* denotes terminal node

1) root 63 50 EWS.T (0.13 0.16 0.21 0.19 0.16 0.16)2) spot.1389>=1.58565 22 9 EWS.T (0 0.41 0.59 0 0 0)

4) spot.1339>=1.9433 9 0 EWS.C (0 1 0 0 0 0) *5) spot.1339< 1.9433 13 0 EWS.T (0 0 1 0 0 0) *

3) spot.1389< 1.58565 41 29 NB (0.2 0.024 0 0.29 0.24 0.24)6) spot.2303< 0.4544 18 8 RMS.T (0.44 0 0 0 0 0.56)12) spot.1< 0.74705 8 0 BL.C (1 0 0 0 0 0) *13) spot.1>=0.74705 10 0 RMS.T (0 0 0 0 0 1) *

7) spot.2303>=0.4544 23 11 NB (0 0.043 0 0.52 0.43 0)14) spot.262< 0.80825 12 0 NB (0 0 0 1 0 0) *15) spot.262>=0.80825 11 1 RMS.C (0 0.091 0 0 0.91 0) *

Page 23: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The classification tree

|spot.1389>=1.586

spot.1339>=1.943 spot.2303< 0.4544

spot.1< 0.7471 spot.262< 0.8082EWS.C

0/9/0/0/0/0EWS.T

0/0/13/0/0/0

BL.C 8/0/0/0/0/0

RMS.T0/0/0/0/0/10

NB 0/0/0/12/0/0

RMS.C0/1/0/0/10/0

Spot 1389 = Fc fragment of IgG, receptor, transporter, alpha

Spot 1339 = ubiquitin thiolesterase

Spot 2303 = Homo sapiens clone 23716mRNA sequence

Spot 1 = catenin(cadherin-associated protein), alpha 1 (102kD)

Spot 262 = B-cell CLL/lymphoma 7b

Page 24: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

The output (cont.)

summary(tnhgri2.rp)Call:rpart(formula = tnhgri2[1:63, ], method = "class",

parms = list(split ="information"), minsplit = 5, maxcompete = 20)n= 63

CP nsplit rel error xerror xstd1 0.24 0 1.00 1.04 0.060263972 0.20 1 0.76 0.94 0.069098503 0.18 3 0.36 0.76 0.077664324 0.16 4 0.18 0.60 0.079282505 0.01 5 0.02 0.34 0.07046332

Page 25: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Cost-complexity pruning

cp

X-va

l Rel

ativ

e Er

ror

0.2

0.4

0.6

0.8

1.0

1.2

Inf 0.22 0.19 0.17 0.04

1 2 4 5 6

size of tree

Page 26: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Competing variablesNode number 1: 63 observations, complexity param=0.24

predicted class=EWS.T expected loss=0.7936508class counts: 8 10 13 12 10 10

probabilities: 0.127 0.159 0.206 0.190 0.159 0.159left son=2 (22 obs) right son=3 (41 obs)Primary splits:

spot.1389 < 1.58565 to the right, improve=37.50727, (0 missing)spot.1634 < 0.7854 to the right, improve=37.50727, (0 missing)spot.1954 < 1.31135 to the right, improve=37.50727, (0 missing)spot.1803 < 0.20905 to the left, improve=36.34711, (0 missing)spot.157 < 0.17895 to the left, improve=34.68483, (0 missing)spot.545 < 2.14155 to the right, improve=34.56965, (0 missing)spot.2050 < 0.30165 to the left, improve=34.56965, (0 missing)spot.246 < 1.4397 to the right, improve=34.56965, (0 missing)

Page 27: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Surrogates

Surrogate splits:

spot.1954 < 1.31135 to the right, agree=0.968, adj=0.909, (0 split)spot.246 < 1.4397 to the right, agree=0.952, adj=0.864, (0 split)spot.545 < 2.14155 to the right, agree=0.952, adj=0.864, (0 split)spot.2050 < 0.30165 to the left, agree=0.952, adj=0.864, (0 split)spot.1708 < 0.61675 to the right, agree=0.937, adj=0.818, (0 split)

Page 28: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Prediction

• How well does recursive partitioning compare to Khan’s ANN

tnhgri2.pred<-predict.rpart(tnhgri2.rp, newdata=tnhgri2[64:88,],type=c("class"))

• 20 test samples– Not included in the training set– BL (3/3)– RMS-T (4/5)– NB (6/6)– EWS-T (4/5)– EWS-C (1/1)– Concordance = 90%

Page 29: MMG991 Session 10• Prediction of some outcome – Use in a variety of settings • Social sciences • Machine learning • S-Plus implementation – tree() implementation • Integral

Some closing thoughts…

• Performance of method appears to be quite good– Are the “diagnostic” genes meaningful?

• ssMultiple “solutions” yield a smaller set of “diagnostic” genes– Recursive-recursive partition?– Extraction of larger groups of differential genes