Class prediction for experiments with microarrays

Class prediction for Class prediction for experiments with experiments with

microarraysmicroarrays Lara LusaLara Lusa

Inštitut zaInštitut za biomedicinsko informatikobiomedicinsko informatiko Medicinska fakultetaMedicinska fakulteta

Lara.Lusa at mf.uni-lj.siLara.Lusa at mf.uni-lj.si

Outline

• Objectives of microarray experiments• Class prediction

– What is a predictor?– How to develop a predictor?

• Which are the available methods?• Which features should be used in the predictor?

– How to evaluate a predictor?• Internal v External validation

– Some examples of what can go wrong• The molecular classification of breast cancer

Study designStudy design

Performance of the experiment Performance of the experiment Sample preparationSample preparation

HybridizationHybridizationImage analysisImage analysis

Quality control and normalizationQuality control and normalization

Data analysisData analysisClass comparison Class comparison Class prediction Class prediction

Class discoveryClass discovery

Interpretation of the resultsInterpretation of the results

Scheme of an experiment

• Class comparison - supervisedClass comparison - supervised– establish differences in gene expression between predetermined

classes (phenotypes)• Tumor vs. Normal tissue • Recurrent vs. Non-recurrent patients treated with a drug (Ma, 2004)• ER+ vs ER- patients (West, 2001)• BRCA1, BRCA2 and sporadics in breast cancer (Hedenfalk, 2001)

• Class predictionClass prediction - - supervisedsupervised– prediction of phenotype using gene expression data

• morphology of a leukemia patient based on his gene expression (ALL vs. AML, Golub 1999)

• which patients with breast cancer will develop a distant metastasis within 5 years (van’t Veer, 2002)

• Class discovery - unsupervisedClass discovery - unsupervised– discover groups of samples or genes with similar expression

• Luminal A, B, C(?), Basal, ERBB2+, Normal in Breast Cancer (Perou 2001, Sørlie, 2003)

Aims of high-throughput experiments

Data from microarray experiments

How to develop a predictor?

• On a training set of samples– Select a subset of genes (feature selection)– Use gene expression measurements (X)

Predict class membership (Y) of new samples

(test set)

Obtain a RULE (g(X)) based on gene-expression for the

classification of new samples

An example from Duda et al.

Rule: Nearest-neighbor classifier

– For each sample of the independent data set (“testing set”) calculate Pearson’s (centered) correlation of its gene expression with each sample from the “test”

– Classification rule: assign the new sample to the class to which belongs the samples from the training set which has the highest correlation with the new sample

Samples from

training set

new sample

correlation

Bishop, 2006

Rule: K-Nearest-neighbor classifier

– For each sample of the independent data set (“testing set”) calculate Pearson’s (centered) correlation of its gene expression with each samplefrom the “test”

– Classification rule: assign the new sample to the class to which belong the majority of the samples from the training set which have the K highest correlation with the new sample

Samples from

training set

new sample

correlation

K=3Bishop, 2006

Rule: Method of centroids (Sørlie et al. 2003)

• Method of centroids – class prediction rule:

– Define a centroid for each class on the original data set (“training set”)

• For each gene, average its expression from the samples assigned to that class

– For each sample of the independent data set (“testing set”) calculate Pearson’s (centered) correlation of its gene expression with each centroid

– Classification rule: Assign the sample to the class for which the centroid has the highest correlation with the sample (if below .1 do not assign)

centroids

new sample

Assigned to the class which centroid has highest

correlation with the new sample

correlation

Rule: Diagonal Linear Discriminant Analysis (DLDA)

• Calculate mean expression of samples from Class 1 and Class 2 in the training set for each of the G genes

• • and the pooled within class variance

• For each sample x* of the test set evaluate if

• where x*j is the expression of the j-th gene for the new sample

• Classification rule: if the above inequality is satisfied, classify the sample in Class 1, otherwise to Class 2.

Rule: Diagonal Linear Discriminant Analysis (DLDA)

•Particular case of discriminant analysis with the hypotheses that

•the feature are not correlated

•the variances of the two classes are the same

•Other methods used in microarray studies are variants of discriminant analysis

•Compound covariate predictor

•Weighted vote method

Bishop, 2006

Other popular classification methods

• Classification and Regression Trees (CART)

• Prediction Analysis of Microarrays (PAM)

• Support Vector Machines (SVM)

• Logistic regression

• Neural networks Bishop, 2006

How to choose a classification method?

• No single method is optimal in every situation– No Free Lunch Theorem: in absence of

assumptions we should not prefer any classification algorithm over another

– Ugly Ducking Theorem: in absence of assumptions there is no “best” set of features

The bias-variance tradeoff

Duda et al, 2001

Hastie et al, 2001

MSE=ED[ (g(x; D) – F(x))2]=

= ( ED[ g(x; D) – F(x) ] )2

+ED[ ( g(x; D) –ED [g(x;D)] )2 ]=

= Bias2+Variance

Feature selection• Can ALL the gene expression

variables be included in the classifier?

• Which variables should be used to build the classifier?

– Filter methods• Prior to building the classifier

– One feature at a time or joint distribution approaches

– Wrapper methods• Performed implicitly by the

classifier– CART, PAM

From Fridlyand, CBMB Workshop

A comparison of classifiers’ performance for microarray

data• Dudoit, Fridlyand and Speed -2002,

JASA on 3 data sets– DA, DLDA, k-NN, SVM, CART

• Good performance of simple classifiers as DLDA and NN

• Feature selection: small number of features included in the classifier

How to evaluate the performance of a classifier

• Classification error– A sample is classified in a class to which it does

not belong• g(X) ≠ Y• Predictive accuracy=% of correctly classified samples

– In a two-class problem, using the terminology from diagnostic tests (“+”=diseased, “-”=healthy)

• Sensitivity = P(classified +| true +)• Specificity = P(classified -| true -)• Positive predictive value = P( true +| classified + )• Negative predictive value = P( true - | classified -)

Class prediction: how to assess the predictive accuracy?

• Use an independent data set• If it is not available?

– ABSOLUTELY WRONG: • Apply your predictor to the data you used to

develop it and see how well it predicts

– OK• cross validation • bootstrap

traintraintrain

testtest

test

train

test

train

test

train

testdata

How to develop a cross-validated class predictor

• Select a subset of genes– Use gene expression measurements

Predict class

Obtain parameters of a “mathematical function”

•Training set

•Test set

•Predict class using class predictor from test set

train

test

train

testtrain

test

Dupuy and Simon, JNCI 2007

Supervised prediction

12/28 reported a misleading estimate of prediction accuracy

50% of studies contained one or more major flaws

Class prediction: a famous example

van’t Veer et al. report results obtained with wrong analysis in the paper and correct analysis (with less striking results) just in the supplementary material

What went wrong?Produces highly biased estimates of

predictive accuracy

+ Going beyond the quantification of predictive accuracy and attempting to make inference with cross-validated class predictor: INFERENCE MADE IS NOT VALID

<5 yrs

>5yrs

Good prognosis

31 18

Bad prognosis

2 26

Microarray

predictor

Observed

Odds ratio=15.0, p-value=4 *10^(-6)

ParameterParameter Logistic Coeff Std. Error Odds ratio Logistic Coeff Std. Error Odds ratio 95% CI 95% CI------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------Grade -0.08 0.79 1.1 [0.2 5.1]ER 0.5 0.94 1.7 [0.3 10.4]PR -0.75 0.93 2.1 [0.3 13.1]size (mm) -1.26 0.66 3.5 [1.0 12.8]Age 1.4 0.79 4 [0.9 19.1]Angioinvasion -1.55 0.74 4.7 [1.1 20.1]Microarray 2.87 0.851 7.6 [3.3 93.7]----------------------------------------------------------------------------------------------------------

Hypothesis: there is no difference between classes

Prop. of rejected H0 0.01 0.05 0.10

LOO CV 0.268 0.414 0.483(n = 100)

Lusa, McShane, Radmacher, Shih, Wright, Simon, Statistics in Medicine, 2007

Michiels et al, 2005 Lancet

Final remarks

• Simple classification methods such as LDDA have proved to work well for microarray studies and outperform fancier methods

• A lot of classification methods which have been proposed in the field with new names are just slight modifications of already known techniques

Final remarks

• Report all the necessary information about your classifier so that other can apply it to their data

• Evaluate correctly the predictive accuracy of the classifier– in “early microarray times”, many papers presented

analyses that were not correct, or drew wrong conclusions from their work.

– still now, middle and low IF journals keep publishing obviously wrong analyses

• Don’t apply methods without understanding exactly – what they are doing – on which assumptions they rely

Other issues in classification

• Missing data• Class representation• Choice of distance function• Standardization of observations and

variables– An example where all this matters…

Class discovery

• Mostly performed through hierarchical clustering of genes and samples– Often abused method in microarray analysis,

used instead of supervised methods• In very few examples

– stability and reproducibility of clustering is assessed

– results are“validated” or further used after “discovery”

– a rule for classification of new samples is given• “Projection” of the clustering to new data

sets seems still problematic

It becomes a class prediction problem

Molecular taxonomy of breast cancer

• Perou/Sørlie (Stanford/Norway)

– Class sub-type discovery (Perou, Nature 2001, Sørlie, PNAS 2001, Sørlie, PNAS 2003)

– Association of discovered classes with survival and other clinical variables (Sørlie, PNAS 2001, Sørlie, PNAS 2003)

– Validation of findings assigning class labels defined from class discovery to independent data sets (Sørlie, PNAS 2003)

28(>.32)89%

11(>.28)82%

11(>.34)64%

19(>.41)22%

10(>.31)

2/3

Hierarchical clustering of the 122 samples from the paper using the “intrinsic gene-set” (~500 genes)

Average linkage and distance= 1- Pearson’s (centered) correlation

Number of samples in each class (node correlation for the core samples included for each subtype) and percentage of ER positive samples

ER +

n=79 (64%) (ρ)

Sørlie et al, PNAS 2003

Can we assign subtype membership to samples from independent data sets?

• Method of centroids – class prediction rule:

– Define a centroid for each class on the original data set (“training set”)

• For each gene, average its expression from the samples assigned to that class

– For each sample of the independent data set (“testing set”) calculate Pearson’s (centered) correlation of its gene expression with each centroid

– Classification rule: Assign the sample to the class for which the centroid has the highest correlation with the sample (if below .1 do not assign)

centroids

new sample

Assigned to the class

which centroid has

highest correlation

with the new sample

correlation

Sørlie et al. 2003

West data set

•Cited thousands of times•Widely used in research papers and praised in editorials•Recent concerns raised about their reproducibility and robustness

Predicted class membership Sørlie our data

• Loris: “I obtained the subtypes on our data! All the samples from Tam113 are Lum A, a bit strange... there are no Lum B in our data set”

• Lara: “Have you tried also on the BRCA60?”• Loris: “No [...] Those are mostly LumA, too. Some are

Normal, very strange..there are no basal among the ER-!”

• Lara: “[...] Have you mean-centered the genes?”• Loris:” No [...] Looks better on BRCA60: Now the ER- of

are mostly basal... On Tam113 I get many lumB... But 50% of the samples from Tam113 are NOT luminal anymore!”

Something is wrong!

Tam113: Tamoxifen treated BR Ca 113 ER+/ 0 ER-

BRCA60: Hereditary BRCa (42ER+/16ER-)

How are the systematic differences between microarray platforms/batches taken into

account?

• Sørlie’s et al 2003 data set

– Genes were mean (and eventually median) centered

“[…], the data file was adjusted for array batch differences as follows; on a gene-by-gene basis, we computed the mean of the nonmissing expression values separately in each batch. Then for each sample and each gene, we subtracted its batch mean for that gene. Hence, the adjusted array would have zero row-means within each batch. This ensures that any variance in a gene is not a result of a batch effect.”

“Rows (genes) were median-centered and both genes and experiments were clustered by using an average hierarchical clustering algorithm.”

• West et al data set (Affymetrix, single channel data)

– Genes were “centered” “Data were transformed to a compatible format by normalizing to the median experiment

[…] Each absolute expression value in a given sample was converted to a ratio by dividing by its average expression value across all samples.”

• van’t Veer et al data set

– Genes do not seem to have been mean-centered• Other data sets where the method was applied

– Genes were always centeredER- ER+

Mean-centering

Possible concerns on the application of the method of

centroids• How are the classification results

influenced by...– normalization of the data (mean-centering of

the genes)?– differences in subtype prevalence across data

sets?– presence of study (or batch) effects?– choice of the method of centroids as a

classification method?– the use of the arbitrary cut-off for non

classifiable samples?

Lusa et al, Challenges in projecting clustering results across gene expression-profiling datasets JNCI 2007

ER (Ligand-Binding Assay): 34 ER-/65 ER+7650 clones (6878 unique)

1. Effects of mean-centering the genes

Sorlie’s centroids

(derived from centered data

set)

Sotiriou’s data set

method of centroids

centered (C)

non centered (N)

full data set (99

samples)

ER+ subset (65 samples)

ER- subset (34 samples)

336/552 common and unique clones

Full data ER+ subset

Centered Not centered Centered Not centered

Class Number classified (ρ<.1)

ER+ Number classified(ρ<.1)

ER+ Number classified (ρ<.1)

Number classified(ρ<.1)

Luminal A 43 (5) 41 59 (1) 55 19 (6) 55 (1)

Luminal B 13 (2) 11 1 (1) 1 13 (3) 1 (0)

ERBB2+ 13 (2) 6 10 (0) 2 11 (1) 2 (0)

Basal 21 (0) 0 5 (0) 0 11(5) 0 (0)

Normal 9 (0) 7 24 (2) 7 11 (1) 7 (0)

95% / 79%

PredictiveaccuracyER+ / ER-

78% / 88%

88% / 83%

92% / 79%

53% / ND

ND / 62%

2. Effects of prevalence of subgroups in (training and) testing set?

55 ER+/ 24 ER-24 ER+/ 24 ER-12 ER+/ 24 ER-55 ER+/ 0 ER-

0 ER+/ 24 ER-

55 ER+/ 24 ER-

Test set

10 ER+/ 10 ER-

2b. What is the role played by prevalence of subgroups in training

and testing set?

Training set Testing set

method of centroids

751 variance filtered unique

clones

ER status prediction Sotiriou’s data set

(C) (N)

training

testing

multiple (100) random SPLITS

0≤ ωtest ≤ 1 (ntest=24)0 ER+/24ER-1 ER+/23ER-

…24 ER+/0ER-

ωtr=1/2 (ntr=20)10 ER+/10ER-

(C) (N)

% correctly classified in class of ER+% correctly classified in class of ER-% of correctly classified overall

ω :% of ER+ samples in the testing set

3. (Possible) study effect on real data

Sotiriou van’t Veervan’t Veer (Centered) van’t Veer (Non centered)

Class True ER+ (ρ<.1)

True ER-(ρ<.1)

Cor (min-max)

PredictedER+

39 (1) 4 (2) .42 (.03-.62)

Predicted ER-

7 (4) 67 (4)

.26 (.01-.55)

Class True ER+ (ρ<.1)

True ER- (ρ<.1)

Cor (min-max)

Predicted ER+

43 (43) 8 (7) .02 (-.24-.13)

Predicted ER-

3 (3) 63 (53) -.03(-.23-16)

•The predictive accuracy is the same

•Most of the samples in the non-centered analysis would not be classificable using the threshold

Predicted class membership

Conclusions I• “Must”s for a clinically useful classifier

– It classifies unambiguously a new sample, independently of any other samples being considered for classification at the same time

– The clinical meaning of the subtype assignment (survival probability, probability of response to treatment) must be stable across populations to which the classifier might be applied

– The technology used to assay the samples must be stable and reproducible – sample assayed on different occasions assigned to the same subtype

• BUT we showed that subgroup assignments of new samples can be substantially influenced by

– Normalization of data• Appropriateness of gene-centering depends on the situation

– Proportion of samples from each subtype in the test set– Presence of systematic differences across data sets– Use of arbitrary rules for identifying non-classifiable samples

• Most of our conclusions apply also to different classification method

Conclusions II• Most of the studies claiming to have validated the subtypes

have focused only on comparing clinical outcome differences – Shows consistency of results between studies– BUT does not provide direct measure of the robustness of the

classification essential before using the subtypes in clinical practice

• Careful thought must be given to comparability of patient populations and datasets

• Many difficulties remain in validating and extending class discovery results to new samples and a robust classification rule remains elusive

The subtyping of breast cancer seems promising

BUT

a standardized definition of the subtypes based on a robust measurement method is needed

Some useful resources and readings

• Books– Simon et al. – Design and Analysis of DNA Microarray

Investigations – Ch.8– Speed (Ed.) – Statistical Analysis of Gene Expression Microarray

Data – Ch.3– Bishop- Pattern Recognition and Machine Learning– Hastie, Tibshirani and Friedman – The Elements of Statistical

Learning– Duda, Hart and Stork – Pattern Classification

• Software for data analysis– R and Bioconductor (www.r-project.org, www.bioconductor.org)– BRB Array Tools (http:// linus.nci.nih.gov)

• Web sites– BRB/NCI web site (NIH)– Tibshirani’s web site (Stanford)– Terry Speed’s web site (Berkley)

http://www.r-project.org/

http://www.bioconductor.org/

Documents

Class prediction for experiments with microarrays