Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/2007 1

Alexander StatnikovAlexander StatnikovDiscovery Systems LaboratoryDiscovery Systems Laboratory

Department of Biomedical InformaticsDepartment of Biomedical InformaticsVanderbilt UniversityVanderbilt University

10/3/200710/3/2007

1

Project historyJoint project with Chun Li and Constantin Aliferis

Cancer Research 2005 paper by Hu et al.: “Genome-Wide Association Study in Esophageal Cancer Using GeneChip Mapping 10K Array” Reported near-perfect classification of cancer patients & healthy

controls on the basis of only SNP data from a case-control GWA study.

This finding suggests that esophageal cancer is a solely genetic disease…

Initial idea of Chun LiAt DSL we had independently obtained the GWA dataset

prior to Chun and Constantin have initiated this project2

BackgroundSNPs make up >90% of all human genetic

variation and have been extensively studied for functional relationships between phenotype & genotype.

Modern high-throughput genotyping technologies allow fast evaluation of SNPs on a genome-wide scale at a relatively low cost.

During last 2 years, several studies have reported success in using SNP genotyping assays in GWA studies in cancer. Probably, the strongest result is reported in the study by Hu et al.

3

Claims of Hu et al.“Using the generalized linear model (GLM)

with adjustment for potential confounders and multiple comparisons, we identified 37 SNPs associated with disease.”

“When the 37 SNPs identified from the GLM recessive mode were used in a principal components analysis, the first principal component correctly predicted 46 of 50 cases and 47 of 50 controls.” […] “The permutation tests indicated that our PCA classification can be generalized.”

4

5

Study dataset & its preparationStudy dataset:

50 esophageal squamous cell carcinoma patients50 healthy controls (matched by age, sex, place of

residence)10k Affymetrix SNP arrays with 11,555 SNPsAdditional variables:

Age Tobacco use Alcohol consumption Family history Consumption of pickled vegetables

Removed ~1.5k SNPs to minimize genotyping errorsImplemented recessive A encodingImputed missing genotypes

6

SNP selection: Original method of Hu et al.

(denoted as GLM1)

Fit a GLM model using data for all 100 subjects:Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙ SNP + c ∙family history + d ∙alcohol consumption

Obtain deviances: D1 - deviance of the above fitted model D0 - deviance of the null model (without predictor

variables)From χ2 distribution, compute a p-value for the test

statistic D0-D1 with 3 degrees of freedomPerform Bonferroni correction at 0.05 alpha level

7

SNP selection: Unbiased GLM-based method

(denoted as GLM2)

Fit a GLM model using data for all 100 subjects:Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙ SNP + c ∙family history + d ∙alcohol consumption

Obtain deviances: D1 - deviance of the above fitted model D0΄- deviance of the model with family history and alcohol

consumptionFrom χ2 distribution, compute a p-value for the test

statistic D0΄-D1 with 1 degree of freedomPerform Bonferroni correction at 0.05 alpha level

8

Recap of SNP selection methods

9

MethodGLM1

(Hu et al.)GLM2

(Current study)

D1SNP, family history, alcohol consumption

D0 Nullfamily history,

alcohol consumption

Degrees of freedom

3 1

Classification:Original method of Hu et al.

Perform principal component analysis (PCA) on selected SNPs using all 100 subjects in the dataset.

Extract the first principal component (PC1).Use the following rule to classify each of the

same 100 subjects as used for the PCA:

If PC1 > 0, classify as control, otherwise classify as case

10

Evaluation of classification performance

Hu et al. used proportion of correct classifications; their classifier is trained and tested in the same dataset

We employ area under ROC curve performance metric and repeated 10-fold cross-validation scheme

11

SNP dataset (100 subjects)

0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0

0.83

0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0

0.83

0.6 0.9 0.9 0.6 0.9 0.5 0.9 0.9 0.9 1.0

0.81

1.0 0.8 0.9 0.7 0.9 0.8 0.7 0.8 0.6 0.7

0.79

…

Reproducing findings of Hu et al.

12

Using GLM1 method, Hu et al. reported 37 significant SNPs, we found 226!

Apparently, they used an extra filtering step that was not reported in the paper (personal comm. with their PI).

Nevertheless, the application of PCA-based classifier (as in Hu et al.) to GLM1 significant SNPs resulted in 0.93 proportion of correct classifications and 0.98 AUC.

Major findings are reproduced using methods of Hu et al.

Bias in SNP selection method GLM1 of Hu et al.

13

Calculation of p-values in GLM1 does not reflect significance of the SNP, but the significance of 3 variables combined (SNP, family history, and alcohol consumption)

Family history & alcohol consumption are strong risk factors p-value is biased towards 0.

Bias in SNP selection method GLM1 by Hu et al.

14

Bonferroni adjusted α-level

On the contrary, GLM2 reflects significance of SNPs and does not suffer from the above bias:Its distribution of SNP

p-values is uniformIt returns no SNPs

significant at the Bonferroni adjusted alpha-level

The distribution of SNP p-values for method GLM1 is not uniform: most p-values are <10-3

Empirical demonstration of bias in SNP selection method

15

Main idea: Create a null distribution where SNPs are completely unrelated to the response variable and see how frequently methods GLM1 and GLM2 find statistically significant SNPs.

1.Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact.

2.Apply GLM1 and GLM2 to the permuted SNP data.

Repeat 1,000 times

Results of permutation experiments

16

GLM1 found significant SNPs in all 1000 permutations! The number of significant SNPs found in a permuted dataset ranges from 185 to 1,938 (357 on average).

GLM2 found significant SNPs in only 48/1000 permutations. The number of significant SNPs found in a permuted dataset ranges from 1 to 3.

GLM1 is biased, while GLM2 is not.

Bias in the classification performance estimate of Hu et al.

17

All data-analysis methods of Hu et al. use data for all subjects. Neither cross-validation nor independent sample validation were performed.

We repeated their data-analysis (GLM1+PCA) embedded in the repeated 10-fold cross-validation design. The resulting performance is only 0.68 AUC (versus 0.98 AUC).

0.30 AUC bias (overestimation) in the reported results

Empirical demonstration of performance estimation bias

18

Main idea: Create a null distribution where SNPs are completely unrelated to the response variable (i.e. AUC=0.5), apply GLM1+PCA methodology and record resulting performance estimates.

1. Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact.

2. Apply GLM1 to the permuted SNP data.3. Build and apply classifier using PCA.4. Estimate classification performance (AUC).

Repeat 1,000 times

Results of permutation experiments

19

Classification performance of GLM1+PCA; both methods applied as in Hu et al. to all data (no cross-validation): 0.99 AUC

Classification performance of GLM1+PCA; GLM1 applied to all data, PCA applied by cross-validation (incomplete cross-validation): 0.98 AUC

Classification performance by GLM1+PCA applied by cross-validation: 0.50 AUC

0.48-0.49 AUC bias (overestimation) under the null

20

Classification:Support Vector Machines (SVMs)

Supervised baseline technique for many types high-throughput data (microarray, proteomics, etc).

Trained and applied by cross-validation

21

* ****** * *

* *

**

*

**

***

* ***

**

*

*

SNP 1

SNP 2

Controls

Cases

?

?

SNP 2

SNP 1

Cases

Controls

Cases

Controls?

?

kernel

SNP selection for fitting SVMs: Recursive Feature Elimination

Among the best performing techniques for the analysis of microarray gene expression data

Applied only to a training set during cross-validation

10,000

SNPs

SVM model

Performanceestimate

5,000 SNPs

5,000 SNPs

Important for classification

Not important for classification

SVM model

Performanceestimate

2,500 SNPs

2,500 SNPs

Important for classification

Not important for classification

Discarded Discarded

…

22

Classification results: repeated 10-fold cross-valid. estimates

23

“+” denotes building of classifier by ensembling technique

24

Feedback on our analysis from Hu et al.

25

1. Concerning bias in SNP selection:“If we use p-values to rank the SNPs, the two

methods [GLM1 and GLM2] will give the same order.”

Our comment:Ranking of SNPs is irrelevant because the method of

Hu et al. (GLM1) as described and used in their paper is the method for selection (and not ranking) of SNPs.


26

2. Concerning bias in estimation of classifier performance:“It was not our purpose to develop a classifier in this

initial pilot effort.”“…we made these calculations as a frame of reference

only.”The authors presented results of their “cross-

validation effort”. SNPs were selected by GLM1 on all 100 subjects and the classifier was trained and tested by cross-validation (2/3 of data is used for training and 1/3 of data is used for testing). This cross-validation procedure was repeated 1,000 times with different splits into training and testing set.


27

The authors obtain the following histogram of classification performance estimates

Our comment:These results are expected

because their SNP selection procedure utilizes both training and testing data. This is “incomplete cross-validation” and is shown to cause biased performance estimation of the classifier.

Proportion of correct classifications

Publications

28

Statnikov A, Li C, Aliferis CF (2007) “Effects of Environment, Genetics and Data Analysis Pitfalls in an Esophageal Cancer Genome-Wide Association Study.” PLoS ONE 2(9): e958.

Statnikov A, Li C, Aliferis CF (2007) “A statistical reappraisal of the findings of an esophageal cancer genome-wide association study.” Cancer Research, (accepted).

Conclusions

29

Data-analysis pitfalls in Hu et al. led researchers to (1) identify non-statistically significant SNPs and (2) derive biased estimates of classification performance.

Environmental factors and family history have modest association with the disease, while SNPs do not appear to be associated.

It is crucially important to have sound statistical analysis in genome-wide association studies.

The amount of work involved in demonstration of errors (even obvious), correcting the analysis, communicating with authors, and publishing the rebuttal is significantly greater than publishing the original paper!

Documents

Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/2007 1