Upload
eileen-fox
View
216
Download
0
Embed Size (px)
Citation preview
Alexander StatnikovAlexander StatnikovDiscovery Systems LaboratoryDiscovery Systems Laboratory
Department of Biomedical InformaticsDepartment of Biomedical InformaticsVanderbilt UniversityVanderbilt University
10/3/200710/3/2007
1
Project historyJoint project with Chun Li and Constantin Aliferis
Cancer Research 2005 paper by Hu et al.: “Genome-Wide Association Study in Esophageal Cancer Using GeneChip Mapping 10K Array” Reported near-perfect classification of cancer patients & healthy
controls on the basis of only SNP data from a case-control GWA study.
This finding suggests that esophageal cancer is a solely genetic disease…
Initial idea of Chun LiAt DSL we had independently obtained the GWA dataset
prior to Chun and Constantin have initiated this project2
BackgroundSNPs make up >90% of all human genetic
variation and have been extensively studied for functional relationships between phenotype & genotype.
Modern high-throughput genotyping technologies allow fast evaluation of SNPs on a genome-wide scale at a relatively low cost.
During last 2 years, several studies have reported success in using SNP genotyping assays in GWA studies in cancer. Probably, the strongest result is reported in the study by Hu et al.
3
Claims of Hu et al.“Using the generalized linear model (GLM)
with adjustment for potential confounders and multiple comparisons, we identified 37 SNPs associated with disease.”
“When the 37 SNPs identified from the GLM recessive mode were used in a principal components analysis, the first principal component correctly predicted 46 of 50 cases and 47 of 50 controls.” […] “The permutation tests indicated that our PCA classification can be generalized.”
4
5
Study dataset & its preparationStudy dataset:
50 esophageal squamous cell carcinoma patients50 healthy controls (matched by age, sex, place of
residence)10k Affymetrix SNP arrays with 11,555 SNPsAdditional variables:
Age Tobacco use Alcohol consumption Family history Consumption of pickled vegetables
Removed ~1.5k SNPs to minimize genotyping errorsImplemented recessive A encodingImputed missing genotypes
6
SNP selection: Original method of Hu et al.
(denoted as GLM1)
Fit a GLM model using data for all 100 subjects:Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙ SNP + c ∙family history + d ∙alcohol consumption
Obtain deviances: D1 - deviance of the above fitted model D0 - deviance of the null model (without predictor
variables)From χ2 distribution, compute a p-value for the test
statistic D0-D1 with 3 degrees of freedomPerform Bonferroni correction at 0.05 alpha level
7
SNP selection: Unbiased GLM-based method
(denoted as GLM2)
Fit a GLM model using data for all 100 subjects:Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙ SNP + c ∙family history + d ∙alcohol consumption
Obtain deviances: D1 - deviance of the above fitted model D0΄- deviance of the model with family history and alcohol
consumptionFrom χ2 distribution, compute a p-value for the test
statistic D0΄-D1 with 1 degree of freedomPerform Bonferroni correction at 0.05 alpha level
8
Recap of SNP selection methods
9
MethodGLM1
(Hu et al.)GLM2
(Current study)
D1SNP, family history, alcohol consumption
D0 Nullfamily history,
alcohol consumption
Degrees of freedom
3 1
Classification:Original method of Hu et al.
Perform principal component analysis (PCA) on selected SNPs using all 100 subjects in the dataset.
Extract the first principal component (PC1).Use the following rule to classify each of the
same 100 subjects as used for the PCA:
If PC1 > 0, classify as control, otherwise classify as case
10
Evaluation of classification performance
Hu et al. used proportion of correct classifications; their classifier is trained and tested in the same dataset
We employ area under ROC curve performance metric and repeated 10-fold cross-validation scheme
11
SNP dataset (100 subjects)
0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0
0.83
0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0
0.83
0.6 0.9 0.9 0.6 0.9 0.5 0.9 0.9 0.9 1.0
0.81
1.0 0.8 0.9 0.7 0.9 0.8 0.7 0.8 0.6 0.7
0.79
…
Reproducing findings of Hu et al.
12
Using GLM1 method, Hu et al. reported 37 significant SNPs, we found 226!
Apparently, they used an extra filtering step that was not reported in the paper (personal comm. with their PI).
Nevertheless, the application of PCA-based classifier (as in Hu et al.) to GLM1 significant SNPs resulted in 0.93 proportion of correct classifications and 0.98 AUC.
Major findings are reproduced using methods of Hu et al.
Bias in SNP selection method GLM1 of Hu et al.
13
Calculation of p-values in GLM1 does not reflect significance of the SNP, but the significance of 3 variables combined (SNP, family history, and alcohol consumption)
Family history & alcohol consumption are strong risk factors p-value is biased towards 0.
Bias in SNP selection method GLM1 by Hu et al.
14
Bonferroni adjusted α-level
On the contrary, GLM2 reflects significance of SNPs and does not suffer from the above bias:Its distribution of SNP
p-values is uniformIt returns no SNPs
significant at the Bonferroni adjusted alpha-level
The distribution of SNP p-values for method GLM1 is not uniform: most p-values are <10-3
Empirical demonstration of bias in SNP selection method
15
Main idea: Create a null distribution where SNPs are completely unrelated to the response variable and see how frequently methods GLM1 and GLM2 find statistically significant SNPs.
1.Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact.
2.Apply GLM1 and GLM2 to the permuted SNP data.
Repeat 1,000 times
Results of permutation experiments
16
GLM1 found significant SNPs in all 1000 permutations! The number of significant SNPs found in a permuted dataset ranges from 185 to 1,938 (357 on average).
GLM2 found significant SNPs in only 48/1000 permutations. The number of significant SNPs found in a permuted dataset ranges from 1 to 3.
GLM1 is biased, while GLM2 is not.
Bias in the classification performance estimate of Hu et al.
17
All data-analysis methods of Hu et al. use data for all subjects. Neither cross-validation nor independent sample validation were performed.
We repeated their data-analysis (GLM1+PCA) embedded in the repeated 10-fold cross-validation design. The resulting performance is only 0.68 AUC (versus 0.98 AUC).
0.30 AUC bias (overestimation) in the reported results
Empirical demonstration of performance estimation bias
18
Main idea: Create a null distribution where SNPs are completely unrelated to the response variable (i.e. AUC=0.5), apply GLM1+PCA methodology and record resulting performance estimates.
1. Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact.
2. Apply GLM1 to the permuted SNP data.3. Build and apply classifier using PCA.4. Estimate classification performance (AUC).
Repeat 1,000 times
Results of permutation experiments
19
Classification performance of GLM1+PCA; both methods applied as in Hu et al. to all data (no cross-validation): 0.99 AUC
Classification performance of GLM1+PCA; GLM1 applied to all data, PCA applied by cross-validation (incomplete cross-validation): 0.98 AUC
Classification performance by GLM1+PCA applied by cross-validation: 0.50 AUC
0.48-0.49 AUC bias (overestimation) under the null
20
Classification:Support Vector Machines (SVMs)
Supervised baseline technique for many types high-throughput data (microarray, proteomics, etc).
Trained and applied by cross-validation
21
* ****** * *
* *
**
*
**
***
* ***
**
*
*
SNP 1
SNP 2
Controls
Cases
?
?
SNP 2
SNP 1
Cases
Controls
Cases
Controls?
?
kernel
SNP selection for fitting SVMs: Recursive Feature Elimination
Among the best performing techniques for the analysis of microarray gene expression data
Applied only to a training set during cross-validation
10,000
SNPs
SVM model
Performanceestimate
5,000 SNPs
5,000 SNPs
Important for classification
Not important for classification
SVM model
Performanceestimate
2,500 SNPs
2,500 SNPs
Important for classification
Not important for classification
Discarded Discarded
…
22
Classification results: repeated 10-fold cross-valid. estimates
23
“+” denotes building of classifier by ensembling technique
24
Feedback on our analysis from Hu et al.
25
1. Concerning bias in SNP selection:“If we use p-values to rank the SNPs, the two
methods [GLM1 and GLM2] will give the same order.”
Our comment:Ranking of SNPs is irrelevant because the method of
Hu et al. (GLM1) as described and used in their paper is the method for selection (and not ranking) of SNPs.
Feedback on our analysis from Hu et al.
26
2. Concerning bias in estimation of classifier performance:“It was not our purpose to develop a classifier in this
initial pilot effort.”“…we made these calculations as a frame of reference
only.”The authors presented results of their “cross-
validation effort”. SNPs were selected by GLM1 on all 100 subjects and the classifier was trained and tested by cross-validation (2/3 of data is used for training and 1/3 of data is used for testing). This cross-validation procedure was repeated 1,000 times with different splits into training and testing set.
Feedback on our analysis from Hu et al.
27
The authors obtain the following histogram of classification performance estimates
Our comment:These results are expected
because their SNP selection procedure utilizes both training and testing data. This is “incomplete cross-validation” and is shown to cause biased performance estimation of the classifier.
Proportion of correct classifications
Publications
28
Statnikov A, Li C, Aliferis CF (2007) “Effects of Environment, Genetics and Data Analysis Pitfalls in an Esophageal Cancer Genome-Wide Association Study.” PLoS ONE 2(9): e958.
Statnikov A, Li C, Aliferis CF (2007) “A statistical reappraisal of the findings of an esophageal cancer genome-wide association study.” Cancer Research, (accepted).
Conclusions
29
Data-analysis pitfalls in Hu et al. led researchers to (1) identify non-statistically significant SNPs and (2) derive biased estimates of classification performance.
Environmental factors and family history have modest association with the disease, while SNPs do not appear to be associated.
It is crucially important to have sound statistical analysis in genome-wide association studies.
The amount of work involved in demonstration of errors (even obvious), correcting the analysis, communicating with authors, and publishing the rebuttal is significantly greater than publishing the original paper!