Upload
devin-lee
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
Selection of Differential Expression Genes in Microarray Experiments
James J. Chen, Ph.D.
Division of Biometry and Risk AssessmentNational Center for Toxicological Research
Food and Drug Administratione-mail: [email protected]
FDA/Industry Workshop
September 19, 2003
Analysis of Microarray Data
Class comparison: Identifying differentially expressed genes Class prediction: Association between genes and samples, selecting a minimal combination of genes (classification). Class discovery: discovery sample sub-types of gene clusters, selecting genes with similar expression pattern (cluster analysis)
Genesg1
g2
g3
.
.gm
S1
y11
y21
y31
.
.ym1
S2
y12
y22
y32
.
.ym2
Sn
y1n
y2n
y3n
.
.ymn
………………...
Samples
Identifying Differentially Expressed Genes
An important goal in the data analysis is to identify a set of genes that are differentially expressed among control and treated samples (groups).
To identify disease-related, drug-response, or biomarker genes (class comparison).
To enhance relationships among genes and samples for clustering or prediction (class prediction or class discovery).
Ranking Genes
The normalized data are analyzed one gene at a time (when there is sufficient number of replicates n) using statistical methods: ANOVA, permutation tests, ROC , etc.
Genesg1
g2
g3
.
.gm
S1
y11
y21
y31
.
.ym1
S2
y12
y22
y32
.
.ym2
Sn
y1n
y2n
y3n
.
.ymn
………………...
Samples
Rankr1 (p1)
r2 (p2)
r3 (p3)
. .
. .rm (pm)
P-value Approaches to Gene Selection
These are the mixture of altered and unaltered genes, altered genes should have smaller p-values.
How to choose a cut-off ?
P-value for Gene Ranking:
Use p-values to rank the genes in the order of evidence for differential expression: p(1) . . . p(m) (an ordered evidence of differences)
Determining Cut-off: fixed p-value, number of rejections, estimating the number altered gene, decision (ROC), Multiple testing Issue: FWE or FDR approach..
Approaches to Multiplicity Testing
Family-wise error (FWE) rate approach – controlling the probability of false rejection of unaltered genes among all hypotheses (genes in the array) tested.
False discovery rate (FDR) approach – estimating the probability of false rejection of unaltered genes among the rejected hypotheses (significant genes)
Two approaches to multiplicity testing:
Testing m hypotheses
Decision True State Significance Non-significance Total
Unaltered V S 1 - m0
Altered U 1- T m1 Total R m-R m
The number of true null hypotheses m0 is fixed but unknown. V and U are unobservable; R=U+V is observable. The FWE is the probability Pr(V 0). The FDR is E(V/R) (rejecting unaltered genes among the significances).
P-Value FWE Approach
FWE : The probability of rejecting at least one true null hypothesis in the given family of the hypotheses.
Bonferroni adjustment: set CWE at /m then FWE
Improvements: Holm (Scand J., 1979) step-down procedure:
(mp(1), (m-1)p(2), (m-2)p(3), . . . )
Estimating the number of un-altered genes m0: =FWE/m0
(m0p(1), m0p(2), m0p(3), . . . )
Since m0 << m, great improvement!
Estimating Number of True Nulls
Difference of two adjacent p-values:
dj = p(j) - p(j-1), j=1,..,(m+1), p(0) = 0, p(m+1) = 1 Under independence and H0, di Beta(1,m0) with mean
E(dj) =1/(m0+1).
An estimate of m0 is m0{MD} = 1/d -1 1/E(d) –1.
Graphic algorithm to estimate m0
Benjamini and Hochberg (J Edu Behav. Stat. 2000) Hsueh et al., J. Biopharm. Stat. (2003)
_
Simulation results for the m0{MD} estimator for m = 1,000,
based on 10,000 replicates.
Estimation: The effect size is set to have 80% power at the FWE = 25.The means and standard deviations (s.d.)
Independence Hypotheses Correlated Hypotheses ( = .25)
m0 Mean s.d. Mean s.d.
1000 999.35 10.89 992.30 42.29 900 904.43 3.43 899.16 36.47 700 709.40 5.26 703.13 37.07
Testing: Empirical familywise error rates at the FWE = 0.05, 010, 0.25.
Independence Hypotheses Correlated Hypotheses ( = .25)
m0 0.05 0.10 0.25 0.05 0.10 0.25
1000 0.049 0.098 0.223 0.039 0.071 0.151 900 0.049 0.095 0.224 0.040 0.070 0.142 700 0.047 0.090 0.213 0.039 0.070 0.142
P-value FDR Methods
FDR: The probability of falsely rejected null hypotheses.
FDR-controlled (BH, 1995): q-value = mp(r) /r < FDR
Fixed CWE = (Storey, 2002): estimate pFDR Fixed R = r (Tsai, 2003): estimate cFDR = E(V |R=r)/r. The expected number of false significances is (r x cFDR)
FDRs depend on the distributions of R and the conditional
distribution V|R. FDR = pFDR P(R>0) = cFDR Pr(R = r)
Chen (ICSA Bulletin,
2003)
Distribution of R and the cFDR for m = 1000 and m0=900 at =.01
and 1= 2. Assume paired t-test with five replicated arrays.
r Pr(R=r) cFDR r Pr(R=r) cFDR r Pr(R=r) cFDR
68 .0009 .0748 79 .0509 .0947 90 .0369 .1231
69 .0016 .0763 80 .0592 .0969 91 .0289 .1262
70 .0026 .0779 81 .0664 .0992 92 .0218 .1293
71 .0042 .0795 82 .0719 .1015 93 .0158 .1326
72 .0065 .0812 83 .0750 .1039 94 .0111 .1359
73 .0097 .0830 84 .0756 .1064 95 .0075 .1393
74 .0140 .0848 85 .0734 .1090 96 .0049 .1428
75 .0195 .0866 86 .0688 .1117 97 .0031 .1463
76 .0261 .0885 87 .0622 .1144 98 .0019 .1500
77 .0338 .0905 88 .0542 .1172 99 .0011 .1537
78 .0422 .0926 89 .0455 .1201 100 .0006 .1574
Unconditional estimates: FDR = .1067, pFDR = .1067, mFDR = .1075Condition at E(R) = 83.7 84 (mode), cFDR = .1064, eFDR=.1071.
FDR, pFDR, cFDR, and mFDR, at = .01 and .001; m = 100, and 1000, F0 F1under independence. The cFDR are evaluated at [E(R)+1]
= .01 = .001 m m0 FDR pFDR cFDR mFDR FDR pFDR cFDR mFDR
100 50 .0257 .0257 .0261 .0262 .0071 .0071 .0071 .0072
80 .0933 .0933 .0960 .0971 .0258 .0271 .0270 .0282
90 .1824 .1831 .1857 .1948 .0462 .0583 .0586 .0613
95 .3012 .3129 .3119 .3380 .0650 .1147 .1163 .1212
100 .6340 1. 1. 1. .0952 1. 1 . 1.
1000 500 .0261 .0261 .0261 .0262 .0072 .0072 .0072 .0072
800 .0967 .0967 .0969 .0971 .0281 .0281 .0282 .0282
900 .1935 .1935 .1946 .1948 .0608 .0608 .0609 .0613
950 .3351 .3351 .3383 .3380 .1193 .1194 .1194 .1212
1000 .9999 1. 1. 1. .6324 1. 1. 1.
Conditional Distribution of V | R=r
Given m0 and , the number of rejections R = V+U, where V Bin(m0,) and U Bin(m1,1-)
The conditional distribution V|R = r has the non-central hypergeometric distribution.
The cFDR = E(V |R=r)/r estimated from the mean of V|R. It can also be computed from distribution of R
To estimate cFDR: mo{MD} and distribution of R (parametric
or bootstrap method)
),|(
)1,1|1()
m( FDR
0
0 0
mmrRP
mmrRP
rc
Taiwan Academia Sinica (Metal) Data*
Control and 8 metals, 55 one-channel arrays, 684 genes* Data from Dr. D. T. Lee’s laboratory
Identifying DE Genes: Sinica Data
Objective: Control vs. As vs. Cd. Design: 6 arrays per group (I, III, IV, VI, VII, IX ; 18 arrays) Microarray: As-chip-TCL01 (one-channel membrane array)
Probes: 708 genes with 16 house keeping genes. Data filtering: Spots with more than 3 zero/negative intensity were removed resulted in 540 genes.
Gene Expression matrix: 540 (genes) x 18 (arrays).
Normalization: GAM (lowess) to adjust for array effects.
Significance test:The p-values were computed using the F statistic from all 18C12 12C6 permutations.
MCP Analysis of Sinica Data
Total number of genes: m = 540
Estimated number of un-altered genes:m0{MD} = 444
Number of rejections (r):
FWE = 0.05, 0.05/444: r = 11 0.05/540: r = 9
FDR = 0.05, = (0.05 x r)/444: r = 39 0.05 x r)/540: r = 27 CWE = = 0.01: r = 50 m1
{MD} = 96: r = 96
The FDR, pFDR, cFDR, and eFDR estimates are close.
pFDR and cFDR Estimates using Different MCP Methods
MCP r p(r)
pFDR cFDR v*
FWE(0.05) 1.13x10-4* 11 1.12x10-4 4.52x10-3 4.50x10-3 .5 FDR(0.05) 4.39x10-3* 39 4.29x10-3 4.87x10-2 4.88x10-2 2
CWE(0.01) 0.01 50 9.97x10-3 8.85x10-2 8.59x10-2 4
M1{MD} 96 3.28x10-2 1.51x10-1 1.53x10-1 15
* FWE(0.05/444; FDR(0.05 x r)/444; *v = r x cFDR
^
v
* m = 540 and m0{MD} = 444
Association Study
Relationships between genes and samples: Effects of drugs (toxicants) on gene expression profiles, DNA diagnostic testing, or pathogen detection (classification).
Relationships among samples: Molecular classification of different tissue types or samples on the basis of gene expression (cluster analysis).
Relationships among genes: Genes of similar function yield similar expression patterns in microarray experiments (metabolic pathways, molecular function,
biological process, etc.) (cluster analysis)
Class Prediction
Class prediction (classification): to develop a decision rule to predict the class membership of a new sample based on the expression profiles of some key genes.
Three Steps:
Selection of the discriminatory (key) gene set.
1. Formation of the discrimination rule: Fisher’s linear discriminant function, nearest-neighbor classifiers, support vector machines, and classification tree.
2. Cross-validation to estimate accuracy of the prediction
Class Prediction: Sinica Data
Nine different treatments: Control, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV for a total of 55 samples (arrays).
Number of Genes: 684 genes (some 2- or 3-plicates). Gene Expression matrix: 684 (genes) x 55 (arrays). Normalization: GAM (lowess) to adjust for array effects. Gene Sets: Five gene sets are considered.
Classification methods: Fisher’s linear discriminant function, nearest-neighbor classifiers (k-nn)
Cross-validation: 10-fold cross-validation, 11 arrays/group.
Selections of Discriminatory genes
Significance testing approach to gene selection:
1. F: Differential expression (global) genes among the 9 groups using F test with FWE = 0.05. 38 genes
T Treatment-specific marker genes, One-Vs-All t-test compares each group with 8 remaining groups with adjusted p = 0.01, Gi. T= G1U … U G9 89 genes
I = F TIntersection of F and T 25 genes
4. U = F U T Union of F and T 102 genes
5. Original gene set 684 genes
Average accuracy (%) of k-NN multi-class classification, based on 11-fold cross-validation over 1,000 permutations.
Metal n I F T U A # of genes 25 38 89 102 684
10099.175.582.061.676.360.081.351.8
81.6
10099.178.684.478.599.742.481.472.7
85.3
98.498.799.881.838.299.537.198.746.0
82.0
98.898.897.181.541.597.897.181.745.8
80.5
79.096.638.750.457.194.918.378.745.8
65.6
14 7 5 6 5 4 5 7 5
55
Control As AsV Cd Cu Ni Cr Sb Pb
TotalThe FLDA algorithm performed poorly, for example, the overall accuracies are 67.9% and 40.5% for I and F respectively.
Cluster analysis with a 2-MDS plot for the treatment-specific marker genes in I: Each gene is labeled with
the compound to which it gives a unique expression.
Metal I Ctrl 7
As 1
AsV 1
Cd 3
Cu 2
Ni 4
Cr 1
Sb 8
Pb 0(1-) metric, complete linkage
Clustering results with 2-MDS plots for the 55 arrays for the genes I and A
Gene setI (25 genes) Gene set A (684 genes)
Acknowledgements Collaborators and Contributors
Dr. Frank Sistare & Staff (CDER/FDA; Merck) Dr. Sue-Jane Wang (CDER/FDA) Dr. T-C Lee & Staff (Academia Sinica,Taiwan) Dr. C-h Chen & Staff (Academia Sinica,Taiwan)
Dr. Suzanne Morris & Staff (NCTR) Dr. Jim Fuscoe & Staff (NCTR) Dr. Ralph Kodell NCTR) Dr. Robert Delongchamp (NCTR)
Dr. Hueymiin Hsueh (Cheng-chi Univ.,Taiwan) Dr. Chen-an Tsai (NCTR) Ms. Yi-Ju Chen (Pen State, NCTR)