View
221
Download
1
Tags:
Embed Size (px)
Citation preview
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays
Department of Biostatistics, University pf North Carolina, Chapel Hill
Division of Human Cancer GeneticsOhio State University
William J. Lemon, Jeffrey J.T. Palatini, Ralf Krahe, Fred A. Wright
Measuring gene expression with the Affymetrix GeneChip
Perfect Match (PM)
Mismatch (MM)
PM - 25 bases complementary to region of gene
MM - Middle base is different
...
Coding portion of gene X polyA
•cRNA from sample mRNA is put on the chip
•intensity of binding reflects gene expression
Reproducibility of Probe Sensitivities
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
The Li-Wong Model
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
Li-Wong Full (LWF)
Li-Wong Reduced (LWR)
),0(~
,2
Ne
eMM
ePM
ijjij
ijijjij
222 2),,0(~
,
N
MMPMy ijijijij
Identifiability constraint j
j J2
The Li-Wong Model
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
Li-Wong Full (LWF)
Li-Wong Reduced (LWR)
),0(~
,2
Ne
eMM
ePM
ijjij
ijijjij
222 2),,0(~
,
N
MMPMy ijijijij
Identifiability constraint j
j J2
ith array
jth probe pair
Total no. probe pairs
The Li-Wong Model
Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
Li-Wong Full (LWF)
Li-Wong Reduced (LWR)
),0(~
,2
Ne
eMM
ePM
ijjij
ijijjij
222 2),,0(~
,
N
MMPMy ijijijij
Identifiability constraint j
j J2
ith array
jth probe pair
Total no. probe pairs
expression
sensitivities
How to compare gene expression indexes?
•We get maximum likelihood estimates for using either full data (LWF) or reduced data (LWR)
•The Affymetrix software computes:
Average Difference (AD)
Log-Average (LA)
•The log-average might perform particularly poorly. Note that if terms are small and error variance is small,
.ˆ j
j Jy
JMMPMj
jj /)/log(10
)/()()/()()/( jjjjjjjj MMPM
•We gain insight by assuming Li-Wong model is true. Then what are the consequences?
•For large sample sizes, the ’s and ’s will be well-estimated
Compare LW estimators directly:
0.2)(
2)ˆvar(
)ˆvar(),(
22
JreducedfullRE j
jjj
j
full
reduced
Comparing to AD is tricky, but with a correction factor AD is also an unbiased estimate of :
ˆˆ
jjJ
0.1)var(1
1
)ˆvar(
)ˆvar(),(
reduced
ADreducedRE
•This also gives insight into “perfect match only” analyses:
RE(full, PM-only)=
jjj
jj
full
PM2
2
)(1
)ˆvar(
)ˆvar(
21 REand
Furthermore, PM-only is always at least twice as efficient as LWR
Empirical Comparisons
•We propose that an expression index is “good” if it has a high correlation with the underlying true expression (which is usually unknown).
•this correlation can be estimated using a specially designed mixing experiment
•if r is the correlation coefficient between the measured index and true expression, the “relative efficiency” of two indexes and can be estimated as
)1/(
)1/(22
22
rr
rr
),0(~,ˆ 210 Nee
).,0(~,ˆ 210 Nee
Suppose the true underlying gene expression for a given gene is . Consider two indices of gene expression
10 /)ˆ(ˆ is an unbiased estimate of
21
2 /)ˆ
var(
21
2
21
2
/
/
)ˆ
var(
)ˆvar()ˆ,ˆ(
RE
And we have
Can we estimate this relative efficiency?
•Suppose we could do a regression of on .
•the ratio of explained to residual variance in the model can be shown to be
2
222
11
/)var(
r
r
)ˆ,ˆ()1/(
)1/(22
22
RErr
rr
and similarly for , so
Can we estimate r without ever knowing true expressions ?
•Yes, with a specially designed mixing experiment
•we seek two contrasting conditions in which many genes will be differentially expressed
Experimental Design
Human Fibroblasts(GM 08330)
20% FBS
48h
24hHarvest total RNA
Lys, PheDap, Thr
50:50
Add Bacterial Control Genes
StimulatedStarved
5 passages
Dap, Thr,Lys, Phe
Produce 50:50 group
Produce duplicates each day for 3d
Synthesize cDNA, cRNA; fragment
Add Hybridization Control Genes
BioB, BioC, BioD, Cre
Hybridize HuGeneFL
0.1% FBS
Serum starvation
Cell culture
Serum stimulation0.1%
20%
Harvest total RNA
Gene Expression IndexesData Reduction
RNA extraction
20% FBS
(6 replicates for each condition)
BIN1 expression
Stim 50:50 Starved
True expression = average of Stim, Starved
full
BIN1 expression
Stim 50:50 Starved
full
1 2 3
X
X
r
or
r
rr
,ˆ
,ˆ
,ˆ
Note that
Where X=1, 2, 3 (say) for Stim, 50:50 Starved, respectively
Mean probe intensity per array
Stim 50:50 Starved
Overall intensity higher in Stimulated
Coefficients of variation for assay (individual probes) and gene expression indexes
0.0 0.5 1.0 1.5 2.0
02
00
00
60
00
01
00
00
0
Assay Stim
CV
# P
rob
es
0.121
0.0 0.5 1.0 1.5 2.0
05
00
10
00
15
00
20
00
25
00
LWF Stim
CV
# g
en
es
0.149
0.0 0.5 1.0 1.5 2.0
02
00
40
06
00
80
0
Affymetrix AD Stim
CV
# g
en
es
0.293
Stim 50:50 Starved Stim 50:50 Starved
Stim
50:50
Starved
Stim
50:50
Starved
LWF
AD
LWR
LA
Correlation matrix of 18 arrays as a colorized image for each expression index.
Comparing ModelsCluster Analysis
Affymetrix Log Ave
Full Model Reduced Model
Affymetrix Ave Diff
Str
v 1
Str
v 4
Str
v 2
Str
v 5
Str
v 3
Str
v 6
50
:50
35
0:5
0 5
50
:50
45
0:5
0 2
50
:50
15
0:5
0 6
Sti
m 4
Sti
m 6
Sti
m 5
Sti
m 3
Sti
m 1
Sti
m 2
Sti
m 2
Str
v 1
Str
v 3
Str
v 2
Str
v 6
Str
v 5
Str
v 4
Sti
m 1
Sti
m 6
Sti
m 3
Sti
m 5
Sti
m 4
50
:50
55
0:5
0 4
50
:50
35
0:5
0 2
50
:50
15
0:5
0 6
Str
v 3
Str
v 4
Str
v 6
Str
v 5
Str
v 2
Str
v 1
Sti
m 2
Sti
m 1
Sti
m 4
Sti
m 5
Sti
m 6
Sti
m 3
50
:50
55
0:5
0 4
50
:50
25
0:5
0 1
50
:50
65
0:5
0 3
Str
v 2
Str
v 3
Str
v 1
Str
v 6
Str
v 5
Str
v 4
Sti
m 2
Sti
m 4
50
:50
1S
tim
1S
tim
6S
tim
3S
tim
55
0:5
0 3
50
:50
55
0:5
0 4
50
:50
25
0:5
0 6
Relative Efficiency
0.0
0.5
1.0
1.5
LWF
LWR
AD LA
Med
ian(
r2 /(1
-r2 )
)
LWF
LWR
AD LA
Unscaled Scaled
Correlation of duplicate measurements of 149 genes
LWF median r=.74
LWR median r=.43
AD median r=.08
LA median r=.17
Number of unexpressed genes•Only 0.2% of the LW estimates are negative
•50:50 group has fewest negative estimates
•could this indicate very few unexpressed genes?
Stim 50:50 Starved
A conservative approach to estimating number of unexpressed genes
•Let U denote number of unexpressed genes
•genes are ranked according to expression index
)genes all amonggenesofrankmedian(2 UU
•This is useful if we can get a random sample of unexpressed genes
Unexpressed population
Gene expression index
•We use the spiked-out bacterial control genes as a sample of “unexpressed” genes
•the 4 genes are are represented 3 times each (different portions of mRNA), for a total of 12 probe sets
•Based on this reasoning, we estimate that greater than 88% of the genes are expressed, even in the Starved samples
Rank of expression index variance across the 6 Stimulated arrays versus rank of index mean
Truly absent in stim group
Rank(mean)
Ran
k(va
r)
0 2000 4000 6000
020
0040
0060
00
Rank(mean)
Ran
k(va
r)
0 2000 4000 6000
2000
4000
6000
DapThrPheLys
ADLWF
Very low estimated expression for truly absent genes when using LWF
Present/absent calls
•We use the statistic
)ˆ(
ˆ
SEz
to declare genes present/absent (absolute call)
•we find the vast majority of genes on the array appear to be present
•for the spiked in/out genes, we find vastly improved present/absent calling using LW estimates
False Positive Rate0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(1 - Specificity)
(Sensitiv
ity)
1 -
Fals
e N
egative R
ate LWF-Z
LWR-Z
Untrimmed AD
Untrimmed LA
LA
AD
Absolute Call
ROC curve - spiked in/out genes
Variability in estimates
Full Model Reduced Model
log(
vari
ance
)
log(mean)
Stim
50:50
Starved
Conclusions
• Model-based estimators are superior to simple averaging• Full model superior to reduced• this does not necessarily mean that the mismatch probes
are a good idea - but if they are present we should use them
• we have demonstrated this using both analytic considerations and experimental data
• a carefully designed experiment can be used to address many issues
• Many more genes may be expressed than previously thought
Other issues/ future work
•Spiking genes might be used to calibrate and normalize arrays
•relationship between variance and mean of expression indexes may be useful in planning experiments
•our data may be useful for future work, especially in producing indexes that are resistant to probe saturation
•all primary data, this Powerpoint presentation and a preprint are available at http://thinker.med.ohio-state.edu