Upload
rafael-a
View
215
Download
1
Embed Size (px)
Citation preview
A gene expression bar codefor microarray dataMichael J Zilliox1,3 & Rafael A Irizarry2
The ability to measure genome-wide expression holds great
promise for characterizing cells and distinguishing diseased from
normal tissues. Thus far, microarray technology has been useful
only for measuring relative expression between two or more
samples, which has handicapped its ability to classify tissue
types. Here we present a method that can successfully predict
tissue type based on data from a single hybridization.
A preliminary web-tool is available online (http://rafalab.
jhsph.edu/barcode/).
The ability of microarrays to measure thousands of RNA transcriptsat one time allows for the characterization of cells and tissues ingreater depth than was previously possible, but has not yet led tobig advances in diagnosis or treatment. Themain reason for this is that feature char-acteristics, such as probe sequence, cancloud the relationship between observedintensity and actual expression. Althoughthis ‘probe effect’ is large, it is also veryconsistent across different hybridizations,which implies that relative measures ofexpression are substantially more usefulthan absolute ones1,2. To understand this,consider that when comparing intensitiesfrom different hybridizations for the samegene, the probe effect is very similar andcancels out. In contrast, when comparingintensities for two genes from the samehybridization, the different probe effectscan alter the observed differences. For thisreason the overwhelming majority ofresults based on microarray data rely onmeasures of relative expression: genes arereported to be differentially expressedrather than expressed or unexpressed.
Approaches for thresholding noisy datahave been successfully used in many appli-cations, including microarray studies3,4. Weused this as motivation to develop the firstmethod that can accurately demarcate
expressed from unexpressed genes and therefore defines a unique‘gene expression bar code’ for each tissue type. To do this, we usedthe vast amount of publicly available data sets to assess thealgorithm. With clinical data, we found near-perfect predictabilityof normal versus diseased tissue for three cancer studies and oneAlzheimer’s disease study. The bar-code method also discovers newtumor subsets in previously published breast cancer studies that canbe used for the prognosis of tumor recurrence and survival time.
For any given gene and microarray platform, we wanted to knowwhat intensity relates to no expression. A simple way to determinethis would be to hybridize tissues for which we know the gene is notexpressed and look at the distribution of the observed intensities. Ifa new sample is provided, and we want to know if this gene isexpressed, we would simply compare the observed intensity to thepreviously formed distribution. For a single laboratory, creatingthis training data set is logistically impossible for two reasons: (i) wedo not know what genes are expressed in which tissues and (ii) itwould require various hybridizations for each gene. Fortunately,a preliminary version of such a data set already exists for someplatforms/organisms. We downloaded raw data for more than a
NM
_000
315
AW
2766
46
AMP
4 6 8 10 12 Tissue type
Log2 intensity
Log 2
inte
nsity
Log 2
inte
nsity
12
10
8
6
4
12
10
8
6
4
Figure 1 | Across-sample distributions of gene-expression estimates. Data for two human genes are
shown with the Genbank accession number on the y axis. Across sample distribution for all tissues (left).
The vertical line is automatically drawn by the bar-code method, and distinguishes the intensity range
associated with expressed and unexpressed genes. The orange, purple and green ticks denote the observed
values with color denoting the call provided by the manufacturer. Absent calls are shown on top of the
plot while present and marginal calls are shown on the bottom axis. For the gene in the top plot, the calls
appear consistent with the plot, but these calls appear unable to distinguish expressed from unexpressed
for the gene shown in the bottom row. The box plots stratify these calls by tissue. The horizontal line
denotes the expressed-unexpressed boundary. All samples of the same tissue are consistently present or
consistently absent. A, M and P, absent, marginal and present calls (made by the manufacturer).
RECEIVED 27 JUNE; ACCEPTED 7 SEPTEMBER; PUBLISHED ONLINE 30 SEPTEMBER 2007; DOI:10.1038/NMETH1102
1W. Harry Feinstone Department of Molecular Microbiology and Immunology, and 2Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,615 N. Wolfe Street, Baltimore, Maryland 21205, USA. 3Present address: Emory Vaccine Center and Department of Microbiology and Immunology, Emory UniversitySchool of Medicine, 1510 Clifton Road, Atlanta, Georgia 30322, USA. Correspondence should be addressed to R.A.I. ([email protected]).
NATURE METHODS | VOL.4 NO.11 | NOVEMBER 2007 | 911
BRIEF COMMUNICATIONS©
2007
Nat
ure
Pub
lishi
ng G
roup
ht
tp://
ww
w.n
atur
e.co
m/n
atur
emet
ho
ds
hundred tissues from the public repositories and preprocessed withthe same algorithm1,5–7. Then, for each gene the intensity distribu-tion was determined. Because it is expected that any given gene willonly be expressed in some tissues, multiple modes should beobserved. It is assumed that the lowest intensity mode is due to alack of expression (Fig. 1). Genes that are expected to be expressedare coded with ones and the unexpressed coded with zeros.We refer to this information as the gene expression bar code(see Supplementary Methods online). Wecreated bar codes for 118 human and 44mouse tissues (Supplementary Tables 1and 2 online; a dendrogram and heatmapdisplaying the bar codes and relatedsummaries are available in SupplementaryFig. 1 online).
We compared the bar code to the detec-tion calls from the Affymetrix defaultanalysis software (MAS 5.0). MAS 5.0 callsgenes present, marginal or absent. WithMAS 5.0, only 10% of the 22,215 genesrepresented in the ‘human’ array achievethe same call in all samples within thesame tissue. This number increases to 48%using our approach (SupplementaryFig. 2a,b online). We obtained similarresults with mouse data (SupplementaryFigs. 2c,d). To assess sensitivity, we usedresults from a study that reported proteinspresent in various mouse tissues8. Wemapped these proteins and found that thebar code was more sensitive at declaringgenes present when the correspondingproteins were found in the tissue (Supple-mentary Fig. 2e).
To demonstrate the utility of our algo-rithm we developed a classification schemethat assigns tissue types to unknown sam-ples by comparing their bar code to pre-defined ones (using Euclidean distance).Various sample classification algorithmshave been published by various groups formicroarray data. Many of these were com-pared on the original expression estimates.Predictive analysis of microarrays (PAM)9
produced the best results (data not shown).
We compared our approach to PAM using leave-one-out cross-validation. We included tissues for which detailed annotationwas available and there were 3 or more samples (see Supplemen-tary Table 3 online for results, which include various clinical datasets10–16). The bar code outperformed PAM in all comparisonsexcept two, where it performed as well. Because cross-validationhas a tendency to overestimate the performance of a classifica-tion algorithm, we assessed performance on six independent data
AmygdalaCaudate nucleus
CerebellumCerebellum peduncles
Fetal brainHypothalamus
Medulla oblongataOccipital lobeOlfactory bulb
Parietal lobePons
Prefrontal cortexSpinal cord
Subthalamic nucleusTemporal lobe
Thalamus721 B lymphoblast CL
Adrenal glandAppendix
Atrioventricular node Early progenitor
Bone marrowCiliary ganglion
Colorectal adenocarCultured adipocyte
LungDRG
Fetal lungFetal thyroid
HBECHeart
IsletLeukemia (K562)
Leukemia (MOLT4)
LiverLymph node
Lymphoma Burkitts DaudLymphoma raji
OvaryPancreas
CD4 T cellCD8 T cell
PlacentaProstate
Salivary glandSkin
Smooth muscleSup cervical ganglion
Testi germ cellTesti intersitial
Testi seminifer tubuleThymusThyroidTongue
TonsilTrachea
Trigeminal ganglionUterus
Am
ygda
laC
auda
te n
ucle
usC
ereb
ellu
mC
ereb
ellu
m p
edun
cles
Fet
al b
rain
Hyp
otha
lam
usM
edul
la o
blon
gata
Occ
ipita
l lob
eO
lfact
ory
bulb
Par
ieta
l lob
eP
ons
Pre
fron
tal c
orte
xS
pina
l cor
dS
ubth
alam
ic n
ucle
usTe
mpo
ral l
obe
Tha
lam
us72
1 B
lym
phob
last
CL
Adr
enal
gla
ndA
ppen
dix
Atr
iove
ntric
ular
nod
e E
arly
pro
geni
tor
Bon
e m
arro
wC
iliar
y ga
nglio
nC
olor
ecta
l ade
noca
rC
ultu
red
adip
ocyt
eLu
ngD
RG
Fet
al lu
ngF
etal
thyr
oid
HB
EC
Hea
rtIs
let
Leuk
emia
(K
562)
Leuk
emia
(M
OLT
4
Live
rLy
mph
nod
eLy
mph
oma
Bur
kitts
Dau
dLy
mph
oma
raji
Ova
ryP
ancr
eas
CD
4 T
cel
lC
D8
T c
ell
Pla
cent
aP
rost
ate
Sal
ivar
y gl
and
Ski
nS
moo
th m
uscl
eS
up c
ervi
cal g
angl
ion
Test
i ger
m c
ell
Test
i int
ersi
tial
Test
i sem
inife
r tu
bule
Thy
mus
Thy
roid
Tong
ueTo
nsil
Trac
hea
Trig
emin
al g
angl
ion
Ute
rus
AmygdalaCaudate nucleus
CerebellumCerebellum peduncles
Fetal brainHypothalamus
Medulla oblongataOccipital lobeOlfactory bulb
Parietal lobePons
Prefrontal cortexSpinal cord
Subthalamic nucleusTemporal lobe
Thalamus721 B lymphoblast CL
Adrenal glandAppendix
Atrioventricular node Early progenitor
Bone marrowCiliary ganglion
Colorectal adenocarCultured adipocyte
LungDRG
Fetal lungFetal thyroid
HBECHeart
IsletLeukemia (K562)
Leukemia (MOLT4)
LiverLymph node
Lymphoma Burkitts DaudLymphoma raji
OvaryPancreas
CD4 T cellCD8 T cell
PlacentaProstate
Salivary glandSkin
Smooth muscleSup cervical ganglion
Testi germ cellTesti intersitial
Testi seminifer tubuleThymusThyroidTongue
TonsilTrachea
Trigeminal ganglionUterus
Leukemia (HL–60)
Leukemia (HL–60)
Leuk
mia
(H
L–60
)
a
b
–1 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1
Figure 2 | Demonstration of the lab effect and
its removal by the bar-code algorithm. (a) The
correlation between every pair of samples from
study E-AFMX-5 is shown in different shades of
blue (negative correlation in red). For each gene
the across-sample mean value was subtracted from
the expression. The different tissues are divided
by black lines and labeled in the axes. Most of the
tissues in this study are completely different types
and therefore the lab effect must be driving the
high correlation. This plot for multiple labs is
available in Supplementary Figure 3. (b) The
same plot as in a but using the bar-code method
to define correlation.
912 | VOL.4 NO.11 | NOVEMBER 2007 | NATURE METHODS
BRIEF COMMUNICATIONS©
2007
Nat
ure
Pub
lishi
ng G
roup
ht
tp://
ww
w.n
atur
e.co
m/n
atur
emet
ho
ds
sets not included in the cross-validation process (Table 1). Hereonly the bar code performed well, with similar accuracy to thatof cross-validation.
The fact that the bar code greatly outperformed PAM on theindependent data sets is likely due to the lab or batch effect. Becausestudies usually target a particular tissue, a primary concern is that astrong lab effect will confound the ability to classify tissues from theability to classify labs2. An example of the lab effect is shown inSupplementary Figure 3a online, where the correlations betweensamples from study E-AFMX-5 are high despite originating from awide variety of tissues. The bar-code approach can remove many ofthese effects because subtle changes in intensity values are notstrong enough to make an absent gene appear present, or vice versa.Notice that the bar code removes most of the correlations in theE-AFMX-5 study without removing the correlation between thebrain tissues, both within the study and between studies (Fig. 2).Another example of how the lab effect can affect results is describedin Supplementary Results online and illustrated in Supplemen-tary Figure 4 online.
To assess the ability of the bar-code algorithm to find undiscov-ered tissue subsets, we used data from three breast cancer studiesthat did not include normal breast tissue samples, but did includepatient survival data14–16. We obtained the distance to all tissue barcodes, and 499 of 500 samples were classified as breast tumor (1 asbladder cancer). When we took out the breast tumor bar code, 37 ofthese samples were close to a variety of normal tissues and the other463 samples to a variety of cancer tissues. We then formed good andbad prognosis bar codes using these 37 and 463 samples, respec-tively. Then we used this new bar code to reclassify the 500 samples.We iterated this procedure until the good and bad prognosis groupsdid not change. The final bar codes resulted in a powerful prognosistool that outperformed the methods described in the original
papers. The performance of our prognosis tool is summarizedin Supplementary Results and illustrated in SupplementaryTables 4–6, and Supplementary Figure 5 online.
We expect the bar-code approach to classification anddiscovery presented in this paper to be improved in variousways. First, the classification algorithm implemented on the barcode was based on a very simple detection method and distancecalculation. Many aspects can be optimized for predictionpurposes. Second, as microarray technology improves so will thebar-code performance. In particular, the emergence of better geneannotation and arrays that probe for individual exons are the mostpromising developments. Finally, we have only implemented thebar code for two widely used platforms: Affymetrix HGU133Ahuman array and MOE430 mouse array. As soon as enough publicdata are available, however, the bar code will be defined for otherplatforms. In conclusion, we would like to acknowledge the effortsfrom the Microarray Gene Expression Data (MGED) Society topromote the sharing of microarray data. The work presented herewould not have been possible without the existing public reposi-tories. In particular, the availability of raw data was key, as themethods used to process raw data into gene-level measurementsalso contribute to study-to-study variability17,18. We hope the trendof sharing raw data continues, as we believe it to be necessary formicroarray technology to fulfill its promise.
Note: Supplementary information is available on the Nature Methods website.
Published online at http://www.nature.com/naturemethodsReprints and permissions information is available online athttp://npg.nature.com/reprintsandpermissions
1. Irizarry, R.A., Gautier, L. & Cope, L.M. in The analysis of gene expression data:methods and software. (eds. Parmigiani, G., Garrett, E.S., Irizarry, R.A. & Zeger,S.I.) 102–119 (Springer-Verlag, New York, 2003).
2. Irizarry, R.A. et al. Nat. Methods 2, 345–350 (2005).3. Kim, S. et al. Genomics 67, 201–209 (2000).4. Pal, R., Datta, A., Fornace, A.J. Jr., Bittner, M.L. & Dougherty, E.R. Bioinformatics
21, 1542–1549 (2005).5. Barrett, T. et al. Nucleic Acids Res. 33, D562–D566 (2005).6. Parkinson, H. et al. Nucleic Acids Res. 33, D553–D555 (2005).7. Carter, S.L., Eklund, A.C., Mecham, B.H., Kohane, I.S. & Szallasi, Z. BMC
Bioinformatics 6, 107 (2005).8. Kislinger, T. et al. Cell 125, 173–186 (2006).9. Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. Proc. Natl. Acad. Sci. USA 99,
6567–6572 (2002).10. Blalock, E.M. et al. Proc. Natl. Acad. Sci. USA 101, 2173–2178 (2004).11. Kimchi, E.T. et al. Cancer Res. 65, 3146–3154 (2005).12. Dyrskjot, L. et al. Cancer Res. 64, 4040–4048 (2004).13. Lenburg, M.E. et al. BMC Cancer 3, 31 (2003).14. Miller, L.D. et al. Proc. Natl. Acad. Sci. USA 102, 13550–13555 (2005).15. Pawitan, Y. et al. Breast Cancer Res. 7, R953–R964 (2005).16. Sotiriou, C. et al. J. Natl. Cancer Inst. 98, 262–272 (2006).17. Irizarry, R.A., Wu, Z. & Jaffee, H.A. Bioinformatics 22, 789–794 (2006).18. Shi, L. et al. BMC Bioinformatics 6 (Suppl. 2), S12 (2005).
Table 1 | Percentage accuracy comparison on independent data sets
GEO identifier Data type
PAM
(% correct)
Bar code
(% correct)
GSE5388 Cortex 100 100
GSE2395 Respiratory system
epithelia
0 100
GSE2665 Lymph node/tonsil 35 95
GSE1561 Breast tumor 69 100
GSE2603 Breast tumor 77 90
GSE6344 Kidney: normal versus
cancer
100 100
PAM versus the bar code approach in six randomly selected data sets not included in theoriginal database. The data described in Supplementary Table 1 were used to train theprediction algorithms. GEO, Gene Expression Omnibus.
NATURE METHODS | VOL.4 NO.11 | NOVEMBER 2007 | 913
BRIEF COMMUNICATIONS©
2007
Nat
ure
Pub
lishi
ng G
roup
ht
tp://
ww
w.n
atur
e.co
m/n
atur
emet
ho
ds