3
A gene expression bar code for microarray data Michael J Zilliox 1,3 & Rafael A Irizarry 2 The ability to measure genome-wide expression holds great promise for characterizing cells and distinguishing diseased from normal tissues. Thus far, microarray technology has been useful only for measuring relative expression between two or more samples, which has handicapped its ability to classify tissue types. Here we present a method that can successfully predict tissue type based on data from a single hybridization. A preliminary web-tool is available online (http://rafalab. jhsph.edu/barcode/). The ability of microarrays to measure thousands of RNA transcripts at one time allows for the characterization of cells and tissues in greater depth than was previously possible, but has not yet led to big advances in diagnosis or treatment. The main reason for this is that feature char- acteristics, such as probe sequence, can cloud the relationship between observed intensity and actual expression. Although this ‘probe effect’ is large, it is also very consistent across different hybridizations, which implies that relative measures of expression are substantially more useful than absolute ones 1,2 . To understand this, consider that when comparing intensities from different hybridizations for the same gene, the probe effect is very similar and cancels out. In contrast, when comparing intensities for two genes from the same hybridization, the different probe effects can alter the observed differences. For this reason the overwhelming majority of results based on microarray data rely on measures of relative expression: genes are reported to be differentially expressed rather than expressed or unexpressed. Approaches for thresholding noisy data have been successfully used in many appli- cations, including microarray studies 3,4 . We used this as motivation to develop the first method that can accurately demarcate expressed from unexpressed genes and therefore defines a unique ‘gene expression bar code’ for each tissue type. To do this, we used the vast amount of publicly available data sets to assess the algorithm. With clinical data, we found near-perfect predictability of normal versus diseased tissue for three cancer studies and one Alzheimer’s disease study. The bar-code method also discovers new tumor subsets in previously published breast cancer studies that can be used for the prognosis of tumor recurrence and survival time. For any given gene and microarray platform, we wanted to know what intensity relates to no expression. A simple way to determine this would be to hybridize tissues for which we know the gene is not expressed and look at the distribution of the observed intensities. If a new sample is provided, and we want to know if this gene is expressed, we would simply compare the observed intensity to the previously formed distribution. For a single laboratory, creating this training data set is logistically impossible for two reasons: (i) we do not know what genes are expressed in which tissues and (ii) it would require various hybridizations for each gene. Fortunately, a preliminary version of such a data set already exists for some platforms/organisms. We downloaded raw data for more than a NM_000315 AW276646 A M P 4 6 8 10 12 Tissue type Log 2 intensity Log 2 intensity Log 2 intensity 12 10 8 6 4 12 10 8 6 4 Figure 1 | Across-sample distributions of gene-expression estimates. Data for two human genes are shown with the Genbank accession number on the y axis. Across sample distribution for all tissues (left). The vertical line is automatically drawn by the bar-code method, and distinguishes the intensity range associated with expressed and unexpressed genes. The orange, purple and green ticks denote the observed values with color denoting the call provided by the manufacturer. Absent calls are shown on top of the plot while present and marginal calls are shown on the bottom axis. For the gene in the top plot, the calls appear consistent with the plot, but these calls appear unable to distinguish expressed from unexpressed for the gene shown in the bottom row. The box plots stratify these calls by tissue. The horizontal line denotes the expressed-unexpressed boundary. All samples of the same tissue are consistently present or consistently absent. A, M and P, absent, marginal and present calls (made by the manufacturer). RECEIVED 27 JUNE; ACCEPTED 7 SEPTEMBER; PUBLISHED ONLINE 30 SEPTEMBER 2007; DOI:10.1038/NMETH1102 1 W. Harry Feinstone Department of Molecular Microbiology and Immunology, and 2 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe Street, Baltimore, Maryland 21205, USA. 3 Present address: Emory Vaccine Center and Department of Microbiology and Immunology, Emory University School of Medicine, 1510 Clifton Road, Atlanta, Georgia 30322, USA. Correspondence should be addressed to R.A.I. ([email protected]). NATURE METHODS | VOL.4 NO.11 | NOVEMBER 2007 | 911 BRIEF COMMUNICATIONS © 2007 Nature Publishing Group http://www.nature.com/naturemethods

A gene expression bar code for microarray data

Embed Size (px)

Citation preview

A gene expression bar codefor microarray dataMichael J Zilliox1,3 & Rafael A Irizarry2

The ability to measure genome-wide expression holds great

promise for characterizing cells and distinguishing diseased from

normal tissues. Thus far, microarray technology has been useful

only for measuring relative expression between two or more

samples, which has handicapped its ability to classify tissue

types. Here we present a method that can successfully predict

tissue type based on data from a single hybridization.

A preliminary web-tool is available online (http://rafalab.

jhsph.edu/barcode/).

The ability of microarrays to measure thousands of RNA transcriptsat one time allows for the characterization of cells and tissues ingreater depth than was previously possible, but has not yet led tobig advances in diagnosis or treatment. Themain reason for this is that feature char-acteristics, such as probe sequence, cancloud the relationship between observedintensity and actual expression. Althoughthis ‘probe effect’ is large, it is also veryconsistent across different hybridizations,which implies that relative measures ofexpression are substantially more usefulthan absolute ones1,2. To understand this,consider that when comparing intensitiesfrom different hybridizations for the samegene, the probe effect is very similar andcancels out. In contrast, when comparingintensities for two genes from the samehybridization, the different probe effectscan alter the observed differences. For thisreason the overwhelming majority ofresults based on microarray data rely onmeasures of relative expression: genes arereported to be differentially expressedrather than expressed or unexpressed.

Approaches for thresholding noisy datahave been successfully used in many appli-cations, including microarray studies3,4. Weused this as motivation to develop the firstmethod that can accurately demarcate

expressed from unexpressed genes and therefore defines a unique‘gene expression bar code’ for each tissue type. To do this, we usedthe vast amount of publicly available data sets to assess thealgorithm. With clinical data, we found near-perfect predictabilityof normal versus diseased tissue for three cancer studies and oneAlzheimer’s disease study. The bar-code method also discovers newtumor subsets in previously published breast cancer studies that canbe used for the prognosis of tumor recurrence and survival time.

For any given gene and microarray platform, we wanted to knowwhat intensity relates to no expression. A simple way to determinethis would be to hybridize tissues for which we know the gene is notexpressed and look at the distribution of the observed intensities. Ifa new sample is provided, and we want to know if this gene isexpressed, we would simply compare the observed intensity to thepreviously formed distribution. For a single laboratory, creatingthis training data set is logistically impossible for two reasons: (i) wedo not know what genes are expressed in which tissues and (ii) itwould require various hybridizations for each gene. Fortunately,a preliminary version of such a data set already exists for someplatforms/organisms. We downloaded raw data for more than a

NM

_000

315

AW

2766

46

AMP

4 6 8 10 12 Tissue type

Log2 intensity

Log 2

inte

nsity

Log 2

inte

nsity

12

10

8

6

4

12

10

8

6

4

Figure 1 | Across-sample distributions of gene-expression estimates. Data for two human genes are

shown with the Genbank accession number on the y axis. Across sample distribution for all tissues (left).

The vertical line is automatically drawn by the bar-code method, and distinguishes the intensity range

associated with expressed and unexpressed genes. The orange, purple and green ticks denote the observed

values with color denoting the call provided by the manufacturer. Absent calls are shown on top of the

plot while present and marginal calls are shown on the bottom axis. For the gene in the top plot, the calls

appear consistent with the plot, but these calls appear unable to distinguish expressed from unexpressed

for the gene shown in the bottom row. The box plots stratify these calls by tissue. The horizontal line

denotes the expressed-unexpressed boundary. All samples of the same tissue are consistently present or

consistently absent. A, M and P, absent, marginal and present calls (made by the manufacturer).

RECEIVED 27 JUNE; ACCEPTED 7 SEPTEMBER; PUBLISHED ONLINE 30 SEPTEMBER 2007; DOI:10.1038/NMETH1102

1W. Harry Feinstone Department of Molecular Microbiology and Immunology, and 2Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,615 N. Wolfe Street, Baltimore, Maryland 21205, USA. 3Present address: Emory Vaccine Center and Department of Microbiology and Immunology, Emory UniversitySchool of Medicine, 1510 Clifton Road, Atlanta, Georgia 30322, USA. Correspondence should be addressed to R.A.I. ([email protected]).

NATURE METHODS | VOL.4 NO.11 | NOVEMBER 2007 | 911

BRIEF COMMUNICATIONS©

2007

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

emet

ho

ds

hundred tissues from the public repositories and preprocessed withthe same algorithm1,5–7. Then, for each gene the intensity distribu-tion was determined. Because it is expected that any given gene willonly be expressed in some tissues, multiple modes should beobserved. It is assumed that the lowest intensity mode is due to alack of expression (Fig. 1). Genes that are expected to be expressedare coded with ones and the unexpressed coded with zeros.We refer to this information as the gene expression bar code(see Supplementary Methods online). Wecreated bar codes for 118 human and 44mouse tissues (Supplementary Tables 1and 2 online; a dendrogram and heatmapdisplaying the bar codes and relatedsummaries are available in SupplementaryFig. 1 online).

We compared the bar code to the detec-tion calls from the Affymetrix defaultanalysis software (MAS 5.0). MAS 5.0 callsgenes present, marginal or absent. WithMAS 5.0, only 10% of the 22,215 genesrepresented in the ‘human’ array achievethe same call in all samples within thesame tissue. This number increases to 48%using our approach (SupplementaryFig. 2a,b online). We obtained similarresults with mouse data (SupplementaryFigs. 2c,d). To assess sensitivity, we usedresults from a study that reported proteinspresent in various mouse tissues8. Wemapped these proteins and found that thebar code was more sensitive at declaringgenes present when the correspondingproteins were found in the tissue (Supple-mentary Fig. 2e).

To demonstrate the utility of our algo-rithm we developed a classification schemethat assigns tissue types to unknown sam-ples by comparing their bar code to pre-defined ones (using Euclidean distance).Various sample classification algorithmshave been published by various groups formicroarray data. Many of these were com-pared on the original expression estimates.Predictive analysis of microarrays (PAM)9

produced the best results (data not shown).

We compared our approach to PAM using leave-one-out cross-validation. We included tissues for which detailed annotationwas available and there were 3 or more samples (see Supplemen-tary Table 3 online for results, which include various clinical datasets10–16). The bar code outperformed PAM in all comparisonsexcept two, where it performed as well. Because cross-validationhas a tendency to overestimate the performance of a classifica-tion algorithm, we assessed performance on six independent data

AmygdalaCaudate nucleus

CerebellumCerebellum peduncles

Fetal brainHypothalamus

Medulla oblongataOccipital lobeOlfactory bulb

Parietal lobePons

Prefrontal cortexSpinal cord

Subthalamic nucleusTemporal lobe

Thalamus721 B lymphoblast CL

Adrenal glandAppendix

Atrioventricular node Early progenitor

Bone marrowCiliary ganglion

Colorectal adenocarCultured adipocyte

LungDRG

Fetal lungFetal thyroid

HBECHeart

IsletLeukemia (K562)

Leukemia (MOLT4)

LiverLymph node

Lymphoma Burkitts DaudLymphoma raji

OvaryPancreas

CD4 T cellCD8 T cell

PlacentaProstate

Salivary glandSkin

Smooth muscleSup cervical ganglion

Testi germ cellTesti intersitial

Testi seminifer tubuleThymusThyroidTongue

TonsilTrachea

Trigeminal ganglionUterus

Am

ygda

laC

auda

te n

ucle

usC

ereb

ellu

mC

ereb

ellu

m p

edun

cles

Fet

al b

rain

Hyp

otha

lam

usM

edul

la o

blon

gata

Occ

ipita

l lob

eO

lfact

ory

bulb

Par

ieta

l lob

eP

ons

Pre

fron

tal c

orte

xS

pina

l cor

dS

ubth

alam

ic n

ucle

usTe

mpo

ral l

obe

Tha

lam

us72

1 B

lym

phob

last

CL

Adr

enal

gla

ndA

ppen

dix

Atr

iove

ntric

ular

nod

e E

arly

pro

geni

tor

Bon

e m

arro

wC

iliar

y ga

nglio

nC

olor

ecta

l ade

noca

rC

ultu

red

adip

ocyt

eLu

ngD

RG

Fet

al lu

ngF

etal

thyr

oid

HB

EC

Hea

rtIs

let

Leuk

emia

(K

562)

Leuk

emia

(M

OLT

4

Live

rLy

mph

nod

eLy

mph

oma

Bur

kitts

Dau

dLy

mph

oma

raji

Ova

ryP

ancr

eas

CD

4 T

cel

lC

D8

T c

ell

Pla

cent

aP

rost

ate

Sal

ivar

y gl

and

Ski

nS

moo

th m

uscl

eS

up c

ervi

cal g

angl

ion

Test

i ger

m c

ell

Test

i int

ersi

tial

Test

i sem

inife

r tu

bule

Thy

mus

Thy

roid

Tong

ueTo

nsil

Trac

hea

Trig

emin

al g

angl

ion

Ute

rus

AmygdalaCaudate nucleus

CerebellumCerebellum peduncles

Fetal brainHypothalamus

Medulla oblongataOccipital lobeOlfactory bulb

Parietal lobePons

Prefrontal cortexSpinal cord

Subthalamic nucleusTemporal lobe

Thalamus721 B lymphoblast CL

Adrenal glandAppendix

Atrioventricular node Early progenitor

Bone marrowCiliary ganglion

Colorectal adenocarCultured adipocyte

LungDRG

Fetal lungFetal thyroid

HBECHeart

IsletLeukemia (K562)

Leukemia (MOLT4)

LiverLymph node

Lymphoma Burkitts DaudLymphoma raji

OvaryPancreas

CD4 T cellCD8 T cell

PlacentaProstate

Salivary glandSkin

Smooth muscleSup cervical ganglion

Testi germ cellTesti intersitial

Testi seminifer tubuleThymusThyroidTongue

TonsilTrachea

Trigeminal ganglionUterus

Leukemia (HL–60)

Leukemia (HL–60)

Leuk

mia

(H

L–60

)

a

b

–1 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1

Figure 2 | Demonstration of the lab effect and

its removal by the bar-code algorithm. (a) The

correlation between every pair of samples from

study E-AFMX-5 is shown in different shades of

blue (negative correlation in red). For each gene

the across-sample mean value was subtracted from

the expression. The different tissues are divided

by black lines and labeled in the axes. Most of the

tissues in this study are completely different types

and therefore the lab effect must be driving the

high correlation. This plot for multiple labs is

available in Supplementary Figure 3. (b) The

same plot as in a but using the bar-code method

to define correlation.

912 | VOL.4 NO.11 | NOVEMBER 2007 | NATURE METHODS

BRIEF COMMUNICATIONS©

2007

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

emet

ho

ds

sets not included in the cross-validation process (Table 1). Hereonly the bar code performed well, with similar accuracy to thatof cross-validation.

The fact that the bar code greatly outperformed PAM on theindependent data sets is likely due to the lab or batch effect. Becausestudies usually target a particular tissue, a primary concern is that astrong lab effect will confound the ability to classify tissues from theability to classify labs2. An example of the lab effect is shown inSupplementary Figure 3a online, where the correlations betweensamples from study E-AFMX-5 are high despite originating from awide variety of tissues. The bar-code approach can remove many ofthese effects because subtle changes in intensity values are notstrong enough to make an absent gene appear present, or vice versa.Notice that the bar code removes most of the correlations in theE-AFMX-5 study without removing the correlation between thebrain tissues, both within the study and between studies (Fig. 2).Another example of how the lab effect can affect results is describedin Supplementary Results online and illustrated in Supplemen-tary Figure 4 online.

To assess the ability of the bar-code algorithm to find undiscov-ered tissue subsets, we used data from three breast cancer studiesthat did not include normal breast tissue samples, but did includepatient survival data14–16. We obtained the distance to all tissue barcodes, and 499 of 500 samples were classified as breast tumor (1 asbladder cancer). When we took out the breast tumor bar code, 37 ofthese samples were close to a variety of normal tissues and the other463 samples to a variety of cancer tissues. We then formed good andbad prognosis bar codes using these 37 and 463 samples, respec-tively. Then we used this new bar code to reclassify the 500 samples.We iterated this procedure until the good and bad prognosis groupsdid not change. The final bar codes resulted in a powerful prognosistool that outperformed the methods described in the original

papers. The performance of our prognosis tool is summarizedin Supplementary Results and illustrated in SupplementaryTables 4–6, and Supplementary Figure 5 online.

We expect the bar-code approach to classification anddiscovery presented in this paper to be improved in variousways. First, the classification algorithm implemented on the barcode was based on a very simple detection method and distancecalculation. Many aspects can be optimized for predictionpurposes. Second, as microarray technology improves so will thebar-code performance. In particular, the emergence of better geneannotation and arrays that probe for individual exons are the mostpromising developments. Finally, we have only implemented thebar code for two widely used platforms: Affymetrix HGU133Ahuman array and MOE430 mouse array. As soon as enough publicdata are available, however, the bar code will be defined for otherplatforms. In conclusion, we would like to acknowledge the effortsfrom the Microarray Gene Expression Data (MGED) Society topromote the sharing of microarray data. The work presented herewould not have been possible without the existing public reposi-tories. In particular, the availability of raw data was key, as themethods used to process raw data into gene-level measurementsalso contribute to study-to-study variability17,18. We hope the trendof sharing raw data continues, as we believe it to be necessary formicroarray technology to fulfill its promise.

Note: Supplementary information is available on the Nature Methods website.

Published online at http://www.nature.com/naturemethodsReprints and permissions information is available online athttp://npg.nature.com/reprintsandpermissions

1. Irizarry, R.A., Gautier, L. & Cope, L.M. in The analysis of gene expression data:methods and software. (eds. Parmigiani, G., Garrett, E.S., Irizarry, R.A. & Zeger,S.I.) 102–119 (Springer-Verlag, New York, 2003).

2. Irizarry, R.A. et al. Nat. Methods 2, 345–350 (2005).3. Kim, S. et al. Genomics 67, 201–209 (2000).4. Pal, R., Datta, A., Fornace, A.J. Jr., Bittner, M.L. & Dougherty, E.R. Bioinformatics

21, 1542–1549 (2005).5. Barrett, T. et al. Nucleic Acids Res. 33, D562–D566 (2005).6. Parkinson, H. et al. Nucleic Acids Res. 33, D553–D555 (2005).7. Carter, S.L., Eklund, A.C., Mecham, B.H., Kohane, I.S. & Szallasi, Z. BMC

Bioinformatics 6, 107 (2005).8. Kislinger, T. et al. Cell 125, 173–186 (2006).9. Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. Proc. Natl. Acad. Sci. USA 99,

6567–6572 (2002).10. Blalock, E.M. et al. Proc. Natl. Acad. Sci. USA 101, 2173–2178 (2004).11. Kimchi, E.T. et al. Cancer Res. 65, 3146–3154 (2005).12. Dyrskjot, L. et al. Cancer Res. 64, 4040–4048 (2004).13. Lenburg, M.E. et al. BMC Cancer 3, 31 (2003).14. Miller, L.D. et al. Proc. Natl. Acad. Sci. USA 102, 13550–13555 (2005).15. Pawitan, Y. et al. Breast Cancer Res. 7, R953–R964 (2005).16. Sotiriou, C. et al. J. Natl. Cancer Inst. 98, 262–272 (2006).17. Irizarry, R.A., Wu, Z. & Jaffee, H.A. Bioinformatics 22, 789–794 (2006).18. Shi, L. et al. BMC Bioinformatics 6 (Suppl. 2), S12 (2005).

Table 1 | Percentage accuracy comparison on independent data sets

GEO identifier Data type

PAM

(% correct)

Bar code

(% correct)

GSE5388 Cortex 100 100

GSE2395 Respiratory system

epithelia

0 100

GSE2665 Lymph node/tonsil 35 95

GSE1561 Breast tumor 69 100

GSE2603 Breast tumor 77 90

GSE6344 Kidney: normal versus

cancer

100 100

PAM versus the bar code approach in six randomly selected data sets not included in theoriginal database. The data described in Supplementary Table 1 were used to train theprediction algorithms. GEO, Gene Expression Omnibus.

NATURE METHODS | VOL.4 NO.11 | NOVEMBER 2007 | 913

BRIEF COMMUNICATIONS©

2007

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

emet

ho

ds