Preprocessing of oligonucleotide array data

C O R R E S P O N D E N C E

Health Inspection Service (Washington, DC,USA) before it can be returned to regularservice. The entire production process mustbe thoroughly documented and is verytightly controlled.

We feel that the critical underlying focusshould be on keeping the recombinantproteins out of the food supply regardless ofwhat crop was used to make them. The cropsshould be grown under suitable containmentconditions—rather than arbitrarily rulingout specific vehicles. Also, the risk associatedwith each protein should be reviewed on the basis of the properties of that specificprotein, not on the basis of the production

system used. The risk of potentialcontamination of the food supply is oneassociated with all genetically enhancedorganisms, not just the food crops. Indeed,food supply contamination can occur withnongenetically modified or naturalproduction systems as well!

Although plant-produced pharmaceuticalshave raised concern in political debates, it isimportant that Nature Biotechnology reportsscience-based facts and arguments and doesnot give credence to the whims of specialinterest groups whose position is not sciencebased. Our discussion here should be focusedon a safe food supply, and on appropriate

intention of measuring only backgroundnoise and considering the difference PM-MM. To obtain an expression measure, theirdefault expression measure (MAS 5.0) is arobust average of log (PM-MM) values4.A version of the additive-background-multiplicative-error stochastic model2,5 hasbeen demonstrated to describe the behaviorof probe-level data well6. We refer to thismodel as the Wu-Irizarry (WI) model.This stochastic model suggests that MAS 5.0is quite accurate, but that its accuracy isoverwhelmed by its impreciseness. Thesetheoretical results have been demonstratedempirically7. RMA uses a global backgroundadjustment based on a PM-only estimationprocedure designed to avoid MAS 5.0’sexaggerated variance. Nevertheless, the WImodel predicts that RMA’s adjustment doesnot completely remove nonspecific bindingtherefore introducing some bias. However,the sacrifice in accuracy is well worth thelarge gains in precision, resulting in asuperior expression measure7.

656 VOLUME 22 NUMBER 6 JUNE 2004 NATURE BIOTECHNOLOGY

To the editor:As outlined in a news item in your Marchissue (Nat. Biotechnol. 22, 256–267, 2004),the United States’ proposed appropriationsof billions of dollars for bioterrorpreparedness in 2005 and, presumably,beyond, constitute a scientific andsociopolitical endeavor unprecedented inscope. The only other undertaking that hasapproached this magnitude as “a major

federal R&D priority” in recent history was the Human Genome Project (HGP). Incontrast to the present initiative, however, itwas recognized immediately that the HGPhad ethical, legal and social implications(ELSI) that needed to be the focus ofconcerted research efforts. With laudableforesight and responsibility, 3%–5% of theannual HGP budget was earmarked for ELSIresearch from the outset of the project. This

ELSI and bioterrorism countermeasures?

To the editor:A paper by Zhang et al.1 in the July 2003issue of Nature Biotechnology presentsPerfectMatch, a physical-model-basedalgorithm for modeling thebinding interactions betweenprobes on Affymetrix (SantaClara, CA, USA) GeneChipoligonucleotide arrays andtheir target nucleic acids.Stochastic-model-basedalgorithms have been shown to improve the preprocessingof array data to take intoaccount the presence of opticalnoise, nonspecific binding and probe-specific effects. Although Zhang et al. demonstrate the superiorperformance of PerfectMatch over twowidely used algorithms, MAS 5.0 anddChip, they fail to consider a leadingstochastic-model-based procedure, robustmulti-array analysis (RMA), from ourgroup2. Here, we provide evidence that

(i) measurements obtained using RMA are as precise and more accurate than those obtained using PerfectMatch and (ii)the accuracy of stochastic-model-based

procedures can be improved by carefully incorporatingphysical model information.

GeneChip arrays are widelyused for quantitative and highly parallel measurements of gene expression. Genes arerepresented on these arrays bypairs of 25-base oligonucleotideprobes3 denoted probe sets.The first component of these

pairs is referred to as a perfect match (PM)probe and is designed to be specific to transcripts from the intended gene.However, background noise andmeasurement error are unavoidable.Affymetrix’s background adjustmentapproach is to pair each PM with amismatch (MM) probe designed bychanging the middle (13th) base with the

Preprocessing of oligonucleotide array data

safety and containment issues regarding thecost-effective production of pharmaceuticals.A. thaliana, flax or duckweed may beappropriate choices for production of certainplant-made pharmaceuticals, but in othercases it may be corn, potatoes, tomatoes,bananas, soybeans, rice or some other foodcrop that makes the most sense bothscientifically and economically.

Gregory C Phillips

Arkansas State University, College of Agriculture,PO Box 1080, 119 S. Caraway Road, StateUniversity, Arkansas 72467-1080, USA. e-mail:[email protected]

begs a very important question: shouldn’t a portion of the bioterror preparednessbudget be similarly allocated to ELSIresearch?

Shane K Green

Institute for Ethics, American MedicalAssociation, 515 North State St., Chicago,Illinois 60610, USA.e-mail: [email protected]

©20

04 N

atur

e P

ublis

hing

Gro

up

http

://w

ww

.nat

ure.

com

/nat

ureb

iote

chno

logy

C O R R E S P O N D E N C E

Because the MM probes do not perfectlypredict nonspecific binding, PerfectMatch’sadjustment uses sequence information and physical models instead1. Yet, theestimation procedure that definesPerfectMatch produces a backgroundadjustment that is even more conservativethan RMA’s. The gains in precisionreported in the paper are due to a largesacrifice in accuracy and not to improvedprediction of nonspecific binding. Figure 1and Table 1 demonstrate that RMA attainsroughly the same gains in precisionwithout adding nearly as much bias.Recently, GeneChip RMA (GCRMA) a new expression measure that combines thestrengths of stochastic-model algorithms,such as RMA, and physical models8 hasbeen proposed9 to improve the accuracy ofRMA without much sacrifice in precision(for an R package implementation, seehttp://www.bioconductor. org/). These

differences in precision and accuracy canhave dramatic consequences in practice. Todemonstrate this, we consider the simple,yet common, application of estimatingdifferential expression between two arrays.Successful fold-change analysis will detectall and only genes that are differentlyexpressed due to biological variation. Weused a subset of the Affymetrix spike-inexperiment data1,2 where we could expectonly the spiked-in genes to be differentiallyexpressed. To assess precision in thiscontext, we computed log-fold changes forthe non-spiked-in genes from all possiblepair-wise comparisons. Figure 1asummarizes the empirical log-fold-changedistributions in the form of color box-plots10. MAS 5.0 has the widestdistribution and RMA and PerfectMatchhave the tightest.

To assess accuracy, we plot averageobserved log concentrations of the spiked-

in genes against each nominal concentra-tion value (Fig. 1b). The observed logconcentration should be linear in true logconcentrations with a slope of 1. Figure 1bdemonstrates that PerfectMatch andGCRMA are the least and most accuratemeasures, respectively. Although the curvesin Figure 1b are not lines, we can divide thegraph into three sections (low, medium andhigh expression) where the curves areapproximately linear. The slopes of theselines represent the expected observed log-fold-change of a gene with a nominal fold-change of 2. If we order genes of interestaccording to their fold-change size, the rank of these slopes in the log fold-changedistribution of Figure 1a represents theexpected rank of a gene with a true fold-change of 2 in an array with only that genetruly differentially expressed. Optimalmeasures will balance accuracy andprecision to produce low ranks for allsections. These slopes and expected ranks(Table 1) demonstrate that RMA andGCRMA produce considerably betterresults (in addition to the global resultspresented, we obtained similar findings for individual probe sets; furthermore,all the assessments were preformed on anew (HGU133) spike-in experiment; seeSupplementary Figs. 1–6). GCRMAperforms better for lower expressed genesand RMA for high ones. Because in mosthybridizations few genes fall in the high-expressed category we expect GCRMA toperform slightly better.

Note: Supplementary information is available on theNature Biotechnology website.

1. Zhang, L., Miles, M.F. & Aldape, F.D. Nat. Biotechnol.21, 818–821 (2003).

2. Irizarry, R. et al. Biostatistics 4, 249–264 (2003).

NATURE BIOTECHNOLOGY VOLUME 22 NUMBER 6 JUNE 2004 657

Table 1 Performance of five different algorithms (MAS 5.0, dChip10, PerfectMatch, RMA and GCRMA) in modeling GeneChip data

Algorithm Precision measured Slopes (expected ranks)3

99.9%1 Inter-quartile Low Medium Highrange2 expression expression expression

MAS 5.0 4.46 0.77 0.69 (4,295) 0.81 (3,716) 0.45 (6,007)

dChip 14.83 0.29 1.40 (415) 0.86 (863) 0.35 (3,512)

PerfectMatch 0.57 0.08 0.21 (759) 0.47 (28) 0.32 (134)

RMA 0.57 0.12 0.35 (316) 0.76 (3) 0.47 (52)

GCRMA 0.74 0.15 0.61 (43) 1.04 (3) 0.47 (207)

Log-fold-changes were computed for the non-spiked-in genes from all possible pair-wise comparisons. Boldfacenumbers denote the best performing algorithm in each of the assessments. 1The 99.9th percentile of the distribu-tion of log-fold changes (which is particularly meaningful because for arrays representing 10,000 genes, it is avalue that 10 false positives are expected to surpass). 2The inter-quartile range of log-fold changes. 3The curvespresented in Figure 1b were divided into three sections (low expression, medium expression and high expression)where the curves are approximately linear. The slopes of these lines quantify accuracy. The ranks of these slopeswithin the log fold-change distribution are shown in parenthesis.

4

2

0

2

4

MAS RMA PerfectMatch GCRMA

0.01

0.10.250.750.9

0.99

2 0 2 4 6 8 10

2

0

2

4

6

8

10

Log nominal concentration

Ad

just

ed o

bse

rved

log

exp

ress

ion

MASRMAPerfectMatchGCRMA

a b

Figure 1 Graphical comparison of the capacity of four algorithms (MAS 5.0, PerfectMatch, RMA and GCRMA) to measure expression. (a) Box-plots showingthe distribution of observed fold changes for non-spiked-in genes. The different colors represent the different quantiles. The relationship of color andquantile is demonstrated in the first box from the left. The six solid horizontal lines represent the smallest, 10th smallest, 50th smallest, 50th largest, 10th

largest and largest values, respectively. (b) Average observed log (base 2) intensity plotted against nominal log (base 2) concentration for each spiked-in gene for arrays in Affymetrix spike-in experiment. The observed concentrations are adjusted so that all lines have the same intercept at nominal log (base 2)concentration of 4. The solid black line represents the ideal slope 1 line. The log concentrations are divided into three sections denoted by the solid verticallines. (The results from dChip10 were similar to those of MAS 5.0 and were omitted for clarity.)

©20

04 N

atur

e P

ublis

hing

Gro

up

http

://w

ww

.nat

ure.

com

/nat

ureb

iote

chno

logy

Documents

Preprocessing of oligonucleotide array data