2
CORRESPONDENCE Health Inspection Service (Washington, DC, USA) before it can be returned to regular service. The entire production process must be thoroughly documented and is very tightly controlled. We feel that the critical underlying focus should be on keeping the recombinant proteins out of the food supply regardless of what crop was used to make them. The crops should be grown under suitable containment conditions—rather than arbitrarily ruling out specific vehicles. Also, the risk associated with each protein should be reviewed on the basis of the properties of that specific protein, not on the basis of the production system used. The risk of potential contamination of the food supply is one associated with all genetically enhanced organisms, not just the food crops. Indeed, food supply contamination can occur with nongenetically modified or natural production systems as well! Although plant-produced pharmaceuticals have raised concern in political debates, it is important that Nature Biotechnology reports science-based facts and arguments and does not give credence to the whims of special interest groups whose position is not science based. Our discussion here should be focused on a safe food supply, and on appropriate intention of measuring only background noise and considering the difference PM- MM. To obtain an expression measure, their default expression measure (MAS 5.0) is a robust average of log (PM-MM) values 4 . A version of the additive-background- multiplicative-error stochastic model 2,5 has been demonstrated to describe the behavior of probe-level data well 6 . We refer to this model as the Wu-Irizarry (WI) model. This stochastic model suggests that MAS 5.0 is quite accurate, but that its accuracy is overwhelmed by its impreciseness. These theoretical results have been demonstrated empirically 7 . RMA uses a global background adjustment based on a PM-only estimation procedure designed to avoid MAS 5.0’s exaggerated variance. Nevertheless, the WI model predicts that RMA’s adjustment does not completely remove nonspecific binding therefore introducing some bias. However, the sacrifice in accuracy is well worth the large gains in precision, resulting in a superior expression measure 7 . 656 VOLUME 22 NUMBER 6 JUNE 2004 NATURE BIOTECHNOLOGY To the editor: As outlined in a news item in your March issue (Nat. Biotechnol. 22, 256–267, 2004), the United States’ proposed appropriations of billions of dollars for bioterror preparedness in 2005 and, presumably, beyond, constitute a scientific and sociopolitical endeavor unprecedented in scope. The only other undertaking that has approached this magnitude as “a major federal R&D priority” in recent history was the Human Genome Project (HGP). In contrast to the present initiative, however, it was recognized immediately that the HGP had ethical, legal and social implications (ELSI) that needed to be the focus of concerted research efforts. With laudable foresight and responsibility, 3%–5% of the annual HGP budget was earmarked for ELSI research from the outset of the project. This ELSI and bioterrorism countermeasures? To the editor: A paper by Zhang et al. 1 in the July 2003 issue of Nature Biotechnology presents PerfectMatch, a physical-model-based algorithm for modeling the binding interactions between probes on Affymetrix (Santa Clara, CA, USA) GeneChip oligonucleotide arrays and their target nucleic acids. Stochastic-model-based algorithms have been shown to improve the preprocessing of array data to take into account the presence of optical noise, nonspecific binding and probe-specific effects. Although Zhang et al. demonstrate the superior performance of PerfectMatch over two widely used algorithms, MAS 5.0 and dChip, they fail to consider a leading stochastic-model-based procedure, robust multi-array analysis (RMA), from our group 2 . Here, we provide evidence that (i) measurements obtained using RMA are as precise and more accurate than those obtained using PerfectMatch and (ii) the accuracy of stochastic-model-based procedures can be improved by carefully incorporating physical model information. GeneChip arrays are widely used for quantitative and highly parallel measurements of gene expression. Genes are represented on these arrays by pairs of 25-base oligonucleotide probes 3 denoted probe sets. The first component of these pairs is referred to as a perfect match (PM) probe and is designed to be specific to transcripts from the intended gene. However, background noise and measurement error are unavoidable. Affymetrix’s background adjustment approach is to pair each PM with a mismatch (MM) probe designed by changing the middle (13 th ) base with the Preprocessing of oligonucleotide array data safety and containment issues regarding the cost-effective production of pharmaceuticals. A. thaliana, flax or duckweed may be appropriate choices for production of certain plant-made pharmaceuticals, but in other cases it may be corn, potatoes, tomatoes, bananas, soybeans, rice or some other food crop that makes the most sense both scientifically and economically. Gregory C Phillips Arkansas State University, College of Agriculture, PO Box 1080, 119 S. Caraway Road, State University, Arkansas 72467-1080, USA. e-mail: [email protected] begs a very important question: shouldn’t a portion of the bioterror preparedness budget be similarly allocated to ELSI research? Shane K Green Institute for Ethics, American Medical Association, 515 North State St., Chicago, Illinois 60610, USA. e-mail: [email protected] © 2004 Nature Publishing Group http://www.nature.com/naturebiotechnology

Preprocessing of oligonucleotide array data

Embed Size (px)

Citation preview

C O R R E S P O N D E N C E

Health Inspection Service (Washington, DC,USA) before it can be returned to regularservice. The entire production process mustbe thoroughly documented and is verytightly controlled.

We feel that the critical underlying focusshould be on keeping the recombinantproteins out of the food supply regardless ofwhat crop was used to make them. The cropsshould be grown under suitable containmentconditions—rather than arbitrarily rulingout specific vehicles. Also, the risk associatedwith each protein should be reviewed on the basis of the properties of that specificprotein, not on the basis of the production

system used. The risk of potentialcontamination of the food supply is oneassociated with all genetically enhancedorganisms, not just the food crops. Indeed,food supply contamination can occur withnongenetically modified or naturalproduction systems as well!

Although plant-produced pharmaceuticalshave raised concern in political debates, it isimportant that Nature Biotechnology reportsscience-based facts and arguments and doesnot give credence to the whims of specialinterest groups whose position is not sciencebased. Our discussion here should be focusedon a safe food supply, and on appropriate

intention of measuring only backgroundnoise and considering the difference PM-MM. To obtain an expression measure, theirdefault expression measure (MAS 5.0) is arobust average of log (PM-MM) values4.A version of the additive-background-multiplicative-error stochastic model2,5 hasbeen demonstrated to describe the behaviorof probe-level data well6. We refer to thismodel as the Wu-Irizarry (WI) model.This stochastic model suggests that MAS 5.0is quite accurate, but that its accuracy isoverwhelmed by its impreciseness. Thesetheoretical results have been demonstratedempirically7. RMA uses a global backgroundadjustment based on a PM-only estimationprocedure designed to avoid MAS 5.0’sexaggerated variance. Nevertheless, the WImodel predicts that RMA’s adjustment doesnot completely remove nonspecific bindingtherefore introducing some bias. However,the sacrifice in accuracy is well worth thelarge gains in precision, resulting in asuperior expression measure7.

656 VOLUME 22 NUMBER 6 JUNE 2004 NATURE BIOTECHNOLOGY

To the editor:As outlined in a news item in your Marchissue (Nat. Biotechnol. 22, 256–267, 2004),the United States’ proposed appropriationsof billions of dollars for bioterrorpreparedness in 2005 and, presumably,beyond, constitute a scientific andsociopolitical endeavor unprecedented inscope. The only other undertaking that hasapproached this magnitude as “a major

federal R&D priority” in recent history was the Human Genome Project (HGP). Incontrast to the present initiative, however, itwas recognized immediately that the HGPhad ethical, legal and social implications(ELSI) that needed to be the focus ofconcerted research efforts. With laudableforesight and responsibility, 3%–5% of theannual HGP budget was earmarked for ELSIresearch from the outset of the project. This

ELSI and bioterrorism countermeasures?

To the editor:A paper by Zhang et al.1 in the July 2003issue of Nature Biotechnology presentsPerfectMatch, a physical-model-basedalgorithm for modeling thebinding interactions betweenprobes on Affymetrix (SantaClara, CA, USA) GeneChipoligonucleotide arrays andtheir target nucleic acids.Stochastic-model-basedalgorithms have been shown to improve the preprocessingof array data to take intoaccount the presence of opticalnoise, nonspecific binding and probe-specific effects. Although Zhang et al. demonstrate the superiorperformance of PerfectMatch over twowidely used algorithms, MAS 5.0 anddChip, they fail to consider a leadingstochastic-model-based procedure, robustmulti-array analysis (RMA), from ourgroup2. Here, we provide evidence that

(i) measurements obtained using RMA are as precise and more accurate than those obtained using PerfectMatch and (ii)the accuracy of stochastic-model-based

procedures can be improved by carefully incorporatingphysical model information.

GeneChip arrays are widelyused for quantitative and highly parallel measurements of gene expression. Genes arerepresented on these arrays bypairs of 25-base oligonucleotideprobes3 denoted probe sets.The first component of these

pairs is referred to as a perfect match (PM)probe and is designed to be specific to transcripts from the intended gene.However, background noise andmeasurement error are unavoidable.Affymetrix’s background adjustmentapproach is to pair each PM with amismatch (MM) probe designed bychanging the middle (13th) base with the

Preprocessing of oligonucleotide array data

safety and containment issues regarding thecost-effective production of pharmaceuticals.A. thaliana, flax or duckweed may beappropriate choices for production of certainplant-made pharmaceuticals, but in othercases it may be corn, potatoes, tomatoes,bananas, soybeans, rice or some other foodcrop that makes the most sense bothscientifically and economically.

Gregory C Phillips

Arkansas State University, College of Agriculture,PO Box 1080, 119 S. Caraway Road, StateUniversity, Arkansas 72467-1080, USA. e-mail:[email protected]

begs a very important question: shouldn’t a portion of the bioterror preparednessbudget be similarly allocated to ELSIresearch?

Shane K Green

Institute for Ethics, American MedicalAssociation, 515 North State St., Chicago,Illinois 60610, USA.e-mail: [email protected]

©20

04 N

atur

e P

ublis

hing

Gro

up

http

://w

ww

.nat

ure.

com

/nat

ureb

iote

chno

logy

C O R R E S P O N D E N C E

Because the MM probes do not perfectlypredict nonspecific binding, PerfectMatch’sadjustment uses sequence information and physical models instead1. Yet, theestimation procedure that definesPerfectMatch produces a backgroundadjustment that is even more conservativethan RMA’s. The gains in precisionreported in the paper are due to a largesacrifice in accuracy and not to improvedprediction of nonspecific binding. Figure 1and Table 1 demonstrate that RMA attainsroughly the same gains in precisionwithout adding nearly as much bias.Recently, GeneChip RMA (GCRMA) a new expression measure that combines thestrengths of stochastic-model algorithms,such as RMA, and physical models8 hasbeen proposed9 to improve the accuracy ofRMA without much sacrifice in precision(for an R package implementation, seehttp://www.bioconductor. org/). These

differences in precision and accuracy canhave dramatic consequences in practice. Todemonstrate this, we consider the simple,yet common, application of estimatingdifferential expression between two arrays.Successful fold-change analysis will detectall and only genes that are differentlyexpressed due to biological variation. Weused a subset of the Affymetrix spike-inexperiment data1,2 where we could expectonly the spiked-in genes to be differentiallyexpressed. To assess precision in thiscontext, we computed log-fold changes forthe non-spiked-in genes from all possiblepair-wise comparisons. Figure 1asummarizes the empirical log-fold-changedistributions in the form of color box-plots10. MAS 5.0 has the widestdistribution and RMA and PerfectMatchhave the tightest.

To assess accuracy, we plot averageobserved log concentrations of the spiked-

in genes against each nominal concentra-tion value (Fig. 1b). The observed logconcentration should be linear in true logconcentrations with a slope of 1. Figure 1bdemonstrates that PerfectMatch andGCRMA are the least and most accuratemeasures, respectively. Although the curvesin Figure 1b are not lines, we can divide thegraph into three sections (low, medium andhigh expression) where the curves areapproximately linear. The slopes of theselines represent the expected observed log-fold-change of a gene with a nominal fold-change of 2. If we order genes of interestaccording to their fold-change size, the rank of these slopes in the log fold-changedistribution of Figure 1a represents theexpected rank of a gene with a true fold-change of 2 in an array with only that genetruly differentially expressed. Optimalmeasures will balance accuracy andprecision to produce low ranks for allsections. These slopes and expected ranks(Table 1) demonstrate that RMA andGCRMA produce considerably betterresults (in addition to the global resultspresented, we obtained similar findings for individual probe sets; furthermore,all the assessments were preformed on anew (HGU133) spike-in experiment; seeSupplementary Figs. 1–6). GCRMAperforms better for lower expressed genesand RMA for high ones. Because in mosthybridizations few genes fall in the high-expressed category we expect GCRMA toperform slightly better.

Note: Supplementary information is available on theNature Biotechnology website.

1. Zhang, L., Miles, M.F. & Aldape, F.D. Nat. Biotechnol.21, 818–821 (2003).

2. Irizarry, R. et al. Biostatistics 4, 249–264 (2003).

NATURE BIOTECHNOLOGY VOLUME 22 NUMBER 6 JUNE 2004 657

Table 1 Performance of five different algorithms (MAS 5.0, dChip10, PerfectMatch, RMA and GCRMA) in modeling GeneChip data

Algorithm Precision measured Slopes (expected ranks)3

99.9%1 Inter-quartile Low Medium Highrange2 expression expression expression

MAS 5.0 4.46 0.77 0.69 (4,295) 0.81 (3,716) 0.45 (6,007)

dChip 14.83 0.29 1.40 (415) 0.86 (863) 0.35 (3,512)

PerfectMatch 0.57 0.08 0.21 (759) 0.47 (28) 0.32 (134)

RMA 0.57 0.12 0.35 (316) 0.76 (3) 0.47 (52)

GCRMA 0.74 0.15 0.61 (43) 1.04 (3) 0.47 (207)

Log-fold-changes were computed for the non-spiked-in genes from all possible pair-wise comparisons. Boldfacenumbers denote the best performing algorithm in each of the assessments. 1The 99.9th percentile of the distribu-tion of log-fold changes (which is particularly meaningful because for arrays representing 10,000 genes, it is avalue that 10 false positives are expected to surpass). 2The inter-quartile range of log-fold changes. 3The curvespresented in Figure 1b were divided into three sections (low expression, medium expression and high expression)where the curves are approximately linear. The slopes of these lines quantify accuracy. The ranks of these slopeswithin the log fold-change distribution are shown in parenthesis.

4

2

0

2

4

MAS RMA PerfectMatch GCRMA

0.01

0.10.250.750.9

0.99

2 0 2 4 6 8 10

2

0

2

4

6

8

10

Log nominal concentration

Ad

just

ed o

bse

rved

log

exp

ress

ion

MASRMAPerfectMatchGCRMA

a b

Figure 1 Graphical comparison of the capacity of four algorithms (MAS 5.0, PerfectMatch, RMA and GCRMA) to measure expression. (a) Box-plots showingthe distribution of observed fold changes for non-spiked-in genes. The different colors represent the different quantiles. The relationship of color andquantile is demonstrated in the first box from the left. The six solid horizontal lines represent the smallest, 10th smallest, 50th smallest, 50th largest, 10th

largest and largest values, respectively. (b) Average observed log (base 2) intensity plotted against nominal log (base 2) concentration for each spiked-in gene for arrays in Affymetrix spike-in experiment. The observed concentrations are adjusted so that all lines have the same intercept at nominal log (base 2)concentration of 4. The solid black line represents the ideal slope 1 line. The log concentrations are divided into three sections denoted by the solid verticallines. (The results from dChip10 were similar to those of MAS 5.0 and were omitted for clarity.)

©20

04 N

atur

e P

ublis

hing

Gro

up

http

://w

ww

.nat

ure.

com

/nat

ureb

iote

chno

logy