10
Winning the Accuracy Game Three statistical strategiesreplicating, blocking and modelingcan help scientists improve accuracy and accelerate progress Hugh G. Gauch, Jr. T homas Edison famously stated that genius is "one percent in- spiration and ninety-nine percent perspiration." In crop science, as in many other fields of research, inves- tigators find much truth in this prov- erb. The discovery of an improved variety of corn, wheat or soybeans is very much a numbers game. The standard varieties already incorpo- rate many genetic refinements from monumental breeding efforts in the past. Improvements come in small in- crements from testing large numbers of experimental genotypes. Other sciences have analogous challenges: In pharmaceutical research, for ex- ample, many compounds must be screened to find one that might make a successful medicine. However, measurement errors or chance variations can cause an infe- rior plant to look better than a supe- rior one. In a large field of contend- ers, the superior breed can get lost in the crowd. Scientists are aware of this problem, of course, but they routinely imderestimate its severity. Several years ago, I analyzed a trial of seven varieties of soybeans. The va- riety that appeared to be best was 14 percent better than the average of the other six, and 3 percent better than its ArPh()io/Al.inM.irlcr Figure 1. How can one identify the best cancer drug, safest automotive design or highest-yielding crop variety? Three statistical strategies can increase success in selecting the best treatment or entry: replicating, blocking and the oft-neglected strategy of modeling. The high-quality turfgrass needed for home lawns and elite golf courses such as this Donald Ross-designed course at the Grove Park Inn in Asheville, North Carolina, has emerged from scientific screening of hundreds of experimental and commercial varieties. www.americ.inscientist.org 2006 March-April 133

Winning the Accuracy Game - Welcome to caspar.bgsu.edu

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

Winning the Accuracy Game

Three statistical strategies—replicating, blocking and modeling—can helpscientists improve accuracy and accelerate progress

Hugh G. Gauch, Jr.

T homas Edison famously statedthat genius is "one percent in-

spiration and ninety-nine percentperspiration." In crop science, as inmany other fields of research, inves-tigators find much truth in this prov-erb. The discovery of an improvedvariety of corn, wheat or soybeansis very much a numbers game. Thestandard varieties already incorpo-rate many genetic refinements from

monumental breeding efforts in thepast. Improvements come in small in-crements from testing large numbersof experimental genotypes. Othersciences have analogous challenges:In pharmaceutical research, for ex-ample, many compounds must bescreened to find one that might makea successful medicine.

However, measurement errors orchance variations can cause an infe-

rior plant to look better than a supe-rior one. In a large field of contend-ers, the superior breed can get lost inthe crowd. Scientists are aware of thisproblem, of course, but they routinelyimderestimate its severity.

Several years ago, I analyzed a trialof seven varieties of soybeans. The va-riety that appeared to be best was 14percent better than the average of theother six, and 3 percent better than its

ArPh()io/Al.inM.irlcrFigure 1. How can one identify the best cancer drug, safest automotive design or highest-yielding crop variety? Three statistical strategies canincrease success in selecting the best treatment or entry: replicating, blocking and the oft-neglected strategy of modeling. The high-qualityturfgrass needed for home lawns and elite golf courses such as this Donald Ross-designed course at the Grove Park Inn in Asheville, NorthCarolina, has emerged from scientific screening of hundreds of experimental and commercial varieties.

www.americ.inscientist.org 2006 March-April 133

Page 2: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

100 n

replications

2

345

10

20

40

100

162

500

confidenceof success

(percent)

60,8

66.7

70.5

73.2

80.5

86.0

90,0

93.7

95.0

97,2

6010 20 40

replications (iog scale)

1100 200

Figure 2. Sir Ronald A. Fisher, shown calculating al left, made Rothamsled Experimental Station in England the proving ground for majorstatistical techniques used in science today. He established the importance of randomization and of replication—doing an experiment overand over again—to increase accuracy. The table and graph show how the probability of replication successfully helping accuracy, despite noisydata, grows with the number of replications. Although the first few replications pay off handsomely, the rewards fall off rapidly. (Photographby A, Barrington-Brown, republished by permission of the R. A. Fisher Memorial Trust.)

closest rival. Suppose these numbersaccurately reflected its superiority,and the experiment were replicated.Would the same soybean necessarilycome out on top? Surprisingly, simula-tions showed only a 49 percent prob-ability that it would; the fourth-best orworse entry would win 10 percent ofthe time.

The odds in favor of selecting thebest breed (or pharmaceutical com-pound, or product-safety modifica-tion) improve, of course, if the ex-periment is replicated more times.Replication decreases the effect ofchance variation, thereby improvingaccuracy, efficiency and repeatability.But in the numbers game, this way ofachieving accuracy comes at a severecost. The more times breeders have torun the same experiment, the feweralternative breeds they can test.

Fortunately, more replication—ormore perspiration, to recall Edison'sdictum—is not the only way to im-prove accuracy. A small investment onthe "inspiration" side can make a verylarge difference. Two other strategies.

called blocking and modeling, can pro-vide at least one replication's worth ofaccuracy {and often more) at almostno cost. Blocking is a method of ex-perimental design that reduces the ef-fects of chance errors. It has becomeroutine in the development of bettermedicines, safer cars, stronger steelsand a host of other applications.

Modeling to gain accuracy is muchless familiar to practicing scientists,even though it is frequently applicableand usually improves experimentalaccuracy more than blocking does.The idea behind it dates back, in somesense, to the medieval master of parsi-mony, William of Ockham. Scientificdata always contain a mixture of signaland noise; the scientist's job is to dis-cern the signal. It almost always showsup as patterns that are inherently sim-pler than the noise. Noise is idiosyn-cratic and complex; the reasons why aparticular corn plant produced moregrain than the one next to it are oftenunknowable. But signal is simpler; asingle environmental difference maycause dozens of breeds to respond sim-

Hugh G. Gauch, Jr., is a senior research specialistin crop and soil sciences at Cornell University.He received a B.S. in botany from the Univer-sity of Maryland in 1964 and an M.S. in plantgenetics from Cornell in 1966. His researchspecialty has been statistical analysis of ecologi-cal and agricultural data. His most recent bookis Scientific Method in Practice, published byCambridge University Press in 2002. He is a fourth-generation member of Sigma Xi (Cornell University,

alpha chapter, 1980), having been preceded byhis great-grandfather, Charles Wesley Rolfe (afounding member at the University of Illinois,1903), maternal grandmother, Susan Farley Rolfe(later Mrs. Horace Graham Butler; Universityof Illinois, 1909), and father, Hugh GilbertGauch (Kansas State University, 1937). Address:Crop and Soil Sciences, Cornell University,519 Bradfield Hall, Ithaca, NY 14853. Internet:[email protected]

ilarly. Modeling is a way of amplifyingthe signal by placing greater weight onsimple patterns in the data.

Different sciences and different ex-periments vary widely in the amountof accuracy they can attain. In physics,the gyromagnetic ratio of the electronhas been measured to 11 significantdigits. Many quantities in science andindustry are readily measured to threeto six significant digits. The data Iwork with in breeding trials carry onlyone significant digit. Nevertheless, ev-ery science shares the need for greateraccuracy. Gains in accuracy translateto safer products, more effective medi-cines and more food on the world'stables. Accuracy matters.

The Limitations of ReplicationIn 1919, a 29-year-old statisticiannamed Ronald A. Fisher started a newjob at Rothamsted Experimental Sta-tion in Harpenden, England. It was thepremier agricultural research site inthe country and would become, thanksto Eisher's efforts, the proving groundfor many of the statistical techniquesthat scientists take for granted today.Fisher was hired to make some senseout of 76 years of experimental records,which he later called a "muck heap."

Why were the data at Rothamstedsuch a mess? In the 19th century, sci-entists had little conception of theimportance of replication. Frequentlythe productivity of a given breed in agiven year would be represented bya single measurement. With only one

134 American Scientist, Volume 94

Page 3: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

observation, as scientists now know,it would have been impossible to esti-mate the amount of error, and thereforeimpossible to make any meaningfulcomparisons between measurements.It was hard to know which results totake seriously.

Inaccuracy or error is quantified bytwo familiar descriptive statistics. Thestandard deination is the square root ofthe mean square error of all the indi-vidual observatior\s. (Error is the ob-served value minus the true value.Since true values are not known, aver-ages over replications are used in errorcalculations.) The standard error is theroot mean square error for an aver-age over N replicates, which equalsthe standard deviation divided by thesquare root of N.

Replication is one of the finest ideasin the history of science, but it facesa severe law of diminishing returns.Halving the standard error requiresa fourfold increase from one to fourreplications. The next several halvingsrequire 16, 64, 256 and 1,024 replica-tions, which rapidly become prohibi-tively expensive. Scientists are familiarwith this square-root dependency andthe diminishing returns that follow asa consequence.

A second shortcoming of replicationis far less well known. It becomes ap-parent when one compares the suc-cess rate of replicated and unreplicat-ed measurements in estimating truevalues. Obviously, scientists prefer an

average of two replicates to a singleunreplicated obser\'ation because theformer is likely to be more accurate.But that does not mean it will alwaysbe more accurate. Just by chance, thefirst replicate may be quite close to thetrue value, while the second replicateis far from it. In that case, the averageof the two replicates is less accuratethan the unreplicated result.

The reader might like to ponder thefollowing three questions before goingon. How often is the average of tworeplicates more accurate than a singlemeasurement? How about the averageof five replicates? And how many repli-cates would be needed to achieve 90 or95 percent confidence that the averageis more accurate than the unreplicatedmeasurement? Curiously and regret-tably, few scientists know the answersto these practical questions.

Here are the answers, based on theordinary assumption that measure-ment errors are distributed accord-ing to a "bell-shaped curve." Tworeplicates are more accurate than one60.8 percent of the time. For five rep-licates, the success rate climbs to 73.2percent—but this means that a singleobservation is still more accurate 26.8percent of the time. To increase rep-lication's success rate to 90 percent,most scientists I have spoken withguess that three to eight replicateswould be sufficient. In fact, the actualnumber is 40. To achieve 95 percentsuccess, a daunting 162 replicates are

required, again far beyond what mostscientists would expect.

Of course, the lesson here is not thatscientists should always replicate theirexperiments 40 or 162 times. Instead,they should develop realistic expecta-tions about what replication can andcannot accomplish.

A related problem, which I alludedto in the introduction, emerges whenscientists face a selection task. In medi-cal research, it is common to performan experiment in order to select thebest treatment. Again, it is obvious thatgreater accuracy will improve the suc-cess rate of the best-treatment selec-tion. But when this simple insight isinformed by concrete numbers, mostscientists are surprised by the diffi-culty of selection tasks, and hence theimportance of using all the availablestrategies to gain accuracy.

Consider the simplest selection sce-nario. An experiment has T inferiortreatments and one superior treatment.All of the inferior treatments have thesame average effect. What is the prob-ability that the experiment will cor-rectly identify the superior treatment,despite noisy data? A field called orderstatistics answers such questions.

For instance, for T=10, the best ofthe observed values for the inferiortreatments is likely to be 1.54 standarderrors above the mean. If the superiortreatment is only one standard errorbetter than its rivals, it does not havea good chance of winning, merely 32.4

A B C D E F G H I J K

signal

probability

noise

10n

8 -

.> 6 -

4 -

2 -

A B C D E F G H I J K

data

Figure 3. Experimental data bave both signal and noise components, tbe former attributable to imposed treatments and the latter arising fromuncontrolled factors. In a simulated experiment, ten treatments, A thnsugh ), have equal true effectiveness. Entry K meanwhile is superior by oneunit (one standard error). Random noise is added to each treatment following a standard normal distribution with a standard deviation of oneunit. Entry K appears to rank third, when if is actually best—a typical outcome, since the simulation shows tbat on average 2.4 inferior treatmentssurpass tbe besf treatment. Noise hinders the searcb for fhe best treatment.

www.americanscientist.org 2006 March-April 135

Page 4: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

completely randomized randomized complete block

percent. Yet in many actual experi-ments the odds are even worse. Real-istic tasks, including many in desper-ately important medical research, ofteninvoive hundreds of competitors andsmall margins of superiority.

Suppose now that the number ofreplications were increased by a factorof four. This would halve the standarderror of the estimated effects. The dif-ference between the inferior and supe-rior treatments would increase to twostandard errors. (The difference itselfhas not changed, but the units it is mea-sured in have gotten smaller.) Now theprobability that the experiment willidentify the superior treatment morethan doubles, to 65.8 percent.

Given the above considerations,scientists clearly need other strategiesfor battling with measurement er-ror, besides the expensive brute-forcemethod of increasing the number ofreplications. However, the concept

of replication does provide a usefulcommon currency for quantifying andcomparing the benefits of other error-control strategies. The error reductionsor accuracy gains achieved by anymethod can be expressed in terms ofthe number of additional replicationsthat would be required to achieve thesame improvement.

Designing for AccuracyThe second strategy of blocking wasalso developed from crop research atRothamsted Experimental Station, byFisher and his protege Frank Yates inthe 1930s. It requires scientists to care-fully think out the design of the experi-ment before beginning.

A typical experiment has one ormore deliberately controlled factors ofinterest, such as diets for chickens orchemicals for reactions, which can begiven the generic label of "treatments."In agricultural yield trials, each combi-

Figure 4. Blocking builds accuracy into anexperiment. The simplest method, the com-pletely randomized (CR) design, applies treat-ments at random to experimental units. At topleft, 24 hypothetical plots of turfgrass are ran-domly assigned six treatments, A through F,for a total of four replications. Accuracy can beincreased by the randomized complete block(RCB) design, subdividing the experimentalunits into as many blocks as there are replica-tions. Statistical efficiency increases becausethe smaller blocks are more uniform than thewhole field. The RCB trial shown at top right,with only three replications arranged fromleft to right, typically provides about the samelevel of accuracy as the four replications of theCR trial shown at top left. Shown at left is atrial for perennial ryegrass (Imckgroiind) andKentucky bluegrass (foreground) from the Na-tional Turfgrass Evaluation Program, whichuses RCB designs with three blocks. (Photo-graph courtesy of Scott Ebdon, University ofMassachusetts, Amherst.)

nation of genetic and environmentalvariables is considered to be a treat-ment, Thus, an experiment may test30 genotypes (varieties differing in oneor more genes) in 20 different environ-ments, for a total of 600 treatments.

The experimental design specifiesin advance how the treatments will beallocated to experimental units (plants,chickens, people). It usually incorpo-rates both randomization and repli-cation to minimize bias and increaseaccuracy. Although intended to be thesame, units given the same treatmentalways vary because of uncontrolledfactors. In the realm of the life sciencesthe units are never completely identi-cal, and the application of treatmentsis never completely uniform. The pur-pose of an experimental design is tominimize the consequences of uncon-trolled variation.

The simplest design, called the com-pletely randomized {CR) design, simply

136 American Scientist, Volume 94

Page 5: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

assigns treatments to experimentalunits at random. The rnndomized com-plete block (RCB) design subdividesthe experimental units into as manyblocks as there are replications. Withineach block, each treatment is allocat-ed to one unit at random. The blocksare chosen to minimize uncontrolledwithin-block variations. For instance,in an agricultural trial, plots that arenear to one another are likely to bemore similar than plots separated bya greater distance. Thus the blocksare simply compact parcels that aresmaller and more uniform than thewhole field.

Extensive experience with RCBdesigns shows that one can typicallyachieve a "statistical efficiency" of 1.3,which means that the accuracy of theexperiment is comparable to a CR de-sign with 1.3 times as many replica-tions. Thus a RCB with three achial rep-lications is about as accurate as a CRwith four, so the researcher has gaineda full replication "for free."

And yet there are even better ap-proaches than RCB. So-called "incom-plete block designs" (in which eachblock receives only a subset of thetreatments) also reduce the residualerror, but can accomplish two otherthings that RCBs cannot. First, theyallow the investigator to adjust esti-mates of treatment effects closer totheir true values. Second, these ad-justments can improve rankings, in-creasing the probability of the trulybest entry winning the trial. Althoughblocking is pervasively popular in sci-entific research, scientists are rarelyaware of these additional benefits ofincomplete blocking.

From Frogs to Shrinkage EstimatorsThough the idea of parsimony is old,its modern expression in statistics hadto wait for two breakthroughs around1955: the theory of shrinkage esti-mators, inifiated by Charles Stein ofStanford University, and the advent ofmodern digital computers. Modelingis not as easy to grasp as replicationand blocking, so I will begin with a"toy example" before proceeding tosome real case studies.

Suppose that you are collectingfrogs from a pond for a jumping-frogcontest. You collect several frogs andrace them over a 10-yard course. Theycomplete the course in a range of timesfrom 30 to 90 seconds, with an averageof a minute. What is your best estimate

of the time that each frog would take ifit hopped the course again?

Naively, you might expect a frogthat took 40 seconds the first time tofinish in 40 seconds the second time.(For simplicity, let us ignore learn-ing effects.) Indeed, that would beyour best guess if you had not col-lected any other frogs. Also, the actualmeasurement is the best estimator ofthe true value if there are two frogs.However, Stein and other stafisticiansproved that it is not the best estimateif you observed three or more frogs.Instead, you should "shrink" eachfrog's deviation from the mean by acertain amount, determined from thenumber of frogs and the variability ofthe times. Suppose the formula toldyou to shrink your frog's deviation by30 percent. Then you would estimatethat it would finish 14 seconds fasterthan the mean the second time—thatis, in 46 seconds instead of 40. Such aprocedure is called, naturally enough,a "shrinkage estimator."

Shrinkage estimators have beenproven both empirically and theoreti-cally. Why do they work, and whatdo they have to do with parsimony?Without a shrinkage estimator, youwould in effect be assuming that ev-ery frog is a unique animal, unrelatedto every other frog, so that the otherfrogs' times are irrelevant to determin-ing your frog's ability. But it is muchmore reasonable—and parsimoni-ous-—to assume that all of the frogscome from one genetic talent pool.Their abilities are distributed alonga bell-shaped curve, with most frogsclose to the mean ability and only asmall percentage being especially fastor slow hoppers. Thus your frog's trueability probably lies closer to the meanthan its 40-second time in the first racewould suggest. Note that the shrink-age esfimator uses all of the data whencalculating each frog's estimate. Thismore vigorous use of the data increas-es accuracy.

Incidentally, although merely shrink-ing estimates toward the average valuedoes not change rankings, more com-plicated experiments or models canchange rankings and hence can changewinners. Also, numerous statisticalanalyses, including multiple-regres-sion models, can produce adjusted es-timates different from the actual data,even though these may not be called"shrinkage estimators." Fitting modelsto data often requires millions or even

complexity of model »•

Figure 5. Modeling recovers variation fromsignal and noise differently depending onlhe complexity of the model. Signal is rela-tively simple, so it is captured rapidly byearly mode! components. But noise is verycomplex. Noise is captured slowly at first,suppressed by considerable signal recovery,then more quickly by exploiting chance cor-relations in the noise, and slowly thereafter.Predictive accuracy is improved by capturingsignal but degraded by capturing noise. Theimplied response for signal minus noise isa unimodal response called Ockham's hill,shown in Figures 7 and 9.

billions of arithmetic steps, makingcomputers essential.

Moving from this toy example toreal experiments, scientists in manyfields routinely use statistical modelsfor various purposes other than gain-ing accuracy, such as testing for sig-nificant effects or summarizing andvisualizing complex information. Forexample, a common data format is atwo-way layout or data matrix, suchas the yields of 30 genotypes in 20 en-vironments or sales of 19 products in57 stores. Principal-components anal-ysis, factor analysis, correspondenceanalysis (also called reciprocal aver-aging), nonmetric multidimensionalscaling and other popular multivari-ate analyses can reduce the high-di-mensional data to a two-dimensionalgraph that often successfully capturesmost of the structure in the data.

Unfortunately, precious few scien-tists—apart from scientists in a fewspecialties such as signal process-ing—know and exploit the fact thatthese familiar analyses can also ser\'ethe purpose of gaining accuracy. Themodels become more complex andless parsimonious as further compo-nents are added. Because the signal is

www.americanscientist.org 2006 March-April 137

Page 6: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

3 0 -

-10 -

•^ -20H

GEl

O

1,500I "

2,500 3,500mean yield (kilograms per hectare)

4,500

Figure 6. Statistical modeling can help in visualizing complex data as well as increasing ac-curacy. The AMMl (Additive Main Effects and Multiplicative interaction) model is usefulfor two-way data tables whenever main (average) and interaction (differential) effects pres-ent researchers with different problems and opportunities, as in agricultural yield-trial re-search. AMMl first accounts for main effects and then applies principal-component analysisto the interactions. On the horizontal axis, this AMMIl biplot shows main effects for sevensoybean genotypes (circles) in 10 New York State environments (squares). The vertical linemarks the grand mean (2,678 kilograms per hectare). Those varieties and environments tothe left performed poorly overall, whereas those to the right performed well. On its verticalaxis, this biplot shows interaction scores. The interaction for a given genotype and environ-ment is estimated by multiplying their scores; thus genotypes and environments with scoresof the same sign have positive interactions, but opposite signs indicate negative interac-tions. The interaction scores reveal a trend from early-maturing varieties and their preferredshort-season environments toward the top to late-maturing varieties and their preferredlong-season environments toward the bottom. This biplot successfully captures 96.9 percentof the total variation in this complex data set.

relatively simple and the noise is verycomplex, the early model componentsselectively recover signal whereas thelate model components selectively re-cover noise. By using only the earlycomponents—a form of shrinkage—one can obtain a parsimonious modelthat makes more accurate predictions

6 -

5 -

4 -

^ 3 -

S 2-

overfitnoise

underfitsignal

3 4AMMl model

than the data. Let us see how thisplays out in three real examples.

Agricultural TrialsThe turf industry in the United States ishuge; about $40 billion is spent annu-ally for lawn care. There are about 50million acres of maintained turf in the

Figure 7. The AMMl model family exhibitsOckham's hill, with a relatively parsimoni-ous model most accurate and efficient. Aquality trial of perennial ryegrass, a com-mon species for golf courses and homelawns, provides an example. In this case,AMMI2 with two principal componentsis most predictively accurate (white line),achieving an impressive statistical effi-ciency of 5.6. This means that applying themodel to the data produced the same accu-racy as gathering 5.6 times as much data andnot using the model. Models with fewercomponents are less accurate because theyunderfit real signal, whereas models withmore components are less accurate becausethey overfit spurious noise.

United States, which would blanketa square 280 miles on a side. The Na-tional Turfgrass Evaluation Program(NTEP) conducts trials with severalhundred entries at about 25 locationsfor its ongoing research to improveturfgrass quality.

Scott Ebdon, of the University ofMassachusetts, and I have analyzedthe NTEP data with a variant of prin-cipal-components analysis calledAMML The salient point is that AMMlproduces a model family—AMMIO,AMMll, AMMI2, AMMI3 and soon—with more and more parametersuntil reaching the full model, which isidentical to the data matrix. The num-ber of parameters in the most accuratemodels is usually a small fraction ofthe number of treatments. Choiceswifh fewer parameters underfit thereal signal, whereas models withmore parameters (including the fullmodel that equals the data!) overfitspurious noise. This response, fit-tingly called Ockham's hill, has beenexplored in detail in my previousAmerican Scientist article ("Predic-tion, Parsimony and Noise," Sep-tember-October 1993) and my recentbook on scientific method.

In an actual turfgrass qualify trial,where we used the various members ofthe AMMl family to predict cross-vali-dation data, AMMI2 was most accu-rate, with a statistical efficiency of 5.6.This means that it produced the sameaccuracy gain as collecting 5.6 times asmuch data. To collect fhat much extradata would have cost NTEP over $1million. Hence, modeling proved tobe an extremely cost-effective strata-gem for accelerating improvements inturfgrass quality. Inspiration does saveperspiration!

Parsimonious models typicallyachieve statistical efficiencies of 2 to 4;blocking designs achieve an efficiencyof 1.3 or so. But this comparison ac-tually understates the superiority ofparsimonious modeling. We should re-ally be comparing the - ir/n in accuracyusing a baseline that comes from theexperiment itself. The statistical benefitfrom modeling (equivalent to adding1 to 3 fimes the replications in the ex-periment itself) greatly exceeds thatfrom blocking designs (0.3).

Despite consistent and impressiveresults, the adoption of aggressive sta-tistical analysis in crop science hasbeen slow. As Donald Nielsen not-ed in his presidential address to the

138 American Scientist, Volume 94

Page 7: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

American Society of Agronomy sev-eral years ago, agricultural scientistsstill rely mainly on statistical tech-niques developed before 1940—thatis, in the era of Ronald Fisher. At pres-ent, worldwide breeding efforts ac-count for an average yield increase ofabout 1 percent per year in the majorcrops. I conservatively estimate thataggressive statistical analysis of thesame data would make it possible toincrease this average to 1.4 percentper year, at virtually no extra cost.Over a decade, that incremental gainwould translate into enough food formillions of persons.

QTL SearchesIn genetics, the inheritance of a simpletrait may involve a single gene, as inGregor Mendel's classic experimentswith peas. But a quantitative trait (suchas yield) involves multiple genes, whichare called quantitative trait loci (QTLs).Numerous genes with relatively smalleffects are considerably more difficultto locate on chromosomes than a singledecisive gene. Searching for these genesis an extremely important problem,with numerous applications to crop im-provements and human diseases.

QTL searches proceed as follows.All individuals in the experiment arescreened for numerous genetic mark-ers that have already been located in achromosome map. Fach individual isalso measured for the quantitative traitof interest (color, disease resistanceor whatever), called its phenotype. Ifgroups of individuals with differentversions of a given marker gene alsohave different phenotypes, it is reason-able to infer that a QTL for the traitexists near that marker.

Numerous statistical methods havebeen proposed for QTL searches dur-ing the past 15 years. Recently, MinZhang at Cornell University and hercollaborators have developed a newmethod with several notable advan-tages. The most intriguing feature,from the point of view of this article,is that its superior performance resultsfrom parsimonious modeling.

From experience, geneticists knowthat only a small proportion of themarker genes are actually near QTLsaffecting the trait being studied. Zhangused a kind of statistical method (aBayesian method) that can readilyincorporate this crucial biological in-formation. Her approach explicitly fa-vors a parsimonious model with few

QTLs. Extensive testing shows thatthis analysis efficiently detects QTLswhile largely avoiding false detections.By contrast, previous methods lack-ing this emphasis on parsimony per-formed worse, and were more vulner-able to problems of missing data andsmall samples.

All three strategies—replicating,blocking and modeling—are relevantfor accurate QTL searches. Replicatingand blocking increase the accuracy ofthe phenotypic data, which in turn im-proves QTL searches. And modelingalso helps, on two counts. First, mod-eling the phenotypic data increases ac-curacy before the search, as explainedin the previous section on agriculturaltrials. Second, modeling QTLs parsi-moniously improves the robustnessand accuracy of the search itself, asexplored in this section.

Molecular ShapesFor much biological and medical re-search, including drug design, sci-entists need to determine the three-

dimensional shape of a protein orother large molecule with great accu-racy. The basic shape of a molecule isconstructed from information on elec-tron densities. This initial picture isthen refined using two kinds of data:noisy, empirical data on x-ray diffrac-tion intensities for the large moleculeof interest, and a data base of typicaldistances and angles between atomsin small molecules. Note the analogywith the toy example, which also bal-anced two kinds of data: the data on agiven frog and the data on other frogs.

To increase accuracy, crystallog-raphers must choose an appropriatetrade-off between these two kinds ofdata. In a molecule with 1,000 atoms,there are 3,000 parameters to be deter-mined—namely, the three coordinatesin space of each atom. But the data-base can reduce the number of inde-pendent parameters by constraining orshrinking all estimates of the chemicalbonds of a given kind toward the samelength. Crystallographers estimate thatfor proteins the number of parameters

Figure 8. Quantitative trait loci <QTL), which control multiple-gene traits, are of greatimportance in crop improvement and human disease. By mapping QTL for yield, StevenTanksley of Cornell University and his collaborators discovered beneficial genes in a smallwild tomato from Peru (top left). Transferring the genes into the commercial variety at upperright produced fruit about 10 percent larger, shown at the bottom. Parsimonious modeling,incorporating the background information that rather few marker genes are near QTLs fora given trait, has been used to greatly improve the effectiveness of QTL searches. Such tech-niques could be applied to searches for medically important QTLs in the human genome,such as the genes underlying high blood pressure, improving the prospects of new treat-ments. (Photograph courtesy of Steven Tanksley, Cornell University.)

www.americanscientist.org 2006 March-April 139

Page 8: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

can usually be reduced by a factor ofapproximately 7.5 (in this case, from3,000 to 400). Reduction of parameters,as in the agricultural example, is a hall-mark of a parsimonious model.

The question then arises of whatrelative weights one should place on

the x-ray data and the data base inorder to optimize accuracy. For de-cades, a measure called the R factorhad been used, which was based onthe amount of agreement betweenthe original data and the restrainedmodel. Unfortunately, in flagrant dis-

regard of parsimony, the R factor canbe made arbitrarily good by addingmore parameters (overrestraining themodel). In 1992, Axel Briinger, thenat Yale University but now at Stan-ford, introduced the "free R statistic,"which avoids overfitting the noise by

. ^ •• • • - . I .

• \

\

\ '•

weight on direct x-ray dataweight on indirect data bank

Figure 9. Accuracy is essential in determining the three-dimensional shapes of molecules for drug development. For example, protease inhibi-tors are an important class of drugs for combating HIV infection, and the inhibitor molecule's interaction with an HIV protease depends onits shape. Biologists determine a protein's shape by combining infonnation from x-ray diffraction, which gives a noisy reading of the crystalstructure (top left), with information from a data bank containing known bond lengths and angles. Shown are two cutaway views of a modelof the TL-3 inhibitor at the site where it interacts with the protease, a known structure shown extending into the background. In the "before"image, an electron-density map (blue mesh) calculated from the diffraction pattern is overlaid on a structural model based on infonnationfrom the data bank. When proper weights are applied to these two kinds of data, an accurate structure can be fitted ("after"). A statistic calledthe R factor measures the agreement between the raw diffraction data and calculated electron densities. A measure called R,r e, which usescross-validation to assess predictive accuracy, is now widely used to improve the accuracy of modeled protein structures. Before the adoptionof the more parsimonious Rfre , modelers typically continued to add parameters to reduce R. As the R factor decreased beyond the optimalweights now found by Rftte, model accuracy actually declined because the models overfitted the noise (graph). (Model images courtesy ofHolly Heaslet and Justin Chartron, Scripps Research Institute; diffraction image courtesy of C. David Stout of Scripps.)

140 American Scientist, Volume 94

Page 9: Winning the Accuracy Game - Welcome to caspar.bgsu.edu

using cross-validation. Most of thex-ray data are used to construct themodel of the molecule, but about 10percent of the data, selected at ran-dom, are withheld in order to checkthe predictive accuracy of the model.Just as one would expect, as the em-phasis given to the direct x-ray dataincreases (and the emphasis on thedatabase decreases), the free R statis-tic displays Ockham's hill. The peakon Ockham's hill (here inverted) indi-cates the optimal choice of weights forthe two kinds of data.

At the time I wrote my previousarticle for this magazine, Briinger'smethod had been introduced just thepreceding year. That year, only 1 per-cent of the crystal structures depos-ited in the international Protein DataBank had used the free R statistic.Just three years later, adoption of Riv .had reached 33 percent, and after fiveyears, 71 percent. By 2000, adoptionhad reached 92 percent, and at pres-ent Rtrie is nearly always reported.Thus crystallography constitutes anencouraging case study in how rapidlyparsimonious modeling can becomeaccepted (or even required) when dataare expensive and limited, but compu-tation and modeling are cheap.

Sdence EducationTwo facts are evident. On fhe one hand,all statisticians know that parsimoni-ous modeling can increase accuracyand efficiency. On the other hand, fewscientists know about this great oppor-tunity. What explains this mismatch,and what is the remedy?

In part, I would suggest that model-ing is neglected because of scientists'complacency. Many scientists thinkthat once they have done replicationand blocking, they are finished withtheir statistical homework, thank youvery much. The goal of this article isto disturb this unfounded complacen-cy. Replication accomplishes less thanmany scientists expect it to, blockingis often done by suboptimal designs,and modeling to gain accuracy is rou-tinely neglected.

Why do scientists sometimes fail afthe accuracy game? I would pin theblame on a common deficiency in theirtraining. Scientific research requiresmastery of both the general principlesof science and the specialized tech-niques of a particular discipline, butthe emphasis can fall too heavily on thelatter. The community of investigators

in a given specialty can remain quiteunaware of fine examples of the usesof statistical methods that can be foimdin other disciplines or literatures, andthus miss opportunities fo gain accu-racy through modeling in their ownwork. Astronomers or geologists whosee an example of successful modelingin agriculture or chemistry ought to beable to distinguish transferable generalprinciples that they can import intotheir own specialties. Parsimony andits relation to accuracy (as describedby Ockham's hill) is one such prin-ciple, wifh pervasive relevance in sci-ence and technology.

Innovation comes not only from in-ventors of new ideas, but also fromimporters of relevant ideas. This facthas been recognized by recent positionpapers on U.S. science education thatemphasize versatility in the scientificworkforce and the transfer of knowl-edge between disciplines.

Manifestly, it is precisely the gen-eral principles of scientific methodthat are the most cosmopolitan facetof science. These constitute the mostfrequent and promising candidatesfor enhancing knowledge transfer andgreater adaptability.

When a scientist plans an experi-ment with a statistical consultant, orwhen a student plans research witha thesis advisor, fhe conversation islikely fo include recommendations forreplicating and blocking to gain ac-curacy. After settling that, the scientistor student should draw but one breathbefore using if to ask, "And what aboutalso gaining accuracy from parsimoni-ous modeling of the treatments?" Sci-entific research is expensive, so it isimperative to use resources efficiently.Anything less is irresponsible.

Surely af least 25 percent of scientificprojects bave one or more key stepswhere parsimonious modeling is ap-plicable, but not yet implemented. Ifmay well be thaf the agricultural ex-ample, in which the current rate ofprogress ctmid be impro\'ed by 40percent simply from better use of dataalready in hand, is a typical case. Ifso, simple multiplication suggests thatwe are missing out on 10 percent offhe pofential return from our invesf-ment in scientific experimenfs. A mod-est investment in training, with atten-tion paid fo such general principles ofscientific method as parsimony, couldreap tremendous benefits for scientificknowledge and for the development

of new commercial and medical prod-ucts. Extra accuracy af trivial cost is agreaf bargain.

BibliographyUerger, J. O., .lnd D. A. Berry 1988. Statisti-

cal analysis and the illusion of objectivity.Artierican Scientist 76:159-165.

Dean, A., and D. Voss. 1999. Design and Analysisof Expi'riments. New York: Springer-Verlag.

Ebdon, J. S., and H. G. Gauch. 2002. Additivemain effect and multiplicative interactionanalysis of national turfgrass performancetrials: II. Cuttivar recommendation. CropScience 42:497-506.

Caiich, H. G. 1992. Statistical Analysis of Re-gional Yield Trink: AMMt Analysis of Facto-rial Designs. New York: Elsevier (Chineseedition 2001, Hangzhou: China NationalRice Research Institute).

Gauch, H. G. 1993. Prediction, parsimony andnoise. American Scientist 81:468-478.

Gauch, H. G., and R. W. Zobel. 1996. Optimalreplication in selection experiments. CropScience 36:838-843.

Gauch, H. G. 2002. Scientific Method in Prac-tice. Cambridge, U.K.: Cambridge Univer-sity Press (Chinese edition 2004, Beijing:Tsinghua University Press).

Glazier, A. M., J. H. Nadeauand T. J. Aitman.2002. Finding genes that underlie complextraits. Science 298:2345-2349.

Jefferys, W. H., and J. O. Berger. 1992. Ock-ham's razor and Bayesian analysis. Ameri-can Scientist 80:64-72.

Kteywegt, G. J., and T. A. Jones. 2002. HomoCrystallographicus—Quo Vadis? Structure10:465-472.

National Academy of Sciences. 1995. Reshap-in;^ tlie Graduate Education of Scientists andEngineers. Washington, D.C.: NationalAcademy Press.

National Science Foundation. 1996. Shapingthe Future: Neio Expectations for Undergradu-ate Education in Science. Mathematics, En-gineering, and Technology. Arlington, VA:National Science Foundation.

Tanksley, S. D.. and S. R. McCouch. 1997.Seed banks and molecular maps: Unlock-ing genetic potential from the wild. Science277:1063-1066.

Zhang, M., K. L. Montooth, M. T. Wells, A.G. Clark and D. Zhang. 2005. Mappingmultiple quantitative trait loci by Bayesianclassification. Genetics 169:2305-2318.

For relevant Web links, consult thisissue of American Scientist Online:

http://www.americanscientist.org/lssueTOC/issue/821

www.americanscientist.org 2006 March-April 141

Page 10: Winning the Accuracy Game - Welcome to caspar.bgsu.edu