- Home
- Documents
- Optimal Experimental Design of Field Trials using ... ?· Optimal Experimental Design of Field Trials…

Published on

19-Apr-2019View

212Download

0

Embed Size (px)

Transcript

<p>Optimal Experimental Design of Field Trialsusing Differential Evolution</p>
<p>An application in Quantitative Genetics and Plant Breeding</p>
<p>Vitaliy FeoktistovStephane Pietravalle</p>
<p>Nicolas HeslotBiostatistics Department</p>
<p>Research Centre of Limagrain EuropeChappes, France</p>
<p>vitaliy.feoktistov@limagrain.com</p>
<p>AbstractWhen setting up field experiments, to test andcompare a range of genotypes (e.g. maize hybrids), it is importantto account for any possible field effect that may otherwise biasperformance estimates of genotypes. To do so, we propose amodel-based method aimed at optimizing the allocation of thetested genotypes and checks between fields and placement withinfield, according to their kinship. This task can be formulated as acombinatorial permutation-based problem. We used DifferentialEvolution concept to solve this problem. We then present resultsof optimal strategies for between-field and within-field placementsof genotypes and compare them to existing optimization strate-gies, both in terms of convergence time and result quality. Thenew algorithm gives promising results in terms of convergenceand search space exploration.</p>
<p>Index TermsOptimization, combinatorial, permutation, diffe-rential evolution, breeding trials, experimental design.</p>
<p>I. INTRODUCTION</p>
<p>In plant breeding trials, field experiments are commonlyused to test genotypes (e.g. maize hybrids or wheat varieties)and estimate their potential for a range of variables, suchas yield, at all stages of the breeding process. Accuratelyestimating those variables is crucial to ensure that only thebest genotypes are kept for further selection.</p>
<p>In a field experiment, each genotype is allocated to one orseveral small plots on a field. Measured performance on thatplot is due to an effect of the genotype, measurement errorand an effect of the field (e.g. nutrient availability). Fisherhighlighted the importance of experimental design in the early1930s [1], to reduce bias and better estimate effects of interest.</p>
<p>In its most basic form, experimental design rely on ran-domization and replication but better design structures, forinstance, through the use of blocking, can further improvethe reliability of the estimates. However, blocking structuresrely on the strong assumption of local homogeneity and alsorequire the design to be replicated. As a result, they cansometimes be difficult to use, especially in early generationvariety trials, where the aim is to test a very large number ofgenotypes with constraints on space and seed availability. Inrecent years, model-based designs have been used more widely</p>
<p>[2][4]. At the local scale, one example of such a design is therepchecks design and relies on a spatial correlation structure oferror first assumed by Gilmour et al. [2]. This design consistsof a number of unreplicated experimental genotypes and some(usually three or four) replicated and well spread-out checkgenotypes to capture and model any potential field effect.In plant breeding trials, it is important to test the genotypesover a range of locations to estimate their performance overa network (e.g. market). Similarly to the constraints presentlocally, it is often practically impossible to have all genotypesreplicated over all locations. As a result, many genotypes areonly present on a subset of the locations in the network; thiscan be achieved through multi-location designs such as thesparse repchecks or sparse p-reps designs [5], [6].</p>
<p>The Section II describes the optimal design of field ex-periments problem. Its mathematical formulation as well asmodelling as combinatorial permutation-based optimizationtask. The problem is split into two phases to be solvedefficiently. The Section III describes a general framework ofDifferential Evolution (DE) algorithm, and then introduces thenew created algorithm for exploring the permutation space.This is followed by proposition of several search strategies,it also mentions the used method to handle constraints. TheSection IV demonstrates the obtained results for both op-timizations. Here we show the efficiency of the approachand compare it with the existing software packages. Real-lifeexamples are presented. The Section V concludes the paperoutlining promising trends and directions.</p>
<p>II. PROBLEM FORMULATION</p>
<p>A. State of the Arts and Motivation</p>
<p>Two PhD theses tried to tackle this problem from a practicalviewpoint [7], [8]. That resulted in two software packages:DiGGer [9], a design search tool in R [10], and OD [11], alsoan R-package. Other recent papers contribute to improvingthe mathematical model of efficient designs [4][6] and showreal-case studies.</p>
<p>arX</p>
<p>iv:1</p>
<p>702.</p>
<p>0081</p>
<p>5v1 </p>
<p> [cs</p>
<p>.NE</p>
<p>] 1</p>
<p> Feb</p>
<p> 201</p>
<p>7</p>
<p>The core of both software packages is coded in Fortranand highly optimized. The problem can be considered as aHPC one and despite of all code tunings, the computations arestill very expensive. To solve this problem, many metaheuris-tics have been tested, among them: Tabu Search, SimulatedAnnealing, Genetic Algorithms, and others. Nevertheless, formany real cases the mentioned algorithms are not suitable dueto their computational time, in other words a poor conver-gence. This motivated us to develop a new approach to solvethis problem more efficiently.</p>
<p>B. Modelling Optimal Design as a Permutation Problem</p>
<p>Let us consider the following linear mixed model [12]</p>
<p>y = X + Zu+ (1)</p>
<p>y is vector of phenotype, X a design matrix for thefixed effects, are fixed effects, Z a design matrixfor the random genetic effect u, so that u is distributedmultivariate normal with covariance G, and the error termis multivariate normal with covariance R. G = K2a, where </p>
<p>2a</p>
<p> is the additive genetic variance and K is the kinship whichcan be based on pedigree or on markers. Kinship informationcan be ignored in the optimization process by setting K to anidentity matrix.</p>
<p>From this model, the mixed model equations can be written[13][</p>
<p>XR1X X</p>
<p>R1Z</p>
<p>ZR1X Z</p>
<p>R1Z +G1</p>
<p>](u</p>
<p>)=</p>
<p>(X</p>
<p>R1y</p>
<p>ZR1y</p>
<p>)(2)</p>
<p>In a general case, the prediction error variance PEV is</p>
<p>PEV = var(u u) = [ZMZ +G1]1 (3)</p>
<p>where</p>
<p>M = R1 R1X(XR1X)1X</p>
<p>R1 (4)</p>
<p>For more details, see [14] and [15].The optimization task consists in finding an optimal design</p>
<p>matrix Z, which minimizes the PEV . Let exists some startdesign matrix Z0 with given properties, then the optimizationtask can be rewritten as finding the optimal permutation , sothat Z = (Z0) is the permutation of lines of Z0 matrix withrespect to constraints representing desirable design properties.</p>
<p>Taking into account relatedness between genotypes ensuresan optimal design is generated avoiding planting closelyrelated genotypes next to each other and splitting familiesevenly between locations. This is a NP -hard problem andis hard to solve for large sized cases, i.e. real situations.Thus, the task is split into two phases. The first one (phase I)is called Between Location phase, and optimizes dispatchingthe genotypes among several locations (sparse design). Thesecond one (phase II) is Within Location phase. At this stage,the optimal design inside of a location is resolved taking inconsideration auto-regressive error and genotype subsets fromthe previous phase. As shown in Eqs. 3 and 4, for both phases,</p>
<p>the optimization can also take into account additional fixedeffects estimation such as block effects within location.</p>
<p>The incorporation of a spatial auto-regressive process(ar(1) ar(1)) with a so-called measurement error variance(nugget) implies a given structure in the variance-covariancematrix of the residuals of the mixed model. Assuming arectangular field of size r rows and c columns, allowing thelast row to be incomplete (with cl columns), the matrix R isof size [c (r 1)+ cl, c (r 1)+ cl] and can be defined as</p>
<p>Ri,i = </p>
<p>Ri,j = |aiaj |r |bibj |c</p>
<p>(5)</p>
<p>where r and c are the autocorrelation coefficients for therows and columns respectively, = 1 + nugget and</p>
<p>ai = (i 1) mod c+ 1aj = (j 1) mod c+ 1bi = (i 1) quo c+ 1bj = (j 1) quo c+ 1</p>
<p>(6)</p>
<p>III. DIFFERENTIAL EVOLUTION</p>
<p>A. General Description</p>
<p>Since its first introduction in 1995 [16], [17] DifferentialEvolution is one of the most successful metaheuristic algo-rithm. It has won many CEC global optimization competi-tions. The algorithm was generalized to different branches ofoptimization such as constrained, multi-modal, multi-objectiveand large-scale optimization as well as made suitable for noisyand mixed-variables objective functions. Thus it covers withsuccess many applications in a number of areas of science andindustry. A detailed history of the algorithm can be found inChapter 1 of [18].</p>
<p>Differential Evolution is a population-based approach. Itsconcept shares the common principles of evolutionary al-gorithms. Starting from an initial population P0 of NP ,solutions DE intensively progresses to the global optimumusing self-learning principles. The initialization can be done indifferent ways, the most often uniformly random solutions aresampled respecting boundary constraints. Then, each solutionis evaluated through the objective function.</p>
<p>To be formal, let the population P consists of indi : i =1, NP solutions, or individuals in evolutionary algorithmsterms. The individual indi contains D variables xij , so-calledgenes. Thus indi = {xij}Dj=1 and P = {indi}NPi=1 .</p>
<p>At the each iteration g, also called generation, all individualsare affected by reproduction cycle. To this end, for each currentindividual ind a set of other individuals = {1, 2, . . . , n}are randomly sampled from the population Pg . All the DEsearch strategies are designed on the basis of this set.</p>
<p>The strategy [19] is how the information about setindividuals is used to create the base and the difference vectors of the main DE operator of the reproduction cycle,often called as differential mutation by analogy with geneticalgorithms, or else differentiation by functional analogy withgradient optimization methods. So the differentiation operator</p>
<p>can be now viewed in its standardized form as the currentpoint and the stochastic gradient step F to do</p>
<p> = + F (7)</p>
<p>where F is constant of differentiation, one of the controlparameters. Usually the recommended values are in [0.5, 2)range.</p>
<p>The next reproduction operator is crossover. It is a typicalrepresentative of genetic algorithms crossovers. Its main func-tion is to be conservative when passing to the new solutionpreserving some part of genes from the old one. The mostused case is when the trial individual inherits the genes ofthe target one with some probability</p>
<p>j =</p>
<p>{j if randj Crindj otherwise</p>
<p>(8)</p>
<p>for j = 1, . . . , D, uniform random numbers randj [0, 1),and crossover constant Cr [0, 1), the second controlparameter.</p>
<p>So, the trial individual is formed and the next stepis to evaluate it through an objective function, also calledfitness. Sometimes, constraints are handled at this stage too.Reparation methods are used before the fitness evaluation, inorder to respect the search domain [L,H] and/or c() 0(hard approach). Other constraints can be evaluated during theobjective function computation (soft approach). It should benoted that in many industrial applications the evaluation ofobjective function and constraints demands significant com-putational efforts in comparison to other parts of the DEalgorithm, thus the algorithms convergence speed during thefirst iterations is very important.</p>
<p>The following step is selection. Often, the elitist selectionis preferred</p>
<p>ind =</p>
<p>{ if f() f(ind)ind otherwise (9)</p>
<p>At this moment the decisions on constraints and multi-objectives values are also influence the final choice of thecandidate.</p>
<p>The Differential Evolution pseudo-code is summarized inAlg. 1.</p>
<p>B. Adaptation to the Permutation Space</p>
<p>Differential Evolution is very successful in solving conti-nuous optimization problems. Nevertheless, there were manyattempts to spread the DE concept on combinatorial optimiza-tion. A good summary of DE adaptations can be found in [20].In this paper, we concentrate on the combinatorial problemsthan can be formulated as permutation-based. We propose anew technique to explore combinatorial space and apply thistechnique to find the optimal design of field experiments, areal application from agriculture and plant breeding.</p>
<p>Further we discuss how to transform the differentiationoperator to handle the permutation space. There are two pointswe need to define</p>
<p>1) what would be the distance in permutation space? and</p>
<p>Algorithm 1 Differential Evolution - a general patternRequire params :F,Cr,NP,Strategy . control parametersf( ) . objective functionc( ), L,H . constraints</p>
<p>procedure DIFFERENTIALEVOLUTION(params)Initialize P0 {ind1, . . . , indNP } [L,H]Evaluate f(P0) {f(ind1), . . . , f(indNP )}while not stopping condition do</p>
<p>for all ind Pg doSample = {1, 2, . . . , n} from PgReproduce Differentiation(,F,Strategy) Crossover(, ind,Cr)</p>
<p>Evaluate f f() . often costly operationSelect ( vs ind) ind</p>
<p>end forg g + 1 . go to the next generation</p>
<p>end whileend procedure</p>
<p>2) how we apply this distance knowledge to compute theDE step from the base point (see Eq. 7)?</p>
<p>As it was shown earlier [18], the crossover operator isoptional and can be missed in many cases without degradationof convergence rate. So we decided not to use it for apermutation-based optimization.</p>
<p>1) Distance: Many types of distances for the permutationspace were invented, explored and used. A good overview ofcombinatorial distances with indicating their complexity andperformance tests is presented in [21].</p>
<p>It is obvious that the Hamming distance is one of the bestcandidates because of its simplicity to compute O(n) andsuitability for the DE context. For two permutations and, the Hamming distance between them is</p>
<p>H(, ) =</p>
<p>Dj=1</p>
<p>j where j ={</p>
<p>1 if j 6= j0 otherwise</p>
<p>(10)2) DE step: Several combinatorial operators are possible</p>
<p>to generate the permutation . Most common are swap,interchange and shift. Swap is considered a local operatoras it concerns the neighbours of some permutation positionj . Interchange is a global operator, it swaps two distantpositions i and j . Shift interchanges two k-length substrings[i, . . . , i+k] and [j , . . . , j+k].</p>
<p>Many combinatorial heuristic algorithms alternate propor-tions of local and global actions to balance the search processin order to avoid stagnation or unnecessary exploration. Wedo not use the different operators to trade-off local andglobal techniques but chose only one global operator, namelyinterchange and control it with locality factor , which can beconsidered as an equivalent of the F differentiation constantbut for the permutation space.</p>
<p>Thereby the size of DE steps is defined as H(). Thusthe trial individual is computed as</p>
<p> = H (11)</p>
<p>where the operator is an interchange operator, which appliesH...</p>