Assessing the predictive accuracy of various statistical methods …. Gauffreteau... · 2014. 7. 16. · Assessing the predictive accuracy of various statistical methods that use

Assessing the predictive accuracy of various statistical methods that use environmental covariates to model

Genotype x Environment Interactions in multi-environment trial

A. Gauffreteau, G. Grignon, P. Pachot, J. Lorgeou, F. Piraux, F. Maupas, H.

Escriou, C. Pontet, F. Salvi

Why predicting GEI?

2/30

Context

• Choice of cultivars adapted to their cropping environments

• Need for cultivars : – Diversified and adapted to a large range of environmental conditions

– Described : average performance and response to the various environmental conditions they can meet in fields

→ Modeling the GEI in multi-environment trials (MET) carried out during the last steps of selection, the registration process and the post-registration cultivar assessment

Climate change

Input reduction

Diversification of cropping

environments

biotic and abiotic stresses higher

and more variable

3/30

2001

2001 2002

2001 2002

2001 2002

2001 2002

2001 2002 71%

12%

13%

4%

Env

Var

Var x Env

ε

GEI in MET

Yijk = μ + Gi + Ej + GxEij + εijk

9 genotypes 2 years 3 cropping

practices

4/30

Importance of GEI in MET

Species Environment Genotype GEI Biblio/Study

Soft wheat 50 to 80 % 10% 25% Lecomte, 2005

80 % 7% 13 % Arvalis (in average)

Sunflower 55 % 22 % 12 % Casadebaig, 2008

70% 3% 9 % Cetiom (in average)

→ GEI effect = from 0.5 to 2 times as big as genotype effect

5/30

Modeling GEI to

• Assess the ability of cultivars to maintain their performance in various environments (more or less stressful) in comparison to other cultivars (dynamic stability)

→ Minimizing GEI (Wricke’s Ecovalence…) • Define the trials where each variety performs the best → Analysing the GEI matrix (AMMI…) • Characterize the cultivars (resistance to environmental stress and ability to

promote environmental resources) → Explaining GEI by using environmental covariates in models (Factorial

regression…) • Recommend cultivars for unexperimented conditions • Assess the stability (dynamic or static) of one cultivar on unexperimented

conditions • Fulfill missing data in cultivar dataset: identify groups of cultivars with

complementary GEI that could be recommended together for a better dynamic stability

→ Predicting GEI with models

6/30

The methods implemented to model GEI

7/30

Modeling GEI using environmental covariates

Yijk = μ + Gi + Ej + GxEij + εijk

α1.X1j+…+ αn.Xnj E’j β1i.X1j βni.Xnj GxE’ij

Y

X

Gen A

Gen B

Mean Gen

+βA - βB

G

8/30

Environmental covariates (soft winter wheat case)

(Meynard and Sebillotte, 1980)

1 covariate = 1 environnemental stress x 1 development period

sum of radiations

T° max > 25°C

Notations, models

notations

Maturity Environmental covariates

sum of P-ETP>0

sum of ETR-ETM

• Development stages depending on cultivar earliness

Environmental covariates

X

time G1

G2 G3

Xij : covariates are calculated by environment j and cultivar i

βi : intrinsic cultivar tolerance

Xj : covariates are calculated by environment j

βi : intrinsic cultivar tolerance + escape

Yijk = μ + Gi +

α1.X1j+…+ αn.Xnj E’j+

β1i.X1j βni.Xnj GxE’ij +

εijk

10/30

Statistical methods

Yijk = μ + Gi + Σn(αn.Xnj) + E’j + Σn(βni.Xnj) + GxE’ij + εijk

• One step factorial regression with forward selection of environmental covariates according to AIC : – X strongly correlated

→ Problem for variable selection (unstable)

→ Problem for parameter estimate

→ Low predictive value

• Two step procedure: – Mixed model: Yijk = μ + Gi + Ej + GxEij + εijk

– Analysis of GxEij by

• Centered and scale PLS regression

• Random Forest

11/30

• The principle: – Look for orthogonal components th=X.ah maximizing cov(X.ah,Y):

explaining X and correlated to Y

– Regress Y on components th

– Express the regression according to initial covariates X

– Number of axis chosen by cross-validation

• The advantages – Deals with correlated X

– Deals with large number of X (can be larger than the number of individuals)

Statistical methods: PLS regression

12/30

• Structure of data

VAR ENV VARxENV X1xVAR1 X1xVAR2 X2xVAR1 X2xVAR2

var1 env1 int11 X11 0 X21 0



var2 env1 int21 0 X11 0 X21



Statistical methods: PLS regression

• Variable selection: – From the model with all the variables

– Successive removal of variables according to their absolute coefficient (from the smallest to the largest)

– At each step calculation of RMSEP of the model by cross validation

– Choice of the model (and the variables) that minimizes RMSEP

13/30

• The principle: – Building independent decision trees : set of dichotomies (nodes) of one

group splitted into 2

– Random sample with replacement of n data : training dataset

– For each node of the tree: a subset of variables are randomly sampled and tested: the variable which minimize the intra-group inertia is selected

– Prediction of each out-of-bag data

– Prediction of random forest = mean of predictions of individual trees

→ MSEP calculated on the random forest predictions

• Variable importance: – Random permutation of a variable

– Calculation of the increase of MSEP due to the permutation (IncMSEP)

– The bigger the increase of MSEP the more important the variable

Statistical methods: Random Forest

14/30

• The advantages: – Deals with interactions between covariates

– Deals with non linear relationships

– Deals with large number of X (can be larger than the number of data)

• Variable selection: – Forward :

• From a model with only cultivar as variable

• Successive introduction of environmental covariates with maximum IncMSEP

• Stop when the introduction of a new variable doesn’t decrease the MSEP of the model

– Backward • From the model with all the variables

• Successive removal of environmental covariates with minimum IncMSEP

• Stop when the removal of a new variable doesn’t decrease the MSEP of the model

Statistical methods: Random Forest

15/30

Cross-validation

• Cross validation over the trial

• Cross validation over the year

• 1/3 – 2/3 cross validation

Test dataset Calibration dataset

1 trial N - 1 trial

1 year N - 1 year

1/3 2/3

16/30

Objective : Assess the ability of GxE models to order cultivars and compare their performances in a given environment For each environment :

• Spearman correlation between predicted and observed yield (Spear)

• For all the couple of cultivars : – RMSEP of yield differences between cultivars (RMSEP)

– Type III Error : Percentage of cases where the model doesn’t order correctly cultivars (ETIII-0)

– Type III Error with a threshold of N t/ha : Percentage of cases where the model doesn’t order correctly cultivars showing a yield difference over N t/ha (ETIII-N)

• Reference: additive model

Cross-validation

17/30

Results

18/30

Wheat dataset

• 8 years (2003-2010)

• Trials treated against diseases, insects and weeds

• 46 agro-climatic covariates calculated by environment and cultivar

• 30 cultivars not experimented in all the trials

→ 50% of missing data

19/30

Results according to cross-validation

Method RMSEP Spear ETIII-0 ETIII-1 ETIII-3

RF (Forward) 7.05 0.51 30.5% 24.9% 16.6%

PLS 6.71 0.58 27.7% 22.4% 14.3%

Additive model

7.34 0.42 33.5% 28.2% 19.4%

Trial

CV

Method RMSEP Spear ETIII-0 ETIII-1 ETIII-3

RF (Forward) 7.20 0.52 30.4% 25.0% 16.6%

PLS 7.63 0.45 33.5% 27.9% 18.5%

Additive model

7.40 0.43 33.4% 28.1% 19.3%

1/3 – 2/3

CV

20/30

Cross validation over the Years

CV 2010 RMSEP Spear ETIII-0 ETIII-1 ETIII-3

RF (Forward) 6.7 0.41 33.9% 29.4% 20.2% PLS 6.79 0.38 35.6% 30.5% 21.0%

Additive model

6.69 0.42 33.7% 28.9% 19.8%

CV 2009 RMSEP Spear ETIII-0 ETIII-1 ETIII-3

RF (Forward) 7.14 0.29 40.0% 34.0% 23.4%

PLS 6.95 0.38 37.0% 31.3% 21.1%

Additive model

6.96 0.35 37.8% 31.9% 21.7%

CV 2008 RMSEP Spear ETIII-0 ETIII-1 ETIII-3 RF (Forward) 9.2 0.33 37.2% 33.3% 25.3%

PLS 9.13 0.29 39.7% 35.6% 27.5%

Additive model

9 0.33 37.6% 33.9% 26.1%

21/30

Conclusion

• Same results observed on – Sunflower

– Beet

– Potatoe

• Variability in the results according to the situation considered

• GxE models are not significantly more predictive than additive models

• Same results observed by using “sparse PLS” and “PLS spline”

• GxE models under study using environmental covariates should not be used to recommend cultivars

Spearman correlation

-0.2 0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

Coefficient de Spearman

Spearman du modèle additif

Spearm

an d

u m

odèle

inte

rarf

Efficience du modèle : 0.1

GE

I m

odel (R

andom

Fore

st)

Additive model 22/30

Trying to explain the low predictive accuracy of GEI models

Cause Hypothesis Method Result

Covariates

Presence of diseases in treated trials not taken into account in

models

Introducing cultivar tolerance to disease as covariates in the

models V

Imprecise covariates selected Check correlation between

predictive value and distance to meteorological station

X

Cultivars

Better predictive values for Cultivars assessed in more

experiments and under a larger range of environmental conditions

Check correlation between predictive value of cultivars and their number of occurrence in MET or D-optimality criterium

X

MET Lower predictive value when the

calibration dataset is very different from the test one

Check distance in terms of environmental conditions

between calibration and test dataset (Mahalanobis distance)

X

Statistic model

Statistical model are not able to simulate complex and dynamic

mechanisms

Simulate data with crop model integrating a cultivar component … 23/30

Dataset and modelling plan

Cultivar

(32)

Cropping

practices

(1)

Soil

(30 sites)

Environmental

covariates

(Xnj)

Climate

(5 years :

2008-2012)

Yield

(Yij)

Development

stages

SUNFLO

Biomass

N content

Water status

Known

without

error Balanced

dataset

No weed, disease

and insect

24/30

Cultivars under study

• Virtual cultivars defined as the combinations of 5 phenotypic inputs of the SUNFLO model.

• Two levels possible for each phenotypic variable corresponding to minimum and maximum values defined by experts

• 25 = 32 cultivars contrasted for their tolerance to water stress

Area of biggest leaf at

flowering

Sum of T° from

emerging to maturity

Threshold over which

foliar growth is impacted

by WS

Threshold over which

stomate functioning is impacted by

WS

Rank of the widest leaf at

flowering

min a l e c h

max A L E C H

25/30

Environmental covariates

• 42 covariates measuring for each environment – Water stress

– Nitrogen stress

– Radiation offer

– Temperature offer

• during – Vegetative period

– Flowering period

– Grain filling period

26/30

Simulated dataset ale

chale

cHale

Ch

ale

CH

alE

chalE

cHalE

Ch

alE

CH

aLech

aLecH

aLeC

haLeC

HaLE

chaLE

cHaLE

Ch

aLE

CH

Ale

chA

lecH

Ale

Ch

Ale

CH

AlE

chA

lEcH

AlE

Ch

AlE

CH

ALech

ALecH

ALeC

hA

LeC

HA

LE

chA

LE

cHA

LE

Ch

ALE

CH

20

30

40

50

boxplot des rendement simulés

par variété

Rdt

sim

ulé

en q

uin

taux/

ha

moyenneMean

Yields simulated by cultivar

Yie

ld (

q/h

a)

88%

6% 6%

E

G

GxE

• A slight over-estimation of yields in comparison with yields observed in real MET but a range of yields in accordance with expertise

• GEI smaller than those observed in real MET 27/30

Results for a CV over the years

Spear

[part of GEI predicted]

RMSEP ETIII-1 (%) ETIII-3 (%)

Additive model

0.73 [0%]

1.87 16.60 4.81

RF (Backward)

0.80 [26%]

1.52 12.22 2.72

Factorial regression

0.82 [33%]

1.48 11.42 2.48

PLS 0.83

[37%] 1.41 10.43 2.27

• All the GEI models have better predictive values than additive model

• The part of GEI predicted by statistical models remains limited

28/30

Spearman correlation

Additive model

GE

I m

odel

RF (Backward) PLS

• Compared to PLS, Random forest shows less situations where the predictive quality is below the one of additive model

Results for a CV over the years

29/30

Conclusion

• The statistical GEI models under study using environmental covariates are not significantly better than additive ones

→ They should not be used to recommend cultivars or fulfill missing data in cultivar dataset

• How improving GEI models?

‒ Better characterization of individual trials

‒ Test of various covariates and development period to quantify the environmental stresses and resources over the cropping period

‒ Test of interaction between covariates in the case of PLS regression

‒ Increase the number of trials and their variability in terms of environmental conditions

‒ Model Gene x Environment interactions (Genes assessed in more environments than cultivars)

‒ …

• To be continued

‒ European Project in preparation: ANOVA (A NOvel approach for Variety Assessment (in Europe)

‒ … 30/30

Documents

Assessing the predictive accuracy of various statistical methods …. Gauffreteau... · 2014. 7. 16. · Assessing the predictive accuracy of various statistical methods that use