Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Assessing the predictive accuracy of various statistical methods that use environmental covariates to model
Genotype x Environment Interactions in multi-environment trial
A. Gauffreteau, G. Grignon, P. Pachot, J. Lorgeou, F. Piraux, F. Maupas, H.
Escriou, C. Pontet, F. Salvi
Why predicting GEI?
2/30
Context
• Choice of cultivars adapted to their cropping environments
• Need for cultivars : – Diversified and adapted to a large range of environmental conditions
– Described : average performance and response to the various environmental conditions they can meet in fields
→ Modeling the GEI in multi-environment trials (MET) carried out during the last steps of selection, the registration process and the post-registration cultivar assessment
Climate change
Input reduction
Diversification of cropping
environments
biotic and abiotic stresses higher
and more variable
3/30
2001
2001 2002
2001 2002
2001 2002
2001 2002
2001 2002 71%
12%
13%
4%
Env
Var
Var x Env
ε
GEI in MET
Yijk = μ + Gi + Ej + GxEij + εijk
9 genotypes 2 years 3 cropping
practices
4/30
Importance of GEI in MET
Species Environment Genotype GEI Biblio/Study
Soft wheat 50 to 80 % 10% 25% Lecomte, 2005
80 % 7% 13 % Arvalis (in average)
Sunflower 55 % 22 % 12 % Casadebaig, 2008
70% 3% 9 % Cetiom (in average)
→ GEI effect = from 0.5 to 2 times as big as genotype effect
5/30
Modeling GEI to
• Assess the ability of cultivars to maintain their performance in various environments (more or less stressful) in comparison to other cultivars (dynamic stability)
→ Minimizing GEI (Wricke’s Ecovalence…) • Define the trials where each variety performs the best → Analysing the GEI matrix (AMMI…) • Characterize the cultivars (resistance to environmental stress and ability to
promote environmental resources) → Explaining GEI by using environmental covariates in models (Factorial
regression…) • Recommend cultivars for unexperimented conditions • Assess the stability (dynamic or static) of one cultivar on unexperimented
conditions • Fulfill missing data in cultivar dataset: identify groups of cultivars with
complementary GEI that could be recommended together for a better dynamic stability
→ Predicting GEI with models
6/30
The methods implemented to model GEI
7/30
Modeling GEI using environmental covariates
Yijk = μ + Gi + Ej + GxEij + εijk
α1.X1j+…+ αn.Xnj E’j β1i.X1j βni.Xnj GxE’ij
Y
X
Gen A
Gen B
Mean Gen
+βA - βB
G
8/30
Environmental covariates (soft winter wheat case)
(Meynard and Sebillotte, 1980)
1 covariate = 1 environnemental stress x 1 development period
sum of radiations
T° max > 25°C
Notations, models
notations
Maturity Environmental covariates
sum of P-ETP>0
sum of ETR-ETM
• Development stages depending on cultivar earliness
Environmental covariates
X
time G1
G2 G3
Xij : covariates are calculated by environment j and cultivar i
βi : intrinsic cultivar tolerance
Xj : covariates are calculated by environment j
βi : intrinsic cultivar tolerance + escape
Yijk = μ + Gi +
α1.X1j+…+ αn.Xnj E’j+
β1i.X1j βni.Xnj GxE’ij +
εijk
10/30
Statistical methods
Yijk = μ + Gi + Σn(αn.Xnj) + E’j + Σn(βni.Xnj) + GxE’ij + εijk
• One step factorial regression with forward selection of environmental covariates according to AIC : – X strongly correlated
→ Problem for variable selection (unstable)
→ Problem for parameter estimate
→ Low predictive value
• Two step procedure: – Mixed model: Yijk = μ + Gi + Ej + GxEij + εijk
– Analysis of GxEij by
• Centered and scale PLS regression
• Random Forest
11/30
• The principle: – Look for orthogonal components th=X.ah maximizing cov(X.ah,Y):
explaining X and correlated to Y
– Regress Y on components th
– Express the regression according to initial covariates X
– Number of axis chosen by cross-validation
• The advantages – Deals with correlated X
– Deals with large number of X (can be larger than the number of individuals)
Statistical methods: PLS regression
12/30
• Structure of data
VAR ENV VARxENV X1xVAR1 X1xVAR2 X2xVAR1 X2xVAR2
var1 env1 int11 X11 0 X21 0
var1 env2 int12 X12 0 X22 0
var1 env3 int13 X13 0 X23 0
var2 env1 int21 0 X11 0 X21
var2 env2 int22 0 X12 0 X22
var2 env3 int23 0 X13 0 X23
Statistical methods: PLS regression
• Variable selection: – From the model with all the variables
– Successive removal of variables according to their absolute coefficient (from the smallest to the largest)
– At each step calculation of RMSEP of the model by cross validation
– Choice of the model (and the variables) that minimizes RMSEP
13/30
• The principle: – Building independent decision trees : set of dichotomies (nodes) of one
group splitted into 2
– Random sample with replacement of n data : training dataset
– For each node of the tree: a subset of variables are randomly sampled and tested: the variable which minimize the intra-group inertia is selected
– Prediction of each out-of-bag data
– Prediction of random forest = mean of predictions of individual trees
→ MSEP calculated on the random forest predictions
• Variable importance: – Random permutation of a variable
– Calculation of the increase of MSEP due to the permutation (IncMSEP)
– The bigger the increase of MSEP the more important the variable
Statistical methods: Random Forest
14/30
• The advantages: – Deals with interactions between covariates
– Deals with non linear relationships
– Deals with large number of X (can be larger than the number of data)
• Variable selection: – Forward :
• From a model with only cultivar as variable
• Successive introduction of environmental covariates with maximum IncMSEP
• Stop when the introduction of a new variable doesn’t decrease the MSEP of the model
– Backward • From the model with all the variables
• Successive removal of environmental covariates with minimum IncMSEP
• Stop when the removal of a new variable doesn’t decrease the MSEP of the model
Statistical methods: Random Forest
15/30
Cross-validation
• Cross validation over the trial
• Cross validation over the year
• 1/3 – 2/3 cross validation
Test dataset Calibration dataset
1 trial N - 1 trial
1 year N - 1 year
1/3 2/3
16/30
Objective : Assess the ability of GxE models to order cultivars and compare their performances in a given environment For each environment :
• Spearman correlation between predicted and observed yield (Spear)
• For all the couple of cultivars : – RMSEP of yield differences between cultivars (RMSEP)
– Type III Error : Percentage of cases where the model doesn’t order correctly cultivars (ETIII-0)
– Type III Error with a threshold of N t/ha : Percentage of cases where the model doesn’t order correctly cultivars showing a yield difference over N t/ha (ETIII-N)
• Reference: additive model
Cross-validation
17/30
Results
18/30
Wheat dataset
• 8 years (2003-2010)
• Trials treated against diseases, insects and weeds
• 46 agro-climatic covariates calculated by environment and cultivar
• 30 cultivars not experimented in all the trials
→ 50% of missing data
19/30
Results according to cross-validation
Method RMSEP Spear ETIII-0 ETIII-1 ETIII-3
RF (Forward) 7.05 0.51 30.5% 24.9% 16.6%
PLS 6.71 0.58 27.7% 22.4% 14.3%
Additive model
7.34 0.42 33.5% 28.2% 19.4%
Trial
CV
Method RMSEP Spear ETIII-0 ETIII-1 ETIII-3
RF (Forward) 7.20 0.52 30.4% 25.0% 16.6%
PLS 7.63 0.45 33.5% 27.9% 18.5%
Additive model
7.40 0.43 33.4% 28.1% 19.3%
1/3 – 2/3
CV
20/30
Cross validation over the Years
CV 2010 RMSEP Spear ETIII-0 ETIII-1 ETIII-3
RF (Forward) 6.7 0.41 33.9% 29.4% 20.2% PLS 6.79 0.38 35.6% 30.5% 21.0%
Additive model
6.69 0.42 33.7% 28.9% 19.8%
CV 2009 RMSEP Spear ETIII-0 ETIII-1 ETIII-3
RF (Forward) 7.14 0.29 40.0% 34.0% 23.4%
PLS 6.95 0.38 37.0% 31.3% 21.1%
Additive model
6.96 0.35 37.8% 31.9% 21.7%
CV 2008 RMSEP Spear ETIII-0 ETIII-1 ETIII-3 RF (Forward) 9.2 0.33 37.2% 33.3% 25.3%
PLS 9.13 0.29 39.7% 35.6% 27.5%
Additive model
9 0.33 37.6% 33.9% 26.1%
21/30
Conclusion
• Same results observed on – Sunflower
– Beet
– Potatoe
• Variability in the results according to the situation considered
• GxE models are not significantly more predictive than additive models
• Same results observed by using “sparse PLS” and “PLS spline”
• GxE models under study using environmental covariates should not be used to recommend cultivars
Spearman correlation
-0.2 0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
Coefficient de Spearman
Spearman du modèle additif
Spearm
an d
u m
odèle
inte
rarf
Efficience du modèle : 0.1
GE
I m
odel (R
andom
Fore
st)
Additive model 22/30
Trying to explain the low predictive accuracy of GEI models
Cause Hypothesis Method Result
Covariates
Presence of diseases in treated trials not taken into account in
models
Introducing cultivar tolerance to disease as covariates in the
models V
Imprecise covariates selected Check correlation between
predictive value and distance to meteorological station
X
Cultivars
Better predictive values for Cultivars assessed in more
experiments and under a larger range of environmental conditions
Check correlation between predictive value of cultivars and their number of occurrence in MET or D-optimality criterium
X
MET Lower predictive value when the
calibration dataset is very different from the test one
Check distance in terms of environmental conditions
between calibration and test dataset (Mahalanobis distance)
X
Statistic model
Statistical model are not able to simulate complex and dynamic
mechanisms
Simulate data with crop model integrating a cultivar component … 23/30
Dataset and modelling plan
Cultivar
(32)
Cropping
practices
(1)
Soil
(30 sites)
Environmental
covariates
(Xnj)
Climate
(5 years :
2008-2012)
Yield
(Yij)
Development
stages
SUNFLO
Biomass
N content
Water status
Known
without
error Balanced
dataset
No weed, disease
and insect
24/30
Cultivars under study
• Virtual cultivars defined as the combinations of 5 phenotypic inputs of the SUNFLO model.
• Two levels possible for each phenotypic variable corresponding to minimum and maximum values defined by experts
• 25 = 32 cultivars contrasted for their tolerance to water stress
Area of biggest leaf at
flowering
Sum of T° from
emerging to maturity
Threshold over which
foliar growth is impacted
by WS
Threshold over which
stomate functioning is impacted by
WS
Rank of the widest leaf at
flowering
min a l e c h
max A L E C H
25/30
Environmental covariates
• 42 covariates measuring for each environment – Water stress
– Nitrogen stress
– Radiation offer
– Temperature offer
• during – Vegetative period
– Flowering period
– Grain filling period
26/30
Simulated dataset ale
chale
cHale
Ch
ale
CH
alE
chalE
cHalE
Ch
alE
CH
aLech
aLecH
aLeC
haLeC
HaLE
chaLE
cHaLE
Ch
aLE
CH
Ale
chA
lecH
Ale
Ch
Ale
CH
AlE
chA
lEcH
AlE
Ch
AlE
CH
ALech
ALecH
ALeC
hA
LeC
HA
LE
chA
LE
cHA
LE
Ch
ALE
CH
20
30
40
50
boxplot des rendement simulés
par variété
Rdt
sim
ulé
en q
uin
taux/
ha
moyenneMean
Yields simulated by cultivar
Yie
ld (
q/h
a)
88%
6% 6%
E
G
GxE
• A slight over-estimation of yields in comparison with yields observed in real MET but a range of yields in accordance with expertise
• GEI smaller than those observed in real MET 27/30
Results for a CV over the years
Spear
[part of GEI predicted]
RMSEP ETIII-1 (%) ETIII-3 (%)
Additive model
0.73 [0%]
1.87 16.60 4.81
RF (Backward)
0.80 [26%]
1.52 12.22 2.72
Factorial regression
0.82 [33%]
1.48 11.42 2.48
PLS 0.83
[37%] 1.41 10.43 2.27
• All the GEI models have better predictive values than additive model
• The part of GEI predicted by statistical models remains limited
28/30
Spearman correlation
Additive model
GE
I m
odel
RF (Backward) PLS
• Compared to PLS, Random forest shows less situations where the predictive quality is below the one of additive model
Results for a CV over the years
29/30
Conclusion
• The statistical GEI models under study using environmental covariates are not significantly better than additive ones
→ They should not be used to recommend cultivars or fulfill missing data in cultivar dataset
• How improving GEI models?
‒ Better characterization of individual trials
‒ Test of various covariates and development period to quantify the environmental stresses and resources over the cropping period
‒ Test of interaction between covariates in the case of PLS regression
‒ Increase the number of trials and their variability in terms of environmental conditions
‒ Model Gene x Environment interactions (Genes assessed in more environments than cultivars)
‒ …
• To be continued
‒ European Project in preparation: ANOVA (A NOvel approach for Variety Assessment (in Europe)
‒ … 30/30