Modeling Crop Phenology in the US Corn Belt Using

Supplementary materials for this article are available at https:// doi.org/ 10.1007/ s13253-020-00419-x .

Modeling Crop Phenology in the US Corn BeltUsing Spatially Referenced SMOS Satellite

DataColin Lewis-Beck, Zhengyuan Zhu , Victoria Walker, and

Brian Hornbuckle

Satellite measurements follow the growth and senescence of vegetation aid in mon-itoring crop development within and across growing seasons. For example, identifyingwhen crops reach their peak growth stage or modeling the seasonal growing cycle isuseful for agronomists and climatologists. In this paper, we analyze remote sensing datafrom an intensively cultivated agricultural region in the Midwest to provide new infor-mation about crop phenology. There is both a temporal and spatial dimension to the dataas they are collected every 12 – 36 hours over regions approximately the size of a 45kmdiameter circle. We represent the measurements using a functional data approach andaccount for spatial dependence between locations through the functional curve coeffi-cients. Modeling across multiple growing years, and including growing degree days as acovariate, we estimate the timing for when crops reach their peak each season and makepredictions at unobserved locations.

Supplementary materials accompanying this paper appear online.

Key Words: Bayesian estimation; Remote sensing; SMOS; Spatial model; Stan; Vege-tation indices.

1. INTRODUCTION

The European Space Agency’s Soil Moisture and Ocean Salinity (SMOS) satellite hasrecently been shown to collect data relevant to farmers and agronomists (Hornbuckle et al.2016). The new data product, referred to as Level 2 τ in the remote sensing literature,measures the water column density of vegetation, which is proportional to the amount ofground vegetation (Jackson and Schmugge 1991). After plants emerge, the water columndensity increases, reaching a peak during the reproductive growth stage. As an example, formaize, following the third reproductive stage (R3), the plant begins to lose water, undergo

C. Lewis-Beck, University of Iowa, Iowa City, USA. Z. Zhu (B) · B. Hornbuckle Iowa State University, Ames,USAZ. Zhu (E-mail: [email protected]). V. Walker, University of Montana, Missoula, USA.

© 2020 International Biometric SocietyJournal of Agricultural, Biological, and Environmental Statistics, Volume 25, Number 4, Pages 657–675https://doi.org/10.1007/s13253-020-00419-x

657

https://doi.org/10.1007/s13253-020-00419-x

http://orcid.org/0000-0002-2266-0646

mailto:[email protected]

http://crossmark.crossref.org/dialog/?doi=10.1007/s13253-020-00419-x&domain=pdf

658 C. Lewis-Beck et al.

senescence, and dry out. Previous research has empirically found a connection betweenLevel 2 τ and the stages of crop development (Patton and Hornbuckle 2013; Lawrence et al.2014).

The goal of this paper is to use the SMOSLevel 2 τ data product (herein referred to as τ ) toprovide new information about crop phenology. There are other data sets that capture similarinformation as SMOS; for example, the Normalized Difference Vegetation Index (NDVI)and the National Agricultural Statistics Service (NASS) Crop Progress Reports. However,the SMOS satellite data are unique due to its combination of high temporal (observations arecollected approximately every 12 to 36 hours) and spatial resolution (over 10 times betterthan USDA ground-based visual surveys).

We model τ over a region in the US corn belt to answer two important questions foragronomists and climatologists. One, how can we model the seasonal pattern of crop devel-opment as measured by the SMOS satellite? Estimating (and quantify the uncertainty in) thecrop growth cycle is useful for agronomists tracking the progression of crop growth withinand across years. The second goal is to estimate the day of the year (DOY)when τ reaches itspeak. For agronomists, this day is of interest as it corresponds to the R3 reproductive stagefor maize; however, the timing is also relevant to climatologists since it corresponds to thestart of crop transpiration. The day when crops begin transpiration is an important covariatein larger climate models, and mid-season information about the timing can improve modelaccuracy (Villarreal-Guerrero et al. 2012).

As often the case with remote sensing data, the observed values of τ are quite noisy.Therefore, we first smooth the data using a kernel smoother appropriate for dependent data.Next, we apply functional principal components analysis (fPCA) to reduce the dimensionof the data and represent τ as a linear combination of principal component curves. Theprimary covariate affecting the timing of peak τ is accumulated thermal time (Hornbuckleet al. 2016). In warmer growing seasons, τ reaches its peak earlier in the summer, whereas incooler seasons τ increases more gradually as plants take longer to develop. We incorporategrowing degree days (GDD) as a covariate into a larger spatial model that accounts fordependence between curve locations. By incorporating a spatial component into our model,we are then able to predict the timing of peak τ at unobserved locations.

The remainder of this paper is organized as follows. Section 2 describes previous workusing a functional approach for modeling vegetation data. Section 3 describes the SMOSsatellite data. Section 4 describes the functional spatial model and Bayesian methods forestimation. Section 5 presents the results, compares the model estimates to ground-basedestimates from the NASS Crop Progress Report, and uses the spatial model to make pre-dictions at new locations. Section 6 contains some concluding remarks and suggestions forfuture research.

2. RELATEDWORK

There has been previouswork analyzing vegetation remote sensing data using a functionaldata approach. Liu et al. (2012) developed a factor rotation method for functional princi-pal components to decompose a time series of vegetation index measurements (EVI data)

Modeling Crop Phenology in the US Corn Belt 659

into periodic and non-periodic functional components. Separating annual versus non-annualvariation in the fPCs resulted in more stable estimates and an interpretable decompositionbetween cyclical and local trends in the curves. More recently, using the same point ref-erenced EVI data, Liu et al. (2017) introduced a parametric model to account for spatialcorrelation between fPC coefficients to improve curve reconstruction. Outside of vegetationindex measures, there are other applications of functional data analysis applied to spatial–temporal point pattern data. Gromenko et al. (2012) developed a new method to estimatethe functional mean and principal components when functional curves are observed at spa-tial locations. Baladandayuthapani et al. (2008) introduced a fully Bayesian hierarchicalapproach to model spatial correlation between regression splines. Spatial dependence wasmodeled on the spline coefficients using a Matérn covariance structure. Lastly, there arenonparametric approaches, such as Li et al. (2007), who propose a kernel-based method toestimate correlation among functions where observations are sampled at regular temporalgrids, and smoothing is performed across different spatial distances.

3. SMOS SATELLITE DATA

The SMOS satellite observes Earth’s brightness temperature via passive microwave (L-band, f = 1.4 GHz, λ = 21 cm) remote sensing (Kerr et al. 2010). In the midlatitudes,an individual location is sampled at multiple incidence angles approximately every 12 to36 hours with overpasses occurring at 6 am and 6 pm local solar time. The observations,τ , are then extracted from version 6.5.0 of the SMOS Level 2 Soil Moisture Output UserData Product. For a detailed summary of the retrieval algorithm, see Kerr et al. (2012). Afterprocessing the signal, the European Space Agency releases a daily estimate of τ averagedover a footprint roughly the shape of a circle with a diameter between 40 and 50 kilometers.We refer to these regions as pixels, and the centroid of each pixel is known. The years weanalyze are from 2011 to 2017.

In addition to reporting the value of τ (bounded between 0 and 1), SMOS records themeasurement time as a fractional day. To make time intervals consistent across years, eachfractional day is rounded to the nearest integer day.Occasionally, there are twomeasurementsper day, in which case we take the sample average. Because we are interested in modelingvegetation, we subset the data to run from May 1 to October 31, which based on cropreports, covers the growing season for corn. Thus, for a single pixel, we have 185 days(measurements) per growing season. Occasionally, due to a corrupted signal, amissing valueis reported; however, for each pixel and season combination, fewer than 3% of observationsare missing.

To minimize the influence of other crops and reduce signal noise from non-agricultureareas, 30 pixels were selected inwestern Iowa (Fig. 1). Iowa is divided into 99 approximatelyequal size counties and each pixel is approximately the size of a county. The maximumdistance between pixel centroids is 321 kilometers; theminimumdistance between centroidsis 25 kilometers. All the pixels have between 75% and 85% of their land area planted witha combination of corn (60%) and soybean (40%). But, because corn is a much larger crop,its signature dominates the SMOS signal (Hornbuckle et al. 2016).


1 2 3

4

5

6

78

9

10

11

12

13

1415

16

17

1819

20

21

22

23

2425

26

27

28

29

30

district

central

east central

north central

northeast

northwest

south central

southeast

southwest

west central

Figure 1. Centroids (dots) and footprints (45km diameter circles) for 30 SMOS pixels. Shading by USDA CropReporting District .

4. FUNCTIONAL SPATIAL MODEL

Rather than modeling τ as a series of discrete random variables, we represent each curveas a smooth continuous function over a bounded interval corresponding to one growingseason. Treating τ as a smooth curve makes sense as the water column density of plantschanges gradually as they develop, reach their peak growth stage, and then dry out. The rawdata, however, are quite noisy, so before modeling τ as a function, smoothing is required.

4.1. SMOOTHING THE DATA

Multiplemethods are available to smooth functional data. One option is to first smooth thedata using a nonparametric kernel smoother or local polynomial fitting technique. Anotherchoice is projecting observations onto a set of basis functions while adding a roughnesspenalty to control the total amount of curvature. We select a smoothing technique thataccounts for temporal dependence. Daily measurements of τ are correlated so standardsmoothing techniques, which assume uncorrelated errors, are known to under smooth thedata. The structure of the dependence, however, is unknown so we applied a smoother out-lined by Brabanter et al. (2011) that smooths time series data without requiring knowledgeof the true correlation structure. The smoother has two stages: The first selects the optimalbandwidth, and the second smooths the data. The first stage pre-smooths the data usinga third-order polynomial with a bimodal weighting kernel K (u) = 2√

πu2 exp(−u2) that


removes temporal correlation in order to select an optimal smoothing bandwidth. The opti-mal bandwidth is chosen that minimizes the residual sum of squares. After a bandwidthis calculated for each curve, the second stage smooths the original data using a Gaussiankernel with an adjusted bandwidth proportional to the bimodal bandwidth. The original andsmoothed data for two representative pixels are presented in Fig. 2.

4.2. FUNCTIONAL PRINCIPAL COMPONENTS

By the Karhunen–Loève decomposition, we can represent any zero mean curve, X (t), inL2 space, as X (t) = ∑∞

k=1 ζkφk(t)where φk are orthonormal eigenfunctions, and the ζk arefunctional principal component scores, random variables defined as

∫(X (t)−μ(t))φk(t)dt

(Ramsay et al. 2009). By construction, the scores are uncorrelated with zero expectationand variance λk . Similar to standard principal components analysis (PCA), which reducesthe dimension of data in Rp, the aim of functional PCA is to represent functional data witha smaller fixed number of basis functions that explain the majority of the variation in thecurves.

As exemplified in Fig. 2, the shape of τ over a growing season has a similar signature:a quadratic rise and fall as crops emerge, reproduce, and undergo senescence. What variesacross pixels within a year, and more clearly across years, is the day of the year of the peak.Therefore, to identify these dominant periods of variation, and reduce the dimension of thedata, we apply functional PCA to the SMOS data.

Including all seasons (t = 1 . . . T = 7) and pixels (p = 1 . . . P = 30), there are 210curves in the data set which we denote as X p,t . The support for each curve runs from

Pix 15 2015 Pix 15 2017

Pix 6 2015 Pix 6 2017

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

0.2

0.4

0.6

0.2

0.4

0.6

Month

T

Figure 2. Raw and smoothed values (lines) of τ for pixel 6 and 15 in 2015 and 2017 .


May 1 to October 31. We denote each integer day between these six months as d whered = 1 . . . D = 185. We estimate the mean function as the sample average over all pixelsand growing seasons X̄(d) = (1/PT )

∑Pp=1

∑Tt=1 X p,t (d). It is reasonable to assume

curves are independent across growing seasons, and, while there is likely correlation acrosscurves within a season, we capture this structure later using a parametric model. Next,we estimate the sample covariance matrix (for time points t and s) as C = 1/(PT −1)

∑Pp=1

∑Tt=1(X p,t (s) − X̄(s)(X p,t (t) − X̄(t)).

Estimating the functional principal components requires an eigendecomposition of thesample covariance matrix. The eigen-equation to solve is Cφ = λφ, where φ are theeigenfunctions, and λ the eigenvalues. After estimating the eigenfunctions and eigenvalues,we estimate the individual fPCA scores as ζ̂p,t,k = ∑D

d=1(X p,t (d) − X̄(d))φ̂k(d). We cannow approximate each curve from the SMOS data as follows,

X p,t (d) ≈ X̄(d) +K∑

k=1

ζ̂p,t,k φ̂k(d). (1)

The first two functional principal curves explain approximately 60% of the variation inthe curves. We can visualize what part of the curve they capture by plotting the overallmean curve and adding and subtracting 1.96 times the standard deviation of each principalcomponent curve. As exemplified below (Fig. 3), the first two fPCs capture the ramp up anddecline of τ . The third and fourth components account for variation at the beginning andend of the growing season, a period of less interest. Thus, we fix K = 2 for the rest of theanalysis.

4.3. MODELING THE FPCA SCORES

4.3.1. Analysis of fPCA Scores

The two main atmospheric variables that impact the growth of row crops are temperatureand precipitation (Hollinger and Angel 2009). To investigate if these variables are relatedto the fPCA scores, we obtained weather data from The Iowa Environmental Mesonet(IEM) (Department of Agronomy. Iowa State University, Accessed September 30, 2018).The IEM has daily data on temperature and precipitation at each pixel location. Each curvecorresponds to an entire growing season so temperature and precipitation are aggregated overeach growing season. Following the convention in agriculture, temperature is transformedto growing degree days (GDD), which is an integral approximation for the total time thetemperature is above 10 degree Celsius after plants emerge (McMaster and Wilhelm 1997).Using data from the USDA National Agricultural Statistics Service (NASS), the day whereat least 50% of the corn was reported planted is the starting day for the GDD calculation(USDA-NASS, Accessed September 30, 2018). This planting data are only available fornine districts in Iowa, so each pixel centroid is mapped to the closest district. It is unknownwhen the crops cease to grow, so to be conservative, GDD was summed until September 1for each pixel and year. Total precipitation was also calculated using the same starting andending dates as for GDD.


0.1

0.2

0.3

0.4

T

0.1

0.2

0.3

0.4

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

Month

T

Figure 3. Overall mean function (solid line). Confidence bands (dashed lines) are ±1.96 standard deviationsabove and below the mean curve for functional PC curve 1 (top), and PC curve 2 (bottom) .

Precipitation and GDD were regressed on the first two fPC scores using ordinary leastsquares. The only significant relationship was between the first fPC score and GDD. Thisis consistent with the previous research as accumulated thermal time is known to be themain cause of changes in the timing of peak τ (Hornbuckle et al. 2016). Figure 4 showsa scatterplot of GDD and fPC1 where the points are colored by year. While the interceptsvary across years, there is an overall positive trend between the two variables.

We fit a linearmodel with a random effects term for the intercept, andGDD as a covariate,to fPC1. We then examined the model residuals, as well as the scores from the other fPCs,to see if any spatial structure remains. We assume the fPCA coefficients are uncorrelatedacross scores, i.e., Cor(ζp,t, j , ζp,t,k) = 0, but spatial correlation exists for each fPCA scoreindividually across pixel locations within a given year. In Fig. 5, we plot the empiricalvariograms for fPC1, and the residuals for fPC1 after regressing GDD. Figure 6 showsthe empirical variogram for fPC2. The variograms are calculated using the Hawkins andCressie robust variogram estimator (Cressie 2015). The solid line is the average variogramacross all seven years, and the dashed lines the corresponding standard errors for the meanestimate. Even after removing the linear trend from GDD, there is spatial structure left inthe residuals for fPC1. Mild spatial correlation is also present in the second fPC scores. Weexplored directional variograms, but there was no evidence that an isotropic assumption wasinappropriate. The third and fourth functional principal components did not exhibit spatialcorrelation.


−0.5

0.0

0.5

2100 2400 2700 3000GDD

fPC

1

year2011201220132014201520162017

Figure 4. Total growing degree days plotted against fPC1 for 2011 to 2017 (colored points). Growing degreedays are accumulated from when 50% of crops are planted until September 1 .

0.00

0.05

0.10

40 80 120 160Distance (km)

Sem

ivar

ianc

e

40 80 120 160Distance (km)

Figure 5. Left: Average variogram for fPC1 across all years. Right: Average variogram of the residuals fromGDD on fPC1 across all years. Dashed lines are 95% confidence bands .


0.00

0.05

0.10

40 80 120 160Distance (km)

Sem

ivar

ianc

e

Figure 6. Average variogram for fPC2 across all years. Dashed lines are 95% confidence bands .

4.3.2. Spatial Dependence of fPCA Scores

Weassumeeachof the two functional principal component scores come from independentstationary Gaussian random fields with a Matérn covariance function (Matérn 1986). Weallow the mean function to vary across growing seasons. The mean of the first functionalprincipal component score is a linear function of GDD as well as a seasonal random effect.The second functional principal component score is modeled with a constant mean andseasonal random effect.

For the covariance parameters, although the spatial covariance could be heterogeneousfrom year to year, with only 30 locations per growing season, there are too few pixels toaccurately estimate annual spatial parameters without strongly informative prior distribu-tions. Thus, we estimate a single set of covariance parameters for all growing seasons. Wefixed the Matérn smoothness parameter to ν = 1/2, which reduces the covariance functionto C(s) = σ 2exp(−sφ), where s is the distance between locations (km), σ 2 the spatialvariance, and φ the spatial range parameter. Putting the mean and covariance functionstogether, we specify the following two models, one for each of the two functional principalcomponent scores,

ζp,t,1 = β0,1 + β0,1,t + β1GDDp,t + ηp,1 + εp,t,1

ζp,t,2 = β0,2 + β0,2,t + ηp,2 + εp,t,2 (2)

where ηp,1, ηp,2 are the spatially correlated errors, and εp,t,1, εp,t,2, are random errorswhich we assume to have independent normal distributions with variances ς2

1 and ς22 . We


let p = 1 . . . P = 30 represent the 30 pixel locations, and t = 1 . . . T = 7 indicate thegrowing seasons from 2011 to 2017.

4.3.3. Prior Distributions

We take a Bayesian approach to estimation and therefore to complete the model, werequire prior distributions on all model parameters. The β parameters are given improperuniform priors. For the spatial, and error, standard deviation terms log-normal (0, 1) priorsare selected. Other priors, such as a half-Cauchy, were tried but did not noticeably affectthe posterior. For the Matérn class of covariance functions, the spatial range parameter isdifficult to estimate (Zhang 2004). Often a weakly informative prior on the range or spatialvariances is required for identification (Banerjee et al. 2014). After reparameterizing therange parameter in terms of 1/φ, a t distribution with 4 degrees of freedom was chosen(Flaxman et al. 2015). This is slightly more informative than a half-Cauchy, but given thelack of data, priors with heavy tails resulted in multimodal posterior distributions. For thestandard deviation of the random effect terms (ε1, ε2), a half-Cauchy prior is selected assuggested by Gelman et al. (2006).

4.3.4. Estimation

Eachmodel was fit using the rstan (Guo et al. 2014) package in R (R Core Team 2020),which implementsHamiltonianMonteCarlo (HMC) (Betancourt andGirolami 2015). HMCjointly updates all model parameters by simulating energy preserving paths with randominitial momentums along the posterior density. This is done to reduce autocorrelation andefficiently explore the posterior. Four chains were run for 5,000 iterations after 2,500 warm-up iterations. The Gelman–Rubin potential scale reduction factor, R̂, was used to providea check for adequate mixing of the multiple chains. Upon convergence, R̂ converges to 1(Gelman et al. 1992). For all model parameters, the value of R̂ was less than 1.1. Plots ofthe posterior draws were inspected for parameters with the fewest effective samples; thesedid not suggest features that were inadequately explored.

5. RESULTS

After fitting the model, we examine the posterior distribution of the parameters for eachmodel. Table 1 shows 95%credible intervals (CIs) for the parametersmodeling fPC1;Table 2shows CIs for the parameters for fPC2. For the fPC1, the parameter on GDD is positiveand its credible interval does not cover zero. For each model, we can calculate the effectivepractical range (3φ) using the median value of φ from the posterior. For fPC1, the practicalrange is 264 km and for fPC2, it is 163 km.

To get an idea of how close the fitted curves match the data, we simulate draws fromthe posterior predictive distribution. For each location and year, we simulate 2,500 drawsfor fPC1 and fPC2. To reproduce the full curve, X (d), we multiply the basis functionsby the simulated coefficients and add them to the mean curve X̄(d). In Fig. 7, we show arepresentative pixel with themodel fit. The dashed line is themedian over all 2,500 simulated


Table 1. Posterior 95% Credible Intervals for fPC1 Model Parameters

Parameter 2.5 (%) 50 (%) 97.5 (%)

β0,1 −3.0 −2.02 −1.12β1 0.005 0.008 0.012σ1 0.15 0.19 0.26ς1 0.02 0.06 0.09ε1 0.01 0.16 0.43φ1 48.4 88.2 194.5

Table 2. Posterior 95% Credible Intervals for fPC2 Model Parameters

Parameter 2.5 (%) 50 (%) 97.5 (%)

β0,2 −0.16 0.02 0.20σ2 0.13 0.16 0.20ς 0.02 0.05 0.09ε2 0.10 0.19 0.45φ2 35.9 54.2 100.40

curves. The solid thick line is the smoothed curve. To get an estimate of the uncertainty inour estimates, point-wise 95% credible intervals are plotted. For most years, we capture thetiming running up to and after the main peak. The confidence bands do miss the magnitudeof peak τ certain years (e.g., 2013, and 2015). However, with only two functional principalcomponents we do not expect the model to fully capture all the local features.

We check if the model captures the spatial structure in the data using simulated replicatedata sets from the posterior predictive distribution and comparing the simulated data to theobserved data. This is a general approach recommended by Gelman et al. (1996) for modelswhere no classical goodness-of-fit test is available. We simulate 50 sets of 30 fPC1 andfPC2 scores, and for each set estimate the empirical variogram. We plot the 50 estimatedvariograms with the annual variograms from the true fPCs layered on top. The simulatedvariogram estimates are plotted with dashed lines; the true variograms for each year arethe bold solid lines. As exemplified in Figs. 8 and 9, the variograms from the true datamainly fall within the center of the simulated variograms. The dispersion in the simulatedvariograms also reflects uncertainty in the variogram as an exploratory tool to detect spatialcorrelation: especially with only 30 locations available per year.

5.1. COMPARISON TO USDA DATA

The primary feature of interest is estimating the day when τ reaches its peak value.Because we are using a Bayesian framework, calculating the posterior for any functional,such as the day when τ reaches its maximum value, is straightforward. For a fixed seasonand pixel, we define the peak as the largest value of X (d) over d. After the maximum valueof τ̂ is identified, the timing of the max is simply the corresponding day of the year (DOY).


0.00.10.20.30.40.5

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

Day

T

Pixel 14 2011

0.00.10.20.30.40.5

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

Day

T

Pixel 14 2012

0.00.10.20.30.40.5

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

Day

T

Pixel 14 2013

0.00.10.20.30.40.5

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

Day

T

Pixel 14 2014

0.00.10.20.30.40.5

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

Day

T

Pixel 14 2015

0.00.10.20.30.40.5

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

Day

T

Pixel 14 2016

0.00.10.20.30.40.5

May 1

June

1Ju

ly 1

Aug 1

Sept 1

Oct 1

Nov 1

Day

T

Pixel 14 2017

Figure 7. Median (dashed line) and point-wise 95% CI (solid lines) for the mean function for pixel 14 across allseasons. The thick black line is the smoothed nonparametric curve .

0.00

0.05

0.10

0.15

0.20

40 80 120 160Distance (km)

Vario

gram

True fPC1Y11

Y12

Y13

Y14

Y15

Y16

Y17

Figure 8. Dashed lines are the estimated variograms for 50 simulated data sets for the first fPC. Solid lines arethe estimated variograms from the true first fPC by year .


0.00

0.05

0.10

0.15

0.20

40 80 120 160Distance (km)

Vario

gram

True fPC1Y11

Y12

Y13

Y14

Y15

Y16

Y17

Figure 9. Dashed lines are the estimated variograms for 50 simulated data sets for the 2nd fPC. Solid lines arethe estimated variograms from the true 2nd fPC by year .

We estimate the day of maximum τ over 2,500 draws from the posterior and calculate themean and variance of this distribution.

The USDA divides Iowa into nine Crop Reporting Districts (Fig. 1). Each week theUSDA’s National Agriculture Statistics Service (NASS) publishes ground-based surveydata aggregated by district. Until 2013, the USDA Crop Reports provided weekly updateson the percentages of crops having reached the R3 growth stage, which corresponds to thetiming of peak τ . To compare our model to the USDA data, we summarize our estimates byCrop Reporting District as well.

Looking at the distribution of the timing of peak τ , there is more variation across yearsthan within a growing season (Tables 3, 4). Given the relative homogeneity of temperatureconditions within Iowa, a larger seasonal (as opposed to pixel location) effect makes sense.Comparing seasons, the 2012 season has the earliest estimate for the timing of peak τ . Thismatches historical data that 2012 was the hottest and driest growing season of the yearsanalyzed. Also, pixels closer to the equator (i.e., in the southwest corner of Iowa) have anearlier mean peak day than locations further north. For 2011 to 2013, we also include anapproximate 50% interval for when crops reached R3 according to the USDA. This intervalwas calculated using linear interpolation to find the DOY when 25% (and 75%) of cropsreached R3 by district. While our model underestimates the Crop Reports point estimates,the USDA reported peak timing is within two standard deviations of our model estimates.


Table 3. Posterior mean and standard deviation of the peak timing DOY summarized by Crop Reporting District

District Region Mean S.D USDA 50 (%) Mean S.D USDA 50 (%)

1 nw 230.50 6.8 (209, 222) 218.5 7.5 (196, 213)2 nc 230.6 6.8 (211, 221) 218.7 7.5 (194, 212)3 ec 228.8 6.2 (209, 219) 218.2 7.5 (189, 212)4 cent 228.9 6.2 (209, 219) 217.6 7.7 (194, 214)5 wc 228.1 5.9 (210, 221) 216.3 7.7 (195, 212)6 sw 226.6 5.2 (209, 223) 214.7 7.8 (192, 210)

Left: 2011. Right: 2012

Table 4. Posterior mean and standard deviation summarized by Crop Reporting District in 2013

District Region Mean S.D USDA 50 (%)

1 nw 237.7 7.9 (218, 237)2 nc 237.8 8.0 (222, 234)3 ec 237.3 8.0 (216, 232)4 cent 236.9 8.0 (217, 235)5 wc 235.7 8.0 (217, 232)6 sw 234.0 7.8 (216, 231)

5.2. UNCERTAINTY PROPAGATION

Because the spatialmodel is based on smoothed data (as opposed to the rawobservations),we are likely underestimating the amount of uncertainty in our reported results. Therefore,to quantify the relative error from the smoothing step (Sect. 4.1) as compared to the modelestimation step (Sect. 4.3.2), we conducted a small simulation study. Using the posteriormedians of the model parameters, we simulated b = 1 . . . B = 210 pairs of functionalprincipal component scores (ζ (b)

1 , ζ(b)2 ) from Model 2. We then generated the smoothed

curves by plugging the simulated scores into Eq. 1. Lastly, we constructed the raw data,X (b)(d), by adding Gaussian random noise, ε(d), to each of the simulated smoothed curves.We empirically estimated ε(d) for each day by averaging the sum of squared differencesbetween the 210 original raw and smoothed SMOS curves.

As in Sect. 5.1, we assumed the day with the largest value of X (b)(d) was the date ofthe true peak. To quantify the smoothing step uncertainty, we calculated the mean squarederror (MSE) between the true peak day with the estimated peak day from the smoothed data.To quantify the model uncertainty, we re-estimated the spatial model using the smootheddata and calculated the MSE between the model estimated peak day and the smoothed peakday. The MSE for the raw data was 16.90 days, compared to 12.60 days for the smoothedversion of the data. Thus, by modeling the smoothed curves, our model underestimates thetrue error in the timing of peak τ by approximately 25% or 4.30 days.


Table 5. Average bias and standard deviation for the estimated timing of peak τ for three different methods:non-parametric bootstrap, spatial model, and hold-one-out spatial prediction

Comparison 2011 2012 2013 2014 2015 2016 2017

Truth vs. Bootstrap Bias 0.8 0.2 1.3 −1.9 −0.3 0.5 −0.2SD 9.4 7.6 7.9 7.2 4.1 7.6 12.1

Truth vs. fPCA Bias 1.60 −1.2 0.1 −5.5 1.4 1.8 0.7SD 10.3 11.2 11.4 8.2 6.4 8.6 15.2

Truth vs. Spatial Pred. Bias 2.2 −7.6 1.5 −3.4 3.4 3.1 3.2SD 10.1 12.7 13.5 9.6 6.6 9.5 17.1

Top: τ for X p,t (d) vs. X p,t,boot (d). Middle: X p,t (d) vs. X̄(d)+∑2k=1 ζ̂p,t,k φ̂k (d). Bottom: X p,t (d) vs. X̄(d)+

∑2k=1 ζ̂med,p,t,k φ̂k (d)

5.3. PREDICTION

A key advantage of estimating a spatial model is the ability to predict at new locations,X0,t , by sampling from the posterior predictive density p(X0,t |X p,t , θ

(m),GDD0,t ), a con-ditional normal distribution that comes from the joint multivariate normal distribution ofX0,t and X p,t . To assess the model’s predictive ability, we hold out one pixel location, esti-mate the model, and then use posterior draws m = 1 . . . M to generate posterior predictivedraws of the two fPCA scores at the removed location. Using the mean curve and the setsof scores, we reconstruct a set of M curves at the held out location X0,t (d).

We have three sets of curves from which to compare estimates, and uncertainty, of thetiming of peak τ : first, the smoothed nonparametric curves; second, the reduced functionalPCA curves; and finally, the parametric spatial model using the fPCA scores. Because thetrue signature of τ is unknown, we use the nonparametric curves from Sect. 4.1 as a proxyfor the truth. To get an uncertainty interval for the true curve, we run a wild bootstrap using atwo point distribution for the errors as outline byMammen (1993). A wild bootstrap methodwas chosen because the residuals from the smoothed curves exhibit heteroskedasticity, withgreater variance in the middle of the growing season. The number of bootstrap replicationswas set to 1,000. For each bootstrap sample, we apply the two-stage smoother to the dataand extract the DOY when X p,t,boot(d) reaches its peak. For the fPCA representation, wesimply use the reconstructed curves from the fPCA reduction (Equation 1) to estimate theday of peak τ . For the spatial model, we estimate the day of the peak using the median dayof the peak over all M curves constructed from the posterior predictive distribution of thescores. For both methods, we compare the estimated DOY for peak τ to the nonparametricsmoother.

Table 5 presents the bias and standard deviation for eachmethod averaged across pixels bygrowing season. In general, going from the nonparametric curves, to the fPCA approxima-tion, to predictions from the spatialmodel, increases the bias and uncertainty in the estimates.To get an idea of how well our model predicts, we focus on the difference between the fPCAapproximation (row two) and the hold-one-out spatial predictions (row three). Unsurpris-ingly, there is more uncertainty with the spatial predictions, but only by around 1-2 days.The average bias is also increases by a similar magnitude except for 2012; however, 2012


1 2 3

4

5

6

7 8

9

10

11

12

13

14 15

16

17

18 19

20

21

22

23

24 25

26

27

28

29

3041

42

43

−96 −94 −92 −90longitude

latit

ude

221

222

223

224

225

226DOY

2012

Figure 10. Predicted median day of the year for peak τ for the 2012 growing season. Numbers correspond toobserved SMOS pixel locations. The predictions range from August 9 (DOY 221) to August 14 (DOY 226) .

was an usually hot summer, and as we assumed a fixed linear trend and spatial parametersfor all growing seasons, this model constraint could be hurting our predictions in years withabnormal weather. Within each growing season, we examined the bias and uncertainty bylocation to see if any spatial pattern exists in either the uncertainty or bias of the predictions;however, there was no clear evidence of spatial structure left in the predictions.

In addition to estimating the timing of peak τ at known locations, we can predict atunobserved locations. Weather data from the IEM are available at intervals of 0.25 degreeslatitude/longitude. Using this resolution, we construct a fine grid of points over north-west/central Iowa and estimate seasonal GDD at each location using the method describedin Sect. 4.3.1. We predict growth curves over the grid using posterior draws from the fittedmodel and observed temperature data at the new locations. Below are two maps showingthe median DOY for peak τ in the 2012, and the widths of the credible intervals (Figs. 10,and 11). Predictions are restricted to counties that overlap or are adjacent to observed SMOSpixels. We also exclude the counties below pixel 26 since that is the start of the Des Moinesmetro area. Although Iowa has a relatively homogeneous distribution of temperatures acrossthe state, we still see spatial structure: pixels closer to the Minnesota boarder reach theirpeak later than locations in southern parts of Iowa. There also appears to be a small tem-perature pocket in the upper northwest corner of the state. As expected, the uncertainty inour predictions increases when we predict at locations further away from the observed data.See Supplemental Material for similar plots from the other growing seasons.


1 2 3

4

5

6

7 8

9

10

11

12

13

14 15

16

17

18 19

20

21

22

23

24 25

26

27

28

29

3041

42

43

−96 −94 −92 −90longitude

latit

ude

10

15

20

Width

2012

Figure 11. Width of 95% CIs for peak τ predictions for the 2012 growing season. Numbers correspond to theobserved pixel locations .

6. DISCUSSION

This paper offers a novel functional data approach to model remote sensing vegetationtime series data. Modeling τ as random curve smooths the data and provides a parsimoniousrepresentation of the satellite signal. By reducing the dimension of the functional curvesusing fPCA, we modeled the range of the curve with the most variability: the day whenit reaches its maximum value. We incorporated accumulated thermal time (GDD) and thespatial dependence between curves to jointly model crop growth signatures from multiplegrowing seasons and locations and make predictions when crops reach their peak (R3) atunobserved locations. Although we focused on making predictions for peak timing, webelieve this approach is flexible enough to be applicable to learning about other features ofvegetation indices (e.g., the timing of an inflection point). In addition, other high temporaland spatial resolution data products, such as the recent SoilMoisture Active Passive (SMAP)satellite, are potential applications for our methodology.

We now address some important model assumptions and limitations from this analysis.First, we assume spatial dependence is constant across years rather than allowing the spatialparameters to vary across growing seasons. In addition, we reduced each curve to a linearcombination of the first and second functional principal component. While this captured theacceleration and deceleration of τ , we did not capture the magnitude of the peak or othersmall scale variation at the beginning and end of the growing season. Lastly, there are likely


other explanatory variables, such as soil quality, seed type, and time of planting that affectthe growth rate of corn (and estimates of GDD) but are not incorporated in our model.

There are a variety of future directions to build upon the current model. For example,the model could better quantify the uncertainty in our estimates if ground observationsfor validation and calibration were available. The research question then becomes how todesign an experiment where a number of ground locations are selected in an optimal wayto observe the plant growth on the ground for the purpose of validating and calibratingthe spatial functional model for the remote sensing data. In a model-based approach, themodel we developed in this paper can be used as the working model to develop appropriatedesign criteria to select spatial locations. Since our model is a spatial functional model, thedesign concepts in classical regression analysis need to be generalized to accommodate thisnew setting. We could also take a design-based approach and use our model predictionsas auxiliary data to stratify the geographical location and select samples so that efficientdesign-based estimation at a regional level can be achieved.

[Received September 2019. Accepted September 2020. Published Online October 2020.]

REFERENCES

Baladandayuthapani, V., Mallick, B. K., Hong, M. Y., Lupton, J. R., Turner, N. D., and Carroll, R. J. (2008),“Bayesian Hierarchical Spatially Correlated Functional Data Analysis with Application to Colon Carcinogen-esis,” Biometrics, 64, 64–73.

Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014), Hierarchical Modeling and Analysis for Spatial Data, CRCPress, Boca Raton, FL.

Betancourt, M. and Girolami, M. (2015), “Hamiltonian Monte Carlo for hierarchical models,” Current trends inBayesian methodology with applications, 79, 2–4.

Brabanter, K. D., Brabanter, J. D., Suykens, J. A., and Moor, B. D. (2011), “Kernel regression in the presence ofcorrelated errors,” Journal of Machine Learning Research, 12, 1955–1976.

Cressie, N. (2015), Statistics for Spatial Data, Wiley, Hoboken, NJ.

Department of Agronomy. Iowa State University (Accessed September 30, 2018), “Iowa Environmental Mesonet.”https://mesonet.agron.iastate.edu/.

Flaxman, S., Gelman, A., Neill, D., Smola, A., Vehtari, A., and Wilson, A. G. (2015), “Fast hierarchical Gaussianprocesses,” Manuscript in Progress.

Gelman, A., Meng, X.-L., and Stern, H. (1996), “Posterior predictive assessment of model fitness via realizeddiscrepancies,” Statistica Sinica, 733–760.

Gelman, A., Rubin, D. B., et al. (1992), “Inference from iterative simulation using multiple sequences,” StatisticalScience, 7, 457–472.

Gelman, A. et al. (2006), “Prior distributions for variance parameters in hierarchical models (comment on articleby Browne and Draper),” Bayesian Analysis, 1, 515–534.

Gromenko, O., Kokoszka, P., Zhu, L., and Sojka, J. (2012), “Estimation and testing for spatially indexed curveswith application to ionospheric and magnetic field trends,” The Annals of Applied Statistics, 669–696.

Guo, J., Betancourt, M., Brubaker, M., Carpenter, B., Goodrich, B., Hoffman, M., Lee, D., Malecki, M., andGelman, A. (2014), “RStan: The R interface to Stan,”.

Hollinger, S. E. and Angel, J. R. (2009), “Weather and crops,” Emerson Nafziger (compiling). Illinois AgronomyHandbook. 24th Edition. Illinois: University of Illinois at Urbana-Champaign.

https://mesonet.agron.iastate.edu/


Hornbuckle, B. K., Patton, J. C., VanLoocke, A., Suyker, A. E., Roby, M. C., Walker, V. A., Iyer, E. R., Herzmann,D. E., and Endacott, E. A. (2016), “SMOSoptical thickness changes in response to the growth and developmentof crops, crop management, and weather,” Remote Sensing of Environment, 180, 320–333.

Jackson, T. and Schmugge, T. (1991), “Vegetation effects on the microwave emission of soils,” Remote Sensing ofEnvironment, 36, 203–212.

Kerr, Y. H., Waldteufel, P., Richaume, P., Wigneron, J. P., Ferrazzoli, P., Mahmoodi, A., Al Bitar, A., Cabot, F.,Gruhier, C., Juglea, S. E., et al. (2012), “The SMOS soil moisture retrieval algorithm,” IEEE Transactions onGeoscience and Remote Sensing, 50, 1384–1403.

Kerr, Y. H., Waldteufel, P., Wigneron, J.-P., Delwart, S., Cabot, F., Boutin, J., Escorihuela, M.-J., Font, J., Reul,N., Gruhier, C., Juglea, S. E., Drinkwater, M. R., Hahne, A., Martin-Neira, M., and Mecklenburg, S. (2010),“The SMOS mission: New tool for monitoring key elements of the global water cycle,” Proceedings of theIEEE, 98, 666–687.

Lawrence, H., Wigneron, J.-P., Richaume, P., Novello, N., Grant, J., Mialon, A., Al Bitar, A., Merlin, O., Guyon,D., Leroux, D., et al. (2014), “Comparison between SMOS Vegetation Optical Depth products and MODISvegetation indices over crop zones of the USA,” Remote Sensing of Environment, 140, 396–406.

Li, Y., Wang, N., Hong, M., Turner, N. D., Lupton, J. R., and Carroll, R. J. (2007), “Nonparametric Estima-tion of Correlation Functions in Longitudinal and Spatial Data, with Application to Colon CarcinogenesisExperiments,” The Annals of Statistics, 35, 1608–1643.

Liu, C., Ray, S., and Hooker, G. (2017), “Functional principal component analysis of spatially correlated data,”Statistics and Computing, 27, 1639–1654.

Liu, C., Ray, S., Hooker, G., and Friedl, M. (2012), “Functional Factor Analysis for Periodic Remote SensingData,” The Annals of Applied Statistics, 6, 601–624.

Mammen, E. (1993), “Bootstrap and wild bootstrap for high dimensional linear models,” The Annals of Statistics,255–285.

Matérn, B. (1986), Spatial Variation, vol. 36 of Lecture Notes in Statistics, Springer Verlag, New York, NY, 2nded.

McMaster, G. S. and Wilhelm, W. (1997), “Growing degree-days: one equation, two interpretations,” Agriculturaland Forest Meteorology, 87, 291–300.

Patton, J. and Hornbuckle, B. (2013), “Initial validation of SMOS vegetation optical thickness in Iowa,” IEEE

Geoscience and Remote Sensing Letters, 10, 647–651.

R Core Team (2020), R: A Language and Environment for Statistical Computing, R Foundation for StatisticalComputing, Vienna, Austria.

Ramsay, J., Hooker, G., andGraves, S. (2009), “Exploring variation: functional principal and canonical componentsanalysis,” Functional Data Analysis with R and MATLAB, 99–115.

USDA-NASS (Accessed September 30, 2018), “NASS Weekly Crop Reports.” https://www.nass.usda.gov/Publications/Reports_By_Date/.

Villarreal-Guerrero, F., Kacira, M., Fitz-Rodríguez, E., Kubota, C., Giacomelli, G. A., Linker, R., and Arbel,A. (2012), “Comparison of three evapotranspiration models for a greenhouse cooling strategy with naturalventilation and variable high pressure fogging,” Scientia Horticulturae, 134, 210–221.

Zhang, H. (2004), “Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics,”Journal of the American Statistical Association, 99, 250–261.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in publishedmaps and institutional affiliations.

https://www.nass.usda.gov/Publications/Reports_By_Date/

https://www.nass.usda.gov/Publications/Reports_By_Date/

Documents

Modeling Crop Phenology in the US Corn Belt Using