17
Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE Ensemble Bayesian model averaging using Markov Chain Monte Carlo sampling Jasper A. Vrugt · Cees G. H. Diks · Martyn P. Clark Received: 7 June 2008 / Accepted: 25 September 2008 © Springer Science+Business Media B.V. 2008 Abstract Bayesian model averaging (BMA) has recently been proposed as a statistical method to calibrate forecast ensembles from numerical weather models. Successful imple- mentation of BMA however, requires accurate estimates of the weights and variances of the individual competing models in the ensemble. In their seminal paper (Raftery et al. Mon Weather Rev 133:1155–1174, 2005) has recommended the Expectation–Maximization (EM) algorithm for BMA model training, even though global convergence of this algorithm can- not be guaranteed. In this paper, we compare the performance of the EM algorithm and the recently developed DiffeRential Evolution Adaptive Metropolis (DREAM) Markov Chain Monte Carlo (MCMC) algorithm for estimating the BMA weights and variances. Simulation experiments using 48-hour ensemble data of surface temperature and multi-model stream- flow forecasts show that both methods produce similar results, and that their performance is unaffected by the length of the training data set. However, MCMC simulation with DREAM is capable of efficiently handling a wide variety of BMA predictive distributions, and pro- vides useful information about the uncertainty associated with the estimated BMA weights and variances. Keywords Bayesian model averaging · Markov Chain Monte Carlo · Maximum likelihood · DiffeRential Evolution Adaptive Metropolis · Temperature forecasting · Streamflow forecasting J. A. Vrugt (B ) Los Alamos National Laboratory (LANL), Center for NonLinear Studies (CNLS), Mail Stop B258, Los Alamos, NM 87545, USA e-mail: [email protected] C. G. H. Diks Center for Nonlinear Dynamics in Economics and Finance, University of Amsterdam, Amsterdam, The Netherlands M. P. Clark NIWA, P.O. Box 8602, Riccarton, Christchurch, New Zealand 123

Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid MechDOI 10.1007/s10652-008-9106-3

ORIGINAL ARTICLE

Ensemble Bayesian model averaging using Markov ChainMonte Carlo sampling

Jasper A. Vrugt · Cees G. H. Diks · Martyn P. Clark

Received: 7 June 2008 / Accepted: 25 September 2008© Springer Science+Business Media B.V. 2008

Abstract Bayesian model averaging (BMA) has recently been proposed as a statisticalmethod to calibrate forecast ensembles from numerical weather models. Successful imple-mentation of BMA however, requires accurate estimates of the weights and variances ofthe individual competing models in the ensemble. In their seminal paper (Raftery et al. MonWeather Rev 133:1155–1174, 2005) has recommended the Expectation–Maximization (EM)algorithm for BMA model training, even though global convergence of this algorithm can-not be guaranteed. In this paper, we compare the performance of the EM algorithm and therecently developed DiffeRential Evolution Adaptive Metropolis (DREAM) Markov ChainMonte Carlo (MCMC) algorithm for estimating the BMA weights and variances. Simulationexperiments using 48-hour ensemble data of surface temperature and multi-model stream-flow forecasts show that both methods produce similar results, and that their performance isunaffected by the length of the training data set. However, MCMC simulation with DREAMis capable of efficiently handling a wide variety of BMA predictive distributions, and pro-vides useful information about the uncertainty associated with the estimated BMA weightsand variances.

Keywords Bayesian model averaging · Markov Chain Monte Carlo · Maximumlikelihood · DiffeRential Evolution Adaptive Metropolis · Temperature forecasting ·Streamflow forecasting

J. A. Vrugt (B)Los Alamos National Laboratory (LANL), Center for NonLinear Studies (CNLS), Mail Stop B258,Los Alamos, NM 87545, USAe-mail: [email protected]

C. G. H. DiksCenter for Nonlinear Dynamics in Economics and Finance, University of Amsterdam, Amsterdam,The Netherlands

M. P. ClarkNIWA, P.O. Box 8602, Riccarton, Christchurch, New Zealand

123

Page 2: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

1 Introduction and scope

During the last decade, multi-model ensemble prediction systems have become the basisfor probabilistic weather and climate forecasts at many operational centers throughout theworld [2,10,16,18]. Multi-model ensemble predictions aim to capture several sources ofuncertainty in numerical weather forecasts, including uncertainty about the initial condi-tions, lateral boundary conditions and model physics, and have convincingly demonstratedimprovements to numerical weather and climate forecasts and production of more skilful esti-mates of forecast probability density functions (pdf) [4,8,9,13,15,20,21]. However, becausethe current generation of ensemble systems do not explicitly account for all sources of fore-cast uncertainty, some form of postprocessing is necessary to provide predictive ensemblepdfs that are meaningful, and can be used to provide accurate forecasts [7,9,12,20,22,29].

Ensemble Bayesian model averaging (BMA) has recently been proposed by Raftery etal. [20] as a statistical postprocessing method for producing probabilistic forecasts fromensembles. The BMA predictive pdf of any future weather quantity of interest is a weightedaverage of pdfs centered on the bias-corrected forecasts from a set of individual models. Theweights are the estimated posterior model probabilities, representing each model’s relativeforecast skill in the training period. Studies applying the BMA method to a range of differ-ent forecasting problems have demonstrated that BMA produces more accurate and reliablepredictions than other available multi-model techniques [1,17,19,20,23,30].

Successful implementation of BMA however, requires estimates of the weights and vari-ances of the individual competing models in the ensemble. In their seminal paper, Raftery etal. [20] recommend the use of the Expectation–Maximization (EM) algorithm to estimate theBMA weights and variances. The advantages of the EM algorithm are not difficult to enumer-ate. The method is relatively easy to implement, computationally efficient, and the algorithmicsteps are designed in such a way that they always satisfy the constraint that the BMA weightsare positive and add up to one. Various contributions to the literature have indeed demon-strated that the EM algorithm works well, and provides robust estimates of the weights andvariances of the individual ensemble members for little computational requirements.

Notwithstanding this progress made, convergence of the EM algorithm to the global opti-mal BMA weights and variances cannot be guaranteed and becomes especially problematicfor high-dimensional problems (ensembles consisting of many different member forecasts).Also, algorithmic modifications are required to adapt the EM method for predictive dis-tributions other than the normal distribution. This will be necessary for variables such asprecipitation, and all other quantities primarily driven by precipitation, such as streamflow.

In this paper, we propose using Markov Chain Monte Carlo (MCMC) simulation to esti-mate the most likely values of the BMA weights and variances, and their underlying pos-terior pdf. This approach overcomes some of the limitations of the EM approach and hasthree distinct advantages. First, MCMC simulation does not require algorithmic modifica-tions when using different conditional probability distributions for the individual ensemblemembers. Second, MCMC simulation provides a full view of the posterior distribution ofthe BMA weights and variances. This information is helpful to assess the usefulness of indi-vidual ensemble members. A small ensemble has important computational advantages, sinceit requires calibrating and running the smaller possible number of models. Finally, MCMCsimulation can handle a relatively high number of BMA parameters, allowing large ensemblesizes containing the predictions of many different constituent models.

We use the recently developed DiffeRential Evolution Adaptive Metropolis (DREAM)MCMC algorithm. This adaptive proposal updating method has recently been introduced[27] and maintains detailed balance and ergodicity while showing excellent efficiency on

123

Page 3: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

complex, multi-modal and high-dimensional sampling problems. To illustrate our approach,we use two different data sets, (a) 48-h forecasts of surface temperature and sea level pressurein the Pacific Northwest in January–June 2000, obtained using the University of Washingtonmesoscale short-range ensemble system [10], and (b) multi-model streamflow forecasts forthe Leaf River basin in Mississippi [24].

The remainder of this paper is organized as follows. In Sect. 2, we briefly describe theBMA method, and Sect. 3 provides a short description of the EM and DREAM algorithm forestimating the values of the BMA weights and variances. In Sect. 4, we compare the resultsof both algorithms for the two data sets. Here we are especially concerned with algorithmrobustness, and flexibility. Finally, a summary with conclusions is presented in Sect. 5.

2 Ensemble Bayesian model averaging (BMA)

To explicate the BMA method, let f = f1, . . . , fK denote an ensemble of predictions obtainedfrom K different models, and � be the quantity of interest. In BMA, each ensemble memberforecast, fk , k = 1, . . . , K , is associated with a conditional pdf, gk(�| fk), which can beinterpreted as the conditional pdf of � on fk , given that fk is the best forecast in the ensem-ble. The BMA predictive model for dynamic ensemble forecasting can then be expressed asa finite mixture model [19,20]

p(�| f1, . . . , fk) =K∑

k=1

wk gk(�| fk), (1)

where wk denotes the posterior probability of forecast k being the best one. The wks arenonnegative and add up to one, and they can be viewed as weights reflecting an individualmodel’s relative contribution to predictive skill over the training period.

The original BMA method described in [19,20] assumes that the conditional pdfs,gk(�| fk) of the different ensemble members can be approximated by a normal distribu-tion centered at a linear function of the original forecast, ak + bk fk and standard deviation σ

�| fk ∼ N (ak + bk fk, σ2). (2)

The values for ak and bk are bias-correction terms that are derived by simple linear regres-sion of � on fk for each of the K ensemble members. The BMA predictive mean can becomputed as

E[�| f1, . . . , fK ] =K∑

k=1

wk(ak + bk fk), (3)

which is a deterministic forecast, whose predictive performance can be compared with theindividual forecasts in the ensemble, or with the ensemble mean. If we denote space andtime with subscripts s and t respectively, so that fkst is the kth forecast in the ensemble forlocation s and time t , the associated variance of Eq. 3 can be computed as [20]

var[�st | f1st , . . . , fK st ] =K∑

k=1

wk

((ak + bk fkst ) −

K∑

l=1

wl(al + bl fl)

)+ σ 2. (4)

The variance of the BMA prediction defined in Eq. 3 consists of two terms, the firstrepresenting the ensemble spread, and the second representing the within-ensemble forecastvariance.

123

Page 4: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

In this development, it is assumed that the pdf from each model at space s and time t canbe approximated as a Normal distribution with mean equal to the deterministic prediction, fk ,and variance σ 2. This formulation assumes a constant variance across models, which requiresestimation of K + 1 parameters. An alternative approach is to vary σ 2 across models. Thelast term on the right-hand side of Eq. 4, σ 2 then needs to be replaced with

∑Kk=1 wkσ

2k . This

increases the number of parameters to be estimated from K + 1 to K ∗ 2.The assumption of a normal distribution for the BMA predictive distribution of the individ-

ual ensemble members works well for weather quantities whose predictive pdfs are approx-imately normal such as temperature and sea-level pressure. For other variables such as pre-cipitation, the normal distribution is not appropriate and other conditional pdfs need to beimplemented [23]. The second case study reported in this paper considers the gamma distri-bution for streamflow forecasting. Further details on how to implement this distribution canbe found there.

3 Computation of the BMA weights and variances

Successful implementation of the BMA method described in the previous section requiresestimates of the weights and variances of the pdfs of the individual forecasts. Following[20], we estimate the values for wk, k = 1, . . . , K and σ 2 by maximum likelihood (ML)from a calibration data set. Assuming independence of forecast errors in space and time, thelog-likelihood function, � for the BMA predictive model defined in Eq. 3 is:

�(w1, . . . , wK , σ 2| fk, . . . , fK ,�) =n∑

s,t

log

(K∑

k=1

wk gk(�st | fkst )

), (5)

where n denotes the total number of measurements in the training data set. For reasons ofnumerical stability we optimize the log-likelihood function rather than the likelihood func-tion itself. Unfortunately, no analytical solutions exist that conveniently maximize Eq. 5.Instead, we need to resort to iterative techniques to find the ML values for the BMA weightsand variances. In the next two sections, we discuss two different methods to obtain the MLestimates of wk, k = 1, . . . , K and σ 2.

3.1 The Expectation–Maximization (EM) algorithm

The Expectation–Maximization algorithm is a broadly applicable approach to the iterativecomputation of ML estimates, useful in a variety of incomplete-data problems [3,14]. Toimplement the EM algorithm for the BMA method we use the unobserved quantity zkst ,where this variable has value 1 if ensemble member k is the best forecast at space s and timet and value 0 otherwise. Hence, for each observation only one of {z1st , . . . , zkst } is equal to1. All the others are zero.

After initialization of the weights and variances of the individual ensemble members,the EM algorithm alternates iteratively between an expectation and maximization step untilconvergence is achieved. In the expectation step, the values of zkst are re-estimated given thecurrent values for the weights and variances according to

z( j)kst = wk g(�st | fkst , σ

( j−1))∑K

l=1 wl g(�st | flst , σ ( j−1)), (Expectation Step) (6)

123

Page 5: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

where the superscript j refers to iteration number, and g(�st | fkst , σ( j−1)) is the conditional

pdf of ensemble member k centered at observation �st . Here we implement a normal den-sity with mean fkst and standard deviation σ ( j−1). In the subsequent maximization step, thevalues for the weights and variances are updated using the current estimates of zkst , z( j)

kst

w( j)k = 1

n

n∑

s,t

z( j)kst

(Maximization Step)

σ 2( j) = 1

n

n∑

s,t

K∑

k=1

z( j)kst (�st − fkst )

2.

(7)

By iterating between Eqs. 6 and 7 the EM algorithm iteratively improves the values ofwk, k = 1, . . . , K , and σ 2. Convergence of the BMA weights and variances is achievedwhen the changes in �(w1, . . . , wK , σ 2| fk, . . . , fK ,�), wk, k = 1, . . . , K , σ 2 and z( j)

kstbecome smaller than some predefined tolerance values for each of these individual quanti-ties. This is similar to what [20] proposed.

The EM method exhibits many desirable properties as it is relatively easy to implement,computationally efficient, and the maximization step in Eq. 7 is designed such that the weightsare always positive and add up to one,

∑Kk=1 wk = 1. Yet, because a single search trajectory

is used, the EM algorithm is prone to get stuck in a local optimal solution. Of course, multipledifferent trajectories can be used, but this is a major source of inefficiency because each ofthese individual optimization trials operate completely independent of each other with nosharing of information [5]. Another disadvantage of the EM method is that it requires differentanalytical solutions for different formulations of gk(�| fk). For instance, it might be neces-sary to replace the normal distribution in Eq. 2 by a more appropriate distribution for variablessuch as streamflow and precipitation that are known to be highly skewed [23,24]. Finally, theEM method does not provide any information about the uncertainty associated with the finaloptimized BMA weights and variances. This is not only useful information to assess BMAmodel uncertainty, but also to understand the importance of individual ensemble members.A search algorithm that can efficiently handle a wide variety of different conditional distri-butions, gk(�| fk) and simultaneously provides estimates of BMA parameter uncertainty isdesirable.

3.2 Markov Chain Monte Carlo Sampling with DREAM

Another approach to estimate the BMA weights and variances is MCMC simulation. UnlikeEM, MCMC simulation uses multiple different trajectories (also called Markov Chains)simultaneously to sample wk, k = 1, . . . , K and σ 2 preferentially, based on their weight inthe likelihood function. Vrugt et al. [27,28] recently presented a novel adaptive MCMC algo-rithm to efficiently estimate the posterior pdf of parameters in complex, high-dimensionalsampling problems. This method, entitled DREAM, runs multiple chains simultaneously forglobal exploration, and automatically tunes the scale and orientation of the proposal dis-tribution during the evolution to the posterior distribution. This scheme is an adaptation ofthe Shuffled Complex Evolution Metropolis [25] global optimization algorithm and has theadvantage of maintaining detailed balance and ergodicity while showing excellent efficiencyon complex, highly nonlinear, and multimodal target distributions [27].

In DREAM, N different Markov Chains are run simultaneously in parallel. If the stateof a single chain is given by a single d-dimensional vector θ = {wk, . . . , wK , σ 2}, then at

123

Page 6: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

each generation the N chains in DREAM define a population �, which corresponds to anN × d matrix, with each chain as a row. Jumps in each chain i = {1, . . . , N } are generatedby taking a fixed multiple of the difference of randomly other chosen members (chains) of� (without replacement):

ϑ i = θ i + γ (δ)

δ∑

j=1

θr( j) − γ (δ)

δ∑

n=1

θr(n) + e, (8)

where δ signifies the number of pairs used to generate the proposal, and r( j), r(n) ∈{1, . . . , N − 1}; r( j) �= r(n). The Metropolis ratio is used to decide whether to acceptcandidate points or not. This series of operations results in an MCMC sampler that conductsa robust and efficient search of the parameter space. Because the joint pdf of the N chainsfactorizes to π(θ1) × · · · × π(θN ), the states θ1 . . . θN of the individual chains are indepen-dent at any generation after DREAM has become independent of its initial value. After thisso-called burn-in period, the convergence of a DREAM run can thus be monitored with theR-statistic of [6].

At every step, the points in � contain the most relevant information about the search, andthis population of points is used to globally share information about the progress of the searchof the individual chains. This information exchange enhances the survivability of individualchains, and facilitates adaptive updating of the scale and orientation of the proposal distri-bution. Note that the method maintains pointwise detailed balance, as the pair (θr1

t−1, θr2t−1)

is as likely as (θr2t−1, θ

r1t−1) and the value of e ∼ Nd(0, b) is drawn from a symmetric dis-

tribution with small b. From the guidelines of the Random Walk Metropolis algorithm, theoptimal choice of γ = 2.4/

√2δd . Every 5th generation γ = 1.0 to facilitate jumping between

different disconnected modes [27].To gear the search towards the high density region in the parameter space, DREAM does

not require the unobserved quantity zkst as used in the EM algorithm, but directly evaluates�(w1, . . . , wK , σ 2| fk, . . . , fK ,�) defined in Eq. 5 for given values of wk, k = 1, . . . , K andσ 2. There are several advantages to adopting this approach. First, the formulation of �(·|·) issimilar for different choices of gk(�| fk), requiring no modifications to the BMA software ifother than normal conditional probability distributions are used for the individual ensemblemembers. Second, MCMC simulation with DREAM provides a full view of the posteriordistribution of the BMA weights and variances. Finally, MCMC simulation with DREAMcan handle a relatively high number of BMA parameters, allowing for large ensemble sizescontaining the predictions of many different constituent models.

The only potential drawback is that MCMC simulation generally requires many morefunction evaluations than the EM algorithm to estimate wk, k = 1, . . . , K and σ 2. This canbe especially cumbersome for ensembles that contain probabilistic forecasts of individualmembers calibrated through stochastic optimization. To date, most multi-model forecastingapplications presented in the literature consider ensemble sizes that are relatively small, typ-ically in the order of K = 5 to K = 20. The methodology presented herein should thereforehave widespread applicability.

4 Case studies

We compare the robustness and efficiency of the EM and DREAM algorithm for two differ-ent case studies. The first case study considers the 48-h forecasts of surface temperature andsea level pressure in the Pacific Northwest in January–June 2000 using the University of

123

Page 7: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

Washington (UW) mesoscale short-range ensemble system. This is a five member multi-analysis ensemble (hereafter referred to as the UW ensemble) consisting of different runsof the fifth-generation Pennsylvania State University – National Center for AtmosphericResearch Mesoscale Model (MM5), in which initial conditions are taken from differentoperational centers. In the second case study, we apply the BMA method to probabilisticstreamflow forecasting using a 36-year calibrated multi-model ensemble of daily streamflowforecasts for the Leaf River basin in Mississippi. This study deals with a variable that ishighly skewed and not well described by a normal distribution.

In both studies, the EM and DREAM algorithm were allowed a maximum of 15,000log-likelihood function evaluations. In the first case study, a uniform prior distribution ofw ∈ [0, 1]K , and σ 2 ∈ [0, 3 · var(�)] was used for the BMA weights and variances, respec-tively. The log-likelihood function in Eq. 5 then equates directly to the posterior densityfunction. In case study 2, we again used w ∈ [0, 1]K for the BMA weights, but the gammashape and scale parameters were sampled between α, β ∈ [0, 10]. These initial ranges wereestablished through trial-and-error, encompass the posterior pdf, and have shown to workwell for a range problems.

4.1 UW ensemble forecasts of temperature and sea level pressure

We consider the 48-h forecast of surface temperature and sea level pressure using data fromthe UW ensemble consisting of forecasts and observations in the 0000 UTC cycle from12 January to 9 June 2000. Following [20] a 25 day training period between April 16 and9 June 2000 is used for BMA model calibration, whereas the remainder of the data set(12 January–16 April) is used to evaluate the model performance. For some days the datawere missing, so that the number of calendar days spanned by the training data set is largerthan the number of training days used. Note, that the performance statistics for the evaluationperiod were derived without a sliding window training approach. This is opposite to whatwas done in [20]. The emphasis of this paper is on the calibration performance of the EMand DREAM algorithm for the training data period rather than the evaluation data period.The evaluation period is simply used to test the statistical consistency of the calibrated BMAweights and variances.

Figure 1 presents histograms of the DREAM derived posterior marginal probability den-sity functions of the BMA weights and variance of the individual ensemble members for the25 day training period for surface temperature (panels a–f) and sea level pressure (panels g–l).The DREAM algorithm has generated N = 10 parallel sequences, each with 1,500 samples.The first 1,000 samples in each chain are used for burn-in, leaving a total of 5,000 samplesto draw inferences from. The optimal values derived with the EM algorithm are separatelyindicated in each panel with the ‘x’ symbol.

The results in Fig. 1 generally show an excellent agreement between the modes of thehistograms derived with MCMC simulation and the ML estimates of the EM solution withinthis high-density region. Previous applications of the EM algorithm are likely to have yieldedrobust parameters for the UW ensemble data set. However, DREAM retains a desirable fea-ture as it not only correctly identifies the ML values of the parameters, but simultaneouslyalso samples the underlying probability distribution. This is helpful information to assessthe usefulness of individual ensemble members in the BMA prediction, and the correlationamong ensemble members. For instance, most of the histograms exhibit an approximateGaussian distribution with relatively small dispersion around their modes. This indicates thatthere is high confidence in the weights applied to each of the individual models.

123

Page 8: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

0.3

0.35

0.4

0

0.1

0.2

0.3

posterior density posterior density

AV

N(N

CE

P)

(A)

0.2

0.25

GE

M(C

MC

)(B

)

0.2

UW

ens

embl

e 48

-for

ecas

ts s

eale

vel p

ress

ure

UW

ens

embl

e 48

-for

ecas

ts s

urfa

ce te

mpe

ratu

re

0.25

0.3

ETA

(NC

EP

)(C

)

00.

050.

1

NG

M(N

CE

P)

(D)

0.05

0.1

NO

GA

PS

(FN

MO

C)(E

)

7.4

7.6

7.8(F

)

0.2

0.25

0.3

0

0.1

0.2

0.3

AV

N(N

CE

P) w

1

0.1

0.15

0.2

0.25

GE

M(C

MC

) w2

00.

005

0.01

ETA

(NC

EP

)

w3

00.

005

0.01

NG

M(N

CE

P)

w4

0.55

0.6

0.65

NO

GA

PS

(FN

MO

C)

w5

55.

56

σ2

(G)

(H)

(I)

(J)

(K)

(L)

Fig

.1M

argi

nalp

oste

rior

pdfs

ofth

eD

RE

AM

deri

ved

BM

Aw

eigh

tsan

dva

rian

ceof

the

indi

vidu

alen

sem

ble

mem

bers

for

the

surf

ace

tem

pera

ture

(a–f

)an

dse

ale

velp

ress

ure

(g–l

)da

tase

ts.T

heE

Mde

rive

dso

lutio

nis

sepa

rate

lyin

dica

ted

inea

chpa

nelw

ithsy

mbo

l‘x’

123

Page 9: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

0 500 1,000 1,5000

0.2

0.4

0.6

*

AVN (NCEP)w

1

0 500 1,000 1,5000

0.2

0.4

0.6

*

GEM (CMC)

w2

0 500 1,000 1,5000

0.2

0.4

0.6

*

ETA (NCEP)

w3

0 500 1,000 1,5000

0.2

0.4

0.6

*

NGM (NCEP)

w4

0 500 1,000 1,5000

0.2

0.4

0.6

*

NOGAPS (FNMOC)

w5

Sample number in Markov chain

0 500 1,000 1,5000

5

10

15

20

*

BMA VAR.

σ2

Sample number in Markov chain

(A) (B)

(C) (D)

(E) (F)

Fig. 2 Transitions of the value of the BMA weight of ensemble member AVN (a), ETA (b), NGM (c), GEM(d), NOGAPS (e), and BMA variance (f) in three different parallel Markov Chains during the evolution ofthe DREAM algorithm to the posterior target distribution. To facilitate comparison, the maximum likelihoodestimates computed with the EM algorithm are separately indicated at the right hand side in each panel withthe symbol ‘∗’

The evolution of the sampled BMA weights, wk, k = 1, . . . , K of the (a) AVN (NCEP),(b) ETA (NCEP), (c) NGM (NCEP), (d) GEM (CMC), (e) NOGAPS (FNMOC) ensem-ble members and BMA variance, σ2 (f) to the posterior distribution for the 25-day surfacetemperature data set are shown in Fig. 2. We randomly selected three different Markov chains,and coded each of them with a different color and symbol. The 1D scatter plots of the sampledparameter space indicate that during the initial stages of the evolution the different sequencesoccupy different parts of the parameter space, resulting in a relatively high value for theR-statistic (not shown). After about 250 draws in each individual Markov chain, the threetrajectories settle down in the approximate same region of the parameter space and succes-sively visit solutions stemming from a stable distribution. This demonstrates convergence toa limiting distribution.

To explore whether the performance of the EM and DREAM sampling methods areaffected by the length of the training data set, we sequentially increased the length of thetraining set. We consider training periods of lengths 5, 10, 15, 20 and 25 days. For eachfixed length, we ran both methods 30 different times, using a randomly sampled (withoutreplacement) calibration period from the original 25 days training data set. Each time, a biascorrection was first applied to the ensemble using simple linear regression of �st on fkst forthe training data set. Figure 3 presents the outcome of the 30 individual trials as a functionof the length of the training set. The top panels (Fig. 3a–c) present the results for the surfacetemperature data set, while the bottom panels (Fig. 3d–f) depict the results for the sea levelpressure data set. The solid lines denote the Root Mean Square Error (RMSE) of the forecasterror of the BMA predictive mean (blue) and average width of the associated 95% prediction

123

Page 10: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

5 10 15 20 25

UW ensemble 48-forecasts sea level pressure

UW ensemble 48-forecasts sea temperature

1

1.0005

1.001

1.0015

Rat

io M

ax. L

ik. E

M to

DR

EA

MR

atio

Max

. Lik

. EM

to D

RE

AM

5 10 15 20 252.82

3.02

3.22

3.42

3.62

RM

SE

[K]

CALIBRATION RESULTS

10.84

11.14

11.44

11.74

12.04

10.84

11.14

11.44

11.74

12.04

5 10 15 20 252.82

3.02

3.22

3.42

3.62

EVALUATION RESULTS

10.84

11.14

11.44

11.74

12.04

10.84

11.14

11.44

11.74

12.04

5 10 15 20 251

1.0005

1.001

1.0015

1.002

1.0025

Length of training set [d]

5 10 15 20 252.21

2.61

3.01

3.41

3.81

RM

SE

[hP

a]

RM

SE

[K]

RM

SE

[hP

a]

Length of training set [d]

8.73

9.43

10.13

10.83

11.53

8.73

9.43

10.13

10.83

11.53 AV

ER

AG

E W

IDT

H [hP

a]A

VE

RA

GE

WID

TH

[k]

AV

ER

AG

E W

IDT

H [hP

a]A

VE

RA

GE

WID

TH

[k]

5 10 15 20 252.21

2.61

3.01

3.41

3.81

Length of training set [d]

8.73

9.43

10.13

10.83

11.53

8.73

9.43

10.13

10.83

11.53

(A)

(D)

(B)

(E)

(C)

(F)

Fig. 3 Comparison of training period lengths for surface temperature and sea level pressure. The reportedresults represent averages of 30 independent trials with randomly selected training data sets from the original25 days data set. Solid lines in panels b–c and e–f denote the EM derived forecast error of the BMA predic-tive mean (blue), and associated width of the prediction intervals (green) for the calibration (left side) andevaluation period (right side) respectively; squared symbols are DREAM derived counterparts

uncertainty intervals (green) obtained with the EM algorithm, whereas the squared symbolsrepresent their DREAM derived counterparts. The panels distinguish between the calibrationand evaluation period.

The results presented here highlight several important observations. First, the ratio betweenthe ML values derived with the EM and DREAM algorithm closely approximates 1, andappears to be rather unaffected by the length of the training set. This provides strong empiri-cal evidence that the performance of the EM and MCMC sampling methods is quite similar,and not affected by the length of the training data set. A ratio significantly deviating from 1would indicate inconsistent results, where one of the two methods would consistently findbetter estimates of wk, k = 1, . . . , K and σ 2. Second, the RMSE of the BMA deterministic(mean) forecast (indicated in blue) generally increases with increasing length of the calibra-tion period. This is to be expected as longer calibration time series are likely to contain alarger variety of weather events. Third, notice that the average width of the BMA 95% pre-diction uncertainty intervals (indicated in green) increases with the number of training days.This again has to do with the larger diversity of weather events observed in longer calibrationtime series. Finally, the BMA forecast pdf derived through MCMC simulation with DREAMis sharper (less spread) than its counterpart estimated with the EM method. This is consistentwith the results depicted in Fig. 1, which show larger ML values of σ 2 estimated with theEM method.

To summarize these results, it seems that there are small advantages in using MCMCsampling for BMA model training. The DREAM algorithm finds similar values of thelog-likelihood function as the multi-start EM algorithm for both the UW surface temper-ature and sea level pressure data set. However, MCMC simulation with DREAM yields

123

Page 11: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

BMA forecast pdfs that are slightly sharper, and provides estimates of parameter uncertaintyof wk, k = 1, . . . , K and σ 2.

4.2 Probabilistic ensemble streamflow forecasting

In this study, we apply the BMA method to probabilistic ensemble streamflow forecastingusing historical data form the Leaf River watershed (1,950 km2) located north of Collins,Mississippi. In a previous study [24] we generated a 36-year ensemble of daily stream-flow forecasts using eight different conceptual watershed models. In this study, the firsteight-year on record (WY 1953–1960) are used for model evaluation purposes, whereas theother 28 years (WY 1961–1988) of the available record are used for model training purposes.

To discern the probabilistic properties of the streamflow ensemble, consider Fig. 4, whichpresents the deterministic forecasts of the individual, calibrated models of the ensemble fora representative period between 30 September and 6 June 1953. Note that the spread of theensemble generally brackets the observations (solid circles). This is a desirable character-istic and prerequisite to facilitate accurate ensemble streamflow forecasting with the BMAmethod. Because the individual hydrologic models were calibrated first using the training dataset and the SCE-UA algorithm, a linear bias correction of the individual ensemble memberswas deemed unnecessary prior to optimization of the BMA weights and variance. More-over, as argued in [24] a (global) linear bias correction, as suggested by [20], is too simpleto be useful in hydrologic modeling. Such an approach is essentially not powerful enoughto remove heteroscedastic and non-Gaussian model errors from the individual forecasts inthe ensemble. For more information about the watershed models used and generation of thestreamflow ensemble, please refer to [24].

The original development of BMA by [20] was for weather quantities whose predic-tive pdfs are approximately normal, such as temperature and sea-level pressure. However,one would not expect streamflow data, and more generally any other quantity primar-ily driven by precipitation to be normally distributed at short time scales. This is mainly

0

20

40

60

80Pre

cipi

tatio

n[m

m/d

ay]

130 155 180 205 230 2500

100

200

300

400

Day of WY 1953

Str

eam

flow

m[3

s]/

Fig. 4 Rainfall hyetograph (top panel) and streamflow predictions of the individual models of the BMAensemble for a representative portion of the historical period (bottom panel). The solid circles represent theverifying streamflow observations

123

Page 12: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

because the distribution of daily streamflow is highly skewed. The normal distribution doesnot fit data of this kind, and we must therefore modify the conditional pdf, gk(�| fk) inEq. 5.

We here implement the gamma distribution. This distribution with shape parameter α andscale parameter β has the pdf:

�| fk ∼ 1

β�(α)�α−1 exp(−�/β), (9)

for �>0, and gk(�| fk) = 0 for �≤0. The mean of this distribution is identical to µ = αβ,and its variance is σ 2 = αβ2. We derive αk = µ2

k/σ2k , and βk = σ 2

k /µk of the gamma distri-bution from the original forecast, fk , of the individual ensemble members through µk = fk

and σ 2k = b · fk + c0. The gamma distribution is centered on the individual forecasts with

and associated variance that is heteroscedastic and directly depends on the actual streamflowprediction.

We estimate the ML values for wk, k = 1, . . . , K ; b; and c0 using the EM and DREAMalgorithm and the eight-year training ensemble. When implementing the EM algorithm, thereare no analytical solutions for the ML estimates of b and c0, so we estimated them numeri-cally at each iteration using the Simplex algorithm by optimizing the likelihood function in(5) using the current guesses of wk, k = 1, . . . , K derived in Eq. 7.

Figure 5 depicts the evolution of wk, k = 1, . . . , K ; b and c0 to the posterior target dis-tribution using MCMC sampling with DREAM. We plotted three different parallel Markovchains, and coded each of them with a different symbol and color. The ML values derivedwith the EM algorithm are separately indicated in each panel using the ‘∗’ symbol. Theindividual Markov chains mix well and require about 200 samples in each sequence to locatethe high probability density region of the parameter space for this 10-dimensional estimationproblem. This indicates the relative efficiency of DREAM. Also notice that the posterior pdfdoes not narrow greatly, especially for the BMA weight of the GR4J, HBV and SAC-SMAmodels. Most of this uncertainty is caused by significant correlation between the optimizedweights of the individual models (not shown), suggesting that one or more members can beremoved from the streamflow ensemble with little loss. So, the correlation structure inducedbetween the individual weights in the posterior pdf provides useful and objective informa-tion for selecting the most appropriate ensemble members. This can be useful given thecomputational cost of running models and creating ensembles.

Also for this example, the modes of the posterior distribution of wk, k = 1, . . . , K ; b andc0 derived through MCMC simulation with DREAM are in very close agreement with theirML estimates of the EM algorithm. This suggests that both methods will receive similarcalibration and (thus evaluation) performance. Thus, the EM algorithm suffices for find-ing reliable estimates of the BMA weights and variance for this eight-member streamflowensemble.

Table 1 presents summary statistics of the one-day-ahead streamflow forecasts for theeight-year evaluation (WY 1953–1960) period for the eight individual conceptual watershedmodels considered in this study. The forecast statistics of the BMA predictive model andassociated weights are also listed. The results in this Table suggest that it is not possibleto select a single “best” watershed model that minimizes the RMSE and %BIAS of theforecast error, while simultaneously also maximizing the correlation (CORR) between thesimulated and measured time series of streamflow. In this analysis, the Sacramento SoilMoisture Accounting model (SAC-SMA) has the most consistent performance, with thelowest quadratic forecast error during the calibration and evaluation period. However, theSAC-SMA model forecasts exhibit considerable bias. Note that the theoretical benefit of

123

Page 13: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

0 500 1,000 1,5000

0.5(A) (B)

(C) (D)

(E) (F)

(G) (H)

(I) (J)

*

ABC

w1

0 500 1,000 1,5000

0.5

*GR4J

w2

0 500 1,000 1,5000

0.5

*

HYMOD

w3

0 500 1,000 1,5000

0.5

*

TOPMO

w4

0 500 1,000 1,5000

0.5

*

AWBM

w5

0 500 1,000 1,5000

0.5

*

NAM

w6

0 500 1,000 1,5000

0.5

*

HBV

w7

0 500 1,000 1,5000

0.5

1

*

SAC-SMA

w8

0 500 1,000 1,5000

5

10

*

Par. Gamma pdf

b]-[

Sample number in Markov chain

0 500 1,000 1,5000

5

*

Par. Gamma pdf

c0

m[3

s]/

Sample number in Markov chain

Fig. 5 Transitions of the value of the BMA weight of ensemble member ABC (a), GR4J (b), HYMOD (c),TOPMO (d), AWBM (e), NAM (f), HBV (g), and SAC-SMA (h), and gamma pdf parameters (I + J) in threedifferent parallel Markov Chains during the evolution of the DREAM algorithm to the posterior target dis-tribution. To facilitate comparison, the maximum likelihood estimates computed with the EM algorithm areseparately indicated at the right hand side in each panel with the symbol ‘∗’

Table 1 Summary statistics (RMSE, CORR, and BIAS) of the one-day-ahead streamflow forecasts using theindividual calibrated models and BMA predictive model for the evaluation period (WY 1953–1960)

Model RMSE CORR % BIAS Weight

ABC 31.67 0.70 15.59 0.00

GR4J 19.21 0.90 7.51 0.20

HYMOD 19.03 0.90 −0.38 0.03

TOPMO 17.68 0.92 −0.59 0.05

AWBM 26.31 0.80 6.37 0.09

NAM 20.22 0.89 −4.10 0.00

HBV 19.44 0.90 −0.04 0.02

SAC-SMA 16.45 0.93 6.69 0.61

BMA 16.35 0.93 6.16

The BMA weights for the individual watershed model are also listed

using multi-model averaging is not realized in this instance: the BMA deterministic forecastgiven by Eq. 3 has an average quadratic forecast error of a similar magnitude as the bestperforming model (SAC-SMA) in the ensemble. This suggests that the BMA approach doesnot necessarily improve predictive capabilities.

123

Page 14: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

It is interesting to observe that the optimized BMA weights not necessarily reflect the pre-dictive skill of the individual ensemble members. For instance, the GR4J model ranks fourthin predictive skill, but receives the second highest weight in the BMA model. Hamill [11]suggests that this de-weighting of some ensemble members and over-weighting of others isa serious flaw of the BMA method, and caused by overfitting of the EM-method when usingsmall sample sizes. We do agree that it is rather counter-intuitive to assign better ensemblemembers a lower weight than worse performing members. However, we disagree that thisis caused by overfitting. First of all, our DREAM derived posterior pdfs typically exhibitsmall dispersion around the ML values, indicating that the BMA weights are well defined bycalibration against the observed streamflow data. Second, we argue that it is not the searchmethod that causes the seemingly strange values of the BMA weights. As demonstrated inthis paper, the EM and DREAM methods do their job well—they correctly and consistentlyfind the ML value of the log-likelihood function defined in Eq. 5. Instead, what is the prob-lem is the strong correlation among the individual predictors in the ensemble. This causes asignificant amount of redundancy, and therefore de-weighing of some ensemble members.One approach to reduce correlation among ensemble members is to use principle componentanalysis of the time series of predictions of the individual models of the ensemble priorto fitting the BMA model. Another approach would be to describe gk(�| fk) with flexible(Gaussian) mixture models or through sequential Monte Carlo (SMC) methods. This is thescope of future work, and our results related to this will be presented in due course.

To determine whether the EM and DREAM algorithm consistently find the same locationof the mode of the posterior pdf of the BMA weights and gamma distribution parameters,we conducted a comparison of both algorithms for different training period lengths. We fol-lowed the same procedure as described previously, and report in Fig. 6 the average outcomeof 30 independent trials as functions of the length of the training data set. The results can besummarized as follows. First, the left panel (Fig. 6a) demonstrates that the EM and DREAMalgorithm provide consistent results in terms of the ML estimates for this 8-member stream-flow ensemble. The results are almost indistinguishable, and suggest that both approachesare viable to optimize BMA model specific predictive distributions other than normal distri-butions, as implemented previously. Second, and consistent with our previous results, longertraining data sets generally result in BMA predictive pdfs that exhibit more spread, and betterperformance of the deterministic point predictor during the evaluation period. However, thedifferences appear minor again. Note that the evaluation period exhibits much lower values

0 2 4 6 80.9992

0.9994

0.9996

0.9998

1

Rat

io M

ax. L

ik. E

M to

DR

EA

M

Length of training set [d]

16.26

17.76

19.26

20.76

22.26

RM

SE

[m3 /s

]

RM

SE

[m3 /s

]

CALIBRATION RESULTS

Length of training set [d]

32.48

36.28

40.08

43.88

47.68

0 2 4 6 8 AV

ER

AG

E W

IDT

H [m

3 /s]

AV

ER

AG

E W

IDT

H [m

3 /s]

16.26

17.76

19.26

20.76

22.26

EVALUATION RESULTS

Length of training set [d]

32.48

36.28

40.08

43.88

47.68

0 2 4 6 8

(A) (C)(B)

Fig. 6 Comparison of training period lengths for streamflow forecasting: (a) ratio of the maximum likelihoodvalue derived with the EM and DREAM algorithm; (b) RMSE of BMA deterministic forecast (blue) andaverage width of the 95% prediction intervals (green) for the calibration period, (c) similar as previous, butfor the evaluation period. Solid lines in panels b and c represent the EM derived results, while the squaredsymbols correspond to the DREAM results. The reported results represent averages of 30 independent trialswith randomly selected training data sets from the original 28 years calibration data set

123

Page 15: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

of the RMSE because the average flow level is significantly lower than the training period.Altogether, the results in Fig. 6c suggest that about 1 year of streamflow data is sufficient toobtain BMA model parameters that will receive good evaluation performance.

5 Summary and conclusions

Bayesian model averaging has recently been proposed as a statistical method to calibrateforecast ensembles from numerical weather models. Successful implementation of BMAhowever, requires accurate estimates of the weights and variances of the individual compet-ing models in the ensemble. In their seminal paper [20], has recommended the Expectation–Maximization (EM) algorithm for BMA model training. This algorithm is relatively simpleto use, and computationally efficient. However, global convergence to the appropriate BMAmodel cannot be guaranteed. Algorithmic modifications are also required to adopt the EMmethod for other than normal conditional pdfs of the individual ensemble members. This isnot very user-friendly.

In this paper, we have compared the performance of the EM algorithm and the recentlydeveloped DREAM MCMC algorithm for estimating the BMA weights and variances. UnlikeEM, MCMC simulation uses multiple different trajectories (also called Markov Chains)simultaneously to sample the BMA parameters preferentially, based on the underlying prob-ability mass of the likelihood function. Simulation experiments using multi-model ensemblesof surface temperature, sea level pressure and streamflow forecasts have demonstrated thatEM and DREAM receive a very similar performance, irrespective of the number of ensemblemembers and the length of the calibration data set. However, MCMC simulation accommo-dates a large variety of BMA conditional distributions without the need to modify existingsource code, provides a full posterior view of the BMA weights and variances, and effi-ciently handles high-dimensionality when dealing with large ensemble sizes containing thepredictions of many different constituent models.

The various simulation experiments have also demonstrated that the ML values of theBMA weights generally do not follow the inverse rank order of the average quadratic error ofthe individual ensemble members. This appears counter-intuitive, because one would expectthat models with good predictive skill would always receive higher weight than ensemblemembers that exhibit a poorer fit to the observations. This is because the predictions of indi-vidual ensemble members are highly correlated. One approach to reduce correlation amongensemble members is to use principle component analysis. This is the scope of future work.

Finally, in our experiments we have optimized the BMA weights and other parametersof the model specific predictive distributions by ML estimation. There are however, vari-ous other summary statistics such as the Continuous Rank Probability Score (CRPS), MeanAverage Error (MAE), and Ignorance Score that provide additional and perhaps conflictinginformation about the optimal parameters of the BMA model, but are deemed importantfor accurate probabilistic weather forecasts. In another paper [26], we have posed the BMAinverse problem in a multiobjective framework, and have examined the Pareto set of solu-tions between these various objectives. The source codes of BMA and DREAM are writtenin MATLAB and can be obtained from the first author ([email protected]) upon request.

Acknowledgments The first author is supported by the a J. Robert Oppenheimer Fellowship from the LANLpostdoctoral program. Support for the third author is from the New Zealand Foundation for Research Sci-ence and Technology, Contract C01X0401. We thank Adrian Raftery from the Seattle School of Statistics atWashington University for providing the MM5 ensemble weather data, and Richard Ibbitt, Tom Hamill and

123

Page 16: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

two anonymous reviewers for providing detailed and insightful reviews that have considerably improved thecurrent manuscript.

References

1. Ajami NK, Duan Q, Sorooshian S (2007) An integrated hydrologic Bayesian multimodel combinationframework: confronting input, parameter, and model structural uncertainty in hydrologic prediction. WaterResour Res 43:W01403. doi:10.1029/2005WR004745

2. Barnston AG, Mason SJ, Goddard L, DeWitt DF, Zebiak SE (2003) Multimodel ensembling in seasonalclimate forecasting at IRI. Bull Am Meteorl Soc 84:1783–1796. doi:10.1175/BAMS-84-12-1783

3. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algo-rithm. J R Stat Soc 39:1–39

4. Doblas-Reyes FJ, Hagedorn R, Palmer TN (2005) The rationale behind the success of multi-model ensem-bles in seasonal forecasting—II. Calibration and combination. Tellus 57:234–252

5. Duan Q, Sorooshian S, Gupta V (1992) Effective and efficient global optimization for conceptual rain-fall-runoff models. Water Resour Res 28(4):1015–1031

6. Gelman A, Rubin DR (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–472

7. Georgekakos KP, Seo DJ, Gupta H, Schaake J, Butts MB (2004) Characterizing streamflow simulationuncertainty through multi-model ensembles. J Hydrol 298(1–4):222–241

8. Gneiting T, Raftery AE (2006) Weather forecasting with ensemble methods. Science 310:248–2499. Gneiting T, Raftery AE, Westerveld AH III, Goldman T (2005) Calibrated probabilistic forecasting using

ensemble model output statistics and minimum CRPS estimation. Mon Weather Rev 133:1098–111810. Grimitt EP, Mass CF (2002) Initial results of a mesoscale short-range ensemble forecasting system over

the Pacific Northwest. Wea Forecast 17:192–20511. Hamill TM (2007) Comments on “Calibrated surface temperature forecasts from the Canadian

ensemble prediction system using Bayesian model averaging”. Mon Weather Rev 135:4226–4230.doi:10.1175/2007MWR1963.1

12. Hamill TM, Colucci SJ (1997) Verification of Eta-RSM short range ensemble forecasts. Mon WeatherRev 125:1312–1327

13. Krishnamurti TN, Kishtawal CM, LaRow TE, Bachiochi D, Zhang Z, Williford CE, Gadgil S, SurendanS (1999) Improved weather and seasonal climate forecasts from multimodel superensembles. Science258:1548–1550

14. McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York, 274 pp15. Min S, Hense A (2006) A Bayesian approach to climate model evaluation and multi-model averaging with

an application to global mean surface temperatures from IPCC AR4 coupled climate models. GeophysRes Lett 33:1–5

16. Molteni F, Buizza R, Palmer TN, Petroliagis T (1996) The ECWMF ensemble prediction system: meth-odology and validation. Q J R Meteorol Soc 122:73–119

17. Neuman SP (2003) Maximum likelihood Bayesian averaging of uncertain model predictions. Stoch Envi-ron Res Risk Assess 17:291–305. doi:10.1007/800477-003-0151-7

18. Palmer TN, Alessandri A, Andersen U, Cantelaube P, Davey M, Délécluse P, Déqué M, Diez E,Doblas-Reyes J, Feddersen H, Graham R, Gualdi S, Guérémy J-F, Hagedorn R, Hoshen M, KeenlysideN, Latif M, Lazar A, Maisonnave E, Marletto V, Morse AP, Orfila B, Rogel P, Terres J-M, Thomson MC(2004) Development of a European Multi-model ensemble system for seasonal-to-interannual prediction(DEMETER). Bull Am Meteorol Soc 85:853–872. doi:10.1175/BAMS-85-6-853

19. Raftery AE, Madigan D, Hoeting JA (1997) Bayesian model averaging for linear regression models. JAm Stat Assoc 92:179–191

20. Raftery AE, Gneiting T, Balabdaoui F, Polakowski M (2005) Using Bayesian model averaging to calibrateforecast ensembles. Mon Weather Rev 133:1155–1174

21. Rajagopalan B, Lall U, Zebiak SE (2002) Categorical climate forecasts through regularization and opti-mal combination of multiple GCM ensembles. Mon Weather Rev 130:1792–1811

22. Richardson DS (2001) Measures of skill and value of ensemble prediction systems, their interrelationshipand the effect of sample size. Q J R Meteorol Soc 127:2473–2489

23. Sloughter JM, Raftery AE, Gneiting T (2006) Probabilistic quantitative precipitation forecasting usingBayesian model averaging. University of Washington, Department of Statistics, Technical Report 496,Seattle, WA, 20 pp

123

Page 17: Ensemble Bayesian model averaging using Markov Chain Monte ...faculty.sites.uci.edu/jasper/files/2016/04/41.pdf · Environ Fluid Mech DOI 10.1007/s10652-008-9106-3 ORIGINAL ARTICLE

Environ Fluid Mech

24. Vrugt JA, Robinson BA (2006) Treatment of uncertainty using ensemble methods: compari-son of sequential data assimilation and Bayesian model averaging. Water Resour Res W01411.doi:10.1029/2005WR004838

25. Vrugt JA, Gupta HV, Bouten W, Sorooshian S (2003) A Shuffled Complex Evolution Metropolis algo-rithm for optimization and uncertainty assessment of hydrologic model parameters. Water Resour Res39(8):1201. doi:10.1029/2002WR001642

26. Vrugt JA, Clark MP, Diks CGH, Duan Q, Robinson BA (2006) Multi-objective calibra-tion of forecast ensembles using Bayesian model averaging. Geophys Res Lett 33:L19817.doi:10.1029/2006GL027126

27. Vrugt JA, ter Braak CJF, Diks CGH, Robinson BA, Hyman JM, Higdon D (2008a) Accelerating Markovchain Monte Carlo simulation by self-adaptive differential evolution with randomized subspace sampling.Int J Nonlinear Sci Numer Simul (in press)

28. Vrugt JA, ter Braak CJF, Clark MP, Hyman JM, Robinson BA (2008b) Treatment of input uncertaintyin hydrologic modeling: doing hydrology backwards with Markov Chain Monte Carlo simulation. WaterResour Res doi:10.1029/2007WR006720

29. Wöhling T, Vrugt JA (2008) Combining multi-objective optimization and Bayesian model averaging tocalibrate forecast ensembles of soil hydraulic models. Water Resour Res doi:10.1029/2008WR007154

30. Ye M, Neuman SP, Meyer PD (2004) Maximum likelihood Bayesian averaging of spatially variabilitymodels in unsaturated fractured tuff. Water Resour Res 40:W05113. doi:10.1029/2003WR002557

123