17
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 1 A Comprehensive Review of Neural Network-based Prediction Intervals and New Advances Abbas Khosravi, Member, IEEE, Saeid Nahavandi, Senior Member, IEEE, Doug Creighton, and Amir F. Atiya, Member, IEEE Abstract—This study evaluates the four leading techniques pro- posed in literature for construction of Prediction Intervals (PIs) for Neural Network (NN) point forecasts. The delta, Bayesian, bootstrap, and Mean–Variance Estimation (MVE) methods are reviewed and their performance for generating high quality PIs is compared. PI-based measures are proposed and applied for the objective and quantitative assessment of each method’s performance. A selection of twelve synthetic and real world case studies is used to examine each method’s performance for PI construction. The comparison is performed based on the quality of generated PIs, the repeatability of the results, the computational requirements, and the PIs’ variability with regard to the data uncertainty. The obtained results in this study indicate that (i) the delta and Bayesian methods are the best in terms of quality and repeatability, and (ii) the MVE and bootstrap methods are the best in terms of low computational load and the width variability of PIs. The paper also introduces the concept of combinations of PI’s, and proposes a new method for generating combined PIs using the traditional PIs. Genetic algorithm is applied for adjusting the combiner parameters through minimization of a PI–based cost function subject to two sets of restrictions. It is shown that the quality of PIs produced by the combiners is dramatically better than the quality of PIs obtained from each individual method. Index Terms—Prediction interval, neural network, delta, Bayesian, bootstrap, mean-variance estimation. I. I NTRODUCTION As a biologically inspired analytical technique, neural net- works have the capacity to learn and model complex nonlinear relationships. Theoretically, multilayered feedforward NNs are universal approximators, and as such, have an excellent ability of approximating any nonlinear mapping to any degree of accuracy [1]. They do not require a priori model to be assumed or a priori assumptions to be made on the properties of data [2]. They have been widely employed for modeling, prediction, classification, optimization, and control purposes [3] [4] [5]. Paliwal et al. [6] comprehensively reviewed comparative stud- ies on applications of NNs in accounting and finance, health and medicine, engineering, manufacturing, and marketing. Abbas Khosravi is with the Centre for Intelligent Systems Research (CISR), Deakin University, Geelong, Vic, 3117, Australia, Tel: +61 3 5227 2179 Fax: +61 3 5227 1046, e-mail: [email protected]. Saeid Nahavandi is the director for the Centre for Intelligent Systems Research (CISR), Deakin University, Geelong, Vic, 3117, Australia, e-mail: [email protected]. Doug Creighton is with the Centre for Intelligent Systems Research (CISR), Deakin University, Geelong, Vic, 3117, Australia, e-mail: dou- [email protected]. Amir F. Atiya is with the Department of Computer Engineering, Cairo University, Cairo 12613, Egypt, e-mail: [email protected]. Manuscript received December, 5, 2010. After reviewing over one hundred comparative studies, they concluded that NN models outperform their traditional rivals in the majority of cases, no matter the source or type of application. NNs suffer from two basic limitations, despite their popu- larity. The first problem is the unsatisfactorily low prediction performance when there exists uncertainty in the data. The reliability of point forecasts significantly drops due to the prevalence of uncertainty in operation of the system. Machine breakdowns on the shopfloor, unexpected passenger demand in public transportation systems, or abrupt changes in weather conditions in the national energy market may have direct impacts on the throughput, performance, or reliability of the underlying systems. As none of these events can be properly predicted in advance, the accuracy of point forecasts is in doubt and questionable. Even if these are known or pre- dictable, the targets will be multi-valued, making predictions prone to error. This weakness is due to the theoretical point that NNs generate averaged values of targets conditioned on inputs. Such a reduction cannot be mitigated through changing the model structure or repeating the training process. Liu [7] describes this problem for a NN application in the semi- conductor industry where there are large errors in forecasts of industry growth. Similar stories have been reported in other fields, including but not limited to, the surface mount manufacturing [8], electricity load forecasting [9], [10], fatigue lifetime prediction [11], financial services [12], hydrologic case studies [13], transportation systems [14], and baggage handling systems [15]. The second problem of NNs is that they only provide point predictions without any indication of their accuracy. Point predictions are less reliable and accurate if the training data is sparse, if targets are multi-valued, or if targets are affected by probabilistic events. To improve the decision making and operational planning, the modeler should be aware of uncertainties associated to the point forecasts. It is important to know how well the predictions generated by NN models match the real targets and how large the risk of un-matching is. Unfortunately, point forecasts do not provide any information about associated uncertainties and carry no indication of their reliability. To effectively cope with these two fundamental problems, several researchers have studied the development of Predic- tion Intervals (PIs) for NN forecasts. A PI is comprised of upper and lower bounds that bracket a future unknown value with a prescribed probability called a confidence level ((1 )%). The main motivation for the construction of PIs

A Comprehensive Review of Neural Network-based Prediction ...alumnus.caltech.edu/~amir/pred-intv-2.pdf · Prediction Intervals and New Advances Abbas Khosravi, Member, IEEE, Saeid

Embed Size (px)

Citation preview

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 1

A Comprehensive Review of Neural Network-basedPrediction Intervals and New Advances

Abbas Khosravi, Member, IEEE, Saeid Nahavandi, Senior Member, IEEE, Doug Creighton, and Amir F.Atiya, Member, IEEE

Abstract—This study evaluates the four leading techniques pro-posed in literature for construction of Prediction Intervals (PIs)for Neural Network (NN) point forecasts. The delta, Bayesian,bootstrap, and Mean–Variance Estimation (MVE) methods arereviewed and their performance for generating high qualityPIs is compared. PI-based measures are proposed and appliedfor the objective and quantitative assessment of each method’sperformance. A selection of twelve synthetic and real worldcase studies is used to examine each method’s performancefor PI construction. The comparison is performed based onthe quality of generated PIs, the repeatability of the results,the computational requirements, and the PIs’ variability withregard to the data uncertainty. The obtained results in this studyindicate that (i) the delta and Bayesian methods are the bestin terms of quality and repeatability, and (ii) the MVE andbootstrap methods are the best in terms of low computationalload and the width variability of PIs. The paper also introducesthe concept of combinations of PI’s, and proposes a new methodfor generating combined PIs using the traditional PIs. Geneticalgorithm is applied for adjusting the combiner parametersthrough minimization of a PI–based cost function subject to twosets of restrictions. It is shown that the quality of PIs producedby the combiners is dramatically better than the quality of PIsobtained from each individual method.

Index Terms—Prediction interval, neural network, delta,Bayesian, bootstrap, mean-variance estimation.

I. INTRODUCTION

As a biologically inspired analytical technique, neural net-works have the capacity to learn and model complex nonlinearrelationships. Theoretically, multilayered feedforward NNs areuniversal approximators, and as such, have an excellent abilityof approximating any nonlinear mapping to any degree ofaccuracy [1]. They do not require a priori model to be assumedor a priori assumptions to be made on the properties of data[2]. They have been widely employed for modeling, prediction,classification, optimization, and control purposes [3] [4] [5].Paliwal et al. [6] comprehensively reviewed comparative stud-ies on applications of NNs in accounting and finance, healthand medicine, engineering, manufacturing, and marketing.

Abbas Khosravi is with the Centre for Intelligent Systems Research (CISR),Deakin University, Geelong, Vic, 3117, Australia, Tel: +61 3 5227 2179 Fax:+61 3 5227 1046, e-mail: [email protected].

Saeid Nahavandi is the director for the Centre for Intelligent SystemsResearch (CISR), Deakin University, Geelong, Vic, 3117, Australia, e-mail:[email protected].

Doug Creighton is with the Centre for Intelligent Systems Research(CISR), Deakin University, Geelong, Vic, 3117, Australia, e-mail: [email protected].

Amir F. Atiya is with the Department of Computer Engineering, CairoUniversity, Cairo 12613, Egypt, e-mail: [email protected].

Manuscript received December, 5, 2010.

After reviewing over one hundred comparative studies, theyconcluded that NN models outperform their traditional rivalsin the majority of cases, no matter the source or type ofapplication.

NNs suffer from two basic limitations, despite their popu-larity. The first problem is the unsatisfactorily low predictionperformance when there exists uncertainty in the data. Thereliability of point forecasts significantly drops due to theprevalence of uncertainty in operation of the system. Machinebreakdowns on the shopfloor, unexpected passenger demandin public transportation systems, or abrupt changes in weatherconditions in the national energy market may have directimpacts on the throughput, performance, or reliability of theunderlying systems. As none of these events can be properlypredicted in advance, the accuracy of point forecasts is indoubt and questionable. Even if these are known or pre-dictable, the targets will be multi-valued, making predictionsprone to error. This weakness is due to the theoretical pointthat NNs generate averaged values of targets conditioned oninputs. Such a reduction cannot be mitigated through changingthe model structure or repeating the training process. Liu [7]describes this problem for a NN application in the semi-conductor industry where there are large errors in forecastsof industry growth. Similar stories have been reported inother fields, including but not limited to, the surface mountmanufacturing [8], electricity load forecasting [9], [10], fatiguelifetime prediction [11], financial services [12], hydrologiccase studies [13], transportation systems [14], and baggagehandling systems [15].

The second problem of NNs is that they only providepoint predictions without any indication of their accuracy.Point predictions are less reliable and accurate if the trainingdata is sparse, if targets are multi-valued, or if targets areaffected by probabilistic events. To improve the decisionmaking and operational planning, the modeler should be awareof uncertainties associated to the point forecasts. It is importantto know how well the predictions generated by NN modelsmatch the real targets and how large the risk of un-matching is.Unfortunately, point forecasts do not provide any informationabout associated uncertainties and carry no indication of theirreliability.

To effectively cope with these two fundamental problems,several researchers have studied the development of Predic-tion Intervals (PIs) for NN forecasts. A PI is comprisedof upper and lower bounds that bracket a future unknownvalue with a prescribed probability called a confidence level((1 − �)%). The main motivation for the construction of PIs

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 2

is to quantify the likely uncertainty in the point forecasts.Availability of PIs allows the decision makers and operationalplanners to efficiently quantify the level of uncertainty asso-ciated with the point forecasts and to consider a multiple ofsolutions/scenarios for the best and worst conditions. Wide PIsare an indication of presence of a high level of uncertaintyin the operation of the underlying system. This informationcan guide the decision makers to avoid the selection of riskyactions under uncertain conditions. In contrast, narrow PIsmean that decisions can be made more confidently with lesschance of confronting an unexpected condition in the future.

The need to construct PIs for unseen targets is not new, yetthe need has intensified dramatically in the last two decades.The corresponding number of papers reporting applications ofPIs has also increased in recent years. This is primarily due tothe increasing complexity of man-made systems. Examples ofsuch systems are manufacturing enterprises, industrial plants,transportation systems, and communication networks, to namea few. More complexity contributes to high levels of uncer-tainty in the operation of large systems. Operational planningand scheduling in large systems is often performed based onthe point forecasts of the system’s future (unseen targets). Asdiscussed above, the reliability of these point forecasts is lowand there is no indication of their accuracy.

The delta [16], [17], Bayesian [2], [18], mean-varianceestimation [19], and bootstrap [20], [21] techniques have beenproposed in literature for construction of PIs. Studies recom-mending methods for construction and use of PIs have beencompleted in a variety of disciplines, including transportation[14], energy market [9], [10], manufacturing [22] , and finan-cial services [12]. Although past studies have clarified the needfor PI (and CI) construction, there has not been an effort toquantitatively evaluate the performance of different methodstogether. The literature predominantly deals with the individualimplementation of these methods and comprehensive reviewstudies are rare. Furthermore, the existing comparative studiesare often represented subjectively rather than objectively [22][23]. Often the coverage probability index, the percentage oftarget values covered by PIs, is used for assessment of the PIquality, and discussion about the width of PIs is either ignoredor vaguely presented [14] [8] [9] [11] [16] [24] [19] [25]. Asdiscussed later, this may lead to a misleading judgment aboutthe quality of PIs and selection of wide PIs.

The purpose of this study is to comparatively examine theperformance of the four frequently used methods for construc-tion of PIs. Instead of subjective and imprecise discussions,quantitative measures are proposed for the accurate evaluationof the PI quality. Unlike other studies in this field, the proposedquantitative measures simultaneously evaluate PIs from twoperspectives: width and coverage probability. A number ofsynthetic and real world case studies are implemented to checkthe performance of each method. The used datasets featurea different number of attributes, training samples, and datadistribution. The methods are judged based on the quality ofconstructed PIs, repeatability of results, computational require-ments, and the variability of PIs against the data uncertainty.

As a major contribution, this paper also proposes a newmethod for constructing combined PIs using traditionally built

PIs. As far as we know, this is the first study that usesthe concept of PI combination. A genetic algorithm–basedoptimization method is developed for adjusting the parametersof linear combiners. A unique aspect of the combining methodis the cost function used for its training. While traditional costfunctions are often based on the errors, the proposed one hereis a PI–based type. Two sets of restrictions are applied to thecombiner parameters to make them theoretically meaningful.It is shown that the proposed combiners outperform thetraditional technique for construction of PIs in the majorityof case studies.

This paper is structured as follows. In Section II, we reviewthe theoretical backgrounds of the delta, Bayesian, mean-variance estimation, and bootstrap methods for PI construction.Section III describes quantitative measures for assessment ofthe PI quality. Simulation results are discussed in Section IVfor the twelve case studies. Section V introduces the newmethod for constructing optimal combined PIs. The effective-ness of the proposed combiners is comparatively examined inSection VI for different case studies. Section VII concludesthe paper with a summary of results.

II. LITERATURE REVIEW OF PI CONSTRUCTION METHODS

It is often assumed that targets can be modeled by,

ti = yi + �i (1)

where ti is the i-th measured target (totally n targets). �i isthe noise, also called error, with a zero expectation. The errorterm moves the target away from its true regression mean, yi,toward the measured value, ti. In all PI construction methodsdiscussed here, it is assumed that errors are independentlyand identically distributed. In practice, an estimate of the trueregression mean is obtained using a model, yi. According tothis, we have,

ti − yi = [yi − yi] + �i (2)

Confidence Intervals (CIs) deal with the variance of thefirst term in the right hand side of (2). They quantify theuncertainty between the prediction, yi, and the true regression,yi. CIs are based on the estimation of characteristics of theprobability distribution P (yi ∣ yi). In contrast, PIs try toquantify the uncertainty associated with the difference betweenthe measured values, ti, and the predicted values, yi. Thisrelates to the probability distribution P (ti ∣ yi). Accordingly,PIs will be wider than CIs and will enclose them.

If the two terms in (2) are statistically independent, the totalvariance associated to the model outcome will become,

�2i = �2

yi + �2�i (3)

The term �2yi

originates from model misspecification andparameter estimation errors, �2

�iis the measure of noise

variance. Upon proper estimation of these values, PIs can beconstructed for the outcomes of NN models. In the followingsections, four traditional methods for approximating thesevalues and construction of PIs are discussed.

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 3

A. The Delta Method

The delta method has its roots in theories of nonlinearregression [26]. The method interprets a NN as a nonlinearregression model, which allows us to apply asymptotic theoriesfor PI construction. Consider that w∗ is the set of optimalNN parameters that approximates the true regression function,yi = f(xi, w

∗). In a small neighborhood of this set, the NNmodel can be linearized as,

y0 = f(x0, w∗) + gT0 (w − w∗) (4)

gT0 is the NN output gradient against the network parame-ters, w∗,

gT0 =

[∂f(x0, w

∗)

∂w∗1

∂f(x0, w∗)

∂w∗2

⋅ ⋅ ⋅ ∂f(x0, w∗)

∂w∗p

](5)

where p indicates the number of NN parameters. In practice,the NN parameters, w, are adjusted through minimization ofthe Sum of Squared Error (SSE) cost function. Under certainregularity conditions, it can be shown that w is very close tow∗. Accordingly, we have,

t0 − y0 ≃ [y0 + �0]− [f(x0, w∗) + gT0 (w − w∗)]

= �0 + gT0 (w − w∗)(6)

and so,

var(t0 − y0) = var(�0) + var(gT0 (w − w∗)) (7)

Assuming that the error terms are normally distributed (� ≈N(0, �2

� )), the second term in the right hand side of (7) canbe expressed as,

�2y0 = �2

� gT0 (FT F )−1g0 (8)

F in (8) is the Jacobian matrix of the NN model with respectto its parameters computed for the training samples,

F =

⎡⎢⎢⎢⎢⎢⎣∂f(x1,w)∂w1

∂f(x1,w)∂w2

⋅ ⋅ ⋅ ∂f(x1,w)∂wp

∂f(x2,w)∂w1

∂f(x2,w)∂w2

⋅ ⋅ ⋅ ∂f(x2,w)∂wp

......

......

∂f(xn,w)∂w1

∂f(xn,w)∂w2

⋅ ⋅ ⋅ ∂f(xn,w)∂wp

⎤⎥⎥⎥⎥⎥⎦ (9)

By replacing (8) in (7), the total variance can be expressedas,

�20 = �2

� (1 + gT0 (FT F )−1g0) (10)

An unbiased estimate of �2� can be obtained from,

s2� =1

n− 1

n∑i=1

(ti − yi)2 (11)

According to this, the (1 − �)% PI for yi is computed asdetailed in [16],

y0 ± t1−�

2n−p s�

√1 + gT0 (F

TF )−1g0 (12)

where t1−�2

n−p is the �2 quantile of a cumulative t-distribution

function with n− p degrees of freedom.A Weight Decay Cost Function (WDCF) can be used instead

of the SSE cost function to minimize the overfitting problemand to improve the NN generalization power. The WDCFtries to keep the magnitude of the NN parameters as smallas possible,

WDCF = SSE + � wT w (13)

De Veaux et al. [17] derived the following formula for PIconstruction for the case that NNs are trained using the WDCF,

y0±t1−�

2n−p s�

√1 + gT0 (F

TF + �I)−1(FTF )(FTF + �I)−1g0(14)

Inclusion of � in (14) improves the reliability and qualityof PIs, particularly for cases that FTF is nearly singular. Wewill return to the singularity problem again in the simulationresult section.

Computationally, the delta technique is more demandingin its development stage than its application stage. Both theJacobian matrix (F ) and s2� should be calculated and estimatedoffline. For PI construction for a new sample, we need tocalculate gT0 and replace it in (12) or (14). With the exceptionof this, other calculations are virtually very simple.

The estimation of �2y0

, and the calculation of the gradientand Jacobian matrices can be potential sources of error in theconstruction of PIs using (12) or (14) [27]. Also, the literaturedoes not discuss how � affects the quality of PIs and howits optimal value can be determined. The delta method as-sumes that s2� is constant for all samples (noise homogeneity).However, there are cases in practice that the level of noise issystematically correlated by the target magnitude or the setof NN inputs. Therefore, it is not unexpected that the deltamethod will generate low quality PIs for these cases.

B. The Bayesian Method

In the Bayesian training framework, NNs are trained basedon a regularized cost function,

E(w) = �Ew + �ED (15)

where ED is SSE and Ew is the sum of squares of thenetwork weights (wTw). � and � are two hyperparameter ofthe cost function determining the training purpose. The methodassumes that the set of NN parameters, w, is a random set ofvariables with assumed a priori distributions. Upon availabilityof a training dataset and a NN model, the density function ofthe weights can be updated using the Bayes’ rule,

P (w∣D, �, �,M) =P (D∣w, �,M)P (w∣�,M)

P (D∣�, �,M)(16)

where M and D are the NN model and the training dataset.P (D∣w, �,M) and P (w∣�,M) are the likelihood functionof data occurrence and the prior density of parameters re-spectively. Representing our knowledge, P (D∣�, �,M) is anormalization factor enforcing that total probability is one.

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 4

Assuming that �i are normally distributed andP (D∣w, �,M) and P (w∣�,M) have normal distributions, wecan write,

P (D∣w, �,M) =1

ZD(�)e−�ED (17)

andP (w∣�,M) =

1

Zw(�)e−�Ew (18)

where ZD(�) =(��

)n2

and Zw(�) =(��

) p2

. n and p are thenumber of training samples and NN parameters respectively.By substituting (17) and (18) into (16), we have,

P (w∣D, �, �,M) =1

ZF (�, �)e−(�Ew+�ED) (19)

The purpose of NN training is to maximize the posteriorprobability P (w∣D, �, �,M). This maximization correspondsto the minimization of (15), ), and that makes the connectionbetween the Bayesian methodology and regularized NN’s. Bytaking derivatives with respect to the logarithm of (19) andsetting it equal to zero, the optimal values for � and � areobtained [2] [18],

�MP =

ED (wMP )(20)

�MP =n−

Ew (wMP )(21)

where = p − 2�MP tr(HMP

)−1is the so-called effective

number of NN parameters, and p is the total number of NNmodel parameters. wMP is the most probable value of the NNparameters. HMP is the hessian matrix of E(w),

HMP = �∇2Ew + �∇2ED (22)

Usually, the Levenberg–Marquardt optimization algorithmis applied to approximate the Hessian matrix [28]. Applicationof this technique for training results in NNs with a variancein their prediction of,

�2i = �2

D + �2wMP

= 1� +∇TwMP yi (HMP )−1 ∇wMP yi

(23)

While the first term in the right hand side of (23) quantifiesthe amount of uncertainty in the training data (the intrinsicnoise), the second term corresponds to the misspecificationof NN parameters and their contribution to the variance ofpredictions. These terms are �2

�iand �2

yiin (3) respectively.

As the total variance of the i-th future sample is known, a(1− �)% PI can be constructed,

yi ± z1−�2

(1

�+∇TwMP yi (HMP )−1 ∇wMP yi

) 12

(24)

where z1−�2 is the 1 − �

2 quantile of a normal distributionfunction with zero mean and unit variance. Also ∇TwMP yi isthe gradient of the NN output with respect to its parameters,wMP .

The Bayesian method for PI construction has a strongmathematical foundation. NNs trained using the Bayesianlearning technique typically have a better generalization powerthan other networks. This minimizes the effects of �2

wMP in(23) on the width of PIs. Furthermore, it eliminates the hasslefor optimal determination of the regularizing parameters. TheBayesian method is computationally demanding in the devel-opment stage, similar to the delta technique. It requires calcu-lation of the Hessian matrix in (22), which is time-consumingand cumbersome for large NNs and datasets. However, thecomputational load is decreased in the PI construction stageas we only need to calculate the gradient of NN output.

C. The Mean-Variance Estimation MethodThe Mean-Variance Estimation (MVE) method was orig-

inally proposed by Nix and Weigend [19] for constructionof PIs. This method also assumes that errors are normallydistributed around the true mean of targets, y(x). Therefore,PIs can easily be constructed if the parameters of this distribu-tion (mean and variance) are known. Both delta and Bayesiantechniques use a fixed target variance for PI construction.In contrast to these techniques, the MVE method estimatesthe target variance using a dedicated NN. This considerationprovides enough flexibility for estimating the heteroscedasticvariation of the targets. The dependence of the target varianceon the set of inputs is the fundamental assumption of thismethod for PI construction.

Fig. 1 shows a schematic representation of the MVEmethod. The set of inputs to the NN models can be identicalor different. There is no limitation on the size and structure ofthe two networks. Consideration of an exponential activationfunction for the unit corresponding to �2 guarantees strictlypositive estimates of variance. Assuming that NNy accuratelyestimates y(x), the approximate PIs with a (1 − �)% confi-dence level can be constructed as follows,

y(x,wy)± z1−�2

√�2(x,w�) (25)

where wy and w� are parameters of NN models for estimationof y and �2 respectively. The target variance values, �i, arenot known a priori. This excludes the application of the error-based minimization techniques for training of NN� . Instead,a maximum likelihood estimation approach can be appliedfor training these NNs. Based on the assumption of normallydistributed errors around yi, the data conditional distributionwill be,

P (ti ∣ xi, NNy, NN�) =1√2��2

i

e− (ti−yi)

2

2�2i (26)

Taking the natural log of this distribution and ignoring theconstant terms results in the following cost function, whichwill be minimized for all samples,

CMVE =1

2

n∑i=1

[ln(�2i ) +

(ti − yi)2

�2i

] (27)

Using this cost function, an indirect three phase trainingtechnique was proposed in [19] for simultaneously adjusting

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 5

wy and w� . The proposed algorithm needs two datasets,namely D1 and D2, for training NNy and NN� . In PhaseI of the training algorithm, NNy is trained to estimate yi.Training is performed through minimization of an error-basedcost function for the first dataset, D1. To avoid the overfittingproblem, D2 can be used as the validation set and for ter-minating the training algorithm. Nothing is done with NN�in this phase. In Phase II, wy are fixed, and D2 is used foradjusting parameters of NN� . Adjusting of w� is achievedthrough minimizing the cost function defined in (27). NNyand NN� are used to approximate yi and �2 for each samplerespectively. The cost function is then evaluated for the currentset of NN� weights, w� . These weights then are updated usingthe traditional gradient descent-based methods. D1 can also beapplied as the validation set to limit the overfitting effects. InPhase III, two new training sets are resampled and appliedfor simultaneous adjustment of both network parameters. There-training of NNy and NN� is again carried out throughminimization of (27). As before, one of the sets is used as thevalidation set.

The main advantage of this method is its simplicity andthat there is no need to calculate complex derivatives and theinversion of the Hessian matrix. Nonstationary variances canbe approximated by employing more complex structures forNN� or through proper selection of the set of inputs.

The main drawback of the MVE method is that it assumesNNy precisely estimates the true mean of the targets, yi. Thisassumption can be violated in practice due to a variety ofreasons, including the existence of of a bias in fitting thedata due to a possible under–specification of the NN modelor due to omission of important attributes affecting the targetbehavior. In these cases, the NN generalization power is weak,resulting in accumulation of uncertainty in the estimation of yi.Therefore, the constructed PIs using (25) will underestimate(or overestimate) the actual (1 − �)% PIs, leading to a lowcoverage probability.

Assuming yi ≃ yi implies the MVE method only considersone portion of the total uncertainty for construction of PIs. Theconsidered variance is only due to errors, not due to misspeci-fication of model parameters (either wy or w�). This can resultin misleadingly narrow PIs with a low coverage probability.This critical drawback has been theoretically identified andpractically demonstrated in the literature [29].

D. The Bootstrap MethodBootstrap is by far the most common technique documented

in the literature for the construction of CIs and PIs. Themethod assumes that an ensemble of NN models will producea less biased estimate of the true regression of the targets[21]. As generalization errors of NN models are made ondifferent subsets of the parameter space, the collective decisionproduced by the ensemble of NNs is less likely to be in errorthan the decision made by any of the individual NN models.B training datasets are resampled from the original datasetwith replacement, {D}Bb=1. The method estimates the variancedue to model misspecification, �2

y , by building B NN models(Fig. 2). According to this assumption, the true regression isestimated by averaging the point forecasts of B models,

Fig. 1. A schematic of the mean-variance estimation method for constructionof PIs.

yi =1

B

B∑b=1

ybi (28)

where ybi is the prediction of the i-th sample generated by theb-th bootstrap model. Assuming that NN models are unbiased,the model misspecification variance can be estimated using thevariance of B model outcomes,

�2yi =

1

B − 1

B∑b=1

(ybi − yi

)2(29)

This variance is mainly due to the random initialization ofparameters and using different datasets for training NNs.

CIs can be constructed using the approximation of �2yi

in(29). To construct PIs, we need to estimate the variance oferrors, �2

�i. From (3), �2

� can be calculated as follow,

�2� ≃ E{(t− y)

2} − �2y (30)

According to (30), a set of variance squared residuals isdeveloped,

r2i = max((ti − yi)2 − �2

yi , 0)

(31)

where yi and �y2i are obtained from (28) and (29). Theseresiduals are linked by the set of corresponding inputs to forma new dataset,

Dr2 ={(xi, r

2i )}ni=1

(32)

A new NN model can be indirectly trained to estimate theunknown values of �2

�i, so as to maximize the probability

of observing the samples in Dr2 . The procedure for indirecttraining of this new NN is very similar to the steps of the MVEmethod described in Section II-C. The training cost functionis defined as,

CBS =1

2

n∑i=1

[ln(�2�i) +

r2i�2�i

] (33)

As noted before, the NN output node activation functionis selected to be exponential, enforcing a positive value for

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 6

Fig. 2. An ensemble of B NN models used by the Bootstrap method.

�2�i

. The minimization of CBS can be done using a variety ofmethods, including traditional gradient descent methods.

The described bootstrap method is traditionally called thebootstrap pairs. There exists another bootstrap method, calledbootstrap residuals, which resamples the prediction residuals.Further information on this method can be found in [27].

For construction of PIs using the bootstrap method, B + 1NN models are required in total. B NN models (assumedto be unbiased) are used for estimation of �y2i and onemodel is used for estimation of �2

�i. Therefore, this method

is computationally more demanding than other methods in itsdevelopment stage (B + 1 times more). However once themodels are trained offline, the online computational load forPI construction is only limited to B + 1 NN point forecasts .This is in contrast with the claim in literature that bootstrapPIs are computationally more intensive than other methods[25]. This claim will be precisely verified in the later sections.Simplicity is another advantage of using the Bootstrap methodfor PI construction. There is no need to calculate complexmatrices and derivatives, as required by the delta and Bayesiantechniques.

The main disadvantage of the bootstrap technique is itsdependence on B NN models. Frequently some of thesemodels are biased leading to an inaccurate estimation of �2

yiin (29). Therefore, the total variance will be underestimatedresulting in narrow PIs with a low coverage probability.

III. PI ASSESSMENT MEASURES

Discussion in the literature on the quality of constructedPIs is often vague and incomplete [14] [8] [9] [11] [16] [24][19] [25]. Frequently PIs are assessed from their coverageprobability perspective without any discussion about how widethey are. As discussed in [10] [30], such an assessmentis subjective and can lead to misleading results. Here webriefly discuss two indices for quantitative and comprehensiveassessment of PIs.

The most important characteristic of PIs is their coverageprobability. PI Coverage Probability (PICP) is measured bycounting the number of target values covered by the con-structed PIs,

PICP =1

ntest

ntest∑i=1

ci (34)

where,

ci =

⎧⎨⎩1 ti ∈ [Li, Ui]

0 ti /∈ [Li, Ui](35)

where ntest is the number of samples in the test set, and Liand Ui are lower and upper bounds of the i-th PI respectively.Ideally, PICP should be very close or larger than the nominalconfidence level associated to the PIs.

PICP has a direct relationship with the width of PIs. Asatisfactorily large PICP can be easily achieved by wideningPIs from either side. However, such PIs are too conservativeand less useful in practice, as they do not show the variationof the targets. Therefore, a measure is required to check howwide the PIs are. Mean PI Width (MPIW) quantifies this aspectof PIs [10],

MPIW =1

ntest

ntest∑i=1

(Ui − Li) (36)

MPIW shows the average width of PIs. Normalizing MPIWby the range of the underlying target, R, allows us to comparePIs constructed for different datasets respectively (the newmeasure is called NMPIW),

NMPIW =MPIW

R(37)

Both PICP and NMPIW evaluate the quality of PIs from oneaspect. A combined index is required for the comprehensiveassessment of PIs from both coverage probability and widthperspectives. The new measure should give a higher priorityto PICP, as it is the key feature of PIs determining whetherconstructed PIs are theoretically correct or not. The CoverageWidth-based Criterion (CWC) evaluates PIs from both cover-age probability and width perspectives,

CWC = NMPIW(1 + (PICP ) e−�(PICP−�)

)(38)

where (PICP ) is given by,

=

⎧⎨⎩0 PICP ≥ �

1 PICP < �(39)

� and � in (38) are two hyperparameters controlling thelocation and amount of CWC jump. These measures can beeasily determined based on the level of confidence associatedwith PIs. � corresponds to the nominal confidence levelassociated with PIs and can be set to 1−�. The design of CWCis based on two principles: (i) if PICP is less than the nominalconfidence level, (1 − �)%, CWC should be large regardlessof the width of PIs (measured by NMPIW), and (ii) if PICPis greater than or equal to its corresponding confidence level,then NMPIW should be the influential factor. (PICP ), as

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 7

defined in (39) eliminates the exponential term of CWC whenPICP is greater or equal to the nominal confidence level.

In the CWC measure, the exponential term penalizes theviolation of the coverage probabilities. This is, however, asmooth penalty rather than a hard one, for the followingreasons. It is appropriate to penalize the degree of violation,rather than just an abrupt binary penalty. Also, it allows forstatistical errors due to the finite samples.

The CWC measure tries to compromise between informa-tiveness and correctness of PIs. As indicated by Halberg etal. [31], precision is much more important than ambiguity.Therefore, PIs should be as narrow as possible from theinformativeness perspective. However, as discussed above, thenarrowness may lead to not bracketing some targets and resultin a low coverage probability (low quality PIs). CWC evaluatesPIs from the two conflicting perspectives: informativeness(being narrow) and correctness (having an acceptable coverageprobability).

IV. QUANTITATIVE ASSESSMENT AND SIMULATIONRESULTS

A. Case Studies

The performance of the delta, Bayesian, MVE, and boot-strap techniques for construction of PIs is compared in thissection. Twelve case studies are defined in the proceedinglist and used to evaluate the effectiveness of each method.More information about datasets can be found in Table I andcited references. All data sets are available from the authorson request:

1) Data in the first case study comes from a syn-thetic mathematical function, g(x1, x2, x3, x4, x5) =0.0647 (12+3x1−3.5x22+7.2x33) (1+cos(4�x4)) (1+0.8 sin(3�x5)). A normally distributed noise is addedto samples as data uncertainty.

2) This case study includes a dependent variable (the esti-mated percentage of body fat) and thirteen continuousindependent variables in 252 men. The aim is to predictbody fat percentage using the independent variables.

3) The third case study considers a real world baggage han-dling system. The goal is to estimate the time requiredto process 70% of each flight bags (T70). The level ofuncertainty in this system is high due to the frequentoccurrence of probabilistic events.

4) Similar to the previous item, this case study attempts toestimate the required time for processing 90% of eachflight bags (T90). In practice, prediction of T90 is moredifficult than T70 due to a high level of uncertaintyaffecting it.

5) Case study 5 relates air pollution to the traffic volumeand meteorological variables.

6) The relationship between the concrete compressivestrength and selected attributes is studied in case study6.

7) In case study 7, target values are generated throughthe following model: y = g(x) + �, where g(x) =x2 + sin(x) + 2. x are randomly generated betweenvalues -10 and 10. � follows a Gaussian distribution with

mean zero and variance g(x)� , where � = 1, 5, 10. The

smaller the � , the stronger the noise. While the additivenoise in case study 1 is homogenous, it is heterogenousin this case study. As indicated in [23], heterogeneityof the noise ruins the point prediction performance ofregression models.

8) Data in this case study comes from a real industrial dryersampled at ten second intervals. The purpose is to modelrelationship between the dry bulb temperature and threeindependent attributes.

9) As for previous case study, the wet bulb temperature isapproximated using three inputs.

10) This case study is again related to the case study 8. Themodeling goal is to estimate the moisture content of rawmaterials based on independent inputs.

11) The data in this case study is generated from an in-dustrial winding process. NN models are trained toapproximate the tension in the web between reel 1 and2 (T12) and four inputs.

12) Similar to the previous case study, the goal in this casestudy is to model the relationship between the tension inthe web between reel 2 and 3 (T23) using NN models.

B. Experimental Procedure

Fig. 3 shows the procedure for performing experiments withthe twelve case studies in this paper. The data is randomlysplit into the 1st training set, D1, (40%), the 2nd trainingset, D2, (40%), and test set (20%). D2 is required for theMVE and bootstrap methods for PI construction. The optimalNN structure for each dataset is determined using a five–foldcross validation technique. Mean Absolute Percentage Errors(MAPEs) are calculated and compared for single and two layerNNs to determine the optimal structure.

After determining the NN structure, delta, Bayesian, MVE,and bootstrap methods are used for PI construction for testsamples. PIs are constructed with an associated 90% confi-dence level (� equal to 0.1). � and � are set to 50 and 0.9.This greatly penalizes PIs with a PICP lower than 90%. PICP,NMPIW, and CWC are computed for the obtained PIs andare recorded for later analysis. PI construction is repeated tentimes using the randomly split datasets and redeveloped NNs.Upon the termination of this loop, performance of the fourmethods is judged by calculating the statistical characteristicsof PICP, NMPIW, and CWC.PIDelta were first constructed using (12). We later decided

to use (14) for PI construction due to the singularity problemand the low quality of generated PIs using (12). � is equal to0.9 in all experiments for constructing PIDelta.

A Simulated Annealing (SA) method [38] is applied forthe minimization of cost functions (27) and (33) in the MVEand bootstrap methods. Traditionally, these cost functions havebeen optimized using gradient descent–based methods. How-ever, such methods are likely to be trapped in local minima,and therefore, may not find the optimal set of NN parameters.In contrast, SA has been shown to have an excellent efficiencyin finding the optimal solution for complex optimizationproblems [10]. Therefore, it is applied here for minimization

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 8

TABLE ICASE STUDY DATASETS USED IN THE EXPERIMENTS

Case Study Target Samples Attributes Reference1 5-dimensional function 300 5 [32] [33]2 Percentage of body fat 252 13 [34]3 T70 272 2 [15] [35]4 T90 272 2 [15] [35]5 Air pollution (NO2) 500 7 [34]6 Concrete compressive strength 1030 9 [36]7 1-dimensional function with heterogenous noise 500 1 [23]8 Dry Bulb Temperature 876 3 [37]9 Wet Bulb Temperature 876 3 [37]

10 Moisture Content of Raw Material 876 3 [37]11 T12 2500 5 [37]12 T23 2500 5 [37]

Yes

Start

Split data into training sets (D1 and D2) and test set (Dtest)

Construct PIs using

Delta Method

Calculate and record PICP, NMPIW, and CWC

Repeated 10 times?No

Compare the quality of PIs using obtained results

End

MVE MethodBayesian Method Bootstrap Method

Perform a five-fold cross validation technique

to determine the optimal NN structure

Fig. 3. The experiment procedure for construction and evaluation of PIs.

of the cost functions and adjustment of the correspondingNN parameters. A geometric cooling schedule with a coolingfactor of 0.9 is applied for guiding the optimization process.

C. Results and Discussion

Table II shows the summary of all obtained results for testsamples of twelve case studies using the four PI constructionmethods. CWC and its statistics are computed for the quanti-tative assessment and comparison of different methods’ perfor-mance. Hereafter and for ease in reference, Delta, Bays, MVE,and Bootstrap subscripts are used for indicating the delta,Bayesian, mean-variance estimation, and bootstrap methodsfor construction of PIs. Also Best subscript corresponds tothe highest quality PIs constructed using a method in tenreplicates. For median and standard deviation, Median, andSD subscripts are used respectively.

As per the results in Table II, the delta technique generatesthe highest quality PIs in 6 out of 12 case studies. CWCBestof the delta technique is either the smallest or 2nd in the rankfor 11 out of 12 case studies. PIMedian of the delta techniqueare also smallest for the majority of case studies (either rank1 or 2 in 10 out of 12 case studies). This indicates thatthe frequency of generating high quality PIs using the deltatechnique is high. However, the method has a large standarddeviation meaning that it may generate low quality PIs in somecases. These large values are an indication of unreliabilityof PIs in specific cases. Unreliability of PIs can be due toviolation of fundamental assumptions of the delta technique,for instance, noise distribution. An over-fitted NN often hasa low generalization ability, leading to some constructed PIsnot bracketing the corresponding targets. This results in a lowPICP and a large CWC. Alternatively, an improperly trainedNN has a large prediction error. Therefore, s� in (14) will belarge, leading to unnecessarily wide PIs. In either case, thequality of constructed PIs will not be the best.

The medians and best of PIBays are close for the majorityof case studies. This indicates that the Bayesian methodgenerates consistent results in different replicates of an ex-periment. Also the obtained CWCs show that the constructedPIs effectively bracket the targets. However, the method tendsto build wide PIs to achieve a satisfactorily high PICP.Fig. 4 shows PICPs and NMPIWs for PIBest for all casestudies. Comparing the best PIDelta and PIBays revealsthat PICPBays are greater than or equal to PICPDelta in9 out of 12 case studies. In contrast, best PIDelta are onaverage narrower than PIBays. As PICPDelta in all cases isgreater than the prescribed confidence level (90%), one mayconclude that the delta technique generates higher quality PIs.However, both median and standard deviation of CWCBaysare smaller than CWCDelta in 7 out of 12 case studies.Increased repeatability of the results highlights the strengthof the Bayesian method for regenerating high quality PIs,although they are wider than others.

An important feature of the Bayesian method for PIconstruction is the consistency of the obtained results. ForPIBays, CWCSD is small for 7 out of 12 case studies.Also the median of CWCSD is 30.39, which is the smallestamongst the four methods. This small value is due to thenature of the Bayesian regularization technique for trainingof NNs. MAPE for ten replicates of an experiment is shown

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 9

TAB

LE

IIS

TAT

IST

ICA

LC

HA

RA

CT

ER

IST

ICS

OF

CW

CF

ORPI D

elta

,PI B

ays

,PI M

VE

,AN

DPI B

oots

trap

FO

RT

HE

TW

ELV

EC

AS

ES

TU

DIE

S

Cas

eSt

udy

Del

taB

ayes

ian

MV

EB

oots

trap

CWC

Best

CWC

Media

nCWC

SD

CWC

Best

CWC

Media

nCWC

SD

CWC

Best

CWC

Media

nCWC

SD

CWC

Best

CWC

Media

nCWC

SD

182

.04

106.

5536

4588

.97

120.

8421

.30

100.

4912

349

119

73.9

911

4.44

1638

242

.64

55.9

332

650

42.4

559

.52

30.3

964

.20

136.

6460

.66

64.1

514

2.98

107.

503

33.5

964

.01

418

29.5

713

181.2

×108

81.3

111

0.34

22.2

347

.39

103.

8230

.24

449

.73

60.0

010

247

.99

82.0

031

3.43

83.9

613

1.69

50.2

259

.03

116.

9571

.58

547

.23

163.

7564

044

.63

60.0

494

5.39

5211

255

44.7

794

.95

50.3

86

29.9

811

2.98

114.

8068

.01

353.

5710

099

68.6

312

0.94

34.2

034

.19

69.7

441

.03

76.

1520

.49

68.7

67.

998.

4742

.43

48.1

210

7.59

26.2

933

.56

90.1

335

.64

828

.73

100.

3437

1.40

31.8

536

.64

2.31

75.2

510

7.04

32.9

132

.83

44.3

937

.87

922

.24

50.3

958

.09

26.5

928

.83

76.2

458

.95

89.6

036

.14

49.0

693

.01

36.6

210

55.7

112

2.48

416.

4672

.08

92.1

214

.93

84.8

112

5.12

106

121.

0214

4.75

99.1

311

51.1

788

.39

58.8

356

.37

58.8

11.

7465

.07

88.5

443

166

73.5

198

.40

20.5

712

38.5

010

6.84

63.3

333

.04

55.7

711

.82

56.5

111

0.62

63.4

966

.86

115.

3651

.92

Fig. 4. PICP (top) and NMPIW (bottom) for the best PIs generated bydelta, Bayesian, MVE, and bootstrap methods.

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12

MA

PE (%

)

Case Study

Fig. 5. MAPE for ten replicates of NNBays experiments for twelve casestudies.

in Fig. 5 for twelve case studies. With the exception of afew cases, MAPE remains almost the same for ten replicatesof an experiment. As per these results, performance of NNmodels trained using the Bayesian technique is less affectedby the random initialization of NN parameters or the randompartition into training/test examples.

Bootstrap is also a stable method for construction of PIs.CWCBootstrap does not rapidly fluctuate and remains approx-imately the same for different replicates of an experiment.The smoothness of CWC is mainly due to incorporating thevariability caused by random initialization of NN parameters(using B models instead of a single model). Traditionally,it is proven that model aggregation is effective to improvegeneralization ability [39] [40]. Training multiple NNs fromvarious random initial points provides a better coverage of theparameter space. The other methods miss properly capturingthe variability due to the random choice of initial parameters.

Compared to other methods, specially the delta method,PIBootstrap are wider. This extra width is due to overesti-mation of �2

yiin (29) and �2

� in (30). Apart from NN modelselection and training process, such an overestimation mightbe caused by the small number of bootstrap models. In ourexperiments, we trained and used 10 bootstrap models for PI

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 10

-0.17

-0.02

0.24

-0.53

0.14

-0.04

-0.61

0.18

0.11

-0.04-0.06

0.28

-0.70

-0.60

-0.50

-0.40

-0.30

-0.20

-0.10

0.00

0.10

0.20

0.30

0.40

1 2 3 4 5 6 7 8 9 10 11 12

Corr

elat

ion

Coef

ficie

nt

Case Study

Fig. 6. Correlation coefficient between the number of bootstrap models andthe quality of constructed PIs measured by CWC for twelve case studies.

construction. However, Efron and Tibshirani [41] and Davisand Hinkley [42] have suggested that B to be consideredgreater than at least 250 (and in some cases 1000) for estimat-ing �2

yiin practice. Assuming this claim is true, there should

be a negative coefficient of correlation between B and CWCas the measure of the PI quality. The effectiveness of thisclaim is examined for all case studies. The number of bootstrapmodels, B, is changed from 100 to 1000 with an incrementof 100. Then correlation coefficients between CWC and Bare calculated for the twelve case studies. These correlationcoefficients are shown in Fig. 6. Although coefficients arenegative for 7 out of 12 case studies, the relationships arenot strong. Besides, the quality of PIs has decreased in 5out of 12 case studies. The mean of correlation coefficientsfor twelve case studies is -0.04. According to the obtainedresults, we may conclude that a strong inverse relationship (alarge negative correlation coefficient) between CWCBest andB does not exist. Greatly increasing the number of bootstrapmodels does not always improve the quality of PIBootstrap.PIMVE have the worst quality among the constructed PIs.

This is mainly due to the unsatisfactorily low PICPs, whichstems from the improper estimation of the target variance usingthe NN models. It is highly likely that the target variancehas little systematic relationship with the inputs. Even ifthere exists a relationship, some important variables may bemissing. Another source of problem can be the invalidity ofthe fundamental assumption of the MVE method (yi ≃ yi).This assumption can be violated due to many reasons, speciallyfor practical cases. Misspecification of NN model parameters,non-optimal selection of the NN architecture, and impropertraining of the NN model are among the potential causes.

It is also important to observe how the widths of PIs changefor different observations of a target. The variability of thewidths of PIs is an indication of how well they respond tothe level of uncertainty in the datasets. Practically, we expectwider PIs for cases that there is more uncertainty in datasets(e.g., having multi-valued targets for approximately equalconditions or a high level of noise affecting the targets). Fig. 7shows PIMedian for the delta, Bayesian, MVE, and bootstrapmethods calculated for case study 6 1. Comparing PIDeltawith others points out that PIDelta are approximately constantin width and only their centers go up and down (which

1For better visualization, only PIs for samples 150 to 200 are shown.

TABLE IIITHE COEFFICIENT OF VARIATION FOR PIMedian CONSTRUCTED USING

THE DELTA, BAYESIAN, MVE, AND BOOTSTRAP METHODS

Case Study Coefficient of VariationDelta Bayesian MVE Bootstrap

1 2.40 2.75 7.77 8.452 7.38 8.27 8.95 10.943 8.41 15.02 12.88 16.774 5.06 7.29 12.65 20.255 6.82 6.49 12.46 13.806 6.04 7.35 8.22 14.467 42.75 36.39 14.17 18.598 9.61 4.20 11.32 19.839 8.10 3.36 10.63 11.5710 3.12 1.75 12.11 8.4411 1.75 4.50 11.08 12.1712 9.17 8.70 13.03 12.72

are in fact the point forecasts generated by the NN model).From the mathematical point of view, it means gT0 (F

TF +�I)−1(FTF )(FTF + �I)−1g0 ≪ 1, and therefore the widthof PIDelta is mainly affected by s�. A similar story happensfor the Bayesian method and the width of PIBays is dominatedby 1

� in (24) ( 1� ≫ ∇TwMP yi (HMP )−1 ∇wMP yi). The MVE

and bootstrap methods show a better performance and their PIshave more variable widths. These variable widths imply thatthe estimation of variances using NNs enables the methods toappropriately respond to the level of uncertainty in the data.

The Coefficient Of Variations (COVs) (ratio of the standarddeviation to the mean) for the width of PIMedian constructedusing the delta, Bayesian, MVE, and bootstrap are shownin Table III. According to these statistics, PIBootstrap havethe largest variability in the width. The MVE method alsoshows an acceptable performance in terms of variable widthsof PIs. PIDelta and PIBays have the lowest COVs, which areon average 9.21% and 8.84% respectively. These are lowerthan COVs for the MVE (11.27%) and bootstrap (14.00%)methods.

To check the effects of noise heterogeneity on the qualityof PIs, we change the amount of noise affecting the targets incase study 7. PIs are constructed for three different valuesof � : 10, 5, and 1. As per the model of case study 7,� = 1 means the additive noise has the largest variance(more uncertainty in data). Table IV shows the characteristicsof CWCDelta, CWCBays, CWCMVE , and CWCBootstrapfor this experiment. As the noise level (variance) increases,PIs become wider to keep the coverage probability of PIssatisfactorily high (at least 90%). The delta and Bayesiantechniques show the best performance for three cases. PIDeltaand PIBootstrap are the narrowest among the four methodswith a small median and standard deviation. It is important tomention that PIBays are of more quality compared to others inthe three cases. The median is very close to the best PIBays.The MVE method is the worst and its performance is highlyaffected by the level of noise in data.

Increase in the width of PIs reflects the most importantadvantage of PIs against point predictions: how they respondto the uncertainty in the data. While forecasted points carry noindication of presence of uncertainty in data, variable widthof PIs is an informative index about how uncertain the data is

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 11

Fig. 7. The PIMedian constructed for case study 6 using the delta method (top left), the Bayesian method (top right), the MVE method (bottom left), andthe bootstrap method (bottom right).

TABLE IVSTATISTICAL CHARACTERISTICS OF FOR CWCDelta , CWCBays ,

CWCMV E , AND CWCBootstrap FOR CASE STUDY 7 WITH DIFFERENTVALUES OF �

Method Index � = 10 � = 5 � = 1

DeltaCWCBest 6.15 9.45 18.84CWCMedian 20.49 24.33 104.71CWCSD 68.76 197.14 109.06

BayesianCWCBest 7.99 10.92 23.09CWCMedian 8.47 12.65 25.31CWCSD 42.43 4.48 31.35

MVECWCBest 48.12 80.37 70.40CWCMedian 107.59 102.64 103.04CWCSD 26.29 15.52 21.34

BootstrapCWCBest 33.56 42.00 16.90CWCMedian 90.13 91.49 76.61CWCSD 35.64 39.64 46.66

or what the risks of decision makings are.The computational requirements of the four methods for PI

constructions are different. It is important to note that NN-based PI construction methods have two types of computa-tional loads: offline and online. The offline requirements in-clude the elapsed time for training NN models and calculationof some matrices, such as the Jacobian or Hessian. The onlinecomputational load is the amount of time spent for the con-struction of PIs for a new sample. From a practical viewpoint,the offline computational load is less important. Models canbe trained, fine tuned, and saved for later use in a convenienttime. Therefore, it is not reasonable to compare PI construc-tion methods for their offline computational requirement. Theonline computational requirement for PI construction for anew sample is a more important and critical issue. As PIsmay be used for optimization purposes and conducting what-

if analysis in real world cases, their cheap construction cost isa real advantage for the underlying method. In the experimentsperformed here, PI construction methods are reviewed andcompared from this perspective.

Table V summarizes the elapsed times for the constructionof PIs for test samples of twelve case studies. The tabulatedtimes are only for a single replicate. According to the obtainedresults, the Bayesian and delta techniques are computationallymuch more expensive than the MVE and bootstrap methodsfor PI construction. Both MVE and bootstrap methods arehundreds of times faster than the other two methods. The MVEmethods is the least expensive method for PI construction,as its online computational requirement is negligible. Theelapsed time for construction of PIBootstrap is also very small.This is in contrast with the frequently stated claim in theliterature that the bootstrap method is computationally highlydemanding. The delta method, in 10 out of 12 case studies, hasa less computational load than the Bayesian method. However,PIDelta are computationally more expensive than PIBays forcases 11 and 12. According to the dataset information shownin Table I, the number of samples for these two cases is greaterthan other cases. This implies that the delta technique has alarger computational burden than the Bayesian technique forlarge datasets.

The presented results in Table V are an indication of the on-line computational requirements of the four methods. Considerthe case that we need to construct PIs for travel times of bagsfor case study 4, T90. At least one thousand what-if scenariosare required for optimal operational planning and scheduling.According to Table V, the elapsed times for conducting theseexperiments using the four methods are TDelta = 2345s,TBays = 4479s, TMVE = 12s, and TBootstrap = 55s. Realtime operational planning and optimization using the MVE andbootstrap methods is possible, as both TMVE and TBootstrap

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 12

TABLE VREQUIRED TIME FOR PI CONSTRUCTION USING THE DELTA, BAYESIAN,

MVE, AND BOOTSTRAP METHODS FOR TEST SAMPLES

Case Study Time (Seconds)Delta Bayesian MVE Bootstrap

1 1.24 2.77 0.03 0.062 0.94 2.19 0.03 0.073 0.98 2.37 0.01 0.054 0.97 2.34 0.01 0.055 2.34 4.48 0.01 0.056 4.91 9.32 0.01 0.067 1.46 4.11 0.02 0.068 3.04 7.41 0.01 0.059 3.03 7.44 0.01 0.06

10 3.04 7.42 0.01 0.0611 54.30 21.74 0.01 0.0612 53.91 21.74 0.01 0.06

are less than one minute. The delta and Bayesian methods donot finish their online computation in a practical time, and aretherefore not suitable for real time operational planning in abaggage handling system.

Decision regarding the suitability of a PI constructionmethod for optimization and decision making purposes de-pends on the size of dataset and constraints of the underlyingsystem. For instance, while the delta and Bayesian methods arenot the best option for case studies 3 and 4, they can be easilyapplied to case study 8–10. The PI construction methods haveenough time to compute the required calculations, as the rateof completion of the tasks and operations for this case study isslow. Therefore, the operational planners and schedulers canenjoy the excellent quality of PIDelta and PIBays withoutconsidering the computational load.

D. Summary

To quantify the performance of the PI construction methods,we rank each method from 1 to 4 depending on the constructedPIs for test samples. The lower the rank, the better the method.The ranking scores are given in four categories as per thefollowing:

∙ Quality: CWCMedian is used as an index to quantify theperformance of each method for producing high qualityPIs. For each case study, CWCMedian in Table II aresorted increasingly and scored between 1 (the lowest) to4 (the greatest) for the four PI construction methods. Thescores are then averaged to generate a total score of themethod’s performance.

∙ Repeatability: This is measured by calculating the 70thpercentile of CWC (i.e. 7th best replicate). As a conser-vative measure, this provides information about how eachmethod will do in worst cases, which methods are proneto fail, and which methods do well even in their badruns. Similar to the quality metric, the 70th percentilesare sorted, scored between 1 to 4, and then averaged fortwelve case studies.

∙ Computational load: The same scoring method is appliedto the elapsed times as shown in Table V.

∙ Variability: This metric relates to the response of PIs tothe level of uncertainty associated with data. To measurethis, we first obtain the width of PIBest for the delta,

TABLE VISUMMARY OF RESULTS BASED ON THE AVERAGED RANK OF EACH

METHOD FOR TWELVE CASE STUDIES

Assessment Metric Delta Bayesian MVE BootstrapQuality 1.58 1.75 3.75 2.92Repeatability 2,75 1.67 2.92 2.67Computational Load 3.25 3.75 1.00 2.00Variability 3.33 3.25 2.08 1.33

Bayesian, MVE, and bootstrap techniques. Then, theCOV is calculated for this set as an indication of itsvariation. The method with the greatest COV is scored1, the method corresponding to the second greatest COVis scored 2, and so on.

It is important to observe that these metrics have an unequalimportance in practice. While the PI quality is the mostimportant metric for some decision makers, the computationalload can be the key factor in optimization problems.

Table VI presents these four metrics for the delta, Bayesian,MVE, and bootstrap techniques. According to this table andresults represented in the previous section, we can make thefollowing conclusions:

∙ PIDelta have the highest quality of the four methods.They are the narrowest with an acceptable coverageprobability above the prescribed confidence level. How-ever, the repeatability of the results is not good and themethod may generate low quality PIs. The computationalrequirements of the method is also large and it constructsPIs with almost a fixed width.

∙ PIBays are second in terms of quality and their re-producibility is the best. Similar results are generatedin different replicates of the method. The method isthe worst in terms of the computational requirementsamongst the four investigated methods. Last but not theleast, PIBays have one of the most fixed widths (scored3.25 out of 4).

∙ PIMVE are the least computationally expensive to beconstructed. This is because only two NNs are used in theprocess of PI construction. Therefore, the method’s com-putational requirements are almost negligible comparedto other methods, in particular the delta and Bayesiantechniques. In some replicates of this method, highquality PIs are constructed. However, the quality andrepeatability metrics of PIMVE are the worst makingPIMVE unreliable for real world applications.

∙ The bootstrap method does not generate high quality PIscompared to the delta and Bayesian methods. The methodtends to overestimate the variance of the targets resultingin wider PIs compared to PIDelta. Increasing the numberof bootstrap models does not guarantee an improvementof the PI quality. In terms of the variability, PIBoostrapare by far the best among the four methods. Also, themethod is ranked second in terms of online computationalload.

V. COMBINED PIS

Results in Table VI show that there is no method that isdominating in all performance criteria. Each method has its

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 13

own theoretical and practical limitations, dependent on theapplication area or due to its overall effectiveness. Moreover,the best method in quality was not the top method for everysingle problem. Given this uncertainty as to which methodwill obtain best results on a given problem, we propose anensemble approach for PI construction. This means that weapply the four available methods and combine their PI’s insome way. The rationale for considering an combination ofmethods is similar to that of ensemble neural networks [43][44]. It is more robust and mitigates the effect of one methodgiving bad results and ruining performance.

It is also important to notice that the quality of PIsconstructed using a method significantly varies in differentreplicates of a method (different results for different NNs).As per the results demonstrated in Table II, there is oftena large discrepancy between the best result and median ofthe results. The standard practice dictates that we trust PIsconstructed using the NN that generates the highest qualityPIs for the validation set. However, it is well known that bestresults are not guaranteed for another set. Therefore, it is morepreferable to keep a set of NNs for each method (an ensembleof models) and run them all with an appropriate collectivedecision strategy to get high quality PIs.

Simple averaging, weighted averaging, ranking, and nonlin-ear mapping can be applied as the combination strategy in theensemble. The greatest advantage of simple averaging is itssimplicity. However, the main drawback of the method is thatit treats all individuals equally, though they are not equallygood. For the purpose of this study, we consider a linearcombination of the PIs generated using the delta, Bayesian,MVE, and bootstrap techniques. As per this mechanism, thelower and upper bounds of the combined PI will be equal tothe summation of the weighted lower and upper bounds offour PIs,

PIcomb = �

⎡⎢⎢⎣PIDeltaPIBaysPIMVE

PIBootstrap

⎤⎥⎥⎦ (40)

where � = [�1, �2, �3, �4] is the vector of combiner parameters.Two sets of constraints can be considered for the combinerparameters:

∙ Restriction 1: They are positive and less than one:0 ≤ �i ≤ 1, i = 1, 2, 3, 4. This means that PIs offour methods positively contribute to the construction ofthe new combined PIs. This idea has been advocated inliterature [45] [46].

∙ Restriction 2: The parameters are restricted to sum to one:∑4i=1 �i = 1. This restriction makes the new combined

PI a weighted average, which is a flexible form of thesimple average.

The key question is how the combiner parameters in (40)can be obtained. Traditionally, these parameters are determinedthrough minimization of an error–based cost function, such asSSE. In those cases, the purpose of combination is improvingthe generalization ability and achieving smoother results overthe error space. However, the purpose of combiner in our case

is improving the quality of combined PIs. Therefore, it is moremeaningful to adjust the combiner parameters in (40) throughminimization of a PI–based cost function.

Another issue in adjusting the combiner parameters relatesto the unavailability of target PIs. The ground truth of theupper and lower bounds of the desired PIs is not a prioriknown, and cannot be used during the training stage of thecombiner. Therefore, a method should be developed that indi-rectly adjusts the combiner parameters leading to the highestquality PIs.

The quality of PIs in this study is assessed using the CWCmeasure defined in (38). As CWC covers both key featuresof PIs (width and coverage probability), it can be used asthe objective function in the problem of enhancing the qualityof PIs using the combiner. The combiner parameters can beoptimally determined through minimization of CWC as thecost function. In fact, these parameters are indirectly fine-tuned to generate high quality combined PIs. This approacheliminates the need for knowing the desired values of PIs foradjusting the combiner parameters.

As per restrictions described above, two optimization prob-lems can be defined:

Option A:�A = arg min

�CWC

s.t. 0 ≤ � ≤ 1

(41)

Option B:�B = arg min

�CWC

s.t. 0 ≤ � ≤ 1∑4i=1 �i = 1

(42)

Option A has the advantage compared to option B that itcan lower or raise the absolute level of the PIs (parameters arenot restricted to sum to one), thereby correcting any generalbias in the PIs.

During the training of the combiner parameters using CWCas the cost function, we set (PICP ) = 1. CWC formulationwith equal to one has the advantage of leaving some slack inthe training, in order to avoid the serious downside of violatingthe PIs’ constraint for the test set (i.e. that PICP ≥ 90%).This conservative approach is applied to avoid excessivelynarrowing PIs during the training stage, which may result ina low PICP for test samples. After the training stage, all PIsare assessed using CWC with (PICP ) as defined in (39).

CWC, as the cost function, is nonlinear and nondiffer-entiable with many local minima. Therefore, descent-basedoptimization methods, such as those used in traditional NNtraining, cannot be applied for its minimization. Here, we useGenetic Algorithm (GA) [47] [48] [49] for minimization ofthe cost function in the optimization stage.

First, the available data is divided into two training sets(D1 and D2) and test set (Dtest). Similar to the experimentsperformed in the previous section, a cross validation techniqueis applied to determine the optimal NN structure. The optimalNN is trained ten times using samples of D1. PID2 are then

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 14

constructed using the delta, Bayesian, MVE, and bootstrapmethods for each set of NN models (totally ten sets of PIsfor each method). The combiner parameters, �, are initializedto values between 0 to 1. The optimization algorithm is thenapplied for finding the optimal values of the combiner param-eter. In each iteration of the optimization algorithm, PIcombare constructed using the new set of combiner parameters andPID2 from the four methods.

In addition to the PI combination across the four methods,we have another layer of combination over the ten runs. Thisalso improves the robustness of the combination approach, asit averages over different network initializations and differenttrain/validation partitions. So, the median of PIcomb over theten runs, called PImediancomb , is computed, and this gives thefinal PIs. PImediancomb are evaluated using the cost function,CWC with (PICP ) = 1, and their appropriateness iscomparatively checked. Upon completion of the optimizationstage, the combiner with its set of optimal parameters is usedfor construction of PImediancomb for test samples (Dtest).

The computation burden of the optimization algorithm ineach iteration is limited to generation of a new populationof the combiner parameters, constructing PIcomb for the newpopulation, calculation of median of PIcomb (PImediancomb ), andevaluation of PImediancomb using CWC. As these tasks do notrequire complex calculation, the optimization algorithm iscomputationally inexpensive.

In summary, the proposed method here uses two types ofensembles for achieving high quality PIs: (i) an ensembleof NN models in each method for constructing PIs, and (ii)an ensemble of four different methods for construction ofcombined PImediancomb using PIs of individual methods. Thistwo stage mechanism maximizes the diversity (using differentmodels and methods) for constructing high quality PIs.

VI. NUMERICAL RESULTS FOR COMBINED PIS

The performance of the proposed method in previous sectionfor construction of combined PI, PIcomb, is here examined fortwelve case studies. The GA program is run with a crossoverfraction of 0.8, 1000 stall generation limit, and a populationof 100 individuals. Again we randomly split data into D1, D2,and Dtest sets. The NN structure is determined through a five–fold cross validation techniques. PIs for D2 are constructedusing the delta, Bayesian, MVE, and bootstrap methods. Thenthe proposed method in the previous section is applied foradjusting the combiner parameters as per (41) and (42).

Fig. 8 shows CWCs for PIcomb constructed using thecombiner trained based on option A and Option B. Hereafterand for ease in reference, we refer to these PIs as PIcomb−Aand PIcomb−B . For the purpose of comparison, CWC mediansfor other methods are also shown in this figure. As per theseresults, PIcomb−A and PIcomb−B are the best for 10 and 1out of 12 case studies respectively (totally for 11 out of 12case studies). It is only for case 11 that the proposed methodsdo not generate the best results. Ranks of option A and B forthis case study are 2 and 3 respectively. With the exception ofthis case, both methods, and in particular option A, outperformany individual method in terms of the quality of constructedPIs.

The amount of difference between the best method for PIconstruction and other methods are demonstrated in Table VII.The percentage difference is the ratio of difference betweenthe CWCs and the minimum of CWCs normalized by theminimum of CWCs,

Difference =CWC − CWCmin

CWCmin(43)

where CWCmin is the minimum of CWCs for each casestudy shown in Fig. 8. A zero difference for a method meansthat it has generated the highest quality PIs in the conductedexperiments. As per the computed difference, the proposedcombining methods, option A and B, significantly improve thequality of PIs. The median values of differences for the sixmethods are 41.1%, 20.8%, 149.6%, 147.5%, 0.0%, and 9.6%respectively. As per these values, it is obvious that using theproposed combiners for PI construction significantly improvesthe quality of constructed PIs.

VII. CONCLUSION

In this paper, we comprehensively reviewed and examinedthe performance of four frequently cited methods for construc-tion of PIs using NNs. The theoretical background of the delta,Bayesian, mean-variance estimation, and bootstrap techniqueswas first studied to find the advantages and disadvantages ofeach method. Twelve synthetic and real world case studieswere implemented to assess performance of each methodfor generating high quality prediction intervals. Effects ofhomogeneous and heterogeneous noise on the quality of PIswas investigated. Quantitative and comprehensive assessmentswere performed by using a hybrid measure related to the widthand coverage probability of prediction intervals. Accordingto the obtained results, the delta technique generates thehighest quality prediction intervals, the Bayesian method isthe most reliable for reproducing quality PIs, and the mean-variance estimation method is the least computationally expen-sive method. The bootstrap-based PIs have the most variablewidths and appropriately respond to the level of uncertaintyin data. Results indicate that there is no best method forall cases. Therefore, selection and application of a predictioninterval construction method will depend on the purpose ofanalysis, the computational constraints, and which aspect ofthe prediction interval is more important.

The paper also proposed a new method for constructionof PIs through combination of traditionally built PIs. Theproposed method uses an ensemble of NNs for each traditionalmethod to construct PIs, and an ensemble of four methods tobuild combined PIs based on the medians of PIs from eachensemble. The combiner parameters were indirectly adjustedthrough minimization of a PI–based cost function. Geneticalgorithm was applied for minimization of the nonlinear,nondifferentiable cost function. It was shown that the proposedcombiner methods outperform any individual method in termsof generating higher quality PIs.

ACKNOWLEDGMENT

This research was fully supported by the Centre for Intelli-gent Systems Research (CISR) at Deakin University.

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 15

0

10

20

30

40

50

60

70

80

90

100

110

120

130

1 2 3 4 5 6 7 8 9 10 11 12

CWC

Case Study

Delta Bayesian MVE

Bootstrap Option A Option B

Fig. 8. Performance of two combiners for generating high quality PIs compared to other four traditional methods.

TABLE VIITHE PERCENTAGE DIFFERENCE BETWEEN THE BEST METHOD FOR PI CONSTRUCTION AND OTHER METHODS

Case Study Delta Bayesian MVE Bootstrap Option A Option B1 3.95 21.21 42.41 55.55 0.00 20.262 14.21 185.71 205.53 166.63 0.00 12.003 14.21 185.71 205.53 166.63 0.00 12.004 72.56 20.06 96.59 56.53 0.00 10.295 2.88 25.13 153.35 191.83 0.30 0.006 261.23 306.49 175.26 251.91 0.00 2.687 387.01 24.79 1080 1018 0.00 386.298 1.76 0.78 145.94 128.77 0.00 9.499 208.45 3.90 232.58 164.83 0.00 9.1710 68.09 20.46 52.14 19.19 0.00 4.5811 0.00 13.80 103.41 55.16 9.16 9.7412 191.01 6.28 110.00 130.25 0.00 6.48

REFERENCES

[1] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforwardnetworks are universal approximators,” Neural Networks, vol. 2, no. 5,pp. 359–366, 1989.

[2] C. M. Bishop, Neural Networks for Pattern Recognition. OxfordUniversity, Press, Oxford, 1995.

[3] M. Azlan Hussain, “Review of the applications of neural networksin chemical process control – simulation and online implementation,”Artificial Intelligence in Engineering, vol. 13, pp. 55–68, 1999.

[4] J. G. De Gooijer and R. J. Hyndman, “25 years of time series forecast-ing,” International Journal of Forecasting, vol. 22, no. 3, pp. 443–473,2006.

[5] B. K. Bose, “Neural network applications in power electronics and motordrives&mdash;an introduction and perspective,” IEEE Transactions onIndustrial Electronics, vol. 54, pp. 14–33, 2007.

[6] M. Paliwal and U. A. Kumar, “Neural networks and statistical tech-niques: A review of applications,” Expert Systems with Applications,vol. 36, no. 1, pp. 2–17, Jan. 2009.

[7] W.-H. Liu, “Forecasting the semiconductor industry cycles by bootstrapprediction intervals,” Applied Economics, vol. 39, no. 13, pp. 1731–1742,2007.

[8] S. Ho, M. Xie, L. Tang, K. Xu, and T. Goh, “Neural network modelingwith confidence bounds: a case study on the solder paste depositionprocess,” IEEE Transactions on Electronics Packaging Manufacturing,vol. 24, no. 4, pp. 323–332, 2001.

[9] J. H. Zhao, Z. Y. Dong, Z. Xu, and K. P. Wong, “A statistical approachfor interval forecasting of the electricity price,” IEEE Transactions onPower Systems, vol. 23, no. 2, pp. 267–276, 2008.

[10] A. Khosravi, S. Nahavandi, and D. Creighton, “Construction of optimalprediction intervals for load forecasting problem,” IEEE Transactionson Power Systems, vol. 25, pp. 1496–1503, 2010.

[11] S. G. Pierce, K. Worden, and A. Bezazi, “Uncertainty analysis of a neuralnetwork used for fatigue lifetime prediction,” Mechanical Systems andSignal Processing, vol. 22, no. 6, pp. 1395–1411, Aug. 2008.

[12] D. F. Benoit and D. Van den Poel, “Benefits of quantile regressionfor the analysis of customer lifetime value in a contractual setting:An application in financial services,” Expert Systems with Applications,vol. 36, no. 7, pp. 10 475–10 484, Sep. 2009.

[13] D. L. Shrestha and D. P. Solomatine, “Machine learning approaches forestimation of prediction interval for the model output,” Neural Networks,vol. 19, no. 2, pp. 225–235, Mar. 2006.

[14] C. van Hinsbergen, J. van Lint, and H. van Zuylen, “Bayesian committeeof neural networks to predict travel times with confidence intervals,”Transportation Research Part C: Emerging Technologies, vol. 17, no. 5,pp. 498–509, Oct. 2009.

[15] A. Khosravi, S. Nahavandi, and D. Creighton, “A prediction interval-based approach to determine optimal structures of neural networkmetamodels,” Expert Systems with Applications, vol. 37, pp. 2377–2387,2010.

[16] J. T. G. Hwang and A. A. Ding, “Prediction intervals for artificial neuralnetworks,” Journal of the American Statistical Association, vol. 92, no.438, pp. 748–757, 1997.

[17] R. D. d. Veaux, J. Schumi, J. Schweinsberg, and L. H. Ungar, “Predictionintervals for neural networks via nonlinear regression,” Technometrics,vol. 40, no. 4, pp. 273–282, 1998.

[18] D. J. C. MacKay, “The evidence framework applied to classificationnetworks,” Neural Computation, vol. 4, no. 5, pp. 720–736, 1992.

[19] D. Nix and A. Weigend, “Estimating the mean and variance of the targetprobability distribution,” in IEEE International Conference on NeuralNetworks, 1994.

[20] B. Efron, “Bootstrap methods: Another look at the jackknife,” TheAnnals of Statistics, vol. 7, no. 1, pp. 1–26, 1979.

[21] T. Heskes, “Practical confidence and prediction intervals,” in NeuralInformation Processing Systems, T. P. M. Mozer, M. Jordan, Ed., vol. 9.MIT Press, 1997, pp. 176–182.

[22] G. Papadopoulos, P. Edwards, and A. Murray, “Confidence estimationmethods for neural networks: a practical comparison,” IEEE Transac-tions on Neural Networks, vol. 12, no. 6, pp. 1278–1287, 2001.

[23] A. Ding and X. He, “Backpropagation of pseudo-errors: neural networks

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 16

that are adaptive to heterogeneous noise,” IEEE Transactions on NeuralNetworks, vol. 14, no. 2, pp. 253–262, 2003.

[24] F. Giordano, M. La Rocca, and C. Perna, “Forecasting nonlinear timeseries with neural network sieve bootstrap,” Computational Statistics &Data Analysis, vol. 51, no. 8, pp. 3871–3884, May 2007.

[25] I. Rivals and L. Personnaz, “Construction of confidence intervals forneural networks based on least squares estimation,” Neural Networks,vol. 13, no. 4-5, pp. 463–484, Jun. 2000.

[26] W. C. J. Seber, G. A. F., Nonlinear Regression. New York: Wiley,1989.

[27] R. Tibshirani, “A comparison of some error estimates for neural networkmodels,” Neural Computation, vol. 8, pp. 152–163, 1996.

[28] M. Hagan and M. Menhaj, “Training feedforward networks with themarquardt algorithm,” IEEE Transactions on Neural Networks, vol. 5,no. 6, pp. 989–993, 1994.

[29] R. Dybowski and S. Roberts, “Confidence intervals and predictionintervals for feed-forward neural networks,” in Clinical Applications ofArtificial Neural Networks, 2000.

[30] A. Khosravi, S. Nahavandi, D. Creighton, and A. F. Atiya, “A lowerupper bound estimation method for construction of neural network-basedprediction intervals,” IEEE Transactions on Neural Networks, vol. 22,no. 3, pp. 337 – 346, 2011.

[31] A.-M. Halberg, K. H. Teigen, and K. I. Fostervold, “Maximum vs.minimum values: Preferences of speakers and listeners for upper andlower limit estimates,” Acta Psychologica, vol. 132, pp. 228–239, 2009.

[32] S. Hashem, “Optimal linear combinations of neural networks,” NeuralNetworks, vol. 10, no. 4, pp. 599–614, Jun. 1997.

[33] L. Ma and K. Khorasani, “New training strategies for constructive neuralnetworks with application to regression problems,” Neural Networks,vol. 17, no. 4, pp. 589–609, May 2004.

[34] P. Vlachos, StatLib datasets archive [http://lib.stat.cmu.edu/datasets],visited Jan. 2010.

[35] A. Khosravi, S. Nahavandi, and D. Creighton, “Constructing predictionintervals for neural network metamodels of complex systems,” in In-ternational Joint Conference on Neural Networks (IJCNN), 2009, pp.1576–1582.

[36] A. Asuncion and D. J. Newman, UCI Machine Learning Repository[http://www.ics.uci.edu/ mlearn/MLRepository.html]. Irvine, CA: Uni-versity of California, School of Information and Computer Science.Visited Jan. 2010.

[37] B. De Moor, DaISy: Database for the Identification of Systems, De-partment of Electrical Engineering, ESAT/SISTA, K.U.Leuven, Belgium,URL: http://homes.esat.kuleuven.be/ smc/daisy/. Visited Jan. 2010.

[38] G. C. V. M. Kirkpatrick, S., “Optimization by simulated annealing,”Science, vol. 220, pp. 671–680, 1983.

[39] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123–140, 1996.

[40] M. Islam, X. Yao, and K. Murase, “A constructive algorithm for trainingcooperative neural network ensembles,” IEEE Transactions on NeuralNetworks, vol. 14, no. 4, pp. 820–834, 2003.

[41] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. NewYork: Chapman and Hall, 1993.

[42] A. C. Davison and D. V. Hinkley, Bootstrap Methods and TheirApplication. Cambridge University Press, 1997.

[43] L. Hansen and P. Salamon, “Neural network ensembles,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 12, no. 10,pp. 993–1001, 1990.

[44] X. Yao and M. Islam, “Evolving artificial neural network ensembles,”IEEE Computational Intelligence Magazine, vol. 3, no. 1, pp. 31–42,2008.

[45] S. I. Gunter, “Nonnegativity restricted least squares combinations,”International Journal of Forecasting, vol. 8, no. 1, pp. 45–59, Jun. 1992.

[46] J. W. Taylor and S. Majithia, “Using combined forecasts with changingweights for electricity demand profiling,” The Journal of the OperationalResearch Society, vol. 51, no. 1, pp. 72–82, 2000.

[47] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Ma-chine Learning. Addison Wesley, 1989.

[48] M. Mitchell, An Introduction to Genetic Algorithms. Cambridge, MA:MIT Press, 1996.

[49] C. R. Reeves and J. E. Rowe, Genetic algorithms : principles andperspectives : a guide to GA theory. Kluwer Academic Publishers,2003.

Abbas Khosravi (M’07) received BSc in Elec.Eng. from Sharif University of Technology, Iran2002, MSc in Elec. Eng. from Amirkabir Universityof Technology, Iran 2005, and PhD from DeakinUniversity, Australia 2010. In 2006-2007, he waswith the eXiT Group, University of Girona, Spain,conducting research in the field of artificial intel-ligence. Currently he is a research fellow in theCentre for Intelligent Systems Research (CISR) atDeakin University. His primary research interestsinclude theory and application of neural networks

and fuzzy logic systems for modeling, analysis, control, and optimization ofoperations within complex systems. Mr Khosravi is recipient of Alfred DeakinPostdoctoral Research Fellowship in 2011.

Saeid Nahavandi (SM’07) received the B.Sc.(hons.), M.Sc., and Ph.D. degrees in automationand control from Durham University, Durham, U.K.He is the Alfred Deakin Professor, Chair of Engi-neering, and the director for the Center for Intelli-gent Systems Research (CISR), Deakin University,Geelong, VIC, Australia. He has published over350 peer reviewed papers in various internationaljournals and conferences. He designed the world’sfirst 3-D interactive surface/motion controller. Hisresearch interests include modeling of complex sys-

tems, simulation-based optimization, robotics, haptics and augmented reality.Dr. Nahavandi was a recipient of the Young Engineer of the Year Title in1996 and six international awards in Engineering. He is the Associate Editorof the IEEE Systems Journal, an Editorial Consultant Board member forthe International Journal of Advanced Robotic Systems, an Editor (SouthPacific Region) of the International Journal of Intelligent Automation andSoft Computing. He is a Fellow of Engineers Australia (FIEAust) and IET(FIET).

Doug Creighton (M’10) received a B.Eng. (Hon-ours) in Systems Engineering and a B.Sc. in Physicsfrom the Australian National University in 1997,where he attended as a National UndergraduateScholar. He spent several years as a software consul-tant prior to obtaining his PhD degree in simulation-based optimization from Deakin University in 2004.He is currently a research academic and streamleader with the Centre for Intelligent Systems Re-search (CISR) at Deakin University. Dr Creightonhas been actively involved in modelling, discrete

event simulation, intelligent agent technologies, HMI and visualization andsimulation-based optimization research. He has development algorithms toallow the application of learning agents to industrial scale systems for use inoptimization, dynamic control and scheduling.

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS 17

Amir F. Atiya (S’86M’90SM’97) received his B.S.degree in 1982 from Cairo University, and the M.S.and Ph.D. degree in 1986 and 1991 from Caltech,Pasadena, CA, all in electrical engineering. Dr. Atiyais currently a Professor at the Department of Com-puter Engineering, Cairo University. He recentlyheld several visiting appointments, such as in Cal-tech and in Chonbuk National University, S. Korea.His research interests are in the areas of neuralnetworks, machine learning, theory of forecasting,computational finance, and Monte Carlo Methods.

He obtained several awards, such as the Kuwait Prize in 2005. He was anAssociate Editor for the IEEE Transactions on Neural Networks from 1998to 2008.