13
Research Article A Robust AdaBoost.RT Based Ensemble Extreme Learning Machine Pengbo Zhang and Zhixin Yang Department of Electromechanical Engineering, Faculty of Science and Technology, University of Macau, Macau Correspondence should be addressed to Zhixin Yang; [email protected] Received 21 August 2014; Revised 12 November 2014; Accepted 13 November 2014 Academic Editor: Yi Jin Copyright © 2015 P. Zhang and Z. Yang. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Extreme learning machine (ELM) has been well recognized as an effective learning algorithm with extremely fast learning speed and high generalization performance. However, to deal with the regression applications involving big data, the stability and accuracy of ELM shall be further enhanced. In this paper, a new hybrid machine learning method called robust AdaBoost.RT based ensemble ELM (RAE-ELM) for regression problems is proposed, which combined ELM with the novel robust AdaBoost.RT algorithm to achieve better approximation accuracy than using only single ELM network. e robust threshold for each weak learner will be adaptive according to the weak learner’s performance on the corresponding problem dataset. erefore, RAE-ELM could output the final hypotheses in optimally weighted ensemble of weak learners. On the other hand, ELM is a quick learner with high regression performance, which makes it a good candidate of “weak” learners. We prove that the empirical error of the RAE-ELM is within a significantly superior bound. e experimental verification has shown that the proposed RAE-ELM outperforms other state-of- the-art algorithms on many real-world regression problems. 1. Introduction In the past decades, computational intelligence methodolo- gies are widely adopted and have been effectively utilized in various areas of scientific research and engineering applica- tions [1, 2]. Recently, Huang et al. introduced an efficient learning algorithm, named extreme learning machine (ELM), for single-hidden layer feedforward neural networks (SLFNs) [3, 4]. Unlike conventional learning algorithms such as back- propagation (BP) methods [5] and support vector machines (SVMs) [6], ELM could randomly generate the hidden neuron parameters (the input weights and the hidden layer biases) before seeing the training data, and could analytically determine the output weights without tuning the hidden layer of SLFNs. As the random generated hidden neuron parameters are independent of the training data, ELM can reach not only the smallest training error but also the smallest norm of output weights. ELM overcomes several limitations in the conventional learning algorithms, such as local minimal and slow learning speed, and embodies very good generalization performance. As a popular and pleasing learning algorithm, massive variants of ELM have been investigated in order to further improve its generalization performance. Rong et al. [7] pro- posed an online sequential fuzzy extreme learning machine (OS-Fuzzy-ELM) for function approximation and classifi- cation problems. Cao et al. [8] combined the voting based extreme learning machine [9] with online sequential extreme learning machine [10] into a new methodology, called voting based online sequential extreme learning machine (VOS- ELM). In addition, to solve the two drawbacks in the basic ELM, namely, the over-fitting problem and the unstable accuracy, Luo et al. [11] presented a novel algorithm, called sparse Bayesian extreme learning machine (SB-ELM), which estimates the marginal likelihood of the output weights automatically pruning the redundant hidden nodes. What is more, to overcome the limitations of supervised learning algorithms, according to the theory of semisupervised learn- ing [12]. Although ELM has good generalization performances for classification and regression problems, how to efficiently perform training and testing on big data is challenging for Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2015, Article ID 260970, 12 pages http://dx.doi.org/10.1155/2015/260970

Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

Research ArticleA Robust AdaBoostRT Based Ensemble ExtremeLearning Machine

Pengbo Zhang and Zhixin Yang

Department of Electromechanical Engineering Faculty of Science and Technology University of Macau Macau

Correspondence should be addressed to Zhixin Yang zxyangumacmo

Received 21 August 2014 Revised 12 November 2014 Accepted 13 November 2014

Academic Editor Yi Jin

Copyright copy 2015 P Zhang and Z YangThis is an open access article distributed under theCreativeCommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Extreme learningmachine (ELM) has beenwell recognized as an effective learning algorithmwith extremely fast learning speed andhigh generalization performance However to deal with the regression applications involving big data the stability and accuracy ofELM shall be further enhanced In this paper a new hybrid machine learning method called robust AdaBoostRT based ensembleELM (RAE-ELM) for regression problems is proposed which combined ELM with the novel robust AdaBoostRT algorithm toachieve better approximation accuracy than using only single ELM network The robust threshold for each weak learner will beadaptive according to theweak learnerrsquos performance on the corresponding problemdatasetTherefore RAE-ELMcould output thefinal hypotheses in optimally weighted ensemble of weak learners On the other hand ELM is a quick learner with high regressionperformance which makes it a good candidate of ldquoweakrdquo learners We prove that the empirical error of the RAE-ELM is withina significantly superior bound The experimental verification has shown that the proposed RAE-ELM outperforms other state-of-the-art algorithms on many real-world regression problems

1 Introduction

In the past decades computational intelligence methodolo-gies are widely adopted and have been effectively utilized invarious areas of scientific research and engineering applica-tions [1 2] Recently Huang et al introduced an efficientlearning algorithm named extreme learningmachine (ELM)for single-hidden layer feedforward neural networks (SLFNs)[3 4] Unlike conventional learning algorithms such as back-propagation (BP) methods [5] and support vector machines(SVMs) [6] ELM could randomly generate the hiddenneuron parameters (the input weights and the hidden layerbiases) before seeing the training data and could analyticallydetermine the output weights without tuning the hiddenlayer of SLFNs As the random generated hidden neuronparameters are independent of the training data ELM canreach not only the smallest training error but also thesmallest norm of output weights ELM overcomes severallimitations in the conventional learning algorithms such aslocal minimal and slow learning speed and embodies verygood generalization performance

As a popular and pleasing learning algorithm massivevariants of ELM have been investigated in order to furtherimprove its generalization performance Rong et al [7] pro-posed an online sequential fuzzy extreme learning machine(OS-Fuzzy-ELM) for function approximation and classifi-cation problems Cao et al [8] combined the voting basedextreme learningmachine [9] with online sequential extremelearning machine [10] into a new methodology called votingbased online sequential extreme learning machine (VOS-ELM) In addition to solve the two drawbacks in the basicELM namely the over-fitting problem and the unstableaccuracy Luo et al [11] presented a novel algorithm calledsparse Bayesian extreme learning machine (SB-ELM) whichestimates the marginal likelihood of the output weightsautomatically pruning the redundant hidden nodes Whatis more to overcome the limitations of supervised learningalgorithms according to the theory of semisupervised learn-ing [12]

Although ELM has good generalization performancesfor classification and regression problems how to efficientlyperform training and testing on big data is challenging for

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 260970 12 pageshttpdxdoiorg1011552015260970

2 Mathematical Problems in Engineering

ELM as well As a single learning machine although ELMis quite stable compared to other learning algorithms itsclassification and regression performancemay still be slightlyvaried among different trails on big datasetMany researcherssought for various ensemble methods that integrate a setof ELMs into a combined network structures and verifiedthat they could perform better than using individual ELMLan et al [13] proposed an ensemble of online sequentialELM (EOS-ELM) which is comprised of several OS-ELMnetworks The mean of the OS-ELM networksrsquo outputs wasused as the performance indicator of the ensemble networksLiu and Wang [14] presented an ensemble-based ELM (EN-ELM) algorithmwhere the cross-validation schemewas usedto create an ensemble of ELM classifiers for decision makingBesides Xue et al [15] proposed a genetic ensemble ofextreme learningmachine (GE-ELM) which adopted geneticalgorithms (GAs) to produce a group of candidate networksfirst According to a specific ranking strategy some of thenetworks were selected to ensemble a new network MorerecentlyWang et al [16] presented a parallelized ELMensem-ble method based on M3-network called M3-ELM It couldimprove the computation efficiency by parallelism and solveimbalanced classification tasks through task decomposition

To learn the exponentially increased number and types ofdata with high accuracy the traditional learning algorithmsmay tend to suffer from overfitting problem Hence arobust and stable ensemble algorithm is of great importanceDasarathy and Sheela [17] firstly introduced an ensemblesystem whose idea is to partition the feature space usingmultiple classifiers Furthermore Hansen and Salamon [18]presented an ensemble of neural networks with a pluralityconsensus scheme to obtain far better performance in classifi-cation issues than approaching using single neural networksAfter that the ensemble-based algorithms have been widelyexplored [19ndash23] Among the ensemble-based algorithmsBagging and Boosting are the most prevailing methodsfor training neural network ensembles The Bagging (shortfor Bootstrap Aggregation) algorithm randomly selects 119899bootstrap samples from 119873 cardinality original training set(119899 sub 119873) and then the diversity in the bagging-based ensem-bles is ensured by the variations within the bootstrappedreplicas on which each classifier is trained By using relativelyweak classifiers the decision boundaries measurably varywith respect to relatively small perturbations in the trainingdata As an iterative method presented by Schapire [20]for generating a strong classifier boosting could achievearbitrarily low training error from an ensemble of weakclassifiers each of which can barely do better than randomguessing Whereafter a novel boosting algorithm called theadaptive boosting (AdaBoost) was presented by Schapire andFreund [21]The AdaBoost algorithmmakes improvement totraditional boosting methods in two perspectives One is thatthe instances thereof are drawn into subsequent subdatasetsfrom an iteratively updated sample distribution of the sametraining dataset AdaBoost replaces randomly subsamples byweighted versions of the same training dataset which couldbe repeatedly utilized The training dataset is therefore notrequired to be very large Another is to define an ensemble

classifier through combination of weighted majority votingof a set of weak classifiers where voting weights are based onclassifiersrsquo training errors

Howevermany of the existing investigations on ensemblealgorithms focus on classification problems The ensemblealgorithms on classification problems unfortunately can-not be directly applied on regression problems Regressionmethods could provide predicated results through analyzinghistorical data Forecasting and predication are importantfunctional requirements for real-world applications such astemperature prediction inventory management and posi-tioning tracking inmanufacturing execution system To solveregression problems based on the AdaBoost algorithm onthe classification problem [24ndash26] Schapire and Freund [21]extended AdaBoostM2 to AdaBoostR In addition Drucker[27] proposed AdaBoostR2 algorithm which is based onad hoc modification of AdaBoostR Besides Avnimelechand Intrator [28] presented the notion of weak and stronglearning and an appropriate equivalence theorem betweenthem so as to improve the boosting algorithm in regressionissues What is more Solomatine and Shrestha [29 30]proposed a novel boosting algorithm called as AdaBoostRTAdaBoostRT projects the regression problems into thebinary classification domain which could be processed byAdaBoost algorithm while filtering out those examples withthe relative estimation error larger than the preset thresholdvalue

The proposed hybrid algorithm which combines theeffective learner ELM with the promising ensemble methodAdaBoostRT algorithm could inherit their intrinsic proper-ties and shall be able to achieve good generalization perfor-mances for dealing with big data Same as the developmenteffort on general ensemble algorithms the available ELMensembles algorithms are mainly aimed at the classificationproblems while the regression problems with ensemblealgorithm have received relatively little attention Tian andMao [31] presented an ensemble ELM based on modifiedAdaBoostRT algorithm (modified Ada-ELM) in order topredict the temperature of molten steel in ladle furnaceThe novel hybrid learning algorithm combined the modifiedAdaBoostRT with ELM which possesses the advantage ofELM and overcomes the limitation of basic AdaBoostRTby self-adaptively modifiable threshold value The thresholdvalue Φ need not be constant instead it could be adjustedusing a self-adaptive modification mechanism subjected tothe change trend of the predication error at each iterationThe variation range of threshold value is set to be [0 04]as suggested by Solomatine and Shrestha [29 30] Howeverthe initial value of Φ is manually fixed to be the meanof the variation range of threshold value ex Φ

0= 02

according to an empirical suggestion When one error rate120576119905is smaller than that in previous iteration 120576

119905minus1 the value

of Φ will decrease and vice versa Hence such empiricalsuggestion based method is not fully self-adaptive in thewhole threshold domainMoreover themanually fixed initialthreshold is not related to the properties of input dataset andthe weak learners which make the ensemble ELM hardlyreach a generally optimized learning effect

Mathematical Problems in Engineering 3

This paper presents a robust AdaBoostRT based ensem-ble ELM (RAE-ELM) for regression problems which com-bined ELM with the robust AdaBoostRT algorithm Therobust AdaBoostRT algorithm not only overcomes the limi-tation of the original AdaBoostRT algorithm (original Ada-ELM) but alsomakes the threshold value ofΦ adaptive to theinput dataset and ELM networks instead of presetting Themain idea of RAE-ELM is as follows The ELM algorithm isselected as the ldquoweakrdquo learning machines to build the hybridensemble model A new robust AdaBoostRT algorithm isproposed to utilize the error statistics method to dynamicallydetermine the regression threshold value rather than viamanual selection which may only be ideal for very fewregression cases The mean and the standard deviation ofthe approximation errors will be computed at each iterationThe robust threshold for each weak learner is defined tobe a scaled standard deviation Based on the concept ofstandard deviation those individual training data with errorexceeding the robust threshold are regarded as ldquoflaws inthis training processrdquo and shall be rejected The rejecteddata will be processed in the late part of weak learnersrsquoiterations

We then analyze the convergence of the proposed robustAdaBoostRT algorithm It could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe proposed robust AdaBoostRT based ensemble extremelearning machine can avoid overfitting because of the char-acteristic of ELM ELM can tend to reach the solutionsstraightforwardly and the error rate of regression outcomeat each training process is much smaller than 05 Thereforethe proposed robust AdaBoostRT based ensemble extremelearning machine selecting ELM as the ldquoweakrdquo learner canavoid overfittingMoreover as ELM is a fast learnerwith quitehigh regression performance it contributes to the overallgeneralization performance of the robust AdaBoostRT basedensemblemoduleThe experiment results have demonstratedthat the proposed robust AdaBoostRT ensemble ELM (RAE-ELM) has superior learning properties in terms of stabilityand accuracy for regression issues and have better general-ization performance than other algorithms

This paper is organized as follows Section 2 gives a briefreview of basic ELM Section 3 introduces the original and theproposed robust AdaBoostRT algorithm The hybrid robustAdaBoostRT ensemble ELM (RAE-ELM) algorithm is thenpresented in Section 4 The performance evaluation of RAE-ELM and its regression ability are verified using experimentsin Section 5 Finally the conclusion is drawn in the lastsection

2 Brief on ELM

Recently Huang et al [3 4] proposed novel neural networkscalled extreme learning machines (ELMs) for single-hiddenlayer feedforward neural networks (SLFNs) [32 33] ELMis based on the least-square method which could randomlyassign the input weights and the hidden layer biases and

then the output weights between the hidden nodes andthe output layer can be analytically determined Since thelearning process in ELM can take place without iterativetuning the ELM algorithm could trend to reach the solutionsstraightforwardly without suffering from those problemsincluding local minimal slow learning speed and overfitting

From the standard optimization theory point of view theobjective of ELM in minimizing both the training errors andthe outputs weights can be presented as [4]

Minimize 119871119875ELM

=

1

2

10038171003817100381710038171205731003817100381710038171003817

2+ 119862

1

2

119873

sum

119894=1

1003817100381710038171003817120585119894

1003817100381710038171003817

2

Subject to h (x119894)120573 = t119879

119894minus 120585119879

119894 119894 = 1 119873

(1)

where 120585119894= [1205851198941 120585

119894119898]119879 is the training error vector of

the 119898 output nodes with respect to the training sample x119894

According to KKT theorem training ELM is equivalent tosolving the dual optimization problem

119871119863ELM

=

1

2

10038171003817100381710038171205731003817100381710038171003817

2+ 119862

1

2

119873

sum

119894=1

1003817100381710038171003817120585119894

1003817100381710038171003817

2

minus

119873

sum

119894=1

119898

sum

119895=1

120572119894119895(h (x119894) 120573119895minus 119905119894119895+ 120585119894119895)

(2)

where 120573119895is the vector of the weights between the hidden

layer and the 119895th output node and 120573 = [1205731 120573

119898] 119862 is the

regularization parameter representing the trade-off betweenthe minimization of training errors and the maximization ofthe marginal distance

According to KKT theorem we can obtain differentsolutions as follows

(1) Kernel Case Consider

120573 = H119879 ( I119862

+HH119879)minus1

T (3)

The ELM output function is

119891 (x) = h (x)120573 = h (x)H119879 ( I119862

+HH119879)minus1

T (4)

A kernel matrix for ELM is defined as follows

ΩELM = HH119879 ΩELM119894119895 = h (x119894) sdot h (x

119895) = 119870 (x

119894 x119895) (5)

Then the ELM output function can be as follows

119891 (x) = h (x)120573 = h (x)H119879 ( I119862

+HH119879)minus1

T

=

[

[

[

[

119870 (x x1)

119870(x x

119873)

]

]

]

]

119879

(

I119862

+Ω119879

ELM)minus1

T(6)

In the special case a corresponding kernel 119870(x119894 x119895) is

used in ELM instead of using the feature mapping h(x)

4 Mathematical Problems in Engineering

which need be known We call 119870(x119894 x119895) = h(x

119894)h(x119895) ELM

random kernel where the feature mapping h(x) is randomlygenerated

(2) Nonkernel Case Similarly based on KKT theorem wehave

120573 = (

I119862

+HH119879)minus1

H119879T (7)

In this case the ELM output function is

119891 (x) = h (x)120573 = h (x) ( I119862

+HH119879)minus1

H119879T (8)

3 The Proposed RobustAdaBoostRT Algorithm

We first describe the original AdaBoostRT algorithm forregression problem and then present a new robust Ada-BoostRT algorithm in this section The corresponding anal-ysis on the novel algorithm will also be given

31 The Original AdaBoostRT Algorithm Solomatine andSherstha proposed AdaBoostRT [29 30] a new boost algo-rithm for regression problems where the letters 119877 and 119879represent regression and threshold respectively The originalAdaBoostRT algorithm is described as follows

Input the following

(i) sequence of 119898 examples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877(ii) weak learning algorithm (weak learner)(iii) integer 119879 specifying number of iterations

(machines)(iv) threshold Φ (0 lt Φ lt 1) for demarcating

correct and incorrect predictions

Initialize the following

(i) machine number or iteration 119905 = 1(ii) distribution119863

119905(119894) = 1119898 for all i

(iii) error rate 120576119905= 0

Learning steps (iterate while 119905 ⩽ 119879) are as followsStep 1 Call weak learner providing it with distribu-tion119863

119905

Step 2 Build the regression model 119891119905(119909) rarr 119910

Step 3 Calculate absolute relative error (ARE) foreach training example as

ARE119905 (119894) =

100381610038161003816100381610038161003816100381610038161003816

119891119905(119909119894) minus 119910119894

119910119894

100381610038161003816100381610038161003816100381610038161003816

(9)

Step 4 Calculate the error rate of 119891119905(119909) 120576

119905=

sum119894ARE119894(119894)gt120601119863119905(119894)

Step 5 Set 120573119905= 120576119899

119905 where 119899 is power coefficient (eg

linear square or cubic)Step 6 Update distribution119863

119905as follows

if ARE119905(119894) le 120601 then119863

119905+1(119894) = (119863

119905(119894)119885119905) times 120573119905

else119863119905+1(119894) = 119863

119905(119894)119885119905

where 119885119905is a normalization factor chosen such

that119863119905+1

will be a distribution

Step 7 Set 119905 = 119905 + 1Output the following the final hypotheses

119891fin (119909) =sum119905(log (1120573

119905)) times 119891

119905(119909)

sum119905(log (1120573

119905))

(10)

TheAdaBoostRT algorithmprojects the regression prob-lems into the binary classification domain Based on theboosting regression estimators [28] and BEM [34] theAdaBoostRT algorithm introduces the absolute relative error(ARE) to demarcate samples as either correct or incorrectpredictions If the ARE of any particular sample is greaterthan the threshold Φ the predicted value for this sample isregarded as the incorrect predictor Otherwise it is remarkedas correct predictor Such indication method is similar to theldquomisclassificationrdquo and ldquocorrect-classificationrdquo labeling usedin classification problemsThe algorithmwill assign relativelylarge weights to those weak learners in the front of learnerlist that reach high correct prediction rate The samples withincorrect prediction will be handled as ad hoc cases by thefollowed weak learners The outputs from each weak learnerare combined as the final hypotheses using the correspondingcomputed weights

AdaBoostRT algorithm requires manual selection ofthreshold Φ which is a main factor sensitively affecting theperformance of committee machines If Φ is too small veryfew samples will be treated as correct predictions which willeasily get boosted It requires the followed learners to handlea large number of ad hoc samples and make the ensemblealgorithm unstable On the other hand if Φ is too large saygreater than 04 most of samples will be treated as correctpredictions where they fail to reject those false samples Infact it will cause low convergence efficiency and overfittingThe initial AdaBoostRT and its variant suffer the limitation insetting threshold value which is specified either as amanuallyspecified constant value or a variable changing in vicinity of02 Both of their strategies are irrelevant to the regressioncapability of the weak learner In order to determine theΦ value effectively a novel improvement of AdaBoostRT isproposed in the following section

32 The Proposed Robust AdaBoostRT Algorithm To over-come the limitations suffered by the current works onAdaBoostRT we embed the statistics theory into theAdaBoostRT algorithm It overcomes the difficulty to opti-mally determining the initial threshold value and enablesthe intermediate threshold values to be dynamically self-adjustable according to the intrinsic property of the input

Mathematical Problems in Engineering 5

data samples The proposed robust AdaBoostRT algorithmis described as follows

(1) Input the following

sequence of 119898 samples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877weak learning algorithm (weak learner)maximum number of iterations (machines) 119879

(2) Initialize the following

iteration index 119905 = 1distribution119863

119905(119894) = 1119898 for all 119894

the weight vector 119908119905119894= 119863119905(119894) for all 119894

error rate 120576119905= 0

(3) Iterate while 119905 ⩽ 119879 the following

(1) Call weak learner WLt providing it with distri-bution

119901(119905)=

119908(119905)

119885119905

(11)

where 119885119905is a normalization factor chosen such

that 119901(119905) will be a distribution(2) Build the regression model 119891

119905(119909) rarr 119910

(3) Calculate each error 119890119905(119894) = 119891

119905(119909119894) minus 119910119894

(4) Calculate the error rate 120576119905= sum119894isin119875119901(119905)

119894 where

119875 = 119894 | |119890119905(119894)minus119890119905| gt 120582120590

119905 119894 isin [1 119898] 119890

119905stands for

the expected value and 120582120590119905is defined as robust

threshold (120590119905stands for the standard deviation

the relative factor 120582 is defined as 120582 isin (0 1))If 120576119905gt 12 then set 119879 = 119905 minus 1 and abort loop

(5) Set 120573119905= 120576119905(1 minus 120576

119905)

(6) Calculate contribution of 119891119905(119909) to the final

result 120572119905= minus log(120573

119905)

(7) Update the weight vectors

if 119894 notin 119875 then 119908(119905+1)119894

= 119908(119905)

119894120573119905

else 119908(119905+1)119894

= 119908(119905)

119894

(8) Set 119905 = 119905 + 1

(4) Normalize 1205721 120572

119879 such that sum119879

119905=1120572119905= 1

Output the final hypotheses 119891fin(119909) =

sum119879

119905=1120572119905119891119905(119909)

At each iteration of the proposed robust AdaBoostRTalgorithm the standard deviation of the approximation errordistribution is used as a criterion In the probability andstatistics theory the standard deviationmeasures the amountof variation or dispersion from the average If the data pointstend to be very close to themean value the standard deviationis low On the other hand if the data points are spread out

over a large range of values a high standard deviation will beresulted in

Standard deviation may be served as a measure ofuncertainty for a set of repeated predictions When decidingwhether predictions agree with their correspondingly truevalues the standard deviation of those predictions madeby the underlined approximation function is of crucialimportance if the averaged distance from the predictions tothe true values is large then the regressionmodel being testedprobably needs to be revised Because the sample points thatfall outside the range of values could reasonably be expectedto occur the prediction accuracy rate of the model is low

In the proposed robust AdaBoostRT algorithm theapproximation error of 119905th weak learner WLt for an inputdataset could be represented as one statistics distributionwithparameters 120583

119905plusmn 120582120590119905 where 120583

119905stands for the expected value

120590119905stands for the standard deviation and 120582 is an adjustable

relative factor that ranges from 0 to 1 The threshold valuefor WLt is defined by the scaled standard deviation 120582120590

119905

In the hybrid learning algorithm the trained weak learnersare assumed to be able to generate small prediction error(120576119905lt 12) For all 120583

119905 120583119905isin (0 minus 120578 0 + 120578) lim 120578 rarr

0 and 119905 = 1 119879 where 120578 denotes a small error limitapproaching zero The population mean of a regression errordistribution is closer to the targeted zero than elements in thepopulation Therefore the means of the obtained regressionerrors are fluctuating around zero within a small range Thestandard deviation 120590

119905 is solely determined by the individual

samples and the generalization performance of the weaklearnerWLt Generally 120590119905 is relatively large such that most ofthe outputs will be located within the range [minus120590

119905 +120590119905] which

tends to make the boosting process unstable To maintain astable adjusting of the threshold value a relative factor 120582 isapplied on the standard deviation 120590

119905 which results in the

robust threshold 120582120590119905 For those samples that fall within the

threshold range [minus120582120590119905 +120582120590119905] they are treated as ldquoacceptedrdquo

samples Other samples are treated as ldquorejectedrdquo samplesWith the introduction of the robust threshold the algorithmwill be stable and resistant to noise in the data Accordingto the error rate 120576

119905of each weak learnerrsquos regression model

each weak learner WLt will be assigned with one accordinglycomputed weight 120572

119905

For one regression problem the performances of differentweak learners may be differentThe regression error distribu-tions for different weak learners under robust threshold areshown in Figure 1

In Figure 1(a) the weak learner WLt generates a regres-sion error distribution with large error rate where thestandard deviation is relatively large On the other handanother weak learner WLt+j may generate an error distribu-tion with small standard deviations as shown in Figure 1(b)Suppose red triangle points represent ldquorejectedrdquo sampleswhose regression error rate is greater than the specified robustthreshold 120576

119905= sum119894isin119875119901(119905)

119894 where 119875 = 119894 | |119890

119905(119894) minus 119890

119905| gt 120582120590

119905 119894 isin

[1 119898] as described in Step (4) of the proposed algorithmThe green circular points represent those ldquoacceptedrdquo sampleswhich own regression errors less than the robust thresholdThe weight vector 119908119905

119894for every ldquoacceptedrdquo sample will be

6 Mathematical Problems in Engineering

Probability

Rejected sampleAccepted sample

Value120583t minus 120582120590t 120583t 120583t + 120582120590t

(a)

Value

Probability

120583t+j t+jminus 120582120590 120583t+j 120583t+j t+j+ 120582120590

Rejected sampleAccepted sample

(b)

Figure 1 Different regression error distributions for different weak learners under robust threshold (a) Error distribution of weak learnerWLt that results in a large threshold (b) Error distribution of weak learner WLt+j that results in a small threshold

dynamically changed for each weak learner WLt while thatfor ldquorejectedrdquo samples will be unchanged

As illustrated in Figure 1 in terms of stability andaccuracy the regression capability of weak learner WLt+jis superior to WLt The robust threshold values for WLt+jand WLt are computed respectively where the former issmaller than the latter to discriminate their correspondinglydifferent regression performances The proposed methodovercomes the limitation suffered by the existing methodswhere the threshold value is set empirically The criticalfactor used in the boosting process the threshold becomesrobust and self-adaptive to the individual weak learnersrsquoperformance on the input data samples Therefore the pro-posed robust AdaBoostRT algorithm is capable to output thefinal hypotheses in optimally weighted ensemble of the weaklearners

In the following we show that the training error of theproposed robust AdaBoostRT algorithm is as bounded Onelemma needs to be given in order to prove the convergence ofthis algorithm

Lemma 1 The convexity argument 119909119889 le 1 minus (1 minus 119909)119889 holdsfor any 119909 ge 0 and 0 le 119889 le 1

Theorem 2 The improved adaptive AdaBoostRT algorithmgenerates hypotheses with errors 120576

1 1205762 120576

119879lt 12 Then the

error 119864119890119899119904119890119898119887119897119890

of the final hypothesis output by this algorithmis bounded above by

119864119890119899119904119890119898119887119897119890

le 2119879

119879

prod

119905=1

radic120576119905(1 minus 120576

119905) (12)

Proof In this proof we need to transform the regressionproblem 119883 rarr 119884 into binary classification problems119883 119884 rarr 0 1 In the proposed improved adaptiveAdaBoostRT algorithm the mean of errors (119890) is assumed tobe closed to zero Thus the dynamically adaptive thresholdscan ignore the mean of errors

Let 119868119905= 119894 | |119891

119905(119909119894) minus 119910119894| lt 120582120590

119905 if 119894 notin 119868

119905119868119894

119905= 1 otherwise

119868119894

119905= 0The final hypothesis output 119891makes a mistake on sample

119894 only if

119879

prod

119905=1

120573minus119868119894

119905

119905ge (

119879

prod

119905=1

120573119905)

minus12

(13)

The final weight of any sample 119894 is

119908119879+1

119894= 119863 (119894)

119879

prod

119905=1

120573(1minus119868119894

119905)

119905 (14)

Combining (13) and (14) the sum of the final weights isbounded by the sum of the final weights of rejected samplesConsider

119898

sum

119894=1

119908119879+1

119894ge sum

119894notin119868119894

119879+1

119908119879+1

119894ge ( sum

119894notin119868119894

119879+1

119863 (119894))(

119879

prod

119905=1

120573119905)

12

= 119864ensemble(119879

prod

119905=1

120573119905)

12

(15)

where the 119864ensemble is the error of the final hypothesis outputBased on Lemma 1

119898

sum

119894=1

119908119905+1

119894=

119898

sum

119894=1

119908119905

1198941205731minus119868119894

119905

119905le

119898

sum

119894=1

119908119905

119894(1 minus (1 minus 120573

119905) (1 minus 119868

119894

119905))

= (

119898

sum

119894=1

119908119905

119894)(1 minus (1 minus 120576

119905) (1 minus 120573

119905))

(16)

Mathematical Problems in Engineering 7

Combining those inequalities for 119905 = 1 119879 the follow-ing equation could be obtained

119898

sum

119894=1

119908119879+1

119894le

119879

prod

119905=1

(1 minus (1 minus 120576119905) (1 minus 120573

119905)) (17)

Combining (15) and (17) we obtain that

119864ensemble le119879

prod

119905=1

1 minus (1 minus 120576119905) (1 minus 120573

119905)

radic120573119905

(18)

Considering that all factors in multiplication are positivethe minimization of the right hand side could be resortedto compute the minimization of each factor individually120573119905could be computed as 120573

119905= 120576119905(1 minus 120576

119905) when setting

the derivative of the 119905th factor to be zero Substitute thiscomputed 120573

119905into (18) completing the proof

Unlike the original AdaBoostRT and its existent variantsthe robust threshold in the proposed AdaBoostRT algorithmis determined and could be self-adaptively adjusted accordingto the individual weak learners and data samples Throughthe analysis on the convergence of the proposed robustAdaBoostRT algorithm it could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe study shows that the robust AdaBoostRT algorithmproposed in this paper can overcome the limitations existingin the available AdaBoostRT algorithms

4 A Robust AdaBoostRT Ensemble-BasedExtreme Learning Machine

In this paper a robust AdaBoostRT ensemble-based extremelearning machine (RAE-ELM) which combines ELM withthe robust AdaBoostRT algorithm described in previoussection is proposed to improve the robustness and stabilityof ELM A set of 119879 number of ELMs is adopted as the ldquoweakrdquolearners In the training phase the RAE-ELM utilizes theproposed robust AdaBoostRT algorithm to train every ELMmodel and assign an ensemble weight accordingly in orderthat each ELM achieves corresponding distribution based onthe training output The optimally weighted ensemble modelof ELMs 119864ensemble is the final hypothesis output used formaking prediction on testing dataset The proposed RAE-ELM is illustrated as follows in Figure 2

41 Initialization For the first weak learner ELM1is supplied

with 119898 training samples with the uniformed distribution ofweights in order that each sample owns equal opportunity tobe chosen during the first training process for ELM

1

42 Distribution Updating The relative prediction error ratesare used to evaluate the performance of this ELM Theprediction error of 119905th ELM ELMt for the input data samplescould be represented as one statistics distribution 120583

119905plusmn 120582120590119905

where 120583119905stands for the expected value and 120582120590

119905is defined

as robust threshold (120590119905stands for the standard deviation

and the relative factor 120582 is defined as 120582 isin (0 1)) Therobust thresholdis applied to demarcate prediction errorsas ldquoacceptedrdquo or ldquorejectedrdquo If the prediction error of oneparticular sample falls into the region 120583

119905plusmn120582120590119905that is bounded

by the robust thresholds the prediction of this sample isregarded as ldquoacceptedrdquo for ELMt and vice versa for ldquorejectedrdquopredictionsThe probabilities of the ldquorejectedrdquo predictions areaccumulated to calculate the error rate 120576

119905 ELM attempts to

achieve the 119891119905(119909) with small error rate

The robust AdaBoostRT algorithm will calculate thedistribution for next ELMt+1 For every sample that iscorrectly predicted by the current ELMt the correspondingweight vector will be multiplied by the error rate function120573119905 Otherwise the weight vector remains unchanged Such

process will be iterated for the next ELMt+1 till the last learnerELMT unless 120576119905 is higher than 05 Because once the error rateis higher than 05 the AdaBoost algorithm does not convergeand tends to overfitting [21] Hence the error ratemust be lessthan 05

43 Decision Making on RAE-ELM The weight updatingparameter120573

119905is used as an indicator of regression effectiveness

of the ELMt in the current iteration According to therelationship between 120576

119905and 120573

119905 if 120576119905increases 120573

119905will become

larger as well The RAE-ELM will grant a small ensembleweight for the ELMt On the other hand the ELMt withrelatively superior regression performance will be grantedwith a larger ensemble weight The hybrid RAE-ELM modelcombines the set of ELMs under different weights as the finalhypothesis for decision making

5 Performance Evaluation of RAE-ELM

In this section the performance of the proposed RAE-ELM learning algorithm is compared with other popularalgorithms on 14 real-world regression problems coveringdifferent domains from UCI Machine Learning Repository[35] whose specifications of benchmark datasets are shownin Table 1The ELMbased algorithms to be compared includebasic ELM [4] original AdaBoostRT based ELM (originalAda-ELM) [30] the modified self-adaptive AdaBoostRTELM (modified Ada-ELM) [31] support vector regression(SVR) [36] and least-square support vector regression (LS-SVR) [37] All the evaluations are conducted in Mat-lab environment running on a Windows 7 machine with320GHzCPU and 4GBRAM

In our experiments all the input attributes are normalizedinto the range of [minus1 1] while the outputs are normalizedinto [0 1] As the real-world benchmark datasets are embed-ded with noise and their distributions are unknown whichare of small sizes low dimensions large sizes and highdimensions for each trial of simulations the whole dataset of the application is randomly partitioned into trainingdataset and testing dataset with the number of samples shownin Table 1 25 of the training data samples are used asthe validation dataset Each partitioned training validationand testing dataset will be kept fixed as inputs for all thesealgorithms

8 Mathematical Problems in Engineering

Training Training Training

Updatedistribution

Updatedistribution

Boosting

The final hypothesis

Sequence of M samples

Distribution Distribution

(x1 y1) (xm ym)

D1(i) Dt (i)Distribution

DT (i)

ELM1 ELMt

Calculate theerror rate

1205761

Calculate theerror rate

120576t

Output1 Outputt

ELMT

OutputT

middot middot middot middot middot middot

Figure 2 Block diagram of RAE-ELM

For RAE-ELM basic ELM original Ada-ELM and mod-ified Ada-ELM algorithms the suitable numbers of hiddennodes of them are determined using the preserved validationdataset respectivelyThe sigmoid function119866(a 119887 x) = 1(1+exp(minus(asdotx+119887))) is selected as the activation function in all thealgorithms Fifty trails of simulations have been conductedfor each problem with training validation and testingsamples randomly split for each trailThe performances of thealgorithms are verified using the average root mean squareerror (RMSE) in testing The significantly better results arehighlighted in boldface

51 Model Selection In ensemble algorithms the number ofnetworks in the ensemble needs to be determined Accordingto Occamrsquos Razor theory excessively complex models areaffected by statistical noise whereas simpler models maycapture the underlying structure better and may thus have

better predictive performance Therefore the parameter 119879which is the number of weak learners need not be very large

We define 120576119905in (12) as 120576

119905= 05 minus 120574

119905 where 120574

119905gt 0 is

constant it results in

119864ensemble le119879

prod

119905=1

radic1 minus 41205742

119905= 119890(minussum119879

119905=1KL(1212minus120574119905))

le 119890(minus2sum

119879

119905=11205742

119905)

(19)

where KL is the Kullback-Leibler divergenceWe then simplify (19) by using 05 minus 120574 instead of 05 minus 120574

119905

that is each 120574119905is set to be the same We can get

119864ensemble le (1 minus 41205742)

1198792

= 119890(minus119879lowastKL(1212minus120574))

le 119890(minus2119879120574

2)

(20)

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

2 Mathematical Problems in Engineering

ELM as well As a single learning machine although ELMis quite stable compared to other learning algorithms itsclassification and regression performancemay still be slightlyvaried among different trails on big datasetMany researcherssought for various ensemble methods that integrate a setof ELMs into a combined network structures and verifiedthat they could perform better than using individual ELMLan et al [13] proposed an ensemble of online sequentialELM (EOS-ELM) which is comprised of several OS-ELMnetworks The mean of the OS-ELM networksrsquo outputs wasused as the performance indicator of the ensemble networksLiu and Wang [14] presented an ensemble-based ELM (EN-ELM) algorithmwhere the cross-validation schemewas usedto create an ensemble of ELM classifiers for decision makingBesides Xue et al [15] proposed a genetic ensemble ofextreme learningmachine (GE-ELM) which adopted geneticalgorithms (GAs) to produce a group of candidate networksfirst According to a specific ranking strategy some of thenetworks were selected to ensemble a new network MorerecentlyWang et al [16] presented a parallelized ELMensem-ble method based on M3-network called M3-ELM It couldimprove the computation efficiency by parallelism and solveimbalanced classification tasks through task decomposition

To learn the exponentially increased number and types ofdata with high accuracy the traditional learning algorithmsmay tend to suffer from overfitting problem Hence arobust and stable ensemble algorithm is of great importanceDasarathy and Sheela [17] firstly introduced an ensemblesystem whose idea is to partition the feature space usingmultiple classifiers Furthermore Hansen and Salamon [18]presented an ensemble of neural networks with a pluralityconsensus scheme to obtain far better performance in classifi-cation issues than approaching using single neural networksAfter that the ensemble-based algorithms have been widelyexplored [19ndash23] Among the ensemble-based algorithmsBagging and Boosting are the most prevailing methodsfor training neural network ensembles The Bagging (shortfor Bootstrap Aggregation) algorithm randomly selects 119899bootstrap samples from 119873 cardinality original training set(119899 sub 119873) and then the diversity in the bagging-based ensem-bles is ensured by the variations within the bootstrappedreplicas on which each classifier is trained By using relativelyweak classifiers the decision boundaries measurably varywith respect to relatively small perturbations in the trainingdata As an iterative method presented by Schapire [20]for generating a strong classifier boosting could achievearbitrarily low training error from an ensemble of weakclassifiers each of which can barely do better than randomguessing Whereafter a novel boosting algorithm called theadaptive boosting (AdaBoost) was presented by Schapire andFreund [21]The AdaBoost algorithmmakes improvement totraditional boosting methods in two perspectives One is thatthe instances thereof are drawn into subsequent subdatasetsfrom an iteratively updated sample distribution of the sametraining dataset AdaBoost replaces randomly subsamples byweighted versions of the same training dataset which couldbe repeatedly utilized The training dataset is therefore notrequired to be very large Another is to define an ensemble

classifier through combination of weighted majority votingof a set of weak classifiers where voting weights are based onclassifiersrsquo training errors

Howevermany of the existing investigations on ensemblealgorithms focus on classification problems The ensemblealgorithms on classification problems unfortunately can-not be directly applied on regression problems Regressionmethods could provide predicated results through analyzinghistorical data Forecasting and predication are importantfunctional requirements for real-world applications such astemperature prediction inventory management and posi-tioning tracking inmanufacturing execution system To solveregression problems based on the AdaBoost algorithm onthe classification problem [24ndash26] Schapire and Freund [21]extended AdaBoostM2 to AdaBoostR In addition Drucker[27] proposed AdaBoostR2 algorithm which is based onad hoc modification of AdaBoostR Besides Avnimelechand Intrator [28] presented the notion of weak and stronglearning and an appropriate equivalence theorem betweenthem so as to improve the boosting algorithm in regressionissues What is more Solomatine and Shrestha [29 30]proposed a novel boosting algorithm called as AdaBoostRTAdaBoostRT projects the regression problems into thebinary classification domain which could be processed byAdaBoost algorithm while filtering out those examples withthe relative estimation error larger than the preset thresholdvalue

The proposed hybrid algorithm which combines theeffective learner ELM with the promising ensemble methodAdaBoostRT algorithm could inherit their intrinsic proper-ties and shall be able to achieve good generalization perfor-mances for dealing with big data Same as the developmenteffort on general ensemble algorithms the available ELMensembles algorithms are mainly aimed at the classificationproblems while the regression problems with ensemblealgorithm have received relatively little attention Tian andMao [31] presented an ensemble ELM based on modifiedAdaBoostRT algorithm (modified Ada-ELM) in order topredict the temperature of molten steel in ladle furnaceThe novel hybrid learning algorithm combined the modifiedAdaBoostRT with ELM which possesses the advantage ofELM and overcomes the limitation of basic AdaBoostRTby self-adaptively modifiable threshold value The thresholdvalue Φ need not be constant instead it could be adjustedusing a self-adaptive modification mechanism subjected tothe change trend of the predication error at each iterationThe variation range of threshold value is set to be [0 04]as suggested by Solomatine and Shrestha [29 30] Howeverthe initial value of Φ is manually fixed to be the meanof the variation range of threshold value ex Φ

0= 02

according to an empirical suggestion When one error rate120576119905is smaller than that in previous iteration 120576

119905minus1 the value

of Φ will decrease and vice versa Hence such empiricalsuggestion based method is not fully self-adaptive in thewhole threshold domainMoreover themanually fixed initialthreshold is not related to the properties of input dataset andthe weak learners which make the ensemble ELM hardlyreach a generally optimized learning effect

Mathematical Problems in Engineering 3

This paper presents a robust AdaBoostRT based ensem-ble ELM (RAE-ELM) for regression problems which com-bined ELM with the robust AdaBoostRT algorithm Therobust AdaBoostRT algorithm not only overcomes the limi-tation of the original AdaBoostRT algorithm (original Ada-ELM) but alsomakes the threshold value ofΦ adaptive to theinput dataset and ELM networks instead of presetting Themain idea of RAE-ELM is as follows The ELM algorithm isselected as the ldquoweakrdquo learning machines to build the hybridensemble model A new robust AdaBoostRT algorithm isproposed to utilize the error statistics method to dynamicallydetermine the regression threshold value rather than viamanual selection which may only be ideal for very fewregression cases The mean and the standard deviation ofthe approximation errors will be computed at each iterationThe robust threshold for each weak learner is defined tobe a scaled standard deviation Based on the concept ofstandard deviation those individual training data with errorexceeding the robust threshold are regarded as ldquoflaws inthis training processrdquo and shall be rejected The rejecteddata will be processed in the late part of weak learnersrsquoiterations

We then analyze the convergence of the proposed robustAdaBoostRT algorithm It could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe proposed robust AdaBoostRT based ensemble extremelearning machine can avoid overfitting because of the char-acteristic of ELM ELM can tend to reach the solutionsstraightforwardly and the error rate of regression outcomeat each training process is much smaller than 05 Thereforethe proposed robust AdaBoostRT based ensemble extremelearning machine selecting ELM as the ldquoweakrdquo learner canavoid overfittingMoreover as ELM is a fast learnerwith quitehigh regression performance it contributes to the overallgeneralization performance of the robust AdaBoostRT basedensemblemoduleThe experiment results have demonstratedthat the proposed robust AdaBoostRT ensemble ELM (RAE-ELM) has superior learning properties in terms of stabilityand accuracy for regression issues and have better general-ization performance than other algorithms

This paper is organized as follows Section 2 gives a briefreview of basic ELM Section 3 introduces the original and theproposed robust AdaBoostRT algorithm The hybrid robustAdaBoostRT ensemble ELM (RAE-ELM) algorithm is thenpresented in Section 4 The performance evaluation of RAE-ELM and its regression ability are verified using experimentsin Section 5 Finally the conclusion is drawn in the lastsection

2 Brief on ELM

Recently Huang et al [3 4] proposed novel neural networkscalled extreme learning machines (ELMs) for single-hiddenlayer feedforward neural networks (SLFNs) [32 33] ELMis based on the least-square method which could randomlyassign the input weights and the hidden layer biases and

then the output weights between the hidden nodes andthe output layer can be analytically determined Since thelearning process in ELM can take place without iterativetuning the ELM algorithm could trend to reach the solutionsstraightforwardly without suffering from those problemsincluding local minimal slow learning speed and overfitting

From the standard optimization theory point of view theobjective of ELM in minimizing both the training errors andthe outputs weights can be presented as [4]

Minimize 119871119875ELM

=

1

2

10038171003817100381710038171205731003817100381710038171003817

2+ 119862

1

2

119873

sum

119894=1

1003817100381710038171003817120585119894

1003817100381710038171003817

2

Subject to h (x119894)120573 = t119879

119894minus 120585119879

119894 119894 = 1 119873

(1)

where 120585119894= [1205851198941 120585

119894119898]119879 is the training error vector of

the 119898 output nodes with respect to the training sample x119894

According to KKT theorem training ELM is equivalent tosolving the dual optimization problem

119871119863ELM

=

1

2

10038171003817100381710038171205731003817100381710038171003817

2+ 119862

1

2

119873

sum

119894=1

1003817100381710038171003817120585119894

1003817100381710038171003817

2

minus

119873

sum

119894=1

119898

sum

119895=1

120572119894119895(h (x119894) 120573119895minus 119905119894119895+ 120585119894119895)

(2)

where 120573119895is the vector of the weights between the hidden

layer and the 119895th output node and 120573 = [1205731 120573

119898] 119862 is the

regularization parameter representing the trade-off betweenthe minimization of training errors and the maximization ofthe marginal distance

According to KKT theorem we can obtain differentsolutions as follows

(1) Kernel Case Consider

120573 = H119879 ( I119862

+HH119879)minus1

T (3)

The ELM output function is

119891 (x) = h (x)120573 = h (x)H119879 ( I119862

+HH119879)minus1

T (4)

A kernel matrix for ELM is defined as follows

ΩELM = HH119879 ΩELM119894119895 = h (x119894) sdot h (x

119895) = 119870 (x

119894 x119895) (5)

Then the ELM output function can be as follows

119891 (x) = h (x)120573 = h (x)H119879 ( I119862

+HH119879)minus1

T

=

[

[

[

[

119870 (x x1)

119870(x x

119873)

]

]

]

]

119879

(

I119862

+Ω119879

ELM)minus1

T(6)

In the special case a corresponding kernel 119870(x119894 x119895) is

used in ELM instead of using the feature mapping h(x)

4 Mathematical Problems in Engineering

which need be known We call 119870(x119894 x119895) = h(x

119894)h(x119895) ELM

random kernel where the feature mapping h(x) is randomlygenerated

(2) Nonkernel Case Similarly based on KKT theorem wehave

120573 = (

I119862

+HH119879)minus1

H119879T (7)

In this case the ELM output function is

119891 (x) = h (x)120573 = h (x) ( I119862

+HH119879)minus1

H119879T (8)

3 The Proposed RobustAdaBoostRT Algorithm

We first describe the original AdaBoostRT algorithm forregression problem and then present a new robust Ada-BoostRT algorithm in this section The corresponding anal-ysis on the novel algorithm will also be given

31 The Original AdaBoostRT Algorithm Solomatine andSherstha proposed AdaBoostRT [29 30] a new boost algo-rithm for regression problems where the letters 119877 and 119879represent regression and threshold respectively The originalAdaBoostRT algorithm is described as follows

Input the following

(i) sequence of 119898 examples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877(ii) weak learning algorithm (weak learner)(iii) integer 119879 specifying number of iterations

(machines)(iv) threshold Φ (0 lt Φ lt 1) for demarcating

correct and incorrect predictions

Initialize the following

(i) machine number or iteration 119905 = 1(ii) distribution119863

119905(119894) = 1119898 for all i

(iii) error rate 120576119905= 0

Learning steps (iterate while 119905 ⩽ 119879) are as followsStep 1 Call weak learner providing it with distribu-tion119863

119905

Step 2 Build the regression model 119891119905(119909) rarr 119910

Step 3 Calculate absolute relative error (ARE) foreach training example as

ARE119905 (119894) =

100381610038161003816100381610038161003816100381610038161003816

119891119905(119909119894) minus 119910119894

119910119894

100381610038161003816100381610038161003816100381610038161003816

(9)

Step 4 Calculate the error rate of 119891119905(119909) 120576

119905=

sum119894ARE119894(119894)gt120601119863119905(119894)

Step 5 Set 120573119905= 120576119899

119905 where 119899 is power coefficient (eg

linear square or cubic)Step 6 Update distribution119863

119905as follows

if ARE119905(119894) le 120601 then119863

119905+1(119894) = (119863

119905(119894)119885119905) times 120573119905

else119863119905+1(119894) = 119863

119905(119894)119885119905

where 119885119905is a normalization factor chosen such

that119863119905+1

will be a distribution

Step 7 Set 119905 = 119905 + 1Output the following the final hypotheses

119891fin (119909) =sum119905(log (1120573

119905)) times 119891

119905(119909)

sum119905(log (1120573

119905))

(10)

TheAdaBoostRT algorithmprojects the regression prob-lems into the binary classification domain Based on theboosting regression estimators [28] and BEM [34] theAdaBoostRT algorithm introduces the absolute relative error(ARE) to demarcate samples as either correct or incorrectpredictions If the ARE of any particular sample is greaterthan the threshold Φ the predicted value for this sample isregarded as the incorrect predictor Otherwise it is remarkedas correct predictor Such indication method is similar to theldquomisclassificationrdquo and ldquocorrect-classificationrdquo labeling usedin classification problemsThe algorithmwill assign relativelylarge weights to those weak learners in the front of learnerlist that reach high correct prediction rate The samples withincorrect prediction will be handled as ad hoc cases by thefollowed weak learners The outputs from each weak learnerare combined as the final hypotheses using the correspondingcomputed weights

AdaBoostRT algorithm requires manual selection ofthreshold Φ which is a main factor sensitively affecting theperformance of committee machines If Φ is too small veryfew samples will be treated as correct predictions which willeasily get boosted It requires the followed learners to handlea large number of ad hoc samples and make the ensemblealgorithm unstable On the other hand if Φ is too large saygreater than 04 most of samples will be treated as correctpredictions where they fail to reject those false samples Infact it will cause low convergence efficiency and overfittingThe initial AdaBoostRT and its variant suffer the limitation insetting threshold value which is specified either as amanuallyspecified constant value or a variable changing in vicinity of02 Both of their strategies are irrelevant to the regressioncapability of the weak learner In order to determine theΦ value effectively a novel improvement of AdaBoostRT isproposed in the following section

32 The Proposed Robust AdaBoostRT Algorithm To over-come the limitations suffered by the current works onAdaBoostRT we embed the statistics theory into theAdaBoostRT algorithm It overcomes the difficulty to opti-mally determining the initial threshold value and enablesthe intermediate threshold values to be dynamically self-adjustable according to the intrinsic property of the input

Mathematical Problems in Engineering 5

data samples The proposed robust AdaBoostRT algorithmis described as follows

(1) Input the following

sequence of 119898 samples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877weak learning algorithm (weak learner)maximum number of iterations (machines) 119879

(2) Initialize the following

iteration index 119905 = 1distribution119863

119905(119894) = 1119898 for all 119894

the weight vector 119908119905119894= 119863119905(119894) for all 119894

error rate 120576119905= 0

(3) Iterate while 119905 ⩽ 119879 the following

(1) Call weak learner WLt providing it with distri-bution

119901(119905)=

119908(119905)

119885119905

(11)

where 119885119905is a normalization factor chosen such

that 119901(119905) will be a distribution(2) Build the regression model 119891

119905(119909) rarr 119910

(3) Calculate each error 119890119905(119894) = 119891

119905(119909119894) minus 119910119894

(4) Calculate the error rate 120576119905= sum119894isin119875119901(119905)

119894 where

119875 = 119894 | |119890119905(119894)minus119890119905| gt 120582120590

119905 119894 isin [1 119898] 119890

119905stands for

the expected value and 120582120590119905is defined as robust

threshold (120590119905stands for the standard deviation

the relative factor 120582 is defined as 120582 isin (0 1))If 120576119905gt 12 then set 119879 = 119905 minus 1 and abort loop

(5) Set 120573119905= 120576119905(1 minus 120576

119905)

(6) Calculate contribution of 119891119905(119909) to the final

result 120572119905= minus log(120573

119905)

(7) Update the weight vectors

if 119894 notin 119875 then 119908(119905+1)119894

= 119908(119905)

119894120573119905

else 119908(119905+1)119894

= 119908(119905)

119894

(8) Set 119905 = 119905 + 1

(4) Normalize 1205721 120572

119879 such that sum119879

119905=1120572119905= 1

Output the final hypotheses 119891fin(119909) =

sum119879

119905=1120572119905119891119905(119909)

At each iteration of the proposed robust AdaBoostRTalgorithm the standard deviation of the approximation errordistribution is used as a criterion In the probability andstatistics theory the standard deviationmeasures the amountof variation or dispersion from the average If the data pointstend to be very close to themean value the standard deviationis low On the other hand if the data points are spread out

over a large range of values a high standard deviation will beresulted in

Standard deviation may be served as a measure ofuncertainty for a set of repeated predictions When decidingwhether predictions agree with their correspondingly truevalues the standard deviation of those predictions madeby the underlined approximation function is of crucialimportance if the averaged distance from the predictions tothe true values is large then the regressionmodel being testedprobably needs to be revised Because the sample points thatfall outside the range of values could reasonably be expectedto occur the prediction accuracy rate of the model is low

In the proposed robust AdaBoostRT algorithm theapproximation error of 119905th weak learner WLt for an inputdataset could be represented as one statistics distributionwithparameters 120583

119905plusmn 120582120590119905 where 120583

119905stands for the expected value

120590119905stands for the standard deviation and 120582 is an adjustable

relative factor that ranges from 0 to 1 The threshold valuefor WLt is defined by the scaled standard deviation 120582120590

119905

In the hybrid learning algorithm the trained weak learnersare assumed to be able to generate small prediction error(120576119905lt 12) For all 120583

119905 120583119905isin (0 minus 120578 0 + 120578) lim 120578 rarr

0 and 119905 = 1 119879 where 120578 denotes a small error limitapproaching zero The population mean of a regression errordistribution is closer to the targeted zero than elements in thepopulation Therefore the means of the obtained regressionerrors are fluctuating around zero within a small range Thestandard deviation 120590

119905 is solely determined by the individual

samples and the generalization performance of the weaklearnerWLt Generally 120590119905 is relatively large such that most ofthe outputs will be located within the range [minus120590

119905 +120590119905] which

tends to make the boosting process unstable To maintain astable adjusting of the threshold value a relative factor 120582 isapplied on the standard deviation 120590

119905 which results in the

robust threshold 120582120590119905 For those samples that fall within the

threshold range [minus120582120590119905 +120582120590119905] they are treated as ldquoacceptedrdquo

samples Other samples are treated as ldquorejectedrdquo samplesWith the introduction of the robust threshold the algorithmwill be stable and resistant to noise in the data Accordingto the error rate 120576

119905of each weak learnerrsquos regression model

each weak learner WLt will be assigned with one accordinglycomputed weight 120572

119905

For one regression problem the performances of differentweak learners may be differentThe regression error distribu-tions for different weak learners under robust threshold areshown in Figure 1

In Figure 1(a) the weak learner WLt generates a regres-sion error distribution with large error rate where thestandard deviation is relatively large On the other handanother weak learner WLt+j may generate an error distribu-tion with small standard deviations as shown in Figure 1(b)Suppose red triangle points represent ldquorejectedrdquo sampleswhose regression error rate is greater than the specified robustthreshold 120576

119905= sum119894isin119875119901(119905)

119894 where 119875 = 119894 | |119890

119905(119894) minus 119890

119905| gt 120582120590

119905 119894 isin

[1 119898] as described in Step (4) of the proposed algorithmThe green circular points represent those ldquoacceptedrdquo sampleswhich own regression errors less than the robust thresholdThe weight vector 119908119905

119894for every ldquoacceptedrdquo sample will be

6 Mathematical Problems in Engineering

Probability

Rejected sampleAccepted sample

Value120583t minus 120582120590t 120583t 120583t + 120582120590t

(a)

Value

Probability

120583t+j t+jminus 120582120590 120583t+j 120583t+j t+j+ 120582120590

Rejected sampleAccepted sample

(b)

Figure 1 Different regression error distributions for different weak learners under robust threshold (a) Error distribution of weak learnerWLt that results in a large threshold (b) Error distribution of weak learner WLt+j that results in a small threshold

dynamically changed for each weak learner WLt while thatfor ldquorejectedrdquo samples will be unchanged

As illustrated in Figure 1 in terms of stability andaccuracy the regression capability of weak learner WLt+jis superior to WLt The robust threshold values for WLt+jand WLt are computed respectively where the former issmaller than the latter to discriminate their correspondinglydifferent regression performances The proposed methodovercomes the limitation suffered by the existing methodswhere the threshold value is set empirically The criticalfactor used in the boosting process the threshold becomesrobust and self-adaptive to the individual weak learnersrsquoperformance on the input data samples Therefore the pro-posed robust AdaBoostRT algorithm is capable to output thefinal hypotheses in optimally weighted ensemble of the weaklearners

In the following we show that the training error of theproposed robust AdaBoostRT algorithm is as bounded Onelemma needs to be given in order to prove the convergence ofthis algorithm

Lemma 1 The convexity argument 119909119889 le 1 minus (1 minus 119909)119889 holdsfor any 119909 ge 0 and 0 le 119889 le 1

Theorem 2 The improved adaptive AdaBoostRT algorithmgenerates hypotheses with errors 120576

1 1205762 120576

119879lt 12 Then the

error 119864119890119899119904119890119898119887119897119890

of the final hypothesis output by this algorithmis bounded above by

119864119890119899119904119890119898119887119897119890

le 2119879

119879

prod

119905=1

radic120576119905(1 minus 120576

119905) (12)

Proof In this proof we need to transform the regressionproblem 119883 rarr 119884 into binary classification problems119883 119884 rarr 0 1 In the proposed improved adaptiveAdaBoostRT algorithm the mean of errors (119890) is assumed tobe closed to zero Thus the dynamically adaptive thresholdscan ignore the mean of errors

Let 119868119905= 119894 | |119891

119905(119909119894) minus 119910119894| lt 120582120590

119905 if 119894 notin 119868

119905119868119894

119905= 1 otherwise

119868119894

119905= 0The final hypothesis output 119891makes a mistake on sample

119894 only if

119879

prod

119905=1

120573minus119868119894

119905

119905ge (

119879

prod

119905=1

120573119905)

minus12

(13)

The final weight of any sample 119894 is

119908119879+1

119894= 119863 (119894)

119879

prod

119905=1

120573(1minus119868119894

119905)

119905 (14)

Combining (13) and (14) the sum of the final weights isbounded by the sum of the final weights of rejected samplesConsider

119898

sum

119894=1

119908119879+1

119894ge sum

119894notin119868119894

119879+1

119908119879+1

119894ge ( sum

119894notin119868119894

119879+1

119863 (119894))(

119879

prod

119905=1

120573119905)

12

= 119864ensemble(119879

prod

119905=1

120573119905)

12

(15)

where the 119864ensemble is the error of the final hypothesis outputBased on Lemma 1

119898

sum

119894=1

119908119905+1

119894=

119898

sum

119894=1

119908119905

1198941205731minus119868119894

119905

119905le

119898

sum

119894=1

119908119905

119894(1 minus (1 minus 120573

119905) (1 minus 119868

119894

119905))

= (

119898

sum

119894=1

119908119905

119894)(1 minus (1 minus 120576

119905) (1 minus 120573

119905))

(16)

Mathematical Problems in Engineering 7

Combining those inequalities for 119905 = 1 119879 the follow-ing equation could be obtained

119898

sum

119894=1

119908119879+1

119894le

119879

prod

119905=1

(1 minus (1 minus 120576119905) (1 minus 120573

119905)) (17)

Combining (15) and (17) we obtain that

119864ensemble le119879

prod

119905=1

1 minus (1 minus 120576119905) (1 minus 120573

119905)

radic120573119905

(18)

Considering that all factors in multiplication are positivethe minimization of the right hand side could be resortedto compute the minimization of each factor individually120573119905could be computed as 120573

119905= 120576119905(1 minus 120576

119905) when setting

the derivative of the 119905th factor to be zero Substitute thiscomputed 120573

119905into (18) completing the proof

Unlike the original AdaBoostRT and its existent variantsthe robust threshold in the proposed AdaBoostRT algorithmis determined and could be self-adaptively adjusted accordingto the individual weak learners and data samples Throughthe analysis on the convergence of the proposed robustAdaBoostRT algorithm it could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe study shows that the robust AdaBoostRT algorithmproposed in this paper can overcome the limitations existingin the available AdaBoostRT algorithms

4 A Robust AdaBoostRT Ensemble-BasedExtreme Learning Machine

In this paper a robust AdaBoostRT ensemble-based extremelearning machine (RAE-ELM) which combines ELM withthe robust AdaBoostRT algorithm described in previoussection is proposed to improve the robustness and stabilityof ELM A set of 119879 number of ELMs is adopted as the ldquoweakrdquolearners In the training phase the RAE-ELM utilizes theproposed robust AdaBoostRT algorithm to train every ELMmodel and assign an ensemble weight accordingly in orderthat each ELM achieves corresponding distribution based onthe training output The optimally weighted ensemble modelof ELMs 119864ensemble is the final hypothesis output used formaking prediction on testing dataset The proposed RAE-ELM is illustrated as follows in Figure 2

41 Initialization For the first weak learner ELM1is supplied

with 119898 training samples with the uniformed distribution ofweights in order that each sample owns equal opportunity tobe chosen during the first training process for ELM

1

42 Distribution Updating The relative prediction error ratesare used to evaluate the performance of this ELM Theprediction error of 119905th ELM ELMt for the input data samplescould be represented as one statistics distribution 120583

119905plusmn 120582120590119905

where 120583119905stands for the expected value and 120582120590

119905is defined

as robust threshold (120590119905stands for the standard deviation

and the relative factor 120582 is defined as 120582 isin (0 1)) Therobust thresholdis applied to demarcate prediction errorsas ldquoacceptedrdquo or ldquorejectedrdquo If the prediction error of oneparticular sample falls into the region 120583

119905plusmn120582120590119905that is bounded

by the robust thresholds the prediction of this sample isregarded as ldquoacceptedrdquo for ELMt and vice versa for ldquorejectedrdquopredictionsThe probabilities of the ldquorejectedrdquo predictions areaccumulated to calculate the error rate 120576

119905 ELM attempts to

achieve the 119891119905(119909) with small error rate

The robust AdaBoostRT algorithm will calculate thedistribution for next ELMt+1 For every sample that iscorrectly predicted by the current ELMt the correspondingweight vector will be multiplied by the error rate function120573119905 Otherwise the weight vector remains unchanged Such

process will be iterated for the next ELMt+1 till the last learnerELMT unless 120576119905 is higher than 05 Because once the error rateis higher than 05 the AdaBoost algorithm does not convergeand tends to overfitting [21] Hence the error ratemust be lessthan 05

43 Decision Making on RAE-ELM The weight updatingparameter120573

119905is used as an indicator of regression effectiveness

of the ELMt in the current iteration According to therelationship between 120576

119905and 120573

119905 if 120576119905increases 120573

119905will become

larger as well The RAE-ELM will grant a small ensembleweight for the ELMt On the other hand the ELMt withrelatively superior regression performance will be grantedwith a larger ensemble weight The hybrid RAE-ELM modelcombines the set of ELMs under different weights as the finalhypothesis for decision making

5 Performance Evaluation of RAE-ELM

In this section the performance of the proposed RAE-ELM learning algorithm is compared with other popularalgorithms on 14 real-world regression problems coveringdifferent domains from UCI Machine Learning Repository[35] whose specifications of benchmark datasets are shownin Table 1The ELMbased algorithms to be compared includebasic ELM [4] original AdaBoostRT based ELM (originalAda-ELM) [30] the modified self-adaptive AdaBoostRTELM (modified Ada-ELM) [31] support vector regression(SVR) [36] and least-square support vector regression (LS-SVR) [37] All the evaluations are conducted in Mat-lab environment running on a Windows 7 machine with320GHzCPU and 4GBRAM

In our experiments all the input attributes are normalizedinto the range of [minus1 1] while the outputs are normalizedinto [0 1] As the real-world benchmark datasets are embed-ded with noise and their distributions are unknown whichare of small sizes low dimensions large sizes and highdimensions for each trial of simulations the whole dataset of the application is randomly partitioned into trainingdataset and testing dataset with the number of samples shownin Table 1 25 of the training data samples are used asthe validation dataset Each partitioned training validationand testing dataset will be kept fixed as inputs for all thesealgorithms

8 Mathematical Problems in Engineering

Training Training Training

Updatedistribution

Updatedistribution

Boosting

The final hypothesis

Sequence of M samples

Distribution Distribution

(x1 y1) (xm ym)

D1(i) Dt (i)Distribution

DT (i)

ELM1 ELMt

Calculate theerror rate

1205761

Calculate theerror rate

120576t

Output1 Outputt

ELMT

OutputT

middot middot middot middot middot middot

Figure 2 Block diagram of RAE-ELM

For RAE-ELM basic ELM original Ada-ELM and mod-ified Ada-ELM algorithms the suitable numbers of hiddennodes of them are determined using the preserved validationdataset respectivelyThe sigmoid function119866(a 119887 x) = 1(1+exp(minus(asdotx+119887))) is selected as the activation function in all thealgorithms Fifty trails of simulations have been conductedfor each problem with training validation and testingsamples randomly split for each trailThe performances of thealgorithms are verified using the average root mean squareerror (RMSE) in testing The significantly better results arehighlighted in boldface

51 Model Selection In ensemble algorithms the number ofnetworks in the ensemble needs to be determined Accordingto Occamrsquos Razor theory excessively complex models areaffected by statistical noise whereas simpler models maycapture the underlying structure better and may thus have

better predictive performance Therefore the parameter 119879which is the number of weak learners need not be very large

We define 120576119905in (12) as 120576

119905= 05 minus 120574

119905 where 120574

119905gt 0 is

constant it results in

119864ensemble le119879

prod

119905=1

radic1 minus 41205742

119905= 119890(minussum119879

119905=1KL(1212minus120574119905))

le 119890(minus2sum

119879

119905=11205742

119905)

(19)

where KL is the Kullback-Leibler divergenceWe then simplify (19) by using 05 minus 120574 instead of 05 minus 120574

119905

that is each 120574119905is set to be the same We can get

119864ensemble le (1 minus 41205742)

1198792

= 119890(minus119879lowastKL(1212minus120574))

le 119890(minus2119879120574

2)

(20)

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

Mathematical Problems in Engineering 3

This paper presents a robust AdaBoostRT based ensem-ble ELM (RAE-ELM) for regression problems which com-bined ELM with the robust AdaBoostRT algorithm Therobust AdaBoostRT algorithm not only overcomes the limi-tation of the original AdaBoostRT algorithm (original Ada-ELM) but alsomakes the threshold value ofΦ adaptive to theinput dataset and ELM networks instead of presetting Themain idea of RAE-ELM is as follows The ELM algorithm isselected as the ldquoweakrdquo learning machines to build the hybridensemble model A new robust AdaBoostRT algorithm isproposed to utilize the error statistics method to dynamicallydetermine the regression threshold value rather than viamanual selection which may only be ideal for very fewregression cases The mean and the standard deviation ofthe approximation errors will be computed at each iterationThe robust threshold for each weak learner is defined tobe a scaled standard deviation Based on the concept ofstandard deviation those individual training data with errorexceeding the robust threshold are regarded as ldquoflaws inthis training processrdquo and shall be rejected The rejecteddata will be processed in the late part of weak learnersrsquoiterations

We then analyze the convergence of the proposed robustAdaBoostRT algorithm It could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe proposed robust AdaBoostRT based ensemble extremelearning machine can avoid overfitting because of the char-acteristic of ELM ELM can tend to reach the solutionsstraightforwardly and the error rate of regression outcomeat each training process is much smaller than 05 Thereforethe proposed robust AdaBoostRT based ensemble extremelearning machine selecting ELM as the ldquoweakrdquo learner canavoid overfittingMoreover as ELM is a fast learnerwith quitehigh regression performance it contributes to the overallgeneralization performance of the robust AdaBoostRT basedensemblemoduleThe experiment results have demonstratedthat the proposed robust AdaBoostRT ensemble ELM (RAE-ELM) has superior learning properties in terms of stabilityand accuracy for regression issues and have better general-ization performance than other algorithms

This paper is organized as follows Section 2 gives a briefreview of basic ELM Section 3 introduces the original and theproposed robust AdaBoostRT algorithm The hybrid robustAdaBoostRT ensemble ELM (RAE-ELM) algorithm is thenpresented in Section 4 The performance evaluation of RAE-ELM and its regression ability are verified using experimentsin Section 5 Finally the conclusion is drawn in the lastsection

2 Brief on ELM

Recently Huang et al [3 4] proposed novel neural networkscalled extreme learning machines (ELMs) for single-hiddenlayer feedforward neural networks (SLFNs) [32 33] ELMis based on the least-square method which could randomlyassign the input weights and the hidden layer biases and

then the output weights between the hidden nodes andthe output layer can be analytically determined Since thelearning process in ELM can take place without iterativetuning the ELM algorithm could trend to reach the solutionsstraightforwardly without suffering from those problemsincluding local minimal slow learning speed and overfitting

From the standard optimization theory point of view theobjective of ELM in minimizing both the training errors andthe outputs weights can be presented as [4]

Minimize 119871119875ELM

=

1

2

10038171003817100381710038171205731003817100381710038171003817

2+ 119862

1

2

119873

sum

119894=1

1003817100381710038171003817120585119894

1003817100381710038171003817

2

Subject to h (x119894)120573 = t119879

119894minus 120585119879

119894 119894 = 1 119873

(1)

where 120585119894= [1205851198941 120585

119894119898]119879 is the training error vector of

the 119898 output nodes with respect to the training sample x119894

According to KKT theorem training ELM is equivalent tosolving the dual optimization problem

119871119863ELM

=

1

2

10038171003817100381710038171205731003817100381710038171003817

2+ 119862

1

2

119873

sum

119894=1

1003817100381710038171003817120585119894

1003817100381710038171003817

2

minus

119873

sum

119894=1

119898

sum

119895=1

120572119894119895(h (x119894) 120573119895minus 119905119894119895+ 120585119894119895)

(2)

where 120573119895is the vector of the weights between the hidden

layer and the 119895th output node and 120573 = [1205731 120573

119898] 119862 is the

regularization parameter representing the trade-off betweenthe minimization of training errors and the maximization ofthe marginal distance

According to KKT theorem we can obtain differentsolutions as follows

(1) Kernel Case Consider

120573 = H119879 ( I119862

+HH119879)minus1

T (3)

The ELM output function is

119891 (x) = h (x)120573 = h (x)H119879 ( I119862

+HH119879)minus1

T (4)

A kernel matrix for ELM is defined as follows

ΩELM = HH119879 ΩELM119894119895 = h (x119894) sdot h (x

119895) = 119870 (x

119894 x119895) (5)

Then the ELM output function can be as follows

119891 (x) = h (x)120573 = h (x)H119879 ( I119862

+HH119879)minus1

T

=

[

[

[

[

119870 (x x1)

119870(x x

119873)

]

]

]

]

119879

(

I119862

+Ω119879

ELM)minus1

T(6)

In the special case a corresponding kernel 119870(x119894 x119895) is

used in ELM instead of using the feature mapping h(x)

4 Mathematical Problems in Engineering

which need be known We call 119870(x119894 x119895) = h(x

119894)h(x119895) ELM

random kernel where the feature mapping h(x) is randomlygenerated

(2) Nonkernel Case Similarly based on KKT theorem wehave

120573 = (

I119862

+HH119879)minus1

H119879T (7)

In this case the ELM output function is

119891 (x) = h (x)120573 = h (x) ( I119862

+HH119879)minus1

H119879T (8)

3 The Proposed RobustAdaBoostRT Algorithm

We first describe the original AdaBoostRT algorithm forregression problem and then present a new robust Ada-BoostRT algorithm in this section The corresponding anal-ysis on the novel algorithm will also be given

31 The Original AdaBoostRT Algorithm Solomatine andSherstha proposed AdaBoostRT [29 30] a new boost algo-rithm for regression problems where the letters 119877 and 119879represent regression and threshold respectively The originalAdaBoostRT algorithm is described as follows

Input the following

(i) sequence of 119898 examples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877(ii) weak learning algorithm (weak learner)(iii) integer 119879 specifying number of iterations

(machines)(iv) threshold Φ (0 lt Φ lt 1) for demarcating

correct and incorrect predictions

Initialize the following

(i) machine number or iteration 119905 = 1(ii) distribution119863

119905(119894) = 1119898 for all i

(iii) error rate 120576119905= 0

Learning steps (iterate while 119905 ⩽ 119879) are as followsStep 1 Call weak learner providing it with distribu-tion119863

119905

Step 2 Build the regression model 119891119905(119909) rarr 119910

Step 3 Calculate absolute relative error (ARE) foreach training example as

ARE119905 (119894) =

100381610038161003816100381610038161003816100381610038161003816

119891119905(119909119894) minus 119910119894

119910119894

100381610038161003816100381610038161003816100381610038161003816

(9)

Step 4 Calculate the error rate of 119891119905(119909) 120576

119905=

sum119894ARE119894(119894)gt120601119863119905(119894)

Step 5 Set 120573119905= 120576119899

119905 where 119899 is power coefficient (eg

linear square or cubic)Step 6 Update distribution119863

119905as follows

if ARE119905(119894) le 120601 then119863

119905+1(119894) = (119863

119905(119894)119885119905) times 120573119905

else119863119905+1(119894) = 119863

119905(119894)119885119905

where 119885119905is a normalization factor chosen such

that119863119905+1

will be a distribution

Step 7 Set 119905 = 119905 + 1Output the following the final hypotheses

119891fin (119909) =sum119905(log (1120573

119905)) times 119891

119905(119909)

sum119905(log (1120573

119905))

(10)

TheAdaBoostRT algorithmprojects the regression prob-lems into the binary classification domain Based on theboosting regression estimators [28] and BEM [34] theAdaBoostRT algorithm introduces the absolute relative error(ARE) to demarcate samples as either correct or incorrectpredictions If the ARE of any particular sample is greaterthan the threshold Φ the predicted value for this sample isregarded as the incorrect predictor Otherwise it is remarkedas correct predictor Such indication method is similar to theldquomisclassificationrdquo and ldquocorrect-classificationrdquo labeling usedin classification problemsThe algorithmwill assign relativelylarge weights to those weak learners in the front of learnerlist that reach high correct prediction rate The samples withincorrect prediction will be handled as ad hoc cases by thefollowed weak learners The outputs from each weak learnerare combined as the final hypotheses using the correspondingcomputed weights

AdaBoostRT algorithm requires manual selection ofthreshold Φ which is a main factor sensitively affecting theperformance of committee machines If Φ is too small veryfew samples will be treated as correct predictions which willeasily get boosted It requires the followed learners to handlea large number of ad hoc samples and make the ensemblealgorithm unstable On the other hand if Φ is too large saygreater than 04 most of samples will be treated as correctpredictions where they fail to reject those false samples Infact it will cause low convergence efficiency and overfittingThe initial AdaBoostRT and its variant suffer the limitation insetting threshold value which is specified either as amanuallyspecified constant value or a variable changing in vicinity of02 Both of their strategies are irrelevant to the regressioncapability of the weak learner In order to determine theΦ value effectively a novel improvement of AdaBoostRT isproposed in the following section

32 The Proposed Robust AdaBoostRT Algorithm To over-come the limitations suffered by the current works onAdaBoostRT we embed the statistics theory into theAdaBoostRT algorithm It overcomes the difficulty to opti-mally determining the initial threshold value and enablesthe intermediate threshold values to be dynamically self-adjustable according to the intrinsic property of the input

Mathematical Problems in Engineering 5

data samples The proposed robust AdaBoostRT algorithmis described as follows

(1) Input the following

sequence of 119898 samples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877weak learning algorithm (weak learner)maximum number of iterations (machines) 119879

(2) Initialize the following

iteration index 119905 = 1distribution119863

119905(119894) = 1119898 for all 119894

the weight vector 119908119905119894= 119863119905(119894) for all 119894

error rate 120576119905= 0

(3) Iterate while 119905 ⩽ 119879 the following

(1) Call weak learner WLt providing it with distri-bution

119901(119905)=

119908(119905)

119885119905

(11)

where 119885119905is a normalization factor chosen such

that 119901(119905) will be a distribution(2) Build the regression model 119891

119905(119909) rarr 119910

(3) Calculate each error 119890119905(119894) = 119891

119905(119909119894) minus 119910119894

(4) Calculate the error rate 120576119905= sum119894isin119875119901(119905)

119894 where

119875 = 119894 | |119890119905(119894)minus119890119905| gt 120582120590

119905 119894 isin [1 119898] 119890

119905stands for

the expected value and 120582120590119905is defined as robust

threshold (120590119905stands for the standard deviation

the relative factor 120582 is defined as 120582 isin (0 1))If 120576119905gt 12 then set 119879 = 119905 minus 1 and abort loop

(5) Set 120573119905= 120576119905(1 minus 120576

119905)

(6) Calculate contribution of 119891119905(119909) to the final

result 120572119905= minus log(120573

119905)

(7) Update the weight vectors

if 119894 notin 119875 then 119908(119905+1)119894

= 119908(119905)

119894120573119905

else 119908(119905+1)119894

= 119908(119905)

119894

(8) Set 119905 = 119905 + 1

(4) Normalize 1205721 120572

119879 such that sum119879

119905=1120572119905= 1

Output the final hypotheses 119891fin(119909) =

sum119879

119905=1120572119905119891119905(119909)

At each iteration of the proposed robust AdaBoostRTalgorithm the standard deviation of the approximation errordistribution is used as a criterion In the probability andstatistics theory the standard deviationmeasures the amountof variation or dispersion from the average If the data pointstend to be very close to themean value the standard deviationis low On the other hand if the data points are spread out

over a large range of values a high standard deviation will beresulted in

Standard deviation may be served as a measure ofuncertainty for a set of repeated predictions When decidingwhether predictions agree with their correspondingly truevalues the standard deviation of those predictions madeby the underlined approximation function is of crucialimportance if the averaged distance from the predictions tothe true values is large then the regressionmodel being testedprobably needs to be revised Because the sample points thatfall outside the range of values could reasonably be expectedto occur the prediction accuracy rate of the model is low

In the proposed robust AdaBoostRT algorithm theapproximation error of 119905th weak learner WLt for an inputdataset could be represented as one statistics distributionwithparameters 120583

119905plusmn 120582120590119905 where 120583

119905stands for the expected value

120590119905stands for the standard deviation and 120582 is an adjustable

relative factor that ranges from 0 to 1 The threshold valuefor WLt is defined by the scaled standard deviation 120582120590

119905

In the hybrid learning algorithm the trained weak learnersare assumed to be able to generate small prediction error(120576119905lt 12) For all 120583

119905 120583119905isin (0 minus 120578 0 + 120578) lim 120578 rarr

0 and 119905 = 1 119879 where 120578 denotes a small error limitapproaching zero The population mean of a regression errordistribution is closer to the targeted zero than elements in thepopulation Therefore the means of the obtained regressionerrors are fluctuating around zero within a small range Thestandard deviation 120590

119905 is solely determined by the individual

samples and the generalization performance of the weaklearnerWLt Generally 120590119905 is relatively large such that most ofthe outputs will be located within the range [minus120590

119905 +120590119905] which

tends to make the boosting process unstable To maintain astable adjusting of the threshold value a relative factor 120582 isapplied on the standard deviation 120590

119905 which results in the

robust threshold 120582120590119905 For those samples that fall within the

threshold range [minus120582120590119905 +120582120590119905] they are treated as ldquoacceptedrdquo

samples Other samples are treated as ldquorejectedrdquo samplesWith the introduction of the robust threshold the algorithmwill be stable and resistant to noise in the data Accordingto the error rate 120576

119905of each weak learnerrsquos regression model

each weak learner WLt will be assigned with one accordinglycomputed weight 120572

119905

For one regression problem the performances of differentweak learners may be differentThe regression error distribu-tions for different weak learners under robust threshold areshown in Figure 1

In Figure 1(a) the weak learner WLt generates a regres-sion error distribution with large error rate where thestandard deviation is relatively large On the other handanother weak learner WLt+j may generate an error distribu-tion with small standard deviations as shown in Figure 1(b)Suppose red triangle points represent ldquorejectedrdquo sampleswhose regression error rate is greater than the specified robustthreshold 120576

119905= sum119894isin119875119901(119905)

119894 where 119875 = 119894 | |119890

119905(119894) minus 119890

119905| gt 120582120590

119905 119894 isin

[1 119898] as described in Step (4) of the proposed algorithmThe green circular points represent those ldquoacceptedrdquo sampleswhich own regression errors less than the robust thresholdThe weight vector 119908119905

119894for every ldquoacceptedrdquo sample will be

6 Mathematical Problems in Engineering

Probability

Rejected sampleAccepted sample

Value120583t minus 120582120590t 120583t 120583t + 120582120590t

(a)

Value

Probability

120583t+j t+jminus 120582120590 120583t+j 120583t+j t+j+ 120582120590

Rejected sampleAccepted sample

(b)

Figure 1 Different regression error distributions for different weak learners under robust threshold (a) Error distribution of weak learnerWLt that results in a large threshold (b) Error distribution of weak learner WLt+j that results in a small threshold

dynamically changed for each weak learner WLt while thatfor ldquorejectedrdquo samples will be unchanged

As illustrated in Figure 1 in terms of stability andaccuracy the regression capability of weak learner WLt+jis superior to WLt The robust threshold values for WLt+jand WLt are computed respectively where the former issmaller than the latter to discriminate their correspondinglydifferent regression performances The proposed methodovercomes the limitation suffered by the existing methodswhere the threshold value is set empirically The criticalfactor used in the boosting process the threshold becomesrobust and self-adaptive to the individual weak learnersrsquoperformance on the input data samples Therefore the pro-posed robust AdaBoostRT algorithm is capable to output thefinal hypotheses in optimally weighted ensemble of the weaklearners

In the following we show that the training error of theproposed robust AdaBoostRT algorithm is as bounded Onelemma needs to be given in order to prove the convergence ofthis algorithm

Lemma 1 The convexity argument 119909119889 le 1 minus (1 minus 119909)119889 holdsfor any 119909 ge 0 and 0 le 119889 le 1

Theorem 2 The improved adaptive AdaBoostRT algorithmgenerates hypotheses with errors 120576

1 1205762 120576

119879lt 12 Then the

error 119864119890119899119904119890119898119887119897119890

of the final hypothesis output by this algorithmis bounded above by

119864119890119899119904119890119898119887119897119890

le 2119879

119879

prod

119905=1

radic120576119905(1 minus 120576

119905) (12)

Proof In this proof we need to transform the regressionproblem 119883 rarr 119884 into binary classification problems119883 119884 rarr 0 1 In the proposed improved adaptiveAdaBoostRT algorithm the mean of errors (119890) is assumed tobe closed to zero Thus the dynamically adaptive thresholdscan ignore the mean of errors

Let 119868119905= 119894 | |119891

119905(119909119894) minus 119910119894| lt 120582120590

119905 if 119894 notin 119868

119905119868119894

119905= 1 otherwise

119868119894

119905= 0The final hypothesis output 119891makes a mistake on sample

119894 only if

119879

prod

119905=1

120573minus119868119894

119905

119905ge (

119879

prod

119905=1

120573119905)

minus12

(13)

The final weight of any sample 119894 is

119908119879+1

119894= 119863 (119894)

119879

prod

119905=1

120573(1minus119868119894

119905)

119905 (14)

Combining (13) and (14) the sum of the final weights isbounded by the sum of the final weights of rejected samplesConsider

119898

sum

119894=1

119908119879+1

119894ge sum

119894notin119868119894

119879+1

119908119879+1

119894ge ( sum

119894notin119868119894

119879+1

119863 (119894))(

119879

prod

119905=1

120573119905)

12

= 119864ensemble(119879

prod

119905=1

120573119905)

12

(15)

where the 119864ensemble is the error of the final hypothesis outputBased on Lemma 1

119898

sum

119894=1

119908119905+1

119894=

119898

sum

119894=1

119908119905

1198941205731minus119868119894

119905

119905le

119898

sum

119894=1

119908119905

119894(1 minus (1 minus 120573

119905) (1 minus 119868

119894

119905))

= (

119898

sum

119894=1

119908119905

119894)(1 minus (1 minus 120576

119905) (1 minus 120573

119905))

(16)

Mathematical Problems in Engineering 7

Combining those inequalities for 119905 = 1 119879 the follow-ing equation could be obtained

119898

sum

119894=1

119908119879+1

119894le

119879

prod

119905=1

(1 minus (1 minus 120576119905) (1 minus 120573

119905)) (17)

Combining (15) and (17) we obtain that

119864ensemble le119879

prod

119905=1

1 minus (1 minus 120576119905) (1 minus 120573

119905)

radic120573119905

(18)

Considering that all factors in multiplication are positivethe minimization of the right hand side could be resortedto compute the minimization of each factor individually120573119905could be computed as 120573

119905= 120576119905(1 minus 120576

119905) when setting

the derivative of the 119905th factor to be zero Substitute thiscomputed 120573

119905into (18) completing the proof

Unlike the original AdaBoostRT and its existent variantsthe robust threshold in the proposed AdaBoostRT algorithmis determined and could be self-adaptively adjusted accordingto the individual weak learners and data samples Throughthe analysis on the convergence of the proposed robustAdaBoostRT algorithm it could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe study shows that the robust AdaBoostRT algorithmproposed in this paper can overcome the limitations existingin the available AdaBoostRT algorithms

4 A Robust AdaBoostRT Ensemble-BasedExtreme Learning Machine

In this paper a robust AdaBoostRT ensemble-based extremelearning machine (RAE-ELM) which combines ELM withthe robust AdaBoostRT algorithm described in previoussection is proposed to improve the robustness and stabilityof ELM A set of 119879 number of ELMs is adopted as the ldquoweakrdquolearners In the training phase the RAE-ELM utilizes theproposed robust AdaBoostRT algorithm to train every ELMmodel and assign an ensemble weight accordingly in orderthat each ELM achieves corresponding distribution based onthe training output The optimally weighted ensemble modelof ELMs 119864ensemble is the final hypothesis output used formaking prediction on testing dataset The proposed RAE-ELM is illustrated as follows in Figure 2

41 Initialization For the first weak learner ELM1is supplied

with 119898 training samples with the uniformed distribution ofweights in order that each sample owns equal opportunity tobe chosen during the first training process for ELM

1

42 Distribution Updating The relative prediction error ratesare used to evaluate the performance of this ELM Theprediction error of 119905th ELM ELMt for the input data samplescould be represented as one statistics distribution 120583

119905plusmn 120582120590119905

where 120583119905stands for the expected value and 120582120590

119905is defined

as robust threshold (120590119905stands for the standard deviation

and the relative factor 120582 is defined as 120582 isin (0 1)) Therobust thresholdis applied to demarcate prediction errorsas ldquoacceptedrdquo or ldquorejectedrdquo If the prediction error of oneparticular sample falls into the region 120583

119905plusmn120582120590119905that is bounded

by the robust thresholds the prediction of this sample isregarded as ldquoacceptedrdquo for ELMt and vice versa for ldquorejectedrdquopredictionsThe probabilities of the ldquorejectedrdquo predictions areaccumulated to calculate the error rate 120576

119905 ELM attempts to

achieve the 119891119905(119909) with small error rate

The robust AdaBoostRT algorithm will calculate thedistribution for next ELMt+1 For every sample that iscorrectly predicted by the current ELMt the correspondingweight vector will be multiplied by the error rate function120573119905 Otherwise the weight vector remains unchanged Such

process will be iterated for the next ELMt+1 till the last learnerELMT unless 120576119905 is higher than 05 Because once the error rateis higher than 05 the AdaBoost algorithm does not convergeand tends to overfitting [21] Hence the error ratemust be lessthan 05

43 Decision Making on RAE-ELM The weight updatingparameter120573

119905is used as an indicator of regression effectiveness

of the ELMt in the current iteration According to therelationship between 120576

119905and 120573

119905 if 120576119905increases 120573

119905will become

larger as well The RAE-ELM will grant a small ensembleweight for the ELMt On the other hand the ELMt withrelatively superior regression performance will be grantedwith a larger ensemble weight The hybrid RAE-ELM modelcombines the set of ELMs under different weights as the finalhypothesis for decision making

5 Performance Evaluation of RAE-ELM

In this section the performance of the proposed RAE-ELM learning algorithm is compared with other popularalgorithms on 14 real-world regression problems coveringdifferent domains from UCI Machine Learning Repository[35] whose specifications of benchmark datasets are shownin Table 1The ELMbased algorithms to be compared includebasic ELM [4] original AdaBoostRT based ELM (originalAda-ELM) [30] the modified self-adaptive AdaBoostRTELM (modified Ada-ELM) [31] support vector regression(SVR) [36] and least-square support vector regression (LS-SVR) [37] All the evaluations are conducted in Mat-lab environment running on a Windows 7 machine with320GHzCPU and 4GBRAM

In our experiments all the input attributes are normalizedinto the range of [minus1 1] while the outputs are normalizedinto [0 1] As the real-world benchmark datasets are embed-ded with noise and their distributions are unknown whichare of small sizes low dimensions large sizes and highdimensions for each trial of simulations the whole dataset of the application is randomly partitioned into trainingdataset and testing dataset with the number of samples shownin Table 1 25 of the training data samples are used asthe validation dataset Each partitioned training validationand testing dataset will be kept fixed as inputs for all thesealgorithms

8 Mathematical Problems in Engineering

Training Training Training

Updatedistribution

Updatedistribution

Boosting

The final hypothesis

Sequence of M samples

Distribution Distribution

(x1 y1) (xm ym)

D1(i) Dt (i)Distribution

DT (i)

ELM1 ELMt

Calculate theerror rate

1205761

Calculate theerror rate

120576t

Output1 Outputt

ELMT

OutputT

middot middot middot middot middot middot

Figure 2 Block diagram of RAE-ELM

For RAE-ELM basic ELM original Ada-ELM and mod-ified Ada-ELM algorithms the suitable numbers of hiddennodes of them are determined using the preserved validationdataset respectivelyThe sigmoid function119866(a 119887 x) = 1(1+exp(minus(asdotx+119887))) is selected as the activation function in all thealgorithms Fifty trails of simulations have been conductedfor each problem with training validation and testingsamples randomly split for each trailThe performances of thealgorithms are verified using the average root mean squareerror (RMSE) in testing The significantly better results arehighlighted in boldface

51 Model Selection In ensemble algorithms the number ofnetworks in the ensemble needs to be determined Accordingto Occamrsquos Razor theory excessively complex models areaffected by statistical noise whereas simpler models maycapture the underlying structure better and may thus have

better predictive performance Therefore the parameter 119879which is the number of weak learners need not be very large

We define 120576119905in (12) as 120576

119905= 05 minus 120574

119905 where 120574

119905gt 0 is

constant it results in

119864ensemble le119879

prod

119905=1

radic1 minus 41205742

119905= 119890(minussum119879

119905=1KL(1212minus120574119905))

le 119890(minus2sum

119879

119905=11205742

119905)

(19)

where KL is the Kullback-Leibler divergenceWe then simplify (19) by using 05 minus 120574 instead of 05 minus 120574

119905

that is each 120574119905is set to be the same We can get

119864ensemble le (1 minus 41205742)

1198792

= 119890(minus119879lowastKL(1212minus120574))

le 119890(minus2119879120574

2)

(20)

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

4 Mathematical Problems in Engineering

which need be known We call 119870(x119894 x119895) = h(x

119894)h(x119895) ELM

random kernel where the feature mapping h(x) is randomlygenerated

(2) Nonkernel Case Similarly based on KKT theorem wehave

120573 = (

I119862

+HH119879)minus1

H119879T (7)

In this case the ELM output function is

119891 (x) = h (x)120573 = h (x) ( I119862

+HH119879)minus1

H119879T (8)

3 The Proposed RobustAdaBoostRT Algorithm

We first describe the original AdaBoostRT algorithm forregression problem and then present a new robust Ada-BoostRT algorithm in this section The corresponding anal-ysis on the novel algorithm will also be given

31 The Original AdaBoostRT Algorithm Solomatine andSherstha proposed AdaBoostRT [29 30] a new boost algo-rithm for regression problems where the letters 119877 and 119879represent regression and threshold respectively The originalAdaBoostRT algorithm is described as follows

Input the following

(i) sequence of 119898 examples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877(ii) weak learning algorithm (weak learner)(iii) integer 119879 specifying number of iterations

(machines)(iv) threshold Φ (0 lt Φ lt 1) for demarcating

correct and incorrect predictions

Initialize the following

(i) machine number or iteration 119905 = 1(ii) distribution119863

119905(119894) = 1119898 for all i

(iii) error rate 120576119905= 0

Learning steps (iterate while 119905 ⩽ 119879) are as followsStep 1 Call weak learner providing it with distribu-tion119863

119905

Step 2 Build the regression model 119891119905(119909) rarr 119910

Step 3 Calculate absolute relative error (ARE) foreach training example as

ARE119905 (119894) =

100381610038161003816100381610038161003816100381610038161003816

119891119905(119909119894) minus 119910119894

119910119894

100381610038161003816100381610038161003816100381610038161003816

(9)

Step 4 Calculate the error rate of 119891119905(119909) 120576

119905=

sum119894ARE119894(119894)gt120601119863119905(119894)

Step 5 Set 120573119905= 120576119899

119905 where 119899 is power coefficient (eg

linear square or cubic)Step 6 Update distribution119863

119905as follows

if ARE119905(119894) le 120601 then119863

119905+1(119894) = (119863

119905(119894)119885119905) times 120573119905

else119863119905+1(119894) = 119863

119905(119894)119885119905

where 119885119905is a normalization factor chosen such

that119863119905+1

will be a distribution

Step 7 Set 119905 = 119905 + 1Output the following the final hypotheses

119891fin (119909) =sum119905(log (1120573

119905)) times 119891

119905(119909)

sum119905(log (1120573

119905))

(10)

TheAdaBoostRT algorithmprojects the regression prob-lems into the binary classification domain Based on theboosting regression estimators [28] and BEM [34] theAdaBoostRT algorithm introduces the absolute relative error(ARE) to demarcate samples as either correct or incorrectpredictions If the ARE of any particular sample is greaterthan the threshold Φ the predicted value for this sample isregarded as the incorrect predictor Otherwise it is remarkedas correct predictor Such indication method is similar to theldquomisclassificationrdquo and ldquocorrect-classificationrdquo labeling usedin classification problemsThe algorithmwill assign relativelylarge weights to those weak learners in the front of learnerlist that reach high correct prediction rate The samples withincorrect prediction will be handled as ad hoc cases by thefollowed weak learners The outputs from each weak learnerare combined as the final hypotheses using the correspondingcomputed weights

AdaBoostRT algorithm requires manual selection ofthreshold Φ which is a main factor sensitively affecting theperformance of committee machines If Φ is too small veryfew samples will be treated as correct predictions which willeasily get boosted It requires the followed learners to handlea large number of ad hoc samples and make the ensemblealgorithm unstable On the other hand if Φ is too large saygreater than 04 most of samples will be treated as correctpredictions where they fail to reject those false samples Infact it will cause low convergence efficiency and overfittingThe initial AdaBoostRT and its variant suffer the limitation insetting threshold value which is specified either as amanuallyspecified constant value or a variable changing in vicinity of02 Both of their strategies are irrelevant to the regressioncapability of the weak learner In order to determine theΦ value effectively a novel improvement of AdaBoostRT isproposed in the following section

32 The Proposed Robust AdaBoostRT Algorithm To over-come the limitations suffered by the current works onAdaBoostRT we embed the statistics theory into theAdaBoostRT algorithm It overcomes the difficulty to opti-mally determining the initial threshold value and enablesthe intermediate threshold values to be dynamically self-adjustable according to the intrinsic property of the input

Mathematical Problems in Engineering 5

data samples The proposed robust AdaBoostRT algorithmis described as follows

(1) Input the following

sequence of 119898 samples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877weak learning algorithm (weak learner)maximum number of iterations (machines) 119879

(2) Initialize the following

iteration index 119905 = 1distribution119863

119905(119894) = 1119898 for all 119894

the weight vector 119908119905119894= 119863119905(119894) for all 119894

error rate 120576119905= 0

(3) Iterate while 119905 ⩽ 119879 the following

(1) Call weak learner WLt providing it with distri-bution

119901(119905)=

119908(119905)

119885119905

(11)

where 119885119905is a normalization factor chosen such

that 119901(119905) will be a distribution(2) Build the regression model 119891

119905(119909) rarr 119910

(3) Calculate each error 119890119905(119894) = 119891

119905(119909119894) minus 119910119894

(4) Calculate the error rate 120576119905= sum119894isin119875119901(119905)

119894 where

119875 = 119894 | |119890119905(119894)minus119890119905| gt 120582120590

119905 119894 isin [1 119898] 119890

119905stands for

the expected value and 120582120590119905is defined as robust

threshold (120590119905stands for the standard deviation

the relative factor 120582 is defined as 120582 isin (0 1))If 120576119905gt 12 then set 119879 = 119905 minus 1 and abort loop

(5) Set 120573119905= 120576119905(1 minus 120576

119905)

(6) Calculate contribution of 119891119905(119909) to the final

result 120572119905= minus log(120573

119905)

(7) Update the weight vectors

if 119894 notin 119875 then 119908(119905+1)119894

= 119908(119905)

119894120573119905

else 119908(119905+1)119894

= 119908(119905)

119894

(8) Set 119905 = 119905 + 1

(4) Normalize 1205721 120572

119879 such that sum119879

119905=1120572119905= 1

Output the final hypotheses 119891fin(119909) =

sum119879

119905=1120572119905119891119905(119909)

At each iteration of the proposed robust AdaBoostRTalgorithm the standard deviation of the approximation errordistribution is used as a criterion In the probability andstatistics theory the standard deviationmeasures the amountof variation or dispersion from the average If the data pointstend to be very close to themean value the standard deviationis low On the other hand if the data points are spread out

over a large range of values a high standard deviation will beresulted in

Standard deviation may be served as a measure ofuncertainty for a set of repeated predictions When decidingwhether predictions agree with their correspondingly truevalues the standard deviation of those predictions madeby the underlined approximation function is of crucialimportance if the averaged distance from the predictions tothe true values is large then the regressionmodel being testedprobably needs to be revised Because the sample points thatfall outside the range of values could reasonably be expectedto occur the prediction accuracy rate of the model is low

In the proposed robust AdaBoostRT algorithm theapproximation error of 119905th weak learner WLt for an inputdataset could be represented as one statistics distributionwithparameters 120583

119905plusmn 120582120590119905 where 120583

119905stands for the expected value

120590119905stands for the standard deviation and 120582 is an adjustable

relative factor that ranges from 0 to 1 The threshold valuefor WLt is defined by the scaled standard deviation 120582120590

119905

In the hybrid learning algorithm the trained weak learnersare assumed to be able to generate small prediction error(120576119905lt 12) For all 120583

119905 120583119905isin (0 minus 120578 0 + 120578) lim 120578 rarr

0 and 119905 = 1 119879 where 120578 denotes a small error limitapproaching zero The population mean of a regression errordistribution is closer to the targeted zero than elements in thepopulation Therefore the means of the obtained regressionerrors are fluctuating around zero within a small range Thestandard deviation 120590

119905 is solely determined by the individual

samples and the generalization performance of the weaklearnerWLt Generally 120590119905 is relatively large such that most ofthe outputs will be located within the range [minus120590

119905 +120590119905] which

tends to make the boosting process unstable To maintain astable adjusting of the threshold value a relative factor 120582 isapplied on the standard deviation 120590

119905 which results in the

robust threshold 120582120590119905 For those samples that fall within the

threshold range [minus120582120590119905 +120582120590119905] they are treated as ldquoacceptedrdquo

samples Other samples are treated as ldquorejectedrdquo samplesWith the introduction of the robust threshold the algorithmwill be stable and resistant to noise in the data Accordingto the error rate 120576

119905of each weak learnerrsquos regression model

each weak learner WLt will be assigned with one accordinglycomputed weight 120572

119905

For one regression problem the performances of differentweak learners may be differentThe regression error distribu-tions for different weak learners under robust threshold areshown in Figure 1

In Figure 1(a) the weak learner WLt generates a regres-sion error distribution with large error rate where thestandard deviation is relatively large On the other handanother weak learner WLt+j may generate an error distribu-tion with small standard deviations as shown in Figure 1(b)Suppose red triangle points represent ldquorejectedrdquo sampleswhose regression error rate is greater than the specified robustthreshold 120576

119905= sum119894isin119875119901(119905)

119894 where 119875 = 119894 | |119890

119905(119894) minus 119890

119905| gt 120582120590

119905 119894 isin

[1 119898] as described in Step (4) of the proposed algorithmThe green circular points represent those ldquoacceptedrdquo sampleswhich own regression errors less than the robust thresholdThe weight vector 119908119905

119894for every ldquoacceptedrdquo sample will be

6 Mathematical Problems in Engineering

Probability

Rejected sampleAccepted sample

Value120583t minus 120582120590t 120583t 120583t + 120582120590t

(a)

Value

Probability

120583t+j t+jminus 120582120590 120583t+j 120583t+j t+j+ 120582120590

Rejected sampleAccepted sample

(b)

Figure 1 Different regression error distributions for different weak learners under robust threshold (a) Error distribution of weak learnerWLt that results in a large threshold (b) Error distribution of weak learner WLt+j that results in a small threshold

dynamically changed for each weak learner WLt while thatfor ldquorejectedrdquo samples will be unchanged

As illustrated in Figure 1 in terms of stability andaccuracy the regression capability of weak learner WLt+jis superior to WLt The robust threshold values for WLt+jand WLt are computed respectively where the former issmaller than the latter to discriminate their correspondinglydifferent regression performances The proposed methodovercomes the limitation suffered by the existing methodswhere the threshold value is set empirically The criticalfactor used in the boosting process the threshold becomesrobust and self-adaptive to the individual weak learnersrsquoperformance on the input data samples Therefore the pro-posed robust AdaBoostRT algorithm is capable to output thefinal hypotheses in optimally weighted ensemble of the weaklearners

In the following we show that the training error of theproposed robust AdaBoostRT algorithm is as bounded Onelemma needs to be given in order to prove the convergence ofthis algorithm

Lemma 1 The convexity argument 119909119889 le 1 minus (1 minus 119909)119889 holdsfor any 119909 ge 0 and 0 le 119889 le 1

Theorem 2 The improved adaptive AdaBoostRT algorithmgenerates hypotheses with errors 120576

1 1205762 120576

119879lt 12 Then the

error 119864119890119899119904119890119898119887119897119890

of the final hypothesis output by this algorithmis bounded above by

119864119890119899119904119890119898119887119897119890

le 2119879

119879

prod

119905=1

radic120576119905(1 minus 120576

119905) (12)

Proof In this proof we need to transform the regressionproblem 119883 rarr 119884 into binary classification problems119883 119884 rarr 0 1 In the proposed improved adaptiveAdaBoostRT algorithm the mean of errors (119890) is assumed tobe closed to zero Thus the dynamically adaptive thresholdscan ignore the mean of errors

Let 119868119905= 119894 | |119891

119905(119909119894) minus 119910119894| lt 120582120590

119905 if 119894 notin 119868

119905119868119894

119905= 1 otherwise

119868119894

119905= 0The final hypothesis output 119891makes a mistake on sample

119894 only if

119879

prod

119905=1

120573minus119868119894

119905

119905ge (

119879

prod

119905=1

120573119905)

minus12

(13)

The final weight of any sample 119894 is

119908119879+1

119894= 119863 (119894)

119879

prod

119905=1

120573(1minus119868119894

119905)

119905 (14)

Combining (13) and (14) the sum of the final weights isbounded by the sum of the final weights of rejected samplesConsider

119898

sum

119894=1

119908119879+1

119894ge sum

119894notin119868119894

119879+1

119908119879+1

119894ge ( sum

119894notin119868119894

119879+1

119863 (119894))(

119879

prod

119905=1

120573119905)

12

= 119864ensemble(119879

prod

119905=1

120573119905)

12

(15)

where the 119864ensemble is the error of the final hypothesis outputBased on Lemma 1

119898

sum

119894=1

119908119905+1

119894=

119898

sum

119894=1

119908119905

1198941205731minus119868119894

119905

119905le

119898

sum

119894=1

119908119905

119894(1 minus (1 minus 120573

119905) (1 minus 119868

119894

119905))

= (

119898

sum

119894=1

119908119905

119894)(1 minus (1 minus 120576

119905) (1 minus 120573

119905))

(16)

Mathematical Problems in Engineering 7

Combining those inequalities for 119905 = 1 119879 the follow-ing equation could be obtained

119898

sum

119894=1

119908119879+1

119894le

119879

prod

119905=1

(1 minus (1 minus 120576119905) (1 minus 120573

119905)) (17)

Combining (15) and (17) we obtain that

119864ensemble le119879

prod

119905=1

1 minus (1 minus 120576119905) (1 minus 120573

119905)

radic120573119905

(18)

Considering that all factors in multiplication are positivethe minimization of the right hand side could be resortedto compute the minimization of each factor individually120573119905could be computed as 120573

119905= 120576119905(1 minus 120576

119905) when setting

the derivative of the 119905th factor to be zero Substitute thiscomputed 120573

119905into (18) completing the proof

Unlike the original AdaBoostRT and its existent variantsthe robust threshold in the proposed AdaBoostRT algorithmis determined and could be self-adaptively adjusted accordingto the individual weak learners and data samples Throughthe analysis on the convergence of the proposed robustAdaBoostRT algorithm it could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe study shows that the robust AdaBoostRT algorithmproposed in this paper can overcome the limitations existingin the available AdaBoostRT algorithms

4 A Robust AdaBoostRT Ensemble-BasedExtreme Learning Machine

In this paper a robust AdaBoostRT ensemble-based extremelearning machine (RAE-ELM) which combines ELM withthe robust AdaBoostRT algorithm described in previoussection is proposed to improve the robustness and stabilityof ELM A set of 119879 number of ELMs is adopted as the ldquoweakrdquolearners In the training phase the RAE-ELM utilizes theproposed robust AdaBoostRT algorithm to train every ELMmodel and assign an ensemble weight accordingly in orderthat each ELM achieves corresponding distribution based onthe training output The optimally weighted ensemble modelof ELMs 119864ensemble is the final hypothesis output used formaking prediction on testing dataset The proposed RAE-ELM is illustrated as follows in Figure 2

41 Initialization For the first weak learner ELM1is supplied

with 119898 training samples with the uniformed distribution ofweights in order that each sample owns equal opportunity tobe chosen during the first training process for ELM

1

42 Distribution Updating The relative prediction error ratesare used to evaluate the performance of this ELM Theprediction error of 119905th ELM ELMt for the input data samplescould be represented as one statistics distribution 120583

119905plusmn 120582120590119905

where 120583119905stands for the expected value and 120582120590

119905is defined

as robust threshold (120590119905stands for the standard deviation

and the relative factor 120582 is defined as 120582 isin (0 1)) Therobust thresholdis applied to demarcate prediction errorsas ldquoacceptedrdquo or ldquorejectedrdquo If the prediction error of oneparticular sample falls into the region 120583

119905plusmn120582120590119905that is bounded

by the robust thresholds the prediction of this sample isregarded as ldquoacceptedrdquo for ELMt and vice versa for ldquorejectedrdquopredictionsThe probabilities of the ldquorejectedrdquo predictions areaccumulated to calculate the error rate 120576

119905 ELM attempts to

achieve the 119891119905(119909) with small error rate

The robust AdaBoostRT algorithm will calculate thedistribution for next ELMt+1 For every sample that iscorrectly predicted by the current ELMt the correspondingweight vector will be multiplied by the error rate function120573119905 Otherwise the weight vector remains unchanged Such

process will be iterated for the next ELMt+1 till the last learnerELMT unless 120576119905 is higher than 05 Because once the error rateis higher than 05 the AdaBoost algorithm does not convergeand tends to overfitting [21] Hence the error ratemust be lessthan 05

43 Decision Making on RAE-ELM The weight updatingparameter120573

119905is used as an indicator of regression effectiveness

of the ELMt in the current iteration According to therelationship between 120576

119905and 120573

119905 if 120576119905increases 120573

119905will become

larger as well The RAE-ELM will grant a small ensembleweight for the ELMt On the other hand the ELMt withrelatively superior regression performance will be grantedwith a larger ensemble weight The hybrid RAE-ELM modelcombines the set of ELMs under different weights as the finalhypothesis for decision making

5 Performance Evaluation of RAE-ELM

In this section the performance of the proposed RAE-ELM learning algorithm is compared with other popularalgorithms on 14 real-world regression problems coveringdifferent domains from UCI Machine Learning Repository[35] whose specifications of benchmark datasets are shownin Table 1The ELMbased algorithms to be compared includebasic ELM [4] original AdaBoostRT based ELM (originalAda-ELM) [30] the modified self-adaptive AdaBoostRTELM (modified Ada-ELM) [31] support vector regression(SVR) [36] and least-square support vector regression (LS-SVR) [37] All the evaluations are conducted in Mat-lab environment running on a Windows 7 machine with320GHzCPU and 4GBRAM

In our experiments all the input attributes are normalizedinto the range of [minus1 1] while the outputs are normalizedinto [0 1] As the real-world benchmark datasets are embed-ded with noise and their distributions are unknown whichare of small sizes low dimensions large sizes and highdimensions for each trial of simulations the whole dataset of the application is randomly partitioned into trainingdataset and testing dataset with the number of samples shownin Table 1 25 of the training data samples are used asthe validation dataset Each partitioned training validationand testing dataset will be kept fixed as inputs for all thesealgorithms

8 Mathematical Problems in Engineering

Training Training Training

Updatedistribution

Updatedistribution

Boosting

The final hypothesis

Sequence of M samples

Distribution Distribution

(x1 y1) (xm ym)

D1(i) Dt (i)Distribution

DT (i)

ELM1 ELMt

Calculate theerror rate

1205761

Calculate theerror rate

120576t

Output1 Outputt

ELMT

OutputT

middot middot middot middot middot middot

Figure 2 Block diagram of RAE-ELM

For RAE-ELM basic ELM original Ada-ELM and mod-ified Ada-ELM algorithms the suitable numbers of hiddennodes of them are determined using the preserved validationdataset respectivelyThe sigmoid function119866(a 119887 x) = 1(1+exp(minus(asdotx+119887))) is selected as the activation function in all thealgorithms Fifty trails of simulations have been conductedfor each problem with training validation and testingsamples randomly split for each trailThe performances of thealgorithms are verified using the average root mean squareerror (RMSE) in testing The significantly better results arehighlighted in boldface

51 Model Selection In ensemble algorithms the number ofnetworks in the ensemble needs to be determined Accordingto Occamrsquos Razor theory excessively complex models areaffected by statistical noise whereas simpler models maycapture the underlying structure better and may thus have

better predictive performance Therefore the parameter 119879which is the number of weak learners need not be very large

We define 120576119905in (12) as 120576

119905= 05 minus 120574

119905 where 120574

119905gt 0 is

constant it results in

119864ensemble le119879

prod

119905=1

radic1 minus 41205742

119905= 119890(minussum119879

119905=1KL(1212minus120574119905))

le 119890(minus2sum

119879

119905=11205742

119905)

(19)

where KL is the Kullback-Leibler divergenceWe then simplify (19) by using 05 minus 120574 instead of 05 minus 120574

119905

that is each 120574119905is set to be the same We can get

119864ensemble le (1 minus 41205742)

1198792

= 119890(minus119879lowastKL(1212minus120574))

le 119890(minus2119879120574

2)

(20)

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

Mathematical Problems in Engineering 5

data samples The proposed robust AdaBoostRT algorithmis described as follows

(1) Input the following

sequence of 119898 samples (1199091 1199101) (119909

119898 119910119898)

where output 119910 isin 119877weak learning algorithm (weak learner)maximum number of iterations (machines) 119879

(2) Initialize the following

iteration index 119905 = 1distribution119863

119905(119894) = 1119898 for all 119894

the weight vector 119908119905119894= 119863119905(119894) for all 119894

error rate 120576119905= 0

(3) Iterate while 119905 ⩽ 119879 the following

(1) Call weak learner WLt providing it with distri-bution

119901(119905)=

119908(119905)

119885119905

(11)

where 119885119905is a normalization factor chosen such

that 119901(119905) will be a distribution(2) Build the regression model 119891

119905(119909) rarr 119910

(3) Calculate each error 119890119905(119894) = 119891

119905(119909119894) minus 119910119894

(4) Calculate the error rate 120576119905= sum119894isin119875119901(119905)

119894 where

119875 = 119894 | |119890119905(119894)minus119890119905| gt 120582120590

119905 119894 isin [1 119898] 119890

119905stands for

the expected value and 120582120590119905is defined as robust

threshold (120590119905stands for the standard deviation

the relative factor 120582 is defined as 120582 isin (0 1))If 120576119905gt 12 then set 119879 = 119905 minus 1 and abort loop

(5) Set 120573119905= 120576119905(1 minus 120576

119905)

(6) Calculate contribution of 119891119905(119909) to the final

result 120572119905= minus log(120573

119905)

(7) Update the weight vectors

if 119894 notin 119875 then 119908(119905+1)119894

= 119908(119905)

119894120573119905

else 119908(119905+1)119894

= 119908(119905)

119894

(8) Set 119905 = 119905 + 1

(4) Normalize 1205721 120572

119879 such that sum119879

119905=1120572119905= 1

Output the final hypotheses 119891fin(119909) =

sum119879

119905=1120572119905119891119905(119909)

At each iteration of the proposed robust AdaBoostRTalgorithm the standard deviation of the approximation errordistribution is used as a criterion In the probability andstatistics theory the standard deviationmeasures the amountof variation or dispersion from the average If the data pointstend to be very close to themean value the standard deviationis low On the other hand if the data points are spread out

over a large range of values a high standard deviation will beresulted in

Standard deviation may be served as a measure ofuncertainty for a set of repeated predictions When decidingwhether predictions agree with their correspondingly truevalues the standard deviation of those predictions madeby the underlined approximation function is of crucialimportance if the averaged distance from the predictions tothe true values is large then the regressionmodel being testedprobably needs to be revised Because the sample points thatfall outside the range of values could reasonably be expectedto occur the prediction accuracy rate of the model is low

In the proposed robust AdaBoostRT algorithm theapproximation error of 119905th weak learner WLt for an inputdataset could be represented as one statistics distributionwithparameters 120583

119905plusmn 120582120590119905 where 120583

119905stands for the expected value

120590119905stands for the standard deviation and 120582 is an adjustable

relative factor that ranges from 0 to 1 The threshold valuefor WLt is defined by the scaled standard deviation 120582120590

119905

In the hybrid learning algorithm the trained weak learnersare assumed to be able to generate small prediction error(120576119905lt 12) For all 120583

119905 120583119905isin (0 minus 120578 0 + 120578) lim 120578 rarr

0 and 119905 = 1 119879 where 120578 denotes a small error limitapproaching zero The population mean of a regression errordistribution is closer to the targeted zero than elements in thepopulation Therefore the means of the obtained regressionerrors are fluctuating around zero within a small range Thestandard deviation 120590

119905 is solely determined by the individual

samples and the generalization performance of the weaklearnerWLt Generally 120590119905 is relatively large such that most ofthe outputs will be located within the range [minus120590

119905 +120590119905] which

tends to make the boosting process unstable To maintain astable adjusting of the threshold value a relative factor 120582 isapplied on the standard deviation 120590

119905 which results in the

robust threshold 120582120590119905 For those samples that fall within the

threshold range [minus120582120590119905 +120582120590119905] they are treated as ldquoacceptedrdquo

samples Other samples are treated as ldquorejectedrdquo samplesWith the introduction of the robust threshold the algorithmwill be stable and resistant to noise in the data Accordingto the error rate 120576

119905of each weak learnerrsquos regression model

each weak learner WLt will be assigned with one accordinglycomputed weight 120572

119905

For one regression problem the performances of differentweak learners may be differentThe regression error distribu-tions for different weak learners under robust threshold areshown in Figure 1

In Figure 1(a) the weak learner WLt generates a regres-sion error distribution with large error rate where thestandard deviation is relatively large On the other handanother weak learner WLt+j may generate an error distribu-tion with small standard deviations as shown in Figure 1(b)Suppose red triangle points represent ldquorejectedrdquo sampleswhose regression error rate is greater than the specified robustthreshold 120576

119905= sum119894isin119875119901(119905)

119894 where 119875 = 119894 | |119890

119905(119894) minus 119890

119905| gt 120582120590

119905 119894 isin

[1 119898] as described in Step (4) of the proposed algorithmThe green circular points represent those ldquoacceptedrdquo sampleswhich own regression errors less than the robust thresholdThe weight vector 119908119905

119894for every ldquoacceptedrdquo sample will be

6 Mathematical Problems in Engineering

Probability

Rejected sampleAccepted sample

Value120583t minus 120582120590t 120583t 120583t + 120582120590t

(a)

Value

Probability

120583t+j t+jminus 120582120590 120583t+j 120583t+j t+j+ 120582120590

Rejected sampleAccepted sample

(b)

Figure 1 Different regression error distributions for different weak learners under robust threshold (a) Error distribution of weak learnerWLt that results in a large threshold (b) Error distribution of weak learner WLt+j that results in a small threshold

dynamically changed for each weak learner WLt while thatfor ldquorejectedrdquo samples will be unchanged

As illustrated in Figure 1 in terms of stability andaccuracy the regression capability of weak learner WLt+jis superior to WLt The robust threshold values for WLt+jand WLt are computed respectively where the former issmaller than the latter to discriminate their correspondinglydifferent regression performances The proposed methodovercomes the limitation suffered by the existing methodswhere the threshold value is set empirically The criticalfactor used in the boosting process the threshold becomesrobust and self-adaptive to the individual weak learnersrsquoperformance on the input data samples Therefore the pro-posed robust AdaBoostRT algorithm is capable to output thefinal hypotheses in optimally weighted ensemble of the weaklearners

In the following we show that the training error of theproposed robust AdaBoostRT algorithm is as bounded Onelemma needs to be given in order to prove the convergence ofthis algorithm

Lemma 1 The convexity argument 119909119889 le 1 minus (1 minus 119909)119889 holdsfor any 119909 ge 0 and 0 le 119889 le 1

Theorem 2 The improved adaptive AdaBoostRT algorithmgenerates hypotheses with errors 120576

1 1205762 120576

119879lt 12 Then the

error 119864119890119899119904119890119898119887119897119890

of the final hypothesis output by this algorithmis bounded above by

119864119890119899119904119890119898119887119897119890

le 2119879

119879

prod

119905=1

radic120576119905(1 minus 120576

119905) (12)

Proof In this proof we need to transform the regressionproblem 119883 rarr 119884 into binary classification problems119883 119884 rarr 0 1 In the proposed improved adaptiveAdaBoostRT algorithm the mean of errors (119890) is assumed tobe closed to zero Thus the dynamically adaptive thresholdscan ignore the mean of errors

Let 119868119905= 119894 | |119891

119905(119909119894) minus 119910119894| lt 120582120590

119905 if 119894 notin 119868

119905119868119894

119905= 1 otherwise

119868119894

119905= 0The final hypothesis output 119891makes a mistake on sample

119894 only if

119879

prod

119905=1

120573minus119868119894

119905

119905ge (

119879

prod

119905=1

120573119905)

minus12

(13)

The final weight of any sample 119894 is

119908119879+1

119894= 119863 (119894)

119879

prod

119905=1

120573(1minus119868119894

119905)

119905 (14)

Combining (13) and (14) the sum of the final weights isbounded by the sum of the final weights of rejected samplesConsider

119898

sum

119894=1

119908119879+1

119894ge sum

119894notin119868119894

119879+1

119908119879+1

119894ge ( sum

119894notin119868119894

119879+1

119863 (119894))(

119879

prod

119905=1

120573119905)

12

= 119864ensemble(119879

prod

119905=1

120573119905)

12

(15)

where the 119864ensemble is the error of the final hypothesis outputBased on Lemma 1

119898

sum

119894=1

119908119905+1

119894=

119898

sum

119894=1

119908119905

1198941205731minus119868119894

119905

119905le

119898

sum

119894=1

119908119905

119894(1 minus (1 minus 120573

119905) (1 minus 119868

119894

119905))

= (

119898

sum

119894=1

119908119905

119894)(1 minus (1 minus 120576

119905) (1 minus 120573

119905))

(16)

Mathematical Problems in Engineering 7

Combining those inequalities for 119905 = 1 119879 the follow-ing equation could be obtained

119898

sum

119894=1

119908119879+1

119894le

119879

prod

119905=1

(1 minus (1 minus 120576119905) (1 minus 120573

119905)) (17)

Combining (15) and (17) we obtain that

119864ensemble le119879

prod

119905=1

1 minus (1 minus 120576119905) (1 minus 120573

119905)

radic120573119905

(18)

Considering that all factors in multiplication are positivethe minimization of the right hand side could be resortedto compute the minimization of each factor individually120573119905could be computed as 120573

119905= 120576119905(1 minus 120576

119905) when setting

the derivative of the 119905th factor to be zero Substitute thiscomputed 120573

119905into (18) completing the proof

Unlike the original AdaBoostRT and its existent variantsthe robust threshold in the proposed AdaBoostRT algorithmis determined and could be self-adaptively adjusted accordingto the individual weak learners and data samples Throughthe analysis on the convergence of the proposed robustAdaBoostRT algorithm it could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe study shows that the robust AdaBoostRT algorithmproposed in this paper can overcome the limitations existingin the available AdaBoostRT algorithms

4 A Robust AdaBoostRT Ensemble-BasedExtreme Learning Machine

In this paper a robust AdaBoostRT ensemble-based extremelearning machine (RAE-ELM) which combines ELM withthe robust AdaBoostRT algorithm described in previoussection is proposed to improve the robustness and stabilityof ELM A set of 119879 number of ELMs is adopted as the ldquoweakrdquolearners In the training phase the RAE-ELM utilizes theproposed robust AdaBoostRT algorithm to train every ELMmodel and assign an ensemble weight accordingly in orderthat each ELM achieves corresponding distribution based onthe training output The optimally weighted ensemble modelof ELMs 119864ensemble is the final hypothesis output used formaking prediction on testing dataset The proposed RAE-ELM is illustrated as follows in Figure 2

41 Initialization For the first weak learner ELM1is supplied

with 119898 training samples with the uniformed distribution ofweights in order that each sample owns equal opportunity tobe chosen during the first training process for ELM

1

42 Distribution Updating The relative prediction error ratesare used to evaluate the performance of this ELM Theprediction error of 119905th ELM ELMt for the input data samplescould be represented as one statistics distribution 120583

119905plusmn 120582120590119905

where 120583119905stands for the expected value and 120582120590

119905is defined

as robust threshold (120590119905stands for the standard deviation

and the relative factor 120582 is defined as 120582 isin (0 1)) Therobust thresholdis applied to demarcate prediction errorsas ldquoacceptedrdquo or ldquorejectedrdquo If the prediction error of oneparticular sample falls into the region 120583

119905plusmn120582120590119905that is bounded

by the robust thresholds the prediction of this sample isregarded as ldquoacceptedrdquo for ELMt and vice versa for ldquorejectedrdquopredictionsThe probabilities of the ldquorejectedrdquo predictions areaccumulated to calculate the error rate 120576

119905 ELM attempts to

achieve the 119891119905(119909) with small error rate

The robust AdaBoostRT algorithm will calculate thedistribution for next ELMt+1 For every sample that iscorrectly predicted by the current ELMt the correspondingweight vector will be multiplied by the error rate function120573119905 Otherwise the weight vector remains unchanged Such

process will be iterated for the next ELMt+1 till the last learnerELMT unless 120576119905 is higher than 05 Because once the error rateis higher than 05 the AdaBoost algorithm does not convergeand tends to overfitting [21] Hence the error ratemust be lessthan 05

43 Decision Making on RAE-ELM The weight updatingparameter120573

119905is used as an indicator of regression effectiveness

of the ELMt in the current iteration According to therelationship between 120576

119905and 120573

119905 if 120576119905increases 120573

119905will become

larger as well The RAE-ELM will grant a small ensembleweight for the ELMt On the other hand the ELMt withrelatively superior regression performance will be grantedwith a larger ensemble weight The hybrid RAE-ELM modelcombines the set of ELMs under different weights as the finalhypothesis for decision making

5 Performance Evaluation of RAE-ELM

In this section the performance of the proposed RAE-ELM learning algorithm is compared with other popularalgorithms on 14 real-world regression problems coveringdifferent domains from UCI Machine Learning Repository[35] whose specifications of benchmark datasets are shownin Table 1The ELMbased algorithms to be compared includebasic ELM [4] original AdaBoostRT based ELM (originalAda-ELM) [30] the modified self-adaptive AdaBoostRTELM (modified Ada-ELM) [31] support vector regression(SVR) [36] and least-square support vector regression (LS-SVR) [37] All the evaluations are conducted in Mat-lab environment running on a Windows 7 machine with320GHzCPU and 4GBRAM

In our experiments all the input attributes are normalizedinto the range of [minus1 1] while the outputs are normalizedinto [0 1] As the real-world benchmark datasets are embed-ded with noise and their distributions are unknown whichare of small sizes low dimensions large sizes and highdimensions for each trial of simulations the whole dataset of the application is randomly partitioned into trainingdataset and testing dataset with the number of samples shownin Table 1 25 of the training data samples are used asthe validation dataset Each partitioned training validationand testing dataset will be kept fixed as inputs for all thesealgorithms

8 Mathematical Problems in Engineering

Training Training Training

Updatedistribution

Updatedistribution

Boosting

The final hypothesis

Sequence of M samples

Distribution Distribution

(x1 y1) (xm ym)

D1(i) Dt (i)Distribution

DT (i)

ELM1 ELMt

Calculate theerror rate

1205761

Calculate theerror rate

120576t

Output1 Outputt

ELMT

OutputT

middot middot middot middot middot middot

Figure 2 Block diagram of RAE-ELM

For RAE-ELM basic ELM original Ada-ELM and mod-ified Ada-ELM algorithms the suitable numbers of hiddennodes of them are determined using the preserved validationdataset respectivelyThe sigmoid function119866(a 119887 x) = 1(1+exp(minus(asdotx+119887))) is selected as the activation function in all thealgorithms Fifty trails of simulations have been conductedfor each problem with training validation and testingsamples randomly split for each trailThe performances of thealgorithms are verified using the average root mean squareerror (RMSE) in testing The significantly better results arehighlighted in boldface

51 Model Selection In ensemble algorithms the number ofnetworks in the ensemble needs to be determined Accordingto Occamrsquos Razor theory excessively complex models areaffected by statistical noise whereas simpler models maycapture the underlying structure better and may thus have

better predictive performance Therefore the parameter 119879which is the number of weak learners need not be very large

We define 120576119905in (12) as 120576

119905= 05 minus 120574

119905 where 120574

119905gt 0 is

constant it results in

119864ensemble le119879

prod

119905=1

radic1 minus 41205742

119905= 119890(minussum119879

119905=1KL(1212minus120574119905))

le 119890(minus2sum

119879

119905=11205742

119905)

(19)

where KL is the Kullback-Leibler divergenceWe then simplify (19) by using 05 minus 120574 instead of 05 minus 120574

119905

that is each 120574119905is set to be the same We can get

119864ensemble le (1 minus 41205742)

1198792

= 119890(minus119879lowastKL(1212minus120574))

le 119890(minus2119879120574

2)

(20)

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

6 Mathematical Problems in Engineering

Probability

Rejected sampleAccepted sample

Value120583t minus 120582120590t 120583t 120583t + 120582120590t

(a)

Value

Probability

120583t+j t+jminus 120582120590 120583t+j 120583t+j t+j+ 120582120590

Rejected sampleAccepted sample

(b)

Figure 1 Different regression error distributions for different weak learners under robust threshold (a) Error distribution of weak learnerWLt that results in a large threshold (b) Error distribution of weak learner WLt+j that results in a small threshold

dynamically changed for each weak learner WLt while thatfor ldquorejectedrdquo samples will be unchanged

As illustrated in Figure 1 in terms of stability andaccuracy the regression capability of weak learner WLt+jis superior to WLt The robust threshold values for WLt+jand WLt are computed respectively where the former issmaller than the latter to discriminate their correspondinglydifferent regression performances The proposed methodovercomes the limitation suffered by the existing methodswhere the threshold value is set empirically The criticalfactor used in the boosting process the threshold becomesrobust and self-adaptive to the individual weak learnersrsquoperformance on the input data samples Therefore the pro-posed robust AdaBoostRT algorithm is capable to output thefinal hypotheses in optimally weighted ensemble of the weaklearners

In the following we show that the training error of theproposed robust AdaBoostRT algorithm is as bounded Onelemma needs to be given in order to prove the convergence ofthis algorithm

Lemma 1 The convexity argument 119909119889 le 1 minus (1 minus 119909)119889 holdsfor any 119909 ge 0 and 0 le 119889 le 1

Theorem 2 The improved adaptive AdaBoostRT algorithmgenerates hypotheses with errors 120576

1 1205762 120576

119879lt 12 Then the

error 119864119890119899119904119890119898119887119897119890

of the final hypothesis output by this algorithmis bounded above by

119864119890119899119904119890119898119887119897119890

le 2119879

119879

prod

119905=1

radic120576119905(1 minus 120576

119905) (12)

Proof In this proof we need to transform the regressionproblem 119883 rarr 119884 into binary classification problems119883 119884 rarr 0 1 In the proposed improved adaptiveAdaBoostRT algorithm the mean of errors (119890) is assumed tobe closed to zero Thus the dynamically adaptive thresholdscan ignore the mean of errors

Let 119868119905= 119894 | |119891

119905(119909119894) minus 119910119894| lt 120582120590

119905 if 119894 notin 119868

119905119868119894

119905= 1 otherwise

119868119894

119905= 0The final hypothesis output 119891makes a mistake on sample

119894 only if

119879

prod

119905=1

120573minus119868119894

119905

119905ge (

119879

prod

119905=1

120573119905)

minus12

(13)

The final weight of any sample 119894 is

119908119879+1

119894= 119863 (119894)

119879

prod

119905=1

120573(1minus119868119894

119905)

119905 (14)

Combining (13) and (14) the sum of the final weights isbounded by the sum of the final weights of rejected samplesConsider

119898

sum

119894=1

119908119879+1

119894ge sum

119894notin119868119894

119879+1

119908119879+1

119894ge ( sum

119894notin119868119894

119879+1

119863 (119894))(

119879

prod

119905=1

120573119905)

12

= 119864ensemble(119879

prod

119905=1

120573119905)

12

(15)

where the 119864ensemble is the error of the final hypothesis outputBased on Lemma 1

119898

sum

119894=1

119908119905+1

119894=

119898

sum

119894=1

119908119905

1198941205731minus119868119894

119905

119905le

119898

sum

119894=1

119908119905

119894(1 minus (1 minus 120573

119905) (1 minus 119868

119894

119905))

= (

119898

sum

119894=1

119908119905

119894)(1 minus (1 minus 120576

119905) (1 minus 120573

119905))

(16)

Mathematical Problems in Engineering 7

Combining those inequalities for 119905 = 1 119879 the follow-ing equation could be obtained

119898

sum

119894=1

119908119879+1

119894le

119879

prod

119905=1

(1 minus (1 minus 120576119905) (1 minus 120573

119905)) (17)

Combining (15) and (17) we obtain that

119864ensemble le119879

prod

119905=1

1 minus (1 minus 120576119905) (1 minus 120573

119905)

radic120573119905

(18)

Considering that all factors in multiplication are positivethe minimization of the right hand side could be resortedto compute the minimization of each factor individually120573119905could be computed as 120573

119905= 120576119905(1 minus 120576

119905) when setting

the derivative of the 119905th factor to be zero Substitute thiscomputed 120573

119905into (18) completing the proof

Unlike the original AdaBoostRT and its existent variantsthe robust threshold in the proposed AdaBoostRT algorithmis determined and could be self-adaptively adjusted accordingto the individual weak learners and data samples Throughthe analysis on the convergence of the proposed robustAdaBoostRT algorithm it could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe study shows that the robust AdaBoostRT algorithmproposed in this paper can overcome the limitations existingin the available AdaBoostRT algorithms

4 A Robust AdaBoostRT Ensemble-BasedExtreme Learning Machine

In this paper a robust AdaBoostRT ensemble-based extremelearning machine (RAE-ELM) which combines ELM withthe robust AdaBoostRT algorithm described in previoussection is proposed to improve the robustness and stabilityof ELM A set of 119879 number of ELMs is adopted as the ldquoweakrdquolearners In the training phase the RAE-ELM utilizes theproposed robust AdaBoostRT algorithm to train every ELMmodel and assign an ensemble weight accordingly in orderthat each ELM achieves corresponding distribution based onthe training output The optimally weighted ensemble modelof ELMs 119864ensemble is the final hypothesis output used formaking prediction on testing dataset The proposed RAE-ELM is illustrated as follows in Figure 2

41 Initialization For the first weak learner ELM1is supplied

with 119898 training samples with the uniformed distribution ofweights in order that each sample owns equal opportunity tobe chosen during the first training process for ELM

1

42 Distribution Updating The relative prediction error ratesare used to evaluate the performance of this ELM Theprediction error of 119905th ELM ELMt for the input data samplescould be represented as one statistics distribution 120583

119905plusmn 120582120590119905

where 120583119905stands for the expected value and 120582120590

119905is defined

as robust threshold (120590119905stands for the standard deviation

and the relative factor 120582 is defined as 120582 isin (0 1)) Therobust thresholdis applied to demarcate prediction errorsas ldquoacceptedrdquo or ldquorejectedrdquo If the prediction error of oneparticular sample falls into the region 120583

119905plusmn120582120590119905that is bounded

by the robust thresholds the prediction of this sample isregarded as ldquoacceptedrdquo for ELMt and vice versa for ldquorejectedrdquopredictionsThe probabilities of the ldquorejectedrdquo predictions areaccumulated to calculate the error rate 120576

119905 ELM attempts to

achieve the 119891119905(119909) with small error rate

The robust AdaBoostRT algorithm will calculate thedistribution for next ELMt+1 For every sample that iscorrectly predicted by the current ELMt the correspondingweight vector will be multiplied by the error rate function120573119905 Otherwise the weight vector remains unchanged Such

process will be iterated for the next ELMt+1 till the last learnerELMT unless 120576119905 is higher than 05 Because once the error rateis higher than 05 the AdaBoost algorithm does not convergeand tends to overfitting [21] Hence the error ratemust be lessthan 05

43 Decision Making on RAE-ELM The weight updatingparameter120573

119905is used as an indicator of regression effectiveness

of the ELMt in the current iteration According to therelationship between 120576

119905and 120573

119905 if 120576119905increases 120573

119905will become

larger as well The RAE-ELM will grant a small ensembleweight for the ELMt On the other hand the ELMt withrelatively superior regression performance will be grantedwith a larger ensemble weight The hybrid RAE-ELM modelcombines the set of ELMs under different weights as the finalhypothesis for decision making

5 Performance Evaluation of RAE-ELM

In this section the performance of the proposed RAE-ELM learning algorithm is compared with other popularalgorithms on 14 real-world regression problems coveringdifferent domains from UCI Machine Learning Repository[35] whose specifications of benchmark datasets are shownin Table 1The ELMbased algorithms to be compared includebasic ELM [4] original AdaBoostRT based ELM (originalAda-ELM) [30] the modified self-adaptive AdaBoostRTELM (modified Ada-ELM) [31] support vector regression(SVR) [36] and least-square support vector regression (LS-SVR) [37] All the evaluations are conducted in Mat-lab environment running on a Windows 7 machine with320GHzCPU and 4GBRAM

In our experiments all the input attributes are normalizedinto the range of [minus1 1] while the outputs are normalizedinto [0 1] As the real-world benchmark datasets are embed-ded with noise and their distributions are unknown whichare of small sizes low dimensions large sizes and highdimensions for each trial of simulations the whole dataset of the application is randomly partitioned into trainingdataset and testing dataset with the number of samples shownin Table 1 25 of the training data samples are used asthe validation dataset Each partitioned training validationand testing dataset will be kept fixed as inputs for all thesealgorithms

8 Mathematical Problems in Engineering

Training Training Training

Updatedistribution

Updatedistribution

Boosting

The final hypothesis

Sequence of M samples

Distribution Distribution

(x1 y1) (xm ym)

D1(i) Dt (i)Distribution

DT (i)

ELM1 ELMt

Calculate theerror rate

1205761

Calculate theerror rate

120576t

Output1 Outputt

ELMT

OutputT

middot middot middot middot middot middot

Figure 2 Block diagram of RAE-ELM

For RAE-ELM basic ELM original Ada-ELM and mod-ified Ada-ELM algorithms the suitable numbers of hiddennodes of them are determined using the preserved validationdataset respectivelyThe sigmoid function119866(a 119887 x) = 1(1+exp(minus(asdotx+119887))) is selected as the activation function in all thealgorithms Fifty trails of simulations have been conductedfor each problem with training validation and testingsamples randomly split for each trailThe performances of thealgorithms are verified using the average root mean squareerror (RMSE) in testing The significantly better results arehighlighted in boldface

51 Model Selection In ensemble algorithms the number ofnetworks in the ensemble needs to be determined Accordingto Occamrsquos Razor theory excessively complex models areaffected by statistical noise whereas simpler models maycapture the underlying structure better and may thus have

better predictive performance Therefore the parameter 119879which is the number of weak learners need not be very large

We define 120576119905in (12) as 120576

119905= 05 minus 120574

119905 where 120574

119905gt 0 is

constant it results in

119864ensemble le119879

prod

119905=1

radic1 minus 41205742

119905= 119890(minussum119879

119905=1KL(1212minus120574119905))

le 119890(minus2sum

119879

119905=11205742

119905)

(19)

where KL is the Kullback-Leibler divergenceWe then simplify (19) by using 05 minus 120574 instead of 05 minus 120574

119905

that is each 120574119905is set to be the same We can get

119864ensemble le (1 minus 41205742)

1198792

= 119890(minus119879lowastKL(1212minus120574))

le 119890(minus2119879120574

2)

(20)

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

Mathematical Problems in Engineering 7

Combining those inequalities for 119905 = 1 119879 the follow-ing equation could be obtained

119898

sum

119894=1

119908119879+1

119894le

119879

prod

119905=1

(1 minus (1 minus 120576119905) (1 minus 120573

119905)) (17)

Combining (15) and (17) we obtain that

119864ensemble le119879

prod

119905=1

1 minus (1 minus 120576119905) (1 minus 120573

119905)

radic120573119905

(18)

Considering that all factors in multiplication are positivethe minimization of the right hand side could be resortedto compute the minimization of each factor individually120573119905could be computed as 120573

119905= 120576119905(1 minus 120576

119905) when setting

the derivative of the 119905th factor to be zero Substitute thiscomputed 120573

119905into (18) completing the proof

Unlike the original AdaBoostRT and its existent variantsthe robust threshold in the proposed AdaBoostRT algorithmis determined and could be self-adaptively adjusted accordingto the individual weak learners and data samples Throughthe analysis on the convergence of the proposed robustAdaBoostRT algorithm it could be proved that the errorof the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior boundThe study shows that the robust AdaBoostRT algorithmproposed in this paper can overcome the limitations existingin the available AdaBoostRT algorithms

4 A Robust AdaBoostRT Ensemble-BasedExtreme Learning Machine

In this paper a robust AdaBoostRT ensemble-based extremelearning machine (RAE-ELM) which combines ELM withthe robust AdaBoostRT algorithm described in previoussection is proposed to improve the robustness and stabilityof ELM A set of 119879 number of ELMs is adopted as the ldquoweakrdquolearners In the training phase the RAE-ELM utilizes theproposed robust AdaBoostRT algorithm to train every ELMmodel and assign an ensemble weight accordingly in orderthat each ELM achieves corresponding distribution based onthe training output The optimally weighted ensemble modelof ELMs 119864ensemble is the final hypothesis output used formaking prediction on testing dataset The proposed RAE-ELM is illustrated as follows in Figure 2

41 Initialization For the first weak learner ELM1is supplied

with 119898 training samples with the uniformed distribution ofweights in order that each sample owns equal opportunity tobe chosen during the first training process for ELM

1

42 Distribution Updating The relative prediction error ratesare used to evaluate the performance of this ELM Theprediction error of 119905th ELM ELMt for the input data samplescould be represented as one statistics distribution 120583

119905plusmn 120582120590119905

where 120583119905stands for the expected value and 120582120590

119905is defined

as robust threshold (120590119905stands for the standard deviation

and the relative factor 120582 is defined as 120582 isin (0 1)) Therobust thresholdis applied to demarcate prediction errorsas ldquoacceptedrdquo or ldquorejectedrdquo If the prediction error of oneparticular sample falls into the region 120583

119905plusmn120582120590119905that is bounded

by the robust thresholds the prediction of this sample isregarded as ldquoacceptedrdquo for ELMt and vice versa for ldquorejectedrdquopredictionsThe probabilities of the ldquorejectedrdquo predictions areaccumulated to calculate the error rate 120576

119905 ELM attempts to

achieve the 119891119905(119909) with small error rate

The robust AdaBoostRT algorithm will calculate thedistribution for next ELMt+1 For every sample that iscorrectly predicted by the current ELMt the correspondingweight vector will be multiplied by the error rate function120573119905 Otherwise the weight vector remains unchanged Such

process will be iterated for the next ELMt+1 till the last learnerELMT unless 120576119905 is higher than 05 Because once the error rateis higher than 05 the AdaBoost algorithm does not convergeand tends to overfitting [21] Hence the error ratemust be lessthan 05

43 Decision Making on RAE-ELM The weight updatingparameter120573

119905is used as an indicator of regression effectiveness

of the ELMt in the current iteration According to therelationship between 120576

119905and 120573

119905 if 120576119905increases 120573

119905will become

larger as well The RAE-ELM will grant a small ensembleweight for the ELMt On the other hand the ELMt withrelatively superior regression performance will be grantedwith a larger ensemble weight The hybrid RAE-ELM modelcombines the set of ELMs under different weights as the finalhypothesis for decision making

5 Performance Evaluation of RAE-ELM

In this section the performance of the proposed RAE-ELM learning algorithm is compared with other popularalgorithms on 14 real-world regression problems coveringdifferent domains from UCI Machine Learning Repository[35] whose specifications of benchmark datasets are shownin Table 1The ELMbased algorithms to be compared includebasic ELM [4] original AdaBoostRT based ELM (originalAda-ELM) [30] the modified self-adaptive AdaBoostRTELM (modified Ada-ELM) [31] support vector regression(SVR) [36] and least-square support vector regression (LS-SVR) [37] All the evaluations are conducted in Mat-lab environment running on a Windows 7 machine with320GHzCPU and 4GBRAM

In our experiments all the input attributes are normalizedinto the range of [minus1 1] while the outputs are normalizedinto [0 1] As the real-world benchmark datasets are embed-ded with noise and their distributions are unknown whichare of small sizes low dimensions large sizes and highdimensions for each trial of simulations the whole dataset of the application is randomly partitioned into trainingdataset and testing dataset with the number of samples shownin Table 1 25 of the training data samples are used asthe validation dataset Each partitioned training validationand testing dataset will be kept fixed as inputs for all thesealgorithms

8 Mathematical Problems in Engineering

Training Training Training

Updatedistribution

Updatedistribution

Boosting

The final hypothesis

Sequence of M samples

Distribution Distribution

(x1 y1) (xm ym)

D1(i) Dt (i)Distribution

DT (i)

ELM1 ELMt

Calculate theerror rate

1205761

Calculate theerror rate

120576t

Output1 Outputt

ELMT

OutputT

middot middot middot middot middot middot

Figure 2 Block diagram of RAE-ELM

For RAE-ELM basic ELM original Ada-ELM and mod-ified Ada-ELM algorithms the suitable numbers of hiddennodes of them are determined using the preserved validationdataset respectivelyThe sigmoid function119866(a 119887 x) = 1(1+exp(minus(asdotx+119887))) is selected as the activation function in all thealgorithms Fifty trails of simulations have been conductedfor each problem with training validation and testingsamples randomly split for each trailThe performances of thealgorithms are verified using the average root mean squareerror (RMSE) in testing The significantly better results arehighlighted in boldface

51 Model Selection In ensemble algorithms the number ofnetworks in the ensemble needs to be determined Accordingto Occamrsquos Razor theory excessively complex models areaffected by statistical noise whereas simpler models maycapture the underlying structure better and may thus have

better predictive performance Therefore the parameter 119879which is the number of weak learners need not be very large

We define 120576119905in (12) as 120576

119905= 05 minus 120574

119905 where 120574

119905gt 0 is

constant it results in

119864ensemble le119879

prod

119905=1

radic1 minus 41205742

119905= 119890(minussum119879

119905=1KL(1212minus120574119905))

le 119890(minus2sum

119879

119905=11205742

119905)

(19)

where KL is the Kullback-Leibler divergenceWe then simplify (19) by using 05 minus 120574 instead of 05 minus 120574

119905

that is each 120574119905is set to be the same We can get

119864ensemble le (1 minus 41205742)

1198792

= 119890(minus119879lowastKL(1212minus120574))

le 119890(minus2119879120574

2)

(20)

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

8 Mathematical Problems in Engineering

Training Training Training

Updatedistribution

Updatedistribution

Boosting

The final hypothesis

Sequence of M samples

Distribution Distribution

(x1 y1) (xm ym)

D1(i) Dt (i)Distribution

DT (i)

ELM1 ELMt

Calculate theerror rate

1205761

Calculate theerror rate

120576t

Output1 Outputt

ELMT

OutputT

middot middot middot middot middot middot

Figure 2 Block diagram of RAE-ELM

For RAE-ELM basic ELM original Ada-ELM and mod-ified Ada-ELM algorithms the suitable numbers of hiddennodes of them are determined using the preserved validationdataset respectivelyThe sigmoid function119866(a 119887 x) = 1(1+exp(minus(asdotx+119887))) is selected as the activation function in all thealgorithms Fifty trails of simulations have been conductedfor each problem with training validation and testingsamples randomly split for each trailThe performances of thealgorithms are verified using the average root mean squareerror (RMSE) in testing The significantly better results arehighlighted in boldface

51 Model Selection In ensemble algorithms the number ofnetworks in the ensemble needs to be determined Accordingto Occamrsquos Razor theory excessively complex models areaffected by statistical noise whereas simpler models maycapture the underlying structure better and may thus have

better predictive performance Therefore the parameter 119879which is the number of weak learners need not be very large

We define 120576119905in (12) as 120576

119905= 05 minus 120574

119905 where 120574

119905gt 0 is

constant it results in

119864ensemble le119879

prod

119905=1

radic1 minus 41205742

119905= 119890(minussum119879

119905=1KL(1212minus120574119905))

le 119890(minus2sum

119879

119905=11205742

119905)

(19)

where KL is the Kullback-Leibler divergenceWe then simplify (19) by using 05 minus 120574 instead of 05 minus 120574

119905

that is each 120574119905is set to be the same We can get

119864ensemble le (1 minus 41205742)

1198792

= 119890(minus119879lowastKL(1212minus120574))

le 119890(minus2119879120574

2)

(20)

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

Mathematical Problems in Engineering 9

Table 1 Specification of real-world regression benchmark datasets

Problems Observations AttributesTraining data Testing data

LVST 90 35 308Yeast 837 645 7Computer hardware 110 99 7Abalone 3150 1027 7Servo 95 72 4Parkinson disease 3000 2875 21Housing 350 156 13Cloud 80 28 9Auto price 110 49 9Breast cancer 100 94 32Balloon 1500 833 4Auto-MPG 300 92 7Bank 5000 3192 8Census (house8L) 15000 7784 8

From (20) we can obtain the upper bound number ofiterations of this algorithm Consider

119879 le [

1

21205742ln 1

119864ensemble] (21)

For RAE-ELM the number of ELM networks need bedeterminedThenumber of ELMnetworks is set to be 5 10 1520 25 and 30 in our simulations and the optimal parameteris selected as the one which results in the best average RMSEin testing

Besides in our simulation trails the relative factor 120582 inRAE-ELM is the parameter which needs to be optimizedwithin the range 120582 isin (0 1) We start simulations for 120582 at 01and increase them at the interval of 01 Table 2 shows theexamples of setting both 119879 and 120582 for our simulation trail

As illustrated in Table 2 and Figure 3 RAE-ELM withsigmoid activation function could achieve good generaliza-tion performance for Parkinson disease dataset as long asthe number of ELM networks 119879 is larger than 15 For agiven number of ELM networks RMSE is less sensitive tothe variation of 120582 and tends to be smaller when 120582 is around05 For a fair comparison we set RAE-ELM with 119879 = 20

and 120582 = 05 in the following experiments For both originalAda-ELMandmodifiedAda-ELMwhen the number of ELMnetworks is less than 7 the ensemble model is unstable Thenumber of ELM networks is also set to be 20 for both originalAda-ELM and modified Ada-ELM

We use the popular Gaussian kernel function 119870(u v) =exp(minus120574u minus v2) in both SVR and LS-SVR As is known toall the performances of SVR and LS-SVR are sensitive to thecombinations of (119862 120574) Hence the cost parameter 119862 and thekernel parameter 120574 need to be adjusted appropriately in awide range so as to obtain good generalization performancesFor each data set 50 different values of 119862 and 50 differentvalues of 120574 that is 2500 pairs of (119862 120574) are applied as theadjustment parameters The different values of 119862 and 120574 are2minus24 2minus23 2

24 225 In both SVR and LS-SVR the best

Table 2 Performance of proposed RAE-ELM with different valuesof 119879 and 120582 for Parkinson disease dataset

120582

119879

5 10 15 20 25 3001 02472 02374 02309 02259 02251 0224902 02469 02315 02276 02241 02249 0224503 02447 02298 02248 02235 02230 0223304 02426 02287 02249 02227 02224 0222705 02428 02296 02251 02219 02216 0221506 02427 02298 02246 02221 02214 0221307 02422 02285 02257 02229 02217 0221908 02441 02301 02265 02228 02225 0222809 02461 02359 02312 02243 02237 02230

performed combinations of (119862 120574) are selected for each dataset as presented in Table 3

For basic ELM and other ELM-based ensemble methodsthe sigmoid function 119866(a 119887 x) = 1(1 + exp(minus(a sdot x + 119887)))is selected as the activation function in all the algorithmsThe parameters (119862 119871) need be selected so as to achieve thebest generalization performance where the cost parameter119862 is selected from the range 2minus24 2minus23 224 225 and thedifferent values of the hidden nodes 119871 are 10 20 1000

In addition for the original AdaBoostRT-based ensem-ble ELM the thresholdΦ should be chosen before seeing thedatasets In the original AdaBoostRT-based ensemble ELMthe thresholdΦ is required to bemanually selected accordingto an empirical suggestionwhich is a sensitive factor affectingthe regression performance IfΦ is too low then it is generallydifficult to obtain a sufficient number of ldquoacceptedrdquo samplesHowever if Φ is too high some wrong samples are treated asldquoacceptedrdquo ones and the ensemblemodel tends to be unstableAccording to Shrestha and Solomatinersquos experiments thethreshold Φ shall be defined between 0 and 04 in order tomake the ensemble model stable [30] In our simulationswe incrementally set thresholds within the range from 0to 04 The original Ada-ELM with threshold values at01 015 035 04 could generate satisfied results for allthe regression problems where the best performed originalAda-ELM is shown in boldface What is more the modifiedAda-ELM algorithm needs to select an initial value of Φ

0

to calculate the followed thresholds in the iterations Tianand Mao suggested setting the default initial value of Φ

0

to be 02 [31] Considering that the manually fixed initialthreshold is not related to the characteristics of ELM pre-diction effect on input dataset the algorithm may not reachthe best generalization performance In our simulations wecompare the performances of different modified Ada-ELMsat correspondingly different initial threshold values Φ

0set

to be 01 015 03 035 The best performed modifiedAda-ELM is also presented in Table 3 as well

52 Performance Comparisons between RAE-ELM and OtherLearning Algorithms In this subsection the performance ofthe proposed RAE-ELM is compared with other learningalgorithms including basic ELM [4] original Ada-ELM [30]and modified Ada-ELM [31] support vector regression [36]

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

10 Mathematical Problems in Engineering

5 10 15 2025

30

002

0406

081

RMSE

0220225

0230235

0240245

025

120582T

Figure 3 The average testing RMSE with different values of 119879 and 120582 in RAE-ELM for Parkinson disease dataset

Table 3 Parameters of RAE-ELM and other learning algorithms

Datasets RAE ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LS SVR [37](119862 119871) (119862 119871) (119862 119871 120601) (119862 119871 120601

0) (119862 120574) (119862 120574)

LVST (213 960) (25 710) (2minus7 530 03) (2minus7 460 035) (217 22) (28 20)Yeast (210 540) (21 340) (25 710 025) (2minus1 800 02) (2minus7 20) (2minus9 23)Computer hardware (2minus4 160) (211 530) (216 240 035) (210 170 025) (2minus9 221) (20 26)Abalone (20 150) (20 200) (2minus22 330 02) (26 610 015) (23 218) (21 214)Servo (21 340) (2minus6 650) (20 110 03) (29 230 02) (2minus2 22) (24 2minus3)Parkinson disease (29 830) (2minus8 460) (24 370 035) (2minus4 590 025) (21 20) (21 27)Housing (2minus19 270) (2minus9 310) (2minus1 440 025) (217 220 02) (221 2minus4) (23 212)Cloud (2minus5 590) (213 880) (20 1000 02) (22 760 025) (27 25) (24 2minus11)Auto price (220 310) (217 430) (2minus2 230 035) (2minus5 270 035) (23 223) (27 2minus10)Breast cancer (2minus3 80) (2minus1 160) (2minus6 90 015) (213 290 025) (26 21) (26 217)Balloon (222 160) (216 190) (213 250 02) (217 210 02) (21 28) (20 2minus7)Auto-MPG (27 630) (2minus10 820) (221 380 035) (211 680 03) (2minus2 2minus1) (21 25)Bank (2minus1 660) (20 590) (22 710 03) (2minus9 450 015) (210 2minus4) (221 2minus5)Census (house8L) (23 580) (25 460) (21 190 03) (23 600 025) (25 2minus6) (20 2minus4)

and least-square support vector regression [37] The resultscomparisons of RAE-ELM and other learning algorithms forreal-world data regressions are shown in Table 4

Table 4 lists the averaging results of multiple trails of thefour ELM based algorithms (RAE-ELM basic ELM originalAda-ELM [30] and modified Ada-ELM [31]) SVR [36]and LS-SVR [37] for fourteen representative real-world dataregression problemsThe selected datasets include large scaleof data and small scale of data as well as high dimensionaldata problems and low dimensional problems It is easyto find that averaged testing RMSE obtained by RAE-ELMfor all the fourteen cases are always the best among thesesix algorithms For original Ada-ELM the performance issensitive to the selection of threshold value of Φ The bestperformed original Ada-ELM models for different regres-sion problems own their correspondingly different thresholdvalues Therefore the manual chosen strategy is not goodThe generalization performance of modified Ada-ELM ingeneral is better than original Ada-ELM However theempirical suggested initial threshold value at 02 does notensure a mapping to the best performed regression model

The three AdaBoostRT based ensemble ELMs (RAE-ELMoriginal Ada-ELM and modified Ada-ELM) all performbetter than the basic ELM which verifies that an ensembleELM using AdaBoostRT can achieve better predicationaccuracy than using individual ELM as the predictor Theaveraged generalization performance of basic ELM is betterthan SVR while it is slightly worse than LS-SVR

To find the best performed original Ada-ELM model ormodified Ada-ELM for a regression problem as the inputdataset and ELM networks are not related to the thresholdselection the optimal parameter need be searched by brute-force One needs to carry out a set of experiments withdifferent (initial) threshold values and then searches amongthem for the best ensemble ELM model Such process istime consuming Moreover the generalization performanceof the optimized ensemble ELMs using original Ada-ELMor modified Ada-ELM can hardly be better than that ofthe proposed RAE-ELM In fact the proposed RAE-ELM isalways the best performed learner among the six candidatesfor all fourteen real-world regression problems

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

Mathematical Problems in Engineering 11

Table 4 Result comparisons of testing RMSE of RAE-ELM and other learning algorithms for real-world data regression problems

Datasets RAE-ELM Basic ELM [4] Original AdaELM [30] Modified AdaELM [31] SVR [36] LSSVR [37]LVST 02571 02854 02702 02653 02849 02801Yeast 01071 01224 01195 01156 01238 01182Computer hardware 00563 00839 00821 00726 00976 00771Abalone 00731 00882 00793 00778 00890 00815Servo 00727 00924 00884 00839 00966 00870Parkinson disease 02219 02549 02513 02383 02547 02437Housing 01018 01259 01163 01135 0127 01185Cloud 02897 03269 03118 03006 03316 03177Auto price 00809 01036 00910 00894 01045 00973Breast cancer 02463 02641 02601 02519 02657 02597Balloon 00570 00639 00611 00592 00672 00618Auto-MPG 00724 00862 00799 00781 00891 00857Bank 00415 00551 00496 00453 00537 00498Census (house8L) 00746 00820 00802 00795 00867 00813

6 Conclusion

In this paper a robust AdaBoostRT based ensemble ELM(RAE-ELM) for regression problems is proposedwhich com-bined ELM with the novel robust AdaBoostRT algorithmCombing the effective learner ELM with the novel ensemblemethod the robust AdaBoostRT algorithm could constructa hybrid method that inherits their intrinsic properties andachieves better predication accuracy than using only indi-vidual ELM as predictor ELM tends to reach the solutionsstraightforwardly and the error rate of regression predictionis in general much smaller than 05 Therefore selectingELM as the ldquoweakrdquo learner can avoid overfittingMoreover asELM is a fast learner with quite high regression performanceit contributes to the overall generalization performance of theensemble ELM

The proposed robust AdaBoostRT algorithm overcomesthe limitations existing in the available AdaBoostRT algo-rithm and its variants where the threshold value is manuallyspecified which may only be ideal for a very limited set ofcases The new robust AdaBoostRT algorithm is proposedto utilize the statistics distribution of approximation errorto dynamically determine a robust threshold The robustthreshold for each weak learner WLt is self-adjustable and isdefined as the scaled standard deviation of the approximationerrors 120582120590

119905 We analyze the convergence of the proposed

robust AdaBoostRT algorithm It has been proved that theerror of the final hypothesis output by the proposed ensemblealgorithm 119864ensemble is within a significantly superior bound

The proposed RAE-ELM is robust with respect to thedifference in various regression problems and variation ofapproximation error rates that do not significantly affectits highly stable generalization performance As one of thekey parameters in ensemble algorithm threshold value doesnot need any human intervention instead it is able tobe self-adjusted according to the real regression effect ofELM networks on the input dataset Such mechanism enableRAE-ELM to make sensitive and adaptive adjustment tothe intrinsic properties of the given regression problem

The experimental result comparisons in terms of stability andaccuracy among the six prevailing algorithms (RAE-ELMbasic ELM original Ada-ELMmodifiedAda-ELM SVR andLS-SVR) for regression issues verify that all the AdaBoostRTbased ensemble ELMs perform better than the SVR andmore remarkably the proposed RAE-ELM always achievesthe best performance The boosting effect of the proposedmethod is not significant for small sized and low dimensionalproblems as the individual classifier (ELM network) couldalready be sufficient to handle such problems well It isworth pointing out that the proposed RAE-ELM has betterperformance than others especially for high dimensional orlarge sized datasets which is a convincing indicator for goodgeneralization performance

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to thank Professor Guang-Bin HuangfromNanyangTechnological University for providing inspir-ing comments and suggestions on our research This work isfinancially supported by the University of Macau with Grantno MYRG079(Y1-L2)-FST13-YZX

References

[1] C L Philip Chen J Wang C H Wang and L Chen ldquoAnew learning algorithm for a Fully Connected Fuzzy InferenceSystem (F-CONFIS)rdquo IEEE Transactions on Neural Networksand Learning Systems vol 25 no 10 pp 1741ndash1757 2014

[2] Z Yang P K Wong C M Vong J Zhong and J Y LiangldquoSimultaneous-fault diagnosis of gas turbine generator systemsusing a pairwise-coupled probabilistic classifierrdquoMathematicalProblems in Engineering vol 2013 Article ID 827128 13 pages2013

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

12 Mathematical Problems in Engineering

[3] G-B Huang Q-Y Zhu and C-K Siew ldquoExtreme learningmachine theory and applicationsrdquoNeurocomputing vol 70 no1ndash3 pp 489ndash501 2006

[4] G-B Huang H Zhou X Ding and R Zhang ldquoExtremelearning machine for regression and multiclass classificationrdquoIEEE Transactions on Systems Man and Cybernetics Part BCybernetics vol 42 no 2 pp 513ndash529 2012

[5] S C Ng C C Cheung and S H Leung ldquoMagnified gradientfunction with deterministic weight modification in adaptivelearningrdquo IEEE Transactions on Neural Networks vol 15 no 6pp 1411ndash1423 2004

[6] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[7] H J Rong G B Huang N Sundararajan and P SaratchandranldquoOnline sequential fuzzy extreme learningmachine for functionapproximation and classification problemsrdquo IEEE Transactionson Systems Man and Cybernetics Part B Cybernetics vol 39no 4 pp 1067ndash1072 2009

[8] J Cao Z Lin and G B Huang ldquoVoting base online sequentialextreme learning machine for multi-class classificationrdquo inProceedings of the IEEE International Symposium onCircuits andSystems (ISCAS 13) pp 2327ndash2330 IEEE 2013

[9] J Cao Z Lin G B Huang and N Liu ldquoVoting based extremelearning machinerdquo Information Sciences vol 185 no 1 pp 66ndash77 2012

[10] N Y Liang G B Huang P Saratchandran and N Sundarara-jan ldquoA fast and accurate online sequential learning algorithmfor feedforward networksrdquo IEEE Transactions on Neural Net-works vol 17 no 6 pp 1411ndash1423 2006

[11] J Luo C M Vong and P K Wong ldquoSparse Bayesian extremelearningmachine formulti-classificationrdquo IEEETransactions onNeural Networks and Learning Systems vol 25 no 4 pp 836ndash843 2014

[12] O Chapelle B Scholkopf and A Zien Semi-Supervised Learn-ing vol 2 MIT Press Cambridge UK 2006

[13] Y Lan Y C Soh and G B Huang ldquoEnsemble of onlinesequential extreme learning machinerdquoNeurocomputing vol 72no 13 pp 3391ndash3395 2009

[14] N Liu and H Wang ldquoEnsemble based extreme learningmachinerdquo IEEE Signal Processing Letters vol 17 no 8 pp 754ndash757 2010

[15] X Xue M Yao Z Wu and J Yang ldquoGenetic ensemble ofextreme learningmachinerdquoNeurocomputing vol 129 no 10 pp175ndash184 2014

[16] X L Wang Y Y Chen H Zhao and B L Lu ldquoParallelizedextreme learning machine ensemble based on minndashmax mod-ular networkrdquo Neurocomputing vol 128 pp 31ndash41 2014

[17] B V Dasarathy and B V Sheela ldquoA composite classifier systemdesign concepts andmethodologyrdquoProceedings of the IEEE vol67 no 5 pp 708ndash713 1979

[18] L KHansen andP Salamon ldquoNeural network ensemblesrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol12 no 10 pp 993ndash1001 1990

[19] L Breiman ldquoBagging predictorsrdquoMachine Learning vol 24 no2 pp 123ndash140 1996

[20] R E Schapire ldquoThe strength of weak learnabilityrdquo MachineLearning vol 5 no 2 pp 197ndash227 1990

[21] R E Schapire and Y Freund Boosting Foundations and Algo-rithmsRobert E Schapire Yoav Freund MIT Press CambridgeMass USA 2012

[22] L Breiman ldquoRandom forestsrdquoMachine Learning vol 45 no 1pp 5ndash32 2001

[23] R A Jacobs M I Jordan S J Nowlan and G E HintonldquoAdaptive mixtures of local expertsrdquo Neural Computation vol3 no 1 pp 79ndash87 1991

[24] V Guruswami and A Sahai ldquoMulticlass learning boostingand error-correcting codesrdquo in Proceedings of the 12th AnnualConference on Computational Learning Theory (COLT rsquo99) pp145ndash155 July 1999

[25] R E Schapire and Y Singer ldquoImproved boosting algorithmsusing confidence-rated predictionsrdquo Machine Learning vol 37no 3 pp 297ndash336 1999

[26] N Li ldquoMulticlass boosting with repartitioningrdquo in Proceedingsof the 23rd International Conference on Machine Learning pp569ndash576 ACM June 2006

[27] H Drucker ldquoImproving regressors using boostingrdquo in Proceed-ings of the 14th International Conference on Machine Learningpp 107ndash115 1997

[28] R Avnimelech and N Intrator ldquoBoosting regression estima-torsrdquo Neural Computation vol 11 no 2 pp 499ndash520 1999

[29] D P Solomatine and D L Shrestha ldquoAdaBoostRT a boostingalgorithm for regression problemsrdquo in Proceedings of the IEEEInternational Joint Conference on Neural Networks vol 2 pp1163ndash1168 IEEE 2004

[30] D L Shrestha and D P Solomatine ldquoExperiments withAdaBoostRT an improved boosting scheme for regressionrdquoNeural Computation vol 18 no 7 pp 1678ndash1710 2006

[31] H-X Tian and Z-Z Mao ldquoAn ensemble ELM based on mod-ified AdaBoostRT algorithm for predicting the temperature ofmolten steel in ladle furnacerdquo IEEE Transactions on AutomationScience and Engineering vol 7 no 1 pp 73ndash80 2010

[32] J W Cao T Chen and J Fan ldquoFast online learning algorithmfor landmark recognition based on BoW frameworkrdquo in Pro-ceedings of the 9th IEEE Conference on Industrial Electronics andApplications Hangzhou China June 2014

[33] J Cao Z Lin and G-B Huang ldquoComposite function waveletneural networks with extreme learning machinerdquo Neurocom-puting vol 73 no 7ndash9 pp 1405ndash1416 2010

[34] R Feely Predicting stock market volatility using neural networksBA (Mod) [PhD thesis] Trinity College Dublin DublinIreland 2000

[35] K Bache and M Lichman UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2014 httparchiveicsucieduml

[36] H Drucker C J Burges L Kaufman A Smola and VVapnik ldquoSupport vector regression machinesrdquo Advances inNeural Information Processing Systems vol 9 pp 155ndash161 1997

[37] J A Suykens and J Vandewalle ldquoLeast squares support vectormachine classifiersrdquo Neural Processing Letters vol 9 no 3 pp293ndash300 1999

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article A Robust AdaBoost.RT Based Ensemble ...downloads.hindawi.com/journals/mpe/2015/260970.pdf · AdaBoost.RT algorithm (modi ed Ada-ELM) in order to predict the temperature

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of