A robust recurrent simultaneous perturbation stochastic approximation training algorithm for recurrent neural networks

ORIGINAL ARTICLE

A robust recurrent simultaneous perturbation stochasticapproximation training algorithm for recurrent neural networks

Zhao Xu • Qing Song • Danwei Wang

Received: 22 February 2012 / Accepted: 21 May 2013

� Springer-Verlag London 2013

Abstract Training of recurrent neural networks (RNNs)

introduces considerable computational complexities due to

the need for gradient evaluations. How to get fast conver-

gence speed and low computational complexity remains a

challenging and open topic. Besides, the transient response

of learning process of RNNs is a critical issue, especially for

online applications. Conventional RNN training algorithms

such as the backpropagation through time and real-time

recurrent learning have not adequately satisfied these

requirements because they often suffer from slow conver-

gence speed. If a large learning rate is chosen to improve

performance, the training process may become unstable in

terms of weight divergence. In this paper, a novel training

algorithm of RNN, named robust recurrent simultaneous

perturbation stochastic approximation (RRSPSA), is

developed with a specially designed recurrent hybrid adap-

tive parameter and adaptive learning rates. RRSPSA is a

powerful novel twin-engine simultaneous perturbation sto-

chastic approximation (SPSA) type of RNN training algo-

rithm. It utilizes three specially designed adaptive

parameters to maximize training speed for a recurrent

training signal while exhibiting certain weight convergence

properties with only two objective function measurements

as the original SPSA algorithm. The RRSPSA is proved with

guaranteed weight convergence and system stability in the

sense of Lyapunov function. Computer simulations were

carried out to demonstrate applicability of the theoretical

results.

Keywords Recurrent neural networks �Stability and convergence analysis � SPSA

List of symbols

k Discrete time step

m Output dimension

nI Input dimension

nh Hidden neuron number

pv = m 9 nh. Dimension of output layer

weight vector

pw = ni 9 nh. Dimension of hidden layer

weight vector

l Number of time-delayed outputs

ui(k) i ¼ 1; . . .;m: External input at time step k

x(k) 2 RnI : Neural network input vector

DxðkÞ 2 RnI : Perturbation vectors on input

x(k - 1)

y(k) 2 Rm: Desired output vector

yðkÞ 2 Rm: Estimated output vector

eðkÞ 2 Rm: Total disturbance vector

e(k) 2 Rm: Output estimation error vector

VðkÞ 2 Rpv

: Estimated weight vector of output

layer

V�ðkÞ 2 Rpv

: Ideal weight vector of output layer

~VðkÞ 2 Rpv

: Estimated error vector of output

layer

WðkÞ 2 Rpw

: Estimated weight vector of hidden

layer

W�ðkÞ 2 Rpw

: Ideal weight vector of hidden layer

~WðkÞ 2 Rpw

: Estimated weight vector of hidden

layer

Z. Xu (&) � Q. Song � D. Wang

School of Electrical and Electronic Engineering, Nanyang

Technological University, Singapore, Singapore

e-mail: [email protected]

Q. Song


D. Wang


123

Neural Comput & Applic

DOI 10.1007/s00521-013-1436-5

Wj;:ðkÞ 2 Rni : The jth row vector of hidden layer

weight matrix WðkÞdv(k) Equivalent approximation errors of the

loss function of output layer

dw(k) Equivalent approximation errors of the

loss function of hidden layer

c Perturbation gain parameter of the SPSA

Dv 2 Rpv

: Perturbation vectors of output

layer

rv 2 Rpv

: Perturbation vectors of output

layer

Dw 2 Rpw

: Perturbation vectors of hidden

layer

rw 2 Rpw

: Perturbation vectors of hidden

layer

HðWðkÞ; xðkÞÞ ¼ HðkÞ 2 Rm�pw

: Nonlinear activation

function matrix

hj(k) ¼ h Wj;:ðkÞxðkÞ� �

: The nonlinear

activation function and the scalar

element of HðVðkÞ; xðkÞÞav Adaptive learning rate of output layer

aw Adaptive learning rate of hidden layer

av, aw Positive scalars

qv Normalization factor of output layer

qw Normalization factor of hidden layer

bv Recurrent hybrid adaptive parameter of

output layer

bw Recurrent hybrid adaptive parameter of

hidden layer

lj(k) 2 R; 1� j� nh: Mean value of the input

vectors of the jth hidden layer neuron

~ljðkÞ 2 R; 1� j� nh: Mean value of the hidden

layer weight vector of the jth hidden layer

neuron

s Positive scalar

k Positive gain parameter of the threshold

function

g A small perturbation parameter

1 Introduction

Recurrent neural networks (RNNs) are inherently dynamic

nonlinear systems which have signal feedback connections

and are able to process temporal sequences. RNNs with free

synaptic weights are more useful for complex dynamical

systems such as nonlinear dynamic system identification,

time series prediction, and adaptive control as compared to

Hopfield neural networks (HNNs), in which the weights are

fixed and symmetric [1–8].

To maximize potentials and capabilities of the RNNs, one

of the preliminary tasks is to design an efficient training

algorithm with fast weight convergence speed and low

computational complexity. Gradient descent optimization

techniques are often used for neural networks which consist

of adjustable weights along the negative direction of error

gradient. However, due to the dynamic structure of RNNs,

the obtained error gradient tends to be very small in a clas-

sical BP training algorithm with the fixed learning step,

because RNN weight depends on not only the current output

but also output of past steps. It is widely acknowledged that

recurrent derivative of the multilayered RNNs, that is, the

recurrent training signal (see Fig. 1), which contains infor-

mation of dynamic time dependence in training, is necessary

to obtain good time response of RNNs [9, 10]. From com-

putational efficiency and weight convergence points of

view, Spall proposed a simultaneous perturbation stochastic

approximation (SPSA) method to find roots and minima of

neural network functions which are contaminated with noise

in [11]. The name of ‘‘simultaneous perturbation’’ arises

from the fact that all elements of unknown weight parame-

ters are being varied simultaneously. SPSA has the potential

to be significantly more efficient than the usual p-dimen-

sional algorithms (of Kiefer–Wolfowitz/Blum type) that are

based on standard finite-difference gradient approximations,

and it is shown in [11] that approximately the same level of

estimation accuracy can typically be achieved with only l/

pth the amount of data needed in the standard approach. The

main advantage of SPSA is its simplicity and excellent

weight convergence property, in which only two objective

function measurements are used. Thus, SPSA is more eco-

nomical in terms of loss function evaluation, which is usu-

ally the most expensive part of an optimization problem.

SPSA has attracted great attention in application areas such

as neural networks training, adaptive control, and model

parameter estimation [7, 12–16].

Fig. 1 Block diagram of RRSPSA training for the output feedback-

based RNN


123

Conventional training methods of RNNs are usually

classified as backpropagation through time (BPTT) [17, 18]

and real-time recurrent learning (RTRL) [19, 20] algo-

rithms, which often suffer from drawbacks of slow con-

vergence speed and heavy computational load. In fact,

BPTT cannot be used for online application due to com-

putational problems. For a standard RTRL training,

recursive weight updating requires a small learning step to

obtain weight convergence and system stability of RNNs.

If a large learning rate is selected to speed up weight

updating, the training process of RTRL may lead to a large

steady-state error or even weight divergence [6, 9].

In the previous work, we have developed an adaptive

SPSA-based training scheme for FNNs and demonstrated that

the adaptive SPSA algorithm can be used to derive a fast

training algorithm and guaranteed weight convergence [7]. In

this paper, we will extend our SPSA research to the RNNs

with free weight updating and propose a robust recurrent

simultaneous perturbation stochastic approximation

(RRSPSA) algorithm under the framework of deterministic

system with guaranteed weight convergence. Compared with

FNNs, RNNs are dynamic systems and contain recurrent

connections in their structure. Considering the time depen-

dence of the signals of RNNs, not only the weights at the

current time steps are to be perturbed, but also those at the

previous time steps, which makes the learning easy to be

diverge and the convergence analysis complicated to obtain.

The key characteristic of RRSPSA training algorithm is to use

a recurrent hybrid adaptive parameter together with adaptive

learning rate and dead zone concept. The recurrent hybrid

adaptive parameter can automatically switch off recurrent

training signal whenever the weight convergence and stability

conditions are violated (see Fig. 1 and explanation in Sects. 2

and 3). There are several advantages to train RNNs based on

RRSPSA algorithm. First, RRSPSA is relatively easy to

implement as compared to other RNN training methods such

as BPTT due to its RTRL type of training and simplicity.

Second, RRSPSA provides excellent convergence property

based on simultaneous perturbation of weight, which is sim-

ilar to the standard SPSA. Finally, RRSPSA has capability to

improve RNN training speed with guaranteed weight con-

vergence over the standard RTRL algorithm by using adap-

tive learning method. RRSPSA is a powerful twin-engine

SPSA type of RNN training algorithm. It utilizes the specif-

ically designed three adaptive parameters to maximize

training speed for recurrent training signal while exhibiting

certain weight convergence properties with only two objec-

tive function measurements as a standard SPSA algorithm.

Combination of the three adaptive parameters makes

RRSPSA converge fast with the maximum effective learning

rate and guaranteed weight convergence. Robust stability and

weight convergence proofs of RRSPSA algorithm are pro-

vided based on Lyapunov function.

The rest of this paper is organized as follows. In Sect. 1,

we briefly introduce the structure of RNN and the con-

ventional SPSA algorithm. In Sects. 2 and 3, the RRSPSA

algorithm and robustness analysis are derived for the

output and hidden layers, respectively. Computer simula-

tion results are presented in Sect. 4 to show the efficiency

of our proposed RRSPSA. Section 5 draws the final

conclusions.

2 Basics of the RNN and SPSA

To simplify notation of this paper, we consider a three-

layer m-external input and m-output RNN with the output

feedback structure as shown in Fig. 1, which can be easily

extended into a RNN with arbitrary numbers of inputs and

outputs. Output vector of the RNN is yðkÞ which can be

presented as

yðkÞ ¼ HðWðkÞ; xðkÞÞVðkÞ ¼ ½y1ðkÞ; . . .; ymðkÞ�T 2 Rm

ð1Þ

where VðkÞ 2 Rpv

pv ¼ m� nhð Þ is the long vector of the

output layer weight matrix VðkÞ 2 Rnh�m defined by

VðkÞ ¼ V1;1ðkÞ; . . .; Vnh;1ðkÞ; . . .; V1;mðkÞ; . . .; Vnh;mðkÞ� �T

ð2Þ

and xðkÞ 2 RnI nI ¼ ðlþ 1Þ � mð Þ is the input vector of the

RNN defined by

xðkÞ ¼ x1ðkÞ; x2ðkÞ; . . .; xnIðkÞ½ �T

¼ yTðk � 1Þ; . . .; yTðk � lÞ; u1ðkÞ; . . .; umðkÞ� �T ð3Þ

and WðkÞ 2 Rpw

pw ¼ nI � nhð Þ is the long vector of the

hidden layer weight matrix WðkÞ 2 Rnh�nI defined by

WðkÞ ¼ W1;:ðkÞ; W2;:ðkÞ; . . .; Wnh;:ðkÞ� �

: ð4Þ

HðWðkÞ; xðkÞÞ 2 Rm�pv

is the nonlinear activation

block-diagonal matrix given by

HðkÞ ¼ HðWðkÞ;xðkÞÞ

¼

h1ðkÞ; . . .;hnhðkÞ; 0; . . . 0; . . .

0; . . . . .. ..

.

..

.

0; . . . 0; . . . h1ðkÞ; . . .;hnhðkÞ

2

66664

3

77775ð5Þ

where hj(k) with 1 B j B nh is one of the most popular

nonlinear activation functions,

hjðkÞ ¼ h Wj;:ðkÞxðkÞ� �

¼ 1

1þ exp �4kWj;:ðkÞxðkÞ� � ð6Þ

with Wj;:ðkÞ 2 RnI being the jth row vector of WðkÞ and

k[ 0 is the gain parameter of the threshold function that is


123

defined specifically for ease of use later when deriving the

sector condition of the hidden layer.

The output vector yðkÞ is an approximation of the ideal

nonlinear function yðkÞ ¼ HðW�; xðkÞÞV� with V�

and W�

being the ideal output layer and hidden layer weight

respectively.

SPSA algorithm is to solve the problem of finding an

optimal weight h* of the gradient equation of neural

networks

gðhðkÞÞ ¼ oLðhðkÞÞohðkÞ

¼ 0

for some differentiable loss function L : Rp ! R1; where

hðkÞ is the estimated weight vector of neural networks. The

basic idea is to vary all elements of the vector hðkÞsimultaneously and approximate the gradient function

using only two measurements of the loss function (7) and

(8) without requiring exact derivatives or a large number of

function evaluation which provides significant

improvement in efficiency over the standard gradient

decent algorithms. Define

gþðhðkÞÞ ¼ LðhðkÞ þ cðkÞDðkÞÞ þ �þðkÞ ð7Þ

g�ðhðkÞÞ ¼ LðhðkÞ � cðkÞDðkÞÞ þ ��ðkÞ ð8Þ

where �þðkÞ and ��ðkÞ represent measurement noise terms,

c(k) is a scalar and DðkÞ ¼ ½D1ðkÞ; . . .;DpðkÞ� is the

perturbation vector. Then, the form of the estimation of

gðhÞ at the kth iteration is

gðhðkÞÞ ¼ gþðhðkÞÞ � g�ðhðkÞÞ2cðkÞD1ðkÞ

. . .gþðhðkÞÞ � g�ðhðkÞÞ

2cðkÞDpðkÞ

" #T

ð9Þ

Theory and numerical experience in [11] indicate that

SPSA can be significantly more efficient than the standard

finite-difference-based algorithms in large-dimensional

problems.

Inspired by the excellent economical and convergence

properties of SPSA, we propose the RRSPSA algorithm

under the framework of deterministic system based on

the idea of simultaneous perturbation with guaranteed

weight convergence. The loss function for RRSPSA is

defined as

LðhðkÞÞ ¼ 1

2eðkÞk k2 ð10Þ

eðkÞ ¼ yðkÞ � yðkÞ þ eðkÞ ð11Þ

where y(k) is the ideal output and a function of h�; yðkÞ is a

function of hðkÞ; and eðkÞ is a disturbance term. Similar to

SPSA, the RRSPSA algorithm seeks to find an optimal

weight vector h* of gradient equation, that is, the weight

vector that minimized differentiable LðhðkÞÞ:That is, the proposed RRSPSA algorithm for updating

hðkÞ 2 RðpwþpvÞ as an estimation of the ideal weight vector

h* is of the form

hðk þ 1Þ ¼ hðkÞ � aðk þ 1ÞgðhðkÞÞ ð12Þ

where a(k ? 1) is an adaptive learning rate and gðhðkÞÞ is

an approximation of the gradient of the loss function. This

approximated gradient function (normalized) is of the

SPSA form

where d(k) is an equivalent approximation error of the loss

function, which is also called the measurement noise, and

q(?1) is the normalization factor. Dðk þ 1Þ; rðk þ 1Þ; and

c(k ? 1) [ 0 are the controlling parameters of the algo-

rithm defined as

1. The perturbation vector

Dðk þ 1Þ ¼ D1ðk þ 1Þ; . . .;Dpwþpvðk þ 1Þ� �T2 Rðp

wþpvÞ

ð14Þ

is a bounded random directional vector that is used to

perturb the weight vector simultaneously and can be

generated randomly.

2. The sequence of r(k ? 1) is defined as

rðkþ1Þ¼ 1

D1ðkþ1Þ ; . . .;1

Dpwþpvðkþ1Þ

� �T

2RðpwþpvÞ:

ð15Þ

3. c(k ? 1) [ 0 is a sequence of positive numbers [11,

14, 15]. c(k ? 1) can be chosen as a small constant

number for a slow time-varying system as pointed in

[15] and it controls the size of perturbations.

Note that for simplicity of notations, we ignore the time

subscripts of several parameters where no confusion is

caused such as Dðk þ 1Þ; rðk þ 1Þ; cðk þ 1Þ; dðkÞ and

a(k ? 1).

gðhðkÞÞ ¼L hðkÞ þ cðk þ 1ÞDðk þ 1Þ�

� L hðkÞ � cðk þ 1ÞDðk þ 1Þ�

þ dðkÞ2cðk þ 1Þqðk þ 1Þ rðk þ 1Þ ð13Þ


123

3 Output layer training and stability analysis

In a multilayered neural network, the output layer weight

vector VðkÞ and hidden layer weight vector WðkÞ are

normally updated separately using different learning rates

as shown in Fig. 1 as in the standard BP training algorithm

[9]. During the updating of the output layer weights, the

hidden layer weights are fixed. We are now ready to pro-

pose the RRSPSA algorithm to update the estimated weight

vector VðkÞ of output layer. That is,

Vðk þ 1Þ ¼ VðkÞ � avðk þ 1ÞgvðVðkÞÞ ð16Þ

where gvðVðkÞÞ 2 Rpv

is the normalized gradient

approximation that uses simultaneous perturbation vectors

Dvðk þ 1Þ 2 Rpv

and rvðk þ 1Þ 2 Rpv

to stimulate weight.

Dvðk þ 1Þ and rv(k ? 1) are, respectively, the first pv

components of perturbation vectors defined in (14) and

(15).

gvðVðkÞÞ ¼ �eTðkÞHðkÞDvrvðqvÞ�1 � bvðkþ 1ÞeTðkÞAðkÞrv 2cqvðk þ 1Þð Þ�1 ð17Þ

where AðkÞ 2 Rm is the extended recurrent gradient defined

as

AðkÞ ¼ H WðkÞ; DvðkÞ VðkÞ þ cDv� ��

VðkÞ� H WðkÞ; DvðkÞ VðkÞ � cDv

� �� VðkÞ ð18Þ

DvðkÞ ¼ HTðk � 1Þ; . . .;HTðk � lÞ; 0; . . .; 0� �T

: ð19Þ

For simplicity of notations, we ignore the time

subscripts of Dvðk þ 1Þ; rvðk þ 1Þ; bvðk þ 1Þ; qvðk þ 1Þand av(k ? 1) in the following part of the paper.

Remark 1 Our purpose is to find the desired output layer

weight vector V�

of the gradient equation

gðVðkÞÞ ¼ oLðVðkÞÞoVðkÞ

¼ 0:

Inspired by the excellent properties of SPSA, we obtain

gðVðkÞÞ by varying all elements of unknown weight

parameters simultaneously as shown in the proof of

Lemma 1 under the framework of deterministic system. We

will show that RRSPSA has the same form as SPSA, and

only two objective function measurements are used at each

iteration, which maintains the efficiency of SPSA. How-

ever, the weight convergence proof and stability analysis

are quite different from those in [11] under the framework

of deterministic system as shown in Theorem 1 and 2. For

RNNs, the outputs depend on not only the weights at

current time steps, but also those at the previous time steps

since the current input vector contains the delayed outputs.

The first term on the right side of equation in (17) is the

result of perturbation of the feedforward signals in RNNs,

and A(k) in (18) is the result of perturbation of the recurrent

signals which makes the training complicated and the

system easy to be unstable. In addition to the simplicity and

efficiency of RRSPSA, one powerful aspect of the devel-

oped RRSPSA algorithm is that it is able to use effectively

large adaptive learning rate av to accelerate weight con-

vergence speed with guaranteed system stability. To avoid

weight divergence at each time step, an adaptive learning

rate bv is used in (17) to control the recurrent signals and

switch off the recurrent connections whenever the con-

vergence condition is violated as shown in (38). The nor-

malization factor qv is also used to bound the signals in

learning algorithms as traditionally did in adaptive control

system [5, 7]. av is to guarantee the weight convergence

during training, which means the weights are only updated

when the convergence and stability requirements are met

and it does not need to be small as in RTRL training

algorithm. The sufficient conditions for the three key

parameters are shown in (34), (37), and (38). DvðkÞ is

obtained during the mathematical deduction in the proof of

Lemma 1.

Lemma 1 The normalized gradient approximation in

(17) is an equivalent presentation of the standard SPSA

algorithm in equation (13).

Proof We will prove that (17) is obtained by perturbing

all the weights simultaneously as SPSA did. Rewrite the

loss function defined in (10) as

LðhðkÞÞ ¼ LðVðkÞ;WðkÞÞ ¼ 1

2yðkÞ� yðkÞþ eðkÞk k2: ð20Þ

Further, we use yðkÞ ¼HðkÞVðkÞ in (1) to define

yvþðVðkÞÞ¼ H VðkÞþ cDv

� �þH W ; xðkÞ þDxðkþ 1Þð Þ

� �VðkÞ

¼ H VðkÞþ cDv� �

þH W ; yTðk� 1Þ; . . .; yT��

� k� lÞ;0; . . .;0ð �TþDxðkþ 1Þ��

VðkÞ


þH W ; VTðk� 1ÞHTðk� 1Þ; . . .;

h��

VTðk� lÞHTðk� lÞ;0; . . .;0

iT

þDxðkþ 1Þ

VðkÞ


þH W ; Vðk� 1Þ þ cDv� �T

HTh�

� k� 1Þ;ð . . .; Vðk� lÞ þ cDv� �T

HTðk� lÞ;0; . . .;0iT

VðkÞ:

ð21Þ

yvþðVðkÞÞ has the same motivation as gþðhðkÞÞ of SPSA

shown as in (7) which is to perturb all the weights

simultaneously. Different from FNN training which only

perturbs the current weight, the input vector must be

perturbed as well for RNN training because of the time

dependence in RNNs. The disturbance Dxðkþ 1Þ is to


123

perturb the weights at previous time steps which are

contained in x(k) as seen in (3). However, the last equation

of (21) is difficult to implement. Here, we assume the

model parameters do not change apparently between each

iteration, that is, VðkÞ � Vðk� 1Þ. . .� Vðk� lÞ: We can

derive a similar approach as RTRL introduced by Williams

and Ziper [19, 20]. Moreover, such approximation makes

the proposed algorithm real time and adaptive in a

recursive fashion. Under the approximation,

yvþðVðkÞÞ

� HðkÞ VðkÞ þ cDv� �

þ H WðkÞ; VðkÞ þ cDv� �T

HTðk � 1Þ;h�

. . .;

� VðkÞ þ cDv� �T

HTðk � lÞ; 0; . . .; 0iT

ÞVðkÞ

¼ HðkÞ VðkÞ þ cDv� �

þ H WðkÞ; HTðk � 1Þ; . . .;HTðk � lÞ; 0; . . .; 0� �T�

� VðkÞ þ cDv� ��

VðkÞ¼ HðkÞ VðkÞ þ cDv

� �þ H WðkÞ; DvðkÞ VðkÞ þ cDv

� �� VðkÞ

ð22Þ

where DvðkÞ is defined in (19). By the approximation VðkÞ �Vðk � 1Þ. . . � Vðk � lÞ; the recurrent term HTðk �1Þ; . . .;HTðk � lÞ of (19) can be iteratively obtained from the

previous steps to speed up the calculation of RRSPSA (nor-

mally, the input u(k) is independent from weight so that the

last entries of DvðkÞ are zero vectors). Weight convergence

and system stability can be guaranteed by using the three

adaptive learning rates, which will be explained in detail later.

Similar to (21), we can get

yv�ðVðkÞÞ ¼ HðkÞ VðkÞ � cDv� �

þ H WðkÞ; xðkÞ � Dxðk þ 1Þð Þ� �

VðkÞ� HðkÞ VðkÞ � cDv

� �þ H WðkÞ; DvðkÞ VðkÞ � cDv

� �� VðkÞ

ð23Þ

so that

yvþðVðkÞÞ � yv�ðVðkÞÞ � VðkÞ� �

2HðkÞcDv þ AðkÞ ð24Þ

where the extended recurrent gradient A(k) is defined as

(18). A(k) in (24) is the result of recurrent connections. To

guarantee weight convergence and system stability during

training, we insert an adaptive recurrent hybrid parameter

bv(k ? 1) (see Fig. 1) to control the recurrent training

signal. As shown in (38), the convergence condition will be

examined at each step and the recurrent contribution will

be switched off whenever the convergence condition is

violated. Thus, (24) can be rewritten as follows

yvþðVðkÞÞ � yv�ðVðkÞÞ ¼ 2HðkÞcDv þ bvAðkÞ: ð25Þ

Note that we change the approximation in Eq. (25) to equal

sign because we use the exact formula on the right-hand

side of the equation to update the weight during learning.

According to (13), we can get

gvðVðkÞÞ

¼L WðkÞ;VðkÞ þ cDv� �

� L WðkÞ;VðkÞ � cDv� �

þ dvðkÞ2cqv

rv

¼yðkÞ � yvþ VðkÞ

� �� 2� yðkÞ � yv� VðkÞ� �� 2þ2dvðkÞ

4cqvrv

¼n

yðkÞ � yvþ VðkÞ� �

þ yðkÞ � yv� VðkÞ� �� T

� yv� VðkÞ� �

� yvþ VðkÞ� ��

þ 2dvðkÞo

rv 4cqvð Þ�1

¼ yðkÞ � yðkÞ þ eðkÞð ÞT yv� VðkÞ� �


rv 2cqvð Þ�1

¼ eTðkÞ yv� VðkÞ� �


rv 2cqvð Þ�1

¼ �eTðkÞHðkÞDvrv qvð Þ�1�bveTðkÞAðkÞrv 2cqvð Þ�1:

ð26Þ

e(k) in the fifth equality of (26) comes from (11). The

fourth equality of (26) reveals the relationship between

gradient approximation error dv(k) and system disturbance

eðkÞ; that is,

dvðkÞ ¼ �2yðkÞ þ yvþ VðkÞ� �

þ yv�ðVðkÞÞ þ eðkÞ� �T

� yv�ðVðkÞÞ � yvþðVðkÞÞ� �

¼ eTðkÞ yv�ðVðkÞÞ � yvþðVðkÞÞ� �

: ð27Þ

In the fifth equality of (26), the gradient approximation

presentation appears exactly the same as the original one in

[11] (see [11, eq. (2.2)]). Only two objective function

measurements are used at each step, which maintains the

efficiency of SPSA.

The training error e(k) comes from the weight estimation

error and the disturbance. For stability analysis, e(k) will be

linked to the weight estimating error of the output layer and

the disturbance. Define the weight estimating errors of the

output layer and hidden layer as

~VðkÞ ¼ V� � VðkÞ ð28Þ

~WðkÞ ¼ W� �WðkÞ ð29Þ

where W�

and V�

are the ideal (optimal) weight vectors of

WðkÞ and VðkÞ: According to (11), the training error can be

rewritten as

eðkÞ ¼ yðkÞ � yðkÞ þ eðkÞ¼ H W

�; xðkÞ

� �V� �HðkÞVðkÞ þ eðkÞ

¼ H W�; xðkÞ

� �V� �HðkÞV� þHðkÞV�

�HðkÞVðkÞ þ eðkÞ: ð30Þ

Because the term H W�;xðkÞ

� �V� �HðkÞV� is

temporarily bounded during output layer training (hidden

layer weights are fixed), we define


123

~evðkÞ ¼ H W�; xðkÞ

� �V� � HðkÞV� þ eðkÞ

¼ ~H WðkÞ;W�; xðkÞ� �

V� þ eðkÞ ð31Þ

with ~H WðkÞ;W�; xðkÞ� �

¼ H W�; xðkÞ

� �� HðkÞ: The

training error caused by the weight estimation error of

the output layer is defined as

/vðkÞ ¼ �HðkÞ~VðkÞ: ð32Þ

Then, (30) can be rewritten as

eðkÞ ¼ �/vðkÞ þ ~evðkÞ: ð33Þ

Hence, the training error is directly linked to the weight

estimating error of the output layer contained in /v(k) and the

disturbance ~evðkÞ: This is critical for the convergence proof.

Theorem 1 If the output layer of the RNN is trained by

RRSPSA algorithm (16), VðkÞ is guaranteed to be con-

vergent in the sense of Lyapunov function

~Vðk þ 1Þ��

��2

� ~VðkÞ��

��2

6 0; 8k with ~VðkÞ ¼ V� � VðkÞ

and the training error will be L2-stable in the sense of

~evðkÞ 2 L2 and /vðkÞ 2 L2:

Proof See Sect. 6.

The conditions for the three adaptive learning rates in

(16) and (17) are as follows. The detailed procedure to

obtain these conditions is shown in the proof of Theorem 1.

1. qv is the normalization factor defined by

qv ¼ sqvðkÞ þmaxavZðkÞ

pv þ 0:5bvYðkÞ þ 1; q

� ð34Þ

where 0 \ s\ 1, 0 \ av B 1, and q are positive

constants and

YðkÞ ¼ ðrvÞT gI þ HTðkÞHðkÞð Þ�1

HTðkÞAðkÞc

ð35Þ

where g is a small perturbation parameter.

ZðkÞ ¼ 2HðkÞcDv þ bvAðkÞð Þk k2rvk k2

4 cð Þ2� �1

:

ð36Þ

2. av is an adaptive learning rate similar to dead zone

approach in the classical neural network and adaptive

control

av ¼ av; if eðkÞk k > KðkÞav ¼ 0; if eðkÞk k\KðkÞ

�ð37Þ

with KðkÞ ¼ ~evmðkÞ=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� avðqvÞ�1

ZðkÞpvþ0:5bvYðkÞ

qand ~ev

mðkÞ being

the maximum value of ~evðkÞ defined in (31).

3. bv is the recurrent hybrid adaptive parameter deter-

mined by

bv ¼ 1; if YðkÞ 0

bv ¼ 0; if YðkÞ\0:

�ð38Þ

Remark 2 According to the theoretical analysis, three

parameters bv, av and qv play a key role in the proof of

weight convergence and stability analysis. They are

designed to guarantee the negativeness of Lyapunov

function as in (72). The function of bv is to automatically

switch off the important recurrent training signal in the

lower left corner of Fig. 1 when the convergence and

stability requirements are violated, which resulted from

recurrent training signals. The normalization factor qv is

also used to bound the signals in learning algorithms as

traditionally did in adaptive control system [5, 7]. av is to

guarantee weight convergence during training, which

means that the weights are only updated when

convergence and stability requirements are met. Besides,

av does not need to be small as in RTRL training algorithm,

which will make the training by RRSPSA faster than that

by RTRL. From the definition of av in (37), it can be seen

that the weights will be updated (av = av) only if the

training error is bigger than the dead zone value, which

means the learning will be stopped if the training error is

lower than the dead zone level. This is to provide good

generalization other than to learn the system noise. In this

paper, we will choose a suitable value for ~evmðkÞ according

to the noise level.

Remark 3 Perturbation size c(k ? 1) is a critical variable

as discussed in the original SPSA papers [7, 11, 14]. A

natural question is whether c(k ? 1) will affect the learning

procedure in the recurrent learning algorithm. The answer

is yes. It is not obvious to see from the updating formula of

gvðVðkÞÞ in (17). However, this can be easily seen by

expanding the second term A(k) in (17) to be explicit in

c(k).

According to the mean value theory and that activation

function in (6) is a nondecreasing function, it can be readily

shown that there exist unique positive mean val-

ues lj(k) such that

h Wj;:ðkÞx1ðkÞ� �

� h Wj;:ðkÞx2ðkÞ� �

¼ ljðkÞWj;:ðkÞ x1ðkÞ � x2ðkÞð Þ ð39Þ

where Wj;:ðkÞ is the estimated weight vector linked to the

jth hidden layer neurons. From (39) and the matrix in (5),

A(k) in (18) can be further extended as

AðkÞ ¼ 2cXðkÞ ð40Þ

where XðkÞ 2 Rm is defined as


123

XðkÞ ¼

X1ðkÞ; . . .;XnhðkÞ; 0; . . . 0; . . .

0; . . . . .. ..

.

..

.

0; . . . 0; . . . X1ðkÞ; . . .;XnhðkÞ

2

666664

3

777775VðkÞ ð41Þ

and XjðkÞ ¼ ljðkÞWj;:ðkÞDvðkÞDv; j ¼ 1; . . .; nh: By such

expanding of A(k), c(k) is explicit as shown in (40). To

reveal the effect of c(k), we further expand (25) as

yvþðVðkÞÞ � yv�ðVðkÞÞ ¼ 2HðkÞcDv þ 2bvðk þ 1ÞcXðkÞ: ð42Þ

Using (25) and (27), we have the presentation of the

system disturbance in terms of the measurement noise

dv(k) of the RRSPSA algorithm as

eðkÞ ¼ � HðkÞDv þ bvXðkÞð Þ HðkÞDv þ bvXðkÞð Þð ÞT� ��1

� HðkÞDv þ bvXðkÞð ÞdvðkÞð2cÞ�1: ð43Þ

Furthermore, according to the equivalent disturbance (31),

we have

~evðkÞ ¼ ~H WðkÞ;W�; xðkÞ� �

V� þ eðkÞ


V� � HðkÞDv þ bvXðkÞð Þf

� HðkÞDv þ bvXðkÞð Þð ÞT��1 HðkÞDv þ bvXðkÞð ÞdvðkÞ

2c:

ð44Þ

The equivalent disturbance ~evðkÞ is important since its

maximum value ~evmðkÞ decides the range of dead zone for

av in (37). Note also that the measurement noise dv(k) is

defined the same way as in [11], which measures

difference between true gradient and gradient

approximation as derived in (26). Because the system

disturbance eðkÞ is decided only by external conditions, a

relatively smaller gain parameter c will imply a smaller

measurement noise dv(k) according to (43). This allows us

to choose the dead zone range mainly according to the

external system disturbance eðkÞ: Furthermore, as shown

in (37), av will be nonzero whenever the norm of e(k) is

bigger than a dead zone value which is related to ~evmðkÞ: If

the external disturbance is also small, we can choose a

small dead zone range, which ensures that the adaptive

learning rate av has the nonzero value for a prolong period

of training time.

4 Hidden layer training and stability analysis

The algorithm for updating estimated weight vector WðkÞof the hidden layer is

Wðk þ 1Þ ¼ WðkÞ � awðk þ 1ÞgwðWðkÞÞ ð45Þ

where gwðWðkÞÞ 2 Rpw

is the normalized gradient

approximation that uses simultaneous perturbation vectors

Dwðk þ 1Þ 2 Rpw

and rwðk þ 1Þ 2 Rpw

to stimulate weight

of the hidden layer. Dwðk þ 1Þ and rw(k ? 1) are,

respectively, the last pw components of perturbation

vectors defined in (14) and (15).

gwðWðkÞÞ ¼ �eTðkÞ BþðkÞ � B�ðkÞ þ bwðk þ 1ÞDwðkÞ

� �

2cqwðk þ 1Þ rwðk þ 1Þ

ð46Þ

where

BþðkÞ ¼ H WðkÞ þ cDwðk þ 1Þ; xðkÞ� �

VðkÞ 2 Rm ð47Þ

B�ðkÞ ¼ H WðkÞ � cDwðk þ 1Þ; xðkÞ� �

VðkÞ 2 Rm ð48Þ

DwðkÞ ¼ H WðkÞ; Bþðk � 1Þð ÞT ; . . .; Bþðk � lÞð ÞT ; 0; . . .; 0h iT

� �

�H WðkÞ; ðB�ðk � 1ÞÞT ; . . .; ðB�ðk � lÞÞT ; 0; . . .; 0� �T�

� VðkÞ 2 Rm:

ð49Þ

For simplicity of notations, we ignore the time subscripts of

Dwðk þ 1Þ; rwðk þ 1Þ; bwðk þ 1Þ; qwðk þ 1Þ and aw(k ? 1)

in the following part of the paper.

Remark 4 The same as the output layer, our purpose is to

find the desired hidden layer weight vector W�

of the

gradient equation

gðWðkÞÞ ¼ oLðWðkÞÞoWðkÞ

¼ 0:

The detailed mathematical deduction is given in the proof

of Lemma 2. B?(k) - B-(k) is the result of the perturba-

tions of the feedforward signals, and DwðkÞ is the result of

the perturbations of the recurrent signals in RNNs. The

function of bw is to automatically switch off recurrent

contribution when the convergence and stability require-

ments are violated. The conditions for the key parameters

appeared in (45) and (46) are shown in (64), (67), and (68).

Lemma 2 The normalized gradient approximation in

(46) is an equivalent presentation of the standard SPSA

algorithm in equation (13).

Similar to the definitions of gvþðVðkÞÞ and gv�ðVðkÞÞ in

(21) and (23), we define two perturbation functions

gwþðVðkÞÞ and gw�ðVðkÞÞ; that is,

gwþðWðkÞÞ ¼ H WðkÞ þ cDw; xðkÞ� �

VðkÞþ H WðkÞ; xðkÞ þ Dxðk þ 1Þð Þ

� �VðkÞ: ð50Þ

From the definition of B?(k) in (47),


123

gwþðWðkÞÞ¼ H WðkÞ þ cDw; xðkÞ

� �VðkÞ þ H WðkÞ; xðkÞ þ Dxðk þ 1Þð Þ

� �VðkÞ

¼ BþðkÞ þ H WðkÞ; yTðk � 1Þ; . . .; yTðk � lÞ;��

� 0; . . .; 0�TþDxðk þ 1Þ��

VðkÞ

¼ BþðkÞ þ H WðkÞ; VTðk � 1ÞHT Wðk � 1Þ þ cDw; xðk � 1Þ

� �; . . .

h�

� VTðk � lÞHT Wðk � lÞ þ cDw; xðk � lÞ

� �; 0; . . .; 0

iT

VðkÞ

� BþðkÞ þ H WðkÞ; Bþðk � 1Þð ÞT ; . . .; Bþðk � lÞð ÞT ; 0; . . .; 0h iT

� VðkÞ:

ð51Þ

From the definition of B-(k) in (48), and similar to (51),

we get

gw�ðWðkÞÞ �B�ðkÞ þ H WðkÞ; ðB�ðk � 1ÞÞT ; . . .;��

� B�ðk � lÞÞT ; 0; . . .; 0� �T

VðkÞ:ð52Þ

Then, we have

gwþðWðkÞÞ � gw�ðWðkÞÞ � BþðkÞ � B�ðkÞ þ DwðkÞ:ð53Þ

DwðkÞ in (53) is the result of recurrent training signal as

shown in Fig. 1. To guarantee weight convergence and

system stability during training, we add an adaptive

recurrent hybrid parameter bw(k ? 1) to control the

recurrent learning signal path. That is,

gwþðWðkÞÞ � gw�ðWðkÞÞ ¼ BþðkÞ � B�ðkÞ þ bwDwðkÞ:ð54Þ

Note that we change the approximation in Eq. (53) to equal

because we use the exact formula on the right-hand side of

the equation to update the weight during learning.

According to (13), we can get

gwðWðkÞÞ ¼L WðkÞþ cDw;VðkÞ� �

�L WðkÞ� cDw;VðkÞ� �

þ dwðkÞ2cqw

rw

¼ yðkÞ� ywþðkÞk k2� yðkÞ� yw�ðkÞk k2þ2dwðkÞ4cqw

rw

¼ yðkÞ� ywþðkÞþ yðkÞ� yw�ðkÞð ÞT yw�ðkÞ� ywþðkÞð Þþ 2dwðkÞ4cqw

rw

¼ 2yðkÞ� ywþðkÞ� yw�ðkÞþ ywþðkÞþ yw�ðkÞ� 2yðkÞþ 2eðkÞð ÞT

� yw�ðkÞ� ywþðkÞð Þþ 2dwðkÞð Þrw 4cqwð Þ�1

¼ yðkÞ� yðkÞþ eðkÞð ÞT yw�ðkÞ� ywþðkÞð Þrwð2cqwÞ�1

¼ eTðkÞ yw�ðkÞ� ywþðkÞð Þrwð2cqwÞ�1

¼�eTðkÞ BþðkÞ�B�ðkÞþbwDwðkÞ� �

rwð2cqwÞ�1

ð55Þ

where the equivalent disturbance dw(k) is defined as

dwðkÞ ¼ yw�ðWðkÞÞ � ywþðWðkÞÞ� �T

eðkÞ

þ 1

2yw�ðWðkÞÞ � ywþðWðkÞÞ� �T

� ywþðWðkÞÞ þ yw�ðWðkÞÞ � 2yðkÞ� �

: ð56Þ

The fifth equality in (55) reveals the relationship between

dw(k) and eðkÞ as in (56).

For stability analysis, the training error will be linked to

the parameter estimating error of the hidden layer and the

disturbance. According to mean value theorem and that

activation function is a nondecreasing function, it can be

readily shown that there exist unique positive mean values

~ljðkÞ such that

h W�j;:xðkÞ�

� h Wj;:ðkÞxðkÞ� �

¼ ~ljðkÞ W�j;: � Wj;:ðkÞ�

xðkÞ

¼ ~ljðkÞ ~Wj;:ðkÞxðkÞð57Þ

where Wj;:ðkÞ;W�j;:; ~Wj;:ðkÞ are the estimated weight, ideal

weight, and weight estimate error vectors linked to the jth

hidden layer neurons, respectively. From (57) and the

matrix in (5), it can be shown that

~H WðkÞ;W�; xðkÞ� �

VðkÞ ¼ H W�; xðkÞ

� �� HðkÞ

� �VðkÞ

¼ ~XðkÞ ~WðkÞð58Þ

where ~XðkÞ 2 Rm�pw

is defined as

~XðkÞ ¼

~l1ðkÞV1ðkÞxTðkÞ . . . ~lnhðkÞVnh

ðkÞxTðkÞ~l1ðkÞVnhþ1ðkÞxTðkÞ . . . ~lnh

ðkÞV2nhðkÞxTðkÞ

..

. ...

~l1ðkÞV ðm�1Þnhþ1ðkÞxTðkÞ . . . ~lnhðkÞVpvðkÞxTðkÞ

2

666664

3

777775

¼ ~l1ðkÞVT1;:ðkÞxTðkÞ. . .~lnh

ðkÞVTnh;:ðkÞxTðkÞ

h i

ð59Þ

and VTj;:ðkÞ is inverse of jth row of the estimated weight

matrix VðkÞ of output layer. From (6), the maximum value

of the derivative of hjðkÞ; _hjðkÞ ¼ ðohðWj;:ðkÞxðkÞÞÞ=ðoðWj;:ðkÞxðkÞÞ is k, and hence 0� ~ljðkÞ� kð1� i� nhÞ:We can rewrite the estimation error as

eðkÞ ¼ yðkÞ � yðkÞ þ eðkÞ¼ H W

�; xðkÞ

� �V� � H W

�; xðkÞ

� �VðkÞ

þ H W�; xðkÞ

� �V � HðkÞVðkÞ þ eðkÞ


VðkÞ þ H W�; ðkÞ

� �~VðkÞ þ eðkÞ¼ �/wðkÞ þ ~ewðkÞ

ð60Þ

where

/wðkÞ ¼ �~H WðkÞ;W�; xðkÞ� �

VðkÞ ¼ �~XðkÞ ~WðkÞ ð61Þ

and the bounded equivalent disturbance ~ewðkÞ of the hidden

layer is defined as


123

~ewðkÞ ¼ H W�; xðkÞ

� �~VðkÞ þ eðkÞ: ð62Þ

Note that ~ewðkÞ is a bounded signal because ~VðkÞ has been

proven to be bounded in Sect. 3.

According to the definition of B?(k) and B-(k) defined

in (47) and (48), we can get

BþðkÞ � B�ðkÞð Þð2cÞ�1 ¼ XðkÞDw ð63Þ

where XðkÞ 2 Rm�pw

is similar to ~XðkÞ in (59) except with

different mean values.

Theorem 2 If the hidden layer of the RNN is trained by

the adaptive RRSPSA algorithm (45), weight WðkÞ is

guaranteed to be convergent in the sense of Lyapunov

function

~Wðk þ 1Þ��

��2

� ~WðkÞ��

��2

6 0; 8k

with ~WðkÞ ¼ W� �WðkÞ and the training error will be L2-

stable in the sense of ~ewðkÞ 2 L2 and /wðkÞ 2 L2:

Proof See Sect. 8.

The three adaptive learning rates appeared in (45) and

(46) are as follows.

1. The normalization factor is defined as

qw ¼ sqwðkÞ þmax 1þ awMðkÞ rwk k2!�1ðkÞ; qn o

ð64Þ

where

MðkÞ¼ BþðkÞ�B�ðkÞþbwDwðkÞ2c

��

��

2

!ðkÞ¼kmin

kpwþ 1

2kcbw rwð ÞT gIþX

TðkÞXðkÞ� �1

XTðkÞDwðkÞ

and 0 \ aw B 1 being a positive constant, k being the

maximum value of the derivative _hjðkÞ of the activation

function in (6), and kmin = 0 being the minimum

nonzero value of the mean value ~ljðkÞ defined as

0� ~ljðkÞ� kð1� i� nhÞ ð65Þ

and

XðkÞ¼

V1ðkÞxTðkÞ ... VnhðkÞxTðkÞ

Vnhþ1ðkÞxTðkÞ ... V2nhðkÞxTðkÞ

..

. ...

V ðm�1Þnhþ1ðkÞxTðkÞ ... VpvðkÞxTðkÞ

2

666664

3

777775ð66Þ

where VjðkÞ is the jth element of VðkÞ; 1� j� pv:2. The adaptive dead zone learning rate of hidden layer is

defined as

aw ¼ aw; if eðkÞk k > CðkÞaw ¼ 0; if eðkÞk k\CðkÞ

�ð67Þ

with C ¼ ~ewmðkÞ=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� aw rwk k2

qw!ðkÞ MðkÞ� r

and ~ewmðkÞ being

the maximum value of ~ewðkÞ defined in (62).

3. The recurrent hybrid adaptive parameter of hidden

layer bw is defined as

bw ¼ 1; if rwð ÞTvðkÞ > 0

bw ¼ 0; if rwð ÞTvðkÞ\0

�ð68Þ

with vðkÞ ¼ gI þ XTðkÞXðkÞ

� �1

XTðkÞDwðkÞ. bw can

automatically cut off the recurrent training signal as in

the lower right corner of Fig. 1 whenever weight

convergence and stability conditions of the RNN are

violated as shown in Theorem 2 of Sect. 3

Remark 5 Summary of the RRSPSA algorithm for RNNs.

1. Set a maximum training step as the stop criterion;

2. Initialize the weights for the RNN;

3. Form new input x(k) as defined in Eq. (3);

4. Perform forward computation and calculate the output

yðkÞ and error e(k) with the input and current estimated

weights;

5. Calculate qv, av, and bv using (34), (37), and (38),

respectively;

6. Update the output layer weights Vðk þ 1Þ via the

learning law (16) for the next iteration;

7. Calculate qw, aw, and bw using (64), (67), and (68),

respectively;

8. Update the hidden layer weights Wðk þ 1Þ via the

learning law (45) for the next iteration;

9. Go back to step 3 to continue the iteration until the

stop criterion is reached.

5 Examples

5.1 Time series prediction

Sunspots are dark blotches on the sun which are caused

by magnetic storms on the surface of the sun. The

underlying mechanism for sunspot appearances is not

exactly known. The sunspots data are a classical example

of a combination of periodic and chaotic phenomena

which has served as a benchmark in the statistics liter-

ature of time series, and much work has been done in

trying to analyze the sunspot data using neural networks

[21, 22]. One set of the high-quality America Sunspot

Numbers, monthly sunspot numbers from December

1949 to October 2009 in the database, is used as an

example to minimize unnecessary human measurement

errors. The total America sunspot number in the monthly

database is 778 points. The output sequence is scaled by

dividing by 100, to get into a range suitable for the


123

network. The hidden neuron number of the RNN is ten.

The selected learning rates for BP and RTRL are 0.09

and 0.01, respectively. Perturbation step size for

RRSPSA is c = 0.01. Figure 2 shows outputs of the

plant y*(k) and estimated outputs y(k) of the RNN

trained by the three different methods. Table 1 gives

comparison results of the three methods in terms of MSE

and STD.

5.2 Michaelis–Menten equation

Consider a continue time Michaelis–Menten equation [23]

_yðtÞ ¼ 0:45yðtÞ

2:27þ yðtÞ � yðtÞ þ uðtÞ ð69Þ

Given an unit impulse as input u(t), a fourth-order Runge–

Kutta algorithm is used to simulate this model with integral

0 100 200 300 400 500 600 700−0.5

0

0.5

1

1.5

2

2.5

k

y(k)

and

y*(

k) a

nd e

(k)

BP

y(k)e(k)y*(k)

(a)

0 100 200 300 400 500 600 700−0.5

0

0.5

1

1.5

2

2.5

k

y(k)

and

y*(

k) a

nd e

(k)

RTRL

y(k)e(k)y*(k)

(b)

0 100 200 300 400 500 600 700−0.5

0

0.5

1

1.5

2

2.5

k

y(k)

and

y*(

k) a

nd e

(k)

RSPSA

y(k)e(k)y*(k)

(c)

Fig. 2 Output y(k) and

reference signal y*(k) and error

e(k) using BP and RTRL and

RRSPSA training algorithms for

time series prediction


123

step size D t = 0.01, and 1,000 equi-spaced samples are

obtained from input and output with a sampling interval of

T = 0.02 time units. The measured noise is white Gaussian

noise with mean 0 and variance 0.05. An RNN is utilized to

emulate the dynamics of the given system and approximate

the system output in (69). Gradient function of this model

has many local minima, and the global minimum has a very

narrow domain of attraction [23]. We compare perfor-

mance of the RRSPSA to that of both BP and RTRL

training algorithms. The hidden neuron number of RNNs in

this example is selected as ten. The fixed learning rates for

BP and RTRL are 0.03 and 0.01, respectively. The per-

turbation size c = 0.01 is selected for RRSPSA. Figure 3

Table 1 Comparison of mean squared error and standard deviation

of the training error

Squared error RRSPSA RTRL BP

MSE 6.01e-4 1.30e-3 4.30e-3

STD 2.45e-2 3.61e-2 6.57e-2

0 100 200 300 400 500 600 700 800 900 1000−0.2

0

0.2

0.4

0.6

0.8

1

1.2

k

y(k)

and

y*(

k) a

nd e

(k)

BP

y(k)e(k)y*(k)

0 100 200 300 400 500 600 700 800 900 1000−0.2

0

0.2

0.4

0.6

0.8

1

1.2

k

y(k)

and

y*(

k) a

nd e

(k)

RTRL

y(k)e(k)y*(k)

0 100 200 300 400 500 600 700 800 900 1000−0.2

0

0.2

0.4

0.6

0.8

1

1.2

k

y(k)

and

y*(

k) a

nd e

(k)

RSPSA

y(k)e(k)y*(k)

Fig. 3 Output y(k) and

reference signal y*(k) and error

e(k) using BP and RTRL and

RRSPSA training algorithms for

simulation of Michaelis–

Menten equation


123

shows outputs of the plant y*(k) and estimated outputs

y(k) of the RNN trained by the three different methods.

Table 2 shows the comparison results of the three methods

in terms of MSE and STD.

Remark 6 Simulation results presented in this paper for

the BP training algorithm performs worst as compared to

other methods. The reason is that the second partial

derivatives similar to that of as in (17) are not considered in

the BP training. For RTRL, the slower convergence com-

pared with RRSPSA is mainly caused by the small fixed

learning rate and absent of the recurrent hybrid adaptive

parameter for guaranteed weight convergence in the pres-

ence of noise. RRSPSA is able to use relatively larger

effective learning rate to accelerate system response with

guaranteed weight convergence.

6 Conclusion

We proposed a robust recurrent simultaneous perturbation

stochastic approximation (RRSPSA) algorithm under the

framework of deterministic system with guaranteed weight

convergence. Compared with FNNs, RNNs are dynamic

systems and contain recurrent connections in their struc-

ture. Considering the time dependence of the signals of

RNNs, not only the weights at the current time steps are to

be perturbed, but also those at the previous time steps,

which makes the learning easy to be diverge and the con-

vergence analysis complicated to obtain. The key charac-

teristic of RRSPSA training algorithm is to use a recurrent

hybrid adaptive parameter together with adaptive learning

rate and dead zone concept. The recurrent hybrid adaptive

parameter can automatically switch off recurrent training

signal whenever the weight convergence and stability

conditions are violated (see Fig. 1 and explanation in Sects.

2 and 3). There are several advantages to train RNNs based

on RRSPSA algorithm. First, RRSPSA is relatively easy to

implement as compared to other RNN training methods

such as BPTT due to its RTRL type of training and sim-

plicity. Second, RRSPSA provides excellent convergence

property based on simultaneous perturbation of weight,

which is similar to the standard SPSA. Finally, RRSPSA

has capability to improve RNN training speed with guar-

anteed weight convergence over the standard RTRL algo-

rithm by using adaptive learning method. RRSPSA is a

powerful twin-engine SPSA type of RNN training algo-

rithm. It utilizes the specifically designed three adaptive

parameters to maximize training speed for recurrent

training signal while exhibiting certain weight convergence

properties with only two objective function measurements

as a standard SPSA algorithm. Combination of the three

adaptive parameters makes RRSPSA converge fast with the

maximum effective learning rate and guaranteed weight

convergence. Robust stability and weight convergence

proofs of RRSPSA algorithm are provided based on

Lyapunov function.

7 Proof of Theorem 1

According to the training algorithm in (16), we can get

~Vðk þ 1Þ ¼ ~VðkÞ þ avgvðVðkÞÞ ð70Þ

where gvðVðkÞÞ is defined in (26), then we have

~Vðk þ 1Þ��

��2

� ~VðkÞ��

��2

¼ 2avgvðVðkÞÞT ~VðkÞ þ avgvðVðkÞÞ�� 2

¼ �av eTðkÞ 2HðkÞcDv þ bvAðkÞð Þcqv

ðrvÞT ~VðkÞ þ avgvðVðkÞÞ�� 2

¼ �av2c eTðkÞHðkÞDvðrvÞT ~VðkÞn o

þ bveTðkÞAðkÞðrvÞT ~VðkÞn o

cqv

þ avgvðVðkÞÞ�� 2

¼ �2av eTðkÞHðkÞ~VðkÞn o

~VTðkÞ ~VðkÞ~VTðkÞ

� �1

DvðrvÞT ~VðkÞ�

ðqvÞ�1

� av bveTðkÞHðkÞ~VðkÞ HðkÞ~VðkÞ� T

HðkÞ~VðkÞ HðkÞ~VðkÞ� T

� �1(

�AðkÞðrvÞT HTðkÞHðkÞ� ��1

HTðkÞHðkÞ~VðkÞoðcqvÞ�1 þ avgvðVðkÞÞ

�� 2

¼ 2avpv eTðkÞ/vðkÞ� �

ðqvÞ�1 þ avbv eTðkÞ/vðkÞ� �

trace HðkÞ~VðkÞ� T�

� HðkÞ~VðkÞ HðkÞ~VðkÞ� T

� �1

AðkÞðrvÞT

� HTðkÞHðkÞ� ��1

HTðkÞHðkÞ~VðkÞoðcqvÞ�1 þ avgvðVðkÞÞ

�� 2

� 2avpv eTðkÞ/vðkÞ� �

ðqvÞ�1 þ avbv eTðkÞ/vðkÞ� �

� trace AðkÞðrvÞT gI þ HTðkÞHðkÞ� ��1

HTðkÞHðkÞ~VðkÞ HðkÞ~VðkÞ� T

�

� HðkÞ~VðkÞ HðkÞ~VðkÞ� T

� �1)

ðcqvÞ�1 þ avgvðVðkÞÞ�� 2

¼ 2avpv eTðkÞ/vðkÞ� �

ðqvÞ�1 þ av eTðkÞ/vðkÞ� �

bvðrvÞT

� gI þ HTðkÞHðkÞ� ��1

HTðkÞAðkÞ cqvð Þ�1þ avgvðVðkÞÞ�� 2

ð71Þ

Note that HðkÞ~VðkÞ in the fourth equality is replaced

with /v(k) as defined in (32). A small perturbation

parameter is added to approximate [HT(k)H(k)]-1 and

guarantee a full rank of the inverse matrix of

[gI ? HT(k)H(k)]-1, which is similar to a general

nonlinear iteration algorithm [9].

Table 2 Comparison of the mean squared error and standard devia-

tion of the error

RRSPSA RTRL BP

MSE 4.2e-3 1.64e-2 2.98e-2

STD 6.46e-2 1.28e-1 1.72e-1


123

~Vðk þ 1Þ��

��2

� ~VðkÞ��

��2

� 2avpv eTðkÞ ~evðkÞ � eðkÞð Þ� �

ðqvÞ�1

þ av eTðkÞ ~evðkÞ � eðkÞð Þ� �

bvðrvÞT


HTðkÞAðkÞðcqvÞ�1

þ eðkÞk k2 av 2HðkÞcDv þ bvAðkÞð Þk k2rvk k2

2cqvð Þ2

� avpvðqvÞ�1 ~evðkÞk k2� eðkÞk k2n o

þ av ~evðkÞk k2� eðkÞk k2n o

bvðrvÞT


HTðkÞAðkÞð2cqvÞ�1

þ eðkÞk k2 av 2HðkÞcDv þ bvAðkÞð Þk k2 rvk k2

2cqvð Þ2

¼ avðqvÞ�1pv þ 0:5bvYðkÞf g ~evðkÞk k2

� avðqvÞ�1pv þ 0:5bvYðkÞ � avðqvÞ�1

ZðkÞn o

eðkÞk k2

� avðqvÞ�1pv þ 0:5bvYðkÞf g

(

~evðkÞk k2

� 1� avqðk þ 1Þ�1ZðkÞ

pv þ 0:5bvYðkÞ

!

eðkÞk k2

)

ð72Þ

where Y(k) and Z(k) are defined in (35) and (36),

respectively. Note also that the second inequality in (72)

is from the fact that 2eTðkÞ~evðkÞ� ~evðkÞk k2þ eðkÞk k2. pv is

the dimension of the output layer weight vector which is

positive. av is a fixed positive scalar. bvY(k) C 0 due to the

definition of recurrent adaptive learning rate bv(k ? 1) in

(38) and qv [ avZðkÞpvþ0:5bvYðkÞ due to definition of the

normalization factor in (34). The adaptive learning rate

av(k ? 1) in (37) implies that recurrent training of

RRSPSA will be av only if the output error of RNN is

bigger than a predefined value; otherwise, it will be zero,

which guarantees ~Vðk þ 1Þ��

��2

� ~VðkÞ��

��2

is bounded

below zero and is not increasing. The limit of

~Vðk þ 1Þ��

��2

� ~VðkÞ��

��2

therefore exists and ~evðkÞk k2�n

1� avqðkþ1Þ�1ZðkÞ

pvþ0:5YðkÞ

� eðkÞk k2

owill go to zero. Then, we

finally have

~Vðk þ 1Þ��

��2

� ~VðkÞ��

��2

� 0:

8 Proof of Theorem 2

Consider that SPSA is an approximation of the gradient

algorithm [11], and using the property of local minimum

points of the gradient

� o eTðkÞeðkÞ� ��

o WðkÞ� �T

� ~WðkÞ 0

.

[5], we have

0� � o eTðkÞeðkÞð Þo WðkÞ� �T

~WðkÞ

¼Xnh

i¼1

_hiðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ

n o

¼Xnh

i¼1

_hiðkÞ~liðkÞ

~liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ

� �

� kkmin

Xnh

i¼1

~liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ

n o

¼ kkmin

eTðkÞ ~XðkÞ ~WðkÞ�

¼ � kkmin

eTðkÞ/wðkÞ

ð73Þ

where k is the maximum value of derivative _hjðkÞ of

activation function in (6), and kmin = 0 is the minimum

nonzero value of mean value ~ljðkÞ defined in (65). (Note

that the inequality is always true if kmin = 0, which implies_hjðkÞ ¼ ~ljðkÞ ¼ 0:) According to the training algorithm for

hidden layer in (45), we get

~Wðk þ 1Þ ¼ ~WðkÞ þ awgwðWðkÞÞ ð74Þ

where gwðWðkÞÞ is defined in (55). Using (54), (74), and

(73), by the definition of XðkÞ in (66), we get


123

~Wðk þ 1Þ��

��2

� ~WðkÞ��

��2

¼ 2awgwðWðkÞÞ ~WðkÞ þ awgwðWðkÞÞ�� 2

¼ �aweTðkÞ BþðkÞ � B�ðkÞð Þ þ bwDwðkÞcDw rwð ÞT ~WðkÞ

þ awgwðWðkÞÞ�� 2

¼Xnh

i¼1

�liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ

n o

� ~WTðkÞ ~WðkÞ ~W

TðkÞ� �1

Dw rwð ÞT ~WðkÞ�

� 2aw

Dw � aweTðkÞ XðkÞ ~WðkÞ�

XðkÞ ~WðkÞ� T

� XðkÞ ~WðkÞ�

XðkÞ ~WðkÞ� T

� ��1

� DwðkÞ rwð ÞT ~WðkÞbwðcDwÞ�1 þ awgwðWðkÞÞ�� 2

¼Xnh

i¼1

liðkÞ~liðkÞ

�~liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ

� � 2pwawðDwÞ�1

þXnh

i¼1

1

~liðkÞ�~liðkÞeTðkÞVT

i;:ðkÞxTðkÞ ~Wi;:ðkÞ� �

awðcDwÞ�1

� bw rwð ÞT XTðkÞXðkÞ

� �1

XTðkÞXðkÞ ~WðkÞ XðkÞ ~WðkÞ

� T

� XðkÞ ~WðkÞ�

XðkÞ ~WðkÞ� T

� ��1

DwðkÞ þ awgwðWðkÞÞ�� 2

�Xnh

i¼1

liðkÞ~liðkÞ

�~liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ

� � 2pwawðDwÞ�1

þXnh

i¼1

1

~liðkÞ�~liðkÞeTðkÞVT

i;:ðkÞxTðkÞ ~Wi;:ðkÞ� �

awðcDwÞ�1

� bw rwð ÞT gI þ XTðkÞXðkÞ

� �1

XTðkÞDwðkÞ þ awgwðWðkÞÞ

�� 2

� 2aw kmin

kpw

Dw eTðkÞ/wðkÞ

þ aw 1

kbw

cDw eTðkÞ/wðkÞ rwð ÞT gI þ XTðkÞXðkÞ

� �1

XTðkÞDwðkÞ

þ eðkÞk k2 BþðkÞ � B�ðkÞ þ bwDwðkÞ� �

2c

��

��

2

rwk k2 aw

Dw

� 2

:

ð75Þ

Similar to the output layer case, /w(k) is replaced with

~ewðkÞ � eðkÞð Þ; we have the following Lyapunov function

for convergence analysis:

~Wðkþ 1Þ��

��2

� ~WðkÞ��

��2

� aw

Dw

kmin

kpwþ bw

2kcrwð ÞT gIþX


XTðkÞDwðkÞ

� �

� ~ewðkÞk k2� 1� aw

Dw

BþðkÞ�B�ðkÞþbwDwðkÞ2c

��

��

2

rwk k2

(

� kmin

kpwþ bw

2kcrwð ÞT gIþX


XTðkÞDwðkÞ

� ��1

eðkÞk k2

)

ð76Þ

The three adaptive parameters defined in (64), (67), and

(68) guarantee the negativeness of Lyapunov function.

That is

~Wðk þ 1Þ��

��2

� ~WðkÞ��

��2

� 0; 8k:

References

1. Hopfield JJ (1982) Neural networks and physical systems with

emergent collective computational abilities. Proc Natl Acad Sci

79:2554–2558

2. Talebi HA (2009) A recurrent neural-network-based sensor and

actuator fault detection and isolation for nonlinear systems with

application to the satellite’s attitude control subsystem. IEEE

Trans Neural Netw 20:45–60

3. Hou ZG, Gupta MM, Nikiforuk PN, Tan M, Cheng L (2007) A

recurrent neural network for hierarchical control of intercon-

nected dynamic systems. IEEE Trans Neural Netw 18:466–481

4. Al Seyab RK, Cao Y (2008) Nonlinear system identification for

predictive control using continuous time recurrent neural networks

and automatic differentiation. J Process Control 18:568–581

5. Song Q, Xiao J, Soh YC (1999) Robust backpropagation training

algorithm for multilayered neural tracking controller. IEEE Trans

Neural Netw 10:1133–1141

6. Song Q, Wu Y, Soh YC (2008) Robust adaptive gradient-descent

training algorithm for recurrent neural networks in discrete time

domain. IEEE Trans Neural Netw 19:1841–1853

7. Song Q, Spall JC, Soh YC, Ni J (2008) Robust neural network

tracking controller using simultaneous perturbation stochastic

approximation. IEEE Trans Neural Netw 19:817–835

8. Song Q (2008) On the weight convergence of Elman networks.

IEEE Trans Neural Netw 21:463–480

9. Haykin S (1999) Neural networks: a comprehensive foundation.

Printice Hall, New Jersey

10. Mandic DP, Chambers JA (2001) Recurrent neural networks for

prediction: learning algorithms, architectures and stability. Wiley,

New York

11. Spall JC (1992) Multivariate stochastic approximation using a

simultaneous perturbation gradient approximation. IEEE Trans

Autom Control 37:332–341

12. Maeda Y, Wakamura M (2005) Simultaneous perturbation

learning rule for recurrent neural networks and its FPGA imple-

mentation. IEEE Trans Neural Netw 16:1664–1672

13. Trunov AB, Polycarpou MM (2000) Automated fault diagnosis in

nonlinear multivariable systems using a learning methodology.

IEEE Trans Neural Netw 11:91–101

14. Spall JC, Cristion JA (1998) Model-free control of nonlinear

stochastic systems with discrete-time measurements. IEEE Trans

Autom Control 43:1198–1210

15. Spall JC, Cristion JA (1997) A neural network controller for

systems with unmodeled dynamics with applications to waste-

water treatment. IEEE Trans Syst Man Cybern Part B Cybern

27:369–375

16. Maeda Y, De Figueiredo RJP (1997) Learning rules for neuro-

controller via simultaneous perturbation. IEEE Trans Neural

Netw 8:1119–1130

17. Werbos PJ (1988) Generalization of backpropagation with

application to a recurrent gas market model. Neural Netw

1:339–356

18. Rumelhart D, Hinton G, Williams R (1986) Learning internal

representations by error backpropagation. Parallel Distrib Process

1:318–362

19. Williams RJ, Zipser D (1989) A learning algorithm for continu-

ally running fully recurrent neural networks. Neural Comput

1:270–280


123

20. Williams RJ, Zipser D (1995) Gradient-based learning algorithms

for recurrent networks and their computational complexity.

Backpropag Theory Archit Appl 2:433–501

21. Lin T, Giles C, Horne B, Kung S (1997) A delay damage model

selection algorithm for NARX neural networks. IEEE Trans

Signal Process 45:2719–2730

22. Park Y, Murray T, Chen C (2002) Predicting sun spots using a

layered perceptron neural network. IEEE Trans Neural Netw

7:501–505

23. Ljung L (2010) Perspectives on system identification. Annu Rev

Control 34(1):1–12


123

Documents

A robust recurrent simultaneous perturbation stochastic approximation training algorithm for recurrent neural networks