Click here to load reader
Upload
danwei-wang
View
216
Download
0
Embed Size (px)
Citation preview
ORIGINAL ARTICLE
A robust recurrent simultaneous perturbation stochasticapproximation training algorithm for recurrent neural networks
Zhao Xu • Qing Song • Danwei Wang
Received: 22 February 2012 / Accepted: 21 May 2013
� Springer-Verlag London 2013
Abstract Training of recurrent neural networks (RNNs)
introduces considerable computational complexities due to
the need for gradient evaluations. How to get fast conver-
gence speed and low computational complexity remains a
challenging and open topic. Besides, the transient response
of learning process of RNNs is a critical issue, especially for
online applications. Conventional RNN training algorithms
such as the backpropagation through time and real-time
recurrent learning have not adequately satisfied these
requirements because they often suffer from slow conver-
gence speed. If a large learning rate is chosen to improve
performance, the training process may become unstable in
terms of weight divergence. In this paper, a novel training
algorithm of RNN, named robust recurrent simultaneous
perturbation stochastic approximation (RRSPSA), is
developed with a specially designed recurrent hybrid adap-
tive parameter and adaptive learning rates. RRSPSA is a
powerful novel twin-engine simultaneous perturbation sto-
chastic approximation (SPSA) type of RNN training algo-
rithm. It utilizes three specially designed adaptive
parameters to maximize training speed for a recurrent
training signal while exhibiting certain weight convergence
properties with only two objective function measurements
as the original SPSA algorithm. The RRSPSA is proved with
guaranteed weight convergence and system stability in the
sense of Lyapunov function. Computer simulations were
carried out to demonstrate applicability of the theoretical
results.
Keywords Recurrent neural networks �Stability and convergence analysis � SPSA
List of symbols
k Discrete time step
m Output dimension
nI Input dimension
nh Hidden neuron number
pv = m 9 nh. Dimension of output layer
weight vector
pw = ni 9 nh. Dimension of hidden layer
weight vector
l Number of time-delayed outputs
ui(k) i ¼ 1; . . .;m: External input at time step k
x(k) 2 RnI : Neural network input vector
DxðkÞ 2 RnI : Perturbation vectors on input
x(k - 1)
y(k) 2 Rm: Desired output vector
yðkÞ 2 Rm: Estimated output vector
eðkÞ 2 Rm: Total disturbance vector
e(k) 2 Rm: Output estimation error vector
VðkÞ 2 Rpv
: Estimated weight vector of output
layer
V�ðkÞ 2 Rpv
: Ideal weight vector of output layer
~VðkÞ 2 Rpv
: Estimated error vector of output
layer
WðkÞ 2 Rpw
: Estimated weight vector of hidden
layer
W�ðkÞ 2 Rpw
: Ideal weight vector of hidden layer
~WðkÞ 2 Rpw
: Estimated weight vector of hidden
layer
Z. Xu (&) � Q. Song � D. Wang
School of Electrical and Electronic Engineering, Nanyang
Technological University, Singapore, Singapore
e-mail: [email protected]
Q. Song
e-mail: [email protected]
D. Wang
e-mail: [email protected]
123
Neural Comput & Applic
DOI 10.1007/s00521-013-1436-5
Wj;:ðkÞ 2 Rni : The jth row vector of hidden layer
weight matrix WðkÞdv(k) Equivalent approximation errors of the
loss function of output layer
dw(k) Equivalent approximation errors of the
loss function of hidden layer
c Perturbation gain parameter of the SPSA
Dv 2 Rpv
: Perturbation vectors of output
layer
rv 2 Rpv
: Perturbation vectors of output
layer
Dw 2 Rpw
: Perturbation vectors of hidden
layer
rw 2 Rpw
: Perturbation vectors of hidden
layer
HðWðkÞ; xðkÞÞ ¼ HðkÞ 2 Rm�pw
: Nonlinear activation
function matrix
hj(k) ¼ h Wj;:ðkÞxðkÞ� �
: The nonlinear
activation function and the scalar
element of HðVðkÞ; xðkÞÞav Adaptive learning rate of output layer
aw Adaptive learning rate of hidden layer
av, aw Positive scalars
qv Normalization factor of output layer
qw Normalization factor of hidden layer
bv Recurrent hybrid adaptive parameter of
output layer
bw Recurrent hybrid adaptive parameter of
hidden layer
lj(k) 2 R; 1� j� nh: Mean value of the input
vectors of the jth hidden layer neuron
~ljðkÞ 2 R; 1� j� nh: Mean value of the hidden
layer weight vector of the jth hidden layer
neuron
s Positive scalar
k Positive gain parameter of the threshold
function
g A small perturbation parameter
1 Introduction
Recurrent neural networks (RNNs) are inherently dynamic
nonlinear systems which have signal feedback connections
and are able to process temporal sequences. RNNs with free
synaptic weights are more useful for complex dynamical
systems such as nonlinear dynamic system identification,
time series prediction, and adaptive control as compared to
Hopfield neural networks (HNNs), in which the weights are
fixed and symmetric [1–8].
To maximize potentials and capabilities of the RNNs, one
of the preliminary tasks is to design an efficient training
algorithm with fast weight convergence speed and low
computational complexity. Gradient descent optimization
techniques are often used for neural networks which consist
of adjustable weights along the negative direction of error
gradient. However, due to the dynamic structure of RNNs,
the obtained error gradient tends to be very small in a clas-
sical BP training algorithm with the fixed learning step,
because RNN weight depends on not only the current output
but also output of past steps. It is widely acknowledged that
recurrent derivative of the multilayered RNNs, that is, the
recurrent training signal (see Fig. 1), which contains infor-
mation of dynamic time dependence in training, is necessary
to obtain good time response of RNNs [9, 10]. From com-
putational efficiency and weight convergence points of
view, Spall proposed a simultaneous perturbation stochastic
approximation (SPSA) method to find roots and minima of
neural network functions which are contaminated with noise
in [11]. The name of ‘‘simultaneous perturbation’’ arises
from the fact that all elements of unknown weight parame-
ters are being varied simultaneously. SPSA has the potential
to be significantly more efficient than the usual p-dimen-
sional algorithms (of Kiefer–Wolfowitz/Blum type) that are
based on standard finite-difference gradient approximations,
and it is shown in [11] that approximately the same level of
estimation accuracy can typically be achieved with only l/
pth the amount of data needed in the standard approach. The
main advantage of SPSA is its simplicity and excellent
weight convergence property, in which only two objective
function measurements are used. Thus, SPSA is more eco-
nomical in terms of loss function evaluation, which is usu-
ally the most expensive part of an optimization problem.
SPSA has attracted great attention in application areas such
as neural networks training, adaptive control, and model
parameter estimation [7, 12–16].
Fig. 1 Block diagram of RRSPSA training for the output feedback-
based RNN
Neural Comput & Applic
123
Conventional training methods of RNNs are usually
classified as backpropagation through time (BPTT) [17, 18]
and real-time recurrent learning (RTRL) [19, 20] algo-
rithms, which often suffer from drawbacks of slow con-
vergence speed and heavy computational load. In fact,
BPTT cannot be used for online application due to com-
putational problems. For a standard RTRL training,
recursive weight updating requires a small learning step to
obtain weight convergence and system stability of RNNs.
If a large learning rate is selected to speed up weight
updating, the training process of RTRL may lead to a large
steady-state error or even weight divergence [6, 9].
In the previous work, we have developed an adaptive
SPSA-based training scheme for FNNs and demonstrated that
the adaptive SPSA algorithm can be used to derive a fast
training algorithm and guaranteed weight convergence [7]. In
this paper, we will extend our SPSA research to the RNNs
with free weight updating and propose a robust recurrent
simultaneous perturbation stochastic approximation
(RRSPSA) algorithm under the framework of deterministic
system with guaranteed weight convergence. Compared with
FNNs, RNNs are dynamic systems and contain recurrent
connections in their structure. Considering the time depen-
dence of the signals of RNNs, not only the weights at the
current time steps are to be perturbed, but also those at the
previous time steps, which makes the learning easy to be
diverge and the convergence analysis complicated to obtain.
The key characteristic of RRSPSA training algorithm is to use
a recurrent hybrid adaptive parameter together with adaptive
learning rate and dead zone concept. The recurrent hybrid
adaptive parameter can automatically switch off recurrent
training signal whenever the weight convergence and stability
conditions are violated (see Fig. 1 and explanation in Sects. 2
and 3). There are several advantages to train RNNs based on
RRSPSA algorithm. First, RRSPSA is relatively easy to
implement as compared to other RNN training methods such
as BPTT due to its RTRL type of training and simplicity.
Second, RRSPSA provides excellent convergence property
based on simultaneous perturbation of weight, which is sim-
ilar to the standard SPSA. Finally, RRSPSA has capability to
improve RNN training speed with guaranteed weight con-
vergence over the standard RTRL algorithm by using adap-
tive learning method. RRSPSA is a powerful twin-engine
SPSA type of RNN training algorithm. It utilizes the specif-
ically designed three adaptive parameters to maximize
training speed for recurrent training signal while exhibiting
certain weight convergence properties with only two objec-
tive function measurements as a standard SPSA algorithm.
Combination of the three adaptive parameters makes
RRSPSA converge fast with the maximum effective learning
rate and guaranteed weight convergence. Robust stability and
weight convergence proofs of RRSPSA algorithm are pro-
vided based on Lyapunov function.
The rest of this paper is organized as follows. In Sect. 1,
we briefly introduce the structure of RNN and the con-
ventional SPSA algorithm. In Sects. 2 and 3, the RRSPSA
algorithm and robustness analysis are derived for the
output and hidden layers, respectively. Computer simula-
tion results are presented in Sect. 4 to show the efficiency
of our proposed RRSPSA. Section 5 draws the final
conclusions.
2 Basics of the RNN and SPSA
To simplify notation of this paper, we consider a three-
layer m-external input and m-output RNN with the output
feedback structure as shown in Fig. 1, which can be easily
extended into a RNN with arbitrary numbers of inputs and
outputs. Output vector of the RNN is yðkÞ which can be
presented as
yðkÞ ¼ HðWðkÞ; xðkÞÞVðkÞ ¼ ½y1ðkÞ; . . .; ymðkÞ�T 2 Rm
ð1Þ
where VðkÞ 2 Rpv
pv ¼ m� nhð Þ is the long vector of the
output layer weight matrix VðkÞ 2 Rnh�m defined by
VðkÞ ¼ V1;1ðkÞ; . . .; Vnh;1ðkÞ; . . .; V1;mðkÞ; . . .; Vnh;mðkÞ� �T
ð2Þ
and xðkÞ 2 RnI nI ¼ ðlþ 1Þ � mð Þ is the input vector of the
RNN defined by
xðkÞ ¼ x1ðkÞ; x2ðkÞ; . . .; xnIðkÞ½ �T
¼ yTðk � 1Þ; . . .; yTðk � lÞ; u1ðkÞ; . . .; umðkÞ� �T ð3Þ
and WðkÞ 2 Rpw
pw ¼ nI � nhð Þ is the long vector of the
hidden layer weight matrix WðkÞ 2 Rnh�nI defined by
WðkÞ ¼ W1;:ðkÞ; W2;:ðkÞ; . . .; Wnh;:ðkÞ� �
: ð4Þ
HðWðkÞ; xðkÞÞ 2 Rm�pv
is the nonlinear activation
block-diagonal matrix given by
HðkÞ ¼ HðWðkÞ;xðkÞÞ
¼
h1ðkÞ; . . .;hnhðkÞ; 0; . . . 0; . . .
0; . . . . .. ..
.
..
.
0; . . . 0; . . . h1ðkÞ; . . .;hnhðkÞ
2
66664
3
77775ð5Þ
where hj(k) with 1 B j B nh is one of the most popular
nonlinear activation functions,
hjðkÞ ¼ h Wj;:ðkÞxðkÞ� �
¼ 1
1þ exp �4kWj;:ðkÞxðkÞ� � ð6Þ
with Wj;:ðkÞ 2 RnI being the jth row vector of WðkÞ and
k[ 0 is the gain parameter of the threshold function that is
Neural Comput & Applic
123
defined specifically for ease of use later when deriving the
sector condition of the hidden layer.
The output vector yðkÞ is an approximation of the ideal
nonlinear function yðkÞ ¼ HðW�; xðkÞÞV� with V�
and W�
being the ideal output layer and hidden layer weight
respectively.
SPSA algorithm is to solve the problem of finding an
optimal weight h* of the gradient equation of neural
networks
gðhðkÞÞ ¼ oLðhðkÞÞohðkÞ
¼ 0
for some differentiable loss function L : Rp ! R1; where
hðkÞ is the estimated weight vector of neural networks. The
basic idea is to vary all elements of the vector hðkÞsimultaneously and approximate the gradient function
using only two measurements of the loss function (7) and
(8) without requiring exact derivatives or a large number of
function evaluation which provides significant
improvement in efficiency over the standard gradient
decent algorithms. Define
gþðhðkÞÞ ¼ LðhðkÞ þ cðkÞDðkÞÞ þ �þðkÞ ð7Þ
g�ðhðkÞÞ ¼ LðhðkÞ � cðkÞDðkÞÞ þ ��ðkÞ ð8Þ
where �þðkÞ and ��ðkÞ represent measurement noise terms,
c(k) is a scalar and DðkÞ ¼ ½D1ðkÞ; . . .;DpðkÞ� is the
perturbation vector. Then, the form of the estimation of
gðhÞ at the kth iteration is
gðhðkÞÞ ¼ gþðhðkÞÞ � g�ðhðkÞÞ2cðkÞD1ðkÞ
. . .gþðhðkÞÞ � g�ðhðkÞÞ
2cðkÞDpðkÞ
" #T
ð9Þ
Theory and numerical experience in [11] indicate that
SPSA can be significantly more efficient than the standard
finite-difference-based algorithms in large-dimensional
problems.
Inspired by the excellent economical and convergence
properties of SPSA, we propose the RRSPSA algorithm
under the framework of deterministic system based on
the idea of simultaneous perturbation with guaranteed
weight convergence. The loss function for RRSPSA is
defined as
LðhðkÞÞ ¼ 1
2eðkÞk k2 ð10Þ
eðkÞ ¼ yðkÞ � yðkÞ þ eðkÞ ð11Þ
where y(k) is the ideal output and a function of h�; yðkÞ is a
function of hðkÞ; and eðkÞ is a disturbance term. Similar to
SPSA, the RRSPSA algorithm seeks to find an optimal
weight vector h* of gradient equation, that is, the weight
vector that minimized differentiable LðhðkÞÞ:That is, the proposed RRSPSA algorithm for updating
hðkÞ 2 RðpwþpvÞ as an estimation of the ideal weight vector
h* is of the form
hðk þ 1Þ ¼ hðkÞ � aðk þ 1ÞgðhðkÞÞ ð12Þ
where a(k ? 1) is an adaptive learning rate and gðhðkÞÞ is
an approximation of the gradient of the loss function. This
approximated gradient function (normalized) is of the
SPSA form
where d(k) is an equivalent approximation error of the loss
function, which is also called the measurement noise, and
q(?1) is the normalization factor. Dðk þ 1Þ; rðk þ 1Þ; and
c(k ? 1) [ 0 are the controlling parameters of the algo-
rithm defined as
1. The perturbation vector
Dðk þ 1Þ ¼ D1ðk þ 1Þ; . . .;Dpwþpvðk þ 1Þ� �T2 Rðp
wþpvÞ
ð14Þ
is a bounded random directional vector that is used to
perturb the weight vector simultaneously and can be
generated randomly.
2. The sequence of r(k ? 1) is defined as
rðkþ1Þ¼ 1
D1ðkþ1Þ ; . . .;1
Dpwþpvðkþ1Þ
� �T
2RðpwþpvÞ:
ð15Þ
3. c(k ? 1) [ 0 is a sequence of positive numbers [11,
14, 15]. c(k ? 1) can be chosen as a small constant
number for a slow time-varying system as pointed in
[15] and it controls the size of perturbations.
Note that for simplicity of notations, we ignore the time
subscripts of several parameters where no confusion is
caused such as Dðk þ 1Þ; rðk þ 1Þ; cðk þ 1Þ; dðkÞ and
a(k ? 1).
gðhðkÞÞ ¼L hðkÞ þ cðk þ 1ÞDðk þ 1Þ�
� L hðkÞ � cðk þ 1ÞDðk þ 1Þ�
þ dðkÞ2cðk þ 1Þqðk þ 1Þ rðk þ 1Þ ð13Þ
Neural Comput & Applic
123
3 Output layer training and stability analysis
In a multilayered neural network, the output layer weight
vector VðkÞ and hidden layer weight vector WðkÞ are
normally updated separately using different learning rates
as shown in Fig. 1 as in the standard BP training algorithm
[9]. During the updating of the output layer weights, the
hidden layer weights are fixed. We are now ready to pro-
pose the RRSPSA algorithm to update the estimated weight
vector VðkÞ of output layer. That is,
Vðk þ 1Þ ¼ VðkÞ � avðk þ 1ÞgvðVðkÞÞ ð16Þ
where gvðVðkÞÞ 2 Rpv
is the normalized gradient
approximation that uses simultaneous perturbation vectors
Dvðk þ 1Þ 2 Rpv
and rvðk þ 1Þ 2 Rpv
to stimulate weight.
Dvðk þ 1Þ and rv(k ? 1) are, respectively, the first pv
components of perturbation vectors defined in (14) and
(15).
gvðVðkÞÞ ¼ �eTðkÞHðkÞDvrvðqvÞ�1 � bvðkþ 1ÞeTðkÞAðkÞrv 2cqvðk þ 1Þð Þ�1 ð17Þ
where AðkÞ 2 Rm is the extended recurrent gradient defined
as
AðkÞ ¼ H WðkÞ; DvðkÞ VðkÞ þ cDv� �� �
VðkÞ� H WðkÞ; DvðkÞ VðkÞ � cDv
� �� �VðkÞ ð18Þ
DvðkÞ ¼ HTðk � 1Þ; . . .;HTðk � lÞ; 0; . . .; 0� �T
: ð19Þ
For simplicity of notations, we ignore the time
subscripts of Dvðk þ 1Þ; rvðk þ 1Þ; bvðk þ 1Þ; qvðk þ 1Þand av(k ? 1) in the following part of the paper.
Remark 1 Our purpose is to find the desired output layer
weight vector V�
of the gradient equation
gðVðkÞÞ ¼ oLðVðkÞÞoVðkÞ
¼ 0:
Inspired by the excellent properties of SPSA, we obtain
gðVðkÞÞ by varying all elements of unknown weight
parameters simultaneously as shown in the proof of
Lemma 1 under the framework of deterministic system. We
will show that RRSPSA has the same form as SPSA, and
only two objective function measurements are used at each
iteration, which maintains the efficiency of SPSA. How-
ever, the weight convergence proof and stability analysis
are quite different from those in [11] under the framework
of deterministic system as shown in Theorem 1 and 2. For
RNNs, the outputs depend on not only the weights at
current time steps, but also those at the previous time steps
since the current input vector contains the delayed outputs.
The first term on the right side of equation in (17) is the
result of perturbation of the feedforward signals in RNNs,
and A(k) in (18) is the result of perturbation of the recurrent
signals which makes the training complicated and the
system easy to be unstable. In addition to the simplicity and
efficiency of RRSPSA, one powerful aspect of the devel-
oped RRSPSA algorithm is that it is able to use effectively
large adaptive learning rate av to accelerate weight con-
vergence speed with guaranteed system stability. To avoid
weight divergence at each time step, an adaptive learning
rate bv is used in (17) to control the recurrent signals and
switch off the recurrent connections whenever the con-
vergence condition is violated as shown in (38). The nor-
malization factor qv is also used to bound the signals in
learning algorithms as traditionally did in adaptive control
system [5, 7]. av is to guarantee the weight convergence
during training, which means the weights are only updated
when the convergence and stability requirements are met
and it does not need to be small as in RTRL training
algorithm. The sufficient conditions for the three key
parameters are shown in (34), (37), and (38). DvðkÞ is
obtained during the mathematical deduction in the proof of
Lemma 1.
Lemma 1 The normalized gradient approximation in
(17) is an equivalent presentation of the standard SPSA
algorithm in equation (13).
Proof We will prove that (17) is obtained by perturbing
all the weights simultaneously as SPSA did. Rewrite the
loss function defined in (10) as
LðhðkÞÞ ¼ LðVðkÞ;WðkÞÞ ¼ 1
2yðkÞ� yðkÞþ eðkÞk k2: ð20Þ
Further, we use yðkÞ ¼HðkÞVðkÞ in (1) to define
yvþðVðkÞÞ¼ H VðkÞþ cDv
� �þH W ; xðkÞ þDxðkþ 1Þð Þ
� �VðkÞ
¼ H VðkÞþ cDv� �
þH W ; yTðk� 1Þ; . . .; yT���
� k� lÞ;0; . . .;0ð �TþDxðkþ 1Þ��
VðkÞ
¼ H VðkÞþ cDv� �
þH W ; VTðk� 1ÞHTðk� 1Þ; . . .;
h��
VTðk� lÞHTðk� lÞ;0; . . .;0
iT
þDxðkþ 1Þ
VðkÞ
¼ H VðkÞþ cDv� �
þH W ; Vðk� 1Þ þ cDv� �T
HTh�
� k� 1Þ;ð . . .; Vðk� lÞ þ cDv� �T
HTðk� lÞ;0; . . .;0iT
VðkÞ:
ð21Þ
yvþðVðkÞÞ has the same motivation as gþðhðkÞÞ of SPSA
shown as in (7) which is to perturb all the weights
simultaneously. Different from FNN training which only
perturbs the current weight, the input vector must be
perturbed as well for RNN training because of the time
dependence in RNNs. The disturbance Dxðkþ 1Þ is to
Neural Comput & Applic
123
perturb the weights at previous time steps which are
contained in x(k) as seen in (3). However, the last equation
of (21) is difficult to implement. Here, we assume the
model parameters do not change apparently between each
iteration, that is, VðkÞ � Vðk� 1Þ. . .� Vðk� lÞ: We can
derive a similar approach as RTRL introduced by Williams
and Ziper [19, 20]. Moreover, such approximation makes
the proposed algorithm real time and adaptive in a
recursive fashion. Under the approximation,
yvþðVðkÞÞ
� HðkÞ VðkÞ þ cDv� �
þ H WðkÞ; VðkÞ þ cDv� �T
HTðk � 1Þ;h�
. . .;
� VðkÞ þ cDv� �T
HTðk � lÞ; 0; . . .; 0iT
ÞVðkÞ
¼ HðkÞ VðkÞ þ cDv� �
þ H WðkÞ; HTðk � 1Þ; . . .;HTðk � lÞ; 0; . . .; 0� �T�
� VðkÞ þ cDv� ��
VðkÞ¼ HðkÞ VðkÞ þ cDv
� �þ H WðkÞ; DvðkÞ VðkÞ þ cDv
� �� �VðkÞ
ð22Þ
where DvðkÞ is defined in (19). By the approximation VðkÞ �Vðk � 1Þ. . . � Vðk � lÞ; the recurrent term HTðk �1Þ; . . .;HTðk � lÞ of (19) can be iteratively obtained from the
previous steps to speed up the calculation of RRSPSA (nor-
mally, the input u(k) is independent from weight so that the
last entries of DvðkÞ are zero vectors). Weight convergence
and system stability can be guaranteed by using the three
adaptive learning rates, which will be explained in detail later.
Similar to (21), we can get
yv�ðVðkÞÞ ¼ HðkÞ VðkÞ � cDv� �
þ H WðkÞ; xðkÞ � Dxðk þ 1Þð Þ� �
VðkÞ� HðkÞ VðkÞ � cDv
� �þ H WðkÞ; DvðkÞ VðkÞ � cDv
� �� �VðkÞ
ð23Þ
so that
yvþðVðkÞÞ � yv�ðVðkÞÞ � VðkÞ� �
2HðkÞcDv þ AðkÞ ð24Þ
where the extended recurrent gradient A(k) is defined as
(18). A(k) in (24) is the result of recurrent connections. To
guarantee weight convergence and system stability during
training, we insert an adaptive recurrent hybrid parameter
bv(k ? 1) (see Fig. 1) to control the recurrent training
signal. As shown in (38), the convergence condition will be
examined at each step and the recurrent contribution will
be switched off whenever the convergence condition is
violated. Thus, (24) can be rewritten as follows
yvþðVðkÞÞ � yv�ðVðkÞÞ ¼ 2HðkÞcDv þ bvAðkÞ: ð25Þ
Note that we change the approximation in Eq. (25) to equal
sign because we use the exact formula on the right-hand
side of the equation to update the weight during learning.
According to (13), we can get
gvðVðkÞÞ
¼L WðkÞ;VðkÞ þ cDv� �
� L WðkÞ;VðkÞ � cDv� �
þ dvðkÞ2cqv
rv
¼yðkÞ � yvþ VðkÞ
� ��� ��2� yðkÞ � yv� VðkÞ� ��� ��2þ2dvðkÞ
4cqvrv
¼n
yðkÞ � yvþ VðkÞ� �
þ yðkÞ � yv� VðkÞ� �� �T
� yv� VðkÞ� �
� yvþ VðkÞ� �� �
þ 2dvðkÞo
rv 4cqvð Þ�1
¼ yðkÞ � yðkÞ þ eðkÞð ÞT yv� VðkÞ� �
� yvþ VðkÞ� �� �
rv 2cqvð Þ�1
¼ eTðkÞ yv� VðkÞ� �
� yvþ VðkÞ� �� �
rv 2cqvð Þ�1
¼ �eTðkÞHðkÞDvrv qvð Þ�1�bveTðkÞAðkÞrv 2cqvð Þ�1:
ð26Þ
e(k) in the fifth equality of (26) comes from (11). The
fourth equality of (26) reveals the relationship between
gradient approximation error dv(k) and system disturbance
eðkÞ; that is,
dvðkÞ ¼ �2yðkÞ þ yvþ VðkÞ� �
þ yv�ðVðkÞÞ þ eðkÞ� �T
� yv�ðVðkÞÞ � yvþðVðkÞÞ� �
¼ eTðkÞ yv�ðVðkÞÞ � yvþðVðkÞÞ� �
: ð27Þ
In the fifth equality of (26), the gradient approximation
presentation appears exactly the same as the original one in
[11] (see [11, eq. (2.2)]). Only two objective function
measurements are used at each step, which maintains the
efficiency of SPSA.
The training error e(k) comes from the weight estimation
error and the disturbance. For stability analysis, e(k) will be
linked to the weight estimating error of the output layer and
the disturbance. Define the weight estimating errors of the
output layer and hidden layer as
~VðkÞ ¼ V� � VðkÞ ð28Þ
~WðkÞ ¼ W� �WðkÞ ð29Þ
where W�
and V�
are the ideal (optimal) weight vectors of
WðkÞ and VðkÞ: According to (11), the training error can be
rewritten as
eðkÞ ¼ yðkÞ � yðkÞ þ eðkÞ¼ H W
�; xðkÞ
� �V� �HðkÞVðkÞ þ eðkÞ
¼ H W�; xðkÞ
� �V� �HðkÞV� þHðkÞV�
�HðkÞVðkÞ þ eðkÞ: ð30Þ
Because the term H W�;xðkÞ
� �V� �HðkÞV� is
temporarily bounded during output layer training (hidden
layer weights are fixed), we define
Neural Comput & Applic
123
~evðkÞ ¼ H W�; xðkÞ
� �V� � HðkÞV� þ eðkÞ
¼ ~H WðkÞ;W�; xðkÞ� �
V� þ eðkÞ ð31Þ
with ~H WðkÞ;W�; xðkÞ� �
¼ H W�; xðkÞ
� �� HðkÞ: The
training error caused by the weight estimation error of
the output layer is defined as
/vðkÞ ¼ �HðkÞ~VðkÞ: ð32Þ
Then, (30) can be rewritten as
eðkÞ ¼ �/vðkÞ þ ~evðkÞ: ð33Þ
Hence, the training error is directly linked to the weight
estimating error of the output layer contained in /v(k) and the
disturbance ~evðkÞ: This is critical for the convergence proof.
Theorem 1 If the output layer of the RNN is trained by
RRSPSA algorithm (16), VðkÞ is guaranteed to be con-
vergent in the sense of Lyapunov function
~Vðk þ 1Þ���
���2
� ~VðkÞ���
���2
6 0; 8k with ~VðkÞ ¼ V� � VðkÞ
and the training error will be L2-stable in the sense of
~evðkÞ 2 L2 and /vðkÞ 2 L2:
Proof See Sect. 6.
The conditions for the three adaptive learning rates in
(16) and (17) are as follows. The detailed procedure to
obtain these conditions is shown in the proof of Theorem 1.
1. qv is the normalization factor defined by
qv ¼ sqvðkÞ þmaxavZðkÞ
pv þ 0:5bvYðkÞ þ 1; q
� ð34Þ
where 0 \ s\ 1, 0 \ av B 1, and q are positive
constants and
YðkÞ ¼ ðrvÞT gI þ HTðkÞHðkÞð Þ�1
HTðkÞAðkÞc
ð35Þ
where g is a small perturbation parameter.
ZðkÞ ¼ 2HðkÞcDv þ bvAðkÞð Þk k2rvk k2
4 cð Þ2� �1
:
ð36Þ
2. av is an adaptive learning rate similar to dead zone
approach in the classical neural network and adaptive
control
av ¼ av; if eðkÞk k > KðkÞav ¼ 0; if eðkÞk k\KðkÞ
�ð37Þ
with KðkÞ ¼ ~evmðkÞ=
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� avðqvÞ�1
ZðkÞpvþ0:5bvYðkÞ
qand ~ev
mðkÞ being
the maximum value of ~evðkÞ defined in (31).
3. bv is the recurrent hybrid adaptive parameter deter-
mined by
bv ¼ 1; if YðkÞ 0
bv ¼ 0; if YðkÞ\0:
�ð38Þ
Remark 2 According to the theoretical analysis, three
parameters bv, av and qv play a key role in the proof of
weight convergence and stability analysis. They are
designed to guarantee the negativeness of Lyapunov
function as in (72). The function of bv is to automatically
switch off the important recurrent training signal in the
lower left corner of Fig. 1 when the convergence and
stability requirements are violated, which resulted from
recurrent training signals. The normalization factor qv is
also used to bound the signals in learning algorithms as
traditionally did in adaptive control system [5, 7]. av is to
guarantee weight convergence during training, which
means that the weights are only updated when
convergence and stability requirements are met. Besides,
av does not need to be small as in RTRL training algorithm,
which will make the training by RRSPSA faster than that
by RTRL. From the definition of av in (37), it can be seen
that the weights will be updated (av = av) only if the
training error is bigger than the dead zone value, which
means the learning will be stopped if the training error is
lower than the dead zone level. This is to provide good
generalization other than to learn the system noise. In this
paper, we will choose a suitable value for ~evmðkÞ according
to the noise level.
Remark 3 Perturbation size c(k ? 1) is a critical variable
as discussed in the original SPSA papers [7, 11, 14]. A
natural question is whether c(k ? 1) will affect the learning
procedure in the recurrent learning algorithm. The answer
is yes. It is not obvious to see from the updating formula of
gvðVðkÞÞ in (17). However, this can be easily seen by
expanding the second term A(k) in (17) to be explicit in
c(k).
According to the mean value theory and that activation
function in (6) is a nondecreasing function, it can be readily
shown that there exist unique positive mean val-
ues lj(k) such that
h Wj;:ðkÞx1ðkÞ� �
� h Wj;:ðkÞx2ðkÞ� �
¼ ljðkÞWj;:ðkÞ x1ðkÞ � x2ðkÞð Þ ð39Þ
where Wj;:ðkÞ is the estimated weight vector linked to the
jth hidden layer neurons. From (39) and the matrix in (5),
A(k) in (18) can be further extended as
AðkÞ ¼ 2cXðkÞ ð40Þ
where XðkÞ 2 Rm is defined as
Neural Comput & Applic
123
XðkÞ ¼
X1ðkÞ; . . .;XnhðkÞ; 0; . . . 0; . . .
0; . . . . .. ..
.
..
.
0; . . . 0; . . . X1ðkÞ; . . .;XnhðkÞ
2
666664
3
777775VðkÞ ð41Þ
and XjðkÞ ¼ ljðkÞWj;:ðkÞDvðkÞDv; j ¼ 1; . . .; nh: By such
expanding of A(k), c(k) is explicit as shown in (40). To
reveal the effect of c(k), we further expand (25) as
yvþðVðkÞÞ � yv�ðVðkÞÞ ¼ 2HðkÞcDv þ 2bvðk þ 1ÞcXðkÞ: ð42Þ
Using (25) and (27), we have the presentation of the
system disturbance in terms of the measurement noise
dv(k) of the RRSPSA algorithm as
eðkÞ ¼ � HðkÞDv þ bvXðkÞð Þ HðkÞDv þ bvXðkÞð Þð ÞT� ��1
� HðkÞDv þ bvXðkÞð ÞdvðkÞð2cÞ�1: ð43Þ
Furthermore, according to the equivalent disturbance (31),
we have
~evðkÞ ¼ ~H WðkÞ;W�; xðkÞ� �
V� þ eðkÞ
¼ ~H WðkÞ;W�; xðkÞ� �
V� � HðkÞDv þ bvXðkÞð Þf
� HðkÞDv þ bvXðkÞð Þð ÞT��1 HðkÞDv þ bvXðkÞð ÞdvðkÞ
2c:
ð44Þ
The equivalent disturbance ~evðkÞ is important since its
maximum value ~evmðkÞ decides the range of dead zone for
av in (37). Note also that the measurement noise dv(k) is
defined the same way as in [11], which measures
difference between true gradient and gradient
approximation as derived in (26). Because the system
disturbance eðkÞ is decided only by external conditions, a
relatively smaller gain parameter c will imply a smaller
measurement noise dv(k) according to (43). This allows us
to choose the dead zone range mainly according to the
external system disturbance eðkÞ: Furthermore, as shown
in (37), av will be nonzero whenever the norm of e(k) is
bigger than a dead zone value which is related to ~evmðkÞ: If
the external disturbance is also small, we can choose a
small dead zone range, which ensures that the adaptive
learning rate av has the nonzero value for a prolong period
of training time.
4 Hidden layer training and stability analysis
The algorithm for updating estimated weight vector WðkÞof the hidden layer is
Wðk þ 1Þ ¼ WðkÞ � awðk þ 1ÞgwðWðkÞÞ ð45Þ
where gwðWðkÞÞ 2 Rpw
is the normalized gradient
approximation that uses simultaneous perturbation vectors
Dwðk þ 1Þ 2 Rpw
and rwðk þ 1Þ 2 Rpw
to stimulate weight
of the hidden layer. Dwðk þ 1Þ and rw(k ? 1) are,
respectively, the last pw components of perturbation
vectors defined in (14) and (15).
gwðWðkÞÞ ¼ �eTðkÞ BþðkÞ � B�ðkÞ þ bwðk þ 1ÞDwðkÞ
� �
2cqwðk þ 1Þ rwðk þ 1Þ
ð46Þ
where
BþðkÞ ¼ H WðkÞ þ cDwðk þ 1Þ; xðkÞ� �
VðkÞ 2 Rm ð47Þ
B�ðkÞ ¼ H WðkÞ � cDwðk þ 1Þ; xðkÞ� �
VðkÞ 2 Rm ð48Þ
DwðkÞ ¼ H WðkÞ; Bþðk � 1Þð ÞT ; . . .; Bþðk � lÞð ÞT ; 0; . . .; 0h iT
� �
�H WðkÞ; ðB�ðk � 1ÞÞT ; . . .; ðB�ðk � lÞÞT ; 0; . . .; 0� �T�
� VðkÞ 2 Rm:
ð49Þ
For simplicity of notations, we ignore the time subscripts of
Dwðk þ 1Þ; rwðk þ 1Þ; bwðk þ 1Þ; qwðk þ 1Þ and aw(k ? 1)
in the following part of the paper.
Remark 4 The same as the output layer, our purpose is to
find the desired hidden layer weight vector W�
of the
gradient equation
gðWðkÞÞ ¼ oLðWðkÞÞoWðkÞ
¼ 0:
The detailed mathematical deduction is given in the proof
of Lemma 2. B?(k) - B-(k) is the result of the perturba-
tions of the feedforward signals, and DwðkÞ is the result of
the perturbations of the recurrent signals in RNNs. The
function of bw is to automatically switch off recurrent
contribution when the convergence and stability require-
ments are violated. The conditions for the key parameters
appeared in (45) and (46) are shown in (64), (67), and (68).
Lemma 2 The normalized gradient approximation in
(46) is an equivalent presentation of the standard SPSA
algorithm in equation (13).
Similar to the definitions of gvþðVðkÞÞ and gv�ðVðkÞÞ in
(21) and (23), we define two perturbation functions
gwþðVðkÞÞ and gw�ðVðkÞÞ; that is,
gwþðWðkÞÞ ¼ H WðkÞ þ cDw; xðkÞ� �
VðkÞþ H WðkÞ; xðkÞ þ Dxðk þ 1Þð Þ
� �VðkÞ: ð50Þ
From the definition of B?(k) in (47),
Neural Comput & Applic
123
gwþðWðkÞÞ¼ H WðkÞ þ cDw; xðkÞ
� �VðkÞ þ H WðkÞ; xðkÞ þ Dxðk þ 1Þð Þ
� �VðkÞ
¼ BþðkÞ þ H WðkÞ; yTðk � 1Þ; . . .; yTðk � lÞ;���
� 0; . . .; 0�TþDxðk þ 1Þ��
VðkÞ
¼ BþðkÞ þ H WðkÞ; VTðk � 1ÞHT Wðk � 1Þ þ cDw; xðk � 1Þ
� �; . . .
h�
� VTðk � lÞHT Wðk � lÞ þ cDw; xðk � lÞ
� �; 0; . . .; 0
iT
VðkÞ
� BþðkÞ þ H WðkÞ; Bþðk � 1Þð ÞT ; . . .; Bþðk � lÞð ÞT ; 0; . . .; 0h iT
� VðkÞ:
ð51Þ
From the definition of B-(k) in (48), and similar to (51),
we get
gw�ðWðkÞÞ �B�ðkÞ þ H WðkÞ; ðB�ðk � 1ÞÞT ; . . .;��
� B�ðk � lÞÞT ; 0; . . .; 0� �T
VðkÞ:ð52Þ
Then, we have
gwþðWðkÞÞ � gw�ðWðkÞÞ � BþðkÞ � B�ðkÞ þ DwðkÞ:ð53Þ
DwðkÞ in (53) is the result of recurrent training signal as
shown in Fig. 1. To guarantee weight convergence and
system stability during training, we add an adaptive
recurrent hybrid parameter bw(k ? 1) to control the
recurrent learning signal path. That is,
gwþðWðkÞÞ � gw�ðWðkÞÞ ¼ BþðkÞ � B�ðkÞ þ bwDwðkÞ:ð54Þ
Note that we change the approximation in Eq. (53) to equal
because we use the exact formula on the right-hand side of
the equation to update the weight during learning.
According to (13), we can get
gwðWðkÞÞ ¼L WðkÞþ cDw;VðkÞ� �
�L WðkÞ� cDw;VðkÞ� �
þ dwðkÞ2cqw
rw
¼ yðkÞ� ywþðkÞk k2� yðkÞ� yw�ðkÞk k2þ2dwðkÞ4cqw
rw
¼ yðkÞ� ywþðkÞþ yðkÞ� yw�ðkÞð ÞT yw�ðkÞ� ywþðkÞð Þþ 2dwðkÞ4cqw
rw
¼ 2yðkÞ� ywþðkÞ� yw�ðkÞþ ywþðkÞþ yw�ðkÞ� 2yðkÞþ 2eðkÞð ÞT
� yw�ðkÞ� ywþðkÞð Þþ 2dwðkÞð Þrw 4cqwð Þ�1
¼ yðkÞ� yðkÞþ eðkÞð ÞT yw�ðkÞ� ywþðkÞð Þrwð2cqwÞ�1
¼ eTðkÞ yw�ðkÞ� ywþðkÞð Þrwð2cqwÞ�1
¼�eTðkÞ BþðkÞ�B�ðkÞþbwDwðkÞ� �
rwð2cqwÞ�1
ð55Þ
where the equivalent disturbance dw(k) is defined as
dwðkÞ ¼ yw�ðWðkÞÞ � ywþðWðkÞÞ� �T
eðkÞ
þ 1
2yw�ðWðkÞÞ � ywþðWðkÞÞ� �T
� ywþðWðkÞÞ þ yw�ðWðkÞÞ � 2yðkÞ� �
: ð56Þ
The fifth equality in (55) reveals the relationship between
dw(k) and eðkÞ as in (56).
For stability analysis, the training error will be linked to
the parameter estimating error of the hidden layer and the
disturbance. According to mean value theorem and that
activation function is a nondecreasing function, it can be
readily shown that there exist unique positive mean values
~ljðkÞ such that
h W�j;:xðkÞ�
� h Wj;:ðkÞxðkÞ� �
¼ ~ljðkÞ W�j;: � Wj;:ðkÞ�
xðkÞ
¼ ~ljðkÞ ~Wj;:ðkÞxðkÞð57Þ
where Wj;:ðkÞ;W�j;:; ~Wj;:ðkÞ are the estimated weight, ideal
weight, and weight estimate error vectors linked to the jth
hidden layer neurons, respectively. From (57) and the
matrix in (5), it can be shown that
~H WðkÞ;W�; xðkÞ� �
VðkÞ ¼ H W�; xðkÞ
� �� HðkÞ
� �VðkÞ
¼ ~XðkÞ ~WðkÞð58Þ
where ~XðkÞ 2 Rm�pw
is defined as
~XðkÞ ¼
~l1ðkÞV1ðkÞxTðkÞ . . . ~lnhðkÞVnh
ðkÞxTðkÞ~l1ðkÞVnhþ1ðkÞxTðkÞ . . . ~lnh
ðkÞV2nhðkÞxTðkÞ
..
. ...
~l1ðkÞV ðm�1Þnhþ1ðkÞxTðkÞ . . . ~lnhðkÞVpvðkÞxTðkÞ
2
666664
3
777775
¼ ~l1ðkÞVT1;:ðkÞxTðkÞ. . .~lnh
ðkÞVTnh;:ðkÞxTðkÞ
h i
ð59Þ
and VTj;:ðkÞ is inverse of jth row of the estimated weight
matrix VðkÞ of output layer. From (6), the maximum value
of the derivative of hjðkÞ; _hjðkÞ ¼ ðohðWj;:ðkÞxðkÞÞÞ=ðoðWj;:ðkÞxðkÞÞ is k, and hence 0� ~ljðkÞ� kð1� i� nhÞ:We can rewrite the estimation error as
eðkÞ ¼ yðkÞ � yðkÞ þ eðkÞ¼ H W
�; xðkÞ
� �V� � H W
�; xðkÞ
� �VðkÞ
þ H W�; xðkÞ
� �V � HðkÞVðkÞ þ eðkÞ
¼ ~H WðkÞ;W�; xðkÞ� �
VðkÞ þ H W�; ðkÞ
� �~VðkÞ þ eðkÞ¼ �/wðkÞ þ ~ewðkÞ
ð60Þ
where
/wðkÞ ¼ �~H WðkÞ;W�; xðkÞ� �
VðkÞ ¼ �~XðkÞ ~WðkÞ ð61Þ
and the bounded equivalent disturbance ~ewðkÞ of the hidden
layer is defined as
Neural Comput & Applic
123
~ewðkÞ ¼ H W�; xðkÞ
� �~VðkÞ þ eðkÞ: ð62Þ
Note that ~ewðkÞ is a bounded signal because ~VðkÞ has been
proven to be bounded in Sect. 3.
According to the definition of B?(k) and B-(k) defined
in (47) and (48), we can get
BþðkÞ � B�ðkÞð Þð2cÞ�1 ¼ XðkÞDw ð63Þ
where XðkÞ 2 Rm�pw
is similar to ~XðkÞ in (59) except with
different mean values.
Theorem 2 If the hidden layer of the RNN is trained by
the adaptive RRSPSA algorithm (45), weight WðkÞ is
guaranteed to be convergent in the sense of Lyapunov
function
~Wðk þ 1Þ���
���2
� ~WðkÞ���
���2
6 0; 8k
with ~WðkÞ ¼ W� �WðkÞ and the training error will be L2-
stable in the sense of ~ewðkÞ 2 L2 and /wðkÞ 2 L2:
Proof See Sect. 8.
The three adaptive learning rates appeared in (45) and
(46) are as follows.
1. The normalization factor is defined as
qw ¼ sqwðkÞ þmax 1þ awMðkÞ rwk k2!�1ðkÞ; qn o
ð64Þ
where
MðkÞ¼ BþðkÞ�B�ðkÞþbwDwðkÞ2c
����
����
2
!ðkÞ¼kmin
kpwþ 1
2kcbw rwð ÞT gIþX
TðkÞXðkÞ� �1
XTðkÞDwðkÞ
and 0 \ aw B 1 being a positive constant, k being the
maximum value of the derivative _hjðkÞ of the activation
function in (6), and kmin = 0 being the minimum
nonzero value of the mean value ~ljðkÞ defined as
0� ~ljðkÞ� kð1� i� nhÞ ð65Þ
and
XðkÞ¼
V1ðkÞxTðkÞ ... VnhðkÞxTðkÞ
Vnhþ1ðkÞxTðkÞ ... V2nhðkÞxTðkÞ
..
. ...
V ðm�1Þnhþ1ðkÞxTðkÞ ... VpvðkÞxTðkÞ
2
666664
3
777775ð66Þ
where VjðkÞ is the jth element of VðkÞ; 1� j� pv:2. The adaptive dead zone learning rate of hidden layer is
defined as
aw ¼ aw; if eðkÞk k > CðkÞaw ¼ 0; if eðkÞk k\CðkÞ
�ð67Þ
with C ¼ ~ewmðkÞ=
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� aw rwk k2
qw!ðkÞ MðkÞ� r
and ~ewmðkÞ being
the maximum value of ~ewðkÞ defined in (62).
3. The recurrent hybrid adaptive parameter of hidden
layer bw is defined as
bw ¼ 1; if rwð ÞTvðkÞ > 0
bw ¼ 0; if rwð ÞTvðkÞ\0
�ð68Þ
with vðkÞ ¼ gI þ XTðkÞXðkÞ
� �1
XTðkÞDwðkÞ. bw can
automatically cut off the recurrent training signal as in
the lower right corner of Fig. 1 whenever weight
convergence and stability conditions of the RNN are
violated as shown in Theorem 2 of Sect. 3
Remark 5 Summary of the RRSPSA algorithm for RNNs.
1. Set a maximum training step as the stop criterion;
2. Initialize the weights for the RNN;
3. Form new input x(k) as defined in Eq. (3);
4. Perform forward computation and calculate the output
yðkÞ and error e(k) with the input and current estimated
weights;
5. Calculate qv, av, and bv using (34), (37), and (38),
respectively;
6. Update the output layer weights Vðk þ 1Þ via the
learning law (16) for the next iteration;
7. Calculate qw, aw, and bw using (64), (67), and (68),
respectively;
8. Update the hidden layer weights Wðk þ 1Þ via the
learning law (45) for the next iteration;
9. Go back to step 3 to continue the iteration until the
stop criterion is reached.
5 Examples
5.1 Time series prediction
Sunspots are dark blotches on the sun which are caused
by magnetic storms on the surface of the sun. The
underlying mechanism for sunspot appearances is not
exactly known. The sunspots data are a classical example
of a combination of periodic and chaotic phenomena
which has served as a benchmark in the statistics liter-
ature of time series, and much work has been done in
trying to analyze the sunspot data using neural networks
[21, 22]. One set of the high-quality America Sunspot
Numbers, monthly sunspot numbers from December
1949 to October 2009 in the database, is used as an
example to minimize unnecessary human measurement
errors. The total America sunspot number in the monthly
database is 778 points. The output sequence is scaled by
dividing by 100, to get into a range suitable for the
Neural Comput & Applic
123
network. The hidden neuron number of the RNN is ten.
The selected learning rates for BP and RTRL are 0.09
and 0.01, respectively. Perturbation step size for
RRSPSA is c = 0.01. Figure 2 shows outputs of the
plant y*(k) and estimated outputs y(k) of the RNN
trained by the three different methods. Table 1 gives
comparison results of the three methods in terms of MSE
and STD.
5.2 Michaelis–Menten equation
Consider a continue time Michaelis–Menten equation [23]
_yðtÞ ¼ 0:45yðtÞ
2:27þ yðtÞ � yðtÞ þ uðtÞ ð69Þ
Given an unit impulse as input u(t), a fourth-order Runge–
Kutta algorithm is used to simulate this model with integral
0 100 200 300 400 500 600 700−0.5
0
0.5
1
1.5
2
2.5
k
y(k)
and
y*(
k) a
nd e
(k)
BP
y(k)e(k)y*(k)
(a)
0 100 200 300 400 500 600 700−0.5
0
0.5
1
1.5
2
2.5
k
y(k)
and
y*(
k) a
nd e
(k)
RTRL
y(k)e(k)y*(k)
(b)
0 100 200 300 400 500 600 700−0.5
0
0.5
1
1.5
2
2.5
k
y(k)
and
y*(
k) a
nd e
(k)
RSPSA
y(k)e(k)y*(k)
(c)
Fig. 2 Output y(k) and
reference signal y*(k) and error
e(k) using BP and RTRL and
RRSPSA training algorithms for
time series prediction
Neural Comput & Applic
123
step size D t = 0.01, and 1,000 equi-spaced samples are
obtained from input and output with a sampling interval of
T = 0.02 time units. The measured noise is white Gaussian
noise with mean 0 and variance 0.05. An RNN is utilized to
emulate the dynamics of the given system and approximate
the system output in (69). Gradient function of this model
has many local minima, and the global minimum has a very
narrow domain of attraction [23]. We compare perfor-
mance of the RRSPSA to that of both BP and RTRL
training algorithms. The hidden neuron number of RNNs in
this example is selected as ten. The fixed learning rates for
BP and RTRL are 0.03 and 0.01, respectively. The per-
turbation size c = 0.01 is selected for RRSPSA. Figure 3
Table 1 Comparison of mean squared error and standard deviation
of the training error
Squared error RRSPSA RTRL BP
MSE 6.01e-4 1.30e-3 4.30e-3
STD 2.45e-2 3.61e-2 6.57e-2
0 100 200 300 400 500 600 700 800 900 1000−0.2
0
0.2
0.4
0.6
0.8
1
1.2
k
y(k)
and
y*(
k) a
nd e
(k)
BP
y(k)e(k)y*(k)
0 100 200 300 400 500 600 700 800 900 1000−0.2
0
0.2
0.4
0.6
0.8
1
1.2
k
y(k)
and
y*(
k) a
nd e
(k)
RTRL
y(k)e(k)y*(k)
0 100 200 300 400 500 600 700 800 900 1000−0.2
0
0.2
0.4
0.6
0.8
1
1.2
k
y(k)
and
y*(
k) a
nd e
(k)
RSPSA
y(k)e(k)y*(k)
Fig. 3 Output y(k) and
reference signal y*(k) and error
e(k) using BP and RTRL and
RRSPSA training algorithms for
simulation of Michaelis–
Menten equation
Neural Comput & Applic
123
shows outputs of the plant y*(k) and estimated outputs
y(k) of the RNN trained by the three different methods.
Table 2 shows the comparison results of the three methods
in terms of MSE and STD.
Remark 6 Simulation results presented in this paper for
the BP training algorithm performs worst as compared to
other methods. The reason is that the second partial
derivatives similar to that of as in (17) are not considered in
the BP training. For RTRL, the slower convergence com-
pared with RRSPSA is mainly caused by the small fixed
learning rate and absent of the recurrent hybrid adaptive
parameter for guaranteed weight convergence in the pres-
ence of noise. RRSPSA is able to use relatively larger
effective learning rate to accelerate system response with
guaranteed weight convergence.
6 Conclusion
We proposed a robust recurrent simultaneous perturbation
stochastic approximation (RRSPSA) algorithm under the
framework of deterministic system with guaranteed weight
convergence. Compared with FNNs, RNNs are dynamic
systems and contain recurrent connections in their struc-
ture. Considering the time dependence of the signals of
RNNs, not only the weights at the current time steps are to
be perturbed, but also those at the previous time steps,
which makes the learning easy to be diverge and the con-
vergence analysis complicated to obtain. The key charac-
teristic of RRSPSA training algorithm is to use a recurrent
hybrid adaptive parameter together with adaptive learning
rate and dead zone concept. The recurrent hybrid adaptive
parameter can automatically switch off recurrent training
signal whenever the weight convergence and stability
conditions are violated (see Fig. 1 and explanation in Sects.
2 and 3). There are several advantages to train RNNs based
on RRSPSA algorithm. First, RRSPSA is relatively easy to
implement as compared to other RNN training methods
such as BPTT due to its RTRL type of training and sim-
plicity. Second, RRSPSA provides excellent convergence
property based on simultaneous perturbation of weight,
which is similar to the standard SPSA. Finally, RRSPSA
has capability to improve RNN training speed with guar-
anteed weight convergence over the standard RTRL algo-
rithm by using adaptive learning method. RRSPSA is a
powerful twin-engine SPSA type of RNN training algo-
rithm. It utilizes the specifically designed three adaptive
parameters to maximize training speed for recurrent
training signal while exhibiting certain weight convergence
properties with only two objective function measurements
as a standard SPSA algorithm. Combination of the three
adaptive parameters makes RRSPSA converge fast with the
maximum effective learning rate and guaranteed weight
convergence. Robust stability and weight convergence
proofs of RRSPSA algorithm are provided based on
Lyapunov function.
7 Proof of Theorem 1
According to the training algorithm in (16), we can get
~Vðk þ 1Þ ¼ ~VðkÞ þ avgvðVðkÞÞ ð70Þ
where gvðVðkÞÞ is defined in (26), then we have
~Vðk þ 1Þ���
���2
� ~VðkÞ���
���2
¼ 2avgvðVðkÞÞT ~VðkÞ þ avgvðVðkÞÞ�� ��2
¼ �av eTðkÞ 2HðkÞcDv þ bvAðkÞð Þcqv
ðrvÞT ~VðkÞ þ avgvðVðkÞÞ�� ��2
¼ �av2c eTðkÞHðkÞDvðrvÞT ~VðkÞn o
þ bveTðkÞAðkÞðrvÞT ~VðkÞn o
cqv
þ avgvðVðkÞÞ�� ��2
¼ �2av eTðkÞHðkÞ~VðkÞn o
~VTðkÞ ~VðkÞ~VTðkÞ
� �1
DvðrvÞT ~VðkÞ�
ðqvÞ�1
� av bveTðkÞHðkÞ~VðkÞ HðkÞ~VðkÞ� T
HðkÞ~VðkÞ HðkÞ~VðkÞ� T
� �1(
�AðkÞðrvÞT HTðkÞHðkÞ� ��1
HTðkÞHðkÞ~VðkÞoðcqvÞ�1 þ avgvðVðkÞÞ
�� ��2
¼ 2avpv eTðkÞ/vðkÞ� �
ðqvÞ�1 þ avbv eTðkÞ/vðkÞ� �
trace HðkÞ~VðkÞ� T�
� HðkÞ~VðkÞ HðkÞ~VðkÞ� T
� �1
AðkÞðrvÞT
� HTðkÞHðkÞ� ��1
HTðkÞHðkÞ~VðkÞoðcqvÞ�1 þ avgvðVðkÞÞ
�� ��2
� 2avpv eTðkÞ/vðkÞ� �
ðqvÞ�1 þ avbv eTðkÞ/vðkÞ� �
� trace AðkÞðrvÞT gI þ HTðkÞHðkÞ� ��1
HTðkÞHðkÞ~VðkÞ HðkÞ~VðkÞ� T
�
� HðkÞ~VðkÞ HðkÞ~VðkÞ� T
� �1)
ðcqvÞ�1 þ avgvðVðkÞÞ�� ��2
¼ 2avpv eTðkÞ/vðkÞ� �
ðqvÞ�1 þ av eTðkÞ/vðkÞ� �
bvðrvÞT
� gI þ HTðkÞHðkÞ� ��1
HTðkÞAðkÞ cqvð Þ�1þ avgvðVðkÞÞ�� ��2
ð71Þ
Note that HðkÞ~VðkÞ in the fourth equality is replaced
with /v(k) as defined in (32). A small perturbation
parameter is added to approximate [HT(k)H(k)]-1 and
guarantee a full rank of the inverse matrix of
[gI ? HT(k)H(k)]-1, which is similar to a general
nonlinear iteration algorithm [9].
Table 2 Comparison of the mean squared error and standard devia-
tion of the error
RRSPSA RTRL BP
MSE 4.2e-3 1.64e-2 2.98e-2
STD 6.46e-2 1.28e-1 1.72e-1
Neural Comput & Applic
123
~Vðk þ 1Þ���
���2
� ~VðkÞ���
���2
� 2avpv eTðkÞ ~evðkÞ � eðkÞð Þ� �
ðqvÞ�1
þ av eTðkÞ ~evðkÞ � eðkÞð Þ� �
bvðrvÞT
� gI þ HTðkÞHðkÞ� ��1
HTðkÞAðkÞðcqvÞ�1
þ eðkÞk k2 av 2HðkÞcDv þ bvAðkÞð Þk k2rvk k2
2cqvð Þ2
� avpvðqvÞ�1 ~evðkÞk k2� eðkÞk k2n o
þ av ~evðkÞk k2� eðkÞk k2n o
bvðrvÞT
� gI þ HTðkÞHðkÞ� ��1
HTðkÞAðkÞð2cqvÞ�1
þ eðkÞk k2 av 2HðkÞcDv þ bvAðkÞð Þk k2 rvk k2
2cqvð Þ2
¼ avðqvÞ�1pv þ 0:5bvYðkÞf g ~evðkÞk k2
� avðqvÞ�1pv þ 0:5bvYðkÞ � avðqvÞ�1
ZðkÞn o
eðkÞk k2
� avðqvÞ�1pv þ 0:5bvYðkÞf g
(
~evðkÞk k2
� 1� avqðk þ 1Þ�1ZðkÞ
pv þ 0:5bvYðkÞ
!
eðkÞk k2
)
ð72Þ
where Y(k) and Z(k) are defined in (35) and (36),
respectively. Note also that the second inequality in (72)
is from the fact that 2eTðkÞ~evðkÞ� ~evðkÞk k2þ eðkÞk k2. pv is
the dimension of the output layer weight vector which is
positive. av is a fixed positive scalar. bvY(k) C 0 due to the
definition of recurrent adaptive learning rate bv(k ? 1) in
(38) and qv [ avZðkÞpvþ0:5bvYðkÞ due to definition of the
normalization factor in (34). The adaptive learning rate
av(k ? 1) in (37) implies that recurrent training of
RRSPSA will be av only if the output error of RNN is
bigger than a predefined value; otherwise, it will be zero,
which guarantees ~Vðk þ 1Þ���
���2
� ~VðkÞ���
���2
is bounded
below zero and is not increasing. The limit of
~Vðk þ 1Þ���
���2
� ~VðkÞ���
���2
therefore exists and ~evðkÞk k2�n
1� avqðkþ1Þ�1ZðkÞ
pvþ0:5YðkÞ
� eðkÞk k2
owill go to zero. Then, we
finally have
~Vðk þ 1Þ���
���2
� ~VðkÞ���
���2
� 0:
8 Proof of Theorem 2
Consider that SPSA is an approximation of the gradient
algorithm [11], and using the property of local minimum
points of the gradient
� o eTðkÞeðkÞ� �� �
o WðkÞ� �T
� ~WðkÞ 0
.
[5], we have
0� � o eTðkÞeðkÞð Þo WðkÞ� �T
~WðkÞ
¼Xnh
i¼1
_hiðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ
n o
¼Xnh
i¼1
_hiðkÞ~liðkÞ
~liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ
� �
� kkmin
Xnh
i¼1
~liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ
n o
¼ kkmin
eTðkÞ ~XðkÞ ~WðkÞ�
¼ � kkmin
eTðkÞ/wðkÞ
ð73Þ
where k is the maximum value of derivative _hjðkÞ of
activation function in (6), and kmin = 0 is the minimum
nonzero value of mean value ~ljðkÞ defined in (65). (Note
that the inequality is always true if kmin = 0, which implies_hjðkÞ ¼ ~ljðkÞ ¼ 0:) According to the training algorithm for
hidden layer in (45), we get
~Wðk þ 1Þ ¼ ~WðkÞ þ awgwðWðkÞÞ ð74Þ
where gwðWðkÞÞ is defined in (55). Using (54), (74), and
(73), by the definition of XðkÞ in (66), we get
Neural Comput & Applic
123
~Wðk þ 1Þ���
���2
� ~WðkÞ���
���2
¼ 2awgwðWðkÞÞ ~WðkÞ þ awgwðWðkÞÞ�� ��2
¼ �aweTðkÞ BþðkÞ � B�ðkÞð Þ þ bwDwðkÞcDw rwð ÞT ~WðkÞ
þ awgwðWðkÞÞ�� ��2
¼Xnh
i¼1
�liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ
n o
� ~WTðkÞ ~WðkÞ ~W
TðkÞ� �1
Dw rwð ÞT ~WðkÞ�
� 2aw
Dw � aweTðkÞ XðkÞ ~WðkÞ�
XðkÞ ~WðkÞ� T
� XðkÞ ~WðkÞ�
XðkÞ ~WðkÞ� T
� ��1
� DwðkÞ rwð ÞT ~WðkÞbwðcDwÞ�1 þ awgwðWðkÞÞ�� ��2
¼Xnh
i¼1
liðkÞ~liðkÞ
�~liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ
� � 2pwawðDwÞ�1
þXnh
i¼1
1
~liðkÞ�~liðkÞeTðkÞVT
i;:ðkÞxTðkÞ ~Wi;:ðkÞ� �
awðcDwÞ�1
� bw rwð ÞT XTðkÞXðkÞ
� �1
XTðkÞXðkÞ ~WðkÞ XðkÞ ~WðkÞ
� T
� XðkÞ ~WðkÞ�
XðkÞ ~WðkÞ� T
� ��1
DwðkÞ þ awgwðWðkÞÞ�� ��2
�Xnh
i¼1
liðkÞ~liðkÞ
�~liðkÞeTðkÞVTi;:ðkÞxTðkÞ ~Wi;:ðkÞ
� � 2pwawðDwÞ�1
þXnh
i¼1
1
~liðkÞ�~liðkÞeTðkÞVT
i;:ðkÞxTðkÞ ~Wi;:ðkÞ� �
awðcDwÞ�1
� bw rwð ÞT gI þ XTðkÞXðkÞ
� �1
XTðkÞDwðkÞ þ awgwðWðkÞÞ
�� ��2
� 2aw kmin
kpw
Dw eTðkÞ/wðkÞ
þ aw 1
kbw
cDw eTðkÞ/wðkÞ rwð ÞT gI þ XTðkÞXðkÞ
� �1
XTðkÞDwðkÞ
þ eðkÞk k2 BþðkÞ � B�ðkÞ þ bwDwðkÞ� �
2c
�����
�����
2
rwk k2 aw
Dw
� 2
:
ð75Þ
Similar to the output layer case, /w(k) is replaced with
~ewðkÞ � eðkÞð Þ; we have the following Lyapunov function
for convergence analysis:
~Wðkþ 1Þ���
���2
� ~WðkÞ���
���2
� aw
Dw
kmin
kpwþ bw
2kcrwð ÞT gIþX
TðkÞXðkÞ� �1
XTðkÞDwðkÞ
� �
� ~ewðkÞk k2� 1� aw
Dw
BþðkÞ�B�ðkÞþbwDwðkÞ2c
����
����
2
rwk k2
(
� kmin
kpwþ bw
2kcrwð ÞT gIþX
TðkÞXðkÞ� �1
XTðkÞDwðkÞ
� ��1
eðkÞk k2
)
ð76Þ
The three adaptive parameters defined in (64), (67), and
(68) guarantee the negativeness of Lyapunov function.
That is
~Wðk þ 1Þ���
���2
� ~WðkÞ���
���2
� 0; 8k:
References
1. Hopfield JJ (1982) Neural networks and physical systems with
emergent collective computational abilities. Proc Natl Acad Sci
79:2554–2558
2. Talebi HA (2009) A recurrent neural-network-based sensor and
actuator fault detection and isolation for nonlinear systems with
application to the satellite’s attitude control subsystem. IEEE
Trans Neural Netw 20:45–60
3. Hou ZG, Gupta MM, Nikiforuk PN, Tan M, Cheng L (2007) A
recurrent neural network for hierarchical control of intercon-
nected dynamic systems. IEEE Trans Neural Netw 18:466–481
4. Al Seyab RK, Cao Y (2008) Nonlinear system identification for
predictive control using continuous time recurrent neural networks
and automatic differentiation. J Process Control 18:568–581
5. Song Q, Xiao J, Soh YC (1999) Robust backpropagation training
algorithm for multilayered neural tracking controller. IEEE Trans
Neural Netw 10:1133–1141
6. Song Q, Wu Y, Soh YC (2008) Robust adaptive gradient-descent
training algorithm for recurrent neural networks in discrete time
domain. IEEE Trans Neural Netw 19:1841–1853
7. Song Q, Spall JC, Soh YC, Ni J (2008) Robust neural network
tracking controller using simultaneous perturbation stochastic
approximation. IEEE Trans Neural Netw 19:817–835
8. Song Q (2008) On the weight convergence of Elman networks.
IEEE Trans Neural Netw 21:463–480
9. Haykin S (1999) Neural networks: a comprehensive foundation.
Printice Hall, New Jersey
10. Mandic DP, Chambers JA (2001) Recurrent neural networks for
prediction: learning algorithms, architectures and stability. Wiley,
New York
11. Spall JC (1992) Multivariate stochastic approximation using a
simultaneous perturbation gradient approximation. IEEE Trans
Autom Control 37:332–341
12. Maeda Y, Wakamura M (2005) Simultaneous perturbation
learning rule for recurrent neural networks and its FPGA imple-
mentation. IEEE Trans Neural Netw 16:1664–1672
13. Trunov AB, Polycarpou MM (2000) Automated fault diagnosis in
nonlinear multivariable systems using a learning methodology.
IEEE Trans Neural Netw 11:91–101
14. Spall JC, Cristion JA (1998) Model-free control of nonlinear
stochastic systems with discrete-time measurements. IEEE Trans
Autom Control 43:1198–1210
15. Spall JC, Cristion JA (1997) A neural network controller for
systems with unmodeled dynamics with applications to waste-
water treatment. IEEE Trans Syst Man Cybern Part B Cybern
27:369–375
16. Maeda Y, De Figueiredo RJP (1997) Learning rules for neuro-
controller via simultaneous perturbation. IEEE Trans Neural
Netw 8:1119–1130
17. Werbos PJ (1988) Generalization of backpropagation with
application to a recurrent gas market model. Neural Netw
1:339–356
18. Rumelhart D, Hinton G, Williams R (1986) Learning internal
representations by error backpropagation. Parallel Distrib Process
1:318–362
19. Williams RJ, Zipser D (1989) A learning algorithm for continu-
ally running fully recurrent neural networks. Neural Comput
1:270–280
Neural Comput & Applic
123
20. Williams RJ, Zipser D (1995) Gradient-based learning algorithms
for recurrent networks and their computational complexity.
Backpropag Theory Archit Appl 2:433–501
21. Lin T, Giles C, Horne B, Kung S (1997) A delay damage model
selection algorithm for NARX neural networks. IEEE Trans
Signal Process 45:2719–2730
22. Park Y, Murray T, Chen C (2002) Predicting sun spots using a
layered perceptron neural network. IEEE Trans Neural Netw
7:501–505
23. Ljung L (2010) Perspectives on system identification. Annu Rev
Control 34(1):1–12
Neural Comput & Applic
123