An optimal learning algorithm for S-model environments

SHORT PAPERS 493

Wo‘lk) =

where

... 0

... 0 0 ...

... ...

... 0

we can define the “modified control weighting” to be

and it. is posit,ive definite. We remark t.hat t,he mat.rh P,,(jlk) is dependent upon observation and, t.hus, cannot be precomputed.

REFERENCES 111 E. Tse, “On the optimal cont.rol of linear svstems with incomplete in-

[2] E. Tsir and Xi.,,Athans, “Adaptive stochastic control for linear systems. formation,” hI.1.T. Electron. Syst. Lab., Rep.-ESLR,412, Jan. 1970.

Parts I and 11. in Proc. 1970 I E E E Conf. Decis-ion and Control. Austin, Tex., Dec: 1970.

A%tokat. Contr.. vol. AC-17, pp. 38-52, Feb. 1972. - Adaptive control for a class of linear systems,” I E E E Trans .

E. Chase MacRae, “Linear decision with experimentation,” presenred a t the Kat. Bureau Economic Res. on Stochastic Control Theory and Economic Syst. Conf.. Princeton, K. J.. May 1972. E. Tse and T. Bar-Shalom, “An actirely adaptive coqt.rol for linear systems with random pa ramaen via the dual control approach.” in Proc. I E E E Conf. Decision and Control, pew Orleans. La.. Dec. 1972. G . Pi. Saridis and R. X. Lotpa. Parameter identification and control of linear discrete-time svstems. I E E E T r a n s . A u t o m a t . Contr., vol. -\C-17, pp. 52-60. Feb. 1972: R. Ku. “+aptive control of stochastic linear =vst.ems sirh unknown parameters. M.I.T. Electron. Ss-st. Lab., Rep. ESGR-477. May 1Y72. XI. .4oki, Optimization of Stochastic S p t e m s . X e w Tork: Xcademie. 1967. Y. Bar-Shalom .and R. Sivan. “On the optimal control of discrete-time linear systems with random parameters,” I E Z E Trana. A u t o m a t . Contr., 1.01. AC-14. pp. 3-8, Feb. 1969.

New York, Academic. 196.5. S. E. Dreyfus. Dynamic Programming and the Calcu1u.q of T7ariations.

R., E. Curry. Estimafion and Control rith Quantized Meacurementa. Cam- bridge. M a s . : h1.I.T. Press. 1970. T. L. Gunckel and G. F. Franklin. “I general solution for linear sampled- data control,” Trans. dS3IE . J. Basic Eng., vol. 85D, pp. 197-203. June 146.1 P. D. Joseph and J. Tou, “On linear control theory.” A I E E Trans. A p p l . I d . , Part 11. 1.01. 80, pp. 193-196, Eept.. 1961. J. Farlson. R. Graham, and R. Shelton. ”Identification and conwol of linear discrete SFstems.” I E E E Trans. Automat. Confr . (Short, Papere). i-01. .\C-12. pp. 438-442, .\ug. 1967. A . H. Jazainski, Stochastic Processes and Filtering Theorg . S e w York: Academic. 1970. D. G. Lainiotis, T. X. Upadhyay. and J. G. Deshpande. ”Optimal adaptive control of linear sysrems,” in Proc. 1971 IEBE Conf. Decis ion and Control , Miami Beach, Fla., Dee. 1971;. G. Stein and G. X. Saridis. .\ parameter adaptive control technique,” Automatica, 7-01. 5 , pp. 731-740, N o r . 1969. E. Tse, x. Bar-Shalom, and L. Meier. ”\Vide-sense adaptive dual cont.ro- for nonlinear st.ochastic systems.” IEEE Trans . Automat . Contr . , vol. .$IC 18, PP. 98-108, .\pr. 1973.-

_l_l.

AII Optimal Learning Algorithm for S-Model Environments

L. G. MASOX

Abstracf-A class of stochastic automata models is proposed for the synthesis of 5 parameter optimizing controller. The automaton can operate in environments characterized by reward strengths (S-models) or reward probabilities (P-models). In the P-model case the proposed algorithm is equivalent to the -optimal algorithm reported by Sliapiro and Narendra. The algorithm discussed here was originally reported by Mason with emphasis on the P-model case. In this paper, emphasis is placed on the S-model case. Recently, an equivalent coptimal algorithm has been reported by Viswanathan and Narendra. I t is Shoivn herein that only the optimal solution is stable and that the expected performance converges monotonically.

Simulation results are presented that corroborate the analytical results. I t is demonstrated that the proposed algorithm is superior to McLaren’s linear reinforcement scheme in regard to expediency.

I. .INTRODUCTION

In various control systems it. is desirable to optimize some performance index by t,he appropriate selection of a system parameter. When the relationship between t.he paramet,er and the performance is unknown; a. priwri, it. is necessary t.o adjust the paranlet,er on-line.

One approach t.o the solut.ion of this problem is to use a learning cont.roller that improves the system performance as it gains ex- perience. Variable sttuct.ure stochast.ic automata have been proposed by Fu and Mchiurtry [I] and McLaren [ 2 ] as models of adaptive and learning controllers. In [l] and [2] the environment of t.he learning controller is of the S-model type, t.hat, is, it is characterized bp penalty strengt.hs corrsponding to t.he various paramet.er settings. Vamhav- skii and Vorontsova [3] have considered a two-stat.e variable struc- ture automaton in an environment characterized by penalt,y proba- bilitie (P-models). Glorioso and Grueneich [4] have proposed a training algorithm for st.ochastic aut.omata operating in P-model environments. Chandrasekaran and Shen [SI have compared the expediency and convergence propert.ies of various learning algorithm in b0t.h P-model and 5’-model environments. Sone of t.he algorithm discussed in [;i] are optima1 in S-model environments. Since S-model

mended by G . X. Saridls, Chairman of the IEEE S C 3 .\dapt.ive and Learning Manuscript received January 1-1, 1972; revised July 12. 1972. Paper recom-

Svstems.. Pattern Recoznition Committee. This work was supported in part by t.he Xational Research Council of Canada under Grant A1080.

Saskatchewan. Saskatoon. Sask., Canada. The author is with she Department o i lieehanical Engineering. Gnirersits of

494 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, OCTOBER 1973

environments characterize many control systems, it is dwirable that an optimal learning algorithm be available for them. Shapiro and Narendra [6] have described an approach using stochastic automata for parameter optimization. They first tmnsform the performance index into a. reward probability. An algorithm that is E-optimal for the resulting P-model environment is then employed for parameter optimization. For the algorithm described in this paper it is not necessary to transform the S-model environment into a P-model one as was done in [6]. The a.lgorithnl described here FYS reported by 3Iason 171, with emphasis on the P-model case. This paper is con- cerned nlainly wit,h the S-model case. Recently, Viswanathan and Narendra [SI have suggested a method for extending P-model algorithms to the S-model case.

I t is shown that only optimal parameter set.tings are stable, and that the expected performance increases monotonically in stationary environments. Gimulat.ion results for the proposed algorithm and Xclaren’i: linear reinforcement scheme corroborate the analytical results. It is demonstrated that t.he proposed algorithnl is superior in regard to expedienc.y.

11. THE EXVIRONMENT

The learning controller and its environment comprise a closed loop system as shown in Fig. 1. Two classes of environments are defined a< follows:: In P-model environments, there is a probability of .;uccws aszociated with each action or input to the environment. If the choice [ ! ( t i = I ( , is made, then the environment’s output i.; z C ~ ~ I = 1. with a probability p t , and z ( t ) = 0 with a probabi1it)- 1 - pi . It i> to be 1mdelr;tood that. the z i / ) is the environment‘:: outptu resulting from the input uct.). If there is delay in the environment’$ reponre, then z ( / ) nil1 be delayed accordinglv. The succeeding input I C ( / -f 1.1 will not. be applied until z i t ) has been obtained. In S-model environments, there is an output ~trength given by z( t ) , which i-: a randonl varial)le such that 0 2 ~ ( f ) 5 1 and E { r ( t ) l u l t ! = u , ] = . s ! . The previous comments concerning the delay in the environment’P response apply here as well. In the case of parameter adaptive control systems, ~ ( t ) represents the parameter setting at time I and z ( f ) represents the resulting noisy normalized performance measurement.

111. THE STOCHASTIC .%UTOXI.\TON

X stochastic automaton is defined by the sextuple { Z? C-,@,g?iT, T } . Z is the set of inputs to the automaton. In P-model environment^:, Z = { O t l ) , whereas in S-model environments Z is a subset of the closed interval [O,]]. The lower case form r ( t ) indicates the environment’s response to the automaton’s ontput n ( f ) . This environmental rwponae is an input of the automaton ZM shown in Fig. 1. 1- =

f / l , ? L Z , ’ . .,urn} is the set of na automaton outputs. The particular output, realized at time t is denoted by u( t ) . @ = {&,@z,. . . ,on2) is the set of states that the automaton may reach, while d(l) is the st.ate realized at t.inle t . i? = ( r , ,~? , . . .,rm) is a probability vector governing the choice of automaton states. The state +f t ) = @( is chosen with a probatility ri(f i, where

11)

T is: an operator that defines the reinforcement scheme or learning algorithm. Thi2; operator updates the a.ntonlaton’s state probability vector as follows:

?(t + 1) = T(.?(t) ,dt))?(t) (2 :I

ahere it is shown explicitly that I’ is a function of the automaton‘s state and the environme~~t’s response.

Jn this paper the learning algorithm is defined by

rz(t + 1) = ir i( t) + G(6,r; - r j ( t ) j z ( t )

for i = 1,2,. . . ,?fa - 1 and +(t) = +r; and

I ~

1

rl ENVIRONMENT

P-MODEL OR S-MODEL

l r l LEARNING SYSTEM

STOCHASTIC AUTOMATON

I 1

Fix. 1.

T,(f + 1) = 1 - T,(t + 1). (3 )

The output function is given by the mapping (g:& -+ u 2 ) . The algorithm is linear in the state probabilities. The.w probabilities are themselves random variables .;ince @ ( t ) and, hence, Bik are random variable?. It n-ill later be shown that the recurrence relationship governing the expected probabilities is nonlinear. Therefore, according t o [>] the proposed algorithm is nonlinear.

m - 1

i = l

IV. COSVERGEXCE PROPERTIES

The convergence propertis n-ill be investigated by the method outlined in [.?I. The qmntities AT; are defined as

= ~ [ r ~ ( l + l i i F [ t j ] - X i ( t ) , i = I , L ’ , . - . , I ~ I . (4)

Xow the asymptotic expected value-: of the probnhility vector must 5atk.f)- the 3Iartingale equations

AT: = 0, i = 1,2,. . .,m. (5 )

For S-model environments and the algorithm defined by (3), the 1Iartingale equations become

(rt* = a,,), [ri* = a,?),. . .,(rg* = 6- 1112., ) i = l,?; ’ . . ? i L (7)

nhere 6 i , is the Kronecker delta function and the astefisk denote$ the aqnlptotic expected value.

According to Chandra=elaran and Shen [ 3 ] a vector rolution i;* to the Martingale equations is stable if d l a , ‘ d r i evaluated at ;T* is negative for 1. = 1,2,. . .,7iZ. In order to satisfy the constraints given by ( l i , the following condition is imposed on the differentials:

-f dir; = 0. , = l

It is: sufficient to illustrate the investigation of stability for the equilibrium given by (r;* = a l l ) since the other equilibria can be in- vestigsted in a rimilar manner. First, consider d>a,, d a l which hy the chain rule is

d l r l ‘drl = a h , ‘ar l t a h , ‘ a d r 2 3 r , + . . . + dAr1 :arrndrm:drl . (9)

SHORT PAPERS 495

ITERATIONS

Fig. 2.

By repeating this procedure for d A ~ i / d a ; it is found that

d A ~ , / d ~ ~ l , ~ + = 6 ~ ~ = -G(sl - si), i = 2;. -,m. (13)

The only may in which(l2) and (13) can be negative, and t.herefore the equilibrium given by ri* = S i I be stable, is if SI 2 si for i = 2,3,. . -,m.. Thus only the opt.ima1 solution is stable. The optimal solution is an absorbing barrier for the Markov process defined by (3), since it is stable and t,he variance there can be shorn to vanish.

The monotone convergence of t,he expected performance d l now be proved, where the performance is defined by

1

This quant.ity is random since a&) is a random variable. The expected performance, or expediency, is given by

U ( t ) = E(P( t ) ] = SiE( 7 4 . (15) i

Taking expectat.ions of (11) results in

E{ ai(t + 1)) - E{ a,(f)} = E( G (Sib - a i ( t ) ) s k r e ( t ) ) . (16) k

Multiplying (16) through by si and summing over i one obtains after some rearrangement

M ( t + 1) - M ( t ) = GE{ a,(t)~i2 - (x i r k ( t ) ~ k ) ~ ) . (17) i k

PARAMETER VALUES LEGEND - S,=0.7. S2=0.6. S3=0.4

i i ': A EXPECTED PROBABILITY TFAJECIORY

0.9; f .- / . OF PROPOSED ALGORITHM

i'

", % 0, 0, ", ?- 0, 0+ 0, =1

Fig. 3.

The monotone convergence of the expect.ed performance is proven since t,he expression under the expectat.ion operator in (17) is always nonnegat.ive. It. is zero only at, the equilibria.

The optimal solution is an absorbing barrier of the Markov process defined by (3). If t.he environment, is nonstationaly, this is an unde- sirable property since once the system has converged it. will remain in that state. It should be noted, however, t.hat every opt.ima1 algorithm has this inflexibilit,y. Various modifications can be made to the proposed algorithm to gain flexibility at the expense of re- duced expediency.

An appr0ximat.e expression for the trajectories of t.he expected probabilities E ( r ( t ) ] and the expected performance can be obtained if t.he gain parameter G is small. Assuming t,hat. the variance of the probabilities, namely

c2,jrk = E ( T i 7 r k I - E{ r i p { T k } (18)

can be neglected when G is small, (16) and (17) can be replaced by the approximat.ions

E ( ~ i ( t + I)) E{ r i ( t ) ] (1 + G(si - shE{ r k ( t ) ) 1) h

for i = 1,2,. - .,7n - 1, and m - 1

E{a*(t + 1)) 2 1 - E(7ri(t + 1)) (19) i = l

and

M ( t + 1) 'V M ( t ) + G ( C S < ~ E ( ai(t)] - (x s&'{ d t ) ] 1'). (20) 1 h

Equations (19) and (20) can be solved recursively to obtain the approximate expect.ed behavior of the learning system. Since no bound for the variance d S i s k has been given, t.he accuracy of the approximat.ion will be checked by simulation.

V. SIMULATION RESULTS

Several simulations were performed using various values for the reward strengths, the initial conditions, and the gain parameter G. A representat.ive example is shown in Figs. 2 and 3, which give t.he t.rajectories of the st.at.e probability vector %(t) and performance P(t) , respectively. -;US0 shown is the approximate trajectory of E(?r(t)] and M ( t ) , which were obt,ained by solving (19) and (20) recursively. The state pr0babilit.y trajectory and performance obtained using McLaren's linear algorit,hm are shorn for comparison

496 IEEE TRANSACTIONS ON A~UTOMATIC CONTROL, OCTOBER 1973

purposes. The results corroborate the theory presented herein. It also clearly demonstr:lte-: the superiority of the proposed algorithm over that of IIcLaren in regard to expediency.

VI. cosrLcxIos

-4 1e:lrniug :d~orithm has been proposed that exhibits eoptimal behavior in environments characterized either by reward probabilities or rem-nrd strengths. .In equivalent algorithm has been indepen- dentl>- derived by Shapiro nnd Narendra [6] and Tiswansthan and Sarendrn 191. I t is shown here that only the optimal state of the automatot1 is stable and that the espected performance increases monotonicall- in Ftationary environments.

Ilspressions for an approximate learning cnrw have been derived and comparisons have been made with 1IcLaren.s linear reinforcement scheme for the S-model case. The proposed algorithm i- superior in regard to espedienry. Pimdntion results have heen given that corroborate the analysis.

It ii thought that the proposed algorithm may have application in the synthe-:ia of adaptive control systems and pattern recognition systems.

.ICKXOWLEDGUEST

The anther wishes to thank lh. P. N . Xikiforuk for hi:: encourage- ment 2nd snggedions in the preparatioll of this paper.

An Adaptive Observer and Identifier for a Linear System GERI) Li'DERd AND KIILIPXTI S. XARENDRA

Abstract-An adaptive scheme is devised which observes the state and simultaneously identifies all the parameters of a single- input single-output nth-order linear system. The scheme uses only the input and the output signals of the system and does not involve the use of derivatives of these signals. The adaptive scheme is proved to be globally asymptotically stable, thus ensuring the convergence of the identification process.

I. INTRODUCTION

The design of model reference adaptive system using Lyapunov's direct method has the important advantage over other adaptive schemes in that the global stability of these systems is automatically guaranteed. Of course, with other adaptive designs a stability proof can be attempted, but this is, in general, a very difficult problem

Manuscript receked December 22. 19i2. Paper recommended by G. X. Saridis. Chairman oi the IEEE S C S idapt ive and Leerning Systems, Pattern Recognition Committee.

The authors are with the Department of Engineering and Applied Science, Sale rnlrersity, Sew Haven, Conn. 06520.

because an adaptive system is highly nonlinear, and in many sit.u- ations a proof of global stability may not. exist.

However, up to now the main disadvantage of adaptive designs wing Lyapunov's method as compared t o other method-. hm been t.he fact that. they require the complete st.ate of the controlled or identified system for their implementation. This is a very restrictive assumption because in most practical situations only the outputs of a system can be measured, and these are fewer in number than the state variables. Further, measurement. noise is invariably present in the system, and therefore the complete state cannot be reconstructed by taking an adequate number of derivatives of the outputs. When the system is completely known, this problem may be overcome by the use of an observer; such a procedure, however, cannot, be used directly in the adaptive contest since the system parameters necessary to build the observer are not known.

Severtheles, given any system which is completely observable and a sufiiciently general input, it is well known that. this system can be completely identified by taking only a finite number of measure- ments of its input.(s) and output(s). Because of this it is natural to expect that it. should also be possible to identify such systems adap- tivelp using Lyapunov's design. In the follorving sections it. is shown that this is indeed t.he case.

A solution to the problem mentioned above has been report.ed in [l]. However, the results obtained in this paper are somewhat in- volved, and the mot.ivation for t.he steps taken in the adaptive procedure, m well as t,he interpret.ation of the result*, are rather difficult. Perhaps much more significant from a practical viewpoint is that, while t,he results of [l] would involve a prohibitively high cost of implementation (as readily acknowledged by the authors), the procedure outlined in this paper lea& to a scheme which is easily imple- mented. The final solution is presented in tern= of a new canonical state representation of the system which provides considerable in- sight. into the ent.ire adaptive procedure and permits easy in te rpreb tion of the results. It is interesting to point. out that. this new canonical representation was further modified in ['L] and used to obtain an identification scheme whow practical implementation is even simpler than the one described herein.

11. REPRESESTATJOX OF X LISUR SYSTEM

Consider the following linear timeinvariant nth-order single-input single-out.put system

5 = Ax f bu.(t)

y = hTz (1)

where x is an n.th-order st.at.e vector, u ( t ) is t.he single input, and y is t,he single output. The problem proposed is t.0 devise an adaptive scheme which would 1) estimate the state vector x , and 2) identify the parameters of the t.riple (A, h, b) , given that. nothing is known about, the system except that. it is of order n, linear, and time in- variant, and that the only available signals are u(t) and y(t).

In the problem stated above, since one is only interested in t.he inpnt-output characteristics of (l), there is considerable freedom in choosing the internal state representation r or, equivalently, in the choice of the triple (-4, h, b) . In particular, since the $?.stern (1) can also be dexribed b - an nth-order transfer function, it would be neressary to identify only 2n parameters: the remaining n? parameters of A , h, b can be chosen arbitrarily, fixing therewith the state representation z.

I t is shown in Xppendis I that if and only if the system described by (1) is complete157 observable, then it can a lwap be represented in the following form:

y = (1 0 0)z = 2 1

Documents

An optimal learning algorithm for S-model environments