Published on

02-Jul-2016View

213Download

1

Embed Size (px)

Transcript

Absorbing stochastic estimatorlearning automata for S-model

stationary environments

G.I. Papadimitriou *, A.S. Pomportsis,S. Kiritsi, E. Talahoupi

Department of Informatics, Aristotle University, Box 888, 54006 Thessaloniki, Greece

Received 26 September 2001; received in revised form 10 March 2002; accepted 8 May 2002

Abstract

An S-model absorbing learning automaton (LA) which is based on the use of a

stochastic estimator is introduced. According to the proposed stochastic estimator

scheme, the estimates of the mean rewards of actions are computed stochastically.

Actions that have not been selected many times have the opportunity to be estimated as

optimal, to increase their choice probabilities, and consequently, to be selected. In this

way, the automatons accuracy and speed of convergence are signicantly improved. 2002 Elsevier Science Inc. All rights reserved.

1. Introduction

Learning automata (LA) [1,2] have attracted considerable interest in the last

decades due to their potential usefulness in a variety of engineering problems

that are characterized by nonlinearity and a high level of uncertainty. They

have the property of progressively improving their performance. The envi-

ronment communicates with the learning system and supplies it with feedback

information. The automaton uses this information to nd which action isoptimal. A LA is one that adapts itself to the environment by learning the

Information Sciences 147 (2002) 193199

www.elsevier.com/locate/ins

*Corresponding author. Fax: +30-310-998419.

E-mail address: gp@csd.auth.gr (G.I. Papadimitriou).

0020-0255/02/$ - see front matter 2002 Elsevier Science Inc. All rights reserved.PII: S0020 -0255 (02 )00263 -3

optimal action and that ultimately chooses this action more frequently than

other actions.

The environment is said to be stationary if the mean rewards of actions are

not dependent on the time, otherwise it is said to be nonstationary.Dierent kinds of LA have been proposed. According to the nature of its

input a LA can be characterized as a P, Q or S-model. A LA is a Q-model

automaton [3,4] if the input set is a nite set of distinct symbols. if the input set

is binary ({0,1}), it is called a P-model automaton [35]. Finally, if the au-

tomatons input can take any real value in the [0,1] range, the automaton iscalled an S-model one [3,4,6]. S-model LA are considered in this paper.

With respect to their Markovian representation, LA are classied into two

main categories: ergodic or automata-possessing absorbing barriers [24]. Theergodic automata converge with a distribution independent of the initial state.

On the other hand, the automata with absorbing states, after a number of a

nite steps, get locked (converge) into a particular action.

Non-estimator learning algorithms update the probability vector based di-

rectly on the environments feedback. On the other hand, estimator algorithms[7] are characterized by the use of a running estimate of the mean reward of

each action. The change in the probability of choosing an action is based on the

running estimates of the probability of reward rather than on the feedbackfrom the environment. These algorithms, at every time instant, increase the

probability of choosing the action with the maximum current estimate of mean

reward. Simulation results have demonstrated the superiority of the estimator

algorithms over the traditional learning algorithms [5,7].

Stochastic Estimator Learning Algorithms scheme were introduced in [6] as

an eort to achieve a high adaptivity when the automaton operates in rapidly

switching non-stationary environments. In this paper we present a new sto-

chastic-estimator-based scheme, which is capable of achieving a high accuracyand a rapid convergence when operating in stationary environments.

Section 2 introduces the reader to the basic concepts of the stochastic esti-

mator. The proposed Absorbing Stochastic Estimator Learning Automaton

(ASELA) scheme is presented in Section 3. In Section 4 extensive simulation

results are presented, which indicate that the proposed scheme achieves a high

performance when operating in stationary environments. Finally, concluding

remarks are given in Section 5.

2. The stochastic estimator

An S-model stationary environment is considered. The LA keeps running

estimates of the actions mean rewards. The estimates are computed stochasti-cally, so they are not strictly dependent on the environmental responses. A zeromean normally distributed random variable is added to the deterministic esti-

194 G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199

mates in order to take the stochastic ones. The variance of the normally dis-

tributed random variable, diers from action to action and is proportional to

the reverse value of the number of times that each action was selected. The LA

gives to actions that have been selected only a few times and so their estimatesare considered as unreliable the opportunity to be estimated as optimal, to

increase their choice probability and consequently, to be selected.

This kind of estimator, which determines the estimates of actions in a

nondeterministic way, is called a stochastic estimator. The use of stochastic

estimator in S-model nonstationary environments has been studied in [6]. In the

present paper we are going to study the use of stochastic estimator in S-model

stationary environments.

3. The absorbing stochastic estimator learning automaton

The Absorbing Stochastic Estimator Learning Automaton (ASELA) is de-

ned by the quintuple {A, B, P, E, T}.

A is the set of r actions that the automaton can choose from. The variousactions are then elements of A fa1; . . . ; arg. The automaton is allowed onechoice at each time instant t and its choice is denoted as at, where at 2 A.

The set of possible responses from the environment is denoted by B. Since

an S-model environment is considered, B 0; 1. The environmental responseat time instant t is denoted by bt, where bt 2 B.

The mean reward of action ai at time instant t is denoted by dit. Since theenvironment is stationary, dit is constant for all time t and so the index oftime is dropped and the quantity is denoted as di. The set of the actions meanrewards is dened as: D fd1; . . . ; drg. It is assumed that there is a uniquemaximum for the set D called db, where db max16 i6 rfdig. The action pos-sessing db, namely ab, is referred to as the best action. The value of each di 2 Dis unknown to the automaton, and so its task is to decide which action is the

best. It bases its decision on the information gained by selecting actions, and

seeing the environmental feedback. This cycle continues until the learning

process is terminated.

P is a probability distribution over the set of actions. We have P t fp1t; . . . ; prtg, where pit is the probability of selecting action ai at timeinstant t. In discretized automata there are only nitely many vales for pit,namely pit is one of f0;D; 2D; 3D; . . . ; 1 for all t. Here D is referred to as thesmallest step size and is inversely proportional to the total number of subdi-

visions of the probability space [0,1]. This parameter D is dened by D 1=N ,where N rn, r is the number of actions and n is the resolution parameter.

E is the estimator that at any time instant contains the estimated environ-

mental characteristics. we dene Et D0t;Mt;Ut, where D0t fd 01t; . . . ; d 0rtg is the Deterministic Estimator Vector, which contains the

G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199 195

current deterministic estimates of the mean rewards of the actions as shown

below (for i 1; . . . ; r):

d 0it Total reward received up to time t for the selections of action ai

Number of times action ai has been selected up to time t

PTi

k1Qki t

mit ; 1

where Qki t for k 1; . . . ; Ti are the rewards received at each time k that actionai was selected and Ti is the last time that ai was selected. Mt fm1t; . . . ;mrtg, where miti 1; . . . ; r is the number of times that the action ai hasbeen selected up to time instant t. Ut fu1t; . . . ; urtg is the StochasticEstimator Vector which at any time instant t, contains the current stochastic

estimates of the mean rewards of the actions. The current stochastic estimateuiti 1; . . . ; r of the mean reward of action ai is dened as follows:

uit d 0it N 0; s2i t

; 2where sit minfa 1mit ; smaxg.

N0; s2i t denotes a random number selected with a normal probabilitydistribution, with a mean equal to 0 and a variance equal to s2i t. a is an in-ternal automatons parameter that determines how rapidly the stochastic es-timates become independent from the deterministic ones. When a 0, no noiseis added to the deterministic estimates. smax is the maximum permitted value ofsiti 1; . . . ; r. It limits the variance of the stochastic estimates in order notto increase innitely.

Finally, T is the learning algorithm which is presented below:

Step 1: Select an action at ak according to the probability vector.Step 2: Receive the feedback bt 2 0; 1 from the environment.Step 3: Update Mt by setting mkt 1 mkt 1 and mit 1 mit

for all i 6 k.Step 4: Compute the new deterministic estimate d 0kt as it is given by rela-

tion (1).

Step 5: For every action aii 1; . . . ; r compute the new stochastic estimateuit as it is given by relation (2).

Step 6: Select the optimal action am that has the highest stochastic esti-mate of mean reward. Thus, umt maxifuitg.

Step 7: Update the probability vector in the following way:for every action

aii 1; . . . ;m 1;m 1; . . . ; r, with pitP 1=N , set: pit 1 pit 1= N .For the optimal action am set

pm t 1 1X

i6mpit 1:

Step 8: If pmt < 1 then GOTO to Step 1 else CONVERGE to action am.

196 G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199

An S-model environment consists of three components denoted by (A, L, B),

where A and B are as dened above and L D; F . D fd1; . . . ; drg is the setthat contains the mean rewards of the actions at any time instant. F t ff t1x; . . . ; f tr xg is the set that contains the probability density functions ofthe actions rewards at every time instant t. f ti x is symmetric about the linex dji di dj. As our automaton operates in a stationary environment themeans and the density functions of the actions rewards are time-invariant.

4. Simulation results

Simulation was performed to demonstrate the superiority of the proposed

stochastic estimator scheme over the deterministic estimator one [8]. Both

schemes were simulated to operate in S-model stationary environments. The

speed and accuracy of convergence were used as performance metrics, in orderto evaluate the two schemes which are under comparison. For both automata

was said to have converged when the probability of choosing an action was

exactly unity. The two automata that are under comparison were placed in a

ten-action environment. The mean reward of the optimal action was xed at

0.85 for all simulations, while the mean rewards of the other actions were

equally spaced in the interval [0.1, 0.5], as in the r-action case we have d1 0:85and di 0:5 i 2d for i 2; 3; . . . ; r, where d 0:5 0:1=r 2.

Before starting the algorithm initial estimates for D0 were obtained by se-lecting each action once. This extra iteration was then included in the total

number of iterations until the algorithm converged. The average results are

shown at Tables 13. The value of the resolution parameter appears in the rst

column of each table. In each case the resolution parameter N is dened by

relation D 1=N , N rn, where as referred earlier r is the number of actionsand D is the step size of the probability vector. The second column contains theaccuracy that corresponds to the resolution parameter of the rst column. The

automaton is said to have converged accurately, if it converged to the bestaction. The mean number of iterations required for convergence is appeared at

the third column.

Table 1

Stochastic vs. deterministic estimator in an S-model automaton with a 0:25, in a 10-action en-vironment with r2 0:25

Stochastic estimator Deterministic estimator

Accuracy Speed Accuracy Speed

N 25 95.48 11.95 86.62 4.95N 50 98.58 18.67 91.42 9.72N 100 99.51 33.21 96.33 17.7N 200 99.71 55.19 97.97 37.01N 300 98.96 45.79

G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199 197

The noise in the environment was simulated by taking truncated samples

from normal (Gaussian) distributions with the mean rewards and a variance r2.As a result of our experiments we observe that for a given accuracy, the sto-

chastic estimator scheme is much faster than the deterministic one.

5. Conclusion

An S-model absorbing LA that uses a stochastic estimator in order to

achieve a high accuracy and a high speed of convergence in stationary envi-ronments is introduced. Extensive simulation results are presented that indicate

that the proposed ASELA scheme achieves a superior performance over the

well-known deterministic-estimator-based absorbing schemes when they op-

erate in S-model stationary environments.

References

[1] K. Najim, A.S. Poznyak, Learning Automata: Theory and Applications, Pergamon, Oxford,

1994.

Table 2

Stochastic vs. deterministic estimator in an S-model automaton with a 0:35, in a 10-action en-vironment with r2 0:25

Stochastic estimator Deterministic estimator

Accuracy Speed Accuracy Speed

N 25 97.2 16.66 4.95 86.62N 50 99.41 25.7 9.72 91.42N 100 99.8 39.28 17.7 96.33N 200 99.89 61.7 37.01 97.97N 300 45.79 98.96

Table 3

Stochastic vs. deterministic estimator in an S-model automaton with a 0:35, in a 10-action en-vironment with r2 0:35

Stochastic estimator Deterministic estimator

Accuracy Speed Accuracy Speed

N 25 90.39 16.78 75.23 8.72N 50 96.27 29.58 82.36 16.13N 100 97.93 49.03 90.6 30.89N 200 98.6 76.51 95.01 48.5N 300 96.75 66.16N 400 97.82 79.3

198 G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199

[2] K.S. Narendra, M.A.L. Thathachar, Learning Automata: An Introduction, Prentice Hall, New

Jersey, 1989.

[3] K.S. Narendra, M.A.L. Thathachar, Learning automata: A survey, IEEE Trans. Syst. Man

Cybern. SMC-4 (4) (1974) 323334.

[4] K.S. Narendra, S. Lakshmivarahan, Learning automata: A critique, J. Cybern. Inform. Sci. 1

(1977) 5366.

[5] G.I. Papadimitriou, Hierarchical discretized pur-suit nonlinear learning automata with rapid

convergence and high accuracy, IEEE Trans. Knowledge Data Eng. 6 (4) (1994) 1994.

[6] G.I. Papadimitriou, A new approach to the design of reinforcement schemes for learning

automata: stochastic estimator learning algorithms, IEEE Trans. Knowledge Data Eng. 6 (4)

(1994).

[7] M.A.L. Thathachar, P.S. Sastry, A class of rapidly converging algorithms for learning

automata, IEEE Trans. Syst. Man Cybern. SMC-15 (1) (1985) 168175.

[8] B.J. Oommen, J.K. Lanctot, Discretized pursuit learning automata, IEEE Trans. Syst. Man

Cybern. SMC-20 (4) (1990) 931938.

G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199 199

Absorb...