Download pdf - Absorbing stochastic estimator learning automata for S-model stationary environments

Absorbing stochastic estimatorlearning automata for S-model

stationary environments

G.I. Papadimitriou *, A.S. Pomportsis,S. Kiritsi, E. Talahoupi

Department of Informatics, Aristotle University, Box 888, 54006 Thessaloniki, Greece

Received 26 September 2001; received in revised form 10 March 2002; accepted 8 May 2002

Abstract

An S-model absorbing learning automaton (LA) which is based on the use of a

stochastic estimator is introduced. According to the proposed stochastic estimator

scheme, the estimates of the mean rewards of actions are computed stochastically.

Actions that have not been selected many times have the opportunity to be estimated as

optimal, to increase their choice probabilities, and consequently, to be selected. In this

way, the automaton�s accuracy and speed of convergence are significantly improved.� 2002 Elsevier Science Inc. All rights reserved.

1. Introduction

Learning automata (LA) [1,2] have attracted considerable interest in the last

decades due to their potential usefulness in a variety of engineering problems

that are characterized by nonlinearity and a high level of uncertainty. They

have the property of progressively improving their performance. The envi-

ronment communicates with the learning system and supplies it with feedback

information. The automaton uses this information to find which action isoptimal. A LA is one that adapts itself to the environment by learning the

Information Sciences 147 (2002) 193–199

www.elsevier.com/locate/ins

*Corresponding author. Fax: +30-310-998419.

E-mail address: [email protected] (G.I. Papadimitriou).

0020-0255/02/$ - see front matter � 2002 Elsevier Science Inc. All rights reserved.

PII: S0020 -0255 (02 )00263 -3

mail to: [email protected]

optimal action and that ultimately chooses this action more frequently than

other actions.

The environment is said to be stationary if the mean rewards of actions are

not dependent on the time, otherwise it is said to be nonstationary.Different kinds of LA have been proposed. According to the nature of its

input a LA can be characterized as a P, Q or S-model. A LA is a Q-model

automaton [3,4] if the input set is a finite set of distinct symbols. if the input set

is binary ({0,1}), it is called a P-model automaton [3–5]. Finally, if the au-

tomaton�s input can take any real value in the [0,1] range, the automaton iscalled an S-model one [3,4,6]. S-model LA are considered in this paper.

With respect to their Markovian representation, LA are classified into two

main categories: ergodic or automata-possessing absorbing barriers [2–4]. Theergodic automata converge with a distribution independent of the initial state.

On the other hand, the automata with absorbing states, after a number of a

finite steps, get locked (converge) into a particular action.

Non-estimator learning algorithms update the probability vector based di-

rectly on the environment�s feedback. On the other hand, estimator algorithms[7] are characterized by the use of a running estimate of the mean reward of

each action. The change in the probability of choosing an action is based on the

running estimates of the probability of reward rather than on the feedbackfrom the environment. These algorithms, at every time instant, increase the

probability of choosing the action with the maximum current estimate of mean

reward. Simulation results have demonstrated the superiority of the estimator

algorithms over the traditional learning algorithms [5,7].

Stochastic Estimator Learning Algorithms scheme were introduced in [6] as

an effort to achieve a high adaptivity when the automaton operates in rapidly

switching non-stationary environments. In this paper we present a new sto-

chastic-estimator-based scheme, which is capable of achieving a high accuracyand a rapid convergence when operating in stationary environments.

Section 2 introduces the reader to the basic concepts of the stochastic esti-

mator. The proposed Absorbing Stochastic Estimator Learning Automaton

(ASELA) scheme is presented in Section 3. In Section 4 extensive simulation

results are presented, which indicate that the proposed scheme achieves a high

performance when operating in stationary environments. Finally, concluding

remarks are given in Section 5.

2. The stochastic estimator

An S-model stationary environment is considered. The LA keeps running

estimates of the actions� mean rewards. The estimates are computed stochasti-cally, so they are not strictly dependent on the environmental responses. A zeromean normally distributed random variable is added to the deterministic esti-

194 G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193–199

mates in order to take the stochastic ones. The variance of the normally dis-

tributed random variable, differs from action to action and is proportional to

the reverse value of the number of times that each action was selected. The LA

gives to actions that have been selected only a few times – and so their estimatesare considered as unreliable – the opportunity to be estimated as ‘‘optimal’’, to

increase their choice probability and consequently, to be selected.

This kind of estimator, which determines the estimates of actions in a

nondeterministic way, is called a stochastic estimator. The use of stochastic

estimator in S-model nonstationary environments has been studied in [6]. In the

present paper we are going to study the use of stochastic estimator in S-model

stationary environments.

3. The absorbing stochastic estimator learning automaton

The Absorbing Stochastic Estimator Learning Automaton (ASELA) is de-

fined by the quintuple {A, B, P, E, T}.

A is the set of r actions that the automaton can choose from. The variousactions are then elements of A ¼ fa1; . . . ; arg. The automaton is allowed onechoice at each time instant t and its choice is denoted as aðtÞ, where aðtÞ 2 A.The set of possible responses from the environment is denoted by B. Since

an S-model environment is considered, B ¼ ½0; 1�. The environmental responseat time instant t is denoted by bðtÞ, where bðtÞ 2 B.The mean reward of action ai at time instant t is denoted by diðtÞ. Since the

environment is stationary, diðtÞ is constant for all time t and so the index oftime is dropped and the quantity is denoted as di. The set of the actions� meanrewards is defined as: D ¼ fd1; . . . ; drg. It is assumed that there is a uniquemaximum for the set D called db, where db ¼ max16 i6 rfdig. The action pos-sessing db, namely ab, is referred to as the best action. The value of each di 2 Dis unknown to the automaton, and so its task is to decide which action is the

best. It bases its decision on the information gained by selecting actions, and

seeing the environmental feedback. This cycle continues until the learning

process is terminated.

P is a probability distribution over the set of actions. We have P ðtÞ ¼fp1ðtÞ; . . . ; prðtÞg, where piðtÞ is the probability of selecting action ai at timeinstant t. In discretized automata there are only finitely many vales for piðtÞ,namely piðtÞ is one of f0;D; 2D; 3D; . . . ; 1Þ for all t. Here D is referred to as the

smallest step size and is inversely proportional to the total number of subdi-

visions of the probability space [0,1]. This parameter D is defined by D ¼ 1=N ,where N ¼ rn, r is the number of actions and n is the resolution parameter.E is the estimator that at any time instant contains the estimated environ-

mental characteristics. we define EðtÞ ¼ ðD0ðtÞ;MðtÞ;UðtÞÞ, where D0ðtÞ ¼fd 0

1ðtÞ; . . . ; d 0rðtÞg is the Deterministic Estimator Vector, which contains the

G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193–199 195

current deterministic estimates of the mean rewards of the actions as shown

below (for i ¼ 1; . . . ; r):

d 0iðtÞ ¼

Total reward received up to time t for the selections of action aiNumber of times action ai has been selected up to time t

¼PTi

k¼1Qki ðtÞ

miðtÞ; ð1Þ

where Qki ðtÞ for k ¼ 1; . . . ; Ti are the rewards received at each time k that action

ai was selected and Ti is the last time that ai was selected. MðtÞ ¼ fm1ðtÞ; . . . ;mrðtÞg, where miðtÞði ¼ 1; . . . ; rÞ is the number of times that the action ai hasbeen selected up to time instant t. UðtÞ ¼ fu1ðtÞ; . . . ; urðtÞg is the StochasticEstimator Vector which at any time instant t, contains the current stochastic

estimates of the mean rewards of the actions. The current stochastic estimateuiðtÞði ¼ 1; . . . ; rÞ of the mean reward of action ai is defined as follows:

uiðtÞ ¼ d 0iðtÞ þ N 0; s2i ðtÞ

� �; ð2Þ

where siðtÞ ¼ minfa � 1miðtÞ ; smaxg.

Nð0; s2i ðtÞÞ denotes a random number selected with a normal probabilitydistribution, with a mean equal to 0 and a variance equal to s2i ðtÞ. a is an in-ternal automaton�s parameter that determines how rapidly the stochastic es-

timates become independent from the deterministic ones. When a ¼ 0, no noise

is added to the deterministic estimates. smax is the maximum permitted value of

siðtÞði ¼ 1; . . . ; rÞ. It limits the variance of the stochastic estimates in order notto increase infinitely.

Finally, T is the learning algorithm which is presented below:

Step 1: Select an action aðtÞ ¼ ak according to the probability vector.Step 2: Receive the feedback bðtÞ 2 ½0; 1� from the environment.

Step 3: Update MðtÞ by setting mkðt þ 1Þ ¼ mkðtÞ þ 1 and miðt þ 1Þ ¼ miðtÞfor all i 6¼ k.Step 4: Compute the new deterministic estimate d 0

kðtÞ as it is given by rela-tion (1).

Step 5: For every action aiði ¼ 1; . . . ; rÞ compute the new stochastic estimateuiðtÞ as it is given by relation (2).Step 6: Select the ‘‘optimal’’ action am that has the highest stochastic esti-

mate of mean reward. Thus, umðtÞ ¼ maxifuiðtÞg.Step 7: Update the probability vector in the following way:for every action

aiði ¼ 1; . . . ;m 1;mþ 1; . . . ; rÞ, with piðtÞP 1=N , set: piðt þ 1Þ ¼ piðtÞ 1= N .For the ‘‘optimal’’ action am set

pm tð þ 1Þ ¼ 1 X

i6¼m

piðt þ 1Þ:

Step 8: If pmðtÞ < 1 then GOTO to Step 1 else CONVERGE to action am.


An S-model environment consists of three components denoted by (A, L, B),

where A and B are as defined above and L ¼ ðD; F Þ. D ¼ fd1; . . . ; drg is the setthat contains the mean rewards of the actions at any time instant. F ðtÞ ¼ff t

1ðxÞ; . . . ; f tr ðxÞg is the set that contains the probability density functions of

the actions rewards at every time instant t. f ti ðxÞ is symmetric about the line

x ¼ dji ¼ di dj. As our automaton operates in a stationary environment the

means and the density functions of the actions rewards are time-invariant.

4. Simulation results

Simulation was performed to demonstrate the superiority of the proposed

stochastic estimator scheme over the deterministic estimator one [8]. Both

schemes were simulated to operate in S-model stationary environments. The

speed and accuracy of convergence were used as performance metrics, in orderto evaluate the two schemes which are under comparison. For both automata

was said to have converged when the probability of choosing an action was

exactly unity. The two automata that are under comparison were placed in a

ten-action environment. The mean reward of the optimal action was fixed at

0.85 for all simulations, while the mean rewards of the other actions were

equally spaced in the interval [0.1, 0.5], as in the r-action case we have d1 ¼ 0:85and di ¼ 0:5 ði 2Þd for i ¼ 2; 3; . . . ; r, where d ¼ ð0:5 0:1Þ=ðr 2Þ.Before starting the algorithm initial estimates for D0 were obtained by se-

lecting each action once. This extra iteration was then included in the total

number of iterations until the algorithm converged. The average results are

shown at Tables 1–3. The value of the resolution parameter appears in the first

column of each table. In each case the resolution parameter N is defined by

relation D ¼ 1=N , N ¼ rn, where as referred earlier r is the number of actionsand D is the step size of the probability vector. The second column contains the

accuracy that corresponds to the resolution parameter of the first column. The

automaton is said to have converged accurately, if it converged to the bestaction. The mean number of iterations required for convergence is appeared at

the third column.

Table 1

Stochastic vs. deterministic estimator in an S-model automaton with a ¼ 0:25, in a 10-action en-

vironment with r2 ¼ 0:25

Stochastic estimator Deterministic estimator

Accuracy Speed Accuracy Speed

N ¼ 25 95.48 11.95 86.62 4.95

N ¼ 50 98.58 18.67 91.42 9.72

N ¼ 100 99.51 33.21 96.33 17.7

N ¼ 200 99.71 55.19 97.97 37.01

N ¼ 300 98.96 45.79


The noise in the environment was simulated by taking truncated samples

from normal (Gaussian) distributions with the mean rewards and a variance r2.As a result of our experiments we observe that for a given accuracy, the sto-

chastic estimator scheme is much faster than the deterministic one.

5. Conclusion

An S-model absorbing LA that uses a stochastic estimator in order to

achieve a high accuracy and a high speed of convergence in stationary envi-ronments is introduced. Extensive simulation results are presented that indicate

that the proposed ASELA scheme achieves a superior performance over the

well-known deterministic-estimator-based absorbing schemes when they op-

erate in S-model stationary environments.

References

[1] K. Najim, A.S. Poznyak, Learning Automata: Theory and Applications, Pergamon, Oxford,

1994.

Table 2





N ¼ 25 97.2 16.66 4.95 86.62

N ¼ 50 99.41 25.7 9.72 91.42

N ¼ 100 99.8 39.28 17.7 96.33

N ¼ 200 99.89 61.7 37.01 97.97

N ¼ 300 45.79 98.96

Table 3





N ¼ 25 90.39 16.78 75.23 8.72

N ¼ 50 96.27 29.58 82.36 16.13

N ¼ 100 97.93 49.03 90.6 30.89

N ¼ 200 98.6 76.51 95.01 48.5

N ¼ 300 96.75 66.16

N ¼ 400 97.82 79.3


[2] K.S. Narendra, M.A.L. Thathachar, Learning Automata: An Introduction, Prentice Hall, New

Jersey, 1989.

[3] K.S. Narendra, M.A.L. Thathachar, Learning automata: A survey, IEEE Trans. Syst. Man

Cybern. SMC-4 (4) (1974) 323–334.

[4] K.S. Narendra, S. Lakshmivarahan, Learning automata: A critique, J. Cybern. Inform. Sci. 1

(1977) 53–66.

[5] G.I. Papadimitriou, Hierarchical discretized pur-suit nonlinear learning automata with rapid

convergence and high accuracy, IEEE Trans. Knowledge Data Eng. 6 (4) (1994) 1994.

[6] G.I. Papadimitriou, A new approach to the design of reinforcement schemes for learning

automata: stochastic estimator learning algorithms, IEEE Trans. Knowledge Data Eng. 6 (4)

(1994).

[7] M.A.L. Thathachar, P.S. Sastry, A class of rapidly converging algorithms for learning

automata, IEEE Trans. Syst. Man Cybern. SMC-15 (1) (1985) 168–175.

[8] B.J. Oommen, J.K. Lanctot, Discretized pursuit learning automata, IEEE Trans. Syst. Man

Cybern. SMC-20 (4) (1990) 931–938.