Absorbing stochastic estimator learning automata for S-model stationary environments

  • Published on
    02-Jul-2016

  • View
    213

  • Download
    1

Embed Size (px)

Transcript

  • Absorbing stochastic estimatorlearning automata for S-model

    stationary environments

    G.I. Papadimitriou *, A.S. Pomportsis,S. Kiritsi, E. Talahoupi

    Department of Informatics, Aristotle University, Box 888, 54006 Thessaloniki, Greece

    Received 26 September 2001; received in revised form 10 March 2002; accepted 8 May 2002

    Abstract

    An S-model absorbing learning automaton (LA) which is based on the use of a

    stochastic estimator is introduced. According to the proposed stochastic estimator

    scheme, the estimates of the mean rewards of actions are computed stochastically.

    Actions that have not been selected many times have the opportunity to be estimated as

    optimal, to increase their choice probabilities, and consequently, to be selected. In this

    way, the automatons accuracy and speed of convergence are signicantly improved. 2002 Elsevier Science Inc. All rights reserved.

    1. Introduction

    Learning automata (LA) [1,2] have attracted considerable interest in the last

    decades due to their potential usefulness in a variety of engineering problems

    that are characterized by nonlinearity and a high level of uncertainty. They

    have the property of progressively improving their performance. The envi-

    ronment communicates with the learning system and supplies it with feedback

    information. The automaton uses this information to nd which action isoptimal. A LA is one that adapts itself to the environment by learning the

    Information Sciences 147 (2002) 193199

    www.elsevier.com/locate/ins

    *Corresponding author. Fax: +30-310-998419.

    E-mail address: gp@csd.auth.gr (G.I. Papadimitriou).

    0020-0255/02/$ - see front matter 2002 Elsevier Science Inc. All rights reserved.PII: S0020 -0255 (02 )00263 -3

  • optimal action and that ultimately chooses this action more frequently than

    other actions.

    The environment is said to be stationary if the mean rewards of actions are

    not dependent on the time, otherwise it is said to be nonstationary.Dierent kinds of LA have been proposed. According to the nature of its

    input a LA can be characterized as a P, Q or S-model. A LA is a Q-model

    automaton [3,4] if the input set is a nite set of distinct symbols. if the input set

    is binary ({0,1}), it is called a P-model automaton [35]. Finally, if the au-

    tomatons input can take any real value in the [0,1] range, the automaton iscalled an S-model one [3,4,6]. S-model LA are considered in this paper.

    With respect to their Markovian representation, LA are classied into two

    main categories: ergodic or automata-possessing absorbing barriers [24]. Theergodic automata converge with a distribution independent of the initial state.

    On the other hand, the automata with absorbing states, after a number of a

    nite steps, get locked (converge) into a particular action.

    Non-estimator learning algorithms update the probability vector based di-

    rectly on the environments feedback. On the other hand, estimator algorithms[7] are characterized by the use of a running estimate of the mean reward of

    each action. The change in the probability of choosing an action is based on the

    running estimates of the probability of reward rather than on the feedbackfrom the environment. These algorithms, at every time instant, increase the

    probability of choosing the action with the maximum current estimate of mean

    reward. Simulation results have demonstrated the superiority of the estimator

    algorithms over the traditional learning algorithms [5,7].

    Stochastic Estimator Learning Algorithms scheme were introduced in [6] as

    an eort to achieve a high adaptivity when the automaton operates in rapidly

    switching non-stationary environments. In this paper we present a new sto-

    chastic-estimator-based scheme, which is capable of achieving a high accuracyand a rapid convergence when operating in stationary environments.

    Section 2 introduces the reader to the basic concepts of the stochastic esti-

    mator. The proposed Absorbing Stochastic Estimator Learning Automaton

    (ASELA) scheme is presented in Section 3. In Section 4 extensive simulation

    results are presented, which indicate that the proposed scheme achieves a high

    performance when operating in stationary environments. Finally, concluding

    remarks are given in Section 5.

    2. The stochastic estimator

    An S-model stationary environment is considered. The LA keeps running

    estimates of the actions mean rewards. The estimates are computed stochasti-cally, so they are not strictly dependent on the environmental responses. A zeromean normally distributed random variable is added to the deterministic esti-

    194 G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199

  • mates in order to take the stochastic ones. The variance of the normally dis-

    tributed random variable, diers from action to action and is proportional to

    the reverse value of the number of times that each action was selected. The LA

    gives to actions that have been selected only a few times and so their estimatesare considered as unreliable the opportunity to be estimated as optimal, to

    increase their choice probability and consequently, to be selected.

    This kind of estimator, which determines the estimates of actions in a

    nondeterministic way, is called a stochastic estimator. The use of stochastic

    estimator in S-model nonstationary environments has been studied in [6]. In the

    present paper we are going to study the use of stochastic estimator in S-model

    stationary environments.

    3. The absorbing stochastic estimator learning automaton

    The Absorbing Stochastic Estimator Learning Automaton (ASELA) is de-

    ned by the quintuple {A, B, P, E, T}.

    A is the set of r actions that the automaton can choose from. The variousactions are then elements of A fa1; . . . ; arg. The automaton is allowed onechoice at each time instant t and its choice is denoted as at, where at 2 A.

    The set of possible responses from the environment is denoted by B. Since

    an S-model environment is considered, B 0; 1. The environmental responseat time instant t is denoted by bt, where bt 2 B.

    The mean reward of action ai at time instant t is denoted by dit. Since theenvironment is stationary, dit is constant for all time t and so the index oftime is dropped and the quantity is denoted as di. The set of the actions meanrewards is dened as: D fd1; . . . ; drg. It is assumed that there is a uniquemaximum for the set D called db, where db max16 i6 rfdig. The action pos-sessing db, namely ab, is referred to as the best action. The value of each di 2 Dis unknown to the automaton, and so its task is to decide which action is the

    best. It bases its decision on the information gained by selecting actions, and

    seeing the environmental feedback. This cycle continues until the learning

    process is terminated.

    P is a probability distribution over the set of actions. We have P t fp1t; . . . ; prtg, where pit is the probability of selecting action ai at timeinstant t. In discretized automata there are only nitely many vales for pit,namely pit is one of f0;D; 2D; 3D; . . . ; 1 for all t. Here D is referred to as thesmallest step size and is inversely proportional to the total number of subdi-

    visions of the probability space [0,1]. This parameter D is dened by D 1=N ,where N rn, r is the number of actions and n is the resolution parameter.

    E is the estimator that at any time instant contains the estimated environ-

    mental characteristics. we dene Et D0t;Mt;Ut, where D0t fd 01t; . . . ; d 0rtg is the Deterministic Estimator Vector, which contains the

    G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199 195

  • current deterministic estimates of the mean rewards of the actions as shown

    below (for i 1; . . . ; r):

    d 0it Total reward received up to time t for the selections of action ai

    Number of times action ai has been selected up to time t

    PTi

    k1Qki t

    mit ; 1

    where Qki t for k 1; . . . ; Ti are the rewards received at each time k that actionai was selected and Ti is the last time that ai was selected. Mt fm1t; . . . ;mrtg, where miti 1; . . . ; r is the number of times that the action ai hasbeen selected up to time instant t. Ut fu1t; . . . ; urtg is the StochasticEstimator Vector which at any time instant t, contains the current stochastic

    estimates of the mean rewards of the actions. The current stochastic estimateuiti 1; . . . ; r of the mean reward of action ai is dened as follows:

    uit d 0it N 0; s2i t

    ; 2where sit minfa 1mit ; smaxg.

    N0; s2i t denotes a random number selected with a normal probabilitydistribution, with a mean equal to 0 and a variance equal to s2i t. a is an in-ternal automatons parameter that determines how rapidly the stochastic es-timates become independent from the deterministic ones. When a 0, no noiseis added to the deterministic estimates. smax is the maximum permitted value ofsiti 1; . . . ; r. It limits the variance of the stochastic estimates in order notto increase innitely.

    Finally, T is the learning algorithm which is presented below:

    Step 1: Select an action at ak according to the probability vector.Step 2: Receive the feedback bt 2 0; 1 from the environment.Step 3: Update Mt by setting mkt 1 mkt 1 and mit 1 mit

    for all i 6 k.Step 4: Compute the new deterministic estimate d 0kt as it is given by rela-

    tion (1).

    Step 5: For every action aii 1; . . . ; r compute the new stochastic estimateuit as it is given by relation (2).

    Step 6: Select the optimal action am that has the highest stochastic esti-mate of mean reward. Thus, umt maxifuitg.

    Step 7: Update the probability vector in the following way:for every action

    aii 1; . . . ;m 1;m 1; . . . ; r, with pitP 1=N , set: pit 1 pit 1= N .For the optimal action am set

    pm t 1 1X

    i6mpit 1:

    Step 8: If pmt < 1 then GOTO to Step 1 else CONVERGE to action am.

    196 G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199

  • An S-model environment consists of three components denoted by (A, L, B),

    where A and B are as dened above and L D; F . D fd1; . . . ; drg is the setthat contains the mean rewards of the actions at any time instant. F t ff t1x; . . . ; f tr xg is the set that contains the probability density functions ofthe actions rewards at every time instant t. f ti x is symmetric about the linex dji di dj. As our automaton operates in a stationary environment themeans and the density functions of the actions rewards are time-invariant.

    4. Simulation results

    Simulation was performed to demonstrate the superiority of the proposed

    stochastic estimator scheme over the deterministic estimator one [8]. Both

    schemes were simulated to operate in S-model stationary environments. The

    speed and accuracy of convergence were used as performance metrics, in orderto evaluate the two schemes which are under comparison. For both automata

    was said to have converged when the probability of choosing an action was

    exactly unity. The two automata that are under comparison were placed in a

    ten-action environment. The mean reward of the optimal action was xed at

    0.85 for all simulations, while the mean rewards of the other actions were

    equally spaced in the interval [0.1, 0.5], as in the r-action case we have d1 0:85and di 0:5 i 2d for i 2; 3; . . . ; r, where d 0:5 0:1=r 2.

    Before starting the algorithm initial estimates for D0 were obtained by se-lecting each action once. This extra iteration was then included in the total

    number of iterations until the algorithm converged. The average results are

    shown at Tables 13. The value of the resolution parameter appears in the rst

    column of each table. In each case the resolution parameter N is dened by

    relation D 1=N , N rn, where as referred earlier r is the number of actionsand D is the step size of the probability vector. The second column contains theaccuracy that corresponds to the resolution parameter of the rst column. The

    automaton is said to have converged accurately, if it converged to the bestaction. The mean number of iterations required for convergence is appeared at

    the third column.

    Table 1

    Stochastic vs. deterministic estimator in an S-model automaton with a 0:25, in a 10-action en-vironment with r2 0:25

    Stochastic estimator Deterministic estimator

    Accuracy Speed Accuracy Speed

    N 25 95.48 11.95 86.62 4.95N 50 98.58 18.67 91.42 9.72N 100 99.51 33.21 96.33 17.7N 200 99.71 55.19 97.97 37.01N 300 98.96 45.79

    G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199 197

  • The noise in the environment was simulated by taking truncated samples

    from normal (Gaussian) distributions with the mean rewards and a variance r2.As a result of our experiments we observe that for a given accuracy, the sto-

    chastic estimator scheme is much faster than the deterministic one.

    5. Conclusion

    An S-model absorbing LA that uses a stochastic estimator in order to

    achieve a high accuracy and a high speed of convergence in stationary envi-ronments is introduced. Extensive simulation results are presented that indicate

    that the proposed ASELA scheme achieves a superior performance over the

    well-known deterministic-estimator-based absorbing schemes when they op-

    erate in S-model stationary environments.

    References

    [1] K. Najim, A.S. Poznyak, Learning Automata: Theory and Applications, Pergamon, Oxford,

    1994.

    Table 2

    Stochastic vs. deterministic estimator in an S-model automaton with a 0:35, in a 10-action en-vironment with r2 0:25

    Stochastic estimator Deterministic estimator

    Accuracy Speed Accuracy Speed

    N 25 97.2 16.66 4.95 86.62N 50 99.41 25.7 9.72 91.42N 100 99.8 39.28 17.7 96.33N 200 99.89 61.7 37.01 97.97N 300 45.79 98.96

    Table 3

    Stochastic vs. deterministic estimator in an S-model automaton with a 0:35, in a 10-action en-vironment with r2 0:35

    Stochastic estimator Deterministic estimator

    Accuracy Speed Accuracy Speed

    N 25 90.39 16.78 75.23 8.72N 50 96.27 29.58 82.36 16.13N 100 97.93 49.03 90.6 30.89N 200 98.6 76.51 95.01 48.5N 300 96.75 66.16N 400 97.82 79.3

    198 G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199

  • [2] K.S. Narendra, M.A.L. Thathachar, Learning Automata: An Introduction, Prentice Hall, New

    Jersey, 1989.

    [3] K.S. Narendra, M.A.L. Thathachar, Learning automata: A survey, IEEE Trans. Syst. Man

    Cybern. SMC-4 (4) (1974) 323334.

    [4] K.S. Narendra, S. Lakshmivarahan, Learning automata: A critique, J. Cybern. Inform. Sci. 1

    (1977) 5366.

    [5] G.I. Papadimitriou, Hierarchical discretized pur-suit nonlinear learning automata with rapid

    convergence and high accuracy, IEEE Trans. Knowledge Data Eng. 6 (4) (1994) 1994.

    [6] G.I. Papadimitriou, A new approach to the design of reinforcement schemes for learning

    automata: stochastic estimator learning algorithms, IEEE Trans. Knowledge Data Eng. 6 (4)

    (1994).

    [7] M.A.L. Thathachar, P.S. Sastry, A class of rapidly converging algorithms for learning

    automata, IEEE Trans. Syst. Man Cybern. SMC-15 (1) (1985) 168175.

    [8] B.J. Oommen, J.K. Lanctot, Discretized pursuit learning automata, IEEE Trans. Syst. Man

    Cybern. SMC-20 (4) (1990) 931938.

    G.I. Papadimitriou et al. / Information Sciences 147 (2002) 193199 199

    Absorb...

Recommended

View more >