A new approach to the design of reinforcement schemes for learning automata: Stochastic estimator learning algorithm

ELSEVIER Neurocomputing 7 (1995) 275-297

A new approach to the design of reinforcement schemes for learning automata:

Stochastic estimator learning algorithm

Athanasios V. Vasilakos ‘T*, Georgios I. Papadimitriou b

a HeIlenic Air Force Academy, Department of Computer Science, P.O. Box 1010, Dekdeia-Attiki, Greece b Department of Computer Engineering, University of Patras, 26500 Patras, Greece

Received 26 May 1992; accepted 7 March 1994

Abstract

In this paper a new approach to the design of S-model ergodic reinforcement learning algorithms is introduced. The new scheme utilizes a stochastic estimator and is able to operate in non-stationary environments with high accuracy and a high adaptation rate. According to the stochastic estimator scheme, which is the first attempt in the field, the estimates of the mean rewards of actions are computed stochastically. So, they are not strictly dependent on the environmental responses. The dependence between the stochastic estimates and the deterministic estimator’s contents is more relaxed if the latter are not updated. In this way actions that have not been selected recently have the opportunity to be estimated as ‘optimal’, to increase their choice probability and consequently to be selected. Thus, the estimator is always recently updated and consequently able to adapt to environmental changes. The performance of the presented Stochastic Estimator Learning Automa- ton (SELA) is superior to all previous well-known S-model ergodic schemes. Furthermore it is proved that SELA is e-optimal in every S-model random environment.

Keywords: Stochastic estimator; Learning window; Ergodic learning algorithm; Discretized learning algorithm; Probability slice

1. Introduction

Adaptive Learning is one of the main fields of Artificial Intelligence. A Learning Automaton is one of the most powerful tools in this scientific area. It is a

* Corresponding author. Address: A.V. Vasilakos, P.O. Box 65251, 15410 Psychiko, Athens, Greece

09252312/95/%09.50 0 1995 Elsevier Science B.V. All rights resewed SSDIO925-2312(94)00027-P

276 A.K Vasilakos, G.I. Papadimitriou /Neurocomputing 7 (1995) 275-297

D(t)={d#.d+,...,d$}

r-+=--l

LEARNING

AUTOMATON t-

b 0) E IO,11

Fig. 1. An S-model learning automaton that interacts with a nonstationary stochastic environment.

finite state machine that interacts with a stochastic environment trying to learn the optimal action this environment offers. Over the last few years, learning automata have been extensively studied in the literature. We refer the reader to the outstanding survey papers by Narendra and Thathachar [12] and Narendra and Lakshmivarahan [13] for a review of various families of learning automata.

A learning automaton that interacts with a stochastic environment tries to learn the optimal action offered by the environment, via a learning process. The learning process is the following (see Fig. 1). The automaton chooses one of the offered actions according to a probability vector, which at every instant contains the probability of choosing each action. The chosen action triggers the environment that responds with a feedback, dependent on the stochastic characteristics of the chosen action. The automaton takes into account this answer and modifies its state by means of a transition function. The new state of the automaton corresponds to a new probability vector given by a function, called output function. A learning automaton is one which learns the action that has the maximum mean reward and which ultimately chooses this action more frequently than other actions.

According to the nature of its input a learning automaton can be characterized as P, Q or S-model one. If the input set is binary ((0, l]> it is characterized as a P-model [12 3 5 6 12,13,21]. A learning automaton is a Q-model [12,13,211, if the , 7 , , 7 input set is a finite set of distinct symbols. Finally, if the automaton’s input can take any real value in the [OJ] range it is characterized as an S-model [8,9,12,13,21,22,23].

Another way to classify learning automata is according to the character of the previously referred transition and output functions. By this view, learning automata can be classified into two main categories: fixed structure automata or variable structure automata. A fixed structure automaton is characterized as one whose transition and output functions are time-invariant. The reader can consult Tsetlin [l&19] in order to study examples of fixed structure learning automata. A variable structure stochastic automaton is characterized as one whose transition and output function are time-varying. They are more powerful learning machines, and so most of the research in learning automata area has involved this category of automata.

According to their Marcovian representation, learning automata are classified into two main categories: ergodic [2,3,15,201, or automata possessing absorbing

A.V. Vasilakos, G.I. Papadimitriou /Neurocomputing 7 (1995) 275-297 277

barriers [2,3,6,7,10,12]. The ergodic automata converge with a distribution independent of the initial state. If the stochastic characteristics of the actions are not stable (nonstationary environment), ergodic automata are preferred. On the other hand, the automata with absorbing states, after a number of finite steps get ‘locked’ (converge) into a specific state. Absorbing automata are preferred, when the stochastic characteristics of the actions are stable (stationary environment).

With respect to the values the action probabilities can take, a learning automaton can be characterized as a continuous or a discretized one. In the former case the action probabilities can take any value in the [OJ] interval. This class of learning automata suffers of low convergence. In the latter, the action probabilities can take values from a finite set only. By discretizing the probability space there is an increase in the speed of the automaton’s convergence. For the continuous automaton’s case, as time tends to infinity, the choice probability of the optimal action tends to unity but never becomes equal to unity. The discretization of the probability space wipes this disadvantage out.

Important results of the application of learning automata in computer networks can be found in 126,271 and [281.

The slow convergence rate of learning automata has been a limiting factor in their applications. Estimator learning algorithms were introduced by Thathachar and Sastry [14] as an effort to solve this problem, by using running estimates of the environmental characteristics. Other authors in [29,30,311 have proposed learning models consisting of networks of neuron-like elements.

All classic learning algorithms update the probability vector based directly on the environment’s answer. If this answer is a reward then the automaton increases the probability of choosing this action (which caused the environment’s reward) at the next time instant. Otherwise, the choice probability of the selected action is decreased.

Estimator learning algorithms are characterized by the use of a running estimate of the mean reward of each action. The change of the probability of choosing an action is based on its current estimated mean reward, rather than on the feedback of the environment. The environment’s response determines the probability vector not directly, but indirectly, by determining the estimates of its mean rewards. Even when an action is rewarded, it is possible that the probability of choosing another action is increased. Usually, estimator algorithms increase the choice probability of the action that has the highest estimated mean reward.

Simulation results have shown the superiority of the estimator learning algorithms over the classic learning algorithms [5,6,10,14].

The performance of estimator learning algorithms decreases when they operate in a nonstationary stochastic environment. This is due to the existence of old, and consequently invalid, feedback information in the estimator.

In reference [51 Vasilakos and Papadimitriou introduce a window technique that eliminates the low adaptation rate problem of P-model ergodic schemes.

In this paper we present a stochastic estimatm scheme which in combination with the window technique presented in [5], gives a satisfactory solution to the low adaptation rate problem of S-model learning automata.

278 A. K Vasilakos, G. I. Papadimitriou / Neurocomputing 7 (1995) 275-297

According to this scheme the estimates of the mean rewards of actions are computed stochastically. So, they are not strictly dependent on environmental responses. The dependence between the stochastic estimates and the deterministic estimator’s contents is more relaxed if the latter are not updated and probably invalid. In this way, actions that have not been selected recently, have the opportunity to be estimated as ‘optimal’, to increase their choice probability and consequently, to be selected. Thus, the estimator is always recently updated and, consequently, able to adapt to environmental changes.

Extensive simulation results indicate that the proposed stochastic estimator learning automaton achieves a significantly higher performance than the previous well-known S-model ergodic schemes.

The structure of this paper is as follows. The introduction to the mathematical model of S-model learning automata in Section 2 is followed by a review of other S-model ergodic schemes in Section 3. Section 4 introduces the reader to the new stochastic estimator scheme, while in Section 5 a detailed description of the new learning algorithm is presented. In Section 6 we give the formal definition of the SELA learning automaton followed by the proof of its e-optimality in Section 7.

Extended simulation results that prove the superiority of SELA’s performance are presented in Section 8. Finally, a brief discussion of the proposed scheme and its possible applications closes the paper in Section 9.

2. S-model learning automata

An S-model learning automaton is defined [8,26] by the sextuple (A, B, Q, P, T, G) where: A = (a,, a2,. . ., a,) is the set of the r actions (2 I r < m> offered by the environment. The action selected at the time instant t is symbolized as:

a(t) = ak EA.

B = [O,l] is the input set of the possible environmental responses. The environmental response can take any value in the [OJ] space. Thus, the environmental response at time instant t is symbolized as:

b(t) ~B=[0,1].

Q is the set of the possible internal states of the automaton. Since there is a one-to-one relation (G) between the automaton’s states and the actions, the state Q can be omitted from the sextuple (A, B, Q, P, T, G, E). However, for reasons of completeness we include Q in the formal definition of SELA learning automaton. P is the probability distribution over the set of actions. We have: P(t) = W,(t), Pz(t>, . . . , P,(t)] where P,(t) is the probability of selecting action a E A at time instant t. T is the learning algorithm that modifies the probability vector P at each iteration by using the environmental feedback. G Q +A is the output function. G is usually a deterministic one-to-one function. Thus, P is also a probability distribution over the set of actions.

A. K Vasilakos, G.I. Papadimittiou / Neurocomputing 7 (I 995) 275-297 279

For automata that maintain estimates of the environmental characteristics the sextuple defined above is extended to a sextuple (A, B, Q, P, T, G, E) by adding the estimator E that contains the estimated environmental characteristics.

Given that the action ai is selected at time instant t, the environment responds with a reward taken with a mean df and a density function f/(n) ( - 03 < x < + ~1. Usually, f:(x) is symmetric about the line x = df. Thus, f:(df + h) =f:(df - h) for every real number h.

The environment in which the automaton operates is defined by the triple (A, L, B) where A and B are as defined above and L(t) = (D(t), F(t)). We define: D(t) = {di, d$, . . . , d:} is the set that contains the meun rewards of the actions at any time instant 1. Thus, df = E[b(t)/u(t) = ui EA]. f(t) = {f:(x), f:(x), . . . , f,!(x)} is the set that contains the probability density functions of the actions’ rewards at every time instant t.

The environment is characterized as ‘stationary’ if the means and the density functions of the actions’ rewards are time-invariant and as ‘nonstationary’ if they are time-variant.

A learning automaton operating in a stationary stochastic environment is ‘optimal’ if lim ,,,P,(t) = 1 with probability 1, where a, is the action that has the highest mean reward. A learning automaton is characterized as an ‘e-optimal’ if there is an internal parameter N such that:

lirn limE[P,(t)] = 1 N+cc t-+m

3. Other S-model ergodic schemes

The main goal in the design of the S-model ergodic learning automata is to achieve a high accuracy and a high rate of adaptation to environmental changes.

The classic SL, scheme [8,12,26] was a first effort to construct an S-model variable structure learning automaton, able to operate in nonstationary environments. The accuracy of the SL,, scheme is not satisfactory, while its adaptation rate is low. The absence of any kind of environmental estimation leads to a decrease of the accuracy of the scheme. Furthermore, the low speed problem of all continuous learning automata is also present in the SL, case.

The G, learning automaton, introduced by Kurose and Simha in [9] uses a gradient projection based learning algorithm in order to achieve high accuracy. The G, automaton keeps estimates of the mean rewards of the actions and modifies the action probabilities according to these estimates. Although the utilization of an estimator increases the accuracy, the G, scheme suffers from a low rate of adaptation to environmental changes. This is due to the absence of any kind of adaptation mechanism.

Consider the following situation: At time instant t an action ui has a low mean reward df. Its estimated mean reward d:(t) is also low, and its choice probability is close to zero (P,(t) 2 0). At time instant t + 1 an environmental switching causes

280 A. V Vasilakos, G. I. Papadimitriou / Neurocomputing 7 (199s) 275-297

an increase of the mean reward of action ai, so that action ai is now the optimal one. Since the choice probability of action ui is close to zero, it is probable that this action will not be selected in the near future. Thus, the G, automaton keeps a wrong estimate of the mean reward of action ui for a long time and consequently, is unable to adapt to the environmental switching. This leads to a serious decrease of the automaton’s performance when it operates in a nonstationary environment.

4. The stochastic estimator

Consider a learning automaton that operates in a nonstationary environment and keeps estimates of the actions’ mean rewards. As the time which has elapsed from the last instant when action Ui was selected increases, there is an increase in the probability of an environmental switching. Consequently, the probability that an unchanged estimation is still valid gradually decreases. An estimator that contains ‘old’ and consequently invalid feedback information leads to an automaton which is incapable of adapting to environmental changes.

Although it may look strange, the main disadvantage of a classical estimator is its absolute confidence in its value. An estimator specially designed for operation in nonstationary environments must be able to ‘doubt’ the validity of ‘old’ environmental responses and simultaneously to give actions that have not been selected recently the opportunity to be chosen.

In this paper we present an estimator scheme which expresses its ‘doubt’ about the validity of its ‘old’ contents by adding a zero mean normally distributed random number to each action’s estimate. The variance of the normal distribution differs from action to action, and is proportional to the time which has elapsed from the last instant when the action was selected. In this way, it gives to actions that have not been selected often the opportunity to be reckoned as ‘optimal’, and thus to increase their choice probability and, consequently, to be selected. This kind of estimator which determines the estimated mean rewards of the actions in a non-deterministic way is called stochastic estimator. Note that this is the first attempt in the field of learning algorithms not utilizing the true estimate of the reward (or penalty) probabilities in the search of the optimal action in non-stationary environments.

As simulation results show, the utilization of a stochastic estimator leads to a dramatic improvement of the automaton’s performance when it operates in a nonstationary stochastic environment.

5. The stochastic estimator learning automaton @EL&

At each step the automaton selects an action according to the probability vector.

The environment responses with a feedback in the range [OJ]. After the receipt of the environmental response, the new deterministic estimate of the mean reward

A. V: Vasihkos, G.I. Papadimitriou / Neurocomputing 7 (I 995) 275-297 281

of the selected action is computed while the deterministic estimates of all other actions remain invariant. The proposed scheme uses a window [51 of size W in order to compute the deterministic estimate of the mean reward of an action. It is defined as the sum of the W last environmental responses to this action, divided by W, The utilization of a window leads to the ignorance of ‘old’ - and probably invalid - environmental responses.

Afterwards, the current stochastic estimate of the mean reward of each action is computed by adding to its deterministic estimate a zero mean normally distributed random number with a variance proportional to the time which has elapsed from the last time each action was selected.

After each iteration the choice probability of the action that has the highest stochastic estimate of mean reward is increased. We note that the proposed learning automaton is a discretized one. Thus, the choice probability of the actions that have the highest stochastically estimated mean reward is increased by a constant ‘probability slice’ A = l/N. N is an internal automaton’s parameter, called ‘resolution parameter’. As discussed in the introduction, the discretization of the probability space leads to an increase of the automaton’s adaptation rate.

6. The formal definition of the SELA learning automaton

The SELA learning automaton is defined as a sextuple (A, B, Q, P, T, G, E) where A, 8, Q, P, T and G are as defined in Section 2, and E is the estimator which, at any time instant, contains the estimated environmental characteristics. We define: E(t) = (D’(t), M(t), U(t)) where: D’(t) = {d;(t), d’,(t), . . . , d:(t)} is the Deterministic Estimator Vector which - at any time instant t-contains the current deterministic estimates of the mean rewards of the actions. The current deterministic estimate d:(t) of the mean reward of action ai is defined as follows:

W is an integer. It is the automaton’s internal parameter called the ‘learning window’ [51. M(t) = {m,(t), m&j,. . . , m,(t)) is the Oldness Vector which - at any time instant t - contains the time which has elapsed (time is counted in number of iterations) from the last time action was selected. Thus, for every action a, we define:

m,(t) =t-my{T: Tst and a(T) =a,}

U(t) = h,(t), z+(t), . . . , u,(t)) is the Stochastic Estimator Vector which - at any time instant t - contains the current stochastic estimates of the mean rewards of

282 A. V. Vasilukos, G.I. Papadimittiu / Neurocomputing 7 (1995) 275-297

the actions. The current stochastic estimate u,(t) of the mean reward of action ai is defined as:

Z4i(t) =d:(t) +N(O, a:(t)) (2)

where U,(t) = min(am,(t), a,,). N(0, a,*(t)) symbolizes a random number selected with a normal probability distribution, with a mean equal to 0 and a variance equal to o;:‘(t). (Y is an internal automaton’s parameter that determines how rapidly the stochastic estimates deviates from the deterministic ones. In a rapidly switching environment where an ‘old’ deterministic estimate rapidly loses its validity, a small (Y is more efficient if SELA operates in a slowly switching environment. ai, is the maximum permitted variance of the stochastic estimates. It bounds the variance of the stochastic estimates, to prevent it from increasing infinitely. T is the learning algorithm discussed in Section 5. The specification of T constitutes the design of the automaton. Its algorithm description is presented below.

Initialization: All P; = I/r. Step 1. Select an action a(t) = uk according to the probability vector. Step 2. Receive he feedback b(t) E [O, 11 from the environment. Step 3. Compute the new deterministic estimate d;(t) of the mean reward of

action uk as it is given by Eq. (1). Step 4. Update the Oldness vector by setting m,(t) = 0 and m,(t) = mi(t - 1) + 1

for every i # k. Step 5. For every action a, (i = 1, 2,. . . , r) compute the new stochastic estimate

u,(t) as it is given by Eq. (2). Step 6. Select the ‘optimal’ action a, that has the highest stochastic estimate of

Step 7 mean reward: Thus, u,(t) = &u&(t)}. _ Update the probability vector in the following way: For every action a,(i = 1, 2,. . , , m - 1, m + 1,. . . , r) with P,(t) > 0 set: Pi(t + 1) := P,(t) - l/N. For the ‘optimal’ action a, set: P,(t + 1) := 1 - Ci + ,,$‘&t + 1). Fur- ther, if for any t the unit simplex S, is defined as:

s,= PJOIP&, ip,=l i i=l I

then the learning scheme represents the mapping T: S, = S,. Step 8. Go to Step 1.

NOTES N is an internal automaton’s parameter which is called ‘resolution parameter’, [5], and determines the stepsize A (A = l/N) of the probability updating. A is also called the ‘probability slice’. The initial probability distribution is computed as follows: [N/r J, probability slices A (A = l/N) are equally distributed to all the actions. After this distribution, if there are remaining probability slices, they are randomly distributed to the actions. For optimal values of the learning parameters, the reader can consult [51.

A. V Vasilakos, G.Z. Papadimitriou / Neurocomputing 7 (I 995) 275-297 283

7. Proof of epsilon-optimal@

Theorem 1. The SELA learning automaton is e-optimal in every stochastic environment that offers symmetrically dkributed noke. Thus, if action a,,, is the optimal one (d, = maxi for i = 1,. . . , r) and P,(t) = Pr[a(t> = a,], then for every value N of the resolution parameter there is a time instant t, such that for every t ;r t, it holds that E[P,(t)] = 1.

Note: (1). The assumption of symmetrically distributed noise is not arbitrary. The noise of all known S-model stochastic environments is symmetrically distributed about the mean rewards of the actions. (2). There is no single optimal action for all the time but all actions should be chosen successively.

Proof. The Proof is given in the Appendix.

8. Simulation results

The superiority of the Stochastic Estimator Learning Automaton (SELA) was affirmed in practice via extensive simulation results.

SELA was compared with other S-model ergodic schemes, as the classic SL,, learning automaton [8,12,26] and the gradient-projection based G, scheme [9].

All of the automata were simulated operating in Markovian switching environments. The automaton was made to cyclically switch between five environments E,, E2, E,, E., and Es according to a Markov chain which determined the probability with which it switched from one environment to the next one.

Given that the automaton is in the Ei (i E 11, 2,. . . ,511 environment at time instant t, the probability of remaining in the same environment at time instant t + 1 is equal to 1 - 6 (where S is a parameter that characterizes the Markovian chain). The probability of switching to the next environment Ej (with j = (i mod 5) + 1) is equal to 6. More formally, we can say that if the probability of being in the

Ei environment at time instant t is P’{t) then the probability of being in the next environment Ei at the next time instant t + 1 is:

PEi(t + 1) = (I- 8) P&(f) + S&it)

where i, j E (1, . . . , 5) and j = (i mod 5) + 1. Obviously, a large 6 leads to a rapidly varying Markov chain. In this case the

probability of environmental switching after each step is high. A small 6 leads to a slowly varying Markov chain, thus the probability of environmental switching after each step is low.

In our simulation all of the environments offer five actions with the following mean rewards: 0.70, 0.45, 0.40, 0.35 and 0.30. When an environment switches to


the next one, then the mean rewards of the actions are cyclically shifted by one place. Thus, we have:

E, = {0.70,0.45,0.40,0.35,0.30}

E,= (0.30, 0.70, 0.45, 0.40,0.35}

E, = (0.35, 0.30, 0.70, 0.45, 0.40}

etc. The noise in the environment was simulated by taking truncated samples from

normal (Gaussian) distributions with the above mean values and a variance CT’. A reliable performance index for an automaton that operates in a nonstationary

environment is the average reward received by the automaton during its operation. The average reward received R* is computed as:

R* = $ iE[~(t)l r=1

where R(t) is the average reward received at time instant t and k is the number of iterations done per run (k is a very large integer number).

We subtract the initial average reward

R,-‘==’ _ 5

0.44

from R* in order to compute the automaton’s power P. Thus we have:

P=R* -R,

Since the optimal action has a mean reward equal to 0.70, the maximum power in such an environment is equal to 0.70 - 0.44 = 0.26. The three learning automata were simulated in Markovian nonstationary environments of the type described above for various values of the 6 parameter and the variance u2 of the Gaussian noise.

The power P = R* - R, each automaton achieves, by using the optimum values of its internal parameters (a, W, N, a,,,= for SELA, a, W, qmh for G2 [9]; a, b for SL,, [S], [12], [26]) for various values of the environmental parameters S and a2 are shown in Table 1. We get algorithms SELA and G, [9] by taking W = 1, so that the di are just the most recent observations.

These results assure the superiority of the Stochastic Estimator Learning Automaton among the previous schemes in both rapidly and slowly switching environments; in both high and low noise environments.

In a low noise rapidly switching environment (a2 = 0.01, 6 = 0.10) the Stochas- tic Estimator Learning Automaton achieves a power P = R” -R, equal to 0.144. In the same environment the G, automaton achieves a power equal to 0.063, while the SLa,‘s power is only 0.053.

When all of the automata operate in a low noise slowly switching environment (u2 = 0.01, S = 0.01) the Stochastic Estimator Learning Automaton achieves a

A. V: Vasilabs, G.I. Papadimitriou / Neurocomputing 7 (1995) 275-297 285

Table 1 Comparison of the power P = R* - R,, of SELA scheme and G,, SL,, automata in nonstationary stochastic environments

RAPIDLY SWITC+NG SLOWLY SWITCHING

ENVIRONMENT ENVIRONMENT

HI.1 b=o.oi

SELA % sLRP SELA ‘2 SLRP

LOWNMSE

0.144 0.063 0.053 0.233 0.183 0.136

.2=0 a1

l-llmrasE

5.m 0.036 0548 Q.23D 0.152 0.734

vZrO.25

power equal to 0.233. Thus, its power is close to the maximum one (0.260). In the same environment the G, automaton achieves a power of only 0.183, while the SL,‘s power is 0.136.

In the case of high noise rapidly switching environment (a* = 0.25, 6 = 0.10) the Stochastic Estimator Learning Automaton achieves power equal to 0.114. In the same environment the G, automaton achieves a power of only 0.036, while the SL,‘s power is equal to 0.048.

Finally, in a high noise slowly switching environment (a2 = 0.25, 6 = 0.01) the Stochastic Estimator Learning Automaton achieves a power equal to 0.230. In the same environment the G, automaton achieves a power equal to 0.152, while the SL,‘s power is 0.134.

We can perceive that the Stochastic Estimator scheme achieves a very high power (close to the maximum one) in both high and low noise, in both rapidly and slowly switching environments.

Apart from the numerical resuits referred above, graphs that represent the SELA’s, G2’s and SL,‘s performance in switching environments are also presented (see Figs. 2-9). These graphs represent the average expected reward (Figs. 2, 4, 6, 8) and the average probability of selecting the optimal action (Figs. 3, 5, 7, 9) as a function of time.

The environmental changes are shown on the iterations’ axis of the graph, so we can perceive the adaptivity of each scheme to these changes.

Three main results can be derived from the presented graphs: (1)

(2)

(3)

The SELA learning automation achieves a high (close to unity) choice probability of the optimal action (accuracy>. Although the above result could reduce the automaton’s adaptivity to environmental changes, the SELA scheme remains very sensitive to these changes, As a result of its high accuracy and its high rate of adaptation to environmental changes, the SELA learning automaton achieves a high power (received reward) in any nonstationary stochastic environment.

286 A. I/ Vhsilakos, G.I. Papadimitriou / Neurocomputing 7 (1995) 275-297

0.30 ~....‘,...!..*.f..,.~.~,,!,...~.,,.!...~!..~ 0 do 200 300 400 500 600 700 800

iterotions

Fig. 2. Expected reward versus iterations in a low noise slowly switching environment.

9. Conclusion and future work

The low adaptation rate of the previous S-model ergodic schemes when they operate in a nonstationary environment has been a limiting factor in their application.

In this paper we have presented a new S-model ergodic discretized learning automaton which uses a stochastic estimator in order to achieve a high adaptation rate and a high accuracy in nonstationary environments.

Via extensive simulation results, we have demonstrated that the proposed SELA scheme achieves a superior performance over the previous well-known

SELA C-J ---_--

SLrp _ _ _ _ _ _ _ __ _ - __ -.

...;inj...~,.~.,.!,..~. _ 2 400 iterations

Fig. 3. Optimal action probability versus iterations in a low noise slowly switching environment.

A. K Vmilakos, G.I. Papadimitriou / Neurocomputing 7 (1995) 275-297 281

Fig. 4. Expected reward versus iterations in a high noise slowly switching environment.

S-model ergodic schemes when they operate in nonstationary random environments.

Furthermore, we proved that the proposed SELA learning automaton is c-optimal in every S-model random environment.

Note that the SELA learning automaton is also able to operate in P-model nonstationary stochastic environments, giving very satisfactory results.

We are currently working in the direction of applying the SELA scheme for the service quality control and traffic management problems in the future Asyn-

2 --__-- $jLrp ____________--.

itsratlons

Fig. 5. Optimal action probability versus iterations in a high noise slowly switching environment.


SEIA G2 -__--_

SLrp _____ ______-__.

0.70

0.60

e E L

-Q 0.50 2!

it E

0.40

iterations

Fig. 6. Expected reward versus iterations in a low noise rapidly switching environment.

chronous Transfer Mode (ATM) communication networks. Because the precise characteristics of the source traffic are not known and the service quality require- ments change over time, building an efficient network controller that can control the network traffic is a difficult task. The proposed SELA scheme can be used as an ATM network controller which learns the relations between the offered traffic and service quality and can be easily implemented. This work will appear in a following paper.

The philosophy of stochastic estimation of environmental characteristics is also applicable in the field of absorbing learning automata. We are also currently

SELA 02 ------

SLrp __-_. ____ _____.

1 .oo

s ‘5

a 0.80

P 0

& 0.60

iterations

Fig. 7. Optimal action probability versus iterations in a low noise rapidly switching environment.

A. V. Vbsihkos, G.I. Papadimitriou / Neurocomputing 7 (I 995) 275-297 289

p ------ cJrp _ - - _ _- - . - - - _ _ -

O’OJ

0.30 ,....I....!...~!....!....!....!...~...~....~...~ 0 10 20 30 40 50 60 70 80 90 100

iterations

Fig. 8. Expected reward versus iterations in a high noise rapidly switching environment.

working toward generalizing stochastic estimator learning automata operating in random environments, using a context vector (associative learning), for the synthe- sis of neural networks.

Appendix: Proof of epsilon optima&

We shall first prove some supplementary lemmas.

Lemma 1. Assume two statistically independent random variables x1, x2 with mean equal to d,, d, correspondingly. Define the random variable y = x1 + x2. If the

iterations

Fig. 9. Optimal action’s probability versus iterations in a high noise rapidly switching environment.

290 A. K Vasilakos, G.I. Papadimitriou / Neurocomputing 7 (1995) 275-297

density functions f,(x), f2( > f x o x1, x2 are symmetric (with respect to their means d,, d2) then the random uariable y is also symmetric (with respect to its mean E[ y I = d, + dz).

Obviously, this lemma can be easily generalized for y = xi +x2 + * . * +x, where

Xl, x 2,. . . , x, are symmetric random variables.

Proof. Define the random variables zr, zz as:

zl=xl-d, and zz=x2-dz

with density functions g,(x), gz(x) and characteristic functions @Jp), @Jpp> correspondingly.

Obviously, the random variables zi and z2 have means equal to 0 and are symmetric with respect to 0. It is known (reference [241, Theorem 6.2.6) that under these conditions @Jcp>, @Jcp> are real valued.

Now let’s define the random variable u = zi + z2. It is known (reference [25], Theorem 13) that the characteristic function of u is: @&+) = cP,,+,z((p) =

@&R$?J Since @Jcp> and gF.Jcp) are real valued it is derived that QU,,<4) is also real

valued. Thus, u is symmetric with respect to 0. (3)

Since u=z1+z2=x1+x2-(dl+d2)=y-(dl+d2)=y-E[y]from(3)itisde- rived that y is symmetric with respect to E[y] = d, + d,. Lemma 1 has been proved. q

Lemma 2. Let d,, d,, . . . , d, to be (correspondingly) the means and a:, ai,. . . , CT: to be the variances of the symmetrically distributed (with respect to their means) rewards of the actions a,, a2, . . . , a, offered by the environment. Let u,(t), u,(t), . . . , u,(t) be corresponding current stochastic estimates of the mean rewards at the time instant t. Then, u,(t), u,(t), . . .,u,(t) are symmetrically distributed (with respect to their means) random variables with means equal to

d,, d 2, . . . , d, correspondingly.

Proof. The symbolism x = A$, a2> denotes that the random variable x is normally distributed with a mean p and a variance a*.

Let W be the window size. Consider action ak. Let w:(t) be the content of the ith place of action’s ak window at time instant t. If the action ak was last selected at time t - u,(t) and au(t) < a,,,= we have:

&l(t)

l+(t) = i=* w + V where V- N(0, a2uk( t)2)

By definition, the random variables w:(t) for i = 1,. . . , W are symmetrically distributed with a mean d, and a variance a:. V is also a symmetric random variable. Thus, u,(t) is a sum of symmetric random variables. So, from Lemma 1 is

A. V. Vmiiakos, G.I. Papadimitriou /Neurocotnputing 7 (1995) 275-297 291

derived that u,(t) is also a symmetric random variable. Its mean E[uJt)I and its variance are as follows:

W4 E[z+(t)] = -+- +O=d,

Var[ uk( t)] = Wa,2 + (~‘rni( t)

In the same way if au(t) > u,, then Var[u(tl] = Wu$ + a,&. Thus, u,(t) is a symmetrically distributed (with respect to its mean) random variable with a mean equal to d, (for every k = 1, 2,. . . , r>. Lemma 2 has been proved. 0

Lemma 3. For any two actions a, and aj we define: P,‘(t) =P$ui(t) ? uj(t>l. If di > dj then there is a positive real number Q,j(Qij > 0) such that q:!(t) - e’(t) L 2Qij and consequently P,(t) > P,‘(t) for any time instant t. We assume that Qij has a non-zero limit.

Proof. Define the random variable: Z;(t) = u,(t) - Uj(t). From Lemmas 1 and 2 is derived that z;(t) is a symmetric (with respect to its mean) random variable with a mean equal to di - dj and a variance equal to (Var[u,(t>l) + War[uj(t>]>.

Let &i(x) be the density function of Z;(t). As discussed earlier, f&x) is symmetric about the line x = di - dj.

If we define P;(t) = P&(t) > uj(t)] we have:

z;(t) >OWj@) >Uj@).

Thus, we have

q(t) = lo’mfi;(x) dx (4)

It is known that di - dj > 0 and $(x1 is symmetric about the line x = di - dj. So, (4) can be written as:

where Qij = min f

Since the variance of Zj(t) is bounded (Var[zj] I Wu/ + Waj’ + 2~2~) it is derived that Qjj > 0. It is obvious that q(t) + p,‘<t> = 1 Thus,P,‘(t> 2 0.5 + Qij H F$t) I 0.5 - Qij = P,(t) -P/(t) 2 2Qij. Thus, we have that q!:!(t) 2 0.5 + Qjj > 0.5 - Q, r I fO r any time instant t. Lemma 3 has been proved. Note that this is not proven conditional on the state of the learning automata. q

292 A.V. Vasilakos, G.I. Papadimitriou /Neurocomputing 7 (1995) 275-297

Lemma 4. Assume any subset A, of the action set A, such that I A, I = k s r. Thus, A, = {all, A,2,. . . , alt) GA = (a,, a2,. . . , a,). For any action a, l Ak define: Bl(Ak, t) = Pr[uZ(t) = max{uj(t>) for j = 11, 12,. . . , Ikl. Then, according to the SELA learning algorithm, for any two actions ai, aj such that ai EAT and aj EAT we have:

Bi(AkT t) P;(t)

Bj(Akv t, I

= pi(t) for any time instant t.

(where q’(t) is defined as in Lemma 3).

Proof. We shall prove Lemma 4 by utilizing mathematical induction on the size k of the A, subset. Basis step. We have to prove that the theorem holds for I A, I = k = 2. Proof of basis step. For any subset A, = {ai, aj} CA we have:

B,(A,, t) =c(t) and B,(A,, t) =P;(t).

Thus,

Bi(A2T t, pi’(t) =-

Bj(A,, t) pi’(t)

for any time instant t. Induction hypothesis. Assume that the theorem holds for I A, I = k = n. Thus, for any subset A,, of the action set A such that 1 A, I = n it holds that according to the learning algorithm of the SELA automaton, for any two actions ai, aj such that ai EA, and aj E A, we have:

Bi(Anp t, C(t)

Bj(AnT t,

= - for any time instant t. P,‘(t)

Induction step. We have to prove that the theorem holds for

lAkl =k=n+l

Thus, we have to prove that for any subset A,, 1 of the action set A such that I A, I = n + 1 it holds that according to the SELA learning algorithm, for any two actions ai, aj such that ai EA,+~ and aj EA,,+~ we have:

Bi(An+l, t) *“ji( t)

Bj(An+l, t) = - for any time instant t.

Pil’( t)

Proof of induction step. Assume any subset A,, + 1 of the action set A such that I A,+1l =n+ 1 and two actions any a, and aj such that aiEA,+, and ajEA,+,. Let ah be an action such that ah EA,+~ and ah # ai, ah # aj. Assume the set

A” =A,+1 - {a,,}. We have:

Bi(An+lT t) = Pr[u,( t) = max,{u,( t)} for every action a, EA,]*

*Pr[u,(t) zma~~{u~(t)}foreveryaction afEA,+I]

=Bj(A,, t)*(l -Bh(An+l, t)) (6)

A. K Vasilakos, G.I. Papadimittiu /Neurocomputing 7 (1995) 275-297 293

In the same way:

Bj(An+ly t) =Pr[u,(t) = max,(u,( t)} for every action a, EA,]

*Pr[u,(t) #maxf{Uf(t)} for every action afGA,+l]

=Bj(A,9 t)*(1-Bh(A,+17 t>) (7)

by utilizing (6) and (7) we have:

Bi(An+ly t, 4(4tT 0 ‘(1 --&b%+1~ t>> %L 0

Bj(An+ly t, = Bj(An? t, * (l -Bh(An+IT l>) = Bj(A,, t, (8)

A, is a subset of the function set A and additionally, I A, I = n = k. Thus from the inductive hypothesis we have that

Bj(.& t) q(t) =- Bj(& t) P!(t)

So Eq. (8) becomes:

N&+1, f) C’(t) =-

Bj(An+Ip t, &j(t)

Thus, Lemma 4 has been proved. 0

Lemma 5. Let d, = mai{di} for i = 1,. . . , r. Thus action a, is an optimal one. If Bi(t) = Pr[u,(t) = IlUXj(Uj(t)J for j = 1,. . . , rl then according to the SELA learning algorithm for any time instant t we have B,(t) = maxi{Bj(t)} for i = 1,. . . , r.

Proof. Let action a,,, be an optimal one. Thus, d, = mai{dJ for i = 1,. . . , r. From Lemma 3 is derived that for every i = 1,. . . , m - 1, m = 1,. . . , r it holds

that: P,“(t) > PjJt). Thus, for every i = 1,. . . , m - 1, m + 1,. . . , r it holds that:

pim(t) - > 1 for every time instant t. czw

(9)

Now, by combining (9) with Theorem 3 (for subset A, = A) is derived that for every i = 1,. . . , m - 1, m + 1,. . . , r it holds that:

%W pi”(t) -=- Bi(t) edt)

> 1 *B,(t) >Bi(t)

Thus, B,(t) = maxj{Bi(t)} for i = 1,. . . , r. Lemma 5 has been proved. 0

Theorem 1. The SELA learning automaton is l -ogtimal in every environment that offers symmetrically diwibuted noke. Thus, if action a,,, is an optimal one (d, = maxi for i + 1,. . . , r) and P,,,(t) =Pr[a(t) =a,], then for every value N > N, (N, > 0) of the resolution parameter there is a time instant t, < 00 such that for every t 2 t, it holds E[P(t)l = 1.

294 A.K Vasiiakos, G.I. Papadimitriou /Neurocomputing 7 (1995) 275-297

Proof. For every two actions aj, aj and every time instant t define Af$t) = Pi(t) -p,(t).

As in Lemma 5 let’s define the quatity:

Bi(t) =Pr ui(t) =max(u,(t)} for j= l,...,r . [ i I

Assume that at a time instant t we have P,(t) # 0 and P,(t) # 0. If neither action ai or aj has the highest estimated mean reward at this instant (u,(t), uj(t) # max,{u,(t)} for i = 1,. . . , rl) then both P,(t) and p,(t) are decreased by the samequantity l/N (where N is the resolution parameter of the SELA automaton). Thus

Aq(t+l)=AP,‘(t) and Ac(t+l)-A$(t)=O

If action ai or action aj has the highest estimated mean reward at the time instant T(uJt) or uj(t) = max,{u,(t)} for 1 = 1,. . . , r-1) then the quantity AI’/(t) is an average increase by (l/N) * (Bi(t) - Bj(t)) where the quantities Bj(t), Bj(t) are as define above.

Thus, if at a time instant t we have Pi(t) # 0 and P,(t) # then the average modification H/(t) of the quanity APi is:

E[A$(t + 1) -AC(t)]

=O’(l-Bi(t)-Bj(t))+(l/N)‘(Bi(t)-Bj(t))’(Bi(t)+Bj(t))

=(1/N).(Bi(t)2-Bj(t)2) (10)

For any pair of actions ui, uj define the quantity H’(t) as:

H,‘(t) = (1/N)(Bi(t)2-Bj(t)2)

If E[Pi(t)] # 0 and E[(P,(t)) # 0 then E[A$(t + 1) - Af$t)l -H;(t). If di > dj then from Lemma 3 is derived that P;(t) > Z’,‘(t) for any time instant t. In this case we have:

H,‘(t)=(l/N)(Bi(t)“-Bj(t)“)=(l/N)

-I(L/N)( &- ( $$)2] r4cl/NJ(Qij-Q:) (11)

A. V. Vasilakos, G. I. Papadimitriou / Neurocomputing 7 (1995) 275-297 295

It is known (see proof Lemma 3) that 0 < Qij < 1 for Vt. Thus, from (11) is derived that if ai, aj are two actions so that di > dj then:

H;(t)t4(1/N)(Q,,-Q:)>O forVt.

Let’s define: Gij = 4(1/N) (Qij - Q:) > 0.

We have: Hi(t) 2 Gi( t) for Vt. (12)

From (12) is derived that for any pair of actions ai, aj with di > dj there is a time instant tij(tij I Gj) such that:

ti, CH,(T) = 1

T=l (13)

If E[q(t)] # 0 then E[Aq:!‘(t + 1) - Af$t)l =H#). Thus, before q(t) will become equal to zero the average difference between P,(t) and P,(t) is:

E[fi(t) -e(t)] = i (E[A$(T+ 1) -A&(T)]) = iIf;, T=l T=l

There are two possible cases: (i) E[Pj(t)] = 0 and E[P,(t) -q(t)] I C’,,,H~(T), (ii) E[q(t)] z 0 and E[P&t) -pi(t)] I C’,=,H_(T).

If condition (13) is satisfied then Case (ii) becomes:

E[Pj:(t)] -E[q(t)] =E[e(t) -c(t)] = &j(T) *E[Pi(t)] = 1 T=l

=E[Pj(t)] =O.

Thus, if condition (13) is satisfied then in both case (i) and (ii) we have E[q(t)] = 0. Thus, E[ qtij)] = 0.

We have proved that for every pair of actions ai, aj with di > d, there is a time tij such that:

E[ q(tij)] = 0 (14)

Let action a,,, be an optimal one Cd, = mux,{d,) for 1= 1,. . . , r>. Thus, for any section a,(i #ml we have di > di and consequently, from (14) is derived that for any action q(i # ml there is a time instant tmi that:

If we select to = mmi{t,,J then EIPi(tJl = 0 for every i # m and subsequently, E[P,(t,)l = 1.

Theorem 1 has been proved. The SELA learning automaton is e-optimal in every stationary stochastic environment. q

296 A. V Vasilakos, G.I. Papadimitriou / Neurocomputing 7 (I 995) 275-297

References

[l] M.A.L. Thathachar and B.J. Oommen, Discretized reward inaction learning automata, J. Cybernet. Informat. Sci. (Spring 1979) 24-29.

[2] B.J. Oommen, Absorbing and ergodic discretized two-action learning automata, IEEE Trans. Systems Man Cybernet. SMC-16 (2) (March/April 1986) 282-293.

[3] B.J. Oommen and J.P.R. Christensen, Epsilon-optimal discretized linear reward-penalty learning automata, IEEE Trans. Systems Man Cybemet. SMC-18 (3) (May/June 1988) 451-4.58.

[4] M.F. Norman, Markov Processes and Learning Models (New York and London, Academic Press, 1972).

[5] A.V. Vasilakos and G.I. Papadimitriou, Ergodic discretized estimator learning automata with high accuracy and high adaptation rate for nonstationary environments, Neurocomputing 4, (3-4) (May 1992) 181-196.

[6] A.V. Vasilakos, G.I. Papadimitriou and C.T. Paximadis, New absorbing hierarchical discretized pursuit nonlinear learning automata with rapid convergence and high accuracy, IEEE Int. Conf Systems Man & Cyber. University of Virginia (Oct. 1991).

[7] A.V. Vasilakos, V.G. Polimenis and V.E. Avgerinos, A new nonlinear discretized learning automaton with rapid convergence and high accuracy, IEEE Int. Conf on Systems, Man and Cybernetics, Cambridge, MA, (Nov. 1989).

[8] R. Viswanathan and KS. Narendra, Stochastic automata models with applications to learning systems, IEEE Trans. Systems Man Cybemet. (Jan. 1973) 107-111.

[9] R. Simha and J.F. Kurose, Relative reward strength algorithms for learning automata, IEEE Trans. Systems Man Cybemet. SMC-19 (2) (March/April 1989) 388-398.

[lo] B.J. Oomen and J.K. Lanctot, Epsilon-optimal discretized pursuit learning automata, IEEE Int. Conf Systems Man Cybernetics, Cambridge, MA (Nov. 19891.

[ll] V.I. Varshavskii and I.P. Vorontsova, On the behavior of stochastic automata with variable structure, Automat. Telemekh. (USSR) 24 (1963) 327-333.

[12] K.S. Narendra and M.A.L. Thathachar, Learning automata - A survey, IEEE Trans. Systems Man Cybernet. SMC-4 (4) (July 1974) 323-334.

[13] K.S. Narendra and S. Lakshmivarahan, Learning automata: A critique, J. Cybemet. Informat. Sci. (1977) 53-66.

[14] M.A.L. Thathachar and P.S. Sastry, A class of rapidly converging algorithms for learning automata, IEEE Trans. Systems Man Cybemet. SMC-15 (1) (Jan./Feb. 1985) 168-175.

[15] B.J. Oommen and M.A.L. Thathachar, Multiaction learning automata possessing ergodicity of the mean, Informat. Sci. 35 (3) (June 1985) 183-198.

[16] A.O. Allen, Probability, Statistics and Queuing Theory with Computer Science Applications (New York, Academic Press, 1978).

[17] D.L.. Isaacson and R.W. Madson, Markov Chains: Theory and Applications (New York, Wiley, 1976).

[18] M.L. Tsetlin, On the behavior of the finite automata in random media, Automat. Telemekh. (USSR) 22 (Oct. 1961) 1345-1354.

[19] M.L. Tsetlin, Automaton Theory and the Modeling of the Biological System (New York, Academic Press, 1973).

[20] Y.Z. Tsypkin and A.S. Poznyak, Finite learning automata, Eng. Cybernet. 10 (1972) 478-490. [21] S. Lakshmivarahan, Leaning Algorithms Theory and Applications (New York, Springer-Verlag,

1981). [22] B. Chandrasekaran and D.W.C. Shen, On expediency and convergence in variable structure

automata, IEEE Trans. Syst. Sci. Cybemet. SSC-4 (March 1968) 52-60. [23] L.G. Mason, An optimal learning algotithm for S-model environments, IEEE Trans. Automatic

Control (Oct. 1973) 493-496. [24] K.L. Chung, A Course in Probability Theory (New York, Harcout, Brace & World, Inc., 1968). [25] H. Cramer, Random Variables and Probability Distributions (Cambridge University Press, 1970).

A. V. Vasilakos, G.I. Papadimitriou / Neurocomputing 7 (I 995) 275-297 291

[26] O.V. Nedzelnitski and K.S. Narendra, Nonstationary models of learning automata routing in data communication networks, IEEE Trans. Sysfems Man Cybernet. SMC-17 (6) (Nov./Dee. 19871 1004-1015.

[27] A.V. Vasilakos, C.A. Moschonas and C.T. Paximadis, Adaptive window flow control and learning algorithms for adaptive routing in data networks, Proc. 1990 ACM SIGMETRICS Boulder, CO (22-25 May 1990).

[28] A.V. Vasilakos, C.A. Moschonas and CT. Paximadis, Variable window flow control and ergodic discretized learning algorithms for adaptive routing in data networks, Comput. Networks ISDN Syst. 22 (1991) 235-248.

[29] A.G. Barto, R.S. Sutton and C.W. Anderson, Neuronlike elements that can solve difficult leatning control problems, IEEE Trans. Systems Man Cybemet. SMC-13 (1983) 834-846.

[30] D. Gorse and J.G. Taylor, A continuous input RAM-Based stochastic neural model, Neural Networks (1991) 637-665.

[31] C. Myers, Reinforcement learning when results are delayed and interleaved in time, fioc. ht. Neural Net. Conf ‘90, Paris (Kluwer, Dordrecht, 1991) 860-836.

Athanasios V. Vasilakos was born in Gianitsa, Greece in 1959. He received the B.Sc. degree in Electrical Engineering from the University of Thrace, Greece in 1983, the M.S. degree from the University of Massachussets, Amherst, MA, USA, in 1986 and the Ph.D. from the University of Patras, Greece in 1988. Since November 1988 he has been with the Computer Engineering Department of the University of Patras and a researcher with the Computer Technology Institute (CTI), Patras. Recently he has become a full Professor at the Com- puter Science Dept. of Hellenic Air Force Academy. He has been involved in ESPRIT Project 2198, FCPN (Factory Customer Premises Network). He has published more than 30 technical papers in the areas of computer networks, learning theory and artificial intelligence. Dr. Vasilakos is known for his work on exponentially discretized and stochastic estimator learning algorithms and their application to high-speed packet-switched networks traffic management.

His main interests are data communication networks, performance evaluation of computer networks, learning theory, neural networks, learning automata, architecture and performance of ATM/B-ISDN networks and mobile cellular networks. Dr. Vasilakos was awarded the Panhellenic Prize of the Mathematical Foundation of Greece, in 1977. He is a member of the IEEE Communications Societies, IEEE Computer Society, ACM and the Technical Chamber of Greece.

Ceoqhxs I. Papadimitriou was born in Thessaloniki, Greece on July 27, 1966.

F He received the B.Sc. degree in Computer Engineering from the University of Patras, Patras, Greece in 1989. He is now a graduate student at the University of Patras. His research interests include high-speed networks and learning

: algorithms with applications in high-speed networks. Mr. Papadimitriou is a member of the IEEE Computer and Communications Societies.

Documents

A new approach to the design of reinforcement schemes for learning automata: Stochastic estimator learning algorithm