Self-learning Fuzzy logic controllers for pursuit-evasion ... · for pursuit-evasion differential games, Robotics and Autonomous Systems (2010), ... This is a of an unedited manuscript

Accepted Manuscript

Self-learning Fuzzy logic controllers for pursuit-evasion differentialgames

Sameh F. Desouky, Howard M. Schwartz

PII: S0921-8890(10)00160-0DOI: 10.1016/j.robot.2010.09.006Reference: ROBOT 1789

To appear in: Robotics and Autonomous Systems

Received date: 17 February 2010Accepted date: 13 September 2010

Please cite this article as: S.F. Desouky, H.M. Schwartz, Self-learning Fuzzy logic controllersfor pursuit-evasion differential games, Robotics and Autonomous Systems (2010),doi:10.1016/j.robot.2010.09.006

This is a PDF file of an unedited manuscript that has been accepted for publication. As aservice to our customers we are providing this early version of the manuscript. The manuscriptwill undergo copyediting, typesetting, and review of the resulting proof before it is published inits final form. Please note that during the production process errors may be discovered whichcould affect the content, and all legal disclaimers that apply to the journal pertain.

http://dx.doi.org/10.1016/j.robot.2010.09.006

Self-learning Fuzzy Logic Controllers for Pursuit-Evasion Differential Games

Sameh F. Desouky∗, Howard M. Schwartz

Department of Systems and Computer EngineeringCarleton University

1125 Colonel By DriveOttawa, ON, Canada

Abstract

This paper addresses the problem of tuning the input and the output parameters of a fuzzy logic controller.The system learns autonomously without supervision or a priori training data. Two novel techniques areproposed. The first technique combines Q(λ)-learning with function approximation (fuzzy inference system)to tune the parameters of a fuzzy logic controller operating in continuous state and action spaces. Thesecond technique combines Q(λ)-learning with genetic algorithms to tune the parameters of fuzzy logiccontroller in the discrete state and action spaces. The proposed techniques are applied to different pursuit-evasion differential games. The proposed techniques are compared with the optimal strategy, Q(λ)-learningonly, reward-based genetic algorithms learning, and to the technique proposed by Dai et al. (2005) inwhich a neural network is used as a function approximation for Q-learning. Computer simulations show theusefulness of the proposed techniques.

Key words: Differential game, function approximation, fuzzy control, genetic algorithms, Q(λ)-learning,reinforcement learning.

1. Introduction

Fuzzy logic controllers (FLCs) are currently be-ing used in engineering applications [1, 2] especiallyfor plants that are complex and ill-defined [3, 4]and plants with high uncertainty in the knowledgeabout its environment such as autonomous mobilerobotic systems [5, 6]. However, FLC has a draw-back of finding its knowledge base which is basedon a tedious and unreliable trial and error process.To overcome this drawback one can use supervisedlearning [7–11] that needs a teacher or input/outputtraining data. However, in many practical cases themodel is totally or partially unknown and it is dif-ficult or expensive and in some cases impossible toget training data. In such cases it is preferable touse reinforcement learning (RL).

RL is a computational approach to learningthrough interaction with the environment [12, 13].

∗Corresponding author. Tel.: +1 613-520-2600 Ext. 5725Email addresses: [email protected] (Sameh F.

Desouky), [email protected] (Howard M.Schwartz)

The main advantage of RL is that it does not needeither a teacher or a known model. RL is suitablefor intelligent robot control especially in the field ofautonomous mobile robots [14–18].

1.1. Related work

Limited studies have applied RL alone to solveenvironmental problems but its use with otherlearning algorithms has increased. In [19], a RLapproach is used to tune the parameters of a FLC.This approach is applied to a single case of onerobot following another along a straight line. In [15]and [20], the authors proposed a hybrid learningapproach that combines a neuro-fuzzy system withRL in a two-phase structure applied to an obsta-cle avoidance mobile robot. In phase 1, supervisedlearning is used to tune the parameters of a FLCthen in phase 2, RL is employed so that the systemcan re-adapt to a new environment. The limita-tion in their approach is that if the training dataare hard or expensive to obtain then supervisedlearning can not be applied. In [21], the authorsovercame this limitation by using Q-learning as an

Preprint submitted to Robotics and Autonomous Systems February 16, 2010

*ManuscriptClick here to view linked References

expert to obtain training data. Then the trainingdata are used to tune the weights of an artificialneural network controller applied to a mobile robotpath planning problem.

In [22], a multi-robot pursuit-evasion game is in-vestigated. The model consists of a combinationof aerial and ground vehicles. However, the un-manned vehicles are not learning. They just dothe actions they received from a central computersystem. In [23], the use of RL in the multi-agentpursuit-evasion problem is discussed. The individ-ual agents learn a particular pursuit strategy. How-ever, the authors do not use a realistic robot modelor robot control structure. In [24], RL is used totune the output parameters of a FLC in a pursuit-evasion game.

A number of articles used fuzzy inference system(FIS) as a function approximation with Q-learning[25–28] however these works have the following dis-advantages: (i) the action space is considered to bediscrete and (ii) only the output parameters of theFIS are tuned.

1.2. Paper motivation

The problem assigned in this paper is that weassume that the pursuer/evader does not know itscontrol strategy. It is not told which actions to takeso as to be able to optimize its control strategy. Weassume that we do not even have a simplistic PDcontroller strategy. The learning goal is to make thepursuer/evader able to self-learn its control strat-egy. It should do that on-line by interaction withthe evader/pursuer.

From several learning techniques we choose RL.RL methods learn without a teacher, without any-body telling them how to solve the problem. RL isrelated to problems where the learning agent doesnot know what it must do. It is the most appropri-ate learning technique for our problem.

However, using RL alone has the limitation inthat it is too hard to visit all the state-action pairs.We try to cover most of the state-action space butwe can not cover all the space. In addition, there arehidden states that are not taken into considerationdue to the discretization process. Hence RL alonecan not find the optimal strategy.

The proposed Q(λ)-learning based genetic fuzzycontroller (QLBGFC) and the proposed Q(λ)-learning Fuzzy inference system (QLFIS) are twonovel techniques used to solve the limitation in RL.The limitation is that the RL method is designed

only for discrete state-action spaces. Since we wantto use RL in the robotics domain which is a con-tinuous domain, then we need to use some type offunction approximation such as FIS to generalizethe discrete state-action space into a continuousstate-action space. Therefore, from the RL pointof view, a FIS is used as a function approximatorto compensate for the limitation in RL. And fromthe FIS point of view, RL is used to tune the inputand/or the output parameters of the fuzzy systemespecially if the model is partially or completelyunknown and it is hard or expensive to get a pri-ori training data or a teacher to learn from. Inthis case, the FIS is used as an adaptive controllerwhose parameters are tuned on-line by RL. There-fore, combining RL and FIS has two objectives; tocompensate the limitation in RL and to tune theparameters of the FLC.

In this paper we design a self-learning FLC usingthe proposed QLFIS and the proposed QLBGFC.The proposed QLBGFC is used when the state andthe action spaces can be discretized in such a waythat make the resulting state and action spaces haveacceptable dimensions. This can be done, as we willsee in our case, if the state and the action values arebounded. If not then the proposed QLFIS will besuitable. However and for the comparatively pur-pose, we will use both of the proposed techniquesin this work.

The learning process in the proposed QLFIS isperformed simultaneously as shown in Fig. 1. Theproposed QLFIS is used ”directly” with the contin-uous state and action spaces. The FIS is used asa function approximation to estimate the optimalaction-value function, Q∗(s, a), in the continuousstate and action spaces while the Q(λ)-learning isused to tune the input and the output parametersof both the FIS and the FLC.

st

uʹ

st u uc

Qt (st ,u)

Δt

r+max Q(st+1 ,uʹ) + -

FIS

FLC Environment

n(0,δn) +

+

Figure 1: The proposed QLFIS technique

2

In the proposed QLBGFC, the learning processis performed sequentially as shown in Fig. ??. Theproposed QLBGFC can be considered as ”indirect”method of using function approximation. First, inphase 1, the state and the action spaces are dis-cretized and Q(λ)-learning is used to obtain an esti-mate of the desired training data set, (s, a∗). Thenthis training data set is used by genetic algorithms(GAs) in phase 2 stage1 to tune the input and theoutput parameters of the FLC which is used at thesame time to generalize the discrete state and actionvalues over the continuous state and action spaces.Finally in phase 2 stage 2, the FLC is further tunedduring the interaction between the pursuer and theevader.

The proposed techniques are applied to twopursuit-evasion games. In the first game, we as-sume that the pursuer does not know its controlstrategy whereas in the second game, we increasethe complexity of the system by assuming both thepursuer and the evader do not know their controlstrategies or the other’s control strategy.

The rest of this paper is organized as follows:some basic terminologies for RL, FIS and GAs arereviewed in Section 2, Section 3 and Section 4, re-spectively. In Section 5, the pursuit-evasion game isdescribed. The proposed QLFIS and the proposedQLBGFC techniques are described in Section 6 andSection 7, respectively. Section 8 presents the com-puter simulation and the results are discussed inSection 9. Finally, conclusion and future work are

Figure 2: The proposed QLBGFC technique

discussed in Section 10.

2. Reinforcement Learning

Agent-environment interaction in RL is shown inFig. 3 [12]. It consists mainly of two blocks, anagent which tries to take actions so as to maxi-mize the discounted return, R, and an environmentwhich provides the agent with rewards. The dis-counted return, Rt, at time t is defined as

Rt =τ∑

k=0

γkrt+k+1 (1)

where rt+1 is the immediate reward, γ is the dis-count factor, (0 < γ ≤ 1), τ is the terminal point.Any task can be divided into independent episodesand τ is the end of an episode. If τ is finite thenthe model is called a finite-horizon model [13]. Ifτ →∞ then the model is called an infinite-horizondiscounted model and in this case γ < 1 to avoidinfinite total rewards.

The performance of an action, a, taken in a state,s, under policy, π, is evaluated by the action valuefunction, Qπ(s, a),

Qπ(s, a) = Eπ(Rt|st = s, at = a)

= Eπ

( ∞∑

k=0

γkrk+t+1|st = s, at = a

)(2)

where Eπ(·) is the expected value under policy, π.The way to choose an action is a trade-off betweenexploitation and exploration. The ε−greedy actionselection method is a common way of choosing theactions. This method can be stated as

at ={a∗, with probability 1− ε;random action, with probability ε.

(3)

Agent

Environment

reward tr

state ts

1+tr

1+ts

action ta

Figure 3: Agent-environment interaction in RL

3

where ε ∈ (0, 1) and a∗ is the greedy action definedas

a∗ = arg maxa′

Q(s, a′) (4)

The RL method is searching for the optimal pol-icy, π∗, by searching for the optimal value function,Q∗(s, a), where

Q∗(s, a) = maxπ

Qπ(s, a) (5)

Many algorithms have been proposed for estimatingthe optimal value functions. The most widely usedand well-known control algorithm is Q-learning [29].

Q-learning, which was first introduced byWatkins in his Ph.D [30], is an off-policy algorithm.Therefore, it has the ability to learn without follow-ing the current policy. The state and action spacesare discrete and their corresponding value functionis stored in a what is known as a Q-table. To use Q-learning with continuous systems (continuous stateand action spaces), one can discretize the state andaction spaces [11, 21, 31–34] or use some type offunction approximation such as FISs [26, 35, 36],neural networks (NNs) [12, 19, 37, 38], or use sometype of optimization technique such as GAs [39, 40].A one-step update rule for Q-learning is defined as

Qt+1(st, at) = Qt(st, at) + α4t (6)

where α is the learning rate, (0 < α ≤ 1) and 4t isthe temporal difference error (TD-error) defined as

4t = rt+1 + γmaxa

Qt(st+1, a)−Qt(st, at) (7)

Equation (6) is a one-step update rule. It updatesthe value function according to the immediate re-ward obtained from the environment. To updatethe value function based on a multi-step update ruleone can use eligibility traces [12].

Eligibility traces are used to modify a one-stepTD algorithm, TD(0), to be a multi-step TD algo-rithm, TD(λ). One type of eligibility traces is thereplacing eligibility [41] defined as: ∀ s, a,

et(s, a) =

1, if s = st and a = at;0, if s = st and a 6= at;λγet−1(s, a), if s 6= st.

(8)

where e0 = 0, and λ is the trace-decay parameter,

(0 ≤ λ ≤ 1). When λ = 0 that means a one-stepupdate, TD(0), and when λ = 1 that means aninfinite-step update. Eligibility traces are used tospeed up the learning process and hence to makeit suitable for on-line applications. Now we willmodify (6) to be

Qt+1(s, a) = Qt(s, a) + α et4t (9)

For the continuous state and action spaces, the eli-gibility trace is defined as

et = γλet−1 +∂Qt(st, at)

∂φ(10)

where φ is the parameter to be tuned.

3. Fuzzy Inference System

The most widely used FIS models are Mamdani[42] and Takagi-Sugeno-Kang (TSK) [43]. In thiswork we are interesting in using the TSK model.A first-order TSK means that the output is a lin-ear function of its inputs while a zero-order TSKmeans that the output is a constant function. Forsimplicity purpose we use a zero-order TSK model.A Rule used in zero-order TSK model for N inputshas the form

Rl : IF x1 is Al1 AND ... AND xN is AlN

THEN fl = Kl (11)

where Ali is fuzzy set of the ith input variable, xi,in rule Rl, l = 1, 2, ..., L, Kl is the consequent pa-rameter of the output, fl, in rule Rl.

The fuzzy output can be defuzzified into a crispoutput using one of the defuzzification techniques.Here, weighted average method is used and is de-fined as follows

f(x) =

L∑

l=1

(N∏

i=1

µAli(xi)

)Kl

L∑

l=1

(N∏

i=1

µAli(xi)

) (12)

where µAli(xi) is the membership value for the fuzzy

set Ali of the input xi in rule Rl. Due to its sim-ple formulas and computational efficiency, Gaussianmembership function (MF) has been used exten-sively especially in real-time implementation, andcontrol. The Gaussian MF depicted in Fig. 4 is

4

defined as

µAli(xi) = exp

(−(xi −ml

i

σli)2

)(13)

where σ and m are the standard deviation and themean, respectively.

The structure of the FIS used in this work isshown in Fig. 5. Without loss of generality, we as-sume that the FIS model has 2 inputs, x1 and x2,and one output, f , and each input has 3 GaussianMFs. The structure has two types of nodes. Thefirst type is an adaptive node (a squared shape)whose output need to be adapted (tuned) and thesecond type is a fixed node (a circled shape) whoseoutput is a known function of its inputs.

The structure has 5 layers. In layer 1, all nodesare adaptive. This layer has 6 outputs denoted byO1, The output of each node in layer 1 is the mem-bership value of its input defined by (13). In layer 2,all nodes are fixed. The AND operation (multipli-cation) between the inputs of each rule is calculatedin this layer. This layer has 9 outputs denoted by

Figure 4: Gaussian MF

Figure 5: Structure of a FIS model

O2l , l = 1, 2, ..., 9. The output of each node in layer

2, known as the firing strength of the rule, ωl, iscalculated as follows

O2l = ωl =

2∏

i=1

µAli(xi) (14)

In layer 3, all nodes are fixed. This layer has 9outputs denoted by O3

l . The output of each node inlayer 3 is the normalized firing strength, ωl, whichis calculated as follows

O3l = ωl =

O2l

9∑

l=1

O2l

=ωl

9∑

l=1

ωl

(15)

In layer 4, all nodes are adaptive. The defuzzifica-tion process that uses the weighted average method,defined by (12), is performed in this layer and thenext layer. Layer 4 has 9 outputs denoted by O4

l .The output of each node is

O4l = O3

lKl = ωlKl (16)

Layer 5 is the output layer and has only one fixednode whose output, f , is the sum of all its inputsas follows

O5 = f =9∑

l=1

O4l =

9∑

l=1

ωlKl (17)

which is the same as (12).

4. Genetic Algorithms

Genetic algorithms (GAs) are search and opti-mization techniques that are based on a formaliza-tion of natural genetics [44, 45]. GAs have beenused to overcome the difficulty and complexity inthe tuning of the FLC parameters such as MFs,scaling factors and control rules [5, 6, 8, 46–48].

A GA searches a multidimensional parameterspace to find an optimal solution. A given set ofparameters is referred to as a chromosome. Theparameters can be either real or binary numbers.The GA is initialized with a number of randomlyselected parameter vectors or chromosomes. Thisset of chromosomes is the initial population. Eachchromosome is tested and evaluated based on a fit-ness function (in control engineering we would referto this as a cost function). The chromosomes aresorted based on the ranking of the fitness functions.

5

One then selects a number of the best, according tothe fitness function, chromosomes to be parents ofthe next generation of chromosomes. A new set ofchromosomes is selected based on reproduction.

In the reproduction process, we generate newchromosomes, which are called children. We usetwo GA operations. The first operation is acrossover in which we choose a pair of parents andselect a random point in all of their chromosomesand make a cross replacement from one parent toanother. The second operation is a mutation inwhich a parent is selected and we change one ormore of its parameters to get a new child. Now, wehave a new population to test again with the fitnessfunction.

The genetic process is repeated until a termi-nation condition is met. There are different con-ditions to terminate the genetic process such as:(i) the maximum iteration is reached, (ii) a fit-ness threshold is achieved, (iii) a maximum time isreached and (iv) a combination of the previous con-ditions. In our work we can not use the maximumtime condition since we use the learning time as aparameter in our comparison. We use the maxi-mum iteration condition which is actually based ona threshold condition. The number of GA itera-tions is determined based on simulations. We trieddifferent numbers of iterations and we determinedthe number of iterations for which further train-ing would not have any significant improvement inperformance. The coding process in the proposedQLBGFC technique using GAs will be described indetail in Section 7.

5. Pursuit-Evasion Differential Game

The pursuit-evasion differential game is one ap-plication of differential games [49] in which a pur-suer tries to catch an evader in minimum timewhere the evader tries to escape from the pur-suer. The pursuit-evasion game is shown in Fig. 6.Equations of motion for the pursuer and the evaderrobots are [50, 51]

xi = Vi cos(θi)yi = Vi sin(θi) (18)

θi =ViLi

tan(ui)

where ”i” is ”p” for the pursuer and is ”e” for theevader, (xi, yi) is the position of the robot, Vi is the

velocity, θi is the orientation, Li is the distance be-tween the front and rear axle, and ui is the steeringangle where ui ∈ [−uimax

, uimax]. The minimum

turning radius is calculated as

Rdimin=

Litan(uimax)

(19)

Our strategies are to make the pursuer fasterthan the evader (Vp > Ve) but at the same time tomake it less maneuverable than the evader (upmax <uemax). The control strategies that we compare ourresults with are defined by

ui =

−uimax : δi < −uimax

δi : −uimax≤ δi ≤ uimax

uimax: δi > uimax

(20)

where

δi = tan−1

(ye − ypxe − xp

)− θi (21)

where ”i” is ”p” for the pursuer and is ”e” for theevader. The capture occurs when the distance be-tween the pursuer and the evader is less than a cer-tain amount, `. This amount is called the captureradius which is defined as

` =√

(xe − xp)2 + (ye − yp)2 (22)

One reason for choosing the pursuit-evasion gameis that the time-optimal control strategy is knownso, it can be a reference for our results. By thisway, we can check the validity of our proposed tech-niques.

x

y

The pursuer

The evader

( xp , yp )

( xe , ye ) Vp

θp

θe Ve

Figure 6: The pursuit-evasion model

6

6. The proposed Q(λ)-learning Fuzzy Infer-ence System

A FIS is used as a function approximation forQ(λ)-learning to generalize the discrete state andaction spaces into continuous state and actionspaces and at the same time Q(λ)-learning is usedto tune the parameters of the FIS and the FLC.The structure of the proposed QLFIS is shown inFig. 1 which is a modified version of the proposedtechniques used in [19] and [24].

The difference between the proposed QLFIS andthat proposed in [24] is that in [24], the authorsused FIS to approximate the value function, V (s),but the proposed QLFIS is used to approximate theaction-value function, Q(s, a). In addition in [24],the authors tune only the output parameters of theFIS and the FLC while in this work the input andthe output parameters of the FIS and the FLC aretuned. The reason for choosing Q-learning in ourwork is that it outperforms the actor-critic learning[52]. The main advantage of Q-learning over actor-critic learning is exploration insensitivity since Q-learning is an off-policy algorithm (see Section 2)whereas actor-critic learning is an on-policy algo-rithm.

The difference between the proposed QLFIS andthat proposed in [19] is that in [19], the authors usedneural networks (NNs) as a function approximationbut here we use the FIS as a function approxima-tion. There are some advantages of using FIS ratherthan NNs such that: (i) linguistic fuzzy rules can beobtained from human experts [53] and (ii) the abil-ity to represent fuzzy and uncertain knowledge [54].In addition, our results show that the proposed QL-FIS outperforms the technique proposed in [19] inboth the learning time and the performance.

Now we will derive the adaptation laws for theinput and the output parameters of the FIS andthe FLC. The adaptation laws will be derived onlyonce and are applied for both the FIS and the FLC.Our objective is to minimize the TD-error, ∆t, andby using the mean square error (MSE) we can for-mulate the error as

E =12

∆2t (23)

We use the gradient descent approach and accord-ing to the steepest descent algorithm, we make achange along the −ve gradient to minimize the er-

ror so,

φ(t+ 1) = φ(t)− η ∂E∂φ

(24)

where η is the learning rate and φ is the parametervector of the FIS and the FLC where φ = [σ,m,K].The parameter vector, φ, is to be tuned. From (23)we get

∂E

∂φ= ∆t

∂∆t

∂φ(25)

Then from (7),

∂E

∂φ= −∆t

∂Qt(st, ut)∂φ

(26)

Substituting in (24), we get

φ(t+ 1) = φ(t) + η∆t∂Qt(st, ut)

∂φ(27)

We can obtain ∂Qt(st, ut)/∂φ for the output pa-rameter, Kl, from (17), where f is Qt(st, ut) forthe FIS and f is u for the FLC, as follows

∂Qt(st, ut)∂Kl

=∑

l

ωl (28)

Then we can obtain ∂Qt(st, ut)/∂φ for the inputparameters, σli and ml

i, based on the chain rule,

∂Qt(st, ut)∂σli

=∂Qt(st, ut)

∂ωl

∂ωl∂σli

(29)

∂Qt(st, ut)∂ml

i

=∂Qt(st, ut)

∂ωl

∂ωl∂ml

i

(30)

The term ∂Qt(st, ut)/∂ωl is calculated from (17)and (15). The terms ∂ωl/∂σli and ∂ωl/∂ml

i are cal-culated from both (14) and (13) so

∂Qt(st, ut)∂σli

=(Kl −Qt(st, ut))∑

l

ωlωl

2(xi −mli)

2

(σli)3

(31)

∂Qt(st, ut)∂ml

i

=(Kl −Qt(st, ut))∑

l

ωlωl

2(xi −mli)

(σli)2(32)

7

Substituting from (28), (31) and (32) in (10) andmodifying (27) to use eligibility trace, the updatelaw for the FIS parameters becomes

φQ(t+ 1) = φQ(t) + η∆tet (33)

The update law in (27) is applied also to the FLCby replacing Qt(st, ut) with the output of the FLC,u. In addition and as shown from Fig. 1, a ran-dom Gaussian noise, n(0, σn), with zero mean andstandard deviation σn is added to the output of theFLC in order to solve the exploration/exploitationdilemma as for example the ε-greedy explorationmethod used in the discrete state and action spaces.Then the update law for the FLC parameters is de-fined by

φu(t+ 1) = φu(t) + ξ∆t∂u

∂φ(uc − uσn

) (34)

where uc is the output of the random Gaussiannoise generator and ξ is the learning rate for theFLC parameters. The term ∂u/∂φ can be calcu-lated by replacing Qt(st, ut) with the output of theFLC, u, in (28), (31) and (32).

7. The proposed Q(λ)-learning Based Ge-netic Fuzzy Logic Controller

The proposed QLBGFC combines Q(λ)-learningwith GAs to tune the parameters of the FLC.The learning process passes through two phases asshown in Fig. ??. Now, we describe the FLC usedin the proposed technique then we will discuss thelearning in the two phases.

7.1. Fuzzy Logic Controller

A block diagram of a FLC system is shown in Fig.7. The FLC has two inputs, the error in the pursuerangle, δ, defined in (21), and its derivative, δ, andthe output is the steering angle, up. For the inputsof the FLC, we use the Gaussian MF described by

reference

signal δ +

- pθ pu Pursuer

robot

FLC

dtd

Figure 7: Block diagram of a FLC system

(13). For the Rules we modify (11) to be

Rl : IF δ is Al1 AND δ is Al2 THEN fl = Kl (35)

where l = 1, 2, ..., 9. The crisp output, up, is calcu-lated using (12) as follows

up =

9∑

l=1

((2∏

i=1

µAli(xi))Kl

9∑

l=1

(2∏

i=1

µAli(xi))

(36)

7.2. Learning in phase 1

In phase 1, Q(λ)-learning is used to obtain a suit-able estimation for the optimal strategy of the pur-suer. The state, s, consists of the error in angleof the pursuer, δ, and its derivative, δ, and the ac-tion, a, is the steering angle of the pursuer, up.The states, (δ, δ), and their corresponding greedyactions, a∗, are then stored in a lookup table.

7.2.1. Building the discrete state and action spacesTo build the state space we discretize the

ranges of the inputs, δ and δ, by 0.2. Theranges of δ and δ are set to be from -1.0 to1.0 so the discretized values for δ and δ will be(−1.0,−0.8,−0.6, . . . , 0.0, . . . , 0.8, 1.0). There are11 discretized values for δ and 11 discretized val-ues for δ. These values are combined to form11× 11 = 121 states.

To build the action space we discretize the rangeof the action, up, by 0.1. The range of the action isset to be from -0.5 to 0.5 so the discretized values forup will be (−0.5,−0.4,−0.3, . . . , 0.0, . . . , 0.4, 0.5).There are 11 actions and the dimension of the Q-table will be 121-by-11.

7.2.2. Constructing the reward functionHow to choose the reward function is very impor-

tant in RL because the agent depends on the rewardfunction in updating its value function. The rewardfunction differs from one system to another accord-ing to the desired task. In our case we want thepursuer to catch the evader in minimum time. Inother words, we want the pursuer to decrease thedistance to the evader at each time step. The dis-tance between the pursuer and the evader at timet is calculated as follows

D(t) =√

(xe(t)− xp(t))2 + (ye(t)− yp(t))2 (37)8

The difference between two successive distances,∆D(t), is calculated as

∆D(t) = D(t)−D(t+ 1) (38)

A positive value of ∆D(t) means that the pursuerapproaches the evader. The maximum value of∆D(t) is defined as

∆Dmax = VrmaxT (39)

where Vrmax is the maximum relative velocity ofthe pursuer with respect to the evader (Vrmax =Vp +Ve) and T is the sampling time. So, we choosethe reward, r, to be

rt+1 =∆D(t)∆Dmax

(40)

The learning process in phase 1 is described in Al-gorithm 1.

Algorithm 1 (Phase 1: Q(λ)-learning)1: Discretize the state space, S, and the action

space, A.2: Initialize Q(s, a) = 0 ∀ s ∈ S, a ∈ A.3: Initialize e(s, a) = 0 ∀ s ∈ S, a ∈ A.4: For each episode

a: Initialize (xp, yp) = (0, 0).b: Initialize (xe, ye) randomly.c: Compute st = (δ, δ) according to (21).d: Select at using (3).e: For each play

i: Receive rt+1 according to (40).ii: Observe st+1.iii: Select at+1 using (3).iv: Calculate et+1 using (8).v: Update Q(st, at) according to (9).

f: End5: End6: Q← Q∗.7: Assign a greedy action, a∗, to each state, s using

(4).8: Store the state-action pairs in a lookup table.

7.3. Learning in phase 2

Phase 2 consists of two stages. In stage 1, thestate-action pairs stored in the lookup table areused as the training data to tune the parameters

of the FLC using GAs. Stage 1 is an off-line tun-ing. In this stage, the fitness function used is themean square error (MSE) defined as

MSE =1

2M

M∑

m=1

(a∗m − umflc)2 (41)

where M is the number of input/output data pairsand is equivalent to the number of states, a∗

m

isthe mth greedy action obtained from phase 1, anduflc is the output of the FLC. The GA in this stageis used as supervised learning so the results of thisstage will not be better than that of phase 1 so weneed to perform stage 2.

In stage 2, we run the pursuit-evasion game withthe tuned FLC as the controller. The GA is thenused to fine tune the parameters of the FLC duringthe interaction with the evader. In this stage, thecapture time which the pursuer wants to minimizeis used as the fitness function. In this stage, the GAis used as a reward-based learning technique with apriori knowledge obtained from stage 1.

7.3.1. Coding a FLC into a chromosomeIn phase 2, we use a GA to tune the input and

the output parameters of the FLC. Now, we willdescribe the coding of the FLC parameters usingthe GA. Note that the coding process of the FLCis the same for all the different GAs used in thispaper so we will describe it in general. The FLCto be learned has 2 inputs, the error, δ, and itsderivative, δ. Each input variable has 3 MFs witha total of 6 MFs. We use the Gaussian MF definedby (13) which has 2 parameters to be tuned. Theseparameters are the standard deviation, σ, and themean, m. The total number of input parametersto be tuned is 12 parameters. The rules being usedis defined by (35) that has a parameter, K, to betuned with a total number of 9 parameters to betuned for the output part.

A FLC will be coded into a chromosome withlength 12+9 = 21 genes as shown in Fig. 8. We usereal numbers in the coding process. The populationconsists of a set of chromosomes, P , (coded FLCs).In the reproduction process, we generate new chro-mosomes by using two GA operations. The firstoperation is crossover in which we choose a pair ofparents and select a random gene, g, between 1 and20 and make a cross replacement from one parent toanother as shown in Fig. 9. The second operation ismutation in which we generate a chromosome ran-

9

domly to avoid a local minimum/maximum for thefitness function. Now, we have a new populationto test again with the fitness function. The geneticprocess is repeated until the termination conditionis met (see Section 4). The learning process in phase2 with its two stages is described in Algorithm 2 andAlgorithm 3.

7.4. Reward-based genetic algorithm Learning

For comparative purpose, we will also implementa general reward-based GA learning technique. Thereward-based GA learning will be initialized withrandomly chosen FLC parameters (chromosomes).The GA adjust the parameters to maximize theclosing distance given by (40). Therefore, (40) actsas the fitness function for the reward-based GAlearning.

In the proposed QLBGFC, A GA is used in phase2 stage 1 to tune the FLC parameters as determinedfrom Q(λ)-learning in phase 1. The GA uses anMSE criterion given by (41) that measures the dif-ference between control or action defined by Q(λ)-learning and the output of the FLC. The FLC pa-rameters are then tuned by the GA to achieve thegreedy actions defined by Q(λ)-learning in phase1. In phase 2 stage 2, the GA fine tunes the input

Algorithm 2 (Phase 2 Stage 1: GA learning)1: Get the state-action pairs from the lookup table.2: Initialize a set of chromosomes in a population,P , randomly.

3: For each iterationa: For each chromosome in the population

i: Construct a FLC.ii: For each state, s,

• Calculate the FLC output, uflc, us-ing (36).

iii: Endiv: Calculate the fitness value using (41).

b: Endc: Sort the entire chromosomes of the popula-

tion according to their fitness values.d: Select a portion of the sorted population as

the new parents.e: Create a new generation for the remaining

portion of the population using crossover andmutation.

4: End

and the output parameters of the FLC to achieve aminimizing capture time. The learning process in

standard deviations means

𝜎𝜎1

………………………………… 𝜎𝜎6 𝑐𝑐1

………………………………... 𝑐𝑐6 𝐾𝐾1

……………………………………………………………………….. 𝐾𝐾9

input parameters output parameters

Figure 8: A FLC coded into a chromosome

chromosome i 𝜎𝜎1𝑖𝑖

…………………. 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑖𝑖 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔+1𝑖𝑖

………………………..………. 𝐾𝐾9𝑖𝑖

make a cross replacement

chromosome i+1 𝜎𝜎1𝑖𝑖+1

…………………. 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑖𝑖+1 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔+1

𝑖𝑖+1

……………………..…………. 𝐾𝐾9𝑖𝑖+1

g

(a) Old chromosomes

chromosome i 𝜎𝜎1𝑖𝑖

…………………. 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑖𝑖 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔+1𝑖𝑖+1

………………………..………. 𝐾𝐾9𝑖𝑖+1

chromosome i+1 𝜎𝜎1𝑖𝑖+1

…………………. 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑖𝑖+1 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔+1

𝑖𝑖

……………………..…………. 𝐾𝐾9𝑖𝑖

(b) New chromosomes

Figure 9: Crossover process in a GA

10

Algorithm 3 (Phase 2 Stage 2: GA learning)1: Initialize a set of chromosomes in a population,P , from the tuned FLC obtained from stage 1.

2: For each iterationa: Initialize (xe, ye) randomly.b: Initialize (xp, yp) = (0, 0).c: Calculate st = (δ, δ) according to (21).d: For each chromosome in the population

i: Construct a FLC.ii: For each play

• Calculate the FLC output, up, using(36).• Observe st+1.

iii: Endiv: Observe the fitness value which is the

capture time that the pursuer wants tominimize.

e: Endf: Sort the entire chromosomes of the popula-

tion according to their fitness values.g: Select a portion of the sorted population as

the new parents.h: Create a new generation for the remaining


3: End

the reward-based GA learning is described in Algo-rithm 4

8. Computer Simulation

We use a core 2 duo with a 2.0 GHz clock fre-quency and 4.0 Gigabytes of RAM. We do com-puter simulation with MATLAB software. Q(λ)-learning and GAs have many parameters to be seta priori therefore we tested computer simulation fordifferent parameter values and different parametervalue combinations and chose the values that givethe best performance. The initial position of theevader is randomly chosen from a set of 64 differentpositions in the space.

8.1. The pursuit-evasion game

The pursuer starts motion from the position(0, 0) with an initial orientation θp = 0 and witha constant velocity Vp = 1 m/s. The distancebetween the front and rear axle Lp = 0.3 m and

Algorithm 4 (Reward-based GA learning)1: Initialize a set of chromosomes in a population,P , randomly.

2: For each iterationa: Initialize (xe, ye) randomly.b: Initialize (xp, yp) = (0, 0).c: Calculate st = (δ, δ) according to (21).d: For each chromosome in the population

i: Construct a FLC.ii: For each play

• Calculate the FLC output, up, using(36).• Observe st+1

iii: Endiv: Observe the fitness value defined by (40).

e: Endf: Sort the entire chromosomes of the popula-

tion according to their fitness values.g: Select a portion of the sorted population as

the new parents.h: Create a new generation for the remaining


3: End

the steering angle up ∈ [−0.5, 0.5]. From (19),Rdpmin ' 0.55 m.

The evader starts motion from a random positionfor each episode with an initial orientation θe = 0and with a constant velocity Ve = 0.5 m/s whichis half that of the pursuer (slower). The distancebetween the front and rear axle Le = 0.3 m andthe steering angle ue ∈ [−1, 1] which is twice thatof the pursuer (more maneuverable). From (19),Rdemin

' 0.19 m which is about one third that ofthe pursuer. The duration of a game is 60 seconds.The game ends when 60 seconds passed withoutcapturing or when the capture occurs before theend of this time. The capture radius ` = 0.10 m.The sampling time is set to 0.1 sec.

8.2. The proposed QLFISWe choose the number of episodes (games) to be

1000, the number of plays (steps) in each episode is600, γ = 0.95, and λ = 0.9. We make the learningrate for the FIS, η, decrease with each episode suchthat

η = 0.1− 0.09(

i

Max. Episodes

)(42)

11

and also make the learning rate for the FLC, ξ,decrease with each episode such that

ξ = 0.01− 0.009(

i

Max. Episodes

)(43)

where i is the current episode. Note that the valueof η is 10 times the value of ξ i.e. the FIS convergesfaster than the FLC to avoid instability in tuningthe parameters of the FLC. We choose σn = 0.08.

8.3. The proposed QLBGFC

We choose the number of episodes to be 200, thenumber of plays in each episode is 6000, γ = 0.5,and λ = 0.3. We make the learning rate, α, decreasewith each episode such that

α =1

(i)0.7(44)

and we also make ε decrease with each episode suchthat

ε =0.1i

(45)

where i is the current episode. The position of theevader, (xe, ye), is chosen randomly at the begin-ning of each episode to cover most of the states.Table 1 shows the values of GAs parameters usedin phase 2 stage 1 and stage 2.

8.4. Compared techniques

To validate the proposed QLFIS and QLBGFCtechniques, we compare their results with the re-sults of the optimal strategy, Q(λ)-learning only,the technique proposed in [19], and the reward-based GA learning. The optimal strategies of thepursuer and the evader are defined by (20) and (21).The parameters of Q(λ)-learning only have the fol-lowing values: the number of episodes is set to 200,the number of plays in each episode is 6000, γ = 0.5

Table 1: Values of GAs parametersPhase 2

Stage 1 Stage 2Number of iterations 800 200Population size 40 10Number of plays - 300Crossover probability 0.2 0.2Mutation probability 0.1 0.1Fitness function MSE defined by (41) capture timeFitness function objective minimize minimize

and λ = 0.3. The learning rate, α, and ε are definedby (44) and (45),respectively.

For the technique proposed in [19], we choose thesame values for the parameters of the NN. The NNhas a three-layer structure with 7-21-1 nodes. TheRL parameters and the initial values of the inputand the output parameters of the FLC are all cho-sen to be the same as those chosen in the proposedQLFIS. We choose σn = 0.1 which is decreasingeach episode by 1/i where i is the current episode.The parameters of the reward-based GA learningare chosen as follows: the number of iterations =1000, the population size = 40, the number of plays= 300, the probability of crossover = 0.2 and theprobability of mutation = 0.1.

Note that in phase 2 stage 2 of the proposed QL-BGFC we already have a tuned FLC (obtained fromphase 2 stage 1) and we just fine tune it but in thereward-based GA learning we have no idea aboutthe FLC parameters so we initialize them randomly.This random initialization of the FLC will make thepursuer not be able or take a long time to catch theevader for some iterations and therefore the learn-ing time will increase as we see in Section 9.

9. Results

Fig. 10 and Table 2 show the input and the out-put parameters of the FLC after tuning using theproposed QLFIS, respectively where ”N”, ”Z”, and”P” are referred to the linguistic values ”Negative”,”Zero”, and ”Positive”. Fig. 11 and Table 3 showthe input and the output parameters of the FLCafter tuning using the proposed QLBGFC.

Table 4 shows the capture times for different ini-tial positions of the evader using the optimal strat-egy of the pursuer, the Q(λ)−learning only, the pro-posed QLFIS, the technique proposed in [19], thereward-based GA, and the proposed QLBGFC. In

(a) The input δp (b) The input δp

Figure 10: MFs for the inputs after tuning using the pro-posed QLFIS

12

addition, the learning times for the different tech-niques are also shown in this table. From Table4 we can see that although the Q(λ)-learning onlyhas the minimum learning time, it is not enoughto get the desired performance in comparison withthe optimal strategy and the other techniques. Thereward-based GA gets the best performance in com-parison to the other techniques and its performanceapproaches that of the optimal strategy. However,the learning process using the reward-based GAtakes a comparatively long learning time. The pro-posed QLFIS outperforms the technique proposedin [19] in both performance and learning time andboth of them have better performance than using

Table 2: Fuzzy decision table after tuning using the proposedQLFIS

δp N Z PδpN -0.5452 -0.2595 -0.0693Z -0.2459 0.0600 0.2299P 0.0235 0.3019 0.5594


Figure 11: MFs for the inputs after tuning using the pro-posed QLBGFC

Table 3: Fuzzy decision table after tuning using the proposedQLBGFC

δp N Z PδpN -1.0927 -0.4378 -0.7388Z -0.5315 -0.2145 0.0827P 0.9100 0.1965 0.0259

Table 4: Capture time, in seconds, for different evader ini-tial positions and learning time, in seconds, for the differenttechniques

Evader initial position Learning(-6,7) (-7,-7) (2,4) (3,-8) time

Optimal strategy 9.6 10.4 4.5 8.5 —Q(λ)-learning only 12.6 15.6 8.5 11.9 32.0Technique proposed in [19] 10.9 12.9 4.7 9.1 258.6Proposed QLFIS 10.0 10.7 4.6 8.8 65.2Reward-based GA 9.7 10.5 4.5 8.6 460.8Proposed QLBGFC 9.9 10.5 4.6 8.7 47.8

the Q(λ)-learning only. We can also see that theproposed QLBGFC has the best performance as thereward-based GA. In addition, it takes only 47.8seconds in the learning process which is about 10%of the learning time taken by the reward-based GAand about 18% of the learning time taken by thetechnique proposed in [19].

Now, we will increase the complexity of the modelby making both the pursuer and the evader self-learn their control strategies simultaneously. Thedifficulty in the learning process is that each robotwill try to find its optimal control strategy basedon the control strategy of the other robot which, atthe same time, is still learning.

Fig. 12 and Table 5 show the input and the out-put parameters of the FLC for the pursuer usingthe proposed QLFIS. Fig. 13 and Table 6 show theinput and the output parameters of the FLC forthe evader using the proposed QLFIS. Fig. 14and Table 7 show the input and the output param-


Figure 12: MFs for the inputs of the pursuer after tuningusing the proposed QLFIS

Table 5: Fuzzy decision table for the pursuer after tuningusing the proposed QLFIS

δp N Z PδpN -1.2990 -0.6134 -0.4064Z -0.3726 -0.0097 0.3147P 0.3223 0.5763 0.9906

(a) The input δe (b) The input δe

Figure 13: MFs for the inputs of the evader after tuningusing the proposed QLFIS

13

eters of the FLC for the pursuer using the proposedQLBGFC. Fig. 15 and Table 8 show the input andthe output parameters of the FLC for the evaderusing the proposed QLBGFC.

Table 6: Fuzzy decision table for the evader after tuningusing the proposed QLFIS

δe N Z PδeN -1.4827 -0.4760 -0.0184Z -0.5365 -0.0373 0.5500P -0.0084 0.4747 1.1182


Figure 14: MFs for the inputs of the pursuer after tuningusing the proposed QLBGFC

Table 7: Fuzzy decision table for the pursuer after tuningusing the proposed QLBGFC

δp N Z PδpN -0.4677 -0.2400 -0.7332Z -0.8044 -1.2307 -0.2618P 0.8208 0.1499 0.9347

(a) The input δe (b) The input δe

Figure 15: MFs for the inputs of the evader after tuningusing the proposed QLBGFC

Table 8: Fuzzy decision table for the evader after tuningusing the proposed QLBGFC

δe N Z PδeN -1.1479 -0.0022 -0.2797Z -0.0529 -0.9777 -0.1257P 0.4332 0.4061 0.7059

To check the performance of the different tech-niques we can not use the capture time as a mea-sure, as we did in Table 4, because in this game boththe pursuer and the evader are learning so we mayfind capture times that are smaller than those cor-responding to the optimal solution. Of course thatdoes not mean that the performance is better thanthe optimal solution but it means that the evaderdoes not learn well and as a result it is capturedin a shorter time. Therefore the measure that weuse is the paths of both the pursuer and the evaderinstead of the capture time. Fig. 16, Fig. 17, Fig.18, Fig. 19, and Fig. 20 show the paths of thepursuer and the evader of the different techniquesagainst the optimal strategies of the pursuer andthe evader. We can see that the best performanceis that of the proposed QLFIS and the proposedQLBGFC. We can also see that the performanceof the reward-based GA diminishes as a result ofincreasing the complexity of the system by makingboth the pursuer and the evader learn their controlstrategies simultaneously.

Table 9 shows the learning time for the differ-ent techniques. Table 9 shows that the proposedQLBGFC has the minimum learning time. Finally,we can conclude that the proposed QLBGFC hasthe best performance and the best learning timeamong all the other techniques. We can also seethat the proposed QLFIS still outperforms the tech-nique proposed in [19] in both performance andlearning time.

10. Conclusion

In this paper we proposed two novel techniquesto tune the parameters of FLC in which RL is com-

Figure 16: Paths of the pursuer and the evader (solid line)using the Q(λ)-learning only against the optimal strategies(dotted line)

14

bined with FIS as a function approximation to gen-eralize the state and the action spaces to the con-tinuous case. The second technique combines RLwith GAs as a powerful optimization technique.The proposed techniques are applied to a pursuit-evasion game. First, we assume that the pursuer

Figure 17: Paths of the pursuer and the evader (solid line) us-ing the technique proposed in [19] against the optimal strate-gies (dotted line)

Figure 18: Paths of the pursuer and the evader (solid line) us-ing the proposed QLFIS against the optimal strategies (dot-ted line)

Figure 19: Paths of the pursuer and the evader (solid line)using the reward-based GA against the optimal strategies(dotted line)

does not know its control strategy. However it canself-learn its optimal control strategy by interactionwith the evader. Second, we increase the complex-ity of the model by assuming that both the pur-suer and the evader do know their control strategies.Computer simulation and the results show that theproposed QLFIS and the proposed QLBGFC tech-niques outperforms all the other techniques in per-formance when compared with the optimal strategyand in the learning time which is also an importantfactor especially in on-line applications.

In future work, we will test our proposed tech-nique in a more complex pursuit-evasion games inwhich multi pursuers and multi evaders self-learntheir control strategies.

References

[1] S. Micera, A. M. Sabatini, and P. Dario, “Adaptivefuzzy control of electrically stimulated muscles for armmovements,” Medical and biological engineering andcomputing, vol. 37, pp. 680–685, Jan. 1999.

[2] F. Daldaban, N. Ustkoyuncu, and K. Guney, “Phaseinductance estimation for switched reluctance motorusing adaptive neuro-fuzzy inference system,” EnergyConversion and Management, vol. 47, pp. 485–493,2006.

[3] S. Labiod and T. M. Guerra, “Adaptive fuzzy controlof a class of SISO nonaffine nonlinear systems,” FuzzySets and Systems, vol. 158, no. 10, pp. 1126–1137, 2007.

Figure 20: Paths of the pursuer and the evader (solid line)using the proposed QLBGFC against the optimal strategies(dotted line)

Table 9: Learning time, in seconds, for the different tech-niques

Learning timeQ(λ)-learning only 62.8Technique proposed in [19] 137.0Proposed QLFIS 110.3Reward-based GA 602.5Proposed QLBGFC 54.7

15

[4] H. K. Lam and F. H. F. Leung, “Fuzzy controller withstability and performance rules for nonlinear systems,”Fuzzy Sets and Systems, vol. 158, pp. 147–163, 2007.

[5] H. Hagras, V. Callaghan, and M. Colley, “Learning andadaptation of an intelligent mobile robot navigator op-erating in unstructured environment based on a novelonline fuzzy-genetic system,” Fuzzy Sets and Systems,vol. 141, pp. 107–160, Jan. 2004.

[6] M. Mucientes, D. L. Moreno, A. Bugarin, and S. Barro,“Design of a fuzzy controller in mobile robotics usinggenetic algorithms,” Applied Soft Computing, vol. 7,no. 2, pp. 540–546, 2007.

[7] L. X. Wang, “Generating fuzzy rules by learning fromexamples,” IEEE Transactions on Systems, Man, andCybernetics, vol. 22, pp. 1414–1427, Nov./Dec. 1992.

[8] F. Herrera, M. Lozano, and J. L. Verdegay, “Tuningfuzzy logic controllers by genetic algorithms,” Inter-national Journal of Approximate Reasoning, vol. 12,pp. 299–315, 1995.

[9] A. Lekova, L. Mikhailov, D. Boyadjiev, and A. Nabout,“Redundant fuzzy rules exclusion by genetic algo-rithms,” Fuzzy Sets and Systems, vol. 100, pp. 235–243,1998.

[10] S. F. Desouky and H. M. Schwartz, “Genetic basedfuzzy logic controller for a wall-following mobile robot,”in American Control Conference, (St. Louis, MO),pp. 3555–3560, Jun. 2009.

[11] S. F. Desouky and H. M. Schwartz, “Different hybrid in-telligent systems applied for the pursuit-evasion game,”in 2009 IEEE International Conference on Systems,Man, and Cybernetics, pp. 2677–2682, Oct. 2009.

[12] R. S. Sutton and A. G. Barto, Reinforcement Learning:An Introduction. Cambridge, MA: MIT Press, 1998.

[13] L. P. Kaelbling, M. L. Littman, and A. P. Moore, “Re-inforcement learning: A survey,” Journal of ArtificialIntelligence Research, vol. 4, pp. 237–285, 1996.

[14] W. D. Smart and L. P. Kaelbling, “Effective reinforce-ment learning for mobile robots,” in IEEE Interna-tional Conference on Robotics and Automation, vol. 4,(Washington, DC), pp. 3404–3410, May 2002.

[15] C. Ye, N. H. C. Yung, and D. Wang, “A fuzzy controllerwith supervised learning assisted reinforcement learningalgorithm for obstacle avoidance,” IEEE Transactionson Systems, Man and Cybernetics, Part B: Cybernet-ics, vol. 33, pp. 17–27, Feb. 2003.

[16] T. Kondo and K. Ito, “A reinforcement learning withevolutionary state recruitment strategy for autonomousmobile robots control,” Robotics and Autonomous Sys-tems, vol. 46, pp. 111–124, Feb. 2004.

[17] D. A. Gutnisky and B. S. Zanutto, “Learning obstacleavoidance with an operant behavior model,” ArtificialLife, vol. 10, no. 1, pp. 65–81, 2004.

[18] M. Rodrıguez, R. Iglesiasa, C. V. Regueirob, J. Correaa,and S. Barroa, “Autonomous and fast robot learningthrough motivation,” Robotics and Autonomous Sys-tems, vol. 55, pp. 735–740, Sep. 2007.

[19] X. Dai, C. K. Li, and A. B. Rad, “An approach totune fuzzy controllers based on reinforcement learningfor autonomous vehicle control,” IEEE Transactions onIntelligent Transportation Systems, vol. 6, pp. 285–293,Sep. 2005.

[20] M. J. Er and C. Deng, “Obstacle avoidance of a mobilerobot using hybrid learning approach,” IEEE Transac-tions on Industrial Electronics, vol. 52, pp. 898–905,Jun. 2005.

[21] H. Xiao, L. Liao, and F. Zhou, “Mobile robot path plan-ning based on Q-ANN,” in IEEE International Con-ference on Automation and Logistics, (Jinan, China),pp. 2650–2654, Aug. 2007.

[22] R. Vidal, O. Shakernia, H. J. Kim, D. H. Shim, andS. Sastry, “Probabilistic pursuit-evasion games: theory,implementation, and experimental evaluation,” IEEETransactions on Robotics and Automation, vol. 18,pp. 662–669, Oct. 2002.

[23] Y. Ishiwaka, T. Satob, and Y. Kakazu, “An approachto the pursuit problem on a heterogeneous multiagentsystem using reinforcement learning,” Robotics and Au-tonomous Systems, vol. 43, no. 4, pp. 245–256, 2003.

[24] S. N. Givigi, H. M. Schwartz, and X. Lu, “A reinforce-ment learning adaptive fuzzy controller for differentialgames,” Journal of Intelligent and Robotic Systems,2009. DOI: 10.1007/s10846-009-9380-4.

[25] L. Jouffe, “Fuzzy inference system learning by reinforce-ment methods,” IEEE Transactions on Systems, Man,and Cybernetics, Part C: Applications and Reviews,vol. 28, pp. 338–355, Aug. 1998.

[26] Y. Duan and X. Hexu, “Fuzzy reinforcement learningand its application in robot navigation,” in Interna-tional Conference on Machine Learning and Cybernet-ics, vol. 2, (Guangzhou), pp. 899–904, IEEE, Aug. 2005.

[27] L. Busoniu, R. Babuska, and B. de Schutter, “A com-prehensive survey of multiagent reinforcement learn-ing,” IEEE Transactions on Systems, Man, and Cy-bernetics, Part C: Applications and Reviews, vol. 38,pp. 156–172, Mar. 2008.

[28] A. Waldock and B. Carse, “Fuzzy q-learning with anadaptive representation,” in 2008 IEEE 16th Inter-national Conference on Fuzzy Systems (FUZZ-IEEE),(Piscataway, NJ), pp. 720–725, 2008.

[29] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Ma-chine Learning, vol. 8, pp. 279–292, May 1992.

[30] C. J. C. H. Watkins, Learning From Delayed Rewards.PhD thesis, Cambridge University, 1989.

[31] B. Bhanu, P. Leang, C. Cowden, Y. Lin, and M. Patter-son, “Real-time robot learning,” in IEEE InternationalConference on Robotics and Automation, vol. 1, (Seoul,Korea), pp. 491–498, May 2001.

[32] K. Macek, I. Petrovic, and N. Peric, “A reinforce-ment learning approach to obstacle avoidance of mo-bile robots,” in 7th International Workshop on Ad-vanced Motion Control, (Maribor, Slovenia), pp. 462–466, 2002.

[33] T. M. Marin and T. Duckett, “Fast reinforcementlearning for vision-guided mobile robots,” in IEEE In-ternational Conference on Robotics and Automation,(Barcelona, Spain), pp. 4170–4175, Apr. 2005.

[34] Y. Yang, Y. Tian, and H. Mei, “Cooperative Q-learningbased on blackboard architecture,” in InternationalConference on Computational Intelligence and SecurityWorkshops, (Harbin, China), pp. 224–227, Dec. 2007.

[35] C. Deng and M. J. Er, “Real-time dynamic fuzzy Q-learning and control of mobile robots,” in 5th AsianControl Conference, vol. 3, pp. 1568–1576, Jul. 2004.

[36] M. J. Er and Y. Zhou, “Dynamic self-generated fuzzysystems for reinforcement learning,” in InternationalConference on Intelligence For Modelling, Controland Automation. Jointly with International Conferenceon Intelligent Agents, Web Technologies and InternetCommerce, pp. 193–198, 2005.

[37] C. K. Tham, “Reinforcement learning of multiple tasks

16

using a hierarchical CMAC architecture,” Robotics andAutonomous Systems, vol. 15, no. 4, pp. 247–274, 1995.

[38] V. Stephan, K. Debes, and H. M. Gross, “A new controlscheme for combustion processes using reinforcementlearning based on neural networks,” International Jour-nal of Computational Intelligence and Applications,vol. 1, pp. 121–36, Jun. 2001.

[39] X. W. Yan, Z. D. Deng, and Z. Q. Sun, “GeneticTakagi-Sugeno fuzzy reinforcement learning,” in IEEEInternational Symposium on Intelligent Control, (Mex-ico City, Mexico), pp. 67–72, Sep. 2001.

[40] D. Gu and H. Hu, “Accuracy based fuzzy Q-learningfor robot behaviours,” in International Conference onFuzzy Systems, vol. 3, (Budapest, Hungary), pp. 1455–1460, Jul. 2004.

[41] S. P. Singh and R. S. Sutton, “Reinforcement learn-ing with replacing eligibility traces,” Machine Learning,vol. 22, pp. 123–158, Jan./Mar. 1996.

[42] E. H. Mamdani and S. Assilian, “An experiment in lin-guistic synthesis with a fuzzy logic controller,” Interna-tional Journal of Man-Machine Studies, vol. 7, no. 1,pp. 1–13, 1975.

[43] T. Takagi and M. Sugeno, “Fuzzy identification of sys-tems and its applications to modelling and control,”IEEE Transactions on Systems, Man and Cybernetics,vol. SMC-15, pp. 116–132, Jan. 1985.

[44] J. H. Holland, “Genetic algorithms and the optimal al-location of trials,” SIAM Journal on Computing, vol. 2,no. 2, pp. 88–105, 1973.

[45] D. E. Goldberg, Genetic Algorithms for Search, Opti-mization, and Machine Learning. Reading: Addison-Wesley, 1989.

[46] C. Karr, “Genetic algorithms for fuzzy controllers,” AIExpert, vol. 6, no. 2, pp. 26–33, 1991.

[47] L. Ming, G. Zailin, and Y. Shuzi, “Mobile robot fuzzycontrol optimization using genetic algorithm,” ArtificialIntelligence in Engineering, vol. 10, no. 4, pp. 293–298,1996.

[48] Y. C. Chiou and L. W. Lan, “Genetic fuzzy logic con-troller: An iterative evolution algorithm with new en-coding method,” Fuzzy Sets and Systems, vol. 152,pp. 617–635, Jun. 2005.

[49] R. Isaacs, Differential Games. John Wiley and Sons,1965.

[50] S. M. LaValle, Planning Algorithms. Cambridge Uni-versity Press, 2006.

[51] S. H. Lim, T. Furukawa, G. Dissanayake, and H. F. D.Whyte, “A time-optimal control strategy for pursuit-evasion games problems,” in International Conferenceon Robotics and Automation, (New Orleans, LA), Apr.2004.

[52] J. Peng and R. J. Williams, “Incremental multi-step Q-learning,” Machine Learning, vol. 22, pp. 283–290, Jan.1996.

[53] L. Jouffe, “Actor-critic learning based on fuzzy in-ference system,” in IEEE International Conferenceon Systems, Man, and Cybernetics, vol. 1, (Beijing,China), pp. 339–344, Oct. 1996.

[54] X. S. Wang, Y. H. Cheng, and J. Q. Yi, “A fuzzyactor-critic reinforcement learning network,” Informa-tion Sciences, vol. 177, pp. 3764–3781, 2007.

17

Sameh F. Desouky received his B.Eng. and M.Sc. degree from the military technical

college, Cairo, Egypt in June 1996 and January 2002, respectively.

He is currently working toward the Ph.D. degree in the Department of Systems and

Computer Engineering at Carleton University, Ottawa, Canada, and his research interests

include adaptive and fuzzy control, genetic algorithms, reinforcement learning, and

mobile robots.

Howard M. Schwartz received his B.Eng. degree from McGill University, Montreal,

Quebec in June 1981 and his M.Sc. degree and Ph.D. degree from M.I.T., Cambridge,

Massachusetts in 1982 and 1987, respectively.

He is currently the chief of the Department of Systems and Computer Engineering at

Carleton University, Ottawa, Canada, and his research interests include adaptive and

intelligent control systems, robotics and process control, system modeling, system

identification, system simulation and computer vision systems.

*Biography of the authors

*Photo of author 1

*Photo of author 2

Documents

Self-learning Fuzzy logic controllers for pursuit-evasion ... · for pursuit-evasion differential games, Robotics and Autonomous Systems (2010), ... This is a of an unedited manuscript