FROM REACTIVE TO COGNITIVE AGENTS: EXTENDING … · 2015. 2. 7. · neuro-fuzzy technique in (Pradhan et al., 2007). Looking forward to give full decision auton-omy to mobile robots,

FROM REACTIVE TO COGNITIVE AGENTS: EXTENDING REINFORCEMENTLEARNING TO GENERATE SYMBOLIC KNOWLEDGE BASES

Romulo Guedes Cerqueira∗ Augusto Loureiro da Costa∗

Stephen G. McGill† Daniel Lee† George Pappas†

∗Robotics Laboratory,Postgraduate Program in Electrical Engineering

Federal University of Bahia, 40210-630 Salvador, Bahia, Brazil

†GRAP Lab,University of Pennsylvania, Philadelphia, Pennsylvania, USA

Email: [email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract— A new methodology for knowledge-based agents to learn from interactions with their environmentis presented in this paper. This approach combines Reinforcement Learning and Knowledge-Based Systems. AQ-Learning algorithm obtains the optimal policy, which is automatically coded into a symbolic rule base, usingfirst-order logic as knowledge representation formalism. The knowledge base was embedded in an omnidirec-tional mobile robot, making it able to navigate autonomously in unpredictable environments with known andunknown obstacles using the same knowledge base. Additionally, a method of space abstraction based in humanreasoning was formalized to reduce the number of complex environment states and to accelerate the learning.The experimental results of autonomous navigation executed by the real robot are also presented here.

Keywords— Robot Navigation, Knowledge-based Systems, Reinforcement Learning, Embedded IntelligentSystems.

Resumo— Uma nova metodologia para agentes baseados em conhecimento aprenderem atraves de interacoescom seus ambientes e apresentada neste artigo. Esta abordagem combina Aprendizado por Reforco e SistemasBaseados em Conhecimento. Um algoritmo Q-Learning obtem a polıtica otima, que e automaticamente codificadapara uma base de regra simbolica, utilizando logica de primeira ordem como formalismo de representacao doconhecimento. A base de conhecimento foi embarcado em um robo movel omnidirecional, tornando-o capaz denavegar autonomamente em ambientes imprevisıveis com obstaculos conhecidos e desconhecidos usando a mesmabase de conhecimento. Adicionalmente, um metodo de abstracao do espaco baseado em raciocınio humano foiformalizado para reduzir o complexo numero de estados do ambiente e acelerar o aprendizado. Os resultadosexperimentais da navegacao autonoma executada pelo robo real tambem e apresentado aqui.

Palavras-chave— Navegacao de Robos, Sistemas baseados em Conhecimento, Aprendizado por Reforco, Sis-temas Inteligentes Embarcados.

1 Introduction

In recent years, the academic community of Ar-tificial Intelligence has searched to replace ex-plicit programming by the process of teaching atask. Autonomous robot skill acquisitions havepresented a large amount of works using learningprocesses. In this context, Reinforcement Learn-ing (RL) is a widely used technique, where anagent learns a policy π, exploring a partially ortotally unknown environment, based in rewardsobtained from its actions. Thus, the optimal pol-icy π∗ maps each environment state st to the bestaction at and it is stored in a hash table.

In task execution, this method presents theclassical reactive agents limitations, some situa-tion, like when the robot is surrounded by mo-bile obstacles, it collapses. Besides, the machinelearning process usually takes a long time to learna new desired behaviour. New learning episodesalso are needed for each new environment con-figurations. In unpredictable changing environ-ments with unpredictable obstacles, the agent canspend more time learning than executing the de-

sired task. Moreover, the storage and manage-ment of several look-up tables becomes hard workfor a reactive agent.

A two layered approach, where basic be-haviours are implemented at a low level layerand supervisory fuzzy rule based system is im-plemented at a high level layer, decides the ap-propriate basic behaviours in the current environ-ment state is presented in (Fatmi et al., 2006).The fuzzy rules base was automatically gener-ated by Reinforcement Learning algorithm. Thisimproves the behaviour based navigation idea in(Brooks, 1991), and extends the ReinforcementLearning to generates a fuzzy rules based sys-tem. In another approach (Torrey, 2009), In-ductive Logic Programming is used to trans-fer learning, where relevant knowledge from pre-vious learning experiences was applied to helpthe new task learning. Recently, a significantamount of interesting research works about mo-bile robot navigation has been presented. Inthese works, some widely utilized methods areneural networks in (Ouarda, 2010), fuzzy logicin (Shayestegan and Marhaban, 2012), reinforce-

ment learning in (Jaradat et al., 2011), geneticalgorithms in (Manikas, 2007), and rule-based-neuro-fuzzy technique in (Pradhan et al., 2007).

Looking forward to give full decision auton-omy to mobile robots, in high level tasks exe-cutions, a cognitive agent called Concurrent Au-tonomous Agent architecture (CAA) (da Costaand Bittencourt, 1999), was chosen to be embed-ded in the respective mobile robot. This agent ar-chitecture, embodied a model for higher-level cog-nition that supports learning and reasoning mech-anisms using logical expressions. The cognitivemodel is based on the following fundamental hy-potheses (Bittencourt, 2007):

• Cognition is an emergent property of a cyclicdynamic self-organizing process based on theinteraction of a large number of functionallyindependent units of a few types.

• Any model of the cognitive activity should beepistemologically compatible with the The-ory of Evolution (Mccarthy and Hayes, 1969).That applies not only to the “hardware” com-ponents of this activity but also to its “psy-chological” aspects.

• Learning and cognitive activity are closely re-lated and, therefore, the cognitive modellingprocess should strongly depend on the cog-nitive agents particular history (Bittencourt,2007).

Since learning and cognitive activity areclosely related, the methodology that combinesRL and Knowledge-Based Systems, was imple-mented to allow the CAA learns with its interac-tion with the environment. This approach makesCAA fully compatible with the three hypothesisfor a cognitive model. In the previous CAA im-plementation, the learning propriety would be el-igible just for reactive level, since that both in-stinctive and cognitive level uses Knowledge BasedSystems for automatic reasoning. The method-ology was applied to the instinctive level, wherea knowledge-based system is used to plan execu-tion. For a given goal gi, the RL obtains the op-timal policy which is automatically coded using aknowledge representation formalism based on firstorder logic. Each knowledge base KBi stores thenecessary knowledge for the agent to achieve thegoal gi, combining a set of basic behaviours, thatcan be performed in the environment. A new ap-proach for environment mapping around the mo-bile robot neighbourhood allows a significant re-duction of the relevant state space.

The paper is organized as follows: In Section2, the proposed approach for abstraction of mo-bile robot neighbourhood is presented. Section 3describes the algorithm based on RL, where theoptimal policy π∗ is learned, and then coded into

the knowledge bases KBi. Section 4, presents ex-perimental results from simulated mobile robot lo-comotion, and real mobile robot locomotion. Sec-tion 5 discusses conclusions and future works.

2 Reduced States Space Mapping

When a human tries to walk safely in a dynamicenvironment, he will care for the relative or ap-proximate distances and directions between himand the goal and the closest obstacle, navigatingsafely independently of the number of obstacles.In this work, the agent learns based on humanreasoning. The robot was considered to be thecenter of universe and the environment surround-ing was divided into four orthogonal regions: R1,R2, R3 and R4, as seen in figure 1.

Figure 1: The Environment Zones and StatesSpace

The environment state at each time t is deter-mined by the goal region (Rg) and closest obstacleregion (Ro):

St = (Rg, Ro), t ∈ [1, 16] (1)

This approach dramatically reduces the envi-ronment to 16 states, accelerates Q-Learning, de-creases computational cost, and more specificallygenerates a small knowledge base, which is veryappropriate for embedding in mobile robots.

This work uses global vision system, wherethe environment surrounding the robot is known,but could also use an omnidirectional local visionsystem. This works adopts some assumptions:

1. The robot’s position and velocity are knownat each instant;

2. The goal’s position and velocity are known ateach instant;

3. The closest obstacle’s position and velocityare known at each instant.

The new methodology consists of three steps:agent learning, simulated environment execution,and embedded knowledge base and task execu-tion into mobile robot. The first and secondsteps are done into a simulated environment, us-ing Player/Stage robotics simulator. Once theoptimal policy π∗ is generated it is coded into asymbolic knowledge Base KB which is embeddedinto CAA’s instinctive level. Then new simulation

experiments, where the environment assumes dif-ferent configurations, are performed. Finally theagent is embedded into the omnidirectional mo-bile robot, and the free collision navigation exper-iments are done. In a second experiment, nowusing a NaO humanoid robot, the agent reac-tive level was replaced by another version whichembedded a Zero Moment Point Controller andthe local Vision systems, provided by the UnifiedHumanoids Robotic Software Platform (McGillet al., 2010), and that instinctive level and mainlythe Knowledge Bases was kept the same, and thesame free collision navigation experiments weresuccessful repeated with the humanoid robot.

3 Reinforcement Learning for TrajectoryPlanning

The RL model that CAA uses in this work, ispresented in figure 2. At each time instant t (it-eration), the agent uses its sensors to perceive thecurrent environment’s state st. Following a pol-icy π, an action at is chosen and executed by itsactuators. This action leads the environment toassume a new state st+1. The agent is rewardedif its taken action was good; otherwise it is pun-ished. This reinforcement measurement, choice ofthe two works over time, will cause a correction ofpolicy, defining a dynamic, interactive and learn-ing process.

A π policy maps the environment states intoactions, and defines the agent behaviour over time.Thus, the role of RL system is to find an optimalpolicy π∗ that maximizes the reward R returnedby environment. Once the optimal policy π∗ islearned, it is coded in knowledge bases.

Figure 2: RL model with CAA architecture

The proposed algorithm, implemented in Cprogramming language, is based in Q-Learning, amachine learning technique that has been success-fully utilized for solving the mobile robot naviga-tion problem in dynamic environments.

3.1 The Set of Actions

An omnidirectional robot and robot soccerfield, with Robocup F180 League official dimen-sions, composes the environment in this work.According to the holonomic constraints of anomnidirectional robot, the set of actions is defined

by eight movement actions:

North, Northeast, East, Southeast, South,Southwest, West, Northwest.

3.2 The Reward Function

The reward function is defined by the euclideandistances between the robot and goal (drg) andclosest obstacle (dro). Thus, the set of states isclassified as follows:

• Safe Zone (SZ), when the robot has no or lowpossibility of collision with any obstacle;

• Not-Safe Zone (NZ), when the robot has highpossibility of collision with any obstacle;

• Victory Zone (V Z), when the robot reachesits goal (the ball in this case);

• Failure Zone (FZ), when the robot collideswith an obstacle.

Moreover:

• radvz is the radius of V Z around the goal;

• radnz is the radius of NZ around the closestobstacle;

• radcol is the radius around closest obstacle.

The robot zone at time instant t (figure 1), isdefined as:

Z(t) =

V Z, drg 6 radvzFZ, dro 6 radcolSZ, dro > radnz

NZ, radnz < dro 6 radcol

(2)

From data, the reward function is defined:

r =

4, Z ⊂ SZ → V Z4, Z ⊂ NZ → V Z1, Z ⊂ NZ → SZ−1, Z ⊂ SZ → NZ3, Z ⊂ NZ → NZ, dro(n+ 1) < dro(n),

drg(n+ 1) < drg(n), dro(n+ 1) ≥ drb(n+ 1)−3, Z ⊂ NZ → NZ, dro(n+ 1) < dro(n),

drg(n+ 1) < drg(n), dro(n+ 1) < drb(n+ 1)−2, Z ⊂ NZ → NZ, dro(n+ 1) < dro(n),

drg(n+ 1) ≥ drg(n)2, Z ⊂ NZ → NZ, dro(n+ 1) ≥ dro(n),

drg(n+ 1) < drg(n)−1, Z ⊂ NZ → NZ, dro(n+ 1) ≥ dro(n),

drg(n+ 1) ≥ drg(n)0, Z ⊂ SZ → SZ−4, Z ⊂ NZ → FZ

(3)

3.3 The Value Function

Major RL algorithms are based on value functions.Q-Learning stores the value function in a Q-table,initially null. During the learning process, Q-tableis filled by seeing different scenarios. This table

has an immediate reward value from the takenaction and the maximum accumulated reward forthe possible actions in a current state (see Equa-tion 4).

Q(st, at) = r(st, at) + γ ∗max(Q, st+1) (4)

where: st is the state at time instant t; at isthe action taken at t; r(st, at) is the immediatereward received when an action at in a state st istaken; γ is the reduction factor, a number in therange of [0,1] that determines if the learned policywill consider delayed or immediate rewards, andmax(Q, st+1) is the maximum Q-value calculatedfor taking all possible actions on the state at thenext time step.

3.4 Learning Process

The learning process is composed by a given num-ber of episodes. For each episode, one give sce-nario is presented by the agent that chooses anaction, considering the goal and the expected re-ward. Q-table is updated by equation 4. Eachscenario consists of the following: the robot po-sition, the desired position, and just one staticobstacle. Since the robot only considers with thenearest obstacle, then one obstacle is enough fortraining.

The element positions are chosen randomlyfor each episode. The robot always moves towardthe goal, independent of a possible collision. It al-lows the Q-table to converge to the optimal policyπ∗, and the robot remains in SZ until it reachesthe desired destination.

Once inside SZ, the agent move towards thegoal. If the last action lead to a state transition toa VZ or CZ, the episode ends. Thus, the Q-tableis updated only when the agent is inside NZ. Ifit is the last episode, learning is finalized. Thelearning process is presented in figure 3.

Figure 3: Learning Process

At the end of learning, an algorithm writtenin C language converts automatically the optimalpolicy presented by Q-table into a rule base KB.Given the set of environment states S and the setof possible actions A, the biggest Q-value for eachstate presents the best action to be executed by

agent. For these experiments, six rules base KBwhere generated, which contains the optimal poli-cies, for 10,000, 25,000, 50,000, 100,000, 250,000and 500,000 episodes. Some of the 19 rules con-tained in KB, are presented on Table 1.

(rule_1

(if (logic (robot zone not_safe))

(logic (target quadrant r1))

(logic (obstacle quadrant r1)))

(then (logic (robo action east)))(cf 0.84))

...

(rule_17

(if (logic (robot zone safe)))

(logic (target direction ?x))

(then (logic (robot action ?x)))(cf 0.90))

(rule_18

(if (logic (robot zone victory)))

(then (logic (robot action success)))(cf 0.83))

(rule_19

(if (logic (robot zone colision)))

(then (logic (robot action stop))))(cf 0.85))

Table 1: Rules Base of Mobile Robot Navigation

From rule 1 to rule 17, the rules chooses thebasic behaviour according to the robot zone, thegoal position, and the closest obstacle position.Rules 18 and 19, both require the reactive levelto stop the robot and inform the cognitive levelabout the rules conclusion. Rule 18 identifies thatthe current goal was satisfied. In this situation,another goal should be chosen by the cognitivelevel.

Also, for rule 19, which characterizes a statewhere the mobile robot is surrounded by mobileobstacles, the cognitive level chooses another goal.This choice implies that another knowledge basemust be used to scape from surrounded state, orwait for the collision zone state became false, itmeans wait for an mobile obstacles move out. Thereceived reward for an action of the optimal pol-icy π∗ is normalized and stored in the rule as thecertainty factor for that rule. A file in text for-mat is generated which contains the rules base, orin other words the knowledge to execute the hu-man like locomotion for the desired goal gi. Thisfile is downloaded to the omnidirectional mobilerobot where the agent uses the stored knowledgeto control the mobile robot locomotion.

4 Experimental Results

Once the symbolic knowledge base was embeddedinto CAA, two kinds of mobile robot locomotionexperiments were done. First with the omnidirec-tional mobile robot AxeBot, and than with theNaO Humanoid robot. In both experiments, asimulated environment was used at first, and thenwith real mobile robot. The first experiment, thesimulated one, the execution scenario was com-posed by: the robot, one target, and from 3 (three)

Figure 4: A success path execution with 8 obsta-cles

to 8 (eight) obstacles (static and dynamic). Thepositions of the environmental elements are ran-domly chosen, including the ball that is the target,or the goal. One of these scenarios, where 8 (eight)obstacles are used, is shown at figure 4.

Figure 5: Testing Process

For each symbolic knowledge base, KBi, 500(five hundred) scenarios, which contain, from 3(tree) to 9 (nine) obstacles, and the ball, randomlypositioned, were tested. The effectiveness of eachdifferent policy automatically coded into the sym-bolic knowledge bases is presented at 6. The sim-ulation algorithm is seen in figure 5.

Figure 6: Number of obstacles vs. Success rate

Clearly, the knowledge base generated fromthe 500,000 iterations presents, award majority, abetter effectiveness then the other ones.

The second step seeks to validate the devel-oped methodology for a mobile robot. The filewhich contains a knowledge base, generated by the500,000 iteration learning process, was loaded intothe CAA instinctive level embedded on the omni-directional mobile robot. This file stores the mo-

tion plan, combining basic reactive behaviours, tomove the robot to the current ball position, whichcontains 19 rules. Three different scenario, whereused for this experiment. The respective trajec-tory tracking for one of these scenarios and thetemporal evolution of robot positions can be seenat figure 7. The path was correctly planned andthe real robot was able to track it under the real-time constraints, support the proposed methodol-ogy.

Figure 7: Temporal Evolution of Positions andMobile Robot Trajectory

In the second experiment a new reactive level,which embedded a Zero Moment Point (ZMP) mo-tion controller for humanoid robots was imple-mented. It was encapsulated in a process whichcommunicates with the instinctive level using udpsockets. This new reactive process also encapsu-lates the local vision system, see Figure 8. Thesame exchange message pattern between the reac-tive level and the instinctive level was kept. Thereactive level sends to instinctive level, the robotpose, the target pose and the position of the seenobstacles. The instinctive level identifies the cur-rent state and choose the behaviour that is exe-cuted by the ZMP motion controller.

Figure 8: Temporal Evolution of Positions andHumanoid Robot Trajectory

The same Three different scenario, used at thefirst experiment with the omnidirectional mobilerobot were successful repeated. The respectivetrajectory tracking for one of these scenarios andthe temporal evolution of robot positions can beseen at Figure 9. Again, the path was correctly

planned and the real robot was able to track it.

Figure 9: Real Navigation using AxeBot

5 Conclusions and Future Works

A new methodology for knowledge-based agentto learn the knowledge from their interactionswith the environment is presented in this paper.The new methodology combines ReinforcementLearning and Knowledge-Based Systems. TheReinforcement Learning obtains the optimal pol-icy which is automatically coded using a knowl-edge representation formalism based on first or-der logic. The generated knowledge base is usedby a cognitive agent embedded into the omnidi-rectional mobile robot that executes the learnedhuman like locomotion task.

Considering only the closest obstacle, thestate space of the environment is reduced dras-tically which accelerates the convergence to theoptimal policy and mainly allows the proposedmethodology to generate small knowledge basesappropriate to be embedded into mobile robots.The experimental tests realized show the plannedpoints in real-time by the Instinctive level were re-ceived and executed successfully by the Reactivelevel.

For a different goal gj , for example dribblingaround an opponent, a new learning process wouldbe necessary. This new learning process wouldgenerate a new knowledge base KBj , which, onceadded to mobile robot, would make it able toachieve both gi and gj , using KBi to achieve gi toachieve gj . In this sense, where a specific knowl-edge base is generated for a given task, it is pos-sible see the proposed methodology as a symbolicknowledge learning methodology. Of course agood discussion about symbolic knowledge learn-ing can be further researched in this situation, andwe hope to support it with future additional re-sults where the robot increases its skill with addi-tional learned knowledge bases.

References

Bittencourt, G. (2007). An embodied modelfor higher-level cognition, 29th Annual

Meeting of the Cognitive Science Society(CogSci2007), pp. 106–115.

Brooks, R. A. (1991). Intelligence without repre-sentation, Artificial Intelligence 47: 139–159.

da Costa, A. L. and Bittencourt, G. (1999). Froma concurrent architecture to a concurrent au-tonomous agents architecture, Springer Lec-ture Notes in Artificial Inteligence 1856: 3rdInternational Workshop in RoboCup, pp. 85–95.

Fatmi, A., Yahmadi, A. A., Khriji, L. andMasmoudi, N. (2006). A fuzzy logicbased navigation of a mobile robot, WorldAcademy of Science, Engineering and Tech-nology 22: 169–174.

Jaradat, M., Al-Rousan, M. and Quadan,L. (2011). Reinforcement based mobilerobots navigation in dynamic environments,Robotics and Computer-Integrated Manufac-turing 27: 135–149.

Manikas, T.W., A. K. W. R. (2007). Ge-netic algorithms for autonomous robot navi-gation, IEEE Instrumentation and Measure-ment Magazine 10(6): 26–31.

Mccarthy, J. and Hayes, P. (1969). Some philo-sophical problems from the standpoint ofartificial intelligence, Machine Intelligence4: 142–152.

McGill, S., Brindza, J., Yi, S. and Lee, D.(2010). Unified humanoid robotic softwareplatform, 5th Workshop on Humanoid Soc-cer Robots(Humanoids 2010), pp. 07–11.

Ouarda, H. (2010). A neural network basednavigation for intelligent autonomous mobilerobots, International Journal of Mathemati-cal Models and Methods in Applied Sciences4(3): 177–186.

Pradhan, S., Parhi, D. and Panda, A. (2007).Navigation of multiple mobile robots usingrule-based-neuro-fuzzy technique, IEEE In-strumentation and Measurement Magazine34(2): 142–152.

Shayestegan, M. and Marhaban, M. (2012). Mo-bile robot safe navigation in unknown envi-ronment, IEEE International Conference onControl System, Computing and Engineering,pp. 44–49.

Torrey, L. (2009). Relational Transfer in Re-inforcement Learning, PhD thesis, Univer-sity of Wisconsin-Madison, Computer Sci-ences Department.

Documents

FROM REACTIVE TO COGNITIVE AGENTS: EXTENDING … · 2015. 2. 7. · neuro-fuzzy technique in (Pradhan et al., 2007). Looking forward to give full decision auton-omy to mobile robots,