Reinforcement Learning of Hierarchical Skills on the Sony Aibo …web.eecs.umich.edu/~baveja/Papers/icdl06.pdf · Reinforcement Learning of Hierarchical Skills on the Sony Aibo robot

Reinforcement Learning of Hierarchical Skills onthe Sony Aibo robot

Vishal Soni and Satinder SinghComputer Science and EngineeringUniversity of Michigan, Ann Arbor{soniv, baveja}@umich.edu

Abstract— Humans frequently engage in activities for theirown sake rather than as a step towards solving a specific task.During such behavior, which psychologists refer to as beingintrinsically motivated, we often develop skills that allow usto exercise mastery over our environment. Reference [7] haverecently proposed an algorithm for intrinsically motivated re-inforcement learning (IMRL) aimed at constructing hierarchiesof skills through self-motivated interaction of an agent with itsenvironment. While they were able to successfully demonstratethe utility of IMRL in simulation, we present the first realizationof this approach on a real robot. To this end, we implementeda control architecture for the Sony-AIBO robot that extendsthe IMRL algorithm to this platform. Through experiments,we examine whether the Aibo is indeed able to learn usefulskill hierarchies.

Index Terms— Reinforcement Learning, Self-MotivatedLearning, Options

I. I NTRODUCTION

Reinforcement learning [8] has made great strides inbuilding agents that are able to learn to solvesinglecomplexsequential decision-making tasks. Recently, there has beensome interest within the reinforcement learning (RL) com-munity to develop learning agents that are able to achievethe sort of broad competence and mastery that most animalshave over their environment ([3], [5], [7]). Such a broadlycompetent agent will possess a variety of skills and willbe able to accomplish many tasks. Amongst the challengesin building such agents is the need to rethink how rewardfunctions get defined. In most control problems from en-gineering, operations research or robotics, reward functionsare defined quite naturally by the domain expert. For theagent the reward is then some extrinsic signal it needs tooptimize. But how does one define a reward function thatleads to broad competence? A significant body of work inpsychology shows that humans and animals have a numberof intrinsic motivations or reward signals that could formthe basis for learning a variety of useful skills (see [1]for a discussion of this work). Building on this work inpsychology as well as on some more recent computationalmodels ([2]) of intrinsic reward, [7] have recently providedan initial algorithm for intrinsically motivated reinforcementlearning (IMRL). They demonstrated that their algorithm,

henceforth the IMRL algorithm, is able to learn a hierarchyof useful skills in a simple simulation environment. The maincontribution of this paper is an empirical evaluation of theIMRL algorithm on a physical robot (the Sony-Aibo). Weintroduce several augmentations to the IMRL algorithm tomeet the challenges posed by working in a complex and real-world domain. We show that our augmented IMRL algorithmlets the Aibo learn a hierarchy of useful skills.

The rest of this paper is organized as follows. We firstpresent the IMRL algorithm, then describe our implemen-tation of it on the Aibo robot, and finally we present ourresults.

II. I NTRINSICALLY MOTIVATED REINFORCEMENT

LEARNING

The IMRL algorithm builds on two key concepts: arewardmechanisminternal to the agent, and theoptions frameworkfor representing temporally abstract skills in RL developedby [9]. Lets consider each concept in turn.

In the IMRL view of an agent, its environment is factoredinto an external environment and an internal environment.The critic is part of the internal environment and determinesthe reward. Typically in RL the reward is a function of ex-ternal stimuli and is specifically tailored to solving the singletask at hand. In IMRL, the internal environment also containsthe agent’s intrinsic motivational system which should begeneral enough that it does not have to be redesigned fordifferent problems. While there are several possibilities forwhat an agent might consider intrinsically motivating orrewarding, the current instantiation of the IMRL algorithmdesigns intrinsic rewards around novelty, i.e., around unpre-dicted but salient events. We give a more precise descriptionof novelty as used in IMRL when we describe our algorithm.

An option (in RL) can be thought of as a temporallyextended action or skill that accomplishes some subgoal. Anoption is defined by three quantities: a policy that directs theagent’s behavior when executing the option, a set of initiationstates in which the option can be invoked, and terminationconditions that define when the option terminates (Figure 1).Since an option-policy can be comprised not only of primitiveactions but also of other options, this framework allows for

Option Definition :

• Initiation SetI ⊆ S which specifies the statesin which the option is available.

• Option Policy π : I × A → {0, 1} whichspecifies the probability of taking actiona instates for all a ∈ A, s ∈ I.

• Termination conditionβ : S → {0, 1} whichspecifies the probability of the option terminatingin states for all s ∈ S.

Fig. 1. Option Definition. See text for details.

the development of hierarchical skills. The following twocomponents of the overall option framework are particularlyrelevant to understanding IMRL:

• Option models:An option model predicts the probabilis-tic consequences of executing the option. As a functionof the states in which the optiono is initiated, the modelgives the (discounted) probability,P o(s′|s), for all s′

that the option terminates in states′, and the total ex-pected (discounted) rewardRo(s) expected over the op-tion’s execution. Option models can usually be learned(approximately) from experience with the environmentas follows:∀x ∈ S, current statest, next statest+1

P o(x|st)α← [γ(1− βo(st+1)P o(x|st+1)

+γβo(st+1)δst+1x]

Ro(st)α← [re

t + γ(1− βo(st+1))Ro(st+1)]

where andα is the learning rate,βo is the terminationcondition for optiono, γ is the discount factor,δ is theKronecker delta, andre

t is the extrinsic reward. Notethat equations of the formx

α← [y] are short forx ←(1− α)x + α[y].

• Intra-option learning methods: These methods allow forall options consistent with a current primitive actionto be updated simultaneously in addition to the optionthat is being currently executed. This greatly speedsup learning of options. Specifically, ifat is the actionexecuted at timet in statest, the option Q-values foroption o are updated according to:∀optionso, st ∈ Io

Qo(st, at)α← [re

t + γ (βo(st+1)× λo)+γ(1− βo(st+1))× max

a∈A∪OQo(st+1, a)]

∀optionso′, st ∈ Io′, o 6= o′

Qo(st, o′) α← Ro′

(st) +∑x∈S

P o′(x|st)[βo(x)× λo

+((1− βo(x))× maxa∈A∪O

Qo(x, a))]

whereλo is the terminal value for optiono.Next we describe the IMRL algorithm. The agent con-

tinually acts according theε-greedy policy with respect to

Loop foreverCurrent statest, next statest+1

current primitive actionat

current optionot,extrinsic rewardre

t , intrinsic rewardrit

If st+1 is a salient eventeIf option for e does not existCreate optionoe

Add st to the initiation setIoe of oe

Make st+1 the termination state foroe

Determine intrinsic rewardrit+1 = τ [1− P oe(st+1|st)]

Update option models∀ learned optionso 6= oe

If st+1 ∈ Io, then addst to initiation setIo

If at is greedy option foro in statest

update theoption modelof o

Update the Behavior Q-value function:∀ primitive options, do Q-Learning update∀ learned options, do SMDP-planning update

∀ learned optionsUpdate the option Q-Value function

Choose the next option to execute,at+1

Determine next extrinsic reward,ret+1

Setst ← st+1

at ← at+1;ret ← re

t+1; rit ← ri

t+1

Fig. 2. The IMRL algorithm.

a evolving behavior Q-value function. The agent starts withan initial set of primitive actions (some of which may beoptions). The agent also starts with a hardwired notion ofsalient or interesting events in its environment. The firsttime a salient event is experienced the agent initiates in itsknowledge base an option for learning to accomplish thatsalient event. As it learns the option for a salient event italso updates the option-model for that option. Each time asalient event is encountered, the agent gets an internal rewardin proportion to the error in the prediction of the salient eventfrom the associated option model. Early encounters with thesalient event generate a lot of reward because the associatedoption model is wrong. This leads the agent’s behavior Q-value function to learn to accomplish the salient event whichin turn improves both the option for achieving the salientevent as well as its option model. As the option modelimproves the reward for the associated salient event decreases

and the agent’s action-value function no longer takes theagent there. The agent, in effect, gets “bored” and moves onto other interesting events. Once an option or equivalentlyskill has been learned and is in the knowledge base of theagent, it becomes available as an action to the agent. Thisin turn enables more sophisticated behaviors on the part ofthe agent and leads to the discovery of more challenges toachieve salient events. Over time this builds up a hierarchyof skills. The details of the IMRL algorithm are presented inFigure 2 in a more structured manner.

One way to view IMRL is as a means of semi-automatically discovering options. There have been otherapproaches for discovering options, e.g., [4]. The advantageof the IMRL approach is that, unlike previous approaches, itlearns options outside the context of any externally specifiedlearning problem and that there is a kind of self-regulationbuilt into it so that once an option is learned the agentautomatically focuses its attention on things not yet learned.

As stated in the Introduction, thus far IMRL has beentested on a simple simulated world. Our goal here is to extendIMRL to a complex robot in the real world. To this end, weconstructed a simple 4 feet x 4 feet play environment forour Sony-Aibo robot containing 3 objects: a pink ball, anaudio speaker that plays pure tones and a person that can alsomake sounds by clapping or whistling or just talking. Theexperiments will have the robot learn skills such as acquirethe ball, approach the sound, and the putting the two togetherto fetch the ball to the sound or perhaps to the person, and soon. Before presenting our results, we describe some detailsof our IMRL implementation on the Aibo robot.

III. IMRL ON A IBO

In implementing IMRL on the Aibo, we built a modularsystem in which we could experiment with different kindsof perceptual processing and different choices for primitiveactions in a relatively plug-and-play manner without intro-ducing significant computational and communication latencyoverhead. We briefly describe the modules in our system.

a) Perception Module::This module receives sensoryinput from the Aibo and parses it into an observation vectorfor the other modules to use. At present, our implementationfilters only simple information from the sensors (such as per-centage of certain colors in the Aibo’s visual field, intensity ofsound, various pure tones, etc.). Work is underway to extractmore complex features from audio (eg. pitch) and visualdata (eg. shapes). Clearly the simple observation vector weextract forms a very coarsely abstracted state representationand hence the learning problem faced by IMRL on Aibo willbe heavily non-Markovian. This constitutes our first challengein moving to a real robot. Will IMRL and in particular theoption-learning algorithms within it be able to cope with thenon-Markovianness?

• Explore: Look for Pink Ball. Terminate whenBall is in visual field.

• Approach Ball: Walk towards Ball. Terminatewhen Ball is near.

• Capture Ball: Slowly try to get between betweenfore limbs.

• Check Ball: Angle neck down to check if ball ispresent between the fore limbs.

• Approach Sound: Walk towards sound of a spe-cific tone. Terminate when the sound source isnear.

Fig. 3. A list of primitive options programmed on the Aibo. See text fordetails.

b) Options Module::This module stores the primitiveoptions available to the agent. When called upon by theLearning module to execute some option it computes whataction the option would take in the current state and callson the low-level motor control primitives implemented inCarnegie Mellon University’s Robo-Soccer code with theparameters needed to implement the action. A list of somekey options with a brief description of each is provided inFigure 3. This brings up another challenge to IMRL. Many ofthe options of Figure 3 contain parameters whose values canhave dramatic impact on the performance of those options.For example, the option to Approach Ball has a parameterthat determines when the option terminates; it terminateswhen the ball occupies a large enough fraction of the visualfield. The problem is that under different lighting conditionsthe portion of the ball that looks pink will differ in size. Thusthe nearness-thresholdcannot be set a priori but must belearned for the lighting conditions. Similarly, the ApproachSound option has a nearness-threshold based on intensityof sound that is dependent on details of the environment.One way to deal with these options with parameters is totreat each setting of the parameter as defining a distinctoption. Thus, Approach Ball with nearness-threshold 10%is a different option than Approach Ball with nearness-threshold of 20%. But this will explode the number of optionsavailable to the agent and thus slow down learning. Besides,treating each parameter setting as a distinct option ignoresthe shared structure across all Approach Ball options. As oneof our augmentations to IMRL we treated the parameterizedoptions as a single option and learned the parameter valuesusing a hill-climbing approach. This use of hill-climbing toadapt the parameters of an option while using RL to learnthe option policy and option-model is a novel approach toextending options to a real robot. Indeed, to the best of ourknowledge there hasn’t been much work in using options ona challenging (not lookuptable) domain.

c) Learning Module:: The Learning module imple-ments the IMRL algorithm defined in the previous section.

• Ball Status ={lost, visible, near,captured firsttime, captured, capture-unsure}

• Destinations ={near no destination,near soundfirst time , near sound, near experimenter,balltaken to sound}

Fig. 4. State variables for the play environment. Events in bold are salient.A ball status of ’captured’ indicates that the Aibo is aware the ball isbetween its fore limbs. The status ’captured first time’ indicates that ballwas previously not captured but is captured now (similarly, the value ’nearsound first time’ is set when the Aibo transitions from not being near soundto being near it). The status of ’maybe capture’ indicates that the Aibo isunsure whether the ball is between its limbs. If the Aibo goes for a specificlength of time without checking for the ball, the ball status transitions tobeing ’capture-unsure’.

IV. EXPERIMENTS

Recall that for our experiments, we constructed a simple4’x4’ play environment containing 3 objects: a pink ball, anaudio speaker that plays pure tones and a person that can alsomake sounds by clapping or whistling or just talking. TheExplore (primitive) option turned the Aibo in place clockwiseuntil the ball became visible or until the Aibo had turned incircle twice. The Approach Ball option had the Aibo walkquickly to the ball and stop when the nearness-threshold wasexceeded. The Approach Sound option used the differencein intensity between the two microphones in the two earsto define a turning radius and forward motion to make theAibo move toward the sound until a nearness-threshold wasexceeded. Echoes of the sound from other objects in the roomas well as from the walls or the moving experimenter madethis a fairly noisy option. The Capture Ball option had theAibo move very slowly to get the ball between the fore limbs;walking fast would knock the ball and often get it rolling outof sight.

We defined an externally rewarded task in this environmentas the task of acquiring the ball, taking it to the soundand then taking it to the experimenter in sequence. Uponsuccessful completion of the sequence, the experimenterwould pat the robot on the back (the Aibo has touch sensorsthere), thereby rewarding it and then pick up the robot andmove it (approximately) to an initial location. The ball wasalso moved to a default initial location at the end of asuccessful trial. This externally rewarded task is a fairlycomplex one. For example, just the part of bringing the ballto sound involves periodically checking if the ball is stillcaptured, for it tends to roll away from the robot; if it islost then the robot has to Explore for ball and ApproachBall and then Capture Ball again, and then repeat. Figure 9shows this complexity visually in a sequence of images takenfrom a video of the robot taking the ball to sound. In image5 the robot has lost the ball and has to explore, approachand capture the ball again (images 6 to 9) before proceedingtowards sound again (image 10). We report the results of

three experiments.Experiment 1: We had the Aibo learn with just the external

reward available and no intrinsic reward. Consequently, theAibo did not learn any new options and performed standardQ-learning updates of its (primitive) option Q-value func-tions. As Figure 5(a) shows, with time the Aibo gets betterat achieving its goal and is able to do this with greaterfrequency. The purpose of this experiment is to serve asa benchmark for comparison. In the next two experiments,we try to determine two things. (a) Does learning optionshinder learning to perform the external task when bothare performed simultaneously? And (b), How ’good’ arethe options learned? If the Aibo is bootstrapped with thelearned options, how would it perform in comparison withthe learning in Experiment 1?

0 500 1000 1500

(b) External and Internal rewards

0 200 400 600 800 1000 1200 1400 1600 1800

(a) External Reward only

0 200 400 600 800 1000 1200 1400 1600 1800

(c) External Reward, learned options available at start time

Time (in seconds)

Fig. 5. A Comparison of learning performance. Each panel depicts thetimes at which the Aibo was successfully able to accomplish the externallyrewarded task. See text for details.

Experiment 2: We hardwired the Aibo with three salientevents: (a) acquiring the ball, (b) arriving at sound source,and (c) arriving at sound source with the ball. Note that (c)is a more complex event to achieve than (a) or (b), and alsothat (c) is not the same as doing (a) and (b) in sequencefor the skill needed to approach ball without sound is muchsimpler than approaching sound with ball (because of thepreviously noted tendency of the ball to roll away as therobot pushes it with its body). The Aibo is also given anexternal reward if it brings the ball to the sound source andthen to the experimenter in sequence. In this experiment theAibo will learn new options as it performs the externallyrewarded task. The robot starts by exploring its environmentrandomly. Each first encounter with a salient event initiatesthe learning of an option for that event. For instance, whenthe Aibo first captures the ball, it sets up the data structuresfor learning the Acquire Ball option. As the Aibo exploresits environment, all the options and their models are updatedvia intra-option learning.

Figure 6 presents our results from Experiment 2. Eachpanel in the figure shows the evolution of the intrinsic reward

0 500 1000 15000

0.5

1(a) Acquire Ball

0 500 1000 15000

0.5

1(b) Approach Sound

Intri

nsic

Rew

ard

0 500 1000 15000

0.5

1(c) Approach Sound With Ball

Time (in seconds)

Fig. 6. Evolution of intrinsic rewards as the Aibo learns new options. Seetext for details. Solid lines depict the occurrence of salient events. The lengthof these lines depicts the associated intrinsic reward.

associated with a particular salient event. Each encounter witha salient event is denoted by a vertical bar and the height ofthe bar denotes the magnitude of the intrinsic reward. We seethat in the early stages of the experiment, the event of arrivingat sound and the event of acquiring the ball tend to occurmore frequently than the more complex event of arriving atsound with the ball. With time however, the intrinsic rewardfrom arriving at sound without the ball starts to diminish. TheAibo begins to get bored with it and moves on to approachingsound with the ball. In this manner, the Aibo learns simplerskills before more complex ones. Note, however, that theacquire ball event continues to occur frequently throughoutthe learning experience because it is an intermediate step intaking the ball to sound.

Figure 5(b) denotes the times in Experiment 2 at whichthe Aibo successfully accomplished the externally rewardedtask while learning new options using the salient events.Comparing this with the result from Experiment 1 (Figure5(a)) when the Aibo did not learn any options, we see that theadditional onus of learning new options does not significantlyimpede the agent from learning to perform the external task,indicating that we do not loose much by having the Aibolearn options using IMRL. The question that remains is, dowe gain anything? Are the learned options actually effectivein achieving the associated salient events?

Experiment 3: We had the Aibo learn, as in the Experi-ment 1, to accomplish the externally rewarded task withoutspecifying any salient events. However, in addition to theprimitive options, the Aibo also had available the learnedoptions from Experiment 2. In Figure 7, we see that the Aiboachieved all three salient events quite early in the learningprocess. Thus the learned options did in fact accomplish theassociated salient events and furthermore these options are

0 200 400 600 800 1000 1200 1400 1600

(a) Acquire Ball

200 300 400 500 600 700 800 900 1000 1100

(b) Approach Sound

0 200 400 600 800 1000 1200 1400 1600 1800

(c) Approach Sound with ball

Time (in seconds)

Fig. 7. Performance when the Aibo bootstrapped with the options learnedin Experiment 2. The markers indicate the times at which the Aibo was ableto achieve each event.

indeed accurate enough to be used in performing the externaltask. Figure 5(c) shows the times at which the external rewardwas obtained in Experiment 3. We see that the Aibo wasquickly able to determine how to achieve the external goaland was fairly consistent thereafter in accomplishing thistask. This result is encouraging since it shows that the optionsit learned enabled the Aibo to bootstrap more effectively.

0 10 20 30 40 50 60 70 8070

80

90

100

110

120

130

140

150

Number of times ball captured

Num

ber o

f pin

k pi

xels

(div

ided

by

16)

Bright lightDim light

Fig. 8. A comparison of thenearness thresholdas set by the hill-climbing component in different lighting conditions. The solid line representsthe evolution of the threshold under bright lighting, and the dashed linerepresents the same under dimmer conditions. Here we see that the hillclimbing is, at the very least, able to distinguish between the lightingconditions and tries to set a lower threshold for case with dim lighting.

Finally, we examined the effectiveness of the hill-climbingcomponent that determines the setting of thenearness thresh-old for the Approach Objectoption. We ran the secondexperiment, where the Aibo learns new options, in twodifferent lighting conditions - one with bright lighting and theother with considerably dimmer lighting. Figure 8 presentsthese results. Recall that this threshold is used to determinewhen theApproach Objectoption should terminate. For aparticular lighting condition, a higher threshold means thatApproach Objectwill terminate closer since a greater numberof pink pixels are required for the ball to be assumed near

1) Approaching Ball

2) Capturing Ball

3) Ball Captured

4) Walking

with Ball

5) Ball is Lost

6) Looking for ball

7) Looking

for ball

8) Ball Found

9) Ball Re-captured

10) Aibo and Ball

Arrive at

Destination

Fig. 9. In these images we see the Aibo executing the learned option to take the ball to sound (the speakers). In the first four images, the Aibo endeavorsto acquire the ball. The Aibo has learned to periodically check for the ball when walking with it. So in image 5 when ball rolls away from the Aibo, itrealizes that the ball is lost. Aibo tries to find the ball again, eventually finding it and taking it to the sound.

(and thus capturable). Also, for a particular distance from theball, more pink pixels are detected under bright lighting thanunder dim lighting. The hill-climbing algorithm was designedto find a threshold so that the option terminates as close tothe ball as possible without losing it thereby speeding upthe average time to acquire the ball. Consequently, we wouldexpect that the nearness threshold should be set lower fordimmer conditions.

V. CONCLUSIONS

In this paper we have provided the first successful appli-cation of the IMRL algorithm to a complex robotic task.Our experiments showed that the Aibo learned a two-levelhierarchy of skills and in turn used these learned skills inaccomplishing a more complex external task. A key factorin our success was our use of parameterized options and theuse of the hill-climbing algorithm in parallel with IMRL totune the options to the environmental conditions. Our resultsform a first step towards making intrinsically motivatedreinforcement learning viable on real robots. Greater eventualsuccess at this would allow progress on the important goalof making broadly competent agents (see [6] for a relatedeffort).

As future work, we are building more functionality onthe Aibo in terms of the richness of the primitive optionsavailable to it. We are also building an elaborate physicalplayground for the Aibo in which we can explore the notionof broad competence in a rich but controlled real-worldsetting. Finally, we will explore the use of the many sources

of intrinsic reward in addition to novelty that have beenidentified in the psychology literature.

ACKNOWLEDGEMENTS

The authors were funded by NSF grant CCF 0432027 andby a grant from DARPA’s IPTO program. Any opinions,findings, and conclusions or recommendations expressed inthis material are those of the authors and do not necessarilyreflect the views of the NSF or DARPA.

REFERENCES

[1] A. G. Barto, S. Singh, and N. Chentanez. Intrinsically motivated rein-forcement learning of hierarchical collection of skills. InInternationalConference on Developmental Learning (ICDL), LaJolla, CA, USA,2004.

[2] S. Kakade and P. Dayan. Dopamine: generalization and bonuses.NeuralNetw., 15(4):549–559, 2002.

[3] F. Kaplan and P.-Y. Oudeyer. Maximizing learning progress: An internalreward system for development. InEmbodied Artificial Intelligence,pages 259–270, 2003.

[4] A. McGovern. Autonomous Discovery of Temporal Abstractions fromInteractions with an Environment. PhD thesis, University of Mas-sachusetts, 2002.

[5] L. Meeden, J. Marshall, and D. Blank. Self-motivated, task-independentreinforcement learning for robots. InAAAI Fall Symposium on Real-World Reinforcement Learning, Washington D.C., 2004.

[6] P.-Y. Oudeyer, F. Kaplan, V. Hafner, and A. Whyte. The playgroundexperiment: Task-independent development of a curious robot. InAAAISpring Symposium Workshop on Developmental Robotics. To Appear,2005.

[7] S. Singh, A. G. Barto, and N. Chentanez. Intrinsically motivatedreinforcement learning. InAdvances in Neural Information ProcessingSystems 17, pages 1281–1288. MIT Press, Cambridge, MA, 2004.

[8] R. Sutton and A. Barto. Reinforcement Learning: An Introduction.Cambridge, MA: MIT Press, 1998.

[9] R. S. Sutton, D. Precup, and S. P. Singh. Between mdps and semi-mdps:A framework for temporal abstraction in reinforcement learning.Artif.Intell., 112(1-2):181–211, 1999.

Documents

Reinforcement Learning of Hierarchical Skills on the Sony Aibo …web.eecs.umich.edu/~baveja/Papers/icdl06.pdf · Reinforcement Learning of Hierarchical Skills on the Sony Aibo robot