Empowered Skills - ias.tu-darmstadt.de · Empowered Skills . Our new algorithm maximizes a trade-off between an extrinsic reward, e.g., returning the ball on the table in table tennis,

Empowered Skills

Alexander Gabriel, Riad Akrour, Jan Peters and Gerhard Neumann

Abstract— Robot Reinforcement Learning (RL) algorithmsreturn a policy that maximizes a global cumulative rewardsignal but typically do not create diverse behaviors. Hence, thepolicy will typically only capture a single solution of a task.However, many motor tasks have a large variety of solutionsand the knowledge about these solutions can have severaladvantages. For example, in an adversarial setting such asrobot table tennis, the lack of diversity renders the behaviorpredictable and hence easy to counter for the opponent. In aninteractive setting such as learning from human feedback, anemphasis on diversity gives the human more opportunity forguiding the robot and to avoid the latter to be stuck in localoptima of the task. In order to increase diversity of the learnedbehaviors, we leverage prior work on intrinsic motivation andempowerment. We derive a new intrinsic motivation signal byenriching the description of a task with an outcome space,representing interesting aspects of a sensorimotor stream. Forexample, in table tennis, the outcome space could be givenby the return position and return ball speed. The intrinsicmotivation is now given by the diversity of future outcomes,a concept also known as empowerment. We derive a newpolicy search algorithm that maximizes a trade-off betweenthe extrinsic reward and this intrinsic motivation criterion.Experiments on a planar reaching task and simulated robottable tennis demonstrate that our algorithm can learn a diverseset of behaviors within the area of interest of the tasks.

I. INTRODUCTION

The application of Reinforcement Learning (RL) torobotics is widespread and has several success stories [1],[2], [3]. A RL problem is defined by sets of states andactions as well as a reward function that maps a state-actionpair to a score. The goal of RL algorithms is to learn apolicy which maximizes the global cumulative reward. Thepolicy can either be deterministic or stochastic but evenin the latter case, stochasticity is only introduced for thesake of exploration [4], [5] and is typically reduced overtime, yielding as a result a policy exhibiting a relative smallamount of diversity.

However, diversity can be an advantage in many set-tings. In dynamic environments, a diverse policy is ofteneasier to adapt to a changing environment or changingtask constraints [6]. Moreover, diversity can help to avoidgetting stuck in local optima. For example, in the coop-erative setting of [7], a robot learns grocery checkout byinteracting with a human. The human can add constraintssuch as not moving liquids over electronics by rerankingthe trajectories of the robot. Unfortunately, the robot canget stuck in local optima requiring manual modification ofthe trajectory’s waypoints [7]. Increasing the diversity in therobot’s behavior can give more feedback opportunities to thehuman and avoids manual interventions. Another examplewhere diverse behaviors should be preferred are competitive

settings. Here, an agent with a repetitive strategy becomespredictable and thus more easily exploitable. In robot tabletennis, for example, a task can be to return incoming balls tothe opponent’s side of the table in a way that is hard to returnfor the opponent. A deterministic behavior would precludeany effect of surprise and reduce the chances of defeatingthe opponent. Therefore, we are interested in allowing therobot to constitute a diverse library of high reward motorskills. In order to increase the diversity of the learned skills,we use intrinsic motivation [8] and empowerment [9]. Wederive a new intrinsic motivation signal by enriching thedescription of a task with an outcome space. The outcomespace represents interesting aspects of a sensorimotor stream,for example, in table tennis, the outcome space could begiven by the return position of the ball or the return ballspeed. Our intrinsic motivation signal is now given by thediversity or entropy of the outcomes of the policy. Maxi-mizing the entropy of the future is a concept also knownas empowerment [9], therefore, we will call our approachEmpowered Skills. Our new algorithm maximizes a trade-offbetween an extrinsic reward, e.g., returning the ball on thetable in table tennis, and this intrinsic motivation criterion.Experiments on a planar reaching task and a simulated robottable tennis task demonstrate that our algorithm can learna diverse set of behaviors within the area of interest of thetasks.

A. Related Work

Intrinsic motivation can be categorized into three clus-ters [8]: knowledge-based models, competence-based modelsand morphological models.

1) Knowledge-Based Models of Intrinsic Motivation:Agents of this category typically monitor and maximize alearning progress. [10], [11] monitor the learning progressof the transition model (predicting the next state given thecurrent state and action). Its maximization leads to thesequencing of the learning process from the simplest to themost intricate parts of the state-action space, while ignoringthe parts that are already learned or impossible to learn.Such an intrinsic motivation signal is also used in [12] andis mixed with an extrinsic, goal-oriented reward. However,unlike in our algorithm, the increased exploration due to theintrinsic motivation serves only the purpose of speeding-up the learning process. Moreover, its influence decreaseswith time and the returned policy is similar to standard RLalgorithms.

2) Competence-Based Models of Intrinsic Motivation:Agents of this category set challenges or goals for themselveswhich consist of a specific sensorimotor configuration with

associated difficulty level. They then plan their actions toachieve this goal and get a reward based on the difficultyof the challenge and their actual performance. [8] suggeststhree possible performance measures based on incompetence,competence and competence progress. Since then, [13] hasexamined the use of a competence-based model to supportan agent’s autonomous decision on what skills to learn. Theymeasure the improvement rate of competence on the basisof the Temporal-Difference learning signal. [14] enhancedthe framework introduced in [10] with a competence-basedintrinsic motivation system. Here, learning is directed by ameasure of interestingness which characterizes goals in taskspace. [15] applies this framework in the context of tool usediscovery. They split a high-dimensional structured sensori-motor space into subspaces with an associated measure ofinterestingness. A multi armed bandit algorithm then selectsin which subspace to perform goal babbling [16] based onthe subspaces’ interestingness.

3) Morphological Models of Intrinsic Motivation: Thiscategory comprises algorithms that maximize mathematicalproperties of the sensorimotor space of the robot and as suchis closely related to the diversity in behavioral space we arestriving for. In Evolutionary Robotics, [17] designed intrinsiconline fitness functions that encourage a population of robotsto seek out experiences that neither the current nor previousgenerations of robots had before. Thus, the algorithm growsa population of curious explorers that are motivated not byreward but by diverse actions and experiences alone. [17]and [18] evolved neural network controllers where the typicalgoal-oriented fitness function is replaced by an intrinsicallymotivated term. Specifically, [18] introduced a distance met-ric in behavioral space and aimed to find behaviors thatare maximally different to all the previously experiencedbehaviors. However, enumerating all such behaviors mightnot be sustainable in high dimensional spaces. The authorsin [17] propose to search for a behavior exhibiting maximalentropy in its sensorimotor stream. For some settings, suchbehaviors can result in interesting solutions to a task such asnavigating through a maze. Yet, high entropy behaviors donot necessarily coincide with high reward behaviors and assuch a trade-off between the intrinsic and extrinsic reward isnecessary.

Empowerment: The notion of empowerment introducedin [19] and extended to the continuous case in [9] isdefined on the state space. The empowerment of a state isproportional to the number of distinct states that could bereached from it. It was demonstrated in [9] that for a tasksuch as pole balancing, the optimal swing up policy coincideswith the search of the most empowered state as the uprightposition has the highest diversity of future states. However,similar to the discussion of [17], such behaviors do notalways emerge for other tasks. As such, in our algorithm, theintrinsic motivation is not used on its own but is combinedwith the extrinsic reward of the task such that we can restrictthe diverse solution space to a subset of useful solutions.

Similar to [17], we measure diversity in our algorithm withan entropy function. However, as the sensorimotor stream can

be excessively large, we restrict the domain of the entropy tothe outcome space (Sec. II-A) that only captures task-relevantaspects of the stream. We call our algorithm EmpoweredSkills and distinguish it from the notion of empowerment ofstates [19], [9] as we search for (motor) skills that result indiverse outcomes.

II. EMPOWERED SKILLS

The Empowered Skills algorithm is couched in the DirectPolicy Search (DPS) framework [4] and therefore reusesmost of its terminology (Sec. II-A). In the following sections,we will describe our algorithm first for discrete and then forcontinuous outcome spaces.

A. Notation and Problem Statement

Given a reward function R : T 7→ R mapping a robottrajectory τ ∈ T to a real value R(τ), the goal of DPSalgorithms is to find a set of parameters θ ∈ Θ maximizingEpθ(τ)[R(τ)] that we denote with a slight abuse of notationR(θ). Specifically, DPS is a class of iterative algorithmsmaintaining search distributions over Θ that we refer toas policies1. The main objective of a DPS algorithm canbe formulated as maximizing the policy return J(π) =Eθ∼π [R(θ)].

In addition to the maximization of the reward, we intro-duce an intrinsic motivation term to the objective functionof our algorithm with the goal to enforce diverse solutions.Similar to other intrinsically motivated algorithms [12], [17],[9], [20], diversity is measured by an entropy term. However,instead of computing the entropy over the whole sensori-motor stream X (like in [17]) which is typically very highdimensional, we provide additional guidance to the algorithmby singling out relevant parts of the sensorimotor streamwhich we call outcomes.

Formally, the outcome space O is defined as the imageof a non-injective mapping f : X 7→ O such that typicallycard(O)� card(X ). For instance, a trajectory τ , that is partof X and comprised of all joint positions of the robot as wellas the positions of the ball during the execution of a robottable tennis strike could retain as outcome o = f(τ) onlythe 2D ball position when it hits the table after returning theball.

Upon the definition of the outcome space, the intrin-sic term is simply given2 by the entropy HO(π) =−∑o pπ(o) ln pπ(o) of the outcome probabilities pπ(o) =∫

p(o|θ)π(θ)dθ. Finally, the policy π∗ returned by ouralgorithm is optimal w.r.t. a trade-off between the extrinsicreward and intrinsic motivation, i.e.,

π∗ = arg maxπ

J(π) + βHO(π),

1For simplicity of notation, only policies of the form π(θ) are consideredthroughout the paper. An extension to the contextual case π(θ|c) for someinitial i.i.d. task-dependent contexts c ∈ C is straightforward and similarto that of previous DPS algorithms (see chapter 2.4.3.2 in [4])

2HO(π) is given for a discrete outcome space; extension to continuousoutcomes follows in Section II-C

with trade-off parameter β. Higher values for β will put moreemphasis on the entropy and result in policies exhibitinghigher diversity.

B. Finite Outcome Space

Inspired by information-theoretic policy search algorithms[21], [22], we solve this optimization problem in an iterativescheme where at each iteration the new policy is updated bysolving the following constrained optimization problem

arg maxπ,pπ

J(π) + βHO(π),

s.t. ε > KL(π||q), (1)

1 =

∫π(θ) dθ, (2)

∀o : pπ(o) =

∫p(o|θ)π(θ) dθ, (3)

1 =∑o

pπ(o). (4)

The Kullback-Leibler Divergence KL(π||q) is used to specifythe step-size [21], [22] of the policy update and is given by

KL(π||q) =

∫π(θ) log

(π(θ)

q(θ)

)dθ.

Without the outcome entropy term HO(π) and conditions 3and 4, this optimization problem would be a standard PolicySearch problem where the information loss, given by theKullback-Leibler Divergence between the last and currentpolicy, is bounded by ε [21], [22]. The use of KL-constraintsis widespread in the robotic RL community whether itis in the context of Policy Search [21], Policy GradientMethods [23] or Optimal Control [20].

The addition of the outcome entropy term HO(π) in-troduces an interdependency between pπ(o) and π. Thisinterdependency prohibits the derivation of a closed formpolicy update directly from the constrained optimizationproblem3. A set of auxiliary variables pπ(o) that we op-timize for is therefore introduced to break this dependencyyielding HO(π) = −

∑o pπ(o) ln pπ(o). For a finite (and

relatively small) outcome space, constraints 3 and 4 thenallow us to enforce for all o ∈ O the equality4 pπ(o) =∫p(o|θ)π(θ) dθ, resulting in the closed form policy update

π(θ) ∝ q(θ) exp

(1

η

(R(θ) +

∑o

µo p(o|θ)

)),

where µo are the Lagrangian multipliers for the constraintsgiven in Equation 3. In comparison to the standard solutionfor DPS, see the episodic REPS algorithm given in [22],we can identify the term

∑o µo p(o|θ), which is added to

the reward function R(θ) as intrinsic motivation. This term

3The interdependency results in a log-sum expression in the Lagrangianthat cannot be solved in closed form.

4Within the internal optimization routine of a policy update, a sampleestimates of the r.h.s.

∫p(o|θ)π(θ) dθ of each equality constraint needs to

be computed for different π. This can be done solely from data generated inprevious iterations using importance sampling. However, we do not elaboratefurther as these constraints will not appear in the continuous outcome case.

depends on the Lagrangian multipliers µo that are obtainedby optimizing the dual function of the optimization problem.

C. Continuous Outcome Space

In the continuous case, adding a constraint for everypossible outcome is impossible. As a consequence, we needto resort to matching feature expectations instead of singleprobability values. Hence, we replace the constraints 3 and 4with∫

φ(o) pπ(o) do =

∫φ(o)

∫p(o|θ)π(θ) dθdo, (5)

1 =

∫pπ(o) do. (6)

The expression of HO(π) in the continuous case is simplyobtained by replacing the sum over the domain O by anintegral.

D. Solving the Optimization Problem

The policy update can now again be obtained by La-grangian optimization and is given by

π(θ) ∝ q(θ) exp

(1

ηδµ(θ)

),

where δµ(θ) = R(θ)+µ∫p(o|θ)φ(o) do and µ is a vector

of Lagrangian multipliers for the constraint given in Equa-tion 5. The policy depends on the Lagrangian multipliersη and µ which can be determined by minimizing the dualfunction

g(η,µ) = ηε+ η log

(∫q(θ) exp

(1

ηδµ(θ)

)dθ

)+ β log

(∫exp

(−µφ(o)

β

)do

). (7)

Similar to other DPS algorithms such as episodic REPS [22],we can only solve this optimization problem for a finite setof samples. The result of this optimization is then given bya weighting wi = exp

(1η (ri + µφ(oi))

)for each sample.

These weights are subsequently used to estimate a newparametric policy using a Weighted Maximum Likelihood(WML) estimate [22].

E. Estimating the New Policy

Because of their simplicity, we use multi-variate Gaussiandistributions for our policies which is also a common as-sumption in DPS. The mean and covariance matrix of thenew policy π are given by

µπ =

∑wiθi∑wi

,

Σπ =

∑wi(θi − µπ)(θi − µπ)T

Z,

Z =(∑wi)

2 −∑w2i∑

wi

which is the WML estimator for samples θi drawn from theprevious policy q.

III. EXPERIMENTS

The objectives of our experiments are twofold. Firstlywe investigate the ability of our algorithm to learn policiesshowing high diversity in the outcome space. Secondlywe investigate how we can shape the learned policies bychoosing different characteristics as outcomes.

We assessed the abilities and performance of our algorithmin two scenarios: a 2D reaching task involving a 5 DoF robotarm and the table tennis task featuring a 9 DoF robot armas shown in Figure 4. In all experiments the policy learnedby the algorithm is represented by a multivariate Gaussiandistribution over parameters of the Dynamic MovementPrimitives (DMP) that control the motion of the respectiverobot arm. As the outcome space is continuous in all ofthe experiments we use Radial Basis Functions (RBF) toapproximate the distribution of outcomes pπ(o). As basiswe use 20 randomly chosen outcomes. For each basis, weweight the distance to each outcome by the squared medianof distances to this basis. We chose the median to have anoutlier resistant scaling factor for every basis.

A. Reaching Task

In the reaching task the robot’s objective is to move theend effector from the top to a target position on the right.It performs this task in 100 time steps. The parameter spaceis 30 dimensional and consists of the weights of the DMP(5 basis per joint and the target location in joint space). Thereward R depends on the action cost u and the distance dbetween end effector and target in the last time step i.e.

R(d,u) = −105d2 − 0.7uTu.

In this experiment scenario we define the positions of themiddle joint at the end of the trajectory as the outcome. Thisresults in a two-dimensional, continuous outcome space.

To learn the policy, the algorithms runs 75 iterations ofpolicy updates. In every iteration they each draw 50 sampletrajectories from their current policy.

Comparison to State-of-the-Art RL Algorithms

The first set of experiments pits our algorithm againststate-of-the-art RL algorithms.

Figure 1 shows how outcome entropy, policy entropy andreward evolve over the 75 iterations of the algorithms. First

β Policy Entropy Outcome Entropy Avg. Reward

100 −302.47± 6.28 −10.980± 1.496 −4032± 33291000 −289.72± 10.04 −8.675± 1.533 −4316± 336310000 −264.36± 10.87 −5.677± 1.429 −4406± 2480

REPS −311.48± 1.44 −17.394± 0.989 −3900± 3270MORE −49.56± 0.00 −14.291± 0.262 −2144± 1946

TABLE IRESULTS OF THE REACHING TASK EXPERIMENTS.

The table shows the results of our reaching task experiments. Theexperiments were run with 5 trials of 75 iterations per setting. We include

REPS [21] and MORE [24] results as baseline comparison. Of note arethe similar policy entropy and average reward, the high standard deviationon the reward as well as the distinct differences in outcome entropy which

show that our algorithm is able to find more diverse solutions.

Fig. 1. Evolution of outcome entropy, policy entropy and reward over75 iterations in the reaching task experiments. The figure shows that ouralgorithm manages to find policies with higher outcome entropy at theexpense of a slight reward decrease.

we note that for all the runs, the reward stabilizes fromiteration 30 onwards while the policy entropy continues toshrink. The outcome entropy converges more slowly than thereward. This is even more apparent for our algorithm, wherethe outcome entropy declines more slowly than the policyentropy. MORE [24] introduces an entropy constraint on thesearch distribution regulating the reduction of the parameterentropy at each iteration. The purpose of this constraint is tohave a better control over the exploration trading-off slowerinitial progress as apparent in Figure 1 for better reward atconvergence as can be seen in Table I. The effect of theentropy constraint of MORE results in a slower and steadierdecrease of the policy entropy. However, the higher entropyin parameter space does not translate into entropy in outcomespace where our algorithm clearly outperforms the others.

0 1 2 3 4 5

−0.5

0

0.5

1

y−

axis

[m

]

REPS

0 1 2 3 4 5

−0.5

0

0.5

1

MORE

0 1 2 3 4 5

−0.5

0

0.5

1

x−axis [m]

y−

axis

[m

]

Empowered REPS (β=1000)

0 1 2 3 4 5

−0.5

0

0.5

1

x−axis [m]

Empowered REPS (β=10000)

Fig. 2. Final arm configuration of 50 trajectories sampled for differentalgorithms and values of β. Policies returned by standard RL algorithmsexhibit no diversity (top row). For our algorithm (bottom row), increasing βincreases the behavioral diversity while only slightly decreasing the reward.

The returned policy of each algorithm can be seen inFigure 2 which shows for different settings of β the robotarm configurations in the last time step as well as the targetpoint, indicated by the red cross. For this we plotted thearm configuration for all the 50 sample trajectories on top ofeach other. The results make it clear that the solution foundby our algorithm is similar to that found by REPS whichit is based on. But where REPS finds a single solution, ouralgorithm exploits the robot’s structure to locally increasediversity in the target joint. Choosing higher values for βleads to increased diversity in posture.

2.5 2.55 2.6 2.65

−0.55

−0.5

−0.45

−0.4

−0.35

−0.3

x−axis [m]

y−

axis

[m

]

Middle Joint

4.4 4.45 4.5 4.55

−0.1

−0.05

0

0.05

0.1

0.15

x−axis [m]

End Effector

REPS

MORE

Empowered Skills (β=100)



Fig. 3. Middle Joint position (left, outcome) and End Effector (right,reward) of 50 trajectories sampled for different algorithms and values of βillustrating the trade-off between behavioral diversity and reward.

This can be confirmed with a look at Figure 3 whichdisplays the end effector and the middle joint in the last timestep of the last iteration. The position distributions seem tobe arc shaped in the runs of our algorithm. This suggeststhat the algorithm exploits the configuration of the robot tolocally increase diversity. The arcs are evidence of variationin few, specific joints. Variation in many joints dissolves thearc-shape-constraint imposed by a single link rotating arounda single joint.

We measured a sizable increase in the diversity of theoutcomes. For REPS the middle joint positions cover an areasmaller than 0.01mm2. For Empowered Skills (β = 100) thisarea is more than 220 times larger5.

The statistical results of the experiment runs can be seen inTable I. While the policy entropy for our algorithms is onlymarginally increased in comparison to REPS and a lot lowerthan in the MORE results, the outcome entropy is distinctlyhigher.

B. Robot Table Tennis

In the table tennis setting there is a simulated robotarm with a racket attached to its end effector. The robot’sobjective is to return a ball to the opponent’s side of thetable. The robot arm has six joints and is suspended overthe table from a floating base which itself has 3 linear joints

5The size of the markers exaggerates the area covered by REPS due tovisibility constraints.

Fig. 4. Images of the robot table tennis performing a forehand strike usinga policy learned by our algorithm.

to allow small 3D movement. A trajectory is parameterizedby the goal position and velocities of a DMP, yielding an18 dimensional continuous action space (9 positions and 9velocities).

In this setting we conduct several experiments in whichwe choose different characteristics as outcomes. In the firstexperiment we use the location of the ball when landing onthe table upon being returned by the robot. This gives us atwo dimensional continuous outcome space. In the secondexperiment we use the speed of the ball at the momentof impact instead, that gives a one dimensional outcomespace. Because in these sets of experiments the outcomesare not properties of the robot, the relation between policyand outcomes depends on the environment (in this case thephysics of ball movement).

We use imitation learning to initialize the weight param-eters of the DMP and optimize for the goal position andvelocity. The initial trajectory used for imitation learning wasgenerated using a hand coded player following [25]. Oncethe policy is initialized, we run 100 iterations of each of theRL algorithms. The reward optimized by these algorithmsis inversely proportional to the distance between a targetpoint6 and the location where the ball first enters the tableplane after being returned by the robot. If the robot doesn’thit the ball, the reward is the negative minimum distancebetween the racket and the ball throughout the trajectory. Inall the experiments we ran 5 trials for each algorithms for100 iterations using 50 samples per iteration.

Experiment 1 - Comparison to a State-of-the-Art RL Algo-rithm

The first set of experiments again pits our algorithmagainst a state-of-the-art RL algorithm and explores the

6The target is marked by a red cross in our plots.

β Policy Entropy Outcome Entropy Avg. Reward

10 −217.805± 1.463 −3.705± 0.891 17.694± 0.190100 −216.498± 2.951 −2.898± 0.616 17.412± 0.5031000 −214.432± 2.184 −1.654± 0.830 10.536± 5.394

REPS −216.863± 1.591 −8.119± 1.765 18.124± 0.017

TABLE IIRESULTS OF THE TABLE TENNIS EXPERIMENTS.

These experiments were run in 5 trials with 100 iterations and 50 samplesper iteration. Displayed are mean values over the 5 trials. The results

show how higher settings of β lead to an increased outcome entropy whilehaving a limited effect on policy entropy and reward except for

beta = 1000. In this experiment the outcomes are the 2D location wherethe ball impacts the table after being returned by the robot.

influence of an increased β on the shape of the solutions.The outcomes in this case are the position of the ball in thetable plane at the moment the ball enters this plane.

Fig. 5. Evolution of outcome entropy, policy entropy and reward over 100iterations of the table tennis experiment. The figure shows the similarities inthe policy entropy and reward while the outcome (bounce locations) curvesare more distinct. For β = 1000, reward decreases due to balls hitting thenet.

Figure 5 shows how outcome entropy, policy entropy andreward develop over 100 iterations of running the algorithms.The biggest differences are visible in the outcome entropy.While the policy entropy is comparable for all the runs overall iterations, the outcome entropy clearly decreases when a

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

RE

PS

x−

axis

[m

]

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

β=

10

x−

axis

[m

]

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

Em

po

we

red

Skill

s

β=

10

0

x−

axis

[m

]

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

y−axis [m]

β=

10

00

x−

axis

[m

]

Fig. 6. Sample trajectories from REPS and Empowered Skills for differentvalues of β. REPS although learning a probabilistic policy always shoots tothe target. Our algorithm is able to simultaneously learn to shoot to differentareas of the table.

smaller β is chosen. The reward converges slower for higherβ but does reach a similar score for all but the run withthe highest β. Our algorithms seems to be able to keepoutcome entropy and reward relatively high while decreasingthe policy entropy. The results for the last iteration are shownagain in Table II.

The final ball trajectories can be seen in Figure 6. Thefigure shows 50 ball trajectories for each of the algorithms.While the trajectories for REPS are indistinguishable andthose with β = 10 are still very similar, for higher valuesof β the trajectories cover a much wider area. The figuremakes it clear that the Empowered Skills algorithm is ableto learn a policy that represents multiple solutions to theposed problem while the policy learned by REPS just yieldsa single behavior.

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

−3.6

−3.4

−3.2

−3

−2.8

−2.6

−2.4

−2.2

x−axis [m]

y−

axis

[m

]

Table Surface

REPS

Emp. Skills (β=10)

Emp. Skills (β=100)

Emp. Skills (β=1000)

Fig. 7. A top view of the table. Marked on it are the impact points of theball for 50 sample trajectories per algorithm. Higher values of β lead to alarger spread of outcomes. The mean of the outcomes is different as well,and gets further away from the table’s edge.

Figure 7 displays the outcomes of 50 sample trajectoriesdrawn from the final policy for different settings of β. Wecan see that the impact area grows with β. For the secondhighest choice of β two shots miss the table and for thehighest β some of the shots end in the net. Interestingly withincreasing outcome entropy the center of outcomes movesto the middle of the opponent’s side of the table. Had itstayed in the same place (the target point marked by thered cross) approximately half of the outcomes would havemissed the table and with it the objective. This shows thatour algorithm finds a compromise between getting a highreward and producing outcomes with higher entropy.

Experiment 2 - The Speed Test

The second set of experiments changes the outcomes tothe speed at which the ball impacts the table. This leads ouralgorithm to learn a different set of policies.

Figure 8 shows the distribution of outcomes for REPSand our algorithm at different settings of β. The diversity ofimpact speeds increases with growing β. Again the algorithmlearns policies with different mean, shifting away from the

table border with increasing β. This shift is a lot lesspronounced than in the experiments with position outcomesin Section III-B.

3.8

3.85

3.9

3.95

REPS

sp

ee

d [

m/s

]

β=10

Empowered Skills

β=100

β=1000

Fig. 8. Speed outcomes of 50 trajectories sampled for different values ofβ and REPS. Here the outcome is the balls impact speed on the table. Thefigure shows how increasing β leads to more diverse impact speeds that aredistributed around different means.

With a look at Figure 9 it becomes also clear that thediversity of impact locations is much lower than in the pre-vious experiment. We can thus choose which characteristicsare important to us and learn fitting policies by choosingappropriate outcomes.

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

RE

PS

x−

axis

[m

]

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

β=

10

x−

axis

[m

]

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

Em

pow

ere

d S

kill

s

β=

100

x−

axis

[m

]

−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

y−axis [m]

β=

1000

x−

axis

[m

]

Fig. 9. Speed outcomes of 50 trajectories sampled for different values of βand REPS. The solutions found by our algorithm for the speed experimentshow a distinctly reduced spread on the table compared to those for theposition experiment (Sec. III-B).

IV. DISCUSSION

In this paper we added an outcome entropy based intrinsicmotivation term to a state-of-the-art objective driven PolicySearch algorithm. Our new algorithm is capable of solvingproblems with continuous action and outcome spaces andis easily extendable to the contextual case. We tested thealgorithm in two experimental settings, a reaching task anda robot table tennis task. The experiments show how theEmpowered Skills algorithm proposed in this paper is ableto learn policies which exhibit higher behavioral diversity inhigh reward areas of the task. The main limitation of ourcurrent setting is the simple form of our Gaussian policy. Inorder to further increase outcome diversity, our future per-spective is to study the integration of this intrinsic motivationterm in more complex distributions such as mixture models.Another perspective is to learn a higher level policy to selectthe most appropriate skill in a cooperative or adversarialsetting such as robot table tennis, from a library of skillsdiscovered by intrinsic motivation.

REFERENCES

[1] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning inrobotics: A survey,” IJRR, vol. 32, no. 11, pp. 1238–1274, 2013.

[2] S. Levine, N. Wagener, and P. Abbeel, “Learning Contact-Rich Ma-nipulation Skills with Guided Policy Search,” in IEEE ICRA, 2015,pp. 156–163.

[3] P. Kormushev, S. Calinon, and D. G. Caldwell, “Robot motor skillcoordination with EM-based reinforcement learning,” in IEEE/RSJIROS, no. September 2016, 2010, pp. 3232–3237.

[4] M. P. Deisenroth, G. Neumann, and J. Peters, “A Survey on PolicySearch for Robotics,” Foundations and Trends in Robotics, vol. 2, no.2011, pp. 1–142, 2011.

[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforce-ment Learning,” CoRR, 2013.

[6] C. Daniel, G. Neumann, O. Kroemer, and J. Peters, “HierarchicalRelative Entropy Policy Search,” JMLR, vol. 17, pp. 1–50, 2016.

[7] A. Jain, B. Wojcik, T. Joachims, and A. Saxena, “Learning TrajectoryPreferences for Manipulators via Iterative Improvement,” in NIPS,2013, pp. 575–583.

[8] P. Y. Oudeyer and F. Kaplan, “What is intrinsic motivation? A typologyof computational approaches,” Frontiers in Neurorobotics, vol. 1, no.Nov, p. 6, 2009.

[9] T. Jung, D. Polani, and P. Stone, “Empowerment for ContinuousAgent–Environment Systems,” Adaptive Behavior, vol. 19, pp. 16–39,2011.

[10] A. Baranes and P.-Y. Oudeyer, “R-IAC: Robust Intrinsically MotivatedExploration and Active Learning,” IEEE TAMD, vol. 1, no. 3, pp.155–169, 2009.

[11] J. Schmidhuber, “A Possibility for Implementing Curiosity and Bore-dom in Model-Building Neural Controllers,” in SAB, vol. 1, 1991, pp.222–227.

[12] M. Lopes, T. Lang, M. Toussaint, and P.-Y. Oudeyer, “Exploration inmodel-based reinforcement learning by empirically estimating learningprogress,” in NIPS, 2012, pp. 206–214.

[13] G. Baldassarre and M. Mirolli, “Deciding Which Skill to LearnWhen: Temporal-Difference Competence-Based Intrinsic Motivation(TD-CB-IM),” in Intrinsically Motivated Learning in Natural andArtificial Systems, G. Baldassarre and M. Mirolli, Eds. Springer,2013, pp. 257–278.

[14] A. Baranes and P. Y. Oudeyer, “Active learning of inverse modelswith intrinsically motivated goal exploration in robots,” Robotics andAutonomous Systems, vol. 61, no. 1, pp. 49–73, 2013.

[15] S. Forestier and P.-Y. Oudeyer, “Modular active curiosity-drivendiscovery of tool use,” in IEEE/RSJ IROS, 2016, pp. 3965–3972.[Online]. Available: http://ieeexplore.ieee.org/document/7759584/

[16] M. Rolf, J. J. Steil, and M. Gienger, “Goal babbling permits directlearning of inverse kinematics,” IEEE TAMD, vol. 2, no. 3, pp. 216–229, 2010.

[17] P. Delarboulas, M. Schoenauer, and M. Sebag, “Open-Ended Evo-lutionary Robotics: An Information Theoretic Approach,” in PPSN,2010, pp. 334–343.

[18] J. Lehman and K. O. Stanley, “Abandoning objectives: evolutionthrough the search for novelty alone.” Evolutionary computation,vol. 19, no. 2, pp. 189–223, 2011.

[19] A. S. Klyubin, D. Polani, and C. L. Nehaniv, “All Else Being EqualBe Empowered,” in ECAL, 2005, pp. 744–753.

[20] V. Kumar, E. Todorov, and S. Levine, “Optimal Control with LearnedLocal Models : Application to Dexterous Manipulation,” in IEEEICRA, 2016, pp. 378–383.

[21] J. Peters, K. Mulling, and Y. Altun, “Relative Entropy Policy Search,”in AAAI, 2010, pp. 1607–1612.

[22] A. Kupcsik, M. P. Deisenroth, J. Peters, A. P. Loh, P. Vadakkepat,and G. Neumann, “Model-based Contextual Policy Search for Data-Efficient Generalization of Robot Skills,” Artificial Intelligence, 2015.

[23] J. Schulman, S. Levine, M. Jordan, and P. Abbeel, “Trust RegionPolicy Optimization,” in ICML, 2015, pp. 1889–1897.

[24] A. Abdolmaleki, R. Lioutikov, J. R. Peters, N. Lau, L. P. Reis, andG. Neumann, “Model-Based Relative Entropy Stochastic Search,” inNIPS, 2015, pp. 3523–3531.

[25] K. Mulling, J. Kober, and J. Peters, “Simulating human table tenniswith a biomimetic robot setup,” in SAB, 2010, pp. 273–282.

Documents

Empowered Skills - ias.tu-darmstadt.de · Empowered Skills . Our new algorithm maximizes a trade-off between an extrinsic reward, e.g., returning the ball on the table in table tennis,