Active Learning of Parameterized Skills

Active Learning of Parameterized Skills

Bruno Castro da Silva [email protected]

School of Computer Science, University of Massachusetts Amherst, MA 01003

George Konidaris [email protected]

Computer Science and Artificial Intelligence Lab, MIT, Cambridge, MA 02139

Andrew Barto [email protected]

School of Computer Science, University of Massachusetts Amherst, MA 01003

AbstractWe introduce a method for actively learning pa-rameterized skills. Parameterized skills are flex-ible behaviors that can solve any task drawnfrom a distribution of parameterized reinforce-ment learning problems. Approaches to learningsuch skills have been proposed, but limited atten-tion has been given to identifying which train-ing tasks allow for rapid skill acquisition. Weconstruct a non-parametric Bayesian model ofskill performance and derive analytical expres-sions for a novel acquisition criterion capableof identifying tasks that maximize expected im-provement in skill performance. We also intro-duce a spatiotemporal kernel tailored for non-stationary skill performance models. The pro-posed method is agnostic to policy and skill rep-resentation and scales independently of task di-mensionality. We evaluate it on a non-linear sim-ulated catapult control problem over arbitrarilymountainous terrains.

1. IntroductionOne approach to dealing with high-dimensional controlproblems is to specify or discover hierarchically structuredpolicies. The most widely used hierarchical reinforcementlearning formalism is the options framework (Sutton et al.,1999), where options (or skills) define temporally-extendedpolicies that abstract away details of low-level control. Oneof the motivating principles underlying this framework isthe idea that subproblems recur, so that options can bereused in a variety of related tasks or contexts. Options,

Proceedings of the 31 st International Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).

however, are typically defined as single policies and can-not be used to solve ensembles of related reinforcementlearning problems. To address this issue, parameterizedskills have recently emerged as a promising framework forproducing reusable behaviors that can solve a distributionof related control problems, given only a parameterizeddescription of a task (Kober et al., 2012; da Silva et al.,2012; Neumann et al., 2013; Deisenroth et al., 2014). Onceacquired, parameterized skills can produce—on-demand—policies for any tasks in the distribution, even those whichthe agent has never had direct experience with. As an ex-ample, consider a soccer-playing agent tasked with learn-ing a kicking skill parameterized by desired force and di-rection. Learning a single policy for each possible varia-tion of the task is infeasible. The agent might wish, in-stead, to learn policies for a few specific kicks and use themto synthesize a single general skill for kicking the ball—parameterized by force and direction—that it can executeon-demand.

In this paper we are concerned with the question of how anagent, tasked with learning a parametrized skill and given adistribution from which future tasks will be drawn, shouldpractice. Such an agent should choose to practice tasks thatlead it to maximize skill performance improvement overall tasks in the target distribution. Intuitively, the tasksfrom which experience is most beneficial are those that al-low the skill to better generalize to a wider range of re-lated problems. As we will show, identifying such tasks isnot straightforward—sampling tasks according to the targetdistribution, for instance, is inefficient because it does notaccount for the varying difficulty of tasks. Furthermore,non-adaptive sampling strategies ignore how some tasksmay require policies qualitatively different from those ofneighboring tasks, thus demanding more extensive training.Continuing with the previous example, a soccer-playingagent may wish to learn a parameterized kicking skill bypracticing different types of kicks, such as accurate ones


towards the goal or powerful but inaccurate kicks in thegeneral direction of the opponent’s half field. Policies forthe former type of kick are harder to learn and generalize,since their parameters are very sensitive to the desired di-rection. Carefully choosing which kicks to practice allowsthe agent to identify which ones require less practice andwhich ones are more challenging, thus focusing on the as-pects of the skill that can more readily be improved.

These observations are consistent with recent theories ofhow human experts acquire professional levels of achieve-ment, which propose that skill improvement involves de-liberate efforts to change particular aspects of performance(Ericsson, 2006). Indeed, thoughtful and deliberate prac-tice is one of the defining characteristics of expert perfor-mance in sports, arts and science: world-class athletes, forinstance, carefully construct training regimens to build ontheir strengths and shore up their weaknesses.

We construct a non-parametric Bayesian model of skill per-formance and derive analytical expressions for a novel ac-quisition criterion capable of identifying tasks that max-imize skill performance improvement over a given targetdistribution. We also introduce a spatiotemporal kernel tai-lored for modeling non-stationary skill performance func-tions. The proposed method is agnostic to policy and skillrepresentation and scales independently of task dimension-ality. We evaluate it on a non-linear simulated catapult con-trol problem in which different launch profiles are requireddepending on the target position and on the smoothnesscharacteristics of the terrain.

2. Active Learning of Parameterized SkillsAssume an agent presented with a sequence of tasks drawnfrom some task distribution. Each task is modeled as aMarkov Decision Process (MDP). Furthermore, assumethat the MDPs have dynamics and reward functions similarenough so that they can be considered variations of a sametask. The objective of a parameterized skill is to maximizethe expected reward over the distribution of MDPs:∫

P (τ)J(πθ, τ

)dτ, (1)

where πθ is a policy with parameters θ ∈ RM , τ is atask parameter vector drawn from a continuous space T ,J(π, τ) is the expected return obtained when using policyπ to solve task τ (given an initial state distribution), andP (τ) is a probability density function describing the prob-ability of task τ occurring.

A parameterized skill is a function Θ : T → RM map-ping task parameters to policy parameters. When using aparameterized skill to solve a given task, the parameters ofthe policy to be used are specified by Θ. An efficient pa-

rameterized skill maximizes the expected performance ofthe policies it specifies over the entire distribution of possi-ble tasks: ∫

P (τ)J(πΘ(τ), τ

)dτ. (2)

A parameterized skill can be learned via a set of trainingtasks and their corresponding policies. The simplest strat-egy for constructing this set is to select tasks uniformly atrandom or to draw them from the task distribution P . Thesestrategies, however, ignore how a carefully-chosen task canimprove performance not only on that task, but over a widerrange of related tasks.

We introduce a general framework for active task selec-tion with arbitrary parameterized skill representations. Theframework uses a Bayesian model of skill performance anda specially tailored acquisition function designed to selecttraining tasks that maximize expected skill performanceimprovement. The process of actively selecting trainingtasks consists of the following steps: 1) identify, by meansof a model of expected skill performance, the most promis-ing task τ to practice; 2) practice the task and learn a cor-responding policy π∗θ ; 3) update the parameterized skill; 4)evaluate the performance J(τ) of the updated skill at thattask; and 5) update the model of expected skill performancewith the observed performance. These steps are repeateduntil the skill cannot be further improved or a maximumnumber of practicing episodes is reached. This trainingprocess is depicted in Figure 1.

SkillPerformance

ModelPolicy

LearningParameterized

Skill

Environment

⌧

J(⌧)

⇡⇤✓

⇡✓

Figure 1. The process of actively learning a parameterized skill.

3. A Bayesian Model of Skill PerformanceLet J

(πΘ(τ), τ

)be the performance achieved by a param-

eterized skill Θ on task τ . To simplify notation we sup-press the dependence on Θ and write simply J(τ). Wepropose creating a Bayesian model capable of predictingthe expected skill performance over a given set of tasks.This model allows the agent to predict performance—evenat tasks which were never directly experienced—withoutincurring the costs of executing the skill. As will be shownin Section 4, this is particularly useful when estimatinghow different training tasks may contribute to improvingthe performance of a skill over a distribution of tasks. Letµ(τ) be the mean predicted performance and σ2(τ) the cor-


responding variance. One way to represent this model isvia a Gaussian Process (GP) prior (Rasmussen & Williams,2006). A GP is a (possibly infinite) set of random vari-ables, any finite set of which is jointly Gaussian distributed.GPs extend Gaussian distributions to the case of infinitedimensionality, and thus can be seen as distributions overfunctions—in this case, predictive distributions over skillperformance functions. To fully specify a GP one must de-fine a mean function m and a positive-definite kernel k:

m(τ) = E[J(τ)]

k(τp, τq) = E[(J(τp)−m(τp))(J(τq)−m(τq))

>].A kernel specifies properties of J such as smoothness andperiodicity. In this section and the following we do notmake any assumptions about the form of k; in Section 5 weintroduce a kernel designed specifically for use with ourframework. We assume that the performance function J iszero-mean, so that m can be dropped from the equations;the extension for the non-zero mean case is straightforward.

Given k and a set of observations D ={(τ1, J(τ1)), . . . , (τN , J(τN ))}, both the log likeli-hood of the data conditioned on the model and theGaussian posterior over skill performance, P (J(τ)|τ,D),can be computed easily for any tasks τ . Under aGaussian Process, the predictive posterior over skillperformance at a task τ is normally-distributed accordingto J(τ) ∼ N (µ(τ), σ2(τ)), where

µ(τ) = k>(CD + σ2I

)−1yD (3)

σ2(τ) = k(τ, τ) + σ2 − k>(CD + σ2I

)−1k, (4)

with k = [k(τ, τ1) . . . k(τ, τN )]>, CD is an N ×N matrixwith entries (CD)ij = k(τi, τj), yD = [J(τ1) . . . J(τN )]>

and σ2 is the additive noise we assume affects measure-ments of skill performance.

GPs are often used in Bayesian optimization to find themaximum of unknown, expensive-to-sample functions. Todo so, a surrogate acquisition function is maximized. Ac-quisition functions guide the search for the optimum of afunction by identifying the most promising points to sam-ple. Standard acquisition functions include Lower Confi-dence Bounds, Maximum Probability of Improvement andExpected Improvement (Snoek et al., 2012). Althoughthese criteria may seem, at first, appropriate for selectingtraining tasks, that is not the case: if applied to a skill per-formance function, standard acquisition criteria would onlyidentify the tasks that can be solved with highest perfor-mance. By contrast, our goal is to identify training taskswhich result in the highest expected improvement in skillperformance. This goal, formally defined in Equation 6,measures how much skill performance may improve over

a target distribution of tasks if the skill is updated by thepractice of a selected task. A motivating principle behindthis objective is that since parameterized skills naturallygeneralize policies to related tasks, an effective acquisitioncriterion should focus not on improving the performance ata single task, but on the performance gains that practicingit may bring to the skill as a whole.

In the next section we introduce a novel acquisition crite-rion specially tailored for parameterized skills. We also de-rive closed-form expressions for this criterion and its gradi-ent, thus obtaining analytical expressions for the two quan-tities required for optimizing the process of selecting train-ing tasks.

4. Active Selection of Training TasksAn acquisition function to identify tasks that provide maxi-mum expected improvement in skill performance should: 1)take into account that tasks may occur with different prob-abilities and prioritize training accordingly; and 2) modelhow the practice of a single task may, due to the gener-alization properties inherent to a parameterized skill, im-prove performance not only at that task but also over relatedtasks.

Let us assume we have practiced N tasks, τ1, . . . , τN , andused the corresponding (optimal or near-optimal) learnedpolicies to construct a parameterized skill. Assume thatwe annotate each task τi with the time ti it was practiced,which we denote by τ tii . Here, time refers to the orderin which tasks are practiced, so that the i-th task to havebeen practiced is annotated with time t = i. Assumethat we have evaluated the efficiency of the skill on tasksτ1, . . . , τN and observed performances J(τ1), . . . , J(τN ).Let D be a training set of tuples {τ tii , J(τi)}, for i ∈{1 . . . N}. Given this training set, Equations 3 and 4 can beused to compute a posterior distribution over skill perfor-mance, P (J(τ)|τ,D), for any tasks τ . Let Jt be the poste-rior distribution obtained when conditioning the GaussianProcess on all tasks practiced up to time t, and let µt(τ)and σ2

t (τ) be its mean and variance, respectively. We de-fine Skill Performance (SP) as the expected performance ofthe skill with respect to an arbitrary distribution P of tasks:

SPt =

∫P (τ)µt(τ)dτ. (5)

Furthermore, let the Expected Improvement in Skill Perfor-mance (EISP), given task τ , be the expected increase inskill performance resulting from an agent practicing τ andobserving subsequent skill performance j(τ). Here, j(τ)is an optimistic upper bound on the skill performance atτ , computed with respect to the current posterior distribu-tion Jt. To compute the EISP of a task τ we consider the


Gaussian posterior, Jt+1, that would result if Jt were to beupdated with a new observation {(τ, j(τ))}. Let µt+1(τ)and σ2

t+1(τ) be the mean and variance of Jt+1. The EISPof a task τ is defined as

EISPt(τ) =

∫P (τ ′)(µt+1(τ ′)− µt(τ ′))dτ ′. (6)

EISP can be understood intuitively as a quantitative wayof comparing tasks based on their likely contributions toimproving the overall quality of the skill. Tasks whosepractice may improve skill performance on a wide rangeof related tasks have higher EISP; conversely, tasks whosesolutions are already well-modeled by the skill have lowerEISP. Computing the EISP is similar to executing a mentalevaluation of possible training outcomes: the agent uses itsmodel of expected skill performance to estimate—withoutever executing the skill—the effects that different trainingtasks may have on its competence across a distribution ofproblems.

The maximum of Equation 6 identifies the task τ∗ which, ifused to update the parameterized skill, results in the highestexpected improvement in overall performance. This cor-responds to an acquisition function which selects trainingtasks according to:

τ∗ = arg maxτ

EISPt(τ). (7)

One way of evaluating Equation 7 is to use a gradient-basedmethod. This requires an analytic expression for (or goodapproximation of) the gradient of EISPt(τ) with respect toarbitrary tasks. To make notation less cluttered, we focuson the case of 1-dimensional task parameters; the extensionto higher-dimensions is straightforward. Assume, withoutloss of generality, that the parameter describing a task isdrawn from a bounded, continuous interval [A,B]. To de-rive the expression for the gradient of EISP, we first observethat µt(τ ′) in Equation 6 does not depend on τ and can beremoved from the maximization. It is possible to show thatthe function to be maximized in Equation 7 is equivalentto:

EISPt(τ) = Gt(τ)>((

Ct(τ) + σ2I)−1

y(τ)

), (8)

where

Gt(τ) = [gt(τt11 ) . . . g(τ tNN ) g(τ t+1)]>, (9)

gt(τtii ) =

∫ B

A

P (r)k(rt+1, τ tii )dr, (10)

y(τ) = [J(τ1) . . . J(τN ) j(τ)]>, (11)

and where Ct(τ) is the covariance matrix of the extended

training set D ∪ {(τ t+1, j(τ))}:

Ct(τ) =

k(τ t11 , τ

t11 ) . . . k(τ t11 , τ

t+1)...

. . ....

k(τ tNN , τ t11 ) . . . k(τ tNN , τ t+1)k(τ t+1, τ t11 ) . . . k(τ t+1, τ t+1)

. (12)

Furthermore, the gradient of EISPt(τ) with respect to anygiven task τ is

∇τEISPt(τ) = ∇τGt(τ)>Wt(τ)y(τ)

− Gt(τ)>Wt(τ)∇τCt(τ)Wt(τ)y(τ)

+ Gt(τ)>Wt(τ)∇τy(τ), (13)

where

∇τGt(τ) =

[0 . . . 0︸︷︷︸

N times

∇τgt(τ)

]>, (14)

∇τy(τ) =

[0 . . . 0︸︷︷︸

N times

∇τ j(τ)

]>, (15)

and W (τ) =(C(τ) + σ2I

)−1. Let j(τ) be the upper end-

point of the 95% confidence interval around the mean pre-dicted performance at τ : j(τ) = µt(τ) + 1.96

√σ2t (τ).

Then,∇τ j(τ) = ∇τ(µt(τ)+1.96

√σ2t (τ)

). If we assume

that the variance σ2t of the process is approximately con-

stant within infinitesimal neighborhoods of a given task,then∇τ j(τ) = ∇τµt(τ), which can be rewritten as

∇τ j(τ) =

(∇τ [k(τ t, τ t11 ) . . . k(τ t, τ tNN )]>

)×(

CD + σ2I)−1

yD. (16)

It can be shown that∇τEISPt(τ) can be expressed as a lin-ear form φ1j(τ) + φ2yD, where both φ1 and φ2 dependonly on the kernel function k and on the task parameterssampled so far. This reveals an interesting property: givenan arbitrary fixed set of training tasks, the gradient of EISPcan be linearly decomposed into one component that de-pends solely on the performances yD that are achievable bythe skill, and another component that depends solely on theoptimistic assumptions made when defining j(·). This im-plies that the direction of maximum improvement of EISPis independently influenced by 1) the generalization capa-bilities of the skill—specifically, the actual performances itachieves on various tasks; and 2) the optimistic assump-tions regarding how further practice of a particular taskmay improve its performance.


Note that some of the equations in this section depend di-rectly or indirectly on the choice of kernel k. In Section 5we introduce a novel spatiotemporal kernel specially de-signed to better model skill performance functions, andin Appendix A we derive analytical expressions for thequantities involving it; namely, ∇τk(τ tii , ·), gt(τ tii ) and∇τgt(τ tii ).

5. Modeling Non-Stationary SkillPerformance Functions

Kernels encode assumptions about the function being mod-eled by a GP, such as its smoothness and periodicity.An implicit assumption made by standard kernels is thatthe underlying function is stationary—that is, it does notchange with time. Kernel functions also specify a measureof similarity or correlation between input points, usuallydefined in terms of the coordinates of the inputs. If dealingwith non-stationary functions, however, defining similari-ties is harder: when a point is resampled, for instance, wegenerally expect the similarity between its new and previ-ous values to decrease with time. Note that the model ofexpected performance introduced in Section 3 is intrinsi-cally non-stationary, since skill performance naturally im-proves with practice. If a standard kernel were to be usedto model this function, outdated performance observationswould contribute to the predicted mean, thus keeping theGP from properly tracking the changing performance.

To address this issue we introduce a new spatiotempo-ral kernel designed to better model non-stationary skillperformance functions. Let us assume an arbitrary ker-nel kS(τ1, τ2) capable of measuring the similarity betweentasks based solely on their parameters τ1 and τ2. We ex-pect kS to be higher if comparing related tasks, and closeto zero otherwise. Note that kS does not account for the ex-pected decrease in the similarity between observations of atask at very different times. To address this issue we con-struct a composite spatiotemporal kernel kC , based on kS ,capable of evaluating the similarity between tasks based ontheir parameters and on the times they were sampled. LetkC(τ t11 , τ

t22 ) be such a kernel, where τ ti denotes a task τi

sampled at time t. For kC to be suitable for modeling non-stationary functions, it should ensure the following prop-erties: 1) related tasks have higher similarity if sampled atsimilar times; that is, kC(τ t11 , τ

t22 ) > kC(τ t1+∆t

1 , τ t22 ), for∆t > 0; 2) if related tasks are sampled at significantly dif-ferent times, no temporal correlation can be inferred andsimilarity is defined solely on their task parameters; thatis, kC(τ t11 , τ

t22 ) → kS(τ1, τ2) as |t1 − t2| → ∞; and 3)

the more unrelated tasks are, the smaller the correlationbetween them, independently of when they were sampled;that is, kC → 0 as kS → 0. The first property implies that iftasks are related, closer sampling times suggest higher cor-

relation; the second property implies that nothing besidessimilarity in task space can be inferred if tasks are sampledat very different times; and the third property implies thatsampling times, on their own, carry no correlation informa-tion if the tasks being compared are significantly different.To define kC we introduce an isotropic exponential kernelkT (t1, t2) for measuring the similarity between samplingtimes:

kT (t1, t2) = 1 + (C − 1) exp

(− ρ−1(t1 − t2)2

), (17)

for some C > 0. kT is such that kT → C as |t1 − t2| →0, and kT → 1 as |t1 − t2| → ∞. The parameter ρ issimilar to the length-scale parameter in squared exponentialkernels and regulates our prior assumption regarding hownon-stationary the function is. We can now define kC as

kC(τ t11 , τt22 ) = kS(τ1, τ2)×(

1 + (C − 1)e−(t1−t2)2

ρ

). (18)

Intuitively, kT boosts the correlation between tasks if theywere sampled at similar times and ensures that only spa-tial correlation is taken into account as the difference be-tween sampling times increases. Furthermore, note thatwhen C = 1 all temporal information is ignored and kCdegenerates to the purely-spatial kernel kS . Several meth-ods, such as evidence maximization, are available to auto-matically identify suitable parameters for kC and kS (Ras-mussen & Williams, 2006).

Figure 2 depicts the predicted posterior mean and varianceof a process defined over a synthetic non-stationary func-tion. Two curves are shown: one for the predicted meanif using the purely-spatial kernel kS , which does not takesampling time into account, and one for the improved pre-dicted mean obtained if using the spatiotemporal kernel kC .Note how the latter kernel allows the predicted mean to cor-rectly track the non-stationary function.

6. The Catapult DomainWe evaluate our task selection method on a simulated cat-apult control problem where the agent is tasked with learn-ing a parameterized skill for hitting targets on mountainousterrains (Figure 3)1. Targets can be placed anywhere on a2-dimensional terrain with various elevations and slopes—both of which are unknown to the agent. The task space Tconsists of a single parameter describing the horizontal dis-tance from the catapult to the target; note that this task pa-rameterization does not convey any information about the

1Code will be made available at http://bitbucket.org/bsilvapoa/active_paramskill.

http://bitbucket.org/bsilvapoa/active_paramskill

http://bitbucket.org/bsilvapoa/active_paramskill


Figure 2. GP posteriors obtained when using a standard kerneland a spatiotemporal kernel. Lighter-colored points indicate oldersamples, while darker ones indicate more recent ones.

elevation of the target or the geometry of the terrain. Learn-ing such a skill is difficult because it requires generalizingover an ensemble of continuous-state, continuous-actioncontrol problems. We learn the parameterized skill viaGaussian Process regression. The skill maps target posi-tions to continuous launch parameters—namely, angle andvelocity. Finally, we define performance of a policy as thedistance between where the projectile hits and the intendedtarget.

Figure 3. The Catapult Domain.

Determining which tasks to practice is challenging be-cause irregular, non-smooth terrains may require signifi-cantly different launch profiles for hitting neighboring tar-gets. Figure 3 depicts how a change ∆τ in task parame-ters may require significantly different launch parametersdepending on the region of the terrain. Figure 4 depictsthe policy manifold associated with launch parameters re-quired for hitting various targets on a randomly-generatedterrain. The 1-dimensional task space is represented by thered line, and gray lines mapping points in task space topolicy space indicate policy predictions made by the skill.Discontinuities in this mapping indicate irregular regions ofthe policy manifold in which generalization is difficult. Fi-nally, note that identifying the target with maximum EISP

corresponds to optimizing a one-step look-ahead samplingstrategy to quickly uncover the structure of this manifold.

VelocityAngle

Task Space

Figure 4. Policy manifold of the catapult domain. The red linerepresents the task space. Gray lines connecting task to policyspace indicate predictions made by the skill. Discontinuities in themapping indicate task regions where generalization is difficult.

We compared the performance of our method with fouralternative approaches. Two of them are baseline, non-adaptive sampling strategies: selecting tasks uniformly atrandom, and probabilistically according to the task distri-bution P . We also compare with two active acquisition cri-teria commonly used in Bayesian optimization: ExpectedImprovement (EI) and Lower Confidence Bound (LCB).Figures 5 and 6 show skill performance as a function ofthe number of tasks practiced, for different task-selectionmethods. Skill performance was measured by evaluatingthe skill on a set of novel tasks; the observed performanceswere weighted by the task distribution P to reflect whetherthe agent was competent at tasks of higher interest. Toreport an absolute measure of skill quality we computedthe mean squared difference between performance of thelearned skill and the maximum performance that can beachieved by the skill model. The lower the difference, thebetter the skill is at solving tasks from the distribution of in-terest. All curves are averages over 50 randomly generatedterrains.

Figure 5 shows how skill performance changes as theagent practices more tasks, assuming a uniform target dis-tribution P of tasks. Note that in this case both non-adaptive sampling strategies—i.e., selecting tasks at ran-dom or drawing them from P—are equivalent. Similarly,Figure 6 shows skill performance as a function of taskspracticed but for the case of a non-uniform target distri-bution P—that is, an agent with stronger preference forbecoming competent at targets in particular regions of theterrain. Here, P was defined as a Gaussian centered at themidpoint of the terrain. Under both types of task distribu-tion, EI performed worse than all other methods. EI selects


0 10 20 30 40 501

1.5

2

2.5

3

3.5

4

4.5

Number of tasks practiced

MSE

of S

kill

Perfo

rman

ce w

rt op

timal

Uniform Task Distribution

EILCBRandomEISP

Figure 5. Average skill performance as a function of the numberof sampled training tasks (uniform task distribution).

tasks whose individual performances are expected to im-prove the most by practice. This criterion leads the agentto repeatedly practice tasks that are already well-modeledby the skill but which may be marginally improved. Thiscauses the agent to ignore regions of the task space inwhich it is not yet competent. LCB suffers from a simi-lar shortcoming; it selects tasks in which the skill has low-est expected performance, thus focusing on improving theagent’s weaknesses. This often leads the agent to obses-sively practice tasks that it may be unable to solve well.Finally, both random selection of tasks and selection ac-cording to the target distribution P fail to account for thevarying difficulty of tasks. These criteria choose to prac-tice problems independently of the skill’s current capabil-ity of solving them; furthermore, they often practice tasksthat are irrelevant according to the target task distribution.EISP, on the other hand, correctly identifies which tasks canimprove skill performance the most, and takes into accountboth their relative difficulties and how well their solutionsgeneralize to related tasks.

0 5 10 15 20 25 30 35 40 450

0.5

1

1.5

2

2.5

3

Number of tasks practiced

MSE

of S

kill

Perfo

rman

ce w

rt op

timal

Non−Uniform Task Distribution

EILCBRandomPEISP

Figure 6. Average skill performance as a function of the numberof sampled training tasks (non-uniform task distribution).

Figure 7 depicts the regions of task space that each method

chooses to explore on a randomly-generated terrain. Ran-dom sampling and sampling according to the task distribu-tion P do not adapt their task-selection strategies accord-ing to the problem. EI identifies two regions of the taskspace in which the skill is effective and focuses on tryingto further improve those. LCB, on the contrary, samplesmore densely regions that contain difficult tasks. However,because it does not model whether skill performance is ex-pected to improve, it often focuses on tasks that are too dif-ficult. EISP prioritizes practice according to the target dis-tribution P and selects problems according to how muchthey are expected to contribute to improving skill perfor-mance. In particular, note how it chooses to practice lesson tasks at the beginning of the task range, even thoughthose tasks have a high probability of occurring accordingto P . This happens because EISP quickly realizes that so-lutions to those tasks can be easily generalized and that nofurther samples are required. Finally, the peak of samplescollect by EISP at the end of the task range correspondsto a particularly difficult part of the terrain which requiresprolonged practice. Note how EISP devotes less attentionto that region than EI since it is capable of predicting whenno further skill improvement is expected.

Ran

dP

EISP

EILC

BTasks

Sam

plin

g de

nsity

Figure 7. Density of samples collected by different training strate-gies.

We can draw a few important conclusions from our results:1) non-adaptive strategies implicitly assume that all taskscan be equally well generalized by the skill—or, equiva-lently, that the manifold has approximately uniform curva-ture; this causes them to unnecessarily practice tasks thatmay already be well-modeled by the skill; 2) the LCB cri-terion always selects tasks with lowest predicted perfor-mance, often repeatedly practicing poorly-performing tasksas long as they can be infinitesimally improved; 3) EI fo-cuses on further improving single tasks at which the agentmay already be competent, thus refraining from practicingmore difficult ones, or ones that are more important accord-ing to the task distribution; and finally 4) because EISP


uses a model of expected performance to infer the gener-alization capabilities of a skill, it correctly identifies thetask regions in which practice leads to better generalizationacross a wider range of tasks. Furthermore, the use of anexpected skill performance model allows for the identifica-tion of tasks that are either too difficult to solve or whoseperformance cannot yet be further improved, thus leadingthe agent to focus on problems that are compatible with itscurrent level of competence.

7. Related WorkIn Section 5 we introduced a spatiotemporal composite ker-nel to model non-stationary skill performance functions.Other methods have been proposed to allow GP regres-sion of non-stationary functions. Rottmann and Burgard(2010) manually assigned higher noise levels to older sam-ples, causing them to contribute less to the predicted mean.This approach, however, relies on domain-dependent costfunctions, which are difficult to design. Spatiotemporalcovariance functions, defined similarly to our compositekernel kC , have been proposed to solve recursive least-squares problems on non-stationary domains. These func-tions, constructed by the product of a standard kernel andan a Ornstein-Uhlenbeck temporal kernel, often require es-timating a forgetting factor (Van Vaerenbergh et al., 2012).

Previous research has also addressed the problem of select-ing training tasks to efficiently acquire a skill. Hausknechtand Stone (2011) constructed a skill by solving a largenumber of tasks uniformly drawn from the task space. Theyexhaustively varied policy parameters and identified whichtasks were solved, thus implicitly acquiring the skill bysampling all possible tasks. Kober, Wilhelm, Oztop, andPeters (2012) proposed a Cost-regularized Kernel Regres-sion method for learning a skill but did not address how toselect training tasks; in their experiments, tasks were sam-pled uniformly at random from the task space. Similarly,da Silva, Konidaris, and Barto (2012) proposed to acquirea skill by analyzing the structure of the underlying policymanifold, but assumed that tasks were selected uniformlyat random. Finally, Baranes and Oudeyer (2013) proposedan active learning framework for acquiring parameterizedskills based on competence progress. Their approach issimilar to ours in that promising tasks are identified viaan adaptive mechanism based on expected performance.However, their approach, unlike ours, requires a discretenumber of tasks and can only optimize the task-selectionproblem over finite and discrete subsets of the task space.

8. Conclusions and Future WorkWe have presented a general framework for actively learn-ing parameterized skills. Our method uses a novel acqui-

sition criterion capable of identifying tasks that maximizeexpected skill performance improvement. We have derivedthe analytical expressions necessary for optimizing it andproposed a new spatiotemporal kernel especially tailoredfor non-stationary performance models. Our method is ag-nostic to policy and skill representation and can be cou-pled with any of the recently-proposed parameterized skilllearning algorithms (Kober et al., 2012; da Silva et al.,2012; Neumann et al., 2013; Deisenroth et al., 2014).

This work can be extended in several important directions.The composite kernel kC can be used to compute a pos-terior over expected future skill performance, which sug-gests an extension of EISP to the case of multistep deci-sions. This can be done by evaluating the predicted poste-rior mean (Equation 3) over a set of test points {τi, T+∆},where T is the current time and ∆ is a positive time incre-ment. This is useful in domains where a one-step look-ahead strategy, like the one optimized in Equation 7, is toomyopic. Secondly, our model uses a homoscedastic GPprior, which assumes constant observation noise through-out the input domain. This may be limiting if the agent hassensors with variable accuracy depending on the task—forinstance, it may be unable to accurately identify the posi-tion of distant targets. Heteroscedastic GP models (Kuin-dersma et al., 2012) may be used to address this limitation.Finally, taking advantage of human demonstrations mayhelp biasing EISP towards tasks which an expert deemsrelevant, which suggests an integration with active learn-ing from demonstration techniques (Silver et al., 2012).

A. AppendixAnalytical solutions to Equations 7 and 8 depend on thechoice of kernel. If we assume that P is a uniform distri-bution over tasks and define kS as the squared exponentialkernel kS(τ1, τ2) = σ2

f exp(−L−1(τ1 − τ2)2), then:

∇τkC(τ t, τ tii ) = − 2

L(τ − τi)kC(τ t, τ tii )

gt(τtii ) =

1

2

(σ2f

√πL

)kT (t+ 1, ti)×[

erf(τi −A√

L

)− erf

(τi −B√

L

)]∇τgt(τ tii ) = σ2

fkT (t+ 1, ti)×[exp

(−(A− τi)2

L

)−

exp

(−(B − τi)2

L

)].

where erf(z) is the Gauss error function.


B. Supplementary MaterialB.1. The Catapult Domain

We simulate launches in the Catapult Domain using stan-dard ballistic equations. We assumed Earth’s gravity. Letus assume that the projectile is launched with velocity vand angle θ, and that Cx and Cy are, respectively, thehorizontal and vertical coordinates of the catapult. Wecan decompose the velocity vector of the projectile in itshorizontal and vertical components: vx = vcos(θ) andvy = vsin(θ). The projectile follows a parabolic pathy(t) = ax(t)2 + bx(t) + c, where x(t) and y(t) are thehorizontal and vertical coordinates of the projectile at timet, respectively, and a, b and c are parabola parameters:

a =−9.8

2v2cos(θ)2

b =vyvx

+9.8Cxv2x

c = Cy −vyCxvx− 1

2

9.8C2x

v2x

The point of impact of the projectile is given by the pointwhere its parabolic path intersects the equation describingthe terrain. If using a piecewise-linear approximation of theterrain surface, the impact point can be found by identify-ing the coordinates where the parabolic path of the projec-tile first intersects one of the line segments describing theterrain.

B.2. Algorithms and Parameters

Our optimistic performance upper bound j(τ) is related toUpper Confidence Bound (UCB) policies (Kroemer et al.,2010). The latter, however, use UCB bounds to identifysingle tasks with highest expected performance; we usethem as base values over which skill improvement on a taskdistribution is computed.

As mentioned in Section 6, we learned the parameterizedskill via Gaussian Process (GP) regression. This is simi-lar to the approach taken by Kober, Wilhelm, Oztop, andPeters (2012) but with no cost regularization. The parame-terized skill used a squared exponential kernel, as describedin the Appendix A. Its parameters were found by evidencemaximization. The expected skill performance model usedour spatiotemporal kernel (Equation 18).


ReferencesBaranes, A. and Oudeyer, P. Active learning of inverse

models with intrinsically motivated goal exploration inrobots. Robotics and Autonomous Systems, 61(1):69–73,2013.

da Silva, B.C., Konidaris, G.D., and Barto, A. Learn-ing parameterized skills. In Proceedings of the TwentyNinth International Conference on Machine Learning,pp. 1679–1686, June 2012.

Deisenroth, M.P., Englert, P., Peters, J., and Fox, D. Multi-task policy search for robotics. In Proceedings of 2014IEEE International Conference on Robotics and Au-tomation, 2014.

Ericsson, K. The influence of experience and deliberatepractice on the development of superior expert perfor-mance, chapter 13, pp. 685–708. Cambridge UniversityPress, 2006.

Hausknecht, M. and Stone, P. Learning powerful kicks onthe Aibo ERS-7: The quest for a striker. In RoboCup-2010: Robot Soccer World Cup XIV, volume 6556of Lecture Notes in Artificial Intelligence, pp. 254–65.Springer Verlag, 2011.

Kober, J., Wilhelm, A., Oztop, .E, and Peters, J. Reinforce-ment learning to adjust parametrized motor primitives tonew situations. Autonomous Robots, pp. 1–19, 4 2012.

Kroemer, O., Detry, R., Piater, J., and Peters, J. Combin-ing active learning and reactive control for robot grasp-ing. Robotics and Autonomous Systems, 58(9):1105–1116, 2010.

Kuindersma, S., Grupen, R., and Barto, A. VariationalBayesian optimization for runtime risk-sensitive control.In Robotics: Science and Systems VIII (RSS), Sydney,Australia, July 2012.

Neumann, G., Daniel, C., Kupcsik, A, Deisenroth, M.,and Peters, J. Information-theoretic motor skill learn-ing. In Proceedings of the AAAI Workshop on IntelligentRobotic Systems, 2013.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-cesses for Machine Learning. MIT Press, 2006.

Rottmann, A. and Burgard, W. Learning non-stationarysystem dynamics online using gaussian processes. InProceedings of the Thirty-second Symposium of the Ger-man Association for Pattern Recognition, pp. 192–201,2010.

Silver, D., Bagnell, J., and Stentz, A. Active learningfrom demonstration for robust autonomous navigation.In Proceedings of 2012 IEEE International Conferenceon Robotics and Automation, May 2012.

Snoek, J., Larochelle, H., and Adams, R. Practical bayesianoptimization of machine learning algorithms. In Ad-vances in Neural Information Processing Systems 25, pp.2951–2959, 2012.

Sutton, R., Precup, D., and Singh, S. Between MDPs andsemi-MDPs: A framework for temporal abstraction inreinforcement learning. Artificial Intelligence, 112:181–211, 1999.

Van Vaerenbergh, S., Santamarıa, I., and Lazaro-Gredilla,M. Estimation of the forgetting factor in kernel recursiveleast squares. In Proceedings of the IEEE InternationalWorkshop On Machine Learning For Signal Processing,September 2012.

Documents

Active Learning of Parameterized Skills