[IEEE 2012 International Joint Conference on Neural Networks (IJCNN 2012 - Brisbane) - Brisbane, Australia (2012.06.10-2012.06.15)] The 2012 International Joint Conference on Neural

Reinforcement learning with guided policy searchusing Gaussian processes

Hunor S. JakabDepartment of Computer Science

Babes–Bolyai UniversityRO-400084 Cluj-Napoca, Romania

Email: [email protected]

Lehel CsatoDepartment of Computer Science

Babes–Bolyai UniversityRO-400084 Cluj-Napoca, RomaniaEmail: [email protected]

Abstract—Gradient based policy search algorithms benefitlargely from the availability of a properly estimated state or state-action value function which can be used to reduce the variance ofthe gradient estimates. Additionally the use of Gaussian processesfor value function approximation provides a fully probabilisticmodel where – using the uncertainty in the estimated valuefunction – we can assess the amount of exploration required.In this article we present two modalities for adjusting differentcharacteristics of the exploration in on-line learning of controlpolicies for problems with continuous state-action spaces. Theproposed methods exploit the fully probabilistic nature of theGaussian processes and aims to constrain the exploration only torelevant subspaces, thereby speeding up convergence. We presentexperiments on a simulated control task to demonstrate thevalidity of our algorithms.

I. INTRODUCTION

Reinforcement learning plays a central role in applicationswhere a high degree of autonomy is desired, one such appli-cation being the problem of optimal robotic motion control.Motion control is accomplished by creating a control policywhich defines a state-dependent action selection mechanism,where states are represented by measurable properties (forexample joint angles) of the robotic system and actions arethe commands that can be sent to its actuators. The task offinding optimal control policies for the achievement of certaingoals can be formulated as a learning problem, where therobot has to learn only from its experiences by interactingwith the environment. Applying RL algorithms in roboticcontrol problems proves to be a challenging task mainlydue to the continuous nature of the state-action spaces andthe limited number of performable experiments. Classicalalgorithms [21] rely on the representation of the expectedutility –also called value– associated with states or state-action pairs. The exact representation of values requires dis-cretization, limiting the range of applicability of the resultingalgorithms and reducing performance. The necessity to dealwith continuous states and actions led to the use of functionapproximation in the family of value-based methods [1], [4].When the learning system presents significant amount ofuncertainty and the state-action spaces are continuous and highdimensional, approximated value functions cannot representthe true value function corresponding to a policy exactly.The combined effect of insufficient representational power

and the non-locality of changes induced by parameter updateslead to convergence problems when the action-selection policyis built upon the estimated state-action values [3]. Gradientbased policy search algorithms are more suitable for real-life control problems. In policy gradient (PG) methods [23]a parameterized control policy is improved upon each stepof the learning algorithm. The direction of improvement isgiven by the gradient of a performance function with respectto the policy parameters. The performance function is usuallydefined so as to measure the long-term optimality of thepolicy. Convergence is guaranteed at least to a local minimum,and PG methods are computationally simple, moreover theincorporation of domain-specific knowledge is easily achievedthrough the parametric form of the policy. The introductionof exploratory behavior however is difficult and plays animportant role in both the performance and the applicabilityof the algorithms.

In this article we investigate the benefits of a fully proba-bilistic estimation of the action-value function Q(·,·) throughGaussian process regression from the perspective of effi-cient exploration in policy gradient algorithms. Our presentedmethod is part of the actor-critic framework. We focus onthe on-line learning of control policies in continuous domainswhere the system dynamics and the reward function are un-known, the environment presents a high degree of stochasticity.We use a state-action value function approximated with aGaussian process (GP) and develop two methods to improveexploration, based on the variance and geometric propertiesof the approximated value function. Our methods allow theintroduction of guided exploration based on current optimalitybeliefs and change the state distributions induced by the policyto cover specific regions of the state-action space. The methodcan be viewed as a transition between on-policy and off-policylearning.

The paper is structured as follows: Section II gives abrief introduction to RL and policy gradient algorithms. Insection III we present the possibilities of approximating Qfunctions with Gaussian processes. To be able to use the fullyprobabilistic GP model for exploration at each time-step ofthe learning we need to avoid the re-estimation of the action-value function from scratch between gradient update steps.In section IV we shortly revise our method from [12] that

U.S. Government work not protected by U.S. copyright

WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN

enables us to gradually exchange old experiences with newlyacquired ones. This method also enhances sample efficiency.In section V we propose two different ways to influencesearch directions in PG algorithms and study the changesin gradient estimation induced by the modifications. Theproposed exploration scheme bridges the gap between action-value based greedy action selection and stochastic explorationin PG algorithms. Section VI gives an illustration of theproposed methods efficiency based on a simulated control taskand we provide performance analysis followed by conclusionsin Section VII.

II. NOTATION AND BACKGROUND

A mathematical representation of the reinforcement learningproblem is given using Markov decision processes (MDP)[18]. Equivalent with a stochastic automata, an MDP isa quadruple M (S,A, P,R) with the following elements:S the set of states; A the set of actions; P (s′|s, a) :S × S × A → [0, 1] the transition probabilities, andR(s, a) : S × A → R the reward function. Informally theMDP describes the environment where an agent can act; theinteractions between the agent and the environment. The deci-sion making mechanism of the learning agent can be modeledwith the help of an action selection policy. We define a policyπθ : S × A → [0, 1] as a conditional probability distribution:πθ(a|s) of taking action a when in state s. The advantage ofusing stochastic policies is that they allow non-deterministicaction selection thereby the possibility of exploratory behavior.In control problems stochastic policies are constructed byperturbing the output of a controller function cθc : S → Awith parameters θc. The controller function provides directmapping from states to actions. Perturbing the output of cθccan be accomplished by varying the parameters θc or byadding exploratory noise to the output. In robotic control thelatter method is preferred since even small changes in theparameters of a controller can produce unexpected and unsafebehavior. Frequently a Gaussian distribution is being used forthe noise with variance σε: 1

πθ(a|s) = c(s, θc) +N (0, σεI) (1)

Where θ =[θTc σTε

]Tis the parameter vector of the policy

composed of the controller parameters θc and the parametersof the exploratory noise distribution σε. To simplify thenotation, from now on we drop the explicit θ from the policy,but policy changes are made via the parameter set θ.

When applying reinforcement learning in the context ofrobotics, we distinguish two major types of problems: (1) mo-tor control and (2) motor planning 2. We refer to the learningproblem as a “motor control problem” when the mappingbetween states and actions are directly computed by theparametrized deterministic controller cθc . The parameters θc

1Our treatment would equally apply for multi-dimensional actions. In thiscase Σ = σI would be a vector containing the parameters of the covariancematrix.

2For an in-depth comparison see [20]

influence the generated actions therefore the size of the effec-tive search space for optimal policies increases exponentiallywith the complexity of the controllable system. On the otherhand in case of “motor planning problems ”the controllerparameters change the shape and duration of a motion tra-jectory. For example when learning gait sequences in case oflegged robots, the periodic trajectory of the end effector isoptimized. In these cases through the policy parameters θ weinfluence the desired joint configurations of the robot during amovement sequence (which can be time or phase dependent).Such policies can be formulated using splines-based trajectorymodels or dynamic motor primitives [11]. This formulationis preferred when an inverse dynamics/kinematics model isavailable or can be accurately approximated. In this work weare going to focus on problems from the first category, howeverthe developed methodology can easily be applied also in caseof dynamic motor primitives.

The goal of RL problems is to solve the MDP, and thesolution is defined as an optimal policy π∗ maximizing theexpected cumulative reward:

π∗ = argmaxπ∈Ωθ

Jπ

Jπ = Eπ

[ ∞∑t=0

γtrt+1

](2)

Here Ωθ denotes the set of all possible policies determinedby the parametrization, Eπ[·] is the expectation with respectto a policy π, rt = R(st, at) is the immediate reward, and γis a discount factor.

We focus on gradient-based policy search algorithms thatoptimize eq. (2). One of the earliest PG algorithms wasWilliams’ REINFORCE [23], other algorithms have been builton the same principles such as vanilla policy gradients [17],natural policy gradients [13] and a wide variety of theirextensions that provide performance enhancements of someform. The gradient of J – according to the policy gradienttheorem [22] – with respect to θ is:

∇θJ(θ) =

∫ds p(s)

∫da π(a|s) ∇θ log π(a|s)Q (s, a) (3)

where∫ds p(s) is the weighted average operator with prob-

ability distribution p(s). The importance of the above for-mulation is that in eq. (3) the state transition probabilities –p(s′|s, a) – are not present, making possible the approximationof the integrals with sample averages without knowing thedynamics of the system. The difficulty is that the actionvalue function Q (st, at) is not known, however it can bereplaced by Monte Carlo estimates of the true value function.These simplifications are the core of Williams’ REINFORCEalgorithm [23] where the integral representation from eq. (3)is replaced with:

∇θJ = E

[H−1∑t=0

∇θ log π(at|st)H−t∑i=0

γiR(st+i, at+i)

]τ

(4)

Here Eτ [·] is sample average for roll-outs – i.e. differentexperiments with the same policy – and summations stand for

empirical averages. Although episodic REINFORCE is one ofthe most basic policy gradient algorithms it is a good candidateto evaluate the efficiency of exploration schemes.

The convergence of the algorithm can be guaranteed atleast a to local maximum, but the high variance of theestimated gradient in eq. (4) leads to very slow convergence.An improvement possibility is to approximate the action-valuefunction Q(s, a) and use it directly in eq. (3). For valuefunction approximation we use Gaussian processes, presentedin the next section.

III. GAUSSIAN PROCESS VALUE FUNCTIONAPPROXIMATION

To approximate action value functions we use as trainingdata the state-action pairs xt

.= (st, at) encountered during

trajectories and the corresponding – possibly – discountedcumulative rewards Ret(st, at) =

∑H−ti=0 γiR(st+i, at+i) as

noisy targets.3 We assume that n state-action pairs havealready been visited, therefore we have a GP built on thedata set D = (xi, Reti)i=1,n also called the basis vectorset. To estimate the action-value of a new state-action pair,x∗

.= (s∗, a∗), we combine the prior distribution over func-

tions induced by the specification of our covariance function,with the information from the training data-set. In case of aGaussian noise model the posterior distribution over Q-valueswill also be a Gaussian:

Q∗|D ∼ N(µ∗, cov(Q∗)

)The predictive mean (5) and variance (6) are given by thefollowing expressions [19]:

µ∗ = k∗αn (5)cov(Q∗) = kq (x∗, x∗)− k∗Cnk∗T , (6)

where αn and Cn are the parameters of the posterior GP:

αn+1 = [Knq + Σn]−1Ret, Cn+1 = [Kn

q + Σn]−1. (7)

with Σn covariance of the observation noise and kn+1 denotesa vector containing the covariances between the new point andthe training points:

k∗ = [kq(x1, x∗), . . . , kq(xn, x

∗)]. (8)

The regression is performed directly in function space, andthe resulting Q(·, ·) is the approximation of the action-valuefunction. The elements of the kernel matrix Kn

q are givenby Kn

q (i, j) = kq(xi, xj) and we used the notation kq toemphasize that the kernel function operates on state-actionpairs.

The parameters α and C of the Gaussian process can beupdated iteratively each time a new point xn+1, Retn+1is processed, this is achieved by combining the likelihoodof the new data-point and the Gaussian process from theprevious step and making use of the parameterization of the

3We assume that the targets have Gaussian noise with equal variance,however one can easily use different known noise variances within the sameframework

posterior moments from [6]. Replacing∑H−ti=0 γiR(st+i, at+i)

from eq. (4) with the posterior mean from eq. (5),4 we get theGP version of the policy gradient algorithm:

∇θJ(θ) = Eτ

[H−1∑t=0

∇θ log π(at|st)QGP (st, at)

](9)

Using the predictive mean of the estimated action-valuefunction instead of Monte Carlo samples significantly reducesthe variance of the gradient estimates and improves the conver-gence rate. The probabilistic nature of the Gaussian processprovides new possibilities to improve performance by influ-encing exploratory action selection, presented in section V.

A. Related work

The use of Gaussian processes for value function approxi-mation purposes has been investigated a number of references.In [19] the authors modeled the value function and thesystem dynamics using GPs, and proposed a policy iterationalgorithm. The expected value for a given state was calculatedby integrating through the GP posterior which was analyticallytractable for certain covariance functions. The support set wasmanually chosen, the analytical form of the reward functionwas considered known and the algorithm operated in batchmode. [9] applied GPs in a policy gradient framework. Theyused a GP to approximate the gradient of the expected returnfunction with the help of Bayesian quadratures. This extensionallowed a full Bayesian view of the gradient estimation. Thealgorithm has been applied for the bandit problem, howeverdimensionality of the GP outputs was the same as the numberof policy parameters, which can largely increase computationalcomplexity in case of complex policies with large numberof parameters. [8] modeled both the value function and theaction-value function with GPs, and proposed a dynamic pro-gramming solution using these approximations. The algorithmis called Gaussian process dynamic programming and it relieson evaluating the integrals from the Bellman equations bymodeling system dynamics using Gaussian processes.

In our methods we choose to approximate the state-actionvalue function directly by a Gaussian process and evaluate thegradient estimates based on sample averages combined withthe estimated value function. This enables us to obtain gradientestimates without explicit knowledge of the reward function.Estimating system dynamics is made harder by the fact thatin a robotic setting we cannot make arbitrary state-transitionsto provide training data for the dynamics GP. A dynamicsGP has |S| + |A| dimensional inputs(state action pairs) and|s| outputs (the elements of the next state). Estimating a Q-function on the other hand requires a GP to have |S| + |A|inputs and 1 output which is considerably less difficult.

IV. SAMPLE REUSE AND CONTINUITY

In this section we address the problem of restarting theaction-value function approximation after a policy change

4For simplicity we denote the GP predictive mean for a state action pairx∗ = (s, a) with Q∗

gp(s, a) = µ∗ where prediction is based on the previouslyvisited data-points.

occurs by shortly presenting a method introduced in [12].After a gradient update step we would like to build upon ourpreviously estimated value function model while simultane-ously incorporating new measurements which provide usefulinformation. To achieve this we make use of a modifiedversion of the Kullback Leibler distance-based sparsificationmechanism from [6]. The sparsification scheme in our caseserves two purposes: (1) it decreases computational costsby discarding unimportant inputs (2) it provides a way toexchange obsolete measurements with newly acquired ones.To decide upon the addition of a new input to basis vectorset of the GP we test for approximate linear independence infeature space. The projection error in feature space of inputn+1 onto the space of existing basis vectors can be expressedas:

ξn+1 = kq (xn+1, xn+1)− kn+1en+1 (10)en+1 = GnkTn+1 (11)

where en+1 is the vector of projection coordinates minimizingthe projection error, Gn is the kernel gram matrix and ξn+1

is the residual. By setting a threshold value for the residualwe can decide which inputs are going to be added to the basisvector set.

Additionally we assign a time variable to every includeddata point in D which signifies at which stage of the learningprocess the data point has been added to the basis vector set.

D = (xi, Qi) → (xi, Qi, ti) i = 1, n (12)

We also limit the basis vector set size. Whenever a newdata point is being processed which needs to be included butthe maximum number of basis vectors has been reached wecompute a modified score function ε for each data-point:

ε(i) =α2(i)

q(i) + c(i)+ λg (t(i)) (13)

The first term in eq. (13) is the Kullback Leibler distancebetween two GPs KL(GP ′||GP ) where GP ′ contains thenew data point and GP is obtained by replacing the data-point with its projection to the space spanned by existing basisvectors5. The second term g(·) penalizes basis vectors thathave been introduced in early stages of the learning process,it is a function of the time variable assigned to each basisvector. Since we want to favor the removal of out-of date basisvectors, this function needs to be monotonically increasing. Inour experiments we used an exponential of the form:

g(ti) = ec(ti−mini(ti)) i = 1 . . . |D| (14)

We also experimented with the logit function which proved tobe more efficient in eliminating old components that had highscores from the first component of eq. (13):

g(ti) = c log

(ti/max(ti)

1− ti/max(ti)

)i = 1 . . . |D| (15)

5For details of derivation of the KL distance see [5]

10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100

Number of policy improvements

per

cen

tag

e o

f n

ew b

asis

fu

nct

ion

s

λ=0

λ=10

λ=100

λ=400

λ=103

Fig. 1. Composition of the basis vector set as a function of λ

We replace the lowest scoring data point from the BV set withthe new measurement. The λ term from eq. (13) serves as atrade off factor between loss of information and accuracy ofrepresentation, c is a constant. In figure 1 we see how muchthe choice of λ influences the composition of the basis vectorset during on-line learning.

If we set λ to be a large number the time-dependent factorfrom the scores will outweigh the KL distance based factorin eq.(13) leading to the inclusion of all newly acquiredmeasurements into the BV set. Setting it too small will allowtoo many out-of date basis vectors to be present in ourrepresentation which leads to inaccurate gradient estimates anda poor policy.

V. GUIDED EXPLORATION

In direct policy search algorithms the use of parameterizedfunctions for policy representation induces a large searchspace, which becomes impossible to fully explore as thenumber of parameters increases. As a consequence, the agenthas to restrict its exploration to a subset of the search spacethat is the most “promising”. Several variants of policy gra-dient algorithms were applied to robotic control for policyoptimization [2], [10], [14], [15], [17], where a starting policywas obtained via imitation learning or manual setup and thesearch procedure was restricted to the immediate neighbor-hood of the initial policy. The drawback of these methodsis that without the existence of a starting policy, randomexploration is inefficient and extremely costly. The availabilityof a fully probabilistic model for the value function providesan interesting opportunity to introduce directed exploratorybehavior in our learning algorithms.

In what follows we explore two modalities to influence theexploration process: either to modify the exploratory noise orto modify the direction of the exploration process.

A. Influencing the exploratory noise

Our first guided exploration method is based on changingthe variance of the exploratory noise σex in eq. (1). We employthe properties of the estimated state-action value functionQGP (s, a). Since it is a random variable, we have access to theposterior variance of the function, providing information about

the uncertainty present in different regions of the parameterspace. Our modification is that in regions with high uncertaintythe exploratory noise should be higher; this is achieved byreplacing the fixed noise with one obtained from the GP modelof the state-action value function in (7). The modified policyis:

πθ = c(s, θc) +N (0, σ2GP I)

σ2GP = λ

(kq (x∗, x∗)− k∗Cnk∗T

) with x∗ = s, c(s, θc)

(16)

Here k∗ = [kq(x∗, x1) . . . kq(x

∗, xn)] is the vector containingthe covariances of the new data-point x∗ and the data-pointsfrom D. At the early stages of learning, the GP-based approx-imation is inaccurate, therefore the predictive variance is largeeverywhere. The large predictive variance facilitates higherexploration rates in the early phase of the learning. As learningprogresses and we add more data, the predictive variancedecreases in the neighbourhood of these points, decreasingalso the added exploratory noise.

The effect of this exploration scheme is displayed on Fig. 2,where – for illustration – we plotted two surfaces correspond-ing to the predictive variances corresponding to the differentguided exploration strategies. 6 The control task – presented indetail in Section VI – was the inverted pendulum control wherethe state-space contains the angle and the angular velocityrespectively. We see that in case of fixed exploratory noisethe visited states tend to lie on tighter regions of the state-space, whereby for guided exploratory noise a much bettercoverage of the important regions is provided. This is the resultof increased exploration in the beginning of learning.

The guided exploratory noise changes the derivative ofthe policy since the noise term also depends on the policyparameters through x∗ from eq. (16).

∇θ log π(a|s) =(a− cθ(s))

σ2GP

∇θcθ(s)

+(a− cθ(s))2 − σ2

GP

σ4GP

∇θσ2GP (17)

The first part of eq. (17) involves the derivative of the deter-ministic controller, easily calculated for a variety of controllerimplementations. The second term involves differentiationthrough the covariance function of the GP approximator:

∇θσ2GP =

[δδθ1σ2GP . . . δ

δθmσ2GP

]m = |θ| (18)

δ

δθiσ2GP =

δ

δθikq(x

∗, x∗)

− δ

δθi

N∑i,j=1

Ci,jkq(xi, x∗) · kq(xj , x∗)

We consider the covariance matrix C constant at the time ofthe differentiation, it does not need to be differentiated. We

6Similar graphs were obtained for the mean action-value functions, herewe used state-value functions for better visibility.

get the following expression:

δ

δθiσ2GP =

δ

δθikq(x

∗, x∗)− 2Ck∗δk∗

δθi(19)

The derivation of kq(·, ·) with respect to the parameters θi, i =1,m can be calculated for several covariance functions.

B. Influencing search directions

Our second proposed modification to improve explorationis to influence not only the variance of the noise but also thecontroller output. The underlying idea is that the agent shouldbetter explore regions of the state-action space which havehigher Q-values according to the current estimation of the Qfunction. Consider the case when the agent is in a state st attime t. The next step of the algorithm is to choose an actionaccording to the action selection policy π(a|s).

We are interested in constructing a policy, favouring actionswith higher estimated Q-values, still taking into account theoutput of our deterministic controller. We propose a policyπ(a|s) in the form of a Gibbs distribution [16] over actionsfrom the neighbourhood of fθ(s) :

π(a|s) =eβE(s,a)

Z(β), where Z(β) =

∫da eβE(s,a) (20)

The term Z(θ) is a normalizing constant and β is the inversetemperature. To include the deterministic controller fθ in theaction selection, we construct the energy function E(s, a)such that only actions neighbouring fθ(s) have significantselection probability. At the same time we want to assignhigher probability to actions that – in the current state st –have higher estimated Q-values and the energy function hasthe following form:

E(s, a) = QGP (s, a) · exp

[−‖ a− cθ(s) ‖

2

2σ2e

]It is composed of the GP-estimated Q-value QGP (s, a) forthe state-action pair (s, a) and a Gaussian on the actionspace to limit the selection to the neighbourhood of thecontroller output cθ(s). The variance parameter σe is fixed, butmaking it dependent on the GP predictive variance can alsobe considered. The Gibbs distribution based stochastic action

−5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Actions

QGP

(s,a)

fθ(s)

E(s,a)

π(a|s)

−6 −4 −2 0 2 4 60

100

200

300

400

500

600

700

800

900

1000

Actions

Sele

ction fre

quencie

s

Fig. 3. The Gibbs action selection policy with temperature set to 10 (a)estimated Q-values , (b) selection frequencies

selection policy is illustrated on figure 3. Here we plotted

(a) (b)

Fig. 2. Predictive variances of the GP-estimated state-value functions.The horizontal axes correspond to the state variables, and the vertical axis displaysthe predictive variance.The dots are the visited states of (a) fixed exploratory noise, (b) adaptive noise variance.

the composition of the action selection probabilities when thetemperature β = 10 and the Gaussian process is trained on thecurrent policy based on 10 episodes. The deterministic action-marked by a star- returned by the controller is cθ(s) = 0. Weplotted the Q-values (blue line) corresponding to the actionsin the neighbourhood of fθ(s). The Q-value landscape ismulti-modal, there are two promising actions to be selectedat the current state s. The energy values E(s, a) - greenline - reflect the combined influence of the Q-values and thedeterministic controller. We observe that the action selectionprobabilities (yellow line) fig. 3(a) are concentrated around theaction with the highest energy value and the action selectionfrequencies fig. 3(b) reflect this as well. If the temperaturewere low the action selection probabilities would be smearedout resulting in a close to uniform action selection distribution.To implement this exploration scheme, we have to compute thelog-derivative of the policy from eq. (20) that is:

δ

δθlog π(a|s) =

δ

δθ(βE(s, a)− logZ(β)) (21)

= βδ

δθE(s, a)− β

Z(β)

∫da eβE(s,a) δ

δθE(s, a)

= β

(δ

δθE(s, a)−

∫da π(s, a)

δ

δθE(s, a)

)Differentiating the energy function is not difficult, since onlythe Gaussian term depends on the parameters:

δ

δθE(s, a) = E(s, a)

(a− cθ(s))σ2e

δ

δθfθ(s) (22)

Combining eqs. (22), (21), and inserting into eq. (9), we havethe expression for the gradients. In practice, the integral fromeq. (21) cannot be evaluated, we instead sample actions fromthe neighbourhood of a = f(s, θ) from a Gaussian distributionwith variance corresponding to the model confidence at (s, a).We calculate the predictive Q-values for these points withthe help of our GP action-value function approximator anduse a discrete Gibbs distribution in the selection process.Fig. 4 illustrates the results of the proposed methodology

Algorithm 1 REINFORCE with GP guided exploration1: Initialize policy parameterization π(s, a|θ)2: Initialize GP parameters α = 0,C = 0D = ∅, M=const, λ=const, maxBV=const, n=0

3: repeat4: for t = 1, H do5: at ∼ π(at|st) eq. (16), (20)6: ξn+1 = kq (xn+1, xn+1)− kn+1en+1 eq. (13)7: if ξn+1 > threshold then8: if maximum number of BV reached then9: for i = 1, n do

10: ε(i) = α2(i)q(i)+c(i) + λg(t(i))

11: end for12: exchange Di where i = arg max ε(i)13: else14: update α,C eq. (7)

n=n+1, D = D ∪ (st, at)15: end if16: else17: discard (st, at)18: end if19: end for20: Estimate ∇Jθ eq. (17), (9) or eq. (21), (9)21: if ∇Jθ converged then22: Update policy parameters23: end if24: until policy converged

on the inverted pendulum problem. We plotted the surfacecorresponding to the estimated state-value function.7 We seeon the first figure that, with random exploration in sub-figure(a), there are regions on the perimeter of the state spacewith high estimated value that are not explored properly. Ifwe use the exploratory mechanism defined above, the high

7The state action-value function cannot be graphically represented, thereforewe used the state-value function for illustration.

(a) (b)

Fig. 4. GP-estimated state-value functions of the inverted pendulum control task. The two horizontal axis correspond to the state variables namely the angleand the angular velocity, and the vertical axes to the state-action value function. The dots are the visited states and corresponding noisy value measurementsin case of (a) fixed exploratory noise, (b) guided search directions.

values of these regions facilitate exploration, thereby improvethe algorithm performance. Moreover, we see that, for guidedexploration, regions of the state space with high values have ahigher concentration of visited points. This is important sincesmall differences in value on high importance region of thestate space can influence the performance of the learned policy.An algorithmic description of our two improved explorationschemes applied to REINFORCE is given in Algorithm 1.

VI. PERFORMANCE EVALUATION

We tested the above presented methods on a simulatedpendulum control problem where both the state and the actionspaces are continuous. A state variable consists of the angleand angular velocity of the pendulum s =

[φ ω

]Tand we

normalized the angle to the [0, 2π] interval. Actions are thetorques that we can apply to the system, and are limited to a[−5, 5] interval. The Hamiltonian for the pendulum is:

H =1

2mlp2φ −mgl cos(φ) pφ = ml2ω qφ = φ (23)

The experiments were performed with a quadratic rewardfunction with added Gaussian noise:

R(s, a) = (s1 − π)2 −

(s2

4

)2

+ ε ε ∼ N(0, σr) (24)

The reward function penalizes pendulum endpoint’s distancefrom the target region as well as the angular velocity.

As a basis for our improvements we used the standardepisodic REINFORCE algorithm [23], a basic Monte Carlopolicy gradient algorithm. We implemented the two versionsof guided exploration discussed in section IV by extendingthe basic algorithm. The performance data is averaged over10 separate experiments for each algorithm. The initial valuesof the learning parameters and the start-state variances werethe same in all cases. During one experiment we performed400 gradient update steps starting with a predefined policyparameter set. The gradient estimates were obtained by per-forming 3 episodes each consisting of 50 steps. In total duringeach experiment we executed 6000 steps. To initialize thehyper-parameters of the GP we sampled 2000 state-action

pairs and corresponding long-term returns before starting thelearning process. We trained the GP hyper-parameters on thisdata. It is worth mentioning that for keeping the pendulumsystem simulation stable and achieving good performance thesystem needed to be actuated in every 5 milliseconds whichcorresponds to a frequency of 200 HZ. In these conditions anepisode length of less than 200 would not allow the systemenough steps to perform a full swing-up which would lead to apoor policy. Figure 5(a) shows the performance evaluation ofour algorithms. The vertical axis denotes the average rewardreceived during a 50 step episode while the horizontal axisdenotes the number of gradient update steps performed.

We see that the GP guided versions of the reinforce al-gorithm clearly outperform Williams’ basic version in bothconvergence speed and in achieved performance. The learningcurve in case of both our algorithms becomes much steeper inthe early phase of the learning which can be explained withthe added flexibility of exploring more important regions ofthe state-action space. Figure 5 shows the evolution of thepolicy variances during learning. The GP guided exploratorynoise does not depend directly on the policy parameters, henceit cannot be rapidly decreased by policy parameter updates.In case of GP influenced search directions as long as thecontroller has not converged to at least a local optimum pointthe Gibbs distribution based policy from eq. (20) will alwaysmaintain some degree of exploration.

VII. CONCLUSION

In this article we presented two new modalities for adjustingdifferent characteristics of the exploration in policy gradientalgorithms with the help of Gaussian process action-valuefunction approximation. An algorithmic form of our methodsis provided in Algorithm 1. We have shown that by using ourmethods the search for an optimal policy can be restrictedto certain regions of the state-action space. This is especiallyimportant in case of continuous state-action spaces where fullexploration is impossible. Our experimental results show thatby using our guided exploration method better convergenceperformance can be achieved in policy gradient algorithms.

0 50 100 150 200 250 300 350 400

−4

−2

0

2

4

6

8

gradient update steps performed

aver

age

perf

orm

ance

for a

n ep

isod

e

Guided Variance

Standard Reinforce

Guided Direction

50 100 150 200 250 300 350 400

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

(a) (b)Fig. 5. Evolution of the average return and the policy variances in case of the episodic reinforce, reinforce with GP guided exploratory noise and reinforcewith GP guided exploration directions

The presented methods can also be viewed as a transitionbetween off-policy and on-policy learning, which opens upfurther interesting research directions.

As future work, we intend to perform a more in-depththeoretical analysis and a more detailed testing of the algo-rithm on high dimensional simulated control problems withcontinuous state and action spaces. Another interesting ap-proach for reinforcement learning is the probabilistic inferencefor learning control – PILCO – algorithm introduced byDeisenroth and Rasmussen [7], where both system dynamicsand control policies are inferred from the recorded data withhigh sample efficiency. We intend to explore the relation andapplicability of the exploration method presented above withthose presented in [7].

ACKNOWLEDGMENT

The authors acknowledge the financial support from grantPN-II-RU-TE-2011-3-0278 of the Romanian Ministry of Ed-ucation and Research.

REFERENCES

[1] Andras Antos, Remi Munos, and Csaba Szepesvari. Fitted q-iteration incontinuous action-space MDPs. In John C. Platt, Daphne Koller, YoramSinger, and Sam T. Roweis, editors, NIPS. MIT Press, 2007.

[2] J. Andrew Bagnell and Jeff Schneider. Autonomous helicopter controlusing reinforcement learning policy search methods. In Proceedings ofthe International Conference on Robotics and Automation 2001. IEEE,May 2001.

[3] Leemon Baird. Residual algorithms: Reinforcement learning withfunction approximation. In In Proceedings of the Twelfth InternationalConference on Machine Learning, pages 30–37. Morgan Kaufmann,1995.

[4] Steven J. Bradtke, Andrew G. Barto, and Pack Kaelbling. Linearleast-squares algorithms for temporal difference learning. In MachineLearning, pages 22–33, 1996.

[5] Lehel Csato. Gaussian Processes – Iterative Sparse Approximation. PhDthesis, Neural Computing Research Group, March 2002.

[6] Lehel Csato and Manfred Opper. Sparse on-line Gaussian Processes.Neural Computation, 14(3):641–669, 2002.

[7] Marc P. Deisenroth and Carl E. Rasmussen. PILCO: A Model-Based andData-Efficient Approach to Policy Search. In L. Getoor and T. Scheffer,editors, Proceedings of the 28th International Conference on MachineLearning, Bellevue, WA, USA, June 2011.

[8] Marc Peter Deisenroth, Carl Edward Rasmussen, and Jan Peters. Gaus-sian process dynamic programming. Neurocomputing, 72(7-9):1508–1524, 2009.

[9] Mohammad Ghavamzadeh and Yaakov Engel. Bayesian policy gradientalgorithms. In B. Scholkopf, J. Platt, and T. Hoffman, editors, NIPS ’07:Advances in Neural Information Processing Systems 19, pages 457–464,Cambridge, MA, 2007. MIT Press.

[10] G.S. Hornby, S. Takamura, J. Yokono, O. Hanagata, T. Yamamoto, andM. Fujita. Evolving robust gaits with aibo. In IEEE InternationalConference on Robotics and Automation (ICRA2000), pages 3040–3045,2000.

[11] J. A. Ijspeert, J. Nakanishi, and S. Schaal. Movement imitation withnonlinear dynamical systems in humanoid robots. In IEEE Internationalconference on robotics and automation (ICRA2002), 2002.

[12] Hunor Jakab and Lehel Csato. Improving Gaussian process valuefunction approximation in policy gradient algorithms. In Timo Honkela,Włodzisław Duch, Mark Girolami, and Samuel Kaski, editors, ArtificialNeural Networks and Machine Learning – ICANN 2011, volume 6792of Lecture Notes in Computer Science, pages 221–228. Springer, 2011.

[13] Sham Kakade. A natural policy gradient. volume 2, pages 1531–1538,Cambridge, MA, 2002. MIT Press.

[14] Min Sub Kim and William Uther. Automatic gait optimisation forquadruped robots. In In Australasian Conference on Robotics andAutomation, 2003.

[15] Nate Kohl and Peter Stone. Policy gradient reinforcement learningfor fast quadrupedal locomotion. In in Proceedings of the IEEEInternational Conference on Robotics and Automation, pages 2619–2624, 2004.

[16] Radford M. Neal. MCMC using Hamiltonian dynamics. In SteveBrooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors,Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC Press,2010.

[17] Jan Peters and Stefan Schaal. Reinforcement learning of motor skillswith policy gradients. Neural Networks, 21(4):682–697, 2008.

[18] Martin L. Puterman. Markov Decision Processes: Discrete StochasticDynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994.

[19] Carl Edward Rasmussen and Christopher Williams. Gaussian Processesfor Machine Learning. MIT Press, 2006.

[20] S Schaal, P. Mohajerian, and A. Ijspeert. Dynamics systems vs. optimalcontrol – a unifying view. Progress In Brain Research, 165:425–445,2007.

[21] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: AnIntroduction. MIT Press, 1998.

[22] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and YishayMansour. Policy gradient methods for reinforcement learning withfunction approximation. In Sara A. Solla, Todd K. Leen, and Klaus-Robert Muller, editors, NIPS ’99: Advances in Neural InformationProcessing Systems, pages 1057–1063, 1999.

[23] Ronald J. Williams. Simple statistical gradient-following algorithms forconnectionist reinforcement learning. Machine Learning, 8:229–256,1992.

Documents

[IEEE 2012 International Joint Conference on Neural Networks (IJCNN 2012 - Brisbane) - Brisbane, Australia (2012.06.10-2012.06.15)] The 2012 International Joint Conference on Neural