Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs

Off-Policy Temporal-Difference

Learningwith Function Approximation

Doina Precup McGill University

Rich SuttonSanjoy Dasgupta

AT&T Labs

Off-policy Learning

Learning about a way of behaving without behaving (exactly) that way

Target policy must be part of source (behavior) policy

π(s,a)>0 ′ π (s,a)>0

E.g., Q-learning learns about the greedy policy while following something more exploratory

Learning about many macro-action policies at once

We need off-policy learning!

RL Algorithm Space

TD Linear FA

Off-policy

LinearTD()

Q-learning,options

stable

We needall 3

But we canonly get 2 at a time

Tsitsiklis & Van Roy 1997Tadic 2000

Baird 1995Gordon 1995

NDP 1996

Boom!

Baird’s Counterexample

θ0 + 2θ1

ε1 − ε

θ0 + 2θ2 θ0 + 2θ3 θ0 + 2θ4 θ0 + 2θ5

2θ0 + θ6

0

5

10

0 1000 2000 3000 4000 5000

10

10

/ -10

Iterations (k)

510

1010

010

-

-

Parametervalues, θk(i)(log scale,

broken at

Markov chain (no actions)

All states updated equally often, synchronously

Exact solution exists: θ = 0

Initial θ0 = (1,1,1,1,1,10,1)T

100%

±1)

Importance Sampling

Re-weighting samples according to their “importance,” correcting for a difference in sampling distribution

For example, any episode

e=s0 a0 r1 s1 a1 r2L sT−1 aT−1 rT sT

has probability

Pr(e|π)=p0(s0,a0)p(s0,s1,a0) ∏k=1

T−1π(sk,ak)p(sk,sk+1,ak)

under , so its importance is

Pr(e|π)Pr(e| ′ π )

=∏k=1

T−1π(sk,ak)′ π (sk,ak)

Corrects foroversampling

under ’

Naïve Importance Sampling Alg

Updatet = (Regular-linear-TD()-updatet)

Converts off-policy to on-policy

On-policy convergence theorem then appliesTsitsiklis & Van Roy, 1997Tadic, 2000

But variance is high, convergence is very slow

We can do better!

∏k=1


⎛ ⎝ ⎜

⎞ ⎠ ⎟

Approximate the action-value function:

as a linear form:

where is a feature vector representing s,a

and is the modifiable parameter vector

Linear Function Approximation

Qπ (s,a)=Eπ rt+1 +γrt+2 +L +rT st =s,at =a

≈

r θ T

r φ s,a = θ(i)φs,a(i)

i∑

r φ sa

r θ

Updating after each episode

Linear TD()

Per-Decision Importance-Sampled TD()

(st1,at1) (st1,at1)

θ ← θ + Δθtt=0

T−1

∑

0

Δθt =α rt+1 +γθTφst+1at+1−θTφstat[ ]φstat

0

Δθt =α rt+1 + γθTφst+1at+1

−θTφstat[ ]φstat∏k=1

π(sk,ak)′ π (sk,ak)

⎛ ⎝ ⎜

⎞ ⎠ ⎟

t

The new

Algorithm!

(see paper for general )

Main Result

Eπ Δθ s0,a0 =E ′ π Δθ s0,a0 ∀s0,a0

Total change over episode for new algorithm

Total change forconventional TD()

Convergence Theorem (based on Tsitsiklis & Van Roy 1997)

Under the usual assumptions, and one annoying assumption:

new algorithm converges to the same θ as on-policy

TD()

var ′ π ∏k=1


⎡

⎣ ⎢ ⎤

⎦ ⎥ <∞

e.g., bounded episode length

The variance assumption is restrictive

• Consider a modified MDP with bounded episode length– We have data for this MDP– Our result assures good convergence for this– This solution can be made close to the sol’n to original

problem– By choosing episode bound long relative to or the

mixing time

• Consider application to macro-actions– Here it is the macro-action that terminates– Termination is artificial, real process is unaffected– Yet all results directly apply to learning about macro-

actions– We can choose macro-action termination to satisfy the

variance condition

But can often be satisfied with “artificial” terminations

Empirical Illustration

Agent always starts at STerminal states marked GDeterministic actions

Behavior policy choosesup-down with 0.4-0.1 prob.

Target policy choosesup-down with 0.1-0.4

If the algorithm is successful, it should give positiveweight to rightmost feature, negative to the leftmost one

Trajectories of Two Components of θ

= 0.9 decreased

θ appears to converge as advertised

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

Episodes x 100,000

µlef tmost ,down

µlef tmost ,down

µr ightmost ,down

*

µr ightmost ,down*

Comparison of Naïve and PD IS Algs

1

1.5

2

-12 -13 -14 -15 -16 -17

2.5

RootMean

SquaredError

Naive IS

Per-Decision IS

Log2

= 0.9 constant

(after 100,000episodes, averaged

over 50 runs)

Precup, Sutton & Dasgupta, 2001

Can Weighted IS help the variance?

Return to the tabular case, consider two estimators:

QnIS(s,a)=1

n Riwii=1

n

∑ith return following s,a at time t

IS correction product

converges with finite variance iff the wi have finite variance

QnISW(s,a)=

Riwii=1

n

∑

wii=1

n

∑

converges with finite variance even if the wi have infinite variance

Can this be extendedto the FA case?

∏k=t+1

T−1 π(sk,ak)′ π (sk,ak)

Restarting within an Episode

• We can consider episodes to start at any time

• This alters the weighting of states,– But we still converge,– And to near the best answer (for the new weighting)

Incremental Implementation

At the start of each episode:

c0 =g0r e 0 =c0φ0

On each step: st at → rt+1 st+1 at+1 0<t<T

ρt+1 =π(st+1,at+1) ′ π (st+1,at+1)

δt =rt+1 +γ ρt+1θTφt+1−θTφt

Δθt =αδtr e t

ct+1 =ρt+1ct +gt+1r e t+1 =γ λ ρt+1

r e t +ct+1φt+1

Conclusion

• First off-policy TD methods with linear FA– Certainly not the last– Somewhat greater efficiencies are undoubtedly

possible

• But the problem is so important• Can’t we do better?

– Is there no other approach?– Something other than importance sampling?

• I can’t think of a credible alternative approach

• Perhaps experimentation in a nontrivial domain would suggest other possibilities...

Documents

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs