Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a...

Preview:

Citation preview

Reinforcement Learning

Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction

I Unsupervised learning has no outcome (no feedback).I Supervised learning has outcome so we know what to

predict.I Reinforcement learning is in between–it has no explicit

supervision so uses a rewarding system to learnfeature-outcome relationship.

I The crucial advantage of reinforcement learning is itsnon-greedy nature: we do not need to improveperformance in a short term but to optimize a long-termachievement.

Donglin Zeng, Department of Biostatistics, University of North Carolina

RL terminology

I Reinforcement learning is a dynamic process where at eachstep, a new decision rule or policy is updated based onnew data and rewarding system.

I Terminology used in reinforcement learning:– Agent: whoever uses learned decisions during theprocess (robot in AI)– Action (A): a decision to be taken during the process– State (S): environment variables that may interact withAction– Reward (R): a value system to evaluate Action givenState.

I Note that (A,S,R) is time-step dependent so we use(At,St,Rt) to reflect time-step t.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement learning diagram

Donglin Zeng, Department of Biostatistics, University of North Carolina

Maze example

Donglin Zeng, Department of Biostatistics, University of North Carolina

Maze example: continue

Donglin Zeng, Department of Biostatistics, University of North Carolina

Maze example: continue

Donglin Zeng, Department of Biostatistics, University of North Carolina

Mountain car problem

Donglin Zeng, Department of Biostatistics, University of North Carolina

RL Notation

I At time-step t, the agent observe a state St from a statespace (ST) and selects an action At from an action space(At).

I Both action and state result in transition to a new state St+1.I Given (At,St,St+1), the agent receives an immediate

rewardRt = rt(St,At,St+1) ∈ R,

where rt(·, ·, ·) is called immediate reward function.

Donglin Zeng, Department of Biostatistics, University of North Carolina

RL mathematical formulation

I At time t, we assume a transition probability function from(St = s,At = a) to (St+1 = s′): pt(s′|s, a) ≥ 0,∫

s′ pt(s′|s, a)ds′ = 1.I We also assume At given St from a probability distribution:πt(a|s) ≥ 0,

∫a πt(a|s)da = 1.

I A trajectory (training sample) (s1, a1, s2, ..., sT, aT, sT+1) isgenerated as follows:– start from an initial state s1 from a probabilitydistribution p(s);– for t = 1, 2, ...,T (T is the total number of steps),– (a) at is chosen from πt(·|st),– (b) the next state st+1 is from pt(·|st, at).

I It is called finite horizon if T <∞ and infinite horizon ifT =∞.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Goal of RL

I Define the return at time t asT∑

j=t

γj−tr(Sj,Aj,Sj+1)

where γ ∈ [0, 1) is called the discount factor (discountinglong trajectory).

I An action policy, π = (π1, ...., πT), is a sequence ofprobability distribution functions, where πt is a probabilitydistribution for At given St.

I The goal of RL is to learn the optimal action decision,policy π∗ = (π∗1 , π

∗2 , ..., π

∗T), to maximize the expected

return:

Eπ[T∑

j=1

γj−1r(Sj,Aj,Sj+1)], Eπ(·) means At|St ∼ πt(·|St).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Optimal policy

I RL aims to find the best action decision rules such that theaverage long-term reward is maximized if such rules areimplemented.

I Note: π∗ is a function of states and for any individual, weonly know what actions should be at time t after observingits states ate time t. This is related to the so-called adaptivedecision or dynamic decision.

Donglin Zeng, Department of Biostatistics, University of North Carolina

How supervised learning is framed in RL context?

I We can imagine St to be all data (both feature andoutcome) collected by step t.

I Then At is the prediction rule from a class of predictionfunctions based on St (no need to be perfect predictionfunction; can be even random prediction) so πt is theprobabilistic selection of which prediction function at t.

I Based on (St,At), St+1 can be St with additionally collecteddata, or St with individual errors, or just St.

I Rt is the prediction error evaluated at the data.I The goal is to learn the best prediction rule–RL method can

help!

Donglin Zeng, Department of Biostatistics, University of North Carolina

Two important concepts in RL

I State-Action value function (SAV) It is the expected returnincrement at time t given state St = s and action At = a:

Qπt (s, a) = Eπ[

T∑j=t

γj−trt(Sj,Aj,Sj+1)|St = s,At = a].

Q∗t (s, a) ≡ Qπ∗t (s, a) is the optimal expected return at time t.

I State value function (SV) It is the expected returnincrement at time t given state St = s:

Vπt (s) = Eπ[

T∑j=t

γj−trt(Sj,Aj,Sj+1)|St = s].

Similarly, V∗t (s) = Vπ∗t (s).

I Clearly, Vπt (s) =

∫a Qπ

t (s, a)πt(a|s)da.

Donglin Zeng, Department of Biostatistics, University of North Carolina

RL methods

I Reinforcement learning methods are mostly into twogroups:– (policy iteration) model-based or learning methods toapproximate SAV– (policy search) model-based or learning methods todirectly maximize SV for estimating π∗.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Policy iteration for value function approximation

I The Bellman equation for SV:

Vπt (s) = Eπ

[rt(s,At,St+1) + γVπ

t+1(St+1)∣∣∣St = s

]=

∫s′

∫a

[rt(s, a, s′) + γVπ

t+1(s′)]πt(a|s)pt(s′|s, a)dads′.

I The Bellman equation for SAV:

Qπt (s, a) = Eπ

[rt(s, a,St+1) + γQπ

t+1(St+1,At+1)∣∣∣St = s,At = a

]=

∫s′

∫a′

[rt(s, a, s′) + γQπ

t+1(s′, a′)

]×πt+1(a′|s′)pt(s′|s, a)da′ds′.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Value function learning for finite horizon

I For finite T, the Bellman equations suggest a backwardprocedure to evaluate the value function associated aparticular policy:– start from time T. We can learnQπ

T(s, a) = E[RT|ST = s,AT = a].– at time T − 1, we learn Qπ

T−1(s, a) as

E[RT−1 + γQπ

T(ST,AT)∣∣∣ST−1 = s,AT−1 = a

].

...– we perform learning backwards till time 1.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Optimal policy learning for finite horizon (Q-learning)

I Start from time T. We can learnQπ

T(s, a) = E[RT|ST = s,AT = a]. We calculate π∗T(s) as withprobability 1 at a∗ = argmaxaQπ

T(s, a).I At time T − 1, we learn Qπ∗

T−1(s, a) as

E[

RT−1 + γmaxa′

Qπ∗T (ST, a′)

∣∣∣ST−1 = s,AT−1 = a].

We obtain π∗T−1 as the one with probability 1 ata∗ = argmaxaQπ∗

T−1(s, a).I We perform the same learning procedures backwards till

time 1 to learn all the optimal policies.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Value function learning for infinite horizon

I When T =∞ or T is large, Q-learning method may not beapplicable.

I The salvage is to take advantage of process stability when tis large so we can assume the following Markov decisionprocess (MDP):

I MDP assumes that state and action spaces are constant overtime.

I MPD assumes pt(s′|s, a) to be independent of t.I Reward function rt(s, a, s′) is independent of t.

I MDP assumption is plausible for a long horizon and aftercertain number of steps.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Bellman equations under MDP (T =∞)

I Under MPD, Qπt (s, a) = Qπ(s, a) and Vπ

t (s) = Vπ(s).I Bellman equations become

Vπ(s) = Eπ[r(s,At,St+1) + γVπ(St+1)

∣∣∣St = s],

Qπ(s, a) = Eπ[r(s, a,St+1) + γQπ(St+1,At+1)

∣∣∣St = s,At = a].

I Derived equation for optimal policy:

Vπ∗(s) = maxa

Qπ∗(s, a),

Qπ∗(s, a) = Eπ∗[r(s, a,St+1) + γVπ∗(St+1)

∣∣∣St = s,At = a],

π∗(s) ∼ I{

a = argmaxaQπ∗(s, a)}.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Policy iteration procedure

I Start from a policy π.I Policy evaluation: evaluate Qπ(s, a) and thus Vπ(s).I Policy improvement: update π(a|s) to be I(a = aπ(s)) where

aπ(s) is the action maximizing Qπ(s, a).I Iterate between policy evaluation step and policy

improvement.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Soft policy iteration procedure

I Selecting a deterministic policy update may be too greedyif the initial policy is far from the optimal.

I More soft policy update includes:– π(a|s) ∝ exp{Qπ(s, a)/τ},– (ε-greedy policy improvement) π(a|x) has a probability(1− ε+ ε/m) at a = a(π) and probability ε/m at other a’s,where m is the number of possible actions.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Estimation of state-value function

I One challenge in the policy iteration is how to estimateQπ(s, a).

I This requires statistical modelling or learning algorithms.I Parametric/semiparametric models for Qπ(s, a) are

commonly used.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Least-squared policy iteration

I We assume

Qπ(s, a) =B∑

b=1

θbφb(s, a),

where φb(s, a) is a sequence of basis functions.I In other words, the policy is indirectly represented by θb’s.I From the Bellman equation, we note

Rt = r(St,At,St+1) = Qπ(St,At)− γEπ[Qπ(St+1,At+1)|St,At]

≈ θTψ(St,At)

has mean zero given (St,At) under policy π, whereψ(s, a) = φ(s, a)− γEπ[φ(St+1,At+1)|St = s,At = a].

Donglin Zeng, Department of Biostatistics, University of North Carolina

Numerical implementation of least-squares policyiteration

I Suppose we have data from n subjects, each with a trainingsample of T steps, or n training T-step sample from thesame agent,

(Si1,Ai1,Si2, ...,SiT,AiT,Si,T+1).

I We estimate ψ(s, a) by ψb(s, a) =

φb(s, a)− γ∑n

i=1∑T

t=1 I(Sit = s,Ait = a)Eπ[φb(Si,t+1,Ai,t+1])∑ni=1∑T

t=1 I(Sit = s,Ait = a).

I We perform a least-squares estimation

minθ

1nT

n∑i=1

T∑t=1

I(Ait|Sit ∼ π)[θTψ(Sit,Ait)− Rit

]2,

where Ait|Sit ∼ π means that the data of Ait is obtained byfollowing the policy.

Donglin Zeng, Department of Biostatistics, University of North Carolina

More on numerical implementation

I Regularization may be introduced to have a more sparsesolution.

I L2-minimization can be replaced by L1-minimization togain robustness.

I Choice of basis functions: radial basis function wherekernel function can be the usual Gaussian kernel (onepossible definition of d(s, s′) is the shortest path from s to s′

in the graph defined by transition probabilities).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Robot-Arm control example

Donglin Zeng, Department of Biostatistics, University of North Carolina

Robot-Arm control example: continue

Donglin Zeng, Department of Biostatistics, University of North Carolina

Robot-Arm control example: continue

Donglin Zeng, Department of Biostatistics, University of North Carolina

Off-policy estimation

I In the previous derivation, we essentially estimate

[T∑

t=1

(θTψ(St,At)− Rt)2

]using the history sample (St,At) following the targetpolicy π.

I This is called on-policy reinforcement learning.I However, not all policy has been seen in the history

sample.I An alternative method is to use importance sampling:

[T∑

t=1

(θTψ(St,At)− Rt)2

]= Eπ

[T∑

t=1

(θTψ(St,At)− Rt)2wt

],

where

wt =

∏tj=1 π(Aj|Sj)∏tj=1 π(Aj|Sj)

.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Off-policy iteration: more

I We need one assumption: there exists a policy in historysample, π, such that

π(a|s) > 0, ∀(a, s).

I Adaptive importance weighting is to replace wt by wνt and

choose ν via cross-validation.I When history sample have multiple policies π’s, we can

obtain the estimate from importance weighting withrespect to each policy and aggregate estimation(sample-reuse policy iteration).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Mountain car example

I Action space: force applied to the car (0.2,−0.2, 0).I State space: (x, x) where x is the horizontal position

(∈ [−1.2, 0.5]) and x is the velocity (∈ [−1.5, 1.5]).I Transition:

xt+1 = xt + xt+1δt,xt+1 = xt + (−9.8wcos(3xt) + at/w− kxt)δt,

where w is the mass 0.2kg, k is the friction coefficient 0.3,and δt is 0.1 second.

I Reward:

r(s, a, s′) ={

1 xs′ ≥ 0.5,−0.01 o.w.

I Policy iteration uses kernels with centers at{−1.2, 0.35, 0.5} × {−1.5,−0.5, 0.5, 1.5} and σ = 1.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Experiment results

Donglin Zeng, Department of Biostatistics, University of North Carolina

Experiment results

Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Policy Search

I The direct policy search approach aims for finding thepolicy maximizing the expected return.

I Suppose we model policy as π(a|s; θ).I The expected return under π is given by

J(θ) =∫

s1,...,sT

p(s1)

T∏t=1

p(st+1|st, at)π(at|st; θ)

×

{T∑

t=1

γt−1r(st, at, st+1)

}s1 · · · dsT.

I We optimize J(θ) to find the optimal θ.I Gradient approach can be adopted for optimization.I EM-based policy search can be used for optimization.I Importance sampling can be used for evaluating J(θ).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Alternative methods

I Modelling transition probability functionsI Active policy iteration (active learning)

– update sampling policy actively

Donglin Zeng, Department of Biostatistics, University of North Carolina

Recommended