View
6
Download
0
Category
Preview:
Citation preview
Reinforcement Learning
Donglin Zeng, Department of Biostatistics, University of North Carolina
Introduction
I Unsupervised learning has no outcome (no feedback).I Supervised learning has outcome so we know what to
predict.I Reinforcement learning is in between–it has no explicit
supervision so uses a rewarding system to learnfeature-outcome relationship.
I The crucial advantage of reinforcement learning is itsnon-greedy nature: we do not need to improveperformance in a short term but to optimize a long-termachievement.
Donglin Zeng, Department of Biostatistics, University of North Carolina
RL terminology
I Reinforcement learning is a dynamic process where at eachstep, a new decision rule or policy is updated based onnew data and rewarding system.
I Terminology used in reinforcement learning:– Agent: whoever uses learned decisions during theprocess (robot in AI)– Action (A): a decision to be taken during the process– State (S): environment variables that may interact withAction– Reward (R): a value system to evaluate Action givenState.
I Note that (A,S,R) is time-step dependent so we use(At,St,Rt) to reflect time-step t.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement learning diagram
Donglin Zeng, Department of Biostatistics, University of North Carolina
Maze example
Donglin Zeng, Department of Biostatistics, University of North Carolina
Maze example: continue
Donglin Zeng, Department of Biostatistics, University of North Carolina
Maze example: continue
Donglin Zeng, Department of Biostatistics, University of North Carolina
Mountain car problem
Donglin Zeng, Department of Biostatistics, University of North Carolina
RL Notation
I At time-step t, the agent observe a state St from a statespace (ST) and selects an action At from an action space(At).
I Both action and state result in transition to a new state St+1.I Given (At,St,St+1), the agent receives an immediate
rewardRt = rt(St,At,St+1) ∈ R,
where rt(·, ·, ·) is called immediate reward function.
Donglin Zeng, Department of Biostatistics, University of North Carolina
RL mathematical formulation
I At time t, we assume a transition probability function from(St = s,At = a) to (St+1 = s′): pt(s′|s, a) ≥ 0,∫
s′ pt(s′|s, a)ds′ = 1.I We also assume At given St from a probability distribution:πt(a|s) ≥ 0,
∫a πt(a|s)da = 1.
I A trajectory (training sample) (s1, a1, s2, ..., sT, aT, sT+1) isgenerated as follows:– start from an initial state s1 from a probabilitydistribution p(s);– for t = 1, 2, ...,T (T is the total number of steps),– (a) at is chosen from πt(·|st),– (b) the next state st+1 is from pt(·|st, at).
I It is called finite horizon if T <∞ and infinite horizon ifT =∞.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Goal of RL
I Define the return at time t asT∑
j=t
γj−tr(Sj,Aj,Sj+1)
where γ ∈ [0, 1) is called the discount factor (discountinglong trajectory).
I An action policy, π = (π1, ...., πT), is a sequence ofprobability distribution functions, where πt is a probabilitydistribution for At given St.
I The goal of RL is to learn the optimal action decision,policy π∗ = (π∗1 , π
∗2 , ..., π
∗T), to maximize the expected
return:
Eπ[T∑
j=1
γj−1r(Sj,Aj,Sj+1)], Eπ(·) means At|St ∼ πt(·|St).
Donglin Zeng, Department of Biostatistics, University of North Carolina
Optimal policy
I RL aims to find the best action decision rules such that theaverage long-term reward is maximized if such rules areimplemented.
I Note: π∗ is a function of states and for any individual, weonly know what actions should be at time t after observingits states ate time t. This is related to the so-called adaptivedecision or dynamic decision.
Donglin Zeng, Department of Biostatistics, University of North Carolina
How supervised learning is framed in RL context?
I We can imagine St to be all data (both feature andoutcome) collected by step t.
I Then At is the prediction rule from a class of predictionfunctions based on St (no need to be perfect predictionfunction; can be even random prediction) so πt is theprobabilistic selection of which prediction function at t.
I Based on (St,At), St+1 can be St with additionally collecteddata, or St with individual errors, or just St.
I Rt is the prediction error evaluated at the data.I The goal is to learn the best prediction rule–RL method can
help!
Donglin Zeng, Department of Biostatistics, University of North Carolina
Two important concepts in RL
I State-Action value function (SAV) It is the expected returnincrement at time t given state St = s and action At = a:
Qπt (s, a) = Eπ[
T∑j=t
γj−trt(Sj,Aj,Sj+1)|St = s,At = a].
Q∗t (s, a) ≡ Qπ∗t (s, a) is the optimal expected return at time t.
I State value function (SV) It is the expected returnincrement at time t given state St = s:
Vπt (s) = Eπ[
T∑j=t
γj−trt(Sj,Aj,Sj+1)|St = s].
Similarly, V∗t (s) = Vπ∗t (s).
I Clearly, Vπt (s) =
∫a Qπ
t (s, a)πt(a|s)da.
Donglin Zeng, Department of Biostatistics, University of North Carolina
RL methods
I Reinforcement learning methods are mostly into twogroups:– (policy iteration) model-based or learning methods toapproximate SAV– (policy search) model-based or learning methods todirectly maximize SV for estimating π∗.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Policy iteration for value function approximation
I The Bellman equation for SV:
Vπt (s) = Eπ
[rt(s,At,St+1) + γVπ
t+1(St+1)∣∣∣St = s
]=
∫s′
∫a
[rt(s, a, s′) + γVπ
t+1(s′)]πt(a|s)pt(s′|s, a)dads′.
I The Bellman equation for SAV:
Qπt (s, a) = Eπ
[rt(s, a,St+1) + γQπ
t+1(St+1,At+1)∣∣∣St = s,At = a
]=
∫s′
∫a′
[rt(s, a, s′) + γQπ
t+1(s′, a′)
]×πt+1(a′|s′)pt(s′|s, a)da′ds′.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Value function learning for finite horizon
I For finite T, the Bellman equations suggest a backwardprocedure to evaluate the value function associated aparticular policy:– start from time T. We can learnQπ
T(s, a) = E[RT|ST = s,AT = a].– at time T − 1, we learn Qπ
T−1(s, a) as
E[RT−1 + γQπ
T(ST,AT)∣∣∣ST−1 = s,AT−1 = a
].
...– we perform learning backwards till time 1.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Optimal policy learning for finite horizon (Q-learning)
I Start from time T. We can learnQπ
T(s, a) = E[RT|ST = s,AT = a]. We calculate π∗T(s) as withprobability 1 at a∗ = argmaxaQπ
T(s, a).I At time T − 1, we learn Qπ∗
T−1(s, a) as
E[
RT−1 + γmaxa′
Qπ∗T (ST, a′)
∣∣∣ST−1 = s,AT−1 = a].
We obtain π∗T−1 as the one with probability 1 ata∗ = argmaxaQπ∗
T−1(s, a).I We perform the same learning procedures backwards till
time 1 to learn all the optimal policies.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Value function learning for infinite horizon
I When T =∞ or T is large, Q-learning method may not beapplicable.
I The salvage is to take advantage of process stability when tis large so we can assume the following Markov decisionprocess (MDP):
I MDP assumes that state and action spaces are constant overtime.
I MPD assumes pt(s′|s, a) to be independent of t.I Reward function rt(s, a, s′) is independent of t.
I MDP assumption is plausible for a long horizon and aftercertain number of steps.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Bellman equations under MDP (T =∞)
I Under MPD, Qπt (s, a) = Qπ(s, a) and Vπ
t (s) = Vπ(s).I Bellman equations become
Vπ(s) = Eπ[r(s,At,St+1) + γVπ(St+1)
∣∣∣St = s],
Qπ(s, a) = Eπ[r(s, a,St+1) + γQπ(St+1,At+1)
∣∣∣St = s,At = a].
I Derived equation for optimal policy:
Vπ∗(s) = maxa
Qπ∗(s, a),
Qπ∗(s, a) = Eπ∗[r(s, a,St+1) + γVπ∗(St+1)
∣∣∣St = s,At = a],
π∗(s) ∼ I{
a = argmaxaQπ∗(s, a)}.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Policy iteration procedure
I Start from a policy π.I Policy evaluation: evaluate Qπ(s, a) and thus Vπ(s).I Policy improvement: update π(a|s) to be I(a = aπ(s)) where
aπ(s) is the action maximizing Qπ(s, a).I Iterate between policy evaluation step and policy
improvement.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Soft policy iteration procedure
I Selecting a deterministic policy update may be too greedyif the initial policy is far from the optimal.
I More soft policy update includes:– π(a|s) ∝ exp{Qπ(s, a)/τ},– (ε-greedy policy improvement) π(a|x) has a probability(1− ε+ ε/m) at a = a(π) and probability ε/m at other a’s,where m is the number of possible actions.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Estimation of state-value function
I One challenge in the policy iteration is how to estimateQπ(s, a).
I This requires statistical modelling or learning algorithms.I Parametric/semiparametric models for Qπ(s, a) are
commonly used.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Least-squared policy iteration
I We assume
Qπ(s, a) =B∑
b=1
θbφb(s, a),
where φb(s, a) is a sequence of basis functions.I In other words, the policy is indirectly represented by θb’s.I From the Bellman equation, we note
Rt = r(St,At,St+1) = Qπ(St,At)− γEπ[Qπ(St+1,At+1)|St,At]
≈ θTψ(St,At)
has mean zero given (St,At) under policy π, whereψ(s, a) = φ(s, a)− γEπ[φ(St+1,At+1)|St = s,At = a].
Donglin Zeng, Department of Biostatistics, University of North Carolina
Numerical implementation of least-squares policyiteration
I Suppose we have data from n subjects, each with a trainingsample of T steps, or n training T-step sample from thesame agent,
(Si1,Ai1,Si2, ...,SiT,AiT,Si,T+1).
I We estimate ψ(s, a) by ψb(s, a) =
φb(s, a)− γ∑n
i=1∑T
t=1 I(Sit = s,Ait = a)Eπ[φb(Si,t+1,Ai,t+1])∑ni=1∑T
t=1 I(Sit = s,Ait = a).
I We perform a least-squares estimation
minθ
1nT
n∑i=1
T∑t=1
I(Ait|Sit ∼ π)[θTψ(Sit,Ait)− Rit
]2,
where Ait|Sit ∼ π means that the data of Ait is obtained byfollowing the policy.
Donglin Zeng, Department of Biostatistics, University of North Carolina
More on numerical implementation
I Regularization may be introduced to have a more sparsesolution.
I L2-minimization can be replaced by L1-minimization togain robustness.
I Choice of basis functions: radial basis function wherekernel function can be the usual Gaussian kernel (onepossible definition of d(s, s′) is the shortest path from s to s′
in the graph defined by transition probabilities).
Donglin Zeng, Department of Biostatistics, University of North Carolina
Robot-Arm control example
Donglin Zeng, Department of Biostatistics, University of North Carolina
Robot-Arm control example: continue
Donglin Zeng, Department of Biostatistics, University of North Carolina
Robot-Arm control example: continue
Donglin Zeng, Department of Biostatistics, University of North Carolina
Off-policy estimation
I In the previous derivation, we essentially estimate
Eπ
[T∑
t=1
(θTψ(St,At)− Rt)2
]using the history sample (St,At) following the targetpolicy π.
I This is called on-policy reinforcement learning.I However, not all policy has been seen in the history
sample.I An alternative method is to use importance sampling:
Eπ
[T∑
t=1
(θTψ(St,At)− Rt)2
]= Eπ
[T∑
t=1
(θTψ(St,At)− Rt)2wt
],
where
wt =
∏tj=1 π(Aj|Sj)∏tj=1 π(Aj|Sj)
.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Off-policy iteration: more
I We need one assumption: there exists a policy in historysample, π, such that
π(a|s) > 0, ∀(a, s).
I Adaptive importance weighting is to replace wt by wνt and
choose ν via cross-validation.I When history sample have multiple policies π’s, we can
obtain the estimate from importance weighting withrespect to each policy and aggregate estimation(sample-reuse policy iteration).
Donglin Zeng, Department of Biostatistics, University of North Carolina
Mountain car example
I Action space: force applied to the car (0.2,−0.2, 0).I State space: (x, x) where x is the horizontal position
(∈ [−1.2, 0.5]) and x is the velocity (∈ [−1.5, 1.5]).I Transition:
xt+1 = xt + xt+1δt,xt+1 = xt + (−9.8wcos(3xt) + at/w− kxt)δt,
where w is the mass 0.2kg, k is the friction coefficient 0.3,and δt is 0.1 second.
I Reward:
r(s, a, s′) ={
1 xs′ ≥ 0.5,−0.01 o.w.
I Policy iteration uses kernels with centers at{−1.2, 0.35, 0.5} × {−1.5,−0.5, 0.5, 1.5} and σ = 1.
Donglin Zeng, Department of Biostatistics, University of North Carolina
Experiment results
Donglin Zeng, Department of Biostatistics, University of North Carolina
Experiment results
Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Policy Search
I The direct policy search approach aims for finding thepolicy maximizing the expected return.
I Suppose we model policy as π(a|s; θ).I The expected return under π is given by
J(θ) =∫
s1,...,sT
p(s1)
T∏t=1
p(st+1|st, at)π(at|st; θ)
×
{T∑
t=1
γt−1r(st, at, st+1)
}s1 · · · dsT.
I We optimize J(θ) to find the optimal θ.I Gradient approach can be adopted for optimization.I EM-based policy search can be used for optimization.I Importance sampling can be used for evaluating J(θ).
Donglin Zeng, Department of Biostatistics, University of North Carolina
Alternative methods
I Modelling transition probability functionsI Active policy iteration (active learning)
– update sampling policy actively
Donglin Zeng, Department of Biostatistics, University of North Carolina
Recommended