Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a...

Reinforcement Learning

Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction

I Unsupervised learning has no outcome (no feedback).I Supervised learning has outcome so we know what to

predict.I Reinforcement learning is in between–it has no explicit

supervision so uses a rewarding system to learnfeature-outcome relationship.

I The crucial advantage of reinforcement learning is itsnon-greedy nature: we do not need to improveperformance in a short term but to optimize a long-termachievement.

RL terminology

I Reinforcement learning is a dynamic process where at eachstep, a new decision rule or policy is updated based onnew data and rewarding system.

I Terminology used in reinforcement learning:– Agent: whoever uses learned decisions during theprocess (robot in AI)– Action (A): a decision to be taken during the process– State (S): environment variables that may interact withAction– Reward (R): a value system to evaluate Action givenState.

I Note that (A,S,R) is time-step dependent so we use(At,St,Rt) to reflect time-step t.

Reinforcement learning diagram

Maze example

Maze example: continue

Mountain car problem

RL Notation

I At time-step t, the agent observe a state St from a statespace (ST) and selects an action At from an action space(At).

I Both action and state result in transition to a new state St+1.I Given (At,St,St+1), the agent receives an immediate

rewardRt = rt(St,At,St+1) ∈ R,

where rt(·, ·, ·) is called immediate reward function.

RL mathematical formulation

I At time t, we assume a transition probability function from(St = s,At = a) to (St+1 = s′): pt(s′|s, a) ≥ 0,∫

s′ pt(s′|s, a)ds′ = 1.I We also assume At given St from a probability distribution:πt(a|s) ≥ 0,

∫a πt(a|s)da = 1.

I A trajectory (training sample) (s1, a1, s2, ..., sT, aT, sT+1) isgenerated as follows:– start from an initial state s1 from a probabilitydistribution p(s);– for t = 1, 2, ...,T (T is the total number of steps),– (a) at is chosen from πt(·|st),– (b) the next state st+1 is from pt(·|st, at).

I It is called finite horizon if T <∞ and infinite horizon ifT =∞.

Goal of RL

I Define the return at time t asT∑

γj−tr(Sj,Aj,Sj+1)

where γ ∈ [0, 1) is called the discount factor (discountinglong trajectory).

I An action policy, π = (π1, ...., πT), is a sequence ofprobability distribution functions, where πt is a probabilitydistribution for At given St.

I The goal of RL is to learn the optimal action decision,policy π∗ = (π∗1 , π

∗2 , ..., π

∗T), to maximize the expected

return:

Eπ[T∑

γj−1r(Sj,Aj,Sj+1)], Eπ(·) means At|St ∼ πt(·|St).

Optimal policy

I RL aims to find the best action decision rules such that theaverage long-term reward is maximized if such rules areimplemented.

I Note: π∗ is a function of states and for any individual, weonly know what actions should be at time t after observingits states ate time t. This is related to the so-called adaptivedecision or dynamic decision.

How supervised learning is framed in RL context?

I We can imagine St to be all data (both feature andoutcome) collected by step t.

I Then At is the prediction rule from a class of predictionfunctions based on St (no need to be perfect predictionfunction; can be even random prediction) so πt is theprobabilistic selection of which prediction function at t.

I Based on (St,At), St+1 can be St with additionally collecteddata, or St with individual errors, or just St.

I Rt is the prediction error evaluated at the data.I The goal is to learn the best prediction rule–RL method can

Two important concepts in RL

I State-Action value function (SAV) It is the expected returnincrement at time t given state St = s and action At = a:

Qπt (s, a) = Eπ[

T∑j=t

γj−trt(Sj,Aj,Sj+1)|St = s,At = a].

Q∗t (s, a) ≡ Qπ∗t (s, a) is the optimal expected return at time t.

I State value function (SV) It is the expected returnincrement at time t given state St = s:

Vπt (s) = Eπ[

T∑j=t

γj−trt(Sj,Aj,Sj+1)|St = s].

Similarly, V∗t (s) = Vπ∗t (s).

I Clearly, Vπt (s) =

∫a Qπ

t (s, a)πt(a|s)da.

RL methods

I Reinforcement learning methods are mostly into twogroups:– (policy iteration) model-based or learning methods toapproximate SAV– (policy search) model-based or learning methods todirectly maximize SV for estimating π∗.

Policy iteration for value function approximation

I The Bellman equation for SV:

Vπt (s) = Eπ

[rt(s,At,St+1) + γVπ

t+1(St+1)∣∣∣St = s

∫s′

[rt(s, a, s′) + γVπ

t+1(s′)]πt(a|s)pt(s′|s, a)dads′.

I The Bellman equation for SAV:

Qπt (s, a) = Eπ

[rt(s, a,St+1) + γQπ

t+1(St+1,At+1)∣∣∣St = s,At = a

∫s′

∫a′

[rt(s, a, s′) + γQπ

t+1(s′, a′)

]×πt+1(a′|s′)pt(s′|s, a)da′ds′.

Value function learning for finite horizon

I For finite T, the Bellman equations suggest a backwardprocedure to evaluate the value function associated aparticular policy:– start from time T. We can learnQπ

T(s, a) = E[RT|ST = s,AT = a].– at time T − 1, we learn Qπ

T−1(s, a) as

E[RT−1 + γQπ

T(ST,AT)∣∣∣ST−1 = s,AT−1 = a

...– we perform learning backwards till time 1.

Optimal policy learning for finite horizon (Q-learning)

I Start from time T. We can learnQπ

T(s, a) = E[RT|ST = s,AT = a]. We calculate π∗T(s) as withprobability 1 at a∗ = argmaxaQπ

T(s, a).I At time T − 1, we learn Qπ∗

T−1(s, a) as

RT−1 + γmaxa′

Qπ∗T (ST, a′)

∣∣∣ST−1 = s,AT−1 = a].

We obtain π∗T−1 as the one with probability 1 ata∗ = argmaxaQπ∗

T−1(s, a).I We perform the same learning procedures backwards till

time 1 to learn all the optimal policies.

Value function learning for infinite horizon

I When T =∞ or T is large, Q-learning method may not beapplicable.

I The salvage is to take advantage of process stability when tis large so we can assume the following Markov decisionprocess (MDP):

I MDP assumes that state and action spaces are constant overtime.

I MPD assumes pt(s′|s, a) to be independent of t.I Reward function rt(s, a, s′) is independent of t.

I MDP assumption is plausible for a long horizon and aftercertain number of steps.

Bellman equations under MDP (T =∞)

I Under MPD, Qπt (s, a) = Qπ(s, a) and Vπ

t (s) = Vπ(s).I Bellman equations become

Vπ(s) = Eπ[r(s,At,St+1) + γVπ(St+1)

∣∣∣St = s],

Qπ(s, a) = Eπ[r(s, a,St+1) + γQπ(St+1,At+1)

∣∣∣St = s,At = a].

I Derived equation for optimal policy:

Vπ∗(s) = maxa

Qπ∗(s, a),

Qπ∗(s, a) = Eπ∗[r(s, a,St+1) + γVπ∗(St+1)

∣∣∣St = s,At = a],

π∗(s) ∼ I{

a = argmaxaQπ∗(s, a)}.

Policy iteration procedure

I Start from a policy π.I Policy evaluation: evaluate Qπ(s, a) and thus Vπ(s).I Policy improvement: update π(a|s) to be I(a = aπ(s)) where

aπ(s) is the action maximizing Qπ(s, a).I Iterate between policy evaluation step and policy

improvement.

Soft policy iteration procedure

I Selecting a deterministic policy update may be too greedyif the initial policy is far from the optimal.

I More soft policy update includes:– π(a|s) ∝ exp{Qπ(s, a)/τ},– (ε-greedy policy improvement) π(a|x) has a probability(1− ε+ ε/m) at a = a(π) and probability ε/m at other a’s,where m is the number of possible actions.

Estimation of state-value function

I One challenge in the policy iteration is how to estimateQπ(s, a).

I This requires statistical modelling or learning algorithms.I Parametric/semiparametric models for Qπ(s, a) are

commonly used.

Least-squared policy iteration

I We assume

Qπ(s, a) =B∑

θbφb(s, a),

where φb(s, a) is a sequence of basis functions.I In other words, the policy is indirectly represented by θb’s.I From the Bellman equation, we note

Rt = r(St,At,St+1) = Qπ(St,At)− γEπ[Qπ(St+1,At+1)|St,At]

≈ θTψ(St,At)

has mean zero given (St,At) under policy π, whereψ(s, a) = φ(s, a)− γEπ[φ(St+1,At+1)|St = s,At = a].

Numerical implementation of least-squares policyiteration

I Suppose we have data from n subjects, each with a trainingsample of T steps, or n training T-step sample from thesame agent,

(Si1,Ai1,Si2, ...,SiT,AiT,Si,T+1).

I We estimate ψ(s, a) by ψb(s, a) =

φb(s, a)− γ∑n

i=1∑T

t=1 I(Sit = s,Ait = a)Eπ[φb(Si,t+1,Ai,t+1])∑ni=1∑T

t=1 I(Sit = s,Ait = a).

I We perform a least-squares estimation

n∑i=1

T∑t=1

I(Ait|Sit ∼ π)[θTψ(Sit,Ait)− Rit

where Ait|Sit ∼ π means that the data of Ait is obtained byfollowing the policy.

Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a...

Documents

Getting Started with the RL1Getting Started with the RL1 Optional: connect an ethernet cable to the RL1. By default, the RL1 will be assigned an IP address dynamically

1. Olllubt·o OPINIAO C rl1 INENSEhemeroteca.ciasc.sc.gov.br/jornais/opiniaocatharinense/OPC1875051.pdf · OPINIAO C rl1 INENSE PUBLICA-SE JHDIBl'HUL lil®ILllLrllrL(fl) ... SpO

Continuous hoops for transverse reinforcement of ... · Continuous hoops for transverse reinforcement of ... transverse reinforcement details for the ... Fig. 1 Transverse reinforcement

THYRISTOR-LEISTUNGSSTELLER THYRO-A 1AH RL1, 2AH RL1 … · 2018. 3. 31. · 59 CONTENTS Safety instructions 56 Safety regulations 62 Remarks on the present operating instructions

Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Skinner’s Behavioral Reinforcement Theory Positive Reinforcement Negative Reinforcement PunishmentExtinction Person repeats desired behaviors to gain a

STATISTICAL LEARNING AND HIGH-DIMENSIONAL …bios.unc.edu/~dzeng/BIOS740/lecture_notes.pdfChapter 1 INTRODUCTION In these set of lecture notes, we concentrate on two major topics:

Reinforcement Learning Das Reinforcement Learning-Problem Alexander Schmid

Reinforcement and deep reinforcement learning for wireless

Liquid Level Resistive Transmitter (RL1/RL2)...DIGITAL OILFIELD Liquid Level Resistive Transmitter (RL1/RL2) APPLICATION DRAWING 3-in-1 or 5-in-1 process monitoringProduct, interface

Reinforcement Learning - CSci 5512: Artificial Intelligence IIvision.psych.umn.edu/users/schrater/schrater_lab/courses/AI2/rl1.pdfReinforcement Learning (RL) Learning what to do to

Generalizing Convolutional Neural Networks to Graph ...dzeng/BIOS740/Walker_Bios740.pdf · "Convolutional neural networks on graphs with fast localized spectral ﬁltering." Advances

FAHION GERMANY FINAL ENGLISH AND GERMAN LAYER4 rl1

Inverse Reinforcement Learning CS885 Reinforcement

FS English - RL1 - Practice Test 1 - Answer Booklet

for high ReinfoRcement systems - kotaca.cz · ReinfoRcement systems rEiNforcEmENt systEm PyraPlEx ... and DBV data sheet "Reinforcement system steel and ... suant to EC2 6.2.2 and

From Reinforcement Learning to Deep Reinforcement …fagostin/assets/files/...Keywords: Machine learning · Reinforcement learning Deep learning · Deep reinforcement learning 1 Introduction

RL1 RL2 Somfy SAS dans un souci constant d’évolution et … · 2015. 6. 15. · röle 1(RL1)’i kumanda eder. P2 elektrik düğmesi röle 2 (RL2)’yi kumanda eder. S1 köprüsüza:

Schedules of reinforcement. Schedules of Reinforcement Continuous reinforcement refers to reinforcement being administered to each instance of a response

Reinforcement Learning - 4. Model-free reinforcement Learning