Upload
pauldix
View
6.392
Download
0
Embed Size (px)
DESCRIPTION
Slides from Aditya Tarigoppula's talk at NYC Machine Learning on December 13th.
Citation preview
An Introduction to Reinforcement Learning (RL) and RL Brain Machine Interface (RL-BMI)
Aditya Tarigoppula www.joefrancislab.com
SUNY Downstate Medical Center
Outline
RL Examples
Environment
Value functions
Optimality
Methods for attaining optimality
DP MC TD
BMI & RL-BMI
Eligibility Traces
START / END
RL Examples
Stanford Autonomous Helicopterhttp://heli.stanford.edu/
Reinforcement Learning Brain Machine InterfaceJoe Francis Lab.
Environment model - Markov decision process 1) States ‘S’
2) Actions ‘A’
3) State transition probabilities.
4) Reward
5)
Deterministic, non-stationary policy
RL Problem: The decision maker, ‘agent’ needs to learn the optimum policy in an ‘environment’ to maximize the total amount of reward it receives over the long term.
1'
'},,|'1Pr{' s
assPaatsstssta
ssP
10
...32
21
...321
trtrtrtR
TtrtrtrtrtR
as :
• Agent performs the action under the policy being followed.
• Environment is everything else other than the agent.
assP '
Value Functions: State Value Function
State – Action Value Function
},|{
},|{),(
10
aassrE
aassREasQ
ttktk
k
ttt
)]'([),(
}|{
}|{)(
''
'
20
1
sVRPas
ssrrE
ssREsV
ass
s
ass
a
tktk
kt
tt
Optimal Value Function:
Optimal Policy – A policy that is better than or equal to all the other policies is called Optimal policy.
(in the sense of maximizing expected reward)
Optimal state value function
Optimal state-action value function
Bellman optimality equation
)(max)(* sVsV
),(max),(* asQasQ
},|)','(max{),(
},|)'({max)(
*1
*
*1
*
aassasQrEasQ
aasssVrEsV
tta
t
ttta
At time = tAcquire Brain State
DecoderAction Selection (trying to execute an optimum action)
Action executedAt time = t +1Observe reward Update the decoder
t
t+1
EXAMPLE
8.0Pr
1.0Pr 1.0Pr
0Pr
EXAMPLE
))](*),((*1.0))(*...
)...,((*1.0))(*),((*8.0[)(
332
211
sVasRsV
asRsVasRsV
1S
2S3S
4S
Prof. Andrew Ng, Lecture 16, Machine learning
Outline
Environment
Value functions
Optimality
Methods for attaining optimality
DP MC TD
BMI & RL-BMI
Eligibility Traces
START / END
We're here !
RL Examples
Solution Methods for RL problem◦ Dynamic Programming (DP) – is a method for optimization of
problems which exhibit the characteristics of overlapping sub problems and optimal substructure.
◦ Monte Carlo method (MC) - requires only experience--sample sequences of states, actions, and rewards from interaction with an environment.
◦ Temporal Difference learning (TD) – is a method that combines the better aspects of DP (estimation) and MC (experience) without incorporating the ‘troublesome’ aspects of both.
Dynamic ProgrammingPolicy Evaluation
Dynamic ProgrammingPolicy Improvement
)())(,( ' sVssQ
*1 *
10 ...... VVVEIIEIE
o
E – Policy Evaluation I – Policy Improvement
Policy Iteration Value Iteration
Replace entire section with
DYNAMIC
PROGRAMMING
)]'([max)( ''
' sVRPsV ass
s
ass
a
Monte Carlo Vs. DP
◦ The estimates for each state are independent. In other words, MC methods do not "bootstrap“.
◦ DP includes only one-step transitions
whereas the MC diagram goes all the
way to the end of the episode.
◦ The computational expense of estimating the value of a single state is less when one requires the value of only a subset of the states.
Monte Carlo Policy Evaluation
Every visit MC First visit MC
-> Without a model, we need Q value estimates.-> All state-action pairs should be visited.-> Exploration techniques 1) Exploring starts 2) e-greedy Policy
Next SlideMONTE
CARLO
As promised, this is the “NEXT SLIDE” !
MONTE
CARLO
Temporal Difference Methods◦ Like MC, TD methods can learn directly from raw experience
without a model of the environment's dynamics. Like DP, TD methods update estimates are based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
)]()([)()( 11 ttttt sVsVrsVsV
TD(lambda)
trace decay parameter
Bias decreases
Variance Increases
Bias –Variance Tradeoff
Intuition: start with large ‘lamda’ and then decrease over time
SARSA
Q Learning
Difference
Outline
Environment
Value functions
Optimality
Methods for attaining optimality
DP MC TD
BMI & RL-BMI
Eligibility Traces
START / END
We're here !
RL Examples
Eligibility Traces
OR
Outline
Environment
Value functions
Optimality
Methods for attaining optimality
DP MC TD
BMI & RL-BMI
Eligibility Traces
START / END
We're here !
RL Examples
Online/closed loop RL-BMI architecture
),(),(
))]([max(_
actionsQasQ
tsiQindexoutputaction
titt
tanh(.)
reward
traceeerrTDdelta
asQasQrerrTD ttttt
_*_
),(),(*_ 11
‘delta’ used for updating the weights through back-propagation
NEURALSIGNAL
Scott, S. H. (1999). "Apparatus for measuring and perturbing shoulder and elbow joint positions and torques during reaching." J Neurosci Methods 89(2): 119-27.
BM I
SET UP
Autonomous Helicopter (Stanford Uni) http://heli.stanford.edu/papers/iser04-invertedflight.pdf
Position , orientation, velocity and angular velocity ),,,,,,,,,,,(
zyxzyx ),,,,,,,(
zyx
S1 S2
a1 a2R1 R2
Dynamics Dynamics
DynamicsRandom Gen
Dynamics
Actor-Critic Model
http://drugabuse.gov/researchreports/methamph/meth04.gif
References Reinforcement Learning: An Introduction
Richard S. Sutton & Andrew G. Barto Prof. Andrew Ng’s machine Learning Lectures http://heli.stanford.edu Dr. Joseph T. Francis
www.joefrancislab.com Prof. Peter Dayan Dr. Justin Sanchez Group
http://www.bme.miami.edu/nrg/