Upload
maj
View
67
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Jaedeug Choi Kee-Eung Kim Korea Advanced Institute of Science and Technology. JMLR Jan, 2011. Inverse Reinforcement Learning in Partially Observable Environments . Basics. Reinforcement Learning (RL) Markov Decision Process (MDP). Reinforcement Learning. Internal State. Actions. - PowerPoint PPT Presentation
Citation preview
Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi Kee-Eung KimKorea Advanced Institute of Science and Technology.
JMLR Jan, 2011
2
Basics
Reinforcement Learning (RL) Markov Decision Process (MDP)
3
Reinforcement Learning
Actions
Reward
InternalState
Observation
4
Actions
Reward
InternalState
Observation
Inverse Reinforcement Learning
5
Why reward function ??
Solves the more natural problems
Most transferable representation of agent’s behaviour!
6
Example 1
Reward
7
Example 2
8
Agent
Name: Agent
Role: Decision making
Property: Principle of rationality
9
Environment
Markov DecisionProcess (MDP)
Partially Observable
Markov DecisionProcess (POMDP)
10
MDP
Sequential decision making problem States are directly perceived
11
POMDP
Sequential decision making problem States are perceived through some
noisy observationSeems like
near a wall !!!
Concept of belief
12
Policy
Explicit policy
Trajectory
13
IRL for MDP\RIRL for MDP\R
Policies TrajectoryLinear
approximation
QCP
Projection Method
Apprenticeship learning
14
Using Policies
Ng and Russel, 2000
Any policy deviating from expert’s policy should not yield a higher value.
15
Using Sample Trajectories Linear approximation for reward
function.
R(s,a) = 11(s,a) + 22(s,a) + … + dd(s,a)
= T
where, [-1,1]d
: SxA→ [0,1]d , basis functions.
16
Using Linear Programming
17
Apprenticeship Learn policy from expert’s
demonstration. Does not compute the exact reward
function.
18
Using QCP
Approximated using Projection method !
19
IRL in POMDP
Ill-posed problem Existence Uniqueness Stability
Computationally intractable
R = 0Exponenti
al increase in size!
20
IRL for POMDP \R
IRL for MDP\R
Policies
Q functions
Howard’s theory
Witness theorem
Trajectory
MMV methodMMFE
method
PRJ method
21
Comparing Q functions
Constraint:
Disadvantage:For each n N, there are |A||N||Z| ways
to deviate one step from expert ! For n nodes, there are |N||A||N||Z|
ways to deviate – it grows exponentially !!!
22
DP Update Based Appraoch Comes from Generalized Howard’s
Policy Improvement Theorem.
Hansen, 1998
If an FSC Policy is not optimal, the DP update transforms it into an FSC policy with a value function that is as good or better for every belief state and better for some belief state.
24
Comparison
25
IRL for POMDP \R
IRL for MDP\R
Policies
Q functions
Howard’s theory
Witness theorem
Trajectory
MMV methodMMFE
method
PRJ method
MMV Method
26
MMFE Method
27
Approximated using Projection (PRJ) Method !!!
28
Experimental Results
Tiger 1d Maze 5 x 5 Grid World Heaven / Hell Rock Sample
29
Illustration
30
Characteristics
31
Results from Policy
32
Results from Trajectories
33
Questions ???
34
Backup slides !
35
Inverse Reinforcement Learning
Given measurements of an agent’s behaviour
over time, in a variety of circumstances, Measurements of the sensory inputs to the
agent, a model of the physical environment
(including the agent’s body).Determine The reward function that the agent is
optimizing.
Russel (1998)
36
Partially Observable Environment
Mathematical framework for single-agent planning under uncertainty.
Agent cannot directly observe the underlying states.
Example: Study global warming from your grandfather’s diary !
37
Advantages of IRL
Natural way to examine animal and human behaviors.
Reward function – most transferable representation of agent’s behavior.
38
MDP Modeling a sequentially decision making
problem. Five tuple system: <S, A, T, R, γ>
S – finite set of states A – finite set of actions T – state transition function T:SxA →∏(S) R – Reward function R:SxA → Ɍ γ – Discount factor [o,1)
Q∏(s,a) = R(s,a) + γ∑s’ST(s,a,s’)V ∏(s’)
39
POMDP Partially observable environment Eight tuple system <S,A,Z,T,O;R,bo,γ>
Z – finite set of observation O:SxA →∏(Z), observation function bo – initial state distribution bo (s)
Belief (b) – b(s) is the probability that the state is s at the current time step.
(To reduce the complexity, introduced by the history of action-observation sequence).
40
Finite State Controller(FSC) Policy in POMDP is represented using
FSC. It’s a directed graph <N,E> nN is associated with an action,
aA eE is an outgoing edge per
observation zZ ∏ = < , >. is the action strategy
and is the observation strategy.
Q∏(<n,b>,<a,os>) = ∑s’ b(s)Q∏
(<n,s>,<a,os>).
41
Using Projection Method
PRJ Method
42