Meeting 3 POMDP (Partial Observability MDP) 資工四阮鶴鳴李運寰 Advisor: 李琳山教授

Meeting 3Meeting 3

POMDP POMDP (Partial Observability MDP)(Partial Observability MDP)

資工四阮鶴鳴李運寰資工四阮鶴鳴李運寰Advisor: Advisor: 李琳山教授李琳山教授

ReferenceReference ““Planning and acting in partially observable stochastic domains” Planning and acting in partially observable stochastic domains”

Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra; in Artificial IntelliLeslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra; in Artificial Intelli

gence 1998gence 1998

““Spoken Dialogue Management Using Probabilistic Reasoning”,Spoken Dialogue Management Using Probabilistic Reasoning”, Nicholas Roy and Joelle Pineau and Sebastian Thrun, in ACL 2000Nicholas Roy and Joelle Pineau and Sebastian Thrun, in ACL 2000

MDP (Markov Decision MDP (Markov Decision Process)Process)

A MDP model contains:A MDP model contains:– A set of states SA set of states S– A set of actions AA set of actions A– A set of state transition description TA set of state transition description T

Deterministic or StochasticDeterministic or Stochastic

– A reward function R (s, a)A reward function R (s, a)

MDPMDP

For MDPs we can compute the optimal pFor MDPs we can compute the optimal policy olicy ππ and use it to act by simply execut and use it to act by simply executing ing ππ(s) for current state s. (s) for current state s.

What happens if the agent is no longer aWhat happens if the agent is no longer able to determine the state it is currently ible to determine the state it is currently in with complete reliability? n with complete reliability?

POMDPPOMDP

A POMDP model contains:A POMDP model contains:– A set of states SA set of states S– A set of actions AA set of actions A– A set of state transition description TA set of state transition description T– A reward function R (s, a)A reward function R (s, a)– A finite set of observations A finite set of observations ΩΩ– An observation function O:SAn observation function O:S╳╳A→A→ΠΠ((ΩΩ))

O(s’, a, o)O(s’, a, o)

POMDP ProblemPOMDP Problem

1. Belief state1. Belief state– First approach: chose the most probable state First approach: chose the most probable state

of the world, given past experienceof the world, given past experience Informational properties described via observationsInformational properties described via observations

– Not explicitNot explicit

– Second approach: probability distributions over Second approach: probability distributions over states of the world.states of the world.

An exampleAn example Actions: EAST and WESTActions: EAST and WEST

– each succeeds with probability 0.9, and when each succeeds with probability 0.9, and when they fail the movement is in the opposite they fail the movement is in the opposite direction. If no movement is possible in direction. If no movement is possible in particular direction, then the agent remains in particular direction, then the agent remains in the same locationthe same location

– Initially [0.33, 0.33, 0, 0.33]Initially [0.33, 0.33, 0, 0.33]– After taking one EAST movement After taking one EAST movement [0.1, 0.45, [0.1, 0.45,

0, 0.45]0, 0.45]– After taking another EAST movementAfter taking another EAST movement[0.1, [0.1,

0.164, 0, 0.736]0.164, 0, 0.736]

POMDP ProblemPOMDP Problem

2. Finding an optimal policy:2. Finding an optimal policy:– Maps the belief state to actionsMaps the belief state to actions

Policy TreePolicy Tree

A tree of depth t that specifies a A tree of depth t that specifies a complete t-step policy.complete t-step policy.– Nodes: actions, the top node determines Nodes: actions, the top node determines

the first action to be taken.the first action to be taken.– Edges: the resulting observationEdges: the resulting observation

Sample Policy TreeSample Policy Tree


Value Evaluation:Value Evaluation:

– VVpp(s) is the value function of step-t that start(s) is the value function of step-t that starting from state s and executing policy tree p.ing from state s and executing policy tree p.


Value Evaluation:Value Evaluation:– Expected value under policy tree p:Expected value under policy tree p:

Where Where

– Expected value that execute different Expected value that execute different policy trees from different initial belief policy trees from different initial belief statesstates


Value Evaluation:Value Evaluation:– VVtt with only two states: with only two states:


Value Evaluation:Value Evaluation:– VVtt with three states: with three states:

Infinite HorizonInfinite Horizon

The three algorithm to compute V:The three algorithm to compute V:– Naive approachNaive approach– Improved by choosing useful policy treeImproved by choosing useful policy tree– Witness algo.Witness algo.


Naive approach:Naive approach:

– εεis a small numberis a small number– This policy tree contains:This policy tree contains:

nodesnodes Each nodes can be labeled with |A| possible actiEach nodes can be labeled with |A| possible acti

onsons– Total number of policy threes: Total number of policy threes:


Improved by choosing useful policy tree:Improved by choosing useful policy tree:

– VVt-1t-1 is the set of useful (t – 1)-step policy tree is the set of useful (t – 1)-step policy trees, can be used to construct a superset of the s, can be used to construct a superset of the useful t-step policy tree.useful t-step policy tree.

– And there are |And there are |AA||||VVt-1t-1||||ΩΩ|| elements in elements in VVtt++


Improved by choosing useful policy Improved by choosing useful policy tree:tree:


Witness algorithm:Witness algorithm:


Witness algorithm:Witness algorithm:– is a set of t-step policy trees that is a set of t-step policy trees that

have action a at their roothave action a at their root– is the value functionis the value function– And And


Witness algorithm:Witness algorithm:– Finding witness:Finding witness:

At each iteration we ask, Is there some belief staAt each iteration we ask, Is there some belief state,b, for which the true value, , computed by te,b, for which the true value, , computed by one-step lookahead using Vt-1, is different from one-step lookahead using Vt-1, is different from the estimated value, , computed using the sethe estimated value, , computed using the set U?t U?

Provided Provided



Now we can state the witness theorem [25]: Now we can state the witness theorem [25]: The true Q-function, , differs from the The true Q-function, , differs from the approximate Q-function, , if and only if approximate Q-function, , if and only if there is some , , and there is some , , and for which there is some b such that for which there is some b such that





The linear program used to find witness The linear program used to find witness points:points:


Witness algorithm:Witness algorithm:– Complete value-iteration:Complete value-iteration:

An agenda containing any single policy treeAn agenda containing any single policy tree A set U containing the set of desired policy treeA set U containing the set of desired policy tree Using pUsing pnewnew to determine whether it is an improvement ove to determine whether it is an improvement ove

r the policy trees in Ur the policy trees in U– 1. If no witness points are discovered, then that policy tree is 1. If no witness points are discovered, then that policy tree is

removed from the agenda. When the agenda is empty, the alremoved from the agenda. When the agenda is empty, the algorithm terminates.gorithm terminates.

– 2. If a witness point is discovered, the best policy tree for tha2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that t point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are addedier from the current policy tree in a single subtree are added to the agenda.d to the agenda.


Witness algorithm:Witness algorithm:– Complexity:Complexity:

Since we know that no more than Since we know that no more than witness points are discovered (each adds a witness points are discovered (each adds a tree to the set of useful policy trees) tree to the set of useful policy trees)

– only trees can ever be added to only trees can ever be added to the agenda (in addition to the one tree in the the agenda (in addition to the one tree in the initial agenda).initial agenda).

Each of these linear programs either Each of these linear programs either removes a policy from the agenda (this removes a policy from the agenda (this happens at most happens at most times) or a witness point is discovered times) or a witness point is discovered (this happens at most times).(this happens at most times).

Tiger ProblemTiger Problem Two doors:Two doors:

– Behind one door is a tigerBehind one door is a tiger– Behind another door is a large rewardBehind another door is a large reward

Two states:Two states:– the state of the world when the tiger is on the left as sthe state of the world when the tiger is on the left as s ll and when it is and when it is

on the right as son the right as srr Three actions:Three actions:

– left, right, and listen.left, right, and listen. Rewards:Rewards:

– reward for opening the correct door is +10 and the penalty for choosireward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1ng the door with the tiger behind it is -100, the cost of listen is -1

Observations:Observations:– to hear the tiger on the left (Tto hear the tiger on the left (T ll) or to hear the tiger on the right (T) or to hear the tiger on the right (Trr))– in state sin state sll, the listen action results in observation T, the listen action results in observation T ll with probability with probability

0.85 and the observation T0.85 and the observation Trr with probability 0.15; conversely for worl with probability 0.15; conversely for world state sd state srr..

Tiger ProblemTiger Problem



Decreasing listening reliability from Decreasing listening reliability from 0.85 down to 0.65:0.85 down to 0.65:

The EndThe End

Documents

Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授

Meeting 3 POMDP (Partial Observability MDP) 資工四阮鶴鳴李運寰 Advisor: 李琳山教授