9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A

9/23

Announcements

• Homework 1 returned today (Avg 27.8; highest 37)– Homework 2 due Thursday

• Homework 3 socket to open today

• Project 1 due Tuesday– A Java offer was made for lisp-challenged

(but no one took it!)

• TA office hours change?– Change from Wed to Tue 10:30-12?

(starting next week)– Feedback survey going on

• http://rakaposhi.eas.asu.edu/cse471/f03-survey1.html

• After MDPs—adversarial search?

http://rakaposhi.eas.asu.edu/cse471/f03-survey1.html

http://rakaposhi.eas.asu.edu/cse471/f03-survey1.html

MDPs as Utility-based problem solving agents

Repeat

[can generalize to have action costs C(a,s)]

If Mij matrix is not known a priori, then we have a reinforcement learning scenario..

Repeat

Think of these as h*() values…Called value function U*

Think of these as related to h* values

Repeat

Policies change with rewards..

- -

What does a solution to an MDP look like?

• The solution should tell the optimal action to do in each state (called a “Policy”)

– Policy is a function from states to actions (* see finite horizon case below*)

– Not a sequence of actions anymore• Needed because of the non-deterministic actions

– If there are |S| states and |A| actions that we can do at each state, then there are |A||S| policies

• How do we get the best policy?– Pick the policy that gives the maximal expected reward– For each policy

• Simulate the policy (take actions suggested by the policy) to get behavior traces

• Evaluate the behavior traces• Take the average value of the behavior traces.

• How long should behavior traces be?– Each trace is no longer than k (Finite Horizon case)

• Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon)

– Eg: Financial portfolio advice for yuppies vs. retirees.

– No limit on the size of the trace (Infinite horizon case)

• Policy is not horizon dependent• Qn: Is there a simpler way than having to evaluate

|A||S| policies? – Yes…

We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

(Value)

How about deterministic case? U(si) is the shortest path to the goal

.8

.1.1

Why are values coming down first?Why are some states reaching optimal value faster?

Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often

.8

.1.1

Terminating Value Iteration

• The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration)– Set a threshold and stop when the change across

two consecutive iterations is less than – There is a minor problem since value is a vector

• We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by

• Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||Ui – Ui+1|| <

Sunday, May 11th, 1997

9/25

http://161.58.237.201/products/magazine/cbm58/real/cbm58v58.rm



http://rakaposhi.eas.asu.edu/cse471/kasparov-givesup.ram

http://www.research.ibm.com/deepblue/home/html/front.map

http://rakaposhi.eas.asu.edu/cse471/kasparov-getmad.ram

Policies converge earlier than values•There are finite number of policies but infinite number of value functions.

• So entire regions of value vector are mapped to a specific policy

• So policies may be converging faster than values. Search in the space of policies

•Given a utility vector Ui we can compute the greedy policy ui

• The policy loss of ui is ||UuiU*||

(max norm difference of two vectors is the maximum amount by which they differ on any dimension)

V(S1)

V(S2)

Consider an MDP with 2 states and 2 actions

P1P2

P3

P4

U*

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation)

n linear equations with n unknowns.

Other ways of solving MDPs• Value and Policy iteration are the

bed-rock methods for solving MDPs. Both give optimality guarantees

• Both of them tend to be very inefficient for large (several thousand state) MDPs

• Many ideas are used to improve the efficiency while giving up optimality guarantees

– E.g. Consider the part of the policy for more likely states (envelope extension method)

– Interleave “search” and “execution” (Real Time Dynamic Programming)

• Do limited-depth analysis based on reachability to find the value of a state (and there by the best action you you should be doing—which is the action that is sending you the best value)

• The values of the leaf nodes are set to be their immediate rewards

• If all the leaf nodes are terminal nodes, then the backed up value will be true optimal value. Otherwise, it is an approximation…

RTDP

What if you see this as a game?The expected value computation is fine if you are maximizing “expected” returnIf you are --if you are risk-averse? (and think “nature” is out to get you) V2= min(V3,V4)

If you are perpetual optimist then V2= max(V3,V4)

Incomplete observability(the dreaded POMDPs)

• To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not)

– Policy maps belief states to actions• In practice, this causes (humongous) problems

– The space of belief states is “continuous” (even if the underlying world is discrete and finite). {GET IT? GET IT??}

– Even approximate policies are hard to find (PSPACE-hard). • Problems with few dozen world states are hard to solve currently

– “Depth-limited” exploration (such as that done in adversarial games) are the only option…

Belief state ={ s1:0.3, s2:0.4; s4:0.3}

This figure basically shows that belief states change as we take actions

5 LEFTs 5 UPs

MDPs and Deterministic Search• Problem solving agent search corresponds to what special case of

MDP?– Actions are deterministic; Goal states are all equally valued, and are all

sink states.• Is it worth solving the problem using MDPs?

– The construction of optimal policy is an overkill• The policy, in effect, gives us the optimal path from every state to the goal

state(s))– The value function, or its approximations, on the other hand are useful.

How?• As heuristics for the problem solving agent’s search

• This shows an interesting connection between dynamic programming and “state search” paradigms– DP solves many related problems on the way to solving the one

problem we want– State search tries to solve just the problem we want– We can use DP to find heuristics to run state search..

Documents

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A