CS 7180: Behavioral Modeling and Decision- making in AI · Decision-making under uncertainty • Decisions traditionally represented in decision trees • Example decision problem:

CS 7180: Behavioral Modeling and Decision-making in AI Markov Decision Processes for Complex Decision-making Prof. Amy Sliva October 17, 2012

Decisions are nondeterministic •  In many situations, behavior and decisions may have more than

one possible outcome

• Action failures •  e.g., gripper drops its load

• One approach: Markov Decision Processes

a c b

grasp(c)

a

c

b Intended outcome

a b Unintended outcome

•  Exogenous events •  e.g., road closed

•  Still need to be able to plan and make decisions in such situations

Decision-making under uncertainty • Decisions traditionally represented in decision trees

•  Example decision problem: Should I have my party inside or outside? •  Exponential in number of decisions

•  Limitations of decision trees •  What if there are many stages to consider? •  What if there are many different outcomes?

• Markov decision processes (MDPs) •  Framework for complex multi-stage decision problems under uncertainty •  More compact representation of problem with efficient solutions •  Further specialized influence diagram

•  No explicit representation of qualitative dependencies

in

out

Regret

Relieved

Perfect!

Disaster

dry

wet

dry

wet

Stochastic systems of actions • We have already learned about stochastic systems

•  Markov chain/Markov process

•  Stochastic system: a triple Σ = (S, A, P) •  S = finite set of states •  A = finite set of actions •  Pa(sʹ′ | s) = probability of going to sʹ′ if we execute a in s •  ∑sʹ′ ∈ S Pa (sʹ′ | s) = 1

•  Several different possible action representations •  e.g., Bayes networks (or influence diagrams) , probabilistic state-space

operators, etc. •  Same underlying semantics

•  Fully specified transition model—explicit enumeration of each Pa (sʹ′ | s)

MDPs are stochastic systems with utility • Model the dynamics of the environment under different actions •  Formally defined as a 4-tuple MDP = (S, A, T, R)

• Markov assumption—next state depends only on current state and action

•  Full observability—cannot predict exactly which state will result from an action, but once it is realized know what it is

S = set of states Environmental context A = set of actions Possible behaviors T = transition model T(s, a, sʹ′) = Pa(sʹ′ | s) What states can result from an action?

R = reward model R(s) and C(s,a)

Reward for each state Cost for each state and action

Graphical MDP representation • Nodes are possible states of the world

•  E.g., Location of robot

• Arcs are actions •  E.g., Move to new location

•  Each arc has associated transition probability

•  Each state has associated reward or cost

First-order Markov dynamics and rewards •  First-Order Markov dynamics (history independence)

•  P(st+1 | at,st,at-1,st-1,...,s0) = P(st+1 | at,st) •  Next state only depends on current state and current action

•  First-Order Markov reward process •  P(rt

| at,st,at-1,st-1,...,s0) = P(rt | at,st)

•  Reward only depends on current state and action •  Assume reward is specified by a deterministic function R(s)

•  Stationary dynamics and reward •  P(st+1 | at,st) = P(sk+1 | ak,sk) for all t, k •  The world dynamics do not change over time

Planning actions in MDPs • Robot r1 starts at location l1

•  State s1 in the diagram

• Objective is to get r1 to location l4 •  State s4 in the diagram

Goal Start

Probability of initial states •  For every state s, there will be a probability

P(s) that the system starts in s

•  Sometimes assume a unique state s0 such that the system always starts in s0

•  In the example, s0 = s1 •  P(s1) = 1 •  P(s) = 0 for all s ≠ s1

Goal Start

• Robot r1 starts at location l1 •  State s1 in the diagram


• No classical plan (sequence of actions) can be a solution •  Why not?

Planning actions in MDPs

Start Goal



• No classical plan (sequence of actions) can be a solution •  No guarantee we’ll be in a state where the next action is applicable π = 〈move(r1,l1,l2), move(r1,l2,l3), move(r1,l3,l4)〉

Planning actions in MDPs

Start Goal

Policies for choosing actions •  Policy is a function mapping states to actions

•  Stationary policy •  π:S → A •  π(s) is action to do at state s (regardless of time)—set of state-action pairs •  Specifies a continuously reactive controller

• Nonstationary policy •  π:S x N→ A, where N is the non-negative integers

•  π(s,t) is action to do at state s with t stages-to-go •  What if we want to keep acting indefinitely?



π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)}

Example stationary policy

Start Goal

wait

wait

Histories of previous states •  History is a sequence of system states

h = 〈s0, s1, s2, s3, s4, … 〉

•  Examples:| h0 = 〈s1, s3, s1, s3, s1, … 〉

h1 = 〈s1, s2, s3, s4, s4, … 〉 h2 = 〈s1, s2, s5, s5, s5, … 〉 h3 = 〈s1, s2, s5, s4, s4, … 〉 h4 = 〈s1, s4, s4, s4, s4, … 〉 h5 = 〈s1, s1, s4, s4, s4, … 〉

h6 = 〈s1, s1, s1, s4, s4, … 〉 h7 = 〈s1, s1, s1, s1, s1, … 〉

•  Each policy induces a probability distribution over histories •  If h = 〈s0, s1, … 〉 then P(h | π) = P(s0) ∏i ≥ 0 Pπ(Si) (si+1 | si)

Goal Start


h1 = 〈s1, s2, s3, s4, s4, … 〉 P(h1 | π1) = 1 × 1 × .8 × 1 × … = 0.8 h2 = 〈s1, s2, s5, s5 … 〉 P(h2 | π1) = 1 × 1 × .2 × 1 × … = 0.2 P(h | π1) = 0 for all other h

Example history distribution

Start Goal

wait

wait


h1 = 〈s1, s2, s3, s4, s4, … 〉 P(h1 | π1) = 1 × 1 × .8 × 1 × … = 0.8 h2 = 〈s1, s2, s5, s5 … 〉 P(h2 | π1) = 1 × 1 × .2 × 1 × … = 0.2 P(h | π1) = 0 for all other h


Start Goal

wait

wait

π1 reaches the goal with probability 0.8

π2 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4))}

h1 = 〈s1, s2, s3, s4, s4, … 〉 P(h1 | π2) = 1 × 1 × .8 × 1 × … = 0.8 h3 = 〈s1, s2, s5, s4, s4, … 〉 P(h3 | π2) = 1 × 0.2 × 1 × 1 × … = 0.2 P(h | π2) = 0 for all other h


Start Goal

wait

wait


h1 = 〈s1, s2, s3, s4, s4, … 〉 P(h1 | π2) = 1 × 1 × .8 × 1 × … = 0.8 h3 = 〈s1, s2, s5, s4, s4, … 〉 P(h3 | π2) = 1 × 0.2 × 1 × 1 × … = 0.2 P(h | π2) = 0 for all other h


Start Goal

wait

wait



h4 = 〈s1, s4, s4, s4, … 〉 P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5 h5 = 〈s1, s1, s4, s4, s4, … 〉 P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25 h6 = 〈s1, s1, s1, s4, s4, … 〉 P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125

• • • h7 = 〈s1, s1, s1, s1, s1, s1, … 〉 P(h7 | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0


Start Goal wait

wait


h4 = 〈s1, s4, s4, s4, … 〉 P(h4 | π3) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5 h5 = 〈s1, s1, s4, s4, s4, … 〉 P(h5 | π3) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25 h6 = 〈s1, s1, s1, s4, s4, … 〉 P(h6 | π3) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125

• • • h7 = 〈s1, s1, s1, s1, s1, s1, … 〉 P(h7 | π3) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0


Start Goal wait

wait


Determining the value of a policy • How good is a policy π? • How do we measure “accumulated” reward? • Utility function V: S →ℝ associates value with each state (or

each state and time for non-stationary π) • Vπ(s) denotes value of policy at state s

•  Depends on immediate reward, but also what you achieve subsequently by following π

•  Optimal policy is no worse than any other policy at any state

•  The goal of MDP reasoning is to compute an optimal policy (method depends on how we define utility)

Compute utility using reward function •  Numeric cost C(s,a) for each state s and action a •  Numeric reward R(s) for

each state s

•  No explicit goals now •  Desirable states have

high rewards

•  Example: •  C(s,wait) = 0 at every state

except s3 •  C(s,a) = 1 for each

“horizontal” action •  C(s,a) = 100 for each

“vertical” action •  R as shown

•  Utility of a history/policy:

•  If h = 〈s0, s1, … 〉, then V(h | π) = ∑i ≥ 0 [R(si) – C(si,π(si))]

Start

r = –100

r = 100

r = 0 r = 0

r = 0 c

= 10

0

c =

100

c = 1

c = 1

Example utility computation π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)} h1 = 〈s1, s2, s3, s4, s4,…〉 P(h1 h2 = 〈s1, s2, s5, s5 … 〉 V(h1|π1) = [R(s1) – C(s1,π1(s1))] + [R(s2) – C(s2,π1(s2))] + [R(s3) –

C(s3,π1(s3))] + [R(s4) – C(s4,π1(s4))] + [R(s4) – C(s4,π1(s4))] +…

= [0 – 100] + [0 – 1] + [0 – 100] + [100 – 0] + [100 – 0] + … = ∞

V(h2|π1) = [0 – 100] + [0 – 1] + [–100 – 0] + [–100 – 0] + [–100 – 0] + …

= –∞

Start

r = –100

r = 100

r = 0 r = 0

r = 0

c =

100

c =

100

c = 1

c = 1 wait

wait

Discounted utility •  In a long history of states, distant rewards/costs have less

influence on the current decision

• Often need to use a discount factor γ, 0 ≤ γ ≤ 1

• Discounted utility of a history

V(h | π) = ∑i ≥ 0 γ i [R(si) – C(si,π(si))]

• Convergence is guaranteed if 0 ≤ γ < 1

• Expected utility of a policy: E(π) = ∑h P(h|π) V(h|π)

Discounted and expected utility π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)} h1 = 〈s1, s2, s3, s4, s4,…〉 P(h1 h2 = 〈s1, s2, s5, s5 … 〉 V(h1|π1) = .90[0 – 100] + .91[0 – 1] + .92[0 – 100] + .93[100 – 0] + .94[100 – 0] + … = 547.9

V(h2|π1) = .90[0 – 100] + .91[0 – 1] + .92[–100 – 0] + .93[–100 – 0] + … = -910.1

E(π1) = 0.8 V(h1|π1) + 0.2 V(h2|π1) = 0.8(547.9) + 0.2(–910.1) = 256.3

Start

r = –100

r = 100

r = 0 r = 0

r = 0

c =

100

c =

100

c = 1

c = 1 wait

wait

Planning as optimization •  Consider special case with start state s0 and all rewards are 0

•  Consider cost rather than utility—the negative of what we had before •  Equations slightly simpler—can generalize to the case of nonzero rewards

•  Discounted cost of a history h and policy π: C(h | π) = ∑i ≥ 0 γ i C(si, π(si))

•  Expected cost of a policy π: E(π) = ∑h P(h | π) C(h | π)

•  A policy π is optimal if for every π', E(π) ≤ E(π')

•  A policy π is everywhere optimal if for every s and every π', Eπ(s) ≤ Eπ' (s)

where Eπ(s) is the expected utility if we start at s rather than s0

Bellman’s theorem •  If π is any policy, then for every s,

Eπ(s) = C(s, π(s)) + γ ∑s ∈ S Pπ(s)(sʹ′ | s) Eπ(sʹ′)

•  Let Qπ(s,a) be the expected cost in a state s if we start by

executing the action a, and use the policy π from then onward

Qπ(s,a) = C(s,a) + γ ∑sʹ′ ∈ S Pa(sʹ′ | s) Eπ(sʹ′)

• Bellman’s theorem: Suppose π* is everywhere optimal. Then for every s,

Eπ*(s) = mina∈A(s) Qπ*(s,a)

s

s1

s2

sn

…

π(s)

Intuition behind Bellman’s theorem • Bellman’s theorem:

Suppose π* is everywhere optimal. Then for every s, Eπ*(s) = mina∈A(s) Qπ*(s,a)

•  If we use π* everywhere else, then the optimal actions at s is

arg mina∈A(s) Qπ*(s,a) •  If π* is optimal, then at each state it will pick one of those • Otherwise we can construct a better policy by using an action in

arg mina∈A(s) Qπ*(s,a) instead of the action that π* uses

•  From Bellman’s theorem it follows that for all s, Eπ*(s) = mina∈A(s) {C(s,a) + γ ∑s’ ∈ S Pa(sʹ′ | s) Eπ*(sʹ′)}

Finite horizon decision-making •  First look at policy optimization over a finite horizon

•  Assumes the agent has n time steps to live •  γ =1

•  To act optimally, should we use stationary or nonstationary policy?

•  Put another way:

•  If you had only one week to live would you act the same way as if you had fifty years to live?

Finite horizon problems • Value (utility) depends on stage-to-go, hence so should policy

•  Nonstationary π(s,k)

•  Vπk(s) is the k-stage-to-go value function for π

• Expected total reward/cost after executing π for k time steps

Vπk(s) = ∑t = 0 to k C(st, π(st))P(ht | π) where C(st, π(st)) denotes the cost received at stage t and P(ht | π) is the probability of the history seen until state t, given the policy π

Computing the finite-horizon value • Markov property facilitates dynamic programming

•  Use to compute Vπk(s) •  Work backward from final stage to determine optimal policy

Vπ0(s) = C(s,π(s))

Vπk(s) = C(s,π(s)) + ∑sʹ′ Pπ(s,k)(sʹ′ | s) Vπk-1(sʹ′)

Vk-1 Vk

0.7

0.3

π(s,k)

Computing the finite-horizon value • Markov property facilitates dynamic programming

•  Use to compute Vπk(s) •  Work backward from final stage to determine optimal policy

Vπ0(s) = C(s,π(s))

Vπk(s) = C(s,π(s)) + ∑sʹ′ Pπ(s,k)(sʹ′ | s) Vπk-1(sʹ′)

Immediate cost Expected future cost with

k-1 stages to go

Vk-1 Vk

0.7

0.3

π(s,k)

Bellman backup • Use Bellman’s theorem to incrementally compute utility value

•  Backup the values from Vt to Vt+1 using DP and choosing minimum path

Vt+1(s) = mina∈A(s) {C(s,a) + 0.7Vt(s1) + 0.3Vt(s4),

C(s,a) + 0.4Vt(s2) + 0.6Vt(s3) } • Uses procedure called value iteration

a1

s

a2 s4

s1

s3

s2

Vt

0.7

0.3

0.4

0.6

Vt+1(s)

General case value iteration algorithm 1.  Start with an arbitrary cost E0(s) for each s and a small ε > 0 2.  For i = 1, 2, …

a)  For every s in S and a in A, i.  Qi (s,a) := C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′ | s) Ei–1(sʹ′)

ii.  Ei(s) = mina∈A(s) Qi (s,a) iii.  πi(s) = arg mina∈A(s) Qi (s,a)

b)  If maxs ∈ S |Ei(s) – Ei–1(s)| < ε for every s then exit

•  πi converges to π* after finitely many iterations, but how to tell it has converged?

•  In general, Ei ≠ Eπi •  When πi doesn’t change, Ei may still change •  The changes in Ei may make πi start changing again

General case value iteration algorithm 1.  Start with an arbitrary cost E0(s) for each s and a small ε > 0 2.  For i = 1, 2, …

a)  For every s in S do

i.  For each a in A do •  Q (s,a) := C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′ | s) Ei–1(sʹ′)

ii.  Ei(s) = mina∈A(s) Q(s,a) iii.  πi(s) = arg mina∈A(s) Q(s,a)

b)  If maxs ∈ S |Ei(s) – Ei–1(s)| < ε for every s then exit

•  If Ei changes by < ε and if ε is small enough, then πi will no longer change •  In this case πi has converged to π*

• How small is small enough?

Value iteration in finite-horizon case • Use Markov property and Bellman backup to avoid

enumerating all possibilities

• Value iteration •  Initialize to final stage cost:

V0(s) = C(s)

•  Compute Vk(s)—optimal k-stage-to-go value function

Vk(s) = mina∈A(s) {C(s,a) + ∑s’ ∈ S Pa(sʹ′ | s) Vk-1(sʹ′)}

•  Derive optimal k-stage-to-go policy

π*(s,k) = argmina∈A(s) {C(s,a) + ∑s’ ∈ S Pa(sʹ′ | s) Vk-1(sʹ′)}

• Optimal value function is unique, but optimal policy is not

•  Many policies can have same value

Finite-horizon value iteration example

•  V1(s4) = mina∈A(s) {C(s4,a1) + 0.7V0(s1) + 0.3V0(s4),

C(s4,a2) + 0.4V0(s2) + 0.6V0(s3) }

0.3

0.7 0.4

0.6

s4

s1

s3

s2

V0 V1

0.4

0.3

0.7

0.6 0.3

0.7

0.4

0.6

V2 V3

a1 =

a2 =

Finite-horizon value iteration example

•  π*(s4,1) = argmina1,a2 {C(s,a) + ∑s’ ∈ S Pa(sʹ′ | s) Vk-1(sʹ′)}

0.3

0.7 0.4

0.6

s4

s1

s3

s2

V0 V1

0.4

0.3

0.7

0.6 0.3

0.7

0.4

0.6

V2 V3

a1 =

a2 =

Complexity of finite-horizon value iteration • Optimal solution to k-1 stage problem can be used without

modification as part of optimal solution to k-stage problem •  Dynamic programming

• Because of finite horizon, policy nonstationary • What is the computational complexity?

•  T iterations •  At each iteration, each of n states computes expectation for |A| actions •  Each expectation takes O(n) time

Complexity of finite-horizon value iteration • Optimal solution to k-1 stage problem can be used without

modification as part of optimal solution to k-stage problem •  Dynamic programming

• Because of finite horizon, policy nonstationary • What is the computational complexity?

•  T iterations •  At each iteration, each of n states computes expectation for |A| actions •  Each expectation takes O(n) time

•  Total time complexity: O(T|A|n2) •  Polynomial in number of states. Is this good?

Discounted infinite-horizon MDPs •  Defining value as total reward is problematic with infinite horizons

•  Many or all policies have infinite expected reward

•  Use our discounted utility to discount the future reward at each time step—discount factor γ, 0 ≤ γ ≤ 1

•  Can restrict attention to stationary policies •  Discounted cost of a history h and policy π:

C(h | π) = ∑i ≥ 0 γ i C(si, π(si))

•  Expected cost of a policy π: E(π) = ∑h P(h | π) C(h | π)

•  A policy π is optimal if for every π', E(π) ≤ E(π')

Value iteration in infinite horizon case • Use value iteration and Bellman backup to compute optimal

policies just like in finite-horizon case •  Now just include the discount factor in the computation

•  Initialize to start state cost: E0(s) = arbitrary cost

• Compute Vk(s)—optimal value function

Ei(s) = mina∈A(s) {C(s,a) + γ ∑s’ ∈ S Pa(sʹ′ | s) Ei-1(sʹ′)}

•  Will converge to the optimal value function as k gets large

Infinite horizon value iteration example •  Let aij be the action that

moves from si to sj •  e.g., a11= wait and

a12 = move(r1,l1,l2)

•  Start with E0(s) = 0 for all s, and ε = 1

Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait

Infinite horizon value iteration example For each s and a compute Q (s,a) :=

C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′ | s) Ei–1(sʹ′)

Q(s1, a11) = 1 + .9×0 = 1 Q(s1, a12) = 100 + .9×0 = 100 Q(s1, a14) = 1 + .9(.5×0 + .5×0) = 1

Q(s2, a21) = 100 + .9×0 = 100 Q(s2, a22) = 1 + .9×0 = 1 Q(s2, a23) = 1 + .9(.5×0 + .5×0) = 1

Q(s3, a32) = 1 + .9×0 = 1 Q(s3, a34) = 100 + .9×0 = 100

Q(s4, a41) = 1 + .9×0 = 1 Q(s4, a43) = 100 + .9×0 = 1 Q(s4, a44) = 0 + .9×0 = 0 Q(s4, a45) = 100 + .9×0 = 100

Q(s5, a52) = 1 + .9×0 = 1 Q(s5, a54) = 100 + .9×0 = 100 Q(s5, a55) = 100 + .9×0 = 100

Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait

Infinite horizon value iteration example For each s and a compute E1 (s,a) and π1(s) E1(s1) = 1 E1(s2) = 1 E1(s3) = 1 E1(s4) = 0 E1(s5) = 1 π1(s1) = a11 = wait π1(s2) = a22 = wait π1(s3) = a32 = move(r1,l3,l2) π1(s4) = a44 = wait π1(s5) = a52 = move(r1,l5,l2)

Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait

Infinite horizon value iteration example For each s and a compute E1 (s,a) and π1(s) E1(s1) = 1 E1(s2) = 1 E1(s3) = 1 E1(s4) = 0 E1(s5) = 1 π1(s1) = a11 = wait π1(s2) = a22 = wait π1(s3) = a32 = move(r1,l3,l2) π1(s4) = a44 = wait π1(s5) = a52 = move(r1,l5,l2)

Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait

What other actions could we have chosen? Is ε small enough?

Deciding how to act using MDPs • Given an Ei from value iteration that closely approximates Eπ*,

what should we use as our policy?

• Use greedy policy

π(s) = greedy[Ei(s)] = argmina∈A(s) Q(s,a) where Q(s,a) = C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′|s) V(sʹ′)

• Note that the value of greedy policy may not be equal to Ei •  Let EG

be the value of the greedy policy? How close is EG to Eπ*?

Deciding how to act using MDPs • Given an Ei from value iteration that closely approximates Eπ*,

what should we use as our policy?

• Use greedy policy

π(s) = greedy[Ei(s)] = argmina∈A(s) Q(s,a) where Q(s,a) = C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′|s) V(sʹ′)

• Greedy is not too far from optimal if Ei close to Eπ* •  In particular, if Ei within ε of Eπ*, then EG within 2εγ /1-γ of Eπ* •  Furthermore, there exists a finite ε s.t. greedy policy is optimal

•  I.e., even if value estimate is off, greedy is optimal once it is close enough

Policy iteration for infinite-horizon planning •  Policy iteration is another way to find π*

•  Suppose there are n states s1, …, sn

•  Start with an arbitrary initial policy π1 •  For i = 1, 2, …

•  Compute πi’s expected costs by solving n equations with n unknowns •  n instances of the discounted expected cost equation

•  For every sj,

•  If πi+1 = πi then exit •  Converges in a finite number of iterations

Eπ i(s1) =C(s,π i (s1))+γ Pπ i (s1 )k=1

n∑ (sk | s1) Eπ i

(sk ) Eπ i

(sn ) =C(s,π i (sn ))+γ Pπ i (sn )k=1

n∑ (sk | sn ) Eπ i

(sk )

π i+1(sj ) = argmina∈A Qπ i(sj,a)

= argmina∈A C(sj,a)+γ Pak=1

n∑ (sk | sj ) Eπ i

(sk )

Policy iteration example • Assume discount factor γ = 0.9

•  s5 undesirable •  C(s5, wait) = 100

•  Incentive to leave non-goal states:

•  C(s1,wait) = 1 •  C(s2,wait) = 1

Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait

Policy iteration example •  Start with arbitrary policy π1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait)}

• Compute expected costs across all states

Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait



Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait



Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait

Policy iteration example • Compute expected costs

across all states

• At each state s, let π2(s) = argmina∈A(s) Qπ (s,a)

•  π2 = {(s1, move(r1,l1,l4)), (s2, move(r1,l2,l1)), (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4))}

Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait

Policy iteration vs. value iteration •  Policy iteration computes an entire policy in each iteration,

and computes values based on that policy •  More work per iteration—needs to solve a set of simultaneous equations •  Usually converges in a smaller number of iterations

• Value iteration computes new values in each iteration, and chooses a policy based on those values •  In general, the values are not the values that one would get from the

chosen policy or any other policy •  Less work per iteration •  Usually takes more iterations to converge

Policy iteration vs. value iteration •  For both, iterations is polynomial in the number of states

•  But the number of states is usually quite large •  Need to examine the entire state space in each iteration

•  Thus, these algorithms can take huge amounts of time and space

• Use real-time dynamic programming and heuristics to improve efficiency

Real-time dynamic programming basics •  Explicitly specify goal states

•  If s is a goal, then actions at s have no cost and produce no change

•  For each state s, maintain value V(s) that gets updated as algorithm proceeds •  Initially V(s) = h(s), where h is a heuristic function

• Greedy policy π(s) = greedy[Ei(s)] = argmina∈A(s) Q(s,a)

where

Q(s,a) = C(s,a) + γ ∑sʹ′ ∈ S Pa (sʹ′|s) V(sʹ′)

RTDP algorithm procedure RTDP(s)

1.  Loop until termination condition a)  RTDP-trial(s)

procedure RTDP-trial(s) 1.  While s is not a goal state

a)  a := arg mina∈A(s) Q(s,a) b)  V(s) := Q(s,a) c)  Randomly pick s' with probability Pa (sʹ′|s) d)  s := s'

RTDP algorithm procedure RTDP(s)


procedure RTDP-trial(s) 1.  While s is not a goal state

a)  a := arg mina∈A(s) Q(s,a) b)  V(s) := Q(s,a) c)  Randomly pick s' with probability Pa (sʹ′|s) d)  s := s'

Forward search

Example of RTDP procedure RTDP(s)


procedure RTDP-trial(s)

1.  While s is not a goal state a)  a := arg mina∈A(s) Q(s,a) b)  V(s) := Q(s,a) c)  Randomly pick s' with

probability Pa (sʹ′|s) d)  s := s'

Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait






Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait

Example: γ = 0.9 h(s) = 0 for all s

V = 0






Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 0

V = 0

Q = 100+.9*0 = 100

Q = 1+.9(½*0+½*0) = 1

Q = 100+.9*0 = 100

V = 0




1.  While s is not a goal state a)   a := arg mina∈A(s) Q(s,a) b)  V(s) := Q(s,a) c)  Randomly pick s' with


Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 0

V = 0

Q = 1+.9(½*0+½*0) = 1

V = 0




1.  While s is not a goal state a)  a := arg mina∈A(s) Q(s,a) b)   V(s) := Q(s,a) c)  Randomly pick s' with


Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 1

V = 0

Q = 1+.9(½*0+½*0) = 1

V = 0




1.  While s is not a goal state a)  a := arg mina∈A(s) Q(s,a) b)  V(s) := Q(s,a) c)   Randomly pick s' with


Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 1






Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 1

V = 0

Q = 100+.9*0 = 100

Q = 1+.9(½*1+½*0) = 1.45

Q = 100+.9*0 = 100

V = 0




1.  While s is not a goal state a)   a := arg mina∈A(s) Q(s,a) b)  V(s) := Q(s,a) c)  Randomly pick s' with


Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 1

V = 0

Q = 1+.9(½*1+½*0) = 1.45

V = 0




1.  While s is not a goal state a)  a := arg mina∈A(s) Q(s,a) b)   V(s) := Q(s,a) c)  Randomly pick s' with


Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 1.45

V = 0

Q = 1+.9(½*1+½*0) = 1.45

V = 0




1.  While s is not a goal state a)  a := arg mina∈A(s) Q(s,a) b)  V(s) := Q(s,a) c)   Randomly pick s' with


Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 1.45






Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 1.45


1.  Loop until termination condition a)   RTDP-trial(s)




Start

c = 100

c = 0

r = 0 c = 1

c = 1 c

= 10

0

c =

100

c = 1

c = 1

wait

wait

wait

wait


V = 1.45

Performance of RTDP •  In practice, it can solve much larger problems than policy

iteration and value iteration

• Won’t always find an optimal solution, won’t always terminate •  If h does not overestimate, and if a goal is reachable (with positive

probability) at every state then it will terminate •  h should be an admissible heuristic

•  If in addition to the above, there is a positive-probability path between every pair of states •  Then it will find an optimal solution

Documents

CS 7180: Behavioral Modeling and Decision- making in AI · Decision-making under uncertainty • Decisions traditionally represented in decision trees • Example decision problem: