Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts

Modified MDPs for Concurrent Execution

AnYuan Guo

Victor Lesser

University of Massachusetts

Concurrent Execution

A set of tasks where each task is relatively easy to solve on its own, but when executed concurrently, new interactions arise that complicate the execution of the composite task.

Single agent executing multiple tasks in parallel (example: office robot)

Multiple agents act in parallel (team)

Cross Product MDP

The problem of concurrent execution can be solved optimally by solving the cross product MDP formed by the separate processes.

Problem: exponential blow up

Related Work Deterministic Planning

- Situation calculus [Reiter96] - Extending STRIPS [Boutilier97, Knoblock94]

Termination schemes for temporally extended actions [Rohanimanesh03] Planning in cross-product MDP [Singh98] Learning ( W-learning [Humphrys96], MAXQ [Dietterich00])

The Goal

Somehow break apart the interactions, encapsulate them within each agent, so they can again be solved independently.

Algorithm Summary

Define the types of events and interactions of interest

Summarize the other agent’s effect on self in terms of statistical information of how often the constraining event occurs

Change my model to reflect this statistic

Events in MDP State based events (agent enters s5) Action based events (agent moves

north 1 step) State-action based events (agent

moves north 1 step from s4)

Events in MDP1 affect events in MDP2,a total of 9 types of interactions

Assumptions The list of possible interactions

between the MDPs are given The constraints are one-way only.

The effects do not propagate back

to the originator of the constraint.

Directed Acyclic Constraints

Constraints between a set of events that forms a directed acyclic graph.

Event Frequency & MDP modification

event 1 event 2

1) Calculate frequency

2) Modify MDP

Calculating State Visitation Frequency

Given a policy , solve the system of simultaneous linear equations:

Ss

sssTsFsF'

)'),'(,()()'(

Under the constraint that:

Ss

sF 1)(

Calculating Action Frequencies

Given a policy , the action frequency F(a) is the sum of the visitation frequencies of all the states in which action a is executed.

Bs

sFaF )()(

where })(|{ assB

Calculating State-Action Frequencies

0

)(),(

sFasF

otherwise

if as )(

Now both the action and the state at which it is executed matters:

Also generalizes to a set of statesand actions.

Account for the Effects of Constraints

Modify the model Modify the transition probability

table Intuition: other agents can change

the dynamics of my environment

RTAS ,,,

Example: A1

A2

Account for State Based Events

A constraint from another task can affect the current task’s ability to enter certain states:

P(s1,a1,s1)

P(s1, a1, s2)

P(s1, a1, s3)

P(s2,a1,s1)

P(s2, a1, s2)

P(s2, a1, s3)

P(s3,a1,s1)

P(s3, a1, s2)

P(s3, a1, s3)

s1

s2

s3

s3s2s1

A slice of the TPT: under action a1.

from:

to:

Account for Action Based Events

A constraint from another task can affect the current task’s ability to carry out certain actions:

P(s1,a1,s1)

P(s1, a1, s2)

P(s1, a1, s3)

P(s2, a1, s1)

P(s2,a1,s2)

P(s2, a1, s3)

P(s3, a1, s1)

P(s3, a1, s2)

P(s3,a1,s3)

s1

s2

s3

s1 s2 s3

TPT foraffected action a1

Account for State-Action Based Events

A constraint from another task can affect the current task’s ability to carry out certain actions at certainstates: s1 s2 s3

P(s1, a1, s1) P(s1, a1, s2)

P(s1, a1, s3)

P(s2, a1, s1) P(s2, a1, s2)

P(s2, a1, s3)

P(s3, a1, s1) P(s3, a1, s2)

P(s3,a1,s3)

s1

s2

s3

TPT for affected action a1

Experiments

States (location of the agent) Actions (move up, down, left, right or any

of the 4 diagonal steps, 8 total) Transitions (0.05 of slipping to an adjacent

state rather than intended) Rewards (-1, -3 for diagonal, 100 for goal) Constraint: agent 1 taking the “up” action

prevents agent 2 from doing so

The mountain climbing scenario:

Results: Policies

Policies when executingindependently

Policies when executedconcurrently, after weapply the algorithm

Results

Size of State Space

Average Value of Policy

Improvements

Explore different ways to modify the MDP (e.g. shrink action set)

Relax the directed-acyclic constraint restriction (take an iterative approach)

Show that it is optimal for summaries that consist of a single random variable

New Directions

Different types of summaries - steady state behavior (current work)

- multi-state summaries - summaries with temporal information

Dynamic task arrival/departure: - given some model of arrival - without model – learning

Positive interactions (e.g. enable)

The End

Documents

Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts