Hierarchical Methods for Planning under Uncertainty Thesis Proposal Joelle Pineau Thesis Committee: Sebastian Thrun, Chair Matthew Mason Andrew Moore Craig

Hierarchical Methods forPlanning under Uncertainty

Thesis Proposal

Joelle Pineau

Thesis Committee:

Sebastian Thrun, Chair

Matthew Mason

Andrew Moore

Craig Boutilier, U. of Toronto

Thesis Proposal: Hierarchical Methods for Planning under Uncertainty Joelle Pineau

Integrating robots in living environments

The robot’s role:- Social interaction- Mobile manipulation- Intelligent reminding- Remote-operation- Data collection / monitoring


A broad perspective

GOAL = Selecting appropriate actions

USER + WORLD + ROBOT

ACTIONS

OBSERVATIONSBeliefstate

STATE


Cause #1: Non-deterministic effects of actions

Cause #2: Partial and noisy sensor information

Cause #3: Inaccurate model of the world and the user

Why is this a difficult problem?

UNCERTAINTY


Cause #1: Non-deterministic effects of actions

Cause #2: Partial and noisy sensor information

Cause #3: Inaccurate model of the world and the user

Why is this a difficult problem?

UNCERTAINTY

A solution: Partially Observable MarkovDecision Processes (POMDPs)

S3o1, o2

S1o1, o2

S2o1, o2

a1

a2


The truth about POMDPs

• Bad news:

– Finding an optimal POMDP action selection policy is computationally intractable for complex problems.


The truth about POMDPs

• Bad news:

– Finding an optimal POMDP action selection policy is computationally intractable for complex problems.

• Good news:

– Many real-world decision-making problems exhibit structure inherent to the problem domain.

– By leveraging structure in the problem domain, I propose an algorithm that makes POMDPs tractable, even for large domains.


How is it done?

• Use a “Divide-and-conquer” approach:

– We decompose a large monolithic problem into a collection of loosely-related smaller problems.

Dialoguemanager

Healthmanager Social

manager

Remindingmanager


Thesis statement

Decision-making under uncertaintycan be made tractable for complex problems

by exploiting hierarchical structurein the problem domain.


Outline

• Problem motivation

Partially observable Markov decision processes

• The hierarchical POMDP algorithm

• Proposed research


POMDPs within the family of Markov models

Markov Chain Hidden Markov Model(HMM)

Markov Decision Process(MDP)

Partially Observable MDP(POMDP)

Uncertainty in sensor input?

no

no

Controlproblem?

yes

yes


POMDP parameters: Initial belief: b0(s)=Pr(so=s) Observation probabilities: O(s,a,o)=Pr(o|s,a) Transition probabilities: T(s,a,s’)=Pr(s’|s,a) Rewards: R(s,a)

HMM

What are POMDPs?

Components:Set of states: sSSet of actions: aASet of observations: oO

0.5

0.5

1

MDP

S2Pr(o1)=0.9Pr(o2)=0.1

S1Pr(o1)=0.5Pr(o2)=0.5

a1

a2S3

Pr(o1)=0.2Pr(o2)=0.8


A POMDP example: The tiger problem

S1“tiger-left”

Pr(o=growl-left)=0.85Pr(o=growl-right)=0.15

S2“tiger-right”

Pr(o=growl-left)=0.15Pr(o=growl-right)=0.85

Actions={ listen, open-left, open-right}

Reward Function: R(a=listen) = -1R(a=open-right, s=tiger-left) = 10R(a=open-left, s=tiger-left) = -100


What can we do with POMDPs?

1) State tracking:– After an action, what is the state of the world, st ?

2) Computing a policy:– Which action, aj, should the controller apply next?

Very hard!

Not so hard.

bt-1 ??

at-1 ot

Robot:

St-1 stWorld:

Control layer:

...

...

??


The tiger problem: State tracking

S1“tiger-left”

S2“tiger-right”

Belief vector

b0

Belief



S1“tiger-left”

S2“tiger-right”

Belief vector

b0

Belief

obs=growl-leftaction=listen



b1

obs=growl-left

S1“tiger-left”

S2“tiger-right”

Belief vector

Belief

b0

action=listen

baoP

sbassPasoP

sbSs

jjii

ij

,|

,|,| 0

1


Policy Optimization

• Which action, aj, should the controller apply next?

– In MDPs:

• Policy is a mapping from state to action, : si aj

– In POMDPs:

• Policy is a mapping from belief to action, : b aj

• Recursively calculate expected long-term reward for each state/belief:

• Find the action that maximizes the expected reward:

)(),|Pr(),(max)(1

j

N

jiji

ai sVassasRsV

)(),|Pr(),(maxarg)(1

j

N

jiji

ai sVassasRs


The tiger problem: Optimal policy

Belief vector:

open-leftopen-right listen

S1“tiger-left”

S2“tiger-right”

Optimal policy:


• Finite-horizon POMDPs are in worse-case doubly exponential:

• Infinite-horizon undiscounted stochastic POMDPs are EXPTIME-hard, and may not be decidable (|n|).

POMDPComplexity(per step ofvalue iteration)

MDPrecursive upper-bound

Time

Space

Complexity of policy optimization

nOAS ||2 ||||

nOA ||||

||1

2 |||||| OnAS

||1 |||| O

nA

|||| 2 AS

|| S

|| n


The essence of the problem

• How can we find good policies for complex POMDPs?

• Is there a principled way to provide near-optimal policies in reasonable time?


Outline


• Partially observable Markov decision processes

The hierarchical POMDP algorithm

• Proposed research


A hierarchical approach to POMDP planning

• Key Idea: Exploit hierarchical structure in the problem domain to break a problem into many “related” POMDPs.

• What type of structure?

Action set partitioning Act

InvestigateHealth Move

NavigateCheckPulse

AskWhere

Left Right Forward Backward

CheckMeds

subtask

abstractaction


Assumptions

• Each POMDP controller has a subset of Ao.

• Each POMDP controller has full state set S0, observation set O0.

• Each controller includes discriminative reward information.

• We are given the action set partitioning graph.

• We are given a full POMDP model of the problem: {So,Ao,Oo,Mo}.


The tiger problem: An action hierarchy

Pinvestigate={S0, Ainvestigate, O0, Minvestigate}Ainvestigate={listen, open-right}

act

open-left investigate

open-rightlisten


Optimizing the “investigate” controller

S1“tiger-left”

S2“tiger-right”

Locally optimal policy:

Belief vector:

open-right listen


The tiger problem: An action hierarchy

Pact={S0, Aact, O0, Mact}Aact={open-left, investigate}

act

open-left investigate

open-rightlisten

But... R(s, a=investigate)is not defined!


Modeling abstract actions

Insight: Use the local policy of corresponding low-level controller.

General form: R( si, ak) = R ( si, Policy(controllerk,si) )

Example: R(s=tiger-left,ak =investigate) =

open-right listen open-left

tiger-left 10 -1 -100

tiger-right -100 -1 10

Policy (investigate,s=tiger-left) = open-right


Optimizing the “act” controller

S1“tiger-left”

S2“tiger-right”

Locally optimal policy:

investigate

Belief vector:

open-left


The complete hierarchical policy

S1“tiger-left”

S2“tiger-right”

Hierarchical policy:

Belief vector:



The complete hierarchical policy

S1“tiger-left”

S2“tiger-right”

Hierarchical policy:


Optimal policy:

Belief vector:


Results for larger simulation domains

POMDP H-POMDP MDP

Navigation Problem:|S|=11, |A|=6, |O|=6

CPU Time (secs): 1119.93 2.84 0.000654

Average Reward: 12.5 12.2 0.0

Dialogue Problem:|S|=20, |A|=30, |O|=27

CPU Time (secs): >24hrs 77.99 6.46

Average Reward: 64.43 53.33

%Correct actions: 93.2 80.0


Related work on hierarchical methods

• Hierarchical HMMs– Fine et al., 1998

• Hierarchical MDPs– Dayan&Hinton, 1993; Dietterich, 1998; McGovern et al., 1998; Parr&Russell,

1998; Singh, 1992.

• Loosely-coupled MDPs– Boutilier et al., 1997; Dean&Lin, 1995; Meuleau et al. 1998; Singh&Cohn, 1998;

Wang&Mahadevan, 1999.

• Factored state POMDPs– Boutilier et al., 1999; Boutilier&Poole, 1996; Hansen&Feng, 2000.

• Hierarchical POMDPs– Castanon, 1997; Hernandez-Gardiol&Mahadevan, 2001; Theocharous et al., 2001;

Wiering&Schmidhuber, 1997.


Outline


• Partially observable Markov decision processes

• The hierarchical POMDP algorithm

Proposed research


Proposed research

1) Algorithmic design

2) Algorithmic analysis

3) Model learning

4) System development and application


Research block #1: Algorithmic design

• Goal 1.1: Developing/implementing hierarchical POMDP algorithm.

• Goal 1.2: Extending H-POMDP for factorized state representation.

• Goal 1.3: Using state/observation abstraction.

• Goal 1.4: Planning for controllers with no local reward information.


• Assumption #2:

“Each POMDP controller has full state set S0, and observation set O0.”

• Can we reduce the number of states/observations, |S| and |O|?

Goal 1.3: State/observation abstraction


• Assumption #2:

“Each POMDP controller has full state set S0, and observation set O0.”

• Can we reduce the number of states/observations, |S| and |O|?

Yes! Each controller only needs subset of state/observation features.

• What is the computational speed-up?

Goal 1.3: State/observation abstraction

Navigate

Left Right Forward Backward

InvestigateHealth

CheckPulse CheckMeds

POMDP recursive upper-bound

Time complexity:nOAS ||2 ||||||

12 |||||| O

nAS


Goal 1.4: Local controller reward information

• Assumption #3:

“Each controller includes some amount of discriminative reward information.”

• Can we relax this assumption?


Goal 1.4: Local controller reward information

• Assumption #3:

“Each controller includes some amount of discriminative reward information.”

• Can we relax this assumption?

Possibly. Use reward shaping to select policy-invariant reward function.

• What is the benefit?

– H-POMDP could solve problems with sparse reward functions.


Research block #2: Algorithmic analysis

• Goal 2.1: Evaluating performance of the H-POMDP algorithm.

• Goal 2.2: Quantifying the loss due to the hierarchy.

• Goal 2.3: Comparing different possible decompositions of a problem.


Goal 2.1: Performance evaluation

• How does the hierarchical POMDP algorithm compare to:– Exact value function methods

» Sondik, 1971; Monahan, 1982; Littman, 1996; Cassandra et al, 1997.

– Policy search methods» Hansen, 1998; Kearns et al., 1999; Ng&Jordan, 2000; Baxter&Bartlett, 2000.

– Value approximation methods» Parr&Russell, 1995; Thrun, 2000.

– Belief approximation methods» Nourbakhsh, 1995; Koenig&Simmons, 1996; Hauskrecht, 2000; Roy&Thrun,

2000.

– Memory-based methods» McCallum, 1996.

• Consider problems from POMDP literature and dialogue management domain.


Goal 2.2: Quantifying the loss

• The hierarchical POMDP planning algorithm provides an approximately-optimal policy.

• How “near-optimal” is the policy?

• Subject to some (very restrictive) conditions:

“The value function of top-level controller

is an upper-bound on the value

of the approximation.”

• Can we loosen the restrictions? Tighten the bound?

Find a lower-bound?

Atop

A1

...

...

V top(b)Vactual(b)


Goal 2.3: Comparing different decomposition

• Assumption #4:

“We are given an action set partitioning graph.”

• What makes a good hierarchical action decomposition?

• Comparing decompositions is the first step towards automatic decomposition.

Manufacture Examine Inspect

Replace

a1

a2

Manufacture Replace Examine Inspect

a1

a2 a3


Research block #3: Model learning

• Goal 3.1: Automatically generating good action hierarchies.

– Assumption #4: “We are given an action set partitioning graph.”

– Can we automatically generate a good hierarchical decomposition?

– Maybe. It is being done for hierarchical MDPs.

• Goal 3.2: Including parameter learning.

– Assumption #5: “We are given a full POMDP model of the problem.”

– Can we introduce parameter learning?

– Yes! Maximum-likelihood parameter optimization (Baum-Welch) can be used for POMDPs.


Touchscreen inputSpeech utterance

Research block #4: System development and application

• Goal 4.1: Building an extensive dialogue manager

Touchscreen messageSpeech utterance

Dialogue Manager

Remindermessage

Robot sensor readings Motion command

Status information

Facemail operations

Robot module

Reminding module

Teleoperation module

User

Remote-controlcommand


An implemented scenario

Physiotherapy

Patientroom

Robothome

Problem size: |S|=288, |A|=14, |O|=15State Features: {RobotLocation, UserLocation, UserStatus, ReminderGoal, UserMotionGoal, UserSpeechGoal}

Test subjects: 3 elderly residents in assisted living facility


Contributions

• Algorithmic contribution: A novel POMDP algorithm based on hierarchical structure.

Enables use of POMDPs for much larger problems.

• Application contribution: Application of POMDPs to dialogue management is novel.

Allows design of robust robot behavioural managers.


Research schedule

1) Algorithmic design/implementation

2) Algorithmic analysis

3) Model learning

4) System development and application

5) Thesis writing

fall 01

spring/summer 02

spring/summer/fall 02

ongoing

fall 02 / spring 03


Questions?


A simulated robot navigation example

Domain size: |S|=11, |A|=6, |O|=6

GetReward(t)ReadMap

Act

Navigate(t)Read OpenDoor

GoLeft GoRight GoBack GoForward

($$)($$)


A dialogue management example

- AskGoWhere- GoToRoom- GoToKitchen- GoToFollow- VerifyRoom- VerifyKitchen- VerifyFollow

- GreetGeneral- GreetMorning- GreetNight- RespondThanks

- AskWeatherTime- SayCurrent- SayToday- SayTomorrow

- StartMeds- NextMeds- ForceMeds- QuitMeds

- AskCallWho- Call911- CallNurse- CallRelative- Verify911- VerifyNurse- VerifyRelative

- AskHealth- OfferHelp

- SayTimeAct

CheckHealth

PhoneDoMedsCheckWeatherMoveGreet

Domain size: |S|=20, |A|=30, |O|=27


Action hierarchy for implemented scenario

Act

Remind Assist Rest

MoveContact Inform

BringtoPhysioCheckUserPresentDeliverUser

SayWeatherVerifyRequest

SayTime

RemindPhysioPublishStatus

RingBellGotoRoom

VerifyBringVerifyRelease

RechargeGotoHome


Sondik’s parts manufacturing problem

Manufacture Examine Inspect Replace

a1

a2 a3

Manufacture Examine Inspect

Replace

a1

a2

Decomposition1:

Decomposition2:

+5 more decompositions


Manufacturing task results

0

0.1

0.2

0.3

0.4

0.5

POMDP D1 D2 D3 D4 D5 D6 D7

MDP

Planning Method

Av

g. R

ew

ard


ReminderGoal={none, medsX}CommunicationGoal={none, personX}UserHealth={good, poor, emergency}

Using state/observation abstraction

Action Set: State Set:

CommunicationGoal={none, nurse, 911, relative}

- AskHealth- OfferHelp

CheckHealth

PhoneDoMeds

- AskCallWho- CallHelp- CallNurse- CallRelative- VerifyHelp- VerifyNurse- VerifyRelative

Phone


Related work on robot planning and control

• Manually-scripted dialogue strategies:– Denecke&Waibel, 1997; Walker et al., 1997.

• Markov decision processes (MDPs) for dialogue management– Levin et al., 1997; Fromer, 1998; Walker et al., 1998; Goddeau&Pineau, 2000; Singh

et al., 2000; Walker, 2000.

• Robot interface:– Torrance, 1996; Asoh et al., 1999.

• Classical planning– Fikes&Nilsson, 1971; Simmons, 1987; McAllester&Rosenblitt, 1991;

Penberthy&Weld, 1992; Kushmerick, 1995; Veloso&al., 1995; Smith&Weld, 1998.

• Execution architectures– Firby, 1987; Musliner, 1993; Simmons, 1994; Bonasso&Kortenkamp, 1996;


Decision-theoretic planning models


-100

-80

-60

-40

-20

0

20

40

0 1

The tiger problem: Value function solution

V

belief

open-right open-leftlisten

S=tiger-left S=tiger-right


Optimizing the “investigate” controller

-120

-100

-80

-60

-40

-20

0

0 1

V

open-right

listen

belief



-60

-40

-20

0

20

40

60

80

0 1

Optimizing the “act” controller

V

belief

open-left

investigate


Documents

Hierarchical Methods for Planning under Uncertainty Thesis Proposal Joelle Pineau Thesis Committee: Sebastian Thrun, Chair Matthew Mason Andrew Moore Craig