63
Markov Decision Processes: A Survey Martin L. Puterman

Markov Decision Processes: A Survey

  • Upload
    odetta

  • View
    64

  • Download
    3

Embed Size (px)

DESCRIPTION

Markov Decision Processes: A Survey. Martin L. Puterman. Outline. Example - Airline Meal Planning MDP Overview and Applications Airline Meal Planning Models and Results MDP Theory and Computation Bayesian MDPs and Censored Models Reinforcement Learning Concluding Remarks. - PowerPoint PPT Presentation

Citation preview

Page 1: Markov Decision Processes: A Survey

Markov Decision Processes:A Survey

Martin L. Puterman

Page 2: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 2

Outline

• Example - Airline Meal Planning• MDP Overview and Applications• Airline Meal Planning Models and Results• MDP Theory and Computation• Bayesian MDPs and Censored Models• Reinforcement Learning• Concluding Remarks

Page 3: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 3

Airline Meal Planning• Goal: Get the right number of meals on

each flight• Why is this hard?

– Meal preparation lead times– Load uncertainty – Last minute uploading capacity constraints

• Why is this important to an airline? – 500 flights per day 365 days $5/meal = $912,500

Page 4: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 4

-40.0

-25.7

-11.4

2.9

17.1

31.4

45.7

60.0

Diff36_6 Diff6_3 Diff3_2 Diff2_1 Diff1_0

Change in Passenger Loads

Time Period

Am

ount

Page 5: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 5

How Significant is the Problem?Fr

eque

ncy

Provisioning Error (Meals)

0 10 20 30-10-20

Page 6: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 6

The Meal Planning Decision Process• At several key decision points up to 3 hours

before departure, the meal planner observes reservations and meals allocated and adjusts allocated meal quantity.

• Hourly in the last three hours, adjustments are made but the cost of adjustment is significantly higher and limited by delivery van capacity and uploading logistics.

Page 7: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 7

Meal Planning Timeline

18 hoursSchedule, Productio

n

6 hoursOrder

Assembly

3 hoursOrder

ready to go

Departure

Adjustments with

Van

Delivery of Order

ObservePassenger

Load

Page 8: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 8

Airline Meal Planning Operational goal; develop a meal planning strategy that minimizes expected total overage,

underage and operational costs

A Meal Planning Strategy specifiesat each decision point the number of extra

meals to prepare or deliver for any observed meal allocation and

reservation quantity.

Page 9: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 9

Why is Finding an Optimal Meal Planning Strategy Challenging?

• 6 decision points• 108 passengers• 108 possible actions• One strategy requires 1081086 = 69984 order

quantities.• There are 7,558,272 strategies to consider.• Demand must be forecasted.

Page 10: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 10

Airline Meal Planning Characteristics• A similar decision is made at several time

points• There are costs associated with each

decision• The decision has future consequences• The overall cost depends on several events• There is uncertainty about the future

Page 11: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 11

What is a Markov decision process?• A mathematical representation of a sequential

decision making problem in which:– A system evolves through time.– A decision maker controls it by taking actions

at pre-specified points of time.– Actions incur immediate costs or accrue

immediate rewards and affect the subsequent system state.

Page 12: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 12

MDP Overview

Page 13: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 13

Markov Decision Processes are also known as:

• MDPs• Dynamic Programs• Stochastic Dynamic Programs• Sequential Decision Processes• Stochastic Control Problems

Page 14: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 14

Early Historical Perspective• Massé - Reservoir Control (1940’s )• Wald - Sequential Analysis (1940’s )• Bellman - Dynamic Programing (1950’s)• Arrow, Dvorestsky, Wolfowitz, Kiefer, Karlin - Inventory

(1950’s)• Howard (1960) - Finite State and Action Models• Blackwell (1962) - Theoretical Foundation• Derman, Ross, Denardo, Veinott (1960’s) - Theory - USA• Dynkin, Krylov, Shirayev, Yushkevitch (1960’s) - Theory -

USSR

Page 15: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 15

Basic Model Ingredients• Decision epochs – {0, 1, 2, …., N} or [0,N]

or {0,1,2, …} or [0,)• State Space – S (generic state s)• Action Sets – As (generic action a)

• Rewards – rt(s,a)

• Transition probabilities – pt(j|s,a)

A model is called stationary if rewards and transition probabilities are independent of t

Page 16: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 16

System Evolution

Decision Epoch t

Decision Epoch t +1

atat+1

rt(st,at) rt+1(st+1,at+1)

st st+1

Page 17: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 17

Another Perspective

a2

a1

s1

s2

s3

s4

s1

s2

s3

s4

Page 18: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 18

Yet Another Perspective: An Event Timeline

Place June order

May 15 June 1

May order

arrives;ship

product to DCs

June 10

May sales data arrives;prepare July forecast

June 15

Place July order

...

...

Page 19: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 19

Some Variants on the Basic Model

• There may be a continuum of states and/or actions

• Decisions may be made in continuous time

• Rewards and transition rates may change over time

• System state may be not observable

• Some model parameters may not be known

Page 20: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 20

Derived Quantities

• Decision Rules: dt(s)

• Policies, Strategies or Plans: = ( d1, d2, …) or = ( d1, d2, …, dN)

• Stochastic Processes : ( Xt, Yt ), Es { }

• Value Functions: vt (s), v

(s), g , …– Value functions differ from immediate rewards, they

represent the value starting in a state of all future events

Page 21: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 21

ObjectiveIdentify a policy that maximizes either the

– expected total reward (finite or infinite horizon)

v(s) = Es { rt (Xt,Yt ) }

– expected discounted reward– expected long run average reward– expected utility

possibly subject to constraints on system performance

t=0

Page 22: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 22

The Bellman Equation MDP computation and theory focuses on solving

the optimality (Bellman) equation which for infinite horizon discounted models

SjAa

jvasjpasrsvs

)}(),|(),({max)(

This can also be expressed as

v = Tv or Bv = 0

v(s) is the value function of the MDP

Page 23: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 23

Some Theoretical Issues• When does an optimal policy with nice structure

exist?– Markov or Stationary Policy– (s,S) or Control Limit Policy

• When do computational algorithms converge? and how fast?

• What properties do solutions of the optimality equation have?

Page 24: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 24

Computing Optimal Policies• Why?

– Implementation – Gold Standard for Heuristics

• Basic Principle - Transform multi-period problem into a sequence of one-period problems.

• Why is computation difficult in practice? – Curse of Dimensionality

Page 25: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 25

Computational Methods• Finite Horizon Models

– Backward Induction (Dynamic Programming)• Infinite Horizon Models

– Value Iteration– Policy Iteration– Modified Policy Iteration– Linear Programming– Neuro-Dynamic Programming/Reinforcement learning

Page 26: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 26

Infinite Horizon Computation• Iterative algorithms work as follows:

– Approximate the value function by v(s)– Select a new decision rule by

– Re-approximate the value function

• Approximation methods – exact - policy iteration– iterative - value iteration and modified policy iteration– simulation based - reinforcement learning

~

SjAa

jvasjpasrsds

)}(~),|(),({maxarg)(

Page 27: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 27

Applications (A to N)

• Airline Meal Planning• Behaviourial Ecology• Capacity Expansion• Decision Analysis• Equipment

Replacement• Fisheries Management• Gambling Systems

• Highway Pavement Repair

• Inventory Control• Job Seeking Strategies• Knapsack Problems• Learning• Medical Treatment• Network Control

Page 28: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 28

Applications (O to Z)

• Option Pricing• Project Selection• Queueing System

Control• Robotic Motion• Scheduling• Tetris

• User Modeling• Vision (Computer)• Water Resources • X-Ray Dosage• Yield Management • Zebra Hunting

Page 29: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 29

“Coffee, Tea or …? A Markov Decision Process Model for Airline Meal Provisioning”

J. Goto, M.E. Lewis and MLP

• Decision Epochs: T= {1, …,5}– 0 - Departure time– 1-3: 1,2 and 3 Hours Pre-Departure– 4: 6 Hours Pre - Departure– 5: 36 Hours Pre-Departure

• States: {(l,q): 0 l Booking Limit, 0 q Capacity}• Actions: (Meal quantity after delivery)

– At,(l,q)= { 0, 1, …, Plane Capacity} t = 3,4,5

– At,(l,q) = {q-van capacity, …, q + van capacity) t= 1,2

Page 30: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 30

Markov Decision Process Formulation

Costs: (depending on t) rt((l,q),a) = Meal Cost + Return penalty + late delivery charge +

shortage cost

Transition Probabilities: pt(q’|q) a=q’

pt((l’,q’)|(l,q),a) = 0 aq’

Page 31: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 31

An Optimal Decision Rule

Decision Epoch 1

Departure18 hoursSchedule, Productio

n

6 hoursOrder

Assembly

3 hoursOrder

ready to go

Adjust with Van

Meal Quantity

Pass

enge

r Loa

d

Page 32: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 32

Empirical PerformanceFr

eque

ncy

Provisioning Error0 4

020

-20

Freq

uenc

yActual Mean = 9.81 Standard Deviation = 8.46

Model (out of sample) Mean = 7.99 Standard Deviation = 6.96

Page 33: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 33

Overage versus Shortage

• Evaluate the model over a range of terminal costs

• Observe the relationship of average overage and proportion of flights short-catered

• 55 flight number / aircraft capacity combinations (evaluated separately)

Page 34: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 34

Overage versus Shortage

Performance of optimal policies

FLIGHT GROUP

Optimal outperforms

actual

Optimal close to actual

Actual outperforms

optimal TotalINTERNATIONAL 6 1 0 7TRANSBORDER 13 3 0 16TRANSCON 9 0 2 11WEST 11 3 7 21

39 7 9 55

Page 35: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 35

Information Acquisition

Page 36: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 36

Information Acquisition and Optimization• Objective: Investigate the tradeoff between acquiring information

and optimal policy choice• Examples:

– Harpaz, Lee and Winkler (1982) study output decisions of a competitive firm in a market with random demand in which the demand distribution is unknown.

– Braden and Oren (1994) study dynamic pricing decisions of a firm in a market with unknown consumer demand curves.

– Lariviere and Porteus (1999) and Ding, Puterman and Bisi (2002) study order decisions of a censored newsvendor with unobservable lost sales and unknown demand distributions.

• Key result - it is optimal to “experiment”

Page 37: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 37

Bayesian Newsvendor Model

• Newsvendor cost structure (c - cost, h - salvage value, p - penalty cost)

• Demand assumptions– positive continuous– i.i.d. sample from f(x|) with unknown – prior on is 1()

• Assume first that demand is unobservable Demand = Sales + “observed” lost sales

Page 38: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 38

Time Line of Events

1 2

1 ( )

3

set y1

obs. x1

2 ( )

set y2

obs. x2

3 ( )

r x y( , )1 1 r x y( , )2 2

Page 39: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 39

n n nn n

n n

xf xf x d

1( ) Pr ( | , )

( | ) ( )( | ) ( )

n1( )

m x f x dn n( ) ( | ) ( )

n ( )

n n+1

Demand Updatingxn

Page 40: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 40

Bayesian MDP FormulationAt decision epoch n, (n=1,2, ..., N)– States: {all probability distributions on the

unknown parameter}– Actions: – Costs:

– Transition Prob:

A R ss + [ , )0

Sn

p y x m xn n n n n n( | , , ) ( ) 1

r y E r X y

cy h y x m x dx p x y m x dx

n n n

n n

y

n n ny

n

n

n

( , ) [ ( , )]

( ) ( ) ( ) ( )

0

Bayesian Newsvendor Model

Page 41: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 41

The Optimality Equations

for n=1,…,N with the boundary condition:

Key Observation The transition probabilities are independent of the actions. So the

problem can be reduced to a sequence of single-period problems.

u r yN N N N 1 1 1 1 0( ) ( , )

u r y u m x dxn n y R n n n n nRn

( ) { ( , ) ( ) ( ) }

min 1 1

Bayesian Newsvendor Model

Page 42: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 42

The Bayesian Newsvendor PolicyThe BMDP reduces to a sequence of single-period, two-step

problems. • Demand distribution parameter updating

• Cost minimization

where Mn is the CDF of mn

y Mp cp h

n NnBN

n

1 1( ) ,..., . for

n 1 nxn-1

Page 43: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 43

Bayesian Newsvendor with Unobservable Lost Sales

• Model Set-up– Same as fully observable case but unmet

demand is lost and unobservable

• Question– Is the Bayesian Newsvendor policy optimal?

Page 44: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 44

Demand is censored by the order quantity. : demand exactly observed

: demand censoredDemand updating is different in this case:

X x yn n n

X x yn n n

2

21 1

1 11

2

1 1

1 1

11

1

' ( )

( )( | ) ( )

( | ) ( )

( )( | ) ( )

( | ) ( )

f x

f x dx

f x dx

f x dxdxc y

y

if < y

if = y

1

1

Bayesian Newsvendor with Unobservable Lost Sales

Page 45: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 45

Set yn-1

n 1 obs. xn-1=1 with mn-1(1)

obs. xn-1 = y

n-1 with [1-Mn-1 (y

n-1 )]

obs. x n-1=0 with m n-1

(0)

n

n

n

ff d

( )( | ) ( )( | ) ( )

00

1

1

n

n

n

ff d

( )( | ) ( )( | ) ( )

11

1

1

2

1

1

1

1

cny

ny

f x dx

f x dxdn

n

( )( | ) ( )

( | ) ( )

Bayesian Newsvendor with Unobservable Lost Sales

Page 46: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 46

Model Formulation– States, Actions, Costs: As above– Transition Probabilities:

=' when )(M-1 n whe)(

),|( cn

cn

n`n'

1

nn

nnn y

myp

n

Bayesian Newsvendor with Unobservable Lost Sales

The Optimality Equations

)},'({min)'( NNRyNN yruN

)]}(1)[(

)()(),({min)(

111

0 111

ncn

cnn

n

y

nnnnRynn

yMu

dxxmuyrun

Page 47: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 47

The Key Result

)( y

;1,...,1for )(

1 *N

1 *

hpcpMy

NnhpcpMyy

NBNN

nBNnn

if f(x| ) is likelihood order increasing in . • In this model, decisions in separate periods are interrelated through the optimality equation.• This means that it is optimal to tradeoff learning for short term optimality.• Question: What is an upper bound on yn

*?

Bayesian Newsvendor with Unobservable Lost Sales

Page 48: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 48

Bayesian Newsvendor with Unobservable Lost Sales

Solving the optimality equations gives For N:

For n=1,..., N-1,

( ) ( ) ( )[ ( )]c h M y p c M yN N N N 1

( ) ( ) [( )

( ( ) ( ))( )

( )][ ( )]c h M y p

du ydy

u y u ym y

M yc M yn n

nc

n

nnc

n n nn n

n nn n

1

1 1 11

( ) ( ) ( ' ( ) )[ ( )]c h M y p y c M yn n n n n 1

where p’(yn) is a “policy dependent penalty cost .

Proof of key result is based on showing p’(yn) > p for n < N.

Page 49: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 49

Bayesian Newsvendor with Unobservable Lost Sales

• Some comments– The extra penalty can be interpreted as the

marginal expected value of information at decision epoch n.

– Numerical results show small improvements when using the optimal policy as opposed to the Bayesian Newsvendor policy.

– We have extended this to a two level supply chain

Page 50: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 50

Partially Observed MDPs• In POMDPs, system state is not observable.

– Decision maker receives a signal y which is related to the system state by q(y|s,a).

• Analysis is based on using Bayes Theorem to estimate distribution of the system state given the signal– Similar to Bayesian MDPs described above

• the posterior state distribution is a sufficient statistic for decision making

– State space is a continuum– Early work by Smallwood and Sondik (1972)

• Applications– Medical diagnosis and treatment– Equipment repair

Page 51: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 51

Reinforcement Learning and Neuro-Dynamic Programming

Page 52: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 52

Neuro-Dynamic Programming or Reinforcement Learning

• A different way to think about dynamic programming– Basis in artificial intelligence research– Mimics learning behavior of animals

• Developed to:– Solve problems with high dimensional state spaces

and/or– Solve problems in which the system is described by a

simulator (as opposed to a mathematical model)• NDL/Rl refers to a collection of methods combining

Monte Carlo methods with MDP concepts

Page 53: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 53

Reinforcement Learning• Mimics learning by experimenting in real life

– learn by interacting with the environment – goal is long term– uncertainty may be present (task must be repeated many

time to obtain its value)• Trades off between exploration and exploitation• Key focus is estimating value function ( v(s) or

Q(s,a) )– Start with guess of value function– Carry out task and observe immediate outcome (reward

and transition)– Update value function

Page 54: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 54

Reinforcement Learning - Example• Playing Tic-Tac-Toe (Sutton and Barto, 2000)

– You know the rules of the game but not opponents strategy (assumed fixed over time but with random component)

– Approach• List possible system states• Start with initial guess of probability of winning in each state• Observe current state (s) and choose action that will move you

to state with highest probability of winning • Observe state (s’) after opponent plays• Revise value in state s based on value in s’.

– Player might not always choose “best action” but try other actions to learn about different states.

Page 55: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 55

Reinforcement Learning - Example• Observations about Tic-Tac-Toe Example

– Dimension of state space is 39

– Writing down a mathematical model for the game is challenging, simulating it is easy.

– Goal is to maximize the probability of winning, there is no immediate reward

– Possible updating mechanism using observationsvnew(s) = vold(s) + ( vold(s’) - vold(s) )

• is a step-size parameter• vold(s’) - vold(s) is a temporal-difference

– The subsequent state s’ depends on players action.

Page 56: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 56

Reinforcement LearningProblems can be classified in two ways

Model No Model

Terminating Estimate v(s) by simulatingthe process many times totermination. Use estimatewith model to choosepolicy or improvement

Estimate Q(s,a) bysimulating the processmany times to termination.Use estimate of Q(s,a) tochoose policy orimprovement

Non-terminating

Estimate v(s) by usingtemporal differenceupdating, Use estimatewith model to choosepolicy or improvement

Estimate Q(s,a) by usingtemporal differenceupdating, Use estimate ofQ(s,a) to choose policy orimprovement

Page 57: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 57

Temporal Difference Updating• No model example, discounted case - based on Q(s,a)

• Algorithm (policy specified)– Start system in s, choose action a and observe s’ and a’– Update Q(s,a) Q(s,a) + ( r + Q(s’,a’) - Q(s,a))– Repeat replacing (s,a) by (s’,a’)

• Algorithm (no-policy specified) (Q-Learning)– Start system in s, choose action a and observe s’ and a’– Update Q(s,a) Q(s,a) + ( r + max a’ A Q(s’,a’) - Q(s,a))– Repeat replacing s by s’

• Issues include choosing and stopping criteria

0

, )},({),(t

ttas YXrEasQ

Page 58: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 58

RL Function Approximation

• High-dimensionality addressed by– replacing v(s) or Q(s,a) by representation

and then applying Q-learning algorithm updating weights wi at each iteration, or

– approximating v(s) or Q(s,a) by a neural network• Issue: choose “basis functions” i(s,a) to reflect problem

structure

),(),(~1

aswasQ i

k

ii

Page 59: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 59

RL Applications

• Backgammon• Checker Player• Robot Control• Elevator Dispatching• Dynamic Telecommunications Channel Allocation• Job Shop Scheduling• Supply Chain Management

Page 60: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 60

Neuro-Dynamic Programming Reinforcement Learning

“It is unclear which algorithms and parameter settings will work on a particular problem, and when a method does work, it is still unclear which ingredients are actually necessary for success. As a result, applications often require trial and error in a long process of a parameter tweaking and experimentation.”

van Roy - 2002

Page 61: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 61

Conclusions

Page 62: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 62

Concluding Comments• MDPs provide an elegant formal framework for

sequential decision making • They are widely applicable

– They can be used to compute optimal policies– They can be used as a baseline to evaluate heuristics– They can be used to determine structural results about

optimal policies• Recent research is addressing “The Curse of

Dimensionality”

Page 63: Markov Decision Processes: A Survey

Martin L. Puterman - June 2002 63

Some ReferencesBertsekas, D.P. and Tsitsiklis, J.N., Neuro-Dynamic

Programming, Athena, 1996.Feinberg, E.A. and Shwartz, A. Handbook of Markov

Decision Processes: Methods and Applications, Kluwer 2002.

Puterman, M.L. Markov Decision Processes, Wiley, 1994.Sutton, R.S. and Barto, A.G. Reinforcement Learning, MIT,

2000.