204
Lectures in Dynamic Programming and Stochastic Control Arthur F. Veinott, Jr. Spring 2008 MS&E 351 Dynamic Programming and Stochastic Control Department of Management Science and Engineering Stanford University Stanford, California 94305

Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

  • Upload
    haminh

  • View
    234

  • Download
    7

Embed Size (px)

Citation preview

Page 1: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

Lectures in Dynamic Programming

and Stochastic Control

Arthur F. Veinott, Jr.

Spring 2008MS&E 351 Dynamic Programming and Stochastic Control

Department of Management Science and EngineeringStanford University

Stanford, California 94305

Page 2: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

Copyright 2008 by Arthur F. Veinott, Jr.©

Page 3: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

i

Contents

1 Discrete-Time-Parameter Finite Markov Population Decision Chains...................................11 FORMULATION................................................................................................................................. 12 MAXIMUM -PERIOD VALUE: RECURSION AND EXAMPLES................................................ 3R

Minimum-Cost Chain........................................................................................................................4Supply Management......................................................................................................................... 5Exercising a Call Option...................................................................................................................6System Reliability............................................................................................................................. 7Maximum Expected -Period Instantaneous Rate of Return: Portfolio Selection.......................... 8R

3 MAXIMUM EXPECTED -PERIOD UTILITY WITH CONSTANT RISK POSTURE................ 9R

Expected Utilities and Risk Aversion/Preference............................................................................. 9Constant Additive Risk Posture..................................................................................................... 10Constant Multiplicative Risk Posture.............................................................................................11Additive and Multiplicative Utility Functions................................................................................11Maximum Expected -Period Symmetric Multiplicative Utility................................................... 12R

Maximum -Period Instantaneous Rate of Expected Return........................................................12R

4 MAXIMUM VALUE IN CIRCUITLESS SYSTEMS.........................................................................12Knapsack Problem (Capital Budgeting)......................................................................................... 13Portfolio Selection........................................................................................................................... 14

5 DECISIONS, POLICIES, OPTIMAL-RETURN OPERATOR.........................................................146 MAXIMUM -PERIOD VALUE: FORMALITIES..........................................................................15R

Advantages of Maximum- -Period-Value Policies.........................................................................16R

Rolling Horizons and the Backward and Forward Recursions........................................................16Limitations of Maximum- -Period-Value Policies......................................................................... 17R

7 MAXIMUM VALUE IN TRANSIENT SYSTEMS............................................................................17Why Study Infinite-Horizon Problem............................................................................................. 17Characterization of Transient Matrices.......................................................................................... 18Transient System............................................................................................................................ 19Comparison Lemma........................................................................................................................ 19Policy-Improvement Method...........................................................................................................20Stationary Maximum-Value Policies: Existence and Characterization...........................................21Newton s Method: A Specialization of Policy-Improvement Method............................................. 22’Successive Approximations: Maximum -Period Value Converges to Maximum Value...............22R

Geometric Interpretation: a Single-State Reliability Example........................................................23System Degree, Spectral Radius and Polynomial Boundedness......................................................24Geometric Convergence of Successive Approximations.................................................................. 27Contraction Mappings.....................................................................................................................28Linear-Programming Method..........................................................................................................29State-Action Frequency...................................................................................................................31Simplex Method: A Specialization of Policy-Improvement Method............................................... 34Running Times................................................................................................................................35Stochastic Constraints.....................................................................................................................37Stationary Randomized Policies......................................................................................................37Supply Management with Service Constraints............................................................................... 39Maximum Present Value.................................................................................................................39Multi-Armed Bandit Problem: Optimality of Largest-Index Rule..................................................41

Page 4: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming ContentsiiCopyright 2008 by Arthur F. Veinott, Jr.©

8 .... 45MAXIMUM PRESENT VALUE WITH SMALL INTEREST RATES IN BOUNDED SYSTEMSBounded System..............................................................................................................................46Strong Maximum Present Value in Bounded Systems................................................................... 46Cesàro Limits and Neumann Series................................................................................................ 46Stationary and Deviation Matrices................................................................................................. 47Laurent Expansion of Resolvent..................................................................................................... 49Laurent Expansion of Present Value for Small Interest Rates....................................................... 50Characterization of Stationary Strong Maximum-Present-Value Policies...................................... 50Application to Controlling Service and Rework Rates in a Queue.................................. 51G/M/_

Strong Policy-Improvement Method...............................................................................................53Existence and Characterization of Stationary Strong Maximum-Present-Value Policies...............54Truncation of Infinite Matrices.......................................................................................................558-Optimality: Efficient Implementation of the Strong Policy-Improvement Method.....................57

9 CESÀRO OVERTAKING OPTIMALITY WITH IMMIGRATION IN BOUNDED SYSTEMS.....60Controlled Queueing Network with Proportional Service Rates.....................................................61Cash Management...........................................................................................................................62Manpower Planning........................................................................................................................ 62Insurance Management................................................................................................................... 62Asset Management.......................................................................................................................... 63Immigration Stream........................................................................................................................ 63Cohort and Markov Policies............................................................................................................63Overtaking Optimality.................................................................................................................... 65Cesàro Overtaking Optimality........................................................................................................ 66Convolutions................................................................................................................................... 66Binomial Coefficients and Sequences.............................................................................................. 67Polynomial Expansion of Expected Population Sizes with Binomial Immigration........................ 68Polynomial Expansion of -Period Values.....................................................................................70R

Comparison Lemma for -Period Values.......................................................................................72R

Cesàro Overtaking Optimality with Binomial Immigration Stream...............................................73Reward-Rate Optimality.................................................................................................................73Cesàro Overtaking Optimality........................................................................................................ 74Float Optimality............................................................................................................................. 74Value Interpretation of Immigration Stream.................................................................................. 75Combining Physical and Value Immigration Streams.................................................................... 75Cesàro Overtaking Optimality with More General Immigration Streams...................................... 76Future-Value Optimality................................................................................................................ 78

10 SUMMARY.......................................................................................................................................78

2 Team Decisions, Certainty Equivalents and Stochastic Programming.................................811 FORMULATION AND EXAMPLES................................................................................................ 81

Airline Reservations with Uncertain Demand.................................................................................82Inventory Control with Uncertain Demand.................................................................................... 83Transportation Problem with Uncertain Demand.......................................................................... 84Capacity Planning with Uncertain Demand................................................................................... 84

2 REDUCTION OF STOCHASTIC TO ORDINARY MATHEMATICAL PROGRAMS..................85Linear and Quadratic Programs......................................................................................................86Computations..................................................................................................................................87Comparison of Dynamic and Stochastic Programming.................................................................. 87

Page 5: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming ContentsiiiCopyright 2008 by Arthur F. Veinott, Jr.©

3 QUADRATIC UNCONSTRAINED TEAM DECISION PROBLEMS............................................. 884 SEQUENTIAL QUADRATIC UNCONSTRAINED TEAMS: CERTAINTY EQUIVALENTS...... 89

Interpretation of Solution................................................................................................................90Computations..................................................................................................................................90Quadratic Control Problem............................................................................................................ 91Rocket Control................................................................................................................................ 91Multiproduct Supply Management................................................................................................. 91Dynamic Programming Solution with Zero Random Errors...........................................................92Solution with Independent Random Errors.................................................................................... 94Solution with Dependent Random Errors.......................................................................................94Strengths and Weaknesses.............................................................................................................. 95

3 Continuous-Time-Parameter Markov Population Decision Processes..................................971 FINITE CHAINS: FORMULATION................................................................................................. 972 MAXIMUM -PERIOD VALUE: BELLMAN’S EQUATION AND EXAMPLES...........................98X

Controlled Queues...........................................................................................................................99Supply Management........................................................................................................................99Project Scheduling...........................................................................................................................99

3 PIECEWISE-CONSTANT POLICIES, GENERATORS, TRANSITION MATRICES................. 100Nonnegative, Substochastic and Stochastic Transition Matrices..................................................101

4 CHARACTERIZATION OF MAXIMUM -PERIOD-VALUE POLICIES................................... 101X

5 EXISTENCE OF MAXIMUM -PERIOD-VALUE POLICIES..................................................... 103X

6 MAXIMUM -PERIOD VALUE WITH A SINGLE STATE.........................................................106X

7 EQUIVALENCE PRINCIPLE FOR INFINITE-HORIZON PROBLEMS......................................1088 MAXIMUM PRINCIPLE................................................................................................................. 110

Linear Control Problem................................................................................................................ 112Markov Population Decision Chain.............................................................................................. 113

9 MAXIMUM PRESENT VALUE FOR CONTROLLED ONE-DIMENSIONAL DIFFUSIONS.....113Diffusions.......................................................................................................................................113Probability of Reaching One Boundary Before the Other............................................................ 116Mean Time to Reach Boundary.................................................................................................... 117Controlled Diffusions.....................................................................................................................117Maximizing the Probability of Accumulating Given Wealth........................................................118

Appendix: Functions of Matrices............................................................................................. 1211 MATRIX NORM..............................................................................................................................1212 EIGENVALUES AND VECTORS...................................................................................................1223 SIMILARITY....................................................................................................................................1224 JORDAN FORM.............................................................................................................................. 1225 SPECTRAL MAPPING THEOREM...............................................................................................1236 MATRIX DERIVATIVES AND INTEGRALS............................................................................... 1247 MATRIX EXPONENTIALS............................................................................................................ 1258 MATRIX DIFFERENTIAL EQUATIONS......................................................................................125

References................................................................................................................................. 127BOOKS................................................................................................................................................ 127SURVEYS............................................................................................................................................128ARTICLES.......................................................................................................................................... 128

Index of Symbols.................................................................................................................................... 131

Page 6: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming ContentsivCopyright 2008 by Arthur F. Veinott, Jr.©

Homework Assignments

Homework 1 (4/11/08)Exercising a Put OptionRequisition ProcessingMatrix ProductsBridge Clearance

Homework 2 (4/18/08)Dynamic Portfolio Selection with Constant Multiplicative Risk PostureAirline OverbookingSequencing: Optimality of Index Policies

Homework 3 (4/25/08)Multifacility Linear-Cost Production PlanningDiscovering System TransienceSuccessive Approximations and Newton’s Method Find Nearly Optimal Policies in Linear Time

Homework 4 (5/2/08)Component ReplacementOptimal Stopping PolicySimple Stopping ProblemsHouse Buying

Homework 5 (5/9/08)Bayesian Statistical Quality Control and RepairMinimum Expected Present Value of Sojourn TimesOptimal Control of Tandem Queues

Homework 6 (5/16/08)Limiting Present-Value Optimality with Binomial ImmigrationMaximizing Reward Rate by Linear Programming

Homework 7 (5/23/08)Discovering System BoundednessFinding the Maximum Spectral RadiusIrreducible Systems and Cesàro-Geometric-Overtaking Optimality

Homework 8 (5/30/08)Element-Wise Product of Symmetric Positive Semi-Definite MatricesQuadratic Unconstrained Team-Decision Problem with Normally Distributed ObservationsOptimal Baking

Homework 9 (6/4/08)Pricing a House for SaleTransient Systems in Continuous Time

Page 7: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

1

1

Discrete-Time-Parameter Finite

Markov Population Decision Chains

1 FORMULATION

A is a that involvesdiscrete-time-parameter finite Markov population decision chain system

a finite population evolving over a sequence of periods labeled . and over which one can"ß #ß á

exert some control. The system description depends on four data elements, viz., states, actions,

rewards and transition rates. In each period , each individual is in some in a set ofR =state f

W _ states. ”The state summarizes all “relevant information about the system history as ofL

period . actionR Each individual in state chooses an from a finite set of possible= + E œ E=L =

actions, earns a , and generates a (finite) expected number reward <Ð=ß +ß LÑ œ <Ð=ß +Ñ :Ð> l =ß +ß LÑ1

œ :Ð> l =ß +Ñ   ! > − R" :Ð> l =ß +Ñ of individuals in state in period . Call the f transition rate.

The assumptions that , and depend on only through are essential for a state in aE < : L = =

period to summarize all relevant information about the system history as of the period. L Ob-

1For nearly all optimality concepts considered in the sequel, it suffices to consider only the expected num-bers of individuals entering each state rather than the distribution of the actual random numbers of indi-viduals entering each state. For that reason, those distributions are not considered explicitly here.

Page 8: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 2 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

serve that the size of the population may vary over time. Also, there is no interaction among the

individuals. Figure 1 illustrates this situation.

>= :+

ã

ã

:

Figure 1

Call a system (resp., ) if (resp., ) for each stochastic substochastic !> =:Ð> l =ß +Ñ œ " Ÿ " + − E

and . An important instance of a stochastic (resp., substochastic) system is that in which= − f

:Ð> l =ß +Ñ = + is the probability that an individual in state who takes action in a period generates

a single individual in state . Call a stochastic system if the are all or > :Ð> l =ß +Ñ ! "2 deterministic

because then for each and there will be a unique for which and = + > − :Ð> l =ß +Ñ œ " :Ð l =ß +Ñf 7

œ ! Á > for all .7

Examples of States and Actions in Various Applications. The table below gives examples of

states and actions in several application areas.

Application State ActionManage supply chain Inventory levels of products Choose product order times/quantities

Maintain road Condition of road Select resurfacing optionInvest in securities Portfolio of securities Buy/sell securities: times and amounts

Inspect lot Number of defectives Accept/reject lot, continue samplingRoute calls Nodes in network Send a call at one node to another

Control queueing network Queue sizes at each station Set service rates at each stationOverbook flight Number of reservations Book/decline reservation requestMarket product Goodwill Advertise product

Manage reservoirs Water levels at reservoirs Release water from reservoirsInsure asset Risk category Set policy premiumPatrol area Car locations, service requests Reposition cars

Guide rocket Position and velocity Choose retrorockets to fireCare for a patient Condition of patient Conduct tests and treatments

In practice, it is often the case that the reward that a substochastic system earns<Ð=ß +ß X Ñ

when an individual in a period takes action in state depends on , and an auxiliary ran-+ = = +

dom variable whose conditional distribution given , and the system history at that timeX = +

2The last condition rules out a system in which an individual in a period generates more than one individualin the next period even though such a system may still be stochastic (resp., substochastic).

Page 9: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 3 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

depends only on and . To reduce this problem to the one above, it suffices to let = + <Ð=ß +Ñ œ

E earns when it is in state Ð<Ð=ß +ß X Ñ l =ß +Ñ =be the conditional expected reward that the system

and takes action therein. For example, if is the next state that the system visits after taking+ X

action in state , then .+ = <Ð=ß +Ñ œ <Ð=ß +ß >Ñ:Ð> l =ß +Ñ!>

2 MAXIMUM -PERIOD VALUE: RECURSION AND EXAMPLESR

Why Maximize Expected Finite-Horizon Rewards? The most common goal in the above set-

ting is to find a “policy”, i.e., a rule that specifies the action to take in each state with or lessR

periods to go, that maximizes the expected -period reward where there is a given terminal valueR

of being in each state with no periods to go. Finite-horizon optimality concepts like this require

users to specify the terminal value, though it is often difficult to do. While decision makers are gen-

erally not interested in a fixed number of periods, this approach is often used for several reasons.

ì Simplicity. Though fixing a particular finite horizon is often rather arbitrary, the concept is simple.By contrast, optimality concepts for an infinite horizon—perhaps the main alternative—are moresubtle and varied.

ì Realism. Optimality over a finite horizon is often more realistic. Users frequently think that theycan specify the terminal value well enough to facilitate making good decisions in earlier periods andare comfortable with finite horizons. This way, they do not have to make explicit projections be-yond the finite horizon—which they might view as speculative anyway—though of course these pro-jections must instead be reflected in the terminal-value function. Few users are prepared to thinkabout the indefinite future of which their own lives occupy such a minuscule part.

ì Extension to Nonstationary Data. Optimality concepts over a finite horizon generally adapt eas-ily to nonstationary data without any increase in computational complexity. By contrast, the com-putations needed to assure infinite-horizon optimality rise rapidly with the extent of nonstation-arity. This is an important advantage of finite-horizon concepts because nonstationarity is com-mon in life.

Maximum -Period Value.R We begin by examining this problem informally. Let be theZ R=

maximum expected -period reward, called the , that an individual (and hisR R -period value

progeny) can earn starting from state , assuming for the moment that this maximum is=

achieved. Now if an individual chooses an action in the first period and the individual’s+ − E=

progeny use an “optimal” policy in the remaining periods, thenR"

.

Z   <Ð=ß +Ñ :Ð> l =ß +ÑZR= >

>−

R"

R"

"f

reward in expected reward in remaining periods first period when an optimal policy is used therein

Moreover, if the individual chooses an “optimal” action in the first period, then equality oc-+

curs above. This idea, called the implies the principle of optimality, dynamic-programming re-

cursion:

(1) max Z œ Ò<Ð=ß +Ñ :Ð> l =ß +ÑZ ÓR R"= >

+−E >−=

"f

Page 10: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 4 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

for and where is the given in state . This recursion per-= − R œ "ß #ß á ß Z =f != terminal value

mits one to calculate , then , then , and so on. Once is computed, one opti-ÖZ × ÖZ × ÖZ × Z" # $ R= = = =

mal action in state with periods to go is simply any action in that achieves the maxi-+ = R ER= =

mum on the right-hand side of (1). Thus, when there are periods to go, each individual inR

state can be assumed to take the same action without loss of optimality. 1)= The recursion isÐ

of fundamental importance in a broad class of applications.

Minimum -Period Cost.R If is instead a “cost” and the aim is to minimize expected<Ð=ß +Ñ

R Z-period cost, then the “max” in (1) should be replaced by “min”. Then is the minimumR=

expected -period cost starting from state . This alternate form of the dynamic-programmingR =

recursion appears often in the sequel with replacing .G ZR R= =

Deterministic Case. In the deterministic case, there may be several actions that take an indi-

vidual from state to . In that = > event, it is best to choose one with maximum reward and eliminate

the others. Then one can identify actions in state with states visited next, so (1) simplifies to= >

(2) max Z œ Ò<Ð=ß >Ñ Z ÓR R"= >

>−E=

for and . As Figure 2 illustrates, can be thought of as the maximum -per= − R œ "ß #ß á Z Rf R= -

iod reward that an individual can earn in traversing an -step chain that begins in state andR =

earns a terminal reward when it ends in state .Z >!>

>=

0N2N1N 1

Z R=

Z !>

Figure 2

Examples

1 Minimum-Cost Chain. The minimum-cost-chain problem described above has a terminal cost

in each state. One specialization of this problem is to find a minimum-cost chain from every state

to a given terminal state in steps or less. Thus if is the cost of moving from state 7 R -Ð=ß >Ñ = to

state in one step and is the minimum -step-or-less cost of moving from state to state > R =GR= 7 ,

then for , andG œ ! R   " G œ -Ð=ß ÑR "=7 7

(3) min , and .G œ Ò-Ð=ß >Ñ G Ó R œ #ß $ß á = − Ï Ö ×R R"= >

>−E=

f 7

Page 11: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 5 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

If the associated (directed) graph with node set and arc set hasZ f T f T´ Ð ß Ñ ´ ÖÐ=ß >Ñ À > − E ×=

no (i.e., directed cycle) around which the total cost incurred is negative, thencircuit

(4) for and .G œ G = − Ï Ö × R   W "R W"= = f 7

This is because a minimum-cost chain from to need not visit a node twice. For if it did, the= 7

chain could be shortened and its cost reduced by eliminating the circuit created by visiting the

indicated node twice as Figure 3 illustrates. Thus a minimum-cost chain from any node in f 7Ï Ö ×

to can be assumed to have arcs or less, justifying (4). If one takes care to choose the min7 W " -

imizer in (3) so that whenever , then the will be independ-> œ > > œ > G œ G > œ >R R R" R R" R= = = = = = =

ent of by (4). Also, the subgraph of with arcs , , is a tree with R   W " Ð=ß > Ñ = − Ï Ö ×Z f 7= the

unique simple chain therein from each node to being a minimum-cost chain from to .= Á =7 7 7

=

1

2

3 5

4

6 6

=

1

7 7

2

Figure 3

Computational Effort (or Complexity). How much computation is required to calculate G ´=

G = − Ï Ö × E RW"= for all ? Let be the number of arcs in . Then for each , evaluation of (3) ref 7 T -

quires additions and nearly the same number of comparisons. Since this must be done forE

each , the total number of additions and comparisons is about .R œ #ß á ß W" ÐW #ÑE

2 Supply Management. One of the areas in which dynamic programming has been used most

widely is to find optimal supply-management policies. The reason for this is that nearly all firms

carry inventories and the investment in them is sizable. For example, US manufacturing and

trade inventories alone in 2000 were 1.205 trillion dollars, or 12% of the entire US gross national

product of 9.963 trillion dollars that year!3

As an illustration of the role of dynamic programming in such problems, suppose that the de-

mands for a single product in successive periods are independent and identically-distributed ran-

dom variables. At the beginning of a period, the supply manager observes the , initial stock = ! Ÿ =

Ÿ W +, and orders a nonnegative amount with immediate delivery bringing the to ,starting stock

= Ÿ + Ÿ W H + H +. If the demand in the period exceeds , the excess demand is lost. There is

an ordering cost and a holding and penalty cost in the period. Let be the-Ð+ =Ñ 2Ð+ HÑ GR=

minimum expected -period cost starting with the initial stock . Then R = ÐG ´ !Ñ!†

32001 Table 756 and 640.Statistical Abstract of the United States,

Page 12: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 6 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

G œ Ò-Ð+ =Ñ 2Ð+ HÑ G ÓR R"=

=Ÿ+ŸWÐ+HÑmin E E

for and .! Ÿ = Ÿ W R œ "ß #ß á

3 Exercising a Call Option. One of the most active places in which dynamic programming is

used today is Wall Street. To illustrate, consider the problem of determining when to exercise

an (American) to buy a stock ignoring commissions. The option gives the purchasercall option

the right to buy the stock at the on any of the next days.strike price =   ! R‡

Two questions arise. When should the option be exercised? What is its value? To answer

these questions requires a stock-price model and a dynamic-programming recursion to find the

value of the option as well as an optimal option-exercise policy.4

Consider the following stock-price model. Suppose that the stock price is on ,= R   !day

i.e., days before the option expires. If , assume that tR R ! he stock price on the following day

R" =V V ß V ß á is where are independent identically distributed nonnegative random varia-R " #

bles with the same distribution as a nonnegative random variable . V Then, is the rate< ´ V "

of return for a day, and E is the expected rate of return that day.<

Let be the on day when the market price of the stock is . ThereZ R =R= value of the option

are two alternatives that day. One is to exercise the option to buy the stock at the strike price

and immediately resell it, which earns . The other = =‡ is not to exercise the option that day, in

which case the maximum expected income in the remaining days is Since one seeksR" E .Z R"=V

the alternative with higher expected future income, the value of the option on expiration day is

Z œ Ð= = Ñ! ‡ = and on day is given recursively byR !

(5) max E .Z œ Ð= = ß Z Ñß R œ "ß #ß áR ‡ R"= =V

Thus it is optimal to exercise the option on expiration day if , and not do so otherwise.= = !‡

And it is optimal to exercise the option on day if E , and not do so otherwise.R ! = = Z‡ R"=V

Nonnegative Expected Rate of Return. Consider now the question when it is optimal to

exercise the option. It turns out that as long as the expected rate of return is nonnegative, i.e.,

E or equivalently E , the answer is to wait until the expiration day. To establish this<   ! V   "

fact, it suffices to show that

(6) EZ œ ZR R"= =V

4This formulation of the problem of when to exercise an (American) call option addresses the situation facedby an investor who wishes to profit from speculation on the market price of a stock. By contrast, the formu-lation of the problem in finance addresses the problem faced by an institution who wishes to price an optionto avoid market risk and rely on commissions for profits. Though the assumptions differ, the computationsare similar.

Page 13: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 7 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

for each and all . To that end, observe from (5) for and by definition for R ! =   ! R " R œ

" Z   = = =   ! V   " Z   Ð=V = Ñ that for each . Thus because E , it follows that E ER" ‡ R" ‡= =V

  = =‡. Hence from (5) again, (6) holds. Hence, if the expected rate of return is nonnegative, it

is optimal not to exercise the option at any market price when days remain until expirationR ! .

Furthermore, E is the value of the option where , so is theZ œ Ð=V = Ñ V ´ V =VR R ‡ R R=

R" 3#

price at expiration.

Negative Expected Rate of Return. Suppose now that the expected rate of return is nega-

tive, i.e., E . To analyze equation (5), it turns out to be useful to subtract from both< ! = =‡

sides of (5) and make the change of variables . Then (5) reduces to the equivY œ Z Ð= = ÑR R ‡= = -

alent system

(5) max E Ew R R"= =VY œ Ð!ß = < Y Ñ

for where . Now we claim that is decreasing and continuous inR œ "ß #ß á Y œ Ð= =Ñ Y! ‡ R= =

=   ! R Y œ ! R œ ! R" for each and lim . Certainly that is so for . Suppose it is so for =Ä_R=

and consider . Then E is decreasing and continuous in and approaches as .R Y = ! = Ä _R"=V

Consequently, there is a smallest such that E E . Thus, it follows that= œ =   ! = < Y Ÿ !RR"=V

the maximum on the right side of (5) is E E if and if . Since (5) andw R"=V R R= < Y = = ! =   =

(5) are equivalent, it follows that this rule is optimal with (5) as well, i.e., it is optimal to waitw

if and to exercise the option if . In short, = = =   =R R if the expected rate of return is negative

and if days remain until expiration of the option, then there is a price limit such that it isR =R

optimal to wait if the price is below and to exercise the option if the price is =R =R or higher.

4 System Reliability. A system consists of a finite set of components. The system is ob-D

served once a period and each component is found to be “working” or “failed.” If is the subset=

of components that is observed to be working in some period, then a subset of the failed com-> Ï =

ponents may be replaced at a cost where is the set of working components after re-< > Ð ª =Ñ>Ï=

placement. The expected operating cost incurred during the period is then . Some components:>

may fail during the period with the conditional distribution of the random set of working comA -

ponents in the next period, given that is the working set after replacement in the period and>

given the past history, depending on , but not otherwise on the past history. Let be the> GR=

minimum expected -period cost when is the initial working set. Then ,R = ÐG ´ !Ñ!†

G œ Ò< : ÐG l >ÑÓR R"= A

=©>©>Ï= >min E

D

for and .g © = © R œ "ß #ß áD

Page 14: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 8 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

5 Maximum Expected -Period Instantaneous Rate of Return: Portfolio Selection.R Sup-

pose that in a stochastic system, the return the system earns in periods is the product"ß á ß R

V â V V ß á ß V" R " R of the nonnegative random returns in those periods. For example, if the

rate of return in period is 15%, then the return in period is 1.15. Now let 100 % be3 3 V œ3 R3

the (random) instantaneous rate of return per period over the periods assuming continuousR

compounding. Then , so/ œ V â V3R R" R

(7) ln ln .3R " Rœ Ò V â V Ó"

R

Now since ln is the instantaneous rates of return in period , the problem of maximizing theV 33

expected instantaneous rate of return over periods reduces to maximizing the sum E lnR V "

â VE ln R of the expected instantaneous rates of return in those periods.R

Now assume that the conditional distribution of given V =3 that the system starts in state in

period and takes action in that per3 + − E= iod, and given the past history, is independent of 3

and of the past history. Then the corresponding conditional expected value of ln is inde-<Ð=ß +Ñ V3

pendent of .3

Note that a necessary condition for to be finite in this example is that the conditional<Ð=ß +Ñ

probability that is zero given that the system starts in state in period and takes actionV = 33

+ − E= in that period is zero. The reason is that a zero return corresponds to an instantaneous

rate of return equal to and so is to be avoided at all costs if the goal is to maximize the ex-_

pected instantaneous rate of return.

Observe that maximizing the expected instantaneous rate of return is different from maximiz-

ing the instantaneous rate of expected return. The former is equivalent to maximizing the expect-

ed value of the sum of the logarithms of the returns whereas the later entails maximizing the ex-

pected value of the product of the returns. Thus, the former involves additive rewards while the

latter involves multiplicative rewards. For example, if you have the opportunity to invest in a

security whose return in a period is 0 or 3, each with equal probability, then E ln andV V œ _

E 1.5. Thus, the expected instantaneous rate of return is whereas the instantaneous rateV œ _

of expected return is 50%.

Optimal Portfolio Selection. A simple example of the above development is portfolio selection.

Suppose that the returns that various securities earn depend on the market state , e.g., mar-= − f

ket index, earnings, interest rates, time, etc. The actions available in market state is a set of= E=

several portfolios of securities. The one-period return of a portfolio in market + state is a ran-=

dom variable whose conditional distribution given , and the past history depends = + only on and=

+ <Ð=ß +Ñ. Denote by the corresponding conditional expected value of the instantaneous rate of

return. Finally the probability that the market state in a period is given that the market state>

Page 15: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 9 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

in the prior period is and that a portfolio is chosen at that time, and given the his= + tory of the

process, depends only on and . This reflects the fact that a single portfolio manager = > generally

does not influence the market. Thus, . Let be the maximum expected value:Ð> l =ß +Ñ œ :Ð> l =Ñ Z R=

of the instantaneous rate of return over periods when the market state is iniR tially . Then (1)=

holds with replacing . Consequently, since an action in max:Ð> l =Ñ :Ð> l =ß +Ñ E= imizes the right-

hand side of (1) if and only if it maximizes because the term <Ð=ß +Ñ :Ð> l =ÑZ!>

R"> is independent

of . This means that the optimal policy is , i.e., max the (expected) reward in each+ myopic imizes

period alone without regard for the future. Thus, in the present case, it suffices to separately max-

imize the expected instantaneous rate of return in each period.

3 MAXIMUM EXPECTED -PERIOD UTILITY WITH CONSTANT RISK POSTURER

[HM72], [Ar65], [Ro75a]

Expected Utility and Risk Aversion/Preference

Expected Utility. A decision maker’s preferences among , which we take to begambles

random -vectors, can often be expressed by a , i.e., a real-valued function de-R ?utility function

fined on the range (assumed in ) of the set of gambles. When this is so, a decision maker whodR

has a choice between two gambles will one with higher expected utility, i.e., if and prefer \ ]

are gambles, then the decision maker prefers to if E E . Eminently plausible\ ] ?Ð\Ñ   ?Ð] Ñ

axioms implying that a decision maker’s preferences can be represented by a utility function are

discussed in books on the foundations of decision theory.

Risk Aversion/Preference. A decision maker whose preferences can be represented by a util-

ity function is called a (resp., ) if E E (resp., E? ?Ð\Ñ Ÿ ?Ð \Ñ ?Ð\Ñ  risk averter risk preferrer

?Ð \Ñ \E ) for all gambles with finite expectations, i.e., the decision maker prefers a certain

(resp., an uncertain) gamble to an uncertain (resp., a certain) one having the same expected

value. As examples, managers and investors are usually risk averters as are buyers of insurance.

By contrast, those who gamble at casinos and at race tracks are usually risk preferrers. Of

course, a risk averter may still prefer an uncertain gamble to a certain one if the former has

higher expectation and is increasing (which is usually the case). Indeed, in that event only?

such gambles will be of interest if they are real valued. For if is a constant random variable]

and is an uncertain one that is preferred to , \ ] then E E , whence E .?Ð] Ñ Ÿ ?Ð\Ñ Ÿ ?Ð \Ñ ] Ÿ \

That is why investors do not exhibit much interest in risky securities whose expected returns are

less than those available on safe ones.

A decision maker is a risk averter (resp., preferrer) if and only if is concave (resp., convex).?

To see this, observe first that it suffices to establish the claim for a risk averter, since if is the?

utility function for a risk preferrer, is the utility function for a risk averter. To show the?

“only if ” part, observe that for all -vectors and nonnegative numbers with ,R Bß C :ß ; : ; œ "

Page 16: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 10 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

a risk averter would rather receive the certain gamble than the gamble that yields :B ;C B

with probability and with probability (and so has the same expected value), i.e., : C ; ?Ð:B ;CÑ

  :?ÐBÑ ;?ÐCÑ, which is the definition of concavity. The “if part” is known as Jensen’s inequality

and may be proved as follows. Suppose is a random -vector with finite expectation E and\ R \

that is concave. Then since is concave, there is a (the supergradient of at E or? ? . − d ? \R

the gradient there if it exists) such that E E for all . Thus ?ÐBÑ Ÿ ?Ð \Ñ .ÐB \Ñ B − d ?Ð\ÑR

Ÿ ?Ð \Ñ .Ð\ \Ñ ?Ð\Ñ Ÿ ?Ð \ÑE E , so on taking expected values, E E .

Examples. Suppose , . For , i.e., for each , let - -œ Ð Ñ B œ ÐB Ñ − d B ¦ ! B ! 3 B ´ B â B3 3 3R

" R- - -" R

and ln ln .B ´ Ð B Ñ3 Then the functions , and, for and , ln ln ex-- - -B /   ! B ¦ ! B œ B- -B3 3 3!

hibit risk aversion, while the first and the negatives of the last two exhibit risk preference. For

B ¦ ! „ B ! ŸÎ ŸÎ ! B, the functions , and for , ln generally do not exhibit either risk aversion- --

or risk preference.

General (even concave) utility functions are usually too complex to permit one to do the

computations needed to choose between gambles. For that reason, it is of interest to consider

more tractable utility functions with special structures that arise naturally in applications. One

such class of utility functions that is tractable in dynamic programming is that with constant

risk posture, i.e., for which one’s posture towards a class of gambles is independent of one’s level

of wealth. We now discuss this concept for the classes of additive and multiplicative gambles.

Constant Additive Risk Posture

A decision maker has if his posture towards every additive gamconstant additive risk posture -

ble is independent of his wealth , i.e., E has constant sign in when-] B − d ?ÐB ] Ñ ?ÐBÑ BR

ever E has finite expected value for every . This hypothesis is plausible for large firms?ÐB ] Ñ B

whose total assets are much larger than the investments they consider. In any case, the hypo-

thesis is satisfied if and only if, apart from an affine transformation of , has the form+? , ? ?

(1) for all ?ÐBÑ œ / B-B

or

(2) for all ?ÐBÑ œ B B-

for some row vector . For the “if ” part of the above result, observe that the sign of- − dR

EE , if (1) holds

E , if (2) holds?ÐB ] Ñ ?ÐBÑ œ

/ Ò / "Ó

] - -B ]

-

is independent of . Notice that is concave (resp., convex) if (1) holds and (resp.,B +? , + Ÿ !

+   !), in which case the decision maker is a constant additive risk averter (resp., preferrer).

Page 17: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 11 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Constant Multiplicative Risk Posture

Alternately, a decision maker has if his posture towardconstant multiplicative risk posture

any positive multiplicative gamble is independent of his positive wealth , i.e.,] ¦ ! B − dR

E has constant sign in whenever E has finite expected value for?ÐB ‰ ] Ñ ?ÐBÑ B ¦ ! ?ÐB ‰ ] Ñ

all where . In most situations this hypothesis seems more reasonable for aB ¦ ! B ‰ ] ´ ÐB ] Ñ3 3

wide range of wealth levels than does constant additive risk posture. In any case, the hypothesis

is satisfied if and only if, apart from an affine transformation of , has the form+? , ? ?

(3) for all ?ÐBÑ œ B B ¦ !-

or

(4) ln for all ?ÐBÑ œ B B ¦ !-

and some . This result follows from that for the additive case on making the change of- − dR

variables ln and ln , and defining by the rule . B œ B ] œ ] ? ? ÐB Ñ œ ?ÐBÑw w w w w Then ?ÐB ‰ ] Ñ œ

? ÐB ] Ñ ? ?w w w w, so exhibits constant multiplicative risk posture if and only if exhibits constant

additive risk posture. In particular,

if and only if ,?ÐBÑ œ B ? ÐB Ñ œ /- -w w Bw

while

ln if and only if .?ÐBÑ œ B ? ÐB Ñ œ B- w w w-

Additive and Multiplicative Utility Functions

A utility function of rewards earned in periods that exhibits ad?Ð<Ñ < œ Ð< ß á ß < Ñ "ß á ß R" R -

ditive or multiplicative risk posture is either , i.e.,additive

(5a) ?Ð<Ñ œ ? Ð< Ñ â ? Ð< Ñ" " R R

for some functions on the real line, or , i.e.,? ß á ß ?" R multiplicative

(5b) ?Ð<Ñ œ „ ? Ð< Ñ â ? Ð< Ñ" " R R

for some nonnegative functions on a subset of the real line. The situation may be sum? ß á ß ?" R -

marized as follows.

ì ?Ð<Ñ Constant Additive Risk Posture. Suppose exhibits constant additive risk posture. Thenup to a positive affine transformation, either is additive with or ?Ð<Ñ œ < ? Ð@Ñ œ @ ?Ð<Ñ œ- -3 3

„ / ? Ð@Ñ œ /- -< @3 is multiplicative with . 3 Thus, the only strictly risk-adverting (resp., -preferring)

constant-additive-risk-posture utility functions are multiplicative and exponential.

ì ?Ð<Ñ Constant Multiplicative Risk Posture. Suppose exhibits constant multiplicative risk pos-ture on the positive orthant. Then, up to a positive affine transformation, either ln ln?Ð<Ñ œ < œ <- -is additive with ln or is multiplicative with for . ? Ð@Ñ œ @ ?Ð<Ñ œ „ < ? Ð@Ñ œ @ @ !3 3 3- - -3 Thus,the only strictly risk-averting (resp., -preferring) constant-multiplicative-risk-posture utility func-tions are additive and logarithmic.

Page 18: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 12 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Maximum Expected -Period Symmetric Multiplicative UtilityR

Now consider an -state stochastic system that consists of a single individual with state-spaceW

f, action sets , transition probabilities and, for this paragraph only, one-period rewardsE ;Ð> l =ß +Ñ=

<Ð=ß +ß >Ñ that depend not only on and , but also on . Suppose also that the utility of the= + > ?Ð<Ñ

rewards < œ Ð< ß á ß < Ñ" R earned in periods is additive or multiplicative, i.e., (5a) or (5b)"ß á ß R

holds. For simplicity, also hen is assume that is , i.e., , say. W? ? œ â œ ? œ ?ssymmetric " R ?

additive, the dynamic-programming recursion (1) in §1.2 applies directly by simply replacing the

one-period rewards by their utilities consider the case when is multi-<Ð=ß +ß >Ñ ?Ð<Ð=ß +ß >ÑÑs . Now ?

plicative. Then let be the maximum expected -period utility Z RR= starting from state . Hence,=

since for and ?Ð< ß á ß < Ñ œ ?Ð< Ñ?Ð< ß á ß < Ñ R   # ?Ð< Ñ œ „ ?Ð< Ñs s" R " # R " " where , one sees that?  !s

Z œ ?Ð<Ð=ß +ß >ÑÑ;Ð> l =ß +ÑZsR R"= >

+−E >−

max =

"f

for and where ( is an -vector of ones). Thus,= − R œ "ß #ß á Z ´ „ " " Wf !

(6) max Z œ :Ð> l =ß +ÑZR R"= >

+−E >−=

"f

for and where for each and ,= − R œ "ß #ß á + − E =ß > −f f=

(7) .:Ð> l =ß +Ñ ´ ?Ð<Ð=ß +ß >ÑÑ;Ð> l =ß +Ñs

Observe that may exceed one even though . Thus (7) is an in-! !>− >−f f:Ð> l =ß +Ñ ;Ð> l =ß +Ñ œ "

stance of branching and maximizing (resp., minimizing) the size of the expected total population

in all states at the end of periods from each initial state where (resp., ).R Z œ " Z œ "! !

Maximum -Period Instantaneous Rate of Expected Return.R

The above development also applies to the problem in which the goal is to maximize the in-

stantaneous rate of expected return. In that event, , so (7) simplifies to?Ð<Ñ œ <s

(7) .w :Ð> l =ß +Ñ ´ <Ð=ß +ß >Ñ;Ð> l =ß +Ñ

4 MAXIMUM VALUE IN CIRCUITLESS SYSTEMS

In the preceding two subsections we studied the problem of maximizing the -period value.R

In many circumstances one is interested in earning the maximum value over an infinite horizon.

Problems of this type are not generally well posed because the sum of the expected rewards in

periods may diverge."ß #ß á

The simplest situation in which this difficulty does not arise is that in which the system is

“circuitless”. To describe this concept, it is useful to introduce the , i.e., the di-system graph Z

Page 19: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 13 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

rected graph whose nodes are the states and whose arcs are the ordered pairs of states forÐ=ß >Ñ

which for some . Call the system if the system graph has no cir-:Ð> l =ß +Ñ ! + − E= circuitless

cuits, i.e., there is no sequence of states that begins and ends with the same state and for which

each successive ordered pair of states in the sequence is an arc of .Ð=ß >Ñ Z

For example, the -period problems of §1.2 and §1.3 are circuitless systems. To see this, in-R

clude the period in the state so the state-space for the -period problem consists of the pairsR

Ð=ß 8Ñ − ‚ œ Ö"ß á ß R×f a a where is the set of periods. Then since no period can be revisited,

the system is circuitless.

In circuitless systems, it is possible to relabel the states as so that each state is ac-"ß á ß W

cessible only to states with higher numbers. Consequently, the number of transitions before the

population disappears is at most because no state can be revisited. Thus, thel l " œ W "f

maximum value starting from state is finite because it is a sum of at most finite terms.Z = W=

Then by an argument like that used to justify (1) of §1.2, one sees that satisfiesZ=

(1) max , .Z œ Ò<Ð=ß +Ñ :Ð> l =ß +ÑZ Ó = −= >+−E =>=

" f

The may then be calculated recursively in the order .Z Z ß á ß Z= W "

Examples

1 Knapsack Problem (Capital Budgeting). The problem of choosing items to fill a knap-

sack to maximize total value, or of allocating limited capital to projects to maximize revenue,

are both instances of an important problem called the problem. To describe the prob-knapsack

lem, let be a finite set of positive integers representing amounts of capital investment requiredG

for each of a group of projects. Let be the projected revenue when is invested and be the< - W-

(positive) capital available. The problem can be posed as the integer program of finding an inte-

ger vector thatB œ ÐB Ñ-

maximizes "-−G

- -< B

subject to

"-−G

--B œ W B   ! and .

(This formulation allows replication of projects; if this is impossible, then assume .) AlsoB Ÿ "

assume , so the knapsack problem is feasible for . Let be the" − G W œ !ß "ß á œ Ö!ß á ß W×f

state-space. The actions available in state are the investments for which . Since= - − G - Ÿ =

each investment reduces the amount of capital available for subsequent investments, the system

is circuitless. Let be the maximum revenue that can be earned when the capital available is .Z ==

Now .Z ´ !!

Page 20: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 14 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

A Dynamic-Programming Recursion. One dynamic-programming recursion for this problem is

Z œ Ò< Z Ó= - =--−G-Ÿ=

max

for . Of course, is the desired maximum revenue, and 100% is the internal= œ "ß á ß W Z "W Š ‹Z

WW

rate of return on the invested capital assuming that all revenue is received at a common time.W

An Alternate Recursion.Dynamic-Programming Dynamic-programming recursions for solv-

ing a problem are not unique. For example, the may also be determined from the alternateZ=

dynamic-programming (branching) recursion

Z œ ÒZ Z Ó ” < = − G= ? @ =?@œ=?ß@!

max ,

and

Z œ ÒZ Z Ó =  G= ? @?@œ=?ß@!

max , .

2 Portfolio Selection (Manne). In each period , , there is a set of securities= " Ÿ = W E=

available for investment. One dollar invested in security in period generates + − E = :Ð> l =ß +Ñ=

  ! > œ ="ß á ß W :Ð> l =ß +Ñ dollars in period . Of course normally exceeds one, so the!W>œ="

system is branching. Then the state-space is . Since no period can be revisited, thef œ Ö"ß á ß W×

system is circuitless. The goal is to find, for each , the maximum income that can be earned= Z=

by period from each dollar invested in period . Then andW = Z œ "W

Z œ :Ð> l =ß +ÑZ = œ "ß á ß W"= >+−E >œ="

W

max , .=

"In this case, the internal rate of return on capital invested in period is 100%.= W ÐÐZ Ñ "Ñ=

"W=

5 DECISIONS, POLICIES, OPTIMAL-RETURN OPERATOR

An informal definition of a “policy” appears on page 3. It is now time to make that concept

precise. At the same time, we introduce matrix and operator notation to simplify the development.

A is decision a function that assigns to each state an action . Thus $ f $ ?= − − E ´ E== =− =‚ f

is the . A is a sequence of decisions. Using meansset of decisions policy 1 $ $ 1 $œ Ð ß ß á Ñ œ Ð Ñ" # R

that if an individual is in state in period , then the individual uses action in that period.= R $=R

Now is the . Finally, call a policy if it? ? ? $ $ $_ _´ ‚ ‚ â œ Ð ß ß á Ñset of policies stationary

uses the same decision in each period.$

For any , let be the -element column vector whose element is , i.e., is$ ? $− < W = <Ð=ß Ñ <$ $> =h

the one-period when using the decision . Also let be the transition ma-reward vector $ T W ‚ W$

trix whose element is , i.e., is the ( ) using . If => :Ð> l =ß Ñ T œ>2 =$ $ 1$ one-step transition matrix

Ð ß ß á Ñ T ´ T â T R ÐT ´ MÑ$ $ 1" #R ! is a policy, let be the -step transition matrix using .1 1$ $" R

Thus . Observe that the element of is the expected number of indi-T ´ T œ ÐT Ñ => TR R R >2 R$ $ $ 1_

Page 21: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 15 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

viduals in state in period that one individual in state in period one and his progeny> R " =

generate when they use . Then the -element column vector1 ÐZ ´ !Ñ W!1

(1) Z ´ T <R 3"R

3œ"1 1 $"

3

is the or - of .expected N-period reward period valueR 1

Define the byoptimal-return operator e À d Ä dW W

(2) max , .eZ ´ Ò< T Z Ó Z − d$ ?

$ $−

W

Observe that is the maximum expected one-period reward when is the terminal reward ineZ Z

that period. Of course , , , etc. It is important to recognizee e e e e e! " #Z œ Z Z œ Z Z œ Ð Z Ñ

that the maximum in (2) is attained; indeed the element of is= Ð Z Ñ Z>2=e e

Ð Z Ñ œ Ò<Ð=ß +Ñ :Ð> l =ß +ÑZ Óe = >+−E >−

max=

"f

for all , i.e., the maximum in (2) is taken coordinatewise. Also note that is increasing= − Zf e

in , a fact that we will use often in the sequel.Z

6 MAXIMUM -PERIOD VALUE: FORMALITIESR

Our goal now is to justify the dynamic-programming recursion (1) of §1.2 in a more formal

way. To that end, let be the -period value when using and the Z Ð?Ñ ´ Z T ? RR R R1 1 1 1 terminal

value vector at the end of period is .R ? − dW

Proposition 1. Maximum -Period Value.R , and .eR R W−? œ Z Ð?Ñ R œ !ß "ß á ? − dmax1 ? 1_

Proof. The result is trivial for . Suppose it holds for and consider . NowR œ ! R "   ! R

< T Z Ð?Ñ œ Z Ð?Ñ$ $ 1 $1R" R

for all , so because ,$ T   !$

(1a) max max . e e e eR R" R" R

− Ð ß Ñ−? œ Ð ?Ñ œ Ò< T ?Ó œ Z Ð?Ñ

$ ?$ $

$ 1 ?$1

Remark 1. The first equality in (1a) can be written in the alternate form

(1b) max , Z œ Ò< T Z Ó R œ "ß #ß áR R"

−$ ?$ $

where and , or equivalently,Z ´ ? Z ´ ?R R !e

(1c) maxZ œ Ò<Ð=ß +Ñ :Ð> l =ß +ÑZ ÓR= >

+−E >−

R"

=

"f

Page 22: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 16 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

for and . Of course, (1c) is precisely (1) in §1.2.R œ "ß #ß á = − f

Remark 2. The proof is constructive. For if we have found maximizing , then1 Z Ð?ÑR"†

Ð ß Ñ Z Ð?Ñ$ 1 $ maximizes if and only if attains the maximum on the right-hand side of (1b).R†

Advantages of Maximum- -Period-Value PoliciesR

Policies having maximum -period value are very useful in practice for several reasons.R

ì REase of Computation. They are easy to compute for moderate size by the recursion (1c).

ì Dependence on Horizon Length. The recursion for computing them automatically finds thebest first-period decision for each horizon length , thus enabling one to study the impact of theRhorizon length on the best first-period decision.

ì Nonstationary Data. The recursion extends immediately to the case of nonstationary datawithout any additional computational effort. However, in this event, it is no longer possible tostudy the impact of the horizon length on the best first-period decision without extra computa-tion in the deterministic case. We now explain briefly why this is so.except

Rolling Horizons and the Backward and Forward Recursions

Consider first the deterministic case in which the reward earned in period when an< Ð=ß >Ñ 88

individual moves from state in period to state in period is nonstationary and one is= 8 > 8 "

given the states in which one must begin and end in periods and respectively. Then the5 7ß " R

obvious generalization of the recursion (1c) is the backward recursion

(2) max , and F œ Ò< Ð=ß >Ñ F Ó = − 8 œ "ß á ß R8 8= >

>

8" f

where is the maximum reward that can be earned in periods by an individual start-F 8ß á ß R8=

ing in state in period and or according as or . Observe that al-= 8 F œ ! _ > œ > ÁR"> 7 7

though we have suppressed the fact in the notation, depends on since is the maximumF R F8 8= =

reward earned in periods . Thus, if one wishes to tabulate the maximum -period value8ß á ß R R

F R œ "ß á ß Q F " Ÿ 8 Ÿ R Ÿ Q = −" 8=5 for , then one must compute the from (2) for all and .f

If we fix the number of state-action pairs, this requires additions and comparisons.SÐQ Ñ# 5

However, it is possible to do this computation with at most additions and compari-SÐQÑ

sons by instead using the forward recursion

(2) max , and w 8 8 8"> =

=J œ Ò< Ð=ß >Ñ J Ó > − 8 œ "ß á ß Rf

where is the maximum reward that can be earned in periods by an individual endingJ "ß á ß 88>

in state in period and or according as or . Now the maximum -per-> 8 J œ ! _ = œ = Á R!= 5 5

5If and are real-valued functions, write if there is a constant such that 0 1 0ÐBÑ œ SÐ1ÐBÑÑ O 0ÐBÑ Ÿ O1ÐBѸ ¸for all .B

Page 23: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 17 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

iod value is simply which can be tabulated for all with at most additionsJ " Ÿ R Ÿ Q SÐQÑR7

and comparisons. Thus, if one is interested in studying the effect of changes in the horizon length

on the maximum -period value for a deterministic problem, it is much better to use the for-R

ward than the backward recursion.

Unfortunately, the forward recursion does not have a natural generalization to stochastic sys-

tems. For that reason one is left only with the nonstationary generalization of the backward re-

cursion (2) in that case.

Limitations of Maximum- -Period-Value PoliciesR

Policies having maximum -period value do have some significant limitations including theR

following.

ì RComputational Burden. The computational effort rises linearly with the horizon length .

ì RNonstationarity of Optimal Policies. Optimal -period policies are generally nonstationary,even with stationary data, and so are difficult to implement.

ì Dependence on Horizon Length. The dependence of the best first-period decision on the hori-zon length is generally complex and, especially in the case of large , requires the user to beR Rrather arbitrary in choosing it.

These considerations lead us to look for an asymptotic theory as gets large, or alternately, toR

consider instead the infinite-horizon problem. The latter approach is particularly fruitful because

the symmetries present in the problem generally assure that there is an “optimal” policy that is

stationary. Also, those policies are “nearly optimal” for all large enough .R

7 MAXIMUM VALUE IN TRANSIENT SYSTEMS [Sh53], [Ho60], [D’E63], [deG60], [Bl62],

[Der62], [Den67], [Ve69a], [Ve74], [Ro75c], [Ro78], [RV92]

Why Study Infinite-Horizon Problems? What is the point of studying infinite-horizon prob-

lems in a stationary environment? After all, life is finite. No doubt for this reason men and wo-

men of practical affairs are usually concerned with the near term, often concerned with the in-

termediate term, and occasionally concerned with the long term. But is the long term infinite?

Probably not. Also, is it reasonable to consider models in which the environment remains immut-

able and unchanging forever? Hard to believe.

In view of these observations, what is the justification for studying the stationary infinite-

horizon problem? Here are some reasons for so doing.

ì Near Optimality for Long Finite-Horizons. The stationary infinite-horizon problem provides agood approximation to the stationary long finite-horizon problem in the sense that optimal infin-ite-horizon policies are nearly optimal for long (and often surprisingly short) horizons.

Page 24: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 18 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

ì Optimality of Stationary Policies. The stationarity of the environment in the stationary infinite-horizon problem generally assures that there is an optimal policy that is stationary and independ-ent of the horizon. This fact is important in practice because stationary policies are much easierto store and implement than nonstationary ones.

ì Computational Effort Independent of Horizon. By contrast with the finite-horizon problem, thecomputational effort to find an optimal infinite-horizon policy is of the horizon.independent

ì Includes Many Nonstationary Problems. The stationary infinite-horizon problem includes the8 =-period nonstationary problem as a special case. To see this, append to the original state each per-iod that the system could be in that state, so the states of the augmented system are the pairsRÐ=ß RÑ = − R œ "ß á ß 8 R 8 R" for and , and transitions are from each period to and from perf -iod to exiting the system. If and transitions in period are instead to period ,8 8 œ 7 : 8 7 "the system repeats every periods after period . This reduces such “eventually -periodic prob-: 7 : ”lems to stationary ones.

A natural approach to the infinite-horizon problem is to seek a policy whose 1 $œ Ð ÑR value

(1) Z ´ T <1 $1"_

Rœ!

RR"

is a maximum. However, this definition makes sense only if the right-hand side of (1) is defined

and finite. In order for this to be the case, it suffices to assume that is , i.e., 1 transient !_Rœ!

RT1

is finite. The element of the last series is the sum of the expected sojourn times of all indi-=>>2

viduals in state generated by one individual starting in state in period one. Indeed, the > = only

way that (1) can be defined and finite for every choice of is that be transient. And in that<† 1

event, is the -vector of expected infinite-horizon rewards, i.e., the value earned by start-Z W1 1

ing in each state. Incidentally, it is often convenient to call a decision or its transition matrix$

T$ if is transient. When is a transition matrix transient?transient $_

Characterization of Transient Matrices

The sequel often uses the (Tchebychev) of a matrix defined by norm l l l lT T œ Ð: Ñ T œ34

max . This norm has the property that whenever the matrix product 3 344! l l l ll ll: l T U Ÿ T U T U

is defined. Though §1 of the Appendix discusses several properties of this and other norms, the

term “norm” means the .Tchebychev norm throughout the main text

Lemma 2. Characterization of Transient Matrices. If is a square complex matrix, the folT -

lowing are equivalent.

" T Ä !‰ R .

# T‰ R_Rœ! The Neumann series converges absolutely.!

Moreover, each of the above implies

$ M T #‰ ‰ is nonsingular and its inverse equals the Neumann series in .

Proof. 2 1 . Immediate.‰ ‰Ê

1 2 . Since , for some . Let ‰ ‰ R Q 4Ê Ä ! Ÿ Q   " P œT T mT mÞ¼ ¼ "

#!Q"

4œ! Since ¼ ¼T Ÿ45Q

l l¼ ¼ ¼ ¼ ¼ ¼! ! !T T T Ÿ P T Ÿ #P _4 Q 45Q Q5 5Q" _ _4œ! 5œ! 5œ!, it follows that .! ¼ ¼_

Rœ!RT œ

Page 25: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 19 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

1 3 . Evidently,‰ ‰Ê

Ð T ÑÐM T Ñ œ ÐT T Ñ œ M T" "Q" Q"

Rœ! Rœ!

R R R" Q .

Since , the right-hand side above is nonsingular for large , so is also nonsingu-T Ä ! Q M TQ

lar. Thus, postmultiplying by and letting establishes 3 . ÐM T Ñ Q Ä _" ‰ è

An alternate (more advanced) proof of the equivalence of 1 and 2 of the Lemma is to‰ ‰

apply the Spectral Mapping Theorem 4 and Corollary 2 of the Appendix.

Observe that Lemma 2 implies that is transient if and only if , i.e., the expected#_ RT Ä !#

population size in period generated by converges to zero as . And in that event,R R Ä _#_

! Ÿ T œ ÐM T Ñ"_

Rœ!

R "# # .

For example, if is , i.e., , then is transient. More generally, ifT T " ¥ "# #strictly substochastic #

T T# # is substochastic, then is transient if and only if is strictly substochastic.# W

Transient System. As discussed above, is defined and finite for every and if and onlyZ <1 1 †

if is transient. In that event, call the system . As an example, note that circuit-every transient1

less systems are transient because then for all .T œ !W1 1

The goal now is to show that for transient systems, there exists a policy having maximum val-

ue, and one such policy is stationary. The constructive proof of these claims depends on the follow-

ing fundamental Comparison Lemma. It asserts that the difference between the values of two tran-

sient policies and is the sum over of the product of two matrices that depend on .1 # )œ Ð Ñ R RR

The first is the matrix of expected numbers of individuals in each state after periods startT R1 R -

ing from each state when is used in periods . The second is the difference be-1 "ß á ß R Z Z# ) )R"

tween the values of the policies —which uses and then —and .Ð ß Ñ# ) # ) )R" R"

Comparison Lemma

Lemma 3. Comparison Lemma. If and are transient policies, then1 # )œ Ð ÑR

Z Z œ T K1 ) # )1"_

Rœ!

RR"

where the comparison function is defined by

K ´ < T Z Z#) # # ) ) for # ?− .

If also , hen1 #œ >_

Z Z œ# ) "_

Rœ!

R "T K œ ÐM T Ñ K# #) # #)

where . Moreover, if and only if .Z ´ Z Z œ Z Z œ < T Z# # # # #_

Proof. Since is transient, is finite and , so .1 Z Ä Z T Ä ! Z ÐZ Ñ œ Z T Z Ä ZR R R R R1 1 1 1 11 ) ) 1

Thus,

Page 26: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 20 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Z Z œ ÒZ ÐZ Ñ Z ÐZ ÑÓ œ ÒZ ÐZ Ñ Z ÐZ ÑÓ

œ T ÒZ ÐZ Ñ Z Ó œ T K

1 ) ) ) ) )1 1 1 1

1 # 1) ) # )

lim

.

RÄ_

R" !_

Rœ!

R" R

_ _

Rœ! Rœ!

R " R

"" "

R" R"

If , the last term above becomes . The remaining assertions then follow1 #œ Ò ÐT Ñ ÓK_ R_Rœ!

! # #)

from Lemma 2 and what was just shown. è

Policy-Improvement Method

It is now time to apply the Comparison Lemma to give a policy-improvement method for find-

ing a maximum-value stationary policy for the case where each stationary policy is transient. Then

since , it follows from the Comparison Lemma that (resp.,ÐM T Ñ   M   ! Z Z   K !# # $ #$"

Z Z Ÿ K Ÿ ! K ! Ÿ ! K !# $ #$ #$ #$) if (resp., ). For this reason we say if . If no# $improves

decision improves , we claim that max , so for all . The claim is an easy con$ ## ? #$ # $− K œ ! Z Ÿ Z -

sequence of the facts that and the element of K œ ! = K œ <Ð=ß Ñ :Ð> l =ß ÑZ K$$ #$ $ #$f>2 = =

= >>−# #!depends on , but for . For suppose that for some and . Define decision# # #= >

=not > Á = K ! =#$

( ( # ( $ by and for . Then and for . Hence= = > >= = > >œ œ > Á = K œ K ! K œ K œ ! > Á =($ #$ ($ $$

K !($ and improves , which ( $ is a contradiction and establishes the claim. The above remarks

permit us to prove the following result constructively.

Lemma 4. Existence of Maximum-Value Stationary Policy. If every stationary policy is

transient, there is a maximum-value stationary policy . Moreover,$_

Ð Ñ ! œ K2 maxa ,# ?

#$−

or equivalently, is a fixed point of , i.e., , or what is the same thing,Z ´ Z Z œ Z$ e e

Ð Ñ Z œ Ò< T Z Ó2 maxb .# ?

# #−

Moreover, the maximum value over the stationary policies is the unique fixed point of .e

Proof. Let be arbitrary and choose decisions inductively so improves ,$ $ $ $ $! " # R" Rß ß á

and terminates otherwise. Since the value of each policy in the sequence is higher than that of

its predecessor, no decision can occur twice in the sequence. Thus because there are only finitely

many decisions, the sequence must terminate with a having no improvement. Then from$ $œ R

the discussion preceding the Lemma, (2a) or equivalently (2b) holds, so is a maximum-value$_

stationary policy. Conversely, if is a fixed point of , then for some ,Z Z œ < T Z −e ( ?( (

whence . Thus max and so is the maximum value over the stationaryZ œ Z K œ ! Z œ Z( #( (# ?−

policies. è

Page 27: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 21 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

The finite algorithm given in the above proof finds a maximum-value stationary policy. Call

the algorithm the . The process of finding an improvement of , ifpolicy-improvement method $

one exists, involves two steps, viz., finding the value of and then seeking an improvement of .$ $

The first step entails solving the system

ÐM T ÑZ œ <$ $

for . The second is to find, for each state , an action for whichZ œ Z = − E$ #==

<Ð=ß Ñ :Ð> l =ß ÑZ   Z# #= =

>−

> ="f

$ $

with strict inequality holding for some , i.e., a decision such that .= œ Ð Ñ < T Z Z# #=# # $ $

Stationary Maximum-Value Policies: Existence and Characterization

We now come to our main results for transient systems.

Theorem 5. Characterization of Transient Systems. A system is transient if and only if

every stationary policy is transient.

Proof. It suffices to prove the “if ” part. By Lemma 4, there is a maximum-value stationary

policy , say, for the case for all . Let be any policy. Then since $ # 1 # #_" #< ´ " œ Ð ß ß á Ñ " Ÿ#

Z ¥ _ " Q$ is a fixed point of , is increasing, and is the maximum -period value withe e eQ

terminal value , it follows that"

Z œ Z œ â œ Z   "   T "$ $ $ 1e e eQ Q RQ

Rœ!

"for each , so is transient. Q œ !ß "ß á 1 è

Theorem 6. Existence and Characterization of Maximum-Value Policies and Maximum

Value. In a transient system, there is a stationary maximum-value policy. Moreover, a policy )

has maximum value if and only if . Finally, the maximum value is the uniquemax# ? #)−‡K œ ! Z

fixed point of .e

Proof. For the first assertion, observe from Lemma 4 that there is a maximum-value station-

ary policy , say, with max . Hence, by the Comparison Lemma, for all ,$ 1_−# ? #$ 1 $K œ ! Z Ÿ Z

so is a maximum-value policy. For the second assertion, if has maximum value, so$ )_ Z œ Z) $

max max . Conversely, if max , then has maximum-value by the# ? # ? # ?#) #$ #)− − −K œ K œ ! K œ ! )

Comparison Lemma. The third assertion is immediate from the first assertion and Lemma 4. è

Key Ideas. The above developments illustrate three important ideas that will recur frequent-

ly in the sequel in studying optimality concepts for nontransient systems.

Page 28: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 22 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

ì Policy Improvement and Comparison Lemma. It is often possible to use suitably generalizedversions of the policy-improvement method and Comparison Lemma together to establish construct-ively the existence of stationary “optimal policies, and to characterize them for general systems—”transient or not.

ì Convenient Terminal Values. It is often useful to study a problem with the aid of a terminalvalue that makes calculations easy and then find the effect of changing that value to the desiredone. For example, in the proof of Theorem 5, is a fixed point of whence .Z   " Z œ Z   "$ $ $e e eQ Q

ì System Properties. To establish properties of the system that are independent of the rewards,e.g., transience in Theorem 5, it is frequently useful to set all rewards and/or terminal values equalto (or or ), and then apply existence or characterization results about “optimal policies and" ! " ”their values.

The fact that is a fixed point of is intuitive on writing that fact asZ ‡ e

(3) max .Z œ Ò < T Z Ó‡ ‡

ß ßÞÞÞ

ß ßÞÞÞ

max valueperiods1 2

reward inperiod 1using

max valuein periods

2 3 given used in 1

# ?# #

##

The fact that is the unique solution of the system of nonlinear equations in Z W Z œ Z W‡ e

variables suggests the possibility of using classical iterative methods to solve the equations for

Z ‡, e.g., Newton’s method, successive approximations, Gauss-Seidel method, etc. We first dis-

cuss Newton’s method.

Newton s Method: A Specialization of Policy-Improvement Method’

Newton’s method is an instance of the policy-improvement method. To see this, initiate

Newton’s method with . The first step of the method is to find a linear approximation Z œ Zµ

$ e

of with . The natural linear approximation is for where e e e e #µ µ

Z œ Z Y ´ < T Y Y − d# #W

is a maximizer of . This is equivalent to choosing to maximize , i.e., choosing the< T Z K# # #$#

“best” improvement of . The linear approximation has the properties that and$ e e eµ µ

Z œ Z

e e eµ µ

Y Ÿ Y Y Y œ Y Y œ < for all . The second step is to solve the linear equations , i.e., #

T Y Y œ Z# #, for .

Now replace by and repeat the above two steps iteratively. Thus Newton’s method for$ #

solving is the that, given , chooses aZ œ Ze $specialization of the policy improvement method

# $ that not only improves , but also maximizes . Moreover, the method converges in finitelyK#$

many steps.

Successive Approximations: Maximum -Period Value Converges to Maximum ValueR

Another classical method for finding the fixed point of is . Thise successive approximations

method entails choosing the sequence iteratively by the rule , Z ß Z ß á Z œ Z R œ "ß #ß! " R R"e

á Z œ Z. The method has special interest in dynamic programming because is the maxi-R Re

Page 29: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 23 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

mum -period value with terminal reward . Since in transient systems, for eachR Z œ Z T Ä !! R$

$, it is plausible that in such systems. The sequel establishes this fact and provides theZ Ä ZR ‡

rate of convergence as well.

In order to see what the rate of convergence might be, consider the special case where ?

consists of only a single decision . Then$

¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼l lZ Z œ Z Z œ T ÐZ Z Ñ Ÿ T Z Z‡ R R ‡ R R ‡ R ‡e e e $ $ .

Thus the rate at which converges to is bounded above by the rate at which con-eR ‡ RZ Z T$

verges to . And the rate of convergence of the latter is determined by the spectral radius of .! T$

Before establishing this fact, it seems useful to give an example.

Geometric Interpretation: a Single-State Reliability Example

It is informative to illustrate the above results geometrically on a concrete example for the

case of a transient system with a single state and three decisions . The example is interest-5 . #ß ß

ing for another reason too, viz., the time between transitions is random. Such systems are semi-

Markov decision chains, though this does not really introduce anything new for infinite-horizon

transient systems as the discussion below reveals.

Suppose that a provides a service, e.g., lighting, automotive transportation, orcontractor

desktop computing, for a . The client can cancel the service at the beginning of any periodclient

without prior notice. The contractor estimates that the client will cancel independently in each

period with probability , . The contractor has three products with which to pro-" ! "- -

vide the service, e.g., three types of lights, cars or computers. The service lives of the copies of

product that the contractor places in service are independent random variables whose distribu-$

tions coincide with that of another nonnegative integer-valued random variable . Let P T ´$ $

E be the probability that the client does not cancel the service during the life of product .- $P$

Also let be the expected revenue that product earns during its lifetime. Now max .< Z œ Z$ $ $$ ‡

The table below provides the problem data and the value of each policy. The system is

$

5

.

#

< T ÐM T Ñ Z

%

$ $ $ $"

" && %" %% $$%

8 10

12 16

6 24

is transient since max . Also, the maximum expected revenue that the contractorÐT ß T ß T Ñ "5 . #

can earn, viz., 24, is the unique solution of the nonlinear equationZ œ Z œ‡

Page 30: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 24 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Z œ Ð ) Z ß "# Z ß ' Z Ñmax .èéê èëéëê èéê

" " $& % %

5 . #

Note that is optimal even though its expected lifetime revenue is smallest.#

It is useful to examine how policy improvement method would produce this result with New-

ton improvements, i.e., the best improvement at each step. To that end, note that the optimal-

return operator is given by max .eZ œ Ð) Z ß "# Z ß ' Z Ñ" " $& % %

• Suppose one begins with . The first step is to solve for the value of ,Policy .5 5 5Z œ ) Z"&

i.e., for 10. Next max 10 10 10 max 10 14 13 14 ,Z œ Z œ Ð"!Ñ œ Ð) ß "# ß ' Ñ œ Ð ß ß Ñ œ5 e " " $ " " "& % % # # #

so 14 10 . Therefore improves .K œ œ %.5" "# # . 5

• Now solve for 16. Then 6 max 16 16 Policy .. Z œ "# Z Z œ Z œ Ð" Ñ œ Ð) ß "# ß ' " " "% & %. e

$ "% &16 max 11 16 18 18. Thus, 18 16 2. Therefore improves .Ñ œ Ð ß ß Ñ œ K œ œ#. # .

• Now solve for 24. Then 24 max 24 24Policy .# Z œ ' Z Z œ Z œ Ð Ñ œ Ð) ß "# ß ' $ " "% & %# e

$%24 24. Thus, 0 for all , establishing that is optimal.Ñ œ K Ÿ$# $ #

Figure 4a displays the graph of the optimal-return operator for the above example as a wide

bold line. Note that is continuous, strictly increasing and changes sign once from toZ Z e

Z Z   Z, and so has a unique zero , viz., the unique fixed point of . Also for all deci-‡ e e $ $

sions , and equality holds if and only if .$ Z ´ Z$‡

Figures 4b and 4c respectively illustrate Newton’s method and successive approximations for

the above example. In each case, if and are values calculated at successive iterations, wideY Z

bold line segments are drawn from to to . Together these line segmentsÐY ß Y Ñ ÐY ß Y Ñ ÐZ ß Z Ñe

form a path that traces the trajectory of successive iterates from the initial value to . In partic-Z ‡

ular, starting from decision , Newton’s method first chooses and then . By contrast,5 . # starting

from the value , successive approximations follows an infinite staircase trajectory.Z !

Observe from Figure 4b,c that starting from an initial value for which , theZ Z Ÿ Z! ! !e

value that Newton’s method calculates at each iteration lies above the one that successive ap-

proximations computes, and this is so in general. Thus, Newton’s method converges at least as

fast as successive approximations. Of course this does not necessarily mean that Newton’s meth-

od is more efficient than successive approximations because the former requires more work at

each iteration than the latter.

System Degree, Spectral Radius and Polynomial Boundedness

The of a sequence of matrices is 0 if for some (i.e.,degree T ß T ß á T œ SÐ Ñ ! "! " RRl l ! !

l l l lT Ÿ O O 5 T œ SÐR ÑR RR 5"! for some ), the smallest positive integer for which if such an

integer exists and the degree is not 0, and otherwise. The of a square matrix is_ . Tdegree T

the degree of the sequence of nonnegative integer powers of .T ß T ß T ß á T! " #

Page 31: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 25 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Z

"!

#! $!

#!

$!

"!!!

eZ

Z

Z ‡ œ Z-Z5 Z.

Figure 4a. Maximum Value is Unique Fixed Point of Optimal-Return Operator

Figure 4b. Newton's Method

Figure 4c. Successive Approximations

Z !

Z

"!

#! $!

#!

$!

"!!!

eZ ‡ œ Z ‡

eZ

Z

Z!!

eZ

Z

Z ‡ œ Z-

<. T.Z

<5 T5Z

<- T-Z

Z ‡ œ Z-Z5 Z.

Z " Z #

eZ ! œ Z "eZ " œ Z #

eZ ‡ œ Z ‡

eZ ‡ œ Z ‡

Page 32: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 26 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Examples of Degrees of Sequences

Degree of ÖT × ! " # $ _

T Ð"Ñ R R #

R

R

RR # RŠ ‹"

#

The of a square complex matrix is the ra-spectral radius 5T T

dius of the smallest circle in the complex plane that is centered

at the origin and contains the eigenvalues of as Figure 5 T il-

lustrates. If , then the has5 5T"T ! T ´ T

normalized matrix

spectral radius one.

im

re

eigenvalues

5T

Figure 5. Spectral Circle

Lemma 7. If is an complex matrix, the following are equivalentT W ‚ W À

" " Ÿ . _à " Ÿ . Ÿ Wà œ " Рщ and . polynomially boundedT T T5

Moreover, the following are equivalentÀ

# . œ !à T !à " Рщ and . transientT TR Ä 5

Finally, the following are equivalentÀ

$ T œ ! 0 R   !à œ !à T œ X Y X Y‰ or some and for some upper-triangular matrix R "T5

with zero diagonal elements and some nonsingular matrix . If also, is real and nonnegative,X T

one such T is a permutation matrix, so the system graph of is circuitless. circuitlessT Ð Ñ

Proof. Use the Jordan form of . T è

Corollary 8. If is an complex matrix with and has degree , thenT W ‚ W ´ ! T .

5 5 5T"

" Ÿ . Ÿ W T œ SÐR Ñ and .R . " R

5

Proof. Since has spectral radius one, the result follows from of Lemma 7. 5" ‰T " è

The of a policy is the degree of the sequence of endogenous expected pop-degree . ÐT Ñ1 11 R

ulation sizes that generates. The degree of a policy is of fundamental importance in the devel-1

opment of the theory and is a natural measure of the endogenous rate of growth of the sequence

of expected population sizes when using the policy.

In particular, the degree of is zero if and only if the sequence of endogenous expected pop-1

ulation sizes when using converges to zero geometrically, so is transient. If the degree of is1 1 1

nonzero and finite, it is one plus the order of the lowest-order polynomial majorizing the sequence

of endogenous expected population sizes when using . For example, if the system is stochastic,1

the degree of each policy is one because the sequence of endogenous expected population sizes is

bounded (by one). If the degree of is infinite, the sequence of endogenous expected population1

sizes when using is not majorized by any polynomial.1

Page 33: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 27 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Put and call max the . Let and call max. ´ . . ´ . ´ ´$ $ $ $ $$ ? $ ?_ − −Tsystem degree 5 5 5 5$

the . If , call the system with transition matrices for allsystem spectral radius 5 5 ! T ´ T

$ $"

$ ?− the .normalized system

The next result is . It implies that max , i.e., the very important system degree ma-1 ? 1− _. œ .

jorizes the degree of every policy.

Theorem 9. System Degree. The sequence , has degree .max1 ? 1−R

_¼ ¼T R œ !ß "ß á ß .

Proof. We give the proof only for the case , (Later we give the proof for the case ).. œ ! . œ "

Put max . e es Z ´ Z ´ T Zmax and Since each stationary policy is transient, $ ? $− Ò" T Z Ó s$ ? $− e

has a unique fixed point by Lemma 4, and . Also Z Z œ " T Z   "$ ! Ÿ Z Ÿ Z œ Zse e . Thus

e e eR RZ R ! Z Æ Y is decreasing in and is bounded below by , so , say. Since is continuous,

Y œ Ð Z Ñ œ Y Y Y œ !lim , so is a fixed point of , whence beRÄ_Re e e e cause is the unique!

fixed point of by Lemma 4. Thus . Hence for e e eR QZ Æ ! Z Ÿ Z"

#some . Then Q   " Ze5Q

Ÿ Ð Ñ Z 5 œ "ß #ß á ß Z œ SÐ Ñ ´ Ð Ñ" "

# #5 R R "ÎQ for from which fact it follows that with . Sincee ! !

e !R R R−Z œ T Z œ SÐ Ñmax , . 1 ? 1_ max1 ? 1−

R_¼ ¼T è

Transient Systems. For the case , Theorem 9 strengthens Theorem 5 by implying not. œ !

only that every policy is transient, but also that every policy has degree zero.

Geometric Convergence of Successive Approximations

It is now possible to establish the convergence rate of to in transient systems.eR ‡Z Z

Theorem 10. Geometric Convergence of Successive Approximations. In a system with posi-

tive maximum spectral radius and degree of the normalized system,5 .

¼ ¼e e 5R R . " R W

Y Z œ SÐR Ñ Y ß Z − dmY Z m uniformly in .

If also the system is transient, i.e., , then .5 e " Z ZR ‡Ä

Proof. Suppose and let be the -step transition matrix for in the normal-Y ß Z − d T RW R

1 1

ized system. Choose so . Then , so by Theorem 9 and1 e eR R R R R RY œ Z T Y Z   Z T Z1 1 1 1

Lemma 7,

e e 5 5R R R R R . " R

Y Z Ÿ T ÐY Z Ñ œ T ÐY Z Ñ œ SÐR Ñ

1 1 mY Z m

uniformly in . Interchanging the roles of establishes the first claim. The second claimY ß Z Y ß Z

then follows from the first by choosing and using the facts that is finite (because theY œ Z Z‡ ‡

system is transient) and . eR ‡ ‡Z œ Z è

Page 34: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 28 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Theorem 10 depends on the proof of Theorem 9 for arbitrary . There is a weaker result . that

depends on Theorem 9 only for , the case for which that theorem was proved above. The. œ !

weaker result is that uniformly in for . The proof is ident¼ ¼Z Z œ SÐ ÑmZ Z m Z ‡ R R ‡e 5 5 5 -

ical to that of Theorem 10 except that one normalizes the with respect to rather than .T $ 5 5

In a transient system, Theorem 10 provides a sharp estimate of the rate of convergence of

e 5R ‡Z Z to . However, the estimate depends on which is, in general, difficult to compute. For

that reason it is useful to have an upper bound on that is easier to compute. The next result5

gives such a bound.

Lemma 11. If is a square complex matrix, then .T Ÿ T5T l lProof. If is an eigenvalue of , then there is a such that . Hence - - -T @ Á ! @ œ T @ l l @ œl l

l l l ll l l lT @ Ÿ T @ @. On dividing by , the result follows. è

Corollary 12. If , the system is transient and .. e .´ T " mZ Z m Ÿ Z Zmax$ ? $−‡ R R ‡l l l l

Proof. Iterate the inequality . mZ Z m Ÿ mZ Z m‡ ‡e . è

Remark. Notice that Corollary 12 is weaker than Theorem 10 when because then, by5 .Á

Lemma 11, . If , then the Corollary gives the constant in Theorem 10. However, then5 . 5 . œ

the rate of convergence is the same in both cases because since for all and . œ " mT m Ÿ " R . R

1

is the degree of for some .T

$ $

Contraction Mappings. The usual proof of Corollary 12 employs the contraction-mapping

theorem. Indeed, when the hypothesis of Corollary 12 is valid, an easy modification of the proof

of Theorem 10 shows that the mapping is a , i.e., one hase .contraction with modulus ! Ÿ "

m Y Z m Ÿ mY Z m Y ß Z −e e . for all . The contraction-mapping theorem in the AppendixdW

then assures that is a Cauchy sequence and converges geometrically to the unique fixed Ö Z ×eR

point of at the rate that Corollary 12 provides. The development here employs a differ-Z ‡ e .

ent approach because, as the above remark discusses, it gives the sharper result of Theorem 10.

Actually, when the system is transient, but the hypothesis of Corollary 12 does not hold, it

is possible to modify the contraction-mapping method to establish geometric convergence of eR Z

to at any rate with in two different ways. One is to apply Theorem 10 to showZ "‡ 5 5 5

that even though is e not a contraction with modulus in the usual (Tchebychev) norm, is a5 e 5

contraction with modulus for some in that norm. The other is to show that is in fact5 e5 5 !

a contraction with modulus in a different “weighted” (Tchebychev) norm.5

Page 35: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 29 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Linear-Programming Method

So far in this section we have explored the possibility of finding maximum-value policies in

transient systems by finding the unique fixed point of . It is possible to find this fixed point bye

linear programming if we consider instead the larger set of vectors satisfying , orZ Z   Ze

equivalently the linear inequalities for all . Call such vectors . AsZ   < T Z − Z$ $ $ ? excessive

we shall see, there is a least excessive vector, and it is the fixed point. The simplest way to see

that there is a least excessive vector is to observe that the set of excessive vectors is nonempty,

closed, bounded below (by each ) and a , i.e., whenever and are excessive,Z Y Z$ meet sublattice

so is the pointwise minimum . Moreover, the least excessive vector may be found by mini-Y • Z

mizing any positive linear function of over the set of excessive vectors . It might seem oddZ Z

that this linear program entails minimization while our goal is to maximize value. The explana-

tion for this apparent paradox is simply that this linear program is, as the sequel shows, the du-

al of the maximum-value problem.

The linear-programming method is interesting in its own right and because the theory and

algorithms of linear programming are well developed, codes for solving such problems are widely

available, and the method is readily adapted to encompass additional stochastic constraints. A

typical stochastic constraint might involve a bound on a linear combination of the expected num-

ber of times individuals in various states take various actions, e.g., an upper bound on the ex-

pected number of times that individuals enter an undesirable set of states. Another stochastic

constraint might assure that if one uses an action in one of a subset of states, then one uses the

action in all states the subset. The last constraint can be expressed in terms of linear inequal- in

ities in integer and continuous variables.

In order to formulate the linear programs, suppose that there is an -element row vectorW

A   ! of initial population sizes of individuals in each state in period one. Then consider the fol-

lowing pair of dual linear programs for a transient system.

Primal Linear Program. Find -element row vectors , , thatW B −$ $ ?

maximize "$ ?

$ $

B <

subject to

"$ ?

$ $

B ÐM T Ñ œ A

B   ! −$ , $ ?

Page 36: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 30 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Dual Linear Program. Find an -element column vector thatW Z

minimizes AZ

subject to

, .Z   < T Z −$ $ $ ?

In order to understand the primal program, let be such that for oneB œ ÐB Ñ B ÐM T Ñ œ A# $ $

$ ? # $− B œ ! Á M T B œ AÐM T Ñ and for all . Then by Lemma 2, is nonsingular and # $ $ $"

œ A T   ! B M T!_Rœ!

R$ $, so is a basic feasible solution of the primal. Call the basis asso-

ciated with . The basic feasible solution has the following interpretation. Since is the$ B AT R$

vector of expected population sizes in each state in period when all individuals use , R " B$_$

is the sum of the expected sojourn times of all individuals in each state when they all use .$_

Hence since is the vector of one-period rewards earned by one individual in each state who<$

uses , is the expected total reward earned by all individuals in all$_ "B < œ AÐM T Ñ < œ AZ$ $ $ $ $

periods when they use . The aim is to find a that maximizes the expected total reward,$ $_ _

which motivates the primal objective function. The next result establishes that the primal and

dual always have optimal solutions in transient systems.

Theorem 13. Linear Programming. In a transient system, the maximum value Z ‡ is the least

feasible solution of the dual, is optimal for the dual for all , and is the unique optimal solu-A   !

tion of the dual for . Moreover, there is a such that also each such A ¦ ! Z œ < T Z$ $‡ ‡ _$ $ ;

has maximum value and is an optimal basis for the primal for all .M T A   !$

Proof. The maximum value is a fixed point of , and so is feasible for the dual. ChooseZ ‡ e

$ ?− Z œ < T Z Z œ Z A   ! B œ ÐB Ñ B ÐM T Ñ œ A so , whence . Suppose . Choose so ‡ ‡ ‡$ $ $ # $ $

and otherwise. As discussed above, is a basic feasible solution of the primal. Hence, sinceB œ ! B#

B Z and satisfy complementary slackness, they are optimal for the primal and dual respectively.‡

Moreover, since has a nonnegative inverse, it is an optimal basis for all . Thus, sinceM T A   !$

Z A   ! Z Z‡ ‡ ‡ is optimal for the dual for all , is the least feasible solution of the dual. Also, is

the unique optimal solution of the dual for since then for . A ¦ ! AZ AZ Z Z‡ ‡ è

It is important to recognize that although the matrix formulations of the primal and dual

programs given above are convenient for theoretical analyses like those above, both are highly

redundant because there are | | distinct decisions that take action in state . As a#>Á= > =E + − E =

consequence the primal will have a distinct variable and the dual a distinct inequality for each

of those distinct decisions that take action in state . On the other hand, since the dual ine-+ =

Page 37: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 31 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

qualities associated with taking action in state are identical, all but one can be omitted be-+ =

cause they are redundant. Hence, the primal variables corresponding to those redundant dual in-

equalities are also redundant and so can also be eliminated.

After stripping out the redundant variables of the primal and corresponding redundant ine-

qualities of the dual, let be the (unique) primal variable corresponding to taking action inB +=+

state . Then the primal and dual programs take the following irredundant forms. It is these ir-=

redundant forms that should be used for computation.

Irredundant Primal. Choose thatB œ ÐB Ñ   !=+

maximizes "+−E ß=−

=+

= f

<Ð=ß +ÑB

subject to

, ." "+−E +−E ß>−

=+ >+ =

= >

B :Ð= l >ß +ÑB œ A = −f

f

Irredundant Dual. Choose thatZ œ ÐZ Ñ=

minimizes "=−

= =

f

A Z

subject to

, , .Z   <Ð=ß +Ñ :Ð> l =ß +ÑZ + − E = −= > =

>−

"f

f

We next study the set of all feasible solutions of the irredundant primal. To do this, it is

convenient to introduce two definitions.

State-Action Frequency. The of a policy is the vector whose comstate-action frequency =+>2 -

ponent is the expected number of times that individuals are in state and take action when the= +

policy is used. If the state-action frequency of a policy is finite, e.g., as when is transB œ ÐB Ñ=+ 1 1 -

ient, is feasible for the irredundant primal. To see this, observe that the expected number ofB

times that individuals leave state when is used is . On the other hand the expected= B1 !+−E =+=

number of times that individuals enter state when is used is . The= A :Ð= l >ß +ÑB1 = >++−E ß>−!

> f

first term accounts for the initial population in state in period one and the second for the ex-=

pected number of times (after period one) that individuals enter state from the preceding per-=

iod when is used. Since and the expected number of times that individuals enter and1 B   !

leave state must coincide when is used, the state-action frequency of is feasible for the ir= B1 1 -

redundant primal.

Page 38: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 32 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

The above development shows that the state-action frequency of every transient policy is feas-

ible for the irredundant primal. The question arises whether the converse is also true. In short, is

every feasible solution of the irredundant primal the state-action frequency of some policy? The

answer is ‘no’—at least for the “nonrandomized” policies considered so far—as the following ex-

ample shows.

Example. Nonconvexity of Set of State-Action Frequencies of Nonrandomized Policies.

Suppose and . There are only two actions in state one. The first generatesf œ Ö"ß #× A œ Ð" !Ñ

no individuals in either state in the next period and the second sends the individual in state one

to state two in the next period. There is only one action in state two and it generates no individ-

uals in either state in the next period. Then there are effectively only two nonrandomized poli-

cies—both transient—and the set of state-action frequencies is , , which isÖÐ" ! !Ñ Ð! " "Ñ×

not convex. Thus there are many feasible solutions of the irredundant primal that are not state-

action frequencies of nonrandomized policies.

Initially Randomized Polices. On the other hand, Theorem 15 below asserts that every feas-

ible solution of the irredundant primal is a state-action frequency if the system is transient and

the class of admissible policies is enlarged to permit initial randomization. Call a policy initially

randomized if at the beginning of period one, the policy selects each stationary policy with$_

some given probability , . Call such a policy if all but one of. .$ $$ ?  ! œ "!− nonrandomized

the ’s is zero. In particular, if only is nonzero, and so equal to one, the corresponding non-. .$ $

randomized policy is simply the stationary one .$_

Theorem 15 also asserts that the extreme solutions of the irredundant primal are state-action

frequencies of nonrandomized stationary policies even when the system is not transient. To estab-

lish this result, we first require the following characterization of the existence of a solution and of

a least solution of the following system of linear inequalities where and are nonnegative.A T

(4) , .B œ A BT B   !

Lemma 14. Suppose that the row vector w and square matrix are nonnegative and have aT

like number of columns, and let . Then the following are equivalent.C ´ AT!_Rœ!

R

" C‰ is finite.

# Ð Ñ C‰ The inequalities have a solution, e.g., .4

$ C Рщ is finite and is the least solution of the inequalities .4

Proof. 1 2 . Observe that , i.e., satisfies 4 .‰ ‰ R_Rœ"Ê C œ A AT œ A CT   ! C! Ð Ñ

2 3 . If satisfies (4), then for ‰ ‰ 3 R 3R" R"3œ! 3œ!Ê B B œ A BT œ â œ AT BT   AT R œ! !

"ß #ß á B   C Ê C, so . Thus since 1 2 , is the least solution of (4).‰ ‰

3 1 . Immediate.‰ ‰Ê

Page 39: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 33 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Theorem 15. Feasible Solutions of Irredundant Primal are State-Action Frequencies of

Initially Randomized Policies. If , then is an extreme solution of the irredundant primalA   ! B

if and only if is the finite state-action frequency of a nonrandomized stationary policy. If ,B A ¦ !

the feasible row bases of the irredundant primal are the matrices for which is trans-M T$ $_

ient. If and the system is transient, then the set of feasible solutions of the irredundantA   !

primal coincides with the set of state-action frequencies of the initially randomized policies.

Proof. Suppose . We begin by proving the sufficiency in the first assertion. Suppose A   ! B

is the finite state-action frequency of , so is finite. Also suppose $ B œ AT B œ ? @Ñ$ $!_

Rœ!R "

for some feasible solutions and of the irredundant primal. Then the nonzero elements of , ? @ B ?

and are contained among the state-action pairs defining , so and satisfy (4) with @ ? @ T œ$ $ $

T B B Ÿ ? ß @ B œ ? @$ $ $ $ $ $ $ $. Since is the least solution of (4) by Lemma 14, , whence because ," "

# #B œ ? œ @ B œ ? œ @ B$ $ $. Hence, , so is an extreme solution of the irredundant primal as assert-

ed.

To establish the necessity in the first assertion, let be the | | row by column con-P E W!=− =f

straint matrix of the irredundant primal. Let be an extreme solution of , and B BP œ A B   ! B

be the row subvector of positive elements of . Then the rows of corresponding to areB F P B

linearly independent. Let (resp., ) be the columns of that contain at least one positiveF F F !

(resp., no positive) element, and let (resp., ) be the corresponding row subvector of .A A A !

Since , , and , it follows that and . Thus, sinceB F œ A B ¦ ! F Ÿ ! A   ! F œ ! A œ ! ! ! ! ! ! !

each row of , and hence of , has at most one positive element, each column of has exactlyP F F

one positive element. For if not, has more rows than columns, contradicting the fact that theF

rows of , and hence , are linearly independent. Thus is nonsingular. Let be a decisionF F F $

that uses the state-action pairs corresponding to the rows of and arbitrary actions in otherF

states. On possibly permuting the rows of , we can write for the restrictionP F œ M T

T   ! T A F B of to the set of states corresponding to . Since is nonsingular, is the$ f

unique solution of , so by Lemma 14. Hence since isB œ A B T   ! B œ A T _Rœ!

R

! f

closed under , is the state-action frequency of , which proves the first assertion. The sec-$ $B _

ond assertion follows from what was just proved on noting that when .f f œ A ¦ !

To establish the third assertion, suppose and the system is transient. Then the set ofA   !

feasible solutions of the irredundant primal is bounded because it has an optimal solution when

<Ð=ß +Ñ œ " + − E = − for all and by Theorem 13, and so is a polytope, i.e., is polyhedral and= f

bounded. Also, it is known that each element of a polytope is a convex combination of its ex-

treme points. The third assertion then follows from the first if we take the weights on the ex-

treme points in the convex combination to be the initial probabilities of choosing the various

nonrandomized stationary policies. è

Page 40: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 34 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Finding An Optimal Policy. Observe from Theorems 13 and 15 that if the system is trans-

ient and is a basic optimal solution of the irredundant primal program, then one max-B œ ÐB Ñ=+

imum-value decision in state is to take the unique action for which if such an= + − E B != =+

action exists and to take any action if for all .B œ ! + − E=+ =

Simplex Method: A Specialization of Policy Improvement Method. Application of the sim-

plex method to solve the irredundant primal is equivalent to the specialization of the policy-im-

provement method in which the action for only a single state is changed at each iteration. To

see this, suppose that . The basis corresponding to is and the price vector is given$ ? $− M T$

by , so that the reduced profits corresponding to are Z œ ÐM T Ñ < − < ÐM T ÑZ œ$ $ $ # # $" # ?

K = K !#$ #$. The simplex method selects a and an for which and differs from only in# # $=

state . (This selection rule agrees with the ordinary rule by which the simplex method chooses=

a variable to leave the basis when because then the only feasible bases are associatedA ¦ !

with decisions. The selection rule breaks ties in the choice of the exiting variable when A   !

and .) By contrast, the policy-improvement method amounts to carrying out block pivots/A ¦ !

in the irredundant primal program, which although not generally justified in linear program-

ming, is valid here because of the special “totally Leontief ” structure of the transpose of the

irredundant primal constraint matrix defined in the above proof.P

We illustrate the simplex method in Figure 6 for a two-state problem in the space of dual

inequalities. In that example there are two states , three actions in state , and two"ß # ,ß -ß . "

actions in state 2. There are five state-action pairs. For each state-action pair , we plot,ß - =+

the line that is the set of solutions to the linear equations

Z œ <Ð=ß +Ñ :Ð> l =ß +ÑZ= >

>−

"f

.

The value of a decision is the intersection of the two lines corresponding to the two state-action

pairs defining the decision. A decision is a pair of state-action pairs, one for each state. For ex-

ample, the maximum-value decision takes action in state one and action in stateÐ".ß #,Ñ . ,

two, and the maximum value is the least excessive point. One possible sequence of decisions

selected by applying the simplex method to the irredundant primal is: Ð",ß #-Ñ Ä Ð"-ß #-Ñ Ä

Ð"-ß #,Ñ Ä Ð".ß #,Ñ. The heavy bold line in Figure 6 traces the progress of the simplex method in

order through the values of these decisions in the space of dual variables. Notice that the action

for exactly one state is changed at each iteration.

Running Times

So far we have discussed three related methods for finding maximum-value policies in trans-

ient systems, viz., successive approximations, policy improvement (including Newton’s method)

Page 41: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 35 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

2b

2c

1b1c 1d

Z‡

Z"

Z#

Excessive Points

Figure 6. Progress of Primal Simplex Method in Dual Space for Two-State Problem

and linear-programming methods (including the simplex method and interior-point algorithms).

Each of these methods can be used in practice to solve problems with thousands—and sometimes

even millions—of states if implemented properly on a computer.

Uniformly Strictly Substochastic Systems Strongly Polynomial-Time Exact and Linear-Time:

%-Approximation Algorithms. uniformly strictly substo Suppose the system is chastic, i.e., each of the

row sums of the transition matrices is uniformly bounded by a fixed constant . W! Ÿ "" ith ra-

tional data, ] has shown that a maximum-value policy can be found in strongly polynom-Ye [Ye03

ial time with interior-point methods. Even without rational data, the running time can be reduced

to linear time using successive approximations if instead one is satisfied to find an -optimal pol-%

icy. We outline briefly why this is so, leaving a more complete account to a homework problem.

Call a policy if its value is within %%-optimal "!!% ! of the maximum value. Call an algo-

rithm that finds such a policy an It turns out that for fixed , suc-%-approximation algorithm. % !

cessive approximations is a strongly linear-time -approximation algorithm. That is so be% cause the

method can be implemented to find a policy that is -optimal with % SÐ8Ñ arithmetic operations

Ð ß ß ‚ ß ƒ ß ” ß • Ñ 8 where is the number of nonzero one-period rewards and transition rates

over all states and actions. To that end, it is necessary to compute for Z œ Z R œ "ß á ß QR R"e

with and to assure that any optimal decision in the first period of an -perZ ´ ! Q ´ Q! Ð"Î Ñ

Ð"ΠѲ ³ln

ln%

" -

iod problem has value within 100 % of the maximum with an infinite horizon. Since each of the%

Q Z needed evaluations of requirese R" arithmeticSÐ8Ñ operations and depends only on Q %

and , successive approximations finds an -optimal policy in strongly" % linear time.

Page 42: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 36 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Since assured of find-one must examine all the data of a transient system at least once to be

ing a maximum-value policy, the running time of every algorithm for finding a maximum-value

policy is at least . Hence,SÐ8Ñ under the above hypothesis and for fixed , successive approxima-%

tions is as efficient as possible up to a constant factor.

Gauss-Seidel Method. One way to improve successive approximations is to update the value

Z Ð Z Ñ = Ð Z Ñ = after computing for a state , rather than waiting to compute for all states e e= =

before updating . This is the Gauss-Seidel method, and it converges faster than successive ap-Z

proximations.

Newton’s Method. A different way to reduce the number of iterations of successive approxi-

mations is to use Newton’s method. However, this method requires the added work of comput-

ing the value of the policy found at each iteration. The number of arithmetical operations to do

this by Gaussian elimination is , which is higher than linear time if the number of state-SÐW Ñ E$

action pairs is much larger than . If instead, one uses successive approximations to estimateW

these values, then the overall number of arithmetic operations to execute Newton’s method falls

to for fixed as with successive approximations.SÐ8Ñ %

Balancing Pricing and Updating Values. In all three algorithms, one should take care to strike

an appropriate balance between the work of pricing (computing ) and updating the estimateeZ

of the value . In particular, this suggests that successive approximations is likely to be more orZ

less efficient than Newton’s method according as the number of actions in each state is respect-

ively relatively small or large. This is simply because is relatively cheap to compute in theeZ

former case and relatively expensive to compute in the latter when compared to the work to

update . Also, successive approximations is likely to be more or less efficient than Newton’sZ

method using Gaussian elimination according as the maximum spectral radius of the transition

matrices is respectively relatively small or large. This is because the work to compute the new

value by successive approximations relative to Gaussian elimination is relatively large or small

according as the maximum spectral radius is respectively relatively large or small.

Simplex Method. If the simplex method is used to solve the irredundant primal, it is proba-

bly not a good idea to compute at each iteration to find the best single state-action pair toeZ

improve. (For with the same work, one could use Newton’s method to improve the actions in all

states at once!) The reason is simply that too much work is expended in pricing. A better ap-

proach is either to price out the actions for a single state at each iteration (which requires=

merely computing ) and perhaps cycle through the states to be priced at each iterationÐ Z Ñe =

periodically. In this way one strikes a better balance between the work of pricing and finding a

good estimate of the value.

Transient Systems Polynomial-Time Algorithms.: It is possible to check whether not a sub- or

stochastic system is transient in time with a combinatorial algorithm that depends only onSÐ8Ñ

Page 43: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 37 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

the number of state-actions pairs and positive transition rates. It is possible to check whether8 or

not an arbitrary system is transient by solving the irredundant primal linear program with objec-

tive function for all <Ð=ß +Ñ œ " + − E= and . If the data are rational, this linear program and= − f

its dual can be solved with arithmetic operations where is the number of bitsSÐÐEW E WÑPÑ P# "Þ&

required to represent the reward and transition rates. Indeed, this can be done with an interior-

point algorithm of Vaidya [Va90] and a refinement of Ye, Todd and Mizuno [YTM94]. If the sys-

tem is transient, a maximum-value policy can then be found by solving the irredundant primal

and dual linear programs with the same interior-point method and running time. An open ques-

tion is whether these methods can be refined to run in linear time.

Stochastic Constraints. Practitioners often want to impose , i.e., con-stochastic constraints

straints on the state-action frequencies, to achieve service goals or avoid undesirable states. Here

are some examples. The expected number of stockouts does not exceed . The expected number5

of backorders is at most . The expected number of equipment failures is at most . Constraints" 9

of this type can often be expressed by appending linear inequalities to the irredundant primal. Al-

ternately, a stochastic constraint might be necessary to assure that if the system uses an action in

one of a subset of states, then the system uses the action in all states in the subset. A constraint

of this type can be expressed in terms of linear inequalities in integer and continuous variables.

If stochastic constraints are imposed on the irredundant primal, then the optimal solution of

the linear program will not generally be an extreme point of the irredundant primal. For this

reason, it is necessary to use randomized policies. As we have seen, if the system is transient, then

one way of doing this is to use an initially randomized policy. However, such a policy has the

unattractive feature that the policies used after the initial randomization can have radically dif-

ferent performance. This feature seems artificial and is likely to be unacceptable to users.

Stationary Randomized Policies. In practice it is often more desirable to use a stationary ran-

domized policy $ $ $ f $ $_ = = == with where, for each , is a probability vector on ,œ Ð Ñ = − œ Ð Ñ E!

i.e., a nonnegative vector on whose sum is one. The interpretation is that chooses actionE=_$

+ − E = −==+ in state with probability each time an individual enters that state . The station-f $

ary (nonrandomized) policies are equivalent to the stationary randomized policies where the prob-

ability vectors are all unit vectors.

Theorem 16. Feasible Solutions of the Irredundant Primal are State-Action Frequencies

of Stationary Randomized Policies. Suppose and the system is transient. Then the set ofA   !

feasible solutions of the irredundant primal coincides with the set of state-action frequencies of

the stationary randomized policies. In particular, if is a feasible solution of the irredundantB

primal, then is the state-action frequency of the stationary randomized policy whereB œ Ð Ñ$ $_ =

the probability vector on is given by for each action for$ $ $= =+ =+= =+ = =−Eœ Ð Ñ E œ B Î B + − E!

! !=

states with and is arbitrary for states with .= B ! = B œ !! !! !! !−E −E= =

== =

$

Page 44: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 38 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Before turning to the proof, it is worthwhile noting why the last claim of the Theorem is

intuitive. The reason is that the probability of choosing action in state can reasonably be+ =

expected to be the ratio of the expected number of individual visits to state that B=+ = choose

action therein to the expected number of individual visits to state .+ =!! !−E ==

B

Proof. We show first that each stationary randomized policy is transient. To that end, for each

state , let be the set of probability vectors on . Consider the stationary= − œ Ð Ñ Ef # #= == =+

randomized policy where and# # #_ =œ Ð Ñ for each . Let be the transition# # f= =+=œ Ð Ñ − = − T#

matrix of , i.e., the element of is .# #=> T :Ð> l =ß +Ñ>2 =++# ! To see that is transient, observe fromT#

Lemma 4 that because the system is transient, there is a unique vector such thatZ   "

Z œ Ð" :Ð> l =ß +ÑZ Ñ = −= >+−E >

max , .=

" f

This implies that

Z œ Ð" :Ð> l =ß +Ñ Z Ñ = −= >− >ß+

max , ,( =

=

" ( f=+

because, for each , the maximum on the right-hand side of the above equation is attained by= − f

setting for some and otherwise. Now on setting for each , it follows( ( ( #=+ = = ==œ " + − E œ ! œ =!

from the above equation that . Iterating this inequality shows thatZ   " T Z#

Z   T " T Z   T "" "3œ! 3œ!

R" R3 R 3# # # ,

so is increasing in and bounded above by , whence is transient as claimed.!R3œ!

3T " R Z T# #

Next oberve that since each stationary randomized policy is transient, its state-action frequency

is finite and so feasible for the irredundant primal by the argument in the paragraph titled “state-

action frequency on page 31. Conversely, it remains to show that if is a feasible solution of the” B

irredun icy dant primal, then is the state-action frequency of the stationary randomized polB $_

defined in the Theorem. To that end, let be the expected number of exits from \= state when=

using . Since is transient, the are finite. It remains to show that $ $_ _= =\ \ œ !

! !−E ==B for

each state . To see this, observe first that because the expected number of entries into= state =

equals the expected number of exits therefrom, it follows that

(5) , .\ :Ð= l >ß +Ñ \ œ A = −= > =

>ß+

" $>+ f

It is enough to show that this equation has a unique solution since is feasible for the irredundantB

primal and so, as is readily verified by substitution, is one such solution. Let \ œ B T= =−E!

! ! $=

be the matrix whose element is . Rewriting (5) in matrix form yields>= :Ð= l >ß +Ñ>2+

! $>+

\ \T œ A$ ,

which as a unique solution because is transient. T$ è

Page 45: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 39 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Example. Supply Management with Service Constraints. Consider a dynamic supply prob-

lem with an obsolescence probability, backorders up to , say, starting inventories up to , say,F M

and a zero lead time for delivery of orders. Then the states are . Now the constraintFß á ß M

that the expected number of stockouts is at most can be expressed by the inequality5

" "=œF

"

+

=+ .B Ÿ 5

Another possible constraint is to require that the expected fraction of the periods in which stock

is short be at most . This can be expressed as the ratio constraint0

" "" "

=œF

"

+

=+

=œF

M

+

=+

,

B

B

Ÿ 0

or equivalently, the linear inequality

" " " "=œF =œF

" M

+ +

=+ =+ ,B Ÿ 0 B

On the other hand, the constraint that the expected number of backorders be at most would"

take the form

=Ð B Ñ Ÿ" "=œF

"

+

=+ ."

Augmented Irredundant Primal. In order to find optimal policies with stochastic constraints,

it is necessary to append the stochastic constraints to the irredundant primal program and solve

the resulting program. Call this the .augmented irredundant primal

Number of Randomized Actions. If there are, say, linear stochastic constraints, then each5

basic feasible solution of the augmented irredundant primal will have at most positiveW 5

variables because there can be at most one basic variable for each constraint. Thus, there can be

up to state-action pairs that are positive in a basic feasible solution of the augmented irre-W 5

dundant primal. In that event, the number of states in which it may be optimal to randomize

can be as large as . If is small in comparison with , relatively little randomization is neces-5 5 W

sary. But if is large in comparison with , randomization becomes important. In the supply5 W

problem above, there are only three stochastic constraints. Thus, it will be optimal to randomize

in at most three initial inventory levels.

Page 46: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 40 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Maximum Present Value

The maximum-value criterion may be a reasonable one if the population is indifferent between

receiving income in one period or another. However this is often not the case because income has

a time value. One reflection of this is that it is usually possible to borrow and lend money at ap-

propriate interest rates. Assume now that these two interest rates both equal 100 % per period,3

say. In that event, the amount of money that must be invested in one period to generate one

unit of money in the following period is . Call the , reflecting the dis-" "´"

"3discount factor

count that one enjoys for buying one unit of money one period before receiving it. Thus, the

amount of money that would be needed in period zero to generate the income stream < ß á ß <" R

in periods at the interest rate 100 % is"ß á ß R 3

Z ´ <3R 3 3R

3œ"

" " .

Call this sum the of the stream discounted to period zero.present value Ð< Ñ3

The above development tacitly assumes that interest is not taxed. If, as is usually the case,

interest is taxed, it is more natural to let 100 % be the , e.g., 60% of the3 after-tax interest rate

original interest rate for populations in the 40% tax bracket. Then is the after-tax presentZ 3R

value of the income stream .Ð< Ñ3

In the above discussion we have also ignored another important factor, viz., inflation. In in-

flationary times, income earned in one period has greater purchasing power than income earned

later. This suggests that income streams and interest rates should be adjusted to account for

inflation. To be concrete about this, suppose that 100 % is the ordinary after-tax interest rate3w

and 100 % is the rate of inflation. Then the after-tax interest rate 100 % is0 inflation-adjusted 3

given by . Thus, even if is positive, will be negative if the rate of inflation ex-" œ3 3 3""0

3ww

ceeds the after-tax rate of interest, a not uncommon situation. In any event is the Z 3R inflation-

adjusted after-tax present value of the income stream . Since the inflation-adjusted after-taxÐ< Ñ3

present value is equivalent to an ordinary present value with an appropriate choice of the inter-

est rate, we can and do simply speak about interest rates and present values in the sequel.

The above discussion leads us to consider both positive and negative interest rates. We will

make the fairly mild assumption that , however. For if , then either , in3 3 3 " Ÿ " œ "

which case money invested in one period is lost in the following period, or , in which case3 "

a unit investment in one period becomes a liability of in the following period. Both" !3

cases rarely occur, and almost never repeatedly.

The above considerations suggest that policies be compared on the basis of their1 $œ Ð Ñ3

R -period present values (discounted to period zero)

Z ´ T <31 1 $

R 3 3"R

3œ"

" " 3 .

Page 47: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 41 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

On setting , and , the above can be rewritten asT ´ T < ´ < T ´ T âTs s s ss$ $ $ $ $ $1" "3

" 3

Z ´ T <s s31 1 $

RR

3œ"

3""3.

This reduces the problem of finding a policy with maximum -period present value to an equiv-R

alent problem of finding a policy with maximum -period value.R

Observe that if and only if . Thus the equivalent problem with transition5 "5Ts T$ $

" "

matrices is transient if and only if , or equivalently, the (discounted toT "s $ "5 present value

period zero)

Z ´ T <31 1 $"_

3œ"

3 3""3

of each policy is absolutely convergent for every choice of . When this is so, the 1 $œ Ð Ñ <3 † maxi-

mum present-value problem is equivalent to a maximum-value problem- and so the results for the

latter problem apply at once. In particular, there is a stationary maximum-present-value policy.

Multi-Armed Bandit Problem: Optimality of Largest-Index Rule [GJ74], [KV87], [Gi89]

A fundamental sequential statistical decision problem that attracted the interest of statistic-

ians at least as early as the 1940s is the multi-armed bandit problem. The problem entails find-

ing a maximum-present-value policy in a particular multidimensional Markov decision chain.

The difficulty of the problem was thought so great that Allied scientists during World Warto be

II joked that dropping leaflets describing the problem over Axis countries might be a good way to

distract their scientists from the war effort. The problem remained unsolved until 1972 when

Gittins and Jones found a remarkably simple solution. There are many applications of this problem

including sequential allocation of several able treatments to patients and sequential allocationavail

of effort to several parallel projects. For definiteness in the sequel, consider the problem to be one

of project scheduling.

There are independent projects, labeled . In each period, only one project may beR "ß á ß R

active. The states of the inactive projects in a period are frozen. The state of project R" = 88>

at the beginning of the period it is active is a finite-state Markov chain with stationary tran->>2

sition probabilities. Since the state of project does not change while it is inactive, is also8 =8"

the state of that project at the beginning of period . Put and assume that " = œ Ð= ß = ß á Ñ =8 8 8 4" #

and are independent for all . If project is active in a period when it is in state , the= 4 Á 8 8 =8

project earns a reward . tor<8= Inactive projects earn no rewards. There is a discount fac

! "" . A policy is a nonanticipative rule for deciding which project to activate in each per-

iod, i.e., the choice of the project to activate in a period depends only on the projects active in

prior periods and the states of the projects observed in that and prior periods. This is a finite-

state-and-action Markov decision chain. The goal is to find a maximum-present-value policy.

Page 48: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 42 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Random and Stopping Times. random time To do this requires a few definitions. A is a _

or nonnegative integer-valued random variable. A is a random timestopping time for project 8

5 5 that is for project , i.e., Pr ) is independent of for eachnonanticipative 8 Ð Ÿ > l = ß = ß á= 8 8>" >#

nonnegative integer and . A policy may be described inductively by > œ Ð= ß á ß = Ñ= " R selecting

for each a period and a project that will be active during periods > œ "ß #ß á "ß á ß7 7 7> >" >

where and whenever . Since policies are nonanticipative and since 7 7 7 7! >" > >"4´ ! _ =

and are independent for , is the in per= 4 Á 8 " 8> >"7 7 stopping time for the project selected -

iod relative to the state of the project in that period.7>""

Index of a Project in a State. Let be the present value of the rewards that proV ´ <8 4 8>

>" =

! " 84

-

ject earns when it is active in the first periods. Put . The sequel shows that is8 > F ´ " F> >>"

proportional to . The is max E E where the max-!>"

4 8 8=  "" index of project in state 8 = 7 ´ V Î F7 77

imum is over stopping times for project 7" 8 in state . Note that the indices for a particular=

project in its various states depend only on the rewards and transition law of the Markov chain

for that project—not the others. Consequently, as the development below shows, finding the in-

dices for the project entails finding max ent-value policies for a family of Markov deci-imum-pres

sion chains for that project alone, one for each state.

Largest-Index Rule. The policy that, in each period, selects a projectlargest-index rule is a

with maximum index. This simple policy turns out to have maximum present value. This result

reduces the -dimensional Markov decision problem to a collection of one-dimensional ones.R

Present Value a Project Earns. The key fact underlying the development of this result is that

for any sequence of (Borel) functions of , there is a random time for"       â   !! ! 5" # =

which Pr , for , soÐ   > l Ñ œ > œ "ß #ß á5 != >

(6) E E .V œ <8

>œ"

_

>> 8

=5 " ! " 8>

For example, if is5 > 8>>2 is the period that project is active when using a given policy and if 5

the random time that for which Pr , then (6) implies E EÐ   > l Ñ œ œ V œ <5 ! " "= >5 > 8 5 8

>œ"_

=> >

8>5

!is from project . Thus, the random time the present value of the rewards that the policy earns 8 5

adjusts the discounting of rewards that project earns to reflect its active periods .8 5 ß 5 ß á" #

Theorem 17. Optimality of Largest-Index Rule. In the multi-armed-bandit problem, the largest-

index rule has maximum present value.

Proof. Let be a policy that uses the largest-index rule and let be the period in1 7‡ >2>"" >

which selects project one, say, in the first per-1 1‡ ‡ selects a project. By relabeling, assume that

iod. Put and E E . By hypothesis, , so for each stopping time 7 7 5´ 7 œ V Î F 7   7 """ " " 8

=7 7 8"

for project ,8

Page 49: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 43 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

(7) E E .V Ÿ 7 F8 "5 5

Let herein be the present value of any policy starting from the given initial state. LetZ1 1

1w be an arbitrary policy, be a number and be a policy that activates each project infin-% 1 ! !

itely often and for which . Because rewards in distant periods are heavily discount-Z   Z 1 1!w %

ed, can be formed, for example, by modifying to cyclically activate each project for one1 1!w

period every R periods starting in a sufficiently distant period. Let be the policy that1"

permutes the order in which activates projects by shifting the first times that activates1 7 1! !

project one to the first periods. Let be the period in which activates project one for the7 16 X !

7 7>2 time if is finite and let otherwise. Figure 7 illustrates these definitions for the caseX œ _

R œ $.

� � � � � � � � � � � � � � �

� � � � � � � � � � � � � � �

� � � � � � � �� �� �� �� �� ��

�����

��

� �

��

� ����

��

��

Figure 7. Active Projects in each Period

For each fixed and , let (resp., ) be the period in which (resp., ) activates pro8 >   " 5 6> > ! "1 1 -

ject for the time. For , set if and if . For8 > " Ÿ 5 Ÿ X œ 8 œ " œ 8 ">2 5 > 5 > 6 >> > >! " ! " "> > >

5 X 5 œ 6 œ ! ! Ÿ Ÿ " >   " 5 > !> > > > > >, and set . Observe that is decreasing in because ! !

is increasing and, if , is increasing in . Thus, from (6), there is a random8 " 5 6 Ÿ ! >   "> >

time with Pr for . Also by (6) and the fact that for ,5 5 !8 8> > > >Ÿ X Ð   > l Ñ œ >   " 5 œ 6 5 X=

the difference between the present values that and earn from project is E1 1 " "! "_ 5 6 8

=8 Ð Ñ<!1

> >8>

œ Ð Ñ< V VE , which is E E !X 5 6 8 " "=1 " "> >

8>

"5 7 if and E if . Hence,8 œ " V 8 "858

(8) E E .Z Z œ V V1 1 7 5" ! 8"

8œ"

R8"

If the reward that each active project earns in each state is instead , the left-hand side ofÐ" ÑÎ" "

(8) vanishes, so because , (8) reduces to!>"

4 >" " " "Ð" ÑÎ œ "

(9) E E .! œ F F7 5"8œ"

R

8

6For to be a policy, it must be nonanticipative. That this is so may be shown by noting that essen-1 1" "

tially consists of activating project one for the first periods and then using afresh thereafter, being care7 1! -ful to skip the first times that activates project one.7 1!

Page 50: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 44 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Now for each , Pr is inde-8 "   Ð " Ÿ > l Ñ œ " 5 5 !8 8>" 8 is a stopping time for project since =

pendent of ition of ,= ß = ß á 78 8 ">" ># . Hence, from (7)-(9) and the defin

Z Z   7 Ð F F Ñ œ !1 1 7 5" !8

"

8œ"

R

E E ."Using the above construction, it follows by induction on that there exist policies that>   " 1>

agree with through period and have the property that . Thus, 1 7‡>   > Z Z   ! Z Z1 1 1 1> >" > !

  ! Z œ Z   Z   Z Z   Z and lim . Since is arbitrary, . 1 1 1 1 1 1‡ w ‡ w> !>Ä_ % % è

Restart Problem. To facilitate computation, it is useful to characterize the index of a project

in state , say, as the value of the project in an associated “restart-in- ” problem. For notational= =

simplicity, drop the superscript designating the project in the sequel without further mention.

Now the problem entails choosing a maximum-present-value stopping time inrestart-in-= "7

state for the problem in which one starts in state , chooses the stopping time and re-= = "7

starts afresh in state in period . If has maximum present value for this problem, then= " @7 =

@ œ ÐV @ @=  " = =max E ). Now if is the (expected) present value of rewards that the stop-7 7 77"

ping time earns, then E E , so E E . Hence, max7 "" @ œ V Ð Ñ@ @ œ V Î F @ œ @ œ 77 7 7 7 7 7 7 77

= = = =  " = =

is the index of the project in state .=

In order to examine the computational efficiency of this formulation, suppose that the se-

quence of states that the project enters is a Markov chain on the finite state-space of size .f W

The above approach to the restart-in- problem looks at the system only when it is in state = =

and chooses a stopping time. Because the system is Markovian, it suffices to limit attention to a

nonrandomized Markovian stopping time, or equivalently a restarting set, i.e., a set of states in

which to restart the system in state . Though this approach requires only one= state, the number

of actions (restarting sets) in that state is , and so grows geometrically with .# WW

A better approach to this problem is to retain the original state-space of the Markov chain.

Then there are only two actions in each state , viz., continue or restart in . To implement this4 =

approach, let be the row of the transition matrix of the Markov chain on . Then theT 44>2 f

maximum-present-value vector for this form of the restart-in- problem is the unique solution=

@ œ Ð@ Ñ4 of

(10) max .@ œ Ð< T @ß < T @Ñß 4 −4 4 4 = =" f

Of course, is the desired index of the project in state .7 œ @ == =

Size Comparison. Suppose that the sizes of the state-spaces of the Markov chains are eachR

W R W R R. Then the original -dimensional problem has states, actions in each state and R WR

state-action pairs. By contrast, the largest-index rule requires solving restarting WR problems,

one for each state of each project, with each such problem having states, two actions and W #W

Page 51: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 45 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

state-action pairs. For example, if , the original problem would have 256 states andW œ R œ %

2 -&"# state-action pairs. By contrast the largest-index rule would require solving 16 restarting prob

lems each with 4 states and 8 state-action pairs.

On-Line Computation. The “discounted computational effort required to solve the restarting”

problems to find the needed indices can be reduced significantly if one computes . Initiallyon-line

one must solve restarting problems. Thereafter, in order to implement the largest-index rule, itW

suffices to wait until the state of the project that was active in the previous period is observed to

be in the restart set and then solve the restart problem for that project and state only. This en-

tails solving at most one restarting problem in each period. Thus the “discounted” number of re-

starting problems that must be solved is at most , and so is independent of ." " "ÐR ÎÐ" ÑÑ W

8 MAXIMUM PRESENT VALUE WITH SMALL INTEREST RATES IN BOUNDED SYSTEMS

[Bl62], [Ve66], [MV69], [Ve69a], [Ve74], [Ro75c], [RV92]

When a decision maker is indifferent between receiving income in different periods, attention

shifts to problems in which the interest rate is zero. Problems of this type are quite important.

Some examples appear below. In the first, the inflation-adjusted after-tax interest rate is zero. In

all but the first two, there is no interest rate. However, it is useful to introduce small interest rates

to address these problems over long time horizons.

ì Zero Inflation-Adjusted After-Tax Interest Rate. Suppose that the interest rate is 5%, the mar-ginal tax rate is 40% and the rate of inflation is 3%. Then the after-tax rate of interest is 3% whichcoincides with the rate of inflation. Consequently, the inflation-adjusted after-tax interest rate is zero.

ì Maximum Present Value for an Interval of Interest Rates. A decision maker may wish to find apolicy that simultaneously has maximum present value for all small enough interest rates major-izing a given nonnegative interest rate.

ì Maximize Sustainable Yield of a Renewable Resource. One may wish to maximize the long-termsustainable yield of fish, crops, lumber, etc.

ì Maximize Expected Instantaneous Rate of Return. This problem arises in infinite-horizon ver-sions of Example 5 on p. 8.

ì Maximize Throughput of a Factory. The general manager of a factory may wish to maximizeits throughput.

ì Maximize Yield of a Production Process. A production engineer may wish to maximize the yieldof chips from a fab plant.

ì Maximize Expected Number of Patients Treated Successfully. A hospital may wish to maxi-mize the expected number of patients treated successfully.

ì Maximize Expected Number of On-Time Deliveries. A trucking, shipping, or air-cargo companymay wish to maximize the expected number on-time deliveries.

ì Maximize the Expected Number of Customers Served. A firm may wish to maximize the ex-pected number of satisfied customers.

ì Minimize Expected Number of Fatalities. A public agency may wish to minimize the expectednumber of fatalities from various hazards.

Page 52: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 46 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

For transient systems, the maximum-present-value problem with zero interest rates is pre-

cisely the maximum-value problem discussed in §1.7. But for nontransient systems, the value of

a policy is typically not well defined or is infinite. For this reason, generalizing the notion of max-

imum value from transient to nontransient systems requires a different formulation.

Bounded System

Assume throughout this section that the system is , bounded i.e., its system degree is zero or.

one. For example, substochastic systems are bounded because then T R1 is substochastic for each

R and so is bounded. However, not all bounded systems are substochastic. In bounded systems,

the present value of a policy is well defined and finite for all Z !31 1 3 even though that is not

generally the case where . However, for nontransient bounded systems 3 œ ! with , it is not3 œ !

generally possible or useful to compare policies by means of their present values because those

values may not be well defined or finite.

Strong Maximum Present Value

One way of resolving this dilemma is to consider present-value preference relations for small

(positive) interest rates. Such preference relations are of special interest when decisions are made

frequently, say every week or day. As will be seen subsequently, the strongest preference relation

of this type is the following. A policy has if has simultaneous- -strong maximum present value

maximum present value for all sufficiently small positive interest rates, i.e., there is a 3‡ !

such that

(1) for all and all .Z Z   ! ! 3-

31 3 3 1‡

This preference relation will play a fundamental role in what follows, especially for systems in

which there is exogenous immigration. The development to follow shows that stationary strong

maximum-present-value policies exist and generalizes the policy improvement method to find

such a policy.

Cesàro Limits and Neumann Series

First, however, it is useful to generalize the Neumann series expansion of Lemma 2 from

transient to bounded systems. If T ß T ß T ´ Ts! " R 3R3œ!á T W ‚ W and are complex matrices and ! ,

write C if . Similarly, write C and say that theT Ä T Ð ß "Ñ R T Ä T T œ T Ð ß "ÑsR R" R" _

Rœ!!

first series C if C , or equivalently, . On the otherconverges Ð ß "Ñ T Ä T Ð ß "Ñ R T Ä Ts sR 3" R"

3œ!!

hand, write C if . Incidentally, C (resp., C ) stands for T Ä T Ð ß !Ñ T Ä T Ð ß !Ñ Ð ß "ÑR R Cesàro limit

of order zero one (resp., ). Thus Cesàro limits of order zero are ordinary limits.

It is easy to see that if a sequence converges C , the sequence also converges C , andÐ ß !Ñ Ð ß "Ñ

the two limits coincide. However the converse is generally false. For example, every periodic se-

quence converges C . But such a sequence does not converge C except in the trivial caseÐ ß "Ñ Ð ß !Ñ

Page 53: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 47 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

of a constant sequence. Thus, one important feature of C convergence is that it assures thatÐ ß "Ñ

a broader class of sequences converges than does ordinary (C 0) convergence.ß

In the above terminology, Lemma 2 asserts that if is a square complex matrix and T T Ä !R

Ð ß Ñ M T ßC 0 , then is nonsingular and has a (C 0) Neumann series expansion. The next lemma as-

serts that this result remains valid for (C 1) limits as well.ß

Lemma . Cesàro Neumann Series.18 CIf is a square complex matrix and , henT T ! Ð ß "Ñ >R Ä

M T is nonsingular and

Ð2 C .Ñ ÐM T Ñ œ T Ð ß "Ñ" R_

Rœ!

"Proof. Recall as in the proof of Lemma 2 that

M T œ ÐT T Ñ œ ÐM T Ñ T œ ÐM T ÑTs4 3 3" 34" 4"

3œ! 3œ!

4"" "

where .T ´ Ts 443œ!

3! Summing the above equation over from to and dividing by , yields4 " R R

(3) , .M R T T œ ÐM T ÑÐR T Ñ R œ "ß #ß ás s" "R" 3

R"

3œ!

"Thus if C , then the left hand side of (3) is nearly for large enough and so isT Ä ! Ð ß "Ñ M RR

nonsingular. Hence is nonsingular. Now premultiply (3) by and let . M T ÐM T Ñ R Ä _" è

In order to appreciate the significance of this result, observe that since C impliesT Ä ! Ð ß !ÑR

T Ä !R Ð ß "Ñ T Ä ! Ð ß !ÑC , it follows that (2) will certainly hold whenever C . But (2) holds inR

other cases as well. For example, if , then C . T œ M T Ä ! Ð ß "ÑR In that event (2) asserts that

Ð ÑM œ ÐMÑ Ð ß "Ñ"#

!_Rœ!

R C . Moreover, the last two limits cannot be replaced by limitsÐ ß "Ñ Ð ß !ÑC C .

Stationary and Deviation Matrices

The next two results characterize two important matrices associated with a square matrix .T

Theorem 19. Stationary Matrix. If is a square complex matrix with ,T . Ÿ "T then there is a

stationary matrix for such thatT T‡

Ð Ñ T T Ð ß . Ñ4 CR ‡TÄ .

Moreover,

Ð Ñ T T œ T T œ T T œ T5 ‡ ‡ ‡ ‡ ‡.

If is real resp., nonnegative, substochastic, stochastic , then so is . Also, if, and pro-T Ð Ñ T T œ !‡ ‡

vided that is nonnegative, only if is transient. T T Finally, if and only if .T œ M T œ M‡

Page 54: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 48 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Proof. If , then and the result is clear. Suppose . Put .. œ ! T œ ! . œ " F ´ R TT T R‡ " 3R"

3œ!!

Then and are bounded. Hence, to prove (4), it suffices to show that the limit points ofÖT × ÖF ×RR

ÖF ×R coincide. (A of a sequence of matrices is, of course, the limit of a subsequence limit point of

the matrices). Let be any limit point of . Then haveN ÖF × F R ÐT MÑ œ T F œ F TR R R R" R

the common limit point . Thus and so . Now letN œ T N œ N T N œ T N œ N T N œ F N œ N FR RR R

O ÖF × N œ be any limit point of . Then the preceding equalities have the common limit point R

ON œ N O N O O œ N O. Since and are arbitrary limit points, they can be interchanged, giving

œ ON N œ O. Thus , so ( ) holds. Moreover, this also establishes (5). We claim that if 4 T is non-

negative and , then is transient. To show this, observe from Lemma 18 that T œ ! T H ´ ÐM T ч "

exists and is nonnegative, so for all . Thus, sinceH œ M T H œ â œ T T H   T R! !R R! !

3 R" 3

T is nonnegative, it is transient as claimed. The remaining assertions follow from ( ) and (5). 4 è

In view of (4), when is nonnegative, the element of can be interpreted as the longT => T>2 ‡

run expected average number of individuals in state per period generated by one individual>

starting in state . Notice from in (5) that each row of is left stationary when= T T œ T T‡ ‡ ‡

postmultiplied by . Thus, by (4), the row of is the of the associ-T = T>2 ‡ stationary distribution

ated Markov population chain to which the -step transition rates from state converge R = Ð ß . ÑC .T

Theorem 20. Deviation Matrix. If is an complex matrix with , then isT W ‚ W . Ÿ " T UT‡

nonsingular where is the stationary matrix for and . Also the deviation matrix T T U ´ T M H‡

for , definedT by

Ð Ñ H ´ ÐT UÑ ÐM T Ñ œ ÐM T ÑÐT UÑ6 ‡ " ‡ ‡ ‡ ",

and satisfyT ‡

Ð Ñ H œ ÐT T Ñ Ð ß . Ñ7 C"_

Rœ!

R ‡T

Ð Ñ HT œ T H œ !8 ‡ ‡

and

Ð Ñ9 ÐT UÑ œ T H‡ " ‡ .

Moreover, is the complex numbers is the direct sum of the ranges of and . The sum of‚ ‚W ‡Ð Ñ T H

the ranks of and is . Also if and only if . And if and only ifT H W T œ ! H œ ÐM T Ñ T œ M‡ ‡ " ‡

H œ ! T H. If further is real, then so is .

Proof. It follows readily from (5) that

ÐT T Ñ œ T T R   "‡ R R ‡ for .

On setting , it is apparent from (4) and the above equation that C .F œ T T F Ä ! Ð ß . ч RT

Hence, is nonsingular by Lemmas 2 and 18, so by the above facts,T U œ M F‡

Page 55: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 49 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Ð Ñ ÐT UÑ œ ÐT T Ñ œ ÐT T Ñ T Ð ß . Ñ10 C .‡ "_ _

Rœ! Rœ!

‡ R R ‡ ‡T" "

Now postmultiply (resp., premultiply) this equation by and use (5) to establish (6), (7)M T ‡

and the fact that and commute. Next postmultiply (resp., premultiply) (6)M T ÐT Uч ‡ "

by and use (5) to establish (8). Also, (9) follows from (7) and 10 . In view of (9), is theT Рч W‚

direct sum of the ranges of and provided implies that . To see that theT H B œ T C œ HD B œ !‡ ‡

last is so, observe from (8) that , whence by (9). ÐT HÑB œ T HD HT C œ ! B œ !‡ ‡ ‡ The char-

acterizations of and follow from ) and 8 . The fact that is real implies that T œ ! T œ M Ð Ð Ñ T H‡ ‡ 6

is real follows from ( ) and ( ).4 6 è

Notice from (5) and (6) that if is nonnegative, the element of is the C limitT => H Ð ß . Ñ>2T

of the amount by which the sum of the expected -period sojourn times of all individuals inR

state generated by one individual starting in state exceeds that when the system starts in-> =

stead with the stationary population distribution for state . Thus measures the deviation= H

from stationarity.

Laurent Expansion of Resolvent

Suppose that is a square complex matrix and that , so . Then T . Ÿ " Ÿ " M U œT T5 3

Ð" ÑM T ! Ð" Ñ T3 3 3 is nonsingular for because has spectral radius less than one. Call"

V ´ Ð M UÑ U ! V œ ÐM T Ñ œ T3 33 3 " " "" " R" R_Rœ! the of for . Also whereresolvent !

" ´ T V =>"

" 3. Therefore, if is real and nonnegative, then is the matrix whose element is3 >2

the sum of the expected discounted sojourn times of all individuals in state generated by one>

individual in state initially where is the discount factor and 100 % is the interest rate.= " 3

In order to study policies with maximum present value for small interest rates , it is3 !

useful to develop an expan in . It turns out that the desired expansion is a Laurentsion of V3 3

series about the origin whose coefficients are the stationary matrix and powers of the devia-T ‡

tion matrix .H

Theorem 21. Laurent Expansion of Resolvent. If is a square complex matrix with T Ÿ ".T

and if and are respectively the stationary and deviation matrices for , then for all T H T‡ 3 !

with ,35H "

Ð11 .Ñ V œ T Ð Ñ H3 3 3" ‡_

8œ!

8 8""If , then . If , then . If is real, so is .T œ ! V œ Ð Ñ ÐM T Ñ H œ ! V œ M T V‡ 8 8" "_

8œ!3 3 3! 3 3

Proof. It follows from Theorem 19 that , so .UT œ ! œ T U Ð M UÑT œ T œ T Ð M Uч ‡ ‡ ‡ ‡3 3 3

Premultiplying the next-to-last equation by and postmultiplying the last equation by yieldsV V3 3

(12) .V T œ T œ T V3 3‡ " ‡ ‡3

Also, from Theorems 19 and 20,

Page 56: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 50 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

(13) .ÐM T ÑÐM HÑ œ ÐT UÑH H œ Ð M UÑH‡ ‡3 3 3

Now for all with , is nonsingular, so its inverse exists and has a Neumann3 35 3 ! " M HH

series expansion. Thus, premultiplying (13) by and postmultiplying (13) by yieldsV ÐM HÑ3 3 "

(14) .V ÐM T Ñ œ HÐM HÑ œ Ð Ñ H3 ‡ "_

8œ!

8 8"3 3"Now add the first equation in (12) to (14). The rest follows from (10). è

Laurent Expansion of Present Value in Interest Rate

Suppose that the system is bounded. Then on putting and , itU ´ T M V ´ Ð M U Ñ$ $ $3$ 3 "

follows that the present value of is$

Z œ T < œ ÐM T Ñ < œ V <3 3$ $$ $ $ $ $Š ‹" .

_

Rœ!

R" R "" " "

This fact and Theorem 21 imply that for all with ,3 35 ! "H

(15) Z œ @3$ $"_

8œ.

8 83

where and for with and being respectively the@ ´ T < @ ´ Ð"Ñ H < 8 œ !ß "ß á ß T H" ‡ 8 8 ‡8"$ $$ $$ $ $$

stationary and deviation matrices for , and is the system degree. Evidently (15) gives theT .$

desired Laurent expansion of in .Z 3$ 3

In order to obtain a clearer understanding of the fundamentally important Laurent expan-

sion (15) of the present value of a stationary policy for small interest rates, it is useful to$_

interpret briefly the quantities appearing in the expansion. To that end, observe first that the

present value of is the product of the expected present value of the sojourn times of allZ V3 3$ $$

individuals in each state and the unit rewards per period <$ that individuals earn in those states.

If , the first coefficient in the Laurent expansion of is . In order to interpret this. œ " Z @3$ $

"

term, observe that depositing Z 3$ in a bank that pays interest at the rate 100 % earns the inter-3

est payment each period in perpetuity. Thus the present value is equivalent to the in-3Z Z3 3$ $

finite stream of equal interest payments in each period. Hence, it follows from (15) that the3Z 3$

limit of the interest payments in each period as the interest rate approaches zero is .@"$

There is a related interpretation of that may be described as follows. Observe from @ Z œ"$

3$

V < T Z œ @3 3$ $$ $ $, (12) and (15) that is the present value of starting with its stationary dis-‡ " "3 $

tribution in each state. Thus, is also the interest payment in each period resulting from depos@"$ -

iting in a bank that pays interest at the rate 100 %.T Z‡$

3$ 3

If , the second coefficient in the expansion (15) of is . It can be interpreted by ob-. œ " Z @3$ $

!

serving from (15) that where Z œ @ @ 9Ð"Ñ 9Ð"Ñ3$ $ $3" " ! is a function of that converges to3

zero as 3 Æ !. Thus is the limit as of the difference between the present@ Æ ! Z @! " "$

3$ $3 3

values of starting in each state and starting with the stationary distribution for that state.$

Page 57: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 51 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Characterization of Stationary Strong Maximum-Present-Value Policies as -Optimal_

The Laurent expansion of the present value of a stationary policy in the interest rate plays a

key role in characterizing stationary strong maximum-present-value policies in a bounded sys-

tem. To see this requires a definition. Call a real matrix , writtenF lexicographically nonnegative

F ¤ ! F, if the first nonvanishing element (if any) of each row of is positive. Similarly, write

F ¢ ! F ¤ ! F Á ! F £ ! F ¡ ! F ¤ ! F ¢ ! if and . Also write (resp., ) if (resp., ).

Let be the matrix of coefficients of the Laurent expansion (15) of .Z ´ Ð@ @ â Ñ Z$ $ $ $3. ."

This matrix has rows and infinitely many columns. The reader should note that differsW Z$

from its earlier interpretation as the value of a transient policy . Indeed, it follows from (15)$_

that in the present notation, the value of would then be lim because in that event$_ !Æ!3

3$ $Z œ @

@ œ !"$ . Thus in the sequel the reader will need to be careful to remember which of the two

interpretations of is intended. The reason for the twin interpretations of is that many ofZ Z$ $

the following results regarding strong present-value optimality for bounded systems have the

same form as for maximum-value optimality for transient systems.

Now observe from (15) that

(16) .Z Z œ Ð@ @ Ñ3$

3# $ #"_

8œ.

8 8 83

This difference is nonnegative, i.e., the present value of is at least as large as that for , for all$ #

3 3 5 ! " Z Z ¤ ! with max if and only if . The reason for this is that for all such ( ? $ #− H( posi-

tive , the sign of each row of (16) is simply the sign of the first nonvanishing coefficient of the3

same row of the Laurent expansion. Thus since there is a stationary maximum-present-value

policy for it follows that each ,3 ! $ $_ _ has strong maximum present value if and only if is

_-optimal, i.e., for all . Let be the set of for which is -optimal.Z ¤ Z − − _$ # # ? ? $ ? $__

Application to Controlling Service and Rework Rates in a QueueKÎQÎ_

As an illustration of the above ideas, consider the following application to controlled KÎQÎ_

queues. Suppose jobs arrive for service at a machine shop at the beginning of each hour accord-

ing to an arbitrary stochastic process. Each job may be processed on a machine during the hour

following arrival by a skilled or unskilled worker since the numbers of machines and of workers

of each type is adequate to serve all arriving jobs (this amounts to assuming that there are

infinitely many channels in the queue). Unskilled workers earn per hour while working on aA

job and skilled workers earn twice that much. A job completed by a worker is and earnsshipped

the shop if the job is acceptable and is (i.e., reprocessed during the next hour) if it is< reworked

not. Each job completed by an unskilled worker requires rework with probability , ,: : ""

#whether or not the job was previously reworked. By contrast each job completed by a skilled

worker requires rework with probability . Thus, the probability of shipping a job#: "

Page 58: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 52 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

completed by a skilled worker is , which is double the corresponding probability #Ð" :Ñ " :

for an unskilled worker. The problem, which is illustrated in Figure 8, is to decide whether to

use skilled or unskilled workers to process jobs when the goal is to achieve strong maximum

present value. In short, the issue is whether it is better to use a slow worker with low pay or a

fast worker with high pay.

Rework

Arrival Ship

Figure 8. Optimal Control of Service and Rework Rates in a QueueKÎQÎ_

Initially it is convenient to ignore the arrival process and instead assume that there is a sin-

gle job awaiting service. Observe that this is a single-state system in which there are two deci-

sions and corresponding respectively to using an unskilled or skilled worker to process the8 5

job. Put . Then the one-period reward and transition rates are? 8 5œ Ö ß ×

< œ Ð" :Ñ< A T œ : < œ #< T œ #: "8 8 5 8 5, , and .

For each , denote by the expected present value of net income (revenue minus wages)$ ?− Z 3$

received from a job when a worker of type processes the job until it is shipped and the hourly$

interest rate is .3 !

Notice that since , the system is transient and so . Thus, and: ” Ð#: "Ñ " . œ ! T œ !‡$

H œ Ð" T Ñ − H œ H œ @ œ H < @ œ H @$ $ 8 5 $ $ $$ $ $" ! " ! for , so and . Also, and for$ ?

" "

": #Ð":Ñ$ ?− @ œ @ ´ @ ´ < @ œ H @ @ œ H @, so , and . Hence! ! ! " ! " !

8 5 8 58 5A

":

(17) .Z Z œ ÐH H Ñ@ 9Ð Ñ œ 9Ð Ñ3 35 8 8 5

!3 3 3@

#Ð":Ñ

!3

Observe that the expected profit when a skilled worker processes a job is the same as that@!

for an unskilled worker, i.e., a job is profitable or unprofitable independently of the type of

worker who processes the job. Thus if the goal of the shop is to use a maximum-value policy

(with ), the shop will be indifferent between using a skilled or an unskilled worker. How-3 œ !

ever, if instead the shop seeks a strong maximum-present-value policy, it follows from (17) that

it is optimal to use a skilled worker if a job is profitable and to use an unskilled worker if the

job is unprofitable! The explanation for this is that if a job is profitable, it is better to use a

skilled worker who earns the profit earlier when the expected discount factor is larger, while if a

job is unprofitable, it is better to use an unskilled worker who incurs the loss later when the ex-

pected discount factor is smaller.

Page 59: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 53 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

But what if a job is neither profitable nor unprofitable, i.e., ? In that event, @ œ ! @ œ! 8$

Ð"Ñ H @ œ ! 8   ! Z Z œ ! !8 8 !$

3 35 8 for all , whence for all , i.e., the shop is indifferent be-3

tween using a skilled or an unskilled worker for all positive interest rates.

Incidentally, so far no consideration has been given to the effect of the arrival process into

the queue. Does it matter? The answer is it does as the developments in the sequel will show.

But for now, return to the general theory.

Strong Policy-Improvement Method

It is now possible to generalize the policy-improvement method to find a strong maximum

present-value policy for bounded systems. To that end, it is useful first to adapt the Comparison

Lemma to the present-value problem.

Comparison Lemma for Present Values. Observe that the maximum-present-value problem

for a bounded system is equivalent to a maximum-value problem for a transient system with

one-period reward vectors " " $ ?< T −$ $ and transition matrices for each . Then on applying the

Comparison Lemma for transient systems to the maximum-present-value problem, the resulting

Comparison Lemma for stationary policies with becomes3 !

(18) ,Z Z œ V K3 3# #

3 3$ #$

where K ´ < U Z Z3 3 3#$ $ $# # 3 is the . For a direct proof of thispresent-value comparison function

fact, observe that [ ] . Also, note that in the special case inZ Z œ V < Ð M U ÑZ œ V K3 3 3# # #

3 3 3$ $ #$# #3

which and are transient, setting the interest rate reduces (18) to the Comparison Lemma# $ 3 œ !

for stationary transient policies.

Laurent Expansion of Present-Value Comparison Function. It is now possible to give the

desired generalization of the policy-improvement method. Observe that if the goal were merely

seeking an improvement of for a single fixed positive interest rate , it would be enough to$ 3 !

choose so that for that single value of . For then by (18), . However, the# 3K ! Z Z3 3#$ $

3#

goal now is to be sure that is an improvement of simultaneously for all sufficiently small# $

positive . One way to achieve this is to choose so that simultaneously for all suffic-3 # K !3#$

iently small positive . To find such a , it is useful to develop the Laurent expansion of 3 # ! K3#$

in . To that end, substitute the Laurent expansion (15) of into and collect terms of3 ! Z K3 3$ #$

like powers of . The result is that3

(19) K œ 13#$ #$"_

8œ.

8 83

for all small enough where 3 ! 1 ´ < U @ @ 8 œ .ß ."ß á ß < ´ ! 8 Á !8 8 8 8" 8#$ $ $# ## for for ,

< ´ < @ ´ ! . K ´ Ð1 1 1 â Ñ! ." . ." .## # #$$ #$ #$ #$, and is the system degree. Let be the matrix of

coefficients of the Laurent expansion (19). Observe that is a matrix with rows and infin-K W#$

Page 60: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 54 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

itely many columns. The reader is warned again that generalizes its earlier interpretation asK#$

comparing the values of the transient policies and to comparing the present values ofÐ ß Ñ# $ $_ _

those policies for all small enough positive interest rates. Moreover, in the present notation and

under the hypothesis that the system is transient, lim which, be-3 # #3#$ #$ $ $Æ!

! ! !K œ 1 œ < T @ @

cause is then the value of , is the difference between the values of the transient policies @ Ð ß Ñ! _$ $ # $

and .$_

Strong Policy Improvement. It follows from (19) that (resp., ) for all small enoughK ! Ÿ !3#$

3 " 3 ! K ¢ ! K £ ! V   M ! if and only if (resp., ). Since for , it follows from (16), (18)#$ #$3$

and (19) that (resp., ) if (resp., ). For this reason, say that Z Z ¢ ! £ ! K ¢ ! £ !# $ #$ # strongly

improves if . Now observe from (18) that since and is nonsingular, $ K ¢ ! Z Z œ ! V K œ#$3 3 3 3

$ $ $ $$

! ! K œ ! for all . Thus, by (19). Hence, if no decision strongly improves , then, analogously3 $$$

with the corresponding claim for the transient case, for all . To see why this is so, ob-K £ !#$ #

serve that it is possible to improve the action in one state without affecting the others. This is be-

cause the row of depends on the action that takes in state , but not on the ac= K K =>2 ==#$ #$ # # -

tion that takes in any other state . Thus, if there is any state and action for which# # #> ==> = − E

K œ > Á = K ¢ !#$ #$=> > is lexicographically positive, then choosing for all will assure that since# $

then for . It follows from the above discussion that K œ K œ ! > Á =#$ $$> > $_ is -optimal if and_

only if for all .K £ !#$ #

Existence and Characterization of Stationary Strong Maximum-Present-Value Policies

Using these facts, it is now possible to prove the main result of this section.

Theorem 22. Existence and Characterization of Stationary Strong Maximum-Present-Value

Policies. In a bounded system, there is a stationary strong maximum-present-value policy. If $ −

? $ $, the following are equivalent has strong maximum present value is -optimal and: ; ;" # _‰ _ ‰ _

$ K £ !‰ for all #$ #.

Proof. The equivalence of Thus,#‰ ‰ ‰ and (resp., ) was shown above (resp., following (16)). $ "

since implies , it suffices to show that there is a for which holds. let $ " $ −‰ ‰ ‰!$ $To that end,

? $ $ ? $ $ be arbitrary and choose in inductively as follows. Given , let strongly im-" # R R"ß ß á

prove , if such a decision exists, and terminate with otherwise. Since increases lexico-$ $R R Z$R

graphically with , no decision can appear twice in the sequence. Thus because there are onlyR

finitely many decisions, the sequence must terminate with a that no $ $œ R decision strongly

improves. Then from the discussion preceding the Theorem.$‰ holds è

Proof of System-Degree Theorem 9 where Theorem 22 provides the tool necessary. œ ".

to prove Theorem 9 where , i.e., the sequence max has degree one. To that end,. œ " mT m1 ? 1−R

_

Page 61: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 55 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

put for all . Then by Theorem 22, there is a such that for all , so that< œ " − K £ !$ #$$ ? $ #

1 Ÿ ! Z   ! ! Z ¤ ! @ œ @" "#$ $

3$ $ for all and for all small , so . On setting , it follows that# 3

@ œ T <   ! N ß O @ ¦ ! @ œ !‡N O$ $ . Let be a partition of the states for which and . Now rewrite

the inequality as for all . Iterating this inequality shows that for1 Ÿ ! @   T @ − @   T @" R#$ # 1# ?

all , so and for all and . Because , the first inequality1 1@   T @ ! œ T @ R   " @ ¦ !N N N NR RN N ON1 1

implies that max and the second implies for all and , i.e., in-1 ? 1 1−R RN N ON_mT m œ SÐ"Ñ T œ ! R1

dividuals in cannot generate individuals in . Thus, since and , .O N @ œ ! Z ¤ ! ? ´ @   !O!O$ $

Also and so for all , i.e., for all . Hence the restriction of the1 œ ! 1 Ÿ ! ?   " T ?" !O O OO#$ #$ ## #

problem to states in is a transient system, so by Theorem 9 for that case, maxO mT m1 ? 1−ROO_ œ

SÐ Ñ ! Ÿ " œ Ð ß ß á Ñ! ! 1 # #R" # for some . Now on setting and , it follows that1 # #3 3" 3#œ Ð ß ß á Ñ

T œ T T TRN O

R

3œ"

3" R3N N OON O1 1 1#"

3 3.

To see this, consider the element of the term of the summand on the right. That element45 3>2 >2

is the expected number of ancestors in state in period of an indi5 − O R" vidual in state 4 − N

in period one for which some descendant of in in period generates4 N 3 an ancestor of in in5 O

period . Thus, since max and 3" mT m œ SÐ"Ñ œ1 ? 1−RN N_ max1 ? 1−

ROO_mT m SÐ Ñ!R , it follows that

max and so max . 1 ? 1 ?1 1− −R RN O_ _mT m œ SÐ"Ñ mT m œ SÐ"Ñ è

Call the algorithm given in the proof of Theorem 22 for finding a stationary strong maxi-

mum-present-value policy the . The method requires only fi-strong policy-improvement method

nitely many iterations, each involving two steps. For a given , first compute . Then seek a $ #Z$

with . The last two steps do not appear to be executable in finite time since they seemK ¢ !#$

to require computing the respective infinite matrices and . However, it is possible to refineZ K$ #$

the algorithm to run in finite time as the development below shows.

Truncation of Infinite Matrices

In a bounded system, it turns out to be sufficient to compute the finite truncated matrices

Z ´ Ð@ â @ Ñ K ´ Ð1 â 1 Ñ 8 œ W X X T8 . 8 8 . 8 ‡$ $ $ #$ #$ #$ $ and for where is the rank of . As examples,

observe that if is transient, then ; if is stochastic and irreducible, then ; ifT X œ ! T X œ "$ $

T œ M X œ W$ , then . In order to establish the claim, a preliminary result is necessary.

Lemma 23. If is a real matrix with rank , is a subspace of , and F W ‚ W 6 P d B − d F B − PW W 3

for , then for ." Ÿ 3 Ÿ 6 F B − P 3 œ "ß #ß á3

Proof. The result is trivial if since then . Thus suppose that . The first step6 œ ! F œ ! 6   "

is to show that there is a positive integer such that is a linear combination of4 Ÿ 6 F B4"

Page 62: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 56 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

F Bß á ß F B F 6" 4 . Since the dimension of the subspace spanned by the columns of is , the vectors

F Bß á ß F B" 6" are linearly dependent, from which the assertion follows.

Next show by induction on that is a linear combination of 5 F Bß á ßF B F B 5   "5 4" for all ,

which, because is a subspace, will complete the proof. This is so by construction for P " Ÿ 5 Ÿ

4" 5" Ð   4"Ñ F B œ F B. Suppose it holds for , so . Premultiplying this equation by5" 343œ" 3! -

F F B œ F B F B F Bß á ß F B F B gives . Since is a linear combination of , so is . 5 3" 4" " 4 543œ" 3! - è

Theorem 24. Truncation. Suppose the system is bounded, and # $ ?ß − T ‡$ has rank T. Then

" K œ ! K œ !‰ if and only if .#$ #$WX

# Z œ Z Z œ Z‰ if and only if .# $ # $WX WX

Proof. For part 1 , it suffices to show the “if ” part. Observe that since ,‰ 8K œ !$$

(20) , .1 œ 1 1 œ ÐT T Ñ@ 8 œ "ß #ß á8 8 8 8#$ #$ $$ $# $

Because , it follows from (20) thatK œ !WX#$

(21) , for .ÐT T Ñ@ œ ! " Ÿ 8 Ÿ WX# $ $8

In view of (20), it suffices to show that (21) holds for . That this is so follows from8 WX

Lemma 23 with , the null space of , and , on noting from Theorem F œ H P T T B œ @$ # $ $! 20

that because the rank of is , the rank of is .T X H WX‡$ $

For part 2 , it suffices to show the “if ” part. Now implies that‰ WX WXZ œ Z# $

(22) , for .ÐH H Ñ@ œ ! ! Ÿ 8 Ÿ WX "# $ $8

To show that , it suffices to establish (22) for . The last follows from LemmaZ œ Z 8   W X# $

23 with , the null space of , and , on noting, as above, that the rankF œ H P H H B œ <$ # $ $

of is . H WX$ è

Remark 1. If the rank of is , then and by Theorem 20. Then X T W T œ M H œ ! @ œ <‡ ‡ "$ $ $ $$

and for .@ œ ! 8 œ !ß "ß á8$

Remark 2. Part of Theorem 24 implies that if the , say, row of vanishes, the same" = K‰ >2 WX#$

is so of the row of . As a consequence, if strongly improves , then . Thus, in= K K ¢ !>2 WX#$ #$# $

searching for a decision that strongly improves , it is never necessary to look beyond the ma-$

trices and , or if is known, and , for . Observe also that if andZ K X Z K − X œ !W W WX WX$ #$ $ #$ # ?

only if is transient. Otherwise .T X   "$

Page 63: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 57 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

8-Optimality: Efficient Implementation of the Strong Policy-Improvement Method

A more efficient way to execute the strong policy-improvement method for bounded -stateW

systems is to lexicographically maximize by first maximizing , then , then ,Z Z Z Z$ $ $ $. ." .#

etc. Indeed, as the sequel shows, for each , it is often of interest to seek a policy that is8   . $_

8-optimal, i.e., , all . Let be the set of decisions for which is -optimal.Z ¤ Z − 88 8 _8$ # # ? ? $ $

Theorem 25. Selectivity. In a bounded S-state system with system degree , . ª ª? ? ?. ."

ª â ª œ œ â œ? ? ?WX WX " _ where T is the minimum of the ranks of the stationary ma-

trices over all decisions.

Proof. Immediate from the definitions and the Truncation Theorem 24.

8-Improvements. It is useful now to develop a method of finding an -optimal policy. To that8

end, put . Call an - of if and if whenK ´ Ð1 â 1 Ñ − 8 − K ¢ ! œ8 . 8 8 = =#$ #$ #$ #$# ? $ ? # $improvement -

ever the row of vanishes. Clearly strongly improves .= K>2 8#$ # $

Theorem 26. -improvements and -optimal policies.8 8 In a bounded system with system degree

. 8 Z ¢ Z Ð8.Ñ 8, if is an -improvement of , then . If has no -improvement, then is -op-# $ $ $8 8 _# $

timal.

Proof. If is an -improvement of , then is a strong improvement of because # $ # $8 K œ !8=#$

implies . Thus for all small enough . Moreover, since ,K œ K œ ! K ! ! V   M#$ $$3#$

3#= = 3 "

lim lim lim lim lim ,3 3 3 3 3

# $ # # #$3 33 3 3

$ #$ #$Æ! Æ! Æ! Æ! Æ!

8 8

5œ. 5œ.

58 5 5 8 8 8 58 5" "3 3 3 "3 3Ð@ @ Ñ œ ÐZ Z Ñ œ V K   K œ 1 !

so as claimed.Z ¢ Z8 8# $

Suppose has no -improvement. Then for each , . Also lim$ # 3 3Ð8.Ñ K Ÿ 9Ð"Ñ V8. .Æ!

3#$ 3

3#

exists and is finite by Theorem 20 and is nonnegative for all small enough . ThusV !3# 3

lim lim lim ,3 3 3

# $ # #3 33 3

$ #$Æ! Æ! Æ!

8

5œ.

58 5 5 8 . 8." 3 3 3 3Ð@ @ Ñ œ ÐZ Z Ñ œ V K Ÿ !

whence for each , i.e., is -optimal. Z £ Z 88 8# $ # $ è

Example. , Has No -Improvement, and . . œ " 8 Â$ $ ?8 It is natural to hope that if . œ "

and has no -improvement, then . However, that is not so. For example, suppose that$ $ ?8 − 8

there is a single state and two decisions , with , so , and , .# $ T œ T œ " U œ U œ ! < œ " < œ !# $ # $ # $

Then and , so , whence has no -improvement. But is a@ œ " @ œ ! 1 œ U @ œ ! "" " " "# $ #$ $# $ #

! 1 œ < U @ @ œ " ! "-improvement of because . Also, is -optimal.$ #! ! "#$ $# # $

Example. , Has a -Improvement.. œ " "$ Suppose there are two states and two decisions

# $ # $ and with . Let . Let , so . Let , so # # X" "œ < œ < œ Ð! "Ñ T œ M U œ ! T œ Ð! "Ñ U œ# $ $ $ # #

Ð" "Ñ @ œ Ð! "Ñ K œ U @ œ " K œ K œ ! ". Then , and , so -improves ." X " " " "" " " # #"$ #$ $ #$ $$# # $

Page 64: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 58 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Computing . Z 8$ The only method discussed so far for computing requires first calculat-Z 8

$

ing the stationary and deviation matrices for . It turns out that this is not necessary. Indeed, it$

suffices to solve a system of linear equations for .Z 8$

Theorem 27. Computation of Z 8$ . Suppose the system is bounded with system degree . Let . Z 8

œ Ð@ â @ Ñ @ W @ ´ ! Z œ Z. 8 4 ." 8. 8. where the are -vectors and . Then satisfies the linear$

equations

(23) < U @ œ @ . Ÿ 4 Ÿ 8 .4 4 4"$ $ for .

Conversely, if satisfies these equations, then .Z Z œ Z8. 8 8$

Proof. Observe first that since , then , or equivalently, satis-K œ ! K œ ! Z œ Z$$ $$ $8. 8.8.

fies (23). Conversely, suppose that satisfies (23). Since also satisfies (23), it followsZ Z8. 8.$

that satisfiesY ´ Z Z8. 8. 8.$

(24) for U ? œ ? . Ÿ 4 Ÿ 8.$4 4"

where and . Now (24) implies thatY œ Ð? ß á ß ? Ñ ? ´ !8. . 8. ."

(25) for U ? œ ? . Ÿ 4 Ÿ 8$4 4"

and, on premultiplying (24) by and using the facts that and, if , ,T T U œ ! . œ ! T œ !‡ ‡ ‡$ $ $$

(26) for .T ? œ ! . Ÿ 4 Ÿ 8‡ 4$

Subtracting (26) from (25) gives

(27) for .ÐU T Ñ? œ ? . Ÿ 4 Ÿ 8$ $‡ 4 4"

Since is nonsingular and , it follows from (27) that . ProceedingT U ? œ ! ? œ !‡ ." .$ $

inductively, it follows that , i.e., as claimed. Y œ ! Z œ Z8 8 8$ è

Example. Why Must Satisfy (23) to Assure with .Z Z œ Z . œ "8. 8 8$ Suppose ,T œ M$

so , and ,. Then the first equation in (23) is , and so does not conU œ ! 8 œ # !@ œ !$" strain

@ < !@ œ @ @ œ <" ! " " at all. However, the second equation in (23) is , which shows that .$ $

Policy Improvement in Bounded Systems. The above results lead to the following policy-im-

provement method for finding an -optimal policy for a bounded system with that requires8 . œ "

only -improvements of decisions. Ð8"Ñ Let be given. Choose inductively so that is$ $ $ $! " R 5ß á ß

an -improvement of for givenÐ8"Ñ $5" where is the first integer (which exists)5 œ "ß á ß R R

for which has no -improvement. de-$R Ð8"Ñ Then is -optimal. In order to check whether a $R 8

cision has an -improvement, it is necessary to compute satisfying (23) with $ Ð8"Ñ Z 8"8#

replacing there. Then from Theorem 27, , which is needed to begin the search for8 Z œ Z8" 8"$

an -improvement of .Ð8"Ñ $

Page 65: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 59 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Sequential Implementation generic implementation. The above of policy improvement to find

an element of can be made more efficient by , i.e., by first seeking a?8 sequential implementation

$ ? $ ? $ ? $ ?" " ! ! " " 8 8− − − −, then a , then a , etc., until a is found. More precisely, suppose

one has found a . Then repeatedly seek -improvements until a decision is found$ ? $5" 5" 5− Ð5"Ñ

with no -improvement. Then Ð5"Ñ by Theorem 26, .$ ?5 5−

Sequential implementation is more efficient than generic implementation for two reasons. First,

sequential implementation generally requires fewer iterations because it focuses on maximizing the

coefficients of the Laurent expansion of in order of declining importance. Thus, sequenZ Z8$

3$ tial

implementation first finds maximizing , then from among all such a maximiz-$ $ $ $ $œ @ œ" !"$

ing @ œ @! ""$ $, then from among all such a maximizing , etc.$ $ $

Second, sequential implementation requires much less computation at each iteration than gen-

eric implementation. The implementation below explains why by showing how to use Theorems

26 and 27 to process inputs and to produce outputs $ ?− Z5"5

$ # ?− Z55" and .#

Ð+Ñ @ Find 5"$ . S and olve < U @ œ @ < U @ œ @ @ œ @5" 5#5" 5 5# 5" 5" 5#

$ $$ $$ and for @5"$ .

Ð,Ñ Ð5"Ñ nd a -improvement of if one exists.Fi # $ Since , has no -improvement.$ ? $− Ð5"Ñ5"

Consider the for which , so . Choose from among those one# #K œ K œ ! K œ Ð! 1 1 Ñ5" 5" 55" 5"#$ $$ #$#$ #$

for which and whenever . If there is a that -improves ,Ð1 1 Ñ ¢ ! œ Ð1 1 Ñ œ ! Ð5"Ñ5 = = 55" 5"= =#$ #$#$ #$# $ # $

go to . Otherwise set and go to . In both cases and Ð-Ñ œ Ð.Ñ Z œ Z# $ 5" 5"# $ # ?− 5".

Ð-Ñ @ Find .5# Solve set and< U @ œ @ < U @ œ @ @ œ5 5 5" 5" 5" 5 5# ## #$ and for @ @5 5"

# and . Re $ #œ

go to .Ð+Ñ

Ð.Ñ − Z Terminate with and . # ?55"# Since has no -improvement, . olve# # $ ?Ð5"Ñ œ − 5 S

< U @ œ @ < U @ œ @ @ œ5" 5" 5 5# 5# 5" 5"# ## #$ and for @ @5" 5#

# and .

Note that steps , and each require solving linear equations. And finding the -improveÐ+Ñ Ð-Ñ Ð.Ñ #W Ð5"Ñbest -

ment of in step entails lexicographically maximizing , which requires up to operations.# $ Ð,Ñ Ð1 1 Ñ #E5 5"#$ #$

7

Uniqueness. In practice, it is usually so that for relatively small and -optimal, only 5 5 œ$ # $_

satisfies . Then, by the Selectivity TheK œ K œ !5 5#$ $$ orem 25, is -optimal for all . Hence$_ 8 8   5

one can terminate sequential implementation early at iteration .5

Policy Improvement in Transient Systems. It is particularly easy to describe the amount of

work required to find a decision in in transient systems, or equivalently, . In that event, let?8 . œ !

Z œ − −8 Ð@ @ â @ Ñ œ Z K £ !! " 8 8 88 8 . for$ #$for any Now from Theorem 26, if and only if $ $? ?

all , or equivalently from Theorem 27 , ,# ? ? ?− Ð@ œ ! œ Ñ""

(28) max [ ], .@ œ < @ T @ 5 œ !ß á ß 85 5 5" 5

−# ?# #

5"

7By contrast with sequential implementation, the generic implementation entails solving a system of Ð5%ÑWlinear seeking the best -improvement at each iteration requires upequations at each iteration. And Ð5"Ñto operations to do each lexicographic maximization.Ð5$ÑE

Page 66: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 60 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Moreover, the set of decisions that maximize the right-hand side of the equation in (28) is# 5>2

? ?5 5"! " 8 5"© @ @ ß á ß @ @. Notice that these equations can be solved for , in that order. Once

has been found, the work of finding is precisely that of finding a stationary maximum-value@5

policy for a transient system with a restricted decision set and a new reward vector ?5"5< #

@ − 8"5"8. Thus the total amount of work required to find a is at most times that re-$ ?

quired to find a stationary maximum-value policy in a transient system—and usually much less.

In particular, the work required to find a strong maximum-present-value policy is at most W"

times that required to find a stationary maximum-value policy in a transient system. Although

we do not discuss it in detail here, we remark that the situation described above is similar for

bounded systems. In short, one should expect to be able to find strong maximum-present-value

policies on PCs for transient or bounded systems with thousands of states.

9 CESÀRO OVERTAKING OPTIMALITY WITH IMMIGRATION IN BOUNDED SYSTEMS

[Ve66], [Ve68], [DM68], [Li69], [Ve74], [RV92]

The preceding section generalizes the maximum-value criterion from transient to bounded

systems by means of maximum-present-value preference relations for small interest rates. Anoth-

er approach to bounded-systems is instead to consider preference relations that make a suitable

function of the rewards large over long horizons. For example, it is often of interest to consider

preference relations that focus on making the long-run expected reward rate, total reward, total

float, future value with a negative interest rate, instantaneous rate of return, or expected multi-

plicative utility large. It is possible to encompass these and other preference relations in a uni-

form and simple way by generalizing our model to allow exogenous (possibly negative) immigra-

tion of individuals into the population in each period. Then, above preference relationseach of the

reduces to Cesàro overtaking optimality for a suitable choice of the immigration stream.

Call a policy Cesàro overtaking optimal if the difference between the expected -period re-R

wards earned by that policy and any other policy has a nonnegative Cesàro limit inferior of ap-

propriate order. Cesàro limits are needed to smooth out oscillations in the differences of expect-

ed -period rewards that would otherwise generally prevent the existence of overtaking-optimalR

policies from being assured. It might be supposed that a policy that is Cesàro overtaking opti-

mal for each individual immigrant would also be Cesàro overtaking optimal for the immigration

stream itself. But that is not so! The reason is fundamentally that each individual immigrant

would be indifferent between two policies that produce the same permanent income even though

one policy produces more float, i.e., temporary income, than the other. But the system would

prefer the policy that produces more float if, as is possible, the aggregate of the float that each

individual immigrant produces provides extra permanent income for the system. That is the rea-

son why the set of Cesàro-overtaking-optimal policies depends on the immigration stream and

permits several preference relations to be reduced thereto.

Page 67: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 61 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Allowing immigration into the population is also interesting in its own right in the study of

controlled populations. For example, exogenous arrivals into a queueing network comprise an im-

migration stream therefor.

Examples

Controlled Queueing Network with Proportional Service Rates. Consider a controlled -staW -

tion queueing network. Customers arrive exogenously at the various stations according to a gen-

eral stochastic process that is independent of the service process, with being the (finite) ex-AR=

pected number of such arrivals at station in period . Each station has infinitely many= R "

independent and identical servers, each with geometrically-distributed services times. A custom-

er arriving exogenously or endogenously at a station at the beginning of a period begins service

there immediately because there are infinitely many servers available. In each period and at

each station, the network manager chooses an action from a finite set that determines the

common service rate of customers at the station and the common rates at which customers leave

the station in the period to exit the network or to go to various stations in the next period. The

cost of servicing a customer and the revenue earned when the customer completes service at a

station depends on the action taken by the manager. This is a Markov decision chain with

immigration in which the states are the stations, is the reward earned W <Ð=ß +Ñ by a customer at

station when the manager takes action at that station, is the probability that a= + :Ð> l =ß +Ñ

customer being served at station completes service at that station during the per= iod and is sent

to station at the beginning of the following period, and is the probability that> " :Ð> l =ß +Ñ!>

a customer completes service at station and leaves the system.=

Keep in mind that the assumption that a station has infinitely many identical servers has an

alternate interpretation, viz., that the station has a single server whose service rate is propor-

tional to the number of individuals waiting to be served at the station. In this event, selection of

an action by the manager amounts to choosing the proportionality constant in the proportional

service rate there.

Figure 9 illustrates a four-station network. Exogenous arrivals occur only at stations 1 and

2. Stations 1 and 2 generate output at both stations 3 and 4. Stations 2 and 3 also generate

output that must be reworked respectively by starting afresh at stations 2 and 1 respectively.

This model of controlled queueing networks has several strengths. One is that only one state is

required for each station (it is not necessary to keep track of the number of customers in a sta-

tion because there are infinitely many servers), so an -station network leads to a problem withW

only states. For this reason, it is computationally tractable to find optimal policies forW

controlling queueing networks with as many as several thousand stations! Two other strengths

are that the exogenous arrival process is arbitrary and the manager is allowed to control the ser-

vice rates and routing of customers.

Page 68: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 62 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

��������

��������

� ����

� ����

� ������ �

� ������ �

Figure 9. Optimal Control of Service/Rework Rates in a Four-Station Network of Queues

Of course the model also has limitations. The service rate at each station must be propor-

tional to the number of customers there; the number of such customers cannot be restricted; and

rewards are linear in the expected number of customers at each station.

Cash Management. Bills arrive daily at a firm for payment from all over the country. The

firm pays the bills by writing checks drawn on several banks throughout the country at which it

has accounts. The length of time for a check to clear depends on the location of the payee and

of the bank on which the check is drawn. Generally, the more distant the payee from the bank

on which the check is written, the longer the check takes to clear. From past experience, the

firm knows the distribution of the random time to clear checks drawn on each bank to pay bills

from each payee. The problem is to choose the bank on which to write a check to pay bills from

each payee.

Manpower Planning. A personnel director manages a work force in which individuals are hired,

move between various jobs and skill-levels over time so as to meet an organization’s needs, and

eventually leave. An individual’s state may be differentiated by factors like age, compensation,

current assignment, education, skill, etc. Actions may include policies on recruitment, hiring, train-

ing, compensation, job assignment, promotion, layoff, firing, retirement, etc.

Insurance Management. An insurance company insures its policy holders against risks of var-

ious types. A policy holder’s state could include the type and terms of his policy, age, sex, health,

safety record, occupation, prior claims, etc. The company’s actions might include rules for accept-

ing or rejecting new policy holders, premiums charged, dividends paid, coverage, cancellation pro-

visions, etc.

Page 69: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 63 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Asset Management. Managers of assets like vehicles, machines, equipment, computers, build-

ings, etc., maintain them to provide service to internal and/or external customers. An asset’s

state may include its type, condition, current status, reliability, age, capabilities, market value,

location, etc. Actions may include buying, renting, leasing, operating, pricing, maintaining, sell-

ing, storing, moving, etc.

Immigration Stream

As several of the above examples suggest, it is useful to generalize the systems considered

heretofore to allow individuals to enter the population through exogenous uncontrolled immi-

gration. In particular, the (possibly negative) number of immigrants in state in period = R "

  " A A œ ÐA Ñ W ‚ W is . Let diag be the diagonal whose R R R= = immigration matrix =>2 diagonal

element is . Call immigrants in a period or according as the numAR= positive negative ber of immi-

grants in the period is positive or negative. Positive (resp., negative) immigrants add (resp.,

subtract) their rewards and those of their descendants to (resp., from) that of the system. Call

the sequence of immigration matrices the .A œ ÐA A â Ñ! " immigration stream

An alternate approach is to represent an immigration stream by appending immigrant-gener-

ating states. This device can be useful for studying certain well behaved immigration streams. In-

deed this viewpoint is necessary when it is possible to control the immigration stream. However,

it two reasons. One is is usually better not to append additional states to reflect immigration for

that this approach facilitates the study of the effect of changes in the immigration stream on the

set of Cesàro-overtaking-optimal policies. The other is that this method does not enlarge the state-

space and leads to more efficient computational methods.

Cohort and Markov Policies

Each immigrant in a period generates populations in subsequent periods. Call the collection

of all immigrants in a given period and their descendants a . The cohort is the onecohort R >2

that the immigrants in period generate. Of course each individual in the population in aR

given period belongs to exactly one cohort, so there is a partition of the population in any given

period into cohorts as Figure 10 illustrates.

In the presence of immigration, a policy has at least two different interpretations.1 $œ Ð ÑR

One is as a , i.e., every individual in period uses . The other is as a Markov policy cohortR $R

policy, i.e., all members of cohort use in period . Equivalently, a cohort policy starts3 R3"$R

afresh for each cohort. Figure 11 illustrates these two types of policies. Of course, the distinction

between Markov and cohort policies vanishes when they are stationary.

Nonstationary Markov policies might seem more natural than nonstationary cohort policies

because the action an individual takes in a given state intuitively should depend only on the state

Page 70: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 64 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Cohort 2

Cohort 5

1 2 3 4 5

Cohort 4

Cohort 3

Cohort 1

Periods

Figure 10. Cohorts

Markov Policy Cohort Policy

Period Period

Cohort 1 2 3 Cohort 1 2 3

1 1

2 2

3 3

â â

â â

ã ã ã ã

$ $ $ $ $ $

$ $ $ $

$ $

" # $ " # $

# $ " #

$ "

Figure 11. Interpretations of as Markov and Cohort PoliciesÐ$RÑ

and not the cohort to which the individual belongs. However, perhaps surprisingly, it turns out

that cohort policies are more convenient for two reasons. One is that they are easier to work with

because the formulas for the relevant quantities are nicer, and results about Markov policies can

easily be derived from them in any case. The second reason is that cumulative immigration streams

have an alternate interpretation in terms of unit values of rewards earned in different periods. In

that event, when there is no exogenous immigration, all individuals are in the same cohort so co-

hort policies are the most general that need to be considered. For these reasons, in the remainder

of this section policies mean cohort policies unless explicitly stated to the contrary.

Let be the -period value when is the immigration stream and is a (cohort) policy.Z R ARA1 1

Observe that since periods remain when the immigrants in cohort arrive, then, asR 5 5 "

Figure 12 illustrates,

Z œ A ZRAR

5œ!

5 R51 1" .

Page 71: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 65 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Periods

1 2 3

Cohort 1

Cohort 2 Immigrants

Cohort 3 by Period

Cohort

Ä

â R

A

A â

A

Æ ã ã

R A

!

"

#

R"

Z

Z

Z

Z

R

R"

R#

"

1

1

1

1

Figure 12. -Period Value with Immigration Stream R A

Overtaking Optimality

It might be expected that the analog of finding a strong maximum-present-value policy, i.e.,

one that simultaneously maximizes the present value for all small enough positive interest rates,

would be to find a policy that maximizes the -period value simultaneously for all suffi-- R Z RA-

ciently large . Unfortunately, as the example below illustrates, such policies do not generallyR

exist because the set of maximum- -period-value policies usually depends on for large , evenR R R

for transient systems. Of course, there are some exceptions to this general rule, e.g., circuitless

systems with no immigration after the first period, minimum-cost-chain problems, etc.

But a satisfactory treatment of bounded systems requires a weaker concept. To motivate one

such concept, observe that in a transient system, lim for each . Thus a policy RÄ_RZ œ Z1 1 1 -

has maximum value, i.e., for all , if and only if lim inf for allZ Z   ! ÐZ Z Ñ   !- 1 - 11 RÄ_R R

1. Both definitions are well defined in a transient system, but only the latter is generally well

defined in a bounded system. This suggests the following definition in a bounded system.

Call a policy for if- overtaking optimal A

Ð Ñ ÐZ Z Ñ   !1 lim inf for all .RÄ_

RA RA- 1 1

Call if replaces in (1). Observe that instead of maxi-- overtaking optimal Z Z Z ZR R RA RA- -1 1

mizing a criterion function, overtaking optimality is a binary on the set ofpreference relation

policies. Moreover, that relation is a , i.e., is reflexive and transitive, but not necessar-quasiorder

ily antisymmetric.

Stationary overtaking-optimal policies exist for transient systems and, when the stopping value

is finite, for . Indeed, they exist provided that the limit inferior in 1 can be re-stopping problems Ð Ñ

placed by a finite limit, as occurs in , and often in (all rewards are nonconvergent systems positive -

negative) and (all rewards are nonpositive) . Unfortunately, overtaking-optimalnegative systems

policies do not exist in general because can oscillate about zero for some states as theZ Z =R R= =$ #

next example illustrates.

Page 72: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 66 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Example. Nonexistence of Overtaking-Optimal Policies when and Dependence of Max- . œ "

imum- -Period-Value Policies on R R . Consider the three-state deterministic system in Figure 13.

There are two decisions and , and the rewards are as indicated on each arc. Note that and# $ #_

$_ are effectively the only policies and . Now. œ " alternates betweenZ Z œ #Ð"Ñ "R R R"! !$ #

$ " and so has maximum -per iod value R$_ iod value for odd and has maximum -perR R#_

for even . Thus, no policy has maximum -period value for all large enough . Also,R R R

lim inf and lim inf ,RÄ_ RÄ_

R R R R! !! !ÐZ Z Ñ œ " ÐZ Z Ñ œ $$ $# #

so no policy is overtaking optimal.

��

��

Figure 13. Nonexistence of Overtaking-Optimal Policies

Cesàro Overtaking Optimality

One way to address this problem is to smooth out the fluctuations in by substi-Z ZRA RA- 1

tuting a C limit inferior for the limit inferior in 1 . If is a sequence, the C limitÐ ß .Ñ Ð Ñ , ß , ß á Ð ß .Ñ! "

inferior of the sequence is the ordinary limit inferior of the sequence if and the limit. œ !

inferior of the sequence of arithmetic averages, viz., lim inf , if .RÄ_ 3" R"

3œ!R , . œ "!Call for if- Cesàro overtaking optimal A

(1) lim inf C for all .w RA RA

RÄ_ÐZ Z Ñ   ! Ð ß .Ñ- 1 1

Call if replaces in (1) Observe that in the ex-- Cesàro overtaking optimal Z Z Z Z ÞR R RA RA w- -1 1

ample above, is the unique Cesàro-overtaking-optimal policy, though it is not overtaking opti-$_

mal. It turns out that stationary Cesàro-overtaking-optimal policies always exist for bounded sys-

tems under natural assumptions on the immigration stream. In order to show this, it is necessary

to develop some preliminary concepts.

Convolutions

It is convenient to represent as a “convolution” of the two sequences and .ÐZ Ñ ÐZ Ñ ÐA ÑRA R R$ $

To define this notion, suppose and are sequences for which and oneF œ ÐF Ñ G œ ÐG Ñ R   !R R

or both are either scalars or matrices of common size and for which the products are wellF G3 4

defined for all . The last is always so if one or both of the sequences are scalars or if both are3ß 4

Page 73: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 67 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

matrices and the number of columns of each equals the number of rows of each . F G3 4 Denote

by the sequence whose element is ; call the F ‡ G R ÐF ‡ GÑ F G F ‡ G>2 R 3 R3R!

! convolution of

F G ‡ ÐF ‡ GÑ ‡ I œ F ‡ ÐG ‡ IÑ and . The convolution operation is , i.e., , provided thatassociative

F ‡ G 3ß 4 and all , then the convolution operationG ‡ I are well defined. If also, for F G œ G F3 4 4 3

is also , i.e., . In particular, that is so if or is a sequences of sca-commutative F ‡ G œ G ‡ F F G

lars. If and are each finite sequences with elements, denote by the middle ele-F G /   " F ì G

ment of the convolution is the sum over of theF ‡ G #/ " of elements. Thus, F ì G 5 œ "ß á ß /

product of the element of with smallest index and that of with largest index.F 5 G 5>2 >2

Binomial Coefficients and Sequences

For any positive integer , set , and for any number , define ,4 4x ´ Ð4ÑÐ4"ÑâÐ"Ñ 5 5 4choose

denoted , byŠ ‹5

4

Š ‹5 5Ð5"ÑâÐ54"Ñ

4 4x´ .

For , put and . Of course, if also are nonnegative integers, then4 œ ! !x ´ " ´ " 5   4Š ‹5

!

Š ‹ Š ‹5 5x 5

4 4xÐ54Ñx 54œ œ .

It is easy to verify for any and nonnegative integer that 5 4 ´ !Š ‹Š ‹5

"

Ð Ñ œ5 5 5"

4" 4 42 .Š ‹ Š ‹ Š ‹

It turns out that the immigration streams in which the numbers of immigrants in a state in

a period are independent of the state and in which the common sequence is “binomial” play a

fundamental role in the analysis. Define the , denoted , by induc-binomial sequence of order 8 "8

tion on . Call the because for any sequence8 " ´ Ð" ! ! â Ñ " ‡ F œ F! !identity sequence

F œ ÐF F â Ñ " ´ Ð" " " â Ñ R! " >2" . Call the because the element ofsumming sequence

" ‡ F F " ´ Ð" " ! ! â Ñ" "R3œ!

3 is . The inverse of summing is differencing. Call the ! dif-

ferencing sequence because the element of is where 0.R " ‡ F F F F ´>2 R R" ""

Now define inductively by the rule:" ´ Ð" Ñ8R8

" œ" ‡ " 8 "

" ‡ " 8 "8

" 8"

" 8" ,

, .

Call and respectively the - and - of . Observe that" ‡ F " ‡ F 8 8 F8 8 fold sum fold difference

" ‡ " œ " 7ß 8 œ !ß „"ß „#ß á7 8 78, .

It is easy to show from 2 by induction on thatÐ Ñ 8

Page 74: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 68 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

" œ Rß 8 œ !ß "ß #ß áR8"

RR8 Š ‹, | | .

Thus,

" œ 8 !R8"

8"R8 Š ‹, ,

which is a polynomial in of degree , whence has degree . AlsoR 8 " " 88

" œ Ð"Ñ 8 Ÿ !8

RR R8 Š ‹, ,

which alternates in sign for and is zero for .! Ÿ R Ÿ 8 R 8

The table below gives the first five elements of the binomial sequences for | | ." 8 Ÿ #8

8 "# " # " ! ! â" " " ! ! !

! " ! ! ! ! â" " " " " "# " # $ % & â

8

Figure 14. Some Binomial Sequences

Polynomial Expansion of Expected Population Sizes with Binomial Immigration

Our goal now is to give a polynomial expansion in of the -period value of a policyR R Z R81

1 for a , i.e., is the immigration matrix in eachbinomial immigration stream of order"8 8 " MR"8

period . Note that the symbol refers both to the binomial sequence and immigrationR   " "8

stream of order . The context will make the meaning clear, though either interpreta8 tion will

usually do.

For each policy , let and . If , re-1 • • • 1 $1 11 1 1 1 1´ ÐZ Z â Ñ ´ ÐZ Z â Ñ ´ " ‡ œ! " 8 !8 "8 _8

place by in these definitions and let . It follows that 1 $ • $ $ $ $$ $´ Ð! T T â Ñ œ Ð" ‡ Ñ<! "" and

• 88"$ $ $œ Ð" ‡ Ñ< .

There are two interpretations of the element of . One is=> Ð" ‡ Ñ œ Ð" ‡ Ð" ‡ ÑÑ>2 R R8" 8 " $ $

the expected population size in state in period generated by binomial immigration of> R   "

order into state . The other is the sum of the expected sojourn times in state in the first8" = >

R   " = periods of the immigrants into state and their descendants when there is binomial immi-

gration of order into state .8 =

Thus, in order to obtain a polynomial expansion of in , it is necessary to find a poly-Z RR8$

nomial expansion of the element of in . The polynomial expansion is analogous toR " ‡ R>28" $

the Laurent expansion of the resolvent of that was needed to study the problem with small in-U$

terest rates. The desired polynomial expansion is provided in the next theorem.

Page 75: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 69 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Theorem 28. Polynomial Expansion of Convolutions of Binomial Sequences with Matrix

Powers. Suppose that P is a square complex matrix with and with stationary and devia-. Ÿ "T

tion matrices and respectively. Then for and ,T H R   " 8   "‡

Ð Ñ Ð Ñ œ T Ð"Ñ H T ÐHÑ ÐM T ÑR8 R8

8" 843 " ‡8" R 4 4" R8 8"

8

4œ!

Š ‹ Š ‹"‡ ‡

where . œ Ð! T T â Ñ! "

Before proving this Theorem, it is useful to discuss it. Notice that on letting be the se-ƒ8"

quence of matrices and be the sequence of binomial coefficientsÐT H H â Ð"Ñ H Ñ ,‡ " # 8 8" R8

Š ‹ Š ‹R8 R8

! 8"ß á ß , the expansion 3 can be rewritten more compactly using convolutions asÐ Ñ

Ð Ñ Ð Ñ œ , ì 3 w R R R8 88"" ‡8" ƒ %

where . Observe that , i.e., C , since and% % %R R8 8" ‡ R R8 8 8. T´ T ÐHÑ ÐM T Ñ œ 9 Ð"Ñ Ä ! Ð ß . Ñ TT

H T Ä T Ð ß . Ñ T T œ T Ð Ñ commute, C and . Thus, 3 provides an expansion that is a sumR ‡ ‡ ‡ ‡ wT

of a polynomial in of degree and an error term that converges to C .R 8 " ! Ð ß . ÑT

The case in which is of special interest. In that event, 3 becomes8 œ ! Ð Ñw

Ð Ñ T œ RT H 9 Ð"Ñ3 .wwR"

3œ!

3 ‡."

T

Notice that this representation is simply a restatement of the representation 7 of the deviationÐ Ñ

matrix in §1.8 as may be seen by subtracting from both sides of the above equation andRT ‡

taking the C limit as .Ð ß . Ñ R Ä _T

The expansion 3 is the sum of a polynomial in and a small error term, and is analogousÐ Ñ Rw

to a corresponding expansion of as the sum of a polynomial in and a small error3 38 "V3

term. The last expansion is obtained from the Laurent expansion of and isV3

3 3 38 8" ‡ 8 8 8"V œ T H â Ð"Ñ H 9Ð"Ñ3 .

Observe that the polynomial in this expansion has the same degree and the same coeffi-8 "

cient matrix as that in 3 . Incidentally, the multiplication of by in the aboveƒ 38"w 8Ð Ñ V3

expression arises because the present value of used in 3 , when discounted to period , is" Ð Ñ 88w

38. In the special case in which , this last expansion becomes8 œ !

V œ T H 9Ð"Ñ3 3" ‡ .

The right-hand side of this expansion is similar to 3 with replacing .Ð Ñ Rww "3

Page 76: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 70 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Proof of Theorem 28. To prove 3 , it suffices to show for and thatÐ Ñ R   " 8   "

Ð Ñ Ð Ñ T œ TR8

8"4 " ‡8" R ‡ ‡Š ‹

and

Ð Ñ Ð Ñ ÐM T Ñ œ Ð"Ñ H T ÐHÑ ÐM T ÑR8

845 ," ‡8" R ‡ 4 4" R8 8" ‡

8

4œ!

"Š ‹since 3 follows by adding and . To prove , observe thatÐ Ñ Ð%Ñ Ð&Ñ Ð%Ñ

Ð Ñ T œ Ð" ‡ " Ñ T œ " T œ TR8

8"" ‡8" R ‡ R" ‡ R" ‡ ‡

8" " 8# Š ‹ .

The next step is to establish 5 by induction on . The result is trivial for . Now obÐ Ñ 8 8 œ " -

serve that . Postmultiplying this equation by and using theÐ ÑÐM T Ñ œ Ð" " ÑM T H" ‡" " !

facts that and from Theorem 20 yieldsHT œ T H œ ! HU œ M T œ UH‡ ‡ ‡

Ð Ñ Ð ÑÐM T Ñ œ Ð" " ÑH T ÐM T ÑH6 ," ‡" ‡ ‡" !

establishing for . Suppose now that holds for and consider . By takingÐ&Ñ 8 œ ! Ð&Ñ 8 Ð   !Ñ 8 "

the convolution of with and then using the bilinearity and associativity of convolution, theÐ'Ñ "8

induction hypothesis, the fact that , and , it follows that for ,H U œ H œ UH Ð#Ñ R   "8" 8 8"

œ Ð H T Ð Ñ ÐM T ÑHÐ Ñ ÐM T Ñ" ‡ " " Ñ " ‡8" 8R R8" 8 R ‡ R ‡

œ H T Ð"Ñ H T ÐHÑ ÐM T Ñ HR8" R8"

8 84Š ‹ Š Š ‹ ‹"8

4œ"

4 4 R8" 8 ‡

œ H Ð"Ñ H ÐH MÑ T ÐHÑ ÐM T ÑR8" R8"

8 84Š ‹ Š ‹"8

4œ"

4 4 R8 8" ‡

œ Ð"Ñ H T ÐHÑ ÐM T ÑR8" R8"

84 84""ŠŠ ‹ Š ‹‹8

4œ!

4 4" R8 8" ‡

œ Ð"Ñ H T ÐHÑ ÐM T ÑR8

84"Š ‹8

4œ!

4 4" R8 8" ‡ ,

which establishes . Ð&Ñ è

Polynomial Expansion of -Period ValuesR

Recall that . Thus, it fol-• • • • 7 "$ $$ $ $ $œ œ Ð Ñ œ œ Ð Ñ<" ‡ " ‡ " Ñ ‡ Ð" ‡ " ‡ " ‡7 7 " " 7" 7"

lows from the polynomial expansion and of the sum of the expected sojourn times of in-Ð$Ñ Ð$Ñw

dividuals in the first periods with binomial immigration of order when using thatR 8 œ 7 $_

Page 77: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 71 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Ð Ñ Z œ @ T ? œ , ì Z T ?R7

747 R7 R 7 R 7 R 7

7

4œ.

47$ $ $ $ $ $$

"Š ‹for and where . Observe that this formula gives an expansion of R   " 7   ! ? ´ T @ Z7 7 7 R7

$ $ $ $

as the sum of a polynomial in of degree with coefficient matrix and a term that isR 7 . Z 7$

9 Ð"Ñ. .

Although the error term is small, it is convenient to eliminate it entirely by including an ap-

propriate terminal reward. This important idea will play a fundamental role in the sequel. To

obtain the desired exact formula, let where is the terminal reward re-Z Ð?Ñ œ Z T ? ?R7 R7 R$ $ $

ceived at the end of period . It follows from 7 that is a polynomial in of degreeR Ð Ñ Z Ð? Ñ RR7 7$ $

7 . with

Ð Ñ Z Ð? Ñ œ @ œ , ì ZR7

748 R7 7 R 7

7

4œ.

47$ $ $$

"Š ‹for and . Observe that since , it follows from 7 , 8 and thatR   " 7   ! . Ÿ " Ð Ñ Ð Ñ T H œ !‡

$ $

lim C for . Notice from 8 that maximizes overRÄ_R7 R7 7 R7 7ÒZ Z Ð? ÑÓ œ ! Ð ß .Ñ 7   ! Ð Ñ Z Ð? Ñ$ $ $ $ $$

? $ for all large enough if and only if maximizes lexicographically. This is a strong formR Z 7$

of overtaking optimality for stationary policies except that the terminal value depends on .?7$ $

The case in which is of special interest. In that event, 8 expresses the fact that with7 œ ! Ð Ñ

the proper choice of the terminal reward, viz., , the -period value? R!$

Z Ð? Ñ œ R@ @R ! " !$ $$ $

is affine in . Moreover, the slopeR

@ œ T < œ R Z" ‡ " R

RÄ_$ $$ $ lim

is the or and the interceptstationary reward reward rate

@ œ H < œ ÐZ R@ Ñ Ð ß .Ñ! R "

RÄ_$ $ $ $ $lim C

is the C limit of the difference between the -period values starting from each state and startÐ ß .Ñ R -

ing with the stationary distribution therein.

For every policy , decision , , , and , let , 1 # $ 1 # # 1œ Ð Ñ ? − d 7ß R   ! P   " ´ Ð ß á ß Ñ ´3 " P PW P

Ð ß ß á Ñ ´ Ð ß Ñ ß − ´ Ð Ñ ´ Ð ß Ñ ´ Ð ß Ñ# # 1 $ 1 $ # $ ? # # #$ # $ # $ # $P" P#P P _ P _ P _ P P _ and . If , let , and .

Subscripts on decisions will be reserved for enumerating them.

The next lemma expresses as the sum of the expected rewards earned in the first Z Ð?Ñ PR71

periods and those earned in the subsequent periods when . The result fol-Q ´ R P   ! 7   !

lows readily from the facts that , and .• • •7 " R "7" 7"1 1 1œ " ‡ " œ œ Ð! ÑŠ ‹R 7

7T < T < â! "

1 1# #" #

Lemma 29. If , , , and , then1 # ?œ Ð Ñ − ? − d Pß Qß 7   ! R œ P Q   "3_ W

Z Ð?Ñ œ T < T Z Ð?ÑR37

7R7 3" P Q7

P

3œ"1 1 1 1#"Š ‹ 3 P

.

Page 78: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 72 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Proof. Use the fact that Z Ð?Ñ œ T < T < T ?R7 3" 3" RP R

3œ"

R37 R377 7

3œP"1 1 1 1# #! !Š ‹ Š ‹3 3

. è

Comparison Lemma for -Period ValuesR

Unfortunately, there is no exact comparison lemma for the difference of the -period valuesR

of two policies in terms of a comparison function. However, it is possible to give an exact com-

parison lemma for “eventually stationary” policies with a suitable terminal reward. Subsequent-

ly, we show that the error made in altering the problem in this way is comparatively small.

The following lemma gives an exact representation for the difference between the -periodR

value of the eventually stationary policy and the -period value of theZ Ð? Ñ R Z Ð? ÑR7 7 P R7 71 $ $ $ $P 1 $

corresponding stationary policy , both with the terminal reward . The exactness of the$_ 7?$

representation is a consequence of this choice of the terminal reward. The representation is in

terms of the time-domain comparison function

Ð9Ñ K ´ 1 œ , ì K57

7457 5 7

7

4œ.

47#$ #$#$

"Š ‹where and , .5ß 7   ! −# $ ?

Lemma 30. If , , , , and , then1 # ? $ ?œ Ð Ñ − − . Ÿ " ! Ÿ P Ÿ R 7   !3_

$

Z Ð? Ñ Z Ð? Ñ œ T KR7 7 R7 7 3"P

3œ"

R3ß71 $ $ $ $ 1 # $P

3" .

Proof. First establish the result for . Put . By Lemma 29 with , P œ " œ P œ " Z Ð? Ñ# #"R7 7

#$ $

œ < T Z Ð? Ñ Ð)Ñ Ð#ÑŠ ‹R7"

7# # $ $

R"ß7 7 . Using this fact, , the identity for binomial coefficients,

and , one finds that@ œ !."$

œ < T @ @Z Ð? Ñ Z Ð? ÑR7" R7" R7

7 74 74R7 7 R7 7

7 7

4œ. 4œ.

4 4#$ $ $ $ # # $ $Š ‹ Š ‹ Š ‹" "

œ < U @ @R7" R7" R7"

7 74 74"Š ‹ Š ‹ Š ‹" "# # $ $

7 7"

4œ. 4œ.

4 4

œ KR"ß7#$ ,

which establishes the result for . This fact and Lemma 29 imply that for ,P œ " P   "

œ ÒZ Ð? Ñ Z Ð? ÑÓZ Ð? Ñ Z Ð? ÑR7 7 R7 7 R7 7 R7 7P

3œ"1 $ $ $ $ $ $1 $ 1 $P 3 3""

œ T ÒZ Ð? Ñ Z Ð? ÑÓ œ T K" "P P

3œ" 3œ"

3" 7 7 3"R3"ß7 R3"ß7 R3ß71 $ $ 1# $ $ # $3 3

. è

The above result permits us to deduce the following important Comparison Lemma which

gives an approximate representation for the difference between the -period values of an arbi-R

trary policy and a stationary policy—both with a binomial immigration stream of order .7

Lemma 31. Comparison Lemma. If , , , and , then. Ÿ " œ Ð Ñ − − R   Q   ! 7   !1 # ? $ ?3_

Z Z œ T K SÐR ÑR7 R7RQ

3œ"

3" ."R3ß71 1$ # $ uniformly in ."

31

Page 79: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 73 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Proof. It follows from Lemma 29 and the System-Degree Theorem 9 that

Z Z Ð? Ñ œ T ÒZ Z Ð? ÑÓ œ SÐR ÑR7 R7 7 RQ Q7 Q7 7 ."1 1 11 $ $ $ $RQ RQ

uniformly in 1

for , and the same is so when replaces . The result then follows by observing that theR   Q $ 1_

error from replacing by and by in Lemma 30 is . Z Ð? Ñ Z Z Ð? Ñ ZR7 7 R7 R7 7 R71 $ $ $ $ $1RQ SÐR Ñ." è

Cesàro Overtaking Optimality with Binomial Immigration Streams

Now a policy is Cesàro overtaking optimal for the binomial immigration stream of order if- 8

(10) lim inf C for all RÄ_

R8 R8ÐZ Z Ñ   ! Ð ß .Ñ- 1 1

or, equivalently, since and lim ,!R3œ!

38 R R Rß8"" 8 8" RÄ_Z œ Ð" ‡ Ð" ‡ ÑÑ œ Ð" ‡ Ñ œ Z œ "1 11 1• •

R

R"

(10) lim inf for all .w . Rß8.

RÄ_

Rß8.R ÐZ Z Ñ   !- 1 1

If , Cesàro overtaking optimality for the binomial immigration stream of order is the same. œ ! 8

as overtaking optimality therefor. If , the C limits are needed to damp out the effect of. œ " Ð ß .Ñ

the (bounded) error term in the Comparison Lemma 31.

Call a policy if it is Cesàro overtaking optimal for binomialstrong Cesàro overtaking optimal

immigration streams of all orders. Now apply Lemma 31 to establish the next result.

Theorem 32. Existence and Characterization of Stationary Cesàro-Overtaking-Optimal Poli-

cies. In a bounded system with system degree , there is a stationary strong Cesàro-overtaking-op-.

timal policy. Also, is Cesàro overtaking optimal for the binomial immigration stream of order$_

8   . 8 if and only if is -optimal. Further, is strong Cesàro overtaking optimal if and only$ $_ _

if is -optimal.$_ _

Proof. Since the system is bounded, it follows from Theorem 22 that there is a stationary

policy that has strong maximum present value and for which for all . Thus, if$ # ?_ K £ ! −#$

Q ! K Ÿ ! − 5   Q is large enough, it follows from (9) that for all and . Also since5ß8.#$ # ?

the system is bounded, for all by the System-Degree Theorem 9. Hence, by the Com-. Ÿ "1 1

parison Lemma 31 (with on noting that for ), satisfies7 œ 8 . R 3   Q 3 Ÿ R Q œ- $_

(10) , i.e., is Cesàro overtaking optimal for the binomial immigration stream of order . Sincew _$ 8

this is so for all , is strong Cesàro overtaking optimal. The two characterizations follow from8 $_

(10) and (7). w è

Special Cases. Several examples of Cesàro overtaking optimality for binomial immigration

streams of order appear below. Each of these examples provides an alternate interpretation of8

Cesàro overtaking optimality as a natural concept of optimality for the system without exogen-

ous immigration after the first period.

Reward-Rate Optimality. Since , is the expected re" œ ÐM M ! ! â Ñ Z œ T <"Rß" R"

1 1 #R -

ward that earns in period . Thus, it is natural to call a policy if1 #œ Ð Ñ R3 reward-rate optimal

it is Cesàro overtaking optimal for the binomial immigration stream of order ."

Page 80: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 74 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

As an example, consider the two-state substochastic deterministic system in Figure 15. There

are two decisions with rewards the numbers on each arc. Observe that both and # $ # $ß _ _ have

common unit reward rate starting from either state, and so both are reward-rate optimal. By con-

trast, 2 for , so only is overtaking optimal.Z Z œ R   "R R _" "$ # $

1 2

0

2

1

Figure 15

Cesàro Overtaking Optimality. Since , . Thus, Ces ro overtak-" œ ÐM ! ! â Ñ Z œ Z!R! R à1 1

ing optimality for the binomial stream of order zero is ordinary Ces ro overtaking optimality. In aà

transient system, a policy is Cesàro overtaking optimal if and only if it has maximum value. More-

over, Ces ro overtaking optimality is well deà fined for bounded systems. By contrast, for such sys-

tems, values are often undefined or equal (as in Figure 15), rendering the maximum-value„ _

criterion insufficiently selective. Thus Cesàro overtaking optimality is the proper way of extending

the concept of maximum value from transient to bounded systems.

Now consider the two-state substochastic deterministic system that Figure 16 illustrates.

There are two decisions and with rewards the numbers on each arc. Then both # $ # $ and have

common value zero starting from state one, and so are à .Ces ro overtaking optimal

1 20 1� � �1

Figure 16

However, ial if the immigration stream is binom of order one, each individual who arrives in

state one in period R and uses earns a unit reward in that period. Moreover, as$_ temporary

Figure 17 illustrates, gregate of the temporary rewards that all individuals generate whenthe ag

they use provides a unit reward for the system, i.e., for all . Thus$_ R""permanent Z œ " R   "$

only maximizes the total reward$_ for the system with binomial immigration of order one.

Float Optimality. Since , is the -period , i.e., the sum " œ ÐM M M â Ñ Z R Z"R" 3R

3œ"1 1float !of the cash positions in each period . For this reason, call a policy if it is" Ÿ 3 Ÿ R float optimal

Cesàro overtaking optimal for the binomial immigration stream of order one. As an example,

consider the -station queueing network discussed at the beginning of this section in which theW

expected exogenous arrivals in each period into each station depends on the station, but not on

the period. Then the stream of expected arrivals is a nonnegative diagonal matrix times the bi-

nomial stream of order one. Thus if the goal is to find a policy that is Cesàro overtaking optimal

Page 81: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 75 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Immigration Periods

Cohort 1 2 3 4

Total Rewards 1 0 0 0

R

AR""

1 1 1 1

2 1 1 1

3 1 1 1

â

ã ã â

â

Figure 17. Rewards each Immigrant Earns in each Period

with diag when Using A œ Ð" !Ñ † ""_$

for the queueing network, it suffices to find a policy that is float optimal with no exogenous

immigration after the first period.

Value Interpretation of Immigration Stream

In each of the above examples, an immigration stream , or more precisely the A cumulative

immigration stream , has an alternate interpretation as values of rewards an[ œ Ð[ Ñ ´ A ‡ "R"

arbitrary policy earns with no immigration after period one. To see this, let be1 •A RA1 1œ ÐZ Ñ

the sequence of -period values that earns when the immigration stream is . R A1 Then it follows

that , so that . Moreover, since• • •A " RA R3 3ß" 3 R3ß"R R3œ! 3œ!1 1 1 1 11œ A ‡ œ [ ‡ Z œ [ Z œ [ Z! !

Z œ T < œ Ð Ñ 3 [3ß" 3" 331 1 #3 is the expected reward that earns in period , it follows that can be1 #

thought of alternately as the unit value of rewards in period , i.e., with periods remaining.R3 3"

As examples, observe that when is the binomial stream of order , then so A " [ œ " [ œ!R !

for , i.e., the weight is on rewards only in the last period. In this event, Cesàro R ! overtaking

optimality is equivalent to maximizing the reward rate. By contrast, if is the binomial stream ofA

order , then , so for all , i.e., there is equal weight on rewards in each per! [ œ " [ œ M R   !"R -

iod. In this event, Cesàro overtaking optimality is equivalent to ordinary Ces ro overtaking opti-à

mality with periods re-. If instead, and , then , i.e., the weight on rewards A œ " 8 ! [ œ "8 8" 3

maining is a polynomial in of degree . Thus as increases, the absolute and relative weights on3 8 8

rewards in early periods increases.

Evidently, this encompasses mostvalue interpretation of cumulative immigration streamsÐ Ñ

standard optimality concepts (in which there is no immigration after the first period). Moreover,

in this setting, the .cohort and Markov policies coincide

Combining Physical and Value Immigration Streams

So far there are two distinct interpretations of immigration streams, viz., and .physical value

Can they be combined as would be desirable, for example, in the study of queueing networks

where both are needed? The answer is ‘yes’. Indeed, Cesàro overtaking optimality for a system

in which there is a physical immigration stream and a value one is equivalent to CesàroA Aw

Page 82: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 76 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

overtaking optimality for the system with the single combined immigration stream . ToA ‡ Aw

see this, observe that the stream of -period values with the physical stream and policy isR A 1

simply . Now if it is desirable to impose the value stream on this system, the new val-A ‡ A•1w

ue-weighted stream of -period values is , which justifies our claim.R A ‡ ÐA ‡ Ñ œ ÐA ‡ AÑ ‡w w• •1 1

Thus, for example, if the vector of expected arrivals into the stations of the queueing net-

work is the diagonal matrix in each period and the goal is to maximize the float of the net-A!

work, then and , so the combined immigration stream is . OnA œ A † " A œ " A ‡ A œ A † "! w w !" " #

the other hand if the goal is merely to maximize the network’s reward rate, then andA œ "w"

the combined stream is .A ‡ A œ A † "w !!

Cesàro Overtaking Optimality with More General Immigration Streams

It is now time to generalize the above results concerning the existence of stationary Cesàro-

overtaking-optimal policies from binomial to a more general class of immigration streams which

includes positive linear combinations of binomial streams. The basic idea is to reduce Cesàro ov-

ertaking optimality for the more general class of immigration streams to that for corresponding

binomial immigration streams. To see how this can be done, it is useful first to establish a pre-

liminary lemma.

Lemma 33. Nonnegativity of Cesàro Limits of Convolutions. If and F œ ÐF F â Ñ G œ! "

ÐG G â Ñ F G! " are sequences of real matrices or scalars for which the convolution is defined,‡

F G   ! Ð ß .Ñ ! Ÿ . Ÿ " is nonnegative and has degree zero, and with , then neces-lim inf CRÄ_R

sarily .lim inf CRÄ_RÐF GÑ   ! Ð ß .ч

Proof. Since lim inf C , lim inf where . Let RÄ_ RÄ_ .R . R

G   ! Ð ß .Ñ R G   ! G ´ G ‡ " ¦ !s s %

be a given matrix. Choose so that for . Then since has degree zero,Q 3 G   3   Q F   !s. 3%

there is a matrix not depending on or such thatO Q %

ÐF ‡ GÑ œ F G F G   SÐR Ñ ORs s sRQ" R

3œ!

R3 R3 ." .3 3

3œQ

" " % .

Therefore, lim inf , whence lim inf C . Thus be-RÄ_ RÄ_. R RR ÐF ‡ GÑ   O ÐF ‡ GÑ   O Ð ß .Ñs % %

cause is arbitrary, lim inf C . % RÄ_RÐF ‡ GÑ   ! Ð ß .Ñ è

In many practical settings, interest centers on immigration streams that are more general

than binomial. One such class is the set of immigration streams for which is non-j8 7A A ‡ "

negative and has degree zero for some . For example, the binomial streams of order7 Ÿ 8 "7

7 Ÿ 8 " ‡ " œ " are in because is nonnegative and has degree zero.j8 7 7 !

Denote by the convex cone that consists of the nonnegative linear combinations of ele-j8

ments of . It is of interest that . To see why, let j j8 8 8j ¨ , so . How-A œ " # † " A −8 8" 8j

ever, is not nonnegative and A ‡ " œ " # † " œ Ð$M #M ! ! âÑ A ‡ " œ " # † " œ8 ! " 8" " !

Page 83: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 77 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Ð$M M M âÑ A ‡ " 7   8 does not have degree zero. Thus is not nonnegative for and does not7

have degree zero for . Hence .7 8 A Â j8

Theorem 34. Reduction of Immigration Streams to Binomial Ones. In a bounded system for

which with , there is a stationary policy that is Cesàro overtaking optimal for ,A − 8   " Aj8

viz., any stationary policy that is Cesàro overtaking optimal for the binomial immigration stream of

order .8

Proof. Suppose and the system degree is . By hypothesis, for some$ ?− . A œ A8 787œ5

!5 Ÿ 8 A ß á ß A A ‡ " and immigration streams such that is nonnegative and has degree5 8 7 7

zero for . For each policy ,5 Ÿ 7 Ÿ 8 1

• • • • • •A A

7œ5

8

7 77 7

$ 1 $ 1$ 1 œ A ‡ Ð Ñ œ ÒÐA ‡ " Ñ ‡ Ð ÑÓ" .

Let (resp., ) denote the C limit inferior of the sequence on the left-hand side (resp., se-_ _7 Ð ß .Ñ

quence in brackets on the right-hand side) of the above equation. Since for each$ ? ?− ©8 7

7 Ÿ 8   !     !, it follows that by Lemma 33. Hence . _ _ _7 787œ5

! è

One may ask whether the hypothesis in Theorem 34 can be dropped. The answer isA − j8

that it cannot as the following example illustrates.

Example. Nonexistence of a Stationary Cesàro-Overtaking-Optimal Policy When A Â/

j8 Consider the two-state substochastic system that Figure 18 illustrates.for all .8   "

The rewards and actions available in each state appear on the arcs. Let diag andA œ Ð" "Ñ!

A œ ! 3   " + ,3 for . The nonstationary policy that takes action in the first period and action 1

in the second is the unique (Cesàro-) overtaking-optimal policy because it earns the system "

starting from each state. By contrast, all stationary policies earn the system or in some" !

state and so are not . Thus, by Theorem 34, it must be so thatCesàro overtaking optimal for A

A œ ÐA Ñ Â 8   "38j for all .

1 2

01

0

�1 �

Figure 18

It can be shown that the condition that A ‡ "7 has degree zero is not essential to Theorem

34 in that event it is necessary to replace Cesàro limits of order one by higher-order Cesàro. But

limits. For that reason we omit a discussion of the more general case.

Page 84: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 78 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Future-Value Optimality

The above result has a useful application to problems in which the interest rate 100 % is3

negative and . Recall from §1.7 that the last assumption means that one cannot lose" 3

more than one invests. Also, negative interest rates arise when the decision maker is interested

in inflation-adjusted rewards and in which the after-tax rate of interest is less than the rate of

inflation, a not uncommon situation. In this circumstance, the inflation-adjusted present value

of a policy is often a divergent series, and so is not well defined. The way out of this difficulty is

to instead .discount income to the future

The in period of a policy isfuture value Z R œ Ð Ñs R3

3

1 1 #

Z ´ T < œ Ð ‡ Ñ œ ÐÐ ‡ " Ñ ‡ Ñs RR

3œ"

R3 3" " R R"

3

1 1 1# 1" ! • •3

where and . Observe that can be thought of alternately! 3 ! ! ! ´ " ´ Ð M M M â Ñ [ œ! " #

as a cumulative immigration stream. The hypotheses on imply that . Call a policy 3 ! -! "

future-value optimal if

lim inf C for all .RÄ_

R RÐZ Z Ñ   ! Ð ß .Ñs s3 3

- 1 1

Theorem 35. Reduction of Future-Value to Reward-Rate Optimality. In a bounded system,

there is a stationary policy that is future-value optimal for all , viz., any stationary re" !3 -

ward-rate optimal policy.

Proof. Set . Then is nonnegative and has degree zero, so .A ´ ‡ " A ‡ " œ A − j" " "

Now apply Theorem 34 with . 8 œ " è

10 SUMMARY

Figures 19 and 20 below provide directed graphs that summarize manyassertion-implication

of the results of this chapter, the homework problems and a few others. The graphs describe re-

lations among the various optimality concepts for transient and bounded systems.

The nodes of the graphs are and the arcs are An arc (arrow) from oneassertions implications.

assertion to another signifies that the first assertion implies the second. Similarly, a double arrow

between two assertions signifies that the two assertions are equivalent. A chain from one asser-

tion to another assures that the first assertion implies the second. Each strong component of a

graph contains a maximal collection of equivalent assertions. For example, the assertions above

the dashed line are equivalent.

In Figure 20, some arcs have text labels. When this is the case, an arc’s implication is valid

when the statement in the arc’s text label is valid. For example, “$_ overtaking optimal” implies

“ Cesàro$_ overtaking optimal”. The converse is generally false as the example of Figure 13 illus-

trates. However, it can be shown that “ Cesàro overtaking optimal” implies “ overtaking op-$ $_ _

timal” that a stationary overtaking-optimal policy exists.provided

Page 85: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 79 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Since there exists a stationary strong present-value optimal policy, there is a stationary poli-

cy that is optimal for each of the concepts considered here except, as discussed above, overtaking

optimality. However, there is a stationary policyCesàro-overtaking-optimal policy, and each such

is overtaking optimal provided only that a stationary overtaking-optimal policy exists.

$_ Strong Cesàro Overtaking Optimal

K#$ Ÿ ! for all # and small 3 !3 _-Optimality Conditions

K#$ £! for all #

$_ Strong Present-Value Optimal

$_ Cesàro Overtaking Optimalfor Binomial Stream of Order W

$_ _-OptimalZ$ ¤ Z# for all #

$_ Limiting Present-Value Optimalfor Binomial Stream of Order W Z$ ¤ Z# for all #W W

$_ W-Optimal

$_ Cesàro Overtaking Optimalfor Binomial Stream of Order 8

$_ Limiting Present-Value Optimalfor Binomial Stream of Order 8 Z$ ¤ Z# for all #8 8

$_ 8-Optimal

(8.)-Optimality ConditionsK#$ £! for all #8lim38K#$ Ÿ ! for all #3

3 Æ !lim K#$ Ÿ ! for all #R8

RÄ_

lim 3WK#$ Ÿ ! for all #3

3 Æ ! K#$ £! for all #W

W-Optimality Conditionslim K#$ Ÿ ! for all #RW

RÄ_

$_ Cesàro Overtaking Optimalfor Binomial Stream of Order 8"

$_ Limiting Present-Value Optimalfor Binomial Stream of Order 8" Z$ ¤ Z# for all #8" 8"

$_ (8")-Optimal

lim K#$ Ÿ ! for all # and 8R8

RÄ_

$_ Limiting Present-Value Optimal for theA with A3   ! and A3 œ SÐ38Ñ for 3 !

$_ Cesàro Overtaking Optimal for PositiveLinear Combinations of Streams A withA ‡ "7   ! of Degree ! for some 7 Ÿ 8 K#$ £! for all # 8.

8-Optimality Conditions

Figure 19. Optimality of for a Bounded System with Immigration$_

Page 86: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 80 §1 Discrete Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

$_ Cesàro Geometric-Overtaking Optimal

$_ Cesàro Overtaking Optimal $_ Overtaking Optimal

Provided there is a StationaryOvertaking-Optimal Policy

Z$ ¤ Z# for all #" "Float Optimal

Z$ ¤ Z# for all #! !Value Optimal

Z$ ¤ Z# for all #" "Reward-Rate Optimal

$_ Future-Value Optimal with" 3 ! and A ‡ "" œ ÐÐ"3ÑRMÑ

$_ Instantaneous-Rate-of-Return Optimal

$_ Maximum Value$_ Total-Instantaneous-Rate-of-Return Optimal

$_ Cesàro Symmetric-Multipli-cative-Utility-Growth Optimal

Irreducible System

Irreducible System

Transient System

TransientSystem

$_ Maximum R-PeriodValue for all R   W

CircuitlessSystem

$_ Moment Optimal for Mean-Variance of First Passage Time

Transient Substochastic System

Figure 20. Optimality of for Some Special Bounded Systems$_

Page 87: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

81

2

Team Decisions, Certainty Equivalents

and Stochastic Programming

1 FORMULATION AND EXAMPLES [Be55], [Da55], [Ra55], [Ve65], [MR72]

Team decision theory team is concerned with a , i.e., collection, of individuals that each

• control different decision variables,

• base their decisions on possibly different information, and

• share a common objective.

In the sequel, we shall formulate such problems as .stochastic programs

A team is composed of members labeled . The team member observes a random8 "ß á ß 8 3>2

vector assuming values in a set and then uses a real-valued to select^ \ À Ä d3 3 3 3™ ™decision

an action . This formulation encompasses team members that have vector-valued deci-\ Ð^ Ñ3 3

sions since each such member can be considered to be a group of different team members each

with common information and real-valued decisions. Let be the vector whose^ œ Ð^ ß á ß ^ Ñ" 8"

elements are the random vectors observed by the team members and a random8 ^ ß á ß ^ 8" 8

Page 88: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 82 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

vector that contains a (possibly empty) subset of those random vectors as well as relevant^8"

unobserved information. The distribution of is known to all members of the team. Denote by^

™ — the range of , which we take to be finite. Call a . Let de-^ \ ´ Ð\ ß á ß \ Ñ" 8 team decision

note the set of all team decisions. Let be the given and<Ð † ß † Ñ À d ‚ Ä d8 ™ reward function

1Ð † ß † Ñ À d ‚ Ä d \8 7™ be the given . The problem is to choose to maxi-constraint function

mize

(1) EVÐ\Ñ ´ <Ð\Ð^Ñß ^Ñ

subject to

(2) 1Ð\Ð^Ñß ^Ñ   !

and

(3) .\ − —

One of the principal goals of team decision theory is to study the effect of different informa-

tion structures on the quality of optimal decisions as measured by the maximum expected re-

ward. Of course, the more information that each team member receives, the greater the maxi-

mum expected reward, but generally so also is the cost of collecting and distributing the infor-

mation. Thus one goal of team decision theory is to quantify the value of additional information

to facilitate comparison with the costs of providing and using such information. This enables the

team to choose the proper amount of information to provide each member. From this viewpoint,

team decision theory can be considered to be a branch of the field of management information

systems.

Example 1. Airline Reservations with Uncertain Demand. An airline wishes to give its

agents in cities rules for deciding how many seats to book on a flight of capacity so as8 O !

to maximize the expected net revenue from the flight where is the^ ´ ÐH ß á ß H Ñ   !8" " 8

random vector of demands experienced by agents . Let be the number of tickets sold"ß á ß 8 \3

by agent . Then3

<Ð\ß ^Ñ œ < \ :Ð \ OÑ" "8 8

3œ" 3œ"

3 3

where is the ticket price and is the penalty for each oversold ticket. Also (2) becomes< :

! Ÿ \ Ÿ H " Ÿ 3 Ÿ 83 3, .

Information Structures. The information on ticket demands can be shared among the agents

in many ways. We mention four examples.

Page 89: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 83 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

(a) Complete Communication. Let for all , i.e., every agent gets complete infor-^ œ ^ 33 8"

mation about tickets demanded at other agents.

(b) Decentralization. Let for all , i.e., each agent receives information only on his^ œ H 33 3

own ticket demand.

(c) Partial Communication. Let , i.e., each agent receives information on^ œ ÐH ß H Ñ3 3 484œ"

!his own demand and the total demand at all agents.

(d) Management by Exception. Let where^ œ ÐH ß X Ñ3 3

X œH H Q

ÚÝÛÝÜ" "8 8

4œ" 4œ"

4 4, if

0 , otherwise,

i.e., each agent receives information on his demand and, if the total demand at all agents ex-

ceeds a threshold , the total demand at all agents. In this event, can be chosen to be atQ Q

least . For if an agent is informed that the total ticket demand at all agents does not exceedO

the capacity of the plane, there is nothing to be gained by telling the agent the precise demands

at the other agents.

Example 2. Inventory Control with Uncertain Demand. An inventory manager seeks to

minimize the expected cost of storage and shortage of a single product over weeks. Let and8 \3

H "ß á ß 3 3 œ "ß á ß 83 be respectively the cumulative orders and demands in weeks where . Let

^ œ ÐH ß á ß H Ñ8" " 8 . Then the storage and shortage cost is

"8

3œ"

3 3 3 3 Ò2Ð\ H Ñ :ÐH \ Ñ Ó

where and are respectively the unit costs of storage and shortage. The constraints are2 :

\ Ÿ \ Ÿ â Ÿ \" # 8.

Information Structures. Among the information structures that may be appropriate in this

problem are the following.

(a) Complete Information. Let , i.e., decisions are made^ œ ÐH ß á ß H Ñ œ Ð^ ß H Ñ3 " 3" 3" 3"

sequentially with the ordering decision in week being based on all previously observed de-3

mands.

(b) Delayed Reporting. Let , i.e., ordering decisions in week are^ œ ÐH ß á ß H Ñ 33 " 3"6

based on all previously reported demands where there is an -week delay in reporting.6

(c) Forecasting. Let , i.e., ordering decisions are based on all previously^ œ ÐH ß á ß H Ñ3 " 3"6

observed demands as well as perfect forecasts of the demands in the next weeks.6

Page 90: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 84 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

(d) Reporting Errors. Let , i.e., ordering decisions in each^ œ ÐH ß á ß H Ñ Ð ß á ß Ñ3 " 3" " 3"% %

period are based on demands observed plus a random error in reporting.

(e) Current Cumulative Demand. Let , i.e., ordering decisions in week depend^ œ H 33 3"

only on cumulative observed demand to date. Although this assumption is a natural one and re-

duces the amount of information needed to make decisions, rules of this type are not generally

optimal—even when demands are independent in each week. For in the latter case, the state of

the system at the beginning of week is the difference between cumulative orders and demands3

through week and so depends on demands in all prior weeks through the cumulative or-3 "

ders.

Example 3. Transportation Problem with Uncertain Demand. Consider the stochastic ver-

sion of the transportation problem in which there are aircraft available at city one= 3 œ "ß á ß 83

evening, there is a random demand at city for aircraft the following day, andH 4 œ "ß á ß 74

the goal is to minimize the expected costs of , i.e., sending empty aircraft from onedeadheading

city to another, and lost revenue. There is a cost of sending an empty aircraft from city to- 334

city during the night. The aircraft scheduler must decide the number of empty aircraft to4 B34

send from city to city before observing the demands at the various cities. Denote by 3 4 C ÐH Ñ4 4

the difference between the number of aircraft needed at city and the number sent there. The4

expected revenue lost from each aircraft needed at city that is not available there is . Then4 <4

the cost of deadheading and lost revenues is

." " "8 7 7

3œ" 4œ" 4œ"

34 34 4 4 4- B < C ÐH Ñ

The constraints are

""

7

4œ"

34 3

8

3œ"

34 4 4 4

34

B Ÿ = 3 œ "ß á ß 8

B C ÐH Ñ œ H 4 œ "ß á ß 7

B   ! 3ß 4

, .

,

, all .

Information Structure. In this case, the decision functions and are based respectivelyB C34 4

on no information, i.e., , and the number of aircraft needed at city .^ œ g ^ œ H 434 4 4

Example 4. Capacity Planning with Uncertain Demand. A firm seeks a minimum-expect-

ed-cost policy for leasing and renting autos to meet independently distributed employee

demands therefor in weeks . There is a unit cost of leasing an auto at theH ß á ß H "ß á ß 8 6" 8 34

Page 91: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 85 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

beginning of week for use during weeks . When the firm does not have a leased auto3 3ß á ß 4

available for an employee who needs one in a week, the firm rents one at a cost . Decisions< !

on leases that begin in week are made after observing demands in prior weeks, but before ob-3

serving the demand in week . But rental decisions for week are made after also observing the3 3

demand in that week. Let be the number of autos leased at the beginning of week for useB 334

during weeks , . Let be the difference between the numbers of autos3ß á ß 4 " Ÿ 3 Ÿ 4 Ÿ 8 C ÐH Ñ3 3

demanded and leased in week , . The total cost of leasing and renting is3 " Ÿ 3 Ÿ 8

" "3Ÿ4 3

34 34 4 4 46 B < C ÐH Ñ .

The constraints are

, "3Ÿ4Ÿ5

35 4 4 4B C ÐH Ñ œ H 4 œ "ß á ß 8

and

, all .B   ! 3ß 434

Information Structure. In this case, the decision functions and are based respectivelyB C34 3

on no information, i.e., , and the auto demand in week . In order to see why it^ œ g ^ œ H 334 3 3

is not necessary to allow leasing decisions to depend on any of the demands, notice that the

state of the system at the beginning of a week is the vector of numbers of autos leased in prior

weeks for use in the week and thereafter. Thus since the demands are independent, the state in

a week is a function only of leasing decisions in prior weeks and not the demands in prior weeks

themselves. Hence the transition law for the system is deterministic and only the rental decision

in a week depends on the demand in the week. Incidentally, this conclusion would not be valid if

the demands were dependent, e.g., if they formed a Markov chain. In that event, leasing deci-

sions in a week should depend on the demands in all prior weeks—even when the demands are

Markovian.

2 REDUCTION OF STOCHASTIC TO ORDINARY MATHEMATICAL PROGRAMS

Since is finite, the stochastic program (1)-(3) reduces to that of choosing ™ \ÐDÑ ´

Ð\ ÐD Ñß á ß \ ÐD ÑÑ D œ ÐD ß á ß D Ñ − :ÐDÑ œ T Ð^ œ DÑ !Ñ" " 8 8 " 8" (where and , that maximizes™

(1) wD−

"™

<Ð\ÐDÑß DÑ:ÐDÑ

subject to

(2) , ,w 1Ð\ÐDÑß DÑ   ! D − ™

Page 92: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 86 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

which is an ordinary mathematical program. The variables of this mathematical program are

the values of the decision for each team member corresponding to each value of the random vec-

tor the member observes. The constraints (2) replicate the original constraints (2) once for eachw

value of the random vector . In applications the number of redundant replications in (2) of an^ w

inequality in (2) can be significantly reduced by replicating the inequality only for values of the

subvector of that appears in the inequality rather than also so doing for the complementary^

subvector of .^

Linear and Quadratic Programs

It is of interest to examine when the above mathematical program is a linear or concave

quadratic program. If and are affine functions for each , then (1) -(2) is a<Ð † ß DÑ 1Ð † ß DÑ D − ™ w w

linear program. If instead is quadratic and the associated form is negative semidefinite<Ð † ß DÑ

for all and negative definite for some , and if is affine for each , thenD − D − 1Ð † ß DÑ D −™ ™ ™

(1) -(2) is a strictly-concave quadratic program. Thus, if it is feasible, it has a unique solution.w w

In particular if , i.e., there are no inequality constraints (2) , then the optimal team de-1 ´ g w

cision is the unique solution to a system of linear equations obtained by setting the partial\‡

derivative of (1) with respect to each decision variable equal to zero.w3 3\ ÐD Ñ

In order to make the above ideas more concrete, we formulate Examples 3 and 4 above as

linear programs. Consider first the in whichTransportation Problem with Uncertain Demand

there are values of the random demand for R 5 œ "ß á ß R. ß á ß . H : ´ T ÐH œ . Ñ4" 4R 4 45 4 45 and

and each . Set and write as a difference of two nonnegative var-4 C œ C Ð. Ñ C œ C C45 4 45 4545

45

iables. Then the problem becomes the ordinary linear program of choosing and that mini-B C34 45

mize

" " " "8 7 7 R

3œ" 4œ" 4œ"

34 34 4 45

5œ"

45 - B < : C

subject to

""

7

4œ"

34 3

8

3œ"

34 45 45 45

34 45 45

B Ÿ = 3 œ "ß á ß 8

B C C œ . 4 œ "ß á ß 7 5 œ "ß á ß R

B C C   ! 3ß 4ß 5

,

, and ,

, , , all .

Consider now the problem of in which there areCapacity-Planning-with-Uncertain-Demand

at most values of the demand and for andR . ß á ß . H : ´ T ÐH œ . Ñ 5 œ "ß á ß R4" 4R 4 45 4 45

each . Set and write as a difference of two nonnegative variables.4 C œ C Ð. Ñ C œ C C45 4 45 4545

45

Then the problem becomes the ordinary linear program of choosing and that minimizeB C34 45

Page 93: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 87 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

" "3Ÿ4

34 34 35

3ß5

356 B <: C .

subject to

"3Ÿ4Ÿ5

35 45 45 45B C C œ . 4 œ "ß á ß 8 5 œ "ß á ß R, and ,

and

B C C   ! 3ß 4ß 53445

45, , , all .

Computations

It is useful to examine the size of the mathematical program that results from the above

approach to stochastic programming. To that end, let and let max be theR ´ l l R ´ R4 4 4 4™

maximum number of values of any . Also, let be the set of indices of nonconstant random^ W 44 3

variables that appear in the constraint in (2). Let max be the maximum number^ 3 Q ´ lW l4 3 3>2

of nonconstant that appear in an inequality. Then the mathematical program may have up^4

to variables and up to inequality constraints. This! ! #8 73œ" 3œ" 4−W3 4

QR Ÿ 8R R Ÿ 7R3

problem is generally tractable only if is not too large and if either is very small or .R Q 7 œ !

These two conditions are often fulfilled in a problem that either has a very small number of

time periods, e.g., say three or less, or has a deterministic transition law. The Transportation

Problem with Uncertain Demand (Example 3) is an example of the former type and involves

7Ð8 #RÑ 8 7R variables and linear equations or inequalities beyond nonnegativity. The

problem of (Example 4) is an example of the latterCapacity Planning with Uncertain Demand

type and involves up to variables and linear equations as well as nonnegativity"

#8 #8R 8R#

of the variables. In both cases is generally not too large and . As a third example, sup-R Q œ "

pose and the objective is quadratic and strictly concave. Then one simply sets the partial7 œ !

derivatives with respect to each of the variables equal to zero and solves the resulting sys-8R

tem of linear equations in a like number of unknowns. This last problem is tractable if is8R R

not too large since .7 œ !

The stochastic programming approach is not tractable for many multiperiod stochastic se-

quential decision problems because is large. To illustrate, consider the problem of R Inventory

Control with Uncertain Demand (Example 2) and complete information. If each has H O   #3

values, then grows exponentially in the number of periods . Thus the numberR œ l l œ O 8™88"

of variables, and hence inequalities, in the mathematical programming formulation of this sto-

chastic program is usually impossibly large. This is so even if the demands in successive periods

are independent.

Comparison of Dynamic and Stochastic Programming

The dynamic- and stochastic-programming approaches to solving sequential decision prob-

lems under uncertainty are generally useful for computational purposes in different circumstances.

Page 94: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 88 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

The dynamic-programming approach is useful if the number of state-action pairs is not too

large. The stochastic-programming approach is useful if the number of possible values of the in-

formation that one observes is not too large. To illustrate, the dynamic-programming approach

is useful in Examples 2 where the demands in successive periods are independent and there is

complete information, whereas the stochastic-programming approach to this problem is generally

intractable. On the other hand, the stochastic-programing approach is useful in Example 3 where

the demands for capacity in successive periods are independent, whereas the dynamic-program-

ming approach to this problem is generally intractable.

3 QUADRATIC UNCONSTRAINED TEAM DECISION PROBLEMS [Ra61 62], [MR72]ß

As discussed above, a team decision problem that is especially easy to solve is one that is

strictly concave and quadratic, and unconstrained. In particular, assume that and1 œ g

(1) <Ð\ß ^Ñ œ \ Ð^Ñ \ U\"

#T T$

where , , and E E .$ $ $ $ $œ Ð^Ñ œ Ð Ñ U œ Ð; Ñ ; œ Ð; ß á ß ; Ñ ´ Ð ± ^ Ñ3 34 3 3" 38 3 3 3^3

Theorem 1. Quadratic Unconstrained Team Decision Problems. In a quadratic uncon-

strained team decision problem with symmetric positive-definite , the team decision is opti-U \‡

mal if and only if

Ð Ñ Ð ; \ Ñ œ ! 3 œ "ß á ß 82 E^ ‡3 3

3 $ , .

Also there is a unique team decision satisfying \‡ (2).

Proof. The function is strictly concave for each fixed . Therefore it follows that<Ð † ß DÑ D − ™!D− 3 3 3 3 3

‡3™ <Ð\ÐDÑß DÑ:ÐDÑ Ö\ ÐD ÑÀ D − 3× \ ÐD Ñ is strictly concave in for all and quadratic. Thus the ™

are the unique solution to

`

`\ ÐD ÑVÐ\ Ñ œ ! D − 3 œ "ß á ß 8

3 3

‡3 3, , .™

Now use (1) to rewrite this system as

! œ VÐ\ Ñ œ Ò <Ð\ Ð^Ñß ^ÑÓ` `

`\ ÐD Ñ `\ ÐD Ñ

œ Ò <Ð\ Ð^Ñß ^ÑÓT Ð^ œ D Ñ`

`\ ÐD Ñ

œ Ò Ð^Ñ ; \ Ð^ÑÓT Ð^ œ D Ñ

3 3 3 3

‡ ^ ‡

3 3

^ œD ‡3 3

^ œD ‡3 3 3 3

E E

E

E ,

3

3 3

3 3

$

so because , (2) holds as claimed. T Ð^ œ D Ñ !3 3 è

Page 95: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 89 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

Theorem 1 requires that a system of up to linear equations in a like number of variables8R

be solved and so is not tractable if is large. In the special case in which is affine in for each R ^ 3$3 3

and has a joint normal distribution, a homework problem shows that the are affine in the^ \3

^ 8 ‚ 83. The coefficients can be found by solving two systems of linear equations, one and the

other where is the number of random variables among .7 ‚ 7 7 ^ ß á ß ^" 8

4 SEQUENTIAL QUADRATIC UNCONSTRAINED TEAMS: CERTAINTY EQUIVALENTS

[HMSM60], [MR72]

On the other hand, even without the above assumptions that is affine in for each and$3 3^ 3

^ has a joint normal distribution, but with the added assumption that the problem is “sequen-

tial”, we now show how to reduce the number of linear equations that need to be solved in such

problems to about two systems of linear equations in a like number of variables—a dramatic8

reduction indeed.

In order to motivate the idea, consider the following example.

Example 5. Single-Person Quadratic Unconstrained Team. Suppose , 8 œ " ^" is the empty

vector, is real valued, does not depend on , and E . Then^ \ ^ <Ð\ß ^Ñ œ Ð^ \Ñ# ##

min E E E E E E E Var\

# # # # # # # ## ‡ # # ‡ #Ð^ \Ñ œ ÒÐ^ ^ Ñ Ð ^ \ ÑÓ œ Ð^ ^ Ñ Ð ^ \ Ñ œ ^

where E . Since depends only on the expected value of , we would have found the\ œ ^ \ ^‡ ‡# #

same solution if we had initially replaced by E and solved the “equivalent” deterministic^ ^# #

problem min E , which equals , so again E . Thus E is a “certainty equiva-\ # # ## ‡Ð ^ \Ñ ! \ œ ^ ^

lent” for with respect to the problem of finding .^ \#‡

If and are finite-valued random vectors, we say that if forY Z Y Z Z œ 0ÐY Ñdetermines

some function . In that event, for any finite-valued random variable , E E E E0 [ [ œ [Z Y Z Y ßZ

œ [ Y Z Z [ Y [E . Also, if determines and determines , then determines , i.e., the deter-Z

mines relation is transitive.

Call the team decision problem if determines for . The nextsequential ^ ^ 3 œ "ß á ß 8"3" 3

result is important because it allows one to solve sequential quadratic unconstrained team de-

cision problems by replacing all random variables by their conditional expectations given the

then current state of information and solving the equivalent deterministic problem.

Theorem 2. Certainty Equivalents for Sequential Quadratic Unconstrained Team Decision

Problems. In a sequential quadratic unconstrained team decision problem with symmetric posi-

tive-definite , there is a unique optimal team decision that satisfiesU \‡

Ð Ñ Ð ; \ Ñ œ ! " Ÿ 3 Ÿ 4 Ÿ 83 E^ ‡4 4

3 $ , .

Moreover, on setting and for given ,L œ Ö"ß á ß 3"× J œ Ö3ß á ß 8× 3

Ð Ñ4 E E .^ ‡ " ^ ‡J LJ J J J L

3 3\ œ U Ò U \ Ó$

Page 96: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 90 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

Proof. Recall from Theorem 1 that

E^ ‡4 4

4Ð ; \ Ñ œ !$

so because determines for ,^ ^ 3 Ÿ 44 3

! œ Ð ; \ Ñ œ Ð ; \ ÑE E E .^ ^ ‡ ^ ‡4 4 4 4

3 4 3$ $

For fixed , these equations can be rewritten as3

E .^ ‡ ‡J J L J JL J

3Ð U \ U \ Ñ œ !$

Now premultiply this equation by and use the fact that E . U \ œ \" ^ ‡ ‡J J L L

3 è

Interpretation of Solution

Remark 1. Solution as a Sequence of Deterministic Problems. The formula (4) permits

\ Ð^ Ñ œ \ \ Ð^ Ñß á ß \ Ð^ ч ^ ‡ ‡ ‡3 3 " 3"3 " 3"E to be computed once the variables have been found.3

Moreover, since depends on only through its conditional expectation E given the\ Ð^ ч ^3 3 J J$ $3

then current information , it follows that has the same value that it would have for^ \ Ð^ Ñ3 3‡3

the corresponding deterministic problem in which the random vector is replaced by its con-$J

ditional expectation E !^J

3$

Remark 2. On-Line Implementation or Wait-and-See Solution. It is not necessary to tabu-

late in order to choose optimally. All that is required is to wait until is observed\ Ð † Ñ \ Ð^ Ñ ^‡ ‡3 3 3 3

—in which case are determined—and then choose as in Remark 1 above.^ ß á ß ^ \ Ð^ Ñ" 3" 3‡3

Remark 3. Linearity of Optimal Decision Function. Observe from (4) that the optimal de-

cision function for person is linear in the of decisions of prior persons and\ Ð^ Ñ 3 \‡ ‡3 L3 history

the conditional expected value E of the random vector (yet to be observed except^J J

3$ $future

for ) given the state of information of person at the time he chooses his decision.$3 3 3Ð^ Ñ ^ 3

Computations

If we use Remark 2 above, then except for the work in computing the needed conditional ex-

pected values E , one for each , the work in solving (4) is independent of for all . If one^J 4

3$ 3 R 4

first finds the Cholesky factorization of the matrix , i.e., the upper triangular matrix for whichU V

U œ V VT , then operations suffice to find all optimal decisions.8

#

$

Page 97: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 91 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

Quadratic Control Problem [Be87], [Ku71]

It is often the case in practice that sequential quadratic team decision problems discussed in

Theorem 2 have additional structure that leads to computational simplifications. To illustrate

these ideas, consider the following problem. Choose and,quadratic control controls ? ß á ß ?" X

given an , that minimize the expected value of theinitial state vector state vectorsB B ß á ß B" # X

quadratic form

Ð Ñ ÐB U B ? V ? Ñ B U B5 "X "

>œ"> >> > > > X XXT T T

subject to the linear dynamical equations

Ð Ñ B œ E B F ? > œ "ß á ß X "6 , >" > > > > >!

where the and are column vectors of common dimension, the are column vectors ofB ?> > >!

common dimension, the matrices are symmetric and positive semidefinite, the matrices areU V> >

symmetric and positive definite, the matrices and have appropriate dimension, the E F> > >! are

random errors whose distributions are independent of the controls applied, and Ð ß á ß Ñ! !" >" is the

information available to choose the control for each . It is natural in most applications to? >>

think of as one of periods, e.g., days, weeks, months, etc., and we shall use this terminology> X

in what follows. Problems of this type arise in a variety of settings of which we mention two.

Example 6. Rocket Control. Consider the problem of moving a rocket efficiently from its

initial position to a terminal position that is close to a target position at time . Let beB X B" >

the position of the rocket at time and be the “control” exerted at that time. The position of> ?>

a rocket might be a vector including at least its six location and velocity coordinates. The de-

sired target location and velocity vector at time is the null vector (the distance and velocityX

should be zero at time ). Assume that the motion of the rocket satisfies the linear dynamicalX

equations (6). The objective function might have for all . Then the objective func-U œ ! > X>

tion would entail minimizing the sum of two terms, the first representing the sum of the costs of

the controls at the first times and the second the cost of missing the targetX " B U BTX X X

position at time . Each cost reflects the fact that small quantities have small costs and largeX

quantities have large costs.

Example 7. Multiproduct Supply Management. Consider the problem of managing inventor-

ies of several products so as to minimize the sum of the expected resource and storage/shortage

costs in the presence of uncertain demands for the products. In this setting, would be the pos-B>

sibly negative vector of inventories of the products in period , the possibly negative vector of> ?>

several resources (perhaps labor, materials, capital, etc.) applied to production in period , and>

Page 98: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 92 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

!> the vector of demands for the products in period . Negative inventories represent backorders>

and negative resource consumption represents returns of resources. The matrix in period E >>

would be the identity matrix if stocks of products are merely held in storage without transform-

ation. That matrix might differ from the identity matrix if stocks move from one state to an-

other over time, e.g., reflecting age, deterioration, location, etc. The element of the matrix34>2

F 3 > 4> would be the rate of production of product in period per unit of resource applied in that

period. The costs reflect the desirability of low inventories and low resource costs.

Reduction to Unconstrained Problem. It might seem that this problem is more general

than the unconstrained sequential decision problem considered to date because of the linear

equality constraints (6). But that is not the case because one can use (6) to eliminate the B>

leaving (5) depending only on the . Then Theorem 3 applies at once. As a consequence, in?>

order to determine the optimal choice of , it suffices to replace the random vectors by their?" >!

conditional expectations given the state of information at the time the vector is chosen. For?"

this reason we can and do assume without loss of generality that the are constant vectors.!>

In fact it is possible to eliminate the constant vectors entirely. This may be accomplished by

appending an additional state variable in each period and appending to (6) the equationsC >>

Ð Ñ C œ C > œ "ß á ß X7 , >" >

where . This assures that for all , so we can replace in (6) by the product .C ´ " C œ " > C" > > > >! !

The with one additional state variable and the same control variables thenaugmented system

has the desired form with zero random errors. If we denote the various quantities in the

augmented system with overbars, those quantities have the following relations to the

corresponding quantities of the original system: , , ,! œ ! ? œ ? V œ V

> > > > >

Ð Ñ U œ E œ F œ B œ U B E F

C8 , , and ,

00 0 0 1 0>

> >> > >

> > >

> ΠΠΠ!

where in each case the second matrix in a column (resp., row) partition has only one column

(resp., row).

It is natural to solve the zero-Dynamic-Programming Solution with Zero Random Errors.

random-error problem by dynamic programming. This approach is tractable because even

though the state vector in period may have many components, the minimum cost is a pos-B >>

itive semidefinite quadratic form that can be calculated explicitly. To see this, let be theV>ÐBÑ

minimum cost in periods given is the state in period , and let be the corre->ß á ß X B > ? ÐBÑ>

sponding optimal control in that state and period. Evidently

Page 99: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 93 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

Ð Ñ ÐBÑ œ ÒB U B ? V ? ÐE B F ?ÑÓ > œ "ß á ß X "9 min , V V> > > >" > >?

T T

and for all .VX XÐBÑ œ B U B BT

Theorem 3. Optimal Quadratic Control. If the quadratic control problem - has zeroÐ&Ñ Ð'Ñ

random errors, each is symmetric and positive semidefinite, and each is symmetric andU V> >

positive definite, then the minimum cost and optimal control areV> >?

Ð Ñ10 V> >"ÐBÑ œ B O B > œ "ß á ß XT , ,

Ð Ñ11 ? ÐBÑ œ W Y E B > œ "ß á ß X "> > >"> ,

with symmetric positive semidefinite matrices given from the Riccati recursionO>

Ð Ñ12 O œ U E [ E > œ "ß á ß X ">" > > >>T ,

and where , and for O œ U [ ´ O Y W Y W ´ V F O F Y ´ F O > œ "ß á ßX " X > > > > > > > > >> > > >"T T T

X ".

Proof. The proof of (10)-(11) is by induction on . This claim is trivially true for . Sup-> > œ X

pose it holds for 1 and consider . Then is symmetric and positive definite since > " Ÿ X > W>

that is so of and since is symmetric and positive semidefinite from the induction hypoth-V O> >

esis. Thus is nonsingular, whence by (10) for and completing the square, the expressionW > ">

in brackets on the right-hand side of (9) is

B U B ? V ? ÐE B F ?Ñ œ B ÐU E O E ÑB Ð? W ? #? Y E BÑT T T T TT> > >" > > > > > > > >>V

œ B ÐU E [ E ÑB Ð? W Y E BÑ W Ð? W Y E BÑT TT> > > > > > > >> > >

" " .

Now since is symmetric and positive definite, is the unique minimizer of theW ? œ W Y E B> > >">

above function, so (10) and (11) follow from (9). Furthermore, it is immediate from (12) and the

induction hypothesis that is symmetric. Finally, is positive semidefinite because isO O>" >" >V

nonnegative. The last follows from the fact that the expression in brackets on the right-hand

side of (9) is nonnegative for all because , and are symmetric and positive semi-B U V O> > >

definite. è

Remark 1. Linearity of Optimal Control. Observe from (11) that the optimal control in?>

period is linear in the state vector. Thus when there is a random error, is linear in the state> ?>

vector in period and the conditional expected value of the random error in that period given>

the random errors in prior periods.

Page 100: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 94 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

Remark 2. Linearity of Computational Effort in X . Observe that the matrices can beO>

calculated recursively from (12). Thus the computational effort to determine those matrices and

the optimal controls is linear in .X

Remark 3. Quadratic Plus Linear Costs. It is also possible to apply the above results to the

case in which a linear function of the state and control vectors is added to the objective function

(5). To do this, complete the square, thereby expressing the generalization of (5), apart from a

constant, as

Ð&Ñ ÒÐB Ñ U ÐB Ñ Ð? Ñ V Ð? ÑÓ ÐB Ñ U ÐB ÑwX "

>œ"

> > > > > > > > > > X X X X X" 0 0 8 8 0 0T T T

for some translation vectors and for each . Then substitute , and0 8 0 8> > > > > > > >> B ´ B ? ´ ?

! 0 8 ! 0 ´ E F >> > > > > > >"w in both (5) and (6) for each . The resulting problem assumes the

form (5)-(6) with , and replacing , and respectively.B ? B ? > > > > > >! !

Solution with Independent Random Errors. If the random errors in different periods are in-

dependent, then the conditional expectations of the random errors in period or later given the>

random errors in periods prior to are independent of the latter. Consequently the optimal con->

trol functions for the corresponding augmented zero-random-error system are optimal for the or-

iginal system with independent random errors.

Solution with Dependent Random Errors. If, however, the random errors in different per-

iods are dependent, then the optimal control function in period determined from the cor-> "

responding augmented zero-random-error system is not generally optimal for the original system

(with random errors) in period . The reason is that the augmented matrix depends on the> E

>

conditional expected error , whose value depends on the values of the conditioning variables.!>

Consequently, the matrix depends on . To examine this dependence,O ´ Ð ß á ß Ñ

> >X > X "! ! !

partition like (8) in the formO

>

O œ O 5

5>

> >

> > T

,.

Then a straightforward, but tedious, computation using (12) for the corresponding augmented

system shows by induction on that the matrices are independent of , the vectors > O 5> >X >!

satisfy the recursion ( )5 ´ !X "

Ð Ñ 5 œ E Ò[ ÐM Y W F Ñ5 Ó > œ "ß á ß X 13 , 1,>" > > >> > > >"T T T

!

and the optimal control takes the form? ÐBß Ñ> >X!

Page 101: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 95 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

Ð Ñ ? ÐBß Ñ œ W ÒY ÐE B Ñ F 5 Ó > œ "ß á ß X 14 , 1,> >X > > > >"> >! !

T

Indeed, the matrices coincide with the corresponding matrices satisfying the Riccati recursionO>

for the unaugmented zero-random-error problem. Also, the matrices defining the coefficients of

B 5, and in (13) and (6) are all independent of . Consequently, to compute the optimal! !> > >X

controls from (6) when is altered because of new information, simply recompute the us-? 5> >X >!

ing the recursion (13). Incidentally, it also follows by iterating (13) that is linear5 œ 5 Ð Ñ> > >"ßX!

in . Thus, in view of (6), is linear in and , confirming Remark 3 following! ! !>"ßX > >X >X? ÐBß Ñ B

Theorem 2 in the present instance.

Strengths and Weaknesses

The Certainty-Equivalence Theorem 2 permits sequential decision problems under uncertainty

with thousands of variables and constraints to be solved nearly as easily as corresponding determin-

istic problems provided that the objective function is convex and quadratic, and the constraints

consist only of linear equations—not inequalities. For example, a multiproduct multiperiod

inventory problem with thousands of products that are tied together by joint convex quadratic

production and storage/short , uncertain demand for each product, and no inequalityage costs

constraints can be solved easily by the Certainty-Equivalence Theorem and its implementation as

a quadratic control problem discussed above.

Because of the Certainty-Equivalence Theorem, it is tempting to use its conclusion, viz., re-

placing all random variables by their expectations and solving the resulting deterministic prob-

lem, in circumstances where it cannot be shown to lead to optimal solutions. The reason for so

doing is the hope that the error from so doing will not be great. Indeed, that general approach is

used implicitly in practice for many large sequential decision problems under uncertainty that

could not otherwise be solved.

Nevertheless, it must be recognized that the assumption that the costs are convex and quad-

ratic may not be a reasonable approximation in practice. And the absence of inequality constraints

does not, for example, permit one to require production in each period to be nonnegative, as is us-

ually the case in practice. Thus, if the optimal production schedule is not nonnegative, the optimal

solution may not be meaningful. On the other hand, in practice one can usually modify the

optimal solution in reasonable ways to overcome these limitations—though not without losing

optimality.

Though the certainty-equivalence approach has limitations, it is one of very few approaches

that is useful for finding optimal policies for large sequential decision problems under uncertainty.

For this reason, it is worthy of serious consideration in many circumstances.

Page 102: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 96 §2 Team Decisions & Stochastic ProgrammingCopyright 2008 by Arthur F. Veinott, Jr.©

This page is intentionally left blank.

Page 103: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

97

3

Continuous-Time-Parameter

Markov Population Decision Processes1 FINITE CHAINS: FORMULATION

Consider a finite population of individuals that is observed continuously and indefinitely be-

ginning at time zero. At each time, each individual will be in some in a finite set of state f W _

states. Each time an individual is observed in state , an is chosen from a finite set>   ! = +action

E <Ð=ß +Ñ= of possible actions and a is earned per unit time. Also the individual generreward rate -

ates a finite expected number of additional individuals in state per unit time. Assume;Ð?l=ß +Ñ ?

that for , since otherwise positive populations may produce negative ones. Call;Ð?l=ß +Ñ   ! ? Á =

;Ð?l=ß +Ñ = ? + the from state to state when using action .transition rate

In the important case where the population consists of exactly (resp., at most) one individual

at each point in time, so (resp., ) for all and , call !?− =f ;Ð?l=ß +Ñ œ ! Ÿ ! + − E = − f the system

stochastic substochastic (resp., ). In either case for all ;Ð=l=ß +Ñ Ÿ ! + − E= and . The condi-= − f

tion has the interpretation that the expected rate of addition of individuals to!?−f ;Ð?l=ß +Ñ œ !

the system generated by each individual is zero, so the system population size remains unchanged.

Similarly, in the substochastic case the system population size remains unchanged or falls.

Page 104: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 98 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

2 MAXIMUM -PERIOD VALUE: BELLMAN'S EQUATION AND EXAMPLESX [Be57]

A fundamental problem in the above setting is to find a policy that maximizes the -periodX

expected reward with terminal value vector in the various states. We begin with an in-@ œ Ð@ Ñ=

formal analysis and make it rigorous later. Let be the maximum expected reward that an in-Z >=

dividual (and his progeny) in state at time earns during the interval . Then it is plausi-= > Ò>ß X Ó

ble that for small increments of time ,2 !

Z œ Ò<Ð=ß +Ñ2 ;Ð?l=ß +Ñ2Z Z 9Ð2ÑÓ>2 > >= ? =

+−E ?−

max =

"f

where is a function for which lim . Subtracting from both sides, dividing9Ð2Ñ 2 9Ð2Ñ œ ! Z2Æ!" >

=

by , and letting yields ’2 2 Æ ! Bellman s equation

(1) max Z œ Ò<Ð=ß +Ñ ;Ð?l=ß +ÑZ Ó† >

= ?+−E ?−

>

=

"f

for and together with the terminal condition for . (Of course .>   ! = − Z ´ @ = − Z ´ Z ц

f fX= = ==

> >.

.>

The expression in brackets on the right-hand side of (1) is the sum of two terms. The first is

the reward rate that an individual in state earns while taking action in that state at<Ð=ß +Ñ = +

time . The > second is the extra reward rate during the interval that results from changes inÒ>ß X Ó

the population size in each state at time when an individual in state takes action in that> = +

state at time and all individuals use an optimal policy thereafter. To see why this is so, observe>

that is the product of the expected rate of addition of individuals to state ;Ð?l=ß +ÑZ ;Ð?l=ß +Ñ ?>?

when an individual in state takes action in that state at time and the maximum expected= + >

reward that each such individual in state and his progeny earn in the interval . Thus,Z ? Ò>ß X Ó>?

the second term !?−>

?f ;Ð?l=ß +ÑZ Ò>ß X Ó is the total extra expected reward rate during the interval

that results from vidual in state takesaddition of individuals to the several states when an indi =

action in that state at time . Finally, the right-hand side of (1) is the maximum expected re-+ >

ward rate that an individual and his progeny earn in state at time , and equals the left-hand= >

side of (1), viz., the negative of the time derivative of the maximum value in state at time .= >

Observe that (1) is a system of ordinary nonlinear differential equations. The sequel develW -

ops this equation rigorously, shows that it has a unique solution, and establishes the existence of a

maximum -period-value piecewise-constant (in time) policy.X

Sometimes in the sequel, the symbol will instead mean the maximum -period value. (TheZ >>=

reader will be warned when this is so.) In that event, the minus sign on the left-hand side of (1)

disappears and the terminal condition becomes instead for .Z ´ @ = −!= = f

Example 1. Controlled Queues. Consider controlling a single-channel queue with independ-

ent Poisson arrivals and exponential service times. The state of the system is the number of cus-

Page 105: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 99 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

tomers in the queue (including the one being served), and this number cannot exceed the queue

capacity . Let be the arrival rate of customers into the system when customers are inW   ! =-=

the queue. Of course, . The actions are the indices of the available service-W = =´ ! "ß á ß O O

rates , say, when customers are in the system. Of course and .. . .=" =O ! !"ß á ß   ! = O œ " œ !=

Then for , the transition rates are! Ÿ =ß ? Ÿ W

, , ,

;Ð?l=ß +Ñ œ

? œ = "

? œ =

? œ = "

ÚÝÛÝÜ a b.

. -

-

=+

=+ =

=

and otherwise. Also where is the price each customer pays on;Ð?l=ß +Ñ œ ! <Ð=ß +Ñ œ < - <.=+ =+

completing service and is the service cost rate when the action is chosen with customers- + ==+

in the queue. In this event Bellman’s equation becomes

Z œ Ò< - ÐZ Z Ñ ÐZ Z ÑÓß ! Ÿ = Ÿ W† >

= = ="Ÿ+ŸO

=+ =+ =+ => > > >

Ð="Ñ Ð="Ñ•W max .=

. . -

Example 2. Supply Management. A supply manager seeks an ordering policy with minimum

expected cost. The states are the possible inventory levels . An order for one or more it-!ß á ß W

ems can be placed or canceled at any time with the time to deliver being exponentially distrib-

uted with mean , . The actions in state are the amounts of stock on hand plus. ." ! _ = +

on order, so . The times between demands are independent exponentially distributed= Ÿ + Ÿ W

random variables with common mean , . The demands are independent and ident-- -" ! _

ically distributed with being the probability of a demand of size . Unsatisfied demands are. 55

lost. The transition rates are

; ?l=ß + œ

. ß ! ? =

. ß ? œ !

ß ? œ + =

ß ? œ =

a bÚÝÝÝÝÝÛÝÝÝÝÝÜ

!

a b

-

-

.

- .

=?

5 = 5

and otherwise. Also, the expected-cost-rate function is ;Ð?l=ß +Ñ œ ! -Ð=ß +Ñ œ -Ð+ =Ñ 2Ð=Ñ .!_Aœ" =A:ÐAÑ . -Ð † Ñ :Ð † Ñ 2Ð † Ñ- where and are the ordering and shortage cost functions, is the

storage-cost rate, and .-Ð!Ñ œ 2Ð!Ñ œ :Ð!Ñ œ !

Example 3. Project Scheduling. A project manager wants to allocate funds continuously

among the activities of a project to maximize the probability of completing the project by time

X . The project consists of a partially-ordered collection of activities. An activity cannot beT

initiated until all activities preceding it in the partial order are completed. At any given point of

time, can be partitioned into two sets, viz., the set of activities that have been completedT =

Page 106: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 100 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

and the set (the complement of ) of activities that have not been completed. From what we= =-

have said above, the set is necessarily , i.e., if an activity is in , then so is every= =decreasing

activity that precedes it. Similarly, is , i.e., if an activity is in , then so is every= =- -increasing

activity that follows it in the partial order. Denote by the set of decreasing subsets of . Forf T

each , denote by the set of activities in that are not preceded by any other activity in= − =f =-

= = −-=. Evidently, is the set of activities on which effort may be expended when is the set f

of completed activities. Funds may be spent continuously on the project at the rate of million,

dollars per week. At each point in time at which is the set of completed activities, the funds= − f

can be allocated among the activities in in multiples of one million dollars. An action in= +

state is an -tuple of nonnegative integers indexed by elements of for= − + œ Ð+ Ñf ! k k= =!

which . The interpretation of is that million dollars per week are allocated to!! ! !− =

+ œ , + +

activity for each such . Denote by the set of all such actions if and the empty! ! T− E = Á= =

action if . The times to complete the activities are independent exponentially distributedg = œ T

random variables. The completion rate of activity when million dollars per month! T− 7   !

are allocated to that activity is with . Let be the maximum probability of com-- -! !È7 ! Z >=

pleting the project by time when is the subset of activities completed by time .X = − ! Ÿ > Ÿ Xf

Then and otherwise. Also, for all and . Moreover, in state@ œ " @ œ ! <Ð=ß +Ñ œ ! + − E = −T = = f

T T f f T, i.e., when the project is completed, for all . Finally, for each \ and;Ð=l ß gÑ œ ! = − = − Ö ×

+ œ Ð+ Ñ − E ;Ð= Ö ×l=ß +Ñ œ! =, the nonzero transition rates are contained among those for which !

- ! -! ! ! !! È È!+ − ;Ð=l=ß +Ñ œ + for some or .= − =

3 PIECEWISE-CONSTANT POLICIES, GENERATORS, TRANSITION MATRICES

A is a function that assigns to each state in an action . Let decision $ f $ ?= − E ´ ‚ E== =− =f

be the set of all decisions. A is a mapping that is policy piecewise1 $ ? $œ Ð Ñ À Ò!ß _Ñ Ä Ð> Ä Ñ> >

constant, i.e., there is a sequence of numbers with as , such! ´ > > > â > Ä _ R Ä _! " # R

that for $ $> > Rœ > Ÿ >R > ! Ÿ R =R" and each . Using means that an individual in state at1

time chooses action at that time. Let denote the set of all policies. Call a policy > $ ? $= _ _> that

uses at each point in time .$ stationary

For each , let be the and be the $ ? $ $− < ´ Ð<Ð=ß ÑÑ U ´ Ð;Ð?l=ß ÑÑ$ $= =reward-rate vector gen-

erator transition matrix matrix. Let be the whose element is the expected numberT W ‚ W =?> >2$

of individuals in state at time given that one individual was in state at time and is? > = ! $_

used. Assume that the satisfy ,T T œ M> !$ $

(1) for T œ T T œ T T 2ß >   !>2 > 2 2 >$ $ $ $ $

and

(2) for .T œ M U 2 9Ð2Ñ 2 !2$ $

Page 107: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 101 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Substituting (2) into (1), subtracting from both sides, dividing by , and letting estabT 2 2 Ä !>$ -

lishes that ÐT ´ T ц > >

$ $.

.>

(3) for .T U œ T œ U T >   !†> > >

$ $ $$ $

These equations describe the growth of the population when the decision is used and may be$

interpreted as follows. Observe that is the rate of addition to the population at any time perU$

individual at that time. The left-hand side of (3) is the product of the population at time and>

the rate of addition to the population per individual at that time. The right-hand side of (3) is

the product of the rate of addition to the population at time zero and the population generated

at time by each individual added at time zero. Both products are simply different ways of ex->

pressing the rate of addition to the population at time . In any case, the unique solution ofT >† >

$

each of these two systems of linear differential equations satisfying isW T œ M!$

(4) for .T œ / >   !> U >$

$

Nonnegative, Substochastic and Stochastic Transition Matrices

The nonnegativity of the off-diagonal elements of implies that is nonnegative and hasU T$ $>

positive diagonal elements for . To see this, write where is a positive>   ! U œ ÐU MÑ M$ $ - - -

number chosen so that . Now commutes with , so that U M   ! ÐU MÑ> M> T œ / œ$ $ $- - - > U >$

/ / /   ! ÐU MÑ>   ! / œ / M   ! T   !ÐU MÑ> M> ÐU MÑ> M> > >$ $- - - - -$ $. Also because and , so . Further,-

since and is nonnegative and has positive diagonal elements/ œ Ð/ Ñ / œ M U 9Ð ÑU > 8U U$ $ $> >8 8

> >8 8$

for large enough , it follows that also has positive diagonal elements.8 T œ /> U >$

$

If also (resp., ), then is stochastic (resp., substochastic). To see this, observeU " œ ! Ÿ ! T$ $>

that

T " " œ Ð/ MÑ" œ / .? U "> U >

>

!

U ?$ $

$ $ ( ,

so because the term in large parentheses is nonnegative, is stochastic (resp., substochastic) ifT >$

U " œ ! U " Ÿ !$ $ (resp., ).

4 CHARACTERIZATION OF MAXIMUM -PERIOD-VALUE POLICIES X [Be57] [Mi68b],

Suppose that is any policy with for for some and num-1 $ $ # #œ Ð Ñ œ > Ÿ > > Ö ×> > R R R" R

bers with as . Then the transition function determined! ´ > > > â > Ä _ R Ä _ T! " # R>1

by is, for , given by1 > Ÿ > >R R"

(1) .T œ T â T T> > > > > >>1 # # #

" ! R R" R! R" R

Thus, for ,> > >R R"

(2 .Ñ T œ T U† > >

1 1 $>

Page 108: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 102 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Interpret the element of as the expected number of individuals that an individual in state=? T>2 >1

= ! at time generates when using . during an interval means that in state at time ? > 1 1Using 1

uses the decision at each time in the interval.$> >

Let and be policies. Fix and . Let be the policy that uses1 # 1 $ 1 1œ Ð Ñ œ Ð Ñ X ! @ − d> >‡ W > ‡

1 1 # $ during and during , i.e., uses at and uses at . Set Ò!ß >Ñ Ò>ß _Ñ ? − Ò!ß >Ñ ? − Ò>ß _Ñ Z œ‡ >? ? 1

ÐZ Ñ Z Ò>ß X Ó> >= =1 1 with being the expected reward that earns during the interval starting with a1

single individual in state at time .= >

The following Comparison Lemma gives an expression for the difference between the -periodX

values of two policies.

Lemma 1. Comparison . If and are policies, then1 # 1œ Ð Ñ>‡

Z Z œ T .>! !

X

!

> >1 1 1 # 1‡ ‡

>( K

where the comparison function is given by

Ð ´ < U Z Z ß − ! Ÿ > Ÿ X†3) K> > >

#1 1 1# #‡ ‡ ‡ # ? and .

Proof. Evidently,

(4) .Z œ T < .? T Z! ? > >

>

!

1 1 1 1 1#> ‡ ? ‡(Thus, for all continuity points of and ,> 1 1‡

(5) ..

.>Z œ T Ò< U Z Z Ó œ T K

†! > > > > >1 1 1 1 1 1 # 1# #> ‡ > > ‡ ‡ ‡

>

Hence since , the result follows from (4) and (5). Z Z œ Ð Z Ñ.>! !X

!

!1 1 1 1‡ > ‡' .

.>è

The optimal-return operator plays a fundamental role in discrete-time-parameter problems.

In continuous-time-parameter problems, the plays a similar role. De-optimal-return generator Z

fine byZ À d Ä dW W

ZZ œ Ò< U Z Ó Z − dmax for .$ ?

$ $−

W

Observe that the optimal-return generator stands in the same relation to the generator matrices

U T$ $ that the optimal-return operator does to the transition matrices .

Theorem 2. Characterization of Maximum -Period-Value Policies .X The policy has1 $‡>œ Ð Ñ

maximum -period value with terminal-value vector if and only if s continuouslyX @ Z ´ Z 3> >1‡

differentiable in on , and> Ò!ß X Ó Z œ @X

Ð Ñ Z œ Z œ < U Z ! Ÿ > Ÿ X†6 > > >Z $ $> > , .

Page 109: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 103 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Proof. For the “if ” part, observe that (6) implies that for all , so by the Compari-K Ÿ !>#1‡ #

son Lemma 1, for all . For the “only if ” part, observe first that if , (5) becomesZ Ÿ Z! ! ‡1 1‡ 1 1 œ 1

! œ T K T> > >1 1$ 1‡ ‡

>‡. Also is nonsingular because it is a product of matrix exponentials, each of which

is nonsingular. Thus for all continuity points of . Now suppose for someK œ ! K ŸÎ !> ‡ >$ 1 #1>

‡ ‡1

continuity point of and some . For each state with , redefine , whence> = K Ÿ ! œ1 # # $‡ > = == >#1‡

K œ K œ ! K ! ? >> > ? w= =#1 #1$ 1‡ ‡

>‡ . Thus for in a sufficiently small neighborhood of . Let be the1

policy that uses on that neighborhood and uses elsewhere. Observe that is nonnegative# 1‡ >T1w

and has positive diagonal elements because it is a product of nonnegative matrix exponentials

with those properties. Thus, it follows from the Comparison Lemma 1 that , which is aZ Z! !1 1w ‡

contradiction. Hence, max , or equivalently (6) holds, at continuity points of# ? #1 $ 1−> >K œ ! œ K‡

>‡

1 Z‡ >. But is continuous and so is , whence by the first equality in (6) at continuity points ofZ

Z Z Ò!ß X Ó† †† >, is continuous on . è

Incidentally, is, of course, ’ .Z œ Z† > >Z Bellman s equation

5 EXISTENCE OF MAXIMUM -PERIOD-VALUE POLICIES X [Mi68b]

As a first step in showing that there is a maximum -period-value piecewise-constant policy,X

we show how to find a stationary policy that has maximum -period value for all small enough>

> !. To that end, let

(1) .Z Ð@Ñ œ / .? < / @s >

>

!

U ? U >$ $ ( $ $

be the expected period reward when is used and the terminal reward is . For , > ! @ > ! Z Ð@Ñs$_ >

$

in (1) can be interpreted as the terminal reward for which the expected period reward> !

when is used is . To see this, notice that with this interpretation, we should have$_ @

(2) .@ œ / .? < / Z Ð@Ñs (>

!

U ? U > >$ $

$ $

Solving (2) for givesZ Ð@Ñs >

$

Z Ð@Ñ œ / .? < / @ œ / .? < / @s >

> >

! !

U Ð?>Ñ U > U ? U >$ $ $ ( ($ $ $ $ ,

which shows that satisfies (1) for .Z Ð@Ñ > !s >

$

Now consider the following two problems.

I. Maximum Value. Choose to maximize for all small enough .$ ?− Z Ð@Ñ > !s >

$

II. Minimum Balloon Payment. Choose to minimize for all large enough .$ ?− Ð@Ñ > !Zs>

$

Page 110: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 104 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

The first problem has an obvious interpretation. The second problem can be interpreted as

follows. Suppose that one must accumulate a total expected reward during the next periods@ >=

starting with one individual in state . Also suppose that there are no funds available initially=

and that if the system earns a total expected reward differing from in the periods, then@ >

each individual in state after periods must make a (possibly negative) balloon payment? >

Z Ð@Ñs >?$ at that time to make up the shortfall. The goal of problem II is to minimize the size of

the balloon payments that each individual must make for small .>

Example 4. Suppose , , , , and . Then W œ " œ Ö ß × < œ < œ ! @ œ " U œ " U œ #? # $ ## $ # $_

solves problem I while solves problem II.$_

In order to solve problems I and II, substitute the Maclaurin-series expansion for into (1) and/U ?$

integrate term-by-term yielding

(3) , „ Z Ð@Ñ œ > !s >

8x

„>_

8œ!

88„$ $" <

where

<8„

8"$

$ $ $

œ„ @ ß 8 œ !

Ð „ U Ñ Ð< U @Ñß 8 œ "ß #ß ÞÞÞ Þ

ÚÛÜ

Let and . Let be the set of that maxiG < < G < < ? $ ?8 ! 8 ! "„ „ „„ „„ „$ $ $$ $$œ Ð â Ñ œ Ð â Ñ − mize

(3) for all sufficiently small . From (3) we have all .> ! œ Ö − À − ×? G G„ „ „$ ? # ?$ #¤ ,

Theorem 3. Maximum-Value and Minimum-Balloon-Payment Stationary Policies for Small

k k> . There is a stationary policy maximizing resp., minimizing over the class of station-Ð Ñ Z Ð@Ñs >

.ary policies for all sufficiently small resp., large , i.e., resp., .> ! Ð > !Ñ Á g Ð Á gÑ? ?

Proof. Let , all for . Clearly , ? $ ? G G # ? ? ? ?8 8 8 ! "„ „ „„ „œ Ö − À ¤ − × 8 œ !ß "ß á œ œ$ #

Ö − À < U @   < U @ − × Á g$ ? # ? ?$ $ # # , all . Now suppose inductively that is nonempty8"„

and is a product of action sets, one for each state, for some . Then say,8 "   " œG G8" 8"„ „$

for . Thus , all . Hence since for$ ? ? $ ? < < # ? < <− œ Ö − À   − × œ „ U8" 8 8" 8 8 8" 8 8"„ „ „ „ „ „„ „$ $# $

$ ?− 8"„ , it follows that

? $ ? < < # ?8 8" 8" 8" 8" œ Ö − À U   U − ×$ # , all

and

? $ ? < < # ?8 8" 8" 8" 8" œ Ö − À U Ÿ U − ×$ # , all ,

so is nonempty and is a product of action sets, one for each state. Thus .? ? ?8 8„ „„

_8œ!œ Á g+ è

Page 111: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 105 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Theorem 4. Truncation . ? ?W"„ œ W#

„ „œ â œ ? .

Proof. If , thenG GW" W"„ „$ #œ

(4) , .ÐU U Ñ œ ! 8 œ "ß á ß W$ # $<8„

It suffices to show (4) holds for all for this implies from which the Theorem fol8   " œG G$ #„ „ -

lows. That (4) holds for all follows from Lemma 23 of §1.8 with and the null8   " F œ „ U P$

space of . U U$ # è

Theorem 5. Existence of Maximum -Period-Value Policies.X For and terminal value-X !

vector , there is a policy with @ maximum -period value. Also , where is the maxi-X Z œ ÐZ Ñ Z> >

mum value on , is the unique solution of Bellman s equation.Ò>ß X Ó ’

Proof. Put . Choose . Let be the -period value when using and? ? $ ? $„ „ " >Ð@Ñ ´ − Ð@Ñ Z Ð@Ñ >s$

the terminal reward is . Now show that maximizes over all piecewise-constant policies for@ Z Ð@Ñs$">$

all sufficiently small . By Theorem 2, this will be so if maximizes for all> ! œ < U Z Ð@Ñs# $">

# # $"

small enough . To see that this is so, notice on using that> ! Z Ð@Ñ œs > _8œ!

8$ $" "

! >

8x

8

<

Ð Ñ < U Z Ð@Ñ œ Ò< U @Ó >U U âs >

#x5 .# # # # # #$ $ $

> "

##

" " "< <

Hence, by construction, lexicographically maximizes the coefficients of the above series and# $œ "

so indeed maximizes for all small enough .< U Z Ð@Ñ > !s# # $>"

Let inf . Then for is> ´ Ö! > Ÿ X À Z Ð@Ñ < U Z Ð@Ñ× ! œ X > Ÿ > Ÿ Xs s" " "> > ‡Z 1 $$ $$ $" "" "

optimal. Now on replacing by , repeat the above construction obtaining @ Ð@Ñ ´Z @ − Ð@ Ñs > "# "

"

"$ $ ? .

Then define for where inf1 $ Z‡ ># " # " " # " "œ X > > Ÿ > X > > œ Ö! > Ÿ X > À Z Ð@ Ñ <s

$ $# #

U Z > ß > ß ás$ $# #

> ‡" #Ð@Ñ× !. Continuing inductively, construct and a sequence 1 of positive num-

bers. If for some , the claim is established. If not, . It turns! !R _3œ" 3œ"3 3

‡>   X R X ´ X >   !

out that this is impossible.

For let lim . From Theorem 3, there is a that minimizes over the class@ œ Z Z Ð@ Ñs‡ > > ‡>ÆX ‡ ‡1 $ .

of decisions for all small enough , i.e., . Now maximizes for> ! − Ð@ Ñ œ < U Z Ð@ Ñs$ ? # $‡ > ‡

# # $

all small enough , say for , . To see this observe> ! ! Ÿ > Ÿ !% %

(6) .< U Z Ð@ Ñ œ Ò< U @ Ó >U U âs >

## # # # # #$ $ $

> ‡ ‡ "

##< <

It follows that on letting ,Z ´ Z Ð@ ѵ s> > ‡

$

(7) , , .Z œ Z Z œ @ ! Ÿ > Ÿµ µ µ†

> > ! ‡Z %

But from Theorem 4 and construction, satisfiesZ ´ Z> >X ‡

‡1

Page 112: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 106 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

(8) , , .Z œ Z Z œ @ ! Ÿ > Ÿ X X† > > ! ‡ ‡Z

Now the equations (7) and (8) are identical on . Thus, since is Lipschitz continuous, it folÒ!ß Ó% Z -

lows from the theory of nonlinear differential equations in the Appendix that for Z œ Z ! Ÿ >µ > >

Ÿ %. Hence, redefine to be on without altering its optimality, contradicting the as-1 $ %‡ ‡ ‡ÒX ß X Ó

sumption .> Æ !R è

Nonstationary Data. The above results concerning the existence of maximum -period-valueX

policies extend to situations in which the data are nonstationary. The simplest situation is that

in which the action sets , reward rates and transition rates at time areE < Ð=ß +Ñ ; Ð?l=ß +Ñ >=> > >

piecewise constant in on , i.e., there exist constants such> Ò!ß X Ó ! œ X X X â X œ X! " # R

that , and are constant in on the interval for each . ToE < Ð=ß +Ñ ; Ð?l=ß +Ñ > ÒX ß X Ñ 3 œ "ß á ß R=> > > 3" 3

see why there is a maximum -period-value policy, suppose that one has found the maximumX

ÐX X Ñ Z ÒX ß X Ó3 3X-period value and the corresponding policy on the interval with terminal value3 1

@ ÐX X Ñ Z . Now apply Theorem 5 to find a maximum -period-value and the correspond-3 3"X3"

ing policy on the interval with terminal value . Then the policy that uses on the. .ÒX ß X Ñ Z3" 3X3

interval followed by on the interval has maximum -period value onÒX ß X Ñ ÒX ß X Ó ÐX X Ñ3" 3 3 3"1

the interval with terminal value . Repeating this procedure for producesÒX ß X Ó @ 3 œ Rß á ß "3"

the desired maximum -period-value policy and its value.X

6 MAXIMUM -PERIOD VALUE WITH A SINGLE STATEX

It is instructive to apply the results for the maximum- -period-value problem to the case whereX

there is only a single state, i.e., . The general procedure is best described with an example.W œ "

Example 5. Suppose there is a single-state and five decisions . Also suppose the graph! # $ ( )ß ß ß ß

of the piecewise-linear convex function appears (landscape) in the right-hand side of FigZ ure 1,

with each linear segment labeled by the decision for which . From the graph, ZZ œ < U Z, ,

of in the right-hand side of Figure 1, it is possible to sketch a rough graph of in the Z Z † left-

hand side of Figure 1 that illustrates several features of the solution to , ,Z œ Z ! Ÿ > Ÿ X† > >Z

given any terminal value . These features include:@

• the monotonicity of in on ,Z > Ò!ß X Ó>

• the subinterval of during which it is optimal to use a decision,Ò!ß X Ó

• the monotonicity of on each such subinterval,Z† >

• the values of at the ends of each such subinterval andZ >

• the order in which the decisions are used.

To see how, suppose one has at hand the value at time . Now find the value and sign of theZ >>

derivative , and the decision to use at time by moving horizontally from the value on Z > Z† > >,

Page 113: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 107 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

� �

� ��

� ��

��

� � � � � � �

� � � �

��� �

� �

� �

� ��

� ��

Z œ Z ! Ÿ > Ÿ X† > >Z ,

Graphs of Four Graph (Landscape) of Z † Z

Z ZZ œ ! œ Z‡ ‡

Figure 1. Graphs of and for a Single-State ProblemZ † Z

the vertical axis of the left-hand graph to the same value on the vertical axis of the right-hand

graph and reading off . Using this information and starting with theZ œ Z œ < U Z† > > >Z , ,

given terminal value , construct the graph of . This is done in the left-hand side of Figure 1@ Z †

for four different terminal values.

To illustrate, consider the second largest , say, of the four terminal values mentioned@

above. For that terminal value, . Let be the value of for whichZ œ @ œ < U @ ! Z† X Z /$ $

< U œ < U >$ $ ( (/ / $ (, i.e., the value for which one is indifferent between using and . Now as

decreases from , decreases and the derivative increases as long as . Call the timeX Z Z Z  †> > > /

> œ Z œ Z7 / at which the . Thus is concave and increasing on the interval> switching time †

Ò ß X Ó > ! Z œ < U Z†

7 7. Now as decreases from to , the derivative decreases, but remains> >( (

positive, with remaining above the value at which the derivative of vanishes, i.e.,Z Z Z> ‡ †

Z 7 (Z œ ! Z Ò!ß Ó‡ . Thus is convex and increasing on the interval . And it is optimal to use on†

Ò!ß Ñ Ò ß X Ó7 $ 7 and on .

Actually, it is possible to give a closed form expression for the switching time . For at , we7 7

have

Z/ Zœ < U Z œ < U / < .? / @ œ / @$ $ $ $ $7

7

7 7 (X

!

U ? U ÐX Ñ U ÐX Ñ$ $ $ .

Solving this equation for yields7

Page 114: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 108 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

7Z/

Zœ X U

@"$ lnŠ ‹

provided that . In the contrary event, it is optimal to use on the entire interval . Ob7 $  ! Ò!ß X Ó -

serve that is independent of .X X7

It is interesting to note that there are two zeros of , viz., and . Both are equilibria inZ Z Z‡ ‡

the sense that if equals either of them, then for all . But the first equilibrium@ Z œ @ ! Ÿ > Ÿ X>

is a stable attractor and the second is an unstable repeller. The first is so because for all @ Z‡,

Z Ä Z Z X Ä _! ‡‡ as . The second is so from what was just shown and the fact that for all

@ Z Z Ä _‡ !, .

Theorem 6. Single State: Each Decision Used for an Interval . If there is a single state, there

is a maximum-T-period-value policy for which the set of times at which any given decision is used

is a possibly empty interval and the maximum value on is monotone in .Ð Ñ Ò>ß X Ó ! Ÿ > Ÿ XZ >

Proof. First show that has constant sign on . To that end, suppose (resp. ).Z Ò!ß X Ó @   ! ß Ÿ !† > Z

Then (resp., ) for all because is continuous in and if there is aZ œ Z   ! Ÿ ! ! Ÿ > Ÿ X Z >† †> > >Z

> Z œ ! Z œ Z ! Ÿ Ÿ >† for which , then for all .> >7 7

Next observe that is piecewise linear and convex in . Hence, for each , the set of val-Z $ ?Z Z −

ues for which is a subinterval of . Now combine these facts with Theorem 2. Z Z œ < U Z dZ $ $ è

One important implication of this result is that the number of switching times is bounded above

by .k k? "

7 EQUIVALENCE PRINCIPLE FOR INFINITE-HORIZON PROBLEMS [Ve69a]

This section studies the infinite-horizon version of the continuous-time-parameter problem.

This problem can, for most purposes, be reduced to solving an “equivalent” discrete-time-pa-

rameter problem. To see this, it is useful to begin by discussing transient policies in continuous

time.

Call a policy if . If is transient, its is given by1 1 $transient value'_!

>>T .> ¥ _ œ Ð Ñ1

Z ´ T < .>1 $1(_

!

>> .

Now

M / œ Ð / Ñ.> œ ÐU ÑÐ / .>ÑU X U > U >

X X

! !

$ $ $( (.

.>$ .

For to be transient, it is necessary and sufficient that as . The necessity is$_ U X/ Ä ! X Ä _$

evident. For the sufficiency, observe that is nonsingular for large enough , so the sameM / XU X$

is so of the two matrices in parentheses following the second equality in the last displayed equa-

Page 115: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 109 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

tion. it follows that Thus, on premultiplying that equation by and letting ,U X Ä _"$ ! Ÿ'_

!

U > "/ .> œ U$$ . Hence

Z œ / .> < œ U <$ $ $$ (_

!

U > "$ .

Now on defining , it follows thatT ´ M U$ $

Z œ ÐM T Ñ <$ $ $" ,

which is the value of for the equivalent discrete-time-parameter problem with and $_ T <$ $ pro-

vided that . The nonnegativity of the can always be assured by taking the unit of timeT   ! T$ $

to be small enough, e.g., a minute rather than a day, in the continuous-time-parameter problem.

A proportional change in the time unit necessitates a like change in the reward rate and the rate

of addition of individuals to the population. In particular, on putting , andU ´ +U < ´ +<s s$ $ $ $

> ´ + + ! Z7 $ where , and observing that the value of is $ independent of the choice of the time

unit, it follows that

Z œ / . +< œ U <s s$ $ $7

$ (_

!

+U "$ 7 .

Thus if is small enough, and so .+ ! T ´ M U   ! Z œ ÐM T Ñ <s ss s$ $ $ $$"

Equivalence of Discrete- and Continuous-Time-Parameter Problems. Suppose for this para-

graph and the next that the time unit is chosen so for each and that each isT œ M U   !$ $ $ $_

transient in the continuous-time-parameter problem. As discussed above, the last is so if and only

if each is transient in $_ the . Moreover,equivalent discrete-time-parameter problem with and T <$ $

for each , $ Z œ$ U < œ ÐM T Ñ <" " _$ $ $ $ is the value of in both problems. $ Thus, in order to find

a maximum or near-maximum value stationary policy for the continuous-time-parameter prob-

lem, it suffices to find such a policy for the equivalent discrete-time-parameter problem, e.g., by

successive approximations, policy improvement or linear programming.

This equivalence is also useful for theoretical studies. For example, is the maximum valueZ ‡

among the stationary policies in the continuous-time-parameter problem if and only if is theZ ‡

maximum value among the stationary policies in the equivalent discrete-time-parameter prob-

lem. Thus, is the unique fixed point of , or equivalently, the unique zero of ,Z Ð œ Mч e Z e

i.e., or! œ ZZ ‡

(1) max .! œ Ò< U Z Ó$ ?

$ $−

This is consistent with the results for the -period problem because if for all , then> Z œ Z >   !> ‡

Page 116: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 110 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Z œ ! > ! œ Z œ Z† †> > >, whence (1) reduces to Bellman’s equation for the -period problem, viz., Z

for all .>   !

Maximum Present Value. Suppose instead that each stationary policy is , i.e.,$_ bounded

T œ SÐ"Ñ>$ . In this event, interest often centers on present-value optimality criteria. For that case,

observe from what was shown above, for each the finite present value of future population sizes$

with the nominal interest rate per unit time is 3 3 ! / T .> œ / .> œ Ð M U Ñ' '_ _! !

> > "ÐU MÑ>3$

3$ $

$

œ V V U3 3$ $ $ where is, of course, the resolvent of . Also, as long as the time unit is small enough

(which here requires common rescaling of the , ) so that for all , then< U T œ M U   !$ $ $ $and 3 $

as shown on page dis- 49, is also the present value of future population sizes in the equivalent V3$

crete-time-parameter problem for all . $ Moreover, in both cases the present values of coincide$_

and equal for each .Z œ V <3 3$ $ $ $ Thus both sets of maximum-present-value stationary policies coin-

cide, and so do their present values. Also, the maximum-present-value among the stationary polZ ‡ -

icies for the continuous-time-parameter problem is the unique zero of the analog of (1) formed by

replacing by , viz.,U U M$ $ 3

(2) max .! œ Ò< U Z Z Ó$ ?

$ $−

‡ ‡3

Thus, in order to find a maximum- or near maximum-present-value stationary policy for the con-

tinuous-time-parameter problem, it suffices to find such a policy for the equivalent discrete-time-

parameter problem, e.g., by successive approximations, policy improvement or linear program-

ming. Also, the sets of strong present-value optimal (resp., -optimal) stationary policies in the con8 -

tinuous- and equivalent discrete-time-parameter problems coincide. Thus, one such policy can be

found for the continuous-time-parameter problem by solving the equivalent discrete-time-param-

eter problem, e.g., by the strong policy-improvement method or linear programming.

8 MAXIMUM PRINCIPLE [PBGM62], [LM67]

It is often useful to consider models in which there are infinitely many states. This section does

that for deterministic control problems whose transition laws are given by differential equations.

The goals are to develop Bellman’s equation and Pontryagin’s celebrated Maximum Principle

for these problems formally. By ‘formal’ is meant that differentiability of appropriate functions

is assumed, but not shown. The Maximum Principle is a necessary condition for optimality that is

also sufficient provided that the problem satisfies suitable convexity conditions, e.g., concavity of

the objective function and linearity of the transition law.

The problem is that of finding a piecewise-continuous that maximizescontrol + À Ò!ß X Ó Ä d† 7

the -period X value

(1) (X

!

> ><Ð= ß + Ñ.>

subject to

Page 117: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 111 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

(2) , = œ 0Ð= ß + Ñ = œ =†> > > !!

and

(3) + − E>

for where and is the given initial state. The interpretation is that! Ÿ > Ÿ X = À Ò!ß X Ó Ä d =† 8!

= − d > + − E © d > <Ð= ß + Ñ> 8 > 7 > > is the of the system at time , is the chosen at time , state action

is the at time , and (2) is the governing the evolution of the systemreward rate transition law>

through the states.

In order to obtain the desired necessary condition for optimality, first formally generalize Bell-

man’s equation

Z œ Ò<Ð=ß +Ñ ÐU Z Ñ Ó Z œ !`

`>> > X

= =+−E

+ =max , ,

for and all , to the present case where is the maximum value in starting in! Ÿ > Ÿ X = Z Ò>ß X Ó>=

state . The appropriate definition of in the present case is the differential operator= ÐU Z Ñ+ =

ÐU Z Ñ œ ÐZ Z Ñœ 0Ð=ß +Ñf Z"

2+ = = = =

2Æ!=20Ð=ß+Ñlim

where is the gradient of because if , then from (2),f Z œ Ð Z Ñ Z Ð= ß + Ñ œ Ð=ß +Ñ= = = => >`

`=3

= œ = 20Ð= ß + Ñ 9Ð2Ñ>2 > > > .

Thus ’ becomes the nonlinear partial differential equationBellman s equation

(4) max , ,! œ Ò<Ð=ß +Ñ 0Ð=ß +Ñf Z Z Ó Z œ !`

`>+−E=

> > X= = =

for and all . Now let be the term in brackets in (4), be an optimal control! Ÿ > Ÿ X = 1Ð=ß +ß >Ñ +†

and be the corresponding optimal trajectory satisfying the differential equations (2). Then=†

from (4),

(5) max! œ 1Ð=ß + ß >Ñ œ 1Ð= ß + ß >Ñ=

> > >

because for all states , actions and . Thus1Ð=ß +ß >Ñ Ÿ ! = + − E ! Ÿ > Ÿ X

(6) ,f 1Ð= ß + ß >Ñ œ !=> >

so on suppressing ,Ð= ß + ß >Ñ> >

! œ f < f 0f Z 0f Z f Z= = =# #== =>

where , and . Thus since ,f œ f œ f 0 œ = œ 0†#== =>

#=Š ‹ Š ‹ Š ‹` ` `0

`= `= `= `> `=

# #

3 4 3 4

3

Page 118: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 112 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

(7) ..

.>f Z œ = f Z f Z œ f < f 0f Z†

= = = => > # > # >

= = === =>>

=> > > >

Now set and observe that for all , (7) becomes the linear differential@ ´ f Z Z œ ! => > X= = => adjoint

equation

(8) , @ œ f <Ð= ß + Ñ f 0Ð= ß + Ñ@ @ œ !† > > > > > > X= =

for and (4) implies the ! Ÿ > Ÿ X maximum principle

(9) max Z œ Ò<Ð= ß +Ñ 0Ð= ß +Ñ@ Ó œ <Ð= ß + Ñ 0Ð= ß + Ñ@`

`>> > > > > > > > >

=+−E

>

for . In order to check whether a control satisfies (9), solve the ordinary differential! Ÿ > Ÿ X Ð+ Ñ>

equations (2) for and then the linear differential equations (8) for Ð= Ñ Ð@ Ñ> > . Finally, check whether

+> attains the maximum on the right-hand-side of (9).

In general the conditions (2), (8), and (9) are necessary, but not sufficient, for optimality of

a control because (6) is necessary, but not sufficient, for (5). Thus, like the analogous Kar-Ð+ Ñ>

ush-Kuhn-Tucker-Lagrange conditions for finite-dimensional mathematical programs, the maxi-

mum principle is necessary, but not sufficient, for optimality. However, again like analogous re-

sults for finite-dimensional concave programs, the maximum principle is sufficient for optimality

under suitable convexity hypotheses, e.g., when is linear, is concave,0Ð † ß † Ñ <Ð=ß +Ñ œ :Ð=Ñ ;Ð+Ñ

E is convex, and mild regularity conditions are fulfilled. By contrast, a suitable form of Bellman’s

equation (4) is necessary and sufficient for optimality just as various versions of it introduced

throughout this course are also necessary and sufficient for optimality.

Example 6. Linear Control Problem . For the special case in which and are linear, i.e.,< 0

<Ð=ß +Ñ œ =- +. 0Ð=ß +Ñ œ =G +H and ,

the adjoint equations (8) reduce to the linear differential equations with constant coefficients

@ œ - G@ @ œ !† > > X,

that are independent of the control and trajectory. Also the maximum principle simplifies to

max .+−E

> > >+Ð. H@ Ñ œ + Ð. H@ Ñ

Observe that in this case, all that is required to find an optimal control is to solve the adjoint

equations and then choose to achieve the maximum on the left-hand side of the above+ œ +>

equation.

Page 119: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 113 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Example 7. Markov Population Decision Chain. Consider the -state Markov branching deci-W

sion chain discussed in the preceding section. In this event, is the (row) vector of expected num=> -

bers of individuals in each state at time . Then , , ,> E œ + œ = œ =? $ !!

<Ð=ß +Ñ œ =< 0Ð=ß +Ñ œ =U$ $ and .

The adjoint equations simplify to the linear differential equations with variable coefficients

@ œ < U @ @ œ !† > > X$ $> > ,

that depend on the control, but not the trajectory. Thus, the maximum principle simplifies to

max .$ ?

$ $ $ $−

> > > > > > > >Ò= < = U @ Ó œ = < = U @ œ = @†> >

Now this is so for all initial populations , and hence all in the corresponding= œ =   ! =   !! >!

convex cone of dimension . Thus it follows by factoring out in the above equation thatW =>

@ œ @ œ < U @† > > >Z $ $> > ,

which is precisely Bellman’s equation (6) of §3.4 for the continuous-time-parameter problem.

9 MAXIMUM PRESENT-VALUE FOR CONTROLLED ONE-DIMENSIONAL DIFFUSIONS

[Ma67 68], [Pu74]ß

The goal of this section is to introduce the theory of controlled one-dimensional diffusions in

a manner analogous to the development of continuous-time-parameter Markov decision chains.

As in the previous section, the generators here are differential operators. Otherwise, as will be

seen in the sequel, the main features, formulas and results developed for Markov decision chains

carry over in the same form to controlled one-dimensional diffusions, albeit with a few new com-

plications. The development given below is analytic in the spirit of Feller, Volume I, and has the

advantage of leading quickly to computational methods. A more fundamental treatment would,

of course, require demonstrating that there exist stochastic processes with the properties described

here.

Diffusions

Roughly speaking, a with a compact intervalone-dimensional diffusion state spaceÖW À >   !×>

of real numbers , , is a Markov process in which for all small f œ Ò= ß = Ó _ = = _ 2 ! " ! "

! W W W, the conditional distribution of each increment given is approximately normal>2 > >

with mean and variance proportional to and with the proportionality constants depending on2

W>. (If f œ d ! " and the proportionality constants are and respectively, the process is a Wiener

Page 120: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 114 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

process Brownian motion drift or .) More precisely, assume that there exist piecewise-continuous

and coefficients and , , with being bounded away from zero and bound-diffusion . 5 f 5=# #= == − !

ed above such that for each ,% !

Ð3Ñ œ 9Ð2ÑT Ð=ß .?Ñ ,(k k?=

2

%

Ð33Ñ œ 2 9Ð2ÑÐ? =ÑT Ð=ß .?Ñ ,(k k?= Ÿ

2=

%

.

and

Ð333Ñ œ 2 9Ð2ÑÐ? =Ñ T Ð=ß .?Ñ (k k?= Ÿ

# 2 #=

%

5

at continuity points of and where . Observe that asserts. 5# > > !T Ð=ß FÑ œ T ÐW − F ± W œ =Ñ Ð3Ñ

that the probability that the process deviates from any given initial state by more than in a%

small interval is small. Condition (resp., ) asserts that the truncated mean drift (resp.,Ð33Ñ Ð333Ñ

square deviation) of the process from any initial state in a small interval is essentially propor-=

tional to the length of the interval with the proportionality constant (resp., ) depending on .. 5=#= =

Incidentally, the diffusion coefficient has an alternate interpretation as the infinitesimal vari-

ance. To justify this interpretation, it suffices to show that and are equivalent to andÐ33Ñ Ð333Ñ Ð33Ñ

Ð333Ñ œ 2 9Ð2ÑÐ? = 2Ñ T Ð=ß .?Ñw #

?= Ÿ=

# 2= .(k k %

. 5

To see why this is so, observe that

( (k k k k?= Ÿ ?= Ÿ=

# 2 # 2

% %

Ð? = 2Ñ T Ð=ß .?Ñ œ Ð? =Ñ T Ð=ß .?Ñ.

. # 2 Ð? =ÑT Ð=ß .?Ñ 2 T Ð=ß .?Ñ. .=?= Ÿ ?= Ÿ

2 # # 2=( (k k k k% %

Now the next to last term on the right-hand side of this equation is by and the last9Ð2Ñ Ð33Ñ

term is because the integral is bounded below by zero and above by one. Thus, it follows9Ð2Ñ

from the above equation that and are equivalent to and as claimed. WhileÐ33Ñ Ð333Ñ Ð33Ñ Ð333Ñw

this formulation has intuitive appeal, it has the drawbacks that the conditions and areÐ33Ñ Ð333Ñw

not independent and condition is less convenient to work with.Ð333Ñw

The next step is to derive the of the process defined bygenerator U

(1) limÐUZ Ñ ´ Z T Ð=ß .?Ñ Z"

2= ? =

2Æ!

2” •(

Page 121: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 115 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

for all bounded with bounded piecewise-continuous and an interior continuity point of ,Z Z = Z†† ††

. 5 and . It turns out that at such points#

(2) .ÐUZ Ñ œ Z Z"

#†† †

= = = =#=5 .

To see this, observe from - thatÐ3Ñ Ð333Ñ

œ ÐZ Z ÑT Ð=ß .?ÑÐUZ Ñ"

2= ? =

2Æ! ?= Ÿ

2lim (k k %

limœ ÒZ Ð? =Ñ Z Ð? =Ñ ÓT Ð=ß .?Ñ" "

2 #† ††

2Æ! ?= Ÿ= =

# 2(k k %

lim ÐZ Z ÑÐ? =Ñ T Ð=ß .?Ñ" "

2 #†† ††– —(

2Æ! ?= Ÿ=

# 2

k k %)

œ Z Z"

#†† †

5 .#= = = =

(with ) because , and hence the bracketed term above, can be made arbik k k k ¸ ¸) = Ÿ ? = Z Z†† ††

) = -

trarily small by choosing small enough.% !

The next step is to show that if the earned per unit time in state is piecewisereward rate <= =

continuous and bounded in , then the expected of rewards, viz.,= Zpresent value 3

Z œ / T <.>3 3(_

!

> > ,

for where , satisfies3 ! ÐT <Ñ ´ < T Ð=ß .?Ñ> >= ?'

f

(3) Ð M UÑZ œ <3 3

at interior continuity points of , and . To see this, observe from that lim at< Ð3Ñ T < œ <. 5# 22Æ!

8

such points. Thus

Z œ / T <.> / T Z œ <2 Ð" 2ÑT Z 9Ð2Ñ3 3 3 3 3(2

!

> > 2 2 2 .3

Subtracting , dividing by and letting establishes (3). Thus if, as can be shown, isZ 2 2 Æ ! Z††3 3

.bounded and piecewise continuous, it follows by substituting (2) into (3) that satisfies the secZ 3 -

ond-order linear differential equation

(4) , "

#Z Z Z œ < = − Ð= ß = ц† †

5 . 3#= = = == = ! "

3 3 3

8Note that this equation is the same as the corresponding equation for Markov population decision chainsexcept that here is a differential operator rather than a matrix.U

Page 122: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 116 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

at continuity points of , and . Of course is not the only solution to (4) because no bound-< Z. 5# 3

ary conditions have been specified.

The simplest type of boundary condition is with a terminal value at for .absorption @ = 3 œ !ß "3 3

This implies that

(5) , .Z œ @ 3 œ !ß "3= 33

It turns out that (4) has a unique solution subject to the (5). Wetwo-point boundary conditions

show this under the hypotheses that To and are piecewise constant. see this, put< ß= =#=. 5

Z ´ U ´ < ´ Z

Z†

! " !

# # #<Œ Î Ñ Î ÑÏ Ò Ï Ò

3

3 , and ,=# # #= = =

= ==3 .

5 5 5

and rewrite (4) as

(4) , .w= = = !ß "=Z œ < U Z = − Ð= = Ñ

In this form, (4) is evidently the differential equation for the period value in a two-state= =!

continuous-time-parameter Markov population chain where and are respectively the reward-< U = =

rate vector and transition-rate matrix units of time from the end of the process. With = =! this

in mind, let be the “reverse time” transition matrix from “time” to “time” . Since hasT = > U

=> =

positive 12 element for , so does for all in . Then>2=>= − T > =

f f

Z œ T Z

= = = =" " ! ! constant.

Hence the first component of is a strictly increasing affine function of . Thus since the firstZ Z

= =" !

component of is given as , the second component of can be chosen to assure the first com-Z @ Z

= ! =! !

ponent of equals . Therefore, (4) has a unique solution subject to the boundary conditions (5).Z @

= ""

Observe that the above argument remains valid even when , whence is the unique3 œ ! Z ´ Z !

solution to (4), (5) with . In that event, is the total expected reward earned start3 œ ! Z= ing from

state .=

It is useful now to apply these results to solve some simple problems associated with the Wiener

process with zero drift and unit diffusion coefficients.

Example 8. Probability of Reaching One Boundary Before the Other. What is the probabil-

ity of reaching before starting from ? Then on and satisfiesZ = = = − Ò= ß = Ó < œ ! Ð= ß = Ñ Z= " ! ! " = ! "

Z œ ! Z œ ! Z œ ""

#††

= = =, , .! "

Page 123: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 117 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Thus , so and whenceZ œ = ! œ = " œ = = ! "! " ! " ! "

Z œ= =

= ==

!

" !.

Example 9. Mean Time to Reach Boundary. What is the mean time until absorption atZ=

the boundaries starting from ? Then on and satisfies= − Ò= ß = Ó < œ " Ð= ß = Ñ Z! " = ! "

Z œ " Z œ Z œ !"

#††

= = =, .! "

Thus

œ = = Z=# ! "

œ = = ! #! !! "

œ = = ! #" "! "

so

Z œ Ð= = ÑÐ= =Ñ= ! " .

Controlled Diffusions

A diffusion process is controlled by changing its drift and diffusion coefficients. Thus suppose

E ! is the finite set of available in any state, and that and are respectively theactions . 5=+#=+

drift diffusion coefficients reward rate and and is the when one chooses action in state . As< + ==+ -

sume that , and are each piecewise constant in and is bounded away from zero. 5 5=+ =+# #=+ =+ ! < =

and bounded above for each . Also assume that there is absorption with terminal rewards+ − E

@ @ = = !! " ! " and at the two finite boundaries . When the interest rate is , earning the term-3

inal reward at the boundary is equivalent to earning per unit time there forever.@ = < ´ @3 3 = + 33 3

A is a piecewise constant function on the interval with being the actiondecision $ $Ò= ß = Ó! "=

that chooses when the process is in state . When using , the is $ $= − Ò= ß = Ó < œ! " reward function $

Ð< Ñ = − Ò= ß = Ó U= ! "$ $= for all . Also, in the interior of the state space, the is defined bygenerator

(6) , ÐU Z Ñ œ Z Z = − Ð= ß = Ñ"

#†† †

$ $$= = = = ! "#=5 .= =

at continuity points of , , and . Since there is absorption at the boundaries, it fol-= Z††

= == #

=$ . 5$ $= =

lows from the definition (1) of the generator that is defined at the boundaries byU$

(7) , .ÐU Z Ñ œ ! = œ = ß =$ = ! "

Page 124: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 118 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

A entails using at each point in time. If the goal is to find a stationarystationary policy $ $_

policy that maximizes the expected present value of the rewards earned over an infinite time in-

terval, Bellman’s equation becomes

(8) max! œ Ò< U Z Z Ó# ?

# #3 3

−3

where is the set of decisions and is the maximum expected present value of rewards. Note? Z 3

that with this formulation, the boundary conditions are absorbed into (8) since its restriction to

the boundary states = @ Z ´ !3 3 = are, in view of (7), equivalent to for , or what is the3 3 33

3 œ !ß "

same thing, , for . To see this, let be the nonnegative linear operator that as-Z ´ @ 3 œ !ß " V3 3$= 33

9

signs to the unique solution of the equation< Z œ Z$3

$

(9) .Ð M U ÑZ œ <3 $ $

Thus, and is invertible with . This leads to the following versionZ œ V < V ÐV Ñ œ M U3 3 3 3$ $ $ $$ $

" 3

of the Comparison Lemma,

Z Z œ V Ò< ÐV Ñ Z Ó œ V K3 3 3 3# # # #

3 3 3$ $ #$#

"

where the is defined bycomparison function

K ´ < U Z Z3 3 3#$ $ $# # 3 .

Thus, maximizes over if for all since (9) can be rewritten as .# $ ? # ?œ Ÿ ! − œ !Z K K3#

3 3#$ $$

Hence Bellman’s equation (8) holds. On substituting the definition (6) and (7) of the differential

operator into (8), observe that (8) becomesU#

(8) max , w

+−E=+ =+ ! "

#=+ = = =! œ Ò< Z Z Z Ó = − Ð= ß = Ñ

"

#†† †

5 . 33 3 3

at interior continuity points of , and , and. 5 <

(8) , ww= 3Z ´ @ 3 œ !ß "3

3

at the boundaries.

Also, it can be shown that there is a stationary optimal policy with piecewise constant,$ $_

even if the data , , and are piecewise constant in . The method of doing this entailsE < ==+ =+#=+5 .

first reducing the infinite-horizon controlled-diffusion problem to an equivalent finite-horizon

( period) two-state continuous-time-parameter Markov population decision chain asX œ = =" !

9Here again the equation (8) is the same as that for the corresponding Markov population decision chainexcept that here the generators are differential operators rather than matrices.U#

Page 125: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 119 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

discussed above (see (4) ). Then show that a piecewise-constant (now in time) policy is optimalw

for the latter problem by using arguments like those given in §3.5—especially the last paragraph

thereof.

Example 10. Maximizing the Probability of Accumulating Given Wealth. Consider an in-

vestor who seeks to maximize his probability of accumulating a million dollars before going bank-

rupt starting with million dollars where . Denote by the desired maximum proba-= ! Ÿ = Ÿ " Z=

bility. Assume that the investor’s wealth is a controlled diffusion with drift coefficient and dif-.

fusion coefficient 5# " " or respectively according as the investor acts boldly or timidly.

Now is the unique solution of Bellman’s equation (with )Z œ !3

(10) ! œ Ð Z Z Ñ ” Ð Z Z Ñ œ Ð Z Ñ ” Ð Z Ñ Z" " " "

# # # #†† † †† † †† †† †

5 . . 5 .# #= = = = = = =

for subject to the boundary conditions and . Observe from (10) that = − Ð!ß "Ñ Z œ ! Z œ "! " bold

(resp., ) is optimal if and only if (resp., ) on , i.e., the investor’stimid investing Z   ! Z Ÿ ! Ò!ß "Ó†† ††

= =

maximum probability is convex (resp., concave) on . This is so if and only if (resp.,Z Ò!ß "Ó Ÿ != .

.   !). Here is the argument for bold investing; the other case is similar. Bold investing is opti-

mal if and only if

(11) , ,! œ Z = − Ð!ß "Ñ"

#†

5 .#= =Z

††

Z ! Z "! " = and = . On rewriting (11) as

" Z

#

††

Z† œ 5 .# =

=

and then integrating gives

"

#Z œ = †

5 . !#=ln

for and some . Thus is nondecreasing, i.e., , if and only if as claimed= − Ò!ß "Ó Z Z   ! Ÿ !† ††

! . .This result has the following intuitive explanation. If the investor earns the mean return, the

investor would eventually accumulate a million dollars if the drift is positive and eventually go

bankrupt if the drift is negative. Consequently, if the drift is positive, timid investing is the better

option since it has higher probability of keeping the wealth close to the mean and this is all that

is needed to reach a million dollars in this case. On the other hand, if the drift is negative, bold

investing is the better option since it has higher probability of raising the wealth significantly

above the mean which is needed to accumulate a million dollars in this case.

Page 126: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 120 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

This page is intentionally left blank.

Page 127: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

121

Appendix

Functions of Matrices The purpose of this appendix is to summarize a few useful facts about functions of matrices,

especially the Jordan form and its applications to matrix power series, matrix exponentials, etc.

1 MATRIX NORM

A norm of a complex matrix is a real valued function that satisfies 1 -4 below.l lT T œ Ð: Ñ34‰ ‰

A norm of a matrix is a measure of its distance from the null matrix. Throughout, all norms are

presumed to satisfy as well. One example is the Tchebychev norm max . Indeed,& T œ l: l‰3 344l l !

the term “norm” means the Tchebychev norm in the main text unless stated otherwise.

1 .‰ l lT   !

2 if and only if .‰ l lT œ ! T œ !

3 for each complex number . ( )‰ l l k kl l- - -T œ T homogeneity

4 if the matrix sum is well defined. ( )‰ l l l l l lT U Ÿ T U T U triangle inequality

5 if the matrix product is well defined. ( )‰ l l l ll lT U Ÿ T U T U Schwarz inequality

Page 128: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 122 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

2 EIGENVALUES AND VECTORS

An of a square complex matrix is a complex number such that is sin-eigenvalue T M T- -

gular. The of is the set of its eigenvalues. The number of eigenvalues, allowingspectrum 5ÐT Ñ T

for multiplicities, equals the order of . The of is the radius of the smallestT Tspectral radius 5T

circle in the complex plane having center at the origin and containing the spectrum of . ForT

each eigenvalue of , there is nonnull vector for which . Call an of .- -T @ @ œ T @ @ Teigenvector

As an application of these definitions, note that and . To see the last fact,5 - 5 5-T T Tœ Ÿ Tk k l llet be an eigenvalue of and be an associated eigenvector. Then since ,- -T @ @ œ T @

k kl l l l l l l ll l- -@ œ @ œ T @ Ÿ T @ ,

whence on dividing by , the result follows.l l@ Á !

3 SIMILARITY

A square complex matrix is to a square complex matrix of like order if there isT Usimilar

a nonsingular matrix such that . The similarity relation is evidently an X T œ X UX " equiv-

alence relation, i.e., is transitive, reflexive and symmetric. More important is the fact that two

similar matrices have the spectrum. And if is similar to , is similar to . Moresame T U T UR R

generally, if is absolutely convergent, then so is , and is similar0ÐT Ñ ´ + T 0ÐUÑ 0ÐT Ñ!_Rœ! R

R

to . Also is absolutely convergent if that is so of . It turns out that among0ÐUÑ 0ÐT Ñ 0Ð T Ñl leach equivalence class of similar matrices, there is a matrix having a particularly simple form and

whose properties it is easier to study. That matrix is the .Jordan form

4 JORDAN FORM

Call a square complex matrix if for some . A of or-T T œ ! R   "nilpotent Jordan matrixR

der is an matrix of the form where is a complex number and is theW W ‚ W N N ´ M I I- -

matrix having ones just above the diagonal and zeroes elsewhere. For example, if ,W œ %

I œ

Î ÑÐ ÓÐ ÓÐ ÓÏ Ò

0 1 0 00 0 1 00 0 0 10 0 0 0

Observe that has the following properties:N œ M I-

1 is the unique eigenvalue of ;‰ - N

2 (so is nilpotent);‰ WI œ ! I

3 , ; and‰ R R5 5W"5œ!N œ I R   W! Š ‹R

5 -

4 if and only if .‰ RN Ä ! "k k-

Page 129: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 123 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Property 3 follows from 2 and the binomial theorem‰ ‰

ÐT UÑ œ T UR

5R R5 5

R

5œ!

"Π,

which is valid for square matrices that , i.e., , as do and . PropertyT ß U T U œ UT M Icommute

4 follows from 3 . Notice that is a polynomial in of degree . Thus 3 implies that‰ ‰ ‰Š ‹R5 R 5

N œ SÐR ÑR W" R- .

A square complex matrix is in if it is block-diagonal and each diagonal block isJordan form

a Jordan matrix. The spectrum of a matrix in Jordan form is simply the union of the spectraT

of its diagonal blocks, say , where for all . ThenN ß á ß N N ´ M I 3" 5 3 3-

T œ

N !

N

ä

! N

Î ÑÐ ÓÐ ÓÐ ÓÏ Ò

"

#

5

and is the spectrum of .Ö ß á ß × T- -" 5

The following representation theorem for square complex matrices in terms of matrices in Jor-

dan form is of great importance. It enables many questions about square complex matrices to be

reduced to the same questions about matrices in Jordan form.

Theorem 1. Jordan Form. Every square complex matrix is similar to a matrix in Jordan form.

Corollary 2. Transient Matrices. If is a square complex matrix, then if and only ifT T Ä !R

5T ".

Proof. The result follows from 4 above for Jordan matrices and so for by Theorem 1. ‰ T è

Corollary 3. Nilpotent Matrices. If is an complex matrix, the following are equiva-T W ‚ W

lent.

" T‰ is nilpotent.

# T œ !‰ .W

$ ÐT Ñ œ Ö!׉ .5

Proof. The result follows easily from 3 above for Jordan matrices and so is true for by‰ T

Theorem 1. è

5 SPECTRAL MAPPING THEOREM

The next result asserts that the spectrum of a matrix power series is the set of complex num-

bers that one obtains by replacing the matrix in the power series by each of its eigenvalues. If is0

a function and is a subset of its domain, let be the image of under , i.e., .W 0ÐWÑ W 0 Ö0Ð Ñ À − W×- -

Page 130: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 124 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Theorem 4. Spectral Mapping. If is a square complex matrix and is ab-T 0ÐT Ñ ´ + T!_Rœ! R

R

solutely convergent, then is absolutely convergent for each and .0Ð Ñ − ÐT Ñ Ð0ÐT ÑÑ œ 0Ð ÐT ÑÑ- - 5 5 5

Proof. By Theorem 1, it suffices to prove the result for the Jordan form of , and hence forT

each of its Jordan blocks , . Now is upper triangular with common diag-N œ M I − ÐT Ñ N- - 5 R

onal element , so is upper triangular with common diagonal element . Thus if - -R 0ÐN Ñ 0Ð Ñ 0ÐN Ñ

is absolutely convergent, then is absolutely convergent and is the unique eigenvalue of ,0Ð Ñ 0ÐN Ñ-

so . 5 5Ð0ÐN ÑÑ œ 0Ð ÐN ÑÑ è

6 MATRIX DERIVATIVES AND INTEGRALS

Often interest centers on a of real-valued functions onmatrix function \Ð † Ñ œ ÐB Ð † ÑÑ B34 34

an interval of real numbers with being for each . The of X \Ð>Ñ 7 ‚ 8 > − X \matrix derivative

at , denoted (or ), is the matrix (or ) of first-order derivatives of> \Ð>Ñ \Ð>Ñ Ð B Ð>ÑÑ ÐB Ð>Ñц †. .

.> .>34 34

the . Similarly, the of , denoted , is the matrix of inte-B \ \Ð>Ñ.> B Ð>Ñ.>34 34matrix integral ' 'Š ‹grals of the . Matrix-valued functions inherit many properties of real-valued functions. It isB34

easy to verify the following facts. If and are matrix-valued functions, then\Ð>Ñ ] Ð>Ñ

. . .

.> .> .>Ð\Ð>Ñ ] Ð>ÑÑ œ \Ð>Ñ ] Ð>Ñ Ð\Ð>Ñ ] Ð>ÑÑ.> œ \Ð>Ñ.> ] Ð>Ñ.> and ( ( (

provided the matrices are of the same size. Also,

. . .

.> .> .>Ð\Ð>Ñ] Ð>ÑÑ œ Ð \Ð>ÑÑ] Ð>Ñ \Ð>Ñ ] Ð>Ñ

provided the number of columns of equals the number of rows of . Finally,\Ð>Ñ ] Ð>Ñ

.

.>\Ð?Ñ.? œ \Ð>Ñ( >

!

.

7 MATRIX EXPONENTIALS

There is a powerful and useful method of extending the definition of a function of a var-0ÐBÑ

iable with a Maclaurin expansion to a function of a square matrix . The method entailsB 0ÐUÑ U

substituting for for each in the Maclaurin expansion of , provided that the Mac-U B R 0ÐBÑR R

laurin expansion of is absolutely convergent.0ÐUÑ

An important example of this method of defining a matrix function is the matrix exponen-

tial. The function of /B the variable has the Maclaurin expansion . Hence oneB / œ Bx !Rœ!_ "

RxR

can define the of the square matrix bymatrix exponential / UU

/ ´ U"

RxU R

_

Rœ!

"

Page 131: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 125 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

because the Maclaurin series defining is absolutely convergent. Thus, from the Spec/U tral Map-

ping Theorem, . If also is a square complex matrix of the same order as and5Ð/ Ñ œ / T UU ÐUÑ5

commutes therewith, it follows easily from the Binomial Theorem in §A.4 above that

(1) ./ œ / / œ / /T U T U U T

One consequence is that is nonsingular and its inverse is because ./ / / œ / œ / /U U ! UU U UM œ

Also, ...> / œ U/ œ / UU> U> U>

8 MATRIX DIFFERENTIAL EQUATIONS

Matrix exponentials are important because they provide an elegant and useful tool for study-

ing the solution of a system of linear differential equations.

Theorem 5. Solution of Matrix Differential Equation. If is an real matrix and isU 7 ‚ 7 ;

a continuous function from the nonnegative real line to , then the system of matrix lineard75 dif-

ferential equations

Ð Ñ \Ð>Ñ œ ;Ð>Ñ U\Ð>Ñ \Ð!Ñ œ \Þ

2 , !

has a unique continuously differentiable solution, viz.,

Ð Ñ \Ð>Ñ œ / ;Ð?Ñ.? / \3 (>

!

UÐ>?Ñ U>!.

In particular, if , and , then .5 œ 7 ; œ ! \ œ M \Ð>Ñ œ /!U>

Proof. To show that (3) satisfies (2), rewrite (3) as

\Ð>Ñ œ / / ;Ð?Ñ.? \U> U?

>

!

! ( .

Then differentiate both sides of this equation. Conversely, if and are two solutions of (2),\ \s

then satisfies , . Thus[ ´ \ \ Ð>Ñ œ U[ Ð>Ñ [ Ð!Ñ œ !s [.

.

.>/ [ Ð>Ñ œ U/ [ Ð>Ñ / Ð>Ñ œ !U> U> U>[

.

so is constant in . But that constant is at , so . Hence ,/ [ Ð>Ñ > ! > œ ! / [ Ð>Ñ œ ! [ œ !U> U>

and so . \ œ \s è

The next result is a continuous-time analog of Corollary 2.

Page 132: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 126 §3 Continuous Time ParameterCopyright 2008 by Arthur F. Veinott, Jr.©

Lemma 6. Transient Matrices in Continuous Time. If is a square complex matrix, the folU -

lowing are equivalent.

" / Ä ! > Ä _‰ U> as .

# "‰ 5/U .

$ / " − ÐUщ ¸ ¸- for all .- 5

% U‰ The real parts of the eigenvalues of are negative.

Proof. 1 2 . Since for by (1), the claim follows from Corollary‰ ‰ UR U RÍ / œ Ð/ Ñ R œ !ß "ß á

2 when runs through the integers. It remains to show that implies where [ ]> / Ä ! / Ä ! >UÒ>Ó U>

is the integer part of . Evidently, , so because ,> / œ / / / Ÿ U Ÿ /U> UÒ>Ó UÐ>Ò>ÓÑ U U_Rœ!

"Rx

R¼ ¼ ! l l l l

¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼/ œ / / Ÿ / /U> UÒ>Ó U? UÒ>Ó U

!Ÿ?Ÿ" sup ,l l

from which the claim follows.

2 3 . Immediate from .‰ ‰ U ÐUÑÍ Ð/ Ñ œ /5 5

3 4 . Immediate from standard properties of complex exponentials. ‰ ‰Í è

Page 133: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

127

ReferencesBooks

Arrow, K. J. (1965). . Academic Bookstore, Helsinki, Finland.Aspects of the Theory of Risk BearingAho, A., J. Hopcroft, and J. Ullman (1974). . Addison-The Design and Analysis of Computer Algorithms

Wesley, Reading, Mass.Bellman, R. E. (1957). . Princeton University Press, Princeton, NJ.Dynamic ProgrammingBellman, R. E. and S. E. Dreyfus (1962). . Princeton University Press,Applied Dynamic Programming

Princeton, NJ.Bensoussan, A., E. G. Hurst, Jr., and B. Naslund (1974). Management Applications of Modern Control

Theory. American Elsevier, New York.Bertsekas, D. P. (2007). I and II. 3rd ed. Athena Scientific,Dynamic Programming and Optimal Control.

Belmont, Massachusetts.Bertsekas, D. P. and S. E. Shreve (1978). . Academ-Stochastic Optimal Control: The Discrete Time Case

ic Press, San Francisco.Bertsekas, D. P. and J. N. Tsitsiklis (1996). . Athena Scientific, Belmont, MA.Neuro-Dynamic ProgrammingChow, Y. S., H. Robbins and D. Siegmund (1971). .Great Expectations: The Theory of Optimal Stopping

Houghton Mifflin, Boston.Denardo, E. V. (1982). Prentice-Hall, Englewood Dynamic Programming: Models and Applications.

Cliffs, NJ.Derman, C. (1970). . Academic Press, New York.Finite State Markovian Decision ProcessesDoob, J. L. (1953). . Wiley, New York.Stochastic ProcessesDubins, L. E. and L. J. Savage (1965). . McGraw-Hill, New York.How to Gamble if You MustDynkin, E. B. and A. A. Yushkevich (1969). . Translated by J. Wood. Plenum Press,Markov Processes

New York.Feller, W. (1957). Wiley, NewAn Introduction to Probability Theory and Its Applications, Vol. 1, Second Ed.

York.Fleming, W. H. and R. W. Rishel (1975). . Springer-Verlag, New York.Deterministic and Stochastic ControlGittens, J. C. (1989). Wiley, Chichester.Multiarmed Bandit Allocation Indices. Halmos, P. (1958). . Van Nostrand, Princeton, NJ.Finite Dimensional Vector SpacesHardy, G. H. (1949). . Clarendon Press, Oxford.Divergent SeriesHolt, C. C., F. Modigliani, J. F. Muth and H. A. Simon (1960). Planning Production, Inventories, and Work

Force. Prentice-Hall, Englewood Cliffs, NJ.Hordijk, A. (1974). . Mathematical Centre TractDynamic Programming and Markov Potential Theory

No. 51, Mathematical Center, Amsterdam.Howard, R. A. (1960). . The Technology Press, MIT, Cam-Dynamic Programming and Markov Processes

bridge, Mass.Howard, R. A. (1971). Dynamic Probabilistic Systems. Vol. II: Semi-Markov and Decision Processes.

Wiley, New York.Karlin, S. (1959). Mathematical Methods and Theory in Games, Programming, and Economics, Vol. I.

Addison-Wesley, Reading, Mass.Kato, T. (1966). . Springer-Verlag, New York.Perturbation Theory for Linear OperatorsKemeny, J. G. and J. L. Snell (1960). . Van Nostrand, Princeton, NJ.Finite Markov ChainsKushner, H. (1967). . Academic Press, New York.Stochastic Stability and ControlKushner, H. (1971). . Holt, Rinehart, and Winston, New York.Introduction to Stochastic ControlLee, E. B. and L. Markus (1967). Wiley, New York.Foundations of Optimal Control Theory.Mandl, P. (1968). . Springer-Verlag, NewAnalytical Treatment of One-Dimensional Markov Processes

York.Marschak, J. and R. Radner (1972). Cowles Foundation Monograph. New Economic Theory of Teams.

Haven and London, Yale University Press.Martin, J. (1967). . Wiley, New York.Bayesian Decision Problems and Markov Chains

Page 134: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 128 ReferencesCopyright 2008 by Arthur F. Veinott, Jr.©

Pontryagin, L. S., V. G. Boltyanskiii, R. V. Gamkrelidze and E. F. Mishchenko (1962). The MathematicalTheory of Optimal Processes. Translated from Russian by K. N. Trirogoff and Edited by L. W.Neustadt. Interscience Publishers, Wiley, New York.

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley,New York.

Ross, S. M. (1983). Academic Press, New York.Introduction to Stochastic Dynamic Programming.Veinott, A. F., Jr. (1965). . Macmillan, New York.Mathematical Studies in Management ScienceWhittle, P. (1983). Optimization Over Time: Dynamic Programming and Stochastic Control. Vols. I &

II. Wiley, New York.

SurveysBellman, R. E. and R. Karush (1964). Dynamic Programming: A Bibliography of Theory and Applica-

tion. RM-3951-PR, 140 pp. and RM-3591-1-PR, 142 pp., the RAND Corporation, Santa Monica, CA.Breiman, L. (1964). In E. Beckenbach (ed.). Stopping Rule Problems. Applied Combinatorial Mathema-

tics. Wiley, New York.Dantzig, G. B. (1959). On the Status of Multistage Linear Programming Problems. 6, 1, 53-72.Man. Sci.Dreyfus, S. E. (1969). An Appraisal of Some Shortest-Path Algorithms. 17, 3, 395-412. Opns. Res.Fleming, W. (1966). Optimal Control of Diffusion Processes. In E. Caianello (ed.). Functional Analysis

and Optimization. Academic Press, New York, 67-84.Fulkerson, D. (1966). Flow Networks and Combinatorial Operations Research. 73, Amer. Math. Monthly

115-138.Mandl, P. (1967). Analytical Methods in the Theory of Controlled Markov Processes. In J. Kozesnik

(ed.). Transactions of the Fourth Prague Conference on Information Theory, Statistical DecisionFunctions and Random Processes, 1965. Academic Press, New York, 45-53.

Puterman, M. L. (1990). Markov Decision Processes. Chapter 8 in D. P. Heyman and M. J. Sobel (eds.).Handbooks in Operations Research and Management Science 2, Elsevier (North-Holland), 331-434.

Veinott, A. F., Jr. (1965). Commentary on Part Two: Stochastic Decision Models. In A. F. Veinott, Jr.(ed). Macmillan, New York, 313-321.Mathematical Studies in Management Science.

Veinott, A. F., Jr. (1974). Markov Decision Chains. In B. C. Eaves and G. B. Dantzig (eds.). Studies inOptimization. MAA Studies in Mathematics, Vol. 10, Mathematical Association of America, 124-159.

White, D. J. (1985). Real applications of Markov decision processes. 15, 73-83.InterfacesWhite, D. J. (1988). Further real applications of Markov decision processes. 18, 55-61.InterfacesWhite, D. J. (1993). A Survey of Applications of Markov Decision Processes. Journal of the Operational

Research Society (UK) 44, 11, 1073-1096.

ArticlesBalinski, M. (1961). . Mathematica, Princeton, On Solving Discrete Stochastic Decision Problems. Study 2

NJ.Beale, E. M. L. (1955). On Minimizing a Convex Function Subject to Linear Inequalities. J. Royal Sta-

tist. Soc. Ser. B 17, 2, 173-184.Bellman, R. E. (1957). A Markovian Decision Process. 6, 5, 679-684.J. Math. Mech.Bertsimas, D. and J. Niño-Mora (1996). Conservation Laws, Extended Polymatroids and Multiarmed

Bandit Problems; A Polyhedral Approach to Indexable Systems. 21, 2, 257-306.Math. Operations Res. Blackwell, D. (1962). Discrete Dynamic Programming. 33, 2, 719-726.Ann. Math. Stat.Blackwell, D. (1967). Positive Dynamic Programming. In Proceedings of Fifth Berkeley Symposium on

Mathematical Statistics and Probability, Vol. I. Univ. of Calif. Press, Berkeley, Calif., 415-418.Brown, B. (1965). On the Iterative Method of Dynamic Programming on a Finite Space Discrete Time

Markov Process. 36 4, 1279-1285.Ann. Math Stat.Cantaluppi, L. J. (1984a). Optimality of Piecewise-Constant Policies in Semi-Markov Decision Chains.

SIAM J. Control Optim. 22, 5, 723-739.Cantaluppi, L. J. (1984b). Computation of Optimal Policies in Discounted Semi-Markov Decision Chains.

OR Spektrum 6, 3, 147-160.

Page 135: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 129 ReferencesCopyright 2008 by Arthur F. Veinott, Jr.©

Chatwin, R. E. (1992). Ph.D. Dissertation, Department of Operations Re-Optimal Airline Overbooking.search, Stanford University, Stanford, CA, 111 pp.

Dantzig, G. B. (1955). Optimal Solutions of a Dynamic Leontief Model with Substitution. Econometrica23, 3, 295-302.

Dantzig, G. B. (1955). Linear Programming Under Uncertainty. 1, 3-4, 197-206.Man. Sci. de Ghellinck, G. (1960). Les Probl ms de Décisions Séquentielles. è Cahiers du Centre d'Etudes de Re-

cherche Opérationnelle 2, 2, 161-179.Denardo, E. V. (1967). Contraction Mappings in the Theory Underlying Dynamic Programming. SIAM

Review 9, 2, 165-177.Denardo, E. V. (1969). Report No. 9Computing a Bias-Optimal Policy in Discrete Time Markov Decision.

Dept. of Admin. Sci., Yale Univ., New Haven, 21 pp. (Revised version of Computing (g,w)-OptimalPolicies in Continuous and Discrete Markov Programs, (1968).)

Denardo, E. V. (1970). On Linear Programming in a Markov Decision Problem. 16, 5, 281-288.Man. Sci.Denardo, E. V. and B. L. Fox (1968). Multichain Markov Renewal Programs. 16, 3,SIAM J. Appl. Math.

468-487.Denardo, E. V. and B. L. Miller (1968). An Optimality Condition for Discrete Dynamic Programming with

No Discounting. 39, 4, 1220-1227.Ann. Math. Stat.D'Epenoux, F. (1963). A Probabilistic Production and Inventory Problem. 10, 1, 98-108.Man. Sci.

(Translation of an article published in 14 (1960)).Revue Francaise de Recherche OpérationnelleDerman, C. (1962). On Sequential Decisions and Markov Chains. 9, 1, 16-24.Man. Sci.Derman, C. (1963). Stable Sequential Rules and Markov Chains. 6, 2, 257-265.J. Math. Anal. and Appl.Derman, C. (1964). On Sequential Control Processes. 35, 1, 341-349.Ann. Math. Stat.Derman, C. and R. Strauch (1965). A Note on Memoryless Rules for Controlling Sequential Control Pro-

cesses. 37, 1, 276-278.Ann. Math. Stat.Dynkin, E. B. (1963). The Optimum Choice of the Instants for Stopping a Markov Process. Soviet Math.

(English translation of .) 4, 627-629.Doklady Akad. NaukEaton, J. and L. Zadeh (1962). Optimal Pursuit Strategies in Discrete State Probabilistic Systems.

Transactions of ASME Series D, Journal of Basic Engineering 84, 23-29.Federgruen, A. (1984). Successive Approximation Methods for Solving Nested Functional Equations in

Markov Decision Problems. 9, 3, 319-344.Math. Opns. Res. Fox, B. L. and D. M. Landi (1968). An Algorithm for Identifying the Ergodic Subchains and Transient

States of a Stochastic Matrix. 11, 9, 619-621.Commun. Assoc. Comp. Mach.Gittens, J. C. and D. M. Jones (1974). A Dynamic Allocation Index for the Sequential Design of Experi-

ments. In J. Gani, K. Sarkadi and I. Vince (Eds.) European Meeting of Statisti-Progress in Statistics. cians, 1972, 1, North Holland, Amsterdam, 214-266.

Hordijk, A. and L. C. M. Kallenberg (1984). Constrained Undiscounted Stochastic Dynamic Program-ming. 9, 2, 276-289.Math. Opns. Res.

Howard, R. A. and J. E. Matheson (1972). Risk Sensitive Markov Decision Processes. 18, 356-Man. Sci. 369.

Kantorovitch, L. (1939). The Method of Successive Approximations for Functional Equations. ActaMathematica 71, 63-97.

Katehakis, M. N. and A. F. Veinott, Jr. (1987). The Multi-Armed Bandit Problem: Decomposition andComputation. 12, 2, 262-268.Math. Operations Res.

Lanery, E. (1967). Etude Asymptotique des Systemes Markoviens à Commande. Révue d'Informatique etRecherche Opérationnelle 1, 5, 3-56.

Lippman, S. A. (1968). On the Set of Optimal Policies in Discrete Dynamic Programming. J. Math. Anal.Appl. 440-445.

Lippman, S. A. (1969). Criterion Equivalence in Discrete Dynamic Programming. 17, 5, 920-Opns. Res.922.

MacQueen, J. and R. Miller, Jr. (1960). Optimal Persistence Policies. 8, 3, 362-380.Opns. Res.Manne, A. S. (1960). Linear Programming and Sequential Decisions. 6, 3, 259-267.Man. Sci.Martin-Lof, A. (1967). Optimal Control of a Continuous-Time Markov Chain with Periodic Transition

Probabilities. 15, 5, 872-881.Opns. Res.

Page 136: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 130 ReferencesCopyright 2008 by Arthur F. Veinott, Jr.©

Miller, B. L. (1968a). Finite State Continuous-Time Markov Decision Processes with an Infinite PlanningHorizon. 22, 552-569.J. Math. Anal. Appl.

Miller, B. L. (1968b). Finite State Continuous-Time Markov Decision Processes with a Finite PlanningHorizon. 6, 2, 266-280.Siam J. Control

Miller, B. L. and A. F. Veinott, Jr. (1969). Discrete Dynamic Programming with a Small Interest Rate.Ann. Math. Stat. 40, 2, 366-370.

Ornstein, D. (1969). On the Existence of Stationary Optimal Strategies. 20, 2,Proc. Amer. Math. Soc.563-569.

Radner, R. (1955). The Linear Team: An Example of Linear Programming Under Uncertainty. Proc. Sec-ond Symp. Linear Prog. National Bureau of Standards, Washington, 381-396.

Radner, R. (1961). Evaluation of Information in Organizations. In Proc. 4th Berkeley Symp.Radner, R. (1962). Team Decision Problems. 33, 2, 857-881.Ann. Math. Stat.Riis, J. (1965). Discounted Markov Programming in a Periodic Process. 13, 6, 920-929.Opns. Res.Rosenblatt, D. (1957). On the Graphs and Asymptotic Forms of Finite Boolean Relation Matrices and

Stochastic Matrices. 4, 2, 151-167.NRLQRoss, I. and F. Harary (1955). Identification of the Liason Persons of an Organization Using the Struc-

ture Matrix. 1, 251-258.Man. Sci.Rothblum, U. G. (1975a). Multivariate Constant Risk Posture. 10, 3, 309-332.J. Econ. TheoryRothblum, U. G. (1975b). Algebraic Eigenspaces of Nonnegative Matrices. 12, 281-292.Lin. Alg. Appl. Rothblum, U. G. (1975c). Normalized Markov Decision Chains I: Sensitive Discount Optimality. Opns.

Res. 16, 785-795.Rothblum, U. G. (1978). Normalized Markov Decision Chains II: Optimality of Nonstationary Policies.

SIAM J. Control 15, 221-232.Rothblum, U. G. (1984). Multiplicative Markov Decision Chains. 9, 1, 6-24.Math. Opns. Res. Rothblum, U. G. and A. F. Veinott, Jr. (1992). Markov Branching Decision Chains: Immigration-Induced

Optimality. Technical Report No. 45, Department of Operations Research, Stanford University, 100 pp.Rothblum, U. G. and P. Whittle (1982). Growth Optimality for Branching Markov Decision Chains.

Math. Opns. Res. 7, 4, 582-601.Rykov, V. V. (1966). Markov Decision Processes with Finite State and Decision Spaces. Theory Prob.

Appl. 112, 2, 302-311.Shapley, L. S. (1953). Stochastic Games. 39, 1095-1100.Proc. Nat. Acad. Sci.Sladky, K. (1974). On the Set of Optimal Controls for Markov Chains with Rewards. 10,Kybernetika

350-367.Strauch, R. E. (1966). Negative Dynamic Programming. 37, 4, 871-890.Ann. Math. Stat.Strauch, R. E. and A. F. Veinott, Jr. (1966). . RM-4772-PR,A Property of Sequential Control Processes

The RAND Corporation, Santa Monica, Calif., 8 pp.Tardos, É. (1986). A Strongly Polynomial Algorithm to Solve Combinatorial Linear Programs. Opns.

Res. 34, 2, 250-256.Tarski, A. (1955). A Lattice Theoretical Fixpoint Theorem and Applications. 5, 285-309.Pac. J. Math.Veinott, A. F., Jr. (1966). On Finding Optimal Policies in Discrete Dynamic Programming with No Dis-

counting. 37, 5, 1284-1294.Ann. Math. Stat.Veinott, A. F., Jr. (1968). Discrete Dynamic Programming with Sensitive Optimality Criteria. Prelimin-

ary Report. 39, 1372.Ann. Math. Stat.Veinott, A. F., Jr. (1969a). Discrete Dynamic Programming with Sensitive Discount Optimality Criteria.

Ann. Math. Stat. 40, 5, 1635-1660.Veinott, A. F., Jr. (1969b). Minimum Concave Cost Solution of Leontief Substitution Models of Multi-fa-

cility Inventory Systems. 17, 2, 262-291.Opns. Res.Warshall, S. (1962). A Theorem on Boolean Matrices. 9, 11-12.J. Assoc. Comp. Mach.Wolfe, P. and G. B. Dantzig (1962). Linear Programming in a Markov Chain. 10, 5, 702-710.Opns. Res.Zachrisson, L. (1964). In M. Dresher L. S. Shapley, and A. W. Tucker (eds.). Markov Games. Advances

in Game Theory. Princeton University Press, Princeton, N. J., 211-253.

Page 137: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

131

Index of Symbolsappearing on more than two successive pages.

Syntax: symbol, page number, meaning.

+, 1, actionE ==, 1, actions in state ‡, 67, convolution

,R8 , 69, binomial coefficient seq.

", 40, discount factorì , 67, middle element of convolution

G R =R= , 4, -period cost in state

.1, 26, degree of 1

.$, 27, degree of $_

., 27, system degreeH, 48, deviation matrixH$, 50, deviation matrix for $$ $ $, 14, decision œ Ð Ñ=

?, 14, set of decisions$ $ $_, 14, stationary policy Ð ß ß á Ñ?_ß 14, set of policies?8, 57, 8-optimal decisions?_, 51, -optimal decisions_ƒ8"

‡ " 8 8", 69, ÐT H â Ð"Ñ H Ñ

/ UU R_Rœ!

"Rx, 124, matrix exp. !

1 ß < U @ @8 8 8 8"#$ $ $# #53,

K>#1‡ ß 102, < U Z Z

†# # 1 1

> >‡ ‡

K Ð1 â 1 Ñ8 . 8#$ #$ #$, 55,

K , ì K57 5 77#$ , 72, #$

K ß ß < T Z Z#) # # ) )19K Ð1 ß 1 ß á Ñ#$ #$ #$, 53, . ."

K 13#$ #$, 53,!_

8œ.8 83

Z, 12, system graphZ, 102, optimal-return generator¦ , 10, strictly greater than¤ ß 51, lexico greater than or equal to

.=, 114, drift coefficient

.=+, 117, drift coefficient, controlled

8-optimal, 578-improvement, 57l l† , 18, norm (Tchebychev)R -period value, 3

" 88, 67, binomial sequence of order "8, 68, binomial immigration order 8" R "R

8 8, 67, coefficient of

:Ð>l=ß +Ñ, 1, discrete-time trans. rateT$, 14, transition matrix for $T R TR >2

$ $, 14, power of T RR

1 , 14, -step trans. matrix for 1T ß T‡ 47, stationary matrix for T >>

$ , 100, -period trans. matrix for $T >>

1, 101, -period trans. matrix for 1, 69, Ð! T T â Ñ ! "

$, 68, Ð! T T â Ñ ! "$ $

1, 14, policy1 1P, 71, use first decisions of P1 1P, 71, exclude first decisions of P1 $ 1 $P P _, 71, use then

;Ð?l=ß +Ñ, 97, cont-time trans. rateU T M$ $, 50, discrete timeU$, 100, cont.-time generator for $

<Ð=ß +Ñ, 1, one-period reward<$ , 14, one-period reward vector for $< < 8 œ ! 8 Á !8

# #, 53, if ; 0 if V Ð M UÑ œ3, 49, 3 " !_

Rœ!R" R" T

V3$ , 50, Ð M U Ñ œ T3 "$

" R" R_Rœ!

!$

e, 15, optimal-return operatore eR >2, 15, “power of R ”

=, 1, stateW , 1, number of statesf, 1, state space5, 27, system spectral radius

5T , 26, spectral radius of T5$ $, 27, spectral radius of T5#

= , 114, diffusion coefficient5#

=+, 117, diffusion coef., controlled

? T @7 7 7$ $ $, 71,

Z$, 18, discrete-time value of $_

Z$, 108, continuous-time value of $_

Z Ð@ @ â Ñ$ $ $, 51, . ."

@ T < ß 8 œ " Ð"Ñ H < 8   !8 ‡ 8 8"$ $ $ $$, 50, ; ,

Z Ð@ â @ Ñ8 . 8$ $ $, 55,

Z ß RR _$ 15, -period value of $

Z ß RR1 15, -period value of 1

Z Ð?ÑR1 , 15, Z T ?R R

1 1

Z R =R= , 3, -period value from

Z ‡, 21, maximum valueZ1, 18, discrete time value of 1Z1, 108, continuous time value of 1Z ß Ò>ß X Ó>

1 102, value of during 1Z ß3

1 41, present value of 1Z ß

3$ 50, present value of $_

Z ßRA1 64, R A-period value of with 1

Z ßR81 68, R "-period value of with 1 8

Z Ð?Ñ Z T ?R7 R7 R1 1 1, 71,

•1, 68, ÐZ ÑR1 R -period values of 1

•$, 68, ÐZ Ñ RR$ -period values of $

•81, 68, " ‡8 •1

A ÐA Ñ, 29, initial immigration =

A ÐA Ñ, 63, immigration stream R

[ Ð[ Ñ, 75, cum. immigration R

j8, 76, set of immigration streams

B=+, 31, state-action frequency

Page 138: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

Index

+,E=,,‡action, 1application,

accumulating wealth, hw0airline overbooking, hw2airline reservation,asset management,baking, hw8Bayesian statistical quality control and repair, hw5bridge clearance, hw1capacity planning,capital budgeting,cash management,chain selection,component replacement, hw4customers served,deliveries on-time,earning and learning, hw9exercising a call option, 0exercising a put option, hw1factory throughput,fatality reduction,house buying, hw4house selling, hw9insurance management,investment,knapsack problem,lot inspection,marketing,matrix products (order of multiplying), hw1medical screening,multi-armed bandit,multi-facility linear-cost production planning, hw3network routing,patients treated successfully,police patrolling,portfolio selection, hw2probability of reaching boundary,project scheduling,queues,

tandem, hw5network of,

reliability,requisition processing, hw1reservoir management,restart problem,road maintenance,rocket guidance,scheduling,search for a plane, hw2

sequencing, hw2supply management,test a machine to find a fault, hw2time to reach boundary, hw5transportation,yield of production process,yield of renewable resource,

,R7 ,

",•Bayesian, hw5Bellman's equation,binary preference relation,binomial coefficient,binomial immigration stream of order ,8binomial sequence of order ,8binomial theorem,bounded,

polynomially,branching,Brownian motion,

certainty equivalent,Cesàro limit,Cesàro-geometric-overtaking optimal, hw7Cesàro Neumann series,Cesàro-overtaking optimal,Cholesky factorization,choose function,combinatorial property,comparison function,

present-value,comparison lemma,

R -period-value,present-value,

concave function,contraction mapping,controlled diffusion,convex function,convolution,

.,

.$,

.1,H$,$,?,$_,?_ß?8,?_,?8

„,

Page 139: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 133 IndexCopyright 2006 by Arthur F. Veinott, Jr.©

?„,ƒ8,decision,deficient point,degree of a matrix,degree of a policy, 26degree of a sequence,degree of a system, 27determines,deviation matrix,diffusion,

controlled,diffusion coefficient,discount factor,discovering system boundedness,discovering system transience, hw3drift coefficient,dynamic-programming recursion,

/U,eigenvalue,eigenvector,equivalence principle,excessive point,

least,expansion,

Cesàro Neumann,Laurent,Maclaurin,Neumann,polynomial,

fixed point,

1 ß8#$

K>#1ß

K8#$,

K57#$ ,

K Z# ,K#$,K3

#$,Z,generator for deterministic control problem,generator for one-dimensional diffusion,generator matrix,graph,

incidence,minimum-cost chain in,system,

horizon,finite,infinite,rolling,

immigration stream,binomial,

combining physical and value,cumulative,order of binomial,physical,present value of,value,

immigration streams in ,j8

improves,8-,strongly,

index ruleGittens,multi-armed-bandit,task sequencing, hw2

inflation,instantaneous rate-of-return, 8interest rate,

negative,positive,real after-tax,small positive,zero,

irredundant dual,irredundant primal,

Jordan form,

Laurent expansion,Laurent expansion of present value,Laurent expansion of present-value comparison function,Laurent expansion of resolvent,lexicographic comparison,linear optimal decision function,linear program,

dual,finding maximum reward-rate with a, hw6finding maximum stopping-value with a,finding maximum value with a,irredundant dual,irredundant primal,primal,

.=,

.=+,minimum balloon payment,maximum expected instantaneous rate-of-return,maximum expected multiplicative symmetric utility,maximum growth factor,maximum -period value,Rmaximum reward rate,maximum reward rate by linear programming, hw6maximum present value,maximum principle,maximum spectral radius, hw7maximum -period value,Xmaximum value,Markov population decision chain,

finite continuous-time-parameter,finite discrete-time-parameter,

Page 140: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 134 IndexCopyright 2006 by Arthur F. Veinott, Jr.©

mathematical program,matrix,

characterization of transient,degree of,deviation,generator,Jordan,nilpotent,nonnegative,R -step transition,permutation,positive definite,positive semidefinite,resolvent,stationary,similarity to a,stochastic,strictly substochastic,substochastic,symmetric,transient,transition,upper-triangular,

matrix derivative,matrix differential equation,matrix exponential,matrix integral,matrix norm,multi-armed bandit,

Neumann series,Newton’s method,

linear-time approximation with, hw3R -period value, 3

"8,"R

8 ,on-line implementation,

multi-armed bandit,sequential quadratic unconstrained team,

optimal policy,Cesàro-geometric-overtaking-, hw7Cesàro-overtaking-,Cesàro-symmetric-multiplicative-utility-growth-,float-,future-value-,growth factor,index, hw2instantaneous-rate-of-return-,limiting present-value-, hw6moment-,myopic stopping, hw4R -period-value-,R -period-expected-utility-,nonexistence of an overtaking-,overtaking-,present-value-,reward-rate-,spectral-radius-, hw7

stopping-, hw4strong Cesàro-overtaking-,strong present-value-,total instantaneous-rate-of-return-,value-,

optimal policy with immigration streamCesàro-overtaking-,limiting present-value, hw5overtaking-,present-value,strong present-value,

optimal quadratic control,optimal-return generator,optimal-return operator,

:Ð>l=ß +Ñ,T$,Ts $,T R

$ ,T R

1 ,T ߇

T >$ ,

T >1,

,$,l lT ,1,1P,1P,1 $P ,1 1> ‡ß<8

„$ ,G8

„$ ,<$„,G$„,Perron-Frobenius Theorem, hw7policy,

cohort,deterministic,eventually stationary,initially randomized,Markov,nonanticipative,nonstationary,periodic,piecewise-constant,randomized,stationary,stopping, hw4transient,

policy-improvement method,limiting present-value,maximum reward-rate,8-optimality,Newton’s method: a special case of,simplex method: a special case of,strong,

polynomially bounded,

Page 141: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 135 IndexCopyright 2006 by Arthur F. Veinott, Jr.©

present value,inflation-adjusted,Laurent expansion of,maximum,strong maximum,

principle of optimality,program,

linear,mathematical,quadratic,stochastic,unconstrained,

;Ð?l=ß +Ñ,U$,quasiorder,

<Ð=ß +Ñ, 1<$ ,<8

$ ,V3

$ ,e,

deficient point of,excessive point of,fixed point of,

eR ,random time,recursion,

backward,forward,

resolvent,Laurent expansion of,

restart problem,reward, 1reward-rate,reward-rate vector,reward vector,Ricatti recursion,risk aversion,risk posture,

constant additive,constant multiplicative,

risk preference,

=,W , 1f,5,5T ,5$,5#

= ,5#

=+,series,

Laurent,Maclaurin,Neumann,

stochastic constraint,similar,

sojourn time,spectral mapping theorem,spectral radius,spectrum,state, 0state-action frequency,stationary matrix,stochastic program,stopping policy, hw4stopping problem, hw4

simple, hw4stopping time,stopping value, hw4successive approximations,

linear-time approximation with, hw3system, 1

bounded, 46circuitless, 13deterministic, 49discovering boundedness of a, hw7discovering transience of a, hw3irreducible, hw7normalized, 27polynomially bounded,stochastic, 2strictly substochastic,substochastic, 2transient, 19

system degree,system graph, 12

strongly connected, hw7system properties,

team decision problem,unconstrained,quadratic convex,

normally distributed observations, hw8sequential,

transient policy,transient system,transition matrix,transition rate, 1

?7$ ,

utility function,additive,multiplicative,symmetric,

@8$ ,

Z 8$ ,

Z$,Z ßR

$

Z ßR1

Z ß>1

Z ßR31

Z ßs R3

1

Z Ð?ÑR1 ,

Page 142: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 136 IndexCopyright 2006 by Arthur F. Veinott, Jr.©

Z Ð@Ñßs >

$

Z R= ,

Z ‡,Z ß3

$

Z ß31

Z ß31

A

Z ßRA$

Z ßRA1

Z ßR81

Z Ð?ÑR81 ,

•$,•1,•8

1,value,

future,R -period,

present,stopping,terminal,transient,

A,A8,[ ,[ 8,j8,Wiener process,

B=+,

Acknowledgement. Mai Vu suggested the creation of this index. It is a pleasure to acknowl-edge her encouragement and substantial collaboration in its preparation.

Page 143: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

Homework 1 Due April 11

1. Exercising a Put Option. Consider the problem of deciding when to exercise a put option

to a stock at the during any of the next days. Suppose that if the pricesell strike price = ! R‡

of the stock on one day is , then the price the next day is where the is a ran-=   ! =V Vreturn

dom variable. Assume that the returns on successive days are independent and have the same dis-

tribution as a nonnegative random variable whose expected value is finite. The goal is to maxV i-

mize the expected net revenue from an option to sell the stock any time in the next days atR

the strike price when the price with days to expiration is . Assume that if one exercises theR =

option to sell the stock on a day at the strike price, then one first buys the stock that day at the

market price. Let be the rate of return.< ´ V "

(a) Dynamic-Programming Recursion. Write down a dynamic-programming recursion for the

maximum expected net revenue when the price with days to expiration is .Z R =R=

(b) Nonpositive Expected Rate of Return. Show that if E , then it is optimal not to< Ÿ !

exercise the option before expiration.

(c) Positive Expected Rate of Return. Determine the optimal exercise policy with days toR

expiration by induction on where E . [ : First show that each .R < ! R œ !ß "ßáHint Z œ =R ‡! for

Next observe that if the stock price falls from a positive level to , which occurs with probability!

" : œ ÐV œ !ÑPr , then the stock price can never become positive again. For this reason, it

suffices to consider the case of positive stock prices . For that case, rewrite the dynamic= !

programming recursion in (a) to reflect this fact and then make the change of variables [ ´R=

"= ÐZ Ð= =ÑÑR ‡

= .]

2. Requisition Processing. Requisitions arrive independently at a data-processing center in

periods . At the end of each period, the center manager decides whether to process any"ßá ßR

requisitions. If the manager decides to process in a period, he processes all previously unprocessed

requisitions arriving in that and all prior periods. The manager must process requisition in a per-

iod in which the number of unprocessed requisitions exceeds . The cost of processing aQ !

batch of requisitions of any size is and the processing time is negligible. The waiting O ! cost

incurred in each period that a requisition awaits processing but is not processed is . AllA !

requisitions, if any, are processed in period . The problem is to determine the periods in whichR

to process that minimize the combined expected costs of processing and waiting over periods.R

Suppose that is the known probability that 0 requisitions arrive in a period where: 3 œ ß "ßá ß ,3!,3œ! 3: œ 1. The expected number of requisitions arriving in a period is positive.

Page 144: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 2 Homework 1 Due April 11, 2008

(a) Dynamic-Programming Recursion. Write down a dynamic-programming recursion for

the minimum expected costs that permits an optimal -period processing policy to be deter-R

mined.

(b) Optimality of Processing-Point Policy. Show that one optimal processing policy in per-

iod has the following form: if is the number of requisitions accumulated before processing at8 =

the end of period , process in period if and wait (do not process) if . Show how8 8 = = = Ÿ =8 8

to determine the constants .= ßá ß =" R [ : Prove these facts by induction on by showing thatHint 8

the minimum expected cost in periods is increasing in .]8ßá ßR =

(c) Optimal Preplanned Processing Periods. Answer part (a) under the hypothesis that all

the processing times must be chosen before any requisitions arrive, that no requisitions are on hand

initially, and that . [ : It suffices to consider only states.]Q œ _ RHint

(d) Comparison of Policies in Parts (b) and (c). Will the minimum expected cost in part

(a) (assuming no requisitions are on hand initially and ) be larger or smaller than thatQ œ _

in part (c), or can one say? What administrative and/or computational advantages does the pol-

icy in part (c) have over that in part (b)?

3. Matrix Products. Let be matrices with having rows and columnsQ ßá ßQ Q < <" 8 3 3" 3

for 1 2 and some positive integers . The problem is to choose the order of mul3 œ ß ßá ß 8 < ßá ß <! 8 -

tiplying the matrices that will minimize the number of multiplications needed to compute the prod-

uct . Assume that matrices are multiplied in the “usual” way.Q Q âQ" # 8

Give a dynamic-programming recursion for finding(a) Dynamic-Programming Recursion.

the optimal order in which to multiply the matrices that requires operations.SÐ8 Ñ$

Find the optimal order in which to multiply the matrices where 4 and(b) Example. 8 œ

Ð< < < < < Ñ œ Ð Ñ! " # $ % 10 30 70 2 100 .

Also solve the(c) Comparison of Minimum and Maximum Number of Multiplications.

problem of part (b) where the objective is instead to maximize the number of multiplications.

What is the ratio of the maximum to minimum number of multiplications?

4. Bridge Clearance. Consider a network of cities. Between each pair of cities, thereW Ð=ß >Ñ

is a road having a bridge whose clearance height is 0 with . A trucking2Ð=ß >Ñ   2Ð=ß >Ñ œ 2Ð>ß =Ñ

firm wants to determine a route from each city to city that maximizes the minimum bridgeW

height along the route subject to the restriction that at most cities are visited enroute (excludR -

ing the initial city). Give a dynamic-programming recursion for solving the problem.

Page 145: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Answers to Homework 1 Due April 11

1. Exercising a Put Option. Suppose that the (nonnegative) market price of a stock on any

day is a product of its price on the preceding day and its that day. Assume that the returnsreturn

on successive days are independent and have the same distribution as another nonnegative random

variable whose expectation is finite. Call the . Let be the maximumV < ´ V " Zrate of return =R

expected revenue from a put option to sell a stock at the strike price when the market price= !‡

R =   ! days before expiration of the option is .

(a) Dynamic-Programming Recursion. The satisfy the dynamic-programming recursionZ=R

(1) max EZ œ Ð= =ß Z Ñ=R ‡ R"

=V

for where for all .R œ "ß #ßá Z œ Ð= =Ñ =   !=! ‡

(b) Nonpositive Expected Rate of Return. Suppose E , so E . We claim that it is< Ÿ ! V Ÿ "

optimal not to exercise the option until the expiration day, and then only if the price that day is

below the strike price. It suffices to show by induction that E for and for allZ œ Z R !=R R"

=V

= R " R œ " Z   = =. To see this, observe from (1) for and by definition for that forR" ‡=

each . Thus, because E , it follows that E E for each .=   ! V Ÿ " Z   = = V   = = =   !R" ‡ ‡=V

Thus, E for as claimed.Z œ Z R !=R R"

=V

(c) Positive Expected Rate of Return. Suppose E , so E . For this part, it is help-< ! V "

ful to note that once one reaches the market price , the subsequent price is also zero, so= œ ! V=

the only market price accessible from is . Hence, when the market price is zero, it is optimal! !

to exercise the put option to sell the stock at the strike price , so for . It= Z œ = R œ !ß "ßᇠR ‡!

is convenient to restrict attention to positive market prices in the sequel. To that end, let : œ

Pr and, for any random variable , let E E . Then EÐV !Ñ [ [ ´ Ð[lV !Ñ Z œ Ð" :Ñ= R" ‡=V

: ZE , so the restriction of (1) to positive market prices can be rewritten asR"=V

(2) max EZ œ Ð= =ß Ð" :Ñ= : Z Ñ=R ‡ ‡ R"

=V

for and where for all .= ! R œ "ß #ßá Z œ Ð= =Ñ = !=! ‡

Subtract from both sides of (2) and make the change of variables = = Y œ Z Ð= =ч R R ‡= =

for . Then (2) reduces to the equivalent systemR œ !ß "ßá

max E EY œ Ð!ß= < : Y ÑR R"= =V

for and where for all . Now make the second change of= ! R œ "ß #ßá Y œ Ð= = Ñ = !! ‡ =

variables for . Then the above equation becomes[ œ Y = !R R= =

"=

(2) max E Ew R R"= =V[ œ Ð!ß < : V[ Ñ

Page 146: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -2- Answers to Homework 1 Due April 11, 2008

for and where for . We claim that for each , is= ! R œ "ß #ßá [ œ Ð" Ñ = ! R [! R= =

==

increasing and continuous in with lim and lim . Certainly that is so= ! [ œ " [ œ !=Å_ =Æ!R R= =

for ER œ ! R " R V[. Suppose it is so for and consider . Then is increasing and contin-: R"=V

uous in , = ! and converges to as and to as , whence the claim follows. Conse-EV = Å _ ! = Æ !

quently, there is a largest = œ =R such that E E . Thus, the maximum on the < : V[ Ÿ !R"=V

right side of (2) is E E w R" =V < : V[ if and if . Since (2) and (2) are equivalent,= = ! = Ÿ =R R

w

it follows that this rule is optimal with (2) as well, i.e., it is optimal to wait if and to exer-= =R

cise the option if . In short, = Ÿ =R if the expected rate of return is positive and if days remainR

until expiration of the option, then there is an exercise price such that it is optimal to wait if=R

the price is above and to exercise the option if the price =R is below =R .

2. Requisition Processing. The state of the system is the number of requisitions on hand at

the end of a period before processing and the actions are to or to .process wait

(a) Dynamic-Programming Recursion. Let be the minimum expected cost in periodsG8=

8ßá ßR = 8 when there are requisitions on hand at the end of period before processing in that

period. Then 0 or according as 0 or 0. Also, for each 1G œ O = œ = Ÿ 8 RßR=

G œ ÒO : G ß A= : G Ó ! Ÿ = Ÿ Q8= > >=> >> >

8" 8"min for ! !and

G œ O : G Q =8= >> >

8"! for

where the first and second terms and in brackets on the right-hand-side of the first1 =88=

equation correspond respectively to processing and to waiting.

(b) Optimality of Processing-Point Policy. We show first that is increasing in by in-G =8=

duction on . This is surely true for . Thus suppose it is so for and consider .8 8 œ R 8"ßá ßR 8

Then since is increasing in for each and is increasing in , is increasing in . Thus,G = > A= = =8">= =

8=

since is independent of , is increasing in .1 1 = 18 8 88 8= == G œ • Ÿ =

Now let 0 and for each 1 , let be the largest integer for which .= œ Ÿ 8 R = = ŸR 8 88== 1

Since is increasing in , it follows that if and only if . Thus one optimal policy= = 18 8= = 8 8= Ÿ = Ÿ =

in period is to wait if and to process if .8 = Ÿ = = =8 8

(c) Optimal Preplanned Processing Periods. Let be the minimum expected cost in per-G8

iods 1 when the processing times must be chosen before any requisitions arrive, there areßá ß 8

no requisitions on hand initially, , and one processes in period . Let be theQ œ _ 8 < œ >:!> >

expected number of requisitions arriving in a period. Without loss of generality, assume that

< ! because in the contrary event there are never any requisitions to process. The expected

Page 147: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -3- Answers to Homework 1 Due April 11, 2008

processing and waiting costs - 33 incurred when all requisitions arriving in consecutive periods are

processed in the last of those periods is

- œ O A< Ð4"Ñ œ O A<3Ð3"Ñ

#3

3

4œ"

! .

Then the may be found from the (deterministic) forward equations (G G ´ !Ñ8 !

G œ ÒG - Ó 8 œ "ßá ßRÞ8 7 87!Ÿ78 min ,

Let be the least minimizing the right-hand-side of this equation. Then is the next to last7 7 78 8

optimal period in which to process assuming that one processes in period . Thus since one pro-8

cesses last in period , one processes next to last in period , 2nd from last in period ,R 7 7R 7R etc.

(d) Comparison of Policies in Parts (b) and (c). Since the optimal solution in part (b)

allows the decision maker to base his decisions on the information accumulated whereas that in

part (c) does not, the minimum expected cost in part (b) will not exceed that in part (c). On

the other hand, the solution in part (c) has the advantages that it requires less computational

effort, it utilizes deterministic rather than random processing times (which makes scheduling

easier), and it does not require one to keep records on the number of requisitions remaining

unprocessed at the end of each period.

We now give a more precise comparison of the computational effort required to find the op-

timal policies in parts (b) and (c) under the assumption that ( ), as seems reasonable toQ œ S R

make the comparison fair. The numbers of comparisons, additions and multiplications required

to find the policies in parts (b) and (c) without further exploiting the special structure of the

problem is summarized in the table below.

Comparisons Additions Multiplications(b) ( ) ( ) ( )(c) ( ) ( ) ( )

S R S R , S R ,

S R S R S R

# # #

# #

Comparison of Computational Effort to Find Solutions in (b) and (c)

Observe that finding the solution in (b) requires the same order of comparisons, ( ) times asS ,

many additions and ( ) times as many multiplications when compared to the number ofS R,

operations required for the solution in (c). The factor ( ) reduction in additions arises becauseS ,

the solution in (c) depends only on the expected number of requisitions arriving in a period<

(which we take to be given) whereas that in (b) depends on the number of elements in the,"

support of the distribution of the number of requisitions arriving in a period.

Page 148: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -4- Answers to Homework 1 Due April 11, 2008

3. Matrix Products. Every intermediate matrix product must be of the form , andQ âQ3" 5

that matrix product must be a product of Q âQ Q âQ 3 4 53" 4 4" 5 and for some . Let 735

be the minimum number of multiplications needed to compute Q âQ3" 5. Thus, the minimum

number of multiplications needed to compute Q âQ Q âQ 7 73" 4 4" 5 34 45 and is . Also, the

number of multiplications needed to compute the product of those two matrices is . Since < < < 43 4 5

should be chosen optimally, one has the branching recursion ( 0 for 0 )7 œ Ÿ 3 83ß3"

7 œ 7 7 < < < ! Ÿ 3 5 Ÿ 8Þ35 34 45 3 4 5345 min for c d

(b) Example. With 10 30 70 2 100 .Ð< < < < < Ñ œ Ð Ñ! " # $ %

210007 œ < < < œ!# ! " #

42007 œ < < < œ"$ " # $

140007 œ < < < œ#% # $ %

min , 7 œ Ò7 7 < < < 7 7 < < < Ó!$ !" "$ ! " $ !# #$ ! # $

min 0œ Ò%#!! '!!ß #"!!! "%!! Ó œ %)!!Þ

min , 7 œ Ò7 7 < < < 7 7 < < < Ó"% "# #% " # % "$ $% " $ %

min 14000 21000 , 4200 6000 10200.œ Ò ! Ó œ

min , , 7 œ Ò7 7 < < < 7 7 < < < 7 7 < < < Ó!% !" "% ! " % !# #% ! # % !$ $% ! $ %

min 10200 30000, 21000 14000 70000, 4800 2000 6800.œ Ò Ó œ

The minimum-multiplication order of matrix-multiplication is thus ( ( )) .Q Q Q Q" # $ %

(c) Comparison of Minimum and Maximum Number of Multiplications. If is the max-Q35

imum number of operations required to compute the matrix product , then Q âQ Q3" 5 35

œ 7 5 3 Ÿ # Q œ Q œ35 !$ "% for . Also 22400, 224000, and

Q œ œ!% max 224000 30000, 21000 14000 70000, 22400 2000 254000.’ “The maximum-multiplication order of matrix-multiplication is thus ( ( )). Also theQ Q Q Q" # $ %

ratio of the maximum to the minimum number of multiplications is

Q

!%

!%37.35!

4. Bridge Clearance. Let be the maximum of the minimum bridge heights along routesZ 8=

from to among those that visit at most cities. Set for . Then= W 8 Z ´ _ 8 œ "ß #ßá8W

Z œ 2Ð=ß WÑ"=

and

Z œ 2Ð=ß >Ñ • Z8 8"= >

>Á=max

for and .= œ "ßá ß W" 8 œ #ßá ßR

Page 149: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr. rev 4:15 pm 4/13/2008

Revised Homework 2 Due April 18(Do any three problems—problems 2-4 if you took MS&E 251.Ñ

1. Dynamic Portfolio Selection with Constant Multiplicative Risk Posture. A financial

planner seeks a strategy for dynamically revising a portfolio of securities that she managesQ

for a client to maximize the expected utility of distributions to him over years. At the begin-8

ning of each year , the planner observes the market value (in dollars) of her client’sR 8 = !

portfolio, distributes a portion to him in cash, and invests fractions of the re-: = : ßá ß :! " Q

maining funds in securities . The and Ð" : Ñ= "ßá ßQ : : ´! !distribution fraction portfolio

Ð: ßá ß : Ñ ! : ! Ÿ " :   ! : œ " 8" Q 3Q3œ" satisfy , and . At the beginning of the last year ,!

the planner distributes the entire market value of the portfolio in cash. There are no transaction

costs for buying or selling securities. Each dollar she invests in security in year is worth3 R

V ! V ´ ÐV ßá ßV ÑR R R R3 " Q dollars at the end of the year. The joint distribution of in each year

R V ßá ßV V V is known, the random vectors are independent, and and ln have finite ex-" 8 R R3 3

pectations for all . Her client’s of distributing to him in years 3ß R A ßá ßA   ! "ßá ß 8utility " 8

is ln .!8Rœ"

RA

(a) Dynamic-Programming Recursion. Give a dynamic-programming recursion satisfied by

the maximum expected utility of distributions to the client in years when the marZ Ð=Ñ Rßá ß 8R -

ket value of the client’s portfolio at the beginning of year is .R =

(b) Maximum Expected Utility has Constant Multiplicative Risk Posture. Show by induc-

tion on that ln for some constants and , and show how to evaluateR Z Ð=Ñ œ + = , + ,R R R R R

the constants.

(c) Optimal Distribution Fraction and Portfolio. Conclude that the optimal policy in year

R 8 : œ "ÎÐ8R"Ñ is to choose the distribution fraction (the reciprocal of the number of!

years remaining) and portfolio that maximizes E ln (invest myopically). Note that both: Ð:V ÑR

: : R! and are independent of the value of the portfolio at the beginning of year .

2. Airline Overbooking. An airline seeks a reservation policy for a flight with seats thatW

maximizes its expected profit from the flight. Reservation requests arrive hourly according to a

Bernoulli process with being the probability of a reservation request during an hour. A pass-:

enger with a reservation pays the fare at flight time. If passengers with reservations0 ! ,   !

are denied boarding at flight time, they do not pay the fare and the airline pays them a penalty

-Ð,Ñ - -Ð!Ñ œ ! (divided among them) where is increasing on the nonnegative integers and . Con-

sider the hour ( before flight time. At the beginning of the hour, the airline reviewsR R !Ñ>2

the number of reservations on hand and decides whether to (accept) or decline any reserva-book

tion request during the hour. Each passenger with a reservation after any reservation request is

booked or declined, but before cancellations, independently cancels the reservation with proba-

Page 150: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 2 Homework 2 Due April 18, 2008

bility during the hour. For this reason, the airline is considering the possibility of overbooking;

the flight to compensate for cancellations. Let be the maximum expected future profit whenZ R<

< seats have been booked reservabefore tion requests cancellations during the hour. Let beand @R<

the maximum expected future profit when seats have been booked booking or declining< after

any reservation request, but cancelbefore lations, during the hour

Give a dynamic-programming recursion from which an(a) Dynamic-Programming Recursion.

optimal reservation policy can be determined.

Assume, as can be shown, that if is quasicon-(b) Optimality of Booking-Limit Policies. 1

cave on the integers, then E is quasiconcave in where is a sum of independent iden-1ÐF Ñ < F << <

tically distributed Bernoulli random variables. Use this result to show the following facts. First,

@ < , @R R R< < is quasiconcave in , i.e., there is a nonnegative integer such that is in-booking limit

creasing in on [0 ] and decreasing in on [ . Second, is quasiconcave in and< ß , < , ß_Ñ Z <R R R<

achieves its maximum at . Third, a booking-limit policy is optimal, i.e., accept a reserva-< œ ,R

tion request when hours remain before flight time if and only if .R < ,R

(c) Solving the Problem with MATLAB. Assume that . Graph for -Ð,Ñ œ 0, Z ! Ÿ R Ÿ $!R<

and by using MATLAB with the following parameter values: , $300 ,! Ÿ < Ÿ #! W œ "! 0 œ J1

: œ # ; œ J. and .3, and .05 and .10, where is the sum of the digits representing the positions in

the alphabet of the first letters of your first and last names. For example, Arthur Veinott would

have because is the first and is the 22 letter of the alphabet. In each case,J œ " ## œ #$ E Z 8.

estimate the booking limit ten hours before flight time from your graphs. Discuss whether your

graphs confirm the claim in (b) that is quasiconcave in . What conjectures do the graphsZ <R<

that you found suggest about the optimal reservation policy and/or maximum expected reward

and their variation with the various data elements? You will lose points on your conjectures

only if your graphs are inconsistent with or do not support your conjectures or if you don’t make

enough interesting conjectures. The idea here is to brainstorm intelligently.

3. Optimal Capacitated Supply Policy. A manager seeks a supply policy for a single prod-

uct that minimizes the expected costs of ordering, storage and shortage over periods. -8 The de

mands for the product in periods are independent real-valued ranH ßá ßH "ßá ß 8" 8 dom varia-

bles. At the beginning of period , the supply manager observes the possibly negative R initial stock

B − d B ! B on hand. If , units are backlogged. The supply manager then orders a nonnegative

1To develop the graphs required for this part and the next one, you need to run the m-fileMATLABairlin27.m. finhor3r.m booklim3.m. This file calls the files and To do this, download these files from theMS&E 351 course directory. is available on the UNIX clusters in Terman and Gates. PrintersMATLABare available in both places for a fee. You can also run remotely. Of course, if you own MATLAB MAT-LAB, you can run it on your personal computer as well.

Page 151: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 3 Homework 2 Due April 18, 2008

quantity at unit cost -   ! with immediate delivery, bringing the after receipt ofstarting stock

the order to , where is the given capacity.C B Ÿ C Ÿ B ? ? ! Unsatisfied demands are back-

logged so the stock on hand at the end of period is . There is a convex storage andR C HR

shortage cost function where is the (possibly negative) stock on hand at the end of a per-1ÐDÑ D

iod. Assume that E is finite for all , as and K ÐCÑ ´ 1ÐC H Ñ C K ÐCÑ Ä _ C Ä _ -C K ÐCÑR R R R

Ä _ C Ä _ G ÐBÑ Rßá ß 8 as . Let be the minimum expected cost in periods starting withR

the initial stock in period . Assume that is finite for all . The terminal cost B R G ÐBÑ B G Ð † ÑR 8"

at the end of period is bounded below. Assume, as can be shown, that all convex functions8

arising in this problem are continuous and that convex functions are closed under addition and

positive scalar multiples.

(a) Dynamic-Programming Recursion. Give a dynamic-programming recursion for calculat-

ing .G ÐBÑR

(b) Convexity of G ÐBÑR . Show that if the terminal cost is convex, then is convexG ÐBÑR

and bounded below in . : Use part (d) below.B Ò ÓHint

(c) Optimality of Constrained Base-Stock Policies. Show that one optimal policy inC ÐBÑR

period has the form for some real number .R C ÐBÑ œ ÐC ” BÑ • ÐB ?Ñ CR‡ ‡R R

(d) Projections of Convex Functions. Suppose that is a real-valued convex function0ÐBß CÑ

on a nonempty closed convex set and let inf . Show that if is[ © d JÐBÑ ´ 0ÐBß CÑ J5ÐBßCÑ−[

finite on , then is convex thereon.\ ´ ÖB À ÐBß CÑ − [× J

4. Sequencing: Optimality of Index Policies. Consider a sequence of tasks. When"ßá ß 8 8

a task is , it or . Given a permutation of the tasks, execute them in turnexecuted succeeds fails

until some task succeeds or until all tasks fail, whichever occurs first. The goal is to choose the

order of executing the tasks that minimizes the resulting expected cost. Let be a5 5 5œ Ð ßá ß Ñ" 8

permutation of the tasks. Assume that the cost of executing task depends on , but not-53 5 53 3

the other tasks. Let be the conditional probability that task fails given that the tasks:5 53 3| 53

5 5 53 " 3"´ Ð ßá ß Ñ : " : œ 0 † 1 fail. Assume that has the property that for some5 5 5 5 5 53 3 3 3 3 3| |

positive functions and . Let be the (unconditional) probability that the tasks fail. As-0 1 :53 53

sume that is , i.e., is independent of the order of executing tasks .: ßá ß53 symmetric 5 5" 3"

(a) Optimality of Index Policies by Interchanging Tasks. Show that there is an func-index

tion such that it is optimal to choose any permutation for which isMÐ † Ñ œ Ð ßá ß Ñ MÐ Ñ5 5 5 5" 8 3

increasing in . [ : For an arbitrary permutation of the tasks, consider when3 œ Ð ßá ß ÑHint 5 5 5" 8

it is optimal not to interchange the order of executing tasks and .]5 53 3"

Show how to apply the result of problem (a) to solve each of the following two problems.

Page 152: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 4 Homework 2 Due April 18, 2008

(b) Testing a Machine to Find a Fault. Suppose that a machine does not operate properly

and independent tests can be performed to find the cause. Test costs and has prob8 "ßá ß 8 3 -3 -

ability of fail:3 ing to find the cause given that the tests preceding it fail to find the cause. Once

the cause are executed. All tests may fail to find the cause. What is the is found, no further tests

minimum-expected-cost sequence to carry out the tests?

(c) Search for a Plane. Suppose a plane is down in one of locations with known pos8 "ßá ß 8 -

itive probabilities , . Denote by the time to search location to find the plane; ßá ß ; ; œ " > 3" 8 3 33!

or determine that it is not there. What search sequence minimizes the time to find the plane?

Page 153: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Answers to Homework 2 Due April 18

1. Dynamic Portfolio Selection with Constant Multiplicative Risk Posture.

(a) Dynamic-Programming Recursion. The dynamic-programming recursion is

(1) max ln( E , Z Ð=Ñ œ Ö =: Ñ Z Ð=Ð": Ñ:V Ñ×ß R œ "ßá ß 8" = !R R" R

!: ":−T

! !!

where is the set of row -vectors , is a column -vectorT Q Ö: À : œ "ß :   !× V œ ÐV Ñ Q!3œ"Q

3R R

3

and ln for .Z Ð=Ñ œ = = !8

(b) Maximum Expected Utility has Constant Multiplicative Risk Posture. It suffices to1

show by induction on thatR

(2) lnZ Ð=Ñ œ + = ,R R R

where and , and for ,+ ´ " , ´ ! R œ "ßá ß 8"8 8

Ð Ñ + ´ 8R" œ " +3 R R"

and

(4 max {ln ln max E ln .Ñ , ´ : + Ð": Ñ× + Ð:V Ñ ,R R" R" R R"

!: "! !

:−T!

To see this, observe that (2) certainly holds for . Suppose (2) holds for . ThenR œ 8 R " Ÿ 8

on substituting (2) for into the right-hand side of (1) and using (3) and (4), it follows thatR"

(2) holds for .R

(c) Optimal Distribution Fraction and Portfolio. It follows from (4) that the optimal distribu-

tion fraction in year maximizes the first term on the right-hand side of (4). Maximiz-: œ : R!R!

ing that term and using (3) yields . Similarly, it follows from (4) that: œ "Î+ œ "ÎÐ8R"ÑR R!

the optimal portfolio in each year 1 maximizes the second term on the right-: œ : Ÿ R 8R

hand side of (4), viz., maximizes E ln subject to . Note that the optimal distributionÐ:V Ñ : − TR

fraction and portfolio in each year are both independent of the level of wealth at the beginning

of the year. Also, the optimal portfolio in each year is , i.e., maximizes the exmyopic pected in-

stantaneous rate of return for that year alone without regard for subsequent investment oppor-

tunities.

1Incidentally, note that the client’s utility function ln for distributions in the years?ÐAÑ œ A A œ ÐA Ñ ¦ ! 8!8Rœ"

R R

exhibits constant multiplicative risk posture. Indeed, up to positive affine transformations, it is the only such utilityfunction that is also symmetric and strictly risk averse.

Page 154: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -2- Answers to Homework 2 Due April 18, 2008

2. Airline Overbooking.

(a) Dynamic-Programming Recursion. Let be the maximum expected profit that canZ R<

be earned from a flight when seats have been booked hours before flight time. Then the<   ! R

profit at flight time ( ) isR œ !

Z œ 0< • Ò0W -ÐÐ< WÑ ÑÓ! < @ ´!

< .

When hours remain before flight time, the airline can either , i.e., accept, or R œ "ß #ßá book de-

cline a new reservation request during the ensuing hour. Reflecting these two possibilities, the maxi-

mum expected profit satisfies the recursion

(1) maxZ œ Ö@ ß Ð" :Ñ@ :@ ×R R R R< < < <"

where E and is the number of the reservations that do not cancel during the hour.@ ´ Z G <R R"< G <<

Then has a binomial distribution with parameters and . Interpret as the maximumG < " ; @<R<

expected profit when there are reservations after booking or declining new requests, but before<

cancellations, with hours remaining until flight time.R

(b) Optimality of Booking-Limit Policies. We prove by induction on that and areR Z @R R< <

quasiconcave in , and a booking-limit policy is optimal with hours to flight time and < R booking

limit , @ R œ ! , œ WR R !< that is a maximizer of . This is true for with booking limit because Z !

<

œ 0<@ œ < Ÿ W Z œ @ œ 0W -ÐÐ< WÑ Ñ <   W! ! ! < < < is increasing in and is decreasing in . Suppose

that R "   ! and is quasiconcave in . Then E is quasiconcave in by the firstZ < @ œ Z <R" R R"< < G<

assertion in the statement of (b). It is clear from (1) that also maximizes and . < œ , Z Z œ @R R R R< , ,R R

Thus since is increasing in , the same is so of . Similarly, since@ < Ÿ , Z œ Ð" :Ñ@ :@R R R R R< < < Ð<"Ñ•, R

@ Z <R R< < is decreas is quasiconcave in . Also, a book-ing in , the same is so of . Thus <   , Z œ @R R R

< <

ing-limit policy is optimal with hours to flight time and booking limit .R ,R

(c) Solving the Problem with MATLAB. The graphs of the and the corresponding book-Z R<

ing limits for and 2 with seats and $ per seat are at-, R œ !ßá ß $! < œ !ßá ß ! W œ "! 0 œ $!!R

tached for .2 .05 , .2 .10 , .3 .05 , .3 .10 . The booking limits 10 hours before flightÐ:ß ;Ñ œ Ð ß Ñ Ð ß Ñ Ð ß Ñ Ð ß Ñ

time are respectively 15, 20, 15, 20. The graphs do confirm the conjecture in (b) that isZ R<

quasiconcave in for each . The graphs are interpolated linearly between successive integer< R

values of to define for fractional values of . The graphs suggest the following additional< Z <R<

conjectures:

ìMonotonicity of Booking Limits. is increasing in and ., ÐRß:ß ;Ñ , œ WR !

ì Equilibrium Reservation Level. There is a (possibly fractional) equilibrium reservation

level such that is independent of . The equilibrium appears to be roughly/ Ÿ W Z œ Z RR/ /

Page 155: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -3- Answers to Homework 2 Due April 18, 2008

the reservation level where the expected number of new reservation requests equals the/

expected number of cancellations, viz., or . This formu-: œ Ð/ :Ñ; / œ :Ð" ;ÑÎ; Ÿ :Î;

la is probably valid only for the case where is significantly smaller than . Also, is in-/ W /

creasing in and decreasing in . Finally, if the number of reservations is less than , the: ; /

expected revenue from the flight never exceeds , regardless of the numberZ œ Z œ /0/!/

of hours before flight time.

ìMonotonicity/Concavity/Quasiconcavity of Maximum Value. is concave in , in-Z <R<

creasing in for , decreasing in for , quasiconcave in for ,R ! Ÿ < Ÿ / R / Ÿ < Ÿ W R W Ÿ <

increasing in , and decreasing in for . In fact, if is sufficiently below , it: ; ! Ÿ < Ÿ W / W

appears that is essentially linear in for sufficiently below . This together withZ < < WR<

the fact from the preceding paragraph implies that Z œ /0 Z ¸ ÒÐ" ;Ñ Ð< /Ñ /Ó0R R R/ <

for sufficiently below .< W

ì RMonotonicity of Differences of Maximum -Period Values. For each and! Ÿ R

! 5 Z ´ Z Z ? < Z œ !, let . Let be the largest number such that .? ?5 5R R5 R R R< < < <5

Then . Also, is positive for , negative for and pos, Ÿ ? Ÿ , Z < / / < ?R R R5 R R5 55 <? -

itive for .? <R5

ì Limiting Maximum Value Independent of Initial Reservations. Then lim ex-RÄ_R<Z

ists and equals for all .Z </

3. Optimal Capacitated Supply Policy.

(a) Dynamic-Programming Recursion. The desired dynamic-programming recursion is

G ÐBÑ œ Ò- † ÐC BÑ K ÐCÑ G ÐC H ÑÓR R R RBŸCŸB?

"min E

for and where is given.B − d R œ "ß #ßá ß 8 G ÐBÑ8"

(b) Convexity of G ÐBÑR . is convex andWe claim that bounded below in by in-G ÐBÑR B

duction on . The claim is so by hypothesis for . Suppose it holds for R R œ 8 " R " Ÿ 8 "

and consider . Then each term in the minimand on the right side of the recursion in (a) isR

convex and bounded below on the convex feasible set . Thus, by (d), isB Ÿ C Ÿ B ? G ÐBÑR

convex and bounded below.

(c) Optimality of Constrained Base-Stock Policies. We claim that one optimal policy C ÐBÑR

in period has the form for some real number . To see this, letR C ÐBÑ œ ÐC ” BÑ • ÐB ?Ñ CR‡ ‡R R

N ÐCÑ ´ -C K ÐCÑ G ÐC H Ñ C œ C N ÐCÑR R R R R"‡RE and be a minimizer of . The existence of

such a minimizer is assured because E is bounded below in , asG ÐC H Ñ C -C K ÐCÑ Ä _R R R"

C Ä _ K ÐCÑ Ä _ C Ä _ N ÐCÑ, as , and is convex on the real line, and so continuous. Also,R R

Page 156: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -4- Answers to Homework 2 Due April 18, 2008

C ÐBÑ N ÐCÑ B Ÿ C Ÿ B ?R R minimizes subject to . To see this, it suffices to consider the following

cases:

• . Since is decreasing in , .B ? C N ÐCÑ C Ÿ C C ÐBÑ œ B ?‡ ‡R RR R

• . Since minimizes , .B Ÿ C Ÿ B ? C œ C N ÐCÑ C ÐBÑ œ C‡ ‡ ‡R R RR R

• . Since is increasing in , C Ÿ B N ÐCÑ C   C C ÐBÑ œ B‡ ‡R RR R

(d) Projections of Convex Functions. Let be a real-valued convex function on a non0ÐBß CÑ -

empty closed convex set and let inf . We show that if is finite on[ © d JÐBÑ ´ 0ÐBß CÑ J5ÐBßCÑ−[

the set , then is convex thereon. To see this, l\ ´ ÖB À ÐBß CÑ − [× J et . Suppose ;% ! :ß :   !w

: : œ " Bß B − \ Cß C ÐBß CÑ ÐB ß C Ñ − [ 0ÐBß CÑ Ÿ JÐBÑ 0ÐB ß C Ñw w w w w w w; and . Choose such that , , and %

Ÿ JÐB Ñ w %. Then

JÐ:B : B Ñ Ÿ 0Ð: † ÐBß CÑ : † ÐB ß C ÑÑ Ÿ :0ÐBß CÑ : 0ÐB ß C Ñ Ÿ :JÐBÑ : JÐB Ñ w w w w w w w w w w %.

Since this is so for all , it is so for , whence is convex.% % ! œ ! J

4. Sequencing: Optimality of Index Policies. Let be a sequence of tasks and be its5 5GÐ Ñ

associated expected cost. Without loss of generality, assume that , i.e., perform the5 œ Ð"ßá ß 8Ñ

tasks in numerical order. Let and for ." œ ! 3 œ Ð"ßá ß 3"Ñ " 3 8

(a) Optimality of Index Policies by Interchanging Tasks. Fix and let be the" Ÿ 3 8 5w

sequence formed from by interchanging tasks and . Let be the expected cost of tasks5 3 3" G3

3 Ð. Let be the expected cost of tasks for . G 4ßá ß 8 " Ÿ 4 Ÿ 8 G ´ G ´ !Ñ4 8"!Then

GÐ Ñ œ G : Ð- : - Ñ G5 3 3 3 3"3 33#

|

and

GÐ Ñ œ G : Ð- : - Ñ G5w 3#3 3 3" 33" 3| .

Consequently, on comparing the above formulas, it follows that if and only ifGÐ Ñ Ÿ GÐ Ñ5 5w

- : - Ÿ - : -3 3" 3" 33 3 3" 3| | ,

or equivalently,- -

" : " :Ÿ

3 3"

3 3 3" 3| |.

Thus since

Ð‡Ñ " : œ 0 † 14 3 4 3|

for each , the above inequality simplifies to4   3

- -

0 0Ÿ

3 3"

3 3".

Page 157: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -5- Answers to Homework 2 Due April 18, 2008

Consider the index rule in which the tests, after possibly renumbering, are carried out in an

order that assures that the index function is increasing in M ´ 33-03

3. We claim that this index rule

is optimal. For in any other sequence , there must 5 be a first test with higher index than its suc-

cessor. Let be the sequence in which one interchanges these two tests. The above argument5w

shows that . Also, is GÐ Ñ Ÿ GÐ Ñ5 5 5w w lexicographically smaller than , i.e., the first nonzero ele-5

ment of is positive. Repeating this pro5 5 w cess up to times produces the index rule be-ˆ ‰8#

cause each sequence is lexicographically smaller than its predecessor and lexicographic order is a

total (or linear) order of the sequences. Thus, the index rule is optimal.

(b) Testing a Machine to Find a Fault. Let be a permutation of the tests, , 5 1 œ " 0 œ " 3 4

:4 and for all . Note that ( ) holds and that and are sym: ´ : â: 3ß 4   " ‡ 1 :3 " 3" 3 3 metric in

3. Thus, the hypotheses of part (a) are satisfied. Hence from part (a), one optimal sequence in5

which to perform the tests is so that the is increasing in .cost-success ratio -":

3

33

(c) Search for a Plane. Let be a permutation of the locations, 5 : ´ " ; â ; 13 " 3" 3,

œ 0 œ ; - œ > 3ß 4   " ‡ 1 :":3

, and for all . Note that ( ) holds and that and are 4 4 4 4 3 3 sym-

metric in . Thus, the hypotheses of part (a) are satisfied. Hence from part (a), one optimal3

sequence in which to search the locations is so that the is increas5 time-success ratio >;3

3ing in .3

Remark. The function can be dispensed with if one replaces the condition ( ) by the1 ‡

equivalent condition that" :

" : 0œ 5ß 4   3

05 3

4 3

5

4

|

| for all .

Page 158: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -6- Answers to Homework 2 Due April 18, 2008

0 2 4 6 8 10 12 14 16 18 200

500

1000

1500

2000

2500

3000Maximum Expected Flight Revenue (p = 0.2, q = 0.05)

Reservations

Rev

enue

in $

0 5 10 15 20 25 300

2

4

6

8

10

12

14

16

18

20

Optimal Booking Limits vs. Hours to Flight Time (p = 0.2, q = 0.05)

Hours to Flight Time

Res

erva

tions

Page 159: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -7- Answers to Homework 2 Due April 18, 2008

0 2 4 6 8 10 12 14 16 18 200

500

1000

1500

2000

2500

3000Maximum Expected Flight Revenue (p = 0.2, q = 0.1)

Reservations

Rev

enue

in $

0 5 10 15 20 25 300

2

4

6

8

10

12

14

16

18

20

Optimal Booking Limits vs. Hours to Flight Time (p = 0.2, q = 0.1)

Hours to Flight Time

Res

erva

tions

Page 160: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -8- Answers to Homework 2 Due April 18, 2008

0 2 4 6 8 10 12 14 16 18 200

500

1000

1500

2000

2500

3000Maximum Expected Flight Revenue (p = 0.3, q = 0.05)

Reservations

Rev

enue

in $

0 5 10 15 20 25 300

2

4

6

8

10

12

14

16

18

20

Optimal Booking Limits vs. Hours to Flight Time (p = 0.3, q = 0.05)

Hours to Flight Time

Res

erva

tions

Page 161: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -9- Answers to Homework 2 Due April 18, 2008

0 2 4 6 8 10 12 14 16 18 200

500

1000

1500

2000

2500

3000Maximum Expected Flight Revenue (p = 0.3, q = 0.1)

Reservations

Rev

enue

in $

0 5 10 15 20 25 300

2

4

6

8

10

12

14

16

18

20

Optimal Booking Limits vs. Hours to Flight Time (p = 0.3, q = 0.1)

Hours to Flight Time

Res

erva

tions

Page 162: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

Homework 3 Due April 25

1. Multifacility Linear-Cost Production Planning. Suppose facilities, labeled 1 , eachO ßá ßO

produce a single product. Production of each unit at facility in period 1 costs and5 8 œ ßá ßR -58

consumes 0 units at facility in period for each and . /   4 7 4ß 5 7 8 -45 578 8" Ÿ The cost includes

the costs of in-transit storage and shipping units of each product in each period , but/ 4 7 84578

excludes the cost of supplying those units of product in period .4 7 atThe unit cost of storage

facility at the end of period is . There are known nonnegative demands for the product pro-5 8 258

duced at each facility in each period. Let be the minimum total cost of satisfyingG58 a unit of

demand at facility in period . This unit cost5 8 includes all costs of production, storage and in-

transit storage at all facilities involved in supplying that unit of demand at facility in period .5 8

(a) Dynamic-Programming Recursion. Give a dynamic-programming recursion for solving

the problem. [ : Express in terms of the for . Use the fact that the demand atHint G G 7 85 48 7

facility in period can be satisfied either by producing at in or producing at in a prior5 8 5 8 5

period and storing from the previous period 1.]8

(b) Running Time. How many multiplications, additions and comparisons are required to solve

the problem? Your answer should be of the form for some polynomials :ÐRßOÑ SÐ;ÐRßOÑÑ :

and with being of higher order than .; : ;

(c) Running Time: Special Cases. How long would it take a computer to execute the opera-

tions needed to solve the problem when 1000 and 50? 10000 and 500? (AssumeO œ R œ O œ R œ

that the computer will do 10 multiplications, 10 additions and 10 comparisons respectively per) * *

second. Use defined in (b) for these evaluations.):ÐRßOÑ

(d) Nonlinear Costs. Suppose the production and storage costs are nonlinear in the amounts

produced and stored. Can you modify your recursion in part (a) to solve this case without expand-

ing the number of “states” and “actions”? If so, show how to do this; if not, explain why.

2. Discovering System Transience. The purpose of this problem is to give two methods of dis-

covering whether or not a system, i.e., a finite discrete-time-parameter Markov population deci-

sion chain, is transient. Improvements in one of the methods are given for substochastic systems.

(a) Characterization of Transient Matrices. Suppose is a square real nonnegative matrix.T

Show that if and only if the linear inequalities , have a solution forT Ä ! ÐM TÑ@ œ < @   !R

every (resp., some ). [ : For the “if ” part, write the first equation in the form<   ! < ¦ ! Hint

@ œ < T@ and iterate.]

(b) Discovering System Transience by Policy Improvement. Show how the policy-improve-

ment method can be used to check whether or not a system is transient. [ : Let for allHint < œ "$

$ and apply the policy-improvement method. Use the result of (a) to show how to detect when a

decision is found for which is not transient.]$ $

(c) Discovering System Transience in Substochastic Systems by a Combinatorial Method.

Consider a substochastic system with states and state-action pairs. Let be the proba-W E ;Ð=ß +Ñ

Page 163: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 2 Homework 3 Due April 25, 2008

bility that an individual who is in state and chooses action exits the system in one step. Thus,= +

;Ð=ß +Ñ ´ " :Ð>l=ß +Ñ + − E = − ; ´ Ð" T Ñ"!>− =

R Rf 1 1 for each and . Let be the vector of prob-f

abilities that an individual exits the system in steps or less starting from each state whileR   !

using . For , let be the set of states from which it is possible to exit the system1 fR   ! X = −R

in steps or less with every policy, i.e., min . Show that the following are equivalent:R ; !1 1R=

" # X œ $ ; ¦ ! W E‰ ‰ W ‰ W the system is transient; ; min . Give a low order polynomial in and f 1 1

that is an upper bound on the number of comparisons needed to find . [ : For , letX R   "W Hint

Y = − RR be the set of states from which it is possible to exit the system in steps or less withf

every policy, but not less than steps for some policy. Then for . ShowR X œ X Y R   "R R" R

that there is an integer for which . Show that . Show! Ÿ Q Ÿ W X œ X " Ê # Ê $ Ê "Q Q" ‰ ‰ ‰ ‰

that by contraposition.]" Ê #‰ ‰

3. Successive Approximations and Newton's Method Find Nearly Optimal Policies in Linear

Time for Discounted Markov Decision Chains. Consider a finite discrete-time-parameter -stateW

discounted Markov decision chain with discount factor and in" œ"

" 3terest rate 100 %3 !, so

T " œ "$ " $ for all . Since adding a constant to the rewards earned in all states does not affect the

set of maximum-value policies, there is no loss of generality in assuming that for all . Let<   !$ $

8 be the larger of the number of state-action pairs and the number of nonzero data elements

<Ð=ß +Ñ :Ð>l=ß +Ñ ! Z, . Let be given and be the max Call a policy% ‡ imum value over all policies.

1 % -optimal if .l l l lZ Z Ÿ Z‡ ‡1 %

(a) Computing in Linear Time.eZ Show that the number of operations required to com-

pute for any fixed is at most . (An operation is an addition, a comparison, and aeZ Z − d 8W

multiplication).

(b) Successive Approximations Finds an -Optimal Policy in Linear Time.% Show that Z ´R

e % "R! R is increasing in . Also show, for fixed and , how successive approximations can be used

to find a stationary -optimal policy with at most operations. [ : Let be a stationary% $SÐ8Ñ Hint _

maximum-value policy and be such that . Show that # Z œ < T Z Z Ÿ Z Ÿ Z ŸR R" R R# # #$

Z T Z R 8R R _$ $ $. Also show how to choose depending on and , but not on , such that is% " #

%-optimal.]

(c) Newton's Method Finds an -Optimal Policy in Linear Time.% Show that the values ZR

´ Z R$R of the successive decisions selected by Newton’s method are increasing in , and$R ßR   "ß

that for all . Thus conclude that Newton’s method requires no more iterations to findZ   Z RR R

a stationary -optimal policy than does successive approximations. Show that if Gaussian elimina% -

tion is used to compute the value of a decision in Newton’s method, then the number of opera-$R

tions needed to find that value is . Thus conclude that the number of operations requiredSÐW Ñ$

to find a stationary -optimal policy with Newton’s method (when computing values by Gaussian%

elimination) is for fixed and . Show that this running time can be reduced to ifSÐ8 W Ñ SÐ8Ñ$ % "

the values of successive decisions selected by Newton’s method are instead estimated by successive

approximations.

Page 164: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Answers to Homework 3 Due April 25

1. Multifacility Linear-Cost Production Planning.

(a) Dynamic-Programming Recursion. The forward dynamic-programming recursion for finding

the minimum unit costs is ( )G œ -5 5" "

G œ Ò- / G ß 2 G Ó5 58 8 78 7 8" 8"

78 4

45 4 5 5min " "for and . The right-hand side of the above equation is the minimum of two" 8 Ÿ R " Ÿ 5 Ÿ O

terms, the first representing production of product in period and the second production of that5 8

product before period and storing the product from period to period .8 8" 8

(b) Running Time. We summarize below the number of additions, multiplications and compar-

isons required to compute the .G58

Additions " " " "5 8 78 4

# # # ." œ R O SÐRO Ñ"

#

Multiplications " " " "5 8 78 4

# # # ." œ R O SÐRO Ñ"

#

Comparisons ""5 8

" œ RO SÐOÑ.

(c) Running Time: Special Cases. The approximate running times for special cases are

summarized in the table below.

O "!! "! !!

R

0 , 050 500

Approx. Run Time 14 sec 38 hrs

Approximate Running Times

(d) Nonlinear Costs. It is not possible to modify the recursion in (a) to handle nonlinear costs

without dramatically enlarging the state and actions spaces. The reason is that the recursion in (a)

is based on the fact that the marginal costs are independent of the volume of production. That

would no longer be true with nonlinear cost functions.

2. Discovering System Transience.

(a) Characterization of Transient Matrices. The “only if ” part follows from the fact that

T Ä ! M T M T œ T   ! @ œR " R_Rœ! implies is nonsingular and ( ) by Lemma 3. Thus, !

( ) solves , for all 0. For the “if ” part, suppose 0 and 0M T <   ! ÐM TÑ@ œ < @   ! <   < ¦ @  "

solves , . ThenÐM TÑ@ œ < @   !

Page 165: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -2- Answers to Homework 3 Due April 25, 2008

@ œ < T@ œ â œ T < T @   T <" "R" R

3œ! 3œ!

3 R 3

for 0, , . Thus since 0, converges, whence 0.R œ " á < ¦ T T Ä!_3œ!

3 R

(b) Using Policy Improvement to Discover System Transience. Let 1 for all .< œ −$ $ ?

Suppose and 1 has no nonnegative solution. Then from (a), is not trans-$ ? $− ÐM T Ñ@ œ$_

ient. If the equations have a nonnegative solution, then that solution is . Now choose an imZ   "$ -

provement of , if one exists, replace by , and repeat the above construction. After finitely# $ $ #

many steps, we will either find a that is not transient, in which case the system is not trans-$_

ient, or a that has no improvement. In the latter event, is a fixed point of , whence$ eZ   !$

Z œ Ð" T Z Ñ ¦ T Z < ´ ÐM T ÑZ ¦ !$ # $ # $ # # $# ?max for all . Thus, for each , , so from (a), is− # # #

transient. Hence the system is transient.

(c) Discovering System Transience in Substochastic Systems by a Combinatorial Method.

For , let be the set of states from which it is possible to exit the system in R   ! X = − RR f

steps or less with every policy. it is possi-For , let be the set of states from which R   " Y = −R f

ble to exit the system in steps or less with every policy, but not in less than steps with someR R

policy. Then for .X œ X Y R   "R R" R

" Ê # # X § X œ Ï X‰ ‰ ‰ W W. The proof is by contraposition. If is false, then . Let . By defin-f f

ition of the , , so there is a smallest integer such that X g œ X © X © X © â ! Ÿ Q Ÿ W YR ! " # Q"

œ g X œ X Y œ g R Q R œ, whence . We claim that for all . This is certainly so for Q Q" R

Q ". Suppose it is so for and consider . Then, for each state , each policyR Q R " = − YR"

that exits the system in steps, but not less, first takes an action that sends the system only toR "

states in , which is impossible because . Hence .Y Y œ g X œ XR R Q W Then there is a decision $

such that is stochastic, whence is not transient.T$XX_$

# Ê $‰ ‰. Immediate from the definitions involved.

$ Ê " ; ¦ !‰ ‰ W. By hypothesis, min1 1 and so max . Thus there exists such1 1T " ¥ " ! "W -

that max . Iterating this inequality shows that max for . Since1 11 1T " Ÿ " T " Ÿ " 3 œ "ß #ßáW 3W 3- -

max is decreasing in , , whence is transient.1 1 1 1T " R   ! T " Ÿ W T " Ÿ WÐ" ÑR R 3W W "_ _Rœ! 3œ!

! ! - 1

We show that the total number of comparisons to find is at most . FX EWQ or , let R   ! ER=

be the set of actions in state that assure it is not possible to exit the system in steps+ − E = − R= f

or less starting from and using initially. For , . For and , .= + = − E œ g R   " = − X E œ gf ! R" R= =

For (resp., and , is the set of such that (resp.,R œ " R "Ñ = − Ï X E + − E ;Ð=ß +Ñ œ !f R" R R"= =

:Ð>l=ß +Ñ œ ! > − Y E = − Ï X for ); for all with R" R R"= given , one can find E ßY X ER" R" R"

Ð†Ñ f

(resp., ) comparisons. For , is the set of states for which ,E†lY l R   " Y = − Ï X E œ gR" R R" R=f

and the are disjoint. If (resp., ), it suffices to find , and so and Y X œ § E Y X œR Q R R R=f f

X Y = − Ï X R Ÿ Q R Ÿ Q"ÑR" R R", for and (resp., ; the number of comparisons to do sof

is at most (resp., EÐ" lY lÑ Ÿ E†lX l œ EW EÐ" lY lÑ Ÿ EÐl Ï X l lX lÑ œ! !Q Q"# #

R" Q R" Q Qf

EWÑ.

Page 166: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -3- Answers to Homework 3 Due April 25, 2008

3. Successive Approximations and Newton's Method Find an -Optimal Policy in Linear%

Time for Discounted Markov Decision Chains.

(a) Computing in Linear Time.eZ Recall that

( ) maxeZ œ Ö<Ð=ß +Ñ :Ð>l=ß +ÑZ ×= >+−E >−=

"f

for each and . Observe that each and appears exactly once in theZ − d = − <Ð=ß +Ñ :Ð>l=ß +ÑW f

formula above for each and therefore the number of operations needed to compute ( ) does= Ze =

not exceed the number number of nonzero elements among the and for all8 <Ð=ß +Ñ :Ð>l=ß +Ñ=

+ − E > − = −= and , for each fixed . Thus, the total number of operations required to computef f

eZ is bounded above by . Let .!= =

! !8 œ 8 Z ´ Z ´ !

(b) Successive Approximations Finds an -Optimal Policy in Linear Time.% Notice that

Z œ Z œ !   ! Z œ Z" ! R R" R "e e e e e. Thus, by the monotonicity of , and so of , it follows that

  ! œ Z Z œ < eR R R. Also, # T Z Ÿ < T Z M T Z Ÿ <# # # # #R" R R, so ( ) . Premultiplying the

last inequality by ( ) , which exists and is nonnegative because is transient, gives M T Z#" _ R#

Ÿ ÐM T < œ Z# # #) . Since the system is strictly substochastic, it is transient and there is a sta-"

tionary maximum-value policy . Hence,$_

Z Ÿ Z Ÿ Z Ÿ Z œ Z T Z œ ZR R R R ‡$ $ $# $ $ .

Therefore,

0 Ÿ Z Z Ÿ Z Z Ÿ T Z œ T Z‡ ‡ R R R ‡# $$ $

and so

mZ Z m Ÿ mZ Z m Ÿ mT mmZ m œ mZ m‡ ‡ R R ‡ R ‡# $ " .

Thus, taking iterations with assures that is -optimal and that isR R œ œ Z¶ · ¶ ·ln ln% %

" 3ln ln

Ð"Î Ñ

Ð" Ñ# %_

#

within 100 % of . Since is independent of and each iteration requires at most operations,% Z R 8 8‡

successive approximations finds a stationary -optimal policy with ( ) operations for fixed and .% % "O 8

(c) Newton's Method Finds an -Optimal Policy in Linear Time.% Since Newton’s method is

an instance of the Policy-Improvement Method and the values of successive policies increase with

that method, it follows that for . Thus, if , thenZ   Z R   ! Z   Z R" R R R

Z   < T Z œ < T Z   < T Z œ Z R" R R R R"

$ $ $ $ $ $$ $

R" R"max{ } max{ } .

This shows that the values at iteration with Newton’s method are at least as large as those withR

successive approximations. Thus Newton’s method requires no more iterations than successive ap-

proximations, so the number of iterations in (b) will find a stationary -optimal policy. EachR %

Page 167: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -4- Answers to Homework 3 Due April 25, 2008

iteration of Newton’s method requires finding the “best” improvement and then R $R its value

Z 8 W$R . The first task can be done in ( ) time as shown in (a) and the second can be done in ( )O O $

time if Gaussian elimination is used to solve the required system of linear equations. Since theW

number of iterations is independent of both and , this implementation of New8 W ton’s method

requires ( ) operations to find an -optimal policy. However, if instead of solving a sys-O 8 W$ %

tem of linear equations to compute , we use successive approximations to estZ$R imate this value

as in (b), then both steps of Newton’s method can be implemented in ( ) time. ThenO 8 this imple-

mentation of Newton’s method requires ( ) operations to find an -optimal policy.O 8 %

Page 168: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Homework 4 Due May 2(Do Problems 1, 2 and either 3 or 4; if you took MS&E 251, do Problems 2, 3 and 4)

1. Component Replacement. A manager of a fleet of new machines desires to develop aQ

maintenance policy that maximizes the expected net profit from the machines as long as they

are “working”. Working machines are rented out each morning and are returned by 5 p.m. that

day with the daily rental revenue per machine being . Each machine consists of four compon-/

ents, two of type and two of type . Each component may be “working” or “failed”. Working! "

components fail independently of one another while in service, though customers receive no re-

funds. The probability that working components of types and fail during a day are .2 and .1! "

respectively. A machine is “working” if it has at least one working component of each type; oth-

erwise the machine is “failed”. When a machine is returned at the end of a day, the manager

observes the numbers of working components of each type therein. If the machine has failed, the

manager discards it. If the machine is working, the manager decides which failed components, if

any, to replace during the evening in order to prolong the machine's working life. Components of

types and cost respectively 1 and 2. The labor cost to replace one failed component of either! "

type in a working machine during an evening is 5 and that to replace two failed components, one

of each type, is 6. Initially, all components of all machines are working.

(a) Solution by Newton’s Method. Find the component replacement policy that maximizes

expected profit for using the MATLAB file (Newton’s method)./ œ % newtonr .m"

(b) Formulation as Linear Program. Formulate the problem of finding a component re-

placement policy for the machines that maximizes the expected net profit as a linear program.

[ : The problem has nine variables and four equality constraints.]Hint

(c) Solution by Linear Programmming. Find the optimal policy and the maximum expect-

ed profit therefrom by solving the linear program in (b) by using the MATLAB file .tranlpsc.m

(Type “help linprog” at the MATLAB prompt to see explanations for the linprog solver.) Do

this for each of . , , and , and where is the sum of the two integers representing the/ œ & # % ' Q

positions in the alphabet of the first letter in your first and last names. For example, Walt Dis-

ney would have to solve the problem for the case since is the twenty-thirdQ œ #$ % œ #( [

and is the fourth letter of the alphabet. Please be sure to state the value of that you use.H Q

Discuss qualitatively the effect that you observe in the above computations of increasing the

daily revenue per machine on the optimal replacement policy./

(d) Extending the Service Life. The manager has been asked by the Vice-President for Mar-

keting to extend the average working life of a machine to at least 17 days. For the case ,/ œ %

what is the maximum-expected profit and corresponding optimal policy among those that satisfy

this constraint? Use the MATLAB file to solve the problem.tranlpsc.m

Page 169: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 2 Homework 4 Due May 2, 2008

2. Optimal Stopping Policy. Consider a finite -state Markov branching decision chain. CallW

a policy if . Call lim sup the of and call the supremum1 1stopping valueT Ä ! Z ´ ZR RRÄ_1 11

Z Z‡ of over all stopping policies the . (The limit superior and suprem1 1 maximum stopping value -

um are taken coordinate-wise.) If there are no stopping policies, let . Let be aZ ´ _ A ¦ !‡

row -vector and consider the primal linear program of finding a vector thatW ÐB Ñ$

maximizes "$ ?

$ $

B <

subject to

"$ ?

$ $

B ÐM T Ñ œ A

all B   ! −$ $ ?

(a) Existence of Stopping Policies. Show that the following statements are equivalent:

" # $‰ ‰ ‰ There is a stopping policy. There is a stationary stopping policy. The primal linear pro-

gram has a feasible solution. Hint [ : Show that 1 implies 2 by establishing that there is a trans-‰ ‰

ient policy that is , i.e., for some , for .1 # # # #œ Ð ß ßá Ñ Q   " œ R œ "ß #ßá" # R RQperiodic

Then, for this part only, replace each by . Show that , that is a finite neg< " ! Æ Z   Z Z$ 1eR -

ative fixed point of , and each for which is a stopping policy.]e $ e_ Z œ " T Z$

(b) Existence of Optimal Stopping Policies. Show that the following statements are equiva-

lent: " #‰ ‰ is finite. There is a stationary maximum-value stopping policy and is the comZ Z‡ ‡ -

mon least fixed point and least excessive point of . An excessive point of is a vector e eÐ Z − dW

with . The primal linear program has an optimal solution. There is a stopping polZ   Z Ñe $ %‰ ‰ -

icy and an excessive point of . Hinte [ : Show that 4 3 2 1 4 . Show that 3‰ ‰ ‰ ‰ ‰ ‰Ê Ê Ê Ê

implies 2 by establishing that one optimal solution of the linear program is basic and that for‰

the corresponding decision , is optimal for the dual program, and so is a fixed point of .$ eZ$

Then show that . Establish that 1 implies 4 by using part (a) to show that since isZ œ Z Z$‡ ‰ ‰ ‡

finite, there is a stationary stopping policy , whence the primal is feasible. Then show is$ e_ RZ$

nondecreasing in and converges to a fixed point of , so the dual is feasible.]R e

(c) Maximum Value Need Not Equal Maximum Stopping Value. Suppose 1 , f ?œ Ö × œ

Ö ß × < œ ! T œ " < œ " T œ ! Z œ " Z œ !# $ , , , and . Show that , but that is the maximum# # $ $‡

value over all policies and is the greatest nonpositive fixed point of .e

Page 170: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 3 Homework 4 Due May 2, 2008

3. Simple Stopping Problems. Let be the stochastic transition matrix of an -state Mar-T W

kov chain. In each period , say, the chain is observed in state , there are two actions avail-R = − f

able. One is to and receive a . The other is to by paying an stop terminal reward continue en-<=

trance fee that permits the state of the chain to be observed in period . Put and- R " < œ Ð< Ñ= =

- œ Ð- Ñ= .

(a) Existence of Stationary Optimal Stopping Policy. Show that there is a stationary max-

imum-value stopping policy if and only if for some .Z   < ” Ð- TZ Ñ Z − dW

(b) Solution in Iterations by Linear Programming.W Show that if the simplex method is

used to solve the corresponding primal linear program starting with the decision that stops in ev-

ery state, then the simplex method will terminate in iterations or less. [ : Show that sinceW Hint

the value of the stopping policy encountered at each iteration of the simplex method exceeds that

of its predecessor, once the simplex method introduces the continuation action in a state, that ac-

tion will not be changed back to the stopping action in that state at a subsequent iteration.]

(c) Entrance-Fee Problem. An is one in which there are no terminalentrance-fee problem

rewards. Show how to reduce the above problem to an equivalent entrance-fee problem. [ :Hint

Subtract from the equation , thereby forming where< Z œ < ” Ð- TZ Ñ Z œ ! ” Ð- TZ Ñs ss

Z ´ Z < - ´ - < T<s s and .]

(d) Optimality of Myopic Policies for Entrance-Fee Problem. Assume .f œ Ö"ßá ß W×

Consider an entrance-fee problem in which ( ) is upper triangular with diagonal elements less3 T

than , " except that the last one equals , and ( ) there is an , , such that for" 33 = " Ÿ = Ÿ W - !‡ ‡=

= =‡ and for . Show that it is optimal to continue if and to stop if .-   ! =   = = = =   ==‡ ‡ ‡

Call this policy because it maximizes the one-period reward.myopic

4. House Buying. You have just moved to a city and are considering buying a house. The

purchase prices of houses that interest you are independent identically distributed random vari-

ables with being the positive probabilities that the purchase prices are: ßá ß : Ð : œ "Ñ" W 33!

respectively . Each time you look at a house to determine its price, you incur a cost"ßá ß W

- !. Determine explicitly a stationary policy that minimizes the expected cost of looking and

buying among those policies in which you eventually buy with probability one. [ : AssumeHint

that you always have the option of buying the cheapest house that you have observed todate.

Then apply the result of problem 3(d) to find an optimal policy. Also show that one optimal

policy has the property that if you do not buy a house in the period in which you observe its

price, then you never buy it.]

Page 171: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Answers to Homework 4 Due May 2

1. Component Replacement. The states of a machine at the end of a day are the ordered

pairs ( ) where and are respectively the number of working components of types and , so3ß 4 3 4 ! "

f œ ß ß ß ß 3ß 4 5ß 6  {(1 1), (1 2), (2 1), (2 2)}. The actions available in state ( ) are the ordered pairs ( )

( ) of numbers of working components of the two types after replacement. Let ( ) be3ß 4 : 7ß 8 ± 5ß 6

the conditional probability that a machine in state ( ) after replacements in an evening is in5ß 6

state ( ) ( ) at 5:00 p.m. the following day. Denote by ( ) the net profit earned7ß8 Ÿ 5ß 6 < 3ß 4ß 5ß 6

during a day by a machine in state ( ) at 5:00 p.m. on the previous day when action ( ) is3ß 4 5ß 6

chosen for the machine that evening. The ( ) are given by the formula: > ± +

: 7ß 8l5ß 6 œ5 6

7 8( ) .8 .2 .9 .1ΠΠ7 57 8 68

and are tabulated below. The ( ) are given by ( ) ( ) where< =ß + < =ß + œ / - =ß +

- 3ß 4ß 5ß 6 œ 5 3 6 4 - 5 6 3 4( ) ( ) 2( ) ( )

and ( ) is the labor cost of replacing components in the machine, (0) 0, (1) 5, and (2)- ? ? - œ - œ -

œ - =ß +6. The ( ) are as tabulated below. Also note that the nine state-action pairs are as given

in a table below. In this instance, there are four actions in state 11, two in each of states 12 and

21, and one in state 22.

:Ð>± +Ñ -Ð=ß +Ñ

+ +

> =

""

"#

#"

##

""

"#

#"

##

11 12 21 22 11 12 21 22

Þ(# Þ"%% Þ#)) Þ!&('

Þ'%) Þ#&*#

Þ&(' Þ""&#

Þ&")%

! ( ' *

! '

! (

!

00 0 0

0 -

0

-

= +

11 1111 1211 2111 2212 1212 2221 2121 2222 22

State-Action Pairs

Let ( ) ( ). Let be the number of machines initially in state . Assume that 0: >l=ß + œ : >l+ A > A œ> >

for (2 2).> Á ß

(a) Solution by Newton's Method. Below is the MATLAB output resulting from running a

file to create the input data and then the file 1 in the course directory tocompdata.m newtonr .m

solve the component replacement problem. Then the MATLAB output is:

Page 172: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -2- Answers to Homework 4 Due May 2, 2008

>> compdata

P = 0.7200 0 0 0 0.1440 0.6480 0 0 0.2880 0 0.5760 0 0.0576 0.2592 0.1152 0.5184 0.1440 0.6480 0 0 0.0576 0.2592 0.1152 0.5184 0.2880 0 0.5760 0 0.0576 0.2592 0.1152 0.5184 0.0576 0.2592 0.1152 0.5184

r = 4 -3 -2 -5 4 -2 4 -3 4

>> newtonr1stop =yesans =Halted after 2 iterations

V = 17.6774 20.6774 21.4412 26.6774

LastDecision = 4 2 1 1

The maximum expected net revenues earned and optimal actions in states 11, 12, 21 and 22

are respectively the elements of above and the actions 22, 22, 21 and 22 indicated in theZ

vector (note that action 4 in state 11 is 22, action 2 in state 12 is 22, action 1 inLastDecision

state 21 is 21, action 1 in state 22 is 22).

(b) Formulation as Linear Program. The problem is to choose the expected numbers

B   ! = +=+ of machine-days that machines end a day in state and take replacement action for

each , to maximize the expected net revenue=ß +

(1) ( )"=ß+

=+< =ß + B

subject to

(2) ( ) , ." "+ =ß+

>+ =+ >B : > l =ß + B œ A > − f

The matrix of nonzero coefficients of (2) is tabulated below (with 1) together with ( )A œ < =ß +##

for each of the four cases .5, 2, 4 and 6./ œ

Page 173: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -3- Answers to Homework 4 Due May 2, 2008

Daily Reward Vectors for Several 's/

=

/ +

11 12 21 22 11 12 21 22 12 22 21 22 22

0.5246

0.5 6.5 5.5 8.5 0.5 5.5 0.5 6.5 0.52 5 4 7 2 4 2 5 24 3 2 5 4 2 4 3 46 1 0 3 6 0 6 1 6

Coefficient Matrix of Equations (2)

=

> + A

11 12 21 22 11 12 21 22 12 22 21 22 22

111221

.28 .856 .712 .9424 .144 .0576 .288 .0576 .0576.648 .2592 .352 .7408 .2592 .2592

.576 .11

>

52 .1152 .424 .8848 .1152.5184 .5184 .5184 .4816 1

22

(c) Solution by Linear Programming. The optimal action to take and the maximum unit

value of each machine in each state are tabulated below for each of the required values ofZ =‡=

/. Notice that the optimal actions in each state are the same for all nonnegative initial popula-

tions, and the maximum values are linear therein. In particular, if is the initial number ofQ

machines in state 22, then the maximum value therein is .Z Q‡##

Observe that as the daily rental price increases, it is optimal to replace more components/

to prolong the lives of the machines. Moreover, if it is optimal to replace any components, then

all failed components should be replaced because of the scale economies in labor cost of replace-

ment. Finally, when a machine has only one failed component and 4, then it is optimal to re-/ œ

place component , but not , since the former is both cheaper and more likely to fail than the! "

latter.

Optimal Actions and Maximum Unit Values in Each State for several 'sZ =‡= /

Optimal Action in Maximum Unit Value in = Z =‡=

11 12 21 22 1.79 2.15 2.39 2.9811 12 21 22 7.14 8.60 9.57 11.9222 22

/ = 11 12 21 22 11 12 21 22

0.524 21 22 17.68 20.68 21.44 26.68

22 22 22 22 53.90 56.90 55.90 62.906

Below is the MATLAB output resulting from running the file to create the inputcompdata.m

data and then the file in the course directory to solve the component replacement probtranlpsc.m -

lem by linear programming. The MATLAB output for the case is:/ œ %

Page 174: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -4- Answers to Homework 4 Due May 2, 2008

>> compdata

P = 0.7200 0 0 0 0.1440 0.6480 0 0 0.2880 0 0.5760 0 0.0576 0.2592 0.1152 0.5184 0.1440 0.6480 0 0 0.0576 0.2592 0.1152 0.5184 0.2880 0 0.5760 0 0.0576 0.2592 0.1152 0.5184 0.0576 0.2592 0.1152 0.5184

r = 4 -3 -2 -5 4 -2 4 -3 4

>> w = [0 0 0 1]

w = 0 0 0 1

>> tranlpsc

x = 0 -0.0000 0.0000 1.5696 0 2.9948 3.1392 0 6.9895

V = 17.6774 20.6774 21.4412 26.6774

d = 4 2 1 1

The maximum expected net revenues earned and optimal actions in states 11, 12, 21 and 22

are respectively the elements of above and the actions 22, 22, 21 and 22 indicated in theZ

vector (note that action 4 in state 11 is 22, action 2 in state 12 is 22, action 1 in state 21 is.

21, action 1 in state 22 is 22). These results agree with those using Newton’s method.

(d) Extending the Service Life. The service life constraint

(3) "=ß+

=+B   "(

must be appended to the constraints (2).

Below is the MATLAB output resulting from running the file with andcompdata.m / œ %

then , to solve the component replacement problem with service constraint by lineartranlpsc.m

programming. The MATLAB output is:

Page 175: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -5- Answers to Homework 4 Due May 2, 2008

>> compdata

P = 0.7200 0 0 0 0.1440 0.6480 0 0 0.2880 0 0.5760 0 0.0576 0.2592 0.1152 0.5184 0.1440 0.6480 0 0 0.0576 0.2592 0.1152 0.5184 0.2880 0 0.5760 0 0.0576 0.2592 0.1152 0.5184 0.0576 0.2592 0.1152 0.5184

r = 4 -3 -2 -5 4 -2 4 -3 4

>> w = [ 0 0 0 1]

w = 0 0 0 1

>> tranlpsc

x = 0 0.0000 0 1.3973 0 3.9360 1.8148 0.9799 8.8720pr = 0 0.0000 0 1.0000 0 1.0000 0.6494 0.3506 1.0000

v = 24.9490

Here the elements of are the optimal conditional probabilities of taking each action:<

starting in each state. In particular, it is optimal to take actions 22 in states 11, 12 and 22. In

state 21, it is optimal to choose action 21 with probability .6494 and action 22 with probability

.3506. One consequence of adding the service constraint is to reduce the maximum expected re-

ward from 26.6774 to 24.9490.Z œ @ œ##

2. Optimal Stopping Policy.

(a) Existence of Stopping Policies. 1 2 . Let be a stopping policy. Then there is an ‰ ‰Ê Q1

such that . Let ( ), ( and ( ), so is per-¼ ¼T œ ß ßá œ ßá ß Ñ œ ß ßá"

#Q

" # " QQ Q Q

1 1 # # 1 # # . 1 1 .

iodic. Notice that is also transient because.

m m œ m m m m m m m m" " " " " "_ _ Q" _ Q" Q"

Rœ!

R Q58 Q 5 8 8

5œ! 5œ!8œ! 8œ! 8œ!

T T Ÿ T T Ÿ # T _. . 1 1 1( ) .

Let 1 for all , 0 and for . We claim is nonincreasing in .< œ Z œ Z œ Z R   ! Z R$ $ e! R" R R

This is certainly so for 1 since 1 0 . Thus, for , it follows from theR œ Z œ Ÿ œ Z R "" !

monotonicity of , and so of , that . Since 1 for all , then e e $R R" R RZ Ÿ Z < œ ¥ ! Z œ$

eR R R0 , so the sequence is bounded below and has a limit . Also, by  Z   Z ¦ _ ÖZ × Z. .

Page 176: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -6- Answers to Homework 4 Due May 2, 2008

the continuity of , lim 0 lim 0 . Thus, there is a decision e e e e e $Z œ Ð Ñ œ œ ZRÄ_ RÄ_R R"

such that 1 0, so from Problem 2 of Homework 3, is transient and soZ œ T Z Ÿ " ¥$ $_

stopping.

2 3 . Let be a stationary stopping policy. Then 0 and, by Lemma 3, ( )‰ ‰ R "Ê T Ä M T$ $ $

exists and is nonnegative. Thus 0 for and ( ) is a basic feasible sol-B œ Á B œ A M T   !# $ $# $ "

ution of the primal.

3 1 . If there is a feasible solution to the primal, there is a basic feasible solution. Hence‰ ‰Ê

by Theorem 15, there is a transient policy , and is stopping.$ $_ _

(b) Existence of Optimal Stopping Policies. 4 3 . Since there is a stopping policy, by‰ ‰Ê

(a), the primal is feasible. If there exists such that , then the dual is also feasible.Z Z   Ze

This implies that both the primal and dual have optimal solutions.

3 2 . If there is an optimal solution of the primal, then by Theorem 15 there is a basic‰ ‰Ê

optimal solution with row basis for some decision for which is transient, and soM T$ $ $_

stopping. Thus the basic optimal solution is such that ( ) and 0 for andB M T œ A B œ Á$ $ # # $

there exist optimal dual prices . Since is feasible for the dual, . And by complemen-Z Z Z   Ze

tary slackness, , so is finite and a fixed point of . Then Z œ < T Z Z œ Z Z œ Z œ â œ$ $ $ $ $e e

e 1R R R RZ   Z T Z T Ä$ $1 1 1. If is stopping, 0, so on taking limit superiors on both sides,

Z   Z Z œ Z Z$ 1 $ $ and . That is the least excessive and fixed point of follows from the fact‡ e

that if is excessive, then , i.e., . Premultiplying the last inequalityZ Z   < T Z ÐM T ÑZ   <$ $ $ $

by , which exists and is nonnegative, implies that as claimed.ÐM T Ñ Z   ÐM T Ñ < œ Z$ $ $ $" "

2 1 . Since there exists a stationary maximum-value stopping policy , and,‰ ‰ _ ‡Ê Z œ Z$ $

since is transient, is finite.$ Z ‡

1 4 . Since is finite, there is a stopping policy and so, by (a), a stationary stopping‰ ‰ ‡Ê Z

policy , say. Then . Hence, by the monotonicity of ,$ e e e e_ R" RZ   < T Z œ Z Z   Z$ $ $ $ $ $ $

and so . Let be a policy such that ( ) where .e 1 # e 1 # #R R R R3 " Rœ Ð Ñ Z œ Z Z œ Z ´ Ð ßá ß Ñ$ $1 1 $R _

Since is stopping, so is ; and it's value is , whence . Hence since $ 1 $ e e e_ R _ R R ‡ RZ Z Ÿ Z Z$ $ $

is bounded above and nondecreasing in , lim exists and, by the continuityR   ! Z ´ ZRÄ_Re $

of , . Thus is excessive.e eZ œ Z Z

(c) Maximum Value Need Not Equal Maximum Stopping Value. Observe that since the

rewards are nonpositive, is nonpositive for every policy . Also, notice that 0 is theZ Z œ. #.

maximum value. Let be a stopping policy. Then uses in some period, so 1 is the1 1 $ Z œ 1

maximum stopping value. Now 1 and 0 (resp., ) is the greatest nonpositiveeZ œ Z ” "

(resp., least) fixed point of .e

Page 177: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -7- Answers to Homework 4 Due May 2, 2008

3. Simple Stopping Problems. The irredundant form of the linear program in Problem 2 is

that of choosing -element row vectors and that maximizeW B C

B< C-

subject to

B C M T œ A Bß C  ( ) , 0

where is a positive -element row vector.A W

(a) Existence of Stationary Optimal Stopping Policy. For the “if ” part, observe that isZ

excessive since ( ) . Let be the stopping policy that stops in everyZ   < ” - TZ œ Ze $_

state, so 0. Then there is a stationary maximum-value stopping policy by Problem 2(b).T œ$

For the “only if ” part, observe that if there is a stationary maximum-value stopping policy,

then is finite. Hence, by Problem 2(b), there is an excessive point of .Z ‡ e

(b) Solution in Iterations by Simplex Method. Recall that the simplex method is theW

special case of the policy improvement method in which each iteration improves the action in

exactly one state. Now suppose the simplex method begins with the decision that stops in$

every state. Each subsequent iteration will either produce a stationary stopping policy (c.f.,

Theorem 15) whose value is greater than that of its predecessor (and strictly so in the state

whose action is changed), or will establish that the primal objective function is unbounded

above, in which case no stationary maximum-value stopping policy exists. Consider the former

case. Observe that it is not possible to improve the action in a state from “stop” to “continue”=

and then back to “stop” again at subsequent iterations. This is because the former change in-

creases the value in from to a strictly higher value while the latter returns the value in to= < ==

<=, which contradicts the fact that the value in a state rises with each iteration. Thus the num-

ber of states in which one continues at an iteration properly contains that at the previous iter-

ation. Hence, by the result of (a) and of Problem 2(b), the simplex method terminates in at

most iterations with a stationary maximum-value stopping policy or by finding that the pri-W

mal objective function is unbounded above, in which case no such policy exists.

(c) Entrance-Fee Problem. Observe that ( ) if and only if 0 ( ),Z œ < ” - TZ Z œ ” - TZs ss

and the latter is an entrance-fee problem. Thus there is a solution of the former system if and only

if that is so of the latter, so there is a stationary maximum-value stopping policy in one problem

if and only if that is so in the other. In either case, is the least solution of the former systemZ ‡

if and only if is the least solution of the latter. Also, if and only if 0,Z œ Z < Z œ < Z œs s‡ ‡‡ ‡= ==

so the optimal states in which to stop are the same in both problems.

(d) Optimality of Myopic Policies for Entrance-Fee Problems. It is certainly optimal to

continue in states with negative entrance fees 0 because one earns an immediate re-= = - ‡=

Page 178: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -8- Answers to Homework 4 Due May 2, 2008

ward 0 from so doing compared with a zero reward from stopping therein. On the other- =

hand, if it is optimal to stop and earn a zero reward. This is because the entrance fees are=   =‡

nonnegative in all states in [ ], and once one enters this set of states it is never possible to= ß W‡

return to any state with a negative entrance fee because the elements of below the> = T‡

diagonal are zero. Thus continuation in any state in [ ] cannot earn positive expected re-= ß W‡

wards. A more formal proof may be given by showing by induction on , starting with ,= = œ W

that the least fixed point ( ) of the optimal return operator is nonegative and is given byZ œ Z‡ ‡=

Z œ

! ß =   =

- : Z ß = ==‡

= =>>=

>‡ ‡

ÚÝÝÛÝÝÜ "

.

4. House Buying. This is a special case of the simple stopping problem in Problem 3 in

which the states are the lowest prices 1 seen so far. If is the lowest price seen so far,ßá ß W =

then one can either stop and buy the house whose price was with revenue or continue= < œ ==

looking at a cost . Then ( ) ( ) . Observe first that ( ) 0 satisfies- œ - T< œ > • = : Z œ Z œ= = > =W"

!Z   = ” - TZ= ( ), so by the result of Problem 3(a), a stationary optimal stopping policy

exists. Also, by the result of Problem 3(c), the problem is equivalent to the entrance-fee problem

in which ( ) so- œ - œ - < T<s s=

- œ - = > • = : œ - > = :s= > >

>œ" >œ"

W W" "( ) ( ) .

Also, the conditions for the optimality of myopic policies in Problem 3(d) are fulfilled with the

order of the states reversed. For the elements of ( ) above the diagonal are zero and theT œ :=>

diagonal elements are less than one except for the first which equals one since 1 1: Ÿ : >> "

for 1 and 1. Also, 0, so there is a maximum such that 0. More-> : œ - œ - = œ = - s s"" " =‡

over, is nonincreasing in so 0 for and 0 for . Thus, if is the lowest- = - = Ÿ = - Ÿ = = =s s s= = =‡ ‡

house price seen to date and , it is optimal to buy the house whose price was ; otherwise= Ÿ = =‡

it is optimal to continue looking.

Observe that in implementing this policy, one buys the first time that the lowest price to date=

is or less. But this happens only if the lowest price to date is in fact that for the last house ob=‡ -

served. Thus, the policy of buying the first house whose price is or less is optimal for the prob=‡ lem

in which the option of buying a house whose price was observed in a previous period is not avail-

able.

Page 179: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Homework 5 Due May 9(Do Problem 3 and Problem 1 or 2—Problem 2 if you took MS&E 251)

1. Bayesian Statistical Quality Control and Repair. A firm manufactures a product under a

continuing contract with the government that allows the government to cancel at any time with-

out penalty. During any given day, the process for producing the product is either orin control

out of control. Whether the process is out of control is not observed directly, but can only be

inferred from the results of production. Each day the firm produces one item, inspects it and

classifies it as good or defective. The government accepts a good item and pays the firm < !

for it. The firm discards a defective item. When the process is out of control, each item produced

is defective. Thus if a good item is produced, the process was in control during its production.

When the process is in control, the (known) probability that it produces a good item is , : !

: ". The (known) probability that the process is in control at the time of production of an

item given that it is in control at the time of production of its predecessor is , . Once; ! ; "

the process is out of control, it remains that way until it is repaired. Independently of the produc-

tion process, there is a probability , , that the firm will retain the contract for anoth-" "! "

er day. The government informs the firm at the beginning of a day whether or not the contract

is to be continued that day. If the decision is to cancel, the firm receives no further revenue from

the government and incurs no further costs. If the decision is to continue, there are two possibil-

ities that day: immediately repair at a cost or don’t repair. Repair is done quickly andO !

brings the process into control with probability (regardless of whether or not it was in control;

at the time of repair). Repair is the only permissible option when consecutive defectives haveW

been produced since the later of the times of the last repair and last good item. This Bayesian

sequential statistical decision problem is to choose a repair policy that maximizes the expected

value of profits before the government cancels the contract.

(a) Posterior Probability the Next Item is Good. Let be the conditional probability that:=

the item is good given that items were defective and either item 0 was good or the= " "ßá ß =

process was repaired just before producing item 1. Give a formula for in terms of , and .: : ; ==

(b) Maximum Expected Value. Write down explicitly an equation (in terms of , , and" O <

the ) whose solution is the maximum expected profit before the government cancels the con-:=

tract. Define the states and actions. Is this system transient? Why? Does there exist a station-

ary policy achieving the maximum expected profit? Explain briefly. Discuss whether or not the

policy-improvement method could be used to solve this problem.

(c) Optimality of Control-Limit Policy. Show for the problem in part (b) that a control-limit

policy is optimal, i.e., there is a number such that if the number of consecutive defective items=‡

since the later of the last good item or repair is , it is optimal to repair if and not to re-= =   =‡

Page 180: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 2 Homework 5 Due May 9, 2008

pair otherwise. [ : First show that is decreasing in . Next show by induction on that Hint : = R= the

maximum expected profit over days is decreasing in the state where the terminal value func-R

tion is zero. Then use successive approximations to show that the maximum expected profit

over the infinite horizon is decreasing in the state. Finally, establish the optimality of a control-

limit policy.]

Remark. The approach to establishing the form of the optimal policy in the infinite-horizon

problem given in part (c), viz., establishing a desired property of the -period value function byR

induction and then retaining the property as , is the most commonly used way to estab-R Ä _

lish the form of the optimal policy in infinite-horizon problems.

2. Minimum Expected Present Value of Sojourn Times. Consider a finite trans substochas-ient

tic Markov decision chain with state space , one-period rewards and transition matrices f < T$ $

for each decision . Append a stopped state to . The vector of probabilities that the sys-$ ? 7 f−

tem enters from each state in in one step using decision is . Let be the sojourn7 f $ ÐM T Ñ" X$ 1=

time of the system in before reaching starting from state and using policy . Let f 7 1= X œ31=!X

4œ"41= " f f be the present value of the sojourn time of the system in starting from state = −

and using policy where 0 is the discount factor. Let , E E ,1 " œ " X œ ÐX Ñ X œ Ð X Ñ""3 1 1 1 1= =

E E , Var Var and E E be column vectors.X œ Ð X Ñ X œ Ð X Ñ X œ Ð X Ñ# #= ==1 1 1 11 1

3 3

(a) Factorial Moments of Sojourn Times. Let , and . Show that X œ X T œ T H œ H H$ $ $

œ ÐM TÑ œ H 5 œ "ß #ßá œ and E 1, where E E is a column" 5Š ‹ Š ‹ Š Š ‹‹X5" X4 X 45 5 5

=

vector. [ : Let if the system starts in state , uses , and does not reach state be-Hint M œ " =R _= $ 7

fore or in period , and let 0 otherwise. Then . Now compute the desiredR M œ X œ MR R= ==

_Rœ!$ !

expectations by induction on using the familiar binomial identity .]5 œ Š ‹ Š ‹ Š ‹X4" X4 X45 5" 5

(b) Minimum Limiting Expected Present Value of Sojourn Times of Individuals and Pop-

ulations, and Moment-Optimality. Suppose one exogenous individual arrives in each state in=

each period. Let be the present value of sojourn times in of all exogenous individuals whoX =3$ f

first enter the system in state and their progeny, and let . Show that the following= X œ ÐX Ñ =3 3$ $

problems are equivalent:

1 Minimum Limiting Expected Present Value of Population Sojourn Time.‰ Find for$

which lim sup E E 0 for all .33$

3#Æ! Ð X X Ñ Ÿ −# ?

2 Minimum Limiting Expected Discounted Individual Sojourn Time of Order One.‰

Find for which lim sup E E 0 for all .$ 3 # ?33$

3#Æ!

"Ð X X Ñ Ÿ −

3 Maximum Expected Population Reward with Negative Unit One-Period Reward.‰

Find for which lim inf 0 for all where for all .$ 3 # ? # ?3 #3$

3#Æ!

"ÐZ Z Ñ   − < œ " −

4 Policies with No 1-Improvement and Negative Unit One-Period Reward‰ . Find $

that maximizes lexicographically where for all .Ð@ ß @ Ñ < œ " −! "$ $ # # ?

5 Moment Optimality‰ . Find that minimizes E Var lexicographically.$ Ð X ß X Ñ$ $

Page 181: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 3 Homework 5 Due May 9, 2008

How would you have to modify the above results if the rate of exogenous arrivals into each state

were constant, but state dependent? Random (and independent) with constant rate, but state

dependent? [ : EsHint tablish the following equivalences: 1 2 3 4 5 . To establish‰ ‰ ‰ ‰ ‰Í Í Í Í

the first equivalence, show that is the present value of the number of exogenous individuals3"

entering the system in each given state. Use the result of a to establish the last equivalence.]Ð Ñ

(c) Risk Posture. For fixed , does an individual who wishes to minimize E exhibit risk3 ! X 3$

aversion, risk preference or neither toward his sojourn time in ? Why? Explain briefly whyX$ f

your conclusion is consistent with the equivalence of 2 and 5 in (b).‰ ‰

(d) Limiting Maximum Expected Sojourn Times. The problems in parts (b) and (c) seem

most appropiate in situations where the sojourn times are painful, so we want to get them over

quickly. But often sojourn times are pleasant, e.g., a firm’s time to bankruptcy, a student’s stay

at Stanford, or even life itself, so we want to prolong them. In this circumstance, how would you

modify the results of parts (b) and (c).

3. Optimal Control of Tandem Queues. Consider a job waiting to be processed by a fac-

tory at the beginning of some hour. The job must first be processed at station one and then at

station two. The job may be processed at each station by a skilled or unskilled worker. Unskilled

workers are paid the hourly rate at station while working on the job and skilled workers are2 33

paid twice that much. If the job is processed by a worker at station one during an hour and

does not require rework, the job is sent on to station two for service at the beginning of the next

hour and earns the shop the progress payment . If the job is processed by a worker at station<"

two during an hour and does not require rework, the job is and earns the shop anshipped

additional payment . If the job is processed at either station in an hour by an unskilled<#

worker, the job requires rework at that station during the subsequent hour with probability ,:"

# : ", whether or not the job was previously reworked. By contrast if the job is processed

by a skilled worker at either station in an hour, the job requires rework at that station during

the following hour with probability . Thus, the probability a skilled worker who processes#: "

the job at a station during an hour successfully completes the job there is , which is#Ð" :Ñ

double the corresponding probability for an unskilled worker. The problem is to decide whether

to use a skilled or unskilled worker at each station to process the job there. In short, the issue is

whether it is better to use a slow worker with low pay or a fast worker with high pay at each

station.

(a) Strong-Maximum-Present-Value Policies by Strong Policy Improvement Method. As-

sume , and for . Determine which stationary policies have max-: œ Þ( 2 œ $ $3 < œ "(3 3 œ "ß #3 3

imum value. Use the strong policy improvement method as implemented on page 59 of Lectures

in Dynamic Programming and Stochastic Control to find a strong-maximum-present-value pol-

icy starting with the policy in which an unskilled worker is used at each station.

Page 182: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 4 Homework 5 Due May 9, 2008

(b) Nonstationary Arrivals. Suppose the expected number of jobs arriving for service at the

factory in hour is sin , Also assume that the num-R " A œ $!!! R ) Ð Ñ R œ !ß "ßá ÞR R"#1

bers of unskilled and skilled workers available at each station each hour is large enough so that

all jobs requiring service during the hour can be processed without delay. What stationary poli-

cies have strong maximum present value? [ : Show that the present value of profits is theHint

product of the present values of the arrival stream and the profits earned by a single job.]ÐA ÑR

Page 183: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Answers to Homework 5 Due May 9

1. Bayesian Statistical Quality Control and Repair.

(a) Posterior Probability the Next Item is Good. Let be the event that item is good;K ="

P "ßá ß = ! be the event that items are defective and either item is good or the process is re-

paired before producing item 1; and be the number of the items that are producedM "ßá ß =

while the process is in control. Then . Let Pr . ThenP œ ÖPß M œ 5× : œ ÐK lPÑ=5œ! =

: œ ÐK lPÑ œ œ œÐKßPÑ ÐKß Pß M œ =Ñ ;:Ò;Ð" :ÑÓ

ÐPÑ ÐPß M œ 5Ñ " ; ; : ; :

= =

5œ!

=

="

5œ!

5 =

Pr .Pr Pr

Pr Pr ( ) [ (1 )] [ (1 )]! !(b) Maximum Expected Value. The states are the numbers of consecutive defec-!ß "ßá ß W

tive items produced since the later of the time of the last repair or good item. In each state oth-

er than , there are two actions, or . In state the only action is to repair.W Wrepair don’t repair

Let be the maximum expected net income given that the system is in state . Observe thatZ ==

: œ ;: ! Ÿ = W! . Then for ,

( ) max (1 ) (1 ) ,‡ Z œ Ð<: O : Z : Z ß <: : Z : Z Ñ= ! ! ! ! " = = ! = =""

and (1 ) . Since the probability of retaining the contract on aZ œ Ð<: O : Z : Z ÑW ! ! ! ! ""

day is , , the system is strictly substochastic and so transient. Thus there is a sta-" "! Ÿ "

tionary maximum-value policy. Hence policy-improvement method could be used to solve this

problem.

(c) Optimality of Control-Limit Policy. To see that a control limit policy is optimal, i.e.,

there exists such that one repairs if and doesn’t repair otherwise, first show that is= =   = :‡ ‡=

decreasing in as follows. Put . Then= œ ;Ð" :Ñ!

: œ œ œ;: ;: ;:

" ; " ; " " ; "=

=

=" ="

5œ! 5œ!

5 = 5==

5œ"

5

!

! ! ! !( ) ( ) ( ) ! ! !is decreasing in because is increasing in and .= = ; "!=

5œ"5!

Next show that the maximum value is decreasing in . To this end, put Z = œ !ßá ß W Z ´=!

! Z = œ !ßá ß W Z = œ !ßá ß W, so is decreasing in . Now suppose that is decreasing in and! R= =

consider . Then for ,R " = œ !ßá ß W"

( ) max (1 ) (1 ) .‡‡ Z œ Ð<: O : Z : Z ß <: : Z : Z ÑR" R R R R= ! ! ! = = =! " ! =""

Since and is decreasing in , is decreasing in . Also, since , it fol-< ! : = <: = œ !ßá ß W : != = =

lows from the induction hypothesis that

(1 ) (1 ) (1 ) ,: Z : Z   : Z : Z   : Z : Z= = =" =" =" ="R R R R R R! =" ! =" ! =#

Page 184: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -2- Answers to Homework 5 Due May 9, 2008

i.e., (1 ) is decreasing in . Thus, the second term in parenthe: Z : Z = œ !ßá ß W"= =R R! ="

1 sis

on the right side of ( ) is decreasing in . Since the first term in parenthesis on the‡‡ = œ !ßá ß W"

right side of ( ) is a constant, the pointwise maximum of a constant and a decreasing‡‡ function is

decreasing, is decreasing in . Also, (1 )Z = œ !ßá ß W" Z   Ð<: O : Z : Z ÑR" R R=

R"W" ! ! !! ""

œ Z Z = œ !ßá ß WR"W

R"=, so is decreasing in .

Now since the system is strictly substochastic, it follows from the fact (shown in )Lectures á

that the maximum -period value converges to the maximum value that lim isR Z œ ZRÄ_ =R=

also decreasing in . Consequently, the second term in parentheses on the right-hand= œ !ßá ß W

side of is decreasing in and the control limit policy defined by letting be theÐ‡Ñ = œ !ßá ß W" =‡

smallest integer , such that (1 ) (1 )=ß ! Ÿ = W <: O : Z : Z   <: : Z : Z! ! ! ! " = = ! = ="

if there is such an integer, and letting otherwise, has maximum value since that decision= ´ W‡

assures that equality occurs in ( ).‡

2. Minimum Expected Present Value of Sojourn Times.

(a) Factorial Moments of Sojourn Times. Since the system is transient, 0 and T œ H œ‡

( ) . We now show by induction that E 1. This is trivially so for 0.M T œ H 5 œ" 5Š ‹X5"

5

Suppose it is true for and consider . Then by the induction hypothesis,5   ! 5 "

E E E 1 E .ΠΠΠΠX5 X5" X5" X5"

5" 5 5" 5"œ œ H 5

Let be the event that the system reaches in the first step and Pr . ThenV : œ ÐV Ñ" " "7

E E E (1 ),ΠΠΠΠΠX5" X5" X5"

5" 5" 5"œ ± V : ± V :" " "

-"

Observe that, in the above equation, the first term is 0 because , and the second termŠ ‹5

5"œ !

is E . Thus, E 1 E , so E 1 1.T œ H T œ ÐM TÑ H œ HŠ ‹ Š ‹ Š ‹ Š ‹X5 X5 X5 X5

5" 5" 5" 5"5 " 5 5" 2

(b) Minimum Limiting Expected Present Value of Sojourn Times of Individuals and Pop-

ulations, and Moment-Optimality.

1 2 . The present value of the number of exogenous individuals entering the system in a‰ ‰Í

given state is for since exactly one exogenous individual enters the state in!_3œ"

3 "" 3 3œ !

each period. Then, E E , which implies the claim.X œ X"3 3

$ $3

2 3 . As shown in , for each and all , E 1 ( 1)‰ ‰Í ! X œ V œ V œ ZLectures á # 3 3 3 3 3# # # #

since 1, which establishes the claim.< œ #

3 4 . For small values of and each , since 0. Thus‰ ‰ 8 8 8 8 ‡_ _8œ" 8œ!Í Z œ @ œ @ T œ3 # 3 33

# # # #! !

3 3 3" " ! ! " "( ) ( ) ( ) ( ). On taking the limit inferior as on bothZ Z œ @ @ @ @ 9 " Æ !3$

3# # #$ $

1This monotonicity argument is due to Gareth James.2This induction step is due to Alamuru Krishna.

Page 185: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -3- Answers to Homework 5 Due May 9, 2008

sides, one sees that the limit inferior on the left is nonnegative if and only if ( ) ( )@ ß @ ¤ @ ß @! " ! "$ $ # #

where is the usual lexicographic order.¤

4 5 . It is enough to observe that E 1 and‰ ‰ !Í X œ H œ H < œ @$ $ $ $ $

Var E E E 2 1 1 ( 1) 2 ( ) .X œ # œ H H H œ @ @ @X " X X

# " "$ $ $

$ $ $$ $ $ $ΠΠΠ## # " ! ! #

No modifications are required in either case.

(c) Risk Posture. Since the present value of the individual’s sojourn time is , minimizingX 3$

E is equivalent to maximizing E , and is a convex function of , it follows that theX X X X3 3 3$ $ $ $

individual exhibits risk preference toward his sojourn time . By the equivalence of 2 and 5 ,X$‰ ‰

minimizing the sum of the first two terms of the expansion of E for small is equiv-3 3" X !3$

alent to lexicographically minimizing (E Var ). Thus the individual prefers policies thatX ß X$ $

minimize E , and among policies that achieve this minimum, those that maximize Var . ThisX X$ $

is consistent with risk preference.

(d) Limiting Maximum Expected Sojourn Times. In (b), modify 1 and 2 by replacing lim sup‰ ‰

by lim inf, and by ; modify 3 and 4 by replacing by ; and modify 5 by replacingŸ   " "‰ ‰ ‰

minimizes by maximizes. The modified version of (c) involves an individual who wishes to maxi-

mize E , and so exhibits risk aversion towards his sojourn time . By the equivalence of theX X3$ $

modified versions of 2 and 5 , maximizing the sum of the first two terms of the expansion of‰ ‰

3 3"E for small is equivalent to lexicographically maximizing (E Var ). Thus theX ! X ß X3$ $ $

individual prefers policies that maximize E , and among policies that achieve this maximum,X$

those that minimize Var . This is consistent with risk aversion.X$

3. Optimal Control of Tandem Queues.

(a) Strong-Maximum-Present-Value Policies by Strong Policy Improvement Method.

The state space consists of two states 1 and 2, viz., the station at which a job is located. There

are four possible decisions: , where represents using unskilled workers at? œ Ö??ß ?=ß =?ß ==× ?=

station 1 and and skilled workers at station 2, etc. The transition matrices associated with each

decision are:

T œ T œ: ": : ":

: #:"?? ?=ΠΠ0 0

T œ T œ#:" #Ð": #:" #Ð":

: #:"=? == ) )

0 0 .

Notice that each is transient and hence the system is transient, i.e., . Thus and . œ ! T œ ! H œ‡$ $

ÐM T Ñ − X$" for all and, if is the minimum of the ranks of the stationary matrices over all$ ?

Page 186: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -4- Answers to Homework 5 Due May 9, 2008

decisions, then . Theorem 22 on p. 54 implies that a stationary strong-maxiX œ ! mum-present-

value policy exists and Theorem 25 on p. 57 implies that .? ? ? ? ?ª ª ª œ! " # _

Next show that and , i.e., that using skilled workers at? ? ? ? ? ?œ ¨ œ œ œ Ö==×! " # _ "

both stations is the unique stationary strong-maximum-present-value policy. Saying that ?! œ

? is equivalent to saying that all stationary policies have the same value. That this is so should

be clear from the results of the single station problem. Since the downstream station (station 2)

may be considered as a single station in isolation, whatever action one takes in state 2 must give

the same value, i.e., 34 . Then for station 1, as far as@ œ @ ´ @ œ < œ œ %! ! !?# =# # #

2 *

": !Þ$# Š ‹

total expected value is concerned, the expected value earned in state 2 is an additional reward

for shipping (so 17 4 21), whence the action taken in state 1 does not affect< œ < @ œ œw !" " #

the expected value obtained in state 2, i.e., 21 1 . Thus@ œ @ ´ @ œ < Ð œ œ Ñ! ! ! w?" =" " "

2 '

": !Þ$"

the value of all stationary policies is .Z œ‡ a b"%It is possible to show this directly by calculating for each ,Z œ H < œ ÐM T Ñ < −$ $ $ $ $

" $ ?

where the reward vectors associated with each decision are

< œ < œÐ" :Ñ< 2Ð" :Ñ< 2

Ð" :Ñ< 2

Ð" :Ñ< 2?? ?=

" "

# #

" "

# #Œ Œ

2 2

< œ < œÐ" :Ñ< 2Ð" :Ñ< 2

Ð" :Ñ< 2

Ð" :Ñ< 2=? ==

" "

# #

" "

# #Π2 2 .

2 22 2

Now lim for each , and so and . Now determine @ œ Z œ Z − @ œ œ @ œ! !Æ! !

"$ 3

3$ $ $ ? ? ?Š ‹"

%

max max . As explained on p. 59 of , the prob-# ? # ?# # #− −" ! " ! "

!Ð< @ T @ Ñ œ Ð@ T @ Ñ áLectures

lem of finding is precisely that of finding a stationary maximum-value policy for the transient@"

system with a restricted decision set and a new reward vector . To do this, use?!" ! !< @ œ @#

policy improvement and, as suggested, begin with the policy . Since , it?? @ œ @ T @" ! "?? ????

follows that

@ œ " @ @" " "??" ??" ??#.7 .3

and

@ œ @" "??# ??#4 .7

from which one concludes

@ œ"??

Î ÑÐ ÓÐ ÓÏ Ò

&!

$

%!

$

.

Now consider improving with the aid of the comparison function??

K @ œ @ T @ @# #" ! " "?? ?? ??.

Page 187: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -5- Answers to Homework 5 Due May 9, 2008

Set Then# œ ==Þ

K @ œ œ ß=="??

Î Ñ Î ÑÐ ÓÐ ÓÐ ÓÐ ÓÏ Ò Ï Ò

1 .4 .6 14 0 .4 4

&! &!

$ $

%! %!

$ $

and so improves . But is, in fact, the decision that gives the “greatest” improvement of== ?? ==

?? == ÐK @ Ñ, i.e., is the decision that Newton’s method selects. This is clear since depends on#"?? >

# #> =, but not on for , so that one has= Á >

K @ œ K @ œ K @ œ?? ?= =?" " "?? ?? ?? 0 0 1

0 4 0, and .

Now since , it follows that@ œ @ T @" ! "== ====

@ œ " @ @" " "==" ==" ==#.4 .6

and

@ œ @" "==# ==#4 .4

from which follows

@ œ"==

Î ÑÐ ÓÐ ÓÏ Ò

#&

$

#!

$

.

Now we claim that for all , which implies that Further, since K @ Ÿ ! − == − Þ K @# #" "== ==! "# ? ?

Á ! œ == œ Ö==×Þ unless , it follows that Since is non-empty and is a subset of , con-# ? ? ?" _ "

clude that and hence that is the unique stationary strong-maximum-present-value?_ œ Ö==× ==

policy.

(b) Nonstationary Analysis. Under the stationary policy , the cohort (i.e., the jobs$ R>2

that arrive in hour ) earn a value discounted to time of . Discounting this to time 0,R" R Z 3$"

the cohort earns the present value . Thus the present value of the policy isR Z>2 R _"" $3$

." "_ _

Rœ! Rœ!

R R R R" "A Z œ Z A" "3 3$ $

But is the present value of the arrival stream. This demonstrates the hint. It fol-!_Rœ!

R R" A

lows immediately that a policy that has strong-maximum-present-value for a single job, also has

strong-maximum-present-value for the problem with non-stationary arrivals. Thus is the==

unique stationary strong-maximum-present-value policy for this problem.

Page 188: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Homework 6 Due May 16

1. Limiting Present-Value Optimality with Binomial Immigration. Consider a bounded -stateW

discrete-time-parameter Markov population decision chain with immigration stream ,A œ ÐA ÑR

interest rate 100 % 0 and discount factor is . Assume throughout that the series 3 " œ A ´"

" 33

!_Rœ!

R R R ;" 3A ! A œ SÐR Ñ is absolutely-convergent for all . (For example, that is so if for

some .) The diagonal element of is the present value of the stream of immigrants;   ! = A>2 3

into state discounted to period one. Let be the present value of the (cohort) policy with= Z 31A 1

the immigration stream . Call for if lim infA A ÐZ Z Ñ- limiting present-value optimal 33-

31Æ!

A A

  ! ´ − À Z ¤ Z − ´ − À Z ¤ Z − for all . Let { , all } and { , all }.1 ? $ ? # ? ? $ ? # ?8 _8 8$ # $ #

(a) Present Values of Convolutions are Products of Present Values. Suppose that andA

A A As s are immigration streams for which the series defining and are absolutely convergent.3 3

Show that . Show by induction on that for ÐA ‡AÑ œ A A 8 " œ Ð" Ñ œ Ð" Ñ 8 œs s3 3 338

8 8 8" 3 3 k k!ß "ßá " 8 where is the present value of the binomial sequence of order discounted to period one.3

8

(b) Limiting Present-Value Optimality for Binomial Immigration Stream of Order .8 Show

that . Show also that is limiting present-value optimal for the binomial immigra-Z œ A Z3 3 31 1A -

tion stream of order if and only if lim inf for all .8 Ð   "Ñ ÐZ Z Ñ   !33-

31Æ!

83 1

(c) is the Set of Decisions that are Limiting Present-Value Optimal for the Binomial?8

Immigration Stream of Order .8 Show that is limiting present-value optimal for the binom-$_

ial immigration stream of order if and only if .8 Ð   "Ñ −$ ?8

(d) Machine Sequencing. Let be the expected number of jobs arriving at a firm's!R   !

machine shop in week , . The machine shop has three types of machines la-R " R œ !ß "ßá

beled . Each job must be processed through one of the four machine sequences or "ß #ß $ +ß ,ß -ß .

given in the table below. Each job takes three weeks to process through the shop with each se-

quence, one week on each machine in the sequence. There are enough machines of each type so

that jobs are not delayed at any machine to which they are assigned by the machine shop. The

firm incurs a cost of thousand dollars in each week a job is processed by machine . Which se-3 3

quences are (Cesàro) overtaking optimal for the firm when the expected numbers of jobs arriv-

ing each week are respectively and for , for all , and! ! ! !! R R ! ! œ ! R ! œ ! R   !

Machines in each Sequence

WeekSequence 1st 2nd 3rd

1 3 2 2 1 3 2 3 1 1 3 3

+

,

-

.

Page 189: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 2 Homework 6 Due May 16, 2008

! !R !œ ÐR "Ñ ! R   ! for all ? Which sequences are limiting present-value optimal for each

of the above three job-arrival streams? Which sequences have strong maximum present value for

each of the three job-arrival streams. [ : Tabulate for each sequence andHint • • ! !! !$ # œ Ð ÑR

relevant pairs of decisions directly from the definitions without first finding the terms of the# $ß

expansions.]

(e) Social Security: Aggregating Losses to Produce a Profit. A country that does not per-

mit immigration divides its adult population into two groups, workers and retirees . TheÐ[Ñ ÐVÑ

country has proposed a new social security system that will cover all current and future inhabi-

tants (except retirees in the initial population who will be supported in a different way).

Individuals in the country spend one period of 30 years in each group. Let diag beA œ ÐA !Ñ! ![

the initial population in the two groups to be covered by social security. The vector of exogen-

ous individuals (children) entering the population in period is , . Workers inR " : A : !R !

one period have an even chance of surviving to retire in the next period and retirees in a period

have no chance of surviving to retire again in the next period. Each worker earns an average

income during his working years and pays 10% thereof into social security. Each retireeM !

receives 20 % of from social security during retirement.: M

Ð3Ñ Show that the expected social security payments to each individual (except the initial re-

tirees) is greater than, equal to, or less than the amount collected from the individual respectively

according as , , or , and yet social security is always in the black!: " : œ " : "

Ð33Ñ Show that the ratio of a worker's retirement income from social security to his income

per person before retirement after social security payments is , and so is an increasing#

*:Ð: "Ñ

function of (assuming that a worker's income is divided equally between himself and his chil-: :

dren). In particular that ratio is .17, .44 or .83 respectively according as .5, 1.0 or 1.5.: œ

Ð333Ñ : " In this example with , whence the number of immigrants increases geometrically,

the government pays out more to each individual than it collects (i.e., ) and yet collectsZ ¥ !$

more money from the population as a whole than it pays out (i.e., lim inf C ).RÄ_RAZ ¦ ! Ð ß "Ñ$

Show that this phenomenon cannot occur in a transient Markov population decision chain (as in

this example) with a binomial immigration stream .A

2. Maximizing Reward Rate by Linear Programming. Assume that an -state discrete-time-W

parameter Markov population decision chain has degree one. Let be a row -vector of theA   ! W

num dual pair of linearbers of individuals initially in each state. Consider the following redundant

programs. The primal is to find row -vectors and for thatW B B −" !# # # ?

Page 190: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 3 Homework 6 Due May 16, 2008

maximize "#

# #B <"

subject to

Ð Ñ B B U œ A1 " "# #

# # #" !

(2) B U œ !"#

# #"

(3) for .B ß B   ! −" !# # # ?

The dual is to find column -vectors and thatW @ @" !

minimize A@"

subject to

Ð Ñ @ U @   <4 " !# #

and

(5) U @   !#"

for all .# ?−

(a) Solution of Linear Programs. Show that there is a not depending on and$ ?− A   !

optimal solutions and respectively of the primal and dual such that ÐB B Ñ Ð@ @ Ñ B œ" ! " ! "# # #

B œ ! Á Ð@ @ Ñ A   ! @ œ @ œ @! " ! " " "−# ## ? $ for all , is independent of , and max is the least# $

feasible value of hoose so that for all , or equivalently@ K £ ! −" ! for the dual. [ : CHint $ # ?#$

1 Ÿ ! R1 1 Ÿ ! Ð@ @ Ñ ´ Ð@" " ! " ! "#$ #$ $#$ and for all and all # large enough . Next show that R

R@ @ Ñ R" !$ $ is feasible for the dual for all large enough . Then show that for all large enough

R ÐB B Ñ B œ B œ ! Á œ ÐB B Ñ, is feasible for the primal where for all and, for , " ! " ! " !# # # # $ $# $ # $

´ ÐAT AÐRT‡ ‡$ $ H ÑÑ ! Ÿ V œ T H 9Ð"Ñ$

3$ $ $. Do this by using the fact that for all3" ‡

small enough . Next show 3 ! Ð@ @ Ñthat for all large enough , and areR ÐB B Ñ" !# #

" !

respectively optimal for the primal and dual because their objective-function values are equal.]

(b) Average Maximum Equals Maximum Reward Rate. Show that limRÄ_" RR ! œe

max . [ : Choose so for all , or equivalently max# ? # ?# #$ #$#$− −" ! " !@ K £ ! − ÒR1 1 Ó œ !Hint $ # ?

for all large enough . Use this fact to show that for fixed large enough , R Q ÐQ@ @ Ñ œeR " !$ $

ÐQ RÑ@ @ R œ "ß #ßá @ Á !" ! "$ $$ , . The geometric interpretation of this result is that when ,

e carries the half-line into itself. And when , is a fixed point ofÖR@ @ À R   Q× @ œ ! @" ! " !$ $$ $

e. Indeed, Theorem 6 asserts that if the system is transient, then this fixed point is unique and

is the maximum value.]

Page 191: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 4 Homework 6 Due May 16, 2008

(c) Stochastic Systems with Strongly Connected System Graph. Suppose the system is sto-

chastic and the system graph is strongly connected. The last means that for each pair of states

= Á > = >, there is a chain in the system graph from to . Show that this implies that there is a de-

cision and a positive integer (both depending on and ) such that . Show that ev-# R = > T !R=>#

ery stationary maximum-reward-rate policy has the property that is independent of .$_ "=@ =$

[ Hint: Set , rewrite (5) as and iterate].@ œ @ @   T @" " " "$ #

(d) Linear Programs for Stochastic Systems with Strongly Connected System Graph. Sup-

pose the system is stochastic, the system graph is strongly connected and . Use the re-!= =A œ "

sult of (c) to reduce the pair of dual linear programs in the problem statement to the equivalent

alternate pair of linear programs given below. The alternate primal is to find row -vectors W B"#

for that# ?−

maximize "#

# #B <"

subject to

Ð Ñ B " œ "6 "#

#"

(7) B U œ !"#

# #"

(8) for .B   ! −"# # ?

The alternate dual is to find a number and a column -vectors that? W @" !

minimize ?"

subject to

Ð Ñ "? U @   <9 " !# #

for all where is here a column -vector of ones.# ?− " W

Determine whether or not these alternate linear programs are dual to one another. Also show

that there is a $ ?− and optimal solutions and respectively of the alternate primalB Ð? @ Ñ" " !#

and dual such that B œ ! Á "? œ @ œ @" " " "−# ## ? $ for all and max , and is the least feas-# $ ?"

ible value of for the alternate dual. [ Postmultiply (1) by the column -vector to form? W "" Hint:

the alternate primal.]

Page 192: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Answers to Homework 6 Due May 16

1. Limiting Present-Value Optimality with Binomial Immigration.

(a) Present Values of Convolutions are Products of Present Values. We have

( ) ( ) .A ‡ A œ A ‡ A œ A A œ A A œ A As s s s s3 3 3" " " " "_ _ R _ _

Rœ! Rœ! Rœ3

R R 3 3 3 R3R

3œ! 3œ!

R3 R3" " " "

We now show by induction that 1 (1 ) (1 ) . The claim holds for 38

8 8 8œ œ 8 œ !ß "ß"" 3 3

since 1 (1 ) , 1 (1 ) and 1 1 . Suppose the claim is true for3 3 3! " "

! 5 "_3œ!œ " œ œ œ œ " " " "!

8 œ ‡ œ œ œ 0. Then 1 ( 1 1 ) 1 1 (1 ) (1 ) (1 ) . Suppose the claim is3 33 38" "" 8 8

" 8 8"" " "

true for 0. Then 1 ( 1 1 ) 1 1 (1 ) (1 ) (1 ) .8 œ ‡ œ œ œ 3 33 38" "" 8 8

8 8"" " "

(b) Limiting Present-Value Optimality for Binomial Immigration Streams of Order . Ev-8

idently, . Then, is limiting present-value optimal for the binomialZ œ A Z œ A Z3= 3 3 31 1 1

!_5œ!

5 5" -

immigration stream 1 of order if lim inf 1 ( ) 0 for all , or equivalently, since3 3 33

3- 18 8Æ!8 Z Z   1

1 (1 ) and lim (1 ) 0, if lim inf ( ) 0 for all .3 33 3

3- 18

8 8 8 8Æ! Æ!œ œ " Z Z  3 3 3 3 1

(c) is the Set of Decisions that are Limiting Present-Value Optimal for the Binomial?8

Immigration Stream of Order 8. Since the system is bounded, there is a stationary policy that

has maximum present value for all small enough . Thus, from (b), 3 ! $_ is limiting present-

value optimal for the binomial immigration stream of order if and only if the following holds:8

lim inf ( ) 0 for all . On substituting the relevant Laurent expansions, the33$

3#Æ!

83 # ?Z Z   −

last is so if and only if lim inf ( ) lim inf ( )3 $ 3 $# #Æ! Æ!8 3 3 3 38 3 3_ 8

3œ" 3œ"3 3 3! !@ @ œ @ @   0 for all

# ?− , or equivalently Z Z ¤ −8 88$ # 0 for all , i.e., .# $ ?

(d) Machine Sequencing. The following table contains all the data required to establish op-

timality of policies under different preference relations. By a suitable choice of units we can and

do assume !! œ ", so the three immigration streams are essentially equivalent to binomial streams

of orders , 1 and 2 respectively. By examining the column corresponding to 3, observe that! R  

for binomial immigration of order 0, policies , and are Cesàro overtaking optimal, and so by+ , -

part (c), limiting present-value optimal. For the binomial immigration stream of order 1, note

that policies and are Cesàro overtaking optimal since + , it suffices to tabulate the difference of

two policies for policies , and because they are the only policies in . Therefore, and + , - + ,?!

are in and, by (c), limiting present-value optimal. Thus, for the third immigration stream,?"

which is binomial of order 2, it suffices to compute the difference of values of policies and .+ ,

The table shows that policy is Cesàro overtaking optimal and, by (c), limiting present-value+

optimal.

Page 193: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -2- Answers to Homework 6 Due May 16, 2008

R œ " R œ # R   $

Z Z " " "

Z Z ! # !

Z Z " " !

Z Z ! # #

Z Z " ! !

Z Z " " "

R R- .R R, -R R+ ,R" R", -R" R"+ ,R# R#+ ,

The only policy that has maximum present value for all sufficiently small positive interest rates

is since it is the unique element of and so by Theorem 25 of §1.8 and the fact+ œ Á g? ? ?# # _

that the system is bounded.

(e) Social Security: Aggregating Losses to Produce a Profit.

Ð3Ñ :M œ :M The expected social security payments to an individual are .2 .1 . However,Œ "#

each individual pays .1 to social security during his working years, which is less than, equal toM

or greater than the expected social security payments to him according as 1, 1 or 1: : œ :

respectively. Yet social security breaks even in every period except the first one when it collects

money but does not make any payments as can be seen from the fact that in period socialR

security collects 0 and pays .2 .Þ" : A M :M: AR ![

R" ![

"

#Ð33Ñ Observe that the income per person before retirement after social security payments is

given by Þ*M

: " as the net income is split among an individual and his children. Then the ratio of a

worker's retirement income from social security to his income before retirement after social secur-

ity payments is .#

*:Ð: "Ñ

Ð333Ñ Z Ÿ ! 8 œ "ß !ßá 8 œ " We claim that lim and all . This is obvious for sinceRÄ_R8$

Z Ä @ œ ! 8   ! ! Z Ÿ # "Rß" "$ $ $. Now suppose that . Choose so that . It is enough to show% %

by induction on that for all large enough and all . To that end, observe that8 Z Ÿ " R 8   !R8$ %

since is transient, so for all large enough . Now suppose the claim is so for$ %Z Ä Z Z Ÿ " RR R$ $$

8 8 " Z œ Z Ÿ " R and consider . Then for all large enough as claimed.Rß8" R3œ!

38$ $

! %

2. Maximizing Reward Rate by Linear Programming. Consider the primal program of finding

row -vectors and for thatW B B −" !# # # ?

maximize "#

# #B <"

subject to

Ð Ñ B B U œ A1 " "# #

# # #" !

(2) B U œ !"#

# #"

(3) for .B ß B   ! −" !# # # ?

Page 194: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -3- Answers to Homework 6 Due May 16, 2008

Then the dual program is to find column -vectors and thatW @ @" !

minimize A@"

subject to

Ð Ñ @ U @   <4 " !# #

and

(5) U @   !#"

for all .# ?−

(a) Solution of Linear Programs. The first step is to observe from Theorem 22 of §1.8 that

there is a such that for all , or equivalently,$ # ?K £ ! −!#$

(6) and for all and all large enough .1 Ÿ ! R1 1 Ÿ ! − R" " !#$ #$ #$ # ?

Now from Theorem 26 of §1.8, max independently of . Also (6) holds if and only@ œ @ A   !" "−$ # ? #

if is feasible for the dual for all large enough .Ð@ @ Ñ ´ Ð@ R@ @ Ñ R" ! " " ! $ $ $

The next step is to observe from Theorem 21 that for all of §1.8 ! Ÿ V œ T H 9Ð"Ñ3$ $ $3" ‡

small enough . Thus, for all large enough 3 ! B ´ RB B   ! R! " !$ $$ and all where A   ! B ´"

$

AT B ´ AH B ´ B B œ B œ ! Á ÐB B Ñß −‡ ! " " " ! " !$ $ $ $ $ # # # # and . Let and for all . We claim that ,# $ # ?

is feasible for the primal for all large enough . The claim follows from R B U œ AT U œ !" ‡$ $ $$ and

B B U œ B ÐRB B ÑU" ! " " !$ $ $ $$ $ $$ $ $ $ $$œ B B U œ AÐT H U Ñ œ A" ! ‡

since , the last from Theorem 20 of §1.8.H U œ H ÐT U Ñ œ M T$ $ $ $$ $‡ ‡

The final step is to observe that for all . This impliesB < œ AT < œ A@ œ A@ A   !" ‡ " "$ $$ $$

that for all large enough , and are respectively optimal for the primal andR ÐB B Ñ Ð@ @ Ñ" ! " !# #

dual, and is the least feasible value of for the dual.@ @" "$

(b) Average Maximum Equals Maximum Average Reward Rate. Choose so for all$ K £ !!#$

# ?− ÐR1 1 Ñ. Thus max# ? #$ #$−" ! œ ! R R   Q for all large enough , say all , or equivalently,

eÐR@ @ Ñ œ ÐR "Ñ@ @ R   Q" ! " !$ $$ $ for . Thus, by repeated application of the last equation,

e eR " ! R" " ! " !ÐQ@ @ Ñ œ ÐÐQ "Ñ@ @ Ñ œ â œ ÐQ RÑ@ @ R œ "ß #ßá$ $ $$ $ $ , . Now Theorem

10 of §1.7 implies that . Therefore, on combining these facts itm ÐQ@ @ Ñ !m œ SÐ"Ñe eR " ! R$ $

follows that lim lim as claimed.RÄ_ RÄ_" R " R " ! "R ! œ R ÐQ@ @ Ñ œ @e e $ $$

(c) Stochastic Systems with Strongly Connected System Graph. Let be states. Since the= Á >

system graph is strongly connected, there is a chain V in the system graph from to . For each= >

state , let be such that is an arc in the system graph. Thus there is a/ − Ï Ö>× 0 − Ð/ß 0 ÑV V/ /

decision such that for each such state . Hence where is the number# #:Ð0 l/ß Ñ ! / T ! R// R

=>#

of arcs in V, establishing the first claim.

Now let be a stationary maximum-reward-rate policy, , be a state that mini-$_ " "@ ´ @ =$

mizes @ > Á ="= and be any other state. As shown above, since the system graph is strongly con-

Page 195: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -4- Answers to Homework 6 Due May 16, 2008

nected, there is a decision and a positive integer such that . Then from (5), # R @  T !R=>#

"

T @   â  #" T @ = T @   T @ R " R " R "

= => ># # #, so by the definition of and the fact that is stochastic,

Ð" T Ñ@R "=> =# . Hence since , . However, by definition of , , so T ! @   @ = @ Ÿ @ @R " " " " "

=> = > = > =#

œ @ œ"> @ ="

=$ is independent of as claimed.

(d) Linear Programs for Stochastic Systems with Strongly Connected System Graph. Sup-

pose the system is stochastic, the system graph is strongly connected and . Now on!= =A œ "

postmultiplying (1) by the -column vector of ones and using the facts that (becauseW " U " œ !#

T A" œ "# is stochastic) and , yields (6) of the alternate primal. Thus, the constraint set of the

alternate primal contains that of the original primal. Also, since one optimal solution ofÐ@ @ Ñ" !

the dual takes the form @ œ "?" ", the last inequality (5) of the dual is redundant and the dual

objective function reduces to . Hence the dual is equivalent to the alternate dual.A"? œ ?" "

Further, the alternate dual is the dual of the alternate primal. Now choose as in (a) above.$

Then there are optim al solutions and respectively of the original primal andÐB B Ñ" !# # Ð@ @ Ñ" !

dual such that B œ ! Á @ œ @ œ " B" " " "−# # ## ? $ for all and max . Thus, is optimal for the# $ ?"

alternate primal and ?" is optimal and the least feasible value of for the alternate dual.?"

Page 196: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Homework 7 Due May 23

1. Discovering System Boundedness. Consider the following collections of linear inequali-ties and nonlinear equations for a discrete-time-parameter finite Markov population decisionchain with for all , viz.,< œ "$ $

(1) and for all , and T @ Ÿ @ < T ? Ÿ ? @ − Ð@ß ?Ñ   !$ $ $ $ ?

and

(2) max , max and .$ ? $ ?

$ $ $− −

T @ œ @ Ð< T ?Ñ œ ? @ Ð@ß ?Ñ   !

Parts (a)-(c) below establish that system boundedness is equivalent to the existence of a Ð@ß ?Ñsatisfying either of the conditions (1) or (2). Thus system boundedness can be checked by linearprogramming.

(a) (1) Implies System Boundedness. Show that if there is a pair satisfying (1), thenÐ@ß ?Ñ

the system is bounded. [ : Argue as in the proof of the System-Degree Theorem 9 of §1.7 forHint. œ " N ßO @ ¦ ! @ œ !. In particular, let be a partition of the states for which and . ShowN O

that for all . Then show the restriction of the system to states in is transientT œ ! − O$ON $ ?

by considering the second system of inequalities.](b) System Boundedness Implies (2). Show that if the system is bounded, then there is a

Ð@ß ?Ñ K £ ! that satisfies (2). [ : There is a with strong maximum present value, so forHint $_ !#$

all . Then show that satisfies (2) for large enough .]# ?− Ð@ß ?Ñ œ Ð@ ßQ@ @ Ñ Q !" " !$ $ $

(c) (2) Implies (1). Show that if satisfies (2), then satisfies (1).Ð@ß ?Ñ Ð@ß ?Ñ

2. Finding the Maximum Spectral Radius. Consider a discrete-time-parameter finite Mar-kov population decision chain.

(a) Monotonicity of Spectral Radius. Show that if and are square real matrices of theT U

same order with , then . [ : Use Corollary 8 of §1.7.]! Ÿ T Ÿ U Ÿ5 5T U Hint(b) Bounds on Spectral Radius. Show that if is a square real nonnegative matrix,T œ Ð: Ñ34

then

7 ´ : Ÿ Ÿ : ´ Qmin max .3 34 4

34 T 34" "5

[ : Construct matrices and for which , the row sums of all equal Hint T T ! Ÿ T Ÿ T Ÿ T T 7

and the row sums of all equal .]T Q

(c) Bounds on Maximum Spectral Radius. Show that where7 Ÿ Ÿ Q5

7 ´ :Ð>l=ß +Ñ Q ´ :Ð>l=ß +Ñ ´min max , max and max= + +ß=

> >

" " 5 5$

$

is the maximum of the spectral radii of the .T$

(d) Finding the Maximum Spectral Radius. Show that if , , and is! Ÿ 7 Q 7 Ÿ Ÿ Q5 )

the midpoint of the interval [ ], then if and only if the system with transition ma-7ßQ ) 5

trices for all is transient. Use this fact to give an iterative method for estimating .) $ 5"T$

Page 197: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming 2 Homework 7 Due May 23, 2008

3. Irreducible Systems and Cesàro Geometric-Overtaking Optimality. Consider a discrete-

time-parameter finite Markov population decision chain. Let max . Call a real num-e@ ´ T @$ ? $−

ber an of if there is a real vector , called an associated , such that. eeigenvalue eigenvector@

e . Z@ œ @. Call the system if the system graph is , i.e., there is airreducible strongly connected

chain (directed path) from each state to each other state. Let max be the maximum5 5´ $ ? $−

spectral radius.

(a) System Boundedness. Show that if has a unit eigenvalue and associated eigene .Ð œ "Ñ -

vector , then the system is bounded.@ ¦ !

(b) Unit Eigenvalue of with Positive Eigenvector. Show that if the system is irreduciblee

and bounded, but not transient, then has a unit eigenvalue and associated eigenvector .e @ ¦ !

[ : Use the results of parts (a)-(c) of Problem 1. Show that since the system is not Hint transient,

the set in the hint for Problem 1(a) is not empty. Then use the irreducibility to show N that the

set is empty.]O

(c) Positive Maximum Spectral Radius an Eigenvalue of with Positive Eigenvector. e

Show that if the system is irreducible, then , is an eigenvalue of and that there is an5 5 e !

associated eigenvector . [ : Use the irreducibility and Problem 2(c) to show that .@ ¦ ! !Hint 5

Next consider the normalized system in which is replaced by for each . Let T T ´ T < œ "

$ $ $ $5 $"

for all and be the resulting maximum expected present value with interest rate 100 %$ ? 3− Z 3

! Z œ Ð " T Z Ñ   " ?, so that max . Let be the sum of the absolute values 3 3

$ ? $− " " " l l of the

elements of . Show that has a limit point as with , and ? mZ m Z @   ! Æ ! @ œ " ´3 3" 3 el l 5 e"

has a unit eigenvalue with being an associated eigenvector. Then use the irreducibility to show@

that .]@ ¦ !

(d) Perron-Frobenius Theorem. Show that if is irreducible, then , is an eigen-T !$ $ $5 5

value of and there is an associated eigenvector , i.e., .T @ ¦ ! T @ œ @$ $ $5

(e) Maximum Spectral Radius is Maximum Population Growth Factor. Show that if the

system is irreducible, then max . Then show that the normalized system is$ ? $−R RmT m œ SÐ Ñ5

bounded. [ : Apply the result of part (c).]Hint

(f) Existence of Stationary Cesàro Geometric-Overtaking Optimal Policies. Suppose that

the system is irreducible. Call if- Cesàro geometrically-overtaking optimal

lim inf C for all .RÄ_

R Rß"Rß"5 1ÐZ Z Ñ   ! Ð ß "Ñ- 1

Show that there is a stationary Cesàro geometrically-overtaking optimal policy, viz., any sta-

tionary maximum-reward-rate policy for the normalized system.

(g) Maximum Long-Run Growth of Expected Symmetric Multiplicative Utility. Discuss

the application of the result of part (f) to the problem of maximizing the long-run growth of ex-

pected symmetric multiplicative utility for an irreducible finite Markov decision chain.

Page 198: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

1

Answers to Homework 7 Due May 23

1. Discovering System Boundedness.

(a) (1) Implies System Boundedness. Let be a partition of the states such that N ßO @ ¦N

0 and 0, so from it follows that and 0. Then, by induction@ œ T @ Ÿ @ T @ Ÿ @ T @ ŸO NN N N ON N$ $ $

T @ Ÿ @ T œ S T @ Ÿ @ ¦ T   T œR RNN NNN N ON N N ON$ $ $ $ $ and so (1). Conclude from 0, 0 and 0 that 0

and thus 0 for every . Also, since ( ) satisfies (1), 1 , so by ProblemT œ R @ß ? T ? Ÿ ?RON OO O O$ $

2(a) of Homework 3, the restriction of the system to states in is transient, whence O T œ3OO$

S ( ) for some , 0 1. Observe that, since is arbitrary, it follows from the System-De-! ! ! $3

gree Theorem that for each policy ( ), max (1), max 0 for1 $ $œ ß ßá mT m œ S mT m œ" #R RNN ON1 11 1

every and max ( ). Thus , maxR mT m œ S T œ T T T mT m œ SÐ"Ñ1 $ 11 1 1 1 13 3 R 3" R3 3"OO NO NN OO NN

R3œ" NO! !

3 3

and max ( ), so max (1). Hence max (1).1 1 11 1 1mT m œ S mT m œ S mT m œ SR3 R3 R ROO NO3

!

(b) System Boundedness Implies (2). Since system is bounded, by Theorem 22 there is a

stationary strong-maximum-present-value policy . Then, 0 for every , so ( )$ #_ " !K £ 1 ß 1 £#$ #$ #$

0, and . Observe that since the rewards are positive, for all small enough .K œ ! Z   ! !$$3$ 3

Thus since , ( ) . These facts imply that 0 and that there existsZ œ @ @ ß @ ¤ ! @  3$ $ $ $$

!_8œ"

8 8 " ! "3

Q œ Ñ œ Q  0 with max 0, max 0 and 0. These inequalities restate# ##$ #$ $#$ $1 Ð1 Q1 @ @" ! " " !

the fact that ( ) ( ) satisfies (2).@ß ? ´ @ ßQ@ @" " !$ $ $

(c) (2) Implies (1). Immediate.

2. Finding the Maximum Spectral Radius.

(a) Monotonicity of Spectral Radius. The proof is by contradiction. Suppose 0 , butŸ T Ÿ U

5 5 5 5 5 5 5U T U" " " "T T T T U U T. Then has spectral radius 1. Thus, is transient, and so is

also transient since . But 1 , contradicting the transience of .

5 5 5 5" " "T T TTT Ÿ U œ T5"

T

(b) Bounds on Spectral Radius. Since is trivial if and the left-hand in-7 Ÿ Ÿ Q Q œ !5T

equality is trivial if , it suffices to consider the case that . Now observe that if7 œ ! ! 7 Ÿ Q

U U œ R œ ß #ßá . œ œ is a square stochastic matrix, then 1 1 for 1 , so 1, and hence 1 byRU U5

Lemma 7. Thus by respectively reducing the elements of rows of with sums above and thenT 7

increasing the elements of rows of with sums below , produces the matrices and T Q T T such

that , 1 1 and 1 1. Hence, and are stochastic and ! Ÿ T Ÿ T Ÿ T T œ 7 T œ Q 7 T Q T " " so

have spectral radius one, whence and . Thus, from part (a), .5 5 5T T Tœ 7 œ Q 7 Ÿ Ÿ Q

(c) Bounds on the Maximum Spectral Radius. Choose so that min ( | ).# #7 œ : > =ß= >=!

Then, by part (b) and Lemma 11 of § , it follows that max where " 7 Ÿ Ÿ Ÿ mT m œ Q m † m5 5# $ $

is the Chebyshev norm.

(d) Finding the Maximum Spectral Radius. The - system is the one with tran-) normalized

sition matrices for all . Observe from Lemma 7 of §1 that if and only if the -nor-) $ 5 ) )"T $

Page 199: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -2- Answers to Homework 7 Due May 23, 2008

malized system is transient, a fact that can be checked by the method of problem (b) of the#

Homework 3. If the -normalized system is transient, then reset ; otherwise, reset .) ) )Q œ 7 œ

The resulting interval [ ] contains by what was just shown, and the interval is one-half7ßQ 5

the length of the original interval. Now repeat this construction iteratively until is smallQ 7

enough.

3. Irreducible Systems and Geometric-Overtaking Optimality.Cesàro

(a) System Boundedness. Let be a positive eigenvector associated with the unit eigenvalue@

of . Then, max , so (1).e e e@ œ @ œ â œ @ œ T @ T œ SR R R1 1 1

(b) Unit Eigenvalue of with Positive Eigenvector.e Suppose the system is bounded and

irreducible, but not transient. From Problem 1, there exists 0 with max . Thus, @   T @ œ @ @$ ? $−

is a nonnegative eigenvector associated with a unit eigenvalue. We claim is positive. Consider@

a partition of the state space as in 1(a). Since the system is not transient, because,N ßO N Á g

as shown in 1(a), the restriction of the system to states in is transient. Furthermore, O O œ g

for if not, 0 for every as in 1(a). But this contradicts the irreducibility of the system.T œ RRON1

Thus, 0.@ œ @ ¦N

(c) Positive Maximum Spectral Radius an Eigenvalue of with Positive Eigenvector.e

Since the system is irreducible, for every state there exists an action such that ( | )= + : > =ß +!>

  7 œ : > =ß + ! T0. Thus by 2(c), min max ( | ) . Consider the normalized system 5 = + >

! $

œ T œ < œ " −5 5 $ ?" $ $. This system is not transient because 1. Let for all . Then the maxi-

mum present value with interest rate 100 % is such that lim because in theZ ! mZ m œ _3 333 Æ!

contrary event is bounded and so the system is transient. But since all rewards are non-Z 3

negative, max { 1 } 1 . Now multiplying both sides by yieldsZ œ T Z   ¦ ! mZ m3 3 3

$ ? $−"" " "

Z mZ m œ Ò mZ m T Z mZ m Ó Z mZ m3 3 3 3 3 3 3

$ ? $" " " "

−max 1 . Thus since has unit norm, it has a" "

limit point as converges to zero. Also and max max or@   ! m@m œ " @ œ T @ œ T @

3 5$ ? $ ?$ $− −"

equivalently, . This shows that is an eigenvalue of and is an associated eigenvec-5 e 5 e@ œ @ @

tor. Observe that this implies for all . Let partition the state space so that T @ Ÿ @ N ßO @$ 5 $ O

œ @ ¦ m@m œ N Á g O Á g T @ Ÿ @0 and 0. Since 1, necessarily . And, if , then it follows from N $ 5

that 0 for all , which implies 0, contradicting the irreducibility of the system.T @ Ÿ T œ$ $ON N ON$

Thus .@ ¦ !

(d) Perron-Frobenius Theorem. This is part (c) when { }.? $œ

(e) Maximum Spectral Radius is Maximum Population Growth Rate. Since the system is

irreducible, 0 (from (c)) and there exists 0 such that for all , so5 5 e $ ? @ ¦ @ œ @   T @ −$

max , whence max ( ). Furthermore, max max1 1 1 11 1 1 1T @ Ÿ @ mT m œ S mT m œ mT m œR R R R R R R5 5 5

S(1) and so the normalized system is bounded.

Page 200: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -3- Answers to Homework 7 Due May 23, 2008

(f) Existence of Stationary Cesàro Geometric-Overtaking Optimal Policies. By (e) the norm-

alized system is bounded. Then by Theorem 32 there exists a stationary Cesàro-overtaking-op-

timal policy for the binomial immigration stream of order 1, i.e.,$_

lim inf 0 (C 1)RÄ_

Rß" Rß"ÐZ Z Ñ   ß $ 1

for all ( ) where . Thus,1 # # 5 5œ ß ßá Z œ T < œ T < œ Z

" #Rß" R" R" R" R" Rß"1 1 1 1# #R R

lim inf 0 (C 1) for all .RÄ_

R Rß"Rß"5 1ÐZ Z Ñ   ß$ 1

(g) Maximum Long-Run Growth of Expected Symmetric Multiplicative Utility. From equa-

tion (6) of §1.4 of , the maximum expected utility in periods 1 satisfiesLecturesá Z ßá ßRR

Z œ T Z œ T Z Z œ < œ Z œR R" R ! !− −

R "max max where 1. Let 1 for all . Then, $ ? 1 ?$ $1 1_ $,

TR"1 1. Since the system is irreducible, apply the result of (f) to find a policy that maximizes

the long-run growth of expected symmetric utility.

Page 201: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

Homework 8 Due May 30

1. Element-Wise Product of Symmetric Positive Semi-Definite Matrices. Show that if isG

an symmetric positive semi-definite matrix that is partitioned symmetrically into blocks7‚7

G G 3ß 4 œ "ßá ß 8 Ÿ 7 U œ Ð; Ñ 8 ‚ 834 33 34 with symmetric positive definite for , and if is an symmet-

ric positive definite matrix, then the matrix composed of blocks is symmetric and positiveL ; G34 34

definite. [ : Hint Observe that since is a symmetric positive semi-definite matrix, there is a diag-G

onal matrix A œ diag of eigenvalues of and a matrix with column -vectorsÐ ßá ß Ñ   ! G Z 7- -" 7

@ ßá ß @ G œ Z Z œ @ @ 7" 7 5 575œ" 5 such that . One can partition each -element row vectorA -T T!

B œ ÐB ßá ß B Ñ G @ œ Ð@ ßá ß @ ÑT T T T T T" 8 5 5" 58 corresponding to the partitioning of , and in particular .

Let be an -element column vector for each . Show that .]A œ ÐB @ Ñ 8 5 B LB œ A UA5 53 5 5375œ" 5

T TT ! -

2. Quadratic Unconstrained Team-Decision Problem with Normally Distributed Observa-

tions. Consider the quadratic unconstrained team-decision problem in which is symmetric andU

positive definite, and the random column vectors have a joint normal distribution^ ßá ß^" 8

with E , denoting the matrix of covariances between the components of and , and^ œ ! G ^ ^3 34 3 4

G 3ß 4 œ "ßá ß 833 being an identity matrix for . You may assume the known (e.g., Feller, Vol. II,

80-87) facts that the covariance matrix is symmetric and positive semi-definite, andG œ ÐG Ñ34

that E for . Also assume that E E for some row vector^ ^4 43 3 3 3 3 3

3 3^ œ G ^ 3ß 4 œ "ßá ß 8 œ . ^ $ $

.3, 3 œ "ßá ß 8 \ \ Ð^ Ñ. Show that there is an optimal team decision function that is affine, i.e., ‡ ‡3 3

œ , ^ - , - 3 œ "ßá ß 83 3 3 3 3 where the row vectors and numbers , , are the unique solutions of cer-

tain systems of linear equations. You may assume that is optimal if, with probability one,\‡

E E , . [ : Apply the result of Problem 1 about element-wise products of^ ^ ‡3 3

3 3$ œ ; \ 3 œ "ßá ß 8 Hint

symmetric positive semi-definite matrices.]

3. Optimal Baking. At the beginning of a day, bakeries make forecasts of their re8 ^ ßá ß^" 8 -

spective demands for bread during the day. The forecasts are random variables with known means.

After observing only its own forecast, bakery bakes loaves of bread, . The actual3 \ 3 œ "ßá ß 83

demands for bread at bakeries are random variables. Assume that and H ßá ßH "ßá ß 8 H ^" 8 3 3

have finitely many values, that forecasts are unbiased, i.e., E for all , and that ÐH l^ Ñ œ ^ 3 ÐH ß ^ Ñ3 3 3 3 3

and are independent for each . The loss , where , ,ÐH ß ^ Ñ 3 Á 4 PÐ\ß^ßHÑ \ œ Ð\ Ñ ^ œ Ð^ Ñ4 4 3 3

and , isH œ ÐH Ñ3

PÐ\ß^ßHÑ œ + Ð\ H Ñ Ð Ð\ H Ñ ,Ñ" "

# # " "8 8

3œ" 3œ"

3 3 3 3 3# #

where , and . The problem is to choose a baking policy from the above class+ ! 3 œ "ßá ß 8 , !3

that minimizes the expected one-period loss. Solve the problem explicitly. [ : Consider affineHint

functions of the form and find the coefficients and .]\ œ ^ 3 3 3 3 3 3! " ! "

Page 202: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

-1-

Answers to Homework 8 Due May 30

1. Element-Wise Product of Symmetric Positive Semi-Definite Matrices. Observe that

since is an symmetric positive semi-definite matrix, there exists a matrix whoseG 7‚7 Z

column -vectors are and a diagonal matrix diag 0 of eigenvalues7 @ ßá ß @ œ Ð ßá ß Ñ  " 7 " 7A - -

of such that . On partitioning each -element row vector G G œ Z Z œ @ @ 7 B œA -X X X75œ" 55 5!

ÐB ßá ß B Ñ 7   8 G @ œX X" 8 5, , corresponding to the partitioning of , and in particular doing so for T

Ð@ ßá ß @ ÑT T5" 58 , then each block . Let G œ @ @34 5 53

75œ" 54

X! - A œ ÐB @ Ñ 85 533T be an -element column vec-

tor for each . Therefore, since , it follows that5 L œ ; G34 34 34

B LB œ B ; @ @ B œ A ; A œ A UA  X X X 3 X

3ß4 3ß43 54 5 534 5 53 4 5 34 5 5

7 7 7

5œ" 5œ" 5œ"

45

" " " " " - - - 0

since is positive definite. But from the fact that is symmetric positive definite, it followsU G33

that for every 0,B Á3

0 ( ) . B G B œ B @ @ B œ AX X X 3 #3 3 53 533 3 5 53 3 5

7 7

5œ" 5œ"

" "- -

Thus, there exists at least one value of for which 0 and 0. Hence, for 0, there is5 A Á B Á-5 5

an such that 0 and so 0.3 B Á B LB œ A UA 3 5 5X X7

5œ" 5! -

2. Quadratic Unconstrained Team-Decision Problem with Normally Distributed Observa-

tions. The optimal solution must satisfy E E with probability one for 1 .\ œ ; \ 3 œ ßá ß 8‡ ^ ^ ‡3 3

3 3$

But by assumption, E E . If ( ) , then^ ‡3 3 3 3 3 3 3 33

3$ $œ . ^ \ ^ œ , ^ -

. ^ œ ; \ œ ; , G ^ - œ , ; G ^ ; -3 3 3 34 34 4 43 3 4 4 43 43 3 34 4

8 8 8

4œ" 4œ" 4œ"

^ ‡4E E ( ) ( )$ ! ! !3

since is symmetric. Let be as in Problem 1. Then, E for all and allU L . ^ œ ,L ^ ; - 33 3 3 3 3 3$

^ , œ , ßá ß , - œ - ßá ß - L œ ÐL ßá ßL Ñ œ3 " 8 " 8 3" 38X X

3 where ( ), ( ) and . Thus it must be that E$

U- . œ ,L - œ U , œ .L U and . These systems have the unique solutions E and since is" "$

symmetric positive definite and, by Problem 1, so is .L

3. Optimal Baking. Write the problem in the form of a Quadratic Unconstrained Team De-

cision Problem by defining diag and writing the objective function as ( )U œ + "" < \ß^ œX

\ U\ \ œ + H , H + "

#X

3 3 3 4 384œ"$ $constant term where . Observe that since 0 for all!

3 U \ œ ; \, is positive definite. By Theorem 1 of §2.3, is optimal if and only if E E , i.e.,‡ ^ ^ ‡3 3

3 3$

E ( ) E (1 ) ,^ ^3 3 4 4 3 3

8

4œ" 4Á3

3 3+ H , H œ Ð \ Ñ + \" "š ›and this is equivalent to

Page 203: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming -2- Answers to Homework 8, Due May 30, 2008

+ ^ , ^ H œ Ð \ Ñ + \3 3 3 4 4 3 3

4Á3 4Á3

" "E E (1 )

since E , E E , E and E E . If , find such^ ^ ^ ^3 3 4 4 3 3 4 4 3 3 3 3 3 3

3 3 3 3H œ ^ H œ H \ œ \ \ œ \ \ œ ^ ß! " ! "

that is the solution to the system above. Do this by substituting in the right-hand side and\ \3 3

observing that E E for all , since E E(E ) E . Then, this new equivalent sys-H œ ^ 3 H œ H œ ^3 3 3 3 3^3

tem has solution and and thus is the optimal solution.! " "3 3 3 3‡3œ " œ \ œ ^

,

+ Ð" + Ñ3 5"5

!

Page 204: Lectures in Dynamic Programming and Stochastic Control …€¦ ·  · 2013-03-27Lectures in Dynamic Programming and Stochastic Control ... MS&E 351 Dynamic Programming and Stochastic

MS&E 351 Dynamic Programming Spring 2008Arthur F. Veinott, Jr.

Homework 9 Due June 4

1. Revenue Management: Pricing a House for Sale. An executive must sell her house in an

interval . Her goal Ò!ß >Ó is to maximize her expected sale price. She is willing to vary her asking

price as a piecewise-constant function of time, with her asking price at any moment restricted to

a given finite set of positive numbers. Whenever her asking price is , buyers willing toT : − T

pay that price arrive according to a poisson process with mean rate 0 per unit time, with ; ;: :

being strictly decreasing in . She sells the house to the first buyer arriving during at her then: Ò!ß >Ó

current asking price, or if no buyer arrives, to her company at the guaranteed price with:‡

0 min . Let : : ´ T

‡ Z Ò!ß >Ó> be the maximum expected sale price in .

(a) Optimality Equation. Give an equation whose solution is .ÐZ Ñ>

(b) Existence of Optimal Asking-Price Policy. Explain why there exists a maximum-ex-

pected-asking-price policy.

(c) Monotonicity and Concavity of Maximum Expected Sale Price in Time Remaining.

Show that is positive, strictly increasing and concave in 0.Z >  >

(d) Monotonicity of Optimal Asking-Price in Time Remaining. Show that with one optimal

asking-price policy, the asking price used when time units remain has the properties that:: >>

max for all large enough 0 and" : œ : ´ T > ‰>

is increasing in 0.# : > ‰>

(e) Limit of Maximum -Period Expected Sale Price.> Show that lim .>Ä_>Z œ :

2. Transient Systems in Continuous Time. Show that if every stationary policy is transient in a

continuous-time-parameter finite Markov population decision chain, then every policy is transient

and there is a stationary maximum-value policy. Explain briefly how successive approximations

can be used to find a stationary policy having value within 0 of the maximum. [ : For the% Hint

first part, adapt the proof of the corresponding result for the discrete-time-parameter problem.]