5
Short Notes: Existence of Stationary Optimal Policies for Some Markov Renewal Programs Author(s): Bennett Fox Source: SIAM Review, Vol. 9, No. 3 (Jul., 1967), pp. 573-576 Published by: Society for Industrial and Applied Mathematics Stable URL: http://www.jstor.org/stable/2028001 . Accessed: 12/06/2014 20:32 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . Society for Industrial and Applied Mathematics is collaborating with JSTOR to digitize, preserve and extend access to SIAM Review. http://www.jstor.org This content downloaded from 185.2.32.109 on Thu, 12 Jun 2014 20:32:29 PM All use subject to JSTOR Terms and Conditions

Short Notes: Existence of Stationary Optimal Policies for Some Markov Renewal Programs

Embed Size (px)

Citation preview

Short Notes: Existence of Stationary Optimal Policies for Some Markov Renewal ProgramsAuthor(s): Bennett FoxSource: SIAM Review, Vol. 9, No. 3 (Jul., 1967), pp. 573-576Published by: Society for Industrial and Applied MathematicsStable URL: http://www.jstor.org/stable/2028001 .

Accessed: 12/06/2014 20:32

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Society for Industrial and Applied Mathematics is collaborating with JSTOR to digitize, preserve and extendaccess to SIAM Review.

http://www.jstor.org

This content downloaded from 185.2.32.109 on Thu, 12 Jun 2014 20:32:29 PMAll use subject to JSTOR Terms and Conditions

SIAM REVIEW Vol. 9, No. 3, July, 1967

EXISTENCE OF STATIONARY OPTIMAL POLICIES FOR SOME MARKOV RENEWAL PROGRAMS*

BENNETT FOXt

MX1arkov renewal programming with a finite state space Q and a finite action space A has been treated by Jewell [5] and Fox [3]. We assume that Q is finite, but A may be infinite, possibly uncountable. Our main result is to exhibit conditions that imply that a pure, stationary optimal policy for the average cost case exists. Of course, Q is the set of states in the (not necessarily ergodic) embedded Mlarkov chain and A is the Cartesian product x.EuM. , where Mi,, is the set of actions available in state co. A pure, stationary policy ir is a mapping from the single point consisting of the ordered elements of Q to A, where ir( ) E AL,, . Thus, for each co E Q, the action 7r(co) at co using ir is independent of the history of previous states, transition times, and actions. Examples in Derman [2] and Maitra [6] show that our results do not extend to the case where Q is countably infiniite without further restrictions.

We shall pass from the continuous discounting case to the averaging case. Denardo [1] has disposed of the former. With discount rate a, denote by Ri(a; w) the total expected discounted loss when using policy ir and starting from state i. Policies that minimize Ri( a; r) for all i E Q are called a-optimal. With no restric- tions on the cardinality of 52, the results in [1] imply the existence of a pure, station- ary a-optimal for every positive discount rate a, provided that the contractioni and monotonicity properties (defined in [1]) hold in addition to a conltinuity condition. In the present case, the monotonicity assumption is obviously satisfied.

Let pij(k) be the probability of a direct transition to j from i when action k is taken in state i. Similarly, we define the transition time distribution Fi( t I D) and the loss Cij(x I t, k) incurred up to time x during a transition of durationl t. We assume that the history of the process does not affect pij(k), Fij(t I k), or Cxj(x I t, k). The one-stage expected discounted loss when making decision k at state i is

(1) 'yi(a; k) pij(k) j dFij(t /) j ex dxCij(xI t k).

We assume that

(A) t dFij(tIk) > c > 0 Vk C Mi ,Vi, j C Q,

(B) Iyi(a;k)I d< oo Va > 0,k VEMiM,Yi E Q

hold. Verification of the fact that the contraction property holds follows routinely from the discussion of Example 4 in [1]. Actually, (A) and (B) are nlot necessary for the existence of a pure, stationary a-optimal policy, but we shall not attempt to find the weakest possible sufficient conditions.

*Received by the editors June 20, 1966. t The RAND Corporation, Santa Monica, Californiia 90406.

573

This content downloaded from 185.2.32.109 on Thu, 12 Jun 2014 20:32:29 PMAll use subject to JSTOR Terms and Conditions

574 BENNETT FOX

We also assume that:

(C) ly(a; *) + Z p2( )V(j) / e Rt cFi1(t* J2 Q - 0

is continuous in a topology with a countable base for which Ml is compact Va > 0 Vi C 52,where v is any function from Q to (-oC, oc );

(D) Ci(* t, k) is of bounded variation Vk E Mi, Vi, j E ;

(E) f t2dFij(tIk) < c' < cc,

F dF4i(tk) f xcliCj(xlt,k) ? d' < oX Vk E Mi,Vi,ji ES.

(F) Chain coIntinuity: pi is a pure, stationary policy with ergodic subchaiii structure r, i = 1, 2, , and pi -> p imply p has chain structure r. (A sufficient condition for (F) is that pij (k1) > 0, 1 = 1, 2, , and p2j (kl)

pij (k) imply pij (k) > 0.) For (C), the discrete topology works automatically if Mi is finite. If Mi is a closed subset of the (possibly extended) real line, the usual (relative) topology often works.

The purpose of this paper is to show that (A)-(F) jointly imply the existence of a pure, stationary policy that minimizes expected cost per unit time, defined as

(2) Lj(7r) = lim sup Lj(t; r) /t,

where Lj(t; 7r) is the expected (undiscounted) cost up to time t when using (a measurable) policy r and starting from state j.

LEMMA 1. p is a stationar-y policy that inmplies Ri(a; p) = g(i; p)/a + w(i; p)

+ o(1), as a -> 0+, where g(i, ), w(i, *) and the o(1) term are uniformly bounded over stationary policies.

Proof. The lemma follows from (A)-(F), and results in Jewell [5] and Fox [3], where explicit expressions for g and w are derived.

Let S be the set of pure, stationary policies. LEMMA 2. S is compact with respect to the topology of pointwise convergence,

which has a countable base. Proof. Since Mi was assumed compact, A is compact relative to the product

topology by Tychonoff's theorem; the lemma follows from the definition of the topology of pointwise convergence, since there is a one-to-one correspondence between S and A. The last assertion is a direct consequence of (C) and the finiteness of Q.

LEMMA 3. Let {as } 0+ and {o-fJ be subsequences such that ?o,i is a pure, sta- tionary as-optimal policy, i = 1, 2, o-. ; fi -> o implies oo minimizes 2j(7r) for all i E Q.

Proof. The subsequences exist by results of Denardo [1]. By Lemma 1 and standard Abelian and Tauberian theorems (see, e.g., Widder [7, pp. 181 and 192]),

This content downloaded from 185.2.32.109 on Thu, 12 Jun 2014 20:32:29 PMAll use subject to JSTOR Terms and Conditions

SHORT NOTES 575

we have for any policy 7r and any starting state j,

2j(r) > lim SUp aRUj(ai ; wx)

? lim aoRj(a, ; oa) j-->00

= lim min aiRj( ai; p) i-0 PES

= min linm aRj(a, ; p) pES ij-0

= min g(j; p) = g(j; (oJ) = 42(co). pES

Interchange of lim and min follows from the uniform convergence of a,Rj( ai; ). Uniform convergence of aRj( a; *) follows from Lemma 1. By (C), aRj(a; ) is continuous. Since the uniform limit of continuous functions is continuous, g(j; *) is continuous, which by Lemma 2 implies that g(j; *) has a minimum over S. The fourth equality follows from the continuity of g(j; * ), the at-op- timality of o(Ja , and the fact that oi -> o-o implies (from the finiteness of Q) that SJo-,, converges uniformly.

Let S* = {p E S:2j(p) = minIE8 ?j(r) YjV E } andS * = p E S*: 1{ai} > O+ and {o7abj such that oi -> p and o-i is pure, stationary, and a,-optimal,

i= 1,2, -..}. LEMMA 4. S is not empty. Proof. By Lemma 2, every sequence {I o-, of pure, stationary a-optimal policies

(with a - 0+) contains a convergent subsequence, say {Jo-,}, with a limit in S. 0-a, -> Jo = o E S S** by Lemma 3.

THEOREM. S* is the set of pure, stationary policies minimizing 2j( r) over all policies for all j CE . S* is not empty.

Proof. Apply Lemmas 3 and 4. COROLLARY. If S* contains exactly one policy, say co, and {(7a} is a sequence of

pure, stationary a-optinal policies with a -> 0+, then o-( -> (o and 0o minimlizes (r) over all policies for all j E Q. Remark 1. Let d(7r, p; a) = maxiEC I Ri(a; r) -Ri(a; p) 1. If a' > 0 and

{o-,} is a sequence of a-optimal policies, then a -a' ? (1at, cra' ; a') -> 0 since Ri( ., 7r) is continuous for all i, v. Thus, if Sa is the set of pure, stationlary a-optimal policies, there is a continuous correspondence between a and Sa. However, S* may properly contain S**. Often (e.g., [3], [4]) a secondary cri- terion is introduced that eliminates all policies in S* - S** not "nearly optimal" for small discount rates.

Remark 2. In applications, a constructive optimization scheme is needed. For the discounting case, this is provided by the policy iteration algorithm in [1]. With the criterion of expected cost per unit time, we shall first consider the sim- plest case. If the embedded M4arkov chain is ergodic for all stationary policies, the policy iteration algorithm in Fig. 2 of [5] applies without change. In the general multichain case where the ergodic subehaini structure depends on the policy,

This content downloaded from 185.2.32.109 on Thu, 12 Jun 2014 20:32:29 PMAll use subject to JSTOR Terms and Conditions

576 BENNETT FOX

either of the policy improvement algorithms in [4] apply without change. If every policy produces a single ergodic subehain plus a policy dependent (possibly empty) set of transient states or if A is finite, convergence to an optimal policy is guaranteed; however, convergence is iiot necessarily finite unless A is finite. For the A iiifinite case, convergence follows from continuity/compactness con- siderations; the details will be omitted.

Remark 3. We close with two examples. The first, due to Eric Denardo, shows that condition (F) is crucial for the theorem. Consider a three-state problem in discrete time, where all transitions are of unit length. In states 2 and 3, there is only one action; these states are absorbing (p22 = p33 = 1). In state 1, the action k is chosen from the closed interval [0, 2] with the usual topology; k determines the transition probabilities P12(k) = k, p13(k) = k2. No matter what action is taken, C1I = C33 = C12 = C13 = 0 and C22 = -1. It is easily shown that

g(I; l) ={-1/( + k), k > 0

Note that inf {g(1; k) :k E [0, 2j} = -1, which is not attained, but, of course, chain continuity does not hold. (With this condition dropped, the proof breaks down at Lemma 1.)

In the preceding example, there is no optimal policy, stationary or otherwise. A one-state discrete time example with no optimal stationary policy but with a nonstationary optimal policy is obtained by dropping condition (C). Let C1l (k) = 1/k, k = 1, 2, . .. Two (of the infinitely many) optimal policies are to take actions k and 17k2, respectively, at step k. From a practical viewpoint, the latter policy is preferable.

REFERENCES

[1] E. V. I)ENARDO, Contraction mappings in the theory underlying dynamic programming, this Review, 9 (1967), pp. 165-177.

[21 C. DERMAN, Denumerable state Markovian decision process-average cost criterion, Ann. Math. Statist., 37 (1966), pp. 1545-1553.

[3] B. Fox, Markov renewal programming by linear fractional programming, SIAM J. Appl. Math., 14 (1966), pp. 1418-1432.

[4] E. V. DENARDO AND B. L. Fox, Multichain Markov renewal programs, RM-5208-PR, The RAND Corporation, Santa Monica, California, 1967.

[51 W. S. JEWELL, Markov-renewal programming. I and II, Operations Res., 11 (1963), pp. 938-971.

[6] A. MAITRA, Dynamic programming for countable state systems, Sankhya Ser. A, 27 (1965), pp. 241-248.

[7] D. V. WIDDER, The Laplace Transform, Princeton University Press, Princeton, 1946.

This content downloaded from 185.2.32.109 on Thu, 12 Jun 2014 20:32:29 PMAll use subject to JSTOR Terms and Conditions