Dynamic priority allocation via restless bandit marginal ...halweb.uc3m.es/jnino/eng/pubs/top07.pdf162 J. Niño-Mora Keywords Priority allocation · Stochastic scheduling ·Index policies

Top (2007) 15: 161–198DOI 10.1007/s11750-007-0025-0

I N V I T E D PA P E R

Dynamic priority allocation via restless bandit marginalproductivity indices

José Niño-Mora

Published online: 27 September 2007© Sociedad de Estadística e Investigación Operativa 2007

Abstract This paper surveys recent work by the author on the theoretical and algo-rithmic aspects of restless bandit indexation as well as on its application to a vari-ety of problems involving the dynamic allocation of priority to multiple stochasticprojects. The main aim is to present ideas and methods in an accessible form that canbe of use to researchers addressing problems of such a kind. Besides building on therich literature on bandit problems, our approach draws on ideas from linear program-ming, economics, and multi-objective optimization. In particular, it was motivatedto address issues raised in the seminal work of Whittle (Restless bandits: activityallocation in a changing world. In: Gani J. (ed.) A Celebration of Applied Proba-bility, J. Appl. Probab., vol. 25A, Applied Probability Trust, Sheffield, pp. 287–298,1988) where he introduced the index for restless bandits that is the starting point ofthis work. Such an index, along with previously proposed indices and more recentextensions, is shown to be unified through the intuitive concept of “marginal produc-tivity index” (MPI), which measures the marginal productivity of work on a project ateach of its states. In a multi-project setting, MPI policies are economically sound, asthey dynamically allocate higher priority to those projects where work appears to becurrently more productive. Besides being tractable and widely applicable, a growingbody of computational evidence indicates that such index policies typically achieve anear-optimal performance and substantially outperform benchmark policies derivedfrom conventional approaches.

This invited paper is discussed in the comments available at:http://dx.doi.org/10.1007/s11750-007-0026-z, http://dx.doi.org/10.1007/s11750-007-0027-y,http://dx.doi.org/10.1007/s11750-007-0028-x, http://dx.doi.org/10.1007/s11750-007-0029-9,http://dx.doi.org/10.1007/s11750-007-0030-3, http://dx.doi.org/10.1007/s11750-007-0031-2.

J. Niño-Mora (�)Department of Statistics, Universidad Carlos III de Madrid, C/ Madrid 126, 28903 Getafe,Madrid, Spaine-mail: [email protected]

162 J. Niño-Mora

Keywords Priority allocation · Stochastic scheduling · Index policies · Restlessbandits · Marginal productivity index · Indexability · Dynamic control of queues ·Control by price

Mathematics Subject Classification (2000) 90B36 · 90C40 · 90B05 · 90B22 ·90B18

1 Introduction

The overarching concern with making best use of that most precious resource, time,leads us to ponder how to set priorities among the multifarious activities vying for ourattention. Thus, we must decide over time whether to engage in projects of potentiallyhigh reward yet unlikely success, or on less rewarding but more realistic alternatives,revising priorities over time in light of actual progress and future prospects. Similarissues arise in the automatic control of modern technological systems, such as thosein manufacturing and computer-communication networks, where the flow of distincttraffic streams can be regulated by dynamically prioritizing access to shared resourcessuch as machines or transmission channels. The high level of discretionarity allowedin such decisions, as well as their often substantial impact on system performance,raises the possibility of optimizing the latter through appropriate design of the prioritypolicy adopted.

Yet, while many such problems are readily formulated in the framework of Markovdecision processes (MDPs), their computational solution via the conventional dy-namic programming (DP) technique is typically intractable, due to the well-knowncurse of dimensionality. As for analytical solutions, they are only available for a fewmodels under rather special conditions. Such a state of affairs motivates investiga-tion of heuristic policies that, while not optimal, are both tractable and come close toachieving desired performance objectives.

Perhaps the most natural and simple class of priority allocation policies is basedon use of priority indices. Thus, if one must dynamically prioritize work on multiplestochastic projects, an index is defined for each as a function of its state. The resultantpriority-index policy engages at each time the required number of projects with cur-rently larger index values. Yet, such a class of policies is still overwhelmingly large,which motivates the quest for ideas that guide us to design sound priority indicesyielding good or even optimal policies.

The earliest result on optimality of a priority-index rule is given in Smith (1956),which addresses the problem of sequencing a batch of jobs having known, determinis-tic processing times and linear holding costs. A sound priority index for a job in sucha setting is given by the ratio of its holding cost rate per unit time to its processingtime, which measures the rate of cost reduction per unit of effort expended. Hence,such an index can be interpreted as a measure of the average productivity of work onthe job. Smith showed that, in the single-machine case, the total weighted completiontime is minimized by the resultant index rule. The optimality of the Smith index rulewas extended in Rothkopf (1966) to a model where job durations are stochastic. Coxand Smith (1961) showed that such a rule also yields an average-optimal policy for

Dynamic priority allocation via restless bandit marginal productivity 163

scheduling a multiclass single-server queue with linear holding costs. A more com-plex optimal index rule for the latter model’s extension that incorporates Bernoullifeedback between job classes was obtained by Klimov (1974), attaching a constantindex to each class.

While such index rules are static, in that the index of a job class is constant, inother problems researchers have identified optimal dynamic index rules. In such avein, the seminal, independent work of Sevcik (1974) and Gittins and Jones (1974)stands out. Of particular relevance to this paper is the celebrated result in the latterpaper on the optimal solution of the classic multiarmed bandit problem by a dynamicindex rule. The problem concerns the sequential allocation of work to a collectionof stochastic projects modeled as Markov chains that, when engaged, yield rewardsand change state. The problem is to decide which project to engage at each time tomaximize the expected total discounted reward earned over an infinite horizon. Theoptimal policy turns out to be the priority-index rule corresponding to the Gittinsindex, which measures the maximum rate of expected discounted reward per unit ofexpected discounted time that can be achieved under stopping rules for each initialproject state. See Gittins (1979, 1989). Hence, again, the “right” index is a measureof the average productivity of work on a project. For alternative proofs of such afundamental result offering complementary insights see, e.g., Whittle (1980), Varaiyaet al. (1985), Weber (1992), and Bertsimas and Niño-Mora (1996).

Actually, the roots of such a result can be traced to earlier work. Thus, the Gittinsindex extends to a general Markovian setting the index introduced by Bellman (1956)to solve the special Bayesian Bernoulli one-armed bandit problem via calibration.

In turn, Bellman drew on the earlier work of Bradt et al. (1956), as he acknowl-edged referring to an unpublished version of that paper, which we regard as the originof bandit indexation. Bradt et al. (1956, Sect. 4) address the problem of optimal se-quential design of an experiment where one wishes to maximize the sum of n obser-vations, to be chosen sequentially from either of two Bernoulli processes. The successprobability of the first process is known, whereas that of the second is unknown. Thesecond process is modeled as a Bayesian bandit whose state is the posterior distrib-ution. They showed that the optimal policy is characterized by a break-even, criticalnumber, which is a function of the number of remaining observations and of the sec-ond process’ state: one should continue sampling from the second population as longas the current break-even value exceeds the known success probability of the firstprocess, and then switch to the latter and keep sampling there until the n observationsare completed. Such a break-even quantity is the index of concern, although they didnot use such a term. Hence, their work introduced the calibration approach to banditindexation, which has proven so fruitful in later developments.

Jumping forward in time in this brief history of bandit indexation, Whittle (1988)significantly expanded the latter’s scope beyond the realm of classic bandits, by in-troducing an index for restless bandits—those that can change state while passive. Hedid so by deploying a Lagrangian relaxation approach to the intractable (cf. Papadim-itriou and Tsitsiklis 1999) restless extension of the classic multiarmed bandit problemunder the average criterion. Whittle conjectured a form of asymptotic optimality ofthe resultant index policy, which was established in Weber and Weiss (1990, 1991)under certain conditions.

164 J. Niño-Mora

Yet, Whittle realized that existence of the index is not guaranteed for all rest-less bandits: only for those that satisfy a so-called indexability property. He stated inWhittle (1988):

. . . one would very much like to have simple sufficient conditions for indexa-bility; at the moment, none are known.

Such a state of affairs prompted the author to address that and other issues onrestless bandit indexation, as reported in the work surveyed herein. A cornerstone ofour approach is the intuitive concept introduced in Niño-Mora (2006a) of marginalproductivity index (MPI), which furnishes a unifying framework for all the indicesreviewed above as well as more recent extensions. Such a concept is grounded oninsights drawn from the marginal productivity theory in economics developed at theend of the 19th century by several researchers. See, e.g., the classic work by Clark(2005). The MPI of a project is a sound, intuitive priority index, as it measures themarginal productivity of work at each project state. As we will see, in the case ofclassic bandits the MPI reduces to an average productivity index. In a multi-projectsetting, MPI policies dynamically assign higher priority to projects where work ap-pears to be currently more productive. Besides being widely tractable and applicable,a growing body of computational evidence indicates that such index policies typicallyexhibit a near-optimal performance and outperform conventional benchmark policiesderived from alternative approaches.

Several applications surveyed below are drawn from the domain of optimal con-trol of queueing systems. While the static optimization of such systems (cf. Combéand Boxma 1994 and Boxma 1995) and some approaches to dynamic optimizationhave attracted substantial research attention, emerging evidence suggests that, withintheir scope, the MPI policies advocated in this paper can often yield significant per-formance gains at a reduced computational expense.

The remainder of the paper is organized as follows. Section 2 reviews the keyconcepts and results of the theory of restless bandit indexation, as well as of itscomputational aspects as developed by the author extending Whittle’s work. Sec-tion 3 discusses applications to problems of admission control and routing to parallelqueues, scheduling a multiclass make-to-order/make-to-stock queue, and schedulinga multiclass queue with finite buffers. Section 4 reports on more recent developments,involving theory, algorithms and applications. Finally, Sect. 5 concludes.

We remark that in the paper we use the terms “bandit” and “project” interchange-ably.

2 Restless bandit indexation: theory and computation

We focus the following exposition on a discrete-time single restless bandit modelhaving a finite state space N , whose one-period rewards and state-transition proba-bilities under actions a = 0 (passive) and a = 1 (active) at state i are denoted by Ra

i

and paij , respectively. Rewards are discounted over time with factor 0 < β < 1. The

project is operated under a policy π , drawn from the class Π of history-dependentrandomized policies Π . We denote by X(t) and a(t) the project state and actionprocesses, respectively.


2.1 Indexability and the MPI

We evaluate a policy π by means of two measures. The first is the reward measure

f πi0

� Eπi0

[ ∞∑t=0

Ra(t)X(t)β

t

],

giving the expected total discounted value of rewards earned over an infinite hori-zon starting at i0. The second measure concerns the associated resource expenditure.Thus, if Qa

j units of work are expended by taking action a in state j , we use the workmeasure

gπi0

� Eπi0

[ ∞∑t=0

Qa(t)X(t)

βt

],

giving the corresponding expected total discounted amount of work expended. Weassume that such work-expenditure parameters satisfy Q1

j ≥ Q0j ≥ 0.

Note that such a setting allows the possibility that the two actions are identical intheir resource consumption and dynamics at some states. We will find it convenientto identify the states, if any, where such is the case,

N {0} �{i ∈ N : Q0

i = Q1i and p0

ij = p1ij , j ∈ N

}, (1)

and call them uncontrollable, while terming controllable the remaining statesN {0,1} � N \ N {0}. We denote by n ≥ 1 the number of controllable states, and adoptthe convention that the passive action is taken at uncontrollable states, which is re-flected in the notation.

We will further refer to corresponding measures f π and gπ obtained by drawingat random the initial state according to an arbitrary positive probability mass functionpi > 0 for i ∈ N , i.e., f π �

∑i∈N pif

πi and gπ �

∑i∈N pig

πi .

Suppose that work is to be paid for at wage rate ν, and consider the ν-wage prob-lem

maxπ∈Π

f π − νgπ , (2)

which is to find an admissible project-operating policy that maximizes the value ofrewards earned minus labor costs incurred. We will use (2) as a calibrating problem,aimed at measuring the marginal value of work at each project state.

Since (2) is a finite-state and -action discounted MDP, standard results (cf. Puter-man 1994) ensure existence of an optimal policy that is: (i) stationary deterministic;and (ii) independent of the initial state. It is convenient to represent each such a policyby its active set S ⊆ N {0,1}, which is the subset of states where it prescribes to engagethe project. We will thus refer to the S-active policy and write, e.g., f S and gS . Wecan thus reduce (2) to the combinatorial optimization problem of finding an optimalactive set in the family of all subsets of N {0,1}, denoted by 2N {0,1}

:

maxS∈2N{0,1}

f S − νgS. (3)

166 J. Niño-Mora

For every wage value ν, the optimal policies are characterized by the unique solu-tion to the Bellman equations

ϑ∗i (ν) = max

a∈{0,1}Ra

i − Qai ν + β

∑j∈N

paijϑ

∗j (ν), i ∈ N, (4)

where ϑ∗i (ν) denotes the optimal value of (2) starting at i. Hence, there exists a

minimal optimal active set S∗(ν) ⊆ N {0,1} for (2), which is characterized in terms of(4) by

S∗(ν) �{i ∈ N {0,1} : R1

i − Q1i ν + β

∑j∈N

p1ij ϑ

∗j (ν) > R0

i − Q0i ν + β

∑j∈N

p0ij ϑ

∗j (ν)

}.

Now, it appears reasonable that, at least in some models, active sets S∗(ν) shouldexpand monotonically from the empty set ∅ to the full controllable state space N {0,1}as the wage ν is decreased from +∞ to −∞. If such is the case, to each controllablestate i will be attached a critical wage value ν∗

i below which i enters S∗(ν).

Definition 1 (Indexability; MPI) We say that the project is indexable if there existsan index ν∗

i ∈ R for i ∈ N {0,1} such that

S∗(ν) = {i ∈ N {0,1} : ν∗

i > ν}, ν ∈ R.

In such a case we say that ν∗i is the project’s MPI.

The concept of indexability was introduced by Whittle (1988) in the case Qai ≡ a

under the long-run average criterion, in a formulation given in terms of optimal pas-sive sets. He also showed that, natural as it may seem, such a property should not betaken for granted, as there are nonindexable projects. The extension to the discountedcriterion was carried out in Niño-Mora (2001). The more general present setting wasintroduced in Niño-Mora (2002), where the index ν∗

i was first shown to measure themarginal value, or productivity, of work at each state, which, along with further ex-tensions and results, prompted our proposing the term MPI in Niño-Mora (2006a).

2.2 An achievable work-reward view of indexability: geometric and economicinsights

Niño-Mora (2002, 2006a) introduces an achievable work-reward approach to index-ability that offers both geometric and economic insights, having deep connectionswith multi-objective optimization (cf. Hernández-Lerma and Hoyos-Reyes 2001).

Consider the achievable work-reward performance region

H �{(

gπ ,f π) : π ∈ Π

},

which is the region spanned in the plane by work-reward performance points under alladmissible policies. Note that such a region is a convex polygon, given by the convex


hull of the finite collection of performance points (gS, f S) achieved by stationarydeterministic policies, as represented by their active sets S:

H = conv({(

gS,f S) : S ∈ 2N {0,1}})

.

Of particular interest for our purposes is the upper boundary of such a region,

∂H �{(g, f ) ∈ H : f π ≤ f for every π ∈ Π

},

as the indexability property concerns the latter’s structure. Thus, the project is index-able iff ∂H is characterized by a nested active-set family

F0 �{S0, S1, . . . , Sn

},

where S0 � ∅, Sn � N {0,1} and Sk � {i1, . . . , ik} for 1 ≤ k ≤ n satisfy

gS0 < gS1 < · · · < gSn, (5)

and i1, . . . , in is an ordering of the project’s n controllable states.In such a case the MPI has the evaluation

ν∗ik

= f Sk − f Sk−1

gSk − gSk−1, 1 ≤ k ≤ n. (6)

The latter representation shows that the MPI measures the reward vs. work trade-offrates or slopes in the upper boundary ∂H, characterizing ν∗

ikas the marginal value

or productivity of work in the project at state ik , which motivates our using the termMPI.

We must emphasize that, while the measures f π and gπ depend on the choseninitial-state probabilities pi > 0, the rates defining the MPI in (6) remain invariantunder changes in the latter. Further, whereas the present geometric approach requiresin general that such probabilities be positive, it can be shown that the MPI can alsobe expressed as

ν∗ik

= fSk

ik− f

Sk−1ik

gSk

ik− g

Sk−1ik

, 1 ≤ k ≤ n.

The latter representation sheds light on the relation between the index for restlessand classic (nonrestless) bandits. Thus, for a classic bandit with zero passive rewardswe have f

Sk−1ik

= gSk−1ik

= 0, and hence

ν∗ik

= fSk

ik

gSk

ik

, 1 ≤ k ≤ n.

Therefore, the MPI for a classic bandit reduces to an average productivity index.The above interpretation furnishes an intuitive economic justification for use of

MPI policies in a multi-project scenario. Thus, such policies seek to dynamicallyallocate work to those projects that can make better use of it, using the MPI as a proxy

168 J. Niño-Mora

— as it ignores interactions—marginal productivity measure. Such a viewpoint drawsand builds on the marginal productivity theory in economics, and on its extensionswhich apply it to optimal resource allocation. See, e.g., Clark (2005), Koopmans(1957), and Kantorovich (1959).

Notice further that the optimal value function ϑ∗i (ν) in (4) of an indexable project

as above is given by

ϑ∗i (ν) = max

S∈F0

f Si − νgS

i . (7)

Two examples will help illustrate these ideas. Consider first the project with statespace N = {1,2,3} and one-period work consumptions Qa

i = a, discount factor β =0.9, one-period active reward vector and one-period transition probabilities

R1 =[ 0.9016

0.109490.01055

], P1 =

[0.2841 0.4827 0.23320.5131 0.0212 0.46570.4612 0.0081 0.5307

],

P0 =[0.1810 0.4801 0.3389

0.2676 0.2646 0.46780.5304 0.2843 0.1853

],

and one-period passive reward vector R0 = 0. Figure 1 displays the achievable work-reward performance region H for such an instance, where points (gS, f S) are labeledby their active sets S, and the initial-state distribution is uniform over N , i.e., pi =1/3 for i ∈ N . The plot shows that this is an indexable project, relative to the nestedactive-set family F0 = {∅, {1}, {1,2}, {1,2,3}}, which determines the region’s upperboundary ∂H. The MPI values of states 1, 2, and 3 are given by the successive trade-off vs. work rates or slopes in such an upper boundary:

ν∗1 = f {1} − f ∅

g{1} − g∅ > ν∗2 = f {1,2} − f {1}

g{1,2} − g{1} > ν∗3 = f {1,2,3} − f {1,2}

g{1,2,3} − g{1,2} .

Fig. 1 Achievable work-rewardperformance region of anindexable project


Consider now the project instance having the same states, initial-state distributionand discount factor, transition probabilities

P1 =[0.7796 0.0903 0.1301

0.1903 0.1863 0.62340.2901 0.3901 0.3198

], P0 =

[0.1902 0.4156 0.39420.5676 0.4191 0.01330.0191 0.1097 0.8712

],

and reward vectors

R1 =[0.9631

0.79630.1057

], R0 =

[ 0.4580.53080.6873

].

Figure 2 displays the achievable work-reward performance region for this instance.The plot reveals that this project is nonindexable, since there is no nested active-setfamily that determines the region’s upper boundary.

2.3 PCL-indexability conditions and adaptive-greedy index algorithm

While testing for indexability of a given restless bandit instance is a conceptuallysimple task, as it can be solved, e.g., by visual inspection of plots such as those inFigs. 1 and 2, researchers will more often be interested in establishing analyticallythat a particular model arising in some application is indexable under a suitable para-meter range. The latter task is, in contrast, generally far from trivial. It would thus beuseful to have tractable sufficient conditions for indexability that are widely applica-ble. Niño-Mora (2001, 2002, 2006a) introduces, develops, and deploys the first suchconditions, along with a corresponding index algorithm, which we review next.

For a restless bandit as above, given an action a ∈ {0,1} and an active set S ⊆N {0,1}, denote by 〈a,S〉 the policy that takes action a in the initial period and adoptsthe S-active policy thereafter. In addition to the work and reward measures discussedbefore, let us now define the marginal work measure

wSi � g

〈1,S〉i − g

〈0,S〉i = Q1

i − Q0i + β

∑j∈N

(p1

ij − p0ij

)gS

j ,

Fig. 2 Achievable work-rewardperformance region of anonindexable project

170 J. Niño-Mora

and the marginal reward measure

rSi � f

〈1,S〉i − f

〈0,S〉i = R1

i − R0i + β

∑j∈N

(p1

ij − p0ij

)f S

j .

Notice that wSi (resp., rS

i ) measures the marginal increment in work expended (resp.,in value of rewards earned) that results from working instead of resting in the initialperiod starting at i, provided that the S-active policy is adopted afterwards.

Further, if wSi �= 0, define the marginal productivity measure

νSi �

rSi

wSi

.

Recall now the characterization of indexability discussed in Sect. 2.2. Typically,i.e., unless the state space is linearly ordered, we will not be able to identify a priorithe nested active-set family F0 determining the achievable work-reward performanceregion’s upper boundary. Yet, often we can draw on intuition or experimentation toguess the structure of optimal policies for the particular model at hand, in the formof an active-set family F ⊆ 2N{0,1} that contains F0, i.e., F ⊇ F0, for a range ofmodel parameters. In fact, F will often be much larger than F0. In the terminologyof combinatorial optimization, (N {0,1},F) is a set system on ground set N {0,1} havingF as its family of feasible sets.

Algorithmic considerations, namely the requirement that we can build our way upfrom the empty set towards a given set S ∈ F through successive single-state aug-mentations, as well as the symmetric requirement for reaching S through successivesingle-state removals from N {0,1}, lead us to impose some natural conditions on sucha set system, for which we need the following concepts. For S ∈ F , define the innerboundary of S relative to F by

∂ inFS �

{i ∈ S : S \ {i} ∈F

}.

Define further the outer boundary of S relative to F by

∂outF S �

{i ∈ N {0,1} \ S : S ∪ {i} ∈ F

}.

Assumption 1 Set system (N {0,1},F) satisfies the following conditions:

(i) ∅,N {0,1} ∈ F(ii) For ∅ �= S ∈ F , ∂out

F S �= ∅(iii) For N {0,1} �= S ∈ F , ∂ in

FS �= ∅

Consider now the adaptive-greedy algorithm AGF shown in Table 1. In essence,referring to the geometric viewpoint in Sect. 2.2, this algorithm seeks to traverse theupper boundary of the achievable work-reward performance region, building up thesuccessive active sets Sk forming the nested family F0 that determines such a bound-ary. To do so it restricts attention to active sets drawn from the given family F . Noticefurther that the algorithm aims to traverse such a boundary from left to right, so that


Table 1 Adaptive-greedy indexalgorithm AGF

ALGORITHM AGF :Output: {ik, ν∗

ik}nk=1

S0 := ∅for k := 1 to n do

pick ik ∈ arg max{νSk−1i

: i ∈ ∂outF Sk−1

}ν∗ik

:= νSk−1ik

; Sk := Sk−1 ∪ {ik}end { for }

the successive index values or slopes in such a frontier are computed in nonincreasingorder, i.e., it is a top–down index algorithm. Also, the algorithm is only well definedwhen the computed marginal productivity rates have nonzero denominators. The out-put consists of an ordered string i1, . . . , in of the n controllable states in N {0,1}, alongwith corresponding index values ν∗

ik.

We next use such an algorithm to define a certain class of restless bandits. Notethat the acronym “PCL” refers to the partial conservation laws introduced in Niño-Mora (2001).

Definition 2 (PCL(F)-indexability) We say that a bandit is PCL(F)-indexable if itsatisfies the following conditions:

(i) Positive marginal work: wSi > 0 for i ∈ N {0,1}, S ∈ F .

(ii) Monotone nonincreasing index computation: the index values produced by algo-rithm AGF satisfy

ν∗i1

≥ ν∗i2

≥ · · · ≥ ν∗in.

Note that part (i) of Definition 2 ensures, along with Assumption 1, that the algo-rithm is well defined. The interest of the class of PCL(F)-indexable bandits is basedon the following result, proven in Niño-Mora (2001, Cor. 2; 2002, Th. 6.3; 2006a,Th. 4.1) in increasingly general settings.

Theorem 1 A PCL(F)-indexable bandit is indexable and algorithm AGF gives itsMPI.

From the point of view of the practical application of Theorem 1 to a particularmodel, one would first set out to establish analytically satisfaction of Definition 2(i).Note that it might well happen that wS

i < 0 for some active sets not in F . Yet, asstated, it suffices to prove positivity of marginal work measures for active sets drawnfrom family F . Then, one would check for satisfaction of Definition 2(ii). This can bedone either computationally, simply by running the algorithm and testing whether theindex is computed in nonincreasing order, or analytically. We refer the reader, e.g.,to Niño-Mora (2002, 2006a, 2006b) for examples of detailed analyses of specificmodels.

We next comment on an approach we have found useful to establish analyticallycondition (ii) in Definition 2, once part (i) has been proven. Thus, suppose we wantto show that the kth and (k + 1)th computed index values satisfy ν∗

ik≥ ν∗

ik+1for any

172 J. Niño-Mora

1 ≤ k < n. Now, letting Sk−1 and Sk = Sk−1 ∪ {ik} be as in Table 1, we can useNiño-Mora (2002, Prop. 6.4(c)) to write

rSk

i − rSk−1i = r

Sk

ik

wSk

ik

(w

Sk

i − wSk−1i

), i ∈ N {0,1},

which is immediately reformulated using ν∗ik

= νSk−1ik

as

νSk

i = ν∗ik

− wSk−1i

wSk

i

(ν∗ik

− νSk−1i

), i ∈ N {0,1}.

From the latter identity along with condition (i) we obtain

ν∗ik

≥ ν∗ik+1

⇐⇒ ν∗ik

≥ νSk−1ik+1

, (8)

which shows that it suffices to prove that ν∗ik

≥ νSk−1ik+1

.

To illustrate, consider the case of a classic bandit—where N = N {0,1}. TakingF � 2N it is easily shown (cf. Niño-Mora 2001, Cor. 4) that Definition 2(i) holds. Toprove condition (ii) note that, by construction,

ν∗ik

= max{ν

Sk−1i : i ∈ N \ Sk−1

} ≥ νSk−1ik+1

,

and hence, by (8), we obtain ν∗ik

≥ ν∗ik+1

. Therefore, classic bandits are PCL(2N)-indexable, or, in the terminology used in Niño-Mora (2001), GCL-indexable, as insuch a case they satisfy the generalized conservation laws (GCL) in Bertsimas andNiño-Mora (1996). It is further shown in Niño-Mora (2001, Cor. 5) that any restlessbandit is GCL-indexable for small enough values of the discount factor.

The reader may wonder about the intuitive interpretation of Definition 2(i), be-sides that suggested by definition of the wS

i ’s. Further insights into such an issue aregiven in Niño-Mora (2002, Prop. 6.2), where it is shown that the condition can bereformulated in terms of work measures as follows. For S ∈F ,

gSi < g

S∪{i}i , i ∈ N {0,1} \ S,

gSi > g

S\{i}i , i ∈ S.

(9)

Note that (9) represents a regularity property of work measures for active sets in Fwhereby, starting at a state i: augmenting an active set in F by adding i leads to anincrease in work expended, whereas shrinking an active set by removing i leads to adecrease in work expended.

Another insightful representation of the index is given in Niño-Mora (2002,Th. 6.4) and Niño-Mora (2006a, Th. 4.2). Letting F0 = {S0, S1, . . . , Sn} be the active-set family produced by the algorithm, it holds that, for 1 ≤ k ≤ n,

maxj∈N {0,1}\Sk−1

νSk−1j = ν

Sk−1ik

= ν∗ik

= νSk

ik= min

j∈Sk

νSk

j , (10)


which is equivalent to the more intuitive reformulation

maxj∈N {0,1}\Sk−1

f Sk−1∪{j} − f Sk−1

gSk−1∪{j} − gSk−1= f Sk − f Sk−1

gSk − gSk−1= ν∗

ik= min

j∈Sk

f Sk − f Sk\{j}

gSk − gSk\{j} . (11)

Such relations characterize the index as a locally optimal marginal productivity rate.Note that (11) has a clear geometric interpretation in the setting of the achievablework-reward region approach in Sect. 2.2.

Further, we have found that, in some models, marginal work measures satisfy thefollowing monotonicity condition:

if S,S′ ∈ F and i ∈ S ⊂ S′ then wSi ≤ wS′

i . (12)

Under (12), it is shown in Niño-Mora (2002, Th. 4.7) that the index has the alternativerepresentation

ν∗i = max

S∈F0 : i∈SνSi = max

S∈F0 : i∈S

f Si − f

S\{i}i

gSi − g

S\{i}i

, (13)

which is closely related to the Gittins index representation given in Gittins (1979) asan optimal average reward rate relative to stopping times. Thus, in the special case ofa classic bandit with zero passive rewards, (13) reduces to

ν∗i = max

S∈F0 : i∈S

f Si

gSi

,

whereas the Gittins index representation referred to above appears, when stationarydeterministic stopping times are formulated in terms of their active sets, as

ν∗i = max

S∈2N : i∈S

f Si

gSi

.

Such a result is extended in Niño-Mora (2006a, Th. 4.3), where it is shown that,if marginal work measures are wedge-shaped, meaning that they satisfy (12) and thecondition

if S,S′ ∈F , S ⊂ S′ and i �∈ S′ then wSi ≥ wS′

i , (14)

then the index also has the representation

ν∗i = min

S∈F0 : i �∈SνSi = min

S∈F0 : i �∈S

fS∪{i}i − f S

i

gS∪{i}i − gS

i

. (15)

It is worth outlining the evolution of ideas that led to our introduction of PCLs andalgorithm AGF . The early roots are in the pioneering work of Klimov (1974), whointroduced an adaptive-greedy algorithm based on LP duality to compute the average-optimal index policy for scheduling a multiclass M/G/1 queue with Bernoulli feed-back. Varaiya et al. (1985) extended the scope of the adaptive-greedy algorithm to

174 J. Niño-Mora

compute the Gittins index. Then, drawing on earlier research on work conserva-tion laws in multiclass queueing systems by Kleinrock (1965), Coffman and Mitrani(1980), Federgruen and Groenevelt (1988), Shanthikumar and Yao (1992) and Tsou-cas (1991), Bertsimas and Niño-Mora (1996) developed a general polyhedral frame-work for the study of indexation encompassing such previous work, based on theunifying concept of GCLs mentioned above. In short, a generic scheduling problemobeying such laws is solved optimally by an index policy, where the index is com-puted by an adaptive-greedy algorithm. In essence, the latter correspond to the caseof AGF where F consists of all subsets of the ground set. Yet, unlike classic bandits,restless bandits do not necessarily satisfy GCLs. This prompted the author to developa framework that applies to the latter, based on a priori identification of a suitableactive-set family F . The resultant framework, termed PCL(F)-indexability after itsgrounding on satisfaction of partial conservation laws, was introduced in Niño-Mora(2001) along with algorithm AGF and Theorem 1. To be precise, that paper gave thefirst version of algorithm AGF , yet in a different formulation than that in Table 1.The present, equivalent formulation was given in Niño-Mora (2002), in an extendedsetting that grounds the approach on polyhedral LP methods.

We must remark that, as shown in Table 1, AGF is not really an algorithm, but analgorithmic scheme, as no implementation details are given. For a discussion of actualimplementations see Sect. 4.2, which reviews results from Niño-Mora (2007e). Thelatter paper reveals and exploits the deep connection between the adaptive-greedyindex algorithm and the classic parametric-objective simplex method of Gass andSaaty (1955).

2.4 Extensions

2.4.1 Indexation in a general framework and the law of diminishing marginalreturns

The indexation theory for restless bandits is developed in a general frameworkin terms of generic work and reward (or cost) measures in Niño-Mora (2006a,Sects. 3–4), which highlights fundamental properties while hiding ancillary model-specific details. Such an approach reveals the fundamental connection between theindexability property and the classic economic law of diminishing marginal returns,or diminishing marginal productivity, whereby, as usage of a resource increases, itsmarginal productivity diminishes. Thus, indexable projects are precisely those thatobey such a law as it applies to the work expended on them, in such a way that thereis a well defined marginal value of work at every state. See Theorem 3.1 in that paper.

2.4.2 Semi-Markov bandits and bandits with a countable state space

Niño-Mora (2006a) also shows how to adapt the theory to semi-Markov restless ban-dits, which is relatively straightforward as the latter are readily reformulated into adiscrete-stage setting by standard methods. Further, that paper extends the indexationtheory to bandits with a countable state space, yet restricting attention to the casewhere the state space is linearly ordered. The analysis of the general countable-state


case raises considerable technical difficulties, as, e.g., it is not clear how to ensurethat the state sequence produced by the adaptive-greedy algorithm traverses the en-tire controllable state space.

2.4.3 Average, discounted, bias and mixed criteria

The theory of restless bandit indexation was introduced by Whittle (1988) under theaverage criterion, and then extended to the discounted criterion in Niño-Mora (2001).Such criteria fit as special cases in the general indexation theory discussed in Niño-Mora (2006a, Sects. 3–4). We have found that the resultant flexibility furnishes anexpanded scope that is relevant in applications.

Consider, e.g., the classic problem of scheduling a multiclass M/G/1 to minimizeaverage linear holding costs, which is well known to be solved by the cμ index rule(cf. Cox and Smith 1961). The problem is immediately formulated as a restless banditproblem by viewing each class’ queue as a restless bandit project. Yet, as noticed inWhittle (1996, Chap. 14.7), and also in Veatch and Wein (1996) in a related multiclassmake-to-stock model, such bandits are not indexable under the average criterion. Asthe latter authors put it:

In contrast, the backorder problem is not indexable. ν(x) does not exist (i.e.,equals −∞) for all x. The difficulty is that ν is a Lagrange multiplier for theconstraint on the time-average number of active arms. For the backorder prob-lem, any stable policy must serve a time-average of ρ classes, so relaxing thisconstraint does not change the optimal value, and the Lagrange multiplier doesnot exist. In fact, no scheduling problem with a fixed utilization will be index-able.

We first showed in Niño-Mora (2003), in the setting of a multiclass make-to-order/make-to-stock queue with convex stock and backlog holding cost rates, thatsuch a difficulty is resolved by defining indexability relative to a mixed criteria ver-sion of ν-wage problem (2), where reward measure f π

i is evaluated by the aver-age criterion while work measure gπ

i is evaluated by the bias criterion of Blackwell(1962). For a bandit representing an M/G/1 queue with utilization factor ρ < 1 sub-ject to service control, the latter is defined by

gπi � E

πi

[∫ ∞

0

{a(t) − ρ

}dt

], (16)

and thus gives the expected total cumulative excess work expended over the nominalallocation.

These ideas are developed in Niño-Mora (2006a, Sect. 5.3). Such an approachreveals a new ground of practical application of bias optimality and mixed criteria inMDPs, which previously had been mostly considered topics of theoretical interest.See Lewis and Puterman (2002), and Feinberg and Shwartz (2002).

The scope of indexability is further extended to a pure bias criterion in Niño-Mora (2006b, Sects. 2.3 and 4.2), motivated by the analysis of a model involving thedynamic scheduling of a multiclass queue with finite buffers, which is nonindexable

176 J. Niño-Mora

under the average criteria. Of course, the indices defined under the mixed average-bias and the pure bias criteria are related to their discounted counterparts through avanishing-discount approach.

3 Applications

3.1 Control of admission and routing to parallel queues

In Niño-Mora (2002), the scope of indexation is extended by allowing work-consumption rates to be state-dependent. Besides introducing new LP-based polyhe-dral methods for the analysis and computation of the resultant indices, the approachis illustrated to develop new index policies for admission control and routing to par-allel queues. The individual projects of concern are Markovian birth-death queuessubject to control of admission, which have finite buffer space, state-dependent ar-rival and service rates, and which incur nonlinear holding costs as well as rejectioncosts. While such problems had been the subject of extensive research typically basedon DP analyses, as surveyed in Stidham (1985), the indexation approach yields newinsights, along with a fruitful connection with routing problems.

The single-queue admission control model addressed in that paper is an extensionof that in Chen and Yao (1990). It incorporates arrival rates λi , service rates μi , andholding cost rates hi that depend on the queue’s current state i, taken as the number ofcustomers in system. We show that, if the difference μi −λi is concave nondecreasingand hi is convex nondecreasing in i, then the model is indexable both under thediscounted and the average criterion. Yet, such an indexability result holds relative toan unconventional work measure gπ

i introduced in Niño-Mora (2002), which givesthe expected total discounted number or the expected long-run average rate per unittime, as the case may be, of arriving customers that are rejected due either to theirfinding a full buffer upon arrival, or to their finding a closed entry gate—imaginethat admission control is implemented through a gatekeeper that opens or shuts anentry gate to the system. Notice that, when the buffer is full, opening or closing theentry gate, i.e., taking the passive or the active action, have identical consequencesin terms of work expended and state dynamics. The corresponding state is henceuncontrollable (cf. (1)).

Under the stated conditions, the model is shown to be PCL-indexable, and henceindexable, with an index ν∗

i that is nondecreasing in i. Such an index characterizesthe optimal policies to the single-queue admission control problem that incorporatesa cost ν per customer rejected: it is optimal to reject a customer who arrives in statei < n, where n is the buffer size, iff ν∗

i ≥ ν. Notice that optimal policies are hence ofthreshold type: reject arrivals if the number of customers in system i is large enough.While the conventional approach to such problems focused on explicit determinationof such thresholds, the indexation approach determines them implicitly.

To compute the index, an efficient upwards recursion is given that generates indexvalues starting at ν∗

0 . This shows that the index does not depend on the buffer size,and hence extends to a corresponding model with infinite buffer space.

Note further that the index is a state-dependent measure of the value of rejectingan arrival, which suggests a natural index policy in the corresponding multi-project


model, involving the admission control and routing to multiple queues in parallel.Thus, consider now a model where customers arrive as a Poisson stream with rate λ,and each rejected customer incurs a cost ν. The controller decides whether or not toreject each customer upon arrival. If admitted, the customer is to be routed irrevocablyto one of K queues in parallel, among those that are not full. Each such queue ismodeled as in the single-queue model discussed above, yet now assuming a constantarrival rate equal to λ. Hence, if all the individual queues’ buffers are full, an arrivingcustomer is necessarily rejected. If we now denote by ν∗

k (ik) the index for queue k

in state ik , the resultant admission control and routing policy is as follows. Uponan arrival that finds each queue k in state ik : reject the customer either if all buffersare full or if ν∗

k (ik) ≥ ν for each nonfull queue k; otherwise, route the customer to anonfull queue of minimum index ν∗

k (ik).One may also use the index policy in the problem version where only the routing

control capability is enabled, and in a model where queues have infinite buffer space.The proposed index policies reduce to the shortest (nonfull) queue routing policy inthe symmetric cases in which this is optimal. See Winston (1977), Johri (1989), andHordijk and Koole (1990). The index policy also recovers the optimal policy in thenonsymmetric routing model solved in Derman et al. (1980), where each queue has asingle buffer space.

In the simplest case of an M/M/1 queue with parameters λ and μ under linearholding costs hi � ci, the index is shown in Niño-Mora (2002, Sect. 7) to have theevaluation

ν∗j = c

μ

j+1∑i=1

(1 + · · · + ρi−1) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

c

μ

[ρj+2 − 1

(ρ − 1)2− j + 2

ρ − 1

]if ρ �= 1,

c

μ

(j + 1)(j + 2)

2if ρ = 1,

(17)

where ρ � λ/μ is the utilization factor. Note that the latter will typically be greaterthan one for each queue in a routing model where there are multiple queues in par-allel. A closed formula is also given in that paper for the quadratic holding cost casehi � ci2. Further, in Niño-Mora (2002, Remark 7.3) it is noted that, under the aver-age criterion, such a model is PCL-indexable provided only that the holding cost rateis nondecreasing, i.e., the convexity assumption can be then dropped.

In Niño-Mora (2007b) we have investigated the applicability of such index policiesto a model involving the dynamic control of admission and routing to parallel multi-server loss queues with reneging. The policies derived from restless bandit indexationare new. More importantly, they appear to be useful, as the preliminary experimentsreported in Niño-Mora (2007b) reveal both a near-optimal performance and substan-tial gains against conventional benchmark policies on the instances investigated.

We remark that a similar model had been addressed in Kallmes and Cassandras(1995) via a static optimization approach (cf. Combé and Boxma 1994 and Boxma1995), which is hindered by the lack of closed formulae for the functions to be op-timized. Other approach that has attracted substantial research attention is based onimproving upon the optimal static allocation by carrying out one step of the policyiteration algorithm. Krishnan (1988, 1990) deploys such an approach to problems ofoptimal routing to parallel queues, obtaining dynamic index policies.

178 J. Niño-Mora

As part of a paper currently under preparation, we are conducting extensive com-putational experiments comparing the MPI policies both against optimal and alterna-tive benchmark policies. We advance next an illustrative result. Consider the classicproblem of routing a Poisson stream of customers arriving with rate λ to two M/M/1queues in parallel, with queue k’s server having service rate μk for k = 1,2. The aimis to find a routing policy that minimizes the long-run average customer sojourn time.To investigate computationally the performance of alternative policies for such a sys-tem, we use a modified system where each queue has a finite buffer capable of holdingup to 30 jobs, waiting or in service. Also, we fix the arrival rate to the value λ = 1, andlet service rates (μ1,μ2) vary over the range 1/2 < ρ < 1, where ρ = λ/(μ1 + μ2).Note that ρ ≈ 1 corresponds to the heavy-traffic regime. The parameters μk are variedover a finite grid of width 1/100.

We thus obtain that, over such a region, the relative suboptimality gap of the MPIrouting policy reaches a maximum value of about 1.55%, being substantially smallerin most of the region. Also, the MPI policy always outperforms the classic join theshortest queue routing rule, achieving relative performance gains of over 84% whenboth service rates are very different in magnitude. As for the individually optimalrouting rule, we find that, though the MPI policy can be worse than it over a smallparameter region, it can lose no more than 0.37% in relative performance. Yet, overmost of the parameter region the MPI policy performs better, attaining relative per-formance gains of up to over 15%.

We have also compared the performance of the MPI routing policy against thepolicy obtained from the best static policy, i.e., optimal Bernoulli splitting (OBS), viaone-step policy improvement (OSI), as proposed in Krishnan (1990). The results aredisplayed in Fig. 3. We emphasize that in such results the OBS policy’s objective wasexactly computed for an infinite-buffer system using closed formulae as in Krishnan(1990), while the OSI and the MPI policies’ objectives were numerically computedfor the finite-buffer approximation mentioned above. The figure shows a density plotof MPI,OSI, the relative performance gain of the MPI over the OSI policy, within theparameter region of concern delimited by the displayed thick lines corresponding toρ = 1/2 and ρ = 1. We find that in some instances the MPI policy’s performance canbe worse than the OSI’s, yet in such cases it loses no more than 0.68%. Over most

Fig. 3 Relative performancegain of the MPI over the OSIpolicy


of the parameter region the MPI policy is better, achieving maximum performancegains of over 7.6%.

3.2 Scheduling a multiclass make-to-order/make-to-stock queue

A fundamental problem arising in manufacturing systems is that of dynamic schedul-ing of production in a multiproduct system, which is conveniently modeled through amulticlass make-to-stock queue that incorporates stock and backorder holding costs.

Wein (1992) addresses the problem of designing a dynamic scheduling policy tominimize average cost, by carrying out a heavy-traffic analysis based on solving anapproximating Brownian control problem. The proposed policy emerging from suchan analysis has the following form. The server or machine is idle as long as theweighted inventory process defined in that paper is above a certain threshold level,and no classes are in “danger of being backordered,” as explained there. Otherwise,the machine is allocated to some class, according to the following static index rule. Ifthere is some class in danger of being backordered, the machine is assigned to a classof largest index bkμk , where bk and μk are the backorder cost rate and the servicerate for class k, respectively. If such is not the case, the machine is assigned to a classof smallest index hkμk , where hk is class k’s stock holding cost rate.

Veatch and Wein (1996) investigate several policies for such a model, based onidentifying a hedging point of base stock levels, which determines the idleness regionin state space, along with an index policy that determines the machine allocation.Though they consider use of the restless bandit index policy in the lost-sales casewhere no backorders are allowed, they discard such an approach in the case withbackorders, as the index does not exist under the average criterion (cf. Sect. 2.4.3above). They thus propose a policy based on a heavy-traffic diffusion approximation.

Peña-Pérez and Zipkin (1997) propose several dynamic index policies based on in-tuitive limited look-ahead arguments. They report experimental results showing thatthe policy they term myopic(T ), which uses as the look-ahead time determining theindex of a class the sojourn time of a job, appears both to be nearly optimal andto outperform alternative index policies. De Véricourt et al. (2000) furnish theoreti-cal support for use of such an index, showing that it partially characterizes optimalpolicies in a Markovian model.

We must also mention the work in Dusonchet and Hongler (2003a), which appliesrestless bandit indexation to address a Brownian model of a multiclass make-to-stocksystem.

Niño-Mora (2006a) develops results announced in Niño-Mora (2003) on index-ation analysis for M/G/1 make-to-order/make-to-stock queues, aimed at derivingindex policies for a multiclass M/G/1 queueing model where some classes mustbe processed in make-to-order mode, whereas others may be processed in make-to-stock mode. The model incorporates state-dependent stock holding and backo-rder cost rates for each class. For earlier work and applications of such combinedsystem models see, e.g., Adan and van der Wal (1998), Soman et al. (2004), andthe references therein. Assuming convexity of backorder and holding cost rates, itis shown in Niño-Mora (2006a, Sect. 6) that the restless bandits of concern, which

180 J. Niño-Mora

are single make-to-order/make-to-stock queues subject to service control, are PCL-indexable and hence indexable, both under the discounted criterion and under themixed average-bias criterion reviewed in Sect. 2.4.3.

Taking as the state of a make-to-order/make-to-stock M/G/1 queue the net back-order level X(t) = X+(t)−X−(t), where X+(t) is the number of backordered itemsand X−(t) is the number of finished items in stock at each time t , we may conve-niently represent the backorder and stock holding cost rate by a single state-dependentcost rate hi . The average-bias MPI for such a system is then given by the simple ex-pression

ν∗i = μE[hX+i], (18)

where hi � hi − hi−1, μ is the service rate and X is a random variable having thesteady-state distribution of the number-in-system for a corresponding make-to-orderM/G/1 queue.

Consider the special case where backorder and stock holding cost rates are linear,so that

hi ={

bi if i ≥ 1,

−hi otherwise,

where b (resp. h) is the backorder (resp. stock) holding cost rate per item per unittime. Then, we show in (42) of Niño-Mora (2006a) that the MPI has the simple eval-uation

ν∗i =

{bμ if i ≥ 1,[(b + ht)P{X ≥ −i + 1} − h

]μ otherwise.

(19)

Now, notice that the Peña-Pérez and Zipkin myopic(T ) index is formulated in thepresent notation by

νPPZi =

{bμ if i ≥ 1,[(b + h)P

{D(T ) ≥ −i + 1

} − h]μ otherwise,

where D(T ) has the steady-state distribution of the number of arrivals during a cus-tomer’s sojourn time in a corresponding make-to-order M/G/1 queue under FIFO.

It turns out that the indices νPPZi and ν∗

i above are identical. The reason is thatX and D(T ) have the same distribution, as argued in Kleinrock (1975, Chap. 5.7).Notice that such a result is a form of the distributional form of Little’s law. See Hajiand Newell (1971) and Keilson and Servi (1988). Thus, the MPI recovers the Peña-Pérez and Zipkin myopic(T ) index in the case of linear costs, and further extends itto models with convex nonlinear cost rates and also to the discounted criterion.

Denoting by ν∗,αi the discounted MPI in the M/M/1 case with discount rate α > 0,

it is further of interest to evaluate the following limiting myopic index:

νmyopici � lim

α↗∞αν∗,αi = μhi =

{bμ if i ≥ 1,

−hμ otherwise.

As pointed out at the end of Niño-Mora (2006a), such a myopic index is precisely theindex obtained by Wein (1992) through a heavy-traffic diffusion analysis.


3.3 Scheduling a multiclass queue with finite buffers

A remarkable paradox in queueing theory is that, while in all queueing systems aris-ing in the real world the storage or buffer space for holding customers is limited, mostqueueing models assume the latter to be unbounded. Such a state of affairs is prob-ably related to the fact that infinite-buffer models are typically more tractable thantheir finite-buffer counterparts, as the latter exhibit complex boundary effects thatcomplicate the analyses. The research on models for scheduling multiclass queues isno exception. Thus, e.g., the optimality of the classic cμ and the Klimov index rulesis established assuming unlimited storage capacity.

In contrast, limited attention has been devoted to problems involving the dynamicscheduling of multiclass queues where each class has its own finite dedicated bufferspace, despite their relevance in manufacturing (scheduling of parts for processingat a flexible machine) and computer-communication (scheduling of packets for ac-cess to a transmission channel) systems. Finite buffers introduce the possibility ofblocking, which occurs when customers arrive to find a full buffer and are hencerejected and lost. In such systems, optimal policies have only been identified understrong symmetry conditions. Thus, Sparaggis et al. (1993) addressed the dynamicscheduling problem with the goal of minimizing the expected blocking cost, in aMarkovian model where all classes have the same arrival, service and blocking costrates, whereas buffer sizes may differ. They exploited a duality relationship betweenrouting and scheduling problems with finite buffers to infer, from the known opti-mality of the shortest nonfull queue routing rule established in Hordijk and Koole(1990), the optimality of its dual scheduling rule, which prescribes to dynamicallyallocate the server at each time to a nonempty class with the smallest residual capac-ity (SRC). Such a result was extended to the case of general service-time distributionsin Wasserman and Bambos (1996).

Kim and Van Oyen (1998) addressed a nonsymmetric Markovian model havingtwo classes k with holding and rejection cost rates ck and rk , respectively. Theyidentified a condition under which discount-optimal policies are characterized by amonotonic switching curve so that, if it is optimal to serve class 1 in state (i1, i2),then it is also optimal to serve it in state (i1 + 1, i2), and correspondingly for class 2,where the state is the number of customers in each class. The required condition isthat, for each class k,

αrk ≥ ck, (20)

where α > 0 is the discount rate. Notice that (20) means that the cost rk of rejecting acustomer is greater than or equal to the cost ck/α of holding it forever in the system.They further show by example that, if such a condition is violated, optimal policiescan exhibit complex patterns, pointing out the seemingly counterintuitive phenom-enon that, in certain situations, there may be an incentive to award higher priority toa shorter queue.

Niño-Mora (2006b) deploys restless bandit indexation theory to a Markovian mul-ticlass queue with dedicated finite buffers, finding that condition (20) plays a key rolein the analyses and results. It turns out that the state ordering induced by the MPI ina traffic class depends critically on whether or not such a condition holds. If it does,

182 J. Niño-Mora

let us say that the class is loss-sensitive. Otherwise, we say that it is delay-sensitive,in each case relative to the prevailing discount rate α. Such concepts are particularlyrelevant in the setting of contemporary research efforts to provide differentiated ser-vice to heterogeneous traffic streams in packet-switched computer-communicationnetworks, where some traffic classes are more sensitive to losses (e.g., file transfer),while others are more sensitive to delays (e.g., interactive and multimedia applica-tions).

That paper shows that a loss-sensitive class is PCL-indexable, relative to the nestedactive-set family (cf. Sect. 2.2) consistent with the index ordering

να,∗n ≤ ν

α,∗n−1 ≤ · · · ≤ ν

α,∗1 , (21)

where the state i in να,∗i is now taken to be the number of empty buffer spaces, as

it does not depend on the buffer size n, and the notation makes explicit the index’dependence on the discount rate. Notice that the ordering in (21) agrees with theintuition that, other things being equal, queues with fewer empty buffer spaces shouldbe awarded higher service priority.

In the pure loss-sensitive case r > 0 = c (where we have dropped the class labelk from the notation), the natural index candidate for scheduling under the averagecriterion is obtained as the limiting index obtained from ν

α,∗i as α vanishes. The latter

turns out to be constant, being equal to rμ, which is noninformative in the case whereall rejection and service rates are the same. To break ties, we consider the Maclaurinseries expansion ν

α,∗i = rμ − αγ ∗

i + o(α) as α ↘ 0, which yields the second-orderMPI

γ ∗i =

⎧⎪⎪⎨⎪⎪⎩

r

ρ

{i + 1 + 1/ρi − (1 − ρ)i − 1

(1 − ρ)2

}if ρ �= 1,

r(i + 1)(i + 2)

2if ρ = 1,

(22)

where ρ � λ/μ. Thus, among classes with the same first-order MPI rμ, the resul-tant index rule gives higher service priority to classes with smaller values of thesecond-order MPI γ ∗

i . We must remark that, in the classic Bayesian multiarmed ban-dit problem with Bernoulli arms, a tie-breaking second order index was shown inKelly (1981) to give an optimal policy for values of the discrete-time discount factornear one.

The paper further shows that a delay-sensitive class is PCL-indexable, relative tothe nested active-set family consistent with the ordering (21), which is however inter-preted by taking now the state i in ν

α,∗i to represent the number of customers in the

system. Hence, other things being equal, such an ordering prescribes to award higherpriority to shorter nonempty queues. While such a result might appear at first sightas counterintuitive, it agrees with the observation of Kim and Van Oyen referred toabove, and also with experimental results on the structure of optimal policies for par-ticular instances. One possible interpretation is that, since holding costs are bounded,when such costs are dominant it is more productive to prevent congestion than toreact to it, while the opposite holds otherwise.

Note further that a class with positive holding cost rate c > 0 is always delay-sensitive for small enough values of the discount rate α, which leads us to consider


the limiting index obtained as the latter vanishes. This has the evaluation

ν∗i =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

c

ρ

{n − ρ

1 − ρ+ i

ρi

1 − ρi

}+ rμ if ρ �= 1,

c

{n − i − 1

2

}+ rμ if ρ = 1.

(23)

Such a limiting index is shown to be indeed an MPI relative to the bias criteriondiscussed in that paper, which is appropriate as a priority-index for the correspondingmulticlass system.

Yet, are such MPI policies useful? We refer the reader to the experimental resultson two-class instances reported in Niño-Mora (2006b). Across the instances investi-gated, such policies are typically near optimal, and outperform, often substantially,traditional naive policies such as the cμ rule or the SRC rule referred to above. Weonly identified a range of instances where the MPI policies appear to perform poorly:in the pure loss-sensitive case with distinct first-order indices rkμk , which yields astatic priority policy.

4 More recent work

4.1 Algorithmic characterization of indexability

Niño-Mora (2007e) announces results of deploying a parametric LP approach, basedon the classic parametric-objective simplex algorithm of Gass and Saaty (1955), totest for indexability and compute the MPI of an indexable semi-Markovian restlessbandit instance. The resultant Complete-Pivoting Indexability (CPI) algorithm con-sists of an initialization stage, which computes the initial simplex tableau, followedby a loop that performs 2n3 + O(n2) arithmetic operations for an indexable projectwith n controllable states. The algorithm is given in a detailed block-partitioned form,i.e., based on operations on submatrices (blocks) of a base matrix, which is ready foractual implementation. The importance of such block implementation has been em-phasized in the scientific computing literature, where it is advocated as a means topartially overcome the exponentially widening gap between memory access timesand processor speed in contemporary computers. See, e.g., Dongarra and Eijkhout(2000).

We also present in that paper a Reduced-Pivoting Indexability (RPI) algorithm,that avoids some unnecessary operations in pivot steps to achieve a reduced opera-tion count of n3 +O(n2) operations in its main loop. Yet, such an improved theoreti-cal complexity is obtained at the expense of manipulating submatrices with arbitraryrow and column indices, which results in relatively inefficient strided memory-accesspatterns. The latter are known to produced significant slowdowns in actual runningtimes, which is verified experimentally in the paper. Despite the above operationcounts, the CPI algorithm is substantially faster in practice than the RPI algorithm.

We have used such algorithms, in MATLAB implementations developed by theauthor, to assess empirically via simulation the prevalence of indexability and PCL-indexability on randomly generated restless bandit instances with dense transition

184 J. Niño-Mora

Table 2 Counts on samples of 107 restless bandit instances

Nonindexable Indexable non-PCL

Number of states Number of states

β 3 4 5 6 7 3 4 5 6 7

0.1 0 0 0 0 0 0 0 0 0 0

0.2 0 0 0 0 0 0 0 0 0 0

0.3 0 0 0 0 0 0 0 0 0 0

0.4 0 0 0 0 0 0 0 0 0 0

0.5 0 0 0 0 0 0 0 0 0 0

0.6 0 0 0 0 0 0 0 0 0 0

0.7 0 0 0 0 0 30 0 0 0 0

0.8 16 1 0 0 0 574 32 1 0 0

0.9 135 7 0 0 0 4460 509 36 5 0

1.0 818 66 4 0 0 18631 3640 425 50 3

probability matrices. Specifically, in each instance active rewards and transition prob-abilities were generated with MATLAB as Uniform(0,1) pseudo-random numbers,and then transition matrices were properly scaled dividing each row by its sum. Foreach of the state space sizes n = 3, . . . ,7 a sample of 107 instances was drawn and,for each instance, the discount factor β was varied from 0.1 to 1. Table 2 reports thecounts obtained of nonindexable instances and of indexable yet non-PCL instancesfor each (β,n) combination. The results suggest that both indexability and PCL-indexability are highly prevalent properties, with their incidence sharply increasingas the discount factor gets smaller and as the state space gets larger.

To illustrate by a concrete example how a bandit can be indexable yet not PCL-indexable, consider Fig. 4, which displays the achievable work-reward performanceregion (cf. Sect. 2.2) for the three-state instance with β = 0.9,

R1 =[0.44138

0.80330.14257

], P1 =

[0.1719 0.1749 0.65320.0547 0.9317 0.01360.1547 0.6271 0.2182

],

P0 =[0.3629 0.5028 0.1343

0.0823 0.7534 0.16430.2460 0.0294 0.7246

],

and R0 = 0. The plot shows that this is an indexable instance, relative to the nestedfamily of optimal policies F0 = {∅, {2}, {2,3}, {1,2,3}}. Yet, it is not PCL-indexable,since g{1,2} < g{2} and hence (cf. (9)), w

{2}1 < 0. In contrast, Fig. 1 shows the achiev-

able work-reward performance region of a PCL-indexable project.

4.2 More powerful indexability conditions and faster index computation

When faced with a particular restless bandit model, it is of interest for researchers tohave practical methods to establish analytically its indexability under an appropriate


Fig. 4 Achievable work-rewardperformance region of anindexable non-PCL bandit

parameter range. While we developed the PCL-indexability approach discussed inSect. 2.3 for such a purpose, our more recent work has revealed limitations that haveprompted further research. In particular: (i) we have found that analytical verificationof the second condition in Definition 2, namely that the sequence of index valuesproduced by the adaptive-greedy algorithm be monotone nonincreasing, can be overlyhard or even elusive in models with a multi-dimensional state, such as that discussedin Sect. 4.7; and (ii) we have further found, as discussed in Sect. 4.6, a relevantrestless bandit model that is indexable, yet not PCL-indexable, as the first conditionin Definition 2 does not necessarily hold.

Such a state of affairs motivated the author to introduce in Niño-Mora (2007e)sufficient indexability conditions that are both easier to apply and less restrictive thanPCL-indexability. We next review such conditions, termed LP-indexability conditionsas they are based on linear programing (LP) analyses. As the PCL-indexability con-ditions in Sect. 2.3, the new conditions are also based on positing an appropriate setsystem (N {0,1},F). Yet, stronger requirements are imposed on the latter than thosein Assumption 1, based on the concept of monotonically connected set system intro-duced in that paper.

Assumption 2 (N {0,1},F) is a monotonically connected set system, i.e., it satisfies:

(i) ∅,N {0,1} ∈ F(ii) For every S,S′ ∈ F with S ⊂ S′ there exist j ∈ ∂out

F S and j ′ ∈ ∂ inFS′ such that

S ∪ {j} ⊆ S′ and S ⊆ S′ \ {j ′}(iii) For any S,S′ ∈F , S ∪ S′ ∈F and S ∩ S′ ∈F

The term “monotonically connected” is motivated by the fact that, in such a setsystem, one can always connect two feasible sets S ⊂ S′ by a monotone increasingsequence S1 ⊂ · · · ⊂ Sm of adjacent sets in F , with S1 = S, Sm = S′. Further, one canalso connect two distinct feasible sets S �= S′ through two successive monotone se-quences of adjacent sets in F , the first of which is monotone increasing and connectsS to S ∪ S′, while the second is monotone decreasing and connects S ∪ S′ to S′.

We further write

rS � maxj∈Sc,wS

j =0rSj and rS � min

j∈S,wSj =0

rSj , (24)

186 J. Niño-Mora

adopting the convention that the maximum (resp., minimum) over an empty set hasthe value −∞ (resp., +∞).

Definition 3 (LP(F)-indexability) We say that a restless bandit is LP(F)-indexableif:

(i) w∅i ,wN {0,1}

i ≥ 0 for i ∈ N {0,1} and r∅ ≤ 0 ≤ rN {0,1}

(ii) For each S ∈F , wSi > 0 for i ∈ ∂ in

FS ∪ ∂outF S

(iii) For every wage ν ∈ R there exists an optimal active set S ∈ F for (2)

In practice, to establish that a bandit model is LP(F)-indexable one would proveconditions (i) and (ii) in Definition 3 by carrying out an analysis of relevant marginalwork and reward measures. As for condition (iii), it requires us to prove a structuralproperty of optimal policies for ν-wage problem (2), for which DP techniques canoften be useful.

The interest of such a bandit class is due to the following result.

Theorem 2 The following holds:

(a) An LP(F)-indexable bandit is indexable, and its MPI is computed in nonincreas-ing order by adaptive-greedy algorithm AGF in Table 1.

(b) An indexable bandit is LP(F0)-indexable relative to some nested active-set fam-ily F0.

Note that part (a) of Theorem 2 yields a sufficient condition for indexability, andfurther extends the scope of the adaptive-greedy algorithm in Table 1 from the classof PCL-indexable bandits to the wider class of LP-indexable bandits, i.e., those thatare LP(F)-indexable relative to some family F . Part (b) shows that the latter classencompasses all indexable bandits.

For a specific application of Theorem 2 we refer the reader to Niño-Mora (2007f),which analyzes a bandit model that is LP-indexable, yet not PCL-indexable. Thatpaper further shows that the condition (ii) in Definition 2 on PCL(F)-indexabilitycan be replaced by condition (iii) in Definition 3.

The validity of index algorithm AGF for LP(F)-indexable bandits allows us toleverage structural knowledge on optimal policies to compute the index faster. Thus,a fast-pivoting block implementation is given in Niño-Mora (2007e) of such an al-gorithm which, after an initialization stage involving the solution of a block linearequation system, performs in its main loop (2/3)n3 + O(n2) arithmetic operationsfor a project having n controllable states. Such an algorithm extends that introducedin Niño-Mora (2007a) for computing the Gittins index and solving the problem ofoptimal stopping of a finite Markov chain, which has the same complexity in its loopyet does not require the initialization stage. Notice that (2/3)n3 + O(n2) is the com-plexity of solving an n × n linear equation system by Gaussian elimination.

To give an idea of actual runtimes of the three algorithms discussed, we report theresults of an experiment based on the author’s MATLAB implementations. A randomrestless bandit instance was generated for each of the state space sizes n = 1000 up to10000 in increments of 1000, with a discount factor β = 0.8. Each instance was tested


Fig. 5 Runtimes with cubicleast-squares fit: FPAG (solid),CPI (dotted) and RPI (dashed)algorithms

both for indexability and PCL-indexability, yielding a positive result in each case.Figure 5 shows the runtime performance of each algorithm, where FPAG refers to thefast-pivoting implementation of the adaptive-greedy algorithm mentioned above. Theexperiment was conducted on an HP xw9300 dual-processor Opteron 254 (2.8 GHz)workstation running MATLAB 2007a under Windows XP x64. The new hyperthread-ing capability of MATLAB in that release was enabled to allow computations to useboth system processors.

The results show that the FPAG algorithm is the faster of the three, followed by theCPI and then the RPI algorithms. The relative runtime performance does not matchwhat would be expected based on theoretical operation counts. Such a phenomenonis well known in the field of scientific computing, as memory-access patterns areoften the dominant factor in an algorithm’s actual performance. See, e.g., Dongarraand Eijkhout (2000). Thus, algorithms RPI and FPAG achieve their reduced opera-tion counts at the expense of working on submatrices with complex row and indexpatterns, which results in strided access to memory, whereas algorithm CPI’s work isconcentrated on a whole matrix manipulated as a single contiguous memory block.

4.3 Scheduling a multiclass wireless queue with finite buffers

The ubiquitous presence of wireless packet-switched computer-communication net-works motivates the investigation of suitable modifications of traditional dynamicscheduling models. Thus, e.g., in a system scheduling multiple traffic streams gen-erated by mobile users that vie for access to a transmission channel, the wirelessenvironment creates the possibility that the latter is temporarily unavailable to someusers, due to phenomena such as fading. A convenient model for such a system isgiven by a multiclass queue with a server whose connectivity to each customer classis turned on an off in a random fashion, according to a two-state Markov chain.

Variants of such a model have attracted considerable research interest, mostlyaimed at establishing optimality of greedy scheduling policies under rather stringentconditions on model parameters. See, e.g., Tassiulas and Ephremides (1993), Lottand Teneketzis (2000), and Bambos and Michailidis (2002).

Niño-Mora (2006c) announces results of deploying restless bandit indexation toobtain new dynamic scheduling policies in a discrete-time Markovian multiclass

188 J. Niño-Mora

wireless queue with finite dedicated buffers. Specifically, we assume that there areK distinct traffic classes, labeled by k ∈ K � {1, . . . ,K}. During a time period (orslot in computer-communications lingo) the number of class k customers that arriveis distributed as a Poisson random variable with rate λk . If the server is allocated toa class k customer during a period, the probability that the service is completed bythe end of the period is μk . We thus assume Poisson arrival processes and geometricservice times, which are mutually independent across periods and classes. Upon aclass k customer’s arrival, it joins its class’ queue, which is held in a dedicated finitebuffer of size nk , if this is not full; otherwise, the customer is blocked and lost. Wedenote by Xk(t) the number of class k customer in system at the start of period t .

The time-varying connectivity is modeled through a binary connectivity processsk(t) for each class k, where sk(t) = 1 when class k is connected to the server duringperiod t , and sk(t) = 0, otherwise. We assume that each sk(t) evolves as a binaryergodic Markov chain with transition probabilities pk(e1, e2), for e1, e2 ∈ {0,1}, andthat connectivity processes are independent across classes.

We take the state of class k at time t as the connectivity-number of customers pair(sk(t),Xk(t)). Its state space is thus

Nk � {0,1} × {0, . . . , nk}.A controller chooses at the start of each period the nonempty connected class, if

any, to be served. Customers within a class are serviced in FIFO order. Such choicesare represented by binary action processes ak(t), where ak(t) = 1 if the server isworking on class k in period t and ak(t) = 0, otherwise. The corresponding sample-path service capacity constraint is thus∑

k∈K

ak(t) ≤ 1, t ≥ 0.

Notice that, in states (sk(t),Xk(t)) with sk(t) = 0 or Xk(t) = 0 the only meaningfulaction is ak(t) = 0. We will thus consider such states, whose union we denote byN

{0}k , as uncontrollable. The remaining set of class k states, which we denote by

N{0,1}k , are controllable, in that both actions are available in them and differ.Action choice is dynamically prescribed through adoption of a scheduling pol-

icy π . This is chosen from the space Π of admissible policies, which are only requiredto be nonanticipative and allow service preemptions at the start of each period.

The system incurs linear holding and/or rejection costs separably across classes,time-discounted with factor 0 < β < 1. Class k traffic incurs holding costs at rateck ≥ 0 per period and customer in the system, and rejection costs at rate rk ≥ 0 percustomer blocked. Denoting by Ak ∼ Poisson(λk) the number of class k arrivals dur-ing a period, we can thus represent the class’ state-dependent cost rate function asfollows: for (ak, ik) ∈ Nk ,

hk(ak, ik) = hk(ik) � ckik + rkE[{

Ak − (nk − ik)}+]

. (25)

The goal is to design a tractable index policy that comes close to minimizing theexpected total discounted value of costs incurred. Since the model is readily formu-


lated as a restless bandit problem, the natural candidate for such an index is the MPI.Note that the latter is a function ν∗

k (1, ik) of the controllable states of each class k.Along the lines in Niño-Mora (2006b) (cf. Sect. 3.3) we distinguish between two

types of traffic classes. We consider a class k to be loss-sensitive if (1 − β)rk ≥ck − εk , and delay-sensitive if (1 − β)rk ≤ ck − εk , where εk > 0 is a critical valuewhose characterization is discussed in the full version of Niño-Mora (2006c), cur-rently under preparation.

In short, the indexability analyses reveal existence of the MPI and structural prop-erties that extend those in the fully-connected special case in Niño-Mora (2006b).Thus, the ordering induced by the MPI ν∗

k (1, i) on the state space of a class dependson whether this is loss-sensitive or delay-sensitive, with such an ordering being ineach case the same as that discussed in Sect. 3.3 for the corresponding MPI ν∗

k (i).Detailed analyses will be given in the full version of Niño-Mora (2006c), along

with comprehensive experimental results assessing the performance of such an MPIpolicy.

4.4 Scheduling a multiclass queue with finite buffers and delayed state observation

In some applications, it appears unreasonable due to communication delays to as-sume that the controller has knowledge of the current system state, which has mo-tivated research into control of systems with delayed state information. See, e.g.,Altman (1995) and the references therein. In the setting of communication systemsfor scheduling multiclass traffic, such delays are significant in, e.g., satellite systems.This has motivated recent research efforts to address the design of dynamic schedul-ing policies in multiclass queues with delayed state observation. Thus, Ehsan and Liu(2006) establish the optimality of a simple greedy scheduling policy, yet under strongconditions on model parameters.

In Niño-Mora (2007c) we announce results of deploying restless bandit indexationin a multiclass finite-buffer queueing scheduling model which differs from that dis-cussed in Sect. 4.3 in that: (i) classes are permanently connected to the server; and (ii)the controller’s state information on queues’ states is delayed by one time period. Themodel is readily cast as a restless bandit problem, where the observed state processof each bandit (class) k is of the form Xk(t) � (ak(t − 1),Xk(t − 1)), i.e., it consistsof the previous action ak(t − 1) ∈ {0,1} and buffer occupancy Xk(t − 1). Note that inthat setting all states for a class are controllable, as the controller might well allocatethe server to a class whose previous backlog was empty, anticipating that such neednot be the case in the present period.

Again, we find that the required family F of active sets relative to which index-ability is established depends critically on whether a class is loss-sensitive or delay-sensitive. The former case corresponds to a class k satisfying, in the notation ofSect. 4.3, that 0 < (1 − β)rk ≥ βck , whereas the latter case corresponds to satisfac-

tion of (1 − β)rk ≤ βck > 0. The paper discusses the structure of F and of the MPIthat emerges in each case, and further reports results of preliminary computationalexperiments, where the MPI policy is shown to be nearly optimal and to outperformconventional scheduling policies. A full version of the paper with detailed analysesand thorough computational experiments is currently under preparation.

190 J. Niño-Mora

4.5 Multiarmed bandits with switching costs

A critical assumption underlying the optimality of the Gittins index rule for the multi-armed bandit problem is that switching projects is costless. Yet, in many applicationsthere are nonnegligible costs of switching, motivating investigation of multiarmedbandits with switching costs, which are extensively surveyed in Jun (2004). In sucha setting, it is natural to consider index policies where the index ν(a−,i) of a projectis a function of its present state i and previous action a− (i.e., a− = 1 if it was activeand a− = 0, otherwise). As pointed out in Banks and Sundaram (1994), the pres-ence of switching costs leads one to stick longer to the project currently engaged.Such a hysteresis property, also discussed in Dusonchet and Hongler (2003b) in thesetting of a continuous-state model, means that the indices should be required to sat-isfy ν(1,i) ≥ ν(0,i), i.e., the continuation index is larger than or equal to the switchingindex.

Though Banks and Sundaram (1994) established that, generally, optimal policiesfor bandits with switching costs are not of index type, Asawa and Teneketzis (1996)introduced an intuitively appealing index, and went on to prove that it partially char-acterizes optimal policies. Their continuation index νAT

(1,i) is precisely the project’s

Gittins index, whereas their switching index νAT(0,i) is the maximum rate of expected

discounted reward minus initial switching cost per unit of expected discounted timethat can be achieved under stopping rules that engage an initially passive project.

Niño-Mora (2007d) deploys restless bandit indexation to such a problem, by ex-ploiting the natural restless reformulation of a classic bandit with switching costs,where one considers the augmented state (a−, i). As the appropriate family F ofactive sets in the augmented state space N = {0,1} × N , where N is the project’soriginal state space, we take

F � {S0 ⊕ S1 : S0 ⊆ S1 ⊆ N} , (26)

where the notation S0 ⊕ S1 refers to the policy that engages the project when it waspreviously rested (resp., engaged) iff its current state lies in S0 (resp., S1).

The paper shows that the resultant restless bandits are PCL(F)-indexable, andthat their MPI is precisely the Asawa and Teneketzis index. More importantly, com-putational issues are addressed. Asawa and Teneketzis (1996) had proposed to jointlycompute both the active and the passive index of an n-state project as the Gittins indexof a certain classic project having 2n states. This results in an eight-fold increase inO(n3) arithmetic operations and substantially increased memory operations relativeto the computational effort to compute the continuation index alone, which is overlyexpensive in large-scale models.

We thus set out to obtain a more efficient computation method by analyzing theadaptive-greedy algorithm AGF in such a model. It turns out that the algorithm nat-urally decouples into a two-stage method. The first stage computes the Gittins indexof the original project along with certain required extra quantities. Then, the secondstage is fed the first stage’s output to compute the switching index an order of mag-nitude faster in at most n2 + O(n) arithmetic operations. The two-stage method alsoyields substantially reduced memory operations, as it involves manipulation of n × n

matrices instead of 2n × 2n matrices.


A computational study demonstrates that such a theoretical complexity reductiontranslates in practice into dramatic runtime savings. Further, the paper reports on acomprehensive computational study on two- and three-class project instances show-ing that the MPI policy is consistently near optimal, and substantially outperformsthe Gittins index policy that ignores switching costs.

To help the reader grasp the underlying intuition, we present next an illustrativeexample. Consider the 3-state bandit instance with constant startup cost c,

β = 0.95, R =[0.0250

0.42420.0338

], and P =

[0.6635 0.0285 0.30800.6345 0.3583 0.00720.4868 0.0530 0.4602

].

Work and reward measures gπ and f π are evaluated assuming that the initialstate is uniformly drawn. The left pane in Fig. 6 shows the achievable work-rewardperformance region in the classic c = 0 case. The four points displayed, determiningits upper boundary, are the work-reward performance points corresponding to thepolicies whose active sets are given, from left to right, by ∅, {2}, {2,3}, and {2,3,1}.The successive work-reward trade-off slopes or rates between such points are theGittins index values:

ν∗2 = 0.4242 > ν∗

3 = 0.061487 > ν∗1 = 0.048002.

The right pane in Fig. 6 shows a corresponding plot for the case with startupcost c = 0.02. The upper work-reward boundary is determined by the seven pointsdisplayed, which are the work-reward performance points corresponding, from leftto right, to the policies having active sets ∅ ⊕ ∅, ∅ ⊕ {2}, {2} ⊕ {2}, {2} ⊕ {2,3},{2,3}⊕{2,3}, {2,3}⊕{2,3,1}, and {2,3,1}⊕{2,3,1}. The successive work-rewardtrade-off slopes between such points give the MPI values:

ν∗(1,2) = 0.424 > ν∗

(0,2) = 0.411 > ν∗(1,3) = 0.061 > ν∗

(0,3) = 0.051

> ν∗(1,1) = 0.048 > ν∗

(0,1) = 0.047.

Fig. 6 Achievable work-reward performance regions and structure of upper boundaries

192 J. Niño-Mora

Fig. 7 Achievable work-reward performance regions and structure of upper boundaries

The plot represents the right end-points giving a continuation index value by a blackcircle, and those giving a switching index value by a white square. Notice further thatthe continuation index matches the Gittins index of the previous case.

The left pane of Fig. 7 shows the achievable work-reward performance region forthe case c = 0.1. Now, the seven points displayed in the upper boundary correspond,from left to right, to the policies having active sets ∅ ⊕ ∅, ∅ ⊕ {2}, {2} ⊕ {2}, {2} ⊕{2,3}, {2}⊕{2,3,1}, {2,3}⊕{2,3,1}, and {2,3,1}⊕{2,3,1}. The successive work-reward trade-off slopes give the bandit’s MPI values:

ν∗(1,2) = 0.424 > ν∗

(0,2) = 0.358 > ν∗(1,3) = 0.061 > ν∗

(1,1) = 0.048

> ν∗(0,3) = 0.044 > ν∗

(0,1) = 0.043.

Finally, the right pane of Fig. 7 shows the corresponding plot for the case c = 0.6.The seven points characterizing the upper boundary correspond, from left to right,to the policies having active sets ∅ ⊕ ∅, ∅ ⊕ {2}, ∅ ⊕ {2,3}, ∅ ⊕ {2,3,1}, {2} ⊕{2,3,1}, {2,3}⊕{2,3,1}, and {2,3,1}⊕{2,3,1}. The resultant MPI values given bythe successive slopes are

ν∗(1,2) = 0.424 > ν∗

(1,3) = 0.061 > ν∗(1,1) = 0.048 > ν∗

(0,2) = 0.047

> ν∗(0,3) = 0.019 > ν∗

(0,1) = 0.018.

Notice that, in each case, the continuation index value ν∗(1,i) matches the Gittins

index value ν∗i . Further, the successive active sets S0 ⊕ S1 characterizing the regions’

upper boundaries belong in the active-set family F in (26). Also, the continuationindex value ν∗

(1,i) is larger than the corresponding switching index value ν∗(0,i) value.

4.6 Multiarmed bandits with switching delays

Besides switching costs, delays for switching projects are clearly relevant in manyapplications. Think, e.g., of the time required to learn a new technique before one can


make productive use of it, of time for preparing the ground in a development project,or of the shutdown time required to dismantle a project. Asawa and Teneketzis (1996)briefly discussed the corresponding multiarmed bandit problem with switching de-lays, for which they proposed an index, claiming that it also gives a partial character-ization of optimal policies. Yet, no algorithm is given in their paper for the computa-tion of such an index.

Niño-Mora (2007f) announces results of extending the restless bandit indexationanalysis in Niño-Mora (2007d) to projects that incorporate both switching costs anddelays. The required extension is, however, far from trivial. First, the natural restlessreformulation of a classic project with switching delays is semi-Markovian. Furtherand more importantly, the resultant restless projects are no longer PCL-indexable. Infact, it was the analysis of this model which motivated the author to develop the morepowerful LP-indexability conditions reviewed in Sect. 4.2.

Thus, the paper shows that the restless projects of concern are LP(F)-indexable,where F is the active-set family discussed in the previous section. This allows ap-plication of the adaptive-greedy algorithm AGF to compute the MPI. Similarly asin the case of switching costs only, analysis of such an algorithm as it applies to themodel with switching delays reveals that it can be decoupled into a two-stage scheme.Again, the first stage computes the Gittins index of the original n-state project alongwith extra quantities, and the second stage uses the first stage’s output to computethe switching index. This is accomplished an order of magnitude faster, in at most(5/2)n2 + O(n) arithmetic operations.

The paper further reports on a computational study showing the dramatic speedupgains yielded by such a two-stage computation scheme, relative to joint computationof both indices. Such a study is complemented by a set of experiments aimed atassessing the degree of suboptimality of the MPI policy, and its performance gainsover the benchmark index policy that ignores switching penalties. These experimentsreveal substantial gains and a near-optimal performance.

4.7 Multiarmed bandits with deadlines

Another critical model assumption underlying the Gittins and Jones result for classicmultiarmed bandits is that the planning horizon is infinite. Yet, in many applicationsit is more appropriate to consider finite horizon scenarios. Thus, projects might besubject to a common deadline at which they expire; or, more generally, each projectmight have its own deadline, which motivates consideration of the multiarmed banditproblem with deadlines.

While such problems appear to be computationally intractable, their widespreadpractical relevance motivates the quest for tractable priority-index policies that arenearly optimal. For such a purpose, the index introduced in Bradt et al. (1956), as itextends to a general Markovian setting, is a natural choice. Such an index measuresthe maximum rate of expected discounted reward per unit of expected discountedtime that can be achieved under stopping rules that do not exceed the horizon. Gittinsand Jones (1974) discussed the resultant index rule, showing that it is generally notoptimal for the finite-horizon multiarmed bandit problem.

Remarkably, to the best of the author’s knowledge, use of such a priority-indexrule has received scant, if any, research attention. It appears that the finite-horizon

194 J. Niño-Mora

index has been regarded in a subordinate role, simply as a means to approximate theGittins index. See, e.g., Gittins (1979, Sect. 7; 1989, Chap. 1), and Wang (1997).

One possible explanation accounting for such a limited use of the finite-horizonindex might be the lack of a simple, exact algorithm for its computation. The authoris only aware of the exact method for Bayesian Bernoulli bandits with beta priorsdiscussed in Gittins (1979, Sect. 7). Yet, such a method involves the numerical com-putation of a supremum at each step—see formula (11) in that paper—which is notan elementary operation.

An approach to address such issues based on restless bandit indexation is an-nounced in Niño-Mora (2005), based on the observation that a finite-horizon banditis readily formulated as an infinite-horizon restless bandit, by augmenting the stateto include the remaining time, and on the result that the latter’s MPI is the former’sfinite-horizon index. In this setting, it is natural to represent Markov deterministicpolicies for operating a project in the form

S = S1 ⊕ S2 ⊕ · · · ⊕ ST ,

where the notation represents the policy that engages the project when t periods re-main to the deadline iff the original state lies in St , for 1 ≤ t ≤ T . The restless banditsof concern for a T -horizon project turn out to be PCL(FT )-indexable, where the ap-propriate active-set family FT is given by

FT � {S1 ⊕ S2 ⊕ · · · ⊕ ST : S1 ⊆ S2 ⊆ · · · ⊆ ST } .

Notice that such a structure is consistent with the intuitive monotonicity property ofthe index whereby ν∗

(t,i) ≤ ν∗(t−1,i), with t being the remaining time and i the current

state.Such a result allows us to use the adaptive-greedy algorithm AGFT

to compute

the MPI. However, the resultant complexity is of order O(T 3n3) arithmetic opera-tions for a T -horizon n-state project, which severely hinders applicability. In morerecent work, we improve on such a result, by showing that the adaptive-greedy algo-rithm naturally decouples into a recursive T -stage scheme, which computes the finite-horizon MPI in only O(T 2n3) arithmetic operations, thus significantly expanding thesize of models that can be addressed. Finally, preliminary computational results ontwo-project instances show that such MPI policies are consistently near optimal, oftensubstantially outperforming the benchmark Gittins index policy.

5 Concluding remarks

We have surveyed a unifying approach to the design and computation of tractablepriority-index policies for a variety of problems, based on the intuitive concept ofMPI. We believe that, while the results reviewed on theory and algorithms for finite-state bandits are in a state that allows them to be readily deployed by researchers, theapplications surveyed only give a glimpse of what can be attained both in depth andscope. Many interesting issues remain to be explored, such as: Is the prevalence ofindexability and PCL-indexability as high as it seems in computational experiments,


and, if so, why? How to analyze the performance of MPI policies in a multi-projectsetting? Under what conditions do MPI policies perform well? How can one designsuitable MPI policies in models with constraints, as in Altman and Shwartz (1989)?We emphasize that the range of applications of the MPI approach is far from ex-hausted. In fact, the author is currently investigating completely different applicationsfrom those surveyed herein, which will be reported in due time. Also, Prof. Weberhas introduced in his discussion to this paper a promising idea that might substantiallyexpand the scope of the MPI approach.

Acknowledgements The author’s work surveyed in this paper has been supported in part by the Span-ish Ministry of Education & Science under projects TAP98-0229, BEC2000-1027, MTM2004-02334,a Ramón y Cajal Investigator Award and an I3 faculty endowment grant, by the European Union’s Net-works of Excellence EuroNGI and EuroFGI, and by the Autonomous Community of Madrid under grantsUC3M-MTM-05-075 and CCG06-UC3M/ESP-0767. The author thanks, for their invitations to presentparts of his work surveyed herein, the organizers of research seminars at Eurandom (1999, 2001, 2004)and at Beta (2004), Technical Univ. of Eindhoven, and the Department of Operations Research and Sta-tistics at Univ. de Sevilla (2007), as well as the organizers of the Workshop on Fluid Queues (Eurandom,2000), the Schloss Dagstuhl Seminar on Scheduling in Computer and Manufacturing Systems (Wadern,Germany, 2002), the Workshop on Analysis and Optimization of Stochastic Networks with Applications toCommunications and Manufacturing (Eurandom, 2002), the Workshop on Topics in Computer Communi-cation and Networks (Isaac Newton Institute for Mathematical Sciences, Univ. of Cambridge, UK, 2002),the Joint Research Conference on Mathematical Programming of the Spanish Operations Research & Sta-tistics Society and the Spanish Royal Mathematical Society (Univ. Miguel Hernández, Elche, Spain, 2004),the Schloss Dagstuhl Seminar on Algorithms for Optimization with Incomplete Information (Wadern, Ger-many, 2005), and the First Iberian Conference on Optimization (Coimbra, Portugal, 2006).

References

Adan IJBF, van der Wal J (1998) Combining make to order and make to stock. OR Spektrum 20:73–81Altman E, Shwartz A (1989) Optimal priority assignment: a time sharing approach. IEEE T Autom Contr

34:1098–1102Altman E, Stidham S Jr (1995) Optimality of monotonic policies for two-action Markovian decision

processes, with applications to control of queues with delayed information. Queueing Syst 21:267–291

Asawa M, Teneketzis D (1996) Multi-armed bandits with switching penalties. IEEE T Autom Contr41:328–348

Bambos N, Michailidis G (2002) On parallel queuing with random server connectivity and routing con-straints. Probab Eng Inform Sc 16:185–203

Banks JS, Sundaram RK (1994) Switching costs and the Gittins index. Econometrica 62:687–694Bellman R (1956) A problem in the sequential design of experiments. Sankhya 16:221–229Bertsimas D, Niño-Mora J (1996) Conservation laws, extended polymatroids and multiarmed bandit prob-

lems; a polyhedral approach to indexable systems. Math Oper Res 21:257–306Blackwell D (1962) Discrete dynamic programming. Ann Math Stat 33:719–726Boxma OJ (1995) Static optimization of queueing systems. In: Agarwal RP (ed) Recent trends in optimiza-

tion theory and applications. World sci ser appl anal, vol 5. World Sci Publ, River Edge, pp 1–16Bradt RN, Johnson SM, Karlin S (1956) On sequential designs for maximizing the sum of n observations.

Ann Math Stat 27:1060–1074Chen H, Yao DD (1990) Optimal intensity control of a queueing system with state-dependent capacity

limit. IEEE T Autom Contr 35:459–464Clark JB (2005) The distribution of wealth: a theory of wages, interest and profits. Adamant media corpo-

ration (reissued from the edition published in 1908 by the Macmillan Company, New York, NY; firstpublished: 1899)

Coffman EG Jr, Mitrani I (1980) A characterization of waiting time performance realizable by single-server queues. Oper Res 28:810–821

196 J. Niño-Mora

Combé MB, Boxma OJ (1994) Optimization of static traffic allocation policies. Theor Comput Sci 125:17–43

Cox DR, Smith WL (1961) Queues. Methuen, LondonDe Véricourt F, Karaesmen F, Dallery Y (2000) Dynamic scheduling in a make-to-stock system: a partial

characterization of optimal policies. Oper Res 48:811–819Derman C, Lieberman GJ, Ross SM (1980) On the optimal assignment of servers and a repairman. J Appl

Probab 17:577–581Dongarra JJ, Eijkhout V (2000) Numerical linear algebra algorithms and software. J Comput Appl Math

123:489–514Dusonchet F, Hongler MO (2003a) Continuous-time restless bandit and dynamic scheduling for make-to-

stock production. IEEE T Robotic Autom 19:977–990Dusonchet F, Hongler MO (2003b) Optimal hysteresis for a class of deterministic deteriorating two-armed

bandit problem with switching costs. Automatica 39:1947–1955Ehsan N, Liu MY (2006) Optimal bandwidth allocation in a delay channel. IEEE J Sel Area Commun

24:1614–1626Feinberg EA, Shwartz A (2002) Mixed criteria. In: Feinberg EA, Shwartz A (eds) Handbook of Markov

decision processes: methods and applications. Kluwer, Boston, pp 209–230Federgruen A, Groenevelt H (1988) Characterization and optimization of achievable performance in gen-

eral queueing systems. Oper Res 36:733–741Gass S, Saaty T (1955) The computational algorithm for the parametric objective function. Naval Res Log

2:39–46Gittins JC (1979) Bandit processes and dynamic allocation indices (with discussion). J R Stat Soc B Met

41:148–177Gittins JC (1989) Multi-armed bandit allocation indices. Wiley, ChichesterGittins JC, Jones DM (1974) A dynamic allocation index for the sequential design of experiments. In:

Gani J, Sarkadi K, Vincze I (eds) Progress in statistics. European meeting of statisticians, Budapest,1972. North-Holland, Amsterdam, pp 241–266

Haji R, Newell GF (1971) A relation between stationary queue and waiting-time distributions. J ApplProbab 8:617–620

Hernández-Lerma O, Hoyos-Reyes LF (2001) A multiobjective control approach to priority queues. MathMethod Oper Res 53:265–277

Hordijk A, Koole G (1990) On the optimality of the generalized shortest queue policy. Probab Eng InformSci 4:477–487

Johri PK (1989) Optimality of the shortest line discipline with state-dependent service rates. Eur J OperRes 41:157–161

Jun T (2004) A survey on the bandit problem with switching costs. De Economist 152:513–541Kallmes MH, Cassandras CG (1995) Two approaches to optimal routing and admission control in systems

with real-time traffic. J Optim Theory Appl 84:311–338Kantorovich LV (1959) Ekonomicheskii raschet nailuchshego ispolzovania resursov. Academy of Sci-

ences, USSR (Engl trans: The best use of economic resources, Harvard University Press, Cambridge,1965)

Keilson J, Servi LD (1988) A distributional form of Little’s law. Oper Res Lett 7:223–227Kelly FP (1981) Multi-armed bandits with discount factor near one: the Bernoulli case. Ann Stat 9:987–

1001Kim E, Van Oyen MP (1998) Beyond the cμ rule: dynamic scheduling of a two-class loss queue. Math

Method Oper Res 48:17–36Kleinrock L (1965) A conservation law for a wide class of queueing disciplines. Naval Res Log 12:181–

192Kleinrock L (1975) Queueing systems, vol I: Theory. Wiley, New YorkKlimov GP (1974) Time-sharing service systems. I. Theory Probab Appl 19:532–551Koopmans TC (1957) Three essays on the state of economic science. McGraw–Hill, New York, chap

Allocation of resources and the price systemKrishnan KR (1988) State-dependent allocation of job stream to parallel workstations. In: Proceedings of

the 27th conference on decision and control. IEEE, Piscataway, pp 2332–2333Krishnan KR (1990) Joining the right queue: a state-dependent decision rule. IEEE T Autom Contr

35:104–108Lewis ME, Puterman ML (2002) Bias optimality. In: Feinberg EA, Schwartz A (ed) Handbook of Markov

decision processes. Kluwer Academic, Boston, pp 89–111


Lott C, Teneketzis D (2000) On the optimality of an index rule in multichannel allocation for single-hopmobile networks with multiple service classes. Probab Eng Inform Sc 14:259–297

Niño-Mora J (2001) Restless bandits, partial conservation laws and indexability. Adv Appl Probab 33:76–98

Niño-Mora J (2002) Dynamic allocation indices for restless projects and queueing admission control: apolyhedral approach. Math Program 93:361–413

Niño-Mora J (2003) Restless bandit marginal productivity indices, diminishing returns, and schedulinga multiclass make-to-order/-stock queue. In: Proceedings of the 41st annual Allerton conference oncommunication, control and computing. Univ of Illinois at Urbana-Champaign, Monticello, pp 100–109

Niño-Mora J (2005) A marginal productivity index policy for the finite-horizon multiarmed bandit prob-lem. In: Proceedings of the 44th IEEE conference on decision and control and European controlconference ECC 2005. IEEE, Piscataway, pp 1718–1722

Niño-Mora J (2006a) Restless bandit marginal productivity indices, diminishing returns and optimal con-trol of make-to-order/make-to-stock M/G/1 queues. Math Oper Res 31:50–84

Niño-Mora J (2006b) Marginal productivity index policies for scheduling a multiclass delay-/loss-sensitivequeue. Queueing Syst 54:281–312

Niño-Mora J (2006c) Marginal productivity index policies for scheduling multiclass wireless transmis-sions. In: NGI 2006, proceedings of the 2nd EuroNGI conference on next generation Internetnetworks—design and engineering. IEEE, Piscataway, pp 342–349

Niño-Mora J (2007a) A (2/3)n3 fast-pivoting algorithm for the Gittins index and optimal stopping of aMarkov chain. INFORMS J Comput, vol 19, DOI: 10.1287/ijoc.1060.0206

Niño-Mora J (2007b) Marginal productivity index policies for admission control and routing to parallelmulti-server loss queues with reneging. In: Network control and optimization, proceedings of thefirst EuroFGI international conference NET-COOP 2007. Lect Notes Comput Sc, vol 4465. Springer,Berlin, pp 138–149

Niño-Mora J (2007c) Marginal productivity index policies for scheduling multiclass delay-/loss-sensitivetraffic with delayed state observation. In: NGI 2007, proceedings of the 3rd EuroNGI conferenceon next generation Internet networks—design and engineering for heterogeneity. IEEE, Piscataway,pp 209–217

Niño-Mora J (2007d) A faster index algorithm and a computational study for bandits with switching costs.INFORMS J Comput (in press)

Niño-Mora J (2007e) Characterization and computation of restless bandit marginal productivity indices. In:SMCtools’07, proceedings from the 2007 workshop on tools for solving structured Markov chains.ACM, New York

Niño-Mora J (2007f) Computing an index policy for bandits with switching penalties. In: SMCtools’07,proceedings from the 2007 workshop on tools for solving structured Markov chains. ACM, NewYork

Papadimitriou CH, Tsitsiklis JN (1999) The complexity of optimal queuing network control. Math OperRes 24:293–305

Peña-Pérez A, Zipkin P (1997) Dynamic scheduling rules for a multiproduct make-to-stock queue. OperRes 45:919–930

Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, NewYork

Rothkopf MH (1966) Scheduling with random service times. Manag Sci 12:707–713Sevcik KC (1974) Scheduling for minimum total loss using service time distributions. J ACM 21:66–75Shanthikumar JG, Yao DD (1992) Multiclass queueing systems: polymatroidal structure and optimal

scheduling control. Oper Res 40:S293–S299Smith WE (1956) Various optimizers for single-stage production. Naval Res Log 3:59–66Soman CA, Van Donk DP, Gaalman G (2004) Combined make-to-order and make-to-stock in a food

production system. Int J Prod Econ 90:223–235Sparaggis PD, Cassandras CG, Towsley D (1993) On the duality between routing and scheduling systems

with finite buffer space. IEEE T Autom Contr 38:1440–1446Stidham S Jr (1985) Optimal control of admission to a queueing system. IEEE T Autom Contr 30:705–713Tassiulas L, Ephremides A (1993) Dynamic server allocation to parallel queues with randomly varying

connectivity. IEEE T Inform Theory 39:466–478Tsoucas P (1991) The region of achievable performance in a model of Klimov. Tech Rep RC16543, IBM

TJ Watson Research Center, Yorktown Heights, New York, NY

http://dx.doi.org/10.1287/ijoc.1060.0206

198 J. Niño-Mora

Varaiya PP, Walrand JC, Buyukkoc C (1985) Extensions of the multiarmed bandit problem: the discountedcase. IEEE T Autom Contr 30:426–439

Veatch MH, Wein LM (1996) Scheduling a multiclass make-to-stock queue: index policies and hedgingpoints. Oper Res 44:634–647

Wang YG (1997) Error bounds for calculation of the Gittins indices. Aust J Stat 39:225–233Wasserman KM, Bambos N (1996) Optimal server allocation to parallel queues with finite-capacity

buffers. Probab Eng Inform Sc 10:279–285Weber RR (1992) On the Gittins index for multiarmed bandits. Ann Appl Probab 2:1024–1033Weber RR, Weiss G (1990) On an index policy for restless bandits. J Appl Probab 27:637–648Weber RR, Weiss G (1991) Addendum to “On an index policy for restless bandits”. Adv Appl Probab

23:429–430Wein LM (1992) Dynamic scheduling of a multiclass make-to-stock queue. Oper Res 40:724–735Whittle P (1980) Multi-armed bandits and the Gittins index. J R Stat Soc B Met 42:143–149Whittle P (1988) Restless bandits: activity allocation in a changing world. In: Gani J (ed) A celebration of

applied probability. J appl probab, vol 25A. Applied Probability Trust, Sheffield, pp 287–298Whittle P (1996) Optimal control: basics and beyond. Wiley, ChichesterWinston W (1977) Optimality of the shortest line discipline. J Appl Probab 14:181–189

Documents

Dynamic priority allocation via restless bandit marginal ...halweb.uc3m.es/jnino/eng/pubs/top07.pdf162 J. Niño-Mora Keywords Priority allocation · Stochastic scheduling ·Index policies