Zhe Yu, Yunjian Xu, and Lang Tong1 Deadline Scheduling as Restless Bandits Zhe Yu, Yunjian Xu, and Lang Tong Abstract—The problem of stochastic deadline scheduling is considered

1

Deadline Scheduling as Restless BanditsZhe Yu, Yunjian Xu, and Lang Tong

Abstract—The problem of stochastic deadline scheduling isconsidered. A constrained Markov decision process model isintroduced in which jobs arrive randomly at a service centerwith stochastic job sizes, rewards, and completion deadlines.The service provider faces random processing costs, convexnon-completion penalties, and a capacity constraint that limitsthe simultaneous processing of jobs. Formulated as a restlessmulti-armed bandit problem, the stochastic deadline schedulingproblem is shown to be indexable. A closed-form expression ofthe Whittle’s index is obtained for the case when the processingcosts are constant. An upper bound on the gap-to-optimalityforthe Whittle’s index policy is obtained, and it is shown that thebound converges to zero as the job arrival rate and the numberof available processors increase simultaneously to infinity.

Index Terms—Constrained Markov decision processes, restlessmulti-armed bandits, stochastic deadline scheduling, Whittle’sindex.

I. I NTRODUCTION

The deadline scheduling problem, in its most generic set-ting, is the scheduling of jobs with different workloads anddeadlines for completion. Typically, not enough servers areavailable to satisfy all the demand; the cost of processing mayvary with time, and unfinished jobs incur penalties.

In this paper, we are interested in thestochastic deadlinescheduling problemwhere key parameters of the problemsuch as job arrivals, workloads, deadlines of completion, andprocessing costs are stochastic. In particular, we consider theproblem of maximizing discounted rewards over an infinitescheduling horizon.

A prototype application of such a problem is the charging ofelectric vehicles (EVs) at a charging service center [2]–[4]. Insuch applications, EVs arrive at the service center randomly,each with its charging demand and deadline for completion.The charging cost depends on the cost of electricity at thetime of charging, and a penalty is imposed when the serviceprovider is unable to fulfill the request. Similar applicationsinclude the scheduling of packet transmission for real-timewireless networks [5], of jobs at data centers [6], of nursingpersonnel in hospitals [7], for internet streaming [8], andatcustomer service centers [9].

The stochastic deadline scheduling problem is an instanceof stochastic dynamic programming, for which obtaining theoptimal solution is fundamentally intractable. However, practi-cal applications often mandate that the processing schedule be

Z. Yu is with GEIRI North America, 250 W. Tasman Dr., San Jose,CA,95134. L. Tong is with the School of Electrical and Computer Engineering,Cornell University, Ithaca, NY 14853, USA. Y. Xu is with the Departmentof Mechanical and Automation Engineering, the Chinese University of HongKong, Hong Kong SAR. Email:[email protected],[email protected],[email protected]. This work is supported in part by the National ScienceFoundation under Grant CNS- 1248079 and 1549989. An earlierversion that contains asubset of the results presented in this paper appeared in [1].

constructed in real time. This means that, in general, one mayhave to sacrifice optimality in favor of approximate solutionsthat are scalable algorithmically and have performance closeto that of the optimal scheduler. An important class of suchalgorithms is the so-calledindex policies[10] that attach anindex to each unfinished job, rank them according to theirindices and assign available processors to the top-ranked jobs.The index of each job is determined by the state of the jobitself and independent of the states of other jobs. Such policiesoffer scalable solutions if the index and ranking algorithmcan be computed online. An index policy becomes especiallyattractive if its gap-to-optimality can be bounded and shownto be diminishing in cases of practical interest.

A. Summary of Results

We formulate the stochastic deadline scheduling problemas a restless multi-armed bandit (RMAB) problem initiallyintroduced by Whittle [11]. We examine the indexability of theproblem and the performance of the Whittle’s index policy. Tothis end, we introduce a constrained Markov decision process(MDP) model with the objective of maximizing the expected(discounted) profit subject to a constraint on the maximumnumber of jobs that can be processed simultaneously. The con-structed MDP model captures the randomness in job arrivals,job sizes, deadlines, and processing costs.

Next, we reformulate the MDP as an RMAB problemwith simultaneous plays [11]. The RMAB problem remainsintractable in general and was shown to be PSPACE hard in[12], which is in sharp contrast to the original (rested) multi-armed bandit (MAB) problem solved by the Gittin’s indexpolicy in [13]. Here we consider the celebrated Whittle’s indexpolicy that has been shown to be optimal in some special cases[11], [14]. To this end, we first establish the indexability ofthe formulated RMAB problem. We then show that, for thedeadline scheduling problem, in particular, the pre-determineddeadline and workload at the time of arrival simplify thecomputation of the Whittle’s index. For the case with constantprocessing cost, we derive the Whittle’s indexes in closed-form, which generalizes the result of [15].

When the number of processors is finite, we first provideexamples that the Whittle’s index policy is not optimal for thedeadline scheduling problem. We show, however, that the gap-to-optimality for the Whittle’s index policy is bounded by theconditional value at risk (CVaR) [16] of the number of arrivalsper unit time, which allows us to examine the performanceloss as a function of arrival rate and the number of availableprocessors.

A major result of this paper is to characterize the asymptoticoptimality of the Whittle’s index policy when the number of

2

processors increases with the job arrival rate. In particular,we show that the gap-to-optimality goes to zero in the lighttraffic case, indicating a specific regime in which Whittle’sindex policy is asymptotically optimal.

B. Related Work

The classical deadline scheduling problem is first consideredby Liu and Layland [17] in a deterministic setting. For thesingle processor case, the results are quite complete. Whenalljobs can be finished on time, simple index algorithms (withlinear complexity) such as the earliest deadline first (EDF)[17], [18] and the least laxity first (LLF) [19] achieve thesame performance as the optimal off-line algorithm in thedeterministic setting.

There is also substantial literature on the deadline schedul-ing problem with multiple processors (for a survey, see [20]).It is shown in [21] that optimal online scheduling policies donot exist in general for the worst case performance measure.

The literature on deadline scheduling in the stochastic set-tings is less extensive. For the single processor case, Panwar,Towsley, and Wolf in [22] and [23] made an early contributionin establishing the optimality of EDF in minimizing the unfin-ished work when jobs are non-preemptive. The performance ofEDF is quantified in the heavy traffic regime using a diffusionmodel in [24]–[26].

The multiprocessor stochastic deadline scheduling problemis less understood, primarily because the stochastic dynamicprogramming for such problems are intractable to solvein practice. A particularly relevant class of applicationsisscheduling in wireless transmissions and routing in networks[27]–[30] where job (packet) arrival is stochastic, and packetssometimes have deadlines for delivery. Another class of appli-cations is in the scheduling of (deadline-constrained) electricvehicle charging with stochastic charging costs [31]–[33]. Thework closest to ours are in [27]–[30] where the authors consid-ered particular instances of the deadline scheduling problemstudied in this paper. In the context of scheduling transmis-sions in wireless networks, the authors of [27] analyzed theperformance of the EDF policy for packets delivery in treenetworks. Also related is the deadline scheduling in ad hocnetworks [28] where an iterative algorithm was proposed toschedule packets over random channels, and the algorithm wasproved to be optimal. Random arrivals of jobs (packets) wereconsidered in [29] where the authors formulated the problemas an RMAB problem and analyzed the indexability. Whittle’sindex policy was applied, but the performance of Whittle’sindex policy was not analyzed. The model considered in [29] isalso more restrictive than the model studied in this paper. Thework of [30] considers the problem of scheduling multi-hopwireless networks for packets with deadlines where the authorsdeveloped decentralized scheduling policies. The constraint onbandwidth in [30] is an average constraint whereas the problemtreated in this paper is a strict deterministic constraint.

A recent related work in the operation research literature is[15] where the authors considered the RMAB formulation ofthe deadline scheduling in knapsack problems. The authorsestablished the indexability of the RMAB problem and a

closed-form of Whittle’s index. There are several significantdifferences, however, between the model considered in [15]and the one in this paper. First, the job arrivals are simultane-ous in [15] and stochastic in this paper. Second, the processingcost/reward is constant in [15] and random in our model.Our paper also establishes the asymptotic performance ofWhittle’s index policy whereas [15] addressed the indexabilityand developed an iterative algorithm to compute the Whittle’sindex.

There is extensive literature on the RMAB problem. See,e.g., [34], [35]. In his seminal work [11], Whittle introducedan index policy (the Whittle’s index policy) for the subclass ofindexable RMAB problems. Although in general suboptimalin the finite arm regime except for some special cases [14],Whittle’s index policy was shown by Weber and Weiss in [36]to be asymptotically optimal under some conditions when thenumber of arms and the number of simultaneous activationsgrow proportionally to infinity. The optimality conditions,however, are difficult to check. We should also point out thatthe asymptotic optimality results established in this paper aredifferent from that formulated in [11] and [36].

II. PROBLEM FORMULATION

In this section, we introduce the stochastic deadline schedul-ing problem as a constrained MDP followed by an RMABformulation.

A. Nominal Model Assumptions

We begin with a set of nominal assumptions in setting upthe MDP formulation:A1. The time is slotted, indexed byt.A2. There areM processors available at all times. In each

time slot, a processor can only work on one job, andeach job can receive service from only one processor atany given time. A processor can be switched from onejob to another without incurring switching cost.

A3. If a processor works on a job in time slott, it receivesa unit payment and incurs a time-varying costc[t]. Herewe assume thatc[t] is an exogenous stationary Markovprocess with a transition probability matrixP = [Pi,j ].

A4. If a job is not completed by its deadline, a penalty definedby a convex function of the amount of unfinished job isimposed on the scheduler by the deadline. LetF (B) bethe convex penalty function withB denoting the amountof the unfinished job andF (0) = 0.

A5. There is a queue withN positions. Jobs arrived at differ-ent positions are statistically independent and identicallydistributed (i.i.d.).

A6. A job arriving at theith position of the queue at thebeginning of time slott revealsBi (the total amountof work to be completed) andTi (the deadline forcompletion). At the end of time slott + Ti, the jobis removed from the queue, regardless whether the jobis completed. When theith position is available, withprobability Q(T,B) a new job with deadlineT andworkloadB arrives. With probabilityQ(0, 0), the positionremains empty. We assume the distribution ofT andB

3

are bounded,i.e., 0 ≤ T ≤ T and0 ≤ B ≤ B, whereTand B are maximum possible lead time and job length.

Some comments and clarifications on these assumptions arein order. Assumptions A1 and A2 are standard. A3 assumesthat the marginal price of service—the marginal payment tothe service provider—is the same for all jobs. The marginalprocessing costc[t] is uniform for jobs processed at the samet. Several generalizations of A3 are possible. In particular, byincluding the initial lead time in the state of a job, our modelcan accommodate the so-called service differentiated deadlinescheduling problem [37], where jobs with different deadlinesface different marginal prices. Another generalization isthatthe marginal price (or the cost) of service depends on theposition of the queue. This, for instance, can model prioritizedservices.

Assumption A4 indicates that the deadline is soft, but itcan be hardened by setting the non-completion penalty muchhigher than processing cost. In this setting, it is always optimal(i.e., reward maximizing) to finish as many jobs as possible,regardless of the processing cost.

The i.i.d. arrival assumption in A5 is limiting but necessaryfor index policies. This is also consistent with the standardPoisson arrival case when the arrived job is randomly assignedto a position in the queue. A5 and A6 imply that when a jobarrives at a position that is occupied by an unfinished job, thenewly arrived job is dropped, which seems unreasonable sincethe job could have been reassigned to an open position (if itexists). However, asymptotically whenN → ∞, there is noloss of performance by imposing these assumptions. In SectionVI-E, we numerically compare the two scenarios with i.i.d.arrivals following A5-A6 and the conventional Poisson arrival.Numerical results show that the performance of differentalgorithms under A5-A6 converges to its counterpart underPoisson arrival as the number of available positions increases.

B. Stochastic Deadline Scheduling as a Constrained MDP

Next, we formulate the constrained MDP by defining states,actions of the scheduler, state evolution, rewards, constraints,and decision policies.

1) State Space:Consider first the state of theith positionin the queue. LetTi[t] , di − t be the lead time to deadlinedi, Bi[t] be the remaining job length, andLi[t] , Ti[t]−Bi[t]be the laxity of jobi, as illustrated in Figure 1.

The state of theith position in the queue is defined as

Si[t]∆=

{

(0, 0) if no job is at theith position,(Ti[t], Bi[t]) otherwise.

The processing costc[t] evolves according to an exogenousfinite state Markov chain with a transition probability matrixP = [Pj,k]. This Markovian assumption is practical to studystochastic prices,e.g.,[38], and simplifies both the model andthe computation of the policies.

The state of the MDP is defined by the queue states andthe processing costc[t] as S[t]

∆=(c[t], S1[t], · · · , SN [t]) ∈ S,

whereS , Sc × S1 × · · · × SN is the state space of the entiresystem,Sc is the cost space, andSi is the state space ofposition i.

timet

JobJi

ri

Ti[t]

Bi[t]Li[t]

di

Fig. 1: An illustration of jobi’s state.ri is the arrival timeof a job at positioni, di is its deadline for completion,Bi[t]is the workload to be completed bydi, Ti[t] is the job’s leadtime to deadline, andLi[t] , Ti[t]−Bi[t] is the job’s laxity.

2) Action: The action of the scheduler in slott is definedby a binary vectora[t] = (a1[t], · · · , aN [t]) ∈ {0, 1}N . Whenai[t] = 1, a processor is assigned to work on the job at positioni, and the position is calledactive. Whenai[t] = 0, positioni is passive, i.e., no processor is assigned. For notationalconvenience, sometimes we allow a position without a jobto be activated, in which case the assigned processor receivesno reward and incurs no cost.

3) State Evolution:The evolution of the processing cost isaccording to the transition matrixP and independent of theactions taken by the scheduler. The evolution of the job stateSi[t] depends on the scheduling actionai[t]:

Si[t+ 1] =

{

(Ti[t]− 1, (Bi[t]− ai[t])+) if Ti[t] > 1,

(T,B) with prob.Q(T,B) if Ti[t] ≤ 1,(1)

whereb+ = max(b, 0). Note that whenTi[t] = 1, the deadlineis due at the end of the current time slot and the job in positioni will be removed.

4) Reward: For each job, the scheduler obtains one unitof reward if the job is processed for one time slot. WhenTi[t] = 1, job i will reach its deadline by the end of the currenttime slot, and the scheduler will incur a penalty if the job isunfinished. The reward collected from jobi at timet is givenby

Rai[t](Si[t], c[t])

=

(1− c[t])ai[t] if Bi[t] > 0, Ti[t] > 1,(1− c[t])ai[t]−F (Bi[t]− ai[t]) if Bi[t] > 0, Ti[t] = 1,0 otherwise.

(2)

5) Objective:Given the initial system stateS[0] = s and apolicy π that maps each system stateS[t] to an action vectora[t], the expected discounted system reward is defined by

GNπ (s)

∆=Eπ

(

∞∑

t=0

N∑

i=1

βtRai[t](Si[t], c[t])

∣

∣

∣

∣

S[0] = s

)

, (3)

whereEπ is the conditional expectation over the randomnessin costs and job arrivals under a given scheduling policyπand0 < β < 1 is the discount factor.

6) Constrained MDP and Optimal Policies:We impose aconstraint on the maximum number of processors that can beactivated simultaneously,i.e.,

∑Ni=1 ai[t] ≤ M . This constraint

represents the processing capacity of the service provider. Forthe EV charging application, this assumption translates directlyto the physical power limit imposed on the charging facility.

4

Thus, the deadline scheduling problem can then be formulatedas a constrained MDP.

GN (s) = sup{π:

∑N

i=1aπ

i[t]≤M, ∀t}

GNπ (s), (4)

whereaπi [t] is the action sequence generated by policyπ forpositioni. A policy π∗ is optimal ifGN

π∗(s) = GN (s). Withoutloss of optimality, we will restrict our attention to stationarypolicies [39].

C. A Restless Multi-armed Bandit Problem

Unfortunately, the MDP formulation does not result in ascalable optimal scheduling policy because the state spacegrows exponentially withN . We, therefore, seek to obtainan effectiveindex policy[10] that scales linearly withN . Weidentify each position in the queue as an arm and formulate(4) as an RMAB problem. To this end, “playing” an arm isequivalent to assigning a processor to process the job (if thereis one) at a position in the queue. The resulting multi-armedbandit problem is restless because the state of positioni—theith arm—evolves regardless whether armi is active or passive.

A complication of casting (4) as an RMAB problem comesfrom the inequality constraint on the maximum number ofsimultaneously activated positions, as the standard RMABformulation imposes an equality constraint on the numberof arms that can be activated. This can be circumvented byintroducing M dummy armsand requiring that exactlyMarms must be activated in each time slot. Specifically, eachdummy armi always accrues zero rewards, and its state staysat Si = (0, 0). The reformulated RMAB problem hasN +Marms. We let{1, · · · , N} be the set ofregular arms thatgenerate reward (penalty) and{N + 1, · · · , N +M} be theset of dummy arms.

We define an extended state of each arm asSi[t] , (Si[t], c[t]) and denote the extended state spaceas Si , Si × Sc. The state transition of each arm and theassociated reward inherit from (1-2) of the original MDP. Wehave the following RMAB problem that is equivalent to theoriginal MDP (4):

supπ Eπ

{

∑∞t=0

∑N+Mi=1 βtRai[t](Si[t]) | Si[0]

}

s.t.∑N+M

i=1 ai[t] = M, ∀ t.(5)

In (5), arms are coupled by the processing cost. With theaddition of dummy arms, the inequality constraint on themaximum number of activated arms in the original MDPproblem is transformed to the equality constraint in (5).

III. W HITTLE ’ S INDEX POLICY

To tackle the deadline scheduling problem as an RMAB,we first establish the indexability of the RMAB and formallydefine the Whittle’s index policy in this section.

A. Indexability

Following [11], we consider aν-subsidizedsingle armreward maximization problem that seeks for a policyπ to

activate/deactivate a single arm to maximize the discountedaccumulative reward:

V νi (s) = sup

πEπ

(

∞∑

t=0

βtRνai[t]

(Si[t])

∣

∣

∣

∣

Si[0] = s

)

, (6)

where the subsidized reward is given by

Rνai[t]

(Si[t]) = Rai[t](Si[t]) + ν1(ai[t] = 0).

Here Rai[t](·) is defined in (2), and1(·) is the indicatorfunction. In theν-subsidizedproblem, the scheduler receivesa subsidyν whenever an arm is passive.

Let La be an operator onV νi defined by

(LaVνi )(s) , E

(

V νi (Si[t+ 1])

∣

∣

∣

∣

Si[t] = s, ai[t] = a

)

.

The maximum discounted rewardV νi (·) in (6) is determined

by the Bellman equation

V νi (s) = max

a∈{0,1}

{

Rνa(s) + β(LaV

νi )(s)

}

. (7)

Let Si(ν) be the set of states under which it is optimalto take the passive action in theν-subsidy problem. Theindexabilityof the RMAB is defined by the monotonicity ofSi(ν) as the subsidy levelν increases.

Definition 1 (Indexability [11]). Arm i is indexable if the setSi(ν) increases monotonically from∅ to Si as ν increasesfrom −∞ to +∞. The MAB problem is indexable if all armsare indexable.

We establish the indexability for the stochastic deadlinescheduling problem.

Theorem 1 (Indexability). Each arm is indexable, and theRMAB problem (5) is indexable.

The indexability of the bi-dimension state model withoutarrival is proved in [15] based on the partial conservation lawprinciple. We provide an elementary proof in Appendix A thatalso includes the random arrivals of jobs.

B. Whittle’s Index Policy

The following definition of Whittle’s index is based onDefinition 1.

Definition 2 (Whittle’s index [11]). If arm i is indexable, itsWhittle’s indexνi(s) of states is the infimum of the subsidyν under which the passive action is optimal at states, i.e.,

νi(s) , infν{ν : R0(s) + ν + β(L0Vνi )(s)

≥ R1(s) + β(L1Vνi )(s)}.

If arm i is indexable, in aν-subsidized problem withν < νi(s) it is optimal to activate armi. Likewise, ifν ≥ νi(s)it is optimal to deactivate armi.

To compute the Whittle’s index for armi, we solve a para-metric program where the subsidyν appears in the constraints.

minui(s)

∑

s∈Sip(s)ui(s)

s.t. ui(s) ≥ R1(s) + β∑

s′∈S P 1s,s′ui(s

′), ∀s,ui(s) ≥ R0(s) + ν + β

∑

s′∈S P 0s,s′ui(s

′), ∀s,

5

wheres = (T,B, c) is the extended state of armi, p(s) theinitial state probability, andP a

s,s′ the transition probabilityfrom s to s′ given actiona. For a particular value ofν, theoptimal solutionu∗

i (s) equals the value functionV νi (s), and

the active constraints give the optimal actions. We solve thisparametric program to find the break point ofν where theoptimal action changes. The simplex method can be used tosolve this parametric program [40].

The unique structure of the deadline problem allows us toobtain a closed-form solution when the processing cost is time-invariant.

Theorem 2. If c[t] = c0 for all t, the Whittle’s index of aregular arm i ∈ {1, · · · , N} is given by

νi(T,B, c0)

=

0 if B = 0,

1− c0 if 1 ≤ B ≤ T − 1,

βT−1F (B − T + 1)−βT−1F (B − T )+1− c0 if T ≤ B.

(8)

The Whittle’s index of a dummy arm is zero, i.e.,

νi(0, 0, c0) = 0, i ∈ {N + 1, · · · , N +M}.

The proof of Theorem 2 can be found in Appendix B. In(8), when it is feasible to finish jobi’s request (i.e., its leadtime is no less than its remaining processing time), jobi’sWhittle’s index is merely the (per-unit) processing profit1−c0.When a non-completion penalty is inevitable, the index takesinto account both the processing profit and the non-completionpenalty. We note that the Whittle’s index gives higher priorityto jobs with less laxity.

We are now ready to define the Whittle’s index policy basedon Definition 2.

Definition 3 (Whittle’s index policy [11]). For the RMABproblem defined in (5), the Whittle’s index policy sorts all armsby their Whittle’s indices in a descending order and activatesthe topM arms.

Since the states of jobs and processing cost are finite,the Whittle’s indices can be computed off-line. In real-timescheduling, at the beginning of each time slot, the schedulerlooks up the indices for all existing jobs based on the currentsystem state and processes the ones with the highest indices.When there is a tie, the scheduler breaks the tie randomly witha uniform distribution.

We note that the Whittle’s index policy does not distin-guish jobs with positive laxity, which leaves some room forimprovement. In Section IV-C, we apply the Less Laxity andLonger Processing Time (LLLP) principle (originally proposedin [31]) to improve the Whittle’s index policy.

IV. PERFORMANCE OFWHITTLE ’ S INDEX POLICY FOR

FINITE-ARMED RESTLESSBANDITS

In this section, we examine the performance of Whittle’sindex policy for the stochastic deadline scheduling problemwhen the number of servers (M ) is finite. We show that

whenM < N , there does not exist an optimal index policyin general. Hence Whittle’s index policy is not optimal. Wefurther derive an upper bound on the gap-to-optimality on theperformance of the Whittle’s index policy. This result providesthe essential ingredient for establishing asymptotic optimalityof the Whittle’s index policy in Section V.

A. Performance in Finite Processor Cases

In general, Whittle’s index policy is not optimal except insome special cases [14]. For the deadline scheduling problem,the same conclusion holds. We show in fact that no optimalindex policy exists in general.

Proposition 1. WhenM = N , the Whittle’s index policy isoptimal. WhenM < N , an optimal index policy for the RMABproblem formulated in (5) may not exist in general.

Proof. The fact that Whittle’s index policy is optimal whenM = N is intuitive. A formal proof can be found in AppendixC of [41]. To show that an optimal index policy does not existin general, it suffices to construct a counterexample that noindex policy can be optimal.

Set the capacity of the queue to beN = 3, the numberof processorsM = 1, the discounted factorβ = 0.4, thepenalty functionF (B) = B2, and the processing costc[t] = 1.Assume the arrival is busy(Q(0, 0) = 0) and the initial laxityis zero (T = B at arrival). For this small scale MDP, a linearprogramming formulation is used to solve for the optimalpolicy [42].

Consider two different states,

s = ((1, 1), (2, 2), (2, 2)),s′ = ((1, 1), (1, 1), (2, 2)),

where s = ((T1, B1), (T2, B2), (T3, B3)) ∈ S is the state ofthe system including the states of each arm. The constantprocessing cost is omitted in the state.

For states, the optimal action is to process job(2, 2). Inthis case, the job(2, 2) is preferred to(1, 1). Processing(2, 2)will cause an immediate penalty of1, and the state will changeto ((T,B), (1, 1), (1, 2)), where(T,B) is a new arrival. In thenext stage, a penalty of2 from the last two jobs will happen.If some policy processes(1, 1) alternately given states, therewill be no penalty in the first stage, and the state will changeto ((T,B), (1, 2), (1, 2)). The last two jobs will at least incura penalty of5.

For states′, the optimal action is to process the job(1, 1).The job (1, 1) is preferred to(2, 2) in this case. Processing(1, 1) will cause an instant penalty of1, and the state willchange to((T,B), (T ′, B′), (1, 2)), where(T,B) and(T ′, B′)are new arrivals. If some policy processes(2, 2) alternatelygiven states′, there will be an instant penalty of2 from thefirst two jobs in the first stage and the state will change to((T,B), (T ′, B′), (1, 1)). In this case, a penalty of1 can besaved in the second stage by processing(2, 2) in the first one.However, due to the discount factor, it is more profitable toprocess(1, 1).

An index policyassigns each job an index (that dependsonly on the job’s current state) and processes the jobs with

6

the highest indices [10]. Therefore, for any “index” policy,the indices of job(1, 1) and(2, 2) are fixed at statess ands′,and the preference of these two jobs should remain the samein these two cases, which is violated by the result here. Thiscounterexample shows that no “index” policy that is optimalin general.

Note that, the Whittle’s index policy is an example of indexpolicies, and thus is sub-optimal in general. However, withparticular combinations of parameters, optimal index policiesmay exist.

B. An Upper Bound of the Gap-to-Optimality

In the following lemma, we first establish a result thatapplies quite generally to the case for a finite queue sizeNand a finite number of processorsM .

Lemma 1. LetGN (s) be the optimal value function defined in(4) andGN

W(s) be the value function achieved by the Whittle’sindex policy, respectively. We have

GN (s)−GNW(s)

≤ C1−βE[I

N [t]|IN [t] > M ] Pr(IN [t] > M),(9)

whereIN [t] is the number of jobs admitted in the queue withN positions within time[t− T +1, t], T is the maximum leadtime of jobs, andC is a constant determined by the processingcost and the penalty of non-completion.

The proof can be found in Appendix C. The gap-to-optimality is bounded by the tail expectation of the jobsadmitted to the system. Note that, the conditional expectationon the right-hand side (RHS) of (9) is connected to theconditional value at risk (CVaR) [16], which measures theexpected losses at a particular risk level and is extremelyimportant in the risk management.

C. Less Laxity and Longer Processing Time (LLLP) Principle

In this subsection, we apply the Less Laxity and Longerremaining Processing time (LLLP) principle (originally pro-posed in [31]) to improve the Whittle’s index policy. As apriority rule for stochastic deadline scheduling, the LLLPprinciple is defined as follows.

Definition 4 (LLLP Order [31]). Consider jobsi andj at timet. We sayj dominatesi (j � i) if j has less laxity and longerremaining processing time than those ofi, i.e., Lj [t] ≤ Li[t]andBj [t] ≥ Bi[t], with at least one of the inequalities strictlyholds.

LLLP defines a partial order over the jobs’ states such thatthe job with less laxity and longer remaining job length shouldbe given priority. Compared to LLF, LLLP takes into accountboth the laxity and the remaining workload, whereas LLFconsiders laxity only.

An LLLP interchange enhancement policy is proposedin [31]. Specifically, it is shown that applying the LLLPinterchange on a policyπ leads to a policy that performsno worse than that ofπ. Numerical experiments shown in[31] demonstrates that the LLLP enhancement often performs

significantly better than the policy to which LLLP is applied.This insight leads us to consider an LLLP enhancement onthe Whittle’s index policy in the context of RMAB approachto stochastic deadline scheduling.

Denote the set of arms byN = {1, · · · , N +M}. Considera policy π that activates arms (jobs) inX and deactivatesthose inX c = N \ X at system stateS[t]. We say that apolicy π follows the LLLP principle if there does not exist apair of jobs(i, j) such thati ∈ X , j ∈ X c, and j � i. Wepropose a Whittle’s index based algorithm that activates armswith the highest indices without violating the LLLP principle.

As shown in Figure 2, the LLLP order defines a directedacyclic graph (DAG)G = {N , E} of all arms, whereNrepresents the arm set andE is the edge set. An edge fromito j indicates that jobi dominates jobj in the sense of LLLPorder. A topological sorting is a linear ordering of the verticesso that for each directed edge(i, j) ∈ E , i comes beforej inthe ordering.

Typically, topological sorting of a DAG is not unique.We employ a stable topological sorting to guarantee that theresult ordering preserves the order of Whittle’s index of armswhenever it is possible. In the proposed algorithm, we employa depth-first search with linear complexity in the number ofvertices and edges [43]. In Figure 2, arms are pre-ordered de-scendingly according to their Whittle’s indices, and the LLLPordering is indicated by the directed edges1. The stable topo-logical sorting gives an order of{1, 3, 4, 5, 6, 9, 2, 7, 8, 10}.The LLLP enhanced Whittle’s index policy is formulated inAlgorithm 1.

1

2

34 5

6

7 8

9

10

Fig. 2: A directed acyclic graph indicating the LLLP order.

Algorithm 1 Whittle’s index policy with LLLP enhancement

1. Calculate the Whittle’s indices of all arms and sort themin a descending order.2. Generate a DAG according to the LLLP principle.3. Carry out a stable topological sort.4. Activate theM arms with the highest priority.

1In the proposed algorithm, if an arm has a state of(0, 0) (either no jobor a dummy arm), it dominates no arm and no arm dominates it.

7

V. A SYMPTOTIC PERFORMANCE OF THEWHITTLE ’ S

INDEX POLICY

In this section, we establish the asymptotic optimality ofthe Whittle’s index policy when the job arrival rateµ and thenumber of serversM increase to infinity simultaneously whilethe system stays stable.

We first consider the case when the aggregated arrival ofjobs follows a Poisson distribution. LetI[t] be the total numberof jobs arrived at the system within[t− T+1, t], recalling thatT is the maximum lead time of jobs. Note thatI[t] is Poissondistributed.

When the queue at the service center is finite withNpositions, we assume that each position receives equally likely1/N th of the traffic2. Because a newly arrived job may berejected when the assigned position is occupied (A5), thetotal number of jobsIN [t] admitted to the system in slottsatisfiesIN [t] ≤ I[t]. However, asN → ∞, IN [t] → I[t] indistribution [41]. Define

G(s)−GW(s) , lim supN→∞

[GN (s)−GNW(s)].

Then, by Lemma 1,

G(s) −GW(s) ≤ C

1− βE[I[t]1(I[t] > M)]. (10)

Equation (10) characterizes the performance gap for theWhittle’s index policy for the asymptotic regime asN in-creases while the arrival process and number of processorsstay constant. Now, we check the performance of the Whittle’sindex policy when the number of processorsM increases andthe mean of the arrival processI[t] also grows as a function.

Theorem 3. Suppose that the aggregated arrivalI[t] is Pois-son with meanµ. The Whittle’s index policy is asymptoticallyoptimal asM → ∞ if µ < M/e. In particular,

G(s) −GW(s) = O(µe−µ

√M

). (11)

The proof of Theorem 3 can be found in Appendix D.Besides showing that the Whittle’s index is asymptoticallyoptimal, Theorem 3 also indicates that the gap-to-optimalitydecays sub-exponentially whenµ grows withM at the con-stant rate less than1/e. Whenµ grows slower thanM , thegap decays to zero but at a slower rate.

In general, suppose that we don’t have the aggregated Pois-son arrival, butIN [t] converges in distribution toI[t] ≤ I[t]asN → ∞. If I[t] with meanµ has a light tailed distribution,i.e., there exist constantsa ≥ 1 andb ≥ 0 with

Pr(I[t] ≥ i) ≤ a exp[−ib/µ], ∀i ≥ 0, (12)

it can then be shown in [44] that,

G(s)−GW(s) = O[exp(−Mb

µ)(Mb+ µ)], (13)

asM → ∞.If I[t] has a heavy tailed distribution with meanµ, i.e., there

exist constantsa > 0 andb > 2 with

Pr(I[t] ≥ i) ≤ aµ/ib, ∀i > 0, (14)

2The thinning property of Poisson justifies A6.

it can then be shown in [44] that,

G(s)−GW(s) = O(µ/M b−1), (15)

asM → ∞.In both cases, the Whittle’s index policy is asymptotically

optimal if the arrival rate grows in the order ofo(M).

VI. N UMERICAL RESULTS

In this section, we present numerical results to comparethe performance of Whittle’s index policy with other simpleheuristic (index) policies,i.e., EDF (earliest deadline first)[17], LLF (least laxity first) [18], and Whittle’s index policywith LLLP enhancement (cf. Algorithm 1).

If feasible, EDF processesM jobs with the earliest dead-lines, and LLF processesM jobs with the least laxity. Bothalgorithms break ties randomly. Note that both policies willfully utilize the processing capacity and activateM jobs aslong as there are at leastM unfinished jobs in the system.The Whittle’s index policy, on the other hand, ranks all armsby the Whittle’s index and activates the firstM arms, and mayput some (regular) arms idle (deactivated) when the processingcost is high. The performance upper bound was obtained byreplacing the strict capacity limit constraint by the constrainton the average [11].

A. Time-invariant Processing Cost

We first considered a special case of problem (5) with aconstant processing cost. Since the processing cost was time-invariant, it was optimal to fully utilize the capacity to processM unfinished jobs.

In Figure 3, we fixed the job arrival process and the lengthof the queueN and varied the processing capacityM . Allpolicies except the EDF policy performed well and achieved anaverage reward close to the performance upper bound. WhenM/N = 1, all jobs could be finished, and all policies achievedoptimality.

In Figure 4, we considered the case whenM/N = 0.5and varied the maximum queue lengthN . We observedthat the Whittle’s index policy with LLLP enhancement andLLF achieved similar performance since both policies roughlyfollowed the least laxity first principle. The performance ofthese two policies was close to the performance upper bound.The EDF policy performed poorly because it did not takethe remaining job length into account. The gap between theWhittle’s index policy and the Whittle’s index policy withLLLP enhancement came from the reordering of jobs withpositive laxity (cf. the discussion following Definition 3).

B. Dynamic Processing Cost

For the dynamic processing cost case, we used the real-time electricity price signal from the California IndependentSystem Operator (CAISO) and trained a Markovian model thatdescribed the marginal processing costs (cf. Sections III andV of [45]). Each time slot of the constructed Markov chain(on processing cost) lasted for 1 hour. For each time slot, thereal-time price was quantized into discrete price states, and

8

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-800

-600

-400

-200

0

200

400

M/N

Tota

lre

war

d/N

($)

Upper BoundEDFLLFWhittle’s indexWhittle’s+LLLP

Fig. 3: Performance comparison with a constant processingcost: c[t] = 0.5, Q(0, 0) = 0.3, T = 12, B = 9, β = 0.999,F (B) = 0.2B2, N = 10.

10 15 20 25 30 35 40 45 5020

40

60

80

100

120

140

160

180

200

N

Tota

lre

war

d/N

($)


Fig. 4: Performance comparison with a constant processingcost: c[t] = 0.5, Q(0, 0) = 0.3, T = 12, B = 9, β = 0.999,F (B) = 0.2B2, M/N = 0.5.

the transition probability (of the Markov chain) was merelythe frequency the price changes from one state to another.

In Figure 5, we fixed the job arrival process and the maxi-mum queue lengthN = 10 and varied the processing capacityM . When the processing limit was low, andM/N was small,there were not enough processors to finish all jobs, and thenon-completion penalty dominated the processing profit. Inthis case, the performance of different policies was close dueto the little flexibility constrained by the limited processingresource. When the processing capacity was adequate andM/N = 1, all jobs could be finished on time. In this case,the Whittle’s index policy solved the problem optimally andachieved the upper bound (which was in correspondencewith Proposition 1). The LLLP enhancement did not makea difference because the Whittle’s index policy followed theLLLP principle in this case. EDF and LLF did not utilizeany information about the stochastic processing cost process

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-800

-600

-400

-200

0

200

400

600

800

M/N

Tota

lre

war

d/N

($)


Fig. 5: Performance comparison with dynamic processingcosts: Q(0, 0) = 0.3, T = 12, B = 9, β = 0.999,F (B) = 0.2B2, N = 10.

10 15 20 25 30 35 40 45 500

0.5

1

1.5

2

2.5×104

N

Tota

lre

war

d($

)


Fig. 6: Performance comparison with dynamic processingcosts: Q(0, 0) = 0.3, T = 12, B = 9, β = 0.999,F (B) = 0.2B2, M/N = 0.5.

and achieved sub-optimal performance. When the processingcapacity constraint was neither too tight (M/N ≈ 0) nor tooloose (M/N ≈ 1), Whittle’s index with LLLP principle tendedto break large unfinished jobs (with long remaining processingtime) into smaller jobs and therefore improved the overallperformance by processing more tasks when processing costwas low and reducing the non-completion penalty.

In Figure 6, we compared the performance of differentpolicies by fixing ratioM/N = 0.5 and varying the maximumqueue lengthN . Both the EDF and LLF policies sought toactivate as many jobs as possible, up to the processing capacityM . The Whittle’s index policy, on the other hand, took pricingfluctuation into account: it processed more jobs at price valleyand kept processors idle when the processing cost was high.Based on the Whittle’s index policy, the LLLP enhancementfurther reduced the penalty of unfinished jobs and improved

9

the performance. The total reward achieved by the Whittle’sindex with LLLP enhancement policy was more than 1.7 timesof that obtained by EDF, and the performance gap between theWhittle’s index with LLLP policy and the LLF policy was over25%. We also noticed that the LLLP principle improved theWhittle’s index policy by around10%.

C. Asymptotic Optimality

2 4 6 8 10 12 140

100

200

300

400

500

600

700

800

900

Tot

al r

ewar

d ga

p

M

G(s) − GW(s)

G(s) − GEDF(s)G(s) − GLLF(s)

G(s) − GW+LLLP(s)Bound on gap in (11)

Fig. 7: Gap-to-optimality of different index policies underdynamic processing costs:Q(0, 0) = 0.3, T = 12, B = 9,β = 0.999, F (B) = 0.2B2, µ = M , N = 1000.

In Figure 7, simulation results are presented to comparethe performance achieved by various heuristic policies andtovalidate the theoretic results established in Lemma 1.

In this simulation, we fixed the queue sizeN = 1000and varied the processing capacityM as a parameter. Thearrival sequence withinT time slots was generated from aPoisson process with meanµ = M . The dynamic cost evolvedaccording to a Markovian model that was trained using real-time electricity price signals from CAISO. Each time slot ofthe constructed Markov chain lasted for 1 hour, and the entiresimulation horizon lasted for 300 days (with24 × 300 timeslots).

The EDF and LLF policies did not take into accountthe dynamics of processing costs, and their gap-to-optimalityincreased as both the job arrival rate and processing capacitygrew as shown in Figure 7. On the other hand, the gap betweenthe total rewards achieved by the Whittle’s index policy andthe performance upper bound quickly decreased to zero asMincreased. We noted that the Whittle’s index policy’s actualgap-to-optimality was less than the performance gap boundderived in (11), as shown in Theorem 3. We also showedin Figure 7 the gap-to-optimality for the LLLP enhancedWhittle’s index policy. The performance gap of the Whittle’sindex policy and the LLLP enhanced one was small becausethe arrival traffic was relatively light.

D. Hard Deadlines

In this subsection, we examine the performance of theproposed algorithms in a setting with hard deadlines. In thissetting, we seek to finish as many jobs as possible regardlessof the processing cost. Our framework can incorporate thehard deadline scenario by setting the non-completion penaltymuch higher than processing costs. In our simulation, we setthe processing costc = 0.95 and considered a linear penaltyfunction with a slope of10, F (B) = 10B. In this setting, itwas optimal (i.e., reward maximizing) to finish as many jobsas possible.

0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.750.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

M/N

Com

ple

tion

ratio

Whittle’s index

Whittle’s+LLLPWhittle’s+LLSP

EDFLLF

Fig. 8: Job completion ratio:c[t] = 0.95, Q(0, 0) = 0.3,T = 12, B = 9, β = 0.999, F (B) = 10B, N = 100.

The ratios of completed jobs achieved by various algorithmsare plotted in Figure 8. We noted that the Whittle’s indexpolicy outperformed the EDF and LLF policies. Althoughthe LLLP principle improved the Whittle’s index policy inthe sense of total reward, it completed fewer jobs as LLLPcan result in many small unfinished jobs. Interestingly, weobserved from Figure 8 that the Less Laxity Smaller Process-ing time (LLSP) principle could significantly enhance the jobcompletion ratio achieved by the Whittle’s index policy. TheLLSP enhancement is the same as the LLLP enhancement(introduced in Section IV-C), except that priority will be givento smaller unfinished jobs instead of larger unfinished ones.

E. Validation of Assumption A5-A6

We conducted numerical experiments to evaluate the effectof the independent arrival assumption in A5-A6. We consid-ered two scenarios. In the first scenario, the job arriving ateach position followed an independent binomial distribution(according to A5-A6). In the second scenario, the aggregatejob arriving at the system followed a Poisson distribution withthe same mean as that in the first scenario. When a job arrivedat the system, it was randomly assigned to one of the emptypositions following a uniform distribution.

We let the number of available processorsM = 10 andfixed the mean of the total job arrivals (withinT time slots)

10

100 200 300 400 500 600 700 800 9002700

2800

2900

3000

3100

3200

3300Dynamic price: battery mean=5; laxity mean =1.5; penalty=0.2

N

Tota

lR

ewa

rd

Whittle’s index w. A5-A6.

Upper bound w. A5-A6.

Whittle’s+LLLP w. A5-A6.

EDF w. A5-A6.LLF w. A5-A6.

Whittle’s index w. Poisson arr.

Upper bound w. Poisson arr.

Whittle’s+LLLP w. Poisson arr.

EDF w. Poisson arr.LLF w. Poisson arr.

Fig. 9: Comparison between Poisson arrival and independentarrival under dynamic processing cost:Q(0, 0) = 0.3, T = 12,B = 9, β = 0.999, F (B) = 0.2B2, M = 10, µ = M .

as µ = M . Shown in Figure 9, as the number of availablepositions in the queue increased, the performance of differentalgorithms under A5-A6 converged to its counterpart underPoisson arrival.

VII. C ONCLUSION

We consider the problem of large scale deadlinescheduling—a problem that has a wide range of applicationsin calling centers, cloud computing, and EV charging. Insuch settings, it is essential to develop highly efficient andonline scheduling algorithms. To this end, the index policiesconsidered in this paper are attractive for its implementationsimplicity and versatility in incorporating various operationuncertainties. It is particularly reassuring that the upper boundon the gap-to-optimality of the Whittle’s index policy con-verges to zero, thus establishing the asymptotic optimality ofthe Whittle’s index policy in the light traffic regime.

APPENDIX APROOF OFTHEOREM 1

In [15], the indexability of the bi-dimension state modelis proved without arrivals. In this appendix, we provide anelementary proof for the indexability of the RMAB problemformulated in (5) with random arrivals. In particular, we willshow that for any states of an arm, there is a criticalν(s)such that if and only ifν ≥ ν(s) the first term in the Bellmanequation (7) is larger than or equal to the second term in asingle armν-subsidy problem.

A. Indexability of Dummy Arms

The indexability of dummy arms is straightforward. Fori ∈ {N + 1, · · · , N +M}, there is no job arrival, and only

the processing cost evolves. The Bellman equation of theν-subsidy problem is given by

V νi (0, 0, cj) = max{β∑k Pj,kV

νi (0, 0, ck) + ν,

β∑

k Pj,kVνi (0, 0, ck)}.

If and only if ν ≥ 0, the first term is larger than the secondterm and it is optimal to deactivate the dummy arm. Otherwise,the active action is optimal. So a dummy arm is indexable andits Whittle’s index isνi(0, 0, cj) = 0.

B. Indexability of Regular Arms

Proof. We now prove the indexability of regular arms byinduction. We first show that the Whittle’s indexνi(T,B, cj)exists for T ≤ 1 and all B and cj , and establish someuseful properties for the difference of the value functiongν(T,B, cj) , V ν

i (T,B + 1, cj)− V νi (T,B, cj) for the case

with T = 1. Then, under the conditions that the Whittle’sindexνi(T,B, cj) exists and the property ofgν(T,B, cj) holdsfor T = t− 1, we showνi(T,B, cj) exists, and the propertyof gν(T,B, cj) holds forT = t.

1) T = 0: There is no job waiting in the position. TheBellman equation is stated as

V νi (0, 0, cj) = max{ν + βW ν

j , βWνj },

where

W νj =

∑

T ′

∑

B

∑

k Q(T ′, B)Pj,kVνi (T ′, B, ck)

+Q(0, 0)∑

k Pj,kVνi (0, 0, ck)

is the expected reward of possible arrivals. If and only ifν ≥ 0,the first term is larger and the passive action is optimal. Thusνi(0, 0, cj) = 0.

2) T = 1: There are two cases.

• If B = 0, the Bellman equation is stated as

V νi (1, 0, cj) = max{ν + βW ν

j , βWνj }.

Thusνi(1, 0, cj) = 0.• If B ≥ 1, the Bellman equation is stated as

V νi (1, B, cj) = max{ν − F (B) + βW ν

j ,1− cj − F (B − 1) + βW ν

j }.If and only if ν ≥ 1− cj + F (B)− F (B − 1), the pas-sive action is optimal.

Thus the Whittle’s index forT = 1 exists, and the closed-form is given by

νi(1, B, cj)

=

{

0, if B = 0;1− cj + F (B)− F (B − 1), if B ≥ 1.

(16)

Let the difference of the value function be

gν(T,B, cj) , V νi (T,B + 1, cj)− V ν

i (T,B, cj).

We note that the difference of the value function is continuousand piecewise linear inν. Specially, denoteG as a set of func-tions ofν such thatg(ν) ∈ G if and only if g(ν) is a continuouspiecewise linear function inν, there existν and ν such that,

11

∂g(ν)/∂ν ≥ −1 when ν ∈ [ν, ν], and ∂g(ν)/∂ν = 0 whenν /∈ [ν, ν]. We show that, whenT = 1, gν(T,B, cj) ∈ G.

• If B = 0,

gν(1, B, cj) = V νi (1, 1, cj)− V ν

i (1, 0, cj).

– If νi(1, 1, cj) > νi(1, 0, cj) = 0,

gν(1, B, cj) =

1− cj , if ν < 0;1− cj − ν, if 0 ≤ ν < νi(1, 1, cj);−F (1), if νi(1, 1, cj) ≤ ν.

– If νi(1, 1, cj) ≤ νi(1, 0, cj) = 0,

gν(1, B, cj) =

1− cj , if ν < νi(1, 1, cj);ν − F (1), if νi(1, 1, cj) ≤ ν < 0;−F (1), if 0 ≤ ν.

• If B ≥ 1,

gν(1, B + 1, cj) = V νi (1, B + 1, cj)− V ν

i (1, B, cj).

Sinceνi(1, B + 1, cj) ≥ νi(1, B, cj) by (16),

gν(1, B, cj) =

F (B − 1)− F (B),if ν < νi(1, B, cj);1− cj − ν,if νi(1, B, cj) ≤ ν < νi(1, B + 1, cj);F (B)− F (B + 1),if νi(1, B + 1, cj) ≤ ν.

So gν(1, B, cj) is continuous piecewise linear inν, andthere existν and ν such that∂gν(1, B, cj)/∂ν ≥ −1whenν ∈ [ν, ν] and∂gν(1, B, cj)/∂ν = 0 otherwise.

3) T ≥ 2: Assuming the Whittle’s indexνi(T,B, cj) exitsand gν(T,B, cj) ∈ G for T = t− 1, we showνi(T,B, cj)exits andgν(T,B, cj) ∈ G for the caseT = t.

First, existence ofνi(T,B, cj) whenT = t.• If B = 0, the Bellman equation is stated as follows.

V νi (t, 0, cj) = max{β∑k Pj,kV

νi (t− 1, 0, ck) + ν,

β∑

k Pj,kVνi (t− 1, 0, ck)}.

If and only if ν ≥ 0, the first term is larger than thesecond term and the passive action is optimal. Thusνi(t, 0, cj) = 0.

• If B ≥ 1, the Bellman equation is stated as follows.

V νi (t, B, cj)

= max{β∑k Pj,kVνi (t− 1, B, ck) + ν,

β∑

k Pj,kVνi (t− 1, B − 1, ck) + 1− cj}.

(17)Denote the difference between the two actions as

fν(t, B, cj) , β∑

k Pj,kgν(t− 1, B − 1, ck)

+ν − (1 − cj),

where

gν(t− 1, B − 1, ck)= V ν

i (t− 1, B, ck)− V νi (t− 1, B − 1, ck).

Since gν(t− 1, B − 1, ck) ∈ G by assumption,fν(t, B, cj) is continuous and piece-wise linear inν. Let

ν(t, B, cj) , mink ν(t− 1, B − 1, ck),

ν(t, B, cj) , maxk ν(t− 1, B − 1, ck),

where ∂gν(t− 1, B − 1, ck)/∂ν ≥ −1 if and only ifν ∈ [ν(t− 1, B − 1, ck), ν(t− 1, B − 1, ck)]. We have

∂fν(t, B, cj)/∂ν

=

{

≥ 0, if ν ∈ [ν(t, B, cj), ν(t, B, cj)];1, otherwise.

So fν(t, B, cj) is continuous and non-decreasingin ν. When ν = −∞, fν(t, B, cj) = −∞. Whenν = +∞, fν(t, B, cj) = +∞. Thus there is across point of fν(t, B, cj) and the ν-axis. Defineνi(t, B, cj) , minν{fν(t, B, cj) = 0}. If and only ifν ≥ νi(t, B, cj), the first term in (17) is larger or equalto the second term and the passive action is optimal. Bydefinition,νi(t, B, cj) is the Whittle’s index.

The existence ofνi(t, B, cj) is shown.Next we showgν(t, B, cj) ∈ G.

• If B = 0,

gν(t, B, cj) = V νi (t, 1, cj)− V ν

i (t, 0, cj).

– If νi(t, 1, cj) > νi(t, 0, cj) = 0,

gν(t, 0, cj)

=

1− cj , if ν < 0;1− cj − ν, if 0 ≤ ν < νi(t, 1, cj);β∑

k Pj,kgν(t− 1, 0, ck), if νi(t, 1, cj) ≤ ν.

– If νi(t, 1, cj) ≤ νi(t, 0, cj) = 0,

gν(t, 0, cj) =

1− cj ,if ν < νi(t, 1, cj);ν + β

∑

k Pj,kgν(t− 1, 0, ck),

if νi(t, 1, cj) ≤ ν < 0;β∑

k Pj,kgν(t− 1, 0, ck),

if 0 ≤ ν.

Thus, gν(t, 0, cj) is a linear combination ofgν(t− 1, 0, ck). Since gν(t− 1, 0, ck) ∈ G for allck by assumption, we havegν(t, 0, cj) ∈ G as well.

• If B ≥ 1,

gν(t, B, cj) = V νi (t, B + 1, cj)− V ν

i (t, B, cj).

– If νi(t, B + 1, cj) > νi(t, B, cj),

gν(t, B, cj)

=

β∑

k Pj,kgν(t− 1, B − 1, ck),

if ν < νi(t, B, cj);1− cj − ν,if νi(t, B, cj) ≤ ν < νi(t, B + 1, cj);β∑

k Pj,kgν(t− 1, B, ck),

if νi(t, B + 1, cj) ≤ ν.

– If νi(t, B + 1, cj) ≤ νi(t, B, cj),

gν(t, B, cj)

=

β∑

k Pj,kgν(t− 1, B − 1, ck),

if ν < νi(t, B + 1, cj);β∑

k Pj,k[gν(t− 1, B, ck) + gν(t− 1, B − 1, ck)]

+ν − (1 − cj),if νi(t, B + 1, cj) ≤ ν < νi(t, B, cj);β∑

k Pj,kgν(t− 1, B, ck),

if νi(t, B, cj) ≤ ν.

12

Clearly, gν(t, B, cj) is a linear combination ofgν(t− 1, B, ck) and gν(t− 1, B − 1, ck). Sincegν(t− 1, B, ck) ∈ G for all B and ck by assumption,we havegν(t, B, cj) ∈ G as well.

Thus, by induction, the Whittle’s indexνi(T,B, cj) existsandgν(T,B, cj) ∈ G for all T,B, andcj .

APPENDIX BPROOF OFTHEOREM 2

Proof. Since the processing costc0 is constant, we omitthe cost in the state of arms for simplicity. In AppendixA-A we have shown that the Whittle’s index of the dummyarms isνi(0, 0) = 0. For regular arms, we showed in (16)that νi(1, 0) = 0 and νi(1, B) = 1− c0 + F (B) − F (B − 1)whenB ≥ 1. Next, we show the closed-form of the Whittle’sindex for the case ofT ≥ 2 using induction.

A. T = 2

The discussion is divided into two conditions.

• If B = 1,

V νi (2, 1) = max{ν + βV ν

i (1, 1),1− c0 + βV ν

i (1, 0)}.

The difference between active and passive actions

fν(2, 1)= ν − (1− c0) + βgν(1, 0)

=

ν − (1 − β)(1 − c0), if ν < 0;(1− β)[ν − (1 − c0)], if 0 ≤ ν < 1− c0 + F (1);ν − (1 − c0)− βF (1), if 1− c0 + F (1) ≤ ν;

equals0 whenν = 1− c0. Thusνi(2, 1) = 1− c0.• If B ≥ 2, the Bellman equation is stated as follows.

V νi (2, B) = max{ν + βV ν

i (1, B),1− c0 + βV ν

i (1, B − 1)}.

Let ∆F (B) = F (B)− F (B − 1). The difference be-tween active and passive actions

fν(2, B)= ν − (1− c0) + βgν(1, B − 1)

=

ν − (1 − c0)− β∆F (B − 1),if ν < 1− c0 +∆F (B − 1);(1− β)[ν − (1 − c0)],if 1− c0 +∆F (B − 1) ≤ ν < 1− c0 +∆F (B);ν − (1 − c0) + β∆F (B),if 1− c0 +∆F (B) ≤ ν;

equals0 when ν = 1− c0 + β[F (B − 1)− F (B − 2)].Thus νi(2, B) = 1− c0 + β[F (B − 1)− F (B − 2)]whenB ≥ 2.So (8) is true whenT = 2.

B. T > 2

Assume equation (8) holds whenT = t− 1, we show thatit holds whenT = t.

• If B = 1,

V νi (t, B) = max{ν + βV ν

i (t− 1, 1),1− c0 + βV ν

i (t− 1, 0)}.The difference between actions is

fν(t, 1)= ν − (1− c0) + βgν(t− 1, 0)

=

ν − (1− β)(1 − c0), if ν < 0;(1 − β)[ν − (1− c0)], if 0 ≤ ν < 1− c0;ν − (1− c0) + β2gν(t− 2, 0), if 1− c0 ≤ ν.

The last case can be rewritten as

ν − (1− c0) + β2gν(t− 2, 0)= (1 − β)[ν − (1− c0)] + β[ν − (1 − c0)]

+β2[V νi (t− 2, 1)− V ν

i (t− 2, 0)],

which equals0 when ν = 1− c0 since by assumptionνi(t− 1, 1) = 1− c0. Thusνi(t, 1) = 1− c0.

• If 2 ≤ B ≤ t−2, the difference between actions is statedas follows.

fν(t, B)= ν − (1− c0) + βgν(t− 1, B − 1)

=

β2gν(t− 2, B − 2) + ν − (1− c0),

if ν < 1− c0;

β2gν(t− 2, B − 1) + ν − (1− c0),

if 1− c0 ≤ ν.

The latter case equals0 when ν = 1− c0 becauseνi(t− 1, B) = 1− c0 when 2 ≤ B ≤ t− 2 by assump-tion. Thusνi(t, B) = 1− c0 when2 ≤ B ≤ t− 2.

• If B = t− 1,

fν(t, B) = ν − (1− c0) + βgν(t− 1, B − 1)

=

ν − (1− c0) + β2gν(t− 2, B − 2),

if ν < 1− c0;

(1 − β)[ν − (1− c0)],

if 1− c0 ≤ ν < 1− c0 + βt−2F (1);

ν − (1− c0) + β2gν(t− 2, B − 1),

if 1− c0 + βt−2F (1) ≤ ν;

equals0 when ν = 1 − c0. So νi(t, B) = 1− c0 whenB = t− 1.

• If B ≥ t,

fν(t, B) = ν − (1− c0) + βgν(t− 1, B − 1)

=

ν − (1 − c0) + β2gν(t− 2, B − 2),

if ν < νi(t− 1, B − 1);

(1− β)[ν − (1 − c0)],

if νi(t− 1, B − 1) ≤ ν < νi(t− 1, B);

ν − (1 − c0) + β2gν(t− 2, B − 1),

if νi(t− 1, B) ≤ ν.(18)

If ν < νi(t− 1, B − 1), according to (8)

ν < νi(t− 1− T ′, B − 1− T ′)≤ νi(t− 1− T ′, B − T ′),

13

for all 0 ≤ T ′ ≤ t− 1. Thus the first case of (18) can bewritten as

ν − (1− c0) + β2gν(t− 2, B − 2)= ν − (1− c0) + β3gν(t− 3, B − 3)= · · ·= ν − (1− c0) + βt−1gν(1, B − t+ 1)= ν − (1− c0) + βt−1[−F (B − t+ 1) + F (B − t)].

As a result, whenν = 1− c0 + βt−1[F (B − t+ 1)− F (B − t)],the first case in equation (18) equals0 . Thus whenB ≥ t,the closed-form of index is stated as:

νi(t, B) = 1− c0 + βt−1[F (B − t+ 1)− F (B − t)].

We therefore conclude that (8) holds whenT = t. By induc-tion, we have established (8) for allT .

APPENDIX CPROOF OFLEMMA 1

Proof. For Problem (5), we useπW to denote the Whittle’sindex policy with the processing limitM (that activates theM arms with highest indices at each time). Specially, when thelimit is loose,e.g.,M = N denote the Whittle’s index policy(that activates theN arms with highest indices at each time)by πN.

In Appendix C of [41], we have shown that whenM = N ,the Whittle’s index policy is optimal. Thus the reward ofπN

serves as an upper bound of the optimal reward for any casewith M ≤ N , i.e.,

GNπN(s) ≥ GN (s) ≥ GN

W (s),

whereGNπN(s) is the reward collected fromπN, GN

W (s) thereward collected from the Whittle’s index policyπW whenM ≤ N , andGN (s) the maximum reward defined in (4).

In this appendix, we establish an upper bound of the differ-ence of the value functions ofπW and πN, GN

πN(s)−GN

W (s),which serves as an upper bound of the gap-to-optimality ofthe Whittle’s index policy,GN (s)−GN

W (s). We first quantifyGN

πN(s)−GN

W (s) by the number of different actions in theprocessing sequences resulted byπW andπN. Then we relatethe number of different actions to the number of job arrivalsin Lemma 3, which gives us the result in (9).

Note that due to the lack of capacity limit, policyπN

activates a regular arm if and only if its Whittle’s indexis positive. On the other hand, the policyπW activates aregular arm if and only if its index belongs to the largestM positive ones. Due to the capacity limitM , if facing thesame trajectory of processing cost and the same sequence ofarrivals, the two policiesπN and πW will generate differentprocessing sequences on a single job. As shown in Figure 10,the processing sequences of a jobJi with arrival time r anddeparture timed determined byπN and πW are plotted. Wedefine two events as follows.

• EventA: πN processesJi but πW does not.• EventB: πW processesJi but πN does not.

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx






xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

πN

AAA BB

πWr

r

d

d

Fig. 10: Processing sequences generated byπN and πW on asingle job.

• If eventA happens at timet, the instant reward differencebetweenπN andπW is bounded,i.e.,

RπNi [t]−RπW

i [t] ≤ |1− cmin|,whereRπ

i [t] is the instant reward collected fromJi bypolicy π at time t.

• If eventB happens at timet, the instant reward differencebetweenπN andπW is also bounded,i.e.,

RπNi [t]−RπW

i [t] ≤ |1− cmax|.• At the deadline ofJi, the difference of unfinished job

length resulting from two policies is bounded by thenumber of eventA. Thus the penalty difference of twopolicies is also bounded,i.e.,

FπNi [d]− FπW

i [d] ≤ F (B +∑d

t=r 1(A[t]))− F (B)

≤ F (B)∑d

t=r 1(A[t]),

where Fπi [d] is the penalty ofJi resulted by π at

deadlined, B is the left over job size underπN of job i,1(A[t]) = 1 if and only if eventA happens at timet, andF (B) is the maximum penalty that can incur to a job.

The reward difference collected fromJi up to time t < d isthe sum of the first two cases,i.e.,

∑th=r β

h(RπNi [h]−RπW

i [h])

≤ |1− cmin|∑t

h=r 1(A[h])βh

+ |1 − cmax|∑t

h=r 1(B[h])βh.

The difference up to deadlinet = d is the sum of the threecases,i.e.,

∑dh=r β

h(RπNi [h]−RπW

i [h])

≤ |1− cmin|∑d

h=r 1(A[h])βh

+|1− cmax|∑d

h=r 1(B[h])βh

+F (B)βd∑d

h=r 1(A[h]).

For each timet ∈ [r, d], we enlarge the penalty term and geta general bound as follows.

∑th=r β

h(RπNi [h]−RπW

i [h])

≤ |1− cmin|∑t

h=r 1(A[h])βh

+|1− cmax|∑t

h=r 1(B[h])βh

+F (B)∑t

h=r 1(A[h])βh.

(19)

Note that the cumulative number of eventA happened upto any fixed timet is always larger than the number of eventB. Formally, we state the following lemma to illustrate therelationship between eventA and B. The proof is delayed toAppendix C-A.

14

Lemma 2. Denote1(A[t]) as whether eventA happens att.Denote#A[t] as the cumulative number of eventA happenedfrom r to time t. Define1(B[t]) and #B[t] respectively. Forany t ∈ [r, d],

#A[t] =∑t

h=r 1(A[h]) ≥ #B[t] =∑t

h=r 1(B[h]),∑t

h=r 1(A[h])βh ≥∑t

h=r 1(B[h])βh.

So the reward difference in (19) is bounded as follows.

∑th=r β

h(RπNi [h]−RπW

i [h])

≤ (|1− cmin|+ F (B) + |1− cmax|)∑t

h=r 1(A[h])βh.

Now we want to quantify the cumulative number of eventA. EventA happens only when there are more thanM jobswith positive Whittle’s index in the system underπW. Thisevent can only occur when there are at leastM jobs in thequeue. To bound the number of eventA, we have the followinglemma. The proof is delayed to Appendix C-B.

Lemma 3. Let IN [t] be the number of jobs admitted to thesystem within[t− T + 1, t]. Then for anyt,

1(A[t]) ≤ 1(IN [t] > M).

Thus for each job, we have

t∑

h=r

βh(RπNi [h]−RπW

i [h]) ≤ C

t∑

h=r

βh1(IN [h] > M), (20)

for any t, whereC = (|1 − cmin|+ F (B) + |1− cmax|).If we sum arrivals and take expectation, we have the

difference of expected value function bounded as follows.

GNπN(s)−GN

W (s) ≤ C∑

t βtE[

1(IN [t] > M)IN [t]]

= CE[

1(IN [t] > M)IN [t]]

/(1− β).

SinceGNπN(s) is an upper bound ofGN (s), we have

GN (s)−GNW (s) ≤ C

1−βE[

IN [t]1(IN [t] > M)]

= C1−βE

[

IN [t]|IN [t] > M]

Pr(IN [t] > M),

which is the expression (9) in Lemna 1.

A. Proof of Lemma 2

Proof. At time t, we denote the remaining job size ofJiunder policy πN and πW by BN[t] and BW[t], respectively.WhenBW[t] = BN[t], job Ji has the same state and Whittle’sindex under both polices. IfπW processesJi, which means theWhittle’s index ofJi is positive,πN also processesJi. Since atthe arrivalr, BW[r] = BN[r], eventB can only happen whenBW[t] > BN[t], which means eventA must have happenedbefore.

This also implies thatBW[t] ≥ BN[t] for all t.

B. Proof of Lemma 3

Proof. Recall that the remaining job size underπW is alwayslarger than the one underπN, i.e.,BW[t] ≥ BN[t]. WheneverπN

processes some jobJi, the Whittle’s index of this job underπN must be positive. If we can show that the Whittle’s indexis monotonically increasing inB, the index under policyπW

must also be positive, andπW will also process this job if thecapacity limit allows. Thus that eventA happens must implythat there are more thanM jobs with positive Whittle’s index,which requires the number of admitted jobs larger thanM ,i.e., IN [t] > M .

In this subsection, we show that the Whittle’s index isindeed increasing inB when the index is positive and thevalue function is concave whenν > 0 by induction. Thatis, νi(T,B + 1, cj) ≥ νi(T,B, cj), if νi(T,B, cj) > 0 andV νi (T,B, cj) is concave whenν > 0.1) T = 1: The Whittle’s index is

νi(1, B, cj) =

{

0, if B = 0;1− cj + F (B)− F (B − 1), if B ≥ 1.

If νi(1, B, cj) > 0, νi(1, B + 1, cj) > νi(1, B, cj) due to theconvexity ofF (B).

The value function is concave inB whenν > 0.

V νi (1, B + 2, cj)− 2V ν

i (1, B + 1, cj) + V νi (1, B, cj)

=

−F (B + 2) + 2F (B + 1)− F (B),if νi(1, B, cj) < ν, νi(1, B + 1, cj) < ν,andνi(1, B + 2, cj) < ν;1− cj − ν + F (B + 1)− F (B),if νi(1, B, cj) < ν, νi(1, B + 1, cj) < ν,andν ≤ νi(1, B + 2, cj);ν − 1 + cj + F (B)− F (B + 1),if νi(1, B, cj) < ν ≤ νi(1, B + 1, cj);−F (B + 1) + 2F (B)− F (B − 1),if ν ≤ νi(1, B, cj);

≤ 0

The first and last cases are negative because of convexity ofthe penalty. The second and third cases are negative becauseof the definition ofνi(1, B, cj).

2) T > 1: Assume νi(T,B + 1, cj) ≥ νi(T,B, cj) whenνi(T,B, cj) > 0, and V ν

i (T,B, cj) is concave inB whenν > 0 for T = t− 1. We show that these properties are truefor T = t.

The difference of the activate and deactivate actions at state(t, B + 1, cj) is given by

fν(t, B + 1, cj)= β

∑

Pj,k[Vνi (t− 1, B + 1, ck)− V ν

i (t− 1, B, ck)]+ν − 1 + cj

= β∑

Pj,k[Vνi (t− 1, B + 1, ck)− 2V ν

i (t− 1, B, ck)+V ν

i (t− 1, B − 1, ck)]+β∑

Pj,k[Vνi (t− 1, B, ck)− V ν

i (t− 1, B − 1, ck)]+ν − 1 + cj

= β∑

Pj,k[Vνi (t− 1, B + 1, ck)− 2V ν

i (t− 1, B, ck)+V ν

i (t− 1, B − 1, ck)]+fν(t, B, cj).

When ν = νi(t, B, cj) > 0, we have fν(t, B, cj) = 0 ac-cording to the definition ofνi(t, B, cj). The first term

15

in the above equation is negative due to the concav-ity of the value function whenν > 0. We thus havefν(t, B + 1, cj) ≤ 0 when ν = νi(t, B, cj) > 0, which im-plies νi(t, B + 1, cj) ≥ νi(t, B, cj).

We have shown the monotonicity of the Whittle’s indexwhenT = t. Next we show the concavity of the value func-tions forT = t whenν > 0.

V νi (t, B + 2, cj) + V ν

i (t, B, cj)− 2V νi (t, B + 1, cj)

=

β∑

Pj,kVνi (t− 1, B + 2, ck)

−2β∑


+β∑

Pj,kVνi (t− 1, B, ck),

if νi(t, B, cj) < ν, νi(t, B + 1, cj) < ν,andνi(t, B + 2, cj) < ν;β∑

Pj,kVνi (t− 1, B, ck)

−β∑


+1− cj − ν,if νi(t, B, cj) < ν, νi(t, B + 1, cj) < ν,andν ≤ νi(t, B + 2, cj);β∑


−β∑

Pj,kVνi (t− 1, B, ck)

+ν − (1− cj),if νi(t, B, cj) < ν ≤ νi(t, B + 1, cj);β∑


−β∑

Pj,k2Vνi (t− 1, B, ck)

+β∑

Pj,kVνi (t− 1, B − 1, ck),

if ν ≤ νi(t, B, cj);≤ 0.

The first and fourth terms are less than zero because by theassumption the value function is concave whenν > 0 fort− 1. The second and third terms are negative because of thedefinition ofνi(t, B+1, cj). So the value functionV ν

i (t, B, cj)is concave inB whenν > 0.

By induction, we have νi(T,B + 1, cj) ≥ νi(T,B, cj)when νi(T,B, cj) > 0, and V ν

i (T,B, cj) is concave inBwhenν > 0 for all T .

APPENDIX DPROOF OFTHEOREM 3

Proof. For a Poisson processI[t] with meanµ, we have theexpression as follows.

E[I[t]|I[t] > M ]Pr(I[t] > M) = µPr(I[t] ≥ M)

For anyM > µ− 1, we have the inequality as follows [46].

µPr(I[t] ≥ M) < µPr(I[t] = M)/(1− µ

M + 1)

= µM+1e−µ(M + 1)/[(M + 1− µ)M !]

≤ µM+1eM−µ(M + 1)√2πMM+1/2(M + 1− µ)

= O(µe−µ

√M

)

(21)where the second inequality is because of Stirling formula.Whenµ ≤ M/e, the right-hand side decreases to zero, whichindicates the asymptotic optimality of the Whittle’s index.

REFERENCES

[1] Z. Yu, Y. Xu, and L. Tong, “Deadline scheduling as restless bandits: Amulti-armed bandit approach,” in2016 54th Annual Allerton Conferenceon Communication, Control, and Computing (Allerton). IEEE, 2016.

[2] Z. Yu, S. Chen, and L. Tong, “An intelligent energy management systemfor large-scale charging of electric vehicles,”CSEE Journal of Powerand Energy Systems, vol. 2, no. 1, pp. 47–53, 2016.

[3] Z. Yu, Y. Xu, and L. Tong, “Large scale charging of electric vehicles,”in 2015 53rd Annual Allerton Conference on Communication, Control,and Computing (Allerton). IEEE, 2015, pp. 389–395.

[4] Z. Yu and L. Tong, “Demand response via large scale charging of electricvehicles,” in Power and Energy Society General Meeting (PESGM),2016. IEEE, 2016, pp. 1–5.

[5] I.-H. Hou and P. R. Kumar, “Packets with deadlines: A frameworkfor real-time wireless networks,”Synthesis Lectures on CommunicationNetworks, vol. 6, no. 1, pp. 1–116, 2013.

[6] J. Vilaplana, F. Solsona, I. Teixido, J. Mateo, F. Abella, and J. Rius, “Aqueuing theory model for cloud computing,”The Journal of Supercom-puting, vol. 69, no. 1, pp. 492–507, 2014.

[7] D. M. Warner and J. Prawda, “A mathematical programming modelfor scheduling nursing personnel in a hospital,”Management Science,vol. 19, no. 4-part-1, pp. 411–422, 1972.

[8] B. B. Chen and P. V.-B. Primet, “Scheduling deadline-constrained bulkdata transfers to minimize network congestion,” inSeventh IEEE Inter-national Symposium on Cluster Computing and the Grid (CCGrid’07).IEEE, 2007, pp. 410–417.

[9] J. Dai and S. He, “Queues in service systems: Customer abandonmentand diffusion approximations,”Tutorials in Operations Research, IN-FORMS: Hanover, MD, vol. 3, pp. 36–59, 2011.

[10] J. C. Gittins, “Bandit processes and dynamic allocation indices,”Journalof the Royal Statistical Society, vol. 41, no. 2, pp. 148–177, 1979.

[11] P. Whittle, “Restless bandits: Activity allocation ina changing world,”Journal of applied probability, vol. 25, pp. 287–298, 1988.

[12] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of optimalqueuing network control,”Mathematics of Operations Research, vol. 24,no. 2, pp. 293–305, 1999.

[13] J. C. Gittins and D. M. Jones, “A dynamic allocation index for thediscounted multiarmed bandit problem,”Biometrika, vol. 66, no. 3, pp.561–565, 1979.

[14] K. Liu and Q. Zhao, “Indexability of restless bandit problems and opti-mality of whittle index for dynamic multichannel access,”InformationTheory, IEEE Transactions on, vol. 56, no. 11, pp. 5547–5567, 2010.

[15] D. Graczova and P. Jacko, “Generalized restless bandits and the knap-sack problem for perishable inventories,”Operations Research, vol. 62,no. 3, pp. 696–711, 2014.

[16] R. T. Rockafellar and S. Uryasev, “Optimization of conditional value-at-risk,” Journal of risk, vol. 2, pp. 21–42, 2000.

[17] C. L. Liu and J. W. Layland, “Scheduling algorithms for multiprogram-ming in a hard-real-time environment,”Journal of ACM, vol. 20, pp.46–61, 1973.

[18] M. Dertouzos, “Control robotics: the procedural control of physicalprocesses,” inProceedings of International Federation for InformationProcessing Congress, 1974, pp. 807–813.

[19] A. Mok, “Fundamental design problmes of distributed systems for thehard real-time environment,” Ph.D. dissertation, MIT, 1983.

[20] R. I. Davis and A. Burns, “A survey of hard real-time scheduling formultiprocessor systems,”ACM Computing Surveys, vol. 43, no. 4, 2011.

[21] M. L. Dertouzos and A. K. Mok, “Multiprocessor online schedulingof hard-real-time tasks,”IEEE Transactions on Software Engineering,vol. 5, pp. 1497–1506, 1989.

[22] S. S. Panwar, D. Towsley, and J. K. Wolf, “Optimal scheduling policiesfor a class of queues with customer deadlines to the beginning ofservice,” Journal of Association for Computing Machinery, vol. 35,no. 4, pp. 832–844, October 1988.

[23] D. Towsley and S. Panwar, “On the optimality of minimum laxity andearliest deadline scheduling for real-time multiprocessors,” in Proceed-ings of IEEE Euromicro 90’ Workshop on Real-Time, Jun. 1990, pp.17–24.

[24] J. Lehoczky, “Real-time queueing theory,” inProceedings of 17th IEEEReal-Time Systems Symposium, Dec. 1996, pp. 186 –195.

[25] B. Doytchinov, J. Lehoczky, and S. Shreve, “Real-time queues in heavytraffic with earliest-deadline-first queue discipline,”Annals of AppliedProbability, vol. 11, no. 2, pp. 332–378, 2011.

[26] L. Kruk, J. Lehoczky, K. Ramanan, and S. Shreve, “Heavy trafficanalysis for EDF queues with reneging,”Annals of Applied Probability,vol. 21, no. 2, pp. 484–545, 2011.

16

[27] P. P. Bhattacharya, L. Tassiulas, and A. Ephremides, “Optimal schedul-ing with deadline constraints in tree networks,”IEEE Transactions onAutomatic Control, vol. 42, no. 12, pp. 1703–1705, 1997.

[28] J. J. Jaramillo, R. Srikant, and L. Ying, “Scheduling for optimal rateallocation in ad hoc networks with heterogeneous delay constraints,”IEEE Journal on Selected Areas in Communications, vol. 29, no. 5, pp.979–987, 2011.

[29] V. Raghunathan, V. Borkar, M. Cao, and P. R. Kumar, “Index policiesfor real-time multicast scheduling for wireless broadcastsystems,” inINFOCOM 2008. The 27th Conference on Computer Communications.IEEE. IEEE, 2008, pp. 1570–1578.

[30] R. Singh and P. R. Kumar, “Decentralized throughput maximizingpolicies for deadline-constrained wireless networks,” inDecision andControl (CDC), 2015 IEEE 54th Annual Conference on. IEEE, 2015,pp. 3759–3766.

[31] Y. Xu and F. Pan, “Scheduling for charging plug-in hybrid electricvehicles,” in Proceedings of 2012 IEEE 51st Annual Conference onDecision and Control (CDC). IEEE, 2012, pp. 2495–2501.

[32] Q. Huang, Q. S. Jia, Z. Qiu, X. Guan, and G. Deconinck, “MatchingEV charging load with uncertain wind: A simulation-based policyimprovement approach,”IEEE Tran. on Smart Grid, vol. 6, no. 3, pp.1425–1433, 2015.

[33] Y. Xu, F. Pan, and L. Tong, “Dynamic scheduling for charging electricvehicles: A priority rule,” IEEE Tran. on Automatic Control, vol. 61,no. 12, pp. 4094–4099, 2016.

[34] D. Ruiz-Hernandez,Indexable restless bandits: Index policies for somefamilies of stochastic scheduling and dynamic allocation problems .VDM Verlag, 2008.

[35] J. Gittins, K. Glazebrook, and R. Weber,Multi-armed bandit allocationindices. John Wiley & Sons, 2011.

[36] R. R. Weber and G. Weiss, “On an index policy for restlessbandits,”Journal of Applied Probability, pp. 637–648, 1990.

[37] E. Bitar and Y. Xu, “Deadline differentiated pricing ofdeferrable electricloads,” IEEE Tran. on Smart Grid, vol. 8, no. 1, pp. 13–25, 2017.

[38] S. M. Kakade and M. Kearns, “Trading in markovian price models,” inInternational Conference on Computational Learning Theory. Springer,2005, pp. 606–620.

[39] E. Altman,Constrained Markov decision processes. CRC Press, 1999,vol. 7.

[40] G. B. Dantzig,Linear programming and extensions. Princeton univer-sity press, 1998.

[41] Z. Yu, Y. Xu, and L. Tong, “Deadline scheduling as restlessbandits,” Cornell University, Tech. Rep., Jan. 2017. [Online]. Available:https://arxiv.org/pdf/1610.00399.pdf

[42] S. M. Ross,Introduction to stochastic dynamic programming. Academicpress, 2014.

[43] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,Introductionto algorithms. MIT press Cambridge, 2001, vol. 6.

[44] Z. Yu, “Large scale charging of electric vehicles: Technology andeconomy,” Ph.D. dissertation, Cornell University, ECE, 2017.

[45] S. Kwon, Y. Xu, and N. Gautam, “Meeting inelastic demandin systemswith storage and renewable sources,”IEEE Tranactions on Smart Grid,2017.

[46] B. Klar, “Bounds on tail probabilities of discrete distributions,” Prob-ability in the Engineering and Informational Sciences, vol. 14, no. 02,pp. 161–171, 2000.

Zhe Yu (S’12-M’17) received his B.E. degree fromTsinghua University, Beijing, China in 2009, M.S.degree from Carnegie Mellon University, Pittsburgh,PA, USA in 2010, and Ph.D. degree from CornellUniversity, Ithaca, NY, USA in 2016 in electricalengineering, respectively. He joined Global EnergyInterconnection Research Institute North America(GEIRI North America) in 2017. His current re-search interests focus on power system and smartgrid, demand response, dynamic programming, dataprocessing, and optimization.

Yunjian Xu (S’06-M’10) received the B.S. and M.S.degrees in Electrical Engineering from TsinghuaUniversity, Beijing, China, in 2006 and 2008, respec-tively, and the Ph.D. degree from the MassachusettsInstitute of Technology (MIT), Cambridge, MA,USA, in 2012.

Dr. Xu was a CMI (Center for the Mathematicsof Information) postdoctoral fellow at the CaliforniaInstitute of Technology, Pasadena, CA, USA, in2012-2013. Before joining the Chinese Universityof Hong Kong (CUHK) as an assistant professor,

he was an assistant professor at the Singapore University ofTechnologyand Design in 2013-2017. His research interests lie in powersystem controland optimization, wholesale electricity market design, and the aggregationof distributed energy resources in power distribution systems. Dr. Xu was arecipient of the MIT-Shell Energy Fellowship.

Lang Tong (F’05) is the Irwin and Joan Jacobs Pro-fessor in Engineering of Cornell University and thesite director of Power Systems Engineering ResearchCenter (PSERC). He received the B.E. degree fromTsinghua University in 1985, and M.S. and Ph.D.degrees in electrical engineering in 1987 and 1991,respectively, from the University of Notre Dame. Hewas a Postdoctoral Research Affiliate at the Infor-mation Systems Laboratory, Stanford University in1991. He was the 2001 Cor Wit Visiting Professorat the Delft University of Technology and had held

visiting positions at Stanford University and the University of California atBerkeley. Lang Tongs research is in the general area of statistical inference,communications, and complex networks. His current research focuses oninference, optimization, and economic problems in energy and power systems.He received the 1993 Outstanding Young Author Award from theIEEECircuits and Systems Society, the 2004 best paper award fromIEEE SignalProcessing Society, and the 2004 Leonard G. Abraham Prize Paper Awardfrom the IEEE Communications Society. He is also a coauthor of seven studentpaper awards. He received Young Investigator Award from theOffice of NavalResearch. He was a Distinguished Lecturer of the IEEE SignalProcessingSociety.

Documents

Zhe Yu, Yunjian Xu, and Lang Tong1 Deadline Scheduling as Restless Bandits Zhe Yu, Yunjian Xu, and Lang Tong Abstract—The problem of stochastic deadline scheduling is considered