Meso-Parametric Value Function Approximation for Dynamic ...web.winforms.phil.tu-bs.de/paper/ulmer/Ulmer_meso.pdf · Dimensional Knapsack Problem, Approximate Dynamic Programming,

Meso-Parametric Value Function Approximation for

Dynamic Customer Acceptances in Delivery Routing

Marlin W. Ulmer Barrett W. Thomas

Abstract

In this paper, we introduce a novel method of value function approximation (VFA). In

stochastic, dynamic decision problems, VFAs approximate the reward-to-go. Conventionally,

VFAs are either parametric or non-parametric. Parametric VFAs (P-VFAs) approximate the

value function using a particular functional form. Non-parametric VFAs (N-VFAs) approxi-

mate value functions without assuming a functional form. Both VFAs have advantages and

shortcomings. While P-VFAs provide fast and reliable approximation, reliable in the sense

that there is an approximate value for every state, the approximation is often inaccurate. N-

VFAs can provide more accurate approximations, but require significant computational effort

to do so. To combine the advantages and to alleviate the shortcomings of P-VFA and N-VFA

used individually, we present a novel method, meso-parametric value function approximation

(M-VFA). This method combines P-VFA and N-VFA approximations. Most importantly, we

demonstrate that simultaneous tuning of the approximations leads to better outcomes than ei-

ther N- and P-VFA individually or some ex-post combination. Using a benchmark problem

that allows combining elements of routing problems, problems for which N-VFA has shown

superior performance, and knapsack problems, problems for which P-VFA has shown supe-

rior performance, we compare the proposed approach with the individual VFAs and online

rollout algorithms. We show how M-VFA offers the advantages of the individual VFAs while

alleviating their shortcomings.

Keywords: Dynamic Customer Acceptances, Dynamic Vehicle Routing, Dynamic Multi-

Dimensional Knapsack Problem, Approximate Dynamic Programming, Value Function Ap-

proximation

1

1 Introduction

Many decision-making problems involve a sequence of decisions in which one must decide to

allocate a finite set of resources for an immediate reward or to conserve resources to maintain the

possibility of taking advantage of some future and yet unknown opportunity. Examples of such

decision-making problems include project scheduling, fleet management, and capital budgeting.

These sequential decision making problems under uncertainty are naturally modeled as Markov

Decision Processes (MDPs). Given the scale of most real-world problems, solutions of these MDPs

rely on approximate dynamic programming (ADP) techniques.

A common ADP technique is value function approximation (VFA). VFAs approximate the

cost-to-go of the optimality equation. VFAs generally operate by reducing the dimensionality of

the state through the selection of a set of features to which all states can be mapped. The cost-to-go

is then approximated via this set of features using either parametric or non-parametric methods.

Parametric VFA (P-VFA) approximates the cost-to-go by using the feature set as variables in a pre-

specified functional form. Non-parametric VFA (N-VFA) operates by directly approximating the

value of an observed instance of features, often using look-up tables. Both P- and N-VFA offer the

computational advantage that they can be tuned offline. P-VFAs are known for providing “reliable”

approximations. That is, P-VFAs can return a value for any given state. Further, P-VFA provides

a reliable approximation “using a relatively small number of observations” (Powell, 2011, p.237).

In addition, given the tuned parameters, P-VFAs can be quickly evaluated. However, even if it

can return a value, the value returned by P-VFAs is often inaccurate as a result of the simplifying

functional assumptions. Because they do not rely on any particular functional form, N-VFAs can be

“very accurate approximations of very general functions, as long as we have enough observations”

(Powell, 2011, p.238). In practice, it can be challenging to “have enough observations.” As a

result and as Bertsimas and Demir (2002) show, N-VFAs may therefore tend to provide unreliable

approximation for large problem sizes.

To mitigate the challenges associated with each method without diminishing each method’s

advantages, we develop a meso-parametric value function approximation (M-VFA) that combines

both P- and N-VFA. The proposed M-VFA simultaneously tunes the parameter values of both the

N-VFA and P-VFA and combines the two. In this paper, we propose a state-space aggregation via

2

a lookup table for the N-VFA and a linear basis function for our P-VFA. We propose combining

the two via a linear combination, but theoretically, a variety of methods could be used to combine

or even choose between the two approximations.

We demonstrate the effectiveness of the proposed method by applying it to the capacitated

customer acceptance problem with stochastic requests (CAPSR). In the CAPSR, throughout the

day, a dispatcher receives random customer requests for service. The dispatcher must instantly

accept or reject each request. Each request comprises a location in the service area, a required

capacity, and a revenue. Accepted requests are delivered the next day by a capacitated vehicle

within a working shift. To determine whether or not a request can be accepted, the dispatcher must

determine whether or not a request can be served feasibly, and if it can be, whether the revenue that

would be earned is worth the consumption of resources. The objective is to maximize the expected

overall revenue.

We select the CAPSR as our test environment for two reasons. First, it is a problem important

for delivery companies (Esser and Kurte, 2015; Savelsbergh and Van Woensel, 2016). Second, the

problem represents a combination of a routing problem with a dynamic knapsack problem. Both

individual problems exhibit different value function structures. For knapsack problems, paramet-

ric VFAs perform well (Bertsimas and Demir, 2002). With the routing component, we experience

interdependencies between decisions and time consumption. Therefore, the value function struc-

ture may be complex, and it is difficult to identify a particularly effective functional form. In such

cases, N-VFA has been more effective (Ulmer et al., 2017).

This paper makes the following contributions. First, we introduce the M-VFA and demonstrate

how to simultaneously tune N- and P-VFA to create M-VFA. Using the CAPSR as a test environ-

ment, we then show that the M-VFA performs better than either the N- or P-VFA alone. To analyze

the performance of M-VFA for different value function structures, we also systematically shift the

focus between the dynamic knapsack and the dynamic routing problem. As expected, P-VFA per-

forms well for the knapsack problem and N-VFA performs well for the routing problem. However,

both methods are significantly outperformed by M-VFA.

The paper is outlined as follows. In Section 2, we present the literature on VFA. In Section 3,

we provide an overview of ADP. Section 4 defines the M-VFA and describes the procedure for

tuning it. In Section 5, we formally present the CAPSR. We also describe the details of our tuning

3

approach for the M-VFA applied to the CAPSR and present the benchmark policies. For a variety

of instances based on customer data of Iowa City, we evaluate and analyze the approaches in

Section 6. The paper concludes with a summary and an outlook in Section 7.

2 Related Literature

Our work presents a general and novel ADP-method for a dynamic decision problem. In our lit-

erature review, we present an overview of methodological literature for related ADP-methods. We

first present literature analyzing the performance of N- and P-VFA as well as developing meth-

ods to reinforce their functionality. We further give an overview on work combining different

ADP-methods.

Generally, the literature confirms that P-VFAs are reliable but inaccurate, while N-VFAs are

accurate but not always reliable (He et al., 2012; Fang et al., 2013; Powell and Meisel, 2016). That

is, P-VFAs can produce a value for any state, but the value may not well represent the value of

being in the state. On the other hand, N-VFAs can offer more accurate approximations, but there

is often a large computational burden in doing so and even then it might not be possible to return

an estimate for every state.

The only systematic comparison of N- and P-VFA is conducted by Bertsimas and Demir (2002)

who compare the two on a deterministic multi-dimensional knapsack problem. The N-VFA pro-

posed by Bertsimas and Demir (2002) approximates the individual value for every potential state.

The P-VFA approximates the value function using a linear function based on each dimension’s

remaining capacity. Bertsimas and Demir (2002) show that the success of N-VFA and P-VFA

depends on the problem’s dimensions. Results in He et al. (2012) suggest that when the function

form is known, the P-VFA can offer significantly better performance than the N-VFA.

2.1 N-VFA Literature

N-VFA methods have a long history in the literature. See Powell (2011) and Bertsekas and Tsit-

siklis (1996) for overviews. N-VFA have shown to be particularly effective in domains in which

the functional form of the value-function is challenging to quantify. One such application area is

dynamic vehicle routing (Goodson et al., 2013, 2016; Ulmer et al., to appear, 2017), the domain to

4

which we apply the method proposed in this paper.

In this paper, we focus on N-VFAs that store VFAs in the form of a look-up table. Methods us-

ing value-function approximations in the form of a lookup table are often referred to as state-space

aggregation. Powell and Meisel (2016) state that N-VFAs based on “[lookup tables] are particularly

impacted by the curses of dimensionality.” The key challenge with lookup tables or state-space ag-

gregation is determining at what level to aggregate or partition the state space. Partitioning at too

fine a level leads to a lookup table that is too large, and even if it can be stored in memory, there are

often areas of the partition for which there are no values. Coarser approximations can overcome

the problem of having empty partitions, but they do so at the risk of greater approximation error.

George et al. (2008) refer to these as sampling and aggregation error, respectively.

To alleviate this shortcoming of lookup tables, researchers propose several methods to improve

the performance of the approximation. George et al. (2008) introduce a method that uses multiple

lookup tables, each with increasing levels of coarseness. For any given state, the value of the future

is given by combining the values in each of the lookup tables. Fang et al. (2013) demonstrate the

effectiveness of the technique for a supply chain sourcing problem.

As an alternative to the method of George et al. (2008), Ulmer et al. (2017) propose a method

that dynamically partitions the (post-decision) state space in response to the learning process. We

refer to the method as dynamic lookup table (DLT). Ulmer et al. (2017) show that for routing

problems, the value function is often complex and not amenable to any particular functional form.

They further show that DLT outperforms the approach of George et al. (2008) in effectiveness and

efficiency. Thus, we incorporate the approach of Ulmer et al. (2017) into our M-VFA to reinforce

the N-VFA component. We further compare M-VFA with N-VFA based on DLT.

Papadaki and Powell (2002) introduce a method for applying lookup tables to monotonic value

functions. One important feature of the proposed algorithm is that it takes advantage of the mono-

tonicity and updates neighboring cells of a just observed cell in a lookup table. This approach

improves the convergence of the estimates in the lookup table. This feature of the algorithm pre-

sented in Papadaki and Powell (2002) also updates regions of the lookup table in which there are

no observations. The combination of the N-VFA with the P-VFA in this paper also allows for

values in unexplored regions of the lookup table, but does not require monotonicity to do so. Jiang

and Powell (2015b) generalize the work in Papadaki and Powell (2002) and introduce a provably

5

convergent algorithm. Jiang and Powell (2015a) demonstrate the application of the technique in

solving an energy management problem.

Because of their ability to theoretically approximate any continuous function (Hornik et al.,

1989), neural networks are also often used to approximate state values. Bertsekas and Tsitsiklis

(1996) provide a well known overview of the methods with a recent overview available in Liu et al.

(2017). With the rise of “deep learning” (see LeCun et al. (2015) for an introduction to deep learn-

ing), the use of neural nets to approximate value functions has recently received renewed interest.

The best known example is the work of Mnih et al. (2013) that uses deep learning combined with

Q-learning, a method similar to post-decision state lookup table methods, to learn to play Atari

2600 games at levels similar to human players. As we note in Section 2.3, our method can be used

with neural-net-based approximations.

We note that one could view lookahead methods as a form of N-VFA. Lookahead methods

approximate the value function either by solving value functions by looking a limited number of

steps into the future or by approximating the future with a heuristic policy. These methods are non-

parametric in the sense that they do not assume any particular functional form for the approximated

values. Lookahead methods are what are known as “online” VFA in that the approximations are

solved at runtime. In contrast, the work in this paper focuses on offline methods for which the

approximations are determined offline in advance of execution. Given their success in solving both

knapsack and dynamic routing problems (see Goodson et al. (2017) and Ulmer et al. (to appear)),

we compare the M-VFA to an online lookahead method known as rollout. Powell (2011) provide

an overview of lookahead methods with Goodson et al. (2017) providing the latest advances in

rollout algorithms.

2.2 P-VFA Literature

Like N-VFAs, P-VFAs have a long history in the literature. Powell (2011) provides a general

overview. Geist and Pietquin (2013) provide an overview of determining parameters in P-VFA.

The most common P-VFA is a linear basis function approximation. A basis function maps fea-

tures of the state into real values and then linearly combines the values. Examples of successful

application of linear basis functions to approximate value functions include ambulance redeploy-

ment (Maxwell et al., 2010; Schmid, 2012), dynamic vehicle routing (Meisel, 2011), technician

6

scheduling (Chen et al., 2017), and truckload trucking (Simao et al., 2009).

There has also been a large body of literature exploiting known non-linear functional forms.

Piecewise linear approximations have proven particularly successful. Examples of piecewise linear

VFAs can be found in fleet management (Godfrey and Powell, 2002a,b; Topaloglu and Powell,

2006), infertility treatment (He et al., 2012), and inventory management (Godfrey and Powell,

2001). To the best of the authors’ knowledge, piecewise linear approximations in the literature

rely on monotonicity. Godfrey and Powell (2001) introduce a method for tuning a piecewise linear

approximation of concave functions. In some applications, monotonicity may not hold in across

all states. To overcome this challenge, He et al. (2012) seek to improve the quality of a piecewise

linear approximation by partitioning the state space and finding piecewise linear approximations

for each partition.

Piecewise linear approximations for nonlinear value functions have the advantage that the

preservation of linearity often allows for the application of efficient math programming techniques

to solve the approximate value function. Yet, there are fields, particularly economics and finance,

in which continuous, nonlinear approximations are favored. An overview of nonlinear P-VFA and

a discussion of numerous applications can be found in Cai and Judd (2014). Recent work in the

area focuses on the challenges of solving nonlinear approximate Bellman equations. Examples

include Cai et al. (2017) and Shen and Wang (2015).

For the M-VFA proposed in this paper, we use linear basis functions for our P-VFA. We do

so because we know of no particular functional form that fits the problem that we are studying.

Further, our results demonstrate the even a linear approximation improves solution quality. How-

ever, the general idea of our proposed scheme does not rely on a linear form of the approximation

and the combination of N- and P-VFA proposed in this paper could use a different functional

approximation than linear.

2.3 Literature on Combining VFAs

In our proposed algorithm, we combine N- and P-VFAs. The literature on such methods is limited.

Both Powell (2011, pp.242) and Bertsekas and Tsitsiklis (1996, pp.70) propose approximations

that combine N- and P-VFA. However, neither presents an application of the proposed approaches.

7

Powell (2011, pp.242) propose embedding a lookup-table into a P-VFA. The lookup-table val-

ues are filled a-priori by a domain expert. Our presented method differs in two ways. First, instead

of drawing on a domain expert, we use offline simulation to fill the values of the lookup table. Sec-

ond, in our method, the lookup-table values are approximated not sequentially but simultaneously

with the P-VFA. We show the advantage of the simultaneous approximation in our computational

evaluation by comparing M-VFA to an ex-post combination of the individual VFAs.

Bertsekas and Tsitsiklis (1996, pp.70) propose combining a neural-net-based approximation

with a linear basis function. The authors propose a two-stage scheme in which they first tune the

neural network, and then having fixed the neural-network approximation, they learn the values of

the basis function. Again, we propose learning the N- and P-VFA values simultaneously. Our

results demonstrate the advantage of this approach.

Additional work combines online lookahead methods with offline VFAs. Online methods de-

termine a state’s value during the decision-making process. Thus, in contrast to offline methods,

online methods require real-time computation time. Online methods often provide detailed ap-

proximation while offline methods generally offer a more reliable approximation based on many

simulation runs. Li and Womer (2015) and Ulmer et al. (to appear) present online rollout algo-

rithms (RAs) with VFAs as base policies. Ulmer and Hennig (2016) limit the horizon of an RA

and estimate the remaining horizon via a VFA value. A similar idea is sketched by Powell et al.

(2012) that simulates the vehicle routing via an online lookahead and estimates the value of the

resulting vehicle locations with VFA. These methods are related to Monte Carlo-tree search, of-

ten applied to generate policies for complex games with long horizons such as Go (Browne et al.,

2012). Our method differs from the online/offline methods in that our approximation scheme is

based on the combination of two offline methods.

3 Approximate Dynamic Programming

In this section, we provide an overview of approximate dynamic programming. We first recall

the terminology of finite Markov decision processes and then describe the approximate Bellman

Equation.

8

0

0x

1 1x 1

0

10x x

Figure 1: Markov Decision Tree

3.1 Markov Decision Process

Markov decision processes (MDPs) are models of sequences of decisions, and stochastic dynamic

decision problems are generally modeled as MDPs. In the following, we recall the terminology of

an MDP to later illustrate the procedure of the M-VFA. The terminology is reflected in the Markov

decision tree shown in Figure 1. An MDP contains a sequence of decision points k = 0, . . . , K.

Parameter K may be a random variable. At each decision point k, a decision state Sk and a set

of potential decisions X (Sk) is given. In Figure 1, decision states are represented by the squares

and decisions by the solid arrows. Each decision x ∈ X (Sk) for state Sk provides a reward

R(Sk, x). This reward may be an expectation. The application of a decision x to a state Sk leads to

a deterministic transition to a post-decision state Sxk , represented by a circle in Figure 1. We utilize

post-decision states in the M-VFA. A realization ωk ∈ Ωk(Sxk ) of an exogenous random variable,

indicated by the dashed arrows in Figure 1, leads to a new decision state Sk+1 = (Sxk , ωk). This

procedure continues until a termination state SK is reached.

A policy π : S → X is a sequence of decision rules that assigns a decision Xπk (Sk) ∈ X (Sk)

to every state Sk ∈ S. The decision Xπk (Sk) is the decision given dependent state Sk and π in

decision point k. An optimal policy π∗ maximizes the expected rewards over all decision points

beginning from an initial state S0. Formerly, π∗ is given by:

9

π∗ = arg maxπ∈Π

E

[K∑k=0

R(Xπk (Sk))|S0

]. (1)

3.2 The Approximate Bellman Equation

Equation (1) can be rewritten recursively as

V (Sk) = maxx∈X(Sk)

R(Sk, x) + E[V (Sk+1) | Sk ] . (2)

The value function V represents the expected reward-to-go originating from a given state. Tradi-

tionally, Equation (2) is solved by backward induction. For most real-world applications, however,

the backward induction approach suffers from the well known “curses of dimensionality.” To over-

come this challenge, researchers turn to solving approximate forms of Equation (2). This method

is often called approximate dynamic programming (ADP). Powell (2011) provides an overview of

ADP. In ADP, we replace the second term of Equation (2) with an approximated value resulting in

the approximate Bellman Equation given by


R(Sk, x) + E

[V (Sk+1) | Sk

]. (3)

In this paper, we will operate on an equivalent approximate Bellman Equation, the post-decision

approximate Bellman Equation, given as:


R(Sk, x) + V (Sxk )

, (4)

where V (Sxk ) is known as the value of the post-decision state.

4 Meso-Parametric Value Function Approximation

In this section, we describe the proposed M-VFA. We first formalize N- and P-VFA. We then use

N- and P-VFA to describe M-VFA. We conclude the section by describing the approximate value

iteration for M-VFA (AVI-M-VFA), the method that we use to tune the M-VFA. The key feature

of AVI-M-VFA is the simultaneous approximation of the N- and P-VFA in the creation of a VFA

10

that is a combination of both.

4.1 M-VFA: Combining P-VFA and N-VFA

Generally, in VFA, states are represented by quantifications based on a (sub-)set of state dimensions

called features φ ∈ Φ. These features are functions mapping states to real numbers, indicators, or

ordinal numbers for specific state characteristics. For example, consider a state that includes the

location of a vehicle at a particular time. We could map this state to a single feature, the point of

time. Both P-VFA and N-VFA as well as M-VFA use features to approximately evaluate states.

Our proposed M-VFA is a combination of N- and P-VFA. To apply P-VFA, two assumptions

are made. First, we assume that there is a known subset of features Φp = (φp1, . . . , φplp) ⊆ Φ.

Second, we assume a general functional form fV is given (e.g., linear, polynomial, logarithmic,

etc.). The functional form may be a sum of individual functions fV1 , . . . , fVm such as monomials in

a polynomial. These individual functions may draw on all or on subsets of the features Φp. A P-

VFA is fitted to a particular problem using a set of tuneable parameters Θ = (θ1, . . . , θm), usually

one for each individual function in fV , resulting in a specific function fV (Θ). In this paper, we

focus on what are known as linear basis functions, which, for a post-decision state Sx, results in

V p(Sx) = fV (Φp(Sx),Θ) = θ0 +m∑i=1

θiφpi (S

x). (5)

In contrast to P-VFA, N-VFAs do not assume a functional form. In this paper, we focus on the

methods known as state-space aggregation. In these methods and similar to the case of the P-VFA,

the state is mapped to a set of features Φn = (φn1 , . . . , φnln) ⊂ Φ, and the N-VFA approximates the

value for each individual feature combination. The resulting approximated values VLT are stored

in a ln-dimensional lookup table, each dimension representing a feature. Thus, the value of a

post-decision state Sx is V n(Sx) = VLT(φ1(Sx), . . . , φf (Sx)).

The M-VFA is a combination of the two approximations, V p and V n, which we represent

generally as V = g(V p, V n). While the algorithm for determining the values of V p and V n is

agnostic to the form of the combination, in this paper, we focus on a convex combination of V p

11

and V n. Given a post-decision state Sx and a user defined parameter λ, the M-VFA is given by

V λ(Sx) = gλ(V p, V n)(Sxk ) = (1− λ)× V p(Sx) + λ× V n(Sx). (6)

4.2 Approximate Value Iteration for the M-VFA

In this section, we define our method for determining the values of the parametric and non-

parametric components of the M-VFA. We denote these two components as M-VFA(P) for para-

metric and M-VFA(N) for non-parametric. We refer to our method as AVI-M-VFA.

AVI-M-VFA is based on approximate value iteration (AVI) (see Powell (2011) for an overview

of AVI). Like AVI, AVI-M-VFA iterates through a set of sample path realizations. At each iteration

and each step in a given sample path realization, the algorithm either explores the state space

or exploits the current value function. AVI-M-VFA solves the approximate Bellman Equation

using the current approximated value of the sampled post-decisions states V . These values are a

combination of the current values of M-VFA(N) (V n) and M-VFA(P) (V p). The key difference

between AVI-M-VFA and AVI as well as between AVI-M-VFA and related methods discussed in

the literature review is that AVI-M-VFA updates V n and V p simultaneously.

The details of the AVI-M-VFA are presented in Algorithm 1. Input for the algorithm is the

initial parametric approximate value function M-VFA(P) V p. The set of M-VFA(N)-values is

initially empty. Throughout, the algorithm carries the observed states and their approximated

values. This set of observations O is initially empty.

After initialization, the algorithm generates a series of sample paths. For each sample path,

the algorithm records the realized post-decision states and the running value of the rewards. These

values are stored respectively in sets R and Sx that are empty at the start of each iteration of the

algorithm.

Given a decision state Sk along sample path i, the algorithm solves the approximate Bellman

Equation. To this end, the algorithm iterates through the potential decisions. For each post-decision

state Sxk resulting from the current state and decision, the algorithm solves the approximate Bell-

man Equation. The algorithm selects the decision that maximizes the approximate Bellman Equa-

tion. We note that the algorithm can be modified to include some randomization in the decision

selection.

12

Algorithm 1: Meso-Parametric Value Function ApproximationInput : Initial M-VFA (P) V p

Output : M-VFA (N) V n, M-VFA(P) V p

1 // Initialization2 i← 1

3 V n ← ∅4 O ← ∅5 // Simulation6 while (i ≤ N) do7 k ← −18 x← ∅9 Sx−1 ← ∅

10 Sx ← ∅11 R ← ∅12 R−1 ← 013 while (Sxk 6= SK) do14 k ← k + 115 ωik ← GenerateExogeneous(Sk, x)16 Sk ← (Sxk−1, ω

ik)

17 v ← −BigM18 for all x ∈ X (Sk) do19 Sxk ← (Sk, x)

20 vtemp ← R(Sk, x) + g(V p, V n) if vtemp > v then21 v ← vtemp22 x∗ ← x

23 end24 end25 Sxk ← (Sk, x

∗)26 Rk ← Rk−1 +R(Sk, x)27 Sx ← Sx ∪ Sxk28 R ← R∪ Rk29 end30 // Update

31 O ← UpdateObservations(O, V n, V p,Sx,R)32 V n ← UpdateN(V n,O)33 V p ← UpdateP(V p,O)34 i← i+ 1

35 end36 // Termination

37 return V n, V p, O

A sample path ends upon reaching a termination state. Before beginning a new sample path,

the states and values observed during the sample path are added to the set of observations. Most

importantly, M-VFA(N) and M-VFA(P) are updated. The specific updates depend on the design of

13

M-VFA(N) and M-VFA(P). In Section 5.4, we provide an example related to the test problem used

in this paper.

After exploring N sample paths, the algorithm returns the VFAs of M-VFA(N) and M-VFA(P)

as well as the observation information O, potentially required to determine the values of V .

5 Application: The Customer Acceptance Problem with Stochas-

tic Requests

In this section, we define the CAPSR and model it as an Markov decision process (MDP). We

then introduce an implementation of M-VFA specific to CAPSR and present a set of benchmark

policies from the literature. For a review of the literature related to CAPSR, we refer the reader to

Appendix A.2.

5.1 Problem Statement

In the CAPSR, a dispatcher receives orders from customers located in a given service area. These

customers place requests dynamically during the horizon [0, tcmax], and the orders are unknown

at the start of the horizon. Each requesting customer C offers an individual revenue P (C) and

requires a specific capacity κ(C).

Accepted orders are served by a vehicle with capacity κmax that delivers orders during a delivery

phase [0, tdmax], [0, tcmax] ∩ [0, tdmax] = ∅. Each delivered order consumes the same service time of ζ ,

and the travel time between two customers and/or the depot is d(·, ·).

Upon receiving a request for service, the dispatcher must immediately accept or reject the

request. Once accepted, an order must be served. An order can be accepted only if the addition of

the order to the vehicle does not violate the capacity and if a feasible planned tour τ incorporating

the new request exists. This means the overall travel and service duration d(τ) does not exceed the

time limit of the delivery phase tdmax. The dispatcher can also reject a request. The dispatcher seeks

to maximize the expected sum of revenues.

14

5.2 Markov Decision Process

We model the CAPSR as a route-based MDP (See Ulmer et al. (2016a) for an overview of route-

based MDPs). An example of the CAPSR can be found in Appendix A.3.

A decision point k occurs when a new order is issued. A state Sk = (tk, Ck, Cnewk , τk) contains

the point of time tk ∈ [0, tcmax] at which the order occurs, the set of already accepted orders Ck =

C1k , . . . , C

mk , the new order Cnew

k , and the currently planned tour τk = (D,Cτk1 , . . . , C

τkm , D)

through the already accepted customers, starting and ending at the depot D.

At each decision point k, a decision x(Sk) is made about whether to accept or reject the cus-

tomer Cnewk , and if the customer is accepted, how to accommodate it in τk. A decision x is feasible,

if the resulting tour duration d(τxk ) does not exceed tmax and the sum of capacities does not exceed

the overall capacity,

κmax −∑C∈Cxk

κ(C) ≥ 0.

The reward for an accepted customer Cnewk is R(Sk, x) = P (Cnew

k ) and is R(Sk, x) = 0 otherwise.

The decision to accept customer Cnewk leads to a transition in which the customer Cnew

k is

added to the set Cxk and tour τxk updated to include Cnewk resulting in the post-decision state Sxk =

(tk, Cxk , τxk ). The realization of the next request ωk leads to a new decision state Sk+1 = (tk+1, Cxk , Cnewk+1, τ

xk ).

The MDP is initialized at the point of the first order with S0 = (t0, ∅, Cnew0 , (D,D)). The initial

tour contains only the depot. The termination state is SK = (tcmax, CxK−1, τxK−1).

5.3 M-VFA for CAPSR

In the following, we describe how we apply and tune the M-VFA for the CAPSR. We describe the

selected features and the required tuning of M-VFA. Finally, even though the M-VFA overcomes

the curse of dimensionality related to the state space, the CAPSR is also challenged by the dimen-

sionality of the action space. Thus, we reduce the decision space by applying a routing heuristic.

We start with the parametric and non-parametric components of M-VFA and then describe how the

steps of Algorithm 1 are executed.

15

Parametric and Non-Parametric Components

For both the M-VFA(P) and M-VFA(N) that we apply to the CAPSR, we use the features free time

budget bxk and the free capacity κxk of a post decision state Sxk . The free time budget bxk,0 ≤ bxk ≤

tdmax, is computed as

bxk = tdmax − d(τxk ).

The free capacity κxk follows from the currently consumed capacity and is computed as

κxk = κmax −∑C∈Cxk

κ(C).

The example of these two features is presented in Appendix A.3.

For the purpose of presentation, we write b and κ in the remainder of this section. We also use

the current point of time t. For both the M-VFA(P) and M-VFA(N), a state is therefore represented

by a three-dimensional feature-vector Φn = Φp = (t, b, κ).

For the parametric component M-VFA(P), we approximate a linear function fV . Our choice

is motivated by Bertsimas and Demir (2002) who demonstrate the effectiveness of such a function

when applied to a knapsack problem, a problem related to the capacity component of the CAPSR.

Because preliminary tests integrating t as a feature into the M-VFA(P) resulted in inferior policies,

we discretize time into unit intervals and derive a function fVι (b, κ) for each of the resulting time

intervals ι ∈ T , where T is the set of intervals resulting from discretizing [0, tcmax]. The overall

function is therefore stepwise-linear over the time-dimension. The function takes as variables b

and κ and is formally written as

V p(Sx) = θbι × b+ θκι × κ+ θaι , (7)

where ι is the time interval in T associated with Sx.

The coefficients Θι = (θbι , θκι , θ

aι ) ∈ R3 determine the function and are approximated. Coef-

ficient θaι represents the abscissa of the function. This term is zero for the optimal value function

because, by definition, a budget of zero in the free time and the vehicle’s capacity leads to a value

of zero. Yet, we add this parameter to increase the number of considered functions since pre-

16

liminary tests that did not include an abscissa led to inferior results. To estimate the coefficients

θbι , θκι , θ

aι for each interval ι, we draw on multiple linear regression, minimizing the mean-squared

error associated with the realized values and functional values over the last ν observations. Based

on preliminary tests, we set ν = 100.

The non-parametric component M-VFA(N) approximates values VLT of a three-dimensional

LT, one dimension for the point of time t, one for the budget b, and one for the capacity κ. To

allow a fast and efficient approximation, we draw on a dynamic state space partitioning scheme,

the dynamic lookup table (DLT) introduced in Ulmer et al. (2017). The DLT partitions the three-

dimensional vector space in response to observed values. The partitioning is defined by intervals

in all three dimensions. The DLT starts with large intervals to achieve a first approximation. Then,

the DLT creates finer partitions for “important” and “reliable” areas, those in which there are

many observations and thus represent states that are frequently visited, of the vector space. The

DLT further refines the partitions in areas with high volatility across the observed values. This

means that areas with high variance across the observed values and with a sufficient number of

observations are partitioned into smaller intervals. Given a current partition of the DLT, the value

of a post-decision state is calculated as

V n(Sx) = VDLT(t, b, κ). (8)

As proposed in Ulmer et al. (2017), the DLTs start with interval length of 16 units of both time

and capacity decreasing to 1. We set the DLT-threshold parameter to 3.0 as proposed in Ulmer et al.

(2016b). This parameter control the speed at which the DLT algorithm creates new partitions. The

values for M-VFA(N) are updated after each run to the running average.

5.4 AVI-M-VFA Implementation for the CAPSR

In this section, we describe the implementation of AVI-M-VFA (Algorithm 1) for the CAPSR.

For each value of λ = 0, 0.1, 0.2, . . . , 1, we run N = 1 million trials of Algorithm 1. Because

the action space is so large, we make routing decisions using an insertion routing heuristic. The

routing heuristic is described in Appendix A.4. This leads to at most two decisions, accept or

reject, for the new request. To evaluate a decision, we use the result of the routing heuristic (in

17

the case of an accept decision) to transition to a post-decision state and then solve the approximate

Bellman Equation. In this case, the approximate Bellman Equation is given by

V λ(Sx) = (1− λ)× (θbι × b+ θκι × κ+ θaι ) + λ× VDLT(t, b, κ), (9)

where b, κ, t, and ι are the features and time interval associated with the post-decision state Sx.

We note that, if no LT-entry for the given post-decision state exists at the time of the decision, we

set V (Sxk ) = V p(Sxk ).

At the end of a given trial, we update the M-VFA(N) and M-VFA(P) values based on the

observations. For the DLT, this additionally means that the partitions are updated. For M-VFA(P),

we use the last ν = 100 observations for each time interval, and the new parameters Θι of function

fVι of M-VFA(P) are determined by means of multiple linear regression.

Having determined an M-VFA for each λ = 0, 0.1, 0.2, . . . , 1, we then determine the best

setting for λ for each instance setting. To do this, we compare the M-VFA for each λ across an

additional 10,000 trials and choose the λwhose M-VFA leads to the best performance for the given

instance setting.

5.5 Benchmark Policies

In this section, we present benchmark policies. We are interested in the performance of the M-VFA

procedure and the general performance of VFA for the CAPSR. To this end, we compare M-VFA

with VFA-methods from the literature and with an online rollout algorithm.

To show the advantages of the combining N- and P-VFA, we first compare our approach with

conventional N- and P-VFA. To do so, we approximate both VFAs individually. The individual

P-VFA can be seen as our M-VFA with λ = 1. The N-VFA is similar to the M-VFA with λ = 0,

but differs in cases in which we observe a new post-decision state and hence a potentially empty

LT-entry. If an empty partition is observed during the tuning phase, the N-VFA selects the unvisited

partition to force exploration.

To examine the impact of the simultaneous tuning of the N- and P-VFA, we also compare

the M-VFA to an ex-post combination of the just described individual N- and P-VFA. We call

these policies E-VFA for ex-post combination. To create E-VFA, we first tune N- and P-VFA

18

individually using 1 million simulation runs. Then, to use these individually derived N- and P-

VFAs in Equation (9), we must find a value of λex. To do so, we run 10,000 trials for each

λ = 0, 0.1, 0.2, . . . , 1 and choose the best λex for each instance setting. We emphasize that the

selection of λex is different from the selection of λ described in the previous section. In the previous

section, the N- and P-VFA are tuned simultaneously. Here, we combine the two VFAs after having

tuned them individually. In contrast to the use of λ in the tuning of the M-VFA, λex is not used in

the tuning phase, but only in the execution phase of E-VFA.Again, the M-VFA and E-VFA policies

are identical for λ = λex = 1 and similar for λ = λex = 0.

We also compare the M-VFA to an online rollout algorithm policy RA. Our benchmark RA is

motivated by Campbell and Savelsbergh (2005) in which the expected reward-to-go for a particular

state is estimated using the number of future feasible customers, their request probability, and their

revenues. Because for the CAPSR the number of potential customers is vast and their revenues

are unknown, we sample a set of requests. We also extend the method proposed in Campbell and

Savelsbergh (2005) by adding a time dimension. This means, that we do not assume that customers

request all at once but over individually over the time horizon. Ulmer et al. (to appear) show that

the addition of a time-dimension leads to a better approximation and a better rollout policy for

dynamic routing problems.

To evaluate the second term of the Bellman Equation in state Sk using the RA, for each decision

x ∈ X(Sk), the RA samples a set of m realizations starting in post-decision state Sxk and ending

in SK . Within the sampled realization, the RA draws on a myopic base-policy that accepts every

feasible request. The average over the realized revenues for each simulated realization is then the

estimate of the reward-to-go for Sxk .

One drawback of rollout algorithms is that they are online. That is, unlike the proposed M-VFA,

rollout algorithms perform their computation at the time of execution. Given that the time available

for computation in real time is highly limited, we limit the number of simulated realizations. Based

on preliminary tests, we set m = 16.

19

6 Computational Evaluation

In this section, we analyze the approaches for a variety of instances defined in §6.1. We present the

results in Section 6.2. We analyze the results for the different VFAs in Section 6.3 and the M-VFA

approximation process in Section 6.4. Finally, we compare the performance of N-VFA and P-VFA

with respect to varying resource shortages.

6.1 Instances

The customer locations are provided by Ulmer and Thomas (2016) and based on Iowa City census

data. The depot is located in the upper left corner of the service area. We calculate the distances

d(·, ·) using the Haversine distance measure (Shumaker and Sinnott, 1984). This distance measure

is the equivalent to the Euclidean distances on a globe. We multiply each resulting distance by

1.4 to account for the impact of traveling on a road network (Boscoe et al., 2012). We set both

capture phase and delivery phase to tcmax = tdmax = 480 minutes, which is equivalent to assuming

that orders are placed the day before deliveries take place. To ensure a surplus in orders, we

set the expected number of orders to 50 and the service time to ζ = 10. This means that it

is generally not possible to service of every order and a selection is necessary. The orders are

generated over time by means of a minute-by-minute Poisson process. The revenue per customer

P ∈ U [1, 10] is discretely uniformly distributed. This can be seen as an extension of Campbell

and Savelsbergh (2005). Following Bertsimas and Demir (2002), we set the discrete capacity

distribution to κ ∈ U [1, 10]. With these parameters, we generate 10,000 trials to compare the

M-VFA to each of the benchmarks.

To analyze the impact of resource shortages, we vary the vehicle’s capacity and the vehi-

cle’s travel speed. We define varying speeds of v = 20 kmh , 25km

h , and 30kmh . All travel dura-

tions are rounded to the minute. These speeds reflect heavy, moderate, and light traffic con-

ditions. We further define different maximal capacities κmax = 100, 120, 140, 160. Capacity

κmax = 100 represents a transport van allowing only for service of around 18 customer orders

while κmax = 160 represents a truck that can serve about 30 customer orders. The variation of

capacities and speeds results in 12 different instance settings. We note that our instance settings

result in up to 481 × 481 × 161 = 37, 249, 121 different feature combinations for the instance

20

0

5

10

15

20

25

30

M‐VFA N‐VFA P‐VFA E‐VFA

Improvement(in%)

Policy

Figure 2: Improvement Compared to RA-Policy

settings with a capacity of 160. Given the use of continuous time and service areas, the state space

is infinite.

6.2 Solution Quality

In this section, we analyze the performance of the VFAs. First, we compare the improvement of

the VFA policies compared to the RA-policy. For a detailed presentation of the individual results,

see Table A1 in Appendix A.1. LetQ(π, i) be the average revenue for policy π and instance setting

i. We then define the improvement of policy π to the RA-policy by calculating

Q(π, i)−Q(RA, i)Q(RA, i)

× 100%.

Figure 2 presents the average improvement over all instance settings. On the x-axis, the policy

is depicted. On the y-axis, the improvement compared to the RA-policy is shown.

All VFA-approaches significantly outperform the RA-policy. The M-VFA provides the greatest

improvement with 25.5% with the N-VFA, P-VFA, and E-VFA returning improvements of 18.7%,

20.6%, and 21.1%, respectively. Interestingly, the RA has an advantage over the VFAs in that

it evaluates states using all the detail of the state and not just features extracted from the states.

21

0

1

2

3

4

5

6

N‐VFA P‐VFA E‐VFA

ImprovementofM‐VFA(in%)

ComparedtoPolicy

Figure 3: Improvement of M-VFA Compared to N-VFA, P-VFA, and E-VFA

However, the RA uses a myopic policy in its estimation of the reward-to-go. At a minimum, for

the CAPSR at least, the results indicate that the value of the full information of the state does not

overcome the poor quality of the myopic heuristic policy.

Figure 3 presents the average improvement of the M-VFA compared to the other VFAs. The

percentage differences in Figure 3 are computed similarly to those for Figure 2. On average, M-

VFA outperforms N-VFA by 5.4%, P-VFA by 3.9%, and E-VFA by 3.5%. Further, as shown in

Table A1 in the Appendix A.1, M-VFA outperforms all other policies not only on average, but also

for each of the individual instance settings. Overall, these results show that the combination of the

N- and P-VFA whether during the approximation or ex-post has value in relation to either the N- or

P-VFA individually. However, significant improvement is possible if the two are combined during

the approximation phase, the difference in the M-VFA method versus that of the E-VFA.

Also of interest is the fact that, on average, P-VFA provides better solution quality than N-

VFA. This result supports the conclusion of Bertsimas and Demir (2002) that high-dimensional

state spaces result in unreliable approximation by N-VFA and that P-VFA is superior in such

circumstances. Yet, as we show in Section 6.5, the performance of N- and P-VFA strongly depends

on the instance settings. In particular, the time budget and capacity have an important role.

22

15

20

25

30

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Improvemem

t(in%)

CombinationParameter

M‐VFA E‐VFA

Figure 4: Comparison of M-VFA and E-VFA per λ

6.3 The Value of Simultaneous Approximation: M-VFA vs. E-VFA

In this section, we analyze the impact of simultaneous approximation by M-VFA. To this end,

we compare M-VFA with E-VFA to show why the simultaneous approximation of LT-values and

feature coefficients is advantageous. Policy class E-VFA first approximates N-VFA and P-VFA

individually and later combines the two components in the determination of the values. As shown

in Section 6.2, E-VFA provides inferior results on average. In the following, we first show the

average improvement compared to the RA for both policy classes for varying λ and λex. We then

analyze the structure of the approximated value function in one example.

Figure 4 shows the average improvement of M- and E-VFA compared to the RA for varying

λ. The x-axis depicts the parameter λ = λex = 0, 0.1, . . . , 1. The improvement is shown on the

y-axis.

First, we analyze the extreme cases λ = λex = 0 and λ = λex = 1. Because they both result

in pure P-VFA, both policies M- and E-VFA perform similarly for λ = λex = 1. There is a gap

between M- and E-VFA with λ = λex = 0. This result is surprising because, with λ = λex = 0,

both only draw on the N-VFA and M-VFA(N) values. Yet, due to our rule based on the observa-

tions, M-VFA with λ = 0 can access the M-VFA(P) component in the evaluation of unobserved

states. Therefore, the initial approximation is more reliable and the overall approximation quality

23

is higher. This finding indicates that it may be generally beneficial to initiate an N-VFA with the

values of a P-VFA.

We now analyze the development for 0.0 ≤ λ = λex ≤ 1.0. For E-VFA, we see a slight, but

constant increase in solution quality up to λ = 0.9. In essence, instead of utilizing the advantages of

N- and P-VFA, the ex-post combination just provides the convex combination of the two weighted

by λex. For M-VFA, we observe a peak at an intermediate λ. The best results are generally

achieved by 0.4 ≤ λ ≤ 0.5 (For details of the best λ per instance setting, we refer to Table A2

in the Appendix A.1.). These results suggest that the M-VFA is doing more than returning a

weighted value of the M-VFA(N) and M-VFA(P). Rather, the simultaneous approximation leads

to an approximation that is greater than the parts.

To better understand the simultaneous approximation, we present the example of the instance

setting v = 25kmh and κmax = 120. For this instance setting, the best M-VFA values are approxi-

mated with λ = 0.4. Because the approximated value functions are three-dimensional, we fix the

point of time and time budget parameters to t = 240 and b = 120, respectively. Figure 5 shows the

values by capacity. The capacity is depicted on the x-axis. States with capacities higher than 70

are usually not observed for this instance setting when t = 240 and b = 120. The y-axis depicts

the approximated values. The gray lines show the values for the individual approximation. The

black lines indicate the values for M-VFA(N) and M-VFA(P). The dashed lines represent P-VFA

or M-VFA(P) while the solid lines represent N-VFA or M-VFA(N).

The parametric VFAs have higher values than the non-parametric VFAs. This result can be

explained by the fact that the parametric coefficients are determined for all potential values of b.

For this particular b, they are therefore slightly higher. However, this phenomenon does not say

anything about the quality of approximation. The plateaus of the non-parametric approximations

are the result of the dynamic LT-partitioning.

The most notable feature shown in Figure 5 is the difference between the N- and P-VFA versus

the difference between M-VFA(N) and M-VFA(P). The N-VFA and P-VFA values show a signifi-

cant difference. The difference between the M-VFA components is less distinct. This result occurs

because the two components of the M-VFA, M-VFA(N) and M-VFA(P), are tuned simultaneously,

and thus reinforce one another.

A result of the simultaneous tuning can be seen in the non-monotonic nature of the N-VFA

24

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70

Value

FreeCapacity

M‐VFA(N) N‐VFA

M‐VFA(P) P‐VFA

Figure 5: Approximate Value Functions for Time t = 240, Free Budget b = 120

whose value drops around κ = 30, even though increasing capacity should render increasing values

as is the case in M-VFA(N). The decrease is the result of the fact that the N-VFA’s first observation

of the entry of κ = 30 resulted in a poor approximation, the AVI for the N-VFA subsequently

avoided states represented by this entry. In the case of M-VFA(N), observations leading to such

results are overcome by the M-VFA(P) as is demonstrated in the figure. This demonstrates how

the combination of M-VFA(N) and M-VFA(P) lead to a both reliable and accurate approximation.

6.4 Approximation Process

In the following, we analyze how the combination impacts the approximation process. Figure 6

depicts the approximation process for the example used in the previous section. Each line in

the spiderweb represents the parameter λ starting on top with λ = 0.0 and moving clockwise

λ increases by 0.1 up to λ = 1.0. The values represent the improvement of M-VFA compared

to the RA starting at −10% in the center of the spiderweb. We show four different steps of the

approximation process after 1k , 10k, 100k, and 1 million approximation runs which are indicated

by the dotted line, dashed lines, and solid line, respectively.

We first analyze the solution quality after 1k approximation runs. Starting with λ = 0.0, we

observe a constant increase in solution quality with increasing λ. The best results are achieved by

λ = 1.0. This means that with only a few approximation runs, a focus on reliable approximation

25

‐10

0

10

20

300

0.1

0.2

0.3

0.4

0.50.6

0.7

0.8

0.9

1

1k

10k

100k

1000k

Figure 6: Approximation Process of M-VFA: Improvement per λ compared to the Rollout Algo-rithm

as given by P-VFA provides better results than a focus on accurate approximation as given by

N-VFA. With an increase in approximation runs, the best tuning shifts from λ = 1.0 for 1k to

λ = 0.9 for 10k runs and to λ = 0.7 for 100k runs before eventually reaching λ = 0.4 for 1

million approximation runs. Importantly, the improvement for λ = 1.0 stagnates after the early

approximation phase and the improvement for λ = 0.0 after 1 million approximation runs is still

low at 21.8%. Thus, except in the case of P-VFA for a very low number of trials, neither the

N- nor the P-VFA is capable of providing the best results. Rather, an explicit combination of non-

parametric and parametric VFAs with 0 < λ < 1 significantly strengthens the entire approximation

process.

26

‐5

0

5

80

85

90

95

100

100 120 140 160

Improvement(in%)

Consumption(in%)

Capacity

Time Capacity N‐VFAvs.P‐VFA

Figure 7: Non-Parametric vs. Parametric VFA

6.5 Resource Shortages: Routing vs. Knapsack Problem

Finally, we analyze the results with respect to each instance’s resource shortages. We show that

the performance of N- and P-VFA depends on the importance in the instance of routing (time) and

capacity. Particularly, we show that the P-VFA is superior when we observe shortage in the vehi-

cle’s capacity and thus the CAPSR is closer to a dynamic knapsack problem. If routing decisions

become more important, N-VFA outperforms P-VFA.

To demonstrate this behavior, we analyze the results with respect to different vehicle capacities

as shown in Figure 7. The x-axis shows the capacity. For the M-VFA policy, the consumption of

the time budget and capacity averaged over the three speeds is indicated by the dashed lines and

the left y-axis. For the time budget, we observe a constant increase with respect to the vehicle’s

capacity. That means that a low capacity of 100 may lead to fewer acceptances, shorter routes, and

less consumption of the available time. Thus, when capacity is low, the number of orders is limited

by the capacity, and routing is less important. If the available capacity increases, more orders can

be served and the routing dimension gains in importance.

We now analyze how the different resource shortages affect the performance of N- and P-VFA.

We depict the improvement of N-VFA compared with the P-VFA by the solid line and the right

y-axis in Figure 7. For a low capacity, P-VFA provides a better solution quality than N-VFA.

27

This can be explained by the linear approximation providing good results for the dynamic knap-

sack problem with independent capacity consumptions of the items. With increasing capacity, the

routing becomes more important and N-VFA outperforms P-VFA. This confirms the observation

by Ulmer et al. (2017) that, for dynamic routing problems, the structure of the value function is

complex. Hence, a functional (in our case linear) approximation may not be able to capture this

complexity.

7 Conclusion

In this research, we have presented a new ADP-method that combines the advantages of non-

parametric and parametric VFAs. Further, as tuning can be done offline, the M-VFA allows im-

mediate responses to real-time requests. Using the CAPSR as a testbed, we demonstrate that

the proposed method provides excellent solution quality. Importantly, our results demonstrate the

value of simultaneously tuning the two components of the M-VFA.

Future research may focus on both extensions of the M-VFA and the CAPSR. The M-VFA is

a novel and general ADP-method. Hence, the M-VFA may be applicable to a variety of dynamic

and stochastic decision problems. Further, it may be interesting to analyze the performance of

M-VFA analytically for different artificial value-function structures. Finally, the procedure of M-

VFA may be further improved. Our computational study indicates that VFAs benefit from an

initial, reliable parametric approximation followed by a detailed N-VFA approximation. To this

end, more sophisticated rules based on the observations may dynamically adapt the combination-

parameter λ with respect to the approximation process and even determine state-dependent values

of λ. For example, it might be possible to base λ variation in LT-entries.

For the CAPSR, fleets of vehicles may be considered as well as additional constraints like time

windows. Because it leads to an increase in dimensions of state and action space, the consider-

ation of fleets is challenging. These problems may be approached with features differing for the

parametric and the non-parametric VFA component. Particularly, because its dimensionality is

unaffected by an increase in the number of features, the parametric component of the M-VFA may

capture additional features. Time windows may require the determination of additional features

and/or change the structure of the value function. In this case, the non-parametric component may

28

provide significant benefit. Finally, the presented results for the CAPSR may be used to develop

anticipatory pricing algorithms.

Acknowledgment

The authors thank Warren Powell and Ulf Jesper for their valuable advice.

References

Bertsekas, Dimitri P, John N Tsitsiklis. 1996. Neuro-dynamic programming. Athena Scientific,

Belmont, MA, Belmont, Massachusetts.

Bertsimas, Dimitris, Ramazan Demir. 2002. An approximate dynamic programming approach to

multidimensional knapsack problems. Management Science 48(4) 550–565.

Boscoe, Francis P., Kevin A. Henry, Michael S. Zdeb. 2012. CA Nationwide Comparison of

Driving Distance Versus Straight-Line Distance to Hospitals.

Browne, Cameron B, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling,

Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, Simon Colton.

2012. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational

Intelligence and AI in Games 4(1) 1–43.

Cai, Yongyang, Kenneth L Judd. 2014. Advances in numerical dynamic programming and new

applications. Karl Schmedders, Kenneth L Judd, eds., Computational Economics, Handbooks

of computational economics, vol. 3. North-Holland, Amsterdam, 479–516.

Cai, Yongyang, Kenneth L Judd, Thomas S Lontzek, Valentina Michelangeli, Che-Lin Su. 2017.

A nonlinear programming method for dynamic programming. Macroeconomic Dynamics 21(2)

336–361.

Campbell, Ann M, Martin Savelsbergh. 2005. Decision support for consumer direct grocery ini-

tiatives. Transportation Science 39(3) 313–327.

29

Chen, Xi, Mike Hewitt, Barrett W. Thomas. 2017. Approximate dynamic programming for the

multi-period technician scheduling with experience-based service times and stochastic cus-

tomers. Submitted for publication.

Ehmke, Jan F, Ann M Campbell. 2014. Customer acceptance mechanisms for home deliveries in

metropolitan areas. European Journal of Operational Research 233(1) 193–207.

Esser, Klaus, Judith Kurte. 2015. Kep 2015. Marktanalyse, Bundesverband Paket und Expresslo-

gistik e. V.

Fang, Jiarui, Lei Zhao, Jan C Fransoo, Tom Van Woensel. 2013. Sourcing strategies in supply

risk management: An approximate dynamic programming approach. Computers & Operations

Research 40(5) 1371–1382.

Geist, Matthieu, Olivier Pietquin. 2013. Algorithmic survey of parametric value function approxi-

mation. IEEE Transactions on Neural Networks and Learning Systems 24(6) 845–867.

George, Abraham, Warren B Powell, Sanjeev R Kulkarni, Sridhar Mahadevan. 2008. Value func-

tion approximation using multiple aggregation for multiattribute resource management. Journal

of Machine Learning Research 9(10) 2079–2111.

Godfrey, Gregory A, Warren B Powell. 2001. An adaptive, distribution-free algorithm for the

newsvendor problem with censored demands, with applications to inventory and distribution.

Management Science 47(8) 1101–1112.

Godfrey, Gregory A, Warren B Powell. 2002a. An adaptive dynamic programming algorithm for

dynamic fleet management, I: Single period travel times. Transportation Science 36(1) 21–39.

Godfrey, Gregory A, Warren B Powell. 2002b. An adaptive dynamic programming algorithm for

dynamic fleet management, II: Multiperiod travel times. Transportation Science 36(1) 40–54.

Goodson, Justin C., Jeffrey W. Ohlmann, Barrett W. Thomas. 2013. Rollout policies for dynamic

solutions to the multivehicle routing problem with stochastic demand and duration limits. Op-

erations Research 61(1) 138–154.

30

Goodson, Justin C., Barrett W. Thomas, Jeffrey W. Ohlmann. 2016. Restocking-based rollout poli-

cies for the vehicle routing problem with stochastic demand and duration limits. Transportation

Science 50(2) 591–607.

Goodson, Justin C, Barrett W Thomas, Jeffrey W Ohlmann. 2017. A rollout algorithm frame-

work for heuristic solutions to finite-horizon stochastic dynamic programs. European Journal

of Operational Research 258(1) 216–229.

He, Miao, Lei Zhao, Warren B Powell. 2012. Approximate dynamic programming algorithms for

optimal dosage decisions in controlled ovarian hyperstimulation. European Journal of Opera-

tional Research 222(2) 328–340.

Hornik, Kurt, Maxwell Stinchcombe, Halbert White. 1989. Multilayer feedforward networks are

universal approximators. Neural networks 2(5) 359–366.

Jabali, Ola, Roel Leus, Tom Van Woensel, Ton de Kok. 2013. Self-imposed time windows in

vehicle routing problems. OR Spectrum 37(2) 331–352.

Jiang, Daniel R, Warren B Powell. 2015a. An approximate dynamic programming algorithm for

monotone value functions. Operations Research 63(6) 1489–1511.

Jiang, Daniel R, Warren B Powell. 2015b. Optimal hour ahead bidding in the real time electricity

market with battery storage using approximate dynamic programming. INFORMS Journal on

Computing 27(3) 525–543.

Klapp, Mathias A, Alan L Erera, Alejandro Toriello. 2016. The one-dimensional dynamic dispatch

waves problem. Transportation Science.

Kleywegt, Anton J, Jason D Papastavrou. 1998. The dynamic and stochastic knapsack problem.

Operations Research 46(1) 17–35.

Kleywegt, Anton J, Jason D Papastavrou. 2001. The dynamic and stochastic knapsack problem

with random sized items. Operations Research 49(1) 26–41.

LeCun, Yann, Yoshua Bengio, Geoffrey Hinton. 2015. Deep learning. Nature 521(7553) 436–444.

31

Li, H., N. Womer. 2015. Solving stochastic resource-constrained project scheduling problems by

closed-loop approximate dynamic programming. European Journal of Operational Research

246 20–33.

Liu, Derong, Qinglai Wei, Ding Wang, Xiong Yang, Hongliang Li. 2017. Adaptive Dynamic

Programming with Applications in Optimal Control. Advances in Industrial Control, Springer,

Cham, Switzerland.

Maxwell, Matthew S, Mateo Restrepo, Shane G Henderson, Huseyin Topaloglu. 2010. Approx-

imate dynamic programming for ambulance redeployment. INFORMS Journal on Computing

22(2) 266–281.

Meisel, Stephan. 2011. Anticipatory Optimization for Dynamic Decision Making, Operations

Research/Computer Science Interfaces Series, vol. 51. Springer.

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan

Wierstra, Martin Riedmiller. 2013. Playing Atari with deep reinforcement learning. preprint

arXiv:1312.5602, arXiv.

Papadaki, Katerina P., Warren B. Powell. 2002. Exploiting structure in adaptive dynamic pro-

gramming algorithms for a stochastic batch service problem. European Journal of Operational

Research 142(1) 108 – 127. doi:https://doi.org/10.1016/S0377-2217(01)00297-1. URL http:

//www.sciencedirect.com/science/article/pii/S0377221701002971.

Papastavrou, Jason D, Srikanth Rajagopalan, Anton J Kleywegt. 1996. The dynamic and stochastic

knapsack problem with deadlines. Management Science 42(12) 1706–1718.

Powell, Warren B. 2011. Approximate Dynamic Programming: Solving the Curses of Dimension-

ality, Wiley Series in Probability and Statistics, vol. 842. John Wiley & Sons, New York.

Powell, Warren B, Stephan Meisel. 2016. Tutorial on stochastic optimization in energy—Part II:

An energy storage illustration. IEEE Transactions on Power Systems 31(2) 1468–1475.

Powell, Warren B, Hugo P Simao, Belgacem Bouzaiene-Ayari. 2012. Approximate dynamic pro-

gramming in transportation and logistics: a unified framework. EURO Journal on Transporta-

tion and Logistics 1(3) 237–284.

32

Ritzinger, Ulrike, Jakob Puchinger, Richard F Hartl. 2015. A survey on dynamic and stochastic

vehicle routing problems. International Journal of Production Research 1–17.

Savelsbergh, Martin, Tom Van Woensel. 2016. 50th anniversary invited article—city logistics:

Challenges and opportunities. Transportation Science 50(2) 579–590.

Schmid, Verena. 2012. Solving the dynamic ambulance relocation and dispatching problem using

approximate dynamic programming. European Journal of Operational Research 219(3) 611–

621.

Shen, Weiwei, Jun Wang. 2015. Transaction costs-aware portfolio optimization via fast lowner-

john ellipsoid approximation. Proceedings of the Twenty-Ninth AAAI Conference on Artificial

Intelligence. AAAI Press, Menlo Park, California, 1854–1860.

Shumaker, BP, RW Sinnott. 1984. Astronomical computing: 1. computing under the open sky. 2.

virtues of the haversine. Sky and telescope 68 158–159.

Simao, Hugo P, Jeff Day, Abraham P George, Ted Gifford, John Nienow, Warren B Powell. 2009.

An approximate dynamic programming algorithm for large-scale fleet management: A case

application. Transportation Science 43(2) 178–197.

Topaloglu, Huseyin, Warren B. Powell. 2006. Dynamic-programming approximations for stochas-

tic time-staged integer multicommodity-flow problems. Informs Journal on Computing 18(1)

31–42.

Ulmer, Marlin W, Justin C Goodson, Dirk C Mattfeld, Marco Hennig. to appear. Offline-online

approximate dynamic programming for dynamic vehicle routing with stochastic requests. Trans-

portation Science.

Ulmer, Marlin W, Justin C Goodson, Dirk C Mattfeld, Barrett W Thomas. 2016a. Route-based

Markov decision processes for dynamic vehicle routing problems. Submitted.

Ulmer, Marlin W, Marco Hennig. 2016. Value function approximation-based limited horizon roll-

out algorithms for dynamic multi-period routing. Submitted.

33

Ulmer, Marlin W, Dirk C Mattfeld, Marco Hennig, Justin C Goodson. 2015. A rollout algorithm for

vehicle routing with stochastic customer requests. Logistics Management. Springer, 217–227.

Ulmer, Marlin W., Dirk C. Mattfeld, Felix Koster. 2017. Budgeting time for dynamic vehicle

routing with stochastic customer requests. Transportation Science.

Ulmer, Marlin W, Dirk C Mattfeld, Ninja Soeffker. 2016b. Dynamic multi-period vehicle routing:

approximate value iteration based on dynamic lookup tables. Submitted.

Ulmer, Marlin W, Barrett W Thomas. 2016. Enough waiting for the cable guy - estimating arrival

times for service vehicle routing. Submitted.

Yang, Xinan, Arne K Strauss. 2016. An approximate dynamic programming approach to attended

home delivery management. Submitted.

Yang, Xinan, Arne K Strauss, Christine SM Currie, Richard Eglese. 2014. Choice-based demand

management and vehicle routing in e-fulfillment. Transportation Science 50(2) 473–488.

Appendix

A.1 Results

In the Appendix, we present the results for every individual instance setting and related literature

for the CAPSR. Table A1 shows the average revenue for the policies and varying speed and ca-

pacity. The best tuning parameters λ for M-VFA and E-VFA per instance setting are depicted in

Table A2.

A.2 Literature for the CAPSR

In this section, we present the literature related to the CAPSR. The work most closely related to

the CAPSR is Ehmke and Campbell (2014) and Campbell and Savelsbergh (2005). Like in the

CAPSR, the problems studied in this papers focus on customer acceptance decisions. However,

they differ in objective and constraints. Ehmke and Campbell (2014) determine customer accep-

tances based on the probability that the integration of the customer does not lead to time window

34

Table A1: Results: Revenue

Speed Capacity M-VFA E-VFA P-VFA N-VFA RA

20 100 166.41 159.61 159.06 157.95 126.9320 120 178.44 170.49 169.02 169.25 142.9220 140 184.90 174.67 173.77 173.93 150.0520 160 185.48 179.94 174.86 179.25 150.1325 100 167.98 163.40 163.40 158.97 126.6425 120 184.04 177.80 177.80 172.93 145.4425 140 196.17 189.36 189.36 186.32 161.1725 160 202.47 195.42 193.98 191.66 169.7630 100 168.32 163.93 163.63 159.86 127.3730 120 185.52 180.44 180.44 172.72 145.0430 140 200.14 194.08 193.96 188.80 162.0730 160 209.61 202.59 202.50 196.33 175.75

violations. These probabilities are determined with respect to stochastic travel times which con-

trasts with the CAPSR in which acceptance decisions are determined with regard to potential new

requests.

In Campbell and Savelsbergh (2005), customers from a known set of customers dynamically

request service. Each potential customer has a request probability and time-window preferences

known at the start of the horizon. Customers can choose from a set of time slots offered by the

service provider. The objective is to maximize the expected revenue. Campbell and Savelsbergh

(2005) determine acceptance by solving the static stochastic vehicle routing problem on a rolling

horizon. They evaluate a planned tour in terms of expected revenue. The approach does not

consider the dynamic development resulting from dynamically requesting customers. Because for

the CAPSR, the number of customers is vast and the vehicle is capacitated, a direct transfer of the

approach in Campbell and Savelsbergh (2005) to the CAPSR is not possible. Thus, we present

an online rollout algorithm that extends the approach of Campbell and Savelsbergh (2005) by

subsequently sampling requests over a simulated request horizon. We use this rollout algorithm as

a benchmark for the M-VFA.

Papastavrou et al. (1996), Kleywegt and Papastavrou (1998), and Kleywegt and Papastavrou

(2001) consider a dynamic knapsack problem in which items of random weight and reward arrive

over time and must be either accepted or rejected for inclusion in the knapsack. The problem is

similar to the CAPSR but without the routing component. Papastavrou et al. (1996), Kleywegt and

35

Table A2: Results: Best Parameter λ

Speed Capacity M-VFA E-VFA

20 100 0.4 0.920 120 0.4 0.820 140 0.4 0.520 160 0.3 0.225 100 0.6 1.025 120 0.5 1.025 140 0.4 0.925 160 0.4 0.930 100 0.4 0.930 120 0.6 1.030 140 0.5 0.930 160 0.5 0.9

Papastavrou (1998), and Kleywegt and Papastavrou (2001) characterize the optimal policies for a

variety of versions of the problem. As a result of its routing component, these results do not apply

to the CAPSR. More recently, Goodson et al. (2017) demonstrate the effectiveness of a rollout

algorithm (RA) applied to a variant of the problem presented in Kleywegt and Papastavrou (1998).

We use RA as one of our benchmark algorithms.

The customer acceptance decision making in the CAPSR is also related to work on dynamic

routing with stochastic requests. For an overview on dynamic routing, the interested reader is

referred to Ritzinger et al. (2015). In these dynamic routing problems, vehicles are already on

the road when new requests occur. For such a problem, Ulmer et al. (2017) present the customer

acceptance policy evaluating the free time budget left by means of a non-parametric VFA. The

non-parametric part of the M-VFA presented in this paper can be seen as a generalization of this

approach adapted to the needs of the CAPSR, especially, considering delivery routing and capacity

constraints. Other work has shown rollout algorithms (RAs) to be effective approaches for dynamic

routing with stochastic requests (Klapp et al., 2016; Ulmer et al., 2015, to appear). As noted

previously, in our computational study, we apply an RA as benchmark.

Finally, customer acceptances in delivery routing is related to time-slot pricing. For these

problems, the customers select time-windows for delivery, but the selection can be influenced

by the dispatcher and, in the extreme case, no time-slot is offered and the customer is rejected.

Recently, Yang and Strauss (2016) present a pricing policy estimating delivery costs via parametric

36

0

0

3

47

2

5

0

3

4

2

5

7 0

3

4

2

5

7

8

7 7 8x

(2,2)

(3,5)

(1,7)

(3,8)

(5,10)

(2,2)

(3,5)

(1,7)

(3,8)

(5,10)

(2,2)

(3,5)

(1,7)

(3,8)

(5,10)

(5,5)

(3,5)

Figure A.1: State, Decision, Post-Decision State, Transition

VFA. For a general overview on time-slot pricing, the interested reader is referred to Yang et al.

(2014).

A.3 CAPSR Example

In the following, we present an example for the MDP. Figure A.1 depicts a state, decision, post-

decision state, and stochastic information for the seventh decision point, k = 7. For the purpose

of presentation, we assume a Manhattan-style grid with travel duration of 10 minutes for each

segment. We further assume a service time of ζ = 10 minutes, a time duration for both capture and

delivery phase of tcmax = tdmax = 480 minutes, and a maximal capacity of κmax = 100. The depot

is indicated by the gray square, the accepted customers by the gray circles, and the new request

by the white circle. The reward and the capacity required by each customer are indicated by the

adjacent white squares. As an example, the previous acceptance of Customer 2 led to a reward of

P (C2) = 3 and a capacity consumption of κ(C2) = 5.

The state S7 = (100, C2, C3, C4, C5, C7, (D,C2, C3, C5, C4, D)) is depicted on the left. The

current point of time is t = 100 minutes. This means there are still 380 minutes to receive orders

and before the actual delivery starts. Four customers are already accepted, Ck = C2, C3, C4, C5.

Customer C7 requested service in t = 100. The current planned tour τk starts and ends in the depot

and traverses Customers 2, 3, 5 and 4. The current tour duration d(τk) is the sum of travel and

service times, 160 + 4× 10 = 200 minutes. The current capacity consumed is 5 + 2 + 8 + 7 = 22.

37

Decision x accepts customer C7 and updates the tour τk accordingly. The immediate reward of

decision x is R(S7, x) = P (C7) = 5. The new planned tour is τx7 = (D,C2, C7, C3, C5, C4, D).

This leads to post-decision state Sx7 = (100, C2, C3, C4, C5, C7, (D,C2, C7, C3, C5, C4, D)), de-

picted in the center of Figure A.1. The new tour duration is d(τx7 ) = 160 + 5 × 10 = 210. The

capacity consumed is 22 + 10 = 32. These values are reflected in the features free time budget bx7

and free capacity κx7 . Generally, the free time budget bxk with 0 ≤ bxk ≤ tdmax is defined as

bxk = tdmax − d(τxk ).

In the example, the free time budget is bx7 = 480− 210 = 270 minutes. This means that 270 min-

utes of travel time and service time are free to integrate new customers. The currently consumed

capacity determines the free capacity κxk as

κxk = κmax −∑C∈Cxk

κ(C).

In the example, the free capacity is κx7 = 100 − 32 = 68. The next decision point k = 8 occurs

when the next stochastic customer C8 requests. The new decision state

S8 = (100, C2, C3, C4, C5, C7, C8, (D,C2, C7, C3, C5, C4, D))

is depicted on the right side of Figure A.1.

A.4 Routing Heuristic

Since τxk needs to be determined in real-time while the customer is waiting, M-VFA and the bench-

mark heuristics draw on the efficient cheapest insertion routing heuristic (CI) as applied by Camp-

bell and Savelsbergh (2005) for a problem similar to the CAPSR. In each decision point, CI main-

tains the current route τk and inserts the new request at the position leading to a minimal extension

of the route. As Ulmer and Thomas (2016) show, CI provides competitive tours compared to op-

timal TSP-solutions for the Iowa City data set while requiring significantly less calculation time.

Further, CI allows for the instant communication of approximate delivery times, a feature often

desired by customers (Jabali et al., 2013).

38

Documents

Meso-Parametric Value Function Approximation for Dynamic ...web.winforms.phil.tu-bs.de/paper/ulmer/Ulmer_meso.pdf · Dimensional Knapsack Problem, Approximate Dynamic Programming,