Arena Stanfordlecturenotes11

7/27/2019 Arena Stanfordlecturenotes11

1/9

MS&E 223 Lecture Notes #11

Simulation Making Decisions via Simulation

Peter J. Haas Spring Quarter 2003-04

Page 1 of9

Making Decisions via Simulation

Ref: Law and Kelton, Chapter 10;Handbook of Simulation , Chapters 8 and 9; Haas, Sec. 6.3.6.

We give an introduction into some methods for selecting the best system design (or parameter setting

for a given design), where the performance under each alternative is estimated via simulation.

The methods presented are chosen because of simplicitysee the references for more details on exactly

when they are applicable, and whether improved versions of the algorithms are available. Current

understanding of the behavior of these algorithms is incomplete, and they should be applied with due

caution.

1. Continuous Stochastic Optimization

Robbins-Monro Algorithm

Here we will consider the problem min f ( ) where, for a given value of , we are not able to evaluatef ( ) analytically or numerically, but must obtain (noisy) estimates of f ( ) using simulation. We willassume for now that the set of possible solutions is uncountably infinite. In particular, suppose that isan interval [ , ] of real numbers. One approach to solving the problem )(fmin is to estimate )('f using simulation and then use an iterative method to solve the equation )('f = 0. The most basic suchmethod is called the Robbins-Monro Algorithm, and can be viewed as a stochastic version of Newtons

method for solving nonlinear equations.

The basic iteration is

n+1 =

nn Zn

a,

where a > 0 is a fixed parameter of the algorithm (the quantity a/n is sometimes called the gain), Z is

an unbiased estimator of )('f n , and

>


2/9




Page 2 of9

as n with probability 1. (Otherwise the algorithm can converge to some other local minimizer.) Forlarge n, it can be shown that n has approximately a normal distribution. Thus, if the procedure is run m

independent times (where m = 5 to 10) with n iterations for each run, generating i.i.d. replicates n,1, ... ,n,m, then a point estimator for* is m =

m1

,jm j 1= and confidence intervals can be formed in the usual

way, based on the Student-t distribution with m 1 degrees of freedom.

Remark: Variants of the above procedure are available for multi-parameter problems.

Remark: The main drawback of the above algorithm is that convergence can be slow and sensitive to

the choice of the parameter a; current research is focused on finding methods that converge faster. One

easy, practical way to stabilize the algorithm a bit is to keep track and return the best parameter value

seen so far.

Remark: A variant of the Robbins-Monro procedure, due to Kiefer and Wolfowitz, replaces the

gradient estimate by a finite-difference that is obtained by running the simulation at various values of the

parameter . This procedure breaks down as the dimension of the parameter space increases. Recentwork on simultaneous perturbation stochastic approximation (SPSA) tries to avoid the dimensionality

problem as follows. At the kth iteration of a d-dimensional problem, a finite-difference is obtained by

running the simulation at the two parameter values c . Here is the current value of the parameter

vector (of length d), c is a nonnegative real number that is a parameter of the algorithm, and is a d-

vector of independent Bernoulli random variables, each equal to 1 or 1 with equal probability.

How do we obtain the derivative estimates needed for the Robbins-Monro algorithm? One popularmethod for estimating )('f is via likelihood ratios.

Tutorial on Likelihood Ratios

Suppose that we wish to estimaten 0 1 n

E[g (X ,X , ,X )] = K , where0 1 n

X , X , , XK are i.i.d. replicates of a

discrete random variable X with state space S. The straightforward approach would be as follows: for k =

1, 2, ... , K, generate n+1 i.i.d. replicates of X, say0,k 1,k n,k

X ,X , ,XK , and set

n 0,k 1,k n,k (k) g (X ,X , , X ) = K .

Then estimate by the average of(1), ... , (K).

Suppose that Y is another random variable with state space S. Also suppose for all s S that q(s) = P{Y= s} is positive whenever p(s) = P{X = s} is positive. Then can be estimated based on simulation of Yinstead of X. Let Y0, ... , Yn be i.i.d. replicates of Y, and set


3/9




Page 3 of9

Ln =

=

=n

0i

i

n

0i

i

)Y(q

)Y(p

.

Ln is called a likelihood ratio and represents the likelihood of seeing the observations under the

distribution p relative to the likelihood of seeing the observations under distribution q. We claim that

(1)n 0 1 n n n 0 1 n

E[g (Y , Y , ,Y )L ] E[g (X ,X , , X )]=K K

To see this, observe that

0 1 n nE[g (Y , Y , ,Y )L ]K = ( )

0 n

n

ii 00 n 0 0 n nn

s S s Sii 0

p(s )g (s , ,s ) P Y s , , Y s

q(s )

=

=

= =

L K K

=0 n

nn

ii 00 n in

s S s S i 0ii 0

p(s )g (s , ,s ) q ( s )

q(s )

=

==

L K

=0 n

n

0 n i

s S s S i 0

g (s , ,s ) p(s )

= L K

= E[gn(X0, ... , Xn)]

Thus, we can simulate Ys instead of Xs, as long as the performance measure g is multiplied by the

likelihood ratio correction factor.

This idea can be applied in great generality. For example, if X and Y are real-valued random variables

with density functions fX and fY respectively, then (1) holds with

Ln =

n

X i

i 0

n

Y i

i 0

f (Y)

f (Y)

=

=

.

As another example, if X0, X1, ... , Xn are observations of a discrete-time Markov chain with initial

distribution and transition matrix P and Y0, Y1, ... , Yn are observations of a discrete-time Markov chainwith initial distribution % and transition matrix P

~, then (1) holds with


4/9




Page 4 of9

(2) Ln =

( )

( )

n

0 i 1 i

i 1

n

0 i 1 i

i 1

Y P(Y , Y )

Y P(Y , Y )

=

=

%%

.

Remark: The relation in (1) continues to hold if n is replaced by a random time N.

Remark: Ln can easily be computed in a recursive manner.

Likelihood ratios can be computed in the context of GSMPs in an analogous way. The idea is to initially set

the running likelihood ratio equal to 1. Then, whenever there is a state transition from S to1

S + caused

by the occurrence of the events in *n

E , the running likelihood ratio is multiplied by

* *

n 1 n n n 1 n np(S ;S ,E )/p(S ;S , E )+ +% . Moreover, for each new clock reading iX generated for an event ie , the

running likelihood is multiplied by * *i n 1 i n n i n 1 i n n

f ( X ; S , e , S , E ) / f ( X ; S , e , S , E )+ +% . (Here f and f% denote the

densities of the new clock-setting distributions.) Similar multiplications are required initially to account for

any differences in the initial distribution.

Application of Likelihood Ratios to Gradient Estimation

Likelihood ratio techniques can be used to yield an estimate of )('f based on simulation at the singleparameter value . Suppose that we have the representation f() = E[c(X, )]; we write E to indicate

that the distribution of the random variable X depends on .

Example: Consider an M/M/1 queue with interarrival rate and service rate . A performance measureof above form is obtained by letting X be the average waiting time for the first k customers and setting c(x,

) = a + bx. This cost function reflects the tradeoff between operating costs due to an increase in theservice rate and the costs due to increased waiting times for the customers. We assume that > , sothat the queue is stable for all possible choices of.

As discussed above, we have

[ ] [ ]hf ( h) E c(X, h) E c(X, h)L(h)+ + = + = + ,

where L(h) is an appropriate likelihood ratio. Noting that L(0) = 1 and assuming that we can interchange

the expectation and limit operations, we have

h 0 h 0

f ( h) f ( ) c(X, h)L(h) c(X, )L(0)f '( ) lim limE

h h

+ + = =


5/9




Page 5 of9

( )

[ ]

h 0h 0

c(X, h)L(h) c(X, )L(0) dE lim E c(X, h)L(h)

h dh

E c'(X, ) c(X, )L'(0)

=

+ = = + = +

(Here c ' c /= .) To estimate r() = 'f (), generate i.i.d. replicates1 n

X , ,XK and compute the

estimator

r n() = [ ]n

i i

i 1

1c'(X , ) c(X , )L'(0)

n = + .

Confidence intervals can be formed in the usual way. In the context of the Robbins-Monro algorithm

(where computing confidence intervals is not an issue), we often take n to be small, even perhaps taking n

= 1, because in practice it is often more efficient to execute many R-M iterations with relatively noisyderivative estimates than to execute fewer iterations with more precise derivative estimates.

Example: For the M/M/1 queue discussed above, denote by1 k

V , ,VK the k service times generated in the

course of simulating the k waiting times. Then the likelihood ratio L(h) is given by

L(h) =

=

+

+

k

1i

V

k

1i

V)h(

i

i

e

e)h(

= =

+k

1i

hVieh

,

so that

=

=k

1i

iV1

)0('L and ( ) = =

++=

n

1j

k

1i

ijj V1

bXaan

1)(r ,

where jX is the average of the k waiting times generated during the jth

simulation and ijV is the ith

service

time generated during the jth

simulation. Confidence intervals for )('f can be formed in the usual way.

Remark: The Gradient information is useful even outside the context of stochastic optimization, because it

can be used to explore the sensitivity of system performance to specified parameters.

Remark: There are other methods that can be used to estimate gradients, such as infinitesimal

perturbation analysis (IPA). See Chapter 9 in theHandbook of Simulation for further discussion.

Remark: The main drawback to the likelihood-ratio approach is that the variance of the likelihood ratio

increases with the length of the simulation. Thus the estimates can be very noisy for long simulation runs,

and many simulation replications must be performed to get reasonable estimates.


6/9




Page 6 of9

2. Small Number of Alternatives: Ranking and SelectionNow lets look at the case of a finite number of alternatives. Law and Kelton discuss the case of twocompeting alternatives in section 10.2. We focus on the more typical case where there are a number of

alternatives to be compared. Unless stated otherwise, we will assume that bigger is better, i.e., our goal

is to maximize some performance measure. (Any minimization problem can be converted to a

maximization problem by maximizing the negative of the original performance measure.)

Suppose that we are comparing k systems, where k is on the order of 3 to 20. Denote by i the value ofthe performance measure for the i

thalternative. (E.g., i might be the expected revenue from system i

over one year.) Sometimes we write (1) < (2) < ... < (k), so that (k) is the optimal value of theperformance measure

The goal is to select the best system with (high) probability 1 , whenever the performance measure forthe best system is at least better than the others; i.e., whenever (k ) (1) > for l k. The parameter

is called the indifference zone. If in fact there exist solutions within the indifference zone (e.g., when(k ) (1) for one or more values of l), the algorithm described below typically selects a good solution

(i.e., a system lying within the indifference zone) with probability 1 .

The inputs to the procedure are observations ijY , where ijY is the jth

estimate ofi

. For a fixed

alternative i, the observations ( ijY : j 1) are assumed to be i.i.d. (i.e., based on independent simulation

runs). Each ijY is required to have (at least approximately) a normal distribution. Thus, an observation ijY

may be

an average of a large enough number of i.i.d. observations of a finite-horizon performancemeasure so that the CLT is applicable

an estimate of a steady-state performance measure based on a large enough number ofregenerative cycles

an estimate of a steady-state performance measure based on a large enough number of batchmeans

The following algorithm (due to Rinott, with modifications by Nelson and Matejcik) will not only select the

best system as defined above, but will also provide simultaneous 100(1 )% confidence intervalsi

J

on the quantities

i i j i jmax =

for 1 i k. That is,i i

P( J for 1 k) 1 = . (This is a stronger assertion thani i

P( J ) 1 = for 1 i k as with individual confidence intervals.) This multiple comparisons with the best approach indicateshow close each of the inferior systems is to being the best. This information is useful in identifying other

good solutions; one of them may be needed if the best solution is ruled out because of secondary


7/9




Page 7 of9

considerations (ease of implementation, maintenance costs, etc.). See Handbook of Simulation, Section

8.3.2 for further details.

Algorithm

1. Specify , , and an intial sample size n0 (n0 20).2. For i = 1 to k, take an i.i.d. sample

0in2i1i Y,,Y,Y K

3. For i = 1 to k, compute = = 00 n 1j ijn1i YY and ( ) = = 00 n 1j 2iij1n 12i YYs 4. Look of the value of h from Table 8.3 (given at the end of the lecture notes).5. For i = 1 to k, Take Ni n0 additional samples Yij, independent of the first-stage sample, where

Ni =

2

i0

hs,nmax

6. For i = 1 to k, compute == ii N 1j ijN1 Y)i( 7. Select the system with the largest value of(i) as the best8. Form the simultaneous 100(1 )% confidence intervals fori as [ai, bi], where

ai = min(0, (i) j imax (j) )

bi = max(0, (i) j imax (j) + ).

The justification of this algorithm is similar to (but more complicated than) the discussion in Section 10A in

Law and Kelton. See Rinotts 1978 paper in Commun. Statist. A: Theory and Methods for details. (The

algorithm presented here is simpler to execute, however, than the algorithm discussed in Law and Kelton.)

Remark: Improved versions of this procedure can be obtained through the use of common random

numbers. SeeHandbook of Simulation, Chapter 8.

3. Large Number of Alternatives: Discrete Stochastic Optimization

We wish to solve the problem )(fmax , where the value of f() is estimated using simulation and isa very large set of discrete design alternatives. For example, might be the number of servers at a servicecenter and f() might be the expected value of the total revenue over a year of operation.


8/9




Page 8 of9

Development of good random search algorithms is a topic of ongoing research. Most of the current

algorithms generate a sequence (n : n 0) with n+1 N(n) {n}, where N() is a neighborhood of.A currently-optimal solution *n is maintained. The algorithms differ in the structure of the neighborhood,

the rule used to determine n+1 from n, the rule used to compute *n , and the rule used to stop thealgorithm.

We present a typical algorithm of this type (due to Andradottir) which uses a simulated annealing-type

rule to determine n+1 from n. This rule usually accepts a better solution but, with a small probability,sometimes accepts a worse solution in order to avoid getting stuck at a local maximum.

Algorithm

1. Fix T, N > 0. Select a starting point 0, and generate an estimate )(f 0 of f(0) via simulation. SetA(0) = 0f ( ) and C(0) = 1. Set n = 0 and 0

*0 = .

2. Select a candidate solution 'n randomly and uniformly from {n}. Generate an estimate )(f 'n of )(f 'n via simulation.

3. If )(f 'n )(f n , then set n+1 = 'n . Otherwise, generate a random variable U distributed uniformlyon [0,1]. If

U

T

)(f)(fexp

n'n

then set n+1 = 'n ; otherwise, set n+1 = n.

4. Set A(n+1) = A(n+1) + )(f 1n + and C(n+1) = C(n+1) + 1. If A(n+1)/ C(n+1) > )(A *n / )(C *n ,then set 1n

*1n ++ = ; otherwise, set

*n

*1n = + .

5. If n N return the final answer * 1n + with estimated objective function value )(A * 1n + / )(C * 1n + .Otherwise, set n = n + 1 and go to Step 2.

In the above algorithm, the neighborhood was defined as N() = {}. If the problem has morestructure, the neighborhood may be defined in a problem-specific manner.


9/9




Page 9 of9

Documents

Arena Stanfordlecturenotes11