10 Simulation-Assisted Estimationelsa.berkeley.edu/choice2/ch10.pdf · P1: GEM/IKJ P2: GEM/IKJ QC: GEM/ABE T1: GEM CB495-10DRV CB495/Train KEY BOARDED August 20, 2002 13:43 Char Count=

P1: GEM/IKJ P2: GEM/IKJ QC: GEM/ABE T1: GEM

CB495-10DRV CB495/Train KEY BOARDED August 20, 2002 13:43 Char Count= 0

10 Simulation-Assisted Estimation

10.1 Motivation

So far we have examined how to simulate choice probabilities but havenot investigated the properties of the parameter estimators that are basedon these simulated probabilities. In the applications we have presented,we simply inserted the simulated probabilities into the log-likelihoodfunction and maximized this function, the same as if the probabilitieswere exact. This procedure seems intuitively reasonable. However, wehave not actually shown, at least so far, that the resulting estimator hasany desirable properties, such as consistency, asymptotic normality, orefficiency. We have also not explored the possibility that other forms ofestimation might perhaps be preferable when simulation is used ratherthan exact probabilities.

The purpose of this chapter is to examine various methods of esti-mation in the context of simulation. We derive the properties of theseestimators and show the conditions under which each estimator is con-sistent and asymptotically equivalent to the estimator that would arisewith exact values rather than simulation. These conditions provide guid-ance to the researcher on how the simulation needs to be performed toobtain desirable properties of the resultant estimator. The analysis alsoilluminates the advantages and limitations of each form of estimation,thereby facilitating the researcher’s choice among methods.

We consider three methods of estimation:

1. Maximum Simulated Likelihood: MSL. This procedure is thesame as maximum likelihood (ML) except that simulated prob-abilities are used in lieu of the exact probabilities. The proper-ties of MSL have been derived by, for example, Gourieroux andMonfort,(1993), Lee(1995), and Hajivassiliou and Ruud (1994).

2. Method of Simulated Moments: MSM. This procedure, sug-gested by McFadden (1989), is a simulated analog to the tradi-tional method of moments (MOM). Under traditional MOM fordiscrete choice, residuals are defined as the difference between

240



Simulation-Assisted Estimation 241

the 0–1 dependent variable that identifies the chosen alterna-tive and the probability of the alternative. Exogenous variablesare identified that are uncorrelated with the model residuals inthe population. The estimates are the parameter values that makethe variables and residuals uncorrelated in the sample. The sim-ulated version of this procedure calculates residuals with thesimulated probabilities rather than the exact probabilities.

3. Method of Simulated Scores: MSS. As discussed in Chapter 8,the gradient of the log likelihood of an observation is called thescore of the observation. The method of scores finds the para-meter values that set the average score to zero. When exact prob-abilities are used, the method of scores is the same as maximumlikelihood, since the log-likelihood function is maximized whenthe average score is zero. Hajivassiliou and McFadden (1998)suggested using simulated scores instead of the exact ones. Theyshowed that, depending on how the scores are simulated, MSScan differ from MSL and, importantly, can attain consistencyand efficiency under more relaxed conditions.

In the next section we define these estimators more formally and re-late them to their nonsimulated counterparts. We then describe the prop-erties of each estimator in two stages. First, we derive the properties ofthe traditional estimator based on exact values. Second, we show howthe derivation changes when simulated values are used rather than exactvalues. We show that the simulation adds extra elements to the samplingdistribution of the estimator. The analysis allows us to identify the condi-tions under which these extra elements disappear asymptotically so thatthe estimator is asymptotically equivalent to its nonsimulated analog.We also identify more relaxed conditions under which the estimator,though not asymptotically equivalent to its nonsimulated counterpart,is nevertheless consistent.

10.2 Definition of Estimators

10.2.1. Maximum Simulated Likelihood

The log-likelihood function is

LL(θ ) =∑

n

ln Pn(θ ),

where θ is a vector of parameters, Pn(θ ) is the (exact) probability of theobserved choice of observation n, and the summation is over a sample



242 Estimation

of N independent observations. The ML estimator is the value of θ thatmaximizes LL(θ ). Since the gradient of LL(θ ) is zero at the maximum,the ML estimator can also be defined as the value of θ at which∑

n

sn(θ ) = 0,

where sn(θ ) = ∂ ln Pn(θ )/∂θ is the score for observation n.Let Pn(θ ) be a simulated approximation to Pn(θ ). The simulated

log-likelihood function is SLL(θ ) = ∑n ln Pn(θ ), and the MSL esti-

mator is the value of θ that maximizes SLL(θ ). Stated equivalently,the estimator is the value of θ at which

∑n sn(θ ) = 0, where sn(θ ) =

∂ ln Pn(θ )/∂θ.

A preview of the properties of MSL can be given now, with a fullexplanation reserved for the next section. The main issue with MSLarises because of the log transformation. Suppose Pn(θ ) is an unbiasedsimulator of Pn(θ ), so that Er Pn(θ ) = Pn(θ ), where the expectation isover draws used in the simulation. All of the simulators that we haveconsidered are unbiased for the true probability. However, since thelog operation is a nonlinear transformation, ln Pn(θ ) is not unbiasedfor ln Pn(θ ) even though Pn(θ ) is unbiased for Pn(θ ). The bias in thesimulator of ln Pn(θ ) translates into bias in the MSL estimator. This biasdiminishes as more draws are used in the simulation.

To determine the asymptotic properties of the MSL estimator, thequestion arises of how the simulation bias behaves when the samplesize rises. The answer depends critically on the relationship betweenthe number of draws that are used in the simulation, labeled R, and thesample size, N . If R is considered fixed, then the MSL estimator doesnot converge to the true parameters, because of the simulation bias inln Pn(θ ). Suppose instead that R rises with N ; that is, the number ofdraws rises with sample size. In this case, the simulation bias disappearsas N (and hence R) rises without bound. MSL is consistent in this case.As we will see, if R rises faster than

√N , MSL is not only consistent

but also efficient, asymptotically equivalent to maximum likelihood onthe exact probabilities.

In summary, if R is fixed, then MSL is inconsistent. If R rises at anyrate with N , then MSL is consistent. If R rises faster than

√N , then

MSL is asymptotically equivalent to ML.The primary limitation of MSL is that it is inconsistent for fixed R.

The other estimators that we consider are motivated by the desire fora simulation-based estimator that is consistent for fixed R. Both MSMand MSS, if structured appropriately, attain this goal. This benefit comesat a price, however, as we see in the following discussion.




10.2.2. Method of Simulated Moments

The traditional MOM is motivated by the recognition that theresiduals of a model are necessarily uncorrelated in the populationwith factors that are exogenous to the behavior being modeled. TheMOM estimator is the value of the parameters that make the residu-als in the sample uncorrelated with the exogenous variables. For dis-crete choice models, MOM is defined as the parameters that solve theequation

(10.1)∑

n

∑j

[dnj − Pnj (θ )]znj = 0,

where

dnj is the dependent variable that identifies the chosen alternative:dnj = 1 if n chose j , and = 0 otherwise, and

znj is a vector of exogenous variables called instruments.

The residuals are dnj − Pnj (θ ), and the MOM estimator is the parametervalues at which the residuals are uncorrelated with the instruments inthe sample.

This MOM estimator is analogous to MOM estimators for standardregression models. A regression model takes the form yn = x ′

nβ + εn .The MOM estimator for this regression is the β at which

∑n

(yn − x ′nβ)zn = 0

for a vector of exogenous instruments zn . When the explanatory vari-ables in the model are exogenous, then they serve as the instruments.The MOM estimator in this case becomes the ordinary least squaresestimator:

∑n

(yn − x ′nβ)xn = 0,

∑n

xn yn =∑

n

xnx ′nβ,

β =( ∑

n

xnx ′n

)−1( ∑n

xn yn

),

which is the formula for the least squares estimator. When instrumentsare specified to be something other than the explanatory variables, the



244 Estimation

MOM estimator becomes the standard instrumental variables estimator:∑n

(yn − x ′nβ)zn = 0,

∑n

zn yn =∑

n

znx ′nβ,

β =( ∑

n

znx ′n

)−1( ∑n

zn yn

),

which is the formula for the instrumental variables estimator. This esti-mator is consistent if the instruments are independent of ε in the pop-ulation. The estimator is more efficient the more highly correlated theinstruments are with the explanatory variables in the model. When theexplanatory variables, xn , are themselves exogenous, then the ideal in-struments (i.e., those that give the highest efficiency) are the explanatoryvariables themselves, zn = xn .

For discrete choice models, MOM is defined analogously and hasa similar relation to other estimators, especially ML. The researcheridentifies instruments znj that are exogenous and hence independent inthe population of the residuals [dnj − Pnj (θ )]. The MOM estimator isthe value of θ at which the sample correlation between instruments andresiduals is zero. Unlike the linear case, equation (10.1) cannot be solvedexplicitly for θ . Instead, numerical procedures are used to find the valueof θ that solves this equation.

As with regression, ML for a discrete choice model is a special caseof MOM. Let the instruments be the scores: znj = ∂ ln Pnj (θ )/∂θ . Withthese instruments, MOM is the same as ML:∑

n

∑j

[dnj − Pnj (θ )]znj = 0,

∑n

( ∑j

dnj∂ ln Pnj (θ )

∂β

)−

( ∑j

Pnj (θ )∂ ln Pnj (θ )

∂β

)= 0,

∑n

∂ lnPni (θ )

∂β−

∑n

∑j

Pnj (θ )1

Pnj (θ )

∂Pnj (θ )

∂θ= 0,

∑n

sn(θ ) −∑

n

∑j

∂Pnj (θ )

∂θ= 0,

∑n

sn(θ ) = 0,

which is the defining condition for ML. In the third line, i is thechosen alternative, recognizing that dnj = 0 for all j �= i . The fourth




line uses the fact that the sum of ∂Pnj/∂θ over alternatives is zero,since the probabilities must sum to 1 before and after the changein θ .

Since MOM becomes ML and hence is fully efficient when the in-struments are the scores, the scores are called the ideal instruments.MOM is consistent whenever the instruments are independent of themodel residuals. It is more efficient the higher the correlation betweenthe instruments and the ideal instruments.

An interesting simplification arises with standard logit. For the stan-dard logit model, the ideal instruments are the explanatory variablesthemselves. As shown in Section 3.7.1, the ML estimator for standardlogit is the value of θ that solves

∑n

∑j [dnj − Pnj (θ )]xnj = 0, where

xnj are the explanatory variables. This is a MOM estimator with theexplanatory variables as instruments.

A simulated version of MOM, called the method of simulated mo-ments (MSM), is obtained by replacing the exact probabilities Pnj (θ )with simulated probabilities Pnj (θ ). The MSM estimator is the value ofθ that solves

∑n

∑j

[dnj − Pnj (θ )]znj = 0

for instruments znj . As with its nonsimulated analog, MSM is consistentif znj is independent of dnj − Pnj (θ).

The important feature of this estimator is that Pnj (θ ) enters theequation linearly. As a result, if Pnj (θ) is unbiased for Pnj (θ ), then[dnj − Pnj (θ )]znj is unbiased for [dnj − Pnj (θ )]znj . Since there is nosimulation bias in the estimation condition, the MSM estimator is con-sistent, even when the number of draws R is fixed. In contrast, MSLcontains simulation bias due to the log transformation of the simulatedprobabilities. By not taking a nonlinear transformation of the simulatedprobabilities, MSM avoids simulation bias.

MSM still contains simulation noise (variance due to simulation). Thisnoise becomes smaller as R rises and disappears when R rises withoutbound. As a result, MSM is asymptotically equivalent to MOM if Rrises with N .

Just like its unsimulated analog, MSM is less efficient than MSL un-less the ideal instruments are used. However, the ideal instruments arefunctions of ln Pnj . They cannot be calculated exactly for any but the sim-plest models, and, if they are simulated using the simulated probability,simulation bias is introduced by the log operation. MSM is usuallyapplied with nonideal weights, which means that there is a loss of



246 Estimation

efficiency. MSM with ideal weights that are simulated without bias be-comes MSS, which we discuss in the next section.

In summary, MSM has the advantage over MSL of being consistentwith a fixed number of draws. However, there is no free lunch, and thecost of this advantage is a loss of efficiency when nonideal weights areused.

10.2.3. Method of Simulated Scores

MSS provides a possibility of attaining consistency without aloss of efficiency. The cost of this double advantage is numerical: theversions of MSS that provide efficiency have fairly poor numerical prop-erties such that calculation of the estimator can be difficult.

The method of scores is defined by the condition∑n

sn(θ ) = 0,

where sn(θ ) = ∂ ln Pn(θ )/∂θ is the score for observation n. This is thesame defining condition as ML: when exact probabilities are used, themethod of scores is simply ML.

The method of simulated scores replaces the exact score with a sim-ulated counterpart. The MSS estimator is the value of θ that solves∑

n

sn(θ ) = 0,

where sn(θ ) is a simulator of the score. If sn(θ ) is calculated as the deriva-tive of the log of the simulated probability; that is, sn(θ ) = ∂ ln Pn(θ )/∂θ ,then MSS is the same as MSL. However, the score can be simulated inother ways. When the score is simulated in other ways, MSS differs fromMSL and has different properties.

Suppose that an unbiased simulator of the score can be constructed.With this simulator, the defining equation

∑n sn(θ ) = 0 does not in-

corporate any simulation bias, since the simulator enters the equationlinearly. MSS is therefore consistent with a fixed R. The simulation noisedecreases as R rises, such that MSS is asymptotically efficient, equiv-alent to MSL, when R rises with N . In contrast, MSL uses the biasedscore simulator sn(θ ) = ∂ ln Pn(θ )/∂θ , which is biased due to the logoperator. MSS with an unbiased score simulator is therefore better thanMSL with its biased score simulator in two regards: it is consistent underless stringent conditions (with fixed R rather than R rising with N ) andis efficient under less stringent conditions (R rising at any rate with Nrather than R rising faster than

√N ).




The difficulty with MSS comes in finding an unbiased score simulator.The score can be rewritten as

sn(θ ) = ∂ ln Pnj (θ )

∂θ= 1

Pnj (θ )

∂Pnj

∂θ.

An unbiased simulator for the second term ∂Pnj/∂θ is easily obtainedby taking the derivative of the simulated probability. Since differenti-ation is a linear operation, ∂Pnj/∂θ is unbiased for ∂Pnj/∂θ if Pnj (θ)is unbiased for Pnj (θ ). Since the second term in the score can be sim-ulated without bias, the difficulty arises in finding an unbiased simula-tor for the first term 1/Pnj (θ ). Of course, simply taking the inverse ofthe simulated probability does not provide an unbiased simulator, sinceEr (1/Pnj (θ )) �= 1/Pnj (θ ). Like the log operation, an inverse introducesbias.

One proposal is based on the recognition that 1/Pnj (θ ) is the expectednumber of draws of the random terms that are needed before an “accept”is obtained. Consider drawing balls from an urn that contains many ballsof different colors. Suppose the probability of obtaining a red ball is 0.20.That is, one-fifth of the balls are red. How many draws would it take,on average, to obtain a red ball? The answer is 1/0.2 = 5. The sameidea can be applied to choice probabilities. Pnj (θ ) is the probabilitythat a draw of the random terms of the model will result in alternativej having the highest utility. The inverse 1/Pnj (θ ) can be simulated asfollows:

1. Take a draw of the random terms from their density.2. Calculate the utility of each alternative with this draw.3. Determine whether alternative j has the highest utility.4. If so, call the draw an accept. If not, then call the draw a reject

and repeat steps 1 to 3 with a new draw. Define Br as the numberof draws that are taken until the first accept is obtained.

5. Perform steps 1 to 4 R times, obtaining Br for r = 1, . . . , R.The simulator of 1/Pnj (θ ) is (1/R)

∑Rr=1 Br .

This simulator is unbiased for 1/Pnj (θ ). The product of this simulatorwith the simulator ∂Pnj/∂θ provides an unbiased simulator of the score.MSS based on this unbiased score simulator is consistent for fixed Rand asymptotically efficient when R rises with N .

Unfortunately, the simulator of 1/Pnj (θ ) has the same difficulties asthe accept–reject simulators that we discussed in Section 5.6. There isno guarantee than an accept will be obtained within any given num-ber of draws. Also, the simulator is not continuous in parameters. The



248 Estimation

discontinuity hinders the numerical procedures that are used to locatethe parameters that solve the MSS equation.

In summary, there are advantages and disadvantages of MSS relativeto MSL, just as there are of MSM. Understanding the capabilities ofeach estimator allows the researcher to make an informed choice amongthem.

10.3 The Central Limit Theorem

Prior to deriving the properties of our estimators, it is useful to reviewthe central limit theorem. This theorem provides the basis for the distri-butions of the estimators.

One of the most basic results in statistics is that, if we take draws froma distribution with mean µ and variance σ , the mean of these draws willbe normally distributed with mean µ and variance σ/N , where N is alarge number of draws. This result is the central limit theorem, statedintuitively rather than precisely. We will provide a more complete andprecise expression of these ideas.

Let t = (1/N )∑

n tn , where each tn is a draw a from a distributionwith mean µ and variance σ . The realization of the draws are calledthe sample, and t is the sample mean. If we take a different sample(i.e., obtain different values for the draws of each tn), then we will geta different value for the statistic t . Our goal is to derive the samplingdistribution of t .

For most statistics, we cannot determine the sampling distributionexactly for a given sample size. Instead, we examine how the samplingdistribution behaves as sample size rises without bound. A distinction ismade between the limiting distribution and the asymptotic distributionof a statistic. Suppose that, as sample size rises, the sampling distributionof statistic t converges to a fixed distribution. For example, the samplingdistribution of t might become arbitrarily close to a normal with meant∗ and variance σ . In this case, we say that N (t∗, σ ) is the limitingdistribution of t and that t converges in distribution to N (t∗, σ ). Wedenote this situation as t

d→ N (t∗, σ ).In many cases, a statistic will not have a limiting distribution. As N

rises, the sampling distribution keeps changing. The mean of a sampleof draws is an example of a statistic without a limiting distribution. Asstated, if t is the mean of a sample of draws from a distribution withmean µ and variance σ , then t is normally distributed with mean µ

and variance σ/N . The variance decreases as N rises. The distributionchanges as N rises, becoming more and more tightly dispersed around




the mean. If a limiting distribution were to be defined in this case, it wouldhave to be the degenerate distribution at µ: as N rises without bound,the distribution of t collapses on µ. This limiting distribution is uselessin understanding the variance of the statistic, since the variance of thislimiting distribution is zero. What do we do in this case to understandthe properties of the statistic?

If our original statistic does not have a limiting distribution, thenwe often can transform the statistic in such a way that the transformedstatistic has a limiting distribution. Suppose, as in in our example ofa sample mean, that the statistic we are interested in does not have alimiting distribution because its variance decreases as N rises. In thatcase, we can consider a transformation of the statistic that normalizesfor sample size. In particular, we can consider

√N (t − µ). Suppose

that this statistic does indeed have a limiting distribution, for example√N (t − µ)

d→ N (0, σ ). In this case, we can derive the properties ofour original statistic from the limiting distribution of the transformedstatistic. Recall from basic principles of probabilities that, for fixed aand b, if a(t − b) is distributed normal with zero mean and variance σ ,then t itself is distributed normal with mean b and variance σ/a2. Thisstatement can be applied to our limiting distribution. For large enough N ,√

N (t − µ) is distributed approximately N (0, σ ). Therefore, for largeenough N , t is distributed approximately N (µ, σ/N ). We denote thisas t

a∼ N (µ, σ/N ). Note that this is not the limiting distribution of t ,since t has no nondegenerate limiting distribution. Rather, it is calledthe asymptotic distribution of t , derived from the limiting distributionof

√N (t − µ).

We can now restate precisely our concepts about the sampling distri-bution of a sample mean. The central limit theorem states the following.Suppose t is the mean of a sample of N draws from a distribution withmean µ and variance σ . Then

√N (t − µ)

d→ N (0, σ ). With this limitingdistribution, we can say that t

a∼ N (µ, σ/N ).There is another, more general version of the central limit theorem.

In the version just stated, each tn is a draw from the same distribution.Suppose tn is a draw from a distribution with mean µ and variance σn ,for n = 1, . . . , N . That is, each tn is from a different distribution; thedistributions have the same mean but different variances. The genera-lized version of the central limit theorem states that

√N (t − µ)

d→N (0, σ ), where σ is now the average variance: σ = (1/N )

∑n σn . Given

this limiting distribution, we can say that ta∼ N (µ, σ/N ). We will use

both versions of the central limit theorem when deriving the distributionsof our estimators.



250 Estimation

10.4 Properties of Traditional Estimators

In this section, we review the procedure for deriving the properties ofestimators and apply that procedure to the traditional, non-simulation-based estimators. This discussion provides the basis for analyzing theproperties of the simulation-based estimators in the next section.

The true value of the parameters is denoted θ∗. The ML and MOMestimators are roots of an equation that takes the form

(10.2)∑

n

gn(θ )/N = 0.

That is, the estimator θ is the value of the parameters that solve thisequation. We divide by N , even though this division does not af-fect the root of the equation, because doing so facilitates our deriva-tion of the properties of the estimators. The condition states that theaverage value of gn(θ ) in the sample is zero at the parameter esti-mates. For ML, gn(θ ) is the score ∂ ln Pn(θ )/∂θ . For MOM, gn(θ)is the set of first moments of residuals with a vector of instruments,∑

j (dnj − Pnj )znj . Equation (10.2) is often called the moment condition.In its nonsimulated form, the method of scores is the same as ML andtherefore need not be considered separately in this section. Note thatwe call (10.2) an equation even though it is actually a set of equations,since gn(θ ) is a vector. The parameters that solve these equations are theestimators.

At any particular value of θ , the mean and variance of gn(θ ) can becalculated for the sample. Label the mean as g(θ ) and the variance asW (θ ). We are particularly interested in the sample mean and varianceof gn(θ ) at the true parameters, θ∗, since our goal is to estimate theseparameters.

The key to understanding the properties of an estimator comes inrealizing that each gn(θ∗) is a draw from a distribution of gn(θ∗)’s inthe population. We do not know the true parameters, but we know thateach observation has a value of gn(θ∗) at the true parameters. The valueof gn(θ∗) varies over people in the population. So, by drawing a personinto our sample, we are essentially drawing a value of gn(θ∗) from itsdistribution in the population.

The distribution of gn(θ∗) in the population has a mean and variance.Label the mean of gn(θ∗) in the population as g and its variance in thepopulation as W. The sample mean and variance at the true parameters,g(θ∗) and W (θ∗), are the sample counterparts to the population meanand variance, g and W.




We assume that g = 0. That is, we assume that the average of gn(θ∗) inthe population is zero at the true parameters. Under this assumption, theestimator provides a sample analog to this population expectation: θ isthe value of the parameters at which the sample average of gn(θ ) equalszero, as given in the defining condition (10.2). For ML, the assumptionthat g = 0 simply states that the average score in the population is zero,when evaluated at the true parameters. In a sense, this can be consideredthe definition of the true parameters, namely, θ∗ are the parameters atwhich the log-likelihood function for the entire population obtains itsmaximum and hence has zero slope. The estimated parameters are thevalues that make the slope of the likelihood function in the sample zero.For MOM, the assumption is satisfied if the instruments are independentof the residuals. In a sense, the assumption with MOM is simply areiteration that the instruments are exogenous. The estimated parametersare the values that make the instruments and residuals uncorrelated inthe sample.

We now consider the population variance of gn(θ∗), which we havedenoted W. When gn(θ ) is the score, as in ML, this variance takes aspecial meaning. As shown in Section 8.7, the information identity statesthat V = −H, where

−H = −E

(∂2 ln Pn(θ∗)

∂θ ∂θ ′

)

is the information matrix and V is the variance of the scores evaluated atthe true parameters: V = Var(∂ ln Pn(θ∗)/∂θ ). When gn(θ ) is the score,we have W = V by definition and hence W = −H by the informationidentity. That is, when gn(θ ) is the score, W is the information matrix.For MOM with nonideal instruments, W �= −H, so that W does notequal the information matrix.

Why does this distinction matter? We will see that knowing whetherW equals the information matrix allows us to determine whether theestimator is efficient. The lowest variance that any estimator can achieveis −H−1/N . For a proof, see, for example, Greene (2000) or Ruud(2000). An estimator is efficient if its variance attains this lower bound.As we will see, this lower bound is achieved when W = −H but notwhen W �= −H.

Our goal is to determine the properties of θ . We derive these propertiesin a two-step process. First, we examine the distribution of g(θ∗), which,as stated earlier, is the sample mean of gn(θ∗). Second, the distributionof θ is derived from the distribution of g(θ∗). This two-step process isnot necessarily the most direct way of examining traditional estimators.



252 Estimation

However, as we will see in the next section, it provides a very convenientway for generalizing to simulation-based estimators.

Step 1: The Distribution of g(θ∗)Recall that the value of gn(θ∗) varies over decision makers in the pop-ulation. When taking a sample, the researcher is drawing values ofgn(θ∗) from its distribution in the population. This distribution has zeromean by assumption and variance denoted W. The researcher calculatesthe sample mean of these draws, g(θ∗). By the central limit theorem,√

N (g(θ∗) − 0)d→ N (0, W) such that the sample mean has distribution

g(θ∗)a∼ N (0, W/N ).

Step 2: Derive the Distribution of θ from the Distribution of g(θ∗)We can relate the estimator θ to its defining term g(θ ) as follows. Takea first-order Taylor’s expansion of g(θ ) around g(θ∗):

(10.3) g(θ ) = g(θ∗) + D[θ − θ∗],

where D = ∂g(θ∗)/∂θ ′. By definition of θ (that is, by defining condition(10.2)), g(θ ) = 0 so that the right-hand side of this expansion is 0. Then

0 = g(θ∗) + D[θ − θ∗],

θ − θ∗ = −D−1g(θ∗),√N (θ − θ∗) =

√N (−D−1)g(θ∗).(10.4)

Denote the mean of ∂gn(θ∗)/∂θ ′ in the population as D. The samplemean of ∂gn(θ∗)/∂θ ′ is D, as defined for equation (10.3). The samplemean D converges to the population mean D as the sample size rises. Weknow from step 1 that

√N g(θ∗)

d→ N (0, W). Using this fact in (10.4),we have

(10.5)√

N (θ − θ∗)d→ N (0, D−1WD−1).

This limiting distribution tells us that θa∼ N (θ∗, D−1WD−1/N ).

We can now observe the properties of the estimator. The asymptoticdistribution of θ is centered on the true value, and its variance decreasesas the sample size rises. As a result, θ converges in probability to θ∗ asthe sample sise rises without bound: θ

p→ θ . The estimator is thereforeconsistent. The estimator is asymptotically normal. And its variance isD−1WD−1/N , which can be compared with the lowest possible variance,−H−1/N , to determine whether it is efficient.

For ML, gn(·) is the score, so that the variance of gn(θ∗) is thevariance of the scores: W = V. Also, the mean derivative of gn(θ∗)




is the mean derivative of the scores: D = H = E(∂2 ln Pn(θ∗)/∂θ ∂θ ′),where the expectation is over the population. By the information iden-tity, V = −H. The asymptotic variance of θ becomes D−1WD−1/N =H−1VH−1/N = H−1(−H)H−1/N = −H−1/N , which is the lowestpossible variance of any estimator. ML is therefore efficient. SinceV = −H, the variance of the ML estimator can also be expressed asV−1/N , which has a readily interpretable meaning: the variance of theestimator is equal to the variance of the scores evaluated at the trueparameters, divided by sample size.

For MOM, gn(·) is a set of moments. If the ideal instruments are used,then MOM becomes ML and is efficient. If any other instruments areused, then MOM is not ML. In this case, W is the population varianceof the moments and D is the mean derivatives of the moments, ratherthan the variance and mean derivatives of the scores. The asymptoticvariance of θ does not equal −H−1/N . MOM without ideal weights istherefore not efficient.

10.5 Properties of Simulation-Based Estimators

Suppose that the terms that enter the defining equation for an estima-tor are simulated rather than calculated exactly. Let gn(θ ) denote thesimulated value of gn(θ ), and g(θ ) the sample mean of these simulatedvalues, so that g(θ ) is the simulated version of g(θ). Denote the numberof draws used in simulation for each n as R, and assume that independentdraws are used for each n (e.g., separate draws are taken for each n).Assume further that the same draws are used for each value of θ whencalculating gn(θ ). This procedure prevents chatter in the simulation: thedifference between g(θ1) and g(θ2) for two different values of θ is notdue to different draws.

These assumptions on the simulation draws are easy for the researcherto implement and simplify our analysis considerably. For interested read-ers, Lee (1992) examines the situation when the same draws are used forall observations. Pakes and Pollard (1989) provide a way to characterizean equicontinuity condition that, when satisfied, facilitates analysis ofsimulation-based estimators. McFadden (1989) characterizes this con-dition in a different way and shows that it can be met by using thesame draws for each value of θ , which is the assumption that we make.McFadden (1996) provides a helpful synthesis that includes a discussionof the need to prevent chatter.

The estimator is defined by the condition g(θ ) = 0. We derive theproperties of θ in the same two steps as for the traditional estimators.



254 Estimation

Step 1: The Distribution of g(θ∗)To identify the various components of this distribution, let us reexpressg(θ∗) by adding and subtracting some terms and rearranging:

g(θ∗) = g(θ∗) + g(θ∗) − g(θ∗) + Er g(θ∗) − Er g(θ∗)

= g(θ∗) + [Er g(θ∗) − g(θ∗)] + [g(θ∗) − Er g(θ∗)],

where g(θ∗) is the nonsimulated value and Er g(θ∗) is the expectation ofthe simulated value over the draws used in the simulation. Adding andsubtracting terms obviously does not change g(θ∗). Yet, the subsequentrearrangement of the terms allows us to identify components that haveintuitive meaning.

The first term g(θ∗) is the same as arises for the traditional estima-tor. The other two terms are extra elements that arise because of thesimulation. The term Er g(θ∗) − g(θ∗) captures the bias, if any, in thesimulator of g(θ∗). It is the difference between the true value of g(θ∗)and the expectation of the simulated value. If the simulator is unbiasedfor g(θ∗), then Er g(θ∗) = g(θ∗) and this term drops out. Often, however,the simulator will not be unbiased for g(θ∗). For example, with MSL,gn(θ ) = ∂ ln Pn(θ )/∂θ , where Pn(θ ) is an unbiased simulator of Pn(θ ).Since Pn(θ ) enters nonlinearly via the log operator, gn(θ ) is not unbi-ased. The third term, g(θ∗) − Er g(θ∗), captures simulation noise, thatis, the deviation of the simulator for given draws from its expectationover all possible draws.

Combining these concepts, we have

(10.6) g(θ ) = A + B + C,

where

A is the same as in the traditional estimator,B is simulation bias,C is simulation noise.

To see how the simulation-based estimators differ from their traditionalcounterparts, we examine the simulation bias B and noise C .

Consider first the noise. This term can be reexpressed as

C = g(θ∗) − Er g(θ∗)

= 1

N

∑n

gn(θ∗) − Er gn(θ∗)

=∑

n

dn/N ,




where dn is the deviation of the simulated value for observation n fromits expectation. The key to understanding the behavior of the simulationnoise comes in noting that dn is simply a statistic for observation n. Thesample constitutes N draws of this statistic, one for each observation:dn, n = 1, . . . , N . The simulation noise C is the average of these Ndraws. Thus, the central limit theorem gives us the distribution of C .

In particular, for a given observation, the draws that are used in sim-ulation provide a particular value of dn . If different draws had beenobtained, then a different value of dn would have been obtained. Thereis a distribution of values of dn over the possible realizations of the drawsused in simulation. The distribution has zero mean, since the expectationover draws is subtracted out when creating dn . Label the variance of thedistribution as Sn/R, where Sn is the variance when one draw is used insimulation. There are two things to note about this variance. First, Sn/Ris inversely related to R, the number of draws that are used in simulation.Second, the variance is different for different n. Since gn(θ∗) is differentfor different n, the variance of the simulation deviation also differs.

We take a draw of dn for each of N observations; the overall simu-lation noise, C , is the average of these N draws of observation-specificsimulation noise. As just stated, each dn is a draw from a distributionwith zero mean and variance Sn/R. The generalized version of the cen-tral limit theorem tells us the distribution of a sample average of drawsfrom distributions that have the same mean but different variances. Inour case,

√NC

d→ N (0, S/R),

where S is the population mean of Sn . Then Ca∼ N (0, S/N R).

The most relevant characteristic of the asymptotic distribution of C isthat it decreases as N increases, even when R is fixed. Simulation noisedisappears as sample size increases, even without increasing the numberof draws used in simulation. This is a very important and powerful fact.It means that increasing the sample size is a way to decrease the ef-fects of simulation on the estimator. The result is intuitively meaningful.Essentially, simulation noise cancels out over observations. The simula-tion for one observation might, by chance, make that observation’s gn(θ )too large. However, the simulation for another observation is likely, bychance, to be too small. By averaging the simulations over observations,the errors tend to cancel each other. As sample size rises, this cancelingout property becomes more powerful until, with large enough samples,simulation noise is negligible.



256 Estimation

Consider now the bias. If the simulator g(θ ) is unbiased for g(θ ), thenthe bias term B in (10.6) is zero. However, if the simulator is biased, aswith MSL, then the effect of this bias on the distribution of g(θ∗) mustbe considered.

Usually, the defining term gn(θ ) is a function of a statistic, �n , that canbe simulated without bias. For example, with MSL, gn(θ) is a functionof the choice probability, which can be simulated without bias; in thiscase �n is the probability. More generally, �n can be any statistic thatis simulated without bias and serves to define gn(θ ). We can write thedependence in general as gn(θ ) = g(�n(θ )) and the unbiased simulatorof �n(θ ) as �n(θ ) where Er �n(θ ) = �n(θ ).

We can now reexpress gn(θ ) by taking a Taylor’s expansion aroundthe unsimulated value gn(θ ):

gn(θ ) = gn(θ ) + ∂g(�n(θ ))

∂�n[�n(θ ) − �n(θ )]

+ 12

∂2g(�n(θ ))

∂�2n

[�n(θ ) − �n(θ ]2,

gn(θ ) − gn(θ ) = g′n[�n(θ ) − �n(θ )] + 1

2 g′′n [�n(θ ) − �n(θ )]2,

where g′n and g′′

n are simply shorthand ways to denote the first and secondderivatives of gn(�(·)) with respect to �. Since �n(θ ) is unbiased for �n(θ ),we know Er g′

n[�n(θ ) − �n(θ )] = g′n[Er �n(θ ) − �n(θ )] = 0. As a result,

only the variance term remains in the expectation:

Er gn(θ ) − gn(θ ) = 12 g′′

n Er [�n(θ ) − �n(θ )]2

= 12 g′′

n Varr �n(θ ).

Denote Varr �n(θ ) = Qn/R to reflect the fact that the variance is in-versely proportional to the number of draws used in the simulation. Thesimulation bias is then

Er g(θ ) − g(θ ) = 1

N

∑n

Er gn(θ ) − gn(θ )

= 1

N

∑n

g′′n

Qn

2R

= ZR

,

where Z is the sample average of g′′nQn/2.




Since B = Z/R, the value of this statistic normalized for sample sizeis

(10.7)√

N B =√

N

RZ.

If R is fixed, then B is nonzero. Even worse,√

N B rises with N , insuch a way that it has no limiting value. Suppose that R is consid-ered to rise with N . The bias term then disappears asymptotically: B =Z/R

p→ 0. However, the normalized bias term does not necessar-ily disappear. Since

√N enters the numerator of this term,

√N B =

(√

N/R)Z p→ 0 only if R rises faster than√

N , so that the ratio√

N/Rapproaches zero as N increases. If R rises slower than

√N , the ratio√

N/R rises, such that the normalized bias term does not disappear butin fact gets larger and larger as sample size increases.

We can now collect our results for the distribution of the defining termnormalized by sample size:

(10.8)√

N g(θ∗) =√

N (A + B + C),

where

√N A

d→ N (0, W), the same as in the traditional estimator,√

N B =√

N

RZ , capturing simulation bias,

√NC

d→ N (0, S/R), capturing simulation noise.

Step 2: Derive Distribution of θ from Distribution of g(θ∗)As with the traditional estimators, the distribution of θ is directly relatedto the distribution of g(θ∗). Using the same Taylor’s expansion as in(10.3), we have

√N (θ − θ∗) = −D

−1√N g(θ∗) = −D

−1√N (A + B + C),(10.9)

where D is the derivative of g(θ∗) with respect to the parameters, whichconverges to its expectation D as sample size rises. The estimator itselfis expressed as

(10.10) θ = θ∗ − D−1

(A + B + C).

We can now examine the properties of our estimators.



258 Estimation

10.5.1. Maximum Simulated Likelihood

For MSL, gn(θ ) is not unbiased for gn(θ ). The bias term in (10.9)is

√N B = (

√N/R)Z . Suppose R rises with N . If R rises faster than√

N , then√

N B = (√

N/R)Z p→ 0,

since the ratio√

N/R falls to zero. Consider now the third term in(10.9), which captures simulation noise:

√NC

d→ N (0, S/R). SinceS/R decreases as R rises, we have S/R

p→ 0 as N → ∞ when R riseswith N . The second and third terms disappear, leaving only the first term.This first term is the same as appears for the nonsimulated estimator. Wehave

√N (θ − θ∗) = −D−1

√N A

d→ N (0, D−1WD−1)

= N (0, H−1VH−1)

= N (0, −H−1),

where the next-to-last equality occurs because gn(θ) is the score, andthe last equality is due to the information identity. The estimator isdistributed

θa∼ N (θ∗, −H−1/N ).

This is the same asymptotic distribution as ML. When R rises fasterthan

√N , MSL is consistent, asymptotically normal and efficient, and

asymptotically equivalent to ML.Suppose that R rises with N but at a rate that is slower than

√N . In

this case, the ratio√

N/R grows larger as N rises. There is no limitingdistribution for

√N (θ − θ∗), because the bias term, (

√N/R)Z , rises

with N . However, the estimator itself converges on the true value. θ

depends on (1/R)Z , not multiplied by√

N . This bias term disappearswhen R rises at any rate. Therefore, the estimator converges on thetrue value, just like its nonsimulated counterpart, which means that θ isconsistent. However, the estimator is not asymptotically normal, since√

N (θ − θ∗) has no limiting distribution. Standard errors cannot be cal-culated, and confidence intervals cannot be constructed.

When R is fixed, the bias rises as N rises.√

N (θ − θ∗) does not havea limiting distribution. Moreover, the estimator itself, θ , contains a biasB = (1/R)Z that does not disappear as sample size rises with fixedR. The MSL estimator is neither consistent nor asymptotically normalwhen R is fixed.




The properties of MSL can be summarized as follows:

1. If R is fixed, MSL is inconsistent.2. If R rises slower than

√N , MSL is consistent but not asymp-

totically normal.3. If R rises faster than

√N , MSL is consistent, asymptotically

normal and efficient, and equivalent to ML.

10.5.2. Method of Simulated Moments

For MSM with fixed instruments, gn(θ ) = ∑j [dnj − Pnj (θ )]znj ,

which is unbiased for gn(θ ), since the simulated probability enters lin-early. The bias term is zero. The distribution of the estimator is deter-mined only by term A, which is the same as in the traditional MOMwithout simulation, and term C , which reflects simulation noise:

√N (θ − θ∗) = −D

−1√N (A + C).

Suppose that R is fixed. Since D converges to its expectation D, we

have −√N D

−1A

d→ N (0, D−1WD−1) and −√N D

−1C

d→ N (0, D−1

(S/R)D−1), so that

√N (θ − θ∗)

d→ N (0, D−1[W + S/R]D−1).

The asymptotic distribution of the estimator is then

θa∼ N (θ∗, D−1[W + S/R]D−1/N ).

The estimator is consistent and asymptotically normal. Its variance isgreater than its nonsimulated counterpart by D−1SD−1/RN , reflectingsimulation noise.

Suppose now that R rises with N at any rate. The extra variance dueto simulation noise disappears, so that θ

a∼ N (θ∗, D−1WD−1/N ), thesame as its nonsimulated counterpart. When nonideal instruments areused, D−1WD−1 �= −H−1 and so the estimator (in either its simulatedor nonsimulated form) is less efficient than ML.

If simulated instruments are used in MSM, then the properties of theestimator depend on how the instruments are simulated. If the instru-ments are simulated without bias and independently of the probabilitythat enters the residual, then this MSM has the same properties as MSMwith fixed weights. If the instruments are simulated with bias and theinstruments are not ideal, then the estimator has the same properties asMSL except that it is not asymptotically efficient, since the information



260 Estimation

identity does not apply. MSM with simulated ideal instruments is MSS,which we discuss next.

10.5.3. Method of Simulated Scores

With MSS using unbiased score simulators, gn(θ ) is unbiasedfor gn(θ ), and, moreover, gn(θ ) is the score such that the informationidentity applies. The analysis is the same as for MSM except that theinformation identity makes the estimator efficient when R rises with N .As with MSM, we have

θa∼ N (θ∗, D−1[W + S/R]D−1/N ),

which, since gn(θ ) is the score, becomes

θa∼ N

(θ∗,

H−1[V + S/R]H−1

N

)= N

(θ∗, −H−1

N+ H−1SH−1

RN

).

When R is fixed, the estimator is consistent and asymptotically normal,but its covariance is larger than with ML because of simulation noise. IfR rises at any rate with N , then we have

θa∼ N (0, −H−1/N ).

MSS with unbiased score simulators is asymptotically equivalent to MLwhen R rises at any rate with N .

This analysis shows that MSS with unbiased score simulators hasbetter properties than MSL in two regards. First, for fixed R, MSS isconsistent and asymptotically normal, while MSL is neither. Second,for R rising with N , MSS is equivalent to ML no matter how fastR is rising, while MSL is equivalent to ML only if the rate is fasterthan

√N .

As we discussed in Section 10.2.3, finding unbiased score simulatorswith good numerical properties is difficult. MSS is sometimes appliedwith biased score simulators. In this case, the properties of the estimatorare the same as with MSL: the bias in the simulated scores translatesinto bias in the estimator, which disappears from the limiting distributiononly if R rises faster than

√N .

10.6 Numerical Solution

The estimators are defined as the value of θ that solves g(θ ) = 0, whereg(θ ) = ∑

n gn(θ )/N is the sample average of a simulated statistic gn(θ ).Since gn(θ ) is a vector, we need to solve the set of equations for the




parameters. The question arises: how are these equations solved numer-ically to obtain the estimates?

Chapter 8 describes numerical methods for maximizing a function.These procedures can also be used for solving a set of equations. Let Tbe the negative of the inner product of the defining term for an estimator:T = −g(θ )′g(θ ) = −(

∑n gn(θ ))′(

∑n gn(θ ))/N 2. T is necessarily less

than or equal to zero, since it is the negative of a sum of squares. Thas a highest value of 0, which is attained only when the squared termsthat compose it are all 0. That is, the maximum of T is attained wheng(θ ) = 0. Maximizing T is equivalent to solving the equation g(θ ) = 0.The approaches described in Chapter 8, with the exception of BHHH,can be used for this maximization. BHHH cannot be used, becausethat method assumes that the function being maximized is a sum ofobservation-specific terms, whereas T takes the square of each sum ofobservation-specific terms. The other approaches, especially BFGS andDFP, have proven very effective at locating the parameters at whichg(θ ) = 0.

With MSL, it is usually easier to maximize the simulated likelihoodfunction rather than T . BHHH can be used in this case, as well as theother methods.

Documents

10 Simulation-Assisted Estimationelsa.berkeley.edu/choice2/ch10.pdf · P1: GEM/IKJ P2: GEM/IKJ QC: GEM/ABE T1: GEM CB495-10DRV CB495/Train KEY BOARDED August 20, 2002 13:43 Char Count=