Score tests in GMM: Why Use Implied Probabilities?saraswata.research.mcgill.ca/chaudhuri_renault_2017.pdf · 1 Introduction The focus of our interest in this paper is the test of

Score tests in GMM: Why Use Implied Probabilities?∗

Saraswata Chaudhuri†and Eric Renault‡

August 20, 2017

Abstract

While simple to implement and thus attractive in practice, the GMM score tests introduced by Newey and

West (1987) often display upward size distortion under common scenarios involving skewed moment vectors

and/or models with weak identification. Inference based on Generalized Empirical Likelihood (GEL) is seen

as a general solution to this problem. Kleibergen (2005) and, more generally, Guggenberger and Smith (2005)

devised an elegant theory for the GEL score tests. However, strictly speaking, GEL score tests do not nest

the Newey-West score test. Our paper provides a unified framework for score tests in GMM that nests all

of the above as special cases and helps us to understand the precise mechanism by which the standard first

order asymptotic theory on size and power of tests well approximates the finite sample behavior of some

score tests (namely, a subset of GEL score tests) but not others. Special attention is paid to models with

weak identification. Moreover, we argue that the apparent computational burden of GEL can be overcome

in practice by recognizing the fundamental common role played by the GEL implied probabilities under all

special cases of our framework. In particular, we show that all GEL implied probabilities are asymptotically

equivalent at the first order – both under the null and under appropriate sequences of alternatives – and thus

are exchangeable across computationally burdensome (e.g. Empirical Likelihood) and easy (e.g. Euclidean

Empirical Likelihood) GEL score tests without affecting the first order asymptotics. Extensive simulation

evidence is provided to corroborate all our theoretical results. The simulation results also support a simple

and yet important insight on the power of the tests: the use of implied probabilities to efficiently estimate the

components of the score statistic, namely the Jacobian and the asymptotic variance of the moment vector,

can provide significant improvement in power of the score test in finite samples.

JEL Classification: C12; C13; C30

Keywords: GEL; GMM; Implied probabilities; Score test

∗We gratefully acknowledge helpful comments from various conference and seminar participants. This versionreplaces a part of an earlier version of the paper titled “Finite-sample improvements of score tests by the use of impliedprobabilities from Generalized Empirical Likelihood”.

†Department of Economics, McGill University, Montreal, Canada. Email: [email protected].‡Department of Economics, Brown University, Providence, United States. Email: eric [email protected].

1

1 Introduction

The focus of our interest in this paper is the test of a null hypothesis

H0 : θ = θ0 (1.1)

about a vector θ ∈ Θ ⊂ Rp of p unknown parameters identified by H > p moments conditions:

E [ψ (Wi, θ)] = 0. (1.2)

More precisely, the true unknown value θ0 is uniquely defined by:

E [ψ (Wi, θ)] = 0 ⇐⇒ θ = θ0.

The true unknown value θ0, and as a consequence also the hypothetical one θ0, are assumed to belongto the interior of the compact set Θ of parameters. Almost certainly for the distribution of Wi, thefunction ψ(w, θ) is continuously differentiable in θ on the interior of Θ.

For the sake of expositional simplicity, the observations Wi, i = 1, . . . , n, will be treated through-out as an i.i.d. sequence1 such that uniform law of large numbers and central limit theorem willapply to the sequence ψi(θ) ≡ ψ(Wi, θ) with an asymptotic covariance matrix:

V (θ) = V ar[√nψn(θ)

]= V ar [ψ1(θ)] , ψn(θ) =

1

n

n∑i=1

ψi(θ).

However, most of the results could be easily extended to the case of estimating equations ψi(θ), i =1, . . . , n resulting from a preliminary smoothing, as put forward by Kitamura and Stutzer (1997),to get efficient inference in the presence of serial dependence.

It has been known since Newey and West (1987) that a test of the null hypothesis (1.1) should notbe directly based on a norm of the sample mean ψn(θ0) =

1n

∑ni=1 ψi(θ0) but should instead maximize

local power by setting the focus on the p moment conditions that are most informative about θ. Forthis purpose, Newey and West (1987) propose a score test based on the criterion function of theefficient two-step GMM.

The important idea put forward by the empirical likelihood (EL) literature and all its extensions,in particular the class of generalized empirical likelihood (GEL) estimators (see Smith (1997), Neweyand Smith (2004)), is that the GEL optimization provides implied probabilities that may lead us torevise our empirical views about the data generating process and thereby ensure better performancein finite samples. It is now well known (see e.g. Kleibergen (2005), Guggenberger and Smith (2005))that the GEL finite-sample improvements on GMM inference are even more warranted in the caseof weak identification of the parameters specified by the moment conditions.

We provide in this paper a general framework to take advantage of these implied probabilitiesfor improving the finite-sample performance of the score test of Newey and West (1987). The keyintuition is that, while power of tests is improved by setting the focus on efficient directions, it isimproved even further by estimating these directions efficiently. Moreover, by erasing perverse corre-lation between the moment conditions and the estimated efficient directions, this efficient estimationprovides an almost perfect hedge against size distortion effects due to weak identification.

The theoretical contribution of the paper is as follows. We derive a comprehensive theory of theasymptotic behavior of implied probabilities. Since our focus of interest is not only size but alsopower of tests, it takes a theory of these probabilities not only when computed at the true value ofthe structural parameters as hypothesized by the null (1.1), but also under convenient sequences of

1In order to accommodate weak identification, we will consider throughout some drifting Data Generating Processes(DGPs) such that the probability distribution of (Wi)1≤i≤n is a product of n identical distributions that may shiftwith n. However, the only relevant impact of this shift is on the moment conditions E [ψi(θ)] and does not impair thelaw of large numbers for higher order moments.

2

alternatives. These alternatives may be only local, in case of strong identification of all structuralparameters, but also more global in case of weak identification of some components of the vector θof parameters.

In this respect, our paper can be seen as following up on the important contribution of Guggen-berger and Smith (2005). Like in their paper, our results are valid for any GEL criterion inspired bythe Cressie-Read family of statistical discrepancies. However, for the sake of notational simplicity, wesometimes exclude the case of exponential tilting. For inference with weak instruments, exponentialtilting has been the focus of interest of Caner (2010). A companion paper, Chaudhuri and Renault(2017), extends the main ideas put forward in this paper to cases with a more general null, e.g.,H0 : a(θ) = 0 where a(θ) is any (possibly smooth) function of θ, including the special case thata(θ) = θ1 where θ can be partitioned as θ = (θ′1, θ

′2)

′.For the purpose of user-friendly practical applications, our contribution is to illustrate the excel-

lent performance of 3SEEL (Three Step Euclidean Empirical Likelihood) with shrinkage proposedby Antoine, Bonnal and Renault (2007) (henceforth, ABR-07). Thanks to the closed form formulasfor implied probabilities stemming from quadratic optimization, EEL does display some significantcomputational advantages with respect to competitors like EL or GS-EL (EL partially modified byGuggenberger and Smith (2005)). Joint use of shrinkage and 3SEEL allows us to resort to user-friendly non-negative implied probabilities that are well suited to improve the estimators of bothJacobian and Variance in the GMM score function. While this shrunk 3SEEL is shown to be firstorder asymptotically equivalent with its main competitors, EL and GS-EL, it is not only much moreuser friendly but also displays better (size-corrected) finite-sample power, according to our MonteCarlo experiments. This seems to confirm, in particular, the higher-order asymptotic equivalencebetween EL and 3SEEL discussed by ABR-07.

The paper is organized as follows. Section 2 provides a general overview of the implied probabil-ities with particular focus on EL and EEL. Section 3 develops the asymptotic theory of the impliedprobabilities that we apply in the rest of the paper to design and describe the properties of scoretests. Section 4 puts on the table several alternatives to Newey-West score test that take advantageof implied probabilities based on the hypothesis (1.1). Section 5 discusses various score tests of thehypothesis (1.1), and presents a Monte Carlo comparison of the finite-sample performance of thesetests. Section 6 concludes. All proofs can be found in the Appendix.

2 Implied probabilities: Overview

2.1 Cressie-Read Lagrange multipliers

The information theoretic approaches to inference in moment condition models have become popularin econometrics since the seminal paper by Imbens et al. (1998). The idea in the context of generalmoment conditions (1.2) is first, for any given value of the vector θ of parameters, to look for implied

probabilities π(γ)n (θ) = (π

(γ)i,n (θ))1≤i≤n as solutions of:

minπ∈Rn

1

γ(γ + 1)

n∑i=1

[(nπi)

1+γ − 1]

(2.1)

subject to

n∑i=1

πi = 1 and

n∑i=1

πiψi(θ) = 0.

The objective function (2.1) is defined for any real γ, including the two limit cases γ → 0 andγ → (−1). The family of these functions, indexed by γ, is generally referred to as the Cressie-Readfamily of power divergence statistics (see Imbens et al. (1998) and the references therein). It is wellknown (see for instance Schennach (2007) for a review with the same notations) that, up to a scalingfactor, the vector λn(θ) of H Lagrange multipliers associated to the constraints

∑ni=1 πiψi(θ) = 0 in

3

the optimization program above is characterized as the solution of the dual problem:

maxλ∈Λ(γ)

n (θ)

n∑i=1

ϱ(γ)[λ′ψi(θ)

](2.2)

with the following notations, for any γ ∈ R :

Q(γ) = x ∈ R; 1 + γx > 0 (2.3)

Λ(γ)n (θ) =

λ ∈ RH ;λ′ψi(θ) ∈ Q(γ), ∀i = 1, ..., n

(2.4)

dϱ(γ)(x)

dx= − [1 + γx]1/γ , ∀x ∈ Q(γ).

Note that, for all γ ∈ R, Q(γ) is an open interval of the real line, containing x = 0, and:

d2ϱ(γ)(x)

dx2= − [1 + γx](1/γ−1) < 0, ∀x ∈ Q(γ).

Thus, ϱ(γ) is a strictly concave function defined on Q(γ) with:

dϱ(γ)(0)

dx=d2ϱ(γ)(0)

dx2= −1

while the function ϱ(γ) itself is defined by (2.3) up to an additive constant, but, whenever it isconvenient, will be normalized by:

ϱ(γ)(0) = 0.

Define

τ (γ)(x) =dϱ(γ)

dx(x) = −[1 + γx]1/γ

(including the limit case ϱ(γ)(x) = 1− exp(x) for γ → 0). Then, we easily check that:

π(γ)i,n (θ) = cn(θ)τ

(γ)(λ(γ)n (θ)′ψi(θ)

)= −cn(θ)

[1 + γλ(γ)n (θ)′ψi(θ)

]1/γ(2.5)

where the scaling constant cn(θ) is determined by the normalization condition:

n∑i=1

π(γ)i,n (θ) = 1.

Our first technical result below in Proposition 1 is tightly related to the setup of Guggenbergerand Smith (2005), but a comprehensive proof is provided in the appendix for the sake of self-containedness. Our maintained assumption is the following:

Assumption 1: Θn, n = 1, 2, ..., is a sequence of subsets of Θ, containing the true unknown valueθ0, and such that:

(i)

supθ∈Θn

∥∥E (ψn(θ))∥∥ = O

(1√n

).

(ii)max1≤i≤n

supθ∈Θn

∥ψi(θ)∥ = oP(√n).

(iii)supθ∈Θn

∥ψi(θ)∥ = OP (1) for each i = 1, . . . , n.

4

(iv)

supθ∈Θn

∥∥ψn(θ)− E(ψn(θ)

)∥∥ = OP

(1√n

).

The interpretation of Assumption 1 obviously depends on the choice of the sequence of sets Θn, n =1, 2, . . .

We first note that Assumption 1 is hardly restrictive if we choose Θn = θ0 for all n = 1, 2, . . ..For this choice, Assumptions 1(i) and 1(iii) are fulfilled by defintion. Moreover, Assumptions 1(ii)and 1(iv) are then a direct consequence of the fact that ψ(Wi, θ

0) is an i.i.d. sequence with zero meanand finite variance. It is a standard result of the empirical likelihood literature (see Owen (1990),Lemma 3) whose proof (see ABR-07) is based on the Borel-Cantelli Lemma, while Assumption 1(iv)is obviouly implied by the Central Limit Theorem.

However, the choice Θn = θ0 is obviously not very useful when the focus of our interest ismainly power issues. For instance, when working with sequences Θn of

√n-local alternatives in

the case of strong identification, Assumption 1(i) will hold by the mean value formula. Besides thesequences of local alternatives naturally suggested by asymptotic power studies, we will also have toconsider larger sets Θn when θ (or some components of θ) will be only weakly identified.

We can then prove:

Proposition 1: For all θ ∈ Θn and γ ∈ R, there exists a sequence of vectorial functions λ(γ)n (θ) ∈

Λ(γ)n (θ) where Λ

(γ)n (θ) is defined in (2.4), such that:

limn=∞

Pr

∩θ∈Θn

n∑i=1

ϱ(γ)[λ(γ)n (θ)′ψi(θ)

]≥

n∑i=1

ϱ(γ)[λ′ψi(θ)

],∀λ ∈ Λ(γ)

n (θ)

= 1

with:supθ∈Θn

∥∥∥λ(γ)n (θ)∥∥∥ = OP (1/

√n).

Then we prove the following result, already derived by Newey and Smith (2004), that gives thevector of Lagrange multipliers as a simple re-scaling of the moment conditions.

Proposition 2: For all θ ∈ Θn:

λ(γ)n (θ) = −(Ω(γ)n (θ)

)−1ψn(θ) (2.6)

where:

Ω(γ)n (θ) =

1

n

n∑i=1

k(γ)(λ(γ)n (θ)′ψi(θ)

)ψi(θ)ψ

′i(θ)

and:

k(γ)(x) = −1

x

[1 + τ (γ)(x)

].

The interpretation of Proposition 2 is even more transparent in two particular cases:

Case 1: The function k(γ)(.) is constant.

With k(γ)(x) = − 1x

[1 + τ (γ)(x)

]= −1−(1+γx)1/γ

x , we see that it happens if and only if γ = 1, thatis in the Euclidean Empirical Likelihood (EEL) case. The criterion function (2.1) is then quadratic

with respect to the probabilities πi. In this case, Ω(1)n (θ) = Ωn(θ), where Ωn(θ) is the estimator of

the covariance matrix V (θ) based on the empirical distribution that gives equal weights πi,n = 1/n

5

to each observation:

Ωn(θ) =

n∑i=1

πi,nψi(θ)ψ′i(θ).

We have:λ(1)n (θ) = −Ωn(θ)

−1ψn(θ).

Case 2: The function k(γ)(.) is proportional to the function τ (γ)(.).

With k(γ)(x) = − 1x

[1 + τ (γ)(x)

]= −1−(1+γx)1/γ

x , we see that it happens if and only if γ = −1,that is in the Empirical Likelihood (EL) case. The criterion function (2.1) is then proportional to∑n

i=1 lnπi. Then:

Ω(−1)n (θ) = − 1

n

n∑i=1

[1− λ(−1)

n (θ)′ψi(θ)]−1

ψi(θ)ψ′i(θ)

is proportional to:

Ω(−1)n (θ) =

n∑i=1

π(−1)i,n (θ)ψi(θ)ψ

′i(θ)

that is an alternative estimator of the covariance matrix V (θ) of the moment vector ψ(Wi, θ), takingadvantage of the implied probabilities. We deduce that:

λ(−1)n (θ) ∝ −[Ω(−1)

n ]−1(θ)ψn(θ). (2.7)

Hence, we conclude that in both cases of EL and EEL (and seemingly only in these two cases),the vector of Lagrange multipliers is proportional to V −1

n (θ)ψn(θ) for some estimator Vn(θ) of V (θ).We will set a special focus on these two cases, in particular because they allow us to extend the

seminal argument of Back and Brown (1993) about taking advantage of the informational content ofmoment restrictions to estimate additional moments.

2.2 Implied Probabilities from Empirical Likelihood

Assume we want to estimate (again, for a given value of θ) the expectation of a known vector functiong(Wi, θ) of data and parameters, while taking advantage of the information that ψ(Wi, θ) has a zeroexpectation. The idea put forward by Back and Brown (1993) is then to consider an extended set ofmoments conditions, that is not only:

E[ψ(Wi, θ)] = 0 (2.8)

but also:E[µ− g(Wi, θ)] = 0 (2.9)

for an extended set (θ′, µ′)′ of parameters of interest.When revisiting the minimization problem (2.1) with not only the constraints produced by (2.8),

namely:n∑i=1

πiψ(Wi, θ) = 0 (2.10)

but also the constraints produced by (2.9), that is:

n∑i=1

πi[µ− g(Wi, θ)] = 0, (2.11)

we will in general end up with different solutions πi, i = 1, 2, . . . , n. However, for given data Wi, i =1, 2, . . . , n, there is one specific value of µ such that the additional constraints (2.11) do not change

6

the solution described above. It is the case where the additional constraints are not binding becausethe given value of µ is nothing but:

µ(γ)n (θ) =

n∑i=1

π(γ)i,n (θ) g(Wi, θ).

In this case, the Lagrange multipliers associated to constraints (2.11) will be all nil. This allows usto derive the following result.

Proposition 3: For all θ ∈ Θn, we have asymptotically with probability one, for γ = ±1

n∑i=1

π(γ)i,n (θ) g(Wi, θ) = gn(θ)− Cov

(γ)n,θ [g(Wi, θ), ψ(Wi, θ)] [V

(γ)n,θ (θ)]

−1ψn(θ) (2.12)

with:

Cov(1)n,θ [g(Wi, θ), ψ(Wi, θ)] =

1

n

n∑i=1

g(Wi, θ)[ψi(θ)− ψn(θ)

]′,

Cov(−1)n,θ [g(Wi, θ), ψ(Wi, θ)] =

n∑i=1

π(−1)i (θ) g(Wi, θ) [ψi(θ)]

′ ,

V (1)n (θ) =

1

n

n∑i=1

ψi(θ)[ψi(θ)− ψn(θ)

]′,

V(−1)n,θ (θ) =

n∑i=1

π(−1)i,n (θ)ψi(θ) [ψi(θ)]

′ .

Formula (2.12), albeit algebraically true for any θ, should be statistically interpreted for θ = θ0,the true unknown value such that E[ψ(Wi, θ

0)] = 0. Then, under standard regularity conditions,

V(1)n (θ0) and V

(−1)n,θ (θ0) will be both consistent estimators of the variance matrix V (θ0) of the vector

ψ(Wi, θ0). Similarly, Cov

(1)n,θ

[g(Wi, θ

0), ψ(Wi, θ0)]and Cov

(−1)n,θ

[g(Wi, θ

0), ψ(Wi, θ0)]are consistent

estimators of the covariance matrix Cov[g(Wi, θ

0), ψ(Wi, θ0)]. In other words, (2.12) can be read at

the true value as:

n∑i=1

π(γ)i,n

(θ0)g(Wi, θ

0) =1

n

n∑i=1

[g(Wi, θ

0)− b(γ)n (θ0)′ψ(Wi, θ0)], for γ = ±1

where b(γ)n (θ0) is a consistent estimator of the regression coefficient of g(Wi, θ

0) on ψ(Wi, θ0). In yet

other words, the two formulas encapsulated in (2.12) point out that when estimating the expectationE[g(Wi, θ

0)] (for known θ0) the naive estimator, sample average of observed g(Wi, θ0), can be op-

timally improved (to get an asymptotic minimum variance among consistent estimators) by controlvariables ψ(Wi, θ

0), the expectation of which is known to be zero. While for γ = 1 formula (2.12)had been extensively discussed in ABR-07 for an efficient use of the informational content of momentequations in the context of EEL (see also Stein (1956) for an earlier reference), the same formula forγ = −1 is the extension of ABR-07 to genuine EL. However, the EL case does not deliver closed formformulas as the EEL case because it must be kept in mind that the corresponding sample momentsactually involve two kinds of occurrences of θ.

Up to this additional difficulty, formulas of Proposition 3 can, both for γ = 1 and for γ = −1, bealso interpreted in terms of tilting the empirical distribution πi,n = 1/n as follows.

Corollary 1: In the context of Proposition 3:

7

π(1)i,n (θ) =

1

n− ψn(θ)

′[V (1)n (θ)]−1ψi(θ)− ψn(θ)

n(2.13)

and:

π(−1)i,n (θ) =

1

n− ψn(θ)

′[V(−1)n,θ (θ)]−1π

(−1)i,n (θ)ψi(θ). (2.14)

Formulas (2.13) and (2.14) have quite similar interpretations. Assume for simplicity that H = 1 andψn(θ) > 0. Then, a specific observation Wi will be downplayed (resp. magnified) by the tilting of

probability weight from πi,n = 1/n to implied probability π(γ)i,n (θ) , γ = ±1, when the value of ψi(θ)

is larger (resp. smaller) than what it is supposed to be, that is, ψn(θ) (in case γ = 1) or zero (incase γ = −1). The rationale of both tiltings is actually the same since in both cases we want to usethe information that the population expectation of ψi(θ) is supposed to be zero. Note however an

important difference between (2.13) and (2.14). While in (2.13), the implied probability π(1)i,n (θ) is

explicitly given as a function of the observations ψi(θ), it is not the case for the implied probability

π(−1)i,n (θ) in (2.14). It must actually be kept in mind that the implied probabilities are also hidden

in the definition of V(−1)n,θ (θ).

While on computational grounds, a closed-form formula to compute implied probabilities like(2.13) is of course very welcome, ((2.14) does not give such a formula) it comes with a cost. Non-negativity of the EEL implied probabilities is not warranted whereas it is warranted for EL ones(since they maximize

∑ni=1 lnπi). However, ABR-07 have shown that this flaw of the EEL implied

probabilites can be fixed by a simple shrinkage, defining instead probabilities shrunk towards (1/n)as:

π(1),Shi,n (θ) =

1

1 + εn(θ)π(1)i,n (θ) +

εn(θ)

1 + εn(θ)πi,n (2.15)

where:

εn(θ) = −n×min

[min1≤i≤n

π(1)i,n (θ), 0

].

Note that this shrinkage is performed in a minimum way, so that the information content of impliedprobabilites is not wasted asymptotically (see Proposition 6 below). Therefore, one can also contem-

plate the use of shrunk probabilities π(1),Shi,n (θ) instead of EEL ones π

(1)i,n (θ) for the purpose of score

testing.

3 Asymptotic theory of implied probabilities

3.1 General Motivation

While it is well known that for any γ = 0 (as already mentioned, we omit the case of exponentialtilting for the sake of notational simplicity), minimization of the Cressie-Read discrepancy delivers

implied probabilities π(γ)i,n (θ), i = 1, . . . , n, defined by (2.5) (up to normalization), little has been

done to compare these different probability distributions for different values of the real number γ.Hausman et al. (2011) stated (with our notations) that, if θELn (resp. θCUn ) stands for the EL (resp.EEL) estimator of θ:

π(−1)i,n (θELn ) = π

(1)i,n (θ

CUn ) +OP (1/n), (3.1)

which is little informative for our purpose, since our goal is to improve upon the naive probabilitiesπi,n = (1/n) which also fulfill for any γ and for any

√n-consistent estimator θn:

π(γ)i,n (θn) = πi,n +OP (1/n).

8

For the purpose of efficient use of the information content of the moment conditions, what reallymatters is that the asymptotic distribution of the estimator:

g(γ)n (θ) =

n∑i=1

π(γ)i,n (θ) g(Wi, θ) (3.2)

is an improvement upon the naive estimator gn(θ) (see Proposition 3 for the particular cases γ =

±1). Our main goal in this section is to show that all these estimators g(γ)n (θ) for any γ = 0 are

asymptotically equivalent, at least for θ in a convenient neighborhood of the true unknown value θ0,and indeed asymptotically more accurate than gn(θ).

This result will be applied to the various score tests (to be introduced in the next section) inthe sense that all score test statistics involve some sample means whose asymptotic variance canbe reduced by using instead optimally weighted averages as (3.2). The key intuition is that theimprovement of accuracy of estimators like (3.2) should result in a power gain for correspondingscore tests, leading to score tests that are more powerful than the standard score test of Newey andWest (1987). Moreover, thanks to the aforementioned asymptotic equivalence, Proposition 3 showsthat in estimators like (3.2), any perverse correlation with moment conditions has been erased, whichshould give us a hedge against bad size distortions.

Since the ultimate goal is to compare asymptotic distributions, we want to get equivalence resultswarranting that for all θ in some neighborhood of θ0 and for any pair (γ, γ∗) of non-zero real numbers:

n∑i=1

π(γ)i,n (θ) g(Wi, θ)−

n∑i=1

π(γ∗)i,n (θ) g(Wi, θ) = oP

(1√n

).

Since we are summing over n terms, such a degree of asymptotic equivalence requires intuitively thatfor i = 1, 2, . . . , n :

π(γ)i,n (θ)− π

(γ∗)i,n (θ) = oP

(1

n√n

). (3.3)

Note that the equivalence result (3.3) is much more powerful than (3.1). In particular, it wouldnot work with the naive (unconstrained) probabilities πi,n = (1/n). We will prove it in the nextsubsection. Note however that we do not prove that:

max1≤i≤n

[π(γ)i,n (θ)− π

(γ∗)i,n (θ)

]= oP

(1

n√n

).

Therefore, (3.3) does not directly provide the asymptotic equivalence result between weighted aver-ages we are looking for. It will take additional assumptions and proofs in a further subsection.

3.2 Multipliers and Implied Probabilities

We derive in this subsection a formal theory of Lagrange Multipliers and Implied Probabilities thatwill justify the informal analysis of Section 2.1. In particular, while we maintain the regularityconditions announced in the introduction, as well as the concept of unique true unknown valuedefined by the moment conditions, we want to allow for various identification patterns, involvingboth strong and weak identification. Typically, the definition of the latter involves drifting datagenerating processes and thus the introduction of double arrays of observations indexed by bothn and i = 1, . . . , n. We will keep the notations simple by not making explicit these double arrays,keeping the notation (Wi)1≤i≤n instead of the correct notation (Wi,n)1≤i≤n.

For the purpose of the results from this section onward, it is useful to bestow Θn defined inAssumption 1 with two additional properties.

Assumption 2: Let the sequence Θn, n = 1, 2, . . . defined in Assumption 1 also satisfy:

9

(i)

supθ∈Θn

∥∥∥Ωn(θ)− V (θ)∥∥∥ = op(1), supθ∈Θn

∥∥∥V (1)n (θ)− V (θ)

∥∥∥ = op(1)

supθ∈Θn

∥∥∥∥[Ωn(θ)]−1− V −1(θ)

∥∥∥∥ = op(1), supθ∈Θn

∥∥∥∥[V (1)n (θ)

]−1− V −1(θ)

∥∥∥∥ = op(1).

(ii) If γmin(θ) (resp. γmax(θ)) stands for the smallest (resp. largest) eigenvalue of V (θ) :

0 < infθ∈Θn

γmin(θ) < supθ∈Θn

γmax(θ) < +∞.

Assumption 2 (i) only maintains the validity of a law of large numbers for the sample covariancematrix, with a population covariance matrix. The two convergence on the first line of (i) are equiv-alent, thanks to Assumption 1 (i) and (iv). The same is true for the convergence on the second lineunder the additional condition of Assumption 2 (ii), which maintains that the population covariancematrix is finite and positive definite.

We first prove the asymptotic equivalence of the sequences of Lagrange multipliers λ(γ)n (θ) for

θ ∈ Θn, defined in Proposition 1.

Proposition 4: Under Assumptions 1 and 2, for any γ = 0:

supθ∈Θn

∥∥∥λ(γ)n (θ)− λ(1)n (θ)∥∥∥ = oP (1/

√n)

with:

λ(1)n (θ) = −Ωn(θ)−1ψn(θ), Ωn(θ) =

1

n

n∑i=1

ψi(θ)ψi(θ)′.

Note that while we only knew from Proposition 1 that for all γ:

supθ∈Θn

∥∥∥λ(γ)n (θ)∥∥∥ = OP (1/

√n),

Proposition 4 implies in addition a smaller order of magnitude for the difference between Lagrangemultipliers corresponding to different values of γ:

γ, γ∗ ∈ R∗ ⇒ supθ∈Θn

∥∥∥λ(γ)n (θ)− λ(γ∗)

n (θ)∥∥∥ = oP (1/

√n).

In terms of the implied probabilities, Proposition 4 implies directly:

Proposition 5: Under Assumptions 1 and 2, we have for any γ = 0 and every i = 1, . . . , n :

supθ∈Θn

∣∣∣π(γ)i,n (θ)− π(1)i,n (θ)

∣∣∣ = oP

(1

n√n

);

supθ∈Θn

∣∣∣∣π(γ)i,n (θ)−1

n

∣∣∣∣ = OP

(1

n√n

);

with: π(1)i,n (θ) =

1

n− ψn(θ)

′[V (1)n (θ)]−1ψi(θ)− ψn(θ)

n;

V (1)n (θ) =

1

n

n∑i=1

ψi(θ)[ψi(θ)− ψn(θ)]′.

Note that the result of Proposition 5 is similar to Lemma 14 of Ramalho and Smith (2006). As

mentioned in subsection 2.1, one may also contemplate the use of shrunk probabilities π(1),Shi,n (θ)

10

to avoid the perverse possible negativity of EEL probabilities π(1)i,n (θ). The nice thing with shrunk

probabilities is that they also define implied probabilities that are asymptotically equivalent to theEEL probabilities at the same rate as in Proposition 5.

Proposition 6: Under Assumptions 1 and 2, for every i = 1, . . . , n :

supθ∈Θn

∣∣∣π(1)i,n (θ)− π(1),Shi,n (θ)

∣∣∣ = op

(1

n√n

).

3.3 Laws of Large Numbers and Central Limit Theorems

Propositions 5 and 6 imply that we can safely build consistent estimators of population expectationsby using any family of implied probabilities. More precisely:

Proposition 7: Let Assumptions 1 and 2 hold. Let (Yi)1≤i≤n be a sequence of i.i.d. square integrablerandom vectors. Then for any γ = 0 :

supθ∈Θn

∥∥∥∥∥n∑i=1

π(γ)i,n (θ)Yi −

1

n

n∑i=1

Yi

∥∥∥∥∥ = oP (1)

supθ∈Θn

∥∥∥∥∥n∑i=1

π(1),Shi,n (θ)Yi −

1

n

n∑i=1

Yi

∥∥∥∥∥ = oP (1) .

In particular, Proposition 7 implies that all estimators of population expectations, based on implied

probabilities π(γ)i,n (θ) or π

(1),Shi,n (θ), are uniformly consistent like the ones based on the naive proba-

bilities πi,n = (1/n) for which the said uniformity holds trivially. The proof of Proposition 7 is quitestraightforward: ∥∥∥∥∥

n∑i=1


1

n

n∑i=1

Yi

∥∥∥∥∥ ≤[max1≤i≤n

∥Yi∥] n∑i=1

∣∣∣∣π(γ)i,n (θ)−1

n

∣∣∣∣with:

n∑i=1

∣∣∣∣π(γ)i,n (θ)−1

n

∣∣∣∣ ≤ nOP

(1

n√n

),

which gives the required result since, when (Yi)1≤i≤n being a sequence of i.i.d. square integrablerandom vectors:

max1≤i≤n

∥Yi∥ = oP (√n).

To move from laws of large numbers to central limit theorems, it would take to be able to replaceoP (1) by oP (1/

√n) in Proposition 7. Of course, it is not true when comparing with the naive

weights πi,n = (1/n), since implied probabilities have precisely been used to provide a more accurateestimator (see Proposition 3). But it might become true when comparing two different sets of impliedprobabilities for Cressie-Read parameters γ and γ∗:∥∥∥∥∥

n∑i=1


n∑i=1

π(γ∗)i,n (θ)Yi

∥∥∥∥∥ ≤ max1≤i≤n

∣∣∣π(γ)i,n (θ)− π(γ∗)i,n

∣∣∣ n∑i=1

∥Yi∥ .

Sincen∑i=1

∥Yi∥ = OP (n), we would get the required result if we knew that:

max1≤i≤n

∣∣∣π(γ)i,n (θ)− π(γ∗)i,n

∣∣∣ = oP

(1

n√n

).

11

Unfortunately, Propositions 5 and 6 only give this order of magnitude for each i = 1, . . . , n butthere is no such thing as a uniform upper bound for i = 1, . . . , n. Hence, we must resort to directproofs of the requested equivalences. We first show that, as already pointed out in ABR-07, shrinkingimplied probabilities is immaterial in terms of first order asymptotics.

Proposition 8: Let Assumptions 1 and 2 hold. Consider a sequence (Yi)1≤i≤n of i.i.d. squareintegrable random vectors such that:

supθ∈Θn

∥∥∥∥∥ 1nn∑i=1

Yi(ψi(θ)− ψn(θ))′ − Cov [Yi, ψi(θ)]

∥∥∥∥∥ = oP (1),

supθ∈Θn

∥Cov [Yi, ψi(θ)]∥ = O(1).

Then:

supθ∈Θn

∥∥∥∥∥n∑i=1

π(1)i,n (θ)Yi −

n∑i=1

π(1),Shi,n (θ)Yi

∥∥∥∥∥ = oP

(1√n

).

Using different sets of implied probabilities is also immaterial, insofar as the probability distri-bution of Yi does not display overly fat tails.

Proposition 9: Let Assumptions 1 and 2 hold. Consider a sequence (Yi)1≤i≤n of i.i.d. random

vectors such that E[∥Yi∥4

]< ∞. Furthermore, suppose that a suitable functional central limit

theorem gives:

supθ∈Θn

∥∥∥∥∥ 1√n

n∑i=1

ψi(θ)Yi − E[ψi(θ)Yi]

∥∥∥∥∥ = Op(1). (3.4)

Then for any γ, γ∗ = 0:

supθ∈Θn

∥∥∥∥∥n∑i=1


n∑i=1

π(γ∗)i,n (θ)Yi

∥∥∥∥∥ = oP

(1√n

).

The condition in (3.4) is apparently strong. It is only required to show the uniformity in θ forthe closeness between the two weighted averages in the result of Proposition 9. On the other hand,if focus lies in the said closeness for a given θ ∈ Θn, then one could accordingly weaken (3.4) whichwould then amount to a standard central limit theorem in the case Θn =

θ0.

Propositions 8 and 9 will allow us to show that all these estimators are asymptotically normalwith the same asymptotic variances as follows:

Proposition 10: Let Assumptions 1 and 2 hold. Consider a sequence (Yi)1≤i≤n of i.i.d. random

vectors with E[∥Yi∥4

]<∞ and such that:

1√n

n∑i=1

[Yi − E[Yi]

ψi(θ)− E[ψi(θ)]

]d−→ N

(0,

[V ar(Yi) Cov[Yi, ψi(θ)]

Cov[ψi(θ), Yi] V ar(ψi(θ))

])for each θ ∈ Θn. Then, for all θ ∈ Θn, the vector:

Un =√n

n∑i=1

π(1)i,n (θ)Yi −

(E[Yi]− Cov[Yi, ψi(θ)][V ar(ψi(θ))]

−1E[ψi(θ)])

ψn(θ)− E[ψi(θ)]

12

is asymptotically normal with mean zero, a block diagonal variance matrix and the North-West Blockof the variance matrix is given by:

V ar(Yi)− Cov[Yi, ψi(θ)] [V ar (ψi(θ))]−1Cov[ψi(θ), Yi].

Note that the statement above involves an abuse of notation since ψi(θ) depends indirectly on n(for θ picked in Θn). However the statement is clear if we rather say that the characteristic functionof Un minus the characteristic function of the normal distribution described above converges to zero.We will actually assume that this convergence is uniform (valid, as for all Propositions above, forthe supremum over θ ∈ Θn ).

Note also that Proposition 10 confirms that the use of the implied probabilities provides a moreaccurate estimator of E[Y ]; its asymptotic variance is smaller than V ar(Y ), albeit at the cost ofintroducing an asymptotic bias that is due to the non-zero E[ψ(θ)] (more precisely, non-op(1/

√n)

E[ψn(θ)]). In our application where we use the implied probabilities for the score test, this bias onlyhas first order asymptotic effect if some elements of θ are weakly identified, in which case, however,their use turns out to be even more important since not doing so leads to an over-sized score test (seeKleibergen (2005)). In a companion paper, Chaudhuri and Renault (2017) analyze the effect of thisincreased bias and reduced variance in the context of testing nonlinear restrictions on parameterswith different strengths of identification (also see Chaudhuri (2017)).

4 Score vector using the implied probabilities

4.1 General framework

A score test for the hypothesis (1.1) on the full vector must be built from the first order conditions of aminimization program dedicated to the estimation of the unknown vector θ of parameters. Followingthe logic of Section 2.1, θ would be estimated through an information theoretic approach applied to

the implied probabilities π(γ)i,n (θ) as follows:

minθ∈Θ

1

γ∗(γ∗ + 1)

n∑i=1

[nπ

(γ)i,n (θ)

1+γ∗

− 1

](4.1)

for some real number γ∗. Then, the first order conditions for computing an estimator of θ would be:

n∑i=1

[π(γ)i,n (θ)

]γ∗− 1

γ∗.∂π

(γ)i,n (θ)

∂θ= 0. (4.2)

Note that, except in the limit case γ∗ → 0, (4.2) can be rewritten in the more concise way:

n∑i=1

[π(γ)i,n (θ)

]γ∗ ∂π(γ)i,n (θ)

∂θ= 0 (4.3)

since we have obviously:n∑i=1

π(γ)i,n (θ) = 1, ∀ θ ⇒

n∑i=1

∂π(γ)i,n (θ)

∂θ= 0. (4.4)

Different, albeit first-order asymptotically equivalent, estimators of θ will be obtained, dependingon the choice of a couple (γ, γ∗) of real numbers that define the two Cressie-Read distances at stake.

(i) The most common strategy is arguably to take the parameter γ∗ of the Cressie-Read distance(4.1) equal to the parameter γ that has been previously at play to define Lagrange multipliers vectors

13

λ(γ)n (θ) and associated implied probabilities π

(γ)i,n (θ).

(ii) An alternative strategy has been put forward by Schennach (2007). She actually recommendsto choose γ∗ = −1 (empirical likelihood, EL) to take advantage of the higher order efficiency prop-erties of empirical likelihood but uses implied probabilities associated to exponential tilting, that isγ = 0 (KLIC minimization as put forward by Kitamura and Stutzer (1997)). Schennach (2007) showsthat keeping γ = 0 for computing implied probabilities does not impair the higher order efficiencybrought by γ∗ = −1, while ensuring some robustness when moment conditions are misspecified. In aslightly different context with (local) conditional moment restrictions, Gagliardini et al. (2011) alsoremain true to exponential tilting for computing implied probabilities while they rather take γ∗ = 1(Euclidean Empirical Likelihood, EEL) to estimate θ.

(iii) We will put forward in this paper a third strategy: we take advantage of the simplification

brought by the condition γ = γ∗ to simplify the formula[π(γ)i,n (θ)

]γ∗but we keep the freedom of a

different value of γ to compute∂π

(γ)i,n (θ)

∂θ .

In other words, using (2.5) and disregarding the limit case γ = 0 for the sake of notationalsimplicity (see Section 4.3. for a special focus on this case), we can simplify the first order conditions(4.3) as follows:

n∑i=1

[1 + γ∗λ(γ

∗)n (θ)′ψi(θ)

].∂π

(γ)i,n (θ)

∂θ= 0 (4.5)

or, using (4.4):

λ(γ∗)

n (θ)′n∑i=1

ψi(θ)∂π

(γ)i,n (θ)

∂θ= 0.

However, it is worth noting that:

n∑i=1

π(γ)i,n (θ)ψi(θ) = 0,∀θ ⇒

n∑i=1

∂π(γ)i,n (θ)

∂θψ′i(θ) = −

n∑i=1

π(γ)i,n (θ)

∂ψ′i(θ)

∂θ.

Hence, we can eventually rewrite (4.5) as follows:

n∑i=1

π(γ)i,n (θ)

∂ψ′i(θ)

∂θ.λ(γ

∗)n (θ) = 0. (4.6)

When remembering that for γ = ±1 we have

λ(γ)n (θ) ∝ [Ω(γ)n ]−1(θ)ψn(θ),

playing with γ, γ∗ = ±1 in (4.6) gives us several possible score vectors. As discussed below, formula(4.6) will allow us to bridge the gap between a score test implied by the information-theoreticapproach (4.1) and a Newey-West score test. For the sake of improving the finite-sample performance,we will actually consider even more generally score vectors of the following form:

ln(θ, πG(θ), πV (θ)

)=

n∑i=1

πGi,n(θ)G′i(θ)

[n∑i=1

πVi,n(θ)Vi,n(θ)

]−1√nψn(θ) (4.7)

where:

Gi(θ) =∂ψi(θ)

∂θ′, Vi,n(θ) = ψi(θ)

(ψi(θ)− ψn(θ)

)′

14

and πGi,n(θ) and πVi,n(θ) may be different, but such that:

πGi,n(θ), πVi,n(θ) ∈

π(γ)i,n (θ); γ ∈ R∗

∪ πi,n. (4.8)

Note that the score vector (4.7) is consistent with the general framework (4.6) since for all γ:

n∑i=1

π(γ)i,n (θ)ψi(θ)ψ

′i(θ) =

n∑i=1

π(γ)i,n (θ)Vi,n(θ).

It is only in the case of the unconstrained (naive) empirical probabilities πi,n that we slightly modify

the score vector (4.6) by considering moments of order two in mean deviation form V(1)n (θ) instead

of the naive estimator Ω(1)n (θ) = Ωn(θ).

4.2 Various score vectors

Out of the nine possible score vectors that can be deduced from (4.7) and the possible choices ofimplied probabilities given by (4.8) with γ = ±1, we will pick seven of them because they havealready been considered in the literature. Note that we use the acronyms 2SGMM (Two StageGMM), 3SEEL (Three Step EEL) that are typically used for estimation procedures based on a first-step estimator. As far as score testing of hypotheses such as (1.1) is concerned where θ is completelyspecified by the null hypothesis, there is no such thing as a multi-step procedure since we just plugin the hypothesized value θ0 given by (1.1).

(i) S2SGMMn the score vector of the standard efficient 2SGMM of Hansen (1982):

S2SGMMn (θ) = ln

(θ, πG(θ), πV (θ)

);πGi,n(θ) = πVi,n(θ) = πi,n.

The solution θ2SGMMn of the equation S2SGMM

n (θ) = 0 is known to reach the semi-parametric effi-ciency bound in terms of first order asymptotics. Moreover, the score vector S2SGMM

n is typically theone put forward by Newey and West (1987) for score testing. Note however that we apply 2SGMMwith an efficient weighting matrix in which second order moments have been computed in mean de-viation form. We follow in this respect the lesson of Hall (2000) and Chaudhuri and Renault (2015),with the hope of better finite-sample performance.

(ii) SELn the score vector of EL:

SELn (θ) = ln(θ, πG(θ), πV (θ)

);πGi,n(θ) = πVi,n(θ) = π

(−1)i,n (θ).

The EL estimator θELn is the solution of the equation SELn (θ) = 0 . It is known (see Newey and Smith(2004)) to be higher order efficient (up to bias correction). As far as score testing is concerned, thescore vector SELn has been promoted by Guggenberger and Smith (2005).

(iii) SEELn the score vector of EEL:

SEELn (θ) = ln(θ, πG(θ), πV (θ)

);πGi,n(θ) = π

(1)i,n (θ);π

Vi,n(θ) = πi,n.

Newey and Smith (2004) and ABR-07 have shown that the solution θCUn of the equation SEELn (θ) = 0is the Continuously Updated (CU-) GMM Estimator of Hansen et al. (1996). By contrast with2SGMM, it is numerically immaterial for CU-GMM to use an efficient weighting matrix in whichsecond order moments have been computed in mean deviation form or not. However, as shown byNewey and Smith (2004), the fact that θCUn is based on a naive estimation of this weighting matrix(πVi,n(θ) = πi,n) is in general responsible for some higher order bias that θELn (or θ3SEELn defined below)

15

does not display. However, Proposition 3 (applied with g(Wi, θ) given by any coefficient of the matrixψi(θ

0)ψ′i(θ

0)) shows that it is asymptotically harmless to use the naive empirical probabilities πi,n

for πVi,n(θ) (instead of π(γ)i,n (θ), γ = ±1 ) insofar as at the true value θ0, the coefficients of ψi(θ

0)ψ′i(θ

0)

are not correlated with the moments vector ψi(θ0). As also noted by Newey and Smith (2004), it

will be the case in particular when the latter are produced by cross-products with instruments h(zi)(functions of zi’s) for conditional moment restrictions:

E[ui(θ) |zi] = 0

with an error term ui(θ0) that has a zero conditional-skewness given zi. As far as score testing with

weak instruments is concerned, the score vector SEELn has been promoted by Kleibergen (2005). Notehowever that the latter paper gets this score vector not from a GEL approach but from CU-GMM.

(iv) SGSn , a score vector considered by Guggenberger and Smith (2005), has a hybrid form similar

to EEL, but using instead the EL implied probabilities π(−1)i,n (θ):

SGSn (θ) = ln(θ, πG(θ), πV (θ)

);πGi,n(θ) = π

(−1)i,n (θ);πVi,n(θ) = πi,n.

Note that an estimator defined as the solution of the equation SGSn (θ) = 0 would have, for the samereasons, exactly the same bias properties as CU-GMM described above: an additional higher orderbias term by contrast with θELn (or θ3SEELn ), no additional bias in the “symmetric” case (i.e. whenthe coefficients of ψi(θ

0)ψ′i(θ

0) are not correlated with the moments vector ψi(θ0)).

(v) S3SEELn the score vector of the 3SEEL:

S3SEELn (θ) = ln


);πGi,n(θ) = πVi,n(θ) = π

(1)i,n (θ).

ABR-07 have introduced the 3SEEL estimator θ3SEELn and shown that it is as efficient as θELn interms of second order asymptotics, and albeit much more user-friendly on computational grounds.Even though they suggest to plug in θ2SGMM

n in the defining equations of θ3SEELn (only ψn(θ) woulddepend on the unknown θ) to get even simpler computations (hence the terminology three-stepGMM), we can consider a higher-order asymptotically equivalent estimator θ3SEELn as the solutionof the equations S3SEEL

n (θ) = 0 . It follows the logic of the EL estimator θELn defined above while

replacing the EL probabilities π(−1)i,n (θ) by the more user friendly ones π

(1)i,n (θ).

(vi) SGPn is the hybrid score vector considered by Guay and Pelgrin (2015):

SGPn (θ) = ln(θ, πG(θ), πV (θ)

);πGi,n(θ) = πi,n;π

Vi,n(θ) = π

(1)i,n (θ).

Note that the implied probabilities are chosen in a hybrid way that is symmetric to the one at playin EEL. Even though it was not studied by Guay and Pelgrin (2015), one could actually define anestimator θGPn as solution of the equations SGPn (θ) = 0 . Proposition 3 shows that it is asymptotically

harmless to use the naive empirical probabilities πi,n for πGi,n(θ) (instead of π(γ)i,n (θ), γ = ±1) insofar

as at the true value θ0, the coefficients of the Jacobian matrix Gi(θ0) are not correlated with the

moments conditions ψi(θ0). It will be the case in particular when, as in a non-linear least squares

context, moment restrictions are produced by cross-products with instruments h(zi) for conditionalmoment restrictions:

E[ui(θ) |zi] = 0

with an error term ui(θ0) which is the disturbance term of a regression equation given zi:

ui(θ) = yi − g(zi, θ).

16

In the even more particular case of minimum distance estimation (g(zi, θ) = g(θ)), Chaudhuri andRenault (2015) have extensively documented the finite-sample improvement in score testing providedby the use of implied probabilities.

(vii) SGPELn , a score vector also considered by Guay and Pelgrin (2015), has a hybrid form similar

to SGPn but using instead the EL implied probabilities π(−1)i,n (θ):

SGPELn (θ) = ln(θ, πG(θ), πV (θ)

);πGi,n(θ) = πi,n;π

Vi,n(θ) = π

(−1)i,n (θ).

Note that an estimator defined as solution of the equation SGPELn (θ) = 0 would have, for the samereasons, exactly the same efficiency properties as the estimator θGPn described above: lack of higherorder efficiency in general, while higher order efficiency in the “regression” case (i.e. when theelements of the Jacobian matrix Gi(θ

0) are not correlated with the moment vector ψi(θ0)).

Note that for the purpose of testing the null hypothesis H0, the seven scores vectors listed above,namely S2SGMM

n (θ), SELn (θ), SEELn (θ), SGSn (θ), S3SEELn (θ), SGPn (θ), SGPELn (θ) will be all used after

plugging in the hypothesized value θ = θ0. They all differ on their way to sometimes replace the naive

probabilities πi,n by either the EL implied probabilities π(−1)i,n (θ) or the EEL implied probabilities

π(1)i,n (θ). This can be done either to improve the estimation of the Jacobian matrix of the moment

vector (probabilities πGi,n(θ)) or to improve the estimation of the variance matrix of the moment

vector (probabilities πVi,n(θ)) or both. In the simpler context of a non-random Jacobian matrixof the moment vector (minimum distance estimation), Chaudhuri and Renault (2015) extensivelydocument the impact of the choice of probabilities πVi,n(θ), especially to improve the finite-sampleperformance of score tests in the presence of skewness. This is the reason why we will set thepriority in this paper on the choice of probabilities πGi,n(θ), while still considering the additional

impact of choice of probabilities πVi,n(θ). But we will never adopt an approach similar to Guay and

Pelgrin (2015) (score vectors SGPn and SGPELn above) where the focus would be set on the choiceof probabilities πVi,n(θ) while sticking to the naive ones for πGi,n(θ). Note that, besides higher order

improvement issues, it is quite natural to set in priority the focus on probabilities πGi,n(θ) in orderto devise inference procedures that are well-behaved even in the presence of weak identification.Kleibergen (2002, 2005)’s and Moreira (2003)’s key contribution had been to note that in order toget a score test robust to weak identification, the estimator of the Jacobian matrix of the momentvector had to be made asymptotically independent of the sample mean of the moment vector. OurProposition 3 actually confirms that implied probabilities delivered by EEL will exactly perform thejob of orthogonalization put forward in the aforementioned papers. However, Proposition 3 alsoshows the newer result that implied probabilities delivered by EL will also do exactly the same job,albeit even more efficiently (through efficient estimation of the regression coefficients). Of course,this efficiency gain in the case of EL comes at the cost of the absence of a closed form formula andmore computational burden. This is the reason why we study extensively in this paper the fivepossible strategies listed above as (i) to (v) and summarized as follows:

Score πG πV

2SGMM naive naiveEL EL ELEEL EEL naiveGS EL naive

3SEEL EEL EEL

Table 1: Score vectors considered in the paper

17

4.3 Score vectors based on Kullback Leibler Information Criterion (KLIC)

Even though we have eventually omitted the KLIC case, corresponding to the limit case γ → 0, weknow from Proposition 2 that it corresponds to the following functions:

ϱ(0)(x) = 1− exp(x)

τ (0)(x) =dϱ(0)(x)

dx= − exp(x)

k(0)(x) = −1 + τ (0)(x)

x= −1− exp(x)

x

so that:

π(0)i,n (θ) ∝ τ (0)

(λ(0)n (θ)′ψi(θ)

)(4.9)

=⇒ π(0)i,n (θ) =

exp(λ(0)n (θ)′ψi(θ)

)∑n

j=1 exp(λ(0)n (θ)′ψj(θ)

)with Lagrange multipliers:

λ(0)n (θ) = −(Ω(0)n (θ)

)−1ψn(θ)

Ω(0)n (θ) = − 1

n

n∑i=1

1− exp(λ(0)n (θ)′ψi(θ)

)λ(0)n (θ)′ψi(θ)

.ψi(θ)ψ′i(θ).

Recall that formula (4.9) leads to characterize the KLIC-Implied Probabilities as “ExponentialTilting” (ET). It is then natural to consider additional score vectors by allowing γ and/or γ∗ to bezero in the formula:

ln(θ, γ, γ∗) =

n∑i=1

π(γ)i,n (θ)

∂ψ′i(θ)

∂θ.√nλ(γ

∗)n (θ).

Note that we do not use the notation ln(θ, πG(θ), πV (θ)

)anymore because the weights that

define Ω(0)n (θ) cannot be interpreted as implied probabilities. However, we are led to consider the

three following score vectors:

(i) SKLICn the score vector corresponding to the genuine KLIC minimization (γ = γ∗ = 0):

SKLICn (θ) = ln(θ, 0, 0) =

n∑i=1

π(0)i,n (θ)

∂ψ′i(θ)

∂θ

1

n

n∑i=1

1− exp(λ(0)n (θ)′ψi(θ)

)λ(0)n (θ)′ψi(θ)

.ψi(θ)ψ′i(θ)

−1

√nψn(θ).

This score vector has been studied by Caner (2010). One may, however, argue that it is a shame toweight the matrices ψi(θ)ψ

′i(θ) by coefficients that cannot be interpreted as implied probabilities.

This suggests to use the exponential tilting probabilities not only to estimate the Jacobian matrixbut also to build an estimator of the variance, similarly to the estimators of variance that we hadconsidered for EL or 3SEEL. This is the main intuition behind the score vector SETn defined below.

(ii) SETn the score vector where the ET probabilities are used at the two levels:

SETn (θ) =

n∑i=1

π(0)i,n (θ)

∂ψ′i(θ)

∂θ

[n∑i=1

π(0)i,n (θ)ψi(θ)ψ

′i(θ)

]−1√nψn(θ).

18

(iii) SKl−ETn the score vector where the ET probabilities are used only to estimate the Jacobianmatrix while the covariance matrix is estimated with the naive probabilities:

SKl−ETn (θ) =

n∑i=1

π(0)i,n (θ)

∂ψ′i(θ)

∂θ

[1

n

n∑i=1

ψi(θ)ψ′i(θ)

]−1√nψn(θ).

One may note the striking analogy between the score vector SKl−ETn and the score vector SEELn

that we have considered in the previous subsection (with only π(1)i,n (θ) replaced by π

(0)i,n (θ)). Since

the latter corresponds to CU-GMM and had been put forward by Kleibergen (2005), Caner (2010)dubbed Kleibergen-ET the score vector SKl−ETn .

5 Testing for H0 : θ = θ0

We revisit in this section the main result of Guggenberger and Smith (2005), namely the fact thatall the GEL test statistics have the standard chi-square asymptotic null distributions independent ofthe strength or weakness of identification. We point out that this achievement of size control evenin case of weak identification is basically an extension of Kleibergen (2005)’s original idea in thecontext of CU-GMM. What really matters to be robust to weak identification is to use a quadraticform in which the Jacobian matrix has been estimated with implied probabilities πGi,n(θ), irrespectiveof the choice of these probabilities. We show that all deliver the correct asymptotic null distribution,even in case of weak identification, thanks to the equivalence results of Propositions 7 and 8. Notethat Proposition 8 will also allow us to build a new test based on shrunk Euclidean probabilities

πGi,n(θ) = π(1),Shi,n (θ). By contrast, the use of implied probabilities πVi,n(θ) is not really needed for the

estimation of the variance matrix, at least as far as the first order asymptotics are concerned.This is the reason why score testing based on four strategies (ii)-(v) of obtaining the score vector

that we have proposed in Section 4.2 will all deliver this size control, while, on the other hand,strategy (i) will not, which will further preclude its asymptotic equivalence with the rest under weakidentification. Our equivalence result is also relevant regarding inference under strong (resp. weak)identification; it will imply that all the score test statistics are also equivalent under sequences oflocal (resp. global) alternatives.

Indeed, we show even more. By suitably adapting Proposition 10 in an uniform sense to thecontext of this section, we show that the four strategies (ii)-(v) are asymptotically equivalent uni-formly in a suitable neighborhood around the true θ0. Note that this uniformity is not required toestablish asymptotic equivalence of the tests against suitable “given” local alternatives. However, ithas implications if one considers confidence intervals obtained by inverting the tests following thestrategies (ii)-(iv).

5.1 Score test statistics

As discussed in section 4, the score statistic for testing H0 : θ = θ0 is designed as a quadratic form ofln(θ0, π

G(θ0), πV (θ0)

)in (4.7) with respect to the inverse of an (consistent under the null) estimator

of its asymptotic variance. From the formula (4.7), it is natural to estimate the asymptotic varianceas:

In(θ0, π

G(θ0), πV (θ0)

)=

n∑i=1

πGi,n(θ0)G′i(θ0)

[n∑i=1

πVi,n(θ0)Vi,n(θ0)

]−1

V arn[√nψn(θ0)

]×

[n∑i=1


]−1 n∑i=1

πGi,n(θ0)Gi(θ0)

19

where V arn[√nψn(θ0)

]stands for an (consistent under the null) estimator of the asymptotic variance

of√nψn(θ0). While one is free to use different sets of weights to reweigh the estimators of the

Jacobian and/or the variance matrix appearing in ln(θ0, π

G(θ0), πV (θ0)

)and In

(θ0, π

G(θ0), πV (θ0)

)respectively, we choose, for the sake of notational simplicity and for the purpose of our simulationsas well, to use the same set of weights, so that we take:

In(θ0, π

G(θ0), πV (θ0)

)=

n∑i=1

πGi,n(θ0)G′i(θ0)

[n∑i=1


]−1 n∑i=1

πGi,n(θ0)Gi(θ0)

and the associated score test statistic:

LMn

(θ0, π

G(θ0), πV (θ0)

)= l′n

(θ0, π

G(θ0), πV (θ0)

) [In(θ0, π

G(θ0), πV (θ0)

)]−1ln(θ0, π

G(θ0), πV (θ0)

).

5.2 Kleibergen-type statistics

We want to show in this section that all the score tests statistics defined above are first-orderasymptotically equivalent to the K-statistic of Kleibergen (2005), insofar as probabilities are chosenas follows:

πGi,n(θ0) ∈π(γ)i,n (θ0); γ ∈ R∗

∪ π(1),Shi,n (θ0) (5.1)

πVi,n(θ0) ∈π(γ)i,n (θ0); γ ∈ R∗

∪ π(1),Shi,n (θ0) ∪ πi,n.

Note that the K-statistic of Kleibergen (2005) itself, even though it was introduced directly froma CU-GMM optimization, is actually nested in our framework with the choice:

πGi,n(θ0) = π(1)i,n (θ0), π

Vi,n(θ0) = πi,n.

In this case, we will use the notation:

LMn

(θ0, π

(1).,n (θ0), π.,n

)= Kn(θ

0).

We will actually be able to show also the asymptotic equivalence with a simplified version of

the K-statistic of Kleibergen (2005) in which probabilities are taken as implied probabilities π(1)i,n (θ0)

only for the components of the score vector corresponding to weak identification, while we can safelykeep the naive probabilities πi,n for the other components. To see that, we will consider a setup ofweak identification similar to the one proposed by Stock and Wright (2000):

Assumption ID: (Weak identification) The parameter space Θ is a Cartesian product:

Θ = Θw ×Θs, where Θw ⊂ Rpw , Θs ⊂ Rps

and accordingly, we have a partition of the general and true unknown values of θ = (θ′w, θ′s)

′ as:

θ0 = (θ0′w , θ

0′s )

′, where θ0w ∈ Int(Θw), θ0s ∈ Int(Θs),

and a decomposition of the expectation of the moment vector as:

E[ψn(θ)

]= m(θs) +

1√nmwn (θ)

where the sequence mwn (θ) converges uniformly towards a continuous function mw(θ):

supθ∈Θ

∥mwn (θ)−mw(θ)∥ = o(1)

20

with

m(θs) = 0 ⇐⇒ θs = θ0s

mw(θ0) = 0.

The above partition of the parameters leads to a corresponding partition of the Jacobian matrix:

Gi(θ) =∂ψi(θ)

∂θ′=

[∂ψi(θ)

∂θ′w,∂ψi(θ)

∂θ′s

].

We can then formalize as follows the general idea of taking implied probabilities π(γ)i,n (θ0) only for

the components of the score vector corresponding to weak identification, while we can safely keepthe naive probabilities πi,n for the other components. Of course, the case without weak identificationis the particular case where the dimension pw is zero. Let us define another score vector as follows:

ln(θ, πG(θ), πV (θ)

)=

∑ni=1 π

Gi,n(θ)

∂ψ′i(θ)

∂θw∑ni=1 πi,n

∂ψ′i(θ)∂θs

[n∑i=1

πVi,n(θ)Vi,n(θ)

]−1√nψn(θ)

and with obvious notations, a corresponding score test statistic:

LMn

(θ0, π

G(θ0), πV (θ0)

)= l′n

(θ0, π

G(θ0), πV (θ0)

) [In(θ0, π

G(θ0), πV (θ0)

)]−1ln(θ0, π

G(θ0), πV (θ0)

).

In this expression of the score test statistic, we use πG(θ0) instead of the πG(θ0) introduced earlierin (5.1). This is done to emphasize that here the columns corresponding to θw (the weakly identified

elements of θ) in the Jacobian matrix are weighted using the implied probabilities π(γ)i,n (θ0), while the

remaining columns are weighted by the naive probabilities πi,n.Then our simplified K-statistic will be:

Kn(θ0) = LMn

(θ0, π

G(θ0), π.,n)

with πG(θ0) = π(1).,n (θ0).

We want to show the general asymptotic equivalence between any score test statistic conformableto (5.1) with both Kn(θ0) and Kn(θ0) under several possible circumstances:

(i) Either the hypothesized value θ0 is the true unknown value θ0.

(ii) Or some components of θ0 may be arbitrarily different from θ0, typically because we have toface weak identification issues for these particular components.

(iii) Or finally, when some components are on the contrary strongly identified, the (correspondingcomponents of the) hypothesized value θ0 are picked along a sequence of local alternatives tothe true unknown value θ0.

The bottom line is the definition of a set Θn of possible hypothesized values θ0 such that themoments of interest are local to zero in the sense of Assumption 1(i). For this purpose we define Θn

as follows.Let r > 0 be such that the open ball B(θ0s , r) of center θ

0s and of radius r is included in Θs. Then

we define:

Θn = Θw ×B

(θ0s ,

r√n

). (5.2)

The key idea is that both local alternatives and weak identification are settled for the momentsto stay “local”, which will allow us to apply the equivalence results of Propositions 6, 7 and 8. In

21

particular, a further smoothness assumption on the population moment (SPM1) m(θs), coupled withAssumption ID, reinforces the key condition, Assumption 1(i), in Lemma 1.

Assumption SPM1: Ms(θs) =∂∂θ′sm(θs) is continuous in θs ∈ B

(θ0s ,

r√n

)and Ms(θ

0s) is finite.

Lemma 1: Under Assumptions ID and SPM1, and for Θn defined by (5.2), we have:

supθ∈Θn

∥∥E (ψn(θ))∥∥ = O

(1√n

).

To get the required equivalence result, we maintain, in addition to assumptions ID and SPM1,the following set of assumptions CLT, SSM, SPM2, O and R that are similar to Kleibergen (2005),although slightly strengthened in two directions: (i) Since we are interested in equivalence resultsuniformly in Θn, the assumptions here are made accordingly under CLT and O. (ii) AssumptionR augments the rank condition in Kleibergen (2005) in a way whose significance has been recentlyemphasized by Andrews and Guggenberger (2014). Equipped with these, we will then reinforce theresult of Proposition 10 in an uniform sense. Accordingly, we proceed as follows.

Let =⇒ denote the weak convergence of random functions on Θn with respect to the sup norm.

Assumption CLT:(Functional CLT) Consider the stochastic processΨn(θ); θ ∈ Θw ×

θ0s

with:

Ψn(θ) =√n

[ψn(θ)− E

[ψn(θ)

]vec

(∂ψn(θ)∂θ′w

− E[∂ψn(θ)∂θ′w

]) ] .Then, Ψn(θ) =⇒ Ψ(θ), where Ψ(θ) is a Gaussian process with zero mean and covariance functionC(θ1, θ2) for θ1, θ2 ∈ Θw ×

θ0s. Let

C(θ) ≡ C(θ, θ) =[V (θ) C(θ)′

C(θ) D(θ)

].

Note that Assumption CLT implicitly assumes that the function ψ(W, θw, θ0s) is differentiable

with respect to θw on some open set containing the compact set Θw. We will actually assume evenmore.

Assumption SSM: (Smoothness of sample moment) Almost surely inW , the function ψ(W, θw, θ0s)

is differentiable with respect to θ on Θ∗w × B(θ0s , r) where Θ∗

w is an open set containing Θw andB(θ0s , r) is as defined in (5.2).

Furthermore, to characterize the expected Jacobian E[∂ψn(θ)∂θ′

], which is key to our local asymp-

totic analysis, we maintain, in addition to assumption SPM1, that:

Assumption SPM2: Mws (θ) = ∂

∂θ′smw(θ) and Mw

w (θ) = ∂∂θ′w

mw(θ) exist and are continuous in

θ ∈ Θn. Also,∂∂θ′E[ψi(θ)] = E

[∂∂θ′ψi(θ)

]for θ ∈ Θn.

The other (O) assumptions below are made in accordance with Propositions 8 and 9. In particular,

the assumptions (O(i)-(ii)) related to ∂ψi(θ)∂θ′w

, i.e., the Jacobian for the weakly identified components,

resembles that from Proposition 9. For the other terms of interest, namely, ∂ψi(θ)∂θ′s

and the third

moment of ψi(θ), the assumptions (O(iii)-(v)) are weaker when contrasted with Assumption CLT.

22

Assumption O:

(i) supθ∈Θn

∥∥∥∥∥ 1nn∑i=1

∂ψi(θ)

∂θ′w

(ψi(θ)− ψn(θ)

)′ − Cov

(∂ψi(θ)

∂θ′w, ψi(θ)

)∥∥∥∥∥ = op(1),

(ii) supθ∈Θn

∥∥∥∥∥ 1√n

n∑i=1

(∂ψi(θ)

∂θ′wψ′i(θ)− E

[∂ψi(θ)

∂θ′wψi(θ)

′])∥∥∥∥∥ = Op(1),

(iii) supθ∈Θn

∥∥∥∥ ∂

∂θ′sψn(θ)− E

[∂

∂θ′sψn(θ)

]∥∥∥∥ = op(1),

(iv) supθ∈Θn

∥∥∥∥∥ 1nn∑i=1

∂

∂θ′sψi(θ)

(ψi(θ)− ψn(θ)

)′∥∥∥∥∥ = Op(1),

(v) supθ∈Θn

∥∥∥∥∥ 1nn∑i=1

ψi(θ)ψ′i(θ)⊗

(ψi(θ)− ψn(θ)

)∥∥∥∥∥ = Op(1).

Assumptions SPM2 and O(iii) imply that supθ∈Θn

∥∥∥ ∂∂θ′sψn(θ)−

[Ms(θs) +

1√nMws (θ)

]∥∥∥ = op(1).

To simplify the expressions for the relevant quantities obtained from assumptions CLT, SPM1-2and O, use the notation Ac:d(θ) to denote the block of rows from c to d of any a × b matrix A(θ)with a ≤ c ≤ d ≤ b, and define:

(i) Ψ1:H(θw, θ0s) is the H × 1 random vector consisting of the first H rows of Ψ(θw, θ

0s).

(ii) ξ(θ) = (ξ1(θ), . . . , ξpw(θ)) is a H × pw random matrix with:

ξj(θ) = ΨjH+1:(j+1)H(θ)− C(j−1)H+1:jH(θ) [V (θ)]−1Ψ1:H(θ) for j = 1, . . . pw.

(iii) Gw(θ) = [Gw,1(θ), . . . , Gw,pw(θ)] is a H × pw non-random matrix with:

Gw,j(θ) =Mww,j(θ)− C(j−1)H+1:jH(θ) [V (θ)]−1mw(θ) for j = 1, . . . pw

where Mww,j(θ) denotes the j-th column of Mw

w (θ) for j = 1, . . . pw.

Lemma 2: Under Assumptions 1, 2, ID, SSM, SPM1-2, CLT and O, we have for any choice ofimplied probabilities πG(θ) and πV (θ) conformable to (5.1):

(i) Asymptotic independence: For θ ∈ Θn,

√n

[ψn(θ)− E[ψn(θ)],

n∑i=1

πGi,n(θ)∂ψi(θ)

∂θ′w

]=⇒

[Ψ1:H(θw, θ

0s), Gw(θw, θ

0s) + ξ(θw, θ

0s)],

where[Ψ1:H(θ)

′, [vec (ξ(θ))]′]is a mean zero Gaussian process with covariance function D(θ1, θ2)

for any θ1, θ2 ∈ Θw ×θ0s; with:

D(θ) ≡ D(θ, θ) =

[V (θ) 00 D(θ)− C(θ)[V (θ)]−1C(θ)′

].

(ii) Asymptotic equivalence of averages using the naive and the implied probabilities:

supθ∈Θn

∥∥∥∥∥n∑i=1

πi,n∂ψi(θ)

∂θ′s−Ms(θs)

∥∥∥∥∥ = supθ∈Θn

∥∥∥∥∥n∑i=1


∂θ′s−Ms(θs)

∥∥∥∥∥ = op(1),

supθ∈Θn

∥∥∥∥∥n∑i=1

πVi,n(θ)Vi,n(θ)− V (θ)

∥∥∥∥∥ = op(1).

23

Combining the results of Lemma 2 (i) and (ii), it follows under the same assumptions that:[ √n∑n

i=1 πGi,n(θ)

∂ψ′i(θ)

∂θw∑ni=1 πi,n

∂ψ′i(θ)∂θs

]=⇒ [G∗(θw, θ

0s)]

′ and

[ √n∑n

i=1 πGi,n(θ)

∂ψ′i(θ)

∂θw∑ni=1 π

Gi,n(θ)

∂ψ′i(θ)∂θs

]=⇒ [G∗(θw, θ

0s)]

′

for (θ′w, θ′s) ∈ Θn, and where:

G∗(θ) ≡ G∗(θw, θs) = [Gw(θ) + ξ(θ),Ms(θs)] .

Furthermore, G∗(θw, θ0s) is asymptotically independent of

√nψn(θw, θs) for (θ

′w, θ

′s) ∈ Θn.

Utilizing this result and the notation introduced above, the standard rank condition for theJacobian is modified in Assumption R below following Andrews and Guggenberger (2014) who showedthat this is particularly important when considering the asymptotic size of the score test.

Assumption R: (Rank condition) G∗(θ) is full column rank almost surely for θ ∈ Θw ×θ0s.

Theorem 1: Under Assumptions 1, 2, ID, SSM, SPM1-2, CLT, O and R, we have for any choice ofimplied probabilities πG(θ) and πV (θ) conformable to (5.1):

supθ∈Θn

∣∣LMn


)−Kn(θ)

∣∣ = oP (1)

supθ∈Θn

∣∣∣LMn


)− Kn(θ)

∣∣∣ = oP (1).

Recall that Kleibergen (2005)’s K test reject the null hypothesis H0 : θ = θ0 at the level α if

Kn(θ) > χ2p(1− α)

where χ2p(1 − α) is the (1 − α)-th quantile of a standard χ2 distribution with p degrees of freedom.

In general, the asymptotic properties of this test for θ0 = θ0, i.e., when H0 is true, and θ0 =(θ′w, (θ

0s +

d√n)′)′ ∈ Θn, i.e., under “suitable local” alternatives, are of interest.

Among other things, Theorem 1 implies that the K-test’s first-order asymptotic size and poweragainst local alternatives are all inherited by the score tests based on the four strategies (ii)-(v) putforward in our paper. Then, the question is: whether these four strategies that involve efficientestimation, as described in Sections 2 and 3, of the Jacobian and possibly the variance matrix areworthwhile? In a simulation exercise we now demonstrate that, in spite of the first-order asymptoticequivalence in Theorem 1, these four strategies provide demonstrable improvements not only for thefinite-sample size but also, and even more so, for the finite-sample power of the score tests.

5.3 Monte Carlo experiments

5.3.1 Design

We use as Data Generating Process (DGP) for observable variables Wi = (yi, Xi, Z′i)′, i = 1, . . . , n

the following triangular system:

yi = Xiθ0 + ui (5.3)

Xi = Z ′iΠ+ vi

whose inputs are i.i.d. draws of (ui, vi, Zi), i = 1 . . . , n.In this system, there is a single right-hand-side (possibly) endogenous variable Xi (Xi ∈ R) and

no included exogenous variables, and H instrumental variables Zi (Zi ∈ RH). We consider the casesH = 2 and H = 4.

24

By definition, the structural errors (ui, vi) are independent of the instruments Zi. Xi is theendogenous variable whose endogeneity is governed by the correlation ϱ between ui and vi. Weconsider three levels of endogeneity: ϱ = 0.9 (highly endogenous), ϱ = 0.5 (moderately endogenous)and ϱ = 0 (not endogenous).

The moment vector is given by:

ψi(θ) ≡ ψ (Wi, θ) = Zi (yi −Xiθ) .

As explained in subsection 4.1.2., it is important to allow implied probabilities to improve theestimation of the variance matrix. This implies considering moment conditions that may displaysome kind of multivariate asymmetry because for some h, k, l = 1, . . . , H:

E[ψi,h(θ

0)ψi,k(θ0)ψi,l(θ

0)]= 0.

By the law of iterated expectation:

E[ψi,h(θ

0)ψi,k(θ0)ψi,l(θ

0)]= E [Zi,h.Zi,k.Zi,l]E[u3i ].

Therefore, we take the following route to allow for a possible asymmetry in the distribution ofthe moment vector. First, we draw Zi from a distribution with non zero mean. We actually takeZi,h, h = 1, . . . ,H mutually independent normal with unit mean and variance. Then, the wishedmultivariate asymmetry of the moment condition ψi is completely governed by the skewness of theerror term ui.

Note that a common way to accommodate an asymmetric moment condition is to endow u withthe probability distribution of a demeaned even power of a normal (while the other error term v wouldbe a standard normal). However, we do not want to do that since it would preclude endogeneity.This is the reason why we consider the two cases:

1st case: Symmetric: ui and vi are standard normal, with correlation ϱ.2nd case: Asymmetric: Let e1,i, e2,i, i = 1, . . . , n stand for (2n) independent draws in the

exponential(1) distribution. For a given real number c, define the zero mean variables:

ui =1√2[e1,i + e2,i − 2] ,

vi =1√

1 + c2[e1,i + ce2,i − (1 + c)] .

Then the correlation between ui and vi is:

1 + c√2√1 + c2

.

We get the requested level ϱ of correlation by choosing c as a solution of:

2ϱ2(1 + c2) = (1 + c)2.

We choose:

c = −1− 2ϱ√

1− ϱ2

1− 2ϱ2.

It is worth noting that the level of skewness in ui and vi is not huge. For instance, the skewnesscoefficient of ui is equal to

√2. This reasonable amount of skewness makes even more compelling

the evidence of improvement that we will display in case of asymmetric moment conditions (seeespecially power comparisons below).

25

We assume that the H instruments have the same degree of weakness by taking:

Π =a√n1H

where 1H is the H-dimensional vector with all components equal to 1. We consider two sample sizes:n = 100 (small sample) and n = 1000 (reasonable for a micro-econometric application).

For a given sample size, a governs the strength of identification through the concentration pa-rameter:

µ =1

HΠ′

(n∑i=1

ZiZ′i

)Π.

We want to describe three different patterns of identification: strong, weak, and complete lack ofidentification. While each pattern will be studied through the draw of 10000 samples of size n each,we slightly modify the parameter a from one sample to the other in order to keep the same value ofthe concentration parameter for each draw in a given pattern:

µ = 10 for strong identification, µ = 1 for weak identification and µ = 0 for lack of identification.

To summarize, it means that we are considering 72 = 2× 3× 2× 3× 2 specifications of the DGP,for which we have 10000 replications:

• 2 values of the number H of instruments (H = 2 or 4),

• 3 levels of endogeneity (ϱ = 0 or ϱ = 0.5 or ϱ = 0.9),

• 2 patterns of symmetry: one symmetric distribution, one asymmetric distribution,

• 3 levels of instrument strength (µ = 10 or µ = 1 or µ = 0),

• 2 possible sample sizes (n = 100 and n = 1000).

5.3.2 Size comparisons

Empirical sizes are first calculated using 5% asymptotic critical values for the nine score tests ofinterest, namely:

• The five tests with score statistics corresponding to score vectors described in section 4.1.2 (seetable 1) : 2SGMM, GS, EL, EEL and 3S which stands for 3SEEL.

• The modification of 3SEEL which uses shrunk probabilities as defined at the end of subsection2.2. The test statistic is denoted by 3S-Sh.

• The three score tests with statistics based on KLIC (see subsection 4.1.3) namely KLIC, ETand Kl-ET.

We do not impose conditional (on Z) homoskedasticity of the errors when computing the asymp-totic variance matrix of the average moment vector. In this sense, the statistics belong to the genrewhat Guggenberger and Smith (2005, page 689-690) refer to as the KHET statistic.

With a sample size n = 1000 and a symmetric moment vector, all empirical sizes of all the scoretests except 2SGMM are controlled below 6%. This result is all the more compelling since, at thesame time, size distortions of 2SGMM may go as far as 35.8% (case of complete lack of identification,strong endogeneity and 4 instruments) and 17.2% when, in the same circumstances, identificationexists but only at a weak level. Generally speaking, when some endogeneity is at stake, empiricalsize of 2SGMM is almost never below 6 %, and the only cases are when identification is strong. Notemoreover that none of the tests is overly conservative (empirical sizes are never below 4.4%). Withthe same sample size n = 1000, the introduction of an asymmetric moment vector leads to slightlymore oversized tests, but nothing dramatic except for the 2SGMM case (never beyond an empiricalsize of 6.4% for the rest).

26

Not surprisingly, the 5% asymptotic critical values are less reliable when the sample size isonly n = 100. In this case, besides 2SGMM, the worst performance is achieved by the 3SEELmethods, with or without shrinkage. Empirical sizes are typically between 8% and 17.1%, which is notcompellingly better than the 2SGMM. By contrast, the EEL approach, as promoted by Kleibergen(2005), does an excellent job: the empirical size is always below 6% in the symmetric case, below8.5% in the asymmetric case. Interestingly enough, EEL dominates both the two kinds of empiricallikelihood approaches (EL and GS) as well as the three kinds of ET approaches (KLIC, ET and Kl-ET). The small sample over-rejection seems especially significant for KLIC and ET (empirical size aslarge as 10% in the symmetric case, 11.1% in the asymmetric case). By contrast, GS, EL and Kl-ETall do a decent job (empirical size below 8% in the symmetric case, 9% in the asymmetric case). Notehowever that, except for the two 3SEEL methods, the performance of all implied probabilities basedmethods is still significantly much better than 2SGMM. The somewhat disappointing performance of3SEEL methods in small samples is not surprising. When the sample is too small, the probability ofnegative values of implied probabilities is far from being negligible, and shrinking them just amountsto put too many of them equal to the naive empirical probabilities. Over the 10000 replications ofsamples of size n = 100 (repeated for each specification), we actually met a significant proportion ofthem for which some EEL implied probabilities were negative: always between 5% and 6% of themin the symmetric case, and between 17% and 21% of them in the asymmetric case. By contrast, overthe 10000 replications of samples of size n = 1000, negative EEL implied probabilities never showup in the symmetric case and they occur in less than 1% of the samples in the asymmetric case.Note also that it is only for small samples that increasing the number of instruments (from 2 to 4)leads to more over-rejection, in particular for the less reliable methods in small samples, namely 3S,3S-Sh, KLIC, ET, GS and EL.

The bottom line is that only the 2SGMM tests have an over-rejection rate significantly increasedby the weakness of instruments. The satisfactory performance in this respect for all eight alternativestatistics is clearly explained by the fact that they all use implied probabilities for estimation ofthe Jacobian part. By contrast, it sounds relatively useless (or even detrimental in small samples)to use implied probabilities for estimation of the variance matrices. The excellent performance ofthe EEL/K statistic (and also Kl-ET) is especially convincing in this respect. The main messageis that, as far as size control in the presence of weak identification is concerned, it is the use ofimplied probabilities to estimate the Jacobian part (and, in particular, to ensure that this estimatoris asymptotically independent of the moment conditions) that matters more than the use of thegenuine GEL-based inference methods.

5.3.3 Power comparisons

Empirical size-adjusted power curves are calculated for the preceding eight 5% score tests that arealternative to 2SGMM. Recall that these eight tests were shown to have correct asymptotic sizeunder scenarios that cover the cases considered in our Monte Carlo experiment. On the other hand,since 2SGMM suffers from severe size distortions, we have excluded it from these comparisons of sizeadjusted power curves.

With a sample size n = 1000 and a symmetric moment vector, the eight size-adjusted powercurves are pretty much the same (see figures 4,5,6 and 10,11,12). By contrast, when consideringasymmetric moment vectors, the 3SEEL approach and even more the 3SEEL with shrinkage (3S-Sh)display a better power performance (see figures 1,2,3 and 7,8,9), especially when identification isweak (µ = 1) or absent (µ = 0). In other words, the use of the EEL implied probabilities to estimatenot only the Jacobian matrix (as it comes for the EEL/K case) but also to improve the estimationof the variance matrix is important as far as power is concerned. For the same reason EL (resp.ET or KLIC) dominates GS (resp. Kl-ET). This remark is all the more interesting since it is worthreminding that, for such large sample sizes, there is no cost in terms of size control to re-use thesame implied probabilities for estimation of both Jacobian and variance matrices, while it is not whatthe GEL approach would lead to do, except in the EL case. By the way, the power performance

27

would rather lead us to use everywhere the EEL probabilities, that turn out to also be the mostuser friendly, since these are the only ones for which we have closed form formulas (both for impliedprobabilities and for estimated expectations, as residuals of regressions on the moment conditions).To put it differently, the naive least squares approach dominates the more computationally involvedalternative approaches, i.e., the EL and KLIC ones. While the only cost of using Euclidean impliedprobabilities might have been the finding of negative probabilities, the shrinkage trick fixes this issuewithout deteriorating the power performance, quite the contrary indeed.

By contrast, with a small sample size (n = 100), we have shown that the 3S approaches are abit dominated by standard GEL approaches, in terms of size control. Interestingly enough, as far assize-adjusted power is concerned, the 3S approaches keep the edge over all other 6 tests under study.They still strictly dominate when moment vectors have an asymmetric probability distribution andidentification is weak (µ = 1) or absent (µ = 0).

To be complete, it must be acknowledged that the genuine EL approach may display some edgeover Euclidean methods in one circumstance, that is when we observe (see figure 2, both for n = 100and n = 1000), for a high degree of endogeneity (ϱ = 0.9), some spurious decline in power far fromthe null hypothesis. This effect was already documented for EEL by Kleibergen(2005)2, and maystill show up, not only for EEL but also, to a lesser extent, for 3SEEL methods and even for thegenuine EL. However, EL may sometimes do the best job in this case.

5.3.4 Main Message of the Monte Carlo study

While it had already been well documented (see in particular Guggenberger and Smith (2005) andCaner (2010)) that GEL approaches were able to do a much better job than 2SGMM in case of weakidentification (much better size control and sufficiently good power performances), we argue fromour results that this good performance is more due to the use of implied probabilities than to theuse of the genuine GEL inference. We actually show that there is no compelling reason to preferthe numerically involved implied probabilities provided by empirical likelihood or by exponentialtilting, while the naive Euclidean probabilities do an excellent job. As already put forward byKleibergen (2005), they do (at least for sufficient sample sizes) the best possible job when used toestimate the Jacobian matrix in order to get a size of the test robust to weak identification. Withour so-called three steps EEL methods (following the terminology of ABR-07 ), we show that theEuclidean methods are also well suited for improving the estimation of the variance matrix (in caseof asymmetrically distributed moment conditions) and, in turn, the power of the test. When it turnsto estimation of the variance, the shrinkage procedure proposed by ABR-07 to ensure non-negativityof implied probabilities is even more relevant.

6 Conclusion

While information theoretic approaches have become a popular alternative to GMM, their mainuse in econometrics is rather as a black box providing satisfactory solutions to poor finite sampleperformance of GMM, in particular in case of weak instruments and large number of instrumentsas well. While the mechanism of this black box is built from implied probabilities, the information

2We noted before that Kleibergen (2005) arrived at the same EEL score test by instead working with the derivative(with respect to θ) of the CU-GMM objective function evaluated at the hypothesized value. The weighted norm squaredwith weight as the inverse of the asymptotic variance of the derivative is the statistic put forward by Kleibergen.However, since this derivative can be zero not only at the minima of the CU-GMM objective function but also at themaxima and the points of inflection, the latter two being completely irrelevant and perhaps far from the true θ, thetest statistic is zero at these values. Hence Kleibergen’s test may display spurious decline in power at regions where the(possibly but not necessarily degenerate) distribution of these maxima and/or the points of inflection are concentrated.Naturally, the decline in power for EEL can be attributed to the same cause.

28

content of these probabilities has been relatively little documented. In the context of score testing,this paper has promoted the use of implied probabilities in two ways:

First, it extends Kleibergen’s (2005) seminal contribution, showing that the use of an estimatorof the Jacobian that is uncorrelated with moment conditions allows a dramatic improvement of thesize control in case of weak identification.

Second, it draws the lesson from Newey and Smith (2004) as well as ABR-07 to note that impliedprobabilities allow efficient estimation of any moment matrix, allowing a more accurate estimationof the selection matrix of estimated equations, for a more powerful score test. In particular, asexplained by these authors, it is in cases of skewness of the moment conditions that an efficient useof implied probabilities for estimation of the variance matrix is especially relevant.

The value added by the present paper is threefold:First, we show that implied probabilities provided by EL produce an orthogonalization of the

estimator of the Jacobian matrix with respect to the moment conditions, in the same way as firstorder conditions of continuously updated GMM for Kleibergen (2005) (or equivalently estimator ofprobabilities provided by EEL for ABR-07). It is important to recognize that this orthogonalizationdoes not need to appeal to asymptotics even in the case of EL, as long as we rely on the EL impliedprobabilities instead of the empirical probabilities.

Second, we notice that more generally implied probabilities obtained by minimization of anyCressie Read discrepancy are asymptotically equivalent in the sense that they all produce asymptot-ically equivalent (up to the order op(n

−1/2)) estimators of any moment functions. In particular, thisfollows from our result that the implied probabilities from different Cressie Read discrepancies mayonly differ by the order of op(n

−3/2).Third, we document that the use of implied probabilities is a valuable hedge against the size

distortion effects of weak identification when it is performed for the estimation of the Jacobian matrix,while efficient estimation of the variance matrix is not necessary for this purpose. By contrast, thelatter is important regarding the power of the score tests. The role of implied probabilities as a hedgeagainst weak identification is even more explicitly pinned down by showing that, in a setting of avector of parameters a la Stock and Wright (2000), with only a subvector that is weakly identified, itis for the weakly identified components that the efficient estimation of the expected partial derivativesmatters.

Moreover, an important technical insight of this paper is, as a generalization of the approachof the GEL theory of Guggenberger and Smith (2005), an asymptotic theory with results that areuniform not only under the null hypothesis but also under a sequence of alternatives relevant for thepower of tests. These alternatives are local in the case of strong identification but global in the caseof weak identification.

The innovation with respect to Guggenberger and Smith (2005) is that we establish a uniformasymptotic equivalence of implied probabilities, so that we do not need to deal with the genuine GELscore vectors but we can instead freely deal with different implied probabilities for estimating theJacobian matrix and the variance matrix respectively. We document in particular that the use ofimplied probabilities based on EEL, with a possible shrinking a la ABR-07 for non-negativity, maybe very efficient for score testing, especially for the power of tests based on skewed moments.

A byproduct of the degree of freedom regarding the choice of well suited implied probabilities is acomputational advantage, since for instance the implied probabilities based on EEL amount to linearregressions. However, we only consider in this paper the score tests for null hypotheses that specifythe value of the full vector of parameters. Testing on the value of a subvector (or more generallyon a function of the parameters) would be more demanding since it may require some numericallyinvolved constrained estimation. A companion paper, Chaudhuri and Renault (2017), proposes touse the tool of C(α) tests to circumvent the computational issues. Moreover, testing on functions ofthe parameters paves the way for important cases where, as studied by Antoine and Renault (2012),different identification strengths may be at stake in different directions in the parameter space.

29

References

Andrews, D. W. K. and Guggenberger, P. (2014). Asymptotic Size of Kleibergen’s LM and Condi-tional LR Tests for Moment Condition Models. Working Paper.

Back, K. and Brown, D. (1993). Implied Probabilities in GMM estimators. Econometrica, 61: 971–976.

Caner, M. (2010). Exponential tilting with weak instruments: Estimation and testing. OxfordBulletin of Economics and Statistics, 72: 307–326.

Chaudhuri, S. (2017). Improved Projection GMM-LM Tests for Linear Restrictions. Technical report,McGill University.

Chaudhuri, S. and Renault, E. (2015). Shrinkage of Variance for Minimum Distance Based Tests.Econometric Reviews, 34: 328–351.

Chaudhuri, S. and Renault, E. (2017). Tests for non-linear restrictions with heterogeneity in identi-fication strengths. Mimeo.

Gagliardini, P., Gourieroux, C., and Renault, E. (2011). Efficient Derivative Pricing by the ExtendedMethod of Moments. Econometrica, 79: 1181–1232.

Guay, A. and Pelgrin, F. (2015). Using Implied Probabilities to Improve the Estimation of Uncon-ditional Moment Restrictions for Weakly Dependent Data. Econometric Reviews, 35: 344–372.

Guggenberger, P. and Smith, R. (2005). Generalized Empirical Likelihood Estimators and Testsunder Partial, Weak and Strong Identification. Econometric Theory, 21: 667–709.

Hall, A. R. (2000). Covariance Matrix Estimation and the Power of the Overidentifying RestrictionsTest. Econometrica, 68: 1517–1527.

Hansen, L. P. (1982). Large Sample Properties of Generalized Method of Moments Estimators.Econometrica, 50: 1029–1054.

Hansen, L. P., Heaton, J., and Yaron, A. (1996). Finite-Sample Properties of Some AlternativeGMM Estimators. Journal of Business and Economic Statistics, 14: 262–280.

Hausman, J., Lewis, R., Menzel, K., and Newey, W. (2011). Properties of the CUE estimator and amodification with moments. Journal of Econometrics, 165: 45–57.

Imbens, G. W., Spady, R. H., and Johnson, P. (1998). Information Theoretic Approaches to Inferencein Moment Condition Models. Econometrica, 66: 333–357.

Kitamura, Y. and Stutzer, M. (1997). An Information-Theoretic Alternative to Generalized Methodof Moments Estimation. Econometrica, 65: 861–874.

Kleibergen, F. (2002). Pivotal Statistics for Testing Structural Parameters in Instrumental VariablesRegression. Econometrica, 70: 1781–1803.

Kleibergen, F. (2005). Testing Parameters In GMM Without Assuming That They Are Identified.Econometrica, 73: 1103–1123.

Moreira, M. J. (2003). A Conditional Likelihood Ratio Test for Structural Models. Econometrica,71: 1027–1048.

Newey, W. K. and Smith, R. J. (2004). Higher Order Properties of GMM and Generalized EmpiricalLikelihood Estimators. Econometrica, 72: 219–255.

30

Newey, W. K. and West, K. D. (1987). Hypothesis Testing with Efficient Method of MomentsEstimation. International Economic Review, 28: 777–787.

Owen, A. B. (1990). Empirical Likelihood Ratio Confidence Regions. Annals of Statistics, 18: 90–120.

Ramalho, J. and Smith, R. (2006). Goodness of Fit Tests for Moment Condition Models. WorkingPaper.

Schennach, S. M. (2007). Point estimation with exponentially tilted empirical likelihood. The Annalsof Statistics, 35: 634–672.

Smith, R. J. (1997). Alternative semi-parametric likelihood approaches to generalized method ofmoments estimation. Economic Journal, 107: 503–519.

Stein, C. (1956). Inadmissibility of the Usual Estimator for the Mean of a Multivariate NormalDistribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics andProbability, volume 1, pages 197–206. University of California Press.

Stock, J. H. and Wright, J. H. (2000). GMM with Weak Identification. Econometrica, 68: 1055–1096.

Tables and Figures

The following tables and figures are referred to in the Monte Carlo experiment in Section 5.3. Inparticular, the tables serve as references for the study of empirical size in Section 5.3.2., while thefigures for that of empirical power in Section 5.3.3.

31

n = 100 & Asymmetric moment vector n = 100 & Symmetric moment vectorH ϱ µ 2SGMM GS EL EEL 3S 3S-Sh 2SGMM GS EL EEL 3S 3S-Sh

2 0.9 0 15.7 7.3 7.1 6.5 10 10.2 14.2 6.2 6.5 5.9 9.5 9.72 0.9 1 8 6.6 6.4 6.3 9.5 9.6 9.1 5.3 5.4 4.9 8.4 8.22 0.9 10 5.3 7 6.2 6.5 9.9 9.5 5.2 5.2 5.3 4.9 8.3 82 0.5 0 9.8 7.2 7.2 6.5 10.6 10.4 8.6 6.1 6.6 5.8 9.4 9.22 0.5 1 6.1 6.8 7.1 6.2 10.1 10 6.6 5.7 6 5.3 8.8 8.62 0.5 10 6.7 7.2 6.5 6.8 9.9 9.6 5.1 5.1 5.4 4.8 8.2 82 0 0 6.1 6.6 6.4 6.1 9.8 9.6 5.5 5.9 6.2 5.4 9.3 9.12 0 1 7.1 7.1 7 6.6 10.2 9.9 5.9 5.8 6.1 5.5 9.3 9.12 0 10 6.9 6.5 5.9 6 8.9 8.7 5.3 5.3 5.6 5 8.4 8.14 0.9 0 39.6 8.3 8.7 6.7 14.9 17.1 38.2 6.5 7.5 5.5 13.5 15.54 0.9 1 14 8.5 8.1 7.7 14.2 13.6 18.8 5.9 6.6 4.9 12.4 124 0.9 10 6.1 8.9 8 7.7 14.8 12.6 7.3 5.6 6.6 4.5 11.8 10.54 0.5 0 16.8 8.4 8.7 6.8 15.2 13.8 15.3 6.7 8 5.4 14 12.94 0.5 1 6.7 8.1 8.1 6.9 14.3 12.4 9.7 6 6.8 4.8 12.2 11.34 0.5 10 8.2 9.5 8.9 8.2 15.6 13.3 6.4 5.8 6.7 4.9 12 10.84 0 0 7.1 8.9 8.8 7 15.8 13.8 6.2 6.8 7.9 5.9 13.8 12.44 0 1 8.6 8.3 8.6 6.8 15.1 13 6.3 6.2 7.1 5.2 12.9 11.54 0 10 9.4 8.6 7.6 7.4 14 11.7 5.8 5.6 6.7 4.6 12.2 10.6

n = 1000 & Asymmetric moment vector n = 1000 & Symmetric moment vectorH ϱ µ 2SGMM GS EL EEL 3S 3S-Sh 2SGMM GS EL EEL 3S 3S-Sh

2 0.9 0 13.2 5.2 5.2 5.2 5.6 5.7 13 4.9 4.9 4.8 5.2 5.22 0.9 1 8.4 5.2 5.1 5.1 5.6 5.6 8 4.5 4.4 4.4 4.8 4.82 0.9 10 4.3 5.1 4.8 5.1 5.1 5.1 4.9 5.1 5 5 5.3 5.32 0.5 0 7.5 4.9 4.7 4.7 5.2 5.2 7.8 4.7 4.6 4.7 5.1 5.12 0.5 1 5.7 5.3 5.2 5.3 5.8 5.8 6.3 5.2 5.2 5.1 5.5 5.52 0.5 10 5.2 5.6 5.1 5.5 5.7 5.6 5.2 5.1 5 5 5.3 5.32 0 0 5.4 5.6 5.3 5.4 5.9 5.9 5 5 4.9 4.9 5.3 5.32 0 1 5.2 5.2 5.2 5.2 5.7 5.7 5.3 5.2 5.1 5.1 5.6 5.62 0 10 5.2 5.2 5 5.1 5.4 5.4 5.3 5.3 5.3 5.2 5.6 5.64 0.9 0 35.5 5.5 5.2 5.2 5.9 6 35.8 5 4.9 4.8 5.7 5.74 0.9 1 15.6 5.4 5 5.3 5.7 5.8 17.2 5.3 5.2 5.1 5.6 5.64 0.9 10 4.9 5.4 4.7 5.3 5.6 5.6 6.2 5 4.9 4.9 5.4 5.44 0.5 0 14.3 5.3 5.1 5 5.8 5.8 13.9 5 4.8 4.8 5.4 5.44 0.5 1 7.3 5.4 5 5.2 5.8 5.8 8.9 5.1 4.9 5 5.5 5.54 0.5 10 5.1 5.5 5.3 5.4 6 6 5 4.5 4.5 4.4 5 54 0 0 5.5 5.8 5.5 5.5 6.4 6.3 4.7 4.9 4.8 4.7 5.5 5.54 0 1 5 5.1 5 4.8 5.7 5.7 5.2 5.2 5.1 5 5.7 5.74 0 10 5.3 5.1 4.7 5 5.4 5.3 5.1 5.1 5 4.9 5.5 5.5

Table 2: Empirical size of 5% score tests based on 10000 Monte Carlo trials. H: number of instru-ments (moments), ϱ: level of endogeneity, and µ: strength of instruments (moments).

32

Asymmetric moment vector Symmetric moment vectorn = 100 n = 100

H ϱ µ Kl-ET KLIC ET Kl-ET KLIC ET

2 0.9 0 6.9 7.2 8.2 6.1 6.7 7.82 0.9 1 6.5 6.7 7.6 5.1 5.6 6.62 0.9 10 6.7 6.8 7.7 5.1 5.5 6.52 0.5 0 7 7.5 8.5 5.9 6.6 7.82 0.5 1 6.4 7 8.1 5.4 6 7.22 0.5 10 7 7 8 5 5.6 6.42 0 0 6.3 6.8 7.9 5.6 6.4 7.62 0 1 6.7 7.1 8.4 5.6 6.3 7.42 0 10 6.2 6.2 7.2 5.1 5.8 6.94 0.9 0 7.3 8.2 10.3 5.9 7.3 9.94 0.9 1 8.1 8 10.2 5.2 6.3 94 0.9 10 8.2 8.2 10 5 6 8.44 0.5 0 7.4 8.1 10.9 5.8 7.4 104 0.5 1 7.2 7.8 9.9 5.2 6.3 8.74 0.5 10 8.8 8.7 10.9 5.3 6.3 8.74 0 0 7.6 8.6 11.1 6.4 7.6 9.64 0 1 7.5 8.2 10.6 5.6 6.7 8.94 0 10 8.1 8 9.3 5.1 6.3 8.5

Asymmetric moment vector Symmetric moment vectorn = 1000 n = 1000

H ϱ µ Kl-ET KLIC ET Kl-ET KLIC ET

2 0.9 0 5.2 5.3 5.5 4.9 4.9 52 0.9 1 5.1 5.3 5.4 4.4 4.5 4.62 0.9 10 5.1 5 4.9 5.1 5.1 5.22 0.5 0 4.8 4.8 4.9 4.7 4.8 4.82 0.5 1 5.3 5.4 5.5 5.2 5.3 5.42 0.5 10 5.5 5.4 5.4 5.1 5.1 5.12 0 0 5.5 5.5 5.5 4.9 5 5.12 0 1 5.2 5.2 5.5 5.1 5.2 5.32 0 10 5.1 5.1 5.2 5.3 5.4 5.44 0.9 0 5.3 5.4 5.5 4.9 5.1 5.34 0.9 1 5.4 5.2 5.3 5.2 5.3 5.44 0.9 10 5.3 5.1 5.2 5 5.1 5.24 0.5 0 5.1 5.2 5.3 4.9 5 5.14 0.5 1 5.3 5.2 5.3 5.1 5.1 5.24 0.5 10 5.4 5.4 5.6 4.5 4.6 4.74 0 0 5.6 5.6 5.8 4.8 5 5.24 0 1 5 5 5.3 5.1 5.3 5.44 0 10 5 4.9 5.1 5 5.1 5.3

Table 3: Empirical size of 5% score tests based on 10000 Monte Carlo trials. H: number of instru-ments (moments), ϱ: level of endogeneity, and µ: strength of instruments (moments). Jacobian isre-weighted by the ET implied probabilities for all three tests. K-ET: empirical variance matrix. ET:variance matrix re-weighted by ki/(

∑j kj). our-ET: variance matrix re-weighted by the ET implied

probabilities. All re-weighting impose the null hypothesis.

33

Figure

1:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte

Carlotrials.Horizontalaxis:true-hypothesized

θ.Asymmetricmoment

vector,H:number

ofinstruments,ϱ:level

ofendogeneity,andµ=

0:strength

ofinstruments.

-50

50.

04

0.06

0.08

n=

100,H

=2,ρ=

0.9

-50

50

0.050.

1n=

100,H

=2,ρ=

0.5

-50

50

0.050.

1n=

100,H

=2,ρ=

0

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0.9

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0.5

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0

-50

50.

045

0.05

0.05

5n=

1000,H

=2,ρ=

0.9

-50

50.

04

0.06

0.08

n=

1000,H

=2,ρ=

0.5

-50

50.

04

0.05

0.06

n=

1000,H

=2,ρ=

0

-50

50.

04

0.06

0.08

n=

1000,H

=4,ρ=

0.9

-50

50.

04

0.06

0.08

n=

1000,H

=4,ρ=

0.5

-50

50.

04

0.06

0.08

n=

1000,H

=4,ρ=

0

GS

EL

EE

L3S

3S-S

h

34

Figure

2:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Asymmetricmoment

vector,H:number



1:strength

ofinstruments.

-50

50

0.51

n=

100,H

=2,ρ=

0.9

-50

50

0.2

0.4

n=

100,H

=2,ρ=

0.5

-50

50

0.2

0.4

n=

100,H

=2,ρ=

0

-50

50

0.51

n=

100,H

=4,ρ=

0.9

-50

50

0.51

n=

100,H

=4,ρ=

0.5

-50

50

0.51

n=

100,H

=4,ρ=

0

-50

50

0.51

n=

1000,H

=2,ρ=

0.9

-50

50

0.2

0.4

n=

1000,H

=2,ρ=

0.5

-50

50

0.1

0.2

n=

1000,H

=2,ρ=

0

-50

50

0.51

n=

1000,H

=4,ρ=

0.9

-50

50

0.2

0.4

n=

1000,H

=4,ρ=

0.5

-50

50

0.2

0.4

n=

1000,H

=4,ρ=

0

GS

EL

EE

L3S

3S-S

h

35

Figure

3:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Asymmetricmoment

vector,H:number



10:strength

ofinstruments.

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0

GS

EL

EE

L3S

3S-S

h

36

Figure

4:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Symmetricmoment

vector,H:number



0:strength

ofinstruments.

-50

50.

04

0.05

0.06

n=

100,H

=2,ρ=

0.9

-50

50.

04

0.06

0.08

n=

100,H

=2,ρ=

0.5

-50

50.

04

0.05

0.06

n=

100,H

=2,ρ=

0

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0.9

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0.5

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0

-50

50.

045

0.05

0.05

5n=

1000,H

=2,ρ=

0.9

-50

50.

045

0.05

0.05

5n=

1000,H

=2,ρ=

0.5

-50

50.

045

0.05

0.05

5n=

1000,H

=2,ρ=

0

-50

50.

045

0.05

0.05

5n=

1000,H

=4,ρ=

0.9

-50

50.

04

0.05

0.06

n=

1000,H

=4,ρ=

0.5

-50

50.

045

0.05

0.05

5n=

1000,H

=4,ρ=

0

GS

EL

EE

L3S

3S-S

h

37

Figure

5:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Symmetricmoment

vector,H:number



1:strength

ofinstruments.

-50

50

0.51

n=

100,H

=2,ρ=

0.9

-50

50

0.2

0.4

n=

100,H

=2,ρ=

0.5

-50

50

0.2

0.4

n=

100,H

=2,ρ=

0

-50

50

0.51

n=

100,H

=4,ρ=

0.9

-50

50

0.5

n=

100,H

=4,ρ=

0.5

-50

50

0.2

0.4

n=

100,H

=4,ρ=

0

-50

50

0.51

n=

1000,H

=2,ρ=

0.9

-50

50

0.2

0.4

n=

1000,H

=2,ρ=

0.5

-50

50

0.1

0.2

n=

1000,H

=2,ρ=

0

-50

50

0.51

n=

1000,H

=4,ρ=

0.9

-50

50

0.2

0.4

n=

1000,H

=4,ρ=

0.5

-50

50

0.1

0.2

n=

1000,H

=4,ρ=

0

GS

EL

EE

L3S

3S-S

h

38

Figure

6:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Symmetricmoment

vector,H:number



10:strength

ofinstruments.

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0

GS

EL

EE

L3S

3S-S

h

39

Figure

7:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Asymmetricmoment

vector,H:number



0:strength

ofinstruments.

-50

50.

04

0.06

0.08

n=

100,H

=2,ρ=

0.9

-50

50

0.050.

1n=

100,H

=2,ρ=

0.5

-50

50

0.050.

1n=

100,H

=2,ρ=

0

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0.9

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0.5

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0

-50

50.

045

0.05

0.05

5n=

1000,H

=2,ρ=

0.9

-50

50.

04

0.06

0.08

n=

1000,H

=2,ρ=

0.5

-50

50.

04

0.05

0.06

n=

1000,H

=2,ρ=

0

-50

50.

04

0.06

0.08

n=

1000,H

=4,ρ=

0.9

-50

50.

04

0.06

0.08

n=

1000,H

=4,ρ=

0.5

-50

50.

04

0.06

0.08

n=

1000,H

=4,ρ=

0

EL

3S-S

hK

l-ET

KLI

CE

T

40

Figure

8:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Asymmetricmoment

vector,H:number



1:strength

ofinstruments.

-50

50

0.51

n=

100,H

=2,ρ=

0.9

-50

50

0.2

0.4

n=

100,H

=2,ρ=

0.5

-50

50

0.2

0.4

n=

100,H

=2,ρ=

0

-50

50

0.51

n=

100,H

=4,ρ=

0.9

-50

50

0.51

n=

100,H

=4,ρ=

0.5

-50

50

0.51

n=

100,H

=4,ρ=

0

-50

50

0.51

n=

1000,H

=2,ρ=

0.9

-50

50

0.2

0.4

n=

1000,H

=2,ρ=

0.5

-50

50

0.1

0.2

n=

1000,H

=2,ρ=

0

-50

50

0.51

n=

1000,H

=4,ρ=

0.9

-50

50

0.2

0.4

n=

1000,H

=4,ρ=

0.5

-50

50

0.2

0.4

n=

1000,H

=4,ρ=

0

EL

3S-S

hK

l-ET

KLI

CE

T

41

Figure

9:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Asymmetricmoment

vector,H:number



10:strength

ofinstruments.

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0

EL

3S-S

hK

l-ET

KLI

CE

T

42

Figure

10:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Symmetricmoment

vector,H:number



0:strength

ofinstruments.

-50

50.

04

0.05

0.06

n=

100,H

=2,ρ=

0.9

-50

50.

04

0.06

0.08

n=

100,H

=2,ρ=

0.5

-50

50.

04

0.05

0.06

n=

100,H

=2,ρ=

0

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0.9

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0.5

-50

50

0.1

0.2

n=

100,H

=4,ρ=

0

-50

50.

045

0.05

0.05

5n=

1000,H

=2,ρ=

0.9

-50

50.

045

0.05

0.05

5n=

1000,H

=2,ρ=

0.5

-50

50.

045

0.05

0.05

5n=

1000,H

=2,ρ=

0

-50

50.

045

0.05

0.05

5n=

1000,H

=4,ρ=

0.9

-50

50.

04

0.05

0.06

n=

1000,H

=4,ρ=

0.5

-50

50.

045

0.05

0.05

5n=

1000,H

=4,ρ=

0

EL

3S-S

hK

l-ET

KLI

CE

T

43

Figure

11:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Symmetricmoment

vector,H:number



1:strength

ofinstruments.

-50

50

0.51

n=

100,H

=2,ρ=

0.9

-50

50

0.2

0.4

n=

100,H

=2,ρ=

0.5

-50

50

0.2

0.4

n=

100,H

=2,ρ=

0

-50

50

0.51

n=

100,H

=4,ρ=

0.9

-50

50

0.5

n=

100,H

=4,ρ=

0.5

-50

50

0.2

0.4

n=

100,H

=4,ρ=

0

-50

50

0.51

n=

1000,H

=2,ρ=

0.9

-50

50

0.2

0.4

n=

1000,H

=2,ρ=

0.5

-50

50

0.1

0.2

n=

1000,H

=2,ρ=

0

-50

50

0.51

n=

1000,H

=4,ρ=

0.9

-50

50

0.2

0.4

n=

1000,H

=4,ρ=

0.5

-50

50

0.1

0.2

n=

1000,H

=4,ρ=

0

EL

3S-S

hK

l-ET

KLI

CE

T

44

Figure

12:Size-ad

justed

pow

erof

5%scoretestsbased

on10000Monte


θ.Symmetricmoment

vector,H:number



10:strength

ofinstruments.

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

100,H

=2,ρ=

0

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

100,H

=4,ρ=

0

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

1000,H

=2,ρ=

0

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0.9

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0.5

-1-0

.50

0.5

10

0.51

n=

1000,H

=4,ρ=

0

EL

3S-S

hK

l-ET

KLI

CE

T

45

A Appendix: Proofs

Proof of Proposition 1: The proof comes in four steps.

Step 1: We define a sequence of compact subsets Bn, n = 1, 2, .. of RH by:

Bn =λ ∈ RH ; ∥λ∥ ≤ rn

with:

(rn)−2 =

√n supθ∈Θn

max1≤i≤n

∥ψi(θ)∥ = oP (n)

where the second inequality comes from Assumption 1(ii) and we assume throughout:

supθ∈Θn

max1≤i≤n

∥ψi(θ)∥ = 0.

This latter assumption can be maintained without loss of generality since all the results becomestraightforward otherwise.

Then, we show that for all γ ∈ R:

limn=∞

Pr

Bn ⊂∩θ∈Θn

Λ(γ)n (θ)

= 1.

Proof for Step 1:For all λ ∈ Bn:

supθ∈Θn

max1≤i≤n

∣∣λ′ψi(θ)∣∣ ≤ ∥λ∥ supθ∈Θn

max1≤i≤n

∥ψi(θ)∥ ≤ (rn)−1

√n

= oP (1) (A.1)

by definition of the sequence rn. Therefore:

∀ε > 0, limn=∞

Pr

[supλ∈Bn

supθ∈Θn

max1≤i≤n

∣∣λ′ψi(θ)∣∣ < ε

]= 1.

For a given γ ∈ R, let us consider ε = (2 |γ|)−1 (ε arbitrary positive number when γ = 0). We deducefrom above that:

limn=∞

Pr

[infλ∈Bn

infθ∈Θn

min1≤i≤n

[1 + γλ′ψi(θ)

]>

1

2

]= 1

which implies that:

limn=∞

Pr

Bn ⊂∩θ∈Θn

Λ(γ)n (θ)

= 1.

Step 2: We recall that the sets Bn and Λ(γ)n (θ) are random since they depend on the state of nature

through ψi(θ), i = 1, ..., n.We denote by An the set of states of nature such that:

Bn ⊂∩θ∈Θn

Λ(γ)n (θ).

For all ω ∈ An, γ ∈ R, θ ∈ Θn, the function h(γ)n,θ(λ) =

n∑i=1

ϱ(γ)[λ′ψi(θ)

]is welll defined for all λ ∈ Bn

46

and can be maximized on this compact set to define:

λ∗(γ)n (θ) = arg maxλ∈Bn

n∑i=1

ϱ(γ)[λ′ψi(θ)

].

Note that the random variable λ∗(γ)n (θ), θ ∈ Θn, is defined only on the set An of states of nature

but, by definition, this set has asymptotically a probability one. Therefore, it makes sense to study

the stochastic convergence of the sequence λ∗(γ)n (θ), n = 1, 2, .... Note also that this sequence is

uniquely defined since the function ϱ(γ) is strictly concave. We first show:

supθ∈Θn

∥∥∥λ∗(γ)n (θ)∥∥∥ = OP

(1/

√n).

Proof for Step 2: By definition of λ∗(γ)n (θ), we have

n∑i=1

ϱ(γ)[λ∗(γ)′n (θ)ψi(θ)

]≥

n∑i=1

ϱ(γ)[0′ψi(θ)

]= 0 = ϱ(γ) [0] . (A.2)

We can write the Taylor expansion:

ϱ(γ)[λ∗(γ)′n (θ)ψi(θ)

]= ϱ(γ) [0] +

∂ϱ(γ) [0]

∂x

(λ∗(γ)′n (θ)ψi(θ)

)+

1

2

∂2ϱ(γ) [uin(θ)]

∂x2


)2(A.3)

with:

0 ≤ supθ∈Θn

max1≤i≤n

|uin(θ)| ≤ supθ∈Θn

max1≤i≤n

∣∣∣λ∗(γ)′n (θ)ψi(θ)∣∣∣ ≤ (rn)

−1

√n

= oP (1)

where the last inequality is deduced from (A.1). Therefore, we have by continuity:

∂2ϱ(γ) [0]

∂x2= −1 ⇒ lim

n=∞Pr

[supθ∈Θn

max1≤i≤n

∂2ϱ(γ) [uin(θ)]

∂x2≤ −1

2

]= 1. (A.4)

Considering together (A.2),(A.3),(A.4) and the fact that:

ϱ(γ) [0] = 0,∂ϱ(γ) [0]

∂x= −1,

we deduce:

limn=∞

Pr

[supθ∈Θn

1

4n

n∑i=1


)2− λ∗(γ)′n (θ)ψn(θ)

≤ 0

]= 1

that is:

limn=∞

Pr

[supθ∈Θn

λ∗(γ)′n (θ)Ωn(θ)λ

∗(γ)n − 4λ∗(γ)′n (θ)ψn(θ)

≤ 0

]= 1.

In particular, if we introduce:infθ∈Θn

γmin(θ) = η > 0,

we have:

limn=∞

Pr

[supθ∈Θn

η∥∥∥λ∗(γ)n (θ)

∥∥∥2 − 4∥∥∥λ∗(γ)n (θ)

∥∥∥ ∥∥ψn(θ)∥∥ ≤ 0

]= 1.

Thus:

limn=∞

Pr

[supθ∈Θn

∥∥∥λ∗(γ)n (θ)∥∥∥− 1

4η

∥∥ψn(θ)∥∥ ≤ 0

]= 1.

47

Note that, by virtue of Assumption 1 we have:

supθ∈Θn

∥∥ψn(θ)∥∥ = OP(1/

√n).

Hence the announced result for step 2.

Step 3:

limn=∞

Pr

∩θ∈Θn

∥∥∥λ∗(γ)n (θ)∥∥∥ < rn

= 1.

Proof for Step 3: Let ε > 0. We want to find nε ∈ N such that:

∀n ≥ nε,Pr

[supθ∈Θn

∥∥∥λ∗(γ)n (θ)∥∥∥ < rn

]≥ 1− ε.

We know that we can find nε ∈ N and a finite number Mε such that for all n ≥ nε:

Pr

[supθ∈Θn

√n∥∥∥λ∗(γ)n (θ)

∥∥∥ ≤Mε

]≥ 1− ε

2

and:Pr[Mε < rn

√n]≥ 1− ε

2.

The first inequality is implied by step 2 while the second inequality is implied by the fact (see step1) that:

(rn)−2 = oP (n).

Then for n ≥ nε:

Pr

[supθ∈Θn

∥∥∥λ∗(γ)n (θ)∥∥∥ < rn

]≥ Pr

[supθ∈Θn

√n∥∥∥λ∗(γ)n (θ)

∥∥∥ ≤Mε

∩Mε < rn

√n]

≥ Pr

[supθ∈Θn

√n∥∥∥λ∗(γ)n (θ)

∥∥∥ ≤Mε

]+ Pr

[Mε < rn

√n]

− 1 ≥ 1− ε.

Step 4:We conclude from steps 2 and 3 that for n sufficiently large and for all ω ∈ An, γ ∈ R, θ ∈ Θn,

the vector λ∗(γ)n (θ) that maximizes the function h

(γ)n,θ(λ) on Bn is actually an interior point of Bn.

Therefore, it must fulfill the first order condition:

∂h(γ)n,θ

(λ∗(γ)n (θ)

)∂λ

= 0.

Since the function h(γ)n,θ(λ) is actually well defined and strictly concave on Λ

(γ)n (θ), the first order

condition is sufficient to ensure that λ∗(γ)n (θ) = λ

(γ)n (θ) maximizes this function on Λ

(γ)n (θ). This

completes the proof of Proposition 1.

Proof of Proposition 2: The first order conditions associated to (2.2) are:

n∑i=1

τ (γ)[λ(γ)′n (θ)ψi(θ)

]ψi(θ) = 0.

48

We can rewrite them as:

n∑i=1

τ (γ)

[λ(γ)′n (θ)ψi(θ)

]+ 1ψi(θ)− nψn(θ) = 0

or:n∑i=1

k(γ)[λ(γ)′n (θ)ψi(θ)

]ψi(θ)

(λ(γ)′n (θ)ψi(θ)

)− nψn(θ) = 0

or, equivalently:nΩ(γ)

n (θ)λ(γ)n (θ)− nψn(θ) = 0,

which gives the announced formula (2.6) for λ(γ)n (θ).

Proof of Proposition 3: From Proposition 2, the Lagrange multipliers associated with constraints(2.11) correspond to the second set of coefficients of the matrix are:(

Ω(γ)n (θ)

)−1Ψn(θ)

where Ψn(θ) = [ψ′n(θ), µ

(γ)′n (θ) − g′n(θ)]

′ stands for the sample counterpart of the augmented set ofmoment conditions.

Therefore, with the generic notation Σ(θ) = Ω(γ)n (θ), we have:

Σ21n (θ)ψn(θ)− Σ22

n (θ)[µ(γ)n (θ)− gn(θ)

]= 0

where Σ21(θ) and Σ22(θ) denote respectively the South-West and South-East blocks of the matrixΣ−1(θ) according to the partition [ψ(Wi, θ)

′, g(Wi, θ)′]′. Then:

µ(γ)n (θ) = gn(θ) + [Σ22n (θ)]−1Σ21

n (θ)ψn(θ) (A.5)

that is:µ(γ)n (θ) = gn(θ)− Σ21n(θ)[Σ11n(θ)]

−1ψn(θ) (A.6)

where with obvious notations, (A.6) is deduced from (A.5) thanks to the identity:

ΣΣ−1 = IdH ⇒ Σ21Σ11 +Σ22Σ21 = 0 ⇔ [Σ22]−1Σ21 = −Σ21[Σ11]−1.

In the EL case (γ = −1), we know that:

Ω(−1)n (θ) =

n∑i=1

π(−1)i,n (θ)Ψi(θ)Ψ

′i(θ) with Ψi(θ) = [ψ′

i(θ), g′(Wi, θ)]

′.

Thus, for γ = −1,(A.6) exactly coincides with the value for µ(−1)n (θ) stated by Proposition 3.

In the EEL case (γ = 1), we know that:

Ω(1)n (θ) =

1

n

n∑i=1

Ψi(θ)Ψ′i(θ).

Thus, for γ = 1,(A.6) does not exactly coincide with the value for µ(1)n (θ) stated by Proposition 3,

since Proposition 3 considers sample counterparts of variances and covariances in mean deviationform.

49

However, this modification is immaterial for the following reason. Proposition 2 tells us that:

λ(1)n (θ) =(Ω(1)n (θ)

)−1Ψn(θ).

Let us define:

W (1)n (θ) =

1

n

n∑i=1

Ψi(θ)[Ψi(θ)− Ψn(θ)

]′.

Then, we get:

Ψn(θ) = Ω(1)n (θ)λ(1)n (θ) =

[W (1)n (θ) + Ψn(θ)Ψn(θ)

′]λ(1)n (θ).

Hence:[W (1)

n (θ)]−1Ψn(θ)− [W (1)n (θ)]−1Ψn(θ)Ψn(θ)

′λ(1)n (θ) = λ(1)n (θ)

which can be rewritten:λ(1)n (θ)

1− Ψn(θ)′λ(1)n (θ)

= [W (1)n (θ)]−1Ψn(θ).

Therefore, the second set of coefficients of the vector λ(1)n (θ) is nil if and only if the second set of

coefficients of [W(1)n (θ)]−1Ψn(θ) is itself nil. In other words, when γ = 1, the formula (A.6) is also

valid with Σ(θ) = W(1)n (θ), which gives the result in mean-deviation form as stated in Proposition 3.

Proof of Proposition 4: For expositional simplicity, we always consider a state of nature in theset An defined in step 2 of the proof of Proposition 1. This set is asymptotically of probability oneand allows us to use that:

Bn ⊂∩θ∈Θn

Λ(γ)n (θ).

In particular, for all γ ∈ R∗ and θ ∈ Θn, λ(γ)n (θ) fulfills the first order conditions for maximization

of the function h(γ)n,θ(λ) =

n∑i=1

ϱ(γ)[λ′ψi(θ)

], so that:

0 =

n∑i=1

ϱ(γ)1

[λ(γ)n (θ)′ψi(θ)

]ψi(θ)

where ϱ(γ)1 [.] stands for the first derivative of the function ϱ(γ) [.] . Since ϱ

(γ)1 [0] = −1, a Taylor

expansion around zero gives a set of H equations:

0 = −ψn(θ) +1

n

n∑i=1

ϱ(γ)2

[v(γ)i,n (θ)

]ψi(θ)ψi(θ)

′λ(γ)n (θ)

= −ψn(θ)−1

n

n∑i=1

ψi(θ)ψi(θ)′λ(γ)n (θ) +R(γ)

n (θ)

R(γ)n (θ) =

1

n

n∑i=1

[ϱ(γ)2

[v(γ)i,n (θ)

]− ϱ

(γ)2 [0]

]ψi(θ)ψi(θ)

′λ(γ)n (θ)

where v(γ)i,n (θ) is for each equation (with a standard abuse of notation overlooking the fact that it

may take different values for different components) a number between 0 and λ(γ)n (θ)′ψi(θ) . ϱ

(γ)2 [.]

stands for the second derivative of the function ϱ(γ) [.] and we have used the fact that ϱ(γ)2 [0] = −1.

50

From the second equation, we deduce that:

λ(γ)n (θ) = −[Ωn(θ)

]−1ψn(θ) +

[Ωn(θ)

]−1R(γ)n (θ)

= λ(1)n (θ) +[Ωn(θ)

]−1R(γ)n (θ).

Since:

supθ∈Θn

∥∥∥∥[Ωn(θ)]−1− Ω−1(θ)

∥∥∥∥ = oP (1)

and infθ∈Θn

γmin(θ) > 0,

we will get the result of Proposition 4 if we show that:

supθ∈Θn

∥∥∥R(γ)n (θ)

∥∥∥ = oP (1/√n).

However, by definition:∥∥∥R(γ)n (θ)

∥∥∥ ≤ max1≤i≤n

∣∣∣ϱ(γ)2

[v(γ)i,n (θ)

]− ϱ

(γ)2 [0]

∣∣∣ ∥∥∥[Ωn(θ)]∥∥∥ ∥∥∥λ(γ)n (θ)∥∥∥

≤ max1≤i≤n

∣∣∣ϱ(γ)3

[z(γ)i,n (θ)

]∣∣∣ max1≤i≤n

∣∣∣v(γ)i,n (θ)∣∣∣ ∥∥∥[Ωn(θ)]∥∥∥ ∥∥∥λ(γ)n (θ)

∥∥∥where z

(γ)i,n (θ) is a number between 0 and v

(γ)i,n (θ) while ϱ

(γ)3 [.] stands for the third derivative of the

function ϱ(γ) [.] . Note that:

supθ∈Θn

max1≤i≤n

∥∥∥z(γ)i,n (θ)∥∥∥ ≤ sup

θ∈Θn

max1≤i≤n

∥∥∥v(γ)i,n (θ)∥∥∥ ≤ sup

θ∈Θn

max1≤i≤n

∥∥∥λ(γ)n (θ)′ψi(θ)∥∥∥ (A.7)

≤ supθ∈Θn

∥∥∥λ(γ)n (θ)∥∥∥ supθ∈Θn

max1≤i≤n

∥ψi(θ)∥ = OP (1/√n)oP (

√n) = oP (1).

Therefore, we can always consider a state of nature in a subset A∗n of An that is also asymptotically

of probability one and such that there exists a compact subset K(γ) of Q(γ) containing z(γ)i,n (θ) for all

θ ∈ Θn and i = 1, ..., n. Let us introduce:

b(γ) = supx∈K(γ)

∣∣∣ϱ(γ)3 [x]∣∣∣ .

Then:

supθ∈Θn

∥∥∥R(γ)n (θ)

∥∥∥ ≤ b(γ) supθ∈Θn

max1≤i≤n

∣∣∣v(γ)i,n (θ)∣∣∣ ∥∥∥[Ωn(θ)]∥∥∥ ∥∥∥λ(γ)n (θ)

∥∥∥≤ b(γ) sup

θ∈Θn

∥∥∥[Ωn(θ)]∥∥∥ supθ∈Θn

∥∥∥λ(γ)n (θ)∥∥∥2 sup

θ∈Θn

max1≤i≤n

∥ψi(θ)∥ = OP (1/n)oP (√n) = oP (1/

√n)

where we have used (A.7) again as well as the fact that, by assumption 2,∥∥∥[Ωn(θ)]∥∥∥ is upper bounded

on Θn.

Proof of Proposition 5: We know from (2.5) that:

π(γ)i,n (θ) =

τ (γ)[λ(γ)n (θ)′ψi(θ)

]∑n

j=1 τ(γ)[λ(γ)n (θ)′ψj(θ)

] . (A.8)

51

Each term in the sum in the denominator of (A.8) can be expanded as follows:

τ (γ)[λ(γ)n (θ)′ψj(θ)

]= τ (γ)(0) + τ

(γ)1 (0).

[λ(γ)n (θ)′ψj(θ)

]+[τ(γ)1 (vj,n(θ))− τ

(γ)1 (0)

].[λ(γ)n (θ)′ψj(θ)

]with vj,n(θ) in the interval between 0 and λ

(γ)n (θ)′ψj(θ) and:

τ(γ)1 (x) =

dτ (γ)(x)

dx.

Hence:

τ (γ)[λ(γ)n (θ)′ψj(θ)

]= 1 + λ(γ)n (θ)′ψj(θ) +

[τ(γ)1 (vj,n(θ))− 1

].[λ(γ)n (θ)′ψj(θ)

]= 1− ψj(θ)

′Ω−1n (θ)ψn(θ) + ψj(θ)

′un(θ) +[τ(γ)1 (vj,n(θ))− 1

].[λ(γ)n (θ)′ψj(θ)

]where un(θ) = λ

(γ)n (θ)− λ

(1)n (θ) and, by Proposition 4:

supθ∈Θn

∥un(θ)∥ = oP

(1√n

).

Then by dividing both numerator and denominator by (1/n), we can rewrite (A.8) as:

π(γ)i,n (θ) =

1n

[1− ψi(θ)

′Ω−1n (θ)ψn(θ) +Ri,n(θ)

]1− ψn(θ)′Ω

−1n (θ)ψn(θ) +

1n

∑nj=1Rj,n(θ)

(A.9)

where:

Ri,n(θ) = ψi(θ)′un(θ) +

[τ(γ)1 (vi,n(θ))− 1

].[λ(γ)n (θ)′ψi(θ)

](A.10)

1

n

n∑j=1

Rj,n(θ) = ψn(θ)′un(θ) +

1

n

n∑j=1

[τ(γ)1 (vj,n(θ))− 1

].[λ(γ)n (θ)′ψj(θ)

].

Note that by the definition of vj,n(θ), Proposition 1 and Assumption 1:

max1≤j≤n

supθ∈Θn

|vj,n(θ)| ≤ max1≤j≤n

supθ∈Θn

∣∣∣λ(γ)n (θ)′ψj(θ)∣∣∣

≤ OP

(1√n

)max1≤j≤n

supθ∈Θn

∥ψj(θ)∥ = OP

(1√n

)oP(√n)= oP (1) .

Thus, there exists a closed ball B such that, for n sufficiently large, vj,n(θ) belongs to B for allj = 1, .., n and all θ ∈ Θn. Thus, for n sufficiently large:

max1≤j≤n

supθ∈Θn

∣∣∣τ (γ)1 (vj,n(θ))− 1∣∣∣ ≤ sup

x∈B

∣∣∣∣∣dτ (γ)1 (x)

dx

∣∣∣∣∣ max1≤j≤n

supθ∈Θn

|vj,n(θ)| = oP (1) . (A.11)

Hence, the first equation of (A.10) gives:

supθ∈Θn

|Ri,n(θ)| ≤ supθ∈Θn

∥ψi(θ)∥∥un(θ)∥+ oP (1)

∥∥∥λ(γ)n (θ)∥∥∥ = oP

(1√n

)since, by Assumption 1:

supθ∈Θn

∥ψi(θ)∥ = OP (1).

52

Moreover, using (A.11) again, the second equation of (A.10) gives:

supθ∈Θn

∣∣∣∣∣∣ 1nn∑j=1

Rj,n(θ)

∣∣∣∣∣∣ ≤ supθ∈Θn

∥∥ψn(θ)∥∥ ∥un(θ)∥+ oP (1) supθ∈Θn

∣∣∣λ(γ)n (θ)′ψn(θ)∣∣∣ = oP

(1

n

)

since, by Assumption 1:

supθ∈Θn

∥∥ψn(θ)∥∥ = OP

(1√n

).

Thus, we have shown:

π(γ)i,n (θ) =

1n

[1− ψi(θ)

′Ω−1n (θ)ψn(θ)

]+ αi,n(θ)

1− ψn(θ)′Ω−1n (θ)ψn(θ) + βn(θ)

(A.12)

with:

αi,n(θ) =1

nRi,n(θ)

βn(θ) =1

n

n∑j=1

Rj,n(θ)

where:

supθ∈Θn

|αi,n(θ)| = oP

(1

n√n

)supθ∈Θn

|βn(θ)| = oP

(1

n

).

Now, consider the denominator of the expression (A.12) and, for brevity, write it as:

Dn(θ) = 1− ψn(θ)′Ω−1n (θ)ψn(θ) + βn(θ).

Note that:

supθ∈Θn

|Dn(θ)− 1| ≤ supθ∈Θn

ψn(θ)′Ω−1n (θ)ψn(θ) + oP

(1

n

)= OP

(1

n

)(A.13)

since:

supθ∈Θn

ψn(θ)′Ω−1n (θ)ψn(θ) ≤ sup

θ∈Θn

∥∥ψn(θ)∥∥2 supθ∈Θn

∥∥∥Ω−1n (θ)

∥∥∥ = OP

(1

n

)OP (1) = OP

(1

n

)

by Assumption 1 and noting in particular that∥∥∥Ω−1

n (θ)∥∥∥ is uniformly bounded in probability by

virtue of Assumption 2. We deduce from (A.13) that:

supθ∈Θn

1

|Dn(θ)|= OP (1).

Hence:

π(γ)i,n (θ) =

1n

[1− ψi(θ)


]Dn(θ)

+ oUP

(1

n√n

)where we use the notation oUP (.) to mean that the bound oP (.) is valid uniformly in θ ∈ Θn.

53

Therefore:

π(γ)i,n (θ)−

1

n=

(− 1n

) [ψi(θ)− ψn(θ)

]′Ω−1n (θ)ψn(θ)

Dn(θ)− βn(θ)

nDn(θ)+ oUP

(1

n√n

)=

(− 1n



Dn(θ)+ oUP

(1

n√n

)= OUP

(1

n√n

)where the last equality is obviously implied by assumptions 1 and 2.

Therefore, we deduce:


(1)i,n (θ) =

[π(γ)i,n (θ)−

1

n

]−[π(1)i,n (θ)−

1

n

]=

(− 1n



Dn(θ)+ oUP

(1

n√n

)−(− 1

n

)[ψi(θ)− ψn(θ)

]′ [V (1)n (θ)

]−1ψn(θ).

Hence:


(1)i,n (θ) =

(− 1n


]′Ω−1n (θ)−

[V

(1)n (θ)

]−1ψn(θ)

Dn(θ)

+

(− 1

n

)[ψi(θ)− ψn(θ)

]′ [V (1)n (θ)

]−1ψn(θ).

1−Dn(θ)

Dn(θ)+ oUP

(1

n√n

).

In this decomposition of the difference of implied probabilities, we see that the first term is

oUP

(1

n√n

)by assumptions 1 and 2, while, by taking also (A.13) into account, we get that the second

term is OUP

(1

n2√n

). We have then proved that:


(1)i,n (θ) = oUP

(1

n√n

).

Proof of Proposition 6: Recall that by definition (2.15):

π(1),Shi,n (θ) =

1

1 + εn(θ)π(1)i,n (θ) +

εn(θ)

1 + εn(θ)πi,n.

Then:

π(1)i,n (θ)− π

(1),Shi,n (θ) =

εn(θ)

1 + εn(θ)

[π(1)i,n (θ)−

1

n

]that is, by (2.13):

π(1)i,n (θ)− π

(1),Shi,n (θ) = − εn(θ)

1 + εn(θ)ψn(θ)

′[V (1)n (θ)]−1ψi(θ)− ψn(θ)

n.

Therefore: ∣∣∣π(1)i,n (θ)− π(1),Shi,n (θ)

∣∣∣ ≤ 1

n

εn(θ)

1 + εn(θ)

∥∥ψn(θ)∥∥∥∥∥V (1)n (θ)]−1

∥∥∥ ∣∣ψi(θ)− ψn(θ)∣∣ .

54

We deduce from Assumptions 1 and 2, continuity of the functions ψi(θ) and compactness of Θ that

supθ∈Θn

∣∣∣π(1)i,n (θ)− π(1),Shi,n (θ)

∣∣∣ = supθ∈Θn

[εn(θ)

1 + εn(θ)

]OP

(1

n√n

)≤ sup

θ∈Θn

[εn(θ)]OP

(1

n√n

)= oP

(1

n√n

)since:

supθ∈Θn

[εn(θ)] = oP (1). (A.14)

To see that, note that:

εn(θ) = max

max1≤i≤n

(−nπ(1)i,n (θ)

), 0

with:

max1≤i≤n

(−nπ(1)i,n (θ)

)= −1 + max

1≤i≤n

[ψn(θ)

′[V (1)n (θ)]−1

(ψi(θ)− ψn(θ)

)]so that:

supθ∈Θn

max1≤i≤n

(−nπ(1)i,n (θ)

)= −1 +OP

(1√n

)which obviously implies (A.14).

Proof of Proposition 7: See main text.

Proof of Proposition 8: We have:

n∑i=1

π(1)i,n (θ)Yi −

n∑i=1

π(1),Shi,n (θ)Yi =

εn(θ)

1 + εn(θ)

[1

n

n∑i=1

Yi(ψi(θ)− ψn(θ)

)′][V (1)n (θ)]−1ψn(θ).

Then, from Assumptions 1 and 2, (A.14), and the additional assumptions for Proposition 8:

supθ∈Θn

∥∥∥∥∥n∑i=1

π(1)i,n (θ)Yi −

n∑i=1

π(1),Shi,n (θ)Yi

∥∥∥∥∥ = oP (1)OP (1)OP

(1√n

)= oP

(1√n

)which gives the announced result.

Proof of Proposition 9: We have from the proof of Proposition 5:

π(γ)i,n (θ) =

1

nDn(θ)

[1− ψi(θ)


]+

1

nDn(θ)Ri,n(θ)

=⇒n∑i=1

π(γ)i,n (θ)Yi = An(θ) +Bn(θ)

with:

An(θ) =1

nDn(θ)

n∑i=1

Ri,n(θ)Yi

Bn(θ) =1

nDn(θ)

n∑i=1

[1− ψi(θ)


]Yi.

We first show that:

An(θ) = oUP

(1√n

).

55

By using (A.13), we only have to show that:

n∑i=1

Ri,n(θ)Yi = oUP(√n).

Recall from (A.10) that:

Ri,n(θ) = ψi(θ)′un(θ) +

[τ(γ)1 (vi,n(θ))− 1

].[λ(γ)n (θ)′ψi(θ)

]so that:

n∑i=1

Ri,n(θ)Yi = A1,n(θ) +A2,n(θ)

with:

A1,n(θ) = un(θ)′n∑i=1

ψi(θ)Yi =√nun(θ)

′ 1√n

n∑i=1

ψi(θ)Yi − E [ψi(θ)Yi]+ nun(θ)′ 1

n

n∑i=1

E [ψi(θ)Yi] .

By the functional central limit theorem:

un(θ) = oUP

(1√n

)=⇒

√nun(θ)

′ 1√n

n∑i=1

ψi(θ)Yi − E [ψi(θ)Yi] = oUP (1) ≤ oUP(√n).

On the other hand, applying the Cauchy-Schwartz and the triangle inequalities, and then usingassumptions 1(i) and 2(ii) and the square integrability of Yi, gives:∣∣∣∣∣nun(θ)′ 1n

n∑i=1

E [ψi(θ)Yi]

∣∣∣∣∣ ≤ n ∥un(θ)∥1

n

n∑i=1

∥E [ψi(θ)Yi]∥ ≤ n ∥un(θ)∥OUp(

1√n

)OUp (1) = oUP

(√n),

where, as with oUp (.), here we use the notation OUp (.) to mean that the bound Op(.) is valid uniformly

in θ ∈ Θn.Hence:

A1,n(θ) = oUP(√n).

On the other hand:

|A2,n(θ)| =

∣∣∣∣∣λ(γ)n (θ)′n∑i=1

ψi(θ)Yi

[τ(γ)1 (vi,n(θ))− 1

]∣∣∣∣∣≤

∥∥∥λ(γ)n (θ)∥∥∥ max1≤j≤n

supθ∈Θn

∣∣∣τ (γ)1 (vj,n(θ))− 1∣∣∣ ∥∥∥∥∥

n∑i=1

ψi(θ)Yi

∥∥∥∥∥ = oUP

(1√n

)∥∥∥∥∥n∑i=1

ψi(θ)Yi

∥∥∥∥∥ = oUP(√n),

respectively by Proposition 1, (A.11) and since the argument above about A1,n(θ) has shown that:∥∥∥∥∥n∑i=1

ψi(θ)Yi

∥∥∥∥∥ = OUP (n) .

We complete the proof by showing that:

Bn(θ) =

n∑i=1

π(1)i,n (θ)Yi + oUP

(1√n

).

56

We have:

Bn(θ) =1

nDn(θ)

n∑i=1

[1− ψi(θ)


]Yi

=1

nDn(θ)

n∑i=1

[1− ψi(θ)

′[V −1n (θ)

]−1ψn(θ)

]Yi

+1

nDn(θ)

n∑i=1

ψi(θ)′[V −1n (θ)

]−1− Ω−1

n (θ)

ψn(θ)Yi.

We first note that:∣∣∣∣∣n∑i=1

ψi(θ)′[V −1n (θ)

]−1− Ω−1

n (θ)

ψn(θ)Yi

∣∣∣∣∣ ≤∥∥∥∥[V −1

n (θ)]−1

− Ω−1n (θ)

∥∥∥∥∥∥∥∥∥

n∑i=1

ψi(θ)Yi

∥∥∥∥∥ ∥∥ψn(θ)∥∥= oUP (1)OUP (n)OUP

(1√n

)= oUP

(√n)

so that:

Bn(θ) =1

nDn(θ)

n∑i=1

[1− ψi(θ)

′[V −1n (θ)

]−1ψn(θ)

]Yi + oUP

(1√n

)

=1

nDn(θ)

n∑i=1

[1−

[ψi(θ)− ψn(θ)

]′ [V −1n (θ)

]−1ψn(θ)

]Yi

− 1

Dn(θ)ψnθ)

′[V −1n (θ)

]−1ψn(θ)

1

n

n∑i=1

Yi + oUP

(1√n

)

=1

Dn(θ)

n∑i=1

π(1)i,n (θ)Yi +OUP

(1√n

)OUP

(1√n

)OUP (1) + oUP

(1√n

)

=n∑i=1

π(1)i,n (θ)Yi +

n∑i=1

π(1)i,n (θ)Yi

[1−Dn(θ)

Dn(θ)

]+ oUP

(1√n

)

=n∑i=1

π(1)i,n (θ)Yi + oUP

(1√n

)where the last equality is a direct consequence of (A.13).

Proof of Proposition 10: Follows directly from the central limit theorem and the definition of

π(1)i,n (θ) (see Proposition 3).

Proof of Lemma 1: Using the uniform convergence ofmwn (θ) tom

w(θ) where the latter is continuousin θ with mw(θ0) = 0, it follows by the compactness of Θ that for any θ ∈ Θn:

√nE[ψn(θ)] = mw

n (θ) +√nm(θs) = (mw(θ) + o(1)) +

√nm(θs).

Hence, by the definition and continuous differentiability of m(θs), in conjunction with the definitionof Θn, we have:

√n supθ∈Θn

∥E[ψn(θ)]∥ ≤ (∥mw(θ)∥+ o(1)) +√n∥m(θ0s)∥+

√n supθ∈Θn

∥Ms(θs)∥∥r∥√n

= O(1) + 0 + ∥O(1)∥∥r∥ = O(1)

using the triangle inequality, assumptions ID and SPM1, and the Cauchy-Schwartz inequality.

57

Proof of Lemma 2: (i) We maintain all the assumptions from Propositions 8 and 9 (and also 6and 7, by default). Hence, for any two choices of πGi,n(θ) in (5.1), call them πGi,n,1(θ) and πGi,n,2(θ)respectively, it directly follows that:

√n

∥∥∥∥∥n∑i=1

πGi,n,1(θ)∂ψi(θ)

∂θ′w−

n∑i=1

πGi,n,2(θ)∂ψi(θ)

∂θ′w

∥∥∥∥∥ = oUp (1).

Now, noting that assumption CLT is actually a stronger version of the joint convergence assump-tion maintained in Proposition 10, it follows from assumptions CLT and O that for any choice ofπGi,n(θ), one would obtain:

√n

[ψn(θ)− E[ψn(θ)],

n∑i=1


∂θ′w− E

[n∑i=1


∂θ′w

]]=⇒

[Ψ1:H(θw, θ

0s), ξ(θw, θ

0s)].

Hence, the final result follows by considering the EEL implied probability for the choice of πGi,n(θ)since, then Lemma 1, assumptions CLT, SPM2 and O imply that for j = 1, . . . , pw:

√nE

[n∑i=1


∂θj

]=

√n∂

∂θjE[ψn(θ)

]− Cov

(∂ψi(θ)

∂θj, ψi(θ)

)[V (θ)]−1√nE

[ψn(θ)

]+ oU (1)

= Gw,j(θw, θ0s) + oU (1).

(ii) Directly follows as above, but without the use of Assumption CLT.

Proof of Theorem 1: First note that, Lemma 1 and CLT give:

supθ∈Θn

√n∥ψn(θ)∥ = Op(1), (A.15)

fixing the order of magnitude of the average moment vector occurring in the score vectors in all threetypes of statistics: LMn


), Kn(θ) and Kn(θ).

Now, consider the scaling matrix Tn =

[ √nIpw 00 Ip−pw

], which is nonsingular for any given n.

Pre-multiply by Tn all the occurrences ofn∑i=1

πGi,n(θ)G′i(θ)

,

n∑i=1

π(1)i,n (θ)G

′i(θ)

and

∑ni=1 π

Gi,n(θ)

∂ψ′i(θ)

∂θw∑ni=1 πi,n

∂ψ′i(θ)∂θs

in LMn


), Kn(θ) and Kn(θ) respectively. Note that in all three types of statistics,

such occurrences happen only four times: once in each of the extreme terms (score vector) and twicein the middle term (estimator of the inverse of the variance of the score vector). Thus, for all n, thispre-multiplication is harmless because Tn and T−1

n always cancel out.Therefore, taking (A.15) into consideration, the result of Theorem 1 now directly follows from

the intermediate results in Lemma 2.

58

Documents

Score tests in GMM: Why Use Implied Probabilities?saraswata.research.mcgill.ca/chaudhuri_renault_2017.pdf · 1 Introduction The focus of our interest in this paper is the test of