Asymptotic Almost Sure Efficiency of Averaged Stochastic Algorithms

ASYMPTOTIC ALMOST SURE EFFICIENCY OF AVERAGEDSTOCHASTIC ALGORITHMS∗

MARIANE PELLETIER†

SIAM J. CONTROL OPTIM. c© 2000 Society for Industrial and Applied MathematicsVol. 39, No. 1, pp. 49–72

Abstract. First, we define the notion of almost sure efficiency for a decreasing stepsize stochas-tic algorithm, and then we show that the averaging method, which gives asymptotically efficientalgorithms, also gives asymptotically almost surely efficient algorithms. Moreover, we prove that theaveraged algorithm also satisfies a law of the iterated logarithm, as well as an almost sure centrallimit theorem.

Key words. stochastic algorithms, central limit theorem, almost sure invariance principles

AMS subject classification. 62L20, 62F12, 60F05, 60F15

PII. S0363012998308169

1. Introduction. Many vectorial algorithms are written in the form

Zn+1 = Zn + γn [F (Zn, ηn+1)] ,

where the gain (γn)n≥0 is a nonrandom sequence decreasing to 0 with∑

γn = ∞and the observed quantity at time n + 1 is F (Zn, ηn+1), ηn+1 being a stochasticdisturbance. Such an algorithm is often studied when rewritten as an algorithm usedfor the search of zeros of a function h,

Zn+1 = Zn + γn [h (Zn) + en+1] ,(1)

where (en) is a “small disturbance” and h (Zn) corresponds to a mean effect ofF (Zn, ηn+1), given the past. The classical Robbins–Monro algorithm is obtainedwhen (ηn) is a sequence of independent identically distributed random variables andh(z) = E[F (z, η)]. Extensions to a Markovian disturbance (ηn) are developed in [1].

Throughout this paper, the algorithm (1) is considered in the following frame-work: (en) is a sequence of d-dimensional random vectors defined on a probabilityspace (Ω,A, P ), adapted to a filtration F = (Fn)n≥0, and Z0 is F0-measurable. The

function h is defined on Rd and Rd-valued, and z∗ is a zero of h such that, on aneighborhood of z∗,

h(z) = H (z − z∗) + O (‖z − z∗‖a) ,

where a > 1 and H is a stable matrix (i.e., all the real parts of the eigenvalues of Hare strictly negative).

Many criteria ensure the almost sure convergence or the convergence with astrictly positive probability of (Zn) towards z∗ (see among many others [1], [9], [12],and [18]). In order to ensure their applications to various cases, our results are con-ditional with respect to the event Γ (z∗) = Zn → z∗.

∗Received by the editors October 13, 1998; accepted for publication (in revised form) January 10,2000; published electronically August 17, 2000.

http://www.siam.org/journals/sicon/39-1/30816.html†Laboratoire de Mathematiques, Batiment Fermat, Universite de Versailles Saint-Quentin, 45

Avenue des Etats-Unis, 78035 Versailles Cedex, France ([email protected]).

49

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

50 MARIANE PELLETIER

It was proved, under some local assumptions stated in section 2, that if P [Γ (z∗)] >0, then the sequence (Zn) satisfies a conditional central limit theorem (CLT):

given Γ (z∗) , Ψn =

√1

γn(Zn − z∗)

D→N (0,Σ),(2)

withD→ denoting the convergence in distribution, N denoting the Gaussian distri-

bution, and Σ being a positive definite matrix (see, for instance, [1], [12], [18], [19]for the case P [Γ (z∗)] = 1 and [9] or [20] for the case P [Γ (z∗)] > 0). (Equation (2)means that the asymptotic conditional law of Ψn with respect to Γ (z∗) is N (0,Σ).)

Moreover, the sequence (Zn) is known to fulfill the three following almost sureproperties ([21], [22]), where Σ is the limit covariance matrix of (2), δx denotes thepoint mass at x, and =⇒ is the weak convergence. (Throughout the paper, we say thata property P holds almost surely (a.s.) on Γ (z∗) if there exists a subset N ⊂ Γ (z∗)such that P (N) = 0 and P holds ∀ω ∈ Γ (z∗) \N .)

• A quadratic strong law of large numbers:

a.s. on Γ (z∗) , limn→∞

1∑nk=1 γk

n∑k=1

(Zk − z∗)(Zk − z∗)T

= Σ.(3)

• A law of the iterated logarithm: for any eigenvector of HT , w ∈ Rd,

a.s. on Γ (z∗) , lim supn→∞

1

2γn ln (∑n

k=1 γk)wT (Zk − z∗)(Zk − z∗)

Tw = wTΣw.

(4)

• An almost sure central limit theorem (a.s.CLT):

a.s. on Γ (z∗) ,1∑n

k=1 γk

n∑k=1

γkδ√ 1γk

(Zk−z∗)=⇒ N (0,Σ),(5)

i.e., there exists a P -null set N ⊂ Γ (z∗) such that ∀ω ∈ Γ (z∗) \N ,

1∑nk=1 γk

n∑k=1

γkδ√ 1γk

(Zk(ω)−z∗)=⇒ N (0,Σ).

The optimal weak convergence rate of (1) given Γ (z∗) is reached when γn = γ0/nwith 2Lγ0 > 1 (−L denoting the greatest real part of the eigenvalues of H), since (2)is then equivalent to

Given Γ (z∗) ,√n (Zn − z∗)

D→N (0, γ0Σ) .

The question arises as to what the optimal covariance matrix is. For that, let usconsider the following class of algorithms:

Zn+1 = Zn +A

n[h (Zn) + en+1] ,(6)

where A is an invertible d×d matrix such that AH+I/2 is stable. For such algorithms(see [9]), it follows from (2) that

Given Γ (z∗) ,√n (Zn − z∗)

D→N (0,Σ(A)) ,

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

AVERAGED STOCHASTIC ALGORITHMS AND EFFICIENCY 51

where Σ(A) is the solution of the Lyapunov equation[AH +

I

2

]Σ(A) + Σ(A)

[HTAT +

I

2

]= −ACAT ,

with C = limn→+∞E(en+1e

Tn+1|Fn

)a.s. on Γ (z∗). The optimal choice of the matrix

A in (6) is A = −H−1, which leads to Σ(A) = H−1C(H−1)T

, since it minimizes thecovariance Σ(A) (with respect to the order of the symmetric matrices).

When A is replaced by H−1, (6) is Newton’s algorithm, which thus has an asymp-totically optimal behavior in distribution. Unfortunately, it is often impossible to usethis algorithm, the matrix H being generally unknown.

These considerations lead us to set the following definition.Definition 1. If (Yn)n≥0 is given by a stochastic algorithm used for the search

of zeros of a function h observable only together with a disturbance (en), h and (en)satisfying the assumptions previously given and y∗ being a zero of h, we say that thealgorithm is asymptotically efficient given Yn → y∗ if

Given Yn → y∗ , √n (Yn − y∗)D→N (0, H−1C(H−1)

T).

The averaging method, simultaneously introduced by Polyak [23] and Ruppert[25], is known to give asymptotically efficient algorithms (see among others Delyon–Juditsky [5], [6], Dippon–Renz [7], Kushner–Yang [13], Polyak–Juditsky [24], and Yin[31]).

The averaged algorithm is built up in the following way; each iteration requirestwo steps.

Step 1. Zn+1 is found from the algorithm (1) where the gain γn is “slow;” typically

γn =γ0

nα,

1

2< α < 1.

Step 2. We compute the empirical mean of all the previous observations,

Zn+1 =1

n + 1

n+1∑k=1

Zk.

Note that Zn+1 can be recursively written as

Zn+1 = Zn +1

n + 1

(Zn+1 − Zn

).

The aim of this paper is to prove that, conditionally on the set of consistencyΓ (z∗), the averaged algorithm is not only asymptotically efficient, but that it alsosatisfies the almost sure properties (3), (4), and (5) with the optimal rate and theoptimal covariance matrix Σ. Moreover, the law of the iterated logarithm (4) isobtained for any vector w (and not only for eigenvectors of HT ).

Before stating our main results, let us first introduce the almost sure version ofthe notion of asymptotic efficiency. When γn = γ0/n with 2Lγ0 > 1, (3) is equivalentto


1

lnn

n∑k=1

(Zk − z∗) (Zk − z∗)T

= γ0Σ,(7)

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


and, for slower gains γn =γ0

nα,

1

2< α < 1, to


1− α

n1−α

n∑k=1

(Zk − z∗) (Zk − z∗)T

= γ0Σ.

Thus, the sum of the “squared” differences between Zk and the estimated param-

eter z∗ is optimal when (7) is fulfilled with γ0Σ = H−1C(H−1)T

. By analogy withthe definition of the asymptotic efficiency, and taking up the terminology introducedby Touati [28] for statistical problems, we introduce the following definition.

Definition 2. If (Yn) is given by a stochastic algorithm used for the search ofzeros of a function h observable only together with a disturbance (en), h and (en)satisfying the assumptions previously given and y∗ being a zero of h, we say that thealgorithm is asymptotically almost surely efficient on Yn → y∗ if

a.s. on Yn → y∗ , limn→∞

1

lnn

n∑k=1

(Yk − y∗) (Yk − y∗)T

= H−1C(H−1)T.

We shall prove that the averaged algorithm is a.s. efficient on Γ (z∗), i.e., that

a.s. on Γ(z∗), limn→∞

1

lnn

n∑k=1

(Zk − z∗)(Zk − z∗)T

= H−1C(H−1)T.

We shall also show that the averaged algorithm fulfills the following law of the iteratedlogarithm,

a.s. on Γ (z∗) , ∀w ∈ Rd,

lim supn→∞

n

2 ln (lnn)wT(Zn − z∗

)(Zn − z∗

)Tw = wTH−1C

(H−1

)Tw,

and the following a.s.CLT:

a.s. on Γ (z∗) ,1

lnn

n∑k=1

1

kδ√k(Zk−z∗) =⇒ N (0, H−1C(H−1)

T).

In fact, we shall extend our framework and study the more general algorithm(including the Kiefer–Wolfowitz algorithm [9])

Zn+1 = Zn + γn [h (Zn) + rn+1] + σnεn+1,(8)

where (εn) is a noise (i.e., a sequence of martingale increments), (rn) a residual term,and (σn) a nonrandom sequence decreasing to 0 such that γn = O (σn); the algorithm(1) corresponds then to the particular case γn = σn and en = rn + εn.

Our results are precisely stated in section 2. In section 3, we give an applicationto efficient recursive estimators. Finally, section 4 is devoted to the proofs.

2. Assumptions and main results. We first precise the required assumptionson (8).

Assumption (A1) about the function h: There exist a > 1, a stable matrix H, anda neighborhood U of z∗, such that for any z in U ,

h(z) = H (z − z∗) + O (‖z − z∗‖a) .

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Assumption (A2) about the noise (εn):(i) There exist M > 0 and b > 2 such that a.s.

E (εn+1|Fn) 1‖Zn−z∗‖≤M = 0 and supn≥0

E(‖εn+1‖b|Fn)1‖Zn−z∗‖≤M <∞.

(ii) There exists a nonrandom symmetric positive definite matrix C such thatlimn→∞ Cn = C a.s. on Γ (z∗), where Cn = E

(εn+1ε

Tn+1|Fn

).

Assumption (A3) about the gains: There exist two decreasing positive functionsγ and σ, defined over [0,+∞[ such that γn = γ(n) and σn = σ(n) ∀ integer n. Wedefine the function v by v(t) = γ(t)/σ2(t) and assume there exist two positive realnumbers α and β such that the following conditions are fulfilled.

(i) v is a differentiable increasing function, v(∞) = ∞, and its differential v′

varies regularly with exponent (β − 1) (i.e., limt→∞ v′(tx)/v′(t) = xβ−1; see[11] or [26]).

(ii) γ is differentiable, varies regularly with exponent (−α), and θ = (1/γ)′ isdecreasing and varies regularly with exponent (α− 1).

(iii) One of the two following conditions (A3.a) or (A3.b) holds.(A3.a) min 1

2 ,2b < α < 1 and 1−α

a−1 < β ≤ 1,

(A3.b) 12 < α < 1 and 1−α

a−1 (1 + 2ab ) < β ≤ 1.

Assumption (A4) about the residual term (rn): We set

J(t) =

∫ t

0

1

γ(s)v(s)ds,(9)

and assume rn+1 = r(1)n+1 + r

(2)n+1 with

(i) r(2)n+1 = O (‖Zn − z∗‖a) a.s.,

(ii) the weak assumption (A4.w) or the stronger one (A4.s) is fulfilled:(A4.w) There exists M > 0 such that

‖r(1)n+1‖1‖Zn−z∗‖≤M = O([

√J(n)γnv(n)]

−1)

a.s.;

(A4.s) There exist M > 0 and ρ > 12 (1+β−α) such that ‖r(1)

n+1‖1‖Zn−z∗‖≤M= O (n−ρ) a.s.

Comments on the assumptions.(a) Our assumptions are local. Thus, the results stated below can be applied

as soon as P [Γ (z∗)] > 0, whatever the behavior of (Zn) outside of Γ (z∗) may be.In particular, they apply to algorithms obtained by projection or truncation in theframework of [3] or [12].

(b) Since the function s → √J(s)γ(s)v(s) varies regularly with exponent (1 +

β − α)/2, assumption (A4.s) implies (A4.w).(c) Assumptions (A2) about the noise and (A4) about the residual term can be

applied to Markovian disturbances in the framework of [1], whose application to theaveraged method is precised in [6].

(d) When the conditional moment of order 4 of the noise (εn) is bounded (i.e.,when b ≥ 4), assumption (A3)(iii) reduces to

1

2< α < 1 and

1− α

a− 1< β ≤ 1;

thus, the condition (A3.b) is useful only in the case 2 < b < 4.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


(e) In most cases, the function h is regular enough so that assumption (A1) holdswith a = 2. In this case, assumption (A3) is fulfilled, for instance, by the gains

γn =γ0

nα, σn =

σ0√nα+β

(γ0 > 0, σ0 > 0),

with2

3≤ α < 1 and 3(1− α) ≤ β ≤ 1.

For these gains, we have

v(n) =γ0n

β

σ20

, J(n) =σ2

0n1+α−β

γ20(1 + α− β)

,

and assumption (A4.w) can be rewritten as:

there exists M > 0 such that ‖r(1)n+1‖1‖Zn−z∗‖≤M = O(

√n1−α+β) a.s.

The Robbins–Monro algorithm is given by (8) with γn = σn and rn = 0; we havethen β = α, J(n) = n, and, if (A1) holds with a = 2, assumptions (A3) and (A4) arefulfilled, for instance, by the gains

γn = σn =γ0

nα(γ0 > 0) with

3

4≤ α < 1.

The Kiefer–Wolfowitz algorithm corresponds to the case h = −∇V , where thefunction V is observable only together with a noise. This algorithm can be written

as (8) with γn =γ0

nα(γ0 > 0), 1/2 < α ≤ 1, and σn = nτγn, 0 < τ < α/2. We

have then β = α − 2τ and J(n) = n1+2τ/(1 + 2τ). Since (n2τrn+1) is known toconverge a.s. on Γ (z∗) toward a deterministic, usually nonzero constant, assumption(A4.w) (respectively, (A4.s)) requires 1/6 ≤ τ < α/2 (respectively, 1/6 < τ < α/2).Consequently, if (A1) holds with a = 2, our assumptions (A3) and (A4) are fulfilledby the gains

γn =γ0

nα, σn = nτγn,

with γ0 > 0,3

4+

τ

2≤ α < 1, and

1

6≤ τ <

α

2for (A4.w),

1

6< τ <

α

2for (A4.s).

(10)

Our first main result is the following quadratic strong law of large numbers.Theorem 3 (quadratic strong law of large numbers). Assume (A1), (A2), (A3),

and (A4.s) hold. Then, a.s. on Γ (z∗),

limn→∞

1

lnn

n∑k=1

k[J(k)]−1(Zk − z∗

) (Zk − z∗

)T= H−1C

(H−1

)T.

Corollary 4 (almost sure efficiency). Assume that (A1), (A2), (A3), and (A4.s)hold and that γn = σn. Then, a.s. on Γ (z∗),

limn→∞

1

lnn

n∑k=1

(Zk − z∗

) (Zk − z∗

)T= H−1C

(H−1

)Tand the almost sure asymptotic efficiency is obtained.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Remarks and examples.(a) Under the assumptions of Theorem 3, Duflo [9] proved the following condi-

tional CLT:

given Γ (z∗) , Yn = n[J(n)]−1/2(Zn − z∗

) D→N (0, H−1C(H−1)T

).(11)

Moreover, the quadratic strong law of large numbers can clearly be rewritten as:

a.s. on Γ (z∗) ,

limn→∞

1

lnn

n∑k=1

1

k

[k[J(k)]−1/2(Zk − z∗)

][k[J(k)]−1/2(Zk − z∗)

]T= H−1C

(H−1

)T.

Thus, the quadratic strong law of large numbers ensures that the logarithmic averageof the YkY

Tk converges a.s. toward the covariance matrix of the asymptotic distribution

of (11).(b) The averaged Robbins–Monro algorithm is asymptotically a.s. efficient on

Γ (z∗).(c) In the case of the Kiefer–Wolfowitz algorithm, assumption (A4.s) requires that

the parameter τ in (10) satisfies τ > 1/6, and we then have


1 + 2τ

lnn

n∑k=1

1

k2τ

(Zk − z∗

) (Zk − z∗

)T= H−1C

(H−1

)T.

However, we failed in proving a quadratic strong law of large numbers when τ = 1/6.In view of remark (a), it is not surprising since, in this case, n[J(n)]−1/2

(Zn − z∗

)is known to converge weakly to a N (m,H−1C(H−1)

T) distribution, where m is a

deterministic, usually nonzero constant.The following corollary gives an estimator of the asymptotic covariance matrix

H−1C(H−1

)T, Zn standing for z∗ in Theorem 3.

Corollary 5 (strongly consistent estimator of the asymptotic covariance). Set

Σn =1

lnn

n∑k=1

k[J(k)]−1(Zk − Zn

) (Zk − Zn

)T.

Under assumptions (A1), (A2), (A3), and (A4.s), Σn is a strongly consistent estimator

of H−1C(H−1

)Ton Γ (z∗).

Remark. The combination of (11) and Corollary 5 implies the following condi-tional CLT:

given Γ (z∗) , n[J(n)]−1/2Σ−1/2n

(Zn − z∗

) D→N (0, I) ,

which permits the construction of confidence regions for z∗.Our second main result is the following law of the iterated logarithm.Theorem 6 (law of the iterated logarithm). Assume (A1), (A2), (A3), and

(A4.w) hold. Then, for any vector u of Rd, we have, a.s. on Γ (z∗),

lim supn→∞

n√2J(n) ln (lnn)

uT(Zn − z∗

)= − lim inf

n→∞n√

2J(n) ln (lnn)uT(Zn − z∗

)=

√uTH−1C(H−1)

Tu.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Moreover, we have, a.s. on Γ (z∗),

∀u ∈ Rd, lim supn→∞

n2

2J(n) ln (lnn)uT(Zn− z∗

)(Zn − z∗

)Tu=uTH−1C

(H−1

)Tu.

(12)

Remarks.(a) Referring to the order of the symmetric matrices, property (12) can be written

as:

a.s. on Γ (z∗), lim supn→∞

n2

2J(n) ln (lnn)

(Zn − z∗

)(Zn − z∗

)T= H−1C

(H−1

)T.

(b) When γn = σn (in particular, for the Robbins–Monro algorithm), we obtain


n

2 ln (lnn)

(Zn − z∗

)(Zn − z∗

)T= H−1C

(H−1

)T.

We see again that the asymptotic almost sure convergence rate of the averaged al-gorithm is optimal since the rate (2 ln(lnn))/n is known to be optimal and the limit

covariance matrix H−1C(H−1

)Tis the smallest one (with respect to the order of the

symmetric matrices).(c) In the case of the Kiefer–Wolfowitz algorithm, we have


(1 + 2τ)n1−2τ

2 ln (lnn)

(Zn − z∗

) (Zn − z∗

)T= H−1C

(H−1

)Tfor any τ satisfying (10), and here we can choose τ = 1/6, which ensures the optimalconvergence rate of the averaged Kiefer–Wolfowitz algorithm.

(d) Theorems 3 and 6 extend previous results of Le Breton [15] and Le Bretonand Novikov [16], [17]. They obtained Theorem 3 under the restriction that h islinear; under the same restriction, they obtained Theorem 6 in the unidimensionalcase (d = 1), whereas they obtained only an upper bound of

(Zn − z∗

)when d > 1.

Our last main result is the following a.s.CLT.Theorem 7 (a.s.CLT). Assume (A1), (A2), (A3), and (A4.s) hold. Then, a.s.

on Γ (z∗),

1

lnn

n∑k=1

1

kδk[J(k)]−1/2(Zk−z∗) =⇒ N (0, H−1C(H−1)

T).

The following corollary is a straightforward consequence of Theorems 3 and 7.Corollary 8 (logarithmic strong law of large numbers). Assume (A1), (A2),

(A3), and (A4.s) hold. Let φ : Rd → R be an almost everywhere continuous function

such that, for a positive constant K, |φ(x)| ≤ K(1 + ‖x‖2). Then, a.s. on Γ (z∗),

limn→∞

1

lnn

n∑k=1

1

kφ[k[J(k)]

−1/2(Zk − z∗)

]=

∫φ(x)dF (x),

where F is the N (0, H−1C(H−1)T

) distribution.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


3. Application to efficient recursive estimation. Let (Yk) be a sequence ofindependent identically distributed random vectors absolutely continuous with respectto some positive σ-finite measure λ. Let us denote by f(θ, .) the probability densityof Yk, where θ ∈ Θ, Θ is an open subset of Rd, and assume this statistical model tobe regular [4].

According to the classical asymptotic theory, the maximum likelihood estimatorθ∗n, which maximizes the likelihood of the sample (Y1, . . . , Yn), is strongly consistentand asymptotically efficient, i.e.,

θ∗n → θ a.s. and√n (θ∗n − θ)

D→N (0, [I(θ)]−1

),

where I (θ) = Eθ([∇ ln f (θ, Yk)] [∇ ln f (θ, Yk)]T

) is the Fisher information of themodel. Touati [28] proved that θ∗n is also a.s. asymptotically efficient, i.e.,

1

lnn

n∑k=1

(θ∗k − θ) (θ∗k − θ)T a.s.→ [I (θ)]

−1.

However, the explicit computation of the maximum likelihood estimator is oftenimpossible or very complicated and some approximation procedure is then necessary.For instance, Newton’s recursive estimator is given by

θ∗n+1 = θ∗n +

[I(θ∗n)

]−1

n∇(ln f(θ∗n, Yn+1)),(13)

i.e.,

θ∗n+1 = θ∗n +1

n

[h(θ∗n) + εn+1

],

where h(t) = [I(t)]−1∫ ∇ (ln f (t, x)) dFθ(x).

Let us assume that there exists a constant b > 2 such that the function

t →∫‖∇(ln f(t, x))‖bdFθ(x)

exists and is bounded in the neighborhood of θ for each θ ∈ Θ.It follows from a straightforward application of (2) and (3) (see [9, p. 168]) that

given θ∗n → θ, √n(θ∗n − θ)D→N (0, [I(θ)]

−1),

and

a.s. on θ∗n → θ, 1

lnn

n∑k=1

(θ∗k − θ)(θ∗k − θ)T → [I(θ)]

−1.

Thus, as soon as Newton’s estimater is strongly consistent, it is also asymptoticallyefficient and a.s. asymptotically efficient.

However, the algorithm (13) requires at each step the computation of the inverseof the Fisher information matrix [I(.)]−1. The use of an averaged algorithm does not

require such a computation; for that, we proceed as follows. We determine (θ∗n) fromthe gradient algorithm

θ∗n+1 = θ∗n +1

nα∇(ln f(θ∗n, Yn+1)),

3

4≤ α < 1,

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


and we compute the empirical mean θ∗n = 1

n

∑nk=1 θ

∗k. It follows from a straightfor-

ward application of (11) and Theorem 3 that θ∗n has the same asymptotic properties

as the θ∗n given by (13); more precisely, given the event θ∗n → θ, θ∗n is a strongly con-sistent estimator, asymptotically efficient and a.s. asymptotically efficient. Moreover,applying Theorems 6 and 7, we also have

a.s. on θ∗n → θ, lim supn→∞

n

2 ln (lnn)(θ

∗n − θ)(θ

∗n − θ)

T= [I(θ)]

−1

and

a.s. on θ∗n → θ, 1

lnn

n∑k=1

1

kδ√k(θ∗k−θ) =⇒ N (0, [I(θ)]

−1).

4. Proofs. In view of assumptions (A1) and (A4), the algorithm (8) can berewritten as

Zn+1 = Zn + γnH (Zn − z∗) + γnrn+1 + σnεn+1,

and we have

H(Zn − z∗) =1

γn[(Zn+1 − z∗)− (Zn − z∗)]− rn+1 − σn

γnεn+1.

Let us define the sequences (Tn), (Tn), (Mn), and (Kn) by

Tn = Zn − z∗, Tn =1

n

n∑k=1

Tk, Mn+1 =

n∑k=1

σkγk

εk+1,

Kn = −T1

γ1+

Tn+1

γn−

n∑k=2

Tk

[1

γk− 1

γk−1

].

Then we have

nHTn = Kn −Mn+1 −n∑

k=1

rk+1.(14)

The proofs of the results stated in section 2 are constructed in the followingway. First we establish the almost sure asymptotic properties (a quadratic strong lawof large numbers, a law of the iterated logarithm, and an a.s.CLT) of the sequence(Mn). Then we show that (Kn) and (

∑nk=1 rk+1) are small enough on Γ (z∗) so that

the properties obtained for (Mn) are also satisfied by the sequence(nHTn

).

Let us first show how we can strengthen assumptions (A2) and (A4.w).Note that in order to establish an almost sure property on Γ (z∗), it is suffi-

cient to prove it a.s. on ΓN = Γ (z∗) ∩ supn≥N ‖Zn − z∗‖≤M for any N such thatP (ΓN ) = 0.

Let ΓN,K be the set of the trajectories of ΓN such that, for a positive integer K,

supn≥N

E(‖εn+1‖b|Fn) ≤ K and supn≥N

(√J(n)γnv(n)‖r(1)

n+1‖) ≤ K.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Since ΓN equals ∪KΓN,K up to a negligible set, it is sufficient to establish a propertya.s. on ΓN,K for each K such that P (ΓN,K) =0, in order to prove it on ΓN .

According to a technique often used by Lai and Wei (see [14], for instance), wemodify the algorithm (8), without changing it on ΓN,K , in order to have, a.s. on thewhole set Ω,

E (εn+1|Fn) = 0, supn≥N

E(‖εn+1‖b|Fn) ≤ K, and supn≥N


n+1‖) ≤ K.

(15)

To this end, we replace r(1)n+1 by r

(1)n+1 = r

(1)n+1.1√J(n)γnv(n)

∥∥∥r(1)n+1

∥∥∥≤Kand εn+1 by

εn+1 = εn+11Bnwith

Bn = E(εn+1|Fn) = 0 and E(‖εn+1‖b|Fn) ≤ K.From now on, we shall assume that these modifications have been made. More-

over, substituting (Zn) for (Z ′n) = (Zn+N ), we shall assume that (15) is fulfilled with

N = 0, i.e., that the following condition holds: there exists K > 0 such that, a.s. onΩ,

E (εn+1|Fn) = 0, supn≥0

E(‖εn+1‖b|Fn) ≤ K, and supn≥0


n+1‖) ≤ K.

In the same way, we strengthen assumption (A4.s) and assume that there exists

ρ > (1− α + β)/2 such that supn≥0(nρ‖r(1)n+1‖) ≤ K a.s. on Ω.

We now state the three lemmas, which give the almost sure properties of thesquare-integrable martingale (Mn); Lemmas 9 and 10 give, respectively, a law of theiterated logarithm and a quadratic strong law of large numbers for (Mn), whereasLemma 11 establishes an a.s.CLT for the unidimensional sequence

(uTMn

)for any

vector u of Rd.Lemma 9 (law of the iterated logarithm for (Mn)). Assume (A2) and (A3).

Then, for any vector u ∈ Rd,

lim supn→∞

[2J(n) ln (lnn)]−1/2

uTMn = − lim infn→∞ [2J(n) ln (lnn)]

−1/2uTMn =

√uTCu a.s.

In particular, ‖Mn‖2 = O (J(n) ln (lnn)) a.s.Lemma 10 (quadratic strong law of large numbers for (Mn)). Assume (A2) and

(A3). Then,

limn→∞

1

lnn

n∑k=1

[kJ(k)]−1MkMTk = C a.s.

Lemma 11 (a.s.CLT for (uTMn)). Assume (A2) and (A3). Then, for any vectoru ∈ Rd,

1

lnn

n∑k=1

1

kδ[J(k)]−1/2uTMk

=⇒N (0, uTCu

)a.s.

Our proofs are now organized as follows. First, we show in section 4.1 howTheorems 3, 6, and 7 can be deduced from Lemmas 9, 10, and 11. Then, Corollary 5is established in section 4.2. Finally, section 4.3 is devoted to the proofs of Lemmas 9,10, and 11.

Throughout the proofs, L denotes a generic, increasing, and slowly varying func-tion.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


4.1. Proof of Theorems 3, 6, and 7. In the case assumptions (A2) and (A3.b)

hold, we have 1b < min 1

2 ; 12a [ (a−1)β

1−α − 1]; throughout this subsection, we then setδ such that

1

b< δ < min

1

2;

1

2a

[(a− 1)β

1− α− 1

].(16)

4.1.1. Preliminaries. In this subsection, we establish two lemmas we shall usein the proofs of Theorems 3, 6, and 7.

Lemma 12. Assume (A1), (A2), and (A4.w) hold. Then, we have, a.s. on Γ (z∗),

(i) under (A3.a), ‖Kn‖ = O[(1 + nα−β2 )L(n)] and

n∑k=1

‖r(2)k+1‖ = O

[(1 + n1− aβ

2 )L(n)],

(ii) under (A3.b), ‖Kn‖ = O[(1 + nδ(1−α)− β2 +α)L(n)] and

∑nk=1 ‖r(2)

k+1‖ =

O[(1 + n1+ a2 [2δ(1−α)−β])L(n)], where δ is given by (16).

Proof of Lemma 12. In view of assumption (A3)(ii), we have

‖Kn‖ = O

[1 +

‖Tn+1‖γn

+

n∑k=2

‖Tk‖ θ(k)

]

with θ = (1/γ)′. Let us apply the following result proved in [21].Result 1 (almost sure upper bounds of (Zn − z∗) on Γ (z∗)). Assume (A1),

(A2), and (A4.w) hold. Then, we have

(i) under (A3.a), ‖Zn − z∗‖ = O([v(n)]−1/2[ln(∑n

k=1 γk)]1/2

) a.s. on Γ(z∗),

(ii) under (A3.b), for any ζ such that ζ > 1b , ‖Zn−z∗‖ = O([v(n)]−1/2(

∑nk=1 γk)

ζ)

a.s. on Γ(z∗).It follows that, under assumption (A3.a), we have

‖Kn‖ = O

1 +[ln(

∑n+1k=1 γk)]

1/2

γn√v(n + 1)

+

n∑k=2

[ln(∑n

k=1 γk)]1/2

θ(k)√v(k)

= O

[(1 + nα−

β2 )L(n)

]a.s. on Γ(z∗)

and, under assumption (A3.b),

‖Kn‖ = O

1 +(∑n+1

k=1 γk)δ

γn√v(n + 1)

+

n∑k=2

(∑k

j=1 γj)δθ(k)√

v(k)

= O

[(1 + nδ(1−α)+α− β

2 )L(n)]

a.s. on Γ(z∗).

On the other hand, since ‖r(2)k+1‖ = O (‖Zk − z∗‖a), we deduce from Result 1 that,

under assumption (A3.a),

n∑k=1

‖r(2)k+1‖ = O

[n∑

k=1

([ln (∑k

j=1 γj)]a/2

[v(k)]−a/2)

]= O[(1 + n1− aβ

2 )L(n)] a.s. on Γ(z∗)

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


and, under assumption (A3.b),

n∑k=1

‖r(2)k+1‖ = O

[n∑

k=1

([∑k

j=1 γj ]aδ

[v(k)]−a/2)

]= O[(1 + n1+ a

2 [2δ(1−α)−β])L(n)] a.s. on Γ(z∗),

which concludes the proof of Lemma 12.Lemma 13. Assume (A1), (A2), and (A3) hold. Then,(i) under (A4.w), we have [J(n)]−1/2 (‖Kn‖+

∑nk=1 ‖rk+1‖) = O(1) a.s. on

Γ (z∗),(ii) under (A4.s), there exists c > 0 such that [J(n)]−1/2 (‖Kn‖+

∑nk=1 ‖rk+1‖) =

O (n−c) a.s. on Γ (z∗).Proof of Lemma 13. The application of Lemma 12 leads to the following almost

sure upper bounds on Γ (z∗):

Sequence Under assumption (A3.a)

[J(n)]−1/2 ‖Kn‖ O[(n− 12(1−β+α) + n− 1

2(1−α))L(n)]

[J(n)]−1/2∑n

k=1 ‖r(2)k+1‖ O[(n− 12(1−β+α) + n

12[(1−α)−(a−1)β])L(n)]

Sequence Under assumption (A3.b)

[J(n)]−1/2 ‖Kn‖ O[(n− 12(1−β+α) + n(δ−

12)(1−α))L(n)]

[J(n)]−1/2∑n

k=1 ‖r(2)k+1‖ O[(n− 12(1−β+α) + n

12[(1+2aδ)(1−α)−(a−1)β])L(n)]

.

Since all the exponents are strictly negative, there exists c1 > 0 such that

[J(n)]−1/2

(‖Kn‖+

n∑k=1

‖r(2)k+1‖

)= O(n−c1) a.s. on Γ(z∗).(17)

Now, under assumption (A4.w), we have

[J(n)]−1/2n∑

k=1

‖r(1)k+1‖ = O

([J(n)]−1/2

n∑k=1

[√J(k)γkv(k)

]−1

)a.s. on Γ(z∗).

Since the function s → [√J(s)γ(s)v(s)]

−1varies regularly with exponent −(1− α +

β)/2 > −1, we have [11, p. 281]

limt→∞

t[√

J(t)γ(t)v(t)]−1∫ t

0

[√J(s)γ(s)v(s)

]−1 =1

2[(1− β) + α],

1

2[(1− β) + α] = 0.

It follows thatn∑

k=1

[√J(k)γkv(k)

]−1= O

(n[√

J(n)γnv(n)]−1)

and [J(n)]−1/2n∑

k=1

‖r(1)k+1‖ = O(n[J(n)γnv(n)]

−1) a.s. on Γ(z∗).

However, the function s → γ(s)v(s) varies regularly with exponent α− β > −1; thus

t[γ(t)v(t)]−1

J(t)→ (1− β) + α = 0,

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


which implies that

[J(n)]−1/2n∑

k=1

‖r(1)k+1‖ = O(1) a.s. on Γ(z∗).(18)

The first part of Lemma 13 then follows from the combination of (17) and (18).Now, under assumption (A4.s), we have

[J(n)]−1/2n∑

k=1

‖r(1)k+1‖ = O

([J(n)]−1/2

n∑k=1

k−ρ

)= O(n− 1

2 (1+α−β)L(n)[n1−ρ + lnn])

= O([n12 (1−α+β)−ρ + n− 1

2 (1+α−β)]L(n))

= O(n−c2) with c2 > 0,(19)

and the second part of Lemma 13 follows from the combination of (17) and (19).

4.1.2. Proof of Theorem 6. In view of (14) and Lemma 13 (i), we have, forany u ∈ Rd,

uTnHTn√2J(n) ln (lnn)

=−uTMn+1√

2J(n) ln (lnn)+

uT (Kn −∑n

k=1 rk+1)√2J(n) ln (lnn)

=−uTMn+1√

2J(n) ln (lnn)+ o(1) a.s.

It then follows from Lemma 9 that, a.s. on Γ (z∗),

lim supn→∞

nuTH(Zn − z∗

)√2J(n) ln (lnn)

= − lim infn→∞

nuTH(Zn − z∗

)√2J(n) ln (lnn)

=√uTCu

and, replacing u by [(HT )−1

u] (HT is nonsingular), we deduce that, a.s. on Γ (z∗),

lim supn→∞

nuT(Zn − z∗

)√2J(n) ln (lnn)

= − lim infn→∞

nuT(Zn − z∗

)√2J(n) ln (lnn)

=

√uTH−1C(HT )

−1u,(20)

which concludes the proof of the first assertion of Theorem 6.Now, (20) implies that for any u ∈ Rd, a.s. on Γ (z∗),

lim supn→∞

n2

2J(n) ln (lnn)uT(Zn − z∗

) (Zn − z∗

)Tu = uTH−1C

(HT

)−1u

and, Q being a countable set, there exists a P -null set N such that ∀ω ∈ Γ (z∗) \N ,∀u ∈ Qd,

lim supn→∞

n2

2J(n) ln (lnn)uT(Zn(ω)− z∗

) (Zn(ω)− z∗

)Tu = uTH−1C

(HT

)−1u.(21)

To conclude the proof of Theorem 6, we have to show that for any ω0 ∈ Γ (z∗) \ Nand any v ∈ Rd

lim supn→∞

vTΣnv = vTΣv,(22)

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


where Σn =n2(Zn(ω0)− z∗

) (Zn(ω0)− z∗

)T2J(n) ln (lnn)

and Σ = H−1C(HT )−1

.

Set ε > 0 and u ∈ Qd such that ‖v − u‖ ≤ ε; we have

vT (Σn − Σ) v ≤ uT (Σn − Σ)u +∥∥(v − u)T (Σn − Σ) v + uT (Σn − Σ) (v − u)

∥∥ ,lim supn→∞

vT (Σn − Σ) v ≤ lim supn→∞

uT (Σn − Σ)u

+ lim supn→∞

[‖v − u‖ (‖Σn‖+ ‖Σ‖) (‖u‖+ ‖v‖)] ,≤ lim sup

n→∞[ε (‖Σn‖+ ‖Σ‖) (ε + 2‖v‖)] in view of (21).

However, (21) implies that ‖Zn(ω0)− z∗‖ = O(1); thus

lim supn→∞

vT (Σn − Σ) v ≤ εC(ε + 2‖v‖), where C > 0.

It follows that lim supn→∞ vTΣnv ≤ vTΣv.

On the other hand, (21) implies that there exists a sequence of integers (t(n)) andn0 ∈ N such that limn→∞ t(n) = ∞ and ∀n ≥ n0, ‖uT (Σt(n) − Σ)u‖ ≤ ε. We thenhave, ∀n ≥ n0,

∥∥vT (Σt(n) − Σ)v∥∥ ≤ εC(ε + 2‖v‖).

Thus limn→∞vTΣt(n)v = vTΣv and (22) is proved.

4.1.3. Proof of Theorem 3. In view of (14), we have

n2HTnTT

nHT =

(Kn −Mn+1 −

n∑k=1

rk+1

)(Kn −Mn+1 −

n∑k=1

rk+1

)T

= MnMTn + Rn,

where Rn = n2HTnTT

nHT −Mn+1M

Tn+1; thus

1

lnn

n∑k=1

k[J(k)]−1HT kTT

kHT =

1

lnn

n∑k=1

[kJ(k)]−1MkMTk +

1

lnn

n∑k=1

[kJ(k)]−1Rk.

Lemmas 9 and 12 and assumption (A4.s) lead to the following almost sure upperbounds on Γ (z∗):

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Sequence Under assumption (A3.a)

[J(k)]−1‖Kk‖2 O[(k−(1−β+α) + k−(1−α))L(k)][J(k)]−1‖∑k

j=1 r(1)j+1‖

2O[(k−(1−β+α) + k(1+β−α)−2ρ)L(k)]

[J(k)]−1‖∑kj=1 r

(2)j+1‖

2O[(k−(1−β+α) + k(1−α)−(a−1)β)L(k)]

[J(k)]−1‖Kk‖‖Mk‖ O[(k−12(1+α−β) + k−

12(1−α))L(k)]

[J(k)]−1‖∑kj=1 r

(1)j+1‖‖Mk‖ O[(k−

12(1+α−β) + k

12(1+β−α)−ρ)L(k)]

[J(k)]−1‖∑kj=1 r

(2)j+1‖‖Mk‖ O[(k−

12(1+α−β) + k

12[(1−α)−(a−1)β])L(k)]

Sequence Under assumption (A3.b)

[J(k)]−1‖Kk‖2 O[(k−(1−β+α) + k(2δ−1)(1−α))L(k)][J(k)]−1‖∑k

j=1 r(1)j+1‖

2O[(k−(1−β+α) + k(1+β−α)−2ρ)L(k)]

[J(k)]−1‖∑kj=1 r

(2)j+1‖

2O[(k−(1−β+α) + k(1+2aδ)(1−α)−(a−1)β)L(k)]

[J(k)]−1‖Kk‖‖Mk‖ O[(k−12(1+α−β) + k(δ−

12)(1−α))L(k)]

[J(k)]−1‖∑kj=1 r

(1)j+1‖‖Mk‖ O[(k−

12(1+α−β) + k

12(1+β−α)−ρ)L(k)]

[J(k)]−1‖∑kj=1 r

(2)j+1‖‖Mk‖ O[(k−

12(1+α−β) + k

12[(1+2aδ)(1−α)−(a−1)β])L(k)]

Since all the exponents are strictly negative, we deduce that∑

[kJ(k)]−1‖Rk‖ < ∞a.s. on Γ (z∗), and thus

1

lnn

n∑k=1

k[J(k)]−1HT kTT

kHT =

1

lnn

n∑k=1

[kJ(k)]−1MkMTk + o(1) a.s. on Γ (z∗) .

Lemma 10 then implies

limn→∞

1

lnn

n∑k=1

k[J(k)]−1HT kTT

kHT = C a.s. on Γ (z∗)

and thus

limn→∞

1

lnn

n∑k=1

k[J(k)]−1(Zk − z∗

) (Zk − z∗

)T= H−1C

(H−1

)Ta.s. on Γ (z∗) .

4.1.4. Proof of Theorem 7. We have to prove that there exists a P -null setN such that ∀ω ∈ Γ (z∗) \N

1

lnn

n∑k=1

1

kδk[J(k)]−1/2(Zk(ω)−z∗) =⇒ N (0, H−1C(H−1)

T).

We first study the behavior of the characteristic functions of the random measures1

lnn

∑nk=1

1k δk[J(k)]−1/2(Zk−z∗).

Let u be any vector of Rd. In view of (14), we have

1

lnn

n∑k=1

1

kexp[i[J(k)]−1/2uT (kHT k)]

=1

lnn

n∑k=1

1

kexp

[i[J(k)]−1/2uT

(−Mk+1 + Kk −

k∑j=1

rj+1

)].

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Applying Lemma 13 (ii), it follows that, a.s. on Γ(z∗),

1

lnn

n∑k=1

1

kexp[i[J(k)]−1/2uT (kHT k)]

=1

lnn

n∑k=1

1

kexp[−i[J(k)]−1/2uTMk+1 + O(k−c)]

=1

lnn

n∑k=1

1

kexp[−i[J(k)]−1/2uTMk+1] + o(1).

Thus, in view of Lemma 11,

limn→∞

1

lnn

n∑k=1

1

kexp[i(HTu)

T(k[J(k)]−1/2T k)] = exp

[−uTCu

2

].

It follows that, for any vector u in Rd, we have, a.s. on Γ(z∗),

limn→∞

1

lnn

n∑k=1

1

kexp[iuT (k[J(k)]−1/2(Zk − z∗))] = exp

[−uTH−1C

(H−1

)Tu

2

].

Since Q is a countable set, there exists a P -null set N ⊂ Γ(z∗) such that ∀ω ∈Γ(z∗) \N , ∀u ∈ Qd,

limn→∞

1

lnn

n∑k=1

1

kexp[iuT (k[J(k)]−1/2(Zk(ω)− z∗))] = exp

[−uTH−1C

(H−1

)Tu

2

].

(23)

Let us now set ω0 ∈ Γ(z∗) \N and prove that the sequence of the deterministicmeasures (µn(ω0)) defined by

µn(ω0) =1

lnn

n∑k=1

1

kδk[J(k)]−1/2(Zk(ω0)−z∗)

converges weakly to the N (0, H−1C(H−1)T ) distribution.Let µ0 be a closure point of (µn (ω0)) and µp(n) (ω0) a subsequence such that

µp(n) (ω0) =⇒ µ0. Since (µp(n)(ω0)) is a bounded sequence of measures, µ0 is abounded measure; let φ0 (respectively, φp(n)) be the characteristic function of µ0

(respectively, µp(n) (ω0)). We then have limn→∞ φp(n)(u) = φ0(u) for any u ∈ Rd, and,

in view of (23), φ0(u) = exp[−uTH−1C(H−1)Tu/2] for any u ∈ Qd. However, the

function φ0 is continuous, thus φ0(u) = exp[−uTH−1C(H−1)Tu/2] for any u ∈ Rd.

We finally deduce that µ0 is the N (0, H−1C(H−1)T

) distribution and µn (ω0) =⇒N (0, H−1C(H−1)

T), which concludes the proof of Theorem 7.

4.2. Proof of Corollary 5. The estimator Σn can be written as

Σn =1

lnn

n∑k=1

k[J(k)]−1(Zk − z∗

) (Zk − z∗

)T+

1

lnn

n∑k=1

k[J(k)]−1(Zn − z∗

) (Zn − z∗

)T− 1

lnn

n∑k=1

k[J(k)]−1(Zk − z∗

) (Zn − z∗

)T − 1

lnn

n∑k=1

k[J(k)]−1(Zn − z∗

) (Zk − z∗

)T.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Applying Theorem 6, we have, a.s. on Γ (z∗),(n∑

k=1

k[J(k)]−1

)∥∥Zn − z∗∥∥2

= O

[(∫ n

1

s[J(s)]−1ds

)J(n) ln (lnn)

n2

].

Since the function s → s[J(s)]−1 varies regularly with exponent β−α > −1, we have

limn→∞

n2[J(n)]−1∫ n1s[J(s)]−1ds

= 1 + β − α, 1 + β − α = 0;

thus, a.s. on Γ (z∗), (n∑

k=1

k[J(k)]−1

)∥∥Zn − z∗∥∥2

= O[ln (lnn)]

and

limn→∞

1

lnn

n∑k=1

k[J(k)]−1(Zn − z∗

) (Zn − z∗

)T= 0.

On the other hand, applying Theorem 6 again, we obtain, a.s. on Γ (z∗),(n∑

k=1

k[J(k)]−1∥∥Zk − z∗

∥∥)∥∥Zn − z∗∥∥

= O

[(n∑

k=1

[J(k)]−1/2√

ln (ln k)

) √J(n) ln (lnn)

n

]

= O

[(∫ n

1

[J(s)]−1/2√

ln (ln s)ds

) √J(n) ln (lnn)

n

].

Since the function s → [J(s)]−1/2√

ln (ln s) varies regularly with exponent − 12 (1 +

α− β) < 1, we have

limn→∞

n[J(n)]−1/2√

ln lnn∫ n1

[J(s)]−1/2√

ln (ln s)ds=

1

2(1− α + β),

1

2(1− α + β) = 0.

Thus, a.s. on Γ (z∗),(n∑

k=1

k[J(k)]−1∥∥Zk − z∗

∥∥)∥∥Zn − z∗∥∥ = O(ln lnn)

and

limn→∞

[1

lnn

n∑k=1

k[J(k)]−1(Zk − z∗

) (Zn − z∗

)T+

1

lnn

n∑k=1

k[J(k)]−1(Zn − z∗

) (Zk − z∗

)T]= 0.

Applying Theorem 3, we finally deduce that limn→∞ Σn = H−1C(H−1

)Ta.s. on

Γ (z∗).

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


4.3. Proof of Lemmas 9, 10, and 11. Throughout this subsection, Tr(A)

denotes the trace of a matrix A, and 〈M〉n the increasing process of a square-integrable

martingale (Mn). Recall that 〈M〉0 = I and E((Mn+1 − Mn)(Mn+1 − Mn)T |Fn) =

〈M〉n+1 − 〈M〉n.

4.3.1. Proof of Lemma 9. The proof of Lemma 9 is based upon the followingadaptation of the law of the iterated logarithm of Stout [27] (see [8] or [10]).

Result 2 (law of the iterated logarithm for unidimensional martingales). Let(ηn) be a sequence of unidimensional random variables adapted to a filtration F suchthat

E(ηn+1|Fn) = 0 ∀n ≥ 0, lim supn→∞

E(|ηn+1|2|Fn) = c2 <∞, and

∃ξ ∈]0, 1[ s.t. supn≥0

E(|ηn+1|2(1+ξ)|Fn) < +∞ a.s.

Let (Φn) be a sequence of unidimensional random variables adapted to F and setτn =

∑nk=0 Φ2

k; if τ∞ = +∞, ∑Φ2+2ξn τ−1−ξ

n < +∞ and Φ2n = o(τn(ln ln τn)−1/ξ)

a.s.; then

lim supn→∞

[2τn ln (ln τn)]−1/2

n∑k=0

Φkηk+1 = − lim infn→∞ [2τn ln (ln τn)]

−1/2n∑

k=0

Φkηk+1 = c a.s.

Set u ∈ Rd; the application of Result 2 with Φn = σnγ−1n and ηn+1 = uT εn+1

leads to

lim supn→∞

∑nk=1 σkγ

−1k uT εk+1[

2(∑n

k=1 σ2kγ

−2k

)ln ln

(∑nk=1 σ

2kγ

−2k

)]1/2= − lim inf

n→∞

∑nk=1 σkγ

−1k uT εk+1[

2(∑n

k=1 σ2kγ

−2k

)ln ln

(∑nk=1 σ

2kγ

−2k

)]1/2=√uTCu a.s.

However, σ2kγ

−2k = [γkv(k)]

−1; thus

∑nk=1 σ

2kγ

−2k ∼ J(n) and we obtain

lim supn→∞

[2J(n) ln (lnn)]−1/2

uTMn = − lim infn→∞ [2J(n) ln (lnn)]

−1/2uTMn =

√uTCu a.s.

4.3.2. Proof of Lemma 10. The proof of Lemma 10 is based upon the followingmartingale version of a result established by Wei [30] for regressive sequences.

Lemma 14 (a strong law of large numbers for martingales). Let (Mn) be ad-dimensional, square-integrable martingale with respect to a filtration F .

Set

Hn =

n∑k=1

MTk [〈M〉−1

k−1 − 〈M〉−1

k ]Mk,

fn = Tr(〈M〉−1/2

n+1 [〈M〉n+1 − 〈M〉n]〈M〉−1/2

n+1 ) = d− Tr(〈M〉n〈M〉−1

n+1),

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Fn = f0 + · · ·+ fn,

and assume there exists a constant a > 1 such that

supn

E(([Mn+1 − Mn]T 〈M〉−1

n+1[Mn+1 − Mn])a

|Fn) <∞ a.s.

Then, a.s. on Fn → +∞,

limn→∞

MTn 〈M〉

−1

n Mn + Hn

Fn= 1.

Let us first take up briefly the outlines of the proof of Lemma 14. Setting Vn =

MTn 〈M〉

−1

n Mn, we have

Vn+1 = [Mn + (Mn+1 − Mn)]T 〈M〉−1

n+1[Mn + (Mn+1 − Mn)]

= Vn −An + Bn+1 + Dn + DTn

with An = MTn (〈M〉−1

n − 〈M〉−1

n+1)Mn, Bn+1 = (Mn+1 − Mn)T 〈M〉−1

n+1(Mn+1 − Mn),

and Dn = (Mn+1 − Mn)T 〈M〉−1

n+1Mn. We deduce that

Vn+1 = V1 −Hn +

n∑k=1

Bk+1 +

n∑k=1

(DT

k + Dk

).

Under the moment assumption, limn→∞ (∑n

k=1 Bk+1) /Fn = 1 a.s. on Fn → +∞.On the other hand,

E(|Dn+1|2|Fn) = MTn 〈M〉

−1

n+1[〈M〉n+1 − 〈M〉n]〈M〉−1

n+1Mn = O(An).

Thus the series∑∞

k=1(DTk + Dk) converges a.s. on Hn < +∞ and

∑nk=1(DT

k +Dk)= o(Hn) a.s. on Hn → +∞. It follows that, a.s. on Fn → +∞,

limn→∞

Vn+1 + Hn + o (Hn) 1Hn→+∞Fn

= 1

and thus

limn→∞

Vn+1 + Hn

Fn= 1,

which concludes the proof of Lemma 14.We now prove Lemma 10. Let u be any nonzero vector of Rd, and set Wn =

uTMn; (Wn) is a square-integrable martingale, whose increasing process 〈W 〉n+1 =

Σnk=1σ

2kγ

−2k uTCku = Σn

k=1[γkv(k)]−1uTCku satisfies

〈W 〉n+1 ∼ J(n)uTCu.(24)

We have

[Wn+1 −Wn]T 〈W 〉−1

n+1[Wn+1 −Wn] =[uT (Mn+1 −Mn)

]2〈W 〉−1n+1

=

(1

γnv(n)

)[uT εn+1

]2〈W 〉−1n+1.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Thus,

E(([Wn+1 −Wn]T 〈W 〉−1

n+1[Wn+1 −Wn])b/2|Fn)

=

(1

γnv(n)

)b/2

〈W 〉−b/2n+1 E(|uT εn+1|b|Fn)

and, in view of (24),

E(([Wn+1 −Wn]T 〈W 〉−1

n+1[Wn+1 −Wn])b/2|Fn) = O([γnv(n)J(n)]

−b/2)(25)

= O[n−b/2L(n)].

Since b > 2, (Wn) fulfills the moment assumption of Lemma 14, and we deduce that,

a.s. on Σnk=1(1− 〈W 〉k〈W 〉−1

k+1) → +∞,

limn→∞

WTn 〈W 〉−1

n Wn +∑n

k=1 WTk [〈W 〉−1

k−1 − 〈W 〉−1k ]Wk∑n

k=1(1− 〈W 〉k〈W 〉−1k+1)

= 1.(26)

Now, in view of (24), (1− 〈W 〉k〈W 〉−1k+1) ∼ 1− J(k − 1)[J(k)]−1. Since J ′ varies

regularly with exponent α−β, we have J(k − 1)[J(k)]−1 = 1−(α−β+1)k−1+o(k−1);thus

1− 〈W 〉k〈W 〉−1k+1 ∼

(α− β + 1)

k(27)

and

n∑k=1

(1− 〈W 〉k〈W 〉−1k+1)∼(α− β + 1) lnn.(28)

On the other hand, Lemma 9 gives ‖Wn‖2 = O(J(n) ln (lnn)) a.s. Using (24)

again, we obtain ‖WTn 〈W 〉−1

n+1Wn‖ = O(ln (lnn)) and, in view of (28),

WTn 〈W 〉−1

n+1Wn = o

[n∑

k=1

(1− 〈W 〉k〈W 〉−1k+1)

].(29)

Finally,

n∑k=1

WTk [〈W 〉−1

k−1 − 〈W 〉−1k ]Wk =

n∑k=1

〈W 〉−1k−1[1− 〈W 〉k−1〈W 〉−1

k ]WkWTk

and, in view of (24) and (27),

n∑k=1

WTk [〈W 〉−1

k−1 − 〈W 〉−1k ]Wk ∼ (α− β + 1)

uTCu

n∑k=1

[kJ(k)]−1uTMkMTk u.(30)

It follows from the combination of (26), (28), (29), and (30) that

limn→∞

1

lnn

n∑k=1

[kJ(k)]−1uTMkMTk u = uTCu a.s.(31)

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Let us define Σn by

Σn =1

lnn

(n∑

k=1

[kJ(k)]−1MkMTk

)− C

and denote by Σ(i,j)n the coefficient of the ith line and jth column of Σn. Let

(e1, . . . , ed) be the canonical basis of Rd. It is easy to see that, for i, j ∈ 1, . . . , d,

Σ(i,j)n =

1

2[(ei + ej)

TΣn(ei + ej)− eTi Σnei − eTj Σnej ].

Applying (31) to the three terms of the right-hand side of this equation, it follows

that limn→∞ Σ(i,j)n = 0 a.s. and thus limn→∞ Σn = 0 a.s., which completes the proof

of Lemma 10.

4.3.3. Proof of Lemma 11. The proof of Lemma 11 is based upon the followingresult proved by Chaabane [2].

Result 3 (an a.s.CLT for unidimensional martingales). Let (Mn) be an unidi-mensional square integrable martingale with respect to a filtration F and assume thereexists a > 1 such that∑

n≥1

E(([Mn+1 − Mn]T 〈M〉−1

n+1[Mn+1 − Mn])a

|Fn) <∞ a.s.

Then,

1

ln 〈M〉n

n∑k=1

〈M〉k − 〈M〉k−1

〈M〉kδ〈M〉−1/2

k Mk=⇒ N (0, 1) a.s.

Set u ∈ Rd, u = 0, and Wn = uTMn. In view of (25), (Wn) satisfies theassumption of Result 3. It follows that

1

ln 〈W 〉n

n∑k=1

[1− 〈W 〉k−1〈W 〉−1k ]δ〈W 〉−1/2

k Wk=⇒ N (0, 1) a.s.

Then, in view of (24) and (27),

1

(α− β + 1) lnn

n∑k=1

α− β + 1

kδ[uTCuJ(k)]−1/2uTMk

=⇒N (0, 1) a.s.

and thus

1

lnn

n∑k=1

1

kδ[J(k)]−1/2uTMk

=⇒N (0, uTCu

)a.s.

This last property is also clearly satisfied by u = 0, and thus the proof of Lemma 11is completed.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


Acknowledgments. I gratefully acknowledge Marie Duflo for her invaluablehelp. I also deeply thank Abdelkader Mokkadem for his thoughtful comments and ananonymous referee whose suggestions have significantly improved the overall presen-tation.

REFERENCES

[1] A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic Ap-proximation, Springer-Verlag, New York, 1990.

[2] F. Chaabane, Version forte du theoreme de la limite centrale fonctionnel pour les martingales,C. R. Acad. Sci. Paris Sec. I Math., 323 (1996), pp. 195–198.

[3] H. F. Chen, L. Guo, and A. J. Gao, Convergence and robustness of the Robbins-Monroalgorithm truncated at randomly varying bounds, Stochastic Process. Appl., 27 (1988),pp. 217–231.

[4] D. D. Castelle and M. Duflo, Probability and Statistics, Vol. 1, Springer-Verlag, New York,1986.

[5] B. Delyon and A. Juditsky, Stochastic optimization with averaging of trajectories, Stochas-tics Stochastic Rep., 39 (1992), pp. 107–118.

[6] B. Delyon and A. Juditsky, Stochastic Approximation with Averaging, preprint 952, InstitutRecherche en Informatique et Systemes Aleatoires, Rennes, France, 1995.

[7] J. Dippon and J. Renz, Weighted means of processes in stochastic approximation, Math.Methods Statist., 5 (1996), pp. 32–60.

[8] M. Duflo, Methodes Recursives Aleatoires, Masson, Paris, 1990. Random Iterative Models,Springer-Verlag, New York, 1997 (translation).

[9] M. Duflo, Algorithmes Stochastiques, Math. Appl. 23, Springer-Verlag, Berlin, 1996.[10] M. Duflo, R. Senoussi, and A. Touati, Sur la loi des grands nombres pour les martingales

vectorielles et l’estimateur des moindres carres d’un modele de regression, Ann. Inst. H.Poincare Probab. Statist., 26 (1990), pp. 549–566.

[11] W. Feller, An Introduction to Probability Theory and Its Applications, Vol. 2, 3rd ed., Wiley,New York, 1968.

[12] H. J. Kushner and D. S. Clark, Stochastic approximation for constrained and unconstrainedsystems, Appl. Math. Sci. Ser. 26, Springer-Verlag, New York, 1978.

[13] H. J. Kushner and J. Yang, Stochastic approximation with averaging of the iterates: Optimalasymptotic rate of convergence for general processes, SIAM J. Control Optim., 31 (1993),pp. 1045–1062.

[14] T. Z. Lai and C. Z. Wei, A note on martingale difference sequences satisfying the localMarcinkiewicz-Zigmund condition, Bull. Inst. Math. Acad. Sinica, 11 (1983), pp. 1–13.

[15] A. Le Breton, About the averaging approach schemes for stochastic approximation, Math.Methods Statist., 2 (1993), pp. 295–315.

[16] A. Le Breton and A. Novikov, Averaging for estimating covariances in stochastic approxi-mation, Math. Methods Statist., 3 (1994), pp. 244–266.

[17] A. Le Breton and A. Novikov, Some results about averaging in stochastic approximation,Metrika, 42 (1995), pp. 153–171.

[18] L. Ljung, G. Pflug, and H. Walk, Stochastic Approximation and Optimization of RandomSystems, Birkhauser, Basel, 1992.

[19] M. B. Nevelson and R. Z. Has’minskii, Stochastic Approximation and Recursive Estimations,Nauka, Moscow, 1972. Transl. Math. Monogr. 47, Amer. Math. Soc., Providence, RI, 1973.

[20] M. Pelletier, Weak convergence rates for stochastic approximation with application to mul-tiple targets and simulated annealing, Ann. Appl. Probab., 8 (1998), pp. 10–44.

[21] M. Pelletier, On the almost sure asymptotic behaviour of stochastic algorithms, Stoch. Pro-cess. Appl., 78 (1998), pp. 217–244.

[22] M. Pelletier, An almost sure central limit theorem for stochastic approximation algorithms,J. Multivariate Anal., 71 (1999), pp. 76–93.

[23] B. T. Polyak, A new method of stochastic approximation type, Automat. i Telemekh, 7 (1990),pp. 98–107 (in Russian); Automat. Remote Control, 51 (1990), pp. 937–946 (in English).

[24] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging,SIAM J. Control Optim., 30 (1992), pp. 838–855.

[25] D. Ruppert, Stochastic approximation, in Handbook in Sequential Analysis, B. K. Ghosh andP. K. Sen, eds., Marcel Dekker, New York, 1991, pp. 503–529.

[26] E. Seneta, Regularly Varying Function, Lecture Notes in Math. 508, Springer-Verlag, New

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php


York, 1976.[27] W. F. Stout, A martingale analogue of Kolmogorov’s law of the iterated logarithm, Z.

Wahrscheinlichkeitstheorie und Verw. Gebiete, 15 (1970), pp. 279–290.[28] A. Touati, Sur les Versions Fortes du Theoreme de la Limite Centrale, Preprint 23, Universite

de Marne-la-Vallee, Marne-la-Vallee, France, 1995.[29] C. Z. Wei, Asymptotic properties of least-squares estimates in stochastic regression models,

Ann. Statist., 13 (1985), pp. 1498–1508.[30] C. Z. Wei, Adaptive prediction by least squares predictors in stochastic regression models with

applications to time series, Ann. Statist., 15 (1987), pp. 1667–1682.[31] G. Yin, On extensions of Polyak’s averaging approach to stochastic approximation, Stochastics

Stochastic Rep., 36 (1991), pp. 245–264.

Dow

nloa

ded

05/0

8/13

to 1

29.1

74.2

1.5.

Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Documents

Asymptotic Almost Sure Efficiency of Averaged Stochastic Algorithms