55
Robbins, Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization Seminar Oct 10, 2017

Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Robbins, Monro: A Stochastic ApproximationMethod

Robert Bassett

University of California Davis

Student Run Optimization SeminarOct 10, 2017

Page 2: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Motivation

You are a carrot farmer. The amount of carrots that you plantplays a part in how much carrots cost in the store, and hence howmuch profit you make on your harvest.

I Too many carrots planted → glut in the market and low profit.

I Too few carrots planted → unexploited demand for carrotsand low profit.

The connection between carrots planted and profit is complicated!We will model this as a stochastic optimization problem, in anattempt to deal with the random components of this model(weather, economic climate, transportation costs, etc.).

Page 3: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Motivation

You are a carrot farmer. The amount of carrots that you plantplays a part in how much carrots cost in the store, and hence howmuch profit you make on your harvest.

I Too many carrots planted → glut in the market and low profit.

I Too few carrots planted → unexploited demand for carrotsand low profit.

The connection between carrots planted and profit is complicated!We will model this as a stochastic optimization problem, in anattempt to deal with the random components of this model(weather, economic climate, transportation costs, etc.).

Page 4: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Motivation

You are a carrot farmer. The amount of carrots that you plantplays a part in how much carrots cost in the store, and hence howmuch profit you make on your harvest.

I Too many carrots planted → glut in the market and low profit.

I Too few carrots planted → unexploited demand for carrotsand low profit.

The connection between carrots planted and profit is complicated!We will model this as a stochastic optimization problem, in anattempt to deal with the random components of this model(weather, economic climate, transportation costs, etc.).

Page 5: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Goal: Find a carrot planting decision θ∗ which minimizes yourexpected loss.

I Model loss as a random function of your planting decision θ.

Loss ∼ q(x , θ) known

I x : random vector of external factors. Independent of θ.

x ∼ F (x) unknown

I Carrot Planter’s Problem (CPP)

minθ∈Θ

Q(θ) = E[q(x , θ)]

I q : Rm × Rn → RI Θ ⊆ Rn, set of possible decisions.

Page 6: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Goal: Find a carrot planting decision θ∗ which minimizes yourexpected loss.

I Model loss as a random function of your planting decision θ.

Loss ∼ q(x , θ) known

I x : random vector of external factors. Independent of θ.

x ∼ F (x) unknown

I Carrot Planter’s Problem (CPP)

minθ∈Θ

Q(θ) = E[q(x , θ)]

I q : Rm × Rn → RI Θ ⊆ Rn, set of possible decisions.

Page 7: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Goal: Find a carrot planting decision θ∗ which minimizes yourexpected loss.

I Model loss as a random function of your planting decision θ.

Loss ∼ q(x , θ) known

I x : random vector of external factors. Independent of θ.

x ∼ F (x) unknown

I Carrot Planter’s Problem (CPP)

minθ∈Θ

Q(θ) = E[q(x , θ)]

I q : Rm × Rn → RI Θ ⊆ Rn, set of possible decisions.

Page 8: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

2. We have access to a random variable y ∼ P(y , θ) with theproperty that

E[y |θ] = ∇Q(θ).

3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.

4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.

Page 9: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

2. We have access to a random variable y ∼ P(y , θ) with theproperty that

E[y |θ] = ∇Q(θ).

3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.

4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.

Page 10: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

2. We have access to a random variable y ∼ P(y , θ) with theproperty that

E[y |θ] = ∇Q(θ).

3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.

4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.

Page 11: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

2. We have access to a random variable y ∼ P(y , θ) with theproperty that

E[y |θ] = ∇Q(θ).

3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.

4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.

Page 12: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

2. We have access to a random variable y ∼ P(y , θ) with theproperty that

E[y |θ] = ∇Q(θ).

3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.

4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.

Page 13: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

2. We have access to a random variable y ∼ P(y , θ) with theproperty that

E[y |θ] = ∇Q(θ).

3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.

4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.

Page 14: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

(2) says we must have access to an unbiased estimator of ourgradient.

For many common loss functions this is true, e.g. squared loss:

q(x , θ) =1

2E[||Aθ − x ||2]

⇒y ∼ ∇q(x , θ) = AT (Aθ − x) satisfies

E[y |θ] = E[∇q(x , θ)|θ] = E[AT (Aθ − x)|θ] = ∇Q(θ)

The last equality follows from the mild regularity assumptions thatallow us to interchange derivative with integral. Know them!

Page 15: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

(2) says we must have access to an unbiased estimator of ourgradient.For many common loss functions this is true, e.g. squared loss:

q(x , θ) =1

2E[||Aθ − x ||2]

⇒y ∼ ∇q(x , θ) = AT (Aθ − x) satisfies

E[y |θ] = E[∇q(x , θ)|θ] = E[AT (Aθ − x)|θ] = ∇Q(θ)

The last equality follows from the mild regularity assumptions thatallow us to interchange derivative with integral. Know them!

Page 16: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

I We would like to find a minimizer of Q(θ) without everattempting to calculate the expectation that defines it.

I Why? It involves random factors that are nasty!

I Instead, we will build a sequence θn which depends on randomsamples from y ∼ P(y , θ) for different θ.

I Hope to generate iterates from samples which allow us tominimize the expectation.

I We want θn to be a consistent estimator of θ∗. That is

∀ε ∃N s.t. n ≥ N ⇒ P(||θn − θ∗||2 > ε) < ε

i.e. θn → θ in probability.

Page 17: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

I We would like to find a minimizer of Q(θ) without everattempting to calculate the expectation that defines it.

I Why? It involves random factors that are nasty!

I Instead, we will build a sequence θn which depends on randomsamples from y ∼ P(y , θ) for different θ.

I Hope to generate iterates from samples which allow us tominimize the expectation.

I We want θn to be a consistent estimator of θ∗. That is

∀ε ∃N s.t. n ≥ N ⇒ P(||θn − θ∗||2 > ε) < ε

i.e. θn → θ in probability.

Page 18: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Facts of Life Give references!

1. L2-convergence implies convergence in probability. That is,

E[||θn − θ||2]→ 0⇒ ||θn − θ||2 → 0 in probability.

2. A convex function defines a monotone operator. That is, iff : Rn → R is convex, then

〈∇f (x)−∇f (y), x − y〉 ≥ 0

3. f (ξ, x) convex in x for some ξ almost everywhere implies

F (x) = E[f (ξ, x)]

is convex.

Page 19: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Facts of Life Give references!

1. L2-convergence implies convergence in probability. That is,

E[||θn − θ||2]→ 0⇒ ||θn − θ||2 → 0 in probability.

2. A convex function defines a monotone operator. That is, iff : Rn → R is convex, then

〈∇f (x)−∇f (y), x − y〉 ≥ 0

3. f (ξ, x) convex in x for some ξ almost everywhere implies

F (x) = E[f (ξ, x)]

is convex.

Page 20: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Facts of Life Give references!

1. L2-convergence implies convergence in probability. That is,

E[||θn − θ||2]→ 0⇒ ||θn − θ||2 → 0 in probability.

2. A convex function defines a monotone operator. That is, iff : Rn → R is convex, then

〈∇f (x)−∇f (y), x − y〉 ≥ 0

3. f (ξ, x) convex in x for some ξ almost everywhere implies

F (x) = E[f (ξ, x)]

is convex.

Page 21: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

TheoremLet an be a positive sequence which satisfies:

I∞∑n=0

a2n <∞

I∞∑n=0

ana0 + ...+ an−1

=∞

For some initial θ0, define the sequence

θn+1 = θn − anyn

Where yn ∼ P(y |θn). Then θn → θ∗ in probability.

Page 22: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Proof:

Because of F.O.L. 1, we will instead show L2 convergence.

Page 23: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Proof:Because of F.O.L. 1, we will instead show L2 convergence.

Page 24: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Proof:

Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

]

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

Page 25: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.

Have that:bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

]

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

Page 26: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

]

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

Page 27: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

]

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

Page 28: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

]

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

Page 29: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

]

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

Page 30: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

= bn + a2n E[∫

Y||y ||2 dP(y |θn)

]︸ ︷︷ ︸−2anE[〈θn − θ∗,∇Q(θn)〉]

E[∫

Y||y ||2 dP(y |θn)

]< C 2

by Assumption 3

Page 31: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

= bn + a2n E[∫

Y||y ||2 dP(y |θn)

]︸ ︷︷ ︸−2anE[〈θn − θ∗,∇Q(θn)〉]

E[∫

Y||y ||2 dP(y |θn)

]< C 2

by Assumption 3

Page 32: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

≤ bn + a2nC

2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸ ︷︷ ︸

Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.

E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]

≥ 0

by F.O.L. 2 and 3

Page 33: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

≤ bn + a2nC

2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸ ︷︷ ︸Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.

E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]

≥ 0

by F.O.L. 2 and 3

Page 34: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

≤ bn + a2nC

2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸ ︷︷ ︸Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.

E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]

≥ 0

by F.O.L. 2 and 3

Page 35: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

≤ bn + a2nC

2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸ ︷︷ ︸Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.

E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]

≥ 0

by F.O.L. 2 and 3

Page 36: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

0 ≤ bn+1 ≤ bn + a2nC

2 − 2anE[〈θn − θ∗,∇Q(θn)〉]

0 ≤ bn+1 ≤ b0 + C 2n∑

i=0

a2n − 2

n∑i=0

aiE[〈θi − θ∗,∇Q(θi )〉]

0 ≤n∑

i=0

aiE[〈θi − θ∗,∇Q(θi )〉] ≤1

2

(b0 + C 2

n∑i=1

a2i

)

Taking the limit as n→∞, gives

0 ≤ limn→∞

bn <∞

Page 37: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

0 ≤ bn+1 ≤ bn + a2nC

2 − 2anE[〈θn − θ∗,∇Q(θn)〉]

0 ≤ bn+1 ≤ b0 + C 2n∑

i=0

a2n − 2

n∑i=0

aiE[〈θi − θ∗,∇Q(θi )〉]

0 ≤n∑

i=0

aiE[〈θi − θ∗,∇Q(θi )〉] ≤1

2

(b0 + C 2

n∑i=1

a2i

)

Taking the limit as n→∞, gives

0 ≤ limn→∞

bn <∞

Page 38: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

0 ≤ bn+1 ≤ bn + a2nC

2 − 2anE[〈θn − θ∗,∇Q(θn)〉]

0 ≤ bn+1 ≤ b0 + C 2n∑

i=0

a2n − 2

n∑i=0

aiE[〈θi − θ∗,∇Q(θi )〉]

0 ≤n∑

i=0

aiE[〈θi − θ∗,∇Q(θi )〉] ≤1

2

(b0 + C 2

n∑i=1

a2i

)

Taking the limit as n→∞, gives

0 ≤ limn→∞

bn <∞

Page 39: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

0 ≤ bn+1 ≤ bn + a2nC

2 − 2anE[〈θn − θ∗,∇Q(θn)〉]

0 ≤ bn+1 ≤ b0 + C 2n∑

i=0

a2n − 2

n∑i=0

aiE[〈θi − θ∗,∇Q(θi )〉]

0 ≤n∑

i=0

aiE[〈θi − θ∗,∇Q(θi )〉] ≤1

2

(b0 + C 2

n∑i=1

a2i

)

Taking the limit as n→∞, gives

0 ≤ limn→∞

bn <∞

Page 40: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Great! This shows that limn→∞ bn exists. But is it equal to 0?

Consider the sequence

kn =K

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

By strong convexity.Rewriting, we then have

≥ Kbn ≥ knbn

for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,

∞∑n

anknbn <∞.

Page 41: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence

kn =K

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

By strong convexity.Rewriting, we then have

≥ Kbn ≥ knbn

for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,

∞∑n

anknbn <∞.

Page 42: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence

kn =K

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

By strong convexity.

Rewriting, we then have

≥ Kbn ≥ knbn

for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,

∞∑n

anknbn <∞.

Page 43: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence

kn =K

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

By strong convexity.Rewriting, we then have

≥ Kbn ≥ knbn

for large enough n.

We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,

∞∑n

anknbn <∞.

Page 44: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence

kn =K

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

By strong convexity.Rewriting, we then have

≥ Kbn ≥ knbn

for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,

∞∑n

anknbn <∞.

Page 45: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

By assumption on an, we have also that

ankn =Kan

a1 + ...+ an−1→∞

So∑

ankn →∞.

Page 46: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

By assumption on an, we have also that

ankn =Kan

a1 + ...+ an−1→∞

So∑

ankn →∞.

Page 47: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

So...

bn → 0!

Page 48: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

So...

bn → 0!

Page 49: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

So...

bn → 0!

Page 50: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

So...

bn → 0!

Page 51: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

So...

bn → 0!

Page 52: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

So are there any sequences that satisfy the conditions in thetheorem?

I∞∑n=0

a2n <∞

I∞∑n=0

ana1 + ...+ an−1

=∞

Take an = 1n . Square summable and

∞∑n=0

1n

11 + ...+ 1

n−1

≈∞∑n=0

1

n ln(n − 1)≥∞∑n=0

1

n ln n→∞

Page 53: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

So are there any sequences that satisfy the conditions in thetheorem?

I∞∑n=0

a2n <∞

I∞∑n=0

ana1 + ...+ an−1

=∞

Take an = 1n . Square summable and

∞∑n=0

1n

11 + ...+ 1

n−1

≈∞∑n=0

1

n ln(n − 1)≥∞∑n=0

1

n ln n→∞

Page 54: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

So are there any sequences that satisfy the conditions in thetheorem?

I∞∑n=0

a2n <∞

I∞∑n=0

ana1 + ...+ an−1

=∞

Take an = 1n . Square summable and

∞∑n=0

1n

11 + ...+ 1

n−1

≈∞∑n=0

1

n ln(n − 1)≥∞∑n=0

1

n ln n→∞

Page 55: Robbins, Monro: A Stochastic Approximation Method 2017. 10. 13. · Robbins,Monro: A Stochastic Approximation Method Robert Bassett University of California Davis Student Run Optimization

Extensions and loose ends

I Actually converges with probability 1 (Blum).I Convergence with rates

I E[Q(θn)− Q(θ∗)] ∈ O(n−1) (with strong convexity)I E[Q(θn)− Q(θ∗)] ∈ O(n−

12 ) (without strong convexity)

I θn−θ∗√n

is asymptotically normal. (Sacks)

I√n rate cannot be beat for general convex case. (Nemirovski

et al)