Langevin Dynamics - di.ens.fr

Diffusions and their numerical approximationApplications of Langevin algorithms

Langevin Dynamics

Loucas Pillaud-Vivien

November 7, 2019

Loucas Pillaud-Vivien Langevin Dynamics


Introduction

Sampling distribution over high-dimensional space is animportant topic in computational statistics and machinelearningExample of application : Bayesian inference forhigh-dimensional modelsProblems:

1 Most of sampling techniques do not scale to high-dimension.Big d.

2 And to large number of data (recall HMC, need the fullgradient). Big N.



Example: Bayesian setting

A Bayesian model is specified by:1 sampling distribution of observed data: likelihood Y ∼ L(·|θ)2 a prior distribution p on the parameter space θ ∈ Rd

The inference is based on the posterior distribution

π(dθ) = p(dθ)L(Y |θ)∫L(Y |u)p(du)

The normalizing constant is often not tractable (too highdimensional), we can only compute:

π(dθ) ∝ p(dθ)L(Y |θ)



Outline

1 Diffusions and their numerical approximationSettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

2 Applications of Langevin algorithmsSampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD



SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Framework

We want to sample the following measure that has a densityw.r.t Lebesgue known up to a normalization factor.

dµ(x) = e−V (x)dx∫Rd e−V (y)dy

We assume that V is L-smooth : i.e. continuouslydifferentiable and ∃L > 0 s.t.

‖∇V (x)−∇V (y)‖ 6 L‖x − y‖




Convergence to equilibrium for Diffusions

Let us consider the overdamped Langevin diffusion in Rd :

dXt = −∇V (Xt)dt +√

2dBt ,

L-smoothness of V gives existence and unicity of a solutionStationnary measure: dµ(x) = e−V (x)dx∫

Rd e−V (y)dy .

Semi-group: Pt(f )(x) = E[f (Xt)|X0 = x ] −→ ”law of Xt”.Infinitesimal generator: Lφ = ∆φ−∇V · ∇φ.

We can verify that the semi-group follows the dynamics:

ddt Pt(f ) = LPt(f ).

−→ Question : what speed of convergence then??? ?




Convergence to equilibrium for Diffusions

Theorem (Poincare implies convergence to equilibrium)With the notations above, the following propositions areequivalent:

µ satisfies a Poincare Inequality with constant PFor all f smooth, Varµ(Pt(f )) 6 e−2t/PVarµ(f ) for all t > 0.

Proof: Integration by part formula (µ is reversible),

−∫

f (Lg) dµ =∫∇f · ∇g dµ = −

∫(Lf )g dµ, hence,

ddt Varµ(Pt(f )) = d

dt

∫(Pt(f ))2dµ = 2

∫Pt(f )(LPt(f ))dµ

= −2∫‖∇Pt(f )‖2dµ

6 −2/P Varµ(Pt(f ))Loucas Pillaud-Vivien Langevin Dynamics



Poincare inequalities: definition in modern language

Definition (Poincare inequality)µ ∈ P(Rd ) satisfies a Poincare Inequality with constant P if

Varµ(f ) 6 Pµ∫‖∇f ‖2dµ,

for all (bounded) f : Rd −→ R of class C1.

Recall that :

Varµ(f ) =∫

f 2dµ−(∫

fdµ)2

=∫ (

f −∫

fdµ)2

dµ∫‖∇f ‖2dµ = E(f ) is the Dirichlet Energy.

Spectral interpretation: E(f ) =∫∇f · ∇fdµ =

∫f (−Lf )dµ

−→ 1/P = λ2, first non-trivial eigenvalue of L.Loucas Pillaud-Vivien Langevin Dynamics



Application to the Ornstein-Uhlenbeck process

The diffusion of the Ornstein-Uhlenbeck process follows theSDE in Rd :

dXt = −Xtdt +√

2dBt ,

Denote L the operator Lφ = ∆φ− x · ∇φ, then1 For dµ(x) = 1

(2π)d/2 e−‖x‖2/2dx , L is self adjoint in L2µ

2 µ stationnary measure of O-U process3 µ verifies Poincare inequality with constant 1.4 for all f smooth, for all t > 0.

Varµ(Pt(f )) 6 e−2tVarµ(f ).




Poincare inequalities

Long story short:

Poincare inequality ⇐⇒ Spectral gap for L⇐⇒

Exponential convergence for the diffusion





For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)

When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :

12 |∇V |2 −∆V > α

For mixture of Gaussian P explodes exponentially.





For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...

A generic condition for non necessarily convex potential :

12 |∇V |2 −∆V > α






For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :

12 |∇V |2 −∆V > α






For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :

12 |∇V |2 −∆V > α





Ok, fine. But how do I get back to the real worldand draw samples ?




Discretized Langevin Diffusion

Idea: Sample the diffusion paths, using Euler-Maruyamascheme

dXt = −∇V (Xt)dt +√

2dBt

Xk+1 = Xk − γk+1∇V (Xk) +√

2γk+1ξk+1

where(ξk)k is i.i.d N (0, Id )(γk)k is a sequence of stepsizes, either constant or decreasingto 0

Note the similarity with gradient descent or its stochasticcounterpart.This algorithm is referred to Unajusted Langevin Algorithm,Langevin Monte Carlo or Gradient Langevin Dynamics.




Discretized Langevin Diffusion: constant stepsize

When ∀k, γk = γ, then (Xk)k is an homogeneous Markovchain with Markov kernel RγUnder some mild assumptions Rγ is irreducible, positiverecurrent and hence has an invariant distributiondµγ 6= dµ.Typical questions:

For a given precision how do we choose the stepsize γ and thenumber of iterations such that

dist(δx Rnγ , dµ) 6 ε

How do we choose x ?How do we quantify dist(dµγ , dµ) ?



Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Outline

1 Diffusions and their numerical approximationSettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

2 Applications of Langevin algorithmsSampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD




Result for a strongly convex potential

Theorem (Durmus, Moulines 2016)Assume that V is m-strongly convex and L smooth. Setγ ∈ (0, 1/(m + L)]) and κ = mL/(m + L) then for all x ∈ Rd ,

W 22 (δx Rn

γ , π) 6 2(1− κγ)nW 22 (δx , π) + Cdγ

Remarks :Decomposition bias + variance as for SGD.Geometric rate then distance from dµ to dµγOne may choose γ s.t. for n = Θ

(dε2

)iterations

W 22 (δx Rn

γ , π) 6 ε

Explicit way of choosing γ (it was a problem! –see MALA)




Result for a strongly convex potential : remarks

Remarks :Exactly same results for

Total variation (Dalalyan 2014)KL divergence (Bartlett et al. 2017)

Same result with decreasing step sizes but no parameter totune!Quadratic improvement by Jordan et. al 2018 by consideringunderdamped Langevin (similar to HMC) for n = Θ

(√dε

)iterations W 2

2 (δx Rnγ , π) 6 ε (needed also only strong convexity

outside of a ball).




Grrrrr...But you know... I do not like to compute allthe gradients...




Stochastic Gradient Langevin Dynamics (SGLD)

Recall: the ULA algorithm is a discretization of theoverdamped Langevin diffusion, which leaves invariant thetarget distribution dµ.To further reduce the computational cost, SGLD usesunbiased estimates of the gradient!Initially proposed by Welling, M. and Teh, Y.W. 2011.




SGLD algorithm

Interested in situations where the distribution dµ arises as theposterior distribution in a Bayesian inference problem withprior dµ0 and a large number N >> 1 of i.i.f observations ziwith likelihoods p(zi |X ):

dµ(X |zi ) ∝ dµ0(X )N∏

i=1p(zi |X ).

Denote for i ∈ {1, . . . ,N},Vi (X ) = − log(p(zi |X ))V0(X ) = − log(dµ0(X ))V =

∑Ni=0 Vi

ULA cost of one iteration is Nd which is prohibitively large




SGLD algorithm

Welling, M and Teh, Y.W. suggested to replace ∇V with anunbiased estimate

∇V0 + (N/p)∑i∈S∇Vi ,

where S is a minibatch of size p.A single update of SGLD is thus (cost pd):

Xk+1 = Xk − γ

∇V0(Xk) + Np

∑i∈Sk+1

∇Vi (Xk)

+√

2γZk+1

Same idea as SGD.Two sources of randomness: estimates of the gradient andGaussian added noise to sample.




SGLD algorithm: need for variance reduction

Xk+1 = Xk − γ

∇V0(Xk) + Np

∑i∈Sk+1

∇Vi (Xk)

+√

2γZk+1

Two sources of noise. For γ = γ0/N:1 Noise from gradient estimates too big ⇒ no sampling.2 Need to decrease the variance: assume x∗ unique minimizer of

V ,

Xk+1 = Xk−γ

∇V0(Xk )−∇V0(x∗) +Np

∑i∈Sk+1

∇Vi (Xk )−∇Vi (x∗)

+√

2γZk+1

If γ = γ0/N, SGLD ∼ SGD. Use variance control to sample.Precise analysis from Moulines et al. (2018).




Non-convex Learning via SGLD

Classical learning problem:Find the minimum of F (w) := EP [f (w ,Z )] where f is notnecessarily convex.Call FZ (w) := 1

n∑n

i=1 f (w , zi )Consider the Langevin diffusion and its associated discretization :

dXt = −∇FZ (Xt) +√

2β−1dBt

Xk+1 = Xk − η∇f (w , zk) +√

2ηβ−1ξk

Converges to dµz(dw) ∝ exp (−βFz(w)), when β ∼ 1/T is big,it concentrates around minimizers of Fz and hence F .




Non-convex Learning via SGLDXk+1 = Xk − η∇f (w , zk) +

√2ηβ−1ξk

(Xk) converges to dµz(w) ∝ exp (−βFz(w)), β ∼ 1/T .

Theorem (Raginsky, Rakhlin, Telgarsky (2018))For k > ε−4, η 6 ε4,

EF (Xk)− F ∗ 6 cε+ (β + d)2

n + d log(β + 1)β

Sketch of proof: control of three termsHow far from the true diffusion + invariant measureexp (−βFz(w))How far FZ is from FHow far a sample from exp (−βFz(w)) is near a minimizer ofFZ in terms of β




Conclusion

We have seen how Langevin Dynamics can be used to derive newalgorithm for:

SamplingBayesian LearningNon-convex optimization

Problem with non-convexity: metastability of the markov process−→ old problem in computational chemistry.”Particle remain trap in wells for a long time before going out.”There has been a huge effort in this community to tackle thisproblem

Inspiration for Machine Learning ?


Documents

Langevin Dynamics - di.ens.fr