28
Diffusions and their numerical approximation Applications of Langevin algorithms Langevin Dynamics Loucas Pillaud-Vivien November 7, 2019 Loucas Pillaud-Vivien Langevin Dynamics

Langevin Dynamics - di.ens.fr

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Diffusions and their numerical approximationApplications of Langevin algorithms

Langevin Dynamics

Loucas Pillaud-Vivien

November 7, 2019

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Introduction

Sampling distribution over high-dimensional space is animportant topic in computational statistics and machinelearningExample of application : Bayesian inference forhigh-dimensional modelsProblems:

1 Most of sampling techniques do not scale to high-dimension.Big d.

2 And to large number of data (recall HMC, need the fullgradient). Big N.

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Example: Bayesian setting

A Bayesian model is specified by:1 sampling distribution of observed data: likelihood Y ∼ L(·|θ)2 a prior distribution p on the parameter space θ ∈ Rd

The inference is based on the posterior distribution

π(dθ) = p(dθ)L(Y |θ)∫L(Y |u)p(du)

The normalizing constant is often not tractable (too highdimensional), we can only compute:

π(dθ) ∝ p(dθ)L(Y |θ)

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Outline

1 Diffusions and their numerical approximationSettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

2 Applications of Langevin algorithmsSampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Framework

We want to sample the following measure that has a densityw.r.t Lebesgue known up to a normalization factor.

dµ(x) = e−V (x)dx∫Rd e−V (y)dy

We assume that V is L-smooth : i.e. continuouslydifferentiable and ∃L > 0 s.t.

‖∇V (x)−∇V (y)‖ 6 L‖x − y‖

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Convergence to equilibrium for Diffusions

Let us consider the overdamped Langevin diffusion in Rd :

dXt = −∇V (Xt)dt +√

2dBt ,

L-smoothness of V gives existence and unicity of a solutionStationnary measure: dµ(x) = e−V (x)dx∫

Rd e−V (y)dy .

Semi-group: Pt(f )(x) = E[f (Xt)|X0 = x ] −→ ”law of Xt”.Infinitesimal generator: Lφ = ∆φ−∇V · ∇φ.

We can verify that the semi-group follows the dynamics:

ddt Pt(f ) = LPt(f ).

−→ Question : what speed of convergence then??? ?

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Convergence to equilibrium for Diffusions

Theorem (Poincare implies convergence to equilibrium)With the notations above, the following propositions areequivalent:

µ satisfies a Poincare Inequality with constant PFor all f smooth, Varµ(Pt(f )) 6 e−2t/PVarµ(f ) for all t > 0.

Proof: Integration by part formula (µ is reversible),

−∫

f (Lg) dµ =∫∇f · ∇g dµ = −

∫(Lf )g dµ, hence,

ddt Varµ(Pt(f )) = d

dt

∫(Pt(f ))2dµ = 2

∫Pt(f )(LPt(f ))dµ

= −2∫‖∇Pt(f )‖2dµ

6 −2/P Varµ(Pt(f ))Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Poincare inequalities: definition in modern language

Definition (Poincare inequality)µ ∈ P(Rd ) satisfies a Poincare Inequality with constant P if

Varµ(f ) 6 Pµ∫‖∇f ‖2dµ,

for all (bounded) f : Rd −→ R of class C1.

Recall that :

Varµ(f ) =∫

f 2dµ−(∫

fdµ)2

=∫ (

f −∫

fdµ)2

dµ∫‖∇f ‖2dµ = E(f ) is the Dirichlet Energy.

Spectral interpretation: E(f ) =∫∇f · ∇fdµ =

∫f (−Lf )dµ

−→ 1/P = λ2, first non-trivial eigenvalue of L.Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Application to the Ornstein-Uhlenbeck process

The diffusion of the Ornstein-Uhlenbeck process follows theSDE in Rd :

dXt = −Xtdt +√

2dBt ,

Denote L the operator Lφ = ∆φ− x · ∇φ, then1 For dµ(x) = 1

(2π)d/2 e−‖x‖2/2dx , L is self adjoint in L2µ

2 µ stationnary measure of O-U process3 µ verifies Poincare inequality with constant 1.4 for all f smooth, for all t > 0.

Varµ(Pt(f )) 6 e−2tVarµ(f ).

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Poincare inequalities

Long story short:

Poincare inequality ⇐⇒ Spectral gap for L⇐⇒

Exponential convergence for the diffusion

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Poincare inequalities

For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)

When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :

12 |∇V |2 −∆V > α

For mixture of Gaussian P explodes exponentially.

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Poincare inequalities

For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...

A generic condition for non necessarily convex potential :

12 |∇V |2 −∆V > α

For mixture of Gaussian P explodes exponentially.

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Poincare inequalities

For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :

12 |∇V |2 −∆V > α

For mixture of Gaussian P explodes exponentially.

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Poincare inequalities

For what distribution do they occur?When V is m-stongly convex: P = 1/m (linear convergenceof gradient descent)When V is only convex: yes but no bound...A generic condition for non necessarily convex potential :

12 |∇V |2 −∆V > α

For mixture of Gaussian P explodes exponentially.

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Ok, fine. But how do I get back to the real worldand draw samples ?

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Discretized Langevin Diffusion

Idea: Sample the diffusion paths, using Euler-Maruyamascheme

dXt = −∇V (Xt)dt +√

2dBt

Xk+1 = Xk − γk+1∇V (Xk) +√

2γk+1ξk+1

where(ξk)k is i.i.d N (0, Id )(γk)k is a sequence of stepsizes, either constant or decreasingto 0

Note the similarity with gradient descent or its stochasticcounterpart.This algorithm is referred to Unajusted Langevin Algorithm,Langevin Monte Carlo or Gradient Langevin Dynamics.

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

SettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

Discretized Langevin Diffusion: constant stepsize

When ∀k, γk = γ, then (Xk)k is an homogeneous Markovchain with Markov kernel RγUnder some mild assumptions Rγ is irreducible, positiverecurrent and hence has an invariant distributiondµγ 6= dµ.Typical questions:

For a given precision how do we choose the stepsize γ and thenumber of iterations such that

dist(δx Rnγ , dµ) 6 ε

How do we choose x ?How do we quantify dist(dµγ , dµ) ?

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Outline

1 Diffusions and their numerical approximationSettingContinuous time Markov process: diffusionsDiscretized Langevin diffusion

2 Applications of Langevin algorithmsSampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Result for a strongly convex potential

Theorem (Durmus, Moulines 2016)Assume that V is m-strongly convex and L smooth. Setγ ∈ (0, 1/(m + L)]) and κ = mL/(m + L) then for all x ∈ Rd ,

W 22 (δx Rn

γ , π) 6 2(1− κγ)nW 22 (δx , π) + Cdγ

Remarks :Decomposition bias + variance as for SGD.Geometric rate then distance from dµ to dµγOne may choose γ s.t. for n = Θ

(dε2

)iterations

W 22 (δx Rn

γ , π) 6 ε

Explicit way of choosing γ (it was a problem! –see MALA)

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Result for a strongly convex potential : remarks

Remarks :Exactly same results for

Total variation (Dalalyan 2014)KL divergence (Bartlett et al. 2017)

Same result with decreasing step sizes but no parameter totune!Quadratic improvement by Jordan et. al 2018 by consideringunderdamped Langevin (similar to HMC) for n = Θ

(√dε

)iterations W 2

2 (δx Rnγ , π) 6 ε (needed also only strong convexity

outside of a ball).

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Grrrrr...But you know... I do not like to compute allthe gradients...

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Stochastic Gradient Langevin Dynamics (SGLD)

Recall: the ULA algorithm is a discretization of theoverdamped Langevin diffusion, which leaves invariant thetarget distribution dµ.To further reduce the computational cost, SGLD usesunbiased estimates of the gradient!Initially proposed by Welling, M. and Teh, Y.W. 2011.

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

SGLD algorithm

Interested in situations where the distribution dµ arises as theposterior distribution in a Bayesian inference problem withprior dµ0 and a large number N >> 1 of i.i.f observations ziwith likelihoods p(zi |X ):

dµ(X |zi ) ∝ dµ0(X )N∏

i=1p(zi |X ).

Denote for i ∈ {1, . . . ,N},Vi (X ) = − log(p(zi |X ))V0(X ) = − log(dµ0(X ))V =

∑Ni=0 Vi

ULA cost of one iteration is Nd which is prohibitively large

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

SGLD algorithm

Welling, M and Teh, Y.W. suggested to replace ∇V with anunbiased estimate

∇V0 + (N/p)∑i∈S∇Vi ,

where S is a minibatch of size p.A single update of SGLD is thus (cost pd):

Xk+1 = Xk − γ

∇V0(Xk) + Np

∑i∈Sk+1

∇Vi (Xk)

+√

2γZk+1

Same idea as SGD.Two sources of randomness: estimates of the gradient andGaussian added noise to sample.

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

SGLD algorithm: need for variance reduction

Xk+1 = Xk − γ

∇V0(Xk) + Np

∑i∈Sk+1

∇Vi (Xk)

+√

2γZk+1

Two sources of noise. For γ = γ0/N:1 Noise from gradient estimates too big ⇒ no sampling.2 Need to decrease the variance: assume x∗ unique minimizer of

V ,

Xk+1 = Xk−γ

∇V0(Xk )−∇V0(x∗) +Np

∑i∈Sk+1

∇Vi (Xk )−∇Vi (x∗)

+√

2γZk+1

If γ = γ0/N, SGLD ∼ SGD. Use variance control to sample.Precise analysis from Moulines et al. (2018).

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Non-convex Learning via SGLD

Classical learning problem:Find the minimum of F (w) := EP [f (w ,Z )] where f is notnecessarily convex.Call FZ (w) := 1

n∑n

i=1 f (w , zi )Consider the Langevin diffusion and its associated discretization :

dXt = −∇FZ (Xt) +√

2β−1dBt

Xk+1 = Xk − η∇f (w , zk) +√

2ηβ−1ξk

Converges to dµz(dw) ∝ exp (−βFz(w)), when β ∼ 1/T is big,it concentrates around minimizers of Fz and hence F .

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Non-convex Learning via SGLDXk+1 = Xk − η∇f (w , zk) +

√2ηβ−1ξk

(Xk) converges to dµz(w) ∝ exp (−βFz(w)), β ∼ 1/T .

Theorem (Raginsky, Rakhlin, Telgarsky (2018))For k > ε−4, η 6 ε4,

EF (Xk)− F ∗ 6 cε+ (β + d)2

n + d log(β + 1)β

Sketch of proof: control of three termsHow far from the true diffusion + invariant measureexp (−βFz(w))How far FZ is from FHow far a sample from exp (−βFz(w)) is near a minimizer ofFZ in terms of β

Loucas Pillaud-Vivien Langevin Dynamics

Diffusions and their numerical approximationApplications of Langevin algorithms

Sampling a strongly convex potentialStochastic Gradient Langevin DynamicsNon convex Learning via SGLD

Conclusion

We have seen how Langevin Dynamics can be used to derive newalgorithm for:

SamplingBayesian LearningNon-convex optimization

Problem with non-convexity: metastability of the markov process−→ old problem in computational chemistry.”Particle remain trap in wells for a long time before going out.”There has been a huge effort in this community to tackle thisproblem

Inspiration for Machine Learning ?

Loucas Pillaud-Vivien Langevin Dynamics