64
RANDOM TOPICS stochastic gradient descent & Monte Carlo

RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

RANDOM TOPICSstochastic gradient descent

&Monte Carlo

Page 2: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

MASSIVE MODEL FITTING

minimize1

2kAx� bk2 =

X

i

1

2(aix� bi)

2

least squares

minimize1

2kwk2 + h(LDw) =

1

2kwk2 +

X

i

h(lidiw)

SVM

low-rank factorization

Big!(over 100K)

minimize f(x) =1

n

nX

i=1

fi(x)

minimize1

2kD �XY k2 =

X

ij

1

2(dij � xiyj)

2

Page 3: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

THE BIG IDEA

True gradient

⌦ ⇢ [1, N ]

Idea: choose a subset of the data

usually just one sample

minimize f(x) =1

n

nX

i=1

fi(x)

rf =1

n

nX

i=1

rfi(x)

rf ⇡ g⌦ =1

|⌦|X

i2⌦

rfi(x)

Page 4: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

INFINITE SAMPLE VS FINITE SAMPLE

minimize f(x) = Es[fs(x)] =

Z

sfs(x)p(s)ds

minimize f(x) =1

n

nX

i=1

fi(x)

We can solve finite sample problem to high accuracy….

…but true accuracy is limited by sample size

finite sample

infinite sample

Page 5: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SGD

big errorsolution improves

compute gradientselect data

update

gk =1

M

MX

i=1

rf(x, di)gk ⇡ rf(x, d8)gk ⇡ rf(x, d12)

small errorsolutions gets worse

xk+1 = xk � ⌧kgk

Page 6: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SGDcompute gradientselect

dataupdate

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

limk!1

⌧k = 0

classical solution

O(1/pk)

shrink stepsize slow convergence

Variance reductionCorrect error in gradient approximations

Page 7: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

EXAMPLE: WITHOUT DECREASING STEPSIZE

minimize1

2kAx� bk2

inexact gradientwhy does this happen?

what's happening?

Page 8: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

DECREASING STEP SIZEbig stepsize

small stepsize

⌧k =a

b+ k

used for strongly convex problems

xk+1 = xk � ⌧krfk(x)

(almost) equivalent tochoose larger sample

why?

Page 9: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

AVERAGING

averaging

xk+1 = xk � ⌧krfk(x)

⌧k =apk + b

xk+1 =1

k + 1

Xxi

“ergodic” averaging

used for weakly convex problems

x

does this limit convergence rate?

why is this bad?

Page 10: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

AVERAGING

xk+1 =1

k + 1

Xxi

ergodic averaging

averaging

x

xk+1 =k

k + 1xk +

1

k + 1xk+1

compute without storage

xk+1 =k

k + ⌘xk +

k + ⌘xk+1

short memory version

⌘ � 1

tradeoff variance for bias

Page 11: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

KNOWN RATES: WEAKLY CONVEX

TheoremSuppose f is convex, krf(x)k G, and that the diameterof dom(f) is less than D. If we use the stepsize

⌧k =cpk,

then

SeeShamir and Zhang, ICML ’13

Rakhlin, Shamir, Sridharan, CoRR ‘11

E[f(xk)� f?] ✓D2

c+ cG

◆2 + log(k)p

k

Page 12: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

KNOWN RATES: STRONGLY CONVEX

Theorem

Shamir and Zhang, ICML ’13

E[f(xk)� f?] 58(1 + ⌘/k)

✓⌘(⌘ + 1) +

(⌘ + .5)3(1 + log k)

k

◆G2

mk

xk+1 =k

k + ⌘xk +

k + ⌘xk+1

Suppose f is strongly convex with parameter m, and thatkrf(x)k G. If you use stepsize ⌧k = 1/mk, and the limitedmemory averaging

with ⌘ � 1, then

Page 13: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

EXAMPLE: SVMPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM

rh(x) =

(�1, x < 1

0, otherwise

note: this is a “subgradient” descent method

minimize1

2kwk2 + Ch(Aw)

Page 14: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

PEGOSOSPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM

⌧k =1

�k

minimizeX

i

2kwk2 + h(aiw)

While “not converged”:

not used in practicewk+1 =1

k + 1

k+1X

i=1

wk

If aTkwk < 1 : wk+1 = wk � ⌧k(�wk � aTi )

If aTkwk � 1 : wk+1 = wk � ⌧k�wk

Page 15: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

PEGOSOS

gradientmethod

stochasticmethods

finite sample method

Page 16: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

PEGOSOS

gradientmethod

stochasticmethods

classification error (CV)

You hope to converge before SGD gets slow!

Page 17: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

ALLREDUCEbut if you don’t…

use SGD as warm start for iterative method

Agarwal. “Effective Terascale Linear Learning.” ‘12

Page 18: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SGDcompute gradientselect

dataupdate

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

variance reduction solutionmake gradient more accurate

preserve fast convergence

Page 19: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SGD+VARIANCE REDUCTIONcompute gradient update

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

variance reduction solutionmake gradient more accurate

preserve fast convergence

� error8

select data

Page 20: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

VR APPROACHES

Central VRA VR approach targeting distributed ML

SAGADefazio, Bach, Lacoste-Julian, 2014

SVRGJohnson, Zhang, 2013

SAGLe Roux, Schmidt, Bach, 2013

many more…

“Efficient Distributed SGD with Variance Reduction,” ICDM 2016

shameless self-promotion

The original: requires full gradient computations

Avoid full gradient computations

Page 21: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SVRG

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )

First epoch gradient tableau

rf3(x3m)

Page 22: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )} gm =

1

n

nX

i=1

rfi(xim)

Average gradientover last epoch rf3(x

3m)

gradient tableau

SVRG

Page 23: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )

gm =1

n

nX

i=1

rfi(xim)

Average gradientover last epoch

rf3(x3m)

(rf3(x3m)� gm)�

corrected gradient}error

rf3(x3m+1)

new gradient

gradient tableau

SVRG

Page 24: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SVRG GETS BACK DETERMINISTIC RATES

Theorem

Johnson and Zhang, 2013

Suppose of objective terms are m stronglyconvex with L Lipschitz gradients. If thelearning rate is small enough, then

<latexit sha1_base64="HTGAliInHfDxWKYbVWtacL71/ok=">AAACj3icbVHLbtNAFB2bVzGvFJbdXBFXYoEiOyzKCkWwoVIXRZC2UhxF48m1PXQe1sx1IET5Br6PT+AD2DNJs6AtVxrp6Dx0r86UrZKesuxXFN+5e+/+g72HyaPHT54+6+0/P/O2cwLHwirrLkruUUmDY5Kk8KJ1yHWp8Ly8/LDRzxfovLTmCy1bnGpeG1lJwSlQs97PolKdbxRWBMnnrm2tR7AV2PIrCpILBEKnPXCHkOoUPDlrarUsigSENQv8Dt8kNZCepHAiWy8aST+gdnwu0ZAfABxXQA1CCCjkzkhTg+OEID14zZUCNLarm9cbl0lmvX42yLYDt0G+A322m9NZ73cxt6LTYZtQ3PtJnrU0XXFHUihcJ0XnseXiktc4CdBwjX662ha3hsPAzKGyLjxDsGX/Tay49n6py+DUnBp/U9uQ/9MmHVVvpytp2o7QiKtFVaeALGx+AebShXrVMgAunAy3gmi44yK07ZPk2h4jBVZBWodu8ptN3AZnw0H+ZjD8NOyP3u9a2mMH7CV7xXJ2xEbsIztlYybYn+ggSqPDeD8+it/FoytrHO0yL9i1iY//AlY7xdk=</latexit>

Ef(wk)� f? ck[f(w0)� f?]<latexit sha1_base64="FkMPZaQmJTfxjKaog3LujK0S4+U=">AAACNHicbVBNS8NAEN3U7/pV9ehlsQj1YElqQY+iCB4rWBWatGy2k3bpZhN2N0oJ/SP+EM9e9R8I3sSDF3+DmzYHW30w8Hhvhpl5fsyZ0rb9ZhXm5hcWl5ZXiqtr6xubpa3tGxUlkkKTRjySdz5RwJmApmaaw10sgYQ+h1t/cJ75t/cgFYvEtR7G4IWkJ1jAKNFG6pTqbkh03/fTixEOKg/twcEhDtqu0kRilwOm7QFuZYZ9cJjrHu6UynbVHgP/JU5OyihHo1P6crsRTUIQmnKiVMuxY+2lRGpGOYyKbqIgJnRAetAyVJAQlJeOvxvhfaN0cRBJU0Ljsfp7IiWhUsPQN53ZL2rWy8T/vFaigxMvZSJONAg6WRQkHOsIZ1HhLpNANR8aQqhk5lZM+0QSqk2gxeLUHsEoBMYamWyc2ST+kpta1Tmq1q7q5dOzPKVltIv2UAU56BidokvUQE1E0SN6Ri/o1Xqy3q0P63PSWrDymR00Bev7BxyTqZY=</latexit>

for some c < 1.<latexit sha1_base64="7i+9y8tRinww0pMsrhRs+pZ9DCo=">AAACD3icbVC9TgJBGNzDP8QfTi1tNoKJFbnDQgsLoo0lJvKTACF7y3ewYW/3srtnQggPYW2rz2BnbH0EH8G3cA+uEHCqycx8+SYTxJxp43nfTm5jc2t7J79b2Ns/OCy6R8dNLRNFoUEll6odEA2cCWgYZji0YwUkCji0gvFd6reeQGkmxaOZxNCLyFCwkFFirNR3i6FUWMsIcJne+JVy3y15FW8OvE78jJRQhnrf/ekOJE0iEIZyonXH92LTmxJlGOUwK3QTDTGhYzKEjqWCRKB703nxGT63ygCnFUIpDJ6rfy+mJNJ6EgU2GREz0qteKv7ndRITXvemTMSJAUEXj8KEYyNxugIeMAXU8IklhCpmu2I6IopQY7cqFJb+CEYhtNbMbuOvLrFOmtWKf1mpPlRLtdtspTw6RWfoAvnoCtXQPaqjBqIoQS/oFb05z8678+F8LqI5J7s5QUtwvn4BB86bCg==</latexit>

Page 25: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

MONTO CARLO METHODS

Monte Carlo casino

Methods that involve randomly sampling a distribution

Monaco?

Page 26: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

BAYESIAN LEARNING

ingredients…prior

likelihood

estimate parameters in a model

probability distribution of unknown parameters

data = M(parameters)estimate

thesemeasure

these

probability of datagiven (unknown) parameters Thomas Bayes

Page 27: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

BAYES RULE

P (A|B) =P (B|A)P (A)

P (B)

P (parameters|data) = P (data|parameters)P (parameters)

P (data)

likelihood prior

P (parameters|data) / P (data|parameters)P (parameters)

we really care about this

“posterior distribution”

Page 28: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

EXAMPLE: LOGISTIC REGRESSION

P (parameters|data) / P (data|parameters)P (parameters)

prior

P (y,X|w) =Y

i

exp(yi · xTi w)

1 + exp(yi · xTi w)

<latexit sha1_base64="lWgkerH9AERziYYhNxTP4CqGnHg=">AAAIGXicfVXNbtw2EFaSNpu6f0567EWoE8BuXGPlBEgvBZK6SNuDUbeIHQOmLVBcasUsfxSSyu6W1pPkGfICufVW9JpTgfZdOtTuNruU1jxIQ37fzHBmyGFWcmZsv//3tes3PvjwZu/WRxsff/LpZ59v3r5zYlSlCT0miit9mmFDOZP02DLL6WmpKRYZp8+z0YHHn7+i2jAln9lpSc8FHkqWM4ItLKWbPx5tT3dPL8c78XcxKrUapAzlGhOH6KTcnsKMDJSNJym7eDbeqV1yvxtIN7f6e/1mxG0hmQtb0XwcpbdvvkUDRSpBpSUcG3OW9Et77rC2jHBab6DK0BKTER5Sh4UxU5HV8T2BbWFCzC92YpMmQfXGxr2l1bPK5t+eOybLylJJQBGwvOKxVbFPUTxgmhLLpyBgohnsJyYFhqRYSOSKKZcpNbI4M+ACSTomSggsBw4RY+qz5NyhjA6ZdAQqZOqtBFEAZ5N6VUHZguoxM7R2yNKJdfH7ldA4ZwKOBthwSLNhYbHWalyHnHxh6ec8Dr1xsC3/9+QncctLgbk30RyGpHb7oQ2NvX8sh02xVqx7hHchxgJSZGriYrSLdk2VvYBMQ+L9rNnCKp/JASj4uhLM3Wm4BSZf1e7CfZOEgGU+OFx1YUrZpbA8q6X9ntHA7dhHpfc7ut8yPhIN0OG1BJsXkACswzxSzM08yixzv4WaIJRL+EELLxiTvlIgpO4ghazkdhqyxFJMoiskSMtyXva7EuM5o5CUjjpoZoVl2FDgdo189a4shAko3XaAcokkzjiOJ5ctUOBJ7aazZrVgBSTKuZJw8e56KU3uBvBBwa46hII8WYKfhHBul9CnrVR5v7PS5i4J7iBU0rd1+lIqf3VoSVQloQc5+rJq2neN4m0Et3cx3/H6P1Bop5oegs1fSqqxVfprcCEgTYeHJ/VaApNMsN9hOwsJgdXBej6eLPhz6Wo+1A46TvNdR8F62FQL/mjXS1cRmVwQ2XqLptBMwomd/ztpzj974LX5rmFYOIEQ6ewHTQpeuyR829rCyf5e8mBv/9eHW4+/n797t6Ivo6+i7SiJHkWPo5+io+g4ItGb6F30T/Rv73Xvj96fvb9m1OvX5jpfRCuj9+4/+/8CsA==</latexit>

P (yi = 1|xi) =exp(xT

i w)

1 + exp(xTi w)

<latexit sha1_base64="30xsw632vjL+Lqp1+IPVHH/WWSI=">AAAIBHicfVXNbtw2EFbSNJu4f05zzEWIE8BuXGO1CZBcAiRxULQHo24ROwZMW6C41IpZ/igkFe+G1jXP0Iforeilh1zbV+jbdCjvNruU1jpIQ37ffMOZIams5MzYfv/fK1c/u/b59d6Nm2tffPnV19+s3/r20KhKE3pAFFf6KMOGcibpgWWW06NSUywyTl9n412Pv35HtWFKvrLTkp4IPJIsZwRbmErXB/ub05Q9Tc4nKduKn8Yo15g4RCflJsycvjrbql3yYGmcrm/0d/rNE7eNZGZsRLNnP711/S80VKQSVFrCsTHHSb+0Jw5rywin9RqqDC0xGeMRdVgYMxVZHd8X2BYmxPxkJzZpalGvrd1fmD2ubP7kxDFZVpZKAo6A5RWPrYp9NeIh05RYPgUDE81gPTEpMJTAQs2WpFym1NjizEAIJOkZUUJgOXSIGFMfJycOZXTEpCPQDFNvJIgCeDGolx2ULag+Y4bWDlk6sS7+NBOKcyZgF4CGQ5qNCou1Vmd1yMnnSj/lcRiNg7b8P5IfxK0oBeZeoml9UrtBqKGxj4/lqGnWkrpHeBdiLCBFpiYuRtto21TZG6g0FN6PmiUs85kcgoPvK8HcHYVLYPJd7U7d90kIWOaTw1UXppRdSMuzWt6fGA3czn1c+rjjBy3xsWiAjqglaJ5CAbAO60gxN7Mss8z9GnqCUS7guy28YEz6ToGRut0UqpLbacgSCzmJrpSgLIt1GXQVxnPGISkdd9DMEsuwkcDtHvnuXdoIE1C6dYByjiTOOI4n5y1Q4EntpogMlY3nrIBEOVcSDt49b6XJvQDeLdhlm1CQ5wvw8xDO7QL6Q6tUPu5Fa3OXBGcQOulvcPpWKn90aElUJeEOcvRt1dzUNYo3EZze+XjL+7+kcJ1qugeaP5dUY6v0dxBCQJn29g7rlQQmmWDvYTlzC4HqcDUfT+b8mXU5H3oHN07zXkXBetR0C75o21uXEZmcE9lqRVNoJmHHzr6dNIdKrSBq817BsLADIdOLD1xS8LdLwn9b2zgc7CQPdwa/PNp49mL237sR3YnuRptREj2OnkU/RvvRQUSi36KP0d/RP70Pvd97f/T+vKBevTLzuR0tPb2P/wHrBPmj</latexit>

Page 29: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

EXAMPLE: LOGISTIC REGRESSION

P (parameters|data) / P (data|parameters)P (parameters)

prior

logP (w) = �|w|Example:

P (w|y,X) =

Y

i

exp(yi · xTi w)

1 + exp(yi · xTi w)

!P (w)

<latexit sha1_base64="004+ZLNp5XNu9cMH76QXus7GxG4=">AAAIK3icfVXdchs1FN4WqEv4S+GSmx3Sztg0ZLyhM+WGmZYwDFxkCEyThomSHa2s9QrrZytpaxtlH4kX4IaH4AqG297CK3C0tqmtXUcXu0f6vvMd6Rz9ZCVnxg6Hf966/cabb93p3X17551333v/g917H54ZVWlCT4niSp9n2FDOJD21zHJ6XmqKRcbp82xy5PHnL6k2TMlndl7SS4HHkuWMYAtD6e5PJ/3p9Xz/fBB/iTjNbT9GpVajlKFcY+IQnZX9OfTISNl4lrKrZ9NB7ZKH3QDSbFzYQQyig3R3b3gwbFrcNpKlsRct20l6785vaKRIJai0hGNjLpJhaS8d1pYRTusdVBlaYjLBY+qwMGYusjp+ILAtTIj5wU5s1qSs3tl5sDZ6Udn8i0vHZFlZKgk4ApZXPLYq9kmLR0xTYvkcDEw0g/nEpMCQIAup3ZBymVITizMDIZCkU6KEwHLkEDGmvkguHcromElHoGam3ksQBXDRqTcdlC2onjJDa4csnVkXvx4JxTkTsFlAwy0qgLVW0zrk5Cul7/I4jMZBW/4fyXfiVpQCcy/RbIykdoehhsY+Ppbjplgb6h7hXYixgBSZmrkY7aN9U2U/Q6Yh8b7XTGGTz+QIHHxdCebuPJwCky9rd+U+S0LAMr84XHVhStm1ZXlWy/s1o4Hba5+UPu7kYUt8IhqgI2oJmleQAKzDPFLMzXKVWeZ+DD3BKNfwoxZeMCZ9pcBI3VEKWcntPGSJtTWJriVBWtbzctiVGM+ZhKR00kEzGyzDxgK3a+Srd2MhTEDp1gHKNZI44zieXbdAgWe1my8urhUrIFHOlYSDd99baXI/gI8KdtMmFOTpGvw0hHO7hn7TSpWPuyht7pLgDEIl/UVPX0jljw4tiaok3EGOvqiaC71GcR/B6V31B97/awrXqabHoPl9STW2Sn8KIQSk6fj4rN5KYJIJ9gtMZ2UhUB1t5+PZir+0buZD7eDGab7bKFiPm2rBH+176yYikysi265oCs0k7Njlv5Pm/BMIUZvvFoaFHQgrXfzgkoLXLgnftrZxdniQfH5w+MOjvSdfLd+9u9HH0SdRP0qix9GT6NvoJDqNSPR79Cr6J/q392vvj95fvb8X1Nu3lj4fRRut9+o/62oJbg==</latexit>

logP (w|y,X) = logP (w) +X

i

� log(1 + exp(�yi · xTi w))

<latexit sha1_base64="3Y80qniIrti7yfwt00Tirz2HpIc=">AAAIFXicfVXdjhs1FJ620JTlbwuX3FhsKyXsjzILEr1BaglCcLFiQd3tSuvdkcfxZEz8M7U9TYJ3noNn4BWQuEPc9pa+DceThCYzyc7FzLG/z9/xOcc+kxaCW9fvv7lz9947797vPHhv5/0PPvzo492Hn5xbXRrKzqgW2lykxDLBFTtz3Al2URhGZCrYi3Q8CPiLV8xYrtVzNyvYlSQjxTNOiYOpZHeAhR6h0+7kZnZw0UPfoOW4h/YRtqVMODqsJ7vxPmbTons4SzimQ+3QNOHXzye9XrK71z/q1w9qG/HC2IsWz2ny8P6feKhpKZlyVBBrL+N+4a48MY5TwaodXFpWEDomI+aJtHYm0wo9lsTltomFyY3YtE5OtbPzeGX2snTZkyvPVVE6pigsBCwrBXIahfSgITeMOjEDg1DDYT+I5sQQ6iCJa1I+1XrsSGrBBVZsQrWURA09ptZWl/GVxykbceUpVMdWezFmAM4H1foC7XJmJtyyymPHps6jtzNNccElHAvQ8NjwUe6IMXpSNTnZUunHDDW9CdBW/3sKA9TykhMRJDKI28eVP25qGBL8EzWqi7WmHhCxCbEOkDzVU4/wAT6wZforZBoSH0b1Ftb5XA1hQagrJcJfNLfA1avKX/vDuAk4HoIj5SZMa7cSVmC1Vr9l1HA79nER/I73W+JjWQMbvBageQ0JIKaZR0aEXUSZpv6X5kowihV80MJzzlWoFBiJHySQlczNmiy5EpPcFBKkZTUvx5sSEzjjJikZb6DZNZblI0naNQrVu7UQtkHZrAOUG6xIKgia3rRASaaVn82b1ZLVIDEhtIKL9yhYSfyoAQ9yftshlPTZCvysCWduBf2+largd17azMeNOwiVDC2dvVQ6XB1WUF0q6EGevSzr1l1h1MVwe5fjXlj/HYN2atgJaP5UMEOcNl+ACwlpOjk5r7YSuOKS/wbbWVoYVIfb+WS65C+s2/lQO+g49XsbhZhRXS344oNg3Ubkaknk2xVtbriCE7v4bqR5XBgNXuv3FoaDEwiRzj/QpOBvFzf/bW3j/Pgo/vLo+Oev9p5+u/jvPYg+iz6PulEcfR09jX6ITqOziEZ/RK+jf6M3nd87f3X+7vwzp969s1jzabT2dF7/B9Db/e8=</latexit>

Page 30: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

BAYESIANS OFFER NO SOLUTIONS

P (w|D, l)

solution space

optimizationsolution

Monte Carlo methods randomly sample this solution space

BayesLagrange

Page 31: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

WHY MONTE CARLO?You need to know more than just a max/minimizer

optimizationsolution

“error bars”

Var(w) = E[(w � µ)(w � µ)T ] =

ZwwTP (w) dw � µµT

E(w) =

ZwP (w) dw

Page 32: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

EXAMPLE: POKER BOTSYou can’t solve your problem by other (faster) methods

maximize logP (w) we don’t know derivative

poker bots

model parameters describe player behavior

2.5M different community cards

10 players = 59K raise/fold/call perms per round

Page 33: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

WHY MONTE CARLO?You’re incredibly lazymaximize logP (w)

I could differentiate this,I just don’t wanna

argmax logP (w) ⇡ maxk

{P (wk)}

Page 34: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

MARKOV CHAINS

Irreducible: can visit any state starting at any state

Aperiodic: does not get trapped in deterministic cycles

MC has steady state if

What is an MC?

Page 35: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

METROPOLIS HASTINGSingredients

q(y|x)

proposal distribution

p(x)

posterior distribution

MH AlgorithmStart with x0

For k = 1, 2, 3, . . .Choose candidate y from q(y|xk)Compute acceptance probability

↵ = min

⇢1,

p(y)q(xk|y)p(xk)q(y|xk)

Set xk+1 = y with probability ↵otherwise xk+1 = xk

Page 36: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

CONVERGENCE

Theorem

Suppose the support of q contains the support of p. Thenthe MH sampler has a stationary distribution, and the

distribution is equal to p.

Irreducible: support of q must contain support of p

Aperiodic: there must be positive rejection probability states

Page 37: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

METROPOLIS ALGORITHMingredients

q(y|x)

proposal distribution

p(x)

posterior distribution

Assume proposal is symmetricq(xk|y) = q(y|xk)

↵ = min

⇢1,

p(y)q(xk|y)p(xk)q(y|xk)

�↵ = min

⇢1,

p(y)

p(xk)

Metropolis Hastings Metropolis

Page 38: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

EXAMPLE: GMMhistogram of Metropolis iterates

Andrieu, Freitas, Doucet, Jordan ‘03

Page 39: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

PROPERTIES OF MH

We don’t need a normalization constant for posteriorWe can run many chains in parallel (cluster/GPU)We don’t need any derivatives

Pro’s

P (parameters|data) = P (data|parameters)P (parameters)

P (data)

P (parameters|data) / P (data|parameters)P (parameters)

Page 40: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

PROPERTIES OF MH

“Mixing time” depends on proposal distribution• too wide = constant rejections = slow mixing• too narrow = short movements=slow mixing

Con’s

Samples only meaningful at stationary distribution • “Burn in” samples must be discarded• Many samples needed because of correlations

Posterior

proposal

This is bad…

xk<latexit sha1_base64="8tY7aDdJ2IAPpvSJDyWI10/Am7g=">AAAH03icfVVLb9w2EJbTNuu4rzg55iLUMeC2jrFyAqRHOw6K9mDUfdgxYNoLikutWPEhk5S9G1qXotdc2//Sf9J/06G8W+9SWusgDfl98w1nhqTSkjNj+/1/Vx589PEnD3urj9Y+/ezzL758vP7kxKhKE3pMFFf6NMWGcibpsWWW09NSUyxSTt+lxYHH311RbZiSv9lJSc8FHkmWMYItTP06vigGjzf6O/3midtGMjU2oulzNFh/+A8aKlIJKi3h2JizpF/ac4e1ZYTTeg1VhpaYFHhEHRbGTERax5sC29yEmJ/sxMZNavXa2ubc7Flls+/OHZNlZakk4AhYVvHYqtgnFw+ZpsTyCRiYaAbriUmONSYWSrAg5VKlCotTAyGQpNdECYHl0CFiTH2WnDuU0hGTjkBtTb2RIArg7aBedFA2p/qaGVo7ZOnYuvhuJhTnTEBTQcMhzUa5xVqr6zrkZDOlH7M4jMZBW/4fyQ/iVpQccy+RQd4uqd1uqKGxj4/lqGnWgrpHeBdiLCB5qsYuRtto21Tp71BpKLwfNUtY5DM5BAffV4K5Ow2XwORV7S7ciyQELPPJ4aoLU8rOpeVZLe87RgO3cy9KH7f4tiVeiAboiFqC5gUUAOuwjhRzM80yTd0voScY5Rx+0MJzxqTvFBgDdzCAqmR2ErLEXE6iKyUoy3xddrsK4zlFSBoUHTSzwDJsJHC7R7579zbCBJRuHaDcIIlTjuPxTQsUeFy7CSJDZeMZKyBRzpWEg/fcW4PkeQAf5Oy+TSjI/hy8H8KZnUO/b5XKx71tbeaS4AxCJ/2FTC+l8keHlkRVEu4gRy+r5uKtUbyF4PTOxl97/7cUrlNND0Hzp5JqbJX+BkIIKNPh4Um9lMAkE+w9LGdmIVAdLufj8Yw/te7nQ+/gxmneyyhYj5puwRdte+s+IpMzIluuaHLNJOzY6beT5lCpFURt3ksYFnYgZHr7gUsK/nZJ+G9rGye7O8nLnd2fX23svZn+91ajZ9FX0VaURK+jveiH6Cg6jkg0ij5Ef0V/9457rvdH789b6oOVqc/TaOHpffgP+CLnSg==</latexit>

Page 41: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SIMULATED ANNEALINGmaximize p(x)

Easy choice: run MCMC, thenmax

kp(xk)

Better choice: sample the distribution

p1

Tk (xk) “temperature”

Why is this better?Where does the name come from?

Page 42: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

COOLING SCHEDULE

Andrieu, Freitas, Doucet, Jordan ‘03

hot warm

cool cold

Page 43: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

CONVERGENCE

Granville, "Simulated annealing: A proof of convergence”

TheoremSuppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time starting at every temperature. For an annealing schedule with temperature

simulated annealing converges to a global optima with probability 1.

Tk =1

C log(k + T0)

SA solves non-convex problem, even NP-complete problems,as time goes to infinity.

Page 44: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

WHEN TO USE SIMULATED ANNEALING

flow chart

Should I use simulated annealing?

no.

No practical way to choose temperature scheduleToo fast = stuck in local minimum (risky)

Too slow = no different from MCMCAct of desperation!

Page 45: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

GIBBS SAMPLERWant to sample

P (x1, x2, x3)

on stage k, pick some coordinate j

q(y|xk) =

(p(yj |xk

jc), if yjc = xkjc

0, otherwise

↵ =p(y)q(xk|y)p(xk)q(y|xk)

=p(y)p(xk

j |xkjc)

p(xk)p(yj |yjc)

=p(yjc)

p(xkjc)

= 1

P (B) =P (A and B)

P (A|B)

p(y) = p(yj and yjc)<latexit sha1_base64="KTec+pwpBDIxNpH0RsoPrN7Tsvc=">AAACI3icbVDLSgMxFM3Ud31VXboJFkE3ZaYKuhFENy4r2Fpoa8mkd9poJjMkd8RhmA/wQ1y71W9wJ25c+AH+hWntwqoHQg7nnMtNjh9LYdB1353C1PTM7Nz8QnFxaXlltbS23jBRojnUeSQj3fSZASkU1FGghGasgYW+hEv/5nToX96CNiJSF5jG0AlZX4lAcIZW6pbK8U66S4+ovbrXtI1whxllqkdzmnaz6yue71KbcivuCPQv8cakTMaodUuf7V7EkxAUcsmMaXlujJ2MaRRcQl5sJwZixm9YH1qWKhaC6WSjz+R02yo9GkTaHoV0pP6cyFhoTBr6NhkyHJjf3lD8z2slGBx2MqHiBEHx70VBIilGdNgM7QkNHGVqCeNa2LdSPmCacbT9FYsTe5TgEFgrt914v5v4SxrVirdXqZ7vl49Pxi3Nk02yRXaIRw7IMTkjNVInnNyTR/JEnp0H58V5dd6+owVnPLNBJuB8fAH9lqL0</latexit>

Page 46: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

GIBBS SAMPLERWant to sample

P (x1, x2, x3)

iterates

x2 ⇠ P (x1|x12, x

13, x

14, . . . , x

1n)

x3 ⇠ P (x2|x21, x

23, x

24, . . . , x

2n)

x4 ⇠ P (x3|x31, x

32, x

34, . . . , x

3n)

· · ·

Page 47: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

APPLICATION: SAMPLING GRAPHICAL MODELS

Restricted Boltzmann Machine (RBM)

E(v, h) = �aT v � bTh� vTWh

P (v, h) =1

Ze�E(v,h) “partition

function”

weights

binaryrandomvariables

0/1

Page 48: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

APPLICATION: SAMPLING GRAPHICAL MODELS

Restricted Boltzmann Machine (RBM)

E(v, h) = �aT v � bTh� vTWh P (v, h) =1

Ze�E(v,h)

remove normalization

aggregate constant

cancel constants

sigmoid function

P (vi = 1|h) = P (vi = 1|h)P (vi = 0|h) + P (vi = 1|h)

=exp(�ai +

Pj wijhj + C)

exp(C) + exp(�ai +P

j wijhj + C)

=exp(�ai +

Pj wijhj)

1 + exp(�ai +P

j wijhj)

= �(�ai +X

i

wijhj)

Page 49: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

BLOCK GIBBS FOR RBM

P (vi = 1|h) = �(�ai +X

j

wijhj)

hiddenvisible

freeze hiddenrandomly sample visible

stage 1

freeze visiblerandomly sample hidden

stage 2

P (hj = 1|v) = �(�bj +X

i

wijvi)

Page 50: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

DEEP BELIEF NETSvisible

hidden

DBN = layered RBM

Each layer only dependson layer beneath it

(feed forward)

Probability for each hidden node is sigmoid function

Page 51: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

EXAMPLE: MNIST

Train 3-layer DBN with 200 hidden units

pre-training: layer-by-layer training

training: train all weights simultaneously

Gan, Henao, Carlson, Carin ‘15

trained on 60K MNIST digits

training done using Gibbs sampler

Gibbs sampler used to explore final solution

Page 52: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

EXAMPLE: MNISTtraining data Learned features

observations sampled from deep belief network using Gibbs sampler

Page 53: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

WHAT’S WRONG WITH GIBBS?

susceptible to strong correlationsalso a problem for MH: bad proposal distribution

Page 54: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SLICE SAMPLERProblems with MH:

hard to choose proposal distributiondoes not exploit analytical information about problem

long mixing timeSlice sampler

q(x, u) =

(1, if 0 u p(x)

0, otherwise

1 1

x

p(x)

x

u

Page 55: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

“LIFTED INTEGRAL”

1 1

x

p(x)

x

u

p(x) =

Zq(x, u) du

Zf(x)p(x) =

Z Zf(x)q(x, u) du dx

sample this

Problems with MH:hard to choose proposal distribution

does not exploit analytical information about problemlong mixing timeSlice sampler

Page 56: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SLICE SAMPLERZ

f(x)p(x) =

Z Zf(x)q(x, u) du dx

do Gibbs on thisFreeze x :

Choose u uniformly from [0, p(x)]Freeze u :

Choose x uniformly from {x|p(x) � u}need analytical formula

for this set

p(x)u

Page 57: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

MODEL FITTING

q(x, u1, u2, . . . , un) =

(1, if pi(x) ui 8i0, otherwise

Zq(x, u1, u2, . . . , un) du =

Z p1(x)

u1=0

Z p2(x)

u2=0· · ·

Z pn(x)

un=0du =

nY

k=1

pi(x)

log p(x) =nX

i=1

log pi(x)p(x) =nY

i=1

pi(x)

Zf(x)p(x) dx =

Z

x

Z

uf(x)q(x, u) dx du

support is a box

Page 58: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

MODEL FITTINGZ

f(x)p(x) dx =

Z

x

Z

uf(x)q(x, u) dx du

Freeze x :For all i, choose ui uniformly from [0, pi(x)]

Freeze u :Choose x uniformly from

Ti{x|pi(x) � ui}

q(x, u1, u2, . . . , un) =

(1, if pi(x) ui 8i0, otherwise

Page 59: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

SLICE SAMPLER PROPERTIESpros

• does not require proposal distribution• mixes super fast• simple implementation

cons• analytical formula for super-level sets• hard in multiple dimensions

fixes / generalizations• step-out methods• random hyper-rectangles

Page 60: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

STEP-OUT METHODSp(x)

u

p(x)u

p(x)u

•pick small interval

•double width size until slice is contained

•sample uniformly from slice using rejection sampling - throw away selections outside of slice

Page 61: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

HIGH DIMS: COORDINATE METHOD

Zf(x)p(x) =

Z Zf(x)q(x, u) du dx

do coordinate Gibbs on this

Freeze x :Choose u uniformly from [0, p(x)]

Freeze u :For i = 1 · · ·n

Choose xi uniformly from{xi|p(x1, · · · , xi · · · , xn) � u}

can use stepping -out methodRadford Neal: “Slice Sampling” ‘03

Page 62: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

HIGH DIMS: HYPER-RECTANGLE

Radford Neal: “Slice Sampling” ‘03

•choose “large” rectangle

•randomly/uniformly place rectangle around current location

•draw random proposal point from rectangle

•if proposal lies outside slice, shrink box so proposal is on boundary

performs much better for poorly-conditioned objectives

Page 63: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

COMPARISON

Gradient/Splitting

SGD

MCMC

rate

e�ck

1/k

1/pk

when to use

you value reliability and precision(moderate speed, high accuracy)

you value speed over accuracy(high speed, moderate accuracy)

you value simplicity (no gradient)or need statistical inference

(slow and inaccurate)

Page 64: RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k + ⌘ x¯k + ... Johnson, Zhang, 2013 SAG Le Roux, ... Metropolis Hastings Metropolis

DO MATLAB EXERCISEMCMC