RANDOM TOPICS - University Of Marylandtomg/course/cmsc764/L12_stochastic.pdf · mk x ¯k+1 = k k +...

Preview:

Citation preview

RANDOM TOPICSstochastic gradient descent

&Monte Carlo

MASSIVE MODEL FITTING

minimize1

2kAx� bk2 =

X

i

1

2(aix� bi)

2

least squares

minimize1

2kwk2 + h(LDw) =

1

2kwk2 +

X

i

h(lidiw)

SVM

low-rank factorization

Big!(over 100K)

minimize f(x) =1

n

nX

i=1

fi(x)

minimize1

2kD �XY k2 =

X

ij

1

2(dij � xiyj)

2

THE BIG IDEA

True gradient

⌦ ⇢ [1, N ]

Idea: choose a subset of the data

usually just one sample

minimize f(x) =1

n

nX

i=1

fi(x)

rf =1

n

nX

i=1

rfi(x)

rf ⇡ g⌦ =1

|⌦|X

i2⌦

rfi(x)

INFINITE SAMPLE VS FINITE SAMPLE

minimize f(x) = Es[fs(x)] =

Z

sfs(x)p(s)ds

minimize f(x) =1

n

nX

i=1

fi(x)

We can solve finite sample problem to high accuracy….

…but true accuracy is limited by sample size

finite sample

infinite sample

SGD

big errorsolution improves

compute gradientselect data

update

gk =1

M

MX

i=1

rf(x, di)gk ⇡ rf(x, d8)gk ⇡ rf(x, d12)

small errorsolutions gets worse

xk+1 = xk � ⌧kgk

SGDcompute gradientselect

dataupdate

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

limk!1

⌧k = 0

classical solution

O(1/pk)

shrink stepsize slow convergence

Variance reductionCorrect error in gradient approximations

EXAMPLE: WITHOUT DECREASING STEPSIZE

minimize1

2kAx� bk2

inexact gradientwhy does this happen?

what's happening?

DECREASING STEP SIZEbig stepsize

small stepsize

⌧k =a

b+ k

used for strongly convex problems

xk+1 = xk � ⌧krfk(x)

(almost) equivalent tochoose larger sample

why?

AVERAGING

averaging

xk+1 = xk � ⌧krfk(x)

⌧k =apk + b

xk+1 =1

k + 1

Xxi

“ergodic” averaging

used for weakly convex problems

x

does this limit convergence rate?

why is this bad?

AVERAGING

xk+1 =1

k + 1

Xxi

ergodic averaging

averaging

x

xk+1 =k

k + 1xk +

1

k + 1xk+1

compute without storage

xk+1 =k

k + ⌘xk +

k + ⌘xk+1

short memory version

⌘ � 1

tradeoff variance for bias

KNOWN RATES: WEAKLY CONVEX

TheoremSuppose f is convex, krf(x)k G, and that the diameterof dom(f) is less than D. If we use the stepsize

⌧k =cpk,

then

SeeShamir and Zhang, ICML ’13

Rakhlin, Shamir, Sridharan, CoRR ‘11

E[f(xk)� f?] ✓D2

c+ cG

◆2 + log(k)p

k

KNOWN RATES: STRONGLY CONVEX

Theorem

Shamir and Zhang, ICML ’13

E[f(xk)� f?] 58(1 + ⌘/k)

✓⌘(⌘ + 1) +

(⌘ + .5)3(1 + log k)

k

◆G2

mk

xk+1 =k

k + ⌘xk +

k + ⌘xk+1

Suppose f is strongly convex with parameter m, and thatkrf(x)k G. If you use stepsize ⌧k = 1/mk, and the limitedmemory averaging

with ⌘ � 1, then

EXAMPLE: SVMPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM

rh(x) =

(�1, x < 1

0, otherwise

note: this is a “subgradient” descent method

minimize1

2kwk2 + Ch(Aw)

PEGOSOSPEGASOS: Primal Estimated sub-GrAdient SOlver for SVM

⌧k =1

�k

minimizeX

i

2kwk2 + h(aiw)

While “not converged”:

not used in practicewk+1 =1

k + 1

k+1X

i=1

wk

If aTkwk < 1 : wk+1 = wk � ⌧k(�wk � aTi )

If aTkwk � 1 : wk+1 = wk � ⌧k�wk

PEGOSOS

gradientmethod

stochasticmethods

finite sample method

PEGOSOS

gradientmethod

stochasticmethods

classification error (CV)

You hope to converge before SGD gets slow!

ALLREDUCEbut if you don’t…

use SGD as warm start for iterative method

Agarwal. “Effective Terascale Linear Learning.” ‘12

SGDcompute gradientselect

dataupdate

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

variance reduction solutionmake gradient more accurate

preserve fast convergence

SGD+VARIANCE REDUCTIONcompute gradient update

gk ⇡ rf(x, d8) xk+1 = xk � ⌧kgk

Error must decrease as we approach solution

variance reduction solutionmake gradient more accurate

preserve fast convergence

� error8

select data

VR APPROACHES

Central VRA VR approach targeting distributed ML

SAGADefazio, Bach, Lacoste-Julian, 2014

SVRGJohnson, Zhang, 2013

SAGLe Roux, Schmidt, Bach, 2013

many more…

“Efficient Distributed SGD with Variance Reduction,” ICDM 2016

shameless self-promotion

The original: requires full gradient computations

Avoid full gradient computations

SVRG

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )

First epoch gradient tableau

rf3(x3m)

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )} gm =

1

n

nX

i=1

rfi(xim)

Average gradientover last epoch rf3(x

3m)

gradient tableau

SVRG

rf1(x1m)

rf2(x2m)

rfn(xnm)

...

rfn�1(xn�1m )

gm =1

n

nX

i=1

rfi(xim)

Average gradientover last epoch

rf3(x3m)

(rf3(x3m)� gm)�

corrected gradient}error

rf3(x3m+1)

new gradient

gradient tableau

SVRG

SVRG GETS BACK DETERMINISTIC RATES

Theorem

Johnson and Zhang, 2013

Suppose of objective terms are m stronglyconvex with L Lipschitz gradients. If thelearning rate is small enough, then

<latexit sha1_base64="HTGAliInHfDxWKYbVWtacL71/ok=">AAACj3icbVHLbtNAFB2bVzGvFJbdXBFXYoEiOyzKCkWwoVIXRZC2UhxF48m1PXQe1sx1IET5Br6PT+AD2DNJs6AtVxrp6Dx0r86UrZKesuxXFN+5e+/+g72HyaPHT54+6+0/P/O2cwLHwirrLkruUUmDY5Kk8KJ1yHWp8Ly8/LDRzxfovLTmCy1bnGpeG1lJwSlQs97PolKdbxRWBMnnrm2tR7AV2PIrCpILBEKnPXCHkOoUPDlrarUsigSENQv8Dt8kNZCepHAiWy8aST+gdnwu0ZAfABxXQA1CCCjkzkhTg+OEID14zZUCNLarm9cbl0lmvX42yLYDt0G+A322m9NZ73cxt6LTYZtQ3PtJnrU0XXFHUihcJ0XnseXiktc4CdBwjX662ha3hsPAzKGyLjxDsGX/Tay49n6py+DUnBp/U9uQ/9MmHVVvpytp2o7QiKtFVaeALGx+AebShXrVMgAunAy3gmi44yK07ZPk2h4jBVZBWodu8ptN3AZnw0H+ZjD8NOyP3u9a2mMH7CV7xXJ2xEbsIztlYybYn+ggSqPDeD8+it/FoytrHO0yL9i1iY//AlY7xdk=</latexit>

Ef(wk)� f? ck[f(w0)� f?]<latexit sha1_base64="FkMPZaQmJTfxjKaog3LujK0S4+U=">AAACNHicbVBNS8NAEN3U7/pV9ehlsQj1YElqQY+iCB4rWBWatGy2k3bpZhN2N0oJ/SP+EM9e9R8I3sSDF3+DmzYHW30w8Hhvhpl5fsyZ0rb9ZhXm5hcWl5ZXiqtr6xubpa3tGxUlkkKTRjySdz5RwJmApmaaw10sgYQ+h1t/cJ75t/cgFYvEtR7G4IWkJ1jAKNFG6pTqbkh03/fTixEOKg/twcEhDtqu0kRilwOm7QFuZYZ9cJjrHu6UynbVHgP/JU5OyihHo1P6crsRTUIQmnKiVMuxY+2lRGpGOYyKbqIgJnRAetAyVJAQlJeOvxvhfaN0cRBJU0Ljsfp7IiWhUsPQN53ZL2rWy8T/vFaigxMvZSJONAg6WRQkHOsIZ1HhLpNANR8aQqhk5lZM+0QSqk2gxeLUHsEoBMYamWyc2ST+kpta1Tmq1q7q5dOzPKVltIv2UAU56BidokvUQE1E0SN6Ri/o1Xqy3q0P63PSWrDymR00Bev7BxyTqZY=</latexit>

for some c < 1.<latexit sha1_base64="7i+9y8tRinww0pMsrhRs+pZ9DCo=">AAACD3icbVC9TgJBGNzDP8QfTi1tNoKJFbnDQgsLoo0lJvKTACF7y3ewYW/3srtnQggPYW2rz2BnbH0EH8G3cA+uEHCqycx8+SYTxJxp43nfTm5jc2t7J79b2Ns/OCy6R8dNLRNFoUEll6odEA2cCWgYZji0YwUkCji0gvFd6reeQGkmxaOZxNCLyFCwkFFirNR3i6FUWMsIcJne+JVy3y15FW8OvE78jJRQhnrf/ekOJE0iEIZyonXH92LTmxJlGOUwK3QTDTGhYzKEjqWCRKB703nxGT63ygCnFUIpDJ6rfy+mJNJ6EgU2GREz0qteKv7ndRITXvemTMSJAUEXj8KEYyNxugIeMAXU8IklhCpmu2I6IopQY7cqFJb+CEYhtNbMbuOvLrFOmtWKf1mpPlRLtdtspTw6RWfoAvnoCtXQPaqjBqIoQS/oFb05z8678+F8LqI5J7s5QUtwvn4BB86bCg==</latexit>

MONTO CARLO METHODS

Monte Carlo casino

Methods that involve randomly sampling a distribution

Monaco?

BAYESIAN LEARNING

ingredients…prior

likelihood

estimate parameters in a model

probability distribution of unknown parameters

data = M(parameters)estimate

thesemeasure

these

probability of datagiven (unknown) parameters Thomas Bayes

BAYES RULE

P (A|B) =P (B|A)P (A)

P (B)

P (parameters|data) = P (data|parameters)P (parameters)

P (data)

likelihood prior

P (parameters|data) / P (data|parameters)P (parameters)

we really care about this

“posterior distribution”

EXAMPLE: LOGISTIC REGRESSION

P (parameters|data) / P (data|parameters)P (parameters)

prior

P (y,X|w) =Y

i

exp(yi · xTi w)

1 + exp(yi · xTi w)

<latexit sha1_base64="lWgkerH9AERziYYhNxTP4CqGnHg=">AAAIGXicfVXNbtw2EFaSNpu6f0567EWoE8BuXGPlBEgvBZK6SNuDUbeIHQOmLVBcasUsfxSSyu6W1pPkGfICufVW9JpTgfZdOtTuNruU1jxIQ37fzHBmyGFWcmZsv//3tes3PvjwZu/WRxsff/LpZ59v3r5zYlSlCT0miit9mmFDOZP02DLL6WmpKRYZp8+z0YHHn7+i2jAln9lpSc8FHkqWM4ItLKWbPx5tT3dPL8c78XcxKrUapAzlGhOH6KTcnsKMDJSNJym7eDbeqV1yvxtIN7f6e/1mxG0hmQtb0XwcpbdvvkUDRSpBpSUcG3OW9Et77rC2jHBab6DK0BKTER5Sh4UxU5HV8T2BbWFCzC92YpMmQfXGxr2l1bPK5t+eOybLylJJQBGwvOKxVbFPUTxgmhLLpyBgohnsJyYFhqRYSOSKKZcpNbI4M+ACSTomSggsBw4RY+qz5NyhjA6ZdAQqZOqtBFEAZ5N6VUHZguoxM7R2yNKJdfH7ldA4ZwKOBthwSLNhYbHWalyHnHxh6ec8Dr1xsC3/9+QncctLgbk30RyGpHb7oQ2NvX8sh02xVqx7hHchxgJSZGriYrSLdk2VvYBMQ+L9rNnCKp/JASj4uhLM3Wm4BSZf1e7CfZOEgGU+OFx1YUrZpbA8q6X9ntHA7dhHpfc7ut8yPhIN0OG1BJsXkACswzxSzM08yixzv4WaIJRL+EELLxiTvlIgpO4ghazkdhqyxFJMoiskSMtyXva7EuM5o5CUjjpoZoVl2FDgdo189a4shAko3XaAcokkzjiOJ5ctUOBJ7aazZrVgBSTKuZJw8e56KU3uBvBBwa46hII8WYKfhHBul9CnrVR5v7PS5i4J7iBU0rd1+lIqf3VoSVQloQc5+rJq2neN4m0Et3cx3/H6P1Bop5oegs1fSqqxVfprcCEgTYeHJ/VaApNMsN9hOwsJgdXBej6eLPhz6Wo+1A46TvNdR8F62FQL/mjXS1cRmVwQ2XqLptBMwomd/ztpzj974LX5rmFYOIEQ6ewHTQpeuyR829rCyf5e8mBv/9eHW4+/n797t6Ivo6+i7SiJHkWPo5+io+g4ItGb6F30T/Rv73Xvj96fvb9m1OvX5jpfRCuj9+4/+/8CsA==</latexit>

P (yi = 1|xi) =exp(xT

i w)

1 + exp(xTi w)

<latexit sha1_base64="30xsw632vjL+Lqp1+IPVHH/WWSI=">AAAIBHicfVXNbtw2EFbSNJu4f05zzEWIE8BuXGO1CZBcAiRxULQHo24ROwZMW6C41IpZ/igkFe+G1jXP0Iforeilh1zbV+jbdCjvNruU1jpIQ37ffMOZIams5MzYfv/fK1c/u/b59d6Nm2tffPnV19+s3/r20KhKE3pAFFf6KMOGcibpgWWW06NSUywyTl9n412Pv35HtWFKvrLTkp4IPJIsZwRbmErXB/ub05Q9Tc4nKduKn8Yo15g4RCflJsycvjrbql3yYGmcrm/0d/rNE7eNZGZsRLNnP711/S80VKQSVFrCsTHHSb+0Jw5rywin9RqqDC0xGeMRdVgYMxVZHd8X2BYmxPxkJzZpalGvrd1fmD2ubP7kxDFZVpZKAo6A5RWPrYp9NeIh05RYPgUDE81gPTEpMJTAQs2WpFym1NjizEAIJOkZUUJgOXSIGFMfJycOZXTEpCPQDFNvJIgCeDGolx2ULag+Y4bWDlk6sS7+NBOKcyZgF4CGQ5qNCou1Vmd1yMnnSj/lcRiNg7b8P5IfxK0oBeZeoml9UrtBqKGxj4/lqGnWkrpHeBdiLCBFpiYuRtto21TZG6g0FN6PmiUs85kcgoPvK8HcHYVLYPJd7U7d90kIWOaTw1UXppRdSMuzWt6fGA3czn1c+rjjBy3xsWiAjqglaJ5CAbAO60gxN7Mss8z9GnqCUS7guy28YEz6ToGRut0UqpLbacgSCzmJrpSgLIt1GXQVxnPGISkdd9DMEsuwkcDtHvnuXdoIE1C6dYByjiTOOI4n5y1Q4EntpogMlY3nrIBEOVcSDt49b6XJvQDeLdhlm1CQ5wvw8xDO7QL6Q6tUPu5Fa3OXBGcQOulvcPpWKn90aElUJeEOcvRt1dzUNYo3EZze+XjL+7+kcJ1qugeaP5dUY6v0dxBCQJn29g7rlQQmmWDvYTlzC4HqcDUfT+b8mXU5H3oHN07zXkXBetR0C75o21uXEZmcE9lqRVNoJmHHzr6dNIdKrSBq817BsLADIdOLD1xS8LdLwn9b2zgc7CQPdwa/PNp49mL237sR3YnuRptREj2OnkU/RvvRQUSi36KP0d/RP70Pvd97f/T+vKBevTLzuR0tPb2P/wHrBPmj</latexit>

EXAMPLE: LOGISTIC REGRESSION

P (parameters|data) / P (data|parameters)P (parameters)

prior

logP (w) = �|w|Example:

P (w|y,X) =

Y

i

exp(yi · xTi w)

1 + exp(yi · xTi w)

!P (w)

<latexit sha1_base64="004+ZLNp5XNu9cMH76QXus7GxG4=">AAAIK3icfVXdchs1FN4WqEv4S+GSmx3Sztg0ZLyhM+WGmZYwDFxkCEyThomSHa2s9QrrZytpaxtlH4kX4IaH4AqG297CK3C0tqmtXUcXu0f6vvMd6Rz9ZCVnxg6Hf966/cabb93p3X17551333v/g917H54ZVWlCT4niSp9n2FDOJD21zHJ6XmqKRcbp82xy5PHnL6k2TMlndl7SS4HHkuWMYAtD6e5PJ/3p9Xz/fBB/iTjNbT9GpVajlKFcY+IQnZX9OfTISNl4lrKrZ9NB7ZKH3QDSbFzYQQyig3R3b3gwbFrcNpKlsRct20l6785vaKRIJai0hGNjLpJhaS8d1pYRTusdVBlaYjLBY+qwMGYusjp+ILAtTIj5wU5s1qSs3tl5sDZ6Udn8i0vHZFlZKgk4ApZXPLYq9kmLR0xTYvkcDEw0g/nEpMCQIAup3ZBymVITizMDIZCkU6KEwHLkEDGmvkguHcromElHoGam3ksQBXDRqTcdlC2onjJDa4csnVkXvx4JxTkTsFlAwy0qgLVW0zrk5Cul7/I4jMZBW/4fyXfiVpQCcy/RbIykdoehhsY+Ppbjplgb6h7hXYixgBSZmrkY7aN9U2U/Q6Yh8b7XTGGTz+QIHHxdCebuPJwCky9rd+U+S0LAMr84XHVhStm1ZXlWy/s1o4Hba5+UPu7kYUt8IhqgI2oJmleQAKzDPFLMzXKVWeZ+DD3BKNfwoxZeMCZ9pcBI3VEKWcntPGSJtTWJriVBWtbzctiVGM+ZhKR00kEzGyzDxgK3a+Srd2MhTEDp1gHKNZI44zieXbdAgWe1my8urhUrIFHOlYSDd99baXI/gI8KdtMmFOTpGvw0hHO7hn7TSpWPuyht7pLgDEIl/UVPX0jljw4tiaok3EGOvqiaC71GcR/B6V31B97/awrXqabHoPl9STW2Sn8KIQSk6fj4rN5KYJIJ9gtMZ2UhUB1t5+PZir+0buZD7eDGab7bKFiPm2rBH+176yYikysi265oCs0k7Njlv5Pm/BMIUZvvFoaFHQgrXfzgkoLXLgnftrZxdniQfH5w+MOjvSdfLd+9u9HH0SdRP0qix9GT6NvoJDqNSPR79Cr6J/q392vvj95fvb8X1Nu3lj4fRRut9+o/62oJbg==</latexit>

logP (w|y,X) = logP (w) +X

i

� log(1 + exp(�yi · xTi w))

<latexit sha1_base64="3Y80qniIrti7yfwt00Tirz2HpIc=">AAAIFXicfVXdjhs1FJ620JTlbwuX3FhsKyXsjzILEr1BaglCcLFiQd3tSuvdkcfxZEz8M7U9TYJ3noNn4BWQuEPc9pa+DceThCYzyc7FzLG/z9/xOcc+kxaCW9fvv7lz9947797vPHhv5/0PPvzo492Hn5xbXRrKzqgW2lykxDLBFTtz3Al2URhGZCrYi3Q8CPiLV8xYrtVzNyvYlSQjxTNOiYOpZHeAhR6h0+7kZnZw0UPfoOW4h/YRtqVMODqsJ7vxPmbTons4SzimQ+3QNOHXzye9XrK71z/q1w9qG/HC2IsWz2ny8P6feKhpKZlyVBBrL+N+4a48MY5TwaodXFpWEDomI+aJtHYm0wo9lsTltomFyY3YtE5OtbPzeGX2snTZkyvPVVE6pigsBCwrBXIahfSgITeMOjEDg1DDYT+I5sQQ6iCJa1I+1XrsSGrBBVZsQrWURA09ptZWl/GVxykbceUpVMdWezFmAM4H1foC7XJmJtyyymPHps6jtzNNccElHAvQ8NjwUe6IMXpSNTnZUunHDDW9CdBW/3sKA9TykhMRJDKI28eVP25qGBL8EzWqi7WmHhCxCbEOkDzVU4/wAT6wZforZBoSH0b1Ftb5XA1hQagrJcJfNLfA1avKX/vDuAk4HoIj5SZMa7cSVmC1Vr9l1HA79nER/I73W+JjWQMbvBageQ0JIKaZR0aEXUSZpv6X5kowihV80MJzzlWoFBiJHySQlczNmiy5EpPcFBKkZTUvx5sSEzjjJikZb6DZNZblI0naNQrVu7UQtkHZrAOUG6xIKgia3rRASaaVn82b1ZLVIDEhtIKL9yhYSfyoAQ9yftshlPTZCvysCWduBf2+largd17azMeNOwiVDC2dvVQ6XB1WUF0q6EGevSzr1l1h1MVwe5fjXlj/HYN2atgJaP5UMEOcNl+ACwlpOjk5r7YSuOKS/wbbWVoYVIfb+WS65C+s2/lQO+g49XsbhZhRXS344oNg3Ubkaknk2xVtbriCE7v4bqR5XBgNXuv3FoaDEwiRzj/QpOBvFzf/bW3j/Pgo/vLo+Oev9p5+u/jvPYg+iz6PulEcfR09jX6ITqOziEZ/RK+jf6M3nd87f3X+7vwzp969s1jzabT2dF7/B9Db/e8=</latexit>

BAYESIANS OFFER NO SOLUTIONS

P (w|D, l)

solution space

optimizationsolution

Monte Carlo methods randomly sample this solution space

BayesLagrange

WHY MONTE CARLO?You need to know more than just a max/minimizer

optimizationsolution

“error bars”

Var(w) = E[(w � µ)(w � µ)T ] =

ZwwTP (w) dw � µµT

E(w) =

ZwP (w) dw

EXAMPLE: POKER BOTSYou can’t solve your problem by other (faster) methods

maximize logP (w) we don’t know derivative

poker bots

model parameters describe player behavior

2.5M different community cards

10 players = 59K raise/fold/call perms per round

WHY MONTE CARLO?You’re incredibly lazymaximize logP (w)

I could differentiate this,I just don’t wanna

argmax logP (w) ⇡ maxk

{P (wk)}

MARKOV CHAINS

Irreducible: can visit any state starting at any state

Aperiodic: does not get trapped in deterministic cycles

MC has steady state if

What is an MC?

METROPOLIS HASTINGSingredients

q(y|x)

proposal distribution

p(x)

posterior distribution

MH AlgorithmStart with x0

For k = 1, 2, 3, . . .Choose candidate y from q(y|xk)Compute acceptance probability

↵ = min

⇢1,

p(y)q(xk|y)p(xk)q(y|xk)

Set xk+1 = y with probability ↵otherwise xk+1 = xk

CONVERGENCE

Theorem

Suppose the support of q contains the support of p. Thenthe MH sampler has a stationary distribution, and the

distribution is equal to p.

Irreducible: support of q must contain support of p

Aperiodic: there must be positive rejection probability states

METROPOLIS ALGORITHMingredients

q(y|x)

proposal distribution

p(x)

posterior distribution

Assume proposal is symmetricq(xk|y) = q(y|xk)

↵ = min

⇢1,

p(y)q(xk|y)p(xk)q(y|xk)

�↵ = min

⇢1,

p(y)

p(xk)

Metropolis Hastings Metropolis

EXAMPLE: GMMhistogram of Metropolis iterates

Andrieu, Freitas, Doucet, Jordan ‘03

PROPERTIES OF MH

We don’t need a normalization constant for posteriorWe can run many chains in parallel (cluster/GPU)We don’t need any derivatives

Pro’s

P (parameters|data) = P (data|parameters)P (parameters)

P (data)

P (parameters|data) / P (data|parameters)P (parameters)

PROPERTIES OF MH

“Mixing time” depends on proposal distribution• too wide = constant rejections = slow mixing• too narrow = short movements=slow mixing

Con’s

Samples only meaningful at stationary distribution • “Burn in” samples must be discarded• Many samples needed because of correlations

Posterior

proposal

This is bad…

xk<latexit sha1_base64="8tY7aDdJ2IAPpvSJDyWI10/Am7g=">AAAH03icfVVLb9w2EJbTNuu4rzg55iLUMeC2jrFyAqRHOw6K9mDUfdgxYNoLikutWPEhk5S9G1qXotdc2//Sf9J/06G8W+9SWusgDfl98w1nhqTSkjNj+/1/Vx589PEnD3urj9Y+/ezzL758vP7kxKhKE3pMFFf6NMWGcibpsWWW09NSUyxSTt+lxYHH311RbZiSv9lJSc8FHkmWMYItTP06vigGjzf6O/3midtGMjU2oulzNFh/+A8aKlIJKi3h2JizpF/ac4e1ZYTTeg1VhpaYFHhEHRbGTERax5sC29yEmJ/sxMZNavXa2ubc7Flls+/OHZNlZakk4AhYVvHYqtgnFw+ZpsTyCRiYaAbriUmONSYWSrAg5VKlCotTAyGQpNdECYHl0CFiTH2WnDuU0hGTjkBtTb2RIArg7aBedFA2p/qaGVo7ZOnYuvhuJhTnTEBTQcMhzUa5xVqr6zrkZDOlH7M4jMZBW/4fyQ/iVpQccy+RQd4uqd1uqKGxj4/lqGnWgrpHeBdiLCB5qsYuRtto21Tp71BpKLwfNUtY5DM5BAffV4K5Ow2XwORV7S7ciyQELPPJ4aoLU8rOpeVZLe87RgO3cy9KH7f4tiVeiAboiFqC5gUUAOuwjhRzM80yTd0voScY5Rx+0MJzxqTvFBgDdzCAqmR2ErLEXE6iKyUoy3xddrsK4zlFSBoUHTSzwDJsJHC7R7579zbCBJRuHaDcIIlTjuPxTQsUeFy7CSJDZeMZKyBRzpWEg/fcW4PkeQAf5Oy+TSjI/hy8H8KZnUO/b5XKx71tbeaS4AxCJ/2FTC+l8keHlkRVEu4gRy+r5uKtUbyF4PTOxl97/7cUrlNND0Hzp5JqbJX+BkIIKNPh4Um9lMAkE+w9LGdmIVAdLufj8Yw/te7nQ+/gxmneyyhYj5puwRdte+s+IpMzIluuaHLNJOzY6beT5lCpFURt3ksYFnYgZHr7gUsK/nZJ+G9rGye7O8nLnd2fX23svZn+91ajZ9FX0VaURK+jveiH6Cg6jkg0ij5Ef0V/9457rvdH789b6oOVqc/TaOHpffgP+CLnSg==</latexit>

SIMULATED ANNEALINGmaximize p(x)

Easy choice: run MCMC, thenmax

kp(xk)

Better choice: sample the distribution

p1

Tk (xk) “temperature”

Why is this better?Where does the name come from?

COOLING SCHEDULE

Andrieu, Freitas, Doucet, Jordan ‘03

hot warm

cool cold

CONVERGENCE

Granville, "Simulated annealing: A proof of convergence”

TheoremSuppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time starting at every temperature. For an annealing schedule with temperature

simulated annealing converges to a global optima with probability 1.

Tk =1

C log(k + T0)

SA solves non-convex problem, even NP-complete problems,as time goes to infinity.

WHEN TO USE SIMULATED ANNEALING

flow chart

Should I use simulated annealing?

no.

No practical way to choose temperature scheduleToo fast = stuck in local minimum (risky)

Too slow = no different from MCMCAct of desperation!

GIBBS SAMPLERWant to sample

P (x1, x2, x3)

on stage k, pick some coordinate j

q(y|xk) =

(p(yj |xk

jc), if yjc = xkjc

0, otherwise

↵ =p(y)q(xk|y)p(xk)q(y|xk)

=p(y)p(xk

j |xkjc)

p(xk)p(yj |yjc)

=p(yjc)

p(xkjc)

= 1

P (B) =P (A and B)

P (A|B)

p(y) = p(yj and yjc)<latexit sha1_base64="KTec+pwpBDIxNpH0RsoPrN7Tsvc=">AAACI3icbVDLSgMxFM3Ud31VXboJFkE3ZaYKuhFENy4r2Fpoa8mkd9poJjMkd8RhmA/wQ1y71W9wJ25c+AH+hWntwqoHQg7nnMtNjh9LYdB1353C1PTM7Nz8QnFxaXlltbS23jBRojnUeSQj3fSZASkU1FGghGasgYW+hEv/5nToX96CNiJSF5jG0AlZX4lAcIZW6pbK8U66S4+ovbrXtI1whxllqkdzmnaz6yue71KbcivuCPQv8cakTMaodUuf7V7EkxAUcsmMaXlujJ2MaRRcQl5sJwZixm9YH1qWKhaC6WSjz+R02yo9GkTaHoV0pP6cyFhoTBr6NhkyHJjf3lD8z2slGBx2MqHiBEHx70VBIilGdNgM7QkNHGVqCeNa2LdSPmCacbT9FYsTe5TgEFgrt914v5v4SxrVirdXqZ7vl49Pxi3Nk02yRXaIRw7IMTkjNVInnNyTR/JEnp0H58V5dd6+owVnPLNBJuB8fAH9lqL0</latexit>

GIBBS SAMPLERWant to sample

P (x1, x2, x3)

iterates

x2 ⇠ P (x1|x12, x

13, x

14, . . . , x

1n)

x3 ⇠ P (x2|x21, x

23, x

24, . . . , x

2n)

x4 ⇠ P (x3|x31, x

32, x

34, . . . , x

3n)

· · ·

APPLICATION: SAMPLING GRAPHICAL MODELS

Restricted Boltzmann Machine (RBM)

E(v, h) = �aT v � bTh� vTWh

P (v, h) =1

Ze�E(v,h) “partition

function”

weights

binaryrandomvariables

0/1

APPLICATION: SAMPLING GRAPHICAL MODELS

Restricted Boltzmann Machine (RBM)

E(v, h) = �aT v � bTh� vTWh P (v, h) =1

Ze�E(v,h)

remove normalization

aggregate constant

cancel constants

sigmoid function

P (vi = 1|h) = P (vi = 1|h)P (vi = 0|h) + P (vi = 1|h)

=exp(�ai +

Pj wijhj + C)

exp(C) + exp(�ai +P

j wijhj + C)

=exp(�ai +

Pj wijhj)

1 + exp(�ai +P

j wijhj)

= �(�ai +X

i

wijhj)

BLOCK GIBBS FOR RBM

P (vi = 1|h) = �(�ai +X

j

wijhj)

hiddenvisible

freeze hiddenrandomly sample visible

stage 1

freeze visiblerandomly sample hidden

stage 2

P (hj = 1|v) = �(�bj +X

i

wijvi)

DEEP BELIEF NETSvisible

hidden

DBN = layered RBM

Each layer only dependson layer beneath it

(feed forward)

Probability for each hidden node is sigmoid function

EXAMPLE: MNIST

Train 3-layer DBN with 200 hidden units

pre-training: layer-by-layer training

training: train all weights simultaneously

Gan, Henao, Carlson, Carin ‘15

trained on 60K MNIST digits

training done using Gibbs sampler

Gibbs sampler used to explore final solution

EXAMPLE: MNISTtraining data Learned features

observations sampled from deep belief network using Gibbs sampler

WHAT’S WRONG WITH GIBBS?

susceptible to strong correlationsalso a problem for MH: bad proposal distribution

SLICE SAMPLERProblems with MH:

hard to choose proposal distributiondoes not exploit analytical information about problem

long mixing timeSlice sampler

q(x, u) =

(1, if 0 u p(x)

0, otherwise

1 1

x

p(x)

x

u

“LIFTED INTEGRAL”

1 1

x

p(x)

x

u

p(x) =

Zq(x, u) du

Zf(x)p(x) =

Z Zf(x)q(x, u) du dx

sample this

Problems with MH:hard to choose proposal distribution

does not exploit analytical information about problemlong mixing timeSlice sampler

SLICE SAMPLERZ

f(x)p(x) =

Z Zf(x)q(x, u) du dx

do Gibbs on thisFreeze x :

Choose u uniformly from [0, p(x)]Freeze u :

Choose x uniformly from {x|p(x) � u}need analytical formula

for this set

p(x)u

MODEL FITTING

q(x, u1, u2, . . . , un) =

(1, if pi(x) ui 8i0, otherwise

Zq(x, u1, u2, . . . , un) du =

Z p1(x)

u1=0

Z p2(x)

u2=0· · ·

Z pn(x)

un=0du =

nY

k=1

pi(x)

log p(x) =nX

i=1

log pi(x)p(x) =nY

i=1

pi(x)

Zf(x)p(x) dx =

Z

x

Z

uf(x)q(x, u) dx du

support is a box

MODEL FITTINGZ

f(x)p(x) dx =

Z

x

Z

uf(x)q(x, u) dx du

Freeze x :For all i, choose ui uniformly from [0, pi(x)]

Freeze u :Choose x uniformly from

Ti{x|pi(x) � ui}

q(x, u1, u2, . . . , un) =

(1, if pi(x) ui 8i0, otherwise

SLICE SAMPLER PROPERTIESpros

• does not require proposal distribution• mixes super fast• simple implementation

cons• analytical formula for super-level sets• hard in multiple dimensions

fixes / generalizations• step-out methods• random hyper-rectangles

STEP-OUT METHODSp(x)

u

p(x)u

p(x)u

•pick small interval

•double width size until slice is contained

•sample uniformly from slice using rejection sampling - throw away selections outside of slice

HIGH DIMS: COORDINATE METHOD

Zf(x)p(x) =

Z Zf(x)q(x, u) du dx

do coordinate Gibbs on this

Freeze x :Choose u uniformly from [0, p(x)]

Freeze u :For i = 1 · · ·n

Choose xi uniformly from{xi|p(x1, · · · , xi · · · , xn) � u}

can use stepping -out methodRadford Neal: “Slice Sampling” ‘03

HIGH DIMS: HYPER-RECTANGLE

Radford Neal: “Slice Sampling” ‘03

•choose “large” rectangle

•randomly/uniformly place rectangle around current location

•draw random proposal point from rectangle

•if proposal lies outside slice, shrink box so proposal is on boundary

performs much better for poorly-conditioned objectives

COMPARISON

Gradient/Splitting

SGD

MCMC

rate

e�ck

1/k

1/pk

when to use

you value reliability and precision(moderate speed, high accuracy)

you value speed over accuracy(high speed, moderate accuracy)

you value simplicity (no gradient)or need statistical inference

(slow and inaccurate)

DO MATLAB EXERCISEMCMC