Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

CS188:ArtificialIntelligenceOptimizationandNeuralNets

Instructors:Brijen Thananjeyan andAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine.AllCS188materialsareathttp://ai.berkeley.edu.]

LogisticRegression:HowtoLearn?

§ Maximumlikelihoodestimation

§ Maximumconditional likelihoodestimation

Bestw?

§ Maximumlikelihoodestimation:

with:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

P (y(i)|x(i);w) =ewy(i) ·f(x(i))

Py e

wy·f(x(i))

=Multi-ClassLogisticRegression

HillClimbing

§ RecallfromCSPslecture:simple,generalidea§ Startwherever§ Repeat:movetothebestneighboringstate§ Ifnoneighborsbetterthancurrent,quit

§ What’sparticularlytrickywhenhill-climbingformulticlasslogisticregression?• Optimizationoveracontinuousspace

• Infinitelymanyneighbors!• Howtodothisefficiently?

1-DOptimization

§ Couldevaluate and§ Thenstepinbestdirection

§ Or,evaluatederivative:

§ Tellswhichdirectiontostepinto

w

g(w)

w0

g(w0)

g(w0 + h) g(w0 � h)

@g(w0)

@w= lim

h!0

g(w0 + h)� g(w0 � h)

2h

2-DOptimization

Source: offconvex.org

GradientAscent

§ Performupdateinuphilldirectionforeachcoordinate§ Thesteepertheslope(i.e.thehigherthederivative)thebiggerthestepforthatcoordinate

§ E.g.,consider:

§ Updates:

g(w1, w2)

w2 w2 + ↵ ⇤ @g

@w2(w1, w2)

w1 w1 + ↵ ⇤ @g

@w1(w1, w2)

§ Updatesinvectornotation:

with:

w w + ↵ ⇤ rwg(w)

rwg(w) =

"@g@w1

(w)@g@w2

(w)

#

=gradient

§ Idea:§ Startsomewhere§ Repeat:Takeastepinthegradientdirection

GradientAscent

Figure source: Mathworks

WhatistheSteepestDirection?

§ First-OrderTaylorExpansion:

§ SteepestDescentDirection:

§ Recall: à

§ Hence,solution:

g(w +�) ⇡ g(w) +@g

@w1�1 +

@g

@w2�2

rg =

"@g@w1@g@w2

#Gradientdirection=steepestdirection!

max�:�2

1+�22"

g(w +�)

max�:�2

1+�22"

g(w) +@g

@w1�1 +

@g

@w2�2

� = "rg

krgk

� = "a

kakmax�:k�k"

�>a

Gradientinndimensions

rg =

2

6664

@g@w1@g@w2

· · ·@g@wn

3

7775

OptimizationProcedure:GradientAscent

§ init§ for iter = 1, 2, …

w

§ :learningrate--- tweakingparameterthatneedstobechosencarefully

§ How?Trymultiplechoices§ Cruderuleofthumb:updatechangesabout0.1– 1%

↵

w

w w + ↵ ⇤ rg(w)

BatchGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

g(w)


w

w w + ↵ ⇤X

i

r logP (y(i)|x(i);w)

StochasticGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)


§ pick random j

w

w w + ↵ ⇤ r logP (y(j)|x(j);w)

Observation: oncegradientononetrainingexamplehasbeencomputed,mightaswellincorporatebeforecomputingnextone

Mini-BatchGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)


§ pick random subset of training examples J

w

Observation: gradientoversmallsetoftrainingexamples(=mini-batch)canbecomputedinparallel,mightaswelldothatinsteadofasingleone

w w + ↵ ⇤X

j2J

r logP (y(j)|x(j);w)

GradientforLogisticRegression

§ Recallperceptron:§ Classifywithcurrentweights

§ Ifcorrect(i.e.,y=y*),nochange!§ Ifwrong:adjusttheweightvectorbyaddingorsubtractingthefeaturevector.Subtractify*is-1.

NeuralNetworks

Multi-classLogisticRegression

§ =specialcaseofneuralnetwork

z1

z2

z3

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

DeepNeuralNetwork=Alsolearnthefeatures!

z1

z2

z3

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…


f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

x1

x2

x3

xL

… … … …

z(1)1

z(1)2

z(1)3

z(1)K(1) z(2)

K(2)

z(2)1

z(2)2

z(2)3

z(OUT )1

z(OUT )2

z(OUT )3

z(n�1)3

z(n�1)2

z(n�1)1

z(n�1)K(n�1)

…

z(k)i = g(X

j

W (k�1,k)i,j z(k�1)

j ) g=nonlinearactivationfunction


softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

x1

x2

x3

xL

… … … …

z(1)1

z(1)2

z(1)3

z(1)K(1) z(n)

K(n)z(2)K(2)

z(2)1

z(2)2

z(2)3 z(n)3

z(n)2

z(n)1

z(OUT )1

z(OUT )2

z(OUT )3

z(n�1)3

z(n�1)2

z(n�1)1

z(n�1)K(n�1)

…

z(k)i = g(X

j

W (k�1,k)i,j z(k�1)

j ) g=nonlinearactivationfunction

CommonActivationFunctions

[source:MIT6.S191introtodeeplearning.com]

DeepNeuralNetwork:AlsoLearntheFeatures!

§ Trainingthedeepneuralnetworkisjustlikelogisticregression:

justwtendstobeamuch,muchlargervectorJ

àjustrungradientascent+stopwhenloglikelihoodofhold-outdatastartstodecrease

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

NeuralNetworksProperties

§ Theorem(UniversalFunctionApproximators).Atwo-layerneuralnetworkwithasufficientnumberofneuronscanapproximateanycontinuousfunctiontoanydesiredaccuracy.

§ Practicalconsiderations§ Canbeseenaslearningthefeatures

§ Largenumberofneurons§ Dangerforoverfitting§ (henceearlystopping!)

Neural Net Demo!

https://playground.tensorflow.org/

§ Derivativestables:

Howaboutcomputingallthederivatives?

[source:http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html

Howaboutcomputingallthederivatives?

n Butneuralnetfisneveroneofthose?n Noproblem:CHAINRULE:

If

Then

à Derivativescanbecomputedbyfollowingwell-definedprocedures

f(x) = g(h(x))

f 0(x) = g0(h(x))h0(x)

§ Automaticdifferentiationsoftware§ e.g.Theano,TensorFlow,PyTorch,Chainer§ Onlyneedtoprogramthefunctiong(x,y,w)§ Canautomaticallycomputeallderivativesw.r.t.allentriesinw§ Thisistypicallydonebycachinginfoduringforwardcomputationpassoff,andthendoingabackwardpass=“backpropagation”

§ Autodiff /Backpropagationcanoftenbedoneatcomputationalcostcomparabletotheforwardpass

§ Needtoknowthisexists§ Howthisisdone?-- outsideofscopeofCS188

AutomaticDifferentiation

SummaryofKeyIdeas§ Optimizeprobabilityoflabelgiveninput

§ Continuousoptimization§ Gradientascent:

§ Computesteepestuphilldirection=gradient(=justvectorofpartialderivatives)§ Takestep inthegradientdirection§ Repeat(untilheld-outdataaccuracystartstodrop=“earlystopping”)

§ Deepneuralnets§ Lastlayer=stilllogisticregression§ Nowalsomanymorelayersbeforethislastlayer

§ =computingthefeatures§ à thefeaturesarelearnedratherthanhand-designed

§ Universalfunctionapproximationtheorem§ If neuralnet islargeenough§ Then neuralnetcanrepresentanycontinuousmappingfrominputtooutputwitharbitraryaccuracy§ Butremember:needtoavoidoverfitting/memorizingthetrainingdataà earlystopping!

§ Automaticdifferentiationgivesthederivativesefficiently(how?=outsideofscopeof188)

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

Documents

Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)