28
CS 188: Artificial Intelligence Optimization and Neural Nets Instructors: Brijen Thananjeyan and Aditya Baradwaj --- University of California, Berkeley [These slides were createdby Dan Klein, Pieter Abbeel, Sergey Levine. All CS188 materials are at http://ai.berkeley.edu.]

Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

CS188:ArtificialIntelligenceOptimizationandNeuralNets

Instructors:Brijen Thananjeyan andAdityaBaradwaj --- UniversityofCalifornia,Berkeley[TheseslideswerecreatedbyDanKlein, PieterAbbeel,SergeyLevine.AllCS188materialsareathttp://ai.berkeley.edu.]

Page 2: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

LogisticRegression:HowtoLearn?

§ Maximumlikelihoodestimation

§ Maximumconditional likelihoodestimation

Page 3: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

Bestw?

§ Maximumlikelihoodestimation:

with:

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

P (y(i)|x(i);w) =ewy(i) ·f(x(i))

Py e

wy·f(x(i))

=Multi-ClassLogisticRegression

Page 4: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

HillClimbing

§ RecallfromCSPslecture:simple,generalidea§ Startwherever§ Repeat:movetothebestneighboringstate§ Ifnoneighborsbetterthancurrent,quit

§ What’sparticularlytrickywhenhill-climbingformulticlasslogisticregression?• Optimizationoveracontinuousspace

• Infinitelymanyneighbors!• Howtodothisefficiently?

Page 5: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

1-DOptimization

§ Couldevaluate and§ Thenstepinbestdirection

§ Or,evaluatederivative:

§ Tellswhichdirectiontostepinto

w

g(w)

w0

g(w0)

g(w0 + h) g(w0 � h)

@g(w0)

@w= lim

h!0

g(w0 + h)� g(w0 � h)

2h

Page 6: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

2-DOptimization

Source: offconvex.org

Page 7: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

GradientAscent

§ Performupdateinuphilldirectionforeachcoordinate§ Thesteepertheslope(i.e.thehigherthederivative)thebiggerthestepforthatcoordinate

§ E.g.,consider:

§ Updates:

g(w1, w2)

w2 w2 + ↵ ⇤ @g

@w2(w1, w2)

w1 w1 + ↵ ⇤ @g

@w1(w1, w2)

§ Updatesinvectornotation:

with:

w w + ↵ ⇤ rwg(w)

rwg(w) =

"@g@w1

(w)@g@w2

(w)

#

=gradient

Page 8: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

§ Idea:§ Startsomewhere§ Repeat:Takeastepinthegradientdirection

GradientAscent

Figure source: Mathworks

Page 9: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

WhatistheSteepestDirection?

§ First-OrderTaylorExpansion:

§ SteepestDescentDirection:

§ Recall: à

§ Hence,solution:

g(w +�) ⇡ g(w) +@g

@w1�1 +

@g

@w2�2

rg =

"@g@w1@g@w2

#Gradientdirection=steepestdirection!

max�:�2

1+�22"

g(w +�)

max�:�2

1+�22"

g(w) +@g

@w1�1 +

@g

@w2�2

� = "rg

krgk

� = "a

kakmax�:k�k"

�>a

Page 10: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

Gradientinndimensions

rg =

2

6664

@g@w1@g@w2

· · ·@g@wn

3

7775

Page 11: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

OptimizationProcedure:GradientAscent

§ init§ for iter = 1, 2, …

w

§ :learningrate--- tweakingparameterthatneedstobechosencarefully

§ How?Trymultiplechoices§ Cruderuleofthumb:updatechangesabout0.1– 1%

w

w w + ↵ ⇤ rg(w)

Page 12: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

BatchGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

g(w)

§ init§ for iter = 1, 2, …

w

w w + ↵ ⇤X

i

r logP (y(i)|x(i);w)

Page 13: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

StochasticGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

§ init§ for iter = 1, 2, …

§ pick random j

w

w w + ↵ ⇤ r logP (y(j)|x(j);w)

Observation: oncegradientononetrainingexamplehasbeencomputed,mightaswellincorporatebeforecomputingnextone

Page 14: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

Mini-BatchGradientAscentontheLogLikelihoodObjective

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

§ init§ for iter = 1, 2, …

§ pick random subset of training examples J

w

Observation: gradientoversmallsetoftrainingexamples(=mini-batch)canbecomputedinparallel,mightaswelldothatinsteadofasingleone

w w + ↵ ⇤X

j2J

r logP (y(j)|x(j);w)

Page 15: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

GradientforLogisticRegression

§ Recallperceptron:§ Classifywithcurrentweights

§ Ifcorrect(i.e.,y=y*),nochange!§ Ifwrong:adjusttheweightvectorbyaddingorsubtractingthefeaturevector.Subtractify*is-1.

Page 16: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

NeuralNetworks

Page 17: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

Multi-classLogisticRegression

§ =specialcaseofneuralnetwork

z1

z2

z3

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

Page 18: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

DeepNeuralNetwork=Alsolearnthefeatures!

z1

z2

z3

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

Page 19: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

DeepNeuralNetwork=Alsolearnthefeatures!

f1(x)

f2(x)

f3(x)

fK(x)

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

x1

x2

x3

xL

… … … …

z(1)1

z(1)2

z(1)3

z(1)K(1) z(2)

K(2)

z(2)1

z(2)2

z(2)3

z(OUT )1

z(OUT )2

z(OUT )3

z(n�1)3

z(n�1)2

z(n�1)1

z(n�1)K(n�1)

z(k)i = g(X

j

W (k�1,k)i,j z(k�1)

j ) g=nonlinearactivationfunction

Page 20: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

DeepNeuralNetwork=Alsolearnthefeatures!

softmax

P (y1|x;w) =ez1

ez1 + ez2 + ez3

P (y2|x;w) =ez2

ez1 + ez2 + ez3

P (y3|x;w) =ez3

ez1 + ez2 + ez3…

x1

x2

x3

xL

… … … …

z(1)1

z(1)2

z(1)3

z(1)K(1) z(n)

K(n)z(2)K(2)

z(2)1

z(2)2

z(2)3 z(n)3

z(n)2

z(n)1

z(OUT )1

z(OUT )2

z(OUT )3

z(n�1)3

z(n�1)2

z(n�1)1

z(n�1)K(n�1)

z(k)i = g(X

j

W (k�1,k)i,j z(k�1)

j ) g=nonlinearactivationfunction

Page 21: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

CommonActivationFunctions

[source:MIT6.S191introtodeeplearning.com]

Page 22: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

DeepNeuralNetwork:AlsoLearntheFeatures!

§ Trainingthedeepneuralnetworkisjustlikelogisticregression:

justwtendstobeamuch,muchlargervectorJ

àjustrungradientascent+stopwhenloglikelihoodofhold-outdatastartstodecrease

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)

Page 23: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

NeuralNetworksProperties

§ Theorem(UniversalFunctionApproximators).Atwo-layerneuralnetworkwithasufficientnumberofneuronscanapproximateanycontinuousfunctiontoanydesiredaccuracy.

§ Practicalconsiderations§ Canbeseenaslearningthefeatures

§ Largenumberofneurons§ Dangerforoverfitting§ (henceearlystopping!)

Page 24: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

Neural Net Demo!

https://playground.tensorflow.org/

Page 25: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

§ Derivativestables:

Howaboutcomputingallthederivatives?

[source:http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html

Page 26: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

Howaboutcomputingallthederivatives?

n Butneuralnetfisneveroneofthose?n Noproblem:CHAINRULE:

If

Then

à Derivativescanbecomputedbyfollowingwell-definedprocedures

f(x) = g(h(x))

f 0(x) = g0(h(x))h0(x)

Page 27: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

§ Automaticdifferentiationsoftware§ e.g.Theano,TensorFlow,PyTorch,Chainer§ Onlyneedtoprogramthefunctiong(x,y,w)§ Canautomaticallycomputeallderivativesw.r.t.allentriesinw§ Thisistypicallydonebycachinginfoduringforwardcomputationpassoff,andthendoingabackwardpass=“backpropagation”

§ Autodiff /Backpropagationcanoftenbedoneatcomputationalcostcomparabletotheforwardpass

§ Needtoknowthisexists§ Howthisisdone?-- outsideofscopeofCS188

AutomaticDifferentiation

Page 28: Optimization and Neural Nets - Berkeley AI Materialsinst.cs.berkeley.edu/~cs188/su19/assets/slides/... · Neural Networks Properties § Theorem (Universal Function Approximators)

SummaryofKeyIdeas§ Optimizeprobabilityoflabelgiveninput

§ Continuousoptimization§ Gradientascent:

§ Computesteepestuphilldirection=gradient(=justvectorofpartialderivatives)§ Takestep inthegradientdirection§ Repeat(untilheld-outdataaccuracystartstodrop=“earlystopping”)

§ Deepneuralnets§ Lastlayer=stilllogisticregression§ Nowalsomanymorelayersbeforethislastlayer

§ =computingthefeatures§ à thefeaturesarelearnedratherthanhand-designed

§ Universalfunctionapproximationtheorem§ If neuralnet islargeenough§ Then neuralnetcanrepresentanycontinuousmappingfrominputtooutputwitharbitraryaccuracy§ Butremember:needtoavoidoverfitting/memorizingthetrainingdataà earlystopping!

§ Automaticdifferentiationgivesthederivativesefficiently(how?=outsideofscopeof188)

maxw

ll(w) = maxw

X

i

logP (y(i)|x(i);w)