37
NEURAL NETWORKS CS446 -FALL ‘16 Administration HW1 grades should be up! HW3 is due midnight. Hw4 will be released next Tuesday. q Please start working on it as soon as possible q Come to sections with questions Deadline for project proposals is close q Make sure to find a partner and explore the ideas. 1 Questions

Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Administration

HW1gradesshouldbeup!HW3isduemidnight.Hw4 willbereleasednextTuesday.q Pleasestartworkingonitassoonaspossibleq Cometosectionswithquestions

Deadlineforprojectproposalsiscloseq Makesuretofindapartnerandexploretheideas.

1

Questions

Page 2: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Recap:Multi-LayerPerceptronsMulti-layernetworkq Aglobalapproximatorq DifferentrulesfortrainingitTheBack-propagationq Forwardstepq Backpropagationoferrors

Congrats!Nowyouknowthehardestconceptaboutneuralnetworks!Today:q ConvolutionalNeuralNetworksq RecurrentNeuralNetworks

2

activation

Input

Hidden

Output

Page 3: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ReceptiveFieldsThe receptivefield ofanindividual sensoryneuron istheparticularregionofthesensoryspace(e.g.,thebodysurface,ortheretina)inwhicha stimulus willtriggerthefiringofthatneuron.q Intheauditorysystem,receptivefieldscancorrespondtovolumesin

auditoryspaceDesigning“proper”receptivefieldsfortheinputNeuronsisasignificantchallenge.Considerataskwithimageinputsq Receptivefieldsshouldgiveexpressivefeaturesfromtherawinputto

thesystemq Howwouldyoudesignthereceptivefieldsforthisproblem?

3

Page 4: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Afullyconnectedlayer:q Example:

§ 100x100images§ 1000unitsintheinput

q Problems:§ 10^7edges!§ Spatialcorrelationslost!§ Variablessizedinputs.

4Slide Credit: Marc'Aurelio Ranzato

Page 5: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Considerataskwithimageinputs:Alocallyconnectedlayer:q Example:

§ 100x100images§ 1000unitsintheinput§ Filtersize:10x10

q Localcorrelationspreserved!q Problems:

§ 10^5edges§ Thisparameterizationisgoodwheninputimageisregistered(e.g.,facerecognition).§ Variablesizedinputs,again.

5Slide Credit: Marc'Aurelio Ranzato

Page 6: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ConvolutionalLayer

Asolution:q Filterstocapturedifferentpatternsintheinputspace.

§ Shareparametersacrossdifferentlocations(assuminginputisstationary)

§ Convolutions withlearnedfiltersq Filterswillbelearnedduringtraining.q Theissueofvariable-sizedinputswillberesolvedwithapoolinglayer.

6Slide Credit: Marc'Aurelio Ranzato

Sowhatisaconvolution?

Page 7: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ConvolutionOperator

Convolutionoperator:∗q takestwofunctionsandgivesanotherfunction

Onedimension:

7

𝑥 ∗ ℎ 𝑡 = &𝑥 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏�

�𝑥 ∗ ℎ [𝑛] = ∑ 𝑥 𝑚 ℎ[𝑛 − 𝑚]�

0

“Convolution”isverysimilarto

“cross-correlation”,exceptthatin

convolutiononeofthefunctions

isflipped.

Page 8: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ConvolutionOperator(2)

Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother

matrixq Example:Sharpenkernel:

8Try other kernels: http://setosa.io/ev/image-kernels/

Page 9: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ConvolutionOperator(3)

Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother

matrix

9Slide Credit: Marc'Aurelio Ranzato

Page 10: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ComplexityofConvolution

Complexityofconvolutionoperatoris𝑛log 𝑛 ,for𝑛inputs.q UsesFast-Fourier-Transform(FFT)

Fortwo-dimension,eachconvolutiontakes𝑀𝑁log 𝑀𝑁 time,wherethesizeofinputis𝑀𝑁.

10Slide Credit: Marc'Aurelio Ranzato

Page 11: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ConvolutionalLayerTheconvolutionoftheinput(vector/matrix) withweights(vector/matrix)resultsinaresponsevector/matrix.Wecanhavemultiplefiltersineachconvolutionallayer,eachproducinganoutput.Ifitisanintermediatelayer,itcanhavemultipleinputs!

11

ConvolutionalLayer

FilterFilterFilterFilterOnecanaddnonlinearity

attheoutputofconvolutionallayer

Page 12: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

PoolingLayer

Howtohandlevariablesizedinputs?q Alayerwhichreducesinputsofdifferentsize,toafixedsize.q Pooling

12Slide Credit: Marc'Aurelio Ranzato

Page 13: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

PoolingLayer

Howtohandlevariablesizedinputs?q Alayerwhichreducesinputsofdifferentsize,toafixedsize.q Poolingq Differentvariations

§ Maxpooling

ℎ6 𝑛 = max6∈;(=)

ℎ@[𝑖]

§ Averagepooling

ℎ6 𝑛 = B=

∑6∈;(=)

ℎ@[𝑖]

§ L2-pooling

ℎ6 𝑛 = B=

∑6∈;(=)

ℎ@C[𝑖]�

§ etc

13

Page 14: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ConvolutionalNets

Onestagestructure:

Wholesystem:

14Slide Credit: Druv Bhatra

Convol. Pooling

Stage1 Stage2 Stage3Fully

ConnectedLayer

Input Image

Class Label

Page 15: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

TrainingaConvNetThesameprocedurefromBack-propagationapplieshere.q Rememberinbackprop westartedfromtheerrortermsinthelaststage,

andpassedthembacktothepreviouslayers,onebyone.

Back-propforthepoolinglayer:q Consider,forexample,thecaseof“max”pooling.q Thislayeronlyroutesthegradienttotheinputthathasthehighestvalueinthe

forwardpass.q Hence,duringtheforwardpassofapoolinglayeritiscommontokeeptrackofthe

indexofthemaxactivation(sometimesalsocalled theswitches)sothatgradientroutingisefficientduringbackpropagation.

q Thereforewehave: 𝛿 = EFGEHI

15

Convol. Pooling

Stage3 FullyConnectedLayerInput Image

Class Label

𝛿JKLMNJKOPQ =𝜕𝐸T

𝜕𝑦JKLMNJKOPQ

𝐸T

Stage1 Stage2

𝛿VWQLMNJKOPQ =𝜕𝐸T

𝜕𝑦VWQLMNJKOPQ

𝑥6 𝑦6

Page 16: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

TrainingaConvNetBack-propfortheconvolutionallayer:

16

Convol. Pooling

Stage3 FullyConnectedLayerInput Image

Class Label

𝛿JKLMNJKOPQ =𝜕𝐸T

𝜕𝑦JKLMNJKOPQ

𝐸T

Stage1 Stage2

𝛿VWQLMNJKOPQ =𝜕𝐸T

𝜕𝑦VWQLMNJKOPQ

𝑥6 𝑦6

Wederivetheupdaterulesfora1Dconvolution,buttheideaisthesameforbiggerdimensions.

𝑦X = 𝑤 ∗ 𝑥 ⟺ 𝑦X6 = [ 𝑤\

0NB

\]^

𝑥6N\ = [ 𝑤6N\

0NB

\]^

𝑥\∀𝑖

𝑦 = 𝑓 𝑦X ⟺ 𝑦6 = 𝑓(𝑦X6)∀𝑖

𝜕𝐸T𝜕𝑤\

=

𝜕𝐸T𝜕𝑦X6

=

𝛿 = 𝜕𝐸T𝜕𝑥\

=

Theconvolution

Adifferentiablenonlinearity

[𝜕𝐸T𝜕𝑦X6

𝜕𝑦X6𝜕𝑤\

0NB

6]^

= [𝜕𝐸T𝜕𝑦X6

𝑥6N\

0NB

6]^

𝜕𝐸T𝜕𝑦6

𝜕𝑦6𝜕𝑦X6

=𝜕𝐸T𝜕𝑦6

𝑓′(𝑦X)

[𝜕𝐸T𝜕𝑦X6

𝜕𝑦X6𝜕𝑥\

0NB

6]^

= [𝜕𝐸T𝜕𝑦X6

𝑤6N\

0NB

6]^

Nowwehaveeverythinginthislayertoupdatethefilter

Weneedtopassthegradienttothepreviouslayer

NowwecanrepeatthisforeachstageofConvNet.

Page 17: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ConvolutionalNets

17

Stage1

Stage2

Stage3

FullyConnected

LayerInput Image

Class Label

FeaturevisualizationofconvolutionalnettrainedonImageNet from[Zeiler &Fergus2013]

Page 18: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

ConvNet rootsFukushima,1980s designednetworkwithsamebasicstructurebutdidnottrainbybackpropagation.ThefirstsuccessfulapplicationsofConvolutionalNetworks byYannLeCun in1990's (LeNet)q Wasusedtoreadzipcodes,digits,etc.Manyvariantsnowadays,butthecoreideaisthesameq Example:asystemdevelopedinGoogle(GoogLeNet)

§ Computedifferentfilters§ Composeonebigvectorfromallofthem§ Layerthisiteratively

18See more: http://arxiv.org/pdf/1409.4842v1.pdf

Page 19: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Depthmatters

19

Slidefrom[Kaiming He2015]

Page 20: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

PracticalTipsBeforelargescaleexperiments,testonasmallsubsetofthedataandchecktheerrorshouldgotozero.q Overfitting onsmalltrainingVisualizefeatures(featuremapsneedtobeuncorrelated)andhavehighvarianceBadtraining:manyhiddenunitsignoretheinputand/orexhibitstrongcorrelations.

20Figure Credit: Marc'Aurelio Ranzato

Page 21: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

DebuggingTrainingdiverges:q Learningratemaybetoolarge→decreaselearningrateq BackProp isbuggy→numericalgradientcheckingLossisminimizedbutaccuracyislowq Checklossfunction:Isitappropriateforthetaskyouwantto

solve?Doesithavedegeneratesolutions?

NNisunderperforming/under-fittingq Computenumberofparameters→iftoosmall,makenetwork

largerNNistooslowq Computenumberofparameters→Usedistributedframework,use

GPU,makenetworksmaller

21

Manyofthesepointsapplytomanymachinelearningmodels,nojustneuralnetworks.

Page 22: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

CNNforvectorinputs

Let’sstudyanothervariantofCNNforlanguageq Example:sentenceclassification(sayspamornotspam)

Firststep:representeachwordwithavectorinℝTThis is not a spam

Concatenatethevectors

NowwecanassumethattheinputtothesystemisavectorℝTcq Wheretheinputsentencehaslength𝑙 (𝑙 = 5 inourexample)q Eachwordvector’slength𝑑 (𝑑 = 7 inourexample)

22

OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

Page 23: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Thinkaboutasingleconvolutionallayerq Abunchofvectorfilters

§ EachdefinedinℝTg

• Whereℎ isthenumberofthewordsthefiltercovers• Sizeofthewordvector𝑑

q Findits(modified)convolutionwiththeinputvector

q Resultoftheconvolutionwiththefilter

q Convolutionwithafilterthatspans2words,isoperatingonallofthebi-grams(vectorsoftwoconsecutiveword,concatenated):“thisis”,“isnot”,“nota”,“aspam”.

q Regardlessofwhetheritisgrammatical(notappealinglinguistically)

OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

ConvolutionalLayeronvectors

23

OO O O O O O O O O O O O O

OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

OO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

OO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

OO O O O O O O O O O O O O

𝑐B = 𝑓(𝑤. 𝑥B:g)𝑐C = 𝑓(𝑤. 𝑥gkB:Cg)𝑐l = 𝑓(𝑤. 𝑥CgkB:lg)𝑐m = 𝑓(𝑤. 𝑥lgkB:mg)

𝑐 = [𝑐B, … . , 𝑐=NgkB] OO O O

Page 24: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

OO O O O

OO O O O O O

ConvolutionalLayeronvectors

24

OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

This is not a spam

O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

OO O O O O O O O O O O O OOO O O O O O O O O O O O O

OO O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O

OO O OOO O O

OO OOO O

Getwordvectorsforeachwords

Concatenatevectors

Performconvolutionwitheachfilter

Filterbank

Setofresponsevectors

*

Howarewegoingtohandlethevariablesized

responsevectors?Pooling!

#of filters

#words - #length of filter + 1

Page 25: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Nowwecanpassthefixed-sizedvectortoalogisticunit(softmax),orgiveittomulti-layernetwork(lastsession)

OO O O O

OO O O O O O

ConvolutionalLayeronvectors

25

OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

This is not a spam

O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

OO O O O O O O O O O O O OOO O O O O O O O O O O O O

OO O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O

OO O OOO O O

OO OOO O

Getwordvectorsforeachwords

Concatenatevectors

Performconvolutionwitheachfilter

Filterbank

*

#of filters

#words - #length of filter + 1

Poolingonfilter

responses

OO OOO OOO OOO OOO O

Somechoicesforpooling:

k-max,mean,etc

Page 26: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

RecurrentNeuralNetworks

Multi-layerfeed-forwardNN:DAGq Justcomputesafixedsequenceofnon-linearlearnedtransformationstoconvertaninputpatterintoanoutputpattern

RecurrentNeuralNetwork:Digraphq Hascycles.q Cyclecanactasamemory;q Thehiddenstateofarecurrentnetcancarryalong

informationabouta“potentially”unboundednumberofpreviousinputs.

q Theycanmodelsequentialdatainamuchmorenaturalway.

26

Page 27: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

EquivalencebetweenRNNandFeed-forwardNN

Assumethatthereisatimedelayof1inusingeachconnection.

Therecurrentnetisjustalayerednetthatkeepsreusingthesameweights.

27Slide Credit: Geoff Hinton

W1 W2 W3 W4

time=0

time=2

time=1

time=3

W1 W2 W3 W4

W1 W2 W3 W4

w1 w4

w2 w3

Page 28: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

RecurrentNeuralNetworks

TrainingageneralRNN’scanbehardq HerewewillfocusonaspecialfamilyofRNN’s

Predictiononchain-likeinput:q Example:POStaggingwordsofasentence

q Issues:§ Structureintheoutput:Thereisconnectionsbetweenlabels§ Interdependencebetweenelementsoftheinputs:Thefinaldecisionis

basedonanintricateinterdependenceofthewordsoneachother.§ Variablesizeinputs:e.g.sentencesdifferinsize

Howwouldyougoaboutsolvingthistask?

28

𝑋 = This is a sample sentence . Y= DT VBZ DT NN NN .

Page 29: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

RecurrentNeuralNetworks

AchainRNN:q Hasachain-likestructureq Eachinputisreplacedwithitsvectorrepresentation𝑥qq Hidden(memory)unitℎq containinformationabout

previousinputsandprevioushiddenunitsℎqNB, ℎqNC, etc§ Computedfromthepastmemoryandcurrentword.Itsummarizesthesentenceuptothattime.

29

OO O O O OO O O O OO O O O

𝑥qNB 𝑥q 𝑥qkB

OO

OOO

OO

OOO

OO

OOOℎqNB ℎq ℎqkB

Memorylayer

Inputlayer

Page 30: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

RecurrentNeuralNetworks

Apopularwayofformalizingit:ℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)

q Where𝑓 isanonlinear,differentiable(why?)function.

Outputs?q Manyoptions;dependingonproblemandcomputational

resource

30

OO O O O OO O O O OO O O O

𝑥qNB 𝑥q 𝑥qkB

OO

OOO

OO

OOO

OO

OOOℎqNB ℎq ℎqkB

Page 31: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

RecurrentNeuralNetworks

Predictionfor𝑥q,withℎq

Predictionfor𝑥q,withℎq, … , ℎqNt

Predictionforthewholechain

SomeinherentissueswithRNNs:q Recurrentneuralnetscannotcapturephraseswithoutprefixcontextq Theyoftencapturetoomuchoflastwordsinfinalvector

31

OO O O O OO O O O OO O O O

𝑥qNB 𝑥q 𝑥qkB

OO

OOO

OO

OOO

OO

OOOℎqNB ℎq ℎqkB

𝑦qNB 𝑦q 𝑦qkB

𝑦q = softmax 𝑊xℎq

𝑦y = softmax 𝑊xℎy

𝑦q = softmax [𝛼6𝑊xN6ℎqN6

t

6]^

Page 32: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Bi-directionalRNN

32OO

OOO

OO

OOO

OO

OOO

ℎ@qNB ℎ@q ℎ@qkB

𝑦qNB 𝑦q 𝑦qkB

OO O O O OO O O O OO O O O𝑥qNB 𝑥q 𝑥qkB

OO

OOO

OO

OOO

OO

OOO

ℎqNB ℎq ℎqkB

OneoftheissueswithRNN:q Hiddenvariablescaptureonlyonesidecontext

Abi-directionalstructure

ℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)

ℎ@q = 𝑓(𝑊{gℎ@qkB +𝑊{6𝑥q)

𝑦q = softmax 𝑊xℎq +𝑊{xℎ@q

Page 33: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Stackofbi-directionalnetworks

Usethesameideaandmakeyourmodelfurthercomplicated:

33

Page 34: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

TrainingRNNsHowtotrainsuchmodel?

q Generalizethesameideasfromback-propagation

Totaloutputerror:𝐸 �⃗�, 𝑡 = ∑ 𝐸q 𝑦q, 𝑡qyq]B

34

OO O O O OO O O O OO O O O

𝑥qNB 𝑥q 𝑥qkB

OO

OOO

OO

OOO

OO

OOOℎqNB ℎq ℎqkB

𝑦qNB 𝑦q 𝑦qkB

Reminder:𝑦q = softmax 𝑊xℎqℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)

𝜕𝐸𝜕𝑊 =[

𝜕𝐸q𝜕𝑊

y

q]B𝜕𝐸q𝜕𝑊 =[

𝜕𝐸q𝜕𝑦q

𝜕𝑦q𝜕ℎq

𝜕ℎq𝜕ℎqN�

𝜕ℎqN�𝜕𝑊

y

q]B

Parameters?𝑊x,𝑊6,𝑊g +vectorsfor

input

Thissometimesiscalled“BackpropagationThroughTime”,sincethegradientsarepropagatedbackthroughtime.

Page 35: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

RecurrentNeuralNetwork

35

𝑦qNB 𝑦q 𝑦qkB

OO O O O OO O O O OO O O O

𝑥qNB 𝑥q 𝑥qkB

OO

OOO

OO

OOO

OO

OOO

ℎqNB ℎq ℎqkB

Reminder:𝑦q = softmax 𝑊xℎqℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)

𝜕𝐸𝜕𝑊 =[

𝜕𝐸q𝜕𝑦q

𝜕𝑦q𝜕ℎq

𝜕ℎq𝜕ℎqN�

𝜕ℎqN�𝜕𝑊

y

q]B

𝜕ℎq𝜕ℎqN�

= �𝜕ℎ�𝜕ℎ�NB

q

�]qN�kB

= � 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q)q

�]qN�kB

𝜕ℎq𝜕ℎqNB

= 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q) diag 𝑎B, … , 𝑎= =𝑎B 0 00 ⋱ 00 0 𝑎=

Page 36: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Vanishing/explodinggradients

Vanishinggradientsarequiteprevalentandaseriousissue.Arealexampleq Trainingafeed-forwardnetworkq y-axis:sumofthegradientnormsq Earlierlayershaveexponentiallysmallersumofgradientnormsq Thiswillmaketrainingearlierlayersmuchslower.

36

𝜕ℎq𝜕ℎ�

≤ � 𝑊g diag 𝑓�(𝑊gℎqNB +𝑊6𝑥q) ≤ � 𝛼𝛽 =q

�]qN�kB

𝛼𝛽 �q

�]qN�kB

𝜕ℎq𝜕ℎqN�

= � 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q)q

�]qN�kB

Gradientcanbecomeverysmallorverylargequickly,andthelocalityassumptionofgradientdescentbreaksdown (Vanishinggradient)[Bengio etal1994]

Page 37: Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Vanishing/explodinggradientsInanRNNtrainedonlongsequences(e.g.100timesteps)thegradientscaneasilyexplodeorvanish.q SoRNNshavedifficultydealingwithlong-range

dependencies.Manymethodsproposedforreducetheeffectofvanishinggradients;althoughitisstillaproblemq Introduceshorterpathbetweenlongconnectionsq Abandonstochasticgradientdescentinfavorofamuch

moresophisticatedHessian-Free(HF)optimizationq Addfanciermodulesthatarerobusttohandlinglong

memory;e.g.LongShortTermMemory(LSTM)Onetricktohandletheexploding-gradients:q Clipgradientswithbiggersizes:

37

Defnne𝑔 = EFE�

If 𝑔 ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then𝑔 ← qg���gxcT

�𝑔