Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation

NEURALNETWORKS CS446-FALL‘16

Administration

HW1gradesshouldbeup!HW3isduemidnight.Hw4 willbereleasednextTuesday.q Pleasestartworkingonitassoonaspossibleq Cometosectionswithquestions

Deadlineforprojectproposalsiscloseq Makesuretofindapartnerandexploretheideas.

1

Questions


Recap:Multi-LayerPerceptronsMulti-layernetworkq Aglobalapproximatorq DifferentrulesfortrainingitTheBack-propagationq Forwardstepq Backpropagationoferrors

Congrats!Nowyouknowthehardestconceptaboutneuralnetworks!Today:q ConvolutionalNeuralNetworksq RecurrentNeuralNetworks

2

activation

Input

Hidden

Output


ReceptiveFieldsThe receptivefield ofanindividual sensoryneuron istheparticularregionofthesensoryspace(e.g.,thebodysurface,ortheretina)inwhicha stimulus willtriggerthefiringofthatneuron.q Intheauditorysystem,receptivefieldscancorrespondtovolumesin

auditoryspaceDesigning“proper”receptivefieldsfortheinputNeuronsisasignificantchallenge.Considerataskwithimageinputsq Receptivefieldsshouldgiveexpressivefeaturesfromtherawinputto

thesystemq Howwouldyoudesignthereceptivefieldsforthisproblem?

3


Afullyconnectedlayer:q Example:

§ 100x100images§ 1000unitsintheinput

q Problems:§ 10^7edges!§ Spatialcorrelationslost!§ Variablessizedinputs.

4Slide Credit: Marc'Aurelio Ranzato


Considerataskwithimageinputs:Alocallyconnectedlayer:q Example:

§ 100x100images§ 1000unitsintheinput§ Filtersize:10x10

q Localcorrelationspreserved!q Problems:

§ 10^5edges§ Thisparameterizationisgoodwheninputimageisregistered(e.g.,facerecognition).§ Variablesizedinputs,again.



ConvolutionalLayer

Asolution:q Filterstocapturedifferentpatternsintheinputspace.

§ Shareparametersacrossdifferentlocations(assuminginputisstationary)

§ Convolutions withlearnedfiltersq Filterswillbelearnedduringtraining.q Theissueofvariable-sizedinputswillberesolvedwithapoolinglayer.


Sowhatisaconvolution?


ConvolutionOperator

Convolutionoperator:∗q takestwofunctionsandgivesanotherfunction

Onedimension:

7

𝑥 ∗ ℎ 𝑡 = &𝑥 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏�

�𝑥 ∗ ℎ [𝑛] = ∑ 𝑥 𝑚 ℎ[𝑛 − 𝑚]�

0

“Convolution”isverysimilarto

“cross-correlation”,exceptthatin

convolutiononeofthefunctions

isflipped.


ConvolutionOperator(2)

Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother

matrixq Example:Sharpenkernel:

8Try other kernels: http://setosa.io/ev/image-kernels/


ConvolutionOperator(3)

Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother

matrix



ComplexityofConvolution

Complexityofconvolutionoperatoris𝑛log 𝑛 ,for𝑛inputs.q UsesFast-Fourier-Transform(FFT)

Fortwo-dimension,eachconvolutiontakes𝑀𝑁log 𝑀𝑁 time,wherethesizeofinputis𝑀𝑁.



ConvolutionalLayerTheconvolutionoftheinput(vector/matrix) withweights(vector/matrix)resultsinaresponsevector/matrix.Wecanhavemultiplefiltersineachconvolutionallayer,eachproducinganoutput.Ifitisanintermediatelayer,itcanhavemultipleinputs!

11

ConvolutionalLayer

FilterFilterFilterFilterOnecanaddnonlinearity

attheoutputofconvolutionallayer


PoolingLayer

Howtohandlevariablesizedinputs?q Alayerwhichreducesinputsofdifferentsize,toafixedsize.q Pooling



PoolingLayer

Howtohandlevariablesizedinputs?q Alayerwhichreducesinputsofdifferentsize,toafixedsize.q Poolingq Differentvariations

§ Maxpooling

ℎ6 𝑛 = max6∈;(=)

ℎ@[𝑖]

§ Averagepooling

ℎ6 𝑛 = B=

∑6∈;(=)

ℎ@[𝑖]

§ L2-pooling

ℎ6 𝑛 = B=

∑6∈;(=)

ℎ@C[𝑖]�

§ etc

13


ConvolutionalNets

Onestagestructure:

Wholesystem:

14Slide Credit: Druv Bhatra

Convol. Pooling

Stage1 Stage2 Stage3Fully

ConnectedLayer

Input Image

Class Label


TrainingaConvNetThesameprocedurefromBack-propagationapplieshere.q Rememberinbackprop westartedfromtheerrortermsinthelaststage,

andpassedthembacktothepreviouslayers,onebyone.

Back-propforthepoolinglayer:q Consider,forexample,thecaseof“max”pooling.q Thislayeronlyroutesthegradienttotheinputthathasthehighestvalueinthe

forwardpass.q Hence,duringtheforwardpassofapoolinglayeritiscommontokeeptrackofthe

indexofthemaxactivation(sometimesalsocalled theswitches)sothatgradientroutingisefficientduringbackpropagation.

q Thereforewehave: 𝛿 = EFGEHI

15

Convol. Pooling

Stage3 FullyConnectedLayerInput Image

Class Label

𝛿JKLMNJKOPQ =𝜕𝐸T

𝜕𝑦JKLMNJKOPQ

𝐸T

Stage1 Stage2

𝛿VWQLMNJKOPQ =𝜕𝐸T

𝜕𝑦VWQLMNJKOPQ

𝑥6 𝑦6


TrainingaConvNetBack-propfortheconvolutionallayer:

16

Convol. Pooling

Stage3 FullyConnectedLayerInput Image

Class Label

𝛿JKLMNJKOPQ =𝜕𝐸T

𝜕𝑦JKLMNJKOPQ

𝐸T

Stage1 Stage2

𝛿VWQLMNJKOPQ =𝜕𝐸T

𝜕𝑦VWQLMNJKOPQ

𝑥6 𝑦6

Wederivetheupdaterulesfora1Dconvolution,buttheideaisthesameforbiggerdimensions.

𝑦X = 𝑤 ∗ 𝑥 ⟺ 𝑦X6 = [ 𝑤\

0NB

\]^

𝑥6N\ = [ 𝑤6N\

0NB

\]^

𝑥\∀𝑖

𝑦 = 𝑓 𝑦X ⟺ 𝑦6 = 𝑓(𝑦X6)∀𝑖

𝜕𝐸T𝜕𝑤\

=

𝜕𝐸T𝜕𝑦X6

=

𝛿 = 𝜕𝐸T𝜕𝑥\

=

Theconvolution

Adifferentiablenonlinearity

[𝜕𝐸T𝜕𝑦X6

𝜕𝑦X6𝜕𝑤\

0NB

6]^

= [𝜕𝐸T𝜕𝑦X6

𝑥6N\

0NB

6]^

𝜕𝐸T𝜕𝑦6

𝜕𝑦6𝜕𝑦X6

=𝜕𝐸T𝜕𝑦6

𝑓′(𝑦X)

[𝜕𝐸T𝜕𝑦X6

𝜕𝑦X6𝜕𝑥\

0NB

6]^

= [𝜕𝐸T𝜕𝑦X6

𝑤6N\

0NB

6]^

Nowwehaveeverythinginthislayertoupdatethefilter

Weneedtopassthegradienttothepreviouslayer

NowwecanrepeatthisforeachstageofConvNet.


ConvolutionalNets

17

Stage1

Stage2

Stage3

FullyConnected

LayerInput Image

Class Label

FeaturevisualizationofconvolutionalnettrainedonImageNet from[Zeiler &Fergus2013]


ConvNet rootsFukushima,1980s designednetworkwithsamebasicstructurebutdidnottrainbybackpropagation.ThefirstsuccessfulapplicationsofConvolutionalNetworks byYannLeCun in1990's (LeNet)q Wasusedtoreadzipcodes,digits,etc.Manyvariantsnowadays,butthecoreideaisthesameq Example:asystemdevelopedinGoogle(GoogLeNet)

§ Computedifferentfilters§ Composeonebigvectorfromallofthem§ Layerthisiteratively

18See more: http://arxiv.org/pdf/1409.4842v1.pdf


Depthmatters

19

Slidefrom[Kaiming He2015]


PracticalTipsBeforelargescaleexperiments,testonasmallsubsetofthedataandchecktheerrorshouldgotozero.q Overfitting onsmalltrainingVisualizefeatures(featuremapsneedtobeuncorrelated)andhavehighvarianceBadtraining:manyhiddenunitsignoretheinputand/orexhibitstrongcorrelations.

20Figure Credit: Marc'Aurelio Ranzato


DebuggingTrainingdiverges:q Learningratemaybetoolarge→decreaselearningrateq BackProp isbuggy→numericalgradientcheckingLossisminimizedbutaccuracyislowq Checklossfunction:Isitappropriateforthetaskyouwantto

solve?Doesithavedegeneratesolutions?

NNisunderperforming/under-fittingq Computenumberofparameters→iftoosmall,makenetwork

largerNNistooslowq Computenumberofparameters→Usedistributedframework,use

GPU,makenetworksmaller

21

Manyofthesepointsapplytomanymachinelearningmodels,nojustneuralnetworks.


CNNforvectorinputs

Let’sstudyanothervariantofCNNforlanguageq Example:sentenceclassification(sayspamornotspam)

Firststep:representeachwordwithavectorinℝTThis is not a spam

Concatenatethevectors

NowwecanassumethattheinputtothesystemisavectorℝTcq Wheretheinputsentencehaslength𝑙 (𝑙 = 5 inourexample)q Eachwordvector’slength𝑑 (𝑑 = 7 inourexample)

22

OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O


Thinkaboutasingleconvolutionallayerq Abunchofvectorfilters

§ EachdefinedinℝTg

• Whereℎ isthenumberofthewordsthefiltercovers• Sizeofthewordvector𝑑

q Findits(modified)convolutionwiththeinputvector

q Resultoftheconvolutionwiththefilter

q Convolutionwithafilterthatspans2words,isoperatingonallofthebi-grams(vectorsoftwoconsecutiveword,concatenated):“thisis”,“isnot”,“nota”,“aspam”.

q Regardlessofwhetheritisgrammatical(notappealinglinguistically)


ConvolutionalLayeronvectors

23

OO O O O O O O O O O O O O

OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

OO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

OO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O

OO O O O O O O O O O O O O

𝑐B = 𝑓(𝑤. 𝑥B:g)𝑐C = 𝑓(𝑤. 𝑥gkB:Cg)𝑐l = 𝑓(𝑤. 𝑥CgkB:lg)𝑐m = 𝑓(𝑤. 𝑥lgkB:mg)

𝑐 = [𝑐B, … . , 𝑐=NgkB] OO O O


OO O O O

OO O O O O O


24


This is not a spam


OO O O O O O O O O O O O OOO O O O O O O O O O O O O

OO O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O

OO O OOO O O

OO OOO O

Getwordvectorsforeachwords

Concatenatevectors

Performconvolutionwitheachfilter

Filterbank

Setofresponsevectors

*

Howarewegoingtohandlethevariablesized

responsevectors?Pooling!

#of filters

#words - #length of filter + 1


Nowwecanpassthefixed-sizedvectortoalogisticunit(softmax),orgiveittomulti-layernetwork(lastsession)

OO O O O

OO O O O O O


25


This is not a spam


OO O O O O O O O O O O O OOO O O O O O O O O O O O O

OO O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O

OO O OOO O O

OO OOO O

Getwordvectorsforeachwords

Concatenatevectors

Performconvolutionwitheachfilter

Filterbank

*

#of filters

#words - #length of filter + 1

Poolingonfilter

responses

OO OOO OOO OOO OOO O

Somechoicesforpooling:

k-max,mean,etc


RecurrentNeuralNetworks

Multi-layerfeed-forwardNN:DAGq Justcomputesafixedsequenceofnon-linearlearnedtransformationstoconvertaninputpatterintoanoutputpattern

RecurrentNeuralNetwork:Digraphq Hascycles.q Cyclecanactasamemory;q Thehiddenstateofarecurrentnetcancarryalong

informationabouta“potentially”unboundednumberofpreviousinputs.

q Theycanmodelsequentialdatainamuchmorenaturalway.

26


EquivalencebetweenRNNandFeed-forwardNN

Assumethatthereisatimedelayof1inusingeachconnection.

Therecurrentnetisjustalayerednetthatkeepsreusingthesameweights.

27Slide Credit: Geoff Hinton

W1 W2 W3 W4

time=0

time=2

time=1

time=3

W1 W2 W3 W4

W1 W2 W3 W4

w1 w4

w2 w3



TrainingageneralRNN’scanbehardq HerewewillfocusonaspecialfamilyofRNN’s

Predictiononchain-likeinput:q Example:POStaggingwordsofasentence

q Issues:§ Structureintheoutput:Thereisconnectionsbetweenlabels§ Interdependencebetweenelementsoftheinputs:Thefinaldecisionis

basedonanintricateinterdependenceofthewordsoneachother.§ Variablesizeinputs:e.g.sentencesdifferinsize

Howwouldyougoaboutsolvingthistask?

28

𝑋 = This is a sample sentence . Y= DT VBZ DT NN NN .



AchainRNN:q Hasachain-likestructureq Eachinputisreplacedwithitsvectorrepresentation𝑥qq Hidden(memory)unitℎq containinformationabout

previousinputsandprevioushiddenunitsℎqNB, ℎqNC, etc§ Computedfromthepastmemoryandcurrentword.Itsummarizesthesentenceuptothattime.

29

OO O O O OO O O O OO O O O

𝑥qNB 𝑥q 𝑥qkB

OO

OOO

OO

OOO

OO

OOOℎqNB ℎq ℎqkB

Memorylayer

Inputlayer



Apopularwayofformalizingit:ℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)

q Where𝑓 isanonlinear,differentiable(why?)function.

Outputs?q Manyoptions;dependingonproblemandcomputational

resource

30



OO

OOO

OO

OOO

OO




Predictionfor𝑥q,withℎq

Predictionfor𝑥q,withℎq, … , ℎqNt

Predictionforthewholechain

SomeinherentissueswithRNNs:q Recurrentneuralnetscannotcapturephraseswithoutprefixcontextq Theyoftencapturetoomuchoflastwordsinfinalvector

31



OO

OOO

OO

OOO

OO


𝑦qNB 𝑦q 𝑦qkB

𝑦q = softmax 𝑊xℎq

𝑦y = softmax 𝑊xℎy

𝑦q = softmax [𝛼6𝑊xN6ℎqN6

t

6]^


Bi-directionalRNN

32OO

OOO

OO

OOO

OO

OOO

ℎ@qNB ℎ@q ℎ@qkB


OO O O O OO O O O OO O O O𝑥qNB 𝑥q 𝑥qkB

OO

OOO

OO

OOO

OO

OOO

ℎqNB ℎq ℎqkB

OneoftheissueswithRNN:q Hiddenvariablescaptureonlyonesidecontext

Abi-directionalstructure

ℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)

ℎ@q = 𝑓(𝑊{gℎ@qkB +𝑊{6𝑥q)

𝑦q = softmax 𝑊xℎq +𝑊{xℎ@q


Stackofbi-directionalnetworks

Usethesameideaandmakeyourmodelfurthercomplicated:

33


TrainingRNNsHowtotrainsuchmodel?

q Generalizethesameideasfromback-propagation

Totaloutputerror:𝐸 �⃗�, 𝑡 = ∑ 𝐸q 𝑦q, 𝑡qyq]B

34



OO

OOO

OO

OOO

OO



Reminder:𝑦q = softmax 𝑊xℎqℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)

𝜕𝐸𝜕𝑊 =[

𝜕𝐸q𝜕𝑊

y

q]B𝜕𝐸q𝜕𝑊 =[

𝜕𝐸q𝜕𝑦q

𝜕𝑦q𝜕ℎq

𝜕ℎq𝜕ℎqN�

𝜕ℎqN�𝜕𝑊

y

q]B

Parameters?𝑊x,𝑊6,𝑊g +vectorsfor

input

Thissometimesiscalled“BackpropagationThroughTime”,sincethegradientsarepropagatedbackthroughtime.


RecurrentNeuralNetwork

35




OO

OOO

OO

OOO

OO

OOO

ℎqNB ℎq ℎqkB

Reminder:𝑦q = softmax 𝑊xℎqℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)

𝜕𝐸𝜕𝑊 =[

𝜕𝐸q𝜕𝑦q

𝜕𝑦q𝜕ℎq


𝜕ℎqN�𝜕𝑊

y

q]B


= �𝜕ℎ�𝜕ℎ�NB

q

�]qN�kB

= � 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q)q

�]qN�kB

𝜕ℎq𝜕ℎqNB

= 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q) diag 𝑎B, … , 𝑎= =𝑎B 0 00 ⋱ 00 0 𝑎=


Vanishing/explodinggradients

Vanishinggradientsarequiteprevalentandaseriousissue.Arealexampleq Trainingafeed-forwardnetworkq y-axis:sumofthegradientnormsq Earlierlayershaveexponentiallysmallersumofgradientnormsq Thiswillmaketrainingearlierlayersmuchslower.

36

𝜕ℎq𝜕ℎ�

≤ � 𝑊g diag 𝑓�(𝑊gℎqNB +𝑊6𝑥q) ≤ � 𝛼𝛽 =q

�]qN�kB

𝛼𝛽 �q

�]qN�kB


= � 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q)q

�]qN�kB

Gradientcanbecomeverysmallorverylargequickly,andthelocalityassumptionofgradientdescentbreaksdown (Vanishinggradient)[Bengio etal1994]


Vanishing/explodinggradientsInanRNNtrainedonlongsequences(e.g.100timesteps)thegradientscaneasilyexplodeorvanish.q SoRNNshavedifficultydealingwithlong-range

dependencies.Manymethodsproposedforreducetheeffectofvanishinggradients;althoughitisstillaproblemq Introduceshorterpathbetweenlongconnectionsq Abandonstochasticgradientdescentinfavorofamuch

moresophisticatedHessian-Free(HF)optimizationq Addfanciermodulesthatarerobusttohandlinglong

memory;e.g.LongShortTermMemory(LSTM)Onetricktohandletheexploding-gradients:q Clipgradientswithbiggersizes:

37

Defnne𝑔 = EFE�

If 𝑔 ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then𝑔 ← qg��gxcT

�𝑔

Documents

Administration - Penn Engineeringdanroth/Teaching/CS446... · update rules for a 1D convolution, but the idea is the same for bigger dimensions. ... but did not train by backpropagation