Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
NEURALNETWORKS CS446-FALL‘16
Administration
HW1gradesshouldbeup!HW3isduemidnight.Hw4 willbereleasednextTuesday.q Pleasestartworkingonitassoonaspossibleq Cometosectionswithquestions
Deadlineforprojectproposalsiscloseq Makesuretofindapartnerandexploretheideas.
1
Questions
NEURALNETWORKS CS446-FALL‘16
Recap:Multi-LayerPerceptronsMulti-layernetworkq Aglobalapproximatorq DifferentrulesfortrainingitTheBack-propagationq Forwardstepq Backpropagationoferrors
Congrats!Nowyouknowthehardestconceptaboutneuralnetworks!Today:q ConvolutionalNeuralNetworksq RecurrentNeuralNetworks
2
activation
Input
Hidden
Output
NEURALNETWORKS CS446-FALL‘16
ReceptiveFieldsThe receptivefield ofanindividual sensoryneuron istheparticularregionofthesensoryspace(e.g.,thebodysurface,ortheretina)inwhicha stimulus willtriggerthefiringofthatneuron.q Intheauditorysystem,receptivefieldscancorrespondtovolumesin
auditoryspaceDesigning“proper”receptivefieldsfortheinputNeuronsisasignificantchallenge.Considerataskwithimageinputsq Receptivefieldsshouldgiveexpressivefeaturesfromtherawinputto
thesystemq Howwouldyoudesignthereceptivefieldsforthisproblem?
3
NEURALNETWORKS CS446-FALL‘16
Afullyconnectedlayer:q Example:
§ 100x100images§ 1000unitsintheinput
q Problems:§ 10^7edges!§ Spatialcorrelationslost!§ Variablessizedinputs.
4Slide Credit: Marc'Aurelio Ranzato
NEURALNETWORKS CS446-FALL‘16
Considerataskwithimageinputs:Alocallyconnectedlayer:q Example:
§ 100x100images§ 1000unitsintheinput§ Filtersize:10x10
q Localcorrelationspreserved!q Problems:
§ 10^5edges§ Thisparameterizationisgoodwheninputimageisregistered(e.g.,facerecognition).§ Variablesizedinputs,again.
5Slide Credit: Marc'Aurelio Ranzato
NEURALNETWORKS CS446-FALL‘16
ConvolutionalLayer
Asolution:q Filterstocapturedifferentpatternsintheinputspace.
§ Shareparametersacrossdifferentlocations(assuminginputisstationary)
§ Convolutions withlearnedfiltersq Filterswillbelearnedduringtraining.q Theissueofvariable-sizedinputswillberesolvedwithapoolinglayer.
6Slide Credit: Marc'Aurelio Ranzato
Sowhatisaconvolution?
NEURALNETWORKS CS446-FALL‘16
ConvolutionOperator
Convolutionoperator:∗q takestwofunctionsandgivesanotherfunction
Onedimension:
7
𝑥 ∗ ℎ 𝑡 = &𝑥 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏�
�𝑥 ∗ ℎ [𝑛] = ∑ 𝑥 𝑚 ℎ[𝑛 − 𝑚]�
0
“Convolution”isverysimilarto
“cross-correlation”,exceptthatin
convolutiononeofthefunctions
isflipped.
NEURALNETWORKS CS446-FALL‘16
ConvolutionOperator(2)
Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother
matrixq Example:Sharpenkernel:
8Try other kernels: http://setosa.io/ev/image-kernels/
NEURALNETWORKS CS446-FALL‘16
ConvolutionOperator(3)
Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother
matrix
9Slide Credit: Marc'Aurelio Ranzato
NEURALNETWORKS CS446-FALL‘16
ComplexityofConvolution
Complexityofconvolutionoperatoris𝑛log 𝑛 ,for𝑛inputs.q UsesFast-Fourier-Transform(FFT)
Fortwo-dimension,eachconvolutiontakes𝑀𝑁log 𝑀𝑁 time,wherethesizeofinputis𝑀𝑁.
10Slide Credit: Marc'Aurelio Ranzato
NEURALNETWORKS CS446-FALL‘16
ConvolutionalLayerTheconvolutionoftheinput(vector/matrix) withweights(vector/matrix)resultsinaresponsevector/matrix.Wecanhavemultiplefiltersineachconvolutionallayer,eachproducinganoutput.Ifitisanintermediatelayer,itcanhavemultipleinputs!
11
ConvolutionalLayer
FilterFilterFilterFilterOnecanaddnonlinearity
attheoutputofconvolutionallayer
NEURALNETWORKS CS446-FALL‘16
PoolingLayer
Howtohandlevariablesizedinputs?q Alayerwhichreducesinputsofdifferentsize,toafixedsize.q Pooling
12Slide Credit: Marc'Aurelio Ranzato
NEURALNETWORKS CS446-FALL‘16
PoolingLayer
Howtohandlevariablesizedinputs?q Alayerwhichreducesinputsofdifferentsize,toafixedsize.q Poolingq Differentvariations
§ Maxpooling
ℎ6 𝑛 = max6∈;(=)
ℎ@[𝑖]
§ Averagepooling
ℎ6 𝑛 = B=
∑6∈;(=)
ℎ@[𝑖]
§ L2-pooling
ℎ6 𝑛 = B=
∑6∈;(=)
ℎ@C[𝑖]�
§ etc
13
NEURALNETWORKS CS446-FALL‘16
ConvolutionalNets
Onestagestructure:
Wholesystem:
14Slide Credit: Druv Bhatra
Convol. Pooling
Stage1 Stage2 Stage3Fully
ConnectedLayer
Input Image
Class Label
NEURALNETWORKS CS446-FALL‘16
TrainingaConvNetThesameprocedurefromBack-propagationapplieshere.q Rememberinbackprop westartedfromtheerrortermsinthelaststage,
andpassedthembacktothepreviouslayers,onebyone.
Back-propforthepoolinglayer:q Consider,forexample,thecaseof“max”pooling.q Thislayeronlyroutesthegradienttotheinputthathasthehighestvalueinthe
forwardpass.q Hence,duringtheforwardpassofapoolinglayeritiscommontokeeptrackofthe
indexofthemaxactivation(sometimesalsocalled theswitches)sothatgradientroutingisefficientduringbackpropagation.
q Thereforewehave: 𝛿 = EFGEHI
15
Convol. Pooling
Stage3 FullyConnectedLayerInput Image
Class Label
𝛿JKLMNJKOPQ =𝜕𝐸T
𝜕𝑦JKLMNJKOPQ
𝐸T
Stage1 Stage2
𝛿VWQLMNJKOPQ =𝜕𝐸T
𝜕𝑦VWQLMNJKOPQ
𝑥6 𝑦6
NEURALNETWORKS CS446-FALL‘16
TrainingaConvNetBack-propfortheconvolutionallayer:
16
Convol. Pooling
Stage3 FullyConnectedLayerInput Image
Class Label
𝛿JKLMNJKOPQ =𝜕𝐸T
𝜕𝑦JKLMNJKOPQ
𝐸T
Stage1 Stage2
𝛿VWQLMNJKOPQ =𝜕𝐸T
𝜕𝑦VWQLMNJKOPQ
𝑥6 𝑦6
Wederivetheupdaterulesfora1Dconvolution,buttheideaisthesameforbiggerdimensions.
𝑦X = 𝑤 ∗ 𝑥 ⟺ 𝑦X6 = [ 𝑤\
0NB
\]^
𝑥6N\ = [ 𝑤6N\
0NB
\]^
𝑥\∀𝑖
𝑦 = 𝑓 𝑦X ⟺ 𝑦6 = 𝑓(𝑦X6)∀𝑖
𝜕𝐸T𝜕𝑤\
=
𝜕𝐸T𝜕𝑦X6
=
𝛿 = 𝜕𝐸T𝜕𝑥\
=
Theconvolution
Adifferentiablenonlinearity
[𝜕𝐸T𝜕𝑦X6
𝜕𝑦X6𝜕𝑤\
0NB
6]^
= [𝜕𝐸T𝜕𝑦X6
𝑥6N\
0NB
6]^
𝜕𝐸T𝜕𝑦6
𝜕𝑦6𝜕𝑦X6
=𝜕𝐸T𝜕𝑦6
𝑓′(𝑦X)
[𝜕𝐸T𝜕𝑦X6
𝜕𝑦X6𝜕𝑥\
0NB
6]^
= [𝜕𝐸T𝜕𝑦X6
𝑤6N\
0NB
6]^
Nowwehaveeverythinginthislayertoupdatethefilter
Weneedtopassthegradienttothepreviouslayer
NowwecanrepeatthisforeachstageofConvNet.
NEURALNETWORKS CS446-FALL‘16
ConvolutionalNets
17
Stage1
Stage2
Stage3
FullyConnected
LayerInput Image
Class Label
FeaturevisualizationofconvolutionalnettrainedonImageNet from[Zeiler &Fergus2013]
NEURALNETWORKS CS446-FALL‘16
ConvNet rootsFukushima,1980s designednetworkwithsamebasicstructurebutdidnottrainbybackpropagation.ThefirstsuccessfulapplicationsofConvolutionalNetworks byYannLeCun in1990's (LeNet)q Wasusedtoreadzipcodes,digits,etc.Manyvariantsnowadays,butthecoreideaisthesameq Example:asystemdevelopedinGoogle(GoogLeNet)
§ Computedifferentfilters§ Composeonebigvectorfromallofthem§ Layerthisiteratively
18See more: http://arxiv.org/pdf/1409.4842v1.pdf
NEURALNETWORKS CS446-FALL‘16
Depthmatters
19
Slidefrom[Kaiming He2015]
NEURALNETWORKS CS446-FALL‘16
PracticalTipsBeforelargescaleexperiments,testonasmallsubsetofthedataandchecktheerrorshouldgotozero.q Overfitting onsmalltrainingVisualizefeatures(featuremapsneedtobeuncorrelated)andhavehighvarianceBadtraining:manyhiddenunitsignoretheinputand/orexhibitstrongcorrelations.
20Figure Credit: Marc'Aurelio Ranzato
NEURALNETWORKS CS446-FALL‘16
DebuggingTrainingdiverges:q Learningratemaybetoolarge→decreaselearningrateq BackProp isbuggy→numericalgradientcheckingLossisminimizedbutaccuracyislowq Checklossfunction:Isitappropriateforthetaskyouwantto
solve?Doesithavedegeneratesolutions?
NNisunderperforming/under-fittingq Computenumberofparameters→iftoosmall,makenetwork
largerNNistooslowq Computenumberofparameters→Usedistributedframework,use
GPU,makenetworksmaller
21
Manyofthesepointsapplytomanymachinelearningmodels,nojustneuralnetworks.
NEURALNETWORKS CS446-FALL‘16
CNNforvectorinputs
Let’sstudyanothervariantofCNNforlanguageq Example:sentenceclassification(sayspamornotspam)
Firststep:representeachwordwithavectorinℝTThis is not a spam
Concatenatethevectors
NowwecanassumethattheinputtothesystemisavectorℝTcq Wheretheinputsentencehaslength𝑙 (𝑙 = 5 inourexample)q Eachwordvector’slength𝑑 (𝑑 = 7 inourexample)
22
OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
NEURALNETWORKS CS446-FALL‘16
Thinkaboutasingleconvolutionallayerq Abunchofvectorfilters
§ EachdefinedinℝTg
• Whereℎ isthenumberofthewordsthefiltercovers• Sizeofthewordvector𝑑
q Findits(modified)convolutionwiththeinputvector
q Resultoftheconvolutionwiththefilter
q Convolutionwithafilterthatspans2words,isoperatingonallofthebi-grams(vectorsoftwoconsecutiveword,concatenated):“thisis”,“isnot”,“nota”,“aspam”.
q Regardlessofwhetheritisgrammatical(notappealinglinguistically)
OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
ConvolutionalLayeronvectors
23
OO O O O O O O O O O O O O
OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
OO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
OO O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
OO O O O O O O O O O O O O
𝑐B = 𝑓(𝑤. 𝑥B:g)𝑐C = 𝑓(𝑤. 𝑥gkB:Cg)𝑐l = 𝑓(𝑤. 𝑥CgkB:lg)𝑐m = 𝑓(𝑤. 𝑥lgkB:mg)
𝑐 = [𝑐B, … . , 𝑐=NgkB] OO O O
NEURALNETWORKS CS446-FALL‘16
OO O O O
OO O O O O O
ConvolutionalLayeronvectors
24
OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
This is not a spam
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
OO O O O O O O O O O O O OOO O O O O O O O O O O O O
OO O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O
OO O OOO O O
OO OOO O
Getwordvectorsforeachwords
Concatenatevectors
Performconvolutionwitheachfilter
Filterbank
Setofresponsevectors
*
Howarewegoingtohandlethevariablesized
responsevectors?Pooling!
#of filters
#words - #length of filter + 1
NEURALNETWORKS CS446-FALL‘16
Nowwecanpassthefixed-sizedvectortoalogisticunit(softmax),orgiveittomulti-layernetwork(lastsession)
OO O O O
OO O O O O O
ConvolutionalLayeronvectors
25
OO O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
This is not a spam
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
OO O O O O O O O O O O O OOO O O O O O O O O O O O O
OO O O O O O O O O O O O O O O O O O O OOO O O O O O O O O O O O O O O O O O O O
OO O OOO O O
OO OOO O
Getwordvectorsforeachwords
Concatenatevectors
Performconvolutionwitheachfilter
Filterbank
*
#of filters
#words - #length of filter + 1
Poolingonfilter
responses
OO OOO OOO OOO OOO O
Somechoicesforpooling:
k-max,mean,etc
NEURALNETWORKS CS446-FALL‘16
RecurrentNeuralNetworks
Multi-layerfeed-forwardNN:DAGq Justcomputesafixedsequenceofnon-linearlearnedtransformationstoconvertaninputpatterintoanoutputpattern
RecurrentNeuralNetwork:Digraphq Hascycles.q Cyclecanactasamemory;q Thehiddenstateofarecurrentnetcancarryalong
informationabouta“potentially”unboundednumberofpreviousinputs.
q Theycanmodelsequentialdatainamuchmorenaturalway.
26
NEURALNETWORKS CS446-FALL‘16
EquivalencebetweenRNNandFeed-forwardNN
Assumethatthereisatimedelayof1inusingeachconnection.
Therecurrentnetisjustalayerednetthatkeepsreusingthesameweights.
27Slide Credit: Geoff Hinton
W1 W2 W3 W4
time=0
time=2
time=1
time=3
W1 W2 W3 W4
W1 W2 W3 W4
w1 w4
w2 w3
NEURALNETWORKS CS446-FALL‘16
RecurrentNeuralNetworks
TrainingageneralRNN’scanbehardq HerewewillfocusonaspecialfamilyofRNN’s
Predictiononchain-likeinput:q Example:POStaggingwordsofasentence
q Issues:§ Structureintheoutput:Thereisconnectionsbetweenlabels§ Interdependencebetweenelementsoftheinputs:Thefinaldecisionis
basedonanintricateinterdependenceofthewordsoneachother.§ Variablesizeinputs:e.g.sentencesdifferinsize
Howwouldyougoaboutsolvingthistask?
28
𝑋 = This is a sample sentence . Y= DT VBZ DT NN NN .
NEURALNETWORKS CS446-FALL‘16
RecurrentNeuralNetworks
AchainRNN:q Hasachain-likestructureq Eachinputisreplacedwithitsvectorrepresentation𝑥qq Hidden(memory)unitℎq containinformationabout
previousinputsandprevioushiddenunitsℎqNB, ℎqNC, etc§ Computedfromthepastmemoryandcurrentword.Itsummarizesthesentenceuptothattime.
29
OO O O O OO O O O OO O O O
𝑥qNB 𝑥q 𝑥qkB
OO
OOO
OO
OOO
OO
OOOℎqNB ℎq ℎqkB
Memorylayer
Inputlayer
NEURALNETWORKS CS446-FALL‘16
RecurrentNeuralNetworks
Apopularwayofformalizingit:ℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)
q Where𝑓 isanonlinear,differentiable(why?)function.
Outputs?q Manyoptions;dependingonproblemandcomputational
resource
30
OO O O O OO O O O OO O O O
𝑥qNB 𝑥q 𝑥qkB
OO
OOO
OO
OOO
OO
OOOℎqNB ℎq ℎqkB
NEURALNETWORKS CS446-FALL‘16
RecurrentNeuralNetworks
Predictionfor𝑥q,withℎq
Predictionfor𝑥q,withℎq, … , ℎqNt
Predictionforthewholechain
SomeinherentissueswithRNNs:q Recurrentneuralnetscannotcapturephraseswithoutprefixcontextq Theyoftencapturetoomuchoflastwordsinfinalvector
31
OO O O O OO O O O OO O O O
𝑥qNB 𝑥q 𝑥qkB
OO
OOO
OO
OOO
OO
OOOℎqNB ℎq ℎqkB
𝑦qNB 𝑦q 𝑦qkB
𝑦q = softmax 𝑊xℎq
𝑦y = softmax 𝑊xℎy
𝑦q = softmax [𝛼6𝑊xN6ℎqN6
t
6]^
NEURALNETWORKS CS446-FALL‘16
Bi-directionalRNN
32OO
OOO
OO
OOO
OO
OOO
ℎ@qNB ℎ@q ℎ@qkB
𝑦qNB 𝑦q 𝑦qkB
OO O O O OO O O O OO O O O𝑥qNB 𝑥q 𝑥qkB
OO
OOO
OO
OOO
OO
OOO
ℎqNB ℎq ℎqkB
OneoftheissueswithRNN:q Hiddenvariablescaptureonlyonesidecontext
Abi-directionalstructure
ℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)
ℎ@q = 𝑓(𝑊{gℎ@qkB +𝑊{6𝑥q)
𝑦q = softmax 𝑊xℎq +𝑊{xℎ@q
NEURALNETWORKS CS446-FALL‘16
Stackofbi-directionalnetworks
Usethesameideaandmakeyourmodelfurthercomplicated:
33
NEURALNETWORKS CS446-FALL‘16
TrainingRNNsHowtotrainsuchmodel?
q Generalizethesameideasfromback-propagation
Totaloutputerror:𝐸 �⃗�, 𝑡 = ∑ 𝐸q 𝑦q, 𝑡qyq]B
34
OO O O O OO O O O OO O O O
𝑥qNB 𝑥q 𝑥qkB
OO
OOO
OO
OOO
OO
OOOℎqNB ℎq ℎqkB
𝑦qNB 𝑦q 𝑦qkB
Reminder:𝑦q = softmax 𝑊xℎqℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)
𝜕𝐸𝜕𝑊 =[
𝜕𝐸q𝜕𝑊
y
q]B𝜕𝐸q𝜕𝑊 =[
𝜕𝐸q𝜕𝑦q
𝜕𝑦q𝜕ℎq
𝜕ℎq𝜕ℎqN�
𝜕ℎqN�𝜕𝑊
y
q]B
Parameters?𝑊x,𝑊6,𝑊g +vectorsfor
input
Thissometimesiscalled“BackpropagationThroughTime”,sincethegradientsarepropagatedbackthroughtime.
NEURALNETWORKS CS446-FALL‘16
RecurrentNeuralNetwork
35
𝑦qNB 𝑦q 𝑦qkB
OO O O O OO O O O OO O O O
𝑥qNB 𝑥q 𝑥qkB
OO
OOO
OO
OOO
OO
OOO
ℎqNB ℎq ℎqkB
Reminder:𝑦q = softmax 𝑊xℎqℎq = 𝑓(𝑊gℎqNB +𝑊6𝑥q)
𝜕𝐸𝜕𝑊 =[
𝜕𝐸q𝜕𝑦q
𝜕𝑦q𝜕ℎq
𝜕ℎq𝜕ℎqN�
𝜕ℎqN�𝜕𝑊
y
q]B
𝜕ℎq𝜕ℎqN�
= �𝜕ℎ�𝜕ℎ�NB
q
�]qN�kB
= � 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q)q
�]qN�kB
𝜕ℎq𝜕ℎqNB
= 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q) diag 𝑎B, … , 𝑎= =𝑎B 0 00 ⋱ 00 0 𝑎=
NEURALNETWORKS CS446-FALL‘16
Vanishing/explodinggradients
Vanishinggradientsarequiteprevalentandaseriousissue.Arealexampleq Trainingafeed-forwardnetworkq y-axis:sumofthegradientnormsq Earlierlayershaveexponentiallysmallersumofgradientnormsq Thiswillmaketrainingearlierlayersmuchslower.
36
𝜕ℎq𝜕ℎ�
≤ � 𝑊g diag 𝑓�(𝑊gℎqNB +𝑊6𝑥q) ≤ � 𝛼𝛽 =q
�]qN�kB
𝛼𝛽 �q
�]qN�kB
𝜕ℎq𝜕ℎqN�
= � 𝑊gdiag 𝑓�(𝑊gℎqNB +𝑊6𝑥q)q
�]qN�kB
Gradientcanbecomeverysmallorverylargequickly,andthelocalityassumptionofgradientdescentbreaksdown (Vanishinggradient)[Bengio etal1994]
NEURALNETWORKS CS446-FALL‘16
Vanishing/explodinggradientsInanRNNtrainedonlongsequences(e.g.100timesteps)thegradientscaneasilyexplodeorvanish.q SoRNNshavedifficultydealingwithlong-range
dependencies.Manymethodsproposedforreducetheeffectofvanishinggradients;althoughitisstillaproblemq Introduceshorterpathbetweenlongconnectionsq Abandonstochasticgradientdescentinfavorofamuch
moresophisticatedHessian-Free(HF)optimizationq Addfanciermodulesthatarerobusttohandlinglong
memory;e.g.LongShortTermMemory(LSTM)Onetricktohandletheexploding-gradients:q Clipgradientswithbiggersizes:
37
Defnne𝑔 = EFE�
If 𝑔 ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then𝑔 ← qg���gxcT
�𝑔