Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CSC411/2515 Fall 2016
Neural Networks Tutorial
Lluís Castrejón Oct. 2016
Slides adapted from Yujia Li’s tutorial and Prof. Zemel’s lecture notes.
Overfitting • Thetrainingdatacontainsinformationabouttheregularitiesinthemappingfrominputtooutput.Butitalsocontainsnoise– Thetargetvaluesmaybeunreliable.– Thereissamplingerror.Therewillbeaccidentalregularitiesjustbecauseoftheparticulartrainingcasesthatwerechosen
• Whenwefitthemodel,itcannottellwhichregularitiesarerealandwhicharecausedbysamplingerror.– Soitfitsbothkindsofregularity.– Ifthemodelisveryflexibleitcanmodelthesamplingerrorreallywell.Thisisadisaster.
2
Overfitting
Picture credit: Chris Bishop. Pattern Recognition and Machine Learning. Ch.1.1.
Preventingoverfitting
• Useamodelthathastherightcapacity:– enoughtomodelthetrueregularities– notenoughtoalsomodelthespuriousregularities(assumingtheyareweaker)
• Standardwaystolimitthecapacityofaneuralnet:– Limitthenumberofhiddenunits.– Limitthesizeoftheweights.– Stopthelearningbeforeithastimetooverfit.
4
Limitingthesizeoftheweights
Weight-decayinvolvesaddinganextratermtothecostfunctionthatpenalizesthesquaredweights.
– Keepsweightssmallunlesstheyhavebigerrorderivatives.
ii
i
iii
ii
wEw
wCwhen
wwE
wC
wEC
∂
∂−==
∂
∂
+∂
∂=
∂
∂
∑+=
λ
λ
λ
1
22
,0
w
C
5
Theeffectofweight-decay
• Itpreventsthenetworkfromusingweightsthatitdoesnotneed– Thiscanoftenimprovegeneralizationalot.– Ithelpstostopitfromfittingthesamplingerror.– Itmakesasmoothermodelinwhichtheoutputchangesmoreslowlyastheinputchanges.
• But,ifthenetworkhastwoverysimilarinputsitpreferstoputhalftheweightoneachratherthanalltheweightononeà otherformofweightdecay?
w/2 w/2 w 0
6
Decidinghowmuchtorestrictthecapacity
• Howdowedecidewhichlimittouseandhowstrongtomakethelimit?– Ifweusethetestdatawegetanunfairpredictionoftheerrorratewewouldgetonnewtestdata.
– Supposewecomparedasetofmodelsthatgaverandomresults,thebestoneonaparticulardatasetwoulddobetterthanchance.Butitwon’tdobetterthanchanceonanothertestset.
• Souseaseparatevalidationsettodomodelselection.
7
Usingavalidationset
• Dividethetotaldatasetintothreesubsets:– Trainingdataisusedforlearningtheparametersofthemodel.
– Validationdataisnotusedoflearningbutisusedfordecidingwhattypeofmodelandwhatamountofregularizationworksbest
– Testdataisusedtogetafinal,unbiasedestimateofhowwellthenetworkworks.Weexpectthisestimatetobeworsethanonthevalidationdata
• Wecouldthenre-dividethetotaldatasettogetanotherunbiasedestimateofthetrueerrorrate.
8
Preventingoverfittingbyearlystopping
• Ifwehavelotsofdataandabigmodel,itsveryexpensivetokeepre-trainingitwithdifferentamountsofweightdecay
• Itismuchcheapertostartwithverysmallweightsandletthemgrowuntiltheperformanceonthevalidationsetstartsgettingworse
• Thecapacityofthemodelislimitedbecausetheweightshavenothadtimetogrowbig.
9
Whyearlystoppingworks
• Whentheweightsareverysmall,everyhiddenunitisinitslinearrange.– Soanetwithalargelayerofhiddenunitsislinear.
– Ithasnomorecapacitythanalinearnetinwhichtheinputsaredirectlyconnectedtotheoutputs!
• Astheweightsgrow,thehiddenunitsstartusingtheirnon-linearrangessothecapacitygrows.
outputs
inputs
10
LeNet
• YannLeCunandothersdevelopedareallygoodrecognizerforhandwrittendigitsbyusingbackpropagationinafeedforwardnetwith:– Manyhiddenlayers– Manypoolsofreplicatedunitsineachlayer.– Averagingtheoutputsofnearbyreplicatedunits.– Awidenetthatcancopewithseveralcharactersatonceeveniftheyoverlap.
• DemoofLENET
11
RecognizingDigitsHand-writtendigitrecognitionnetwork
– 7291trainingexamples,2007testexamples– Bothcontainambiguousandmisclassifiedexamples– Inputpre-processed(segmented,normalized)
• 16x16graylevel[-1,1],10outputs
12
LeNet:SummaryMainideas:
• Localà globalprocessing • Retaincoarseposninfo
Maintechnique:weightsharing–unitsarrangedinfeaturemaps
Connections:1256units,64,660cxns,9760freeparameters
Results:0.14%(train),5.0%(test)
vs.3-layernetw/40hiddenunits: 1.6%(train),8.1%(test)
13
The82errorsmadebyLeNet5
Notice that most of the errors are cases that people find quite easy. The human error rate is probably 20 to 30 errors
14
Abruteforceapproach
• LeNetusesknowledgeabouttheinvariancestodesign:– thenetworkarchitecture– ortheweightconstraints– orthetypesoffeature
• Butitsmuchsimplertoincorporateknowledgeofinvariancesbyjustcreatingextratrainingdata:– foreachtrainingimage,producenewtrainingdatabyapplyingallofthetransformationswewanttobeinsensitiveto
– Thentrainalarge,dumbnetonafastcomputer.– Thisworkssurprisinglywell
15
16
Makingbackpropagationworkforrecognizingdigits
• Usingthestandardviewingtransformations,andlocaldeformationfieldstogetlotsofdata.
• Usemany,globallyconnectedhiddenlayersandlearnforaverylongtime– ThisrequiresaGPUboardoralargecluster
• Usetheappropriateerrormeasureformulti-classcategorization– Cross-entropy,withsoftmaxactivation
• Thisapproachcanget35errorsonMNIST!17
Fabricatingtrainingdata
Goodgeneralizationrequireslotsoftrainingdata,includingexamplesfromallrelevantinputregions
ImprovesolutionifgooddatacanbeconstructedExample:ALVINN
18
ALVINN:simulatingtrainingexamples
On-the-flytraining:currentvideocameraimageasinput,currentsteeringdirectionastarget
But:over-trainonsameinputs;noexperiencegoingoff-road
Method:generatenewexamplesbyshiftingimages
Replace10low-error&5randomtrainingexampleswith15new
Key:relationbetweeninputandoutputknown!
19
NeuralNetDemos
Scene recognition - Places MIT
Digit recognition
Neural Nets Playground
Neural Style Transfer