Neural Networks Tutorialurtasun/courses/CSC411... · by just creating extra training data: – for each training image, produce new training data by applying all of the transformations

CSC411/2515 Fall 2016

Neural Networks Tutorial

Lluís Castrejón Oct. 2016

Slides adapted from Yujia Li’s tutorial and Prof. Zemel’s lecture notes.

Overfitting • Thetrainingdatacontainsinformationabouttheregularitiesinthemappingfrominputtooutput.Butitalsocontainsnoise– Thetargetvaluesmaybeunreliable.– Thereissamplingerror.Therewillbeaccidentalregularitiesjustbecauseoftheparticulartrainingcasesthatwerechosen

• Whenwefitthemodel,itcannottellwhichregularitiesarerealandwhicharecausedbysamplingerror.– Soitfitsbothkindsofregularity.– Ifthemodelisveryflexibleitcanmodelthesamplingerrorreallywell.Thisisadisaster.

2

Overfitting

Picture credit: Chris Bishop. Pattern Recognition and Machine Learning. Ch.1.1.

Preventingoverfitting

• Useamodelthathastherightcapacity:– enoughtomodelthetrueregularities– notenoughtoalsomodelthespuriousregularities(assumingtheyareweaker)

• Standardwaystolimitthecapacityofaneuralnet:– Limitthenumberofhiddenunits.– Limitthesizeoftheweights.– Stopthelearningbeforeithastimetooverfit.

4

Limitingthesizeoftheweights

Weight-decayinvolvesaddinganextratermtothecostfunctionthatpenalizesthesquaredweights.

– Keepsweightssmallunlesstheyhavebigerrorderivatives.

ii

i

iii

ii

wEw

wCwhen

wwE

wC

wEC

∂

∂−==

∂

∂

+∂

∂=

∂

∂

∑+=

λ

λ

λ

1

22

,0

w

C

5

Theeffectofweight-decay

• Itpreventsthenetworkfromusingweightsthatitdoesnotneed– Thiscanoftenimprovegeneralizationalot.– Ithelpstostopitfromfittingthesamplingerror.– Itmakesasmoothermodelinwhichtheoutputchangesmoreslowlyastheinputchanges.

• But,ifthenetworkhastwoverysimilarinputsitpreferstoputhalftheweightoneachratherthanalltheweightononeà otherformofweightdecay?

w/2 w/2 w 0

6

Decidinghowmuchtorestrictthecapacity

• Howdowedecidewhichlimittouseandhowstrongtomakethelimit?– Ifweusethetestdatawegetanunfairpredictionoftheerrorratewewouldgetonnewtestdata.

– Supposewecomparedasetofmodelsthatgaverandomresults,thebestoneonaparticulardatasetwoulddobetterthanchance.Butitwon’tdobetterthanchanceonanothertestset.

• Souseaseparatevalidationsettodomodelselection.

7

Usingavalidationset

• Dividethetotaldatasetintothreesubsets:– Trainingdataisusedforlearningtheparametersofthemodel.

– Validationdataisnotusedoflearningbutisusedfordecidingwhattypeofmodelandwhatamountofregularizationworksbest

– Testdataisusedtogetafinal,unbiasedestimateofhowwellthenetworkworks.Weexpectthisestimatetobeworsethanonthevalidationdata

• Wecouldthenre-dividethetotaldatasettogetanotherunbiasedestimateofthetrueerrorrate.

8

Preventingoverfittingbyearlystopping

• Ifwehavelotsofdataandabigmodel,itsveryexpensivetokeepre-trainingitwithdifferentamountsofweightdecay

• Itismuchcheapertostartwithverysmallweightsandletthemgrowuntiltheperformanceonthevalidationsetstartsgettingworse

• Thecapacityofthemodelislimitedbecausetheweightshavenothadtimetogrowbig.

9

Whyearlystoppingworks

• Whentheweightsareverysmall,everyhiddenunitisinitslinearrange.– Soanetwithalargelayerofhiddenunitsislinear.

– Ithasnomorecapacitythanalinearnetinwhichtheinputsaredirectlyconnectedtotheoutputs!

• Astheweightsgrow,thehiddenunitsstartusingtheirnon-linearrangessothecapacitygrows.

outputs

inputs

10

LeNet

• YannLeCunandothersdevelopedareallygoodrecognizerforhandwrittendigitsbyusingbackpropagationinafeedforwardnetwith:– Manyhiddenlayers– Manypoolsofreplicatedunitsineachlayer.– Averagingtheoutputsofnearbyreplicatedunits.– Awidenetthatcancopewithseveralcharactersatonceeveniftheyoverlap.

• DemoofLENET

11

https://www.youtube.com/watch?v=FwFduRA_L6Q



RecognizingDigitsHand-writtendigitrecognitionnetwork

– 7291trainingexamples,2007testexamples– Bothcontainambiguousandmisclassifiedexamples– Inputpre-processed(segmented,normalized)

• 16x16graylevel[-1,1],10outputs

12

LeNet:SummaryMainideas:

• Localà globalprocessing • Retaincoarseposninfo

Maintechnique:weightsharing–unitsarrangedinfeaturemaps

Connections:1256units,64,660cxns,9760freeparameters

Results:0.14%(train),5.0%(test)

vs.3-layernetw/40hiddenunits: 1.6%(train),8.1%(test)

13

The82errorsmadebyLeNet5

Notice that most of the errors are cases that people find quite easy. The human error rate is probably 20 to 30 errors

14

Abruteforceapproach

• LeNetusesknowledgeabouttheinvariancestodesign:– thenetworkarchitecture– ortheweightconstraints– orthetypesoffeature

• Butitsmuchsimplertoincorporateknowledgeofinvariancesbyjustcreatingextratrainingdata:– foreachtrainingimage,producenewtrainingdatabyapplyingallofthetransformationswewanttobeinsensitiveto

– Thentrainalarge,dumbnetonafastcomputer.– Thisworkssurprisinglywell

15

16

Makingbackpropagationworkforrecognizingdigits

• Usingthestandardviewingtransformations,andlocaldeformationfieldstogetlotsofdata.

• Usemany,globallyconnectedhiddenlayersandlearnforaverylongtime– ThisrequiresaGPUboardoralargecluster

• Usetheappropriateerrormeasureformulti-classcategorization– Cross-entropy,withsoftmaxactivation

• Thisapproachcanget35errorsonMNIST!17

Fabricatingtrainingdata

Goodgeneralizationrequireslotsoftrainingdata,includingexamplesfromallrelevantinputregions

ImprovesolutionifgooddatacanbeconstructedExample:ALVINN

18

ALVINN:simulatingtrainingexamples

On-the-flytraining:currentvideocameraimageasinput,currentsteeringdirectionastarget

But:over-trainonsameinputs;noexperiencegoingoff-road

Method:generatenewexamplesbyshiftingimages

Replace10low-error&5randomtrainingexampleswith15new

Key:relationbetweeninputandoutputknown!

19

NeuralNetDemos

Scene recognition - Places MIT

Digit recognition

Neural Nets Playground

Neural Style Transfer

http://places.csail.mit.edu/demo.html

http://scs.ryerson.ca/~aharley/vis/conv/

http://playground.tensorflow.org

https://www.instapainting.com/ai-painter

Documents

Neural Networks Tutorialurtasun/courses/CSC411... · by just creating extra training data: – for each training image, produce new training data by applying all of the transformations