58
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? 4/5/16 Richard Socher Lecture 1, Slide 1

Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

OnlineVideos

• FERPA

• Signwaiverorsitonthesidesorintheback

• Offcameraquestiontimebeforeandafterlecture

• Questions?

4/5/16RichardSocherLecture1,Slide 1

Page 2: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

CS224dDeepNLP

Lecture4:WordWindowClassification

andNeuralNetworks

RichardSocher

Page 3: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Feedbacksofar

• ~70%goodwithcurrentspeed,• 15%toofastà Pleasevisitofficehours• 15%tooslow

• Math,whenglossedoverà notrequired,foodforthoughtforadvancedstudents

• Lecturesdryà understandingbasicsimportant,• Startingnextweekwewillbecomemoreconceptual,introduce

complexmodelsandgainpracticalintuitions

4/5/16RichardSocherLecture1,Slide 3

Page 4: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Feedbacksofar

• Giventhefeedback:clearlydefinewordvectorupdatestoday,movedeadlineofPSet1by2days

• Projectideas:• 2types• moreinfonextweek• myofficehour

• Detail:Intuitionforwordvectorcontextwindow• Relativecontextdifferencesmalleràmoresimilarvectors

4/5/16RichardSocherLecture1,Slide 4

Page 5: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

OverviewToday:

• Generalclassificationbackground

• Updatingwordvectorsforclassification

• Windowclassification&crossentropyerrorderivationtips

• Asinglelayerneuralnetwork!

• (Max-Marginlossandbackprop)

4/5/16RichardSocherLecture1,Slide 5

Page 6: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Refresher:Classificationsetupandnotation

• Generallywehaveatrainingdatasetconsistingofsamples

{xi,yi}Ni=1

• xi - inputs,e.g.words(indicesorvectors!),contextwindows,sentences,documents,etc.

• yi - labelswetrytopredict,• e.g.otherwords• class:sentiment,namedentities,buy/selldecision,• later:multi-wordsequences

4/5/16RichardSocherLecture1,Slide 6

Page 7: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Classificationintuition

• Trainingdata:{xi,yi}Ni=1

• Simpleillustrationcase:• Fixed2dwordvectorstoclassify• Usinglogisticregression• à lineardecisionboundaryà

• GeneralML:assumexisfixedandonlytrainlogisticregressionweightsWandonlymodifythedecisionboundary

4/5/16RichardSocherLecture1,Slide 7

VisualizationswithConvNetJS byKarpathy!http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Page 8: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Classificationnotation

• Crossentropylossfunctionoverdataset{xi,yi}Ni=1

• Whereforeachdatapair(xi,yi):

• Wecanwritef inmatrixnotation andindexelementsofitbasedonclass:

4/5/16RichardSocherLecture1,Slide 8

Page 9: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Classification:Regularization!

• Reallyfulllossfunctionoveranydatasetincludesregularizationoverallparametersµ:

• Regularizationwillpreventoverfittingwhenwehavealotoffeatures(orlateraverypowerful/deepmodel)• x-axis:morepowerfulmodelormoretrainingiterations

• Blue:trainingerror,red:testerror

4/5/16RichardSocherLecture1,Slide 9

Page 10: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Details:GeneralMLoptimization

• Forgeneralmachinelearningµ usuallyonlyconsistsofcolumnsofW:

• Soweonlyupdatethedecisionboundary

4/5/16RichardSocherLecture1,Slide 10

VisualizationswithConvNetJS byKarpathy

Page 11: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Classificationdifferencewithwordvectors

• Commonindeeplearning:• LearnbothWandwordvectorsx

4/5/16RichardSocherLecture1,Slide 11

Verylarge!

OverfittingDanger!

Page 12: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Losinggeneralizationbyre-trainingwordvectors

• Setting:Traininglogisticregressionformoviereviewsentimentandinthetrainingdatawehavethewords• “TV”and“telly”

• Inthetestingdatawehave• “television”

• Originallytheywereallsimilar(frompre-trainingwordvectors)

• Whathappenswhenwetrainthewordvectors?

4/5/16RichardSocherLecture1,Slide 12

TVtelly

television

Page 13: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Losinggeneralizationbyre-trainingwordvectors

• Whathappenswhenwetrainthewordvectors?• Thosethatareinthetrainingdatamovearound• Wordsfrompre-trainingthatdoNOTappearintrainingstay

• Example:• Intrainingdata:“TV”and“telly”• Intestingdataonly:“television”

4/5/16RichardSocherLecture1,Slide 13

TVtelly

television:(

Page 14: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Losinggeneralizationbyre-trainingwordvectors

• Takehomemessage:

Ifyouonlyhaveasmalltrainingdataset,don’ttrainthewordvectors.

Ifyouhavehaveaverylargedataset,itmayworkbettertotrainwordvectorstothetask.

4/5/16RichardSocherLecture1,Slide 14

TVtelly

television

Page 15: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Sidenoteonwordvectorsnotation

• ThewordvectormatrixLisalsocalledlookuptable• Wordvectors=wordembeddings =wordrepresentations(mostly)• Mostlyfrommethodslikeword2vecorGlove

|V|

L =d ……

aardvarka…meta…zebra• Thesearethewordfeaturesxword fromnowon

• Conceptuallyyougetaword’svectorbyleftmultiplyingaone-hotvectore byL:x =Le2 d£ V¢ V£ 1

[]

15

Page 16: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Windowclassification

• Classifyingsinglewordsisrarelydone.

• Interestingproblemslikeambiguityariseincontext!

• Example:auto-antonyms:• "Tosanction"canmean"topermit"or"topunish.”• "Toseed"canmean"toplaceseeds"or"toremoveseeds."

• Example:ambiguousnamedentities:• Parisà Paris,Francevs ParisHilton• Hathawayà BerkshireHathawayvs AnneHathaway

4/5/16RichardSocherLecture1,Slide 16

Page 17: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Windowclassification

• Idea:classifyawordinitscontextwindowofneighboringwords.

• Forexamplenamedentityrecognitioninto4classes:• Person,location,organization,none

• Manypossibilitiesexistforclassifyingonewordincontext,e.g.averagingallthewordsinawindowbutthatloosespositioninformation

4/5/16RichardSocherLecture1,Slide 17

Page 18: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Windowclassification

• Trainsoftmax classifierbyassigningalabeltoacenterwordandconcatenatingallwordvectorssurroundingit

• Example:ClassifyParisinthecontextofthissentencewithwindowlength2:

…museumsinParisareamazing….

Xwindow =[xmuseums xin xParis xare xamazing ]T

• Resultingvectorxwindow =x2 R5d,acolumnvector!

4/5/16RichardSocherLecture1,Slide 18

Page 19: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Simplestwindowclassifier:Softmax

• Withx=xwindow wecanusethesamesoftmax classifierasbefore

• Withcrossentropyerrorasbefore:

• Buthowdoyouupdatethewordvectors?

4/5/16RichardSocherLecture1,Slide 19

same

predictedmodeloutputprobability

Page 20: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Updatingconcatenatedwordvectors

• Shortanswer:Justtakederivativesasbefore

• Longanswer:Let’sgooverthestepstogether(you’llhavetofillinthedetailsinPSet 1!)

• Define:• :softmax probabilityoutputvector(seepreviousslide)• :targetprobabilitydistribution(all0’sexceptatgroundtruthindexofclassy,whereit’s1)

• andfc =c’th elementofthefvector

• Hard,thefirsttime,hencesometipsnow:)

4/5/16RichardSocherLecture1,Slide 20

Page 21: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

• Tip1:Carefullydefineyourvariablesandkeeptrackoftheirdimensionality!

• Tip2:Knowthychainruleanddon’tforgetwhichvariablesdependonwhat:

• Tip3:Forthesoftmax partofthederivative:Firsttakethederivativewrt fc whenc=y(thecorrectclass),thentakederivativewrt fc whenc≠ y(alltheincorrectclasses)

Updatingconcatenatedwordvectors

4/5/16RichardSocherLecture1,Slide 21

Page 22: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

• Tip4:Whenyoutakederivativewrtoneelementoff,trytoseeifyoucancreateagradientintheendthatincludesallpartialderivatives:

• Tip5:Tolaternotgoinsane&implementation!à resultsintermsofvectoroperationsanddefinesingleindex-ablevectors:

Updatingconcatenatedwordvectors

4/5/16RichardSocherLecture1,Slide 22

Page 23: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

• Tip6:Whenyoustartwiththechainrule,firstuseexplicitsumsandlookatpartialderivativesofe.g.xi orWij

• Tip7:Tocleanitupforevenmorecomplexfunctionslater:Knowdimensionalityofvariables&simplifyintomatrixnotation

• Tip8:Writethisoutinfullsumsifit’snotclear!

Updatingconcatenatedwordvectors

4/5/16RichardSocherLecture1,Slide 23

Page 24: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

• Whatisthedimensionalityofthewindowvectorgradient?

• x istheentirewindow,5d-dimensionalwordvectors,sothederivativewrt toxhastohavethesamedimensionality:

Updatingconcatenatedwordvectors

4/5/16RichardSocherLecture1,Slide 24

Page 25: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

• Thegradientthatarrivesatandupdatesthewordvectorscansimplybesplitupforeachwordvector:

• Let• Withxwindow =[xmuseums xin xParis xare xamazing ]

• Wehave

Updatingconcatenatedwordvectors

4/5/16RichardSocherLecture1,Slide 25

Page 26: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

• Thiswillpushwordvectorsintoareassuchtheywillbehelpfulindeterminingnamedentities.

• Forexample,themodelcanlearnthatseeingxin asthewordjustbeforethecenterwordisindicativeforthecenterwordtobealocation

Updatingconcatenatedwordvectors

4/5/16RichardSocherLecture1,Slide 26

Page 27: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

• ThegradientofJwrt thesoftmax weightsW!

• Similarsteps,writedownpartialwrt Wij first!• Thenwehavefull

What’smissingfortrainingthewindowmodel?

4/5/16RichardSocherLecture1,Slide 27

Page 28: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Anoteonmatriximplementations

4/5/16RichardSocher28

• Therearetwoexpensiveoperationsinthesoftmax:

• Thematrixmultiplication andtheexp

• Aforloopisneverasefficientwhenyouimplementitcomparedvs whenyouusealargermatrixmultiplication!

• Examplecodeà

Page 29: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Anoteonmatriximplementations

4/5/16RichardSocher29

• Loopingoverwordvectorsinsteadofconcatenatingthemallintoonelargematrixandthenmultiplyingthesoftmax weightswiththatmatrix

• 1000loops,bestof3:639µsperloop10000loops,bestof3:53.8µsperloop

Page 30: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Anoteonmatriximplementations

4/5/16RichardSocher30

• ResultoffastermethodisaCxNmatrix:

• Eachcolumnisanf(x)inournotation(unnormalized classscores)

• Matricesareawesome!

• Youshouldspeedtestyourcodealottoo

Page 31: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Softmax (=logisticregression)isnotverypowerful

4/5/16RichardSocher31

• Softmax onlygiveslineardecisionboundariesintheoriginalspace.

• Withlittledatathatcanbeagoodregularizer

• Withmoredataitisverylimiting!

Page 32: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Softmax (=logisticregression)isnotverypowerful

4/5/16RichardSocher32

• Softmax onlylineardecisionboundaries

• à Lamewhenproblemiscomplex

• Wouldn’titbecooltogetthesecorrect?

Page 33: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

NeuralNetsfortheWin!

4/5/16RichardSocher33

• Neuralnetworkscanlearnmuchmorecomplexfunctionsandnonlineardecisionboundaries!

Page 34: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Fromlogisticregressiontoneuralnets

34

Page 35: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Demystifyingneuralnetworks

Neuralnetworkscomewiththeirownterminologicalbaggage

…justlikeSVMs

Butifyouunderstandhowsoftmax modelswork

Thenyoualreadyunderstand theoperationofabasicneuralnetworkneuron!

AsingleneuronAcomputationalunitwithn(3) inputs

and1outputandparametersW,b

Activationfunction

Inputs

Biasunitcorresponds tointerceptterm

Output

35

Page 36: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Aneuronisessentiallyabinarylogisticregressionunit

hw,b(x) = f (wTx + b)

f (z) = 11+ e−z

w,b aretheparametersofthisneuroni.e.,thislogisticregressionmodel

36

b:Wecanhavean“alwayson”feature,whichgivesaclassprior,orseparateitout,asabiasterm

Page 37: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Aneuralnetwork=runningseverallogisticregressionsatthesametimeIfwefeedavectorofinputsthroughabunchoflogisticregressionfunctions,thenwegetavectorofoutputs…

Butwedon’thavetodecideaheadoftimewhatvariablestheselogisticregressionsaretryingtopredict!

37

Page 38: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Aneuralnetwork=runningseverallogisticregressionsatthesametime…whichwecanfeedintoanotherlogisticregressionfunction

Itisthelossfunctionthatwilldirectwhattheintermediatehiddenvariablesshouldbe,soastodoagoodjobatpredictingthetargetsforthenextlayer,etc.

38

Page 39: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Aneuralnetwork=runningseverallogisticregressionsatthesametime

Beforeweknowit,wehaveamultilayerneuralnetwork….

39

Page 40: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Matrixnotationforalayer

Wehave

Inmatrixnotation

wheref isappliedelement-wise:

a1

a2

a3

a1 = f (W11x1 +W12x2 +W13x3 + b1)a2 = f (W21x1 +W22x2 +W23x3 + b2 )etc.

z =Wx + ba = f (z)

f ([z1, z2, z3]) = [ f (z1), f (z2 ), f (z3)]40

W12

b3

Page 41: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Non-linearities (f):Whythey’reneeded

• Example:functionapproximation,e.g.,regressionorclassification• Withoutnon-linearities,deepneuralnetworkscan’tdoanythingmorethanalineartransform

• Extralayerscouldjustbecompileddownintoasinglelineartransform:W1W2x =Wx

• Withmorelayers,theycanapproximatemorecomplexfunctions!

41

Page 42: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Amorepowerfulwindowclassifier

• Revisiting

• Xwindow =[xmuseums xin xParis xare xamazing ]

4/5/16RichardSocherLecture1,Slide 42

Page 43: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

ASingleLayerNeuralNetwork

• Asinglelayerisacombinationofalinearlayerandanonlinearity:

• Theneuralactivationsacanthenbeusedtocomputesomefunction

• Forinstance,asoftmax probabilityoranunnormalized score:

43

Page 44: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Summary:Feed-forwardComputation

44

Computingawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)

Xwindow =[xmuseums xin xParis xare xamazing ]

Page 45: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Nextlecture:

4/5/16RichardSocher45

Trainingawindow-basedneuralnetwork.

Takingmoredeeperderivativesà Backprop

Thenwehaveallthebasictoolsinplacetolearnaboutmorecomplexmodels:)

Page 46: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Probablyfornextlecture…

4/5/16RichardSocher46

Page 47: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Anotheroutputlayerandlossfunctioncombo!

47

• Sofar:softmax andcross-entropyerror(exp slow)

• Wedon’talwaysneedprobabilities,oftenunnormalized scoresareenoughtoclassifycorrectly.

• Also:Max-margin!

• Moreonthatinfuturelectures!

Page 48: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

NeuralNetmodeltoclassifygrammaticalphrases

4/5/16RichardSocher48

• Idea:Trainaneuralnetworktoproducehighscoresforgrammatical phrasesofspecificlengthandlowscoresforungrammaticalphrases

• s =score(catchillsonamat)

• sc =score(catchillsMenloamat)

Page 49: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Anotheroutputlayerandlossfunctioncombo!

• Ideafortrainingobjective• Makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize

• Thisiscontinuous,canperformSGD49

Page 50: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

TrainingwithBackpropagation

AssumingcostJis>0,itissimpletoseethatwecancomputethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x

50

Page 51: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

TrainingwithBackpropagation

• Let’sconsiderthederivativeofasingleweightWij

• Thisonlyappearsinsideai

• Forexample:W23 isonlyusedtocomputea2

x1 x2x3 +1

a1 a2

s U2

W23

51

Page 52: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

TrainingwithBackpropagation

DerivativeofweightWij:

52

x1 x2x3 +1

a1 a2

s U2

W23

Page 53: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

whereforlogisticf

TrainingwithBackpropagation

DerivativeofsingleweightWij :

Localerrorsignal

Localinputsignal

53

x1 x2x3 +1

a1 a2

s U2

W23

Page 54: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

• Wewantallcombinationsofi =1,2 and j=1,2,3

• Solution:Outerproduct:whereisthe“responsibility”comingfromeachactivationa

TrainingwithBackpropagation

• FromsingleweightWij tofullW:

54

x1 x2x3 +1

a1 a2

s U2

W23

Page 55: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

TrainingwithBackpropagation

• Forbiasesb,weget:

55

x1 x2x3 +1

a1 a2

s U2

W23

Page 56: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

TrainingwithBackpropagation

56

That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!

Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers

Example:lastderivativesofmodel,thewordvectorsinx

Page 57: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

TrainingwithBackpropagation

• Takederivativeofscorewithrespecttosinglewordvector(forsimplicitya1dvector,butsameifitwaslonger)

• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:

Re-usedpartofpreviousderivative57

Page 58: Online Videos - cs224d.stanford.educs224d.stanford.edu/lectures/CS224d-Lecture 4.pdf• “TV” and “telly” ... • The gradient that arrives at and updates the word vectors can

Summary

4/5/16RichardSocher58