Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ClassificationIITaylorBerg-Kirkpatrick– CMU
Slides:DanKlein– UCBerkeley
AlgorithmsforNLP
MinimizeTrainingError?§ Alossfunctiondeclareshowcostlyeachmistakeis
§ E.g.0lossforcorrectlabel,1lossforwronglabel§ Canweightmistakesdifferently(e.g.falsepositivesworsethanfalse
negativesorHammingdistanceoverstructuredlabels)
§ Wecould,inprinciple,minimizetrainingloss:
§ Thisisahard,discontinuousoptimizationproblem
Examples:Perceptron
§ SeparableCase
3
Examples:Perceptron
§ Non-SeparableCase
4
ObjectiveFunctions
§ Whatdowewantfromourweights?§ Depends!§ Sofar:minimize(training)errors:
§ Thisisthe“zero-oneloss”§ Discontinuous,minimizingisNP-complete
§ MaximumentropyandSVMshaveotherobjectivesrelatedtozero-oneloss
LinearModels:MaximumEntropy
§ Maximumentropy(logisticregression)§ Usethescoresasprobabilities:
§ Maximizethe(log)conditionallikelihoodoftrainingdata
Make positive
Normalize
MaximumEntropyII
§ Motivationformaximumentropy:§ Connectiontomaximumentropyprinciple(sortof)§ Mightwanttodoagoodjobofbeinguncertainonnoisycases…
§ …inpractice,though,posteriorsareprettypeaked
§ Regularization(smoothing)
Log-Loss§ Ifweviewmaxentasaminimizationproblem:
§ Thisminimizesthe“logloss”oneachexample
§ Oneview:loglossisanupperbound onzero-oneloss
MaximumMargin
§ Non-separableSVMs§ Addslacktotheconstraints§ Makeobjectivepay(linearly)forslack:
§ Ciscalledthecapacity oftheSVM– thesmoothingknob
§ Learning:§ CanstillstickthisintoMatlabifyouwant§ Constrainedoptimizationishard;bettermethods!§ We’llcomebacktothislater
Note: exist other choices of how to penalize slacks!
RememberSVMs…
§ Wehadaconstrained minimization
§ …butwecansolveforxi
§ Giving
HingeLoss
§ Thisiscalledthe“hingeloss”§ Unlikemaxent /logloss,youstop
gainingobjectiveoncethetruelabelwinsbyenough
§ YoucanstartfromhereandderivetheSVMobjective
§ Cansolvedirectlywithsub-gradientdecent(e.g.Pegasos:Shalev-Shwartz etal07)
§ Consider the per-instance objective:
Plot really only right in binary case
Maxvs“Soft-Max”Margin
§ SVMs:
§ Maxent:
§ Verysimilar!Bothtrytomakethetruescorebetterthanafunctionoftheotherscores§ TheSVMtriestobeattheaugmentedrunner-up§ TheMaxentclassifiertriestobeatthe“soft-max”
You can make this zero
… but not this one
LossFunctions:Comparison
§ Zero-OneLoss
§ Hinge
§ Log
Structure
Handwritingrecognition
brace
Sequential structure
x y
[Slides:Taskar andKlein05]
CFGParsing
The screen was a sea of red
Recursive structure
x y
BilingualWordAlignment
What is the anticipated cost of collecting fees under the new proposal?
En vertu de nouvelle propositions, quel est le côut prévu de perception de les droits?
x yWhat
is the
anticipatedcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelle propositions, quel est le côut prévu de perception de le droits?
Combinatorial structure
StructuredModels
Assumption:
Scoreisasumoflocal“part”scores
Parts=nodes,edges,productions
space of feasible outputs
NamedEntityRecognition
Apple Computer bought Smart Systems Inc. located in Arkansas.
ORG ORG --- ORG ORG ORG --- --- LOC
Bilingualwordalignment
§ association§ position§ orthography
Whatis
theanticipated
costof
collecting fees
under the
new proposal
?
En vertu delesnouvelle propositions, quel est le côut prévu de perception de le droits?
j
k
EfficientDecoding§ Commoncase:youhaveablackboxwhichcomputes
atleastapproximately,andyouwanttolearnw
§ Easiestoptionisthestructuredperceptron[Collins01]§ Structureentershereinthatthesearchforthebestyistypicallya
combinatorialalgorithm(dynamicprogramming,matchings,ILPs,A*…)§ Predictionisstructured,learningupdateisnot
StructuredMargin(Primal)
Rememberourprimalmarginobjective?
minw
1
2kwk22 + C
X
i
✓max
y
�w>fi(y) + `i(y)
�� w>fi(y
⇤i )
◆
Stillapplieswithstructuredoutputspace!
StructuredMargin(Primal)
minw
1
2kwk22 + C
X
i
�w>fi(y) + `i(y)� w>fi(y
⇤i )�
y = argmaxy�w>fi(y) + `i(y)
�Justneedefficientloss-augmenteddecode:
rw = w + CX
i
(fi(y)� fi(y⇤i ))
Stillusegeneralsubgradient descentmethods!(Adagrad)
StructuredMargin(Dual)§ Remembertheconstrainedversionofprimal:
§ Dualhasavariableforeveryconstrainthere
§ Wewant:
§ Equivalently:
FullMargin:OCR
a lot!…
“brace”
“brace”
“aaaaa”
“brace” “aaaab”
“brace” “zzzzz”
§ Wewant:
§ Equivalently:
‘It was red’
Parsingexample
a lot!
SA B
C D
SA BD F
SA B
C D
SE F
G H
SA B
C D
SA B
C D
SA B
C D
…
‘It was red’
‘It was red’
‘It was red’
‘It was red’
‘It was red’
‘It was red’
§ Wewant:
§ Equivalently:
‘What is the’‘Quel est le’
Alignmentexample
a lot!…
123
123
‘What is the’‘Quel est le’
123
123
‘What is the’‘Quel est le’
123
123
‘What is the’‘Quel est le’
123
123
123
123
123
123
123
123
‘What is the’‘Quel est le’
‘What is the’‘Quel est le’
‘What is the’‘Quel est le’
CuttingPlane(Dual)§ Aconstraintinductionmethod[Joachimsetal09]
§ Exploitsthatthenumberofconstraintsyouactuallyneedperinstanceistypicallyverysmall
§ Requires(loss-augmented)primal-decodeonly
§ Repeat:§ Findthemostviolatedconstraintforaninstance:
§ Addthisconstraintandresolvethe(non-structured)QP(e.g.withSMOorotherQPsolver)
CuttingPlane(Dual)§ Someissues:
§ CaneasilyspendtoomuchtimesolvingQPs§ Doesn’texploitsharedconstraintstructure§ Inpractice,worksprettywell;fastlikeperceptron/MIRA,morestable,noaveraging
Likelihood,Structured
§ Structureneededtocompute:§ Log-normalizer§ Expectedfeaturecounts
§ E.g.ifafeatureisanindicatorofDT-NNthenweneedtocomputeposteriormarginalsP(DT-NN|sentence)foreachpositionandsum
§ Alsoworkswithlatentvariables(morelater)
Comparison
Option0:Reranking
x = “The screen was a sea of red.”
…Baseline Parser
Input N-Best List(e.g. n=100)
Non-Structured Classification
Output
[e.g. Charniak and Johnson 05]
Reranking§ Advantages:
§ Directlyreducetonon-structuredcase§ Nolocalityrestrictiononfeatures
§ Disadvantages:§ Stuckwitherrorsofbaselineparser§ Baselinesystemmustproducen-bestlists§ But,feedbackispossible[McCloskey,Charniak,Johnson2006]
M3Ns§ Anotheroption:expressallconstraintsinapackedform
§ MaximummarginMarkovnetworks[Taskar etal03]§ Integratessolutionstructuredeeplyintotheproblemstructure
§ Steps§ ExpressinferenceoverconstraintsasanLP§ Usedualitytotransformminimax formulationintomin-min§ Constraintsfactorinthedualalongthesamestructureastheprimal;
alphasessentiallyactasadual“distribution”§ Variousoptimizationpossibilitiesinthedual
Example:Kernels
§ Quadratickernels
Non-LinearSeparators§ Anotherview:kernelsmapanoriginalfeaturespacetosome
higher-dimensionalfeaturespacewherethetrainingsetis(more)separable
Φ: y→ φ(y)
WhyKernels?§ Can’tyoujustaddthesefeaturesonyourown(e.g.addall
pairsoffeaturesinsteadofusingthequadratickernel)?§ Yes,inprinciple,justcomputethem§ Noneedtomodifyanyalgorithms§ But,numberoffeaturescangetlarge(orinfinite)§ Somekernelsnotasusefullythoughtofintheirexpanded
representation,e.g.RBFordata-definedkernels[HendersonandTitov05]
§ Kernelsletuscomputewiththesefeaturesimplicitly§ Example:implicitdotproductinquadratickerneltakesmuchless
spaceandtimeperdotproduct§ Ofcourse,there’sthecostforusingthepuredualalgorithms…