Algorithms for NLPtbergkir/11711fa17/FA17 11-711...proposal? En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits? Combinatorial structure. Structured

ClassificationIITaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

AlgorithmsforNLP

MinimizeTrainingError?§ Alossfunctiondeclareshowcostlyeachmistakeis

§ E.g.0lossforcorrectlabel,1lossforwronglabel§ Canweightmistakesdifferently(e.g.falsepositivesworsethanfalse

negativesorHammingdistanceoverstructuredlabels)

§ Wecould,inprinciple,minimizetrainingloss:

§ Thisisahard,discontinuousoptimizationproblem

Examples:Perceptron

§ SeparableCase

3

Examples:Perceptron

§ Non-SeparableCase

4

ObjectiveFunctions

§ Whatdowewantfromourweights?§ Depends!§ Sofar:minimize(training)errors:

§ Thisisthe“zero-oneloss”§ Discontinuous,minimizingisNP-complete

§ MaximumentropyandSVMshaveotherobjectivesrelatedtozero-oneloss

LinearModels:MaximumEntropy

§ Maximumentropy(logisticregression)§ Usethescoresasprobabilities:

§ Maximizethe(log)conditionallikelihoodoftrainingdata

Make positive

Normalize

MaximumEntropyII

§ Motivationformaximumentropy:§ Connectiontomaximumentropyprinciple(sortof)§ Mightwanttodoagoodjobofbeinguncertainonnoisycases…

§ …inpractice,though,posteriorsareprettypeaked

§ Regularization(smoothing)

Log-Loss§ Ifweviewmaxentasaminimizationproblem:

§ Thisminimizesthe“logloss”oneachexample

§ Oneview:loglossisanupperbound onzero-oneloss

MaximumMargin

§ Non-separableSVMs§ Addslacktotheconstraints§ Makeobjectivepay(linearly)forslack:

§ Ciscalledthecapacity oftheSVM– thesmoothingknob

§ Learning:§ CanstillstickthisintoMatlabifyouwant§ Constrainedoptimizationishard;bettermethods!§ We’llcomebacktothislater

Note: exist other choices of how to penalize slacks!

RememberSVMs…

§ Wehadaconstrained minimization

§ …butwecansolveforxi

§ Giving

HingeLoss

§ Thisiscalledthe“hingeloss”§ Unlikemaxent /logloss,youstop

gainingobjectiveoncethetruelabelwinsbyenough

§ YoucanstartfromhereandderivetheSVMobjective

§ Cansolvedirectlywithsub-gradientdecent(e.g.Pegasos:Shalev-Shwartz etal07)

§ Consider the per-instance objective:

Plot really only right in binary case

Maxvs“Soft-Max”Margin

§ SVMs:

§ Maxent:

§ Verysimilar!Bothtrytomakethetruescorebetterthanafunctionoftheotherscores§ TheSVMtriestobeattheaugmentedrunner-up§ TheMaxentclassifiertriestobeatthe“soft-max”

You can make this zero

… but not this one

LossFunctions:Comparison

§ Zero-OneLoss

§ Hinge

§ Log

Structure

Handwritingrecognition

brace

Sequential structure

x y

[Slides:Taskar andKlein05]

CFGParsing

The screen was a sea of red

Recursive structure

x y

BilingualWordAlignment

What is the anticipated cost of collecting fees under the new proposal?

En vertu de nouvelle propositions, quel est le côut prévu de perception de les droits?

x yWhat

is the

anticipatedcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelle propositions, quel est le côut prévu de perception de le droits?

Combinatorial structure

StructuredModels

Assumption:

Scoreisasumoflocal“part”scores

Parts=nodes,edges,productions

space of feasible outputs

NamedEntityRecognition

Apple Computer bought Smart Systems Inc. located in Arkansas.

ORG ORG --- ORG ORG ORG --- --- LOC

Bilingualwordalignment

§ association§ position§ orthography

Whatis

theanticipated

costof

collecting fees

under the

new proposal

?

En vertu delesnouvelle propositions, quel est le côut prévu de perception de le droits?

j

k

EfficientDecoding§ Commoncase:youhaveablackboxwhichcomputes

atleastapproximately,andyouwanttolearnw

§ Easiestoptionisthestructuredperceptron[Collins01]§ Structureentershereinthatthesearchforthebestyistypicallya

combinatorialalgorithm(dynamicprogramming,matchings,ILPs,A*…)§ Predictionisstructured,learningupdateisnot

StructuredMargin(Primal)

Rememberourprimalmarginobjective?

minw

1

2kwk22 + C

X

i

✓max

y

�w>fi(y) + ì(y)

�� w>fi(y

⇤i )

◆

Stillapplieswithstructuredoutputspace!

StructuredMargin(Primal)

minw

1

2kwk22 + C

X

i

�w>fi(y) + ì(y)� w>fi(y

⇤i )�

y = argmaxy�w>fi(y) + ì(y)

�Justneedefficientloss-augmenteddecode:

rw = w + CX

i

(fi(y)� fi(y⇤i ))

Stillusegeneralsubgradient descentmethods!(Adagrad)

StructuredMargin(Dual)§ Remembertheconstrainedversionofprimal:

§ Dualhasavariableforeveryconstrainthere

§ Wewant:

§ Equivalently:

FullMargin:OCR

a lot!…

“brace”

“brace”

“aaaaa”

“brace” “aaaab”

“brace” “zzzzz”

§ Wewant:

§ Equivalently:

‘It was red’

Parsingexample

a lot!

SA B

C D

SA BD F

SA B

C D

SE F

G H

SA B

C D

SA B

C D

SA B

C D

…

‘It was red’

‘It was red’

‘It was red’

‘It was red’

‘It was red’

‘It was red’

§ Wewant:

§ Equivalently:

‘What is the’‘Quel est le’

Alignmentexample

a lot!…

123

123


123

123


123

123


123

123

123

123

123

123

123

123




CuttingPlane(Dual)§ Aconstraintinductionmethod[Joachimsetal09]

§ Exploitsthatthenumberofconstraintsyouactuallyneedperinstanceistypicallyverysmall

§ Requires(loss-augmented)primal-decodeonly

§ Repeat:§ Findthemostviolatedconstraintforaninstance:

§ Addthisconstraintandresolvethe(non-structured)QP(e.g.withSMOorotherQPsolver)

CuttingPlane(Dual)§ Someissues:

§ CaneasilyspendtoomuchtimesolvingQPs§ Doesn’texploitsharedconstraintstructure§ Inpractice,worksprettywell;fastlikeperceptron/MIRA,morestable,noaveraging

Likelihood,Structured

§ Structureneededtocompute:§ Log-normalizer§ Expectedfeaturecounts

§ E.g.ifafeatureisanindicatorofDT-NNthenweneedtocomputeposteriormarginalsP(DT-NN|sentence)foreachpositionandsum

§ Alsoworkswithlatentvariables(morelater)

Comparison

Option0:Reranking

x = “The screen was a sea of red.”

…Baseline Parser

Input N-Best List(e.g. n=100)

Non-Structured Classification

Output

[e.g. Charniak and Johnson 05]

Reranking§ Advantages:

§ Directlyreducetonon-structuredcase§ Nolocalityrestrictiononfeatures

§ Disadvantages:§ Stuckwitherrorsofbaselineparser§ Baselinesystemmustproducen-bestlists§ But,feedbackispossible[McCloskey,Charniak,Johnson2006]

M3Ns§ Anotheroption:expressallconstraintsinapackedform

§ MaximummarginMarkovnetworks[Taskar etal03]§ Integratessolutionstructuredeeplyintotheproblemstructure

§ Steps§ ExpressinferenceoverconstraintsasanLP§ Usedualitytotransformminimax formulationintomin-min§ Constraintsfactorinthedualalongthesamestructureastheprimal;

alphasessentiallyactasadual“distribution”§ Variousoptimizationpossibilitiesinthedual

Example:Kernels

§ Quadratickernels

Non-LinearSeparators§ Anotherview:kernelsmapanoriginalfeaturespacetosome

higher-dimensionalfeaturespacewherethetrainingsetis(more)separable

Φ: y→ φ(y)

WhyKernels?§ Can’tyoujustaddthesefeaturesonyourown(e.g.addall

pairsoffeaturesinsteadofusingthequadratickernel)?§ Yes,inprinciple,justcomputethem§ Noneedtomodifyanyalgorithms§ But,numberoffeaturescangetlarge(orinfinite)§ Somekernelsnotasusefullythoughtofintheirexpanded

representation,e.g.RBFordata-definedkernels[HendersonandTitov05]

§ Kernelsletuscomputewiththesefeaturesimplicitly§ Example:implicitdotproductinquadratickerneltakesmuchless

spaceandtimeperdotproduct§ Ofcourse,there’sthecostforusingthepuredualalgorithms…

Documents

Algorithms for NLPtbergkir/11711fa17/FA17 11-711...proposal? En vertu de les nouvelle propositions, quel est le côut prévu de perception de le droits? Combinatorial structure. Structured