128
POS Tagging / Parsing I Taylor Berg-Kirkpatrick – CMU Slides: Dan Klein – UC Berkeley Algorithms for NLP

Algorithms for NLP - cs.cmu.edutbergkir/11711fa17/FA17 11-711 lecture 9 -- parsing... · §If between words, this is P(word | history) §If inside words, this is P ... self-loop probabilities

Embed Size (px)

Citation preview

POSTagging/ParsingITaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

AlgorithmsforNLP

SpeechTraining

WhatNeedstobeLearned?

§ Emissions:P(x|phoneclass)§ XisMFCC-valued

§ Transitions:P(state|prev state)§ Ifbetweenwords,thisisP(word|history)§ Ifinsidewords,thisisP(advance|phoneclass)§ (Reallyahierarchicalmodel)

s s s

x x x

Estimation fromAlignedData§ Whatifeachtimestepwaslabeledwithits(context-

dependentsub)phone?

§ CanestimateP(x|/ae/)asempiricalmeanand(co-)varianceofx’swithlabel/ae/

§ Problem:Don’tknowalignmentattheframeandphonelevel

/k/ /ae/ /ae/

x x x

/ae/ /t/

x x

Forced Alignment§ WhatiftheacousticmodelP(x|phone)wasknown?

§ …andalsothecorrectsequencesofwords/phones

§ Canpredictthebestalignmentofframestophones

§ Called“forcedalignment”

ssssssssppppeeeeeeetshshshshllllaeaeaebbbbb

“speech lab”

ForcedAlignment§ Createanewstatespacethatforcesthehiddenvariablestotransition

throughphonesinthe(known)order

§ Stillhaveuncertaintyaboutdurations

§ InthisHMM,alltheparametersareknown§ Transitionsdeterminedbyknownutterance§ Emissionsassumedtobeknown§ Minordetail:self-loopprobabilities

§ JustrunViterbi(orapproximations)togetthebestalignment

/s/ /p/ /ee/ /ch/ /l/ /ae/ /b/

EMforAlignment§ Input:acousticsequenceswithword-leveltranscriptions

§ Wedon’tknoweithertheemissionmodelortheframealignments

§ ExpectationMaximization(HardEMfornow)§ Alternatingoptimization§ Imputecompletionsforunlabeledvariables(here,thestatesateach

timestep)§ Re-estimatemodelparameters(here,Gaussianmeans,variances,

mixtureids)§ Repeat§ OneoftheearliestusesofEM!

SoftEM§ HardEMusesthebestsinglecompletion

§ Here,singlebestalignment§ Notalwaysrepresentative§ Certainlybadwhenyourparametersareinitializedandthealignments

arealltied§ Usesthecountofvariousconfigurations(e.g.howmanytokensof

/ae/haveself-loops)

§ Whatwe’dreallylikeistoknowthefractionofpathsthatincludeagivencompletion§ E.g.0.32ofthepathsalignthisframeto/p/,0.21alignitto/ee/,etc.§ Formallywanttoknowtheexpectedcountofconfigurations§ Keyquantity:P(st |x)

Computing Marginals

= sum of all paths through s at tsum of all paths

ForwardScores

Backward Scores

TotalScores

FractionalCounts

§ Computingfractional(expected)counts§ Computeforward/backwardprobabilities§ Foreachposition,computemarginalposteriors§ Accumulateexpectations§ Re-estimateparameters(e.g.means,variances,self-loopprobabilities)fromratiosoftheseexpectedcounts

StagedTrainingandStateTying

§ CreatingCDphones:§ Startwithmonophone,doEM

training§ CloneGaussiansintotriphones§ Builddecisiontreeandcluster

Gaussians§ Cloneandtrainmixtures

(GMMs)

§ Generalidea:§ Introducecomplexitygradually§ Interleaveconstraintwith

flexibility

PartsofSpeech

Parts-of-Speech(English)§ Onebasickindoflinguisticstructure:syntacticwordclasses

Open class (lexical) words

Closed class (functional)

Nouns Verbs

Proper Common

Auxiliary

Main

Adjectives

Adverbs

Prepositions

Particles

Determiners

Conjunctions

Pronouns

… more

… more

IBMItaly

cat / catssnow

seeregistered

canhad

yellow

slowly

to with

off up

the some

and or

he its

Numbers

122,312one

CC conjunction, coordinating and both but either orCD numeral, cardinal mid-1890 nine-thirty 0.5 oneDT determiner a all an every no that theEX existential there there FW foreign word gemeinschaft hund ich jeuxIN preposition or conjunction, subordinating among whether out on by ifJJ adjective or numeral, ordinal third ill-mannered regrettable

JJR adjective, comparative braver cheaper tallerJJS adjective, superlative bravest cheapest tallestMD modal auxiliary can may might will would NN noun, common, singular or mass cabbage thermostat investment subhumanity

NNP noun, proper, singular Motown Cougar Yvette LiverpoolNNPS noun, proper, plural Americans Materials StatesNNS noun, common, plural undergraduates bric-a-brac averagesPOS genitive marker ' 's PRP pronoun, personal hers himself it we themPRP$ pronoun, possessive her his mine my our ours their thy your

RB adverb occasionally maddeningly adventurouslyRBR adverb, comparative further gloomier heavier less-perfectlyRBS adverb, superlative best biggest nearest worst RP particle aboard away back by on open throughTO "to" as preposition or infinitive marker to UH interjection huh howdy uh whammo shucks heckVB verb, base form ask bring fire see take

VBD verb, past tense pleaded swiped registered sawVBG verb, present participle or gerund stirring focusing approaching erasingVBN verb, past participle dilapidated imitated reunifed unsettledVBP verb, present tense, not 3rd person singular twist appear comprise mold postponeVBZ verb, present tense, 3rd person singular bases reconstructs marks usesWDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever which who whomWP$ WH-pronoun, possessive whose WRB Wh-adverb however whenever where why

Part-of-SpeechAmbiguity§ Wordscanhavemultiplepartsofspeech

§ Twobasicsourcesofconstraint:§ Grammaticalenvironment§ Identityofthecurrentword

§ Manymorepossiblefeatures:§ Suffixes,capitalization,namedatabases(gazetteers),etc…

Fed raises interest rates 0.5 percentNNP NNS NN NNS CD NNVBN VBZ VBP VBZVBD VB

WhyPOSTagging?§ Usefulinandofitself(morethanyou’dthink)

§ Text-to-speech:record,lead§ Lemmatization:saw[v]® see,saw[n]® saw§ Quick-and-dirtyNP-chunkdetection:grep{JJ|NN}*{NN|NNS}

§ Usefulasapre-processingstepforparsing§ Lesstagambiguitymeansfewerparses§ However,sometagchoicesarebetterdecidedbyparsers

DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …

DT NNP NN VBD VBN RP NN NNSThe Georgia branch had taken on loan commitments …

IN

VDN

Part-of-SpeechTagging

ClassicSolution:HMMs§ Wewantamodelofsequencessandobservationsw

§ Assumptions:§ Statesaretagn-grams§ Usuallyadedicatedstartandendstate/word§ Tag/statesequenceisgeneratedbyamarkov model§ Wordsarechosenindependently,conditionedonlyonthetag/state§ Thesearetotallybrokenassumptions:why?

s1 s2 sn

w1 w2 wn

s0

States§ Statesencodewhatisrelevantaboutthepast§ TransitionsP(s|s’)encodewell-formedtagsequences

§ Inabigramtagger,states=tags

§ Inatrigramtagger,states=tagpairs

<¨,¨>

s1 s2 sn

w1 w2 wn

s0

< ¨, t1> < t1, t2> < tn-1, tn>

<¨>

s1 s2 sn

w1 w2 wn

s0

< t1> < t2> < tn>

EstimatingTransitions

§ Usestandardsmoothingmethodstoestimatetransitions:

§ Cangetalotfancier(e.g.KNsmoothing)orusehigherorders,butinthiscaseitdoesn’tbuymuch

§ Oneoption:encodemoreintothestate,e.g.whetherthepreviouswordwascapitalized(Brants 00)

§ BIGIDEA:Thebasicapproachofstate-splitting/refinementturnsouttobeveryimportantinarangeoftasks

)(ˆ)1()|(ˆ),|(ˆ),|( 211121221 iiiiiiiii tPttPtttPtttP llll --++= -----

EstimatingEmissions

§ Emissionsaretrickier:§ Wordswe’veneverseenbefore§ Wordswhichoccurwithtagswe’veneverseenthemwith§ Oneoption:breakoutthefancysmoothing(e.g.KN,Good-Turing)§ Issue:unknownwordsaren’tblackboxes:

§ Basicsolution:unknownwordsclasses(affixesorshapes)

§ Commonapproach:EstimateP(t|w)andinvert§ [Brants 00]usedasuffixtrie asits(inverted)emissionmodel

343,127.23 11-year Minteria reintroducibly

D+,D+.D+ D+-x+ Xx+ x+-“ly”

Disambiguation(Inference)§ Problem:findthemostlikely(Viterbi)sequenceunderthemodel

§ Givenmodelparameters,wecanscoreanytagsequence

§ Inprinciple,we’redone– listallpossibletagsequences,scoreeachone,pickthebestone(theViterbistatesequence)

Fed raises interest rates 0.5 percent .NNP VBZ NN NNS CD NN .

P(NNP|<¨,¨>) P(Fed|NNP) P(VBZ|<NNP,¨>) P(raises|VBZ) P(NN|VBZ,NNP)…..

NNP VBZ NN NNS CD NNNNP NNS NN NNS CD NNNNP VBZ VB NNS CD NN

logP = -23

logP = -29logP = -27

<¨,¨> <¨,NNP> <NNP, VBZ> <VBZ, NN> <NN, NNS> <NNS, CD> <CD, NN> <STOP>

FindingtheBestTrajectory§ Toomanytrajectories(statesequences)tolist§ Option1:BeamSearch

§ Abeamisasetofpartialhypotheses§ Startwithjustthesingleemptytrajectory§ Ateachderivationstep:

§ Considerallcontinuationsofprevioushypotheses§ Discardmost,keeptopk,orthosewithinafactorofthebest

§ Beamsearchworksokinpractice§ …butsometimesyouwanttheoptimalanswer§ …andyouneedoptimalanswerstovalidateyourbeamsearch§ …andthere’susuallyabetteroptionthannaïvebeams

<>Fed:NNP

Fed:VBN

Fed:VBD

Fed:NNP raises:NNS

Fed:NNP raises:VBZFed:VBN raises:NNS

Fed:VBN raises:VBZ

TheStateLattice/Trellis

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

START Fed raises interest rates END

TheStateLattice/Trellis

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

START Fed raises interest rates END

TheViterbiAlgorithm§ Dynamicprogramforcomputing

§ Thescoreofabestpathuptopositioniendinginstates

§ Alsocanstoreabacktrace(butnoonedoes)

§ Memoizedsolution§ Iterativesolution

)...,...(max)( 1110... 10--

-

= iisssi wwsssPsi

d

)'()'|()'|(max)( 1'sswPssPs isi -= dd

îíì >••=<

=otherwisesif

s0

,1)(0d

)'()'|()'|(maxarg)( 1'

sswPssPs is

i -= dy

SoHowWellDoesItWork?§ Choosethemostcommontag

§ 90.3%withabadunknownwordmodel§ 93.7%withagoodone

§ TnT (Brants,2000):§ Acarefullysmoothedtrigramtagger§ Suffixtreesforemissions§ 96.7%onWSJtext(SOTAis97+%)

§ Noiseinthedata§ Manyerrorsinthetrainingandtestcorpora

§ Probablyabout2%guaranteederrorfromnoise(onthisdata)

NN NN NNchief executive officer

JJ NN NNchief executive officer

JJ JJ NNchief executive officer

NN JJ NNchief executive officer

DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …

Overview:Accuracies§ Roadmapof(known/unknown)accuracies:

§ Mostfreq tag: ~90%/~50%

§ TrigramHMM: ~95%/~55%

§ TnT (HMM++): 96.2%/86.0%

§ Maxent P(t|w): 93.7%/82.6%§ MEMMtagger: 96.9%/86.9%§ State-of-the-art: 97+%/89+%§ Upperbound: ~98%

Most errors on unknown

words

CommonErrors§ Commonerrors[fromToutanova&Manning00]

NN/JJ NN

official knowledge

VBD RP/IN DT NN

made up the story

RB VBD/VBN NNS

recently sold shares

RicherFeatures

BetterFeatures§ Candosurprisinglywelljustlookingatawordbyitself:

§ Word the:the® DT§ Lowercasedword Importantly:importantly® RB§ Prefixes unfathomable:un-® JJ§ Suffixes Surprisingly:-ly® RB§ Capitalization Meridian:CAP® NNP§ Wordshapes 35-year:d-x® JJ

§ Thenbuildamaxent(orwhatever)modeltopredicttag§ MaxentP(t|w): 93.7%/82.6% s3

w3

WhyLinearContextisUseful§ Lotsofrichlocalinformation!

§ Wecouldfixthiswithafeaturethatlookedatthenextword

§ Wecouldfixthisbylinkingcapitalizedwordstotheirlowercaseversions

§ Solution:discriminativesequencemodels(MEMMs,CRFs)

§ Realitycheck:§ Taggersarealreadyprettygoodonnewswiretext…§ Whattheworldneedsistaggersthatworkonothertext!

PRP VBD IN RB IN PRP VBD .They left as soon as he arrived .

NNP NNS VBD VBN .Intrinsic flaws remained undetected .

RB

JJ

Sequence-FreeTagging?

§ Whataboutlookingatawordanditsenvironment,butnosequenceinformation?

§ Addinprevious/nextword the__§ Previous/nextwordshapes X__X§ Occurrencepatternfeatures [X:xXoccurs]§ Crudeentitydetection __…..(Inc.|Co.)§ Phrasalverbinsentence? put……__§ Conjunctionsofthesethings

§ Allfeaturesexceptsequence:96.6%/86.8%§ Useslotsoffeatures:>200K§ Whyisn’tthisthestandardapproach?

t3

w3 w4w2

NamedEntityRecognition§ Othersequencetasksusesimilarmodels

§ Example:nameentityrecognition(NER)

Prev Cur NextState Other ??? ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx

Local Context

Tim Boon has signed a contract extension with Leicestershire which will keep him at Grace Road .

PER PER O O O O O O ORG O O O O O LOC LOC O

MEMMTaggers§ Idea:left-to-rightlocaldecisions,conditiononprevioustags

andalsoentireinput

§ TrainupP(ti|w,ti-1,ti-2)asanormalmaxent model,thenusetoscoresequences

§ ThisisreferredtoasanMEMMtagger[Ratnaparkhi 96]§ Beamsearcheffective!(Why?)§ Whataboutbeamsize1?

§ Subtleissueswithlocalnormalization(cf.Laffertyetal01)

NERFeatures

Feature Type Feature PERS LOCPrevious word at -0.73 0.94Current word Grace 0.03 0.00Beginning bigram <G 0.45 -0.04Current POS tag NNP 0.47 0.45Prev and cur tags IN NNP -0.10 0.14Previous state Other -0.70 -0.92Current signature Xx 0.80 0.46Prev state, cur sig O-Xx 0.68 0.37Prev-cur-next sig x-Xx-Xx -0.69 0.37P. state - p-cur sig O-x-Xx -0.20 0.82…Total: -0.58 2.68

Prev Cur NextState Other ??? ???Word at Grace RoadTag IN NNP NNPSig x Xx Xx

Local Context

Feature WeightsBecause of regularization term, the more common prefixes have larger weights even though entire-word features are more specific.

Decoding§ DecodingMEMMtaggers:

§ JustlikedecodingHMMs,differentlocalscores§ Viterbi,beamsearch,posteriordecoding

§ Viterbialgorithm(HMMs):

§ Viterbialgorithm(MEMMs):

§ General:

ConditionalRandomFields(andFriends)

MaximumEntropyII

§ Remember:maximumentropyobjective

§ Problem:lotsoffeaturesallowperfectfittotrainingset§ Regularization(comparetosmoothing)

DerivativeforMaximumEntropy

Big weights are bad

Total count of feature n in correct candidates

Expected count of feature n in predicted

candidates

PerceptronReview

Perceptron§ Linearmodel:

§ …thatdecomposealongthesequence

§ …allowustopredictwiththeViterbialgorithm

§ …whichmeanswecantrainwiththeperceptronalgorithm(orrelatedupdates,likeMIRA)

[Collins 01]

ConditionalRandomFields§ Makeamaxentmodeloverentiretaggings

§ MEMM

§ CRF

CRFs§ Likeanymaxent model,derivativeis:

§ Soallweneedistobeabletocomputetheexpectationofeachfeature(forexamplethenumberoftimesthelabelpairDT-NNoccurs,orthenumberoftimesNN-interestoccurs)underthemodeldistribution

§ Criticalquantity:countsofposteriormarginals:

ComputingPosteriorMarginals§ Howmany(expected)timesiswordwtaggedwiths?

§ Howtocomputethatmarginal?^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

START Fed raises interest rates END

GlobalDiscriminativeTaggers

§ Newer,higher-powereddiscriminativesequencemodels§ CRFs(alsoperceptrons,M3Ns)§ Donotdecomposetrainingintoindependentlocalregions§ Canbedeathlyslowtotrain– requirerepeatedinferenceontraining

set§ DifferencestendnottobetooimportantforPOStagging§ Differencesmoresubstantialonothersequencetasks§ However:oneissueworthknowingaboutinlocalmodels

§ “Labelbias”andotherexplainingawayeffects§ MEMMtaggers’localscorescanbenearonewithouthavingboth

good“transitions”and“emissions”§ Thismeansthatoftenevidencedoesn’tflowproperly§ Whyisn’tthisabigdealforPOStagging?§ Also:indecoding,conditiononpredicted,notgold,histories

Transformation-BasedLearning

§ [Brill95]presentsatransformation-basedtagger§ Labelthetrainingsetwithmostfrequenttags

DTMDVBDVBD.Thecanwasrusted.

§ Addtransformationruleswhichreducetrainingmistakes

§ MD® NN:DT__§ VBD® VBN:VBD__.

§ Stopwhennotransformationsdosufficientgood§ Doesthisremindanyoneofanything?

§ Probablythemostwidelyusedtagger(esp.outsideNLP)§ …butdefinitelynotthemostaccurate:96.6%/82.0%

LearnedTransformations§ Whatgetslearned?[fromBrill95]

EngCGTagger

§ Englishconstraintgrammartagger§ [TapanainenandVoutilainen94]§ Somethingelseyoushouldknowabout§ Hand-writtenandknowledgedriven§ “Don’tguessifyouknow”(generalpoint

aboutmodelingmorestructure!)§ Tagsetdoesn’tmakeallofthehard

distinctionsasthestandardtagset(e.g.JJ/NN)

§ Theygetstellaraccuracies:99%ontheirtagset

§ Linguisticrepresentationmatters…§ …butit’seasiertowinwhenyoumakeup

therules

DomainEffects§ Accuraciesdegradeoutsideofdomain

§ Uptotripleerrorrate§ Usuallymakethemosterrorsonthethingsyoucareaboutinthedomain(e.g.proteinnames)

§ Openquestions§ Howtoeffectivelyexploitunlabeleddatafromanewdomain(whatcouldwegain?)

§ Howtobestincorporatedomainlexicainaprincipledway(e.g.UMLSspecialistlexicon,ontologies)

UnsupervisedTagging

UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:

§ Rawsentencesin§ Taggedsentencesout

§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults

EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the

tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach

kindoftransitionandemissionwehaveundercurrentparams:

§ SamequantitiesweneededtotrainaCRF!

EMforHMMs:Quantities§ Totalpathvalues(correspondtoprobabilitieshere):

TheStateLattice/Trellis

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

START Fed raises interest rates END

EMforHMMs:Process

§ Fromthesequantities,cancomputeexpectedtransitions:

§ Andemissions:

Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]

§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels

§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples

§ Onnexamples,re-estimatewithEM

§ Note:weknowallowedtagsbutnotfrequencies

Merialdo:Results

DistributionalClustering

president the __ ofpresident the __ saidgovernor the __ ofgovernor the __ appointedsaid sources __ ¨said president __ thatreported sources __ ¨

presidentgovernor

saidreported

thea

¨ the president said that the downturn was over ¨

[Finch and Chater 92, Shuetze 93, many others]

DistributionalClustering§ Threemainvariantsonthesameidea:

§ Pairwisesimilaritiesandheuristicclustering§ E.g.[FinchandChater92]§ Producesdendrograms

§ Vectorspacemethods§ E.g.[Shuetze93]§ Modelsofambiguity

§ Probabilisticmethods§ Variousformulations,e.g.[LeeandPereira99]

NearestNeighbors

Dendrograms_

Dendrograms_

VectorSpaceVersion§ [Shuetze93]clusterswordsaspointsinRn

§ Vectorstoosparse,useSVDtoreduce

Mw

context counts

US V

w

context counts

Cluster these 50-200 dim vectors instead.

Õ -=i

iiii ccPcwPCSP )|()|(),( 1

Õ +-=i

iiiiii cwwPcwPcPCSP )|,()|()(),( 11

AProbabilisticVersion?

¨ the president said that the downturn was over ¨

c1 c2 c6c5 c7c3 c4 c8

¨ the president said that the downturn was over ¨

c1 c2 c6c5 c7c3 c4 c8

WhatElse?§ Variousnewerideas:

§ Contextdistributionalclustering[Clark00]§ Morphology-drivenmodels[Clark03]§ Contrastiveestimation[SmithandEisner05]§ Feature-richinduction[HaghighiandKlein06]

§ Also:§ Whataboutambiguouswords?§ Usingwidercontextsignatureshasbeenusedforlearningsynonyms(what’swrongwiththisapproach?)

§ Canextendtheseideasforgrammarinduction(later)

Computing Marginals

= sum of all paths through s at tsum of all paths

ForwardScores

Backward Scores

TotalScores

Syntax

ParseTrees

The move followed a round of similar increases by other lenders, reflecting a continuing decline in that market

PhraseStructureParsing§ Phrasestructureparsing

organizessyntaxintoconstituentsorbrackets

§ Ingeneral,thisinvolvesnestedtrees

§ Linguistscan,anddo,argueaboutdetails

§ Lotsofambiguity

§ Nottheonlykindofsyntax…

new art critics write reviews with computers

PP

NPNP

N’

NP

VP

S

ConstituencyTests

§ Howdoweknowwhatnodesgointhetree?

§ Classicconstituencytests:§ Substitutionbyproform§ Questionanswers§ Semanticgounds

§ Coherence§ Reference§ Idioms

§ Dislocation§ Conjunction

§ Cross-linguisticarguments,too

ConflictingTests§ Constituencyisn’talwaysclear

§ Unitsoftransfer:§ thinkabout~penser ৠtalkabout~hablar de

§ Phonologicalreduction:§ Iwillgo® I’llgo§ Iwanttogo® Iwanna go§ alecentre® aucentre

§ Coordination§ Hewenttoandcamefromthestore.

La vélocité des ondes sismiques

ClassicalNLP:Parsing

§ Writesymbolicorlogicalrules:

§ Usedeductionsystemstoproveparsesfromwords§ Minimalgrammaron“Fedraises”sentence:36parses§ Simple10-rulegrammar:592parses§ Real-sizegrammar:manymillionsofparses

§ Thisscaledverybadly,didn’tyieldbroad-coveragetools

Grammar (CFG) Lexicon

ROOT ® S

S ® NP VP

NP ® DT NN

NP ® NN NNS

NN ® interest

NNS ® raises

VBP ® interest

VBZ ® raises

NP ® NP PP

VP ® VBP NP

VP ® VBP NP PP

PP ® IN NP

Ambiguities

Ambiguities:PPAttachment

Attachments

§ Icleanedthedishesfromdinner

§ Icleanedthedisheswithdetergent

§ Icleanedthedishesinmypajamas

§ Icleanedthedishesinthesink

SyntacticAmbiguitiesI

§ Prepositionalphrases:Theycookedthebeansinthepotonthestovewithhandles.

§ Particlevs.preposition:Thepuppytoreupthestaircase.

§ ComplementstructuresThetouristsobjectedtotheguidethattheycouldn’thear.Sheknowsyoulikethebackofherhand.

§ Gerundvs.participialadjectiveVisitingrelativescanbeboring.Changingschedulesfrequentlyconfusedpassengers.

SyntacticAmbiguitiesII§ ModifierscopewithinNPs

impracticaldesignrequirementsplasticcupholder

§ MultiplegapconstructionsThechickenisreadytoeat.Thecontractorsarerichenoughtosue.

§ Coordinationscope:Smallratsandmicecansqueezeintoholesorcracksinthewall.

DarkAmbiguities

§ Darkambiguities: mostanalysesareshockinglybad(meaning,theydon’thaveaninterpretationyoucangetyourmindaround)

§ Unknownwordsandnewusages§ Solution:Weneedmechanismstofocusattentiononthebestones,probabilistictechniquesdothis

Thisanalysiscorrespondstothecorrectparseof

“Thiswillpanicbuyers!”

AmbiguitiesasTrees

PCFGs

ProbabilisticContext-FreeGrammars

§ Acontext-freegrammarisatuple<N,T,S,R>§ N :thesetofnon-terminals

§ Phrasalcategories:S,NP,VP,ADJP,etc.§ Parts-of-speech(pre-terminals):NN,JJ,DT,VB

§ T :thesetofterminals(thewords)§ S :thestartsymbol

§ OftenwrittenasROOTorTOP§ Notusuallythesentencenon-terminalS

§ R :thesetofrules§ OftheformX® Y1 Y2 …Yk,withX,Yi Î N§ Examples:S® NPVP,VP® VPCCVP§ Alsocalledrewrites,productions,orlocaltrees

§ APCFGadds:§ Atop-downproductionprobabilityperruleP(Y1 Y2 …Yk|X)

TreebankSentences

TreebankGrammars

§ NeedaPCFGforbroadcoverageparsing.§ Cantakeagrammarrightoffthetrees(doesn’tworkwell):

§ Betterresultsbyenrichingthegrammar(e.g.,lexicalization).§ Canalsogetstate-of-the-artparserswithoutlexicalization.

ROOT ® S 1

S ® NP VP . 1

NP ® PRP 1

VP ® VBD ADJP 1

…..

PLURAL NOUN

NOUNDETDET

ADJ

NOUN

NP NP

CONJ

NP PP

TreebankGrammarScale

§ Treebankgrammarscanbeenormous§ AsFSAs,therawgrammarhas~10Kstates,excludingthelexicon§ Betterparsersusuallymakethegrammarslarger,notsmaller

NP

ChomskyNormalForm

§ Chomskynormalform:§ AllrulesoftheformX® YZorX® w§ Inprinciple,thisisnolimitationonthespaceof(P)CFGs

§ N-aryrulesintroducenewnon-terminals

§ Unaries/emptiesare“promoted”§ Inpracticeit’skindofapain:

§ Reconstructingn-ariesiseasy§ Reconstructingunariesistrickier§ Thestraightforwardtransformationsdon’tpreservetreescores

§ Makesparsingalgorithmssimpler!

VP

[VP ® VBD NP •]

VBD NP PP PP

[VP ® VBD NP PP •]

VBD NP PP PP

VP

CKYParsing

ARecursiveParser

§ Willthisparserwork?§ Whyorwhynot?§ Memoryrequirements?

bestScore(X,i,j,s)if (j = i+1)

return tagScore(X,s[i])else

return max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)

AMemoizedParser§ Onesmallchange:

bestScore(X,i,j,s)if (scores[X][i][j] == null)

if (j = i+1)score = tagScore(X,s[i])

elsescore = max score(X->YZ) *

bestScore(Y,i,k) *bestScore(Z,k,j)

scores[X][i][j] = scorereturn scores[X][i][j]

§ Canalsoorganizethingsbottom-up

ABottom-UpParser(CKY)

bestScore(s)for (i : [0,n-1])for (X : tags[s[i]])score[X][i][i+1] =

tagScore(X,s[i])for (diff : [2,n])for (i : [0,n-diff])j = i + difffor (X->YZ : rule)for (k : [i+1, j-1])score[X][i][j] = max score[X][i][j],

score(X->YZ) *score[Y][i][k] *score[Z][k][j]

Y Z

X

i k j

UnaryRules§ Unaryrules?

bestScore(X,i,j,s)if (j = i+1)

return tagScore(X,s[i])else

return max max score(X->YZ) *bestScore(Y,i,k) *bestScore(Z,k,j)

max score(X->Y) *bestScore(Y,i,j)

CNF+UnaryClosure

§ Weneedunariestobenon-cyclic§ Canaddressbypre-calculatingtheunaryclosure§ Ratherthanhavingzeroormoreunaries,alwayshaveexactlyone

§ Alternateunaryandbinarylayers§ Reconstructunarychainsafterwards

NP

DT NN

VP

VBDNP

DT NN

VP

VBD NP

VP

S

SBAR

VP

SBAR

AlternatingLayers

bestScoreU(X,i,j,s)if (j = i+1)

return tagScore(X,s[i])else

return max max score(X->Y) *bestScoreB(Y,i,j)

bestScoreB(X,i,j,s)return max max score(X->YZ) *

bestScoreU(Y,i,k) *bestScoreU(Z,k,j)

Analysis

Memory§ Howmuchmemorydoesthisrequire?

§ Havetostorethescorecache§ Cachesize:|symbols|*n2 doubles§ Fortheplaintreebankgrammar:

§ X~20K,n=40,double~8bytes=~256MB§ Big,butworkable.

§ Pruning:Beams§ score[X][i][j]cangettoolarge(when?)§ Cankeepbeams(truncatedmapsscore[i][j])whichonlystorethebestfew

scoresforthespan[i,j]

§ Pruning:Coarse-to-Fine§ UseasmallergrammartoruleoutmostX[i,j]§ Muchmoreonthislater…

Time:Theory§ Howmuchtimewillittaketoparse?

§ Foreachdiff(<=n)§ Foreachi (<=n)

§ ForeachruleX® YZ§ ForeachsplitpointkDoconstantwork

§ Totaltime:|rules|*n3

§ Somethinglike5secforanunoptimized parseofa20-wordsentence

Y Z

X

i k j

Time:Practice

§ Parsingwiththevanillatreebank grammar:

§ Why’sitworseinpractice?§ Longersentences“unlock”moreofthegrammar§ Allkindsofsystemsissuesdon’tscale

~ 20K Rules

(not an optimized parser!)

Observed exponent:

3.6

Same-SpanReachability

ADJP ADVPFRAG INTJ NPPP PRN QP SSBAR UCP VP

WHNP

TOP

LST

CONJP

WHADJP

WHADVP

WHPP

NX

NAC

SBARQ

SINV

RRCSQ X

PRT

RuleStateReachability

§ Manystatesaremorelikelytomatchlargerspans!

Example: NP CC •

NP CC

0 nn-11 Alignment

Example: NP CC NP •

NP CC

0 nn-k-1n AlignmentsNP

n-k

EfficientCKY

§ LotsoftrickstomakeCKYefficient§ Someofthemarelittleengineeringdetails:

§ E.g.,firstchoosek,thenenumeratethroughtheY:[i,k]whicharenon-zero,thenloopthroughrulesbyleftchild.

§ Optimallayoutofthedynamicprogramdependsongrammar,input,evensystemdetails.

§ Anotherkindismoreimportant(andinteresting):§ ManyX[i,j]canbesuppressedonthebasisoftheinputstring§ We’llseethisnextclassasfigures-of-merit,A*heuristics,coarse-to-fine,etc

Agenda-BasedParsing

Agenda-BasedParsing§ Agenda-basedparsingislikegraphsearch(butovera

hypergraph)§ Concepts:

§ Numbering:wenumberfencepostsbetweenwords§ “Edges”oritems:spanswithlabels,e.g.PP[3,5],representthesetsof

treesoverthosewordsrootedatthatlabel(cf.searchstates)§ Achart:recordsedgeswe’veexpanded(cf.closedset)§ Anagenda:aqueuewhichholdsedges(cf.afringeoropenset)

0 1 2 3 4 5critics write reviews with computers

PP

WordItems§ Buildinganitemforthefirsttimeiscalleddiscovery.Itemsgo

intotheagendaondiscovery.§ Toinitialize,wediscoverallworditems(withscore1.0).

critics write reviews with computers

critics[0,1], write[1,2], reviews[2,3], with[3,4], computers[4,5]

0 1 2 3 4 5

AGENDA

CHART [EMPTY]

UnaryProjection§ Whenwepopaworditem,thelexicontellsusthetagitem

successors(andscores)whichgoontheagenda

critics write reviews with computers

0 1 2 3 4 5critics write reviews with computers

critics[0,1] write[1,2]NNS[0,1]

reviews[2,3] with[3,4] computers[4,5]VBP[1,2] NNS[2,3] IN[3,4] NNS[4,5]

ItemSuccessors§ Whenwepopitemsoffoftheagenda:

§ Graphsuccessors:unaryprojections(NNS® critics,NP® NNS)

§ Hypergraph successors:combinewithitemsalreadyinourchart

§ Enqueue /promoteresultingitems(ifnotinchartalready)§ Recordbacktraces asappropriate§ Stickthepoppededgeinthechart(closedset)

§ Queriesachartmustsupport:§ IsedgeX[i,j]inthechart?(Whatscore?)§ WhatedgeswithlabelYendatpositionj?§ WhatedgeswithlabelZstartatpositioni?

Y[i,j] with X ® Y forms X[i,j]

Y[i,j] and Z[j,k] with X ® Y Z form X[i,k]

Y Z

X

AnExample

0 1 2 3 4 5critics write reviews with computers

NNS VBP NNS IN NNS

NNS[0,1] VBP[1,2] NNS[2,3] IN[3,4] NNS[3,4] NP[0,1] NP[2,3] NP[4,5]

NP NP NP

VP[1,2] S[0,2]

VP

PP[3,5]

PP

VP[1,3]

VP

ROOT[0,2]

SROOT

SROOT

S[0,3] VP[1,5]

VP

NP[2,5]

NP

ROOT[0,3] S[0,5] ROOT[0,5]

S

ROOT

EmptyElements§ Sometimeswewanttopositnodesinaparsetreethatdon’t

containanypronouncedwords:

§ Theseareeasytoaddtoaagenda-basedparser!§ Foreachpositioni,addthe“word”edgee[i,i]§ AddruleslikeNP® e tothegrammar§ That’sit!

0 1 2 3 4 5I like to parse empties

e e e e e e

NP VP

I want you to parse this sentence

I want [ ] to parse this sentence

UCS/A*

§ Withweightededges,ordermatters§ Mustexpandoptimalparsefrom

bottomup(subparsesfirst)§ CKYdoesthisbyprocessingsmaller

spansbeforelargerones§ UCSpopsitemsofftheagendain

orderofdecreasingViterbiscore§ A*searchalsowelldefined

§ Youcanalsospeedupthesearchwithoutsacrificingoptimality§ Canselectwhichitemstoprocessfirst§ Candowithany“figureofmerit”

[Charniak98]§ Ifyourfigure-of-meritisavalidA*

heuristic,nolossofoptimiality[KleinandManning03]

X

n0 i j

(Speech)Lattices§ Therewasnothingmagicalaboutwordsspanningexactly

oneposition.§ Whenworkingwithspeech,wegenerallydon’tknow

howmanywordsthereare,orwheretheybreak.§ Wecanrepresentthepossibilitiesasalatticeandparse

thesejustaseasily.

Iawe

of

van

eyes

sawa

‘ve

an

Ivan

UnsupervisedTagging

UnsupervisedTagging?§ AKApart-of-speechinduction§ Task:

§ Rawsentencesin§ Taggedsentencesout

§ Obviousthingtodo:§ Startwitha(mostly)uniformHMM§ RunEM§ Inspectresults

EMforHMMs:Process§ Alternatebetweenrecomputingdistributionsoverhiddenvariables(the

tags)andreestimatingparameters§ Crucialstep:wewanttotallyuphowmany(fractional)countsofeach

kindoftransitionandemissionwehaveundercurrentparams:

§ SamequantitiesweneededtotrainaCRF!

EMforHMMs:Quantities§ Totalpathvalues(correspondtoprobabilitieshere):

TheStateLattice/Trellis

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

^

N

V

J

D

$

START Fed raises interest rates END

EMforHMMs:Process

§ Fromthesequantities,cancomputeexpectedtransitions:

§ Andemissions:

Merialdo:Setup§ Some(discouraging)experiments[Merialdo94]

§ Setup:§ Youknowthesetofallowabletagsforeachword§ Fixktrainingexamplestotheirtruelabels

§ LearnP(w|t)ontheseexamples§ LearnP(t|t-1,t-2)ontheseexamples

§ Onnexamples,re-estimatewithEM

§ Note:weknowallowedtagsbutnotfrequencies

Merialdo:Results

DistributionalClustering

president the __ ofpresident the __ saidgovernor the __ ofgovernor the __ appointedsaid sources __ ¨said president __ thatreported sources __ ¨

presidentgovernor

saidreported

thea

¨ the president said that the downturn was over ¨

[Finch and Chater 92, Shuetze 93, many others]

DistributionalClustering§ Threemainvariantsonthesameidea:

§ Pairwisesimilaritiesandheuristicclustering§ E.g.[FinchandChater92]§ Producesdendrograms

§ Vectorspacemethods§ E.g.[Shuetze93]§ Modelsofambiguity

§ Probabilisticmethods§ Variousformulations,e.g.[LeeandPereira99]

NearestNeighbors

Dendrograms_

Dendrograms_