Upload
buicong
View
217
Download
0
Embed Size (px)
Citation preview
WordsandMorphology
11-711AlgorithmsforNLP15November2016– PartI
(SlidesmostlyborrowedfromLoriLevin)
Words
• Tokenization• Input:rawtext• Output:sequenceoftokensnormalizedforfurtherprocessing
• Recognition• Input:astringofcharacters• Output:yesorno
• MorphologicalParsing• Input:aword• Output:astructureoranalysisoftheword
• MorphologicalGeneration• Input:astructureoranalysisofaword• Output:aword
Whatisaword?• Thethingsthatareinthedictionary?
• Buthowdidthelexicographersdecidewhattoputinthedictionary?• Thesmallestunitthatcanbeutteredinisolation?
• Youcouldsaythiswordinisolation:Unimpressively• Thisonetoo: impress• Butyouprobablywouldn’tsaytheseinisolation,unlessyouweretalkingaboutmorphology:• un• ive• ly
• Thethingsbetweenspacesandpunctuation?
Sowhatisaword?
• I’mnotgoingtoanswerthatquestion:• didn’t• would’ve• gonna• Ima• finna• blackboard• thepersonwholeft’s hat• acct.• LTI
About1000pages.$139.99
Youdon’thavetoreadit.
Thepointisthatittakes1000pagesjusttosurveytheissuesrelatedtowhatwordsare.
Sowhatisaword?
• Itisuptoyouorthesoftwareyouuseforprocessingwords.• Takelinguisticsclasses.• Makegooddecisionsinsoftwaredesignandengineering.
TokenizationInput:rawtext
Dr. Smith said tokenization of English is “harder than you’ve thought.” When in New York, he paid $12.00 a day for lunch and wondered what it would be like to work for AT&T or Google, Inc.
OutputfromStanfordParser:http://nlp.stanford.edu:8080/parser/index.jspwithpart-of-speechtags:
Dr./NNP Smith/NNP said/VBD tokenization/NN of/IN English/NNP is/VBZ ``/`` harder/JJR than/IN you/PRP 've/VBP thought/VBN ./. ''/’’When/WRB in/IN New/NNP York/NNP ,/, he/PRP paid/VBD $/$ 12.00/CD a/DT day/NN for/IN lunch/NN and/CC wondered/VBD what/WP it/PRP would/MD be/VB like/JJ to/TO work/VB for/IN AT&T/NNP or/CC Google/NNP ,/, Inc./NNP ./.
WhatisLinguisticMorphology?
• Morphologyisthestudyoftheinternalstructureofwords.• Derivationalmorphology. Hownewwordsarecreatedfromexistingwords.
• [grace]• [[grace]ful]• [un[grace]ful]]
• Inflectionalmorphology. Howfeaturesrelevanttothesyntacticcontextofawordaremarkedonthatword.• Thisexampleillustratesnumber(singularandplural)andtense(presentandpast).• Greenindicatesirregular.Blueindicateszeromarkingofinflection.Redindicatesregularinflection.
• This student walks.• These studentswalk.• These students walked.
• Compounding. Creatingnewwordsbycombiningexistingwords• Withorwithoutspaces:surfboard,golfball,blackboard
Morphemes
• Avenerablewayoflookingatmorphology.• Morphemes.Minimalpairingsofformandmeaning.• Roots. The“core”ofawordthatcarriesitsbasicmeaning.
• apple :‘apple’• walk :‘walk’
• Affixes (prefixes,suffixes,infixes,andcircumfixes).Morphemesthatareaddedtoabase(arootorstem)toperformeitherderivationalorinflectionalfunctions.• un- :‘NEG’• -s :‘PLURAL’
IsolatingLanguages:Littlemorphologyotherthancompounding
• Chinese inflection• fewaffixes(prefixesandsuffixes):
• 们: 我们,你们, 他们,。。。同志们mén:wǒmén,nǐmén,tāmén, tóngzhìménplural:we,you(pl.),theycomrades,LGBTpeople
• “suffixes”thatmarkaspect:着 -zhě ‘continuousaspect’
• Chinesederivation•艺术家 yìshùjiā ‘artist’• Chineseisachampionintherealmofcompounding—upto80%ofChinesewordsareactuallycompounds.毒 + 贩 → 毒贩
dú fàn dúfàn
‘poison,drug’ ‘vendor’ ‘drug trafficker’
Fusional Languages:ANewWorldSpanish
Singular Plural
1st 2nd 3rdformal 2nd
1st 2nd 3rd
Present am-o am-as am-a am-a-mos am-áis am-an
Imperfect am-ab-a am-ab-as am-ab-a am-áb-a-mos am-ab-ais am-ab-an
Preterit am-é am-aste am-ó am-a-mos am-asteis am-aron
Future am-aré am-arás am-ará am-are-mos am-aréis am-arán
Conditional am-aría am-arías am-aría am-aría-mos am-aríais am-arían
AgglutinativeLanguages:SwahiliVerbsinSwahilihaveanaverageof4-5morphemes,http://wals.info/valuesets/22A-swa
Swahili English
m-tu a-li-lala ‘Thepersonslept’
m-tu a-ta-lala ‘Thepersonwillsleep’
wa-tu wa-li-lala ‘Thepeopleslept’
wa-tu wa-ta-lala ‘Thepeople willsleep’
• Wordswrittenwithouthyphensorspacesbetweenmorphemes.• Orangeprefixesmarknounclass(likegender,exceptSwahili hasnineinsteadoftwoor
three).• Verbsagreewithnounsinnounclass.• Adjectivesalsoagreewithnouns.• Veryhelpfulinparsing.
• Blackprefixesindicatetense.
TurkishExampleofextremeagglutinationButmostTurkishwordshavearoundthreemorphemes
uygarlaştıramadıklarımızdanmışsınızcasına“(behaving)asifyouareamongthosewhomwewerenotabletocivilize”
uygar “civilized”+laş “become”+tır “causeto”+ama “notable”+dık pastparticiple+larplural+ımız firstpersonpluralpossessive(“our”)+dan ablativecase(“from/among”)+mış past+sınız secondpersonplural(“y’all”)+casına finiteverb→adverb(“asif”)
Operationalization
• operate(opus/opera+ate)• ion• al• ize• ate• ion
PolysyntheticLanguages
• Polysyntheticmorphologiesallowthecreationoffull“sentences”bymorphologicalmeans.• Theyoftenallowtheincorporationofnounsintoverbs.• Theymayalsohaveaffixesthatattachtoverbsandtaketheplaceofnouns.• YupikEskimountu-ssur-qatar-ni-ksaite-ngqiggte-uqreindeer-hunt-FUT-say-NEG-again-3SG.INDIC‘Hehadnotyetsaidagainthathewasgoingtohuntreindeer.’
Root-and-PatternMorphology
• Root-and-pattern.A specialkindoffusional morphologyfoundinArabic,Hebrew,andtheircousins.• Rootusuallyconsistsofasequenceofconsonants.• Wordsarederivedand,tosomeextent,inflectedbypatternsofvowelsintercalatedamongtherootconsonants.• kitaab ‘book’• kaatib ‘writer;writing’• maktab ‘office;desk’• maktaba ‘library’
OtherNon-Concatenative MorphologicalProcesses• Non-concatenativemorphology involvesoperationsotherthantheconcatenationofaffixeswithbases.• Infixation.Amorphemeisinsertedinsideanothermorphemeinsteadofbeforeorafterit.
• Reduplication.Canbeprefixing,suffixing,andeveninfixing.
• Tagalog:• sulat (write,imperative)• susulat (reduplication)(write,future)• sumulat (infixing)(write,past)• sumusulat (infixingandreduplication)(write,present)
• Internalchange (tonechange;stressshift;apophony,suchasumlautandablaut).• Root-and-pattern morphology.• Andmore...
Canyoumakealistofallthewordsinalanguage?ProductivityIntheOxfordEnglishDictionary(OED)(www.oed.com,accessibleforfreefromCMUmachines)
• drinkable• visitable
NotintheOED• mous(e)able• stapl(e)able
InNLP,youneedtobeabletoprocesswordsthatarenotinthedictionary.
Butcouldyoumakealistofallpossiblewords,takingproductivityintoaccount?
Canyoumakealistofallthewordsinalanguage?
Atrie representingalistofwords(lexicon)
Type-TokenCurvesFinnishisagglutinativeIñupiaq ispolysynthetic
0
1000
2000
3000
4000
5000
6000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Type
s
Tokens
Type-TokenCurves
English
Arabic
Hocąk
Inupiaq
Finnish
TypesandTokens:“Iliketowalk.Iamwalkingnow.Itookalongwalkearliertoo.”
Thetypewalk occurstwice.Sotherearetwotokensofthetypewalk.
Walking isadifferenttypethatoccursonce.
Recognizingthewordsofalanguage
• Input:astring(fromsomealphabet)• Output:yesorno
FSAforEnglishNouns
Lexicon:
Note:“fox”becomespluralbyadding“es”not“s”.Wewillgettothatlater.
Finite-StateAutomaton
• Q:afinitesetofstates• q0∈ Q:aspecialstartstate• F⊆ Q:asetoffinalstates• Σ:afinitealphabet• Transitions:
• Encodesaset ofstringsthatcanberecognizedbyfollowingpathsfromq0 tosomestateinF.
qiqjs∈ Σ*
......
FSAforEnglishAdjectives
Butnotethatthisacceptswordslike“unbig”.
Big,bigger,biggestHappy,happier,happiest,happilyUnhappy,unhappier,unhappiest,unhappilyClear,clearer,clearest,clearlyUnclear,unclearly
Cool,cooler,coolest,coollyRed,redder,reddestReal,unreal,really
FSAforEnglishDerivationalMorphology
Howbigdotheseautomataget?Reasonablecoverageofalanguagetakesanexpertabouttwotofourmonths.
Whatdoesittaketobeanexpert?Studylinguisticstogetusedtoallthecommonandnot-so-commonthingsthathappen,andthenpractice.
MorphologicalParsing
Input:awordOutput:theword’sstem(s)andfeaturesexpressedbyothermorphemes.
Example: geese→goose+N+Plgooses→goose+V+3P+Sgdog→{dog+N+Sg,dog+V}leaves→{leaf+N+Pl,leave+V+3P+Sg}
UpperSide/LowerSide
talk+Past
talked
FST
uppersideorunderlyingform
lowersideorsurfaceform
FiniteStateTransducers
• Q:afinitesetofstates• q0∈ Q:aspecialstartstate• F⊆ Q:asetoffinalstates• ΣandΔ:twofinitealphabets• Transitions:
qiqj
s :ts∈ Σ*andt∈ Δ*
......
MorphologicalParsingwithFSTs
Note“samesymbol”shorthand.
^denotesamorphemeboundary.
#denotesawordboundary.
EnglishSpellingGettingbacktofox+s =foxes
TheEInsertionRuleasaFST
✏ ! e/
8<
:
s
x
z
9=
; ^ s#
Generateanormallyspelledwordfromanabstractrepresentationofthemorphemes:
Input:fox^s#(fox^εs#)Output:foxes#(foxεes#)
TheEInsertionRuleasaFST
✏ ! e/
8<
:
s
x
z
9=
; ^ s#
Parseanormallyspelledwordintoanabstractrepresentationofthemorphemes:
Input:foxes#(foxεes#)Output:fox^s#(fox^εs#)
CombiningFSTsparse
generate
FSTOperations
Input:fox+N+plOutput:foxes#
Stemming(“PoorMan’sMorphology”)
Input:awordOutput:theword’sstem(approximately)
ExamplesfromthePorterstemmer:•-sses→-ss•-ies→i•-ss→s
nonoahnob
nobilitynobisnoble
noblemannoblemennobleness
noblernobles
noblessenoblestnobly
nobodynocesnod
noddednoddingnoddlenoddlesnoddynods
nonoahnobnobilnobinoblnoblemannoblemennoblnoblernoblnoblessnoblestnoblinobodinocenodnodnodnoddlnoddlnoddinod
Conclusion
• Finitestatemethodsprovideasimpleandpowerfulmeansofgeneratingandanalyzingwords(aswellasthephonologicalalternationsthataccompanywordformation/inflection).• Straightforwardconcatenative morphologyiseasytoimplementusingfinitestatemethods.• Otherphenomenaareeasiesttocapturewithextensionstothefinitestateparadigm.• Co-occurrencerestrictions—flagdiacritics.• Non-concatenative morphology—compile-replace algorithm.Purefinitestate,butcomputedinanovelfashion.
Conclusion
• Finitestatemethodsprovideasimpleandpowerfulmeansofgeneratingandanalyzingwords(aswellasthephonologicalalternationsthataccompanywordformation/inflection).• Straightforwardconcatenative morphologyiseasytoimplementusingfinitestatemethods.• Otherphenomenaareeasiesttocapturewithextensionstothefinitestateparadigm.• Co-occurrencerestrictions—flagdiacritics.• Non-concatenative morphology—compile-replace algorithm.Purefinitestate,butcomputedinanovelfashion.
LanguageComparisonwrt FSTs
• Morphologiesofalltypescanbeanalyzedusingfinitestatemethods.• Somepresentmorechallengesthanothers.• Analyticlanguages.Trivial,sincethereislittleornomorphology(otherthancompounding).• Agglutinatinglanguages.Straightforward—finitestatemorphologywas“made”forlanguageslikethis.• Polysyntheticlanguages.Similartoagglutinatinglanguages,butwithblurredlinesbetweenmorphologyandsyntax.• Fusional languages. Easyenoughtoanalyzeusingfinitestatemethodaslongasoneallows“morphemes”tohavelotsofsimultaneousmeaningsandoneiswillingtoemploysomeadditionaltricks.• Root-and-patternlanguages. Requiresomeveryclevertricks.
TheGoodNews
• Morethanalmostanyotherproblemincomputationallinguistics,morphologyisasolvedproblem(aslongasyoucanaffordtowriterulesbyhand).• Finitestatemorphologyisoneofthegreatsuccessesofnaturallanguageprocessing.• OnebrilliantaspectofusingFSTsformorphology:thesamecode canhandlebothanalysis andgeneration.
Tools
• Therearespecialfinitestatetoolkitsforbuildingmorphologicaltools(andotherlinguistictools).• Thebest-knownoftheseistheXeroxFiniteStateTool orXFST,whichoriginatedatXeroxPARC.• Thereareopensourcereimplementations ofXFSTcalledHFST(HelsinkiFiniteStateTechnology)andFoma,whicharenotasfullyoptimizedasXFSTbutwhicharesometimesmorepleasanttouse.• NoneofthesetoolsallowtheconstructionofweightedFSTs.
MapudunguncomparedtoSpanishMapudungunispolysyntheticSpanishisfusional
0
20
40
60
80
100
120
140
0 500 1,000 1,500
Typ
es, i
n T
hous
ands
Tokens, in Thousands
Mapudungun Spanish
Telugu,Tamil,Kannada,MalayalamDravidianlanguages
• AgglutinatinglikeTurkish,Finnish,andSwahili
Hindi,Urdu,Bengali,Marathi,Punjabi,etc.Indo-european• AlittlericherthanEnglish• LikeEnglish,usesauxiliaryverbsandseparatewordstoexpressthingsthatareaffixesontheverbsinDravidianlanguages.• want,have,be,make,etc.
MoreThanMereConcatenation
• Reduplication. Repeatingallorpartofawordasamorphologicaloperation.• Infixation. Insertinganaffix intoabase.• Root-and-patternmorphology. Interdigitation inSemiticanditsrelatives.• Others.Apophony,includingtheumlautinEnglishtooth→teeth;subtractivemorphology,includingthetruncation inEnglishnicknameformation(David→Dave);andsoon.