Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
POSTaggingforCSData
FahadAlGhamdi,MonaDiab,Abdelati HawariTheGeorgeWashingtonUniversityGiovanniMolina,ThamarSolorio
UniversityofHoustonVictorSoto,JuliaHirschberg
ColumbiaUniversity
EMNLP2016
Outline• Introduction
oMotivation.oMainContribution.
• Approach• Evaluation• Discussion• Conclusion
Introduction
• CodeSwitching:LinguisticCodeSwitching(CS)isaphenomenonthatoccurswhenmultilingualspeakersalternatebetweentwoormorelanguagesordialects.
• Example:• ArabicIntra-sententialCS:wlkn AjhztnA AljnAgyp lAnhAm$ xyAl Elmy lmtjdwlwmElwmp wAHdp.
• EnglishTranslation:Sinceourcrimeinvestigationdepartmentsarenotdealingwithsciencefiction,theydidnotfindasinglepieceofinformation.
Introduction:Motivation
• AddressingtheproblemofpartofSpeechtagging(POS)forCSdataontheintra-sententiallevel.
• FocusingontwolanguagepairsSpanish-English(SPA-ENG)andModernStandardArabicandtheEgyptianArabicdialect(MSA-EGY).
• UsingthesamePOStagsetsforbothlanguagepairs,theUniversalPOStagset(Petrov etal.,2011)
Introduction:OurContribution
• ExploringdifferentstrategiestoleveragemonolingualresourcesforPOStaggingCSdata.
• PresentingthefirstempiricalevaluationonPOStaggingwithtwodifferentlanguagepairs.
• Allofthepreviousworkfocusedonasinglelanguagepaircombination.
Outline• Introduction
üMotivation.üMainContribution.
• ApproachoMonolingualPOSTaggingsystemsoCombinedExperimentalConditions.o IntegratedExperimentalConditions.
• Evaluation• Discussion• Conclusion
Approach
• Weadoptasupervisedframeworkforourexperimentalsetup.• wecompareleveragingmonolingualstateoftheartPOStaggersusingdifferentstrategiesinwhatwecallaCOMBINEDframeworkcomparingitagainstusingasingleCStrainedPOStaggeridentifiedasanINTEGRATEDframework.
• WeexploredifferentstrategiestoinvestigatetheoptimalwayoftacklingPOStaggingofCSdata.
Approach:MonolingualPOSTaggingSystems
MSA- EGYLanguagePair:• WeusedthepubliclyavailableMADAMIRAtool.• Itis fast,comprehensivetoolformorphologicalanalysisanddisambiguationofArabic.
• MADAMIRAMSAistrainedonnewswiredata(PennArabicTreebanks1,2,3).
• MADAMIRAEGYistrainedonEgyptianblogdatawhichcomprisesamixofMSA,EGYandCSdata(MSA-EGY)fromtheLDCEgyptianTreebankparts1-5(ARZ1-5)(Maamouri etal.,2012).
Approach:MonolingualPOSTaggingSystems
MSA- EGYLanguagePair:• weneedarelativelypuremonolingualtaggerperlanguagevariety(MSAorEGY)trainedoninformalgenresforbothMSAandEGY.
• weretrainedanewversionofMADAMIRAMSAstrictlyonpureMSAsentencesidentifiedintheEGYTreebankARZ1-5.
• wecreatedaMADAMIRA-EGYtaggertrainedspecificallyonthepureEGYsentencesextractedfromthesameARZ1-5Treebank.
Approach:MonolingualPOSTaggingSystems
SPA- ENGLanguagePair:• wecreatedmodelsusingtheTreeTagger monolingualsystemsforSpanishandEnglishrespectively.
• ThedatausedtotrainTreeTagger forEnglishwasthePennTreebankdata(Marcusetal.,1993),sections0-22.
• FortheSpanishmodel,weusedAncora-ES.
LanguageIDidentification(TokenLevel)
InputSentence
POSTaggerforlang1 POSTaggerforlang2
Lang1Chunk Lang2Chunk
ComparethegeneratedPOStagswithgoldtags
POSTags POSTags
Results
Approach:CombinationExperimentalConditionsCOMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.
Approach:CombinationExperimentalConditions
COMB1:LID-MonoLT:• ForMSA-EGY:WeusedtheAutomaticIdentificationofDialectalArabic(AIDA2)tooltoperformtokenlevellanguageidentificationfortheEGYandMSAtokensincontext.
• ForSPA-ENG:Wetrained6-gramcharacterlanguagemodelsusingtheSRILMToolkit.
• TheEnglishlanguagemodelwastrainedontheAFPsectionoftheEnglishGigaWord.
• TheSpanishlanguagemodelwastrainedontheAFPsectionoftheSpanishGigaWord
LanguageIDidentification(TokenLevel)
POSTaggerforlang1 POSTaggerforlang2
lang1Chunk Lang2Chunk
ComparethegeneratedPOStagswithgoldPOStags
POSTags POSTags
Results
InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل
Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourself first.Thentheworldwillchange
Approach:CombinationExperimentalConditions
COMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.
LanguageIDidentification(TokenLevel)
POSTaggerforlang1 POSTaggerforlang2
lang1Chunk Lang2Chunk
ComparethegeneratedPOStagswithgoldPOStags
POSTags POSTags
Results
InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل
Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourself first.Thentheworldwillchange
COMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.
Approach:CombinationExperimentalConditions
LanguageIDidentification(TokenLevel)
InputSentence
POStaggerforlang1 POStaggerforlang2
ComparethegeneratedPOStagswithgoldtags
Results
Whole Sentence Whole Sentence
ExtractPOSTagsforLang2wordsfromLang2POStagger
ExtractPOSTagsforLang1wordsfromLang1POStagger
POSTags POSTagsCOMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.
Approach:CombinationExperimentalConditions
AIDALanguageIDidentification(TokenLevel)
MadamiraMSA Madamira EGY
CompareMadamira POStagswithATBtags
Results
Whole Sentence Whole Sentence
ExtractPOSTagsforEgyptianwordsfromMadamira EGY
ExtractPOSTagsforEgyptianwordsfromMadamiraMSA
POSTags POSTags
InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل
Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourselffirst.Thentheworldwillchange
COMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.
Approach:CombinationExperimentalConditions
AIDALanguageIDidentification(TokenLevel)
MadamiraMSA Madamira EGY
CompareMadamira POStagswithATBtags
Results
Whole Sentence Whole Sentence
ExtractPOSTagsforEgyptianwordsfromMadamira EGY
ExtractPOSTagsforEgyptianwordsfromMadamiraMSA
POSTags POSTags
InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل
Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourselffirst.Thentheworldwillchange
COMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.
Approach:CombinationExperimentalConditions
Approach:CombinationExperimentalConditions
• COMB3:MonoLT-Conf:§ Applyseparatetaggers.§ Thenuseprobability/confidencescoresyieldedbyeachtaggertochoosewhichtaggertotrustmorepertoken.
• COMB4:MonoLT-SVM:§ Combiningresultsfromthemonolingualtaggers(baselines)andCOMB3intoanMLframeworksuchasSVMtodecidewhichtagtochoosefrom(MSAvs.EGYforexampleorSPAvs.ENG).
Approach:IntegratedExperimentalConditions
• INT1:CSD:TrainasupervisedMLframeworkonexclusivelycodeswitcheddata.o ForMSA– EGY:TrainaMADAMIRAmodelexclusivelywiththeCSdatao ForSPA– ENG:TrainedaCSmodelusingTreeTagger.
• INT2:AllMonoData:SimilartoConditionINT1:CSDbutchangingthetrainingdataforeachofthelanguagepairs.o ForMSA– EGY:mergingthetrainingdatafromMSAandEGY.o ForSPA– ENG:mergingtheSpanishandEnglishcorporacreatinganintegratedSPA-ENGmodel.
Approach:IntegratedExperimentalConditions
• INT3:AllMonoData+CSD:Mergingtrainingdatafromconditions“INT1:CSD”and”INT2:AllMonoData”totrainnewtaggersforCSPOStagging.
Outline• Introduction
üMotivation.üMainContribution.
• ApproachüMonolingualPOSTaggingsystemsüCombinedExperimentalConditions.üIntegratedExperimentalConditions.
• Evaluationo Datasetso POSTagSetso Results
• Discussion• Conclusion
Evaluation:Datasets
• MSA– EGY:• WeusetheLDCEgyptianArabicTreebanks1-5(ARZ1-5).• TheARZ1-5dataisfromthediscussionforumsgenremostlyintheEgyptianArabicdialect(EGY).
• SPA– ENG:TwodatasetsthatwereusedforSPA-ENGlanguagepair
• ThetranscribedconversationusedintheworkbySolorioandLiu(SolorioandLiu,2008),referredtoasSpanglish.
• TheBangorMiamicorpus,referredtoasBangor.Thiscorpusisconversationalspeechinvolvingatotalof84speakersliving inMiami,FL
40%
60%
Code-Switched%ForMSA-EGY
Code-Switched%Monolingual%
20.61%
79.39%
Code-Switch%forSpanglish
Code-Switched%
Monolingual%
60%
40%
Code-Switch%forBangor
Code-Switched%
Monolingual%
Evaluation:Datasets
Dataset #Sentences #Words #Types CS%ARZ 13,698 175,361 39,168 38.78%
Spanglish 922 8,022 1,455 20.61%
Bangor 45,605 335,578 13,994 6.21%
Dataset Train/DevTokens TestTokensARZ 154,897 20,464
Spanglish 6,456 1,566
Bangor 268,464 67,114
Table-1: Data set details.
Table-1: Datasetdistribution.
MSA-EGY:• TheARZ1-5datasetismanuallyannotatedusingtheBuckwalter (BW)POStagset.
• TheBWPOStagsetisconsideredoneofthemostpopularArabicPOStagsets.
SPA-ENG:• TheBangorMiamicorpushasbeenautomaticallyglossedandtaggedwithpart-of-speechtagsinthefollowingmanner:
• eachwordisautomaticallyglossedusingtheBangorAutoglosser.
Evaluation:Part-of-SpeechTagset
Table:Mapping tableforBWPOStagsetandandUniversalPOStag set
Evaluation:Part-of-SpeechTagsetBW POStagset MappingtoUniversalPOStag set
1- Personal,relative,demonstrative,interrogative,andindefinite pronouns. MappedtoPronoun.2-Acronyms. MappedtoProperNouns.3- Complementizers andadverbialclauseintroducers. MappedtoSubordinatingConjunction.4- Mainverbs(contentverbs),copulas,participles,andsomeverbformssuchasgerundsandinfinitives.
MappedtoVerb.
5- Prepositionsandpostpositions. Mappedto Adpositions.6- Interrogative,relativeanddemonstrativeadverbs. Mappedto Adverb.7- Tense,passiveandModalauxiliaries. MappedtoAuxiliaryVerb.8- Possessivedeterminers,demonstrativedeterminers,interrogativedeterminers,quantity/quantifierdeterminers,etc.
MappedtoDeterminer.
9- Nounandgerundsandinfinitives. MappedtoNoun.10- Negationparticle,questionparticle,sentencemodality,andindeclinableaspectualortenseparticles
MappedtoParticle
SPA-ENG:
• TheBangorcorpuswentthroughtwoedition/annotationstages:1. Stageoneincludes:
a) TokensthattaggedambiguouslywithmorethanonePOStagweredisambiguated(e.g.that.CONJ.[or].DET).
b) AmbiguousPOScategorieslikeASV,AVandSVweredisambiguatedintoeitherADJ,NOUN,orVERB.
c) Forfrequenttokenslikesoandthat,theirPOStagswerehand-corrected.d) Mistranscribed termswhichwereoriginallylabeledasUnknownwerehand
correctedandgivenacorrectPOStag
Evaluation:Part-of-SpeechTagset
SPA-ENG:1. Stagetwoincludes:
• MappingtheBangorcorpusoriginalPOStagset totheUniversalPOStagset.
Evaluation:Part-of-SpeechTagset
Evaluation:Part-of-SpeechTagsetBangorPOS tagset MappingtoUniversalPOStag set
1- ExclamationsandIntonational Markers. MappedtoInterjections2- PossessiveAdjectives,PossessiveDeterminers,InterrogativeAdjectives,DemonstrativeAdjectivesandQuantifyingAdjective MappedtoDeterminer
3- Relatives, InterrogativesandDemonstratives(withnospecification towhethertheywereDeterminers,AdjectivesorPronouns). Manuallylabeled
4- possessivemarkers,negationparticles,andinfinitivetotokens. PRT
5- ConjunctionsMappedtoCoordinatingConjunctionsandSubordinatingConjunctionsusingwordlists
6- AsubsetofEnglishVerbsMappedtoAuxiliaryVerbs(could,should,might,may,will,shall,etc.
7- Categorieswithanobviousmatch(likeNouns,Adjectives,Verbs,Pronouns,Determiners,ProperNouns,Numbers,etc.)
Automaticallymappedtotheappropriatecategory
Table:Mapping tableforBangorPOStagsetandandUniversalPOStag set
Toevaluatetheperformanceofourapproaches:
Evaluation:Results
• ComparingtheoutputPOStagsgeneratedfromeachconditionagainsttheavailablegoldPOStagsforeachdataset.
• Comparetheaccuracyofourapproachesforeachlanguagepairtoitscorrespondingmonolingualtaggerbaseline.
MSA-EGYBaselineDataset MADAMIRA-MSA MADAMIRA-EGY
ARZ 77.23% 72.22%
SPA-ENGBaseline
Dataset TreeTagger SPA TreeTagger ENG
Spanglish 44.61% 75.87%Bangor 45.95% 64.05%
Evaluation:Results
Table:POStaggingaccuracyformonolingual baselinetaggers
Evaluation:Results
Table:AccuracyResultsforARZTestDataset
Approach Overall CSPosts MSAPosts EGYPostsCOMB1:LID-MonoLT 77.66 78.03 76.79 78.57COMB2:MonoLT-LID 77.41 77.41 78.31 77.01COMB3:MonoLT-Conf 76.66 77.89 76.79 76.11COMB4:MonoLT-SVM 90.56 90.85 91.63 88.91
INT1:CSD 83.89 82.03 82.48 83.26INT2:AllMonoData 87.86 87.92 86.82 86
INT3:AllMonoData+CSD 89.36 88.12 85.12 87
Baseline:MSA:77.23%
EGY: 72.22%
Evaluation:Results
Table:AccuracyResultsforBangorDataset
Approach Overall CSPosts ENGPosts SPAPostsCOMB1:LID-MonoLT 68.35 71.11 66.36 76.02COMB2:MonoLT-LID 65.51 69.66 64.44 71.32COMB3:MonoLT-Conf 68.25 68.21 71.93 65.03COMB4:MonoLT-SVM 96.31 95.39 96.37 96.60
INT1:CSD 95.28 94.41 94.41 95.15INT2:AllMonoData 78.57 78.62 81.85 76.53
INT3:AllMonoData+CSD 91.04 89.59 92.00 89.48
Baseline:SPA: 44.61%
ENG: 75.87%
Evaluation:Results
Table:AccuracyResultsforSpanglish Dataset
Approach Overall CSPosts ENGPosts SPAPostsCOMB1:LID-MonoLT 78.73 77.81 80.18 73.99COMB2:MonoLT-LID 73.52 73.80 73.60 71.57COMB3:MonoLT-Conf 77.39 76.11 80.20 65.43COMB4:MonoLT-SVM 90.61 89.43 93.61 87.96
INT1:CSD 82.95 83.03 85.95 77.26INT2:AllMonoData 84.55 84.84 88.50 76.59
INT3:AllMonoData+CSD 85.06 84.70 90.15 76.59
Baseline:SPA:45.95%
ENG: 64.05%
Outline• Introduction
üMotivation.üMainContribution.
• ApproachüMonolingualPOSTaggingsystemsüCombinedExperimentalConditions.üIntegratedExperimentalConditions.
• EvaluationüDatasetsüPOSTagSetsüResults
• Discussiono CombinedConditions.o IntegratedConditions.
• Conclusion
ForMSA– EGY:• Allthecombinedexperimentalconditionsoutperformthebaselines• COMB1:LID-MonoLTyieldsworseresultsthanCOMB2:MonoLT-LID.• Itisexpectedduetothefactthatthetaggersareexpectingwellformedsentencesoninput.
• TheworstresultsareforconditionMonoLT-Conf.ForSPA– ENG:Spanglishdataset:• AlmostalltheaccuraciesachievedbythecombinedconditionsarehigherthantheSpanglishdataset’sbaselines.
• ”COMB2:MonoLT-LID”istheonlycombinedconditionthatislowerthanthebaselines’oftheSpanglishdataset(73.52%,75.87%).
Discussion:Combinedconditions
ForSPA– ENG:Spanglishdataset:• Mistakesintheautomatedlanguageidentificationthatcausesthewrongtaggertobechosen.
ForSPA– ENG:Bangordataset:• AlltheaccuraciesachievedbythecombinedconditionsarehigherthantheBangordataset’sbaselines.
Discussion:Combinedconditions
• Thetrendsalmost thesamebetweenthetwolanguagepairs.• BothlanguagepairsachievethehighestperformancewithMonoLT-SVMandworseresultswithMonoLT-Conf.
• TheweaknessesoftheMonoLT-Conf approachcomefromthefactthatifthemonolingualtaggersareweak,theirconfidencescoresareequallyunreliable
• TheresultsareswitchedbetweenconditionsLID-MonoLT (condition1)andMonoLT-LID(condition2)forthetwolanguagepairs.
Discussion:Combinedconditions
• Ingeneral,exceptthe”COMB4:MonoLTSVM”conditionalltheINTconditionsoutperformedtheCOMBconditions.
ForMSA– EGY:• Addingmoredatahelps,INT2:AllMonoDataoutperformsINT1:CSD,butcombiningthetwoconditionsastrainingdata,wenotethatINT3:AllMonoData+CSDoutperformstheotherINTconditions.
• ForSPA– ENG:• TheworseINTconditionisINT2:AllMonoDataforBangor(accuracy78.57%)andINT1:CSDforSpanglish(accuracy82.95%).
Discussion:Integratedconditions
ForSPA– ENG:• ThelargestgapinperformanceforBangorcouldbeduetoahigherdomainmismatchwiththemonolingualdatausedtotrainthetagger.
• NotabledifferencebetweenthetwolanguagepairsisthesignificantjumpinperformancefortheBangorcorpusfromthefirstthreeCOMB conditionsfrom(68.35% to96.31%).
• WeobserveasimilarjumpfortheSpanglishcorpus,thegapismuchlargerfortheBangorcorpus
Discussion:Integratedconditions
• Some similar trends between the two combinations.• MSA and EGY share a significant number of homographs some ofwhich are cognates but many of which are not.
• The homograph overlap is quite limited in SPA-ENG.• Adding the CSD to the monolingual corpora in the INT3:AllMonoData-CSD condition for MSA-EGY improves performance (1.5% absoluteincrease in accuracy).
• The results are not consistent across the SPA-ENG data sets.
Discussion:Integratedconditions
Outline• Introduction
üMotivation.üMainContribution.
• ApproachüMonolingualPOSTaggingsystemsü CombinedExperimental Conditions.ü IntegratedExperimental Conditions.
• EvaluationüDatasetsü POSTagSetsüResults
• Discussionü CombinedConditions.ü IntegratedConditions.
• Conclusiono Summaryo FutureWork
• PresentingdetailedstudyofvariousstrategiesforPOStaggingofCSdataintwolanguagepairs.
• Theresultsindicatethatdependingonthelanguagepairtherearevaryingdegreesofneedforannotatedcodeswitcheddatainthetrainingphaseoftheprocess.
• Languagesthatshareasignificantamountofhomographswhencodeswitchedwillbenefitfrommorecodeswitcheddataattrainingtime.(e.g.,MSA-EGY)
• LanguagesthatarefartherapartsuchasSpanishandEnglish,whencodeswitched,benefitmorefromhavinglargermonolingualdatamixed
Conclusion:Summary
• AllCOMBconditionsuseeitheroutofcontextorincontextchunksasaninputforthemonolingualtaggers.
• Ourplanforthefutureworkthatprocesstheoutofcontextchunkstoprovideameaningfulcontexttothemonolingualtaggers.
• ExtendthefeaturesetusedintheCOMB4:MonoLT-SVMconditiontoincludeBrownClustering,Word2Vec,andDeeplearningbasedfeatures
Conclusion:FutureWork
Outline• Introduction
üMotivation.üMainContribution.
• ApproachüMonolingualPOSTaggingsystemsü CombinedExperimental Conditions.ü IntegratedExperimental Conditions.
• EvaluationüDatasetsü POSTagSetsüResults
• Discussionü CombinedConditions.ü IntegratedConditions.
• Conclusionü Summaryü FutureWork
Thanks!!