POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

POSTaggingforCSData

FahadAlGhamdi,MonaDiab,Abdelati HawariTheGeorgeWashingtonUniversityGiovanniMolina,ThamarSolorio

UniversityofHoustonVictorSoto,JuliaHirschberg

ColumbiaUniversity

EMNLP2016

Outline• Introduction

oMotivation.oMainContribution.

• Approach• Evaluation• Discussion• Conclusion

Introduction

• CodeSwitching:LinguisticCodeSwitching(CS)isaphenomenonthatoccurswhenmultilingualspeakersalternatebetweentwoormorelanguagesordialects.

• Example:• ArabicIntra-sententialCS:wlkn AjhztnA AljnAgyp lAnhAm$ xyAl Elmy lmtjdwlwmElwmp wAHdp.

• EnglishTranslation:Sinceourcrimeinvestigationdepartmentsarenotdealingwithsciencefiction,theydidnotfindasinglepieceofinformation.

Introduction:Motivation

• AddressingtheproblemofpartofSpeechtagging(POS)forCSdataontheintra-sententiallevel.

• FocusingontwolanguagepairsSpanish-English(SPA-ENG)andModernStandardArabicandtheEgyptianArabicdialect(MSA-EGY).

• UsingthesamePOStagsetsforbothlanguagepairs,theUniversalPOStagset(Petrov etal.,2011)

Introduction:OurContribution

• ExploringdifferentstrategiestoleveragemonolingualresourcesforPOStaggingCSdata.

• PresentingthefirstempiricalevaluationonPOStaggingwithtwodifferentlanguagepairs.

• Allofthepreviousworkfocusedonasinglelanguagepaircombination.


üMotivation.üMainContribution.

• ApproachoMonolingualPOSTaggingsystemsoCombinedExperimentalConditions.o IntegratedExperimentalConditions.

• Evaluation• Discussion• Conclusion

Approach

• Weadoptasupervisedframeworkforourexperimentalsetup.• wecompareleveragingmonolingualstateoftheartPOStaggersusingdifferentstrategiesinwhatwecallaCOMBINEDframeworkcomparingitagainstusingasingleCStrainedPOStaggeridentifiedasanINTEGRATEDframework.

• WeexploredifferentstrategiestoinvestigatetheoptimalwayoftacklingPOStaggingofCSdata.

Approach:MonolingualPOSTaggingSystems

MSA- EGYLanguagePair:• WeusedthepubliclyavailableMADAMIRAtool.• Itis fast,comprehensivetoolformorphologicalanalysisanddisambiguationofArabic.

• MADAMIRAMSAistrainedonnewswiredata(PennArabicTreebanks1,2,3).

• MADAMIRAEGYistrainedonEgyptianblogdatawhichcomprisesamixofMSA,EGYandCSdata(MSA-EGY)fromtheLDCEgyptianTreebankparts1-5(ARZ1-5)(Maamouri etal.,2012).


MSA- EGYLanguagePair:• weneedarelativelypuremonolingualtaggerperlanguagevariety(MSAorEGY)trainedoninformalgenresforbothMSAandEGY.

• weretrainedanewversionofMADAMIRAMSAstrictlyonpureMSAsentencesidentifiedintheEGYTreebankARZ1-5.

• wecreatedaMADAMIRA-EGYtaggertrainedspecificallyonthepureEGYsentencesextractedfromthesameARZ1-5Treebank.


SPA- ENGLanguagePair:• wecreatedmodelsusingtheTreeTagger monolingualsystemsforSpanishandEnglishrespectively.

• ThedatausedtotrainTreeTagger forEnglishwasthePennTreebankdata(Marcusetal.,1993),sections0-22.

• FortheSpanishmodel,weusedAncora-ES.

LanguageIDidentification(TokenLevel)

InputSentence

POSTaggerforlang1 POSTaggerforlang2

Lang1Chunk Lang2Chunk

ComparethegeneratedPOStagswithgoldtags

POSTags POSTags

Results

Approach:CombinationExperimentalConditionsCOMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.

Approach:CombinationExperimentalConditions

COMB1:LID-MonoLT:• ForMSA-EGY:WeusedtheAutomaticIdentificationofDialectalArabic(AIDA2)tooltoperformtokenlevellanguageidentificationfortheEGYandMSAtokensincontext.

• ForSPA-ENG:Wetrained6-gramcharacterlanguagemodelsusingtheSRILMToolkit.

• TheEnglishlanguagemodelwastrainedontheAFPsectionoftheEnglishGigaWord.

• TheSpanishlanguagemodelwastrainedontheAFPsectionoftheSpanishGigaWord



lang1Chunk Lang2Chunk

ComparethegeneratedPOStagswithgoldPOStags

POSTags POSTags

Results

InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل

Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourself first.Thentheworldwillchange


COMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.



lang1Chunk Lang2Chunk

ComparethegeneratedPOStagswithgoldPOStags

POSTags POSTags

Results


Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourself first.Thentheworldwillchange

COMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.



InputSentence

POStaggerforlang1 POStaggerforlang2

ComparethegeneratedPOStagswithgoldtags

Results

Whole Sentence Whole Sentence

ExtractPOSTagsforLang2wordsfromLang2POStagger

ExtractPOSTagsforLang1wordsfromLang1POStagger

POSTags POSTagsCOMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.


AIDALanguageIDidentification(TokenLevel)

MadamiraMSA Madamira EGY

CompareMadamira POStagswithATBtags

Results


ExtractPOSTagsforEgyptianwordsfromMadamira EGY

ExtractPOSTagsforEgyptianwordsfromMadamiraMSA

POSTags POSTags


Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourselffirst.Thentheworldwillchange

COMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.


AIDALanguageIDidentification(TokenLevel)

MadamiraMSA Madamira EGY

CompareMadamira POStagswithATBtags

Results


ExtractPOSTagsforEgyptianwordsfromMadamira EGY

ExtractPOSTagsforEgyptianwordsfromMadamiraMSA

POSTags POSTags


Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourselffirst.Thentheworldwillchange

COMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.



• COMB3:MonoLT-Conf:§ Applyseparatetaggers.§ Thenuseprobability/confidencescoresyieldedbyeachtaggertochoosewhichtaggertotrustmorepertoken.

• COMB4:MonoLT-SVM:§ Combiningresultsfromthemonolingualtaggers(baselines)andCOMB3intoanMLframeworksuchasSVMtodecidewhichtagtochoosefrom(MSAvs.EGYforexampleorSPAvs.ENG).

Approach:IntegratedExperimentalConditions

• INT1:CSD:TrainasupervisedMLframeworkonexclusivelycodeswitcheddata.o ForMSA– EGY:TrainaMADAMIRAmodelexclusivelywiththeCSdatao ForSPA– ENG:TrainedaCSmodelusingTreeTagger.

• INT2:AllMonoData:SimilartoConditionINT1:CSDbutchangingthetrainingdataforeachofthelanguagepairs.o ForMSA– EGY:mergingthetrainingdatafromMSAandEGY.o ForSPA– ENG:mergingtheSpanishandEnglishcorporacreatinganintegratedSPA-ENGmodel.

Approach:IntegratedExperimentalConditions

• INT3:AllMonoData+CSD:Mergingtrainingdatafromconditions“INT1:CSD”and”INT2:AllMonoData”totrainnewtaggersforCSPOStagging.



• ApproachüMonolingualPOSTaggingsystemsüCombinedExperimentalConditions.üIntegratedExperimentalConditions.

• Evaluationo Datasetso POSTagSetso Results

• Discussion• Conclusion

Evaluation:Datasets

• MSA– EGY:• WeusetheLDCEgyptianArabicTreebanks1-5(ARZ1-5).• TheARZ1-5dataisfromthediscussionforumsgenremostlyintheEgyptianArabicdialect(EGY).

• SPA– ENG:TwodatasetsthatwereusedforSPA-ENGlanguagepair

• ThetranscribedconversationusedintheworkbySolorioandLiu(SolorioandLiu,2008),referredtoasSpanglish.

• TheBangorMiamicorpus,referredtoasBangor.Thiscorpusisconversationalspeechinvolvingatotalof84speakersliving inMiami,FL

40%

60%

Code-Switched%ForMSA-EGY

Code-Switched%Monolingual%

20.61%

79.39%

Code-Switch%forSpanglish

Code-Switched%

Monolingual%

60%

40%

Code-Switch%forBangor

Code-Switched%

Monolingual%

Evaluation:Datasets

Dataset #Sentences #Words #Types CS%ARZ 13,698 175,361 39,168 38.78%

Spanglish 922 8,022 1,455 20.61%

Bangor 45,605 335,578 13,994 6.21%

Dataset Train/DevTokens TestTokensARZ 154,897 20,464

Spanglish 6,456 1,566

Bangor 268,464 67,114

Table-1: Data set details.

Table-1: Datasetdistribution.

MSA-EGY:• TheARZ1-5datasetismanuallyannotatedusingtheBuckwalter (BW)POStagset.

• TheBWPOStagsetisconsideredoneofthemostpopularArabicPOStagsets.

SPA-ENG:• TheBangorMiamicorpushasbeenautomaticallyglossedandtaggedwithpart-of-speechtagsinthefollowingmanner:

• eachwordisautomaticallyglossedusingtheBangorAutoglosser.

Evaluation:Part-of-SpeechTagset

Table:Mapping tableforBWPOStagsetandandUniversalPOStag set

Evaluation:Part-of-SpeechTagsetBW POStagset MappingtoUniversalPOStag set

1- Personal,relative,demonstrative,interrogative,andindefinite pronouns. MappedtoPronoun.2-Acronyms. MappedtoProperNouns.3- Complementizers andadverbialclauseintroducers. MappedtoSubordinatingConjunction.4- Mainverbs(contentverbs),copulas,participles,andsomeverbformssuchasgerundsandinfinitives.

MappedtoVerb.

5- Prepositionsandpostpositions. Mappedto Adpositions.6- Interrogative,relativeanddemonstrativeadverbs. Mappedto Adverb.7- Tense,passiveandModalauxiliaries. MappedtoAuxiliaryVerb.8- Possessivedeterminers,demonstrativedeterminers,interrogativedeterminers,quantity/quantifierdeterminers,etc.

MappedtoDeterminer.

9- Nounandgerundsandinfinitives. MappedtoNoun.10- Negationparticle,questionparticle,sentencemodality,andindeclinableaspectualortenseparticles

MappedtoParticle

SPA-ENG:

• TheBangorcorpuswentthroughtwoedition/annotationstages:1. Stageoneincludes:

a) TokensthattaggedambiguouslywithmorethanonePOStagweredisambiguated(e.g.that.CONJ.[or].DET).

b) AmbiguousPOScategorieslikeASV,AVandSVweredisambiguatedintoeitherADJ,NOUN,orVERB.

c) Forfrequenttokenslikesoandthat,theirPOStagswerehand-corrected.d) Mistranscribed termswhichwereoriginallylabeledasUnknownwerehand

correctedandgivenacorrectPOStag


SPA-ENG:1. Stagetwoincludes:

• MappingtheBangorcorpusoriginalPOStagset totheUniversalPOStagset.


Evaluation:Part-of-SpeechTagsetBangorPOS tagset MappingtoUniversalPOStag set

1- ExclamationsandIntonational Markers. MappedtoInterjections2- PossessiveAdjectives,PossessiveDeterminers,InterrogativeAdjectives,DemonstrativeAdjectivesandQuantifyingAdjective MappedtoDeterminer

3- Relatives, InterrogativesandDemonstratives(withnospecification towhethertheywereDeterminers,AdjectivesorPronouns). Manuallylabeled

4- possessivemarkers,negationparticles,andinfinitivetotokens. PRT

5- ConjunctionsMappedtoCoordinatingConjunctionsandSubordinatingConjunctionsusingwordlists

6- AsubsetofEnglishVerbsMappedtoAuxiliaryVerbs(could,should,might,may,will,shall,etc.

7- Categorieswithanobviousmatch(likeNouns,Adjectives,Verbs,Pronouns,Determiners,ProperNouns,Numbers,etc.)

Automaticallymappedtotheappropriatecategory

Table:Mapping tableforBangorPOStagsetandandUniversalPOStag set

Toevaluatetheperformanceofourapproaches:

Evaluation:Results

• ComparingtheoutputPOStagsgeneratedfromeachconditionagainsttheavailablegoldPOStagsforeachdataset.

• Comparetheaccuracyofourapproachesforeachlanguagepairtoitscorrespondingmonolingualtaggerbaseline.

MSA-EGYBaselineDataset MADAMIRA-MSA MADAMIRA-EGY

ARZ 77.23% 72.22%

SPA-ENGBaseline

Dataset TreeTagger SPA TreeTagger ENG

Spanglish 44.61% 75.87%Bangor 45.95% 64.05%

Evaluation:Results

Table:POStaggingaccuracyformonolingual baselinetaggers

Evaluation:Results

Table:AccuracyResultsforARZTestDataset

Approach Overall CSPosts MSAPosts EGYPostsCOMB1:LID-MonoLT 77.66 78.03 76.79 78.57COMB2:MonoLT-LID 77.41 77.41 78.31 77.01COMB3:MonoLT-Conf 76.66 77.89 76.79 76.11COMB4:MonoLT-SVM 90.56 90.85 91.63 88.91

INT1:CSD 83.89 82.03 82.48 83.26INT2:AllMonoData 87.86 87.92 86.82 86

INT3:AllMonoData+CSD 89.36 88.12 85.12 87

Baseline:MSA:77.23%

EGY: 72.22%

Evaluation:Results

Table:AccuracyResultsforBangorDataset

Approach Overall CSPosts ENGPosts SPAPostsCOMB1:LID-MonoLT 68.35 71.11 66.36 76.02COMB2:MonoLT-LID 65.51 69.66 64.44 71.32COMB3:MonoLT-Conf 68.25 68.21 71.93 65.03COMB4:MonoLT-SVM 96.31 95.39 96.37 96.60

INT1:CSD 95.28 94.41 94.41 95.15INT2:AllMonoData 78.57 78.62 81.85 76.53

INT3:AllMonoData+CSD 91.04 89.59 92.00 89.48

Baseline:SPA: 44.61%

ENG: 75.87%

Evaluation:Results

Table:AccuracyResultsforSpanglish Dataset

Approach Overall CSPosts ENGPosts SPAPostsCOMB1:LID-MonoLT 78.73 77.81 80.18 73.99COMB2:MonoLT-LID 73.52 73.80 73.60 71.57COMB3:MonoLT-Conf 77.39 76.11 80.20 65.43COMB4:MonoLT-SVM 90.61 89.43 93.61 87.96

INT1:CSD 82.95 83.03 85.95 77.26INT2:AllMonoData 84.55 84.84 88.50 76.59

INT3:AllMonoData+CSD 85.06 84.70 90.15 76.59

Baseline:SPA:45.95%

ENG: 64.05%



• ApproachüMonolingualPOSTaggingsystemsüCombinedExperimentalConditions.üIntegratedExperimentalConditions.

• EvaluationüDatasetsüPOSTagSetsüResults

• Discussiono CombinedConditions.o IntegratedConditions.

• Conclusion

ForMSA– EGY:• Allthecombinedexperimentalconditionsoutperformthebaselines• COMB1:LID-MonoLTyieldsworseresultsthanCOMB2:MonoLT-LID.• Itisexpectedduetothefactthatthetaggersareexpectingwellformedsentencesoninput.

• TheworstresultsareforconditionMonoLT-Conf.ForSPA– ENG:Spanglishdataset:• AlmostalltheaccuraciesachievedbythecombinedconditionsarehigherthantheSpanglishdataset’sbaselines.

• ”COMB2:MonoLT-LID”istheonlycombinedconditionthatislowerthanthebaselines’oftheSpanglishdataset(73.52%,75.87%).

Discussion:Combinedconditions

ForSPA– ENG:Spanglishdataset:• Mistakesintheautomatedlanguageidentificationthatcausesthewrongtaggertobechosen.

ForSPA– ENG:Bangordataset:• AlltheaccuraciesachievedbythecombinedconditionsarehigherthantheBangordataset’sbaselines.


• Thetrendsalmost thesamebetweenthetwolanguagepairs.• BothlanguagepairsachievethehighestperformancewithMonoLT-SVMandworseresultswithMonoLT-Conf.

• TheweaknessesoftheMonoLT-Conf approachcomefromthefactthatifthemonolingualtaggersareweak,theirconfidencescoresareequallyunreliable

• TheresultsareswitchedbetweenconditionsLID-MonoLT (condition1)andMonoLT-LID(condition2)forthetwolanguagepairs.


• Ingeneral,exceptthe”COMB4:MonoLTSVM”conditionalltheINTconditionsoutperformedtheCOMBconditions.

ForMSA– EGY:• Addingmoredatahelps,INT2:AllMonoDataoutperformsINT1:CSD,butcombiningthetwoconditionsastrainingdata,wenotethatINT3:AllMonoData+CSDoutperformstheotherINTconditions.

• ForSPA– ENG:• TheworseINTconditionisINT2:AllMonoDataforBangor(accuracy78.57%)andINT1:CSDforSpanglish(accuracy82.95%).

Discussion:Integratedconditions

ForSPA– ENG:• ThelargestgapinperformanceforBangorcouldbeduetoahigherdomainmismatchwiththemonolingualdatausedtotrainthetagger.

• NotabledifferencebetweenthetwolanguagepairsisthesignificantjumpinperformancefortheBangorcorpusfromthefirstthreeCOMB conditionsfrom(68.35% to96.31%).

• WeobserveasimilarjumpfortheSpanglishcorpus,thegapismuchlargerfortheBangorcorpus


• Some similar trends between the two combinations.• MSA and EGY share a significant number of homographs some ofwhich are cognates but many of which are not.

• The homograph overlap is quite limited in SPA-ENG.• Adding the CSD to the monolingual corpora in the INT3:AllMonoData-CSD condition for MSA-EGY improves performance (1.5% absoluteincrease in accuracy).

• The results are not consistent across the SPA-ENG data sets.




• ApproachüMonolingualPOSTaggingsystemsü CombinedExperimental Conditions.ü IntegratedExperimental Conditions.

• EvaluationüDatasetsü POSTagSetsüResults

• Discussionü CombinedConditions.ü IntegratedConditions.

• Conclusiono Summaryo FutureWork

• PresentingdetailedstudyofvariousstrategiesforPOStaggingofCSdataintwolanguagepairs.

• Theresultsindicatethatdependingonthelanguagepairtherearevaryingdegreesofneedforannotatedcodeswitcheddatainthetrainingphaseoftheprocess.

• Languagesthatshareasignificantamountofhomographswhencodeswitchedwillbenefitfrommorecodeswitcheddataattrainingtime.(e.g.,MSA-EGY)

• LanguagesthatarefartherapartsuchasSpanishandEnglish,whencodeswitched,benefitmorefromhavinglargermonolingualdatamixed

Conclusion:Summary

• AllCOMBconditionsuseeitheroutofcontextorincontextchunksasaninputforthemonolingualtaggers.

• Ourplanforthefutureworkthatprocesstheoutofcontextchunkstoprovideameaningfulcontexttothemonolingualtaggers.

• ExtendthefeaturesetusedintheCOMB4:MonoLT-SVMconditiontoincludeBrownClustering,Word2Vec,andDeeplearningbasedfeatures

Conclusion:FutureWork



• ApproachüMonolingualPOSTaggingsystemsü CombinedExperimental Conditions.ü IntegratedExperimental Conditions.

• EvaluationüDatasetsü POSTagSetsüResults

• Discussionü CombinedConditions.ü IntegratedConditions.

• Conclusionü Summaryü FutureWork

Thanks!!

Documents

POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA