45
POS Tagging for CS Data Fahad AlGhamdi, Mona Diab, Abdelati Hawari The George Washington University Giovanni Molina, Thamar Solorio University of Houston Victor Soto, Julia Hirschberg Columbia University EMNLP 2016

POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

POSTaggingforCSData

FahadAlGhamdi,MonaDiab,Abdelati HawariTheGeorgeWashingtonUniversityGiovanniMolina,ThamarSolorio

UniversityofHoustonVictorSoto,JuliaHirschberg

ColumbiaUniversity

EMNLP2016

Page 2: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Outline• Introduction

oMotivation.oMainContribution.

• Approach• Evaluation• Discussion• Conclusion

Page 3: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Introduction

• CodeSwitching:LinguisticCodeSwitching(CS)isaphenomenonthatoccurswhenmultilingualspeakersalternatebetweentwoormorelanguagesordialects.

• Example:• ArabicIntra-sententialCS:wlkn AjhztnA AljnAgyp lAnhAm$ xyAl Elmy lmtjdwlwmElwmp wAHdp.

• EnglishTranslation:Sinceourcrimeinvestigationdepartmentsarenotdealingwithsciencefiction,theydidnotfindasinglepieceofinformation.

Page 4: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Introduction:Motivation

• AddressingtheproblemofpartofSpeechtagging(POS)forCSdataontheintra-sententiallevel.

• FocusingontwolanguagepairsSpanish-English(SPA-ENG)andModernStandardArabicandtheEgyptianArabicdialect(MSA-EGY).

• UsingthesamePOStagsetsforbothlanguagepairs,theUniversalPOStagset(Petrov etal.,2011)

Page 5: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Introduction:OurContribution

• ExploringdifferentstrategiestoleveragemonolingualresourcesforPOStaggingCSdata.

• PresentingthefirstempiricalevaluationonPOStaggingwithtwodifferentlanguagepairs.

• Allofthepreviousworkfocusedonasinglelanguagepaircombination.

Page 6: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Outline• Introduction

üMotivation.üMainContribution.

• ApproachoMonolingualPOSTaggingsystemsoCombinedExperimentalConditions.o IntegratedExperimentalConditions.

• Evaluation• Discussion• Conclusion

Page 7: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Approach

• Weadoptasupervisedframeworkforourexperimentalsetup.• wecompareleveragingmonolingualstateoftheartPOStaggersusingdifferentstrategiesinwhatwecallaCOMBINEDframeworkcomparingitagainstusingasingleCStrainedPOStaggeridentifiedasanINTEGRATEDframework.

• WeexploredifferentstrategiestoinvestigatetheoptimalwayoftacklingPOStaggingofCSdata.

Page 8: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Approach:MonolingualPOSTaggingSystems

MSA- EGYLanguagePair:• WeusedthepubliclyavailableMADAMIRAtool.• Itis fast,comprehensivetoolformorphologicalanalysisanddisambiguationofArabic.

• MADAMIRAMSAistrainedonnewswiredata(PennArabicTreebanks1,2,3).

• MADAMIRAEGYistrainedonEgyptianblogdatawhichcomprisesamixofMSA,EGYandCSdata(MSA-EGY)fromtheLDCEgyptianTreebankparts1-5(ARZ1-5)(Maamouri etal.,2012).

Page 9: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Approach:MonolingualPOSTaggingSystems

MSA- EGYLanguagePair:• weneedarelativelypuremonolingualtaggerperlanguagevariety(MSAorEGY)trainedoninformalgenresforbothMSAandEGY.

• weretrainedanewversionofMADAMIRAMSAstrictlyonpureMSAsentencesidentifiedintheEGYTreebankARZ1-5.

• wecreatedaMADAMIRA-EGYtaggertrainedspecificallyonthepureEGYsentencesextractedfromthesameARZ1-5Treebank.

Page 10: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Approach:MonolingualPOSTaggingSystems

SPA- ENGLanguagePair:• wecreatedmodelsusingtheTreeTagger monolingualsystemsforSpanishandEnglishrespectively.

• ThedatausedtotrainTreeTagger forEnglishwasthePennTreebankdata(Marcusetal.,1993),sections0-22.

• FortheSpanishmodel,weusedAncora-ES.

Page 11: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

LanguageIDidentification(TokenLevel)

InputSentence

POSTaggerforlang1 POSTaggerforlang2

Lang1Chunk Lang2Chunk

ComparethegeneratedPOStagswithgoldtags

POSTags POSTags

Results

Approach:CombinationExperimentalConditionsCOMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.

Page 12: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Approach:CombinationExperimentalConditions

COMB1:LID-MonoLT:• ForMSA-EGY:WeusedtheAutomaticIdentificationofDialectalArabic(AIDA2)tooltoperformtokenlevellanguageidentificationfortheEGYandMSAtokensincontext.

• ForSPA-ENG:Wetrained6-gramcharacterlanguagemodelsusingtheSRILMToolkit.

• TheEnglishlanguagemodelwastrainedontheAFPsectionoftheEnglishGigaWord.

• TheSpanishlanguagemodelwastrainedontheAFPsectionoftheSpanishGigaWord

Page 13: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

LanguageIDidentification(TokenLevel)

POSTaggerforlang1 POSTaggerforlang2

lang1Chunk Lang2Chunk

ComparethegeneratedPOStagswithgoldPOStags

POSTags POSTags

Results

InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل

Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourself first.Thentheworldwillchange

Approach:CombinationExperimentalConditions

COMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.

Page 14: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

LanguageIDidentification(TokenLevel)

POSTaggerforlang1 POSTaggerforlang2

lang1Chunk Lang2Chunk

ComparethegeneratedPOStagswithgoldPOStags

POSTags POSTags

Results

InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل

Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourself first.Thentheworldwillchange

COMB1:LID-MonoLT:Languageidentificationfollowedbymonolingualtagging.

Approach:CombinationExperimentalConditions

Page 15: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

LanguageIDidentification(TokenLevel)

InputSentence

POStaggerforlang1 POStaggerforlang2

ComparethegeneratedPOStagswithgoldtags

Results

Whole Sentence Whole Sentence

ExtractPOSTagsforLang2wordsfromLang2POStagger

ExtractPOSTagsforLang1wordsfromLang1POStagger

POSTags POSTagsCOMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.

Approach:CombinationExperimentalConditions

Page 16: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

AIDALanguageIDidentification(TokenLevel)

MadamiraMSA Madamira EGY

CompareMadamira POStagswithATBtags

Results

Whole Sentence Whole Sentence

ExtractPOSTagsforEgyptianwordsfromMadamira EGY

ExtractPOSTagsforEgyptianwordsfromMadamiraMSA

POSTags POSTags

InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل

Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourselffirst.Thentheworldwillchange

COMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.

Approach:CombinationExperimentalConditions

Page 17: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

AIDALanguageIDidentification(TokenLevel)

MadamiraMSA Madamira EGY

CompareMadamira POStagswithATBtags

Results

Whole Sentence Whole Sentence

ExtractPOSTagsforEgyptianwordsfromMadamira EGY

ExtractPOSTagsforEgyptianwordsfromMadamiraMSA

POSTags POSTags

InputSentenceبتقولالليالحكمھافتكرودایماارجوكم : حوالكمنالعالمیتغیرفعدھانفسكغیرحولكمنالعالمتغیرانقبل

Pleasealwaysremember thewisdomthatsaysbeforetrying tochangetheword,changeyourselffirst.Thentheworldwillchange

COMB2:MonoLT-LID:MonolingualtaggingthenLanguageID.

Approach:CombinationExperimentalConditions

Page 18: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Approach:CombinationExperimentalConditions

• COMB3:MonoLT-Conf:§ Applyseparatetaggers.§ Thenuseprobability/confidencescoresyieldedbyeachtaggertochoosewhichtaggertotrustmorepertoken.

• COMB4:MonoLT-SVM:§ Combiningresultsfromthemonolingualtaggers(baselines)andCOMB3intoanMLframeworksuchasSVMtodecidewhichtagtochoosefrom(MSAvs.EGYforexampleorSPAvs.ENG).

Page 19: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Approach:IntegratedExperimentalConditions

• INT1:CSD:TrainasupervisedMLframeworkonexclusivelycodeswitcheddata.o ForMSA– EGY:TrainaMADAMIRAmodelexclusivelywiththeCSdatao ForSPA– ENG:TrainedaCSmodelusingTreeTagger.

• INT2:AllMonoData:SimilartoConditionINT1:CSDbutchangingthetrainingdataforeachofthelanguagepairs.o ForMSA– EGY:mergingthetrainingdatafromMSAandEGY.o ForSPA– ENG:mergingtheSpanishandEnglishcorporacreatinganintegratedSPA-ENGmodel.

Page 20: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Approach:IntegratedExperimentalConditions

• INT3:AllMonoData+CSD:Mergingtrainingdatafromconditions“INT1:CSD”and”INT2:AllMonoData”totrainnewtaggersforCSPOStagging.

Page 21: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Outline• Introduction

üMotivation.üMainContribution.

• ApproachüMonolingualPOSTaggingsystemsüCombinedExperimentalConditions.üIntegratedExperimentalConditions.

• Evaluationo Datasetso POSTagSetso Results

• Discussion• Conclusion

Page 22: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Evaluation:Datasets

• MSA– EGY:• WeusetheLDCEgyptianArabicTreebanks1-5(ARZ1-5).• TheARZ1-5dataisfromthediscussionforumsgenremostlyintheEgyptianArabicdialect(EGY).

• SPA– ENG:TwodatasetsthatwereusedforSPA-ENGlanguagepair

• ThetranscribedconversationusedintheworkbySolorioandLiu(SolorioandLiu,2008),referredtoasSpanglish.

• TheBangorMiamicorpus,referredtoasBangor.Thiscorpusisconversationalspeechinvolvingatotalof84speakersliving inMiami,FL

40%

60%

Code-Switched%ForMSA-EGY

Code-Switched%Monolingual%

20.61%

79.39%

Code-Switch%forSpanglish

Code-Switched%

Monolingual%

60%

40%

Code-Switch%forBangor

Code-Switched%

Monolingual%

Page 23: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Evaluation:Datasets

Dataset #Sentences #Words #Types CS%ARZ 13,698 175,361 39,168 38.78%

Spanglish 922 8,022 1,455 20.61%

Bangor 45,605 335,578 13,994 6.21%

Dataset Train/DevTokens TestTokensARZ 154,897 20,464

Spanglish 6,456 1,566

Bangor 268,464 67,114

Table-1: Data set details.

Table-1: Datasetdistribution.

Page 24: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

MSA-EGY:• TheARZ1-5datasetismanuallyannotatedusingtheBuckwalter (BW)POStagset.

• TheBWPOStagsetisconsideredoneofthemostpopularArabicPOStagsets.

SPA-ENG:• TheBangorMiamicorpushasbeenautomaticallyglossedandtaggedwithpart-of-speechtagsinthefollowingmanner:

• eachwordisautomaticallyglossedusingtheBangorAutoglosser.

Evaluation:Part-of-SpeechTagset

Page 25: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Table:Mapping tableforBWPOStagsetandandUniversalPOStag set

Evaluation:Part-of-SpeechTagsetBW POStagset MappingtoUniversalPOStag set

1- Personal,relative,demonstrative,interrogative,andindefinite pronouns. MappedtoPronoun.2-Acronyms. MappedtoProperNouns.3- Complementizers andadverbialclauseintroducers. MappedtoSubordinatingConjunction.4- Mainverbs(contentverbs),copulas,participles,andsomeverbformssuchasgerundsandinfinitives.

MappedtoVerb.

5- Prepositionsandpostpositions. Mappedto Adpositions.6- Interrogative,relativeanddemonstrativeadverbs. Mappedto Adverb.7- Tense,passiveandModalauxiliaries. MappedtoAuxiliaryVerb.8- Possessivedeterminers,demonstrativedeterminers,interrogativedeterminers,quantity/quantifierdeterminers,etc.

MappedtoDeterminer.

9- Nounandgerundsandinfinitives. MappedtoNoun.10- Negationparticle,questionparticle,sentencemodality,andindeclinableaspectualortenseparticles

MappedtoParticle

Page 26: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

SPA-ENG:

• TheBangorcorpuswentthroughtwoedition/annotationstages:1. Stageoneincludes:

a) TokensthattaggedambiguouslywithmorethanonePOStagweredisambiguated(e.g.that.CONJ.[or].DET).

b) AmbiguousPOScategorieslikeASV,AVandSVweredisambiguatedintoeitherADJ,NOUN,orVERB.

c) Forfrequenttokenslikesoandthat,theirPOStagswerehand-corrected.d) Mistranscribed termswhichwereoriginallylabeledasUnknownwerehand

correctedandgivenacorrectPOStag

Evaluation:Part-of-SpeechTagset

Page 27: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

SPA-ENG:1. Stagetwoincludes:

• MappingtheBangorcorpusoriginalPOStagset totheUniversalPOStagset.

Evaluation:Part-of-SpeechTagset

Page 28: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Evaluation:Part-of-SpeechTagsetBangorPOS tagset MappingtoUniversalPOStag set

1- ExclamationsandIntonational Markers. MappedtoInterjections2- PossessiveAdjectives,PossessiveDeterminers,InterrogativeAdjectives,DemonstrativeAdjectivesandQuantifyingAdjective MappedtoDeterminer

3- Relatives, InterrogativesandDemonstratives(withnospecification towhethertheywereDeterminers,AdjectivesorPronouns). Manuallylabeled

4- possessivemarkers,negationparticles,andinfinitivetotokens. PRT

5- ConjunctionsMappedtoCoordinatingConjunctionsandSubordinatingConjunctionsusingwordlists

6- AsubsetofEnglishVerbsMappedtoAuxiliaryVerbs(could,should,might,may,will,shall,etc.

7- Categorieswithanobviousmatch(likeNouns,Adjectives,Verbs,Pronouns,Determiners,ProperNouns,Numbers,etc.)

Automaticallymappedtotheappropriatecategory

Table:Mapping tableforBangorPOStagsetandandUniversalPOStag set

Page 29: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Toevaluatetheperformanceofourapproaches:

Evaluation:Results

• ComparingtheoutputPOStagsgeneratedfromeachconditionagainsttheavailablegoldPOStagsforeachdataset.

• Comparetheaccuracyofourapproachesforeachlanguagepairtoitscorrespondingmonolingualtaggerbaseline.

Page 30: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

MSA-EGYBaselineDataset MADAMIRA-MSA MADAMIRA-EGY

ARZ 77.23% 72.22%

SPA-ENGBaseline

Dataset TreeTagger SPA TreeTagger ENG

Spanglish 44.61% 75.87%Bangor 45.95% 64.05%

Evaluation:Results

Table:POStaggingaccuracyformonolingual baselinetaggers

Page 31: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Evaluation:Results

Table:AccuracyResultsforARZTestDataset

Approach Overall CSPosts MSAPosts EGYPostsCOMB1:LID-MonoLT 77.66 78.03 76.79 78.57COMB2:MonoLT-LID 77.41 77.41 78.31 77.01COMB3:MonoLT-Conf 76.66 77.89 76.79 76.11COMB4:MonoLT-SVM 90.56 90.85 91.63 88.91

INT1:CSD 83.89 82.03 82.48 83.26INT2:AllMonoData 87.86 87.92 86.82 86

INT3:AllMonoData+CSD 89.36 88.12 85.12 87

Baseline:MSA:77.23%

EGY: 72.22%

Page 32: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Evaluation:Results

Table:AccuracyResultsforBangorDataset

Approach Overall CSPosts ENGPosts SPAPostsCOMB1:LID-MonoLT 68.35 71.11 66.36 76.02COMB2:MonoLT-LID 65.51 69.66 64.44 71.32COMB3:MonoLT-Conf 68.25 68.21 71.93 65.03COMB4:MonoLT-SVM 96.31 95.39 96.37 96.60

INT1:CSD 95.28 94.41 94.41 95.15INT2:AllMonoData 78.57 78.62 81.85 76.53

INT3:AllMonoData+CSD 91.04 89.59 92.00 89.48

Baseline:SPA: 44.61%

ENG: 75.87%

Page 33: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Evaluation:Results

Table:AccuracyResultsforSpanglish Dataset

Approach Overall CSPosts ENGPosts SPAPostsCOMB1:LID-MonoLT 78.73 77.81 80.18 73.99COMB2:MonoLT-LID 73.52 73.80 73.60 71.57COMB3:MonoLT-Conf 77.39 76.11 80.20 65.43COMB4:MonoLT-SVM 90.61 89.43 93.61 87.96

INT1:CSD 82.95 83.03 85.95 77.26INT2:AllMonoData 84.55 84.84 88.50 76.59

INT3:AllMonoData+CSD 85.06 84.70 90.15 76.59

Baseline:SPA:45.95%

ENG: 64.05%

Page 34: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Outline• Introduction

üMotivation.üMainContribution.

• ApproachüMonolingualPOSTaggingsystemsüCombinedExperimentalConditions.üIntegratedExperimentalConditions.

• EvaluationüDatasetsüPOSTagSetsüResults

• Discussiono CombinedConditions.o IntegratedConditions.

• Conclusion

Page 35: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

ForMSA– EGY:• Allthecombinedexperimentalconditionsoutperformthebaselines• COMB1:LID-MonoLTyieldsworseresultsthanCOMB2:MonoLT-LID.• Itisexpectedduetothefactthatthetaggersareexpectingwellformedsentencesoninput.

• TheworstresultsareforconditionMonoLT-Conf.ForSPA– ENG:Spanglishdataset:• AlmostalltheaccuraciesachievedbythecombinedconditionsarehigherthantheSpanglishdataset’sbaselines.

• ”COMB2:MonoLT-LID”istheonlycombinedconditionthatislowerthanthebaselines’oftheSpanglishdataset(73.52%,75.87%).

Discussion:Combinedconditions

Page 36: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

ForSPA– ENG:Spanglishdataset:• Mistakesintheautomatedlanguageidentificationthatcausesthewrongtaggertobechosen.

ForSPA– ENG:Bangordataset:• AlltheaccuraciesachievedbythecombinedconditionsarehigherthantheBangordataset’sbaselines.

Discussion:Combinedconditions

Page 37: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

• Thetrendsalmost thesamebetweenthetwolanguagepairs.• BothlanguagepairsachievethehighestperformancewithMonoLT-SVMandworseresultswithMonoLT-Conf.

• TheweaknessesoftheMonoLT-Conf approachcomefromthefactthatifthemonolingualtaggersareweak,theirconfidencescoresareequallyunreliable

• TheresultsareswitchedbetweenconditionsLID-MonoLT (condition1)andMonoLT-LID(condition2)forthetwolanguagepairs.

Discussion:Combinedconditions

Page 38: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

• Ingeneral,exceptthe”COMB4:MonoLTSVM”conditionalltheINTconditionsoutperformedtheCOMBconditions.

ForMSA– EGY:• Addingmoredatahelps,INT2:AllMonoDataoutperformsINT1:CSD,butcombiningthetwoconditionsastrainingdata,wenotethatINT3:AllMonoData+CSDoutperformstheotherINTconditions.

• ForSPA– ENG:• TheworseINTconditionisINT2:AllMonoDataforBangor(accuracy78.57%)andINT1:CSDforSpanglish(accuracy82.95%).

Discussion:Integratedconditions

Page 39: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

ForSPA– ENG:• ThelargestgapinperformanceforBangorcouldbeduetoahigherdomainmismatchwiththemonolingualdatausedtotrainthetagger.

• NotabledifferencebetweenthetwolanguagepairsisthesignificantjumpinperformancefortheBangorcorpusfromthefirstthreeCOMB conditionsfrom(68.35% to96.31%).

• WeobserveasimilarjumpfortheSpanglishcorpus,thegapismuchlargerfortheBangorcorpus

Discussion:Integratedconditions

Page 40: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

• Some similar trends between the two combinations.• MSA and EGY share a significant number of homographs some ofwhich are cognates but many of which are not.

• The homograph overlap is quite limited in SPA-ENG.• Adding the CSD to the monolingual corpora in the INT3:AllMonoData-CSD condition for MSA-EGY improves performance (1.5% absoluteincrease in accuracy).

• The results are not consistent across the SPA-ENG data sets.

Discussion:Integratedconditions

Page 41: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Outline• Introduction

üMotivation.üMainContribution.

• ApproachüMonolingualPOSTaggingsystemsü CombinedExperimental Conditions.ü IntegratedExperimental Conditions.

• EvaluationüDatasetsü POSTagSetsüResults

• Discussionü CombinedConditions.ü IntegratedConditions.

• Conclusiono Summaryo FutureWork

Page 42: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

• PresentingdetailedstudyofvariousstrategiesforPOStaggingofCSdataintwolanguagepairs.

• Theresultsindicatethatdependingonthelanguagepairtherearevaryingdegreesofneedforannotatedcodeswitcheddatainthetrainingphaseoftheprocess.

• Languagesthatshareasignificantamountofhomographswhencodeswitchedwillbenefitfrommorecodeswitcheddataattrainingtime.(e.g.,MSA-EGY)

• LanguagesthatarefartherapartsuchasSpanishandEnglish,whencodeswitched,benefitmorefromhavinglargermonolingualdatamixed

Conclusion:Summary

Page 43: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

• AllCOMBconditionsuseeitheroutofcontextorincontextchunksasaninputforthemonolingualtaggers.

• Ourplanforthefutureworkthatprocesstheoutofcontextchunkstoprovideameaningfulcontexttothemonolingualtaggers.

• ExtendthefeaturesetusedintheCOMB4:MonoLT-SVMconditiontoincludeBrownClustering,Word2Vec,andDeeplearningbasedfeatures

Conclusion:FutureWork

Page 44: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Outline• Introduction

üMotivation.üMainContribution.

• ApproachüMonolingualPOSTaggingsystemsü CombinedExperimental Conditions.ü IntegratedExperimental Conditions.

• EvaluationüDatasetsü POSTagSetsüResults

• Discussionü CombinedConditions.ü IntegratedConditions.

• Conclusionü Summaryü FutureWork

Page 45: POS Tagging for CS Data - RiTUALritual.uh.edu/wp-content/uploads/2015/06/alghamdi-EtAl.pdf · • MADAMIRA MSA is trained on newswire data (Penn Arabic Treebanks 1,2,3). • MADAMIRA

Thanks!!