Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Recap:BERT Announcements
‣ A4backtoday,A5backsoon
‣ FPcheck-induetoday,willbereturnedsoon
‣ eCISevaluaEons:pleasefilltheseout
Mul$linguality
Dealingwithotherlanguages
‣ SomeofouralgorithmshavebeenspecifiedtoEnglish
‣ SomestructureslikeconsEtuencyparsingdon’tmakesenseforotherlanguages
‣ NeuralmethodsaretypicallytunedtoEnglish-scaleresources,maynotbethebestforotherlanguageswherelessdataisavailable
1)Whatotherphenomena/challengesdoweneedtosolve?
‣ QuesEon:
2)HowcanweleverageexisEngresourcestodobeUerinotherlanguageswithoutjustannotaEngmassivedata?
‣OtherlanguagespresentsomechallengesnotseeninEnglishatall!
ThisLecture
‣ Morphologicalrichness:effectsandchallenges
‣ Cross-lingualtaggingandparsing
‣ Morphologytasks:analysis,inflecEon,wordsegmentaEon
‣ Cross-lingualwordrepresentaEons
Morphology
Whatismorphology?‣ Studyofhowwordsform
‣ DerivaEonalmorphology:createanewlexemefromabase
estrange(v)=>estrangement(n)
become(v)=>unbecoming(adj)
Ibecome/shebecomes
‣ InflecEonalmorphology:wordisinflectedbasedonitscontext
‣ Maynotbetotallyregular:enflame=>inflammable
‣ Mostlyappliestoverbsandnouns
MorphologicalInflecEon‣ InEnglish: Iarrive youarrive he/she/itarrives
wearrive youarrive theyarrive[X]arrived
‣ InFrench:
MorphologicalInflecEon‣ InSpanish:
NounInflecEon
‣ NominaEve:I/he/she,accusaEve:me/him/her,geniEve:mine/his/hers
‣ Notjustverbseither;gender,number,casecomplicatethings
Igivethechildrenabook<=>IchgebeeinBuchItaughtthechildren<=>IchunterrichtedieKinder
‣ DaEve:mergedwithaccusaEveinEnglish,showsrecipientofsomething
denKindern
IrregularInflecEon‣ Commonwordsareodenirregular
‣ Iam/youare/sheis
‣ Lesscommonwordstypicallyfallintosomeregularparadigm—thesearesomewhatpredictable
‣ Jesuis/tues/elleest
‣ Soy/está/es
AggluEnaEngLangauges‣ Finnish/Hungarian(Finno-Ugric),alsoTurkish:whatapreposiEonwoulddoinEnglishisinsteadpartoftheverb
‣ Manypossibleforms—andinnewswiredata,onlyafewareobservedillaEve:“into” adessive:“on”
halata:“hug”
Morphologically-RichLanguages
‣ ManylanguagesspokenallovertheworldhavemuchrichermorphologythanEnglish
‣ CoNLL2006/2007:dependencyparsing+morphologicalanalysesfor~15mostlyIndo-Europeanlanguages
‣ Wordpiece/byte-pairencodingmodelsforMTarepreUygoodathandlingtheseifthere’senoughdata
‣ SPMRLsharedtasks(2013-2014):SyntacEcParsingofMorphologically-RichLanguages
Morphologically-RichLanguages
‣ GreatresourcesforchallengingyourassumpEonsaboutlanguageandforunderstandingmulElingualmodels!
MorphologicalAnalysis/Inflec$on
MorphologicalAnalysis‣ InEnglish,lexicalfeaturesonwordsandwordvectorsarepreUyeffecEve
‣ Whenwe’rebuildingsystems,weprobablywanttoknowbaseform+morphologicalfeaturesexplicitly
‣ Inotherlanguages,lotsmoreunseenwordsduetorichmorphology!Affectsparsing,translaEon,…
‣ Howtodothiskindofmorphologicalanalysis?
MorphologicalAnalysis:Hungarian
Ámakormányegyetlenadócsökkentésétsemjavasolja.
n=singular|case=nomina$ve|proper=no
deg=posi$ve|n=singular|case=nomina$ve
n=singular|case=nomina$ve|proper=no
n=singular|case=accusa$ve|proper=no|pperson=3rd|pnumber=singular
mood=indica$ve|t=present|p=3rd|n=singular|def=yes
Butthegovernmentdoesnotrecommendreducingtaxes.
MorphologicalAnalysis
‣ Givenawordincontext,needtopredictwhatitsmorphologicalfeaturesare
‣ LotsofworkonArabicinflecEon(highamountsofambiguity)
‣ Basicapproach:combinestwomodules:
‣ Lexicon:tellsyouwhatpossibiliEesarefortheword
‣ Analyzer:staEsEcalmodelthatdisambiguates
‣ ModelsarelargelyCRF-like:scoremorphologicalfeaturesincontext
MorphologicalInflecEon‣ Inversetaskofanalysis:givenbaseform+features,inflecttheword
DurreMandDeNero(2013)
‣ Hardforunknownwords—needmodelsthatgeneralize
w i n d e n
MorphologicalInflecEon
Chahuneauetal.(2013)
‣ MachinetranslaEonwherephrasetableisdefinedintermsoflemmas
‣ “Translate-and-inflect”:translateintouninflectedwordsandpredictinflecEonbasedonsourceside
ChineseWordSegmentaEon
‣ LSTMsovercharacterembeddings/characterbigramembeddingstopredictwordboundaries
‣ WordsegmentaEon:somelanguagesincludingChinesearetotallyuntokenized
Chenetal.(2015)
‣ HavingtherightsegmentaEoncanhelpmachinetranslaEon
Cross-LingualTaggingandParsing
Cross-LingualTagging‣ LabelingPOSdatasetsisexpensive
‣ CanwetransferannotaEonfromhigh-resourcelanguages(English,etc.)tolow-resourcelanguages?
English
Rawtext
POSdata
Spanish:
+Rawtext
en-esbitext
POSdata
Malagasy
bitext
Rawtext+ Malagasytagger
Spanishtagger
Cross-LingualTagging‣ Canweleveragewordalignmenthere?
NPRV??
‣ TagwithEnglishtagger,projectacrossbitext,trainFrenchtagger?WorkspreUywell
Ilikeitalot
Jel’aimebeaucoup
align Ilikeitalot
Jel’aimebeaucoup
NVPRDTADJ
tag Ilikeitalot
Jel’aimebeaucoup
Projectedtags
DasandPetrov(2011)
Cross-LingualParsing
McDonaldetal.(2011)
‣ NowthatwecanPOStagotherlanguages,canweparsethemtoo?
‣ Directtransfer:trainaparseroverPOSsequencesinonelanguage,thenapplyittoanotherlanguage
Iliketomatoes
PRONVERBNOUN
JelesaimePRONPRONVERB
Ilikethem
PRONVERBPRON
Parsertrainedtoaccepttaginput
VERBistheheadofPRONandNOUN
parsenewdata
train
Cross-LingualParsing
McDonaldetal.(2011)
‣ MulE-dir:transferaparsertrainedonseveralsourcetreebankstothetargetlanguage
‣ MulE-proj:morecomplexannotaEonprojecEonapproach
Cross-LingualWordRepresenta$ons
MulElingualEmbeddings
Ammaretal.(2016)
‣ mulECluster:usebilingualdicEonariestoformclustersofwordsthataretranslaEonsofoneanother,replacecorporawithclusterIDs,train“monolingual”embeddingsoverallthesecorpora
‣ Worksokaybutnotallthatwell
Ihaveanapple
J’aidesoranges IJeJ’
ID:47aihave
ID:24
4724891981
472418427
‣ Input:corporainmanylanguages.Output:embeddingswheresimilarwordsindifferentlanguageshavesimilarembeddings
MulElingualSentenceEmbeddings
Artetxeetal.(2019)
‣ FormBPEvocabularyoverallcorpora(50kmerges);willincludecharactersfromeveryscript
‣ TakeabunchofbitextsandtrainanMTmodelbetweenabunchoflanguagepairswithsharedparameters,useWassentenceembeddings
MulElingualSentenceEmbeddings
‣ TrainasystemforNLI(entailment/neutral/contradicEonofasentencepair)onEnglishandevaluateonotherlanguages
Artetxeetal.(2019)
MulElingualBERT
Devlinetal.(2019)
‣ Taketop104Wikipedias,trainBERTonallofthemsimultaneously
‣Whatdoesthislooklike?
BeethovenmayhaveproposedunsuccessfullytoThereseMalfay,thesupposeddedicateeof"FürElise";hisstatusasacommonermayagainhaveinterferedwiththoseplans.
当⼈们在⻢尔法蒂身后发现这部⼩曲的⼿稿时,便误认为上⾯写的是“FürElise”(即《给爱丽丝》)[51]。
Китай́(официально—Китай́скаяНаро́днаяРеспуб́лика,сокращённо—КНР;кит.трад.中華⼈⺠共和國,упр.中华⼈⺠
共和国,пиньинь:ZhōnghuáRénmín
MulElingualBERT:Results
Piresetal.(2019)
‣ CantransferBERTdirectlyacrosslanguageswithsomesuccess
‣…butthisevaluaEonisonlanguagesthatallshareanalphabet
MulElingualBERT:Results
Piresetal.(2019)
‣ Urdu(Arabicscript)=>Hindi(Devanagari).Transferswelldespitedifferentalphabets!
‣ Japanese=>English:differentscriptandverydifferentsyntax
ScalingUp:XLM-R
Conneauetal.(2019)
‣ Larger“CommonCrawl”dataset,beUerperformancethanmBERT
‣ Low-resourcelanguagesbenefitfromtrainingonotherlanguages
‣ High-resourcelanguagesseeasmallperformancehit,butnotmuch
Wherearewenow?‣ Universaldependencies:treebanks(+tags)for70+languages
‣ ManylanguagesaresEllsmall,soprojecEontechniquesmaysEllhelp
‣ Morecorporainotherlanguages,lessandlessrelianceonstructuredtoolslikeparsers,andpretrainingonunlabeleddatameansthatperformanceonotherlanguagesisbeUerthanever
‣ MulElingualmodelsseemtobeworkingbeUerandbeUer—butsEllmanychallengesforlow-resourceseyngs
Takeaways
‣ ManylanguageshaverichermorphologythanEnglishandposedisEnctchallenges
‣ Problems:howtoanalyzerichmorphology,howtogeneratewithit
‣ CanleverageresourcesforEnglishusingbitexts
‣ NextEme:wrapup+discussionofethics