9
Recap: BERT Announcements A4 back today, A5 back soon FP check-in due today, will be returned soon eCIS evaluaEons: please fill these out Mul$linguality Dealing with other languages Some of our algorithms have been specified to English Some structures like consEtuency parsing don’t make sense for other languages Neural methods are typically tuned to English-scale resources, may not be the best for other languages where less data is available 1) What other phenomena / challenges do we need to solve? QuesEon: 2) How can we leverage exisEng resources to do beUer in other languages without just annotaEng massive data? Other languages present some challenges not seen in English at all!

Recap: BERT Announcements

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Recap:BERT Announcements

‣ A4backtoday,A5backsoon

‣ FPcheck-induetoday,willbereturnedsoon

‣ eCISevaluaEons:pleasefilltheseout

Mul$linguality

Dealingwithotherlanguages

‣ SomeofouralgorithmshavebeenspecifiedtoEnglish

‣ SomestructureslikeconsEtuencyparsingdon’tmakesenseforotherlanguages

‣ NeuralmethodsaretypicallytunedtoEnglish-scaleresources,maynotbethebestforotherlanguageswherelessdataisavailable

1)Whatotherphenomena/challengesdoweneedtosolve?

‣ QuesEon:

2)HowcanweleverageexisEngresourcestodobeUerinotherlanguageswithoutjustannotaEngmassivedata?

‣OtherlanguagespresentsomechallengesnotseeninEnglishatall!

ThisLecture

‣ Morphologicalrichness:effectsandchallenges

‣ Cross-lingualtaggingandparsing

‣ Morphologytasks:analysis,inflecEon,wordsegmentaEon

‣ Cross-lingualwordrepresentaEons

Morphology

Whatismorphology?‣ Studyofhowwordsform

‣ DerivaEonalmorphology:createanewlexemefromabase

estrange(v)=>estrangement(n)

become(v)=>unbecoming(adj)

Ibecome/shebecomes

‣ InflecEonalmorphology:wordisinflectedbasedonitscontext

‣ Maynotbetotallyregular:enflame=>inflammable

‣ Mostlyappliestoverbsandnouns

MorphologicalInflecEon‣ InEnglish: Iarrive youarrive he/she/itarrives

wearrive youarrive theyarrive[X]arrived

‣ InFrench:

MorphologicalInflecEon‣ InSpanish:

NounInflecEon

‣ NominaEve:I/he/she,accusaEve:me/him/her,geniEve:mine/his/hers

‣ Notjustverbseither;gender,number,casecomplicatethings

Igivethechildrenabook<=>IchgebeeinBuchItaughtthechildren<=>IchunterrichtedieKinder

‣ DaEve:mergedwithaccusaEveinEnglish,showsrecipientofsomething

denKindern

IrregularInflecEon‣ Commonwordsareodenirregular

‣ Iam/youare/sheis

‣ Lesscommonwordstypicallyfallintosomeregularparadigm—thesearesomewhatpredictable

‣ Jesuis/tues/elleest

‣ Soy/está/es

AggluEnaEngLangauges‣ Finnish/Hungarian(Finno-Ugric),alsoTurkish:whatapreposiEonwoulddoinEnglishisinsteadpartoftheverb

‣ Manypossibleforms—andinnewswiredata,onlyafewareobservedillaEve:“into” adessive:“on”

halata:“hug”

Morphologically-RichLanguages

‣ ManylanguagesspokenallovertheworldhavemuchrichermorphologythanEnglish

‣ CoNLL2006/2007:dependencyparsing+morphologicalanalysesfor~15mostlyIndo-Europeanlanguages

‣ Wordpiece/byte-pairencodingmodelsforMTarepreUygoodathandlingtheseifthere’senoughdata

‣ SPMRLsharedtasks(2013-2014):SyntacEcParsingofMorphologically-RichLanguages

Morphologically-RichLanguages

‣ GreatresourcesforchallengingyourassumpEonsaboutlanguageandforunderstandingmulElingualmodels!

MorphologicalAnalysis/Inflec$on

MorphologicalAnalysis‣ InEnglish,lexicalfeaturesonwordsandwordvectorsarepreUyeffecEve

‣ Whenwe’rebuildingsystems,weprobablywanttoknowbaseform+morphologicalfeaturesexplicitly

‣ Inotherlanguages,lotsmoreunseenwordsduetorichmorphology!Affectsparsing,translaEon,…

‣ Howtodothiskindofmorphologicalanalysis?

MorphologicalAnalysis:Hungarian

Ámakormányegyetlenadócsökkentésétsemjavasolja.

n=singular|case=nomina$ve|proper=no

deg=posi$ve|n=singular|case=nomina$ve

n=singular|case=nomina$ve|proper=no

n=singular|case=accusa$ve|proper=no|pperson=3rd|pnumber=singular

mood=indica$ve|t=present|p=3rd|n=singular|def=yes

Butthegovernmentdoesnotrecommendreducingtaxes.

MorphologicalAnalysis

‣ Givenawordincontext,needtopredictwhatitsmorphologicalfeaturesare

‣ LotsofworkonArabicinflecEon(highamountsofambiguity)

‣ Basicapproach:combinestwomodules:

‣ Lexicon:tellsyouwhatpossibiliEesarefortheword

‣ Analyzer:staEsEcalmodelthatdisambiguates

‣ ModelsarelargelyCRF-like:scoremorphologicalfeaturesincontext

MorphologicalInflecEon‣ Inversetaskofanalysis:givenbaseform+features,inflecttheword

DurreMandDeNero(2013)

‣ Hardforunknownwords—needmodelsthatgeneralize

w i n d e n

MorphologicalInflecEon

Chahuneauetal.(2013)

‣ MachinetranslaEonwherephrasetableisdefinedintermsoflemmas

‣ “Translate-and-inflect”:translateintouninflectedwordsandpredictinflecEonbasedonsourceside

ChineseWordSegmentaEon

‣ LSTMsovercharacterembeddings/characterbigramembeddingstopredictwordboundaries

‣ WordsegmentaEon:somelanguagesincludingChinesearetotallyuntokenized

Chenetal.(2015)

‣ HavingtherightsegmentaEoncanhelpmachinetranslaEon

Cross-LingualTaggingandParsing

Cross-LingualTagging‣ LabelingPOSdatasetsisexpensive

‣ CanwetransferannotaEonfromhigh-resourcelanguages(English,etc.)tolow-resourcelanguages?

English

Rawtext

POSdata

Spanish:

+Rawtext

en-esbitext

POSdata

Malagasy

bitext

Rawtext+ Malagasytagger

Spanishtagger

Cross-LingualTagging‣ Canweleveragewordalignmenthere?

NPRV??

‣ TagwithEnglishtagger,projectacrossbitext,trainFrenchtagger?WorkspreUywell

Ilikeitalot

Jel’aimebeaucoup

align Ilikeitalot

Jel’aimebeaucoup

NVPRDTADJ

tag Ilikeitalot

Jel’aimebeaucoup

Projectedtags

DasandPetrov(2011)

Cross-LingualParsing

McDonaldetal.(2011)

‣ NowthatwecanPOStagotherlanguages,canweparsethemtoo?

‣ Directtransfer:trainaparseroverPOSsequencesinonelanguage,thenapplyittoanotherlanguage

Iliketomatoes

PRONVERBNOUN

JelesaimePRONPRONVERB

Ilikethem

PRONVERBPRON

Parsertrainedtoaccepttaginput

VERBistheheadofPRONandNOUN

parsenewdata

train

Cross-LingualParsing

McDonaldetal.(2011)

‣ MulE-dir:transferaparsertrainedonseveralsourcetreebankstothetargetlanguage

‣ MulE-proj:morecomplexannotaEonprojecEonapproach

Cross-LingualWordRepresenta$ons

MulElingualEmbeddings

Ammaretal.(2016)

‣ mulECluster:usebilingualdicEonariestoformclustersofwordsthataretranslaEonsofoneanother,replacecorporawithclusterIDs,train“monolingual”embeddingsoverallthesecorpora

‣ Worksokaybutnotallthatwell

Ihaveanapple

J’aidesoranges IJeJ’

ID:47aihave

ID:24

4724891981

472418427

‣ Input:corporainmanylanguages.Output:embeddingswheresimilarwordsindifferentlanguageshavesimilarembeddings

MulElingualSentenceEmbeddings

Artetxeetal.(2019)

‣ FormBPEvocabularyoverallcorpora(50kmerges);willincludecharactersfromeveryscript

‣ TakeabunchofbitextsandtrainanMTmodelbetweenabunchoflanguagepairswithsharedparameters,useWassentenceembeddings

MulElingualSentenceEmbeddings

‣ TrainasystemforNLI(entailment/neutral/contradicEonofasentencepair)onEnglishandevaluateonotherlanguages

Artetxeetal.(2019)

MulElingualBERT

Devlinetal.(2019)

‣ Taketop104Wikipedias,trainBERTonallofthemsimultaneously

‣Whatdoesthislooklike?

BeethovenmayhaveproposedunsuccessfullytoThereseMalfay,thesupposeddedicateeof"FürElise";hisstatusasacommonermayagainhaveinterferedwiththoseplans.

当⼈们在⻢尔法蒂身后发现这部⼩曲的⼿稿时,便误认为上⾯写的是“FürElise”(即《给爱丽丝》)[51]。

Китай́(официально—Китай́скаяНаро́днаяРеспуб́лика,сокращённо—КНР;кит.трад.中華⼈⺠共和國,упр.中华⼈⺠

共和国,пиньинь:ZhōnghuáRénmín

MulElingualBERT:Results

Piresetal.(2019)

‣ CantransferBERTdirectlyacrosslanguageswithsomesuccess

‣…butthisevaluaEonisonlanguagesthatallshareanalphabet

MulElingualBERT:Results

Piresetal.(2019)

‣ Urdu(Arabicscript)=>Hindi(Devanagari).Transferswelldespitedifferentalphabets!

‣ Japanese=>English:differentscriptandverydifferentsyntax

ScalingUp:XLM-R

Conneauetal.(2019)

‣ Larger“CommonCrawl”dataset,beUerperformancethanmBERT

‣ Low-resourcelanguagesbenefitfromtrainingonotherlanguages

‣ High-resourcelanguagesseeasmallperformancehit,butnotmuch

Wherearewenow?‣ Universaldependencies:treebanks(+tags)for70+languages

‣ ManylanguagesaresEllsmall,soprojecEontechniquesmaysEllhelp

‣ Morecorporainotherlanguages,lessandlessrelianceonstructuredtoolslikeparsers,andpretrainingonunlabeleddatameansthatperformanceonotherlanguagesisbeUerthanever

‣ MulElingualmodelsseemtobeworkingbeUerandbeUer—butsEllmanychallengesforlow-resourceseyngs

Takeaways

‣ ManylanguageshaverichermorphologythanEnglishandposedisEnctchallenges

‣ Problems:howtoanalyzerichmorphology,howtogeneratewithit

‣ CanleverageresourcesforEnglishusingbitexts

‣ NextEme:wrapup+discussionofethics