141
PA153 Vít Baisa

PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PA153

Vít Baisa

Page 2: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

ENGLISH TO CZECH MTMoses is an implementation of the statistical (or data-driven) approach tomachine translation (MT). This is the dominant approach in the �eld at the

moment, and is employed by the online translation systems deployed by thelikes of Google and Microsoft.

Mojžíš je implementace statistické (nebo řízené daty) přístupu k strojovéhopřekladu (MT). To je převládajícím přístupem v oblasti v současné době, a jezaměstnán pro on-line překladatelských systémů nasazených likes Google a

Microsoft.

Moses je implementace statistického (nebo daty řízeného) přístupu k strojovémupřekladu (MT). V současné době jde o převažující přístup v rámci strojového

překladu, který je použit online překladovými systémy nasazenými Googlem aMicrosoftem.

Mojžíš je provádění statistické (nebo aktivovaný) přístup na strojový překlad (mt).To je dominantní přístup v oblasti v tuto chvíli, a zaměstnává on - line překlad

systémů uskutečněné takové, Google a Microsoft.

Page 3: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 4: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

QUESTIONSIs accurate translation possible at all?What is easier: to translate from / to your mothertongue?How we know is equivalent to ?English wind types: airstream, breeze, crosswind, dustdevil, easterly, gale, gust, headwind, jet stream,mistral, monsoon, prevailing wind, sandstorm, seabreeze, sirocco, southwester, tailwind, tornado, tradewind, turbulence, twister, typhoon, whirlwind, wind,windstorm, zephyr

w1 w2

Page 5: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EXAMPLE OF HARD WORDSalkáč, večerníček, telka, čoklbuřt, knížečka, ČSSD ... ?matka, macecha, mamka, máma, maminka, matička,máti, mama, mamča, maminascvrnkls, nejneobhospo....nějšími

: language as a cipherLeacock: Nonsese novels (Literární poklesky)Navajo Code

Page 6: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MACHINE TRANSLATIONWe consider only technical / specialized texts:

web pages,technical manuals,scienti�c documents and papers,lea�ets and catalogues,law texts andin general, texts from speci�c domains.

Nuances on di�erent language levels in art literature areout of scope of current MT systems.

Page 7: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MACHINE TRANSLATION: ISSUESIn fact an output of MT is always revised. We distinguish

pre-editing and post-editing.

MT systems make di�erent types of errors.

These mistakes are characteristic for human translators:

wrong prepositions: (I am in school)missing determiners (I saw man)wrong tense (Viděl jsem: I was seeing), ...

For computers, errors in meaning are characteristic:

Kiss me honey. → Polib mi med.

Page 8: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

Costa, Ângela, et al. "A linguistically motivated taxonomy for Machine Translation error analysis."Machine Translation 29.2 (2015): 127-161.

Page 9: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

FREE WORD ORDERThe more morphologically rich language, the freer word order it has.

Katka snědla kousek koláče.

Kati megevett egy szelet tortát → Katie eating a piece of cakeEgy szelet tortát Kati evett meg → Katie ate a piece of cakeKati egy szelet tortát evett meg → Katie ate a piece of cakeEgy szelet tortát evett meg Kati → Katie ate a piece of cakeMegevett egy szelet tortát Kati → Katie eating a piece of cakeMegevett Kati egy szelet tortát → Katie ate a piece of cake

Page 10: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

FREE WORD ORDER IN CZECHVíš, že se z kávy vyrábí mouka?Víš, že se z kávy mouka vyrábí?Víš, že se mouka vyrábí z kávy?Víš, že se mouka z kávy vyrábí?Víš, že se vyrábí mouka z kávy?Víš, že se vyrábí z kávy mouka?

How their meanings di�er?

Page 11: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

DIRECT METHODS FOR IMPROVING MTQUALITY

limit input to a:sublanguage (indicative sentences)domain (informatics)document type (patents)

text pre-processing (e.g. manual syntactic analysis)

Page 12: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

CLASSIFICATION BASED ON APPROACHrule-based, knowledge-based (RBMT, KBMT)

transferwith interlingua

statistical machine translation (SMT)hybrid machine translation (HMT, HyTran)neural networks

Page 13: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

VAUQUOIS'S TRIANGLE

Page 14: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MACHINE TRANSLATION NOWADAYSbig companies (Microsoft) focused on English as SLlarge pairs (En:Sp, En:Fr): very good translation qualitySMT enriched with syntaxGoogle Translate as a gold standardmorphologically rich languages neglectedEn: a :En pairs prevailneural networks being deployed

Page 15: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MOTIVATION IN 21ST CENTURYtranslation of for gisting (getting the mainmessage)methods for speeding-up human translationsubstantially (translation memories)cross-language extraction of facts and search forinformationinstant translation of e-communicationtranslation on mobile devices

web pages

Page 16: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

RULE-BASED MT

Page 17: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

RULE-BASED MACHINE TRANSLATIONRBMTlinguistic knowledge in form of rulesrules for analysis of SLrules for transfer between languagesrules for generation/rendering/synthesis of TL

Page 18: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

KNOWLEDGE-BASED MACHINETRANSLATION

systems using linguistic knowledge about languagesolder types, more general notionanalysis of meaning of SL is crucialno total meaning (connotations, common sense)to be able to translate vrána na stromě not necessary to know vrána is a bird and can �yterm KBMT rather for systems with interlinguafor us KBMT RBMT=

Page 19: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

KBMT CLASSIFICATIONdirect translationsystems with interlinguatransfer systems

The only types of MT until 90s.

Page 20: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

DIRECT TRANSLATIONthe oldest systemsone step process: transfer

, interest dropped quicklyGeorgetown experiment METEO

Page 21: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

DIRECT TRANSLATIONfocus on S↔T elements correspondences�rst experiments on En-Ru pairall components are bound to a language pair (andone direction)typically consists of:

translation dictionarymonolithic program dealing with analysis andgeneration

necessarily one-directional and bilinguale�cacy: for languages we need ?N

Page 22: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MT WITH INTERLINGUAwe suppose it is possible to convert SL to a language-independent representationinterlingua (IL) must be unambiguoustwo steps: analysis & synthesis (generation)from IL, TL is generatedanalysis is SL-dependent but TL-independentand vice versa for synthesisfor translation among languagesN

Page 23: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

KBMT-89

Nirenburg, Sergei. Knowledge-based machine translation. Machine Translation 4.1 (1989): 5-24.

Page 24: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

TRANSFER TRANSLATIONanalysis up to a certain leveltransfer rules S forms → T formsnot necessarily between same levelsusually on syntactic level → context constraints (not available in direct translation)distinction IL vs. transfer blurredthree-step translation

Page 25: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

INTERLINGUA VS. TRANSFER

Page 26: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SOURCE LANGUAGE ANALYSIS

Page 27: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

TOKENIZATION�rst level in Vauquois' △input text to tokens (words, numbers, punctuation)token = sequence of non-white charactersoutput = list of tokensinput for further processing

Page 28: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

OBSTACLES OF TOKENIZATIONdon't: do n't, do n 't, don 't, ?červeno-černý: červeno - černý, červeno-černý,červeno- černý

Page 29: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SCRIPTIO CONTINUA

What is word?

Page 30: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

TOKENIZATIONin most cases a heuristic is usedalphabetic writing systems: split on spaces and on other punctuation marks ?!.,-()/:;demo: unitok.py

Page 31: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SENTENCE SEGMENTATIONMT almost always uses sentences90% of periods are sentence boundary indicators(Riley 1989)using list of punctuation (!?.<>) Měl jsem 5 (sic!) poznámek.exceptions:

abbreviations (aj. atd. etc. e.g.)degrees (RNDr., prof.)

HTML elements might be used (p, div, td, li)demo: tag_sentencespaper on tokenization

Page 32: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

OBSTACLES OF SENTENCESEGMENTATION

Zeleninu jako rajče, mrkev atd. Petr nemá rád.Složil zkoušku a získal titul Mgr. Petr mu dost záviděl.John F. Kennedy = one token? John F. Kennedy'srelated to named entity recognitionneglected step in the processing (DCEP, EUR-Lex)

Page 33: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGICAL LEVEL

Page 34: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGYmorpheme: the smallest item carrying a meaningpří-lež-it-ost-n-ým-ipre�x-root-in�x-su�x-su�x-su�x-a�xcase, number, gender, lemma, a�x

Page 35: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGIC LEVELsecond level in Vauquois' △reducing the immense amount of wordforms

: lexicon sizes of various corporaconversion from wordforms to lemmata give, gives, gave, given, giving → give dělá, dělám, dělal, dělaje, dělejme, ... → dělatanalysis of grammatical categories of wordforms dělali → dělat + past t. + continuous + plural + 3rd p. did → do + past t. + perfective + person ? + number ?Robertovým → Robert + case ? + adjective + number ?demo:

demo

wwwajka

Page 36: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGIC ANALYSISfor each token we get a base form, grammarcategories, segmentation to morphemesWhat is a base form? Lemma.nouns: singular, nominative, positive, masculinebycha → bych?, nejpomalejšími → pomalý neschopný → schopný?, mimochodem → mimochod?verbs: in�nitiveneraď → radit?, bojím se → bát (se)Why in�nitive? the most frequent form of verbsexample

Page 37: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGICAL TAGS, TAGSETlanguage-dependent (various morphological categories)attribute system: pairs category--valuemaminkou k1gFnSc7 udělány k5eAaPmNgFnPpositional system: 16 �xed positions kontury NNFP1-----A---- zdají VB-P---3P-AA---Penn Treebank tagset (English): limited set of tags faster RBR doing VBGCLAWS tagset (English)and others (German) gigantische ADJA.ADJA.Pos.Acc.Sg.Fem erreicht VVPP.VPP.Full.Psp

Page 38: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGICAL POLYSEMYin many cases: words have more than one tagPoS polysemy (>1 lemma), in Czech jednou k4gFnSc7, k6eAd1, k9 ženu k1gFnSc4, k5eAaImIp1nS k1 + k2, k3 + k5?what about English?demo: auto PoSpolysemy within a PoSin Czech: nominative = accusative víno: k1gNnSc1, k1gNnSc4, ... odhalení: 10 tags

SkELL

Page 39: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGICAL DISAMBIGUATIONfor each word: one tag and one lemmamorphological disambiguationa tool: taggertranslational polysemy is another issue pubblico – Ö�entlichkeit, Publikum, Zuschauermost of methods use contexti.e. surrounding words, lemmas, tags

Page 40: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

STATISTICAL DISAMBIGUATIONthe most probable sequence of tags Ženu je domů. k5|k1, k3|k5, k6|k1 Mladé muže gF|gM, nS|nPthere are tough situations: dítě škádlí lvíčemachine learning trained on manuallytagged/disambiguated dataBrill's tagger, TreeTagger, Freeling, RFTaggerdemo for Czech: (hybrid, DESAM)Desamb

Page 41: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

RULE-BASED DISAMBIGUATIONthe only option if an annotated corpus not availablealso used as a �lter before a statistical methodrules help to capture wider contextcase, number and gender agreement in noun phrases malému (c3, gIMN) chlapci (nPc157, nSc36, gM)a more matured: valency structure of sentences valency: vidět koho/co, to give OBJ to DIROBJ vidím dům → c4 I gave the present to her → DIROBJ

, VerbaLex PDEV

Page 42: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

GUESSERwe aim at high coverage: as many words as possiblefor out-of-vocabulary (OOV) tokensnew, borrowed, compound wordsstemming, guessing PoSes from word su�xvygooglit, olajkovat, zaxzovatsedm dunhillektřitisícedvěstědevadesátpět znakůfunny errors: Matka božit, topit box

Page 43: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGICAL DISAMBIGUATION—EXAMPLEslovo analýzy disambiguace

Pravidelné k2eAgMnPc4d1, k2eAgInPc1d1, k2eAgInPc4d1, k2eAgInPc5d1,k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPc1d1,k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSc1d1, k2eAgNnSc4d1,k2eAgNnSc5d1, ... (+ 5)

k2eAgNnSc1d1

krmení k2eAgMnPc1d1, k2eAgMnPc5d1, k1gNnSc1, k1gNnSc4, k1gNnSc5,k1gNnSc6, k1gNnSc3, k1gNnSc2, k1gNnPc2, k1gNnPc1, k1gNnPc4,k1gNnPc5

k1gNnSc1

je k5eAaImIp3nS, k3p3gMnPc4, k3p3gInPc4, k3p3gNnSc4, k3p3gNnPc4,k3p3gFnPc4, k0

k5eAaImIp3nS

pro k7c4 k7c4

správný k2eAgMnSc1d1, k2eAgMnSc5d1, k2eAgInSc1d1, k2eAgInSc4d1,k2eAgInSc5d1, ... (+ 18)

k2eAgInSc4d1

růst k5eAaImF, k1gInSc1, k1gInSc4 k1gInSc4

důležité k2eAgMnPc4d1, k2eAgInPc1d1, k2eAgInPc4d1, k2eAgInPc5d1,k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPc1d1,k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSc1d1, k2eAgNnSc4d1,k2eAgNnSc5d1, ... (+ 5)

k2eAgNnSc1d1

Page 44: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PROBLEMS WITH POSESquality of MA a�ects all further levels of analysisquality depends on a language (English vs. Hungarian)chončaam: my small house (Tajik)kahramoni: you are hero (Tajik)legeslegmagasabb: the very highest (Hungarian)raněný: SUBS / ADJthe big red �re truck: SUBS / ADJ?The Duchess was entertaining last night.Pokojem se neslo tiché pšššš.

Page 45: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MORPHOLOGY—SUMMARYMA introduce critical errors into the analysisthe goal is to limit the immense amount of wordformswordform → lemma + tagmuch simpler for English (cc. 35 tags)PoS tagging accuracy depends on a languageusually around 95%

Page 46: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

LEXICAL LEVELDICTIONARIES

Page 47: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

DICTIONARIES IN MT Iconnection between languagestransfer systems: syntactic leveldictionaries crucial for KBMT systemsGNU-FDL slovníkWiktionary

Page 48: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

DICTIONARIES IN MT IIhow many items in a dict do we need / want? → named entities, slang, MWElisteme: lexical item, which can not be deduced fromthe principle of compositionality (slaměný vdovec)which form in a dict? → lemmatizationhow many di�erent senses is reasonable todistinguish? → granularity

Page 49: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

POLYSEMY IN DICTIONARIESwords relates to senseswhat is meaning of meaning?we need a formal de�nition for computersdata is discrete, meaning is continuousman: an adult male person what about 17-years-old male person?

Page 50: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SMOOTH SENSE TRANSITIONS

log log chair chair

Page 51: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

POLYSEMY ON SEVERAL LEVELSmorphology: -sword level: keymultiword expressions: bílá vránasentence level: I saw a man with a telescope.homonymy: accidental

full homonymy: líčit, kolejpartial homonymy: los, stát

polysemy is natural and ubiquitous

Page 52: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MEANING REPRESENTATIONlist: a common dictionarygraph: senses:vertices, semantic relations:edgesspace: senses:dots, similarity:distance

Page 53: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

WORD SENSE DISAMBIGUATION�nding a proper sense of a word in a given contexttrivial for human, very hard for computerswe need a �nite inventory of sensesaccuracy about 90%crucial task for KBMT: Ludvig dodávka Beethoven, kiss me honey, ... box in the pen (Bar-Hillel)

a�ects the quality of WSDgranularity

Page 54: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SYNTACTIC LEVELMiloš and Vojta

Page 55: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SEMANTIC LEVEL / ANALYSISZuzka Nevěřilová, Adam Rambousek

Page 56: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

TECTOMTPDT formalism, high modularitysplitting tasks to a sequence of blocks—scenariosblocks are Perl scripts communicating via APIblocks allow massive data processing, parallelisationrule-based, statistical, hybrid methodsprocessing: conversion to the tmt format →application of a scenario → conversion to an outputformat

Page 57: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

TECTOMT: A SIMPLE BLOCKEnglish negative particles → verb attributes

sub process_document { my ($self,$document) = @_;

foreach my $bundle ($document->get_bundles()) { my $a_root = $bundle->get_tree('SEnglishA');

foreach my $a_node ($a_root->get_descendants) { my ($eff_parent) = $a_node->get_eff_parents; if ($a_node->get_attr('m/lemma')=~/^(not|n't)$/ and $eff_parent->get_attr('m/tag')=~/^V/ ) { $a_node->set_attr('is_aux_to_parent',1); } } } }

Page 58: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 59: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

RULE-BASED SYSTEMS: CONCLUSION(purely) rule-based systems not used anymorestatistical systems achieve better resultsstill, some methods from RBMT may improve SMT

Page 60: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

STATISTICAL MACHINETRANSLATION

Page 61: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

INTRODUCTIONrule-based systems motivated by linguistic theoriesSMT inspired by information theory and statisticsGoogle, IBM, Microsoft develop SMT systemsmillions of webpages translated with SMT dailygisting: we don't need exact translation, sometimes agist of a text is enough (on of the most frequent useof SMT)SMT in assisted MT (CAT)trending right now: neural network models for MTdata-driven approach more viable than RBMT

Page 62: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SMT SCHEME

Page 63: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PARALLEL CORPORA Ibasic data source for SMTavailable sources ~10–100 Msize depends heavily on a language pairmultilingual webpages (online newspapers)paragraph and sentence alignment needed

Page 64: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PARALLEL CORPORA II: 11 ls, 40 M words

: parallel texts of various origin, open subtitles,UI localizations

: law documents of EU (20 ls): 1.3 M pairs of text chunks from the o�cial

records of the Canadian Parliament

comparable corpora...

EuroparlOPUS

Acquis CommunautaireHansards

EUR-Lex

Page 65: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SENTENCE ALIGNMENTsometimes sentences are not in 1:1 ratio in corporaChurch-Gale alignmenthunalign

P alignment

0.89 1:1

0.0099 1:0, 0:1

0.089 2:1, 1:2

0.011 2:2

Page 66: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SMT NOISY CHANNEL PRINCIPLEClaude Shannon (1948), self-correcting codes transferedthrough noisy channels based on information about the

original data and errors made in the channels.

Used for MT, ASR, OCR. Optical Character Recognition iserroneous but we can estimate what was damaged in atext (with a language model); errors l 1 I, rn m etc.↔ ↔ ↔

= arg p(e|f)e∗ maxe

= arg maxe

p(e)p(f|e)

p(f)= arg p(e)p(f|e)max

e

Page 67: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SMT COMPONENTS Ilanguage modelhow we get for any string the more looks like proper language the higher should beissue: what is for an unseen ?

p(e) e

e p(e)

p(e) e

Page 68: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SMT COMPONENTS IItranslation modelfor and compute the more looks like a proper translation of , thehigher

e f p(f|e)e f

p(f|e)

Page 69: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SMT COMPONENTS IIIdecoding algorithmbased on TM and LM, �nd a sentence as the besttranslation of as fast as possible and with as few memory aspossibleprune non-perspective hypothesesbut do not lost any valid translations

fe

Page 70: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

LANGUAGE MODELS

Page 71: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

WHAT IT IS GOOD FOR?What is the probability of utterance of ?s

I go to home vs. I go home

What is the next, most probable word?

Ke snídani jsem měl celozrnný ...

{ chléb pečivo zákusek mléko babičku}> > > >

Page 72: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

CHOMSKY WAS WRONGColorless green ideas sleep furiously

vs. Furiously sleep ideas green colorless

LM assigns higher to the 1st! (Mikolov, 2012)p

Page 73: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

GENERATING RANDOM TEXTTo him swallowed confess hear both. Which. Of save ontrail for are ay device and rote life have Every enter now

severally so, let. (unigrams)

Sweet prince, Falsta� shall die. Harry of Monmouth'sgrave. This shall forbid it should be branded, if renown

made it empty. (trigrams)

Can you guess the author of the original text?

CBLM

Page 74: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MAXIMUM LIKELIHOOD ESTIMATION

p( | , ) =w3 w1 w2count( , , )w1 w2 w3

count( , ,w)∑w w1 w2

(the, green, *): 1,748× in EuroParl

w count p(w)

paper 801 0.458

group 640 0.367

light 110 0.063

party 27 0.015

ecu 21 0.012

Page 75: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

LM QUALITYWe need to compare quality of various LMs.

2 approaches: extrinsic and intrinsic evaluation.

A good LM should assign a higher probability to a good(looking) text than to an incorrect text. For a �xed

testing text we can compare various LMs.

Page 76: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

ENTROPYShannon, 1949the expected value (average) of the informationcontained in a messageinformation viewed as the negative of the logarithm ofthe probability distributionevents that always occur do not communicateinformationpure randomness has highest entropy (uniformdistribution )lo ng2

E(X) = − p( ) p( )∑ni=1 xi log2 xi

Page 77: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PERPLEXITYPP = 2H( )pLM

PP(W) = p( …w1w2 wn)− 1N

A good LM should not waste for improbablephenomena. The lower entropy, the better the lower

perplexity, the better.

p→

Minimizing probabilities = minimizing perplexity.

Page 78: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

WHAT INFLUENCES LM QUALITY?size of training dataorder of language modelsmoothing, interpolation, back-o�

Page 79: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

LARGE LM - N-GRAM COUNTSHow many unique n-grams are in a corpus?

order types singletons %

unigram 86,700 33,447 (38,6%)

bigram 1,948,935 1,132,844 (58,1%)

trigram 8,092,798 6,022,286 (74,4%)

4-gram 15,303,847 13,081,621 (85,5%)

5-gram 19,882,175 18,324,577 (92,2%)

Taken from Europarl with 30 mil. tokens.

Page 80: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

ZERO FREQUENCY, OOV, RARE WORDSprobability must always be non zeroto be able to measure perplexitymaximum likelihood bad at ittraining data: work on Tuesday/Friday/Wednesdaytest data: work on Sunday, p(Sunday|work on) = 0

Page 81: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EXTRINSIC EVALUATION: SENTENCE COMPLETIONMicrosoft Research Sentence Completion Challengeevaluation of language modelswhere perplexity not availablefrom �ve Holmes novelstraining data: project Gutenberg

Model Acc

Human 90

smoothed 3-gram 36

smoothed 4-gram 39

RNN 59

RMN (LSTM) 69

Page 82: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SENTENCE COMPLETIONThe stage lost a �ne XXX, even as science lost an acute reasoner, when he

became a specialist in crime. a) linguist b) hunter c) actor d) estate e) horseman

What passion of hatred can it be which leads a man to XXX in such a place atsuch a time.

a) lurk b) dine c) luxuriate d) grow e) wiggle

My heart is already XXX since i have con�ded my trouble to you. a) falling b) distressed c) soaring d) lightened e) punished

My morning’s work has not been XXX, since it has proved that he has the verystrongest motives for standing in the way of anything of the sort.

a) invisible b) neglected c) overlooked d) wasted e) deliberate

That is his XXX fault, but on the whole he’s a good worker. a) main b) successful c) mother’s d) generous e) favourite

Page 83: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

NEURAL NETWORK LANGUAGE MODELSold approach (1940s)only recently applied successfully to LM2003 Bengio et al. (feed-forward NNLM)2012 Mikolov (RNN)

right nowkey concept: distributed representations of words1-of-V, one-hot representation

trending

Page 84: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

RECURRENT NEURAL NETWORKTomáš Mikolov (VUT)hidden layer feeds itselfshown to beat n-grams by large margin

Page 85: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 86: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

WORD EMBEDDINGSdistributional semantics with vectorsskip-gram, CBOW (continuous bag-of-words)

Page 87: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 88: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 89: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 90: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EMBEDDINGS IN MT

Page 91: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

LONG SHORT-TERM MEMORYRNN model, can learn to memorize and learn toforgetbeats RNN in sequence learningLSTM

Page 92: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

TRANSLATION MODELS

Page 93: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

LEXICAL TRANSLATIONStandard lexicon does not contain information about

frequency of translations of individual meanings ofwords.

key → klíč, tónina, klávesa

How often are the individual translations used intranslations?

key → klíč (0.7), tónina (0.18), klávesa (0.11)

probability distribution :pf

(e) = 1∑e

pf

∀e : 0 ≤ (e) ≤ 1pf

Page 94: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EM ALGORITHM - INITIALIZATION

Page 95: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EM ALGORITHM - FINAL PHASE

Page 96: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

IBM MODELSIBM-1 does not take context into account, cannot add

and skip words. Each of the following models addssomething more to the previous.

IBM-1: lexical translationIBM-2: + absolute alignment modelIBM-3: + fertility modelIBM-4: + relative alignment modelIBM-5: + further tuning

Page 97: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

WORD ALIGNMENT MATRIX

Page 98: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

WORD ALIGNMENT ISSUES

Page 99: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PHRASE-BASE TRANSLATION MODEL

Phrases not linguistically, but statistically motivated.German am is seldom translated with single English to.

Cf. (fun (with (the game)))

Page 100: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

ADVANTAGES OF PBTMtranslating n:m wordsword is not a suitable element for translation formany lang. pairsmodels learn to translate longer phrasessimpler: no fertility, no NULL token etc.

Page 101: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PHRASE-BASED MODELTranslation probability is split to phrases:p(f|e)

p( | ) = ϕ( | )d(star − en − 1)fI1 eI1 ∏

i=1

I

f i ei ti di−1

Sentence is split to phrases , all segmentationsare of the same probability. Function is translationprobability for phrases. Function is distance-based

reordering model. is position of the �rst word ofphrase from sentence , which is translated to -th

phrase of sentence .

f I f i

ϕd

startif i

e

Page 102: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PHRASE EXTRACTION

Page 103: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EXTRACTED PHRASES

phr1 phr2

michael michael

assumes geht davon aus / geht davon aus

that dass / , dass

he er

will stay bleibt

in the im

house haus

michael assumes michael geht davon aus / michael geht davonaus ,

assumes that geht davon aus , dass

assumes that he geht davon aus , dass er

that he dass er / , dass er

Page 104: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

phr1 phr2

in the house im haus

michael assumesthat

michael geht davon aus , dass

Page 105: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PHRASE-BASED MODEL OF SMT

= ϕ( | ) d(star − en − 1)e∗ argmaxe∏i=1

I

f i ei ti di−1

( | . . . )∏i=1

|e|

pLM ei e1 ei−1

Page 106: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

DECODINGGiven a model and translation model weneed to �nd a translation with the highest probability

but from exponential number of all possibletranslations.

pLM p(f|e)

Heuristic search methods are used. It is not guaranteedto �nd the best translation.

Errors in translations are caused by 1) decoding process, when the best translation is not

found owing to the heuristics or 2) models, where the best translation according to the

probability functions is not the best possible.

Page 107: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EXAMPLE OF NOISE-INDUCED ERRORS(GOOGLE TRANSLATE)

Rinneadh clárúchán an úsáideora yxc a eiteach gorathúil.

The user registration yxc made a successful rejection.

Rinneadh clárúchán an úsáideora qqq a eiteach gorathúil.

Qqq made registration a user successfully refused.

Page 108: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

PHRASE-WISE SENTENCE TRANSLATION

In each step of translation we count preliminary valuesof probabilities from the translation, reordering and

language models.

Page 109: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SEARCH SPACE OF TRANSLATIONHYPOTHESES

Exponential space of all possible translations → limit this space!

Page 110: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

HYPOTHESIS CONSTRUCTION, BEAMSEARCH

Page 111: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

BEAM SEARCHbreadth-�rst searchon each level of the tree: generate all children of nodes on that level, sort themaccording to various heuristicsstore only a limited number of the best states on eachlevel (beam width)only these states are investigated furtherthe wider beam the smaller number of children areprunedwith an unlimited width → breadth-�rst searchthe width correlates with memory consumptionthe best �nal state might not be found since it can bepruned

Page 112: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

NEURAL NETWORK MACHINETRANSLATION

very close to state-of-the-art (PBSMT)a problem: variable length input and outputlearning to translate and align at the same time

hot topic (2014, 2015)LISA

Page 113: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

NN MODELS IN MT

Page 114: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SUMMARY VECTOR FOR SENTENCES

Page 115: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

BIDIRECTIONAL RNN

Page 116: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

ATTENTION MECHANISMA neural network with a single hidden layer, a single

scalar output

Page 118: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

ALIGNMENT WITH CO-OCCURRENCESTATISTICS

Dice =2fxy

+fx fylogDice = 14 + Dlog2

biterms in SkE

Page 119: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MT QUALITY EVALUATIONOTHER MINOR TOPICS

Page 120: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

MOTIVATION FOR MT EVALUATION�uency: is the translation �uent, in a natural wordorder?adequacy: does the translation preserve meaning?intelligibility: do we understand the translation?

Page 121: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EVALUATION SCALEadequacy �uency

5 all meaning 5 �awless English

4 most meaning 4 good

3 much meaning 3 non-native

2 little meaning 2 dis-�uent

1 no meaning 1 incomprehensible

Page 122: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

DISADVANTAGES OF MANUALEVALUATION

slow, expensive, subjectiveinter-annotator agreement (IAA) shows people agreemore on �uency than on adequacyanother option: is X better than Y? → higher IAAor time spent on post-editingor how much cost of translation is reduced

Page 123: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

AUTOMATIC TRANSLATION EVALUATIONadvantages: speed, costdisadvantages: do we really measure quality of translation?gold standard: manually prepared reference translationscandidate is compared with reference translations the paradox of automatic evaluation: the task correspondsto situation where students are to assess their own exam:how they know where they made a mistake?various approaches: n-gram shared between and , editdistance, ...

c n ri

c ri

Page 124: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

RECALL AND PRECISION ON WORDS

precision = = = 50%correct

output-length

3

6

recall = = = 43%correct

reference-length

3

7

f-score = 2 × = 2 × = 46%precision × recall

precision + recall

.5 × .43

.5 + .43

Page 125: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

RECALL AND PRECISION:SHORTCOMINGS

metrics system A system B

precision 50% 100%

recall 43% 100%

f-score 46% 100%

It does not capture wrong word order.

Page 126: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

BLEUstandard metrics (2001)IBM, Papinenin-gram match between reference and candidatetranslationsprecision is calculated for 1-, 2- ,3- and 4-grams+ brevity penalty

BLEU = min(1, ) (output-length

reference-length∏i=1

4

precision

Page 127: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

BLEU: AN EXAMPLE

metrics system A system B

precision (1gram) 3/6 6/6

precision (2gram) 1/5 4/5

precision (3gram) 0/4 2/4

precision (4gram) 0/3 1/3

brevity penalty 6/7 6/7

BLEU 0% 52%

Page 128: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

NISTNIST: National Institute of Standards and Technologyweighted matches of n-grams (information value)very similar results as for BLEU (a variant)

Page 129: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

NEVANgram EVAluationBLEU score adapted for short sentencesit takes into account synonyms (stylistic richness)

Page 130: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

WAFTWord Accuracy for Translation

edit distance between and c r

WAFT = 1 − d+s+imax( , )lr lc

Page 131: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

TERTranslation Edit Ratethe least edit steps (deletion, insertion, swap,replacement)

dnes jsem si při fotbalu zlomil kotník při fotbalu jsem si dnes zlomil kotník

TER = ?

r =c =

TER =number of edits

avg. number of ref. words

Page 132: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

HTERHuman TER

manually prepared and then TER is appliedr

Page 133: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

METEORaligns hypotheses to one or more referencesexact, stem (morphology), synonym (WordNet),paraphrase matchesvarious scores including WMT ranking and NISTadequacyextended support for English, Czech, German, French,Spanish, and Arabic.high correlation with human judgments

Page 134: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EVALUATION OF EVALUATION METRICSCorrelation of automatic evaluation with manual

evaluation.

Page 135: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 137: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 138: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

EUROMATRIX II

Page 139: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

SYNTACTIC RULES EXTRACTION

Page 140: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very
Page 141: PA153 - Masaryk University · 2016. 12. 14. · VAUQUOIS'S TRIANGLE. MACHINE TRANSLATION NOWADAYS big companies (Microsoft) focused on English as SL large pairs (En:Sp, En:Fr): very

HYBRID SMT+RBMT, UFAL

TectoMT + Mosesbetter than Google Translate (En-Cz)

Chimera