Algorithms for NLPdemo.clab.cs.cmu.edu/algo4nlp19/slides/FA19_IITP_lecture... · 2019. 11. 19. ·...

Yulia Tsvetkov

Algorithms for NLP

IITP, Fall 2019

Lecture 21: Machine Translation I

Machine Translation

from Dream of the Red Chamber Cao Xue Qin (1792)

English: leg, foot, paw

French: jambe, pied, patte, etape

ChallengesAmbiguities

▪ words▪ morphology▪ syntax▪ semantics▪ pragmatics

Gaps in data

▪ availability of corpora▪ commonsense knowledge

+Understanding of context, connotation, social norms, etc.

Classical Approaches to MT

▪ Direct translation: word-by-word dictionary translation▪ Transfer approaches: word dictionary + rules

morphological analysis

lexical transfer using bilingual dictionary

local reordering

morphological generation

source language text

target language text

Levels of Transfer

▪ Syntactic transfer

Levels of Transfer

▪ Semantic transfer

Levels of Transfer

The Vauquois Triangle (1968)

Levels of Transfer: The Vauquois triangle

Statistical approaches

Statistical MT

Modeling correspondences between languages

Sentence-aligned parallel corpus:

Yo lo haré mañanaI will do it tomorrow

Hasta prontoSee you soon

Hasta prontoSee you around

Yo lo haré prontoNovel Sentence

I will do it soon

I will do it around

See you tomorrow

Machine translation system:

Model of translation

Research Problems

▪ How can we formalize the process of learning to translate from examples?

▪ How can we formalize the process of finding translations for new inputs?

▪ If our model produces many outputs, how do we find the best one?

▪ If we have a gold standard translation, how can we tell if our output is good or bad?

Two Views of MT

MT as Code Breaking

The Noisy-Channel Model

source W Anoisy channel

decoder observed w a

▪ We want to predict a sentence given acoustics:

▪ The noisy-channel approach:

channel model source model

decoderobserved

w abest

decoderobserved

w abest

Acoustic model (HMMs)Translation model

Likelihood

Language model: Distributions over sequences of words (sentences)

Noisy Channel Model

MT as Direct Modeling

▪ one model does everything▪ trained to reproduce a corpus of translations

Two Views of MT

▪ Code breaking (aka the noisy channel, Bayes rule)▪ I know the target language ▪ I have example translations texts (example enciphered data)

▪ Direct modeling (aka pattern matching)▪ I have really good learning algorithms and a bunch of example

inputs (source language sentences) and outputs (target language translations)

Which is better?

▪ Noisy channel -▪ easy to use monolingual target language data▪ search happens under a product of two models (individual models

can be simple, product can be powerful)▪ obtaining probabilities requires renormalizing

▪ Direct model -▪ directly model the process you care about▪ model must be very powerful

Where are we in 2019?

▪ Direct modeling is where most of the action is▪ Neural networks are very good at generalizing and conceptually

very simple▪ Inference in “product of two models” is hard

▪ Noisy channel ideas are incredibly important and still play a big role in how we think about translation

Two Views of MT

Noisy Channel: Phrase-Based MT

fesource phrase

target phrase

translation features

Translation Model

Language Model

Reranking Model

feature weights

Parallel corpus

Monolingual corpus

Held-out parallel corpus

Neural MT: Conditional Language Modeling

http://opennmt.net/

A common problem

Both models must assign probabilities to how a sentence in one language translates into a sentence in another language.

Learning from Data

Parallel corpora

Mining parallel data from microblogs Ling et al. 2013

http://opus.nlpl.eu

▪ There is a lot more monolingual data in the world than translated data

▪ Easy to get about 1 trillion words of English by crawling the web

▪ With some work, you can get 1 billion translated words of English-French▪ What about Japanese-Turkish?

Phrase-Based MT

fesource phrase

target phrase

translation features

Translation Model

Language Model

Reranking Model

feature weights

Parallel corpus

Monolingual corpus

Held-out parallel corpus

Phrase-Based System Overview

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9language ||| langue ||| 0.9 …

Phrase table(translation model)Word alignments

Many slides and examples from Philipp Koehn or John DeNero

Word Alignment Models

Lexical Translation

▪ How do we translate a word? Look it up in the dictionary

Haus — house, building, home, household, shell

▪ Multiple translations▪ some more frequent than others▪ different word senses, different registers, different inflections (?)▪ house, home are common

▪ shell is specialized (the Haus of a snail is a shell)

How common is each?

Look at a parallel corpus (German text along with English translation)

Estimate Translation Probabilities

Maximum likelihood estimation

▪ Goal: a model▪ where e and f are complete English and Foreign sentences

Lexical Translation

▪ Goal: a model ▪ where e and f are complete English and Foreign sentences▪ Lexical translation makes the following assumptions:▪ Each word in e

i in e is generated from exactly one word in f

▪ Thus, we have an alignment ai that indicates which word e

i “came

from”, specifically it came from fai

▪ Given the alignments a, translation decisions are conditionally independent of each other and depend only on the aligned source word f

Lexical Translation

▪ Putting our assumptions together, we have:

The Alignment Function

▪ Alignments can be visualized in by drawing links between two sentences, and they are represented as vectors of positions:

Reordering

▪ Words may be reordered during translation.

Word Dropping

▪ A source word may not be translated at all

Word Insertion

▪ Words may be inserted during translation ▪ English just does not have an equivalent▪ But it must be explained - we typically assume every source

sentence contains a NULL token

One-to-many Translation

▪ A source word may translate into more than one target word

Many-to-one Translation

▪ More than one source word may not translate as a unit in lexical translation

IBM Model 1

▪ Generative model: break up translation process into smaller steps

▪ Simplest possible lexical translation model▪ Additional assumptions▪ All alignment decisions are independent▪ The alignment distribution for each a

i is uniform over all source

words and NULL

IBM Model 1

▪ Translation probability▪ for a foreign sentence f = (f

1, ..., f

lf ) of length l

▪ to an English sentence e = (e1, ..., e

le ) of length l

▪ with an alignment of each English word ej to a foreign word f

according to the alignment function a : j → i

▪ parameter ϵ is a normalization constant

Example

Learning Lexical Translation Models

We would like to estimate the lexical translation probabilities t(e|f) from a parallel corpus

▪ ... but we do not have the alignments

▪ Chicken and egg problem▪ if we had the alignments,

→ we could estimate the parameters of our generative model (MLE)

▪ if we had the parameters,

→ we could estimate the alignments

EM Algorithm

▪ Incomplete data▪ if we had complete data, we could estimate the model▪ if we had the model, we could fill in the gaps in the data

▪ Expectation Maximization (EM) in a nutshell

1. initialize model parameters (e.g. uniform, random)

2. assign probabilities to the missing data

3. estimate model parameters from completed data

4. iterate steps 2–3 until convergence

EM Algorithm

▪ Initial step: all alignments equally likely▪ Model learns that, e.g., la is often aligned with the

EM Algorithm

▪ After one iteration▪ Alignments, e.g., between la and the are more likely

EM Algorithm

▪ After another iteration▪ It becomes apparent that alignments, e.g., between fleur and

flower are more likely (pigeon hole principle)

EM Algorithm

▪ Convergence▪ Inherent hidden structure revealed by EM

EM Algorithm

▪ Parameter estimation from the aligned corpus

IBM Model 1 and EM

EM Algorithm consists of two steps

▪ Expectation-Step: Apply model to the data▪ parts of the model are hidden (here: alignments)▪ using the model, assign probabilities to possible values

▪ Maximization-Step: Estimate model from data▪ take assigned values as fact▪ collect counts (weighted by lexical translation probabilities)▪ estimate model from counts

▪ Iterate these steps until convergence

IBM Model 1 and EM

▪ We need to be able to compute:▪ Expectation-Step: probability of alignments▪ Maximization-Step: count collection

IBM Model 1 and EM

t-table

IBM Model 1 and EM

t-table

IBM Model 1 and EM

t-table

IBM Model 1 and EM

Applying the chain rule:

t-table

IBM Model 1 and EM: Expectation Step

The Trick

IBM Model 1 and EM: Expectation Step

E-step

t-table

IBM Model 1 and EM: Maximization Step

E-step

M-step

t-table

IBM Model 1 and EM: Maximization Step

Update t-table:

p(the|la) = c(the|la)/c(la)

E-step

M-step

t-table

IBM Model 1 and EM: Pseudocode

Convergence

IBM Model 1

▪ Generative model: break up translation process into smaller steps

▪ Simplest possible lexical translation model▪ Additional assumptions▪ All alignment decisions are independent▪ The alignment distribution for each a

i is uniform over all source

words and NULL

IBM Model 1

▪ Translation probability▪ for a foreign sentence f = (f

1, ..., f

lf ) of length l

▪ to an English sentence e = (e1, ..., e

le ) of length l

▪ with an alignment of each English word ej to a foreign word f

according to the alignment function a : j → i

▪ parameter ϵ is a normalization constant

Example

Evaluating Alignment Models

▪ How do we measure quality of a word-to-word model?

▪ Method 1: use in an end-to-end translation system▪ Hard to measure translation quality▪ Option: human judges▪ Option: reference translations (NIST, BLEU)▪ Option: combinations (HTER)▪ Actually, no one uses word-to-word models alone as TMs

▪ Method 2: measure quality of the alignments produced▪ Easy to measure▪ Hard to know what the gold alignments should be▪ Often does not correlate well with translation quality (like perplexity in LMs)

Alignment Error Rate

Problems with Lexical Translation

▪ Complexity -- exponential in sentence length▪ Weak reordering -- the output is not fluent▪ Many local decisions -- error propagation

Algorithms for NLPdemo.clab.cs.cmu.edu/algo4nlp19/slides/FA19_IITP_lecture... · 2019. 11. 19. ·...

Documents

Hierarchical Phrase-Based Statistical Machine Translation Systembibek/hierarchical_MT_bibek... · 2014. 8. 17. · Hierarchical Phrase-Based Statistical Machine Translation System

Lakota and Translation Theory - One Phrase â€“ Many Translation

ISTIC Statistical Machine Translation System for NTCIR-9 ...research.nii.ac.jp/ntcir/workshop/Online... · 2.2 Multiple translation systems We use phrase-based machine translation

Apache2 Ubuntu Default Page: It worksdemo.clab.cs.cmu.edu/11711fa18/slides/FA18 11-711...f e source phrase target phrase translation features Translation Model e Language Model f e

AN ANALYSIS OF TRANSLATION SHIFT ON NOUN PHRASE IN ... - …

Phrase-Based Statistical Machine Translation Using Finite

Word/Phrase Phonetic Spelling Translation (Slovak) Unit 1

TRANSLATION METHODS AND QUALITY OF NOUN PHRASE IN ... …

Introduction to Machine Translation and Phrase-Based ...ufal.mff.cuni.cz/mtm15/files/04-pbmt-introduction-ales-tamchyna.pdf · Introduction to Machine Translation and Phrase-Based

Statistical Machine Translation Part III – Phrase- based SMT / Decoding

A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang

Pure Neural Machine Translation - SYSTRAN white paper · Ideally, this process would ultimately generate the following translation: Phrase-Based Machine Translation Phrase-Based Machine

UnsupervisedLearning withText - Sciencesconf.org · 2019. 12. 10. · Phrase-Based Statistical Machine Translation (PBSMT) •Two main components •Phrase-table (needs parallel data)

Exact Decoding of Phrase-based Translation Models through Lagrangian Relaxation #emnlpreading

Hierarchical Phrase-Based Translation with Jane 2mhuck/public/talks/Huck_Jane2.p… · Hierarchical Phrase-Based Translation with Jane 2 Matthias Huck, Jan-Thorsten Peter, Markus

A TRANSLATION SHIFT OF VERB PHRASE ON PRODUCTS IN ...eprints.umm.ac.id/37401/1/jiptummpp-gdl-putripuspi... · A Translation Shift Analysis of Verb Phrase in Academic Paper by Translation

A Syntax-Driven Bracketing Model for Phrase-Based Translation

Journal of Digital Science · 2020. 6. 6. · Keywords: machine translation, Lumasaaba, data-driven machine translation, phrase-based statistical machine translation, Neural machine

Exact Decoding of Phrase-Based Translation Models through

TRANSLATION SHIFT ON NOUN AND NOUN PHRASE IN FROZEN …eprints.ums.ac.id/29922/20/PUBLICATION_ARTICLE.pdf · TRANSLATION SHIFT ON NOUN AND NOUN PHRASE IN FROZEN MOVIE AND ITS SUBTITLING