7. Trevor Cohn (usfd) Statistical Machine Translation

Statistical Machine Translation

Part II: Decoding

Trevor Cohn, U. Sheffield

EXPERT winter school

November 2013

Some figures taken from Koehn 2009

Recap

You’ve seen several models of translation

word-based models: IBM 1-5

phrase-based models

grammar-based models

Methods for

learning translation rules from bitexts

learning rule weights

learning several other features: language models,

reordering etc

Decoding

Central challenge is to predict a good translation

Given text in the input language (f )

Generate translation in the output language (e)

Formally

where our model scores each candidate translation e using a translation model and a language model

A decoder is a search algorithm for finding e*

caveat: few modern systems use actual probabilities

Outline

Decoding phrase-based models

linear model

dynamic programming approach

approximate beam search

Decoding grammar-based models

synchronous grammars

string-to-string decoding

Decoding objective

Objective

Where model, f, incorporates

translation frequencies for phrases

distortion cost based on (re)ordering

language model cost of m-grams in e

...

Problem of ambiguity

may be many different sequences of translation decisions mapping f to e

e.g. could translate word by word, or use larger units

Decoding for derivations

A derivation is a sequence of translation decisions

can “read off” the input string f and output e

Define model over derivations not translations

aka Viterbi approximation

should sum over all derivations within the maximisation

instead we maximise for tractability

But see Blunsom, Cohn and Osborne (2008)

sum out derivational ambiguity (during training)

Decoding

Includes a coverage constraint

all input words must be translated exactly once

preserves input information

Cf. ‘fertility’ in IBM word-based models

phrases licence one to many mapping (insertions) and

many to one (deletions)

but limited to contiguous spans

Tractability effects on decoding

Translation process

Translate this sentence

translate input words and “phrases”

reorder output to form target string

Derivation = sequence of phrases

1. er – he; 2. ja nicht – does not;

3. geht – go; 4. nach hause – home

Figure from Machine Translation Koehn 2009

Generating process

er geht ja nicht nach hause

1: segment

2: translate

3: order

Consider the translation decisions in a derivation

er


1: segment

2: translate

3: order

geht ja nicht nach hause

Generating process

Generating process

er


1: segment

2: translate

3: order


he go does not home

Generating process

er


1: segment

2: translate

3: order


he go does not home

he godoes not home

1: uniform cost (ignore)

2: TM probability

3: distortion cost & LM probability

Generating process

er


1: segment

2: translate

3: order


he go does not home

he godoes not home

f = 0

+ φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → home)

+ ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(home| go) + d(2) + ψ(</S>| home)

Linear Model

Assume a linear model

d is a derivation

φ(rk) is the log conditional frequency of a phrase pair

d is the distortion cost for two consecutive phrases

ψ is the log language model probability

each component is scaled by a separate weight

Often mistakenly referred to as log-linear

Model components

Typically:

language model and word count

translation model (s)

distortion cost

Values of α learned by discriminative training (not covered today)

Search problem

Given options

1000s of possible output strings

he does not go home

it is not in house

yes he goes not to home …


Search Complexity

Search space

Number of segmentations 32 = 26

Number of permutations 720 = 6!

Number of translation options 4096 = 46

Multiplying gives 94,371,840 derivations

(calculation is naïve, giving loose upper bound)

How can we possibly search this space?

especially for longer input sentences

Search insight

Consider the sorted list of all derivations

…

he does not go after home

he does not go after house

he does not go home

he does not go to home

he does not go to house

he does not goes home

…

Many similar derivations, each

with highly similar scores

Search insight #1

he / does not / go / home

he / does not / go / to home

f = φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → home) + ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(home| go) + d(2) + ψ(</S>| home)

f = φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → to home) + ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(to| go) + ψ(home| to) + d(2) + ψ(</S>| home)

Search insight #1

Consider all possible ways to finish the translation

Search insight #1

Score ‘f’ factorises, with shared components across all options.

Can find best completion by maximising f.

Search insight #2 Several partial translations can be finished the same way

Search insight #2 Several partial translations can be finished the same way

Only need to consider maximal scoring partial translation

Dynamic Programming Solution

Key ideas behind dynamic programming

factor out repeated computation

efficiently solve the maximisation problem

What are the key components for “sharing”?

don’t have to be exactly identical; need same:

set of untranslated words

righter-most output words

last translated input word location

The decoding algorithm aims to exploit this

More formally

Considering the decoding maximisation

where d ranges over all derivations covering f

We can split maxd into maxd1 maxd2 …

move some ‘maxes’ inside the expression, over elements

not affected by that rule

bracket independent parts of expression

Akin to Viterbi algorithm in HMMs, PCFGs

Phrase-based Decoding

Start with empty state



Expand by choosing input span and generating translation



Consider all possible options to start the translation



Continue to expand states, visiting uncovered words. Generating outputs left to right.



Read off translation from best complete derivation by back-tracking


Dynamic Programming

Recall that shared structure can be exploited

vertices with same coverage, last output word, and input

position are identical for subsequent scoring

Maximise over these paths

aka “recombination” in the MT literature (but really just

dynamic programming)

⇒


Complexity

Even with DP search is still intractable

word-based and phrase-based decoding is NP complete

Knight 99; Zaslavskiy, Dymetman, and Cancedda, 2009

whereas SCFG decoding is polynomial

Complexity arises from

reordering model allowing all permutations (limit)

no more than 6 uncovered words

many translation options (limit)

no more than 20 translations per phrase

coverage constraints, i.e., all words to be translated once

Pruning

Limit the size of the search graph by eliminating bad paths

early

Pharaoh / Moses

divide partial derivations into stacks, based on number of

input words translated

limit the number of derivations in each stack

limit the score difference in each stack

Stack based pruning

Algorithm iteratively “grows” from one stack to the next larger ones, while pruning the entries in each stack.


Future cost estimate

Higher scores for translating easy parts first

language model prefers common words

Early pruning will eliminate derivations starting with the difficult words

pruning must incorporate estimate of the cost of translating the

remaining words

“future cost estimate” assuming unigram LM and monotone translation

Related to A* search and admissible heuristics

but incurs search error (see Chang & Collins, 2011)

Beam search complexity

Limit the number of translation options per phrase to constant (often 20)

# translations proportional to input sentence length

Stack pruning

number of entries & score ratio

Reordering limits

finite number of uncovered words (typically 6)

but see Lopez EACL 2009

Resulting complexity

O( stack size x sentence length )

k-best outputs

Can recover not just the best solution

but also 2nd, 3rd etc best derivations

straight-forward extension of beam search

Useful in discriminative training of feature weights, and other

applications

Alternatives for PBMT decoding

FST composition (Kumar & Byrne, 2005)

each process encoded in WFST or WFSA

simply compose automata, minimise and solve

A* search (Och, Ueffing & Ney, 2001)

Sampling (Arun et al, 2009)

Integer linear programming

Germann et al, 2001

Reidel & Clarke, 2009

Lagrangian relaxation

Chang & Collins, 2011

Outline

Decoding phrase-based models

linear model

dynamic programming approach

approximate beam search

Decoding grammar-based models

tree-to-string decoding

string-to-string decoding

cube pruning

Grammar-based decoding

Reordering in PBMT poor, must limit

otherwise too many bad choices available

and inference is intractable

better if reordering decisions were driven by context

simple form of lexicalised reordering in Moses

Grammar based translation

consider hierarchical phrases with gaps (Chiang 05)

(re)ordering constrained by lexical context

inform process by generating syntax tree (Venugopal & Zollmann, 06; Galley et al, 06)

exploit input syntax (Mi, Huang & Liu, 08)

Hierarchical phrase-based MT

have diplomatic relations with Australia

yu Aozhou you bangjiao

have diplomatic relations with Australia

yu Aozhou you bangjiao

Standard PBMT

Hierarchical PBMT

Must ‘jump’ back and forth to obtain correct ordering. Guided primarily by language model.

Grammar rule encodes this common reordering: yu X1 you X2 → have X2 with X1

also correlates yu … you and have … with.

Example from Chiang, CL 2007

SCFG recap

Rules of form

can include aligned gaps

can include informative non-terminal categories (NN, NP, VP etc)

yu you have withX X X X

X X

SCFG generation

Synchronous grammars generate parallel texts

Further:

applied to one text, can generate the other text

leverage efficient monolingual parsing algorithms

yu you have withX X X X

X X

bangiaoAozhou dipl. relations Australia

SCFG extraction from bitexts Step 1: identify aligned

phrase-pairs

Step 2: “subtract” out subsumed

phrase-pairs

Example grammar

yu youX1 X2

X

X

Aozhou

X

bangiao

S

X

have withX2 X1

X

X

Australia

X

diplomatic relations

S

X

Decoding as parsing

Consider only the foreign side of grammar

yu youX X

X

Aozhou

X

yu youX X

X

Aozhou bangiao

S

S

Xbangiao

X

Step 1: parse input text

with has XX

X

Australiadipl. rels

S

Step 2: Translate

yu youX X

X

Aozhou bangiao

S

yu youX X

X

Australia dipl. rels

S

Traverse tree, replacing each input production with its highest scoring output side

X0,4

Chart parsing

youAozhou bangiaoyu0 1 2 3 4

X1,2 X3,4

X0,2X2,4

S0,2

S0,4

2. length = 2

X → yu X

X → you X

S → X

1. length = 1 X → Aozhou

X → bangjiao

4. length = 4 S → S X X → yu X you X

Two derivations yielding S0,4

Take the one with

maximum score

Chart parsing for decoding

• starting at full sentence S0,J

• traverse down to find maximum score derivation

• translate each rule using the maximum scoring right-hand side

• emit the output string

youAozhou bangiaoyu

X1,2 X3,4

X0,2X2,4

S0,2

S0,4

0 1 2 3 4

X0,4

LM intersection

Very efficient

cost of parsing, i.e., O(n3)

reduces to linear if we impose a maximum span limit

translation step simple O(n) post-processing step

But what about the language model?

CYK assumes model scores decompose with the tree structure

but the language model must span constituents

Problem: LM doesn’t factorise!

LM intersection via lexicalised NTs

Encode LM context in NT categories (Bar-Hillel et al, 1964) X → <yu X1 you X2, have X2 with X1>

haveXb → <yu aXb1 you cXd2, have aXb2 with cXd1>

left & right m-1 words in output translation

When used in parent rule, LM can access boundary words

score now factorises with tree

LM intersection via lexicalised NTs

yu youX X

X

X

Aozhou

X

bangiao

S

X

yu youaXb cXd

withXb

AustraliaXAustralia

Aozhou

diplomaticXrelations

bangiao

S

aXb

➠

φTM + ψ(with → c) + ψ(d → has) + ψ(has → a)

φTM

φTM

φTM

φTM

φTM

φTM

φTM + ψ(<S> → a) + ψ(b → </S>)

+LM Decoding

Same algorithm as before

Viterbi parse with input side grammar (CYK)

for each production, find best scoring output side

read off output string

But input grammar has blown up

number of non-terminals is O(T2)

overall translation complexity of O(n3T4(m-1))

Terrible!

Beam search and pruning

Resort to beam search

prune poor entries from chart cells during CYK parsing

histogram, threshold as in phrase-based MT

rarely have sufficient context for LM evaluation

Cube pruning

lower order LM estimate search heuristic

follows approximate ‘best first’ order for incorporating child spans into

parent rule

stops once beam is full

For more details, see

Chiang “Hierarchical phrase-based translation”. 2007. Computational

Linguistics 33(2):201–228.

Further work

Synchronous grammar systems

SAMT (Venugopal & Zollman, 2006)

ISI’s syntax system (Marcu et al.,2006)

HRGG (Chiang et al., 2013)

Tree to string (Liu, Liu & Lin, 2006)

Probabilistic grammar induction

Blunsom & Cohn (2009)

Decoding and pruning

cube growing (Huang & Chiang, 2007)

left to right decoding (Huang & Mi, 2010)

Summary

What we covered

word based translation and alignment

linear phrase-based and grammar-based models

phrase-based (finite state) decoding

synchronous grammar decoding

What we didn’t cover

rule extraction process

discriminative training

tree based models

domain adaptation

OOV translation

…

Technology

7. Trevor Cohn (usfd) Statistical Machine Translation