56
Statistical Machine Translation Part II: Decoding Trevor Cohn, U. Sheffield EXPERT winter school November 2013 Some figures taken from Koehn 2009

7. Trevor Cohn (usfd) Statistical Machine Translation

  • Upload
    riilp

  • View
    535

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 7. Trevor Cohn (usfd) Statistical Machine Translation

Statistical Machine Translation

Part II: Decoding

Trevor Cohn, U. Sheffield

EXPERT winter school

November 2013

Some figures taken from Koehn 2009

Page 2: 7. Trevor Cohn (usfd) Statistical Machine Translation

Recap

You’ve seen several models of translation

word-based models: IBM 1-5

phrase-based models

grammar-based models

Methods for

learning translation rules from bitexts

learning rule weights

learning several other features: language models,

reordering etc

Page 3: 7. Trevor Cohn (usfd) Statistical Machine Translation

Decoding

Central challenge is to predict a good translation

Given text in the input language (f )

Generate translation in the output language (e)

Formally

where our model scores each candidate translation e using a translation model and a language model

A decoder is a search algorithm for finding e*

caveat: few modern systems use actual probabilities

Page 4: 7. Trevor Cohn (usfd) Statistical Machine Translation

Outline

Decoding phrase-based models

linear model

dynamic programming approach

approximate beam search

Decoding grammar-based models

synchronous grammars

string-to-string decoding

Page 5: 7. Trevor Cohn (usfd) Statistical Machine Translation

Decoding objective

Objective

Where model, f, incorporates

translation frequencies for phrases

distortion cost based on (re)ordering

language model cost of m-grams in e

...

Problem of ambiguity

may be many different sequences of translation decisions mapping f to e

e.g. could translate word by word, or use larger units

Page 6: 7. Trevor Cohn (usfd) Statistical Machine Translation

Decoding for derivations

A derivation is a sequence of translation decisions

can “read off” the input string f and output e

Define model over derivations not translations

aka Viterbi approximation

should sum over all derivations within the maximisation

instead we maximise for tractability

But see Blunsom, Cohn and Osborne (2008)

sum out derivational ambiguity (during training)

Page 7: 7. Trevor Cohn (usfd) Statistical Machine Translation

Decoding

Includes a coverage constraint

all input words must be translated exactly once

preserves input information

Cf. ‘fertility’ in IBM word-based models

phrases licence one to many mapping (insertions) and

many to one (deletions)

but limited to contiguous spans

Tractability effects on decoding

Page 8: 7. Trevor Cohn (usfd) Statistical Machine Translation

Translation process

Translate this sentence

translate input words and “phrases”

reorder output to form target string

Derivation = sequence of phrases

1. er – he; 2. ja nicht – does not;

3. geht – go; 4. nach hause – home

Figure from Machine Translation Koehn 2009

Page 9: 7. Trevor Cohn (usfd) Statistical Machine Translation

Generating process

er geht ja nicht nach hause

1: segment

2: translate

3: order

Consider the translation decisions in a derivation

Page 10: 7. Trevor Cohn (usfd) Statistical Machine Translation

er

er geht ja nicht nach hause

1: segment

2: translate

3: order

geht ja nicht nach hause

Generating process

Page 11: 7. Trevor Cohn (usfd) Statistical Machine Translation

Generating process

er

er geht ja nicht nach hause

1: segment

2: translate

3: order

geht ja nicht nach hause

he go does not home

Page 12: 7. Trevor Cohn (usfd) Statistical Machine Translation

Generating process

er

er geht ja nicht nach hause

1: segment

2: translate

3: order

geht ja nicht nach hause

he go does not home

he godoes not home

1: uniform cost (ignore)

2: TM probability

3: distortion cost & LM probability

Page 13: 7. Trevor Cohn (usfd) Statistical Machine Translation

Generating process

er

er geht ja nicht nach hause

1: segment

2: translate

3: order

geht ja nicht nach hause

he go does not home

he godoes not home

f = 0

+ φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → home)

+ ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(home| go) + d(2) + ψ(</S>| home)

Page 14: 7. Trevor Cohn (usfd) Statistical Machine Translation

Linear Model

Assume a linear model

d is a derivation

φ(rk) is the log conditional frequency of a phrase pair

d is the distortion cost for two consecutive phrases

ψ is the log language model probability

each component is scaled by a separate weight

Often mistakenly referred to as log-linear

Page 15: 7. Trevor Cohn (usfd) Statistical Machine Translation

Model components

Typically:

language model and word count

translation model (s)

distortion cost

Values of α learned by discriminative training (not covered today)

Page 16: 7. Trevor Cohn (usfd) Statistical Machine Translation

Search problem

Given options

1000s of possible output strings

he does not go home

it is not in house

yes he goes not to home …

Figure from Machine Translation Koehn 2009

Page 17: 7. Trevor Cohn (usfd) Statistical Machine Translation

Search Complexity

Search space

Number of segmentations 32 = 26

Number of permutations 720 = 6!

Number of translation options 4096 = 46

Multiplying gives 94,371,840 derivations

(calculation is naïve, giving loose upper bound)

How can we possibly search this space?

especially for longer input sentences

Page 18: 7. Trevor Cohn (usfd) Statistical Machine Translation

Search insight

Consider the sorted list of all derivations

he does not go after home

he does not go after house

he does not go home

he does not go to home

he does not go to house

he does not goes home

Many similar derivations, each

with highly similar scores

Page 19: 7. Trevor Cohn (usfd) Statistical Machine Translation

Search insight #1

he / does not / go / home

he / does not / go / to home

f = φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → home) + ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(home| go) + d(2) + ψ(</S>| home)

f = φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → to home) + ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(to| go) + ψ(home| to) + d(2) + ψ(</S>| home)

Page 20: 7. Trevor Cohn (usfd) Statistical Machine Translation

Search insight #1

Consider all possible ways to finish the translation

Page 21: 7. Trevor Cohn (usfd) Statistical Machine Translation

Search insight #1

Score ‘f’ factorises, with shared components across all options.

Can find best completion by maximising f.

Page 22: 7. Trevor Cohn (usfd) Statistical Machine Translation

Search insight #2 Several partial translations can be finished the same way

Page 23: 7. Trevor Cohn (usfd) Statistical Machine Translation

Search insight #2 Several partial translations can be finished the same way

Only need to consider maximal scoring partial translation

Page 24: 7. Trevor Cohn (usfd) Statistical Machine Translation

Dynamic Programming Solution

Key ideas behind dynamic programming

factor out repeated computation

efficiently solve the maximisation problem

What are the key components for “sharing”?

don’t have to be exactly identical; need same:

set of untranslated words

righter-most output words

last translated input word location

The decoding algorithm aims to exploit this

Page 25: 7. Trevor Cohn (usfd) Statistical Machine Translation

More formally

Considering the decoding maximisation

where d ranges over all derivations covering f

We can split maxd into maxd1 maxd2 …

move some ‘maxes’ inside the expression, over elements

not affected by that rule

bracket independent parts of expression

Akin to Viterbi algorithm in HMMs, PCFGs

Page 26: 7. Trevor Cohn (usfd) Statistical Machine Translation

Phrase-based Decoding

Start with empty state

Figure from Machine Translation Koehn 2009

Page 27: 7. Trevor Cohn (usfd) Statistical Machine Translation

Phrase-based Decoding

Expand by choosing input span and generating translation

Figure from Machine Translation Koehn 2009

Page 28: 7. Trevor Cohn (usfd) Statistical Machine Translation

Phrase-based Decoding

Consider all possible options to start the translation

Figure from Machine Translation Koehn 2009

Page 29: 7. Trevor Cohn (usfd) Statistical Machine Translation

Phrase-based Decoding

Continue to expand states, visiting uncovered words. Generating outputs left to right.

Figure from Machine Translation Koehn 2009

Page 30: 7. Trevor Cohn (usfd) Statistical Machine Translation

Phrase-based Decoding

Read off translation from best complete derivation by back-tracking

Figure from Machine Translation Koehn 2009

Page 31: 7. Trevor Cohn (usfd) Statistical Machine Translation

Dynamic Programming

Recall that shared structure can be exploited

vertices with same coverage, last output word, and input

position are identical for subsequent scoring

Maximise over these paths

aka “recombination” in the MT literature (but really just

dynamic programming)

Figure from Machine Translation Koehn 2009

Page 32: 7. Trevor Cohn (usfd) Statistical Machine Translation

Complexity

Even with DP search is still intractable

word-based and phrase-based decoding is NP complete

Knight 99; Zaslavskiy, Dymetman, and Cancedda, 2009

whereas SCFG decoding is polynomial

Complexity arises from

reordering model allowing all permutations (limit)

no more than 6 uncovered words

many translation options (limit)

no more than 20 translations per phrase

coverage constraints, i.e., all words to be translated once

Page 33: 7. Trevor Cohn (usfd) Statistical Machine Translation

Pruning

Limit the size of the search graph by eliminating bad paths

early

Pharaoh / Moses

divide partial derivations into stacks, based on number of

input words translated

limit the number of derivations in each stack

limit the score difference in each stack

Page 34: 7. Trevor Cohn (usfd) Statistical Machine Translation

Stack based pruning

Algorithm iteratively “grows” from one stack to the next larger ones, while pruning the entries in each stack.

Figure from Machine Translation Koehn 2009

Page 35: 7. Trevor Cohn (usfd) Statistical Machine Translation

Future cost estimate

Higher scores for translating easy parts first

language model prefers common words

Early pruning will eliminate derivations starting with the difficult words

pruning must incorporate estimate of the cost of translating the

remaining words

“future cost estimate” assuming unigram LM and monotone translation

Related to A* search and admissible heuristics

but incurs search error (see Chang & Collins, 2011)

Page 36: 7. Trevor Cohn (usfd) Statistical Machine Translation

Beam search complexity

Limit the number of translation options per phrase to constant (often 20)

# translations proportional to input sentence length

Stack pruning

number of entries & score ratio

Reordering limits

finite number of uncovered words (typically 6)

but see Lopez EACL 2009

Resulting complexity

O( stack size x sentence length )

Page 37: 7. Trevor Cohn (usfd) Statistical Machine Translation

k-best outputs

Can recover not just the best solution

but also 2nd, 3rd etc best derivations

straight-forward extension of beam search

Useful in discriminative training of feature weights, and other

applications

Page 38: 7. Trevor Cohn (usfd) Statistical Machine Translation

Alternatives for PBMT decoding

FST composition (Kumar & Byrne, 2005)

each process encoded in WFST or WFSA

simply compose automata, minimise and solve

A* search (Och, Ueffing & Ney, 2001)

Sampling (Arun et al, 2009)

Integer linear programming

Germann et al, 2001

Reidel & Clarke, 2009

Lagrangian relaxation

Chang & Collins, 2011

Page 39: 7. Trevor Cohn (usfd) Statistical Machine Translation

Outline

Decoding phrase-based models

linear model

dynamic programming approach

approximate beam search

Decoding grammar-based models

tree-to-string decoding

string-to-string decoding

cube pruning

Page 40: 7. Trevor Cohn (usfd) Statistical Machine Translation

Grammar-based decoding

Reordering in PBMT poor, must limit

otherwise too many bad choices available

and inference is intractable

better if reordering decisions were driven by context

simple form of lexicalised reordering in Moses

Grammar based translation

consider hierarchical phrases with gaps (Chiang 05)

(re)ordering constrained by lexical context

inform process by generating syntax tree (Venugopal & Zollmann, 06; Galley et al, 06)

exploit input syntax (Mi, Huang & Liu, 08)

Page 41: 7. Trevor Cohn (usfd) Statistical Machine Translation

Hierarchical phrase-based MT

have diplomatic relations with Australia

yu Aozhou you bangjiao

have diplomatic relations with Australia

yu Aozhou you bangjiao

Standard PBMT

Hierarchical PBMT

Must ‘jump’ back and forth to obtain correct ordering. Guided primarily by language model.

Grammar rule encodes this common reordering: yu X1 you X2 → have X2 with X1

also correlates yu … you and have … with.

Example from Chiang, CL 2007

Page 42: 7. Trevor Cohn (usfd) Statistical Machine Translation

SCFG recap

Rules of form

can include aligned gaps

can include informative non-terminal categories (NN, NP, VP etc)

yu you have withX X X X

X X

Page 43: 7. Trevor Cohn (usfd) Statistical Machine Translation

SCFG generation

Synchronous grammars generate parallel texts

Further:

applied to one text, can generate the other text

leverage efficient monolingual parsing algorithms

yu you have withX X X X

X X

bangiaoAozhou dipl. relations Australia

Page 44: 7. Trevor Cohn (usfd) Statistical Machine Translation

SCFG extraction from bitexts Step 1: identify aligned

phrase-pairs

Step 2: “subtract” out subsumed

phrase-pairs

Page 45: 7. Trevor Cohn (usfd) Statistical Machine Translation

Example grammar

yu youX1 X2

X

X

Aozhou

X

bangiao

S

X

have withX2 X1

X

X

Australia

X

diplomatic relations

S

X

Page 46: 7. Trevor Cohn (usfd) Statistical Machine Translation

Decoding as parsing

Consider only the foreign side of grammar

yu youX X

X

Aozhou

X

yu youX X

X

Aozhou bangiao

S

S

Xbangiao

X

Step 1: parse input text

Page 47: 7. Trevor Cohn (usfd) Statistical Machine Translation

with has XX

X

Australiadipl. rels

S

Step 2: Translate

yu youX X

X

Aozhou bangiao

S

yu youX X

X

Australia dipl. rels

S

Traverse tree, replacing each input production with its highest scoring output side

Page 48: 7. Trevor Cohn (usfd) Statistical Machine Translation

X0,4

Chart parsing

youAozhou bangiaoyu0 1 2 3 4

X1,2 X3,4

X0,2X2,4

S0,2

S0,4

2. length = 2

X → yu X

X → you X

S → X

1. length = 1 X → Aozhou

X → bangjiao

4. length = 4 S → S X X → yu X you X

Two derivations yielding S0,4

Take the one with

maximum score

Page 49: 7. Trevor Cohn (usfd) Statistical Machine Translation

Chart parsing for decoding

• starting at full sentence S0,J

• traverse down to find maximum score derivation

• translate each rule using the maximum scoring right-hand side

• emit the output string

youAozhou bangiaoyu

X1,2 X3,4

X0,2X2,4

S0,2

S0,4

0 1 2 3 4

X0,4

Page 50: 7. Trevor Cohn (usfd) Statistical Machine Translation

LM intersection

Very efficient

cost of parsing, i.e., O(n3)

reduces to linear if we impose a maximum span limit

translation step simple O(n) post-processing step

But what about the language model?

CYK assumes model scores decompose with the tree structure

but the language model must span constituents

Problem: LM doesn’t factorise!

Page 51: 7. Trevor Cohn (usfd) Statistical Machine Translation

LM intersection via lexicalised NTs

Encode LM context in NT categories (Bar-Hillel et al, 1964) X → <yu X1 you X2, have X2 with X1>

haveXb → <yu aXb1 you cXd2, have aXb2 with cXd1>

left & right m-1 words in output translation

When used in parent rule, LM can access boundary words

score now factorises with tree

Page 52: 7. Trevor Cohn (usfd) Statistical Machine Translation

LM intersection via lexicalised NTs

yu youX X

X

X

Aozhou

X

bangiao

S

X

yu youaXb cXd

withXb

AustraliaXAustralia

Aozhou

diplomaticXrelations

bangiao

S

aXb

φTM + ψ(with → c) + ψ(d → has) + ψ(has → a)

φTM

φTM

φTM

φTM

φTM

φTM

φTM + ψ(<S> → a) + ψ(b → </S>)

Page 53: 7. Trevor Cohn (usfd) Statistical Machine Translation

+LM Decoding

Same algorithm as before

Viterbi parse with input side grammar (CYK)

for each production, find best scoring output side

read off output string

But input grammar has blown up

number of non-terminals is O(T2)

overall translation complexity of O(n3T4(m-1))

Terrible!

Page 54: 7. Trevor Cohn (usfd) Statistical Machine Translation

Beam search and pruning

Resort to beam search

prune poor entries from chart cells during CYK parsing

histogram, threshold as in phrase-based MT

rarely have sufficient context for LM evaluation

Cube pruning

lower order LM estimate search heuristic

follows approximate ‘best first’ order for incorporating child spans into

parent rule

stops once beam is full

For more details, see

Chiang “Hierarchical phrase-based translation”. 2007. Computational

Linguistics 33(2):201–228.

Page 55: 7. Trevor Cohn (usfd) Statistical Machine Translation

Further work

Synchronous grammar systems

SAMT (Venugopal & Zollman, 2006)

ISI’s syntax system (Marcu et al.,2006)

HRGG (Chiang et al., 2013)

Tree to string (Liu, Liu & Lin, 2006)

Probabilistic grammar induction

Blunsom & Cohn (2009)

Decoding and pruning

cube growing (Huang & Chiang, 2007)

left to right decoding (Huang & Mi, 2010)

Page 56: 7. Trevor Cohn (usfd) Statistical Machine Translation

Summary

What we covered

word based translation and alignment

linear phrase-based and grammar-based models

phrase-based (finite state) decoding

synchronous grammar decoding

What we didn’t cover

rule extraction process

discriminative training

tree based models

domain adaptation

OOV translation