Upload
riilp
View
535
Download
0
Tags:
Embed Size (px)
Citation preview
Statistical Machine Translation
Part II: Decoding
Trevor Cohn, U. Sheffield
EXPERT winter school
November 2013
Some figures taken from Koehn 2009
Recap
You’ve seen several models of translation
word-based models: IBM 1-5
phrase-based models
grammar-based models
Methods for
learning translation rules from bitexts
learning rule weights
learning several other features: language models,
reordering etc
Decoding
Central challenge is to predict a good translation
Given text in the input language (f )
Generate translation in the output language (e)
Formally
where our model scores each candidate translation e using a translation model and a language model
A decoder is a search algorithm for finding e*
caveat: few modern systems use actual probabilities
Outline
Decoding phrase-based models
linear model
dynamic programming approach
approximate beam search
Decoding grammar-based models
synchronous grammars
string-to-string decoding
Decoding objective
Objective
Where model, f, incorporates
translation frequencies for phrases
distortion cost based on (re)ordering
language model cost of m-grams in e
...
Problem of ambiguity
may be many different sequences of translation decisions mapping f to e
e.g. could translate word by word, or use larger units
Decoding for derivations
A derivation is a sequence of translation decisions
can “read off” the input string f and output e
Define model over derivations not translations
aka Viterbi approximation
should sum over all derivations within the maximisation
instead we maximise for tractability
But see Blunsom, Cohn and Osborne (2008)
sum out derivational ambiguity (during training)
Decoding
Includes a coverage constraint
all input words must be translated exactly once
preserves input information
Cf. ‘fertility’ in IBM word-based models
phrases licence one to many mapping (insertions) and
many to one (deletions)
but limited to contiguous spans
Tractability effects on decoding
Translation process
Translate this sentence
translate input words and “phrases”
reorder output to form target string
Derivation = sequence of phrases
1. er – he; 2. ja nicht – does not;
3. geht – go; 4. nach hause – home
Figure from Machine Translation Koehn 2009
Generating process
er geht ja nicht nach hause
1: segment
2: translate
3: order
Consider the translation decisions in a derivation
er
er geht ja nicht nach hause
1: segment
2: translate
3: order
geht ja nicht nach hause
Generating process
Generating process
er
er geht ja nicht nach hause
1: segment
2: translate
3: order
geht ja nicht nach hause
he go does not home
Generating process
er
er geht ja nicht nach hause
1: segment
2: translate
3: order
geht ja nicht nach hause
he go does not home
he godoes not home
1: uniform cost (ignore)
2: TM probability
3: distortion cost & LM probability
Generating process
er
er geht ja nicht nach hause
1: segment
2: translate
3: order
geht ja nicht nach hause
he go does not home
he godoes not home
f = 0
+ φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → home)
+ ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(home| go) + d(2) + ψ(</S>| home)
Linear Model
Assume a linear model
d is a derivation
φ(rk) is the log conditional frequency of a phrase pair
d is the distortion cost for two consecutive phrases
ψ is the log language model probability
each component is scaled by a separate weight
Often mistakenly referred to as log-linear
Model components
Typically:
language model and word count
translation model (s)
distortion cost
Values of α learned by discriminative training (not covered today)
Search problem
Given options
1000s of possible output strings
he does not go home
it is not in house
yes he goes not to home …
Figure from Machine Translation Koehn 2009
Search Complexity
Search space
Number of segmentations 32 = 26
Number of permutations 720 = 6!
Number of translation options 4096 = 46
Multiplying gives 94,371,840 derivations
(calculation is naïve, giving loose upper bound)
How can we possibly search this space?
especially for longer input sentences
Search insight
Consider the sorted list of all derivations
…
he does not go after home
he does not go after house
he does not go home
he does not go to home
he does not go to house
he does not goes home
…
Many similar derivations, each
with highly similar scores
Search insight #1
he / does not / go / home
he / does not / go / to home
f = φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → home) + ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(home| go) + d(2) + ψ(</S>| home)
f = φ(er → he) + φ(geht → go) + φ(ja nicht → does not) + φ(nach hause → to home) + ψ(he | <S>) + d(0) + ψ(does | he) + ψ(not | does) + d(1) + ψ(go| not) + d(-3) + ψ(to| go) + ψ(home| to) + d(2) + ψ(</S>| home)
Search insight #1
Consider all possible ways to finish the translation
Search insight #1
Score ‘f’ factorises, with shared components across all options.
Can find best completion by maximising f.
Search insight #2 Several partial translations can be finished the same way
Search insight #2 Several partial translations can be finished the same way
Only need to consider maximal scoring partial translation
Dynamic Programming Solution
Key ideas behind dynamic programming
factor out repeated computation
efficiently solve the maximisation problem
What are the key components for “sharing”?
don’t have to be exactly identical; need same:
set of untranslated words
righter-most output words
last translated input word location
The decoding algorithm aims to exploit this
More formally
Considering the decoding maximisation
where d ranges over all derivations covering f
We can split maxd into maxd1 maxd2 …
move some ‘maxes’ inside the expression, over elements
not affected by that rule
bracket independent parts of expression
Akin to Viterbi algorithm in HMMs, PCFGs
Phrase-based Decoding
Start with empty state
Figure from Machine Translation Koehn 2009
Phrase-based Decoding
Expand by choosing input span and generating translation
Figure from Machine Translation Koehn 2009
Phrase-based Decoding
Consider all possible options to start the translation
Figure from Machine Translation Koehn 2009
Phrase-based Decoding
Continue to expand states, visiting uncovered words. Generating outputs left to right.
Figure from Machine Translation Koehn 2009
Phrase-based Decoding
Read off translation from best complete derivation by back-tracking
Figure from Machine Translation Koehn 2009
Dynamic Programming
Recall that shared structure can be exploited
vertices with same coverage, last output word, and input
position are identical for subsequent scoring
Maximise over these paths
aka “recombination” in the MT literature (but really just
dynamic programming)
⇒
Figure from Machine Translation Koehn 2009
Complexity
Even with DP search is still intractable
word-based and phrase-based decoding is NP complete
Knight 99; Zaslavskiy, Dymetman, and Cancedda, 2009
whereas SCFG decoding is polynomial
Complexity arises from
reordering model allowing all permutations (limit)
no more than 6 uncovered words
many translation options (limit)
no more than 20 translations per phrase
coverage constraints, i.e., all words to be translated once
Pruning
Limit the size of the search graph by eliminating bad paths
early
Pharaoh / Moses
divide partial derivations into stacks, based on number of
input words translated
limit the number of derivations in each stack
limit the score difference in each stack
Stack based pruning
Algorithm iteratively “grows” from one stack to the next larger ones, while pruning the entries in each stack.
Figure from Machine Translation Koehn 2009
Future cost estimate
Higher scores for translating easy parts first
language model prefers common words
Early pruning will eliminate derivations starting with the difficult words
pruning must incorporate estimate of the cost of translating the
remaining words
“future cost estimate” assuming unigram LM and monotone translation
Related to A* search and admissible heuristics
but incurs search error (see Chang & Collins, 2011)
Beam search complexity
Limit the number of translation options per phrase to constant (often 20)
# translations proportional to input sentence length
Stack pruning
number of entries & score ratio
Reordering limits
finite number of uncovered words (typically 6)
but see Lopez EACL 2009
Resulting complexity
O( stack size x sentence length )
k-best outputs
Can recover not just the best solution
but also 2nd, 3rd etc best derivations
straight-forward extension of beam search
Useful in discriminative training of feature weights, and other
applications
Alternatives for PBMT decoding
FST composition (Kumar & Byrne, 2005)
each process encoded in WFST or WFSA
simply compose automata, minimise and solve
A* search (Och, Ueffing & Ney, 2001)
Sampling (Arun et al, 2009)
Integer linear programming
Germann et al, 2001
Reidel & Clarke, 2009
Lagrangian relaxation
Chang & Collins, 2011
Outline
Decoding phrase-based models
linear model
dynamic programming approach
approximate beam search
Decoding grammar-based models
tree-to-string decoding
string-to-string decoding
cube pruning
Grammar-based decoding
Reordering in PBMT poor, must limit
otherwise too many bad choices available
and inference is intractable
better if reordering decisions were driven by context
simple form of lexicalised reordering in Moses
Grammar based translation
consider hierarchical phrases with gaps (Chiang 05)
(re)ordering constrained by lexical context
inform process by generating syntax tree (Venugopal & Zollmann, 06; Galley et al, 06)
exploit input syntax (Mi, Huang & Liu, 08)
Hierarchical phrase-based MT
have diplomatic relations with Australia
yu Aozhou you bangjiao
have diplomatic relations with Australia
yu Aozhou you bangjiao
Standard PBMT
Hierarchical PBMT
Must ‘jump’ back and forth to obtain correct ordering. Guided primarily by language model.
Grammar rule encodes this common reordering: yu X1 you X2 → have X2 with X1
also correlates yu … you and have … with.
Example from Chiang, CL 2007
SCFG recap
Rules of form
can include aligned gaps
can include informative non-terminal categories (NN, NP, VP etc)
yu you have withX X X X
X X
SCFG generation
Synchronous grammars generate parallel texts
Further:
applied to one text, can generate the other text
leverage efficient monolingual parsing algorithms
yu you have withX X X X
X X
bangiaoAozhou dipl. relations Australia
SCFG extraction from bitexts Step 1: identify aligned
phrase-pairs
Step 2: “subtract” out subsumed
phrase-pairs
Example grammar
yu youX1 X2
X
X
Aozhou
X
bangiao
S
X
have withX2 X1
X
X
Australia
X
diplomatic relations
S
X
Decoding as parsing
Consider only the foreign side of grammar
yu youX X
X
Aozhou
X
yu youX X
X
Aozhou bangiao
S
S
Xbangiao
X
Step 1: parse input text
with has XX
X
Australiadipl. rels
S
Step 2: Translate
yu youX X
X
Aozhou bangiao
S
yu youX X
X
Australia dipl. rels
S
Traverse tree, replacing each input production with its highest scoring output side
X0,4
Chart parsing
youAozhou bangiaoyu0 1 2 3 4
X1,2 X3,4
X0,2X2,4
S0,2
S0,4
2. length = 2
X → yu X
X → you X
S → X
1. length = 1 X → Aozhou
X → bangjiao
4. length = 4 S → S X X → yu X you X
Two derivations yielding S0,4
Take the one with
maximum score
Chart parsing for decoding
• starting at full sentence S0,J
• traverse down to find maximum score derivation
• translate each rule using the maximum scoring right-hand side
• emit the output string
youAozhou bangiaoyu
X1,2 X3,4
X0,2X2,4
S0,2
S0,4
0 1 2 3 4
X0,4
LM intersection
Very efficient
cost of parsing, i.e., O(n3)
reduces to linear if we impose a maximum span limit
translation step simple O(n) post-processing step
But what about the language model?
CYK assumes model scores decompose with the tree structure
but the language model must span constituents
Problem: LM doesn’t factorise!
LM intersection via lexicalised NTs
Encode LM context in NT categories (Bar-Hillel et al, 1964) X → <yu X1 you X2, have X2 with X1>
haveXb → <yu aXb1 you cXd2, have aXb2 with cXd1>
left & right m-1 words in output translation
When used in parent rule, LM can access boundary words
score now factorises with tree
LM intersection via lexicalised NTs
yu youX X
X
X
Aozhou
X
bangiao
S
X
yu youaXb cXd
withXb
AustraliaXAustralia
Aozhou
diplomaticXrelations
bangiao
S
aXb
➠
φTM + ψ(with → c) + ψ(d → has) + ψ(has → a)
φTM
φTM
φTM
φTM
φTM
φTM
φTM + ψ(<S> → a) + ψ(b → </S>)
+LM Decoding
Same algorithm as before
Viterbi parse with input side grammar (CYK)
for each production, find best scoring output side
read off output string
But input grammar has blown up
number of non-terminals is O(T2)
overall translation complexity of O(n3T4(m-1))
Terrible!
Beam search and pruning
Resort to beam search
prune poor entries from chart cells during CYK parsing
histogram, threshold as in phrase-based MT
rarely have sufficient context for LM evaluation
Cube pruning
lower order LM estimate search heuristic
follows approximate ‘best first’ order for incorporating child spans into
parent rule
stops once beam is full
For more details, see
Chiang “Hierarchical phrase-based translation”. 2007. Computational
Linguistics 33(2):201–228.
Further work
Synchronous grammar systems
SAMT (Venugopal & Zollman, 2006)
ISI’s syntax system (Marcu et al.,2006)
HRGG (Chiang et al., 2013)
Tree to string (Liu, Liu & Lin, 2006)
Probabilistic grammar induction
Blunsom & Cohn (2009)
Decoding and pruning
cube growing (Huang & Chiang, 2007)
left to right decoding (Huang & Mi, 2010)
Summary
What we covered
word based translation and alignment
linear phrase-based and grammar-based models
phrase-based (finite state) decoding
synchronous grammar decoding
What we didn’t cover
rule extraction process
discriminative training
tree based models
domain adaptation
OOV translation
…