Improving SMT with Phrase to Phrase Translations

Improving SMT withPhrase to Phrase Translations

Joy Ying Zang, Ashish Venugopal,

Stephan Vogel, Alex Waibel

Carnegie Mellon University

Project: Mega-RADD

2

CMU Mega RADD

The Mega-RADD Team:

SMT: Stephan Vogel, Alex Waibel, John Lafferty,EMBT: Ralf Brown, Bob Frederking,

Chinese: Joy Ying Zang, Ashish Venugopal,

Bing Zhao, Fei HuangArabic: Alicia Tribble, Ahmed Badran

3

Overview

• Goals:– Develop Data-Driven General Purpose MT Systems

– Train on Large and Small Corpora, Evaluate to test Portability

• Approaches– Two Data-driven Approaches: Statistical, Example-Based

– Also Grammar based Translation System

– Multi-Engine Translation

• Languages: Chinese and Arabic

• Statistical Translation:– Exploit Structure in Language: Phrases

– Determine Phrases from Mono- and Bi-Lingual Co-occurrences

– Determine Phrases from Lexical and Alignment Information

4

Arabic: Initial System

• 1 million words of UN data, 300 sentences for testing

• Preprocessing: separation of punctuation marks, lower case for English, correction of corrupted numbers

• Adding Human knowledge: cleaning statistical lexicon for 100 most frequent wordsbuilding lists names, simple date expressions, numbers (total: 1000 entries, total effort: two part-timers * 4 weeks)

• Alignment: IBM1 plus HMM training, lexicon plus phrase translations

• Language Model: trained on 1m sub-corpus

• Results (20 May 2002):UN test data (300 sentences): Bleu = 0.1176NIST devtest (203 sentences): Bleu = 0.0242 NIST = 2.0608

5

Arabic: Portability to a New Language

• Training on subset of UN corpus chosen to cover vocabulary of test data

• Training English to Arabic for extraction of phrase translations

• Minimalist Morphology: strip/add suffixes for ~200 unknown wordsNIST: 5.5368 5.6700

• Adapting LM: Select stories from 2 years of English Xinhua storiesaccording to 'Arabic' keyword list (280 entries); size 6.9m words.NIST: 5.5368 5.9183

• Results:- 20 Mai (devtest): 2.0608- 13 June (devtest): 6.5805- 14 June (evaltest): 5.4662 (final training not completed)- 17 June (evaltest): 6.4499 (after completed training)- 19 Juli (devtest): 7.0482

6

Two Approaches

• Determine Phrases from Mono- and Bi-Lingual Co-occurrences– Joy

• Determine Phrases from Lexical and Alignment Information– Ashish

7

• Mismatch between languages: word to word translation doesn’t work

• Phrases encapsulate the context of words, e.g. verb tense

Why phrases?

8

Why phrases? (Cont.)

• Local reordering, e.g. Chinese relative clause

• Using phrases to soothe word segmentation failure

9

Utilizing bilingual information

• Given a sentence pair (S,T),

S=<s1,s2,…,si,…sm>

T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words.

• Given an m*n matrix B, where

B(i,j)= co-occurrence(si,tj)=

where, N=a+b+c+d;

tj ~tj

si a b

~si c d

)()()()(

)(),(

22

dcbadbca

cdadNts ji

10

Utilizing bilingual information (Cont.)

• Goal: find a partition over matrix B, under the constraint that one src/tgt word can only align to one tgt/src word or phrase (adjacent word sequence)

Legal segmentation, imperfect alignment Illegal segmentation, perfect alignment

11

Utilizing bilingual information (Cont.)For each sentence pair in the training data:

While(still has row or column not aligned){Find cell[i,j], where B(i,j) is the max among all available(not aligned) cells;Expand cell[i,j] with similarity sim_thresh to region[RowStart,RowEnd;

ColStart,ColEnd]Mark all the cells in the region as aligned

}Output the aligned regions as phrases

-----------------------------------------------------

Sub expand cell[i,j] with sim_thresh {current aligned region: region[RowStart=i, RowEnd=i; ColStart=j, ColEnd=j]While(still ok to expand){

if all cells[m,n], where m=RowStart-1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart --; //expand to north

if all cells[m,n], where m=RowEnd+1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart ++; //expand to south

… //expand to east… //expand to west

}

Define similar(x,y)= true, if abs((x-y)/y) < 1-similarity_thresh

12

Utilizing bilingual information (Cont.)

Expand to North

Expand to South

Expand to EastExpand to West

13

Integrating monolingual information

• Motivation:– Use more information in the alignment

– Easier for aligning phrases

– There is much more monolingual data than bilingual data

Santa Monica

Pittsburgh Los AngelesSomerset

Union town

Santa Clarita

Corona

14

Integrating monolingual information (Cont.)

• Given a sentence pair (S,T),

S=<s1,s2,…,si,…sm> and T=<t1,t2,…,tj,…,tn>, where si/tj are source/target words.

• Construct m*m matrix A, where A(i,j) = collocation(si, sj); Only A(i,i-1) and A(i,i+1) have values.

• Construct n*n matrix C, where C(i,j) = collocation(ti, tj); Only C(j-1,j) and A(j+1,j) have values.

• Construct m*n matrix B, where B(i,j)= co-occurrence(si, tj).

15

Integrating monolingual information (Cont.)

• Normalize A so that:

• Normalize C so that:

• Normalize B so that:

• Calculating new src-tgt matrix B’

j

jiA 1),(

i

jiC 1),(

m

i

n

j

jiB1 1

1),(

CBAB '

B B’

16

Discussion and Results

• Simple• Efficient

– Partitioning the matrix is linear O(min(m,n)).

– The construction of A*B*C is O(m*n);

• Effective– Improved the translation quality from baseline (NIST=

6.3775, Bleu=0.1417 ) to (NIST= 6.7405, Bleu=0.1681) on small data track dev-test

17

Utilizing alignment information: Motivation

• Alignment model associates words and their translations on the sentence level.

• Context and co-occurrence are represented when considering a set of sentence level alignments.

• Extract phrase relations from the alignment information.

18

Processing Alignments

• Identification – Selection of the source phrases target phrase candidates.

• Scoring – Assigning a score to each candidate phrase pair to create a ranking.

• Pruning – Reducing the set of candidate translations to a computationally tractable number.

19

Identification

• Extraction from sentence level alignments.• For each source phrase identify the sentences in

which they occur and load the sentence alignment• Form a sliding/expanding window in the

alignment to identify candidate translations.

20

Identification Example - I

21

Identification Example - II

- is

-is in step with the

-is in step with the establishment

-is in step with the establishment of

-is in step with the establishment of its

-is in step with the establishment of its legal

-is in step with the establishment of its legal system

-the

-the establishment

-the establishment of

-……

-the establishment of its legal system

-……

-establishment

-establishment of

-establishment of its

-….

22

Scoring - I

• This candidate set H needs to be scored and ranked before pruning.

• Alignment based scores.• Similarity clustering

– Assume that the hypothesis set contains several similar phrases ( across several sentences ) and several noisy phrases.

– SimScore(h) = Mean(EditDistance(h, h’)/AvgLen(h,h’)) for h,h’ in H

23

Scoring Example

24

Scoring - II

• Lexicon augmentation– Weight each point in alignment scoring by their lexical

probability.• P( si | tj ) where I, J represent the area of the translation

hypothesis being considered. Only the pairs of words where there is an alignment is considered.

– Calculate translation probability of hypothesis• Σi Πj P( si | tj ) All words in the hypothesis are considered.

25

Combining Scores

• Final Score(h) = Πj Scorej(h) for each scoring method.

• Due to additional morphology present in English as compared to Chinese, a length model is used to adjust the final score to prefer longer phrases.

• Diff Ratio = (I-J) / J if I>J• FinalScore(h)=FinalScore(h)*(1.0+c*e(-1.0*DiffRatio) )

– c is an experimentally determined constant

26

Pruning

• This large candidate list is now sorted by score and is ready for pruning.

• Difficult to pick a threshold that will work across different phrases. We need a split point that separates the useful and the noisy candidates.

• Split point = argmax p {MeanScore(h<p) – MeanScore(h>=p)}where h represents each hypothesis in the ordered set H.

27

Experiments

• Alignment model – experimented with one-way (EF) and two-way (EF-FE union/intersection) for IBM Models 1-4.– Best results found using union (high recall model) from

model 4.

• Both lexical augmentation (using model 1 lexicon) scores and length bonus were applied.

28

Results and Thoughts

6.3775 6.52Baseline (IBM1+LDC-Dic)

6.7405 7.316+ Phrases

Small Track Large Track

-More effective pruning techniques will significantly reduced the experimentation cycle

- Improved alignment models that better combine bi-directional alignment information

29

Combining Methods

Small Data Track

(Dec-01 data)

6.6427

6.5624

6.5295

6.2381

6.8790

6.7987

6.7405

6.3775

+ Phrases Joy & Ashish

+ Phrases Joy

+ Phrases Ashish

Baseline(IBM1+LDC-Dic)

standard improved

Segmentation

Documents

Improving SMT with Phrase to Phrase Translations