Machine Translation for Digital Library Content

Machine Translation for Digital Library Content

Carnegie Mellon University

Jaime Carbonell et al

Background1. State of the Art in MT

2. The General MT Challenge

3. MT for Low-Density Languages

4. Active Crowd-Sourcing for MT

Relevant MT Paradigms

1. Statistical/Example-Based MT

2. Context-Based MT

3. Learning Transfer Rules

4. Active Data Collection

5th ICUDL – November 2009

2

The Multilingual Bookstack

English

ArabicSpanish

Chinese

+ Multiple other languages, including rare and ancient ones

Telegu

3

Multi-lingual Challenges Are Not New

The Great Library + Tower of Babel = ?

But the Technological Solutions Are New

http://upload.wikimedia.org/wikipedia/commons/e/e1/Brueghel-tower-of-babel.jpg

4

Coping with Massive Multilinguality• Multilingual Content

– Scanning, indexing, interlingual metadata, …– Within scope of UDL

• Multilingual Access– Pre-translation upon acquisition?

• Need to store content N-times (N languages)

– On-demand translation upon query?• Needs fast translation, cross-lingual retrieval

• Requisite Technology– Accurate OCR, Cross-lingual search– Accurate machine translation, beyond common lang’s

5

Challenges for Machine Translation• Ambiguity Resolution

– Lexical, phrasal, structural• Structural divergence

– Reordering, vanishing/appearing words, …• Inflectional morphology

– Spanish 40+ verb conjugations, Arabic has more.• Training Data

– Bilingual corpora, aligned corpora, annotated corpora• Evaluation

– Automated (BLEU, METEOR, TER) vs HTER vs …

6

Context Needed to Resolve Ambiguity

Example: English Japanese Power line – densen (電線 ) Subway line – chikatetsu ( 地下鉄 )

(Be) on line – onrain (オンライン ) (Be) on the line – denwachuu (電話中 ) Line up – narabu (並ぶ ) Line one’s pockets – kanemochi ni naru (金持ちになる ) Line one’s jacket – uwagi o nijuu ni suru (上着を二重にする ) Actor’s line – serifu (セリフ ) Get a line on – joho o eru (情報を得る )

Sometimes local context suffices (as above) n-grams help. . . but sometimes not

7

CONTEXT: More is Better

• Examples requiring longer-range context:– “The line for the new play extended for 3 blocks.”

– “The line for the new play was changed by the scriptwriter.”

– “The line for the new play got tangled with the other props.”

– “The line for the new play better protected the quarterback.”

• Challenges:– Short n-grams (3-4 words) insufficient

– Requires more general semantics Domain specificity

8

Low Density Languages

• 6,900 languages in 2000 – Ethnologue www.ethnologue.com/ethno_docs/distribution.asp?by=area

• 77 (1.2%) have over 10M speakers– 1st is Chinese, 5th is Bengali, 11th is Javanese

• 3,000 have over 10,000 speakers each• 3,000 may survive past 2100• 5X to 10X number of dialects• # of L’s in some interesting countries:

– Pakistan: 77, India 400– North Korea: 1, Indonesia 700

http://www.ethnologue.com/ethno_docs/distribution.asp?by=area

9

Some Linguistics Maps

10

Some (very) LD Languages in N. A.

Anishinaabe (Ojibwe, Potawatame, Odawa) Great Lakes

11

Additional Challenges for Rare and for Ancient Languages

• Morpho-syntactics is plentiful– Beyond inflection: verb-incorporation, agglomeration, …

• Data is scarce– Insignificant bilingual or annotated data

• Fluent computational linguists are scarce– Field linguists know LD languages best

• Standardization is scarce– Orthographic, dialectal, rapid evolution, …

12

Morpho-Syntactics & Multi-Morphemics

• Iñupiaq (North Slope Alaska, Lori Levin)– Tauqsiġñiaġviŋmuŋniaŋitchugut. – ‘We won’t go to the store.’

• Kalaallisut (Greenlandic, Per Langaard, p.c.)– Pittsburghimukarthussaqarnavianngilaq

– Pittsburgh+PROP+Trim+SG+kar+tuq+ssaq+qar+naviar+nngit+v+IND+3SG

– "It is not likely that anyone is going to Pittsburgh"

13

Morphotactics in Iñupiaq

14

Type Token Curve for Mapudungun

0

20

40

60

80

100

120

140

0 500 1,000 1,500

Typ

es, i

n T

hous

ands

Tokens, in Thousands

Mapudungun Spanish

• 400,000+ speakers• Mostly bilingual• Mostly in Chile

• Pewenche• Lafkenche• Nguluche• Huilliche

1504/22/23 15

Paradigms for Machine Translation

Interlingua

Syntactic Parsing

Semantic Analysis

Sentence Planning

Text Generation

Source (e.g. Pashto)

Target(e.g. English)

Transfer Rules

Direct: SMT, EBMT CBMT, …

16

Which MT Paradigms are Best? Towards Filling the Table

Large T Med T LD T

Large S SMT ??? ???

Med S ??? ??? ???

LD S ??? ??? ???

• DARPA MT: Large S Large T– Arabic English; Chinese English

17

An Evolutionary Tree of MT Paradigms

1950 20101980

Transfer MT

DecodingMT

Analogy MT

Large-scale TMT

Interlingua MT

Example-based MT

Larger-Scale TMT

Context-Based MT

Statistical MT

Phrasal SMT

Stat Transfer

MT

Stat Syntax

MT

18

Parallel Text: Requiring Less is Better (Requiring None is Best )

• Challenge– There is just not enough to approach human-quality MT for major

language pairs (we need ~100X to ~10,000X)

– Much parallel text is not on-point (not on domain)

– LD languages or distant pairs have very little parallel text

• CBMT Approach [Abir, Carbonell, Sofizade, …]

– Requires no parallel text, no transfer rules . . .

– Instead, CBMT needs• A fully-inflected bilingual dictionary

• A (very large) target-language-only corpus

• A (modest) source-language-only corpus [optional, but preferred]

19

CMBT System

Parser

Target Language

Source Language

ParserN-gram Segmenter

Overlap-based Decoder

Bilingual Dictionary

INDEXED RESOURCES

Target Corpora

[Source Corpora]

N-GRAM BUILDERS(Translation Model)

Flooder(non-parallel text method)

Cross-LanguageN-gram Database

CACHE DATABASE

N-GRAM CONNECTOR

N-gram Candidates

Approved N-gram

Pairs

Stored N-gram

PairsGazetteers

Edge Locker

TTR

Substitution Request

20

Step 1: Source Sentence Chunking• Segment source sentence into overlapping n-grams via sliding window• Typical n-gram length 4 to 9 terms• Each term is a word or a known phrase• Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words)

S1 S2 S3 S4 S5 S6 S7 S8 S9

S1 S2 S3 S4 S5

S2 S3 S4 S5 S6

S3 S4 S5 S6 S7

S4 S5 S6 S7 S8

S5 S6 S7 S8 S9

21

Flooding Set

Step 2: Dictionary Lookup

T3-aT3-bT3-c

T4-aT4-bT4-cT4-dT4-e

T5-a T6-aT6-bT6-c

• Using bilingual dictionary, list all possible target translations for each source word or phrase

Source Word-String

T2-aT2-bT2-cT2-d

Target Word Lists

S2 S3 S4 S5 S6

Inflected Bilingual Dictionary

22

Step 3: Search Target Text

• Using the Flooding Set, search target text for word-strings

containing one word from each group

• Find maximum number of words from Flooding Set in minimum length word-string– Words or phrases can be in any order

– Ignore function words in initial step (T5 is a function word in this example)

T2-aT2-bT2-cT2-d

T3-aT3-bT3-c


T5-a T6-aT6-bT6-cFlooding Set

23

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-cT(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

Step 3: Search Target Text (Example)

T2-aT2-bT2-cT2-d

T3-aT3-bT3-c



Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-cT(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

T3-b T(x) T2-d T(x) T(x) T6-c

Target Candidate 1

24

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)


T2-aT2-bT2-cT2-d

T3-aT3-bT3-c



Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T4-a T6-b T(x) T2-c T3-a T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

T4-a T6-b T(x) T2-c T3-a

Target Candidate 2

25

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)


T2-aT2-bT2-cT2-d

T3-aT3-bT3-c



Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-c T2-b T4-e T5-a T6-a T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T3-c T2-b T4-e T5-a T6-a

Target Candidate 3

Reintroduce function words after initial match (e.g. T5)

26

Scoring

---

---

---

Step 4: Score Word-String Candidates

• Scoring of candidates based on:– Proximity (minimize extraneous words in target n-gram precision)– Number of word matches (maximize coverage recall))– Regular words given more weight than function words

– Combine results (e.g., optimize F1 or p-norm or …)

T3-b T(x) T2-d T(x) T(x) T6-c

T4-a T6-b T(x) T2-c T3-a

T3-c T2-b T4-e T5-a T6-a

Proximity

3rd

1st

1st

Word Matches

3rd

2st

1st

“Regular” Words

3rd

1st

1st

Total Scoring

3rd

2nd

1st

Target Word-String Candidates

27

T3-b T(x3) T2-d T(x5) T(x6) T6-c

T4-a T6-b T(x3) T2-c T3-a


T(x2) T4-a T6-b T(x3) T2-c

T(x1) T2-d T3-c T(x2) T4-b

T(x1) T3-c T2-b T4-e

T6-b T(x11) T2-c T3-a T(x9)

T2-b T4-e T5-a T6-a T(x8)

T6-b T(x3) T2-c T3-a T(x8)

Step 5: Select Candidates Using Overlap(Propagate context over entire sentence)

Word-String 1Candidates



T(x2) T4-a T6-b T(x3) T2-c


T6-b T(x3) T2-c T3-a T(x8)


T3-b T(x3) T2-d T(x5) T(x6) T6-cT3-b T(x3) T2-d T(x5) T(x6) T6-c





28

Step 5: Select Candidates Using Overlap




T(x2) T4-a T6-b T(x3) T2-c


T6-b T(x3) T2-c T3-a T(x8)

T(x2) T4-a T6-b T(x3) T2-c T3-a T(x8)

T(x1) T3-c T2-b T4-e T5-a T6-a T(x8)

Best translations selected via maximal overlap

Alternative 1

Alternative 2

29

A (Simple) Real Example of Overlap

A United States soldier died and two others were injured Monday

A United States soldier

United States soldier died

soldier died and two others

died and two others were injured

two others were injured Monday

A soldier of the wounded United States died and other two were east Monday

N-grams generated

from Flooding

Systran

Flooding N-gram fidelity

Overlap Long range fidelity

N-grams connected via Overlap

30

System Scores

Based on same Spanish test set

0.4

0.5

0.6

0.7

0.8

BL

EU

SC

OR

ES

: 4

Ref

Trx

s

SystranSpanish

GoogleChinese

(’06 NIST)

CBMTSpanish

0.3GoogleArabic

(‘06 NIST)

0.3859

0.5137

0.5551 0.5610

SDLSpanish

0.7533

0.85

0.7447

CBMTSpanish

(Non-blind)

GoogleSpanish

’08 top lang

0.7189

Human Scoring Range

31

0.4

0.5

0.6

0.7

0.8

BL

EU

SC

OR

ES

0.3

Human Scoring Range

Apr 04 Aug 04 Dec 04 Apr 05 Aug 05 Dec 05 Apr 06 Aug 06 Dec 06 Apr 07 Aug 07 Dec 07 2008

0.85

.3743

.5670

.6144.6165

.6456

.6950

.5953.6129

Non-Blind

.6374.6393

.6267

.7059

.6694 .6929

Blind

Historical CBMT Scoring

.7354

.6645

.7365

.7276

.7447

.7533

32

An Example

• Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes por

el estallido de un artefacto explosivo improvisado en el centro de Bagdad, dijeron

funcionarios militares estadounidenses

• CBMT: A United States soldier died and two others were injured monday by the

explosion of an improvised explosive device in the heart of Baghdad, American

military officials said.

• Systran: A soldier of the wounded United States died and other two were east

Monday by the outbreak from an improvised explosive device in the center of

Bagdad, said American military civil employees

BTW: Google’s translation is identical to CBMT’s

33

Stat-Transfer MT: Research Goals(Lavie, Carbonell, Levin, Vogel & Students)• Long-term research agenda (since 2000) focused on developing a

unified framework for MT that addresses the core fundamental weaknesses of previous approaches:– Representation – explore richer formalisms that can capture complex

divergences between languages– Ability to handle morphologically complex languages– Methods for automatically acquiring MT resources from available data

and combining them with manual resources– Ability to address both rich and poor resource scenarios

• Main research funding sources: NSF (AVENUE and LETRAS projects) and DARPA (GALE)

04/22/23 33Stat-XFER

3404/22/23 Stat-XFER 34

Stat-XFER: List of Ingredients

• Framework: Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts

• SMT-Phrasal Base: Automatic Word and Phrase translation lexicon acquisition from parallel data

• Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages

• Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences

• Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants

• XFER + Decoder:– XFER engine produces a lattice of possible transferred structures at all levels

– Decoder searches and selects the best scoring combination

3504/22/23 35

Stat-XFER MT Approach

Interlingua

Syntactic Parsing

Semantic Analysis

Sentence Planning

Text Generation

Source (e.g. Urdu)

Target(e.g. English)

Transfer Rules

Direct: SMT, EBMT

Statistical-XFER

36

Avenue Architecture

Mar 1, 2006AVENUE/LETRAS36

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

37

Syntax-driven Acquisition Process

Automatic Process for Extracting Syntax-driven Rules and Lexicons from sentence-parallel data:

1. Word-align the parallel corpus (GIZA++)

2. Parse the sentences independently for both languages

3. Tree-to-tree Constituent Alignment:a) Run our new Constituent Aligner over the parsed sentence pairsb) Enhance alignments with additional Constituent Projections

4. Extract all aligned constituents from the parallel trees

5. Extract all derived synchronous transfer rules from the constituent-aligned parallel trees

6. Construct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities)

04/22/23 37Alon Lavie: Stat-XFER

38

PFA Node Alignment Algorithm Example

•Any constituent or sub-constituent is a candidate for alignment•Triggered by word/phrase alignments•Tree Structures can be highly divergent

38

39

PFA Node Alignment Algorithm Example

•Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases)

•Resulting aligned nodes are highlighted in figure

•Transfer rules are partially lexicalized and read off tree.

39

40

Targeted Data Aquisition

• Elicitation Corpus for Structural Diversity– Field Linguist in a Box (Levin et al)– Automated Selection of Next Elicitations

• Active Learning – To select maximally informative sentences to translate

• E.g. with max freq n-grams in monolingual corpus

• E.g. maximally different from current bi-lingual corpus

– To select hand alignments– Use Amazon’s Mechanical Turk or Web-game

• Multi-source independent verification

– Elicit just bilingual dictionaries (for CBMT)

SourceLanguage

Corpus

ModelModel

Trainer

MT System

SS

Active

Learner

S,TS,T

Active Learning for MT

ExpertTranslator

Sampled corpus

Parallel corpus

S,T1S,T1

SourceLanguage

Corpus

ModelModel

Trainer

MT System

SS

ACT Framework

.

.

.

S,T2S,T2

S,TnS,Tn

Active Crowd Translation

SentenceSelection

TranslationSelection

Active Learning Strategy:Diminishing Density Weighted Diversity Sampling

43

|)(|

)]/(*[^)/(

)( )(

sPhrases

LxcounteULxP

Sdensity sPhrasesx

Lifx

Lifx

sPhrases

xcount

Sdiversity sPhrasesx

1

0

|)(|

)(*

)( )(

)()(

)(*)()1()(

2

2

SdiversitySdensity

SdiversitySdensitySScore

Experiments:Language Pair: Spanish-English

Iterations: 20 Batch Size: 1000 sentences eachTranslation: Moses Phrase SMT

Development Set: 343 sensTest Set: 506 sens

Graph:X: Performance (BLEU )

Y: Data (Thousand words)

Translation Selection from Crowd Data• Translation Reliability

– Compare translations for agreement with each other

– Flexible matching is the key

• Translator Reliability

• Translation Selection: • Maximal agreement criterion

45

Which MT Paradigms are Best? Some Tentative Answers

Large T Med T LD T

Large S SMTCMBT

SMTSXFER

RMT?

Med S CBMTSMT

SXFERSMT

???

LD S CBMTSXFER

SXFERRMT?

???

• Unidirectional: LD E (intelligence)• Bi-directional: LD E (theater ops)• LDA LDB

46

Test of Time: 3K years ago, none of the existing languages existed 3K years from now?

Documents

Machine Translation for Digital Library Content