Topic Models for Dynamic Translation Model Adaptation

Vladimir Eidelman

Jordan Boyd-Graber

Philip Resnik

(Typical) Domain Adaptation

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

Training Corpus

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

Newswire

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

Newswire Web

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

Newswire Web Europarl

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

Newswire Web Europarl

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

out out in

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

out out in

dev w test

Motivation

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

Motivation

doc3 doc2

doc4 doc3

doc2 doc1

doc4 doc3

doc2 doc1 doc1

test dev w test

• Model Domain

– Induce soft unsupervised domains

• Latent Topics

• Apply to MT

– Bias translation model

• Introduce topic-dependent lexical weighting

Lexical Weighting

• Estimate phrase pair quality word-by-word

粉丝很多 fěnsī hěnduō noodles a lot of

Lexical Weighting

粉丝很多 fěnsī hěnduō noodles a lot of

Lexical Weighting

粉丝很多 fěnsī hěnduō noodles a lot of fans a lot of

Topic Models

•Used MALLET (McCallum, 2002) •Latent Dirichlet Allocation (Blei, Ng, Jordan 2003) •Only on source •Topic distribution the same for every sentence in document

Standard Lexical Weighting

粉丝很多

Source Target P(e|f)

粉丝很多 lots of noodles .45

粉丝很多 lots of fans .33

粉丝很多

Translation Table

粉丝很多

Domain Lexical Weighting (Chiang 2011)

粉丝很多

Domain Lexical Weighting

Translation Table: nw

(Chiang 2011)

粉丝很多

Domain Lexical Weighting

Translation Table: nw

Translation Table: Web

(Chiang 2011)

粉丝很多

Source Target Ps=nw(e|f)

Source Target Ps=wb(e|f)

Lexical Weighting with Topic Models

粉丝很多

Translation Table: Topic 1

Source Target Ptopic=1(e|f)

粉丝很多

Lexical Weighting Adaptation Features

Source Target Ptopic(e|f)

test sentence

ƒ1(e|f) = 0.71 * 0.65

ƒ1(e|f) = 0.15 * 0.65

粉丝很多 lots of noodles .71 0.46

粉丝很多 lots of fans .15 0.09

粉丝很多 ||| lots of fans ||| ƒ1(e|f)=.46 ƒ2(e|f)=.09 ƒ3(e|f)=.02 ƒ1(f|e) ƒ2(f|e) ƒ3(f|e) …

Experiments

• Chinese-English

• Two settings – Small (FBIS)

• 300k sentence pairs

• Document boundaries

– Large (~NIST) • 1.6m sentence pairs

• No documents

• NIST MT06 tune, MT03 & 05 test

• MIRA optimizer

Unsupervised Domain Induction

• What is a document (for topic modeling)?

• Only some MT data have document boundaries

• Treat each sentence as document

Document v. Sentence Results

FBIS Document v. Sentence Results

Large Setting

Future Work

• Improve Topic Model

– Multilingual Topic Modeling

– More (mono,multi)-lingual data

– Hierarchical models

• Other languages

Conclusions

• Extend domain adaptation

– No reliance on collection/genre annotation

– Finer-grained topic distributions

• Bias transation toward topic

– Lexical weighting adaptation with soft membership

• Add Ptopic(e|f) and Ptopic(f|e) features to every rule

• Thank You!

• Question?

Feature Representation

• Topic Identity

– Probability under topic 1, topic 2?

– Cross-domain

• Topic Distribution

– Probability under most probable topic? Second most?

– Dynamic

Global vs. Local Topic Model

Large Corpus

Topic Models for Dynamic Translation Model Adaptation · doc1 doc1 out out in dev w test ....

Documents

DAS NEUE TNT EXPRESS VERPACKUNGSSORTIMENT€¦ · Doc1 Doc2 Doc3 • schützt den Inhalt vor SATCHELS Verpackungen für Dokumente und Kleinteile Business Pak1 Business Pak2 Bag1 Bag2

Doc4 · 2015-12-30 · Microsoft Word - Doc4 Author: fdarko Created Date: 20100902172149Z

Doc3 Cropped

EZSET DM 正面 2 · DOC4 SERIES DOC CONCEALED DOOR CLOSER DO COMMERCIAL DOOR CLOSER DC OVERHEAD HINGE RESIDENTIAL DOOR CLOSER OD FLOOR HINGE OF DOC3 SERIES Solid Brass Lever Solid

Physische XML-Speicherstrukturen und Indizes · doc1 doc1 4 2 5 15 17 16 doc1 doc1 doc1 doc1 doc1 doc1 doc1 9 doc1 3 doc1 7 12 11 das leuchtturm am befindet beschreibung das dem gegenüber

Doc1mp3ktv.huhu.tw/joungei/4-2-3/Doc4.pdf · 2017-07-05 · Microsoft Word - Doc1.doc Author: NEW-User Created Date: 5/2/2017 2:28:17 PM

Doc3 Neurohb

Doc4 Norma

Doc4 simultan

CR-B1 CR-C1 CR-E1 CR-G1 · doc1 doc2 doc3 doc5 doc6 doc7 doc1-os doc2-os doc3-os doc5-os doc6-os doc7-os doc1-rt doc2-rt doc3-rt doc5-rt doc6-rt doc7-rt dod1 dod2 dod3 dod5 dod6 dod7

Doc3 - ssc.gov.vn

5 doc4 peruencifras

Language Convergence Infrastructuregrammarware.net/slides/2009/gttse.pdfNontrivial extraction in numbers app1 doc1 jls1 app2 doc12 doc2 jls2 app3 doc3 jls3 jls12 doc123 jls123 Figure

WordPress.com€¦ · 5- em caso de falta, fornecer mais peles (doc4 + cond4) ou (doc4a + cond4) 4- planeamento concluído ou falta (doc3 + doc2 + cond4) ou (doc3 + doc2 + cond4)

INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ...tft,d doc1 doc2 doc3 doc4 a 2 4 0 1 and 0 1 0 0 document 1 2 1 1 is 1 2 1 1 sentence 1 2 0 1 short 0 0 1 0 this 0 0 1 1

Doc1 - mp3ktv.huhu.twmp3ktv.huhu.tw/joungei/4-2-2/Doc3.pdf · Microsoft Word - Doc1.doc Author: NEW-User Created Date: 5/2/2017 11:07:00 AM

Doc1 - 阿超數學、國中數學mp3ktv.huhu.tw/all t cool/4-1-2/questiom/Doc3.pdf · Microsoft Word - Doc1.doc Author: NEW-User Created Date: 4/14/2017 9:03:19 AM

DOC4 CONSENSO TRATTAMENTO DATI Casa dei Colorisolidarietatv.altervista.org/wp-content/uploads/2016/05/Doc3-CONSE… · Title: Microsoft Word - DOC4 CONSENSO TRATTAMENTO DATI Casa

Doc1mp3ktv.huhu.tw/all t cool/5-1-3/questiom/Doc3.pdf · Microsoft Word - Doc1.doc Author NEW-User Created Date 4/11/2017 5:20:09 PM

Capítulo - contilnet.com.brCurso_Tecnico/Turma127/... · DOC1.DOC (pode ser DOC2, DOC3 ou outro semelhante, dependendo do número de documentos criados). O procedimento correto é