Lecture 11: Statistical/Probabilistic Models for CLIR & Word Alignment Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering,

Lecture 11: Statistical/Probabilistic Models

for CLIR & Word Alignment

Wen-Hsiang Lu (盧文祥 )

Department of Computer Science and Information Engineering,

National Cheng Kung University

2011/05/30

Cross-Language Information Retrieval

Query TranslationQuery Translation Information RetrievalInformation RetrievalSourceQuery

TargetTranslation

Target Documents

Target Documents

• Query in source language and retrieve relevant documents in target languages

海珊 / 侯賽因 / 哈珊 / 胡笙 (TC)侯赛因 / 海珊 / 哈珊 (SC)

Hussein

References

• The Web as a Parallel Corpus– Philip Resnik and Noah A. Smith,

Computational Linguistics, Special Issue on the Web as Corpus, 2003• Automatic Construction of English/Chinese Parallel Corpora

– Christopher C. Yang & Kar Wing Li,Journal of the American Society for Information Science and Technology, 2003

• Statistical Cross-Language Information Retrieval using N-Best Query Translations (SIGIR2002)

– Marcello Federico & Nicola Bertoldi, ITC-irst Centro per la Ricerca Scientifica e Techologica

• Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval– Wessel Kraaij, Jian-Yun Nie and Michel Simard,

Computational Linguistics, Special Issue on the Web as Corpus, 2003

• A Probability Model to Improve Word Alignment (ACL2003)– Colin Cherry & Dekang Lin, University of Alberta

The Web as Corpus

Outline

• The Web as a Parallel CorpusPhilip Resnik and Noah A. SmithComputational Linguistics, Special Issue on the Web as Corpus, 2003

• Automatic Construction of English/Chinese Parallel CorporaChristopher C. Yang and Kar Wing LiJournal of the American Society for Information Science and Technology, 2003

• Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval Wessel Kraaij, Jian-Yun Nie and Michel SimardComputational Linguistics, Special Issue on the Web as Corpus, 2003

The Web as a Parallel Corpus

Philip Resnik and Noah A. SmithComputational Linguistics, Special Issue on

the Web as Corpus, 2003

Parallel Corpora

• Bitexts, bodies of text in parallel translation, plays an important role in machine translation and multilingual natural language processing.

• Not readily available in necessary quantities– Canadian parliamentary proceedings (Hansards) in

English/French– United Nations proceedings (Linguistic Data

Consortium, http://www.ldc.upenn.edu/)– Religious texts (Resnik, Olsen, and Diab)– Localized versions of software manuals (Resnik and

Melamed 1997; Menezes and Richardson)

STRAND

• An architecture for structural translation recognition, acquiring natural data (Resnik 1998, 1999)

• Identify pairs of Web Pages that are mutual translations.

• Web page authors disseminate information in multiple languages– When presenting the same content in two different

languages, authors exhibit a very strong tendency to use the same document structure

Finding Parallel Web Pages• Finding parallel text on the Web consists of three

main steps:– Location of pages that might have parallel translation

– Generation of candidate pairs that might be translations

– Structural filtering out of nontranslation candidate pairs

• Locating pages– Two types: parents and siblings

– Ask AltaVista: (anchor: “english” OR anchor: ”anglais”) AND (anchor: “french” OR anchor: “francais”)

Two types of Website Structure

STRAND

• Generating Candidate Pairs:– Automatic language identification (Dunning 1994)– URL-matching: manually creating a list of

substitution rules• E.g., http://mysite.com/english/home_en.html =>

http://mysite.com/big5/home_ch.html

– Document length: length(E) C.length(F)

• Structural filtering– The heart of STRAND– Markup analyzer: determine a set of pair-specific

structural values for translation pairs

Automatic Construction of English/Chinese Parallel

Corpora

Christopher C. Yang and Kar Wing LiJournal of the American Society for

Information Science and Technology, 2003

Web Parallel Corpora

• Some web sites with bilingual text contain a completely separate monolingual sub-tree for each language.

• Title alignment and dynamic programming matching

References

Statistical Cross-Language Information Retrieval using N-Best Query Translations

Marcello Federico & Nicola Bertoldi,ITC-irst Centro per la Ricerca Scientifica e Techologica

Outline

• Statistical CLIR Approach• Query Document Model• Query Translation Model

Statistical CLIR Approach

• CLIR problem− Given a query i in the source language (Italian),

one would like to find relevant documents d in the target language (English), within a collection D.

P(d | i) P(i, d)

− To fill the language difference between query and documents, the hidden variable e is introduced, which represents an English translation of i.


– P(e,d) is computed by the query-document model– P(i,e) is computed by the query-translation model

e

d'

e

e

)'e,(

)P(e,e),i(

)e |e)P(,i(

),e,i(),i(

dP

dP

dP

dPdP


Query-Document Model

• Statistical LM & Smoothing– Term frequencies of a document are

smoothed linearly and the amount of probability assigned to never observed terms is proportionally to the size of the document vocabulary

n

kkn dqPdqqP

dPdPdP

11 )|()|...q(

)()|q(),q(

||

1

||

||

||

)()(

)(|)(|)(

|)(|

|)(|)(

),()|(

VVN

V

VN

qNqP

qPdVdN

dV

dVdN

qdPdqP

local

global

Query-Translation Model

• According to the HMM

• Determine N-best translations– The most probable translation

e* can be computed throughthe Viterbi search algorithm.

– Intermediate results of the Viterbi algorithm can be used by the A* search algorithm to efficiently compute the N most possible translations of i.

n

kkkkknn eiPeePeiPePeeiiP

2111111 )|()|()|()()...e,...i(

Query-Translation Model

• P(i | e) are estimated from a bilingual dictionary

• P(e | e’) are estimated on the target document collection (order-free bigram LM)

• Smoothing

otherwise 0

pairon translatia is )( if 1 ),( ,

),'(

),()|(

'

i,eei

ei

eieiP

i

)',''(

)',()'|(

''

e

eeP

eePeeP

corpus. in the times occurring pairs termofnumber therepresent

and ),( above described as estimated is )(

corpus, in the appearing soccurrence-co ofnumber theis )( where

2 ),'()( 0 ,

)',(max)',(

21

1

kn

qPeP

e,e'C

nn

nePeβP

N

eeCeeP

k

CLIR Algorithm

• Use two approximations to limit the set of possible translations and documents.

otherwise 0

)i( e if (i)

)e,i()e,i(' :1 Appr. 1

NΤK

PP

otherwise 0

)e( e if (e)

)e,()e,(' :2 Appr. 2K

dPdP

e

d'

)'e,(

)P(e,e),i(

),i(

dP

dP

dP

Complexity of CLIR Algorithm

index file inverted theofentry each by spanned documents ofnumber average :

terma of ons translatiofnumber average :

ons translatigenerated ofnumber

lengthquery :

Ι

N:

n

Text Preprocessing

Blind Relevance Feedback

• The R most relevant terms are selected from the top B ranked documents according to:

documents top theamong termcontaining documents ofnumber the:

)5.0)(5.0(

)5.0)(5.0(

Bwr

rBrN

rBNNrr

w

www

wwww

Comparison with other CLIR Models

• Hiemstra (1999)

• Xu (2001)

n

k ekkk

n

k ekk

n

kk

k

k

dePeiP

deiP

diPdiP

1

1

1

)|()|(

)|,(

)|()|(

])|()|()1()([)|(1

ke

kkk

n

kk dePeiPiPdiP

Term Translation Model using Search Result Pages

• Apply page authority to search-result-based translation extraction

links total#

oflink #)( where

)()]|()|([)(

1

)(

)()|()|(

)|()|(

)|()|(

1

drdrP

drPdrqPdrtPqP

qP

drPdrqPdrtP

qdrPdrtP

RtPqtP

dr

k

ii

dr

dr

q

Embedding Web-Based Statistical Translation Models

in Cross-Language Information Retrieval

Wessel Kraaij, Jian-Yun Nie and Michel Simard

Computational Linguistics, Special Issue on the Web as Corpus, 2003

Web Mining for CLIR

• The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically.

• The resulting translation models can be embedded in several ways in a retrieval model.

• Conventional approach: IR + MT (machine translation)

Problems in Query Translation

• Finding translations– Lexical coverage: Proper names and abbreviations.– Transliteration: Phonemic representation of a

named entity.• Jeltsin, Eltsine, Yeltsin, and Jelzin (in Latin script)

• Pruning translation alternatives• Weighting translation alternatives

Exploitation of Parallel Texts

• Using a pseudofeedback approach (Yang et al. 1998)

• Capturing global cross-language term associations (Yang et al. 1998; Lavrenko 2002)

• Transposing to a language-independent semantic space (Dumais et al. 1997; Yang et al. 1998)

• Training a statistical translation model (Nie et al. 1999; Franz et al. 2001; Hiemstra 2001; Xu et al. 2001)

Mining Process in PTMiner

Embedding Translation into IR Model

• Basic language model

• Normalized log-likelihood ratio (NLLR)

Embedding Translation into IR Model

* Query Model:

* Document Model:

* Basic Language Model:

(log likelihood ratio)

(normalizedlog likelihood ratio)

A Probability Model to Improve Word Alignment

Colin Cherry & Dekang Lin,University of Alberta

Outline

• Introduction• Probabilistic Word-Alignment Model• Word-Alignment Algorithm

– Constraints– Features

Introduction

• Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation.– E.g., translation lexicons, transfer rules

• Word-alignment problem– Conventional approaches usually used co-occurrence models

• E.g., Ø2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993)– Indirect association problem: Melamed (2000) proposed

competitive linking along with an explicit noise model to solve

• To propose a probabilistic word-alignment model which allows easy integration of context-specific features.

)),,(|),((

)),,(|),((log),(

vucoocvulinksB

vucoocvulinksBvuscoreB CISCO System Inc.

思科系統

CISCO System Inc.

思科系統

noise

Probabilistic Word-Alignment Model

• Given E = e1, e2, …, em , F = f1, f2, …, fn

– If ei and fj are translation pair, then link l(ei, fj) exists

– If ei has no corresponding translation, then null link l(ei, f0) exists

– If fj has no corresponding translation, then null link l(e0, fj) exists

– An alignment A is a set of links such that every word in E and F participates in at least one link

• Alignment problem is to find alignment A to maximize P(A|E, F)

• IBM’s translation model: maximize P(A, F|E)

Probabilistic Word-Alignment Model (Cont.)

• Given A = {l1, l2, …, lt}, where lk = l(eik, fjk

), then

consecutive subsets of A, lij = {li, li+1, …, lj}

• Let Ck= {E, F, l1k-1} represent the context of lk

t

k

kk

t lFElPFElPFEAP1

111 ),,|(),|(),|(

),|(

)|(),|(

),(

),,(

),|(

)|(

),,(

)()|(

)(

),()|(

jkikk

kkjkikk

jkik

jkikk

jkikk

kk

jkikk

kkk

k

kkkk

feCP

lCPfelP

feP

felP

feCP

lCP

feCP

lPlCP

CP

ClPClP

1)|,(

)|,()(),,(

kjkik

kjkikkjkikk

CfeP

CfePCPfeCP

1)|,(

)|,()(),,(

kjkik

kjkikkjkikk

lfeP

lfePlPfelP

Probabilistic Word-Alignment Model (Cont.)

• Ck = {E, F, l1k-1} is too complex to estimate

• FTk is a set of context-related features such that P(lk|Ck) can be approximated by P(lk|eik

, fjk, FTk)

• Let Ck’ = {eik

, fjk} ∪ FTk

),|(

)|(),|(

),|(

)|(),|()|(

'

''

jkikk

kkjkikk

jkikk

kkjkikkkk

feFTP

lFTPfelP

feCP

lCPfelPClP

t

k FTft jkik

kjkikk

kfeftP

lftPfelPFEAP

1 ),|(

)|(),|(),|(

An Illustrative Example

Word-Alignment Algorithm

• Input: E, F, TE

– TE is E’s dependency tree which enable us to make use of features and constraints based on linguistic intuitions

• Constraints– One-to-one constraint: every

word participates in exactly one link

– Cohesion constraint: use TE to induce TF with no crossing dependencies

Word-Alignment Algorithm (Cont.)

• Features– Adjacency features fta: for

any word pair (ei, fj), if a link l(ei’, fj’) exists where -2 i’-i 2 and -2 j’-j 2, then fta(i-i’, j-j’, ei’) is active for this context.

– Dependency features ftd: for any word pair (ei, fj), let ei’ be the governor of ei ,and let rel be the grammatical relationship between them. If a link l(ei’, fj’) exists, then ftd(j-j’, rel) is active for this context.

) ,3(),(

) ,1(

),1,1( )',(

1

1

detftlesthepair

detft

hostftlthepair

d

d

a

Experimental Results

• Test bed: Hansard corpus– Training: 50K aligned pairs of sentences (Och & Ney 2000)– Testing: 500 pairs

Future Work

• The alignment algorithm presented here is incapable of creating alignments that are notone-to-one, many-to-one alignment will be pursued.

• The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “many” side.