Cross Lingual Information Retrieval (CLIR) Rong Jin

Cross Lingual Information Retrieval (CLIR)

Rong Jin

The Problem Increasing pressure for accessing information

in foreign language: find information written in foreign languages read and interpret that information merge it with information in other languages

Need for multilingual information access

Why Cross Lingual IR is Important? Internet is no longer monolingual and non-English

content is growing rapidly

Non-English speakers represent the fastest growing group of new internet users

In 1997, 8.1 million Spanish speaking users In 2000, 37 million ……..

Confidential, unpublished information©Manning & Napier Information Services 2000

English English

2000 2005

2. Multilingual Text Processing

Character encoding Language recognition Tokenization Stop word removal Feature normalization (stemming) Part-of-speech tagging Phrase identification

Character Encoding Language (alphabet) specific native encoding:

Chinese GB, Big5, Western European ISO-8859-1 (Latin1) Russian KOI-8, ISO-8859-5, CP-1251

UNICODE (ISO/IEC 10646) UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte

Tokenization

Punctuation separated from words – incl. word separation characters.

“The train stopped.” “The”, “train”, “stopped”, “.”

String split into lexical units - incl. Segmentation (Chinese) and compound-splitting (German)

Chinese Segmentation

Chinese Segmentation Frank Petroleum Detection

German Segmentation Unrestricted compounding in German

Abendnachrichtensendungsblock

Use compound analysis together with CELEX German dictionary (360,000 words) Treuhandanstalt { treuhand, anstalt }

Use n-gram representation Treuhandanstalt {Treuha, reuhan, treuhand, euhand, … }

CLIR - Approaches

Machine Translation Bilingual Dictionaries Parallel/Comparable Corpora

UserQuery

Document

LanguageBarrier

QueryRepresentation

DocumentRepresentation

誰在 1998 年贏得環法自行車大賽

Marco Pantani of Italy became the first Italian to win the Tour de France of 1998 …

Machine Translation Translate all documents into the query language

Chinese Queries English

DocumentsChinese

Documents

Machine Translation

Lucene

Machine Translation (MT) Translate all documents into the query language

Not viable on large collections (MT is computationally expensive)

Not viable if there are many possible query languages


DocumentsChinese

Documents

Machine Translation

Lucene

Machine Translation

Translate the query into languages of the content being searched


Documents

Machine Translation Lucene

English Queries

Machine Translation

Translate the query into languages of the content being searched

Query translation is inadequate for CLIR no context for accurate translation system selects preferred target term


Documents

Machine Translation Lucene

English Queries

Example of Translating Queries

Who won the Tour de France in 1998?

Using Dictionaries

Bilingual machine-readable dictionaries (in-house or commercial)

Look-up query terms in dictionary and replace with translations in document languages


DocumentsLuceneEnglish

QueriesBilingual

Dictionary

Using Dictionaries

Problems ambiguity many terms are out-of-vocabulary lack of multiword terms phrase identification bilingual dictionary needed for every

query-document language pair of interest

Word Sense Disambiguation

Word Sense Disambiguation

The sign for independent press to disappear

Using Corpora Parallel Corpora

translation equivalent e.g. UN corpus in French, Spanish & English

Comparable Corpora Similar for topic, style, time etc. Hong Kong TV broadcast news in both Chinese

and English

Using CorporaParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

A E

d1

a a c e

d2

b c d a

d3

e d a

How to bridge the language barrier using the parallel corpora ?

Translate Query using Parallel Corpus (I)

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

A E

d1

a a c e

d2

b c d a

d3

e d a


Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

A E

d2

b c d a

d3

e d a

Query:

ce

d1

a a c e


Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

Query:

A E

d1

a a c e

d2

b c d a

d3

e d a

Query:

ce

Translate Query using Parallel Corpus (II) Learn word-to-word translation probabilities

from parallel corpa Compute the relevance of a document d to a

given query q by estimating the probability of translating document d into query q

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Translate Query using Parallel Corpus (II)

Word-to-Word Translation Probabilities

Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1



Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1



Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

d1

a a c e

d2

b c d a

d3

e d a

A, E 0.58


Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

d1

a a c e

d2

b c d a

d3

e d a

A, E 0.58 0.36 0.48


Q = (A E), d1 = (a a c e)

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

d1

a a c e

d2

b c d a

d3

e d a

A, E 0.58 0.36 0.48


How to obtain the translation probabilities ?

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

Approach I: Co-occurrence Counting

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3


Co-occurrence based translation model

e.g. p(A|a) = co(a, A) / occur(a) = 4/4 = 1

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3


a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

P(B|c) = co(B, c)/occ(c) = 2/4 = 0.5


a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Any problem ?


P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Many large translation probabilities Usually one word of one language corresponds motly to a

single word in another language


P(A|a) a b c d e

A 1 0.67 1 0.5 0.67

B 0.5 1 0.5 1 0.67

C 0.75 0.33 0.75 0.5 0.33

D 0.25 0.33 0 0.5 0.33

E 0.5 0.33 0.5 0.5 1

Many large translation probabilities Usually one word of one language corresponds motly to a

single word in another language We may over-count the co-occurrence statistics

Approach I: Overcounting

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

co(A, a) = 4 implies that all occurrence of ‘A’ is due to the occurrence of ‘a’


a b c d e total

A 4 3 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e


a b c d e total

A 4 3 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

If we believe that the first two occurrences of ‘A’ is due to ‘a’, then, co(A, b) = 1, not 3

But, we have no idea if the first two occurrences of ‘A’ is due to ’a’

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

X

x

How to Compute Co-occurrence ? IBM statistical translation model

There are translation models published by IBM research We will only discuss IBM Translation Model I

It uses an iterative procedure to eliminate the over counting problem

Step 1: Compute co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Step 1: Compute co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a b c d e total

A 4 2 4 1 2 4

B 2 3 2 2 2 3

C 3 1 3 1 1 3

D 1 1 0 1 1 1

E 2 1 2 1 3 3

total 4 3 4 2 3

Assume that translation probabilities are proportional to co-occurrence

( , ) 4( | ) 0.33

( , ) ( , ) ( , ) ( , ) 12

co A ap A a

co A a co B a co C a co D a

Step 2: Compute Conditional Prob.

Assume that translation probabilities are proportional to co-occurrence

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

( , ) 4( | ) 0.33

( , ) ( , ) ( , ) ( , ) 12

co A ap A a


Step 3: Re-estimate co-occurrenceParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

A B C

b c a d

‘A’ can be caused by one of the words ‘b’, ‘c’, ‘a’, ‘d’

co(A,a) for sentence 1 should be computed by taking account of the competition

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

Step 3: Re-estimate co-occurrenceParallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

A B C

b c a d

a b c d e

A 0.33 0.25 0.36 0.17 0.22

B 0.17 … … … …

C 0.25 … … … …

D 0.08 … … … …

E 0.17 … … … …

( | )( , ;1)

( | ) ( | ) ( | ) ( | )

0.330.41

0.33 0.25 0.36 0.17

p A aoc A a

p A a p A c p A b p A d

Step 3: Re-estimate co-occurrence

Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

co(A,a) for each sentence

0.41

0.37

0.48

0

0.36

a b c d e

A 0.33 0.25 0.36 0.17 0.22


Parallel Corpus

A B C b c a d

B D E A b d e a

C A c a

A B E c b e

A C E a c e

a

A 1.62

B

C

D

E

co(A,a) for each sentence

0.41

0.37

0.48

0

0.36

a b c d e

A 0.33 0.25 0.36 0.17 0.22

co(A,a) =

0.41 + 0.37 + 0.48 + 0 + 0.36

= 1.62


Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 1.62 … … … …

B 0.75 … … … …

C 1.25 … … … …

D 0.31 … … … …

E 0.56 … … … …

Step 4: Re-compute Conditional Prob.

Parallel Corpus

A B C b c a d

B D E b d e a

C A c a

A B E c b e

A C E a c e

a b c d e

A 0.46 … … … …

B 0.15 … … … …

C 0.22 … … … …

D 0.06 … … … …

E 0.11 … … … …

( , )( | ) 0.36

( , ) ( , ) ( , ) ( , )

co A ap A a


IBM Statistical Translation Model Apply the steps of counting and estimation

iteratively The convergence can be proved This is related so-called Expectation Maximization

Algorithm E step: counting M step: estimate the translation probabilities

It has the best performance in the past TREC evaluations for CLIR