Upload
lee-moore
View
239
Download
0
Tags:
Embed Size (px)
Citation preview
Cross Lingual Information Retrieval (CLIR)
Rong Jin
The Problem Increasing pressure for accessing information
in foreign language: find information written in foreign languages read and interpret that information merge it with information in other languages
Need for multilingual information access
Why Cross Lingual IR is Important? Internet is no longer monolingual and non-English
content is growing rapidly
Non-English speakers represent the fastest growing group of new internet users
In 1997, 8.1 million Spanish speaking users In 2000, 37 million ……..
Confidential, unpublished information©Manning & Napier Information Services 2000
English English
2000 2005
2. Multilingual Text Processing
Character encoding Language recognition Tokenization Stop word removal Feature normalization (stemming) Part-of-speech tagging Phrase identification
Character Encoding Language (alphabet) specific native encoding:
Chinese GB, Big5, Western European ISO-8859-1 (Latin1) Russian KOI-8, ISO-8859-5, CP-1251
UNICODE (ISO/IEC 10646) UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte
Tokenization
Punctuation separated from words – incl. word separation characters.
“The train stopped.” “The”, “train”, “stopped”, “.”
String split into lexical units - incl. Segmentation (Chinese) and compound-splitting (German)
Chinese Segmentation
Chinese Segmentation Frank Petroleum Detection
German Segmentation Unrestricted compounding in German
Abendnachrichtensendungsblock
Use compound analysis together with CELEX German dictionary (360,000 words) Treuhandanstalt { treuhand, anstalt }
Use n-gram representation Treuhandanstalt {Treuha, reuhan, treuhand, euhand, … }
CLIR - Approaches
Machine Translation Bilingual Dictionaries Parallel/Comparable Corpora
UserQuery
Document
LanguageBarrier
QueryRepresentation
DocumentRepresentation
誰在 1998 年贏得環法自行車大賽
Marco Pantani of Italy became the first Italian to win the Tour de France of 1998 …
Machine Translation Translate all documents into the query language
Chinese Queries English
DocumentsChinese
Documents
Machine Translation
Lucene
Machine Translation (MT) Translate all documents into the query language
Not viable on large collections (MT is computationally expensive)
Not viable if there are many possible query languages
Chinese Queries English
DocumentsChinese
Documents
Machine Translation
Lucene
Machine Translation
Translate the query into languages of the content being searched
Chinese Queries English
Documents
Machine Translation Lucene
English Queries
Machine Translation
Translate the query into languages of the content being searched
Query translation is inadequate for CLIR no context for accurate translation system selects preferred target term
Chinese Queries English
Documents
Machine Translation Lucene
English Queries
Example of Translating Queries
Who won the Tour de France in 1998?
Using Dictionaries
Bilingual machine-readable dictionaries (in-house or commercial)
Look-up query terms in dictionary and replace with translations in document languages
Chinese Queries English
DocumentsLuceneEnglish
QueriesBilingual
Dictionary
Using Dictionaries
Problems ambiguity many terms are out-of-vocabulary lack of multiword terms phrase identification bilingual dictionary needed for every
query-document language pair of interest
Word Sense Disambiguation
Word Sense Disambiguation
The sign for independent press to disappear
Using Corpora Parallel Corpora
translation equivalent e.g. UN corpus in French, Spanish & English
Comparable Corpora Similar for topic, style, time etc. Hong Kong TV broadcast news in both Chinese
and English
Using CorporaParallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
Query:
A E
d1
a a c e
d2
b c d a
d3
e d a
How to bridge the language barrier using the parallel corpora ?
Translate Query using Parallel Corpus (I)
Parallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
Query:
A E
d1
a a c e
d2
b c d a
d3
e d a
Translate Query using Parallel Corpus (I)
Parallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
Query:
A E
d2
b c d a
d3
e d a
Query:
ce
d1
a a c e
Translate Query using Parallel Corpus (I)
Parallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
Query:
A E
d1
a a c e
d2
b c d a
d3
e d a
Query:
ce
Translate Query using Parallel Corpus (II) Learn word-to-word translation probabilities
from parallel corpa Compute the relevance of a document d to a
given query q by estimating the probability of translating document d into query q
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
Translate Query using Parallel Corpus (II)
Word-to-Word Translation Probabilities
Q = (A E), d1 = (a a c e)
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
Translate Query using Parallel Corpus (II)
Word-to-Word Translation Probabilities
Q = (A E), d1 = (a a c e)
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
Translate Query using Parallel Corpus (II)
Word-to-Word Translation Probabilities
Q = (A E), d1 = (a a c e)
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
d1
a a c e
d2
b c d a
d3
e d a
A, E 0.58
Translate Query using Parallel Corpus (II)
Q = (A E), d1 = (a a c e)
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
d1
a a c e
d2
b c d a
d3
e d a
A, E 0.58 0.36 0.48
Translate Query using Parallel Corpus (II)
Q = (A E), d1 = (a a c e)
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
d1
a a c e
d2
b c d a
d3
e d a
A, E 0.58 0.36 0.48
Translate Query using Parallel Corpus (II)
How to obtain the translation probabilities ?
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
Approach I: Co-occurrence Counting
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
a b c d e total
A 4 2 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
Approach I: Co-occurrence Counting
Co-occurrence based translation model
e.g. p(A|a) = co(a, A) / occur(a) = 4/4 = 1
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
a b c d e total
A 4 2 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
Approach I: Co-occurrence Counting
a b c d e total
A 4 2 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
P(B|c) = co(B, c)/occ(c) = 2/4 = 0.5
Approach I: Co-occurrence Counting
a b c d e total
A 4 2 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
Any problem ?
Approach I: Co-occurrence Counting
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
Many large translation probabilities Usually one word of one language corresponds motly to a
single word in another language
Approach I: Co-occurrence Counting
P(A|a) a b c d e
A 1 0.67 1 0.5 0.67
B 0.5 1 0.5 1 0.67
C 0.75 0.33 0.75 0.5 0.33
D 0.25 0.33 0 0.5 0.33
E 0.5 0.33 0.5 0.5 1
Many large translation probabilities Usually one word of one language corresponds motly to a
single word in another language We may over-count the co-occurrence statistics
Approach I: Overcounting
a b c d e total
A 4 2 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
co(A, a) = 4 implies that all occurrence of ‘A’ is due to the occurrence of ‘a’
Approach I: Overcounting
a b c d e total
A 4 3 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
Approach I: Overcounting
a b c d e total
A 4 3 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
If we believe that the first two occurrences of ‘A’ is due to ‘a’, then, co(A, b) = 1, not 3
But, we have no idea if the first two occurrences of ‘A’ is due to ’a’
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
X
x
How to Compute Co-occurrence ? IBM statistical translation model
There are translation models published by IBM research We will only discuss IBM Translation Model I
It uses an iterative procedure to eliminate the over counting problem
Step 1: Compute co-occurrence
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
a b c d e total
A 4 2 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
Step 1: Compute co-occurrence
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
a b c d e total
A 4 2 4 1 2 4
B 2 3 2 2 2 3
C 3 1 3 1 1 3
D 1 1 0 1 1 1
E 2 1 2 1 3 3
total 4 3 4 2 3
Assume that translation probabilities are proportional to co-occurrence
( , ) 4( | ) 0.33
( , ) ( , ) ( , ) ( , ) 12
co A ap A a
co A a co B a co C a co D a
Step 2: Compute Conditional Prob.
Assume that translation probabilities are proportional to co-occurrence
Parallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
a b c d e
A 0.33 0.25 0.36 0.17 0.22
B 0.17 … … … …
C 0.25 … … … …
D 0.08 … … … …
E 0.17 … … … …
( , ) 4( | ) 0.33
( , ) ( , ) ( , ) ( , ) 12
co A ap A a
co A a co B a co C a co D a
Step 3: Re-estimate co-occurrenceParallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
A B C
b c a d
‘A’ can be caused by one of the words ‘b’, ‘c’, ‘a’, ‘d’
co(A,a) for sentence 1 should be computed by taking account of the competition
a b c d e
A 0.33 0.25 0.36 0.17 0.22
B 0.17 … … … …
C 0.25 … … … …
D 0.08 … … … …
E 0.17 … … … …
Step 3: Re-estimate co-occurrenceParallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
A B C
b c a d
a b c d e
A 0.33 0.25 0.36 0.17 0.22
B 0.17 … … … …
C 0.25 … … … …
D 0.08 … … … …
E 0.17 … … … …
( | )( , ;1)
( | ) ( | ) ( | ) ( | )
0.330.41
0.33 0.25 0.36 0.17
p A aoc A a
p A a p A c p A b p A d
Step 3: Re-estimate co-occurrence
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
co(A,a) for each sentence
0.41
0.37
0.48
0
0.36
a b c d e
A 0.33 0.25 0.36 0.17 0.22
Step 3: Re-estimate co-occurrence
Parallel Corpus
A B C b c a d
B D E A b d e a
C A c a
A B E c b e
A C E a c e
a
A 1.62
B
C
D
E
co(A,a) for each sentence
0.41
0.37
0.48
0
0.36
a b c d e
A 0.33 0.25 0.36 0.17 0.22
co(A,a) =
0.41 + 0.37 + 0.48 + 0 + 0.36
= 1.62
Step 3: Re-estimate co-occurrence
Parallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
a b c d e
A 1.62 … … … …
B 0.75 … … … …
C 1.25 … … … …
D 0.31 … … … …
E 0.56 … … … …
Step 4: Re-compute Conditional Prob.
Parallel Corpus
A B C b c a d
B D E b d e a
C A c a
A B E c b e
A C E a c e
a b c d e
A 0.46 … … … …
B 0.15 … … … …
C 0.22 … … … …
D 0.06 … … … …
E 0.11 … … … …
( , )( | ) 0.36
( , ) ( , ) ( , ) ( , )
co A ap A a
co A a co B a co C a co D a
IBM Statistical Translation Model Apply the steps of counting and estimation
iteratively The convergence can be proved This is related so-called Expectation Maximization
Algorithm E step: counting M step: estimate the translation probabilities
It has the best performance in the past TREC evaluations for CLIR