Learning Bilingual Lexicons from Monolingual Corpora
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein
Computer Science DivisionUniversity of California, Berkeley
Standard MT Approach
SourceText
TargetText
Need (lots of) parallel sentences May not always be available
Need (lots of) sentences
MT from Monotext
SourceText
TargetText
This talk: translation w/o parallel text? Koehn and Knight (2002) & Fung (1995)
Need (lots of) sentences
Task: Lexicon Induction
SourceText
TargetText
Matchingmstate
world
name
Source Words s
nation
estado
política
Target Words t
mundo
nombre
Data Representation
state
SourceText
Orthographic Features
1.01.0
1.0
#sttatte#
Context Features
20.05.0
10.0
worldpoliticssociety
Data Representation
state
Orthographic Features1.0
1.0
1.0
#sttatte#
5.0
20.0
10.0
Context Features
worldpoliticssociety
SourceText
estado
Orthographic Features1.0
1.0
1.0
#esstado#
10.0
17.0
6.0
Context Features
mundopolitica
sociedadTargetText
Canonical Correlation Analysis
Source Space Target Space
PCA
PCA
3
1
2
Canonical Correlation Analysis
PCA
Source Space
12 3 23 1
Target Space
2
3 1
PCA
Canonical Correlation Analysis
1
Source Space Target Space
23
2
3 1
CCA
CCA
21 3 21 3
Canonical Correlation Analysis
21 3
Canonical Space
1
23
2
3 1
Source Space Target Space
Canonical Correlation Analysis
2
Canonical Space
2
2
2
Source Space Target Space
Generative ModelSource Words
sTarget Words
tMatchingm
Generative Model
estadostateSource Space Target Space
PAria
Canonical Space
Generative ModelSource Words
sTarget Words
tMatchingmstate
world
name
nation
estado
nombre
politica
mundo
E-Step: Obtain posterior over matching
M-Step: Maximize CCA Parameters
Learning: EM?
Learning: EM?
0.2
0.15
0.30
0.10
0.30....
Getting expectations over matchings is #P-hard! See John DeNero’s paper
“The Complexity of Phrase Alignment Problems”
Hard E-Step: Find bipartite matching
M-Step: Solve CCA
Inference: Hard EM
Experimental Setup
Nouns only (for now)
Seed lexicon – 100 translation pairs
Induce lexicon between top 2k source and target word-types
Evaluation: Precision and Recall against lexicon obtained from Wiktionary Report p0.33, precision at recall 0.33
Feature Experiments
Series10
25
50
75
100
61.1
Edit Dist
Prec
isio
n Baseline: Edit Distance
4k EN-ES Wikipedia Articles
Feature Experiments
Series10
25
50
75
100
61.1
Series1; Ortho; 80.1
Edit Dist Ortho
Prec
isio
n MCCA: Only orthographic features
4k EN-ES Wikipedia Articles
Feature Experiments
Series10
25
50
75
100
Series1; Edit Dist;
61.1
Series1; Ortho; 80.1
Series1; Context;
80.2
Edit Dist Ortho Context
Prec
isio
n MCCA: Only Context features
4k EN-ES Wikipedia Articles
Feature Experiments
Series10
25
50
75
100
Series1; Edit Dist;
61.1
Series1; Ortho; 80.1
Series1; Context;
80.289.0
Edit Dist Ortho Context MCCA
Prec
isio
n MCCA: Orthographic and context features
4k EN-ES Wikipedia Articles
Feature ExperimentsPr
ecis
ion
Recall
Feature ExperimentsPr
ecis
ion
Recall
Corpus Variation
93.8
100k EN-ES Europarl Sentences
Identical Corpora
Series10
25
50
75
100
93.8
Identical
Prec
isio
n
Corpus Variation
Comparable Corpora
4k EN-ES Wikipedia Articles
¼
Series10
25
50
75
100
93.8 89.0
Identical Wiki
Prec
isio
n
Corpus Variation
Unrelated Corpora
92 8968
100k English and Spanish Gigaword
?
Series10
25
50
75
100
93.8 89.0Series1;
Unre-lated; 68.3
Identical Wiki Unrelated
Prec
isio
n
Seed Lexicon Source
Automatic Seed Use edit distance to induce seed lexicon as inKoehn & Knight (2002)
92
4k EN-ES Wikipedia Articles
Series10
25
50
75
100
91.8 93.8
Auto Seed Gold Seed
Prec
isio
n
Analysis
Analysis
Top Non-Cognates
Analysis
Interesting Mistakes
Language Variation
Language Variation
AnalysisOrthography Features
Context Features
Summary
Learned bilingual lexicon from monotext Matching + CCA model Possible even from unaligned corpora Possible for non-related languages High-precision, but much left to do!
Thank you!
http://nlp.cs.berkeley.edu
Error Analysis
Top 100 errors 21 correct translations not in gold 30 were semantically related 15 were orthographically related (coast,costas) 30 were seemingly random
Bleu Experiment
On English-French only 1k parallel sentences Without lexicon BLEU: 13.61 With lexicon BLEU: 15.22
More Numbers
Conclusion
Three cases of unsupervised learning in NLP
Unsupervised systems can be competitive with supervised systems
Future problems Document summarization Building MindNet-like resources Discourse Analysis
Generative Model
estadostateSource Space Target Space
Latent Space
Orthographic Features1.0
1.0
1.0
#sttat
te#
5.0
20.0
10.0
Context Featuresworldpolitics
society
Generate Matched Words
Generative Model
estadostate
Source Space Target Space
Latent Space
Orthographic Features1.0
1.0
1.0
#sttat
te#
5.0
20.0
10.0
Context Featuresworldpolitics
society
Generate Matched Words
state
Translation Lexicon Induction
SourceText
TargetText
state
world
name
Source Words s
estado
nombre
mundo
Target Words tMatching
m
Generative Model
For each matched word pair:
For each unmatched source word:
For each unmatched target word:
Results: Accuracy
Corpus Variation
Disjoint Sentences
[email protected] [email protected] [email protected]
75
100
ParallelWikiDisjoint
Corpus Variation
Unrelated
[email protected] [email protected] [email protected]
75
100
ParallelWikiUnrelated
?
Machine Translation
SourceText
TargetText
Machine Translation
SourceText
TargetText
Machine Translation
Source Word Target Word P(T | S)state estado 0.98world mundo 0.97name nombre 0.99
SourceText
TargetText
What are we generating?
Canonical Correlation Analysis
Source Space Target Space
PCAPCA
CCACCA
Canonical Space
1
23
2
3 1
1 2 3
E-Step: Compute matching posteriors
M-Step: Estimate
Inference: EM?
P (mjs;t)
Data Representation
state
Orthographic Features1.0
1.0
1.0
#sttatte#
5.0
20.0
10.0
Context Features
worldpoliticssociety
SourceText
estado
Orthographic Features1.0
1.0
1.0
#esstado#
10.0
17.0
6.0
Context Features
mundopolitica
sociedadTargetText
What are we generating?
Language Variation
Generative Model
estadostateSource Space Target Space
Latent Space
PAria
Generate matched word vectors
Generative Model
Matchingmstate
world
name
Source Words s
nation
estado
nombre
política
Target Words t
mundo
Generate unmatched word vectors
Results: Example Matches
Results: Examples
Top Non-Cognates Interesting Mistakes
PCAPCA
Canonical Correlation Analysis
Source Space Target Space
PCAPCA
CCACCA
Canonical Space
1
23
2
3 1
1 2 3
Generative Model
Matchingmstate
world
name
Source Words s
nation
estado
nombre
política
Target Words t
mundo
Corpus Variation
Identical Corpora
p0.33 89.0
Recall
Prec
ision
100k EN-ES Europarl Sentences