Upload
ariana-baird
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Lernverfahren auf Basis von Parallelkorpora
Learning Techniques based on Parallel Corpora
Jonas KuhnUniversität des Saarlandes, Saarbrücken
Heidelberg, 16. Juni 2005
Talk Outline
Parallel Corpora The PTOLEMAIOS Project: Goals and Motivation
Specification of the intended prototype system Study 1
Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus
Study 2 A chart-based parsing algorithm for bilingual parallel
grammars based on a word alignment Conclusion
The PTOLEMAIOS research agenda
Parallel Corpora
Collections of texts and their translation into different languages
Alignment across languages at various levels Document Section Paragraph Sentence (not necessarily one-to-one) Phrase Word
Varying quality depending on origin of translation Translation is often not literal Parts of a document may occur only in one version
Examples of Parallel Corpora Hansards of the 36th Parliament of Canada
(http://www.isi.edu/natural-language/download/hansard/) 1.3 million sentence pairs (19.8 million word forms English/21.2 million
word forms French)
The Bible
Europarl corpus (European Parliament Proceedings Parallel Corpus 1996-2003) (http://www.isi.edu/~koehn/europarl/) 11 languages: Danish, Dutch, English, Finnish, French, German,
Greek, Italian, Portuguese, Spanish, Swedish Up to 28 million word forms per language
OPUS – an open source parallel corpus (http://logos.uio.no/opus/) Includes Europarl corpus and various manuals for open source
software (parts of which have been translated into more than 20 languages, incl. e.g., Chinese, Hebrew, Japanese, Korean, Russian, and Turkish)
Uses for Parallel Corpora
Building bilingual dictionaries Resource for Machine Translation research
“Classical” MT: data basis for transfer rules/dictionary entries
Statistical MT: training data for statistical alignment models
Paragraph level Word level
Resource for (multilingual) NLP applications “Annotation projection” for training monolingual tools
Query interface
From Opus corpus website http://logos.uio.no/cgi-bin/opus/opuscqp.pl (based on the corpus work bench/CQP from IMS
Stuttgart)
Query interface
Training Statistical Alignments
GIZA++ Tool
http://www.fjoch.com/GIZA++.html Implementation of “IBM models” for Statistical MT Extension of the program GIZA (part of the SMT toolkit
EGYPT) developed at a summer workshop in 1999 at Johns-Hopkins University
Sample word alignment (from Europarl corpus):
im
actually Minutesthe
nachzulesenProtokollwarDies
in toreferredwasThis
“Annotation Projection”
Yarowsky/Ngai/Wicentowski 2001
Parallel corpus and word alignment given Tagger/chunker for English exists Projected annotation is used as training data for a
tagger/chunker in the target language Robust learning techniques based on confidence in training
data
]oil
NN
crude
JJ
[for
IN
]producersignificanta[
NNJJDT
JJNNINJJNNDT
]brutpetrole[de]importantproducteurun[
Quality of projected information
Evaluation results for part-of-speech tagging (from Yarowsky et al. 2001)
Train grammars on parallel corpora?
New weakly supervised learning approach for (probabilistic) syntactic grammars: Training data: Parallel corpora – collections of original texts
and their translations into one or more languages
Preparatory step: Identification of word correspondences with known statistical techniques (word alignment from statistical machine translation)
en de fr
darandersvölligjedochLagediesichstelltHeute
isThe situation now however radically different
Train grammars on parallel corpora?
Beyond lexical information, patterns in the word correspondence relation contain rich implicit information about the grammars of the languages
One should be able to exploit this implicit information about structure and meaning for grammar learning
Little manual annotation effort should be required Combination of insights from linguistics and machine learning
darandersvölligjedochLagediesichstelltHeute
isThe situation now however radically different
The PTOLEMAIOS Project
Rosetta Stone
Parallel Corpus-Based Grammar Induction: PTOLEMAIOS Parallel-Text-based Optimization for
Language Learning – Exploiting Multilingual Alignment for the Induction Of Syntactic Grammars
Funded by DFG (German Research Foundation) as an Emmy Noether research group Universität des Saarlandes
(Saarbrücken), Department of Computational Linguistics
Starting date: 1 April 2005 Expected duration: 4 years (1-year
extension possible)
Project Goals
Development of formalisms and algorithms to support grammar induction for arbitrary languages from parallel corpora
To make goals tangible… Intended prototype:
The PTOLEMAIOS I system for building grammars for new (sub-)languages
The PTOLEMAIOS I system
Resources required: Parallel corpus of language L and one or (ideally) more
other languages No NLP tools for language L required
Preparatory work required: Manual annotation of a set of seed sentence pairs
(e.g., 50-100 pairs) Phrasal correspondence across languages “Lean” bracketing: mark only
full argument/modifier phrases (PPs, NPs) and full clauses
The PTOLEMAIOS I system
Training steps: (Sentence alignment on parallel corpus) Word alignment on parallel corpus
Using standard techniques from Statistical Machine Translation (GIZA++ tool)
Part-of-speech clustering for L Bootstrapping learning of syntactic grammars for L and
the other language(s) Starting from annotated seed data Exploit large amounts of unannotated data, finding
systematic patterns in phrasal correspondences Assuming implicit underlying representation (“pseudo
meaning representation”) Relying on consensus across the grammars
The PTOLEMAIOS I system
Result: Robust probabilistic grammar for L
Representation of predicate-argument and modifier relations
Models predict probabilities for cross-linguistic argument/modifier links
(These will be particularly useful in lexicalized models)
Application: Multilingual Information Extraction, Question
Answering Intermediate step for syntax-based MT
Motivation
Practical Explore alternative to standard treebank training of
grammars For “smaller” languages, it is unrealistic to do the
necessary manual resource annotation Theoretical
Establish parallel corpora as an empirical basis for (crosslinguistic or monolingual) linguistic studies
Frequency-related phenomena (like multi-word expressions/collocations) are otherwise hard to assess empirically at the level of syntax
Learnability properties as a criterion for assessing formal models for natural language
Study 1
Unsupervised learning of a probabilistic context-free grammar (PCFG) exploiting partial information from a parallel corpus [Kuhn 2004 (ACL)]
Underlying consideration: The distribution of word correspondences in the
translation of a string contains partial information about possible phrase boundaries
darandersvölligjedochLagediesichstelltHeute
isThe situation now however radically different
different
L2
Alignment-induced word blocks
andersLagedieHeute
nowL1 The situation is
ist derin Türkei
in TurkeyNULL
(The)={die}(situation)={Lage}(NULL)={der}
Word alignments obtained with the standard model are asymmetrical [Brown et al. 1993] Each word from language L1 is mapped to a set of
words from L2 (For L2-words without an overt correspondent in L1, an
artificial word NULL is assumed in each sentence in L1)
Definition A word alignment mapping induces a Word block w1...wn in
language L1, iff the union over (w1)...(wn) forms a continuous string in L2 [separated only by words from (NULL)]
Let’s call a word block maximal if adding a word to the left or right leads to a non-word block
Maximal word blocks are possible constituents – but not necessary ones
Conservative formulation: Exclusion of impossible constituents (“distituents”) … whenever a word sequence crosses block boundaries … … without fully covering one of the adjacent blocks
Word blocks and constituents
einebald
soonWe decision
brauchen EntscheidungWir
aneed
Unsupervised Learning
Grammar induction of an X-bar grammar Using a variant of standard PCFG induction with the
Inside-Outside algorithm (an Expectation Maximization algorithm)
All word spans are considered as phrase candidates – except the excluded distituents
Automatic generalization based on patterns in the learning data (after part-of-speech tagging)
Exclusion of distituents can reduce the effect of frequent non-phrasal word sequences
Empirical Results
Comparative experiment [Kuhn 2004 (ACL)]
A: Grammar induction based on English corpus data only
B: Induction including partial information from parallel corpus (Europarl corpus) [Koehn 2002]
Statistical word alignment, trained with GIZA++
[Al-Onaizan et al., 1999; Och/Ney, 2003] Exclusion of “distituents” based on alignment-induced word
blocks Evaluation:
Parsing sentences from the Penn Treebank (Wall Street Journal) with the induced grammars
Comparison of the “automatical” analyses with the gold standard treebank analyses created by linguists
Empirical Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Precision Recall F-Score
Strictly left-branchingstructure
Strictly right-branchingstructure
A: Standard PCFGinduction
B: Induction using partialinformation from parallelcorpusUpper bound (Oraclebinary grammar)
# correctly identified phrases
# proposed phrases
# correctly identified phrases
# gold standard phrases
Mean of Precision
und Recall
lookmustwe So agriculturaltheat policy
prüfenAgrarpolitikdiedeshalbmüssenWir
Study 2 Underlying consideration:
It should be possible to learn more complex syntactic generalization from parallel corpora if phrase correspondences can be observed systematically
Cross-linguistic “consensus structure” as a poor man’s meaning representation (pseudo-meaning representation)
Form-meaning relation is important for Optimality Theory-style discriminative learning
XPXPXP
XP
S
XPXPXP
XP
S
Parallel Parsing Prerequisites for structural learning from parallel corpora:
Grammar formalism generating sentence pairs or tuples (bitexts or multitexts)
Generalization of monolingual context-free grammars, generating two or more trees simultaneously (keeping note of phrase correspondences)
Algorithms for parsing and learning (Parallel Parsing/Synchronous Parsing)
Problem: higher complexity than monolingual parsing
Compare theoretical work on machine translation [Wu 1997, Melamed 2004]
Focus for this study: Efficient parallel parsing based on a word-alignment
[Kuhn 2005 (IJCAI)]
Chart Parsing (review) In chart parsing for context-free grammars, partial analyses
covering the same string under the same category are merged
WP covering string position j through m Two possible internal analyses – a single chart entry Internal distinction irrelevant from external point of view
i k n m
XP YPZP XP
WP: j-m
Reading 1XP: j-k YP: k-mWP XP YP WP: j-m
Reading 2ZP: j-n XP: n-mWP ZP XP WP: j-m
WP
prüfenAgrarpolitikdiedeshalbmüssenWir prüfenAgrarpolitikdiedeshalbmüssenWir
lookmustwe So agriculturaltheat policy
Data Structure for Parallel Parsing
So
we
must
look
at
the
agricultural
policy
prüfenAgrarpolitikdiedeshalbmüssenWir
The input is not a one-dimensional word string like in classical parsing, but a two-dimensional word array
Representation of the word alignment
“3-D” Parsing
Wir
deshalbmüssen
die
prüfenAgrarpolitik
mustwe
So
XPXPXP
XP
lookmustwe So agriculturaltheat policy
S
AgrarpolitikdiedeshalbmüssenWir
XPXPXP
S
prüfen
XP
policyagricultural
theat
look
Assumptions for Parallel Parsing
A particular word alignment is given (or the n best alignments according to a statistical model) Note: Both languages may contain words without a
correspondent (“null words”) Complete constituents must be continuous in both
languages There may however be phrase-internal “discontinuities”
soon
decision
EntscheidungeinebaldbrauchenWir
We
need
a
Generalized Earley Parsing
Main idea: One of the languages is the “master language”
Choice of master language is arbitrary in principle For efficiency: pick language with more null words
Primary index for chart entries according to master language: string span (from-to)
Secondary index: Bit vector for word coverage in the secondary language (cp. chart-based generation)
5: soon
4: decision
5:Entscheidung4:eine3:bald2:brauchen1:Wir
1: We
2: need
3: a
Example:
Active chart entry for partial constituent
PI: 1-3, SI: [01001]
Primary/Secondary Index in Parsing
Combination of chart entries
5: soon
4: decision
5:Entscheidung4:eine3:bald2:brauchen1:Wir
1: We
2: need
3: a
PI: 1-3, SI: [01001]
PI: 3-5, SI: [00110]
PI: 1-5, SI: [01111]
Complexity With a given word alignment, the secondary index is fully
determined by the primary index unless there are any null words in the secondary language In the absence of null words the worst case parsing
complexity is essentially the same as in monolingual parsing
The secondary index does not add any free variables to the inference rules for parsing
(Bitvector operations are very cheap – ignored here) In the average case the search space is even reduced over
monolingual parsing5: soon
4: decision
5:Entscheidung4:eine3:bald2:brauchen1:Wir
1: We
2: need
3: a
Example:
(Incorrect) combination of a chart entry with an incomplete constituent is excluded early
PI: 0-1, SI: [10000]
PI: 1-3, SI: [01001]
PI: 0-3, SI: [11001]
Illegal as a passive
chart entry
Complexity and Null Words
policy
agricultural
the
at
look
must
we
So
Wir müssen deshalb die Agrarpolitik prüfen
8: policy
7: agricultural
6: the
5: at
4: look
4: die 5: Agrarpolitik 6: prüfen
1. PI: 3-3, SI: [0,0,0,0,1,0,0,0]PI: 3-4, SI: [0,0,0,0,0,1,0,0]
PI: 3-4, SI: [0,0,0,0,1,1,0,0]
2. PI: 3-3, SI: [0,0,0,0,1,0,0,0]PI: 3-5, SI: [0,0,0,0,0,1,1,1]
PI: 3-5, SI: [0,0,0,0,1,1,1,1]
3. PI: 3-3, SI: [0,0,0,0,1,0,0,0]PI: 3-7, SI: [0,0,0,1,0,1,1,1]
PI: 3-7, SI: [0,0,0,1,1,1,1,1]
Null words in the secondary language L2 create possible variations for the secondary index (with a single primary index)
However the effect is fairly local thanks to the continuity assumption
look agriculturaltheat policy
XP
XP
look agriculturaltheat policy
XP
XP
look agriculturaltheat policy
XP
No passive entry, i.e., can be used only locally
Complexity and Null Words Variability due to null words increases the worst case complexity
depending on the number of null words in L2 n: total number of words in L1 m: number of null words in L2
(note: typically clearly smaller than n) Complexity class for alignment-based parallel parsing (time
complexity for non-lexicalized grammars): O(n3m3) For comparison – complexity for general parallel parsing
without a fixed alignment: O(n6) [vgl. Wu 1997, Melamed 2004]
0
O(n6) O(n3m3)
O(n3)
Experimental Results
Prototype implementation of parser SWI Prolog (chart implemented as a hash function) Probabilistic variant: Viterbi parser (determining the most
probable reading) Scripts for generation of training data (currently based on
hand-labeled data using MMAX2)
[http://mmax.eml-research.de]
Comparison of “Correspondence-guided synchronous parsing” (CGSP) (Simulated) monolingual parsing
Experimental Results
Parsing sentences without NIL words
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
4 5 6 7 8 9 10
number of words (in L1)
pa
rsin
g t
ime
[s
ec
]
Monolingualparsing L1
CGSP
Experimental Results
Comparison wrt. # NIL words
0
0.2
0.4
0.6
0.8
1
1.2
1.4
5 6 7 8 9 10
number of words (in L1)
pa
rsin
g t
ime
[s
ec
]
3 L1-NILs,CGSP
2 L1-NILs,CGSP
1 L1-NIL,CGSP
no L1-NILs,CGSP
monolingualparsing (L1)
Alignment-based Parallel Parsing
Conclusion: Alignment-based parsing of parallel corpora on a larger scale should be realistic Use in (weakly supervised) grammar learning
should be possible Sentences with a large proportion of null words on
both sides could be filtered out Open questions:
Effective heuristic for determining a single word alignment from a statistical alignment
Tractable relaxation of continuity assumption
Talk Summary
Parallel Corpora The PTOLEMAIOS Project: Goals and Motivation
Specification of the intended prototype system Study 1
Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus
Study 2 A chart-based parsing algorithm for bilingual parallel
grammars based on a word alignment Conclusion
The PTOLEMAIOS research agenda
The PTOLEMAIOS research agenda
Formal grammar model for parallel linguistic analysis
Specific choice of linguistic representations/constraints
Efficient parallel parsing algorithms
Probability models for parallel structural representations
Weakly supervised learning techniques for bootstrapping the grammars
Grammar formalism
Linguisticspecification
Algorithmicrealization
Probabilisticmodeling
Bootstrappinglearning
Conclusion
The planned PTOLEMAIOS architecture
Selected References
Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky. 1999. Statistical machine translation. Final report, JHU Workshop.
Brown, P.F., S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19:263–311.
Dubey, Amit, and Frank Keller. 2003. Probabilistic parsing for German using sister-head dependencies. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 96–103, Sapporo.
Koehn, Philipp. 2002. Europarl: A multilingual corpus for evaluation of machine translation. Ms., University of Southern California.
Kuhn, Jonas. 2004. Experiments in parallel-text based grammar induction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona.
Melamed, I. Dan. 2004. Multitext grammars and synchronous parsers. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona.
Och, Franz Josef and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
Wu, Dekai. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377–403.
Yarowsky, D., G. Ngai and R. Wicentowski. 2001. Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In Proceedings of HLT. 161—168.