Lernverfahren auf Basis von Parallelkorpora Learning Techniques based on Parallel Corpora Jonas Kuhn Universität des Saarlandes, Saarbrücken Heidelberg,

Lernverfahren auf Basis von Parallelkorpora

Learning Techniques based on Parallel Corpora

Jonas KuhnUniversität des Saarlandes, Saarbrücken

Heidelberg, 16. Juni 2005

Talk Outline

Parallel Corpora The PTOLEMAIOS Project: Goals and Motivation

Specification of the intended prototype system Study 1

Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus

Study 2 A chart-based parsing algorithm for bilingual parallel

grammars based on a word alignment Conclusion

The PTOLEMAIOS research agenda

Parallel Corpora

Collections of texts and their translation into different languages

Alignment across languages at various levels Document Section Paragraph Sentence (not necessarily one-to-one) Phrase Word

Varying quality depending on origin of translation Translation is often not literal Parts of a document may occur only in one version

Examples of Parallel Corpora Hansards of the 36th Parliament of Canada

(http://www.isi.edu/natural-language/download/hansard/) 1.3 million sentence pairs (19.8 million word forms English/21.2 million

word forms French)

The Bible

Europarl corpus (European Parliament Proceedings Parallel Corpus 1996-2003) (http://www.isi.edu/~koehn/europarl/) 11 languages: Danish, Dutch, English, Finnish, French, German,

Greek, Italian, Portuguese, Spanish, Swedish Up to 28 million word forms per language

OPUS – an open source parallel corpus (http://logos.uio.no/opus/) Includes Europarl corpus and various manuals for open source

software (parts of which have been translated into more than 20 languages, incl. e.g., Chinese, Hebrew, Japanese, Korean, Russian, and Turkish)

Uses for Parallel Corpora

Building bilingual dictionaries Resource for Machine Translation research

“Classical” MT: data basis for transfer rules/dictionary entries

Statistical MT: training data for statistical alignment models

Paragraph level Word level

Resource for (multilingual) NLP applications “Annotation projection” for training monolingual tools

Query interface

From Opus corpus website http://logos.uio.no/cgi-bin/opus/opuscqp.pl (based on the corpus work bench/CQP from IMS

Stuttgart)

Query interface

Training Statistical Alignments

GIZA++ Tool

http://www.fjoch.com/GIZA++.html Implementation of “IBM models” for Statistical MT Extension of the program GIZA (part of the SMT toolkit

EGYPT) developed at a summer workshop in 1999 at Johns-Hopkins University

Sample word alignment (from Europarl corpus):

im

actually Minutesthe

nachzulesenProtokollwarDies

in toreferredwasThis

“Annotation Projection”

Yarowsky/Ngai/Wicentowski 2001

Parallel corpus and word alignment given Tagger/chunker for English exists Projected annotation is used as training data for a

tagger/chunker in the target language Robust learning techniques based on confidence in training

data

]oil

NN

crude

JJ

[for

IN

]producersignificanta[

NNJJDT

JJNNINJJNNDT

]brutpetrole[de]importantproducteurun[

Quality of projected information

Evaluation results for part-of-speech tagging (from Yarowsky et al. 2001)

Train grammars on parallel corpora?

New weakly supervised learning approach for (probabilistic) syntactic grammars: Training data: Parallel corpora – collections of original texts

and their translations into one or more languages

Preparatory step: Identification of word correspondences with known statistical techniques (word alignment from statistical machine translation)

en de fr

darandersvölligjedochLagediesichstelltHeute

isThe situation now however radically different

Train grammars on parallel corpora?

Beyond lexical information, patterns in the word correspondence relation contain rich implicit information about the grammars of the languages

One should be able to exploit this implicit information about structure and meaning for grammar learning

Little manual annotation effort should be required Combination of insights from linguistics and machine learning



The PTOLEMAIOS Project

Rosetta Stone

Parallel Corpus-Based Grammar Induction: PTOLEMAIOS Parallel-Text-based Optimization for

Language Learning – Exploiting Multilingual Alignment for the Induction Of Syntactic Grammars

Funded by DFG (German Research Foundation) as an Emmy Noether research group Universität des Saarlandes

(Saarbrücken), Department of Computational Linguistics

Starting date: 1 April 2005 Expected duration: 4 years (1-year

extension possible)

Project Goals

Development of formalisms and algorithms to support grammar induction for arbitrary languages from parallel corpora

To make goals tangible… Intended prototype:

The PTOLEMAIOS I system for building grammars for new (sub-)languages

The PTOLEMAIOS I system

Resources required: Parallel corpus of language L and one or (ideally) more

other languages No NLP tools for language L required

Preparatory work required: Manual annotation of a set of seed sentence pairs

(e.g., 50-100 pairs) Phrasal correspondence across languages “Lean” bracketing: mark only

full argument/modifier phrases (PPs, NPs) and full clauses


Training steps: (Sentence alignment on parallel corpus) Word alignment on parallel corpus

Using standard techniques from Statistical Machine Translation (GIZA++ tool)

Part-of-speech clustering for L Bootstrapping learning of syntactic grammars for L and

the other language(s) Starting from annotated seed data Exploit large amounts of unannotated data, finding

systematic patterns in phrasal correspondences Assuming implicit underlying representation (“pseudo

meaning representation”) Relying on consensus across the grammars


Result: Robust probabilistic grammar for L

Representation of predicate-argument and modifier relations

Models predict probabilities for cross-linguistic argument/modifier links

(These will be particularly useful in lexicalized models)

Application: Multilingual Information Extraction, Question

Answering Intermediate step for syntax-based MT

Motivation

Practical Explore alternative to standard treebank training of

grammars For “smaller” languages, it is unrealistic to do the

necessary manual resource annotation Theoretical

Establish parallel corpora as an empirical basis for (crosslinguistic or monolingual) linguistic studies

Frequency-related phenomena (like multi-word expressions/collocations) are otherwise hard to assess empirically at the level of syntax

Learnability properties as a criterion for assessing formal models for natural language

Study 1

Unsupervised learning of a probabilistic context-free grammar (PCFG) exploiting partial information from a parallel corpus [Kuhn 2004 (ACL)]

Underlying consideration: The distribution of word correspondences in the

translation of a string contains partial information about possible phrase boundaries



different

L2

Alignment-induced word blocks

andersLagedieHeute

nowL1 The situation is

ist derin Türkei

in TurkeyNULL

(The)={die}(situation)={Lage}(NULL)={der}

Word alignments obtained with the standard model are asymmetrical [Brown et al. 1993] Each word from language L1 is mapped to a set of

words from L2 (For L2-words without an overt correspondent in L1, an

artificial word NULL is assumed in each sentence in L1)

Definition A word alignment mapping induces a Word block w1...wn in

language L1, iff the union over (w1)...(wn) forms a continuous string in L2 [separated only by words from (NULL)]

Let’s call a word block maximal if adding a word to the left or right leads to a non-word block

Maximal word blocks are possible constituents – but not necessary ones

Conservative formulation: Exclusion of impossible constituents (“distituents”) … whenever a word sequence crosses block boundaries … … without fully covering one of the adjacent blocks

Word blocks and constituents

einebald

soonWe decision

brauchen EntscheidungWir

aneed

Unsupervised Learning

Grammar induction of an X-bar grammar Using a variant of standard PCFG induction with the

Inside-Outside algorithm (an Expectation Maximization algorithm)

All word spans are considered as phrase candidates – except the excluded distituents

Automatic generalization based on patterns in the learning data (after part-of-speech tagging)

Exclusion of distituents can reduce the effect of frequent non-phrasal word sequences

Empirical Results

Comparative experiment [Kuhn 2004 (ACL)]

A: Grammar induction based on English corpus data only

B: Induction including partial information from parallel corpus (Europarl corpus) [Koehn 2002]

Statistical word alignment, trained with GIZA++

[Al-Onaizan et al., 1999; Och/Ney, 2003] Exclusion of “distituents” based on alignment-induced word

blocks Evaluation:

Parsing sentences from the Penn Treebank (Wall Street Journal) with the induced grammars

Comparison of the “automatical” analyses with the gold standard treebank analyses created by linguists

Empirical Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Precision Recall F-Score

Strictly left-branchingstructure

Strictly right-branchingstructure

A: Standard PCFGinduction

B: Induction using partialinformation from parallelcorpusUpper bound (Oraclebinary grammar)

# correctly identified phrases

# proposed phrases

# correctly identified phrases

# gold standard phrases

Mean of Precision

und Recall

lookmustwe So agriculturaltheat policy

prüfenAgrarpolitikdiedeshalbmüssenWir

Study 2 Underlying consideration:

It should be possible to learn more complex syntactic generalization from parallel corpora if phrase correspondences can be observed systematically

Cross-linguistic “consensus structure” as a poor man’s meaning representation (pseudo-meaning representation)

Form-meaning relation is important for Optimality Theory-style discriminative learning

XPXPXP

XP

S

XPXPXP

XP

S

Parallel Parsing Prerequisites for structural learning from parallel corpora:

Grammar formalism generating sentence pairs or tuples (bitexts or multitexts)

Generalization of monolingual context-free grammars, generating two or more trees simultaneously (keeping note of phrase correspondences)

Algorithms for parsing and learning (Parallel Parsing/Synchronous Parsing)

Problem: higher complexity than monolingual parsing

Compare theoretical work on machine translation [Wu 1997, Melamed 2004]

Focus for this study: Efficient parallel parsing based on a word-alignment

[Kuhn 2005 (IJCAI)]

Chart Parsing (review) In chart parsing for context-free grammars, partial analyses

covering the same string under the same category are merged

WP covering string position j through m Two possible internal analyses – a single chart entry Internal distinction irrelevant from external point of view

i k n m

XP YPZP XP

WP: j-m

Reading 1XP: j-k YP: k-mWP XP YP WP: j-m

Reading 2ZP: j-n XP: n-mWP ZP XP WP: j-m

WP

prüfenAgrarpolitikdiedeshalbmüssenWir prüfenAgrarpolitikdiedeshalbmüssenWir


Data Structure for Parallel Parsing

So

we

must

look

at

the

agricultural

policy

prüfenAgrarpolitikdiedeshalbmüssenWir

The input is not a one-dimensional word string like in classical parsing, but a two-dimensional word array

Representation of the word alignment

“3-D” Parsing

Wir

deshalbmüssen

die

prüfenAgrarpolitik

mustwe

So

XPXPXP

XP


S

AgrarpolitikdiedeshalbmüssenWir

XPXPXP

S

prüfen

XP

policyagricultural

theat

look

Assumptions for Parallel Parsing

A particular word alignment is given (or the n best alignments according to a statistical model) Note: Both languages may contain words without a

correspondent (“null words”) Complete constituents must be continuous in both

languages There may however be phrase-internal “discontinuities”

soon

decision

EntscheidungeinebaldbrauchenWir

We

need

a

Generalized Earley Parsing

Main idea: One of the languages is the “master language”

Choice of master language is arbitrary in principle For efficiency: pick language with more null words

Primary index for chart entries according to master language: string span (from-to)

Secondary index: Bit vector for word coverage in the secondary language (cp. chart-based generation)

5: soon

4: decision

5:Entscheidung4:eine3:bald2:brauchen1:Wir

1: We

2: need

3: a

Example:

Active chart entry for partial constituent

PI: 1-3, SI: [01001]

Primary/Secondary Index in Parsing

Combination of chart entries

5: soon

4: decision


1: We

2: need

3: a

PI: 1-3, SI: [01001]

PI: 3-5, SI: [00110]

PI: 1-5, SI: [01111]

Complexity With a given word alignment, the secondary index is fully

determined by the primary index unless there are any null words in the secondary language In the absence of null words the worst case parsing

complexity is essentially the same as in monolingual parsing

The secondary index does not add any free variables to the inference rules for parsing

(Bitvector operations are very cheap – ignored here) In the average case the search space is even reduced over

monolingual parsing5: soon

4: decision


1: We

2: need

3: a

Example:

(Incorrect) combination of a chart entry with an incomplete constituent is excluded early

PI: 0-1, SI: [10000]

PI: 1-3, SI: [01001]

PI: 0-3, SI: [11001]

Illegal as a passive

chart entry

Complexity and Null Words

policy

agricultural

the

at

look

must

we

So

Wir müssen deshalb die Agrarpolitik prüfen

8: policy

7: agricultural

6: the

5: at

4: look

4: die 5: Agrarpolitik 6: prüfen

1. PI: 3-3, SI: [0,0,0,0,1,0,0,0]PI: 3-4, SI: [0,0,0,0,0,1,0,0]

PI: 3-4, SI: [0,0,0,0,1,1,0,0]

2. PI: 3-3, SI: [0,0,0,0,1,0,0,0]PI: 3-5, SI: [0,0,0,0,0,1,1,1]

PI: 3-5, SI: [0,0,0,0,1,1,1,1]

3. PI: 3-3, SI: [0,0,0,0,1,0,0,0]PI: 3-7, SI: [0,0,0,1,0,1,1,1]

PI: 3-7, SI: [0,0,0,1,1,1,1,1]

Null words in the secondary language L2 create possible variations for the secondary index (with a single primary index)

However the effect is fairly local thanks to the continuity assumption

look agriculturaltheat policy

XP

XP


XP

XP


XP

No passive entry, i.e., can be used only locally

Complexity and Null Words Variability due to null words increases the worst case complexity

depending on the number of null words in L2 n: total number of words in L1 m: number of null words in L2

(note: typically clearly smaller than n) Complexity class for alignment-based parallel parsing (time

complexity for non-lexicalized grammars): O(n3m3) For comparison – complexity for general parallel parsing

without a fixed alignment: O(n6) [vgl. Wu 1997, Melamed 2004]

0

O(n6) O(n3m3)

O(n3)

Experimental Results

Prototype implementation of parser SWI Prolog (chart implemented as a hash function) Probabilistic variant: Viterbi parser (determining the most

probable reading) Scripts for generation of training data (currently based on

hand-labeled data using MMAX2)

[http://mmax.eml-research.de]

Comparison of “Correspondence-guided synchronous parsing” (CGSP) (Simulated) monolingual parsing


Parsing sentences without NIL words

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

4 5 6 7 8 9 10

number of words (in L1)

pa

rsin

g t

ime

[s

ec

]

Monolingualparsing L1

CGSP


Comparison wrt. # NIL words

0

0.2

0.4

0.6

0.8

1

1.2

1.4

5 6 7 8 9 10

number of words (in L1)

pa

rsin

g t

ime

[s

ec

]

3 L1-NILs,CGSP

2 L1-NILs,CGSP

1 L1-NIL,CGSP

no L1-NILs,CGSP

monolingualparsing (L1)

Alignment-based Parallel Parsing

Conclusion: Alignment-based parsing of parallel corpora on a larger scale should be realistic Use in (weakly supervised) grammar learning

should be possible Sentences with a large proportion of null words on

both sides could be filtered out Open questions:

Effective heuristic for determining a single word alignment from a statistical alignment

Tractable relaxation of continuity assumption

Talk Summary

Parallel Corpora The PTOLEMAIOS Project: Goals and Motivation

Specification of the intended prototype system Study 1

Unsupervised learning of a probabilistic context-free grammar (PCFG) using partial information from a parallel corpus

Study 2 A chart-based parsing algorithm for bilingual parallel

grammars based on a word alignment Conclusion



Formal grammar model for parallel linguistic analysis

Specific choice of linguistic representations/constraints

Efficient parallel parsing algorithms

Probability models for parallel structural representations

Weakly supervised learning techniques for bootstrapping the grammars

Grammar formalism

Linguisticspecification

Algorithmicrealization

Probabilisticmodeling

Bootstrappinglearning

Conclusion

The planned PTOLEMAIOS architecture

Selected References

Al-Onaizan, Yaser, Jan Curin, Michael Jahr, Kevin Knight, John Lafferty, Dan Melamed, Franz-Josef Och, David Purdy, Noah A. Smith, and David Yarowsky. 1999. Statistical machine translation. Final report, JHU Workshop.

Brown, P.F., S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19:263–311.

Dubey, Amit, and Frank Keller. 2003. Probabilistic parsing for German using sister-head dependencies. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 96–103, Sapporo.

Koehn, Philipp. 2002. Europarl: A multilingual corpus for evaluation of machine translation. Ms., University of Southern California.

Kuhn, Jonas. 2004. Experiments in parallel-text based grammar induction. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona.

Melamed, I. Dan. 2004. Multitext grammars and synchronous parsers. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004). Barcelona.

Och, Franz Josef and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.

Wu, Dekai. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377–403.

Yarowsky, D., G. Ngai and R. Wicentowski. 2001. Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In Proceedings of HLT. 161—168.

Documents

Lernverfahren auf Basis von Parallelkorpora Learning Techniques based on Parallel Corpora Jonas Kuhn Universität des Saarlandes, Saarbrücken Heidelberg,