11
SOUTHERN JOURNAL OF LINGUISTICS 29: 214-223 • © 2007 Southeastern Conference on Linguistics Cross-Linguistic Knowledge Induction from Parallel Corpora Dan Tufiş Romanian Academy and University 'A.I. Cuza,' Iaşi Parallel corpora encode extremely valuable linguistic knowledge, the revealing of which is facilitated by the recent advances in multilingual corpus linguistics. The linguistic decisions made by the human translators in order to faithfully convey the meaning of the source text can be traced and can bring evidence on linguistic facts which in a monolingual context might be overlooked by a computer program. When linguistic annotations are available or easy to produce for one or more languages in a parallel corpus, but not for all, inductive learning methods provide a powerful support for systematic and consistent cross-lingual transfer of the linguistic interpretations and allow for focused comparative studies for the languages of the parallel corpus. 1. Parallel corpora and corpus alignment. A bitext or a parallel text is an association between two texts (written or spoken) in different languages that represent translations of each other. By extension, a parallel text might contain several language translations of the same (source) text. A collection of parallel texts, or even a large enough parallel text is called a parallel corpus. A professional translation of a text represents a series of linguistic decisions made by the translator in order to convey as faithfully as possible the meaning of the original text and to produce a 'natural' text from the perspective of a native speaker of the target language. Given that the meaning is presumably preserved across the bitexts, a parallel corpus encodes extremely valuable linguistic knowledge about the paired languages, both in terms of vocabulary and syntax. The algorithmic key issue for taking advantage of the translators' knowledge, embedded into the translations, is the ability to automatically precisely identify the segments of texts (translation units) that represent reciprocal translations. This problem, known as parallel corpus alignment, can be defined at various levels of granularity (paragraph, sentence, phrase, word) with different degrees of difficulty. The computational approaches to the alignment problem range from symbolic/rule-based to purely statistical ones and one could easily find PROs or CONs arguments

Cross-lingual knowledge induction from parallel corpora

  • Upload
    racai

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

SOUTHERN JOURNAL OF LINGUISTICS 29: 214-223 • © 2007 Southeastern Conference on Linguistics

Cross-Linguistic Knowledge Induction from Parallel Corpora Dan Tufiş Romanian Academy and University 'A.I. Cuza,' Iaşi

Parallel corpora encode extremely valuable linguistic knowledge, the revealing of which is facilitated by the recent advances in multilingual corpus linguistics. The linguistic decisions made by the human translators in order to faithfully convey the meaning of the source text can be traced and can bring evidence on linguistic facts which in a monolingual context might be overlooked by a computer program. When linguistic annotations are available or easy to produce for one or more languages in a parallel corpus, but not for all, inductive learning methods provide a powerful support for systematic and consistent cross-lingual transfer of the linguistic interpretations and allow for focused comparative studies for the languages of the parallel corpus.

1. Parallel corpora and corpus alignment. A bitext or a parallel text is an association between two texts (written or spoken) in different languages that represent translations of each other. By extension, a parallel text might contain several language translations of the same (source) text. A collection of parallel texts, or even a large enough parallel text is called a parallel corpus. A professional translation of a text represents a series of linguistic decisions made by the translator in order to convey as faithfully as possible the meaning of the original text and to produce a 'natural' text from the perspective of a native speaker of the target language. Given that the meaning is presumably preserved across the bitexts, a parallel corpus encodes extremely valuable linguistic knowledge about the paired languages, both in terms of vocabulary and syntax.

The algorithmic key issue for taking advantage of the translators' knowledge, embedded into the translations, is the ability to automatically precisely identify the segments of texts (translation units) that represent reciprocal translations. This problem, known as parallel corpus alignment, can be defined at various levels of granularity (paragraph, sentence, phrase, word) with different degrees of difficulty. The computational approaches to the alignment problem range from symbolic/rule-based to purely statistical ones and one could easily find PROs or CONs arguments

Tufiş / 215

for each of these approaches. Lately, with more and more parallel data available, there is a visible tendency towards statistical approaches, but in general, a best compromise between the expected accuracy and the necessary development efforts can be reached by taking a mix approach. Irrespective of the translation-unit granularity and of the prevalence of either rationalistic or statistical model in a mix approach, the larger the parallel corpus, the better alignment accuracy (obviously, for purely statistical approaches, mass data is even more critical). Depending on the alignment granularity, required accuracy of the process, and the purpose of the alignment, the input textual data might need pre-processing steps in all languages of the parallel corpus (e.g. segmentation (sentence boundary recognition and tokenization), POS-tagging and lemmatization), or at least in one of the languages of the corpus (e.g. chunking, dependency linking/parsing, and word sense disambiguation).

In the rest of this article we will provide an overview of the major pre-processing steps that we implemented for the various granularity levels of alignment and briefly show how we exploited the alignments. 2. Preprocessing Steps 2.1. Text Tokenization The first pre-processing step in most NLP systems deals with text segmentation. In our processing chain this step is achieved by a modified version (much faster) of the multilingual segmenter MtSeg developed for the MULTEXT project. The segmenter comes with tokenization resources for many western European languages, further enhanced in the MULTEXT-EAST project (Dimitrova et al. 1998; Tufiş et al. 1998) with corresponding resources for Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. The segmenter is able to recognize sentence and clause boundaries, dates, numbers and various fix phrases, and to split clitics or contractions (where the case). We significantly updated the tokenization resources for Romanian and English (the languages we were most interested lately). Additionally, for bilingual contexts we used the feedback from the lexical alignment

216 / Cross-Linguistic Knowledge

phase to build a language-pair dependent tokenization resource where are stored multiword sequences in one language that are translated in the other language by a single word or a multiword sequence in the other language (provided the sequences in the two languages cannot be aligned word-by-word). 2.2. POS-tagging and Lemmatization It is generally known that POS-tagging accuracy depends on the quality of the language model underlying the morpho-lexical processing, which on its turn is highly dependent on the quality and quantity of the training data and on the tagset of the language model. For languages with a productive inflectional morphology the morpho-lexical feature-value combinations may be very numerous, leading to very large tagsets and unavoidable training data sparseness treat. The insufficient training data affects the robustness of the language models which consequently will generate an increased number of tagging errors at run time. To cope with the tagset cardinality problem we developed the tiered-tagging methodology (Tufiş 1999) and implemented it using the TnT trigram HMM tagger (Brants 2000). The tiered tagging model is based on using two different tagsets (related, but of very different cardinalities). The first one, which is smaller and much better suited for the statistical processing, is used internally while the other one (significantly larger and more informative and linguistically motivated) is used in the tagger's output. We described elsewhere the relation between these tagsets and how they can be mapped to each other via a reduced contextual set of hand-written mapping rules (Tufiş 1999, 2000; Tufiş and Dragomirescu 2004). Recently, we re-implemented the tiered tagging methodology, by relying on a combination between an HMM tagger, called TTL (Ion 2006; Tufiş et al. 2006c), which produces also the lemmatization, and a maximum-entropy tagger (Ceauşu 2006).The HMM tagger works with the reduced tagset while the ME-tagger ensures the mapping of the first tagset onto the large one (the lexical tagset) dispensing on the hand-written mapping rules.

Lemmatization is in our case a straightforward process, since the monolingual lexicons developed within MULTEXT-EAST

Tufiş / 217

contain, for each word, its lemma and morpho-syntactic codes. Knowing the word-form and its associated tag, the lemma extraction is simply a matter of lexicon lookup for those words that are in the lexicon. For the unknown words, which are not tagged as proper names, a set of lemma candidates is generated by a set of suffix-stripping rules induced from the word-form lexicon. A four-gram letter Markov model (trained on lemmas in the word-form dictionary) is used to choose the most likely lemma. 2.3. Sentence Alignment Bob Moore (2002) presents a state of the art sentence aligner which uses a three stage hybrid approach. In the first stage, the algorithm uses length-based methods for sentence alignment (Gale and Church 1991). In the second stage, a translation equivalence table is estimated from the aligned corpus resulted in the first stage. The method used for translation equivalents estimation is based on IBM model 1 (Brown et al. 1993). The final step uses a combination of length-based methods and word correspondence to find 1-1 sentence alignments. The aligner has an excellent precision for one-to-one alignments because it was meant for acquisition of very accurate training data for machine translation experiments. Another problem of this aligner is that apparently it cannot process more than 100,000 sentence pairs.

We developed a sentence aligner (Ceauşu et al. 2006) inspired by Moore's program which removes the 1-1 alignment restriction as well as the upper limit on the number of sentence-pairs that can be aligned. It has a comparable precision but a better recall than Moore's aligner. Our aligner does not need a-priori language specific information but setting its parameters requires training on small number of human checked alignment data.

The sentence aligner consists of a hypothesis generator which creates a list of plausible sentence alignments from the parallel corpus and a filter which removes the improbable alignments. The filter is an SVM binary classifier (Chang and Lin 2001) initially trained on a Gold Standard (approx. 1000 hand-aligned sentence pairs). The features of the initial SVM model are: the word sentence length, the non-word sentence length, and the rank correlation for the first 25% of the most frequent words in the two

218 / Cross-Linguistic Knowledge

parts of the training bitext. This model is used to preliminary filter alignment hypotheses generated from the parallel corpus. The set of the remaining aligned sentences is used as the input for an EM algorithm which builds a word translation equivalence table by a similar approach to the IBM model-1 procedure. Unlike in the IBM model-1, we did not consider the null alignments (words not translated in the other side of the bitext); we found that the null word-alignments do not help the sentence alignment process. The SVM model is rebuilt (from the Gold Standard) this time including as an additional feature the number of word translation equivalents existing in the sentences of a candidate alignment pair. This new model is used by the SVM classifier for the final sentence alignment of the parallel corpus as largely described in (Ceauşu et al. 2006). 2.4. Word and Phrase Alignment The word alignment of a bitext is an explicit representation of the pairs of words <wL1 wL2> (called translation equivalence pairs) co-occurring in the aligned sentences. The general word alignment problem includes the cases where words in one part of the bitext are not translated in the other part (these are called null alignments) and the cases where multiple words in one part of the bitext are translated as one or more words in the other part (these are called phrase alignments). Our COWAL word aligner, largely described in (Tufiş et al. 2005, 2006b), is a statistical one, based on the link reification concept which regards the translation equivalence relation between two words as a complex object represented as a feature-value structure. The major features used by are the POS-affinity, translation probability, orthographic similarity, locality (Tufiş et al. 2006b). 3. Exploiting the Alignments In this section we will present a few applications exploiting the alignments. There are several other ways to take advantage of the alignment technology, most notably in Statistical Machine Translation, but we restricted the presentation only to a few of our

Tufiş / 219

experiments. 3.1. Wordnet-based Sense Disambiguation Once the translation equivalents identified, it is reasonable to expect that the words of a translation pair <wi

L1, wjL2> share at

least one conceptual meaning stored in an interlingual sense inventory. In the Balkanet project (Tufiş et al. 2004) we used the Princeton WordNet (PWN) as an interlingual index. Based on the interlingually aligned wordnets, obtaining the sense labels for the words in a translation pair is straightforward: one has to identify for wi

L1 the synset SiL1 and for wj

L2 the synset SjL2 so that Si

L1 and Sj

L2 are projected over the same interlingual concept. The index of this common interlingual concept (ILI) is the sense label of the two words wi

L1 and wjL2. However, it is possible that no common

interlingual projection will be found for the synsets to which wiL1

and wjL2 belong. In this case, the senses of the two words will be

given by the indexes of the most similar interlingual concepts corresponding to the synsets of the two words. The semantic-similarity score is computed as SYM (ILI1, ILI2) = 1/1+k where k is the number of PWN links from ILI1 to ILI2 or from both ILI1 and ILI2 to the nearest common ancestor. 3.2. Annotation Transfer Another great use of word-aligned corpora is for transferring from one language to the other the syntactic/semantic annotations existing in first language (called the source language) but not available in the second one (target language). When the same type of syntactic/semantic annotation exists in both languages the annotation transfer turns into annotation validation. We addressed, with very promissing results, the transfer/validation for the following types of annotation: word senses, dependency relations, valency and semantic frames.

Sense annotation transfer from one language to another is straightforward (Ion and Tufiş 2004) when the word alignment is provided. There is simple import of the sense label from one word to its translation equivalent. When sense annotation is available in both languages of a bitext the word alignment allows for the

220 / Cross-Linguistic Knowledge

validation. If the same sense inventory is used in both parts of the bitext, the validation comes to a simple identity test for the sense identifiers of the translation equivalents. If different sense inventories are used, then a mapping is required from one sense inventory to the other and the validation reduces to checking whether the sense labels of the translation equivalents maps to each other.

For the dependency relations transfer, the task reduces to checking whether the relations existing between the words in the source language can be established also between the translation equivalents in the target language. A quantitative analysis of the dependency relations transfer from an English parsed text into its Romanian translation is given by Barbu-Mititelu and Ion (2005) who also describe the regular transformations required in case of some types of relations.

The transfer of FrameNet annotations has been investigated for English-Romanian language pair (we translated into Romanian almost half of the annotated texts from the FrameNet release) with very good results (Ion 2006).

The valency frames existing in the Czech wordnet of the Balkanet project were subject to an interesting exercise of transfer into the Romanian wordnet (Tufiş et al. 2006a). We used 601 valency frames, kindly offered by Karel Pala, the Czech-Romanian aligned sub-corpus of the '1984' multilingual parallel corpus (Erjavec 2004) and the Czech-Romanian aligned wordnets. The manual validation of the automatic transfer of the Czech valency frames from the Czech verbs to their Romanian translation equivalents revealed a surprisingly high matching (80%), given the differences between Slavic and Romance languages.

REFERENCES BARBU-MITITELU, VERGINICA and RADU ION. 2005. Cross-

language transfer of syntactic relations using parallel corpora. Proceedings of the Workshop on Cross-Language Knowledge Induction, EUROLAN 2005, ed. by SomebodyDiana Inkpen and Carlo Straparava, 46-51. Cluj-Napoca, Romania: Some Press"Al. I. Cuza" University Press.

BRANTS, THORSTEN. 2000. TnT a statistical part-of-speech tagger.

Tufiş / 221

Proceedings of the 6th ANLP Conference, ed. by SomebodySergei Nirenburg (ed.), 224-231. Seattle, WA: Some Press.Association for Computational Linguistics

BROWN, PETER F., STEPHEN A DELLA PIETRA, VINCENT J. DELLA. PIETRA and ROBERT L. MERCER. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19: 263-311.

CEAUŞU, ALEXANDRU. 2006. Maximum entropy tiered tagging. Proceedings of the Eleventh ESSLLI Student Session, edJanneke Huitink and Sophia Katrenko (eds). by Somebody, 173-179. Málaga, Spain: Some Press.University of Nijmegen Publishing House

CEAUŞU, ALEXANDRU, DAN ŞTEFĂNESCU and DAN TUFIŞ. 2006. Acquis Communautaire sentence alignment using support vector machines. Proceedings of the 5th LREC Conference, ed. by SomebodyNicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapas (eds), 2134-2137. Genoa, Italy: Some PressELRA Publications.

CHANG, CHIH-CHUNG and CHIH-JEN LIN, 2001. LIBSVM: a

library for support vector machines. Software available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm CHEN, STANLEY F. 1993. Aligning sentences in bilingual corpora

using lexical information. Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, ed. by SomebodyLenhart Shubert (ed), 9-16. Columbus, Ohio:: Association for Computational Linguistics

Some Press. DIMITROVA LUDMILA, TOMAZ ERJAVEC, NANCY IDE, HEIKI J.

KAALEP, VLADIMIR PETKEVIC, and DAN TUFIŞ. 1998. Multext-East: Parallel and comparable Corpora and lexicons for six central and eastern European languages. Proceedings of COLING, Christian Boitet and Pete Whitelock (eds)ed. by Somebody, 315-319. Montreal, Canada: Some PressUniversity of Montreal.

ERJAVEC TOMAZ. 2004. MULTEXT-East Version 3: Multilingual morphosyntactic specifications, lexicons and corpora. Proceedings of the 4th LREC Conference, ed. by Somebody,Maria Teresa Lino, Maria Francisca Xavier, Fatima Ferreira, Rute Costa, Racquel Silva (eds) 1535–1538. Lisbon,

Formatted: Font: Times

222 / Cross-Linguistic Knowledge

Portugal: Some Press. ELRA Publications. GALE, WILIAM, and KENETH W.CHURCH. 1991. A program for

aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, ed. by SomebodyDouglass E. Appelt (ed), 177–184. Berkeley, CA: Association for Computational Linguistics.

Some Press. ION, RADU. 2006. Metode de dezambiguizare semantică automată.

Aplicaţii pentru limbile engleză şi română. Ph.D. thesis, Romanian Academy, Bucharest, Romania.

ION, RADU and DAN TUFIŞ. 2004. Multilingual word sense disambiguation using aligned wordnets. Romanian Journal on Information Science and Technology (Special Issue on BalkaNet) 7.2-3: 198-214.

MOORE, ROBERT C. 2002. Fast and accurate sentence alignment of bilingual corpora. Machine Translation: From Research to Real Users. Proceedings, 5th Conference of the Association for Machine Translation in the Americas, ed. by SomebodyStephen D. Richardson (ed), pp. 135-244. Heidelberg, Germany: Springer-Verlag.

TUFIŞ, DAN. 1999. Tiered tagging and combined classifiers. Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692, ed. by F. Jelinek, E. Nöth, 28-33. Heidelberg, Germany: Springer-Verlag.

TUFIŞ, DAN. _____. 2000. Using a large set of EAGLES-compliant morpho-syntactic descriptors as a tagset for probabilistic tagging. Proceedings of the 2nd LREC Conference, ed. by SomebodyMaria Gavrilidou, George Carayannis, Stella Markantonatou, Stelios Piperides (eds) , 1105-1112. Athens, Greece: Some PressELRA Publications..

TUFIŞ, DAN, NANCY IDE and TOMAZ ERJAVEC. 1998. Standardised specifications, development and assessment of large morpho-lexical resources for six central and eastern European languages. Proceedings of the 1st LREC Conference, Antonio.Rubio, Natividad Gallardo, Rosa.Castro, Antonio. Tejada (eds)ed. by Somebody, 233-240. Granada, Spain: ELRA Publications.Some Press.

TUFIŞ, DAN and LIVIU DRAGOMIRESCU. 2004. Tiered tagging revisited. Proceedings of the 4th LREC Conference, Maria

Formatted: Font: 12 pt

Formatted: Font: 12 pt

Formatted: Highlight

Formatted: Justified, Indent: Left: 0",Hanging: 0.3"

Tufiş / 223

Teresa Lino, Maria Francisca Xavier, Fatima Ferreira, Rute Costa, Racquel Silva (eds) 1535–1538. Lisbon, Portugal: ELRA Publications.

ed. by Somebody, 39-42. Lisbon, Portugal: Some Press.

TUFIŞ, DAN, DAN CRISTEA and SOFIA STAMOU. 2004. BalkaNet:

Aims, methods, results and perspectives: A general

overview. Romanian Journal on Information Science and

Technology (Special Issue on BalkaNet) 7.2-3: 9-34. TUFIŞ, DAN, RADU ION, ALEXANDRU CEAUŞU and DAN

STEFĂNESCU. 2005. Combined aligners. Proceeding of the ACL 2005 Workshop on 'Building and Using Parallel Corpora: Data-driven Machine Translation and Beyond,' ed. by Somebody,Philip Koehn, Joel Martin, Rada Mihalcea, Christof Monz, Ted Pedersen (eds) 107-110. Ann Arbor, MI: Association for Computational Linguistics.Some Press.

TUFIŞ, DAN, VERGINICA BARBU-MITITELU, LUIGI BOZIANU and CĂTĂLIN MIHĂILĂ. 2006a. Romanian WordNet: New developments and applications. Proceedings of the 3rd Conference of the Global WordNet Association, ed. by SomebodyPetr Sojka, Key-Sun Choi, Christiane Fellbaum, Piek Vossen (eds), 337-344. Jeju, Republic of Korea: Some PressMasaryk University in Brno.

TUFIŞ, DAN, RADU ION, ALEXANDRU CEAUŞU and DAN ŞTEFĂNESCU. 2006b. Improved lexical alignment by combining multiple reified alignments. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, ed. by SomebodyDiana McCarthy and Shuly Wintner (eds), 153-160. Trento, Italy: Association for Computational Linguistics.Some Press.

TUFIŞ, DAN, RADU ION, ELENA IRIMIA, VERGINICA BARBU-MITITELU, ALEXANDRU CEAUŞU, DAN ŞTEFĂNESCU, LUIGI BOZIANU and CĂTĂLIN MIHĂILĂ. 2006c. Resources, tools and algorithms for the semantic web. Proceedings of the Workshop 'IST – Multidisciplinary approaches', ed. by SomebodyAngela Ioniţă.. Cristina Niculescu, Cornel Lepădatu, Gabriela Dumitrescu, 13-22. Bucharest, Romania: Some PressRomanian Academy Library.

Formatted: Indent: Left: 0", Hanging: 0.3"

Formatted: Justified, Indent: Left: 0",Hanging: 0.3"

224 / Cross-Linguistic Knowledge

Dan Tufiş 13 Septembrie, 13, Bucharest 5, 050711, Romania, Romania [[email protected]]